Some applications should support obsolete syntaxes of URLs of pages that were presumably bookmarked by users of old versions. Some applications generate long URLs with many parameters, so the URL does not fit into a standard e-mail line length or is otherwise inconvenient.
But even more importantly, many web crawlers have difficulty crawling web pages with parameters, and others will simply ignore them or value them less because they are considered highly dynamic.
So the purpose is Search Engine Optimization (SEO) and eventually to help humans to easily recognize the URL. (This is a portion of the user experience with a web page, so a "nice URL" could ultimately be considered a User Interface feature.)
In the rest of the document, 'long' URLs are those with named parameters after '?', while 'nice' URLs have data encoded in some other format.
It is possible to have pages that are called via 'nice' URLs using plain VHOST functionality and redirects to pages with long URLs, but this result in an extra HTTP loop and invalidates 'back' button browser navigation.
A more accurate solution is to provide an intermediate URL rewriter that can be enabled for some sorts of URLs and which transparently alters the parsed parameters:
http_physical_path()
function returns the path to the actually-called page.
path
' parameter remains the same, even if the actually-called page has different location.
params
' parameter is not what's parsed from 'nice' URL, but is the result of that parsing.
lines
' parameters are extended by 1 parameter line, called 'X-VirtuosoRewrite
', for handlers that should know about both the source and the destination.The implementation should be able to handle ill-formed URLs in an understandable way. Possible scenarios are:
The default rewriting is a rewriting rule, too. So Virtuoso has a list (or even a tree) of rewriting rules to apply to requested URLs. If none of these rules match, but the default rewriting rule is set, then Virtuoso will try to process it as is. If none of these rules match and the default rewriting rule is not set, then Virtuoso reports an error.
A 'rewriting rule' describes how to parse a single 'nice' syntax and how to compose a name of the page that should be actually called. Every rewriting rule has an IRI that is used to refer to the rule. Rule IRI is passed as an argument to functions that
sprintf
-based and regex-based.
For purposes of 'nice' to 'long' conversion, the only difference between them is the syntax of format strings.
But 'long' to 'nice' conversion works only for sprintf
-based rules, whereas regex-based rules are 'unidirectional'.A sprintf
-based rule contains following data:
rule_iri
- Rule name nice_format
- A format string used by sprintf_inverse
to parse the URL into the vector of parts nice_params
- A vector of names of parsed parameter; the length of the vector should be equal to number of '%
' specifiers in the format string nice_min_params
- Minimum allowed number of parameters that should be parsed by sprintf_inverse
to treat the parsing as successful target_format
- A format string used by sprintf
to compose the URL of the destination page target_params
- A vector of names of parameters that should be passed to sprintf
, in order of their use in format string target_expn
- (Optional) SQL text that should be executed instead of an sprintf()
callA regex-based rule contains following data:
rule_iri
- Rule name nice_match
- A regex match expression to parse the URL into the vector of occurrences nice_params
- A vector of names of parsed parameter; the length of the vector should be equal to number of '(...)
' specifiers in the format string target_compose
- A regex 'compose
' expression for the URL of the destination page target_params
- A vector of names of parameters that should be passed to 'compose
' expression as $1, $2,
and so on target_expn
- (Optional) SQL text that should be executed instead of an regex compose callSome rewriting rules are preset; their IRIs start from 'sys:
'.
Rule called 'sys:default
' makes no changes in the source: its parse
and sprintf
formats are both '%s
', both vectors of parsed parameter names are vector('path')
.
A 'rewriting rule list' is a named ordered list of rewriting rules or rule lists. Rules of the list are tried from top to bottom, the first rule that matches is applied. When an IRI in a rule list belong to other rule list, all rules and rule lists from 'included' list are matched, except rules and rule lists that were matched before, during the recursion. Rule lists and rules may include each other with 'loops' and 'diamonds'; this is not a data inconsistency, but allowed use case. It is also legal for a rule list to be empty. Nevertheless, rule list can not directly include itself; this is an explicit idiocy. If no rule matches, a detailed error message is reported.
It is essential to have both "rewriting rules" and "rewriting conditions".
Sometimes a URL matches more than one rule.
This is the case when there are "optional" parameters in the URL.
So with the "rewriting condition", we check if a special pattern is present in a URL.
If so, then you execute the next rewriting rule; otherwise you continue in the rewriting rule set.
This is the same as the "RewriteCond
" in the mod_rewrite Apache module.
Special care should be taken around URL encoding and URL decoding.
The mod_rewrite
Apache module required a hack to handle a special case
DB.DBA.URLREWRITE_DROP_RULE ( rule_iri, force );
rule_iri
is in use as rulelist IRI, an error is signaled.
rule_iri
is unknown, an error is signaled.
rule_iri
starts from 'sys:
', an error is signaled.
rule_iri
is still in use in some rule lists, then either error is signaled or it is removed from all rule lists, according to the value of 'force' flag.DB.DBA.URLREWRITE_CREATE_SPRINTF_RULE ( rule_iri, allow_update, nice_format, nice_params, nice_min_params, target_format, target_params, target_expn := NULL ); DB.DBA.URLREWRITE_CREATE_REGEX_RULE ( rule_iri, allow_update, nice_match, nice_params, target_compose, target_params, target_expn := NULL );
rule_iri
is already in use as rulelist IRI, an error is signaled.
rule_iri
is already in use as rule IRI and allow_update
is zero, an error is signaled.
rule_iri
is already in use as rule IRI and allow_update
is non-zero, the existing rule is updated.rule_iri
starts with 'sys:
' then an error is not signaled, unlike DB.DBA.URLREWRITE_DROP_RULE
, but this should not be used in vain.
DB.DBA.URLREWRITE_DROP_RULELIST ( rulelist_iri, force );
rulelist_iri
is already in use as rule IRI, an error is signaled.
rulelist_iri
is unknown, an error is signaled.
rulelist_iri
starts with 'sys:', no error is signaled.
rulelist_iri
is still in use in 'opts
' of HTTP_PATH
or in rule lists then either error is signaled or it is removed from all rule lists and 'opts
', according to the value of 'force
' flag.DB.DBA.URLREWRITE_CREATE_RULELIST ( rulelist_iri, allow_update, vector_of_rule_iris );
rulelist_iri
is already in use as rule IRI, an error is signaled.
vector_of_rule_iris
contains rulelist_iri
, an error is signaled.
rulelist_iri
is already in use as rulelist IRI and allow_update
is zero, an error is signaled.
rulelist_iri
is already in use as rulelist IRI and allow_update
is non-zero, the existing rule list is updated.opts
' of HTTP_PATH
.
If rulelist_iri
starts with 'sys:
', an error is not signaled, but this should not be used in vain.DB.DBA.URLREWRITE_ENUMERATE_RULES ( like_pattern_for_rule_iris, dump_details );
This function lists all rules whose IRIs match the specified 'SQL like' pattern.
varchar_or_null
), DB.DBA.URLREWRITE_ENUMERATE_RULELISTS ( like_pattern_for_rulelist_iris, dump_details );
This function lists all rule lists whose IRIs match the specified 'SQL like' pattern.
The URL rewriting is enabled by application by providing 'url_rewrite' parameter in 'opts' list of vhost_define() call. The parameter value is the IRI of a rule list. If no matching rule found in the list (during the recursive traversal), a detailed 404 error report is returned. If a matching rule is found, the 'physical' path is substituted, and the search is made for virtual host definition whose host and port are equal to original values, and physical path matches. If more than one record matches, one with longest physical path prefix is chosen. If more than one record reaches the longest physical path length, and one of them is the record found for the original logical path, it is chosen. Otherwise, a detailed 500 error report is returned.
Thus the order of actions that may affect the search for destination page and its properties/permissions is:
The path translation function should be available for public:
DB.DBA.URLREWRITE_APPLY ( IN nice_url VARCHAR, IN post_params ANY, OUT long_url VARCHAR, out params ANY, OUT nice_vhost_pkey ANY, OUT top_rulelist_iri VARCHAR, OUT rule_iri VARCHAR, OUT target_vhost_pkey ANY );
The function gets nice_url
and tries to find the appropriate HTTP_PATH
.
If found then it performs a recursive traversal of the specified rulelist.
For every rule in the tree the function implements almost the same logic, no matter what's the type of the rule, sprintf- or regex- based.
split_and_decode()
.
split_and_decode()
.
The final list of parameters that is visible to application as 'params' is a concatenation of HP_LISTEN_HIST
, PH_HOST, HP_LPATH).The function returns 1 if there was an actual non-default rewriting and stores its results via 'out' parameters:
long_url
is the URL for destination page, params
is the final vector of parameters, nice_vhost_pkey
is a vector of pk col of HTTP_PATH row that matches 'nice url', top_rulelist_iri
is the rulelist IRI found in HTTP_PATH, rule_iri
is one that actually made the successful parsing, target_vhost_pkey
is the vector of pk col of HTTP_PATH row whose physical path matched 'long url'.The function may fill some outputs with NULLs, if the execution did not reach some part of processing.
The URL rewrite result in an effect that is somewhat similar to IS_REDIRECT_REF()
: the data returned relate to a resource that is not equal to the requested resource.
The logical consequence of URL rewriting is quite similar to a redirect stored in DAV.
URL rewriting has smarter processing but almost the same final effect.
There's an non-obvious debugging problem.
Changes in URL rewriting and/or VHOST_DEFINE
table may result in frequent changes in the returned content, even if destination pages return constant data when called via their 'long' URLs.
To make cache live shorter, the standard HTTP response on 'nice' URL should return 'Last-Modified
' that is equal not to resource creation date, but to the most recent of:
Note that this change does not affect the calculation of ETag value! ETag relates to the content of the resource regardless of its location or retrieval URL!
Two functions are responsible for composing 'nice' URLs. One gets a ready-to-send path part of 'long' URL and a rule or rulelist IRI; the other gets a rule IRI, an array of named parameters, optional protocol, host, port, and fragment parts. Both either return a 'nice' URL or signal an error, indicating that there's no way of composing the URL by source data. Both of them use a third (internal) function that returns diagnostics instead of signaling an error:
DB.DBA.URLREWRITE_TRY_INVERSE ( IN rule_iri VARCHAR, IN long_path VARCHAR, IN known_params ANY, IN param_retrieval_callback VARCHAR, IN param_retrieval_env ANY, INOUT param_retrieval_cache ANY, OUT nice_path VARCHAR, OUT nice_params ANY, OUT error_report VARCHAR );
The function tries to rewrite long_path
using given rule.
If failed, then nice_path
is NULL and error_report
contains diagnostics as readable text.
If OK, then nice_path
is string and error_report
is NULL.
The rule should exist (check needed) and should be sprintf
-based; otherwise, the function will fail.
The call of sprintf_inverse
applied to long_path
may produce values of some parameters.
Other parameters may be passed via known_params
variable.
Even after that, the vector of nice parameters may contain more names than the union of these two lists.
To get additional values, function will execute the text of param_retrieval_callback
(if it is not null) and pass two values to the exec()
: parameter name, and the value of param_retrieval_env
.
To avoid extra exec()
invocations, every retrieved value is cached in param_retrieval_cache
dictionary, so any given additional parameter is retrieved only once per rulelist.
The value of nice_path
may contain less parameter values than it was specified by known_params
vector.
On success, nice_params
is filled with values from known_params
that whose names did not appear in the list of 'nice' sprintf
parameters.
The order is important: parameters in nice_params
should be in same relative order as they were in known_params
.