URL Rewriting in Virtuoso

Why rewrite URLs?

Some applications should support obsolete syntaxes of URLs of pages that were presumably bookmarked by users of old versions. Some applications generate long URLs with many parameters, so the URL does not fit into a standard e-mail line length or is otherwise inconvenient.

But even more importantly, many web crawlers have difficulty crawling web pages with parameters, and others will simply ignore them or value them less because they are considered highly dynamic.

So the purpose is Search Engine Optimization (SEO) and eventually to help humans to easily recognize the URL. (This is a portion of the user experience with a web page, so a "nice URL" could ultimately be considered a User Interface feature.)

In the rest of the document, 'long' URLs are those with named parameters after '?', while 'nice' URLs have data encoded in some other format.

It is possible to have pages that are called via 'nice' URLs using plain VHOST functionality and redirects to pages with long URLs, but this result in an extra HTTP loop and invalidates 'back' button browser navigation.

A more accurate solution is to provide an intermediate URL rewriter that can be enabled for some sorts of URLs and which transparently alters the parsed parameters:

The implementation should be able to handle ill-formed URLs in an understandable way. Possible scenarios are:

The default rewriting is a rewriting rule, too. So Virtuoso has a list (or even a tree) of rewriting rules to apply to requested URLs. If none of these rules match, but the default rewriting rule is set, then Virtuoso will try to process it as is. If none of these rules match and the default rewriting rule is not set, then Virtuoso reports an error.

Rewriting Rules

A 'rewriting rule' describes how to parse a single 'nice' syntax and how to compose a name of the page that should be actually called. Every rewriting rule has an IRI that is used to refer to the rule. Rule IRI is passed as an argument to functions that

A sprintf-based rule contains following data:

A regex-based rule contains following data:

Some rewriting rules are preset; their IRIs start from 'sys:'. Rule called 'sys:default' makes no changes in the source: its parse and sprintf formats are both '%s', both vectors of parsed parameter names are vector('path').

A 'rewriting rule list' is a named ordered list of rewriting rules or rule lists. Rules of the list are tried from top to bottom, the first rule that matches is applied. When an IRI in a rule list belong to other rule list, all rules and rule lists from 'included' list are matched, except rules and rule lists that were matched before, during the recursion. Rule lists and rules may include each other with 'loops' and 'diamonds'; this is not a data inconsistency, but allowed use case. It is also legal for a rule list to be empty. Nevertheless, rule list can not directly include itself; this is an explicit idiocy. If no rule matches, a detailed error message is reported.

It is essential to have both "rewriting rules" and "rewriting conditions". Sometimes a URL matches more than one rule. This is the case when there are "optional" parameters in the URL. So with the "rewriting condition", we check if a special pattern is present in a URL. If so, then you execute the next rewriting rule; otherwise you continue in the rewriting rule set. This is the same as the "RewriteCond" in the mod_rewrite Apache module.

Special care should be taken around URL encoding and URL decoding. The mod_rewrite Apache module required a hack to handle a special case

Configuration API


DB.DBA.URLREWRITE_DROP_RULE 
   ( rule_iri, 
     force
   );


DB.DBA.URLREWRITE_CREATE_SPRINTF_RULE 
   ( rule_iri, 
     allow_update, 
     nice_format, 
     nice_params, 
     nice_min_params, 
     target_format, 
     target_params, 
     target_expn := NULL
   );

DB.DBA.URLREWRITE_CREATE_REGEX_RULE 
   ( rule_iri, 
     allow_update, 
     nice_match, 
     nice_params, 
     target_compose, 
     target_params, 
     target_expn := NULL
   );
Note: if rule_iri starts with 'sys:' then an error is not signaled, unlike DB.DBA.URLREWRITE_DROP_RULE, but this should not be used in vain.


DB.DBA.URLREWRITE_DROP_RULELIST 
   ( rulelist_iri, 
     force
   );

DB.DBA.URLREWRITE_CREATE_RULELIST
   ( rulelist_iri, 
     allow_update, 
     vector_of_rule_iris
   );

Note: Unlike rules, rule lists can be used either in other rule lists or in 'opts' of HTTP_PATH. If rulelist_iri starts with 'sys:', an error is not signaled, but this should not be used in vain.
DB.DBA.URLREWRITE_ENUMERATE_RULES
   ( like_pattern_for_rule_iris, 
     dump_details
   );

This function lists all rules whose IRIs match the specified 'SQL like' pattern.


DB.DBA.URLREWRITE_ENUMERATE_RULELISTS
   ( like_pattern_for_rulelist_iris, 
     dump_details
   );

This function lists all rule lists whose IRIs match the specified 'SQL like' pattern.

Accepting Requests

The URL rewriting is enabled by application by providing 'url_rewrite' parameter in 'opts' list of vhost_define() call. The parameter value is the IRI of a rule list. If no matching rule found in the list (during the recursive traversal), a detailed 404 error report is returned. If a matching rule is found, the 'physical' path is substituted, and the search is made for virtual host definition whose host and port are equal to original values, and physical path matches. If more than one record matches, one with longest physical path prefix is chosen. If more than one record reaches the longest physical path length, and one of them is the record found for the original logical path, it is chosen. Otherwise, a detailed 500 error report is returned.

Thus the order of actions that may affect the search for destination page and its properties/permissions is:

  1. HTTP_PATH table may substitute the URL by
    • a mapping with 'url_rewrite' via rulelist (and change properties via search in same table for appropriate physical path) OR
    • a 'plain' virtual host (and set properties) OR
    • no change at all, hence the file is in filesystem as it is the default, hence no 'special' permissions or redirects.
  2. According to 'noinherit' in 'opts' of the vhost, the URL used to find a resource may become as short as 'ppath'.
  3. According to 'redirectref' property (PROP_NAME) of a resource, the path part of the URL can be replaced. Either of these three steps may change or not change the path independently from others.

The path translation function should be available for public:


DB.DBA.URLREWRITE_APPLY 
   ( IN   nice_url           VARCHAR, 
     IN   post_params        ANY,
     OUT  long_url           VARCHAR, 
     out  params             ANY,
     OUT  nice_vhost_pkey    ANY,
     OUT  top_rulelist_iri   VARCHAR, 
     OUT  rule_iri           VARCHAR, 
     OUT  target_vhost_pkey  ANY
   );

The function gets nice_url and tries to find the appropriate HTTP_PATH. If found then it performs a recursive traversal of the specified rulelist. For every rule in the tree the function implements almost the same logic, no matter what's the type of the rule, sprintf- or regex- based.

The function returns 1 if there was an actual non-default rewriting and stores its results via 'out' parameters:

The function may fill some outputs with NULLs, if the execution did not reach some part of processing.

HTTP Handlers

The URL rewrite result in an effect that is somewhat similar to IS_REDIRECT_REF(): the data returned relate to a resource that is not equal to the requested resource. The logical consequence of URL rewriting is quite similar to a redirect stored in DAV. URL rewriting has smarter processing but almost the same final effect.

There's an non-obvious debugging problem. Changes in URL rewriting and/or VHOST_DEFINE table may result in frequent changes in the returned content, even if destination pages return constant data when called via their 'long' URLs. To make cache live shorter, the standard HTTP response on 'nice' URL should return 'Last-Modified' that is equal not to resource creation date, but to the most recent of:

Note that this change does not affect the calculation of ETag value! ETag relates to the content of the resource regardless of its location or retrieval URL!

Composing Nice URLs

Two functions are responsible for composing 'nice' URLs. One gets a ready-to-send path part of 'long' URL and a rule or rulelist IRI; the other gets a rule IRI, an array of named parameters, optional protocol, host, port, and fragment parts. Both either return a 'nice' URL or signal an error, indicating that there's no way of composing the URL by source data. Both of them use a third (internal) function that returns diagnostics instead of signaling an error:


DB.DBA.URLREWRITE_TRY_INVERSE
   (
     IN     rule_iri                  VARCHAR,
     IN     long_path                 VARCHAR, 
     IN     known_params              ANY,
     IN     param_retrieval_callback  VARCHAR, 
     IN     param_retrieval_env       ANY,
     INOUT  param_retrieval_cache     ANY,
     OUT    nice_path                 VARCHAR, 
     OUT    nice_params               ANY,
     OUT    error_report              VARCHAR
   );

The function tries to rewrite long_path using given rule. If failed, then nice_path is NULL and error_report contains diagnostics as readable text. If OK, then nice_path is string and error_report is NULL. The rule should exist (check needed) and should be sprintf-based; otherwise, the function will fail.

The call of sprintf_inverse applied to long_path may produce values of some parameters. Other parameters may be passed via known_params variable. Even after that, the vector of nice parameters may contain more names than the union of these two lists. To get additional values, function will execute the text of param_retrieval_callback (if it is not null) and pass two values to the exec(): parameter name, and the value of param_retrieval_env. To avoid extra exec() invocations, every retrieved value is cached in param_retrieval_cache dictionary, so any given additional parameter is retrieved only once per rulelist.

The value of nice_path may contain less parameter values than it was specified by known_params vector. On success, nice_params is filled with values from known_params that whose names did not appear in the list of 'nice' sprintf parameters. The order is important: parameters in nice_params should be in same relative order as they were in known_params.

CategoryVirtuoso CategorySpec