Normalization of UNICODE3 accented characters for Virtuoso free-text indexing

Normalization of UNICODE3 accented characters in a free-text index can be controlled by setting the XAnyNormalization configuration parameter in the [I18N] section of the Virtuoso configuration file, virtuoso.ini. This parameter controls whether accented UNICODE characters should be converted to their non-accented base variants when creating a free-text index or when parsing a free-text query string. The parameter's value is a bitmask integer, currently with only 2 bits in use:

XAnyNormalization? value bit equivalent Description Sort in descending order
3 11 This combines 1 and 2, and so causes both conversions. Any pair of base character and combining character loses the second character, and characters with accents lose their accents.
0 00 Default. Nothing is normalized, so "Jose" and "Jos?" are two distinct words.
2 10 Any "combining character sequence" (a combination of a base character and one or more combining characters) is converted to its (smallest known) base. For example, "?" will lose its accent, and become a plain ASCII "e".
1 01 ToBeDone?

So the fragment of virtuoso.ini would look like:


...

[I18N]
XAnyNormalization = 3

...

  • XAnyNormalization = 3 is recommended for most scenarios requiring such normalization. In some rare cases, XAnyNormalization = 1 may be more appropriate.
  • The parameter should generally be set before creating a database, and must be set identically for all instances in a cluster configuration. If changed on an existing database, you should rebuild all free-text indexes that may contain non-ASCII data by running the following procedure from isql --

    VT_INDEX_DB_DBA_RDF_OBJ(0)

  • On a typical system, the parameter affects all text columns, XML columns, RDF literals, and queries. (Strictly speaking, it only affects items that use default "x-any" language, or a language derived from x-any such as "en" or "en-US". If you haven't tried writing new C plug-ins for custom languages, you need not look so deep.)
  • Note: We have had requests for a database function that normalizes characters in strings, as the free-text engine does with XAnyNormalization=3. This function will be provided as a separate patch/update, and will depend on XAnyNormalization.

Example

With XAnyNormalization=3, one can get the following:


SQL> SPARQL 
     INSERT 
       IN <http://InternationalNSMs/>
         {
           <s>  <sp>  "?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?"  ; 
                <ru>  "?? ??????? ????????, ??????? ? ???????? ???????? ?? ?????"   
         }
       ;

INSERT INTO <http://InternationalNSMs/>, 2 (or less) triples -- done


SQL> DB.DBA.RDF_OBJ_FT_RULE_ADD (NULL, NULL, 'InternationalNSMs.wb');

Done. -- 0 msec.

SQL> VT_INDEX_DB_DBA_RDF_OBJ(0);

Done. -- 26 msec.

SQL> SPARQL 
     SELECT * 
       FROM <http://InternationalNSMs/> 
       WHERE 
         {
           ?s  ?p  ?o 
         }
       ORDER BY ASC (str(?o))
       ;

s  sp  ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?
s  ru  ?? ??????? ????????, ??????? ? ???????? ???????? ?? ?????

2 Rows. -- 2 msec.

SQL> SPARQL 
     SELECT * 
       FROM <http://InternationalNSMs/> 
       WHERE 
         { 
           ?s  ?p            ?o                                                    . 
           ?o  bif:contains  "'?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?'"  
         }
       ;

s  sp  ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?

1 Rows. -- 2 msec.

SQL> SPARQL 
     SELECT * 
       FROM <http://InternationalNSMs/> 
       WHERE
         { 
           ?s  ?p            ?o                                                    . 
           ?o  bif:contains  "'Indio Joao Macapa Junior Torres Luis Araujo Jose'" 
         }
       ;

s  sp  ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?

1 Rows. -- 1 msec.

SQL> SPARQL 
     SELECT * 
       FROM <http://InternationalNSMs/> 
       WHERE 
         { 
           ?s  ?p            ?o                         . 
           ?o  bif:contains  "'???????? ???????? ?? ?????'" 
         }
       ;

s  ru  ?? ??????? ????????, ??????? ? ???????? ???????? ?? ?????

Related