Normalization of UNICODE3 accented characters for Virtuoso free-text indexing

Normalization of UNICODE3 accented characters in a free-text index can be controlled by setting the XAnyNormalization configuration parameter in the [I18N] section of the Virtuoso configuration file, virtuoso.ini. This parameter controls whether accented UNICODE characters should be converted to their non-accented base variants when creating a free-text index or when parsing a free-text query string. The parameter's value is a bitmask integer, currently with only 2 bits in use:

XAnyNormalization value	bit equivalent	Description
`0`	`00`	Default. Nothing is normalized, so "Jose" and "Jos?" are two distinct words.
`1`	`01`	ToBeDone
`2`	`10`	Any "combining character sequence" (a combination of a base character and one or more combining characters) is converted to its (smallest known) base. For example, "?" will lose its accent, and become a plain ASCII "e".
`3`	`11`	This combines `1` and `2`, and so causes both conversions. Any pair of base character and combining character loses the second character, and characters with accents lose their accents.

So the fragment of virtuoso.ini would look like:

...

[I18N]
XAnyNormalization = 3

...

XAnyNormalization = 3 is recommended for most scenarios requiring such normalization. In some rare cases, XAnyNormalization = 1 may be more appropriate.

The parameter should generally be set before creating a database, and must be set identically for all instances in a cluster configuration. If changed on an existing database, you should rebuild all free-text indexes that may contain non-ASCII data by running the following procedure from isql --
```
VT_INDEX_DB_DBA_RDF_OBJ(0)
```
On a typical system, the parameter affects all text columns, XML columns, RDF literals, and queries. (Strictly speaking, it only affects items that use default "x-any" language, or a language derived from x-any such as "en" or "en-US". If you haven't tried writing new C plug-ins for custom languages, you need not look so deep.)

Note: We have had requests for a database function that normalizes characters in strings, as the free-text engine does with XAnyNormalization=3. This function will be provided as a separate patch/update, and will depend on XAnyNormalization.

Example

With XAnyNormalization=3, one can get the following:

SQL> SPARQL 
     INSERT 
       IN <http://InternationalNSMs/>
         {
           <s>  <sp>  "?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?"  ; 
                <ru>  "?? ??????? ????????, ??????? ? ???????? ???????? ?? ?????"   
         }
       ;

INSERT INTO <http://InternationalNSMs/>, 2 (or less) triples -- done


SQL> DB.DBA.RDF_OBJ_FT_RULE_ADD (NULL, NULL, 'InternationalNSMs.wb');

Done. -- 0 msec.

SQL> VT_INDEX_DB_DBA_RDF_OBJ(0);

Done. -- 26 msec.

SQL> SPARQL 
     SELECT * 
       FROM <http://InternationalNSMs/> 
       WHERE 
         {
           ?s  ?p  ?o 
         }
       ORDER BY ASC (str(?o))
       ;

s  sp  ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?
s  ru  ?? ??????? ????????, ??????? ? ???????? ???????? ?? ?????

2 Rows. -- 2 msec.

SQL> SPARQL 
     SELECT * 
       FROM <http://InternationalNSMs/> 
       WHERE 
         { 
           ?s  ?p            ?o                                                    . 
           ?o  bif:contains  "'?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?'"  
         }
       ;

s  sp  ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?

1 Rows. -- 2 msec.

SQL> SPARQL 
     SELECT * 
       FROM <http://InternationalNSMs/> 
       WHERE
         { 
           ?s  ?p            ?o                                                    . 
           ?o  bif:contains  "'Indio Joao Macapa Junior Torres Luis Araujo Jose'" 
         }
       ;

s  sp  ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?

1 Rows. -- 1 msec.

SQL> SPARQL 
     SELECT * 
       FROM <http://InternationalNSMs/> 
       WHERE 
         { 
           ?s  ?p            ?o                         . 
           ?o  bif:contains  "'???????? ???????? ?? ?????'" 
         }
       ;

s  ru  ?? ??????? ????????, ??????? ? ???????? ???????? ?? ?????

Virtuoso ini I18N section

Virtuoso Open-Source Edition

Normalization of UNICODE3 accented characters for Virtuoso free-text indexing

Example

Related