Normalization of UNICODE3 accented characters for Virtuoso free-text indexing
Normalization of UNICODE3 accented characters in a free-text index can be controlled by setting the XAnyNormalization
configuration parameter in the [I18N]
section of the Virtuoso configuration file, virtuoso.ini
.
This parameter controls whether accented UNICODE characters should be converted to their non-accented base variants when creating a free-text index or when parsing a free-text query string.
The parameter's value is a bitmask integer, currently with only 2 bits in use:
XAnyNormalization? value | bit equivalent | Description |
---|---|---|
0 |
00 |
Default. Nothing is normalized, so "Jose" and "Jos?" are two distinct words. |
1 |
01 |
ToBeDone? |
2 |
10 |
Any "combining character sequence" (a combination of a base character and one or more combining characters) is converted to its (smallest known) base. For example, "?" will lose its accent, and become a plain ASCII "e". |
3 |
11 |
This combines 1 and 2 , and so causes both conversions. Any pair of base character and combining character loses the second character, and characters with accents lose their accents. |
So the fragment of virtuoso.ini
would look like:
... [I18N] XAnyNormalization = 3 ...
-
XAnyNormalization = 3
is recommended for most scenarios requiring such normalization. In some rare cases,XAnyNormalization = 1
may be more appropriate.
- The parameter should generally be set before creating a database, and must be set identically for all instances in a cluster configuration.
If changed on an existing database, you should rebuild all free-text indexes that may contain non-ASCII data by running the following procedure from isql --
VT_INDEX_DB_DBA_RDF_OBJ(0)
- On a typical system, the parameter affects all text columns, XML columns, RDF literals, and queries.
(Strictly speaking, it only affects items that use default "
x-any
" language, or a language derived fromx-any
such as "en
" or "en-US
". If you haven't tried writing new C plug-ins for custom languages, you need not look so deep.)
-
Note: We have had requests for a database function that normalizes characters in strings, as the free-text engine does with
XAnyNormalization=3
. This function will be provided as a separate patch/update, and will depend onXAnyNormalization
.
Example
With XAnyNormalization=3
, one can get the following:
SQL> SPARQL INSERT IN <http://InternationalNSMs/> { <s> <sp> "?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?" ; <ru> "?? ??????? ????????, ??????? ? ???????? ???????? ?? ?????" } ; INSERT INTO <http://InternationalNSMs/>, 2 (or less) triples -- done SQL> DB.DBA.RDF_OBJ_FT_RULE_ADD (NULL, NULL, 'InternationalNSMs.wb'); Done. -- 0 msec. SQL> VT_INDEX_DB_DBA_RDF_OBJ(0); Done. -- 26 msec. SQL> SPARQL SELECT * FROM <http://InternationalNSMs/> WHERE { ?s ?p ?o } ORDER BY ASC (str(?o)) ; s sp ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos? s ru ?? ??????? ????????, ??????? ? ???????? ???????? ?? ????? 2 Rows. -- 2 msec. SQL> SPARQL SELECT * FROM <http://InternationalNSMs/> WHERE { ?s ?p ?o . ?o bif:contains "'?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?'" } ; s sp ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos? 1 Rows. -- 2 msec. SQL> SPARQL SELECT * FROM <http://InternationalNSMs/> WHERE { ?s ?p ?o . ?o bif:contains "'Indio Joao Macapa Junior Torres Luis Araujo Jose'" } ; s sp ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos? 1 Rows. -- 1 msec. SQL> SPARQL SELECT * FROM <http://InternationalNSMs/> WHERE { ?s ?p ?o . ?o bif:contains "'???????? ???????? ?? ?????'" } ; s ru ?? ??????? ????????, ??????? ? ???????? ???????? ?? ?????