%META:TOPICPARENT{name="VirtTipsAndTricksGuide"}%
---+ Normalization of UNICODE3 accented characters for Virtuoso free-text indexing
Normalization of UNICODE3 accented characters in a free-text index can be controlled by setting
the XAnyNormalization
configuration parameter in the
[I18N]
section of the Virtuoso configuration file, virtuoso.ini
.
This parameter controls whether accented UNICODE characters should be converted to their non-accented
base variants when creating a free-text index or when parsing a free-text query string. The parameter's
value is a bitmask integer, currently with only 2 bits in use:
| *XAnyNormalization value* | *bit equivalent* | *Description* |
| 0
| 00
| Default. Nothing is normalized, so "Jose" and "Jos?" are two distinct words. |
| 1
| 01
| ToBeDone |
| 2
| 10
| Any "combining character sequence" (a combination of a base character and one or more combining characters) is converted to its (smallest known) base. For example, "?" will lose its accent, and become a plain ASCII "e". |
| 3
| 11
| This combines 1
and 2
, and so causes both conversions. Any pair of base character and combining character loses the second character, and characters with accents lose their accents. |
So the fragment of virtuoso.ini
would look like:
...
[I18N]
XAnyNormalization = 3
...
* XAnyNormalization = 3
is recommended for most scenarios requiring
such normalization. In some rare cases, XAnyNormalization = 1
may be
more appropriate.
* The parameter should generally be set before creating a database, and must be set identically
for all instances in a cluster configuration. If changed on an existing database, you should rebuild
all free-text indexes that may contain non-ASCII data by running the following procedure from isql --
VT_INDEX_DB_DBA_RDF_OBJ(0)
* On a typical system, the parameter affects all text columns, XML columns, RDF literals, and
queries. (Strictly speaking, it only affects items that use default "x-any
" language,
or a language derived from x-any
such as "en
" or "en-US
". If
you haven't tried writing new C plug-ins for custom languages, you need not look so deep.)
* Note: We have had requests for a database function that normalizes characters in
strings, as the free-text engine does with XAnyNormalization=3
. This function will be
provided as a separate patch/update, and will depend on XAnyNormalization
.
---++ Example
With XAnyNormalization=3
, one can get the following:
SQL> SPARQL
INSERT
IN
{
"?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?" ;
"?? ??????? ????????, ??????? ? ???????? ???????? ?? ?????"
}
;
INSERT INTO , 2 (or less) triples -- done
SQL> DB.DBA.RDF_OBJ_FT_RULE_ADD (NULL, NULL, 'InternationalNSMs.wb');
Done. -- 0 msec.
SQL> VT_INDEX_DB_DBA_RDF_OBJ(0);
Done. -- 26 msec.
SQL> SPARQL
SELECT *
FROM
WHERE
{
?s ?p ?o
}
ORDER BY ASC (str(?o))
;
s sp ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?
s ru ?? ??????? ????????, ??????? ? ???????? ???????? ?? ?????
2 Rows. -- 2 msec.
SQL> SPARQL
SELECT *
FROM
WHERE
{
?s ?p ?o .
?o bif:contains "'?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?'"
}
;
s sp ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?
1 Rows. -- 2 msec.
SQL> SPARQL
SELECT *
FROM
WHERE
{
?s ?p ?o .
?o bif:contains "'Indio Joao Macapa Junior Torres Luis Araujo Jose'"
}
;
s sp ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?
1 Rows. -- 1 msec.
SQL> SPARQL
SELECT *
FROM
WHERE
{
?s ?p ?o .
?o bif:contains "'???????? ???????? ?? ?????'"
}
;
s ru ?? ??????? ????????, ??????? ? ???????? ???????? ?? ?????
---++ Related
* [[http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_I18N][Virtuoso ini I18N section]]