%META:TOPICPARENT{name="VirtTipsAndTricksGuide"}% ---+ Normalization of UNICODE3 accented characters for Virtuoso free-text indexing Normalization of UNICODE3 accented characters in a free-text index can be controlled by setting the XAnyNormalization configuration parameter in the [I18N] section of the Virtuoso configuration file, virtuoso.ini. This parameter controls whether accented UNICODE characters should be converted to their non-accented base variants when creating a free-text index or when parsing a free-text query string. The parameter's value is a bitmask integer, currently with only 2 bits in use: | *XAnyNormalization value* | *bit equivalent* | *Description* | | 0 | 00 | Default. Nothing is normalized, so "Jose" and "Jos?" are two distinct words. | | 1 | 01 | ToBeDone | | 2 | 10 | Any "combining character sequence" (a combination of a base character and one or more combining characters) is converted to its (smallest known) base. For example, "?" will lose its accent, and become a plain ASCII "e". | | 3 | 11 | This combines 1 and 2, and so causes both conversions. Any pair of base character and combining character loses the second character, and characters with accents lose their accents. | So the fragment of virtuoso.ini would look like: ... [I18N] XAnyNormalization = 3 ... * XAnyNormalization = 3 is recommended for most scenarios requiring such normalization. In some rare cases, XAnyNormalization = 1 may be more appropriate. * The parameter should generally be set before creating a database, and must be set identically for all instances in a cluster configuration. If changed on an existing database, you should rebuild all free-text indexes that may contain non-ASCII data by running the following procedure from isql -- VT_INDEX_DB_DBA_RDF_OBJ(0) * On a typical system, the parameter affects all text columns, XML columns, RDF literals, and queries. (Strictly speaking, it only affects items that use default "x-any" language, or a language derived from x-any such as "en" or "en-US". If you haven't tried writing new C plug-ins for custom languages, you need not look so deep.) * Note: We have had requests for a database function that normalizes characters in strings, as the free-text engine does with XAnyNormalization=3. This function will be provided as a separate patch/update, and will depend on XAnyNormalization. ---++ Example With XAnyNormalization=3, one can get the following: SQL> SPARQL INSERT IN { "?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?" ; "?? ??????? ????????, ??????? ? ???????? ???????? ?? ?????" } ; INSERT INTO , 2 (or less) triples -- done SQL> DB.DBA.RDF_OBJ_FT_RULE_ADD (NULL, NULL, 'InternationalNSMs.wb'); Done. -- 0 msec. SQL> VT_INDEX_DB_DBA_RDF_OBJ(0); Done. -- 26 msec. SQL> SPARQL SELECT * FROM WHERE { ?s ?p ?o } ORDER BY ASC (str(?o)) ; s sp ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos? s ru ?? ??????? ????????, ??????? ? ???????? ???????? ?? ????? 2 Rows. -- 2 msec. SQL> SPARQL SELECT * FROM WHERE { ?s ?p ?o . ?o bif:contains "'?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos?'" } ; s sp ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos? 1 Rows. -- 2 msec. SQL> SPARQL SELECT * FROM WHERE { ?s ?p ?o . ?o bif:contains "'Indio Joao Macapa Junior Torres Luis Araujo Jose'" } ; s sp ?ndio Jo?o Macap? J?nior T?rres Lu?s Ara?jo Jos? 1 Rows. -- 1 msec. SQL> SPARQL SELECT * FROM WHERE { ?s ?p ?o . ?o bif:contains "'???????? ???????? ?? ?????'" } ; s ru ?? ??????? ????????, ??????? ? ???????? ???????? ?? ????? ---++ Related * [[http://docs.openlinksw.com/virtuoso/databaseadmsrv.html#ini_I18N][Virtuoso ini I18N section]]