Virtuoso Full Text
Text Indexing
Like many relational databases, Virtuoso allows creating a full text index on a character or character LOB column. Virtuoso offers a CONTAINS predicate with a full text matching language compatible with Microsoft SQL Server, with AND, OR, AND NOT, phrases, wildcards etc.
A text index can be maintained in strict transactional synchrony with the indexed table or with a batch delay. Text indices are maintained in the same database by the same engine and require no extra software or additional processes or separate maintenance or backing up.
Scores and Search Summaries
The contains predicate introduces a virtual column reflecting the relevance of the search criteria on the matched document. The score is a composite of matched word frequency and proximity in the matched document.
Given a match and a search expression, the search_excerpt SQL function composes a search-hit summary similar to the result excerpts shown by Google and others.
Tracking Changes
Virtuoso has a special full text trigger capability that will flag changes made to text-indexed columns. When a column changes, the data indexes a base of stored queries. The queries, which are potential hits then, get evaluated against the incoming data and results are stored in a hits table. In this way, personalized notifications can be implemented without having to periodically re-evaluate all the stored queries of the users.
This makes a dramatic difference in the scalability of any personalized change tracking. This can be applied to implementing services like pubsub, various types of news alerts and so forth.
Customizing Indexing, Off band Data
In almost all applications, full text conditions coexist with other search conditions. Dramatic performance increase is achieved if all or most conditions can be evaluated within the text index.
To support this, Virtuoso provides indexing and unindexing hooks that can add extra words or off band data to the indexed text as well as perform any format conversions such as stripping HTML tags off the text before indexing. For example, if a table holds text indexed emails, it makes sense to add the folder and the mailbox identifier strings as extra words. By adding these words to the search conditions, the search can be scoped to mailbox and folder without having to consult the emails table itself. Likewise, if the messages received datetime is stored as off band data, it is possible to filter on time without having to look up the row of the emails table.
The text index references the main table with a numeric row id. Sometimes this id can have application level semantic. If ids are allocated in ascending order, searching in descending order of id gives the most recent hits first. Thus, the natural order of the index can be exploited for application purposes without having to sort. Ranking on hit score always requires sorting since the score is a function of the search pattern and the hit and cannot be pre-calculated.
CategoryWebSite CategoryVirtuoso CategoryOpenSource