Sponger Cartridges (RDF Mappers or RDFizers) for Virtuoso

This article explains how to develop and test a custom Sponger Cartridge.

Concept

Sponger Cartridges provide a modular approach to RDF-oriented entity extraction and ontology mapping as part of a Linked Data production pipeline. Typical sources include (X)HTML pages, images, Office documents, and PDFs, among others.

Cartridges expose their functionality to service consumers via the Virtuoso Sponger ("Sponger"), a middleware layer that extracts and delivers RDF to other Virtuoso components such as the Web Crawler and the SPARQL Query Processor. In addition, it is directly exposed as a REST-style Web Service for exploitation by external applications and services.

A Cartridge is comprised of an initialization PL procedure (hook) and an entity extractor and mapper. The entity extractor and mapping component can be developed using PL, C, or any external language supported by Virtuoso via the Virtuoso Server Extensions APIs.

Once the cartridge has been developed, it is plugged into the Virtuoso Sponger by adding a record to the DB.DBA.SYS_RDF_MAPPERS table.

How Cartridges Work

SPARQL Query Processing

When a SPARQL query is dispatched to Virtuoso, it invokes the Sponger during the act of URI dereferencing; i.e., it actually crawls the Web, locates a resource via it's URI/IRI (using a variety of RDF discovery heuristics), and then processes the data that the URI exposes. If RDF is discovered, the cartridges play no role. On the other hand, if RDF isn't discovered the Sponger will look in the DB.DBA.SYS_RDF_MAPPERS table (in RM_ID order), and for every matching URI or MIME type pattern (depending on RM_TYPE column value), it will invoke the associated cartridge via the hook procedure. If the hook returns zero, the next cartridge will be tried. If the result is negative, the process stops, informing the SPARQL engine that nothing was retrieved. If the result is positive, the process stops, this time informing the SPARQL engine that RDF data was successfully retrieved.

Sponger Cartridge RDF Extractor PL hook Requirements

Every PL function used to plug a cartridge into the SPARQL engine must have the following parameter signature:

IN graph_iri VARCHAR the IRI of the local storage graph
IN new_origin_uri VARCHAR the URI of the target information resource
IN destination VARCHAR the IRI of the target graph
INOUT content ANY the content of the information resource retrieved for dispatch to the Sponger
INOUT async_queue ANY an asynchronous queue; can be used for background processing (if required)
INOUT ping_service ANY the value in the [SPARQL] section of a Virtuoso instance (i.e., PingService INI parameter) which enables RDF triple propagation and notification to pinger services such as http://pingthesemanticweb.com/
INOUT api_key ANY a plain text ID, either a single key value or serialized vector of keys (basically the value of RM_KEY column of the DB.DBA.SYS_RDF_MAPPERS table), which is used for handling of API keys for 3rd party Web Services

Note: the names of the parameters are not important, but their order and presence are vital.

Implementation

In the example script we implement a basic cartridge which maps a text/plain MIME type to an imaginary ontology, which extends the class Document from FOAF with properties 'txt:UniqueWords?' and 'txt:Chars', where we specity the prefix 'txt:' as 'urn:txt:v0.0:'.


USE DB;

CREATE PROCEDURE DB.DBA.RDF_LOAD_TXT_META
  (
       IN  graph_iri       VARCHAR,
       IN  new_origin_uri  VARCHAR,
       IN  dest            VARCHAR,
    INOUT  ret_body        ANY,
    INOUT  aq              ANY,
    INOUT  ps              ANY,
    INOUT  ser_key         ANY
  )
  {
    DECLARE  words, 
             chars   INT;
    DECLARE  vtb, 
             arr, 
             subj, 
             ses, 
             str     ANY;
    DECLARE  ses     ANY;
    -- if any error, we just say nothing can be done
    DECLARE EXIT HANDLER FOR SQLSTATE '*'
      {
        return 0;
      };
    subj  := COALESCE (dest, new_origin_uri);
    vtb   := vt_batch ();
    chars := LENGTH (ret_body);

    -- using the text index procedures, we get a list of words
    vt_batch_feed (vtb, ret_body, 1);
    arr := vt_batch_strings_array (vtb);
  
    -- the list has 'word' and positions array, so we must divide by 2
    words := LENGTH (arr) / 2;
    ses := string_output ();

    -- we compose a N3 literal
    http (sprintf ('<%s> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Document> .\n', subj), ses);
    http (sprintf ('<%s> <urn:txt:v0.0:UniqueWords> "%d" .\n', subj, words), ses);
    http (sprintf ('<%s> <urn:txt:v0.0:Chars> "%d" .\n', subj, chars), ses);
    str := string_output_string (ses);

    -- we push the N3 text into the local store
    DB.DBA.TTLP (str, new_origin_uri, subj);
    RETURN 1;
  }
;

DELETE FROM DB.DBA.SYS_RDF_MAPPERS 
   WHERE RM_HOOK = 'DB.DBA.RDF_LOAD_TXT_META';

INSERT SOFT DB.DBA.SYS_RDF_MAPPERS 
  ( RM_PATTERN, 
    RM_TYPE, 
    RM_HOOK, 
    RM_KEY, 
    RM_DESCRIPTION
  )
  VALUES 
    ( '(text/plain)', 
      'MIME', 
      'DB.DBA.RDF_LOAD_TXT_META', 
      NULL, 
      'Text Files (demo)'
    )
;

-- here we set order to some large number so don't break existing mappers
UPDATE DB.DBA.SYS_RDF_MAPPERS 
  SET   RM_ID = 2000 
  WHERE RM_HOOK = 'DB.DBA.RDF_LOAD_TXT_META';

To test the cartridge, we just use the /sparql endpoint with option 'Retrieve remote RDF data for all missing source graphs' to execute:


SELECT * 
  FROM <URL-of-a-txt-file> 
  WHERE { ?s ?p ?o }

Note that the SPARQL_SPONGE role must be granted to any user account (by default, the "SPARQL" user) which will be issuing SPARQL queries that invoke Sponger cartridges, to enable persistence to local storage.

If the above is set correctly, then you can just hit this link.

More complex examples can be found in the cartridges package implementation.

Authentication in the Sponger

To enable user-defined authentication, there are additional parameters for the /proxy and /sparql endpoints, which should be sent by RDF browser or iSPARQL users --

  • for /proxy endpoint:

    'login=<account name>'

  • for /sparql endpoint:

    get-login=<account name>

References

CategoryRDF CategoryVirtuoso