Exploiting Virtuoso's Anytime Query Feature

What?

Virtuoso has a native mechanism for providing query results within a configurable time-frame.

Why?

This feature is called "Anytime Query" and its vital to any practical DBMS related offering (especially on the World Wide Web) that supports query access from an unpredictable number of clients performing a variety of queries.

How?

This functionality is available via the default HTML based SPARQL Query Editor page that accompanies all Virtuoso instances. It is also available via a URI parameter (&timeout={time in milliseconds}) as part of Virtuoso's SPARQL Protocol implementation.

Sample Scenario

The following scenario demonstrates how to manage anytime SPARQL query execution using the timeout parameter at a Virtuoso SPARQL Endpoint.

Suppose a simple query:


SELECT DISTINCT ?p
WHERE
  {
    ?s ?p ?o
  }
LIMIT 100

Suppose the query execution fails against Virtuoso endpoint due to running out of execution time as set in virtuoso.ini configuration file. This is basically a table scan, since it has to go through each quad and create a HASH of unique P it sees. After that is done, it will return the first 100 values of P from this hash.

Next we will use SQL translation and explain() of the query:


 SQL> SET sparql_translate ON;
 SQL> SELECT DISTINCT ?p WHERE { ?s ?p ?o } LIMIT 100;

 SPARQL_TO_SQL_TEXT
 VARCHAR
 _______________________________________________________________________________ 


  SELECT  __ro2sq ("s_2_0_rbc"."p") AS "p" FROM (SELECT DISTINCT TOP 
 100 __id2i ( "s_2_1-t0"."P" ) AS "p"
   FROM DB.DBA.RDF_QUAD AS "s_2_1-t0"
 OPTION (QUIETCAST)) AS "s_2_0_rbc"

 1 Rows. -- 1 msec.
 SQL> set sparql_translate off;

 SQL> set explain on;
 SQL> SELECT  __ro2sq ("s_2_0_rbc"."p") AS "p" FROM (SELECT DISTINCT 
 TOP 100 __id2i ( "s_2_1-t0"."P" ) AS "p"   FROM DB.DBA.RDF_QUAD AS 
 "s_2_1-t0" OPTION (QUIETCAST)) AS "s_2_0_rbc"
 SQL> SELECT  __ro2sq ("s_2_0_rbc"."p") AS "p" FROM (SELECT DISTINCT 
 TOP 100 __id2i ( "s_2_1-t0"."P" ) AS "p"   FROM DB.DBA.RDF_QUAD AS 
 "s_2_1-t0" OPTION (QUIETCAST)) AS "s_2_0_rbc" ;
 REPORT
 VARCHAR
 _______________________________________________________________________________ 


 {
   local save: ($22 "set_no", $23 "set_no_save")
 Subquery 20
 {
 from DB.DBA.RDF_QUAD by RDF_QUAD_POGS    8.9e+07 rows
 Key RDF_QUAD_POGS  ASC ($25 "s_2_1-t0.P")



 After code:
       0: $28 "__id2i" := Call __id2i ($25 "s_2_1-t0.P")
       5: BReturn 0
 Distinct (HASH) ($25 "s_2_1-t0.P")

 After code:
       0: $21 "p" :=  := artm $28 "__id2i"
       4: BReturn 0
 Subquery Select(TOP  100 ) ($21 "p", <$27 "<DB.DBA.RDF_QUAD s_2_1-t0>" 
 spec 5>)
 }


 After code:
       0: $44 "p" := Call __ro2sq ($21 "p")
       5: BReturn 0
 Select ($44 "p")
 }

 26 Rows. -- 1 msec.

 SQL> set explain off;

As we can see, it is going to use the RDF_QUAD_POGS index which offers the best query plan construction basis. It would be marginally faster if there was an index that started with GP, but the current 2+3 index scheme does not have such.

So it is almost a full table scan, which might not complete on most of our systems due to limits set in the INI file for Web accessible instances

Set the Execution timeout on the /sparql form to enable the Anytime paradigm:

  1. Go to a Virtuoso SPARQL Endpoint, e.g., the DBPpedia SPARQL Endpoint:



  2. Enter the query:

    SELECT distinct ?p WHERE { ?s ?p ?o } LIMIT 100

  3. For "Execution timeout", enter 60000.



  4. Click "Run Query".
  5. You will be redirected to the following URL (containing parameter timeout=60000), that presents the found results:

    http://dbpedia.org/sparql/?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3Fp%0D%0AWHERE%0D%0A{%0D%0A+%3Fs+%3Fp+%3Fo%0D%0A}%0D%0ALIMIT+100%0D%0A&format=text%2Fhtml&timeout=60000&debug=on





Conclusions: When using the timeout value of 10000 (10 sec), you get only one unique ?p value back. Using a timeout value of 60000 (60 sec) would still only show a few different values for ?p, which is logical considering that the index is sorted by Predicate (the P slot in the RDF triple pattern). This means that once it gets the first unique ?p, it has to skip all the triples that have the same value of ?p, to get to the next one.

Using cURL

cURL Variant without the timeout parameter


$ curl -F "query=SELECT DISTINCT ?p WHERE  { ?s ?p ?o } LIMIT 100" http://dbpedia.org/sparql
Virtuoso S1T00 Error SR171: Transaction timed out

SPARQL query:
define sql:big-data-const 0 SELECT DISTINCT ?p WHERE  { ?s ?p ?o} LIMIT 100

cURL Variant using the timeout parameter


$ curl "http://dbpedia.org/sparql/?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+DISTINCT+%3Fp%0D%0AWHERE%0D%0A%7B%0D%0A+%3Fs+%3Fp+%3Fo%0D%0A%7D%0D%0ALIMIT+100%0D%0A&format=text%2Fhtml&timeout=60000&debug=on"
<table class="sparql" border="1">
  <tr>
    <th>p</th>
  </tr>
  <tr>
    <td>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2002/07/owl#equivalentClass</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2002/07/owl#sameAs</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2002/07/owl#equivalentProperty</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2000/01/rdf-schema#subClassOf</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2004/02/skos/core#broader</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2000/01/rdf-schema#comment</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2000/01/rdf-schema#label</td>
  </tr>
  <tr>
    <td>http://xmlns.com/foaf/0.1/name</td>
  </tr>
  <tr>
    <td>http://xmlns.com/foaf/0.1/nick</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2004/02/skos/core#prefLabel</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2003/01/geo/wgs84_pos#lat</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2003/01/geo/wgs84_pos#long</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2000/01/rdf-schema#domain</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2000/01/rdf-schema#range</td>
  </tr>
  <tr>
    <td>http://www.w3.org/2002/07/owl#versionInfo</td>
  </tr>
  <tr>
    <td>http://dbpedia.org/ontology/purpose</td>
  </tr>
  <tr>
    <td>http://dbpedia.org/ontology/supplementalDraftRound</td>
  </tr>
  <tr>
    <td>http://dbpedia.org/ontology/podiums</td>
  </tr>
  <tr>
    <td>http://dbpedia.org/ontology/buildingStartDate</td>
  </tr>
</table>

cURL Variant using the timeout parameter with additional response header

In this example, the Response header contains X-SPARQL-MaxRows?: 10000, which is used to indicate the maximum result set size of the instance, configured via [SPARQL] INI section entry.

For Anytime Query, it means that determining timeout that triggers response with additional headers (due to partial results) is based on the combination of MaxRows and QueryTimeout; i.e., only when Virtuoso cannot produce the ResultSet that matches MaxRows within QueryTimeout setting, will there be a timeout condition accompanied by addition response headers.


curl -I "http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fhtml&timeout=2000&debug=on"
HTTP/1.1 200 OK
Date: Fri, 12 Dec 2014 20:51:29 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 3804866
Connection: keep-alive
Vary: Accept-Encoding
Server: Virtuoso/07.10.3211 (Linux) x86_64-redhat-linux-gnu  VDB
X-SPARQL-default-graph: http://dbpedia.org
*X-SPARQL-MaxRows: 10000*
Expires: Fri, 19 Dec 2014 20:51:29 GMT
Cache-Control: max-age=604800
Accept-Ranges: bytes

Related