VirtTipsAndTricksAnytimeSPARQLQuery

VirtTipsAndTricksAnytimeSPARQLQueryExploiting Virtuoso's Anytime Query FeatureExploiting Virtuoso's Anytime Query Feature What? Virtuoso has a native mechanism for providing query results within a configurable time-frame.Why? This feature is called "Anytime Query" and its vital to any practical DBMS related offering (especially on the World Wide Web) that supports query access from an unpredictable number of clients performing a variety of queries.How? This functionality is available via the default HTML based SPARQL Query Editor page that accompanies all Virtuoso instances. It is also available via a URI parameter (&timeout={time in milliseconds}) as part of Virtuoso's SPARQL Protocol implementation.Sample Scenario The following scenario demonstrates how to manage anytime SPARQL query execution using the timeout parameter at a Virtuoso SPARQL Endpoint. Suppose a simple query: SELECT DISTINCT ?p WHERE { ?s ?p ?o } LIMIT 100 Suppose the query execution fails against Virtuoso endpoint due to running out of execution time as set in virtuoso.ini configuration file. This is basically a table scan, since it has to go through each quad and create a HASH of unique P it sees. After that is done, it will return the first 100 values of P from this hash. Next we will use SQL translation and explain() of the query: SQL> SET sparql_translate ON; SQL> SELECT DISTINCT ?p WHERE { ?s ?p ?o } LIMIT 100; SPARQL_TO_SQL_TEXT VARCHAR _______________________________________________________________________________ SELECT __ro2sq ("s_2_0_rbc"."p") AS "p" FROM (SELECT DISTINCT TOP 100 __id2i ( "s_2_1-t0"."P" ) AS "p" FROM DB.DBA.RDF_QUAD AS "s_2_1-t0" OPTION (QUIETCAST)) AS "s_2_0_rbc" 1 Rows. -- 1 msec. SQL> set sparql_translate off; SQL> set explain on; SQL> SELECT __ro2sq ("s_2_0_rbc"."p") AS "p" FROM (SELECT DISTINCT TOP 100 __id2i ( "s_2_1-t0"."P" ) AS "p" FROM DB.DBA.RDF_QUAD AS "s_2_1-t0" OPTION (QUIETCAST)) AS "s_2_0_rbc" SQL> SELECT __ro2sq ("s_2_0_rbc"."p") AS "p" FROM (SELECT DISTINCT TOP 100 __id2i ( "s_2_1-t0"."P" ) AS "p" FROM DB.DBA.RDF_QUAD AS "s_2_1-t0" OPTION (QUIETCAST)) AS "s_2_0_rbc" ; REPORT VARCHAR _______________________________________________________________________________ { local save: ($22 "set_no", $23 "set_no_save") Subquery 20 { from DB.DBA.RDF_QUAD by RDF_QUAD_POGS 8.9e+07 rows Key RDF_QUAD_POGS ASC ($25 "s_2_1-t0.P") After code: 0: $28 "__id2i" := Call __id2i ($25 "s_2_1-t0.P") 5: BReturn 0 Distinct (HASH) ($25 "s_2_1-t0.P") After code: 0: $21 "p" := := artm $28 "__id2i" 4: BReturn 0 Subquery Select(TOP 100 ) ($21 "p", <$27 "<DB.DBA.RDF_QUAD s_2_1-t0>" spec 5>) } After code: 0: $44 "p" := Call __ro2sq ($21 "p") 5: BReturn 0 Select ($44 "p") } 26 Rows. -- 1 msec. SQL> set explain off; As we can see, it is going to use the RDF_QUAD_POGS index which offers the best query plan construction basis. It would be marginally faster if there was an index that started with GP, but the current 2+3 index scheme does not have such. So it is almost a full table scan, which might not complete on most of our systems due to limits set in the INI file for Web accessible instances Set the Execution timeout on the /sparql form to enable the Anytime paradigm: Go to a Virtuoso SPARQL Endpoint, e.g., the DBPpedia SPARQL Endpoint:

Enter the query: SELECT distinct ?p WHERE { ?s ?p ?o } LIMIT 100 For "Execution timeout", enter 60000.

Click "Run Query". You will be redirected to the following URL (containing parameter timeout=60000), that presents the found results: http://dbpedia.org/sparql/?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+distinct+%3Fp%0D%0AWHERE%0D%0A{%0D%0A+%3Fs+%3Fp+%3Fo%0D%0A}%0D%0ALIMIT+100%0D%0A&format=text%2Fhtml&timeout=60000&debug=on

Conclusions: When using the timeout value of 10000 (10 sec), you get only one unique ?p value back. Using a timeout value of 60000 (60 sec) would still only show a few different values for ?p, which is logical considering that the index is sorted by Predicate (the P slot in the RDF triple pattern). This means that once it gets the first unique ?p, it has to skip all the triples that have the same value of ?p, to get to the next one. Using cURL cURL Variant without the timeout parameter $ curl -F "query=SELECT DISTINCT ?p WHERE { ?s ?p ?o } LIMIT 100" http://dbpedia.org/sparql Virtuoso S1T00 Error SR171: Transaction timed out SPARQL query: define sql:big-data-const 0 SELECT DISTINCT ?p WHERE { ?s ?p ?o} LIMIT 100 cURL Variant using the timeout parameter $ curl "http://dbpedia.org/sparql/?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+DISTINCT+%3Fp%0D%0AWHERE%0D%0A%7B%0D%0A+%3Fs+%3Fp+%3Fo%0D%0A%7D%0D%0ALIMIT+100%0D%0A&format=text%2Fhtml&timeout=60000&debug=on" <table class="sparql" border="1"> <tr> <th>p</th> </tr> <tr> <td>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</td> </tr> <tr> <td>http://www.w3.org/2002/07/owl#equivalentClass</td> </tr> <tr> <td>http://www.w3.org/2002/07/owl#sameAs</td> </tr> <tr> <td>http://www.w3.org/2002/07/owl#equivalentProperty</td> </tr> <tr> <td>http://www.w3.org/2000/01/rdf-schema#subClassOf</td> </tr> <tr> <td>http://www.w3.org/2004/02/skos/core#broader</td> </tr> <tr> <td>http://www.w3.org/2000/01/rdf-schema#comment</td> </tr> <tr> <td>http://www.w3.org/2000/01/rdf-schema#label</td> </tr> <tr> <td>http://xmlns.com/foaf/0.1/name</td> </tr> <tr> <td>http://xmlns.com/foaf/0.1/nick</td> </tr> <tr> <td>http://www.w3.org/2004/02/skos/core#prefLabel</td> </tr> <tr> <td>http://www.w3.org/2003/01/geo/wgs84_pos#lat</td> </tr> <tr> <td>http://www.w3.org/2003/01/geo/wgs84_pos#long</td> </tr> <tr> <td>http://www.w3.org/2000/01/rdf-schema#domain</td> </tr> <tr> <td>http://www.w3.org/2000/01/rdf-schema#range</td> </tr> <tr> <td>http://www.w3.org/2002/07/owl#versionInfo</td> </tr> <tr> <td>http://dbpedia.org/ontology/purpose</td> </tr> <tr> <td>http://dbpedia.org/ontology/supplementalDraftRound</td> </tr> <tr> <td>http://dbpedia.org/ontology/podiums</td> </tr> <tr> <td>http://dbpedia.org/ontology/buildingStartDate</td> </tr> </table>cURL Variant using the timeout parameter with additional response header In this example, the Response header contains X-SPARQL-MaxRows: 10000, which is used to indicate the maximum result set size of the instance, configured via [SPARQL] INI section entry. For Anytime Query, it means that determining timeout that triggers response with additional headers (due to partial results) is based on the combination of MaxRows and QueryTimeout; i.e., only when Virtuoso cannot produce the ResultSet that matches MaxRows within QueryTimeout setting, will there be a timeout condition accompanied by addition response headers. curl -I "http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=SELECT+*+WHERE+%7B+%3Fs+%3Fp+%3Fo+.+%7D&format=text%2Fhtml&timeout=2000&debug=on" HTTP/1.1 200 OK Date: Fri, 12 Dec 2014 20:51:29 GMT Content-Type: text/html; charset=UTF-8 Content-Length: 3804866 Connection: keep-alive Vary: Accept-Encoding Server: Virtuoso/07.10.3211 (Linux) x86_64-redhat-linux-gnu VDB X-SPARQL-default-graph: http://dbpedia.org *X-SPARQL-MaxRows: 10000* Expires: Fri, 19 Dec 2014 20:51:29 GMT Cache-Control: max-age=604800 Accept-Ranges: bytes Related Virtuoso Tips and Tricks Collection Product Documentation Section on Anytime Query Feature