VOS.VirtTipsAndTricksGuideBestPerformanceSPARQLExEndpoint

How to achieve best performance executing a query against SPARQL Endpoint?

Assume the query from below is to be executed against the dbpedia sparql endpoint:


PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX prop: <http://dbpedia.org/property/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?thumbnail ?abstract
WHERE 
  {
    ?location rdfs:label ?label;
          a dbo:PopulatedPlace . 
    OPTIONAL { ?location dbo:thumbnail ?thumbnail . }
    OPTIONAL { ?location dbo:abstract ?abstract . 
      FILTER langMatches(lang(?abstract), 'en') }
    FILTER REGEX(?label, 'Swanage', 'i')
  } 
LIMIT 1

What should we do if the endpoint either times out or does not return any data at all? - There are a number of reasons why the dbpedia endpoint can give timeouts such as:

  1. You send too many requests in very short amount of time;
  2. You send very time-consuming queries and you receive a timeout
  3. Someone else is doing some very expensive queries

We have several ACLs in place to deal with the above scenarios but that does not mean that some users write crawlers that ignore the HTTP status codes and properly act upon them. We are looking into the best way to deal with such.

Back to the query from above, the performance killer in it is the use of:


FILTER regex(?label, 'Swanage', 'i')

which is basically will need to check every individual place triple to see if it matches which is not a very efficient way to quickly get the results.

We suggest the FILTER line to be replaced it with:


?label bif:contains '\'D?sseldorf\''

or


?label bif:contains " 'Antwerpen' "

which takes into account the fact there can be special characters in names. If you are dealing only with Latin1 characters, you can leave out the inner quotes. Also note the spaces between the double and single quote are not needed, but enhance readability.

The CONTAINS function has the same advantage as the REGEX (x, 'i') option of finding both antwerpen, Antwerpen or any other case mixes but using an index which is much faster.

Finally the query should like this:


PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX prop: <http://dbpedia.org/property/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?thumbnail ?abstract
WHERE 
  {
    ?location rdfs:label ?label;
         a dbo:PopulatedPlace.
     ?label bif:contains " 'Antwerpen' ".
     OPTIONAL { ?location dbo:thumbnail ?thumbnail . }
     OPTIONAL { ?location dbo:abstract ?abstract . 
       FILTER  langMatches(lang(?abstract), 'en') }
  }
LIMIT 1

which is much faster.

Another trick you can use is to turn a SPARQL request into an ANYTIME query. This is done by adding:


&timeout=5000

to the end of the /sparql/?query=XXXX request which instructs the Virtuoso SPARQL endpoint to only return results that it can find in approximate 5000msec. There are special HTTP result header flags that indicate if the resultset is a partial or full result.

View the results of the query execution on the LOD instance.

Related