The best method to get a random sample of all triples for a subset of all the resources of a SPARQL endpoint, is decimation in its original style:
SELECT ?s ?p ?o FROM <some-graph> WHERE { ?s ?p ?o . FILTER ( 1 > <bif:rnd> (10, ?s, ?p, ?o) ) }
By tweaking the first argument of bif:rnd()
and the left side of the inequality, you can tweak the decimation ratio from 1/10 to any desired value.
It is important to know that the SQL optimizer has a right to execute bif:rnd (10)
only once at the beginning of the query, so we pass three additional arguments that can be known only when a table row is fetched.
Thus, bif:rnd (10, ?s, ?p, ?o)
is calculated for each and every row, and any given row is either returned or ignored independently from others.
However, bif:rnd (10, ?s, ?p, ?o)
contains a subtle inefficiency.
In the RDF store, graph nodes are stored as numeric IRI IDs, and literal objects may be stored in a separate table.
A SQL function call needs arguments of traditional SQL datatypes, so the query processor will extract the text of the IRI for each node and the full value for each literal object.
That is a significant waste of time.
The workaround is to tell the SPARQL front-end to omit redundant conversions of values, by use of the SHORT_OR_LONG
tag, as shown here --
SELECT ?s ?p ?o FROM <some-graph> WHERE { ?s ?p ?o . FILTER ( 1 > <SHORT_OR_LONG::bif:rnd> (10, ?s, ?p, ?o)) }
The following SPARQL Query shows random occurrences of dc:description
on the LOD Cloud Cache instance:
SELECT * WHERE { ?s <http://purl.org/dc/elements/1.1/description> ?o FILTER ( 1 > <SHORT_OR_LONG::bif:rnd> (10, ?s, ?o)) } limit 100
View the results of the query execution here.