%META:TOPICPARENT{name="VirtSetCrawlerJobsGuide"}%
---+Setting up a Content Crawler Job to Retrieve Content from SPARQL endpoint
The following step-by guide walks you through the process of:
* Populating a Virtuoso Quad Store with data from a 3rd party SPARQL endpoint
* Generating RDF dumps that are accessible to basic HTTP or WebDAV user agents.
1. Sample SPARQL query producing a list SPARQL endpoints:
PREFIX rdf:
PREFIX rdfs:
PREFIX owl:
PREFIX xsd:
PREFIX foaf:
PREFIX dcterms:
PREFIX scovo:
PREFIX void:
PREFIX akt:
SELECT DISTINCT ?endpoint
WHERE
{
?ds a void:Dataset .
?ds void:sparqlEndpoint ?endpoint
}
1 Here is a sample SPARQL protocol URL constructed from one of the sparql endpoints in the result from the query above:
http://void.rkbexplorer.com/sparql/?query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E+%0D%0APREFIX+void%3A+++++%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E++%0D%0ASELECT+distinct+%3Furl++WHERE+%7B+%3Fds+a+void%3ADataset+%3B+foaf%3Ahomepage+%3Furl+%7D%0D%0A&format=sparql
1 Here is the cURL output showing a Virtuoso SPARQL URL that executes against a 3rd party SPARQL Endpoint URL:
$ curl "http://void.rkbexplorer.com/sparql/?query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E+%0D%0APREFIX+void
%3A+++++%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E++%0D%0ASELECT+distinct+%3Furl++WHERE+%7B+%3Fds+a+void%3ADataset+%3B+foaf%3Ah
omepage+%3Furl+%7D%0D%0A&format=sparql"
http://kisti.rkbexplorer.com/
http://epsrc.rkbexplorer.com/
http://test2.rkbexplorer.com/
http://test.rkbexplorer.com/
...
...
...
1 Go to Conductor UI. For ex. http://localhost:8890/conductor :
%BR%%BR%%BR%%BR%
1 Enter dba credentials
1 Go to "Web Application Server"-> "Content Management" -> "Content Imports"
%BR%%BR%%BR%%BR%
1 Click "New Target"
%BR%%BR%%BR%%BR%
1 In the presented form enter for ex.:
* "Crawl Job Name": voiD store
* "Data Source Address (URL)": the url from above i.e.:
http://void.rkbexplorer.com/sparql/?query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E+%0D%0APREFIX+void%3A+++++%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E++%0D%0ASELECT+distinct+%3Furl++WHERE+%7B+%3Fds+a+void%3ADataset+%3B+foaf%3Ahomepage+%3Furl+%7D%0D%0A&format=sparql
* "Local WebDAV Identifier":
/DAV/void.rkbexplorer.com/content
* "Follow links matching (delimited with ;)":
%
* Un-hatch "Use robots.txt" ;
* "XPath expression for links extraction":
//binding[@name="url"]/uri/text()
* Hatch "Semantic Web Crawling";
* "If Graph IRI is unassigned use this Data Source URL:": enter for ex:
http://void.collection
* Hatch "Follow URLs outside of the target host";
* Hatch "Run "Sponger" and "Accept RDF"
%BR%%BR%
%BR%%BR%%BR%
1 Click "Create".
1 The target should be created and presented in the list of available targets:
%BR%%BR%%BR%%BR%
1 Click "Import Queues":
%BR%%BR%%BR%%BR%
1 Click "Run" for the imported target:
%BR%%BR%%BR%%BR%
1 To check the retrieved content go to "Web Application Server"-> "Content Management" -> "Content Imports" -> "Retrieved Sites":
%BR%%BR%%BR%%BR%
1 Click voiD store -> "Edit":
%BR%%BR%%BR%%BR%
1 To check the imported URLs go to "Web Application Server"-> "Content Management" -> "Repository" path DAV/void.rkbexplorer.com/content:
%BR%%BR%%BR%%BR%
1 To check the inserted into the RDF QUAD data go to http://cname/sparql and execute the following query:
SELECT *
FROM
WHERE
{
?s ?p ?o
}
%BR%%BR%%BR%%BR%
%BR%%BR%%BR%%BR%
---++Related
* [[http://docs.openlinksw.com/virtuoso/rdfinsertmethods.html#rdfinsertmethodvirtuosocrawler][Setting up a Content Crawler Job to Add RDF Data to the Quad Store]]
* [[VirtSetCrawlerJobsGuideSitemaps][Setting up a Content Crawler Job to Retrieve Sitemaps]] (when the source includes RDFa)
* [[VirtSetCrawlerJobsGuideSemanticSitemaps][Setting up a Content Crawler Job to Retrieve Semantic Sitemaps]] (a variation of the standard sitemap)
* [[VirtSetCrawlerJobsGuideDirectories][Setting up a Content Crawler Job to Retrieve Content from Specific Directories]]
* [[VirtCrawlerGuideAtom][Setting up a Content Crawler Job to Retrieve Content from ATOM feed]]