%META:TOPICPARENT{name="VirtSetCrawlerJobsGuide"}% ---+Setting up a Content Crawler Job to Retrieve Content from SPARQL endpoint The following step-by guide walks you through the process of: * Populating a Virtuoso Quad Store with data from a 3rd party SPARQL endpoint * Generating RDF dumps that are accessible to basic HTTP or WebDAV user agents. 1. Sample SPARQL query producing a list SPARQL endpoints: PREFIX rdf: PREFIX rdfs: PREFIX owl: PREFIX xsd: PREFIX foaf: PREFIX dcterms: PREFIX scovo: PREFIX void: PREFIX akt: SELECT DISTINCT ?endpoint WHERE { ?ds a void:Dataset . ?ds void:sparqlEndpoint ?endpoint } 1 Here is a sample SPARQL protocol URL constructed from one of the sparql endpoints in the result from the query above: http://void.rkbexplorer.com/sparql/?query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E+%0D%0APREFIX+void%3A+++++%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E++%0D%0ASELECT+distinct+%3Furl++WHERE+%7B+%3Fds+a+void%3ADataset+%3B+foaf%3Ahomepage+%3Furl+%7D%0D%0A&format=sparql 1 Here is the cURL output showing a Virtuoso SPARQL URL that executes against a 3rd party SPARQL Endpoint URL: $ curl "http://void.rkbexplorer.com/sparql/?query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E+%0D%0APREFIX+void %3A+++++%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E++%0D%0ASELECT+distinct+%3Furl++WHERE+%7B+%3Fds+a+void%3ADataset+%3B+foaf%3Ah omepage+%3Furl+%7D%0D%0A&format=sparql" http://kisti.rkbexplorer.com/ http://epsrc.rkbexplorer.com/ http://test2.rkbexplorer.com/ http://test.rkbexplorer.com/ ... ... ... 1 Go to Conductor UI. For ex. http://localhost:8890/conductor : %BR%%BR%

%BR%%BR% 1 Enter dba credentials 1 Go to "Web Application Server"-> "Content Management" -> "Content Imports" %BR%%BR%

%BR%%BR% 1 Click "New Target" %BR%%BR%

%BR%%BR% 1 In the presented form enter for ex.: * "Crawl Job Name": voiD store * "Data Source Address (URL)": the url from above i.e.: http://void.rkbexplorer.com/sparql/?query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E+%0D%0APREFIX+void%3A+++++%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E++%0D%0ASELECT+distinct+%3Furl++WHERE+%7B+%3Fds+a+void%3ADataset+%3B+foaf%3Ahomepage+%3Furl+%7D%0D%0A&format=sparql * "Local WebDAV Identifier": /DAV/void.rkbexplorer.com/content * "Follow links matching (delimited with ;)": % * Un-hatch "Use robots.txt" ; * "XPath expression for links extraction": //binding[@name="url"]/uri/text() * Hatch "Semantic Web Crawling"; * "If Graph IRI is unassigned use this Data Source URL:": enter for ex: http://void.collection * Hatch "Follow URLs outside of the target host"; * Hatch "Run "Sponger" and "Accept RDF" %BR%%BR%

%BR%

%BR%%BR% 1 Click "Create". 1 The target should be created and presented in the list of available targets: %BR%%BR%

%BR%%BR% 1 Click "Import Queues": %BR%%BR%

%BR%%BR% 1 Click "Run" for the imported target: %BR%%BR%

%BR%%BR% 1 To check the retrieved content go to "Web Application Server"-> "Content Management" -> "Content Imports" -> "Retrieved Sites": %BR%%BR%

%BR%%BR% 1 Click voiD store -> "Edit": %BR%%BR%

%BR%%BR% 1 To check the imported URLs go to "Web Application Server"-> "Content Management" -> "Repository" path DAV/void.rkbexplorer.com/content: %BR%%BR%

%BR%%BR% 1 To check the inserted into the RDF QUAD data go to http://cname/sparql and execute the following query: SELECT * FROM WHERE { ?s ?p ?o } %BR%%BR%

%BR%%BR% %BR%%BR%

%BR%%BR% ---++Related * [[http://docs.openlinksw.com/virtuoso/rdfinsertmethods.html#rdfinsertmethodvirtuosocrawler][Setting up a Content Crawler Job to Add RDF Data to the Quad Store]] * [[VirtSetCrawlerJobsGuideSitemaps][Setting up a Content Crawler Job to Retrieve Sitemaps]] (when the source includes RDFa) * [[VirtSetCrawlerJobsGuideSemanticSitemaps][Setting up a Content Crawler Job to Retrieve Semantic Sitemaps]] (a variation of the standard sitemap) * [[VirtSetCrawlerJobsGuideDirectories][Setting up a Content Crawler Job to Retrieve Content from Specific Directories]] * [[VirtCrawlerGuideAtom][Setting up a Content Crawler Job to Retrieve Content from ATOM feed]]