<docbook><section><title>VirtCrawlerSPARQLEndpoints</title><title>Setting up a Content Crawler Job to Retrieve Content from SPARQL endpoint</title>Setting up a Content Crawler Job to Retrieve Content from SPARQL endpoint
<para> The following step-by guide walks you through the process of:</para>
<itemizedlist mark="bullet" spacing="compact"><listitem>Populating a Virtuoso Quad Store with data from a 3rd party SPARQL endpoint </listitem>
<listitem>Generating RDF dumps that are accessible to basic HTTP or <ulink url="WebDAV">WebDAV</ulink> user agents.</listitem>
</itemizedlist><para> </para>
<orderedlist spacing="compact"><listitem>Sample SPARQL query producing a list SPARQL endpoints: <programlisting>PREFIX rdf:      &lt;http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX rdfs:     &lt;http://www.w3.org/2000/01/rdf-schema#&gt;
PREFIX owl:      &lt;http://www.w3.org/2002/07/owl#&gt;
PREFIX xsd:      &lt;http://www.w3.org/2001/XMLSchema#&gt;
PREFIX foaf:     &lt;http://xmlns.com/foaf/0.1/&gt;
PREFIX dcterms:  &lt;http://purl.org/dc/terms/&gt;
PREFIX scovo:    &lt;http://purl.org/NET/scovo#&gt;
PREFIX void:     &lt;http://rdfs.org/ns/void#&gt;
PREFIX akt:      &lt;http://www.aktors.org/ontology/portal#&gt;

SELECT DISTINCT ?endpoint 
WHERE 
  { 
    ?ds a void:Dataset . 
    ?ds void:sparqlEndpoint ?endpoint 
  }
</programlisting></listitem>
<listitem>Here is a sample SPARQL protocol URL constructed from one of the sparql endpoints in the result from the query above: <programlisting>http://void.rkbexplorer.com/sparql/?query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E+%0D%0APREFIX+void%3A+++++%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E++%0D%0ASELECT+distinct+%3Furl++WHERE+%7B+%3Fds+a+void%3ADataset+%3B+foaf%3Ahomepage+%3Furl+%7D%0D%0A&amp;format=sparql
</programlisting></listitem>
<listitem>Here is the cURL output showing a Virtuoso SPARQL URL that executes against a 3rd party SPARQL Endpoint URL: <programlisting>$ curl &quot;http://void.rkbexplorer.com/sparql/?query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E+%0D%0APREFIX+void
%3A+++++%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E++%0D%0ASELECT+distinct+%3Furl++WHERE+%7B+%3Fds+a+void%3ADataset+%3B+foaf%3Ah
omepage+%3Furl+%7D%0D%0A&amp;format=sparql&quot;
&lt;?xml version=&quot;1.0&quot;?&gt;
&lt;sparql xmlns=&quot;http://www.w3.org/2005/sparql-results#&quot;&gt;
  &lt;head&gt;
    &lt;variable name=&quot;url&quot;/&gt;
  &lt;/head&gt;
  &lt;results ordered=&quot;false&quot; distinct=&quot;true&quot;&gt;
    &lt;result&gt;
      &lt;binding name=&quot;url&quot;&gt;&lt;uri&gt;http://kisti.rkbexplorer.com/&lt;/uri&gt;&lt;/binding&gt;
    &lt;/result&gt;
    &lt;result&gt;
      &lt;binding name=&quot;url&quot;&gt;&lt;uri&gt;http://epsrc.rkbexplorer.com/&lt;/uri&gt;&lt;/binding&gt;
    &lt;/result&gt;
    &lt;result&gt;
      &lt;binding name=&quot;url&quot;&gt;&lt;uri&gt;http://test2.rkbexplorer.com/&lt;/uri&gt;&lt;/binding&gt;
    &lt;/result&gt;
    &lt;result&gt;
      &lt;binding name=&quot;url&quot;&gt;&lt;uri&gt;http://test.rkbexplorer.com/&lt;/uri&gt;&lt;/binding&gt;
    &lt;/result&gt;
    ...
    ...
    ...
  &lt;/results&gt;
&lt;/sparql&gt;
</programlisting></listitem>
<listitem>Go to Conductor UI.
 For ex.
 <ulink url="http://localhost:8890/conductor">http://localhost:8890/conductor</ulink> : <ulink url="VirtCrawlerSPARQLEndpoints/scp1.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp1.png" /></figure></ulink> </listitem>
<listitem>Enter dba credentials </listitem>
<listitem>Go to &quot;Web Application Server&quot;-&gt; &quot;Content Management&quot; -&gt; &quot;Content Imports&quot; <ulink url="VirtCrawlerSPARQLEndpoints/scp2.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp2.png" /></figure></ulink> </listitem>
<listitem>Click &quot;New Target&quot; <ulink url="VirtCrawlerSPARQLEndpoints/scp3.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp3.png" /></figure></ulink> </listitem>
<listitem>In the presented form enter for ex.: <itemizedlist mark="bullet" spacing="compact"><listitem>&quot;Crawl Job Name&quot;: voiD store </listitem>
<listitem>&quot;Data Source Address (URL)&quot;: the url from above i.e.: <programlisting>http://void.rkbexplorer.com/sparql/?query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E+%0D%0APREFIX+void%3A+++++%3Chttp%3A%2F%2Frdfs.org%2Fns%2Fvoid%23%3E++%0D%0ASELECT+distinct+%3Furl++WHERE+%7B+%3Fds+a+void%3ADataset+%3B+foaf%3Ahomepage+%3Furl+%7D%0D%0A&amp;format=sparql
</programlisting></listitem>
<listitem>&quot;Local <ulink url="WebDAV">WebDAV</ulink> Identifier&quot;: <programlisting>/DAV/void.rkbexplorer.com/content
</programlisting></listitem>
<listitem>&quot;Follow links matching (delimited with ;)&quot;: <programlisting>%
</programlisting></listitem>
<listitem>Un-hatch &quot;Use robots.txt&quot; ; </listitem>
<listitem>&quot;XPath expression for links extraction&quot;: <programlisting>//binding[@name=&quot;url&quot;]/uri/text()
</programlisting></listitem>
<listitem>Hatch &quot;Semantic Web Crawling&quot;; </listitem>
<listitem>&quot;If Graph IRI is unassigned use this Data Source URL:&quot;: enter for ex: <programlisting>http://void.collection
</programlisting></listitem>
<listitem>Hatch &quot;Follow URLs outside of the target host&quot;; </listitem>
<listitem>Hatch &quot;Run &quot;Sponger&quot; and &quot;Accept RDF&quot; <ulink url="VirtCrawlerSPARQLEndpoints/scp4.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp4.png" /></figure></ulink> <ulink url="VirtCrawlerSPARQLEndpoints/scp5.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp5.png" /></figure></ulink> </listitem>
</itemizedlist></listitem>
<listitem>Click &quot;Create&quot;.
</listitem>
<listitem>The target should be created and presented in the list of available targets: <ulink url="VirtCrawlerSPARQLEndpoints/scp7.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp7.png" /></figure></ulink> </listitem>
<listitem>Click &quot;Import Queues&quot;: <ulink url="VirtCrawlerSPARQLEndpoints/scp8.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp8.png" /></figure></ulink> </listitem>
<listitem>Click &quot;Run&quot; for the imported target: <ulink url="VirtCrawlerSPARQLEndpoints/scp9.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp9.png" /></figure></ulink> </listitem>
<listitem>To check the retrieved content go to &quot;Web Application Server&quot;-&gt; &quot;Content Management&quot; -&gt; &quot;Content Imports&quot; -&gt; &quot;Retrieved Sites&quot;: <ulink url="VirtCrawlerSPARQLEndpoints/scp11.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp11.png" /></figure></ulink> </listitem>
<listitem>Click voiD store -&gt; &quot;Edit&quot;: <ulink url="VirtCrawlerSPARQLEndpoints/scp12.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp12.png" /></figure></ulink> </listitem>
<listitem>To check the imported URLs go to &quot;Web Application Server&quot;-&gt; &quot;Content Management&quot; -&gt; &quot;Repository&quot; path <emphasis>DAV/void.rkbexplorer.com/content</emphasis>: <ulink url="VirtCrawlerSPARQLEndpoints/scp10.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp10.png" /></figure></ulink> </listitem>
<listitem>To check the inserted into the RDF QUAD data go to <ulink url="http://cname/sparql">http://cname/sparql</ulink> and execute the following query: <programlisting>SELECT * 
FROM &lt;http://void.collection&gt; 
WHERE 
  {
    ?s ?p ?o
  }
</programlisting><ulink url="VirtCrawlerSPARQLEndpoints/scp13.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp13.png" /></figure></ulink> <ulink url="VirtCrawlerSPARQLEndpoints/scp14.png"><figure><graphic fileref="VirtCrawlerSPARQLEndpoints/scp14.png" /></figure></ulink></listitem>
</orderedlist><para> </para>
<bridgehead class="http://www.w3.org/1999/xhtml:h2">Related</bridgehead>
<para> </para>
<itemizedlist mark="bullet" spacing="compact"><listitem><ulink url="http://docs.openlinksw.com/virtuoso/rdfinsertmethods.html#rdfinsertmethodvirtuosocrawler">Setting up a Content Crawler Job to Add RDF Data to the Quad Store</ulink> </listitem>
<listitem><ulink url="VirtSetCrawlerJobsGuideSitemaps">Setting up a Content Crawler Job to Retrieve Sitemaps</ulink> (when the source includes RDFa) </listitem>
<listitem><ulink url="VirtSetCrawlerJobsGuideSemanticSitemaps">Setting up a Content Crawler Job to Retrieve Semantic Sitemaps</ulink> (a variation of the standard sitemap) </listitem>
<listitem><ulink url="VirtSetCrawlerJobsGuideDirectories">Setting up a Content Crawler Job to Retrieve Content from Specific Directories</ulink> </listitem>
<listitem><ulink url="VirtCrawlerGuideAtom">Setting up a Content Crawler Job to Retrieve Content from ATOM feed</ulink> </listitem>
</itemizedlist></section></docbook>