Setting up a Content Crawler Job to retrieve Semantic Sitemaps

The following guide describes how to set up crawler job for getting Semantic Sitemap's content -- a variation of standard sitemap:

  1. Go to Conductor UI. For ex. at http://localhost:8890/conductor .
  2. Enter dba credentials.
  3. Go to "Web Application Server".



  4. Go to "Content Imports".



  5. Click "New Target".



  6. In the shown form:
    • Enter for "Crawl Job Name":

      Semantic Web Sitemap Example

    • Enter for "Data Source Address (URL)":

      http://www.connexfilter.com/sitemap_en.xml

    • Enter the location in the Virtuoso WebDAV repository the crawled should stored in the "Local WebDAV Identifier " text-box, for example, if user demo is available, then:

      /DAV/home/demo/semantic_sitemap/

    • Choose the "Local resources owner" for the collection from the list box available, for ex: user demo.
    • Hatch "Semantic Web Crawling":
      • Note: when you select this option, you can either:
        1. Leave the Store Function and Extract Function empty - in this case the system Store and Extract functions will be used for the Semantic Web Crawling Process, or:
        2. You can select your own Store and Extract Functions. View an example of these functions.
    • Hatch "Accept RDF"




    • Optionally you can hatch "Store metadata *" and specify which RDF Cartridges to be included from the Sponger:



  7. Click the button "Create".



  8. Click "Import Queues".



  9. For "Robot target" with label "Semantic Web Sitemap Example" click "Run".
  10. As result should be shown the number of the pages retrieved.



  11. Check the retrieved RDF data from your Virtuoso instance SPARQL endpoint http://cname:port/sparql with the following query selecting all the retrieved graphs for ex:

    SELECT ?g FROM <http://localhost:8890/> WHERE { graph ?g { ?s ?p ?o } . FILTER ( ?g LIKE <http://www.connexfilter.com/%> ) }





Related