Setting up a Content Crawler Job to retrieve Sitemaps

The following guide describes how to set up a crawler job for getting content of a basic Sitemap where the source includes RDFa.

  1. From the Virtuoso Conductor User Interface i.e. http://cname:port/conductor, login as the "dba" user.
  2. Go to "Web Application Server" tab.



  3. Go to the "Content Imports" tab.



  4. Click on the "New Target" button.



  5. In the form displayed:
    • Enter a name of choice in the "Crawl Job Name" text-box:

      Basic Sitemap Crawling Example

    • Enter the URL of the site to be crawled in the "Data Source Address (URL)" text-box:

      http://psclife.pscdog.com/catalog/seo_sitemap/product/&nbsp

    • Enter the location in the Virtuoso WebDAV repository the crawled should stored in the "Local WebDAV Identifier" text-box, for example, if user demo is available, then:

      /DAV/home/demo/basic_sitemap/

    • Choose the "Local resources owner" for the collection from the list-box available, for ex: user demo.
    • Select the "Accept RDF" check-box.




  6. Click the "Create" button to create the import:



  7. Click the "Import Queues" button.
  8. For the "Robot targets" with label "Basic Sitemap Crawling Example " click the "Run" button.
  9. This will result in the Target site being crawled and the retrieved pages stored locally in DAV and any sponged triples in the RDF Quad store.



  10. Go to the "Web Application Server" -> "Content Management" tab.



  11. Navigate to the location of newly created DAV collection:

    /DAV/home/demo/basic_sitemap/

  12. The retrieved content will be available in this location.



Related