Setting up a Content Crawler Job to retrieve Sitemaps
The following guide describes how to set up a crawler job for getting content of a basic Sitemap where the source includes RDFa.
- From the Virtuoso Conductor User Interface i.e. http://cname:port/conductor, login as the "dba" user.
- Go to "Web Application Server" tab.
- Go to the "Content Imports" tab.
- Click on the "New Target" button.
- In the form displayed:
- Enter a name of choice in the "Crawl Job Name" text-box:
Basic Sitemap Crawling Example
- Enter the URL of the site to be crawled in the "Data Source Address (URL)" text-box:
http://psclife.pscdog.com/catalog/seo_sitemap/product/ 
- Enter the location in the Virtuoso WebDAV repository the crawled should stored in the "Local WebDAV Identifier" text-box, for example, if user demo is available, then:
/DAV/home/demo/basic_sitemap/
- Choose the "Local resources owner" for the collection from the list-box available, for ex: user demo.
- Select the "Accept RDF" check-box.
- Enter a name of choice in the "Crawl Job Name" text-box:
- Click the "Create" button to create the import:
- Click the "Import Queues" button.
- For the "Robot targets" with label "Basic Sitemap Crawling Example " click the "Run" button.
- This will result in the Target site being crawled and the retrieved pages stored locally in DAV and any sponged triples in the RDF Quad store.
- Go to the "Web Application Server" -> "Content Management" tab.
- Navigate to the location of newly created DAV collection:
/DAV/home/demo/basic_sitemap/
- The retrieved content will be available in this location.
Related
- Setting up Crawler Jobs Guide using Conductor
- Setting up a Content Crawler Job to Add RDF Data to the Quad Store
- Setting up a Content Crawler Job to Retrieve Semantic Sitemaps (a variation of the standard sitemap)
- Setting up a Content Crawler Job to Retrieve Content from Specific Directories
- Setting up a Content Crawler Job to Retrieve Content from SPARQL endpoint