Setting up a Content Crawler Job to retrieve Semantic Sitemaps
The following guide describes how to set up crawler job for getting Semantic Sitemap's content -- a variation of standard sitemap:
- Go to Conductor UI. For ex. at http://localhost:8890/conductor .
- Enter dba credentials.
- Go to "Web Application Server".
- Go to "Content Imports".
- Click "New Target".
- In the shown form:
- Enter for "Crawl Job Name":
Semantic Web Sitemap Example
- Enter for "Data Source Address (URL)":
http://www.connexfilter.com/sitemap_en.xml
- Enter the location in the Virtuoso WebDAV repository the crawled should stored in the "Local WebDAV Identifier " text-box, for example, if user demo is available, then:
/DAV/home/demo/semantic_sitemap/
- Choose the "Local resources owner" for the collection from the list box available, for ex: user demo.
- Hatch "Semantic Web Crawling":
- Note: when you select this option, you can either:
- Leave the Store Function and Extract Function empty - in this case the system Store and Extract functions will be used for the Semantic Web Crawling Process, or:
- You can select your own Store and Extract Functions. View an example of these functions.
- Note: when you select this option, you can either:
- Hatch "Accept RDF"
- Optionally you can hatch "Store metadata *" and specify which RDF Cartridges to be included from the Sponger:
- Enter for "Crawl Job Name":
- Click the button "Create".
- Click "Import Queues".
- For "Robot target" with label "Semantic Web Sitemap Example" click "Run".
- As result should be shown the number of the pages retrieved.
- Check the retrieved RDF data from your Virtuoso instance SPARQL endpoint http://cname:port/sparql with the following query selecting all the retrieved graphs for ex:
SELECT ?g FROM <http://localhost:8890/> WHERE { graph ?g { ?s ?p ?o } . FILTER ( ?g LIKE <http://www.connexfilter.com/%> ) }
Related
- Setting up Crawler Jobs Guide using Conductor
- Setting up a Content Crawler Job to Add RDF Data to the Quad Store
- Setting up a Content Crawler Job to Retrieve Sitemaps (where the source includes RDFa)
- Setting up a Content Crawler Job to Retrieve Content from Specific Directories
- Setting up a Content Crawler Job to Retrieve Content from SPARQL endpoint