%META:TOPICPARENT{name="VirtSetCrawlerJobsGuide"}%
---+Setting up a Content Crawler Job to Retrieve Content from Specific Directories
The following guide describes how to set up crawler job for getting directories using Conductor.
1 Go to Conductor UI. For ex. at http://localhost:8890/conductor .
1 Enter dba credentials.
1 Go to "Web Application Server".
%BR%%BR%%BR%%BR%
1 Go to "Content Imports".
%BR%%BR%%BR%%BR%
1 Click "New Target".
%BR%%BR%%BR%%BR%
1 In the shown form set respectively:
* "Crawl Job Name":
Gov.UK data
* "Data Source Address (URL)":
http://source.data.gov.uk/data/
* "Local WebDAV Identifier" for available user, for ex. demo:
/DAV/home/demo/gov.uk/
* Choose from the available list "Local resources owner" an user, for ex. demo ;
%BR%%BR%%BR%%BR%
* Click the button "Create".
1 As result the Robot target will be created:
%BR%%BR%%BR%%BR%
1 Click "Import Queues".
%BR%%BR%%BR%%BR%
1 For "Robot target" with label "Gov.UK data " click "Run".
1 As result will be shown the status of the pages: retrieved, pending or respectively waiting.
%BR%%BR%%BR%%BR%
1 Click "Retrieved Sites"
1 As result should be shown the number of the total pages retrieved.
%BR%%BR%%BR%%BR%
1 Go to "Web Application Server" -> "Content Management" .
1 Enter path:
DAV/home/demo/gov.uk
%BR%%BR%%BR%%BR%
1 Go to path:
DAV/home/demo/gov.uk/data
1 As result the retrieved content will be shown.
%BR%%BR%%BR%%BR%
---++Related
* [[VirtSetCrawlerJobsGuide][Setting up Crawler Jobs Guide using Conductor]]
* [[http://docs.openlinksw.com/virtuoso/rdfinsertmethods.html#rdfinsertmethodvirtuosocrawler][Setting up a Content Crawler Job to Add RDF Data to the Quad Store]]
* [[VirtSetCrawlerJobsGuideSitemaps][Setting up a Content Crawler Job to Retrieve Sitemaps (where the source includes RDFa)]]
* [[VirtSetCrawlerJobsGuideSemanticSitemaps][Setting up a Content Crawler Job to Retrieve Semantic Sitemaps (a variation of the standard sitemap)]]
* [[VirtCrawlerSPARQLEndpoints][Setting up a Content Crawler Job to Retrieve Content from SPARQL endpoint]]