%META:TOPICPARENT{name="VirtSetCrawlerJobsGuide"}% ---+Setting up a Content Crawler Job to Retrieve Content from Specific Directories The following guide describes how to set up crawler job for getting directories using Conductor. 1 Go to Conductor UI. For ex. at http://localhost:8890/conductor . 1 Enter dba credentials. 1 Go to "Web Application Server". %BR%%BR%

%BR%%BR% 1 Go to "Content Imports". %BR%%BR%

%BR%%BR% 1 Click "New Target". %BR%%BR%

%BR%%BR% 1 In the shown form set respectively: * "Crawl Job Name": Gov.UK data * "Data Source Address (URL)": http://source.data.gov.uk/data/ * "Local WebDAV Identifier" for available user, for ex. demo: /DAV/home/demo/gov.uk/ * Choose from the available list "Local resources owner" an user, for ex. demo ; %BR%%BR%

%BR%%BR% * Click the button "Create". 1 As result the Robot target will be created: %BR%%BR%

%BR%%BR% 1 Click "Import Queues". %BR%%BR%

%BR%%BR% 1 For "Robot target" with label "Gov.UK data " click "Run". 1 As result will be shown the status of the pages: retrieved, pending or respectively waiting. %BR%%BR%

%BR%%BR% 1 Click "Retrieved Sites" 1 As result should be shown the number of the total pages retrieved. %BR%%BR%

%BR%%BR% 1 Go to "Web Application Server" -> "Content Management" . 1 Enter path: DAV/home/demo/gov.uk %BR%%BR%

%BR%%BR% 1 Go to path: DAV/home/demo/gov.uk/data 1 As result the retrieved content will be shown. %BR%%BR%

%BR%%BR% ---++Related * [[VirtSetCrawlerJobsGuide][Setting up Crawler Jobs Guide using Conductor]] * [[http://docs.openlinksw.com/virtuoso/rdfinsertmethods.html#rdfinsertmethodvirtuosocrawler][Setting up a Content Crawler Job to Add RDF Data to the Quad Store]] * [[VirtSetCrawlerJobsGuideSitemaps][Setting up a Content Crawler Job to Retrieve Sitemaps (where the source includes RDFa)]] * [[VirtSetCrawlerJobsGuideSemanticSitemaps][Setting up a Content Crawler Job to Retrieve Semantic Sitemaps (a variation of the standard sitemap)]] * [[VirtCrawlerSPARQLEndpoints][Setting up a Content Crawler Job to Retrieve Content from SPARQL endpoint]]