<docbook><section><title>VirtCartConfigHTML</title><title>Virtuoso HTML+Variants Cartridge Configuration</title>Virtuoso HTML+Variants Cartridge Configuration
<bridgehead class="http://www.w3.org/1999/xhtml:h2">Overview</bridgehead>
<para>The HTML and Variants extractor cartridge handles various XML document formats including HTML, XHTML and pure RDF (XML, Turtle, N3).</para>
<para>RDF sources are ingested directly.</para>
<para>Sources that look like HTML are passed through HTML Tidy before further extraction can take place.</para>
<para>Many dialects for embedding microdata (collectively known as &quot;data-islands&quot;) are supported.</para>
<para>Certain well-known domain-specific standards (such as Twitter Cards) are also supported.</para>
<para> </para>
<bridgehead class="http://www.w3.org/1999/xhtml:h3">Core HTML</bridgehead>
<para>At its core, the HTML+Variants cartridge identifies the following aspects of a webpage:</para>
<itemizedlist mark="bullet" spacing="compact"><listitem>Title and other well-known data in &lt;meta&gt; tags such as description, author, keywords, encoding, etc </listitem>
<listitem>images referenced </listitem>
<listitem>links to alternate forms of the current document </listitem>
<listitem>links to other referenced documents </listitem>
<listitem>x509 Certificate fingerprints </listitem>
<listitem>the page content </listitem>
<listitem>any RSS or Atom feeds it finds</listitem>
</itemizedlist><para>In a second extraction, the following are also supported:</para>
<itemizedlist mark="bullet" spacing="compact"><listitem>Twitter Cards </listitem>
<listitem>OpenGraph markup (og: prefix) </listitem>
<listitem>any link@rel becomes a predicate </listitem>
<listitem>any meta@name becomes a predicate</listitem>
</itemizedlist><para>Because the mapping between an HTML document and a logical primary entity is not always clear, there is an option whereby </para>
<itemizedlist mark="bullet" spacing="compact"><listitem>any mention of latitude and longitude in a meta tag is extracted</listitem>
</itemizedlist><bridgehead class="http://www.w3.org/1999/xhtml:h3">Data Islands</bridgehead>
<para>There are many ways in which an HTML document can contain embedded data, termed data-islands.</para>
<para>Because of its significance, HTML5 Microdata is regarded as an island, although it is processed along with other formats in the GRDDL loop - a second island encompassing:</para>
<itemizedlist mark="bullet" spacing="compact"><listitem>AB Meta </listitem>
<listitem>BSBM </listitem>
<listitem>Dublin Core </listitem>
<listitem>eRDF </listitem>
<listitem>geoURL </listitem>
<listitem>Google Base </listitem>
<listitem>hAudio </listitem>
<listitem>hCalendar </listitem>
<listitem>hCard </listitem>
<listitem>hListing </listitem>
<listitem>hNews </listitem>
<listitem>hProduct </listitem>
<listitem>hRecipe </listitem>
<listitem>hResume </listitem>
<listitem>hReview </listitem>
<listitem>hReview-aggregate </listitem>
<listitem>HR-XML </listitem>
<listitem>Microsoft XML Spreadsheet 2003 </listitem>
<listitem>Ning Metadata </listitem>
<listitem>relLicense </listitem>
<listitem>Slidy </listitem>
<listitem>Word 2003 XML Document </listitem>
<listitem>XBRL </listitem>
<listitem>XDDL </listitem>
<listitem>XFN Profile </listitem>
<listitem>XFN Profile2 </listitem>
<listitem>xFolk</listitem>
</itemizedlist><para>Recent W3C standards have proposed the use of the HTML &lt;script type=&quot;text/turtle&quot;&gt;...&lt;/script&gt; tag for storing Turtle RDF triples.</para>
<para>Relatedly, Google promote the use of the same HTML &lt;script type=&quot;application/ld+json&quot;&gt;...&lt;/script&gt; tag for storing data in JSON-LD format.</para>
<bridgehead class="http://www.w3.org/1999/xhtml:h3">Reification</bridgehead>
<para>Data in each of the above islands can be reified (default: enabled apart from the plethora of GRDDL microformats), by which for every statement extracted, an rdf:Statement entity is created summarizing the statement&#39;s subect, predicate and object/value along with a pointer to the data-island whence it came.</para>
<para>Enabled by default, this reification can be disabled selectively for </para>
<itemizedlist mark="bullet" spacing="compact"><listitem>HTML core + meta extraction (together) </listitem>
<listitem>RDFa data </listitem>
<listitem>Turtle-in-script </listitem>
<listitem>JSON-LD-in-script </listitem>
<listitem>all the supported GRDDL formats </listitem>
<listitem>HTML5-Microdata</listitem>
</itemizedlist><para> </para>
<bridgehead class="http://www.w3.org/1999/xhtml:h2">Passthrough Mode</bridgehead>
<para>Previously, the HTML+Variants cartridge used to be regarded as a fall-back, to be invoked in cases where no other extractor cartridge could handle the document.</para>
<para>This has changed; by default, the HTML cartridge is now the first to be invoked; this requires its exit-status be set to permit control to flow into other cartridges.</para>
<para>If the HTML cartridge detects another enabled cartridge that might contribute to the data, it assigns its output to the container entity representing the HTML document; in cases where no other cartridge is likely, it makes statements using the primary entity URI as a subject instead.</para>
<para>If you wish to move the HTML cartridge back to the bottom of the list, you should also change the passthrough back to `no&#39;.</para>
<para> </para>
<bridgehead class="http://www.w3.org/1999/xhtml:h2">Options</bridgehead>
<para>The HTML+Variants cartridge supports the following options:</para>
<para> </para>
<itemizedlist mark="bullet" spacing="compact"><listitem>passthrough_mode=yes  - whether to expect other cartridges to contribute </listitem>
<listitem>fallback-mode=no - whether to run only if no other cartridge has contributed </listitem>
<listitem>get-feeds=no - whether to include data from RSS and Atom feeds </listitem>
<listitem>preview-length=512 (deprecated) - how long a bibo:abstract to generate </listitem>
<listitem>add-html-meta=yes - add data extracted from &lt;meta&gt; tags </listitem>
<listitem>rdfa=yes - include data from RDFa </listitem>
<listitem>reify_html=1  - whether the core data should be reified </listitem>
<listitem>reify_html5md=1 - whether the HTML5-Microdata should be reified </listitem>
<listitem>reify_rdfa=1 - whether RDFa data should be reified </listitem>
<listitem>reify_turtle=1 - whether Turtle data embedded in &lt;script&gt; elements should be reified </listitem>
<listitem>reify_jsonld=1 - whether JSON-LD data embedded in &lt;script&gt; elements should be reified </listitem>
<listitem>reify_all_grddl=0 - whether all the GRDDL sources should be reified </listitem>
<listitem>loose=yes - look for latitude and longitude information in meta tags</listitem>
</itemizedlist><itemizedlist mark="bullet" spacing="compact"><listitem><figure><graphic fileref="VirtCartConfigHTML/VirtCartConfigHTML.png" /></figure> </listitem>
</itemizedlist></section></docbook>