Virtuoso HTML+Variants Cartridge Configuration

Overview

The HTML and Variants extractor cartridge handles various XML document formats including HTML, XHTML and pure RDF (XML, Turtle, N3).

RDF sources are ingested directly.

Sources that look like HTML are passed through HTML Tidy before further extraction can take place.

Many dialects for embedding microdata (collectively known as "data-islands") are supported.

Certain well-known domain-specific standards (such as Twitter Cards) are also supported.

Core HTML

At its core, the HTML+Variants cartridge identifies the following aspects of a webpage:

In a second extraction, the following are also supported:

Because the mapping between an HTML document and a logical primary entity is not always clear, there is an option whereby

Data Islands

There are many ways in which an HTML document can contain embedded data, termed data-islands.

Because of its significance, HTML5 Microdata is regarded as an island, although it is processed along with other formats in the GRDDL loop - a second island encompassing:

Recent W3C standards have proposed the use of the HTML <script type="text/turtle">...</script> tag for storing Turtle RDF triples.

Relatedly, Google promote the use of the same HTML <script type="application/ld+json">...</script> tag for storing data in JSON-LD format.

Reification

Data in each of the above islands can be reified (default: enabled apart from the plethora of GRDDL microformats), by which for every statement extracted, an rdf:Statement entity is created summarizing the statement's subect, predicate and object/value along with a pointer to the data-island whence it came.

Enabled by default, this reification can be disabled selectively for

Passthrough Mode

Previously, the HTML+Variants cartridge used to be regarded as a fall-back, to be invoked in cases where no other extractor cartridge could handle the document.

This has changed; by default, the HTML cartridge is now the first to be invoked; this requires its exit-status be set to permit control to flow into other cartridges.

If the HTML cartridge detects another enabled cartridge that might contribute to the data, it assigns its output to the container entity representing the HTML document; in cases where no other cartridge is likely, it makes statements using the primary entity URI as a subject instead.

If you wish to move the HTML cartridge back to the bottom of the list, you should also change the passthrough back to `no'.

Options

The HTML+Variants cartridge supports the following options: