Delta-aware bulk loading of datasets into Virtuoso

Why
What
How

Prerequisites
Basic usage
Diagnostics

Why

High performance bulk-revision of existing data, on a par with simple bulk insertion of similar data, is best achieved by finding the difference (the "delta") between an existing graph or dataset and the new graph or dataset being loaded, and then applying that differential or "graph delta" to the quad store.

What

Given an existing dataset hosted by Virtuoso, identified by a named graph IRI, and one that's being loaded from N-Quad files in the filesystem, Virtuoso's bulk load process can automatically determine the differences between the two datasets and quickly apply relevant INSERTs, UPDATEs, and DELETEs to the existing dataset.

The Virtuoso RDF Bulk Loader is told to use this "graph delta" load process with a special option called with_delete, applied in the ld_dir() or ld_dir_all() commands.

How

Prerequisites

The with_delete option is available in
- Virtuoso Enterprise Edition Release 6.x, version 06.04.3134 or greater, only in cluster mode
- Virtuoso Enterprise Edition Release 7.0 and later, in both cluster and single-server mode
- Note: with_delete functionality is not available in Virtuoso Open Source Edition (VOS).
N-Quad datasets where every graph name is specified within the dataset.
- Graphs may be in any order. Multiple graphs may be in one file, however all triples from each graph must be together; triples from different graphs cannot be intermingled. (In SQL terms, GROUP BY graphname; no ORDER BY is necessary.)
- All triples from any given graph must be in the same file as all other triples from that graph. No graph may have triples in multiple files, else the triples from the last file loaded will comprise the entirety of the updated graph.
Virtuoso must be allocated at least 200 bytes of RAM per quad in the dataset being loaded. As may be obvious, loading large graphs with this option can have a significant impact on Virtuoso's memory use.
The Virtuoso server must be running with a default transaction isolation level of 2, READ COMMITTED. Ensure that the [Parameters] section of the Virtuoso configuration file (default, virtuoso.ini) includes the following entry, and restart the Virtuoso server.
```
DefaultIsolation = 2
```

The following lock mode settings should be set before using the with_delete option:


cl_exec ('__dbf_set (''lock_escalation_pct'', 200)');
cl_exec ('__dbf_set (''enable_distinct_key_dup_no_lock'', 1)');

The dataset files must not contain multiple graphs which have the same name but contain different triples. Doing so will result in unpredictable triple counts, depending on which dataset file is being loaded on a given thread, which is non-deterministic.

Basic usage

Using the ld_dir() or ld_dir_all() commands as usual, set the target_graph argument to 'with_delete' for each dataset file specified in ll_file that is known to require an update/reload.

For example --

ld_dir ('/data8/2848260', '%.gz', 'with_delete');
ld_dir_all ('/data8/', '%.gz', 'with_delete');

Once all are set run the rdf_loader_run() or cl_exec('rdf_ld_srv()') commands to enable the update/reload to commence.

As many rdf_loader_run() or cl_exec('rdf_ld_srv()') commands can be invoked as threads/cores are available across the machines the Virtuoso cluster is being run on for fast parallel loading of the datasets, as would typically be done for the initial bulk load of the datasets.

Note that all RDF loader threads can be stopped using the following command at which point all currently running threads will be allowed to complete and then exit:

rdf_load_stop()

Diagnostics

A diagnostic log of the with_delete activity may be written to a file called g_log.txt on each cluster instance.

To enable this log, run the following command:


cl_exec ('__dbf_set (''enable_g_replace_log'',1)')

To disable this log, run the following command:


cl_exec ('__dbf_set (''enable_g_replace_log'',0)')

Virtuoso RDF Bulk Loader

Delta-aware bulk loading of datasets into Virtuoso

Why

What

How

Prerequisites

Basic usage

Diagnostics

Related