High performance bulk-revision of existing data, on a par with simple bulk insertion of similar data, is best achieved by finding the difference (the "delta") between an existing graph or dataset and the new graph or dataset being loaded, and then applying that differential or "graph delta" to the quad store.
Given an existing dataset hosted by Virtuoso, identified by a named graph IRI, and one that's being loaded from N-Quad files in the filesystem, Virtuoso's bulk load process can automatically determine the differences between the two datasets and quickly apply relevant INSERTs
, UPDATEs
, and DELETEs
to the existing dataset.
The Virtuoso RDF Bulk Loader is told to use this "graph delta" load process with a special option called with_delete
, applied in the ld_dir()
or ld_dir_all()
commands.
with_delete
option is available in with_delete
functionality is not available in Virtuoso Open Source Edition (VOS).GROUP BY graphname
; no ORDER BY
is necessary.) READ COMMITTED
.
Ensure that the [Parameters]
section of the Virtuoso configuration file (default, virtuoso.ini
) includes the following entry, and restart the Virtuoso server.
DefaultIsolation = 2
with_delete
option:
cl_exec ('__dbf_set (''lock_escalation_pct'', 200)'); cl_exec ('__dbf_set (''enable_distinct_key_dup_no_lock'', 1)');
Using the ld_dir()
or ld_dir_all()
commands as usual, set the target_graph
argument to 'with_delete'
for each dataset file specified in ll_file
that is known to require an update/reload.
For example --
ld_dir ('/data8/2848260', '%.gz', 'with_delete'); ld_dir_all ('/data8/', '%.gz', 'with_delete');
Once all are set run the rdf_loader_run()
or cl_exec('rdf_ld_srv()')
commands to enable the update/reload to commence.
As many rdf_loader_run()
or cl_exec('rdf_ld_srv()')
commands can be invoked as threads/cores are available across the machines the Virtuoso cluster is being run on for fast parallel loading of the datasets, as would typically be done for the initial bulk load of the datasets.
Note that all RDF loader threads can be stopped using the following command at which point all currently running threads will be allowed to complete and then exit:
rdf_load_stop()
A diagnostic log of the with_delete
activity may be written to a file called g_log.txt
on each cluster instance.
cl_exec ('__dbf_set (''enable_g_replace_log'',1)')
cl_exec ('__dbf_set (''enable_g_replace_log'',0)')