VOSVirtuoso6FAQ Virtuoso 6 FAQ Virtuoso 6 FAQ We have received various inquiries about high-end metadata stores. Requested features include: Scaling to trillions of triples Running on clusters of commodity servers Running in federated environments, possibly over wide-area networks Built-in inference Transactions Security Support for extra triple-level metadata, such as security attributes Questions we answer here -- What is the storage cost per triple? This depends on the index scheme. If indexed 2 ways, assuming that the graph will always be stated in queries, this is 31 bytes. With 4 indices, supporting queries where the graph can be left unspecified (i.e., triples from any graph will be considered in query evaluation), this is 39 bytes. The numbers are measured with the LUBM validation data set of 121K triples, with no full-text index on literals. With 4 indices and a full text index on all literals, the Billion Triples Challenge data set, 1115M triples, is about 120 GB of database pages. The database file size is larger due to space in reserve and other factors. 120 GB is the number to use when assessing RAM-to-disk ratio, i.e., how much RAM the system ought to have in order to provide good response. This data set is a heterogeneous collection including social network data, conversations harvested from the Web, DBpedia, Freebase, etc., with relatively numerous and long text literals. The numbers do not involve any database page stream compression such as gzip. Using such compression does not save in terms of RAM because cached pages must be kept uncompressed but will cut the disk usage to about half. What is the cost to insert a triple (for the insertion itself, as well as for updating any indices)? The more triples are inserted at a time, the faster this goes. Also, the more concurrent triple insertions are going on, the better the throughput. When loading data such as the US Census, a cluster of 2 commodity servers can insert up to 100,000 triples per second. A single 4-core machine can load 1 billion triples of LUBM data at an average rate of 36K triples per second. This is limited by disk. What is the cost to delete a triple (for the deletion itself, as well as for updating any indices)? The delete cost is similar to insert cost. What is the cost to search on a given property? If we are looking for equality matches, a single 2GHz core can do about 250,000 single triple random lookups per second as long as disk reads are not involved. If each triple requires a disk seek the number is naturally lower. Parallelism depends on the query. With a query like counting all x and y such that x knows y and y knows x, we get up to 3.4 million single-triple lookups-per-second on a cluster of 2 8-core Xeon servers. With complex nested sub-queries the parallelism may be less. Lookups involving ranges of values, such as ranges of geographical coordinates or dates use an index, since quads are indexed in a manner that collates in the natural order of the data type. What data types are supported? Virtuoso supports all RDF data types, including language-tagged and XML schema typed strings as native data types. Thus there is no overhead converting between RDF data types and types supported by the underlying DBMS. What inferencing is supported? Subclass, subproperty, identity by inverse-functional properties, and owl:sameAs are processed at run time if an inference context option is specified in the query. There is a general-purpose transitivity feature that can be used for a wide variety of graph algorithms. For example: SELECT ?friend WHERE { <alice> foaf:knows ?friend option (transitive) } - would return all the people directly or indirectly known by <alice>. Is the inferencing dynamic, or is an extra step required before inferencing can be used? The mentioned types of inferencing are enabled by a switch in the query and are done at run-time, with no step for materialization of entailed triples needed. The pattern - {?s a <type>} would iterate over all the RDFS subclasses of <type> and look for subjects with this type. The pattern {<thing> a ?class} will, if the match of ?class has superclasses, also return the superclasses even though the superclass membership is not physically stored for each superclass. Of course, one can always materialize entailed triples by running SPARQL/SPARUL statements to explicitly add any implied information. If two subjects have the same inverse functional property with the same value, they will be considered the same. For example, if two people have the same email address, they will be considered the same. If two subjects are declared to be owl:sameAs, either directly or through a chain of x owl:sameAs y, y owl:sameAs z, and so on, they will be considered the same. These features can be individually enabled and disabled. They all have some run time cost, hence they are optional. The advantage is that no preprocessing of the data itself is needed before querying, and the data does not get bigger. This is important, especially if the database is very large and queries touch only small parts of it. In such cases, materializing implied triples can be very costly. See discussion at E Pluribus Unum... Do you support full-text search? Virtuoso has an optional full-text index on RDF literals. Searching for text matches using the SPARQL regex feature is very inefficient in the best of cases. This is why Virtuoso offers a special bif:contains predicate similar to the SQL contains predicate of many relational databases. This supports a full-text query language with proximity, and/or/and-not, wildcards, etc. While the full-text index is a general-purpose SQL feature in Virtuoso, there is extra RDF-specific intelligence built into it. One can, for example, specify which properties are indexed, and within which graphs this applies. What programming interfaces are supported? Do you support standard SPARQL protocol? Virtuoso supports the standard SPARQL protocol. Virtuoso offers drivers for the Jena, Sesame, and Redland frameworks. These allow using Virtuoso's store and SPARQL implementation as the back end of Jena, Sesame, or Redland applications. Virtuoso will then do the query optimization and execution. Jena and Sesame drivers come standard; contact us about Redland. Virtuoso SPARQL can be used through any SQL call level interface (CLI) supported by Virtuoso (i.e., ODBC, JDBC, OLE-DB, ADO.NET, XMLA). All have suitable extensions for RDF specific data types such as IRIs and typed literals. In this way, one can write, for example, PHP web pages with SPARQL queries embedded, just using the SQL tools. Prefixing a SQL query with the keyword "sparql" will invoke SPARQL instead of SQL, through any SQL client API. How can data be partitioned across multiple servers? Virtuoso Cluster partitions each index of all tables containing RDF data separately. The partitioning is by hash. The result is that the data is evenly distributed over the selected number of servers. Immediately consecutive triples are generally in the same partition, since the low bits of IDs do not enter in into the partition hash. This means that key compression works well. Since RDF tables are in the end just SQL tables, SQL can be used for specifying a non-standard partitioning scheme. For example, one could dedicate one set of servers for one index, and another set for another index. Special cases might justify doing this. With very large deployments, using a degree of application-specific data structures may be advisable. See "Does Virtuoso support property tables" below. How many triples can a single server handle? With free-form data and text indexing enabled, 500M triples per 16G RAM can be a ballpark guideline. If the triples are very short and repetitive, like the LUBM test data, then 16G per one billion triples is a possibility. Much depends on the expected query load. If queries are simple lookups, then less memory per billion triples is needed. If queries will be complex (analytics, join sequences, and aggregations all over the data set), then relatively more RAM is necessary for good performance. The count of quads has little impact on performance as long as the working set fits in memory. If the working set is in memory, there may be 15-20% difference between a million and a billion triples. If the database must frequently go to disk, this degrades performance since one can easily do 2000 random accesses in memory in the time it takes to do one random access from disk. But working-set characteristics depend entirely on the application. Whether the quads in a store all belong to one graph or any number of graphs makes no difference. There are Virtuoso instances in regular online use with hundreds of millions of triples, such as DBpedia and the Neurocommons databases. What is the performance impact of going from the billion to the trillion triples? Performance dynamics change when going from a single server to a cluster. If each partition is around a billion triples in size, then the single triple lookup takes the same time, but there is cluster interconnect latency added to the mix. On the other hand, queries that touch multiple partitions or multiple triples in a partition will do this in parallel and usually with a single message per partition. Thus throughput is higher. In general terms, operations on a single triple at a time from a single thread are penalized and operations on hundreds or more triples at a time win. Multiuser throughput is generally better due to more cores and more memory, and latency is absorbed by having large numbers of concurrent requests. See Post about Virtuoso & SPARQL scalability. Do you support additional metadata for triples, such as time-stamps, security tags etc? Since quads (triple plus graph) are stored in a regular SQL table with special data types, changing the table layout to add a column is possible. This column would not however be visible to SPARQL without some extra tuning. For coarse grain provenance and security information, we recommend doing this at the graph level, where triples that belong together are tagged with the same provenance or security are in the same graph. The graph can then have the relevant metadata as its properties. If tagging at the single triple level is needed, this will most often not be needed for all triples. Hence altering the table for all triples may not be the best choice. Making a special table that has the graph, subject, predicate and object of the tagged triple as a key and the tag data as a dependent part may be more efficient. Also, this table could be more easily accessed from SPARQL. Using the RDF reification vocabulary is not recommended as a first choice but is possible without any alterations. Alterations of this nature are possible but we recommend contacting us for specifics. We can provide consultancy on the best way to do this for each application. Altering the storage layout without some extra support from us is not recommended. Should we use RDF for our large metadata store? What are the alternatives? If the application has high heterogeneity of schema and frequent need for adaptation, then RDF is recommended. The alternative is making a relational database. Making a custom non-RDF object-attribute-value representation on Virtuoso or some other RDBMS is possible but not recommended. The reason for this is that this would miss many of the optimizations made specifically for RDF, use of the SPARQL language, inference, compatibility with diverse browsers and front end tools, etc. Not to mention interoperability and joinability with the body of linked data. Even if the application is strictly private, using entity names and ontologies from the open world can still have advantages. If some customization to the quad (triple plus graph layout) is needed, we can provide consultancy on how to do this while staying within the general RDF framework and retaining all the interoperability benefits. How multithreaded is Virtuoso? All server and client components are multithreaded, using pthreads on Unix/Linux, Windows native on Windows. Multithread/multicore scalability is good; see BSBM In the case of Virtuoso Cluster, in order to have the maximum number of threads on a single query, we recommend that each server on the cluster be running one Virtuoso process per 1.2 cores. Can multiple servers run off a single shared disk database? This might be possible with some customization but this is not our preferred way. Instead, we can store selected indices in duplicate or more copies inside a clustered database. In this way, all servers can have their own disk. Each key of each index will belong to one partition but each partition will have more than one physical copy, each on a different server. The cluster query logic will perform the load balancing. On the update side, the cluster will automatically do a distributed transaction with two phase commit to keep the duplicates in sync. Can Virtuoso run on a SAN? Yes. Unlike Oracle RAC, for example, Virtuoso Cluster does not require a SAN. Each server has its own database files and is solely responsible for these. In this way, having shared disk among all servers is not required. Running on a SAN may still be desirable for administration reasons. If using a SAN, the connection to the SAN should be high performance, such as Infiniband. How does Virtuoso join across partitions? Partitioning is entirely transparent to the application. Virtuoso has a highly optimized message-flow between cluster nodes that combines operations into large batches and evaluates conditions close to the data. See Post about Virtuoso & Web Scale RDF. Does Virtuoso support federated triple stores? If there are multiple SPARQL end points, can Virtuoso be used to do queries joining between these? This is a planned extension. The logic for optimizing message flow between multiple end-points on a wide-area network is similar to the logic for message-optimization on a cluster. This will allow submitting a query with a list of end-points. The query will then consider triples from each of the end points, as if the content of all the end points were in a single store. End-point meta information, such as VoID descriptions of the graphs in the end-points, may be used to avoid sending queries to end points that are known not to have a certain type of data. How many servers can a cluster contain? There is no fixed limit. If you have a large cluster installed, you can try Virtuoso there. Having an even point-to-point latency is desirable. How do I reconfigure a cluster, adding and removing machines, etc? We are working on a system whereby servers can be added and removed from a cluster during operation and no repartitioning of the data is needed. In the first release, the number of server processes that make up the cluster is set when creating the database. These processes with their database files can then be moved between machines but this requires stopping the cluster and updating configuration files. How will Virtuoso handle regional clusters? Performance of a cluster depends on the latency and bandwidth of the interconnect. At least dual 1Gbit ethernet is recommended for each node. Thus a cluster should be on a single local or system area network. If regional copies are needed, we would replicate between clusters by asynchronous log shipping. This requires some custom engineering. When a transaction is committed at one site, it is logged and sent to the subscribing sites if they are online. If there is no connection, the subscribing sites will get the data from the log. This scheme now works between single Virtuoso servers, and needs some custom development to be adapted to clusters. If replicating all the data of one site to another site is not possible, then application logic should be involved. Also, if consolidated queries should be made against large, geographically-separated clusters, then it is best to query them separately and merge the results in the application. All depends on the application level rules on where data resides. Is there a mechanism for terminating long running queries? Virtuoso SPARQL and SQL offer an "anytime" option that will return partial results after a configurable timeout. In this way, queries will return in a predictable time and indicate whether the results are complete or not, as well as give a summary of resource utilization. This is especially useful for publishing a SPARQL endpoint where a single long running query could impact the performance of the whole system. This timeout significantly reduces the risk of denial of service. This is also more user-friendly than simply timing-out a query after a set period and returning an error. With the anytime option, the user gets a feel for what data may exist, including whether any data exists at all. This feature works with arbitrarily complex queries, including aggregation, GROUP BY, ORDER BY, transitivity, etc. Since the Virtuoso SPARQL endpoint supports open authentication (OAuth), the authentication can be used for setting timeouts, so as to give different service to different users. It is also possible to set a timeout that will simply abort a query or an update transaction if it fails to terminate in a set time. Disconnecting the client from the server will also terminate any processing on behalf of that client, regardless of timeout settings. The SQL client call-level interfaces (ODBC, JDBC, OLE-DB, ADO.NET, XMLA) each support a cancel call that can terminate a long running query from the application, without needing to disconnect. Can the user be asynchronously notified when a long running query terminates? There is no off-the-shelf API for this but making an adaptation of the SPARQL endpoint that could proceed after the client disconnected and, for example, could send results by email is trivial. Since SOAP and REST Web services can be programmed directly in Virtuoso's stored procedure language, implementing and exposing this type of application logic is easy. How many concurrent queries can Virtuoso handle? There is no set limit. As with any DBMS, response times get longer if there is severe congestion. For example, having 2 or 3 concurrent queries per core is a good performance point which will keep all parts of the system busy. Having more than this is possible but will not increase overall throughput. With a cluster, each server has both HTTP and SQL listeners, so clients can be evenly spread across all nodes. In a heavy traffic Web application, it is best to have a load balancer in front of the HTTP endpoints to divide the connections among the servers and to keep some cap on the number of concurrently running requests, enforcing a maximum request-rate per client IP address, etc. What is the relative performance of SPARQL queries vs native relational queries? This is application dependent. In Virtuoso, SPARQL and SQL share the same query execution engine, query optimizer, and cost model. If data is highly regular (i.e., a good fit for relational representation), and if queries typically access most of the row, then SQL will be more efficient. If queries are unpredictable, data is ragged, schema changes frequent, or inference is needed, then RDF will do relatively better. The recent Berlin SPARQL Benchmark shows some figures comparing Virtuoso SQL and SPARQL and SPARQL in front of relational representation. However, the test workload is heavily biased in favor of relational. See also BSBM: MySQL vs Virtuoso. With the TPC-H workload, relationally stored data, and SPARQL mapped to SQL, we find that with about half the queries there is no significant cost to SPARQL. With some queries there is additional overhead because the mapping does not produce a SQL query identical to that specified in the benchmark. Does Virtuoso Support Property Tables? For large applications, we would recommend RDF whenever there is significant variability of schema, but would still use an application-specific, relational style representation for those parts of the data that are regular in format. This is possible without loss of flexibility for the variable-schema part. However, this will introduce relational-style restrictions on the regular data; for example, a person could only have a single date-of-birth by design. In many cases, such restrictions are quite acceptable. Querying will still take place in SPARQL, and the representation will be transparent. A relational table where the primary key is the RDF subject and where columns represent single-valued properties is usually called a property table. These can be defined in a manner similar to defining RDF mappings of relational tables. What performance metrics does Virtuoso offer? There is an extensive array of performance metrics. This includes: Cluster status summary with thread counts, CPU utilization, interconnect traffic, clean and dirty cache pages, virtual memory swapping warning, etc. This is either a cluster total or a total with breakdown per cluster node. Disk access, lock contention, general concurrency, and access count per index Statistics on memory usage for disk caching index-by-index, cache replacement statistics, disk random and sequential read times Count of random, sequential index access, disk access, lock contention, cluster interconnect traffic per query/client Detailed query-execution plans are available through the explain function What support do you provide for concurrent/multithreaded operation? Is your interface thread-safe? All client interfaces and server-side processes are multithreaded. As usual, each thread of an application should use a different connection to the database. What level of ACID properties is supported? Virtuoso supports all 4 isolation levels from dirty read to serializable, for both relational and RDF data. The recommended default isolation is read-committed, which offers a clean historical read of data that has uncommitted updates. This mode is similar to the Oracle default isolation and guarantees that no uncommitted data is seen, and that no read will block waiting for a lock held by another client. There is transaction logging and roll forward recovery, with two phase commit used in Virtuoso Cluster if an update transaction modifies more than one server. For RDF workloads which typically are not transactional and have large bulk loads, we recommend running in a "row autocommit" mode without transaction logging. This virtually eliminates log contention but still guarantees consistent results of multithreaded bulk loads. Setting this up requires some consultancy and custom development but is well worthwhile for large projects. Do you provide the ability to atomically add a set of triples, where either all are added or none are added? Yes. Doing this with millions of triples per transaction may run out of rollback space. Also, there is risk of deadlock if multiple such inserts run at the same time. For good concurrency, the inserts should be of moderate size. As usual, deadlocks are resolved by aborting one of the conflicting transactions. Do you provide the ability to add a set of triples, respecting the isolation property (so concurrent accessors either see none of the triple values, or all of them)? Yes. The reading client should specify serializable isolation and the inserting client should perform the insert as a transaction, no row autocommit mode. What is the time to start a database, create/open a graph? Starting a Virtuoso server process takes a few seconds. Making a new graph takes no time beyond the time to insert the triples into it. Once the server process(es) are running, all the data is online. With high-traffic applications, reaching cruising speed may sometimes take a long time, specially if the load is random-access intensive. Filling gigabytes of RAM with cached disk pages takes a long time if done a page at a time. To alleviate this, Virtuoso pre-reads 2MB-sized extents instead of single pages if there is repeated access to the same extent within a short time. Thus cache warm-up times are shortened. What sort of security features are built into Virtuoso? For SQL, we have the standard role-based security and an Oracle-style row-level security (policy) feature. For SPARQL, users may have read or update roles at the level of the quad store. With RDF, a graph may be owned by a user. The user may specify read and write privileges on the graph. These are then enforced for SPARUL (the SPARQL update language) and SPARQL. When an RDF graph is based on relationally stored data in Virtuoso or another RDBMS through Virtuoso's SQL federation feature (i.e., if the graph is an Linked Data View of underlying SQL data), then all relational security controls apply. Further, due to the dual-nature of Virtuoso, sophisticated ontology-based security models are feasible. Such models are not currently used by default, but they are achievable with our consultancy.