oWiki | VOS.VOSArticleRDF

This is a diff between 1.1 and 1.2 revisions:

This document is also available in <a href="%ATTACHURLPATH%/rdfdb1.pdf" style="wikiautogen">PDF format</a>   * [20] Oracle Real Application Clusters. http://www.oracle.com/database/rac home.html   * [19] Alan Ruttenberg: Harnessing the Semantic Web to Answer Scientifc Questions. 16th International World Wide Web Conference. http://www.w3.org/2007/Talks/www2007-AnsweringScientifcQuestions-Ruttenberg.pdf    * [18] Ping The Semantic Web. http://pingthesemanticweb.com/about.php    * [17] About Geonames. http://www.geonames.org/about.html    * [16] About MusicBrainz. http://musicbrainz.org/doc/AboutMusicBrainz    * [15] Soren Auer, Jens Lehmann: What have Innsbruck and Leipzig in common? Extracting Semano tics from Wiki Content 4th European Semantic Web Conference. http://www.informatik.uni-leipzig.de/ auer/publication/ExtractingSemantics.pdf    * [14] Bizer C., Heath T., Ayers D., Raimond Y.: Interlinking Open Data on the Web. 4th European Semantic Web Conference. http://www.eswc2007.org/pdf/demo-pdf/LinkingOpenData.pdf    * [13] Extensible Business Reporting Language (XBRL) 2.1. http://www.xbrl.org/Specifcation/XBRL-RECOMMENDATION-2003-12-31+CorrectedErrata-2006-12-18.rtf    * [12] Wikipedia3: A Conversion of the English Wikipedia into RDF. http://labs.systemone.at/wikipedia3    * [11] Semantically-Interlinked Online Communities. http://sioc-project.org/    * [10] Oracle Semantic Technologies Center. http://www.oracle.com/technology/tech/semantic technologies/index.html    * [9] Northrop Grumman Corporation: Kowari Metastore. http://www.kowari.org/    * [8] Jena Semantic Web Framework. http://jena.sourceforge.net/    * [7] Intellidimension RDF Gateway. http://www.intellidimension.com    * [6] 3store, an RDF triple store. http://sourceforge.net/projects/threestore    * [5] Prudhommeaux E.: SPASQL: SPARQL Support In MySQL. http://xtech06.usefulinc.com/schedule/paper/156    * [4] Guo, Y., Pan, Z., HeÃÂÃÂ¨ÃÂin, J.: LUBM: A Benchmark for OWL Knowledge Base Systems. Journal of Web Semantics 3(2), 2005, pp158Ã¢ÂÂÃÂÃÂ¬182. Available via http://www.websemanticsjournal.org/ps/pub/2005-16    * [3] Chen, H., Wang, Y., Wang, H. et al.: Towards a Semantic Web of Relational Databases: a Practical Semantic Toolkit and an In-Use Case from Traditional Chinese Medicine. http://iswc2006.semanticweb.org/items/Chen2006kx.pdf    * [2] Bizer, C., Cyganiak, R., Garbers, J., Maresch, O.: D2RQ: Treating Non-RDF Databases as Virtual RDF Graphs. http://sites.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/    * [1] Beckett, D.: Redland RDF Application Framework. http://librdf.org/ ---++Referencessite.issues are found in separate papers on the http://virtuoso.openlinksw.comFurther details on the SQL to RDF mapping and triple storage performanceweb infrastructure.for query and storage federation in an Internet-scale multi-vendor semanticis best suited for which scale of problem and vendors may provide supportcommunity addressing these issues, the users may better assess which toolstandard benchmarks and SPARQL end point self-description. Through theThe community working on RDF storage will need to work on interoperability,introduced.scenarios, some restrictions on possible query types may have to bethat may come from many alternate relational sources. In complex EDIimpossible to effciently evaluate, especially if they join between dataStill, it remains possible to write short queries which are almostmultiple storage scenarios has proven to be complex and needing more work.technology. Optimizing queries produced by expanding SPARQL into unions ofand on-line collaboration have proven to be good test-beds for theApplications such as www.pingthesemanticweb.com and ODS for social networksa DBMS than relational applications dealing with the same data.product as well. The reason is that RDF makes relatively greater demands onresulted in generic features which beneft the relational side of therelational engine can be molded to effciently support RDF. This has alsostorage and has shown that without overwhelmingly large modifcations, aExperience with Virtuoso has encountered most of the known issues of RDF---++Conclusionpersonal FOAF fles etc.for transactions of limited size, like individual bookeeping records,in memory so it's bad for big dump/restore operations but pretty effecientThe partitioning of N triples requires O(N ln N ) operations and keeps datainto one insert or delete statement.between two table rows. After integrity check, every group is convertedbecome members of more than one group, e.g., a triple may specify relationone distinct extracted primary key of some source table. Some triples maytriples and partitioning the graph into groups of triples, one group perbegins with extracting all SQL values from all calculatable felds ofThe translation of a given RDF graph into SQL data manipulation statementupdatable.extending SPARUL compiler and run-time in order to make such RDF viewscolumn values are used as object literals unchanged. We are presentlyrestrictions because different columns are for different predicates andquad map patterns for RDF property tables usually satisfy these"obviously" pairwise disjoint and invoked IRI classes are bijections. E.g.,in such a way that sets of triples made from different columns arecontains quad map patterns that maps all columns of some table into triples*Updating Relational Data by SPARUL Statements*. In many cases, an RDF viewpractical than a shared disk/cache fusion architecture such as Oracle RAC.from a cluster and having a fxed partition scheme makes this much morecarries 10 lookups. Thus, batching operations is key to getting any beneftwhich cluster parallelism wins over message delays is when a single messagefast interconnect and a cluster of two machines, the break-even point aftertakes 50 microseconds and fnding a triple takes under fve. Supposing a veryThe key observation is that an interprocess round-trip in a single SMP boxcombining messages.predictable number of intra cluster messages and for greater ease infusion but have opted for hash partitioning in order to have a morewith all RDF index layouts. We have considered Oracle RAC[20]-style cacheclustering support is a generic RDBMS feature and will work equally wellsetting, partitioning individual indices by hash of selected key parts. Thethis effect. The present clustering scheme can work in a shared nothingnumber of machines. We are presently implementing clustering support tobillions of triples, the insert and query load needs to be shared among a*Clustering*. Going from the billions into the tens and hundreds of---++Future Directions38000 triples per second.busy at all times. Loading speed for data in the Turtle syntax is up toall data fts in memory and 10000 triples per second with one disk out of 6Loading speed for the LUBM data as RDFXML is 23000 triples per second ifWith key compression the data size drops to half.processors. The loading takes advantage of SMP and parallel IO to disks.takes 23h 45m on a machine with 8G memory and two 2GHz dual core Intel Xeonthe data without key compression size is 75G all inclusive and the loadset. At a scale of 8000 universities, amounting to 1068 million triples,*LUBM Benchmark*. Virtuoso has been benchmarked with loading the LUBM datathemselves via RDF Views.as physical triples as well as through accessing the relational tablesinstance data. This is done through maintaining a copy of the relevant datadata managed by these applications is available for SPARQL querying as SIOCa blog, wiki, social network, news reader and other components. All the*OpenLink Data Spaces (ODS)*. ODS is a web applications suite consisting ofMediawiki and Drupal into SIOC with RDF Views.optionally as the PHP web server. We have presently mapped PHP BB,applications in Virtuoso, with Virtuoso serving as the DBMS and also*Web 2.0 Applications*. We can presently host many popular web 2.0biomedical data sets.made on Virtuoso, running a 350 million triple database combining diverseThe life sciences demonstration at WWW 2007 [19] by Science Commons wassingle billions of triples.[17], PingTheSemanticWeb [18] and others. The largest databases are in theOpen Data Project [14], including Dbpedia [15], Musicbrainz [16], GeonamesVirtuoso has been used for hosting many of the data sets in the Linkingrelational data is limited to the commercial version.The RDF Views system is part of the offering but access to remoteavailable as a part of the Virtuoso open source and commercial offerings.As of this writing, July 2007, the native Virtuoso triple store is---++Applications and Benchmarkspurposes.QUAD table so it can be queried via SPARQL or exported for debug/backupThe RDF mapping schema is stored as triples in a dedicated graph in the RDFSPARQL query and a storage that refers to single table of physical quads.are always defned: a default that is used if no storage is specifed in thecontains only a subset of all available quad patterns. Two quad storagesbetween all quad storages of an RDF mapping schema but any quad storageDeclarations of IRI classes, value classes and quad patterns are sharedonly quad patterns of views of the specifed quad storage.storage is a named list of RDF views. A SPARQL query will be executed usingThe top-level items of the data mapping metadata are quad storages. A quadwithout destroying and re-creating the whole mapping schema.Quad map patterns can be named; these names are used to alter mapping rulesmap pattern.if the "root" pattern of the tree is not a subordinate of any other quadA tree of a quad map pattern and all its subordinates is called "RDF view"graph contains only triples mapped by four subordinates.of the root pattern permits the SPARQL optimizer to assume that the RDFThis grouping is not only a syntax sugar. In this example, exclusive optionfoaf:homepage :blog-home (blog.HOMEPAGE) . } >/verbatim>
foaf:Person ; foaf:name user.U_FULL_NAME ; foaf:mbox user.U_E_MAIL ;      :user-iri (user.U_ID) rdf:typegraph <http://myhost/users> option (exclusive) { >verbatim>
pattern specifes S, P and O and inherits G from root, as below:root pattern that specifes only the graph value whereas every subordinatefour; other patterns of subtree specify the rest. A typical use case is aas a root of a subtree if it specifes only some quad map values but not allQuad map patterns may be organized into trees. A quad map pattern may actwhere (not ^{user.}^.U_ACCOUNT_DISABLED) . >/verbatim>
object :blog-home (blog.HOMEPAGE) predicate :homepage subject :user-iri (user.U_ID) graph <http://myhost/users> >verbatim>
BLOGS into quads with :homepage predicate.mapped. E.g., the following pattern will map a join of SYS USERS and SYSquad map pattern can also specify restrictions on column values that can behow the column values of table aliases are combined into an RDF quad. TheFour quad map values (for G, S, P and O) form quad map pattern that specifyalias and a column name.of a value class is the identity class, which is simply marked by tablelist of columns of table aliases where SQL values come from. A special caseIt may be an RDF literal constant, an IRI constant or an IRI class with aA quad map value describes how to compose one of four felds of an RDF quad.(^{blog.}^.OWNER_ID = ^{user.}^.U_ID)>/verbatim>
from SYS_USERS as user from SYS_BLOGS as blog where>verbatim>
refer to unused ones.table names and all conditions that refer to used aliases and does notdata using some table aliases, the fnal SQL statement contains relatedrestrictions on table rows. When a SPARQL query should select relationalthem and provides logical conditions to join tables and to applystatement. It lists some relational tables, assigns distinct aliases totable aliases, somehow similar to FROM and WHERE clauses of an SQL SELECTof those values. This part of mapping declaration starts from a set ofIRI classes describe how to format SQL values but do not specify the originof :grantee_iri . make :group_iri subclass of :grantee_iri . >/verbatim>
varchar not null, in post_id integer not null ) . make :user_iri subclassnull ) . create iri class :permalink "http://myhost/%s/%d" ( in blog_homeiri class :blog-home "http://myhost/%s/home" ( in blog_home varchar not"http://myhost/sys/users/%s" ( in login_name varchar not null ) . createcreate iri class :user-iriprefix : <http://www.openlinksw.com/schemas/oplsioc#> >verbatim>
prove that no one IRI may match both formats.strings SPARQL optimizer can fnd a common subformat of these two or try tocreated by totally disjoint IRI classes. For two given sprintf formatattention to formats of created IRIs to eliminate joins between IRIsIRI class always match one of these formats. SPARQL optimizer paysprovide a list of sprintf-style formats such that that any IRI made by thespecifed by user-defned functions. In any case the defnition may optionallyused in standard C sprintf function. Complex transformations may beMost of IRI classes are defned by format strings that is similar to onemade by A and B if A is bijection.subClassOf other class B so the optimizer may simplify joins between valuesrelational tables. It is also possible to declare one IRI class A aswith join on raw SQL values that can effciently use native indexes ofknows that an join on two IRIs calculated by same IRI class can be replacedas bijection so an IRI can be parsed into original SQL values. The compilername, user name and post ID etc. A conversion of this sort may be declaredbe built from the user ID, a permalink of a blog post consists of hostconverted into an IRI in a certain way, e.g., an IRI of a user account canAn IRI class defnes that an SQL value or a tuple of SQL values can beand literals. The storage can be extended as follows:triples, using special formats that are suitable for arbitrary RDF nodesQUAD consists of four columns (G, S, P and O) that contain felds of storedquad storages. The default quad storage declares that the system table RDFIn Virtuoso, an RDF mapping schema consists of declarations of one or morefederated database.distributed SQL query optimization through its long history as a SQLcombines the mapping with native triple storage and may offer betterDBLP [3] among others. Virtuoso differs from these primarily in that itThe problem domain is well recognized, with work by D2RQ [2], SPASQL [5],others onto SIOC through Virtuoso's RDF Views system.popular Web 2.0 applications such as Wordpress, Mediawiki, PHP BB andsemantic web. One example of this is OpenLink's ongoing project of mappingcommon ontology using different rules, providing instant content for theintegration puzzle. Many disparate legacy systems may be projected onto aRDF and ontologies form the remaining piece of the enterprise data---++Mapping Legacy Relational Data into RDF for SPARQL Accessinterlinked documents such as personal FOAFs that refer to each other.or other sort of flter. This provides common tool to traverse sets ofextensible so it can extract RDF data from non-RDF resource via microformatremote graph is usually kept for some limited time. The sponge procedure iscaching headers are observed for caching, the local copy of a retrievedmerged into a common graph. When they are kept in private graphs, HTTPResources thus retrieved are kept in their private graphs or they can bealso or try dereferencing any intermediate query results, for example.Several modes are possible: follow only selected links, such as sioc:seeautomatically follow selected IRIs for retrieving additional resources.resources for querying. Having retrieved an initial resource, it can*RDF Sponge*. The Virtuoso SPARQL protocol end point can retrieve externalthrough the conversion of XBRL [13].for these is inevitable because fnancial data become available as RDFexpressions, quantifed subqueries and the like are needed. The requirementFor SPARQL to compete with SQL for analytics, extensions such as returningThis returns the total value of orders by ACME grouped by product.       has_value ?value } >/verbatim>
       ?ol has_product ?product ;        ?o has_line ?ol .        <ACME> has_order ?o . select ?product sum (?value) from <sales> where { >verbatim>
*Aggregates*. Basic SQL style aggregation is supported through queries likewill use the text index for resolving the pseudo-predicate bif:contains.}>/verbatim>
?person a person ; has_resume ?r . ?r bif:contains 'SQL and "semantic web"'select ?person from <people> where {>verbatim>
indexed. Queries like*Full Text*. All or selected string valued objects can be full textlocking is minimal.large batch updates since rollback information does not have to be kept andautomatic commit after each modifed triple. The latter mode is good forcompatible with JENA [8]. Updates can be run either transactionally or with*Data Manipulation*. Virtuoso supports the SPARUL SPARQL extension,box through the same mechanism as subclasses and subproperties.The owl:same-as predicate of classes and properties can be handled in the Tbut geo:Berlin does have a latitude and is declared to be owl:same-as>Berlin>.
will give the latitude of Berlin even if <Berlin> has no direct latitudeselect ?lat where {>Berlin> has_latitude ?lat }
        >/verbatim>
        >verbatim>
synonyms of the given value. Thus,is transparently expanded to have an extra node joining each S or O to alltried in turn. Thus, when same-as expansion is enabled, the SQL query graphinto the transitive closure of its same-as synonyms and each of these iseither an O or S is compared with equality with an IRI, the IRI is expandedFor A box reasoning, Virtuoso has special support for owl:same-as. Whenexplicitly stored in the database.the behavior is indistinguishable from having all the implied classesWhen asking for the class of an IRI, we also return any superclasses. Thuslubm:Professor and retrieve all persons that have any of these as rdf:type.we add an extra query graph node that will iterate over the subclasses ofselect ?person where { ?person a lubm:Professor } >/verbatim>
        >verbatim>
likethis capability directly in the Virtuoso SQL execution engine. With a queryand subproperties can be accomplished by query rewrite. We have integrated*Basic RDF Inferencing*. Much of basic T box inferencing such as subclassesby RDF, the same technique works equally well with any relational index.samples or any number of leading key parts given. While primarily motivatedeither for the whole index, with no key part known, using a few randomcardinality for the P, G, O combination. The same estimate can be madelevels of the tree with the same P, G, O allows calculating a ballparkfrst bitts and knowing how many sibling leaves are referenced from upperThe entire bitmap may span multiple pages in the index tree but reading theand (O = <http://people.com/people#John>) ) >/verbatim>
((G = <my-friends>) and (P = sioc:knows)>verbatim>
part of the bitmap forknown for two triples. A single lookup in log(n) time retrieves the frstprevious example, of people knowing both John and Mary, the G, P and O areof an index are constants known at compile time. For example, in thehosting the data, this can be done whenever one or more leading key partsthe query. Since the SQL compiler is in the same process as the indexsolution for this problem is to go look at the data itself when compilingtable. Histograms for ranges of P, G, O, and S are also not useful. Ourgiven only the table and column cardinalities for the RDF triple or quadtable, correct join order and join type decisions are diffcult to make*SQL Cost Model and RDF Queries*. When all triples are stored in a singleaggregates.features like construct and describe are implemented as user defnedare correctly predicted, the resulting execution plans are sensible. SPARQLonly inside a derived table are not copied outside of it. If cardinalitiesthe data transferred between parts of the queries, so that variables neededVirtuoso SQL, no special problems are encountered. The translator optimizesouter joins can be nested to arbitrary depths inside derived tables inunion becoming a SQL union and optional becoming a left outer join. SinceIf all triples are in one table, the translation is straightforward, withInternally, SPARQL is translated into SQL at the time of parsing the query.point for HTTP is equally available.prefxed with the SPARQL keyword to distinguish it from SQL. A SPARQL endstore using the PHP to ODBC bridge. The SPARQL text simply has to bemodifcations. For example, one may write a PHP web page querying the triplethis is that all supported CLI's work directly with SPARQL, with noof SQL, as well as any built-in or user defned functions. Another beneft ofaccepted. Thus SPARQL inherits all the aggregation and grouping functionsa top level SQL statement of wherever a subquery or derived table istas ble function. A SPARQL subquery or derived table is accepted either asVirtuoso offers SPARQL inside SQL, somewhat similarly to OracleÃÂ¬ÃÂ¥ RDF MATCH---++SPARQL and SQLdisastrous worst case.thus this layout may prove preferable in some use cases due to having noSQL optimizer can deal equally well with this index selection as any other,takes 2.5 times longer than the same check from a single 4 part index. Theperforming the bitmap AND of 4 given parts to check for existence of a quadS, P and O. In this way, no combination of criteria is penalized. However,dependent part of a row id and made 4 single column bitmap indices for G,data set. Thus we have experimented with a table holding G, S, P, O as aspecify no graph are however next to impossible to evaluate with any largeevaluated with the GSPO and OGPS indices. Some queries, such as ones that*Alternative Index Layouts*. Most practical queries can be effcientlymade.set but saves disk. Detailed performance impact measurement is yet to bewith gzip at default compression settings. This does not improve workingcompression. Over 99% of 8K pages flled to 90% compress to less than 3Kby nature repetitive, even if the repeating parts are shortened by keythird, even after key compression. This is understandable since indices areWhen applying gzip to database pages, we see a typical compression to aID's.ID's. The benefts of compression are still greater when using 64 bit IRIcompression. We observe a doubling of the working set when using 32 bit IRIbillion can be located in less than 5 microseconds with or without keywhile sacrifcing no random access performance. A single triple out of aget 35 bytes per triple. Thus, key compression doubles the working setwith a billion-triple LUBM data set (LUBM scale 8000). With compression, weprefxes of strings. Without key compression, we get 75 bytes per tripleeach database page, we store distinct values only once and eliminate common*Compression*. We have implemented compression at two levels. First, withinis specifed as one of many.exact matches. The latter is useful for example with FROM NAMED, where a Gsorting or distinct keys and supporting the IN predicate as a union ofOther special SPARQL oriented accommodations include allowing blobs asenhancing freedom for query optimization.and frees the developer from writing complex cast expressions in SQL, alsothis by providing a special QUIETCAST query hint. This simplifes queriesSPARQL will silently fail where SQL signals an error. Virtuoso addressesThe type cast rules for comparison of data are different in SQL and SPARQL.ID as primary key. This is similar to other implementations.separate table, with the full text or its MD5 checksum as one key and themapping between ID's of long O values and their full text is kept in afor the namespace prefxes and one for the local part of the name. TheThe mapping between an IRI ID and the IRI is represented in two tables, onemerge intersection of two sparse bitmaps.the index structure allows the AND of the conditions to be calculated as a                <http://people.com/people#Mary> } >/verbatim>
  ?s sioc:knows <http://people.com/people#John> ,graph <my-friends> {>verbatim>
For example, answering queries likediscussed below.the G is left unspecifed, other representations have to be considered, asWe fnd that this index structure works well as long as the G is known. Ifconsumption of the second index to drop to about a third of the frst index.distinct P, G, O. With the Wikipedia data set, this causes the spacean integer-like scalar, we can represent it as a bitmap, one bitmap perWe note however that since S is the last key part of P, G, O, S and it isby 2 because of having two indices.(overhead) + 3 * 4 (G, S, P) + 9 (O) = 27 bytes per index entry, multipliedthe average 9 bytes long, making for an average index entry length of 6Using the Wikipedia data set [12] as sample data, we fnd that the O is onthe main row from an index leaf.needs to be associated with it. Also there is never a need for a lookup ofrepresented by these two indices and no other persistent data structureG, P, S. Since both indices contain all columns, the table is whollyeffect, the table is represented as two covering indices, G, S, P, O and O,Generally, triples should be locatable given the S or a value of O. To thisand assign ID's only to long ones.distinct O's, regardless of type. We however store short O values inlinequad table. For example Oracle [10] has chosen to give a unique ID to allcharacters are assigned a unique ID and this ID is stored as the O of therepeated in the index. Hence O's of string type that are longer than 12Since O is a primary key part, we do not wish to have long O valuesquestion with default collation.elements of compatible type, the order is that of the data type(s) insupports a lexicographic ordering of type ANY, meaning that with any twoobject, from scalar to array or user defned type instance. Indexinginteger. The O column is of SQL type ANY, meaning any serializable SQLrun time from integer even though internally this is a 32 or 64 bitS are IRI ID's, for which we have a custom data type, distinguishable atare G for graph, P for predicate, S for subject and O for object. P, G andof four columns holds one quad, i.e. triple plus graph per row. The columnsVirtuoso's initial storage solution is fairly conventional: a single table---++Triple Storage   * Need to map existing relational data into RDF and join between RDF data and relational data. We shall discuss our response to all these challenges in the course of this paper.   * Effcient space utilization.    *  Diffculty of computing query cost. Normal SQL compiler statistics are not precise enough for optimizing a SPARQL query if if all data is in a single table.   *  More permissive cast rules than in SQL.    *  Unknown data lengths large objects mix with small scalar values in a manner not known at query compile time.   *  Data Types: RDF is typed at run time and IRIs must be distinct from other data. for triple storage: The RDF work on Virtuoso started by identifying problems of using an RDBMSdegree of native RDF support into an existing relational platform.others. Other vendors such as Oracle and OpenLink have opted for building adevelopment of custom RDF engines, e.g. RDF Gateway [7], Kowari [9] andof the model [1][8]. Performance considerations have however led to thestep. RDF data has been stored in relational databases since the inceptionweb, incorporating RDF functionality into the product is a logical nextlanguages. Given this background and the present emergence of the semanticprotocol end points and dynamic web pages in a variety of scriptinghas a built-in HTTP server providing a DAV repository, SOAP and WS*external relational databases. Besides catering for SQL clients, Virtuosorelational data stored either within Virtuoso itself or any combination ofVirtuoso is a multi-protocol server providing ODBC/JDBC access to---++Introduction And Motivationinformatics and collaborative web applications.and metrics as well as a number of use cases, from DBpedia to biowithout converting the data into physical triples. We present conclusionsfurther discuss mapping existing relational data into RDF for SPARQL accesswith dedicated data types, bitmap indexing and SQL optimizer techniques. Weplatform. We discuss adapting a relational engine for native RDF supporta general purpose relational / federated database and applicationsThis paper discusses RDF related work in the context of OpenLink Virtuoso,---++Abstract%TOC%Ivan Mikhailov imikhailov{at}openlinksw.comOrri Erling oerling{at}openlinksw.com <BR>---+RDF Support in the Virtuoso DBMS