VOSArticleBISPARQL2 Business Intelligence Extensions for SPARQLBusiness Intelligence Extensions for SPARQL Orri Erling (Program Manager, OpenLink Virtuoso) and Ivan Mikhailov (Lead Developer, OpenLink Virtuoso).OpenLink Software, 10 Burlington Mall Road Suite 265 Burlington, MA 01803 U.S.A http://www.openlinksw.com/ Abstract We believe that the possibility to use SPARQL as a front end to heterogenous data without significant cost in performance or expressive power is key to RDF taking its rightful place as the lingua franca of data-integration. To this effect, we demonstrate how RDF and SPARQL can tackle a standard relational workload. We discuss extending SPARQL for business intelligence (BI) workloads and relate experiences on running SPARQL against relational and native RDF databases. We use the well known TPC-H benchmark as our reference schema and workload. We define a mapping of the TPC-H schema to RDF and restate the queries as BI extended SPARQL. To this effect, we define aggregation and nested queries for SPARQL. We demonstrate that it is possible to perform the TPC-H workload restated in SPARQL against an existing RDBMS without loss of performance or expressivity and without changes to the RDBMS. Introduction and Motivation RDF promises to be a top-level representation for data extracted or dynamically mapped from any conceivable source. Thus, RDF's chief promise is in the field of information integration, analysis and discovery. Yet it is difficult to imagine any business reporting, let alone more complex information integration task that would not involve aggregating and grouping. As a data access and data integration vendor, OpenLink has a natural interest in seeing SPARQL succeed as a top level language for answering business questions on data mapped from any present day data warehouse or other repository. This potential role of SPARQL is however fundamentally undermined if SPARQL cannot perform any part of the database industry's baseline business intelligence benchmark, TPC-H. To this effect, we have extended SPARQL with expressions in results, aggregates and grouping and derived tables. These extensions allow a straightforward translation of arbitrary SQL queries to SPARQL. We call this extended SPARQL "SPARQL BI". We demonstrate the operation of SPARQL BI versions of TPC-H queries on relational data managed by Virtuoso and Oracle. We also demonstrate the same workload on the same data stored as RDF in Virtuoso. Test Data We use the TPC-H schema mapped to RDF in all our examples. The table names are directly converted to classes and the column names are directly converted to predicates in namespace http://www.openlinksw.com/schemas/TPC-H. The prefix tpch is used to refer to this namespace throughout the paper.SPARQL Extensions Expressions In its proposed recommendation form, SPARQL does not allow returning any value that is not retrieved through a triple pattern. Expressions are only allowed in filters but cannot be returned. We lift this restriction by allowing expressions in the result set. Consider: SELECT (?extendedprice * (1 - ?discount)) WHERE { ?line a tpch:lineitem ; tpch:lineextendedprice ?extendedprice ; tpch:linediscount ?discount } We can shorten the notation as SELECT ( ?line+>tpch:extendedprice * ( 1 - ?line+>tpch:discount ) ) WHERE { ?line a tpch:lineitem } The +> (dereference) operator allows referring to a property without naming it as a variable. This is exactly equivalent to having a triple pattern binding a variable to the mentioned property of the subject within the group pattern enclosing the reference. For a select, the enclosing group pattern is considered to be the top level pattern of the where clause or in the event of a union, the top level of each term of the union. Each distinct dereference adds exactly one triple pattern to the enclosing group pattern, thus multiple uses +> do not each add a triple pattern. Having multiple copies of an identical pattern might lead to changes in cardinality if multiple input graphs were being considered. If a line item had multiple discounts or extended prices, then we would get the cartesian product of both. If a property referenced via +> is absent, the expression does not get evaluated in the first place. The optional dereference operator |> will produce an unbound value if the property does not exist. Further, mentioning the same chain of dereferences multiple times in the same group pattern will not cause redundant triple patterns to be added or result in more joining that is necessary. We further allow expressions in the place of variables in triple patterns. To scope the above query to orders by customers in France, we could write: SELECT ( ?li+>tpch:extendedprice * ?li+>tpch:discount ) WHERE { ?li a tpch:lineitem . ?li+>tpch:l_orderkey+>tpch:o_custkey+>tpch:c_nationkey tpch:n_name "France" } The sequence of dereferences expands into triple patterns, as in: ... ?li tpch:l_orderkey ?v1 . ?v1 tpch:o_custkey ?v2 . ?v2 tpch:c_nationkey ?v3 . ?v3 tpch:n_name "France" . Aggregation We introduce the SUM, COUNT, AVG, MIN, and MAX aggregate functions from SQL. Their semantics with respect to NULL are inherited from SQL. To count result rows without regard to any value being defined, COUNT (*) is introduced. If grouping is desired, aggregate expressions can be combined with non- aggregate expressions in a selection list. The non-aggregate expressions will function as grouping columns, i.e., the aggregates are calculated for each distinct combination of the grouping columns. No special GROUP BY clause is needed. SELECT ?l+>tpch:l_linestatus COUNT(*) SUM(?l+>tpch:extendedprice) WHERE { l a tpch:lineitem } gives the count and total value of lineitems for each distinct lineitem status. User defined aggregates from Virtuoso SQL can be used in SPARQL as well, using the sql: namespace. Nesting of Queries and Named Results SQL allows nesting queries, in effect treating the evaluation of a query as a table (derived table) or as a value in an expression (scalar subquery). We allow embedding a SPARQL SELECT in the place of a triple pattern. The syntax is as in -- SELECT ?line WHERE { ?line a tpch:lineitem . { SELECT MAX ( ?l2+>tpch:extendedprice ) AS ?maxprice WHERE { ?l2 a tpch:lineitem } } . FILTER ( line+>tpch:extendedprice = ?maxprice ) } This selects all lineitems with extendedprice equal to the highest extended- price in the set of lineitems. We note that we have a SQL-style explicit comparison for joining the nested select with the outer select. The bindings that are in scope in the pattern containing the nested select are also in scope inside the nested select. In this the scope rules resemble SQL' s rules for subqueries. In Virtuoso SQL and Virtuoso/PL we allow SPARQL queries in all places where "plain" SQL SELECT could be used, e.g., SQL query can contain SPARQL subqueries. Sample Queries Due to space constraints, we chose to pick only two of the twenty-two queries of the TPC-H workload. These were selected because of their relative complexity and use of nested queries. The TPC-H definition states the business questions for Q15 and Q18 as follows: Q15, The Top Supplier Query finds the supplier who contributed the most to the overall revenue for parts shipped during a given quarter of a given year. In case of a tie, the query lists all suppliers whose contribution was equal to the maximum, presented in supplier number order. Q18, The Large Volume Customer Query finds a list of the top 100 customers who have ever placed large quantity orders. The query lists the customer name, customer key, the order key, date and total price and the quantity for the order. The Q15 SQL text used with Virtuoso is: SELECT s_suppkey, s_name, s_address, s_phone, total_revenue FROM supplier, ( SELECT l_suppkey AS supplier_no, SUM( l_extendedprice * ( 1 - l_discount ) ) AS total_revenue FROM lineitem WHERE l_shipdate >= { D '1996-01-01' } AND l_shipdate < { FN timestampadd ( SQL_TSI_MONTH, 3, {d '1996-01-01'} ) } GROUP BY l_suppkey ) AS revenue WHERE s_suppkey = supplier_no AND total_revenue = ( SELECT max(total_revenue) FROM ( SELECT l_suppkey AS supplier_no, SUM ( l_extendedprice * ( 1 - l_discount ) ) AS total_revenue FROM lineitem WHERE l_shipdate >= { D '1996-01-01' } AND l_shipdate < { FN timestampadd ( SQL_TSI_MONTH, 3, {d '1996-01-01'} ) } GROUP BY l_suppkey ) AS revenue ) ORDER BY s_suppkey ; The corresponding SPARQL BI text is: SPARQL PREFIX tpch <http://openlinksw.com/schemas/TPC-H/> SELECT ?supplier ?s_name ?s_address ?s_phone ?total_revenue WHERE { ?supplier a tpch:supplier ; tpch:s_name ?s_name ; tpch:s_address ?s_address ; tpch:s_phone ?s_phone . { SELECT ?supplier ( SUM ( l_extendedprice * ( 1 - l_discount ) ) ) AS ?total_revenue WHERE { ?lineitem a tpch:lineitem ; tpch:l_shipdate ?l_shipdate ; tpch:l_suppkey ?supplier . FILTER ( ?l_shipdate >= xsd:date ('1996-01-01') AND ?l_shipdate < bif:dateadd ('month', 3, xsd:date ('1996-01-01')) ) } } { SELECT MAX ( ?all_totals.total_revenue ) AS ?maxtotal WHERE { { SPARQL SELECT ( SUM ( l_extendedprice * ( 1 - l_discount ) ) ) AS ?total_revenue WHERE { ?lineitem a tpch:lineitem ; tpch:l_shipdate ?l_shipdate ; tpch:l_suppkey ?l_suppkey . FILTER ( ?l_shipdate >= xsd:date ('1996-01-01') AND ?l_shipdate < bif:dateadd ('month', 3, xsd:date ('1996-01-01')) ) } } AS all_totals } } FILTER ( ?total_revenue = ?maxtotal ) } ORDER BY ?supplier ; The Virtuoso text of Q18 is: SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, SUM ( l_quantity ) FROM lineitem, orders, customer WHERE o_orderkey IN ( SELECT l_orderkey FROM lineitem GROUP BY l_orderkey HAVING SUM( l_quantity ) > 250 ) AND c_custkey = o_custkey AND o_orderkey = l_orderkey GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY o_totalprice DESC, o_orderdate; The SPARQL BI version is: SELECT ?c_name ?customer ?order ?o_orderdate ?o_totalprice SUM ( ?l_quantity ) FROM <http://example.com/tpcd> WHERE { ?customer a tpch:customer ; foaf:name ?c_name . ?order a tpch:order ; tpch:ordertotalprice ?o_totalprice ; tpch:orderdate ?o_orderdate ; tpch:has_customer ?customer . [ a tpch:lineitem ; tpch:linequantity ?l_quantity ; tpch:has_order ?order ] . { SELECT ?sum_order SUM (?quantity) AS ?sum_q WHERE { [ a tpch:lineitem ; tpch:linequantity ?quantity ; tpch:has_order ?sum_order ] } } . FILTER ( ?sum_order = ?order AND ?sum_q > 250 ) } ORDER BY DESC (?o_totalprice) ?o_orderdate Experiments For this test, we had the test data on both an Oracle 10G server and a Virtuoso 5.0 server on the same machine. The tables from the Oracle and Virtuoso servers were attached to another Virtuoso server, which served as the SPARQL front end. For Q15, Virtuoso SQL gave the answer in 180 ms and with SPARQL the answer took 3800 ms. For Q18, Virtuoso SQL gave 340 ms and SPARQL 371 ms. The large difference with Q15 is due to the SQL compiler choosing a different plan because the reformulated text has a different structure. Some further tuning will eliminate the difference. With the Oracle backend, we obtained the correct answers but our setup did not pass the queries through to Oracle as a single SQL statement; hence the performance was less than would have been seen if the queries were natively submitted to Oracle. Some further adjustments will result in the queries passing through the pipeline as single statements, at which point we will have a negligible translation overhead. The test database was at 1 percent scale, hence the results are not about TPC-H performance per se, but are solely aimed at verifying that the correct answers are produced and queries are executed as close to the data as possible. We also dumped the data as physical triples stored in Virtuoso. Our aim is to arrange things so that the physical triples version will at no point be more than three times slower than the equivalent relational setting on Virtuoso, running with a database scaled to 100G. Reaching this requires some enhancements to our SQL implementation, specifically for dealing with queries with dozens of joined tables. We note that we get a self-join for each column referenced. This work was not completed at the submission deadline. Since the features discussed were first implemented within days of the submission deadline, no tuning or adaptation of the Virtuoso SQL was possible within the time limit, hence results are not anywhere near final and most interesting experiments had to be left out. We intend to further study the comparative performance of SPARQL going to natively stored triples and compare this with SQL performance with single instance and clustered Virtuoso databases. One line of future work is benchmarking SPARQL and SQL based vertical storage schemes. We note that the RDF model is a vertical storage scheme almost by nature. Declaring that triples with given predicates be stored apart, in a table that keeps only subject and object results in a column-oriented store. We further intend to broaden the scope of the present example around TPC-H by including more sources in the mapping. This will demonstrate that the same queries can be run without loss of performance on a number of similar but distinct relational database instances. Thus SPARQL does become a data integration tool that exceeds the capabilities of SQL views merging data from multiple sources, for example. Conclusions The work discussed here demonstrates the feasibility of querying existing relational data through extended SPARQL without loss of performance or expressivity and without any modification to the relational data store in question. A skeptic might ask what the value of an alternate syntax for SQL is, when SQL is universally known and applied. We would point out that us bringing SPARQL on par with SQL for decision support queries is not aimed at replacing SQL but at making SPARQL capable of fulfilling its role as a language for integration. Indeed, we retain all of SPARQL's and RDF's flexibility for uniquely identifying entities, for abstracting away different naming conventions, layouts and types of primary and foreign keys and so forth. In the context of mapping relational data to RDF, we could map several instances of comparable but different schemas to the common terminology and couch all our queries within this terminology. Further, we can join from this world of mapped data to native RDF data, such as the data in the Linking Open Data project. For example, we could join regional sales data to the US census data set within a single query. Once we have demonstrated that performance or expressivity barriers do not cripple SPARQL when performing traditional SQL tasks, we have removed a significant barrier from enterprise adoption of RDF and open data. Related TPC-H workload Sample Queries References W3C RDF Data Access Working Group: SPARQL Query Language for RDF: http://www.w3.org/TR/rdf-sparql-query/ Transaction Processing Performance Council: TPC-H a Decision Support Benchmark. http://www.tpc.org/tpch/ Orri Erling, Ivan Mikahilov: Adapting an ORDBMS for RDF Storage and Mapping. Proceedings of the First Conference on Social Semantic Web. Leipzig (CSSW 2007), SABRE. Volume P-113 of GI-Edition - Lecture Notes in Informatics. Bonner Kollen Verlag, ISBN 978-3-88579-207-9