L. Zou et al.
as Jena [31]), if there are updates to the properties in RDF
triples, it is necessary to re-cluster and re-build the property
tables. In SW-store [2], it is potentially expensive to insert
data, since each update requires writing to many columns. In
order to address this issue, it uses “overflow table + batch
write”, meaning that online updates are recorded to over-
flow tables that SW-store periodically scans to materialize the
updates. Obviously, this kind of maintenance method cannot
work well for applications such as online social networks
that require real time access.
More recent x-RDF-3x [22] proposes an efficient online
maintenance algorithm, but does not support wildcard or
aggregate SPARQL queries. There exist some works that
discuss the possibility of storing RDF data as a graph (e.g.,
[5,30]), but these approaches do not address scalability. Some
are based on main memory implementations [26], while oth-
ers utilize graph partitioning to reduce self-joins of triple
tables [32]. While graph partitioning is a reasonable tech-
nique to parallelize execution, updates to the graph may
require re-partitioning unless incremental partitioning meth-
ods are developed (which are not in these works).
Few SPARQL query engines consider aggregate queries,
and to the best of our knowledge, only two proposals exist
in literature [16,24]. Given an aggregate SPARQL query
Q, a straightforward method [16] is to transform Q into
a SPARQL query Q
without aggregation predicates, find
the solution to Q
by existing query engines, then partition
the solution set into one or more groups based on rows that
share the specified values, and finally, compute the aggre-
gate values for each group. Although it is easy for existing
RDF engines to implement aggregate functions this way, the
approach is problematic, since it misses opportunities for
query optimization. Furthermore, it has been pointed out [24]
that this method may produce incorrect answers.
Seid and Mehrotra [24] study the semantics of group-by
and aggregation in RDF graph and how to extend SPARQL
to express grouping and aggregation queries. They do not
address the physical implementation or query optimization
techniques.
Finally, the RDF data tend not to be very structured. For
example, each subject of the same type do not need to have
the same properties. This facilitates “pay-as-you-go” data
integration, but prohibits the application of classical rela-
tional approaches to speed up aggregate query processing.
For example, materialized views [13], which are commonly
used to optimize query execution, may not be used easily.
In relational systems, if there is a materialized view V
1
over
dimensions (A, B, C), an aggregate query over dimensions
(A, B) can be answered by only scanning view V
1
rather
than scanning the original table. However, this is not always
possible in RDF. For example, consider Q
3
that groups all
individuals by their titles, gender, and founding year of their
birth places and reports the number of individuals in each
(a)
(b)
Fig. 3 Difficulty of using materialized views a answer to query Q
3
b
answer to query Q
4
group. The answer to this query, R(Q
3
), is given in Fig. 3a
(we show how to compute this answer in Sect. 9).
Now, consider another query (say Q
4
) that groups all indi-
viduals by their titles and gender and reports the number of
individuals in each group. The answer to this query is given in
Fig. 3b. Although the group-by dimensions in Q
4
is a subset
of those in Q
3
, it is not possible to get the aggregate result set
R(Q
4
) by scanning R(Q
3
). The main reason is the nature of
RDF data and the fact that RDF data tend not be structured,
and there may be subjects of the same type that do not have
the same properties. Therefore, some subjects that exist in
a “smaller” materialized view may not occur in a “larger”
view.
3 Preliminaries
An RDF dataset is a collection of (subject, property, object)
triples s, p, o, where subject is an entity or a class, and
property denotes one attribute associated with one entity or
a class, and object is an entity, a class, or a literal value.
According to the RDF standard, an entity or a class is denoted
by a URI (Uniform Resource Identifier). In Fig. 1,“http://
en.wikipedia.org/wiki/United_States” is an entity, “http://en.
wikipedia.org/wiki/Country” is a class, and “United States”
is a literal value. In this work, we do not distinguish between
an “entity” and a “class” since we have the same operations
over them. RDF data can be modeled as an RDF graph, which
is formally defined as follows (frequently used symbols are
shown in Table 1):
Definition 1 A RDF graph is a four-tuple G =V, L
V
, E,
L
E
, where
1. V = V
c
∪V
e
∪V
l
is a collection of vertices that correspond
to all subjects and objects in RDF data, where V
c
, V
e
, and
V
l
are collections of class vertices, entity vertices, and
literal vertices, respectively.
2. L
V
is a collection of vertex labels. The label of a vertex
u ∈ V
l
is its literal value, and the label of a vertex u ∈
V
c
∪ V
e
is its corresponding URI.
123