gStoreagraph-basedSPARQLqueryengine_gstore资源-CSDN文库

需积分: 9 79 浏览量 2014-03-14 15:06:23 上传评论收藏 2.31MB PDF 举报

标题中提到的"gStore: a graph-based SPARQL query engine"，指出了本文研究的核心是gStore系统，这是一款基于图的SPARQL查询引擎。SPARQL是一种用于查询和处理RDF（资源描述框架）数据的标准查询语言。RDF是用于描述网络资源的元数据模型，它是语义网（Semantic Web）技术的重要组成部分。gStore通过图的处理方式来存储和索引RDF数据，利用图匹配技术回答SPARQL查询。从描述中可以看出，本文主要探讨了如何高效地处理RDF数据集上的SPARQL查询。文章介绍了gStore系统如何以统一和可扩展的方式处理包含通配符和聚合操作符的SPARQL查询。gStore的核心思想是将RDF数据存储为一个大型图，并将SPARQL查询表示为查询图，从而将查询响应问题转化为子图匹配问题。为了实现高效和可扩展的查询处理，gStore开发了索引、有效的剪枝规则和高效的搜索算法。gStore还提出了应对RDF知识库在线更新的有效维护算法，并通过广泛的实验验证了解决方案的有效性。在文档中提及了RDF图的概念，RDF图是用节点（代表资源或概念）和边（代表资源之间的关系或属性）来表示数据。在gStore系统中，这种图结构被用来表达RDF数据，而SPARQL查询也被转换成图的形式，即查询图。这种转换使得原本的查询处理变成了一种图匹配任务，这是图数据库中常见的操作，特别是当涉及到复杂的连接查询时。文章提到的关键字包括RDF、SPARQL、图数据库、图匹配和聚合查询。这些关键字是本领域研究者关注的重点，它们体现了在RDF数据管理和查询方面的最新研究方向和成果。本部分还简要介绍了RDF数据模型的背景，它是为了语义网开发的一部分而设计的，用于模拟网络上的对象。RDF数据模型的用途正在各种应用程序中逐渐增加。RDF数据集是由多个 RDF 三元组构成的，每一个三元组由主体、谓词和对象组成。gStore正是为了解决这一类型数据的高效查询和存储而设计的。在gStore系统中，为了实现查询的高效性和可扩展性，采用了一种专门设计的索引技术、有效的剪枝规则以及高效的搜索算法。索引技术是优化数据库性能的关键因素之一，它能够显著提高数据检索的速度。剪枝规则用于排除不可能成为查询结果一部分的图的部分，从而减少需要检查的数据量，提高查询效率。搜索算法是决定查询处理速度的重要因素，高效的搜索算法可以快速找到结果，减少资源消耗。文档中提到的维护算法是处理 RDF 数据库在线更新的关键。在不断有新数据加入的情况下，如何保持查询性能稳定和响应时间快是很大的挑战。gStore通过特定的维护算法来应对在线更新，确保数据的一致性和查询性能。文章也提到了"Extended version of paper"，意味着本文是之前在VLDB（Very Large Data Base）会议上介绍的论文的扩展版本。VLDB是数据管理和数据库领域内的顶级会议之一，因此本文的研究内容具有相当的学术权威性和创新性。通过阅读这篇文章，研究人员和数据库开发人员可以获取到关于如何构建和优化基于图的RDF查询引擎的先进知识，了解当前在这一领域的最佳实践和面临的挑战。

资源推荐

资源详情

资源评论

The VLDB Journal

DOI 10.1007/s00778-013-0337-7

REGULAR PAPER

gStore: a graph-based SPARQL query engine

Lei Zou · M. Tamer Özsu · Lei Chen ·

Xuchuan Shen · Ruizhe Huang · Dongyan Zhao

Received: 21 December 2012 / Revised: 7 August 2013 / Accepted: 26 August 2013

Abstract We address efﬁcient processing of SPARQL

queries over RDF datasets. The proposed techniques, incor-

porated into the gStore system, handle, in a uniform and scal-

able manner, SPARQL queries with wildcards and aggregate

operators over dynamic RDF datasets. Our approach is graph

based. We store RDF data as a large graph and also repre-

sent a SPARQL query as a query graph. Thus, the query

answering problem is converted into a subgraph matching

problem. To achieve efﬁcient and scalable query processing,

we develop an index, together with effective pruning rules

and efﬁcient search algorithms. We propose techniques that

Extended version of paper “gStore: Answering SPARQL Queries via

Subgraph Matching” that was presented at 2011 VLDB Conference.

Electronic supplementary material The online version of this

article (doi:10.1007/s00778-013-0337-7) contains supplementary

material, which is available to authorized users.

L. Zou · X. Shen · R. Huang · D. Zhao

Institute of Computer Science and Technology, Peking University,

Beijing, China

e-mail: zoulei@pku.edu.cn

X. Shen

e-mail: shenxuchuan@pku.edu.cn

R. Huang

e-mail: huangruizhe@pku.edu.cn

D. Zhao

e-mail: zhaody@pku.edu.cn

M. T. Özsu (

)

David R. Cheriton School of Computer Science,

University of Waterloo, Waterloo, ON, Canada

e-mail: Tamer.Ozsu@uwaterloo.ca

L. Chen

Department of Computer Science and Engineering, Hong Kong

University of Science and Technology, Hong Kong, China

e-mail: leichen@cse.ust.hk

use this infrastructure to answer aggregation queries. We also

propose an effective maintenance algorithm to handle online

updates over RDF repositories. Extensive experiments con-

ﬁrm the efﬁciency and effectiveness of our solutions.

Keywords RDF · SPARQL · Graph database · Graph

matching · Aggregate query

1 Introduction

The RDF (Resource Description Framework) data model

was proposed for modeling Web objects as part of developing

the semantic web. Its use in various applicationsisincreasing.

A RDF dataset is a collection of (s

ubject, property, object)

triples denoted as s, p, o. A running example is given in

Fig. 1a. In order to query RDF repositories, SPARQL query

language [23] has been proposed by W3C. An example query

that retrieves the names of individuals who were born on

February 12, 1809 and who died on April 15, 1865 can be

speciﬁed by the following SPARQL query (Q

SELECT ?name WHERE

{?m <hasName> ?name.

?m <bornOnDa te> ‘‘1809−02−12’’.

?m < d i e dOnDa t e> ‘‘1865−04−15 ’ ’.}

Although RDF data management has beenstudiedover the

past decade, most early solutions do not scale to large RDF

repositories and cannot answer complex queries efﬁciently.

For example, early systems such as Jena [31], Yars2 [14] and

Sesame 2.0 [6] do not work well over large RDF datasets.

More recent works (e.g., [1,19,32]), as well as systems,

such as RDF-3x [20], x-RDF-3x [22], Hexastore [30] and

SW-store [2], are designed to address scalability over large

datasets. However, none of these address scalability along

with the following real requirements of RDF applications:

123

L. Zou et al.

(a)

(b)

Fig. 1 RDF Graph

– SPARQL queries with wildcards. Similar to SQL and

XPathcounterparts, the wildcard SPARQL queries enable

users to specify more ﬂexible query criteria in real-life

applications where users may not have full knowledge

about a query object. For example, we may know that

a person was born in 1976 in a city that was founded

in 1718, but we may not know the exact birth date. In

this case, we have to perform a query with wildcards, as

shown below (Q

SELECT ?name WHERE

{?m <bornIn> ?c it y . ?m <hasName> ?name.

?m <bornOnDate> ?bd.

?city <foundingYear> ‘‘1718 ’ ’.

FILTER( regex ( s t r ( ? bd) , ‘ ‘ 1976 ’ ’))}

– Dynamic RDF repositories. RDF repositories are not sta-

tic and are updated regularly. For example, Yago and

DBpedia datasets are continually expanding to include

the newly extracted knowledge from Wikipedia. The

RDF data in social networks, such as the FOAF project

(foaf-project.org), are also frequently updated to repre-

sent the individuals’ changing relationships. In order to

support queries over such dynamic RDF datasets, query

engines should be able to handle frequent updates with-

out much maintenance overhead.

– Aggregate SPARQL queries. Few existing works and

SPARQL engines consider aggregate queries despite

their real-life importance. A typical aggregate SPARQL

query that groups all individuals by their titles, genders,

and the founding year of their birth places and reports

the number of individuals in each group is shown below

SELECT ? t ? g ?y COUNT( ?m) WHERE

{?m <bornIn> ?c. ?m <title> ?t .

?m <gend er> ?g. ?c <foundingYear> ?y.}

GROUP BY ? t ? g ? y

In this paper, we describe gStore, which is a graph-based

triple store that can answer the above discussed SPARQL

queries over dynamic RDF data repositories. In this context,

answering a query is transformed into a subgraph matching

problem. Speciﬁcally, we model an RDF dataset (a collec-

tion of triples) as a labeled, directed multiedge graph (RDF

graph), where each vertex corresponds to a subject or an

object. We also represent a given SPARQL query by a query

graph, Q. Subgraph matching of the query graph Q over the

RDF graph G provides the answer to the query.

For example, Fig. 1b shows an RDF graph G correspond-

ing to RDF triples in Fig. 1a. We formally deﬁne an RDF

graph in Deﬁnition 1. Note that, the numbers above the boxes

in Fig. 1b are not vertex labels but vertex IDs that we intro-

duce to simplify the description. The RDF graph does not

have to be connected. A SPARQL query can also be repre-

sented as a directed labeled query graph Q (Deﬁnition 2). Fig-

ure 2 shows the query graph corresponding to the SPARQL

query Q

. Usually, query graph Q is a connected graph. Oth-

123

Graph-based SPARQL query engine

Fig. 2 Query graph of Q

erwise, we can regard each connected component of Q as a

separate query and perform them one by one.

We develop novel indexing and graph-matching tech-

niques rather than using existing ones. This is because the

characteristics of an RDF graph are considerably different

from graphs typically considered in much of the graph data-

base research. First, the size of an RDF graph (i.e., the num-

ber of vertices and edges) is larger than what is considered

in typical graph databases by orders of magnitude. Second,

the cardinality of vertex and edge labels in an RDF graph

is much larger than that in traditional graph databases. For

example, a typical dataset (i.e., the AIDS dataset) used in

the existing graph database work [25,33] has 10,000 data

graphs, each with an average number of 20 vertices and 25

edges. The total number of distinct vertex labels is 62. The

total size of the dataset is about 5M bytes. However, the

Yago RDF graph has about 500M vertices and the t otal size

is about 3.1GB. Therefore, I/O cost becomes a key issue

in RDF query processing. However, most existing subgraph

query algorithms are memory based. Third, SPARQL queries

combine several attribute-like properties of the same entity;

thus, they tend to contain stars as subqueries [20]. A star

query refers to the query graph in the shape of a star, formed

by one central vertex and its neighbors.

Contributions of this paper are the following:

1. We adopt the graph model as the physical storage scheme

for RDF data. Speciﬁcally, we store RDF data in disk-

based adjacency lists.

2. We transform an RDF graph into a data signature graph

by encoding each entity and class vertex. An index (VS

∗

tree) is developed over the data signature graph with light

maintenance overhead.

3. We develop a ﬁltering rule for subgraph query over the

data signature graph, which can be seamlessly embedded

into our query algorithm that answers SPARQL queries

efﬁciently.

4. We introduce an auxiliary structure (called T-index),

which is a structured organization of materialized views,

to speed up aggregate SPARQL queries.

5. We demonstrate experimentally that the performance of

our approach is superior to existing systems.

The rest of this paper is organized as follows. We discuss

the related work and preliminaries in Sects. 2 and 3, respec-

tively. We give an overview of our solution in Sect. 4.We

discuss the storage and encoding method in Sect. 5. We then

present the VS

∗

-tree index in Sect. 6 and an algorithm for

SPARQL query processing in Sect. 7. In order to support

aggregate queries efﬁciently, we develop T-index in Sect. 8

and aggregate query processing algorithm in Sect. 9.Wedis-

cuss the maintenance of indexes (VS

∗

-tree and T-index) as

RDF data get updated in Sect. 10. We study our methods

by experiments in Sect. 11. Section 12 concludes this paper.

Some of the additional material supporting the main ﬁndings

reported in the paper are included in an Online Supplement.

2 Related work

Three approaches have been proposed to store and query

RDF data: one giant triples table, clustered property tables,

and vertically partitioned tables.

One giant triples table. The systems in this category store

RDF triples in a single three-column table where columns

correspond to subject, property, and object (as in Fig. 1a)

enabling them to manipulate all RDF t riples in a uniform

manner. However, this requires performing a large number

of self-joins over this table to answer a SPARQL query. Some

efforts have been made to address this issue, such as RDF-

3x [20,19] and Hexastore [30], which build several clustered

-trees for all permutations of s, p and o columns.

Property tables. There are two kinds of property tables. The

ﬁrst one, called a clustered property table, groups together

the properties that tend to occur in the same subjects. Each

property cluster is mapped to a property table. The second

type is a property-class table, which clusters the subjects with

the same type of property into one property table.

Vertically partitioned tables. For each property, this appr-

oach builds a single two-column (subject, object) table

ordered by subject [2]. The advantage of the ordering is to

perform fast merge-join during query processing. However,

this approach does not scale well as the number of properties

increases.

Existing RDF storage systems, such as Jena [31], Yars2

[14] and Sesame 2.0 [6], do not work well in large RDF

datasets. SW-store [2], RDF-3x [20], x-RDF-3x [22]m, and

Hexastore [30] are designed to address scalability; however,

they only support exact SPARQL queries, since they replace

all literals (in RDF triples) by ids using a mapping dictionary.

Furthermore, most of existing methods do not efﬁciently

handle online updates of the underlying RDF repositories.

Forexample, in clustered property table-based methods (such

123

L. Zou et al.

as Jena [31]), if there are updates to the properties in RDF

triples, it is necessary to re-cluster and re-build the property

tables. In SW-store [2], it is potentially expensive to insert

data, since each update requires writing to many columns. In

order to address this issue, it uses “overﬂow table + batch

write”, meaning that online updates are recorded to over-

ﬂow tables that SW-store periodically scans to materialize the

updates. Obviously, this kind of maintenance method cannot

work well for applications such as online social networks

that require real time access.

More recent x-RDF-3x [22] proposes an efﬁcient online

maintenance algorithm, but does not support wildcard or

aggregate SPARQL queries. There exist some works that

discuss the possibility of storing RDF data as a graph (e.g.,

[5,30]), but these approaches do not address scalability. Some

are based on main memory implementations [26], while oth-

ers utilize graph partitioning to reduce self-joins of triple

tables [32]. While graph partitioning is a reasonable tech-

nique to parallelize execution, updates to the graph may

require re-partitioning unless incremental partitioning meth-

ods are developed (which are not in these works).

Few SPARQL query engines consider aggregate queries,

and to the best of our knowledge, only two proposals exist

in literature [16,24]. Given an aggregate SPARQL query

Q, a straightforward method [16] is to transform Q into

a SPARQL query Q



without aggregation predicates, ﬁnd

the solution to Q



by existing query engines, then partition

the solution set into one or more groups based on rows that

share the speciﬁed values, and ﬁnally, compute the aggre-

gate values for each group. Although it is easy for existing

RDF engines to implement aggregate functions this way, the

approach is problematic, since it misses opportunities for

query optimization. Furthermore, it has been pointed out [24]

that this method may produce incorrect answers.

Seid and Mehrotra [24] study the semantics of group-by

and aggregation in RDF graph and how to extend SPARQL

to express grouping and aggregation queries. They do not

address the physical implementation or query optimization

techniques.

Finally, the RDF data tend not to be very structured. For

example, each subject of the same type do not need to have

the same properties. This facilitates “pay-as-you-go” data

integration, but prohibits the application of classical rela-

tional approaches to speed up aggregate query processing.

For example, materialized views [13], which are commonly

used to optimize query execution, may not be used easily.

In relational systems, if there is a materialized view V

over

dimensions (A, B, C), an aggregate query over dimensions

(A, B) can be answered by only scanning view V

rather

than scanning the original table. However, this is not always

possible in RDF. For example, consider Q

that groups all

individuals by their titles, gender, and founding year of their

birth places and reports the number of individuals in each

(a)

(b)

Fig. 3 Difﬁculty of using materialized views a answer to query Q

answer to query Q

group. The answer to this query, R(Q

), is given in Fig. 3a

(we show how to compute this answer in Sect. 9).

Now, consider another query (say Q

) that groups all indi-

viduals by their titles and gender and reports the number of

individuals in each group. The answer to this query is given in

Fig. 3b. Although the group-by dimensions in Q

is a subset

of those in Q

, it is not possible to get the aggregate result set

R(Q

) by scanning R(Q

). The main reason is the nature of

RDF data and the fact that RDF data tend not be structured,

and there may be subjects of the same type that do not have

the same properties. Therefore, some subjects that exist in

a “smaller” materialized view may not occur in a “larger”

view.

3 Preliminaries

An RDF dataset is a collection of (subject, property, object)

triples s, p, o, where subject is an entity or a class, and

property denotes one attribute associated with one entity or

a class, and object is an entity, a class, or a literal value.

According to the RDF standard, an entity or a class is denoted

by a URI (Uniform Resource Identiﬁer). In Fig. 1,“http://

en.wikipedia.org/wiki/United_States” is an entity, “http://en.

wikipedia.org/wiki/Country” is a class, and “United States”

is a literal value. In this work, we do not distinguish between

an “entity” and a “class” since we have the same operations

over them. RDF data can be modeled as an RDF graph, which

is formally deﬁned as follows (frequently used symbols are

shown in Table 1):

Deﬁnition 1 A RDF graph is a four-tuple G =V, L

, E,

, where

1. V = V

∪V

is a collection of vertices that correspond

to all subjects and objects in RDF data, where V

, V

, and

are collections of class vertices, entity vertices, and

literal vertices, respectively.

2. L

is a collection of vertex labels. The label of a vertex

u ∈ V

is its literal value, and the label of a vertex u ∈

∪ V

is its corresponding URI.

123

剩余25页未读，继续阅读

评论收藏

内容反馈

xiaolong232627

粉丝: 0
资源: 1

gStore a graph-based SPARQL query engine

最新资源

gStore a graph-based SPARQL query engine

CPU-GPU异构环境下的大规模商品知识查询处理.docx

gm-sparql:使用SPARQL进行图挖掘

GStore 网上商店发售Vista-Ready显卡

IQL:An ad hoc query service based on the spark sql engine.(基于spark sql引擎的即席查询服务)

科学大数据 gStore 图数据模型

gOWL:gOWL-1是用于实现ABox的第一个版本。 您可以使用它来完成数据的扩展，然后使用查询引擎来完成查询的回答

GStoreParser_OLD_:从 Google Play 商店网站应用列表生成 CSV 文件的 Java 应用

gStore是由北京大学计算机所数据管理实验室研发面向RDF知识图谱的开源图数据库系统（通常称为Triple Store）

基于金融数据集搭建以gStore为管理平台的小型金融知识图谱python源码.zip

基于图数据模型的RDF数据管理系统gStore设计源码

gStore图数据库管理系统安装指南

Google-Analytics-Revenue-Preditcion:分析Google Merchandise Store（也称为GStore）客户数据集以预测每位客户的收入。结果将为那些选择使用数据分析的公司带来更具可操作性的运营变化并更好地利用营销预算

一种基于树搜索的RDF查询算法.docx

neo4j操作指南,含安装方法

数据库技术gstore论文

gstore：一种用于按纬度存储和检索数据的服务

gStore—开源图数据库系统及其在企业中的应用.pdf

gstore：一个死了的简单工具，用于同步组织的GitHub存储库

Sparql query

宝德SCSI磁盘阵列用户手册_V2.pdf

面向知识图谱应用的图数据库系统介绍.pptx

王淑军_2018216134_计算机科学发展前沿课程作业1

gstore-master.zip

GStore:适用于 Microsoft MVC 5、Entity Frameworks 6、Code First 和 ASP Identity 2.0 的电子商务和 CMS 解决方案

cronstorm-opensource:在AWS SQS上构建的任务计划程序即服务

gStore:gStore - 基于图的 RDF 三元存储

gstore-node:适用于Node.js的Google数据存储区实体建模

(源码)基于SpringBoot和gStore的RDF数据查询系统.zip

nsql-cache-datastore：用于nsql-cache的Google数据存储区适配器

最新资源

gOWL:gOWL-1是用于实现ABox的第一个版本。您可以使用它来完成数据的扩展，然后使用查询引擎来完成查询的回答