AMIE：在不完整知识库下的关联规则挖掘（代码+文档）

共901个文件

class：484个

svn-base：240个

java：122个

需积分: 47 79 浏览量 2015-03-18 13:10:04 上传评论 8 收藏 2.43MB RAR 举报

最近几年，例如YAGO和DBpedia等大规模知识库发展有了很大的进步。知识库提供了大量的不同种类的实体信息，如人、国家、河流、城市大学等等，同时知识库包含了大量的在实体（entity）间的关系既事实（fact）。当今的知识库包含的数据量是巨大的通常有百万个实体和上亿个描述实体间关系的事实数据。虽然目前的知识库存在大量的实体和事实数据，但是这样大规模的数据仍然不完整。目前构建知识库的方法主要有两种，一种是从大量的文本中抽取事实但这种方法必然会带来大量的噪声数据，第二是人工扩展，但这样的方法对于时间的开销是极大的。如果确保一个知识库是完整的则必须花费很大的努力来抽取大量的事实，并检查事实的正确性，因为只有正确的事实加入到知识库中才是有意义的。同时知识库的本身由于有足够的信息可以推理出更多的新的事实。例如有这样一个例子，一个知识库包含一组事实是孩子c有一个妈妈m，这样可以推理得出孩子妈妈的丈夫f很可能是孩子的父亲。该逻辑规则形式化的描述如下： motherof(m,c)∧marriedTo(m,f)⟹fatherof(f,c) 挖掘这种规则可帮助做一下四种事情：1、利用这种规则来推理出新的事实，而这些被挖掘出的新的事实可以使知识库更完整。2、这些规则可以检测出知识库潜在的错误例如一个陈述是一个与一个男孩无关的人是这个男孩的父亲，这样的陈述很可能是错误的。3、有很多推理工具依赖其他工具提供规则，所以这些被挖掘出来的规则可以用于推理。4、这些规则描述一个普遍的规律，这些规律可以帮我我们理解分析知识库中的数据，如找到一些国家通常与说同一种语言的国家交易。或结婚是一个对称关系，或使用同一个乐器的音乐家通常互相影响等等。 AMIE的目标是从RDF格式的知识库中挖掘如上所述的逻辑规则，在语义网（Semantic Web）中存在大量的RDF知识库如YAGO、Freebase和DBpedia等。这些知识库使用RDF三元组（S,P,O）提供二元关系（binary relation）的描述。由于知识库一般只包含正例而（S,P,O）没有反例（S,¬P,O），所以RDF这样的知识库中仅能通过正例来推理。进一步来说在RDF知识库上的操作是基于开放世界假设（OWA）的。在开放世界假设下，一个事实没有在知识库中存在那么我们不能说这个事实是错误的，只能说这个陈述是未知的。这与标准的数据库在封闭世界假设的设定有本质上的区别。例如在知识库中没有包含marry(a,b)，在封闭世界假设中我们可以得出这个a没有和b结婚而在开放世界假设下我们只能说a可能结婚了也可能单身。压缩包内包含AMIE可运行源代码与相应文档资料，欢迎下载参考

资源推荐

资源详情

资源评论

收起资源包目录

AMIE：在不完整知识库下的关联规则挖掘（代码+文档）（901个子文件）

FactDatabase.class 56KB

Name.class 44KB

NameML.class 36KB

Char.class 32KB

DateParser.class 27KB

Database.class 27KB

DummyDatabase$DummyResultSet.class 23KB

NonsharedParameters.class 23KB

Query.class 22KB

PlingStemmer.class 22KB

AMIE.class 21KB

MiningAssistant.class 21KB

AMIE.class 21KB

MiningAssistant.class 20KB

MiningAssistantHeadVariables.class 19KB

MiningAssistantHeadVariables.class 18KB

NumberParser.class 17KB

TypesCleaner.class 17KB

DateParser1.class 17KB

FrequencyVector.class 16KB

PredictionsSampler.class 15KB

Announce.class 14KB

D.class 14KB

SparseVector.class 12KB

Parameters.class 12KB

WordNet.class 11KB

SeedsCountMiningAssistant.class 11KB

SeedsCountMiningAssistant.class 10KB

FileLines.class 9KB

PostgresDatabase.class 9KB

PQRTree$Node.class 9KB

IntHashMap.class 9KB

IntSet.class 8KB

PCAFalseFactsSampler.class 8KB

Database$Inserter.class 8KB

PCAFalseFactsSampler.class 8KB

DoubleHashMap.class 8KB

StringModifier.class 8KB

FigureProducer.class 8KB

PeekIterator.class 7KB

DBWordNet$Table.class 7KB

RegularExpression.class 6KB

OracleDatabase.class 6KB

Language.class 6KB

DummyDatabase.class 6KB

SignatureStatisticsTable.class 6KB

MySQLDatabase.class 6KB

MatchReader.class 6KB

NameML$PersonNameML.class 6KB

NumberFormatter.class 6KB

BloomFilter.class 6KB

CompressedString.class 6KB

NounGroup.class 5KB

HTMLReader.class 5KB

SchemaUtilities.class 5KB

EntitiesRelationSampler.class 5KB

共 901 条

AMIE: Association Rule Mining under Incomplete Evidence

in Ontological Knowledge Bases

Luis Galárraga

, Christina Teﬂioudi

, Katja Hose

, Fabian M. Suchanek

Max-Planck Institute for Informatics, Saarbrücken, Germany

Aalborg University, Aalborg, Denmark

{lgalarra, chteﬂio, suchanek}@mpi-inf.mpg.de,

{khose}@cs.aau.dk

ABSTRACT

Recent advances in information extraction have led to huge

knowledge bases (KBs), which capture knowledge in a ma-

chine-readable format. Inductive Logic Programming (ILP)

can be used to mine logical rules from the KB. These rules

can help deduce and add missing knowledge to the KB.

While ILP is a mature ﬁeld, mining logical rules from KBs is

diﬀerent in two aspects: First, current rule mining systems

are easily overwhelmed by the amount of data (state-of-the

art systems cannot even run on today’s KBs). Second, ILP

usually requires counterexamples. KBs, however, implement

the open world assumption (OWA), meaning that absent

data cannot be used as counterexamples. In this paper, we

develop a rule mining model that is explicitly tailored to

support the OWA scenario. It is inspired by association rule

mining and introduces a novel measure for conﬁdence. Our

extensive experiments show that our approach outperforms

state-of-the-art approaches in terms of precision and cover-

age. Furthermore, our system, AMIE, mines rules orders of

magnitude faster than state-of-the-art approaches.

Categories and Subject Descriptors

H2.8 [Information Systems]: Database Applications

General Terms

Algorithms

Keywords

Rule Mining, Inductive Logic Programming, ILP

1. INTRODUCTION

In recent years, we have experienced the rise of large

knowledge bases (KBs), such as Cyc [23], YAGO [35], DB-

pedia [5], and Freebase

. These KBs provide information

about a great variety of entities, such as people, countries,

rivers, cities, universities, movies, animals, etc. Moreover,

KBs also contain facts relating these entities, e.g., who was

born where, which actor acted in which movie, or which city

is located in which country. Today’s KBs contain millions

of entities and hundreds of millions of facts.

Yet, even these large KBs are not complete. Some of

them are extracted from natural language resources that

http://freebase.com

inevitably exhibit gaps. Others are created and extended

manually. Making these KBs complete requires great eﬀort

to extract facts, check them for correctness, and add them

to the KB. However, KBs themselves often already contain

enough information to derive and add new facts. If, for in-

stance, a KB contains the fact that a child has a mother,

then the mother’s husband is most likely the father:

motherOf (m, c) ∧ marriedTo(m, f ) ⇒ fatherOf (f, c)

As for any rule, there can be exceptions, but in the vast

majority of cases, the rule will hold. Finding such rules

can serve four purposes: First, by applying such rules on

the data, new facts can be derived that make the KB more

complete. Second, such rules can identify potential errors

in the knowledge base. If, for instance, the KB contains the

statement that a totally unrelated person is the father of a

child, then maybe this statement is wrong. Third, the rules

can be used for reasoning. Many reasoning approaches rely

on other parties to provide rules (e.g., [27, 31]). Last, rules

describing general regularities can help us understand the

data better. We can, e.g., ﬁnd out that countries often trade

with countries speaking the same language, that marriage is

a symmetric relationship, that musicians who inﬂuence each

other often play the same instrument, and so on.

The goal of this paper is to mine such rules from KBs.

We focus on RDF-style KBs in the spirit of the Seman-

tic Web, such as YAGO [35], Freebase

, and DBpedia [5].

These KBs provide binary relationships in the form of RDF

triples

. Since RDF has only positive inference rules, these

KBs contain only positive statements and no negations. Fur-

thermore, they operate under the Open World Assumption

(OWA). Under the OWA, a statement that is not contained

in the KB is not necessarily false; it is just unknown. This

is a crucial diﬀerence to many standard database settings

that operate under the Closed World Assumption (CWA).

Consider an example KB that does not contain the infor-

mation that a particular person is married. Under CWA we

can conclude that the person is not married. Under OWA,

however, the person could be either married or single.

Mining rules from a given dataset is a problem that has

a long history. It has been studied in the context of asso-

ciation rule mining and inductive logic programming (ILP).

Association rule mining [3] is well-known in the context of

sales databases. It can ﬁnd rules such as “If a client bought

beer and wine, then he also bought aspirin”. The conﬁdence

of such a rule is the ratio of cases where beer and wine was

actually bought together with aspirin. Association rule min-

ing inherently implements a closed world assumption: A rule

http://www.w3.org/TR/rdf-primer/

Committee (IW3C2). IW3C2 reserves the right to provide a hyperlink

to the author’s site if the Material is used in electronic media.

WWW 2013, May 13–17, 2013, Rio de Janeiro, Brazil.

ACM 978-1-4503-2035-1/13/05.

413

that predicts new items that are not in the database has a

low conﬁdence. It cannot be used to (and is not intended to

be used to) add new items to the database.

ILP approaches deduce logical rules from ground facts.

Yet, current ILP systems cannot be applied to semantic

KBs for two reasons: First, they usually require negative

statements as counter-examples. Semantic KBs, however,

usually do not contain negative statements. The semantics

of RDF are too weak to deduce negative evidence from the

facts in a KB.

Because of the OWA, absent statements can-

not serve as counter-evidence either. Second, today’s ILP

systems are slow and cannot handle the huge amount of

data that KBs provide. In our experiments, we ran state-of-

the-art approaches on YAGO2 for a couple of days without

obtaining any results.

In this paper, we propose a rule mining system that is

inherently designed to work under the OWA, and eﬃcient

enough to handle the size of today’s KBs. More precisely,

our contributions are as follows:

(1) A method to simulate negative examples for positive KBs

(the Partial Completeness Assumption)

(2) An algorithm for the eﬃcient mining of rules.

(3) A system, AMIE, that mines rules on millions of facts

in a few minutes without the need for parameter tuning or

expert input.

The rest of this paper is structured as follows. Section 2 dis-

cusses related work and Section 3 introduces preliminaries.

Sections 4 and 5 are the main part of the paper, present-

ing our mining model and its implementation. Section 6

presents our experiments before Section 7 concludes.

2. RELATED WORK

We aim to mine rules of the form

motherOf (m, c) ∧ marriedTo(m, f ) ⇒ fatherOf (f, c)

Technically, these are Horn rules on binary predicates. Rule

mining has been an area of active research for the past couple

of years. Some approaches mine association rules, some mine

logical rules, others mine a schema for the KB, and again

others use rule mining for application purposes.

Association Rule Mining. Association rules [3] are mined

on a list of transactions. A transaction is a set of items. For

example, in the context of sales analysis, a transaction is the

set of products bought together by a customer in a speciﬁc

event. The mined rules are of the form {ElvisCD, Elvis-

Book} ⇒ ElvisCostume, meaning that people who bought

an Elvis CD and an Elvis book usually also bought an Elvis

costume. However, these are not the kind of rules that we

aim to mine in this paper. We aim to mine Horn rules.

One problem for association rule mining is that for some

applications the standard measurements for support and

conﬁdence do not produce good results. [36] discusses a num-

ber of alternatives to measure the interestingness of a rule in

general. Our approach is inspired by this work and we also

make use of a language bias [2] to reduce the search space.

Logical Rule Mining. Sherlock [32] is an unsupervised

ILP method to learn ﬁrst-order Horn clauses from a set of

extracted facts for a given target relation. It uses probabilis-

tic graphical models (PGMs) to infer new facts. It tackles

the noise of the extracted facts by extensive ﬁltering in a

RDF has only positive rules and no disjointness constraints

or similar concepts.

preprocessing step and by penalizing longer rules in the in-

ference part. For mining the rules, Sherlock uses 2 heuristics:

statistical signiﬁcance and statistical relevance.

The WARMR system [11,12] mines patterns in databases

that correspond to conjunctive queries. It uses a declara-

tive language bias to reduce the search space. An extension

of the system, WARMER [13], modiﬁed the approach to

support a broader range of conjunctive queries and increase

eﬃciency of search space exploration.

ALEPH

is a general purpose ILP system, which imple-

ments Muggleton’s Inverse Entailment algorithm [25] in Pro-

log. It employs a variety of evaluation functions for the rules,

and a variety of search strategies.

These approaches are not tailored to deal with large KBs

under the Open World Assumption. We compare our sys-

tem, AMIE, to WARMR and ALEPH, which are the only

ones available for download. Our experiments do not only

show that these systems mine less sensible rules than AMIE,

but also that it takes them much longer to do so.

Expert Rule Mining. Another rule mining approach over

RDF data [28] was proposed to discover causal relations in

RDF-based medical data. It requires a domain expert who

deﬁnes targets and contexts of the mining process, so that

the correct transactions are generated. Our approach, in

contrast, does not rely on the user to deﬁne any context or

target. It works out-of-the-box.

Generating Schemas. In this paper, we aim to generate

Horn rules on a KB. Other approaches use rule mining to

generate the schema or taxonomy of a KB. [7] applies clus-

tering techniques based on context vectors and formal con-

cept analysis to construct taxonomies. Other approaches

use clustering [21] and ILP-based approaches [9]. For the

friend-of-a-friend network on the Semantic Web, [14] ap-

plies clustering to identify classes of people and ILP to learn

descriptions of these groups. Another example of an ILP-

based approach is the DL-Learner [19], which has success-

fully been applied [15] to generate OWL class expressions

from YAGO [35]. As an alternative to ILP techniques, [37]

propose a statistical method that does not require negative

examples. In contrast to our approach, these techniques

aim at generating a schema for a given RDF repository, not

logical rules in general.

Learning Rules From Hybrid Sources. [8] proposes to

learn association rules from hybrid sources (RDBMS and

Ontologies) under the OWA. For this purpose, the deﬁni-

tion of frequency (and thus of support and conﬁdence) is

changed so that unknown statements contribute with half

of the weight of the true statements. Another approach [20]

makes use of an ontology and a constraint Datalog program.

The goal is to learn association rules at diﬀerent levels of

granularity w.r.t. the type hierarchy of the ontology. While

these approaches focus more on the beneﬁts of combining

hybrid sources, our approach focuses on pure RDFS KBs.

Further Applications of Rule Mining. [17] proposes an

algorithm for frequent pattern mining in KBs that use DL-

safe rules. Such KBs can be transformed into a disjunctive

Datalog program, which allows seeing patterns as queries.

This approach does not mine the Horn rules that we aim at.

Some approaches use rule mining for ontology merging

and alignment [10, 24, 30]. The AROMA system [10], e.g.,

http://www.cs.ox.ac.uk/activities/machlearn/

Aleph/aleph_toc.html

414

uses association rules on extracted terms to ﬁnd subsump-

tion relations between classes and properties of diﬀerent on-

tologies. Again, these systems do not mine the kind of rules

we are interested in.

In [1] association rules and frequency analysis are used to

identify and classify common misusage patterns for relations

in DBpedia. In contrast to our work, this approach does not

mine logical rules, but association rules on the co-occurrence

of values. Since RDF data can be seen as a graph, mining

frequent subtrees [6, 18] is another related ﬁeld of research.

However, as the URIs of resources in knowledge bases are

unique, these techniques are limited to mining frequent com-

binations of classes.

Several approaches, such as Markov Logic [31] or URDF

[27] use Horn rules to perform reasoning. These approaches

can be consumers of the rules we mine with AMIE.

3. PRELIMINARIES

RDF KBs. In this paper, we focus on RDF knowledge

bases

. An RDF KB can be considered a set of facts, where

each fact is a triple of the form hx, r, yi with x denoting the

subject, r the relation (or predicate), and y the object of the

fact. There are several equivalent alternative representations

of facts; in this paper we use a logical notation and represent

a fact as r(x, y). For example, we write father(Elvis,Lisa).

The facts of an RDF KB can usually be divided into an A-

Box and a T-Box. While the A-Box contains instance data,

the T-Box is the subset of facts that deﬁne classes, domains,

ranges for predicates, and the class hierarchy. Although T-

Box information can also be used by our mining approach,

we are mainly concerned with the A-Box, i.e., the set of facts

relating one particular entity to another.

In the following, we assume a given KB K as input. Let

R = π

relation

(K) denote the set of relations contained in K

and E = π

subject

(K) ∪ π

object

(K) the set of entities.

Functions. A function is a relation r that has at most one

object for every subject, i.e., ∀x : |{y : r(x, y)}| ≤ 1. A

relation is an inverse function if each of its objects has at

most one subject. Since RDF KBs are usually noisy, even

relations that should be functions (such as hasBirthdate)

may exhibit two objects for the same subject. Therefore,

we use the notion of functionality [33]. The functionality of

a relation r is a value between 0 and 1, that is 1 if r is a

function:

fun(r) :=

#x : ∃y : r(x, y)

#(x, y) : r(x, y)

with #x : X as an abbreviation for |{x : X ∈ K}|. The

inverse functionality is deﬁned accordingly as ifun(r) :=

fun(r

−1

). Without loss of generality, we assume that ∀r ∈

R : f un(r) ≥ ifun(r) (FUN-Property). If that is not

the case for a relation r, we can replace all facts r(x, y)

with the inverse relation, r

−

(y, x), which entails f un(r

−

) ≥

ifun(r

−

). For example, if the KB contains the inverse

functional relation directed(person,movie), we can create the

functional relation isDirectedBy(movie,person) and use only

that one in the rule mining process. Manual inspection

shows, however, that relations in semantic KBs tend to be

more functional than inverse functional. Intuitively, this al-

lows us to consider a fact r(x, y) as a fact about x.

http://www.w3.org/TR/rdf-primer/

Rules. An atom is a fact that can have variables at the

subject and/or object position. A (Horn) rule consists of a

head and a body, where the head is a single atom and the

body is a set of atoms. We denote a rule with head r(x, y)

and body {B

, ..., B

} by an implication

∧ B

∧ ... ∧ B

⇒ r(x, y)

which we abbreviate as

B ⇒ r(x, y). One example of such

a rule is

hasChild(p, c) ∧ isCitizenOf (p, s) ⇒ isCitizenOf (c, s)

An instantiation of a rule is a copy of the rule, where all

variables have been substituted by entities. A prediction of

a rule is the head atom of an instantiated rule if all body

atoms of the instantiated rule appear in the KB. For ex-

ample, the above rule can predict isCitizenOf(Lisa,USA) if

the KB knows a parent of Lisa (hasChild(Elvis,Lisa)) who

is American (isCitizenOf(Elvis,USA)).

Language Bias. As most ILP systems, AMIE uses a lan-

guage bias to restrict the search space. We say that two

atoms in a rule are connected if they share a variable or an

entity. A rule is connected if every atom is connected tran-

sitively to every other atom of the rule. AMIE mines only

connected rules, i.e., it avoids constructing rules that con-

tain unrelated atoms. We say that a rule is closed if every

variable in the rule appears at least twice. Such rules do not

predict merely the existence of a fact (e.g. diedIn(x, y) ⇒

∃z : wasBornIn(x, z)), but also concrete arguments for it

(e.g. diedIn(x, y) ⇒ wasBornIn(x, y)). AMIE mines only

closed rules. We allow recursive rules that contain the head

relation in the body.

Parallels to Association Rule Mining. Association Rule

Mining discovers correlations in shopping transactions. Thus,

association rules are diﬀerent in nature from the Horn rules

we aim at. Still, we can show some similarities between

the two approaches. Let us deﬁne one transaction for every

set of n entities that are connected in the KB. For exam-

ple, in Figure 1, we will deﬁne a transaction for the enti-

ties Elvis, Lisa and Priscilla, because they are connected

through the facts mother(Priscilla,Lisa), father(Elvis,Lisa),

marr(Elvis, Priscilla). We label the transaction with the

set of these entities. Each atom r(x

, x

) on variables in-

dexed by 1 ≤ i, j ≤ n corresponds to an item. A transaction

with label hC

, . . . , C

i contains an item r(x

, x

) if r(C

, C

)

is in the KB. For example, the transaction hElvis, Lisa,

Priscillai contains the items {mother(x

), father(x

marr(x

)}, since the ground atoms mother(Priscilla,Lisa),

father(Elvis,Lisa) and marr(Elvis, Priscilla) are in the KB.

In this representation, association rules are Horn rules. In

the example, we can mine the association rule

{mother(x

, x

), marr(x

, x

)} ⇒ {f ather(x

, x

)}

which corresponds to the Horn rule

mother(x

, x

) ∧ marr(x

, x

) ⇒ father(x

, x

)

Transaction Label Transaction Items

hElvis,Lisa,Priscillai {mother(x

),father(x

),marr(x

)}

hBarack,Mali,Mich.i {mother(x

),father(x

),marr(x

)}

hFran¸cois,Flora,S´egoi {mother(x

),father(x

)}

Figure 1: Mining Rules with 3 Variables

Constructing such a table with all possible combinations

of entities is practically not very viable. Apart from that,

415

评论收藏

内容反馈

白树升

粉丝: 11
资源: 8

AMIE：在不完整知识库下的关联规则挖掘（代码+文档）

关联规则挖掘_Apriori_数据挖掘_关联规则挖掘_

关联规则挖掘算法

关联规则挖掘1

关联规则挖掘算法综述（doc格式）

数据挖掘-关联规则挖掘

amie:Mavenized AMIE +打字

关联挖掘算法详解

关联规则挖掘方法的改进

一种数据挖掘系统的设计与实现（源代码）

数据挖掘大作业-数据探索性分析与预处理，关联规则挖掘，分类与聚类+源代码+文档说明

使用Apriori算法进行关联规则挖掘的实验报告与代码实现

关联规则挖掘经典算法apriori标准代码实现

数据挖掘大作业-基于python实现关联规则挖掘可视化系统+源代码+文档说明

基于XML文档的关联规则挖掘算法XQ_Apriori的设计与实现

知识图谱构建工具使用方法

python中cPAMIE类包

关联规则挖掘高效的关联规则算法实现

data mining类相关课程-关联规则挖掘算法

amie:适用于Minecraft的OpenComputers mod的自定义，小巧，简单，快速的操作系统

关联规则挖掘源代码打包&相关学习资料

关联规则挖掘之FP-growth算法实现

Association-Rule-Mining-Python:关联规则挖掘的Python实现

YAGO: A Large Ontology from Wikipedia and WordNet

简单的关联规则推荐步奏文档以及代码实现

R语言数据挖掘实验报告——美国黑色星期五（BlackFriday）（附代码和实验数据csv文件）聚类 关联规则挖掘

python数据挖掘机器学习实战（代码+数据集）——中医证型关联规则挖掘.zip

Text-Mining:使用TF-IDF算法查找关键字的文本挖掘代码和使用Apriori算法生成关联规则的文本挖掘代码

精品版基于MATLAB R语言 SAS SPSS软件的 数据分析与挖掘实战 完整课程PPT课件 第10章（共49页）基于关联规则的网站智能推荐服务.pptx

SEU 知识推理1

最新资源

R语言数据挖掘实验报告——美国黑色星期五（BlackFriday）（附代码和实验数据csv文件）聚类关联规则挖掘

精品版基于MATLAB R语言 SAS SPSS软件的数据分析与挖掘实战完整课程PPT课件第10章（共49页）基于关联规则的网站智能推荐服务.pptx