MultiRelationalDataMiningAnIntroduction资源-CSDN文库

需积分: 3 157 浏览量 2008-04-12 18:16:27 上传评论收藏 197KB PDF 举报

### 多关系数据挖掘：概览与核心概念在当今数据驱动的世界中，数据挖掘作为一项关键的技术，致力于从海量数据中发现隐藏的模式、趋势和关联，为决策支持、预测分析以及各种智能应用提供有力的数据依据。然而，传统的数据挖掘方法大多聚焦于单一数据表中的模式发现，而多关系数据挖掘（MRDM）则是一种更高级的方法，它能够处理来自关系数据库中多个表（关系）的数据，从而捕捉到更为复杂和深入的数据间关联。 #### 1. 多关系数据挖掘简介多关系数据挖掘（MRDM），也常被称为关系数据挖掘（RDM），是一种旨在从关系数据库的多个表中发现模式的高级数据挖掘技术。与传统的基于命题逻辑的数据挖掘方法不同，MRDM利用了归纳逻辑编程（ILP）等技术，能够在涉及多个表的关系中寻找模式，这使得它在处理具有复杂结构和关系的数据集时表现出色。 #### 2. 多关系数据挖掘的关键组件 - **关系数据**：关系数据库由多个表（关系）组成，每个表包含一组记录和字段。这些表通过公共字段（键）相互连接，形成了复杂的多维数据结构。在MRDM中，数据不仅包括单个表内的信息，还涉及跨表的关系。 - **模式**：在多关系数据挖掘中，模式可以是关联规则、决策树、距离基方法等。这些模式反映了数据间的复杂关联和规律，可用于预测、分类和聚类等任务。 - **算法**：MRDM依赖于一系列先进的算法，如归纳逻辑编程（ILP）、多关系关联规则发现、多关系决策树以及多关系距离基方法等。这些算法专门设计用于处理多表数据，能够在不同表之间建立联系，发现深层次的模式。 #### 3. 多关系数据挖掘的应用领域 MRDM已经在多个领域展现出其独特的优势，特别是在生物信息学领域取得了显著成果。它能够处理基因组学、蛋白质组学等领域的复杂数据，发现疾病与遗传因素之间的关联，辅助药物研发，优化个性化医疗方案等。此外，MRDM在社交网络分析、推荐系统、市场篮子分析等领域也有广泛应用。 #### 4. 多关系数据挖掘面临的挑战尽管MRDM带来了诸多优势，但它也面临着一系列挑战，包括数据预处理的复杂性、计算资源的需求、以及如何有效处理大规模数据集中的噪声和不一致性问题。此外，算法的设计和优化也是一个持续的研究方向，需要不断探索新的方法和技术来提高MRDM的效率和准确性。 #### 结论多关系数据挖掘作为数据科学的一个重要分支，其强大的功能在于能够处理和分析复杂的关系型数据，揭示出数据间的深层次关联。随着技术的不断进步，我们有理由相信，MRDM将在更多领域发挥其独特作用，为科学研究和社会发展带来新的机遇和突破。

资源推荐

资源详情

资源评论

Multi-Relational Data Mining: An Introduction

Saˇso Dˇzeroski

Joˇzef Stefan Institute

Jamova 39, SI-1000 Ljubljana, Slovenia

saso.dzeroski@ijs.si

ABSTRACT

Data mining algorithms look for patterns in data. While

most existing data mining approaches look for patterns in

a single data table, multi-relational data mining (MRDM)

approaches look for patterns that involve multiple tables

(relations) from a relational database. In recent years, the

most common types of patterns and approaches considered

in data mining have been extended to the multi-relational

case and MRDM now encompasses multi-relational (MR) as-

sociation rule discovery, MR decision trees and MR distance-

based methods, among others. MRDM approaches have

been successfully applied to a number of problems in a va-

riety of areas, most notably in the area of bioinformatics.

This article provides a brief introduction to MRDM, while

the remainder of this special issue treats in detail advanced

research topics at the frontiers of MRDM.

Keywords

relational data mining, multi-relational data mining,

inductive logic programming, relational association rules,

relational decision trees, relational distance-based methods

1. IN A NUTSHELL

Data mining algorithms look for patterns in data. Most

existing data mining approaches are propositional and look

for patterns in a single data table. Relational data mining

(RDM) approaches [16], many of which are based on induc-

tive logic programming (ILP, [35]), look for patterns that in-

volve multiple tables (relations) from a relational database.

To emphasize this fact, RDM is often referred to as multi-

relational data mining (MRDM, [21]). In this article, we

will use the terms RDM and MRDM interchangeably. In

this introductory section, we take a look at data, patterns,

and algorithms in RDM, and mention some application ar-

eas.

1.1 Relational data

A relational database typically consists of several tables

(relations) and not just one table. The example database in

Table 1 has two relations: Customer and MarriedTo. Note

that relations can be deﬁned extensionally (by tables, as in

our example) or intensionally through database views (as

explicit logical rules). The latter typically represent re-

lationships that can be inferred from other relationships.

For example, having extensional representations of the re-

lations mother and father, we can intensionally deﬁne the

relations grandparent, grandmother, sibling, and ancestor,

among others.

Intensional deﬁnitions of relations typically represent gen-

eral knowledge about the domain of discourse. For example,

if we have extensional relations listing the atoms that make a

compound molecule and the bonds between them, functional

groups of atoms can be deﬁned intensionally. Such general

knowledge is called domain knowledge or background knowl-

edge.

Table 1: A relational database with two tables and two

classiﬁcation rules: a propositional and a relational.

Customer table

ID Gender Age Income TotalSpent BigS

c1 Male 30 214000 18800 Yes

c2 Female 19 139000 15100 Yes

c3 Male 55 50000 12400 No

c4 Female 48 26000 8600 No

c5 Male 63 191000 28100 Yes

c6 Male 63 114000 20400 Yes

c7 Male 58 38000 11800 No

c8 Male 22 39000 5700 No

... ... ... ... ... ...

MarriedTo table

Spouse1 Spouse2

c1 c2

c2 c1

c3 c4

c4 c3

c5 c12

c6 c14

... ...

Propositional rule

IF Income > 108000 THEN BigSpender = Yes

Relational rule

big spender(C1,Age1,Income1,TotalSpent1) ←

married to(C1,C2) ∧

customer(C2,Age2,Income2,TotalSpent2,BS2) ∧

Income2 ≥ 108000.

1.2 Relational patterns

Relational patterns involve multiple relations from a re-

lational database. They are typically stated in a more ex-

pressive language than patterns deﬁned on a single data

table. The major types of relational patterns extend the

types of propositional patterns considered in single table

data mining. We can thus have relational classiﬁcation rules,

relational regression trees, and relational association rules,

among others.

An example relational classiﬁcation rule is given in Ta-

ble 1, which involves the relations Customer and MarriedTo.

It predicts a person to be a big spender if the person is mar-

ried to somebody with high income (compare this to the rule

that states a person is a big spender if he has high income,

listed above the relational rule). Note that the two persons

C1 and C2 are connected through the relation MarriedTo.

Relational patterns are typically expressed in subsets of

ﬁrst-order logic (also called predicate or relational logic).

Essentials of predicate logic include predicates (MarriedTo)

and variables (C1, C2), which are not present in proposi-

tional logic. Relational patterns are thus more expressive

than propositional ones.

Most commonly, the logic programming subset of ﬁrst-

order logic, which is strongly related to deductive databases,

is used as the formalism for expressing relational patterns.

E.g., the relational rule in Table 1 is a logic program clause.

Note that a relation in a relational database corresponds to

a predicate in ﬁrst-order logic (and logic programming).

1.3 Relational to propositional

RDM tools can be applied directly to multi-relational

data to ﬁnd relational patterns that involve multiple rela-

tions. Most other data mining approaches assume that the

data resides in a single table and require preprocessing to

integrate data from multiple tables (e.g., through joins or

aggregation) into a single table before they can be applied.

Integrating data from multiple tables through joins or aggre-

gation, however, can cause loss of meaning or information.

Suppose we are given the relations customer(CustID,

Name, Age, SpendsALot) and purchase(CustID,

P roductID, Date, V alue, P aymentMode), where each cus-

tomer can make multiple purchases, and we are interested in

characterizing customers that spend a lot. Integrating the

two relations via a natural join will give rise to a relation

purchase1 where each row corresponds to a purchase and

not to a customer. One possible aggregation would give rise

to the relation customer1(CustID, Age, NofP urchases,

T otalV alue, SpendsALot). In this case, however, some in-

formation has been clearly lost during aggregation.

The following pattern can be discovered if the relations

customer and purchase are considered together.

customer(CID, N ame, Age, yes) ←

Age > 30 ∧

purchase(CID, P ID, D, V alue, P M ) ∧

P M = credit card ∧ V alue > 100.

This pattern says: “a customer spends a lot if she is older

than 30, has purchased a product of value more than 100

and paid for it by credit card.” It would not be possible to

induce such a pattern from either of the relations purchase1

and customer1 considered on their own.

Besides the ability to deal with data stored in multi-

ple tables directly, RDM systems are usually able to take

into account generally valid background (domain) knowl-

edge given as a logic program. The ability to take into ac-

count background knowledge and the expressive power of the

language of discovered patterns are distinctive for RDM.

Note that data mining approaches that ﬁnd patterns in a

given single table are referred to as attribute-value or propo-

sitional learning approaches, as the patterns they ﬁnd can be

expressed in propositional logic. RDM approaches are also

referred to as ﬁrst-order learning approaches, or relational

learning approaches, as the patterns they ﬁnd are expressed

in the relational formalism of ﬁrst-order logic. A more de-

tailed discussion of the single table assumption, the prob-

lems resulting from it and how a relational representation

alleviates these problems is given by Wrobel [50] (Chapter

4 of [16]).

1.4 Algorithms for relational data mining

A RDM algorithm searches a language of relational pat-

terns to ﬁnd patterns valid in a given database. The search

algorithms used here are very similar to those used in single

table data mining: one can search exhaustively or heuristi-

cally (greedy search, best-ﬁrst search, etc.). Just as for the

single table case, the space of patterns considered is typically

lattice-structured and exploiting this structure is essential

for achieving eﬃciency. The lattice structure is traversed by

using reﬁnement operators [46], which are more complicated

in the relational case. In the propositional case, a reﬁnement

operator may add a condition to a rule antecedent or an item

to an item set. In the relational case, a new relation can be

introduced as well.

Just as many data mining algorithms come from the ﬁeld

of machine learning, many RDM algorithms come form the

ﬁeld of inductive logic programming (ILP, [35; 30]). Situ-

ated at the intersection of machine learning and logic pro-

gramming, ILP has been concerned with ﬁnding patterns

expressed as logic programs. Initially, ILP focussed on au-

tomated program synthesis from examples, formulated as

a binary classiﬁcation task. In recent years, however, the

scope of ILP has broadened to cover the whole spectrum of

data mining tasks (classiﬁcation, regression, clustering, asso-

ciation analysis). The most common types of patterns have

been extended to their relational versions (relational classi-

ﬁcation rules, relational regression trees, relational associa-

tion rules) and so have the major data mining algorithms

(decision tree induction, distance-based clustering and pre-

diction, etc.).

Van Laer and De Raedt [49] (Chapter 10 of [16]) present

a generic approach of upgrading single table data mining

algorithms (propositional learners) to relational ones (ﬁrst-

order learners). Note that it is not trivial to extend a single

table data mining algorithm to a relational one. Extending

the key notions to, e.g., deﬁning distance measures for multi-

relational data requires considerable insight and creativity.

Eﬃciency concerns are also very important, as it is often the

case that even testing a given relational pattern for validity

is computationally expensive, let alone searching a space of

such patterns for valid ones. An alternative approach to

RDM (called propositionalization) is to create a single table

from a multi-relational database in a systematic fashion [28]

(Chapter 11 of [16]): this approach shares some eﬃciency

concerns and in addition can have limited expressiveness.

A pattern language typically contains a very large num-

ber of possible patterns even in the single table case: this

number is in practice limited by setting some parameters

(e.g., the largest size of frequent itemsets for association

rule discovery). For relational pattern languages, the num-

ber of possible patterns is even larger and it becomes nec-

essary to limit the space of possible patterns by providing

more explicit constraints. These typically specify what re-

lations should be involved in the patterns, how the relations

can be interconnected, and what other syntactic constraints

the patterns have to obey. The explicit speciﬁcation of the

pattern language (or constraints imposed upon it) is known

under the name of declarative bias [38].

1.5 Applications of relational data mining

The use of RDM has enabled applications in areas rich

with structured data and domain knowledge, which would

be diﬃcult to address with single table approaches. RDM

has been used in diﬀerent areas, ranging from analysis of

business data, through environmental and traﬃc engineering

to web mining, but has been especially successful in bioin-

formatics (including drug design and functional genomics).

Bioinformatics applications of RDM are discussed in the ar-

ticle by Page and Craven in this issue. For a comprehensive

survey of RDM applications we refer the reader to Dˇzeroski

[20] (Chapter 14 of [16]).

1.6 What’s in this article

The remainder of this article ﬁrst gives a brief intro-

duction to inductive logic programming, which (from the

viewpoint of MRDM) is mainly concerned with the induc-

tion of relational classiﬁcation rules for two-class problems.

It then proceeds to introduce the basic MRDM techniques

of discovery of relational association rules, induction of rela-

tional decision trees and relational distance-based methods

(that include both classiﬁcation and clustering). The arti-

cle concludes with an overview of the MRDM literature and

Internet resources.

2. INDUCTIVE LOGIC PROGRAMMING

From a KDD perspective, we can say that inductive

logic programming (ILP) is concerned with the development

of techniques and tools for relational data mining. Pat-

terns discovered by ILP systems are typically expressed as

logic programs, an important subset of ﬁrst-order (predi-

cate) logic, also called relational logic. In this section, we

ﬁrst brieﬂy discuss the language of logic programs, then pro-

ceed with a discussion of the major task of ILP and some

approaches to solving it.

2.1 Logic programs and databases

Logic programs consist of clauses. We can think of

clauses as ﬁrst-order rules, where the conclusion part is

termed the head and the condition part the body of the

clause. The head and body of a clause consist of atoms, an

atom being a predicate applied to some arguments, which

are called terms. In Datalog, terms are variables and con-

stants, while in general they may consist of function symbols

applied to other terms. Ground clauses have no variables.

Consider the clause father(X, Y ) ∨ mother(X, Y ) ←

parent(X, Y ). It reads: “if X is a parent of Y then X is the

father of Y or X is the mother of Y” (∨ stands for logical or).

parent(X, Y ) is the body of the clause and father(X, Y ) ∨

Table 2: Database and logic programming terms.

DB terminology LP terminology

relation name p predicate symbol p

attribute of relation p argument of predicate p

tuple ha

, . . . , a

i ground fact p(a

, . . . , a

)

relation p - predicate p -

a set of tuples deﬁned extensionally

by a set of ground facts

relation q predicate q

deﬁned as a view deﬁned intensionally

by a set of rules (clauses)

mother(X, Y ) is the head. parent, father and mother

are predicates, X and Y are variables, and parent(X, Y ),

father(X, Y ), mother(X, Y ) are atoms. We adopt the Pro-

log [4] syntax and start variable names with capital let-

ters. Variables in clauses are implicitly universally quan-

tiﬁed. The above clause thus stands for the logical formula

∀X∀Y : father(X, Y ) ∨ mother(X, Y ) ∨ ¬parent(X, Y ).

Clauses are also viewed as sets of literals, where a literal

is an atom or its negation. The above clause is then the set

{father(X, Y ), mother(X, Y ), ¬parent(X, Y )}.

As opposed to full clauses, deﬁnite clauses contain ex-

actly one atom in the head. As compared to deﬁnite clauses,

program clauses can also contain negated atoms in the body.

While the clause in the paragraph above is a full clause, the

clause ancestor(X, Y ) ← parent(Z, Y ) ∧ ancestor(X, Z) is

a deﬁnite clause (∧ stands for logical and). It is also a recur-

sive clause, since it deﬁnes the relation ancestor in terms of

itself and the relation parent. The clause mother(X, Y ) ←

parent(X, Y ) ∧ not male(X) is a program clause.

A set of clauses is called a clausal theory. Logic pro-

grams are sets of program clauses. A set of program clauses

with the same predicate in the head is called a predicate

deﬁnition. Most ILP approaches learn predicate deﬁnitions.

A predicate in logic programming corresponds to a rela-

tion in a relational database. A n-ary relation p is formally

deﬁned as a set of tuples [48], i.e., a subset of the Cartesian

product of n domains D

× D

× . . . × D

, where a domain

(or a type) is a set of values. It is assumed that a relation is

ﬁnite unless stated otherwise. A relational database (RDB)

is a set of relations.

Thus, a predicate corresponds to a relation, and the ar-

guments of a predicate correspond to the attributes of a

relation. The major diﬀerence is that the attributes of a

relation are typed (i.e., a domain is associated with each

attribute). For example, in the relation lives in(X, Y ), we

may want to specify that X is of type person and Y is of

type city. Database clauses are typed program clauses.

A deductive database (DDB) is a set of database clauses.

In deductive databases, relations can be deﬁned extension-

ally as sets of tuples (as in RDBs) or intensionally as sets of

database clauses. Database clauses use variables and func-

tion symbols in predicate arguments and the language of

DDBs is substantially more expressive than the language of

RDBs [31; 48]. A deductive Datalog database consists of

deﬁnite database clauses with no function symbols.

Table 2 relates basic database and logic programming

terms. For a full treatment of logic programming, RDBs,

and deductive databases, we refer the reader to [31] and

[48].

Table 3: A simple ILP problem: learning the daughter relation. Positive examples are denoted by ⊕ and negative by .

Training examples Background knowledge

daughter(mary, ann). ⊕ parent(ann, mary). female(ann).

daughter(eve, tom). ⊕ parent(ann, tom). female(mary).

daughter(tom, ann).  parent(tom, eve). female(eve).

daughter(eve, ann).  parent(tom, ian).

2.2 The ILP task of relational rule induction

Logic programming as a subset of ﬁrst-order logic is

mostly concerned with deductive inference. Inductive logic

programming, on the other hand, is concerned with induc-

tive inference. It generalizes from individual instances/obser-

vations in the presence of background knowledge, ﬁnding

regularities/hypotheses about yet unseen instances.

The most commonly addressed task in ILP is the task

of learning logical deﬁnitions of relations [42], where tuples

that belong or do not belong to the target relation are given

as examples. From training examples ILP then induces a

logic program (predicate deﬁnition) corresponding to a view

that deﬁnes the target relation in terms of other relations

that are given as background knowledge. This classical ILP

task is addressed, for instance, by the seminal MIS system

[46] (rightfully considered as one of the most inﬂuential an-

cestors of ILP) and one of the best known ILP systems FOIL

[42].

Given is a set of examples, i.e., tuples that belong to the

target relation p (positive examples) and tuples that do not

belong to p (negative examples). Given are also background

relations (or background predicates) q

that constitute the

background knowledge and can be used in the learned deﬁ-

nition of p. Finally, a hypothesis language, specifying syn-

tactic restrictions on the deﬁnition of p is also given (either

explicitly or implicitly). The task is to ﬁnd a deﬁnition of

the target relation p that is consistent and complete, i.e.,

explains all the positive and none of the negative tuples.

Formally, given is a set of examples E = P ∪N, where P

contains positive and N negative examples, and background

knowledge B. The task is to ﬁnd a hypothesis H such that

∀e ∈ P : B ∧H |= e (H is complete) and ∀e ∈ N : B∧H 6|= e

(H is consistent), where |= stands for logical implication or

entailment. This setting, introduced by Muggleton [34], is

thus also called learning from entailment. In an alternative

setting proposed by De Raedt and Dˇzeroski [15], the require-

ment that B ∧ H |= e is replaced by the requirement that

H be true in the minimal Herbrand model of B ∧ e: this

setting is called learning from interpretations.

In the most general formulation, each e, as well as B

and H can be a clausal theory. In practice, each e is most

often a ground example (tuple), B is a relational database

(which may or may not contain views) and H is a deﬁnite

logic program. The semantic entailment (|=) is in practice

replaced with syntactic entailment (`) or provability, where

the resolution inference rule (as implemented in Prolog) is

most often used to prove examples from a hypothesis and

the background knowledge. In learning from entailment,

a positive fact is explained if it can be found among the

answer substitutions for h produced by a query ? − b on

database B, where h ← b is a clause in H. In learning from

interpretations, a clause h ← b from H is true in the minimal

Herbrand model of B if the query b ∧ ¬h fails on B.

As an illustration, consider the task of deﬁning relation

daughter(X, Y ), which states that person X is a daughter

of person Y , in terms of the background knowledge relations

female and parent. These relations are given in Table 3.

There are two positive and two negative examples of the

target relation daughter. In the hypothesis language of def-

inite program clauses it is possible to formulate the following

deﬁnition of the target relation,

daughter(X, Y ) ← female(X), parent(Y, X).

which is consistent and complete with respect to the back-

ground knowledge and the training examples.

In general, depending on the background knowledge, the

hypothesis language and the complexity of the target con-

cept, the target predicate deﬁnition may consist of a set of

clauses, such as

daughter(X, Y ) ← female(X), mother(Y, X).

daughter(X, Y ) ← female(X), father(Y, X).

if the relations mother and father were given in the back-

ground knowledge instead of the parent relation.

The hypothesis language is typically a subset of the lan-

guage of program clauses. As the complexity of learning

grows with the expressiveness of the hypothesis language,

restrictions have to be imposed on hypothesized clauses.

Typical restrictions are the exclusion of recursion and re-

strictions on variables that appear in the body of the clause

but not in its head (so-called new variables).

From a data mining perspective, the task described above

is a binary classiﬁcation task, where one of two classes is

assigned to the examples (tuples): ⊕ (positive) or  (nega-

tive). Classiﬁcation is one of the most commonly addressed

tasks within the data mining community and includes ap-

proaches for rule induction. Rules can be generated from

decision trees [43] or induced directly [33; 7].

ILP systems dealing with the classiﬁcation task typically

adopt the covering approach of rule induction systems. In a

main loop, a covering algorithm constructs a set of clauses.

Starting from an empty set of clauses, it constructs a clause

explaining some of the positive examples, adds this clause

to the hypothesis, and removes the positive examples ex-

plained. These steps are repeated until all positive examples

have been explained (the hypothesis is complete).

In the inner loop of the covering algorithm, individual

clauses are constructed by (heuristically) searching the space

of possible clauses, structured by a specialization or general-

ization operator. Typically, search starts with a very general

rule (clause with no conditions in the body), then proceeds

to add literals (conditions) to this clause until it only covers

(explains) positive examples (the clause is consistent).

When dealing with incomplete or noisy data, which is

most often the case, the criteria of consistency and com-

pleteness are relaxed. Statistical criteria are typically used

剩余15页未读，继续阅读

评论收藏

内容反馈

morre

粉丝: 187
资源: 2330

MultiRelational Data Mining An Introduction

最新资源

MultiRelational Data Mining An Introduction

Introduction to Data Mining

MultiRelational Data Mining

Introduction to Data Mining and its Applications

An-Introduction-to-Data-Mining.rar_out

Introduction to Data Mining （明大的lecture）

MultiRelational Data Mining 2005 Workshop Report

MultiRelational Data Mining The Current Frontiers

Cluster Analysis and Data Mining: An Introduction

Introduction to Data Mining(数据挖掘概论)

Introduction to Data Mining 数据挖掘导论 part1

Introduction to Data Mining英文版+PPT

Introduction to Data Mining (第三版英文)

Introduction to data mining

4本经典Data Mining电子书pdf

Data Analytics with Hadoop: An Introduction for Data Scientists

英文原版-Data Mining 2nd Edition

Introduction to Data Mining for the Life Sciences

Errata for Introduction to Data Mining

Introduction to Oracle Data Mining

Introduction to Data Mining with SQL Server 2000

An Introduction to Data Science

Introduction to data mining 课后答案 169页

Data Mining with R: Learning with Case Studies, Second Edition

Learning Data Mining with Python - Second Edition

Data.Mining.Concepts.and.Techniques.2nd.Ed 配套 PPT 9 章

Data Mining: A Tutorial-Based Primer, Second Edition

最新资源