SpeedingUpMulti-RelationalDataMining资源-CSDN文库

需积分: 3 189 浏览量 2008-04-16 22:41:08 上传评论收藏 79KB PDF 举报

### 加速多关系数据挖掘的关键知识点 #### 一、引言与背景多关系数据挖掘(Multi-Relational Data Mining, MRDM)是指从多个相互关联的数据表或关系中提取有用的信息和知识的过程。随着大数据时代的到来，各种科学和商业领域积累了大量存储于关系数据库中的数据。这些数据往往涉及复杂的实体间关系，而传统的单表数据挖掘方法难以有效地处理这类复杂的数据结构。论文《Speeding Up Multi-Relational Data Mining》由Anna Atramentov和Vasant Honavar撰写，主要关注如何加速多关系数据挖掘算法的运行时间，同时保持结果质量不变。研究者提出了一种通用的方法，该方法通过优化选择图(selection graphs)的构建过程来实现算法加速。选择图是一种用于从关系数据库中获取建立预测模型(如决策树分类器)所需信息的数据结构。 #### 二、多关系数据挖掘概述多关系数据挖掘的目标是从多个关系表中学习出预测模型，这些模型能够用于新的未知数据的分类或回归任务。在实际应用中，这种类型的数据挖掘技术对于解决复杂问题非常有用，例如，在生物信息学中分析基因间的相互作用网络，在社会网络分析中挖掘用户之间的联系等。 #### 三、选择图的概念及作用选择图是多关系数据挖掘算法中的一种关键数据结构，它用于捕捉实体之间的关系，并从中提取特征用于构建预测模型。通过选择图，算法可以有效地查询和筛选出与目标实体相关的其他实体，从而为决策树或其他预测模型提供训练数据。 #### 四、算法加速策略为了加速多关系数据挖掘算法，研究者提出了以下几种改进措施： 1. **优化选择图构建：**通过对选择图的构建过程进行优化，减少不必要的计算和数据检索操作，从而显著降低算法的运行时间。 2. **增量更新机制：**在处理大型数据集时，采用增量更新机制可以在数据发生变化时快速更新选择图，避免重复计算。 3. **并行处理技术：**利用现代计算机硬件的并行处理能力，对选择图的构建和查询过程进行并行化处理，进一步提高算法的效率。 #### 五、实验验证研究者通过一系列实验验证了所提出的加速方法的有效性。实验结果显示，这种方法能够在保持结果质量的同时，将算法的运行时间减少一个到两个数量级。这意味着，对于那些原本因为计算成本过高而无法处理的大规模数据集，现在可以通过改进后的算法得以有效分析。 #### 六、实际应用场景多关系数据挖掘技术在许多领域都有着广泛的应用前景，例如： 1. **生物信息学：**分析蛋白质间的相互作用网络，预测药物靶点。 2. **社会网络分析：**研究用户之间的社交关系，预测社交行为。 3. **电子商务：**分析顾客购买行为模式，提供个性化推荐服务。 4. **金融风险管理：**检测欺诈交易，评估信贷风险。 #### 七、结论与展望《Speeding Up Multi-Relational Data Mining》一文提出了有效的加速策略，不仅提高了多关系数据挖掘算法的执行效率，还拓宽了其在实际应用中的范围。随着技术的不断进步，未来多关系数据挖掘技术将在更多领域展现出更大的潜力和价值。

资源推荐

资源详情

资源评论

Speeding Up Multi-Relational Data Mining

Anna Atramentov, Vasant Honavar

Artiﬁcial Intelligence Research Laboratory,

Computer Science Department, Iowa State University,

226 Atanasoff Hall, Ames, IA 50011-1040, USA,

{anjuta, honavar}@cs.iastate.edu

Abstract

We present a general approach to speeding up a

family of multi-relational data mining algorithms

that construct and use selection graphs to obtain the

information needed for building predictive mod-

els (e.g., decision tree classiﬁers) from relational

database. Preliminary results of our experiments

suggest that the proposed method can yield 1-2 or-

ders of magnitude reductions in the running time

of such algorithms without any deterioration in

the quality of results. The proposed modiﬁcations

enhance the applicability of multi-relational data

mining algorithms to signiﬁcantly larger relational

databases that would otherwise be not feasible in

practice.

1 Introduction

Recent advances in high throughput data acquisition, digital

storage, and communications technologies have made it pos-

sible to gather very large amounts of data in many scientiﬁc

and commercial domains. Much of this data resides in rela-

tional databases. Even when the data repository is not a rela-

tional database, it is often convenient to view heterogeneous

data sources as if they were a collection of relations

[

Reinoso-

Castillo, 2002

]

for the purpose of extracting and organizing

information from multiple sources. Thus, the task of learning

from relational data has begun to receive signiﬁcant attention

in the literature

[

Blockeel, 1998; Knobbe et al., 1999a; Fried-

man et al., 1999; Koller, 1999; Krogel and Wrobel, 2001;

Getoor, 2001; Kersting and De Raedt, 2000; Pfeffer, 2000;

Dzeroski andLavrac, 2001; Dehaspe and Raedt, 1997;Jaeger,

1997; Karalic and Bratko, 1997

]

Recently,

[

Knobbe et al., 1999a

]

outlined a general frame-

work for multi-relational data mining which exploits struc-

tured query language (SQL) to gather the information needed

for constructing classiﬁers (e.g., decision trees) from multi-

relational data. Based on this framework, several algorithms

for multi-relational data mining have been developed. Exper-

iments reported by

[

Leiva, 2002

]

have shown that MRDTL –

a multi-relational decision tree learning algorithm is compet-

itive with other approaches to learning from relational data.

One common feature of all algorithms based on the multi-

relational data mining frameworkproposedby

[

Knobbe et al.,

1999a

]

is their use of selection graphs to query the relevant

databases to obtain the information (e.g., statistics) needed

for constructing a model. Our experiments with MRDTL re-

vealed that the executionof queries encoded by such selection

graphs was a major bottleneck in terms of the running time of

the algorithm. Hence, this paper describes an approach for

signiﬁcantly speeding up some of the most time consuming

components of such algorithms. Preliminary results of our

experiments suggest that the proposed method can yield one

to two orders of magnitude speedups in the case of MRDTL.

We expect similar speedups to be obtained with other multi-

relational data mining algorithms which construct and use se-

lection graphs.

The rest of the paper is organized as follows: in Section 2

we overview multi-relational data-mining framework, in Sec-

tion 3 we describe the speed up scheme for this framework

and in Section 4 we show the experimental results that we

obtained applying the scheme.

2 Multi-Relational Data Mining

2.1 Relational Databases

A relational database consists of a set of tables D =

, X

, ...X

}, and a set of associations between pairs of

tables. In each table a row represents description of one

record. A column represents values of some attribute for the

records in the table. An attribute A from table X is denoted

by X.A.

Deﬁnition 2.1 The domain of the attribute X.A is denoted as

DOM (X.A) and is deﬁned as the set of all different values

that the records from table X have in the column of attribute

Associations between tables are deﬁned through primary

and foreign key attributes in D.

Deﬁnition 2.2 A primary key attribute of table X, denoted

as X.ID, has a unique value for each row in this table.

Deﬁnition 2.3 A foreign key attribute in table Y refer-

encing table X, denoted as Y.X

ID, takes values from

DOM (X.ID).

An example of a relational database is shown in Figure

1. There are three tables and three associations between

tables. The primary keys of the tables GENE, COMPO-

SITION, and INTERACTION are: GENE ID, C ID, and

GENE_ID

ESSENTIAL

CHROMOSOME

LOCALIZATION

GENE

GENE_ID1

GENE_ID2

TYPE

EXPRESSION_CORR

INTERACTION

I_ID

COMPOSITION

GENE_ID

CLASS

COMPLEX

PHENOTYPE

MOTIF

C_ID

Figure 1: Example database

I ID, respectively. Each COMPOSITION record references

some GENE record through the foreign key COMPOSI-

TION.GENE ID, and each INTERACTION record refer-

ences two GENE records through the foreign keys INTER-

ACTION.GENE ID1 and INTERACTION.GENE ID2.

In this setting, if an attribute of interest is chosen, it is

called target attribute, and the table in which this attribute

is stored is called target table and is denoted by T

Each record in T

corresponds to a single object. Addi-

tional information about an object is stored in other tables of

the database, which can be looked up, when following the

associations between tables.

2.2 Multi-Relational Data Mining Framework

Multi-relational data mining framework is based on the

search for interesting patterns in the relational database,

where multi-relational patterns can be viewed as ”pieces of

substructure encountered in the structure of the objects of in-

terest”

[

Knobbe et al., 1999a

]

Deﬁnition 2.4 A multi-relational object is covered by a

multi-relational pattern iff the substructure described by the

multi-relational pattern, in terms of both attribute-value con-

ditions and structural conditions, occurs at least once in the

multi-relational object. (

[

Knobbe et al., 1999a

]

)

Multi-relational patterns also can be viewed as subsets of

the objects from the database having some property. The most

interesting subsets are chosen according to some measure (i.e.

information gain for classiﬁcation task), which guides the

search in the space of all patterns. The search for interesting

patterns usually proceeds by a top-down induction. For each

interesting pattern, subputterns are obtained with the help of

reﬁnement operator, which can be seen as further division of

the set of objects covered by initial pattern. Top-down induc-

tion of interesting pattern proceeds recursively applying such

reﬁnement operators to the best patterns.

Complex = ’Cytoskeleton’ and

GENE

Chromosome=1

COMPOSITION

Class = ’Proteases’

Complex = ’Cytoskeleton’

Figure 2: Selection graph, corresponding to those GENE(s)

that belong to chromosome number 1, that have at least one

COMPOSITION record whose complex value is ’Cytoskele-

ton’, but for which none of the COMPOSITION records have

complex value ’Cytoskeleton’ and class value ’Proteases’ at

the same time.

Multi-relational pattern language is deﬁned in terms of se-

lection graphs and reﬁnements which are described in the fol-

lowing sections.

2.3 Selection Graphs

Multi-relational patterns are expressed in a graphical lan-

guage of selection graphs

[

Knobbe et al., 1999b

]

Deﬁnition 2.5 A selection graph S is a directed graph S =

(N, E). N represents the set of nodes in S in the form of

tuples (X, C, s, f), where X is a table from D, C is the set of

conditions on attributes in X (for example, X.color = ’red’

or X.salary > 5,000), s is a ﬂag with possible values open

and closed, and f is a ﬂag with possible values front and

back. E represents edges in S in the form of tuples (p, q, a, e),

where p and q are nodes and a is a relation between p and q

in the data model (for example, X.ID = Y.X

ID), and e is a

ﬂag with possible values present and absent. The selection

graph should contain at least one node n

that corresponds

to the target table T

An example of the selection graph for the data model from

Figure 1 is shown in Figure 2. This selection graph corre-

sponds to those GENE(s) that belong to chromosome number

1, that have at least one COMPOSITION record whose com-

plex value is ’Cytoskeleton’, but for which none of the COM-

POSITION records have complex value ’Cytoskeleton’ and

class value ’Proteases’ at the same time. In this example the

target table is GENE, and within GENE the target attribute is

LOCALIZATION.

In graphical representation of a selection graph, the value

of s is represented by the presence or absence of a cross in

the node, representing the value open and closed, respec-

tively. The value for e, in turn, is indicated by the presence

(present value) or absence (absent value) of a cross on the

corresponding arrow representing the edge. An edge between

nodes p and q chooses the records in the database that match

the joint condition, a, between the tables which is deﬁned by

the relation between the primary key in p and a foreign key

剩余6页未读，继续阅读

评论收藏

内容反馈

morre

粉丝: 187
资源: 2329

Speeding Up Multi-Relational Data Mining

Mining Multi-label Data

Driven-Data-Hackathon-Pump-it-Up-Data-Mining-the-Water-Table:驱动数据网站上的在线数据科学竞赛在排行榜的前5％以下

Pump-it-Up-Data-Mining-the-Water-Table:推动数据竞争

Multi-Relational Data Mining in Medical Databases

Numbers in Multi-Relational Data Mining

火山ML通过可扩展的搜索空间分解加速端到端AutoML_VolcanoML Speeding up End-to-End Aut

Speeding up Networking - Precision IO-计算机科学

藏经阁-Speeding up Spark with Data Co.pdf

藏经阁-Speeding up Spark with Data Compression on Xeon+FPGA.pdf

speeding-infraction-management:此存储库包含用于演示完全使用Azure Stack构建的无服务器加速违规管理系统的代码

Multi-Relational Data Mining using UML for ILP 1

Prospects and Challenges for Multi-Relational Data Mining

Python Data Analysis Cookbook

网站优化资料( Best Practices for Speeding Up Your Web Site 中文)

Speeding up MATLAB Applications 加速 MATLAB 应用程序.pdf

Best_Practices_for_Speeding_Up_Your_Web_Site

Speeding up packet IO in virtual machines

LineSimplification:使用 Douglas-Peucker 算法的线简化算法

sigmod2011全部论文(3)

Implementing.Splunk.2nd.Edition.1784391603

Prentice.Hall.PTR.Rapid.J2EE.Development.An.Adaptive.Foundation.for.Enterprise.Applications.chm

50 Tips and Tricks for MongoDB Developers

SAR图像压缩采样恢复的GPU并行实现

Accelerating MATLAB Performance - 1001 Tips to Speed Up MATLAB Programs

Bigtable：A Distributed Storage System for Structured Data

Docker in Practice, 2nd Edition

最新资源