基于Hadoop高效的数据挖掘框架_java数据挖掘框架资源-CSDN文库

共2个文件

pdf：1个

ris：1个

4星 · 超过85%的资源需积分: 9 133 浏览量 2010-11-07 15:04:27 上传评论收藏 424KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

DataMiningFrame-YangLai.zip （2个子文件）

An Efficient Data mining framework on Hadoop (ieee).pdf 469KB

An Efficient Data Mining Framework on Hadoop (cit2010).ris 2KB

An Efficient Data Mining Framework on Hadoop using Java Persistence API

Yang Lai

The Key Laboratory of Intelligent Information Process-

ing, Institute of Computing Technology, Chinese

Academy of Sciences, Beijing, 100190, China.

Graduate University of Chinese Academy of Sciences,

Beijing 100039, China.

e-mail: yanglai@ics.ict.ac.cn

Shi ZhongZhi

The Key Laboratory of Intelligent Information Process-

ing, Institute of Computing Technology, Chinese

Academy of Sciences, Beijing, 100190, China.

e-mail: shizz@ics.ict.ac.cn

Abstract—Data indexing is common in data mining when working

with high-dimensional, large-scale data sets. Hadoop, a cloud

computing project using the MapReduce framework in Java, has

become of significant interest in distributed data mining. A feasi-

ble distributed data indexing algorithm is proposed for Hadoop

data mining, based on ZSCORE binning and inverted indexing and

on the Hadoop SequenceFile format. A data mining framework on

Hadoop using the Java Persistence API (JPA) and MySQL Cluster

is proposed. The framework is elaborated in the implementation of

a decision tree algorithm on Hadoop. We compare the data index-

ing algorithm with Hadoop MapFile indexing, which performs a

binary search, in a modest cloud environment. The results show

the algorithm is more efficient than naïve MapFile indexing. We

compare the JDBC and JPA implementations of the data mining

framework. The performance shows the framework is efficient for

data mining on Hadoop.

Keywords—Data Mining, Distributed applications, JPA,

ORM, Distributed file systems, Cloud computing

I. INTRODUCTION

Many approaches have been proposed for handling

high-dimensional and large-scale data, in which query proc-

essing is the bottleneck. “Algorithms for knowledge discov-

ery tasks are often based on range searches or nearest

neighbor search in multidimensional feature spaces” [1].

Business intelligence and data warehouses can hold a

Terabyte or more of data. Cloud computing has emerged for

the subsequently increasing demands of data mining. Ma-

pReduce is a programming framework and an associated

implementation designed for large data sets. The details of

partitioning, scheduling, failure handling and communica-

tion are hidden by MapReduce. Users simply define map

functions to create intermediate <key, value> tuples, and

then reduce functions to merge the tuples for special proc-

essing [2].

A concise indexing Hadoop implementation of MapRe-

duce is presented in McCreadie’s work [3]. Ralf proposes a

basic program skeleton to underlie MapReduce computa-

tions [4]. Moretti presents an abstraction for scalable data

mining, in which data and computation are distributed in a

computing cloud with minimal user efforts [5]. Gillick uses

Hadoop to implement query-based learning [6].

Most data mining algorithms are based on object

-oriented programming, which runs in memory. Han elabo-

rates many of these methods [7]. A link-based structure, like

X-tree [8], may be suitable for indexing high-dimensional

data sets.

However, the following features in the MapReduce

framework are unsuitable for data mining. First, in globality,

map tasks are irrelevant to each other, as are reducing tasks.

Data mining requires that all of the training data be con-

verted into a global model, such as a decision tree or clus-

tering tree. The tasks in the MapReduce framework only

handle its partition of the entire data set and output its results

into the Hadoop distributed file system (HDFS). Second,

random-write operations are disallowed by the HDFS, thus

disabling link-based data models in Hadoop, such as

linked-lists, trees, and graphs. Finally, the duration of both

map and reduce tasks are based on scanning processing, and

will end when the partitioning of the training dataset is fin-

ished. Data mining requires a persistent model for following

testing processing.

A database is an ideal persistent repository for objects

generated by data mining using Hadoop tasks. To mine

high-dimensional and large-scale data on Hadoop, we em-

ploy Object-Relation Mapping (ORM), which stores objects

whose size may surpass memory limits in a relational data-

base. The Java Persistence API (JPA) provides a persistence

model for ORM [9]. It can store all the objects generated

from data mining models in a relational database. A distrib-

uted database is a suitable solution to ensure robustness in

distributed handling. MySQL Cluster is designed to with-

stand any single point of failure [10], which is consistent

with Hadoop.

We performed the same work that McCreadie’s per-

formed [3] and now propose a novel indexing Hadoop im-

plementation for continuous values. Using JPA and MySQL

Cluster on Hadoop, we propose an efficient data mining

framework, which is elaborated by a decision tree imple-

mentation. The distributed computing in Hadoop and cen-

tralized data collection using JPA are combined together

organically. We also employ the naïve JDBC implementa-

tion in comparison with our JPA implementation on Ha-

doop.

The rest of the paper is organized as follows. In Section

2 we explain the index structures, flowchart and algorithms,

and propose the data mining framework. Section 3 provides

descriptions of our experimental setting and results. In Sec-

tion 4, we offer conclusions and suggest possible directions

for future work.

DOI 10.1109/CIT.2010.71

203

2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010)

DOI 10.1109/CIT.2010.71

203

2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010)

DOI 10.1109/CIT.2010.71

203

II. METHOD

A. Naïve Data Indexing

Naïve data indexing is necessary for querying data in

classification or clustering algorithms. Data indexing can

dramatically decrease the complexity of querying. As in

McCreadie’s work [3], we create an inverted index for con-

tinuous values.

1) Definition

A (n × m) dataset is an (n × m) matrix containing n rows

of data, which contain m columns of float numbers. Let

min(i), max(i), sum(i), avg(i), std(i) , cnt(i) (i∈[1..m]) equal

the minimum, maximum, sum, average, standard deviation

and count of the i-column, respectively—i.e., the essential

statistical measures of the i-column of the dataset. Let

sum2(i) be the dot product of the i-column, which is used for

std(i). From the formula ZSCORE(x)=

x-avg()

std()

[7], the for-

mula YSCORE(x)=round

⎝

⎛

⎠

⎞

(x - avg())×iStep

std()

, a new measure

is defined. The iStep is an experimental integral parameter.

2) Simple Design of the Inverted Index

In general, high-dimensional datasets are always stored

as tables in a sophisticated DBMS for most data mining

tasks. Table 1 shows the format of a dataset in DBMS. For

data mining, the CSV format is common data interchange

file format.

Table 1 A normal data table.

RecordsNO Field01 Field02 …

Rec01 Value02 Value01 …

Rec02 Value05 Value01 …

Rec03 Value02 Value04 …

… … … …

Figure 1 shows the flowchart in which the whole proce-

dure is illustrated for indexing and similarity querying:

Discretization calculates the essential statistical measures

(Figure 2), then performs YSCORE binning on the Float-

Data.TXT file (CSV format) and outputs the IntData.TXT

file in which each line is labeled with a unique number

(LineNo) consecutively.

The encoder transforms the text format file IntData.TXT

into the Hadoop Sequence File format IntData.SEQ and ge-

nerates IntData.DAT. To locate a record for a query rapidly,

the length of the record has to be equivalent. The Int-

Data.DAT contains an (n × m) matrix, i.e., n rows of records

containing m columns of IDs of bins, as does IntData.SEQ.

The structure of IntData.DAT is shown in Table 2.

With the indexer, the index files Val2NO.DAT and

Col2Val.DAT are outputted with MapReduce from the Int-

Data.SEQ file. To obtain the list of LineNo’s for a specific

bin, it is needed to store the list in a structural file

(Val2NO.DAT), shown in Table 4; the address and the

length of the list should be stored in another structural file

(Col2Val.DAT), shown in Table 3. The two structural files

are easy to access by offset.

The searcher, given a query—which should be a vector

with m float values— performs YSCORE binning with the

same statistical measures and yields a vector with m integers

for searching; it then outputs a list of the number of records.

The search algorithm is shown in Figure 5.

In intersection, the positions of each integer in

Val2NO.DAT for a particular query will be extracted from

Col2Val.DAT , and the lists of LineNo’s for each integer

can be found. The result will be computed by the intersec-

tion of the lists. The algorithm is omitted.

Table 2 Structure of IntData.DAT in BNF.

IntData.DAT ::= <Record>

<Bins> ::= Integer(32bits binary)

Table 3 Structure of Col2Val.DAT in BNF.

Col2Val.DAT ::= <Field>

<Bins> ::= Integer(32bits binary)

<Address> ::= Integer(64bits binary)

<LengthLineNo> ::= Integer(64bits binary)

Table 4 Structure of Val2NO.DAT in BNF.

Val2NO.DAT ::= <LineNo>

<LineNo> ::= Integer(64bits binary)

FloatData.TXT

IntData.TXT

IntData.SEQ

IntData.DAT

Val2No.DAT

Col2Val.DAT

Discretization

Encoder

Indexer

Searcher

List of No

Query Result

InterSection

Figure 1 Flowchart for searching the Hadoop Index

3) Statistics phase

In Hadoop, data sets are scanned once to calculate the

four distributive measures, min(i), max(i), sum(i) and sum2(i)

as well as cnt(i) for the i-column (Figure 2); it then com-

putes the algebraic measures of average and deviation.

Therefore, a (5×m) array is used for the keys in the MapRe-

duce process: the first row contains the keys of the minimum

204204204

of a column; the second row contains the keys for the

maximum; the third contains the keys for the sum; the fourth

contains the keys for the square of the sum; and the fifth

contains the keys for the row count.

FUNCTION map (key, values, output)

1. f[1..m]← parse the values into float array;

2. iW ←f.length;

3. FOR i = 1 THRU m

a. IF min > f[i] THEN

1) min ← f[i];

2) output.collect(iW*0+i, min);

b. IF max < f[i] THEN

1) max ← f[i];

2) output.collect(iW*1+i, max);

c. output.collect(iW*2+i,

sum);

d. output.collect(iW*3+i, sum2);

4. output.collect(iW*4, 1);

FUNCTION reduce (key, values, output)

5. iCategory ← (int) key.get() / LenFields

6. SWITCH (iCategory)

7. CASE 0 : /** min()*/

a. WHILE (values.hasNext())

1) fTmp ← values.next();

2) IF (min > fTmp) THEN min ← fTmp;

b. output.collect(key, min);

c. BREAK;

8. CASE 1 : /** max() */

a. WHILE (values.hasNext()){

1) fTmp ← values.next();

2) IF (max

< fTmp) THEN max ← fTmp;

b. output.collect(key, max);

c. BREAK;

9. CASE 2 : /** sum() */

a. WHILE (values.hasNext())

1) sum ← sum + values.next();

2) output.collect(key, sum);

b. BREAK;

10. CASE 3 : /** sum2() */

a. WHILE (values.hasNext())

1) sum2← sum2+ values.next();

2) output.collect(key, sum2);

b. BREAK;

11. CASE 4 : /** cnt() */

a. WHILE (values.hasNext())

1) cnt ← cnt + values.next();

2) output.collect(key,

cnt);

b. BREAK;

FUNCTION statistics (input)

12. faStat[][]←0; key←0; value←0; cnt←0;

13. WHILE input.hasNext()

a. <key,value> ← Parse the line of input.Next();

b. iCategory← key / LenFields;

c. iField← key MOD LenFields;

d. IF (iCategory == LenFields * 4) THEN

1) faStat[iCategory][ iField]← value;

e. ELSE

1) Cnt ← value;

14. avg[] ←0; std[] ←0;

15. FOR i

= 1 THRU LenFields

a. avg(i) ← faStat[2][i] / cnt;

b. std(i)←sqrt ((faStat[3][i]-faStat[2][i]×faStat[2][i]/cnt)

/(cnt-1));

16. RETURN min(i),max(i),sum(i),avg(i),std(i),cnt;

END OF ALGORITHM STATISTICS

Figure 2: Algorithm Statistics for continuous values in Hadoop

The algorithm performs O(N/mapper), in which N

represents the number of records in the data sets, and map-

per is the number of the task of map functions. After step 4,

all partitions will have been calculated into the five distribu-

tive measures. The reduce function merges them into global

measures. Keys for the map function are the positions of the

next line, and values are the next line of text.

After step 11 in Figure 2, Hadoop will create some result

files named “part-xxxxx”, which contain the five distributive

measures for each field; the number of these result files de-

pends on the number of reducers. From step 15, the alge-

braic measures avg() and std() can be calculated from the

five distributive measures.

4) YSCORE phase

Figure 3 shows YSCORE binning of 1,000,000 values

from a normal distribution with a mean of 0 and standard

deviation of 1. If the iStep of 1 is used, the upper left figure

shows exactly the Three-Sigma Rule, there are only 9 bins

and the maximum number in the bins is near 400,000. When

the iStep is changed to 8, the upper right shows that the

number of bins increases to 36, and the figure fits the normal

distribution curve better. The important thing is that the

maximum in the bins reduces to about 50,000 at the iStep of

8. When the iStep is set to 16 or 32, the maximum reduces to

25,000 or 10,000. The lowered numbers administer to in-

verted indexing.

−4 −3 −2 −1 0 1 2 3 4

x 10

Counting

Yscore bins, iStep=1

−20 0 20

x 10

Counting

Yscore bins, iStep=8

−50 0 50

0.5

1.5

x 10

Counting

Yscore bins, iStep=16

−100 0 100

5000

10000

Counting

Yscore bins, iStep=32

Figure 3 YSCORE binning of 1,000,000 values from a normal distribution

with mean 0 and standard deviation 1. The above figures show YSCORE

binning using an iStep of 1, 8, 16, and 32, respectively.

205205205

评论收藏

内容反馈

车轱辘菜

2016-03-21

还行吧，用处不大
vozon

2011-12-01

有一定参考意义，利用JPA解决了Hadoop的子任务数据共问题
luckywhc

2011-09-14

英文的英文的。。。。
lidalong0408

2012-08-27

很好的资料不过是英文的
kerwinyc

2014-01-03

还可以，对我价值不是特别大

前往

页

gzdillon

粉丝: 6
资源: 8

基于Hadoop高效的数据挖掘框架

基于Hadoop云平台的并行数据挖掘方法

大数据环境下基于Hadoop框架的数据挖掘算法的研究与实现 (1).pdf

基于Hadoop平台的并行数据挖掘算法工具箱与数据挖掘云.pdf

数据挖掘项目源码

海量高效数据索引·hadoop·JPA·data mining

基于Hadoop的数据挖掘

基于Hadoop的数据挖掘算法研究与实现

基于Hadoop的海量数据挖掘.zip

基于Hadoop的并行数据挖掘系统的研究与实现

基于Hadoop云平台的海量数据挖掘方法.pdf

一种基于Hadoop的语义大数据分布式推理框架

DataMingingPaper:基于Django+LayUI+HBase的文献数据挖掘系统的实现

大数据之运维.pptx

大数据分析平台.docx

基于Hadoop平台的数据挖掘技术研究.pdf

Hadoop数据挖掘并行算法框架.pdf

基于Hadoop的数据挖掘技术在测光红移上的研究

基于Hadoop的电商平台大数据挖掘研究.pdf

基于Hadoop的农业大数据挖掘系统构建 (1).pdf

第七章-《大数据导论》大数据处理平台.pdf

Spark一个高效的分布式计算系统

论文研究-基于Spark的FP-Growth伴随车辆发现与应用.pdf

一种大数据智能分析平台的数据分析方法及实现技术.doc

XXX区社会治理大数据平台建设方案(政务大数据平台).docx

基于Hadoop云计算平台的数据挖掘分析 (1).pdf

基于Hadoop框架的大数据集连接优化算法

基于Hadoop与Mahout云数据挖掘推荐研究.pdf

Hadoop下基于粗糙集与贝叶斯的气象数据挖掘研究

基于Hadoop的城市交通碳排放数据挖掘研究.pdf

最新资源