大数据并行增量频繁项目集挖掘资源-CSDN文库

44 浏览量 2021-03-15 19:20:48 上传评论收藏 2.16MB PDF 举报

资源推荐

资源详情

资源评论

Song YG, Cui HM, Feng XB. Parallel incremental frequent itemset mining for large data. JOURNAL OF COMPUTER

SCIENCE AND TECHNOLOGY 32(2): 368–385 Mar. 2017. DOI 10.1007/s11390-017-1726-y

Parallel Incremental Frequent Itemset Mining for Large Data

Yu-Geng Song

1,2

, Hui-Min Cui

1,2

, Member, CCF, and Xiao -Bing Feng

, Member, CCF, ACM, IEEE

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of

Sciences, Beijing 100190, China

University of Chinese Academy of Sciences, Beijing 100049, China

E-mail: {songyugeng, cuihm, fxb}@ict.ac.cn

Received Novemb er 18, 2015; revised November 28, 2016.

Abstract Frequent itemset mining (FIM) is a popular data mining issue adopted in many ﬁelds, such as commodity

recommendation in the retail ind ustry, log analysis in web searching, and query recommendation (or related search). A

large number of FIM algorithms have been proposed to obtain better performance, including parallelized algorithms for

processing large data volumes. Besides, incremental FIM algorithms are also proposed to deal with incremental database

updates. However, most of these in cremental algorithms have low parallelism, causing low eﬃciency on huge databases.

This paper presents two parallel in cremental FIM algorithms called IncMiningPFP and IncBuildingPFP, implemented on

the MapReduce framework. IncMiningPFP preserves the FP-tree mining results of the original pass, and utilizes them

for incremental calculations. In particular, we p ropose a metho d to generate a partial FP-tree in the incremental pass,

in order to avoid unnecessary mining work. Further, some of the incremental parallel tasks can be omitted when the

inserted transactions include fewer items. IncbuildingPFP preserves the CanTrees built in the original pass, and then adds

new transactions to them during the in cremental passes. Our experimental results show that IncMiningPFP can achieve

signiﬁcant speedup over PFP (Parallel FPGrowth) and a sequential incremental algorithm (CanTree) in most cases of

incremental input database, and in other cases IncBuildingPFP can achieve it.

Keywords incremental parallel FPGrowth, data mining, frequent itemset mining, MapReduce

1 Introduction

Frequent itemset mining (FIM), a lso known as fre-

quent pattern mining, plays an essential role in data

mining. It tries to mine some interesting patterns from

a given database, such as association r ule s, correlations,

and so on. In particular, a ssociation rules describe the

items that are frequently purchased together, which are

valuable in the r e tail industry and electronic commerce.

Therefore FIM is widely used in recommendation sys-

tems for customer s

[1]

. Besides, correlations are of great

value in web searching or log analysis

[2]

, making FIM

an important a pproach in these ﬁelds.

Researchers have pro posed a number of FIM al-

gorithms, to achieve b e tter pe rformance

[3-6]

or higher

parallelism

[7-12]

. Among them, Apriori

[4]

and FP-

Growth

[3]

are two classic algorithms. The Apriori algo-

rithm works in an iterative way. It generates the candi-

date itemsets to be counted in a pass by using only the

itemsets found large in the previous pass — without

considering the transactions in the database. There-

fore it gains better performance by reducing the size of

candidate itemsets. However, when encountering abun-

dant freq ue nt patterns, long patterns or low minimal

support values, Apriori has consider able overhea ds of

storing and handling large candidate itemsets, as well

as scanning them. FP-Growth

[3]

solves this problem

by introducing FP-tree, an extended preﬁx-tree struc-

ture for storing compressed, crucial information about

frequent patterns, and an e ﬃcie nt FP-tree-based min-

ing method, for mining the complete set of fre quent

patterns by pattern fragment growth. However, the

two algorithms have some common disa dvantages: both

Regular Paper

This work was supported by the National High Technology Research and Development 863 Program of China under Grant

Nos. 2015AA011505, 2015AA015306, and 2012AA010902, the National Natural Science Foundation of C hina under Grant Nos.

61202055, 61221062, 61521092, 61303053, 61432016, 61402445, and 61672492, and the National Key Research and Development Pro-

gram of China under Grant No. 2016YFB1000402.

Yu-Geng Song et al.: Parallel Incremental Frequent Itemset Mining for Large Data 369

of them have poor parallelism, and may be resource-

intens ive when processing huge databases; neither of

them is capable of handling the cases of incremental

database. Parallel FP-Growth (PFP)

[7]

parallelizes FP-

Growth by implementing the algorithm on the MapRe-

duce framework

[13]

, gaining excellent parallelism and

performance. The adaptiveness of FP-Gr owth to the

MapReduce framework is our main reason to choose it

as the original algorithm. On the other hand, several in-

cremental algorithms or data structures

[14-21]

have been

proposed to handle inserting, appending, or deleting

transactions . Some of these algorithms or data struc-

tures include IncSpan

[15]

and CanTree

[14,16]

. However,

none of them has high parallelism, which is essential

for handling huge databases eﬃciently. Hence an FIM

algorithm that is both incremental and highly paral-

lelized is urge ntly needed.

In this work, we propose two parallel incremental

algorithms based on the MapReduce framework, which

we call Inc MiningPFP and IncBuildingP FP. IncMin-

ingPFP adopts the same FP-tree structure and FP-

Growth mining method as PFP, and maintains the min-

ing results o f each group while processing the original

database. In the incremental passes, we ﬁrst scan the

incremental transa ctions to ﬁnd out all the items con-

tained in them, and then r un the tree building and tree

mining processes only on the groups containing these

items. An FP-tree is built upon both the original trans-

actions and the incremental ones, from which a partial

FP-tree called IncFP-tree is g e ne rated. Then we run

the mining process on IncFP-tree, and after that the

mining r esults are merged with the original o ne s. Fi-

nally the ag gregating process merges the mining results

of all groups. IncBuildingPFP prese rves the CanTrees

of each group during the original pass (also dur ing the

incremental passes if future increments are expected),

obtains the Ca nTrees during the incremental pass, and

directly ins e rts incremental transactions onto them be-

fore preserving them again. After the mining process

executed on the updated CanTrees, the mining results

of all groups are merged by the aggrega ting process.

Our experiments have shown that IncMiningPFP and

IncBuildingPFP have speedups of 1.5x and 1.25x , re-

spectively, over PFP on average of diﬀerent database

sizes and values of parameters.

The contributions of this paper include:

1) two diﬀerent methods of utilizing the interme-

diate results when applying parallel incremental FIM

algorithms;

2) two parallel incremental FIM algorithms, each

applying one of the two methods: IncMiningPFP and

IncBuildingPFP;

3) the comparison of the above two algorithms, in-

cluding the cases where each of them performs the best.

The rest of the paper is organized as follows. Section

2 discusses related work. Section 3 introduces two exist-

ing FIM algorithms: FP-Growth a nd PFP, an existing

data structure for eﬃcient incremental FIM: CanTree,

and the challenges and two basic methods of incremen-

tal FIM. Sec tion 4 and Section 5 demonstrate one of

these two methods in detail respectively, with discus-

sions of feasibility. Section 6 presents expe rimental re-

sults, and Se ction 7 concludes the paper.

What has to be mentioned is that IncMiningPFP

only handles the case that only new transactions are

inserted, with the original data not being deleted or

changed.

2 Related Work

In recent years, lots o f researches have been focusing

on the FIM or the incremental FIM problems. Some of

them have made great eﬀorts to promote the eﬃciency

of FIM.

Agrawal and Srikant proposed the Apriori

algorithm

[4]

that ge ne rates the candidate itemsets to be

counted in a pass by using only the itemsets found large

in the previous pass — without considering the transac-

tions in the database. Apriori outperforms tranditiona l

FIM a lgorithms such a s AIS

[22]

and SETM

[23]

, using

both synthetic a nd rea l-life data. However, when meet-

ing trans actions of large size or a low minimal support

threshold, the algorithm is dragged by the overheads of

examining large candidate sets.

The FP-Growth algorithm proposed by Han et al.

[3]

solves this problem by introducing FP-tree and FP-

Growth, as we mentioned in Section 3. The FP-Growth

algorithm is eﬃcient and scalable for mining both long

and short frequent items ets, and is about an order of

magnitude faster than the Apriori algorithm.

Li et al.

[7]

proposed to parallelize FP-Growth on

distributed machines, formalizing a parallel algorithm

called PFP. The details of PFP are shown in Section 3.

It is demonstrated that PFP can achieve virtually linear

speedup.

On the other hand, some of the recent researches

proposed algorithms or structures for incremental FIM.

Cheng et al.

[15]

developed an eﬃcient algorithm,

IncSpan, for incremental mining of sequential patterns.

IncSpan maintains a set of “almost fr e quent” sequences

370 J. Comput. Sci. & Technol., Mar. 2017, Vol.32, No.2

as the candidates in the updated database, which has

several nice properties and leads to eﬃcient techniques.

Two optimization techniques, reverse pattern matching

and shared projection, were designed to improve the

performance. Their experiments showed tha t IncSpan

outp erforms some previously propos ed incremental al-

gorithms as well as a non-incremental one with a wide

margin.

Leung et al.

[14,16]

proposed a novel tree struc ture,

called CanTree. By exploiting its nice properties (see

Section 3), CanTree can be easily maintained when

database transactions are inserted, deleted, and/or

modiﬁed. It is shown that CanTrees can be used for

eﬃcient inc remental mining of frequent patterns.

However, neither of these incremental FIM ap-

proaches can eﬃciently ha ndle the case of a massive

database, since they do not have high parallelism.

3 Background

3.1 FP-Growth Algorithm

FP-Growth is a well-known algorithm for frequent

itemset mining, which was proposed by Han et al.

[3]

First, we introduce some notations to describe the prob-

lem of FIM.

• Itemset. Let I be a set of items, and a s e t

S = i

, ..., i

⊆ I is called an itemset.

• Transaction. A transaction over I is deﬁned as

T = (id, S), where id is an integer fo r represe nting the

transaction ID, and S is an itemset over I.

• Transaction Database. A transaction database D

over I is a se t of transactions over I.

• Incremental Transaction Database. An incremen-

tal transaction database ∆D is the collection of the

incremental transactions inserted to D.

• Support Value. The support value of an itemset S

is the number of transactions consisting of S in D.

• Frequent Itemset. An itemset S is frequent if its

support value is no less than a given threshold t.

The problem can be formalize d as follows: given a

transaction databas e D and a suppo rt value threshold

minSupport, ﬁnding out all the frequent itemsets in D.

The FP-Growth algorithm consists of two steps :

FP-tree building and frequent itemse t mining. In the

ﬁrst step, we scan the whole database D once, obtain

the frequency of each item to form a frequency list, and

build an FP -tree acc ording to D and the list. Then in

the second step, we mine the frequent itemsets on the

FP-tree through a bottom-up approach. The details of

these steps are shown in Fig.1 a nd Fig.2.

Step 1 of Algorithm FPGrowth (FP-Tree Construction)

Input: a transaction database D and a minimum

support threshold ξ

Output: its frequent pattern tree, FP-tree

Method: the FP-tree is constructed in the following

steps

1. Scan the transaction database D once. Collect

the set of frequent items I and their supports.

Sort I in support descending order as L, the list

of frequent items .

2. Create the root of an FP -tree, R, and label it as

Ānull”. For each transaction T in D, do the

following.

Select and sort the frequent items in T

according to the order of L. Let the sorted

frequent item list in T be [p|P], where p is

the first element and P is the remaining list.

Call

insert_tree([p|P], R).

The function insert _tree([p|P], R) is performed as

follows. If R has a child N such that N.item-name

= p.item-name, then we increment N's count by 1;

else create a new node N, and let its count be 1,

its parent link be linked to R, and its node -link

be linked to the nodes with the same item-name

via the node-link structure . If P is nonempty , call

insert_tree(P,N) recursively .

Fig.1. FP-tree constructing step in FP-Growth

[3]

Step 2 of Algorithm FPGrowth (Mining Frequent

Itemsets with FP-Tree by Pattern Fragment Growth)

Input: FP-tree constructed based on step 1,

using D and a minimum support threshold ξ

Output: the complete set of frequent itemsets

Method: call FP -growth (FP-tree , null)

Procedure FP-growth (Tree, α)

{

(1) if Tree contains a single path P

(2) then for each combination (denoted as S)

of the nodes in path P do

(3) Generate itemset S 8 α with support =

minimum support of nodes in S;

(4) else for each a

in the header of Tree do {

(5) Generate itemset S = a

8α with

support = a

.support;

(6) Construct S 's conditional pattern base and

then S's conditional FP -tree Tree

;

(7) if TreeS ≠

(8) then call FP-growth (Tree

, S) }

}

Fig.2. Frequent itemset mining step in FP-Growth

[3]

3.2 PFP: Parallel FP-Growth Algorithm

Li et al.

[7]

designed PFP: a parallel FIM algo-

rithm on distributed machines, taking advantage of the

MapReduce framework. PFP partitions computation

剩余17页未读，继续阅读

评论收藏

内容反馈

weixin_38518638

粉丝: 3
资源: 932

大数据并行增量频繁项目集挖掘

hadoop 大数据集 数据并行挖掘

论文研究-基于FIUT的并行频繁项集增量更新算法.pdf

分布式并行化数据流频繁模式挖掘算法.pdf

基于Spark的并行频繁模式挖掘算法

群智能大数据并行优化系统项目可行性研究报告.doc

并行挖掘频繁项目集新算法――MREclat (2014年)

群智能大数据并行优化系统项目策划书.doc

基于粒度计算的频繁闭项目集挖掘 (2014年)

一种改进的频繁项目集挖掘算法 (2013年)

使用ITBitree频繁关闭项目集挖掘

基于频繁模式树的正负项目集挖掘 (2012年)

CUSF增长：不确定数据流中受约束的频繁项目集挖掘

FrequentItemsetsMining:并行频繁项集挖掘算法的HadoopMapReduce实现

基于MapReduce的频繁项集挖掘方法

基于权重的流数据频繁项挖掘算法的应用

工业大数据背景下频繁项集挖掘算法对比分析及研究展望.zip

采用N-list结构的混合并行频繁项集挖掘算法.docx

数据频繁挖掘案例(观察者模式)

基于Spark的转录组大数据并行处理方法.pdf

基于Spark框架的大数据局部频繁项集挖掘算法设计.zip

流数据上的频繁项挖掘算法 (2011年)

【Python实战】-Python+Opencv是实现车牌自动识别（源码+数据+字符匹配模板）

Python基于机器学习实现的股票价格预测、股票预测源码+数据集，机器学习大作业

时间序列数据集TSdatasets.rar

Fragstats V4.2 软件计算景观指数的参数文件示例

最新资源

hadoop 大数据集数据并行挖掘