没有合适的资源?快使用搜索试试~ 我知道了~
大数据并行增量频繁项目集挖掘
0 下载量 44 浏览量
2021-03-15
19:20:48
上传
评论
收藏 2.16MB PDF 举报
温馨提示
频繁项集挖掘(FIM)是许多领域采用的流行数据挖掘问题,例如零售行业的商品推荐,Web搜索中的日志分析以及查询推荐(或相关搜索)。 为了获得更好的性能,已经提出了大量的FIM算法,包括用于处理大数据量的并行算法。 此外,还提出了增量FIM算法来处理增量数据库。 但是,这些增量算法大多数都具有较低的并行度,从而在大型数据库上导致较低的效率。本文介绍了在MapReduce框架上实现的两种并行增量FIM算法,分别为IncMiningPFP和IncBuildingPFP。 IncMiningPFP保留原始遍历的FP树挖掘结果,并将其用于增量计算。 特别是,我们提出了一种在增量过程中生成部分FP树的方法,以避免不必要的挖掘工作。 此外,当插入的事务包括较少的项目时,可以省略一些增量并行任务。 IncbuildingPFP保留原始遍历中构建的CanTree,然后在增量遍历中向其添加新事务。 我们的实验结果表明,在大多数增量输入数据库的情况下,IncMiningPFP可以在PFP(并行FPGrowth)和顺序增量算法(CanTree)上实现显着的提速,而在其他情况下,IncBuildingPFP可以
资源推荐
资源详情
资源评论
Song YG, Cui HM, Feng XB. Parallel incremental frequent itemset mining for large data. JOURNAL OF COMPUTER
SCIENCE AND TECHNOLOGY 32(2): 368–385 Mar. 2017. DOI 10.1007/s11390-017-1726-y
Parallel Incremental Frequent Itemset Mining for Large Data
Yu-Geng Song
1,2
, Hui-Min Cui
1,2
, Member, CCF, and Xiao -Bing Feng
1
, Member, CCF, ACM, IEEE
1
State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of
Sciences, Beijing 100190, China
2
University of Chinese Academy of Sciences, Beijing 100049, China
E-mail: {songyugeng, cuihm, fxb}@ict.ac.cn
Received Novemb er 18, 2015; revised November 28, 2016.
Abstract Frequent itemset mining (FIM) is a popular data mining issue adopted in many fields, such as commodity
recommendation in the retail ind ustry, log analysis in web searching, and query recommendation (or related search). A
large number of FIM algorithms have been proposed to obtain better performance, including parallelized algorithms for
processing large data volumes. Besides, incremental FIM algorithms are also proposed to deal with incremental database
updates. However, most of these in cremental algorithms have low parallelism, causing low efficiency on huge databases.
This paper presents two parallel in cremental FIM algorithms called IncMiningPFP and IncBuildingPFP, implemented on
the MapReduce framework. IncMiningPFP preserves the FP-tree mining results of the original pass, and utilizes them
for incremental calculations. In particular, we p ropose a metho d to generate a partial FP-tree in the incremental pass,
in order to avoid unnecessary mining work. Further, some of the incremental parallel tasks can be omitted when the
inserted transactions include fewer items. IncbuildingPFP preserves the CanTrees built in the original pass, and then adds
new transactions to them during the in cremental passes. Our experimental results show that IncMiningPFP can achieve
significant speedup over PFP (Parallel FPGrowth) and a sequential incremental algorithm (CanTree) in most cases of
incremental input database, and in other cases IncBuildingPFP can achieve it.
Keywords incremental parallel FPGrowth, data mining, frequent itemset mining, MapReduce
1 Introduction
Frequent itemset mining (FIM), a lso known as fre-
quent pattern mining, plays an essential role in data
mining. It tries to mine some interesting patterns from
a given database, such as association r ule s, correlations,
and so on. In particular, a ssociation rules describe the
items that are frequently purchased together, which are
valuable in the r e tail industry and electronic commerce.
Therefore FIM is widely used in recommendation sys-
tems for customer s
[1]
. Besides, correlations are of great
value in web searching or log analysis
[2]
, making FIM
an important a pproach in these fields.
Researchers have pro posed a number of FIM al-
gorithms, to achieve b e tter pe rformance
[3-6]
or higher
parallelism
[7-12]
. Among them, Apriori
[4]
and FP-
Growth
[3]
are two classic algorithms. The Apriori algo-
rithm works in an iterative way. It generates the candi-
date itemsets to be counted in a pass by using only the
itemsets found large in the previous pass — without
considering the transactions in the database. There-
fore it gains better performance by reducing the size of
candidate itemsets. However, when encountering abun-
dant freq ue nt patterns, long patterns or low minimal
support values, Apriori has consider able overhea ds of
storing and handling large candidate itemsets, as well
as scanning them. FP-Growth
[3]
solves this problem
by introducing FP-tree, an extended prefix-tree struc-
ture for storing compressed, crucial information about
frequent patterns, and an e fficie nt FP-tree-based min-
ing method, for mining the complete set of fre quent
patterns by pattern fragment growth. However, the
two algorithms have some common disa dvantages: both
Regular Paper
This work was supported by the National High Technology Research and Development 863 Program of China under Grant
Nos. 2015AA011505, 2015AA015306, and 2012AA010902, the National Natural Science Foundation of C hina under Grant Nos.
61202055, 61221062, 61521092, 61303053, 61432016, 61402445, and 61672492, and the National Key Research and Development Pro-
gram of China under Grant No. 2016YFB1000402.
©2017 Springer Science + Business Media, LLC & Science Press, China
Yu-Geng Song et al.: Parallel Incremental Frequent Itemset Mining for Large Data 369
of them have poor parallelism, and may be resource-
intens ive when processing huge databases; neither of
them is capable of handling the cases of incremental
database. Parallel FP-Growth (PFP)
[7]
parallelizes FP-
Growth by implementing the algorithm on the MapRe-
duce framework
[13]
, gaining excellent parallelism and
performance. The adaptiveness of FP-Gr owth to the
MapReduce framework is our main reason to choose it
as the original algorithm. On the other hand, several in-
cremental algorithms or data structures
[14-21]
have been
proposed to handle inserting, appending, or deleting
transactions . Some of these algorithms or data struc-
tures include IncSpan
[15]
and CanTree
[14,16]
. However,
none of them has high parallelism, which is essential
for handling huge databases efficiently. Hence an FIM
algorithm that is both incremental and highly paral-
lelized is urge ntly needed.
In this work, we propose two parallel incremental
algorithms based on the MapReduce framework, which
we call Inc MiningPFP and IncBuildingP FP. IncMin-
ingPFP adopts the same FP-tree structure and FP-
Growth mining method as PFP, and maintains the min-
ing results o f each group while processing the original
database. In the incremental passes, we first scan the
incremental transa ctions to find out all the items con-
tained in them, and then r un the tree building and tree
mining processes only on the groups containing these
items. An FP-tree is built upon both the original trans-
actions and the incremental ones, from which a partial
FP-tree called IncFP-tree is g e ne rated. Then we run
the mining process on IncFP-tree, and after that the
mining r esults are merged with the original o ne s. Fi-
nally the ag gregating process merges the mining results
of all groups. IncBuildingPFP prese rves the CanTrees
of each group during the original pass (also dur ing the
incremental passes if future increments are expected),
obtains the Ca nTrees during the incremental pass, and
directly ins e rts incremental transactions onto them be-
fore preserving them again. After the mining process
executed on the updated CanTrees, the mining results
of all groups are merged by the aggrega ting process.
Our experiments have shown that IncMiningPFP and
IncBuildingPFP have speedups of 1.5x and 1.25x , re-
spectively, over PFP on average of different database
sizes and values of parameters.
The contributions of this paper include:
1) two different methods of utilizing the interme-
diate results when applying parallel incremental FIM
algorithms;
2) two parallel incremental FIM algorithms, each
applying one of the two methods: IncMiningPFP and
IncBuildingPFP;
3) the comparison of the above two algorithms, in-
cluding the cases where each of them performs the best.
The rest of the paper is organized as follows. Section
2 discusses related work. Section 3 introduces two exist-
ing FIM algorithms: FP-Growth a nd PFP, an existing
data structure for efficient incremental FIM: CanTree,
and the challenges and two basic methods of incremen-
tal FIM. Sec tion 4 and Section 5 demonstrate one of
these two methods in detail respectively, with discus-
sions of feasibility. Section 6 presents expe rimental re-
sults, and Se ction 7 concludes the paper.
What has to be mentioned is that IncMiningPFP
only handles the case that only new transactions are
inserted, with the original data not being deleted or
changed.
2 Related Work
In recent years, lots o f researches have been focusing
on the FIM or the incremental FIM problems. Some of
them have made great efforts to promote the efficiency
of FIM.
Agrawal and Srikant proposed the Apriori
algorithm
[4]
that ge ne rates the candidate itemsets to be
counted in a pass by using only the itemsets found large
in the previous pass — without considering the transac-
tions in the database. Apriori outperforms tranditiona l
FIM a lgorithms such a s AIS
[22]
and SETM
[23]
, using
both synthetic a nd rea l-life data. However, when meet-
ing trans actions of large size or a low minimal support
threshold, the algorithm is dragged by the overheads of
examining large candidate sets.
The FP-Growth algorithm proposed by Han et al.
[3]
solves this problem by introducing FP-tree and FP-
Growth, as we mentioned in Section 3. The FP-Growth
algorithm is efficient and scalable for mining both long
and short frequent items ets, and is about an order of
magnitude faster than the Apriori algorithm.
Li et al.
[7]
proposed to parallelize FP-Growth on
distributed machines, formalizing a parallel algorithm
called PFP. The details of PFP are shown in Section 3.
It is demonstrated that PFP can achieve virtually linear
speedup.
On the other hand, some of the recent researches
proposed algorithms or structures for incremental FIM.
Cheng et al.
[15]
developed an efficient algorithm,
IncSpan, for incremental mining of sequential patterns.
IncSpan maintains a set of “almost fr e quent” sequences
370 J. Comput. Sci. & Technol., Mar. 2017, Vol.32, No.2
as the candidates in the updated database, which has
several nice properties and leads to efficient techniques.
Two optimization techniques, reverse pattern matching
and shared projection, were designed to improve the
performance. Their experiments showed tha t IncSpan
outp erforms some previously propos ed incremental al-
gorithms as well as a non-incremental one with a wide
margin.
Leung et al.
[14,16]
proposed a novel tree struc ture,
called CanTree. By exploiting its nice properties (see
Section 3), CanTree can be easily maintained when
database transactions are inserted, deleted, and/or
modified. It is shown that CanTrees can be used for
efficient inc remental mining of frequent patterns.
However, neither of these incremental FIM ap-
proaches can efficiently ha ndle the case of a massive
database, since they do not have high parallelism.
3 Background
3.1 FP-Growth Algorithm
FP-Growth is a well-known algorithm for frequent
itemset mining, which was proposed by Han et al.
[3]
First, we introduce some notations to describe the prob-
lem of FIM.
• Itemset. Let I be a set of items, and a s e t
S = i
1
, ..., i
k
⊆ I is called an itemset.
• Transaction. A transaction over I is defined as
T = (id, S), where id is an integer fo r represe nting the
transaction ID, and S is an itemset over I.
• Transaction Database. A transaction database D
over I is a se t of transactions over I.
• Incremental Transaction Database. An incremen-
tal transaction database ∆D is the collection of the
incremental transactions inserted to D.
• Support Value. The support value of an itemset S
is the number of transactions consisting of S in D.
• Frequent Itemset. An itemset S is frequent if its
support value is no less than a given threshold t.
The problem can be formalize d as follows: given a
transaction databas e D and a suppo rt value threshold
minSupport, finding out all the frequent itemsets in D.
The FP-Growth algorithm consists of two steps :
FP-tree building and frequent itemse t mining. In the
first step, we scan the whole database D once, obtain
the frequency of each item to form a frequency list, and
build an FP -tree acc ording to D and the list. Then in
the second step, we mine the frequent itemsets on the
FP-tree through a bottom-up approach. The details of
these steps are shown in Fig.1 a nd Fig.2.
Step 1 of Algorithm FPGrowth (FP-Tree Construction)
Input: a transaction database D and a minimum
support threshold ξ
Output: its frequent pattern tree, FP-tree
Method: the FP-tree is constructed in the following
steps
1. Scan the transaction database D once. Collect
the set of frequent items I and their supports.
Sort I in support descending order as L, the list
of frequent items .
2. Create the root of an FP -tree, R, and label it as
Ānull”. For each transaction T in D, do the
following.
Select and sort the frequent items in T
according to the order of L. Let the sorted
frequent item list in T be [p|P], where p is
the first element and P is the remaining list.
Call
insert_tree([p|P], R).
The function insert _tree([p|P], R) is performed as
follows. If R has a child N such that N.item-name
= p.item-name, then we increment N's count by 1;
else create a new node N, and let its count be 1,
its parent link be linked to R, and its node -link
be linked to the nodes with the same item-name
via the node-link structure . If P is nonempty , call
insert_tree(P,N) recursively .
Fig.1. FP-tree constructing step in FP-Growth
[3]
.
Step 2 of Algorithm FPGrowth (Mining Frequent
Itemsets with FP-Tree by Pattern Fragment Growth)
Input: FP-tree constructed based on step 1,
using D and a minimum support threshold ξ
Output: the complete set of frequent itemsets
Method: call FP -growth (FP-tree , null)
Procedure FP-growth (Tree, α)
{
(1) if Tree contains a single path P
(2) then for each combination (denoted as S)
of the nodes in path P do
(3) Generate itemset S 8 α with support =
minimum support of nodes in S;
(4) else for each a
i
in the header of Tree do {
(5) Generate itemset S = a
i
8α with
support = a
i
.support;
(6) Construct S 's conditional pattern base and
then S's conditional FP -tree Tree
S
;
(7) if TreeS ≠
(8) then call FP-growth (Tree
S
, S) }
}
Fig.2. Frequent itemset mining step in FP-Growth
[3]
.
3.2 PFP: Parallel FP-Growth Algorithm
Li et al.
[7]
designed PFP: a parallel FIM algo-
rithm on distributed machines, taking advantage of the
MapReduce framework. PFP partitions computation
剩余17页未读,继续阅读
资源评论
weixin_38518638
- 粉丝: 3
- 资源: 932
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 【岗位说明】公司行政岗位职责.doc
- 【岗位说明】公司行政副总岗位说明书.doc
- 【岗位说明】公司行政文职类岗位职责.doc
- 【岗位说明】行政部岗位职责.doc
- 【岗位说明】公司组织结构及部门职责.doc
- 【岗位说明】行政部锅炉工岗位说明书.doc
- 【岗位说明】行政部各职位说明书.doc
- 【岗位说明】行政部门岗位职责02.doc
- 【岗位说明】行政后勤岗位职责.doc
- 【岗位说明】行政经理岗位说明书.doc
- 【岗位说明】行政前台岗位职责.doc
- 【岗位说明】行政经理岗位职责.doc
- 【岗位说明】行政前台岗位职责及工作要求.doc
- 【岗位说明】行政人事部部门职责(制造业).doc
- 【岗位说明】行政人事部部门职责说明书(计算机企业).doc
- 【岗位说明】行政人事部部门职责说明书(旅游公司).doc
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功