没有合适的资源?快使用搜索试试~ 我知道了~
机器学习十大算法:Apriori.pdf
5星 · 超过95%的资源 需积分: 9 18 下载量 133 浏览量
2011-11-21
12:47:46
上传
评论
收藏 657KB PDF 举报
温馨提示
试读
32页
机器学习十大算法:Apriori.pdf机器学习十大算法:Apriori.pdf
资源推荐
资源详情
资源评论
Chapter 4
Apriori
Hiroshi Motoda and Kouzou Ohara
Contents
4.1 Introduction ............................................................ 62
4.2 Algorithm Description .................................................. 62
4.2.1 Mining Frequent Patterns and Association Rules .................. 62
4.2.1.1 Apriori ................................................. 63
4.2.1.2 AprioriTid .............................................. 66
4.2.2 Mining Sequential Patterns ....................................... 67
4.2.2.1 AprioriAll .............................................. 68
4.2.3 Discussion ....................................................... 69
4.3 Discussion on Available Software Implementations ...................... 70
4.4 Two Illustrative Examples ............................................... 71
4.4.1 Working Examples .............................................. 71
4.4.1.1 Frequent Itemset and Association Rule Mining .......... 71
4.4.1.2 Sequential Pattern Mining ............................... 75
4.4.2 Performance Evaluation .......................................... 76
4.5 Advanced Topics ........................................................ 80
4.5.1 Improvement in Apriori-Type Frequent Pattern Mining ........... 80
4.5.2 Frequent Pattern Mining Without Candidate Generation ........... 81
4.5.3 Incremental Approach ........................................... 82
4.5.4 Condensed Representation: Closed Patterns
and Maximal Patterns ............................................ 82
4.5.5 Quantitative Association Rules ................................... 84
4.5.6 Using Other Measure of Importance/Interestingness .............. 84
4.5.7 Class Association Rules .......................................... 85
4.5.8 Using Richer Expression: Sequences, Trees, and Graphs .......... 86
4.6 Summary ............................................................... 87
4.7 Exercises ............................................................... 88
References ................................................................... 89
61
© 2009 by Taylor & Francis Group, LLC
62 Apriori
4.1 Introduction
Many of the pattern finding algorithms such as those for decision tree building, clas-
sification rule induction, and data clustering that are frequently used in data mining
have been developed in the machine learning research community. Frequent pattern
and association rule mining is one of the few exceptions to this tradition. Its introduc-
tion boosted data mining research and its impact is tremendous. The basic algorithms
are simple and easy to implement. In this chapter the most fundamental algorithms of
frequent pattern and association rule mining, known as Apriori and AprioriTid [3, 4],
and Apriori’s extension to sequential pattern mining, known as AprioriAll [6, 5],
are explained based on the original papers with working examples, and performance
analysis of Apriori is shown using a freely available implementation [1] for a dataset
in UCI repository [8]. Since Apriori is so fundamental and the form of database is
limited to market transaction, there have been many works for improving compu-
tational efficiency, finding more compact representation, and extending the types of
data that can be handled. Some of the important works are also briefly described as
advanced topics.
4.2 Algorithm Description
4.2.1 Mining Frequent Patterns and Association Rules
One of the most popular data mining approaches is to find frequent itemsets from
a transaction dataset and derive association rules. The problem is formally stated as
follows. Let I ={i
1
, i
2
,...,i
m
} be a set of items. Let D be a set of transactions,
where each transaction t is a set of items such that t ⊆ I. Each transaction has a
unique identifier, called its TID. A transaction t contains X, a set of some items
in I,ifX ⊆ t. An association rule is an implication of the form X ⇒ Y, where
X ⊂ I, Y ⊂ I,andX ∩ Y =∅. The rule X ⇒ Y holds in D with confidence c
(0 ≤ c ≤ 1) if the fraction of transactions that also contain Y in those which contain
X in D is c. The rule X ⇒ Y (andequivalently X∪Y)has support
1
s (0 ≤ s ≤ 1) in D
if the fraction of transactions in D that contain X ∪Y is s. Given a set of transactions
D, the problem of mining association rules is to generate all association rules that
have support and confidence no less than the user-specified minimum support (called
minsup) and minimum confidence (called minconf), respectively.
Finding frequent
2
itemsets (itemsets with support no less than minsup) is not tri-
vial because of the computational complexity due to combinatorial explosion. Once
1
An alternative support definition is the absolute count of frequency. In this chapter the latter definition is
also used where appropriate.
2
The Apriori paper [3] uses “large” to mean “frequent,” but large is often associated with the number of
items in the itemset. Thus, we prefer to use “frequent.”
© 2009 by Taylor & Francis Group, LLC
4.2 Algorithm Description 63
frequent itemsets are obtained, it is straightforward to generate association rules with
confidence no less than minconf. Apriori and AprioriTid,proposed by R. Agrawaland
R. Srikant, are seminal algorithms that are designed to work for a large transaction
dataset [3].
4.2.1.1 Apriori
Apriori is an algorithm to find all sets of items (itemsets) that have support no less
than minsup. The support for an itemset is the ratio of the number of transactions that
contain the itemset to the total number of transactions. Itemsets that satisfy minimum
support constraint are called frequent itemsets. Apriori is characterized as a level-wise
complete search (breadth first search) algorithm using anti-monotonicity property of
itemsets: “If an itemset is not frequent, any of its superset is never frequent,” which is
also called the downward closure property.The algorithm makes multiple passes over
the data. In the first pass, the support of individual items is counted and frequent items
are determined. In each subsequent pass, a seed set of itemsets found to be frequent
in the previous pass is used for generating new potentially frequent itemsets, called
candidate itemsets, and their actual support is counted during the pass over the data.
At the end of the pass, those satisfying minimum support constraint are collected,
that is, frequent itemsets are determined, and they become the seed for the next pass.
This process is repeated until no new frequent itemsets are found.
Byconvention,Apriori assumes that items withina transaction or itemset are sorted
in lexicographic order. The number of items in an itemset is called its size and an
itemset of size k is called a k-itemset. Let the set of frequent itemsets of size k be F
k
and their candidates be C
k
. Both F
k
and C
k
maintain a field, support count.
Apriori algorithm is given in Algorithm 4.1. The first pass simply counts item
occurrences to determine the frequent 1-itemsets. A subsequent pass consists of two
phases. First, the frequent itemsets F
k−1
found in the (k − 1)-th pass are used to
generate the candidate itemsets C
k
using the apriori-gen function. Next, the database
is scanned and the support of candidates in C
k
is counted. The subset function is used
for this counting.
The apriori-gen function takes as argument F
k−1
, the set of all frequent (k − 1)-
itemsets, and returns a superset of the set of all frequent k-itemsets. First, in the join
steps, F
k−1
is joined with F
k−1
.
insert into C
k
select p.fitemset
1
, p.fitemset
2
, ... , p.fitemset
k−1
, q.fitemset
k−1
from F
k−1
p, F
k−1
q
where p.fitemset
1
= q.fitemset
1
,...,p.fitemset
k−2
= q.fitemset
k−2
,
p.fitemset
k−1
< q.fitemset
k−1
Here, F
k
p means that the itemset p is a frequent k-itemset, and p.fitemset
k
is the
k-th item of the frequent itemset p.
Then, in the prune step, all the itemsets c ∈ C
k
for which some (k − 1)-subset is
not in F
k−1
are deleted.
© 2009 by Taylor & Francis Group, LLC
64 Apriori
Algorithm 4.1 Apriori Algorithm
F
1
= {frequent 1-itemsets};
for (k = 2; F
k−1
=∅; k ++) do begin
C
k
= apriori-gen(F
k−1
); //New candidates
foreach transaction t ∈ D do begin
C
t
= subset(C
k
, t); //Candidates contained in t
foreach candidate c ∈ C
t
do
c.count ++;
end
F
k
={c ∈ C
k
|c.count ≥ minsup };
end
Answer =∪
k
F
k
;
The subset function takes as arguments C
k
and a transaction t, and returns all the
candidate itemsets contained in the transaction t. For fast counting, Apriori adopts
a hash-tree to store the candidate itemsets C
k
. Itemsets are stored in leaves. Every
node is initially a leaf node, and the depth of the root node is defined to be 1. When
the number of itemsets in a leaf node exceeds a specified threshold, the leaf node is
converted to an interior node. An interior node at depth d points to nodes at depth
d + 1. Which branch to follow is decided by applying a hash function to the d-th
item of the itemset. Thus, each leaf node is ensured to contain at most a certain
number of itemsets (to be precise, this is true only when creating an interior node
takes place at depth d smaller than k), and an itemset in the leaf node can be reached
by successively hashing each item in the itemset in sequence from the root. Once the
hash-tree is constructed, the subset function finds all the candidates contained in a
transaction t, starting from the root node. At the root node, every item in t is hashed,
and each branch determined is followed one depth down. If a leaf node is reached,
itemsets in the leaf that are in the transaction t are searched and those found are made
reference to the answer set. If an interior node is reached by hashing the item i, items
that come after i in t are hashed recursively until a leaf node is reached. It is evident
that itemsets in the leaves that are never reached are not contained in t.
Clearly, any subset of a frequent itemset satisfies the minimum support constraint.
The join operation is equivalent to extending F
k−1
with each item in the database and
then deleting those itemsets for which the (k − 1)-itemset obtained by deleting the
(k−1)-th item isnotin F
k−1
.Thecondition p.fitemset
k−1
< q.fitemset
k−1
ensuresthat
no duplication is made. The prune step where all the itemsets whose (k − 1)-subsets
are not in F
k−1
are deleted from C
k
does not delete any itemset that could be in F
k
.
Thus, C
k
⊇ F
k
, and Apriori algorithm is correct.
The remaining task is to generate the desired association rules from the frequent
itemsets. A straightforward algorithm for this task is as follows. To generate rules,
© 2009 by Taylor & Francis Group, LLC
4.2 Algorithm Description 65
all nonempty subsets of every frequent itemset f are enumerated and for every such
subset a, a rule of the form a ⇒ ( f − a) is generated if the ratio of support( f )to
support(a) is at least minconf. Here, note that the confidence of the rule
ˆ
a ⇒ ( f −
ˆ
a)
cannot be larger than the confidence of a ⇒ ( f −a) for any
ˆ
a ⊂ a. This in turn means
that for a rule ( f −a) ⇒ a to hold, all rules of the form ( f −
ˆ
a) ⇒
ˆ
a musthold. Using
this property, the algorithm to generate association rules is given in Algorithm 4.2.
Algorithm 4.2 Association Rule Generation Algorithm
H
1
=∅//Initialize
foreach; frequent k-itemset f
k
, k ≥ 2 do begin
A = (k −1)-itemsets a
k−1
such that a
k−1
⊂ f
k
;
foreach a
k−1
∈ A do begin
conf = support( f
k
)/support(a
k−1
);
if (conf ≥ minconf) then begin
output the rule a
k−1
⇒ ( f
k
− a
k−1
)
with confidence = conf and support = support( f
k
);
add ( f
k
− a
k−1
)toH
1
;
end
end
call ap-genrules( f
k
, H
1
);
end
Procedure ap-genrules( f
k
: frequent k-itemset, H
m
: set of m-item
consequents)
if (k > m + 1) then begin
H
m+1
= apriori-gen(H
m
);
foreach h
m+1
∈ H
m+1
do begin
conf = support( f
k
)/support( f
k
− h
m+1
);
if (conf ≥ minconf) then
output the rule f
k
− h
m+1
⇒ h
m+1
with confidence = conf and support = support( f
k
);
else
delete h
m+1
from H
m+1
;
end
call ap-genrules( f
k
, H
m+1
);
end
Aprioriachievesgoodperformance byreducingthesizeofcandidatesets.However,
in situations with very many frequent itemsets or very low minimum support, it still
suffers from the cost of generating a huge number of candidate sets and scanning the
database repeatedly to check a large set of candidate itemsets.
© 2009 by Taylor & Francis Group, LLC
剩余31页未读,继续阅读
资源评论
- saloyun2013-05-02谢谢分享,好东西摆在这里可惜无人问津,真是可惜。
- cwl198911152015-03-19最近在看机器学习十大算法,还不错
ningningww
- 粉丝: 0
- 资源: 2
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功