机器学习十大算法：Apriori.pdf资源-CSDN文库

机器学习十大算法：Apriori.pdf

5星 · 超过95%的资源需积分: 9 133 浏览量 2011-11-21 12:47:46 上传评论收藏 657KB PDF 举报

资源推荐

资源详情

资源评论

Chapter 4

Apriori

Hiroshi Motoda and Kouzou Ohara

Contents

4.1 Introduction ............................................................ 62

4.2 Algorithm Description .................................................. 62

4.2.1 Mining Frequent Patterns and Association Rules .................. 62

4.2.1.1 Apriori ................................................. 63

4.2.1.2 AprioriTid .............................................. 66

4.2.2 Mining Sequential Patterns ....................................... 67

4.2.2.1 AprioriAll .............................................. 68

4.2.3 Discussion ....................................................... 69

4.3 Discussion on Available Software Implementations ...................... 70

4.4 Two Illustrative Examples ............................................... 71

4.4.1 Working Examples .............................................. 71

4.4.1.1 Frequent Itemset and Association Rule Mining .......... 71

4.4.1.2 Sequential Pattern Mining ............................... 75

4.4.2 Performance Evaluation .......................................... 76

4.5 Advanced Topics ........................................................ 80

4.5.1 Improvement in Apriori-Type Frequent Pattern Mining ........... 80

4.5.2 Frequent Pattern Mining Without Candidate Generation ........... 81

4.5.3 Incremental Approach ........................................... 82

4.5.4 Condensed Representation: Closed Patterns

and Maximal Patterns ............................................ 82

4.5.5 Quantitative Association Rules ................................... 84

4.5.6 Using Other Measure of Importance/Interestingness .............. 84

4.5.7 Class Association Rules .......................................... 85

4.5.8 Using Richer Expression: Sequences, Trees, and Graphs .......... 86

4.6 Summary ............................................................... 87

4.7 Exercises ............................................................... 88

References ................................................................... 89

62 Apriori

4.1 Introduction

Many of the pattern ﬁnding algorithms such as those for decision tree building, clas-

siﬁcation rule induction, and data clustering that are frequently used in data mining

have been developed in the machine learning research community. Frequent pattern

and association rule mining is one of the few exceptions to this tradition. Its introduc-

tion boosted data mining research and its impact is tremendous. The basic algorithms

are simple and easy to implement. In this chapter the most fundamental algorithms of

frequent pattern and association rule mining, known as Apriori and AprioriTid [3, 4],

and Apriori’s extension to sequential pattern mining, known as AprioriAll [6, 5],

are explained based on the original papers with working examples, and performance

analysis of Apriori is shown using a freely available implementation [1] for a dataset

in UCI repository [8]. Since Apriori is so fundamental and the form of database is

limited to market transaction, there have been many works for improving compu-

tational efﬁciency, ﬁnding more compact representation, and extending the types of

data that can be handled. Some of the important works are also brieﬂy described as

advanced topics.

4.2 Algorithm Description

4.2.1 Mining Frequent Patterns and Association Rules

One of the most popular data mining approaches is to ﬁnd frequent itemsets from

a transaction dataset and derive association rules. The problem is formally stated as

follows. Let I ={i

, i

,...,i

} be a set of items. Let D be a set of transactions,

where each transaction t is a set of items such that t ⊆ I. Each transaction has a

unique identiﬁer, called its TID. A transaction t contains X, a set of some items

in I,ifX ⊆ t. An association rule is an implication of the form X ⇒ Y, where

X ⊂ I, Y ⊂ I,andX ∩ Y =∅. The rule X ⇒ Y holds in D with conﬁdence c

(0 ≤ c ≤ 1) if the fraction of transactions that also contain Y in those which contain

X in D is c. The rule X ⇒ Y (andequivalently X∪Y)has support

s (0 ≤ s ≤ 1) in D

if the fraction of transactions in D that contain X ∪Y is s. Given a set of transactions

D, the problem of mining association rules is to generate all association rules that

have support and conﬁdence no less than the user-speciﬁed minimum support (called

minsup) and minimum conﬁdence (called minconf), respectively.

Finding frequent

itemsets (itemsets with support no less than minsup) is not tri-

vial because of the computational complexity due to combinatorial explosion. Once

An alternative support deﬁnition is the absolute count of frequency. In this chapter the latter deﬁnition is

also used where appropriate.

The Apriori paper [3] uses “large” to mean “frequent,” but large is often associated with the number of

items in the itemset. Thus, we prefer to use “frequent.”

4.2 Algorithm Description 63

frequent itemsets are obtained, it is straightforward to generate association rules with

conﬁdence no less than minconf. Apriori and AprioriTid,proposed by R. Agrawaland

R. Srikant, are seminal algorithms that are designed to work for a large transaction

dataset [3].

4.2.1.1 Apriori

Apriori is an algorithm to ﬁnd all sets of items (itemsets) that have support no less

than minsup. The support for an itemset is the ratio of the number of transactions that

contain the itemset to the total number of transactions. Itemsets that satisfy minimum

support constraint are called frequent itemsets. Apriori is characterized as a level-wise

complete search (breadth ﬁrst search) algorithm using anti-monotonicity property of

itemsets: “If an itemset is not frequent, any of its superset is never frequent,” which is

also called the downward closure property.The algorithm makes multiple passes over

the data. In the ﬁrst pass, the support of individual items is counted and frequent items

are determined. In each subsequent pass, a seed set of itemsets found to be frequent

in the previous pass is used for generating new potentially frequent itemsets, called

candidate itemsets, and their actual support is counted during the pass over the data.

At the end of the pass, those satisfying minimum support constraint are collected,

that is, frequent itemsets are determined, and they become the seed for the next pass.

This process is repeated until no new frequent itemsets are found.

Byconvention,Apriori assumes that items withina transaction or itemset are sorted

in lexicographic order. The number of items in an itemset is called its size and an

itemset of size k is called a k-itemset. Let the set of frequent itemsets of size k be F

and their candidates be C

. Both F

and C

maintain a ﬁeld, support count.

Apriori algorithm is given in Algorithm 4.1. The ﬁrst pass simply counts item

occurrences to determine the frequent 1-itemsets. A subsequent pass consists of two

phases. First, the frequent itemsets F

k−1

found in the (k − 1)-th pass are used to

generate the candidate itemsets C

using the apriori-gen function. Next, the database

is scanned and the support of candidates in C

is counted. The subset function is used

for this counting.

The apriori-gen function takes as argument F

k−1

, the set of all frequent (k − 1)-

itemsets, and returns a superset of the set of all frequent k-itemsets. First, in the join

steps, F

k−1

is joined with F

k−1

insert into C

select p.ﬁtemset

, p.ﬁtemset

, ... , p.ﬁtemset

k−1

, q.ﬁtemset

k−1

from F

k−1

p, F

k−1

where p.ﬁtemset

= q.ﬁtemset

,...,p.ﬁtemset

k−2

= q.ﬁtemset

k−2

p.ﬁtemset

k−1

< q.ﬁtemset

k−1

Here, F

p means that the itemset p is a frequent k-itemset, and p.ﬁtemset

is the

k-th item of the frequent itemset p.

Then, in the prune step, all the itemsets c ∈ C

for which some (k − 1)-subset is

not in F

k−1

are deleted.

64 Apriori

Algorithm 4.1 Apriori Algorithm

= {frequent 1-itemsets};

for (k = 2; F

k−1

=∅; k ++) do begin

= apriori-gen(F

k−1

); //New candidates

foreach transaction t ∈ D do begin

= subset(C

, t); //Candidates contained in t

foreach candidate c ∈ C

c.count ++;

end

={c ∈ C

|c.count ≥ minsup };

end

Answer =∪

;

The subset function takes as arguments C

and a transaction t, and returns all the

candidate itemsets contained in the transaction t. For fast counting, Apriori adopts

a hash-tree to store the candidate itemsets C

. Itemsets are stored in leaves. Every

node is initially a leaf node, and the depth of the root node is deﬁned to be 1. When

the number of itemsets in a leaf node exceeds a speciﬁed threshold, the leaf node is

converted to an interior node. An interior node at depth d points to nodes at depth

d + 1. Which branch to follow is decided by applying a hash function to the d-th

item of the itemset. Thus, each leaf node is ensured to contain at most a certain

number of itemsets (to be precise, this is true only when creating an interior node

takes place at depth d smaller than k), and an itemset in the leaf node can be reached

by successively hashing each item in the itemset in sequence from the root. Once the

hash-tree is constructed, the subset function ﬁnds all the candidates contained in a

transaction t, starting from the root node. At the root node, every item in t is hashed,

and each branch determined is followed one depth down. If a leaf node is reached,

itemsets in the leaf that are in the transaction t are searched and those found are made

reference to the answer set. If an interior node is reached by hashing the item i, items

that come after i in t are hashed recursively until a leaf node is reached. It is evident

that itemsets in the leaves that are never reached are not contained in t.

Clearly, any subset of a frequent itemset satisﬁes the minimum support constraint.

The join operation is equivalent to extending F

k−1

with each item in the database and

then deleting those itemsets for which the (k − 1)-itemset obtained by deleting the

(k−1)-th item isnotin F

k−1

.Thecondition p.ﬁtemset

k−1

< q.ﬁtemset

k−1

ensuresthat

no duplication is made. The prune step where all the itemsets whose (k − 1)-subsets

are not in F

k−1

are deleted from C

does not delete any itemset that could be in F

Thus, C

⊇ F

, and Apriori algorithm is correct.

The remaining task is to generate the desired association rules from the frequent

itemsets. A straightforward algorithm for this task is as follows. To generate rules,

剩余31页未读，继续阅读

评论收藏

内容反馈

saloyun

2013-05-02

谢谢分享，好东西摆在这里可惜无人问津，真是可惜。
cwl19891115

2015-03-19

最近在看机器学习十大算法，还不错

ningningww

粉丝: 0
资源: 2

机器学习十大算法：Apriori.pdf

机器学习十大算法

机器学习十大算法，完整集合

机器学习算法十讲.pdf

机器学习十大经典算法

机器学习十大算法.zip

机器学习30讲.pdf

3.机器学习常用算法.pdf

编译好的c++机器学习库shark4.0

机器学习分类算法实现（c++语言和c语版本）

aprioriTid算法代码

机器学习十大算法：Apriori

Apriori算法的应用.pdf

基于数据挖掘的Apriori算法().pdf

机器学习算法PDF课件.rar

Apriori 算法 实例

机器学习相关公式笔记2.pdf

序列模式挖掘（AprioriAll和AprioriSome算法）-附件资源

机器学习的数学基础.pdf

Apriori vs AprioriTid 算法

apriori algorithm用python实现

机器学习10大经典算法_机器学习_

知识图谱生成与学习路径推荐源码与demo数据

AprioriAll.rar_AprioriAll算法_aprioriall_c++数据挖掘_visual c_序列模式

Apriori算法详解[归纳].pdf

Apriori算法的更新算法.pdf

数据挖掘Apriori算法的改进.pdf

数据挖掘中关联规则Apriori算法.pdf

Apriori算法在数据挖掘中的应用.pdf

最新资源

Apriori 算法实例