没有合适的资源?快使用搜索试试~ 我知道了~
Association Rules
需积分: 2 0 下载量 143 浏览量
2012-07-25
11:24:27
上传
评论
收藏 335KB PDF 举报
温馨提示
试读
20页
Pruning_Closed_Itemset_Lattices_for_Association_Rules_Pasquier_et_al._BDA_1998
资源详情
资源评论
资源推荐
Pruning Closed Itemset Lattices for
Asso ciation Rules
Nicolas Pasquier, Yves Bastide, Rak Taouil, Lot Lakhal
{pasquier,bastide,taouil}@lib d1.univ-bpclermont.fr
lakhal@ucfma.univ-bp clermont.fr
Lab oratoire d'Informatique (LIMOS)
Université Blaise Pascal - Clermont-Ferrand I I
Complexe Scientique des Cézeaux
24, av. des Landais, 63177 Aubière Cedex France
Résumé
La découverte des règles d'association est l'un des principaux problèmes de l'extraction
de connaissances dans les bases de données. De nombreux algorithmes ecaces ont été
proposés, dont les plus remarquables sont Apriori, l'algorithme de Mannila, Partition,
Sampling et DIC. Ces derniers sont tous basés sur la méthode de recherche de Apriori:
l'élagage du treil lis des parties (treil lis des itemsets). Dans cet article, nous proposons
un algorithme ecace basé sur une nouvel le méthode de recherche: l'élagage du treillis
des fermés (treil lis des itemsets fermés). Ce treillis qui est un sous-ordre du treil lis des
parties est étroitement lié au treillis de concepts de Wil le dans son analyse formel le
de concepts. Nous avons comparé expérimentalement Close à une version optimisée
de Apriori et les résultats obtenus montrent la grande ecacité de Close dans le
traitement des données denses et/ou corrélées tel les que les données de rescensement
(cas dicile). Nous avons également pu observer que Close donne des temps de
réponse corrects dans le traitement des bases de données de ventes.
Mots-Clef:
extraction de connaissances; règles d'association; treillis; algorithmes.
Abstract
Discovering asso ciation rules is one of the most important task in data mining and
many ecient algorithms have been prop osed in the literature. The most noticeable
are Apriori, Mannila's algorithm, Partition, Sampling and DIC, that are all based on
the Apriori mining metho d: pruning of the subset lattice (itemset lattice). In this
paper we propose an ecient algorithm, called Close, based on a new mining method:
pruning of the closed set lattice (closed itemset lattice). This lattice, which is a
sub-order of the subset lattice, is closely related to Wille's concept lattice in formal
concept analysis. Experiments comparing Close to an optimized version of Apriori
showed that Close is very ecient for mining dense and/or correlated data such as
census data, and p erforms reasonably well for market basket style data.
Keywords:
data mining; knowledge discovery; asso ciation rules; lattices; algorithms.
hal-00467745, version 1 - 26 Apr 2010
Author manuscript, published in "BDA'1998 international conference on Advanced Databases, Hammamet : Tunisia (1998)"
1 Intro duction
One of the most imp ortant task in data mining is the discovery of
association rules
rst introduced
in [1]. The aim of the association rule discovery is to identify relationships between items in very
large databases. For example, given a market basket database, it would b e interesting for decision
support to know the fact that 80% of customers who b ought cereals and sugar also b ought milk.
In a census database, we should discover that 60% of p ersons who worked last year earned less
than the average income, or in a medical database, that 70% of patients who have stinesses and
fever also have headaches.
Agrawal's statement of the problem of discovering association rules in market basket databases is
the following [1, 2]. Let
I
= {
i
1
; i
2
; : : : ; i
m
} b e a set of
m
literals called
items
. Let the database
D
= {
t
1
; t
2
; : : : ; t
n
} b e a set of
n
transactions, each one consisting of a set of items
I
from
I
and
associated with a unique identier called its TID.
I
is called a
k-itemset
, where
k
is the size of
I
. A transaction
t
2 D
is said to
contain
an itemset
I
if
I
t
. The
support
of an itemset
I
is
the p ercentage of transactions in
D
containing
I
:
support
(
I
) =
f
t
2 D j
I
t
g
=
f
t
2 Dg
. An
association rule is a conditional implication among itemsets,
I
)
I
0
, where itemsets
I ; I
0
I
and
I
\
I
0
=
;
. The
condence
of an association rule
r
:
I
)
I
0
is the conditional probability that a
transaction contains
I
0
, given that it contains
I
:
conf idence
(
r
) =
support
(
I
[
I
0
)
= suppor t
(
I
)
.
The supp ort of
r
is dened as:
support
(
r
) =
support
(
I
[
I
0
)
.
The problem of mining asso ciation rules in a database
D
is then traditionally dened as follows.
Given user dened thresholds for the permissible minimum support and condence, nd all the
association rules that hold with more than the given
minsupport
and
mincondence
. This problem
can be broken into two subproblems [1]:
1.
Finding all
frequent
itemsets in
D
, i.e. itemsets with support greater or equal to
minsupport
.
Frequent itemsets are also called
large
itemsets.
2.
For each frequent itemset
I
1
found, generating all association rules
I
2
)
I
1
I
2
|
I
2
I
1
,
with condence greater or equal to
mincondence
.
The second subproblem can b e solved in main memory in a straightforward manner once all frequent
itemsets and their supp ort are known. Hence, the problem of mining asso ciation rules is reduced
to the problem of nding frequent itemsets. Many algorithms have b een prop osed in the literature
[2, 3, 9, 8, 11, 12, 13]. Although they are very dierent from each other, they are all based on the
Apriori mining metho d [2]: pruning of the subset lattice for nding frequent itemsets. This relies
on the basic prop erties that
al l subsets of a frequent itemset are frequent
and that
al l supersets
of an infrequent itemset are infrequent
. Algorithms based on this approach perform very well for
weakly correlated data such as market basket data. However perfomances drastically decrease for
correlated data such as census data.
In this paper, we prop ose a new ecient algorithm called Close for mining association rules in very
large databases. Close is based on the pruning of the closed itemset lattice which is a sub-order
of the subset lattice, thus much smaller. Such a structure is closely related to Wille's concept
lattice in formal concept analysis [5, 14, 15]. We show that this structure can be used as a formal
framework for discovering association rules given the basic prop erties that
al l sub-closed itemsets of
a frequent closed itemset are frequent
, that
al l sup-closed itemsets of an infrequent closed itemset
are infrequent
and that
the set of maximal frequent itemsets is identical to the set of maximal
frequent closed itemsets
. Empirical evaluations comparing Close to an optimized version of Apriori
showed that Close performs reasonably well for weakly correlated data and p erforms very well for
correlated data.
The rest of the paper is organized as follows. Section 2 reviews related work and exhibits the
contribution of the pap er. In Section 3, we dene the semantics of asso ciation rules based on
the
Galois connection operators
. In Section 4, we describe the Close algorithm. Section 5 gives
experimental results on synthetic data
1
and census data using the PUMS le for Kansas USA
2
and Section 6 concludes the pap er.
1
http://www.almaden.ibm.com/cs/q uest/syndata.html
2
ftp://ftp2.cc.ukans.edu/pub/ipp r/census/pums/pums 90ks.zip
2
hal-00467745, version 1 - 26 Apr 2010
TID Items
1 A C D
2 B C E
3 A B C E
4 B E
5 A B C E
Figure 1: The transaction database
D
2 Related Work and Contribution
In this section, we rst present the subset lattice based approach for mining association rules.
Then, we intro duce the use of the closed itemset lattice as a formal framework in data mining and
we briey describe the Close mining metho d.
2.1 A Common Approach for Mining Asso ciation Rules
Finding all frequent itemsets is a nontrivial problem b ecause the numb er of possible frequent
itemsets is exp onential in the size of the set of items
I
of the database. Given
kI k
=
m
, there
are p ossibly 2
m
frequent itemsets, which form a
lattice of subsets
over
I
with height equal to
m
.
Consider the example transaction database
D
given in Figure 1. The lattice of subsets asso ciated
with
D
is represented in Figure 2. This lattice contains 32 itemsets and its height is 6. However,
depending on the data and the
minsupport
value, only a small fraction of the whole lattice space is
frequent. For instance, assuming that
minsupport
is 2 (40%), only 15 itemsets of
D
are frequent.
A naive approach consists of testing the support of every itemset in the lattice, which can b e done
in a single pass over the database. Clearly, this approach is impractical for large values of
m
. In
the following, we describ e the Apriori mining metho d used by all existing algorithms for nding
frequent itemsets. The notation is given in Table 1.
ABCDE
Ø
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABDE
BCDE
ABCE
ABCD
ACDE
A
B
C
D
E
ABC
ABD
ACD
BCD
ABE
ACE
BCE
ADE
BDE
CDE
Frequent itemset (minsupport=2)
Infrequent itemset
Figure 2: Itemset lattice of
D
3
hal-00467745, version 1 - 26 Apr 2010
C
k
Set of candidate
k
-itemsets (potentially frequent itemsets).
Each element of this set has two elds: i) itemset and ii) support count.
L
k
Set of frequent
k
-itemsets (itemsets with minimum supp ort).
Each element of this set has two elds: i) itemset and ii) support count.
Table 1: Notation
Algorithm Apriori
In the Apriori algorithm, items are sorted in lexicographic order. The pseudo-co de of the Apriori
frequent itemset discovery is given in Algorithm 1. Frequent itemsets are computed iteratively,
in the ascending order of their size. The pro cess takes
k
iterations, where
k
is the size of the
largest frequent itemsets. For each iteration
i
k
, the database is scanned once and all frequent
itemsets of size
i
are computed. The rst iteration computes the set
L
1
of frequent
1
-itemsets. A
subsequent iteration
i
consists of two phases. First, a set
C
i
of candidate
i
-itemsets is created by
joining the frequent (
i
1
)-itemsets in
L
i
1
found in the previous iteration. This phase is realized
by the Apriori-Gen function described b elow. Next, the database is scanned for determining the
support of the candidates in
C
i
and the frequent
i
-itemsets are extracted from the candidates.
This pro cess is rep eated until no more candidate can be generated.
1)
L
1
= {Large 1-itemsets};
2)
for
(
k
=2;
L
k
1
6
=
;
;
k
++ )
do b egin
3)
C
k
= Apriori-Gen(
L
k
1
); // Generates candidates
k
-itemsets
4)
forall
transactions
t
2 D
do b egin
5)
C
t
= Subset(
C
k
,
t
); // Candidates contained in t
6)
forall
candidates
c
2
C
t
do
7)
c
.count++;
8)
end
9)
L
k
= {
c
2
C
k
|
c
.count
minsupport
}
10)
end
11) Answer
=
S
k
L
k
;
Algorithm 1: Apriori frequent itemset discovery
Apriori-Gen Candidate Generation
The function takes as argument the set
L
i
1
of frequent
(
i
1
)-itemsets. It returns the set
C
i
of candidate
i
-itemsets, which is a superset of the set of all
frequent
i
-itemsets. Two frequent itemsets of size
i
1
with the same rst
i
2
items are joined,
generating a new candidate itemset of size
i
:
insert into
C
i
select
p
.item
1
,
p
.item
2
,
: : :
,
p
.item
i
1
,
q
.item
i
1
from
L
i
1
p; L
i
1
q
where
p
.item
1
=
q
.item
1
,
: : :
,
p
.item
i
2
=
q
.item
i
2
,
p
.item
i
1
<
q
.item
i
1
;
Then, the candidate set
C
i
produced is pruned by removing every candidate
i
-itemset
c
such that
some (
i
1
)-subset of
c
is not in
L
i
1
:
forall
candidate itemsets
c
2
C
i
do b egin
forall
(
i
1
)-subsets
s
of
c
do b egin
if
(
s =
2
L
i
1
)
then
delete
c
from
C
i
;
Example
Figure 3 shows the execution of Apriori for a minimum supp ort of 2 (40%) on the
database
D
. This process takes four iterations, computing four sets of candidates and frequent
itemsets and p erforming four database passes. The frequent itemsets found are outlined in the
itemset lattice given in Figure 2.
4
hal-00467745, version 1 - 26 Apr 2010
剩余19页未读,继续阅读
thssla21
- 粉丝: 5
- 资源: 142
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论0