机器学习十大算法:Apriori.pdf

5星(超过95%的资源)
所需积分/C币:48 2011-11-21 12:47:46 657KB PDF
128
收藏 收藏
举报

机器学习十大算法:Apriori.pdf机器学习十大算法:Apriori.pdf
4.2 Algorithm descri frequent itemsets are obtained, it is straightforward to generate association rules with confidence no less than minconf. Apriori and AprioriTid, proposed by r. Agrawal and R. Srikant, are seminal algorithms that are designed to work for a large transaction dataset 3] 4.2.1.1 Apriori Apriori is an algorithm to find all sets of items(itemsets) that have support no less than minsup The support for an itemset is the ratio of the number of transactions that ontain the itemset to the total number of transactions. Itemsets that satisfy minimum. support constraint are called frequent itemsets Apriori is characterized as a level-wise complete search(breadth first search)algorithm using anti-monotonicity property of itemsets: If an itemset is not frequent, any of its superset is never frequent, which is also called the downward closure property The algorithm makes multiple passes over the data. In the first pass the support of individual items is counted and frequent items are determined In each subsequent pass, a seed set of itemsets found to be frequent in the previous pass is used for generating new potentially frequent itemsets, called candidate itemsets, and their actual support is counted during the pass over the data At the end of the pass, those satisfying minimum support constraint are collected that is, frequent itemsets are determined, and they become the seed for the next pass This process is repeated until no new frequent itemsets are found By convention, Apriori assumes that items within a transaction or itemset are sorted in lexicographic order The number of items in an itemset is called its size and an itemset of size k is called a k-itemset. Let the set of frequent itemsets of size k be Fk and their candidates be Ck. Both Fk and Ck maintain a field, support count Apriori algorithm is given in Algorithm 4. 1. The first pass simply counts item occurrences to determine the frequent 1-itemsets. A subsequent pass consists of two phases. First, the frequent itemsets Fk-1 found in the (k- 1)-th pass are used to generate the candidate itemsets Ck using the apriori-gen function Next the database is scanned and the support of candidates in Ck is counted The subset function is used for this counting The apriori-gen function takes as argument Fk-1, the set of all frequent(k-1)- itemsets, and returns a superset of the set of all frequent k-itemsets. First, in the join Fi-1 is joined with Fk insert into C select p. fitemset1, P. fitemset2,.., P.fitemsetk-1, q. fitemsetk-I from F -1p F where p itemset 9 fitemset1,., p. fitemsetk-2 g. fitemsetk-2 p. fitemsetk-I< q.fitemsetk-l Here, FkP means that the itemset p is a frequent k-itemset, and p. fitemsetk is the k-th item of the frequent itemset P Then, in the prune step, all the itemsets c E Ck for which some (k-1)-subset is ot in Fk- are deleted o 2009 by Taylor Francis Group, LLC apriori Algorithm 4.1 Apriori algorithm FI=frequent 1-itemsets for(k=2;Fk-1≠的;k++) do begin Ck= apriori-gen( Fk-1); //New candidates foreach trans t∈ d do begi Ct= subset(Ck, t); //Candidates contained in t foreach candidatecec do ccount ++; d Fk={c∈Cklc. count≥ minsup}; Answer=∪kFk k The subset function takes as arguments Ck and a transaction t, and returns all the candidate itemsets contained in the transaction t For fast counting, Apriori adopts a hash-tree to store the candidate itemsets Ck. Itemsets are stored in leaves. Every node is initially a leaf node, and the depth of the root node is defined to be l. when the number of itemsets in a leaf node exceeds a specified threshold the leaf node is converted to an interior node. An interior node at depth d points to nodes at depth d +l. which branch to follow is decided by applying a hash function to the d-th item of the itemset Thus each leaf node is ensured to contain at most a certain number of itemsets (to be precise, this is true only when creating an interior node takes place at depth d smaller than k), and an itemset in the leaf node can be reached by successively hashing each item in the itemset in sequence from the root. Once the hash-tree is constructed the subset function finds all the candidates contained in a transaction t, starting from the root node. At the root node, every item in t is hashed and each branch determined is followed one depth down. If a leaf node is reached itemsets in the leaf that are in the transaction t are searched and those found are made reference to the answer set. If an interior node is reached by hashing the item i, items that come after i in t are hashed recursively until a leaf node is reached. It is evident that itemsets in the leaves that are never reached are not contained in t Clearly, any subset of a frequent itemset satisfies the minimum support constraint The join operation is equivalent to extending Fk-1 with each item in the database and then deleting those itemsets for which the(k- 1)-itemset obtained by deleting the (k-1)-th item is not in Fk-1. The condition p. fitemsetk-1< g. fitemsetk-1 ensures that no duplication is made. The prune step where all the itemsets whose(k-1)-subsets are not in FK-l are deleted from Ck does not delete any itemset that could be in H Thus, Ck 2 Fk, and apriori algorithm is correct i The remaining task is to generate the desired association rules from the frequent temsets. A straightforward algorithm for this task is as follows. To generate rules o 2009 by Taylor Francis Group, LLC 4.2 Algorithm descri 65 all nonempty subsets of every frequent itemset f are enumerated and for every such subset a, a rule of the form a =(f-a) is generated if the ratio of support(f)to support(a)is at least minconf. Here, note that the confidence of the rule a=(f-a) cannot be larger than the confidence ofa=(f-a) for any a c a. This in turn means that for a rule(f-a)=a to hold, all rules of the form(f-a)=a must hold. Using this property, the algorithm to generate association rules is given in Algorithm 4.2 Algorithm 4.2 Association Rule Generation Algorithm HI=g/Initial foreach; frequent k-itemset fi, k >2 do begin A=(k-1)-itemsets ak-1 such that ak-1 Cfk foreach ak_I E a do begin t(fk)/support(ak-1) if(conf minconf)then begin output the rule ak-1=(fk-ak-1) with confidence= conf and support= support(fk); add(fk-ak-1)to Hi d eno end call ap-genrules(fk, 11 end Procedure ap-genrules(fk: frequent k-itemset, Hm: set of m-item consequents if(k>m+ Then begin Hm+1= apriori-gen( Hm) foreach hm+1 E IIm+1 do begin conf= support(fk)/support(fk-hm+1) if (conf minconf)then output the rule f-hm+1=h, m+1 with confidence conf and support= support(f&) else delete hm+1 from Am+1 end call ap-genrules(fk ); end Apriori achieves good performance by reducing the size of candidate sets. However in situations with very many frequent itemsets or very low minimum support, it still uffers from the cost of generating a huge number of candidate sets and scanning the database repeatedly to check a large set of candidate itemsets o 2009 by Taylor Francis Group, LLC apriori 4.2. 1.2 AprioriTid AprioriTid is a variation of Apriori. It does not reduce the number of candidates but it does not use the database D for counting support after the first pass. It uses a new dataset Ck. Each member of the set Ck is of the form TID, D >, where each ID is the identifier of a potentially frequent k-itemset present in the transaction with identifier T/n except k 1. For k =1, C1 corresponds to the database D although conceptually each item i is replaced by the itemset [. The member of C corresponding to a transaction t is <t TID, ic E Cklc contained in t)> The intuition for using Ck is that it will be smaller than the database D for large k because some transactions may not contain any candidate k-itemset in which case Ck does not have an entry for this transaction, or because very few candidates may be contained in the transaction and each entry may be smaller than the number of items in the corresponding transaction AprioriTid algorithm is given in Algorithm 4.3. Here, c[i] represents the i-th item in k-itemset c Algorithm 4.3 AprioriTid algorithm Fi=frequent 1-itemsets; C- database D for(k=2;Fk-1≠的;k++) do begin Ck=apriori-gen (Fk-1); //New candidates 0; foreach entry t C Ck-1 do begin // determine candidate itemsets in Ck contained // in the transaction with identifier t TID Cn={c∈Ck(c-c[k)∈tet-of- Itemsets∧ (C-ck-1])∈ foreach candidate E C, do ccount + tmd× D)then Ck+=(tTID, Ct) Fk=[CE Ck ccount 2 minsup; end Answer= Uk Fk Each Ck is stored in a sequential structure. A candidate k-itemset Ck in Ck maintains two additional fields; generator and extensions, in addition to the field, support count The generator field stores the IDs of the two frequent(k- 1)-itemsets whose join generated Ck. The extension field stores the IDs of all the(k+ 1)-candidates that are extensions of ck. When a candidate ch is generated by joining f-I and fk-1, their IDs are saved in the generator field of Ck and the Id of ck is added to the extension field of fk_1. The t. set-of-itemsets field of an entry t in Ck-I gives the IDs of all o 2009 by Taylor Francis Group, LLC 2 Algorithm Descri (k-1)-candidates contained in tTID. For each such candidate Ck-I the extension field gives Tk, the set of IDs of all the candidate k-itemsets that are extensions of ck-1 For each Ck in Tk, the generator field gives the Ds of the two itemsets that generated Ck If these itemsets are present in the entry for t. set-of-itemsets, it is concluded that Ck is present in transaction t TID, and ck is added to Ct AprioriTid has an overhead to calculate Ck but an advantage that Ck can be stored in memory when k is large. It is thus expected that Apriori beats AprioriTid in earlier passes(small k)and AprioriTid beats Apriori in later passes (large k). Since both Apriori and AprioriTid use the same candidate generation procedure and therefore count the same itemsets, it is possible to make a combined use of these two algo- ithms in sequence. AprioriHybrid uses Apriori in the initial passes and switches to AprioriTid when it expects that the set Ck at the end of the pass will fit in memory 4.2.2 Mining Sequential Patterns Agrawal and Srikant extended Apriori algorithm to the problem of sequential pattern mining [6]. In Apriori there is no notion of sequence, and thus, the problem of finding which items appear together can be viewed as finding intratransaction patterns. Here sequence matters and the problem of finding sequential patterns can be viewed as Intertransaction patterns Each transaction consists of sequence-id, transaction-time, and a set of items. The same sequence-id has no more than one transaction with the same transaction-time a sequence is an ordered list of itemsets. Thus, a sequence consists of a list of sets of characters(items), rather than being simply a list of characters. The length of a sequence is the number of itemsets in the sequence. a sequence of length k is called a k-sequence. Without loss of generality the set of items is assumed to be mapped to a set of contiguous integers, and an itemset i is denoted by (ini2.. im) where i; is an item. A sequence s is denoted by(5152. Sn,. A sequence(ana2. an is contained in another sequence(b1b2.. bm)(n s m)if there exist integers i1 i2 <ln suc h that an C bi, a2 C bi,..., an C bin. All the transactions with the same sequence-id which are sorted by transaction-time together form a sequence(transaction sequence A sequence-id supports a sequence s if s is contained in its transaction sequence. The support for a sequence is defined as the fraction of total number of sequence -ids that support this sequence. Likewise, the support for an itemset i is defined as the fraction of sequence- ids that have items in i in any one of their transactions. Note that this definition is different from that used in Apriori. Thus the itemset i and the 1-sequence (i have the same support Given a transaction database D, the problem of mining sequential patterns is to find the maximal'sequences among all sequences that satisfy a certain user-specified minimum support constraint. Each such maximal sequence represents a sequential pattern. A sequence satisfying the minimum support constraint is called a frequent sequence(not necessarily maximal), and an itemset satisfying the minimum support ,LaterR. Agrawal and R Srikant removed this constraint in their generalized sequentialpatterns(GSP)[32] o 2009 by Taylor Francis Group, LLC apriori constraint is called a frequent itemset, or itemset for short. Any frequent sequence must be a list of itemsets The algorithm consists of five phases: (1)sort phase. (2)itemset phase, (3)trans formation phase, (4) sequence phase, and (5) maximal phase. The first three are preprocessing phases and the last one is a postprocessing phase o In the sort phase, the database D is sorted with sequence-id as the major key and transaction-time as the minor key. In the itemset phase, the set of all itemsets is obtained using Apriori algorithm with the corresponding modification of counting a support, and is mapped to a set of contiguous integers This makes comparing two fitemsets for equality in a constant time. Note that the set of all frequent 1-sequences are simultaneously found in this phase. In the transformation phase, each transaction is replaced by the set of all itemsets that are in that transaction If a transaction does not contain any itemset, it is not retained in the transformed sequence. If a transaction sequence does not contain any itemset this sequence is removed from the transformed database, but it is still used in counting the total number of sequence-ids. After the transformation, a transaction sequence is represented by a list of sets of itemsets Each set of itemsets is represented by [f1, f2,..., fn where f; is an itemset. This transformation is designed for efficiently testing which given frequent sequences are contained in a transaction sequence. The transformed database is denoted as dr The sequence phase is the main part where the frequent sequences are to be enu merated. Two families of algorithms are proposed: count-all and count-some. They differ in the way the frequent sequences are counted Count-all algorithm counts all the frequent sequences, including nonmaximal sequences that must be pruned later, whereas count-some algorithm avoids counting sequences which are contained in a longer sequence because the final goal is to obtain only maximal sequences. Agrawal and srikant developed one count-all algorithm called AprioriAll and two count-some algorithms called AprioriSome and Dynamic Some. Here, only AprioriAllis explained due to the space limitation In the last maximal phase, maximal sequences are extracted from the set of al frequent sequences. The hash-tree(similar to the one used in the subset function in priori) is used to quickly find all subsequences of a given sequence 4.2.2.1 AprioriAll The algorithm is given in Algorithm 4. 4. In each pass the frequent sequences from the previous pass are used to generate the candidate sequences and then their support is measured by making a pass over the database. At the end of the pass, the support of the candidates is used to determine the frequent sequences The apriori-gen-2 function takes as argument Fk-1, the set of all frequent(k-1)- sequences. First, join operation is performed as nsert into C select P itemset], P. fitemset2,.., P. fitemsetk-1, g itemset -1 from Fk-1P, Fk-19 where p itemset =q. fitemset1, .. P fitemsetk-2=q. fitemsetk-2 o 2009 by Taylor Francis Group, LLC 2 Algorithm Descri Algorithm 4.4 AprioriAll algorithm FI=frequent 1-sequences]; //Result of itemset phase for(k=2,Fk-1≠的;k++) do begin Ck=apriori-gen-2(Fk-1); //New candidate sequences foreach tran t∈ Dr do begin Ct= subseq(Ck, t); / Candidate sequences contained in t foreach candidatecec do ccount ++; end Fk={c∈Cklc. count≥ minsup}; Answer= maximal sequences in Uk FK then, all the sequences c E Ck for which some(k-1)-subsequence is not in / are deleted. The subseq function is similar to the subset function in Apriori. As in Apriori the candidate sequences Ck are stored in a hash-tree to quickly find all candidates contained in a transaction sequence. Note that the transformed transaction sequence is a list of sets of fitemsets and all the fitemsets in a set have the same transaction-time and no more than one transaction with the same transaction -time is allowed for the same sequence-id. This constraint has to be imposed in the subseq function 4.2.3 Discussion Both Apriori and aprioriTid need minsup and rincon/to be specified in advance. The algorithms have to be rerun each time these values are changed, throwing everything away that was obtained in previous runs. If no appropriate values for these threshold are known in advance and we want to know how the results change with these values without rerunning the algorithms, the best we can do is to generate and count only those itemsets that appear at least once in the database without duplication and store them all in an efficient way. Note that Apriori generates candidates that do not exist in the datab Apriori and Aprioritid use a hash-tree to store the candidate itemsets. Another data structure that is of d is a trie-structure [35, 9]. Each node in the depth k of the trie corresponds to a candidate k-itemset and stores the k-th item and the support of the itemset. As two frequent k-itemsets that share the first(k-1)-itemsets are siblings below their parent node at the depth k-1 in the trie, the candidate generation is simply to join the two siblings and extend the tree to one more depth below the first frequent k-itemset after pruning. In order to find the candidate k-itemsets that are contained in a transaction t each item in the transaction is fed from the root node and the branch is followed according to the succeeding item until a k-th item is reached Many practical implementations of Apriori use this trie-structure to store not only candidates but also transactions [10, 9 o 2009 by Taylor Francis Group, LLC 70 apriori If we go a step further, we can get rid of generating candidate itemsets at all. Further it is not necessary to enumerate all the frequent itemsets. These topics are discussed in section 4.5 Apriori and almost all other association rule minings use two-phase strategy: first mine frequent patterns and then generate association rules. This is not the sole way Webb's magnumOpus uses another strategy that immediately generates a large subset of all association rules [38] There are direct extensions of the original apriori family. Use of taxonomy and in corporating temporal constraint are two examples. Generalized association rules [30] employ a set of user-specified taxonomies, which makes it possible to extract fre quent itemsets that are expressed by higher concepts even when use of the base level concepts produces only infrequent itemsets. The basic algorithm is to add all ances- tors of each item in a transaction to the transaction and then run apriori algorithm Several optimizations can be added to improve efficiency, one example being that the support for an itemset X that contains both an item x and its ancestor i is the same as the support of the itemset X-A, and thus need not be counted Generalized sequential patterns [32] place, in addition to the introduction of taxonomies, time constraints that specify a minimum and/or maximum time period between adjacent elements(itemsets) in a pattern and relax the restrictions that items in an element of a sequential pattern must come from the same transaction by allowing the items to be present in a set of transactions of the same sequence-id whose transaction-times are within a user-specified time window. It also finds all frequent sequential patterns(not limited to maximal sequential patterns). GSP algorithm runs about 20 times faster than AprioriAll, one reason being that GSP counts fewer candidates than AprioriAll 4.3 Discussion on Available Software Implementations There are many available implementations of Apriori ranging from free software to commercial products. Here, we will present only three well-known implementations which are freely downloadable via Internet The first one is an implementation embedded in the most famous open-source machine learning and data mining toolkit, Weka, provided by the University of Waikato[40]. Apriori in Weka can be used through Wekas common graphical user interface together with many other algorithms that are available in Weka. The im plementation includes Wekas own extensions. For example, minsup is iteratively decreased from an upper bound u minsup to a lower bound lminsup with an interval Sminsup. Further, in addition to confidence the metrics lift, leverage, and conviction are available to evaluate association rules. Lift and leverage are discussed in Section 4.5 Conviction [11] is a metric that was proposed to measure the departure from inde pendence of an association rule taking implication into account. When using one of these metrics, its minimal value has to be given as a threshold o 2009 by Taylor Francis Group, LLC

...展开详情
试读 32P 机器学习十大算法:Apriori.pdf
立即下载
限时抽奖 低至0.43元/次
身份认证后 购VIP低至7折
一个资源只可评论一次,评论内容不能少于5个字
cwl19891115 最近在看机器学习十大算法,还不错
2015-03-19
回复
saloyun 谢谢分享,好东西摆在这里可惜无人问津,真是可惜。
2013-05-02
回复
您会向同学/朋友/同事推荐我们的CSDN下载吗?
谢谢参与!您的真实评价是我们改进的动力~
关注 私信
上传资源赚钱or赚积分
最新推荐
机器学习十大算法:Apriori.pdf 48积分/C币 立即下载
1/32
机器学习十大算法:Apriori.pdf第1页
机器学习十大算法:Apriori.pdf第2页
机器学习十大算法:Apriori.pdf第3页
机器学习十大算法:Apriori.pdf第4页
机器学习十大算法:Apriori.pdf第5页
机器学习十大算法:Apriori.pdf第6页
机器学习十大算法:Apriori.pdf第7页

试读结束, 可继续读3页

48积分/C币 立即下载