机器学习十大算法：Apriori.pdf

机器学习十大算法：Apriori.pdf机器学习十大算法：Apriori.pdf
4.2 Algorithm descri frequent itemsets are obtained, it is straightforward to generate association rules with confidence no less than minconf. Apriori and AprioriTid, proposed by r. Agrawal and R. Srikant, are seminal algorithms that are designed to work for a large transaction dataset 3] 4.2.1.1 Apriori Apriori is an algorithm to find all sets of items(itemsets) that have support no less than minsup The support for an itemset is the ratio of the number of transactions that ontain the itemset to the total number of transactions. Itemsets that satisfy minimum. support constraint are called frequent itemsets Apriori is characterized as a levelwise complete search(breadth first search)algorithm using antimonotonicity property of itemsets: If an itemset is not frequent, any of its superset is never frequent, which is also called the downward closure property The algorithm makes multiple passes over the data. In the first pass the support of individual items is counted and frequent items are determined In each subsequent pass, a seed set of itemsets found to be frequent in the previous pass is used for generating new potentially frequent itemsets, called candidate itemsets, and their actual support is counted during the pass over the data At the end of the pass, those satisfying minimum support constraint are collected that is, frequent itemsets are determined, and they become the seed for the next pass This process is repeated until no new frequent itemsets are found By convention, Apriori assumes that items within a transaction or itemset are sorted in lexicographic order The number of items in an itemset is called its size and an itemset of size k is called a kitemset. Let the set of frequent itemsets of size k be Fk and their candidates be Ck. Both Fk and Ck maintain a field, support count Apriori algorithm is given in Algorithm 4. 1. The first pass simply counts item occurrences to determine the frequent 1itemsets. A subsequent pass consists of two phases. First, the frequent itemsets Fk1 found in the (k 1)th pass are used to generate the candidate itemsets Ck using the apriorigen function Next the database is scanned and the support of candidates in Ck is counted The subset function is used for this counting The apriorigen function takes as argument Fk1, the set of all frequent(k1) itemsets, and returns a superset of the set of all frequent kitemsets. First, in the join Fi1 is joined with Fk insert into C select p. fitemset1, P. fitemset2,.., P.fitemsetk1, q. fitemsetkI from F 1p F where p itemset 9 fitemset1,., p. fitemsetk2 g. fitemsetk2 p. fitemsetkI< q.fitemsetkl Here, FkP means that the itemset p is a frequent kitemset, and p. fitemsetk is the kth item of the frequent itemset P Then, in the prune step, all the itemsets c E Ck for which some (k1)subset is ot in Fk are deleted o 2009 by Taylor Francis Group, LLC apriori Algorithm 4.1 Apriori algorithm FI=frequent 1itemsets for(k=2;Fk1≠的;k++) do begin Ck= apriorigen( Fk1); //New candidates foreach trans t∈ d do begi Ct= subset(Ck, t); //Candidates contained in t foreach candidatecec do ccount ++; d Fk={c∈Cklc. count≥ minsup}; Answer=∪kFk k The subset function takes as arguments Ck and a transaction t, and returns all the candidate itemsets contained in the transaction t For fast counting, Apriori adopts a hashtree to store the candidate itemsets Ck. Itemsets are stored in leaves. Every node is initially a leaf node, and the depth of the root node is defined to be l. when the number of itemsets in a leaf node exceeds a specified threshold the leaf node is converted to an interior node. An interior node at depth d points to nodes at depth d +l. which branch to follow is decided by applying a hash function to the dth item of the itemset Thus each leaf node is ensured to contain at most a certain number of itemsets (to be precise, this is true only when creating an interior node takes place at depth d smaller than k), and an itemset in the leaf node can be reached by successively hashing each item in the itemset in sequence from the root. Once the hashtree is constructed the subset function finds all the candidates contained in a transaction t, starting from the root node. At the root node, every item in t is hashed and each branch determined is followed one depth down. If a leaf node is reached itemsets in the leaf that are in the transaction t are searched and those found are made reference to the answer set. If an interior node is reached by hashing the item i, items that come after i in t are hashed recursively until a leaf node is reached. It is evident that itemsets in the leaves that are never reached are not contained in t Clearly, any subset of a frequent itemset satisfies the minimum support constraint The join operation is equivalent to extending Fk1 with each item in the database and then deleting those itemsets for which the(k 1)itemset obtained by deleting the (k1)th item is not in Fk1. The condition p. fitemsetk1< g. fitemsetk1 ensures that no duplication is made. The prune step where all the itemsets whose(k1)subsets are not in FKl are deleted from Ck does not delete any itemset that could be in H Thus, Ck 2 Fk, and apriori algorithm is correct i The remaining task is to generate the desired association rules from the frequent temsets. A straightforward algorithm for this task is as follows. To generate rules o 2009 by Taylor Francis Group, LLC 4.2 Algorithm descri 65 all nonempty subsets of every frequent itemset f are enumerated and for every such subset a, a rule of the form a =(fa) is generated if the ratio of support(f)to support(a)is at least minconf. Here, note that the confidence of the rule a=(fa) cannot be larger than the confidence ofa=(fa) for any a c a. This in turn means that for a rule(fa)=a to hold, all rules of the form(fa)=a must hold. Using this property, the algorithm to generate association rules is given in Algorithm 4.2 Algorithm 4.2 Association Rule Generation Algorithm HI=g/Initial foreach; frequent kitemset fi, k >2 do begin A=(k1)itemsets ak1 such that ak1 Cfk foreach ak_I E a do begin t(fk)/support(ak1) if(conf minconf)then begin output the rule ak1=(fkak1) with confidence= conf and support= support(fk); add(fkak1)to Hi d eno end call apgenrules(fk, 11 end Procedure apgenrules(fk: frequent kitemset, Hm: set of mitem consequents if(k>m+ Then begin Hm+1= apriorigen( Hm) foreach hm+1 E IIm+1 do begin conf= support(fk)/support(fkhm+1) if (conf minconf)then output the rule fhm+1=h, m+1 with confidence conf and support= support(f&) else delete hm+1 from Am+1 end call apgenrules(fk ); end Apriori achieves good performance by reducing the size of candidate sets. However in situations with very many frequent itemsets or very low minimum support, it still uffers from the cost of generating a huge number of candidate sets and scanning the database repeatedly to check a large set of candidate itemsets o 2009 by Taylor Francis Group, LLC apriori 4.2. 1.2 AprioriTid AprioriTid is a variation of Apriori. It does not reduce the number of candidates but it does not use the database D for counting support after the first pass. It uses a new dataset Ck. Each member of the set Ck is of the form TID, D >, where each ID is the identifier of a potentially frequent kitemset present in the transaction with identifier T/n except k 1. For k =1, C1 corresponds to the database D although conceptually each item i is replaced by the itemset [. The member of C corresponding to a transaction t is <t TID, ic E Cklc contained in t)> The intuition for using Ck is that it will be smaller than the database D for large k because some transactions may not contain any candidate kitemset in which case Ck does not have an entry for this transaction, or because very few candidates may be contained in the transaction and each entry may be smaller than the number of items in the corresponding transaction AprioriTid algorithm is given in Algorithm 4.3. Here, c[i] represents the ith item in kitemset c Algorithm 4.3 AprioriTid algorithm Fi=frequent 1itemsets; C database D for(k=2;Fk1≠的;k++) do begin Ck=apriorigen (Fk1); //New candidates 0; foreach entry t C Ck1 do begin // determine candidate itemsets in Ck contained // in the transaction with identifier t TID Cn={c∈Ck(cc[k)∈tetof Itemsets∧ (Cck1])∈ foreach candidate E C, do ccount + tmd× D)then Ck+=(tTID, Ct) Fk=[CE Ck ccount 2 minsup; end Answer= Uk Fk Each Ck is stored in a sequential structure. A candidate kitemset Ck in Ck maintains two additional fields; generator and extensions, in addition to the field, support count The generator field stores the IDs of the two frequent(k 1)itemsets whose join generated Ck. The extension field stores the IDs of all the(k+ 1)candidates that are extensions of ck. When a candidate ch is generated by joining fI and fk1, their IDs are saved in the generator field of Ck and the Id of ck is added to the extension field of fk_1. The t. setofitemsets field of an entry t in CkI gives the IDs of all o 2009 by Taylor Francis Group, LLC 2 Algorithm Descri (k1)candidates contained in tTID. For each such candidate CkI the extension field gives Tk, the set of IDs of all the candidate kitemsets that are extensions of ck1 For each Ck in Tk, the generator field gives the Ds of the two itemsets that generated Ck If these itemsets are present in the entry for t. setofitemsets, it is concluded that Ck is present in transaction t TID, and ck is added to Ct AprioriTid has an overhead to calculate Ck but an advantage that Ck can be stored in memory when k is large. It is thus expected that Apriori beats AprioriTid in earlier passes(small k)and AprioriTid beats Apriori in later passes (large k). Since both Apriori and AprioriTid use the same candidate generation procedure and therefore count the same itemsets, it is possible to make a combined use of these two algo ithms in sequence. AprioriHybrid uses Apriori in the initial passes and switches to AprioriTid when it expects that the set Ck at the end of the pass will fit in memory 4.2.2 Mining Sequential Patterns Agrawal and Srikant extended Apriori algorithm to the problem of sequential pattern mining [6]. In Apriori there is no notion of sequence, and thus, the problem of finding which items appear together can be viewed as finding intratransaction patterns. Here sequence matters and the problem of finding sequential patterns can be viewed as Intertransaction patterns Each transaction consists of sequenceid, transactiontime, and a set of items. The same sequenceid has no more than one transaction with the same transactiontime a sequence is an ordered list of itemsets. Thus, a sequence consists of a list of sets of characters(items), rather than being simply a list of characters. The length of a sequence is the number of itemsets in the sequence. a sequence of length k is called a ksequence. Without loss of generality the set of items is assumed to be mapped to a set of contiguous integers, and an itemset i is denoted by (ini2.. im) where i; is an item. A sequence s is denoted by(5152. Sn,. A sequence(ana2. an is contained in another sequence(b1b2.. bm)(n s m)if there exist integers i1 i2 <ln suc h that an C bi, a2 C bi,..., an C bin. All the transactions with the same sequenceid which are sorted by transactiontime together form a sequence(transaction sequence A sequenceid supports a sequence s if s is contained in its transaction sequence. The support for a sequence is defined as the fraction of total number of sequence ids that support this sequence. Likewise, the support for an itemset i is defined as the fraction of sequence ids that have items in i in any one of their transactions. Note that this definition is different from that used in Apriori. Thus the itemset i and the 1sequence (i have the same support Given a transaction database D, the problem of mining sequential patterns is to find the maximal'sequences among all sequences that satisfy a certain userspecified minimum support constraint. Each such maximal sequence represents a sequential pattern. A sequence satisfying the minimum support constraint is called a frequent sequence(not necessarily maximal), and an itemset satisfying the minimum support ,LaterR. Agrawal and R Srikant removed this constraint in their generalized sequentialpatterns(GSP)[32] o 2009 by Taylor Francis Group, LLC apriori constraint is called a frequent itemset, or itemset for short. Any frequent sequence must be a list of itemsets The algorithm consists of five phases: (1)sort phase. (2)itemset phase, (3)trans formation phase, (4) sequence phase, and (5) maximal phase. The first three are preprocessing phases and the last one is a postprocessing phase o In the sort phase, the database D is sorted with sequenceid as the major key and transactiontime as the minor key. In the itemset phase, the set of all itemsets is obtained using Apriori algorithm with the corresponding modification of counting a support, and is mapped to a set of contiguous integers This makes comparing two fitemsets for equality in a constant time. Note that the set of all frequent 1sequences are simultaneously found in this phase. In the transformation phase, each transaction is replaced by the set of all itemsets that are in that transaction If a transaction does not contain any itemset, it is not retained in the transformed sequence. If a transaction sequence does not contain any itemset this sequence is removed from the transformed database, but it is still used in counting the total number of sequenceids. After the transformation, a transaction sequence is represented by a list of sets of itemsets Each set of itemsets is represented by [f1, f2,..., fn where f; is an itemset. This transformation is designed for efficiently testing which given frequent sequences are contained in a transaction sequence. The transformed database is denoted as dr The sequence phase is the main part where the frequent sequences are to be enu merated. Two families of algorithms are proposed: countall and countsome. They differ in the way the frequent sequences are counted Countall algorithm counts all the frequent sequences, including nonmaximal sequences that must be pruned later, whereas countsome algorithm avoids counting sequences which are contained in a longer sequence because the final goal is to obtain only maximal sequences. Agrawal and srikant developed one countall algorithm called AprioriAll and two countsome algorithms called AprioriSome and Dynamic Some. Here, only AprioriAllis explained due to the space limitation In the last maximal phase, maximal sequences are extracted from the set of al frequent sequences. The hashtree(similar to the one used in the subset function in priori) is used to quickly find all subsequences of a given sequence 4.2.2.1 AprioriAll The algorithm is given in Algorithm 4. 4. In each pass the frequent sequences from the previous pass are used to generate the candidate sequences and then their support is measured by making a pass over the database. At the end of the pass, the support of the candidates is used to determine the frequent sequences The apriorigen2 function takes as argument Fk1, the set of all frequent(k1) sequences. First, join operation is performed as nsert into C select P itemset], P. fitemset2,.., P. fitemsetk1, g itemset 1 from Fk1P, Fk19 where p itemset =q. fitemset1, .. P fitemsetk2=q. fitemsetk2 o 2009 by Taylor Francis Group, LLC 2 Algorithm Descri Algorithm 4.4 AprioriAll algorithm FI=frequent 1sequences]; //Result of itemset phase for(k=2,Fk1≠的;k++) do begin Ck=apriorigen2(Fk1); //New candidate sequences foreach tran t∈ Dr do begin Ct= subseq(Ck, t); / Candidate sequences contained in t foreach candidatecec do ccount ++; end Fk={c∈Cklc. count≥ minsup}; Answer= maximal sequences in Uk FK then, all the sequences c E Ck for which some(k1)subsequence is not in / are deleted. The subseq function is similar to the subset function in Apriori. As in Apriori the candidate sequences Ck are stored in a hashtree to quickly find all candidates contained in a transaction sequence. Note that the transformed transaction sequence is a list of sets of fitemsets and all the fitemsets in a set have the same transactiontime and no more than one transaction with the same transaction time is allowed for the same sequenceid. This constraint has to be imposed in the subseq function 4.2.3 Discussion Both Apriori and aprioriTid need minsup and rincon/to be specified in advance. The algorithms have to be rerun each time these values are changed, throwing everything away that was obtained in previous runs. If no appropriate values for these threshold are known in advance and we want to know how the results change with these values without rerunning the algorithms, the best we can do is to generate and count only those itemsets that appear at least once in the database without duplication and store them all in an efficient way. Note that Apriori generates candidates that do not exist in the datab Apriori and Aprioritid use a hashtree to store the candidate itemsets. Another data structure that is of d is a triestructure [35, 9]. Each node in the depth k of the trie corresponds to a candidate kitemset and stores the kth item and the support of the itemset. As two frequent kitemsets that share the first(k1)itemsets are siblings below their parent node at the depth k1 in the trie, the candidate generation is simply to join the two siblings and extend the tree to one more depth below the first frequent kitemset after pruning. In order to find the candidate kitemsets that are contained in a transaction t each item in the transaction is fed from the root node and the branch is followed according to the succeeding item until a kth item is reached Many practical implementations of Apriori use this triestructure to store not only candidates but also transactions [10, 9 o 2009 by Taylor Francis Group, LLC 70 apriori If we go a step further, we can get rid of generating candidate itemsets at all. Further it is not necessary to enumerate all the frequent itemsets. These topics are discussed in section 4.5 Apriori and almost all other association rule minings use twophase strategy: first mine frequent patterns and then generate association rules. This is not the sole way Webb's magnumOpus uses another strategy that immediately generates a large subset of all association rules [38] There are direct extensions of the original apriori family. Use of taxonomy and in corporating temporal constraint are two examples. Generalized association rules [30] employ a set of userspecified taxonomies, which makes it possible to extract fre quent itemsets that are expressed by higher concepts even when use of the base level concepts produces only infrequent itemsets. The basic algorithm is to add all ances tors of each item in a transaction to the transaction and then run apriori algorithm Several optimizations can be added to improve efficiency, one example being that the support for an itemset X that contains both an item x and its ancestor i is the same as the support of the itemset XA, and thus need not be counted Generalized sequential patterns [32] place, in addition to the introduction of taxonomies, time constraints that specify a minimum and/or maximum time period between adjacent elements(itemsets) in a pattern and relax the restrictions that items in an element of a sequential pattern must come from the same transaction by allowing the items to be present in a set of transactions of the same sequenceid whose transactiontimes are within a userspecified time window. It also finds all frequent sequential patterns(not limited to maximal sequential patterns). GSP algorithm runs about 20 times faster than AprioriAll, one reason being that GSP counts fewer candidates than AprioriAll 4.3 Discussion on Available Software Implementations There are many available implementations of Apriori ranging from free software to commercial products. Here, we will present only three wellknown implementations which are freely downloadable via Internet The first one is an implementation embedded in the most famous opensource machine learning and data mining toolkit, Weka, provided by the University of Waikato[40]. Apriori in Weka can be used through Wekas common graphical user interface together with many other algorithms that are available in Weka. The im plementation includes Wekas own extensions. For example, minsup is iteratively decreased from an upper bound u minsup to a lower bound lminsup with an interval Sminsup. Further, in addition to confidence the metrics lift, leverage, and conviction are available to evaluate association rules. Lift and leverage are discussed in Section 4.5 Conviction [11] is a metric that was proposed to measure the departure from inde pendence of an association rule taking implication into account. When using one of these metrics, its minimal value has to be given as a threshold o 2009 by Taylor Francis Group, LLC

20150319

20130502
 72KB
知识图谱生成与学习路径推荐源码与demo数据
20180611与本资源相关的思路分析在我的博客里，因为都是工作日志，所以每一篇都不太全，不过可以作为参考，贴两篇比较密集的传送门。 传送门1 是讲AprioriAll算法的：https://blog.csdn.net/w_z_y1997/article/details/80503643 传送门2 是关于基于图谱的推荐部分的： https://blog.csdn.net/w_z_y1997/article/details/80574836
 2.95MB
机器学习十大算法（超详细、免费）
20160726机器学习经典算法，英文版，很详细，值得翻阅
 5.12MB
人工智能领域机器学习十大最经典算法
20131202人工智能领域机器学习十大算法、你至少要调过2~3个程序，熟悉5~6个
 582KB
机器学习10大经典算法
20160330机器学习10大经典算法
 2.95MB
机器学习十大算法 带书签PDF
20141104机器学习十大算法合并和整理，附带书签，内容丰富。
 545KB
机器学习十大算法10：CART.pdf
20180414机器学习十大算法1机器学习十大算法2机器学习十大算法3 ...
 570KB
机器学习10大经典算法.pdf
20130326机器学习是人工智能的一个重要分支，这是机器学习的十大经典算法。希望对大家有用。
 20.2MB
机器学习算法pdf版
20181028资源包括了机器学习十大算法，线性回归，Kmeans聚类、决策树、神经网络等等算法

下载
行业分类机械工程一种比色皿残留液甩干装置.zip
行业分类机械工程一种比色皿残留液甩干装置.zip

下载
tesseractocr下载地址.txt
tesseractocr下载地址.txt

下载
行业分类纺织造纸一种制备心脏瓣膜缝合线设备.zip
行业分类纺织造纸一种制备心脏瓣膜缝合线设备.zip

下载
d3d11renderers.rar
d3d11renderers.rar

下载
行业分类作业装置 小水线面双体船型倒挂式推力轴承基座及其设计方法.zip
行业分类作业装置 小水线面双体船型倒挂式推力轴承基座及其设计方法.zip

下载
行业分类机械工程一种变桨叶轮以及风力发电机.zip
行业分类机械工程一种变桨叶轮以及风力发电机.zip

下载
C语言自编json封包解析程序STM32.rar
C语言自编json封包解析程序STM32.rar

下载
s7comm_plus.dll
s7comm_plus.dll

下载
（含仿真）51数字电压表设计.rar
（含仿真）51数字电压表设计.rar

下载
行业分类作业装置 纤维增强双层树脂复合材料、及其挤成型装置和工艺.zip
行业分类作业装置 纤维增强双层树脂复合材料、及其挤成型装置和工艺.zip