挖掘高平均效用项集的快速算法资源-CSDN文库

40 浏览量 2021-03-07 21:36:41 上传评论收藏 1.42MB PDF 举报

在讨论“挖掘高平均效用项集的快速算法”这一主题时，我们首先需要了解几个关键概念。高平均效用项集（High Average-Utility Itemsets, 简称HAUIs）挖掘问题是事务数据库中高效用项集（High Utility Itemsets, 简称HUIs）挖掘问题的一种变体。在HUIs问题中，旨在发现那些具有高效用的项集，即那些在用户事务中频繁出现的项集。而HAUIs问题则进一步考虑项集的长度，即利用平均效用这一度量标准，用以评价项集的效用。平均效用是指项集中的项带来的总效用与其长度（项的数目）之比。本研究中所提出的快速算法，是为了更高效地挖掘HAUIs。由于现有算法在内存消耗和执行时间上常常面临挑战，主要是因为搜索空间较大以及使用了较宽松的上界估计项集的平均效用。该论文中所提出的算法包括三个剪枝策略，以提供更严格的平均效用上界，从而有效地减少搜索空间，缩短运行时间。这些策略分别是： 1. 第一种剪枝策略：利用项对之间的关系来减少包含三个或更多项的项集的搜索空间。这意味着算法会识别并忽略掉那些无论如何都不能形成高平均效用项集的项对组合，以此来缩减需要考虑的项集数量。 2. 第二种剪枝策略：为项集的平均效用提供更紧的上界，以便及早剪枝掉那些不具潜力的候选项集。这一步骤是通过对项集可能达到的平均效用做更精确的估计，从而快速排除那些不可能成为高平均效用项集的候选项集。 3. 第三种策略：减少构建项集的平均效用列表结构所需的时间。平均效用列表结构在计算平均效用时是必要的，通过优化其生成过程，可以进一步提升算法效率。此项研究的作者来自不同的学术机构，包括哈尔滨工业大学深圳研究生院、高雄国立大学、国立中山大学以及高雄市的应用科技大学等。他们合作提出的新算法不仅在理论上具有创新性，也在实验中显示出比现有算法更优异的性能。 HAUIs问题的研究对于数据挖掘领域有着重要的意义。在商业、医疗保健以及许多需要从大数据中提取有价值信息的领域，能够快速准确地找出有价值的项集对于做出正确的决策至关重要。传统上，关联规则挖掘被广泛应用于发现数据之间的模式和关联性，但这些方法往往忽略了项集的效用或者价值。相比之下，效用驱动的挖掘方法，比如HAUIs挖掘，更能反映商业交易中商品的实际价值。 HAUIs的挖掘算法不仅在理论研究上具有挑战性，因为它涉及到项集组合的指数级增长问题，而且在实际应用中也有广泛的需求。快速而高效的HAUIs挖掘算法可以被应用在库存管理、购物篮分析、市场篮分析、推荐系统等多个领域，从而帮助商家更好地理解消费者行为，优化库存，提高利润率等。这篇研究论文提出的算法为HAUIs挖掘问题提供了一个新的研究方向和解决方案。通过三项策略优化搜索过程，并降低了计算复杂度，使得挖掘高平均效用项集的过程更加高效和实用。这对于数据挖掘和知识发现领域是一个重要的贡献。

资源推荐

资源详情

资源评论

Appl Intell

DOI 10.1007/s10489-017-0896-1

A fast algorithm for mining high average-utility itemsets

Jerry Chun-Wei Lin

· Shifeng Ren

· Philippe Fournier-Viger

· Tzung-Pei Hong

3,4

Ja-Hwung Su

· Bay Vo

6,7

Abstract Mining high-utility itemsets (HUIs) in transac-

tional databases has become a very popular research topic

in recent years. A popular variation of the problem of HUI

 Jerry Chun-Wei Lin

jerrylin@ieee.org

Shifeng Ren

renshifeng@stmail.hitsz.edu.cn

Philippe Fournier-Viger

philfv@hitsz.edu.cn

Tzung-Pei Hong

tphong@nuk.edu.tw

Ja-Hwung Su

bb0820@ms22.hinet.net

Bay Vo

bayvodinh@gmail.com

School of Computer Science and Technology, Harbin Institute

of Technology Shenzhen Graduate School, Shenzhen, China

School of Natural Sciences and Humanities, Harbin Institute

of Technology Shenzhen Graduate School, Shenzhen, China

Department of Computer Science and Information

Engineering, National University of Kaohsiung,

Kaohsiung, Taiwan

Department of Computer Science and Engineering, National

Sun Yat-sen University, Kaohsiung, Taiwan

Department of Information Management,

Cheng Shiu University, Kaohsiung, Taiwan

Faculty of Information Technology, Ho Chi Minh City

University of Technology, Ho Chi Minh, Vietnam

College of Electronics and Information Engineering,

Sejong University, Seoul, Republic of Korea

mining is to discover high average-utility itemsets (HAUIs),

where an alternative measure called the average-utility is

used to evaluate the utility of itemsets by considering their

lengths. Albeit, HAUI mining has been studied extensively,

current algorithms often consume a large amount of mem-

ory and have long execution times, due to the large search

space and the usage of loose upper bounds to estimate the

average-utilities of itemsets. In this paper, we present a

more efficient algorithm for HAUI mining, which includes

three pruning strategies to provide a tighter upper bound

on the average-utilities of itemsets, and thus reduce the

search space more effectively to decrease the runtime. The

first pruning strategy utilizes relationships between item

pairs to reduce the search space for itemsets containing

three or more items. The second pruning strategy provides

a tighter upper bound on the average-utilities of itemsets

to prune unpromising candidates early. The third strategy

reduces the time for constructing the average-utility-list

structures for itemsets, which is used to calculate their upper

bounds. Substantial experiments conducted on both real-

life and synthetic datasets show that the proposed algorithm

with three pruning strategies can efficiently and effectively

reduce the search space for mining HAUIs, when com-

pared to the state-of-the-art algorithms, in terms of runtime,

number of candidates, memory usage, performance of the

pruning strategies and scalability.

Keywords High average-utility itemsets · Pruning

strategies · Tighter upper bound · Data mining

1 Introduction

The main purpose of data mining is to reveal informa-

tion that is important, interesting, and novel from various

J.C.-W. Lin et al.

types of databases. Frequent itemset mining (FIM) [1]plays

an essential role in data mining since it can extract rele-

vant relationships between purchases made by customers

in transactional databases. In traditional FIM, only the

occurrence frequencies of item/sets are considered. But in

real-life, many other important factors must be considered to

determine if a pattern is interesting such as purchase quan-

tities, unit profits, and item weights. To reveal more useful

and meaningful information from transactional databases,

high-utility itemset mining (HUIM) was proposed by Yao

et al. [26]. It considers both the unit profits and purchase

quantities of items to find high-utility itemsets (HUIs), that

is itemsets that yield a high profit, or have a high impor-

tance. An item/set is considered as a HUI if its utility is

no less than a minimum utility count, which is set accord-

ing to the user’s preferences. Discovering HUIs can be seen

as an extension of the task of FIM. It has a wide range of

applications, such as sale promotion [5, 19, 28] and stock

investment [17]. To address the challenge that the down-

ward closure (DC) property of FIM is not applicable to the

utility measure in HUIM, the transaction-weighted utility

(TWU) model [14] was designed. It provides a transaction-

weighted downward closure (TWDC) property for finding

high-transaction-weighted utilization itemsets (HTWUIs).

This property can be used to avoid exploring the generally

very large search space of itemsets for HUIM, and thus dis-

cover HUIs efficiently. Several algorithms [7, 19–21]were

proposed to mine HUIs using the TWU model.

Although HUIM can reveal information that is often

more relevant than the one found by FIM, it has an impor-

tant limitation, which is that it does not take the length of

itemsets into account. This is a problem since the utility gen-

erally increases with the length of itemsets. For this reason,

the utility measure is an unfair measurement of the impor-

tance of itemsets for real-world applications. As a solution

to this issue, the task of high average-utility itemset min-

ing (HAUIM) has been introduced. By considering the size

of itemsets, it can reveal the high average-utility itemsets

(HAUIs). The average-utility of an item/set is the utility of

the itemset divided by its length (number of items). Hong

et al. first designed a two-phase average-utility (TPAU)

algorithm [11] to mine the HAUIs using an Apriori-like

approach. An average-utility upper bound (auub) property

was designed to ensure the completeness and correctness

of the algorithm. A projection-based PAI algorithm [12],

a tree-based high average-utility pattern (HAUP)-tree algo-

rithm [15], and the HAUI-tree [22] algorithm were designed

to efficiently mine HAUIs based on the TPAU algorithm.

The HAUI-Miner algorithm [18] was then developed to

directly mine HAUIs without generating candidates and

without scanning the database multiple times, thanks to a

special type of utility-list structure [20]. Nonetheless, a con-

siderable drawback of the HAUI-Miner algorithm is that

it repeatedly performs costly join operations for mining

HAUIs of different lengths since the TPAU model is based

on the auub upper bound, which is a loose upper bound on

the average-utility of itemsets.

To provide a more efficient algorithm for HAUI mining,

this paper presents a novel algorithm, which relies on sev-

eral pruning strategies to efficiently mine HAUIs. The major

contributions of this paper are summarized as follows.

1. In this paper, we design a depth-first search algo-

rithm for mining HAUIs, which includes three pruning

strategies. Moreover, a depth-first search version of the

algorithm is also considered.

2. The first pruning strategy stores the auub values of 2-

itemsets (item pairs) in a matrix, which is then used

to eliminate unpromising candidates early. Two more

pruning strategies are also designed based on a designed

item processing order, to respectively reduce the search

space for mining HAUIs and perform join operations

more efficiently.

3. Experiments are conducted to evaluate the efficiency

of the designed approach in terms of runtime, memory

usage, number of candidates, and scalability. Both real-

world and synthetic datasets are used. Moreover, the

effectiveness of the pruning strategies is also evaluated.

The rest of this paper is organized as follows. Related

work on HUIM and HAUIM is reviewed in Section 2.

Preliminaries and the problem statement are presented in

Section 3. The proposed algorithm, including three prun-

ing

strategies, is described in Section 4. A detailed example

showing how the designed algorithm is applied step-by-step

on an example database is given in Section 5. Results from

a series of experiments performed on various datasets are

discussed in Section 6. Finally, a conclusion is drawn in

Section 7.

2 Related work

Association-rule mining (ARM) is a fundamental research

area in data mining, which has been widely studied and

has many real-life applications. The Apriori algorithm [1]

was first developed to mine association rules (ARs) in two

phases. In the first phase, a set of frequent itemsets (FIs) is

discovered based on a minimum support threshold. Then,

in the second phase, the derived FIs are combined to obtain

a set of ARs respecting a minimum confidence thresh-

old. Let the size or length of an itemset be the number

of items that it contains. The Apriori algorithm is said to

use a level-wise approach to discover FIs since it discov-

ers itemsets by ascending order of their size. A limitation of

level-wise algorithms is that multiple database scans are typ-

ically performed to discover the desired itemsets, and that

A fast algorithm for mining high average-utility itemsets

a large amount of candidates may be generated. To avoid

this problem, a compact tree structure called the frequent

pattern (FP)-tree was developed and a corresponding algo-

rithm named FP-growth was designed to derive the FIs from

the FP-tree structure [10]. Other algorithms have also been

designed to further enhance the performance of FIM, and

have been applied in many real-world situations [4, 6, 9,

23–25].

FIM and ARM discover patterns by only considering

their occurrence frequencies. They ignore other factors such

as each item’s weights and purchase quantities in trans-

actions. As a consequence, FIM and ARM are unsuitable

for discovering important itemsets that may be frequent or

infrequent, such as the most profitable itemsets. To dis-

cover more useful and meaningful itemsets in databases,

high-utility itemset mining (HUIM) [7, 16, 19–21, 27]was

introduced. It considers both purchase quantities and unit

profits of item/sets to discover high-utility itemsets (HUIs),

i.e. itemsets having a utility (e.g. yielding a profit) greater

than or equal to a minimum utility count. To discover HUIs,

the transaction-weighted utilization (TWU) model [14]was

designed. It introduces a property called the transaction-

weighted downward closure (TWDC) property of high

transaction-weighted utilization itemsets (HTWUIs). This

model provides an upper bound on the utilities of item-

sets, which is used to reduce the search space, and thus

speed up the discovery of HUIs. Li et al. [13] then designed

the isolated items discarding strategy (IIDS) to further

reduce the number of candidate itemsets for mining HUIs

using the TWU model. The incremental high-utility pat-

tern (IHUP) algorithm [3] was developed to incrementally

and interactively mine HUIs based on a FP-tree based

tree structure. Recently, the HUI-Miner algorithm [20]was

developed based on a novel utility-list structure. HUI-Miner

mines HUIs without candidate generation using a depth-

first exploration of the search space. Thanks to its vertical

database representation and a join operation, HUI-Miner

does not need to perform multiple database scans to derive

the set of high-utility itemsets. To further improve the min-

ing performance, the FHM algorithm [7] was designed.

It utilizes co-occurrence information about item pairs to

reduce the search space for mining HUIs and was shown

to prune an enormous amount of unpromising candidates.

An important drawback of traditional HUIM algorithms is

that they tend to be biased toward finding itemsets con-

taining multiple items. The reason is that itemsets of large

size (containing many items) tend to have a higher utility

compared to smaller itemsets. To assess the utility of item-

sets in a more fair way, high average-utility itemset mining

(HAUIM) [11] was proposed. An item/set is considered

as a high average-utility item/set (HAUI) if its average-

utility (utility divided by its size) is no less than a minimum

average-utility threshold. The first algorithm for HAUIM is

the two-phase TPAU algorithm [11]. It is designed based

on an average-utility upper bound (auub) property, devel-

oped to estimate the average-utility of item/sets. This prop-

erty ensures the completeness and the correctness of the

algorithm, for HAUIM. Since TPAU is a level-wise algo-

rithm, it suffers from the problem of performing multiple

database scans and generating numerous candidates. To

address this problem and speed up the discovery of HAUIs,

a projection-based PAI algorithm [12] was developed. The

high average-utility pattern (HAUP)-tree structure and its

mining algorithm called HAUP-growth were then designed

to avoid performing multiple database scans. Each node

in the HAUP-tree stores the purchase quantities of prefix

items, thus ensuring that HAUIs can be derived without

accessing the database. This approach is efficient but can

consume a large amount of memory for databases contain-

ing long transactions. Lu et al. presented the HAUI-tree

approach [22]tomineHAUIsbasedonanindextableto

quickly find item/sets for mining HAUIs. A novel HAUI-

Miner algorithm [18] was then developed to mine HAUIs

using a compact list structure. Although HAUI-Miner was

shown to be more efficient than previous work, HAUIM

remains a computationally expensive task. In particular,

the join operation performed by HAUI-Miner to create its

list structure is costly. In this paper, we address the above

problems by designing an algorithm, which includes three

pruning strategies, to discover HAUIs more efficiently.

3 Preliminaries and problem statement

Let I ={i

,...,i

} be a finite set of m distinct

items. A quantitative database is a set of transactions D =

,...,T

}, where each transaction T

∈ D (1 ≤ q ≤

m) is a subset of I and has a unique identifier q, called its

TID. Moreover, each item i

in a transaction T

, has a pur-

chase quantity (a positive integer) denoted as q(i

).A

profit table ptable ={pr(i

), pr(i

),...,pr(i

)} indicates

the profit value of each item i

.Asetofk distinct items

X ={i

,...,i

} such that X ⊆ I is said to be a k-itemset,

where k is the length of the itemset. An itemset X is said to

be contained in a transaction T

if X ⊆ T

An example quantitative database is shown in Table 1.

It consists of 7 transactions and 6 items, denoted from (A)

to (F), respectively. Table 2 shows the profit table, which

defines the profit value of each item. Henceforth, Tables 1

and 2 will be used as running examples.

Definition 1 The utility of an item i

in a transaction T

denoted as u(i

), and defined as:

u(i

) = q(i

) × pr(i

). (1)

剩余15页未读，继续阅读

评论收藏

内容反馈

weixin_38581777

粉丝: 4
资源: 917

挖掘高平均效用项集的快速算法

挖掘高平均效用项集的有效算法

一种分布式并行的高效用项集挖掘算法

具有事务删除功能的高平均效用项目集的维护算法

基于数组前缀树的频繁项集挖掘算法

大数据中效用挖掘的快速单阶段算法

一种使用N-list快速挖掘频繁项集的新算法

APRIORI算法中频繁项集的挖掘

基于格的快速频繁项集挖掘算法* (2013年)

基于索引效用的To p-k高效用项集挖掘方法 (2016年)

Eclat:频繁项集挖掘的Eclat算法

基于关联矩阵的频繁项集挖掘算法 (2012年)

HUI-Miner 高效用项集挖掘算法

Java实现的挖掘频繁项集Apriori算法

Apriori算法挖掘频繁项集

模糊特征的top-k平均效用co-location模式挖掘.docx

通过事务插入更新发现的高平均效用模式

频繁项集挖掘算法的CUDA实现

聚类数据挖掘方法在银行综合效益评估中的应用.pdf

python Scikit-Learn0.19中文文档

数据挖掘技术在电子商务中的应用.pdf

巧用excel数据透视表(与“数据”有关的文档共111张).pptx

试卷分析的一种新方法--R0ugh集法 (2010年)

基于IC卡数据挖掘获取公交OD矩阵的方法.pdf

浅析数据挖掘中的数据预处理技术.pdf

人工智能方法在股票数据分析中的应用.docx

RFM模型原理和操作实践.docx

最新资源