K-modes和K-prototypes的提出论文资源-CSDN文库

需积分: 34 164 浏览量 2011-04-13 02:11:32 上传评论 1 收藏 113KB PDF 举报

### K-modes与K-prototypes算法：对K-means算法的拓展 #### 概述在数据挖掘领域，聚类分析是一种基本操作，用于将数据库中的对象集划分为同质性较高的组或簇（Klosgen和Zytkow, 1996）。这种操作对于多种任务至关重要，例如无监督分类、聚合、分割或解剖等（Cormack, 1971; IBM, 1996）。通过将对象聚类为不同的组，可以发现有意义的模式或趋势，如高索赔成本的汽车保险保单持有者群体，或者银行客户群体中的特定类型（Williams和Huang, 1996）。 #### K-means算法简介 K-means算法以其高效处理大规模数据集的能力而闻名。它是一种迭代的、基于中心的聚类方法，通过最小化每个簇内对象与其所属簇中心之间的平方误差总和来划分数据集。然而，该算法仅适用于数值型数据，并且假设所有特征都是连续的，这限制了其在现实世界中的应用范围，因为许多实际数据集包含类别型变量。 #### K-modes算法针对K-means算法只能处理数值型数据的问题，Huang（1998）提出了K-modes算法。该算法特别设计用于处理纯类别型数据，通过以下三个关键扩展改进了K-means算法： 1. **相似度度量**：K-modes算法采用了简单的匹配不相似度度量来处理类别对象，这一度量计算两个对象在各个属性上的不同之处。 2. **使用模式代替均值**：由于均值的概念不适用于类别型数据，K-modes算法使用众数（即最频繁出现的值）作为每个簇的代表点，这样可以更好地表示簇内对象的特性。 3. **频率更新方法**：算法采用了一种基于频率的方法来更新模式，在聚类过程中不断优化这些模式，以最小化聚类成本函数，从而确保每个簇内的对象尽可能相似。 #### K-prototypes算法除了K-modes算法外，Huang还提出了K-prototypes算法，用于处理包含混合类型数据的数据集，即同时包含数值型和类别型属性的对象。该算法通过定义一种组合不相似度度量来集成K-means和K-modes算法的功能。这种度量方式能够同时衡量数值型和类别型特征之间的差异，从而使得算法能够在处理复杂数据时保持高效。 #### 实验验证为了验证这两种算法的有效性和效率，作者使用了两个知名的数据集——大豆疾病数据集（Soybean Disease Dataset）和信用审批数据集（Credit Approval Dataset），以及两个大型真实世界数据集进行了实验。这些数据集包含约50万个对象，实验结果表明K-modes和K-prototypes算法在处理大规模数据集时表现出色，这对于数据挖掘应用至关重要。 #### 结论 K-modes和K-prototypes算法是K-means算法的重要扩展，它们解决了原始K-means算法无法有效处理类别型数据的问题。K-modes算法专注于纯类别型数据，而K-prototypes算法则进一步扩展到处理包含数值型和类别型属性的混合型数据。这两种算法不仅提高了聚类分析的适用范围，而且在处理大规模数据集时也表现出了良好的性能。

展开

资源推荐

资源详情

资源评论

P1: SUD

Data Mining and Knowledge Discovery KL657-03-Huang October 27, 1998 12:59

Data Mining and Knowledge Discovery 2, 283–304 (1998)

° 1998 Kluwer Academic Publishers. Manufactured in The Netherlands.

Extensions to the k-Means Algorithm for Clustering

Large Data Sets with Categorical Values

ZHEXUE HUANG huang@mip.com.au

ACSys CRC, CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra, ACT 2601, Australia

Abstract. The k-means algorithm is well knownforitsefﬁciencyinclusteringlarge data sets. However, working

only on numeric values prohibits it from being used to cluster real world data containing categorical values. In

this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains

with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure

to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to

update modes in the clustering process to minimise the clustering cost function. With these extensionsthek-modes

algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm,

through the deﬁnition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms

to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known

soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms.

Our experiments on two real world data sets with half a million objects each show that the two algorithms are

efﬁcient when clustering large data sets, which is critical to data mining applications.

Keywords: data mining, cluster analysis, clustering algorithms, categorical data

1. Introduction

Partitioning a set of objects in databases into homogeneous groups or clusters (Klosgen and

Zytkow, 1996) is a fundamental operation in data mining. It is useful in a number of tasks,

suchasclassiﬁcation (unsupervised)(Cormack, 1971), aggregationandsegmentation(IBM,

1996) or dissection (Cormack, 1971). For example, by partitioning objects into clusters,

interesting object groups may be discovered, such as the groups of motor insurance policy

holders with a high average claim cost (Williams and Huang, 1996), or the groups of clients

in a banking database having a heavy investment in real estate.

Clustering (Anderberg, 1973; Jain and Dubes, 1988; Kaufman and Rousseeuw, 1990) is

a popular approach to implementing the partitioning operation. Clustering methods par-

tition a set of objects into clusters such that objects in the same cluster are more similar

to each other than objects in different clusters according to some deﬁned criteria. Statisti-

cal clustering methods (Anderberg, 1973; Everitt, 1974; Kaufman and Rousseeuw, 1990)

partition objects according to some (dis)similarity measures, whereas conceptual clustering

methods cluster objects according to the concepts objects carry (Michalski and Stepp, 1983;

Fisher, 1987).

Current address: MIP, level 3, 60 City Rd, Southbank Vic 3006, Melbourne, Australia.

P1: SUD

Data Mining and Knowledge Discovery KL657-03-Huang October 27, 1998 12:59

284 HUANG

The most distinct characteristic of data mining is that it deals with very large and complex

data sets (gigabytes or even terabytes). The data sets to be mined often contain millions

of objects described by tens, hundreds or even thousands of various types of attributes or

variables (interval, ratio, binary, ordinal, nominal, etc.). This requires the data mining oper-

ations and algorithms to be scalable and capable of dealing with different types of attributes.

However, most algorithms currently used in data mining do not scale well when applied to

very large data sets because they were initially developed for other applications than data

mining which involve small data sets. In terms of clustering, we are interested in algo-

rithms which can efﬁciently cluster large data sets containing both numeric and categorical

values because such data sets are frequently encountered in data mining applications. Most

existing clustering algorithms either can handle both data types but are not efﬁcient when

clustering large data sets or can handle large data sets efﬁciently but are limited to numeric

attributes. Few algorithms can do both well.

Using Gower’s similarity coefﬁcient (Gower, 1971) and other dissimilarity measures

(Gowda and Diday, 1991) the standard hierarchical clustering methods can handle data

with numeric and categorical values (Anderberg, 1973; Jain and Dubes, 1988). However,

the quadratic computational cost makes them unacceptable for clustering largedata sets. On

the other hand, the k-means clustering method (MacQueen, 1967; Anderberg, 1973) is efﬁ-

cient for processing large data sets. Therefore, it is best suited for data mining. However, the

k-means algorithm only works on numeric data, i.e., the variables are measured on a ratio

scale (Jain and Dubes, 1988), because it minimises a cost function by changing the means

of clusters. This prohibits it from being used in applications where categorical data are

involved. The traditional approach to converting categorical data into numeric values does

not necessarily produce meaningful results in the case where categorical domains are not

ordered.

Ralambondrainy (1995) presented an approach to using the k-means algorithm to cluster

categorical data. Ralambondrainy’s approach is to convert multiple category attributes into

binary attributes (using 0 and 1 to represent either a category absent or present) and to

treat the binary attributes as numeric in the k-means algorithm. If it is used in data mining,

this approach needs to handle a large number of binary attributes because data sets in data

mining often have categorical attributes with hundreds or thousands of categories. This will

inevitably increase both computational and space costs of the k-means algorithm. The other

drawback is that the cluster means, given by real values between 0 and 1, do not indicate

the characteristics of the clusters.

Conceptual clustering algorithms developed in machine learning cluster data with cate-

gorical values (Michalski and Stepp, 1983; Fisher, 1987; Lebowitz, 1987) and also produce

conceptual descriptions of clusters. The latter feature is important to data mining because

the conceptual descriptions provide assistance in interpreting clustering results. Unlike sta-

tisticalclustering methods, these algorithms are based on asearch forobjects whichcarry the

same or similar concepts. Therefore, their efﬁciency relies on good search strategies. For

problems in data mining, which often involve many concepts and very large object spaces,

the concept-based search methods can become a potential handicap for these algorithms to

deal with extremely large data sets.

The data mining community has recently put a lot of efforts on developing fast algorithms

for clustering very large data sets. Some popular ones include CLARANS (Ng and Han,

P1: SUD

Data Mining and Knowledge Discovery KL657-03-Huang October 27, 1998 12:59

EXTENSIONS TO THE k-MEANS ALGORITHM 285

1994), DBSCAN (Ester et al., 1996) and BIRCH (Zhang et al., 1996). These algorithms

are often revisions of some existing clustering methods. By using some carefully designed

search methods (e.g., randomised search in CLARANS), organising structures (e.g., CF

Tree in BIRCH) and indices (e.g., R

∗

-tree in DBSCAN), these algorithms have shown

some signiﬁcant performance improvements in clustering very large data sets. Again, these

algorithms still target on numeric data and cannot be used to solve massive categorical data

clustering problems.

In this paper we present two new algorithms that use the k-means paradigm to cluster data

having categorical values. The k-modes algorithm (Huang, 1997b) extends the k-means

paradigm to cluster categorical data by using (1) a simple matching dissimilarity measure

for categorical objects (Kaufman and Rousseeuw, 1990), (2) modes instead of means for

clustersand (3) afrequency-basedmethodto updatemodes in thek-means fashionclustering

processtominimisetheclusteringcostfunction.Thek-prototypesalgorithm(Huang, 1997a)

integrates the k-means and k-modes processes to cluster data with mixed numeric and

categoricalvalues. Inthek-prototypesalgorithmwedeﬁneadissimilarity measure that takes

into account both numeric and categorical attributes. Assume s

is the dissimilarity measure

on numeric attributes deﬁned by the squared Euclidean distance and s

is the dissimilarity

measureoncategoricalattributesdeﬁnedasthenumberofmismatches of categoriesbetween

two objects. We deﬁne the dissimilarity measure between two objects as s

+ γ s

, where γ

is a weight to balance the two parts to avoid favouring either type of attribute. The clustering

process of the k-prototypes algorithm is similar to the k-means algorithm except that it uses

the k-modes approach to updating the categorical attribute values of cluster prototypes.

Because these algorithms use the same clustering process as k-means, they preserve the

efﬁciency of the k-means algorithm which is highly desirable for data mining.

Asimilar workwhich aimsto cluster large data sets isthe CLARA program (Kaufman and

Rousseeuw, 1990). CLARA is a combination of a sampling procedure and the clustering

program PAM. Given a set of objects X and the number of clusters k, PAM clusters X

by ﬁnding k medoids (representative objects of clusters) which can minimise the average

dissimilarity of objects to their closest medoids. Since PAM can use any dissimilarity

measures, it can cluster objects with categorical attributes.

Ng and Han (1994) has analysed that the computational complexity of PAM in a single

iteration is O(k(n − k)

) where n is the number of objects in X. Obviously, PAM is not

efﬁcient when clustering large data sets. To compensate for this, CLARA takes a small

sample from a large data set, uses PAM to generate k medoids from the sample and uses the

k medoids to cluster the rest of objects by the rules {x ∈ S

if d(x, q

) ≤ d(x, q

) ∧ i 6= j},

where 1 ≤ i, j ≤ k, d is a dissimilarity measure, q

is the medoid of cluster j and S

is cluster i. For each sample, CLARA only goes through the large data set once. Its

computational complexity basically depends on the computational complexity of PAM

which is decided by the size of the sample. That is, CLARA is efﬁcient in clustering large

data sets only if the sample size used by PAM is small. However, for large and complex

data sets in data mining applications, small samples often cannot represent the genuine

distributions of the data. The CLARA solution to this problem is to take several samples

and cluster the whole data set several times. Then, the result with the minimal average

dissimilarity is selected.

P1: SUD

Data Mining and Knowledge Discovery KL657-03-Huang October 27, 1998 12:59

286 HUANG

The major differences between CLARA and the k-prototypes algorithm are as follows:

(1) CLARA clusters a large data set based on samples, whereas k-prototypes directly works

on the whole data set. (2) CLARA optimises its clustering result at the sample level. A good

clustering based on samples will not necessarily represent a good clustering of the whole

data set if the sample is biased. The k-prototypes algorithm optimises the cost function on

the whole data set. It guarantees at least a locally optimal clustering. (3) The efﬁciency of

CLARA depends on the sample size. The larger and more complex the whole data set is,

the larger the sample is required. CLARA will no longer be efﬁcient when the sample size

exceeds a certain range, say thousands of objects. The k-prototypes algorithm has no such

limitations.

2. Notation

We assume that in a database objects from the same domain are represented by the same

set of attributes, A

, A

,...,A

. Each attribute A

describes a domain of values, denoted

by DOM(A

), associated with a deﬁned semantic and data type. Different deﬁnitions of

data types are used in data representation in databases and in data analysis. Simple data

types commonly used in relational databases are integer, ﬂoat, double, character and strings,

whereas data types (often called variabletypes) concerned in data analysis are interval,ratio,

binary, ordinal, nominal, etc. According to the semantics of the attributes in the database

one can always ﬁnd a mapping between the related data types. In terms of the clustering

algorithms to be discussed below, we only consider two general data types, numeric and

categorical and assume other types can be mapped to one of these two types. The domains

of attributes associated with these two types are called numeric and categorical, respectively.

Anumeric domainis represented by continuous values. Adomain DOM(A

) is deﬁned as

categoricalif it is ﬁnite and unordered, e.g., for any a, b ∈ DOM(A

), either a = b or a 6= b.

A categorical domain contains only singletons. Combinational values like in (Gowda and

Diday, 1991) are not allowed. A special value, denoted by ², is deﬁned on all categorical

domains and used to represent missing values. To simplify the dissimilarity measure we do

not consider the conceptual inclusion relationships among values in a categorical domain

like in (Kodratoff and Tecuci, 1988) such that car and vehicle are two categorical values in

a domain and conceptually a car is also a vehicle. However, such relationships may exist

in real world databases.

Like in (Gowda and Diday, 1991) an object X is logically represented as a conjunction

of attribute-value pairs

= x

] ∧ [A

= x

] ∧···∧[A

= x

]

where x

∈ DOM(A

) for 1 ≤ j ≤ m. An attribute-value pair [A

= x

] is called a

selector in (Michalski and Stepp, 1983). Without ambiguity we represent X as a vector

, x

,...,x

p+1

,...,x

where the ﬁrst p elements are numeric values and the rest are categorical values. If X has

剩余21页未读，继续阅读

评论收藏

内容反馈

#完美解决问题
#运行顺畅
#内容详尽
#全网独家
#注释完整

doingo

粉丝: 0
资源: 4

K-modes 和 K-prototypes 的提出论文

最新资源

K-modes 和 K-prototypes 的提出论文

论文研究-量子遗传算法的模糊K-prototypes聚类.pdf

论文研究-Approximation Algorithms for K-Modes Clustering.pdf

kmodes：k模式和k原型聚类算法的Python实现，用于聚类分类数据

K-prototype源代码

K-modes算法 随机类中心

prototype学习笔记

k-modes聚类算法1.rar

K-modes聚类算法

k-modes 聚类算法

3-Matlab实例(以K-Modes算法为例).pdf

基于K-modes聚类的半导体封装测试粗日投料控制.pdf

KP.rar_KP_k prototypes_k-prototypes_k-prototypes聚类

K-modes matlab实现

sms_client-2.0.7k.tgz_GSM_K-modes_SMS Clie

jekyll-modes-源码.rar

bpgbblju.zip_K.

基于离散小波变换和模糊K-modes的负荷聚类算法

byornzka.zip_K.

K-means算法java实现

基于分布式的K-prototypes算法设计.pdf

mcrqglrg.zip_K.

最新资源

K-modes算法随机类中心