Python机器学习机器学习十大算法英文文档K-means资源-CSDN文库

共1个文件

pdf：1个

python

机器学习

kmeans

71 浏览量 2024-08-15 22:23:39 上传评论收藏 757KB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

Python机器学习机器学习十大算法英文文档K-means.zip （1个子文件）

机器学习十大算法英文文档K-means.pdf 827KB

Chapter 2

K-Means

Joydeep Ghosh and Alexander Liu

Contents

2.1 Introduction ............................................................ 21

2.2 The k-means Algorithm ................................................. 22

2.3 Available Software ...................................................... 26

2.4 Examples ............................................................... 27

2.5 Advanced Topics ........................................................ 30

2.6 Summary ............................................................... 32

2.7 Exercises ............................................................... 33

References ................................................................... 34

2.1 Introduction

In this chapter, we describe the k-means algorithm, a straightforward and widely

used clustering algorithm. Given a set of objects (records), the goal of clustering

or segmentation is to divide these objects into groups or “clusters” such that objects

withinagrouptendtobemoresimilartooneanotherascomparedtoobjectsbelonging

to different groups. In other words, clustering algorithms place similar points in the

same cluster whileplacing dissimilar points in different clusters. Note that, in contrast

to supervised tasks such as regression or classiﬁcation where there is a notion of a

target value or class label, the objects that form the inputs to a clustering procedure

do not come with an associated target. Therefore, clustering is often referred to

as unsupervised learning. Because there is no need for labeled data, unsupervised

algorithms are suitable for many applications where labeled data is difﬁcult to obtain.

Unsupervised tasks such as clustering are also often used to explore and characterize

the dataset before running a supervised learning task. Since clustering makes no use

of class labels, some notion of similarity must be deﬁned based on the attributes of the

objects.Thedeﬁnitionofsimilarity and the method in whichpointsareclustereddiffer

based on the clustering algorithm being applied. Thus, different clustering algorithms

are suited to different types of datasets and different purposes. The “best” clustering

algorithm to use therefore depends on the application. It is not uncommon to try

several different algorithms and choose depending on which is the most useful.

22 K-Means

The k-means algorithm is a simple iterative clustering algorithm that partitions

a given dataset into a user-speciﬁed number of clusters, k. The algorithm is simple

to implement and run, relatively fast, easy to adapt, and common in practice. It is

historically one of the most important algorithms in data mining.

Historically, k-means in its essential form has been discovered by several re-

searchers across different disciplines, most notably by Lloyd (1957, 1982)[16],

Forgey (1965) [9], Friedman and Rubin (1967) [10], and McQueen (1967) [17]. A

detailed history of k-means along with descriptions of several variations are given

in Jain and Dubes [13]. Gray and Neuhoff [11] provide a nice historical background

for k-means placed in the larger context of hill-climbing algorithms.

In the rest of this chapter, we will describe how k-means works, discuss the limi-

tations of k-means, give some examples of k-means on artiﬁcial and real datasets,

and brieﬂy discuss some extensions to the k-means algorithm. We should note that

ourlist of extensionsto k-means is far from exhaustive,andthe reader is encouraged

to continue their own research on the aspect of k-means of most interest to them.

2.2 The k-means Algorithm

The k-means algorithm applies to objects that are represented by points in a

d-dimensional vector space. Thus, it clusters a set of d-dimensional vectors, D =

|i = 1,...,N}, where x

∈

denotes the ith object or “data point.” As discussed

in the introduction, k-means is a clustering algorithm that partitions D into k clus-

ters of points. That is, the k-means algorithm clusters all of the data points in D

such that each point x

falls in one and only one of the k partitions. One can keep

track of which point is in which cluster by assigning each point a cluster ID. Points

with the same cluster ID are in the same cluster, while points with different cluster

IDs are in different clusters. One can denote this with a cluster membership vector m

of length N, where m

is the cluster ID of x

The value of k is an input to the base algorithm. Typically, the value for k is based

on criteria such as prior knowledge of how many clusters actually appear in D,how

many clusters are desired for the current application, or the types of clusters found by

exploring/experimenting with different values of k.Howk is chosen is not necessary

for understanding how k-means partitions the dataset D, and we will discuss how

to choose k when it is not prespeciﬁed in a later section.

In k-means, each of the k clusters is represented by a single point in 

.Letus

denote this set of cluster representatives as the set C ={c

|j = 1,...,k}. These k

cluster representatives are also called the cluster means or cluster centroids; we will

discuss the reason for this after describing the k-means objective function.

Lloyd ﬁrst described the algorithm in a 1957 Bell Labs technical report, which was ﬁnally published in

1982.

2.2 The k-means Algorithm 23

In clustering algorithms, points are grouped by some notion of “closeness” or

“similarity.” In k-means, the default measure of closeness is the Euclidean distance.

In particular, one can readily show that k-means attempts to minimize the following

nonnegative cost function:

Cost =



i=1

(argmin

||x

− c

) (2.1)

In other words, k-means attempts to minimize the total squared Euclidean distance

between each point x

and its closest cluster representative c

. Equation 2.1 is often

referred to as the k-means objective function.

The k-means algorithm, depicted in Algorithm 2.1, clusters D in an iterative

fashion, alternating between two steps: (1) reassigning the cluster ID of all points in

D and (2) updating the cluster representatives based on the data points in each cluster.

The algorithm works as follows. First, the cluster representatives are initialized by

picking k points in 

. Techniques for selecting these initial seeds include sampling

at random from the dataset, setting them as the solution of clustering a small subset

of the data, or perturbing the global mean of the data k times. In Algorithm 2.1, we

initialize by randomly picking k points. The algorithm then iterates between two steps

until convergence.

Step 1: Data assignment. Each data point is assigned to its closest centroid, with

ties broken arbitrarily. This results in a partitioning of the data.

Step 2: Relocation of “means.” Each cluster representative is relocated to the

center (i.e., arithmetic mean) of all data points assigned to it. The rationale

of this step is based on the observation that, givena set of points, the single best

representative for this set (in the sense of minimizing the sum of the squared

Euclidean distances between each point and the representative) is nothing but

the mean of the data points. This is also why the cluster representative is often

interchangeably referred to as the cluster mean or cluster centroid, and where

the algorithm gets its name from.

The algorithm converges when the assignments (and hence the c

values) no longer

change. One can show that the k-means objective function deﬁned in Equation 2.1

will decrease whenever there is a change in the assignment or the relocation steps,

and convergence is guaranteed in a ﬁnite number of iterations.

Note that each iteration needs N × k comparisons, which determines the time

complexity of one iteration. The number of iterations required for convergence varies

and may depend on N, but as a ﬁrst cut, k-means can be considered linear in the

dataset size. Moreover, since the comparison operation is linear in d, the algorithm is

also linear in the dimensionality of the data.

Limitations. The greedy-descent nature of k-means on a nonconvex cost im-

plies that the convergence is only to a local optimum, and indeed the algorithm

is typically quite sensitive to the initial centroid locations. In other words,

24 K-Means

Algorithm 2.1 The k-means algorithm

Input: Dataset D, number clusters k

Output: Set of cluster representatives C, cluster membership vector m

/* Initialize cluster representatives C */

Randomly choose k data points from D

5: Use these k points as initial set of cluster representatives C

repeat

/* Data Assignment */

Reassign points in D to closest cluster mean

Update m such that m

is cluster ID of ith point in D

10: /* Relocation of means */

Update C such that c

is mean of points in jth cluster

until convergence of objective function



i=1

(argmin

||x

− c

)

initializing the set of cluster representatives C differently can lead to very

different clusters, even on the same dataset D. A poor initialization can lead

to very poor clusters. We will see an example of this in the next section when

we look at examples of k-means applied to artiﬁcial and real data. The local

minima problem can be countered to some extent by running the algorithm

multiple times with different initial centroids and then selecting the best result,

orby doing limited local search about theconvergedsolution. Other approaches

include methods such as those described in [14] that attempt to keep k-means

from converging to local minima. [8] also contains a list of different methods

of initialization, as well as a discussion of other limitations of k-means.

As mentioned, choosing the optimal value of k may be difﬁcult. If one has knowledge

about the dataset, such as the number of partitions that naturally comprise the dataset,

then that knowledge can be used to choose k. Otherwise, one must use some other

criteria to choose k, thus solving the model selection problem. One naive solution

is to try several different values of k and choose the clustering which minimizes the

k-means objectivefunction (Equation 2.1). Unfortunately, the valueof the objective

function is not as informative as one would hope in this case. For example, the cost

of the optimal solution decreases with increasing k till it hits zero when the number

of clusters equals the number of distinct data points. This makes it more difﬁcult to

use the objective function to (a) directly compare solutions with different numbers

of clusters and (b) ﬁnd the optimum value of k. Thus, if the desired k is not known

in advance, one will typically run k-means with different values of k, and then use

some other, more suitable criterion to select one of the results. For example, SAS

uses the cube-clustering criterion, while X-means adds a complexity term (which

increases with k) to the original cost function (Equation 2.1) and then identiﬁes the k

which minimizes this adjusted cost [20]. Alternatively, one can progressively increase

the number of clusters, in conjunction with a suitable stopping criterion. Bisecting

k-means [21] achieves this by ﬁrst putting all the data into a single cluster, and then

recursively splitting the least compact cluster into two clusters using 2-means. The

2.2 The k-means Algorithm 25

celebrated LBG algorithm [11] used for vector quantization doubles the number of

clusters till a suitable code-book size is obtained. Both these approaches thus alleviate

the need to know k beforehand. Many other researchers have studied this problem,

such as [18] and [12].

In addition to the above limitations, k-means suffers from several other problems

that can be understood by ﬁrst noting that the problem of ﬁtting data using a mixture

of k Gaussians with identical, isotropic covariance matrices ( = σ

I), where I is

the identity matrix, results in a “soft” versionof k-means. More precisely, if the soft

assignments of data points to the mixture components of such a model are instead

hardened so that each data point is solely allocated to the most likely component

[3], then one obtains the k-means algorithm. From this connection it is evident that

k-means inherently assumes that the dataset is composed of a mixture of k balls or

hyperspheres of data, and each of the k clusters corresponds to one of the mixture

components. Because of this implicit assumption, k-means will falter whenever

the data is not well described by a superposition of reasonably separated spherical

Gaussian distributions. For example, k-means will have trouble if there are non-

convex-shaped clusters in the data. This problem may be alleviated by rescaling the

data to “whiten” it before clustering, or by using a different distance measure that is

more appropriate for the dataset. For example, information-theoretic clustering uses

the KL-divergence to measure the distance between two data points representing two

discrete probability distributions. It has been recently shown that if one measures

distance by selecting any member of a very large class of divergences called Bregman

divergences during the assignment step and makes no other changes, the essential

properties of k-means, including guaranteed convergence, linear separation bound-

aries, and scalability, are retained [1]. This result makes k-means effective for a

much larger class of datasets so long as an appropriate divergence is used.

Another method of dealing with nonconvex clusters is by pairing k-means with

another algorithm. For example, one can ﬁrst cluster the data into a large number of

groupsusingk-means.Thesegroupsarethenagglomeratedintolargerclustersusing

single link hierarchical clustering, which can detect complex shapes. This approach

also makes the solution less sensitive to initialization, and since the hierarchical

method provides results at multiple resolutions, one does not need to worry about

choosing an exact value for k either; instead, one can simply use a large value for k

when creating the initial clusters.

The algorithm is also sensitive to the presence of outliers, since “mean” is not a

robuststatistic.Apreprocessingsteptoremoveoutlierscanbehelpful.Postprocessing

the results, for example, to eliminate small clusters, or to merge close clusters into

a large cluster, is also desirable. Ball and Hall’s ISODATA algorithm from 1967

effectively used both pre- and postprocessing on k-means.

Another potential issue is the problem of “empty” clusters [4]. When running k-

means, particularly with large values of k and/or when data resides in very high

dimensional space, it is possible that at some point of execution, there exists a cluster

representative c

such that all points x

in D are closer to some other cluster repre-

sentative that is not c

. When points in D are assigned to their closest cluster, the jth

cluster will have zero points assigned to it. That is, cluster j is now an empty cluster.

评论收藏

内容反馈

codemami

粉丝: 1362
资源: 3270

Python机器学习机器学习十大算法英文文档K-means

Python机器学习机器学习十大算法英文文档AdaBoost

Python机器学习机器学习十大算法英文文档朴素贝叶斯

Python机器学习机器学习十大算法英文文档Apriori

机器学习-K-Means算法的Python实现.zip

基于Python的机器学习K-means聚类分析NBA球员案例

Python机器学习机器学习十大算法英文文档C4.5

基于python的K-Means聚类算法设计与实现

K-Means文本聚类python实现

DP-means k - means聚类算法的比较

Fuzzy-C-means Python代码

python实现机器学习K-means聚类算法.zip

ISODATA及K-means聚类算法

k-means_kmeans案例_k-means聚类算法_k-means实战_K._

机器学习十大算法：K-means

K_Means_pythonk-means_K-meanspython_机器学习_softlywyk_K._

K-means算法

Python-Python编码示例和机器学习算法的文档

深大计软_最优化方法_实验1：K-Means聚类之Python实现手写数字图像MNIST分类

python机器学习算法源代码.zip

K-means-master_k-means_k-means聚类算法_K._

k-means python实现及数据.zip

Python机器学习原理与算法实现.pptx

python实现机器学习K-means聚类算法源代码+数据，对数据进行聚类并绘图，k-means算法对大数据薪资情况的聚类分析

Python机器学习_预测分析核心算法,python数据预测算法,Python

python机器学习案例

Python-基本机器学习算法的简单Python实现

python大作业 含爬虫、数据可视化、地图、报告、及源码（2016-2021全国各地区粮食产量）.rar

《点燃我温暖你》中李峋的同款爱心代码

Python金融量化的高级库：TA-Lib-0.4.24（包含python3.7、3.8、3.9、3.10的32位和64位版本）

大麦网抢票脚本【Python脚本】

最新资源

python大作业含爬虫、数据可视化、地图、报告、及源码（2016-2021全国各地区粮食产量）.rar