Chapter 2
K-Means
Joydeep Ghosh and Alexander Liu
Contents
2.1 Introduction ............................................................ 21
2.2 The k-means Algorithm ................................................. 22
2.3 Available Software ...................................................... 26
2.4 Examples ............................................................... 27
2.5 Advanced Topics ........................................................ 30
2.6 Summary ............................................................... 32
2.7 Exercises ............................................................... 33
References ................................................................... 34
2.1 Introduction
In this chapter, we describe the k-means algorithm, a straightforward and widely
used clustering algorithm. Given a set of objects (records), the goal of clustering
or segmentation is to divide these objects into groups or “clusters” such that objects
withinagrouptendtobemoresimilartooneanotherascomparedtoobjectsbelonging
to different groups. In other words, clustering algorithms place similar points in the
same cluster whileplacing dissimilar points in different clusters. Note that, in contrast
to supervised tasks such as regression or classification where there is a notion of a
target value or class label, the objects that form the inputs to a clustering procedure
do not come with an associated target. Therefore, clustering is often referred to
as unsupervised learning. Because there is no need for labeled data, unsupervised
algorithms are suitable for many applications where labeled data is difficult to obtain.
Unsupervised tasks such as clustering are also often used to explore and characterize
the dataset before running a supervised learning task. Since clustering makes no use
of class labels, some notion of similarity must be defined based on the attributes of the
objects.Thedefinitionofsimilarity and the method in whichpointsareclustereddiffer
based on the clustering algorithm being applied. Thus, different clustering algorithms
are suited to different types of datasets and different purposes. The “best” clustering
algorithm to use therefore depends on the application. It is not uncommon to try
several different algorithms and choose depending on which is the most useful.
21
© 2009 by Taylor & Francis Group, LLC