P1: SUD
Data Mining and Knowledge Discovery KL657-03-Huang October 27, 1998 12:59
284 HUANG
The most distinct characteristic of data mining is that it deals with very large and complex
data sets (gigabytes or even terabytes). The data sets to be mined often contain millions
of objects described by tens, hundreds or even thousands of various types of attributes or
variables (interval, ratio, binary, ordinal, nominal, etc.). This requires the data mining oper-
ations and algorithms to be scalable and capable of dealing with different types of attributes.
However, most algorithms currently used in data mining do not scale well when applied to
very large data sets because they were initially developed for other applications than data
mining which involve small data sets. In terms of clustering, we are interested in algo-
rithms which can efficiently cluster large data sets containing both numeric and categorical
values because such data sets are frequently encountered in data mining applications. Most
existing clustering algorithms either can handle both data types but are not efficient when
clustering large data sets or can handle large data sets efficiently but are limited to numeric
attributes. Few algorithms can do both well.
Using Gower’s similarity coefficient (Gower, 1971) and other dissimilarity measures
(Gowda and Diday, 1991) the standard hierarchical clustering methods can handle data
with numeric and categorical values (Anderberg, 1973; Jain and Dubes, 1988). However,
the quadratic computational cost makes them unacceptable for clustering largedata sets. On
the other hand, the k-means clustering method (MacQueen, 1967; Anderberg, 1973) is effi-
cient for processing large data sets. Therefore, it is best suited for data mining. However, the
k-means algorithm only works on numeric data, i.e., the variables are measured on a ratio
scale (Jain and Dubes, 1988), because it minimises a cost function by changing the means
of clusters. This prohibits it from being used in applications where categorical data are
involved. The traditional approach to converting categorical data into numeric values does
not necessarily produce meaningful results in the case where categorical domains are not
ordered.
Ralambondrainy (1995) presented an approach to using the k-means algorithm to cluster
categorical data. Ralambondrainy’s approach is to convert multiple category attributes into
binary attributes (using 0 and 1 to represent either a category absent or present) and to
treat the binary attributes as numeric in the k-means algorithm. If it is used in data mining,
this approach needs to handle a large number of binary attributes because data sets in data
mining often have categorical attributes with hundreds or thousands of categories. This will
inevitably increase both computational and space costs of the k-means algorithm. The other
drawback is that the cluster means, given by real values between 0 and 1, do not indicate
the characteristics of the clusters.
Conceptual clustering algorithms developed in machine learning cluster data with cate-
gorical values (Michalski and Stepp, 1983; Fisher, 1987; Lebowitz, 1987) and also produce
conceptual descriptions of clusters. The latter feature is important to data mining because
the conceptual descriptions provide assistance in interpreting clustering results. Unlike sta-
tisticalclustering methods, these algorithms are based on asearch forobjects whichcarry the
same or similar concepts. Therefore, their efficiency relies on good search strategies. For
problems in data mining, which often involve many concepts and very large object spaces,
the concept-based search methods can become a potential handicap for these algorithms to
deal with extremely large data sets.
The data mining community has recently put a lot of efforts on developing fast algorithms
for clustering very large data sets. Some popular ones include CLARANS (Ng and Han,