46 M.-S. Yang, Y. Nataliani / Pattern Recognition 71 (2017) 45–59
Since Zadeh [8] proposed fuzzy set that introduced the idea of
partial memberships described by membership functions, it was
successfully applied in clustering. Fuzzy clustering has been widely
studied and applied in a variety of substantive areas more than
45 years [9–12] since Ruspini [13] first proposed fuzzy c -partitions
as a fuzzy approach to clustering in the 1970s. In fuzzy clustering,
the fuzzy c-means (FCM) clustering algorithm proposed by Dunn
[14] and Bezdek [9] is the most well-known and used method.
There are many extensions and variants of FCM proposed in the
literature. The first important extension to FCM was proposed by
Gustafson and Kessel (GK) [15] in which the Euclidean distance
in the FCM objective function was replaced by the Mahalanobis
distance. Afterwards, there are many extensions to FCM, such as
extensions to maximum-entropy clustering (MEC) by Karayiannis
[16] , Miyamoto and Umayahara [17] and Wei and Fahn [18] , ex-
tensions to L
p
norms by Hathaway et al. [19] , extension of FCM as
alpha-cut implemented fuzzy clustering algorithms by Yang et al.
[20] , extension of FCM for treating very large data by Havens et al.
[21] , an augmented FCM for clustering spatiotemporal data by Iza-
kian et al. [22] , and so forth. However, these fuzzy clustering algo-
rithms always need to give a number of clusters a priori. In gen-
eral, the cluster number c is unknown. In this case, validity indices
can be used to find a cluster number c where they are supposed
to be independent of clustering algorithms. Many cluster validity
indices for fuzzy clustering algorithms had been proposed in the
literature, such as partition coefficient (PC) [23] , partition entropy
(PE) [24] , normalization of PC and PE [25–26] , fuzzy hypervolume
(FHV) [27] and XB (Xie and Beni [28] ).
Frigui and Krishnapuram [29] proposed the robust competitive
agglomerative (RCA) algorithm by adding a loss function of clus-
ters and a weight function of data points to clusters. The RCA al-
gorithm can be used for determining a cluster number. Starting
with a large cluster number, RCA reduces the number by discard-
ing clusters with less cardinality. Some parameter initial values are
needed in RCA, such as time constant, discarding threshold, tun-
ing factor, etc. Another clustering algorithm was presented by Ro-
driguez and Laio [30] for clustering by fast search, called C-FS, us-
ing a similarity matrix for finding density peaks. They proposed
the C-FS algorithm by assigning a cutoff distance d
c
and selecting a
decision window so that it can automatically determine a number
of clusters. In [30] , the cutoff distance d
c
becomes another param-
eter in which clustering results are heavily dependent of the cut-
off parameter d
c
. Recently, Fazendeiro and Oliveira [31] presented
a fuzzy clustering algorithm with an unknown number of clusters
based on observer position, called focal point. With this point, ob-
server can select a suitable point while searching for clusters that
is actually appropriate to the underlying data structure. After the
focal point is chosen, the initialization of cluster centers must be
generated randomly. The inverse of XB index is used to compute
the validity measure. The maximal value is chosen to get the best
number of clusters. Although these algorithms can find a number
of clusters during iteration procedures, they are still dependent of
initializations and parameter selections.
Up to now, there is no work in the literature for FCM to be si-
multaneously robust to initializations and parameter selection un-
der free of the fuzziness index without a given number of clusters.
We think that this may be due to its difficulty for constructing
this kind of robust FCM. In this paper, we try to construct a ro-
bust learning-based framework for fuzzy clustering, especially for
the FCM algorithm. This framework can automatically find the best
number of clusters, without any initialization and parameter selec-
tion, and it is also free of the fuzziness index m . We first consider
some entropy-type penalty terms for adjusting the bias, and then
create a robust-learning mechanism for finding the best number of
clusters. The organization of this paper is as follows. In Section 2 ,
we construct a robust learning-based framework for fuzzy cluster-
ing. The robust-learning FCM (RL-FCM) clustering algorithm is also
presented in this section. In Section 3 , several experimental exam-
ples and comparisons with numeric and real data sets are provided
to demonstrate the effectiveness of the proposed RL-FCM, which
can automatically find the best number of clusters. Finally, conclu-
sions are stated in Section 4 .
2. Robust-learning fuzzy c-means clustering algorithm
Let X = { x
1
, . . . , x
n
} be a data set in a d -dimensional Euclidean
space R
d
and V = { v
1
, . . . , v
c
} be the c cluster centers with its
Euclidean norm denoted by d
ik
= x
i
− v
k
2
=
d
j=1
( x
ij
− v
kj
)
2
.
The fuzzy c-means (FCM) objective function [9 –10] is given with
J
m
(U , V ) =
c
k =1
n
i =1
μ
m
ik
d
2
ik
where m > 1 is the fuzziness in-
dex, μ = { μ
ik
}
n ×c
∈ M
fcn
is a fuzzy partition matrix with M
fcn
=
{ μ = [ μ
ik
]
nc
| ∀ i, ∀ k, 0 ≤ μ
ik
≤ 1 ,
c
k =1
μ
ik
= 1 , 0 <
n
i =1
μ
ik
< n } ,
and d
ik
= x
i
− v
k
2
is the Euclidean distance. The FCM al-
gorithm is iterated through necessary conditions for min-
imizing J
m
( U, V )with the updating equations for cluster
centers and memberships as: v
k
=
n
i =1
μ
m
ik
x
i
/
n
i =1
μ
m
ik
and
μ
ik
= ( d
ik
)
−2 / (m −1)
/
c
t=1
( d
it
)
−2 / (m −1)
.
We know that the FCM algorithm is dependent on initial val-
ues and some parameters need to be given a priori, such as a
fuzziness index m , cluster center initialization and also a num-
ber of clusters. Although there exist some works in the literature
to solve some problems in FCM, such as Dembélé and Kastner
[32] and Schwämmle and Jensen [33] on estimating the fuzziness
index m for clustering microarray data, there is no work for FCM
to be simultaneously robust to initializations and parameter selec-
tion under free of the fuzziness index m without a given num-
ber of clusters. Next, we construct a robust learning-based schema
for FCM to simultaneously solve these problems. Our basic idea is
that, we first consider all data points as initial cluster centers, i.e.,
the number of data points is the initial number of clusters. After
that, use the mixing proportion α
k
of the cluster k , which is like
a cluster weight, and discard these clusters that have values of α
k
less than one over the number of data points. The proposed al-
gorithm can iteratively obtain the best number of clusters until it
converges.
For a data set X = { x
1
, . . . , x
n
} in R
d
with c cluster centers to
have FCM be simultaneously robust to initializations and param-
eter selection under free of the fuzziness index m that can au-
tomatically find the best number of clusters, we add several en-
tropy terms in the FCM objective function. First, to construct an
algorithm free of the fuzziness index m , we replace m by adding
an extra term with a function of μ
ik
. In this sense, we consider
the concept of MEC [16 –18] by adding the entropy term of mem-
berships with
c
k =1
n
i =1
μ
ik
ln μ
ik
. Moreover, we use a learning
function r , i.e. r
c
k =1
n
i =1
μ
ik
ln μ
ik
, to learn the effects of the en-
tropy term for adjusting bias. We next use the mixing proportion
α = ( α
1
, ··· , α
c
) of clusters, where α
k
presents the probability of
one data point belonged to the k th cluster with the constraint
c
k =1
α
k
= 1 . Hence, − ln α
k
is the information in the occurrence of
a data point belonged to the k th cluster. Thus, we add the entropy
term
c
k =1
n
i =1
μ
ik
ln α
k
to summarize the average of information
for the occurrence of a data point belonged to the corresponding
cluster over fuzzy memberships. Furthermore, we borrow the idea
of Yang et al. [34] in the EM algorithm by using the entropy term,
c
k =1
α
k
ln α
k
, to represent the average of information for the oc-
currence of each data point belonged to the corresponding cluster.
Totally, the entropy terms of mixing proportion in probability and
the average of occurrence in probability over fuzzy memberships
are used for learning to find the best number of clusters.