194 Y. Yang et al. / Neurocomputing 257 (2017) 193–205
[6] , Euclidean distance modified by a shortest path algorithm [7] ,
and Mahalanobis distances trained using convex optimization [8] .
Constraint-based approaches modify the existing clustering al-
gorithm itself so that user-provided labels or constraints can be
used to guide the algorithms for better clustering output. This
is done by modifying the objective function of clustering algo-
rithm in many ways. Constrained COBWEB [9] embed the con-
straints into the incremental partitioning process by optimizing its
objective of clustering. Constrained K-means [10] is done by in-
corporating background knowledge in the form of instance level
constraints into the conventional K-means clustering algorithm.
Seeded-K-means [11] uses constraints to set the initial seeds for
K-means instead of selecting the seeds randomly. In this approach,
initial cluster is obtained from the transitive closure of the con-
straints, then the centers of these clusters are used as seeds, after
the initialization step, the structure of clusters can be iteratively
updated either with or without constraint information. While a
semi-supervised hierarchical clustering [12] is proposed by incor-
porating triple-wise relative constraints into an agglomerative clus-
tering process based on ultra-metric distance matrices. C-DBSCAN
[13] is designed to build clusters with constraints upon data in-
stances. Such approach exhibits the inherent advantage of the base
algorithm DBSCAN [14] in being robust toward clusters of arbitrary
shapes.
Many clustering algorithms can be modified in a semi-
supervised learning approaches. Most of works [9–12,15–20] have
been done for partitioning and hierarchical clustering approaches,
but there are few of density-based clustering approaches. In fact,
density-based clustering approaches are ideally to partition the an-
ticipated groups of data instances that are expected to differ in
size or shape. They are different from partitioning algorithms of
that strive a globally optimal partitioning of the data space, for
instance, DBSCAN [14] builds solutions that are only locally opti-
mal. Hence, density-based semi-supervised clustering can exploit
both Must-Link and Cannot-Link constraints between proximal in-
stances. This is an inherent advantage toward constraint-based par-
titioning algorithms, which may fail to converge in the presence of
Cannot-Link constraints.
In this work, we aim to develop a novel semi-supervised clus-
tering algorithm based on density-based approach. It is not only
easy to implement, but it also guarantees to improve the perfor-
mance of clustering process. The main idea of the proposed ap-
proach is to determine a set of density-based parameters includ-
ing least data points MinPts and a radius Eps in use of supervision
information for different density areas, in which both of Cannot-
Link and Must-Link constraints are enforced. Then local clusters
are built by applying DBSCAN on the target dataset with differ-
ent set of density-based parameters. Finally the clustering result is
constructed by reconciling the local clusters. Our contributions can
be highlighted as follows:
• Our approach is user-input parameters free, a set of density-
based parameters can be automatically determined from intrin-
sic structure of dataset.
• Our approach is able to identify the complex cluster structure
in different size, shape and density by reconciling the local
clusters; specially, it has a superior performance in the pres-
ence of noise.
• Our approach does not only satisfy all input constraints, but
also significantly minimizes side-effect of generating many
singleton clusters that is normally resulted in many semi-
supervised density-based clustering approaches.
The rest of this paper is organized as follows. We describe the
semi-supervised learning approaches related to our approach Our
approach in detail reports the simulation test results on various
datasets. discusses several issues concerned about future work and
finally the conclusions are drawn in Section 6 .
2. Related work
In this section, we describe several semi-supervised learning
algorithms including constrained K-means, C-DBSCAN, constrained
evidential clustering (CEVCLUS), constrained clustering via spectral
regularization (CCSR) and Semi Naïve Bayesian, which are com-
pared with our approach in the simulation.
2.1. Constrained K-means(C-Kmeans)
Constrained k-means algorithm [10] uses the background infor-
mation to constrain the clustering process. Such background in-
formation is always collected in the form of instance-level con-
straints, it indicates that which instances should be or should not
be grouped together. Hence there should be two types of con-
straints in this algorithm, Must-Link and Cannot-Link . Must-link
requires that two instances have to be in the same cluster, and
Cannot-Link requires that two instances must not be placed in the
same cluster.
Given a data set X with a labeled set X
L
, n and l are the number
of instances in X and X
L
respectively, the Must-link and Cannot-
Link constraint information on data set X can be encoded into a
matrix M = { −1 , 0 , 1 }
n ×n
,
according to the labeled set X
L
,
the ma-
trix element M ( i, j ) is defined as follows:
M(i, j) =
1 , y
i
= y
j
−1 , y
i
= y
j
0 , else
(1)
where 1 ≤ i, j ≤ l, y
i
and y
j
are the labels of the i th and the j th
instance in X
L
. Constrained k-means is implemented by ensuring
that two instances are put in the same cluster if the value of M ( i,
j ) is 1, and in different clusters if the value is −1.
2.2. C-DBSCAN
DBSCAN [14] iteratively constructs a set of clusters from a se-
lected data point by absorbing all data points in its neighborhood.
It is designed on two main concepts: density reachability and den-
sity connectability. These both concepts depend on two input pa-
rameters: the size of neighborhood Eps defined as a semi-diameter
of neighborhood based on Euclidean distance and the minimum
number of data points in a cluster MinPts . Data points within the
same neighborhood are termed density-reachable from the core
point, those in overlapping neighborhoods are density-connected
to each other.
C-DBSCAN [13] extends DBSCAN in three steps. The data space
is initially partitioned into denser subspaces in use of a KD-Tree
[21] . From which, a set of initial local clusters are constructed;
these are groups of points within the leaf nodes of the KD-tree,
split so finely that they already satisfy those Cannot-Link con-
straints that are associated with a priori knowledge about which
instances should not be grouped together. Then, density-connected
local clusters are merged by enforcing the Must-Link constraints
associated with a priori knowledge about which instances should
be grouped together. Finally adjacent neighborhoods are merged
in a bottom-up fashion by enforcing the Cannot-Link constraints
again.
2.3. Constrained evidential clustering (CEVCLUS)
This semi-supervised clustering algorithm [22] is an extension
of Evidential Clustering (EVCLUS) [23] algorithm, proposed in the
theoretical framework of belief functions [24] . It is based on two