基于多密度信息的自适应半监督聚类方法资源-CSDN文库

需积分: 5 32 浏览量 2021-05-29 07:59:49 上传评论收藏 2.45MB PDF 举报

标题中提到的“基于多密度信息的自适应半监督聚类方法”指的是一种采用半监督学习方式，通过分析数据中的多个密度区域信息来实现聚类的技术。在数据挖掘和机器学习领域，聚类是一种基本的数据分析任务，旨在将样本集划分为若干组或“簇”，使得同一组内的样本之间相似度较高，而不同组的样本相似度较低。描述部分提到，由于多媒体信息的急剧增加，多媒体数据挖掘吸引了更多的关注。聚类作为一项重要的挖掘任务，为处理大量多媒体数据并发现其内在结构提供了支持。尽管已提出多种方法以提升半监督学习中聚类过程的性能和准确性，但它们对于集群结构的假设和算法参数的设置仍然较为敏感。本文提出了一个基于多密度信息的自适应半监督聚类方法，该方法能够自动确定一组重要的基于密度的参数，并利用标记和未标记的数据共同作用。基于不同的密度参数集合，该方法能够适应性地识别不同大小、形状和密度的复杂簇结构，而且对噪声不敏感。该方法已在合成数据集以及一系列基准数据集上进行了广泛评估，结果表明其聚类结果有前景；此外，它还应用于Yale面部数据库B，并且特定的仿真结果显示了它在实际人脸识别应用中的潜力。从标签“研究论文”来看，这是一篇学术论文，其研究对象是自适应半监督聚类方法，特别是在基于多密度信息的框架下。这篇论文的研究成果有潜力在实际应用中，比如人脸识别等领域，发挥作用。从提供的部分内容来看，这篇文章提到了几个关键概念和技术： 1. 半监督学习（Semi-supervised learning）：这是一种机器学习方法，介于监督学习和无监督学习之间，主要利用大量的未标记数据以及少量标记数据进行学习。在聚类任务中，半监督学习可以更好地处理未标记的数据，并在一定程度上提升聚类结果的质量。 2. 密度聚类（Density-based clustering）：这是一种基于密度概念的聚类方法，其中簇被定义为密度足够高的区域，该区域内的样本点可以通过密度达到彼此。DBSCAN（Density-Based Spatial Clustering of Applications with Noise）是此类方法中的一种流行算法。 3. 约束信息（Constraint information）：在聚类过程中，约束信息可以用来指导簇的形成，例如通过对样本点施加连接或排斥的约束来影响聚类结果。 4. 噪声的鲁棒性（Noise insensitivity）：在实际应用中，数据集往往包含噪声，即与数据集中的主要模式不一致的点。一个好的聚类方法应该能够减少噪声对最终聚类结果的负面影响。文章提出的方法利用标记数据和未标记数据共同确定基于密度的重要参数，这在先前的方法中是比较少见的。通常，在聚类中使用密度信息的方法需要人工设定一些参数，如邻域大小、密度阈值等，但这些参数往往需要依据具体问题进行调整，且很难找到最优解。本文的方法提供了一种自适应确定这些参数的途径，能更好地适应不同类型的数据，无需事先知道簇的数量，这在处理现实世界复杂数据集时尤其有用。研究中提到的方法不仅在合成数据集和基准数据集上进行了测试，而且还在人脸数据集上进行了实践应用评估。这表明该方法除了在理论上有独到之处，还具有实际应用价值。通过人脸数据库的实验，作者特别展示了其在人脸识别等实际问题中应用的潜力。这篇文章提出的方法通过结合标记数据和未标记数据、多密度信息分析、约束信息的利用，为聚类分析提供了新的思路和工具。其创新点主要体现在能够自适应地确定密度相关的参数，从而更准确地识别出复杂的数据结构，并且对噪声有较好的鲁棒性。这项研究的成果不仅增进了聚类算法的理论深度，而且在现实世界问题中的应用也展现了其实际价值和潜力。

资源推荐

资源详情

资源评论

Neurocomputing 257 (2017) 193–205

Contents lists available at ScienceDirect

Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

An adaptive semi-supervised clustering approach via multiple

density-based information

Yun Yang

, Zongze Li

, Wei Wang

, Dapeng Tao

b , ∗

National Pilot School of Software, Yunnan University, Kunming 650091, Yunnan, PR China

School of Information Science and Engineering, Yunnan University, Kunming 650091, Yunnan, PR China

a r t i c l e i n f o

Article history:

Received 28 May 2016

Revised 7 November 2016

Accepted 16 November 2016

Available online 4 February 2017

Keywords:

Semi-supervised learning

Density-based clustering

Constraint-based clustering

Constraint information

a b s t r a c t

Since multimedia information has been dramatically increasing, multimedia data mining has drawn much

more attentions than ever. As one of important mining tasks, clustering provided underpinning tech-

niques for discovering the intrinsic structure and condensing information over large amount of multime-

dia data. Although many approaches have been proposed to improve performance and accuracy of clus-

tering process in semi-supervised learning way, they are still quite sensitive to the assumptions of clus-

ter structures and algorithm parameters setups. In this paper, we propose an adaptive semi-supervised

clustering approach via multiple density-based information. It can automatically determine sets of impor-

tant density-based parameters in use of both labeled and unlabeled data. Based on the different sets of

density-based parameters, our approach is able to adaptively identify complex cluster structures in dif-

ferent size, shape and density without knowing the number of clusters; moreover, it is quite insensitive

to the noise. Our approach has been generally evaluated on synthetic data and a collection of bench-

mark data sets, and yields promising clustering results. Furthermore, it has been also applied on Yale

face database B, and simulation results speciﬁcally show its potential for practical application in face

recognitions.

1. Introduction

Clustering can be considered as one of the most important

data mining tasks: it aims to ﬁnd cluster structure within target

dataset. A cluster is therefore a collection of instances that are

“similar” amongst themselves and “dissimilar” to instances belong-

ing to other clusters. From different perspectives and criteria, the

taxonomies of clustering algorithms are various. However, a com-

mon framework [1] is still widely accepted for the classiﬁcation

of clustering algorithms, which includes hierarchical clustering,

partitioning-based clustering, density-based clustering, and model-

based clustering. Recently many researchers have been looking at

guiding the clustering process though a limited amount of supervi-

sion information in order to improve both eﬃciency and quality of

clustering results, which consequently leads to emergence of a new

branch of clustering approaches so named semi-supervised cluster-

ing. Generally semi-supervised clustering algorithms fall into two

categories of distance-based and constraint-based approaches [2] .

∗

Corresponding author.

E-mail addresses: yangyan19@hotmail.com (Y. Yang), saber214fate@163.com (Z.

Li), wangwei@ynu.edu.cn (W. Wang), dapeng.tao@gmail.com , dtao.scut@gmail.com

(D. Tao).

In distance-based approaches, an adjusted distance measure is

parameterized, and the parameters are learned from the prior su-

pervision information in form of constraints such as Must-Link

and Cannot-Link constraints. A Must-Link constraint indicates that

both instances in the pair should be grouped in the same clus-

ter, while a Cannot-Link constraint indicates that two instances in

the pair should be separated into different clusters. Such adjusted

distance measure is typically implemented in a look-up conver-

sion matrix, thus allowing for individual distance values for any

pair of instances, where the distances between instances associ-

ated with a Must-Link constraint should be shortened, while the

distances between instances associated with a Cannot-Link con-

straint should be prolonged. However, the implementation of con-

strains do not strictly direct the clustering process, clustering re-

sults may sometimes violate the constrains, e.g. a pair of instances

associated with a Must-Link are still far away from each other, and

thus separated into different clusters. Literature survey shows that

many adjusted distance measures have been proposed for either

classiﬁcation or semi-supervised clustering tasks, including RS-

KISS distance trained using statistical inference scheme [3] , deep

multimodal distance trained using auto-encoders on multimodal

features [4] , string-edit distance trained using expectation maxi-

mization (EM) [5] , KL divergence trained using gradient descent

http://dx.doi.org/10.1016/j.neucom.2016.11.061

194 Y. Yang et al. / Neurocomputing 257 (2017) 193–205

[6] , Euclidean distance modiﬁed by a shortest path algorithm [7] ,

and Mahalanobis distances trained using convex optimization [8] .

Constraint-based approaches modify the existing clustering al-

gorithm itself so that user-provided labels or constraints can be

used to guide the algorithms for better clustering output. This

is done by modifying the objective function of clustering algo-

rithm in many ways. Constrained COBWEB [9] embed the con-

straints into the incremental partitioning process by optimizing its

objective of clustering. Constrained K-means [10] is done by in-

corporating background knowledge in the form of instance level

constraints into the conventional K-means clustering algorithm.

Seeded-K-means [11] uses constraints to set the initial seeds for

K-means instead of selecting the seeds randomly. In this approach,

initial cluster is obtained from the transitive closure of the con-

straints, then the centers of these clusters are used as seeds, after

the initialization step, the structure of clusters can be iteratively

updated either with or without constraint information. While a

semi-supervised hierarchical clustering [12] is proposed by incor-

porating triple-wise relative constraints into an agglomerative clus-

tering process based on ultra-metric distance matrices. C-DBSCAN

[13] is designed to build clusters with constraints upon data in-

stances. Such approach exhibits the inherent advantage of the base

algorithm DBSCAN [14] in being robust toward clusters of arbitrary

shapes.

Many clustering algorithms can be modiﬁed in a semi-

supervised learning approaches. Most of works [9–12,15–20] have

been done for partitioning and hierarchical clustering approaches,

but there are few of density-based clustering approaches. In fact,

density-based clustering approaches are ideally to partition the an-

ticipated groups of data instances that are expected to differ in

size or shape. They are different from partitioning algorithms of

that strive a globally optimal partitioning of the data space, for

instance, DBSCAN [14] builds solutions that are only locally opti-

mal. Hence, density-based semi-supervised clustering can exploit

both Must-Link and Cannot-Link constraints between proximal in-

stances. This is an inherent advantage toward constraint-based par-

titioning algorithms, which may fail to converge in the presence of

Cannot-Link constraints.

In this work, we aim to develop a novel semi-supervised clus-

tering algorithm based on density-based approach. It is not only

easy to implement, but it also guarantees to improve the perfor-

mance of clustering process. The main idea of the proposed ap-

proach is to determine a set of density-based parameters includ-

ing least data points MinPts and a radius Eps in use of supervision

information for different density areas, in which both of Cannot-

Link and Must-Link constraints are enforced. Then local clusters

are built by applying DBSCAN on the target dataset with differ-

ent set of density-based parameters. Finally the clustering result is

constructed by reconciling the local clusters. Our contributions can

be highlighted as follows:

• Our approach is user-input parameters free, a set of density-

based parameters can be automatically determined from intrin-

sic structure of dataset.

• Our approach is able to identify the complex cluster structure

in different size, shape and density by reconciling the local

clusters; specially, it has a superior performance in the pres-

ence of noise.

• Our approach does not only satisfy all input constraints, but

also signiﬁcantly minimizes side-effect of generating many

singleton clusters that is normally resulted in many semi-

supervised density-based clustering approaches.

The rest of this paper is organized as follows. We describe the

semi-supervised learning approaches related to our approach Our

approach in detail reports the simulation test results on various

datasets. discusses several issues concerned about future work and

ﬁnally the conclusions are drawn in Section 6 .

2. Related work

In this section, we describe several semi-supervised learning

algorithms including constrained K-means, C-DBSCAN, constrained

evidential clustering (CEVCLUS), constrained clustering via spectral

regularization (CCSR) and Semi Naïve Bayesian, which are com-

pared with our approach in the simulation.

2.1. Constrained K-means(C-Kmeans)

Constrained k-means algorithm [10] uses the background infor-

mation to constrain the clustering process. Such background in-

formation is always collected in the form of instance-level con-

straints, it indicates that which instances should be or should not

be grouped together. Hence there should be two types of con-

straints in this algorithm, Must-Link and Cannot-Link . Must-link

requires that two instances have to be in the same cluster, and

Cannot-Link requires that two instances must not be placed in the

same cluster.

Given a data set X with a labeled set X

, n and l are the number

of instances in X and X

respectively, the Must-link and Cannot-

Link constraint information on data set X can be encoded into a

matrix M = { −1 , 0 , 1 }

n ×n

according to the labeled set X

the ma-

trix element M ( i, j ) is deﬁned as follows:

M(i, j) =



1 , y

= y

−1 , y

 = y

0 , else

(1)

where 1 ≤ i, j ≤ l, y

and y

are the labels of the i th and the j th

instance in X

. Constrained k-means is implemented by ensuring

that two instances are put in the same cluster if the value of M ( i,

j ) is 1, and in different clusters if the value is −1.

2.2. C-DBSCAN

DBSCAN [14] iteratively constructs a set of clusters from a se-

lected data point by absorbing all data points in its neighborhood.

It is designed on two main concepts: density reachability and den-

sity connectability. These both concepts depend on two input pa-

rameters: the size of neighborhood Eps deﬁned as a semi-diameter

of neighborhood based on Euclidean distance and the minimum

number of data points in a cluster MinPts . Data points within the

same neighborhood are termed density-reachable from the core

point, those in overlapping neighborhoods are density-connected

to each other.

C-DBSCAN [13] extends DBSCAN in three steps. The data space

is initially partitioned into denser subspaces in use of a KD-Tree

[21] . From which, a set of initial local clusters are constructed;

these are groups of points within the leaf nodes of the KD-tree,

split so ﬁnely that they already satisfy those Cannot-Link con-

straints that are associated with a priori knowledge about which

instances should not be grouped together. Then, density-connected

local clusters are merged by enforcing the Must-Link constraints

associated with a priori knowledge about which instances should

be grouped together. Finally adjacent neighborhoods are merged

in a bottom-up fashion by enforcing the Cannot-Link constraints

again.

2.3. Constrained evidential clustering (CEVCLUS)

This semi-supervised clustering algorithm [22] is an extension

of Evidential Clustering (EVCLUS) [23] algorithm, proposed in the

theoretical framework of belief functions [24] . It is based on two

剩余12页未读，继续阅读

评论收藏

内容反馈

weixin_38748239

粉丝: 3
资源: 943

基于多密度信息的自适应半监督聚类方法

基于多个基于密度的信息的自适应半监督聚类方法

基于半监督聚类的微视频标注方法

一种基于密度峰值的新型半监督聚类算法

基于密度调整的改进自适应谱聚类算法

一种基于密度的分布式聚类方法.pdf

基于MATLAB的自适应蚁群聚类算法研究与仿真.pdf

论文研究-基于自适应权重的面板数据聚类方法.pdf

基于密度的最佳聚类数确定方法.docx

一种基于自适应最近邻的聚类融合方法

论文研究-基于自适应蚁群聚类的入侵检测.pdf

SA2DBSCAN:一种自适应基于密度聚类算法

基于密度的最佳聚类数确定方法.pdf

基于密度的自适应LDA模型选择方法

基于密度的聚类方法

自适应k均值聚类

自适应子序列聚类的时间自组织神经网络和案例研究

最新资源