基于meanshift的谱聚类算法资源-CSDN文库

4星 · 超过85%的资源需积分: 9 78 浏览量 2010-10-16 22:53:37 上传评论收藏 3.01MB PDF 举报

### 基于MeanShift的谱聚类算法 #### 概述《基于MeanShift预处理的谱聚类算法》是一篇重要的研究论文，由Umut Ozertem和Deniz Erdogmus共同撰写，该文发表自美国俄勒冈健康与科学大学（Oregon Health & Science University）计算机科学与工程系（CSEE Department）。文章主要讨论了如何通过引入一种基于MeanShift的预处理阶段来改进传统的谱聚类方法，并以此减少谱聚类过程中所需的计算资源。 #### MeanShift与谱聚类简介 - **MeanShift**：这是一种非参数的基于密度的聚类算法，其基本思想是在数据点密集的地方寻找模式中心。MeanShift算法能够自动确定聚类的数量，并且对于异常值具有较好的鲁棒性。 - **谱聚类**：是一种基于图论的聚类方法，它通过构建一个表示样本间相似性的矩阵（即亲和矩阵），然后利用该矩阵的特征向量来进行数据分割。谱聚类的一个显著优势是它可以处理非凸形状的数据集。 #### 文章核心贡献本文的核心贡献在于提出了一种新的预处理步骤，即将MeanShift作为谱聚类的一种预处理技术。这种预处理可以降低用于谱技术的矩阵维度，从而大幅节省计算成本。具体来说： 1. **MeanShift预处理**：使用MeanShift算法对原始数据进行预处理，找到数据中的模式中心。这些模式中心随后被用作代表整个数据集的简化版本。 2. **构建简化矩阵**：基于这些模式中心构建一个新的简化矩阵，这个矩阵相比于原始的亲和矩阵具有更低的维度。 3. **应用谱聚类**：在简化后的矩阵上应用谱聚类算法，得到最终的聚类结果。 #### 方法学细节 - **MeanShift算法原理**：MeanShift算法通过迭代地移动每个数据点到其邻域内所有点的加权平均位置，直到所有点达到收敛。这里的权重通常由高斯核函数给出，这使得距离较近的数据点在均值移动中拥有更大的权重。 - **谱聚类算法流程**： - 构建亲和矩阵：根据选定的距离度量或相似性度量计算样本之间的相似性。 - 计算拉普拉斯矩阵：拉普拉斯矩阵是由亲和矩阵转换而来的，它反映了图的结构特性。 - 特征分解：对拉普拉斯矩阵进行特征分解，获取特征向量。 - K-means聚类：基于特征向量应用K-means算法进行聚类。 #### 实验验证与结果分析 - **实验设计**：为了验证所提出的改进方法的有效性，作者设计了一系列实验来比较传统谱聚类方法与加入MeanShift预处理后的谱聚类方法在不同数据集上的表现。 - **性能评估**：通过对实验结果进行量化分析，包括聚类准确性、计算效率等方面，展示了改进方法的优势所在。 - **案例研究**：通过具体的案例研究进一步证明了所提方法在实际应用场景中的有效性，比如图像分割、数据挖掘等。 #### 结论与展望本文提出了一种将MeanShift作为预处理步骤应用于谱聚类的新方法，不仅有效降低了计算复杂度，而且保持了良好的聚类性能。这种方法为解决大规模数据集的聚类问题提供了一个实用的解决方案。未来的研究方向可能包括探索更高效的数据预处理技术和进一步优化谱聚类算法本身。 ### 小结《基于MeanShift预处理的谱聚类算法》不仅深入探讨了MeanShift算法与谱聚类结合的可能性，而且还通过实证研究证明了这种结合的有效性和实用性。对于那些关注大规模数据集聚类问题的研究者而言，这篇论文提供了有价值的参考和启示。

资源推荐

资源详情

资源评论

SPECTRAL CLUSTERING WITH MEAN SHIFT PREPROCESSING

Umut Ozertem, Deniz Erdogmus

CSEE Department, OGI, Oregon Health & Science University, Portland, Oregon, USA

Abstract. Clustering is a fundamental problem in

machine learning with numerous important applications

in statistical signal processing, pattern recognition, and

computer vision, where unsupervised analysis of data

classification structures are required. The current state-

of-the-art in clustering is widely accepted to be the so-

called spectral clustering. Spectral clustering, based on

pairwise affinities of samples imposes very large

computational requirements. In this paper, we propose a

vector quantization preprocessing stage for spectral

clustering similar to the classical mean-shift principle

for clustering. This preprocessing reduces the

dimensionality of the matrix on which spectral

techniques will be applied, resulting in significant

computational savings.

1. INTRODUCTION

Data clustering is an important fundamental problem

having a wide range of applications on different aspects

of unsupervised learning; image segmentation, data

mining, speech recognition, and data compression to

name just a few. In recent years there has been a growing

interest on spectral clustering and it is recognized as an

important tool for clustering problems. In spectral

clustering, data segmentation is obtained using the

eigendecomposition of an affinity matrix that defines the

similarities in the data. In the definition of the affinity

matrix, different similarity measures can be utilized to

characterize the affinities. The affinities do not even have

to obey the metric axioms except the symmetry property.

Spectral clustering dates back to the discovery of the

utilization of the second eigenvector of the Laplacian

matrix to bi-partition the data, by Fiedler [1]. Recently, a

number of related clustering methods are suggested that

utilize the eigenvectors or generalized eigenvectors of the

affinity matrix [2-14]. Such methods are known as

spectral clustering and considered to be the state-of-the

art clustering methods in the literature.

The majority of the spectral clustering algorithms

are different variants of graph cut and multiway cut

methods, each using different affinity matrices and

utilizing the resulting eigendecomposition in different

manners. Obtaining the eigendecomposition, the

clustering is obtained by thresholding the values of a

suitably selected eigenvector. One should also notice that

these methods are sensitive to the definition of affinity

between the data pairs, and since no theoretical criterion

for choosing the functions to assign the affinities are

known, these algorithms require the assumption of the

existence of a suitable affinity definition.

A different track in spectral clustering was

designated by Scott and Longuet-Higgins [12], in which

they propose a mapping using the eigenvectors of the

affinity matrix to transform the data from the original data

space to the kernel induced feature space, and do the

actual clustering on the image of the data in that space.

Normalization of the transformed data is an important

step in this approach, and provided that, clustering of the

image of the data in the kernel induced feature space is

shown to be generating very successful results for a

variety of different data sets. Spectral clustering can be

understood as measuring sample similarities by an inner

product in the kernel-induced feature space. Using

Mercer kernels, the kernel trick defines a technique to

compute the inner products in the potentially infinite

dimensional kernel induced feature space. Kernel-based

methods rely on the assumption that the clustering in the

kernel induced feature space is easier compared to the

original clustering problem. In practice, one cannot prove

that this assumption holds for all Mercer kernels, on the

other hand, one could search for a kernel that makes this

desired property true. Kernel optimization, in general, is a

daunting task and there are no practical solutions yet. We

will exploit the connection of kernel methods with kernel

density estimation to be able to utilize well-known results

from this literature to select an appropriate kernel [15].

The main shortcoming of the spectral clustering

algorithms is the computational complexity, since these

algorithms require the computation of the eigenvectors of

the

affinity matrix, where N is the sample size.

The computational complexity of all the eigenvector

calculations is O(N

), which makes the spectral clustering

methods impractical to use for large data sets.

In this paper, we propose a spectral clustering

algorithm using fixed-size kernel density estimation with

a mean shift algorithm to represent the data in a much

smaller and quantized affinity matrix. This leads to a

reduced computational complexity, which is defined in

the order of modes present in the data probability density

function.

2. THE PROPOSED METHOD

We discuss the details of the proposed method in this

section after a brief overview of spectral clustering.

Given a set of vectors {x

,…,x

} and a suitable kernel

function K

)  the measure of pairwise affinity or

closeness  where

denotes the kernel size (e.g., the

standard deviation in the case of a Gaussian kernel), the

affinity matrix K or the normalized graph Laplacian

matrix L are constructed as shown in (1) [5,7,8].

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余5页未读，立即下载

评论收藏

内容反馈

我心碎

2013-10-08

英文文献看起来比较累受益了
Leafnode

2014-03-29

这是Umut Ozertem和Deniz Erdogmus在2005年发表在Machine Learning for Signal Processing上的论文《SPECTRAL CLUSTERING WITH MEAN SHIFT PREPROCESSING》，大家直接去google scholar上下载就行。
magglezhang

2014-03-14

东西写得不错
joannae

2013-11-28

有帮助，值得一看
女神~~经

2013-06-26

东西写得不错很受益