【免费】matlab子空间聚类_matlab稀疏子空间资源-CSDN文库

需积分: 0 44 浏览量更新于2016-01-30 收藏 2.06MB PDF 举报

标题所指的“matlab子空间聚类”涉及的是一种数据聚类方法，利用的是子空间聚类算法。子空间聚类算法是用于发现数据子集的聚类结构，特别是当数据点存在于高维空间的不同低维子空间时。MATLAB是一种常用于算法开发、数据可视化、数据分析以及数值计算的高级编程语言和交互式环境。通过MATLAB实现子空间聚类算法，用户可以对数据进行分析与处理，特别适用于那些涉及复杂数据结构的数学建模问题。描述中提到的“可以用于数学建模”，是指子空间聚类算法可以作为一种工具应用于数学建模的过程中。数学建模是一种分析和解决问题的方法，它涉及将现实世界问题转换成数学形式，然后使用数学工具求解问题。子空间聚类算法可以应用于各种高维数据集的分析，比如医学成像、视频监控、文本挖掘、生物信息学以及网络数据分析等领域，这些领域的问题往往可以用高维数据来表示。描述还提到了与子空间聚类相关的英文论文，并建议有兴趣的人士阅读。其中提及的一篇论文《Sparse Subspace Clustering: Algorithm, Theory, and Applications》由Ehsan Elhamifar和Rene´ Vidal撰写，发表在《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》2013年11月刊上。这篇论文详细阐述了一种名为稀疏子空间聚类（Sparse Subspace Clustering, SSC）的算法，该算法的目标是将存在于低维子空间的高维数据点进行聚类。SSC算法的关键思想是使用稀疏表示来选择数据点，这种表示方法可以使得数据点与其同类子空间内的少量其他点建立联系。算法通过解决一个稀疏优化问题来实现，并将解决方案应用于谱聚类框架中，从而推断出数据的子空间聚类结构。由于解决稀疏优化问题通常是非确定性多项式时间难解的问题（NP-hard），论文中提出了一个凸松弛，并在子空间和数据分布的适当条件下，证明了所提出的最小化程序能够成功地恢复所需的稀疏表示。此外，SSC算法可以有效处理数据噪声、稀疏异常值和缺失值等问题，通过将数据模型纳入稀疏优化程序中。标签“子空间聚类”指的是一系列算法和理论的集合，它们关注于发现和利用数据中潜在的低维子空间结构。这些算法试图将数据点分配到其所在的子空间中，从而使得同一子空间内的点在某种意义下彼此接近，而不同子空间的点则彼此远离。内容部分提供了论文的一段摘要，其内容涉及算法介绍和实验验证。摘要提到，现实世界中许多问题都涉及高维数据的集合，例如图像、视频、文本、网页文档、DNA微阵列数据等。这些高维数据通常靠近于某些低维结构，这些结构对应于数据所属的几个类别或分类。论文提出了一种稀疏子空间聚类算法，用于对位于低维子空间并集中的数据点进行聚类。其核心思想是，利用无穷多可能的数据点表示中，稀疏表示意味着选择来自同一子空间的少量点。这激励着解决一个稀疏优化问题，并将得到的解决方案应用于谱聚类框架中，从而推断出数据在子空间中的聚类情况。由于稀疏优化问题在一般情况下是NP-hard的，论文考虑了一个凸松驰版本，并在子空间排列和数据分布的适当条件下，证明了所提出的最小化程序能够成功地恢复出所需的数据稀疏表示。提出的算法效率较高，且能够处理位于子空间交集附近的数据点。此外，论文提出的方法相对于现有技术还具有一些优势，特别是它能够直接处理数据中的噪声、稀疏异常值和缺失值等问题，通过将数据模型整合进稀疏优化程序来实现。论文通过在合成数据以及两个现实世界问题（运动分割和面部聚类）上的实验来证明了所提算法的有效性。

Sparse Subspace Clustering:

Algorithm, Theory, and Applications

Ehsan Elhamifar, Student Member, IEEE, and Rene

Vidal, Senior Member, IEEE

Abstract—Many real-world problems deal with collections of high-dimensional data, such as images, videos, text, and web

documents, DNA microarray data, and more. Often, such high-dimensional data lie close to low-dimensional structures corresponding

to several classes or categories to which the data belong. In this paper, we propose and study an algorithm, called sparse subspace

clustering, to cluster data points that lie in a union of low-dimensional subspaces. The key idea is that, among the infinitely many

possible representations of a data point in terms of other points, a sparse representation corresponds to selecting a few points from the

same subspace. This motivates solving a sparse optimization program whose solution is used in a spectral clustering framework to

infer the clustering of the data into subspaces. Since solving the sparse optimization program is in general NP-hard, we consider a

convex relaxation and show that, under appropriate conditions on the arrangement of the subspaces and the distribution of the data,

the proposed minimization program succeeds in recovering the desired sparse representations. The proposed algorithm is efficient

and can handle data points near the intersections of subspaces. Another key advantage of the proposed algorithm with respect to the

state of the art is that it can deal directly with data nuisances, such as noise, sparse outlying entries, and missing entries, by

incorporating the model of the data into the sparse optimization program. We demonstrate the effectiveness of the proposed algorithm

through experiments on synthetic data as well as the two real-world problems of motion segmentation and face clustering.

Index Terms—High-dimensional data, intrinsic low-dimensionality, subspaces, clustering, sparse representation, ‘

-minimization,

convex programming, spectral clustering, principal angles, motion segmentation, face clustering

1INTRODUCTION

IGH-DIMENSIONAL data are ubiquitous in many areas of

machine learning, signal and image processing,

computer vision, pattern recognition, bioinformatics, etc.

For instance, images consist of billions of pixels, videos

can have millions of frames, text and web documents are

associated with hundreds of thousands of features, etc.

The high-dimensionality of the data not only increases the

computational time and memory requirements of algo-

rithms, but also adversely affects their performance due to

the noise effect and insufficient number of samples with

respect to the am bient space dimension, comm only

referred to as the “curse of dimensionality” [1]. However,

high-dimensional data often lie in low-dimensional struc-

tures instead of being uniformly distributed across the

ambient space. Recovering low-dimensional structures in

the data helps to not only reduce the computational

cost and memory requirements of algorithms, but also

reduce the effect of high-dimensional noise in the data

and improve the performance of inference, learning, and

recognition tasks.

In fact, in many problems, data in a class or category can

be well represented by a low-dimensional subspace of the

high-dimensional ambient space. For example, feature

trajectories of a rigidly moving object in a video [2],

face images of a subject under varying illumination [3],

and multiple instances of a hand-written digit with

different rotations, translations, and thicknesses [4] lie in a

low-dimensional subspace of the ambient space. As a result,

the collection of data from multiple classes or categories lies

in a union of low-dimensional subspaces. Subspace cluster-

ing (see [5] and references therein) refers to the problem of

separating data according to their underlying subspaces

and finds numerous applications in image processing

(e.g., image representation and compression [6]) and

computer vision (e.g., image segme ntation [7], motion

segmentation [8], [9], and temporal video segmentation

[10]), as illustrated in Figs. 1 and 2. Since data in a subspace

are often distributed arbitrarily and not around a centroid,

standard clustering methods [11] that take advantage of the

spatial proximity of the data in each cluster are not in

general applicable to subspace clustering. Therefore, there

is a need for having clustering algorithms that take into

account the multisubspace structure of the data.

1.1 Prior Work on Subspace Clustering

Existing algorithms can be divided into four main categories:

iterative, algebraic, statistical, and spectral clustering-based

methods.

Iterative methods. It erative approaches, such as K-

subspaces [12], [13] and median K-flats [14], alternate

between assigning points to subspaces and fitting a subspace

to each cluster. The main drawbacks of such approaches

are that they generally require knowing the number and

dimensions of the subspaces and that they are sensitive to

initialization.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGEN CE, VOL. 35, NO. 11, NOVEMBER 2013 2765

. E. Elhamifar is with the Department of Electrical Engineering and

Computer Science, University of California, Berkeley, CA 94804.

E-mail: ehsan@eecs.berkeley.edu.

. R. Vidal is with the Center for Imaging Science and the Department of

Biomedical Engineering, The Johns Hopkins University, 322 Clark Hall,

3400 North Charles Street, Baltimore, MD 21218.

E-mail: rvidal@cis.jhu.edu.

Manuscript received 23 Feb. 2012; revised 13 Aug. 2012; accepted 30 Jan.

2013; published online 14 Mar. 2013.

Recommended for acceptance by F. de la Torre.

For information on obtaining reprints of this article, please send e-mail to:

tpami@computer.org, and reference IEEECS Log Number

TPAMI-2012-02-0131.

Digital Object Identifier no. 10.1109/TPAMI.2013.57.

0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society

Algebraic approaches. Factorization-based algebraic

approaches such as [8], [9], [15] find an initial segmentation

by thresholding the entries of a similarity matrix built from

the factorization of the data matrix. These methods are

provably correct when the subspaces are independent, but

fail when this assumption is violated. In addition, they are

sensitive to noise and outliers in the data. Algebraic-

geometric approaches such as generalized principal com-

ponent analysis [10], [16] fit the data with a polynomial

whose gradient at a point gives the normal vector to the

subspace containing that point. While GPCA can deal with

subspaces of different dimensions, it is sensitive to noise

and outliers, and its complexity increases exponentially in

terms of the number and dimensions of subspaces.

Statistical methods. Iterative statistical approaches, such

as mixtures of probabilistic PCA [17], multistage learning

[18], or [19], assume that the distribution of the data inside

each subspace is Gaussian and alternate between data

clustering and subspace estimation by applying expectation

maximization. The main drawbacks of these methods are

that they generally need to know the number and

dimensions of the subspaces, and that they are sensitive

to initialization . Robust statistical approaches, such as

random sample consensus (RANSAC) [20], fit a subspace

of dimension d to randomly chosen subsets of d points until

the number of inliers is large enough. The inliers are then

removed, and the process is repeated to find a second

subspace, and so on. RANSAC can deal with noise and

outliers and does not need to know the number of

subspaces. However, the dimensions of the subspaces must

be known and equal. In addition, the complexity of the

algorithm increases exponentially in the dimension of the

subspaces. Information-theoretic statistic al approaches,

such as agglomerative lossy compression (ALC) [21], look

for the segmentation of the data that minimizes the coding

length needed to fit the points with a mixture of degenerate

Gaussians up to a given distortion. As this minimization

problem is NP-hard, a suboptimal solution is found by first

assuming that each point forms its own group, and then

iteratively merging pairs of groups to reduce the coding

length. ALC can handle noise and outliers in the data.

While, in principle, it does not need to know the number

and dimensions of the subspaces, the number of subspaces

found by the algorithms is dependent on the choice of a

distortion parameter. In addition, there is no theoretical

proof for the optimality of the agglomerative algorithm.

Spectral clustering-based methods. Local spectral clus-

tering-based approaches such as local subspace affinity

(LSA) [22], locally linear manifold clustering [23], spectral

local best-fit flats [24] and [25], use local information

around each point to build a similarity between pairs of

points. The segmentation of the data is then obtained by

applying spectral clustering [26], [27] to the similarity

matrix. These methods have difficulties in dealing with

points near the intersection of two subspaces because the

neighborhood of a point can contain points from different

subspaces. In addition, they are sensitive to the right choice

of the neighborhood size to compute the local information

at each point.

Global spectral clustering-based approaches try to

resolve these issues by building better similarities between

data points using global information. Spectral curvature

clustering(SCC)[28]usesmultiwaysimilaritiesthat

capture the curvature of a collection of points within an

affine subspace. SCC can deal with noisy data but requires

knowing the number and dimensions of subspaces and

assumes that subspaces have the same dimension. In

addition, the complexity of building the multiway simi-

larity grows exponentially with the dimensions of the

subspaces; hence, in practice, a sampling strategy is

employed to reduce the computational cost. Using

advances in sparse [29], [30], [31] and low-rank [32],

[33], [34] recovery algorithms, the Sparse Subspace

Clustering (SSC) [35], [36], [37], Low-Rank Recovery

(LRR) [38], [39], [40], and Low-Rank Subspace Clustering

(LRSC) [41] algorithms pose the clustering problem as one

of finding a sparse or low-rank representation of the data

in the dictionary of the data itself. The solution of the

corresponding global optimization algorithm is then used

to build a similarity graph from which the segmentation of

2766 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 11, NOVEMBER 2013

Fig. 2. Face clustering: Given face images of multiple subjects (top), the goal is to find images that belong to the same subject (bottom).

Fig. 1. Motion segmentation: Given feature points on multiple rigidly moving objects tracked in multiple frames of a video (top), the goal is to separate

the feature trajectories according to the moving objects (bottom).

the data is obtained. The advantages of these methods

with respect to most state-of-the-art algorithms are that

they can handle noise and outliers in data, and that they

do not need to know the dimensions and, in principle, the

number of subspaces a priori.

1.2 Paper Contributions

In this paper, we propose and study an algorithm based on

sparse representation techniques, called SSC, to cluster a

collection of data points lying in a union of low-dimensional

subspaces. The underlying idea behind the algorithm is

what we call the self-expressiveness property of the data,

which states that each data point in a union of subspaces

can be efficiently represented as a linear or affine combina-

tion of other points. Such a representation is not unique in

general because there are infinitely many ways in which a

data point can be expressed as a combination of other

points. The key observation is that a sparse representation of a

data point ideally corresponds to a combination of a few

points from its own subspace. This motivates solving a

global sparse optimization program whose solution is used

in a spectral clustering framework to infer the clustering of

data. As a result, we can overcome the problems of local

spectral clustering-based algorithms, such as choosing the

right neighborhood size and dealing with points near

the intersection of subspaces, since, for a given data point,

the sparse optimization program automatically picks a few

other points that are not necessarily close to it but that

belong to the same subspace.

Since solving the sparse optimization program is in

general NP-hard, we consider its ‘

relaxation. We show

that, under mild conditions on the arrangement of sub-

spaces and data distribution, the proposed ‘

-minimization

program recovers the desired solution, guaranteeing the

success of the algorithm. Our theoretical analysis extends

the sparse representation theory to the multisubspace

setting where the number of points in a subspace is

arbitrary, possibly much larger than its dimension. Unlike

block-sparse recovery problems [42], [43], [44], [45], [46],

[47] where the bases for the subspaces are known and

given, we do not have the bases for subspaces nor do we

know which data points belong to which subspace, making

our case more challenging. We only have the sparsifying

dictionary for the union of subspaces given by the matrix of

data points.

The proposed ‘

-minimization program can be solved

efficiently using convex programming tools [48], [49], [50]

and does not require initialization. Our algorithm can

directly deal with noise, sparse outlying entries, and

missing entries in the data as well as the more general class

of affine subspaces by incorporating the data corruption or

subspace model into the sparse optimization program.

Finally, through experimental results, we show that our

algorithm outperforms state-of-the-art subspace clustering

methods on the two real-world problems of motion

segmentation (see Fig. 1) and face clustering (see Fig. 2).

Paper organization. In Section 2, we motivate and

introduce the SSC algorithm for clustering data points in a

union of linear subspaces. In Section 3, we generalize the

algorithm to deal with noise, sparse outlying entries, and

missing entries in the data as well as the more general class

of affine subspaces. In Section 4, we investigate theoretical

conditions under which the ‘

-minimization program re-

covers the desired sparse representations of data points. In

Section 5, we discuss the connectivity of the similarity

graph and propose a regularization term to increase the

connectivity of points in each subspace. In Section 6, we

verify our theoretical analysis through experiments on

synthetic data. In Section 7, we compare the performance

of SSC with the state of the art on the two real-world

problems of motion segmentation and face clustering.

Finally, Section 8 concludes the paper.

2SPARSE S UBSPACE CLUSTERING

In this section, we introduce the SSC algorithm for

clustering a collection of multisubspace data using sparse

representation techniques. We motivate and formulate the

algorithm for data points that perfectly lie in a union of

linear subspaces. In the next section, we will generalize the

algorithm to deal with data nuisances such as noise, sparse

outlying entries, and missing entries, as well as the more

general class of affine subspaces.

Let fS

‘

‘¼1

be an arrangement of n linear subspaces

of IR

of dimensions fd

‘

‘¼1

. Consider a given collection

of N noise-free data points fyyyy

i¼1

that lie in the union of

the n subspaces. Denote the matrix containing all the data

points as

YYYY ¼

½ yyyy

... yyyy

¼½YYYY

... YYYY

; ð1Þ

where YYYY

‘

2 IR

DN

‘

is a rank-d

‘

matrix of the N

‘

points

that lie in S

‘

and  2 IR

NN

is an unknown permutation

matrix. We assume that we do not know a priori the bases

of the subspaces nor do we know which data points belong

to which subspace. The subspace clustering problem refers to

the problem of finding the number of subspaces, their

dimensions, a basis for each subspace, and the segmenta-

tion of the data from YYYY .

To address the subspace clustering problem, we propose

an algorithm that consists of two steps. In the first step, for

each data point, we find a few other points that belong to

the same subspace. To do so, we propose a global sparse

optimization program whose solution encodes information

about the memberships of data points to the underlying

subspace of each point. In the second step, we use these

information in a spectral clustering framework to infer the

clustering of the data.

2.1 Sparse Optimization Program

Our proposed algorithm takes advantage of what we refer

to as the self-expressiveness property of the data, i.e.,

each data point in a union of subspaces can be efficiently

reconstructed by a combination of other points in the dataset.

More precisely, each data point yyyy

‘¼1

‘

can be written as

yyyy

¼ YYYYcccc

¼ 0; ð2Þ

where cccc

½ c

... c



and the constraint c

¼ 0

eliminates the trivial solution of writing a point as a linear

combination of itself. In other words, the matrix of data

points YYYY is a self-expressive dictionary in which each point

can be written as a linear combination of other points.

ELHAMIFAR AND VIDAL: SPARSE SUBSPACE CLUSTERING: ALGORITHM, THEORY, AND APPLICATIONS 2767

剩余16页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

资源推荐

资源评论

baidu_33888573

粉丝: 1
资源: 1

matlab子空间聚类

MATLAB上的 聚类/子空间聚类算法

Clustering:MATLAB上的聚类子空间聚类算法

聚类算法MATLAB实现

matlab中存档算法代码-L0-motivated-LRSSC:L0激励的低秩稀疏子空间聚类的Matlab实现

matlab自带的聚类算法

稀疏子空间代码.rar

sparse-subspace-clustering-python:稀疏子空间聚类算法的Python实现

稀疏子空间聚类模型的高光谱遥感图像光谱空间聚类

matlab底层代码-FMR:用于子空间聚类的灵活多视图表示学习代码

融合高光谱影像三维空谱特征的子空间聚类算法

matlab 聚类算法

subspace-clustering:通过共享子空间对点进行分组，以实现有效的子空间聚类

K-subspaces:K-subspaces 提供了类似 k-means 的聚类。-matlab开发

MATLAB子空间聚类算法（用于《数据挖掘》的课程展示）+PPT文档+使用方法

一种改进的稀疏子空间聚类算法

基于Matlab实现的子空间聚类算法源码+答辩PPT.zip

clustering-master.zip_clustering matlab_子空间聚类_稀疏子空间_稀疏聚类_稀疏聚类 ma

基于Matlab实现的子空间聚类算法源码+答辩PPT（下载即用）.zip

稀疏子空间聚类

Sparse Subspace Clustering基于人脸分割的子空间聚类的原始代码

OSC:序列数据的子空间聚类

稀疏子空间聚类代码包

鲁棒稀疏子空间聚类RLRR代码

Sparse_subspace_clustering算法代码

基于子空间的Dcut算法-MATLAB

BSAS顺序聚类算法matlab实现代码（注释很全）

用卷积滤波器matlab代码-3DS-SSC:论文库“具有3D正则化的稀疏子空间聚类，用于光谱图像土地覆盖物分割”

matlab中存档算法代码-S5C:基于选择性采样的可扩展稀疏子空间聚类（NeurIPS19'）

最新资源

MATLAB上的聚类/子空间聚类算法