没有合适的资源?快使用搜索试试~ 我知道了~
matlab子空间聚类
试读
17页
需积分: 0 31 下载量 11 浏览量
更新于2016-01-30
收藏 2.06MB PDF 举报
标题所指的“matlab子空间聚类”涉及的是一种数据聚类方法,利用的是子空间聚类算法。子空间聚类算法是用于发现数据子集的聚类结构,特别是当数据点存在于高维空间的不同低维子空间时。MATLAB是一种常用于算法开发、数据可视化、数据分析以及数值计算的高级编程语言和交互式环境。通过MATLAB实现子空间聚类算法,用户可以对数据进行分析与处理,特别适用于那些涉及复杂数据结构的数学建模问题。
描述中提到的“可以用于数学建模”,是指子空间聚类算法可以作为一种工具应用于数学建模的过程中。数学建模是一种分析和解决问题的方法,它涉及将现实世界问题转换成数学形式,然后使用数学工具求解问题。子空间聚类算法可以应用于各种高维数据集的分析,比如医学成像、视频监控、文本挖掘、生物信息学以及网络数据分析等领域,这些领域的问题往往可以用高维数据来表示。
描述还提到了与子空间聚类相关的英文论文,并建议有兴趣的人士阅读。其中提及的一篇论文《Sparse Subspace Clustering: Algorithm, Theory, and Applications》由Ehsan Elhamifar和Rene´ Vidal撰写,发表在《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》2013年11月刊上。这篇论文详细阐述了一种名为稀疏子空间聚类(Sparse Subspace Clustering, SSC)的算法,该算法的目标是将存在于低维子空间的高维数据点进行聚类。SSC算法的关键思想是使用稀疏表示来选择数据点,这种表示方法可以使得数据点与其同类子空间内的少量其他点建立联系。算法通过解决一个稀疏优化问题来实现,并将解决方案应用于谱聚类框架中,从而推断出数据的子空间聚类结构。由于解决稀疏优化问题通常是非确定性多项式时间难解的问题(NP-hard),论文中提出了一个凸松弛,并在子空间和数据分布的适当条件下,证明了所提出的最小化程序能够成功地恢复所需的稀疏表示。此外,SSC算法可以有效处理数据噪声、稀疏异常值和缺失值等问题,通过将数据模型纳入稀疏优化程序中。
标签“子空间聚类”指的是一系列算法和理论的集合,它们关注于发现和利用数据中潜在的低维子空间结构。这些算法试图将数据点分配到其所在的子空间中,从而使得同一子空间内的点在某种意义下彼此接近,而不同子空间的点则彼此远离。
内容部分提供了论文的一段摘要,其内容涉及算法介绍和实验验证。摘要提到,现实世界中许多问题都涉及高维数据的集合,例如图像、视频、文本、网页文档、DNA微阵列数据等。这些高维数据通常靠近于某些低维结构,这些结构对应于数据所属的几个类别或分类。论文提出了一种稀疏子空间聚类算法,用于对位于低维子空间并集中的数据点进行聚类。其核心思想是,利用无穷多可能的数据点表示中,稀疏表示意味着选择来自同一子空间的少量点。这激励着解决一个稀疏优化问题,并将得到的解决方案应用于谱聚类框架中,从而推断出数据在子空间中的聚类情况。
由于稀疏优化问题在一般情况下是NP-hard的,论文考虑了一个凸松驰版本,并在子空间排列和数据分布的适当条件下,证明了所提出的最小化程序能够成功地恢复出所需的数据稀疏表示。提出的算法效率较高,且能够处理位于子空间交集附近的数据点。此外,论文提出的方法相对于现有技术还具有一些优势,特别是它能够直接处理数据中的噪声、稀疏异常值和缺失值等问题,通过将数据模型整合进稀疏优化程序来实现。论文通过在合成数据以及两个现实世界问题(运动分割和面部聚类)上的实验来证明了所提算法的有效性。
Sparse Subspace Clustering:
Algorithm, Theory, and Applications
Ehsan Elhamifar, Student Member, IEEE, and Rene
´
Vidal, Senior Member, IEEE
Abstract—Many real-world problems deal with collections of high-dimensional data, such as images, videos, text, and web
documents, DNA microarray data, and more. Often, such high-dimensional data lie close to low-dimensional structures corresponding
to several classes or categories to which the data belong. In this paper, we propose and study an algorithm, called sparse subspace
clustering, to cluster data points that lie in a union of low-dimensional subspaces. The key idea is that, among the infinitely many
possible representations of a data point in terms of other points, a sparse representation corresponds to selecting a few points from the
same subspace. This motivates solving a sparse optimization program whose solution is used in a spectral clustering framework to
infer the clustering of the data into subspaces. Since solving the sparse optimization program is in general NP-hard, we consider a
convex relaxation and show that, under appropriate conditions on the arrangement of the subspaces and the distribution of the data,
the proposed minimization program succeeds in recovering the desired sparse representations. The proposed algorithm is efficient
and can handle data points near the intersections of subspaces. Another key advantage of the proposed algorithm with respect to the
state of the art is that it can deal directly with data nuisances, such as noise, sparse outlying entries, and missing entries, by
incorporating the model of the data into the sparse optimization program. We demonstrate the effectiveness of the proposed algorithm
through experiments on synthetic data as well as the two real-world problems of motion segmentation and face clustering.
Index Terms—High-dimensional data, intrinsic low-dimensionality, subspaces, clustering, sparse representation, ‘
1
-minimization,
convex programming, spectral clustering, principal angles, motion segmentation, face clustering
Ç
1INTRODUCTION
H
IGH-DIMENSIONAL data are ubiquitous in many areas of
machine learning, signal and image processing,
computer vision, pattern recognition, bioinformatics, etc.
For instance, images consist of billions of pixels, videos
can have millions of frames, text and web documents are
associated with hundreds of thousands of features, etc.
The high-dimensionality of the data not only increases the
computational time and memory requirements of algo-
rithms, but also adversely affects their performance due to
the noise effect and insufficient number of samples with
respect to the am bient space dimension, comm only
referred to as the “curse of dimensionality” [1]. However,
high-dimensional data often lie in low-dimensional struc-
tures instead of being uniformly distributed across the
ambient space. Recovering low-dimensional structures in
the data helps to not only reduce the computational
cost and memory requirements of algorithms, but also
reduce the effect of high-dimensional noise in the data
and improve the performance of inference, learning, and
recognition tasks.
In fact, in many problems, data in a class or category can
be well represented by a low-dimensional subspace of the
high-dimensional ambient space. For example, feature
trajectories of a rigidly moving object in a video [2],
face images of a subject under varying illumination [3],
and multiple instances of a hand-written digit with
different rotations, translations, and thicknesses [4] lie in a
low-dimensional subspace of the ambient space. As a result,
the collection of data from multiple classes or categories lies
in a union of low-dimensional subspaces. Subspace cluster-
ing (see [5] and references therein) refers to the problem of
separating data according to their underlying subspaces
and finds numerous applications in image processing
(e.g., image representation and compression [6]) and
computer vision (e.g., image segme ntation [7], motion
segmentation [8], [9], and temporal video segmentation
[10]), as illustrated in Figs. 1 and 2. Since data in a subspace
are often distributed arbitrarily and not around a centroid,
standard clustering methods [11] that take advantage of the
spatial proximity of the data in each cluster are not in
general applicable to subspace clustering. Therefore, there
is a need for having clustering algorithms that take into
account the multisubspace structure of the data.
1.1 Prior Work on Subspace Clustering
Existing algorithms can be divided into four main categories:
iterative, algebraic, statistical, and spectral clustering-based
methods.
Iterative methods. It erative approaches, such as K-
subspaces [12], [13] and median K-flats [14], alternate
between assigning points to subspaces and fitting a subspace
to each cluster. The main drawbacks of such approaches
are that they generally require knowing the number and
dimensions of the subspaces and that they are sensitive to
initialization.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGEN CE, VOL. 35, NO. 11, NOVEMBER 2013 2765
. E. Elhamifar is with the Department of Electrical Engineering and
Computer Science, University of California, Berkeley, CA 94804.
E-mail: ehsan@eecs.berkeley.edu.
. R. Vidal is with the Center for Imaging Science and the Department of
Biomedical Engineering, The Johns Hopkins University, 322 Clark Hall,
3400 North Charles Street, Baltimore, MD 21218.
E-mail: rvidal@cis.jhu.edu.
Manuscript received 23 Feb. 2012; revised 13 Aug. 2012; accepted 30 Jan.
2013; published online 14 Mar. 2013.
Recommended for acceptance by F. de la Torre.
For information on obtaining reprints of this article, please send e-mail to:
tpami@computer.org, and reference IEEECS Log Number
TPAMI-2012-02-0131.
Digital Object Identifier no. 10.1109/TPAMI.2013.57.
0162-8828/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society
Algebraic approaches. Factorization-based algebraic
approaches such as [8], [9], [15] find an initial segmentation
by thresholding the entries of a similarity matrix built from
the factorization of the data matrix. These methods are
provably correct when the subspaces are independent, but
fail when this assumption is violated. In addition, they are
sensitive to noise and outliers in the data. Algebraic-
geometric approaches such as generalized principal com-
ponent analysis [10], [16] fit the data with a polynomial
whose gradient at a point gives the normal vector to the
subspace containing that point. While GPCA can deal with
subspaces of different dimensions, it is sensitive to noise
and outliers, and its complexity increases exponentially in
terms of the number and dimensions of subspaces.
Statistical methods. Iterative statistical approaches, such
as mixtures of probabilistic PCA [17], multistage learning
[18], or [19], assume that the distribution of the data inside
each subspace is Gaussian and alternate between data
clustering and subspace estimation by applying expectation
maximization. The main drawbacks of these methods are
that they generally need to know the number and
dimensions of the subspaces, and that they are sensitive
to initialization . Robust statistical approaches, such as
random sample consensus (RANSAC) [20], fit a subspace
of dimension d to randomly chosen subsets of d points until
the number of inliers is large enough. The inliers are then
removed, and the process is repeated to find a second
subspace, and so on. RANSAC can deal with noise and
outliers and does not need to know the number of
subspaces. However, the dimensions of the subspaces must
be known and equal. In addition, the complexity of the
algorithm increases exponentially in the dimension of the
subspaces. Information-theoretic statistic al approaches,
such as agglomerative lossy compression (ALC) [21], look
for the segmentation of the data that minimizes the coding
length needed to fit the points with a mixture of degenerate
Gaussians up to a given distortion. As this minimization
problem is NP-hard, a suboptimal solution is found by first
assuming that each point forms its own group, and then
iteratively merging pairs of groups to reduce the coding
length. ALC can handle noise and outliers in the data.
While, in principle, it does not need to know the number
and dimensions of the subspaces, the number of subspaces
found by the algorithms is dependent on the choice of a
distortion parameter. In addition, there is no theoretical
proof for the optimality of the agglomerative algorithm.
Spectral clustering-based methods. Local spectral clus-
tering-based approaches such as local subspace affinity
(LSA) [22], locally linear manifold clustering [23], spectral
local best-fit flats [24] and [25], use local information
around each point to build a similarity between pairs of
points. The segmentation of the data is then obtained by
applying spectral clustering [26], [27] to the similarity
matrix. These methods have difficulties in dealing with
points near the intersection of two subspaces because the
neighborhood of a point can contain points from different
subspaces. In addition, they are sensitive to the right choice
of the neighborhood size to compute the local information
at each point.
Global spectral clustering-based approaches try to
resolve these issues by building better similarities between
data points using global information. Spectral curvature
clustering(SCC)[28]usesmultiwaysimilaritiesthat
capture the curvature of a collection of points within an
affine subspace. SCC can deal with noisy data but requires
knowing the number and dimensions of subspaces and
assumes that subspaces have the same dimension. In
addition, the complexity of building the multiway simi-
larity grows exponentially with the dimensions of the
subspaces; hence, in practice, a sampling strategy is
employed to reduce the computational cost. Using
advances in sparse [29], [30], [31] and low-rank [32],
[33], [34] recovery algorithms, the Sparse Subspace
Clustering (SSC) [35], [36], [37], Low-Rank Recovery
(LRR) [38], [39], [40], and Low-Rank Subspace Clustering
(LRSC) [41] algorithms pose the clustering problem as one
of finding a sparse or low-rank representation of the data
in the dictionary of the data itself. The solution of the
corresponding global optimization algorithm is then used
to build a similarity graph from which the segmentation of
2766 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 35, NO. 11, NOVEMBER 2013
Fig. 2. Face clustering: Given face images of multiple subjects (top), the goal is to find images that belong to the same subject (bottom).
Fig. 1. Motion segmentation: Given feature points on multiple rigidly moving objects tracked in multiple frames of a video (top), the goal is to separate
the feature trajectories according to the moving objects (bottom).
the data is obtained. The advantages of these methods
with respect to most state-of-the-art algorithms are that
they can handle noise and outliers in data, and that they
do not need to know the dimensions and, in principle, the
number of subspaces a priori.
1.2 Paper Contributions
In this paper, we propose and study an algorithm based on
sparse representation techniques, called SSC, to cluster a
collection of data points lying in a union of low-dimensional
subspaces. The underlying idea behind the algorithm is
what we call the self-expressiveness property of the data,
which states that each data point in a union of subspaces
can be efficiently represented as a linear or affine combina-
tion of other points. Such a representation is not unique in
general because there are infinitely many ways in which a
data point can be expressed as a combination of other
points. The key observation is that a sparse representation of a
data point ideally corresponds to a combination of a few
points from its own subspace. This motivates solving a
global sparse optimization program whose solution is used
in a spectral clustering framework to infer the clustering of
data. As a result, we can overcome the problems of local
spectral clustering-based algorithms, such as choosing the
right neighborhood size and dealing with points near
the intersection of subspaces, since, for a given data point,
the sparse optimization program automatically picks a few
other points that are not necessarily close to it but that
belong to the same subspace.
Since solving the sparse optimization program is in
general NP-hard, we consider its ‘
1
relaxation. We show
that, under mild conditions on the arrangement of sub-
spaces and data distribution, the proposed ‘
1
-minimization
program recovers the desired solution, guaranteeing the
success of the algorithm. Our theoretical analysis extends
the sparse representation theory to the multisubspace
setting where the number of points in a subspace is
arbitrary, possibly much larger than its dimension. Unlike
block-sparse recovery problems [42], [43], [44], [45], [46],
[47] where the bases for the subspaces are known and
given, we do not have the bases for subspaces nor do we
know which data points belong to which subspace, making
our case more challenging. We only have the sparsifying
dictionary for the union of subspaces given by the matrix of
data points.
The proposed ‘
1
-minimization program can be solved
efficiently using convex programming tools [48], [49], [50]
and does not require initialization. Our algorithm can
directly deal with noise, sparse outlying entries, and
missing entries in the data as well as the more general class
of affine subspaces by incorporating the data corruption or
subspace model into the sparse optimization program.
Finally, through experimental results, we show that our
algorithm outperforms state-of-the-art subspace clustering
methods on the two real-world problems of motion
segmentation (see Fig. 1) and face clustering (see Fig. 2).
Paper organization. In Section 2, we motivate and
introduce the SSC algorithm for clustering data points in a
union of linear subspaces. In Section 3, we generalize the
algorithm to deal with noise, sparse outlying entries, and
missing entries in the data as well as the more general class
of affine subspaces. In Section 4, we investigate theoretical
conditions under which the ‘
1
-minimization program re-
covers the desired sparse representations of data points. In
Section 5, we discuss the connectivity of the similarity
graph and propose a regularization term to increase the
connectivity of points in each subspace. In Section 6, we
verify our theoretical analysis through experiments on
synthetic data. In Section 7, we compare the performance
of SSC with the state of the art on the two real-world
problems of motion segmentation and face clustering.
Finally, Section 8 concludes the paper.
2SPARSE S UBSPACE CLUSTERING
In this section, we introduce the SSC algorithm for
clustering a collection of multisubspace data using sparse
representation techniques. We motivate and formulate the
algorithm for data points that perfectly lie in a union of
linear subspaces. In the next section, we will generalize the
algorithm to deal with data nuisances such as noise, sparse
outlying entries, and missing entries, as well as the more
general class of affine subspaces.
Let fS
‘
g
n
‘¼1
be an arrangement of n linear subspaces
of IR
D
of dimensions fd
‘
g
n
‘¼1
. Consider a given collection
of N noise-free data points fyyyy
i
g
N
i¼1
that lie in the union of
the n subspaces. Denote the matrix containing all the data
points as
YYYY ¼
4
½ yyyy
1
... yyyy
N
¼½YYYY
1
... YYYY
n
; ð1Þ
where YYYY
‘
2 IR
DN
‘
is a rank-d
‘
matrix of the N
‘
>d
‘
points
that lie in S
‘
and 2 IR
NN
is an unknown permutation
matrix. We assume that we do not know a priori the bases
of the subspaces nor do we know which data points belong
to which subspace. The subspace clustering problem refers to
the problem of finding the number of subspaces, their
dimensions, a basis for each subspace, and the segmenta-
tion of the data from YYYY .
To address the subspace clustering problem, we propose
an algorithm that consists of two steps. In the first step, for
each data point, we find a few other points that belong to
the same subspace. To do so, we propose a global sparse
optimization program whose solution encodes information
about the memberships of data points to the underlying
subspace of each point. In the second step, we use these
information in a spectral clustering framework to infer the
clustering of the data.
2.1 Sparse Optimization Program
Our proposed algorithm takes advantage of what we refer
to as the self-expressiveness property of the data, i.e.,
each data point in a union of subspaces can be efficiently
reconstructed by a combination of other points in the dataset.
More precisely, each data point yyyy
i
2[
n
‘¼1
S
‘
can be written as
yyyy
i
¼ YYYYcccc
i
;c
ii
¼ 0; ð2Þ
where cccc
i
¼
4
½ c
i1
c
i2
... c
iN
>
and the constraint c
ii
¼ 0
eliminates the trivial solution of writing a point as a linear
combination of itself. In other words, the matrix of data
points YYYY is a self-expressive dictionary in which each point
can be written as a linear combination of other points.
ELHAMIFAR AND VIDAL: SPARSE SUBSPACE CLUSTERING: ALGORITHM, THEORY, AND APPLICATIONS 2767
剩余16页未读,继续阅读
资源推荐
资源评论
5星 · 资源好评率100%
2021-05-11 上传
2021-05-23 上传
162 浏览量
2021-05-02 上传
2021-06-12 上传
2021-04-30 上传
140 浏览量
176 浏览量
5星 · 资源好评率100%
123 浏览量
5星 · 资源好评率100%
5星 · 资源好评率100%
168 浏览量
5星 · 资源好评率100%
2021-07-09 上传
2019-03-19 上传
2015-12-01 上传
2020-04-29 上传
资源评论
baidu_33888573
- 粉丝: 1
- 资源: 1
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 基于matlab的视频镜头检测、视频关键帧提取源代码+实验报告PPT
- 中国法研杯法律智能源码+设计文档.zip
- 智能循迹避障小车-基于树莓派图像识别(含源码+项目说明+硬件设计).zip
- 中文短文本实体链指技术-CCKS2019比赛技术创新奖解决方案(基于Python,含源码+项目说明).zip
- 智慧医疗在线挂号小程序(前后端分离,支持疫苗预约等模块,含源码+项目说明).zip
- 智能门禁系统-基于STM32的多模态身份验证(含人脸识别+蓝牙APP+RFID+密码锁,最新开发).zip
- 智能教室管理系统-基于龙芯2K1000处理器(含源码+项目说明+硬件设计).zip
- 智能售货系统-基于Qt的饮料售卖机(含源码+项目说明+硬件设计).zip
- 知识图谱医疗诊断问答系统python源码+项目说明(2024毕设).zip
- 指标体系管理系统-基于Java实现(含源码+项目说明+课设报告).zip
- Java 代码辅助开发工具
- 智慧路灯管理系统-基于MQTT协议+物联网云平台(含源码+项目说明+部署指南).zip
- 掌静脉识别系统-手势识别与特征提取(含源码+项目说明+GUI界面设计).zip
- 智慧养老系统-基于情感分析(实训项目,含源码+项目说明+设计文档).zip
- 证券交易系统开发(含源码+项目说明+设计文档).zip
- 征信系统-基于Hyperledger Fabric技术打造可靠信用评价体系(含源码及设计文档).zip
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功