没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
试读
93页
Dimension Reduction A Guided Tour give a tutorial overview of several foundational methods for dimension reduction. We divide the methods into projective methods and methods that model the manifold on which the data lies. For projective methods, we review projection pursuit, principal component analysis (PCA), kernel PCA, probabilistic PCA, canonical correlation analysis (CCA), kernel CCA, Fisher discriminant analysis, oriented PCA, and several techniques for sufficient dimension reduction. For the manifold methods, we review multidimensional scaling (MDS), landmark MDS, Isomap, locally linear embedding, Laplacian eigenmaps, and spectral clustering
资源推荐
资源详情
资源评论
Foundations and Trends
R
in
Machine Learning
Vol. 2, No. 4 (2009) 275–365
c
2010 C. J. C. Burges
DOI: 10.1561/2200000002
Dimension Reduction: A Guided Tour
By Christopher J. C. Burges
Contents
1 Introduction 276
2 Estimating the Dimension 280
2.1 A Cautionary Note 281
2.2 Empirical Investigation 283
3 Projective Methods 287
3.1 Independent Component Analysis 289
3.2 Principal Component Analysis (PCA) 291
3.3 Probabilistic PCA (PPCA) 298
3.4 The Kernel Trick 301
3.5 Kernel PCA 303
3.6 Canonical Correlation Analysis 307
3.7 Linear Discriminant Analysis 314
3.8 Oriented PCA and Distortion Discriminant Analysis 316
3.9 Sufficient Dimension Reduction 319
4 Manifold Modeling 330
4.1 The Nystr¨om Method 330
4.2 Multidimensional Scaling 334
4.3 Isomap 341
4.4 Locally Linear Embedding 342
4.5 Graphical Methods 344
4.6 Pulling the Threads Together 348
5 Pointers and Conclusions 351
5.1 Pointers to Further Reading 351
5.2 Conclusions 355
A Appendix: The Nearest Positive
Semidefinite Matrix 357
Acknowledgments 359
References 360
Foundations and Trends
R
in
Machine Learning
Vol. 2, No. 4 (2009) 275–365
c
2010 C. J. C. Burges
DOI: 10.1561/2200000002
Dimension Reduction: A Guided Tour
Christopher J. C. Burges
Microsoft Research, One Microsoft Way, Redmond, WA 98052-6399, USA,
chris.burges@microsoft.com
Abstract
We give a tutorial overview of several foundational methods for dimen-
sion reduction. We divide the methods into projective methods and
methods that model the manifold on which the data lies. For projective
methods, we review projection pursuit, principal component analysis
(PCA), kernel PCA, probabilistic PCA, canonical correlation analysis
(CCA), kernel CCA, Fisher discriminant analysis, oriented PCA, and
several techniques for sufficient dimension reduction. For the manifold
methods, we review multidimensional scaling (MDS), landmark MDS,
Isomap, locally linear embedding, Laplacian eigenmaps, and spectral
clustering. Although this monograph focuses on foundations, we also
provide pointers to some more modern techniques. We also describe
the correlation dimension as one method for estimating the intrinsic
dimension, and we point out that the notion of dimension can be a
scale-dependent quantity. The Nystr¨om method, which links several of
the manifold algorithms, is also reviewed. We use a publicly available
data set to illustrate some of the methods. The goal is to provide a
self-contained overview of key concepts underlying many of these algo-
rithms, and to give pointers for further reading.
1
Introduction
Dimension reduction
1
is the mapping of data to a lower dimensional
space such that uninformative variance in the data is discarded, or such
that a subspace in which the data lives is detected. Dimension reduction
has a long history as a method for data visualization, and for extracting
key low dimensional features (for example, the two-dimensional orien-
tation of an object, from its high dimensional image representation). In
some cases the desired low dimensional features depend on the task at
hand. Apart from teaching us about the data, dimension reduction can
lead us to better models for inference. The need for dimension reduc-
tion also arises for other pressing reasons. Stone [85] showed that, under
certain regularity assumptions (including that the samples be IID),
the optimal rate of convergence
2
for nonparametric regression varies
1
We follow both the lead of the statistics community and the spirit of the paper to reduce
‘dimensionality reduction’ and ‘dimensional reduction’ to ‘dimension reduction’.
2
The definition of ‘optimal rate of convergence’ is technical and for completeness we repro-
duce Stone’s definitions here [85]. A ‘rate of convergence’ is defined as a sequence of
numbers, indexed by sample size. Let θ be the unknown regression function, Θ the col-
lection of functions to which θ belongs,
ˆ
T
n
an estimator of θ using n samples, and {b
n
}
a sequence of positive constants. Then {b
n
} is called a lower rate of convergence if there
exists c>0 such that lim
n
inf
ˆ
T
n
sup
Θ
P (
ˆ
T
n
− θ≥cb
n
) = 1, and it is called an achiev-
able rate of convergence if there is a sequence of estimators {
ˆ
T
n
} and c>0 such that
276
277
as m
−p/(2p+d)
, where m is the sample size, the data lies in R
d
, and
where the regression function is assumed to be p times differentiable.
We can get a very rough idea of the impact of sample size on the rate
of convergence as follows. Consider a particular point in the sequence
of values corresponding to the optimal rate of convergence: m =10,000
samples, for p = 2 and d = 10. Suppose that d is increased to 20; what
number of samples in the new sequence gives the same value? The
answer is approximately 10 million. If our data lies (approximately) on
a low dimensional manifold L that happens to be embedded in a high
dimensional manifold H, then modeling the data directly in L rather
than in H may turn an infeasible problem into a feasible one.
The purpose of this monograph is to describe the mathematics and
key ideas underlying the methods, and to provide some links to the
literature for those interested in pursuing a topic further.
3
The sub-
ject of dimension reduction is vast, so we use the following criterion
to limit the discussion: we restrict our attention to the case where the
inferred feature values are continuous. The observables, on the other
hand, may be continuous or discrete. Thus this review does not address
clustering methods, or, for example, feature selection for discrete data,
such as text. This still leaves a very wide field, and so we further limit
the scope by choosing not to cover probabilistic topic models (in par-
ticular, latent Dirichlet allocation, nonnegative matrix factorization,
probabilistic latent semantic analysis, and Gaussian process latent vari-
able models). Furthermore, implementation details, and important the-
oretical details such as consistency and rates of convergence of sample
quantities to their population values, although important, are not dis-
cussed. For an alternative, excellent overview of dimension reduction
methods, see Lee and Verleysen [62]. This monograph differs from that
work in several ways. In particular, while it is common in the litera-
ture to see methods applied to artificial, low dimensional data sets such
as the famous Swiss Roll, in this monograph we prefer to use higher
dimensional data: while low dimensional toy data can be valuable to
lim
n
sup
Θ
P (
ˆ
T
n
− θ≥cb
n
)=0; {b
n
} is called an optimal rate of convergence if it is
both a lower rate of convergence and an achievable rate of convergence. Here the inf
ˆ
T
n
is
over all possible estimators
ˆ
T
n
.
3
This monograph is a revised and extended version of Burges [17].
剩余92页未读,继续阅读
资源评论
gaoshantanyun
- 粉丝: 0
- 资源: 3
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功