ATutorialonPrincipalComponentAnalysis[2014]资源-CSDN文库

需积分: 23 20 浏览量 2018-11-26 21:38:56 上传评论收藏 377KB PDF 举报

资源推荐

资源详情

资源评论

A Tutorial on Principal Component Analysis

Jonathon Shlens

∗

Google Research

Mountain View, CA 94043

(Dated: April 7, 2014; Version 3.02)

Principal component analysis (PCA) is a mainstay of modern data analysis - a black box that is widely used

but (sometimes) poorly understood. The goal of this paper is to dispel the magic behind this black box. This

manuscript focuses on building a solid intuition for how and why principal component analysis works. This

manuscript crystallizes this knowledge by deriving from simple intuitions, the mathematics behind PCA. This

tutorial does not shy away from explaining the ideas informally, nor does it shy away from the mathematics. The

hope is that by addressing both aspects, readers of all levels will be able to gain a better understanding of PCA as

well as the when, the how and the why of applying this technique.

I. INTRODUCTION

Principal component analysis (PCA) is a standard tool in mod-

ern data analysis - in diverse ﬁelds from neuroscience to com-

puter graphics - because it is a simple, non-parametric method

for extracting relevant information from confusing data sets.

With minimal effort PCA provides a roadmap for how to re-

duce a complex data set to a lower dimension to reveal the

sometimes hidden, simpliﬁed structures that often underlie it.

The goal of this tutorial is to provide both an intuitive feel for

PCA, and a thorough discussion of this topic. We will begin

with a simple example and provide an intuitive explanation

of the goal of PCA. We will continue by adding mathemati-

cal rigor to place it within the framework of linear algebra to

provide an explicit solution. We will see how and why PCA

is intimately related to the mathematical technique of singular

value decomposition (SVD). This understanding will lead us

to a prescription for how to apply PCA in the real world and an

appreciation for the underlying assumptions. My hope is that

a thorough understanding of PCA provides a foundation for

approaching the ﬁelds of machine learning and dimensional

reduction.

The discussion and explanations in this paper are informal in

the spirit of a tutorial. The goal of this paper is to educate.

Occasionally, rigorous mathematical proofs are necessary al-

though relegated to the Appendix. Although not as vital to the

tutorial, the proofs are presented for the adventurous reader

who desires a more complete understanding of the math. My

only assumption is that the reader has a working knowledge

of linear algebra. My goal is to provide a thorough discussion

by largely building on ideas from linear algebra and avoiding

challenging topics in statistics and optimization theory (but

see Discussion). Please feel free to contact me with any sug-

gestions, corrections or comments.

∗

Electronic address: jonathon.shlens@gmail.com

II. MOTIVATION: A TOY EXAMPLE

Here is the perspective: we are an experimenter. We are trying

to understand some phenomenon by measuring various quan-

tities (e.g. spectra, voltages, velocities, etc.) in our system.

Unfortunately, we can not ﬁgure out what is happening be-

cause the data appears clouded, unclear and even redundant.

This is not a trivial problem, but rather a fundamental obstacle

in empirical science. Examples abound from complex sys-

tems such as neuroscience, web indexing, meteorology and

oceanography - the number of variables to measure can be

unwieldy and at times even deceptive, because the underlying

relationships can often be quite simple.

Take for example a simple toy problem from physics dia-

grammed in Figure 1. Pretend we are studying the motion

of the physicist’s ideal spring. This system consists of a ball

of mass m attached to a massless, frictionless spring. The ball

is released a small distance away from equilibrium (i.e. the

spring is stretched). Because the spring is ideal, it oscillates

indeﬁnitely along the x-axis about its equilibrium at a set fre-

quency.

This is a standard problem in physics in which the motion

along the x direction is solved by an explicit function of time.

In other words, the underlying dynamics can be expressed as

a function of a single variable x.

However, being ignorant experimenters we do not know any

of this. We do not know which, let alone how many, axes

and dimensions are important to measure. Thus, we decide to

measure the ball’s position in a three-dimensional space (since

we live in a three dimensional world). Speciﬁcally, we place

three movie cameras around our system of interest. At 120 Hz

each movie camera records an image indicating a two dimen-

sional position of the ball (a projection). Unfortunately, be-

cause of our ignorance, we do not even know what are the real

x, y and z axes, so we choose three camera positions

b and

at some arbitrary angles with respect to the system. The angles

between our measurements might not even be 90

! Now, we

record with the cameras for several minutes. The big question

remains: how do we get from this data set to a simple equation

arXiv:1404.1100v1 [cs.LG] 3 Apr 2014

camera A camera B camera C

FIG. 1 A toy example. The position of a ball attached to an oscillat-

ing spring is recorded using three cameras A, B and C. The position

of the ball tracked by each camera is depicted in each panel below.

of x?

We know a-priori that if we were smart experimenters, we

would have just measured the position along the x-axis with

one camera. But this is not what happens in the real world.

We often do not know which measurements best reﬂect the

dynamics of our system in question. Furthermore, we some-

times record more dimensions than we actually need.

Also, we have to deal with that pesky, real-world problem of

noise. In the toy example this means that we need to deal

with air, imperfect cameras or even friction in a less-than-ideal

spring. Noise contaminates our data set only serving to obfus-

cate the dynamics further. This toy example is the challenge

experimenters face everyday. Keep this example in mind as

we delve further into abstract concepts. Hopefully, by the end

of this paper we will have a good understanding of how to

systematically extract x using principal component analysis.

III. FRAMEWORK: CHANGE OF BASIS

The goal of principal component analysis is to identify the

most meaningful basis to re-express a data set. The hope is

that this new basis will ﬁlter out the noise and reveal hidden

structure. In the example of the spring, the explicit goal of

PCA is to determine: “the dynamics are along the x-axis.” In

other words, the goal of PCA is to determine that

x, i.e. the

unit basis vector along the x-axis, is the important dimension.

Determining this fact allows an experimenter to discern which

dynamics are important, redundant or noise.

A. A Naive Basis

With a more precise deﬁnition of our goal, we need a more

precise deﬁnition of our data as well. We treat every time

sample (or experimental trial) as an individual sample in our

data set. At each time sample we record a set of data consist-

ing of multiple measurements (e.g. voltage, position, etc.). In

our data set, at one point in time, camera A records a corre-

sponding ball position (x

). One sample or trial can then

be expressed as a 6 dimensional column vector

X =













where each camera contributes a 2-dimensional projection of

the ball’s position to the entire vector

X. If we record the ball’s

position for 10 minutes at 120 Hz, then we have recorded 10×

60 ×120 = 72000 of these vectors.

With this concrete example, let us recast this problem in ab-

stract terms. Each sample

X is an m-dimensional vector,

where m is the number of measurement types. Equivalently,

every sample is a vector that lies in an m-dimensional vec-

tor space spanned by some orthonormal basis. From linear

algebra we know that all measurement vectors form a linear

combination of this set of unit length basis vectors. What is

this orthonormal basis?

This question is usually a tacit assumption often overlooked.

Pretend we gathered our toy example data above, but only

looked at camera A. What is an orthonormal basis for (x

A naive choice would be {(1,0),(0,1)}, but why select this

basis over {(

√

),(

−

√

−

√

)}or any other arbitrary rota-

tion? The reason is that the naive basis reﬂects the method we

gathered the data. Pretend we record the position (2, 2). We

did not record 2

√

2 in the (

√

) direction and 0 in the per-

pendicular direction. Rather, we recorded the position (2, 2)

on our camera meaning 2 units up and 2 units to the left in our

camera window. Thus our original basis reﬂects the method

we measured our data.

How do we express this naive basis in linear algebra? In the

two dimensional case, {(1,0),(0,1)} can be recast as individ-

ual row vectors. A matrix constructed out of these row vectors

is the 2 ×2 identity matrix I. We can generalize this to the m-

dimensional case by constructing an m ×m identity matrix

B =



















1 0 ··· 0

0 1 ··· 0

0 0 ··· 1







= I

where each row is an orthornormal basis vector b

with m

components. We can consider our naive basis as the effective

starting point. All of our data has been recorded in this basis

and thus it can be trivially expressed as a linear combination

of {b

B. Change of Basis

With this rigor we may now state more precisely what PCA

asks: Is there another basis, which is a linear combination of

the original basis, that best re-expresses our data set?

A close reader might have noticed the conspicuous addition of

the word linear. Indeed, PCA makes one stringent but power-

ful assumption: linearity. Linearity vastly simpliﬁes the prob-

lem by restricting the set of potential bases. With this assump-

tion PCA is now limited to re-expressing the data as a linear

combination of its basis vectors.

Let X be the original data set, where each column is a single

sample (or moment in time) of our data set (i.e.

X). In the toy

example X is an m ×n matrix where m = 6 and n = 72000.

Let Y be another m ×n matrix related by a linear transfor-

mation P. X is the original recorded data set and Y is a new

representation of that data set.

PX = Y (1)

Also let us deﬁne the following quantities.

• p

are the rows of P

• x

are the columns of X (or individual

X).

• y

are the columns of Y.

Equation 1 represents a change of basis and thus can have

many interpretations.

1. P is a matrix that transforms X into Y.

2. Geometrically, P is a rotation and a stretch which again

transforms X into Y.

3. The rows of P,

{

,...,p

}

, are a set of new basis vec-

tors for expressing the columns of X.

The latter interpretation is not obvious but can be seen by writ-

ing out the explicit dot products of PX.

PX =















··· x



Y =







·x

··· p

·x

··· p

·x







In this section x

and y

are column vectors, but be forewarned. In all other

sections x

and y

are row vectors.

We can note the form of each column of Y.







·x







We recognize that each coefﬁcient of y

is a dot-product of

with the corresponding row in P. In other words, the j

coefﬁcient of y

is a projection on to the j

row of P. This is

in fact the very form of an equation where y

is a projection

on to the basis of

{

,...,p

}

. Therefore, the rows of P are a

new set of basis vectors for representing of columns of X.

C. Questions Remaining

By assuming linearity the problem reduces to ﬁnding the ap-

propriate change of basis. The row vectors

{

,...,p

}

this transformation will become the principal components of

X. Several questions now arise.

• What is the best way to re-express X?

• What is a good choice of basis P?

These questions must be answered by next asking ourselves

what features we would like Y to exhibit. Evidently, addi-

tional assumptions beyond linearity are required to arrive at

a reasonable result. The selection of these assumptions is the

subject of the next section.

IV. VARIANCE AND THE GOAL

Now comes the most important question: what does best ex-

press the data mean? This section will build up an intuitive

answer to this question and along the way tack on additional

assumptions.

A. Noise and Rotation

Measurement noise in any data set must be low or else, no

matter the analysis technique, no information about a signal

can be extracted. There exists no absolute scale for noise but

rather all noise is quantiﬁed relative to the signal strength. A

common measure is the signal-to-noise ratio (SNR), or a ratio

of variances σ

SNR =

signal

noise

A high SNR ( 1) indicates a high precision measurement,

while a low SNR indicates very noisy data.

剩余11页未读，继续阅读

评论收藏

内容反馈

larry_dongy

粉丝: 563
资源: 18

A Tutorial on Principal Component Analysis [2014]

最新资源

A Tutorial on Principal Component Analysis [2014]

A tutorial on Principal Components Analysis

Principal Components Analysis:

A tutorial on Principal Components Analysis.pdf

A Tutorial on Principal Component Analysis [2002]

主成分分析Principal Component Analysis

主成分分析（Principal Component Analysis，PCA）

PCA主成分分析(Principal Component Analysis)

Principal-Component-Analysis-PCA

ndependent Component Analysis A Tutorial Introduction

PCA算法的一个介绍

PCA_Tutorial_pca_

Dimension Reduction A Guided Tour

Generalized Principal Component Analysis

Bayesian Maximum Margin Principal Component Analysis

Kernel-Principal-Component-Analysis-KPCA_KPCA_KPCAmatlab_主成分分析_源

Kernel Principal Component Analysis (KPCA)matlab代码.zip

Dimension Reduction：A Guided Tour

kernal_PCA的一个MATLAB实现

Basics of Linear Algebra for Machine Learning (Python)

Basics of Linear Algebra for Machine Learning

Machine Learning in Action.pdf

Kernel-Principal-Component-Analysis-KPCA_KPCA_KPCAmatlab_主成分分析.z

Kernel-Principal-Component-Analysis-KPCA-master.zip

PCA算法.pdf(主成分分析 ( Principal Component Analysis ， PCA ))

Independent Component Analysis_ A Tutorial Introduction

广义主成因分析（Generalized Principal Component Analysis）英文版

最新资源