EM Algorithms for PCA and SPCA
Sam Roweis
Abstract
I present an expectation-maximization (EM) algorithm for principal
component analysis (PCA). The algorithm allows a few eigenvectors and
eigenvalues to be extracted from large collections of high dimensional
data. It is computationally very efficient in space and time. It also natu-
rally accommodates missing information. I also introduce a new variant
of PCA called sensible principal component analysis (SPCA) which de-
fines a proper density model in the data space. Learningfor SPCA is also
done with an EM algorithm. I report results on synthetic and real data
showing that these EM algorithms correctly and efficiently find the lead-
ing eigenvectors of the covariance of datasets in a few iterations using up
to hundreds of thousands of datapoints in thousands of dimensions.
1 Why EM for PCA?
Principal component analysis (PCA) is a widely used dimensionality reductiontechnique in
data analysis. Its popularity comes from three important properties. First, it is the optimal
(in terms of mean squared error) linear scheme for compressing a set of high dimensional
vectors into a set of lower dimensional vectors and then reconstructing. Second, the model
parameters can be computed directly from the data – for example by diagonalizing the
sample covariance. Third, compression and decompression are easy operations to perform
given the model parameters – they require only matrix multiplications.
Despite these attractive features however, PCA models have several shortcomings. One is
that naive methods for finding the principal component directions have trouble with high
dimensional data or large numbers of datapoints. Consider attempting to diagonalize the
sample covariance matrix of
vectors in a space of
dimensions when
and
are several
hundred or several thousand. Difficulties can arise both in the form of computational com-
plexity and also data scarcity.
1
Even computing the sample covariance itself is very costly,
requiring
operations. In general it is best to avoid altogether computing the sample
[email protected]; Computation & Neural Systems, California Institute of Tech.
1
On the data scarcity front, we often do not have enough data in high dimensions for the sample
covariance to be of full rank and so we must be careful to employ techniques which do not require full
rank matrices. On the complexity front, direct diagonalization of a symmetric matrix thousands of
rows in size can be extremely costly since this operation is
for
inputs. Fortunately, several
techniques exist for efficient matrix diagonalization when only the first few leading eigenvectors and
eigenvalues are required (for example the power method [10] which is only
).