Improved Distributed Principal Component Analysis
Maria-Florina Balcan
School of Computer Science
Carnegie Mellon University
ninamf@cs.cmu.edu
Vandana Kanchanapally
School of Computer Science
Georgia Institute of Technology
vvandana@gatech.edu
Yingyu Liang
Department of Computer Science
Princeton University
yingyul@cs.princeton.edu
David Woodruff
Almaden Research Center
IBM Research
dpwoodru@us.ibm.com
Abstract
We study the distributed computing setting in which there are multiple servers,
each holding a set of points, who wish to compute functions on the union of their
point sets. A key task in this setting is Principal Component Analysis (PCA), in
which the servers would like to compute a low dimensional subspace capturing as
much of the variance of the union of their point sets as possible. Given a proce-
dure for approximate PCA, one can use it to approximately solve problems such
as k-means clustering and low rank approximation. The essential properties of an
approximate distributed PCA algorithm are its communication cost and computa-
tional efficiency for a given desired accuracy in downstream applications. We give
new algorithms and analyses for distributed PCA which lead to improved com-
munication and computational costs for k-means clustering and related problems.
Our empirical study on real world data shows a speedup of orders of magnitude,
preserving communication with only a negligible degradation in solution quality.
Some of these techniques we develop, such as a general transformation from a
constant success probability subspace embedding to a high success probability
subspace embedding with a dimension and sparsity independent of the success
probability, may be of independent interest.
1 Introduction
Since data is often partitioned across multiple servers [20, 7, 18], there is an increased interest in
computing on it in the distributed model. A basic tool for distributed data analysis is Principal
Component Analysis (PCA). The goal of PCA is to find an r-dimensional (affine) subspace that
captures as much of the variance of the data as possible. Hence, it can reveal low-dimensional
structure in very high dimensional data. Moreover, it can serve as a preprocessing step to reduce
the data dimension in various machine learning tasks, such as Non-Negative Matrix Factorization
(NNMF) [15] and Latent Dirichlet Allocation (LDA) [4].
In the distributed model, approximate PCA was used by Feldman et al. [9] for solving a number
of shape fitting problems such as k-means clustering, where the approximation is in the form of a
coreset, and has the property that local coresets can be easily combined across servers into a global
coreset, thereby providing an approximate PCA to the union of the data sets. Designing small
coresets therefore leads to communication-efficient protocols. Coresets have the nice property that
their size typically does not depend on the number n of points being approximated. A beautiful
property of the coresets developed in [9] is that for approximate PCA their size also only depends
linearly on the dimension d, whereas previous coresets depended quadratically on d [8]. This gives
the best known communication protocols for approximate PCA and k-means clustering.
1