分布式的PCA_分布式PCA资源-CSDN文库

共3个文件

pdf：2个

zip：1个

Distributed

分布式PCA

5星 · 超过95%的资源需积分: 13 136 浏览量 2014-11-19 21:54:28 上传评论 2 收藏 905KB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

DPCA.rar （3个子文件）

distributedPCAandCoresets.pdf 266KB

DistributedCoresetAndPCA.zip 146KB

DistributedPCA_NIPS.pdf 636KB

Improved Distributed Principal Component Analysis

Maria-Florina Balcan

School of Computer Science

Carnegie Mellon University

ninamf@cs.cmu.edu

Vandana Kanchanapally

School of Computer Science

Georgia Institute of Technology

vvandana@gatech.edu

Yingyu Liang

Department of Computer Science

Princeton University

yingyul@cs.princeton.edu

David Woodruff

Almaden Research Center

IBM Research

dpwoodru@us.ibm.com

Abstract

We study the distributed computing setting in which there are multiple servers,

each holding a set of points, who wish to compute functions on the union of their

point sets. A key task in this setting is Principal Component Analysis (PCA), in

which the servers would like to compute a low dimensional subspace capturing as

much of the variance of the union of their point sets as possible. Given a proce-

dure for approximate PCA, one can use it to approximately solve problems such

as k-means clustering and low rank approximation. The essential properties of an

approximate distributed PCA algorithm are its communication cost and computa-

tional efﬁciency for a given desired accuracy in downstream applications. We give

new algorithms and analyses for distributed PCA which lead to improved com-

munication and computational costs for k-means clustering and related problems.

Our empirical study on real world data shows a speedup of orders of magnitude,

preserving communication with only a negligible degradation in solution quality.

Some of these techniques we develop, such as a general transformation from a

constant success probability subspace embedding to a high success probability

subspace embedding with a dimension and sparsity independent of the success

probability, may be of independent interest.

1 Introduction

Since data is often partitioned across multiple servers [20, 7, 18], there is an increased interest in

computing on it in the distributed model. A basic tool for distributed data analysis is Principal

Component Analysis (PCA). The goal of PCA is to ﬁnd an r-dimensional (afﬁne) subspace that

captures as much of the variance of the data as possible. Hence, it can reveal low-dimensional

structure in very high dimensional data. Moreover, it can serve as a preprocessing step to reduce

the data dimension in various machine learning tasks, such as Non-Negative Matrix Factorization

(NNMF) [15] and Latent Dirichlet Allocation (LDA) [4].

In the distributed model, approximate PCA was used by Feldman et al. [9] for solving a number

of shape ﬁtting problems such as k-means clustering, where the approximation is in the form of a

coreset, and has the property that local coresets can be easily combined across servers into a global

coreset, thereby providing an approximate PCA to the union of the data sets. Designing small

coresets therefore leads to communication-efﬁcient protocols. Coresets have the nice property that

their size typically does not depend on the number n of points being approximated. A beautiful

property of the coresets developed in [9] is that for approximate PCA their size also only depends

linearly on the dimension d, whereas previous coresets depended quadratically on d [8]. This gives

the best known communication protocols for approximate PCA and k-means clustering.

Despite this recent exciting progress, several important questions remain. First, can we improve the

communication further as a function of the number of servers, the approximation error, and other

parameters of the downstream applications (such as the number k of clusters in k-means clustering)?

Second, while preserving optimal or nearly-optimal communication, can we improve the computa-

tional costs of the protocols? We note that in the protocols of Feldman et al. each server has to

run a singular value decomposition (SVD) on her local data set, while additional work needs to be

performed to combine the outputs of each server into a global approximate PCA. Third, are these al-

gorithms practical and do they scale well with large-scale datasets? In this paper we give answers to

the above questions. To state our results more precisely, we ﬁrst deﬁne the model and the problems.

Communication Model. In the distributed setting, we consider a set of s nodes V = {v

, 1 ≤ i ≤

s}, each of which can communicate with a central coordinator v

. On each node v

, there is a local

data matrix P

∈ R

×d

having n

data points in d dimension (n

> d). The global data P ∈ R

n×d

is then a concatenation of the local data matrix, i.e. P



, P

, . . . , P



and n =

i=1

Let p

denote the i-th row of P. Throughout the paper, we assume that the data points are centered

to have zero mean, i.e.,

i=1

= 0. Uncentered data requires a rank-one modiﬁcation to the

algorithms, whose communication and computation costs are dominated by those in the other steps.

Approximate PCA and `

-Error Fitting. For a matrix A = [a

], let kAk

i,j

be its

Frobenius norm, and let σ

(A) be the i-th singular value of A. Let A

(t)

denote the matrix that

contains the ﬁrst t columns of A. Let L

denote the linear subspace spanned by the columns of X.

For a point p, let π

(p) be its projection onto subspace L and let π

(p) be shorthand for π

(p).

For a point p ∈ R

and a subspace L ⊆ R

, we denote the squared distance between p and L by

(p, L) := min

q∈L

kp − qk

= kp − π

(p)k

Deﬁnition 1. The linear (or afﬁne) r-Subspace k-Clustering on P ∈ R

n×d

min

(P, L) :=

i=1

min

L∈L

, L) (1)

where P is an n × d matrix whose rows are p

, . . . , p

, and L = {L

}

j=1

is a set of k centers, each

of which is an r-dimensional linear (or afﬁne) subspace.

PCA is a special case when k = 1 and the center is an r-dimensional subspace. This optimal r-

dimensional subspace is spanned by the top r right singular vectors of P, also known as the principal

components, and can be found using the singular value decomposition (SVD). Another special case

of the above is k-means clustering when the centers are points (r = 0). Constrained versions of this

problem include NNMF where the r-dimensional subspace should be spanned by positive vectors,

and LDA which assumes a prior distribution deﬁning a probability for each r-dimensional subspace.

We will primarily be concerned with relative-error approximation algorithms, for which we would

like to output a set L

of k centers for which d

(P, L

) ≤ (1 + ) min

(P, L).

For approximate distributed PCA, the following protocol is implicit in [9]: each server i computes

its top O(r/) principal components Y

of P

and sends them to the coordinator. The coordinator

stacks the O(r/) × d matrices Y

on top of each other, forming an O(sr/) × d matrix Y, and

computes the top r principal components of Y, and returns these to the servers. This provides a

relative-error approximation to the PCA problem. We refer to this algorithm as Algorithm disPCA.

Our Contributions. Our results are summarized as follows.

Improved Communication: We improve the communication cost for using distributed PCA for k-

means clustering and similar `

-ﬁtting problems. The best previous approach is to use Corollary 4.5

in [9], which shows that given a data matrix P, if we project the rows onto the space spanned by

the top O(k/

) principal components, and solve the k-means problem in this subspace, we obtain a

(1+)-approximation. In the distributed setting, this would require ﬁrst running Algorithm disPCA

with parameter r = O(k/

), and thus communication at least O(skd/

) to compute the O(k/

)

global principal components. Then one can solve a distributed k-means problem in this subspace,

and an α-approximation in it translates to an overall α(1 + ) approximation.

Our Theorem 3 shows that it sufﬁces to run Algorithm disPCA while only incurring O(skd/

)

communication to compute the O(k/

) global principal components, preserving the k-means solu-

tion cost up to a (1 + )-factor. Our communication is thus a 1/ factor better, and illustrates that

for downstream applications it is sometimes important to “open up the box” rather than to directly

use the guarantees of a generic PCA algorithm (which would give O(skd/

) communication). One

feature of this approach is that by using the distributed k-means algorithm in [3] on the projected

data, the coordinator can sample points from the servers proportional to their local k-means cost

solutions, which reduces the communication roughly by a factor of s, which would come from each

server sending their local k-means coreset to the coordinator. Furthermore, before applying the

above approach, one can ﬁrst run any other dimension reduction to dimension d

so that the k-means

cost is preserved up to certain accuracy. For example, if we want a 1+ approximation factor, we can

set d

= O(log n/

) by a Johnson-Lindenstrauss transform; if we want a larger 2+ approximation

factor, we can set d

= O(k/

) using [5]. In this way the parameter d in the above communication

cost bound can be replaced by d

. Note that unlike these dimension reductions, our algorithm for

projecting onto principal components is deterministic and does not incur error probability.

Improved Computation: We turn to the computational cost of Algorithm disPCA, which to the best

of our knowledge has not been addressed. A major bottleneck is that each player is computing

a singular value decomposition (SVD) of its point set P

, which takes min(n

, n

d) time. We

change Algorithm disPCA to instead have each server ﬁrst sample an oblivious subspace embedding

(OSE) [22, 6, 19, 17] matrix H

, and instead run the algorithm on the point set deﬁned by the rows

of H

. Using known OSEs, one can choose H

to have only a single non-zero entry per column

and thus H

can be computed in nnz(P

) time. Moreover, the number of rows of H

is O(d

/

which may be signiﬁcantly less than the original n

number of rows. This number of rows can be

further reducted to O(d log

O(1)

d/

) if one is willing to spend O(nnz(P

) log

O(1)

d/) time [19].

We note that the number of non-zero entries of H

is no more than that of P

One technical issue is that each of s servers is locally performing a subspace embedding, which

succeeds with only constant probability. If we want a single non-zero entry per column of H

, to

achieve success probability 1 − O(1/s) so that we can union bound over all s servers succeeding,

we naively would need to increase the number of rows of H

by a factor linear in s. We give a

general technique, which takes a subspace embedding that succeeds with constant probability as a

black box, and show how to perform a procedure which applies it O(log 1/δ) times independently

and from these applications ﬁnds one which is guaranteed to succeed with probability 1 − δ. Thus,

in this setting the players can compute a subspace embedding of their data in nnz(P

) time, for

which the number of non-zero entries of H

is no larger than that of P

, and without incurring

this additional factor of s. This may be of independent interest.

It may still be expensive to perform the SVD of H

and for the coordinator to perform an SVD

on Y in Algorithm disPCA. We therefore replace the SVD computation with a randomized approx-

imate SVD computation with spectral norm error. Our contribution here is to analyze the error in

distributed PCA and k-means after performing these speedups.

Empirical Results: Our speedups result in signiﬁcant computational savings. The randomized tech-

niques we use reduce the time by orders of magnitude on medium and large-scal data sets, while

preserving the communication cost. Although the theory predicts a new small additive error because

of our speedups, in our experiments the solution quality was only negligibly affected.

Related Work A number of algorithms for approximate distributed PCA have been proposed [21,

2, 14, 16, 9], but either without theoretical guarantees, or without considering communication. Most

closely related to our work is [9, 12]. [9] observes that the top singular vectors of the local data is its

summary and the union of these summaries is a summary of the global data, i.e., Algorithm disPCA

discussed above. [12] studies algorithms in the arbitrary partition model in which each server holds

a matrix P

and P =

i=1

. More details and more related work can be found in the appendix.

2 Tradeoff between Communication and Solution Quality

Algorithm disPCA for distributed PCA is suggested in [21, 9], which consists of a local stage and a

global stage. In the local stage, each node performs SVD on its local data matrix, and communicates

the ﬁrst t

singular values Σ

)

and the ﬁrst t

right singular vectors V

)

to the central coordi-

nator. Then in the global stage, the coordinator concatenates Σ

)

to form a matrix Y,

and performs SVD on it to get the ﬁrst t

right singular vectors.

To get some intuition, consider the easy case when the data points actually lie in an r-dimensional

subspace. We can run Algorithm disPCA with t

= t

= r. Since P

has rank r, its projection to

P =













Local PCA

−−−−−→

Local PCA

−−−−−→







)



)



)



)





















= Y

Global PCA

−−−−−−→ V

)

Figure 1: The key points of the algorithm disPCA.

the subspace spanned by its ﬁrst t

= r right singular vectors,

= U

(r)

)

, is identical

to P

. Then we only need to do PCA on

P, the concatenation of

. Observing that

P =

UY where

U is orthonormal, it sufﬁces to compute SVD on Y, and only Σ

(r)

needs to be communicated.

In the general case when the data may have rank higher than r, it turns out that one needs to set t

sufﬁciently large, so that

approximates P

well enough and does not introduce too much error

into the ﬁnal solution. In particular, the following close projection property about SVD is useful:

Lemma 1. Suppose A has SVD A = UΣV and let

A = AV

(t)

)

denote its SVD truncation.

If t = O(r/), then for any d × r matrix X with orthonormal columns,

0 ≤ kAX −

AXk

≤ d

(A, L

), and 0 ≤ kAXk

− k

AXk

≤ d

(A, L

This means that the projections of

A and A on any r-dimensional subspace are close, when the

projected dimension t is sufﬁciently large compared to r. Now, note that the difference between

kP − PXX

and k

P −

PXX

is only related to kPXk

− k

PXk

[kP

−

]. Each term in which is bounded by the lemma. So we can use

P as a proxy for P in

the PCA task. Again, computing PCA on

P is equivalent to computing SVD on Y, as done in

Algorithm disPCA. These lead to the following theorem, which is implicit in [9], stating that the

algorithm can produce a (1 + )-approximation for the distributed PCA problem.

Theorem 2. Suppose Algorithm disPCA takes parameters t

≥ r + d4r/e − 1 and t

= r. Then

kP − PV

(r)

)

≤ (1 + ) min

kP − PXX

where the minimization is over d×r orthonormal matrices X. The communication is O(

srd



) words.

2.1 Guarantees for Distributed `

-Error Fitting

Algorithm disPCA can also be used as a pre-processing step for applications such as `

-error ﬁtting.

In this section, we prove the correctness of Algorithm disPCA as pre-processing for these applica-

tions. In particular, we show that by setting t

, t

sufﬁciently large, the objective value of any solu-

tion merely changes when the original data P is replaced the projected data

P = PV

)

Therefore, the projected data serves as a proxy of the original data, i.e., any distributed algorithm

can be applied on the projected data to get a solution on the original data. As the dimension is lower,

the communication cost is reduced. Formally,

Theorem 3. Let t

= t

= O(rk/

) in Algorithm disPCA for  ∈ (0, 1/3). Then there exists a

constant c

≥ 0 such that for any set of k centers L in r-Subspace k-Clustering,

(1 − )d

(P, L) ≤ d

(

P, L) + c

≤ (1 + )d

(P, L).

The theorem implies that any α-approximate solution L on the projected data

P is a (1 + 3)α-

approximation on the original data P. To see this, let L

∗

denote the optimal solution. Then

(1 − )d

(P, L) ≤ d

(

P, L) + c

≤ αd

(

P, L

∗

) + c

≤ α(1 + )d

(P, L

∗

)

which leads to d

(P, L) ≤ (1 + 3)αd

(P, L

∗

). In other words, the distributed PCA step only

introduces a small multiplicative approximation factor of (1 + 3).

The key to prove the theorem is the close projection property of the algorithm (Lemma 4): for any

low dimensional subspace spanned by X, the projections of P and

P on the subspace are close. In

Algorithm 1 Distributed k-means clustering

Input: {P

}

i=1

, k ∈ N

and  ∈ (0, 1/2), a non-distributed α-approximation algorithm A

1: Run Algorithm disPCA with t

= t

= O(k/

) to get V, and send V to all nodes.

2: Run the distributed k-means clustering algorithm in [3] on {P

}

i=1

, using A

as a sub-

routine, to get k centers L.

Output: L.

particular, we choose X to be the orthonormal basis of the subspace spanning the centers. Then the

difference between the objective values of P and

P can be decomposed into two terms depending

only on kPX−

PXk

and kPXk

−k

PXk

respectively, which are small as shown by the lemma.

The complete proof of Theorem 3 is provided in the appendix.

Lemma 4. Let t

= t

= O(k/) in Algorithm disPCA. Then for any d×k matrix X with orthonor-

mal columns, 0 ≤ kPX −

PXk

≤ d

(P, L

), and 0 ≤ kPXk

− k

PXk

≤ d

(P, L

Proof Sketch: We ﬁrst introduce some auxiliary variables for the analysis, which act as intermediate

connections between P and

P. Imagine we perform two kinds of projections: ﬁrst project P

= P

)

, then project

to P

)

. Let

P denote the vertical

concatenation of

and let

P denote the vertical concatenation of P

. These variables are designed

so that the difference between P and

P and that between

P and P are easily bounded.

Our proof then proceeds by ﬁrst bounding these differences, and then bounding that between P and

P. In the following we sketch the proof for the second statement, while the other statement can be

proved by a similar argument. See the appendix for details.

kPXk

− k

PXk

kPXk

− k

PXk

− kPXk

kPXk

− k

PXk

The ﬁrst term is just

i=1

− k

, each of which can be bounded by Lemma 1,

since

is the SVD truncation of P. The second term can be bounded similarly. The more difﬁcult

part is the third term. Note that P

= P

Z where Z := V

)

X, leading to

kPXk

− k

PXk

i=1

− kP

. Although Z is not orthonormal as required by

Lemma 1, we prove a generalization (Lemma 7 in the appendix) which can be applied to show that

the third term is indeed small.

Application to k-Means Clustering To see the implication, consider the k-means clustering prob-

lem. We can ﬁrst perform any other possible dimension reduction to dimension d

so that the k-

means cost is preserved up to accuracy , and then run Algorithm disPCA and ﬁnally run any

distributed k-means clustering algorithm on the projected data to get a good approximate solution.

For example, in the ﬁrst step we can set d

= O(log n/

) using a Johnson-Lindenstrauss transform,

or we can perform no reduction and simply use the original data.

As a concrete example, we can use original data (d

= d), then run Algorithm disPCA, and ﬁnally

run the distributed clustering algorithm in [3] which uses any non-distributed α-approximation al-

gorithm as a subroutine and computes a (1 + )α-approximate solution. The resulting algorithm is

presented in Algorithm 1.

Theorem 5. With probability at least 1 − δ, Algorithm 1 outputs a (1 + )

α-approximate solution

for distributed k-means clustering. The total communication cost of Algorithm 1 is O(



) vectors

in R

plus O





(



+ log

) + sk log



vectors in R

O(k/

)

3 Fast Distributed PCA

Subspace Embeddings One can signiﬁcantly improve the time of the distributed PCA algorithms

by using subspace embeddings, while keeping similar guarantees as in Lemma 4, which sufﬁce for

-error ﬁtting. More precisely, a subspace embedding matrix H ∈ R

`×n

for a matrix A ∈ R

n×d

has the property that for all vectors y ∈ R

, kHAyk

= (1 ± )kAyk

. Suppose independently,

评论收藏

内容反馈

wuzh07

2015-04-06

代码很好！很赞！！对DPCA很有用！
一野

2015-03-30

DPCA有很多应用，这个是手写体识别的应用，是张灏的大作业
sinat_20953315

2015-01-08

DPCA的文献很有用哦。
散仙一个

2018-09-10

DPCA的文献还可以。
jiamimi1111

2014-12-29

代码很好！值得一看！

windy_feng520

粉丝: 4
资源: 4

分布式的PCA

论文研究-分布式并行PCA算法在大样本数据集中的应用 .pdf

An Improved Distributed PCA-Based Outlier Detection in Wireless Sensor Networks

DPCA 程序编写

Awvs 12.x 原版和补丁 解压密码 www.xssav.com(1).rar

红外循迹小车完美程序，什么弯都能转

hack黑客术语

jsp新闻管理系统

PCA的数学原理

PCA分配

Robust PCA

PCA测试程序

PCA c code

实际海杂波数据统计特性分析 MATLAB

简易频率计（数电课程设计）

altium designer15安装包

输入若干字符排序

电商案例分析--唯品会

PCA人脸识别——改

PCA and LDA

PCA人脸识别

全站仪导入文件

采用分治法计算两个大整数的乘积

python爬虫实例

3s/2r变换及2r/3s变换

MMDAE2

PCA:这是我的PCA算法

最新资源

Awvs 12.x 原版和补丁解压密码 www.xssav.com(1).rar