Large-ScaleMatrixFactorizationwithDistributedStochasticGradientDescent资源-CSDN文库

5星 · 超过95%的资源需积分: 16 122 浏览量 2014-02-19 16:12:48 上传评论收藏 259KB PDF 举报

### 大规模矩阵分解与分布式随机梯度下降 #### 摘要本文提出了一种新的算法，可以近似地对具有数百万行、数百万列以及数十亿非零元素的大规模矩阵进行分解。该方法基于随机梯度下降（SGD），这是一种迭代式的随机优化算法。我们首先开发了一个新颖的“分层”SGD变体（SSGD），它适用于一般形式的损失最小化问题，在这类问题中损失函数可表示为不同“分层损失”的加权和。我们利用随机逼近理论和再生过程理论建立了SSGD收敛性的充分条件。然后我们将SSGD特化为一种新的矩阵分解算法——分布式随机梯度下降（DSGD），该算法能够完全分布化，并可以在大规模数据集上运行，例如通过MapReduce框架实现。DSGD能够处理各种类型的矩阵分解。我们还描述了在DSGD实现中用于优化性能的实际技术。实验表明，与替代算法相比，DSGD具有更快的收敛速度和更好的可扩展性。 #### 分类和主题描述 - G.4 [计算数学]：数学软件——并行和向量实现 - 通用术语：算法、实验、性能 - 关键词：分布式矩阵分解、随机梯度下降、MapReduce、推荐系统 #### 引言随着Web2.0和企业云应用的普及，数据挖掘算法需要被重新设计以处理大规模的数据集。因此，近年来低秩矩阵分解受到了广泛关注，因为它对于越来越多应用于大规模数据集上的挖掘任务至关重要。 #### 矩阵分解的重要性矩阵分解是一种强大的工具，广泛应用于多个领域，包括推荐系统、计算机视觉、文本分析等。对于推荐系统来说，矩阵分解尤其重要，因为它可以帮助预测用户对未评分项目的兴趣程度，从而提高推荐质量。传统的矩阵分解方法往往难以应对大规模数据集的挑战，特别是在内存有限的情况下处理海量数据。 #### 随机梯度下降（SGD）随机梯度下降是一种常用的优化算法，尤其适用于大规模数据集。相比于批量梯度下降，SGD在每一步迭代中只使用一个样本（或小批量样本）来更新模型参数，这样可以显著减少内存需求，并且由于每次更新都基于不同的样本，因此可以更快地收敛到最优解或接近最优解。 #### 分布式随机梯度下降（DSGD） - **原理**：DSGD是SGD的一种扩展形式，旨在通过并行处理技术来加速大规模矩阵分解过程。在DSGD中，原始的大型矩阵被分割成更小的部分，每个部分可以独立处理。这些处理过程可以在多台机器上并行执行，最后将结果合并以获得最终的矩阵分解。 - **分层SGD（SSGD）**：为了适应更广泛的矩阵分解场景，研究者们提出了SSGD这一创新性算法。SSGD允许将损失函数分解为多个子部分，即所谓的“分层损失”，每个分层都可以独立优化。这种灵活性使得SSGD能够更好地处理复杂的矩阵分解问题。 - **并行与分布式实现**：DSGD的设计考虑到了并行和分布式环境下的高效执行。例如，通过MapReduce框架可以在集群中的多个节点上实现矩阵分解任务的并行处理。这种方法不仅提高了计算效率，也解决了单机内存不足的问题。 - **实验结果**：实验结果表明，DSGD相较于其他矩阵分解算法具有明显的优势，特别是在处理大规模数据集时。其更快的收敛速度和更好的可扩展性使其成为处理Web规模数据集的理想选择。 #### 实践中的技术优化为了在实际应用中优化DSGD的性能，研究人员采取了一系列措施： 1. **数据预处理**：通过对原始数据进行预处理，如缺失值填充、异常值检测和标准化等操作，可以提高算法的稳定性和准确性。 2. **参数调优**：通过细致地调整学习率、批次大小等超参数，可以进一步提升算法的收敛速度和效果。 3. **并行化策略**：设计高效的并行化策略，比如合理分配计算任务、优化通信开销等，对于最大化算法的性能至关重要。 4. **硬件优化**：利用GPU等高性能计算资源可以进一步加速矩阵运算过程。 #### 结论总体而言，通过采用DSGD算法，我们可以有效地解决大规模矩阵分解问题，尤其是在推荐系统等应用场景中。这种分布式随机梯度下降的方法不仅提高了计算效率，还极大地提升了算法的可扩展性，使其能够在处理Web规模数据集时发挥出最佳性能。

资源推荐

资源详情

资源评论

Large-Scale Matrix Factorization

with Distributed Stochastic Gradient Descent

Rainer Gemulla

Peter J. Haas

Erik Nijkamp

Yannis Sismanis

Max-Planck-Institut für Informatik

IBM Almaden Research Center

Saarbrücken, Germany San Jose, CA, USA

rgemulla@mpi-inf.mpg.de {phaas, enijkam, syannis}@us.ibm.com

ABSTRACT

We provide a novel algorithm to approximately factor large matrices

with millions of rows, millions of columns, and billions of nonzero

elements. Our approach rests on stochastic gradient descent (SGD),

an iterative stochastic optimization algorithm. We ﬁrst develop a

novel “stratiﬁed” SGD variant (SSGD) that applies to general loss-

minimization problems in which the loss function can be expressed

as a weighted sum of “stratum losses.” We establish sufﬁcient

conditions for convergence of SSGD using results from stochastic

approximation theory and regenerative process theory. We then

specialize SSGD to obtain a new matrix-factorization algorithm,

called DSGD, that can be fully distributed and run on web-scale

datasets using, e.g., MapReduce. DSGD can handle a wide variety

of matrix factorizations. We describe the practical techniques used to

optimize performance in our DSGD implementation. Experiments

suggest that DSGD converges signiﬁcantly faster and has better

scalability properties than alternative algorithms.

Categories and Subject Descriptors

G.4 [

Mathematics of Computing

]: Mathematical Software—Par-

allel and vector implementations

General Terms

Algorithms, Experimentation, Performance

Keywords

distributed matrix factorization, stochastic gradient descent, MapRe-

duce, recommendation system

1. INTRODUCTION

As Web 2.0 and enterprise-cloud applications proliferate, data

mining algorithms need to be (re)designed to handle web-scale

datasets. For this reason, low-rank matrix factorization has received

much attention in recent years, since it is fundamental to a vari-

ety of mining tasks that are increasingly being applied to massive

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

KDD 2011 August 21-24, 2011, San Diego, CA.

datasets [8, 12, 13, 15, 16]. Speciﬁcally, low-rank matrix factor-

izations are effective tools for analyzing “dyadic data” in order to

discover and quantify the interactions between two given entities.

Successful applications include topic detection and keyword search

(where the corresponding entities are documents and terms), news

personalization (users and stories), and recommendation systems

(users and items). In large applications (see Sec. 2), these problems

can involve matrices with millions of rows (e.g., distinct customers),

millions of columns (e.g., distinct items), and billions of entries

(e.g., transactions between customers and items). At such massive

scales, distributed algorithms for matrix factorization are essential

to achieving reasonable performance [8, 9, 16, 20]. In this paper, we

provide a novel, effective distributed factorization algorithm based

on stochastic gradient descent.

In practice, exact factorization is generally neither possible nor

desired, so virtually all “matrix factorization” algorithms actually

produce low-rank approximations, attempting to minimize a “loss

function” that measures the discrepancy between the original input

matrix and product of the factors returned by the algorithm; we

use the term “matrix factorization” throughout to refer to such an

approximation.

With the recent advent of programmer-friendly parallel processing

frameworks such as MapReduce, web-scale matrix factorizations

have become practicable and are of increasing interest to web com-

panies, as well as other companies and enterprises that deal with

massive data. To facilitate distributed processing, prior approaches

would pick an embarrassingly parallel matrix factorization algorithm

and implement it on a MapReduce cluster; the choice of algorithm

was driven by the ease with which it could be distributed. In this

paper, we take a different approach and start with an algorithm that

is known to have good performance in non-parallel environments.

Speciﬁcally, we start with stochastic gradient descent (SGD), an

iterative optimization algorithm that has been shown, in a sequential

setting, to be very effective for matrix factorization [13]. Although

the generic SGD algorithm (Sec. 3) is not embarrassingly parallel

and hence cannot directly scale to very large data, we can exploit the

special structure of the factorization problem to obtain a version of

SGD that is fully distributed and scales to extremely large matrices.

The key idea is to ﬁrst develop (Sec. 4) a “stratiﬁed” variant of

SGD, called SSGD, that is applicable to general loss-minimization

problems in which the loss function

L(θ)

can be expressed as a

weighted sum of “stratum losses,” so that

L(θ) = w

(θ) + · · · +

(θ)

. At each iteration, the algorithm takes a downhill step

with respect to one of the stratum losses

, i.e., approximately in

the direction of the negative gradient −L

(θ). Although each such

direction is “wrong” with respect to minimization of the overall loss

, we prove that, under appropriate regularity conditions, SSGD

will converge to a good solution for

if the sequence of strata is

chosen carefully. Our proof rests on stochastic approximation theory

and regenerative process theory.

We then specialize SSGD to obtain a novel distributed matrix-

factorization algorithm, called DSGD (Sec. 5). Speciﬁcally, we

express the input matrix as a union of (possibly overlapping) pieces,

called “strata.” For each stratum, the stratum loss is deﬁned as

the loss computed over only the data points in the stratum (and

appropriately scaled). The strata are carefully chosen so that each

stratum has “

-monomial” structure, which allows SGD to be run on

the stratum in a distributed manner. The DSGD algorithm repeatedly

selects a stratum according to the general SSGD procedure and

processes the stratum in a distributed fashion. Importantly, both

matrix and factors are fully distributed, so that DSGD has low

memory requirements and scales to matrices with millions of rows,

millions of columns, and billions of nonzero elements. When DSGD

is implemented in MapReduce (Sec. 6) and compared to state-of-the-

art distributed algorithms for matrix factorization, our experiments

(Sec. 7) suggest that DSGD converges signiﬁcantly faster, and has

better scalability.

Unlike many prior algorithms, DSGD is a generic algorithm

in that it can be used for a variety of different loss functions. In

this paper, we focus primarily on the class of factorizations that

minimize a “nonzero loss.” This class of loss functions is important

for applications in which a zero represents missing data and hence

should be ignored when computing loss. A typical motivation for

factorization in this setting is to estimate the missing values, e.g.,

the rating that a customer would likely give to a previously unseen

movie. See [10] for a treatment of other loss functions.

2. EXAMPLE AND PRIOR WORK

To gain understanding about applications of matrix factorizations,

consider the “Netﬂix problem” [3] of recommending movies to

customers. Netﬂix is a company that offers tens of thousands of

movies for rental. The company has more than 15M customers,

each of whom can provide feedback about their personal taste by

rating movies with 1 to 5 stars. The feedback can be represented in

a feedback matrix such as

Avatar The Matrix Up

Alice ? 4 2

Bob 3 2 ?

Charlie 5 ? 3

Each entry may contain additional data, e.g., the date of rating or

other forms of feedback such as click history. The goal of the factor-

ization is to predict missing entries (denoted by “?”); entries with

a high predicted rating are then recommended to users for view-

ing. This matrix-factorization approach to recommender systems

has been successfully applied in practice; see [13] for an excellent

discussion of the underlying intuition.

The traditional matrix factorization problem can be stated as

follows. Given an

m×n

matrix

and a rank

, ﬁnd an

m×r

matrix

and an

r × n

matrix

such that

V = W H

. As discussed

previously, our actual goal is to obtain a low-rank approximation

V ≈ W H

, where the quality of the approximation is described by

an application-dependent loss function L. We seek to ﬁnd

argmin

W ,H

L(V , W , H ),

i.e., the choice of

and

that give rise to the smallest loss. For

example, assuming that missing ratings are coded with the value

0, loss functions for recommender systems are often based on the

nonzero squared loss

NZSL

i,j:V

6=0

− [W H]

)

and

usually incorporate regularization terms, user and movie biases, time

drifts, and implicit feedback.

In the following, we restrict attention to loss functions that, like

NZSL

, can be decomposed into a sum of local losses over (a subset

of) the entries in

. I.e., we require that the loss can be written as

L =

(i,j)∈Z

l(V

, W

i∗

, H

∗j

) (1)

for some training set

Z ⊆ { 1, 2, . . . , m } × { 1, 2, . . . , n }

and lo-

cal loss function

, where

i∗

and

∗j

denote row

and column

matrix

, respectively. Many loss functions used in practice—such

as squared loss, generalized Kullback-Leibler divergence (GKL),

and

regularization—can be decomposed in such a manner [19].

Note that a given loss function

can potentially be decomposed

in multiple ways. In this paper, we focus primarily on the class of

nonzero decompositions, in which

Z = { (i, j) : V

6= 0 }

. As

mentioned above, such decompositions naturally arise when zeros

represent missing data. Our algorithms can handle other decomposi-

tions as well; see [10].

To compute

and

on MapReduce, all known algorithms start

with some initial factors

and

and iteratively improve them.

The

m × n

input matrix

is partitioned into

× d

blocks, which

are distributed in the MapReduce cluster. Both row and column

factors are blocked conformingly:

· · · H

· · · V

where we use superscripts to refer to individual blocks. The al-

gorithms are designed such that each block

can be processed

independently in the map phase, taking only the corresponding

blocks of factors

and

as input. Some algorithms directly

update the factors in the map phase (then either

= m

= n

to avoid overlap), whereas others aggregate the factor updates in a

reduce phase.

Existing algorithms can be classiﬁed as specialized algorithms,

which are designed for a particular loss, and generic algorithms,

which work for a wide variety of loss functions. Specialized algo-

rithms currently exist for only a small class of loss functions. For

GKL loss, Das et al. [8] provide an EM-based algorithm, and Liu et

al. [16] provide a multiplicative-update method. In [16], the latter

MULT approach is also applied to squared loss and to nonnegative

matrix factorization with an “exponential” loss function. Each of

these algorithms in essence takes an embarrassingly parallel matrix

factorization algorithm developed previously and directly distributes

it across the MapReduce cluster. Zhou et al. [20] show how to dis-

tribute the well-known alternating least squares (ALS) algorithm to

handle factorization problems with a nonzero squared loss function

and an optional weighted

regularization term. Their approach

requires a double-partitioning of

: once by row and once by col-

umn. Moreover, ALS requires that each of the factor matrices

and

can (alternately) ﬁt in main memory. See [10] for details on

the foregoing algorithms.

Generic algorithms are able to handle all differentiable loss func-

tions that decompose into summation form. A simple approach is

distributed gradient descent (DGD) [9, 11, 17], which distributes

gradient computation across a compute cluster, and then performs

centralized parameter updates using, for example, quasi-Newton

methods such as L-BFGS-B [6]. Partitioned SGD approaches make

use of a similar idea: SGD is run independently and in parallel on

剩余8页未读，继续阅读

评论收藏

内容反馈

ddccjj

2015-04-02

还行可以用的

idealism19890

粉丝: 1
资源: 3

Large-Scale Matrix Factorization with Distributed Stochastic Gra...

最新资源

Large-Scale Matrix Factorization with Distributed Stochastic Gra...

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Distributed Stochastic Gradient Descent with Discriminative Aggregating

Projected Gradient Methods for Non-negative Matrix Factorization

机器学习 -- Gradient Descent

Non-negative Matrix Factorization with sparseness constraints

A fast distributed stochastic gradient descent algorithm for matrix factorization

Gradient-Descent

非负矩阵分解matlab代码（全）

存在伪驻点时的超参数化矩阵分解_Over-Parametrized Matrix Factorization in the Pr

非负矩阵分解_non-negative matrix factorization_NMF算法_matlab

Matrix Analysis

机器学习技法15 - 2 - Basic Matrix Factorization (16-32).mp4

2017-Deep Matrix Factorization Models for Recommender Systems.pdf

Algorithms for Non-negative Matrix Factorization

基于matlab实现的非负矩阵分解(non-negative matrix factorization,NMF)算法.rar

Matrix Computations

Low-rank matrix factorization with multiple Hypergraph regularizer

Algorithms for Non-negative Matrix 论文概要

MahNMF Manhattan Non-negative Matrix Factorization

Multi-view non-negative matrix factorization by patch alignment framework with view consistency

Multi-mode process monitoring method based on multi-block projection non-negative matrix factorization

jNMFMA: a joint non-negative matrix factorization meta-analysis of transcriptomics data

《Learning the parts of objects by nonnegative matrix factorization》

An overview of gradient descent optimization algorithms

最新资源