will converge to a good solution for
L
if the sequence of strata is
chosen carefully. Our proof rests on stochastic approximation theory
and regenerative process theory.
We then specialize SSGD to obtain a novel distributed matrix-
factorization algorithm, called DSGD (Sec. 5). Specifically, we
express the input matrix as a union of (possibly overlapping) pieces,
called “strata.” For each stratum, the stratum loss is defined as
the loss computed over only the data points in the stratum (and
appropriately scaled). The strata are carefully chosen so that each
stratum has “
d
-monomial” structure, which allows SGD to be run on
the stratum in a distributed manner. The DSGD algorithm repeatedly
selects a stratum according to the general SSGD procedure and
processes the stratum in a distributed fashion. Importantly, both
matrix and factors are fully distributed, so that DSGD has low
memory requirements and scales to matrices with millions of rows,
millions of columns, and billions of nonzero elements. When DSGD
is implemented in MapReduce (Sec. 6) and compared to state-of-the-
art distributed algorithms for matrix factorization, our experiments
(Sec. 7) suggest that DSGD converges significantly faster, and has
better scalability.
Unlike many prior algorithms, DSGD is a generic algorithm
in that it can be used for a variety of different loss functions. In
this paper, we focus primarily on the class of factorizations that
minimize a “nonzero loss.” This class of loss functions is important
for applications in which a zero represents missing data and hence
should be ignored when computing loss. A typical motivation for
factorization in this setting is to estimate the missing values, e.g.,
the rating that a customer would likely give to a previously unseen
movie. See [10] for a treatment of other loss functions.
2. EXAMPLE AND PRIOR WORK
To gain understanding about applications of matrix factorizations,
consider the “Netflix problem” [3] of recommending movies to
customers. Netflix is a company that offers tens of thousands of
movies for rental. The company has more than 15M customers,
each of whom can provide feedback about their personal taste by
rating movies with 1 to 5 stars. The feedback can be represented in
a feedback matrix such as
0
@
Avatar The Matrix Up
Alice ? 4 2
Bob 3 2 ?
Charlie 5 ? 3
1
A
.
Each entry may contain additional data, e.g., the date of rating or
other forms of feedback such as click history. The goal of the factor-
ization is to predict missing entries (denoted by “?”); entries with
a high predicted rating are then recommended to users for view-
ing. This matrix-factorization approach to recommender systems
has been successfully applied in practice; see [13] for an excellent
discussion of the underlying intuition.
The traditional matrix factorization problem can be stated as
follows. Given an
m×n
matrix
V
and a rank
r
, find an
m×r
matrix
W
and an
r × n
matrix
H
such that
V = W H
. As discussed
previously, our actual goal is to obtain a low-rank approximation
V ≈ W H
, where the quality of the approximation is described by
an application-dependent loss function L. We seek to find
argmin
W ,H
L(V , W , H ),
i.e., the choice of
W
and
H
that give rise to the smallest loss. For
example, assuming that missing ratings are coded with the value
0, loss functions for recommender systems are often based on the
nonzero squared loss
L
NZSL
=
P
i,j:V
ij
6=0
(V
ij
− [W H]
ij
)
2
and
usually incorporate regularization terms, user and movie biases, time
drifts, and implicit feedback.
In the following, we restrict attention to loss functions that, like
L
NZSL
, can be decomposed into a sum of local losses over (a subset
of) the entries in
V
ij
. I.e., we require that the loss can be written as
L =
X
(i,j)∈Z
l(V
ij
, W
i∗
, H
∗j
) (1)
for some training set
Z ⊆ { 1, 2, . . . , m } × { 1, 2, . . . , n }
and lo-
cal loss function
l
, where
A
i∗
and
A
∗j
denote row
i
and column
j
of
matrix
A
, respectively. Many loss functions used in practice—such
as squared loss, generalized Kullback-Leibler divergence (GKL),
and
L
p
regularization—can be decomposed in such a manner [19].
Note that a given loss function
L
can potentially be decomposed
in multiple ways. In this paper, we focus primarily on the class of
nonzero decompositions, in which
Z = { (i, j) : V
ij
6= 0 }
. As
mentioned above, such decompositions naturally arise when zeros
represent missing data. Our algorithms can handle other decomposi-
tions as well; see [10].
To compute
W
and
H
on MapReduce, all known algorithms start
with some initial factors
W
0
and
H
0
and iteratively improve them.
The
m × n
input matrix
V
is partitioned into
d
1
× d
2
blocks, which
are distributed in the MapReduce cluster. Both row and column
factors are blocked conformingly:
0
B
B
B
@
H
1
H
2
· · · H
d
2
W
1
V
11
V
12
· · · V
1d
2
W
2
V
21
V
22
· · · V
2d
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
W
d
1
V
d
1
1
V
d
1
2
· · · V
d
1
d
2
1
C
C
C
A
,
where we use superscripts to refer to individual blocks. The al-
gorithms are designed such that each block
V
ij
can be processed
independently in the map phase, taking only the corresponding
blocks of factors
W
i
and
H
j
as input. Some algorithms directly
update the factors in the map phase (then either
d
1
= m
or
d
2
= n
to avoid overlap), whereas others aggregate the factor updates in a
reduce phase.
Existing algorithms can be classified as specialized algorithms,
which are designed for a particular loss, and generic algorithms,
which work for a wide variety of loss functions. Specialized algo-
rithms currently exist for only a small class of loss functions. For
GKL loss, Das et al. [8] provide an EM-based algorithm, and Liu et
al. [16] provide a multiplicative-update method. In [16], the latter
MULT approach is also applied to squared loss and to nonnegative
matrix factorization with an “exponential” loss function. Each of
these algorithms in essence takes an embarrassingly parallel matrix
factorization algorithm developed previously and directly distributes
it across the MapReduce cluster. Zhou et al. [20] show how to dis-
tribute the well-known alternating least squares (ALS) algorithm to
handle factorization problems with a nonzero squared loss function
and an optional weighted
L
2
regularization term. Their approach
requires a double-partitioning of
V
: once by row and once by col-
umn. Moreover, ALS requires that each of the factor matrices
W
and
H
can (alternately) fit in main memory. See [10] for details on
the foregoing algorithms.
Generic algorithms are able to handle all differentiable loss func-
tions that decompose into summation form. A simple approach is
distributed gradient descent (DGD) [9, 11, 17], which distributes
gradient computation across a compute cluster, and then performs
centralized parameter updates using, for example, quasi-Newton
methods such as L-BFGS-B [6]. Partitioned SGD approaches make
use of a similar idea: SGD is run independently and in parallel on