264 Mach Learn (2008) 72: 263–276
However, suggesting the right items is a highly nontrivial task: (1) There are many items
to choose from. (2) Customers are willing to consider only a small number of recommenda-
tions (typically in the order of ten). Collaborative filtering addresses this problem by learning
the suggestion function for a user from ratings provided by this and other users on items.
The known data can be thought of a sparse n ×m matrix Y of rating/purchase informa-
tion, where n denotes the number of users and m is the number of items. In this context, Y
ij
indicates the rating of item j given by user i. Typically, the rating is given on a five star scale
and thus Y ∈{0,...,5}
n×m
, where the value 0 indicates that a user did not rate an item. In
this sense, 0 is special since it does not indicate that a user dislikes an item but rather that
data is missing.
Related work A common approach to collaborative filtering is to fit a factor model to the
data. For example by extracting a feature vector for each user and item in the data set such
that the inner product of these features minimizes an explicit or implicit loss functional
(e.g. Hoffman 2004 following a probabilistic approach). The underlying idea behind these
methods is that both user preferences and item properties can be modeled by a number of
factors.
The basic idea of matrix factorization approaches is to fit the original Y matrix with
a low rank approximation F . More specifically, the goal is to find such an approximation
that minimizes the sum of the squared distances between the known entries in Y and their
predictions in F . One possibility of doing so is by using a Singular Value Decomposition
of Y and by using only a small number of the vectors obtained by this procedure. In the
information retrieval community this numerical operation is commonly referred to as Latent
Semantic Indexing.
Note, however, that this method does not do justice to the way Y was formed. An entry
Y
ij
=0 indicates that we did not observe a (user,object) pair. It does, however, not indicate
that user i disliked object j . In (Srebro and Jaakkola 2003), an alternative approach is sug-
gested which is the basis of the method described in this paper. We aim to find two matrices
U and M where U ∈R
n×d
and M ∈R
d×m
such that F =UM with the goal to approximate
the observed entries in Y rather than approximating all entries at the same time.
In general, finding a globally optimal solution of the low rank approximation problem
is unrealistic: in particular the approach proposed by (Srebro and Jaakkola 2003)forcom-
puting a weighted factorization, which is relevant in collaborative filtering settings, requires
semidefinite programming which is feasible only for hundreds, at most, thousands of terms.
Departing from the goal of minimizing the rank, Maximum Margin Matrix Factorization
(MMMF) aims at minimizing the Froebenius norms of U and M, resulting in a set of con-
vex problems when taken in isolation and thus tractable by current optimization techniques.
It was shown in (Srebro et al. 2005; Srebro and Shraibman 2005) that optimizing the Froebe-
nius norm is a good proxy for optimizing the rank in its application to model complexity
control. Similar ideas based on matrix factorization have been also proposed in (Salakhutdi-
nov and Mnih 2008; Takács et al. 2007).
Recently (Weimer et al. 2008) proposed to extend the general MMMF framework in
order to minimize structured (ranking) losses instead of the sum of squared errors on the
known ratings. Key in the reasoning is that collaborative filtering is often at the heart of
recommender systems. For those, only the ranking of unrated items in terms of the user
preferences matter. To enable effective optimization of the structured ranking loss, a novel
optimization technique (Smola et al. 2008) was used to minimize the loss in terms of the
Normalized Discounted Cumulative Gain (NDCG).