1. They use stochastic gradient method (SG) to solve the
optimization problem. To avoid over-fitting, they only
train with one epoch.
2. FFM performs the best among six models they tried.
In this paper, we aim to concretely establish FFM as an
effective approach for CTR prediction. Our major results
are as follows.
• Though FFM is shown to be effective in [8], this work
may be the only published study of applying FFMs on
CTR prediction problems. To further demonstrate the
effectiveness of FFMs on CTR prediction, we present the
use of FFM as our major model to win two world-wide
CTR competitions hosted by Criteo and Avazu.
• We compare FFMs with two related models, Poly2 and
FMs. We first discuss conceptually why FFMs might be
better than them, and conduct experiments to see the
difference in terms of accuracy and training time.
• We present techniques for training FFMs. They include
an effective parallel optimization algorithm for FFMs and
the use of early-stopping to avoid over-fitting.
• To make FFMs available for public use, we release an open
source software.
This paper is organized as follows. Before we present
FFMs and its implementation in Section 3, we discuss the
two existing models Poly2 and FMs in Section 2. Experi-
ments comparing FFMs with other models are in Section 4.
Finally, conclusions and future directions are in Section 5.
Code used for experiments in this paper and the package
LIBFFM are respectively available at:
http://www.csie.ntu.edu.tw/˜cjlin/ffm/exps
http://www.csie.ntu.edu.tw/˜cjlin/libffm
2. POLY2 AND FM
Chang et. al [4] have shown that a degree-2 polynomial
mapping can often effectively capture the information of fea-
ture conjunctions. Further, they show that by applying a
linear model on the explicit form of degree-2 mappings, the
training and test time can be much faster than using ker-
nel methods. This approach, referred to as Poly2, learns a
weight for each feature pair:
φ
Poly2
(w, x) =
n
X
j
1
=1
n
X
j
2
=j
1
+1
w
h(j
1
,j
2
)
x
j
1
x
j
2
, (2)
where h(j
1
, j
2
) is a function encoding j
1
and j
2
into a natural
number. The complexity of computing (2) is O(¯n
2
), where
¯n is the average number of non-zero elements per instance.
FMs proposed in [6] implicitly learn a latent vector for
each feature. Each latent vector contains k latent factors,
where k is a user-specified parameter. Then, the effect of
feature conjunction is modelled by the inner product of two
latent vectors:
φ
FM
(w, x) =
n
X
j
1
=1
n
X
j
2
=j
1
+1
(w
j
1
· w
j
2
)x
j
1
x
j
2
. (3)
The number of variables is n ×k, so directly computing (3)
costs O(¯n
2
k) time. Following [6], by re-writing (3) to
φ
FM
(w, x) =
1
2
n
X
j=1
(s − w
j
x
j
) · w
j
x
j
,
where
s =
n
X
j
0
=1
w
j
0
x
j
0
,
the complexity is reduced to O(¯nk).
Rendle [6] explains why FMs can be better than Poly2
when the data set is sparse. Here we give a similar illus-
tration using the data set in Table 1. For example, there
is only one negative training data for the pair (ESPN, Adi-
das). For Poly2, a very negative weight w
ESPN,Adidas
might
be learned for this pair. For FMs, because the prediction of
(ESPN, Adidas) is determined by w
ESPN
· w
Adidas
, and be-
cause w
ESPN
and w
Adidas
are also learned from other pairs
(e.g., (ESPN, Nike), (NBC, Adidas)), the prediction may be
more accurate. Another example is that there is no training
data for the pair (NBC, Gucci). For Poly2, the prediction on
this pair is trivial, but for FMs, because w
NBC
and w
Gucci
can be learned from other pairs, it is still possible to do
meaningful prediction.
Note that in Poly2, the naive way to implement h(j
1
, j
2
)
is to consider every pair of features as a new feature [4].
1
This approach requires the model as large as O(n
2
), which is
usually impractical for CTR prediction because of very large
n. Vowpal Wabbit (VW) [9], a widely used machine learning
package, solves this problem by hashing j
1
and j
2
.
2
Our
implementation is similar to VW’s approach. Specifically,
h(j
1
, j
2
) = (
1
2
(j
1
+ j
2
)(j
1
+ j
2
+ 1) + j
2
) mod B,
where the model size B is a user-specified parameter.
In this paper, for the simplicity of formulations, we do not
include linear terms and bias term. However, in Section 4,
we include them for some experiments.
3. FFM
The idea of FFM originates from PITF [7] proposed for
recommender systems with personalized tags. In PITF, they
assume three available fields including User, Item, and Tag,
and factorize (User, Item), (User, Tag), and (Item, Tag) in
separate latent spaces. In [8], they generalize PITF for more
fields (e.g., AdID, AdvertiserID, UserID, QueryID) and ef-
fectively apply it on CTR prediction. Because [7] aims at
recommender systems and is limited to three specific fields
(User, Item, and Tag), and [8] lacks detailed discussion on
FFM, in this section we provide a more comprehensive study
of FFMs on CTR prediction. For most CTR data sets like
that in Table 1, “features” can be grouped into “fields.” In
our example, three features ESPN, Vogue, and NBC, belong
to the field Publisher, and the other three features Nike,
Gucci, and Adidas, belong to the field Advertiser. FFM is
a variant of FM that utilizes this information. To explain
how FFM works, we consider the following new example:
Clicked Publisher (P) Advertiser (A) Gender (G)
Yes ESPN Nike Male
Recall that for FMs, φ
FM
(w, x) is
w
ESPN
· w
Nike
+ w
ESPN
· w
Male
+ w
Nike
· w
Male
.
1
More precisely, [4] includes the original features as well,
though we do not consider such a setting until the experi-
ments.
2
See http://github.com/JohnLangford/vowpal wabbit/
wiki/Feature-interactions for details.