neck layer as a representation used to generalize recognition
beyond the set of identities used in training. The downsides
of this approach are its indirectness and its inefficiency: one
has to hope that the bottleneck representation generalizes
well to new faces; and by using a bottleneck layer the rep-
resentation size per face is usually very large (1000s of di-
mensions). Some recent work [15] has reduced this dimen-
sionality using PCA, but this is a linear transformation that
can be easily learnt in one layer of the network.
In contrast to these approaches, FaceNet directly trains
its output to be a compact 128-D embedding using a triplet-
based loss function based on LMNN [19]. Our triplets con-
sist of two matching face thumbnails and a non-matching
face thumbnail and the loss aims to separate the positive pair
from the negative by a distance margin. The thumbnails are
tight crops of the face area, no 2D or 3D alignment, other
than scale and translation is performed.
Choosing which triplets to use turns out to be very im-
portant for achieving good performance and, inspired by
curriculum learning [1], we present a novel online nega-
tive exemplar mining strategy which ensures consistently
increasing difficulty of triplets as the network trains. To
improve clustering accuracy, we also explore hard-positive
mining techniques which encourage spherical clusters for
the embeddings of a single person.
As an illustration of the incredible variability that our
method can handle see Figure 1. Shown are image pairs
from PIE [13] that previously were considered to be very
difficult for face verification systems.
An overview of the rest of the paper is as follows: in
section 2 we review the literature in this area; section 3.1
defines the triplet loss and section 3.2 describes our novel
triplet selection and training procedure; in section 3.3 we
describe the model architecture used. Finally in section 4
and 5 we present some quantitative results of our embed-
dings and also qualitatively explore some clustering results.
2. Related Work
Similarly to other recent works which employ deep net-
works [15, 17], our approach is a purely data driven method
which learns its representation directly from the pixels of
the face. Rather than using engineered features, we use a
large dataset of labelled faces to attain the appropriate in-
variances to pose, illumination, and other variational condi-
tions.
In this paper we explore two different deep network ar-
chitectures that have been recently used to great success in
the computer vision community. Both are deep convolu-
tional networks [8, 11]. The first architecture is based on the
Zeiler&Fergus [22] model which consists of multiple inter-
leaved layers of convolutions, non-linear activations, local
response normalizations, and max pooling layers. We addi-
tionally add several 1×1×d convolution layers inspired by
the work of [9]. The second architecture is based on the
Inception model of Szegedy et al. which was recently used
as the winning approach for ImageNet 2014 [16]. These
networks use mixed layers that run several different convo-
lutional and pooling layers in parallel and concatenate their
responses. We have found that these models can reduce the
number of parameters by up to 20 times and have the poten-
tial to reduce the number of FLOPS required for comparable
performance.
There is a vast corpus of face verification and recognition
works. Reviewing it is out of the scope of this paper so we
will only briefly discuss the most relevant recent work.
The works of [15, 17, 23] all employ a complex system
of multiple stages, that combines the output of a deep con-
volutional network with PCA for dimensionality reduction
and an SVM for classification.
Zhenyao et al. [23] employ a deep network to “warp”
faces into a canonical frontal view and then learn CNN that
classifies each face as belonging to a known identity. For
face verification, PCA on the network output in conjunction
with an ensemble of SVMs is used.
Taigman et al. [17] propose a multi-stage approach that
aligns faces to a general 3D shape model. A multi-class net-
work is trained to perform the face recognition task on over
four thousand identities. The authors also experimented
with a so called Siamese network where they directly opti-
mize the L
1
-distance between two face features. Their best
performance on LFW (97.35%) stems from an ensemble of
three networks using different alignments and color chan-
nels. The predicted distances (non-linear SVM predictions
based on the χ
2
kernel) of those networks are combined us-
ing a non-linear SVM.
Sun et al. [14, 15] propose a compact and therefore rel-
atively cheap to compute network. They use an ensemble
of 25 of these network, each operating on a different face
patch. For their final performance on LFW (99.47% [15])
the authors combine 50 responses (regular and flipped).
Both PCA and a Joint Bayesian model [2] that effectively
correspond to a linear transform in the embedding space are
employed. Their method does not require explicit 2D/3D
alignment. The networks are trained by using a combina-
tion of classification and verification loss. The verification
loss is similar to the triplet loss we employ [12, 19], in that it
minimizes the L
2
-distance between faces of the same iden-
tity and enforces a margin between the distance of faces of
different identities. The main difference is that only pairs of
images are compared, whereas the triplet loss encourages a
relative distance constraint.
A similar loss to the one used here was explored in
Wang et al. [18] for ranking images by semantic and visual
similarity.