One Millisecond Face Alignment with an Ensemble of Regression Trees
Vahid Kazemi and Josephine Sullivan
KTH, Royal Institute of Technology
Computer Vision and Active Perception Lab
Teknikringen 14, Stockholm, Sweden
{vahidk,sullivan}@csc.kth.se
Abstract
This paper addresses the problem of Face Alignment for
a single image. We show how an ensemble of regression
trees can be used to estimate the face’s landmark positions
directly from a sparse subset of pixel intensities, achieving
super-realtime performance with high quality predictions.
We present a general framework based on gradient boosting
for learning an ensemble of regression trees that optimizes
the sum of square error loss and naturally handles missing
or partially labelled data. We show how using appropriate
priors exploiting the structure of image data helps with ef-
ficient feature selection. Different regularization strategies
and its importance to combat overfitting are also investi-
gated. In addition, we analyse the effect of the quantity of
training data on the accuracy of the predictions and explore
the effect of data augmentation using synthesized data.
1. Introduction
In this paper we present a new algorithm that performs
face alignment in milliseconds and achieves accuracy supe-
rior or comparable to state-of-the-art methods on standard
datasets. The speed gains over previous methods is a con-
sequence of identifying the essential components of prior
face alignment algorithms and then incorporating them in
a streamlined formulation into a cascade of high capacity
regression functions learnt via gradient boosting.
We show, as others have [8, 2], that face alignment can
be solved with a cascade of regression functions. In our case
each regression function in the cascade efficiently estimates
the shape from an initial estimate and the intensities of a
sparse set of pixels indexed relative to this initial estimate.
Our work builds on the large amount of research over the
last decade that has resulted in significant progress for face
alignment [9, 4, 13, 7, 15, 1, 16, 18, 3, 6, 19]. In particular,
we incorporate into our learnt regression functions two key
elements that are present in several of the successful algo-
rithms cited and we detail these elements now.
Figure 1. Selected results on the HELEN dataset. An ensemble
of randomized regression trees is used to detect 194 landmarks on
face from a single image in a millisecond.
The first revolves around the indexing of pixel intensi-
ties relative to the current estimate of the shape. The ex-
tracted features in the vector representation of a face image
can greatly vary due to both shape deformation and nui-
sance factors such as changes in illumination conditions.
This makes accurate shape estimation using these features
difficult. The dilemma is that we need reliable features to
accurately predict the shape, and on the other hand we need
an accurate estimate of the shape to extract reliable features.
Previous work [4, 9, 5, 8] as well as this work, use an it-
erative approach (the cascade) to deal with this problem.
Instead of regressing the shape parameters based on fea-
tures extracted in the global coordinate system of the image,
the image is transformed to a normalized coordinate system
based on a current estimate of the shape, and then the fea-
tures are extracted to predict an update vector for the shape
parameters. This process is usually repeated several times
until convergence.
The second considers how to combat the difficulty of the
1