Face Alignment by Explicit Shape Regression
Xudong Cao Yichen Wei Fang Wen Jian Sun
Microsoft Research Asia
{xudongca,yichenw,fangwen,jiansun}@microsoft.com
Abstract. We present a very efficient, highly accurate,
“Explicit Shape Regression” approach for face alignment.
Unlike previous regression-based approaches, we directly
learn a vectorial regression function to infer the whole fa-
cial shape (a set of facial landmarks) from the image and
explicitly minimize the alignment errors over the training
data. The inherent shape constraint is naturally encoded in-
to the regressor in a cascaded learning framework and ap-
plied from coarse to fine during the test, without using a
fixed parametric shape model as in most previous methods.
To make the regression more effective and efficient, we
design a two-level boosted regression, shape-indexed fea-
tures and a correlation-based feature selection method. This
combination enables us to learn accurate models from large
training data in a short time (20 minutes for 2,000 training
images), and run regression extremely fast in test (15 m-
s for a 87 landmarks shape). Experiments on challenging
data show that our approach significantly outperforms the
state-of-the-art in terms of both accuracy and efficiency.
1. Introduction
Face alignment or locating semantic facial landmarks
such as eyes, nose, mouth and chin, is essential for tasks
like face recognition, face tracking, face animation and 3D
face modeling. With the explosive increase in personal and
web photos nowadays, a fully automatic, highly efficient
and robust face alignment method is in demand. Such re-
quirements are still challenging for current approaches in
unconstrained environments, due to large variations on fa-
cial appearance, illumination, and partial occlusions.
A face shape S = [x
1
, y
1
, ..., x
N
fp
, y
N
fp
]
T
consists of N
fp
facial landmarks. Given a face image, the goal of face align-
ment is to estimate a shape S that is as close as possible to
the true shape
b
S, i.e., minimizing
||S −
b
S||
2
. (1)
The alignment error in Eq.(1) is usually used to guide
the training and evaluate the performance. However, dur-
ing testing, we cannot directly minimize it as
b
S is unknown.
According to how S is estimated, most alignment approach-
es can be classified into two categories: optimization-based
and regression-based.
Optimization-based methods minimize another error
function that is correlated to (1) instead. Such methods
depend on the goodness of the error function and whether
it can be optimized well. For example, the AAM ap-
proach [13, 16, 17, 3] reconstructs the entire face using an
appearance model and estimates the shape by minimizing
the texture residual. Because the learned appearance mod-
els have limited expressive power to capture complex and
subtle face image variations in pose, expression, and illu-
mination, it may not work well on unseen faces. It is also
well known that AAM is sensitive to the initialization due
to the gradient descent optimization.
Regression-based methods learn a regression function
that directly maps image appearance to the target out-
put. The complex variations are learnt from large train-
ing data and testing is usually efficient. However, previ-
ous such methods [6, 19, 7, 16, 17] have certain drawbacks
in attaining the goal of minimizing Eq. (1). Approaches
in [7, 16, 17] rely on a parametric model (e.g., AAM) and
minimize model parameter errors in the training. This is
indirect and sub-optimal because smaller parameter errors
are not necessarily equivalent to smaller alignment errors.
Approaches in [6, 19] learn regressors for individual land-
marks, effectively using (1) as their loss functions. Howev-
er, because only local image patches are used in training and
appearance correlation between landmarks is not exploited,
such learned regressors are usually weak and cannot handle
large pose variation and partial occlusion.
We notice that the shape constraint is essential in all
methods. Only a few salient landmarks (e.g., eye centers,
mouth corners) can be reliably characterized by their im-
age appearances. Many other non-salient landmarks (e.g.,
points along face contour) need help from the shape con-
straint - the correlation between landmarks. Most previous
works use a parametric shape model to enforce such a con-
straint, such as PCA model in AAM [3, 13] and ASM [4, 6].
Despite of the success of parametric shape models, the
model flexibility (e.g., PCA dimension) is often heuristical-
1