Face Alignment by Explicit Shape Regression 1

Int J Comput Vis
DOI 10.1007/s11263-013-0667-3
Face Alignment by Explicit Shape Regression

Xudong Cao · Yichen Wei · Fang Wen · Jian Sun
Received: 28 December 2012 / Accepted: 7 October 2013

© Springer Science+Business Media New York 2013
Abstract We present a very efficient, highly accurate, 1 Introduction

“Explicit Shape Regression” approach for face alignment.
Unlike previous regression-based approaches, we directly Face alignment or locating semantic facial landmarks such
learn a vectorial regression function to infer the whole facial as eyes, nose, mouth and chin, is essential for tasks like face
shape (a set of facial landmarks) from the image and explic- recognition, face tracking, face animation and 3D face mod-
itly minimize the alignment errors over the training data. The eling. With the explosive increase in personal and web photos
inherent shape constraint is naturally encoded into the regres- nowadays, a fully automatic, highly efficient and robust face
sor in a cascaded learning framework and applied from coarse alignment method is in demand. Such requirements are still
to fine during the test, without using a fixed parametric shape challenging for current approaches in unconstrained environ-
model as in most previous methods. To make the regression ments, due to large variations on facial appearance, illumi-
more effective and efficient, we design a two-level boosted nation, and partial occlusions.
regression, shape indexed features and a correlation-based A face shape S = [x1 , y1 , ..., x Nfp , y Nfp ]T consists of Nfp
feature selection method. This combination enables us to facial landmarks. Given a face image, the goal of face align-
learn accurate models from large training data in a short ment is to estimate a shape S that is as close as possible to
time (20 min for 2,000 training images), and run regres- the true shape Ŝ, i.e., minimizing
sion extremely fast in test (15 ms for a 87 landmarks shape).
Experiments on challenging data show that our approach sig- ||S − Ŝ||2 . (1)
nificantly outperforms the state-of-the-art in terms of both
accuracy and efficiency. The alignment error in Eq. (1) is usually used to guide the
training and evaluate the performance. However, during test-
Keywords Face alignment · Shape indexed feature · ing, we cannot directly minimize it as Ŝ is unknown. Accord-
Correlation based feature selection · Non-parametric shape ing to how S is estimated, most alignment approaches can
constraint · Tow-level boosted regression be classified into two categories: optimization-based and
regression-based.
Optimization-based methods minimize another error
function that is correlated to (1) instead. Such methods
X. Cao (B) · Y. Wei · F. Wen · J. Sun
depend on the goodness of the error function and whether
Visual Computing Group, Microsoft Research Asia, Beijing,
People’s Republic of China it can be optimized well. For example, the AAM approach
e-mail: [email protected] (Matthews and Baker 2004; Sauer and Cootes 2011; Saragih
Y. Wei and Goecke 2007; Cootes et al. 2001) reconstructs the entire
e-mail: [email protected] face using an appearance model and estimates the shape by
F. Wen minimizing the texture residual. Because the learned appear-
e-mail: [email protected] ance models have limited expressive power to capture com-
J. Sun plex and subtle face image variations in pose, expression, and
e-mail: [email protected] illumination, it may not work well on unseen faces. It is also
123
Int J Comput Vis
well known that AAM is sensitive to the initialization due to

the gradient descent optimization.
Regression-based methods learn a regression function that
directly maps image appearance to the target output. The
complex variations are learnt from large training data and
testing is usually efficient. However, previous such methods (a)
(Cristinacce and Cootes 2007; Valstar et al. 2010; Dollar et
al. 2010; Sauer and Cootes 2011; Saragih and Goecke 2007)
have certain drawbacks in attaining the goal of minimizing
Eq. (1). Approaches in (Dollar et al. 2010; Sauer and Cootes
2011; Saragih and Goecke 2007) rely on a parametric model
(e.g., AAM) and minimize model parameter errors in the
training. This is indirect and sub-optimal because smaller
parameter errors are not necessarily equivalent to smaller (b)
alignment errors. Approaches in (Cristinacce and Cootes
2007; Valstar et al. 2010) learn regressors for individual land-
marks, effectively using (1) as their loss functions. However,
because only local image patches are used in training and
appearance correlation between landmarks is not exploited,
such learned regressors are usually weak and cannot handle
large pose variation and partial occlusion.
We notice that the shape constraint is essential in all meth-
ods. Only a few salient landmarks (e.g., eye centers, mouth
corners) can be reliably characterized by their image appear-
ances. Many other non-salient landmarks (e.g., points along (c) (d)
face contour) need help from the shape constraint—the cor-
Fig. 1 Shape constraint is preserved and adaptively learned in a coarse
relation between landmarks. Most previous works use a para- to fine manner in our boosted regressor. a The shape is progressively
metric shape model to enforce such a constraint, such as PCA refined by the shape increments learnt by the boosted regressors in
model in AAM (Cootes et al. 2001; Matthews and Baker different stages. b Intrinsic dimensions of learnt shape increments in
a 10-stage boosted regressor, using 87 facial landmarks. c, d The first
2004) and ASM (Cootes et al. 1995; Cristinacce and Cootes
three principal components (PCs) of shape increments in the first and
2007). final stage, respectively
Despite of the success of parametric shape models, the
model flexibility (e.g., PCA dimension) is often heuristically
determined. Furthermore, using a fixed shape model in an tive than using only local patches for individual landmarks.
iterative alignment process (as most methods do) may also These properties enable us to learn a flexible model with
be suboptimal. For example, in initial stages (the shape is far strong expressive power from large training data. We call
from the true target), it is favorable to use a restricted model our approach “Explicit Shape Regression”.
for fast convergence and better regularization; in late stages Jointly regressing the entire shape is challenging in the
(the shape has been roughly aligned), we may want to use presence of large image appearance variations. We design
a more flexible shape model with more subtle variations for a boosted regressor to progressively infer the shape—the
refinement. To our knowledge, adapting such shape model early regressors handle large shape variations and guaran-
flexibility is rarely exploited in the literature. tee robustness, while the later regressors handle small shape
In this paper, we present a novel regression-based variations and ensure accuracy. Thus, the shape constraint is
approach without using any parametric shape models. The adaptively enforced from coarse to fine, in an automatic man-
regressor is trained by explicitly minimizing the alignment ner. This is illustrated in Fig. 1 and elaborated in Sect. 2.3.
error over training data in a holistic manner—all facial land- Our explicit shape regression framework is inspired by the
marks are regressed jointly. The general idea of regressing cascaded pose regression proposed by Dollar et al. (2010).
non-parametric shape has also been explored by Zhou and In their work, a sequence of random fern regressors are
Comaniciu (2007). learnt to predict the object pose parameters progressively. In
Our regressor realizes the shape constraint in an non- each iteration, image features not only depend on the image
parametric manner: the regressed shape is always a lin- content, but also depend on the predicted pose parameter
ear combination of all training shapes. Also, using fea- from last iteration. Such pose-indexed features provide bet-
tures across the image for all landmarks is more discrimina- ter geometric invariance and greatly enhance the regressor’s
123
Int J Comput Vis
performance. In their experiment, this method has also been reduce the alignment errors on training set. In stage each t,
used to estimate face shape which is modeled by a simple the stage regressor R t is formally learnt as follows,
parametric ellipse (Dollar et al. 2010).

N
Our method improves the cascaded pose regression frame- R t = argmin ||yi − R(Ii , Sit−1 )||2 (3)
work in several important aspects and works better for face R i=1
alignment problem. We adopt a non-parametric representa-
yi = M S t−1 ◦ ( Ŝi − Sit−1 ),
tion, directly estimate the facial landmarks by minimizing i
the alignment error instead of parameter error. Consequently, where Sit−1 is the estimated shape in previous stage t − 1,
the underlying shape constraint is preserved automatically.
M S t−1 ◦( Ŝi − Sit−1 ) is the normalized regression target which
To address the very challenging high-dimensional regres- i
will be discussed later. Please note herein we only apply the
sion problem, we further propose several improvements: a
scale and rotation transformation to normalize the regression
two-level boosted regression, effective shape indexed fea-
target.
tures, a fast correlation-based feature selection method and
In testing, given a facial image I and an initial shape S 0 ,
sparse coding based model compression so that: (1) we can
the stage regressor computes a normalized shape increment
quickly learn accurate models from large training data (20
from image features and then updates the face shape, in a
min on 2,000 training samples); (2) the resulting regressor is
cascaded manner:
extremely efficient in the test (15 ms for 87 facial landmarks);
(3) the model size is reasonably small (a few megabytes) and Sit = Sit−1 + M −1 t t−1
t−1 ◦ R (Ii , Si ), (4)
Si
applicable in many scenarios. We show superior results on
several challenging datasets. where the stage regressor R t updates the previous shape
S t−1 to the new shape S t in stage t. Note we only scale and
rotate (without translating) the output of the stage regression
2 Face Alignment by Shape Regression according to the angle and the scale of the previous shape.
2.1 The Shape Regression Framework 2.1.1 Discussion
In this section, we describe our shape regression framework Although our basic face alignment framework follows gradi-
and discuss how it fit to the face alignment task. ent boosting framework, there are three components specifi-
We cast our solution to shape regression task into the gra- cally designed for the shape regression tasks. As these com-
dient boosting regression (Friedman 2001; Duffy and Helm- ponents are generic, regardless of the specific forms of the
bold 2002) framework, which is a representative approach of weak learner and the feature used for learning, it is worth
ensemble learning. In training, it sequentially learns a series clarifying and discussing them here.
of weak learners to greedily minimize the regression loss Shape Indexed Feature The feature for learning the weak
function. In testing, it simply combine the pre-learnt weak learner R t depends on both image I and previous estimated
learners in an additive manner to give the final prediction. shape S t−1 . It improves the performance by achieving geo-
Before specifying how to resolve face alignment task in metric invariance. In other words, the feature is extracted
gradient boosting framework, we first clarify a simple and relative to S t−1 to eliminate two kinds of variations: the vari-
basic term, i.e. the normalized shape. Provided the predefined ations due to scale, rotation and translation and the variations
mean shape S̄, the normalized shape of an input shape S is due to identity, pose and expression (i.e. the similarity trans-
obtained by a similarity transform which aligns the input form M S−1t−1 and the normalized shape M S t−1 ◦ S
t−1 ). It is
shape to the mean shape1 to minimizes their L2 distance, worth mentioning that although shape indexed feature sounds
more complex, our designed feature is extremely cheap to be
M S = argmin|| S̄ − M ◦ S||2 , (2) computed and does not require image warping. The details
M
of our feature will be described in Sect. 2.4.
where M S ◦ S is the normalized shape. Now we are ready to Regressing the Normalized Target Instead of learning the
describe our shape regression framework. mapping from shape indexed feature to Ŝi − Sit−1 , we argue
In training, given N training samples {Ii , Ŝi , Si0 }i=1
N ,
the regression task will be simplified if we regress the normal-
the stage regressors (R , ..., R ) are sequentially learnt to
1 T
ized target M S t−1 ◦ ( Ŝi − Sit−1 ). Because the normalized tar-
i
get is invariant to similarity transform. To better understand
1 It is also interesting to know that the mean shape is defined as the its advantage, imagine two identical facial images with the
average of the normalized training shapes. Although it sounds like a
estimated shape. One of them is changed by similarity trans-
circular definition, we still can compute the mean shape in an iterative
way. Readers are recommended to Active Shape Model (Cootes et al. form while the other is kept unchanged. Due to the transform,
1995) method for details. the regression target (shape increments) of the transformed
123
Int J Comput Vis
image are different from that of the unchanged one. Hence As the stage regressor plays a vital role in our shape regres-
the regression task is complicated. In contrast, the regression sion framework, we will focus on it next. We will discuss
task will keep simple if we regress the normalized targets, what kinds of regressors are suitable for our framework and
which are still the same for both samples. present a series of methods for effectively learning the stage
Data Augmentation and Multiple Initialization Unlike regressor.
typical regression task, the sample of our shape regression
task is a triple which is defined by the facial image, the 2.2 Two-Level Boosted Regression
groundtruth shape and the initial shape i.e. {Ii , Ŝi , Si0 }. So
we can augment the samples by generating multiple initial Conventionally, the stage regressor R t , is a quite weak regres-
shapes for one image. In fact, it turns out such simple oper- sor such as a decision stump (Cristinacce and Cootes
ation not only effectively improves generalization of train- 2007) or a fern (Dollar et al. 2010). However, in our
ing, but also reduces the variation of the final prediction by early experiments, we found that such regressors result in
bagging the results obtained by multiple initializations. See slow convergence in training and poor performance in the
Sect. 3 for experimental validations and further discussions. testing.
To make the training and testing our framework clear, we We conjecture this is due to two reasons: first, regressing
format them in the pseudo codes style. the entire shape (as large as dozens of landmarks) is too diffi-
cult to be handled by a weak regressor in each stage; second,
the shape indexed feature will be unreliable if the previous
Algorithm 1 Explicit Shape Regression (ESR) regressors are too weak to provide fairly reliable shape esti-
Variables: Training images and labeled shapes {Il , Ŝl }l=1
L ; ESR model
mation, so will the regressor based on the shape indexed
{R }t=1 ; Testing image I ; Predicted shape S; TrainParams{times of
t T
feature. Therefore it is crucial to learn a strong regressor that
data augment Naug , number of stages T };
can rapidly reduce the alignment error of all training samples
TestParams{number of multiple initializations Nint };
InitSet which contains exemplar shapes for initialization in each stage.
Generally speaking, any kinds of regressors with strong
ESRTraining({Il , Ŝl }l=1
L , TrainParams, InitSet)
fitting capacity will be desirable. In our case, we again inves-
// augment training data
tigate boosted regression as the stage regressor R t . Therefore
{Ii , Ŝi , Si0 }i=1
N ← Initialization ({I , Ŝ } L , N
l l l=1 aug , InitSet)
for t from 1 to T threre are two-levels boosted regression in our method, i.e.
Y ← {M S t−1 ◦ ( Ŝi − Sit−1 )}i=1
N // compute normalized targets the basic face alignment framework (external-level) and the
i
R t ← LearnStageRegressor(Y, {Ii , Sit−1 }i=1
N ) // using Eq. (3) stage regressor R t (internal-level). To avoid terminology con-
for i from 1 to N fusion, we term the weak learner of the internal-level boosted
Sit ← Sit−1 + M −1 t t−1
t−1 ◦ R (Ii , Si ) regression, as primitive regressor.
Si
return {R t }t=1
T It is worth mentioning that other strong regressors have
also been investigated by two notable works. Sun et al.
ESRTesting(I, {R t }t=1 T , TestParams, InitSet)
(2013) investigated the regressor based on convolutional
// multiple initializations
Nint
{Ii , ∗, Si0 }i=1 ← Initialization ({I, ∗}, Nint , InitSet) neural network. Xiong and De la Torre (2013) investi-
for t from 1 to T gated the linear regression with strong hand-craft feature i.e.
for i from 1 to Nint SIFT.
Sit ← Sit−1 + M −1 t t−1
t−1 ◦ R (Ii , Si ) Although the external- and internal- level boosted regres-
Si
Nint
S ← CombineMultipleResutls({SiT }i=1 ) sion bear some similarities, there is a key differences. The
return S difference is that the shape indexed image features are fixed
in the internal-level, i.e., they are indexed only relative to the
Initialization({Ic , Ŝc }C c=1 , D, InitSet)
i ←1 previous estimated shape S t−1 and no longer change when
for c from 1 to C those primitive regressor are being learnt.2 This is impor-
for d from 1 to D tant, as each primitive regressor is rather weak, allowing fea-
Sio ← sampling an exemplar shape from InitSet
ture indexing will frequently change the features, leading to
{Iio , Ŝio } ← {Ic , Ŝc }
unstable consequences. Also the fixed features can lead to
i ←i +1
return {Iio , Ŝio , Sio }i=1
CD much faster training, as will be described later. In our exper-
iments, we found using two-level boosted regression is more
accurate than one level under the same training effort, e.g.,
T = 10, K = 500 is better than one level of T =5,000, as
The details about combining multiple results and initial- shown in Table 4.
ization (including constructing InitSet and sampling from
InitSet) will be discussed later in Sect. 3. 2 Otherwise this degenerates to a one level boosted regression.
123
Int J Comput Vis
2.3 Primitive Regressor also satisfies the constraint. Compare to the pre-fixed PCA
shape model, the non-parametric shape constraint is adap-
We use a fern as our primitive regressor. The fern was firstly tively determined during the learning.
introduced for classification by Ozuysal et al. (2010) and later To illustrate the adaptive shape constraint, we perform
used for regression by Dollar et al. (2010). PCA on all the shape increments stored in K ferns (2 F × K
A fern is a composition of F (5 in our implementation) in total) in each stage t. As shown in Fig. 1, the intrinsic
features and thresholds that divide the feature space and all dimension (by retaining 95 % energy) of such shape spaces
training samples into 2 F bins. Each bin b is associated with increases during the learning. Therefore, the shape constraint
a regression output yb . Learning a fern involves a simple is automatically encoded in the regressors in a coarse to fine
task and a hard task. The simple task refers to learning the manner. Figure 1 also shows the first three principal com-
outputs in bins. The hard task refers to learning the structure ponents of the learnt shape increments (plus a mean shape)
(the features and the splitting thresholds). We will handle the in first and final stage. As shown in Fig. 1c, d, the shape
simple task first, and resolve the hard task in the Sect. 2.5 updates learned by the first stage regressor are dominated by
later. global rough shape changes such as yaw, roll and scaling.
Let the targets of all training samples be { ŷi }i=1
N . The pre- In contrast, the shape updates of the final stage regressor are
diction/output of a bin should minimize its mean square dis- dominated by the subtle variations such as face contour, and
tance from the targets for all training samples falling into this motions in the mouth, nose and eyes.
bin.

yb = argmin || ŷi − y||2 , (5) 2.4 Shape Indexed (Image) Features
y
i∈Ωb
For efficient regression, we use simple pixel-difference
where the set Ωb indicates the samples in the bth bin. It is
features, i.e., the intensity difference of two pixels in the
easy to see that the optimal solution is the average over all
image. Such features are extremely cheap to compute and
targets in this bin.
powerful enough given sufficient training data (Ozuysal et

i∈Ωb ŷi al. 2010; Shotton et al. 2011; Dollar et al. 2010).
yb = . (6) To let the pixel-difference achieve geometric invariance,
|Ωb |
we need to make the extracted raw pixel invariant to two
To overcome over-fitting in the case of insufficient training kinds of variations: the variations due to similarity transform
data in the bin, a shrinkage is performed (Friedman 2001; (scale, rotation and translation) and the variations due to the
Ozuysal et al. 2010) as normalized shape (3D-poses, expressions and identities).

1 i∈Ωb ŷi
In this work, we propose to index a pixel by the esti-
yb = , (7) mated shape. Specifically, we index a pixel by the local
1 + β/|Ωb | |Ωb |
coordinate l = (x l , y l ) with respect to a landmark in
where β is a free shrinkage parameter. When the bin has the normalized face. The superscript l indicates which land-
sufficient training samples, β makes little effect; otherwise, mark this pixel is relative to. As Fig. 2 shows, such index-
it adaptively reduces the magnitude of the estimation. ing holds invariance against the variations mentioned above
and make the algorithm robust. In addition, it also enable us
2.3.1 Non-parametric Shape Constraint sampling more useful candidate features distributed around
salient landmarks (e.g., a good pixel difference feature could
By directly regressing the entire shape and explicitly mini- be “eye center is darker than nose tip” or “two eye centers
mizing the shape alignment error in Eq. (1), the correlation are similar”).
between the shape coordinates is preserved. Because each
shape update is additive as in Eqs. (4), (6) and (7), it can be
shown that the final regressed shape S is the sum of initial
shape S 0 and the linear combination of all training shapes:

N
S = S0 + wi Ŝi . (8)
i=1
Therefore, as long as the initial shape S 0 satisfies the (a) (b)

shape constraint, the regressed shape is always constrained
Fig. 2 Pixels indexed by the same local coordinates have the same
to reside in the linear subspace constructed by all train- semantic meaning (a), but pixels indexed by the same global coordinates
ing shapes. In fact, any intermediate shape in the regression have different semantic meanings due to the face shape variation (b)
123
Int J Comput Vis
In practical implementation, unlike previous works Algorithm 2 Shape indexed features

(Cristinacce and Cootes 2007; Valstar et al. 2010; Sauer Variables: images and corresponding estimated shapes {Ii , Si }i=1
N ;
and Cootes 2011) which requires image warping, we instead number of shape indexed pixel features P; number of facial points Nfp ;
transform the local coordinate back to the global coordinates the range of local coordinate κ; local coordinates {lαα }α=1
P ;
shape indexed pixel features ρ ∈ N ×P ;

on the original image to extract the shape indexed pixels
shape indexed pixel-difference features X ∈ N ×P ;
2
and then compute the pixel-difference features. This leads to
much faster testing speed. GenerateShapeIndexedFeatures({Ii , Si }i=1
N , N , P, κ)
fp
Let the estimated shape of a sample be S. The location of lα P
{α }α=1 ← GenerateLocalCoordinates(FeatureParams)
the mth landmark is obtained by πl ◦ S, where the operator πl N , {lα } P )
ρ ← ExtractShapeIndexedPixels({Ii , Si }i=1 α α=1
gets the x and y coordinates of mth landmark from the shape X ← pairwise difference of all columns of ρ
vector. Provided the local coordinate l , the corresponding return {lαα }α=1
P , ρ, X
global coordinates on the original image can formally repre-

GenerateLocalCoordinates(Nfp , P, κ)
sented as follows.3 for α from 1 to P
lα ← randomly drawn a integer in [1, Nfp ]
πl ◦ S + M S−1 ◦ l (9)
lαα ← randomly drawn two floats in [−κ, κ]
Note that although the l is identical for different samples, return {lαα }α=1
P
the global coordinates for extracting raw pixels are adaptively N , {lα } P )
ExtractShapeIndexedPixels({Ii , Si }i=1 α α=1
adjusted for different samples to ensure geometric invariance. for i from 1 to N
Herein we only scale and rotate (without translating) the local for α from 1 to P
coordinates according to the angle and the scale of the shape. μα ← πlα ◦ Si + M S−1
i
◦ lα
ρiα ← Ii (μα )
For each stage regressor R t in the external-level, we ran- return ρ
domly generate P local coordinates {lαα }α=1P which define
P shape indexed pixels. Each local coordinate is generated
by first randomly selecting a landmark (e.g. lα th landmark) regression target with N (the number of samples) rows and
and then draw random x- and y-offset from uniform distri- 2Nfp columns. Let X be pixel-difference features matrix with
bution. The P pixels result in P 2 pixel-difference features. N rows and P 2 columns. Each column X j of the feature
Now, the new challenge is how to quickly select effective matrix represents a pixel-difference feature. We want to select
features from such a large pool. F columns of X which are highly correlated with Y . Since Y
is a matrix, we use a projection v, which is a column vector
2.5 Correlation-Based Feature Selection drawn from unit Gaussian, to project Y into a column vector
Yprob = Y v. The feature which maximizes its correlation
To form a good fern regressor, F out of P 2 features are (Pearson Correlation) with the projected target is selected.
selected. Usually, this is done by randomly generating a pool
jopt = argmin corr(Yprob , X j ) (10)
of ferns and selecting the one with minimum regression error j
as in (5) (Ozuysal et al. 2010; Dollar et al. 2010). We denote
By repeating this procedure F times, with different random
this method as best-of-n, where n is the size of the pool. Due
projections, we obtain F desirable features.
to the combinatorial explosion, it is unfeasible to evaluate (5)
The random projection serves two purposes: it can pre-
for all of the compositional features. As illustrated in Table 5,
serve proximity (Bingham and Mannila 2001) such that the
the error is only slightly reduced by increasing n from 1 to
features correlated to the projection are also discriminative
1024, but the training time is significantly longer.
to delta shape; the multiple projections have low correlations
To better explore the huge feature space in a short time
with a high probability and the selected features are likely to
and generate good candidate ferns, we exploit the correlation
be complementary. As shown in Table 5, the proposed corre-
between features and the regression target. We expect that a
lation based method can select good features in a short time
good fern should satisfy two properties: (1) each feature in
and is much better than the best-of-n method.
the fern should be highly correlated to the regression target;
(2) correlation between features should be low so they are
complementary when composed. 2.6 Fast Correlation Computation
To find features satisfying such properties, we propose
a correlation-based feature selection method. Let Y be the At first glance, we need to compute the correlation for all can-
didate features to select a feature. The complexity is linear
3 to the number of training samples and the number of pixel-
According to aforementioned definition, the global coordinates are
computed via M S−1 ◦ (πl ◦ M S−1 ◦ S + l ). By simplifying this formula, difference features, i.e. O(N P 2 ). As the size of feature pool
we get Eq. (9) scales square to the number of sampled pixels, the computa-
123
Int J Comput Vis
tion will be very expensive, even with moderate number of the external-level shape regression framework, we term it as
pixels (e.g. 400 pixels leads to 160,000 candidate features!). internal-level boosted regression. With the prepared ingredi-
Fortunately the computational complexity can be reduced ents in previous three sections, we are ready to describe how
from O(N P 2 ) to O(N P) by the following facts: The cor- to learn the internal-level boosted regression.
relation between the regression target and a pixel-difference The internal-boosted regression consists of K primitive
feature (ρm − ρn ) can be represented as follows. regressors {r1 , ..., r K }, which are in fact ferns. In testing, we
combine them in an additive manner to predict the output.
cov(Yproj , ρm ) − cov(Yproj , ρn )
corr(Yproj , ρm − ρn ) = In training, the primitive regressors are sequentially learnt to
σ (Yproj )σ (ρm − ρn ) greedily fit the regression targets, in other words, each primi-
σ (ρm − ρn ) = cov(ρm , ρm ) tive regressor handles the residues left by previous regressors.
+cov(ρn , ρn )−2cov(ρm , ρn ) (11) In each iteration, the residues are used as the new targets for
learning a new primitive regressor. The learning procedure
We can see that the correlation is composed by two cat- are essentially identical for all primitive regressors, which
egories of covariances: the target-pixel covariance and the can be describe as follows.
pixel–pixel covariance. The target-pixel covariance refers to
the covariance between the projected target and pixel fea-
– Features Selecting F pixel-difference features using cor-
ture, e.g., cov(Yproj , ρm ) and cov(Yproj , ρn ). The pixel–pixel
relation based feature selection method.
covariance refers to the covariance among different pixel fea-
– Thresholds Randomly sampling F i.i.d. thresholds from
tures, e.g., cov(ρm , ρm ), cov(ρn , ρn ) and cov(ρm , ρn ). As the
an uniform distribution.4
shape indexed pixels are fixed in the internal-level boosted
– Outputs Partitioning all training samples into different
regression, the pixel–pixel covariances can be pre-computed
bins using the learnt features and thresholds. Then, learn-
and reused within each internal-level boosted regression. For
ing the outputs of the bins using Eq. (7).
each primitive regressor, we only need to compute all target-
pixel covariances to compose the correlations, which scales
linear to the number of pixel features. Therefore the com-
plexity is reduced from O(N P 2 ) to O(N P).
Algorithm 4 Internal-level boosted regression
Variables: regression targets Y ∈ N ×2Nfp ; training images and
Algorithm 3 Correlation-based feature selection corresponding estimated shapes {Ii , Si }i=1
N ; training parameters
TrainParams{Nfp , P, κ, F, K }; the stage regressor R; testing image

Input: regression targets Y ∈ N ×2Nfp ; shape indexed pixel features and corresponding estimated shape {I, S};
ρ ∈ N ×P ; pixel-pixel covariance cov(ρ) ∈ P×P ; number of
features of a fern F; LearnStageRegressor(Y, {Ii , Si }i=1 N , TrainParams)
Output: The selected pixel-difference features {ρm f − ρn f }f=1
F and the
lα P
{α }α=1 ← GenerateLocalCoordinates(Nfp , P, κ)
corresponding indices {m f , n f }f=1 ;
F
ρ ← ExtractShapeIndexedPixels({Ii , Si }i=1 N , {lα } P )
α α=1
CorrelationBasedFeatureSelection(Y, cov(ρ), F) cov(ρ) ← pre-compute pixel-pixel covariance
for f from 1 to F Y 0 ← Y // initialization
v ← randn(2Nfp , 1) // draw a random projection from unit Gaussian for k from 1 to K
Yprob ← Y v // random projection {ρm f − ρn f }f=1
F , {m , n } F ←
f f f=1
cov(Yprob , ρ) ∈ 1×P ← compute target-pixel covariance CorrelationBasedFeatureSelection(Y k−1 , cov(ρ), F)
σ (Yprob ) ← compute sample variance of Yprob {θf }f=1
F ← sample F thresholds from an uniform distribution
F
m f = 1; n f = 1; {Ωb }2b=1 ← partition training samples into 2 F bins
F
for m from 1 to P {yb }2b=1 ← compute the outputs of all bins using Eq. (7)
for n from 1 to P F , {θ } F , {y }2 F } // construct a fern
rk ← {{m f , n f }f=1
corr(Yprob , ρm − ρn ) ← compute correlation using Eq. (11) f f=1 b b=1
if corr(Yprob , ρm − ρn ) > corr(Yprob , ρm f − ρn f ) Y k ← Y k−1 − r k ({ρm f − ρn f }f=1

F ) // update the targets
K , {lα } P } // construct stage regressor
R ← {{r k }k=1
m f = m; n f = n; α α=1
return {ρm f − ρn f }f=1
F , {m , n } F
f f f=1 return R
ApplyStageRegressor(I, S, R) // i.e. R(I, S)

ρ ← ExtractShapeIndexedPixels({I, S}, {lαα }α=1
P )
2.7 Internal-Level Boosted Regression δS ← 0
for k from 1 to K
As aforementioned, at each stage t, we need to learn to a δS ← δS + r k ({ρm f − ρn f }f=1
F )
stage regressor to predict the normalized shape increments return δS

from the shape indexed features. Since strong fitting capacity
is vital to the final performance, we again exploit the boosted 4 Provided the range of pixel difference feature is [−c, c], the range of
regression to form the stage regressor. To distinguish it from the uniform distribution is [−0.2c, 0.2c].
123
Int J Comput Vis
2.8 Model Compression compressed version is more time-consuming, the testing is

still as efficient as the original version.
By memorizing the fitting procedures in the regression
model, our method gains very efficient testing speed. How-
3 Implementation Details
ever, comparing with optimization based approaches, our
regression based method leads higher storage cost. The stored
We discuss more implementation details, including the shape
delta shapes in the ferns’ leaves contributes to the main
initialization in training and testing, parameter setting and
cost in the total storage, which can be quantitatively com-
running performance.
puted via 8Nfp T K 2 F , where T K 2 F gives the number of all
leaves of T K random ferns in T stages, 8Nfp is the stor-
3.1 Initialization
age of a delta shape in a single leaf. For example, provided
T = 10, K = 500, F = 5 and Nfp = 194, the storage is
As aforementioned, we generate a initial shape by sampling
around 240mb, which is unaffordable in many real scenarios
from an InitSet which contains exemplar shapes. The exem-
such as mobile phones and embedded devices.
plar shapes could be representative shapes selected from the
A straight forward way to compress model size is via
training data, or the groundtruth shapes of the training data
principal component analysis (Jolliffe 2005). However, since
{ Ŝi }i=1
N . In training, we choose the later as the InitSet. While,
PCA only exploits orthogonal undercompleted basis in cod-
in testing, we use the first for its storage efficiency. For one
ing, its compression capability is limited. To further com-
image, we draw the initial shapes without replacement to
press the model size, we adopt sparse coding which not only
avoid duplicated samples. The scale and translation of the
takes the advantage of overcompleted basis but also enforces
sampled shape should be adjusted to ensure that the corre-
sparse regularization in its objective.
sponding face rectangle is the same with the face rectangle
Formally, the objective of sparse coding in a certain leave
of the input facial image.
of fern can be expressed as follows:

min ŷi − Bx22 , s.t.x0 ≤ Q, (12) 3.2 Training Data Augmentation
x
i∈Ωb
Each training sample consists of a training image, an initial
where the ŷi is the regression target; B is the basis for sparse shape and a ground truth shape. To achieve better general-
coding; Q the is the upper bound of the number of non-zero ization ability, we augment the training data by sampling
codes (Q = 5 in our implementation). multiple initializations (20 in our implementation) for each
The learning procedure of our model compression method training image. This is found to be very effective in obtain-
consists of two steps in each stage. In the first step, the basis ing robustness against large pose variation and rough initial
B of this stage is constructed. To learn the basis, we first run shapes during the testing.
the non-sparse version of our method to obtain K ferns of this
stage as well as the outputs stored in their leaves. Then the 3.3 Multiple Initializations in Testing
basis is constructed by random sampling5 from the outputs
in the leaves. The regressor can give reasonable results with different initial
In the second step, we use the same method in Sect. 2.5 shapes for a test image and the distribution of multiple results
to learn the structure of a fern. For each leaf, sparse codes indicates the confidence of estimation. As shown in Fig. 3,
are computed by optimizing the objective in Eq. (12) using
the orthogonal matching pursuit method (Tropp and Gilbert
2007). Comparing with the shape increment, sparse codes
require much less storage in leaves, especially when the
dimension of the shape is high.
In testing, for K ferns in one stage, their non-zero codes
in the leaves which are visited by the testing sample are lin-
early summed up to form the final coefficient x, then the
output is computed via Bx, which contributes to the main
computation. It is worth noting although the training of the
Fig. 3 Left results of 5 facial landmarks from multiple runs with differ-
5 We use random sampling for basis construction due to its simplicity ent initial shapes. The distribution indicates the estimation confidence:
and effectiveness. We also tried more sophisticated K-SVD method left eye and left mouth corner estimations are widely scattered and less
(Elad and Aharon 2006) for learning basis. It yields similar performance stable, due to the local appearance noises. Right the average alignment
comparing with random sampling. error increases as the standard deviation of multiple results increases
123
Int J Comput Vis
Table 1 Training and testing times of our approach, measured on an validates the proposed approach and presents some interest-
Intel Core i7 2.93GHz CPU with C++ implementation ing discussions.
Landmarks 5 29 87 We briefly introduce the datasets used in the experiments.
They present different challenges, due to different numbers
Training (min) 5 10 21
of annotated landmarks and image variations.
Testing (ms) 0.32 0.91 2.9
BioID The dataset was proposed by Jesorsky et al. (2001)
and widely used by previous methods. It consists of 1,521
near frontal face images captured in a lab environment, and
when multiple landmark estimations are tightly clustered, is therefore less challenging. We report our result on it for
the result is accurate, and vice versa. In the test, we run the completeness.
regressor several times (5 in our implementation) and take the LFPW The dataset was created by Belhumeur et al. (2011).
median result6 as the final estimation. Each time the initial Its full name is Labeled Face Parts in the Wild. The images
shape is randomly sampled from the training shapes. This are downloaded from internet and contain large variations in
further improves the accuracy. pose, illumination, expression and occlusion. It is intended
to test the face alignment methods in unconstraint condi-
3.4 Running Time Performance tions. This dataset shares only web image URLs, but some
URLs are no longer valid. We only downloaded 812 of the
Table 1 summarizes the computational time of training (with 1,100 training images and 249 of the 300 test images. To
2,000 training images) and testing for different number of acquire enough training data, we augment the training images
landmarks. Our training is very efficient due to the fast fea- to 2,000 in the same way as Belhumeur et al. (2011) did and
ture selection method. It takes minutes with 40,000 training use the available testing images.
samples (20 initial shapes per image), The shape regression LFW87 The dataset was created by Liang et al. (2008). The
in the test is extremely efficient because most computation is images mainly come from the Labeled Face in the Wild
pixel comparison, table look up and vector addition. It takes (LFW) dataset (Huang et al. 2008), which is acquired from
only 15 ms for predicting a shape with 87 landmarks (3 ms uncontrolled conditions and is widely used in face recog-
× 5 initializations). nition. In addition, it has 87 annotated landmarks, much
more than that in BioID and LFPW, therefore, the perfor-
3.5 Parameter Settings mance of an algorithm relies more on its shape constraint.
We use the same setting in Liang et al. (2008)’s work: the
The number of features in a fern F and the shrinkage para- training set contains 4,002 images mainly from LFW, and
meter β adjust the trade off between fitting power in training the testing set contains 1,716 images which are all from
and generalization ability in testing. They are set as F = 5, LFW.
β =1,000 by cross validation. Helen The dataset was proposed by Le et al. (2012). It
Algorithm accuracy consistently increases as the number consists of 2,330 high resolution web images with 194
of stages in the two-level boosted regression (T, K ) and num- annotated landmarks. The average size of face is 550 pix-
ber of candidate features P 2 increases. Such parameters are els. Even the smallest face in the dataset is larger than
empirically chosen as T = 10, K = 500, P = 400 for a 150 pixels. It serves as a new benchmark which provides
good tradeoff between computational cost and accuracy. richer and more detailed information for accurate face
The parameter κ is used for generating the local coordi- alignment.
nates relative to landmarks. We set κ equal to 0.3 times of
the distance between two pupils on the mean shape.
4.1 Comparison with previous works
4 Experiments For comparisons, we use the alignment error in Eq. (1) as the
evaluation metric. To make it invariant to face size, the error
The experiments are performed in two parts. The first part is not in pixels but normalized by the distance between the
compares our approach with previous works. The second part two pupils, similar to most previous works.
The following comparison shows that our approach out-
6 The median operation is performed on x and y coordinates of all performs the state of the art methods in both accuracy and
landmarks individually. Although this may violate the shape constraint efficiency, especially on the challenging LFPW and LFW87
mentioned before, the resulting median shape is mostly correct as in
most cases the multiple results are tightly clustered. We found such a
datasets. Figures 4, 5, 6 and 7 show our results on challenging
simple median based fusion is comparable to more sophisticated strate- examples with large variations in pose, expression, illumina-
gies such as weighted combination of input shapes. tion and occlusion from the four datasets.
123
Int J Comput Vis
Fig. 4 Selected results from LFPW
Fig. 5 Selected results from LFW87
4.1.1 Comparison on LFPW Comparison in Fig. 8 shows that most landmarks esti-
mated by our approach are more than 10 % accurate7 than
The consensus exemplar approach proposed by Belhumeur
et al. (2011) is one of the state of the art methods. It was the
7
best on BioID when published, and obtained good results on The relative improvement is the ratio between the error reduction and
LFPW. the original error.
123
Int J Comput Vis
Fig. 6 Selected results from BioID
Fig. 7 Selected results from Helen dataset
takes more than 10 s8 to run 29 detectors over the entire

image.
4.1.2 Comparison on LFW87
Liang et al. (2008) proposed a component-based discrimina-

tive search (CDS) method which trains a set of direction clas-
sifiers for pre-defined facial components to guide the ASM
search direction. Their algorithm outperform previous ASM
Fig. 8 Results on the LFPW dataset. Left 29 facial landmarks. The
and AAM based works by a large margin.
circle radius is the average error of our approach. Point color repre- We use the same root mean square error (RMSE) used in
sents relative accuracy improvement over the results of the consensus CDS (Liang et al. 2008) as the evaluation metric. Table 2
exemplars(CE) method proposed by Belhumeur et al. (2011). Green shows our method is significantly better. For the strict error
more than 10 % more accurate. Cyan 0 to 10 % more accurate. Red less
accurate. Right top relative accuracy improvement of all landmarks over
threshold (5 pixels), the error rate is reduced nearly by half,
the results of CE method. Right bottom average error of all landmarks from 25.3 to 13.9 %. The superior performance on a large
(Color figure online) number of landmarks verifies the effectiveness of proposed
holistic shape regression and the encoded adaptive shape con-
straint.
the method proposed by Belhumeur et al. (2011) and our
overall error is smaller.
In addition, our method is thousands of times faster. It
takes around 5ms per image (0.91 × 5 initializations for 29 8 Belhumeur et al. (2011) discussed in their work: “The localizer
landmarks). The method proposed by Belhumeur et al. (2011) requires less than 1 s per fiducial on an Intel Core i7 3.06GHz machine”.
uses expensive local landmark detectors (SIFT+SVM) and it We conjecture that it takes more than 10 s to locate 29 landmarks.
123
Int J Comput Vis
Table 2 Percentages of test images with root mean square error 1

(RMSE) less than given thresholds on the LFW87 dataset 0.9
Fraction of Landmarks
0.8
RMSE <5 Pixels <7.5 Pixels <10 Pixels 0.7
0.6
CDS (%) 74.7 93.5 97.8
0.5 Our Method
Our method (%) 86.1 95.2 98.2 Vukadinovic and Pantic et al.
0.4
0.3 Cristinacce and Cootes et al
Bold values represent the best results under certain settings
Milborrow and Nicolls et al.
0.2 Valstar et al.
0.1 Belhumeur et al.
Table 3 Comparison on Helen dataset
0
0 0.05 0.1 0.15
Method Mean Median Min Max
Landmark Error
STASM 0.111 0.094 0.037 0.411
Fig. 9 Cumulative error curves on the BioID dataset. For compari-
CompASM 0.091 0.073 0.035 0.402 son with previous results, only 17 landmarks are used (Cristinacce and
Our method 0.057 0.048 0.024 0.16 Cootes 2006). As our model is trained on LFPW images, for those land-
marks with different definitions between the two datasets, a fixed offset
The error of each sample is first individually computed by averaging
is applied in the same way in Belhumeur et al. (2011)
the errors of 194 landmarks, and then the mean error across all testing
samples is computed
Bold values represent the best results under certain settings Table 4 Tradeoffs between two levels boosted regression
Stage regressors (T ) 1 5 10 100 5000
Primitive regressors (K) 5000 1000 500 50 1
4.1.3 Comparison on Helen Mean error (×10−2 ) 15 6.2 3.3 4.5 5.2
Bold value represents the best results under certain settings
We adopt the same training and testing protocol as well as
the same error metric used by Le et al. (2012). Specifically,
we divide the Helen dataset into training set of 2,000 images
and testing set of 330 images. As the pupils are not labeled in parts for training and testing. The training set contains 1,500
the Helen dataset, the distance between the centroids of two images and the testing set contains 500 images. Parameters
eyes are used to normalize the deviations from groundtruth. are fixed as in Sect. 3, unless otherwise noted.
We compare our method with STASM (Milborrow and
Nicolls 2008) and recently proposed CompASM (Le et al.
2012). As shown in Table 3, our method outperforms them by 4.2.1 Two-Level Boosted Regression
a large margin. Comparing with STASM and CompASM, our
method reduces the mean error by 50 and 40 % respectively, As discussed in Sect. 2, the stage regressor exploits shape
meanwhile, the testing speed is even faster. indexed features to obtain geometric invariance and decom-
pose the original difficult problem into easier sub-tasks. The
shape indexed features are fixed within the internal-level
4.1.4 Comparison to Previous Methods on BioID boosted regression to avoid instability.
Different tradeoffs between two-level boosted regression
Our model is trained on augmented LFPW training set and are presented in Table 4, using the same number of ferns.
tested on the entire BioID dataset. On one extreme, regressing the whole shape in a single
Figure 9 compares our method with previous methods stage (T = 1, K = 5000) is clearly the worst. On the
(Vukadinovic and Pantic 2005; Cristinacce and Cootes 2006; other extreme, using a single fern as the stage regressor
Milborrow and Nicolls 2008; Valstar et al. 2010; Belhumeur (T = 5000, K = 1) also has poor generalization ability in
et al. 2011). Our result is the best but the improvement is mar- the test. The optimal tradeoff (T = 10, K = 500) is found
ginal. We believe this is because the performance on BioID is in between via cross validation.
nearly maximized due to its simplicity. Note that our method
is thousands of times faster than the second best method (Bel-
humeur et al. 2011). 4.2.2 Shape Indexed Feature
4.2 Algorithm Validation and Discussions We compare the global and local methods of shape indexed
features. The mean error of local index method is 0.033,
We verify the effectiveness of different components of the which is much smaller than the mean error of global index
proposed approach. Such experiments are performed on the method 0.059. The superior accuracy supports the proposed
our augmented LFPW dataset. The dataset is split into two local index method.
123
Int J Comput Vis
Table 5 Comparison between correlation based feature selection Table 6 Model compression experiment
(CBFS) method and best-of-n feature selection methods
Dataset Raw PCA SC
Best-of-n n=1 n = 32 n = 1024 CBFS
Mean error (×10−2 ) LFW87 4.23 4.35 4.34
Error (×10−2 ) 5.01 4.92 4.83 3.32 Model size (mb) LFW87 118 30 8
Time (s) 0.1 3.0 100.3 0.12 Comp. ratio LFW87 – 4 15
The training time is for one primitive regressor Mean error (×10−2 ) Helen194 5.70 5.83 5.79
Bold value represents the best results under certain settings Model size (mb) Helen194 240 42 12
Comp. ratio Helen194 – 6 20
The suffix of the name of the dataset means the number of annotated
landmarks
As shown in Table 6, the sparse coding based method

outperforms the PCA based method both in the sense of
compression ratio and accruacy. For example, on Helen
dataset, the sparse coding based method archives 20 times
compression. In contrast, the PCA based method achieves
only 6 times compression at the cost of even lager mean
error.
Fig. 10 Average ranges of selected features in different stages. In stage
1, 5 and 10, an exemplar feature (a pixel pair) is displayed on an image
5 Discussion and Conclusion
4.2.3 Feature Selection We have presented the explicit shape regression method for
face alignment. By jointly regressing the entire shape and
The proposed correlation based feature selection method minimizing the alignment error, the shape constraint is auto-
(CBFS) is compared with the commonly used best-of- matically encoded. The resulting method is highly accurate,
n method (Ozuysal et al. 2010; Dollar et al. 2010) in efficient, and can be used in real time applications such as
Table 5. CBFS can select good features rapidly and this is face tracking. The explicit shape regression framework can
crucial to learn good models from large training data. also be applied to other problems like articulated object pose
estimation and anatomic structure segmentation in medical
4.2.4 Feature Range images.
The range of a feature is the distance between the pair of

pixels normalized by the distance between the two pupils. References
Figure 10 shows the average ranges of selected features in
the 10 stages. As observed, the selected features are adaptive Belhumeur, P., Jacobs, D., Kriegman, D., & Kumar, N. (2011). Local-
to the different regression tasks. At first, long range features izing parts of faces using a concensus of exemplars. In IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR).
(e.g., one pixel on the mouth and the other on the nose) are Bingham, E., & Mannila, H. (2001). Random projection in dimensional-
often selected for rough shape adjustment. Later, short range ity reduction: Applications to image and text data. In ACM SIGKDD
features (e.g., pixels around the eye center) are often selected Conference on Knowledge Discovery and Data Mining (KDD) .
for fine tuning. Cootes, T., Edwards, G., & Taylor, C. (2001). Active appearance models.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
23(6), 681–685.
4.2.5 Model Compression Cootes, T., Taylor, C., Cooper, D., Graham, J., et al. (1995). Active
shape models-their training and application. Computer Vision and
Image Understanding, 61(1), 38–59.
We conduct experiments on both LFW87 and Helen datasets Cristinacce, D., & Cootes, T. (2006). Feature detection and tracking
to compare the sparse coding(SC) based compression method with constrained local models. In British Machine Vision Conference
with PCA based compression method. (BMVC).
For sparse coding based method, the number of non-zero Cristinacce, D., & Cootes, T. (2007). Boosted regression active shape
models. In British Machine Vision Conference (BMVC).
codes is 5 and the number of basis is 512. For PCA based Dollar, P., Welinder, P., & Perona, P. (2010). Cascaded pose regression.
method, The principle components are computed by preserv- In IEEE Conference on Computer Vision and Pattern Recognition
ing 95 % energy. (CVPR).
123
Int J Comput Vis
Duffy, N., & Helmbold, D. P. (2002). Boosting methods for regression. Saragih, J., & Goecke, R. (2007). A nonlinear discriminative approach
Machine Learning, 47(2–3), 153–200. to aam fitting. In International Conference on Computer Vision
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redun- (ICCV) .
dant representations over learned dictionaries. IEEE Transactions on Sauer, P., & Cootes, C. T. T. (2011). Accurate regression procedures
Image Processing, 15(12), 3736–3745. for active appearance models. In British Machine Vision Conference
Friedman, J. H. (2001). Greedy function approximation: A gradient (BMVC).
boosting machine. The Annals of Statistics, 29(5), 1189–1232. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore,
Huang, G., Mattar, M., Berg, T., Learned-Miller, E. et al. (2008) Labeled R., et al. (2011). Real-time human pose recognition in parts from
faces in the wild: A database forstudying face recognition in uncon- single depth images. In IEEE Conference on Computer Vision and
strained environments. In Workshop on Faces in’Real-Life’Images: Pattern Recognition (CVPR).
Detection, Alignment, and Recognition. Sun, Y., Wang, X., & Tang, X. (2013). Deep convolutional network
Jesorsky, O., Kirchberg, K. J., & Frischholz, R. W. (2001). Robust cascade for facial point detection. In: IEEE Conference on Computer
face detection using the hausdorff distance (pp. 90–95). New York: Vision and Pattern Recognition (CVPR).
Springer. Tropp, J., & Gilbert, A. (2007). Signal recovery from random mea-
Jolliffe, I. (2005). Principal component analysis. Wiley Online Library. surements via orthogonal matching pursuit. IEEE Transactions on
Le, V., Brandt, J., Lin, Z., Bourdev, L., & Huang, T. (2012). Interactive Information Theory, 53(12), 4655–4666.
facial feature localization. In European Conference on Computer Valstar, M., Martinez, B., Binefa, X., & Pantic, M. (2010). Facial point
Vision. detection using boosted regression and graph models. In IEEE Con-
Liang, L., Xiao, R., Wen, F., & Sun, J. (2008). Face alignment via ference on Computeer Vision and Pattern Recognition (CVPR).
component-based discriminative search. In European Conference on Vukadinovic, D., & Pantic, M. (2005). Fully automatic facial feature
Computer Vision (ECCV). point detection using gabor feature based boosted classifiers. Interna-
Matthews, I., & Baker, S. (2004). Active appearance models revisited. tional Conference on Systems, Man and Cybernetics, 2, 1692–1698.
International Journal of Computer Vision, 60(2), 135–164. Xiong, X., De la Torre, F. (2013) Supervised descent method and its
Milborrow, S., & Nicolls, F. (2008). Locating facial features with an applications to face alignment. In IEEE Conference on Computer
extended active shape model. In European Conference on Computer Vision and Pattern Recognition (CVPR).
Vision (ECCV). Zhou, S. K., & Comaniciu, D. (2007). Shape regression machine. In
Ozuysal, M., Calonder, M., Lepetit, V., & Fua, P. (2010). Fast key- Information Processing in Medical Imaging, (pp. 13–25). Heidel-
point recognition using random ferns. IEEE Transactions on Pattern berg: Springer.
Analysis and Machine Intelligence, 32(3), 448–461.
123

Face Alignment by Explicit Shape Regression 1

Uploaded by

Copyright:

Available Formats

Face Alignment by Explicit Shape Regression 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Face Alignment by Explicit Shape Regression 1

Uploaded by

Copyright:

Available Formats

Int J Comput Vis

Face Alignment by Explicit Shape Regression

Received: 28 December 2012 / Accepted: 7 October 2013

Abstract We present a very efficient, highly accurate, 1 Introduction

well known that AAM is sensitive to the initialization due to

2.1 The Shape Regression Framework 2.1.1 Discussion

Therefore, as long as the initial shape S 0 satisfies the (a) (b)

In practical implementation, unlike previous works Algorithm 2 Shape indexed features

shape indexed pixel features ρ ∈  N ×P ;

global coordinates on the original image can formally repre-

TrainParams{Nfp , P, κ, F, K }; the stage regressor R; testing image

if corr(Yprob , ρm − ρn ) > corr(Yprob , ρm f − ρn f ) Y k ← Y k−1 − r k ({ρm f − ρn f }f=1

ApplyStageRegressor(I, S, R) // i.e. R(I, S)

stage regressor to predict the normalized shape increments return δS

2.8 Model Compression compressed version is more time-consuming, the testing is

Fig. 4 Selected results from LFPW

Fig. 5 Selected results from LFW87

Fig. 6 Selected results from BioID

Fig. 7 Selected results from Helen dataset

takes more than 10 s8 to run 29 detectors over the entire

4.1.2 Comparison on LFW87

Liang et al. (2008) proposed a component-based discrimina-

Table 2 Percentages of test images with root mean square error 1

As shown in Table 6, the sparse coding based method

The range of a feature is the distance between the pair of

You might also like

shape indexed pixel features ρ ∈ N ×P ;