Face Alignment by Explicit Shape Regression 1
Face Alignment by Explicit Shape Regression 1
Face Alignment by Explicit Shape Regression 1
DOI 10.1007/s11263-013-0667-3
123
Int J Comput Vis
123
Int J Comput Vis
performance. In their experiment, this method has also been reduce the alignment errors on training set. In stage each t,
used to estimate face shape which is modeled by a simple the stage regressor R t is formally learnt as follows,
parametric ellipse (Dollar et al. 2010).
N
Our method improves the cascaded pose regression frame- R t = argmin ||yi − R(Ii , Sit−1 )||2 (3)
work in several important aspects and works better for face R i=1
alignment problem. We adopt a non-parametric representa-
yi = M S t−1 ◦ ( Ŝi − Sit−1 ),
tion, directly estimate the facial landmarks by minimizing i
the alignment error instead of parameter error. Consequently, where Sit−1 is the estimated shape in previous stage t − 1,
the underlying shape constraint is preserved automatically.
M S t−1 ◦( Ŝi − Sit−1 ) is the normalized regression target which
To address the very challenging high-dimensional regres- i
will be discussed later. Please note herein we only apply the
sion problem, we further propose several improvements: a
scale and rotation transformation to normalize the regression
two-level boosted regression, effective shape indexed fea-
target.
tures, a fast correlation-based feature selection method and
In testing, given a facial image I and an initial shape S 0 ,
sparse coding based model compression so that: (1) we can
the stage regressor computes a normalized shape increment
quickly learn accurate models from large training data (20
from image features and then updates the face shape, in a
min on 2,000 training samples); (2) the resulting regressor is
cascaded manner:
extremely efficient in the test (15 ms for 87 facial landmarks);
(3) the model size is reasonably small (a few megabytes) and Sit = Sit−1 + M −1 t t−1
t−1 ◦ R (Ii , Si ), (4)
Si
applicable in many scenarios. We show superior results on
several challenging datasets. where the stage regressor R t updates the previous shape
S t−1 to the new shape S t in stage t. Note we only scale and
rotate (without translating) the output of the stage regression
2 Face Alignment by Shape Regression according to the angle and the scale of the previous shape.
In this section, we describe our shape regression framework Although our basic face alignment framework follows gradi-
and discuss how it fit to the face alignment task. ent boosting framework, there are three components specifi-
We cast our solution to shape regression task into the gra- cally designed for the shape regression tasks. As these com-
dient boosting regression (Friedman 2001; Duffy and Helm- ponents are generic, regardless of the specific forms of the
bold 2002) framework, which is a representative approach of weak learner and the feature used for learning, it is worth
ensemble learning. In training, it sequentially learns a series clarifying and discussing them here.
of weak learners to greedily minimize the regression loss Shape Indexed Feature The feature for learning the weak
function. In testing, it simply combine the pre-learnt weak learner R t depends on both image I and previous estimated
learners in an additive manner to give the final prediction. shape S t−1 . It improves the performance by achieving geo-
Before specifying how to resolve face alignment task in metric invariance. In other words, the feature is extracted
gradient boosting framework, we first clarify a simple and relative to S t−1 to eliminate two kinds of variations: the vari-
basic term, i.e. the normalized shape. Provided the predefined ations due to scale, rotation and translation and the variations
mean shape S̄, the normalized shape of an input shape S is due to identity, pose and expression (i.e. the similarity trans-
obtained by a similarity transform which aligns the input form M S−1t−1 and the normalized shape M S t−1 ◦ S
t−1 ). It is
shape to the mean shape1 to minimizes their L2 distance, worth mentioning that although shape indexed feature sounds
more complex, our designed feature is extremely cheap to be
M S = argmin|| S̄ − M ◦ S||2 , (2) computed and does not require image warping. The details
M
of our feature will be described in Sect. 2.4.
where M S ◦ S is the normalized shape. Now we are ready to Regressing the Normalized Target Instead of learning the
describe our shape regression framework. mapping from shape indexed feature to Ŝi − Sit−1 , we argue
In training, given N training samples {Ii , Ŝi , Si0 }i=1
N ,
the regression task will be simplified if we regress the normal-
the stage regressors (R , ..., R ) are sequentially learnt to
1 T
ized target M S t−1 ◦ ( Ŝi − Sit−1 ). Because the normalized tar-
i
get is invariant to similarity transform. To better understand
1 It is also interesting to know that the mean shape is defined as the its advantage, imagine two identical facial images with the
average of the normalized training shapes. Although it sounds like a
estimated shape. One of them is changed by similarity trans-
circular definition, we still can compute the mean shape in an iterative
way. Readers are recommended to Active Shape Model (Cootes et al. form while the other is kept unchanged. Due to the transform,
1995) method for details. the regression target (shape increments) of the transformed
123
Int J Comput Vis
image are different from that of the unchanged one. Hence As the stage regressor plays a vital role in our shape regres-
the regression task is complicated. In contrast, the regression sion framework, we will focus on it next. We will discuss
task will keep simple if we regress the normalized targets, what kinds of regressors are suitable for our framework and
which are still the same for both samples. present a series of methods for effectively learning the stage
Data Augmentation and Multiple Initialization Unlike regressor.
typical regression task, the sample of our shape regression
task is a triple which is defined by the facial image, the 2.2 Two-Level Boosted Regression
groundtruth shape and the initial shape i.e. {Ii , Ŝi , Si0 }. So
we can augment the samples by generating multiple initial Conventionally, the stage regressor R t , is a quite weak regres-
shapes for one image. In fact, it turns out such simple oper- sor such as a decision stump (Cristinacce and Cootes
ation not only effectively improves generalization of train- 2007) or a fern (Dollar et al. 2010). However, in our
ing, but also reduces the variation of the final prediction by early experiments, we found that such regressors result in
bagging the results obtained by multiple initializations. See slow convergence in training and poor performance in the
Sect. 3 for experimental validations and further discussions. testing.
To make the training and testing our framework clear, we We conjecture this is due to two reasons: first, regressing
format them in the pseudo codes style. the entire shape (as large as dozens of landmarks) is too diffi-
cult to be handled by a weak regressor in each stage; second,
the shape indexed feature will be unreliable if the previous
Algorithm 1 Explicit Shape Regression (ESR) regressors are too weak to provide fairly reliable shape esti-
Variables: Training images and labeled shapes {Il , Ŝl }l=1
L ; ESR model
mation, so will the regressor based on the shape indexed
{R }t=1 ; Testing image I ; Predicted shape S; TrainParams{times of
t T
feature. Therefore it is crucial to learn a strong regressor that
data augment Naug , number of stages T };
can rapidly reduce the alignment error of all training samples
TestParams{number of multiple initializations Nint };
InitSet which contains exemplar shapes for initialization in each stage.
Generally speaking, any kinds of regressors with strong
ESRTraining({Il , Ŝl }l=1
L , TrainParams, InitSet)
fitting capacity will be desirable. In our case, we again inves-
// augment training data
tigate boosted regression as the stage regressor R t . Therefore
{Ii , Ŝi , Si0 }i=1
N ← Initialization ({I , Ŝ } L , N
l l l=1 aug , InitSet)
for t from 1 to T threre are two-levels boosted regression in our method, i.e.
Y ← {M S t−1 ◦ ( Ŝi − Sit−1 )}i=1
N // compute normalized targets the basic face alignment framework (external-level) and the
i
R t ← LearnStageRegressor(Y, {Ii , Sit−1 }i=1
N ) // using Eq. (3) stage regressor R t (internal-level). To avoid terminology con-
for i from 1 to N fusion, we term the weak learner of the internal-level boosted
Sit ← Sit−1 + M −1 t t−1
t−1 ◦ R (Ii , Si ) regression, as primitive regressor.
Si
return {R t }t=1
T It is worth mentioning that other strong regressors have
also been investigated by two notable works. Sun et al.
ESRTesting(I, {R t }t=1 T , TestParams, InitSet)
(2013) investigated the regressor based on convolutional
// multiple initializations
Nint
{Ii , ∗, Si0 }i=1 ← Initialization ({I, ∗}, Nint , InitSet) neural network. Xiong and De la Torre (2013) investi-
for t from 1 to T gated the linear regression with strong hand-craft feature i.e.
for i from 1 to Nint SIFT.
Sit ← Sit−1 + M −1 t t−1
t−1 ◦ R (Ii , Si ) Although the external- and internal- level boosted regres-
Si
Nint
S ← CombineMultipleResutls({SiT }i=1 ) sion bear some similarities, there is a key differences. The
return S difference is that the shape indexed image features are fixed
in the internal-level, i.e., they are indexed only relative to the
Initialization({Ic , Ŝc }C c=1 , D, InitSet)
i ←1 previous estimated shape S t−1 and no longer change when
for c from 1 to C those primitive regressor are being learnt.2 This is impor-
for d from 1 to D tant, as each primitive regressor is rather weak, allowing fea-
Sio ← sampling an exemplar shape from InitSet
ture indexing will frequently change the features, leading to
{Iio , Ŝio } ← {Ic , Ŝc }
unstable consequences. Also the fixed features can lead to
i ←i +1
return {Iio , Ŝio , Sio }i=1
CD much faster training, as will be described later. In our exper-
iments, we found using two-level boosted regression is more
accurate than one level under the same training effort, e.g.,
T = 10, K = 500 is better than one level of T =5,000, as
The details about combining multiple results and initial- shown in Table 4.
ization (including constructing InitSet and sampling from
InitSet) will be discussed later in Sect. 3. 2 Otherwise this degenerates to a one level boosted regression.
123
Int J Comput Vis
2.3 Primitive Regressor also satisfies the constraint. Compare to the pre-fixed PCA
shape model, the non-parametric shape constraint is adap-
We use a fern as our primitive regressor. The fern was firstly tively determined during the learning.
introduced for classification by Ozuysal et al. (2010) and later To illustrate the adaptive shape constraint, we perform
used for regression by Dollar et al. (2010). PCA on all the shape increments stored in K ferns (2 F × K
A fern is a composition of F (5 in our implementation) in total) in each stage t. As shown in Fig. 1, the intrinsic
features and thresholds that divide the feature space and all dimension (by retaining 95 % energy) of such shape spaces
training samples into 2 F bins. Each bin b is associated with increases during the learning. Therefore, the shape constraint
a regression output yb . Learning a fern involves a simple is automatically encoded in the regressors in a coarse to fine
task and a hard task. The simple task refers to learning the manner. Figure 1 also shows the first three principal com-
outputs in bins. The hard task refers to learning the structure ponents of the learnt shape increments (plus a mean shape)
(the features and the splitting thresholds). We will handle the in first and final stage. As shown in Fig. 1c, d, the shape
simple task first, and resolve the hard task in the Sect. 2.5 updates learned by the first stage regressor are dominated by
later. global rough shape changes such as yaw, roll and scaling.
Let the targets of all training samples be { ŷi }i=1
N . The pre- In contrast, the shape updates of the final stage regressor are
diction/output of a bin should minimize its mean square dis- dominated by the subtle variations such as face contour, and
tance from the targets for all training samples falling into this motions in the mouth, nose and eyes.
bin.
yb = argmin || ŷi − y||2 , (5) 2.4 Shape Indexed (Image) Features
y
i∈Ωb
For efficient regression, we use simple pixel-difference
where the set Ωb indicates the samples in the bth bin. It is
features, i.e., the intensity difference of two pixels in the
easy to see that the optimal solution is the average over all
image. Such features are extremely cheap to compute and
targets in this bin.
powerful enough given sufficient training data (Ozuysal et
i∈Ωb ŷi al. 2010; Shotton et al. 2011; Dollar et al. 2010).
yb = . (6) To let the pixel-difference achieve geometric invariance,
|Ωb |
we need to make the extracted raw pixel invariant to two
To overcome over-fitting in the case of insufficient training kinds of variations: the variations due to similarity transform
data in the bin, a shrinkage is performed (Friedman 2001; (scale, rotation and translation) and the variations due to the
Ozuysal et al. 2010) as normalized shape (3D-poses, expressions and identities).
1 i∈Ωb ŷi
In this work, we propose to index a pixel by the esti-
yb = , (7) mated shape. Specifically, we index a pixel by the local
1 + β/|Ωb | |Ωb |
coordinate l = (x l , y l ) with respect to a landmark in
where β is a free shrinkage parameter. When the bin has the normalized face. The superscript l indicates which land-
sufficient training samples, β makes little effect; otherwise, mark this pixel is relative to. As Fig. 2 shows, such index-
it adaptively reduces the magnitude of the estimation. ing holds invariance against the variations mentioned above
and make the algorithm robust. In addition, it also enable us
2.3.1 Non-parametric Shape Constraint sampling more useful candidate features distributed around
salient landmarks (e.g., a good pixel difference feature could
By directly regressing the entire shape and explicitly mini- be “eye center is darker than nose tip” or “two eye centers
mizing the shape alignment error in Eq. (1), the correlation are similar”).
between the shape coordinates is preserved. Because each
shape update is additive as in Eqs. (4), (6) and (7), it can be
shown that the final regressed shape S is the sum of initial
shape S 0 and the linear combination of all training shapes:
N
S = S0 + wi Ŝi . (8)
i=1
123
Int J Comput Vis
and Cootes 2011) which requires image warping, we instead number of shape indexed pixel features P; number of facial points Nfp ;
transform the local coordinate back to the global coordinates the range of local coordinate κ; local coordinates {lαα }α=1
P ;
the global coordinates for extracting raw pixels are adaptively N , {lα } P )
ExtractShapeIndexedPixels({Ii , Si }i=1 α α=1
adjusted for different samples to ensure geometric invariance. for i from 1 to N
Herein we only scale and rotate (without translating) the local for α from 1 to P
coordinates according to the angle and the scale of the shape. μα ← πlα ◦ Si + M S−1
i
◦ lα
ρiα ← Ii (μα )
For each stage regressor R t in the external-level, we ran- return ρ
domly generate P local coordinates {lαα }α=1P which define
P shape indexed pixels. Each local coordinate is generated
by first randomly selecting a landmark (e.g. lα th landmark) regression target with N (the number of samples) rows and
and then draw random x- and y-offset from uniform distri- 2Nfp columns. Let X be pixel-difference features matrix with
bution. The P pixels result in P 2 pixel-difference features. N rows and P 2 columns. Each column X j of the feature
Now, the new challenge is how to quickly select effective matrix represents a pixel-difference feature. We want to select
features from such a large pool. F columns of X which are highly correlated with Y . Since Y
is a matrix, we use a projection v, which is a column vector
2.5 Correlation-Based Feature Selection drawn from unit Gaussian, to project Y into a column vector
Yprob = Y v. The feature which maximizes its correlation
To form a good fern regressor, F out of P 2 features are (Pearson Correlation) with the projected target is selected.
selected. Usually, this is done by randomly generating a pool
jopt = argmin corr(Yprob , X j ) (10)
of ferns and selecting the one with minimum regression error j
as in (5) (Ozuysal et al. 2010; Dollar et al. 2010). We denote
By repeating this procedure F times, with different random
this method as best-of-n, where n is the size of the pool. Due
projections, we obtain F desirable features.
to the combinatorial explosion, it is unfeasible to evaluate (5)
The random projection serves two purposes: it can pre-
for all of the compositional features. As illustrated in Table 5,
serve proximity (Bingham and Mannila 2001) such that the
the error is only slightly reduced by increasing n from 1 to
features correlated to the projection are also discriminative
1024, but the training time is significantly longer.
to delta shape; the multiple projections have low correlations
To better explore the huge feature space in a short time
with a high probability and the selected features are likely to
and generate good candidate ferns, we exploit the correlation
be complementary. As shown in Table 5, the proposed corre-
between features and the regression target. We expect that a
lation based method can select good features in a short time
good fern should satisfy two properties: (1) each feature in
and is much better than the best-of-n method.
the fern should be highly correlated to the regression target;
(2) correlation between features should be low so they are
complementary when composed. 2.6 Fast Correlation Computation
To find features satisfying such properties, we propose
a correlation-based feature selection method. Let Y be the At first glance, we need to compute the correlation for all can-
didate features to select a feature. The complexity is linear
3 to the number of training samples and the number of pixel-
According to aforementioned definition, the global coordinates are
computed via M S−1 ◦ (πl ◦ M S−1 ◦ S + l ). By simplifying this formula, difference features, i.e. O(N P 2 ). As the size of feature pool
we get Eq. (9) scales square to the number of sampled pixels, the computa-
123
Int J Comput Vis
tion will be very expensive, even with moderate number of the external-level shape regression framework, we term it as
pixels (e.g. 400 pixels leads to 160,000 candidate features!). internal-level boosted regression. With the prepared ingredi-
Fortunately the computational complexity can be reduced ents in previous three sections, we are ready to describe how
from O(N P 2 ) to O(N P) by the following facts: The cor- to learn the internal-level boosted regression.
relation between the regression target and a pixel-difference The internal-boosted regression consists of K primitive
feature (ρm − ρn ) can be represented as follows. regressors {r1 , ..., r K }, which are in fact ferns. In testing, we
combine them in an additive manner to predict the output.
cov(Yproj , ρm ) − cov(Yproj , ρn )
corr(Yproj , ρm − ρn ) = In training, the primitive regressors are sequentially learnt to
σ (Yproj )σ (ρm − ρn ) greedily fit the regression targets, in other words, each primi-
σ (ρm − ρn ) = cov(ρm , ρm ) tive regressor handles the residues left by previous regressors.
+cov(ρn , ρn )−2cov(ρm , ρn ) (11) In each iteration, the residues are used as the new targets for
learning a new primitive regressor. The learning procedure
We can see that the correlation is composed by two cat- are essentially identical for all primitive regressors, which
egories of covariances: the target-pixel covariance and the can be describe as follows.
pixel–pixel covariance. The target-pixel covariance refers to
the covariance between the projected target and pixel fea-
– Features Selecting F pixel-difference features using cor-
ture, e.g., cov(Yproj , ρm ) and cov(Yproj , ρn ). The pixel–pixel
relation based feature selection method.
covariance refers to the covariance among different pixel fea-
– Thresholds Randomly sampling F i.i.d. thresholds from
tures, e.g., cov(ρm , ρm ), cov(ρn , ρn ) and cov(ρm , ρn ). As the
an uniform distribution.4
shape indexed pixels are fixed in the internal-level boosted
– Outputs Partitioning all training samples into different
regression, the pixel–pixel covariances can be pre-computed
bins using the learnt features and thresholds. Then, learn-
and reused within each internal-level boosted regression. For
ing the outputs of the bins using Eq. (7).
each primitive regressor, we only need to compute all target-
pixel covariances to compose the correlations, which scales
linear to the number of pixel features. Therefore the com-
plexity is reduced from O(N P 2 ) to O(N P).
Algorithm 4 Internal-level boosted regression
Variables: regression targets Y ∈ N ×2Nfp ; training images and
Algorithm 3 Correlation-based feature selection corresponding estimated shapes {Ii , Si }i=1
N ; training parameters
123
Int J Comput Vis
123
Int J Comput Vis
Table 1 Training and testing times of our approach, measured on an validates the proposed approach and presents some interest-
Intel Core i7 2.93GHz CPU with C++ implementation ing discussions.
Landmarks 5 29 87 We briefly introduce the datasets used in the experiments.
They present different challenges, due to different numbers
Training (min) 5 10 21
of annotated landmarks and image variations.
Testing (ms) 0.32 0.91 2.9
BioID The dataset was proposed by Jesorsky et al. (2001)
and widely used by previous methods. It consists of 1,521
near frontal face images captured in a lab environment, and
when multiple landmark estimations are tightly clustered, is therefore less challenging. We report our result on it for
the result is accurate, and vice versa. In the test, we run the completeness.
regressor several times (5 in our implementation) and take the LFPW The dataset was created by Belhumeur et al. (2011).
median result6 as the final estimation. Each time the initial Its full name is Labeled Face Parts in the Wild. The images
shape is randomly sampled from the training shapes. This are downloaded from internet and contain large variations in
further improves the accuracy. pose, illumination, expression and occlusion. It is intended
to test the face alignment methods in unconstraint condi-
3.4 Running Time Performance tions. This dataset shares only web image URLs, but some
URLs are no longer valid. We only downloaded 812 of the
Table 1 summarizes the computational time of training (with 1,100 training images and 249 of the 300 test images. To
2,000 training images) and testing for different number of acquire enough training data, we augment the training images
landmarks. Our training is very efficient due to the fast fea- to 2,000 in the same way as Belhumeur et al. (2011) did and
ture selection method. It takes minutes with 40,000 training use the available testing images.
samples (20 initial shapes per image), The shape regression LFW87 The dataset was created by Liang et al. (2008). The
in the test is extremely efficient because most computation is images mainly come from the Labeled Face in the Wild
pixel comparison, table look up and vector addition. It takes (LFW) dataset (Huang et al. 2008), which is acquired from
only 15 ms for predicting a shape with 87 landmarks (3 ms uncontrolled conditions and is widely used in face recog-
× 5 initializations). nition. In addition, it has 87 annotated landmarks, much
more than that in BioID and LFPW, therefore, the perfor-
3.5 Parameter Settings mance of an algorithm relies more on its shape constraint.
We use the same setting in Liang et al. (2008)’s work: the
The number of features in a fern F and the shrinkage para- training set contains 4,002 images mainly from LFW, and
meter β adjust the trade off between fitting power in training the testing set contains 1,716 images which are all from
and generalization ability in testing. They are set as F = 5, LFW.
β =1,000 by cross validation. Helen The dataset was proposed by Le et al. (2012). It
Algorithm accuracy consistently increases as the number consists of 2,330 high resolution web images with 194
of stages in the two-level boosted regression (T, K ) and num- annotated landmarks. The average size of face is 550 pix-
ber of candidate features P 2 increases. Such parameters are els. Even the smallest face in the dataset is larger than
empirically chosen as T = 10, K = 500, P = 400 for a 150 pixels. It serves as a new benchmark which provides
good tradeoff between computational cost and accuracy. richer and more detailed information for accurate face
The parameter κ is used for generating the local coordi- alignment.
nates relative to landmarks. We set κ equal to 0.3 times of
the distance between two pupils on the mean shape.
4.1 Comparison with previous works
4 Experiments For comparisons, we use the alignment error in Eq. (1) as the
evaluation metric. To make it invariant to face size, the error
The experiments are performed in two parts. The first part is not in pixels but normalized by the distance between the
compares our approach with previous works. The second part two pupils, similar to most previous works.
The following comparison shows that our approach out-
6 The median operation is performed on x and y coordinates of all performs the state of the art methods in both accuracy and
landmarks individually. Although this may violate the shape constraint efficiency, especially on the challenging LFPW and LFW87
mentioned before, the resulting median shape is mostly correct as in
most cases the multiple results are tightly clustered. We found such a
datasets. Figures 4, 5, 6 and 7 show our results on challenging
simple median based fusion is comparable to more sophisticated strate- examples with large variations in pose, expression, illumina-
gies such as weighted combination of input shapes. tion and occlusion from the four datasets.
123
Int J Comput Vis
4.1.1 Comparison on LFPW Comparison in Fig. 8 shows that most landmarks esti-
mated by our approach are more than 10 % accurate7 than
The consensus exemplar approach proposed by Belhumeur
et al. (2011) is one of the state of the art methods. It was the
7
best on BioID when published, and obtained good results on The relative improvement is the ratio between the error reduction and
LFPW. the original error.
123
Int J Comput Vis
123
Int J Comput Vis
Fraction of Landmarks
0.8
RMSE <5 Pixels <7.5 Pixels <10 Pixels 0.7
0.6
CDS (%) 74.7 93.5 97.8
0.5 Our Method
Our method (%) 86.1 95.2 98.2 Vukadinovic and Pantic et al.
0.4
0.3 Cristinacce and Cootes et al
Bold values represent the best results under certain settings
Milborrow and Nicolls et al.
0.2 Valstar et al.
0.1 Belhumeur et al.
Table 3 Comparison on Helen dataset
0
0 0.05 0.1 0.15
Method Mean Median Min Max
Landmark Error
STASM 0.111 0.094 0.037 0.411
Fig. 9 Cumulative error curves on the BioID dataset. For compari-
CompASM 0.091 0.073 0.035 0.402 son with previous results, only 17 landmarks are used (Cristinacce and
Our method 0.057 0.048 0.024 0.16 Cootes 2006). As our model is trained on LFPW images, for those land-
marks with different definitions between the two datasets, a fixed offset
The error of each sample is first individually computed by averaging
is applied in the same way in Belhumeur et al. (2011)
the errors of 194 landmarks, and then the mean error across all testing
samples is computed
Bold values represent the best results under certain settings Table 4 Tradeoffs between two levels boosted regression
Stage regressors (T ) 1 5 10 100 5000
Primitive regressors (K) 5000 1000 500 50 1
4.1.3 Comparison on Helen Mean error (×10−2 ) 15 6.2 3.3 4.5 5.2
Bold value represents the best results under certain settings
We adopt the same training and testing protocol as well as
the same error metric used by Le et al. (2012). Specifically,
we divide the Helen dataset into training set of 2,000 images
and testing set of 330 images. As the pupils are not labeled in parts for training and testing. The training set contains 1,500
the Helen dataset, the distance between the centroids of two images and the testing set contains 500 images. Parameters
eyes are used to normalize the deviations from groundtruth. are fixed as in Sect. 3, unless otherwise noted.
We compare our method with STASM (Milborrow and
Nicolls 2008) and recently proposed CompASM (Le et al.
2012). As shown in Table 3, our method outperforms them by 4.2.1 Two-Level Boosted Regression
a large margin. Comparing with STASM and CompASM, our
method reduces the mean error by 50 and 40 % respectively, As discussed in Sect. 2, the stage regressor exploits shape
meanwhile, the testing speed is even faster. indexed features to obtain geometric invariance and decom-
pose the original difficult problem into easier sub-tasks. The
shape indexed features are fixed within the internal-level
4.1.4 Comparison to Previous Methods on BioID boosted regression to avoid instability.
Different tradeoffs between two-level boosted regression
Our model is trained on augmented LFPW training set and are presented in Table 4, using the same number of ferns.
tested on the entire BioID dataset. On one extreme, regressing the whole shape in a single
Figure 9 compares our method with previous methods stage (T = 1, K = 5000) is clearly the worst. On the
(Vukadinovic and Pantic 2005; Cristinacce and Cootes 2006; other extreme, using a single fern as the stage regressor
Milborrow and Nicolls 2008; Valstar et al. 2010; Belhumeur (T = 5000, K = 1) also has poor generalization ability in
et al. 2011). Our result is the best but the improvement is mar- the test. The optimal tradeoff (T = 10, K = 500) is found
ginal. We believe this is because the performance on BioID is in between via cross validation.
nearly maximized due to its simplicity. Note that our method
is thousands of times faster than the second best method (Bel-
humeur et al. 2011). 4.2.2 Shape Indexed Feature
4.2 Algorithm Validation and Discussions We compare the global and local methods of shape indexed
features. The mean error of local index method is 0.033,
We verify the effectiveness of different components of the which is much smaller than the mean error of global index
proposed approach. Such experiments are performed on the method 0.059. The superior accuracy supports the proposed
our augmented LFPW dataset. The dataset is split into two local index method.
123
Int J Comput Vis
Table 5 Comparison between correlation based feature selection Table 6 Model compression experiment
(CBFS) method and best-of-n feature selection methods
Dataset Raw PCA SC
Best-of-n n=1 n = 32 n = 1024 CBFS
Mean error (×10−2 ) LFW87 4.23 4.35 4.34
Error (×10−2 ) 5.01 4.92 4.83 3.32 Model size (mb) LFW87 118 30 8
Time (s) 0.1 3.0 100.3 0.12 Comp. ratio LFW87 – 4 15
The training time is for one primitive regressor Mean error (×10−2 ) Helen194 5.70 5.83 5.79
Bold value represents the best results under certain settings Model size (mb) Helen194 240 42 12
Comp. ratio Helen194 – 6 20
The suffix of the name of the dataset means the number of annotated
landmarks
4.2.3 Feature Selection We have presented the explicit shape regression method for
face alignment. By jointly regressing the entire shape and
The proposed correlation based feature selection method minimizing the alignment error, the shape constraint is auto-
(CBFS) is compared with the commonly used best-of- matically encoded. The resulting method is highly accurate,
n method (Ozuysal et al. 2010; Dollar et al. 2010) in efficient, and can be used in real time applications such as
Table 5. CBFS can select good features rapidly and this is face tracking. The explicit shape regression framework can
crucial to learn good models from large training data. also be applied to other problems like articulated object pose
estimation and anatomic structure segmentation in medical
4.2.4 Feature Range images.
123
Int J Comput Vis
Duffy, N., & Helmbold, D. P. (2002). Boosting methods for regression. Saragih, J., & Goecke, R. (2007). A nonlinear discriminative approach
Machine Learning, 47(2–3), 153–200. to aam fitting. In International Conference on Computer Vision
Elad, M., & Aharon, M. (2006). Image denoising via sparse and redun- (ICCV) .
dant representations over learned dictionaries. IEEE Transactions on Sauer, P., & Cootes, C. T. T. (2011). Accurate regression procedures
Image Processing, 15(12), 3736–3745. for active appearance models. In British Machine Vision Conference
Friedman, J. H. (2001). Greedy function approximation: A gradient (BMVC).
boosting machine. The Annals of Statistics, 29(5), 1189–1232. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore,
Huang, G., Mattar, M., Berg, T., Learned-Miller, E. et al. (2008) Labeled R., et al. (2011). Real-time human pose recognition in parts from
faces in the wild: A database forstudying face recognition in uncon- single depth images. In IEEE Conference on Computer Vision and
strained environments. In Workshop on Faces in’Real-Life’Images: Pattern Recognition (CVPR).
Detection, Alignment, and Recognition. Sun, Y., Wang, X., & Tang, X. (2013). Deep convolutional network
Jesorsky, O., Kirchberg, K. J., & Frischholz, R. W. (2001). Robust cascade for facial point detection. In: IEEE Conference on Computer
face detection using the hausdorff distance (pp. 90–95). New York: Vision and Pattern Recognition (CVPR).
Springer. Tropp, J., & Gilbert, A. (2007). Signal recovery from random mea-
Jolliffe, I. (2005). Principal component analysis. Wiley Online Library. surements via orthogonal matching pursuit. IEEE Transactions on
Le, V., Brandt, J., Lin, Z., Bourdev, L., & Huang, T. (2012). Interactive Information Theory, 53(12), 4655–4666.
facial feature localization. In European Conference on Computer Valstar, M., Martinez, B., Binefa, X., & Pantic, M. (2010). Facial point
Vision. detection using boosted regression and graph models. In IEEE Con-
Liang, L., Xiao, R., Wen, F., & Sun, J. (2008). Face alignment via ference on Computeer Vision and Pattern Recognition (CVPR).
component-based discriminative search. In European Conference on Vukadinovic, D., & Pantic, M. (2005). Fully automatic facial feature
Computer Vision (ECCV). point detection using gabor feature based boosted classifiers. Interna-
Matthews, I., & Baker, S. (2004). Active appearance models revisited. tional Conference on Systems, Man and Cybernetics, 2, 1692–1698.
International Journal of Computer Vision, 60(2), 135–164. Xiong, X., De la Torre, F. (2013) Supervised descent method and its
Milborrow, S., & Nicolls, F. (2008). Locating facial features with an applications to face alignment. In IEEE Conference on Computer
extended active shape model. In European Conference on Computer Vision and Pattern Recognition (CVPR).
Vision (ECCV). Zhou, S. K., & Comaniciu, D. (2007). Shape regression machine. In
Ozuysal, M., Calonder, M., Lepetit, V., & Fua, P. (2010). Fast key- Information Processing in Medical Imaging, (pp. 13–25). Heidel-
point recognition using random ferns. IEEE Transactions on Pattern berg: Springer.
Analysis and Machine Intelligence, 32(3), 448–461.
123