A Detailed Look at CNN-based Approaches in Facial Landmark Detection

A Detailed Look At CNN-based Approaches In Facial Landmark
Detection
Chih-Fan Hsu12 , Chia-Ching Lin12 , Ting-Yang Hung3 , Chin-Laung Lei2 , and Kuan-Ta Chen1
1 Institute
of Information Science, Academia Sinica
2 Department of Electrical Engineering, National Taiwan University
3 Halicioǧlu Data Science Institute, University of California San Diego
arXiv:2005.08649v1 [cs.CV] 8 May 2020
ABSTRACT Instead of developing a new network architecture for detecting

Facial landmark detection has been studied over decades. Numer- facial landmarks, in this paper, we (1) investigate both CNN-based
ous neural network (NN)-based approaches have been proposed approaches, (2) generalize their advantages and disadvantages, and
for detecting landmarks, especially the convolutional neural net- (3) investigate a variation of the heatmap approach, the pixel-wise
work (CNN)-based approaches. In general, CNN-based approaches classification (PWC) model. Although, the PWC model is widely
can be divided into regression and heatmap approaches. How- used for the object instance segmentation [8, 9] and the joint de-
ever, no research systematically studies the characteristics of dif- tection [10], to the best of our knowledge, using the PWC model
ferent approaches. In this paper, we investigate both CNN-based to detect facial landmarks have not been comprehensively studied.
approaches, generalize their advantages and disadvantages, and in- Besides, detecting facial landmarks by the PWC model might be
troduce a variation of the heatmap approach, a pixel-wise classifi- problematic because numerous landmarks are located at positions
cation (PWC) model. To the best of our knowledge, using the PWC with similar image structure. To further improve the detection ac-
model to detect facial landmarks have not been comprehensively curacy of the PWC model by integrating the advantages from the
studied. We further design a hybrid loss function and a discrimi- regression and heatmap approaches, we design a hybrid loss func-
nation network for strengthening the landmarks’ interrelationship tion and a discrimination network to strengthen the landmarks’
implied in the PWC model to improve the detection accuracy with- interrelationship implied in the PWC model.
out modifying the original model architecture. Six common facial Six facial landmark datasets, AFW dataset [11], Helen dataset [12],
landmark datasets, AFW, Helen, LFPW, 300-W, IBUG, and COFW Labeled Face Parts in the Wild (LFPW) dataset [13], 300 Faces In-
are adopted to train or evaluate our model. A comprehensive eval- the-Wild Challenge (300-W) dataset, and the additional 135 images
uation is conducted and the result shows that the proposed model in difficult poses and expressions of the 300-W dataset (IBUG), and
outperforms other models in all tested datasets. the testing set of the Caltech Occluded Faces in the Wild (COFW)
dataset [14], are adopted to train or evaluate our model. We con-
KEYWORDS duct a comprehensive evaluation and the result shows that the de-
tection accuracy of the PWC model can be improved without mod-
Facial landmark detection, Convolutional neural network
ifying the model architecture. Besides, the proposed model out-
performs other models in all datasets. For making the community
1 INTRODUCTION reproduce our experiment and develop further, we also release the
source codes of all models on the GitHub1 .
Facial landmarks are the fundamental components for various ap-
plications. For instance, the landmarks can be used to understand
facial expressions [1], estimate head poses [2], recognize faces [3],
2 CONVOLUTIONAL NEURAL
or manipulate facial components [4]. Facial landmark detection NETWORK-BASED APPROACHES
aims to automatically detect facial landmarks in an image and has Using deep learning models to detect facial landmarks is a popu-
been studied over decades [5]. lar research topic because CNN-based approaches become more
Various approaches were proposed to detect facial landmarks, efficient and powerful. Generally, CNN-based approaches can be
such as active appearance models (AAM) [6] and constrained local divided into regression and heatmap approaches.
models (CLM) [7]. Using deep learning approaches to detect land-
marks gradually dominates this research topic. Numerous neural 2.1 Regression Approaches
network (NN)-based approaches are proposed for detecting land- The regression approaches can be further divided into direct and
marks, especially the convolutional neural network (CNN)-based cascaded regression models.
approaches. Generally, CNN-based approaches can be further di- Direct regression models. Detecting facial landmarks by di-
vided into regression and heatmap approaches. Regression approaches rect regression models is studied for many years [2, 15–21]. The
directly infer the horizontal and vertical coordinates from a facial model detects the landmark coordinates S represented by a vector
image; heatmap approaches detect the spatial position in a set of from a facial image I . The dimension of the vector is the twice
two-dimension heatmaps. Many researchers dedicate to develop number of landmarks. Figure 1 shows an example of a direct re-
various network architectures for both approaches and prove that gression model for detecting 68 facial landmarks. The backbone
both approaches are robust and accurate for detecting landmarks. network can be any network architecture for extracting features
However, there is no systematic research to study the characteris-
tics of both approaches. 1 https://github.com/chihfanhsu/fl_detection
ࡿ
ĨůĂƚƚĞŶ
ĂĐŬďŽŶĞŶĞƚǁŽƌŬ
ϮϬϰϴ ϭϯϲ
Ϯϱϲ
ϭϬϮϰ
Figure 1: An example model architecture of the direct regres-

sion model for detecting 68 facial landmarks. A rectangle Figure 2: Detected results of the regression approach. The
denotes a fully-connected layer and the number of nodes in face shape formed by detected landmarks is maintained
the layer is listed beneath the rectangle. even detection failed. However, the landmarks suffer from
inaccurate positions.
from the input facial image. Here, we assume that the last layer of
the backbone network contains 2,048 channels.
н
ࢎ
Cascaded regression models. Unlike direct regression models ĂĐŬďŽŶĞ
that directly detect landmark coordinates, the cascaded regression ŶĞƚǁŽƌŬ
н
models iteratively update a predefined or a pre-detected landmarks ϭൈϭ
S 0 to detect landmarks [22–31]. A sub-network is used to generate ϮϬϰϴ ϲϴ ϱϭϮ
Ϯϱϲ
an updating vector ∆Si to update the landmark positions in each
stage i. After n updates, the model outputs the final landmark co-
ϲϴ
ordinates Sn . Generally, the cascaded regression models can be
simply formulated as Si = Si −1 + ∆Si −1, i = 1, 2, ..., n. Figure 3: An example of the network architecture of the
To train a regression approach, the L2 distance is adopted to distribution model. The goal of the model is to predict
evaluate the point-wise difference between the detected and the a set of heatmaps as similar to the corresponding ground-
ground-truth landmarks, which can be formulated as truth maps generated by the multivariate distribution. Here,
1 Õ the distribution is the multivariate Gaussian with the three-
lossr eд = kSl − Sˆl k2 , (1)
|L| pixel standard deviation. The cube denotes a tensor and the
l ∈L
channel size is listed beneath the cube.
where L is the set of landmarks, Sl and Sˆl represent the detected
and the ground-truth coordinates of the l th landmark, respectively.
Comparing the direct and the cascaded models, cascaded regres-
sion models generally are more effective than direct regression convolutional network (FCN) proposed by [8], which contains a
models because cascaded regression models follow the coarse-to- convolutional part to generate semantic features from the facial
fine strategy [32]. However, there is no standard to define how image and a de-convolutional part to decode semantic features to
many stages should be involved in a cascaded model to achieve a set of heatmaps. According to the characteristic of the heatmap
the best detection accuracy. Also, there is no standard to generate and the loss function, heatmap approaches can be divided into the
the predefined face shape. Hence, numerous studies obtained the distribution, the heatmap regression, and the pixel-wise classifica-
predefined shape by averaging face shapes of the training set or tion models.
predict the shape by an additional model. Distribution models. The distribution model indicates the po-
Generally, regression approaches benefit from the strong inter- sition of a landmark by a multivariate distribution. The center of
relationship between landmarks because the structural informa- the distribution is located at the landmark coordinates. Generally,
tion of a face is implicitly embedded and learned in the fully con- a two-dimensional Gaussian distribution is commonly used to in-
nected layers. Therefore, the detected landmarks can still form a dicate the landmark [33–36]. Figure 3 shows an example of the
face-like shape even if some key parts of a face are occluded. On de-convolutional part of a distribution model for detecting 68 fa-
the other hand, since the interrelationship is strong, the approach cial landmarks. We mention here that the number of channels con-
suffers from slightly inaccurate landmark positions. Namely, the tained in the heatmaps is the same as the number of landmarks.
approach tends to maintain the shape of landmarks as a face rather A spatial softmax function is used to force the sum of elements in
than detecting the true positions of landmarks. Once detecting each heatmap equal to one.
failed, the model may randomly place landmarks with a face-like To train a distribution model, Kullback-Leibler (KL) divergence
shape. Figure 2 shows several examples detected by regression ap- is adopted to measure the distance between the predicted heatmap
proaches. The left two images illustrate examples that the detected hl and the ground-truth Gaussian distribution ĥl for each land-
landmarks suffer from inaccurate positions. The right two images Í
mark l. Namely, the loss can be calculated by lossdist = l ∈L KL(ĥl khl ),
illustrate examples that the detection is failed but the shape of land- where KL(·) is the KL divergence.
marks remains a face-like shape. Heatmap regression models. Wei et al. [37] proposed an al-
ternative model to detect landmarks from the heatmaps. Although
2.2 Heatmap Approaches the proposed model, Convolutional Pose Machines, is used to de-
Heatmap approaches detect a landmark by indicating the position tect landmarks of the human, the model can be extended to de-
of the landmarks in a two-dimensional heatmap. Generally, the tect facial landmarks. Differing from the distribution model, the
model structures of heatmap approaches are inspired by the fully output of the model contains an additional heatmap to indicate
2
ࢎ
ĂĐŬďŽŶĞ н
ŶĞƚǁŽƌŬ
н
ϭൈϭ ϭ ŝƐƚ͘
݄௜
ϮϬϰϴ ϲϵ ϱϭϮ
Ϯϱϲ
ϲϵ
,ĞĂƚŵĂƉ
Figure 4: An example of the pixel-wise classification model. ZĞŐ͘
The vector at each pixel in output heatmaps is a probabilistic
vector that indicates the pixel belonged to which landmark
classes or the background class. Wt
the background pixels. The L2 loss is adopted to evaluate the dif- Figure 5: Several fail results detected by the heatmap ap-
ference between the ground-truth distribution maps and the pre- proaches. The face shape suffers from serious distortion
dicted heatmaps. Namely, the loss function can be formulated as when some landmarks are not successfully detected. On
Í
losshr eд = l ∈{L,b } (ĥl − hl )2 , where b represents the image back- the other hand, the successfully detected landmarks bene-
ground. fit from accurate positions.
Pixel-wise classification models. Using pixel-wise classifica-
tion (PWC) models to detect landmarks of the human are studied
by He et al. [9]. The model also can be extended to detect facial 3 A PIXEL-WISE CLASSIFICATION MODEL
landmarks, however, to the best of our knowledge, adopting PWC WITH DISCRIMINATOR
models to detect facial landmarks has not been comprehensively
We are seeking for a new method that contains the advantages
studied.
and restrains from the disadvantages of regression and heatmap
Despite the output of the PWC model is a set of heatmaps, which
approaches. Considering that the landmark accuracy is highly im-
are similar to the aforementioned heatmap models, the meaning
portant, we develop a new model based on the heatmap approach.
of the heatmap is different. The heatmaps of the distribution and
Moreover, because the PWC model outperforms the other two heatmap
heatmap regression models are generated by the multivariate dis-
models, we design a new model based on the PWC model (the ac-
tribution to indicate the spatial position of the landmarks; the heatmap
curacy will be discussed in Section 5).
of the PWC model is generated by a set of probabilistic vectors that
A hybrid loss function. To overcome the disadvantages of
indicates the probabilities of the pixel belonged to which landmark
heatmap approaches that the interrelationship among landmarks
or the background. Specifically, each vector contains |L| + 1 ele-
cannot be maintained, we introduce a hybrid loss function that
ments and the sum of elements is equal to one. Figure 4 shows an
combines the L2 loss and the PWC loss functions. The idea of the
example of the network architecture of the PWC model.
hybrid loss function is augmenting the PWC loss by the L2 dis-
To train a PWC model, the cross-entropy loss is adopted. Namely,
tance by penalizing the landmark shifting when detecting failed
1 Õ Õ p p
to strengthen the interrelationship implied in the model. Our loss
loss PW C = − ĥi loд(hi ), (2) function can be formulated by
|I |
p ∈I i ∈{L,b }
losshybr id = α × loss PW C + β × lossr eд , (3)
p p
where hi and ĥi represent the predicted probabilistic vector and where the hyperparameters α and β are used to balance between
the ground-truth vector at pixel p, respectively, i indicates the i th loss functions and we empirically set α to 1 and β to 0.25 in our
element of the vector. experiment.
To obtain the landmark positions from heatmaps, the coordi- A discrimination network. We also expect that the detected
nates of a landmark can be calculated by the position with the landmarks should form a face-like shape. To achieve this goal, we
maximum probability in each heatmap, which can be calculated add a discrimination network D after the detection network to en-
by Sl = argmax(x,y) hl , l ∈ L. The heatmap approaches bene- courage the detected landmarks to remain a face-like shape. Specif-
fit from highly accurate landmark positions but suffer from the ically, the loss function is modified by losstot al = losshybr id +
lack of the interrelationship between landmarks. It is because the loss f ace , where loss f ace is defined by −E[loд(D(S))], reflecting the
interrelationship gradually degrades when it passes through de- encouragement of the predicted landmarks being classified to have
convolutional layers. Once some landmarks are detected failed a face-like shape given by the discrimination network.
(mostly caused by the occlusion), the detected position tends to The detection and discrimination networks are jointly trained
shift to a nearby corner or edge. As a result, the shape of the de- as training a Generative Adversarial Networks proposed by Good-
tected landmarks is distorted. Figure 5 shows the detected results fellow et al. [38]. Specifically, in each training step, the training
of the heatmap approaches. As we can observe, landmarks cor- process can be divided into updating the detection network and
responding to occluded parts are commonly missing and the face updating the discrimination network stages. The losstot al is used
shape formed from the detected landmarks is distorted. to update the detection network and the loss function, lossdisc =
3
ĞƚĞĐƚŝŽŶŶĞƚǁŽƌŬ ࢎ ŝƐĐ͘
ŵƉ
ĂĐŬďŽŶĞ
н ŶĞƚǁŽƌŬ ƐŚŽƌƚĐƵƚ
ŵƉ
ࡿ
ĂƌŐŵĂǆ
ŶĞƚǁŽƌŬ н ƐŚŽƌƚĐƵƚ
ŵƉ
ŵƉ
ŵƉ
ϭൈϭ ϳൈϳ ϭൈϭ
ϭϭϮ
ϮϬϰϴ ϲϵ ϱϭϮ ϱϭϮൈϰ ϱϭϮൈϰ ϮϬϰϴ ϮϬϰϴ
ϮϮϰ
Ϯϱϲ
ϭ
ϭϯϲ Ϯϱϲൈϰ
ϭϮϴ
ϭϮϴ
ϭϮϴൈϮ
ϲϰൈϮ ĂĐŬďŽŶĞŶĞƚǁŽƌŬ
ϲϵ
݈‫ݏݏ݋‬௉ௐ஼ ݈‫ݏݏ݋‬௥௘௚ ‫ܦ‬ሺࡿሻ
Figure 7: The backbone network. Each cube represents a ten-
sor with a specific size. We only show the first two tensors
Figure 6: The proposed model contains a detection network for figure conciseness. Beneath the cube, the number before
and a discrimination network. The detection network is a and after the × mark indicate the tensor’s channel size and
PWC model for detecting landmarks. The discrimination the number of convolutional blocks performed in the ten-
network is a 3-layer fully connected NN for testing the shape sor, respectively. The number of convolutional blocks will
of detected landmarks forms a face-like shape or not. be omitted when only one convolutional block is performed.
The mp represents the max-pooling layer with 2 × 2 kernel
size and stride two.
ˆ + E[loд(1 − D(S))]), is used to update the discrimi-
−(E[loд(D(S))]
nation network. In our experiment, in each training step, we em-
pirically train the discrimination network once and the detection Table 1: The number of images in the training (T) and vali-
network twice. dation (V) sets.
Figure 6 shows the network architecture of our model. As we
previously mentioned, the backbone network can be any network 300-W
architecture. A cube in the figure represents a specific tensor size AFW Helen LFPW IBUG COFW
(In/Out)
and the number of channels of the tensor is shown below the cube. #T 337 2,000 811 0/0 0 0
At least one convolutional block is performed under a certain ten- #V 0 330 224 300/300 135 507
sor size. The convolutional block comprises a convolutional layer,
a batch normalization layer, and an activation layer sequentially.
Once more than one convolutional block is performed, a × mark
will be shown after the channel size and the number after the × model, we crop the faces for every facial image according to the
mark indicates the number of convolutional blocks (Figure 7). The ground-truth landmarks to reduce negative impacts from unstable
number above the cube denotes the kernel size of filters in the con- face detection. Specifically, we calculate the maximum distance
volutional layer. The default kernel size is set to 3×3 and omitted in of the horizontal and the vertical distances (d h and dv ) of land-
the figure. The rectangular denotes a fully connected block and the marks for each image. Then, the facial image is cropped by a square
number below each is the number of nodes in the fully connected bounding box with 1.3 × max(d h , dv ) side length and the center of
layer. A fully connected block comprises a fully connected layer, a the bounding box is located at the centroid of landmarks. Finally,
batch normalization layer, and an activation layer sequentially. the cropped image is resized to 224 × 224 × 3 image size.
4 TRAINING AND IMPLEMENTATION 5 EXPERIMENTAL RESUTLS

We implement models based on Tensorflow 1.8.0 and Python 3.5.3. We name the model hybrid+disc for avoiding model garble in the
The input image size is set to 224 × 224 × 3. The backbone network following sections, where hybrid+disc denotes the model trained
is implemented by the VGG19-like network architecture and two with losshybr id and supported with the discrimination network
shortcut links are used to improve the feature utilization for the during training. Figure 8 shows the detected results of the hy-
heatmap approaches (Figure 7). A data augmentation algorithm brid+disc model. As we can observe that the hybrid+disc model
is adopted to generate various orientations and sizes of faces to can handle various head orientations and facial expressions. Also,
increase the dataset diversity. Specifically, we randomly rotated the model can moderately handle partial occlusion.
the input image from −30◦ to 30◦ and rescaled the image with the We adopt the normalized mean squared error (NMSE) to quan-
ratio from 0.6 to 1.0 before the input image being fed to the model. titatively evaluate the models. The NMSE can be calculated by
Adam optimizer is used to update the network parameters. An 1 Í |L | kSl − Sˆl k2
N MSE = , where Sˆl and Sl represent the co-
early stopping mechanism is adopted to prevent the model from |L| l =1 d iod
overfitting. Specifically, we calculate the average validation loss ordinates of the l th ground-truth and detected landmarks, respec-
once every 1,000 training steps and stop the training process when tively. The notation d iod represents the inter-ocular distance that
the loss does not decrease ten times in a row. can be calculated by the L2 distance between the outer corners of
Several public facial landmark datasets, AFW, Helen, LFPW, 300- the ground-truth eyes.
W, IBUG, and COFW datasets, are trained or tested in our experi- Nine models are compared in our experiment, (1) the Dlib model,
ment. We use the default settings of datasets to separate the train- an ensemble of regression trees model presented by Kazemi and
ing and validation sets for making a fair comparison. Table 1 shows Sullivan [39], (2) the TCDCN model, a multi-task NN-based model
the number of images for both sets in each dataset. To train the presented by Zhang et al. [16], (3) a simple directly regression
4
method implemented by CNN, (4) a cascaded regression model im-
plemented by DAN model architecture, (5) a distribution model,
the ground-truth heatmaps are generated by the Gaussian distri-
bution with the three-pixels standard deviation, (6) a heatmap re-
gression model, the ground-truth heatmaps are same as the dis-
tribution model, (7) a PWC model, a CNN-based model trained
with loss PW C , (8) a hybrid model, a CNN-based model trained with
losshybr id , and (9) a hybrid+disc model.
To conduct a fair comparison among models, the backbone net-
work in the investigated models is the same except the DAN model.
It is because the cascaded model contains several sub-networks.
Adopting the backbone network to all sub-networks leads the model
to include massive parameters that exceed the memory limitation
of our training machine. Hence, we adopt the original three-stage Figure 8: Several results detected by the hybrid+disc model.
DAN suggested by Kowalski et al. [27]. Since the DAN model re-
quires 112 × 112 × 3 input image size, we resize the image size
68 Facial Landmarks 12 Eye Anchors
to meet the requirement and detect the landmark coordinates in
1.0
1.0
224 × 224 image size.
0.8
0.8
In the experiment, the direct and cascaded regression models
0.6
0.6
ECDF
ECDF
contain 178,682,440 and 35,053,144 parameters, respectively. The Direct reg. Direct reg.
0.4
0.4
distribution model contains 82,871,880 parameters. The number of Cascaded reg.
Distribution
Cascaded reg.
Distribution
0.2
0.2
parameters of the PWC model and the heatmap regression model Heatmap reg.
PWC
Heatmap reg.
PWC
hybrid+disc hybrid+disc
is the same because the difference between the two approaches
0.0
0.0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
is only the loss function, which contains 82,947,658 parameters. NMSE NMSE
The hybrid+disc model contains 82,982,347 parameters, in which (a) 68 facail landmarks (b) 12 eye anchors
the detection and discrimination networks contain 82,947,658 and
34,689 parameters, respectively. Overall, the direct regression model Figure 9: The ECDF of the NMSE for the investigated models.
contains the largest number of parameters.
Table 2 shows the average NMSE values of the nine models. We
also evaluate the standard deviation from five independent train-
ings. Since the standard deviations are small, we omit the standard Figure 9 shows the empirical cumulative distribution function
deviation to make the table concise. The Dlib and the TCDCN mod- (ECDF) of NMSE values to illustrate the model effectiveness on the
els are the baselines for comparison. The lowest NMSE value in validation set. Because the landmarks around the eyes (or called
each testing dataset is highlighted. We mention there that the Dlib eye anchors) are more widely used than other landmarks, such as
model cannot successfully detect the landmarks for all validation eye blink detection and gaze manipulation, we further evaluated
images. Therefore, the average NMSE values of the Dlib model are the detection accuracy for the 12 eye anchors. Figure 9(a) and Fig-
only used for reference and the detection rates are listed below the ure 9(b) show the ECDF results of all 68 landmarks and 12 eye an-
table. chors, respectively. As we previously mentioned that heatmap ap-
Generally, all investigated models outperform the base-line mod- proaches generally achieve higher detection accuracy than the re-
els except the heatmap regression model. It is because the heatmap gression approaches. However, the heatmap regression model suf-
regression model suffers from the weak interrelationship between fers from large NMSE value when the landmarks are hard to detect.
landmarks due to the network architecture and the small penalty The PWC model slightly outperforms the distribution model. The
for failure detection due to the loss function. Therefore, once the proposed hybrid loss function further improves the PWC model
landmark is detected failed, the detected position usually locates without enlarging the network architecture. Overall, the hybrid
at the position that has a similar structure as the landmarks with- model outperforms other investigated models.
out occlusion and the position might be far from the ground-truth
position. The largest NMSE value in the IBUG dataset and the rel- 5.1 Ablation Study
ative small NMSE value in the Helen dataset reveal this trend. In To ensure the hybrid loss function and the discrimination network
between the regression models, surprisingly, the DAN model per- can indeed improve the detection accuracy of the PWC model, we
forms worse than the direct regression model. We suspect that trained the models with several model and loss function combi-
the DAN model is limited by the network architecture of the sub- nations, which includes (1) the PWC model, (2) the PWC model
network because the sub-network only contains 11,146,312 param- supported by the discrimination network (PWC+disc), the hybrid
eters that are much smaller than the number of parameters con- model, and the hybrid+disc model.
tained in the direct regression model. In the heatmap models, the Table 2 shows the average NMSE of the investigated combina-
PWC model detects the landmarks more accurately than regres- tions. Generally, the model trained with the hybrid loss function
sion models and other heatmap models. The hybrid model outper- outperforms the model trained with the PWC loss function. The
forms other models in almost all datasets. result indicates that the regression loss function can support the
5
Table 2: The average NMSE of the investigated models for detecting 68 facial landmarks. The colored numbers indicate the
smallest NMSE in each dataset. The 300-W dataset contains the indoor (In) and the outdoor (Out) subsets and the results are
listed in two horizontal rows.
Dataset Dlib* TCDCN Direct Cascaded Dist. Heat. reg. PWC PWC+disc hybrid hybrid+disc
Helen 0.0298 0.0474 0.0356 0.0367 0.0337 0.0353 0.0334 0.0336 0.0315 0.0312
LFPW 0.0356 0.0454 0.0359 0.0360 0.0347 0.0371 0.0356 0.0359 0.0336 0.0337
300-W 0.0677 0.0781 0.0495 0.0520 0.0508 0.0597 0.0499 0.0505 0.0481 0.0486
In/Out 0.0640 0.0746 0.0500 0.0516 0.0503 0.0586 0.0487 0.0492 0.0470 0.0473
IBUG 0.0594 0.0795 0.0643 0.0678 0.0700 0.0812 0.0656 0.0663 0.0637 0.0639
Total 0.0496 0.0639 0.0452 0.0469 0.0455 0.0515 0.0446 0.0450 0.0427 0.0428
*Dlib does not successfully detect landmarks for all images in the validation set, the detection rate of Dlib in each dataset are listed as
follows: Helen (318/330), LFPW (218/224), 300-W indoor (264/300) and outdoor (252/300), and IBUG (98/135).

PWC model to achieve better detection accuracy without modify-
1.0
1.0
ing the architecture of the detection network. However, the dis-
0.8
0.8
crimination network does not significantly improve detection ac-
0.6
0.6
ECDF
ECDF
curacy.
0.4
0.4
Figure 10 shows the ECDF of NMSE values for every loss com- PWC PWC
0.2
0.2
PWC+disc PWC+disc
binations. As we can observe that the hybrid model supported by hybrid hybrid
hybrid+disc hybrid+disc
0.0
0.0
the discrimination network achieves slightly better results than the 0.01 0.02 0.03 0.04 0.05 0.06 0.010 0.015 0.020 0.025 0.030 0.035 0.040
models without supported by the discrimination network about NMSE NMSE
60% and 80% of validation images for detecting 68 facial landmarks (a) 68 landmarks (b) 12 eye anchors
and 12 eye anchors, respectively. However, for the PWC model, the
discrimination network does not improve the detection accuracy. Figure 10: The ECDF of the NMSE values for different loss
To further explore the benefits by adopting the hybrid loss func- function combinations.
tion and the discrimination network, we carefully observe the de-
tected landmarks among the detected results. Generally, the land-
marks detected by the PWC model prefer to locate at the key point
(the edge or the corner). The result is not surprising because the
key point is easier to be preserved in the semantic features than
the smooth area and the facial landmarks should locate at the key 'ƌŽƵŶĚͲƚƌƵƚŚ Wt WtнĚŝƐĐ ŚǇďƌŝĚ ,ǇďƌŝĚнĚŝƐĐ
point if the landmarks are not occluded. However, the landmark
positions are easy to be affected by the shape of the key point es- Figure 11: The hybrid loss function penalizes the landmark
pecially the landmarks located at the edge. Besides, once the land- shifting due to the shape of the key point or partial occlu-
marks are occluded, the detected positions tend to located at a key sion. The discrimination network encourages the detected
point near the position where the landmark should be. Therefore, landmarks to form a face-like shape.
the shape of the detected landmarks might be distorted. Figure 11
shows the illustration. In the result detected by the PWC model,
the landmark positions are affected by the shape of the edge (red 5.2 Occlusion Tolerance
and blue) and the occluded landmark located at a nearby corner
To explore the model’s ability for handling facial images with par-
(blue). Adding the L2 loss to the PWC model penalizes the posi-
tial occlusion, we mutually selected 298 images that some land-
tion shifting due to the occlusion or the negative impact from the
marks are occluded by objects or extreme light sources from the
shape of the key point, which greatly improves the detection ac-
validation datasets. Figure 12 shows several results detected by
curacy. However, the shape of the detected landmarks might still
the investigated models. As we can observe that the regression ap-
be distorted. Although adopting the discrimination network in the
proaches suffer from inaccurate landmark positions. The heatmap
training process might make landmarks slight inaccurate, the net-
approaches benefit from the accurate positions but the landmarks
work successfully improves the shape of the landmarks. Overall,
shift due to the occlusion. The hybrid+disc model eases the land-
adding the discrimination network has only a minor impact on de-
mark shifting and can moderately guess the landmark positions
tection accuracy. Besides, the discrimination network does not in-
when landmarks are occluded.
crease the inference time when testing. Training a model with a
Figure 13 shows the ECDF of the NMSE values for the images
discrimination network is worth being considered and explored
with and without partial occlusion. As we can observe that the
further.
hybrid+disc model outperforms other models for the images with-
out partial occlusion. For the images with partial occlusion, the
hybrid+disc model achieves better model accuracy for about 70%
of validation images. It is worthwhile to mention that when the
6
ŝƌĞĐƚƌĞŐ͘ ĂƐĐĂĚĞĚƌĞŐ͘ ŝƐƚ͘ ,ĞĂƚŵĂƉƌĞŐ͘ Wt ŚǇďƌŝĚнĚŝƐĐ
1.0
1.0
Direct reg.
Cascaded reg.
0.8
0.8
Distribution
heatmapreg
ηϭϯϴ PWC
0.6
0.6
hybrid
ECDF
ECDF
hybrid+disc Direct reg.
Cascaded reg.
0.4
0.4
Distribution
heatmapreg
0.2
0.2
PWC
ηϵϯϵ hybrid
hybrid+disc
0.0
0.0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
NMSE NMSE
(a) 68 facial landmarks (b) 12 eye anchors

ηϵϲϰ
Figure 14: The ECDF of the NMSE values for evaluating oc-
clusion tolerance on the serious occluded images (the COFW
Figure 12: The hybrid+disc model can moderately guess the dataset).
landmark positions when landmarks are occluded.
68 Facial Landmarks 68 Facial Landmarks To further improve the detection accuracy of the PWC model,
1.0
1.0
Direct reg.
Cascaded reg. we propose the hybrid loss function and comprehensively verify
0.8
0.8
Distribution
Heatmap reg.
PWC
the loss function that can improve the detection accuracy with-
0.6
0.6
hybrid+disc
out modifying the architecture of the detection network. Besides,
ECDF
ECDF
Direct reg.
a discrimination network is proposed to encourage the detected
0.4
0.4
Cascaded reg.
Distribution
landmarks to form a face-like shape. Although the discrimination
0.2
0.2
Heatmap reg.
PWC
hybrid+disc network has little impact on detection accuracy, it can be further
0.0
0.0
0.01 0.02 0.03 0.04 0.05

NMSE
0.06 0.07 0.08 0.01 0.02 0.03 0.04 0.05
NMSE
0.06 0.07 0.08
studied. Our proposed model is evaluated on six common facial
(a) Without occlusion (b) With occlusion landmark datasets, AFW, Helen, LFPW, 300-W, IBUG, and COFW
datasets. The evaluated results reveal that the PWC model com-
Figure 13: The ECDF of the NMSE values for evaluating oc- bined with the hybrid loss function achieves higher landmark accu-
clusion tolerance. racy than other investigated approaches not only for the 68 facial
landmarks but also for the 12 eye anchors.
image has serious occlusion or failed detection, the regression ap- REFERENCES
proaches achieve lower NMSE values than heatmap approaches [1] Shan Li and Weihong Deng. Deep facial expression recogni-
because the face shape is maintained. tion: A survey. CoRR, 2018.
To further explore the occlusion tolerance for the serious oc- [2] Rajeev Ranjan, Vishal M. Patel, and Rama Chellappa. Hy-
cluded facial images, we tested the investigated models on the test- perface: A deep multi-task learning framework for face de-
ing set of the COFW dataset. The testing set contains 507 images. tection, landmark localization, pose estimation, and gender
Figure 14(a) and Figure 14(a), shows the ECDF value for detecting recognition. IEEE Transactions on Pattern Analysis and Ma-
68 facial landmarks and 12 eye anchors, respectively. Figure 14(a) chine Intelligence, 41:121–135, 2019.
also reveals that the regression approaches perform better than [3] Changxing Ding and Dacheng Tao. Robust face recognition
the heatmap approaches in the images with serious occlusion. The via multimodal deep face representation. IEEE Transactions
hybrid model slightly outperforms the other heatmap approaches on Multimedia, 17(11):2049–2058, Nov 2015.
about 70% of testing images. The heatmap regression performs the [4] Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and
worst in all investigated model as the result in the previous experi- Victor Lempitsky. Deepwarp: Photorealistic image resynthe-
ment. In terms of the eye anchors, which is relatively less occluded sis for gaze manipulation. In Bastian Leibe, Jiri Matas, Nicu
than other landmarks, the heatmap approaches achieve a more pre- Sebe, and Max Welling, editors, European Conference on Com-
cise landmark position than the regression approaches. Once the puter Vision, pages 311–326, 2016.
landmarks become hard to detect, the regression approaches grad- [5] Nannan Wang, Xinbo Gao, Dacheng Tao, Heng Yang, and
ually achieve better accuracy than the heatmap approaches. Over- Xuelong Li. Facial feature point detection: A comprehensove
all, the hybrid model and the hybrid+disc model outperforms other survey. Neurocomputing, 275:50–65, 2018.
models in the 80% of testing images. [6] Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Tay-
lor. Active appearance models. IEEE Transactions on Pattern
6 DISCUSSION AND CONCLUSION Analysis and Machine Intelligence, 23(6):681–685, June 2001.
We have comprehensively studied the commonly used convolu- [7] David Cristinacce and Timothy F. Cootes. Feature detection
tional neural network-based approaches for detecting 68 facial land- and tracking with constrained local models. In Proceedings of
marks, the regression and heatmap approaches, and their varia- the British Machine Vision Conference, pages 929–938, 2006.
tions. We generalize the advantages and the disadvantages of these [8] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully
approaches and investigate a new variation of the heatmap model, convolutional networks for semantic segmentation. IEEE
pixel-wise classification (PWC) model. Transactions on Pattern Analysis and Machine Intelligence,
7
39:640–651, 2017. [23] Erjin Zhou, Haoqiang Fan, Zhimin Cao, Yuning Jiang, and
[9] Kaiming He, Georgia Gkioxari, Piotr DollÃąr, and Ross Gir- Qi Yin. Extensive facial landmark localization with coarse-
shick. Mask r-cnn. In IEEE International Conference on Com- to-fine convolutional network cascade. In IEEE International
puter Vision (ICCV), pages 2980–2988, 2017. Conference on Computer Vision Workshops, pages 386–391,
[10] Adrian Bulat and Georgios Tzimiropoulos. Human pose esti- 2013.
mation via convolutional part heatmap regression. In Bastian [24] Haoqiang Fan and Erjin Zhou. Approaching human level fa-
Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Euro- cial landmark localization by deep learning. Image and Vision
pean Conference on Computer Vision, pages 717–732, 2016. Computing, 47:27–35, 2016.
[11] Xiangxin Zhu and Deva Ramanan. Face detection, pose esti- [25] Jiangjing Lv, Xiaohu Shao, Junliang Xing, Cheng Cheng, and
mation, and landmark localization in the wild. In IEEE Confer- Xi Zhou. A deep regression architecture with two-stage re-
ence on Computer Vision and Pattern Recognition, pages 2879– initialization for high performance facial landmark detection.
2886, 2012. In IEEE Conference on Computer Vision and Pattern Recogni-
[12] Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, and tion, pages 3691–3700, 2017.
Thomas S. Huang. Interactive facial feature localization. In [26] Jie Zhang, Shiguang Shan, Meina Kan, and Xilin Chen.
Proceedings of the 12th European Conference on Computer Vi- Coarse-to-fine auto-encoder networks (cfan) for real-time
sion - Volume Part III, ECCV’12, pages 679–692, Berlin, Hei- face alignment. In European Conference on Computer Vision,
delberg, 2012. Springer-Verlag. pages 1–16, 2014.
[13] Peter N. Belhumeur, David W. Jacobs, David J. Kriegman, and [27] Marek Kowalski, Jacek Naruniec, and Tomasz Trzcinski.
Neeraj Kumar. Localizing parts of faces using a consensus of Deep alignment network: A convolutional neural network
exemplars. IEEE Transactions on Pattern Analysis and Machine for robust face alignment. In IEEE Conference on Computer
Intelligence, 35(12):2930–2940, Dec 2013. Vision and Pattern Recognition Workshop, 2017.
[14] Xavier P. Burgos-Artizzu, Pietro Perona, and Piotr Dollar. Ro- [28] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao.
bust face landmark estimation under occlusion. In The IEEE Joint face detection and alignment using multitask cascaded
International Conference on Computer Vision (ICCV), 2013. convolutional networks. IEEE Signal Processing Letters,
[15] Adrian Bulat and Georgios Tzimiropoulos. Two-stage con- 23:1499–1503, 2016.
volutional part heatmap regression for the 1st 3d face align- [29] Zhenliang He, Jie Zhang, Meina Kan, Shiguang Shan, and
ment in the wild (3dfaw) challenge. In European Conference Xilin Chen. Robust fec-cnn: A high accuracy facial landmark
on Computer Vision, pages 616–624, 2016. detection system. In IEEE Conference on Computer Vision and
[16] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Pattern Recognition Workshops, pages 2044–2050, 2017.
Tang. Learning deep representation for face alignment with [30] Xin Chai, Qisong Wang, Yongping Zhao, and Yongqiang Li.
auxiliary attributes. IEEE Transactions on Pattern Analysis & Robust facial landmark detection based on initializing multi-
Machine Intelligence, 38:918–930, 2016. ple poses. International Journal of Advanced Robotic Systems,
[17] Jing Yang, Qingshan Liu, and Kaihua Zhang. Stacked hour- 13:1729881416662793, 2016.
glass network for robust facial landmark localisation. In IEEE [31] Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style ag-
Conference on Computer Vision and Pattern Recognition Work- gregated network for facial landmark detection. In IEEE/CVF
shops, pages 2025–2033, 2017. Conference on Computer Vision and Pattern Recognition, pages
[18] Amir Zadeh, Yao Chong Lim, Tadas BaltruÅąaitis, and Louis- 379–388, 2018.
Philippe Morency. Convolutional experts constrained local [32] Yue Wu and Qiang Ji. Facial landmark detection: a literature
model for 3d facial landmark detection. In IEEE International survey. International Journal of Computer Vision, 127:115–142,
Conference on Computer Vision Workshops, pages 2519–2528, 2019.
2017. [33] Aaron S. Jackson, Michel Valstar, and Georgios Tzimiropou-
[19] Rajeev Ranjan, Swami Sankaranarayanan, Carlos D. Castillo, los. A cnn cascade for landmark guided semantic part seg-
and Rama Chellappa. An all-in-one convolutional neural net- mentation. In European Conference on Computer Vision Work-
work for face analysis. In IEEE International Conference on shops, pages 143–155, 2016.
Automatic Face & Gesture Recognition, pages 17–24, 2017. [34] Adrian Bulat and Georgios Tzimiropoulos. How far are we
[20] Wayne Wu, Qian Chen, Shuo Yang, Quan Wang, Yici Cai, from solving the 2d & 3d face alignment problem? (and a
and Qiang Zhou. Look at boundary: A boundary-aware face dataset of 230,000 3d facial landmarks). In International Con-
alignment algorithm. In IEEE Conference on Computer Vision ference on Computer Vision, pages 1021–1030, 2017.
and Pattern Recognition, pages 2129–2138, 2018. [35] Adrian Bulat and Georgios Tzimiropoulos. Super-fan: Inte-
[21] Yue Wu, Tal Hassner, Kanggeon Kim, GÃľrard Medioni, and grated facial landmark localization and super-resolution of
Prem Natarajan. Facial landmark detection with tweaked real-world low resolution faces in arbitrary poses with gans.
convolutional neural networks. IEEE Transactions on Pattern In IEEE/CVF Conference on Computer Vision and Pattern Recog-
Analysis and Machine Intelligence, 40:3067–3074, 2018. nition, pages 109–117, 2018.
[22] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolu- [36] Joseph P. Robinson, Yuncheng Li, Ning Zhang, Yun Fu, and
tional network cascade for facial point detection. In IEEE Sergey Tulyakov. Laplace landmark localization. In CoRR,
Conference on Computer Vision and Pattern Recognition, pages 2019.
3476–3483, 2013.
8
[37] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser
Sheikh. Convolutional pose machines. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 4724–4732,
June 2016.
[38] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances
in Neural Information Processing Systems, volume 27, pages
2672–2680. Advances in Neural Information Processing Sys-
tems, 2014.
[39] Vahid Kazemi and Josephine Sullivan. One millisecond face
alignment with an ensemble of regression trees. In IEEE
Conference on Computer Vision and Pattern Recognition, page
1867âĂŞ1874, 2014.

A Detailed Look at CNN-based Approaches in Facial Landmark Detection

Uploaded by

Copyright:

Available Formats

A Detailed Look at CNN-based Approaches in Facial Landmark Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Detailed Look at CNN-based Approaches in Facial Landmark Detection

Uploaded by

Copyright:

Available Formats

A Detailed Look At CNN-based Approaches In Facial Landmark

ABSTRACT Instead of developing a new network architecture for detecting

Figure 1: An example model architecture of the direct regres-

4 TRAINING AND IMPLEMENTATION 5 EXPERIMENTAL RESUTLS

68 Facial Landmarks 12 Eye Anchors

models without supported by the discrimination network about NMSE NMSE

(a) 68 facial landmarks (b) 12 eye anchors

0.01 0.02 0.03 0.04 0.05

You might also like