A Detailed Look at CNN-based Approaches in Facial Landmark Detection
A Detailed Look at CNN-based Approaches in Facial Landmark Detection
A Detailed Look at CNN-based Approaches in Facial Landmark Detection
Detection
Chih-Fan Hsu12 , Chia-Ching Lin12 , Ting-Yang Hung3 , Chin-Laung Lei2 , and Kuan-Ta Chen1
1 Institute
of Information Science, Academia Sinica
2 Department of Electrical Engineering, National Taiwan University
3 Halicioǧlu Data Science Institute, University of California San Diego
arXiv:2005.08649v1 [cs.CV] 8 May 2020
ĨůĂƚƚĞŶ
ĂĐŬďŽŶĞŶĞƚǁŽƌŬ
ϮϬϰϴ ϭϯϲ
Ϯϱϲ
ϭϬϮϰ
ϲϵ
,ĞĂƚŵĂƉ
Figure 4: An example of the pixel-wise classification model. ZĞŐ͘
The vector at each pixel in output heatmaps is a probabilistic
vector that indicates the pixel belonged to which landmark
classes or the background class. Wt
the background pixels. The L2 loss is adopted to evaluate the dif- Figure 5: Several fail results detected by the heatmap ap-
ference between the ground-truth distribution maps and the pre- proaches. The face shape suffers from serious distortion
dicted heatmaps. Namely, the loss function can be formulated as when some landmarks are not successfully detected. On
Í
losshr eд = l ∈{L,b } (ĥl − hl )2 , where b represents the image back- the other hand, the successfully detected landmarks bene-
ground. fit from accurate positions.
Pixel-wise classification models. Using pixel-wise classifica-
tion (PWC) models to detect landmarks of the human are studied
by He et al. [9]. The model also can be extended to detect facial 3 A PIXEL-WISE CLASSIFICATION MODEL
landmarks, however, to the best of our knowledge, adopting PWC WITH DISCRIMINATOR
models to detect facial landmarks has not been comprehensively
We are seeking for a new method that contains the advantages
studied.
and restrains from the disadvantages of regression and heatmap
Despite the output of the PWC model is a set of heatmaps, which
approaches. Considering that the landmark accuracy is highly im-
are similar to the aforementioned heatmap models, the meaning
portant, we develop a new model based on the heatmap approach.
of the heatmap is different. The heatmaps of the distribution and
Moreover, because the PWC model outperforms the other two heatmap
heatmap regression models are generated by the multivariate dis-
models, we design a new model based on the PWC model (the ac-
tribution to indicate the spatial position of the landmarks; the heatmap
curacy will be discussed in Section 5).
of the PWC model is generated by a set of probabilistic vectors that
A hybrid loss function. To overcome the disadvantages of
indicates the probabilities of the pixel belonged to which landmark
heatmap approaches that the interrelationship among landmarks
or the background. Specifically, each vector contains |L| + 1 ele-
cannot be maintained, we introduce a hybrid loss function that
ments and the sum of elements is equal to one. Figure 4 shows an
combines the L2 loss and the PWC loss functions. The idea of the
example of the network architecture of the PWC model.
hybrid loss function is augmenting the PWC loss by the L2 dis-
To train a PWC model, the cross-entropy loss is adopted. Namely,
tance by penalizing the landmark shifting when detecting failed
1 Õ Õ p p
to strengthen the interrelationship implied in the model. Our loss
loss PW C = − ĥi loд(hi ), (2) function can be formulated by
|I |
p ∈I i ∈{L,b }
losshybr id = α × loss PW C + β × lossr eд , (3)
p p
where hi and ĥi represent the predicted probabilistic vector and where the hyperparameters α and β are used to balance between
the ground-truth vector at pixel p, respectively, i indicates the i th loss functions and we empirically set α to 1 and β to 0.25 in our
element of the vector. experiment.
To obtain the landmark positions from heatmaps, the coordi- A discrimination network. We also expect that the detected
nates of a landmark can be calculated by the position with the landmarks should form a face-like shape. To achieve this goal, we
maximum probability in each heatmap, which can be calculated add a discrimination network D after the detection network to en-
by Sl = argmax(x,y) hl , l ∈ L. The heatmap approaches bene- courage the detected landmarks to remain a face-like shape. Specif-
fit from highly accurate landmark positions but suffer from the ically, the loss function is modified by losstot al = losshybr id +
lack of the interrelationship between landmarks. It is because the loss f ace , where loss f ace is defined by −E[loд(D(S))], reflecting the
interrelationship gradually degrades when it passes through de- encouragement of the predicted landmarks being classified to have
convolutional layers. Once some landmarks are detected failed a face-like shape given by the discrimination network.
(mostly caused by the occlusion), the detected position tends to The detection and discrimination networks are jointly trained
shift to a nearby corner or edge. As a result, the shape of the de- as training a Generative Adversarial Networks proposed by Good-
tected landmarks is distorted. Figure 5 shows the detected results fellow et al. [38]. Specifically, in each training step, the training
of the heatmap approaches. As we can observe, landmarks cor- process can be divided into updating the detection network and
responding to occluded parts are commonly missing and the face updating the discrimination network stages. The losstot al is used
shape formed from the detected landmarks is distorted. to update the detection network and the loss function, lossdisc =
3
ĞƚĞĐƚŝŽŶŶĞƚǁŽƌŬ ࢎ ŝƐĐ͘
ŵƉ
ĂĐŬďŽŶĞ
н ŶĞƚǁŽƌŬ ƐŚŽƌƚĐƵƚ
ŵƉ
ࡿ
ĂƌŐŵĂdž
ŶĞƚǁŽƌŬ н ƐŚŽƌƚĐƵƚ
ŵƉ
ŵƉ
ŵƉ
ϭൈϭ ϳൈϳ ϭൈϭ
ϭϭϮ
ϮϬϰϴ ϲϵ ϱϭϮ ϱϭϮൈϰ ϱϭϮൈϰ ϮϬϰϴ ϮϬϰϴ
ϮϮϰ
Ϯϱϲ
ϭ
ϭϯϲ Ϯϱϲൈϰ
ϭϮϴ
ϭϮϴ
ϭϮϴൈϮ
ϲϰൈϮ ĂĐŬďŽŶĞŶĞƚǁŽƌŬ
ϲϵ
݈ݏݏௐ ݈ݏݏ ܦሺࡿሻ
Figure 7: The backbone network. Each cube represents a ten-
sor with a specific size. We only show the first two tensors
Figure 6: The proposed model contains a detection network for figure conciseness. Beneath the cube, the number before
and a discrimination network. The detection network is a and after the × mark indicate the tensor’s channel size and
PWC model for detecting landmarks. The discrimination the number of convolutional blocks performed in the ten-
network is a 3-layer fully connected NN for testing the shape sor, respectively. The number of convolutional blocks will
of detected landmarks forms a face-like shape or not. be omitted when only one convolutional block is performed.
The mp represents the max-pooling layer with 2 × 2 kernel
size and stride two.
ˆ + E[loд(1 − D(S))]), is used to update the discrimi-
−(E[loд(D(S))]
nation network. In our experiment, in each training step, we em-
pirically train the discrimination network once and the detection Table 1: The number of images in the training (T) and vali-
network twice. dation (V) sets.
Figure 6 shows the network architecture of our model. As we
previously mentioned, the backbone network can be any network 300-W
architecture. A cube in the figure represents a specific tensor size AFW Helen LFPW IBUG COFW
(In/Out)
and the number of channels of the tensor is shown below the cube. #T 337 2,000 811 0/0 0 0
At least one convolutional block is performed under a certain ten- #V 0 330 224 300/300 135 507
sor size. The convolutional block comprises a convolutional layer,
a batch normalization layer, and an activation layer sequentially.
Once more than one convolutional block is performed, a × mark
will be shown after the channel size and the number after the × model, we crop the faces for every facial image according to the
mark indicates the number of convolutional blocks (Figure 7). The ground-truth landmarks to reduce negative impacts from unstable
number above the cube denotes the kernel size of filters in the con- face detection. Specifically, we calculate the maximum distance
volutional layer. The default kernel size is set to 3×3 and omitted in of the horizontal and the vertical distances (d h and dv ) of land-
the figure. The rectangular denotes a fully connected block and the marks for each image. Then, the facial image is cropped by a square
number below each is the number of nodes in the fully connected bounding box with 1.3 × max(d h , dv ) side length and the center of
layer. A fully connected block comprises a fully connected layer, a the bounding box is located at the centroid of landmarks. Finally,
batch normalization layer, and an activation layer sequentially. the cropped image is resized to 224 × 224 × 3 image size.
1.0
1.0
224 × 224 image size.
0.8
0.8
In the experiment, the direct and cascaded regression models
0.6
0.6
ECDF
ECDF
contain 178,682,440 and 35,053,144 parameters, respectively. The Direct reg. Direct reg.
0.4
0.4
distribution model contains 82,871,880 parameters. The number of Cascaded reg.
Distribution
Cascaded reg.
Distribution
0.2
0.2
parameters of the PWC model and the heatmap regression model Heatmap reg.
PWC
Heatmap reg.
PWC
hybrid+disc hybrid+disc
is the same because the difference between the two approaches
0.0
0.0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
is only the loss function, which contains 82,947,658 parameters. NMSE NMSE
The hybrid+disc model contains 82,982,347 parameters, in which (a) 68 facail landmarks (b) 12 eye anchors
the detection and discrimination networks contain 82,947,658 and
34,689 parameters, respectively. Overall, the direct regression model Figure 9: The ECDF of the NMSE for the investigated models.
contains the largest number of parameters.
Table 2 shows the average NMSE values of the nine models. We
also evaluate the standard deviation from five independent train-
ings. Since the standard deviations are small, we omit the standard Figure 9 shows the empirical cumulative distribution function
deviation to make the table concise. The Dlib and the TCDCN mod- (ECDF) of NMSE values to illustrate the model effectiveness on the
els are the baselines for comparison. The lowest NMSE value in validation set. Because the landmarks around the eyes (or called
each testing dataset is highlighted. We mention there that the Dlib eye anchors) are more widely used than other landmarks, such as
model cannot successfully detect the landmarks for all validation eye blink detection and gaze manipulation, we further evaluated
images. Therefore, the average NMSE values of the Dlib model are the detection accuracy for the 12 eye anchors. Figure 9(a) and Fig-
only used for reference and the detection rates are listed below the ure 9(b) show the ECDF results of all 68 landmarks and 12 eye an-
table. chors, respectively. As we previously mentioned that heatmap ap-
Generally, all investigated models outperform the base-line mod- proaches generally achieve higher detection accuracy than the re-
els except the heatmap regression model. It is because the heatmap gression approaches. However, the heatmap regression model suf-
regression model suffers from the weak interrelationship between fers from large NMSE value when the landmarks are hard to detect.
landmarks due to the network architecture and the small penalty The PWC model slightly outperforms the distribution model. The
for failure detection due to the loss function. Therefore, once the proposed hybrid loss function further improves the PWC model
landmark is detected failed, the detected position usually locates without enlarging the network architecture. Overall, the hybrid
at the position that has a similar structure as the landmarks with- model outperforms other investigated models.
out occlusion and the position might be far from the ground-truth
position. The largest NMSE value in the IBUG dataset and the rel- 5.1 Ablation Study
ative small NMSE value in the Helen dataset reveal this trend. In To ensure the hybrid loss function and the discrimination network
between the regression models, surprisingly, the DAN model per- can indeed improve the detection accuracy of the PWC model, we
forms worse than the direct regression model. We suspect that trained the models with several model and loss function combi-
the DAN model is limited by the network architecture of the sub- nations, which includes (1) the PWC model, (2) the PWC model
network because the sub-network only contains 11,146,312 param- supported by the discrimination network (PWC+disc), the hybrid
eters that are much smaller than the number of parameters con- model, and the hybrid+disc model.
tained in the direct regression model. In the heatmap models, the Table 2 shows the average NMSE of the investigated combina-
PWC model detects the landmarks more accurately than regres- tions. Generally, the model trained with the hybrid loss function
sion models and other heatmap models. The hybrid model outper- outperforms the model trained with the PWC loss function. The
forms other models in almost all datasets. result indicates that the regression loss function can support the
5
Table 2: The average NMSE of the investigated models for detecting 68 facial landmarks. The colored numbers indicate the
smallest NMSE in each dataset. The 300-W dataset contains the indoor (In) and the outdoor (Out) subsets and the results are
listed in two horizontal rows.
Dataset Dlib* TCDCN Direct Cascaded Dist. Heat. reg. PWC PWC+disc hybrid hybrid+disc
Helen 0.0298 0.0474 0.0356 0.0367 0.0337 0.0353 0.0334 0.0336 0.0315 0.0312
LFPW 0.0356 0.0454 0.0359 0.0360 0.0347 0.0371 0.0356 0.0359 0.0336 0.0337
300-W 0.0677 0.0781 0.0495 0.0520 0.0508 0.0597 0.0499 0.0505 0.0481 0.0486
In/Out 0.0640 0.0746 0.0500 0.0516 0.0503 0.0586 0.0487 0.0492 0.0470 0.0473
IBUG 0.0594 0.0795 0.0643 0.0678 0.0700 0.0812 0.0656 0.0663 0.0637 0.0639
Total 0.0496 0.0639 0.0452 0.0469 0.0455 0.0515 0.0446 0.0450 0.0427 0.0428
*Dlib does not successfully detect landmarks for all images in the validation set, the detection rate of Dlib in each dataset are listed as
follows: Helen (318/330), LFPW (218/224), 300-W indoor (264/300) and outdoor (252/300), and IBUG (98/135).
1.0
1.0
ing the architecture of the detection network. However, the dis-
0.8
0.8
crimination network does not significantly improve detection ac-
0.6
0.6
ECDF
ECDF
curacy.
0.4
0.4
Figure 10 shows the ECDF of NMSE values for every loss com- PWC PWC
0.2
0.2
PWC+disc PWC+disc
binations. As we can observe that the hybrid model supported by hybrid hybrid
hybrid+disc hybrid+disc
0.0
0.0
the discrimination network achieves slightly better results than the 0.01 0.02 0.03 0.04 0.05 0.06 0.010 0.015 0.020 0.025 0.030 0.035 0.040
60% and 80% of validation images for detecting 68 facial landmarks (a) 68 landmarks (b) 12 eye anchors
and 12 eye anchors, respectively. However, for the PWC model, the
discrimination network does not improve the detection accuracy. Figure 10: The ECDF of the NMSE values for different loss
To further explore the benefits by adopting the hybrid loss func- function combinations.
tion and the discrimination network, we carefully observe the de-
tected landmarks among the detected results. Generally, the land-
marks detected by the PWC model prefer to locate at the key point
(the edge or the corner). The result is not surprising because the
key point is easier to be preserved in the semantic features than
the smooth area and the facial landmarks should locate at the key 'ƌŽƵŶĚͲƚƌƵƚŚ Wt WtнĚŝƐĐ ŚLJďƌŝĚ ,LJďƌŝĚнĚŝƐĐ
point if the landmarks are not occluded. However, the landmark
positions are easy to be affected by the shape of the key point es- Figure 11: The hybrid loss function penalizes the landmark
pecially the landmarks located at the edge. Besides, once the land- shifting due to the shape of the key point or partial occlu-
marks are occluded, the detected positions tend to located at a key sion. The discrimination network encourages the detected
point near the position where the landmark should be. Therefore, landmarks to form a face-like shape.
the shape of the detected landmarks might be distorted. Figure 11
shows the illustration. In the result detected by the PWC model,
the landmark positions are affected by the shape of the edge (red 5.2 Occlusion Tolerance
and blue) and the occluded landmark located at a nearby corner
To explore the model’s ability for handling facial images with par-
(blue). Adding the L2 loss to the PWC model penalizes the posi-
tial occlusion, we mutually selected 298 images that some land-
tion shifting due to the occlusion or the negative impact from the
marks are occluded by objects or extreme light sources from the
shape of the key point, which greatly improves the detection ac-
validation datasets. Figure 12 shows several results detected by
curacy. However, the shape of the detected landmarks might still
the investigated models. As we can observe that the regression ap-
be distorted. Although adopting the discrimination network in the
proaches suffer from inaccurate landmark positions. The heatmap
training process might make landmarks slight inaccurate, the net-
approaches benefit from the accurate positions but the landmarks
work successfully improves the shape of the landmarks. Overall,
shift due to the occlusion. The hybrid+disc model eases the land-
adding the discrimination network has only a minor impact on de-
mark shifting and can moderately guess the landmark positions
tection accuracy. Besides, the discrimination network does not in-
when landmarks are occluded.
crease the inference time when testing. Training a model with a
Figure 13 shows the ECDF of the NMSE values for the images
discrimination network is worth being considered and explored
with and without partial occlusion. As we can observe that the
further.
hybrid+disc model outperforms other models for the images with-
out partial occlusion. For the images with partial occlusion, the
hybrid+disc model achieves better model accuracy for about 70%
of validation images. It is worthwhile to mention that when the
6
ŝƌĞĐƚƌĞŐ͘ ĂƐĐĂĚĞĚƌĞŐ͘ ŝƐƚ͘ ,ĞĂƚŵĂƉƌĞŐ͘ Wt ŚLJďƌŝĚнĚŝƐĐ
68 Facial Landmarks 12 Eye Anchors
1.0
1.0
Direct reg.
Cascaded reg.
0.8
0.8
Distribution
heatmapreg
ηϭϯϴ PWC
0.6
0.6
hybrid
ECDF
ECDF
hybrid+disc Direct reg.
Cascaded reg.
0.4
0.4
Distribution
heatmapreg
0.2
0.2
PWC
ηϵϯϵ hybrid
hybrid+disc
0.0
0.0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
NMSE NMSE
Figure 14: The ECDF of the NMSE values for evaluating oc-
clusion tolerance on the serious occluded images (the COFW
Figure 12: The hybrid+disc model can moderately guess the dataset).
landmark positions when landmarks are occluded.
68 Facial Landmarks 68 Facial Landmarks To further improve the detection accuracy of the PWC model,
1.0
1.0
Direct reg.
Cascaded reg. we propose the hybrid loss function and comprehensively verify
0.8
0.8
Distribution
Heatmap reg.
PWC
the loss function that can improve the detection accuracy with-
0.6
0.6
hybrid+disc
out modifying the architecture of the detection network. Besides,
ECDF
ECDF
Direct reg.
a discrimination network is proposed to encourage the detected
0.4
0.4
Cascaded reg.
Distribution
landmarks to form a face-like shape. Although the discrimination
0.2
0.2
Heatmap reg.
PWC
hybrid+disc network has little impact on detection accuracy, it can be further
0.0
0.0
image has serious occlusion or failed detection, the regression ap- REFERENCES
proaches achieve lower NMSE values than heatmap approaches [1] Shan Li and Weihong Deng. Deep facial expression recogni-
because the face shape is maintained. tion: A survey. CoRR, 2018.
To further explore the occlusion tolerance for the serious oc- [2] Rajeev Ranjan, Vishal M. Patel, and Rama Chellappa. Hy-
cluded facial images, we tested the investigated models on the test- perface: A deep multi-task learning framework for face de-
ing set of the COFW dataset. The testing set contains 507 images. tection, landmark localization, pose estimation, and gender
Figure 14(a) and Figure 14(a), shows the ECDF value for detecting recognition. IEEE Transactions on Pattern Analysis and Ma-
68 facial landmarks and 12 eye anchors, respectively. Figure 14(a) chine Intelligence, 41:121–135, 2019.
also reveals that the regression approaches perform better than [3] Changxing Ding and Dacheng Tao. Robust face recognition
the heatmap approaches in the images with serious occlusion. The via multimodal deep face representation. IEEE Transactions
hybrid model slightly outperforms the other heatmap approaches on Multimedia, 17(11):2049–2058, Nov 2015.
about 70% of testing images. The heatmap regression performs the [4] Yaroslav Ganin, Daniil Kononenko, Diana Sungatullina, and
worst in all investigated model as the result in the previous experi- Victor Lempitsky. Deepwarp: Photorealistic image resynthe-
ment. In terms of the eye anchors, which is relatively less occluded sis for gaze manipulation. In Bastian Leibe, Jiri Matas, Nicu
than other landmarks, the heatmap approaches achieve a more pre- Sebe, and Max Welling, editors, European Conference on Com-
cise landmark position than the regression approaches. Once the puter Vision, pages 311–326, 2016.
landmarks become hard to detect, the regression approaches grad- [5] Nannan Wang, Xinbo Gao, Dacheng Tao, Heng Yang, and
ually achieve better accuracy than the heatmap approaches. Over- Xuelong Li. Facial feature point detection: A comprehensove
all, the hybrid model and the hybrid+disc model outperforms other survey. Neurocomputing, 275:50–65, 2018.
models in the 80% of testing images. [6] Timothy F. Cootes, Gareth J. Edwards, and Christopher J. Tay-
lor. Active appearance models. IEEE Transactions on Pattern
6 DISCUSSION AND CONCLUSION Analysis and Machine Intelligence, 23(6):681–685, June 2001.
We have comprehensively studied the commonly used convolu- [7] David Cristinacce and Timothy F. Cootes. Feature detection
tional neural network-based approaches for detecting 68 facial land- and tracking with constrained local models. In Proceedings of
marks, the regression and heatmap approaches, and their varia- the British Machine Vision Conference, pages 929–938, 2006.
tions. We generalize the advantages and the disadvantages of these [8] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully
approaches and investigate a new variation of the heatmap model, convolutional networks for semantic segmentation. IEEE
pixel-wise classification (PWC) model. Transactions on Pattern Analysis and Machine Intelligence,
7
39:640–651, 2017. [23] Erjin Zhou, Haoqiang Fan, Zhimin Cao, Yuning Jiang, and
[9] Kaiming He, Georgia Gkioxari, Piotr DollÃąr, and Ross Gir- Qi Yin. Extensive facial landmark localization with coarse-
shick. Mask r-cnn. In IEEE International Conference on Com- to-fine convolutional network cascade. In IEEE International
puter Vision (ICCV), pages 2980–2988, 2017. Conference on Computer Vision Workshops, pages 386–391,
[10] Adrian Bulat and Georgios Tzimiropoulos. Human pose esti- 2013.
mation via convolutional part heatmap regression. In Bastian [24] Haoqiang Fan and Erjin Zhou. Approaching human level fa-
Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Euro- cial landmark localization by deep learning. Image and Vision
pean Conference on Computer Vision, pages 717–732, 2016. Computing, 47:27–35, 2016.
[11] Xiangxin Zhu and Deva Ramanan. Face detection, pose esti- [25] Jiangjing Lv, Xiaohu Shao, Junliang Xing, Cheng Cheng, and
mation, and landmark localization in the wild. In IEEE Confer- Xi Zhou. A deep regression architecture with two-stage re-
ence on Computer Vision and Pattern Recognition, pages 2879– initialization for high performance facial landmark detection.
2886, 2012. In IEEE Conference on Computer Vision and Pattern Recogni-
[12] Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, and tion, pages 3691–3700, 2017.
Thomas S. Huang. Interactive facial feature localization. In [26] Jie Zhang, Shiguang Shan, Meina Kan, and Xilin Chen.
Proceedings of the 12th European Conference on Computer Vi- Coarse-to-fine auto-encoder networks (cfan) for real-time
sion - Volume Part III, ECCV’12, pages 679–692, Berlin, Hei- face alignment. In European Conference on Computer Vision,
delberg, 2012. Springer-Verlag. pages 1–16, 2014.
[13] Peter N. Belhumeur, David W. Jacobs, David J. Kriegman, and [27] Marek Kowalski, Jacek Naruniec, and Tomasz Trzcinski.
Neeraj Kumar. Localizing parts of faces using a consensus of Deep alignment network: A convolutional neural network
exemplars. IEEE Transactions on Pattern Analysis and Machine for robust face alignment. In IEEE Conference on Computer
Intelligence, 35(12):2930–2940, Dec 2013. Vision and Pattern Recognition Workshop, 2017.
[14] Xavier P. Burgos-Artizzu, Pietro Perona, and Piotr Dollar. Ro- [28] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao.
bust face landmark estimation under occlusion. In The IEEE Joint face detection and alignment using multitask cascaded
International Conference on Computer Vision (ICCV), 2013. convolutional networks. IEEE Signal Processing Letters,
[15] Adrian Bulat and Georgios Tzimiropoulos. Two-stage con- 23:1499–1503, 2016.
volutional part heatmap regression for the 1st 3d face align- [29] Zhenliang He, Jie Zhang, Meina Kan, Shiguang Shan, and
ment in the wild (3dfaw) challenge. In European Conference Xilin Chen. Robust fec-cnn: A high accuracy facial landmark
on Computer Vision, pages 616–624, 2016. detection system. In IEEE Conference on Computer Vision and
[16] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Pattern Recognition Workshops, pages 2044–2050, 2017.
Tang. Learning deep representation for face alignment with [30] Xin Chai, Qisong Wang, Yongping Zhao, and Yongqiang Li.
auxiliary attributes. IEEE Transactions on Pattern Analysis & Robust facial landmark detection based on initializing multi-
Machine Intelligence, 38:918–930, 2016. ple poses. International Journal of Advanced Robotic Systems,
[17] Jing Yang, Qingshan Liu, and Kaihua Zhang. Stacked hour- 13:1729881416662793, 2016.
glass network for robust facial landmark localisation. In IEEE [31] Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style ag-
Conference on Computer Vision and Pattern Recognition Work- gregated network for facial landmark detection. In IEEE/CVF
shops, pages 2025–2033, 2017. Conference on Computer Vision and Pattern Recognition, pages
[18] Amir Zadeh, Yao Chong Lim, Tadas BaltruÅąaitis, and Louis- 379–388, 2018.
Philippe Morency. Convolutional experts constrained local [32] Yue Wu and Qiang Ji. Facial landmark detection: a literature
model for 3d facial landmark detection. In IEEE International survey. International Journal of Computer Vision, 127:115–142,
Conference on Computer Vision Workshops, pages 2519–2528, 2019.
2017. [33] Aaron S. Jackson, Michel Valstar, and Georgios Tzimiropou-
[19] Rajeev Ranjan, Swami Sankaranarayanan, Carlos D. Castillo, los. A cnn cascade for landmark guided semantic part seg-
and Rama Chellappa. An all-in-one convolutional neural net- mentation. In European Conference on Computer Vision Work-
work for face analysis. In IEEE International Conference on shops, pages 143–155, 2016.
Automatic Face & Gesture Recognition, pages 17–24, 2017. [34] Adrian Bulat and Georgios Tzimiropoulos. How far are we
[20] Wayne Wu, Qian Chen, Shuo Yang, Quan Wang, Yici Cai, from solving the 2d & 3d face alignment problem? (and a
and Qiang Zhou. Look at boundary: A boundary-aware face dataset of 230,000 3d facial landmarks). In International Con-
alignment algorithm. In IEEE Conference on Computer Vision ference on Computer Vision, pages 1021–1030, 2017.
and Pattern Recognition, pages 2129–2138, 2018. [35] Adrian Bulat and Georgios Tzimiropoulos. Super-fan: Inte-
[21] Yue Wu, Tal Hassner, Kanggeon Kim, GÃľrard Medioni, and grated facial landmark localization and super-resolution of
Prem Natarajan. Facial landmark detection with tweaked real-world low resolution faces in arbitrary poses with gans.
convolutional neural networks. IEEE Transactions on Pattern In IEEE/CVF Conference on Computer Vision and Pattern Recog-
Analysis and Machine Intelligence, 40:3067–3074, 2018. nition, pages 109–117, 2018.
[22] Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolu- [36] Joseph P. Robinson, Yuncheng Li, Ning Zhang, Yun Fu, and
tional network cascade for facial point detection. In IEEE Sergey Tulyakov. Laplace landmark localization. In CoRR,
Conference on Computer Vision and Pattern Recognition, pages 2019.
3476–3483, 2013.
8
[37] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser
Sheikh. Convolutional pose machines. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 4724–4732,
June 2016.
[38] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances
in Neural Information Processing Systems, volume 27, pages
2672–2680. Advances in Neural Information Processing Sys-
tems, 2014.
[39] Vahid Kazemi and Josephine Sullivan. One millisecond face
alignment with an ensemble of regression trees. In IEEE
Conference on Computer Vision and Pattern Recognition, page
1867âĂŞ1874, 2014.