Joint Face Detention & Alignment Using Multitask Cascaded Convolutional Networks
Joint Face Detention & Alignment Using Multitask Cascaded Convolutional Networks
Joint Face Detention & Alignment Using Multitask Cascaded Convolutional Networks
Abstract—Face detection and alignment in unconstrained en- formance of CNNs in computer vision tasks, some of the CNNs
vironment are challenging due to various poses, illuminations and based face detection approaches have been proposed in recent
occlusions. Recent studies show that deep learning approaches years. Yang et al. [11] train deep convolution neural networks
can achieve impressive performance on these two tasks. In this for facial attribute recognition to obtain high response in face
paper, we propose a deep cascaded multi-task framework which
regions which further yield candidate windows of faces.
exploits the inherent correlation between them to boost up their
performance. In particular, our framework adopts a cascaded
However, due to its complex CNN structure, this approach is
structure with three stages of carefully designed deep convolu- time costly in practice. Li et al. [19] use cascaded CNNs for
tional networks that predict face and landmark location in a face detection, but it requires bounding box calibration from
coarse-to-fine manner. In addition, in the learning process, we face detection with extra computational expense and ignores
propose a new online hard sample mining strategy that can im- the inherent correlation between facial landmarks localization
prove the performance automatically without manual sample and bounding box regression.
selection. Our method achieves superior accuracy over the Face alignment also attracts extensive interests. Regres-
state-of-the-art techniques on the challenging FDDB and WIDER sion-based methods [12, 13, 16] and template fitting ap-
FACE benchmark for face detection, and AFLW benchmark for
proaches [14, 15, 7] are two popular categories. Recently,
face alignment, while keeps real time performance.
Zhang et al. [22] proposed to use facial attribute recognition as
Index Terms—Face detection, face alignment, cascaded con- an auxiliary task to enhance face alignment performance using
volutional neural network deep convolutional neural network.
However, most of the available face detection and face
alignment methods ignore the inherent correlation between
I. INTRODUCTION these two tasks. Though there exist several works attempt to
jointly solve them, there are still limitations in these works. For
F ACE detection and alignment are essential to many face
applications, such as face recognition and facial expression
analysis. However, the large visual variations of faces, such as
example, Chen et al. [18] jointly conduct alignment and detec-
tion with random forest using features of pixel value difference.
occlusions, large pose variations and extreme lightings, impose But, the handcraft features used limits its performance. Zhang
great challenges for these tasks in real world applications. et al. [20] use multi-task CNN to improve the accuracy of
The cascade face detector proposed by Viola and Jones [2] multi-view face detection, but the detection accuracy is limited
utilizes Haar-Like features and AdaBoost to train cascaded by the initial detection windows produced by a weak face de-
classifiers, which achieve good performance with real-time tector.
efficiency. However, quite a few works [1, 3, 4] indicate that On the other hand, in the training process, mining hard
this detector may degrade significantly in real-world applica- samples in training is critical to strengthen the power of de-
tions with larger visual variations of human faces even with tector. However, traditional hard sample mining usually per-
more advanced features and classifiers. Besides the cascade forms an offline manner, which significantly increases the
structure, [5, 6, 7] introduce deformable part models (DPM) for manual operations. It is desirable to design an online hard
face detection and achieve remarkable performance. However, sample mining method for face detection and alignment, which
they need high computational expense and may usually require is adaptive to the current training process automatically.
expensive annotation in the training stage. Recently, convolu- In this paper, we propose a new framework to integrate these
tional neural networks (CNNs) achieve remarkable progresses two tasks using unified cascaded CNNs by multi-task learning.
in a variety of computer vision tasks, such as image classifica- The proposed CNNs consist of three stages. In the first stage, it
tion [9] and face recognition [10]. Inspired by the good per- produces candidate windows quickly through a shallow CNN.
Then, it refines the windows to reject a large number of
non-faces windows through a more complex CNN. Finally, it
uses a more powerful CNN to refine the result and output facial
K.-P. Zhang, Z.-F. Li and Y. Q. are with the Multimedia Laboratory, landmarks positions. Thanks to this multi-task learning
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, framework, the performance of the algorithm can be notably
Shenzhen 518055, China. E-mail: [email protected]; [email protected];
[email protected]
improved. The major contributions of this paper are summa-
Z.-P. Zhang is with the Department of Information Engineering, The Chi- rized as follows: (1) We propose a new cascaded CNNs based
nese University of Hong Kong, Hong Kong. E-mail: [email protected] framework for joint face detection and alignment, and carefully
2
TABLE I
COMPARISON OF SPEED AND VALIDATION ACCURACY OF OUR CNNS AND
PREVIOUS CNNS [19]
Fig. 2. The architectures of P-Net, R-Net, and O-Net, where “MP” means max pooling and “Conv” means convolution. The step size in convolution and pooling
is 1 and 2, respectively.
regression problem and we minimize the Euclidean loss: helpful to strengthen the detector while training. Experiments
show that this strategy yields better performance without
‖̂ ‖ (3) manual sample selection. Its effectiveness is demonstrated in
the Section III.
where ̂ is the facial landmark’s coordinate obtained
III. EXPERIMENTS
from the network and is the ground-truth coordinate.
There are five facial landmarks, including left eye, right eye, In this section, we first evaluate the effectiveness of the
nose, left mouth corner, and right mouth corner, and thus proposed hard sample mining strategy. Then we compare our
. face detector and alignment against the state-of-the-art methods
4) Multi-source training: Since we employ different tasks in in Face Detection Data Set and Benchmark (FDDB) [25],
each CNNs, there are different types of training images in the WIDER FACE [24], and Annotated Facial Landmarks in the
learning process, such as face, non-face and partially aligned Wild (AFLW) benchmark [8]. FDDB dataset contains the an-
face. In this case, some of the loss functions (i.e., Eq. (1)-(3) ) notations for 5,171 faces in a set of 2,845 images. WIDER
are not used. For example, for the sample of background region, FACE dataset consists of 393,703 labeled face bounding boxes
we only compute , and the other two losses are set as 0. in 32,203 images where 50% of them for testing into three
This can be implemented directly with a sample type indicator. subsets according to the difficulty of images, 40% for training
Then the overall learning target can be formulated as: and the remaining for validation. AFLW contains the facial
landmarks annotations for 24,386 faces and we use the same
∑ ∑ (4) test subset as [22]. Finally, we evaluate the computational ef-
ficiency of our face detector.
where is the number of training samples. denotes on the A. Training Data
task importance. We use Since we jointly perform face detection and alignment, here
in P-Net and R-Net, while we use four different kinds of data annotation in our training
in O-Net for more accurate facial land- process: (i) Negatives: Regions that the Intersec-
marks localization. is the sample type indicator. In tion-over-Union (IoU) ratio less than 0.3 to any ground-truth
this case, it is natural to employ stochastic gradient descent to faces; (ii) Positives: IoU above 0.65 to a ground truth face; (iii)
train the CNNs. Part faces: IoU between 0.4 and 0.65 to a ground truth face; and
5) Online Hard sample mining: Different from conducting (iv) Landmark faces: faces labeled 5 landmarks’ positions.
traditional hard sample mining after original classifier had been Negatives and positives are used for face classification tasks,
trained, we do online hard sample mining in face classification positives and part faces are used for bounding box regression,
task to be adaptive to the training process. and landmark faces are used for facial landmark localization.
In particular, in each mini-batch, we sort the loss computed The training data for each network is described as follows:
in the forward propagation phase from all samples and select 1) P-Net: We randomly crop several patches from WIDER
the top 70% of them as hard samples. Then we only compute FACE [24] to collect positives, negatives and part face. Then,
the gradient from the hard samples in the backward propagation we crop faces from CelebA [23] as landmark faces
phase. That means we ignore the easy samples that are less 2) R-Net: We use first stage of our framework to detect faces
4
Fig. 3. (a) Validation loss of O-Net with and without hard sample mining. (b)
“JA” denotes joint face alignment learning while “No JA” denotes do not joint
it. “No JA in BBR” denotes do not joint it while training the CNN for bounding
box regression.
[2] P. Viola and M. J. Jones, “Robust real-time face detection. International part model for face detection,” in IEEE International Conference on Bio-
journal of computer vision,” vol. 57, no. 2, pp. 137-154, 2004 metrics Theory, Applications and Systems, 2015, pp. 1-8.
[3] M. T. Pham, Y. Gao, V. D. D. Hoang, and T. J. Cham, “Fast polygonal [28] G. Ghiasi, and C. C. Fowlkes, “Occlusion Coherence: Detecting and
integration and its application in extending haar-like features to improve Localizing Occluded Faces,” arXiv preprint arXiv:1506.08347.
object detection,” in IEEE Conference on Computer Vision and Pattern [29] S. S. Farfade, M. J. Saberian, and L. J. Li, “Multi-view face detection using
Recognition, 2010, pp. 942-949. deep convolutional neural networks,” in ACM on International Conference
[4] Q. Zhu, M. C. Yeh, K. T. Cheng, and S. Avidan, “Fast human detection on Multimedia Retrieval, 2015, pp. 643-650.
using a cascade of histograms of oriented gradients,” in IEEE Computer
Conference on Computer Vision and Pattern Recognition, 2006, pp.
1491-1498.
[5] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool, “Face detection
without bells and whistles,” in European Conference on Computer Vision,
2014, pp. 720-735.
[6] J. Yan, Z. Lei, L. Wen, and S. Li, “The fastest deformable part model for
object detection,” in IEEE Conference on Computer Vision and Pattern
Recognition, 2014, pp. 2497-2504.
[7] X. Zhu, and D. Ramanan, “Face detection, pose estimation, and landmark
localization in the wild,” in IEEE Conference on Computer Vision and
Pattern Recognition, 2012, pp. 2879-2886.
[8] M. Köstinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated facial
landmarks in the wild: A large-scale, real-world database for facial land-
mark localization,” in IEEE Conference on Computer Vision and Pattern
Recognition Workshops, 2011, pp. 2144-2151.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097-1105.
[10] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representa-
tion by joint identification-verification,” in Advances in Neural Infor-
mation Processing Systems, 2014, pp. 1988-1996.
[11] S. Yang, P. Luo, C. C. Loy, and X. Tang, “From facial parts responses to
face detection: A deep learning approach,” in IEEE International Confer-
ence on Computer Vision, 2015, pp. 3676-3684.
[12] X. P. Burgos-Artizzu, P. Perona, and P. Dollar, “Robust face landmark
estimation under occlusion,” in IEEE International Conference on Com-
puter Vision, 2013, pp. 1513-1520.
[13] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape
regression,” International Journal of Computer Vision, vol 107, no. 2, pp.
177-190, 2012.
[14] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23,
no. 6, pp. 681-685, 2001.
[15] X. Yu, J. Huang, S. Zhang, W. Yan, and D. Metaxas, “Pose-free facial
landmark fitting via optimized part mixtures and cascaded deformable
shape model,” in IEEE International Conference on Computer Vision,
2013, pp. 1944-1951.
[16] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-fine auto-encoder
networks (CFAN) for real-time face alignment,” in European Conference
on Computer Vision, 2014, pp. 1-16.
[17] Luxand Incorporated: Luxand face SDK, http://www.luxand.com/
[18] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun, “Joint cascade face detection
and alignment,” in European Conference on Computer Vision, 2014, pp.
109-122.
[19] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional neural
network cascade for face detection,” in IEEE Conference on Computer
Vision and Pattern Recognition, 2015, pp. 5325-5334.
[20] C. Zhang, and Z. Zhang, “Improving multiview face detection with mul-
ti-task deep convolutional neural networks,” IEEE Winter Conference
on Applications of Computer Vision, 2014, pp. 1036-1041.
[21] X. Xiong, and F. Torre, “Supervised descent method and its applications to
face alignment,” in IEEE Conference on Computer Vision and Pattern
Recognition, 2013, pp. 532-539.
[22] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by
deep multi-task learning,” in European Conference on Computer Vision,
2014, pp. 94-108.
[23] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the
wild,” in IEEE International Conference on Computer Vision, 2015, pp.
3730-3738.
[24] S. Yang, P. Luo, C. C. Loy, and X. Tang, “WIDER FACE: A Face Detec-
tion Benchmark”. arXiv preprint arXiv:1511.06523.
[25] V. Jain, and E. G. Learned-Miller, “FDDB: A benchmark for face detec-
tion in unconstrained settings,” Technical Report UMCS-2010-009, Uni-
versity of Massachusetts, Amherst, 2010.
[26] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Convolutional channel features,” in
IEEE International Conference on Computer Vision, 2015, pp. 82-90.
[27] R. Ranjan, V. M. Patel, and R. Chellappa, “A deep pyramid deformable