DSFD
DSFD
Figure 1: Visual results of our DSFD. Our method is robust to various variations on scale, blurry, illumination, pose, occlusion, reflection,
makeup, etc.
1
detectors have being extensively studied, detecting faces we propose Progressive Anchor Loss (PAL) that computes
with high degree of variability in scale, pose, occlusion, ex- auxiliary supervision loss by a set of smaller anchors to ef-
pression, appearance and illumination in real-world scenar- fectively facilitate the orginal features, since smaller anchor
ios remains a challenge. tiled to original feature maps cell may have more semantic
Previous state-of-the-art face detectors can be roughly information for classification and high-resolution location
divided into two categories. The first one is mainly based information for small faces. Last but not least, we propose
on the Region Proposal Network (RPN) adopted in Faster an Improved Anchor Matching (IAM) method, which inte-
RCNN [19] and employs two stage detection schemes [24, grates anchor partition strategy and anchor-based data aug-
27, 29]. RPN is trained end-to-end and generates high- mentation techniques into our DSFD to match anchors and
quality region proposals which are further refined by Fast ground truth faces as far as possible to provide better ini-
R-CNN detector. The other one is Single Shot Detec- tialization for the regressor. Fig. 1 shows the effectiveness
tor (SSD) [15] based one-stage methods, which get rid of of our proposed DSFD on various variations, especially on
RPN, and directly predict the bounding boxes and confi- extreme small faces or heavily occluded faces.
dence [3, 21, 32]. Recently, one-stage face detection frame- In summary, the main contributions of this paper include:
work has attracted more attention due to its higher inference • A novel Feature Enhance Module to utilize different
efficiency and straightforward system deployment. level information and thus obtain more discriminability and
Despite the progress achieved by the above face detec- robustness features.
tors, there are still some problems existed in the following • Auxiliary supervisions introduced in early layers by
three aspects: using a set of smaller anchors to effectively facilitate the
Feature learning Feature extraction part is essential for features.
a face detector. Currently, Feature Pyramid Network • An improved anchor matching strategy to match an-
(FPN) [12] is widely used in state-of-the-art face detectors chors and ground truth faces as far as possible to provide
for rich features. However, FPN just aggregates hierarchi- better initialization for the regressor.
cal feature maps between high and low-level output layers, • Comprehensive experiments conducted on popular
which does not consider the current layers information, and benchmarks FDDB and WIDER FACE to demonstrate the
the context relationship between anchors is ignored. superiority of our proposed DSFD network compared with
Loss design The conventional loss functions used in object the state-of-the-art methods.
detection include a regression loss for the face region and
a classification loss for identifying if a face is detected or 2. Related work
not. To further address the class imbalance problem, Lin et We review the prior works from three perspectives.
al. [13] proposed Focal Loss to focus training on a sparse Feature Learning Early works on face detection mainly
set of hard examples. To use all original and enhanced fea- rely on hand-crafted features, such as Harr-like fea-
tures, Zhang et al. proposed Hierarchical Loss to effectively tures [23], control point set [1], edge orientation his-
learn the network [30]. However, the above loss functions tograms [10]. However, hand-crafted features design is lack
do not consider progressive learning ability of feature maps of guidance. With the great progress of deep learning, hand-
in different levels. crafted features have been replaced by Convolutional Neu-
Anchor matching Basically, pre-set anchors for each fea- ral Networks (CNN). For example, Overfeat [20], Cascade-
ture map are generated by regularly tiling a collection of CNN [11], MTCNN [31] adopt CNN as a sliding window
boxes with different scales and aspect ratios on the image. detector on image pyramid to build feature pyramid. How-
Some works [21, 32] analyze a series of reasonable anchor ever, using an image pyramid is slow and memory ineffi-
scales and anchor compensation strategy to increase posi- cient. As the result, most two stage detectors extract fea-
tive anchors’ number. However, such strategy ignores ran- tures on single scale. R-CNN [5, 6] obtains region propos-
dom sampling in data augmentation. Continuous face scale als by selective search [22], and then forwards each nor-
and a large number of discrete anchor scales still make huge malized image region through a CNN to classify. Faster
ratio differences of negative and positive anchors. R-CNN [19], R-FCN [4] employ Region Proposal Network
To address the above three issues, we propose a novel (RPN) to generate initial region proposals. Besides, ROI-
network based on the SSD pipeline named Dual Shot Face pooling [19] and position-sensitive RoI pooling [4] are ap-
Detection (DSFD). First, combining the similar setting of plied to extract features from each region.
low-level FPN in PyramidBox and the Receptive Field More recently, some research indicates that multi-scale
Block (RFB) in RFBNet [14], we introduce a Feature En- features perform better for tiny objects. Specifically,
hance Module (FEM) to enhance the discriminability and SSD [15], MS-CNN [2], SSH [18], S3FD [32] predict boxes
robustness of the features. Second, motivated by the hier- on multiple layers of feature hierarchy. FCN [17], Hyper-
archical loss [30] and pyramid anchor [21] in PyramidBox, columns [7], Parsenet [16] fuse multiple layer features in
(a) Original Feature Shot
N/3
high-level semantic information to all scales. FPN-based
product
methods, such as FAN [25], PyramidBox [21] achieve sig- N/3 concat
nificant improvement on detection. However, these meth- 1x1 N
ods do not consider the current layers information. Dif- conv N/3
Up feature map
ferent from the above methods that ignore the context re-
lationship between anchors, we propose a feature enhance 1x1
module that incorporates multi-level dilated convolutional conv upsample dilation conv,kernel=3x3,rate=3
layers to enhance the semantic of the features.
Loss Design Generally, the objective loss in detection is a Figure 3: Feature Enhance Module illustrating the current feature
map cell interactive with neighbor in current feature maps and up
weighted sum of classification loss (e.g. softmax loss) and
feature maps.
box regression loss (e.g. L2 loss). Girshick et al. [5] pro-
pose smooth L1 loss to prevent exploding gradients. Lin 3. Dual Shot Face Detector
et al. [13] discover that the class imbalance is one obsta-
cle for better performance in one stage detector, hence they We firstly introduce the pipeline of our proposed frame-
propose focal loss, a dynamically scaled cross entropy loss. work DSFD, and then detailly describe our feature enhance
Besides, Wang et al. [26] design RepLoss for pedestrian de- module in Sec. 3.2, progressive anchor loss in Sec. 3.3 and
tection, which improves performance in occlusion scenar- improved anchor matching in Sec. 3.4, respectively.
ios. FANet [30] create a hierarchical feature pyramid and
presents hierarchical loss for their architecture. However, 3.1. Pipeline of DSFD
the anchors used in FANet are kept the same size in dif- The framework of DSFD is illustrated in Fig. 2. Our
ferent stages. In this work, we adaptively choose different architecture uses the same extended VGG16 backbone as
anchor sizes in different stages to facilitate the features. PyramidBox [21] and S3FD [32], which is truncated be-
Anchor Matching To make the model more robust, most fore the classification layers and added with some aux-
detection methods [15,28,32] do data augmentation, such as iliary structures. We select conv3 3, conv4 3, conv5 3,
color distortion, horizontal flipping, random crop and multi- conv fc7, conv6 2 and conv7 2 as the first shot detec-
scale training. Zhang et al. [32] propose an anchor compen- tion layers to generate six original feature maps named
sation strategy to make tiny faces to match enough anchors of1 , of2 , of3 , of4 , of5 , of6 . Then, our proposed FEM trans-
during training. Wang et al. [28] propose random crop to fers these original feature maps into six enhanced feature
generate large number of occluded faces for training. How- maps named ef1 , ef2 , ef3 , ef4 , ef5 , ef6 , which have the
ever, these methods ignore random sampling in data aug- same sizes as the original ones and are fed into SSD-style
mentation, while ours combines anchor assign to provide head to construct the second shot detection layers. Note that
better data initialization for anchor matching. the input size of the training image is 640, which means the
feature map size of the lowest-level layer to highest-level Table 1: The stride size, feature map size, anchor scale, anchor
layer is from 160 to 5. Different from S3FD and Pyramid- ratio, anchor number of six original and enhanced features for two
Box, after we utilize the receptive field enlargement in FEM shots.
Feature Stride Size Scale Ratio Number
and the new anchor design strategy, its unnecessary for the
ef 1 (of 1) 4 160 × 160 16 (8) 1.5 : 1 25600
three sizes of stride, anchor and receptive field to satisfy ef 2 (of 2) 8 80 × 80 32 (16) 1.5 : 1 6400
equal-proportion interval principle. Therefore, our DSFD is ef 3 (of 3) 16 40 × 40 64 (32) 1.5 : 1 1600
more flexible and robustness. Besides, the original and en- ef 4 (of 4) 32 20 × 20 128 (64) 1.5 : 1 400
hanced shots have two different losses, respectively named ef 5 (of 5) 64 10 × 10 256 (128) 1.5 : 1 100
ef 6 (of 6) 128 5×5 512 (256) 1.5 : 1 25
First Shot progressive anchor Loss (FSL) and Second Shot
progressive anchor Loss (SSL).
Compared to the enhanced feature maps in the same level,
3.2. Feature Enhance Module the original feature maps have less semantic information for
Feature Enhance Module is able to enhance original fea- classification but more high resolution location information
tures to make them more discriminable and robust, which for detection. Therefore, we believe that the original feature
is called FEM for short. For enhancing original neuron cell maps can detect and classify smaller faces. As the result, we
oc(i,j,l) , FEM utilizes different dimension information in- propose the First Shot multi-task Loss with a set of smaller
cluding upper layer original neuron cell oc(i,j,l) and current anchors as follows:
layer non-local neuron cells: nc(i−ε,j−ε,l) , nc(i−ε,j,l) , ..., 1
nc(i,j+ε,l) , nc(i+ε,j+ε,l) . Specially, the enhanced neuron Σi Lconf (pi , p∗i )
LF SL (pi , p∗i , ti , gi , sai ) =
Nconf
cell ec(i,j,l) can be mathematically defined as follow:
β
+ Σi p∗i Lloc (ti , gi , sai ),
ec(i,j,l) = fconcat (fdilation (nc(i,j,l) )) Nloc
(1) (3)
nci,j,l = fprod (oc(i,j,l) , fup (oc(i,j,l+1) ))
where sa indicates the smaller anchors in the first shot lay-
where ci,j,l is a cell located in (i, j) coordinate of the feature ers, and the two shots losses can be weighted summed into
maps in the l-th layer, f denotes a set of basic dilation con- a whole Progressive Anchor Loss as follows:
volution, elem-wise production, up-sampling or concatena-
tion operations. Fig. 3 illustrates the idea of FEM, which is LP AL = LF SL (sa) + λLSSL (a). (4)
inspired by FPN [12] and RFB [14]. Here, we first use 1×1
Note that anchor size in the first shot is half of ones in the
convolutional kernel to normalize the feature maps. Then,
second shot, and λ is weight factor. Detailed assignment
we up-sample upper feature maps to do element-wise prod-
on the anchor size is described in Sec. 3.4. In prediction
uct with the current ones. Finally, we split the feature maps
process, we only use the output of the second shot, which
to three parts, followed by three sub-networks containing
means no additional computational cost is introduced.
different numbers of dilation convolutional layers.
3.4. Improved Anchor Matching
3.3. Progressive Anchor Loss
During training, we need to compute positive and neg-
In this subsection, we adopt the multi-task loss [15, 19]
ative anchors and determine which anchor corresponds to
since it helps to facilitate the original and enhanced fea-
its face bounding box. Current anchor matching method
ture maps training task in two shots. First, our Second Shot
is bidirectional between the anchor and ground-truth face.
anchor-based multi-task Loss function is defined as:
Therefore, anchor design and face sampling during aug-
1 mentation are collaborative to match the anchors and faces
LSSL (pi , p∗i , ti , gi , ai ) =(Σi Lconf (pi , p∗i )
Nconf as far as possible for better initialization of the regressor.
β Table 1 shows details of our anchor design on how
+ Σi p∗i Lloc (ti , gi , ai )), each feature map cell is associated to the fixed shape an-
Nloc
(2) chor. We set anchor ratio 1.5:1 based on face scale statis-
where Nconf and Nloc indicate the number of positive and tics. Anchor size for the original feature is one half of
negative anchors, and the number of positive anchors re- the enhanced feature. Additionally, with probability of
spectively, Lconf is the softmax loss over two classes (face 2/5, we utilize anchor-based sampling like data-anchor-
vs. background), and Lloc is the smooth L1 loss between the sampling in PyramidBox, which randomly selects a face in
parameterizations of the predicted box ti and ground-truth an image, crops sub-image containing the face, and sets the
box gi using the anchor ai . When p∗i = 1 (p∗i = {0, 1}), size ratio between sub-image and selected face to 640/rand
the anchor ai is positive and the localization loss is acti- (16, 32, 64, 128, 256, 512). For the remaining 3/5 probabil-
vated. β is a weight to balance the effects of the two terms. ity, we adopt data augmentation similar to SSD [15]. In
Table 2: Effectiveness of Feature Enhance Module on the AP
performance.
Component Easy Medium Hard
FSSD+VGG16 92.6% 90.2% 79.1%
FSSD+VGG16+FEM 93.0% 91.4% 84.6%
We use a more robust and discriminative feature enhance Effects of Different Backbones To better understand
module to improve the feature presentation ability, espe- our DSFD, we further conducted experiments to examine
cially for hard face. 2) Auxiliary loss based on progressive how different backbones affect classification and detection
anchor is used to train all 12 different scale detection feature performance. Specifically, we use the same setting ex-
maps, and it improves the performance on easy, medium cept for the feature extraction network, we implement SE-
and hard faces simultaneously. 3) Our improved anchor ResNet101, DPN−98, SE-ResNeXt101 32×4d following
matching provides better initial anchors and ground-truth the ResNet101 setting in our DSFD. From Table 5, DSFD
faces to regress anchor from faces, which achieves the im- with SE-ResNeXt101 32×4d got 95.7%, 94.8%, 88.9%, on
provements of 0.3%, 0.1%, 0.3% on three settings, respec- easy, medium and hard settings respectively, which indi-
tively. Additionally, when we enlarge the training batch size cates that more complexity model and higher Top-1 Ima-
(i.e., LargeBS), the result in hard setting can get 91.2% AP. geNet classification accuracy may not benefit face detection
Discontinous ROC curves Continous ROC curves
AP. Therefore, in our DSFD framework, better performance set is further defined into three levels of difficulty: ’Easy’,
on classification are not necessary for better performance on ’Medium’, ’Hard’ based on the detection rate of a baseline
detection. Our DSFD enjoys high inference speed benefited detector. As shown in Fig. 6, our DSFD achieves the best
from simply using the second shot detection results. For performance among all of the state-of-the-art face detectors
VGA resolution inputs to Res50-based DSFD, it runs 22 based on the average precision (AP) across the three sub-
FPS on NVIDA GPU P40 during inference. sets, i.e., 96.6% (Easy), 95.7% (Medium) and 90.4% (Hard)
on validation set, and 96.0% (Easy), 95.3% (Medium) and
4.3. Comparisons with State-of-the-Art Methods 90.0% (Hard) on test set. Fig. 8 shows more examples to
demonstrate the effects of DSFD on handling faces with
We evaluate the proposed DSFD on two popular face de- various variations, in which the blue bounding boxes indi-
tection benchmarks, including WIDER FACE [28] and Face cate the detector confidence is above 0.8.
Detection Data Set and Benchmark (FDDB) [9]. Our model
is trained only using the training set of WIDER FACE, FDDB Dataset It contains 5, 171 faces in 2, 845 images
and then evaluated on both benchmarks without any further taken from the faces in the wild data set. Since WIDER
fine-tuning. We also follow the similar way used in [25] FACE has bounding box annotation while faces in FDDB
to build the image pyramid for multi-scale testing and use are represented by ellipses, we learn a post-hoc ellipses re-
more powerful backbone similar as [3]. gressor to transform the final prediction results. As shown
WIDER FACE Dataset It contains 393, 703 annotated in Fig. 7, our DSFD achieves state-of-the-art performance
faces with large variations in scale, pose and occlusion in on both discontinuous and continuous ROC curves, i.e.
total 32, 203 images. For each of the 60 event classes, 40%, 99.1% and 86.2% when the number of false positives equals
10%, 50% images of the database are randomly selected to 1, 000. After adding additional annotations to those un-
as training, validation and testing sets. Besides, each sub- labeled faces [32], the false positives of our model can be
Scale Pose Occlusion Blurry
Figure 8: Effectiveness of our DSFD to various large variations on scale, pose, occlusion, blurry, makeup, illumination, modality and
reflection. Blue bounding boxes indicate the detector confidence is above 0.8.
further reduced and outperform all other methods. early layers by using smaller anchors are adopted to ef-
fectively facilitate the features. Moreover, an improved an-
5. Conclusions chor matching method is introduced to match anchors and
This paper introduces a novel face detector named Dual ground truth faces as far as possible to provide better ini-
Shot Face Detector (DSFD). In this work, we propose a tialization for the regressor. Comprehensive experiments
novel Feature Enhance Module that utilizes different level are conducted on benchmarks FDDB and WIDER FACE to
information and thus obtains more discriminability and ro- demonstrate the superiority of our proposed DSFD network
bustness features. Auxiliary supervisions introduced in compared with the state-of-the-art methods.
References In Proceedings of European conference on computer vision
(ECCV), 2016. 2, 3, 4
[1] Y. Abramson, B. Steux, and H. Ghorayeb. Yet even [16] W. Liu, A. Rabinovich, and A. Berg. Parsenet: Looking
faster (yef) real-time object detection. International Journal wider to see better. In Proceedings of International Con-
of Intelligent Systems Technologies and Applications, 2(2- ference on Learning Representations Workshop, 2016. 2
3):102–112, 2007. 2
[17] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
[2] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified networks for semantic segmentation. In Proceedings of IEEE
multi-scale deep convolutional neural network for fast object Conference on Computer Vision and Pattern Recognition
detection. In Proceedings of European Conference on Com- (CVPR), 2015. 2
puter Vision (ECCV), 2016. 2 [18] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis.
[3] C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou. Selec- Ssh: Single stage headless face detector. In Proceedings of
tive refinement network for high performance face detection. IEEE International Conference on Computer Vision (ICCV),
In Proceedings of Association for the Advancement of Artifi- 2017. 2
cial Intelligence (AAAI), 2019. 2, 7 [19] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
[4] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection real-time object detection with region proposal networks. In
via region-based fully convolutional networks. In Proceed- Proceedings of Advances in Neural Information Processing
ings of Advances in Neural Information Processing Systems Systems (NIPS), 2015. 2, 4
(NIPS), 2016. 2 [20] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
[5] R. Girshick. Fast r-cnn. In Proceedings of IEEE Interna- and Y. LeCun. Overfeat: Integrated recognition, localiza-
tional Conference on Computer Vision (ICCV), 2015. 2, 3 tion and detection using convolutional networks. In Pro-
[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- ceedings of International Conference on Learning Represen-
ture hierarchies for accurate object detection and semantic tations (ICLR), 2014. 2
segmentation. In Proceedings of IEEE Conference on Com- [21] X. Tang, D. K. Du, Z. He, and J. Liu. Pyramidbox: A
puter Vision and Pattern Recognition (CVPR), pages 580– context-assisted single shot face detector. In Proceedings of
587, 2014. 2 European Conference on Computer Vision (ECCV), 2018. 2,
[7] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hy- 3, 5
percolumns for object segmentation and fine-grained local- [22] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W.
ization. In Proceedings of IEEE Conference on Computer Smeulders. Selective search for object recognition. Interna-
Vision and Pattern Recognition (CVPR), 2015. 2 tional Journal of Computer Vision, 104(2):154–171, 2013.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning 2
for image recognition. In Proceedings of IEEE Conference [23] P. Viola and M. J. Jones. Robust real-time face detection.
on Computer Vision and Pattern Recognition (CVPR), 2016. International Journal of Computer Vision, 57(2):137–154,
1 2004. 1, 2
[9] V. Jain and E. Learned-Miller. Fddb: A benchmark for face [24] H. Wang, Z. Li, X. Ji, and Y. Wang. Face r-cnn. arXiv
detection in unconstrained settings. Technical report, Techni- preprint arXiv:1706.01061, 2017. 2
cal Report UM-CS-2010-009, University of Massachusetts, [25] J. Wang, Y. Yuan, and G. Yu. Face attention network: An
Amherst, 2010. 7 effective face detector for the occluded faces. arXiv preprint
[10] K. Levi and Y. Weiss. Learning object detection from a small arXiv:1711.07246, 2017. 3, 7
number of examples: the importance of good features. In [26] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen. Re-
Proceedings of IEEE Conference on Computer Vision and pulsion loss: Detecting pedestrians in a crowd. In Proceed-
Pattern Recognition (CVPR), 2004. 2 ings of IEEE Conference on Computer Vision and Pattern
[11] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolu- Recognition (CVPR), 2018. 3
tional neural network cascade for face detection. In Proceed- [27] Y. Wang, X. Ji, Z. Zhou, H. Wang, and Z. Li. Detecting
ings of IEEE Conference on Computer Vision and Pattern faces using region-based fully convolutional networks. arXiv
Recognition (CVPR), 2015. 2 preprint arXiv:1709.05256, 2017. 2
[12] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, [28] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A face
and S. J. Belongie. Feature pyramid networks for object de- detection benchmark. In Proceedings of IEEE Conference
tection. In Proceedings of IEEE Conference on Computer on Computer Vision and Pattern Recognition (CVPR), 2016.
Vision and Pattern Recognition (CVPR), 2017. 2, 3, 4 3, 7
[13] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Fo- [29] C. Zhang, X. Xu, and D. Tu. Face detection using improved
cal loss for dense object detection. In Proceedings of IEEE faster rcnn. arXiv preprint arXiv:1802.02142, 2018. 2
International Conference on Computer Vision (ICCV), 2017. [30] J. Zhang, X. Wu, J. Zhu, and S. C. Hoi. Feature agglomera-
2, 3 tion networks for single stage face detection. arXiv preprint
[14] S. Liu, D. Huang, and Y. Wang. Receptive field block net for arXiv:1712.00721, 2017. 2, 3
accurate and fast object detection. In Proceedings of Euro- [31] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection
pean Conference on Computer Vision, 2018. 2, 4 and alignment using multitask cascaded convolutional net-
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- works. IEEE Signal Processing Letters, 23(10):1499–1503,
Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. 2016. 2
[32] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. Sˆ
3fd: Single shot scale-invariant face detector. In Proceed-
ings of IEEE International Conference on Computer Vision
(ICCV), 2017. 2, 3, 5, 8