0% found this document useful (0 votes)
28 views10 pages

DSFD

The document presents the Dual Shot Face Detector (DSFD), a novel face detection network that enhances feature extraction through a Feature Enhance Module (FEM) and introduces Progressive Anchor Loss (PAL) for improved accuracy in detecting faces under various conditions. Extensive experiments demonstrate DSFD's superiority over existing state-of-the-art face detectors on benchmarks like WIDER FACE and FDDB. Key contributions include a robust feature enhancement strategy, auxiliary supervision for feature learning, and an improved anchor matching technique.

Uploaded by

keithtung1209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views10 pages

DSFD

The document presents the Dual Shot Face Detector (DSFD), a novel face detection network that enhances feature extraction through a Feature Enhance Module (FEM) and introduces Progressive Anchor Loss (PAL) for improved accuracy in detecting faces under various conditions. Extensive experiments demonstrate DSFD's superiority over existing state-of-the-art face detectors on benchmarks like WIDER FACE and FDDB. Key contributions include a robust feature enhancement strategy, auxiliary supervision for feature learning, and an improved anchor matching technique.

Uploaded by

keithtung1209
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DSFD: Dual Shot Face Detector

Jian Li†∗ Yabiao Wang‡ Changan Wang‡ Ying Tai‡


Jianjun Qian† Jian Yang† Chengjie Wang‡ Jilin Li‡ Feiyue Huang‡
† ‡
Nanjing University of Science and Technology Youtu Lab, Tencent

[email protected], {csjqian, csjyang}@njust.edu.cn

{casewang, changanwang, yingtai, jasoncjwang, jerolinli, garyhuang}@tencent.com
arXiv:1810.10220v2 [cs.CV] 23 Nov 2018

Scale Blurry Illumination

Pose & Occlusion Reflection Makeup

Figure 1: Visual results of our DSFD. Our method is robust to various variations on scale, blurry, illumination, pose, occlusion, reflection,
makeup, etc.

Abstract and anchor design strategy in our DSFD to provide better


initialization for the regressor. Extensive experiments on
Recently, Convolutional Neural Network (CNN) has popular benchmarks: WIDER FACE (easy: 0.966, medium:
achieved great success in face detection. However, it re- 0.957, hard: 0.904) and FDDB ( discontinuous: 0.991,
mains a challenging problem for the current face detection continuous: 0.862) demonstrate the superiority of DSFD
methods owing to high degree of variability in scale, pose, over the state-of-the-art face detectors (e.g., PyramidBox
occlusion, expression, appearance and illumination. In this and SRN). Code will be made available upon publication.
paper, we propose a novel face detection network named
Dual Shot Face Detector(DSFD), which inherits the archi-
tecture of SSD and introduces a Feature Enhance Module 1. Introduction
(FEM) for transferring the original feature maps to extend
the single shot detector to dual shot detector. Specially, Pro- Face detection is a fundamental step for various facial
gressive Anchor Loss (PAL) computed by using two set of applications, like face alignment, parsing, recognition, and
anchors is adopted to effectively facilitate the features. Ad- verification. As the pioneering work for face detection,
ditionally, we propose an Improved Anchor Matching (IAM) Viola-Jones [23] adopts AdaBoost algorithm with hand-
method by integrating novel data augmentation techniques crafted features, which are now replaced by deeply learned
features from the convolutional neural network (CNN) [8]
∗ This work was done when Jian Li was an intern at Tencent Youtu Lab. that achieves great progress. Although the CNN based face

1
detectors have being extensively studied, detecting faces we propose Progressive Anchor Loss (PAL) that computes
with high degree of variability in scale, pose, occlusion, ex- auxiliary supervision loss by a set of smaller anchors to ef-
pression, appearance and illumination in real-world scenar- fectively facilitate the orginal features, since smaller anchor
ios remains a challenge. tiled to original feature maps cell may have more semantic
Previous state-of-the-art face detectors can be roughly information for classification and high-resolution location
divided into two categories. The first one is mainly based information for small faces. Last but not least, we propose
on the Region Proposal Network (RPN) adopted in Faster an Improved Anchor Matching (IAM) method, which inte-
RCNN [19] and employs two stage detection schemes [24, grates anchor partition strategy and anchor-based data aug-
27, 29]. RPN is trained end-to-end and generates high- mentation techniques into our DSFD to match anchors and
quality region proposals which are further refined by Fast ground truth faces as far as possible to provide better ini-
R-CNN detector. The other one is Single Shot Detec- tialization for the regressor. Fig. 1 shows the effectiveness
tor (SSD) [15] based one-stage methods, which get rid of of our proposed DSFD on various variations, especially on
RPN, and directly predict the bounding boxes and confi- extreme small faces or heavily occluded faces.
dence [3, 21, 32]. Recently, one-stage face detection frame- In summary, the main contributions of this paper include:
work has attracted more attention due to its higher inference • A novel Feature Enhance Module to utilize different
efficiency and straightforward system deployment. level information and thus obtain more discriminability and
Despite the progress achieved by the above face detec- robustness features.
tors, there are still some problems existed in the following • Auxiliary supervisions introduced in early layers by
three aspects: using a set of smaller anchors to effectively facilitate the
Feature learning Feature extraction part is essential for features.
a face detector. Currently, Feature Pyramid Network • An improved anchor matching strategy to match an-
(FPN) [12] is widely used in state-of-the-art face detectors chors and ground truth faces as far as possible to provide
for rich features. However, FPN just aggregates hierarchi- better initialization for the regressor.
cal feature maps between high and low-level output layers, • Comprehensive experiments conducted on popular
which does not consider the current layers information, and benchmarks FDDB and WIDER FACE to demonstrate the
the context relationship between anchors is ignored. superiority of our proposed DSFD network compared with
Loss design The conventional loss functions used in object the state-of-the-art methods.
detection include a regression loss for the face region and
a classification loss for identifying if a face is detected or 2. Related work
not. To further address the class imbalance problem, Lin et We review the prior works from three perspectives.
al. [13] proposed Focal Loss to focus training on a sparse Feature Learning Early works on face detection mainly
set of hard examples. To use all original and enhanced fea- rely on hand-crafted features, such as Harr-like fea-
tures, Zhang et al. proposed Hierarchical Loss to effectively tures [23], control point set [1], edge orientation his-
learn the network [30]. However, the above loss functions tograms [10]. However, hand-crafted features design is lack
do not consider progressive learning ability of feature maps of guidance. With the great progress of deep learning, hand-
in different levels. crafted features have been replaced by Convolutional Neu-
Anchor matching Basically, pre-set anchors for each fea- ral Networks (CNN). For example, Overfeat [20], Cascade-
ture map are generated by regularly tiling a collection of CNN [11], MTCNN [31] adopt CNN as a sliding window
boxes with different scales and aspect ratios on the image. detector on image pyramid to build feature pyramid. How-
Some works [21, 32] analyze a series of reasonable anchor ever, using an image pyramid is slow and memory ineffi-
scales and anchor compensation strategy to increase posi- cient. As the result, most two stage detectors extract fea-
tive anchors’ number. However, such strategy ignores ran- tures on single scale. R-CNN [5, 6] obtains region propos-
dom sampling in data augmentation. Continuous face scale als by selective search [22], and then forwards each nor-
and a large number of discrete anchor scales still make huge malized image region through a CNN to classify. Faster
ratio differences of negative and positive anchors. R-CNN [19], R-FCN [4] employ Region Proposal Network
To address the above three issues, we propose a novel (RPN) to generate initial region proposals. Besides, ROI-
network based on the SSD pipeline named Dual Shot Face pooling [19] and position-sensitive RoI pooling [4] are ap-
Detection (DSFD). First, combining the similar setting of plied to extract features from each region.
low-level FPN in PyramidBox and the Receptive Field More recently, some research indicates that multi-scale
Block (RFB) in RFBNet [14], we introduce a Feature En- features perform better for tiny objects. Specifically,
hance Module (FEM) to enhance the discriminability and SSD [15], MS-CNN [2], SSH [18], S3FD [32] predict boxes
robustness of the features. Second, motivated by the hier- on multiple layers of feature hierarchy. FCN [17], Hyper-
archical loss [30] and pyramid anchor [21] in PyramidBox, columns [7], Parsenet [16] fuse multiple layer features in
(a) Original Feature Shot

Input Image conv3_3 conv4_3 conv5_3 conv_fc7 conv6_2 conv7_2

First Shot PAL


(b) Feature Enhance Module

Second Shot PAL


640x640 160x160 80x80 40x40 20x20 10x10 5x5

(c) Enhanced Feature Shot


Figure 2: Our DSFD framework uses a Feature Enhance Module (b) on top of a feedforward VGG16 architecture to generate the enhanced
features (c) from the original features (a), along with two loss layers named first shot PAL for the original features and second shot PAL
for the enchanted features.

segmentation. FPN [12], a top-down architecture, integrate


Current feature map

N/3
high-level semantic information to all scales. FPN-based
product
methods, such as FAN [25], PyramidBox [21] achieve sig- N/3 concat
nificant improvement on detection. However, these meth- 1x1 N
ods do not consider the current layers information. Dif- conv N/3
Up feature map

ferent from the above methods that ignore the context re-
lationship between anchors, we propose a feature enhance 1x1
module that incorporates multi-level dilated convolutional conv upsample dilation conv,kernel=3x3,rate=3
layers to enhance the semantic of the features.
Loss Design Generally, the objective loss in detection is a Figure 3: Feature Enhance Module illustrating the current feature
map cell interactive with neighbor in current feature maps and up
weighted sum of classification loss (e.g. softmax loss) and
feature maps.
box regression loss (e.g. L2 loss). Girshick et al. [5] pro-
pose smooth L1 loss to prevent exploding gradients. Lin 3. Dual Shot Face Detector
et al. [13] discover that the class imbalance is one obsta-
cle for better performance in one stage detector, hence they We firstly introduce the pipeline of our proposed frame-
propose focal loss, a dynamically scaled cross entropy loss. work DSFD, and then detailly describe our feature enhance
Besides, Wang et al. [26] design RepLoss for pedestrian de- module in Sec. 3.2, progressive anchor loss in Sec. 3.3 and
tection, which improves performance in occlusion scenar- improved anchor matching in Sec. 3.4, respectively.
ios. FANet [30] create a hierarchical feature pyramid and
presents hierarchical loss for their architecture. However, 3.1. Pipeline of DSFD
the anchors used in FANet are kept the same size in dif- The framework of DSFD is illustrated in Fig. 2. Our
ferent stages. In this work, we adaptively choose different architecture uses the same extended VGG16 backbone as
anchor sizes in different stages to facilitate the features. PyramidBox [21] and S3FD [32], which is truncated be-
Anchor Matching To make the model more robust, most fore the classification layers and added with some aux-
detection methods [15,28,32] do data augmentation, such as iliary structures. We select conv3 3, conv4 3, conv5 3,
color distortion, horizontal flipping, random crop and multi- conv fc7, conv6 2 and conv7 2 as the first shot detec-
scale training. Zhang et al. [32] propose an anchor compen- tion layers to generate six original feature maps named
sation strategy to make tiny faces to match enough anchors of1 , of2 , of3 , of4 , of5 , of6 . Then, our proposed FEM trans-
during training. Wang et al. [28] propose random crop to fers these original feature maps into six enhanced feature
generate large number of occluded faces for training. How- maps named ef1 , ef2 , ef3 , ef4 , ef5 , ef6 , which have the
ever, these methods ignore random sampling in data aug- same sizes as the original ones and are fed into SSD-style
mentation, while ours combines anchor assign to provide head to construct the second shot detection layers. Note that
better data initialization for anchor matching. the input size of the training image is 640, which means the
feature map size of the lowest-level layer to highest-level Table 1: The stride size, feature map size, anchor scale, anchor
layer is from 160 to 5. Different from S3FD and Pyramid- ratio, anchor number of six original and enhanced features for two
Box, after we utilize the receptive field enlargement in FEM shots.
Feature Stride Size Scale Ratio Number
and the new anchor design strategy, its unnecessary for the
ef 1 (of 1) 4 160 × 160 16 (8) 1.5 : 1 25600
three sizes of stride, anchor and receptive field to satisfy ef 2 (of 2) 8 80 × 80 32 (16) 1.5 : 1 6400
equal-proportion interval principle. Therefore, our DSFD is ef 3 (of 3) 16 40 × 40 64 (32) 1.5 : 1 1600
more flexible and robustness. Besides, the original and en- ef 4 (of 4) 32 20 × 20 128 (64) 1.5 : 1 400
hanced shots have two different losses, respectively named ef 5 (of 5) 64 10 × 10 256 (128) 1.5 : 1 100
ef 6 (of 6) 128 5×5 512 (256) 1.5 : 1 25
First Shot progressive anchor Loss (FSL) and Second Shot
progressive anchor Loss (SSL).
Compared to the enhanced feature maps in the same level,
3.2. Feature Enhance Module the original feature maps have less semantic information for
Feature Enhance Module is able to enhance original fea- classification but more high resolution location information
tures to make them more discriminable and robust, which for detection. Therefore, we believe that the original feature
is called FEM for short. For enhancing original neuron cell maps can detect and classify smaller faces. As the result, we
oc(i,j,l) , FEM utilizes different dimension information in- propose the First Shot multi-task Loss with a set of smaller
cluding upper layer original neuron cell oc(i,j,l) and current anchors as follows:
layer non-local neuron cells: nc(i−ε,j−ε,l) , nc(i−ε,j,l) , ..., 1
nc(i,j+ε,l) , nc(i+ε,j+ε,l) . Specially, the enhanced neuron Σi Lconf (pi , p∗i )
LF SL (pi , p∗i , ti , gi , sai ) =
Nconf
cell ec(i,j,l) can be mathematically defined as follow:
β
+ Σi p∗i Lloc (ti , gi , sai ),
ec(i,j,l) = fconcat (fdilation (nc(i,j,l) )) Nloc
(1) (3)
nci,j,l = fprod (oc(i,j,l) , fup (oc(i,j,l+1) ))
where sa indicates the smaller anchors in the first shot lay-
where ci,j,l is a cell located in (i, j) coordinate of the feature ers, and the two shots losses can be weighted summed into
maps in the l-th layer, f denotes a set of basic dilation con- a whole Progressive Anchor Loss as follows:
volution, elem-wise production, up-sampling or concatena-
tion operations. Fig. 3 illustrates the idea of FEM, which is LP AL = LF SL (sa) + λLSSL (a). (4)
inspired by FPN [12] and RFB [14]. Here, we first use 1×1
Note that anchor size in the first shot is half of ones in the
convolutional kernel to normalize the feature maps. Then,
second shot, and λ is weight factor. Detailed assignment
we up-sample upper feature maps to do element-wise prod-
on the anchor size is described in Sec. 3.4. In prediction
uct with the current ones. Finally, we split the feature maps
process, we only use the output of the second shot, which
to three parts, followed by three sub-networks containing
means no additional computational cost is introduced.
different numbers of dilation convolutional layers.
3.4. Improved Anchor Matching
3.3. Progressive Anchor Loss
During training, we need to compute positive and neg-
In this subsection, we adopt the multi-task loss [15, 19]
ative anchors and determine which anchor corresponds to
since it helps to facilitate the original and enhanced fea-
its face bounding box. Current anchor matching method
ture maps training task in two shots. First, our Second Shot
is bidirectional between the anchor and ground-truth face.
anchor-based multi-task Loss function is defined as:
Therefore, anchor design and face sampling during aug-
1 mentation are collaborative to match the anchors and faces
LSSL (pi , p∗i , ti , gi , ai ) =(Σi Lconf (pi , p∗i )
Nconf as far as possible for better initialization of the regressor.
β Table 1 shows details of our anchor design on how
+ Σi p∗i Lloc (ti , gi , ai )), each feature map cell is associated to the fixed shape an-
Nloc
(2) chor. We set anchor ratio 1.5:1 based on face scale statis-
where Nconf and Nloc indicate the number of positive and tics. Anchor size for the original feature is one half of
negative anchors, and the number of positive anchors re- the enhanced feature. Additionally, with probability of
spectively, Lconf is the softmax loss over two classes (face 2/5, we utilize anchor-based sampling like data-anchor-
vs. background), and Lloc is the smooth L1 loss between the sampling in PyramidBox, which randomly selects a face in
parameterizations of the predicted box ti and ground-truth an image, crops sub-image containing the face, and sets the
box gi using the anchor ai . When p∗i = 1 (p∗i = {0, 1}), size ratio between sub-image and selected face to 640/rand
the anchor ai is positive and the localization loss is acti- (16, 32, 64, 128, 256, 512). For the remaining 3/5 probabil-
vated. β is a weight to balance the effects of the two terms. ity, we adopt data augmentation similar to SSD [15]. In
Table 2: Effectiveness of Feature Enhance Module on the AP
performance.
Component Easy Medium Hard
FSSD+VGG16 92.6% 90.2% 79.1%
FSSD+VGG16+FEM 93.0% 91.4% 84.6%

Table 3: Effectiveness of Progressive Anchor Loss on the AP


performance.
Component Easy Medium Hard
FSSD+RES50 93.7% 92.2% 81.8%
FSSD+RES50+FEM 95.0% 94.1% 88.0% Figure 5: Comparisons on number distribution of matched anchor
FSSD+RES50+FEM+PAL 95.3% 94.4% 88.6% for ground truth faces between traditional anchor matching (blue
line) and our improved anchor matching (red line).

periments, except for the specified changes to the compo-


nents. All models are trained on the WIDER FACE training
set and evaluated on validation set. To better understand
DSFD, we select different baselines to ablate each compo-
nent on how this part affects the final performance.
Feature Enhance Module First, We adopt anchor designed
Figure 4: The number distribution of different scales of faces
in S3FD [32], PyramidBox [21] and six original feature
compared between traditional anchor matching (left) and our im-
proved anchor matching (right). maps generated by VGG16 to perform classification and re-
gression, which is named Face SSD (FSSD) as the baseline.
order to improve the recall rate of faces and ensure anchor We then use VGG16-based FSSD as the baseline to add
classification ability simultaneously, we set Intersection- feature enchance module for comparison. Table 2 shows
over-Union (IoU) threshold 0.4 to assign anchor to its that our feature enhance module can improve VGG16-based
ground-truth faces. FSSD from 92.6%, 90.2%, 79.1% to 93.0%, 91.4%, 84.6%.
Progressive Anchor Loss Second, we use Res50-based
4. Experiments FSSD as the baseline to add progressive anchor loss for
comparison. We use four residual blocks’ ouputs in
4.1. Implementation Details ResNet to replace the outputs of conv3 3, conv4 3, conv5 3,
First, we present the details in implementing our net- conv fc7 in VGG. Except for VGG16, we do not perform
work. The backbone networks are initialized by the pre- layer normalization. Table 3 shows our progressive an-
trained VGG/ResNet on ImageNet. All newly added con- chor loss can improve Res50-based FSSD using FEM from
volution layers’ parameters are initialized by the ‘xavier’ 95.0%, 94.1%, 88.0% to 95.3%, 94.4%, 88.6%.
method. We use SGD with 0.9 momentum, 0.0005 weight Improved Anchor Matching To evaluate our improved
decay to fine-tune our DSFD model. The batch size is set to anchor matching strategy, we use Res101-based FSSD
16. The learning rate is set to 10−3 for the first 40k steps, without anchor compensation as the baseline. Table 4 shows
and we decay it to 10−4 and 10−5 for two 10k steps. that our improved anchor matching can improve Res101-
During inference, the first shot’s outputs are ignored and based FSSD using FEM from 95.8%, 95.1%, 89.7% to
the second shot predicts top 5k high confident detections. 96.1%, 95.2%, 90.0%. Finally, we can improve our DSFD
Non-maximum suppression is applied with jaccard overlap to 96.6%, 95.7%, 90.4% with ResNet152 as the backbone.
of 0.3 to produce top 750 high confident bounding boxes per Besides, Fig. 4 shows that our improved anchor match-
image. For 4 bounding box coordinates, we round down top ing strategy greatly increases the number of ground truth
left coordinates and round up width and height to expand faces that are closed to the anchor, which can reduce the
the detection bounding box. contradiction between the discrete anchor scales and con-
tinuous face scales. Moreover, Fig. 5 shows the number dis-
4.2. Analysis on DSFD tribution of matched anchor number for ground truth faces,
In this subsection, we conduct extensive experiments and which indicates our improved anchor matching can signif-
ablation studies on the WIDER FACE dataset to evaluate icantly increase the matched anchor number, and the aver-
the effectiveness of several contributions of our proposed aged number of matched anchor for different scales of faces
framework, including feature enhance module, progressive can be improved from 6.4 to about 6.9.
anchor loss, and improved anchor matching. For fair com- From the above analysis and results, some promising
parisons, we use the same parameter settings for all the ex- conclusions can be drawn: 1) Feature enhance is crucial.
Val: easy Val: medium Val: hard

Test: easy Test: medium Test: hard


Figure 6: Precision-recall curves on WIDER FACE validation and testing subset.
Table 4: Effectiveness of Improved Anchor Matching on the AP performance.
Component Easy Medium Hard
FSSD+RES101 95.1% 93.6% 83.7%
FSSD+RES101+FEM 95.8% 95.1% 89.7%
FSSD+RES101+FEM+IAM 96.1% 95.2% 90.0%
FSSD+RES101+FEM+IAM+PAL 96.3% 95.4% 90.1%
FSSD+RES152+FEM+IAM+PAL 96.6% 95.7% 90.4%
FSSD+RES152+FEM+IAM+PAL+LargeBS 96.4% 95.7% 91.2%

Table 5: Effectiveness of different backbones.


Component Params ACC@Top-1 Easy Medium Hard
FSSD+RES101+FEM+IAM+PAL 399M 77.44% 96.3% 95.4% 90.1%
FSSD+RES152+FEM+IAM+PAL 459M 78.42% 96.6% 95.7% 90.4%
FSSD+SE-RES101+FEM+IAM+PAL 418M 78.39% 95.7% 94.7% 88.6%
FSSD+DPN98+FEM+IAM+PAL 515M 79.22% 96.3% 95.5% 90.4%
FSSD+SE-RESNeXt101 32×4d+FEML+IAM+PA 416M 80.19% 95.7% 94.8% 88.9%

We use a more robust and discriminative feature enhance Effects of Different Backbones To better understand
module to improve the feature presentation ability, espe- our DSFD, we further conducted experiments to examine
cially for hard face. 2) Auxiliary loss based on progressive how different backbones affect classification and detection
anchor is used to train all 12 different scale detection feature performance. Specifically, we use the same setting ex-
maps, and it improves the performance on easy, medium cept for the feature extraction network, we implement SE-
and hard faces simultaneously. 3) Our improved anchor ResNet101, DPN−98, SE-ResNeXt101 32×4d following
matching provides better initial anchors and ground-truth the ResNet101 setting in our DSFD. From Table 5, DSFD
faces to regress anchor from faces, which achieves the im- with SE-ResNeXt101 32×4d got 95.7%, 94.8%, 88.9%, on
provements of 0.3%, 0.1%, 0.3% on three settings, respec- easy, medium and hard settings respectively, which indi-
tively. Additionally, when we enlarge the training batch size cates that more complexity model and higher Top-1 Ima-
(i.e., LargeBS), the result in hard setting can get 91.2% AP. geNet classification accuracy may not benefit face detection
Discontinous ROC curves Continous ROC curves

Discontinous ROC curves Continous ROC curves


Figure 7: Comparisons with popular state-of-the-art methods on the FDDB dataset. The first row shows the ROC results without additional
annotations, and the second row shows the ROC results with additional annotations.

AP. Therefore, in our DSFD framework, better performance set is further defined into three levels of difficulty: ’Easy’,
on classification are not necessary for better performance on ’Medium’, ’Hard’ based on the detection rate of a baseline
detection. Our DSFD enjoys high inference speed benefited detector. As shown in Fig. 6, our DSFD achieves the best
from simply using the second shot detection results. For performance among all of the state-of-the-art face detectors
VGA resolution inputs to Res50-based DSFD, it runs 22 based on the average precision (AP) across the three sub-
FPS on NVIDA GPU P40 during inference. sets, i.e., 96.6% (Easy), 95.7% (Medium) and 90.4% (Hard)
on validation set, and 96.0% (Easy), 95.3% (Medium) and
4.3. Comparisons with State-of-the-Art Methods 90.0% (Hard) on test set. Fig. 8 shows more examples to
demonstrate the effects of DSFD on handling faces with
We evaluate the proposed DSFD on two popular face de- various variations, in which the blue bounding boxes indi-
tection benchmarks, including WIDER FACE [28] and Face cate the detector confidence is above 0.8.
Detection Data Set and Benchmark (FDDB) [9]. Our model
is trained only using the training set of WIDER FACE, FDDB Dataset It contains 5, 171 faces in 2, 845 images
and then evaluated on both benchmarks without any further taken from the faces in the wild data set. Since WIDER
fine-tuning. We also follow the similar way used in [25] FACE has bounding box annotation while faces in FDDB
to build the image pyramid for multi-scale testing and use are represented by ellipses, we learn a post-hoc ellipses re-
more powerful backbone similar as [3]. gressor to transform the final prediction results. As shown
WIDER FACE Dataset It contains 393, 703 annotated in Fig. 7, our DSFD achieves state-of-the-art performance
faces with large variations in scale, pose and occlusion in on both discontinuous and continuous ROC curves, i.e.
total 32, 203 images. For each of the 60 event classes, 40%, 99.1% and 86.2% when the number of false positives equals
10%, 50% images of the database are randomly selected to 1, 000. After adding additional annotations to those un-
as training, validation and testing sets. Besides, each sub- labeled faces [32], the false positives of our model can be
Scale Pose Occlusion Blurry

Makeup Illumination Modality Reflection

Figure 8: Effectiveness of our DSFD to various large variations on scale, pose, occlusion, blurry, makeup, illumination, modality and
reflection. Blue bounding boxes indicate the detector confidence is above 0.8.

further reduced and outperform all other methods. early layers by using smaller anchors are adopted to ef-
fectively facilitate the features. Moreover, an improved an-
5. Conclusions chor matching method is introduced to match anchors and
This paper introduces a novel face detector named Dual ground truth faces as far as possible to provide better ini-
Shot Face Detector (DSFD). In this work, we propose a tialization for the regressor. Comprehensive experiments
novel Feature Enhance Module that utilizes different level are conducted on benchmarks FDDB and WIDER FACE to
information and thus obtains more discriminability and ro- demonstrate the superiority of our proposed DSFD network
bustness features. Auxiliary supervisions introduced in compared with the state-of-the-art methods.
References In Proceedings of European conference on computer vision
(ECCV), 2016. 2, 3, 4
[1] Y. Abramson, B. Steux, and H. Ghorayeb. Yet even [16] W. Liu, A. Rabinovich, and A. Berg. Parsenet: Looking
faster (yef) real-time object detection. International Journal wider to see better. In Proceedings of International Con-
of Intelligent Systems Technologies and Applications, 2(2- ference on Learning Representations Workshop, 2016. 2
3):102–112, 2007. 2
[17] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
[2] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified networks for semantic segmentation. In Proceedings of IEEE
multi-scale deep convolutional neural network for fast object Conference on Computer Vision and Pattern Recognition
detection. In Proceedings of European Conference on Com- (CVPR), 2015. 2
puter Vision (ECCV), 2016. 2 [18] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis.
[3] C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou. Selec- Ssh: Single stage headless face detector. In Proceedings of
tive refinement network for high performance face detection. IEEE International Conference on Computer Vision (ICCV),
In Proceedings of Association for the Advancement of Artifi- 2017. 2
cial Intelligence (AAAI), 2019. 2, 7 [19] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards
[4] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection real-time object detection with region proposal networks. In
via region-based fully convolutional networks. In Proceed- Proceedings of Advances in Neural Information Processing
ings of Advances in Neural Information Processing Systems Systems (NIPS), 2015. 2, 4
(NIPS), 2016. 2 [20] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
[5] R. Girshick. Fast r-cnn. In Proceedings of IEEE Interna- and Y. LeCun. Overfeat: Integrated recognition, localiza-
tional Conference on Computer Vision (ICCV), 2015. 2, 3 tion and detection using convolutional networks. In Pro-
[6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- ceedings of International Conference on Learning Represen-
ture hierarchies for accurate object detection and semantic tations (ICLR), 2014. 2
segmentation. In Proceedings of IEEE Conference on Com- [21] X. Tang, D. K. Du, Z. He, and J. Liu. Pyramidbox: A
puter Vision and Pattern Recognition (CVPR), pages 580– context-assisted single shot face detector. In Proceedings of
587, 2014. 2 European Conference on Computer Vision (ECCV), 2018. 2,
[7] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hy- 3, 5
percolumns for object segmentation and fine-grained local- [22] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W.
ization. In Proceedings of IEEE Conference on Computer Smeulders. Selective search for object recognition. Interna-
Vision and Pattern Recognition (CVPR), 2015. 2 tional Journal of Computer Vision, 104(2):154–171, 2013.
[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning 2
for image recognition. In Proceedings of IEEE Conference [23] P. Viola and M. J. Jones. Robust real-time face detection.
on Computer Vision and Pattern Recognition (CVPR), 2016. International Journal of Computer Vision, 57(2):137–154,
1 2004. 1, 2
[9] V. Jain and E. Learned-Miller. Fddb: A benchmark for face [24] H. Wang, Z. Li, X. Ji, and Y. Wang. Face r-cnn. arXiv
detection in unconstrained settings. Technical report, Techni- preprint arXiv:1706.01061, 2017. 2
cal Report UM-CS-2010-009, University of Massachusetts, [25] J. Wang, Y. Yuan, and G. Yu. Face attention network: An
Amherst, 2010. 7 effective face detector for the occluded faces. arXiv preprint
[10] K. Levi and Y. Weiss. Learning object detection from a small arXiv:1711.07246, 2017. 3, 7
number of examples: the importance of good features. In [26] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen. Re-
Proceedings of IEEE Conference on Computer Vision and pulsion loss: Detecting pedestrians in a crowd. In Proceed-
Pattern Recognition (CVPR), 2004. 2 ings of IEEE Conference on Computer Vision and Pattern
[11] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolu- Recognition (CVPR), 2018. 3
tional neural network cascade for face detection. In Proceed- [27] Y. Wang, X. Ji, Z. Zhou, H. Wang, and Z. Li. Detecting
ings of IEEE Conference on Computer Vision and Pattern faces using region-based fully convolutional networks. arXiv
Recognition (CVPR), 2015. 2 preprint arXiv:1709.05256, 2017. 2
[12] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, [28] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A face
and S. J. Belongie. Feature pyramid networks for object de- detection benchmark. In Proceedings of IEEE Conference
tection. In Proceedings of IEEE Conference on Computer on Computer Vision and Pattern Recognition (CVPR), 2016.
Vision and Pattern Recognition (CVPR), 2017. 2, 3, 4 3, 7
[13] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Fo- [29] C. Zhang, X. Xu, and D. Tu. Face detection using improved
cal loss for dense object detection. In Proceedings of IEEE faster rcnn. arXiv preprint arXiv:1802.02142, 2018. 2
International Conference on Computer Vision (ICCV), 2017. [30] J. Zhang, X. Wu, J. Zhu, and S. C. Hoi. Feature agglomera-
2, 3 tion networks for single stage face detection. arXiv preprint
[14] S. Liu, D. Huang, and Y. Wang. Receptive field block net for arXiv:1712.00721, 2017. 2, 3
accurate and fast object detection. In Proceedings of Euro- [31] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection
pean Conference on Computer Vision, 2018. 2, 4 and alignment using multitask cascaded convolutional net-
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- works. IEEE Signal Processing Letters, 23(10):1499–1503,
Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. 2016. 2
[32] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. Sˆ
3fd: Single shot scale-invariant face detector. In Proceed-
ings of IEEE International Conference on Computer Vision
(ICCV), 2017. 2, 3, 5, 8

You might also like