Face Detection With The Faster R-CNN
Face Detection With The Faster R-CNN
1
Table 1. Comparisons of the entire pipeline of different region-based object detection methods. (Both Faceness [15] and DeepBox [10]
rely on the output of EdgeBox. Therefore their entire running time should include the processing time of EdgeBox.)
R-CNN Fast R-CNN Faster R-CNN
EdgeBox: 2.73s
proposal stage time Faceness: 9.91s (+ 2.73s = 12.64s)
0.32s
DeepBox: 0.27s (+ 2.73s = 3.00s)
input to CNN cropped proposal image input image & proposals input image
refinement stage #forward thru. CNN #proposals 1 1
time 7.04s 0.21s 0.06s
R-CNN + EdgeBox: 9.77s Fast R-CNN + EdgeBox: 2.94s
total time R-CNN + Faceness: 19.68s Fast R-CNN + Faceness: 12.85s 0.38s
R-CNN + DeepBox: 10.04s Fast R-CNN + DeepBox: 3.21s
2
1 1
EdgeBox EdgeBox
DeepBox DeepBox
Faceness Faceness
0.8 RPN 0.8 RPN
Detection Rate
Detection Rate
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1
IoU threshold IoU threshold
Detection Rate
Detection Rate
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1
IoU threshold IoU threshold
Figure 1. Sample images in the WIDER face dataset, where green erations, the base learning rate is 0.001 and it is reduced
bounding boxes are ground-truth annotations. to 0.0001 in the last 5,000. Note that there are more then
15,000 training images in each split. We run only 10,000
iterations of fine-tuning to adapt the regression branches of
dataset. We can see that there exist great variations in scale, the Faster R-CNN model trained on WIDER to the annota-
pose, and the number of faces in each image, making this tion styles of IJB-A.
dataset challenging. There are two criteria for quantatitive comparison for the
We train the face detection model based on a pre-trained FDDB benchmark. For the discrete scores, each detection
ImageNet model, VGG16 [13]. We randomly sample one is considered to be positive if its intersection-over-union
image per batch for training. In order to fit it in the GPU (IoU) ratioe with its one-one matched ground-truth anno-
memory, it is resized based on the ratio 1024/ max(w, h), tation is greater than 0.5. By varying the threshold of detec-
where w and h are the width and height of the image, re- tion scores, we can generate a set of true positives and false
spectively. We run the SGD solver 50k iterations with a positives and report the ROC curve. For the more restrictive
base learning rate of 0.001 and run another 20K iterations continuous scores, the true positives are weighted by the
reducing the base learning rate to 0.0001.2 IoU scores. On IJB-A, we use the discrete score setting and
We test the trained face detection model on two report the true positive rate based on the normalized false
benchmark datasets, FDDB [7] and IJB-A [8]. There positive rate per image instead of the total number of false
are 10 splits in both FDDB and IJB-A. For test- positives.
ing, we resize the input image based on the ratio
min(600/ min(w, h), 1024/ max(w, h)). For the RPN, we 3.2. Comparison of Face Proposals
use only the top 300 face proposals to balance efficiency We compare the RPN with other approaches including
and accuracy. EdgeBox [2], Faceness [15], and DeepBox [10] on FDDB.
For FDDB, we directly test the model trained on EdgeBox evaluates the objectness score of each proposal
WIDER. For IJB-A, it is necessary to fine-tune the face based on the distribution of edge responses within it in a
detection model due to the different annotation styles be- sliding window fashion. Both Faceness and DeepBox re-
tween WIDER and IJB-A. In WIDER, the face annotations rank other object proposals, e.g., EdgeBox. In Faceness,
are specified tightly around the facial region while annota- five CNNs are trained based on attribute annotations of fa-
tions in IJB-A include larger areas (e.g., hair). We fine-tune cial parts including hair, eyes, nose, mouth, and beard. The
the face detection model on the training images of each split Faceness score of each proposal is then computed based on
of IJB-A using only 10,000 iterations. In the first 5,000 it- the response maps of different networks. DeepBox, which
2 We use the author released Python implementation is based on the Fast R-CNN framework, re-ranks each pro-
https://github.com/rbgirshick/py-faster-rcnn. posal based on the region-pooled features. We re-train a
3
1
module, which is based on a deeply trained CNN. Note that
0.9 the Faster R-CNN also runs much faster than both the R-
0.8
CNN and Fast R-CNN, as summarized in Table 1.
0.7 3.4. Comparison with State-of-the-art Methods
True Positive Rate
0.6
R-CNN
Finally, we compare the Faster R-CNN with 11 other
0.5 Fast R-CNN top detectors on FDDB, all published since 2015. ROC
Faster R-CNN
0.4
curves of the different methods, obtained from the FDDB
results page, are shown in Fig. 4. For discrete scores on
0.3
FDDB, the Faster R-CNN performs better than all others
0.2 when there are more than around 200 false positives for the
0.1
entire test set, as shown in Fig. 4(a) and (b). With the more
restrictive continuous scores, the Faster R-CNN is better
0
0 500 1000 1500 2000 than most of other state-of-the-art methods but poorer than
Total False Positives MultiresHPM [3]. This discrepancy can be attributed to the
Figure 3. Comparisons of region-based CNN object detection fact that the detection results of the Faster R-CNN are not
methods for face detection on FDDB. always exactly around the annotated face regions, as can be
seen in Fig. 5. For 500 false positives, the true positive rates
DeepBox model for face proposals on the WIDER training with discrete and continuous scores are 0.952 and 0.718 re-
set. spectively. One possible reason for the relatively poor per-
We follow [2] to measure the detection rate of the top formance on the continuous scoring might be the difference
N proposals by varying the Intersection-over-Union (IoU) of face annotations between WIDER and FDDB.
threshold. The larger the threshold is, the fewer the pro- IJB-A is a relatively new face detection benchmark
posals that are considered to be true objects. Quantitative dataset published at CVPR 2015, and thus not too many
comparisons of proposals are displayed in Fig. 2. As can be results have been reported on it yet. We borrow results of
seen, the RPN and DeepBox are significantly better than the other methods from [1, 8]. The comparison is shown in
other two. It is perhaps not surprising that learning-based Fig. 4(d). As we can see, the Faster R-CNN performs better
approaches perform better than the heuristic one, EdgeBox. than all of the others by a large margin.
Although Faceness is also based on deeply trained convolu- We further demonstrate qualitative face detection results
tional networks (fine tuned from AlexNet), the rule to com- in Fig. 5 and Fig. 6. It can be observed that the Faster R-
pute the faceness score of each proposal is hand-crafted in CNN model can deal with challenging cases with multiple
contrast to the end-to-end learning of the RPN and Deep- overlapping faces and faces with extreme poses and scales.
Box. The RPN performs slightly better than DeepBox, per-
haps since it uses a deeper CNN. Due to the sharing of con- 4. Conclusion
volutional layers between the RPN and the Fast R-CNN de-
tector, the process time of the entire system is lower. More- In this report, we have demonstrated state-of-the-art face
over, the RPN does not rely on other object proposal meth- detection performance on two benchmark datasets using the
ods, e.g., EdgeBox. Faster R-CNN. Experimental results suggest that its effec-
tiveness comes from the region proposal network (RPN)
3.3. Comparison of Region-based CNN Methods module. Due to the sharing of convolutional layers between
the RPN and Fast R-CNN detector module, it is possible to
We also compare face detection performance of the R-
use a deep CNN in RPN without extra computational bur-
CNN, the Fast R-CNN, and the Faster R-CNN on FDDB.
den.
For both the R-CNN and Fast R-CNN, we use the top
Although the Faster R-CNN is designed for generic ob-
2000 proposals generated by the Faceness method [15]. For
ject detection, it demonstrates impressive face detection
the R-CNN, we fine-tune the pre-trained VGG-M model.
performance when retrained on a suitable face detection
Different from the original R-CNN implementation [5],
training set. It may be possible to further boost its perfor-
we train a CNN with both classification and regression
mance by considering the special patterns of human faces.
branches end-to-end following [15]. For both the Fast
R-CNN and Faster R-CNN, we fine-tune the pre-trained
VGG16 model. As can be observed from Fig. 3, the Faster References
R-CNN significantly outperforms the other two. Since the [1] J. Cheney, B. Klein, A. K. Jain, and B. F. Klare. Uncon-
Faster R-CNN also contains the Fast R-CNN detector mod- strained face detection: State of the art baseline and chal-
ule, the performance boost mostly comes from the RPN lenges. In ICB, pages 229–236, 2015.
4
1 1
0.9 0.9
0.8 0.8
Faceness Faceness
headHunter headHunter
0.7 0.7
MTCNN MTCNN
hyperFace hyperFace
0.6 0.6
DP2MFD DP2MFD
CCF CCF
0.5 0.5
NPDFace NPDFace
MultiresHPM MultiresHPM
0.4 0.4
FD3DM FD3DM
DDFD DDFD
0.3 0.3
CascadeCNN CascadeCNN
Faster R-CNN Faster R-CNN
0.2 0.2
0.1 0.1
0 0
0 500 1000 1500 2000 0 50 100 150 200
Total False Positives Total False Positives
(a) (b)
0.8
0.7
Faceness
0.6
headHunter
MTCNN
True Positive Rate
0.5 hyperFace
DP2MFD
CCF
0.4
NPDFace
MultiresHPM
0.3 FD3DM
DDFD
CascadeCNN
0.2
Faster R-CNN
0.1
0
0 500 1000 1500 2000
Total False Positives
(c) (d)
Figure 4. Comparisons of face detection with state-of-the-art methods on (a) ROC curves on FDDB with discrete scores, (b) ROC curves
on FDDB with discrete scores using less false positives, (c) ROC curves on FDDB with continuous scores, and (d) results on IJB-A dataset.
[2] P. Dollár and C. L. Zitnick. Fast edge detection using [8] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney,
structured forests. IEEE Trans. Pattern Anal. Mach. Intell., K. Allen, P. Grother, A. Mah, M. Burge, and A. K. Jain.
37(8):1558–1570, 2015. Pushing the frontiers of unconstrained face detection and
[3] G. Ghiasi and C. C. Fowlkes. Occlusion coherence: Lo- recognition: IARPA Janus benchmark A. In CVPR, pages
calizing occluded faces with a hierarchical deformable part 1931–1939, 2015.
model. In CVPR, pages 1899–1906, 2014. [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
[4] R. B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, classification with deep convolutional neural networks. In
2015. NIPS, pages 1106–1114, 2012.
[5] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich [10] W. Kuo, B. Hariharan, and J. Malik. Deepbox: Learning ob-
feature hierarchies for accurate object detection and semantic jectness with convolutional networks. In ICCV, pages 2479–
segmentation. In CVPR, pages 580–587, 2014. 2487, 2015.
[6] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling [11] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
in deep convolutional networks for visual recognition. In wards real-time object detection with region proposal net-
ECCV, pages 346–361, 2014. works. IEEE Trans. Pattern Anal. Mach. Intell., 2016.
[7] V. Jain and E. Learned-Miller. FDDB: A benchmark for face [12] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:
detection in unconstrained settings. Technical Report UM- towards real-time object detection with region proposal net-
CS-2010-009, University of Massachusetts, Amherst, 2010. works. In NIPS, pages 91–99, 2015.
5
Figure 5. Sample detection results on the FDDB dataset, where green bounding boxes are ground-truth annotations and red bounding boxes
are detection results of the Faster R-CNN.
Figure 6. Sample detection results on the IJB-A dataset, where green bounding boxes are ground-truth annotations and red bounding boxes
are detection results of the Faster R-CNN.
[13] K. Simonyan and A. Zisserman. Very deep convolu- [16] S. Yang, P. Luo, C. C. Loy, and X. Tang. WIDER FACE: A
tional networks for large-scale image recognition. CoRR, face detection benchmark. In CVPR, 2016.
abs/1409.1556, 2014.
[14] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and
A. W. M. Smeulders. Selective search for object recognition.
International Journal of Computer Vision, 104(2):154–171,
2013.
[15] S. Yang, P. Luo, C. C. Loy, and X. Tang. From facial parts
responses to face detection: A deep learning approach. In
ICCV, pages 3676–3684, 2015.