0% found this document useful (0 votes)
4 views6 pages

Face Detection With The Faster R-CNN

The document presents the Faster R-CNN model, which has achieved state-of-the-art results in face detection on benchmarks such as FDDB and IJB-A by utilizing a two-module system: a Region Proposal Network (RPN) for generating object proposals and a Fast R-CNN detector for refining these proposals. The authors compare the performance and processing speed of different generations of region-based CNNs, highlighting the efficiency of the Faster R-CNN over its predecessors. Experiments demonstrate the model's effectiveness on the WIDER face dataset, showcasing its ability to handle various challenges in face detection.

Uploaded by

onywxxyz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

Face Detection With The Faster R-CNN

The document presents the Faster R-CNN model, which has achieved state-of-the-art results in face detection on benchmarks such as FDDB and IJB-A by utilizing a two-module system: a Region Proposal Network (RPN) for generating object proposals and a Fast R-CNN detector for refining these proposals. The authors compare the performance and processing speed of different generations of region-based CNNs, highlighting the efficiency of the Faster R-CNN over its predecessors. Experiments demonstrate the model's effectiveness on the WIDER face dataset, showcasing its ability to handle various challenges in face detection.

Uploaded by

onywxxyz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Face Detection with the Faster R-CNN

Huaizu Jiang Erik Learned-Miller


University of Massachusetts Amherst University of Massachusetts Amherst
Amherst MA 01003 Amherst MA 01003
[email protected] [email protected]
arXiv:1606.03473v1 [cs.CV] 10 Jun 2016

Abstract 2.1. Evolution of Region-based CNNs for Object


Detection
The Faster R-CNN [12] has recently demonstrated im-
Girshick et al. [5] introduced a region-based CNN (R-
pressive results on various object detection benchmarks. By
CNN) for object detection. The pipeline consists of two
training a Faster R-CNN model on the large scale WIDER
stages. In the first, a set of category-independent object
face dataset [16], we report state-of-the-art results on two
proposals are generated, using selective search [14]. In
widely used face detection benchmarks, FDDB and the re-
the second refinement stage, the image region within each
cently released IJB-A.
proposal is warped to a fixed size (e.g., 227 × 227 for the
AlexNet [9]) and then mapped to a 4096-dimensional fea-
ture vector, which is fed into a classifier and also into a re-
1. Introduction
gressor that refines the position of the detection.
Deep convolutional neural networks (CNNs) have domi- The significance of the R-CNN is that it brings the high
nated many tasks of computer vision. In object detection, accuracy of CNNs on classification tasks to the problem of
region-based CNN detection methods are now the main object detection. Its success is largely due to transferring
paradigm. It is such a rapidly developing area that three the supervised pre-trained image representation for image
generations of region-based CNN detection models have classification to object detection.
been proposed in the last few years, with increasingly better The R-CNN, however, requires a forward pass through
performance and faster processing speed. the convolutional network for each object proposal in order
The latest generation, represented by the Faster R-CNN to extract features, leading to a heavy computational bur-
of Ren, He, Girshick, and Sun [12] demonstrates impressive den. To mitigate this problem, two approaches, the SPP-
results on various object detection benchmarks. It is also the net [6] and the Fast R-CNN [4] have been proposed. In-
foundational framework for the winning entry of the COCO stead of feeding each warped proposal image region to the
detection challenge 2015.1 In this report, we demonstrate CNN, the SPPnet and the Fast R-CNN run through the CNN
state-of-the-art face detection results using the Faster R- exactly once for the entire input image. After projecting
CNN on two popular face detection benchmarks, the widely the proposals to convolutional feature maps, a fixed length
used Face Detection Dataset and Benchmark (FDDB) [7], feature vector can be extracted for each proposal in a man-
and the more recent IJB-A benchmark [8]. We also com- ner similar to spatial pyramid pooling. The Fast R-CNN
pare different generations of region-based CNN object de- is a special case of the SPPnet, which uses a single spa-
tection models, and compare to a variety of other recent tial pyramid pooling layer, i.e., the region of interest (RoI)
high-performing detectors. pooling layer, and thus allows end-to-end fine-tuning of a
pre-trained ImageNet model. This is the key to its better
2. Overview of the Faster R-CNN performance relative to the original R-CNN.
Both the R-CNN and the Fast R-CNN (and the SPP-
Since the dominant success of a deeply trained con-
Net) rely on the input generic object proposals, which usu-
volutional network (CNN) [9] in image classification on
ally come from a hand-crafted model such as selective
the ImageNet Large Scale Visual Recognition Challenge
search [14], EdgeBox [2], etc. There are two main issues
(ILSVRC) 2012, it was wondered that if the same success
with this approach. The first, as shown in image classifica-
can be achieved for object detection. The short answer is
tion and object detection, is that (deeply) learned represen-
yes.
tations often generalize better than hand-crafted ones. The
1 http://mscoco.org/dataset/#detections-leaderboard second is that the computational burden of proposal gen-

1
Table 1. Comparisons of the entire pipeline of different region-based object detection methods. (Both Faceness [15] and DeepBox [10]
rely on the output of EdgeBox. Therefore their entire running time should include the processing time of EdgeBox.)
R-CNN Fast R-CNN Faster R-CNN
EdgeBox: 2.73s
proposal stage time Faceness: 9.91s (+ 2.73s = 12.64s)
0.32s
DeepBox: 0.27s (+ 2.73s = 3.00s)
input to CNN cropped proposal image input image & proposals input image
refinement stage #forward thru. CNN #proposals 1 1
time 7.04s 0.21s 0.06s
R-CNN + EdgeBox: 9.77s Fast R-CNN + EdgeBox: 2.94s
total time R-CNN + Faceness: 19.68s Fast R-CNN + Faceness: 12.85s 0.38s
R-CNN + DeepBox: 10.04s Fast R-CNN + DeepBox: 3.21s

eration dominate the processing time of the entire pipeline dows.


(e.g., 2.73 seconds for EdgeBox in our experiments). Al- To deal with different scales and aspect ratios of objects,
though there are now deeply trained models for proposal anchors are introduced in the RPN. An anchor is at each
generation, e.g. DeepBox [10] (based on the Fast R-CNN sliding location of the convolutional maps and thus at the
framework), its processing time is still not negligible. center of each spatial window. Each anchor is associated
To reduce the computational burden of proposal gener- with a scale and an aspect ratio. Following the default set-
ation, the Faster R-CNN was proposed. It consists of two ting of [12], we use 3 scales (1282 , 2562 , and 5122 pixels)
modules. The first, called the Regional Proposal Network and 3 aspect ratios (1 : 1, 1 : 2, and 2 : 1), leading to k = 9
(RPN), is a fully convolutional network for generating ob- anchors at each location. Each proposal is parameterized
ject proposals that will be fed into the second module. The relative to an anchor. Therefore, for a convolutional fea-
second module is the Fast R-CNN detector whose purpose ture map of size W × H, we have at most W Hk possible
is to refine the proposals. The key idea is to share the same proposals. We note that the same features of each sliding
convolutional layers for the RPN and Fast R-CNN detec- location are used to regress k = 9 proposals, instead of ex-
tor up to their own fully connected layers. Now the image tracting k sets of features and training a single regressor.t
only passes through the CNN once to produce and then re- Training of the RPN can be done in an end-to-end manner
fine object proposals. More importantly, thanks to the shar- using stochastic gradient descent (SGD) for both classifi-
ing of convolutional layers, it is possible to use a very deep cation and regression branches. For the entire system, we
network (e.g., VGG16 [13]) to generate high-quality object have to take care of both the RPN and Fast R-CNN mod-
proposals. ules since they share convolutional layers. In this paper,
The key differences of the R-CNN, the Fast R-CNN, and we adopt the approximate joint learning strategy proposed
the Faster R-CNN are summarized in Table 1. The run- in [11]. The RPN and Fast R-CNN are trained end-to-end as
ning time of different modules are reported on the FDDB they are independent. Note that the input of the Fast R-CNN
dataset [7], where the typical resolution of an image is about is actually dependent on the output of the RPN. For the ex-
350 × 450. The code was run on a server equipped with an act joint training, the SGD solver should also consider the
Intel Xeon CPU E5-2697 of 2.60GHz and an NVIDIA Tesla derivates of the RoI pooling layer in the Fast R-CNN with
K40c GPU with 12GB memory. We can clearly see that respect to the coordinates of the proposals predicted by the
the entire running time of the Faster R-CNN is significantly RPN. However, as pointed out by [11], it is not a trivial op-
lower than for both the R-CNN and the Fast R-CNN. timization problem.
2.2. The Faster R-CNN
3. Experiments
In this section, we briefy introduce the key aspects of the
Faster R-CNN. We refer readers to the original paper [12] In this section, we report experiments on comparisons of
for more technical details. region proposals and also on end-to-end performance of top
In the RPN, the convolution layers of a pre-trained net- face detectors.
work are followed by a 3 × 3 convolutional layer. This cor-
3.1. Setup
responds to mapping a large spatial window or receptive
field (e.g., 228 × 228 for VGG16) in the input image to a We train a Faster R-CNN face detection model on the re-
low-dimensional feature vector at a center stride (e.g., 16 cently released WIDER face dataset [16]. There are 12,880
for VGG16). Two 1 × 1 convolutional layers are then added images and 159,424 faces in the training set. In Fig. 1, we
for classification and regression branches of all spatial win- demonstrate some randomly sampled images of the WIDER

2
1 1
EdgeBox EdgeBox
DeepBox DeepBox
Faceness Faceness
0.8 RPN 0.8 RPN

Detection Rate

Detection Rate
0.6 0.6

0.4 0.4

0.2 0.2

0 0
0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1
IoU threshold IoU threshold

(a) 100 Proposals (b) 300 Proposals


1 1
EdgeBox EdgeBox
DeepBox DeepBox
Faceness Faceness
0.8 RPN 0.8 RPN

Detection Rate

Detection Rate
0.6 0.6

0.4 0.4

0.2 0.2

0 0
0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1
IoU threshold IoU threshold

(c) 500 proposals (d) 1000 proposals


Figure 2. Comparisons of face proposals on FDDB using different
methods.

Figure 1. Sample images in the WIDER face dataset, where green erations, the base learning rate is 0.001 and it is reduced
bounding boxes are ground-truth annotations. to 0.0001 in the last 5,000. Note that there are more then
15,000 training images in each split. We run only 10,000
iterations of fine-tuning to adapt the regression branches of
dataset. We can see that there exist great variations in scale, the Faster R-CNN model trained on WIDER to the annota-
pose, and the number of faces in each image, making this tion styles of IJB-A.
dataset challenging. There are two criteria for quantatitive comparison for the
We train the face detection model based on a pre-trained FDDB benchmark. For the discrete scores, each detection
ImageNet model, VGG16 [13]. We randomly sample one is considered to be positive if its intersection-over-union
image per batch for training. In order to fit it in the GPU (IoU) ratioe with its one-one matched ground-truth anno-
memory, it is resized based on the ratio 1024/ max(w, h), tation is greater than 0.5. By varying the threshold of detec-
where w and h are the width and height of the image, re- tion scores, we can generate a set of true positives and false
spectively. We run the SGD solver 50k iterations with a positives and report the ROC curve. For the more restrictive
base learning rate of 0.001 and run another 20K iterations continuous scores, the true positives are weighted by the
reducing the base learning rate to 0.0001.2 IoU scores. On IJB-A, we use the discrete score setting and
We test the trained face detection model on two report the true positive rate based on the normalized false
benchmark datasets, FDDB [7] and IJB-A [8]. There positive rate per image instead of the total number of false
are 10 splits in both FDDB and IJB-A. For test- positives.
ing, we resize the input image based on the ratio
min(600/ min(w, h), 1024/ max(w, h)). For the RPN, we 3.2. Comparison of Face Proposals
use only the top 300 face proposals to balance efficiency We compare the RPN with other approaches including
and accuracy. EdgeBox [2], Faceness [15], and DeepBox [10] on FDDB.
For FDDB, we directly test the model trained on EdgeBox evaluates the objectness score of each proposal
WIDER. For IJB-A, it is necessary to fine-tune the face based on the distribution of edge responses within it in a
detection model due to the different annotation styles be- sliding window fashion. Both Faceness and DeepBox re-
tween WIDER and IJB-A. In WIDER, the face annotations rank other object proposals, e.g., EdgeBox. In Faceness,
are specified tightly around the facial region while annota- five CNNs are trained based on attribute annotations of fa-
tions in IJB-A include larger areas (e.g., hair). We fine-tune cial parts including hair, eyes, nose, mouth, and beard. The
the face detection model on the training images of each split Faceness score of each proposal is then computed based on
of IJB-A using only 10,000 iterations. In the first 5,000 it- the response maps of different networks. DeepBox, which
2 We use the author released Python implementation is based on the Fast R-CNN framework, re-ranks each pro-
https://github.com/rbgirshick/py-faster-rcnn. posal based on the region-pooled features. We re-train a

3
1
module, which is based on a deeply trained CNN. Note that
0.9 the Faster R-CNN also runs much faster than both the R-
0.8
CNN and Fast R-CNN, as summarized in Table 1.
0.7 3.4. Comparison with State-of-the-art Methods
True Positive Rate

0.6
R-CNN
Finally, we compare the Faster R-CNN with 11 other
0.5 Fast R-CNN top detectors on FDDB, all published since 2015. ROC
Faster R-CNN
0.4
curves of the different methods, obtained from the FDDB
results page, are shown in Fig. 4. For discrete scores on
0.3
FDDB, the Faster R-CNN performs better than all others
0.2 when there are more than around 200 false positives for the
0.1
entire test set, as shown in Fig. 4(a) and (b). With the more
restrictive continuous scores, the Faster R-CNN is better
0
0 500 1000 1500 2000 than most of other state-of-the-art methods but poorer than
Total False Positives MultiresHPM [3]. This discrepancy can be attributed to the
Figure 3. Comparisons of region-based CNN object detection fact that the detection results of the Faster R-CNN are not
methods for face detection on FDDB. always exactly around the annotated face regions, as can be
seen in Fig. 5. For 500 false positives, the true positive rates
DeepBox model for face proposals on the WIDER training with discrete and continuous scores are 0.952 and 0.718 re-
set. spectively. One possible reason for the relatively poor per-
We follow [2] to measure the detection rate of the top formance on the continuous scoring might be the difference
N proposals by varying the Intersection-over-Union (IoU) of face annotations between WIDER and FDDB.
threshold. The larger the threshold is, the fewer the pro- IJB-A is a relatively new face detection benchmark
posals that are considered to be true objects. Quantitative dataset published at CVPR 2015, and thus not too many
comparisons of proposals are displayed in Fig. 2. As can be results have been reported on it yet. We borrow results of
seen, the RPN and DeepBox are significantly better than the other methods from [1, 8]. The comparison is shown in
other two. It is perhaps not surprising that learning-based Fig. 4(d). As we can see, the Faster R-CNN performs better
approaches perform better than the heuristic one, EdgeBox. than all of the others by a large margin.
Although Faceness is also based on deeply trained convolu- We further demonstrate qualitative face detection results
tional networks (fine tuned from AlexNet), the rule to com- in Fig. 5 and Fig. 6. It can be observed that the Faster R-
pute the faceness score of each proposal is hand-crafted in CNN model can deal with challenging cases with multiple
contrast to the end-to-end learning of the RPN and Deep- overlapping faces and faces with extreme poses and scales.
Box. The RPN performs slightly better than DeepBox, per-
haps since it uses a deeper CNN. Due to the sharing of con- 4. Conclusion
volutional layers between the RPN and the Fast R-CNN de-
tector, the process time of the entire system is lower. More- In this report, we have demonstrated state-of-the-art face
over, the RPN does not rely on other object proposal meth- detection performance on two benchmark datasets using the
ods, e.g., EdgeBox. Faster R-CNN. Experimental results suggest that its effec-
tiveness comes from the region proposal network (RPN)
3.3. Comparison of Region-based CNN Methods module. Due to the sharing of convolutional layers between
the RPN and Fast R-CNN detector module, it is possible to
We also compare face detection performance of the R-
use a deep CNN in RPN without extra computational bur-
CNN, the Fast R-CNN, and the Faster R-CNN on FDDB.
den.
For both the R-CNN and Fast R-CNN, we use the top
Although the Faster R-CNN is designed for generic ob-
2000 proposals generated by the Faceness method [15]. For
ject detection, it demonstrates impressive face detection
the R-CNN, we fine-tune the pre-trained VGG-M model.
performance when retrained on a suitable face detection
Different from the original R-CNN implementation [5],
training set. It may be possible to further boost its perfor-
we train a CNN with both classification and regression
mance by considering the special patterns of human faces.
branches end-to-end following [15]. For both the Fast
R-CNN and Faster R-CNN, we fine-tune the pre-trained
VGG16 model. As can be observed from Fig. 3, the Faster References
R-CNN significantly outperforms the other two. Since the [1] J. Cheney, B. Klein, A. K. Jain, and B. F. Klare. Uncon-
Faster R-CNN also contains the Fast R-CNN detector mod- strained face detection: State of the art baseline and chal-
ule, the performance boost mostly comes from the RPN lenges. In ICB, pages 229–236, 2015.

4
1 1

0.9 0.9

0.8 0.8
Faceness Faceness
headHunter headHunter
0.7 0.7
MTCNN MTCNN

True Positive Rate


True Positive Rate

hyperFace hyperFace
0.6 0.6
DP2MFD DP2MFD
CCF CCF
0.5 0.5
NPDFace NPDFace
MultiresHPM MultiresHPM
0.4 0.4
FD3DM FD3DM
DDFD DDFD
0.3 0.3
CascadeCNN CascadeCNN
Faster R-CNN Faster R-CNN
0.2 0.2

0.1 0.1

0 0
0 500 1000 1500 2000 0 50 100 150 200
Total False Positives Total False Positives
(a) (b)
0.8

0.7

Faceness
0.6
headHunter
MTCNN
True Positive Rate

0.5 hyperFace
DP2MFD
CCF
0.4
NPDFace
MultiresHPM
0.3 FD3DM
DDFD
CascadeCNN
0.2
Faster R-CNN

0.1

0
0 500 1000 1500 2000
Total False Positives
(c) (d)
Figure 4. Comparisons of face detection with state-of-the-art methods on (a) ROC curves on FDDB with discrete scores, (b) ROC curves
on FDDB with discrete scores using less false positives, (c) ROC curves on FDDB with continuous scores, and (d) results on IJB-A dataset.

[2] P. Dollár and C. L. Zitnick. Fast edge detection using [8] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney,
structured forests. IEEE Trans. Pattern Anal. Mach. Intell., K. Allen, P. Grother, A. Mah, M. Burge, and A. K. Jain.
37(8):1558–1570, 2015. Pushing the frontiers of unconstrained face detection and
[3] G. Ghiasi and C. C. Fowlkes. Occlusion coherence: Lo- recognition: IARPA Janus benchmark A. In CVPR, pages
calizing occluded faces with a hierarchical deformable part 1931–1939, 2015.
model. In CVPR, pages 1899–1906, 2014. [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
[4] R. B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, classification with deep convolutional neural networks. In
2015. NIPS, pages 1106–1114, 2012.
[5] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich [10] W. Kuo, B. Hariharan, and J. Malik. Deepbox: Learning ob-
feature hierarchies for accurate object detection and semantic jectness with convolutional networks. In ICCV, pages 2479–
segmentation. In CVPR, pages 580–587, 2014. 2487, 2015.
[6] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling [11] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-
in deep convolutional networks for visual recognition. In wards real-time object detection with region proposal net-
ECCV, pages 346–361, 2014. works. IEEE Trans. Pattern Anal. Mach. Intell., 2016.
[7] V. Jain and E. Learned-Miller. FDDB: A benchmark for face [12] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:
detection in unconstrained settings. Technical Report UM- towards real-time object detection with region proposal net-
CS-2010-009, University of Massachusetts, Amherst, 2010. works. In NIPS, pages 91–99, 2015.

5
Figure 5. Sample detection results on the FDDB dataset, where green bounding boxes are ground-truth annotations and red bounding boxes
are detection results of the Faster R-CNN.

Figure 6. Sample detection results on the IJB-A dataset, where green bounding boxes are ground-truth annotations and red bounding boxes
are detection results of the Faster R-CNN.

[13] K. Simonyan and A. Zisserman. Very deep convolu- [16] S. Yang, P. Luo, C. C. Loy, and X. Tang. WIDER FACE: A
tional networks for large-scale image recognition. CoRR, face detection benchmark. In CVPR, 2016.
abs/1409.1556, 2014.
[14] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and
A. W. M. Smeulders. Selective search for object recognition.
International Journal of Computer Vision, 104(2):154–171,
2013.
[15] S. Yang, P. Luo, C. C. Loy, and X. Tang. From facial parts
responses to face detection: A deep learning approach. In
ICCV, pages 3676–3684, 2015.

You might also like