CMS-RCNN: Contextual Multi-Scale Region-Based CNN For Unconstrained Face Detection
CMS-RCNN: Contextual Multi-Scale Region-Based CNN For Unconstrained Face Detection
CMS-RCNN: Contextual Multi-Scale Region-Based CNN For Unconstrained Face Detection
Abstract—Robust face detection in the wild is one of the ultimate components to support various facial related problems,
arXiv:1606.05413v1 [cs.CV] 17 Jun 2016
i.e. unconstrained face recognition, facial periocular recognition, facial landmarking and pose estimation, facial expression
recognition, 3D facial model construction, etc. Although the face detection problem has been intensely studied for decades
with various commercial applications, it still meets problems in some real-world scenarios due to numerous challenges, e.g.
heavy facial occlusions, extremely low resolutions, strong illumination, exceptionally pose variations, image or video compression
artifacts, etc. In this paper, we present a face detection approach named Contextual Multi-Scale Region-based Convolution
Neural Network (CMS-RCNN) to robustly solve the problems mentioned above. Similar to the region-based CNNs, our proposed
network consists of the region proposal component and the region-of-interest (RoI) detection component. However, far apart of
that network, there are two main contributions in our proposed network that play a significant role to achieve the state-of-the-
art performance in face detection. Firstly, the multi-scale information is grouped both in region proposal and RoI detection to
deal with tiny face regions. Secondly, our proposed network allows explicit body contextual reasoning in the network inspired
from the intuition of human vision system. The proposed approach is benchmarked on two recent challenging face detection
databases, i.e. the WIDER FACE Dataset which contains high degree of variability, as well as the Face Detection Dataset and
Benchmark (FDDB). The experimental results show that our proposed approach trained on WIDER FACE Dataset outperforms
strong baselines on WIDER FACE Dataset by a large margin, and consistently achieves competitive results on FDDB against
the recent state-of-the-art face detection methods.
Index Terms—Robust Face Detection, Multi-Scale Information, Contextual Reasoning, Convolutional Neural Network, Region-
based CNN
F
1 I NTRODUCTION
Detection and analysis on human subjects using facial
feature based biometrics for access control, surveillance
systems and other security applications have gained pop-
ularity over the past few years. Several such biometrics
systems are deployed in security checkpoints across the
globe with more being deployed every day. Particularly,
face recognition has been one of the most popular biomet-
rics modalities attractive to security departments. Indeed,
the uniqueness of facial features across individuals can be
captured much more easily than other biometrics. In order
to take into account a face recognition algorithm, however,
face detection usually needs to be done first.
The problem of face detection has been intensely studied
for decades with the aim of ensuring the generalization
of robust algorithms to unseen face images [2], [3], [4],
[5], [6], [7], [8], [9], [10], [11], [12], [13]. Although the Fig. 1. An example of face detection results using our
detection accuracy in recent face detection algorithms [14], proposed CMS-RCNN method. The proposed method
[15], [16], [17], [18], [19] has been highly improved due to can robustly detect faces across occlusion, facial ex-
the advancement of deep Convolutional Neural Networks pression, pose, illumination, scale and low resolution
(CNN), they are still far from achieving the same detection conditions from WIDER FACE Dataset [1].
capabilities as a human due to a number of challenges
CyLab Biometrics Center and the Department of Electrical and Computer
Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. Emails:
{chenchez, yutongzh, kluu}@andrew.cmu.edu, [email protected] in practice. For example, off-angle faces, large occlusions,
* indicates equal contribution. low-resolutions and strong lighting conditions, as shown in
2
where they combined the problems of face detection, pose is incorporated using spatial recurrent neural networks.
estimation, and facial landmarking into one framework. By Inside the network, skip pooling is used to extract informa-
utilizing all three aspects in one framework, they were able tion at multiple scales and levels of abstraction. Recently,
to outperform the state-of-the-art at the time on real world Zagoruyko et al. [26] have presented the MultiPath network
images. Yu et al. [23] extended this work by incorporating with three modifications to the standard Fast R-CNN object
group sparsity in learning which landmarks are the most detector, i.e. skip connections that give the detector access
salient for face detection as well as incorporating 3D to features at multiple network layers, a foveal structure
models of the landmarks in order to deal with pose. Chen et to exploit object context at multiple object resolutions,
al. [10] have combined ideas from both of these approaches and an integral loss function and corresponding network
by utilizing a cascade detection framework while simul- adjustment that improve localization. The information in
taneously localizing features on the face for alignment of their proposed network can flow along multiple paths. Their
the detectors. Similarly, Ghiasi and Fowlkes [12] have been MultiPath network is combined with DeepMask object
able to use heirarchical DPMs not only to achieve good face proposals to solve the object detection problem.
detection in the presence of occlusion but also landmark Unlike all the previous approaches that select a feature
localization. However, Mathias et al. [9] were able to show extractor beforehand and incorporate a linear classifier with
that both DPM models and rigid template detectors similar the depth descriptor beside RGB channels, our method
to the Viola-Jones detector have a lot of potential that has solves the problem under a deep learning framework where
not been adequately explored. By retraining these models the global and the local context features, i.e. multi scaling,
with appropriately controlled training data, they were able are synchronized to Faster Region-based Convolutional
to create face detectors that perform similarly to other, more Neural Networks in order to robustly achieve semantic
complex state-of-the-art face detectors. detection.
All of these approaches to face detection were based
on selecting a feature extractor beforehand. However, there 3 BACKGROUND
has been work done in using a ConvNet to learn which
The recent studies in deep ConvNets have achieved signifi-
features are used to detect faces. Neural Networks have
cant results in object detection, classification and modeling
been around for a long time but have been experiencing
[27]. In this section, we review various well-known Deep
a resurgence in popularity due to hardware improvements
ConvNets. Then, we show the current limitations of the
and new techniques resulting in the capability to train these
Faster R-CNN, one of the state-of-the-art deep ConvNet
networks on large amounts of training data. Li et al. [14]
methods in object detection, in the defined context of the
utilized a cascade of CNNs to perform face detection.
face detection.
The cascading networks allowed them to process different
scales of faces at different levels of the cascade while
also allowing for false positives from previous networks 3.1 Region-based Convolution Neural Networks
to be removed at later layers in a similar approach to One of the most important approaches for the object
other cascade detectors. Yang et al. [16] approached the detection task is the family of Region-based Convolution
problem from a different perspective more similar to a DPM Neural Networks (R-CNN).
approach. In their method, the face is broken into several R-CNN [28], the first generation of this family, applies
facial parts such as hair, eyes, nose, mouth, and beard. By the high-capacity deep ConvNet to classify given bottom-
training a detector on each part and combining the score up region proposals. Due to the lack of labeled training
maps intelligently, they were able to achieve accurate face data, it adopts a strategy of supervised pre-training for
detection even under occlusions. Both of these methods an auxiliary task followed by domain-specific fine-tuning.
require training several networks in order to achieve their Then the ConvNet is used as a feature extractor and the
high accuracy. Our method, on the other hand, can be system is further trained for object detection with Support
trained as a single network, end-to-end, allowing for less Vector Machines (SVM). Finally, it performs bounding-
annotation of training data needed while maintaining highly box regression. The method achieves high accuracy but
accurate face detection. is very time-consuming. The system takes a long time
The ideas of using contextual information in object to generate region proposals, extract features from each
detection have been studied in several recent work with image, and store these features in a hard disk, which
very high detection accuracy. Divvala et al. [24] reviewed also takes up a large amount of space. At testing time,
the the role of context in a contemporary, challenging the detection process takes 47s per image using VGG-16
object detection in their empirical evaluation analysis. In network [21] implemented in GPU due to the slowness of
their conclusions, the context information not only reduces feature extraction. In other words, R-CNN is slow because
the overall detection errors, but also the remaining errors it processes each object proposal independently without
made by the detector are more reasonable. Bell et al. sharing computation.
[25] introduced an advanced object detector method named Fast R-CNN [29] solves this problem by sharing the
Inside-Outside Network (ION) to exploit information both features between proposals. The network is designed to
inside and outside the region of interest. In their approach, only compute a feature map once per image in a fully
the contextual information outside the region of interest convolutional style, and to use ROI-pooling to dynamically
4
sample features from the feature map for each object arrangement, which works well on clear faces. But when
proposal. The network also adopts a multi-task loss, i.e. facial parts are missing due to occlusion or when face
classification loss and bounding-box regression loss. Based itself is too small, facial parts become more hard to detect.
on the two improvements, the framework is trained end- Therefore, the body context information plays its role.
to-end. The processing time for each image significantly As an example of context-dependent objects, faces often
reduced to 0.3s. Fast R-CNN accelerates the detection come together with human body. Even though the faces are
network using the ROI-pooling layer. However the region occluded, we can still locate it only by seeing the whole
proposal step is designed out of the network hence still human body. Similar advantages for faces at low-resolution,
remains a bottleneck, which results in sub-optimal solution i.e. tiny faces. The deep features can not tell much about
and dependence on the external region proposal methods. tiny faces since their receptive field is too small to be
Faster R-CNN [30] addresses the problem with fast R- informative. Introducing context information can extend
CNN by introducing the Region Proposal Network (RPN). the area to extract features and make them meaningful.
An RPN is implemented in a fully convolutional style to On the other hand, the context information also helped
predict the object bounding boxes and the objectness scores. with reducing false detection as discussed previously, since
In addition, the anchors are defined with different scales and context information tells the difference between real faces
ratios to achieve the translation invariance. The RPN shares with bodies and face-like patterns without bodies.
the full-image convolution features with the detection net-
work. Therefore the whole system is able to complete both
proposal generation and detection computation within 0.2s 4 C ONTEXTUAL M ULTI -S CALE R-CNN
using very deep VGG-16 model [21]. With a smaller ZF
model [31], it can reach the level of real-time processing. Our goal is to detect human faces captured under various
challenging conditions such as strong illumination, heavily
3.2 Limitations of Faster R-CNN occlusion, extreme off-angles, and low resolution. Under
The Region-based CNN family, e.g. Faster R-CNN and these conditions, the current CNN-based detection systems
its variants [29], achieves the state-of-the-art performance suffer from two major problems, i.e. 1) tiny faces are hard to
results in object detection on the PASCAL VOC dataset. identify; 2) only face region is taken into consideration for
These methods can detect objects such as vehicles, animals, classification. In this section, we show why these problems
people, chairs, and etc. with very high accuracy. In general, hinder the ability of a face detection system. Then, our
the defined objects often occupy the majority of a given proposed network is presented to address these problems
image. However, when these methods are tested on the by using the Multi-Scale Region Proposal Network (MS-
challenging Microsoft COCO dataset [32], the performance RPN) and the Contextual Multi-Scale Convolution Neural
drops a lot, since images contain more small, occluded Network (CMS-CNN), as illustrated in Figure 2. Similar
and incomplete objects. Similar situations happen in the to Faster R-CNN, the MS-RPN outputs several region
problem of face detection. We focus on detecting only facial candidates and the CMS-CNN computes the confidence
regions that are sometimes small, heavily occluded and of score and bounding box for each candidate.
low resolution (as shown in Figure 1).
The detection network in designed Faster R-CNN is
unable to robustly detect such tiny faces. The intuition 4.1 Identifying Tiny Faces
point is that the Regions of Interest pooling layer, i.e. ROI-
Why tiny faces are hard to be robustly detected by the
pooling layer, builds features only from the last single high
previous region-based CNNs? The reason is that in these
level feature map. For example, the global stride of the
networks both the proposed region and the classification
’conv5’ layer in the VGG-16 model is 16. Therefore, given
score are produced from one single high-level convolution
a facial region with the sizes less than 16 × 16 pixels in an
feature map. This representation doesn’t have enough in-
image, the projected ROI-pooling region for that location
formation for the multiple tasks, i.e. region proposal and
will be less than 1 pixel in the ’conv5’ layer, even if the
RoI detection. For example, Faster R-CNN generates region
proposed region is correct. Thus, the detector will have
candidates and does RoI-pooling from the ’conv5’ layer
much difficulty to predict the object class and the bounding
of the VGG-16 model, which has a overall stride of 16.
box location based on information from only one pixel.
One issue is that the reception field in this layer is quite
large. When the face size is less than 16-by-16 pixels, the
3.3 Other Face Detection Method Limitations corresponding output in ’conv5’ layer is less than 1 pixel,
Other challenges in object detection in the wild include which is insufficient to encode informative features. The
occlusion and low-resolution. For face detection, it is very other issue is that as the convolution layers go deeper, each
common for people to wear stuffs like sunglasses, scarf and pixel in the feature map gather more and more information
hats, which occlude the face. In such cases, the methods outside the original input region so that it contains lower
that only extract features from faces do not work well. proportion of information for the region of interest. These
For example, Faceness [16] consider finding faces through two issues together make the last convolution layer less
scoring facial parts responses by their spatial structure and representative for tiny faces.
5
4.1.1 Multiple Scale Faster-RCNN Following the back-propagation and chain rule, the up-
Our solution for this problem is a combination of both date for scaling factor γ is:
global and local features, i.e. multiple scales. In this ar- ∂l ∂l
= ·γ
chitecture, the feature maps are incorporated from lower ∂ x̂ ∂y
level convolution layers with the last convolution layer
!
∂l ∂l I xxT
for both MS-RPN and CMS-CNN. Features from lower = − 3
∂x ∂ x̂ kxk2 kxk2
convolution layer help get more information for the tiny
faces, because stride in lower convolution layer will not be ∂l X ∂l
= x̂i
too small. Another benefit is that both low-level feature with ∂γi y
∂yi
i
localization capability and high-level feature with semantic
T
information are fused together [33], since face detection where y = [y1 , y2 , ..., yd ] .
needs to localize the face as well as to identify the face.
In the MS-RPN, the whole lower level feature maps are 4.1.3 New Layer in Deep Learning Caffe Framework
down-sampled to the size of high level feature map and then The system integrate information from lower layer feature
concatenated with it to form a unified feature map. Then maps, i.e. third and fourth convolution layers, to extract
we reduce the dimension of the unified feature map and determinant features for tiny faces. For both parts of our
use it to generate region candidates. In the CMS-CNN, the system, i.e. MS-RPN and CMS-CNN, the L2 normalization
region proposal is projected into feature maps from multiple layers are inserted before concatenation of feature maps
convolution layers. And RoI-pooling is performed in each from the three layers. The features were re-scaled to proper
layer, resulting in a fixed-size feature tensor. All feature ten- values and concatenated to a single feature map. We set the
sors are normalized, concatenated and dimension-reduced initial scaling factor in a special way, following two rules.
to a single feature blob, which is forwarded to two fully First, the average scale for each feature map is roughly
connected layers to compute a representation of the region identical; second, after the following 1 × 1 convolution, the
candidate. resulting tensor should have the same average scale as the
conv5 layer in the work of Faster R-CNN. As implied, after
4.1.2 L2 Normalization the following 1 × 1 convolution, the tensor should be the
same as the original architecture in Faster R-CNN, in terms
In both MS-RPN and CMS-CNN, concatenation of feature
of its size, scale of values and function for the downstream
maps is done with L2 normalization layer [34], shown in
process.
Fig. 2, since the feature maps from different layer have gen-
erally different properties in terms of numbers of channels,
scale of value and norm of feature map pixels. Generally, 4.2 Integrating Body Context
comparing with values in shallower layers, the values in When humans are searching for faces, they try to look
deeper layers are usually too small, which leads to the for not only the facial patterns, e.g. eyes, nose, mouth,
dominance of shallower layers. In practice, it is impossible but also the human bodies. Sometimes a human body
for the system to readjust and tune value from each layer makes us more convinced about the existence of a face.
for best performance. Therefore, L2 normalization layers In addition, sometimes human body helps to reject false
before concatenation are crucial for the robustness of the positives. If we only look at face regions, we may make
system because it keeps the value from each layer in mistakes identifying them. For example, Figure 3 shows
roughly the same scale. two cases where body region plays a significant role for
The normalization is performed within each pixel, and correct detection. This intuition is not only true for human
all feature map is treated independently: but also valid in computer vision. Previous research has
shown that contextual reasoning is a critical piece of the
x object recognition puzzle, and that context not only reduces
x̂ =
kxk2 the overall detection errors, but, more importantly, the
d remaining errors made by the detector are more reasonable
1
X
kxk2 = ( |xi |) 2 [24]. Based on this intuition, our network is designed
i=1 to make explicit reference to the human body context
where the x and x̂ stand for the original pixel vector and information in the RoI detection.
the normalized pixel vector respectively. d stands for the In our proposed network, the contextual body reasoning
number of channels in each feature map tensor. is implemented by explicitly grouping body information
During training, scaling factors γi will be updated to from convolution feature maps shown as the red blocks
readjust the scale of the normalized features. For each in Figure 2. Specifically, additional RoI-pooling operations
channel i, the scaling factor follows: are performed for each region proposal in convolution
feature maps to represent the body context features. Then
yi = γi x̂i same as the face feature tensors, these body feature tensors
are normalized, concatenated and dimension-reduced to a
where yi stand for the re-scaled feature value. single feature blob. After two fully connected layers the
6
the same architecture as the deep VGG-16 model, and strong lighting conditions. The images in this database
during training their parameters are initialized from the are organized and split into three subsets, i.e. training,
pre-trained VGG-16. For simplicity we refer to the last validation and testing. Each contains 40%, 10% and 50%
convolution layers in set 3, 4 and 5 as ’conv3’, ’conv4’, and respectively of the original databases. The images and the
’conv5’ respectively. All the following layers are connected ground-truth labels of the training and the validation sets
exclusively to these three layers. In the MS-RPN, we want are available online for experiments. However, in the testing
’conv3’, ’conv4’, and ’conv5’ to be synchronized to the set, only the testing images (not the ground-truth labels)
same size so that concatenation can be applied. So ’conv3’ are available online. All detection results are sent to the
is followed by pooling layer to perform down-sampling. database server for evaluating and receiving the Precision-
Then ’conv3’, ’conv4’, and ’conv5’ are normalized along Recall curves.
the channel axis to a learnable re-weighting scale and In our experiments, the proposed CMS-RCNN is trained
concatenated together. To ensure training convergence, the on the training set of the WIDER FACE dataset containing
initial re-weighting scale needs to be carefully set. Here 159,424 annotated faces collected in 12,880 images. The
we set the initial scale of ’conv3’, ’conv4’, and ’conv5’ to trained model on this database are used in testing of all
be 66.84, 94.52, and 94.52 respectively. In the CMS-CNN, databases.
the RoI pooling layer already ensure that the pooled feature
maps have the same size. Again we normalize the pooled Testing and Comparison
features to make sure the downstream values are at reason- During the testing phase, the face images in the testing set
able scales when training is initialized. Specifically, features are divided into three parts based on their detection rates on
pooled from ’conv3’, ’conv4’, and ’conv5’ are initialized EdgeBox [36]. In other words, face images are divided into
with scale to be 57.75, 81.67, and 81.67 respectively, for three levels according to the difficulties of the detection, i.e.
both face and body pipelines. The MS-RPN and the CMS- Easy, Medium and Hard [1]. The proposed CMS-RCNN
CNN share the same parameters for all convolution layers model is compared against recent strong face detection
so that computation can be done once, resulting in higher methods, i.e. Two-stage CNN [1], Multiscale Cascade CNN
efficiency. Additionally, in order to shrink the channel size [1], Faceness [16], and Aggregate Channel Features (ACF)
of the concatenated feature map, a 1×1 convolution layer is [11]. All these methods are trained on the same training set
then employed. Therefore the channel size of final feature and tested on the same testing set.
map is at the same size as the original fifth convolution The Precision-Recall curves and AP values are shown in
layer in Faster R-CNN. Figure 5. Our method outperforms those strong baselines
by a large margin. It achieves the best average precision in
5 E XPERIMENTS all level faces, i.e. AP = 0.902 (Easy), 0.874 (Medium) and
This section presents the face detection bechmarking using 0.643 (Hard), and outperforms the second best baseline by
our proposed CMS-RCNN approach on the WIDER FACE 26.0% (Easy), 37.4% (Medium) and 60.8% (Hard). These
dataset [1] and the Face Detection Data Set and Bench- results suggest that as the difficulty level goes up, CMS-
mark (FDDB) [20] database. The WIDER FACE dataset RCNN can detect challenging faces better. So it has the
is experimented with high degree of variability. Using ability to handle difficult conditions hence is more closed
this database, our proposed approach robustly outperforms to human detection level. Figure 8 shows some examples
strong baseline methods, including Two-stage CNN [1], of face detection results using the proposed CMS-RCNN
Multi-scale Cascade CNN [1], Faceness [16] and Aggregate on this database.
Channel Features (ACF) [11], by a large margin. We also
With Context v.s. Without Context
show that our model trained on WIDER FACE dataset
generalizes well enough to the FDDB database. The trained As we show in Section 4.2 that human vision can benefit
model consistently achieves competitive results against from additional context information for better detection and
the recent state-of-the-art face detection methods on this recognition, we show in this section how does explicit
database, including HyperFace [19], DP2MFD [17], CCF contextual reasoning in the network help improve the model
[18], Faceness [16], NPDFace [13], MultiresHPM [12], performance.
DDFD [15], CascadeCNN [14], ACF-multiscale [11], Pico To prove this, we test our models with and without body
[7], HeadHunter [9], Joint Cascade [10], Boosted Exemplar context information on the validation set of WIDER FACE
[8], and PEP-Adapt [6]. dataset. The model without body context is implemented
by removing the context pipeline and only use the rep-
resentation from face pipeline to compute the confidence
5.1 Experiments on WIDER FACE Dataset
score and the bounding box regression. We compare their
Data description performances as illustrated in Figure 6. The Faster R-CNN
WIDER FACE is a public face detection benchmark dataset. method is setup as a baseline.
It contains 393,703 labeled human faces from 32,203 Starting from 0 in recall, two curves of our models are
images collected based on 61 event classes from internet. overlapped at first, which means that two models perform
The database has many human faces with a high degree as well as each other on some easy faces. Then the curve
of pose variation, large occlusions, low-resolutions and of model without context starts to drop quicker than the
8
1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
Precision
Precision
Precision
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
CMS−RCNN−0.902 CMS−RCNN−0.874 CMS−RCNN−0.643
0.2 ACF−WIDER−0.695 0.2 ACF−WIDER−0.588 0.2 ACF−WIDER−0.290
Faceness−WIDER−0.716 Faceness−WIDER−0.604 Faceness−WIDER−0.315
0.1 Multiscale Cascade CNN−0.711 0.1 Multiscale Cascade CNN−0.636 0.1 Multiscale Cascade CNN−0.400
Two−stage CNN−0.657 Two−stage CNN−0.589 Two−stage CNN−0.304
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Recall Recall
(a) (b) (c)
Fig. 5. Precision-Recall curves obtained by our proposed CMS-RCNN (red) and the other baselines, i.e. Two-
stage CNN [1], Multi-scale Cascade CNN [1], Faceness [16], and Aggregate Channel Features (ACF) [11]. All
methods trained and tested on the same training and testing set of the WIDER FACE dataset. (a): Easy level,
(b): Medium level and (c): Hard level. Our method achieves the state-of-the-art results with the highest AP values
of 0.902 (Easy), 0.874 (Medium) and 0.643 (Hard) among the methods on this database. It also outperforms the
second best baseline by 26.0% (Easy), 37.4% (Medium) and 60.8% (Hard).
model with context, suggesting the model with context can positives produced by our CMS-RCNN model. We are
handle the challenging conditions better when faces become curious about what object can fool our model to treat it as
more and more difficult. Thus eventually the model with a face. Is it due to over-fitting, data bias, or miss labeling?
context achieves a higher recall value. Additionally, the In order to visualize the false positives, we test the CMS-
context model produces a longer PR curve, which means RCNN model on the WIDER FACE validation set and
that contextual reasoning can help finding more faces. pick all the false positives according to the ground truth.
Then those positives are sorted by the confidence score in
a descending order. We choose the top 20 false positives as
illustrated in Figure 9 Because their confidence scores are
high, they are the objects most likely to cause our model
making mistakes. It turns out that most of the false positives
are actually human faces caused by miss labeling, which is
a problem of the dataset itself. For other false positives, we
find the errors made by our model are rather reasonable.
They all have the pattern of human face as well as the shape
of human body.
Fig. 7. ROC curves of our proposed CMS-RCNN and the other published methods on FDDB database [20].
Our method achieves the best recall rate on this database. Numbers in the legend show the average precision
scores.
Fig. 8. Some examples of face detection results using our proposed CMS-RCNN method on WIDER FACE
database[1].
Fig. 9. Examples of the top 20 false positives from our CMS-RCNN model tested on the WIDER FACE validation
set. In fact these false positives include many human faces not in the dataset due to mislabeling, which means
that our method is robust to the noise in the data.
localizing occluded faces,” arXiv preprint arXiv:1506.08347, 2015. ization, pose estimation, and gender recognition,” arXiv preprint
1, 3, 7, 9 arXiv:1603.01249, 2016. 1, 7, 9
[13] S. Liao, A. Jain, and S. Li, “A fast and accurate unconstrained face [20] V. Jain and E. Learned-Miller, “Fddb: A benchmark for face
detector,” 2014. 1, 7, 9 detection in unconstrained settings,” University of Massachusetts,
[14] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional Amherst, Tech. Rep. UM-CS-2010-009, 2010. 2, 7, 8, 9, 11
neural network cascade for face detection,” in Proceedings of the [21] K. Simonyan and A. Zisserman, “Very deep convolutional networks
IEEE Conference on Computer Vision and Pattern Recognition, for large-scale image recognition,” arXiv preprint arXiv:1409.1556,
2015, pp. 5325–5334. 1, 3, 7, 9 2014. 2, 3, 4
[22] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan,
[15] S. S. Farfade, M. J. Saberian, and L.-J. Li, “Multi-view face detection
“Object detection with discriminatively trained part-based models,”
using deep convolutional neural networks,” in Proceedings of the 5th
IEEE Trans. on PAMI, vol. 32, no. 9, pp. 1627–1645, Sept 2010. 2
ACM on International Conference on Multimedia Retrieval. ACM,
2015, pp. 643–650. 1, 7, 9 [23] X. Yu, J. Huang, S. Zhang, W. Yan, and D. Metaxas, “Pose-free
facial landmark fitting via optimized part mixtures and cascaded
[16] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “From facial parts responses deformable shape model,” in Proceedings of the IEEE International
to face detection: A deep learning approach,” in Proceedings of the Conference on Computer Vision, 2013, pp. 1944–1951. 3
IEEE International Conference on Computer Vision, 2015, pp. 3676–
[24] S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert,
3684. 1, 2, 3, 4, 7, 8, 9
“An empirical study of context in object detection,” in Computer
[17] R. Ranjan, V. M. Patel, and R. Chellappa, “A deep pyramid Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
deformable part model for face detection,” in Biometrics Theory, on. IEEE, 2009, pp. 1271–1278. 3, 5
Applications and Systems (BTAS), 2015 IEEE 7th International [25] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick, “Inside-outside net:
Conference on. IEEE, 2015, pp. 1–8. 1, 7, 9 Detecting objects in context with skip pooling and recurrent neural
[18] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Convolutional channel networks,” arXiv preprint arXiv:1512.04143, 2015. 3
features,” in Proceedings of the IEEE International Conference on [26] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chin-
Computer Vision, 2015, pp. 82–90. 1, 7, 9 tala, and P. Dollár, “A multipath network for object detection,” arXiv
[19] R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: A deep preprint arXiv:1604.02135, 2016. 3
multi-task learning framework for face detection, landmark local- [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
11
Fig. 10. Some examples of face detection results using our proposed CMS-RCNN method on FDDB database
[20].
Fig. 11. More results of unconstrained face detection under challenging conditions using our proposed CMS-
RCNN.