CMS-RCNN: Contextual Multi-Scale Region-Based CNN For Unconstrained Face Detection

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

1

CMS-RCNN: Contextual Multi-Scale


Region-based CNN for Unconstrained Face
Detection
Chenchen Zhu*, Student, IEEE, Yutong Zheng*, Student, IEEE,
Khoa Luu, Member, IEEE, Marios Savvides, Senior Member, IEEE

Abstract—Robust face detection in the wild is one of the ultimate components to support various facial related problems,
arXiv:1606.05413v1 [cs.CV] 17 Jun 2016

i.e. unconstrained face recognition, facial periocular recognition, facial landmarking and pose estimation, facial expression
recognition, 3D facial model construction, etc. Although the face detection problem has been intensely studied for decades
with various commercial applications, it still meets problems in some real-world scenarios due to numerous challenges, e.g.
heavy facial occlusions, extremely low resolutions, strong illumination, exceptionally pose variations, image or video compression
artifacts, etc. In this paper, we present a face detection approach named Contextual Multi-Scale Region-based Convolution
Neural Network (CMS-RCNN) to robustly solve the problems mentioned above. Similar to the region-based CNNs, our proposed
network consists of the region proposal component and the region-of-interest (RoI) detection component. However, far apart of
that network, there are two main contributions in our proposed network that play a significant role to achieve the state-of-the-
art performance in face detection. Firstly, the multi-scale information is grouped both in region proposal and RoI detection to
deal with tiny face regions. Secondly, our proposed network allows explicit body contextual reasoning in the network inspired
from the intuition of human vision system. The proposed approach is benchmarked on two recent challenging face detection
databases, i.e. the WIDER FACE Dataset which contains high degree of variability, as well as the Face Detection Dataset and
Benchmark (FDDB). The experimental results show that our proposed approach trained on WIDER FACE Dataset outperforms
strong baselines on WIDER FACE Dataset by a large margin, and consistently achieves competitive results on FDDB against
the recent state-of-the-art face detection methods.

Index Terms—Robust Face Detection, Multi-Scale Information, Contextual Reasoning, Convolutional Neural Network, Region-
based CNN
F

1 I NTRODUCTION
Detection and analysis on human subjects using facial
feature based biometrics for access control, surveillance
systems and other security applications have gained pop-
ularity over the past few years. Several such biometrics
systems are deployed in security checkpoints across the
globe with more being deployed every day. Particularly,
face recognition has been one of the most popular biomet-
rics modalities attractive to security departments. Indeed,
the uniqueness of facial features across individuals can be
captured much more easily than other biometrics. In order
to take into account a face recognition algorithm, however,
face detection usually needs to be done first.
The problem of face detection has been intensely studied
for decades with the aim of ensuring the generalization
of robust algorithms to unseen face images [2], [3], [4],
[5], [6], [7], [8], [9], [10], [11], [12], [13]. Although the Fig. 1. An example of face detection results using our
detection accuracy in recent face detection algorithms [14], proposed CMS-RCNN method. The proposed method
[15], [16], [17], [18], [19] has been highly improved due to can robustly detect faces across occlusion, facial ex-
the advancement of deep Convolutional Neural Networks pression, pose, illumination, scale and low resolution
(CNN), they are still far from achieving the same detection conditions from WIDER FACE Dataset [1].
capabilities as a human due to a number of challenges
CyLab Biometrics Center and the Department of Electrical and Computer
Engineering, Carnegie Mellon University, Pittsburgh, PA, USA. Emails:
{chenchez, yutongzh, kluu}@andrew.cmu.edu, [email protected] in practice. For example, off-angle faces, large occlusions,
* indicates equal contribution. low-resolutions and strong lighting conditions, as shown in
2

Figure 1, are always the important factors that need to be


considered.
This paper presents an advanced CNN based approach
named Contextual Multi-Scale Region-based CNN (CMS-
RCNN) to handle the problem of face detection in digital
face images collected under numerous challenging condi-
tions, e.g. heavy facial occlusion, illumination, extreme off-
angle, low-resolution, scale difference, etc. Our designed
region-based CNN architecture allows the network to si-
multaneously look at multi-scale features, as well as to
explicitly look outside facial regions as the potential body
regions. In other words, this process tries to mimic the way
of face detection by human in a sense that when humans
are not sure about a face, seeing the body will increase
our confidence. Additionally this architecture also helps to
synchronize both the global semantic features in high level
layers and the localization features in low level layers for
facial representation. Therefore, it is able to robustly deal
with the challenges in the problem of unconstrained face
detection.
Our CMS-RCNN method introduces the Multi-Scale
Region Proposal Network (MS-RPN) to generate a set Fig. 2. Our proposed Contextual Multi-Scale Region-
of region candidates and the Contextual Multi-Scale Con- based CNN model. It is based on the VGG-16 model
volution Neural Network (CMS-CNN) to do inference [21], with 5 sets of convolution layers in the middle. The
on the region candidates of facial regions. A confidence upper part is the Multi-Scale Region Proposal Network
score and bounding box regression are computed for every (MS-RPN) and the lower part is the Contextual Multi-
candidate. In the end, the face detection system is able to Scale Convolution Neural Network (CMS-CNN). In the
decide the quality of the detection results by thresholding CMS-CNN, the face features labeled as blue blocks
these generated confidence scores in given face images. and the body context features labeled as red blocks
The architecture of our proposed CMS-RCNN network for are processed in parallel and combined in the end for
unconstrained face detection is illustrated in Figure 2. final outputs, i.e. confidence score and bounding box.
Our approach is evaluated on two challenging face de-
tection databases and compared against numerous recent
face detection methods. Firstly, the proposed CMS-RCNN in Section 6.
method is compared against four strong baselines [11], [16],
[1] on the WIDER FACE Dataset [1], a large scale face 2 R ELATED W ORK
detection benchmark database. This experiment shows its Face detection has been a well studied area of computer
capability to detect face images in the wild, e.g. under vision. One of the first well performing approaches to
occlusions, illumination, facial poses, low-resolution condi- the problem was the Viola-Jones face detector [2]. It was
tions, etc. Our method outperforms the baselines by a huge capable of performing real time face detection using a
margin in all easy, medium, and hard partitions. It is also cascade of boosted simple Haar classifiers. The concepts
benchmarked on the Face Detection Data Set and Bench- of boosting and using simple features has been the basis
mark (FDDB) [20], a dataset of face regions designed for for many different approaches [3] since the Viola-Jones
studying the problem of unconstrained face detection. The face detector. These early detectors tended to work well on
experimental results show that the proposed CMS-RCNN frontal face images but not very well on faces in different
approach consistently achieves highly competitive results poses. As time has passed, many of these methods have
against the other state-of-the-art face detection methods. been able to deal with off-angle face detection by utilizing
The rest of this paper is organized as follows. In section multiple models for the various poses of the face. This
2, we summarize prior work in face detection. Section 3 increases the model size but does afford more practical uses
reviews a general deep learning framework, the background of the methods. Some approaches have moved away from
as well as the limitations of the Faster R-CNN in the the idea of simple features but continued to use the boosted
problem of face detection. In Section 4, we introduce our learning framework. Li and Zhang [5] used SURF cascades
proposed CMS-RCNN approach for the problem of uncon- for general object detection but also showed good results
strained face detection. Section 5 presents the experimen- on face detection.
tal face detection results and comparisons obtained using More recent work on face detection has tended to focus
our proposed approach on two challenging face detection on using different models such as a Deformable Parts
databases, i.e. the WIDER FACE Dataset and the FDDB Model (DPM) [4], [22]. Zhu and Ramanan’s work was
database. Finally, our conclusions in this work are presented an interesting approach to the problem of face detection
3

where they combined the problems of face detection, pose is incorporated using spatial recurrent neural networks.
estimation, and facial landmarking into one framework. By Inside the network, skip pooling is used to extract informa-
utilizing all three aspects in one framework, they were able tion at multiple scales and levels of abstraction. Recently,
to outperform the state-of-the-art at the time on real world Zagoruyko et al. [26] have presented the MultiPath network
images. Yu et al. [23] extended this work by incorporating with three modifications to the standard Fast R-CNN object
group sparsity in learning which landmarks are the most detector, i.e. skip connections that give the detector access
salient for face detection as well as incorporating 3D to features at multiple network layers, a foveal structure
models of the landmarks in order to deal with pose. Chen et to exploit object context at multiple object resolutions,
al. [10] have combined ideas from both of these approaches and an integral loss function and corresponding network
by utilizing a cascade detection framework while simul- adjustment that improve localization. The information in
taneously localizing features on the face for alignment of their proposed network can flow along multiple paths. Their
the detectors. Similarly, Ghiasi and Fowlkes [12] have been MultiPath network is combined with DeepMask object
able to use heirarchical DPMs not only to achieve good face proposals to solve the object detection problem.
detection in the presence of occlusion but also landmark Unlike all the previous approaches that select a feature
localization. However, Mathias et al. [9] were able to show extractor beforehand and incorporate a linear classifier with
that both DPM models and rigid template detectors similar the depth descriptor beside RGB channels, our method
to the Viola-Jones detector have a lot of potential that has solves the problem under a deep learning framework where
not been adequately explored. By retraining these models the global and the local context features, i.e. multi scaling,
with appropriately controlled training data, they were able are synchronized to Faster Region-based Convolutional
to create face detectors that perform similarly to other, more Neural Networks in order to robustly achieve semantic
complex state-of-the-art face detectors. detection.
All of these approaches to face detection were based
on selecting a feature extractor beforehand. However, there 3 BACKGROUND
has been work done in using a ConvNet to learn which
The recent studies in deep ConvNets have achieved signifi-
features are used to detect faces. Neural Networks have
cant results in object detection, classification and modeling
been around for a long time but have been experiencing
[27]. In this section, we review various well-known Deep
a resurgence in popularity due to hardware improvements
ConvNets. Then, we show the current limitations of the
and new techniques resulting in the capability to train these
Faster R-CNN, one of the state-of-the-art deep ConvNet
networks on large amounts of training data. Li et al. [14]
methods in object detection, in the defined context of the
utilized a cascade of CNNs to perform face detection.
face detection.
The cascading networks allowed them to process different
scales of faces at different levels of the cascade while
also allowing for false positives from previous networks 3.1 Region-based Convolution Neural Networks
to be removed at later layers in a similar approach to One of the most important approaches for the object
other cascade detectors. Yang et al. [16] approached the detection task is the family of Region-based Convolution
problem from a different perspective more similar to a DPM Neural Networks (R-CNN).
approach. In their method, the face is broken into several R-CNN [28], the first generation of this family, applies
facial parts such as hair, eyes, nose, mouth, and beard. By the high-capacity deep ConvNet to classify given bottom-
training a detector on each part and combining the score up region proposals. Due to the lack of labeled training
maps intelligently, they were able to achieve accurate face data, it adopts a strategy of supervised pre-training for
detection even under occlusions. Both of these methods an auxiliary task followed by domain-specific fine-tuning.
require training several networks in order to achieve their Then the ConvNet is used as a feature extractor and the
high accuracy. Our method, on the other hand, can be system is further trained for object detection with Support
trained as a single network, end-to-end, allowing for less Vector Machines (SVM). Finally, it performs bounding-
annotation of training data needed while maintaining highly box regression. The method achieves high accuracy but
accurate face detection. is very time-consuming. The system takes a long time
The ideas of using contextual information in object to generate region proposals, extract features from each
detection have been studied in several recent work with image, and store these features in a hard disk, which
very high detection accuracy. Divvala et al. [24] reviewed also takes up a large amount of space. At testing time,
the the role of context in a contemporary, challenging the detection process takes 47s per image using VGG-16
object detection in their empirical evaluation analysis. In network [21] implemented in GPU due to the slowness of
their conclusions, the context information not only reduces feature extraction. In other words, R-CNN is slow because
the overall detection errors, but also the remaining errors it processes each object proposal independently without
made by the detector are more reasonable. Bell et al. sharing computation.
[25] introduced an advanced object detector method named Fast R-CNN [29] solves this problem by sharing the
Inside-Outside Network (ION) to exploit information both features between proposals. The network is designed to
inside and outside the region of interest. In their approach, only compute a feature map once per image in a fully
the contextual information outside the region of interest convolutional style, and to use ROI-pooling to dynamically
4

sample features from the feature map for each object arrangement, which works well on clear faces. But when
proposal. The network also adopts a multi-task loss, i.e. facial parts are missing due to occlusion or when face
classification loss and bounding-box regression loss. Based itself is too small, facial parts become more hard to detect.
on the two improvements, the framework is trained end- Therefore, the body context information plays its role.
to-end. The processing time for each image significantly As an example of context-dependent objects, faces often
reduced to 0.3s. Fast R-CNN accelerates the detection come together with human body. Even though the faces are
network using the ROI-pooling layer. However the region occluded, we can still locate it only by seeing the whole
proposal step is designed out of the network hence still human body. Similar advantages for faces at low-resolution,
remains a bottleneck, which results in sub-optimal solution i.e. tiny faces. The deep features can not tell much about
and dependence on the external region proposal methods. tiny faces since their receptive field is too small to be
Faster R-CNN [30] addresses the problem with fast R- informative. Introducing context information can extend
CNN by introducing the Region Proposal Network (RPN). the area to extract features and make them meaningful.
An RPN is implemented in a fully convolutional style to On the other hand, the context information also helped
predict the object bounding boxes and the objectness scores. with reducing false detection as discussed previously, since
In addition, the anchors are defined with different scales and context information tells the difference between real faces
ratios to achieve the translation invariance. The RPN shares with bodies and face-like patterns without bodies.
the full-image convolution features with the detection net-
work. Therefore the whole system is able to complete both
proposal generation and detection computation within 0.2s 4 C ONTEXTUAL M ULTI -S CALE R-CNN
using very deep VGG-16 model [21]. With a smaller ZF
model [31], it can reach the level of real-time processing. Our goal is to detect human faces captured under various
challenging conditions such as strong illumination, heavily
3.2 Limitations of Faster R-CNN occlusion, extreme off-angles, and low resolution. Under
The Region-based CNN family, e.g. Faster R-CNN and these conditions, the current CNN-based detection systems
its variants [29], achieves the state-of-the-art performance suffer from two major problems, i.e. 1) tiny faces are hard to
results in object detection on the PASCAL VOC dataset. identify; 2) only face region is taken into consideration for
These methods can detect objects such as vehicles, animals, classification. In this section, we show why these problems
people, chairs, and etc. with very high accuracy. In general, hinder the ability of a face detection system. Then, our
the defined objects often occupy the majority of a given proposed network is presented to address these problems
image. However, when these methods are tested on the by using the Multi-Scale Region Proposal Network (MS-
challenging Microsoft COCO dataset [32], the performance RPN) and the Contextual Multi-Scale Convolution Neural
drops a lot, since images contain more small, occluded Network (CMS-CNN), as illustrated in Figure 2. Similar
and incomplete objects. Similar situations happen in the to Faster R-CNN, the MS-RPN outputs several region
problem of face detection. We focus on detecting only facial candidates and the CMS-CNN computes the confidence
regions that are sometimes small, heavily occluded and of score and bounding box for each candidate.
low resolution (as shown in Figure 1).
The detection network in designed Faster R-CNN is
unable to robustly detect such tiny faces. The intuition 4.1 Identifying Tiny Faces
point is that the Regions of Interest pooling layer, i.e. ROI-
Why tiny faces are hard to be robustly detected by the
pooling layer, builds features only from the last single high
previous region-based CNNs? The reason is that in these
level feature map. For example, the global stride of the
networks both the proposed region and the classification
’conv5’ layer in the VGG-16 model is 16. Therefore, given
score are produced from one single high-level convolution
a facial region with the sizes less than 16 × 16 pixels in an
feature map. This representation doesn’t have enough in-
image, the projected ROI-pooling region for that location
formation for the multiple tasks, i.e. region proposal and
will be less than 1 pixel in the ’conv5’ layer, even if the
RoI detection. For example, Faster R-CNN generates region
proposed region is correct. Thus, the detector will have
candidates and does RoI-pooling from the ’conv5’ layer
much difficulty to predict the object class and the bounding
of the VGG-16 model, which has a overall stride of 16.
box location based on information from only one pixel.
One issue is that the reception field in this layer is quite
large. When the face size is less than 16-by-16 pixels, the
3.3 Other Face Detection Method Limitations corresponding output in ’conv5’ layer is less than 1 pixel,
Other challenges in object detection in the wild include which is insufficient to encode informative features. The
occlusion and low-resolution. For face detection, it is very other issue is that as the convolution layers go deeper, each
common for people to wear stuffs like sunglasses, scarf and pixel in the feature map gather more and more information
hats, which occlude the face. In such cases, the methods outside the original input region so that it contains lower
that only extract features from faces do not work well. proportion of information for the region of interest. These
For example, Faceness [16] consider finding faces through two issues together make the last convolution layer less
scoring facial parts responses by their spatial structure and representative for tiny faces.
5

4.1.1 Multiple Scale Faster-RCNN Following the back-propagation and chain rule, the up-
Our solution for this problem is a combination of both date for scaling factor γ is:
global and local features, i.e. multiple scales. In this ar- ∂l ∂l
= ·γ
chitecture, the feature maps are incorporated from lower ∂ x̂ ∂y
level convolution layers with the last convolution layer
!
∂l ∂l I xxT
for both MS-RPN and CMS-CNN. Features from lower = − 3
∂x ∂ x̂ kxk2 kxk2
convolution layer help get more information for the tiny
faces, because stride in lower convolution layer will not be ∂l X ∂l
= x̂i
too small. Another benefit is that both low-level feature with ∂γi y
∂yi
i
localization capability and high-level feature with semantic
T
information are fused together [33], since face detection where y = [y1 , y2 , ..., yd ] .
needs to localize the face as well as to identify the face.
In the MS-RPN, the whole lower level feature maps are 4.1.3 New Layer in Deep Learning Caffe Framework
down-sampled to the size of high level feature map and then The system integrate information from lower layer feature
concatenated with it to form a unified feature map. Then maps, i.e. third and fourth convolution layers, to extract
we reduce the dimension of the unified feature map and determinant features for tiny faces. For both parts of our
use it to generate region candidates. In the CMS-CNN, the system, i.e. MS-RPN and CMS-CNN, the L2 normalization
region proposal is projected into feature maps from multiple layers are inserted before concatenation of feature maps
convolution layers. And RoI-pooling is performed in each from the three layers. The features were re-scaled to proper
layer, resulting in a fixed-size feature tensor. All feature ten- values and concatenated to a single feature map. We set the
sors are normalized, concatenated and dimension-reduced initial scaling factor in a special way, following two rules.
to a single feature blob, which is forwarded to two fully First, the average scale for each feature map is roughly
connected layers to compute a representation of the region identical; second, after the following 1 × 1 convolution, the
candidate. resulting tensor should have the same average scale as the
conv5 layer in the work of Faster R-CNN. As implied, after
4.1.2 L2 Normalization the following 1 × 1 convolution, the tensor should be the
same as the original architecture in Faster R-CNN, in terms
In both MS-RPN and CMS-CNN, concatenation of feature
of its size, scale of values and function for the downstream
maps is done with L2 normalization layer [34], shown in
process.
Fig. 2, since the feature maps from different layer have gen-
erally different properties in terms of numbers of channels,
scale of value and norm of feature map pixels. Generally, 4.2 Integrating Body Context
comparing with values in shallower layers, the values in When humans are searching for faces, they try to look
deeper layers are usually too small, which leads to the for not only the facial patterns, e.g. eyes, nose, mouth,
dominance of shallower layers. In practice, it is impossible but also the human bodies. Sometimes a human body
for the system to readjust and tune value from each layer makes us more convinced about the existence of a face.
for best performance. Therefore, L2 normalization layers In addition, sometimes human body helps to reject false
before concatenation are crucial for the robustness of the positives. If we only look at face regions, we may make
system because it keeps the value from each layer in mistakes identifying them. For example, Figure 3 shows
roughly the same scale. two cases where body region plays a significant role for
The normalization is performed within each pixel, and correct detection. This intuition is not only true for human
all feature map is treated independently: but also valid in computer vision. Previous research has
shown that contextual reasoning is a critical piece of the
x object recognition puzzle, and that context not only reduces
x̂ =
kxk2 the overall detection errors, but, more importantly, the
d remaining errors made by the detector are more reasonable
1
X
kxk2 = ( |xi |) 2 [24]. Based on this intuition, our network is designed
i=1 to make explicit reference to the human body context
where the x and x̂ stand for the original pixel vector and information in the RoI detection.
the normalized pixel vector respectively. d stands for the In our proposed network, the contextual body reasoning
number of channels in each feature map tensor. is implemented by explicitly grouping body information
During training, scaling factors γi will be updated to from convolution feature maps shown as the red blocks
readjust the scale of the normalized features. For each in Figure 2. Specifically, additional RoI-pooling operations
channel i, the scaling factor follows: are performed for each region proposal in convolution
feature maps to represent the body context features. Then
yi = γi x̂i same as the face feature tensors, these body feature tensors
are normalized, concatenated and dimension-reduced to a
where yi stand for the re-scaled feature value. single feature blob. After two fully connected layers the
6

Fig. 3. Examples of body context helping face identi-


fication. The first two figures show that existence of a
body can increase the confidence of finding a face. The
last two figures show that what looks like a face turns
out to be a mountain on the planet surface when we
see more context information.

Fig. 4. The Vitruvian Man: spatial relation between the


final body representation is concatenated with the face
face (blue box) and the body (red box).
representation. They together contribute to the computation
of confidence score and bounding box regression.
With projected region proposal as the face region, the the larger values will dominate the smaller ones, making
additional RoI-pooling region represents the body region the system rely too much on shallower features rather
and satisfies a pre-defined spatial relation with the face than a combination of multiple scale features causing the
region. In order to model this spatial relation, we make system to no longer be robust. We adopt the normalization
a simple hypothesis that if there is a face, there must exist layer from [34] to address this problem. The system takes
a body, and the spatial relation between each face and body the multiple scale features and apply L2 normalization
is fixed. This assumption may not be true all the time but along the channel axis of each feature map. Then, since
should cover most of the scenarios since most people we see the channel size is different among layers, the normalized
in the real world are either standing or sitting. Therefore, feature map from each layer needed to be re-weighted, so
the spatial relation is roughly fixed between the face and that their values are at the same scale. After that, the feature
the vertical body. Mathematically, this spatial relation can maps are concatenated to one single feature map tensor.
be represented by four parameters presented in Equation 1. This modification helps to stabilize the system and increase
tx = (xb − xf )/wf the accuracy. Finally, the channel size of the concatenated
feature map is shrunk to fit right in the original architecture
ty = (yb − yf )/hf
(1) for the downstream fully-connected layers.
tw = log(wb /wf ) Another crucial question is whether to fuse the face
th = log(hb /hf ) information and the body information at a early stage or
at the very end of the network. Here we choose the late
where x(∗) , y(∗) , w(∗) , and h(∗) denote the two coordinates
fusion strategy in which face features and body context
of the box center, width, and height respectively. And b and
features are extracted in two parallel pipelines. At the very
f stand for body and face respectively. tx , ty , tw , and th
end of the network two representations for face and body
are the parameters. Through out this paper, we fix the for
context are concatenated together to form a long feature
parameters such that the two projected RoI regions of face
vector. Then this feature vector is forwarded to compute
and body satisfies a certain spatial ratio illustrated in the
confidence score and bounding box regression. The other
famous drawing in Figure 4.
strategy is the early fusion, in which face feature maps
and body context feature maps get concatenated right after
4.3 Information Fusion RoI pooling and normalization. These two strategies both
It’s worth noticing that in our deep network architecture we combine the information from face and body context, but
have multiple face feature maps and body context feature we prefer the late fusion. The reason is that we want
maps for each proposed region. A critical issue is how we the network to make decisions in a more semantic space.
effectively fuse these information, i.e. what computation to We care more about the existence of the face and the
apply and in which stage. body. The localization information is already encoded in
In our network, features extracted from different convo- the predefined spatial relation mentioned in Section 4.2.
lution layers need to be fused together to get a uniform Moreover empirical experiments also show that late fusion
representation. They cannot be naively concatenated due to strategy works better.
the overall differences of the numbers of channels, scales of
values and norms of feature map pixels among these layers. 4.4 Implementation Details
The detailed research shows that the deeper layers often Our CMS-RCNN is implemented in the Caffe deep learning
contain smaller values than the shallower layers. Therefore, framework [35]. The first 5 sets of convolution layers have
7

the same architecture as the deep VGG-16 model, and strong lighting conditions. The images in this database
during training their parameters are initialized from the are organized and split into three subsets, i.e. training,
pre-trained VGG-16. For simplicity we refer to the last validation and testing. Each contains 40%, 10% and 50%
convolution layers in set 3, 4 and 5 as ’conv3’, ’conv4’, and respectively of the original databases. The images and the
’conv5’ respectively. All the following layers are connected ground-truth labels of the training and the validation sets
exclusively to these three layers. In the MS-RPN, we want are available online for experiments. However, in the testing
’conv3’, ’conv4’, and ’conv5’ to be synchronized to the set, only the testing images (not the ground-truth labels)
same size so that concatenation can be applied. So ’conv3’ are available online. All detection results are sent to the
is followed by pooling layer to perform down-sampling. database server for evaluating and receiving the Precision-
Then ’conv3’, ’conv4’, and ’conv5’ are normalized along Recall curves.
the channel axis to a learnable re-weighting scale and In our experiments, the proposed CMS-RCNN is trained
concatenated together. To ensure training convergence, the on the training set of the WIDER FACE dataset containing
initial re-weighting scale needs to be carefully set. Here 159,424 annotated faces collected in 12,880 images. The
we set the initial scale of ’conv3’, ’conv4’, and ’conv5’ to trained model on this database are used in testing of all
be 66.84, 94.52, and 94.52 respectively. In the CMS-CNN, databases.
the RoI pooling layer already ensure that the pooled feature
maps have the same size. Again we normalize the pooled Testing and Comparison
features to make sure the downstream values are at reason- During the testing phase, the face images in the testing set
able scales when training is initialized. Specifically, features are divided into three parts based on their detection rates on
pooled from ’conv3’, ’conv4’, and ’conv5’ are initialized EdgeBox [36]. In other words, face images are divided into
with scale to be 57.75, 81.67, and 81.67 respectively, for three levels according to the difficulties of the detection, i.e.
both face and body pipelines. The MS-RPN and the CMS- Easy, Medium and Hard [1]. The proposed CMS-RCNN
CNN share the same parameters for all convolution layers model is compared against recent strong face detection
so that computation can be done once, resulting in higher methods, i.e. Two-stage CNN [1], Multiscale Cascade CNN
efficiency. Additionally, in order to shrink the channel size [1], Faceness [16], and Aggregate Channel Features (ACF)
of the concatenated feature map, a 1×1 convolution layer is [11]. All these methods are trained on the same training set
then employed. Therefore the channel size of final feature and tested on the same testing set.
map is at the same size as the original fifth convolution The Precision-Recall curves and AP values are shown in
layer in Faster R-CNN. Figure 5. Our method outperforms those strong baselines
by a large margin. It achieves the best average precision in
5 E XPERIMENTS all level faces, i.e. AP = 0.902 (Easy), 0.874 (Medium) and
This section presents the face detection bechmarking using 0.643 (Hard), and outperforms the second best baseline by
our proposed CMS-RCNN approach on the WIDER FACE 26.0% (Easy), 37.4% (Medium) and 60.8% (Hard). These
dataset [1] and the Face Detection Data Set and Bench- results suggest that as the difficulty level goes up, CMS-
mark (FDDB) [20] database. The WIDER FACE dataset RCNN can detect challenging faces better. So it has the
is experimented with high degree of variability. Using ability to handle difficult conditions hence is more closed
this database, our proposed approach robustly outperforms to human detection level. Figure 8 shows some examples
strong baseline methods, including Two-stage CNN [1], of face detection results using the proposed CMS-RCNN
Multi-scale Cascade CNN [1], Faceness [16] and Aggregate on this database.
Channel Features (ACF) [11], by a large margin. We also
With Context v.s. Without Context
show that our model trained on WIDER FACE dataset
generalizes well enough to the FDDB database. The trained As we show in Section 4.2 that human vision can benefit
model consistently achieves competitive results against from additional context information for better detection and
the recent state-of-the-art face detection methods on this recognition, we show in this section how does explicit
database, including HyperFace [19], DP2MFD [17], CCF contextual reasoning in the network help improve the model
[18], Faceness [16], NPDFace [13], MultiresHPM [12], performance.
DDFD [15], CascadeCNN [14], ACF-multiscale [11], Pico To prove this, we test our models with and without body
[7], HeadHunter [9], Joint Cascade [10], Boosted Exemplar context information on the validation set of WIDER FACE
[8], and PEP-Adapt [6]. dataset. The model without body context is implemented
by removing the context pipeline and only use the rep-
resentation from face pipeline to compute the confidence
5.1 Experiments on WIDER FACE Dataset
score and the bounding box regression. We compare their
Data description performances as illustrated in Figure 6. The Faster R-CNN
WIDER FACE is a public face detection benchmark dataset. method is setup as a baseline.
It contains 393,703 labeled human faces from 32,203 Starting from 0 in recall, two curves of our models are
images collected based on 61 event classes from internet. overlapped at first, which means that two models perform
The database has many human faces with a high degree as well as each other on some easy faces. Then the curve
of pose variation, large occlusions, low-resolutions and of model without context starts to drop quicker than the
8

1 1 1
0.9 0.9 0.9
0.8 0.8 0.8
0.7 0.7 0.7
0.6 0.6 0.6
Precision

Precision

Precision
0.5 0.5 0.5
0.4 0.4 0.4
0.3 0.3 0.3
CMS−RCNN−0.902 CMS−RCNN−0.874 CMS−RCNN−0.643
0.2 ACF−WIDER−0.695 0.2 ACF−WIDER−0.588 0.2 ACF−WIDER−0.290
Faceness−WIDER−0.716 Faceness−WIDER−0.604 Faceness−WIDER−0.315
0.1 Multiscale Cascade CNN−0.711 0.1 Multiscale Cascade CNN−0.636 0.1 Multiscale Cascade CNN−0.400
Two−stage CNN−0.657 Two−stage CNN−0.589 Two−stage CNN−0.304
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Recall Recall
(a) (b) (c)

Fig. 5. Precision-Recall curves obtained by our proposed CMS-RCNN (red) and the other baselines, i.e. Two-
stage CNN [1], Multi-scale Cascade CNN [1], Faceness [16], and Aggregate Channel Features (ACF) [11]. All
methods trained and tested on the same training and testing set of the WIDER FACE dataset. (a): Easy level,
(b): Medium level and (c): Hard level. Our method achieves the state-of-the-art results with the highest AP values
of 0.902 (Easy), 0.874 (Medium) and 0.643 (Hard) among the methods on this database. It also outperforms the
second best baseline by 26.0% (Easy), 37.4% (Medium) and 60.8% (Hard).

model with context, suggesting the model with context can positives produced by our CMS-RCNN model. We are
handle the challenging conditions better when faces become curious about what object can fool our model to treat it as
more and more difficult. Thus eventually the model with a face. Is it due to over-fitting, data bias, or miss labeling?
context achieves a higher recall value. Additionally, the In order to visualize the false positives, we test the CMS-
context model produces a longer PR curve, which means RCNN model on the WIDER FACE validation set and
that contextual reasoning can help finding more faces. pick all the false positives according to the ground truth.
Then those positives are sorted by the confidence score in
a descending order. We choose the top 20 false positives as
illustrated in Figure 9 Because their confidence scores are
high, they are the objects most likely to cause our model
making mistakes. It turns out that most of the false positives
are actually human faces caused by miss labeling, which is
a problem of the dataset itself. For other false positives, we
find the errors made by our model are rather reasonable.
They all have the pattern of human face as well as the shape
of human body.

5.2 Experiments on FDDB Face Database


To show that our method generalizes well to other database,
the proposed CMS-RCNN is also benchmarked on the
FDDB database [20]. It is a standard database for testing
and evaluation of face detection algorithms. It contains
annotations for 5,171 faces in a set of 2,845 images taken
Fig. 6. Precision-Recall curves on the WIDER FACE from the Faces in the Wild dataset. Most of the images
validation set. The baseline (green curve) is generated in the FDDB database contain less than 3 faces that are
by the Faster R-CNN [30] model trained on WIDER clear or slightly occluded. The faces generally have large
FACE training set. We show that our model without sizes and high resolutions compared to WIDER FACE. We
context (red curve) outperforms baseline by a large use the same model trained on WIDER FACE training set
gap. With body context information, the performance presented in Section 5.1 to perform the evaluation on the
gets boosted even further (blue curve). The numbers FDDB database.
in the legend are the average precision values. The evaluation is performed based on the discrete crite-
rion following the same rules in PASCAL VOC Challenge
[37], i.e. if the ratio of the intersection of a detected region
Visualization of False Positives with an annotated face region is greater than 0.5, it is
As it is well known that precision-recall curves get dropped considered as a true positive detection. The evaluation is
due to the false positives, we are interested in the false proceeded following the FDDB evaluation protocol and
9

Fig. 7. ROC curves of our proposed CMS-RCNN and the other published methods on FDDB database [20].
Our method achieves the best recall rate on this database. Numbers in the legend show the average precision
scores.

compared against the published methods provided in the R EFERENCES


protocol, i.e. HyperFace [19], DP2MFD [17], CCF [18],
Faceness [16], NPDFace [13], MultiresHPM [12], DDFD [1] S. Yang, P. Luo, C. C. Loy, and X. Tang, “Wider face: A face
detection benchmark,” in IEEE Conference on Computer Vision and
[15], CascadeCNN [14], ACF-multiscale [11], Pico [7], Pattern Recognition (CVPR), 2016. 1, 2, 7, 8, 10
HeadHunter [9], Joint Cascade [10], Boosted Exemplar [8], [2] P. Viola and M. Jones, “Rapid object detection using a boosted
and PEP-Adapt [6]. The proposed CMS-RCNN approach cascade of simple features,” in Computer Vision and Pattern Recog-
nition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer
outperforms most of the published face detection methods Society Conference on, vol. 1. IEEE, 2001, pp. I–511. 1, 2
and achieves a very high recall rate comparing against [3] C. Zhang and Z. Zhang, “A survey of recent advances in face
all other methods (as shown Figure 7). This is concrete detection,” Tech. Rep. MSR-TR-2010-66, June 2010. [Online].
Available: http://research.microsoft.com/apps/pubs/default.aspx?id=
evidence to demonstrate that CMS-RCNN robustly detects 132077 1, 2
unconstrained faces. Figure 10 shows some examples of the [4] X. Zhu and D. Ramanan, “Face detection, pose estimation, and
face detection results using the proposed CMS-RCNN on landmark localization in the wild,” in Computer Vision and Pattern
the FDDB dataset. Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp.
2879–2886. 1, 2
[5] J. Li and Y. Zhang, “Learning surf cascade for fast and accurate ob-
6 C ONCLUSION AND F UTURE W ORK ject detection,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2013, pp. 3468–3475. 1, 2
This paper has presented our proposed CMS-RCNN ap- [6] H. Li, G. Hua, Z. Lin, J. Brandt, and J. Yang, “Probabilistic
proach to robustly detect human facial regions from images elastic part model for unsupervised face detector adaptation,” in
Proceedings of the IEEE International Conference on Computer
collected under various challenging conditions, e.g. highly Vision, 2013, pp. 793–800. 1, 7, 9
occlusions, low resolutions, facial expressions, illumination [7] N. Markuš, M. Frljak, I. S. Pandžić, J. Ahlberg, and R. Forchheimer,
variations, etc. The approach is benchmarked on two chal- “A method for object detection based on pixel intensity comparisons
lenging face detection databases, i.e. the WIDER FACE organized in decision trees,” arXiv preprint arXiv:1305.4537, 2013.
1, 7, 9
Dataset and the FDDB, and compared against recent other [8] H. Li, Z. Lin, J. Brandt, X. Shen, and G. Hua, “Efficient boosted
face detection methods. The experimental results show that exemplar-based face detection,” in Proceedings of the IEEE Confer-
our proposed approach outperforms strong baselines on the ence on Computer Vision and Pattern Recognition, 2014, pp. 1843–
1850. 1, 7, 9
WIDER FACE and consistently achieves very competitive [9] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool, “Face
results against state-of-the-art methods on the FDDB. detection without bells and whistles,” in Computer Vision–ECCV
In our implementation, the proposed CMS-RCNN con- 2014. Springer, 2014, pp. 720–735. 1, 3, 7, 9
[10] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun, “Joint cascade
sists of the MS-RPN and the CMS-CNN. During training, face detection and alignment,” in Computer Vision–ECCV 2014.
they are merged together in an approximate joint training Springer, 2014, pp. 109–122. 1, 3, 7, 9
style for each SGD iteration, in which the derivatives w.r.t. [11] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Aggregate channel features
the proposal boxes’ coordinates are ignored. In the future for multi-view face detection,” in Biometrics (IJCB), 2014 IEEE
International Joint Conference on. IEEE, 2014, pp. 1–8. 1, 2, 7,
we want to go to the fully joint training so that the network 8, 9
can be trained in end-to-end fashion. [12] G. Ghiasi and C. C. Fowlkes, “Occlusion coherence: Detecting and
10

Fig. 8. Some examples of face detection results using our proposed CMS-RCNN method on WIDER FACE
database[1].

Fig. 9. Examples of the top 20 false positives from our CMS-RCNN model tested on the WIDER FACE validation
set. In fact these false positives include many human faces not in the dataset due to mislabeling, which means
that our method is robust to the noise in the data.

localizing occluded faces,” arXiv preprint arXiv:1506.08347, 2015. ization, pose estimation, and gender recognition,” arXiv preprint
1, 3, 7, 9 arXiv:1603.01249, 2016. 1, 7, 9
[13] S. Liao, A. Jain, and S. Li, “A fast and accurate unconstrained face [20] V. Jain and E. Learned-Miller, “Fddb: A benchmark for face
detector,” 2014. 1, 7, 9 detection in unconstrained settings,” University of Massachusetts,
[14] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A convolutional Amherst, Tech. Rep. UM-CS-2010-009, 2010. 2, 7, 8, 9, 11
neural network cascade for face detection,” in Proceedings of the [21] K. Simonyan and A. Zisserman, “Very deep convolutional networks
IEEE Conference on Computer Vision and Pattern Recognition, for large-scale image recognition,” arXiv preprint arXiv:1409.1556,
2015, pp. 5325–5334. 1, 3, 7, 9 2014. 2, 3, 4
[22] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan,
[15] S. S. Farfade, M. J. Saberian, and L.-J. Li, “Multi-view face detection
“Object detection with discriminatively trained part-based models,”
using deep convolutional neural networks,” in Proceedings of the 5th
IEEE Trans. on PAMI, vol. 32, no. 9, pp. 1627–1645, Sept 2010. 2
ACM on International Conference on Multimedia Retrieval. ACM,
2015, pp. 643–650. 1, 7, 9 [23] X. Yu, J. Huang, S. Zhang, W. Yan, and D. Metaxas, “Pose-free
facial landmark fitting via optimized part mixtures and cascaded
[16] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “From facial parts responses deformable shape model,” in Proceedings of the IEEE International
to face detection: A deep learning approach,” in Proceedings of the Conference on Computer Vision, 2013, pp. 1944–1951. 3
IEEE International Conference on Computer Vision, 2015, pp. 3676–
[24] S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and M. Hebert,
3684. 1, 2, 3, 4, 7, 8, 9
“An empirical study of context in object detection,” in Computer
[17] R. Ranjan, V. M. Patel, and R. Chellappa, “A deep pyramid Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference
deformable part model for face detection,” in Biometrics Theory, on. IEEE, 2009, pp. 1271–1278. 3, 5
Applications and Systems (BTAS), 2015 IEEE 7th International [25] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick, “Inside-outside net:
Conference on. IEEE, 2015, pp. 1–8. 1, 7, 9 Detecting objects in context with skip pooling and recurrent neural
[18] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Convolutional channel networks,” arXiv preprint arXiv:1512.04143, 2015. 3
features,” in Proceedings of the IEEE International Conference on [26] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, S. Chin-
Computer Vision, 2015, pp. 82–90. 1, 7, 9 tala, and P. Dollár, “A multipath network for object detection,” arXiv
[19] R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: A deep preprint arXiv:1604.02135, 2016. 3
multi-task learning framework for face detection, landmark local- [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
11

Fig. 10. Some examples of face detection results using our proposed CMS-RCNN method on FDDB database
[20].

tion with deep convolutional neural networks,” in Advances in neural


information processing systems, 2012, pp. 1097–1105. 3
[28] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based
convolutional networks for accurate object detection and segmenta-
tion,” Pattern Analysis and Machine Intelligence, IEEE Transactions
on, vol. 38, no. 1, pp. 142–158, 2016. 3
[29] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE International
Conference on Computer Vision, 2015, pp. 1440–1448. 3, 4
[30] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-
time object detection with region proposal networks,” in Advances
in Neural Information Processing Systems, 2015, pp. 91–99. 4, 8
[31] M. D. Zeiler and R. Fergus, “Visualizing and understanding con-
volutional networks,” in Computer vision–ECCV 2014. Springer,
2014, pp. 818–833. 4
[32] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in
context,” in ECCV, 2014, pp. 740–755. 4
[33] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik, “Hypercolumns
for object segmentation and fine-grained localization,” in Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2015, pp. 447–456. 5
[34] W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: Looking wider
to see better,” arXiv preprint arXiv:1506.04579, 2015. 5, 6
[35] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” in Proceedings of the ACM International
Conference on Multimedia. ACM, 2014, pp. 675–678. 6
[36] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals
from edges,” in ECCV. Springer, 2014, pp. 391–405. 7
[37] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
A. Zisserman, “The pascal visual object classes (voc) challenge,”
International journal of computer vision, vol. 88, no. 2, pp. 303–
338, 2010. 8
12

Fig. 11. More results of unconstrained face detection under challenging conditions using our proposed CMS-
RCNN.

You might also like