Huang 2003
Huang 2003
www.elsevier.com/locate/neucom
Abstract
Automatic detection of human faces from cluttered images is important for face recognition
and security applications. This problem is challenging due to the multitude of variations and
the confusion between face and background regions. This paper proposes a new face detection
method using a polynomial neural network (PNN). To locate the human faces in an image, the
local regions in multiscale sliding windows are classi5ed by the PNN to two classes, namely, face
and non-face. The PNN takes as inputs the binomials of the projection of the local image onto a
feature subspace learned by principal component analysis (PCA). We investigated the in9uence of
PCA on either the face samples or the pooled face and non-face samples. In addition, we integrate
the distance from the feature subspace into the PNN to improve the detection performance. In
experiments on images with complex backgrounds, the proposed method has produced promising
results in terms of high detection rate and low false positive rate.
c 2002 Elsevier Science B.V. All rights reserved.
Keywords: Face recognition; Face detection; Pattern classi5cation; Polynomial neural network; Feature
extraction
1. Introduction
Machine recognition of human faces has wide applications in security and human–
computer interface [3]. A complete face recognition system consists of several modules:
0925-2312/03/$ - see front matter c 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 9 2 5 - 2 3 1 2 ( 0 2 ) 0 0 6 1 6 - 1
198 L.-L. Huang et al. / Neurocomputing 51 (2003) 197 – 211
the distance of image pattern from the feature subspace, which was frequently used in
eigenface-based recognition.
Experiments of face detection on images with complex backgrounds demonstrated
the eIciency of the proposed method. In respect of the detection rate and false pos-
itive rate, the achieved results are comparable to those reported in the literature. In
comparison with a multilayer perceptron (MLP), the performance of PNN is superior.
The proposed method is not complicated to implement and shows potential to further
improve the performance.
The rest of this paper is organized as follows. Section 2 gives an overview of the
face detection method; Section 3 describes the PNN structure and learning algorithm.
The experimental results are presented in Section 4, and 5nally, Section 5 provides
concluding remarks.
2. System overview
To detect faces of variable sizes and locations, the detector needs to examine the
shifted regions of the test image in multiple scales. A statistical classi5er or a neu-
ral network is used to classify the image pattern of the local region to one of two
classes: face or non-face. Alternatively, the image pattern is assigned by the classi5er
a likelihood measure, which should be high for a face region and low for a non-face
region. The likelihood measure is useful to resolve the competition between overlapping
regions within the same scale and across diHerence scales.
Our strategy of image rescaling and pre-processing is similar to those of [21,26]. In
brief, the test image with faces of unknown size and location is rescaled to multiple
sizes with the hope that after scaling, each face becomes nearly a standard size in
one of the rescaled images. Each rescaled image is scanned exhaustively to examine
all shifted regions of standard size (in this work, 20 × 20 pixels). The local image of
each shifted region is assigned by the underlying classi5er a likelihood value. A region
with a likelihood value higher than a threshold is classi5ed to be a face region. The
regions shifted slightly from the standard face region and=or with a slightly diHerent
scale may also be assigned high likelihood values, so they are also classi5ed to be
face regions. The overlapping face regions within a rescaled image or across diHerent
scales compete with each other so that only the region of the highest face likelihood
is retained.
As in previous works [21,26], we aim to detect frontal faces and de5ne a face
region as a window enclosing the organs of a human face: two eyes, one nose and one
mouth. To reduce the eHect of inhomogeneous lighting condition, the gray levels of
the local image is subtracted from an optimally 5tted plane in MSE (minimum square
error) sense. Then the gray levels are adjusted by histogram equalization so as to
standardize the contrast. In calculating the 5tting plane and the histogram, the pixels in
four corners of the 20 × 20 window are excluded because they mostly do not compose
the face organs and are subject to large variation. Finally, the gray levels of the local
image excluding the corner pixels are arranged in a 368-dimensional vector as the
input pattern to the classi5er. Fig. 1 shows examples of local image pre-processing.
200 L.-L. Huang et al. / Neurocomputing 51 (2003) 197 – 211
The upper row is an example of face region, and the lower row is an example of
non-face region. The four images from left to right show in turn the clipped local
image, the result of lighting condition correction, the result of histogram equalization,
and the image pattern excluding the corner pixels.
The underlying classi5er for image pattern classi5cation is a PNN, which has a
single output unit to indicate whether an input pattern is a face or not. The output
value of the unit gives the likelihood as to what extent the input pattern looks like a
human face. The output unit takes as inputs the input features as well as their binomial
expansions. When the number of input features is large, the number of polynomial
terms will be huge. This would not only increase the computational complexity, but
also deteriorate the generalization performance if trained on small sample size. To
overcome this problem, we reduce the dimensionality of the input pattern using PCA.
Using PCA to learn an eigen subspace from a set of example patterns, the projections
of an input pattern onto the principal eigenvectors are used as the input features of the
PNN. The structure and the learning algorithm of PNN are described in detail in the
following section.
The architecture of PNN combined with feature extraction exhibits great 9exibilities.
Hence the PNN shows great potential to improve the detection performance if the PCA
is replaced or combined with other techniques for feature extraction. Some feature
extraction techniques used in face detection [12,25] can be readily combined with
the PNN. Other feature extraction and transformation techniques available for this task
include the wavelet transform, the Gabor transform, the independent component analysis
[1], and various feature subset selection methods [9].
The PNN can be viewed as a generalized linear classi5er which uses as inputs not
only the feature measurements of the input pattern but also the polynomials of the
measurements. For face detection, the PNN has one single output unit for two-class
classi5cation. The number of polynomial terms, i.e., the number of inputs to the out-
put unit, increases rapidly with the number of features. Nevertheless, the size of a
second-order (binomial) network is acceptable and the classi5cation performance is
L.-L. Huang et al. / Neurocomputing 51 (2003) 197 – 211 201
promising. The binomial network is also closely related to the Gaussian quadratic clas-
si5er since they both utilize the second-order statistics of the pattern space [24,28].
However, the PNN (including the binomial network) breaks the constraints of Gaus-
sian density and the parameters are optimized in discriminative learning so as to well
separate the patterns of diHerent classes. While compared to other neural networks,
such as the MLP [22] and radial basis function (RBF) network [2], the PNN is faster
in learning and is less susceptible to local minima because it is a single-layer structure.
The classi5cation power of PNN originates from the nonlinear mapping of polynomial
expansion.
Denote the input pattern as a feature vector x = (x1 ; x2 ; : : : ; xd )T , the output of the
PNN is computed by
d d d
y(x) = g w i xi + wij xi xj + w0 ; (1)
i=1 i=1 j=i
zj = (x − )T j ; j = 1; 2; : : : ; m; (2)
where zj denotes the projection of x onto the jth axis of the subspace, j denotes
the eigenvector of the axis, and
denotes the mean vector of the pattern space. The
eigenvectors are computed by PCA or K–L transformation on a sample set, i.e., a
set of face samples or the pooled set of face and non-face samples. The eigenvectors
corresponding to the m largest eigenvalues are selected such that the error of pattern
reconstruction from the subspace is minimized.
Using the projections of image pattern onto the subspace as the features, the output
of the PNN is now in the form
m m
m
y(x) = g w i zi + wij zi zj + w0 : (3)
i=1 i=1 j=i
In this form, the reconstruction error of image pattern (the distance from the feature
subspace, DFFS) is totally ignored. However, the DFFS is an important indicator of
the deviation of the pattern from the subspace. When the subspace is learned from
face samples, the DFFS indicates the dissimilarity of a pattern being a face. Hence,
we integrate the DFFS into the PNN with hope to improve the detection performance:
m m m
y(x) = g w i zi + wij zi zj + w Df + w0 = g(wT zE + w0 );
D
(4)
i=1 i=1 j=i
202 L.-L. Huang et al. / Neurocomputing 51 (2003) 197 – 211
where w denotes the vector composed of all connecting weights while zE is the vector
composed of all the inputs to the output unit, including the DFFS:
m
2
Df = x −
− zj2 : (5)
j=1
The connecting weights of the PNN are trained in supervised learning on a set of face
and non-face samples with aim to minimize the empirical loss of mean square error
(MSE):
Nx
Nx
E= [y(xn ) − t n ]2 + w2 = En; (6)
n=1 n=1
where t n denotes the target output for the input pattern xn , with value 1 for face
pattern and 0 for non-face pattern; is a coeIcient of weight decay, which is helpful
to improve the generalization performance.
The connecting weights are updated by stochastic gradient descent [20]. The example
patterns are fed into the network repeatedly to update the weights until the empirical
loss reaches a local minimum. On an input pattern zn = z(xn ), the connecting weights
are updated by gradient descent:
9E n
w(n + 1) = w(n) − (n) ;
9w
9E n
w0 (n + 1) = w0 (n) − (n) ; (7)
9w0
where (n) is a learning rate, which is small enough and decreases progressively. The
partial derivatives are computed by
9E n
= [y(xn ) − t n ]y(1 − y)zE + w;
9w Nx
9E n
= [y(xn ) − t n ]y(1 − y): (8)
9w0
Since the PNN is a single layer network, the training process is quite fast and the
result is not in9uenced by the random initialization of weights.
4. Experimental results
To train the classi5er and test the performance of face detection, we used two sets of
images. The 5rst set contains 3257 images downloaded from several websites, mostly
of 384 × 384 size and with one face in each image. The second set has 130 images
downloaded from the website of CMU, called as the Test Set 1 in [21]. In this paper,
we call the 3257 images as type 1 images and the CMU images as type 2 images.
L.-L. Huang et al. / Neurocomputing 51 (2003) 197 – 211 203
We used 2987 type 1 images (containing 2990 faces) to extract face samples. The
face boxes were arti5cially located and then each box is adjusted into a square.
The local image within the face box is normalized to 20 × 20 size, and undergoes
lighting condition correction, histogram equalization, and corner elimination to give a
368-dimensional vector of face pattern. The square face box also varies in aspect ratio
and size to generate four variations. In addition, the mirror re9ection of a face image
to the vertical axis also gives a variation. In combination, a face image gives 10 vari-
ations of face patterns and as a result, 29,990 face patterns are available for training.
The rest of the 270 type 1 images in the 5rst set were used in testing.
We collected non-face samples from three subsets of images in three steps. The
images for non-face collection include a subset of 228 images and a subset of 273
images from which the face samples were extracted, and 14 scene images from the
type 2 dataset. From the 5rst subset of 228 images, the shifted local regions in 10
scales (at scaling factor 1.21 and starting from 0.1) were examined to collect the
5rst-step non-face samples. The local patterns were compared with the center vector of
the face samples such that the patterns with the Euclidean distance below a threshold is
considered to be a confusing non-face example. Since the number of patterns satisfying
this condition is huge, we collected the patterns of minimum distance in 20 × 20 range
and those standing on the grid of 10 × 10 pixels. As a result, we obtained 56,007
non-face samples in the 5rst step.
In the second step, non-face samples were collected from the subset of 273 images
(10 scales starting from 0.1) using the PNN trained on the face samples and the
5rst-step non-face samples. For this time all the local region patterns with the output of
the PNN higher than 0.5 were collected. For diHerent structures of PNN, the number of
second-step non-face samples ranges from 30,000 to 40,000. The PNN is then retrained
with the face samples and the non-face samples of two steps for collecting non-face
examples from the 14 scene images (10 scales starting from 0.2). The number of
third-step non-face samples is around 10,000.
From the type 2 (CMU) dataset, 14 scene images have been used for non-face
collection. We again excluded 7 images with extremely big or small faces (the face
sizes were out of the 10 scales). The rest 109 images containing 487 faces were used
as our type 2 test set. The 10 image scales start from 0.2 and end at 1.11. This implies
we can detect the faces ranging from 18 × 18 pixels to 100 × 100 pixels. For the type
1 set, the 10 scales start with 0.1.
We 5rst tested three PNN variations with subspace dimension m=80, which we refer
to as PNN-A80, PNN-B80, and PNN-C80. The PNN-A80 and the PNN-C80 use the
subspace learned on the face samples. The PNN-C80 incorporates the DFFS whereas
the PNN-A80 does not. For the PNN-B80, the subspace was learned with the pooled
set of the face samples and the 5rst-step non-face samples. This paradigm is aimed
to represent the pooled sample in a uni5ed subspace. The distributions of eigenvalues
(sorted in decreasing order) of PCA from face samples and from pooled sample are
plotted in Fig. 2.
204 L.-L. Huang et al. / Neurocomputing 51 (2003) 197 – 211
Fig. 2. Eigenvalues of PCA from face samples and from pooled sample.
Table 1
Detection results of three PNN structures
On the two test datasets, the detection results of the three PNN structures are listed
in Table 1. The false positive rate is the ratio of the false positive faces with respect
to the total number of examined windows. We can see that on the type 1 test set,
since the images have high clarity and the face shape variation is relatively small,
all the three PNN structures achieve very high detection rate and low false positive
rate. Some images in the type 2 test set have very low clarity and many faces are
inherently ambiguous, so the detection rate is lower than that of the type 1 test set. In
comparison with the three PNN structures, the PNN-B80 gives fewer false positives
than the PNN-A80, but the detection rate is traded oH. The PNN-C80 outperforms
PNN-A80 and PNN-B80 in both detection rate and the false positive rate. This justi5es
that incorporating the DFFS into PNN is bene5cial.
We also tested the performance of the PNN with DFFS on variable dimensionality
of subspace (PNN-C series). Table 2 gives the detection results of PNN-C on subspace
dimensionality m=60; 80, and 100. We can see that the performance of the PNN-C60 is
evidently inferior to that of the PNN-C80. The PNN-C100 trades oH the detection rate
L.-L. Huang et al. / Neurocomputing 51 (2003) 197 – 211 205
Table 2
Detection results of variable subspace dimensionality
to decrease the false positive rate. We will see later from the detection=false tradeoH
curves that the PNN-C80 outperforms the PNN-C100.
Some examples of face detection using the PNN-C80 are shown in Figs. 3–5.
Fig. 3 shows the examples of type 1 test set, while Figs. 4 and 5 shows the ex-
amples of type 2 test set. From the results of type 2 test set, we can see that the
proposed method is quite robust against low image quality and face shape variation.
The missed faces are incomplete, ambiguous, or rotate excessively. The false positives,
on the other hand, mostly resemble the geometrical shape of human faces. In gray-scale
images, the solution of ambiguous faces relies on more contextual information, e.g.,
the other parts of human body.
To compare the performance of PNN with other neural networks, we have experi-
mented with the MLP, the most popular neural network for regression and classi5cation.
We use a four-layer MLP (two hidden layers) with the number of hidden units in the
5rst hidden layer equal to the subspace dimensionality of PNN (h1 = m), and the num-
ber of hidden units in the second hidden layer being the half of that of the 5rst hidden
206 L.-L. Huang et al. / Neurocomputing 51 (2003) 197 – 211
layer (h2 =h1 =2). In this situation, the numbers of parameters and computations of MLP
approximately equal to those of PNN in that the 5rst hidden layer weights correspond
to the subspace eigenvectors of PNN, the number of weights of second hidden layer
and output layer approximately equals the number of weights of PNN. The weights
of MLP are learned using the back-propagation (BP) algorithm [22], which minimizes
the heuristic MSE as for PNN.
The MLP was trained and tested on same images as the PNN. After having trained
on the 2990 face samples and 56,007 5rst-step non-face samples, the MLP was used to
collect second-step non-face samples. The third-step non-face samples were collected
after retraining the MLP. For face detection on test images, the MLP was trained on
the face samples and all non-face samples. To compare the detection performance of
L.-L. Huang et al. / Neurocomputing 51 (2003) 197 – 211 207
diHerent networks more fairly, we give the curve of tradeoH between correct detection
rate and false positive rate with variable decision thresholds. On the type 2 test images,
the tradeoH curves of MLP and PNN are plotted in Fig. 6, where PNN-60, PNN-80 and
PNN-100 denote the PNN (DFFS incorporated) with m = 60; 80 and 100, respectively,
and MLP-60, MLP-80 and MLP-100 denote the MLP with 60, 80 and 100 hidden units
in the 5rst hidden layer, respectively. From the results, it is evident that the performance
of PNN is superior to that of MLP. Among the PNNs, the PNN-80 performs best, and
among the MLPs, the MLP-80 performs best.
In comparison to the results of previous methods, we would like to specially men-
tion the results of [21] and [6] experimented on the same CMU dataset as ours. As
have explained above, this dataset contains many ambiguous faces so it is diIcult
to obtain both high detection rate and low false positive rate. The template matching
208 L.-L. Huang et al. / Neurocomputing 51 (2003) 197 – 211
and model-based methods usually report their results on high quality images only (e.g.,
[15,16]). The methods of [6,21] achieved very low false positive rates at high detection
rates. At the detection rate of 86%, their false rates are 2:8 × 10−7 and 9:7 × 10−8 ,
respectively. However, their results were achieved by combining multiple neural net-
works, which is signi5cant to reduce false positives. We use a single neural network
and the detection performance (84.6% detection rate and 3:51 × 10−6 false rate) is
fairly good.
5. Conclusion
We propose a new face detection method using a PNN. The PNN functions as a
classi5er to evaluate the face likelihood of the image patterns of multiscale shifted local
regions. The PCA technique is used to reduce the dimensionality of image patterns and
extract features for the PNN. Using a single network, we have achieved fairly high
detection rate and low false positive rate on images with complex backgrounds. Besides
the linear PCA, the PNN is 9exible to combine with other feature extraction techniques
(e.g., wavelet analysis, ICA, nonlinear PCA, feature subset selection) and shows the
potential to further improve the detection performance. While 5t for the applications
of high-quality images, the current version of PNN suIces in respect of performance.
L.-L. Huang et al. / Neurocomputing 51 (2003) 197 – 211 209
Acknowledgements
The author would like to thank the editors and the anonymous reviewers for their
invaluable comments.
References
[23] H. Schneiderman, T. Kanade, Probabilistic modeling of local appearance and spatial relationships for
object recognition, in: Proceedings of the CVPR, 1998, pp. 45 –51.
[24] J. SchRurmann, Pattern Classi5cation: A Uni5ed View of Statistical Pattern Recognition and Neural
Networks, Wiley Interscience, New York, 1996.
[25] Q. Song, J. Robinson, A feature space for face image processing, in: Proceedings of the 15th ICPR,
2000, pp. 97–100.
[26] K.-K. Sung, T. Poggio, Example-based learning for view-based human face detection, IEEE Trans.
Pattern Anal. Mach. Intell. 20 (1) (1998) 39–50.
[27] G. Yang, T.S. Huang, Human face detection in a complex background, Pattern Recognition 27 (1)
(1994) 53–61.
[28] H.-C. Yau, T. Tanry, Iterative improvement of a Gaussian classi5er, Neural Networks 3 (1990) 437–443.
[29] K.C. Yow, R. Cipolla, Feature-based human face detection, Image Vision Comput. 15 (9) (1997)
pp. 713–735.
Lin-Lin Huang was born in April, 1968. She received the B.S. degree from Wuhan
University and M.E. degree from Beijing Polytechnic University, China, in 1989 and
1994, respectively. Currently she is a Ph.D. student in Tokyo University of Agri-
culture and Technology. From 1994 to 1998, she worked as a lecturer in Northern
Jiaotong University, Beijing, China. Her research interests include pattern recogni-
tion, image processing and computer vision.
Akinobu Shimizu was born in October, 1965. He received his B.E. and Ph.D. de-
grees from Graduate School of Engineering, Nagoya University in 1989 and 1994,
respectively. He became a research associate at Nagoya University in 1994, and
has been an associate professor in the Graduate School of Bio-Applications and
Systems Engineering, Tokyo University of Agriculture and Technology since 1998.
His research interests include image processing and analysis. He is a member of the
Japanese Society of Medical Imaging Technology, the Japanese Society for Medi-
cal and Biological Engineering, the Japan Society of Computer Aided Diagnosis of
Medical Images, and the IEEE.
Hidefumi Kobatake was born in November, 1943. He received the B.E., M.E., and
Ph.D. degrees from The University of Tokyo, Tokyo, Japan, in 1967, 1969 and
1972, respectively. He is now a Professor at Graduate School of Bio-Applications
and Systems Engineering, Tokyo University of Agriculture and Technology, Tokyo,
Japan. His research activities are in the areas of speech processing, image pro-
cessing, and the applications of digital signal processing. He has received several
awards, including a 1987 Society of Instrument and Control Engineers’ Best Mono-
graph Award and a 1998 Three-Dimensional Image Conference’s Best Paper Award.
He is the member of the IEEE, the Society of Instrument, the Acoustical Society
of Japan, etc.