Facial Expression Recognition Using Constructive Feedforward Neural Networks 01298906
Facial Expression Recognition Using Constructive Feedforward Neural Networks 01298906
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 34, NO. 3, JUNE 2004
I. INTRODUCTION
The computer-based recognition of facial expressions has been an
active area of research in the literature for a long time. The ultimate
goal in this research area is the realization of intelligent and transparent
communications between human beings and machines. Several facial
expression recognition methods have been proposed in the literature;
see, for example, [4], [5], [12], and [27] and the references therein.
A well-known facial action coding system was developed by Ekman
[5] for facial expression description. In facial action coding system,
the face is divided into 44 action units, such as nose, mouth, eyes,
etc. The movement of muscles of these feature-bearing action units are
used to describe any human facial expression. This method requires
a three-dimensional (3-D) measurement and may thus, be too complex for real-time processing. To overcome and remedy the drawbacks
associated with the original facial action coding system, a modified
system using only 17 relevant action units was proposed in [14] for
facial expression analysis and synthesis. However, 3-D measurement
is still needed. Although, the complexity of the modified facial action
coding system is reduced when compared to the original system, certain information useful for facial expression recognition may be lost. In
[6], a more accurate representation of human facial expressions (FAC+)
is derived by using computer vision system to probabilistically characterize facial motion and muscle activation. In recent years facial expression recognition based on two-dimensional (2-D) digital images has
Manuscript received July 14, 2001; revised July 9, 2003. This work was supported in part by the Natural Science and Engineering Research Council of
Canada (NSERC) under research Grant RGPIN-42515. This paper was recommended by Associate Editor V. Murino.
The authors are with the Department of Electrical and Computer Engineering, Concordia University, Montreal, H3G 1M8 QC, Canada (e-mail:
[email protected]; [email protected]).
Digital Object Identifier 10.1109/TSMCB.2004.825930
received a lot of attention by researchers. In [24], a radial basis function neural network is proposed to recognize human facial expressions.
The 2-D discrete cosine transform is used to compress the entire face
image. The resulting lower-frequency 2-D discrete cosine transform
coefficients are used to train a one-hidden-layer feedforward neural network in [27]. Very promising experimental results are also reported in
[24]. A more detailed review on facial expression recognition can be
found in [4].
The neural network-based recognition methods are found to be
particularly promising [24], [27], since the neural networks can easily
implement the mapping from the feature space of face images to
the facial expression space. However, determining a proper network
size has always been a frustrating and time consuming experience
for neural network developers. This is generally dealt with through
a series of long and costly trial-and-error simulations. Motivated
by these limitations and drawbacks, in this paper, we propose to
use a constructive feedforward neural network to overcome and
remedy this problem. The constructive feedforward neural network
can systematically determine a proper network size required by the
complexity of a given problem, while reducing considerably the
computational cost involved in network training when compared
with the standard radial basis functions and back propagation-based
training techniques. We are particularly interested in the constructive
one-hidden-layer feedforward neural networks which are simple in
structure and yield fairly good performances in many applications
such as regression problems, image compression and facial expression
recognition [18][20].
The organization of the remainder of this paper is as follows. In
Section II, the main features of a constructive neural network are presented. A pruning technique is proposed and applied to our constructive
neural network in Section II-C. In Section III, the application of our
proposed constructive neural network to facial expression recognition
is presented. Experimental results on a database consisting of images
of 60 men, each having five facial expression images are also presented
to demonstrate and illustrate the potential capabilities of our proposed
technique. Conclusions are stated in Section IV.
II. CONSTRUCTIVE ALGORITHMS FOR
FEEDFORWARD NEURAL NETWORKS
Constructive learning alters the network structure as learning proceeds, producing automatically a network with an appropriate size. In
this approach, one starts with an initial network of a small size, and
then adds incrementally new hidden units and/or hidden layers until
some prespecified error requirement is reached, or no performance improvement can be observed. The network obtained in this way is a
reasonably sized one for the given problem at hand. Generally, a
minimal or an optimal network size is seldom achieved by using
this strategy, however a subminimal/suboptimal network can be expected [16], [17]. This problem has attracted a lot of attention by many
researchers and several promising algorithms have been proposed in
the literature. Kwok and Yeung in [16] surveys the major constructive
algorithms in the literature. Dynamic node creation algorithm and its
variants [1], [25], activity-based structure level adaptation [26], cascade-correlation algorithms [8], [23], and the constructive one-hiddenlayer algorithms [15], [17] are among the most important constructive
learning algorithms developed so far in the literature.
The major advantages of constructive algorithms over the other
methods such as pruning algorithms [3], [21] and regularization-based
techniques [2], [13] are as follows.
1) It is easier to specify the initial network architecture in constructive learning techniques, whereas in pruning algorithms
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 34, NO. 3, JUNE 2004
one usually does not know a priori how large the original
network should be.
2) Constructive algorithms tend to build small networks due to their
incremental learning nature. A network is constructed that has a
direct correspondence to the complexity of the given problem
and the specified performance requirements, while excessive efforts may be rendered to trim the unnecessary weights in the network in pruning algorithms. Thus, constructive algorithms are
generally more efficient (in terms of training time and network
complexity/structure) than pruning algorithms.
3) In pruning algorithms and regularization-based techniques, one
must specify or select several problem-dependent parameters in
order to obtain an acceptable and good network yielding
satisfactory performance results. This aspect could potentially
reduce the applicability of these algorithms in real-life applications. On the other hand, constructive algorithms do not suffer
from these limitations.
In the next section, we first give a simple formulation of the training
problem for a constructive one-hidden-layer feedforward neural
network in the context of a nonlinear optimization problem. The
advantages and disadvantages of the constructive algorithms are also
discussed.
w w
Our motivation for applying a constructive learning algorithm as developed earlier by the authors in [19] and [20] may be justified due to
the following rationale.
1) The one-hidden-layer feedforward neural network is simple and
elegant in structure. The fan-in problem with the cascade correlation-type architectures is not present in this structure. Furthermore, as deeper the structure becomes, the more input-side
connections for a new hidden unit will be required. This may give
rise to degradation of generalization performance of the network,
as some of the connections may become irrelevant to the prediction of the output.
min
l;n;f ;w
j =1
0y
..
.
y l = fl
w y 01
l
j
l
f ;f ;n ;w ;w
d
j =1
2M ;
2n ;
n
n
w 2 <12
min
subject to
(1)
y1 = f1 (w1x ); w1 2 <
y 2 = f2 w 2 y 1 ; w 2 2 <
j
subject to
j
l
x x
y
0 y2
j
y1 2 <
y2 2 <
j
yl
x 2<
j
(2)
2 <1
(3)
1589
x 2<
j
(4)
1590
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 34, NO. 3, JUNE 2004
Fig. 5. Mean generalization SSEs versus the block size and the number of
hidden units (training with pruning, 20 runs).
Fig. 3. Sample of face images from the database with the image registered
as sadness being ambiguous.
Fig. 6. Mean recognition rates versus the block size and the number of hidden
units obtained during network training with pruning (20 runs).
connections to both the original input units and all the established
hidden units. In other words, this algorithm generates a network that
has a similar structure as in the cascade correlation-based networks,
and hence has the same limitations as the cascade correlation. In the
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 34, NO. 3, JUNE 2004
1591
Fig. 7. Mean recognition rates versus the block size and the number of hidden
units obtained during testing the networks trained with pruning (20 runs).
Fig. 9. Maximum recognition rates versus the block size obtained in testing
for the networks trained with pruning and without pruning (20 runs).
Fig. 8. Maximum recognition rates versus the block size obtained during
network training with pruning and without pruning (20 runs).
Jmax;n
Jinput (wn;i
= 0);
= 1; . . . ; M
(5)
1592
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 34, NO. 3, JUNE 2004
fs
Fig. 14. Mean accumulative number of pruned input-side weights for the
constructive one-hidden-layer feedforward neural networks with pruning
= 12 and 20 runs).
(
images, as the difference image still has a large amount of data. To facilitate the recognition process, one needs to further compress the difference image in order to reduce the size of the data without sacrificing
key attributes and features that play fundamental role in the recognition
success. The 2-D discrete cosine transform (DCT) is frequently used in
image compression as one viable tool for this purpose. The 2-D DCT
can reduce the size of the data significantly by transforming an image
from a spatial representation into the frequency domain where, in general, the lower frequencies are characterized by relatively large amplitudes while the higher frequencies have much smaller magnitudes. In
other words, the higher frequency components can be ignored without
significantly compromising the key characteristics of the original difference image, as far as the facial expression recognition problem is
concerned. It is therefore argued that the 2-D DCT coefficients of the
lower frequency modes, in principle, capture the most dominant and
relevant information of the facial expressions.
A square (or block) of the lower frequency 2-D DCT coefficients is
rearranged as an input vector x of dimension b2 fed to a constructive
one-hidden-layer feedforward neural network. The input-side training
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 34, NO. 3, JUNE 2004
1593
Fig. 15. Recognition rates versus the number of hidden units for two
constructive one-hidden-layer feedforward neural networks yielding the best
recognition rates in testing stage. These two networks are obtained in the
18th and 8th runs of network training with and without pruning, respectively
(
= 12).
1594
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 34, NO. 3, JUNE 2004
TABLE I
CONFUSION MATRICES OBTAINED BY ONE-HIDDEN-LAYER FEEDFORWARD NEURAL NETWORK WITH SIX HIDDEN UNITS FOR THE IMAGES USED DURING
= 12)
NETWORK TRAINING (LEFT TABLE) WITH PRUNING AND (RIGHT TABLE) WITHOUT PRUNING (
TABLE II
CONFUSION MATRICES OBTAINED BY ONE-HIDDEN-LAYER FEEDFORWARD NEURAL NETWORK WITH 6 HIDDEN UNITS FOR THE IMAGES NOT SEEN BY THE
TRAINED NETWORK (LEFT TABLE) WITH PRUNING AND (RIGHT TABLE) WITHOUT PRUNING (
= 12)
TABLE III
CONFUSION MATRICES BY VECTOR MATCHING (LEFT TABLE IS FOR TRAINING AND RIGHT TABLE IS FOR TESTING)
TABLE IV
CONFUSION MATRICES BY FIXED-SIZE NN (LEFT TABLE IS FOR TRAINING AND RIGHT TABLE IS FOR TESTING)
TABLE V
COMPARISON AMONG THE BEST RECOGNITION RESULTS OBTAINED BY THREE DIFFERENT RECOGNITION METHODS
2) It can be seen from Figs. 813, and 15 that our proposed network
training with and without pruning results in very close training,
generalization SSEs and recognition rates performances. However, by invoking pruning the number of input-side weights is
reduced by approximately 30%, resulting in a much smaller network. The one-hidden-layer feedforward neural networks with
four to eight hidden units are found to have sufficient computational capabilities to represent the mapping from the feature
space to the facial expression space of images.
3) Tables I and II demonstrate that the confusion matrices corresponding to expressions anger and sadness clearly emphasize the challenges that the facial expression recognition system
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 34, NO. 3, JUNE 2004
IV. CONCLUSION
In this paper, the application of an adaptive constructive one-hiddenlayer feedforward neural network to facial expression recognition was
considered. It was shown that the proposed constructive algorithm can
produce one-hidden-layer feedforward neural networks with much
reduced number of hidden units and input-side weights in comparison
with the backpropagation-based neural network constructed in [27],
while yielding an improved recognition rate. In all the experimental
results presented, it was revealed that the input-side weight pruning
technique proposed results in smaller networks, while simultaneously
providing similar performances when compared to their fully connected
network counterparts.
ACKNOWLEDGMENT
The authors would like to thank M. Oda, Ritsumeikan University
[formerly with Advanced Telecommunications Research Institute International (ATR)], Kyoto, Japan, for providing them with the database
used in this work.
REFERENCES
[1] T. Ash, Dynamic node creation in backpropagation networks, Connection Sci., vol. 1, no. 4, pp. 365375, 1989.
[2] Y. Chauvin, A back-propagation algorithm with optimal use of hidden
units, in Advances in Neural Information Processing, D. S. Touretzky,
Ed. San Mateo, CA: Morgan Kaufmann, 1990, vol. 2, pp. 642649.
[3] Y. Le Gun, J. S. Denker, and S. A. Solla, Optimal brain damage, in
Advances in Neural Information Processing, D. S. Touretzky, Ed. San
Mateo, CA: Morgan Kaufmann, 1990, vol. 2, pp. 598605.
1595