0% found this document useful (0 votes)
18 views16 pages

A Local Experts Organization Model With Application To Face

Uploaded by

alejandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views16 pages

A Local Experts Organization Model With Application To Face

Uploaded by

alejandra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Available online at www.sciencedirect.

com

Expert Systems
with Applications
Expert Systems with Applications 36 (2009) 804–819
www.elsevier.com/locate/eswa

A local experts organization model with application to face


emotion recognition
Jia-Jun Wong, Siu-Yeung Cho *
Forensics and Security Laboratory (ForSe Lab), Division of Computing Systems, School of Computer Engineering,
Nanyang Technological University, Nanyang Avenue, Singapore 639798, Singapore

Abstract

This paper presents a novel approach for recognizing human facial emotion in order to further detect human suspicious behaviors.
Instead of relying on relative poor representation of facial features in a flat vector form, the approach utilizes a format of tree structures
with Gabor feature representations to present a facial emotional state. The novel local experts organization (LEO) model is proposed for
the processing of this tree structure representation. The motivation for the LEO model is to deal with the inconsistent length of features
in case there are some features failed to be detected. The proposed LEO model is inspired by the natural hierarchical model presented in
natural organization, where workers (local experts) reports to their supervisor (fusion classifier), whom in turn reports to upper manage-
ment (global fusion classifier). Moreover, an Asian emotion database is created. The database contains high-resolution images of 153
Asian subjects in six basic pseudo-emotions (excluding neutral expression) in three different poses for evaluating our proposed system.
Empirical studies were conducted to benchmark our approach with other well-known classifiers applying to the system, and the results
showed that our approach is the most robust, and less affected by noise from feature locators for the face emotion recognition system.
 2007 Elsevier Ltd. All rights reserved.

Keywords: Face emotion tree structure; Emotion recognition; Local experts organization model; Support vector machines; Polynomial classifier

1. Introduction from that person. This is due to the covert recognition pro-
cess (emotion response stimuli) is disconnected and the
In a human society, relationships with friends, col- overt recognition process (people identity nodes) are still
leagues and family (Ekman, 2004) are carefully maintained connected (Ellis & Lewis, 2001). Ekman has identified six
by the ability to understand, interpret and react to emo- basic categories of emotions (Ekman, 2004) (i.e., fear,
tions. Emotions positively affect intelligent functions such anger, sadness, surprise, disgust and joy). Fig. 1 shows a
as decision making, perception and empathic understand- subject expressing different facial expressions by control-
ing (Bechara, Damasio, & Damasio, 2000; Isen, 2000). ling his facial muscles to represent the six basic emotions.
Most people are able to interpret the emotions expressed Such emotions are revealed earlier through facial expres-
by others all the times, but there are people who lack this sion than people verbalize or even realize their emotion
ability, such as people diagnosed along the autism spec- states (Tian, Kanade, & Cohn, 2001). Ekman has shown
trum (Baron-Cohen, 1995). Capgras’ syndrome patients how to use a facial expression to identify a lie (Ekman,
(Ellis & Lewis, 2001) who suffered from head trauma 1991). Cognitive interpretations of emotions are known
and have the link between their visual cortex and limbic to be innate and universal to all humans regardless of cul-
system severed, thinks that family and friends are replaced tures (Ekman, 1999; Thompson, 1941). Besides, detecting
by imposters because they are unable to feel any emotions suspicious behaviors through human emotions is also an
active research in the field of security. Wrongdoers, with
*
Corresponding author. Tel.: +65 6790 5491; fax: +65 6792 6559. reason to falsify information or documents or carry ille-
E-mail address: [email protected] (S.-Y. Cho). gal hidden objects, will be under stress and detection

0957-4174/$ - see front matter  2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2007.10.030
J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819 805

Fig. 1. Six basic emotions (from left to right): anger, joy, sadness, surprise, fear and disgust (images taken from NTU Asian emotion database).

apprehension. They will exhibit some forms of emotional of regions of interest (ROI). In the proposed emotion rec-
betrayal of their intentions. ognition system, the ROI features are captured in our pro-
Models and automated systems have been created to rec- posed localized Gabor feature (LGF) vector but such
ognize the emotional states from facial expressions. The static data structure in flat LGF vector will lose any pos-
leading method – facial action coding system (FACS) sible relationships among the facial components. There-
(Ekman & Friesen, 1978) for measuring facial movements fore, transforming the LGF vector to FEETS
in behavioral science, was developed by Ekman and Friesen representation would be able to encode the feature into
in 1978. Other methods such as electromyography, which hierarchical relationship information such that it allows
directly measures the electrical signals generated by the the system to ‘‘reuse” knowledge and thus generalize with
facial muscles and deducing the facial behavior from it, are less training. As the LEO model is used for the system,
both obtrusive and non-comprehensive. According to the the low-level nodes of FEETS may learn first. Facial fea-
survey in Ekman and Rosenberg (1997), FACS uses 46 tures in high-level nodes then share what was previously
defined action units to correspond into each independent learned in low-level nodes. Such characteristic is able to
motion of the face. However, this model takes over 100 h allow the proposed LEO model to reduce the training
of training to achieve minimal competency for a human time since that, for example, a system may take a large
expert (Donato, Bartlett, Hager, Ekman, & Sejnowski, amount of time and memory to learn the low-level fea-
1999). Faster automation approaches, such as measurement tures, but once it has done so, it is able to learn the
of facial motion through optic flow (Mase, 1991; Rosenb- higher-level features in a shorter time and less memory
lum, Yacoob, & Davis, 1996) and analysis of surface textures required.
based on principal component analysis (PCA) (Lanitis, Tay- The main contributions of this paper are summarized as
lor, & Cootes, 1997). Newer techniques include using Gabor below:
wavelets (Daugman, 1988), linear discriminant analysis
(Belhumeur, Hespanha, & Kriegman, 1996), local feature 1. A FacE emotion tree structure (FEETS) representation
analysis (Penev & Atick, 1996), and independent component based on local Gabor features is proposed to represent
analysis (Bartlett & Sejnowski, 1997), however, such meth- and encode hierarchical information among features
odologies may fail to handle large and complex task like rec- extracted from facial components.
ognizing human emotion because either they take too long to 2. A novel local experts organization (LEO) model is pro-
train a system or they take too much memory. posed to reduce the training time required for processing
Recently, a hierarchical manner representation, so- the FEETS using the original neural network model and
called FacE emotion tree structure (FEETS), was proposed the LEO model is able to recognize the human emotion
to represent a face image that the features are included to from facial expressions.
present from global to local. An adaptive processing of this 3. An Asian facial database is developed to represent Asian
FEETS representation was employed and realized by a population for evaluation of the proposed emotion rec-
probabilistic based neural network (PNN) model reported ognition system.
in Wong and Cho (2006), Wong and Cho (2007) and the
obtained results showed about 90% recognition rate This paper is organized as follow: Section 2 briefs an
achieved by evaluating a subset of CMU (Cohn, Zlochow- overview of facial emotion recognition systems in which
er, Lien, Wu, & Kanade, 1997) facial action unit database. the common used techniques of feature extraction and rec-
However, such a neural network based system may need to ognition model are described. Section 3 describes the
take quite long time to train the model (Cho, 2008; Cho, extraction of local and global Gabor features and how to
Chi, Siu, & Tsoi, 2003). In this paper, a new hierarchical use the FEETS representation to encode the hierarchical
based model, namely local experts organization (LEO) information of faces is also presented. Section 4 illustrates
model, is proposed to reduce the time needed for training the proposed local experts organization (LEO) model and
and testing of the original PNN model for emotion how it is used to process the FEETS. The architecture,
recognition. learning algorithms and kernel used of this model are also
Ekman’s FACS (Ekman & Friesen, 1978) provided the discussed in this section. Section 5 describes the creation of
fundamental foundation, which is similar to the concept the Asian emotion database. Detail experiment results and
806 J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819

empirical studies are described and demonstrated in Sec- by the system. Image processing techniques can be used to
tion 6. Finally, conclusions are found in Section 7. improve the image’s quality, for instance, noise removal
through median filters could be used for noisy environment
2. An emotion recognition system condition. We may also use simple histogram stretching to
enhance the image contrast in our developed system.
This section demonstrates the techniques commonly For the feature extraction, five main techniques (Aras,
used for face image pre-processing as well as those used Subramanian, & Zhang, 2004) can be used for, i.e., dimen-
in our system for implementing the various processing sionality reduction transforms, discrete cosine transform
blocks in an emotion recognition system. The objective of (DCT), Gabor wavelet, spectrofaces and fractal image cod-
this system is to extract relevant features from the facial ing. Karhunen–Loeve transform, known as Principal Com-
image and hence the system is able to recognize the emo- ponent Analysis expansion for representation (Kirby &
tions based on these features extracted. A typical recogni- Sirovich, 1990; Sirovich & Kirby, 1987), is used for dimen-
tion system would comprise of the two essential blocks, sionality reduction transformation. Linear discriminant
i.e., the feature extraction block and the feature recognition analysis (LDA), Fisher discriminant analysis (FDA) and
block. Our proposed emotion recognition system as illus- Independent Discriminant Analysis (IDA) are perhaps
trated in Fig. 2 contains various processing models, such the most popular techniques (Aras et al., 2004) to generate
as face detection, eye detection, nose detection, . . . , etc., a set of the most discriminant features so that different clas-
to form the feature extraction block, and the emotion rec- ses of training data can be classified. Discrete Cosine
ognition block to provide the recognition of the emotional Transform is not only used in joint picture expert group
states. The details of these processing blocks would be (JPEG) for image compression, but it could also be used
described in the following subsections. for feature extraction as DCT transform images from spa-
tial domain to the frequency domain by means of sinusoi-
2.1. Feature extraction dal basis functions. Gabor wavelets has capability to
capture the properties of spatial localization, orientation
A face image can be captured from any form of video or selectivity, spatial frequency selectivity, and quadrature
image source. At first, the face detection module is used to phase relationship, such that it seems to be a good approx-
separate the face from the background as well as the body. imation to filter response profiles encountered experimen-
Several well-known techniques such as PLANNING by tally in cortical neurons (Daugman, 1985). Hence, it is a
Kelly (1970) for automatic extraction of head and body out- good feature extraction technique for face processing
lines from an image and subsequently the locations of eyes, because of its biological relevance and computational
nose and mouth. A real-time face tracker (Yang & Waibel, properties (Liu & Wechsler, 2003). Fig. 3 shows the
1996) using hue detection for skin color could be used for responses of two different facial images for four of the
face detection in case of color images are used. The face selected Gabor filters.
detection step provides us with a rectangle head boundary, In our system, we propose to extract Gabor feature vec-
which includes the whole face area. Having correctly tors including global and local features representations.
detected the face from the rest of the input image, then The global features are said to be more accurate in frontal
the face needs to be cropped from the background and views of face. It also provides the holistic analysis of the
resized to be normalized against the other images encoded image and they do not depend on the accurate location

Feature Extraction
Feature Extraction Localized
Gabor Filter Gabor Features
Image Processing
Eyes Detection
Face Cropping Feature Facial Emotion Tree
and Resizing Nose Detection Structure
Locations
Transformation
Face Detection Mouth Detection

Local Expert Facial Emotion


Emotion State
Organization Tree Structure
Capture Image Recognized
Model Representation
Emotion Recognition

Fig. 2. The proposed face emotion recognition system.


J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819 807

Fig. 3. Examples of Gabor wavelets and corresponding convoluted images.

of the fiducial points of the face. On the other hand, local of tree structures (Sperduti & Starita, 1997), makes use of
features provide the robustness of accurate face recognition the relationship information between features to recognize
whenever the faces are in different postures. For localized tree structures patterns more robustly. This paper would
feature extraction, four key fiducial points locating at eyes, discuss about how to model such tree structure patterns
nose and mouth are needed to detect to form the basic ref- for the recognition of face emotions.
erence locations of the faces. Brunelli and Poggio (1993)
used independently matching templates for finding the four 3. FEETS representation
key fiducial points. We used a hierarchical component-
based feature recognizer from the idea of Weyrauch and 3.1. Localized Gabor feature extraction
Huang (2003), which detected the location of the four
key fiducial points. The left, right, top and bottom regions A pre-defined global filter based on the 2D Gabor
of these four fiducial points are used to define the extended wavelets g(x, y) is used, which can be defined as follows
feature components. (Marcelja, 1980):
  " ! #
2.2. Emotion recognition 1 1 x2 y 2
gðx; yÞ ¼ exp  þ þ 2pjWx ; ð1Þ
2prx ry 2 r2x r2y
After the feature vectors are obtained from the face
images, the emotion recognition classifiers are essential to where parameters W = Uh, rxh= 2ru/p and i ry ¼
make use of the feature vectors to discriminate and identify ða1ÞU h r2
pffiffiffiffiffiffiffi ; rv ¼ tanð p Þ U h  2 lnð u Þ ½2 ln 2
2rv =p. ru ¼ ðaþ1Þ 2K
each of the emotions. Various pattern classifiers such as ð2 ln 2Þ2 r2u 12
2 ln 2 Uh

support vector machines (SVM) (Platt, 1998), K-nearest U2


 and K is the total number of orientations,
h 1

neighbors (K-NN) (Aha & Kibler, 1991), naı̈ve Bayes algo- a ¼ ðU h =U l Þ s1 and s is the number of scales in the mul-
rithm (John George & Langley, 1995) and artificial neural ti-resolution decomposition. Uh and Ul denote the lower
networks could be used for recognition stage of the system. and upper center frequencies, respectively.
SVM is a learning technique developed by Vapnik (1995) Four primary feature locations are located by the fea-
which was strongly motivated by results of statistically tures detection algorithm (Viola & Jones, 2004), which will
learning theory. SVM operates on the principle of provide the coordinate location for the center of the left
induction, known as structural risk minimization, which eye, center of the right eye, tip of the nose and the center
minimizes the upper bound of the generalization error. of the lips as shown in Fig. 4. The location points of the left
K-nearest neighbors method is a non-parametric technique and right eye features are being derived from the location
in pattern recognition, which is used to generate k numbers of the center of the left eye and right eye denoted by the
of nearest neighbors rules for classification. Naı̈ve Bayes coordinates (xLE, yLE) and (xRE, yRE), respectively. The
algorithm is based on the Bayesian decision theory, which location of the nose bridge is the middle point of the left
is a fundamental statistical approach to the problem of pat- and right eye on the X-axis. The nose feature locations
tern classification. This approach is based on quantifying are derived from the location of tip of the nose denoted
the tradeoffs between various classification decisions using by the coordinate of (xNS, yNS). The locations of lips fea-
probability and the costs that accompany such decisions. tures are derived from the center of lips coordinate of
Most of these techniques often lose to generalize the rela- (xLS, yLS).
tionship information between the features. Structural pat- Given an image I(x, y), the Gabor wavelet transformed
tern classification techniques such as adaptive processing is defined as follows:
808 J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819

The localized Gabor feature (LGF) vector of each of the


image can be formed as

X F 0; ~
~ ¼ ½~ F 1; ~
F 2; . . . ; ~
F 60 : ð8Þ

Each of feature Fn is a vector of features extracted using


Eqs. (6) and (7) from the sub-matrix of the convolution
output for the image with the Gabor filter bank. The super-
Fig. 4. Four primary feature locations and entire face region. Crosses script n denotes the set of features derive from each of the
denote the center of fiducial points. Rectangle box denotes of region of 60 feature location region:
interest.
~
F n ¼ ½l00 ; r00 ; l01 ; r01 ; . . . ; lmn ; rmn : ð9Þ
Z
W mn ðx; yÞ ¼ Iðx1 ; y 1 Þgmn  ðx  x1 ; y  y 1 Þ dx1 dy 1 : ð2Þ Each of the extended features is relative or an extension of
the known features as shown in Fig. 5.
The subscript m denotes the size of the filter bank in terms
of number of orientation. The subscript n denotes the size 3.2. Gabor features to FEETS representation
of the filter bank in terms of number of scales. We used
m = 7 and n = 1 for our system. The localized Gabor Fea- Ekman’s facial action coding system (FACS) (Ekman &
ture F(xF, yF) can be expressed as a sub-matrix of the holis- Friesen, 1978) provided the foundation of dominant facial
tic Gabor wavelet output from Eq. (2): components of interest for recognizing human facial expres-
2 3
ðxF ; y F Þ    ðxFþS ; y F Þ sion. In our study, we used similar concept of facial compo-
6 .. .. .. 7 nents of interest to extract the facial features to represent
F mn ðxF ; y F Þ ¼ W mn 6
4 . . .
7;
5 ð3Þ human face. In addition, we used the tree structure to pres-
ðxF ; y FþS Þ    ðxFþS ; y FþS Þ ent the relationship information between the features from
coarse to fine detail, which are represented by the localized
where S defines the size of the feature area. The xF and yF Gabor feature vector. In this approach, the facial emotion
can be defined respectively as can be represented by a five level deep tree structure model
xF ¼ xRF þ c; ð4Þ as shown in Fig. 6, the entire face region acting as a root
node and localized features upper and lower face and left,
y F ¼ y RF þ c; ð5Þ
right and center of the face became the second level branch
where the subscript ‘‘RF” refers to the relative center loca- nodes. At the third level nodes, the forehead, eyes, nose,
tion coordinates that can be labeled as ‘‘LE”, ‘‘RE”, ‘‘NS” mouth and cheek area became the corresponding branch
or ‘‘LS” for the different facial components. The mean and nodes. At the fourth level, the forehead, eyes, eyebrows,
standard deviation of the convolution output is used as the nose, cheeks and mouth act as the branching nodes to the
representation for classification purpose: third level nodes. Sub-detail features from the 4-key fiducial
Z Z points form the leaves of the tree structure. The extracted
lmn ¼ jF mn ðxF ; y F Þj dx dy; and ð6Þ features are grouped and attached to the corresponding
nodes as shown in Fig. 6. The leaf nodes have been grouped
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Z Z ffi
together as shown in Fig. 6. There are about 8–9 features in
2
rmn ¼ ðjF mn ðxF ; y F Þj  lmn Þ dx dy : ð7Þ most of these groups. The actual tree structure would have
more connecting branches and arcs. The arcs between each

Fig. 5. Sixty extended feature regions denoted by rectangle boxes at various detail levels (from left to right). (A) Level 2: upper, lower, left, right and center
region of face. (B) Level 3: forehead, left and right eye, eyes, nose, mouth, left and right cheek and nostril. (C) Level 4: forehead, eyebrows, details of left
and right eye, left and right cheek, left and right side of nose, nostril, mouth. (D) Level 5: details of left and right eye, details of nose, details of mouth and
details of nose bridge.
J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819 809

Face L1

Upper Lower Left Face Right Face Center L2


Face Face Face

Nose & Mouth & Right Eye Right Cheek & L3


Cheek Jaw & Brow Mouth

Forehead Eyes & Left Eye Left Cheek & Nose, Nostril
Brows & Brow Mouth & Nasal
Root

Left Eye Right Eye L4


Mouth Left Eye Right Eye
Brow Brow

Forehead Left Nose Right Left Right Nostril &


Detail Nose Cheek Cheek Nasal Root

L5
Left Nose Right Nose Mouth Left Eye Right Eye Nasal Root
Details Details Details Details Details Details

Fig. 6. A typical FEETS representation of a human face.

of the nodes corresponding to the features relationship are LEO


used to represent the semantic of human facial components. Node

4. Local experts organization (LEO) model L1


LEO LEO
Node Node
4.1. Overview
L2A L2B
This section describes an offer of novel local experts LEO LEO LEO LEO
organization (LEO) model for adaptive processing of Node Node Node Node
tree-structures based face emotion recognition. The moti-
vation for this LEO model is that feature length is not L3A
LEO LEO
always consistent for a localized face features extraction Node Node
system. A similar FEETS representation can be formed
even when the local feature detection fails to detect the cen- Fig. 7. LEO model structure for processing a partial FacE emotion tree
ter of the two eyes. Since the length of the feature vector is structure.
inconsistent as features are dependent on the location of
eyes, which could not be extracted properly. The proposed ated from the respective facial components such that the
LEO model is inspired by the natural hierarchical model LEO network structure would have various numbers of
presented in natural organization, where workers (local branches from each node to process all the input features.
experts) reports to their supervisor (fusion classifier), Each leaf node of the tree represents each individual object
whom in turn reports to upper management (global fusion and the root node represents the whole image. In general,
classifier). Using this model, the system should be less each feature attached to each node becomes an input attri-
affected when nodes are missing, as each node is made up bute in the LEO model, and the depth of the tree becomes
of network of local experts and fusion classifier. We can the hierarchy to convert the input into most likely a tree
realize that some nodes can learn to recognize nose and structure representation (Tsoi, 1998). The output, from
mouth features that are able to recognize the whole face each LEO node is the local regression relationship, and
even if one node was missed to learn the eye feature. It is the parent node uses this information as one of its inputs.
because memory in the LEO model is dynamic that it is Based on the children’s node output, and together with
not possible to decide in advance what a node should learn. the local expert (LE) output, the parent node will be able
A partial FEETS representation is being used as shown to determine its own local winner.
in Fig. 7 to illustrate the LEO based structural processing Fig. 8 shows the internal architecture of the LEO node
for a facial image from Fig. 6. The structure of the LEO where exist connections between local experts (LE) and
model is dependant on how the features of an image are fusion classifier (FC). In this architecture, the children’s
represented in a tree structure as shown in Fig. 7. For the output can be expressed as
FEETS representation of the facial image, the number of
child nodes is dependent on the number of sub-regions cre- y ¼ ½y 1;1 ; . . . ; y 1;m ; y 2;1 ; . . . ; y n;m ;
~ ð10Þ
810 J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819

Output, Y of current state and the pervious state is used to determine


whether the optimization is stopped. Once the difference is
LEO Node greater than or equal to zero, the optimization is stopped
Fusion and the FC parameters are stored. The optimization frame-
Classifier Output layer
work of the LEO classifier is described as below and the
details of the local experts and the fusion classifier are
described in the following subsections.
Local Local Hidden
layer Algorithm 1. The optimization of local experts organiza-
Expert Expert tion algorithm

Begin
Input layer
Initialize: Generate parameters at a pre-defined level
for LE and FC
Children’s Output, y Input Attributes, u While (optimize is NOT true)
For each node in the entire tree
Fig. 8. The internal architecture of LEO node.
Minimize LE error rates
Minimize FC error rates
where m is the number of output classes and n is the num- End for
ber of children which parent node is connected to. At the If error rate is greater than the goal
terminal/leaf nodes (the nodes that do not have any child Then
y to be a n  m zero vector. The input attri-
nodes), we set ~ Adjust parameters for LE and FC
butes to the LEO node can be expressed as Else
Set optimize is true
u ¼ ½u1    ui ;
~ ð11Þ End while
where i is the number of features for a given node. Based End
on Eq. (9), i = 14, as we extract two statistical features
for each of the seven orientations used in the Gabor filter 4.3. Local experts
banks, in a given node. The LEO node’s output can be ex-
pressed as In this model, support vector machines (SVM) is
~ employed as they fulfilled the role of the local experts. Sup-
Y ¼ ½P ðx1 jð~ yÞÞ; P ðx2 jð~
u;~ yÞÞ; . . . ; P ðxm jð~
u;~ u;~
yÞÞ; ð12Þ
port vector machines (SVM) is a hyperplane that separates
where x is the output class. The output from each of the a set of positive examples from a negative example with
local expert (LE) can be expressed as the posterior proba- maximum margin, was invented by Vapnik in 1979 (Vap-
bility for each of the class: nik, 1995). Considering the problem of separating the set
of training vectors belonging to two separate classes,
ujxm ÞP ðxm Þ
pð~ xi 2 Rn Þ, with labels
(x1, y1) . . . (xn, yn), where instances ð~
P ðxm j~
uÞ ¼ : ð13Þ yi 2 {1, +1}. The output of a linear SVM is u ¼ ~ w~x  b,
pð~

where ~ w is the normal vector to the hyperplane, ~ x is the
The inputs to the fusion classifier can be expressed as input vector and b is the threshold. The separating hyper-
v ¼ ½pðx1 j~
~ uÞ; . . . ; pðxm j~
uÞ;~
y: ð14Þ plane is u = 0, and it is optimal when the set of vectors is
separated without error and maximal margin.
Cortes and Vapnik (1995) defined the minimization
4.2. Parameters optimization problem that is equivalent to maximizing the margin as
1 XN
Firstly, all parameters of the LEO model are initialized min w k2 þ C
k~ ni ; ð15Þ
at the pre-defined level. Then, the parameters of the LE are w;b;~
~ n 2 i¼1
optimized in the fashion of node-by-node manner. The
error of each node in the current state was computed to subject to yiui P 1  ni, "i, where ni are slack variables that
compare with that of the previous iteration. If the differ- permit margin failure. N is the number of training exam-
ence between the both errors is smaller than zero, the opti- ples. According to Platt (1998), the training is expressed
mization procedures of the LE are continuously performed as a minimization of dual Lagrange multipliers a:
until the difference becomes greater than or equal to zero. 1 T
min WðaÞ ¼ min a Qa  eT a;
Then, the LE parameters are stored and proceed to the a 2a
fusion classifier (FC) processing. The FC is also optimized X N ð16Þ
as the same manner as the LE. The overall accuracy is cal- subject to 0 6 ai 6 C; 8i and y i ai ¼ 0;
culated at the root node. The difference between the error i¼1
J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819 811

where e is the vector of all ones, C is the upper bound that explosion problem, and demonstrated good performance
must be greater than zero, Q is an N by N positive semi- in multimodal biometric decision fusion applications
definite matrix, such that Qij  y i y j Kð~ xj Þ, and
xi ;~ (Toh, Yau, & Jiang, 2004). The general multivariate poly-
T
Kð~ xj Þ  /ð~
xi ;~ xi Þ /ð~xj Þ is the kernel function that measures nomial model is defined as follows:
the similarity or distance between the input vector ~ x and X
K
stored training vector ~ xj . The function / maps the training gðk; xÞ ¼ ki xn1 n2 nl
1 x2    xl ; ð18Þ
vectors ~ xi into higher dimensional space. The decision func- i
tion is defined as where summation is taken over all non-negative integers
!
XN n1, n2, . . . , nl for which n1 + n2 +    + nl 6 r with r being
sgn y i ai Kð~ xÞ  b ;
xi ;~ ð17Þ the order of approximation. k = [k1, . . . , kK]T is the param-
i¼1 eter vector to be estimated and the regression vector
where ~ x is the input vector. x = [x1, . . . , xl]T containing l inputs. K is the total number
Eq. (16) forms a quadratic programming (QP) problem of terms in g(k, x).
that arises from the SVMs. Osuna, Freund, and Girosi’s A second-order bivariate polynomial model (r = 2 and
(1997) theorem proved that a large QP problem can be bro- l = 2) is given by
ken down into a series of small QP sub-problems. As long gðk; xÞ ¼ kT pðxÞ; ð19Þ
as at least one example that violates the Karush–Kuhn–
Tucker (KKT) conditions are added to the examples of where
the previous sub-problem, each step will reduce the overall k ¼ ½ k1    k6 
T
ð20Þ
objective function and maintain a feasible point that obeys
all of the constraints. Therefore, a sequence of QP sub- and
problems that always adds at least one violator will be T
guaranteed to converge. Osuna et al. (1997) also suggested pðxÞ ¼ ½ 1 x1 x2 x21 x1 x2 x22  : ð21Þ
keeping a constant size matrix for every QP sub-problem, Given m data points with m > K (assume K = 6) and using
which implies adding and deleting the same number of the least square error minimization is given by
examples at every step. Using a constant-size matrix will
Xm
allow the training on arbitrarily sized data sets. Based on vðk; xÞ ¼ ½y i  gðk; xi Þ2 ¼ ½y  PkT ½y  Pk: ð22Þ
Osuna’s theorem, Platt (1998) invented the learning algo- i¼1
rithm called Sequential Minimal Optimization (SMO)
The parameter vector k can be estimated by
which decompose the overall QP problem into QP sub-
problems and for solving them using an analytic QP step. k ¼ ðPT PÞ1 PT y; ð23Þ
SMO chooses to solve the smallest possible optimization
problem at every step, which involves two Lagrange multi- where P 2 Rmk denotes the Jacobian matrix of p(x):
pliers to jointly optimize (Platt, 1998). SMO finds the opti- 2 3
1 x1;1 x2;1 x21;1 x1;1 x2;1 x22;1
mal values for these multipliers, and updates the SVM to 6. .. .. .. .. .. 7
reflect the new optimal values. The advantage of SMO over P¼6 4 .. . . . . . 5
7 ð24Þ
other algorithms lies in the fact that solving for two 1 x1;m x1;m x21;m x1;m x2;m x22;m
Lagrange multipliers can be done analytically. Therefore,
numerical QP optimization is avoided. In addition, as and y = [y1, . . . , ym]T is the known inference vector from the
SMO does not use any matrix computations, therefore, it training data. The first and second subscripts of the matrix
requires no extra matrix storage (Platt, 1998). Thus, very elements xj,k (j = 1, 2, k = 1, . . . , m) indicate the number of
large SVM training problems can fit inside the memory inputs and instances, respectively. From Eq. (22), we could
of an ordinary personal computer. form
Xm
4.4. Fusion classifier vðk; xÞ ¼ ½y i  gðk; xi Þ2 þ dkkk22
i¼1
T
In our proposed LEO model, we used a form of Polyno- ¼ ½y  P k ½y  P k þ dkT k; ð25Þ
mial classifier as the fusion classifier. Multivariate polyno-
mial (MP) model being tractable for optimization, where k  k2 denotes the l2-norm and d is the regularization
sensitivity analysis, and prediction of confidence intervals, constant.
provides an effective way to describe complex non-linear The following non-linear estimation model is usually
input–output relationship. However, for high-dimensional considered to significantly reduce the huge number of
and high-order systems, multivariate polynomial regression terms in multivariate polynomials:
becomes impractical due to its prohibitive number of prod- X
r
uct terms. Toh et al. proposed a parametric reduced multi- f ðk; xÞ ¼ k0 þ ðkj1 x1 þ kj2 x2 þ    þ kjl xl Þj ; ð26Þ
variate polynomial model to circumvent this dimension j¼1
812 J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819

where the weight parameters (kjk, j = 1, . . . , r, k = 1, . . . , l) their decision values f. Wu, Lin, and Weng (2004) have
are part of the non-linear estimation model. l denotes the shown in their second approach that solving the following
number of input dimensions and r denotes the order of optimization problem, the value of pi from the values of rij
model. can be obtained:
Toh, Tran, and Srinivasan (2004) highlighted that solu-
tions derived from such a non-linear estimation model have 1X k X
2
J ¼ min ðrji pi  rij pj Þ ; ð33Þ
no guarantee that the solutions are optimal. By using Mean p 2 i¼1 j:j–i
Value Theorem, the problem was overcome. The reduced Pk
multivariate (RM) model is written as the following form:
subject to i¼1 p i ¼ 1; p i P 0; 8i. The objective function
can be formulated as
X
r X
l X
r
f ðk; xÞ ¼ k0 þ kkj xkj þ krlþj ðx1 þ x2 þ    þ xl Þj 1 T
J ¼ min p Qp; ð34Þ
k¼1 j¼1 j¼1 p 2
X
r
j1 where
þ ðkTj  xÞðx1 þ x2 þ    þ xl Þ ; ð27Þ (P
2
j¼2
s:s–i rsi if i ¼ j;
Qij ¼ ð35Þ
where l and r P 2. The number of terms in this model is de- rji rij if i–j:
fined as
For this convex problem, the optimality condition where
K ¼ 1 þ r þ lð2r  1Þ: ð28Þ there is a scalar g becomes
    
Minimizing the objective function in Eq. (27), we may get Q e p 0
¼ ; ð36Þ
k ¼ ðP T P þ dIÞ1 P T y; ð29Þ eT 0 g 1

where y 2 R m1
and I is a (K  K) identity matrix. where e is a k  1 vector of all ones, 0 is a k  1 vector of
The above minimization used by least squares optimiza- all zeros, P
and g is the Lagrangian multiplier of the equality
tion might be suffered from the ill-posed problem if the constant ki¼1 pi ¼ 1. Hence,
dimension of vector p(x) is much larger than the number pT Qp ¼ pT QðgQ1 eÞ ¼ bpT e ¼ g ð37Þ
of given data points, i.e., K < m. One way to resolve this
problem may need to have a local discriminator to reduce and the solution p satisfy the condition:
X
the dimension. In our approach, SVM is used as a local dis- Qtt pt þ Qtj pj  pT Qp ¼ 0 for any t: ð38Þ
criminator to produce a local regression output in which j:j–t
the dimension is reduced to be less than the number of 2

training patterns that is able to fulfill the requirement of A radial basis function (RBF) kernel, Kðxi ; xj Þ ¼ eckxi xj k ,
the RM model. is used in the multi-class SVM (Wu et al., 2004) for our pro-
We express the output for each LEO node with respect posed model. In this LEO architecture, the parameters C
to Eq. (29) to be and c for the multi-class SVM and the r parameter for the
RM classifier are determined by the optimization frame-
ym ¼ ½f ðk1 ; xÞ; . . . ; f ðkn ; xÞ ¼ Pk; ð30Þ work and the d parameter is set to be constant at 104.
where the largest element of y0 will be the output class. m is
the node number and m = 0 represents this is the root node 5. Asian emotion database
output. The vector x can be defined as
To the best of our knowledge, few investigations have
x ¼ ½p; y 1 ; y 2 ; . . . ; y n ; ð31Þ been conducted on analyzing face emotion behavior among
where n is the number of children that the LEO node is the different races in the Asian population. Most of the
connected to. p is the probability estimates from the mul- publicly available emotion database contains images that
ti-class SVM. In order to obtain a probability estimate are captured from video recording or stored in low-resolu-
from a multi-class SVM, the SVM must work in regression. tion quality. The closest Asian emotion database is the Jap-
However, traditional SVM offers only classification to pre- anese female emotion database (Lyons, Budynek, &
dict as a class label without any probability information. Akamatsu, 1999), and it contains 213 images of seven facial
Lin and Weng (2004) have extended SVM to work in expressions (including neutral) posed by 10 Japanese
regression and producing probability estimates as results. actresses. A 3D Facial expression database by Yin, Wei,
The pair-wise class probability estimates from this multi- Shun, Wang, and Rosato (2006) contains 100 subjects in
class SVM can be expressed as various emotions from various races found in the America.
The development of our database was designed to capture
1
rij  ; ð32Þ high-resolution 2D facial images for various races, age
1 þ eAf þB groups and genders found in the Asian population in seven
where A and B are estimated by minimizing the negative emotional states and in three different poses. As Singapore
log-likelihood function using known training data and is in the heart of Asia and has a high mixture of different
J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819 813

races in Asia, we have collected our data from our Univer- Each subject was instructed to sit in front of the camera.
sity. Currently, the database is opened to public to freely be They were requested to perform seven expressions, i.e.,
downloaded from: http://www.ntu.edu.sg/sce/labs/forse/ Neutral, Anger, Joy, Sadness, Surprise, Fear, and Disgust.
Asian%20Emotion%20Database.htm. As the digital camera is unable to capture dynamic facial
expressions, we required the subject to perform the expres-
5.1. Creation of database sion for a short period. Ideally, a video clip could be used
for eliciting a genuine emotion state of subjects. However,
We set up a photograph station as shown in Fig. 9 in a it is difficult to provide such a setup, especially for emo-
public venue, and invited volunteers (both female and tions such as sadness and fear (Sebe et al., 2004). Cowie
male) of all races and age groups from the public to partic- et al. (2001) quotes, displays of intense emotion or ‘‘pure”
ipate in our data collection exercises. The images were cap- primary emotions rarely happened. Due to the time con-
tured using a five megapixels Ricoh R1V digital camera straints and the limitation of setup, we had asked the sub-
using ISO 64 settings with flash from the camera. Subjects jects to perform pseudo-emotions. The subjects were also
were sitting down at 143 cm away against a white back- asked to position themselves, 45 to the left and right, so
ground from the camera. We cropped and scaled the origi- we could capture these expressions for half profile poses
nal 2560  1920 pixels facial to around 900  1024 pixels as shown in Fig. 10.
for archival. We annotated, processed, and stored each
image in 24-bit color Bitmap format as a ground truth.

5.2. Statistics of participants

The Asian emotion database contains around 4947


images for seven different facial expressions and three dif-
ferent poses from 153 subjects, who participated in the data
collection exercises over a period of four days. Out of the
153 subjects, 64 of them have given consent to their images
for use in publications. The Asian emotion database has
facial images from various races, including, Chinese,
Malay, Indian, Thai, Vietnamese, Indonesian, Iranian
and Caucasian. About 72% of the subject belong to the
Chinese race, 7% Indian, 7% Vietnamese, 5% Caucasian,
4% Malay and 4% others. Table 1 shows the detail distribu-
Fig. 9. During image collection process where the volunteer sitting against tion of our database according to Race and Gender. Table
a white background, on the left of picture performs an expression. The
2 shows the majority of the subjects about 72% were of the
camera is place 143 cm from subject. The photographer sits on the right of
this picture, will give instructions to the subject for performing various 20–29-year-old age group, as well as the detail breakdown
expressions and poses. Subject seen here is in frontal pose position. The of the database according to Race and Age group. Due to
subject is asked to turn 45 to the left and right for two other poses. time constraints of our participants, we were only able to

Fig. 10. Subject in six different expressions in three different poses.


814 J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819

get 27 of them to pose for the two half profile poses for all datasets, i.e., subject-dependent dataset (dataset A) and
expressions. subject-independent dataset (dataset B). Subject-dependent
dataset will be able to show how well the system performed
6. Experiments and results when the system knows how each of the subject’s pseudo-
emotions looked. The training images contain 3901 images
We evaluated our system using our created Asian emo- of all 153 subjects in various emotions. The testing set is
tion database. We have used all the 4947 images from our using the remaining 1046 images. Subject-independent
database containing 153 persons in six basic emotions and dataset will be able to evaluate the performance of this sys-
one neutral emotion. For this experiment, we created two tem when it is used in a situation where, prior knowledge of
subject is not available, as the system does not know how
the evaluated subject’s pseudo-emotions will look. We
Table 1
Distribution of subjects according to race and gender
Female Male Total 70%
LEO
PRNN
Chinese 33 77 110 60%
SVM
C45
RBF
Malay 5 2 7 NBM
Indian 3 8 11 50%

Classification Rate
Vietnamese 1 10 11
40%
Caucasian 1 7 8
Others 3 3 6
30%
Total 46 107 153
20%

10%

Table 2 0%
<=5 <=10 <=15 >15
Distribution of subjects according to race and age group
Error (number of pixels)
Age (years) <20 20–29 30–39 40–49 P50 Total
Chinese 4 80 20 6 0 110 Fig. 12. Chart showing the performance of the classification models for
Malay 0 6 0 2 0 7 detecting error in the Fiduical Point Locations. (LEO – local experts
Indian 1 5 2 2 1 11 organization, PRNN – probabilistic recursive neural network, SVM –
Vietnamese 0 11 0 0 0 11 support vector machine, KNN – K-nearest neighbor, C45 – decision tree,
Caucasian 0 4 1 3 0 8 RBF – radial basis function, NBM – naı̈ve Bayesian multinomial).
Others 3 3 0 0 0 6
Total 8 109 23 12 1 153

Table 3
Performance of FEETS/LEO model against other classifiers
LEO PRNN SVM C45 RBF NBM
Dataset A (subject 77.3 62.5 52.9 45.8 30.5 16.1
dependent), %
Dataset B (subject 57.0 56.8 50.0 41.4 31.7 15.7
independent), %
LEO – local experts organization, PRNN – probabilistic recursive neural
network, SVM – support vector machine, C45 – decision tree, RBF – Fig. 13. (A) Subject without any artifact, (B) subject wearing a veil, (C)
radial basis function, NBM – naı̈ve Bayesian multinomial. subject wearing sunglasses.

Fig. 11. Examples of features detection errors: (A) feature location is five pixel or less off from the ideal center of features; (B) features are 6–10 pixels off
from the ideal center of features; (C) 11–15 pixels off from the ideal center of features; (D) more than 16 pixels off the ideal center of features.
J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819 815

performed 5-folds cross validation in this evaluation, so Naı̈ve Bayesian Multimodal (NBM) classifier (Mccallum
that all the subjects will be trained and tested. & Nigam, 1998), which were performed from the Weka
We benchmarked our proposed method against the package (Witten & Frank, 2005). The kernel function of
probabilistic recursive neural network (PRNN) for pro- the SVM is used as the same polynomial function as the
cessing FEETS (Wong & Cho, 2006; Wong & Cho, proposed LEO model with the complexity parameter at
2007). We also performed empirical studies using other 1.0 and gamma parameter at 0.01. The C4.5 decision tree
well-known classifiers such as support vector machine model was used 3-folds tree pruning with confidence fac-
(SVM) (Platt, 1998), C4.5 Tree (C45) (Quinlan, 1993), tor of 0.25. All of these tested classifiers were used with
Gaussian radial basis function network (RBF network), flat-vectors input such that some regularities inherently

Fig. 14. Likelihood histogram showing LEO output response for Fig. 13 scenarios: (A) subject without any artifacts, (B) subject wearing a veil, (C) subject
wearing sunglasses.

Left Eye missed Right Eye missed


80% 80%
70% 70%
Classification rate

Classification rate

60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
LEO PRNN SVM C45 RBF NBM LEO PRNN SVM C45 RBF NBM

Nose missed Mouth missed


80% 80%
70%
Classification rate

Classification rate

70%
60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
LEO PRNN SVM C45 RBF NBM LEO PRNN SVM C45 RBF NBM

Both Eyes missed Nose & Mouth missed


80% 80%
70% 70%
Classification rate

Classification rate

60% 60%
50% 50%
40% 40%
30% 30%
20% 20%
10% 10%
0% 0%
LEO PRNN SVM C45 RBF NBM LEO PRNN SVM C45 RBF NBM

Fig. 15. Performance results of missing features evaluation by the LEO model against other models for dataset A (subject dependent). (LEO – local
experts organization, PRNN – probabilistic recursive neural network, SVM – support vector machine, C45 – decision tree, RBF – radial basis function,
NBM – naı̈ve Bayesian multinomial).
816 J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819

associated with the tree structures were broken and less sig- to this type of noise. It also showed that traditional classi-
nificant generalization results were yielded. Some of those fication models might suffer from noise in the feature detec-
models might suffer from poor convergence and resulted tion inaccuracy existing in the facial emotion recognition
in a relatively low classification rate. Table 3 summarized system.
the performance of the LEO model against other classifica- A further evaluation to the system against extreme con-
tion model for perfect feature location detection in the face ditions where feature detection failed completely was con-
emotion recognition system. Two datasets, subjects depen- ducted. Feature lost could be considered a high
dent and independent, were being used and the results occurrence error for any facial or image recognition prob-
showed that the LEO model is able to yield a better perfor- lem, as subjects might not always be in perfect view of the
mance that the other models. camera, for example, objects occlusion or self-occlusion.
An evaluation on how well our system performed when Fig. 13 shows some extreme situations where feature
there is noise in the accuracy of the feature detectors, i.e., detector will fail. Fig. 13b shows a subject wearing a veil,
error in locating the center of features, was conducted. in which it will cause nose and mouth detector failed to
Fig. 11 shows some examples of error in locating the center detect the corresponding feature locations. Similarly,
of features at various degrees of error. Under normal situ- wearing sunglasses as shown in Fig. 13c will cause eyes
ations, it is necessary to consider error levels, less than 15 detection failure to detect the eyes feature. Those are able
pixels off the center of feature locations, as a common error to evaluate the system’s robustness when features are lost
in feature detection processing. We benchmarked our sys- due to failure to detect feature locations. In this experi-
tem performance for this type of noise by using the dataset ment, undetected features were padded with zeros in order
A, against other classification models. The results as shown to retain the similar length for the feature vector, as well
in Fig. 12 demonstrate that the hierarchical learning model, as the shape of the FEETS representation. The system
i.e., the LEO model, is the most robust and less subjective was evaluated against various degrees of loss, i.e., single

Left Eye missed Right Eye missed


60% 60%

50% 50%
Classification rate

Classification rate

40% 40%

30% 30%

20% 20%

10% 10%

0% 0%
LEO PRNN SVM C45 RBF NBM LEO PRNN SVM C45 RBF NBM

Nose missed Mouth missed


60% 60%
Classification rate

50% 50%
Classification rate

40% 40%

30% 30%

20% 20%

10% 10%

0% 0%
LEO PRNN SVM C45 RBF NBM LEO PRNN SVM C45 RBF NBM

Both Eyes missed Nose & Mouth missed


60% 60%

50% 50%
Classification rate

Classification rate

40% 40%

30% 30%

20% 20%

10% 10%

0% 0%
LEO PRNN SVM C45 RBF NBM LEO PRNN SVM C45 RBF NBM

Fig. 16. Performance results of missing features evaluation by the LEO model against other models for dataset B (subject independent). (LEO – local
experts organization, PRNN – probabilistic recursive neural network, SVM – support vector machine, C45 – decision tree, RBF – radial basis function,
NBM – naı̈ve Bayesian multinomial).
J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819 817

eye missing, nose missing, mouth missing, as well as multi- rate when there is small number of person (may be less than
ple feature loss, i.e., eyes missing, nose and mouth miss- 10). However, there is a huge performance difference when
ing. Using the perfect features location and full features the number of subjects is increase to 25. This is due to the
as a training set, we evaluated the system when tested with increasing in the number of variations created by various
missing features. Fig. 14 shows the corresponding people for their pseudo-emotion expressions. We observed
responses from the LEO model for each of the scenarios that the LEO model is robust in when the number of sub-
in Fig. 13. It is obvious that the ‘Joy’ emotion was ject increased, as the feature-to-feature relationship is pre-
detected by the proposed LEO model even though there served in the FEETS representation.
are the situations of the features hiding by the artifacts. We evaluated if our proposed LEO model meets the
Figs. 15 and 16 show the performance of the LEO model objective of reducing the training time needed for process-
against other approaches using both dataset A and data- ing FEETS representation. The LEO model was bench-
set B, respectively. Experiments in dataset B were run marked in terms of time required for training against
under 5-folds cross validation to ensure that all the sub- PRNN model as well as other classification models, and
jects are trained and tested in turns. The experimental result in Fig. 18 showed that the LEO model required
results show that the proposed LEO model is more robust about 2291 s as compared to 57,934 s using the PRNN
when features were getting lost. We observe that the per- model.
formance of the LEO model is better than the others
when features were missing rather than features were
selected from a wrong region as shown in Fig. 12. It is
Time Taken (seconds)
because of a missing feature condition in the LEO model PRNN, 57934, 95%

that the affected node could be removed from the FEETS


representation. A recommendation is drawn that features
locations with a low confidence levels should be consid-
ered as undetected locations as divination in the feature
location generates more noise into the system during SMO, 340, 1%

evaluation.
Moreover, the scalability of the LEO model by training
C45,181, 0%
and testing the model using different number of persons in
the training data and test data was investigated. We tested LEO, 2291, 4%

the performance under 1, 25, 50, 100 and 153 persons in the
Fig. 18. Chart showing the amount of time taken during training for the
training and test data, where the experiment was performed
two models. Note that the codes in these models are not optimized, and
using dataset A. Fig. 17 shows that the LEO model is more are running based on both Matlab platform and from Weka package.
scalable than the other models. We observe that most of (LEO – local experts organization, PRNN – probabilistic recursive neural
the classification models are able to obtain high recognition network, SMO – sequential minimum optimization, C45 – decision tree).

100%

90%

80%
Rate

70%
LEO
60%
PRNN

50% SVM
C45
40% RBF
NBM
30%
1 25 50 100 153
No of Persons

Fig. 17. Scalability performance of model using different numbers of persons in training and testing set. (LEO – local experts organization, PRNN –
probabilistic recursive neural network, SVM – support vector machine, C45 – decision tree, RBF – radial basis function, NBM – naı̈ve Bayesian
multinomial.)
818 J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819

7. Conclusion Baron-Cohen, S. (1995). Mindblindness: An essay on autism and theory of


mind. Cambridge, MA: MIT Press.
Bartlett, M. S., & Sejnowski, T. (1997). Viewpoint invariant face
This paper describes a novel local experts organization recognition using independent component analysis and attractor
(LEO) model for processing FacE emotion tree structure networks. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Neural
(FEETS). The LEO model is introduced to solve the two information processing systems. Cambridge, MA: MIT Press.
latent problems with using conventional connectionist Bechara, A., Damasio, H., & Damasio, A. R. (2000). Emotion, decision
models for performing tree structure based representation, making and the orbitofrontal cortex. Cerebral Cortex, 10(3),
295–307.
i.e., long training time and inconsistency for each training Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1996). EigenFaces
cycle due to the use of random initialization parameters. vs FisherFaces: Recognition using class specific linear projection.
Unlike conventional connectionist and statistical IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7),
approaches that rely on static representations of data result- 711–720.
ing in vectors of features, the approach proposed in this Brunelli, R., & Poggio, T. (1993). Face recognition: Features versus
templates. IEEE Transactions on Pattern Analysis and Machine
paper allows patterns to be properly represented by directed Intelligence, 15, 1042–1052.
graphs or trees. These patterns are subsequently processed Cho, S. Y. (2008). Probabilistic Based recursive model for adaptive
using specific tree structure models. The rationale for the processing of data structures. Expert Systems with Applications, 32(2),
proposed hybrid approach is that a local expert (LE) has 1403–1422.
the capability of learning in high-dimensional spaces, and Cho, S. Y., Chi, Z., Siu, W. C., & Tsoi, A. C. (2003). An improved
algorithm for learning long-term dependency problems in adaptive
a fusion classifier (FC) is used for branch level decision processing of data structures. IEEE Transactions on Neural Networks,
making. Both of them are incorporated together to process 14(4), 781–793.
the tree structure representation in face emotion recogni- Cohn, J. F., Zlochower, A. J., Lien, J. J., Wu, Y.-T., & Kanade, T. (1997).
tion. The proposed recognition system was evaluated using Automated face coding: A computer-vision based method of facial
an Asian emotion database that represents the Asian popu- expression analysis. In 7th European conference on facial expression
measurement and meaning (pp. 329–333).
lation. For the empirical studies, the performances for both Cortes, C., & Vapnik, V. N. (1995). Support vector networks. Machine
subjects dependent and independent based emotion recog- Learning, 20, 273–297.
nition are benchmarked against various well-known classi- Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S.,
fiers using both conventional connectionist (e.g., C4.5 and Fellenz, W., et al. (2001). Emotion recognition in human–computer
RBF) and statistical approaches (e.g., NBM and SVM). interaction. IEEE Signal Processing Magazine, 18(1), 32–80.
Daugman, J. G. (1985). Uncertainty relation for resolution in space,
The promising performance achieved by the LEO model spatial frequency, and orientation optimized by two-dimensional
illustrates that it is able to use for structural pattern recog- cortical filters. Journal of Optical Society of America, 2(7),
nition across a variety of situations. Moreover, the results 1160–1167.
showed that the LEO model is superior to the PRNN model Daugman, J. G. (1988). Complete discrete 2-D Gabor transforms by
for shorter training time for processing the FEETS repre- neural networks for image analysis and compression. IEEE Transac-
tions Pattern Analysis and Machine Intelligence, 36, 1169–1179.
sentation (i.e., only 4% of CPU time taken by the LEO Donato, G., Bartlett, M. S., Hager, J. C., Ekman, P., & Sejnowski, T. J.
instead of 95% of CPU time taken by the PRNN). The (1999). Classifying facial actions. IEEE Transactions on Pattern
robustness of the system was also tested with errors in fidu- Analysis and Machine Intelligence, 21(10), 974–989.
cial points as well as extreme conditions where fiducial Ekman, P. (1991). Telling lies. New York: W.W. Norton.
points were missing. It is also concluded that LEO model Ekman, P. (1999). Facial expressions. In T. Dalgleish & M. Powers (Eds.),
Handbook of cognition and emotion. New York: John Wiley and Sons.
is the most scaleable in a relatively large database system Ekman, P. (2004). Emotions revealed. First owl books. New York: Henry
for face emotion recognition. Further studies should be car- Holt and Company LLC.
ried out to enhance the effective of self-evolving learning as Ekman, P., & Friesen, W. (1978). Facial action coding system: A technique
such enhanced model would have significant use in real- for the measurement of facial movement. Palo Alto, CA: Consulting
time classification and regression problems. Psychologist Press.
Ekman, P., & Rosenberg, E. L. (1997). What the face reveals: Basic and
applied studies of spontaneous expression using the facial action coding
Acknowledgements system (FACS). New York: Oxford University Press.
Ellis, H. D., & Lewis, M. B. (2001). Capgras delusion: A window on face
recognition. Trends in Cognitive Science, 5, 149–156.
We would like to thank all the volunteers who have par-
Isen, A. M. (2000). Positive affect and decision making. In Handbook of
ticipated and contributed to the creation of the Asian emo- emotions (pp. 417–435). New York: Guilford Press.
tion database. We would like to thank Muhammad Raihan John George, H., & Langley, P. (1995). Estimating continuous distribu-
Jumat Bin Md. T.J. for processing and encoding the Asian tions in Bayesian classifiers. In The eleventh conference on uncertainty
emotion database. in artificial intelligence (pp. 338–345). San Mateo: Morgan Kaufmann.
Kelly, M. D. (1970). Visual identification of people by computer.
Technical Report AI-130, Stanford AI Project, Standford, CA.
References Kirby, M., & Sirovich, L. (1990). Application of the Karhunen–Loeve
procedure for characterization of human faces. IEEE Transactions on
Aha, D., & Kibler, D. (1991). Instance based learning algorithms. Machine Pattern Analysis and Machine Intelligence, 12(1), 103–108.
Learning, 6, 37–66. Lanitis, A., Taylor, C., & Cootes, T. (1997). Automatic interpretation and
Aras, S., Subramanian, A. K., & Zhang, Z. (2004). Face recognition. In coding of face images using flexible models. IEEE Transactions on
CSE717 lecture notes. University of Buffalo. Pattern Analysis and Machine Intelligence, 19(7), 743–756.
J.-J. Wong, S.-Y. Cho / Expert Systems with Applications 36 (2009) 804–819 819

Lin, C.-J., & Weng, R. C. (2004). Simple probabilistic predictions for Thompson, J. (1941). Development of facial expression of emotion in
support vector regression. Technical Report, Department of Computer blind and seeing children. Archives of Psychology, 37, 1–47.
Science, National Taiwan University. Tian, Y.-L., Kanade, T., & Cohn, J. F. (2001). Recognizing action units
Liu, C., & Wechsler, H. (2003). Independent component analysis of Gabor for facial expression analysis. IEEE Transactions on Pattern Analysis
features for face recognition. IEEE Transactions on Neural Networks, and Machine Intelligence, 23(2), 1–18.
14(4), 919–928. Toh, K.-A., Yau, W.-Y., & Jiang, X. (2004). A reduced multivariate
Lyons, M. J., Budynek, J., & Akamatsu, S. (1999). Automatic classifica- polynomials model for multi-modal biometrics and classifiers fusion.
tion of single facial images. IEEE Transactions on Pattern Analysis and IEEE Transactions on Circuits and Systems for Video Technology,
Machine Intelligence, 21(12), 1357–1362. 14(2), 224–233.
Marcelja, S. (1980). Mathematical description of the responses of simple Toh, K.-A., Tran, Q.-L., & Srinivasan, D. (2004). Benchmarking a
cortical cells. Journal of Optical Society of America, 70, 1297–1300. reduced multivariate polynomial patter classifier. IEEE Transactions
Mase, K. (1991). Recognition of facial expression from optical flow. on Pattern Analysis and Machine Intelligence, 26(6), 740–755.
IEICE Transactions E, 74(10), 3474–3483. Tsoi, A. C. (1998). Adaptive processing of data structure: An expository
Mccallum, A., & Nigam, K. (1998). A comparison of event models for overview and comments. Technical Report, Faculty Informatics,
naive Bayes text classification. In International conference on machine University of Wollongong, Wollongong: Australia.
learning (pp. 41–48). Vapnik, V. N. (1995). The nature of statistical learning theory. New York:
Osuna, E., Freund, R., & Girosi, F. An improved training algorithm for Springer.
support vector machines. In Proceedings of the 1997 IEEE workshop, Viola, P., & Jones, M. (2004). Robust real-time object detection.
neural network for signal processing (pp. 276–285). International Journal of Computer Vision, 57(2), 137–154.
Penev, P. S., & Atick, J. J. (1996). Local feature analysis: A general Weyrauch, B., & Huang, J. (2003). Component-based face recognition
statistical theory for object representation. Network: Computation in with 3D morphable models. Lecture notes in computer science (Vol.
Neural Systems, 7(3), 477–500. 2688, pp. 27–34). .
Platt, J. (1998). Fast training of support vector machines using sequential Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning
minimal optimization. In B. Scholkopf, C. Burges, & A. Smola (Eds.), tools and techniques (2nd ed.). San Francisco: Morgan Kaufmann.
Advances in kernel methods – Support vector learning (pp. 185–208). Wong, J.-J., & Cho, S.-Y. (2006). A brain-inspired framework for emotion
Cambridge, MA: MIT Press. recognition. Neural Information Processing: Letters and Reviews, 10(7),
Quinlan, R. (1993). C4.5: Programs for machine learning. San Mateo, CA: 169–179.
Morgan Kaufmann Publishers. Wong, J.-J., & Cho, S.-Y. (2007). A brain-inspired model for recognizing
Rosenblum, M., Yacoob, Y., & Davis, L. (1996). Human expression human emotional states from facial expression. In Robert Kozma &
recognition from motion using a radial basis function network Leonid Perlovsky (Eds.), Neurodynamics of higher-level cognition and
architecture. IEEE Transactions on Neural Networks, 7(5), 1121–1138. consciousness (pp. 233–254). Berlin: Springer.
Sebe, N., Lew, M., Cohen, I., Sun, Y., Gevers, T., & Huang, T. (2004). Wu, T.-F., Lin, C.-J., & Weng, R. C. (2004). Probability estimates for
Authentic facial expression analysis. In International conference on multi-class classification by pairwise coupling. Journal of Machine
automatic face and gesture recognition (pp. 517–522), Seoul, Korea. Learning Research, 5, 975–1005.
Sirovich, L., & Kirby, M. (1987). Low-dimensional procedure for the Yang, J., & Waibel, A. A real-time face tracker. In IEEE workshop on
characterization of human face. Journal of the Optical Society of applications of computer vision (WACV’96), 1996, Saratosa, FL, USA
America, 4, 519–524. (p. 142).
Sperduti, A., & Starita, A. (1997). Supervised neural networks for Yin, L., Wei, X., Shun, Y., Wang, J., & Rosato, M. J. (2006). A 3D facial
classification of structures. IEEE Transactions on Neural Networks, 8, expression database for facial behavior research. In 7th International
714–735. conference on automatic face and gesture recognition.

You might also like