0% found this document useful (0 votes)

89 views13 pages

Efficient Facial Expression Recognition Algorithm Based On Hierarchical Deep Neural Network Structure

This document describes a new hierarchical deep learning algorithm for facial expression recognition (FER). The algorithm extracts both appearance features from preprocessed face images using a convolutional neural network, as well as geometric features tracking the coordinates of action units (facial landmarks). The two feature sets are fused in a hierarchical structure and classified using softmax. The algorithm also uses an autoencoder to generate neutral face images to better recognize dynamic expressions. When evaluated on standard CK+ and JAFFE datasets, the algorithm achieved 96.46% and 91.27% accuracy respectively, outperforming other recent FER methods.

Uploaded by

Ahadit AB

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views13 pages

Efficient Facial Expression Recognition Algorithm Based On Hierarchical Deep Neural Network Structure

Uploaded by

Ahadit AB

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Received February 22, 2019, accepted March 20, 2019, date of publication March 25, 2019, date of current

version April 11, 2019.

Digital Object Identifier 10.1109/ACCESS.2019.2907327

Efficient Facial Expression Recognition

Algorithm Based on Hierarchical
Deep Neural Network Structure
JI-HAE KIM1 , BYUNG-GYU KIM 1 , (Senior Member, IEEE),
PARTHA PRATIM ROY2 , (Member, IEEE), AND DA-MI JEONG1
1 Department of IT Engineering, Sookmyung Women’s University, Seoul, South Korea
2 School of Computer Science and Engineering, IIT Roorkee, Roorkee, India
Corresponding author: Byung-Gyu Kim ([email protected])
This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the
Ministry of Education under Grant NRF-2016R1D1A1B04934750.

ABSTRACT With the continued development of artificial intelligence (AI) technology, research on interac-
tion technology has become more popular. Facial expression recognition (FER) is an important type of visual
information that can be used to understand a human’s emotional situation. In particular, the importance of AI
systems has recently increased due to advancements in research on AI systems applied to AI robots. In this
paper, we propose a new scheme for FER system based on hierarchical deep learning. The feature extracted
from the appearance feature-based network is fused with the geometric feature in a hierarchical structure. The
appearance feature-based network extracts holistic features of the face using the preprocessed LBP image,
whereas the geometric feature-based network learns the coordinate change of action units (AUs) landmark,
which is a muscle that moves mainly when making facial expressions. The proposed method combines the
result of the softmax function of two features by considering the error associated with the second highest
emotion (Top-2) prediction result. In addition, we propose a technique to generate facial images with neutral
emotion using the autoencoder technique. By this technique, we can extract the dynamic facial features
between the neutral and emotional images without sequence data. We compare the proposed algorithm with
the other recent algorithms for CK+ and JAFFE dataset, which are typically considered to be verified datasets
in the facial expression recognition. The ten-fold cross validation results show 96.46% of accuracy in the
CK+ dataset and 91.27% of accuracy in the JAFFE dataset. When comparing with other methods, the result
of the proposed hierarchical deep network structure shows up to about 3% of the accuracy improvement
and 1.3% of average improvement in CK+ dataset, respectively. In JAFFE datasets, up to about 7% of the
accuracy is enhanced, and the average improvement is verified by about 1.5%.

INDEX TERMS Artificial intelligence (AI), facial expression recognition (FER), emotion recognition, deep
learning, LBP feature, geometric feature, convolutional neural network (CNN).

I. INTRODUCTION the use of such technologies that recognize voice and lan-
Technologies for communication have traditionally been guage, there are artificial intelligence robots that can interact
developed based on the senses that play a major role in closely with real life, in such ways as managing the daily
human interaction [1]. In particular, artificial intelligence schedules of people and playing their favorite music. How-
voice recognition technology using the sense of hearing and ever, sensory acceptance is required for interactions more
AI speakers has been commercialized because of improve- precisely mirroring those of humans. Therefore, the most
ments in artificial intelligence (AI) technology [2]. Through necessary technology is a vision sensor, as vision is a large
part of human perception in most interactions. In artificial
The associate editor coordinating the review of this manuscript and
intelligence robots using interactions between a human and
approving it for publication was Peter Peer. a machine, human faces provide important information as a

2169-3536
2019 IEEE. Translations and content mining are permitted for academic research only.
VOLUME 7, 2019 Personal use is also permitted, but republication/redistribution requires IEEE permission. 41273
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
J.-H. Kim et al.: Efficient FER Algorithm Based on Hierarchical Deep Neural Network Structure

clue to understand the current state of the user. Therefore, recognition. Aside from environmental changes, there is a
the field of facial expression recognition has been studied problem with the lack of appropriate datasets.
extensively over the last ten years. In this paper, we propose an efficient algorithm to
Recently, with the increment of relevant data and continued improve the recognition accuracy by a hierarchial deep
development of deep learning, a facial expression recogni- neural network structure which can re-classify the result
tion system which accurately recognizes facial expressions (Top-2 error emotion). The feature extracted from the appear-
in various environments has come to be actively studied. ance feature-based network is fused with the geometric
Facial expression recognition systems FERs) are fundamen- feature in a hierarchical structure. The proposed scheme
tally based on an ergonomic and evolutionary approach. combines two features to obtain more accurate result by
Based on universality, similarity, physiological, and evolu- considering the error associated with the second highest emo-
tionary properties, emotions in FER studies can be classi- tion (Top-2) prediction result.
fied into six categories: happiness, sadness, fear, disgust, The rest of this paper is organized as follows: In Section II,
surprise, and anger. In addition, emotions can be classi- we discuss various existing algorithms for FER. Section III
fied into seven categories with the addition of a neutral presents a new proposed FER algorithm using deep learn-
emotion [1], [3]. ing based on appearance feature and geometric feature. The
A FER system for recognizing facial expressions requires experimental results are reported in Section IV. Finally,
four steps. First, we need a face detection step that local- the concluding remarks are presented in Section V.
izes the human face. Representative algorithms include
Adaboost [4], Haarcascade [5], and a histogram of oriented II. RELATED WORKS
gradients (HOG) [6]. The second step involves a face regis- We describe related works of facial expression recognition
tration with which to obtain the main feature points in order to systems that have been studied to date. These algorithms
recognize face rotations or muscle movement. The faces after can largely be divided into the classical feature extraction
detection step is inclined to be degraded in terms of recogni- method and the deep learning-based method. The classical
tion accuracy due to the potential for various illuminations feature extraction methods can be roughly classified further
and rotations. Therefore, it is necessary to improve the image into the appearance feature extraction method that extracts
by obtaining landmarks, which are the positions of the main the features of the entire facial criterion and the geometric
muscle movements when one is making facial expression. feature extraction method that extracts geometric elements
The positions that define the contraction of the facial muscles of the facial structure and motion of the facial muscles.
are called action units (AUs), and the main positions include In the following sub-sections, we will describe some of the
the eyebrows, eyes, nose, and mouth [7]. A typical algorithm recent algorithms involving the appearance feature extraction
is active appearance models (AAM) [8]. Third, features that method, the geometric feature extraction method, and deep
can recognize facial expressions are extracted by acquiring learning-based facial expression recognition algorithms.
the motion or position information of the feature points in the
feature extraction step. A. CLASSICAL FEATURE FER APPROACHES
To this end, the approach can be divided into appear- Liu et al. [13] and Happy and Routray [14] reported a rep-
ance feature-based and geometric feature-based methods. resentative FER algorithm using LBP. In Liu et al. [13],
The appearance feature-based method is a feature extraction the active patches were defined around 68 landmarks
method for the entire facial image. It involves the method extracted by the active appearance model (AAM), and the
of dimension reduction through a fusion with binary feature features were extracted using LBP for the patches. In this way,
extraction, which is widely applied in the field of facial they eliminated unnecessary parts of the face and reduced
studies. Principal component analysis (PCA) and linear dis- the effect on the environmental change. This improved the
criminant analysis (LDA) are typical dimension reduction accuracy by using the more robust features obtained from
methods. The local binary pattern (LBP) and local direc- the main facial muscles with patch centering. In Happy and
tional pattern (LDP) techniques are binary feature extraction Routray [14], rather than using other existing algorithms that
methods [9], [10] for presenting facial expression. The geo- extract facial landmarks, they detected the points of the eye-
metric feature-based method extracts the geometric position brows, eyes, nose, and mouth corners by applying the sobel
of the face or the value of the change in facial muscle move- edge, otsu algorithm, morphological operation, etc. By defin-
ment. Finally, based on the obtained features, a classification ing the active facial patches and extracting the LBP histogram
step is needed to classify the defined emotions using a sup- feature of each patch, the features of the 19 patches have been
port vector machine (SVM) and the hidden markov model extracted from the main facial muscles such as the forehead,
(HMM) [11], [12]. nose, and mouth, which move when facial expressions are
Despite the fact that many algorithms have been studied, generated. The local directional ternary pattern (LDTP) fea-
some problems still remain, such as illumination changes, ture extraction method based on the two largest directions
rotations, occlusions, and accessories [3]. These are not only after performing a matrix operation on the Robinson compass
classical problems involved in image processing, but also was used by Ryu et al. [15]. This algorithm formed a 17 ×
factors that cause hardship for capturing action units of facial 17 block around 42 landmarks selected by the active pattern

41274 VOLUME 7, 2019

J.-H. Kim et al.: Efficient FER Algorithm Based on Hierarchical Deep Neural Network Structure

through LDTP, and extracted an LDTP histogram from it. argumentation methods, and FER research based on CNN is
Through this process, robust features were extracted, and being actively studied.
more accurate emotional recognition was achieved by extract- The FER approaches using dual networks, which fuse
ing more information from the strong response regions. both holistic features of the face and partial features focused
When emotions are expressed, the formation of patches on facial landmarks, have been studied in Jung et al. [19],
around the mainly changing facial muscles, the application Xie and Hu [20] and Yang et al. [21]. In these approaches,
of this information to the appearance-based feature extrac- one CNN network extracted features using facial gray scale
tion technique, have recently expanded to contribute to the images while the other network extracted features using
development of various algorithms applying deep learning. image patches or landmark changes. Finally, these features
A typical algorithm that uses geometric features extracts are fused by a weighted function or fully-connected learning.
the temporal or dynamic changes of the landmark of the face. A method of extracting temporal features and spatial fea-
Kotsia and Pitas [16] used the candide wireframe model tures to combine softmax and predict emotions was used by
to predict facial emotions by extracting geometric features Zhang et al. [22]. The temporal features were extracted so
around face landmarks. In this algorithm, the grid was traced as to learn temporal landmarks using the part-based hier-
in the sequential dataset. The geometric and dynamic infor- archical bidirectional recurrent neural network (PHRNN),
mation regarding the emotion changes have been extracted by and the multi-signal convolutional neural network (MSCNN)
the difference between the first neutral face grid and the peak involved extracting holistic features that extract the overall
emotion grid of the last frame. Finally, when classifying the facial features. The PHRNN classifies the facial landmarks
final emotion, the emotion was classified using support vector into each AU and hierarchically learns the movements of
machine (SVM) by combining the values with the facial AUs, which change with time. The MSCNN learns the facial
action units (FAUs), including the facial change according to gray scale images in order to extract the entire appearance
the emotions. features. Using these two networks, more accurate facial
A dynamic texture-based approach to classifying emotions expression recognition could be made possible by consider-
using a free-form deformation (FFD) technique for tracking ing both temporal and spatial features.
the motion direction of AUs in image sequences was proposed Despite many studies, the recognition rate is still not
by Koelstra et al. [17]. The extracted representation based high enough, due to the influence of various environmental
on motion history was used to derive the motion direction changes such as lighting and accessories as well as the dif-
histogram descriptor in the spatial and temporal domains. The ference in the characteristics of individual people. Therefore,
extracted features finally combined gentleboolst algorithm in this paper, we would like to propose an efficient FER
and HMM in order to classify the emotions. scheme with robust features combining two types of features
These geometric feature extraction methods can reduce using deep learning.
the effect of the degradation of accuracy due to illumination
or external change by tracking the movement of geometric III. PROPOSED ALGORITHM
coordinates extracted from the main AUs. Therefore, those The proposed algorithm is shown in Figure 1. In this study,
geometric feature-based algorithms have been studied by we propose an efficient algorithm to improve the recognition
many researchers in order to improve accuracy by the fusion accuracy by a hierarchial deep neural network structure which
of the features extraction methods. can re-classify the result (Top-2 error emotion), which is the
most frequent error. The first network learns the convolu-
B. CNN-BASED FER APPROACHES tional neural network, which focuses on AUs using the LBP
Recently, due to the development of big data and the improve- feature, which is a typical feature extraction technique in the
ment of hardware technology, many algorithms based on deep field of facial studies [9]. The second network extracts the
learning have been researched. Since the field of FER is being geometric changes of the landmarks of each AU and learns
influenced by these advancements as well, more robust and all pairs of six emotions. Based on two features, the proposed
efficient feature recognition has been achieved through the algorithm combines them using adaptive weighting function
automatic learning of the extracted facial features. In this to give the final result.
section, we introduce the CNN-based FER algorithms.
Lopes et al. [18] suggested a representative facial expres- A. OBSERVATION: ERROR RATIO OF TOP-2 SELECTION
sion algorithm applying deep learning based on CNN. This The ratio of the correct answer was measured by using the
takes the data argumentation process to resolve the scarcity 6-length softmax results in order to determine the cause of
of the FER dataset and to make robust facial emotions to errors and ratio of the facial recognition errors when only
changes such as rotation and transportation. In this algo- using a single network. The experiment was performed using
rithm, except for the parts with unnecessary elements around two datasets with 150 images per emotion using the appear-
the face, the AUs are cropped into blocks at the center of ance feature-based CNN. This experiment was conducted by
the action unit, and the emotions are classified into six to using a 10-fold cross validation method.
seven emotions though CNN. In such algorithms, the lack of The entire dataset was divided into 10 sets, 9 of which
datasets required for deep learning algorithms are solved by were employed for training while the other one was used for

VOLUME 7, 2019 41275

J.-H. Kim et al.: Efficient FER Algorithm Based on Hierarchical Deep Neural Network Structure

FIGURE 1. The overall procedure of the proposed FER algorithm.

TABLE 1. Top-2 error rate in CK + dataset. in Figure 1. The facial expression is predicted as one of six
emotions: anger, disgust, fear, happy, sad, and surprise. First,
the appearance feature-based network uses the LBP image
in order to learn the holistic characteristics of the face in
one frame. Secondly, the geometric feature-based network
learns the change of eighteen x and y coordinates among the
facial landmarks, which mainly move according to emotional
changes. The predicted result of the appearance feature-based
network, the two highest softmax values among the emotions
are weighted with the results of the geometric feature-based
TABLE 2. Top-2 error rate in JAFFE dataset.
network, respectively. Then, robust features are generated by
fusing the different types of features. Finally, we can obtain
a predicted final emotion.
From the next sub-section, each module will be explained
in detail

B. PREPROCESSING
Before the main process of facial expression recognition,
it is necessary to identify a face and recognize the facial
area. Therefore, the face and non-face parts must be sepa-
verification. The softmax results for six emotions were sorted rated through the face detection process. Only preserving the
in descending order, and the rank of correspondence to the important parts for emotion recognition prevents accuracy
True answer was counted. The number of counts was divided degradation due to changes in the environment surrounding
by the total number of points of validation data in the fold, the face. In this paper, we used the face detector model
and the ratio was measured at each fold. The results are shown which was learned through P. F. Felzenszwalb et al. [23]: it
in Tables 1 and 2. is a detector that uses the HoG algorithm to determine the
As shown in Tables 1 and 2, the case of including the sec- face boundary coordinates. We cropped the facial area using
ond highest label in CK +, was observed by 4.2%. When this detector. In this algorithm, the linear SVM was used to
Top-2 ∼ Top-6 were considered as errors and Top-1 is the identify the facial region by training HoG features from pos-
correctly predicted result. The ratio of Top-2 resulted in 4.7%, itive (contain an object) and negative (not contain an object)
which covers 90.5% of total error. In addition, the JAFFE samples composed of sliding window. The algorithm could
dataset had the largest error rate of Top-2 at 8.0% which is be used as the get_frontal_face_detector defined in dlib [24]
occupied with 75.3% of 10.6% total error. in order to identify and crop the facial area coordinates of left-
As a result, we can see that the error is biased in Top-2 error. top x, y, right-bottom x, y. The cropped facial images usually
The error occurs within the Top-2 range. In all datasets, appear from the middle of the forehead to the chin, and from
Top-2 errors occurred at a rate of more than 82% of the total the leftmost face to the rightmost face.
error. From this viewpoint, we can consider a structure to After the facial region has been cropped, a blurring
improve the recognition accuracy if we reduce Top-2 error process is performed before creating the LBP image
rate by a refined classification. in order to remove noise, which is the input value of
To design an efficient scheme, robust features are extracted the appearance feature-based network. If the features are
using hierarchical fusion of the two types of networks shown extracted from unfiltered facial images, it may lead to a

41276 VOLUME 7, 2019

J.-H. Kim et al.: Efficient FER Algorithm Based on Hierarchical Deep Neural Network Structure

center coordinates of the block for making the LBP pattern. ip

and ic denote the gray scale values of the neighboring pixels
and the center pixel, respectively. s(x), which is converted to a
binary number based on the difference from the center pixel,
is defined as the following:
(
1, if x ≥ 0,
s(x) = (2)
0, if x < 0.

The binary code extracted from the above equation is

FIGURE 2. LBP images with applied preprocessing: (a)LBP images with
bilateral blurring preprocessing, (b)LBP images without preprocessing.
converted into a decimal number in order to form an LBP
image. These images occupy less computational complexity
and lower capacity than the original images, which enable
learning and execution at higher speeds. These points are a
great advantage, and they are used in facial recognition and
facial emotion recognition because they can extract facial
texture with good performance [26], [27]. For this reason,
the LBP images are employed as input to the appearance
feature-based network.
FIGURE 3. Example of encoding a LBP pattern.
The convolutional neural network (CNN) has also been
successfully used in computer vision applications. The CNN
degraded performance. In addition, since the importance is a network that extracts feature maps by performing a
of the AUs, which plays a major role in emotion change, convolutional operation with kernel on the original data. It is
is higher than that of other facial parts, they are converted typically constructed of convolutional layers and pooling
to LBP images after preprocessing. layers that extract feature maps expressing an image [28].
The result of using only the important parts extracted This CNN structure makes it possible to learn while pre-
based on AUs is shown in Figure 2. We used a preprocess- serving the shape characteristics of each component of the
ing technique involving bilateral blurring [25]. This filtering face image. In addition, maintaining the shape of the input and
technique is one of the image analysis methods known as output data of each layer allows for the effective extraction
edge preserving smoothing. It performs Gaussian blurring of the facial expression features by considering the char-
while preserving the edge where the correlation with similar acteristics of the adjacent image. The formula used in the
peripheral values is small. convolution operation is as:
K
X −1 K
X −1
C. THE APPEARANCE FEATURE-BASED NETWORK xijl,d = σ (bij + l,d
wmn xi+m,j+n ), (3)
The appearance feature-based CNN is a process for extracting m=0 N =0
the holistic features of the faces. In the proposed algorithm,
we used LBP images which are robust in the FER system. where d is 2 in this study because of the 2-d convolution
This is a representative feature extraction technique used in operation. The x denotes an output signal of i, j in the feature
the field of facial studies, because it extracts the texture of map of the lth layer. The σ denotes a non-linear function, and
main facial AUs movement with a simple structure. It com- b denotes a bias of i, j. wmn is the weight value applied to
pares 3 × 3 neighbor pixels with 8 based on the center pixel, the convolution operation with each kernel and the key to
as shown in Figure 3. controlling the neuron.
Each pixel is encoded as a 1 if it is brighter than the center As shown in Figure 4, the appearance feature-based net-
pixel, and as a 0 when it is darker. Then, these values are work has a 128 × 128 size input, and passes through the
connected to each other in the clockwise direction from the convolution layers and the pooling layers a total of three
upper left corner, and converted into decimal numbers to be times. The first convolutional layer performs a convolutional
used as feature values of a LBP image. These binary values operation using 5 × 5 size kernel with the number of 4. Next,
are referred to as Local Binary Patterns or LBP codes [26]. in the first pooling layer, the max-pooling, which involves
The formula is written as the following: selecting one pixel among the pixels in a 2 × 2 block, is pro-
cessed. The kernel size was experimentally determined as
P
X 5 × 5 in order to fit 128 × 128 input sizes so as to effectively
LBPP,R (xc , yc ) = s(ip − ic )2p , (1) extract the image feature map while considering the typical
p=0
application of a 3 × 3 kernel for a 64 × 64 size input. As a
where LBPP,R means calculating the number of P neighbor- result, the 64 maps of 128 × 128 size obtained in the previous
ing pixels among the pixels in the radius of R from the center step are changed to 64 maps of 64 × 64 sizes, which represent
pixel. LBP8,1 is used in the proposed method. (xc , yc ) is the a reduction in size by half.

VOLUME 7, 2019 41277

J.-H. Kim et al.: Efficient FER Algorithm Based on Hierarchical Deep Neural Network Structure

FIGURE 4. The proposed appearance feature-based CNN structure.

In a similar way, the of convolutional and pooling The results of two emotions with the highest value among
operations are repeated three times. Finally, 256 maps of the softmax results extracted using this learned model are
16 × 16 sizes as a result of the last pooling are derived. After later used for more accurate emotion prediction by fusing
the convolutional and pooling layers are finished, these values with the result of the geometric feature. Thus, the label infor-
are flattened and passed through the fully connected layers, mation for these two high emotions are transmitted to the
which are the hidden layers. The first fully connected layer geometric feature-based network.
has 1024 nodes, while the second has 500 nodes.
In the proposed network, we use the dropout operation D. THE GEOMETRIC FEATURE-BASED NETWORK
between fully connected layers. When the network is learn- We considered both types of the appearance feature-based
ing, it turns off the neurons at random, thus disturbing the feature and geometric feature in order to reduce recognition
learning. This can prevent an over-fitting that is biased toward errors by using more robust features. In the case of using only
the learning data [29]. We use rectified linear unit (ReLU) as one network, the recognition accuracy is inclined to be low
the activation function between the convolutional layer and because of various factors, such as rotations, illuminations,
fully connected layer. This activation function is a step for and peripheral accessories. Further, in the case of fine emo-
converting the quantitative value of the feature map through tional change, it is difficult to recognize emotion only using
the convolution operation into a nonlinear value. At the end of the holistic features of the face.
the network, six emotions are extracted as continuous values In this paper, we additionally use a geometric feature-
using the softmax function. The formula of the softmax result based CNN that captures the movements of the landmarks
for six emotions is as the following: of emotion. The feature of the partial elements obtained by
eai detecting the movement of the landmark is added to the
si = , (4)
n−1
P a overall features so that more robust features can be extracted.
ek Furthermore, we detected and demonstrated that the facial
k=0
expression recognition error most frequently occurred in the
where n corresponds to the number of emotions requiring
emotion of the second highest probability when only using
classification. si is the softmax function score of the ith class.
the appearance network. Based on this assumption, the Top-2
This value is the sum of the exponential values of the ak s,
values with the highest values among the results of 6-classes
which are the values of the entire category, and is then divided
probabilities composed of the last softmax layer are selected.
by the exponential value of the emotion score. The error
Finally, the emotion is classified through the max result of
is computed by the network through this process, and the
two network fusion by weighted sum calculation.
error is reduced by using the cross entropy loss function. The
For extracting geometric features which contain the
calculation of the loss is considered as the following:
dynamic features of a face, a neutral facial image of the
n−1
X person depicted in the reference facial image is required.
L=− yj log(sj ), (5) However, in the real system, there is not enough neutral facial
j=0 images. There are also some FER datasets which do not have
where yj is the jth element of the correct answer vector. enough neutral image data. In this case, we need to obtain
Using the cross entropy function, it is possible to flexibly enough neutral image data to learn the dynamic feature of
respond to various probability distributions of the model by the facial expression. We suggest the autoencoder to generate
obtaining the cross entropy through a negative log-likelihood. neutral image data. This network can be used to learn the
In addition, the process of finding a gradient is relatively geometric feature-based network by obtaining the difference
simple as well [30]. of coordinates between the generated neutral facial image and
To classify six categories, which is n, if the first element is the emotional image, and to create dynamic features. The
correct, y = [1, 0, 0, 0, 0, 0], y1 = 1 and the others are zero. sj proposed autoencoder technique is presented in Figure 5.
is the output value of the softmax function. In addition, we use This network is constructed from the VGG19 network
a steepest gradient descent (SGD) as an optimizer along with structure [28]. The images of neutral faces can be generated
the calculated cross entropy loss. using this structure. It can be divided into the encoding and

41278 VOLUME 7, 2019

J.-H. Kim et al.: Efficient FER Algorithm Based on Hierarchical Deep Neural Network Structure

FIGURE 6. Generated neutral images with the autoencoder (Top: facial

images with a emotion, Bottom: neutral facial images).

FIGURE 5. Autoencoder structure for generating neutral image.

decoding processes. First, in the encoding process, the input

of the facial image with emotion is passed to the pool3 layer
of VGG19. Next, the number of 1 convolution layer and
1 maxpooling layer are performed one more time, then the
features are compressed through the twice fully connected
layers of 4096 nodes.
In the decoding process, the fully connected layer of the
4096 extracted nodes is reshaped, and the upsampling and
convolution processes are repeated as opposed to the encod-
ing step. In this process, an output value equal to the input size
is derived. The error function for reducing the loss is obtained
by calculating the difference between the input facial image
with emotion and the neutral facial image of the same person
in the input image, unlike the previous case, in which the FIGURE 7. Landmarks using geometric feature-based CNN.
autoencoder narrows the difference between the input and
output values. The error function can be written as: algorithm, 68 landmarks of a face defined by the iBUG 300-
error = MSE(Gn , Xn ), (6) W face landmark dataset [31] were obtained using a real-
time landmark extraction technique which used the ensemble
where Gn is the neutral facial image generated from the facial of regression trees Kazemi [32]. Among the obtained land-
image with emotion, and Xn is the neutral facial image in the marks, we calculate a difference of the main AUs’s coordi-
existing datasets. The ground truth Xn is used as the same nates such as eyebrows, eyes, and nose between the neutral
person in the input image. The formula used for MSE to facial image and the emotion facial image. The coordinates
reduce this error is: are shown in Figure 7.
n−1 As shown in Figure 7, we selectively use the landmarks of
1X
MSE = (gi − xi )2 , (7) the main AUs, such as the eyebrow edge, eye periphery, and
m
i=0 mouth edge among the landmarks defined as 1 to 68. First, for
where m is the number of pixels for an image and gi and eyebrows, we used 22 and 23, which are the points nearest
xi are the ith pixel of the generated neutral image, and the to the middle of the forehead, as well as 18 and 27, which
ith pixel of the ground truth image, respectively. In this way, correspond to the ends of the eyebrows. In addition, we use
the autoencoder model is constructed so as to generate neutral 20 and 25 in the center of the landmark of each eyebrow area.
facial data for input facial images by reducing loss. Next, for eyes, we use 40 and 43, which are closest to the
The neutral facial images generated through this process nose, and both ends of the eyes at 37 and 46.
are shown in Figure 6. The autoencoder had been trained for In order to obtain each center of the eyes, two points,
about 170 epochs for a week and used an Adam optimizer except for both ends of the eyes, are calculated as average
with a learning rate of 0.01. The CK+, JAFFE datasets with values by dividing the top and bottom and used as each center
neutral image data were used as training data. Using the coordinate of the eyes. At the edge of the lips, 49 and 55 are
generated neutral image, dynamic features can be extracted used at the ends of the lips, and the center of upper lip 52 and
using geometric feature-based CNN, even though we do not the center of lower lip 58 are used.
have enough such image data. As a result, a total of 36-length x and y coordinates are
The geometric feature-based CNN is a process which constructed in order to extract the geometric features. The
involves capturing dynamic changes in facial expressions and following is a vector used to calculate the difference between
extracting geometric features of landmarks. In the proposed the coordinates of the neutral facial image and the facial

VOLUME 7, 2019 41279

J.-H. Kim et al.: Efficient FER Algorithm Based on Hierarchical Deep Neural Network Structure

FIGURE 8. Geometric feature-based CNN structure.

TABLE 3. Fifteen pairs of six emotions for the geometric feature-based same way as the appearance feature-based network, and the
model (AN:Angry, DI:Disgust, FE:Fear, HA:Happy, SA:Sad, SU:Surprise).
dropout is used to overcome the overfitting problem. The loss
function for learning is calculated using the cross entropy
loss function in the same way as the appearance feature-
based network. The model for each pair is stored, and the
model corresponding to the Top-2 pair from the result of
the softmax of the previously obtained appearance model is
selected. In this way, we consider a weighting function to
determine the final emotion.

E. WEIGHTING FUNCTION FOR TOP-2 EMOTIONS

The emotion recognition error is the most frequent among the
image with emotion. two highest results. The Top-2 emotion results of the appear-
ance feature-based network are weighted by the geometric
Ve = [xe0 − xn0 , y0e − y0n , · · · , xe35 − xn35 , y35 35
e − yn ], (8) feature-based network result in order to more accurately pre-
where Ve is the geometric vector of emotion e. Each xe and ye dict emotion by taking into consideration the holistic feature
coordinate is listed in order from left to right, top to bottom, and partial feature of a face. In the appearance feature-based
and the eyes edge and lips coordinate are constructed clock- CNN mentioned in Section 3.2, the two classes with the Top-2
wise order, top, right, bottom, and left. The 36-length vector softmax values among six classes are selected.
extracted from the above equation is learned through the CNN In the geometric feature-based CNN mentioned in
constructed by borrowing from the VGG16 Network [33]. Section 3.3, the 2-length softmax result by using the pair
The structure of the CNN is as shown in Figure 8. about the Top-2 classes from the appearance network is
Through the network composed as shown in Figure 8, one extracted. The extracted softmax values are fused by the
of the results among the fifteen pairs shown in Table 3 are following:
weighted in the Top-2 result of the appearance feature-based Ck = αAk + (1 − α)Gk , (9)
CNN. The feature extraction is efficiently performed using
part of the VGG16 structure verified as having a high perfor- where α is a real-value between 0 and 1, corresponding to
mance in classification studies. In addition, geometric feature a weight for the combination of two features. In this thesis,
can be learned by maintaining the individual geometric order the α value set a 0.8. The k is 0 or 1, as the class with the first
as well as the shape of components. highest softmax value is 0 and that with the second highest
In the first convolution layer of Conv1_1, the feature is 1. Ak is a softmax result of the appearance feature-based
maps are generated by performing convolutional operation network corresponding to the kth emotion class, while Gk
using 64 kernels with 3 × 1 size. The convolution operation is a softmax result of the geometric feature-based network.
is also performed in the same way in Conv1_2, then the The softmax function for the operation is shown in Eq. (3.5).
maxpooling is operated using the kernel of 2 × 1 size. Next, The Top-2 softmax results of the appearance feature-based
the Conv2_1 is operated using 128 kernels with 3 × 1 sizes, CNN are normalized to the same ratio of geometric results.
and this process is repeated in Conv2_2 and Conv2_3 as well. For example, they are operated so that the sum of the two
Next, the maxpooling process using the 2 × 1 kernel size classes is 1, before being combined with the geometric results
is operated. Next, 256 convolution operations are repeated as the following:
in Conv3_1, Conv3_2, and Conv3_3 using the same ker- aTopk
nel size of Conv2. Following the Pool3 operation, the fea- ATopk = , (10)
Pj
ture is flattened and connected with a fully-connected layer aTopi
of 500 nodes of size, and we then obtaining softmax results i=0
of the two categories. where a is a softmax value for a class and Topk is the class
We employ the ReLU as the activation function in the number with the kth highest softmax value. The max value

41280 VOLUME 7, 2019

J.-H. Kim et al.: Efficient FER Algorithm Based on Hierarchical Deep Neural Network Structure

of j is 1. The ATopk resulted by the formula is fused to the

geometric result Gk at a rate of α, resulting in Ck of Eq. (9).
The combined Ck using this value is re-scaled by the ratio
occupied by Top-2 in the previous appearance feature-based
network in order to finally determines the facial expression.
The re-scaling process can be represented as:
Rk = (aTop0 + aTop1 ) × Ck . (11)
FIGURE 9. Extended Cohn-Kanade (CK+) dataset.
The re-scaling value Rk , is obtained by multiplying the sum
of a and Ck .
It is the combination of the class with the kth highest
comparison with other algorithms. It divides the dataset into
softmax value and the ratio (aTop0 + aTop1 ) occupied by
10 sets, nine of which are used for learning, and the last one
Top-2 in the appearance feature-based network. Therefore,
is used for validation. The accuracy of the facial expression
the softmax vector is obtained by fusing the two networks,
recognition algorithm is averaged using these results. Fig-
and the predicted emotion are as follows:
ure 9 shows CK+ dataset examples of frontal facial images.
F = [v0 , v1 , · · · , v5 ], (12)
prede = arg max vi . (13) B. JAPANESE FEMALE FACIAL EXPRESSION (JAFFE)
i JAFFE is a dataset consisting of gray scale frontal facial
The obtained Rk is used as the value vi of the emotion class expression images of 10 Japanese women. It contains a total
corresponding to 0 to 5 softmax values. Therefore, the result of 213 images including seven facial expressions (anger,
of the 6-length softmax function consists of the vector for the disgust, fear, happiness, sadness, surprise, and neutral) [35].
final prediction. The emotion i with the largest value of v like We use a total of 915 augmented points of data for learning
Eq. (12) and the final emotion is selected by Eq. (13). If the and validation, including rotations, flip, and noise. The rota-
predicted i label value correspond to 0, then the final emotion tions were constructed 5 degrees clockwise and counterclock-
is Angry. If i is from 1 to 5, then the final emotion will be wise, respectively, and the noise was added to the original
Angry, Disgust, Fear, Happy, Sad, or Surprise. image with Gaussian noise with a variance of 0.01 and a zero
mean.
IV. EXPERIMENT AND DISCUSSION In order to compare the proposed method with the latest
In this section, the experimental results and discussion are algorithm, we verified the 10-fold cross validation method in
conducted in order to support the evidence of the suggested the same way as the CK+ dataset, and measure the accuracy
algorithm. We also demonstrate the superiority of the algo- by averaging the results. Figure 10 shows examples of the
rithm by comparing the accuracy with that of the state-of- JAFFE dataset.
the-art algorithm. In the following sub-sections, we describe
the datasets for experimentation. The hardware and param-
eter settings in which the experiment were performed are C. EXPERIMENTAL ENVIRONMENT
described. Finally, we present the results of the experiments They were based on Windows OS i7-8700 CPU with a clock
and analyze the performance. speed of 3.20Ghz, RAM of 8GB, and GTX 1070 GPU. The
proposed method is a deep learning-based algorithm. There-
A. DATASETS fore, we model CNN and Autoencoder using Tensorflow and
1) EXTENDED COHN-KANADE (CK+) Keras. It is also based on the python language, which is opti-
The CK + is a dataset which has the labeled emotion number mized for deep learning modeling and related libraries. The
for the frame of sequences from neutral to peak states. A total experimental results, including the accuracy, are verified for
of 123 subjects participated and 593 image sequences were each dataset by learning and the 10-fold validation method.
included, with 327 of them being labeled with seven universal In the training process for each network, 30 epochs were
emotions (anger, contempt, disgust, fear, happiness, sadness, iterated for each data set and the learning rate was 0.01. The
and surprise) [34]. This dataset is employed by various algo- stochastic gradient descent (SGD) was used as the optimizer.
rithms related to facial expression recognition. Therefore, it is
suitable for evaluation against the latest technology.
In this thesis, six emotions of anger, disgust, fear, hap- D. PERFORMANCE ANALYSIS
piness, sadness, and surprise were used as the data of the 1) WEIGHT DETERMINATION FOR FUSION OF FEATURES
experiment. The last three frames in each sequence were In order to improve the error in the Top-2 range, we designed
used as peak emotion frames, and there were approximately an algorithm for fusing two networks. As shown in Eq. (3),
80-120 images for each emotion, so that a total number the prediction result is calculated as the highest value by com-
of 927 images were used. The verification method for accu- bining the results of the geometric feature-based network,
racy uses a 10-fold cross validation method for the purpose of which are obtained from the softmax results of the appearance

VOLUME 7, 2019 41281

J.-H. Kim et al.: Efficient FER Algorithm Based on Hierarchical Deep Neural Network Structure

TABLE 4. Confusion matrix of appearance feature-based CNN in the CK+

dataset. (AN:Angry, DI:Disgust, FE:Fear, HA:Happy, SA:Sad, SU:Surprise).

TABLE 5. Confusion matrix of the proposed method in the CK+ dataset.

FIGURE 10. Japanese female facial expression (JAFFE) dataset.
(AN:Angry, DI:Disgust, FE:Fear, HA:Happy, SA:Sad, SU:Surprise).

TABLE 6. Comparisons of our approach and the state-of-the-arts of FER

approaches in CK+ dataset.

FIGURE 11. Accuracy according to weight value (α).

feature-based network. The value of α in Eq. (9) refers to

the degree of contribution of the appearance feature-based
network. It has a real value between 0 and 1. matrix is the average of all results. In addition, we will
We divide the experiment into 10 sets for each dataset. compare the results of the proposed method using only the
We used nine sets for training and the other set for testing. appearance feature-based network for each fold. We also
The result of the experiment is shown in Figure 11. describe the average values as the final results.
In this figure, The CK+ case is displayed with square In the CK+ dataset, the accuracy was measured by divid-
points and the JAFFE case is represented with triangles. The ing 927 points of data into 10 sets. The confusion matrix
circle points illustrate the average values. The experiments of the appearance feature-based network with LBP features
were carried out on the number of nine weights, ranging from is shown in Table 4. The horizontal axis presents a pre-
0.1 to 1.0 at intervals of 0.1. Each dataset is used in the dicted class among six emotions, and the vertical axis is the
same way as described in the previous sub-sections. A total actual class which is the correct answer. In the CK+ dataset,
of 2000 datasets were used. As a result, the CK+ dataset the accuracy for angry was 86.9% and the accuracy for fear
provided the highest accuracy of 89.81% when the weight α was the highest. The angry emotion was often mistakenly
value was 0.7 or 0.8. The used of the JAFFE dataset resulted predicted as sad emotion. This is because, in both of these
in the accuracy of 96.7%, where the weight value α was 0.8 or emotions, the facial images appeared with the ends of the
0.9. The average results was 93.26% using an α value of 0.8. mouth pointing downward. Therefore, it seems that the prob-
Therefore, the weight α of 0.8 was selected for combining ability of error is increased by similar the appearance-based
two features. features if only appearance feature-based network is used.
The average for total emotions was 95.15% of the accuracy.
2) QUANTITATIVE VERIFICATION The results of the confusion matrix of the proposed method
In this section, we will verify whether facial expressions combining the geometric feature-based network extracting
are correctly recognized by the proposed algorithm for each the changes of the AUs landmark on the face and the
dataset. The accuracy was measured using the 10-fold cross appearance feature-based network are shown in Table 5. The
validation method for each dataset. First, we measured the highest rate of recognition was for angry, which was the
confusion matrix using only the appearance feature-based lowest accuracy of all. The ratio of increment was 4.7%, from
network. In addition, we measured the confusion matrix of the 86.9% to 91.6%. In addition, the accuracy was maintained or
proposed method, which is fused with the geometric feature- increased across all emotions. The average for total emotions
based network using the weight α = 0.8; the confusion was 96.46%, showing an average increase of 1.3%.

41282 VOLUME 7, 2019

J.-H. Kim et al.: Efficient FER Algorithm Based on Hierarchical Deep Neural Network Structure

TABLE 7. Confusion matrix of appearance feature-based CNN in the TABLE 8. Confusion matrix of the proposed method in the JAFFE dataset.
JAFFE dataset. (AN:Angry, DI:Disgust, FE:Fear, HA:Happy, SA:Sad, (AN:Angry, DI:Disgust, FE:Fear, HA:Happy, SA:Sad, SU:Surprise).
SU:Surprise).

TABLE 9. Comparisons between our approach and the state-of-arts FER

approaches in the JAFFE dataset.
In Table 6, the proposed algorithm is compared with the
state-of-arts algorithms. As mentioned previously, we used
10-fold cross validation using the CK+ dataset. Xie et al. [20]
proposed a facial expression recognition algorithm using
deep comprehensive multi-patches aggregation dual CNN
(DCMA-CNNs) method. The similarity with our proposed
method is that they both use a dual network. This method
focused on static features with LBP image patches as opposed
to dynamic features. As a result, the accuracy for the CK+ In Table 9, the proposed algorithm was compared with the
dataset was 93.46%. Happy and Routray [14] also focused on state-of-arts algorithms for JAFFE datasets. Lopes et al. [18]
the appearance of each active patch in one frame rather than suggested a representative method that used a CNN in the
the dynamic feature. The method resulted in 94.09% accuracy FER. They learned data with shuffle training in order to
in the CK+ dataset. change the order so that the method could be learned with
Another state-of-the-art is an RBM-based method [36] that less data. The accuracy of the basic CNN algorithm using
combines temporal and spatial features. This method used only the static feature was 84.48%. Another recent algorithm
facial-expressions and non-facial-expressions to learn emo- is that presented by Liu et al. [37], which extracted features
tional changes by using RBM-based models. The use of of salient areas using LBP and HoG features with gamma
facial-expressions and non-facial-expressions is similar to correction. This method resulted in 90% accuracy. Finally,
our geometric feature-based network part, which used the Goyani and Patel [38] used feature vectors by constructing
neutral facial images and emotional face images. The accu- the Haar wavelet of multiple levels with the face, eyes, and
racy of this method was measured at 95.66%. mouth.
Compared with these algorithms, the appearance feature- These methods also used spatial information in a frame.
based network showed similar accuracy of 95.15%. However, Therefore, the performance was similar to that of the appear-
the proposed fusion network with a geometric feature-based ance feature-based network. The proposed algorithm was
network showed better performance than other algorithms by verified with a rate of 91.27% and achieved higher accuracy
a factor of 96.46% of accuracy. than other methods.
In the JAFFE dataset, the accuracy of the 10-fold cross As a result, we illustrated the improved accuracy and con-
validation was measured as well, and the average value was fusion matrix results of the proposed network. The hierarchi-
calculated. This dataset contains black and white images cal network which focuses on improving the most frequent
with relatively high amounts of noise information. For this Top-2 errors, has improved the accuracy of recognition in
reason, the result was slightly lower accuracy than that of all datasets. We have also achieved better performance than
the CK+ dataset. In Table 7, the result of the appearance those of the latest algorithms. The proposed method for all
feature-based network was shown by the accuracy of 89.33%. datasets achieved an average of at least 91.27%, up to 98.07%
The lowest rate was 74.6% in the sad and showed the most of recognition accuracy; an average of 94.67% was achieved.
errors in fear and disgust. The highest value was 98.2% in To validate the performance in more wild dataset, we tested
happy. the proposed algorithm on the AffectNet dataset, which is
The result of the confusion matrix of the proposed algo- a dataset of about 1M of face images collected by internet
rithm is shown in Table 8. The results were similar to those search engine as shown in a Figure 12 using about 1,250 emo-
of the CK+ dataset. In particular, the error of the disgust tions related with keywords. This dataset has 10 categories
emotion was the same as in the sad emotion which showed (0: Neutral, 1: Happiness, 2: Sadness, 3: Surprise, 4: Fear,
the lowest accuracy. As a result, the accuracy of the sad 5: Disgust, 6: Anger, 7: Contempt, 8: None, 9: Uncertain,
emotion was increased by about 4.6%. Finally, we achieved 10 : No-Face). In this experiment, the proposed hierarchical
an average accuracy of 91.27%, which was increased by 1.7% deep neural network scheme achieved about 88.3% of the
on average. accuracy.

VOLUME 7, 2019 41283

J.-H. Kim et al.: Efficient FER Algorithm Based on Hierarchical Deep Neural Network Structure

[7] B. Martinez, M. F. Valstar, B. Jiang, and M. Pantic, ‘‘Automatic analysis

of facial actions: A survey,’’ IEEE Trans. Affect. Comput., to be published.
doi: 10.1109/TAFFC.2017.2731763.2017.
[8] T. F. Cootes, G. J. Edwards, and C. J. Taylor, ‘‘Active appearance models,’’
IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 681–685,
Jun. 2001.
[9] D. Huang, C. Shan, M. Ardabilian, Y. Wang, and L. Chen, ‘‘Local binary
patterns and its application to facial image analysis: A survey,’’ IEEE
Trans. Syst., Man, C, Appl. Rev., vol. 41, no. 6, pp. 765–781, Nov. 2011.
[10] W. Gu, C. Xiang, Y. V. Venkatesh, D. Huang, and H. Lin, ‘‘Facial expres-
sion recognition using radial encoding of local Gabor features and classi-
fier synthesis,’’ Pattern Recognit., vol. 45, no. 1, pp. 80–91, Jan. 2012.
[11] I. Kotsia and I. Pitas, ‘‘Facial expression recognition in image sequences
using geometric deformation features and support vector machines,’’ IEEE
Trans. Image Process., vol. 16, no. 1, pp. 172–187, Jan. 2007.
[12] M. Pardàs and A. Bonafonte, ‘‘Facial animation parameters
extraction and expression recognition using hidden Markov models,’’
Signal Process., Image Commun., vol. 17, no. 9, pp. 675–688,
2002.
FIGURE 12. Examples of the affectNet dataset.
[13] Y. Liu et al., ‘‘Facial expression recognition with PCA and LBP features
extracting from active facial patches,’’ in Proc. IEEE Int. Conf. Real-Time
Comput. Robot. (RCAR), Angkor Wat, Cambodia, Jun. 2016, pp. 368–373.
[14] S. L. Happy and A. Routray, ‘‘Automatic facial expression recognition
V. CONCLUSION using features of salient facial patches,’’ IEEE Trans. Affect. Comput.,
We have proposed an efficient facial expression recogni- vol. 6, no. 1, pp. 1–12, Mar. 2015.
tion algorithm combining appearance feature and geometric [15] B. Ryu, A. R. Rivera, J. Kim, and O. Chae, ‘‘Local directional ternary
pattern for facial expression recognition,’’ IEEE Trans. Image Process.,
feature based on deep neural networks for more accurate vol. 26, no. 12, pp. 6006–6018, Dec. 2017.
and efficient facial expression recognition. The appearance [16] I. Kotsia and I. Pitas, ‘‘Facial expression recognition in image sequences
feature-based network extracts the holistic feature of the using geometric deformation features and support vector machines,’’ IEEE
Trans. Image Process., vol. 16, no. 1, pp. 172–187, Jan. 2007.
LBP feature containing the AUs information. The geometric [17] S. Koelstra, M. Pantic, and I. Patras, ‘‘A dynamic texture-based approach
feature-based network extracts the dynamic feature, which is to recognition of facial actions and their temporal models,’’ IEEE Trans.
the face landmark change centered on the coordinate move- Pattern Anal. Mach. Intell., vol. 32, no. 11, pp. 1940–1954, Nov. 2010.
[18] A. T. Lopes, E. de Aguiar, A. F. de Souza, and T. Oliveira-Santos, ‘‘Facial
ment between the neutral face and the peak emotion. As a expression recognition with convolutional neural networks: Coping with
result, we constructed more robust feature by combining few data and the training sample order,’’ Pattern Recognit., vol. 61,
static appearance feature from the appearance network and pp. 610–628, Jan. 2017.
[19] H. Jung, S. Lee, J. Yim, S. Park, and J. Kim, ‘‘Joint fine-tuning in deep
dynamic feature from the geometric feature-based network. neural networks for facial expression recognition,’’ in Proc. IEEE Int. Conf.
In the experiments, we have shown that the Top-2 error Comput. Vis., Santiago, Chile, Dec. 2015, pp. 2983–2991.
frequently occurred with average about 82% using only [20] S. Xie and H. Hu, ‘‘Facial expression recognition using hierarchical fea-
tures with deep comprehensive multipatches aggregation convolutional
appearance feature-based network. As a result of improving neural networks,’’ IEEE Trans. Multimedia, vol. 21, no. 1, pp. 211–220,
this error with the proposed algorithm, we achieved about Jan. 2019. doi: 10.1109/TMM.2018.2844085.2018.
96.5% accuracy with 1.3% improvement when comparing [21] B. Yang, J. Cao, R. Ni, and Y. Zhang, ‘‘Facial expression recognition using
weighted mixture deep neural network based on double-channel facial
to the other algorithms in the CK+ dataset. In addition,
images,’’ IEEE Access, vol. 6, pp. 4630–4640, 2017.
the proposed algorithm yielded 91.3% of the accuracy which [22] K. Zhang, Y. Huang, Y. Du, and L. Wang, ‘‘Facial expression recognition
was improved by 1.5% when compared with other existing based on deep evolutional spatial-temporal networks,’’ IEEE Trans. Image
methods in the JAFFE dataset. From the experiments for all Process., vol. 26, no. 9, pp. 4193–4203, Sep. 2017.
[23] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
datasets, the accuracy of six emotions was at least maintained ‘‘Object detection with discriminatively trained part-based models,’’
or enhanced significantly. IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645,
Sep. 2010.
[24] E. D. King, ‘‘Dlib-ml: A machine learning toolkit,’’ J. Mach. Learn. Res.,
REFERENCES vol. 10, pp. 1755–1758, Jul. 2009.
[1] C. A. Corneanu, M. O. Simón, J. F. Cohn, and S. E. Guerrero, ‘‘Survey [25] C. Tomasi and R. Manduchi, ‘‘Bilateral filtering for gray and color
on RGB, 3D, thermal, and multimodal approaches for facial expression images,’’ in Proc. IEEE 6th Int. Conf. Comput. Vis., Jan. 1998,
recognition: History, trends, and affect-related applications,’’ IEEE Trans. pp. 839–846.
Pattern Anal. Mach. Intell., vol. 38, no. 8, pp. 1548–1568, Aug. 2016. [26] D. Huang, C. Shan, M. Ardabilian, Y. Wang, and L. Chen, ‘‘Local binary
[2] T. Yukitake, ‘‘Innovative solutions toward future society with AI, robotics, patterns and its application to facial image analysis: A survey,’’ IEEE
and IoT,’’ in Proc. VLSI, Kyoto, Japan, Jun. 2017, pp. C16–C19. Trans. Syst., Man, C, Appl. Rev., vol. 41, no. 6, pp. 765–781, Nov. 2011.
[3] B. Fasel and J. Luettin, ‘‘Automatic facial expression analysis: A survey,’’ [27] C. Shan, S. Gong, and P. W. McOwan, ‘‘Facial expression recognition
Pattern Recognit., vol. 36, no. 1, pp. 259–275, Jan. 2016. based on local binary patterns: A comprehensive study,’’ Image Vis. Com-
[4] P. Viola and M. Jones, ‘‘Rapid object detection using a boosted cascade of put., vol. 27, no. 6, pp. 803–816, May 2009.
simple features,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern [28] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
Recognit. (CVPR), Kauai, HI, USA, Dec. 2001, pp. I–I. pp. 436–444, May 2015.
[5] P. I. Wilson and J. Fernandez, ‘‘Facial feature detection using Haar classi- [29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
fiers,’’ J. Comput. Sci. Colleges, vol. 21, no. 4, pp. 127–133, Apr. 2006. R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
[6] Q. Zhu, M. C. Yeh, K. T. Cheng, and S. Avidan, ‘‘Fast human detection from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958,
using a cascade of histograms of oriented gradients.,’’ in Proc. IEEE 2014.
Comput. Soc. Conf. Comput. Vis. Pattern Recognit., New York, NY, USA, [30] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning,
Jun. 2006, pp. 1491–1498. Cambridge, MA, USA: MIT Press, 2016.

41284 VOLUME 7, 2019

J.-H. Kim et al.: Efficient FER Algorithm Based on Hierarchical Deep Neural Network Structure

[31] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and In 2004, he joined in the Real-Time Multimedia Research Team, Elec-
M. Pantic, ‘‘300 faces in-the-wild challenge: Database and results,’’ Image tronics and Telecommunications Research Institute (ETRI), South Korea,
Vis. Comput., vol. 47, pp. 3–18, Mar. 2016. where he was a Senior Researcher. In ETRI, he developed so many real-time
[32] V. Kazemi and J. Sullivan, ‘‘One millisecond face alignment video signal processing algorithms and patents and received the Best Paper
with an ensemble of regression trees,’’ in Proc. IEEE Conf. Award, in 2007. From 2009 to 2016, he was an Associate Professor with
Comput. Vis. Pattern Recognit., Columbus, Ohio, Jun. 2014, the Division of Computer Science and Engineering, Sun Moon University,
pp. 1867–1874. South Korea. In 2016, he joined the Department of Information Technol-
[33] K. Simonyan and A. Zisserman. (2014). ‘‘Very deep convolutional
ogy (IT) Engineering, Sookmyung Women’s University, South Korea, where
networks for large-scale image recognition.’’ [Online]. Available:
he is currently an Associate Professor. He has published over 200 inter-
https://arxiv.org/abs/1409.1556
[34] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, national journal and conference papers, and holds patents in his field. His
‘‘The extended cohn-kanade dataset (ck+): A complete dataset for action research interests include image and video signal processing for the content-
unit and emotion-specified expression,’’ in Proc. IEEE Comput. Soc. Conf. based image coding, video coding techniques, 3-D video signal processing,
Comput. Vis. Pattern Recognit. Workshops (CVPRW), San Francisco, CA, deep/reinforcement learning algorithm, embedded multimedia systems, and
USA, Jun. 2010, pp. 94–101. intelligent information systems for image signal processing.
[35] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, ‘‘Coding Dr. Kim is a Professional Member of the ACM and IEICE. He has
facial expressions with gabor wavelets,’’ in Proc. Third IEEE Int. received the Special Merit Award for Outstanding Paper from the IEEE Con-
Conf. Autom. Face Gesture Recognit., Nara, Japan, Apr. 1998, sumer Electronics Society, IEEE ICCE 2012, the Certification Appreciation
pp. 200–205. Award from the SPIE Optical Engineering, in 2013, and the Best Academic
[36] S. Elaiwat, M. Bennamoun, and F. Boussaid, ‘‘A spatio-temporal RBM- Award from the CIS, in 2014. He also served or serves as an Organizing
based model for facial expression recognition,’’ Pattern Recognit., vol. 49, Committee Member of CSIP 2011, a Co-Organizer of CICCAT2016/2017,
pp. 152–161, Jan. 2016. and a Program Committee Member of many international conferences. He is
[37] Y. Liu, Y. Li, X. Ma, and R. Song, ‘‘Facial expression recognition with serving as a Professional Reviewer for many academic journals, including
fusion features extracted from salient facial areas,’’ Sensors, vol. 17, no. 4,
the IEEE, ACM, Elsevier, Springer, Oxford, SPIE, IET, MDPI, and so on.
712, 2017.
In 2007, he served as an Editorial Board Member for the International Jour-
[38] M. Goyani and N. Patel, ‘‘Multi-level haar wavelet based facial expression
recognition using logistic regression,’’ Indian J. Sci. Technol., vol. 10, p. 9, nal of Soft Computing, Recent Patents on Signal Processing, the Research
Mar. 2017. Journal of Information Technology, the Journal of Convergence Information
Technology, and the Journal of Engineering and Applied Sciences. Since
2018, he has been the Editor-in-Chief of The Journal of Multimedia Infor-
mation System and an Associate Editor of the IEEE ACCESS Journal. He is
serving as an Associate Editor for Circuits, Systems and Signal Processing
JI-HAE KIM was born in Seoul, South Korea,
(Springer), The Journal of Supercomputing (Springer), the Journal of Real-
in 1994. She received the B.S. and M.S.
Time Image Processing (Springer), and the International Journal of Image
degrees from the College of Engineering, Sook-
Processing and Visual Communication (IJIPVC).
myung Women’s University, Seoul, South Korea,
in 2017 and 2019, respectively, where she was
a Researcher with the Intelligent Vision Pro-
PARTHA PRATIM ROY (M’87) received the
cessing Laboratory (IVPL), from 2017 to 2019.
Ph.D. degree in computer science from the Uni-
Her research interests include the development of
versitat Autònoma de Barcelona, Spain, in 2010.
computer vision processing and facial expression
He was a Postdoctoral Research Fellow with the
recognition techniques using deep learning and
Computer Science Laboratory (LI, RFAI Group),
machine learning, fundamental study of image feature extraction, and image
France, and with the Synchromedia Laboratory,
classification.
Canada. He was also a Visiting Scientist with the
Indian Statistical Institute, Kolkata, India, for more
than 6 times. He has gathered industrial experience
while working as an Assistant System Engineer
at TATA Consultancy Services, India, from 2003 to 2005, and as a Chief
Engineer at Samsung, Noida, from 2013 to 2014. He is currently an Assistant
Professor with the Department of Computer Science and Engineering, IIT
Roorkee, Roorkee. He has participated in several national and international
projects funded by the Spanish and French Government. He has published
more than 160 research papers in various international journals and confer-
ence proceedings. His main research interest includes pattern recognition.
In 2009, he received the Best Student Paper Award from the International
Conference on Document Analysis and Recognition (ICDAR).

DA-MI JEONG was born in Bucheon, South

Korea, in 1995. She received the B.S. degree from
BYUNG-GYU KIM (M’04–SM’16) received the the College of Engineering, Sookmyung Women’s
B.S. degree from Pusan National University, South University, Seoul, South Korea, in 2018, where
Korea, in 1996, the M.S. degree from the Korea she is currently pursuing the M.S. degree. Since
Advanced Institute of Science and Technology 2018, she has been researching with the Intelli-
(KAIST), in 1998, and the Ph.D. degree from the gent Vision Processing Laboratory (IVPL), Sook-
Department of Electrical Engineering and Com- myung Women’s University. Her research interests
puter Science, KAIST, in 2004. include the development of computer vision pro-
cessing and real-time facial expression recognition
techniques using deep learning, feature extraction, and image classification.