0% found this document useful (0 votes)
39 views14 pages

Facial Expression Recognition With Convolutional Neural Networks Via A New Face Cropping and Rotation Strategy

This document presents a new facial expression recognition method using convolutional neural networks with a face cropping and rotation strategy. The strategy aims to make the data more abundant and extract only useful facial features. The method is evaluated on two databases and achieves high average recognition accuracies of 97.38% and 97.18% respectively.

Uploaded by

Sharat Sachin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views14 pages

Facial Expression Recognition With Convolutional Neural Networks Via A New Face Cropping and Rotation Strategy

This document presents a new facial expression recognition method using convolutional neural networks with a face cropping and rotation strategy. The strategy aims to make the data more abundant and extract only useful facial features. The method is evaluated on two databases and achieves high average recognition accuracies of 97.38% and 97.18% respectively.

Uploaded by

Sharat Sachin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

The Visual Computer

https://doi.org/10.1007/s00371-019-01627-4

ORIGINAL ARTICLE

Facial expression recognition with convolutional neural networks via


a new face cropping and rotation strategy
Kuan Li1 · Yi Jin1 · Muhammad Waqar Akram1 · Ruize Han2 · Jiongwei Chen1

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Abstract
With the recent development and application of human–computer interaction systems, facial expression recognition (FER)
has become a popular research area. The recognition of facial expression is a difficult problem for existing machine learning
and deep learning models because that the images can vary in brightness, background, pose, etc. Deep learning methods also
require the support of big data. It does not perform well when the database is small. Feature extraction is very important
for FER, even a simple algorithm can be very effective if the extracted features are sufficient to be separable. However,
deep learning methods automatically extract features so that some useless features can interfere with useful features. For
these reasons, FER is still a challenging problem in computer vision. In this paper, with the aim of coping with few data
and extracting only useful features from image, we propose new face cropping and rotation strategies and simplification
of the convolutional neural network (CNN) to make data more abundant and only useful facial features can be extracted.
Experiments to evaluate the proposed method were performed on the CK+ and JAFFE databases. High average recognition
accuracies of 97.38% and 97.18% were obtained for 7-class experiments on the CK+ and JAFFE databases, respectively. A
study of the impact of each proposed data processing method and CNN simplification is also presented. The proposed method
is competitive with existing methods in terms of training time, testing time, and recognition accuracy.

Keywords Face cropping · Facial expression recognition · Convolutional neural network · Computer vision

1 Introduction data-driven animation, facial expression recognition (FER)


has become a popular field of study in recent years.
Facial expressions are one of the most important features to Machine learning plays an increasingly significant role in
reflect the human emotional state because they convey useful this field. Several methods have been proposed for FER in
information to the observer [6]. Facial expressions convey recent years, particularly using deep learning approaches [13,
55% of a communicated message, which is more than the 14,25,32]. Deep learning methods perform well in FER [13,
part conveyed by the combination of voice and language [22]. 35]. Facial expression recognition methods can be classi-
Facial expressions can be divided into six basic categories [3], fied into two main categories: those based on an image
namely anger, disgust, fear, happiness, sadness, and surprise. sequence [23,26,27,34] and those based on static images [5,
With the development of human–computer interaction sys- 39]. In the methods based on an image sequence, the sequence
tems, such as social robots, visual-interactive games, and changes from a neutral expression to a peak expression, and
these two expressions from the same person form a contrast
that makes it easier to extract the features of each expression.
B Yi Jin Static-image-based methods distinguish facial expressions
[email protected] by analysing the peak expression image without temporal
1 Department of Precision Machinery and Precision information.
Instrumentation, University of Science and Technology of Facial expression extraction is an important part of FER.
China, 96 Jinzhai road, Baohe District, Hefei 230026, Anhui, Facial changes caused by different facial expressions are typi-
People’s Republic of China cally extracted using appearance-based methods [2,28,29,38]
2 School of Computer Science and Technology, Tianjin or geometry-based methods [9,34]. Appearance-based fea-
University, 135 Yaguan road, Jinnan District, Tianjin 300350, tures describe the texture of the face resulting from an
People’s Republic of China

123
K. Li et al.

expression, such as wrinkles. In appearance-based FER, The remainder of this paper is organized as follows:
facial features are extracted by applying image filters, such as Sect. 2 presents the most recent related work, while Sect. 3
the Gabor wavelets filter [10], local binary patterns (LBP) fil- introduces the proposed method in detail. The experimental
ter [24], and histogram of oriented gradient (HOG) filter [1], results and a relevant discussion are given in Sect. 4. A com-
to the whole face or to specific regions. Geometry-based parison with other research works is presented in Sect. 5, and
methods extract the shape and components of the face, such the conclusions are given in Sect. 6.
as the nose and mouth. The first step in most geometry-based
methods is detection and tracking of facial points using an
active appearance model (AAM) [19]. The facial shape and 2 Related work
other information can be represented by these landmarks,
which are designed in different ways. Several deep learning approaches for facial expression recog-
Geometry-based and appearance-based methods have a nition were developed in the last decades, particularly the
common disadvantage, i.e. difficulty in selecting a good fea- method of CNN. Some recent methods are focused on the
ture to replace the facial expression. For geometry-based construction of advanced networks and the training of model,
features, the feature vector is associated with landmarks, the fusion of multiple structures and the selection of fusion
which must be selected carefully. For appearance-based parameters, and the optimization of classification algorithms.
features, an experienced designer is required to design a pow- Mayya et al. [20] proposed an approach to recognize facial
erful filter. The convolutional neural network (CNN) [18] has expression using DCNN features. They used a DCNN archi-
been applied to FER to address these limitations. CNN per- tecture which is used for ImageNet [12] to extract the facial
forms better than other deep learning methods [33], such as features, and then they obtain a 9216-dimensional vector
Deep Belief Network (DBN) which is one of the most widely for validation with support vector machine (SVM) classi-
used networks in FER. For example, CNN can automatically fier to recognize facial expression. Their experiments were
learn the features of data without manual selection, and it can conducted on two databases, CK+ and JAFFE, and achieved
combine different features neatly. Furthermore, CNN has a an accuracy of 96.02% and 98.12% for 7 classes, respec-
better effect on feature extraction than DBN, particularly for tively. Despite the high accuracy, their validation method is
expressions of contempt, fear, and sadness [33]. CNN ran- LOSO which is more superior (discussed in Sect. 5). Their
domly initializes a specific number of filters before training approach is not an end-to-end method, it is difficult and time-
and makes these filters better via gradient descent. One of consuming to train.
the main advantages of CNN is that the input to the network Wen et al. [33] presented an ensemble of CNNs with
is an original image rather than a set of hand-coded features. probability-based fusion for FER. In their work, they used
References [20,21,33] use deep convolutional neural net- random techniques to generate 100 CNNs with rich diver-
work (DCNN) and ensemble convolutional neural network sity (different parameters) and then selected 37 CNNs
(ECNN) systems, respectively, and achieve good results. (removed the CNNs which have bad performance) as the
However, these systems have a few limitations. DCNN is dif- base classifier for their final model. Finally, a fusion method,
ficult to train compared to CNN and has high validation error such as majority voting, weighted majority voting, and the
when the network layer is too large [7]. On the other hand, probability-based fusion, was employed for ensemble. Their
ECNN requires the generation of a large number of convolu- method can reduce the training time by parallel computation,
tional neural networks, which requires substantial computing but it requires a large amount of computing resources.
resources and training time. Zhang et al. [37] proposed a novel well-designed CNN
To overcome these limitations, we propose a new face which can reduce same-expression variations and enlarge
cropping and image rotation strategy to improve the accu- different-expression differences. They used 2-way soft-max
racy and simplify the CNN structure. The proposed approach function to train their model which requires the researchers
was applied to the CK+ [16] and JAFFE [17] databases and have enough experience. However, their method is for smile
compared with other methods. The main contributions of our detection, and they can use 4000 images for one expression
work are as follows: which is much bigger than the database which can be used
for FER (as explained in Sect. 4.1).
(1) Propose a new approach of face cropping to remove the In comparison with the methods above, this work: (1)
useless regions in an image. presents a competitive result in two public databases; (2)
(2) Propose an image rotation strategy to cope with data employs a simple yet effective CNN structure (not DCNN or
scarcity. ECNN), which is easy and fast to train; (3) does not need to
(3) Build a simplified CNN structure for FER to reduce design a tricky algorithm.
training/application time and to achieve real-time FER
with an ordinary computer.

123
Facial expression recognition with convolutional neural networks via a new face cropping and…

Before Training
Input image
Only on training

Pre-processing Random rotation Normalization


Face alignment Random flipping Down-sampling
Image cropping

Training
Save the best model
Labels
Training labels guide the training Training images
6 CNN Model

Testing
Predicted Load the best model
labels
Testing labels evaluate the model Predict Testing images
5 CNN Model

Fig. 1 Outline of the proposed system

3 Proposed method

The proposed FER system is a single classifier based on a


CNN. Image pre-processing is required because the images
have different colour channels and include people of vari-
ous races. To cope with the lack of data, we expanded our
training images using data augmentation. An outline of the
proposed method is shown in Fig. 1. The aligned image was
cropped to remove the useless region, and histogram equal-
ization, Z -score normalization, and down-sampling were
applied to standardize the image data. During the training Fig. 2 Face alignment. a Before face alignment. b After face alignment
phase, random rotation and horizontal flipping were per-
formed to increase the database size. The expanded training
data were used to train the CNN, and the best CNN model other six points encircle the right eye centre. The rotation
was saved. During the testing phase, the normalized testing angle is calculated on the basis of these twelve points using
images (without expansion) were sent to the CNN model Eq. (1).
from the training phase for prediction. 47 41
−1 n=42 yn − n=36 yn
angle = tan 47 41 (1)
3.1 Face alignment n=42 x n − n=36 x n

The images in the databases collected from a laboratory where xn is the x-coordinate of the nth point, and yn is the
show various postures. These variations affect the system y-coordinate of the nth point.
performance. Face alignment was performed to address this
problem, as shown in Fig. 2. The algorithm used for face 3.2 Image cropping
alignment was based on the position of the eyes. The Dlib
toolkit [11] was used to obtain the face landmarks. A total Image cropping is an important part of present study as we
of 68 sequential points, each of which could be represented proposed a new method for face cropping. The proposed
by a coordinate, were identified, but only 12 face landmarks method was compared with two common methods. Fig-
are shown in the figure for clarity. The centres of the left and ure 3a shows an image cropped using the OpenCV toolkit
right eyes were computed based on twelve points (No. 36 to used by [14,36]; the cropped image has a little background.
47). The first six points encircle the left eye centre, and the Figure 3b shows an image cropped using another common

123
K. Li et al.

3.3 Data normalization

Brightness and contrast can differ even between images of the


same person with the same expression, as shown in Fig. 4a.
Histogram equalization was applied to each image to reduce
this variation. Figure 4b shows the images obtained by his-
togram equalization. The mean values of the normalized
images are closer. Z -score normalization was also applied
to these images using Eq. (3) to enhance the contrast.

x −μ
x = (3)
σ

where x  is the value of the new pixel, x is the value of


the original pixel, μ is the mean pixel value of all sample
Fig. 3 Image cropping methods. a Face with background. b Face with- images, and σ is the standard deviation of the pixel values of
out background. c Face without forehead (proposed). d Illustration of
the proposed face cropping method all sample images. Figure 4c shows the images obtained by
Z -score normalization, the contrast of the normalized images
is enhanced. Finally, the image is down-sampled to 32 × 32
cropping method [2,20] that removes the image background. pixels.
Figure 3c shows an image cropped by the proposed method.
Sixty-eight face landmarks were obtained, but only 15 are
shown in the figure for clarity. The horizontal distance d 3.4 Data augmentation
between the eye centres is calculated using Eq. (2).
 41  After the preceding processing steps, there is slight tilt in the
47
n=42 x n − n=36 x n images, as shown in Fig. 5a. To ensure the adaptability and
d= (2) abundancy of our data, we adopted random horizontal flip-
6
ping and random rotation. In random horizontal flipping, an
The forehead region was then removed in such a way that image can be flipped before training to address the problem of
perpendicular distance from the top side of the cropped image uneven cropping. Random rotation expands the original data
to the horizontal line connecting the eye centres is 0.6 d (d by rotating an image by a random angle within an interval.
is the distance between eye centres), as shown in Fig. 3d. We selected the best rotation angle interval via mesh search
The other three sides of the cropped image are defined by the method (described in Sect. 4.5). Figure 5b, c shows the ran-
coordinates of 1st, 9th, and 17th face landmark points. dom horizontal flipping and random rotation, respectively.

Fig. 4 Data normalization. a Original images. b After histogram equalization. c After Z -score normalization

123
Facial expression recognition with convolutional neural networks via a new face cropping and…

Fig. 7 Sample images from the CK+ and JAFFE databases

Fig. 5 Data augmentation. a Original images. b Horizontally flipped


images. c Randomly rotated images
selections, the final experiments were performed on the
databases, and the results are discussed in detail.
The proposed method was implemented using OpenCV,
3.5 CNN structure
Python, and the Neural Network Model library (tflearn-
CPU). An Intel Core i5 3.2 GHz CPU was used to conduct
In the present work, CNN was applied to extract features
all the experiments in an Ubuntu 16.04 environment.
and categorize expressions. The architecture of the simplified
CNN, which has two convolution layers, two sub-sampling
layers, and one output layer, is presented in Fig. 6. The
first and third layers are convolution layers with 32 and 64 4.1 Databases
kernels, respectively, which have the size of 5 × 5. The acti-
vation function in the CNN is a rectified linear unit (Relu) Two widely used databases were used in the experiments: the
function [8]. The second and fourth layers are sub-sampling Extended Cohn–Kanade (CK+) database and the Japanese
layers that reduce the image size. We employed max-pooling Female Facial Expressions (JAFFE) database. The samples
with a kernel size of 2 × 2 and step size of 2. We flatted taken from these databases are shown in Fig. 7. The CK+
the sub-sampling to a 1600-dimensional vector and directly dataset contains 10,708 images and 327 video sequences col-
connected the output layer (with the soft-max activation func- lected from 118 participants. The last image of each video
tion). In addition, the simplified CNN uses the momentum sequence is regarded as the peak expression, and they are
optimizer [31], Xavier initializer [4] and cross-entropy loss labelled with seven expressions: anger, contempt, disgust,
function. fear, happiness, sadness, and surprise. All sequences begin
with a neutral expression, so the dataset contains an addi-
tional 327 neutral expressions. Neutral expressions are not
discussed in this paper. The numbers of anger, contempt, dis-
4 Experiments and discussion gust, fear, happiness, sadness, and surprise expressions are
45, 18, 59, 25, 69, 28, and 83, respectively. JAFFE contains
This section introduces the databases and the details of 213 peak expressions collected from ten women, and the
the experiments performed in this study. Experiments were expressions are labelled as anger, disgust, fear, happiness,
performed to select the best number of neurons, the face neutral, sadness, and surprise. Each expression is shown in
cropping method, and the rotation angle. Following these approximately 30 images.

Fig. 6 Structure of the simplified CNN

123
K. Li et al.

4.2 Evaluation criteria mance. Further, tenfold cross-validation was implemented


in each experiment, and four experiments were conducted
In the field of classification, there are many criteria that can for each number of neurons to avoid contingency. Each of
be used to evaluate the model. Evaluation criteria are not these experiments used the same optimizer, learning rate,
fixed, we need to choose an appropriate one according to face cropping method (face without background) and dataset
the actual problems we faced. A simple yet effective pair (CK+ dataset) order for a fair comparison. The efficien-
of evaluation criteria are error rate and precision. To eval- cies of the fully connected layer with different numbers of
uate the performance of our model f , we need to compare neurons are shown in Fig. 8. Figure 8a shows the accu-
the predicted results with the real labels y. The error rate racy for the experiment with the hidden layer removed. The
and accuracy are calculated using Eqs. 4 and 5, respectively. average accuracy was 89.40%. Figure 8b–d shows the accu-
Precision and recall rate are another pair of criteria, they can racy for the experiments using neuron number as 256, 512,
provide more reliable information for certain situations (such and 1024. The corresponding average efficiency is 86.34%,
as data imbalance). Based on the combination of the real label 87.99%, and 87.48%, respectively. Figure 8e shows the com-
and predicted result, the sample can be divided into: true pos- parison of accuracy obtained from these four experiments.
itive (TP), false positive (FP), true negative (TN), and false Compared with the other three networks, our network has a
negative (FN). The precision and recall rate are calculated slight improvement in recognition accuracy (at least 1.41%
using Eqs. 6 and 7, respectively. For the multi-classification improvement). At the same time, the proposed network has
problem, we can construct a confusion matrix to describe the fewer parameters, so it takes less training and testing time
relation and difference between categories. F1 score is also competitively. It was observed that the max accuracy for all
a commonly used criterion, as shown in Eq. 8, it can be seen the experiments is obtained when the epoch is set between
as a harmonic mean of the precision and recall. 80 and 120, as shown by a segment between vertical lines in
Fig. 8e. The recognition accuracy started to decrease when
1 
m
epoch reaches 120 due to lack of data.
err = g( f (xi ) = yi ) (4)
m
i=1
4.4 Evaluation on face cropping methods
1 m
acc = g( f (xi ) = yi ) (5)
m After demonstrating the effects of the number of neurons, we
i=1
performed experiments to compare the face cropping meth-
where g is the indicator function, m is the total number of ods. Three face cropping methods are discussed in Sect. 3.2,
data, and x is a given sample. and Fig. 3 shows the images cropped by these methods. Fig-
ure 3a is an image of a face with a background, Fig. 3b is an
TP image of a face without a background, and Fig. 3c is an image
pre = (6)
TP + FP of a face without a forehead, cropped with our proposed
TP cropping method. Four experiments were conducted for each
rec = (7)
TP + FN cropped image, and a CNN without a hidden fully connected
2 × pre × rec layer was adopted. Each experiment used the same optimizer,
F1 = (8)
pre + rec learning rate, and dataset (CK+ dataset) order for a fair com-
parison. The average accuracy for the experiment using the
For facial expression recognition, the data are balanced image cropped with the proposed method was 92.42%, com-
and we want to recognize images as many as possible. There- pared with the average accuracies of 86.90% and 89.40% for
fore, the accuracy is more appropriate than precision and the image with the background and the image without the
recall rates. The confusion matrix of multi-classification is background, respectively. These efficiency curves are shown
used to describe the relation and difference between cate- in Fig. 9.
gories. In order to verify the universality of proposed method, we
also conducted comparative experiments on two well-known
4.3 Selection of the neuron number networks: LeNet5 [18] and AlexNet. LeNet5 and AlexNet
have two convolutional layers and five convolutional lay-
The effects of the number of neurons of the hidden fully ers, respectively. The input image sizes are 32 × 32 and
connected layer are considered in this section. A 1600- 224 × 224, respectively. To avoid interference from other
dimensional vector was extracted after two convolution and factors (such as image random flipping), we only cropped
sub-sampling layers. Fully connected layer acts as a classi- the images in three ways. We conducted many experiments
fier. We used four different numbers of neurons (0, 256, 512, and obtained the average recognition accuracy, as shown in
and 1024) in the fully connected layer to improve the perfor- Table 1. The proposed face cropping method is better than

123
Facial expression recognition with convolutional neural networks via a new face cropping and…

(a) (b)

(c) (d)

(e)

Fig. 8 Recognition accuracy for experiments performed during the selection of the number of neurons. a Without the fully connected layer. b With
256 neurons. c With 512 neurons. d With 1024 neurons. e Comparison of the accuracy for these four experiments

Fig. 9 Recognition accuracy for different image cropping methods

123
K. Li et al.

Table 1 Comparative
Network\method With background (%) Without background (%) Proposed (%)
accuracies of three face cropping
methods on three networks LeNet5 80.43 84.10 88.07
AlexNet 87.16 88.99 90.52
Proposed 86.90 89.40 92.42

Fig. 10 Recognition accuracy for different rotation angles

Table 2 The effect of pre-processing steps


Step Accuracy (%)

None < 89.40


Remove fully connected layer 89.40
Remove fully connected layer + face crop 92.42
Remove fully connected layer + face crop + random flipping 95.72
Remove fully connected layer + face crop + random flipping + random rotation 96.63

the other two methods on all three networks. Moreover, the learning rate, face cropping method (proposed method), and
proposed method is more advantageous in small networks, dataset (CK+ dataset) order for a fair comparison. Random
such as LeNet5 and proposed network. horizontal flipping was also implemented in this experiment.
The accuracies for the set of experiments with different rota-
tion angles are shown in Fig. 10. The accuracy increases when
4.5 Selection of the random rotation angle
the image was rotated to 2, but further rotation decreases
the accuracy. The maximum average accuracy was obtained
After the previous two sets of experiments, a third experiment
for 2 rotation angle. The optimized rotation angle is data-
was conducted to determine the effect of the rotation angle.
dependent, and it depends on the image collecting condition.
The recognition accuracy decreased when the angle was too
The rotation angle needs to be adjusted according to different
large. Therefore, it is important to select an optimal rotation
data.
angle. Four experiments were performed for different rota-
Table 2 summarizes the effects of these three steps (select
tion angles (0 , 2, 4, 6 , 8, and 10 ), and the average accuracy
the number of neurons, the face cropping method, and the
was determined. Each experiment used the same optimizer,

123
Facial expression recognition with convolutional neural networks via a new face cropping and…

Table 3 Recognition accuracy


Accuracy 1 2 3 4 5
for the seven-class and six-class
experiments on the CK+ Seven classes (%) 97.27 97.27 97.55 97.55 97.27
database using the proposed
system Six classes (%) 98.38 98.05 98.38 97.73 98.38
Average of 7 classes: 97.38% Standard deviation: 0.14%
Best of 7 classes: 97.55%
Average of 6 classes: 98.18% Standard deviation: 0.26%
Best of 6 classes: 98.38%

domly generated in each experiment before training. Five


experiments were conducted for each classification. Table 3
shows the accuracies achieved by the five experiments on
the CK+ database for both classifications. The average and
surprise fear angry sadness maximum accuracy for the 7-class experiment was 97.38%
contempt happy happy angry
and 97.55%, respectively, while the average and maximum
accuracy for the 6-class experiment was 98.18% and 98.38%,
respectively. The results of the five experiments for each clas-
sification were similar.
A few images had a high probability of misclassification,
sadness
angry
angry contempt sadness and those images are shown in Fig. 11. Anger and sadness
sadness sadness angry
Underlined: real expressions Other: predicted expressions were the most likely expressions to be misclassified.
The recognition accuracy of each expression in both
Fig. 11 Misclassified images. The underlined caption is the real expres-
experiments is shown in Table 4. In the 6-class experiment,
sions, and the other caption is the expression predicted by the CNN
happiness, disgust, and sadness were recognized with 100%
accuracy, and fear was recognized with an accuracy of at least
rotation angle). As it can be seen in Table 2, all these three 93.6%. The poor classification accuracy of fear was a result
steps we proposed can improve the recognition accuracy. of there being only 25 images with an expression of fear, so
there may be an uneven division of these images between
the training and testing data. In the 7-class experiment, the
4.6 Database experiments recognition accuracy for sadness decreased significantly due
to the addition of contempt. The proposed CNN model con-
The experiments performed to select the best rotation angle, sidered sadness and contempt to have similar features. The
cropping method, and neuron number have a few limitations. confusion matrices for the 7 classes and 6 classes in the CK+
For example, the same dataset order was used to train the database are shown in Tables 5 and 6, respectively. In both
CNN. Therefore, random experiments were performed using cases, more than 6.4% of the fear images were misclassified
the proposed method. In addition to random rotation and ran- as happiness, and approximately 4% of the anger images
dom horizontal flipping, the following random approaches were misclassified into other categories. Moreover, sadness
were employed: (1) the training and testing datasets were was also wrongly classified as anger in the 7-class experi-
randomly assigned (they were not assigned evenly); and (2) ment.
the training dataset sequence was randomly generated before JAFFE database experiment Similar to the CK+ dataset
each epoch. In contrast to [15], the best training sample order experiments, five experiments were conducted on the JAFFE
was not selected. dataset. The difference between the CK+ and JAFFE is that
CK+ database experiment The second expression in the the JAFFE database includes neutral expressions instead of
CK+ database is contempt, which is not present in other the contempt expressions contained in the CK+ database.
databases. Therefore, this database was classified in terms of The JAFFE database has only 213 images, but there is a
both seven and six expressions. Eighteen contempt expres- similar number of images showing each expression. Tenfold
sions are contained in the database, so there are only 309 cross-validation was applied, and the recognition accuracy
images when six expressions are considered. Tenfold cross- for each expression is shown in Table 7. The average and
validation was applied. When classifying seven expressions, maximum accuracy was 97.18% and 97.65%, respectively.
294 images were used to train the model, and the remaining The expressions of anger, fear, and happiness were recog-
images were used for testing. Similarly, 278 images were nized with 100% accuracy, while the neutral expression was
used to train the 6-class CNN. The sample order was ran-

123
K. Li et al.

Table 4 Recognition accuracy


Classifier An Co Su Fe Ha Sa Su
obtained for each expression on
the CK+ database Seven classes (%) 95.56 94.44 98.98 93.60 100 93.57 98.07
Six classes (%) 96.00 – 100 91.20 100 100 98.07

Table 5 Confusion matrix for


An Co Di Fe Ha Sa Su
the seven-expression experiment
on the CK+ database An 215/225 0 0 2/225 5/225 3/225 0
Co 0 85/90 0 0 0 5/90 0
Di 0 0 292/295 0 0 3/295 0
Fe 0 0 0 117/125 8/125 0 0
Ha 0 0 0 0 345/345 0 0
Sa 9/140 0 0 0 0 131/140 0
Su 0 5/415 3/415 0 0 0 407/415

Table 6 Confusion matrix for


An Di Fe Ha Sa Su
the 6-expression experiment on
the CK+ database An 216/225 0 0 5/225 4/225 0
Di 0 295/295 0 0 0 0
Fe 0 0 114/125 11/125 0 0
Ha 0 0 0 345/345 0 0
Sa 9/140 0 0 0 140/140 0
Su 0 2/415 6/415 0 0 407/415

Table 7 Recognition accuracy


Classifier An Di Fe Ha Ne Sa Su
for each expression on the
JAFFE database 7 classes (%) 100 97.24 100 100 91.33 95.48 94.67
Average of 7 classes: 97.18% standard deviation: 0.30%
Best of 7 classes: 97.65%

Table 8 Confusion matrix for


An Di Fe Ha Ne Sa Su
the 7-expression experiment on
the JAFFE database An 150/150 0 0 0 0 0 0
Di 4/145 141/145 0 0 0 0 0
Fe 0 0 160/160 0 0 0 0
Ha 0 0 0 155/155 0 0 0
Ne 9/150 0 0 0 139/150 2/150 0
Sa 0 2/155 0 0 5/155 148/155 0
Su 0 0 0 0 8/155 0 142/150

recognized with an accuracy of at least 91.33%. The confu- Table 9 Cross-database experiment on the CK+ and JAFFE databases
sion matrix for the seven classes of JAFFE database is shown Train Test Average (standard deviation) Best (%)
in Table 8. Neutral expressions were responsible for 80% of
the misclassified images. CK+ JAFFE 39.01% (1.12%) 40.98
Cross-database experiment In these experiments, the net- JAFFE CK+ 62.78% (1.52%) 64.40
work was trained on one database and tested on the other
database. The CK+ database does not contain neutral expres-
sions, and the JAFFE database does not include contempt
Five experiments were conducted for each case. The recogni-
expressions. Therefore, these expressions were neglected.
tion accuracy for these experiments is shown in Table 9. The

123
Facial expression recognition with convolutional neural networks via a new face cropping and…

Table 10 Training parameters


Parameters Value Remark
used in our experiments
Random rotation −2◦ to 2◦
Dropout 0.5 The second sub-sampling layer
Optimizers Momentum Learning rate = 0.001
Weights initializer Xavier
Batch size 16
Loss function Cross-entropy
Epochs 120 Shuffle the training data

Table 11 Comparison of the


Method Validation Database Iteration Recognition accuracy (%)
proposed algorithm and other
studies Time Binary Six classes Seven classes

DCNN + SVM [20] LOSO CK+ 1.91 s 97.08 96.02


JAFFE 98.12
CNN [15] Eightfold CK+ 92.68 ms 96.76 95.75
Zeng et al. [36] Tenfold CK+ 95.79
LBP + SVM [29] Tenfold CK+ 95.10 91.40
Liu et al. [14] Eightfold CK+ 96.70
HOG + SVM [2] LOSO CK+ 96.40
Liu et al. [13] Tenfold CK+ 95.78
Pu et al. [26] Tenfold CK+ 96.38
Proposed Tenfold CK+ 69.16 ms 98.18 97.38
JAFFE 97.18

average accuracy was 39.01% when the CK+ database was 5 Comparisons
used for training and the JAFFE database was used for test-
ing. In the opposite case, the average accuracy was 62.78%. Several novel methods for facial expression recognition have
The proposed system is a real-time system. The time con- been proposed in recent years. In this section, the experimen-
sumed for image recognition is divided into two parts. The tal results of the proposed approach on the CK+ and JAFFE
first part is the time taken before the image is sent to the databases are compared with those of other methods. The
CNN, which includes the time consumed for face alignment, comparison is shown in Table 11. Tenfold cross-validation
face cropping, histogram equalization, Z -score normaliza- was not used by all researchers; therefore, we compare
tion, and image down-sampling. The other part is the time our method with existing similar or tenfold cross-validation
taken during CNN prediction. The time consumed by land- methods. The authors in [20] achieved 98.12% accuracy
mark generation is not considered because the corresponding on the JAFFE database for 7 classes by combining DCNN
files are provided in the CK+ database. A total of 1000 and support vector machines (SVM), which is 0.94% higher
images were predicted by the proposed system, and the time than the accuracy achieved with our method. However,
consumed was recorded: 3.93 s and 1.58 s were consumed they used leave-one-subject-out (LOSO) validation, which
before and during the CNN process, respectively, i.e. the total enabled 212 images to be used for training on the JAFFE
processing time was 5.51 s. The proposed approach is sum- database, whereas we used only 192 images. We conducted
marized in Table 10 in terms of the parameters used in the multiple experiments on JAFFE dataset using LOSO valida-
experiments. A dropout [30] rate of 0.5 was applied to the sec- tion method, and the obtained average recognition accuracy
ond sub-sampling layer (1600-dimensional vector), a default reaches up to 98.59%, which is 0.47% higher than that
learning rate of 0.001 was used, and the batch size was 16. obtained in [20]. The validation method in [2] was also
During training, the CNN trained 120 epochs (each epoch LOSO. The researchers in [14,15] used eightfold cross-
was trained on the complete processed training data), and validation, [15] used the best sample order, and [14] trained
the training data order for each epoch was randomly shuf- seven binary classifiers for each expression. By contrast, we
fled. randomly divided the training set for our experiments and
trained a seven-class classifier.

123
K. Li et al.

Training a neural network is a time-consuming task, espe- References


cially for DCNN and ECNN. We implemented DCNN [20]
with a batch size of 16 and computed the time required 1. Dalal, N., Triggs, B.: Histograms of oriented gradients for human
detection. In: IEEE Conference on Computer Vision and Pattern
for comparison with our work. A total of 1.91 s was
Recognition, pp 886–893. https://doi.org/10.1109/CVPR.2005.
required to update the weights of DCNN [20], which is 177 (2005)
more than the time consumed by our method. We also 2. De la Torre, F., Chu, W.S., Xiong, X., Vicente, F., Ding, X., Cohn,
computed the iteration time consumed by [15], whose J.F.: Intraface. In: IEEE International Conference on Automatic
Face and Gesture Recognition, pp 1–8. https://doi.org/10.1109/
network is similar to ours. The time taken by [15] was
FG.2015.7163082 (2015)
92.86 ms, which is reduced to 69.16 ms in our case due 3. Ekman, P., Friesen, W.V.: Facial action coding system: a technique
to the removal of the hidden fully connected layer. DCNN for the measurement of facial movement. In: Consulting Psychol-
required more than 12 h to train a model using tenfold ogists, Palo Alto (1978)
4. Glorot, X., Bengio, Y.: Understanding the difficulty of training
cross-validation, while our proposed method required only
deep feedforward neural networks. J. Mach. Learn. Res. 9, 249–
26 min. 256 (2010)
In addition to the results of the CK+ and JAFFE database 5. Gogić, I., Manhart, M., Pandžić, I.S., Ahlberg, J.: Fast facial
experiments, our results for the cross-database experiments expression recognition using local binary features and shallow neu-
ral networks. Vis. Comput. 1–16 (2018). https://doi.org/10.1007/
were also competitive. The average recognition accuracy s00371-018-1585-8
when using the CK+ database as the training data and the 6. Goh, K.M., Ng, C.H., Lim, L.L., Sheikh, U.: Micro-expression
JAFFE database as the testing data was 39.01%, which is recognition: an updated review of current trends, challenges and
0.21% higher than that achieved by [15]. solutions. Vis. Comput. 1–24 (2018). https://doi.org/10.1007/
s00371-018-1607-6
7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for
image recognition. In: IEEE Conference on Computer Vision and
Pattern Recognition, pp 770–778 (2016)
6 Conclusion 8. Jarrett, K., Kavukcuoglu, K., Ranzato, M., Lecun, Y.: What is the
best multi-stage architecture for object recognition? In: IEEE Inter-
national Conference on Computer Vision, vol 30, pp 2146–2153
In this paper, we present an efficient FER approach to (2009)
simplify the CNN and propose new face cropping and 9. Jin, H., Wang, X., Lian, Y., Hua, J.: Emotion information visualiza-
image rotation methods. The impact of the CNN simpli- tion through learning of 3d morphable face model. Vis. Comput.
fication and the proposed data processing methods was 1–14 (2018). https://doi.org/10.1007/s00371-018-1482-1
10. Jones, J.P., Palmer, L.A.: An evaluation of the two-dimensional
studied, and high recognition accuracies were achieved with gabor filter model of simple receptive fields in cat striate cortex. J.
each technique. The CNN without a hidden fully con- Neurophysiol. 58(6), 1233–1258 (1987). https://doi.org/10.1152/
nected layer has a simpler structure and achieves improved jn.1987.58.6.1233
recognition accuracy. The proposed face cropping method 11. King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn.
Res. 10, 1755–1758 (2009)
retains useful face information and removes useless regions, 12. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification
and the proposed rotation method greatly increases the with deep convolutional neural networks. In: Pereira, F., Burges,
amount of data. After separate evaluation of each pro- C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural
posed technique, final experiments were conducted on Information Processing Systems, pp 1097–1105. Curran Asso-
ciates, Inc., Lake Tahoe, Nevada, USA (2012)
the CK+ and JAFFE databases. The results show that 13. Liu, M., Li, S., Shan, S., Chen, X.: Au-inspired deep networks for
the proposed FER approach achieves competitive results facial expression feature learning. Neurocomputing 159(C), 126–
in terms of training time, testing time, and recognition 136 (2015). https://doi.org/10.1016/j.neucom.2015.02.011
accuracy. Furthermore, the proposed method can be imple- 14. Liu, P., Han, S., Meng, Z., Tong, Y.: Facial expression recogni-
tion via a boosted deep belief network. In: IEEE Conference on
mented on an ordinary computer without GPU accelera- Computer Vision and Pattern Recognition, pp 1805–1812 (2014)
tion. 15. Lopes, A.T., Aguiar, E.D., Souza, A.F.D., Oliveira-Santos, T.:
Facial expression recognition with convolutional neural networks:
coping with few data and the training sample order. Pattern Recog-
Funding This research was sponsored by the National Natural Science
nit. 61, 610–628 (2016). https://doi.org/10.1016/j.patcog.2016.07.
Foundation of China (Grant No. 51605464), National Basic Research
026
Program of China (973 Program) (2014CB049500) and Research on the
16. Lucey, P., Cohn, J.F., Kanade, T., Saragih, J.: The extended cohn-
Major Scientific Instrument of National Natural Science Foundation of
kanade dataset (ck+): a complete dataset for action unit and
China (61727809).
emotion-specified expression. In: IEEE conference on computer
vision and pattern recognition workshops, pp 94–101. https://doi.
org/10.1109/CVPRW.2010.5543262 (2010)
Compliance with ethical standards 17. Lyons, M.J., Budynek, J., Akamatsu, S.: Automatic classification
of single facial images. IEEE Trans. Pattern Anal. Mach. Intell.
21(12), 1357–1362 (1999). https://doi.org/10.1109/34.817413
Conflict of interest The authors declare that they have no conflict of
interest.

123
Facial expression recognition with convolutional neural networks via a new face cropping and…

18. Lcun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based 37. Zhang, K., Huang, Y., Wu, H., Wang, L.: Facial smile detection
learning applied to document recognition. Proc. IEEE 86(11), based on deep learning features. In: 2015 3rd IAPR Asian Confer-
2278–2324 (1998). https://doi.org/10.1109/5.726791 ence on Pattern Recognition (ACPR). IEEE, pp 534–538 (2015)
19. Matthews, I., Baker, S.: Active appearance models revisited. Int. J. 38. Zhao, G., Pietikinen, M., Member, S.: Dynamic texture recognition
Comput. Vis. 60, 135–164 (2004) using local binary patterns with an application to facial expressions.
20. Mayya, V., Pai, R.M., Pai, M.M.M.: Automatic facial expression IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928 (2008).
recognition using dcnn. Proc. Comput. Sci. 93, 453–461 (2016a). https://doi.org/10.1109/TPAMI.2007.1110
https://doi.org/10.1016/j.procs.2016.07.233 39. Zhao, J., Mao, X., Zhang, J.: Learning deep facial expression
21. Mayya, V., Pai, R.M., Pai, M.M.M.: Combining temporal interpo- features from image and optical flow sequences using 3D CNN.
lation and dcnn for faster recognition of micro-expressions in video Vis. Comput. 34(10), 1461–1475 (2018). https://doi.org/10.1007/
sequences. In: International Conference on Advances in Comput- s00371-018-1477-y
ing, Communications and Informatics, pp 699–703. https://doi.org/
10.1109/ICACCI.2016.7732128 (2016)
22. Mehrabian, A.: Communication without words. Commun. Theory,
Publisher’s Note Springer Nature remains neutral with regard to juris-
193–200 (2008)
dictional claims in published maps and institutional affiliations.
23. Mohammadi, M.R., Fatemizadeh, E., Mahoor, M.H.: Pca-based
dictionary building for accurate facial expression recognition via
sparse representation. J. Vis. Commun. Image Represent. 25(5), Kuan Li received his B.S. degree
1082–1092 (2014). https://doi.org/10.1016/j.jvcir.2014.03.006 from China University of Mining
24. Ojala, T., Pietikäinen, M., Harwood, D.: A comparative study of and Technology, China, in 2016.
texture measures with classification based on featured distribu- He is currently pursuing the mas-
tions. Pattern Recognit. 29(1), 51–59 (1996). https://doi.org/10. ters degree from University of Sci-
1016/0031-3203(95)00067-4 ence and Technology of China,
25. Owusu, E., Zhan, Y., Mao, Q.R.: An svm-adaboost facial expres- Hefei, China. His research interest
sion recognition system. Appl. Intell. 40(3), 536–545 (2014) includes image processing, pat-
26. Pu, X., Fan, K., Chen, X., Ji, L., Zhou, Z.: Facial expression recog- tern recognition, and deep learn-
nition from image sequences using twofold random forest classifier. ing.
Neurocomputing 168(C), 1173–1180 (2015). https://doi.org/10.
1016/j.neucom.2015.05.005
27. Rashid, M., Abu-Bakar, S., Mokji, M.: Human emotion recognition
from videos using spatio-temporal and audio features. Vis. Comput.
29(12), 1269–1275 (2013)
28. Rivera, A.R., Castillo, J.R., Chae, O.: Local directional number
pattern for face analysis: face and expression recognition. IEEE
Trans. Image Process. 22(5), 1740–1752 (2013). https://doi.org/ Yi Jin received his Ph.D. degree
10.1109/TIP.2012.2235848 from University of Science and
29. Shan, C., Gong, S., Mcowan, P.W.: Facial expression recognition Technology of China, China, in
based on local binary patterns: a comprehensive study. Image Vis. 2013. He is currently an Asso-
Comput. 27(6), 803–816 (2009). https://doi.org/10.1016/j.imavis. ciate Professor at University of
2008.08.005 Science and Technology of China.
30. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhut- His current research interest
dinov, R.: Dropout: a simple way to prevent neural networks from includes human–computer inter-
overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) action, pattern recognition, and
31. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance image processing.
of initialization and momentum in deep learning. In: International
Conference on Machine Learning, pp 1139–1147 (2013)
32. Uddin, M.Z., Hassan, M.M., Almogren, A., Zuair, M., Fortino, G.,
Torresen, J.: A facial expression recognition system using robust
face features from depth videos and deep learning. Comput. Electr.
Eng. 63, 114–125 (2017). https://doi.org/10.1016/j.compeleceng.
2017.04.019 Muhammad Waqar Akram recei-
33. Wen, G., Hou, Z., Li, H., Li, D., Jiang, L., Xun, E.: Ensemble of deep ved his masters degree from Uni-
neural networks with probability-based fusion for facial expression versity of Agriculture Faisalabad,
recognition. Cogn. Comput. 9(5), 597–610 (2017). https://doi.org/ Pakistan, in 2015. He is currently
10.1007/s12559-017-9472-6 pursuing PhD degree in precision
34. Yang, P., Liu, Q., Metaxas, D.N.: Boosting coded dynamic features machinery and instrumentation at
for facial action units and facial expression recognition. In: IEEE University of Science and Tech-
Conference on Computer Vision and Pattern Recognition, pp 1–6. nology of China, Hefei, PR China.
https://doi.org/10.1109/CVPR.2007.383059 (2007) His research interest includes solar
35. Yu, Z., Liu, Q., Liu, G.: Deeper cascaded peak-piloted network energy, farm machinery, and com-
for weak expression recognition. Vis. Comput. 34(12), 1691–1699 puter vision.
(2018). https://doi.org/10.1007/s00371-017-1443-0
36. Zeng, N., Zhang, H., Song, B., Liu, W., Li, Y., Dobaie, A.M.:
Facial expression recognition via learning deep sparse autoen-
coders. Neurocomputing 273, 643–649 (2017). https://doi.org/10.
1016/j.neucom.2017.08.043

123
K. Li et al.

Ruize Han received his B.S. degree Jiongwei Chen received his B.S.
from Hebei University of Tech- degree from China University of
nology, China, in 2016. He is cur- Mining and Technology, China,
rently pursuing masters degree in in 2016. He is currently pursuing
Tianjin University, Tianjin, China. masters degree in University of
His research interest includes Science and Technology of China,
image processing and computer Hefei, China. His research inter-
vision. est includes image processing and
pattern recognition.

123

You might also like