Facial Expression Recognition With Convolutional Neural Networks Via A New Face Cropping and Rotation Strategy
Facial Expression Recognition With Convolutional Neural Networks Via A New Face Cropping and Rotation Strategy
https://doi.org/10.1007/s00371-019-01627-4
ORIGINAL ARTICLE
Abstract
With the recent development and application of human–computer interaction systems, facial expression recognition (FER)
has become a popular research area. The recognition of facial expression is a difficult problem for existing machine learning
and deep learning models because that the images can vary in brightness, background, pose, etc. Deep learning methods also
require the support of big data. It does not perform well when the database is small. Feature extraction is very important
for FER, even a simple algorithm can be very effective if the extracted features are sufficient to be separable. However,
deep learning methods automatically extract features so that some useless features can interfere with useful features. For
these reasons, FER is still a challenging problem in computer vision. In this paper, with the aim of coping with few data
and extracting only useful features from image, we propose new face cropping and rotation strategies and simplification
of the convolutional neural network (CNN) to make data more abundant and only useful facial features can be extracted.
Experiments to evaluate the proposed method were performed on the CK+ and JAFFE databases. High average recognition
accuracies of 97.38% and 97.18% were obtained for 7-class experiments on the CK+ and JAFFE databases, respectively. A
study of the impact of each proposed data processing method and CNN simplification is also presented. The proposed method
is competitive with existing methods in terms of training time, testing time, and recognition accuracy.
Keywords Face cropping · Facial expression recognition · Convolutional neural network · Computer vision
123
K. Li et al.
expression, such as wrinkles. In appearance-based FER, The remainder of this paper is organized as follows:
facial features are extracted by applying image filters, such as Sect. 2 presents the most recent related work, while Sect. 3
the Gabor wavelets filter [10], local binary patterns (LBP) fil- introduces the proposed method in detail. The experimental
ter [24], and histogram of oriented gradient (HOG) filter [1], results and a relevant discussion are given in Sect. 4. A com-
to the whole face or to specific regions. Geometry-based parison with other research works is presented in Sect. 5, and
methods extract the shape and components of the face, such the conclusions are given in Sect. 6.
as the nose and mouth. The first step in most geometry-based
methods is detection and tracking of facial points using an
active appearance model (AAM) [19]. The facial shape and 2 Related work
other information can be represented by these landmarks,
which are designed in different ways. Several deep learning approaches for facial expression recog-
Geometry-based and appearance-based methods have a nition were developed in the last decades, particularly the
common disadvantage, i.e. difficulty in selecting a good fea- method of CNN. Some recent methods are focused on the
ture to replace the facial expression. For geometry-based construction of advanced networks and the training of model,
features, the feature vector is associated with landmarks, the fusion of multiple structures and the selection of fusion
which must be selected carefully. For appearance-based parameters, and the optimization of classification algorithms.
features, an experienced designer is required to design a pow- Mayya et al. [20] proposed an approach to recognize facial
erful filter. The convolutional neural network (CNN) [18] has expression using DCNN features. They used a DCNN archi-
been applied to FER to address these limitations. CNN per- tecture which is used for ImageNet [12] to extract the facial
forms better than other deep learning methods [33], such as features, and then they obtain a 9216-dimensional vector
Deep Belief Network (DBN) which is one of the most widely for validation with support vector machine (SVM) classi-
used networks in FER. For example, CNN can automatically fier to recognize facial expression. Their experiments were
learn the features of data without manual selection, and it can conducted on two databases, CK+ and JAFFE, and achieved
combine different features neatly. Furthermore, CNN has a an accuracy of 96.02% and 98.12% for 7 classes, respec-
better effect on feature extraction than DBN, particularly for tively. Despite the high accuracy, their validation method is
expressions of contempt, fear, and sadness [33]. CNN ran- LOSO which is more superior (discussed in Sect. 5). Their
domly initializes a specific number of filters before training approach is not an end-to-end method, it is difficult and time-
and makes these filters better via gradient descent. One of consuming to train.
the main advantages of CNN is that the input to the network Wen et al. [33] presented an ensemble of CNNs with
is an original image rather than a set of hand-coded features. probability-based fusion for FER. In their work, they used
References [20,21,33] use deep convolutional neural net- random techniques to generate 100 CNNs with rich diver-
work (DCNN) and ensemble convolutional neural network sity (different parameters) and then selected 37 CNNs
(ECNN) systems, respectively, and achieve good results. (removed the CNNs which have bad performance) as the
However, these systems have a few limitations. DCNN is dif- base classifier for their final model. Finally, a fusion method,
ficult to train compared to CNN and has high validation error such as majority voting, weighted majority voting, and the
when the network layer is too large [7]. On the other hand, probability-based fusion, was employed for ensemble. Their
ECNN requires the generation of a large number of convolu- method can reduce the training time by parallel computation,
tional neural networks, which requires substantial computing but it requires a large amount of computing resources.
resources and training time. Zhang et al. [37] proposed a novel well-designed CNN
To overcome these limitations, we propose a new face which can reduce same-expression variations and enlarge
cropping and image rotation strategy to improve the accu- different-expression differences. They used 2-way soft-max
racy and simplify the CNN structure. The proposed approach function to train their model which requires the researchers
was applied to the CK+ [16] and JAFFE [17] databases and have enough experience. However, their method is for smile
compared with other methods. The main contributions of our detection, and they can use 4000 images for one expression
work are as follows: which is much bigger than the database which can be used
for FER (as explained in Sect. 4.1).
(1) Propose a new approach of face cropping to remove the In comparison with the methods above, this work: (1)
useless regions in an image. presents a competitive result in two public databases; (2)
(2) Propose an image rotation strategy to cope with data employs a simple yet effective CNN structure (not DCNN or
scarcity. ECNN), which is easy and fast to train; (3) does not need to
(3) Build a simplified CNN structure for FER to reduce design a tricky algorithm.
training/application time and to achieve real-time FER
with an ordinary computer.
123
Facial expression recognition with convolutional neural networks via a new face cropping and…
Before Training
Input image
Only on training
Training
Save the best model
Labels
Training labels guide the training Training images
6 CNN Model
Testing
Predicted Load the best model
labels
Testing labels evaluate the model Predict Testing images
5 CNN Model
3 Proposed method
The images in the databases collected from a laboratory where xn is the x-coordinate of the nth point, and yn is the
show various postures. These variations affect the system y-coordinate of the nth point.
performance. Face alignment was performed to address this
problem, as shown in Fig. 2. The algorithm used for face 3.2 Image cropping
alignment was based on the position of the eyes. The Dlib
toolkit [11] was used to obtain the face landmarks. A total Image cropping is an important part of present study as we
of 68 sequential points, each of which could be represented proposed a new method for face cropping. The proposed
by a coordinate, were identified, but only 12 face landmarks method was compared with two common methods. Fig-
are shown in the figure for clarity. The centres of the left and ure 3a shows an image cropped using the OpenCV toolkit
right eyes were computed based on twelve points (No. 36 to used by [14,36]; the cropped image has a little background.
47). The first six points encircle the left eye centre, and the Figure 3b shows an image cropped using another common
123
K. Li et al.
x −μ
x = (3)
σ
Fig. 4 Data normalization. a Original images. b After histogram equalization. c After Z -score normalization
123
Facial expression recognition with convolutional neural networks via a new face cropping and…
123
K. Li et al.
123
Facial expression recognition with convolutional neural networks via a new face cropping and…
(a) (b)
(c) (d)
(e)
Fig. 8 Recognition accuracy for experiments performed during the selection of the number of neurons. a Without the fully connected layer. b With
256 neurons. c With 512 neurons. d With 1024 neurons. e Comparison of the accuracy for these four experiments
123
K. Li et al.
Table 1 Comparative
Network\method With background (%) Without background (%) Proposed (%)
accuracies of three face cropping
methods on three networks LeNet5 80.43 84.10 88.07
AlexNet 87.16 88.99 90.52
Proposed 86.90 89.40 92.42
the other two methods on all three networks. Moreover, the learning rate, face cropping method (proposed method), and
proposed method is more advantageous in small networks, dataset (CK+ dataset) order for a fair comparison. Random
such as LeNet5 and proposed network. horizontal flipping was also implemented in this experiment.
The accuracies for the set of experiments with different rota-
tion angles are shown in Fig. 10. The accuracy increases when
4.5 Selection of the random rotation angle
the image was rotated to 2, but further rotation decreases
the accuracy. The maximum average accuracy was obtained
After the previous two sets of experiments, a third experiment
for 2 rotation angle. The optimized rotation angle is data-
was conducted to determine the effect of the rotation angle.
dependent, and it depends on the image collecting condition.
The recognition accuracy decreased when the angle was too
The rotation angle needs to be adjusted according to different
large. Therefore, it is important to select an optimal rotation
data.
angle. Four experiments were performed for different rota-
Table 2 summarizes the effects of these three steps (select
tion angles (0 , 2, 4, 6 , 8, and 10 ), and the average accuracy
the number of neurons, the face cropping method, and the
was determined. Each experiment used the same optimizer,
123
Facial expression recognition with convolutional neural networks via a new face cropping and…
123
K. Li et al.
recognized with an accuracy of at least 91.33%. The confu- Table 9 Cross-database experiment on the CK+ and JAFFE databases
sion matrix for the seven classes of JAFFE database is shown Train Test Average (standard deviation) Best (%)
in Table 8. Neutral expressions were responsible for 80% of
the misclassified images. CK+ JAFFE 39.01% (1.12%) 40.98
Cross-database experiment In these experiments, the net- JAFFE CK+ 62.78% (1.52%) 64.40
work was trained on one database and tested on the other
database. The CK+ database does not contain neutral expres-
sions, and the JAFFE database does not include contempt
Five experiments were conducted for each case. The recogni-
expressions. Therefore, these expressions were neglected.
tion accuracy for these experiments is shown in Table 9. The
123
Facial expression recognition with convolutional neural networks via a new face cropping and…
average accuracy was 39.01% when the CK+ database was 5 Comparisons
used for training and the JAFFE database was used for test-
ing. In the opposite case, the average accuracy was 62.78%. Several novel methods for facial expression recognition have
The proposed system is a real-time system. The time con- been proposed in recent years. In this section, the experimen-
sumed for image recognition is divided into two parts. The tal results of the proposed approach on the CK+ and JAFFE
first part is the time taken before the image is sent to the databases are compared with those of other methods. The
CNN, which includes the time consumed for face alignment, comparison is shown in Table 11. Tenfold cross-validation
face cropping, histogram equalization, Z -score normaliza- was not used by all researchers; therefore, we compare
tion, and image down-sampling. The other part is the time our method with existing similar or tenfold cross-validation
taken during CNN prediction. The time consumed by land- methods. The authors in [20] achieved 98.12% accuracy
mark generation is not considered because the corresponding on the JAFFE database for 7 classes by combining DCNN
files are provided in the CK+ database. A total of 1000 and support vector machines (SVM), which is 0.94% higher
images were predicted by the proposed system, and the time than the accuracy achieved with our method. However,
consumed was recorded: 3.93 s and 1.58 s were consumed they used leave-one-subject-out (LOSO) validation, which
before and during the CNN process, respectively, i.e. the total enabled 212 images to be used for training on the JAFFE
processing time was 5.51 s. The proposed approach is sum- database, whereas we used only 192 images. We conducted
marized in Table 10 in terms of the parameters used in the multiple experiments on JAFFE dataset using LOSO valida-
experiments. A dropout [30] rate of 0.5 was applied to the sec- tion method, and the obtained average recognition accuracy
ond sub-sampling layer (1600-dimensional vector), a default reaches up to 98.59%, which is 0.47% higher than that
learning rate of 0.001 was used, and the batch size was 16. obtained in [20]. The validation method in [2] was also
During training, the CNN trained 120 epochs (each epoch LOSO. The researchers in [14,15] used eightfold cross-
was trained on the complete processed training data), and validation, [15] used the best sample order, and [14] trained
the training data order for each epoch was randomly shuf- seven binary classifiers for each expression. By contrast, we
fled. randomly divided the training set for our experiments and
trained a seven-class classifier.
123
K. Li et al.
123
Facial expression recognition with convolutional neural networks via a new face cropping and…
18. Lcun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based 37. Zhang, K., Huang, Y., Wu, H., Wang, L.: Facial smile detection
learning applied to document recognition. Proc. IEEE 86(11), based on deep learning features. In: 2015 3rd IAPR Asian Confer-
2278–2324 (1998). https://doi.org/10.1109/5.726791 ence on Pattern Recognition (ACPR). IEEE, pp 534–538 (2015)
19. Matthews, I., Baker, S.: Active appearance models revisited. Int. J. 38. Zhao, G., Pietikinen, M., Member, S.: Dynamic texture recognition
Comput. Vis. 60, 135–164 (2004) using local binary patterns with an application to facial expressions.
20. Mayya, V., Pai, R.M., Pai, M.M.M.: Automatic facial expression IEEE Trans. Pattern Anal. Mach. Intell. 29(6), 915–928 (2008).
recognition using dcnn. Proc. Comput. Sci. 93, 453–461 (2016a). https://doi.org/10.1109/TPAMI.2007.1110
https://doi.org/10.1016/j.procs.2016.07.233 39. Zhao, J., Mao, X., Zhang, J.: Learning deep facial expression
21. Mayya, V., Pai, R.M., Pai, M.M.M.: Combining temporal interpo- features from image and optical flow sequences using 3D CNN.
lation and dcnn for faster recognition of micro-expressions in video Vis. Comput. 34(10), 1461–1475 (2018). https://doi.org/10.1007/
sequences. In: International Conference on Advances in Comput- s00371-018-1477-y
ing, Communications and Informatics, pp 699–703. https://doi.org/
10.1109/ICACCI.2016.7732128 (2016)
22. Mehrabian, A.: Communication without words. Commun. Theory,
Publisher’s Note Springer Nature remains neutral with regard to juris-
193–200 (2008)
dictional claims in published maps and institutional affiliations.
23. Mohammadi, M.R., Fatemizadeh, E., Mahoor, M.H.: Pca-based
dictionary building for accurate facial expression recognition via
sparse representation. J. Vis. Commun. Image Represent. 25(5), Kuan Li received his B.S. degree
1082–1092 (2014). https://doi.org/10.1016/j.jvcir.2014.03.006 from China University of Mining
24. Ojala, T., Pietikäinen, M., Harwood, D.: A comparative study of and Technology, China, in 2016.
texture measures with classification based on featured distribu- He is currently pursuing the mas-
tions. Pattern Recognit. 29(1), 51–59 (1996). https://doi.org/10. ters degree from University of Sci-
1016/0031-3203(95)00067-4 ence and Technology of China,
25. Owusu, E., Zhan, Y., Mao, Q.R.: An svm-adaboost facial expres- Hefei, China. His research interest
sion recognition system. Appl. Intell. 40(3), 536–545 (2014) includes image processing, pat-
26. Pu, X., Fan, K., Chen, X., Ji, L., Zhou, Z.: Facial expression recog- tern recognition, and deep learn-
nition from image sequences using twofold random forest classifier. ing.
Neurocomputing 168(C), 1173–1180 (2015). https://doi.org/10.
1016/j.neucom.2015.05.005
27. Rashid, M., Abu-Bakar, S., Mokji, M.: Human emotion recognition
from videos using spatio-temporal and audio features. Vis. Comput.
29(12), 1269–1275 (2013)
28. Rivera, A.R., Castillo, J.R., Chae, O.: Local directional number
pattern for face analysis: face and expression recognition. IEEE
Trans. Image Process. 22(5), 1740–1752 (2013). https://doi.org/ Yi Jin received his Ph.D. degree
10.1109/TIP.2012.2235848 from University of Science and
29. Shan, C., Gong, S., Mcowan, P.W.: Facial expression recognition Technology of China, China, in
based on local binary patterns: a comprehensive study. Image Vis. 2013. He is currently an Asso-
Comput. 27(6), 803–816 (2009). https://doi.org/10.1016/j.imavis. ciate Professor at University of
2008.08.005 Science and Technology of China.
30. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhut- His current research interest
dinov, R.: Dropout: a simple way to prevent neural networks from includes human–computer inter-
overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014) action, pattern recognition, and
31. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance image processing.
of initialization and momentum in deep learning. In: International
Conference on Machine Learning, pp 1139–1147 (2013)
32. Uddin, M.Z., Hassan, M.M., Almogren, A., Zuair, M., Fortino, G.,
Torresen, J.: A facial expression recognition system using robust
face features from depth videos and deep learning. Comput. Electr.
Eng. 63, 114–125 (2017). https://doi.org/10.1016/j.compeleceng.
2017.04.019 Muhammad Waqar Akram recei-
33. Wen, G., Hou, Z., Li, H., Li, D., Jiang, L., Xun, E.: Ensemble of deep ved his masters degree from Uni-
neural networks with probability-based fusion for facial expression versity of Agriculture Faisalabad,
recognition. Cogn. Comput. 9(5), 597–610 (2017). https://doi.org/ Pakistan, in 2015. He is currently
10.1007/s12559-017-9472-6 pursuing PhD degree in precision
34. Yang, P., Liu, Q., Metaxas, D.N.: Boosting coded dynamic features machinery and instrumentation at
for facial action units and facial expression recognition. In: IEEE University of Science and Tech-
Conference on Computer Vision and Pattern Recognition, pp 1–6. nology of China, Hefei, PR China.
https://doi.org/10.1109/CVPR.2007.383059 (2007) His research interest includes solar
35. Yu, Z., Liu, Q., Liu, G.: Deeper cascaded peak-piloted network energy, farm machinery, and com-
for weak expression recognition. Vis. Comput. 34(12), 1691–1699 puter vision.
(2018). https://doi.org/10.1007/s00371-017-1443-0
36. Zeng, N., Zhang, H., Song, B., Liu, W., Li, Y., Dobaie, A.M.:
Facial expression recognition via learning deep sparse autoen-
coders. Neurocomputing 273, 643–649 (2017). https://doi.org/10.
1016/j.neucom.2017.08.043
123
K. Li et al.
Ruize Han received his B.S. degree Jiongwei Chen received his B.S.
from Hebei University of Tech- degree from China University of
nology, China, in 2016. He is cur- Mining and Technology, China,
rently pursuing masters degree in in 2016. He is currently pursuing
Tianjin University, Tianjin, China. masters degree in University of
His research interest includes Science and Technology of China,
image processing and computer Hefei, China. His research inter-
vision. est includes image processing and
pattern recognition.
123