Facial Emotion Recognition Using Convolutional Neural Networks (FERC)
Facial Emotion Recognition Using Convolutional Neural Networks (FERC)
Facial Emotion Recognition Using Convolutional Neural Networks (FERC)
Received: 16 July 2019 / Accepted: 12 February 2020 / Published online: 18 February 2020
© Springer Nature Switzerland AG 2020
Abstract
Facial expression for emotion detection has always been an easy task for humans, but achieving the same task with a
computer algorithm is quite challenging. With the recent advancement in computer vision and machine learning, it is
possible to detect emotions from images. In this paper, we propose a novel technique called facial emotion recognition
using convolutional neural networks (FERC). The FERC is based on two-part convolutional neural network (CNN): The first-
part removes the background from the picture, and the second part concentrates on the facial feature vector extraction.
In FERC model, expressional vector (EV) is used to find the five different types of regular facial expression. Supervisory
data were obtained from the stored database of 10,000 images (154 persons). It was possible to correctly highlight the
emotion with 96% accuracy, using a EV of length 24 values. The two-level CNN works in series, and the last layer of per-
ceptron adjusts the weights and exponent values with each iteration. FERC differs from generally followed strategies with
single-level CNN, hence improving the accuracy. Furthermore, a novel background removal procedure applied, before
the generation of EV, avoids dealing with multiple problems that may occur (for example distance from the camera).
FERC was extensively tested with more than 750K images using extended Cohn–Kanade expression, Caltech faces, CMU
and NIST datasets. We expect the FERC emotion detection to be useful in many applications such as predictive learning
of students, lie detectors, etc.
* Ninad Mehendale, [email protected] | 1Ninad’s Research Lab, Thane, India. 2K. J. Somaiya College of Engineering, Mumbai,
India.
Vol.:(0123456789)
Research Article SN Applied Sciences (2020) 2:446 | https://doi.org/10.1007/s42452-020-2234-1
Vol:.(1234567890)
SN Applied Sciences (2020) 2:446 | https://doi.org/10.1007/s42452-020-2234-1 Research Article
Fig. 1 a Block diagram of FERC. The input image is (taken from from the current input image is detected. b Facial vectors marked
camera or) extracted from the video. The input image is then on the background-removed face. Here, nose (N), lip (P), forehead
passed to the first-part CNN for background removal. After back- (F), eyes (Y) are marked using edge detection and nearest cluster
ground removal, facial expressional vector (EV) is generated. mapping. The position left, right, and center are represented using
Another CNN (the second-part CNN) is applied with the supervisory L, R, and C, respectively
model obtained from the ground-truth database. Finally, emotion
Fig. 2 Convolution filter operation with the 3 × 3 kernel. Each pixel from the input image and its eight neighboring pixels are multiplied
with the corresponding value in the kernel matrix, and finally, all multiplied values are added together to achieve the final output value
a binary image and used as the feature, for the first layer of on the type of input image. If the image is the colored
background removal CNN (also referred to as the first-part image, then YCbCr color threshold can be used. For skin
CNN in this manuscript). This skin tone detection depends tome, the Y-value should be greater than 80, Cb should
Vol.:(0123456789)
Research Article SN Applied Sciences (2020) 2:446 | https://doi.org/10.1007/s42452-020-2234-1
Fig. 3 a Vertical and horizontal edge detector filter matrix used measured at bottom. c Representation of point in Image domain
at layer 1 of background removal CNN (first-part CNN). b Sample (top panel) to Hough transform domain (bottom panel) using
EV matrix showing all 24 values in the pixel in top and parameter Hough transform
range between 85 and 140, Cr value should be between hence the name ‘convolutional filter.’ During the convolu-
135 and 200. The set of values mentioned in the above tion, dot product of both 3 × 3 matrix is computed and
line was chosen by trial-and-error method and worked for stored at a corresponding location, e.g., (1,1) at the output,
almost all of the skin tones available. We found that if the as shown in Fig. 2. Once the entire output matrix is calcu-
input image is grayscale, then skin tone detection algo- lated, then this output is passed to the next layer of CNN
rithm has very low accuracy. To improve accuracy during for another round of convolution. The last layer of face fea-
background removal, CNN also uses the circles-in-circle ture extracting CNN is a simple perceptron, which tries to
filter. This filter operation uses Hough transform values for optimize values of scale factor and exponent depending
each circle detection. To maintain uniformity irrespective upon deviation from the ground truth.
of the type of input image, Hough transform (Fig. 3c) was
always used as the second input feature to background 3.4 Hardware and software details
removal CNN. The formula used for Hough transform is as
shown in Eq. 1 All the programs were executed on Lenovo Yoga 530
∞ ∞ model laptop with Intel i5 8th generation CPU and 8 GB
∫−∞ ∫−∞
H(𝜃, 𝜌) = A(x, y)𝛿(𝜌 − x cos 𝜃 − y sin 𝜃)dxdy (1) RAM with 512 GB SSD hard disk. Software used to run the
experiment were Python (Using Thonny IDE), MATLAB
2018a, and ImageJ.
3.3 Convolution filter
4 Results and discussions
As shown in Fig. 2 for each convolution operation, the
entire image is divided into overlapping 3 × 3 matrices, To analyze the performance of the algorithm, extended
and then the corresponding 3 × 3 filter is convolved over Cohn–Kanade expression dataset [31] was used initially.
each 3 × 3 matrix obtained from the image. The sliding and Dataset had only 486 sequences with 97 posers, causing
taking dot product operation is called ‘convolution’ and accuracy to reach up to 45% maximum. To overcome the
Vol:.(1234567890)
SN Applied Sciences (2020) 2:446 | https://doi.org/10.1007/s42452-020-2234-1 Research Article
problem of low efficiency, multiple datasets were down- giving good accuracy. Hence, FERC was designed with
loaded from the Internet [32, 33], and also author’s own four layers and four filters. As a future scope of this study,
pictures at different expressions were included. As the researchers can try varying the number of layers for both
number of images in dataset increases, the accuracy also CNN independently. Also, the vast amount of work can be
increased. We kept 70% of 10K dataset images as training done if each layer is fed with a different number of filters.
and 30% dataset images as testing images. In all 25 itera- This could be automated using servers. Due to compu-
tions were carried out, with the different sets of 70% train- tational power limitation of the author, we did not carry
ing data each time. Finally, the error bar was computed as out this study, but it will be highly appreciated if other
the standard deviation. Figure 4a shows the optimization researchers come out with a better number than 4 (layers),
of the number of layers for CNN. For simplicity, we kept the 4 (filters) and increase the accuracy beyond 96%, which we
number of layers and the number of filters, for background could achieve. Figure 4c and e shows regular front-facing
removal CNN (first-part CNN) as well as face feature extrac- cases with angry and surprise emotions, and the algorithm
tion CNN (the second-part CNN) to be the same. In this could easily detect them (Fig. 4d, f ). The only challenging
study, we varied the number of layers from 1 to 8. We part in these images was skin tone detection, because of
found out that maximum accuracy was obtained around the grayscale nature of these images. With color images,
4. It was not very intuitive, as we assume the number of background removal with the help of skin tone detec-
layers is directly proportional to accuracy and inversely tion was straightforward, but with grayscale images, we
proportional to execution time. Hence due to maximum observed false face detection in many cases. Image, as
accuracy obtained with 4 layers, we selected the number shown in Fig. 4g, was challenging because of the orienta-
of layers to be 4. The execution time was increasing with tion. Fortunately, with 24 dimensions EV feature vector, we
the number of layers, and it was not adding significant could correctly classify 30° oriented faces using FERC. We
value to our study, hence not reported in the current man- do accept the method has some limitations such as high
uscript. Figure 4b shows the number of filters optimization computing power during CNN tuning, and also, facial hair
for both layers. Again, 1–8 filters were tried for each of the causes a lot of issues. But other than these problems, the
four-layer CNN networks. We found that four filters were accuracy of our algorithm is very high (i.e., 96%), which
Fig. 4 a Optimization for the number of CNN layers. Maximum e, g Different input images from the dataset. d, f, h The output of
accuracy was achieved for four-layer CNN. b Optimization for the background removal with a final predicted output of emotion
number of filters. Four filters per layer gave maximum accuracy. c,
Vol.:(0123456789)
Research Article SN Applied Sciences (2020) 2:446 | https://doi.org/10.1007/s42452-020-2234-1
Table 1 Results obtained with different databases Table 3 Comparison table of FERC with standard networks
Database Images (people) Accuracy (%) Algorithm Accuracy (%) Compu-
tational
Caltech faces [34] 450 (27) 85 complexity
The CMU database [35] 750,000 (337) 78
Alexnet [40] 57–87 O4
NIST database [36] 3248 (1573) 96
VGG [41] 67–68 O9
GoogleNet [42] 83–87 O5
is comparable to most of the reported studies (Table 2). Resnet [41] 73.30 O16
One of the major limitations of this method is when all 24 FERC 78–96 O4
features in EV vector are not obtained due to orientation
or shadow on the face. Authors are trying to overcome
shadow limitation by automated gamma correction on expression detection in a single CNN network. Address-
images (manuscript under preparation). For orientation, ing both issues separately reduces complexity and also
we could not find any strong solution, other than assum- the tuning time. Although we only have considered five
ing facial symmetry. Due to facial symmetry, we are gen- moods to classify, the sixth and seventh mood cases were
erating missing feature parameters by copying the same misclassified, adding to the error. Zao et al. [37] have
12 values for missing entries in the EV matrix (e.g., the achieved maximum accuracy up to 99.3% but at the cost
distance between the left eye to the left ear (LY–LE) is of 22 layers neural network. Training such a large network
assumed the same as a right eye to the right ear (RY–RE), is a time-consuming job. Compared to existing methods,
etc.) The algorithm also failed when multiple faces were only FERC has keyframe extraction method, whereas oth-
present in the same image, with equal distance from the ers have only gone for the last frame. Jung et al. [38] tried
camera. For testing data selection, the same dataset with to work with fixed frames which make the system not so
30% data which was not used for training was used. For efficient with video input. The number of folds of training
each pre-processing epoch, all the 100 % data were taken in most of the other cases was ten only, whereas we could
as new fresh sample data in all 25 folds of training. To find go up to 25-fold training because of small network size.
the performance of FERC with large datasets Caltech faces, As shown in Table 3, FERC has similar complexity as
CMU database and NIST database were used (Table 1). It that of Alexnet. FERC is much faster, compared to VGG,
was found that Accuracy goes down with an increasing GoogleNet, and Resnet. In terms of accuracy, FERC out-
number of images because of the over-fitting. Also, accu- performs existing standard networks. However, in some
racy remained low, when the number of training images cases we found GoogleNet out-performs FERC, especially
is less. The ideal number of images was found out to be in when the iteration of GoogleNet reaches in the range of
the range of 2000–10,000 for FERC to work properly. 5000 and above.
Another unique contribution of FERC is skin tone-based
4.1 Comparison with other methods feature and Hough transform for circles-in-circle filters.
The skin tone is a pretty fast and robust method of pre-
As shown in Table 2, FERC method is a unique method processing the input data. We expect that with these new
developed with two 4-layer networks with an accuracy functionalities, FERC will be the most preferred method for
of 96%, where others have just gone for a combined mood detection in the upcoming years.
approach of solving background removal and face
Table 2 Comparison table with No. of mood Key frame N/W size Accuracy No. fold
similar methods reported in
the literature FERC 5 Edge based 8 96 25
Zao et al. [37] 6 Last frame 22 99.3 10
Jung et al. [38] 7 Fixed frame 4 91.44 10
Zang et al. [39] 7 Last frame 7 97.78 10
Vol:.(1234567890)
SN Applied Sciences (2020) 2:446 | https://doi.org/10.1007/s42452-020-2234-1 Research Article
5 Conclusions 11. Ekman P, Friesen WV (1971) Constants across cultures in the face
and emotion. J Personal Soc Psychol 17(2):124
12. Matsumoto D (1992) More evidence for the universality of a
FERC is a novel way of facial emotion detection that uses contempt expression. Motiv Emot 16(4):363
the advantages of CNN and supervised learning (feasible 13. Sajid M, Iqbal Ratyal N, Ali N, Zafar B, Dar SH, Mahmood MT,
due to big data). The main advantage of the FERC algo- Joo YB (2019) The impact of asymmetric left and asymmetric
right face images on accurate age estimation. Math Probl Eng
rithm is that it works with different orientations (less than 2019:1–10
30°) due to the unique 24 digit long EV feature matrix. 14. Ratyal NI, Taj IA, Sajid M, Ali N, Mahmood A, Razzaq S (2019)
The background removal added a great advantage in Three-dimensional face recognition using variance-based reg-
accurately determining the emotions. FERC could be the istration and subject-specific descriptors. Int J Adv Robot Syst
16(3):1729881419851716
starting step, for many of the emotion-based applications 15. Ratyal N, Taj IA, Sajid M, Mahmood A, Razzaq S, Dar SH, Ali N,
such as lie detector and also mood-based learning for stu- Usman M, Baig MJA, Mussadiq U (2019) Deeply learned pose
dents, etc. invariant image analysis with applications in 3D face recogni-
tion. Math Probl Eng 2019:1–21
Acknowledgements The author would like to thank Dr. Madhura 16. Sajid M, Ali N, Dar SH, Iqbal Ratyal N, Butt AR, Zafar B, Shafique
Mehendale for her constant support on database generation and T, Baig MJA, Riaz I, Baig S (2018) Data augmentation-assisted
corresponding ground truths cross-validation. Also, the author makeup-invariant face recognition. Math Probl Eng 2018:1–10
would like to thank all the colleagues at K. J. Somaiya College of 17. Ratyal N, Taj I, Bajwa U, Sajid M (2018) Pose and expression
Engineering. invariant alignment based multi-view 3D face recognition. KSII
Trans Internet Inf Syst 12:10
18. Xie S, Hu H (2018) Facial expression recognition using hierar-
Compliance with ethical standards chical features with deep comprehensive multipatches aggre-
gation convolutional neural networks. IEEE Trans Multimedia
Conflict of interest On behalf of all authors, the corresponding au- 21(1):211
thor states that there is no conflict of interest. 19. Danisman T, Bilasco M, Ihaddadene N, Djeraba C (2010) Auto-
matic facial feature detection for facial expression recogni-
tion. In: Proceedings of the International conference on com-
puter vision theory and applications, pp 407–412. https://doi.
org/10.5220/0002838404070412
References 20. Mal HP, Swarnalatha P (2017) Facial expression detection using
facial expression model. In: 2017 International conference on
1. Mehrabian A (2017) Nonverbal communication. Routledge, energy, communication, data analytics and soft computing
London (ICECDS). IEEE, pp 1259–1262
2. Bartlett M, Littlewort G, Vural E, Lee K, Cetin M, Ercil A, Movellan 21. Parr LA, Waller BM (2006) Understanding chimpanzee facial
J (2008) Data mining spontaneous facial behavior with auto- expression: insights into the evolution of communication. Soc
matic expression coding. In: Esposito A, Bourbakis NG, Avouris Cogn Affect Neurosci 1(3):221
N, Hatzilygeroudis I (eds) Verbal and nonverbal features of 22. Dols JMF, Russell JA (2017) The science of facial expression.
human–human and human–machine interaction. Springer, Oxford University Press, Oxford
Berlin, pp 1–20 23. Kong SG, Heo J, Abidi BR, Paik J, Abidi MA (2005) Recent
3. Russell JA (1994) Is there universal recognition of emotion from advances in visual and infrared face recognition—a review.
facial expression? A review of the cross-cultural studies. Psychol Comput Vis Image Underst 97(1):103
Bull 115(1):102 24. Xue Yl, Mao X, Zhang F (2006) Beihang university facial expres-
4. Gizatdinova Y, Surakka V (2007) Automatic detection of facial sion database and multiple facial expression recognition.
landmarks from AU-coded expressive facial images. In: 14th In: 2006 International conference on machine learning and
International conference on image analysis and processing cybernetics. IEEE, pp 3282–3287
(ICIAP). IEEE, pp 419–424 25. Kim DH, An KH, Ryu YG, Chung MJ (2007) A facial expres-
5. Liu Y, Li Y, Ma X, Song R (2017) Facial expression recognition sion imitation system for the primitive of intuitive human-
with fusion features extracted from salient facial areas. Sensors robot interaction. In: Sarkar N (ed) Human robot interaction.
17(4):712 IntechOpen, London
6. Ekman R (1997) What the face reveals: basic and applied stud- 26. Ernst H (1934) Evolution of facial musculature and facial
ies of spontaneous expression using the facial action coding expression. J Nerv Ment Dis 79(1):109
system (FACS). Oxford University Press, New York 27. Kumar KC (2012) Morphology based facial feature extraction
7. Zafar B, Ashraf R, Ali N, Iqbal M, Sajid M, Dar S, Ratyal N (2018) and facial expression recognition for driver vigilance. Int J
A novel discriminating and relative global spatial image repre- Comput Appl 51:2
sentation with applications in CBIR. Appl Sci 8(11):2242 28. Hernández-Travieso JG, Travieso CM, Pozo-Baños D, Alonso
8. Ali N, Zafar B, Riaz F, Dar SH, Ratyal NI, Bajwa KB, Iqbal MK, Sajid JB et al (2013) Expression detector system based on facial
M (2018) A hybrid geometric spatial image representation for images. In: BIOSIGNALS 2013-proceedings of the international
scene classification. PLoS ONE 13(9):e0203339 conference on bio-inspired systems and signal processing
9. Ali N, Zafar B, Iqbal MK, Sajid M, Younis MY, Dar SH, Mahmood MT, 29. Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S,
Lee IH (2019) Modeling global geometric spatial information for Fellenz W, Taylor JG (2001) Emotion recognition in human–
rotation invariant classification of satellite images. PLoS ONE 14:7 computer interaction. IEEE Signal Process Mag 18(1):32
10. Ali N, Bajwa KB, Sablatnig R, Chatzichristofis SA, Iqbal Z, Rashid 30. Hsu RL, Abdel-Mottaleb M, Jain AK (2002) Face detection in
M, Habib HA (2016) A novel image retrieval based on visual color images. IEEE Trans Pattern Anal Mach Intell 24(5):696
words integration of SIFT and SURF. PLoS ONE 11(6):e0157428
Vol.:(0123456789)
Research Article SN Applied Sciences (2020) 2:446 | https://doi.org/10.1007/s42452-020-2234-1
31. Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I 38. Jung H, Lee S, Yim J, Park S, Kim J (2015) Joint fine-tuning in
(2010) The extended Cohn–Kanade dataset (ck+): a complete deep neural networks for facial expression recognition. In:
dataset for action unit and emotion-specified expression. In: Proceedings of the IEEE international conference on computer
2010 IEEE computer society conference on computer vision vision. pp 2983–2991
and pattern recognition-workshops. IEEE, pp 94–101 39. Zhang K, Huang Y, Du Y, Wang L (2017) Facial expression recogni-
32. Littlewort G, Whitehill J, Wu T, Fasel I, Frank M, Movellan J, Bar- tion based on deep evolutional spatial-temporal networks. IEEE
tlett M (2011) The computer expression recognition toolbox Trans Image Process 26(9):4193
(CERT). In: Face and gesture 2011. IEEE, pp 298–305 40. Wu YL, Tsai HY, Huang YC, Chen BH (2018) Accurate emotion rec-
33. Shan C, Gong S, McOwan PW (2009) Facial expression recogni- ognition for driving risk prevention in driver monitoring system.
tion based on local binary patterns: a comprehensive study. In: 2018 IEEE 7th global conference on consumer electronics
Image Vis Comput 27(6):803 (GCCE). IEEE, pp 796–797
34. Caltech Faces (2020) http://www.vision.caltech.edu/html-files 41. Gajarla V, Gupta A (2015) Emotion detection and sentiment
/archive.html. Accessed 05 Jan 2020 analysis of images. Georgia Institute of Technology, Atlanta
35. The CMU multi-pie face database (2020) http://ww1.multipie. 42. Giannopoulos P, Perikos I, Hatzilygeroudis I (2018) Deep learning
org/. Accessed 05 Jan 2020 approaches for facial emotion recognition: a case study on FER-
36. NIST mugshot identification database (2020) https://www.nist. 2013. In: Hatzilygeroudis I, Palade V (eds) Advances in hybridiza-
gov/itl/iad/image - group/resources/biometric-special-datab tion of intelligent methods. Springer, Berlin, pp 1–16
ases-and-software. Accessed 05 Jan 2020
37. Zhao X, Liang X, Liu L, Li T, Han Y, Vasconcelos N, Yan S (2016) Publisher’s Note Springer Nature remains neutral with regard to
Peak-piloted deep network for facial expression recognition. In: jurisdictional claims in published maps and institutional affiliations.
European conference on computer vision. Springer, pp 425–442
Vol:.(1234567890)