Final Project Report FF
Final Project Report FF
On
Facial Emotion Detection
Submitted in partial fulfilment of the Requirement for the award of the degree
SUBMITTED BY
Ujjawal Gupta (22SCSE1040014)
Yogendra Singh (22SCSE1040078)
Priyanshi Raj (22SCSE1040035)
Roshan Kumar (22SCSE1040054)
by brain, captured in either video, electric signal (EEG) or image form can be approximated. Human
emotion detection is the need of the hour so that modern artificial intelligent systems can emulate and
gauge reactions from face. This can be helpful to make informed decisions be it regarding identification
of intent, promotion of offers or security relatedthreats. Recognizing emotions from images or video is a
trivial task for human eye, but proves to be very challenging for machines and requires many image
processing techniques for feature extraction. Several machine learning algorithms are suitable for this job.
Any detection or recognition by machine learning requires training algorithm and then testing them on a
suitable dataset. This paper explores a couple of machine learning algorithms as well as feature extraction
I would like to express my gratitude to my project supervisor, Mr. Rajakumar P. for their guidance and support.
I also appreciate the resources provided by Galgotias University, Greater Noida and the efforts of mycolleagues
and team members. Finally, thanks to my family and friends for their unwavering support.
Student Name:
Supervisor's Signature:
TABLE OF CONTENTS
1. Introduction ........................................................................................................................................ 1
2. Image Features.................................................................................................................................... 3
i
3.2.1 Support Vector Machines (SVM).............................................................................................. 16
4.1 OpenCV........................................................................................................................................ 20
4.2 Dlib............................................................................................................................................... 20
5. Implementation ................................................................................................................................. 23
6. Results ................................................................................................................................................ 30
References ............................................................................................................................................. 49
ii
1. Introduction
Human emotion detection is implemented in many areas requiring additional security or
information about the person. It can be seen as a second step to face detection where we may be
required to set up a second layer of security, where along with the face, the emotion is also
detected. This can be useful to verify that the person standing in front of the camera is not just a
Another important domain where we see the importance of emotion detection is for business
promotions. Most of the businesses thrive on customer responses to all their products and offers.
If an artificial intelligent system can capture and identify real time emotions based on user image
or video, they can make a decision on whether the customer liked or disliked the product or offer.
We have seen that security is the main reason for identifying any person. It can be based on finger-
print matching, voice recognition, passwords, retina detection etc. Identifying the intent of the
person can also be important to avert threats. This can be helpful in vulnerable areaslike airports,
concerts and major public gatherings which have seen many breaches in recent years.
Human emotions can be classified as: fear, contempt, disgust, anger, surprise, sad, happy, and
neutral. These emotions are very subtle. Facial muscle contortions are very minimal and detecting
these differences can be very challenging as even a small difference results in differentexpressions
[4]. Also, expressions of different or even the same people might vary for the same emotion, as
emotions are hugely context dependent [7]. While we can focus on only those areas of the face
which display a maximum of emotions like around the mouth and eyes [3], how we
1
extract these gestures and categorize them is still an important question. Neural networks and
machine learning have been used for these tasks and have obtained good results.
Machine learning algorithms have proven to be very useful in pattern recognition and
classification. The most important aspects for any machine learning algorithm are the features. In
this paper we will see how the features are extracted and modified for algorithms like Support
Vector Machines [1]. We will compare algorithms and the feature extraction techniques from
different papers. The human emotion dataset can be a very good example to study the robustness
and nature of classification algorithms and how they perform for different types of dataset.
Usually before extraction of features for emotion detection, face detection algorithms are applied
on the image or the captured frame. We can generalize the emotion detection steps as follows:
1) Dataset preprocessing
2) Face detection
3) Feature extraction
In this work, we focus on the feature extraction technique and emotion detection based on
the extracted features. Section 2 focuses on some important features related to the face. Section 3
gives information on the related work done in this field. Related work covers many of the feature
extraction techniques used until now. It also covers some important algorithms which can be used
for emotion detection in human faces. Section 4 details the tools and libraries used in the
implementation. Section 5 explains the implementation of the proposed feature extraction and
emotion detection framework. Section 6 highlights the result of the experiment. Section 7 covers
2
2. Image Features
We can derive different types of features from the image and normalize it in vector form.We can employ
various types of techniques to identify the emotion like calculating the ellipses formed on the face or the
angles between different parts like eyes, mouth etc. Following are some of the prominent features which
2.1 FACS
Facial Action Coding System is used to give a number to facial moment. Each such number is
called as action unit. Combination of action units result in a facial expression. The micro changes
in the muscles of the face can be defined by an action unit. For example, a smilingface can be
defined in terms of action units as 6 + 12, which simply means movement of AU6 muscle and
AU12 muscle results in a happy face. Here Action Unit 6 is cheek raiser and Action Unit 12 is lip
corner puller. Facial action coding system based on action units is a good system to determine
which facial muscles are involved in which expression. Real time face models can be generated
based on them.
Landmarks on the face are very crucial and can be used for face detection and recognition. The
same landmarks can also be used in the case of expressions. The Dlib libraryhas a 68 facial
Figure 2 shows all the 68 landmarks on face. Using dlib library we can extract the co-
ordinates(x,y) of each of the facial points. These 68 points can be divided into specific areas like
left eye, right eye, left eyebrow, right eyebrow, mouth, nose and jaw.
4
2.3 Feature Descriptors
Good features are those which help in identifying the object properly. Usually the images are
identified on the basis of corners and edges. For finding corners and edges in images, we have
many feature detector algorithms in the OpenCV library such as Harris corner detector.
These feature detectors take into account many more factors such as contours, hull and convex.
The Key-points are corner points or edges detected by the feature detector algorithm. The feature
descriptor describes the area surrounding the key-point. The description can be anything including
raw pixel intensities or co-ordinates of the surrounding area. The key-point and descriptor together
form a local feature. One example of a feature descriptor is a histogram of oriented gradients. ORB
(based on BRIEF), SURF, SIFT etc. are some of the feature descriptor algorithms [25].
5
3. Related Work
This method uses cascaded regression trees and finds the important positions on the face using
images. Pixel intensities are used to distinguish between different parts of the face, identifying 68
facial landmarks [1]. Based on a current estimate of shape, parameter estimation is done by
transforming the image in the normal co-ordinate system instead of global. Extracted features are
used to re-estimate the shape parameter vectors and are recalculated until convergence [5].
The author [1] uses only 19 features as shown in Figure 3 from the 68 extracted features,focusing
6
Ensemble of regression trees was very fast and robust giving 68 features in around 3
milliseconds.
Once the features are in place, the displacement ratios of these 19 feature points are
calculated using pixel coordinates. Displacement ratios are nothing but the difference in
Instead of using these distances directly, displacement ratios are used as these pixel
distances may vary depending on the distance between the camera and the person.
The dataset used for this experiment was the iBug-300W dataset which has more than
7000 images along with CK + dataset having 593 sequences of facial expressions of
123 differentsubjects.
7
Table 2: Distances calculated to determine displacement ratios between different parts of face [1]
D1 and D2 Distance between the upper and lower eyelid of the right and left eyes
D3 Distance between the inner points of the left and right eyebrow
D4 and D5 Distance between the nose point and the inner point of the left and right eyebrow
D6 and D8 Distance between the nose point and the right and left mouth corner
D7 and D9 Distance between the nose point and the midpoint of the upper and lower lip
D11 Distance between the midpoint of the upper and lower lip
Subtle emotions are hard to detect. If we magnify the emotions, there is a possibility of increasing
the accuracy of detection. Motion properties such as velocity and acceleration can be used for
and phase. Based on the properties, there are A-EMM (Amplitude based) and P-EMM (Phase
8
TOOLS AND LIBRARIES USED
3.1 OpenCV
OpenCV is the library we will be using for image transformation functions such as converting the
image to grayscale. It is an open source library and can be used for many image functions and has
a wide variety of algorithm implementations. C++ and Python are the languages supported by
OpenCV. It is a complete package which can be used with other librariesto form a pipeline for any
image extraction or detection framework. The range of functions it supports is enormous, and it
3.2 Dlib
Dlib is another powerful image-processing library which can be used in conjunction withPython,
C++ and other tools. The main function this library provides is of detecting faces, extracting
features, matching features etc. It has also support for other domains like machine learning,
3.3 Python
Python is a powerful scripting language and is very useful for solving statistical problemsinvolving
machine learning algorithms. It has various utility functions which help in pre- processing.
Processing is fast and it is supported on almost all platforms. Integration with C++ and other image
libraries is very easy, and it has in-built functions and libraries to store and manipulate data of all
types. It provides the pandas and numpy framework which helps in manipulation of data as per
our need. A good feature set can be created using the numpy arrays which can have n-dimensional
data.
9
3.4 Scikit-learn
Scikit-learn is the machine learning library in python. It comprises of matplotlib, numpy and a
wide array of machine learning algorithms. The API is very easy to use and understand. It has
many functions to analyze and plot the data. A good feature set can be formed using many ofits
feature reduction, feature importance and feature selection functions. The algorithm it provides
can be used for classification and regression problems and their sub-types.
Jupyter Notebook is the IDE to combine python with all the libraries we will be using inour
Plots and images are displayed instantly. It can be used as a one stop for all our requirements,
and most of the libraries like Dlib, OpenCV, Scikit-learn can be integrated easily.
3.6 Database
We have used the extended Cohn-Kanade database (CK+) and Radbound Faces database(RaFD).
CK+ has around 593 images for 123 subjects. Only 327 files have labeled/identified emotions. It
covers all the basic human emotions displayed by the face. The emotions and codes are as follows:
– Surprise. The database is widely used for emotion detection research and analysis. There are 3
more folders along with the images. FACS contains action units for each image. Landmark
contains AAM tracked facial features of the 68 facial points. Emotion contains emotion label for
Radbound Faces Database [19] is a standard database having equal number of files for all
emotions. It has images of 67 subjects displaying 8 emotions: Neutral included. The pictures are
taken in 5 different camera poses. Also, the gaze is in 3 directions. We are using only front
10
facing images for our experiment. We have a total of 536 files with 67 models displaying 8
different emotions.
11
4. Implementation
A static approach using extracted features and emotion recognition using machine learning is used in this
work. The focus is on extracting features using python and image processing libraries and using machine
learning algorithms for prediction. Our implementation isdivided into three parts. The first part is image
pre-processing and face detection. For face detection, inbuilt methods available in dlib library are used.
Once the face is detected, the regionof interest and important facial features are extracted from it. There
are various features which can be used for emotion detection. In this work, the focus is on facial points
We have a multi-class classification problem and not multi-label. There is a subtle difference as a set of
features can belong to many labels but only one unique class. The extractedfacial features along with SVM
are used to detect the multi-class emotions. The papers we have studied focus on SVM as one of the widely
used and accepted algorithms for emotion classification. Our database has a total of 7 classes to classify.
We have compared our results with logistic regression and random forest to compare the results of
The image files for the CK+ database are in different directories and sub-directories based on
the person and session number. Also, not all the images depict emotion; only 327 fileshave one
of the emotion depicted from 1-7. All the files were of type portable networks graphicfile(.png).
The emotion labels are in the different directory but with the same name as image files. We wrote
a small utility function in java which used the emotion file name to pick up the
12
correct image from the directory and copy it in our final dataset folder. We also appended the
name of the emotion file to the image file name. Thus, while parsing the file in our program we
001 the session number, 00000014 the image number in the session and finally 7represents
The dataset we created consisted of only frontal face images, and there was no file with no-emotion
(no neutral emotion). The lighting and illumination condition for some images was different. Some
images were colored. The processing pipeline for all the images was the same, in spite of the
illumination conditions.
For the RaFD database we simply extracted the name of the emotion from image filename
which was in jpg format. As this database was standard we had a balanced number of classes
for each emotion. Table 5 shows the distribution of different emotion classes.
13
Table 5: Number of images per class for CK+ database
Number of images
1: Anger 45
2: Contempt 18
3: Disgust 59
4:Fear 25
5: Happy 69
6: Sadness 28
7: Surprise 83
14
From Figure 14 we see that the numbers of classes are not equal and this might result in some
classes being misclassified. For example contempt has very less number of samples (18); hence if
none of the samples are present in training, it will be difficult to classify the class contempt in the
testing data set. Moreover, due to less training samples, the class can also be treated as an outlier.
The algorithm can become biased towards the emotion surprise, and classify most images as
surprise. For RaFD database there was no bias towards any single class.
Face detection was the first and important part of the processing pipeline. Before further
processing, we had to detect the face, even though our images contained only frontal facial
expression data. Once the face was detected, it was easier to determine the region of interest and
Figure 15: Original image from the database and detected face from the image
For face detection, we tried many algorithms like Haar-cascades from OpenCV. Finally we
settled for face detector based on histogram of oriented gradients from Dlib library. HoG
15
descriptors along with SVM are used to identify the face from the image. Images are converted
For facial feature extraction, we used the 68 landmark facial feature predictor from dlib.The face
face is passed to the feature predictor algorithm. Figure 16 shows the detected 68 landmarks for
a particular face. The predictor function returns the 68 points at the eyes(left andright), mouth,
eyebrows(eft and right), nose and jaw. We used numpy array to convert the 68 points to an array
of 68 x and y co-ordinates representing their location. These are the facial features we have used
to predict emotion.
The landmarks are easier to access in numpy array form. Also, from Figure 16 we knowthe
indices of each feature, hence we can focus on a particular feature instead of the entire set.
16
The feature points are divided as 1-17 for jaw, 49-68 for mouth and so on. So, for instance, if we
want to ignore the jaw, we can simply put the x and y co-ordinates for the jaw as 0, while
converting the features into numpy array. We also calculated distances and polygon areas for some
The dataset of 327 files was stored in a directory and each file was processed to create thefeature
set. As soon as the file was picked up, the name of the file was parsed to extract the emotion label.
The emotion label was appended to a list of labels which will form our multi-classtarget variable.
The image was processed for face detection and feature prediction. The features derived from each
file were appended to a list which was later converted to a numpy array of dimension 327*68*2.
We also had the target classes in the form of a numpy array. Same process was followed for RaFD
database.
Once we had created the feature set and the target variable, we used Support Vector Machines
to predict the emotions. Sklearn machine library was used to implement the Support Vector
Machines (SVM) and Logistic Regression algorithms. The multiclass strategy used was
―One-Vs-Rest‖ for all the algorithms. Logistic regression algorithm was fine tuned for penalty
―l1‖ and ―l2‖. We also fine-tuned the linear kernel to rbf and poly to see the variation in results.
Cross-validation technique was used along with SVM to remove any biases in the databases.
Initially the dataset was divided as 70% for training and 30% for testing. We tried many other
splits such as 80:20 and 70:30. 70:30 split seemed more appealing as our assumption was all
classes will be equally represented in the test set. For cross-validation score we initially tested
with 4 splits. To improve the results we chose the value 5 and 10, which are standard values for
17
cross-validation. Random Forest Classifier and Decision Trees were also run on our dataset, but
resulted into low accuracy as compared to other algorithms in our experiment; hence we decided
18
5. Results
We applied support vector machines to our dataset and predicted the results. The results were interpreted
using confusion matrix and accuracy metric. The train:test split was 75:25. We also did cross-validation
on the dataset to remove any biases. Value of split was chosen as 4 because the resultant splits will have
same number of images as our 25% test set. The results areas follows:
In our experiment SVM with linear kernel performed better than other kernels. Rbf gave us the worst
performance, whereas poly was as good as linear kernel. We tried to keep the test set % same for both
split and cross-validated data so as to have uniformity in results. The mean cross-validation score was also
approximately equal to the accuracy score achieved by the split. Figure 17 shows the heat-map of the
confusion matrix from our multi-class classification results.On further analysis of the confusion matrix,
19
the predicted values and actual values in report format. From this, we can inferthe correct number of
20
7. Conclusion and Future Work
1. Face detection
2. Feature extraction
Feature extraction was very important part of the experiment. The added distance and area features
provided good accuracy for CK+ database (89%). But for cross-database experiment we observed that
raw features worked best with Logistic Regression for testing RaFD database and Mobile images dataset.
The accuracy was 66% and 36% for both using CK+ dataset as training set. The additional features
(distance and area) reduced the accuracy of the experiment for SVM as seen in Table 13 and Table 15.
The algorithm generalized the results from the training set to the testing set better than SVM and other
algorithms. The results of the emotion detection algorithm gave average accuracy up to 86% for RaFD
database and 87% for CK+ database for cross-validation=5. RaFDdataset had equal number of classes;
hence, cross-validation did not help in improving the accuracy of the model.
Table 22 shows our performance as compared to different papers. When compared to Paper[1], which
used ORB feature descriptors our method performed better, using only the 68 facial landmark points and
the distances and area features [14]. Paper [1] achieved an accuracy of 69.9% without the neutral emotion
whereas we achieved average accuracy of 89%. Paper [14]had similar feature extraction technique as ours
and their accuracy was slightly(0.78) better thanus. Paper [20] used large number of iterations to train the
21
98.15% with 35000 iterations. Paper [21] also used similar concept of angles and areas and
achieved an accuracy of 82.2% and 86.7% for k-NN and CRF respectively.
k-nn and CRF(Angles and Areas) All 7 k-NN: 82.2/ CRF: 86.7
[21]
instead of contempt
22
We did not focus on face detection in this paper. Our main focus was on feature extraction and analysis
of the machine algorithm on the dataset. But accurate face-detection algorithm becomes very important
if there are multiple people in the image. If we are determining the emotion of a particular person from
a webcam, the webcam should be able todetect all the faces accurately.
For future work, a more robust face detection algorithm coupled with some good featurescan be researched
to improve the results. We focused on only some distances and areas, there can be many more such
interesting features on the face which can be statistically calculated and usedfor training the algorithm.
Also, not all the features help to improve the accuracy, some maybe not helpful with the other features.
Feature selection and reduction technique can be implemented on the created feature to improve the
accuracy of the dataset. We can experiment with facial action coding system or feature descriptors as
features or a combination of both of them. Also, we can experiment with different datasets amongst
different races. This will give us an idea if the approach is similar for all kinds of faces or if some other
features should be extracted to identify the emotion. Applications such as drowsiness detection amongst
drivers [1] can be developed using feature selection and cascading different algorithms together.
Algorithms like logistic regression, linear discriminant analysis and random forest classifier can be fine-
tuned to achieve good accuracy and results. Also, metrics such as cross- validation score, recall and f1
score can be used to define the correctness of model and the modelcan be improved based on these metric
results.
23
References
[1] W. Swinkels, L. Claesen, F. Xiao and H. Shen, "SVM point-based real-time emotion
detection," 2017 IEEE Conference on Dependable and Secure Computing, Taipei, 2017.
[2] Neerja and E. Walia, "Face Recognition Using Improved Fast PCA Algorithm," 2008
[3] H. Ebine, Y. Shiga, M. Ikeda and O. Nakamura, "The recognition of facial expressions with
automatic detection of the reference face," 2000 Canadian Conference on Electrical and
[4] A. C. Le Ngo, Y. H. Oh, R. C. W. Phan and J. See, "Eulerian emotion magnification for
[5] V. Kazemi and J. Sullivan, "One millisecond face alignment with an ensemble of regression
trees," 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH,
2014
[6] M. Dahmane and J. Meunier, "Emotion recognition using dynamic grid-based HoG
[7] K. M. Rajesh and M. Naveenkumar, "A robust method for face recognition and face emotion
24
[8] C. Loconsole, C. R. Miranda, G. Augusto, A. Frisoli and V. Orvalho, "Real-time emotion
recognition novel method for geometrical facial features extraction," 2014 International
Conference on Computer Vision Theory and Applications (VISAPP), Lisbon, Portugal, 2014
[9] J. M. Saragih, S. Lucey and J. F. Cohn, "Real-time avatar animation from a single
image," Face and Gesture 2011, Santa Barbara, CA, USA, 2011
[10] G. T. Kaya, "A Hybrid Model for Classification of Remote Sensing Images With Linear
SVM and Support Vector Selection and Adaptation," in IEEE Journal of Selected Topics in
Applied Earth Observations and Remote Sensing, vol. 6, no. 4, pp. 1988-1997, Aug. 2013
[11] X. Jiang, "A facial expression recognition model based on HMM," Proceedings of 2011
[12] J. J. Lee, M. Zia Uddin and T. S. Kim, "spatiotemporal human facial expression
recognition using fisher independent component analysis and Hidden Markov Model," 2008
30th Annual International Conference of the IEEE Engineering in Medicine and Biology
[13] Xiaoxu Zhou, Xiangsheng Huang, Bin Xu and Yangsheng Wang, "Real-time facial
expression recognition based on boosted embedded hidden Markov model," Image and
Graphics (ICIG'04), Third International Conference on, Hong Kong, China, 2004
[14] T. Kundu and C. Saravanan, "Advancements and recent trends in emotion recognition
using facial image analysis and machine learning models," 2017 International Conference on
25
26