IEEE Conference Template
IEEE Conference Template
Abstract—Facial expressions serve as one of the most natural interaction systems. In our project, we aim to expand the
and universal methods of human communication, transcend- horizons of FER by targeting practical applications in par-
ing both language and cultural differences. This project aims ticularly challenging environments like low-light, nighttime,
to enhance the area of Facial Expression Recognition (FER)
by implementing and assessing various deep learning models. and dynamic conditions. To boost performance, we utilize
Our research has two primary goals: to achieve cutting-edge advanced techniques such as transfer learning, data aug-
accuracy on established benchmark datasets and to adapt these mentation, class balancing, integration of auxiliary data, and
findings to dynamic, real-world conditions. By utilizing a mix ensemble modeling. Additionally, we prioritize interpretability
of convolutional neural networks (CNNs), data augmentation and thorough error analysis to refine our models. Ultimately,
techniques, and transfer learning methodologies, we reached a
significant accuracy of 75.8 on the FER2013 test set, exceeding our objective is to create a sophisticated hybrid model that can
all previous publications in this field. Aside from boosting provide trustworthy emotion recognition, even in intricate and
accuracy, this research highlights the practical application of FER unpredictable settings.
systems. To achieve this, we created a mobile web application
that allows our FER models to operate directly on devices in II. R ELATED W ORKS
real-time. This application not only showcases the viability of
deploying FER models on edge devices but also underscores their A. Advancements in FER and hybrid approaches
potential use in dynamic and resource-limited settings, such as
healthcare monitoring, driver safety mechanisms, and human- The dataset referred to as FER2013, introduced by Good-
computer interaction. Our comparative analysis also investigates fellow et al. during a Kaggle competition, has played a
how these models perform under different conditions, including crucial role in advancing facial emotion recognition (FER)
variations in lighting, occlusions, and scenarios involving multiple
people, ensuring robustness and broad applicability. Through systems. Although current studies showcase notable progress,
this project, we strive to connect academic research with real- there are still unexplored avenues for further enhancement.
world implementations of FER systems, advancing the frontiers The leading models in the competition utilized convolutional
of accuracy, efficiency, and usability in practical settings. neural networks (CNNs) combined with creative strategies like
Index Terms—component, formatting, style, styling, insert image transformations and novel loss functions, exemplified
by Yichuan Tang, who reached an accuracy of 71.2 percent
I. I NTRODUCTION using the L2-SVM loss function. Our research builds on these
Facial expressions are vital for human communication and strategies by employing a hybrid model that combines CNN
connection. Although identifying basic emotions in controlled and LSTM architectures to improve spatiotemporal feature
environments—characterized by good lighting, frontal im- representation and increase the system’s resilience in chal-
ages, and posed expressions—has achieved nearly perfect lenging conditions. Literature such as the review by S. Li and
accuracy, detecting emotions in more dynamic and real-life W. Deng offers a detailed look at the advancements in deep
situations presents a greater challenge. Real-world factors such learning for FER, while the work of Pramerdorfer and Kampel
as fluctuating lighting conditions, nighttime environments, illustrates the efficacy of model ensembling in attaining top-
varying angles of the head, and obstructions complicate this tier results, achieving 75.2 percent accuracy on FER2013.
task significantly. The emergence of deep learning over the Motivated by these results, forthcoming iterations of our
last ten years has transformed Facial Expression Recognition hybrid model could integrate ensemble learning techniques,
(FER), allowing systems to classify emotions with remarkable harnessing the strengths of CNN, LSTM, and other cutting-
precision, often exceeding human capabilities. This progress edge networks. Additionally, the research conducted by Zhang
has paved the way for innovative applications across various et al. reveals the advantages of using auxiliary features like
domains, including socially adept robotics, tailored medi- histogram of oriented gradients (HoG) and facial landmark
cal services, driver safety supervision, and human-computer registration. Future research could fuse these features into
the hybrid framework to combat inaccuracies in landmark
extraction for intricate datasets like FER2013. Likewise, the
strategies applied by Kim et al.—including face registra- B. Typical faces
tion, data augmentation, and ensembling—could be adapted To improve the performance of our hybrid CNN-LSTM
and refined for hybrid models to boost performance further. model, we developed a custom dataset that includes images
Another promising area for investigation involves real-world showcasing various facial expressions, including some cap-
application contexts where FER systems encounter various tured by us. We recorded four distinct expressions—Happy,
challenges such as fluctuating lighting conditions, occlusion, Sad, Angry, and Neutral—under different lighting and real-
and head poses. Implementing domain adaptation strategies world environments. These images were incorporated to en-
and training on larger, more varied datasets could greatly hance existing datasets such as FER2013, CK+, and JAFFE.
enhance the model’s ability to generalize. Furthermore, broad- This strategy was designed to boost the model’s capability to
ening the hybrid approaches to encompass multimodal FER generalize and adapt to a wide range of situations, ensuring
by integrating audio and physiological signals can enrich the reliable performance in real-world applications.
context of emotional detection within the system. By tackling
these issues, the suggested hybrid strategy holds the promise of
making a significant contribution to the continuous evolution
of FER systems, bringing us closer to achieving human-like
performance in emotion recognition. .
III. DATASETS
Facial Expression Recognition (FER) is a well-explored
field featuring a range of available datasets. In this study,
we mainly focused on the FER2013 dataset to evaluate our
model’s performance. To improve accuracy further, we added
CK+ and JAFFE as supplementary datasets. Additionally,
we assembled a custom web app dataset, which comprised
images of various expressions from real-world contexts, as
well as personal images, to refine the model and enhance its
applicability to real-life scenarios.
A. FER2013 Dataset
Fig. 2. Sample images of typical expressions
The FER2013 dataset is extensively examined and is com-
monly utilized in ICML competitions and various research
initiatives. It is regarded as one of the more difficult datasets, IV. M ODELS
with human-level accuracy estimated at 65±5 percent and the A. Base Line Model
highest reported accuracy in published literature reaching 75.2
To gain a better grasp of the issue, we chose to initially
percent. This dataset, accessible on Kaggle, consists of 35,887
address it by constructing a basic CNN that includes four
grayscale images, each resized to 48x48 pixels. It features
3x3x32 same-padding, ReLU filters, interspersed with two 2x2
seven distinct facial expression categories: Angry (4,953),
MaxPool layers, finishing with a fully connected layer and a
Disgust (547), Fear (5,121), Happy (8,989), Sad (6,077),
softmax layer. We also incorporated batch normalization and
Surprise (4,002), and Neutral (6,198). However, FER2013 is
50 percent dropout layers to mitigate high variance, which
characterized by imbalances, with considerable discrepancies
helped enhance our accuracy from 53.0 to 64 percents.
in the number of images across these expressions.
B. Five-Layer Model
One of the top accuracy studies we uncovered was con-
ducted by Pramerdorfer and Kampel [2], which reported an
accuracy of 75.2 percent, even without using auxiliary training
data or facial landmark registration. The authors reached these
outcomes by examining six other studies and combining their
networks. Given the model’s straightforwardness, we decided
to replicate their effort to duplicate the results of Kim et al. [6].
This model is structured in three stages of convolutional and
max-pooling layers, succeeded by a fully connected layer of
size 1024 and a softmax output layer. The convolutional layers
comprise 32, 32, and 64 filters with dimensions of 5x5, 4x4,
Fig. 1. Images from each emotion class in the FER2013 dataset. and 5x5, consecutively. The max-pooling layers utilize kernels
with dimensions of 3x3 and a stride of 2. ReLU serves as the
activation function. To enhance performance, we additionally increasing the batch size to 128, we achieved a test accuracy
implemented batch normalization in every layer and a 30 of 73.2 percent. We attempted to freeze the entire pre-trained
percent dropout after the final fully connected layer. For fine- network and only train the fully connected layers along with
tuning the model, we trained it for 300 epochs, optimizing the the output layer, but the model struggled to fit the training
cross-entropy loss through stochastic gradient descent with a set within the first 20 epochs, despite multiple attempts to
momentum of 0.9. The initial values for the learning rate, tweak the hyperparameters. Due to our limited computational
batch size, and weight decay are set at 0.1, 128, and 0.0001, resources, we chose not to pursue this avenue further.
respectively. If the validation accuracy does not improve after
E. Fine Tuning SeNet50
10 epochs, the learning rate is reduced by half.
SeNet50 is another pre-trained model we investigated. This
model is also a deep residual network with 50 layers, and its
structure closely resembles that of ResNet50, which led us to
spend less time on tuning it. We applied the same parameter set
used for ResNet50 to this network and succeeded in achieving
a test accuracy of 72.5 percent.
F. Fine Tuning VGG16
Fig. 3. Architecture of Five-layer model
VGG16, while considerably shallower than ResNet50 and
SeNet50 with only 16 layers, possesses greater complexity
C. Transfer-Learning and a higher number of parameters. We kept all pre-trained
Given that the FER2013 dataset is relatively small and layers frozen and included two fully connected layers of sizes
imbalanced, we discovered that implementing transfer learning 4,096 and 1,024, respectively, with a 50 percent dropout rate.
considerably improved our model’s accuracy. We investigated After training for 100 epochs utilizing the Adam optimizer,
transfer learning by utilizing the Keras VGG-Face library we reached a test accuracy of 70.2 percent.
along with ResNet50, SeNet50, and VGG16 as our pre-trained
models. To fulfill the input requirements of these networks,
V. METHODS
which anticipated RGB images no smaller than 197x197, we
resized and transformed the 48x48 grayscale images from A. Auxiliary Data and Data Preparation
FER2013 during the training phase. While numerous FER datasets are accessible online, they
exhibit considerable differences in image size, color, format,
labeling, and directory structure. We addressed these discrep-
ancies by organizing all input datasets into seven distinct
directories, one for each class. During the training phase, we
loaded images in batches from the disk to prevent memory
overflow, and utilized Keras data generators to automatically
resize and format the images.
B. Data Augementation
We explored and tested widely used techniques from ex-
isting FER literature and found that our best outcomes were
achieved through horizontal flipping, ±10 degree rotations, ±10
percent zooms on images, and ±10 percent horizontal/vertical
Fig. 4. Structure of Transfer Learning shifts.
C. Class Weighting
D. Fine Tuning ResNet50
To tackle the issue of class imbalance, we implemented
The first pre-trained model we examined was ResNet50, class weighting that is inversely related to the number of
a deep residual network consisting of 50 layers. In Keras, samples. For the disgust class, we succeeded in reducing the
this model is implemented with 175 layers. Our initial step misclassification rate from 61 to 34 percentages.
involved replicating the work conducted by Brechet et al.
[10]. We substituted the original output layer with two fully D. SMOTE
connected layers, having sizes of 4,096 and 1,024, followed The Synthetic Minority Over-sampling Technique (SMOTE)
by a softmax layer for 7 emotion classes. We also froze the is a method that entails oversampling the minority classes and
first 170 layers of ResNet while keeping the remaining layers undersampling the majority classes to yield optimal results.
trainable. For optimization, we utilized SGD with a learning Although applying SMOTE resulted in a perfectly balanced
rate of 0.01 and a batch size of 32. Upon training for 122 training dataset, our models quickly began to overfit the train-
epochs with SGD, maintaining a learning rate of 0.01 and ing data, leading us to decide against further experimentation.
E. Ensembling and Test-Time Augmentation (TTA)
We executed an ensemble method using soft voting across
seven models, which significantly raised our peak test accuracy
from 73.2 to 75.8 percentages. Likewise, TTA that included
horizontal flipping
R EFERENCES
[1] S. Li and W. Deng, “Deep facial expression recognition: A survey,”
arXiv preprint arXiv:1804.08348, 2018.
[2] C. Pramerdorfer, M. Kampel, “Facial expression recognition us-
ing convolutional neural networks: state of the art,” Preprint
arXiv:1612.02903v1, 2016.
[3] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B.
Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee et al., “Chal-
lenges in representation learn- ing: A report on three machine learning
contests,” in International Conference on Neural Information Processing.
Springer, 2013, pp. 117–124.
[4] Y. Tang, “Deep Learning using Support Vector Machines,” in Interna-
tional Con- ference on Machine Learning (ICML) Workshops, 2013.
[5] Z. Zhang, P. Luo, C.-C. Loy, and X. Tang, “Learning Social Relation
Traits from Face Images,” in Proc. IEEE Int. Conference on Computer
Vision (ICCV), 2015, pp. 3631–3639.
[6] B.-K. Kim, S.-Y. Dong, J. Roh, G. Kim, and S.-Y. Lee, “Fusing Aligned
and Non- Aligned Face Information for Automatic Affect Recognition in
the Wild: A Deep Learning Approach,” in IEEE Conf. Computer Vision
and Pattern Recognition (CVPR) Workshops, 2016, pp. 48–57.
[7] M.Quinn,G.Sivesind,andG.Reis,“Real-timeEmotionRecognitionFromFacial
Expressions”, 2017.