Enhanced Convolutional Neural Network For Robust Facial Expression Recognition On Fer2013 and Natural Image Datasets
Enhanced Convolutional Neural Network For Robust Facial Expression Recognition On Fer2013 and Natural Image Datasets
represents the ith Convolution kernel, bi kernels in each layer and visually displays the
represents the bias of the first convolution extracted feature maps. A facial expression
kernel. The expression of the ReLU function pictures in the Fer2013 dataset were used for
is as shown in (2). demonstration. The feature extraction for a
𝑦, 𝑦 ≥ 0 feature map after one convolution operation
𝑅𝑒𝐿𝑈 (𝑦) = { (2) is as shown in Figure 2.
0, 𝑦 < 0
This work uses a total of 4 convolutional
layers. The convolution kernel sizes are: 1*1,
5*5, 3*3, 3*3, and the number of convolution
kernels is 32-32-64-128. After the Original image First layer
convolution
Second layer
convolution
Third layer
convolution
Fourth layer
convolution
convolution layer, the excitation layer is kernel 1*1 kernel 5*5 kernel 3*3 kernel 3*3
Figure 2. Features after convolution.
output. A 1*1 convolution kernel before the
second layer of convolution is used to 3.2. Pooling Layer
increase the nonlinear representation of the The pooling layer of a convolutional neural
input, deepen the network structure of the network is usually designed after the
model, and improve the expression ability of convolutional layer. The number of feature
the model. The input of the image is a 48*48 maps will increase as the number of
matrix. After convolution with 32 (1*1) convolutional layers increases. However, the
convolution kernels, 32 (48*48) feature maps increase in feature dimension will cause a
are output. The second layer of convolution dimensionality disaster, so it is usually added
uses a 5*5 convolution kernel to first extract after the convolutional layer. The pooling
features in a large receptive field, and then layer is used for dimensionality reduction.
reduce the size of the convolution kernel to This work uses the maximum pooling
extract features in a smaller area. 5*5 operation to maintain the most salient
convolution kernels for convolution on the features in a pooled area. The pooling layer
48*48 feature map was used to obtain 32 (48- can be expressed as shown in (3).
5+1)*(48-5+1) feature maps. Using 32
convolution kernels is the extraction 32 𝑆𝑖 = 𝑑𝑜𝑤𝑛 (max(𝑦𝑎,𝑏 )) 𝑎, 𝑏 ∈ 𝑝𝑖 (3)
different local expression features are Where, Si represents the maximum pooling
included. The third and fourth convolutional result of the ith pooling area, down(∙)
layers use 3*3 convolution kernels represents the down sampling process
respectively. The specific parameter values (retaining the maximum value of the pooling
of each layer of the network are shown in area), ya,b represents the value in the pooling
Figure 1. area, and pi represents the ith pooling area.
Each layer of the convolution operation of the The third layer of the network structure in this
convolutional neural network performs work is the pooling layer, and the feature map
feature extraction. This paper fuses the input to this layer is 44*44. The pooling area
features extracted by different convolution is 2*2, so in the feature map, 2*2 represents
Advanced International Journal of Multidisciplinary Research
Volume 1, Issue 1, July – August 2023………………………………………………………198
a pooling window, and each pooling window probability corresponding to this neuron is
results in a maximum pooling result. the highest. The representation of SoftMax
Therefore, the final pooling result of the classification is shown in (5).
feature map is (44/2)*(44/2). 𝑒 𝑤𝑐 ×𝑚
𝑝( 𝑦 = 𝑐 | 𝑚 ; 𝑤 ) = ∑𝑘 𝑤𝑖 ×𝑚 (5)
𝑖=1 𝑒
3.3. Fully Connected Layer
Where, 𝑝( 𝑦 = 𝑐 | 𝑚 ; 𝑤) represents the
In the fully connected layer, the neurons, are probability that the input picture m is the
connected to the ones in the previous layer, expression type 𝑐, w is the weight parameter
thereby converting the feature dimensions value (to be fitted), and k is the total number
into one-dimensional data. The last pooling of categories, 7. The value of expression type
layer in this work is connected to the fully c is {0, 1, 2, 3, 4, 5, 6}.
connected layer. The last pooling layer
outputs 128 convolutional 5*5 feature maps, 4. Experiment
which is converted into one-dimensional
The experiments in this work are
data: 128×5×5 = 3200, and then input 1×3200
implemented in Python, and are based on the
data into the fully connected layer. The fully deep learning platform of Keras. In addition,
connected layer is represented by (4). Python was also used to reproduce the two
𝐹𝑢𝑙𝑙 = 𝑓(𝑤 × 𝑧 + 𝑏) (4) models in the two papers for comparison. In
Where, Full represents the output result of order to make a fair comparison of the
the fully connected layer, f(∙) is the ReLU experimental results, a unified data set was
activation function, w represents the weight used for training the different models.
value of the connection, z is the value input
4.1. Dataset
to the fully connected layer, and b is the bias.
This article uses two data sets, one is the
In order to reduce the complexity of the
Fer2013 [16] facial expression data set, and
network structure and prevent overfitting,
random deactivation (Dropout) of neurons is the other is produced as a part of this paper
work. The Fer2013 expression database has
used.
35,886 facial expression pictures, including
3.4. SoftMax 28,708 in the training set, 3589 in the
The last layer of the network structure in this verification set and 3,589 in the test set. The
work is the SoftMax function to classify the size of each grayscale image is 48*48. There
7 facial expressions. There are 7 neurons in are 7 expressions in the data set: angry,
this layer, and each neuron represents an disgusted, fearful, happy, sad, surprised, and
expression category. For each input face neutral.
picture, the 7 neurons in the SoftMax layer The Fer2013 data set was collected in a
input the probability between 0 and 1, and the laboratory environment, so it cannot well
neuron with the largest input probability verify the model's recognition of human
value, it means that the expression expressions in the natural state. Therefore, a
Advanced International Journal of Multidisciplinary Research
Volume 1, Issue 1, July – August 2023………………………………………………………198
search for some pictures of human The training goal is to minimize the loss
expressions in the natural state from the value, use the backpropagation algorithm to
Internet was performed, and then analyzed propagate the error value, and use the SGD
the pictures. The size, pixels, background, optimization algorithm to update the
etc. are preprocessed and the pictures are parameter values along the direction of
uniformly converted into grayscale images. gradient descent. The SGD algorithm is as
Finally, a small data set is formed. The facial shown in (7).
expressions in the self-made data set are
divided into 7 types of expressions, with a
𝜕𝑙𝑜𝑠𝑠 𝜃
total of 396 pictures. This work uses the 𝜕𝜃𝑖1
= − ∑𝑛𝑖=1 𝑎𝑖1 (7)
𝑖1
above two data sets to jointly verify the
Therefore, the parameters are updated as in
performance of the proposed convolutional
(8).
neural network model.
𝜕𝑙𝑜𝑠𝑠
𝜃𝑗 = 𝜃𝑗 − 𝑎 𝜕𝜃𝑗
(8)
4.2. Model Training
In order to train a more accurate model and Where θj is the parameter to be updated, a is
use expression pictures more efficiently, the 𝜕𝑙𝑜𝑠𝑠
the learning rate, is the value to
𝜕𝜃𝑗
expression data library was amplified
decrease in the gradient descent direction.
through a series of random transformations,
The learning rate in this model is 0.01. In
such as these shown in Figure 3.
order to allow the training to converge to the
best result, this work sets the learning rate to
gradually decay as the number of training
times increases, so the learning step size
gradually decreases.
This work first uses the training set in the
Fer2013 data set to train the model, and then
uses the verification set to verify the accuracy
of the recognition. When the accuracy of the
Figure 3. Data augmentation.
verification set decreases and the loss value
The loss function used in this work is a multi- increases, the training is stopped.
class cross-entropy loss function, the loss
function is as shown in (6). 4.3. Experimental Results and Analysis
𝑙𝑜𝑠𝑠 = − ∑𝑛𝑖=1 𝑦𝑖1 log 𝑎𝑖1 + . . . +𝑦𝑖7 𝑙𝑜𝑔 𝑎𝑖7 (6) This article reproduces the convolutional
neural network model proposed by Lu [7] et
Where a is the actual output value of the al. and the LeNet-5 model proposed by Li
neuron and y is the expected output value. [14] et al. using Python. The Fer2013 data set
was used to train and the accuracy was
calculated on the test set. The training results
Advanced International Journal of Multidisciplinary Research
Volume 1, Issue 1, July – August 2023………………………………………………………198
of the model in this paper, the model of Lu Figure 6. Training results of the Li et al. model.
[7] et al, and the model of Li [14] et al, are as
shown in Figs. 4~6. For the training results, we selected the
training models with the highest accuracy on
the verification set. The accuracy of various
models in the test set is summarized in Table
2. It can be seen from this table that the
accuracy of the suggested model on the test
set is relatively high, with an accuracy of
72.92%.
11 0.7074
From the confusion matrix, one can see that people's pupils dilate, so fear will be
the recognition accuracy of happy, neutral, recognized as surprise. The overall accuracy
and surprised expressions is relatively high, rate on the self-made natural expression data
but the recognition effect is relatively poor set is 337/396 = 85.10%.
when identifying disgust and fear In expression recognition, we analyzed
expressions. Regarding disgust expressions, several difficulties in facial expression and
everyone has different expressions and facial expression recognition. Human beings are
expressions are also very different. complex animals with a rich inner world. The
Therefore, when recognizing disgust expressions on human faces are sometimes
expressions, the recognition results are intertwined with multiple emotions, such as
relatively scattered and may be recognized facial expressions may contain multiple
into various expressions. When recognizing expressions such as surprise, anger, and
fear expressions, it is easier to recognize helplessness at the same time, which makes
surprise, mainly when extracting features of recognition difficult. Sometimes different
the eyes. Both fear and surprise tend to make expressions of human beings may ex-press
Advanced International Journal of Multidisciplinary Research
Volume 1, Issue 1, July – August 2023………………………………………………………198
the same emotion, and the same expression design. To further evaluate the model’s
may have different emotions for different capability to generalize beyond controlled
people, which requires strict extraction of environments, a supplementary dataset
subtle features of human faces. Finally, composed of facial expression images
human facial features have their own collected from natural, real-world settings
characteristics and cannot be generalized. For was created and utilized. These images
example, in the expression recognition underwent preprocessing steps such as
process, the expressions of people with big grayscale conversion, cropping, and pixel
eyes are more likely to be recognized as normalization to align them with the training
surprise or fear. data format.
The evaluation on this custom dataset
5. Conclusion
revealed that the CNN maintained a high
Unlike traditional facial expression level of recognition performance, indicating
recognition methods that rely heavily on good generalization capacity across varied
manual feature engineering, convolutional contexts and image conditions. However, it is
neural networks (CNNs) possess the important to note that CNNs typically require
remarkable ability to automatically and large-scale datasets to achieve optimal
implicitly learn relevant features directly learning outcomes, as their depth and
from raw image data. By utilizing only the parameter complexity necessitate extensive
pixel values of facial images as input, the exposure to diverse training examples.
CNN model is capable of discovering Consequently, the development of more
complex hierarchical patterns that represent comprehensive and representative facial
various emotional expressions. In this study, expression datasets, especially those
a customized CNN architecture is developed depicting spontaneous expressions in real-
specifically for facial expression recognition, life scenarios, would further enhance the
leveraging the inherent strengths of CNNs in model’s applicability and reliability in
spatial feature extraction and image practical emotion recognition systems. As
classification tasks. such, the continued accumulation and
The proposed model is initially trained using integration of naturalistic facial expression
the widely adopted FER2013 dataset, which images is essential to advance the
contains a large and diverse collection of generalization and real-world readiness of
labeled facial expression images. The CNN-based recognition models.
experimental results obtained from this
dataset highlight the superior performance of
the designed network in terms of References
classification accuracy and robustness,
1. Kumbhar, M., Jadhav, A., & Patil, M.
affirming the effectiveness of the
(2012). Facial Expression
architectural choices made in the model's
Recognition Based on Image Feature.
Advanced International Journal of Multidisciplinary Research
Volume 1, Issue 1, July – August 2023………………………………………………………198