0% found this document useful (0 votes)
10 views9 pages

Emotion Detection With Vision Transformers and Image Features

Article Research

Uploaded by

Naser Al Zoubi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views9 pages

Emotion Detection With Vision Transformers and Image Features

Article Research

Uploaded by

Naser Al Zoubi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Emotion Detection with Vision Transformers and Image Features

Stephan Sharkov
Stanford University,
Department of Computer Science
[email protected]

Abstract ing data of a grayscale image represented in numbers of


pixel values or HOG features of the image, and a number
With computer vision on the rise and new methods such indicating emotion in the picture. The final model is sup-
as transformers and finetuning targeting many vision prob- posed to accept a new grayscale image with human face or
lems, it is important to investigate ones impacting our ev- images’ HOG features and be able to identify the emotion
eryday life and experience of interacting with computers. In by returning back the number indicating the emotion. The
this paper, I attempt to improve on problem of emotion de- emotion categories targeted in this problem are: 0=Angry,
tection by using transformers and finetuning on pre-trained 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neu-
models. I also attempt to use image features, particularly tral.
HOG, to investigate whether they improve performance for The second problem is essentially very similar, except
any of the methods. My best model achieves 58.51% ac- with the sole difference that I am going to perform senti-
curacy on emotion classification task, using Transformer ment analysis of the image. So rather than predicting 1 of
and original images, improving on previously implemented the 7 emotions in the image, I am going to perform senti-
models. I further discuss qualitative performance of the ment analysis and classify data for Negative (Anger, Dis-
transformer and how it could compare to humans. I also gust, Fear, Sad) vs. Neutral-Positive emotion (Happy, Sur-
achieved 71.45% accuracy on sentiment analysis problem, prise, Neutral).
with usage of finetuning and HOG features, showing po-
tential for image features and their importance in computer 2. Related Work
vision.
A lot of research has been done in the field, starting back
from 1972[1] with first non-AI methods of face and emo-
1. Introduction tion detection, but I would like to pay attention to a few im-
portant papers relating to this project. Early attempts used
Rising amount of images in the web allows to analyze methods before CNN and Transformers like ones by Ga-
pictures people post on popular social media websites such jarla and Gupta [2], where authors attempted to finetune on
as Instagram, Facebook, or Snapchat. While people tag existing CNNs for emotion detection and sentiment analy-
some of those pictures or indicate certain parameters about sis (positive vs. negative emotion). First of all they col-
it, one thing they can’t tag is emotion or state of mind they lected their own dataset using Flickr API, an approach that
have when making a post. Investigating emotion detection, seems to be popular in computer vision in other domains as
field of recognising emotion based on an image, has been well, including predicting cities or urban setting based on
an ongoing problem, and is currently being improved with images[3]. Love and Violence were included to represent
more methods arising in Computer Vision. This area of re- recent trends of protests and civil action taking place in so-
search could be useful in robotics, healthcare, virtual assis- cial media. The authors took on different methods, where
tance, and many other fields, where machines need to com- most of them included fine-tuning on the existing models of
prehend human emotion and be able to act upon it. object and scene detection. Authors started with training a
In this project, I am attempting to improve existing meth- One vs. All SVM using object classification model of VGG-
ods of emotion detection using Vision Transformers and ImageNet[4] as a pre-trained model. More specifically
Finetuning, and investigating whether usage of image fea- they passed the images as input through VGG-ImageNet[4],
tures like HOG can improve that performance. stored the activations from second-to-last fully connected
There will be two problems I will be trying to analyse layer as feature vectors, and then trained SVM for each
during the project. In the first problem models accept train- emotion separately. Coming with the assumption that some
images (for example from protests) have many people, au- ing but with transformers, the method that gave evolution
thors thought it is hard to analyse emotions in those as there to Natural Language Processing, and now being widely ex-
are too many objects. Because of that they tried to fine-tune plored in Computer Vision. Ma, Sun, and Li [9] gave the
VGG-Place205[5] using a scene-based dataset, getting ac- most recent attempt by constructing a transformer with con-
curacy improvement compared to previous method. That volution, global and local attention, as well as batch nor-
gave them rise to the last method they used is fine-tuning malization. All of the training of this transformer is done on
on ResNet-50, as that network is trained on both object and top of pre-trained model ResNet18, that was trained on MS-
scene data, combining the two approaches above. Final re- Celeb-1M face recognition dataset[10]. Authors improved
sults were 41% in emotion classification, which was worse on all accuracy metrics in the problem they were solving
than previously done models and 73% on sentiment analy- and at the moment their approach represents state of the
sis task, which was better than earlier results. Authors have art method in emotion detection, especially with the fact
done an incredible job considering many limiting and spe- that their approach generalises better than previous meth-
cific for emotion classification factors, as well as changing ods. Since their research focused more on generalisation,
classes of emotion, but using One vs. All SVM was not the authors did not explore other pre-trained models, which is a
best model choice. main disadvantage of the approach, however is something I
Later on, researchers trained their own CNNs for the will attempt to improve on in this paper.
problem. In paper by Jaiswal, Raju and Deb [6], authors
attempted to train one specifically for the emotion detec- 3. Data
tion, using existing datasets FER2013 and JAFFE. Authors 3.1. Dataset
used two submodels which share the same input and both
have the same kernel size. Then the outputs from feature It took some time to research for a suitable dataset. Many
extraction are flattened into vectors and concatenated be- good quality datasets unfortunately are not available to the
fore passing into final output layer for classification. Each public, so I ended up going with the FER2013 dataset [11].
submodel contains convolutional layer, local contrast nor- That dataset contains 48x48 pixel grayscale images. Rep-
malization layer, maxpool, flatten. Reasoning behind split- resentation of it is in a comma-separated value (.csv) file,
ting into submodels came from the fact that authors wanted and dataset is split in two columns, and every row con-
model to look at features of face like eyes, lips, mouth and tains two parameters describing an image: first is number
etc. They compared their results to a similar model of a pre- representing emotion in the image (0=Angry, 1=Disgust,
vious author, and saw improvement to 70% from 67% on 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral), and sec-
FER2013, and 98.5% from 98% on JAFFE. Authors have ond is a string of 2304(48*48) numbers, where each is a
done a good job introducing an interesting and working ap- pixel grayscale value of the image.
proach of splitting CNN into submodels, however usage of
not widely available local contrast normalization prevents Figure 1. Examples of data for each emotion
from replicating or improving on those results.
Similar to approach taken in Gajarla and Gupta[2], re-
searchers like Xu et. al.[7] used transfer learning but with
Convolutional Neural Networks as finetuning to perform vi-
sual sentiment analysis task. More specifically they pre-
trained the same CNN’s as did Gajarla and Gupta[2], after
which they used the sentiment dataset to make final tuning My first job was to convert string to an array of inte-
for the task. These authors also came up with more inter- gers and then figure out how to be able to move between
esting evaluation metric - Area Under the receiver operat- number values and create actual images, which was done
ing characteristic Curve (AUC). With that metric authors using PIL[12] libraries of Python. Dataset has a very big
improved on previous models achieving 70% accuracy on disadvantage, low resolution of pictures was not enough to
the 5-scale sentiment rating prediction, with previous per- achieve high performance, however, was enough to conduct
formances of 64% and 67%. AUC helps with imbalanced analysis.
datasets, which could be great for most emotion datasets, Dataset includes 28709 images for the training set, where
as special emotions like disgust or surprise have way less each emotion has around 4200 images, except disgust hav-
images existing on the web. Because of the wide imbal- ing only 430. Testing data includes 3589 images.
ance and different choices of which emotions to detect, re-
searchers use a variety of metrics including Precision, Re- 3.2. Train/val/test data
call, and F-1 score like Asghat et. al. [8]. For my experiments I slightly modified the distribution,
The most recent approaches attempted to use finetun- and split the 28709 training images into 22968(80%) im-
ages for training my models, and 5741(20%) for validation learn[15]. Initially I am using pixel values as a feature pa-
set, which models predict on at every epoch. rameter to compare images, but using HOG representations
Moreover, to initially tweak hyper parameters and make makes more sense, as we are comparing similarity of shapes
sure code works, I worked with smaller subsets of the data. of faces and face parts, thus looking more specifically for
I used 800 training and 200 validation images for training a main characteristics of emotions.
transformer, and 6400 training and 1600 validation images
for finetuning. After tweaking parameters, every model was 4.1.2 CNN
ran on full dataset. At every point of taking subsets of data,
I made sure that every emotion was present in those subsets CNN’s have been the most popular approach in vision for
in proportion similar to those in full dataset. a long time, and many different networks were created to
achieve the best performance. I implemented a simple CNN
3.3. Image Features using Numpy[14] and PyTorch[16] inspired by lectures, as-
signment 2[17], and papers in section 3. This CNN starts
As one of potential ways to improve on current mod-
with batch-normalization, to pre-process and normalize the
els, I was considering using different image features as
data. This appeared to be the most important stage that I
data inputs, which could go separately or be concatenated
settled on after trying alternative things such as Local Re-
into a vector together. The feature I focused on is His-
sponse Normalization, as well as making dataset more bal-
togram of Oriented Gradients(HOG) computed using scikit-
anced. Without batch-normalization CNN could not learn
image[13]. HOG represents distribution of direction of gra-
anything and was just predicting the class that is most fre-
dients, as it calculates magnitude and direction of gradient
quent in the training set. After that, I applied convolutional
at every pixel of the image and stores it as a feature vector
Layer with 32 filters, ReLU activation, another convolu-
of the image. This highlights the edges and corners, as the
tional layer with 16 filters, ReLU, and MaxPool.
biggest gradient change happens there. Knowing that dis-
tribution, allows us to understand the shape of the face and 4.2. Main Methods
emotion, as positions of every face part are also recorded,
which helps to see unique parts of an emotion. 4.2.1 Training a Transformer
I implemented a Transformer using Tenserflow[18] and
Figure 2. Examples of HOG data for each emotion
Keras[19] inspired by class material and guides from
Keras[19] library.
Transformer starts with creating different data augmen-
tations for each of the images, including Resizing to 72x72
pixel image, RandomRotation and RandomZoom. All of
those help to add more data and assist transformer in recog-
nising different rotations and zooming of each of the faces
Due to the small resolution of the dataset and the fact that in the dataset, since everyone’s head can be positioned dif-
images are cropped just to cover the face, HOG does not ferently in the picture.
demonstrate much difference among emotions. However, After that each image is being split into 144 patches, and
looking closely we can see differences in mouth and eye those patches are being encoded with position embedding,
shapes, as even in real life those distinguish emotions the so that the system knows the relative position of patches.
best. We can see an example below:

4. Methods Figure 3. Patch splitting

4.1. Baselines
4.1.1 KNN
KNN makes an assumption that similar things are close to
each other. It is an algorithm which predicts information
about an image based on k number of images closest to
it. More specifically it chooses the label that appears the
most among those k neighbors, where number k itself can
be chosen and adjusted to achieve better accuracy. The de-
cision of how close images are is based on a feature param-
eter. My KNN is constructed using Numpy[14] and Scikit- This is being done so that transformer has pieces of im-
ages to work with since it requires sequences of data for it to different data for classifying many diverse objects, whereas
function. Transformer learns how those pieces relate to each other researchers used datasets for specific problems like
other, and how those relations indicate the emotion. It helps face detection which are more related to emotion detection.
the system to learn more concretely about mouth shape, eye Among the different models I performed comparisons of ac-
shape and other face features indicating emotion. curacy after first 20 epochs and ended on ResNet101, which
Next most important step is the main part of the trans- is 101 layer network. Important to note the network per-
former is the transformer block of layers. The number of formed better than ResNet18, that was used in [9], suggest-
those blocks could vary, but in my model I set on 10, which ing that deeper networks could be more helpful for compli-
performed the best. Each implemented block consists of cated task of emotion classification.
Layer Normalization, followed by Multi-Head Attention, After that I added a multilayer perceptron(MLP) for the
another Layer Normalization and MLP layers. MLP layer finetuning part that would run on the FER2013 data. Final
consisted of 2 layers of Dense layer with ReLU activation MLP ended up consisting of 4 layers of Dense layer with
and Dropout with probability 0.5 and 0.7 for both problems ReLU activation and Dropout with probabilities dependent
respectively. Dense layer is essentially a Keras[19] method on the problem and features.
of combining a connected layer with activation function.
The transformer finishes and gives final results after an- And finally I combined all the parts together inspired by
other Layer Normalization, Flatten layer, another Dropout the code from [22]. That part essentially passes the image
and last MLP with the same structure. inputs through ResNet101 and gets the final features after
The structure of the transformer is drawn out below: all 101 layers, and those features are passed into the MLP
so we can get from 1000 classes of ResNet101 to 7 or 2
Figure 4. Structure of the transformer classes specific to our problems. After all of that I apply
Softmax activation function to get final predictions.

5. Experiments
In this section I cover my derivation of the best Trans-
former and best Finetune model I was able to achieve, and
how their performance compares to the previous work and
baselines.
For my experiments I decided to use Categorical
Crossentropy loss[19] which is a common choice for loss
functions in classification problems with many classes.
Choice of optimizer fell on Adam[19] optimizer, as I used it
in assignments for this class, and Adam always performed
better than others.
To compare results and performance of my methods and
baselines, I will be using loss and different accuracy met-
rics: training/validation/testing accuracies and differences
among those. While numbers themselves demonstrate per-
formance and success of models, differences demonstrate
which models overfitted more, and which learned more uni-
versally.

4.2.2 Finetuning on pre-trained models 5.1. Hyperparameter tuning


Besides training a transformer for the emotion detection, For my baselines I have not implemented much hyper-
I have considered to explore approaches of transfer learn- parameter tuning, as there were not many parameters to
ing, similar to the ones done with CNN, using Keras[19] change, especially with the fact that my baseline CNN came
and Pandas[21]. For the start I had to choose a pre- from already existing structures. For KNN I only chose the
trained model from which to begin finetuning. My op- best k out of 1 to 100 and reported it in the tables following.
tions were limited to pre-trained models of Keras[19] li- However, for Transformer and Finetune there were a lot of
brary, which used ImageNet dataset that includes a lot of options.
5.1.1 Tuning Transformers because model predicted the same class for all validation
images. This is due to the fact that Transformers analyse
All hyperparameter tuning for transformers was done on a
images as a sequential data, and HOG images are too sim-
small subset of the dataset with 800 training and 200 val-
ilar to each other with minor differences in orientation of
idation images and training was done for 10 epochs. The
some of the gradients. Because of that it is hard for the
metric of best performance was validation accuracy, and in
Transformers to use HOG features for training, thus I did
case of tie I also calculated top-3 accuracy which accounts
not proceed with using HOG in this part.
for correct label appearing in the highest 3 predictions of
the model.
I decided to choose the parameters that seemed the most 5.1.2 Tuning Finetuned pre-trained models
important and influential. I tweaked learning rate(LR), Proceeding with usage of ResNet101 model pretrained on
batch size(BS), number of transformer layers(TL), and ImageNet, I had to tweak parameters of MLP that were
dropout rate(DR) for MLPs. I also looked at other parame- responsible for the finetuning part: dropout rates(DR) and
ters like number of heads in MLP(default is 2) and parame- number of MLP heads(NH). I attempted to work with batch
ters for patches such as patch size (default is 6) and resized size, however changing it from default(128) only reduced
image size (default is 72), however those did not seem to the performance, thus not reported in the table.
anyhow affect the performance, thus they are omitted from Since this part was able to run relatively faster than trans-
here. former, I was able to use bigger portions of training(6400
images) and validation(1600 images) sets of the dataset.
Table 1: Hyperparameter Tuning for Transformer on
Each finetune was originally done for just 20 epochs to
original images for emotion classification
compare different parameters and choose the best ones. The
LR BS TL DR Val Accuracy
metric of best performance again was validation accuracy.
1 0.001 256 8 0.5 0.21
Table 3: Hyperparameter Tuning for Finetune on original
2 0.01 256 8 0.5 0.06
images for emotion classification
3 0.0001 256 8 0.5 0.28
DR NH Val Accuracy
4 0.0001 128 8 0.5 0.26
5 0.0001 64 8 0.5 0.19 1 0.2 2 0.39
6 0.0001 256 6 0.5 0.23 2 0.2 4 0.41
7 0.0001 256 10 0.5 0.28 3 0.3 4 0.39
8 0.0001 256 10 0.7 0.181 4 0.4 4 0.38
9 0.0001 256 10 0.3 0.21 5 0.1 4 0.39

As we can see model 7 and 3 performed the best with 28% Similar was done for the same problem but using HOG fea-
accuracy, however model 7 had better top-3 accuracy. tures:
The exact same process was conducted for the sentiment Table 4: Hyperparameter Tuning for Finetune on HOG
analysis task, but top-3 accuracy was not computed here. features for emotion classification
The results are presented below: DR NH Val Accuracy
Table 2: Hyperparameter Tuning for Transformer on 1 0.2 4 0.29
original images for sentiment analysis 2 0.4 4 0.31
LR BS TL DR Val Accuracy 3 0.6 4 0.32
4 0.8 4 0.31
1 0.001 256 10 0.3 0.55
5 0.6 2 0.28
2 0.01 256 10 0.3 0.53
3 0.0001 256 10 0.3 0.57 And the same process was conducted for sentiment analysis
4 0.0001 128 10 0.3 0.58 problem:
5 0.0001 64 10 0.3 0.57 Table 5: Hyperparameter Tuning for Finetune on original
6 0.0001 128 8 0.3 0.54 images for sentiment analysis
7 0.0001 128 6 0.5 0.51 DR NH Val Accuracy
8 0.0001 128 10 0.5 0.60
9 0.0001 128 10 0.7 0.63 1 0.4 2 0.64
2 0.4 4 0.65
As I attempted to perform similar process on Transform- 3 0.6 4 0.67
ers but using HOG features, the model was not learning any- 4 0.8 4 0.66
thing and accuracy was the same for different parameters, 5 0.2 4 0.65
Table 6: Hyperparameter Tuning for Finetune on HOG Figure 5. Loss for emotion classification transformer[23]
features for sentiment analysis
DR NH Val Accuracy
1 0.2 2 0.59
2 0.2 4 0.6
3 0.8 4 0.6
4 0.6 4 0.58
5 0.4 4 0.62

5.2. Final results


I proceeded to full training for 100 epochs for best mod-
els in each of the tasks and features, that are highlighted in
each of the tables. And the final results on the test set for
the best 6 models can be seen below:
Table 7: Final results for the best models 1
Model Task LS TA VL VA FA
TR EC 0.76 72.23 1.16 59.40 58.51
FT EC 0.02 99.53 4.58 47.64 48.26 Looking at the loss for sentiment analysis we can see
FT-HOG EC 0.03 99.67 5.28 38.17 38.03 better performance:
TR SA 0.56 70.21 0.54 72.56 71.08
Figure 6. Loss for sentiment analysis transformer[23]
FT SA 0.01 99.89 2.6 70.64 68.86
FT-HOG SA 0.02 99.73 2.6 71.14 71.45
As we can see for emotion classification transformer per-
formed the best, overshooting the final accuracy signifi-
cantly compared to finetune. However for sentiment anal-
ysis finetune on HOG features performed the best, slightly
getting above the transformer.
Finetuning has a clear overfitting problem as we can see
based on the final loss and the 99% accuracy for training
data. This issue appeared earlier during hyperparameter
tuning even with low number of epochs (20). I attempted
to fix this problem by making a less complex model and
increasing regularisation, or in case of Finetuning - the
dropout rate, however that did not appear to help at all,
as model kept overfitting on the training data. My guess
is that size of FER2013 dataset (35,000) is not significant
compared to the size of ImageNet(1,000,000) leading to
model paying closer attention to training examples, minorly Training and validation loss here are almost identical,
tweaking parameters to work for that data, but not being however don’t decrease by much over the course of training,
broad enough for validation. and converge by the end of it, indicating that 100 epochs is
Transformers however had a relatively stable learning. If sufficient for this task.
we look at loss during emotion classification, we can see
that for the first 50 epochs both loses were going down to- 5.3. Comparison to Baselines and Previous Work
gether at the same rate. After 50 epochs decrease in vali-
5.3.1 Emotion Classification
dation loss slowed down, remained stable, but then slowly
took an increasing direction by the beginning of epoch 90, The results for emotion classification(all 7 categories) can
indicating that we could have stopped even earlier to fully be seen below for all the models:
avoid overfitting, but we were not affected much by it.
1 Abbreviations: LS - loss; TA - training accuracy; VL - validation loss;

VA - validation accuracy; FA - test accuracy; TR - transformer; FT - fine-


tune; EC - emotion classification; SA - sentiment analysis.
Table 8: All models for emotion classification default images performs worse than CNN, but also worse
Model Loss Accuracy(%) than Finetune-HOG.
KNN(k=60) - 16 We can see that both Finetunes performed worse than
KNN-HOG(k=50) - 35.5 models similar to the ones in [2] and [9], that were imple-
CNN 0.92 47.15 mented the same way as described in the previous section.
CNN-HOG 0.76 46.93 This indicates that for a simpler problem of sentiment anal-
Finetune[2] - 41.34 ysis it is not necessary to use more complex pre-trained
Finetune[9] - 45.16 models like ResNet101. However, with the fact that my
Transformer(this) 0.76 58.51 Finetune-HOG performed better than my Finetune suggests
that usage of image features can improve the existing fine-
Finetune(this) 0.02 48.26
tunes further.
Finetune-HOG(this) 0.03 38.03
5.4. Qualitative Analysis
Both Transformer and Finetune that were implemented are
better than any CNN or KNN methods. Finetune-HOG per- For this section I chose to analyse the Transfomer I im-
formed significantly worse which was not expected. plemented for emotion classification since it performed the
I implemented finetuning methods similar to the ones of best among all implemented methods. Below we can see
[2] and [9] that used VGG/ResNet50 and ResNet18 respec- the confusion matrix constructed with scikit-learn[15] for
tively for pretrained models, whereas my finetuning used the model’s predictions:
ResNet101. As can be seen my finetuning performs better,
Figure 7. Confusion matrix
suggesting that more recent and more complex models have
potential to improve performance. The reported accuracy is
different from the one in papers because of couple factors.
Authors used better and newer datasets that are available for
professional researchers, whereas FER2013 has lower qual-
ity. As well as the fact that authors had access to pre-trained
models on more significant data. For example in [9] the au-
thors used 1M faces dataset for pretrained model, which is a
closer data to the one used for emotion classification, com-
pared to ImageNet data that was used in my finetuning. This
essentially means that I re-implemented authors’ approach
partially, specifically focusing on pre-trained model in the
base, to demonstrate potential improvements.

5.3.2 Sentiment Analysis


Results for sentiment analysis(positive vs. negative emo-
tion) can be seen below: We can see that almost half of disgust(1) images were
predicted as anger(0) indicating that model had hard time
Table 9: All models for sentiment analysis
differentiating between the two when the disgust(1) image
Model Loss Accuracy(%)
was given. Both emotions are quite similar as in real life
KNN(k=100) - 49 they have similar eyebrow and mouth shapes that are very
KNN-HOG(k=10) - 60.5 expressive.
CNN 0.58 70.27 Significant portion of images with fear(2) and neutral(6)
CNN-HOG 0.40 69.53 emotions were predicted as sad(4), indicating that for the
Finetune[2] - 73.19 fear and neutral categories, sadness is another likely emo-
CNN[6] - 70.17 tion. Again this is also expected since emotions are quite
Finetune[9] - 73.67 similar, and sometimes even in real life it is hard differenti-
Transformer(this) 0.56 71.08 ate between those 3 for some people.
Finetune(this) 0.01 68.86 While happy(3) emotion had the best performance due
Finetune-HOG(this) 0.02 71.45 to prevalence in data, among less represented emotions, sur-
prise(5) emotion had the least error. This indicates that sys-
Implemented Transfomer and Finetune-HOG performed tem was significantly certain when predicting image as sur-
better than any baseline models. Implemented Finetune on prise(5), as well as other labels seem very unlikely for those
type of images. This result is not expected, as I thought hap- This example is interesting because the image is ani-
piness(3) and fear(2) would be similar to it due to similarity mated rather than of real person, which can affect perfor-
of face features, and while those 2 were the next choices, mance depending on the quality of animation. Clearly here
they are still not comparable to correct surprise(5) predic- this was able to confuse the system. The left side of the
tions. image does look like anger(0), but the right side is almost
Here we can see an example of a correctly classified im- a surprise(5)/fear(2) resulting in the prediction made by the
age, which represents disgust(1). system.

Figure 8. Correctly classified image 6. Conclusion


Transformers have confirmed to still remain state of the
art method in context of emotion detection. The trained
transformer performed the best in emotion classification
task, and had accuracy close to best model in the sentiment
analysis task.
Transformers showed to have steady learning process
that eventually converges to great results, but takes longer
time compared to finetuned models. However such steady
learning process resulted in a good advantage over finetun-
ing in terms of overfitting on training data. As we saw ac-
curacy and loss were similar for both validation and train-
ing data for transformers, whereas finetuning on pre-trained
models significantly overfitted. However, for future work
I would like to investigate this further, potentially with a
larger and higher quality dataset, as finetuning shows to per-
form well in other fields.
This project has shown that image features like HOG
Personally I think this is a very hard example to classify, have potential for emotion detection with finetuning on pre-
as this emotion could almost pass for happiness(3) with the trained models, particularly in the sentiment analysis task.
not obvious shape of mouth we can see. That means that However, the features did not prove to be useful at all for
model was able to capture significant differences in shapes emotion classification, and especially during transformer
of mouth, like here the mouth is slightly closed, potentially training. For future work it is definetely worth exploring
differentiating it from happiness(3). other image features, and considering which ones could be
Here we can see an incorrectly classified image, which useful for transformers and emotion classification.
represents anger(0), however system predicted fear(2) for it. Another thing important to explore is representation
within the dataset. As we saw model incorrectly predicted
for the animated image, but there could be other factors that
Figure 9. Incorrectly classified image can influence the model such as race, sex or age. If there is
not enough representation for different groups of people or
image types, then training process could be biased and not
work for underrepresented groups.

References
[1] ”Emotion in the human face.” P Ekman, W
Friesen, P Ellsworth. Pergamon Press New York.
1972. https://www.elsevier.com/books/emotion-in-the-
human-face/ekman/978-0-08-016643-8
[2] ”Emotion Detection and Sentiment Anal-
ysis of Images.” V Gajarla, A Gupta. Geor-
gia Institute of Technology. 1-4. 2015.
https://faculty.cc.gatech.edu/ hays/7476/projects/Aditi Vas
avi.pdf
[3] ”PlaNet - Photo Geolocation with Convolutional
Neural Networks.” Weyand, T., Kostrikov, I., Philbin, J.,
Leibe, B., Matas, J., Sebe, N., Welling, M. Computer 2825-2830. https://scikit-learn.org/stable/index.html
Vision – ECCV 2016. ECCV 2016. Lecture Notes [16] Pytorch: ”Advances in Neural Information Pro-
in Computer Science(). vol 9912. Springer, Cham. cessing Systems 32”. Paszke, Adam and Gross, Sam
https://doi.org/10.1007/978-3-319-46484-8 3 and Massa, Francisco and Lerer, Adam and Bradbury,
[4] ”Very Deep Convolutional Networks for Large-Scale James and Chanan, Gregory and Killeen, Trevor and
Image Recognition.” K Simonyan, A Zisserman. arXiv. Lin, Zeming and Gimelshein, Natalia and Antiga, Luca
2014. https://arxiv.org/abs/1409.1556 and Desmaison, Alban and Kopf, Andreas and Yang,
[5] “Learning Deep Features for Scene Recogni- Edward and DeVito, Zachary and Raison, Martin and
tion using Places Database.” B Zhou, A Lapedriza, J Tejani, Alykhan and Chilamkurthy, Sasank and Steiner,
Xiao, A Torralba, and A Oliva. Advances in Neu- Benoit and Fang, Lu and Bai, Junjie and Chintala,
ral Information Processing Systems 27 (NIPS). 2014. Soumith. Curran Associates, Inc. 2019. pp. 8024-8035.
http://places.csail.mit.edu/places NIPS14.pdf http://papers.neurips.cc/paper/9015-pytorch-an-imperative-
[6] ”Facial Emotion Detection Using Deep Learning.” A style-high-performance-deep-learning-library.pdf
Jaiswal, A Krishnama Raju, S Deb. International Confer- [17] ”CS231N Convolutional Neural
ence for Emerging Technology (INCET). 2020. pp. 1-5. Networks for Visual Recognition: As-
https://ieeexplore.ieee.org/document/9154121 signment 2.” Stanford University. 2022.
[7] ”Visual Sentiment Prediction with Deep Convolu- https://cs231n.github.io/assignments2022/assignment2/
tional Neural Networks.” C Xu, S Cetintas, K Lee, L Li. [18] ”TensorFlow: Large-Scale Machine Learning on
2014. https://arxiv.org/pdf/1411.5731.pdf Heterogeneous Systems.” Martı́n Abadi, Ashish Agarwal,
[8] ”Performance Evaluation of Supervised Machine Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Learning Techniques for Efficient Detection of Emo- Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu
tions from Online Content.” M Asghar, F Subhan, Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,
M Imran, F Kundi, A Khan, S Shamshirband, A Geoffrey Irving, Michael Isard, Rafal Jozefowicz, Yangqing
Mosavi, P Csiba, A Koczy. Computers, Materials Jia, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,
& Continua. 2020. vol. 63, pp. 1093-1118. Dan Mané, Mike Schuster, Rajat Monga, Sherry Moore,
http://www.techscience.com/cmc/v63n3/38864 Derek Murray, Chris Olah, Jonathon Shlens, Benoit Steiner,
[9] ”Facial Expression Recognition with Visual Trans- Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Van-
formers and Attentional Selective Fusion.” F Ma, B Sun, houcke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
S Li. IEEE Trans. Affective Comput. 2021. 1-1. Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,
https://doi.org/10.48550/arXiv.2103.16854 and Xiaoqiang Zheng. 2015. tensorflow.org
[10] ”MS-Celeb-1M: A Dataset and Benchmark for [19] ”Keras.” F Chollet and others. 2015.
Large-Scale Face Recognition.” G Yandong, Z Lei, https://keras.io/api/
H Yuxiao, H X, and G Jianfeng. ECCV. 2016. [20] ”Image classification with Vision
https://doi.org/10.48550/arXiv.1607.08221 Transformer.” K Salama. Keras.io. 2018.
[11] FER2013 Dataset: ”Challenges in Representation https://keras.io/examples/vision/image classification with
Learning: A report on three machine learning contests.” I vision transformer/
Goodfellow, D Erhan, PL Carrier, A Courville, M Mirza, B [21] ”Pandas-dev/pandas: Pandas.” The
Hamner, W Cukierski, Y Tang, DH Lee, Y Zhou, C Rama- pandas development team. Zenodo. 2020.
iah, F Feng, R Li, X Wang, D Athanasakis, J Shawe-Taylor, https://doi.org/10.5281/zenodo.3509134
M Milakov, J Park, R Ionescu, M Popescu, C Grozea, J [22] ”Transfer learning and the art of
Bergstra, J Xie, L Romaszko, B Xu, Z Chuang, and Y. Ben- using Pre-trained Models in Deep Learn-
gio. arXiv. 2013. http://arxiv.org/abs/1307.0414 ing.” D Gupta. Analytics Vidhya. 2017.
https://www.analyticsvidhya.com/blog/2017/06/transfer-
[12] ”PIL”. A Clark, and contributors. 2010-2022.
learning-the-art-of-fine-tuning-a-pre-trained-model/
https://pillow.readthedocs.io/en/stable/about.html
[23] ”Matplotlib: A 2D graphics environment.” J Hunter.
[13] ”scikit-image: Image processing in Python.” S Walt,
IEEE COMPUTER SOC. 2007. vol. 9, n. 3, pp. 90-95.
J Schönberger, J Nunez-Iglesias, F Boulogne, J Warner, N
https://matplotlib.org/stable/index.html
Yager, E Gouillart, T Yu and the scikit-image contributors.
PeerJ 2:e453. 2014. https://doi.org/10.7717/peerj.453
[14] ”Array programming with NumPy.” C Harris, K
Millman,S Walt, et al. Nature 585. 2020. 357–362.
https://doi.org/10.1038/s41586-020-2649-2
[15] ”Scikit-learn: Machine Learning in Python.” Jour-
nal of Machine Learning Research. 2011. vol. 12, pp.

You might also like