Emotion Detection With Vision Transformers and Image Features
Emotion Detection With Vision Transformers and Image Features
Stephan Sharkov
Stanford University,
Department of Computer Science
[email protected]
4.1. Baselines
4.1.1 KNN
KNN makes an assumption that similar things are close to
each other. It is an algorithm which predicts information
about an image based on k number of images closest to
it. More specifically it chooses the label that appears the
most among those k neighbors, where number k itself can
be chosen and adjusted to achieve better accuracy. The de-
cision of how close images are is based on a feature param-
eter. My KNN is constructed using Numpy[14] and Scikit- This is being done so that transformer has pieces of im-
ages to work with since it requires sequences of data for it to different data for classifying many diverse objects, whereas
function. Transformer learns how those pieces relate to each other researchers used datasets for specific problems like
other, and how those relations indicate the emotion. It helps face detection which are more related to emotion detection.
the system to learn more concretely about mouth shape, eye Among the different models I performed comparisons of ac-
shape and other face features indicating emotion. curacy after first 20 epochs and ended on ResNet101, which
Next most important step is the main part of the trans- is 101 layer network. Important to note the network per-
former is the transformer block of layers. The number of formed better than ResNet18, that was used in [9], suggest-
those blocks could vary, but in my model I set on 10, which ing that deeper networks could be more helpful for compli-
performed the best. Each implemented block consists of cated task of emotion classification.
Layer Normalization, followed by Multi-Head Attention, After that I added a multilayer perceptron(MLP) for the
another Layer Normalization and MLP layers. MLP layer finetuning part that would run on the FER2013 data. Final
consisted of 2 layers of Dense layer with ReLU activation MLP ended up consisting of 4 layers of Dense layer with
and Dropout with probability 0.5 and 0.7 for both problems ReLU activation and Dropout with probabilities dependent
respectively. Dense layer is essentially a Keras[19] method on the problem and features.
of combining a connected layer with activation function.
The transformer finishes and gives final results after an- And finally I combined all the parts together inspired by
other Layer Normalization, Flatten layer, another Dropout the code from [22]. That part essentially passes the image
and last MLP with the same structure. inputs through ResNet101 and gets the final features after
The structure of the transformer is drawn out below: all 101 layers, and those features are passed into the MLP
so we can get from 1000 classes of ResNet101 to 7 or 2
Figure 4. Structure of the transformer classes specific to our problems. After all of that I apply
Softmax activation function to get final predictions.
5. Experiments
In this section I cover my derivation of the best Trans-
former and best Finetune model I was able to achieve, and
how their performance compares to the previous work and
baselines.
For my experiments I decided to use Categorical
Crossentropy loss[19] which is a common choice for loss
functions in classification problems with many classes.
Choice of optimizer fell on Adam[19] optimizer, as I used it
in assignments for this class, and Adam always performed
better than others.
To compare results and performance of my methods and
baselines, I will be using loss and different accuracy met-
rics: training/validation/testing accuracies and differences
among those. While numbers themselves demonstrate per-
formance and success of models, differences demonstrate
which models overfitted more, and which learned more uni-
versally.
As we can see model 7 and 3 performed the best with 28% Similar was done for the same problem but using HOG fea-
accuracy, however model 7 had better top-3 accuracy. tures:
The exact same process was conducted for the sentiment Table 4: Hyperparameter Tuning for Finetune on HOG
analysis task, but top-3 accuracy was not computed here. features for emotion classification
The results are presented below: DR NH Val Accuracy
Table 2: Hyperparameter Tuning for Transformer on 1 0.2 4 0.29
original images for sentiment analysis 2 0.4 4 0.31
LR BS TL DR Val Accuracy 3 0.6 4 0.32
4 0.8 4 0.31
1 0.001 256 10 0.3 0.55
5 0.6 2 0.28
2 0.01 256 10 0.3 0.53
3 0.0001 256 10 0.3 0.57 And the same process was conducted for sentiment analysis
4 0.0001 128 10 0.3 0.58 problem:
5 0.0001 64 10 0.3 0.57 Table 5: Hyperparameter Tuning for Finetune on original
6 0.0001 128 8 0.3 0.54 images for sentiment analysis
7 0.0001 128 6 0.5 0.51 DR NH Val Accuracy
8 0.0001 128 10 0.5 0.60
9 0.0001 128 10 0.7 0.63 1 0.4 2 0.64
2 0.4 4 0.65
As I attempted to perform similar process on Transform- 3 0.6 4 0.67
ers but using HOG features, the model was not learning any- 4 0.8 4 0.66
thing and accuracy was the same for different parameters, 5 0.2 4 0.65
Table 6: Hyperparameter Tuning for Finetune on HOG Figure 5. Loss for emotion classification transformer[23]
features for sentiment analysis
DR NH Val Accuracy
1 0.2 2 0.59
2 0.2 4 0.6
3 0.8 4 0.6
4 0.6 4 0.58
5 0.4 4 0.62
References
[1] ”Emotion in the human face.” P Ekman, W
Friesen, P Ellsworth. Pergamon Press New York.
1972. https://www.elsevier.com/books/emotion-in-the-
human-face/ekman/978-0-08-016643-8
[2] ”Emotion Detection and Sentiment Anal-
ysis of Images.” V Gajarla, A Gupta. Geor-
gia Institute of Technology. 1-4. 2015.
https://faculty.cc.gatech.edu/ hays/7476/projects/Aditi Vas
avi.pdf
[3] ”PlaNet - Photo Geolocation with Convolutional
Neural Networks.” Weyand, T., Kostrikov, I., Philbin, J.,
Leibe, B., Matas, J., Sebe, N., Welling, M. Computer 2825-2830. https://scikit-learn.org/stable/index.html
Vision – ECCV 2016. ECCV 2016. Lecture Notes [16] Pytorch: ”Advances in Neural Information Pro-
in Computer Science(). vol 9912. Springer, Cham. cessing Systems 32”. Paszke, Adam and Gross, Sam
https://doi.org/10.1007/978-3-319-46484-8 3 and Massa, Francisco and Lerer, Adam and Bradbury,
[4] ”Very Deep Convolutional Networks for Large-Scale James and Chanan, Gregory and Killeen, Trevor and
Image Recognition.” K Simonyan, A Zisserman. arXiv. Lin, Zeming and Gimelshein, Natalia and Antiga, Luca
2014. https://arxiv.org/abs/1409.1556 and Desmaison, Alban and Kopf, Andreas and Yang,
[5] “Learning Deep Features for Scene Recogni- Edward and DeVito, Zachary and Raison, Martin and
tion using Places Database.” B Zhou, A Lapedriza, J Tejani, Alykhan and Chilamkurthy, Sasank and Steiner,
Xiao, A Torralba, and A Oliva. Advances in Neu- Benoit and Fang, Lu and Bai, Junjie and Chintala,
ral Information Processing Systems 27 (NIPS). 2014. Soumith. Curran Associates, Inc. 2019. pp. 8024-8035.
http://places.csail.mit.edu/places NIPS14.pdf http://papers.neurips.cc/paper/9015-pytorch-an-imperative-
[6] ”Facial Emotion Detection Using Deep Learning.” A style-high-performance-deep-learning-library.pdf
Jaiswal, A Krishnama Raju, S Deb. International Confer- [17] ”CS231N Convolutional Neural
ence for Emerging Technology (INCET). 2020. pp. 1-5. Networks for Visual Recognition: As-
https://ieeexplore.ieee.org/document/9154121 signment 2.” Stanford University. 2022.
[7] ”Visual Sentiment Prediction with Deep Convolu- https://cs231n.github.io/assignments2022/assignment2/
tional Neural Networks.” C Xu, S Cetintas, K Lee, L Li. [18] ”TensorFlow: Large-Scale Machine Learning on
2014. https://arxiv.org/pdf/1411.5731.pdf Heterogeneous Systems.” Martı́n Abadi, Ashish Agarwal,
[8] ”Performance Evaluation of Supervised Machine Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,
Learning Techniques for Efficient Detection of Emo- Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu
tions from Online Content.” M Asghar, F Subhan, Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,
M Imran, F Kundi, A Khan, S Shamshirband, A Geoffrey Irving, Michael Isard, Rafal Jozefowicz, Yangqing
Mosavi, P Csiba, A Koczy. Computers, Materials Jia, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,
& Continua. 2020. vol. 63, pp. 1093-1118. Dan Mané, Mike Schuster, Rajat Monga, Sherry Moore,
http://www.techscience.com/cmc/v63n3/38864 Derek Murray, Chris Olah, Jonathon Shlens, Benoit Steiner,
[9] ”Facial Expression Recognition with Visual Trans- Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Van-
formers and Attentional Selective Fusion.” F Ma, B Sun, houcke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
S Li. IEEE Trans. Affective Comput. 2021. 1-1. Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu,
https://doi.org/10.48550/arXiv.2103.16854 and Xiaoqiang Zheng. 2015. tensorflow.org
[10] ”MS-Celeb-1M: A Dataset and Benchmark for [19] ”Keras.” F Chollet and others. 2015.
Large-Scale Face Recognition.” G Yandong, Z Lei, https://keras.io/api/
H Yuxiao, H X, and G Jianfeng. ECCV. 2016. [20] ”Image classification with Vision
https://doi.org/10.48550/arXiv.1607.08221 Transformer.” K Salama. Keras.io. 2018.
[11] FER2013 Dataset: ”Challenges in Representation https://keras.io/examples/vision/image classification with
Learning: A report on three machine learning contests.” I vision transformer/
Goodfellow, D Erhan, PL Carrier, A Courville, M Mirza, B [21] ”Pandas-dev/pandas: Pandas.” The
Hamner, W Cukierski, Y Tang, DH Lee, Y Zhou, C Rama- pandas development team. Zenodo. 2020.
iah, F Feng, R Li, X Wang, D Athanasakis, J Shawe-Taylor, https://doi.org/10.5281/zenodo.3509134
M Milakov, J Park, R Ionescu, M Popescu, C Grozea, J [22] ”Transfer learning and the art of
Bergstra, J Xie, L Romaszko, B Xu, Z Chuang, and Y. Ben- using Pre-trained Models in Deep Learn-
gio. arXiv. 2013. http://arxiv.org/abs/1307.0414 ing.” D Gupta. Analytics Vidhya. 2017.
https://www.analyticsvidhya.com/blog/2017/06/transfer-
[12] ”PIL”. A Clark, and contributors. 2010-2022.
learning-the-art-of-fine-tuning-a-pre-trained-model/
https://pillow.readthedocs.io/en/stable/about.html
[23] ”Matplotlib: A 2D graphics environment.” J Hunter.
[13] ”scikit-image: Image processing in Python.” S Walt,
IEEE COMPUTER SOC. 2007. vol. 9, n. 3, pp. 90-95.
J Schönberger, J Nunez-Iglesias, F Boulogne, J Warner, N
https://matplotlib.org/stable/index.html
Yager, E Gouillart, T Yu and the scikit-image contributors.
PeerJ 2:e453. 2014. https://doi.org/10.7717/peerj.453
[14] ”Array programming with NumPy.” C Harris, K
Millman,S Walt, et al. Nature 585. 2020. 357–362.
https://doi.org/10.1038/s41586-020-2649-2
[15] ”Scikit-learn: Machine Learning in Python.” Jour-
nal of Machine Learning Research. 2011. vol. 12, pp.