0% found this document useful (0 votes)

16 views13 pages

Final Paperhh

This paper presents a novel approach for detecting depression severity using multi-modal data through Deep Neural Networks and the Hamilton Depression Rating Scale (HDRS). The method analyzes video, speech, and text modalities, achieving accuracies of 66%, 81%, and 82% respectively for depression classification. By fusing these modalities, the system aims to provide an early detection mechanism for depression, which is crucial for timely intervention and treatment.

Uploaded by

Madhu Sudhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views13 pages

Final Paperhh

Uploaded by

Madhu Sudhan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Multi-Modal Depression Severity Detection using Deep

Neural Networks and Depression Assessment Scale

Madhu Sudhan H V1 and S Saravana Kumar2

1
Department of Computer Science and Engineering, CMR University (CMRU), Bangalore
560043, Karnataka, India
2
Department of Computer Science and Engineering, CMR University (CMRU), Bangalore
560043, Karnataka, India
[email protected], [email protected]

Abstract. Depression is a common mental disorder worldwide which not only

impacts the sufferer, but also the people surrounding them. It is a mental illness
with persistent depressed mood or losing interest in activities which may impact
the daily life actions. If its long lasting with moderate or severe intensity and
untreated, depression will lead to a serious health condition and even may lead
to suicide. It is necessary to detect the disorder at an early stage and reduce the
effects of disorder before it reaches for suicide. This paper presents a novel
depression severity detection technique with multi-modal data by using Deep
Neural Networks and Hamilton Depression Rating Scale. Individual modalities
for video, speech and text are detected using Deep Neural Networks (DNN) and
all the modalities are fused with corresponding weights to calculate the total
points in Hamilton Depression Rating Scale (HDRS). Based on the HDRS
points calculated, severity of the depression will be classified as no depression,
mild, moderate and severe depression. Performance is evaluated at individual
modalities and results showed an accuracy of 66% through video modality,
81% for speech modality and 82% for text modality to calculate the points for
HDRS and to classify severity.

Keywords: Depression Detection, Deep Neural Networks, Multi-Modal Emo-

tion Recognition, Hamilton Depression Rating Scale.

1 Introduction

Depression is a common mental health disorder that affects worldwide. According to

WHO [1], there are 264 million people affected worldwide for depression and around
800,000 people die to suicide every year. Suicide is mainly seen in 15-29 years old
and it is the second leading cause of death. Depressed people find more difficult to
concentrate on work, have problems in social communication and find it difficult to
carry out daily activities, and are more inclined to sadness, loneliness, hopelessness,
anxiety and disinterested [2]. Various factors like social, psychological and biological
with complex interactions results in to depression. Depression normally comes when
people undergo adverse life events like loss of job, partner, psychological trauma etc.
2

In low- and middle-income countries, between 76% to 85% receives no treatment for
their disorder [3]. Some of the barriers for effective treatment is lack of resources and
trained health care professionals and also the social stain which is related with depres-
sion. Other barriers include inaccurate assessment and mis-diagnosis of people of all
income levels across countries and prescribing anti-depressants to those who actually
don’t have the disorder and vice versa.

Detection of depression plays an important role as the current methods rely on

clinical review of the patient through physically or virtually. Clinical review takes
more time and is expensive and not reachable to people of all income levels. Affective
Sensing methods can assist physicians in the early stages of depression and in the
subsequent monitoring. Through Affective Sensing techniques, people can assess
themselves about severity of the depression initially before approaching the health
care professionals by saving time, cost and chances of mis-diagnosis. When a person
is affected by depression, person expresses the intricate signs that can be captured by
studying all the modalities. For e.g., when a person is asked with a question related to
HDRS, person may have less eye contact and displays sad emotion majority of the
time during the interview that can be captured by video, low and depressed tone can
be captured through speech and context of words used can be analyzed through the
lexicon text. This paper discusses about fusing of 3 modalities and calculating the
severity of depression using HDRS.

2 Related Work

Automatic depression detection has gained more attention recently in the affective
sensing community. Lot of research has been done to understand the depression using
individual modalities and multi-modal approach by fusing the various modalities.
Review of various works done using different modalities of video, speech and text are
discussed in this section.

2.1 Facial emotion analysis from video

Various methods have been proposed to detect depression from videos and images.
Active Appearance Model [4][5] was used to understand the facial features which is
further used to compute parameters like Action Units associated with depression by
computing the mean duration, ratio of onset to total duration and ratio of offset to
onset phase [6]. [7] discusses to classify face in to a set of emotions like happiness,
sadness and anger using facial expression recognition system and other studies like
[8] focuses on facial expression recognition through individual muscle movements.
Facial Action Coding System (FACS) is proposed in [9] to perform facial emotion
recognition with the help of Action Units (AU). Bayesian Networks, HMM and Neu-
ral Networks were proposed in [10] for facial expression recognition. [11] used con-
volutional neural networks (CNN) for facial expression recognition using gray-scale
images. CNN model was trained with combination of raw pixel data and Histogram of
Oriented Gradients (HOG) features.
3

Real-time vision system which does the task of face detection, gender classification
and emotion recognition is proposed in [12] using CNN architecture and guided back-
propagation technique. It showcased that regularization methods and visualization of
previously hidden features are necessary to reduce the gap between slow perfor-
mances and real time architectures. Facial feature extraction for depression detection
in students is proposed in [13] which captures the facial features from video and ex-
tracts the facial features in each frame. Facial features are analyzed to detect the signs
of depression. OpenFace [14], an interactive open-source facial expression analysis is
used in [15] to extract features for face landmark regions, head pose and eye gaze
estimations and converts to a Facial Action Unit. Features extracted from OpenFace
and Bag-of-VisualWords are used to train individual model per feature all having a
single layer of 200 BLSTM hidden units, followed by a maxpooling and then learn a
regressor to detect the depression. The current paper uses a CNN Model with Xcep-
tion [16][12] architecture to identify the emotion from video which will be used to
detect depression.

2.2 Emotion analysis from speech

Emotion analysis from speech plays a significant role in identifying signs of depres-
sion. [14] proposes an emotion recognition system using Gradient Boosting, KNN and
SVM to classify emotion and also to identify differences based on gender. Another
study proposed in [15] achieved an accuracy of 66.41% on audio data and achieved
90% by mixing both audio and video. Three shared emotion recognition model for
speech and song is proposed in [16] namely a single task model, a single task hierar-
chical model and multi-task hierarchical model. [17] proposes an emotion classifica-
tion method using deep neural networks (CNN). F1 score evaluation metric achieved
accuracy of 0.91 on the test set and the best performance on “Angry” emotion was
with a score of 0.95. Comparison of various speech emotion recognition techniques is
studied in [18]. Significance of features such as Log-Mel Spectrogram, Mel-Fre-
quency Cepstral Coefficients (MFCCs), pitch and energy were compared by applying
methods such as Long Short-Term Memory (LSTM), Convolutional Neural Networks
(CNNs), Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs). An
accuracy of 68% was achieved with a 2 dimensional 4-layer CNN and also observed
that choice of audio factors impacts the results.
Attention based fully convolutional network is proposed for speech emotion recog-
nition [19] along with transfer learning to improve the accuracy, because of limited
data. Proposed model outperformed with a weighted accuracy of 70.4% and an un-
weighted accuracy of 63.9% respectively. Performance of two categories of models
for speech emotion recognition is studied in [20]. In first method, extracted features
are used to train 6 traditional machine learning algorithms and in the second method,
a feed-forward neural network and LSTM-based classifier are trained over features.
Study concludes that traditional machine learning techniques are able to achieve per-
formance comparable with deep learning techniques. The current paper proposes
emotion identification from speech using MFCC features and CNN.
4

2.3 Emotion analysis from text

Emotion analysis from text has been researched from quite a sometime in Natural
Language Processing community. Recently deep learning techniques have been used
for text classification. Personality traits and meta features such as age and gender
could have positive impact on model performance while detecting depression from
social media text [21]. Social media text is used to classify depression in [22] and also
other studies have been done to detect for various mental disorders. Recurrent Neural
Networks (RNN) with attention is used to detect social media posts resembling crisis
[23]. [24] showcased that using CNN produces better result when compared with
RNN while detecting depression. [25] aims to predict depression in tweets using
RNN, GRU and CNN. Study examines the effect of character-based vs word-based
models and pretrained embeddings vs learned embeddings. Best performing models
are word-based GRU with 98% accuracy and word-based CNN with 97% accuracy.
[26] and [27] uses transformer encoders for detecting the emotions using Bidirectional
Encoder Representations from Transformers (BERT). In the current research, BERT
based emotion detector is used to predict the emotions in the text to classify for de-
pression.

2.4 Hamilton Depression Rating Scale (HDRS)

Also known as Ham-D is the most widely used depression assessment scale for ad-
ministering clinical depression [26][27]. HDRS 17 contains 17 items pertaining to
symptoms of the depression over the past week. The scale was designed for comple-
tion after a clinical interview. Later HDRS 21 was introduced with four additional items
for sub-categorizing the depression. Limitations of HDRS includes atypical symptoms
(e.g., hypersomnia, hyperphagia) are not assessed using the scale. For HDRS 17, a
score of 0-7 is accepted as normal and a score>20 is usually required for entry into the
clinical trial.

3 Proposed Methodology

This section presents the architecture and methods for depression severity detection
system for all the modalities. In section 3.1 to 3.3, individual modality is discussed in
detail about the dataset, features extraction, methodology, evaluation and perfor-
mance. In section 3.4, fusion of all the modalities with HDRS is discussed. Fig 1.
presents the technical architecture for depression severity detection system.

Fig. 1. Technical Architecture for Depression Severity Detection System

3.1 Facial Emotion Recognition Model

Facial emotion recognition is achieved with the help of a real-time facial classifica-
tion model [12]. Model architecture is divided into two parts as shown in Fig 2. First
part of the model eliminates fully connected layer completely and the second part
combines the deleted fully connected layer and combines the depth-wise separable
convolutions and residual modules. Both parts are trained using ADAM optimizer
[28].

Fig. 2. Model Architecture for Facial Emotion Recognition

Global Average Pooling is used to remove fully connected layers and it is achieved by
having the same number of feature maps in the last convolutional layer as same num-
ber of classes and applying a softmax activation function to each feature map. Model
architecture has 9 convolutional layers, ReLUs, Batch Normalization and Global Av-
erage Pooling. Dataset used to train the model was taken from FER-2013 dataset [29].
Dataset has 35,887 grayscale images and it has classes for “angry”, “disgust”, “fear”,
“happy”, “sad”, “surprise” and “neutral”. First model achieved an accuracy of 66% on
the dataset. Second part of the architecture is based on Xception [30] and it combines
the use of depth-wise separable convolutions [31] and residual modules [32]. Learned
features become the difference of original feature map and the desired features, and is
obtained by residual modules which modify the desired mapping between two subse-
quent layers.
Final combined architecture consists of a fully convolutional network with 4 resid-
ual depth-wise separable convolutions and each convolution is followed by a batch
normalization and ReLU activation function. Last layer has global average pooling
and soft-max activation function for the prediction. Final architecture gave an accu-
racy of 66% for the emotion classification task on FER-2013 dataset. Limitation in-
cludes the misclassification between sad, fear, angry and disgust, but this limitation is
not that impactful on current study as all the above emotions are signs of depression
and we could use collectively for calculating the HDRS Score.

3.2 Speech Emotion Recognition Model

Speech emotion recognition model is derived from [17] which proposes an architec-
ture using Deep Neural Networks and Mel-Frequency Cepstral Coefficients (MFCC).
MFCC is the only feature that is used to train the model using CNN and dense layers.
In speech recognition, MFCC is considered as one of the most recognized sound for-
malization technique [33]. MFCC is mainly used because of its capability to represent
6

the amplitude spectrum of sound wave in a compact vectorial form. 40 features have
been extracted and used with MFCC for training the model.

Fig. 3. Model Architecture for Speech Emotion Recognition

Fig 3 explains the model architecture for speech emotion recognition using CNN.
Input to the network is a vector of 40 features for each speech file. On the architec -
ture, 1D CNN with ReLU activation function, dropout of 20% and max-pooling is
used. Pooling is used for the model to focus on principal characteristics of data. A
dropout and flatten is added to make it compatible for other layers. Finally dense layer
with softmax activation function is added for the prediction. Dataset used was Ryer-
son Audio-Visual Database of Emotional Speech and Song (RAVDESS) [34]. Speech
dataset has neutral, calm, happy, sad, angry, fearful, surprise, and disgust expressions
in the data and it has 7356 files made by 12 female and 12 male professional actors.
Fig 4 shows the waveplots for speech data for all the emotion types.

Fig. 4. Waveplots for Speech data

Fig. 5. Spectrograms for Speech data

Spectrograms for speech data is shown in Fig 5. During training, files have been split
in to train and test datasets with 77:33 respectively. Training set has 3315 MFCC
vector of 40 features. Model is trained using sparse categorical cross entropy loss
function for 50 epochs along with rmsprop optimizer.

Fig. 6. Model Accuracy and Confusion Matrix

Fig 6 shows the Model Accuracy during training with 1000 epochs. Overall accuracy
for F1 score is 81.14% and best score is obtained for “anger” and “neutral” classes
with 87% and 86% F1 scores. Output prediction from the speech emotion recognition
is used to calculate the HDRS points to predict the depression.

3.3 Text Emotion Recognition Model

Text emotion recognition model is based on Bidirectional Encoder Representations

from Transformers (BERT) and derived from [35]. Architecture consists of two stages
namely BERT fine-tuned training and Bi-LSTM classification as is shown in Fig 7.
Datasets used are ISEAR [36], DailyDialog [37] and Emotion-Stimulus [38]. All the
datasets are combined and have the classes for joy, sadness, anger, fear and neutral.
Data preprocessing is performed on the input text and Natural Language Processing
(NLP) techniques like stop words removal, lemmatization and stemming is applied on
the dataset. BERT fine tuning stage uses self-attention and transformers to modeling
language using a bi-directional pre-training approach. Bert-base-uncased model is
used with 12-layer transformer blocks with each block having 768 hidden layers and
12 head self-attention layers that produces around 110 million parameters.

Fig. 7. BERT Text Emotion Recognition Architecture

Each transformer receives list of token embeddings and produces a feature vector of
same length at the output. 12 th transformer layer containing vector transformations is
used as aggregated sequence transformation for classification. For Bi-LSTM classifi-
cation, input layer, mask layer, Bidirectional LSTM layer and dense layer are attached
to the BERT model. Input layer received the output from previous stage, bi-direc-
tional layer has 100 neurons and a dense layer containing 5 layers to predict the 5
emotion classes using softmax activation function. Overall accuracy is 81.79% and
classification report is shown in Fig 8. Emotions for “joy” and “fear” has the highest
F1 score with 84% for both. Emotions from text will be used for calculating depres-
sion severity.

Fig. 8. Classification Report for BERT Model

3.4 Depression Severity Detection with HDRS

HDRS is used to evaluate the severity of depression. This paper proposes a novel
approach to calculate depression severity by using emotion detected from all the
modalities and HDRS. Emotion detected from all the modalities above are used to
calculate the final score for all the 17 items in HDRS 17. Sample rule to find point cate-
gory for the question “Are you feeling bad about yourself or that you are a failure, or
let yourself or your family down?” is shown in Table 1. Emotions are represented
with characters as sad – S, anger – A and neutral - N. Minimum Emotion frequency
count is performed for each question for all the modalities, for e.g., if the frequency
count of sad emotion is 2, anger is 4 and neutral is 0, then it’s represented as S2-A4-
N0.

Table 1. Rules for calculating score in HDRS for sample question.

Point Facial Emotion Speech Emotion Text Emotion Selected Point

Category Recognition Recognition Recognition Category
0 S0-A0-N2 S0-A0-N1 S0-A0-N1 2
1 S1-A0-N0 S2-A0-N0 S2-A0-N0
2 S2-A1-N0 S2-A2-N0 S2-A1-N0
3 S3-A2-N0 S3-A3-N0 S3-A3-N0
4 S4-A3-N0 S3-A4-N0 S3-A4-N0

Algorithm does the minimum frequency count match from bottom up for the defined
rules, and the rule which matches all the criteria for all the emotions will be picked up
and corresponding point category will be selected for a particular question. If there
are multiple matches, single rule will be selected while going from bottom to top in
the rules so that priority will be given from higher points to smaller points. In this
way, it’s possible to identify high severity and then go to moderate, mild and no
severity for depression detection. Final score is calculated by summing up the individ-
ual point categories for each question. Severity detection is classified as below in
Table 2.

Table 2. Severity Classification for Depression using HDRS Score.

Total Points Severity of Depression

0-7 No Depression
8 - 17 Mild
18 - 24 Moderate
> 25 Severe
10

4 Results

HDRS17 has 17 items which is used to prepare 17 questions that will be used in the
interview through a webcam and mic for depression counselling. User will sit in front
of the webcam with microphone enabled and respond to 17 questions. After each
question, recording will be interrupted and the video, speech and converted text are
stored as files on the drive. Later these files are used to calculate the point categories
and finally to calculate the overall score for identifying the depression severity.

Fig. 9. Emotion Recognition from Face, Speech and Text

Screenshot of live depression severity detection system is shown in Fig 9. Emotions

are recognized from all the modalities for individual question and total score is calcu-
lated using HDRS for classifying the severity. Sample final report is shown in Fig 10
which classifies the depression severity for HDRS17.

Fig. 9. Depression Severity Detection Report

5 Conclusion

This paper proposes a novel depression severity prediction system using Deep Neural
Networks and HDRS. Architecture of individual modalities for face, speech and text
are defined and analyzed with different datasets and techniques. Facial emotion
recognition gave an accuracy of 66%, speech emotion recognition gave accuracy up
to 81.14% and text emotion recognition gave accuracy of 81.79%. All the emotions
were combined and rules developed to calculate the point categories and finally calcu-
late the overall score for HDRS which in turn is used to classify the severity. The
proposed system is fully automated and the end result is a report that can be con -
sumed by the patient for self-evaluation. Limitations include non-availability of
enough data to validate the severity of depression and evaluation would be subjective
in nature as the points allocated in HDRS 17 may vary from therapist to therapist. Fu-
ture scope would involve using hand pose movements, gesture, eye gaze movements
to identify the depression severity.

References
1. WHO Homepage, https://www.who.int/news-room/fact-sheets/detail/depression, last ac-
cessed 2021/05/02.
2. Katon, Wayne, and Mark D. Sullivan. “Depression and chronic medical illness.” J Clin
Psychiatry 51. Suppl 6 (1990): 3-11.
3. Wang. “Use of mental health services for anxiety, mood, and substance disorders in 17
countries in the WHO world mental health surveys.” The Lancet. 2007; 370(9590):841-50.
4. G. Edwards, C. Taylor, and T. Cootes, “Interpreting Face Images Using Active Appear-
ance Models,” in Proceedings of the IEEE International Conference on Automatic Face
and Gesture Recognition FG’98. Nara, Japan: IEEE, Apr. 1998, pp. 300–305.
5. J. Saragih and R. Goecke, “Learning AAM fitting through simulation,” Pattern Recogni-
tion, vol. 42, no. 11, pp. 2628–2636, 2009.
6. J. Joshi, A. Dhall, R. Goecke and J. F. Cohn, "Relative Body Parts Movement for Auto -
matic Depression Analysis," 2013 Humaine Association Conference on Affective Comput-
ing and Intelligent Interaction, 2013, pp. 492-497, doi: 10.1109/ACII.2013.87.
7. A. Kleinsmith and N. Bianchi-Berthouze, “Affective body expression perception and
recognition: A survey,” IEEE Transactions on Affective Computing, vol. PP, no. 99, p. 1,
2012.
8. G. Littlewort, M. Bartlett, I. Fasel, J. Susskind, and J. Movellan. “Dynamics of facial ex -
pression extracted automatically from video.” Image and Vision Computing, 24(6), 2006.
9. M.S. Bartlett, G. Littlewort, M.G. Frank, C. Lainscsek, I. Fasel, and J.R. Movellan. “Auto-
matic recognition of facial actions in spontaneous expressions.” Journal of Multimedia,
2006.
10. P. Ekman,W. Friesen, “Facial Action Coding System: A Technique for the Measurement
of Facial Movement,” Consulting Psychologists Press, 1978.
11. Shima Alizadeh and Azar Fazel, “Convolutional Neural Networks for Facial Expression
Recognition.” arXiv preprint arXiv:1704.06756, 2017.
12. Arriaga, O., Valdenegro-Toro, M., & Plöger, P., “Real-time Convolutional Neural Net-
works for emotion and gender classification.” ArXiv, abs/1710.07557, 2019.
12

13. Venkataraman, D, and Parameswaran, N.S., “Extraction of Facial Features for Depression
Detection among Students.” International journal of pure and applied mathematics, 118,
2018.
14. IQBAL, A., AND BARUA, K, “A real-time emotion recognition from speech using gradi-
ent boosting.” International Conference on Electrical, Computer and Communication Engi-
neering (ECCE), 2019, IEEE, pp. 1–5.
15. JANNAT, R., TYNES, I., LIME, L. L., ADORNO, J., AND CANAVAN, “S. Ubiquitous
emotion recognition using audio and video data.” Proceedings of the 2018 ACM Interna-
tional Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous
Computing and Wearable Computers (2018), ACM, pp. 956–959.
16. ZHANG, B., ESSL, G., AND PROVOST, E. M., “Recognizing emotion from singing and
speaking using shared models.” 2015 International Conference on Affective Computing
and Intelligent Interaction (ACII) (2015), IEEE, pp. 139–145.
17. M. G. de Pinto, M. Polignano, P. Lops and G. Semeraro, "Emotions Understanding Model
from Spoken Language using Deep Neural Networks and Mel-Frequency Cepstral Coeffi-
cients," 2020 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS),
2020, pp. 1-5, doi: 10.1109/EAIS48028.2020.9122698.
18. Venkataramanan, K., & Rajamohan, H., “Emotion Recognition from Speech.” ArXiv,
abs/1912.10458, (2019).
19. Y. Zhang, J. Du, Z. Wang, J. Zhang and Y. Tu, "Attention Based Fully Convolutional
Network for Speech Emotion Recognition," 2018 Asia-Pacific Signal and Information
Processing Association Annual Summit and Conference (APSIPA ASC), 2018, pp. 1771-
1775, doi: 10.23919/APSIPA.2018.8659587.
20. Sahu, G., “Multimodal Speech Emotion Recognition and Ambiguity Resolution.” ArXiv,
abs/1904.06022, (2019).
21. Daniel Preot, Johannes Eichstaedt, Gregory Park, Maarten Sap, Laura Smith, Victoria
Tobolsky, H Andrew Schwartz, and Lyle Ungar, “The Role of Personality, Age and Gen-
der in Tweeting about Mental Illnesses.” Proceedings of the Workshop on Computational
Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 21–
30, 2015.
22. Philip Resnik, William Armstrong, Leonardo Claudino, and Thang Nguyen, “The Univer-
sity of Maryland CLPsych 2015 Shared Task System.” CLPsych 2015 Shared Task Sys-
tem, c, pages 54–60, 2015.
23. Rohan Kshirsagar, Robert Morris, and Samuel Bowman, “Detecting and Explaining Cri-
sis.” Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psy-
chology — From Linguistic Signal to Clinical Reality, pages 66–73, Vancouver. Associa-
tion for Computational Linguistics, 2017.
24. Ahmed Husseini Orabi, Prasadith Buddhitha, Mahmoud Husseini Orabi, and Diana
Inkpen, “Deep Learning for Depression Detection of Twitter Users.” Fifth Workshop on
Computational Linguistics and Clinical Psychology, pages 88–97, 2018.
25. Diveesh Singh and Alineen Wang, “Detecting Depression Through Tweets” Standford
University CA 9430, pp.1-9.
26. Hamilton M., “Development of a rating scale for primary depressive illness.” Br J Soc Clin
Psychol 1967, 6(4):278–96.
27. Williams JB, “A structured interview guide for the Hamilton Depression Rating Scale.”
Arch Gen Psychiatry, 1988, 45(8):742–7.
28. Diederik Kingma and Jimmy Ba. Adam, “A method for stochastic optimization.” arXiv
preprint arXiv:1412.6980, 2014.
13

29. L. Zahara, P. Musa, E. Prasetyo Wibowo, I. Karim and S. Bahri Musa, "The Facial Emo-
tion Recognition (FER-2013) Dataset for Prediction System of Micro-Expressions Face
Using the Convolutional Neural Network (CNN) Algorithm based Raspberry Pi," 2020
Fifth International Conference on Informatics and Computing (ICIC), 2020, pp. 1-9, doi:
10.1109/ICIC50835.2020.9288560.
30. Franc¸ois Chollet, “Xception: Deep learning with depthwise separable convolutions.”
CoRR, abs/1610.02357, 2016.
31. Andrew G. Howard, “Mobilenets: Efficient convolutional neural networks for mobile
vision applications.” CoRR, abs/1704.04861, 2017.
32. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for
image recognition.” Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016.
33. MUDA, L, BEGAM, M, AND ELAMVAZUTHI, I., “Voice recognition algorithms using
mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques.”
arXiv preprint arXiv:1003.4083 (2010).
34. LIVINGSTONE, S. R., AND RUSSO, F. A., “The ryerson audio-visual database of emo-
tional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expres-
sions in north american english.” PloS one 13, 5 (2018), e0196391.
35. Adoma, A.F., Henry, N., Chen, W., & Niyongabo, R.A., “Recognizing Emotions from
Texts using a Bert-Based Approach.” 17th International Computer Conference on Wavelet
Active Media Technology and Information Processing (ICCWAMTIP), 62-66, 2020.
36. ISEAR Homepage,
https://www.unige.ch/cisa/research/materials-and-online-research/research-material/, last
accessed 2021/05/02.
37. Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu, “DailyDialog: A
Manually Labelled Multi-turn Dialogue Dataset.” IJCNLP 2017. [pdf] [arXiv] [dataset].
38. Diman Ghazi, Diana Inkpen & Stan Szpakowicz (2015). “Detecting Emotion Stimuli in
Emotion-Bearing Sentences”. Proceedings of the 16th International Conference on Intelli-
gent Text Processing and Computational Linguistics (CICLing 2015), Cairo, Egypt.