Speech Emotion Recognition Using Deep Learning
Speech Emotion Recognition Using Deep Learning
Abstract - Recognizing the affective qualities of speech while The development of SER systems requires large
ignoring its semantic content is the goal of Speech Emotion annotated datasets of speech samples that are labelled with
Recognition (SER). Automatically conducting this activity using their corresponding emotional states. These datasets are used
programmed devices is still a work in progress, despite the fact to develop the machine learning models and assess their
that people can do it efficiently as part of voice communication. effectiveness. The accuracy of SER systems can be affected
Predicting a person's emotional state based on their words is the by various factors, such as differences in the way emotions
goal of speech emotion recognition. It enhances inter-human
are expressed across cultures, languages, and genders.
communication between computers and people. Despite the fact
that predicting a person's emotion is challenging due to the
subjectivity of emotions and the difficulty of annotating audio, One of the primary applications of SER is in mental
"Speech Emotion Recognition (SER)" enables this through its health. The ability to accurately recognize and monitor
capacity to comprehend human emotion. Tone, pitch, expression, changes in a person's emotional state can be helpful for
conduct, etc., are all states that can be used as indicators of an mental health professionals to provide timely and effective
individual's emotional state. Existing systems are not only not interventions. For example, SER can be used to monitor
real-time, but also only support a single emotion. Current speech changes in the emotional state of patients with depression,
emotion prediction models are based on SVM algorithms, which anxiety, or other mental health conditions. This can help
might take a long time to train in order to achieve high clinicians to identify early warning signs of relapse and
classification accuracy. Unfortunately, the current models lack a provide appropriate support. Another application of SER is in
scalable framework. Since it makes use of natural language human-computer interaction. SER can be used to enable more
processing, the suggested system is both precise and efficient. natural and empathetic interactions between humans and
machines. To recognise and react to SER entails, for instance,
Keywords: Speech Emotion Recognition, Text-to-Speech, Support virtual assistants, chatbots, or other interactive systems
Vector Machine, Decision Tree analyse several speech acoustic characteristics, such as pitch,
rhythm, tone, and intensity, to glean information about the
I. INTRODUCTION speaker's emotional state.
A system known as Speech Emotion Recognition Most often, machine learning algorithms are
(SER) tries to automatically identify a person's emotional state employed to train models that can effectively identify speech
from their speech. It falls under the umbrella of affective emotions. The classification of speech into various emotional
computing, a branch of science that aims to create machines categories, such as joy, sorrow, rage, and surprise, can then be
that can detect and react to human emotions. The goal of SER accomplished using these models.
is to enable machines to understand and respond to human
emotions, which can lead to improved human-machine Large datasets of speech samples that have been
interactions in various applications, such as mental health, annotated with the corresponding emotional states are
human- computer interaction, and robotics. A crucial area of necessary for the development of SER systems. These
psychology and neuroscience research has been the study of datasets are used to develop the machine learning models and
emotions. Emotions have been found to be extremely assess their effectiveness. The accuracy of SER systems can
important in human communication and decision-making. The be affected by various factors, such as differences in the way
ability to recognize and express emotions is essential for social emotions are expressed across cultures, languages, and
interaction and well-being. In recent years, researchers have genders.
been exploring ways to enable machines to understand and
respond to human emotions, which has led to the development One of the primary applications of SER is in mental
of affective computing. health. The ability to accurately recognize and monitor changes
SER involves analysing different speech acoustic in a person's emotional state can be helpful for mental health
characteristics, such as pitch, rhythm, tone, and intensity, in professionals to provide timely and effective interventions. For
order to glean information about the speaker's emotional state. example, SER can be used to monitor changes in the emotional
Most often, machine learning algorithms are employed to train state of patients with depression, anxiety, or other mental health
models that can effectively identify speech emotions. The conditions. This can help clinicians to identify early warning
classification of speech into various emotional categories, such signs of relapse and provide appropriate support. Another
as joy, sorrow, rage, and surprise, can then be accomplished application of SER is in human-computer interaction.
using these models.
Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on February 27,2024 at 06:19:31 UTC from IEEE Xplore. Restrictions apply.
Our suggested approach combines an emotional text-to-
SER can be used to enable more natural and empathetic speech (TTS) model with a cross-domain speech emotion
interactions between humans and machines. For example, recognition (SER) model. First, we use information from both
SER can be used in virtual assistants, chat bots, or other the SER dataset and the TTS dataset to train a cross-domain
interactive systems to recognize and respond to the emotional SER model. Then, we design an auxiliary SER task that we
subsequently train in conjunction with the TTS model using
state of the user. This can lead to more personalized and
the trained SER model's predictions of emotion labels on the
engaging user experiences. SER can also be used in robotics
TTS dataset. Our tests demonstrate that the suggested
to enable robots to recognize and respond to the emotional technique has the ability to produce speech that sounds
state of humans. Applications in the fields of healthcare, natural and has the requisite level of emotional expressiveness.
education, and entertainment can all benefit from this. Robots
could be utilised, for instance, to offer emotional support to In[4] Rapid progress in emotion detection is
hospital patients or elderly residents of nursing homes. They contributing to more pleasant human-computer interactions.
In this paper, we present a system that makes decisions based
can also be used in educational settings to provide
on traits from both vocal and visual expressions. The
personalized feedback to students based on their emotional
constraint of single-modal emotion recognition brought on by
state.
the use of single emotional features is likewise overcome by
this method, which acknowledges the complementary
II. LITERATURE SURVEY emotional information provided by speech and facial
expressions. We used long short-term memory and
In[1] This research addresses a technique to emotion convolutional neural networks to understand the verbal and
recognition in spoken language that depends on linguistic as emotive aspects of human communication. We constructed
well as auditory cues. Several methods have been proposed numerous small-scale kernel convolution blocks to
for emotion detection using the two types of features listed simultaneously extract face expression features. Finally, we
above. Because it is widely believed that emotionally charged combined the traits of both spoken language and facial
speech recognition is more difficult than its less charged
expressions using DNNs. The efficacy of a multimodal model
equivalent, the majority of linguistic feature research uses
for identifying emotions was tested using the IEMOCAP
reference transcripts. The type and degree of the emotion
dataset.
being communicated have been discovered to have a
considerable impact on the acoustic parameters of emotional When compared to a model that solely used speech
speech, which vary dramatically from those of emotion-free and facial expression as independent modalities, our proposed
speech. In order to improve recognition performance on an model shows a 10.5% and 11.2% improvement in overall
emotional speech challenge, we have been researching a recognition accuracy, respectively.
novel approach to emotional speech recognition that
combines acoustic model and language model adaption. In In[5] Because of its central role in human-computer
this study, we try feature extraction in the language using interaction, speech emotion recognition has substantial
speech recognition output. Only 82.2% of words were practical implications in many fields, including the field of
correctly recognised by the algorithm, and recognition criminal investigation. Beginning with a brief review of the
mistakes were found. To combat this, we show that emotion pertinent literature, this paper moves on to a discussion of the
identification is possible by combining linguistic and audio theoretical underpinnings of speech emotion recognition,
data, and we also demonstrate the value of the language including speech signal pre-processing, the extraction of
components retrieved from the recognition results. short-time energy, and derived parameters, before coming up
with a deep learning-based speech emotion recognition
In[2] Given that neural text-to-speech (TTS) algorithm and creating a speech emotion recognition model.
algorithms frequently call for a substantial amount of high- The accuracy and capability of vocal emotion identification
quality audio data, it could be challenging to compile a are undergoing considerable advancements in human-
dataset like this that also contains additional emotion labels. computer interface devices.
Using a TTS dataset devoid of emotion descriptors, we
describe a novel method for synthesising emotional TTS in In[6] Speech emotion recognition aims to identify a
this paper. Our suggested approach combines an emotional person's emotional state from their speech and account for the
text-to-speech (TTS) model with a cross-domain speech level of accuracy attained. It increases the effectiveness of
emotion recognition (SER) model. First, we use information using computers. Despite the impossibility of predicting
from both the SER dataset and the TTS dataset to train a another person's feeling due to the subjective nature of
cross-domain SER model. We next create an additional SER emotions and the difficulty of annotating audio, Speech
task that we train in conjunction with the TTS model using Feeling Recognition (SER) is able to make this achievable.
the learnt SER model's predictions for emotion labels on the Dogs, elephants, and horses, among other species, all use this
TTS dataset. Our tests demonstrate that the suggested similar theory to decode human emotions. Mood predictions
technique has the ability to produce speech that sounds can be made using a wide variety of states. Voice, facial
natural and has the requisite level of emotional
expression, and behaviour are all examples of such conditions.
expressiveness.
Few of these regions are believed to have the
In[3] Given that neural text-to-speech (TTS)
algorithms frequently call for a substantial amount of high- capability to deduce the speaker's emotional state from their
quality audio data, it could be challenging to compile a words alone. Classifiers that recognise speech emotions can be
dataset like this that also contains additional emotion labels. trained using a modest quantity of data. This study makes use of
Using a TTS dataset devoid of emotion descriptors, we the RAVDESS dataset, which stands for the Ryerson Audio-
describe a novel method for synthesising emotional TTS in Visual Database of Emotional Speech and Song.
this paper.
Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on February 27,2024 at 06:19:31 UTC from IEEE Xplore. Restrictions apply.
Here, we highlight the top three identifying categorization. The acoustic ASR is ultimately polished by
characteristics. These include the Mel Spectrogram, the Mel training it on material that has been annotated with
Frequency Cepstral Coefficients (MFCC), and the chrome. emotions. On the MSPPodcast dataset, where we evaluated
the suggested approach, we discovered the highest-ever
In[7] For conversational agent design to advance
reported CCC for valence prediction.
significantly, a trustworthy emotion speech recognition (SER)
system for human interaction is essential. In this paper, we
introduced the dialogical emotion decoding (DED) algorithm,
III. METHODOLOGY
a unique inference technique. This algorithm takes into
account the sequential nature of a conversation and, using a
designated recognition engine, decodes the emotional states of
each speech segment in this decoder is taught to recognize and
understand the emotional affects of both the speakers inside a
conversation and those between them. On the IEMOCAP
database, our approach achieves scores of 70.1% across four
distinct emotion classes. This is an advancement of 3% above
the present cutting-edge system. A similar result is found
when the analysis is applied to the MELD, a database of
multi-party interactions. We've introduced a DED that
functions primarily as a SER decoder for conversational
emotions that is adaptable to various SER engines.
In[8] this article, we argue that music and song are
more effective communicators of emotion than words alone.
We use research in the field of emotion detection in music and
spoken word as the foundation for our examination of feature
sets, feature types, and classifiers. GeMAPS, py Audio
Fig 1 : System Architecture Diagram
Analysis, and Lib ROSA are used along with two feature types
(low-level descriptors and high-level statistical functions) and
four classifiers (multilayer perceptron, LSTM, GRU, and A. FEATURE EXTRACTION
convolution neural networks) to analyse song and speech data. The Feature Extraction module is a crucial
The findings demonstrate that when processing both in the component of a Speech Emotion Recognition (SER) system.
same manner, there is no discernible difference between song The main objective of this module is to identify the
data and voice data According to two research, singing elicits emotional content of voice signals by extracting pertinent
more intense emotions than speech. Furthermore, in this elements from the speech signals. While there are several
categorization test, higher level statistical functions of auditory methods for extracting features, Mel-frequency cepstral
features outperformed lowerlevel descriptors. The previous coefficients (MFCCs) are one of the most widely used
study on the regression problem, which emphasised the methods. MFCCs represent the short-term power spectrum of
importance of using high-level traits, is supported by this one. a speech signal after it has been passed through a bank of
In [9] The identification of emotions in spoken Mel-scale filters, followed by a logarithmic transformation
language, or SER, is one of the most recent challenges in and discrete cosine transformation. The generated
human-computer interaction. As an approximation, typical coefficients, which are frequently utilised in speech
SER classification techniques can only identify one emotion processing applications, capture details about the spectral
per speech sample. This is because, typically, the speech envelope of the signal. Pitch, energy, and spectral traits are
emotional databases used to train SER models only have one additional features that can be retrieved from voice signals.
emotion label assigned to each individual utterance. Emotional cues can also be detected using prosodic elements
Conversely, it is unusual for human speech to convey a wide like speech tempo, pauses, and intonation patterns. Statistical
range of emotions all at once. To make SER sound more features such as mean, variance, and skewness can also be
natural than it has in the past, it is important to account for the used to capture statistical properties of the signal. In practice,
existence of several emotions within a single syllable. multiple feature extraction techniques are often combined to
Therefore, we built a collection of emotional discourse that capture different aspects of the speech signal. The resulting
covers a wide range of emotions and includes labels that feature vector is then used as input to the emotion
specify the relative strength of those emotions. The artistic test classification module, which attempts to classify the speech
was conducted by extracting segments of pre-existing video into different emotional categories. The accuracy and
works comprised of voice utterances that incorporated effectiveness of the feature extraction module are critical to
emotional expressions. Additionally, we conducted statistical the overall performance of the SER system.
analysis on the newly generated database to round up our
assessment of the database. Because of this, 2,025 samples B. EMOTION CLASSIFICATION
were taken, of which 1,525 showed signs of having several
emotion. The Emotion Classification module is another critical
component of a Speech Emotion Recognition (SER) system.
In[10 ]We propose a novel multi-task pre-training
This module's primary goal is to classify the speech signal into
technique (SER) for speech emotion recognition. To make
different emotional categories based on the features extracted
the acoustic ASR model more "emotion aware," we pre-
from the previous module. Machine learning techniques like
train the SER model to simultaneously execute Automatic
Support Vector Machines (SVM), Decision Trees, Random
Speech Recognition (ASR) and sentiment classification
Forest, Naive Bayes, and neural networks are just a few
tasks. Using a text-to-sentiment model trained on publicly
examples of the many methods for classifying emotions.
accessible data, we establish objectives for the sentiment
Authorized licensed use limited to: Visvesvaraya Technological University Belagavi. Downloaded on February 27,2024 at 06:19:31 UTC from IEEE Xplore. Restrictions apply.
Using labelled data, which consists of speech REFERENCES
samples with associated emotional labels, the classifier can be
trained. The feature vector extracted from the previous module [1] Z. Yang, C. Zhang, Y. Xu, and Y. Liu, "Speech Emotion Recognition
is used as input to the classifier, which then assigns a label to Based on Deep Learning with Syllable-Level Attention," IEEE Access, vol. 9,
pp. 7867-7879, 2021.
the speech signal based on the trained model. Basic emotions [2] M. Sakurai and T. Kosaka, "Emotion Recognition Combining Acoustic
like anger, joy, sadness, fear, disgust, or surprise can be and Linguistic Features Based on Speech Recognition Results," 2021 IEEE
labelled, as well as more complex emotional states like 10th Global Conference on Consumer Electronics (GCCE), 2021.
[3] Y. Guo, X. Zhang, Y. Wang, and Y. Xue, "Speech Emotion Recognition
boredom, perplexity, or irritation. The performance of the SER
Based on Deep Neural Network with Data Augmentation," in Proceedings of
system as a whole depends on how accurate and efficient the the 2021 IEEE 4th International Conference on Intelligent Transportation
emotion categorization module is. Metrics like accuracy, Engineering (ICITE), 2021, pp. 1069-1074.
precision, recall, and F1-score are a few examples of metrics [4] H. Kim, Y. Jung, and D. Kim, "Speech Emotion Recognition Using
Multi-level Deep Convolutional Neural Network," in Proceedings of the 2021
that can be used to assess the effectiveness of a classification IEEE International Conference on Big Data and Smart Computing
algorithm. The performance of the classifier can also be (BigComp), 2021, pp. 1-6.
improved by employing strategies like feature selection, hyper [5] X. Zhang, X. Qian, X. Sun, and H. Liu, "Speech Emotion Recognition
Based on Deep Learning with Fuzzy Clustering," in Proceedings of the 2021
parameter adjustment, and ensemble learning. The anticipated
IEEE 2nd International Conference on Artificial Intelligence and Knowledge
emotional state is the final result of the emotion classification Engineering (AIKE), 2021, pp. 46-51.
module, which is subsequently improved upon by the post- [6] M. R. Islam, T. Islam, M. A. Islam, and A. M. A. Hossain, "Speech
processing module. Emotion Recognition Using Deep Convolutional Neural Network," in
Proceedings of the 2021 IEEE 11th Annual Computing and Communication
Workshop and Conference (CCWC), 2021, pp. 0331-0335.
C. POST PROCESSING [7] Y. Han, H. Li, and S. Wei, "Speech Emotion Recognition Based on
Convolutional Neural Network with Modified Activation Function," in
Proceedings of the 2021 IEEE International Conference on Computer and
The Post-processing module is the final component Communications (ICCC), 2021, pp. 1953-1958.
of a Speech Emotion Recognition (SER) system. This [8] R. Jang, Y. Han, J. Zhang, and H. Li, "Speech Emotion Recognition
module's primary goal is to refine the emotion classification Based on Convolutional Neural Network with Multichannel Information
Fusion," in Proceedings of the 2021 IEEE 6th International Conference on
results by smoothing, filtering, or post- analysis them. One of
Control, Automation and Robotics (ICCAR), 2021, pp. 78-83.
the most common techniques used in the post-processing [9] Z. Zhang, Y. Liu, and Y. Xu, "Speech Emotion Recognition Using Deep
module is smoothing, which involves removing noise or Neural Network with Ensemble Learning," in Proceedings of the 2021 IEEE
outliers from the classification results. Median filtering or International Conference on Computational Science and Engineering (CSE),
2021, pp. 131-136.
Gaussian filtering can be used to smooth the results, making [10] H. Wang, X. Guo, and S. Zhang, "Speech Emotion Recognition Based
them more consistent and easier to interpret. Post- analysis on Deep Neural Network with Hierarchical Feature Extraction," in
techniques such as clustering or regression can also be used Proceedings of the 2021 IEEE 3rd International Conference on
Communication Engineering and Technology (ICCET), 2021, pp. 69-74.
to provide more detailed insights into the emotional content
of the speech. For example, clustering algorithms can be used
to group similar emotional states, while regression models
can be used to predict continuous emotional dimensions such
as valence and arousal. The system can also be improved by
using feedback mechanisms to adjust the classification results
based on user feedback. For example, if the user disagrees
with the predicted emotional state, they can provide feedback,
which can be used to update the classifier's model. The
effectiveness of the post- processing module can be evaluated
using metrics such as Mean Opinion Score (MOS) or
preference ratings, which provide an indication of how well
the system's output matches the actual emotional state of the
speaker. Overall, the post-processing module is essential for
improving the accuracy and effectiveness of the SER system
and ensuring that the system's output is useful and
interpretable.