0% found this document useful (0 votes)
48 views15 pages

Emotion Detection Final Paper

The document discusses emotion detection from speech using machine learning techniques. It summarizes previous work on speech emotion recognition and proposes detecting emotions like joy, sadness, anger, fear, surprise and neutral from speech using datasets and machine learning algorithms like SVMs and CNNs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views15 pages

Emotion Detection Final Paper

The document discusses emotion detection from speech using machine learning techniques. It summarizes previous work on speech emotion recognition and proposes detecting emotions like joy, sadness, anger, fear, surprise and neutral from speech using datasets and machine learning algorithms like SVMs and CNNs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Journal of Emerging Technologies and Innovative Research ISSN:2349-5162

EMOTION DETECTION BASED ON SPEECH USING


MACHINE LEARNING

G. Saraswathi*1, M. Lakshmi Sai*2, D. V. S. Praneeth *3, G. Yamini*4,


M. Uday Kumar*5
Department of Information Technology, Sasi Institute of Technology & Engineering
[email protected]*1 [email protected]*2 [email protected]*3
[email protected]*4 [email protected]*5
Assistant Professor*1 , Student#(2,3,4,5)

Abstract: Developers use human-system interactions frequently these days. Speech is a


common technology that is used for interacting with people. It offers a channel to convey
one's thoughts or emotional condition to others. However, a very important task in
interactions between people and technology is their failure to recognize emotions in
speech. Speech Emotion Recognition (SER) analyses the speaker's emotions through their
speech module. Any machine with a small amount of processing capacity can, when
necessary, identify common feelings including joy, sorrow, rage, fear, disbelief, and
neutrality. The system develops a special challenge to take input from users and presents
the emotional state using dataset like RAVDESS, TESS, and Emotional-DB for training
as well as testing the efficiency of the system.

Keywords: Speech Emotion Recognition, Data Set, Machine Learning, Neural


Network.

I. Introduction: speech is the biggest barrier to human-


machine communication.
The primary interactional medium
in the modern period is speech. When more The important objective of emotion

than one people communicate with each recognition includes determine how

other, they can quickly tell how each other individuals perceive or respond to given

is feeling by looking at each other's faces or conditions. Understanding the speaker's

using words. Human-machine interaction is mood is also helpful. It provides

widely used in research nowadays. Lack of applications in huge amount, particularly in

ability to interpret human emotions through call centre applications, robotics

Volume 11, Issue 3- March 2024 967 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162

engineering, medical science, and contexts, including homes, hospitals, and


applications. remote areas. Seven emotions are
detectable by the suggested system:
Hence, it is very much important to
anxiety, surprise, neutral, sorrow,
create an algorithm that is capable of
happiness, rage, and love. The system's
precisely recognizing the emotions
development, testing and training on
expressed in speech. In this field, various
comparison datasets, and evaluation
parameters for detecting a speaker's
according to metrics like time, accuracy,
emotional state through speech have been
and error rate are all covered in this work.
discovered. Happy, sad, angry, disgusted,
When compared to current technology, the
fearful, and neutral emotions can all be
suggested approach exhibits positive
identified with one another.
results.
Over the years, similar software has
Płaza et al. [2] offers an innovative
developed, and each uses a different set of
approach for call centre and contact centre
features. Artificial networks are frequently
emotion recognition. By accurately
employed as voice signals. According to
identifying the emotional states of clients as
our research, using Support Vector
well as agents during talks, the technique
Machines (SVM) and Convolution Neural
aims to improve the efficacy of virtual
Networks (CNN) provides an edge during
assistants. The suggested strategy may
the recognition phase. The ability of
identify emotions in speech and text
machines in identifying people emotional
channels, providing opportunities for
nature is becoming increasingly relevant
developing behavioural profiles that
and enhances interactions between people
enhance client satisfaction and agent
and machines.
productivity. This research explores the
II. Related work: application of programmed transcription of
recordings to assess voice channel
Kumar et al. [1] explores the
emotions. The suggested approach is
application of audio communications to
appropriate for real-world use in contact
determine emotions in an intelligent
centre and call centre systems as the
assistant system. The system was designed
experimental findings show that emotional
to regulate electrical devices for alert
states may be successfully classified.
actions and uses a multilayer neural
network for voice emotion identification. It Yan et al. [3] suggests applying the
is about helping people in a variety of AA-CBGRU network model for identifying

Volume 11, Issue 3- March 2024 968 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162

feelings in audio. The model utilizes a model. To improve feature learning and
bidirectional gated recurrent unit (BGRU) screen out distracting information, the self-
network with an attention layer to gather attention mechanism is used. The approach
deep time series information, spectrogram beats existing models with accuracy rates of
features, and spatial data using a 86.03%, 86.07%, and 70.57% on three
convolutional neural network with residual emotional datasets (CASIA, IEMOCAP,
blocks. Using the IEMOCAP sentiment and MELD).
corpus, the model's accuracy rises.
Wani et al. [6] gives an in-depth
Zhang et al. [4] explains the examination of systems for Speech
increasing curiosity in multi-modal Emotion Recognition (SER). The design
emotion detection and the important role of components and methodologies of SER
recognizing feelings in human interaction. systems, including databases,
The authors recommend an approach to preprocessing, feature extraction, and
increase the accuracy of emotion classification techniques, are addressed.
identification utilizing text, video, and Along with highlighting the research
audio modalities. After preprocessing, they invalid in the subject, the study additionally
extract deep emotional features from the tackles the difficulties encountered in SER.
data and integrate the data at the feature
Yang et al. [7] explains the
level. The model's findings on the
evolution of the discrete emotion model-
IEMOCAP dataset are discussed in the
based spoken emotion recognition research.
release, exhibiting increased accuracy over
It provides an overview of speech emotion
speech emotion identification alone.
feature parameters and frequently utilized
Han et al. [5] recommends using a emotion databases. The study offers a
deep residual shrinkage network with a description of the emotion recognition and
bidirectional gated recurrent unit (DRSN- methods for extraction of features
Bi-GRU) to identify speech emotions. The employed in current Chinese research. It
approach makes use of the Mel- also discusses the challenges in recognizing
spectrogram, an attribute for speech that has emotions in speech and the directions that
information in both the historical and future research and growth might proceed.
frequency worlds. A convolution network,
Barhoumi et al. [8] demonstrates a
residual shrinkage network, bi-directional
real-time voice emotion identification
recurrent unit, and fully-connected network
system developed via data augmentation
are all included in the DRSN-Bi-GRU

Volume 11, Issue 3- March 2024 969 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162

and deep learning methods. The goal is to classification problems. When compared to
use the tone of the voice alone to identify earlier methods, the findings demonstrate
emotions. The system utilizes the use of that the suggested method achieves great
three separate datasets and a variety of precision in speech emotion identification.
feature selection techniques, including
Olatinwo et al. [10] proposes using
chroma, Root Mean Square Value (RMS),
the Internet of Things to develop a WBAN
Mel spectrograms, Zero Crossing Rate
(Wireless Body Area Network) system that
(ZCR), and Mel Frequency Cepstral
is emotion-aware and capable to grasp
Coefficients (MFCC). Emotion recognition
patients' expressed emotions. The
can be accomplished by three distinct deep
technology uses a combination of machine
learning models: Convolutional Neural
learning algorithms and IoT sensors to
Network (CNN), Multi-Layer Perceptron
assess and forecast patients' moods based
(MLP), and a hybrid model incorporating
on their speech. The writers look at several
CNN with Bidirectional Long-Short Term
methods for extracting features, techniques
Memory (Bi-LSTM). By evaluating the
to normalization, and algorithms for deep
suggested system's efficacy in real-time
learning and machine learning.
scenarios, the CNN + Bi-LSTM model
Additionally, they create a regularized
appears to be stronger.
CNN model and a hybrid deep learning
Uthayashangar et al. [9] draws model to lower computational complexity
attention to speech emotion recognition and boost prediction accuracy. The
(SER) and its potential applications in a accuracy of the suggested models is around
number of fields. The study uses Mel 98% when compared to an existing model.
Frequency Cepstral Coefficients (MFCCs)
Iliev et al. [11] investigates the
to extract attributes from voice data and
application of deep learning techniques in
convolutional neural networks (CNNs) to
artificial intelligence to determine emotions
characterize emotions. Preprocessing
through speech. It discusses how essential
speech data, feature selection, and
emotions are in human communication and
background noise reduction are all part of
how difficult it may be to separate emotions
the recommended approach. Using data
clearly from used signals. The chapter looks
augmentation techniques increases the
at and compares the performance of several
model's dependability. The CNN algorithm
deep learning and machine learning
is utilized for classification because of its
classifiers used in emotion detection. The
adaptability and history of success with
limitations of these approaches are also

Volume 11, Issue 3- March 2024 970 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162

covered, as well as how important emotions important emotion detection is to


are for interactions, making decisions, and improving decision-making across a
overall well-being. number of industries.

Pucci et al. [12] presents a chatbot Koppula et al. [14] describes a


virtual assistant system that makes use of unique hybrid firefly-based recurrent neural
machine learning neural networks to network approach to speech emotion
identify emotions with the aim to simplify recognition (SER). Preprocessing and
contact tracking during the COVID-19 feature analysis algorithms have been
epidemic. Utilizing a transfer learning incorporated into the system to allow the
strategy, the system trained on an Italian classification of human emotions through
language dataset with identifiers and speech input. When compared to other
acquired a 92% testing accuracy. In contact methods currently in usage, the model
tracing conversations, the importance of established efficacy, resilience, and high
recognizing emotions is highlighted since it accuracy. The suggested approach could
may be used to identify stress, psychiatric find applications in a number of fields,
disorders, and possible frauds. The study including security and medical.
issue of emotions in contact tracing
Tambat et al. [15] focuses on
conversations, the new data set offered by
building a speech-based emotion prediction
Blu Pantheon, and the use of transfer
system using CNN classifiers. The
learning are among the different novelties
researchers use the Mel-frequency cepstral
presented in this work.
(MFCC) as an extracted spectral
Saini et al. [13] investigates the characteristic and evaluate the effectiveness
implementation of machine learning of their method using the Database of
methods to voice recognition of emotions. Emotional Speech and Song (RAVDESS).
The authors examine two separate datasets The results show that using CNN classifiers
including samples of text and speech data to yields good performance in recognizing
evaluate the efficiency of three different emotions.
machine learning techniques: multinominal
Jayanthi et al. [16] suggests an
Naive Bayes (MNB), logistic regression
extensive framework that combines speech
(LR), and linear support vector machine
and still photographs of faces to identify
(LSVM). The results indicate that LSVM
emotions. With regard to individual
performs more effectively than either of the
recognition methodologies, the framework
two methods. The study highlights how

Volume 11, Issue 3- March 2024 971 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162

demonstrates outstanding precision online courses, online therapy sessions,


because of its use of deep classifier fusion. smart speakers, virtual assistants, and
The purpose of this tool is to identify a automobile safety systems. The different
person's mental condition and offer auto- kinds of training datasets used for SER and
suggestions for enhancing their mental the traditional methods employed at the
health. time of the development of deep learning
are also addressed in the research. A
Cai et al. [17] develops a
summary of possible future SER research
multimodal emotion detection model that
directions came to a close.
increases the performance of the emotional
recognition system through the integration Liu et at. [19] suggests an algorithm
of audio and text data. The model acquires for understanding speech emotions in a
textual and audio information using long small number of cases circumstance. The
short-term memory (LSTM) networks and irrelevant traits and unstable data in
convolutional neural networks (CNN), emotion recognition are tackled by the
respectively. The fusion attributes are model. To minimize the impacts of sample
subsequently generated and categorized by imbalance, it provides the Selective
a deep neural network. The suggested Interpolation Synthetic Minority Over-
approach performs more effectively than Sampling Technique (SISMOTE), a data
single-modal models with greater precision imbalance processing strategy.
in text and speech emotion identification, Additionally, redundant features are
according to evaluations undertaken on the eliminated using a feature selection
IEMOCAP database. technique called gradient boosting decision
tree (GBDT), which is based on variance
Abbaschian et al. [18] examines and
analysis. The suggested technique improves
contrasts deep learning methods for speech
state-of-the-art approaches in terms of
emotion recognition (SER) with traditional
recognition accuracy, based on
machine learning methods. The goal of the
experimental findings on three databases.
study is to offer a general description of the
issue of discrete speech emotion Issa et al. [20] offers an innovative
identification by looking at neural network framework for the identification of
techniques, datasets, and current emotions in speech. The spectrum contrast
methodologies. The importance of SER in features, ranging chromagram, Mel-scale
human-computer interaction is addressed spectrogram, Tonnetz representation, and
along with its numerous applications in Mel-frequency cepstral coefficients are just

Volume 11, Issue 3- March 2024 972 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162

a few of the features that the authors extract Chromogram, Mel scaled spectrogram,
from sound files. To identify emotions, a Spectral contrast, and Tonal Centroid were
one-dimensional Convolutional Neural some of the Acoustic characteristics which
Network (CNN) uses these properties as are obtained from speech data. These
inputs. The suggested approaches exceed features capture multiple speech signal
current frameworks and achieve excellent characteristics that are crucial for emotion
classification accuracy, providing a new recognition.
standard for emotion identification.
3.Deep Neural Network Model: It is a
classification model for emotion
recognition. It also uses other models like
SVM and Control Neural Network (CNN).
III. Methodology and approaches:
4.Training and Evaluation: The
Methodology: researchers trained the speech DNN using
the resultant characteristics and collected
The suggested system relies on
audio recordings.
emotion detection and uses some specified
dataset for system training. Following by IV. Approaches:
training, several preprocessing methods are S. Methods Param Challe
used, and feature extraction techniques is
No eters nges
then carried out. This dataset is utilized by
1 1MFCC, Accuracy,
the proposed method to classify the LPCC, Error rate, ----------
emotions into different categories. CNN
. DELTA, Time.

and Support Vector Machine (SVM) are FFT, PLP.

2 Data Lack of a
two classification methods used by the
balancing dedicated
system. Training data is utilized for
techniques, emotion
classification. Vectorization -------- recognitio
methods, n method,
1.Data Collection: To train the emotion
Word Unavailabi
recognition system, the researchers embedding lity of
gather audio recordings from 24 people in techniques, methods

the RAVDASS speech dataset. Dedicated considerin


transcription g the
2.Feature Extraction: Mel-frequency method, audio

cepstral coefficients (MFCC), signal

Volume 11, Issue 3- March 2024 973 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162
Speech signal parameters (DRSN), ng Models
descriptors . Bidirectional with High
Gated Recognitio
Recurrent n
3 Model - AA- Gradient
Unit (Bi- Accuracy.
CBGRU disappeara
GRU).
Network nce, Poor
6 Interdisciplin Significanc Availabilit
Model. --------- learning
ary e of y of
Methods - ability of
Knowledge Acoustic labelled
Data input, time series
from Fields Features, data for
Spatial informatio
such as Potential Training
feature n.
Speech Integration Supervise
collection,
Emotion of d Learning
Time series
Recognition, linguistic, Systems,
feature
Applied Facial, and Need for
collection,
Psychology, Speech improving
and
Human- Informatio the
Classification
Computer n in accuracy
.
Interface. Emotion of Feature
4 Deep IEMOCAP Complexit
Recognitio Extraction
Learning Database y and
n. Technique
Techniques which Diversity
s.
for Multi- contains of Human
7 Selecting the Acoustic Lack of
Modal audiovisua Emotion
Feature Emotional Acknowle
Emotion l data from Expressio
Subset from Features, dged
Recognition. 10 actors. ns,
Existing Semantic Speech
Achieving
Features, Emotional Emotion
Robust
Using Neural Features. Features,
and
Networks to Difficulty
accurate
Extract new of
Emotion
Features. Convertin
Detection.
g
5 Mel- Finding
Qualitativ
spectrogram Robust
e
Feature and
Emotional
Extraction, ----------- Universal
States into
Deep Emotional
Quantitati
Residual Features
ve Spatial
Shrinkage of Speech,
Co-
Network Constructi
ordinates.

Volume 11, Issue 3- March 2024 974 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162

8 Data Highlights Normalizatio


Augmentatio the n Techniques
n Complexit are Explored.
Techniques, ----------- y of 11 Process of Accuracy, Human
Feature -- Emotion Image Scores. Error in
Extraction Recognitio Preprocessin Identifyin
Algorithms n, Need g, g
such as for Face Emotions
MFCC, ZCR, Effective detection, Solely
Mel Feature Facial through
spectrograms, Extraction Landmark Speech
RMS, and and Detection, Signals.
Chroma. Classificat Feature
ion Vector
Methods. Creation.

9 Voice data Kaggle Lack of 12 Neural Labelled Unavailabi


Preprocessin open- data and networks. Italian- lity of
g source Poor language labelled
Techniques, RAVDES Model Dataset Emotions
Data S dataset Accuracy (EMOVO in the
augmentation for training in SER Corpus). Dataset
methods, and Research, provided
CNN testing. Limited by Blu
algorithm as availabilit Pantheon.
the y of 13 Multinominal Accuracy, Analysis
classification labelled Naive Bayes Precision, of
approach. whisper (MNB), Recall, F1- Expressio
voice data. Logistic score. ns in

10 Machine Accuracy, Low Regression Long-


Learning and Precision, Prediction (LR), Linear Distance
Deep Recall, F1 Accuracy, Support Communi
Learning Score, High Vector cation,
Algorithms, Confusion Computati Machine Identifyin
Different Matrix. onal (LSVM), g the most
Optimization Complexit Artificial effective
Strategies y, Neural method
and Delay in Networks for Speech
Regularizatio Real-time (ANN), Emotion
n Prediction. Gaussian Recognitio
Techniques, Mixture n.

Volume 11, Issue 3- March 2024 975 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162
Models, K- Network Features, g Audio
Nearest (CNN). Energy Recording
Neighbour, Features, s with
Hidden Intensity, Associated
Markov Rate of Emotions,
Models Spoken Collecting
(HMM). Words. unbiased

14 Hybrid Features Filtering audio


Firefly-based Extracted Noise data.
Recurrent from the Content, 17 Combination Limited
Neural Speech Extracting of CNN and Emotion
Network Signal, Emotional LSTM, Informatio
(FbRNSR). Firefly Features, Fusion of ------------ n in Single
Fitness for Complexit Features, --- Mode,
Optimizati y and Cost CNN-Bi- Traditiona
on. Associated LSTM- l Feature
with Attention Extraction
Incorporat (CBLA) Methods,
ing Digital Model,L2 Modelling
Filters. Regularizatio Acoustic

15 MFCC as a Librosa Difficulty n. and


spectral module in in Textual
characteristic Python, differentia Features.
for emotion Samples ting 18 Autoencoders Complexit
identification, from the between , y of
Feature RAVDES various Convolutiona implement
selection. S emotions l Neural ------------ ing a
Database. in spoken Networks -- determinis
words, (CNNs), tic system
Limitation Generative based on
s of the Adversarial Pleasure,
system in Networks Arousal,
handling (GANs), and
multiple Long Short- dominance
speakers term Memory measures
simultaneo (LSTM) and the
usly networks. variability

16 Convolutiona Spectral Difficulty of


l Neural Features, in prosody,
Pitch Annotatin voice

Volume 11, Issue 3- March 2024 976 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162
quality, Gradient Feature
and Boosting sets often
spectral Decision have
features Tree Redundant
across (GBDT). Features.
different 20 Convolutiona Uncertaint
emotions l Neural y in
and Networks. choosing
speakers. --------- the right

19 Selective Data features,


Interpolation Imbalance Presence
Synthetic is of
Minority common backgroun
Over- ----------- problem in d noise in
Sampling -- emotional audio
Technique Corpora, recordings
(SISMOTE), Excessive .
Variance High-level
Analysis, Emotional

IV. Input and Output: Recognition, Dataset and Accuracy, Deep


Neural Networks, and Feature Comparison
Input: Takes the required number of
are some of the trends in Emotion
datasets for training of the system and
Recognition.
apply some of the techniques for data
preprocessing and feature extraction. And Emotion Recognition: The primary
also takes the audio sample input from the objective of the study is to recognize

user. emotions in speech. In recent years, there


is a lot of interest in the field of emotion
Output: Takes the input from the user and
detection from speech signals. For a large
classify them into the category in which number of applications, including human-
the emotion is identified. machine interaction, psychological health,

V. Findings and Trends: decision-making, medical science, robotics


engineering, and contact centre
An essential component for
applications, it is seen as crucial.
recognizing emotions is speech. Emotion
Recognition, Speech Emotion

Volume 11, Issue 3- March 2024 977 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162

Speech Emotion Recognition: Speech text. According to the findings, LFPC was

is a crucial element in understanding regarded as a better feature for emotion

emotions. Some of the trends in emotion categorization than traditional features.

detection are Emotion detection, Speech


VI. Future Scope:
Emotion Recognition, Dataset and
Accuracy, Deep Neural Networks, These are some of few potential

Literature Survey, and Feature paths for further study and advancement in

Comparison. the field of speech processing for emotion


recognition. Future improvements in
Dataset and Accuracy: The RAVDASS technology, increasing data accessibility,
speech dataset, which consists of 1440 and changing user needs are expected to
audio recordings from 24 different people, influence the field's future direction.
was used by the researchers to train the
voice DNN. When compared to other Advanced Machine Learning
algorithms like KNN, LDA, and SMO, the Methods: To enhance the accuracy and
accuracy rate for emotion identification efficacy of emotion recognition from
utilizing the DNN was observed to be speech signals, future research might focus
96%. on developing more advanced machine
learning methods, such as deep learning
Deep Neural Networks : It was
architectures (e.g., convolutional neural
discovered that deep neural networks,
networks, recurrent neural networks), and
when utilized simply for voice emotion
reinforcement learning.
detection, had a significant advantage in
accurately identifying and classifying Multimodal Emotion detection: By
emotions from speech data. Deep neural integrating speech processing with other
networks are used in this example to show modalities, such as facial expressions,
how they can comprehend intricate styles gestures, and physiological indications,
and representations in speech notifications. emotion detection techniques can become
more precise and robust. Future research
Feature Comparison: The comparison
may examine the combination of many
of MFCC, linear regression forecast
modalities to more completely record and
cepstral variables (LPCC), and short time
examine emotional data.
log frequency power coefficients (LFPC),
three spectral features utilized for emotion Emotion Recognition in Real-Time
identification, is briefly discussed in the and Online: Systems that can process

Volume 11, Issue 3- March 2024 978 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162

and analyse speech signals in real-time are emotional state more accurately through
growing more and more important. Future speech.
research may concentrate on creating
VIII. References:
effective structures and algorithms that can
offer real-time emotion identification [1] Kumar, Sandeep, Mohd Anul Haq,
capabilities, enabling applications in fields Arpit Jain, C. Andy Jason, Nageswara Rao
like affective computing, virtual assistants, Moparthi, Nitin Mittal, and Zamil S.
and human-computer interaction. Alzamil. "Multilayer Neural Network
Based Speech Emotion Recognition for
Cross-Cultural and Multilingual
Smart Assistance." Computers, Materials &
Emotion Recognition: various cultures Continua 75, no. 1 (2023).
and languages have various ways of
[2] Płaza, Mirosław, Robert Kazała,
expressing emotions. Future studies might
Zbigniew Koruba, Marcin Kozłowski,
work on creating emotion recognition
Małgorzata Lucińska, Kamil Sitek, and
algorithms that are accurate in identifying
Jarosław Spyrka. "Emotion Recognition
and interpreting emotions in cross-cultural
Method for Call/contact Centre
and multilingual settings.
Systems." Applied Sciences 12, no. 21
VII. Conclusion: (2022): 10951.

Deep learning algorithms can produce [3] Yan, Yu, and Xizhong Shen. "Research
fruitful outcomes. We successfully on speech emotion recognition based on
described a model for emotion recognition, AA-CBGRU network." Electronics 11, no.
and it scored 96% in testing. You should be 9 (2022): 1409.
aware that expecting feelings is arbitrary
[4] Zhang, Xue, Ming-Jiang Wang, and
and that different listeners may give any
Xing-Da Guo. "Multi-modal emotion
piece of music different emotional values.
recognition based on deep learning in
The algorithm occasionally generates
speech, video and text." In 2020 IEEE 5th
inconsistent results when trained on human-
International Conference on Signal and
rated emotions for the same reason. The
Image Processing (ICSIP), pp. 328-333.
system was trained using datasets such as
IEEE, 2020.
RAVDESS, which says mainly the speaker
accent may result in unexpected results. As [5] Han, Tian, Zhu Zhang, Mingyuan Ren,

a result, it seeks to convey the speaker's Changchun Dong, Xiaolin Jiang, and
Quansheng Zhuang. "Speech Emotion

Volume 11, Issue 3- March 2024 979 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162

Recognition Based on Deep Residual [11] Iliev, Alexander I. "Perspective


Shrinkage Network." Electronics 12, no. 11 Chapter: Emotion Detection Using Speech
(2023): 2512. Analysis and Deep Learning." (2023).

[6] Wani, Taiba Majid, Teddy Surya [12] Pucci, Francesco, Pasquale Fedele, and
Gunawan, Syed Asif Ahmad Qadri, Mira Giovanna Maria Dimitri. "Speech emotion
Kartiwi, and Eliathamby Ambikairajah. "A recognition with artificial intelligence for
comprehensive review of speech emotion contact tracing in the COVID‐19
recognition systems." IEEE access 9 pandemic." Cognitive Computation and
(2021): 47795-47814. Systems 5, no. 1 (2023): 71-85.

[7] Yang, Chunfeng, Jiajia Lu, Qiang Wu, [13] Saini, Anu, Amit Ramesh Khaparde,
and Huiyu Chen. "Research progress of Sunita Kumari, Salim Shamsher,
speech emotion recognition based on Jeevanandam Joteeswaran, and Seifedine
discrete emotion model." In Journal of Kadry. "An investigation of machine
Physics: Conference Series, vol. 2010, no. learning techniques in speech emotion
1, p. 012110. IOP Publishing, 2021. recognition." Indonesian Journal of
Electrical Engineering and Computer
[8] Barhoumi, Chawki, and Yassine Ben
Science 29, no. 2 (2023): 875-882.
Ayed. "Real-Time Speech Emotion
Recognition Using Deep Learning and Data [14] Koppula, Neeraja, Koppula Srinivas
Augmentation." (2023). Rao, Shaik Abdul Nabi, and Allam
Balaram. "A novel optimized recurrent
[9] Uthayashangar, S. "Speech Emotion
network-based automatic system for speech
Recognition Using Machine
emotion identification." Wireless Personal
Learning." Journal of Coastal Life
Communications 128, no. 3 (2023): 2217-
Medicine 11 (2023): 1564-1570.
2243.
[10] Olatinwo, Damilola D., Adnan Abu-
[15] Tambat, Aditi Manoj, Ramkumar
Mahfouz, Gerhard Hancke, and Hermanus
Solanki, and Pawan R. Bhaladhare.
Myburgh. "IoT-Enabled WBAN and
"Sentiment Analysis-Emotion
Machine Learning for Speech Emotion
Recognition." Int. J. of Aquatic Science 14,
Recognition in Patients." Sensors 23, no. 6
no. 1 (2023): 381-390.
(2023): 2948.
[16] Jayanthi, K., and S. Mohan. "An
integrated framework for emotion
recognition using speech and static images

Volume 11, Issue 3- March 2024 980 www.jetir.org


Journal of Emerging Technologies and Innovative Research ISSN:2349-5162

with deep classifier fusion


approach." International Journal of
Information Technology 14, no. 7 (2022):
3401-3411.

[17] Cai, Linqin, Yaxin Hu, Jiangong


Dong, and Sitong Zhou. "Audio-textual
emotion recognition based on improved
neural networks." Mathematical Problems
in Engineering 2019 (2019): 1-9.

[18] Abbaschian, Babak Joze, Daniel


Sierra-Sosa, and Adel Elmaghraby. "Deep
learning techniques for speech emotion
recognition, from databases to
models." Sensors 21, no. 4 (2021): 1249.

[19] Liu, Zhen-Tao, Bao-Han Wu, Dan-


Yun Li, Peng Xiao, and Jun-Wei Mao.
"Speech emotion recognition based on
selective interpolation synthetic minority
over-sampling technique in small sample
environment." Sensors 20, no. 8 (2020):
2297.

[20] Issa, Dias, M. Fatih Demirci, and


Adnan Yazici. "Speech emotion
recognition with deep convolutional neural
networks." Biomedical Signal Processing
and Control 59 (2020): 101894.

Volume 11, Issue 3- March 2024 981 www.jetir.org

You might also like