Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
Abstract— The importance of recognizing emotion from (MLP), and Support Vector Machines (SVM) are the linear
voice stems from the basic human need to understand and models that are most often used to recognize emotions. The
communicate emotional states, which is vital in enhancing speech sound is not usually thought of as being fixed. Since
security, health care, etc. This study compares several advanced this is the case, nonlinear models should do well in SER. SER
machine learning models to assess their effectiveness in can be used with several different nonlinear classification
recognizing emotions from speech, using the widely accepted methods. These are frequently employed to put data into
RAVDESS, i.e. Ryerson Audiovisual Database of Emotional groups based on basic-level traits.
Speech Song. Our research focuses on the study of depth models
of Convolutional Neural Networks (CNNs) and Long Short- A lot of the time, energy-based traits like Perceptual Linear
Term Memory networks (LSTMs) versus conventional machine Prediction cepstrum coefficients (PLP), Mel-Frequency
learning algorithms, like Support Vector Machines (SVMs), Cepstrum Coefficients (MFCC), Linear Predictor Coefficients
Random Forests (RFs), and Long-Range Machines (GBM). (LPC), and Mel Energy-spectrum Dynamic Coefficients
Through careful preprocessing, feature extraction using Mel- (MEDC) are used to pick out feelings in speech accurately.
Frequency Cepstral Coefficients (MFCCs). The research Deep learning techniques for SER have multiple advantages
concludes that LSTM performs better at 91% than the other over traditional methods. For example, they can find complex
implemented models. Thus, in future, voice-based emotion structures and includes without requiring human feature
recognition can help with diagnosis with ongoing monitoring of extraction and tuning. They also prefer to extract features at a
mental health conditions like depression, anxiety and stress by
low level from raw data and can work with data that has yet to
detecting emotional distress or mood changes.
be labelled. Deep Neural Networks (DNNs) with
Keywords— Machine Learning, emotion detection, voice Convolutional Neural Networks (also known as CNN) are
detection, CNN, LSTM, SVM, GBM, RF, MFCC. good at handling images and videos. Speech-based
classification tasks like natural language processing (NLP)
I. INTRODUCTION and speech recognition (SER) should use recurrent designs,
Human speech includes numerous features that the listener such as recurrent neural networks (RNNs) with Long Short-
examines to understand the complicated information supplied Term Memory (LSTM). In other words, the study focuses on
by the speaker. Inadvertently, the speaker conveys tone, the efficiency of machine learning and deep-learning
intensity, tempo, and other auditory properties, which help to algorithms to detect feelings.
capture both the subtext or meaning and the precise words. II. LITERATURE WORK
Emotion detection has many applications in medical treatment,
security, forensic sciences, and other fields. Models such as Speech emotion recognition enhances human-machine
LSTM do computations in a timestep sequence. Numeric interaction through emotional classification [1]. Fusion of
features are fed into a network of neural networks, which spatial and temporal feature depictions for speech emotion
outputs the logit vector. LSTMs, the decoder, was built to be recognition [2] achieves higher accuracy on RAVDESS and
an attention-based machine that trained on the encoder's learnt IEMOCAP datasets and outperforms state-of-the-art models.
representation to produce an output chance for the following Emotion recognition based on speech and audio features using
character sequence. When examining MFCCs as time-series MFCC and CNN+LSTM algorithms [3]. Anger and neutral
information, LSTMs or their more complicated counterparts emotion performed best, yielding an accuracy of 61.07%.
are used to address the issue of the speech emotion recognition Deep learning techniques are critical solutions for SER [4].
problem of classification. CNNs work with MFCCs in a single Speech emotion recognition (SER) is a method for extracting
dimension or acquire to recognize Mel spectrograms applying emotions from human speech. The analysis used the
2D filters. RAVDESS data set and achieved an accuracy of 80.64% with
CNN LSTM [5]. Hybrid MFCCT features with CNN
SET, which stands for speech emotion identification, has performed better than MFCC and domain features [6].
two steps: extracting features and categorizing features.
Speech-processing researchers have developed several A proposed self-concept-based deep learning model for
features, such as source-based excitement features, prosodic speech perception recognition. The optimized data set
characteristics, vocal sliding factors, and mixed features. In obtained an experimental accuracy rate of 90% [7]. Bilingual
the second step, nonlinear and linear algorithms are used to Arabic English Speech Emotion Recognition System. High
sort the features into groups. Bayesian networks (BN), which performance with low computing costs". Speech emotion
are sometimes called the Minimum Likelihood Principle recognition using audio features achieves an accuracy of
85%[8]. Detection of sarcasm was done with a score of 75%.
507
979-8-3503-6684-6/24/$31.00 ©2024 IEEE
Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on April 22,2025 at 06:20:11 UTC from IEEE Xplore. Restrictions apply.
2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE)
"Human speech emotion recognition using CNN[9]. The shown in Figure 1. The model involves various steps for the
model outperformed the other models and achieved an detection of 6 classes of emotions.
accuracy of 94.38%. A variety of audio and machine learning
algorithms are used for emotion recognition. A proposed
method for speech perception detection using masked sliding
windows[10]. A deep neural network-based classifier achieves
high accuracy with sentiment data sets. Emotion recognition
uses speech signals in the intelligence system[11]. Deep
learning techniques for feature extraction and model building.
This paper describes a set of sound structures means built on
Match Frequency Cepstral Coefficient (MFCC)[12], Wavelet
Packet Transformation (WPT), Linear Predictive Cepstral
Coefficient (LPCC), Zero Crossing Rate (ZCR), Spectrum
Center, Spectral Rolloff. Spectral Kurtosis[13], Root surface
square (RMS), pitch, jitter, and shimmer to improve a
particular feature[14]. This paper explains acoustic text
features in hidden space are used to select a perceptual class
with minimum generalized reconstruction error as an SER
result, which can be used as an indicator to decide whether the
class is neutral or not and thus can be applied to it other classes
of perception[15]. Voice is a powerful emotional state;
loudness and tone often betray underlying emotional states.
Advances in SER systems have been characterized by the
inherently language-driven nature of consumer engagement to
enhance user experience through responsive and sensitive
technology[16].
Early approaches to SER, as described in the literature,
included the development of unique classifiers based on
extraction methods from speech signals. These classifiers Fig. 1. Flow Chart of Proposed Work
were trained on tone, pitch, and strength to distinguish
between emotional states[17]. One study stated that linear A. Dataset Used
discriminant analysis (LDA) and support vector machine We used the RAVDESS dataset for this study because it is
(SVM) were used to detect four primary emotions: happy, sad, an open-source collection that scientists can use to find out
angry, and neutral. Deep learning, especially 2D how people feel when they talk. Research the Ryerson Audio-
Convolutional Neural Networks (CNNs), shows essential Visual Database of Emotional Speech and Songs, also known
progress in the field. CNNs have shown promise in classifying as RAVDESS, has 7356 recordings showing emotions. These
emotions, with a reported accuracy of around 70% when files have three types: full AV, video-only, and audio-only.
analyzing data sets[18]. Including CNNs highlights the shift There are also two voice channels, one for spoken text and one
towards architectures that can extract and learn the most for song. One character in each file plays one of the eight
suitable features for SER tasks with little domain knowledge. feelings below: neutral, happy, sad, angry, scared, shocked, or
A notable case in this area is the RAVDESS, which contains sickened..
linguistic content with various sensory properties. This data
set has contributed to developing SER systems that are more B. Data Visualization
nuanced and capable of understanding complex human Figure 2 shows the count of emotions in the dataset; it
emotions[19]. Research suggests that gender-based training describes the colour of each bar indicates a specific emotion,
can help develop more accurate SER models, emphasizing the while its height indicates the frequency of that emotion.
importance of individualized programs. One of the recent
studies proposed a less complex SER algorithm that showed
good performance using only Mel-frequency cepstral
coefficients (MFCC)[20-21]. Thus, the survey describes that
extensive research in advanced machine learning techniques
and deep learning techniques are used for speech emotion
detection (SER) [22-23]. However, real-world applications
often have environments with variable noise and sound, which
can reduce the performance of SER models that are not
explicitly designed to handle such situations; our proposed
model performs augmentation by noise pitching to check the
efficiency of the mode. Thus, the research is also efficient in
giving excellent analysis by comparing the outcomes of
multiple machine learning and deep learning algorithms.
III. PROPOSED WORK Fig. 2. Count of Emotions in RAVDESS Dataset
The proposed model describes the efficient method of The research analysed the highest count of emotions, such
detecting emotions using machine learning algorithms, as as anger, sadness, fear, happiness, and disgust; each visual
508
Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on April 22,2025 at 06:20:11 UTC from IEEE Xplore. Restrictions apply.
2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE)
shows a waveform and a spectrogram with frequency. They for emotional information; data augmentation describes the
explain waves: the figure on the left shows a wave that is the process of affectedly expanding the amount of data by
visual magnitude of an acoustic signal. The x-axis indicates manipulating existing data. Figure 8 shows the original voice;
time, and the y-axis represents amplitude. Peaks in the we have performed some standard data augmentation methods
waveform indicate where the sound is loudest (highest applied to voice emotion recognition, and Figure 8 shows the
amplitude) and troughs quietest (lowest amplitude). original voice.
Spectrogram with fundamental frequency: The graph on the
right is a spectrogram, which explains the representation of the
spectrum frequencies in sound or other signals as they vary
with time. Here again, time at x-3. Axis: the y-axis represents
frequency (in Hz) and colours at any given time. The brighter
the colour, which indicates the intensity or signal on each
frequency, the more energy there is. The cyan line that appears
to be traced through the centre of the spectrogram indicates
the dominant frequency at which the signal evolves, the lowest Fig. 8. Original Voice
frequency of the sound perceived as the pitch of the tone.
Figure 3 shows anger. While Figure 4 shows disgust, Figure 5 Noise Injection: Adding background noises to clean audio
fear, Figure 6 is happy, and Figure 7 is sad. samples helps to stabilize the image in real-world situations
with background noise, the sample voice, by injecting the
voice is as shown in Figure 9.
C. Pre-processing
In the research, we have implemented the augmentation
method for pre-processing in the context of voice recognition
509
Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on April 22,2025 at 06:20:11 UTC from IEEE Xplore. Restrictions apply.
2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE)
COMPARATIVE ANALYSIS
PERFORMANCE METRIC
0.91
0.89
0.89
0.9
0.9
0.9
0.9
0.88
0.88
0.87
0.87
0.86
0.86
0.85
0.85
0.84
(e):Sad
Fig. 13. (a) Angry, (b) Disgust, (c) Fear, (d) Happiness (e) Sad
510
Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on April 22,2025 at 06:20:11 UTC from IEEE Xplore. Restrictions apply.
2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE)
Further the analysis of the proposed model is shown by the The confusion matrix describes the performance of the
ROC curve for each model. AUC values close to equal scores CNN model, which is used to distinguish between visually
of 1.00 across samples indicate good classification. The distinct emotions, such as 'happiness' and 'sadness' but
analysis of each model is explained as below: struggles with more subtle distinctions, such as 'fear' and
'shock', as shown in Figure 16-a. The SVM model, shown in
LSTM (long-term memory): The ROC curve of the LSTM Figure 16-b, which is known to perform well in high-
model slopes almost to the upper left, indicating a high area dimensional environments, has strongly defined limitations.
under the curve (AUC) of 0.99. This means a high true An LSTM model, as shown in Figure 16-c, that is adept at
positive for the range of decision thresholds of the LSTM processing sequences succeeds with regard to temporal
model. The rate is a low number and a false positive. sensitivity. The clustered random forest approach can lead to
CNN (Convolutional Neural Network): CNN has an robust overall performance with less apparent weaknesses.
absolute AUC of 1.00, indicating that it discriminates well The RF, as described in Figure 16-d, the diagonal cell,
between classes for all thresholds. predicted values match the actual value, and the darker cells
indicate a higher number of correct predictions. The GBM
SVM (Support Vector Machine): Like CNN, SVM also model, as Figure16-e, in capturing complex patterns but at risk
shows an AUC of 1.00, meaning that the positive and negative of overfitting, can potentially lose generalization. Black
classifications can be perfectly told apart. triangles along diagonals in their respective uncertainty
GBM (Gradient Boosting Machine): The ROC curve for matrices will indicate correct predictions. In contrast, any
GBM is in the upper left corner, indicating an AUC of 1.00, obvious diagonal excess pattern reflects systematic
indicating good performance. misclassification, such as 'silent' and 'neutral a neutral' or
'scared' conflated with 'surprised' balanced model respects
RF (Random Forest): RF has a ROC curve with an AUC accuracy in all emotions, maintains high levels of true
of 0.99, which is close to equal scores, indicating that it also positives, and reduces false positives and false information on
performs very well in discriminating between studies. negative as the most effective for emotion seeing this
The diagonal dashed line ultimately represents AUC = 0.5 particular task. The LSTM algorithm shows better results
for random classification. If the classifier's ROC curve regarding the confusion matrix than other machine learning
exceeds this line and moves towards the corner on the upper and deep learning models.
left, it appears strong. The test is more accurate when the slope
V. CONCLUSION
stays close to the left and upper edges of the ROC space. Thus,
we analyzed that LSTM performed better on the RAVDESS In summary, machine education image analysis is obtained
dataset for emotion detection than the other models. by analyzing the RAVDESS data in the experimental analysis.
Their superior displays evidence this CNN accomplished an
accuracy of 0.91 and an F1-score of 0.9. At the same time,
LSTM notched an impressive 92.3% accuracy and 0.91 F1-
score-sensitive audio data, highlighting their excellence in
capturing spatial and temporal aspects of the types of content.
The AUC values close to the model score of 1.00 indicate that
this model can discriminate well in sensitivity classification.
However, such high AUC values deserve to be interpreted
cautiously to ensure that they are not the result of overfitting
but rather the accuracy of the models' generalizability. The
(a) CNN (b) SVM
ROC curves further support the power of CNN and LSTM
models; an LSTM shows a high AUC of 0.99. Finally, the
outcome of the LSTM model on the RAVDESS dataset is
outstanding, indicating that it is considered more suitable for
sensing recognition tasks compared to its other existing modes;
the further future scope for the proposed model can be the
enhancement model implemented on multiple datasets, with
better accuracy as compared to the proposed model.
(c) LSTM (d) RF
REFERENCES
[1] S. Shreya, P. Likitha, G. Saicharan, and Dr. Shruti Bhargava Choubey,
“Speech Emotion Detection Through Live Calls,” International Journal
for Research in Applied Science & Engineering Technology
(IJRASET), vol. 11, no. 5, May 2023.
[2] R. Ullah et al., “Speech Emotion Recognition Using Convolution
Neural Networks and Multi-Head Convolutional Transformer,”
Sensors, vol. 23, no. 13, p. 6212, 2023.
[3] Q. Ouyang, “Speech emotion detection based on MFCC and CNN-
LSTM architecture,” in Proceedings of the 3rd International
Conference on Signal Processing and Machine Learning, Sichuan,
China, 2023.
(e) GBM
[4] G. Liu, S. Cai, and C. Wang, “Speech emotion recognition based on
Fig. 16. (a) CNN, (b) SVM, (c) LSTM, (d) RF (e) GBM emotion perception,” EURASIP Journal on Audio, Speech, and Music
Processing, vol. 1, no.1, p.1-10, 2023.
[5] M. C. Pentu Saheb, P. Sai Srujana, P. Lalitha Rani, and M. Siva Jyothi,
“Speech Emotion Recognition,” International Journal of Food and
511
Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on April 22,2025 at 06:20:11 UTC from IEEE Xplore. Restrictions apply.
2024 International Conference on Communication, Computer Sciences and Engineering (IC3SE)
Nutritional Sciences (IJFANS), vol.11, no. 12, pp.1920-1927, Dec. acoustic and text features in latent space,” in Proceedings of the 2022
2022. Asia-Pacific Signal and Information Processing Association Annual
[6] A. S. Alluhaidan, O. Saidani, R. Jahangir, and O. S. Neffati, “Speech Summit and Conference (APSIPA ASC), Chiang Mai, Thailand,
Emotion Recognition through Hybrid Features and Convolutional pp.1678-1683, 2022.
Neural Network,” Appl. Sci., vol.13, no.8, p. 4750, 2023. [16] S. M. B. R., S. B., S. L., and K. K., “Speech Based Emotion
[7] J. Singh, L. B. Saheer, and O. Faust, “Speech Emotion Recognition Recognition System,” International Journal of Engineering Technology
Using Attention Model,” Int. J. Environ. Res. Public Health, vol.20, and Management Sciences, vol.7, no.1, pp.332-337, 2023.
no.6, p.5140, 2023. [17] J. Indra, R. K. Shankar, and R. D. Priya, "Speech Emotion Recognition
[8] M. E. Seknedy and S. Fawzi, "Arabic English Speech Emotion Using Support Vector Machine and Linear Discriminant Analysis," in
Recognition System," in Proceedings of the 20th Learning and Intelligent Systems Design and Applications. ISDA 2022, A. Abraham,
Technology Conference (L&T), Jeddah, Saudi Arabia, pp.167-170, S. Pllana, G. Casalino, K. Ma, and A. Bajaj, Eds., vol. 715, no.1, 2023.
2023. [18] R. Aswani, A. Gawale, B. Dhawale, A. Shivade, N. Donde, and Prof.
[9] Q. Q. Oh, C. K. Seow, M. Yusuff, S. Pranata, and Q. Cao, “The Impact U. Tambe, “Speech Emotion Recognition,” International Journal of
of Face Mask and Emotion on Automatic Speech Recognition (ASR) Creative Research Thoughts (IJCRT), vol. 9, no. 5, May 2021.
and Speech Emotion Recognition (SER),” in Proceedings of the 8th [19] S. M. M. Naidu, V. Shinde, V. Kulkarni, A. Wadekar, and Y. A. Chavan,
International Conference on Cloud Computing and Big Data Analytics “Speech-based Emotion Recognition Methodologies,” The Ciencia &
(ICCCBDA), Chengdu, China, 2023. Engenharia - Science & Engineering Journal, vol. 11, no. 1, pp. 798-
[10] M. D. A. I. Majumder et al., “Human Speech Emotion Recognition 807, 2023.
Using CNN,” in Proceedings of the 25th International Conference on [20] R. Mittal, S. Vart, P. Shokeen and M. Kumar, “Speech Emotion
Computer and Information Technology (ICCIT), Cox's Bazar, Recognition,” 2022 2nd International Conference on Intelligent
Bangladesh, pp.25-30, 2022. Technologies (CONIT), Hubli, India, pp. 1-6, 2022.
[11] A. Sayar et al., “Emotion Recognition From Speech via the Use of [21] Kumar, Sandeep, Mohd Anul Haq, Arpit Jain, C. Andy Jason,
Different Audio Features, Machine Learning and Deep Learning Nageswara Rao Moparthi, Nitin Mittal, and Zamil S. Alzamil.
Algorithms,” Artificial Intelligence and Social Computing, vol. 72, "Multilayer Neural Network Based Speech Emotion Recognition for
no.1, pp.111-120, 2023. Smart Assistance." Computers, Materials & Continua 75, no. 1 (2023).
[12] N. T. Pham, S. D. Nguyen, V. S. T. Nguyen, B. N. H. Pham, and D. N. [22] Kumar, Sandeep, Sanjana Mathew, Navya Anumula, and K. Shravya
M. Dang, “Speech emotion recognition using overlapping sliding Chandra. "Portable camera-based assistive device for real-time text
window and Shapley additive explainable deep neural network,” recognition on various products and speech using android for blind
Journal of Information and Telecommunication, vol.7, no.3, pp.317- people." In Innovations in Electronics and Communication
335, 2023. Engineering: Proceedings of the 8th ICIECE 2019, pp. 437-448.
[13] S. Harsha Vardhan, M. P. Rahul, P. Kavyasri, and A. Sraavani, Springer Singapore, 2020.
“Emotion Recognition using Speech Signals,” International Journal of [23] Srilakshmi, Regula, Vidya Kamma, Shilpa Choudhary, Sandeep Kumar,
Advanced Research in Science, Communication and Technology and Munish Kumar. "Building an Emotion Detection System in Python
(IJARSCT), vol. 2, no. 3, pp.126-131, November 2022. Using Multi-Layer Perceptrons for Speech Analysis." In 2023 3rd
[14] K. Bhangale and M. Kothandaraman, “Speech Emotion Recognition International Conference on Technological Advancements in
Based on Multiple Acoustic Features and Deep Convolutional Neural Computational Sciences (ICTACS), pp. 139-143. IEEE, 2023.
Network,” Electronics, vol. 12, no. 4, p. 839, 2023.
[15] J. Santoso, R. Sekiguchi, T. Yamada, K. Ishizuka, T. Hashimoto, and S.
Makino, “Speech emotion recognition based on the reconstruction of
512
Authorized licensed use limited to: Nitte Meenakshi Institute of Technology. Downloaded on April 22,2025 at 06:20:11 UTC from IEEE Xplore. Restrictions apply.