0% found this document useful (0 votes)

9 views10 pages

Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis

This paper explores machine learning and deep learning techniques for emotion recognition from human speech using acoustic analysis, focusing on the RAVDESS and TESS datasets. Various classifiers were tested, with Random Forest achieving the highest accuracy of 93.24%, while MLP reached 93.34%. The study highlights the importance of feature selection and discusses potential applications in healthcare, human-computer interaction, and customer service.

Uploaded by

ANUBHAB RATH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views10 pages

Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis

Uploaded by

ANUBHAB RATH

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Volume 7-Issue 1, January 2024

Paper: 48

Machine Learning and Deep Learning Techniques for Emotion

Recognition from Human Speech using Acoustic Analysis
Anirban Sen Dr. Meenu Chawla Dr. Namita Tiwari
M. Tech Professor Assistant Professor
Department of CSE Department of CSE Department of CSE
MANIT MANIT MANIT
Bhopal, India Bhopal, India Bhopal, India
[email protected] [email protected] [email protected]

Abstract—This work investigates the use of machine learning and deep learning techniques for tone recognition in audio data,
with the aim of identifying emotions in human speech. Emotion recognition has numerous practical applications, such as in
healthcare, human-computer interaction, and customer service. The paper discusses various approaches to emotion recognition
including acoustic analysis and Machine Learning algorithms and proposes the best classifier for recognition. RAVDESS and
TESS datasets are used in this research which consists of speech recordings labelled with seven emotions: anger, disgust, fear,
happiness, sadness, pleasant surprise and neutral. The feature engineering techniques are also discussed thoroughly in this
paper. The models experimented are SVM, Decision tree and Random Forest and the accuracy obtained was 84.1%, 88.9% and
93.24% respectively and the experimental results indicate the success of the proposed system. MLP has also been used which
gave an accuracy of 93.34%. The paper concludes with a discussion of the potential applications of this technology and future
research directions in the field of emotion recognition from audio.
Keywords—Emotion Recognition, MFCC, Mel Spectrogram, Ma-chine Learning, Deep Learning

I. INTRODUCTION
Emotion recognition is a fundamental aspect of human commu-nication, allowing us to understand the intentions and feelings
of others and respond appropriately. The emotions exhibited by a person provide insights into their psychological state. The
ability to recognize emotions in speech has numerous practical applications in fields such as healthcare for mental health
diagnosis, human-computer interaction in Alexa, Siri etc, customer service to get an insight about feedback and in security
fields to identify suspicious or abnormal behaviour based on changes in emotional states. A few researches have been done to
automatically detect emotions from audio signals but there are a few obstacles:

1) Benchmark dataset covering all types of emotions with proper labels needs to be chosen.
2) Mult-lingual datasets are very less.
3) Feature selection is also very challenging.
4) It is necessary to have a dependable classifier and machine learning algorithm that is suitable for the task at hand.
The features extracted from the audio have a significant impact on how well an emotion identification system performs.
However, there is no single sound feature that performs optimally in all sound signal processing tasks, and features must be
customized to meet the specific demands of the problem being addressed. For this reason, researchers are using the deep
learning model because it extracts features of its own [1].In this paper, Mel-Spectrogram and MFCC have been used as the
features because they provide a good representation of the spectral content of speech, which can be useful in identifying
emotional states [4]. Two baseline databases namely RAVDESS [13] and TESS [5] have been used. Experiments with three
Machine learning models were done and Random Forest performed the best among them. Experiments with Multi-layer
perceptron(MLP) have also been conducted and a slightly better result is observed.The paper is organized as follows in section
2, The pertinent studies on speech emotion are briefly summarized, and in section 3, the dataset that was used and an EDA is
shown. The next section discusses the feature engineering aspects. In Section 5, introspections on the model used, methodology,

Machine Learning and Deep Learning Techniques for Emotion Recognition from Human Speech using Acoustic Analysis Page 1
Volume 7-Issue 1, January 2024
Paper: 48

hyper-parameters and the number of iterations that were made are discussed followed by the results and comparative analysis.
The final section outlines the conclusion and potential areas for future research
II. LITERATURE REVIEW
Over the last few decades, many Machine learning techniques and Deep learning techniques have been used to detect emotions
from audio. Ye et al. Introduced a temporal Emotional Approach, which they called TIM-net [14] and has achieved the best
accuracy of 0.92 on the RAVDESS dataset. Chamiska et al. [2] presents a new method for extracting features from
conversational audio data called Bag-of-AudioWords (BoAW) based feature embeddings. The proposed approach also includes
a state-of-the-art emotion detection model that uses a Recurrent Neural Network (RNN) to capture both the context of the
conversation and the emotional state of the individual parties involved. The model is designed to make real-time predictions of
categorical emotions based on the extracted features. They achieved an accuracy of 60% on the IEMOCAP dataset.
Anbalagan et al. proposed SVM algorithm with MFCC feature selection and achieved an accuracy of 89%. Jain et al. [4]
introduced PCA on the TESS dataset and achieved an accuracy of 97.86%. Hason et al [11] proposed a feature extractor
network based on CNNs and a classifier using a multi-layer perceptron (MLP) along with Mel-spectrogram, Tonnetz and
spectrogram for acoustic feature extraction and achieved the best accuracy on Emo-DB Dataset with an accuracy of 92.79%
Popova et al [10] used CNN VGG-16 as a classifier and achieved an accuracy of 71%.

III. DATASETS
In this paper, RAVDESS and TESS datasets have been used and both of them are in the English Language.
• RAVDESS-It stands for Ryerson Audio-Visual Database of Emotional Speech and Song and holds a collection of 7,356
files with a total size of 1.26 GB. It features recordings from 24 professional actors (12 female and 12 male) who spoke
two sets of related phrases in a neutral North American accent. The labels of the dataset are neutral, happiness, sadness,
anger, fear, surprise, and disgust. All of these recordings are available in three different modalities: only audio with 2 Byte
size and 48KHz frequency, audio-video with 720p and 48KHz frequency and only video without sound. [8]
• TESS- It stands for Toronto emotional speech set and is also an open-source dataset and was created by researchers at the
University of Toronto containing 2800 audio files, each of which is approximately 2-3 seconds long and it contains labels:
neutral, happiness, sadness, anger, fear, surprise, and disgust [9]
In the final model, these 2 datasets have been merged and it is observed that there are 7688 records, containing recordings both of
males and females. To ensure diversity in terms of speakers and context, two datasets were utilized since a single dataset was
insufficient for predicting real-time audio data accurately. The Exploratory data analysis is shown in the fig.1. It can be seen from
fig.1, that the merged dataset is almost label balanced.

Fig. 1. Exploratory Data Analysis

Machine Learning and Deep Learning Techniques for Emotion Recognition from Human Speech using Acoustic Analysis Page 2
Volume 7-Issue 1, January 2024
Paper: 48

IV. FEATURE SELECTION

The Librosa package has been used to load the data. The purpose of using the Librosa package is that it converts the audio to mono-
channel and preserves a uniform sampling rate of 22Khz. It also normalizes all the discrete values between +1 and -1. For these
reasons, the calculations become easier and computations become faster. It is also noticed that there is an increase in accuracy for
using the Librosa package instead of the sound file package.
There are three representations of audio features namely, time-domain representation, frequency-domain representation, and
time-frequency representation. Among these, time-frequency representation is mostly used for Machine learning pur-poses.
Some examples of this are Spectrogram, MFCC, Mel-spectrogram, etc. In this research, MFCC and Mel spectrogram has been
used because they provide a good representation of the spectral content of speech. MFCCs extract information about the
spectral envelope of the speech signal, while Mel spectrograms provide a visual representation of the spectral content of
speech. Both can be used to extract relevant features for emotion recognition, and machine learning algorithms can be trained to
identify patterns indicative of particular emotional states.

• MFCC- It stands for Mel-Frequency Cepstral Coefficients

[6] and is widely used in audio signal processing for feature extraction. Through this technique, knowledge about the
envelope is attained [6]. At first, the waveform in the time domain is converted to a log-Amplitude spectrum by applying
DFT on the waveform. Then MEL scaling is done with the help of Mel filter banks and at last Discrete cosine
transformation is applied to get the MFCCs. After a lot of empirical analysis, it was decided that the number of MFCCs
will be 40. The formula for computing the Cepstrum is the equation 1

C(x(t)) = F −1[log(F (x(t)))] (1)

The above equation reveals that on applying the inverse Fourier transformation on the Spectrum, Cepstrum will be
obtained and it is the building block of MFCC.

x(t) = e(t) ∗ h(t) (2)

X(t) = E(t).H(t) (3)
log(X(t)) = log(E(t)) + log(H(t)) (4)
In equation 2, it is shown that Speech is generated through the interaction of the glottal pulse and the resonant frequencies
of the vocal tract, resulting in the production of sound. In Equation 3 Fourier transformation has been applied and the
logarithmic function is applied to get the final coefficients from Equation 4. In fig. 2, there are 40 MFCCs and the lower
bands are the glottal pulse coefficient and contain more information about the glottal pulse whereas the higher bands are
the vocal tract coefficients and contain more information about the vocal tract frequencies. The lowest band in fig. 2
shows that it contains the most information about the Glottal pulse.

Fig. 2. Result of the MFCCs

Machine Learning and Deep Learning Techniques for Emotion Recognition from Human Speech using Acoustic Analysis Page 3
Volume 7-Issue 1, January 2024
Paper: 48

• Mel-spectrogram- Normal spectrogram is linear in nature but the Human ear perceives sound in a logarithmic nature and
that’s where Mel-spectrogram is beneficial. Here, the term Mel comes from Melody and Mel scale is a frequency scale which
is logarithmic in nature and is just like how the human ear perceives a sound. To create a Mel-spectrogram, the first step is to
transform the frequency scale of the spectrogram from linear to Mel scale with the help of Mel- filter banks, which are evenly
spaced filters that are triangular in shape [12]. In fig. 3, it can be seen that the differences between the Mel frequencies are
same and the higher the Mel frequency larger the radius of the triangle which reveals that more frequencies in Hz are covered
by higher Mel bands giving an essence of the logarithmic nature.

Fig. 3. Mel filter banks

The energy of a particular frequency range is captured by each filterbank output and these outputs are merged to generate the
Mel-spectrogram. Further, to compress the dynamic range, Logarithm is applied on Mel-spectrogram to get a log-Mel-
Spectrogram. The below formula has been used to find the Mel frequency.
f
m = 2595.log(1 + ) (5)
500

40 Mel banks have been used in this research work and the below figure shows the Mel Spectrogram.
In fig. 4, the Mel spectrogram of an angry voice is shown and it gives a clear understanding of the time-frequency components.
Mel spectrograms of different emotions are different from each other and are an excellent feature to train the model to detect
emotions.

Fig. 4. Mel-Spectrogram
40 MFCCs and 40 Mel-bands are used resulting in 80 features in total. In the encoding technique, Label encoder has been used
and experiments showed that using a Label encoder in place of One hot encoding, there is an increase in accuracy of 5% to 6%
in most of the algorithms.

V. PROPOSED METHODOLOGY
In the first step, the Labelled dataset is acquired from Kaggle and both the datasets (RAVDESS and TESS) are merged. These
are benchmark dataset and contains preprocessed data, therefore preprocessing is not needed. It has been seen that after

Machine Learning and Deep Learning Techniques for Emotion Recognition from Human Speech using Acoustic Analysis Page 4
Volume 7-Issue 1, January 2024
Paper: 48

merging, the final dataset becomes almost label balance. Mel spectrogram and MFCC have been used as audio features. 40
features from the Mel spectrogram and 40 features from MFCCs have been used. 80% of the data has been used for training,
while 20% has been allocated for testing.
After numerous experiments, it has been found that Random Forest is performing best in this problem and an accuracy of
93.24% was achieved. Random forest being an ensemble learning technique uses bagging under the hood and therefore it
prevented the problem of overfitting and handled the categorical values well. The number of decision trees in the Random
Forest has been kept at 80, which is equal to the number of features. Gini impurity as the criterion has been used in this work
and since in Random Forest, random sampling is done and the base estimators are weak learners so pruning was not done. The
max_feature hyperparameter was considered to be “sqrt”. It is a hyperparameter that specifies the maximum number of features
that can be considered at each split. A flow chart of the proposed methodology is shown in fig. 5.

Fig. 5. Proposed Methodology

Each Decision tree in the random forest algorithm is trained on smaller datasets and fewer features are chosen randomly from
the main dataset to determine the best feature to split on. The final prediction is then made by taking the majority voting
algorithm.
Testing was also performed with an Artificial Neural network with 80 inputs, 2 hidden layers, 3 dropout layers and 7 output layers
and the total trainable parameters were 11,647. To build the neural network model for the research work, the Rectified Linear Unit
(ReLU) activation function has been used in all layers except for the output layer. The output layer, on the other hand, was activated
using the Softmax function. Relu stands for Rectified Linear unit and is as described the equation 6.
f(x) = max(0, x) (6)

It has been used for its simplicity, effectiveness and solves the problem of the vanishing gradient. [14] The softmax function is
used to obtain a probability distribution and therefore is best suited for categorical values. [14]. It is described in equation 7.
z
e i
softmax(zi) = (7)
K
ez j
j=1

Machine Learning and Deep Learning Techniques for Emotion Recognition from Human Speech using Acoustic Analysis Page 5
Volume 7-Issue 1, January 2024
Paper: 48

th
where zi is the input to the i output neuron, and K is the total number of output neurons. The model summary is shown in fig.
??. A total of 100 epochs were done to test the result.

Fig. 6. Model summary of the MLP

The "Early stopping" mechanism has been used for faster convergence.
VI. RESULTS AND ANALYSIS
In this experiment, the effectiveness of traditional machine learning models such as Random Forest, Decision Tree, and SVM
have been analysed. The GridSearchCV package has been utilized to perform hyperparameter tuning and to find the most
appropriate model with optimal hyperparameters. Before testing these algorithms in the main merged dataset, analysis was also
done in both of these datasets individually. Since the datasets were appended one after the other, so k fold cross-validation
technique in the random Forest and SVM have been used to reduce the bias but it was found that the accuracy is getting reduced
because of the average function.
In table I, it is evident that the performance of all the models deteriorates when tested on the RAVDESS dataset. This can be
attributed to label imbalance in the dataset. The decision tree algorithm, in particular, exhibits a high variance and low bias,
which indicates the presence of overfitting. In an attempt to address this issue, post-pruning and k-fold cross-validation
techniques were employed, but they did not yield a significant improvement in accuracy. The best results are shown in the table
after performing several iterations and tuning the hyperparameters. The SVM has the least accuracy due to the constraint
optimization problem and Decision Tree was facing the problem of high Variance. Random Forest is an ensemble learning
technique and uses weak learners, therefore, it reduces the bias. The number of trees has been kept at 80 after empirical
analysis. Therefore, Random Forest is performing best in all the scenarios. The confusion matrix is shown in fig.7. A
comparative analysis has been shown in Table 1.
TABLE I RESULTS OF THE EXPERIMENTS

Author Method Dataset Accuracy

Random Forest n_estimator=80, Ravdess 87.75
criterion="Gini" Tess 99.73
Ravdess+Tess 93.24
Decision Tree n_estimator=80, Ravdess 78.7
criterion="Gini", Tess 89.64
Pruning=None Ravdess+Tess 88.9
SVM C=60, Ravdess 83.2
Kernel=Poly Tess 99.64
Ravdess+Tess 84.1

Machine Learning and Deep Learning Techniques for Emotion Recognition from Human Speech using Acoustic Analysis Page 6
Volume 7-Issue 1, January 2024
Paper: 48

Fig. 7. Confusion Matrix of the Random Forest

Fig. 7 reveals that a majority of the labels were accurately classified. However, the emotions with the "disgust" and "neutral"
labels exhibited a higher number of misclassifications. Specifically, out of all the misclassified audio samples, nine with the
"disgust" label were incorrectly classified as "pleasant surprises," while eight with the "neutral" label were erroneously
identified as "sad." Fig. 8 shows the accuracy graph of the ANN model and Fig. 9 shows the loss graph of the MLP model.
Fig. 8 is an accuracy graph and it is a chart that displays a model’s performance over time or as a function of a specific
parameter. The x-axis represents the independent variable, such as time or hyperparameter values, while the y-axis displays the

Fig. 8. Accuracy Graph of the MLP

accuracy score. This graph is valuable for understanding how a model’s performance changes with different settings, and it can
help identify trends and patterns in accuracy. Researchers and practitioners can use the accuracy graph to compare models and
variations of the same model, evaluate and optimize machine learning models, and make informed decisions about model
selection and parameter tuning. An accuracy of 93.34% was achieved. "Early Stopping" mechanism for faster convergence and
in fig. 8 it can be seen that the algorithm converges at the 52nd iteration. The accuracy could have been better if a larger dataset
had been used while training the Multi-layer perceptron. Fig. 9 shows a loss graph and it is a visual representation of a
mathematical function called the loss function, which measures the difference between the predicted output and the actual

Machine Learning and Deep Learning Techniques for Emotion Recognition from Human Speech using Acoustic Analysis Page 7
Volume 7-Issue 1, January 2024
Paper: 48

output of a machine learning model. The goal of training a model is to minimize this function by adjusting the model’s weights
and biases through an optimization process. The loss graph shows the value of the loss function over time or as a function of a
specific parameter during the training process, with the y-axis representing the value of the loss function and the x-axis
representing the number of iterations or epochs. As the training progresses, the values of the loss function should decrease,
indicating improved predictions. A loss graph is a valuable tool for evaluating and optimizing machine learning models,
monitoring the training process, and making informed decisions about model selection and parameter tuning. From fig. 9 it can
be seen that after the 20th epoch, the value of the loss function gradually decreases and converges to a very small value. In the
field of machine learning, the progress of a model’s performance during training, spanning multiple epochs, is often visualized
through an accuracy vs. epoch graph. An epoch is defined as a complete iteration through the entirety of the training dataset.
Accuracy is a performance measure, expressed as a percentage, that quantifies the model’s ability to accurately predict the
target labels for the input data. It is a commonly used evaluation metric for classification tasks, where the objective is to classify
input data into predefined categories. The accuracy vs. epoch graph serves as a graphical representation of the model’s
performance over the course of training. It provides insights into how the accuracy of the model evolves over time as it learns
from the training data. Initially, the accuracy may be low as the model’s parameters are being adjusted. However, with
continued training, the accuracy may improve as the model’s parameters are fine-tuned and it becomes more proficient at
making accurate predictions. Monitoring the accuracy vs. epoch graph during training is a standard practice among researchers
and practitioners to assess the effectiveness of the model’s learning process and identify potential areas for improvement, such
as adjusting hyperparameters, refining model architecture, or optimizing training strategies. This graphical representation serves
as a valuable tool in analyzing and interpreting the performance of machine learning models and is commonly included in
research papers as a means of presenting experimental results.

Fig. 9. Loss graph of the MLP

In the context of machine learning research, a loss graph is a graphical representation of the change in the value of the loss
function during model training, typically with the value of the loss on the y-axis and the number of epochs on the x-axis. The
loss function quantifies the discrepancy between the model’s predicted outputs and the actual target outputs for the training
data. The loss graph provides a visual depiction of how the loss evolves over the course of model training. A comparative
analysis is shown in Table II. Since all the cited studies in this research have utilized the same dataset and employed identical
audio features for emotion recognition from human speech, a comparative analysis can be conducted to evaluate the
performance of various machine learning and deep learning techniques in this context. A bar graph for better visualization has
been shown in fig. 10.

Machine Learning and Deep Learning Techniques for Emotion Recognition from Human Speech using Acoustic Analysis Page 8
Volume 7-Issue 1, January 2024
Paper: 48

TABLE II COMPARATIVE ANALYSIS

Author Model Dataset Accuracy
Ye et al. [14] TIM- Ravdess+Tess 92.08
net(Temporal-
aware bI-
direction
Multi-scale
Network)
Jain et al. [4] SVM with PCA TESS 97.86
Luna et al. [7] pre-trained xlsr- Ravdess 87
Wav2Vec2.0
transformer
Tess 94
Ravdess+Tess 91
Dolka et al. [3] ANN Ravdess+Tess 88.72
Proposed Work MLP Ravdess 87.75
Tess 99.73
Ravdess+Tess 93.34

Fig. 10. Comparative Analysis

VII. CONCLUSION AND FUTURE WORK

Method to automatically detect emotion from audio has been conducted in this research and an accuracy of 93.24% has been
achieved using Random Forest Classifier and 93.34% accuracy has been achieved with 3 layers neural network and has the
best-achieved accuracy among all the previous works. Robust and accurate emotion recognizer can be a boon to society and has
the potential to revolutionize various industries by providing insights into the emotional states of individuals and enhancing the
quality of human-computer interaction, healthcare, education, marketing, entertainment and security. Although, future works
can be extended to languages other than English and researchers can also explore other audio features that can help to improve
the accuracy. Larger datasets are needed to deploy Deep neural network techniques. This work can also be incorporated with
Sentiment analysis which can work in conjunction with text sentiment analysis and Emotion recognition of the speaker where
both tone and context of the speaker can be gathered to analyse the Sentiment of the speaker.

REFERENCES
[1] Sudipta Bhattacharya, Samarjeet Borah, Brojo Kishore Mishra, and Atreyee Mondal. Emotion detection from multilingual
audio using deep analysis. Multimedia Tools and Applications, 81(28):41309–41338, 2022.
[2] Sadil Chamishka, Ishara Madhavi, Rashmika Nawaratne, Damminda Ala-hakoon, Daswin De Silva, Naveen Chilamkurti,
and Vishaka Nanayakkara. A voice-based real-time emotion detection technique using recurrent neural network
empowered feature modelling. Multimedia Tools and Applications, 81(24):35173–35194, 2022.

Machine Learning and Deep Learning Techniques for Emotion Recognition from Human Speech using Acoustic Analysis Page 9
Volume 7-Issue 1, January 2024
Paper: 48

[3] Harshit Dolka, Arul Xavier VM, and Sujitha Juliet. Speech emotion recognition using ann on mfcc features. In 2021 3rd
international conference on signal processing and communication (ICPSC), pages 431–435. IEEE, 2021.
[4] Kabir Jain, Anjali Chaturvedi, Jahnvi Dua, and Ramesh K Bhukya. In-vestigation using mlp-svm-pca classifiers on speech
emotion recognition. In 2022 IEEE 9th Uttar Pradesh Section International Conference on Electrical, Electronics and
Computer Engineering (UPCON), pages 1–6. IEEE, 2022.
[5] Kabir Jain, Anjali Chaturvedi, Jahnvi Dua, and Ramesh K Bhukya. In-vestigation using mlp-svm-pca classifiers on speech
emotion recognition. In 2022 IEEE 9th Uttar Pradesh Section International Conference on Electrical, Electronics and
Computer Engineering (UPCON), pages 1–6. IEEE, 2022.
[6] MS Likitha, Sri Raksha R Gupta, K Hasitha, and A Upendra Raju. Speech based human emotion recognition using
mfcc. In 2017 international conference on wireless communications, signal processing and networking (WiSPNET),
pages 2257–2260. IEEE, 2017.
[7] Cristina Luna-Jiménez, Ricardo Kleinlein, David Griol, Zoraida Callejas, Juan M Montero, and Fernando
Fernández-Martínez. A proposal for multimodal emotion recognition using aural transformers and action units on
ravdess dataset. Applied Sciences, 12(1):327, 2021.
[8] Sean MacAvaney, Hao-Ren Yao, Eugene Yang, Katina Russell, Nazli Goharian, and Ophir Frieder. Hate speech
detection: Challenges and solutions. PloS one, 14(8):e0221152, 2019.
[9] Ameya Ajit Mande, Sukrut Dani, Shruti Telang, and Zongru Shao. Emotion detection using audio data samples.
International Journal of Advanced Research in Computer Science, 10(6), 2019.
[10] Anastasiya S Popova, Alexandr G Rassadin, and Alexander A Pono-marenko. Emotion recognition in sound. In
Advances in Neural Computation, Machine Learning, and Cognitive Research: Selected Papers from the XIX
International Conference on Neuroinformatics, October 2-6, 2017, Moscow, Russia 19, pages 117–124. Springer,
2018.
[11] David Hason Rudd, Huan Huo, and Guandong Xu. Leveraged mel spectrograms using harmonic and percussive
components in speech emotion recognition. In Advances in Knowledge Discovery and Data Mining: 26th Pacific-
Asia Conference, PAKDD 2022, Chengdu, China, May 16–19, 2022, Proceedings, Part II, pages 392–404. Springer,
2022.
[12] Kannan Venkataramanan and Haresh Rengaraj Rajamohan. Emotion recognition from speech. arXiv preprint
arXiv:1912.10458, 2019.
[13] Jiaxin Ye, Xincheng Wen, Yujie Wei, Yong Xu, Kunhong Liu, and Hongming Shan. Temporal modeling matters: A
novel temporal emotional modeling approach for speech emotion recognition. arXiv preprint arXiv:2211.08233,
2022.
[14] Jiaxin Ye, Xincheng Wen, Yujie Wei, Yong Xu, Kunhong Liu, and Hongming Shan. Temporal modeling matters: A
novel temporal emotional modeling approach for speech emotion recognition. arXiv preprint arXiv:2211.08233,
2022.

Machine Learning and Deep Learning Techniques for Emotion Recognition from Human Speech using Acoustic Analysis Page 10

MS Thesis Final
No ratings yet
MS Thesis Final
47 pages
Designing A Roller Coaster
100% (6)
Designing A Roller Coaster
18 pages
A Voice-Based Real-Time Emotion Detection Technique Using Recurrent Neural Network Empowered Feature Modelling
No ratings yet
A Voice-Based Real-Time Emotion Detection Technique Using Recurrent Neural Network Empowered Feature Modelling
22 pages
Electronics 12 00839 v2
No ratings yet
Electronics 12 00839 v2
17 pages
Emotion Classification From Speech Signal Based On
No ratings yet
Emotion Classification From Speech Signal Based On
16 pages
Electronics 11 03831
No ratings yet
Electronics 11 03831
12 pages
Draft 6
No ratings yet
Draft 6
14 pages
M Akagi I 0710
No ratings yet
M Akagi I 0710
15 pages
Applsci 12 00327
No ratings yet
Applsci 12 00327
23 pages
Emotional Expression Detection in Spoken Language Employing Machine Learning Algorithms
No ratings yet
Emotional Expression Detection in Spoken Language Employing Machine Learning Algorithms
15 pages
Project Report
No ratings yet
Project Report
106 pages
Enhanced Speech Emotion Detection Using Deep Neural Networks
No ratings yet
Enhanced Speech Emotion Detection Using Deep Neural Networks
14 pages
Voice Emotion Recognition
No ratings yet
Voice Emotion Recognition
11 pages
1822 B.E Cse Batchno 140
No ratings yet
1822 B.E Cse Batchno 140
55 pages
Zhao 2019
No ratings yet
Zhao 2019
12 pages
Speech Emotion Recognization
No ratings yet
Speech Emotion Recognization
65 pages
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
No ratings yet
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
9 pages
Speech Emotion Recognition Using Machine Learning
No ratings yet
Speech Emotion Recognition Using Machine Learning
8 pages
Wa0007
No ratings yet
Wa0007
6 pages
EMOTIONDETECTION (1) Mini Project
No ratings yet
EMOTIONDETECTION (1) Mini Project
5 pages
Presentation1 (Autosaved) (Autosaved)
No ratings yet
Presentation1 (Autosaved) (Autosaved)
20 pages
Recognition of Emotions in Speech Using Deep CNN A
No ratings yet
Recognition of Emotions in Speech Using Deep CNN A
18 pages
Emotion Recognition From Speech Via The Use of Dif
No ratings yet
Emotion Recognition From Speech Via The Use of Dif
11 pages
Lit Review
No ratings yet
Lit Review
10 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
6 pages
Paper IEEE EAIS 2020
No ratings yet
Paper IEEE EAIS 2020
5 pages
Paper5 Implementation
No ratings yet
Paper5 Implementation
7 pages
Emotion Recognition From Speech: Abstract. Emotions Play An Extremely Vital Role in Human Lives and Human
No ratings yet
Emotion Recognition From Speech: Abstract. Emotions Play An Extremely Vital Role in Human Lives and Human
13 pages
2019 BE Emotionrecognition ICESTMM19
No ratings yet
2019 BE Emotionrecognition ICESTMM19
8 pages
Speech Emotion Recognition Based On SVM Using MATLAB: March 2016
No ratings yet
Speech Emotion Recognition Based On SVM Using MATLAB: March 2016
7 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
5 pages
1 s2.0 S0003682X23002906 Main
No ratings yet
1 s2.0 S0003682X23002906 Main
11 pages
GROUP7 Researchpaper
No ratings yet
GROUP7 Researchpaper
9 pages
11.speech Emotion Recognition
No ratings yet
11.speech Emotion Recognition
13 pages
Project Report SSUC-12
No ratings yet
Project Report SSUC-12
2 pages
Research Paper Attri
No ratings yet
Research Paper Attri
7 pages
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
No ratings yet
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
6 pages
Group No 37
No ratings yet
Group No 37
19 pages
ASERS-LSTM: Arabic Speech Emotion Recognition System Based On LSTM Model
No ratings yet
ASERS-LSTM: Arabic Speech Emotion Recognition System Based On LSTM Model
9 pages
Set Conference Draft Paper - 223585
No ratings yet
Set Conference Draft Paper - 223585
6 pages
DL Emotion MFCC
No ratings yet
DL Emotion MFCC
6 pages
Emotion Detection Final Paper
No ratings yet
Emotion Detection Final Paper
15 pages
Reality
No ratings yet
Reality
11 pages
JETIR2106163
No ratings yet
JETIR2106163
5 pages
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
No ratings yet
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
20 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
5 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
10 pages
Scottish Fold Cat
100% (2)
Scottish Fold Cat
11 pages
Research Paper
No ratings yet
Research Paper
5 pages
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
No ratings yet
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
11 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
Christian Dior The Magic of Fashion
100% (3)
Christian Dior The Magic of Fashion
66 pages
Physical Features Based Speech Emotion Recognition Using Predictive Classification
No ratings yet
Physical Features Based Speech Emotion Recognition Using Predictive Classification
12 pages
Speech Emotion Recognition For Enhanced User Experience: A Comparative Analysis of Classification Methods
No ratings yet
Speech Emotion Recognition For Enhanced User Experience: A Comparative Analysis of Classification Methods
12 pages
Speech Emotion Recognition Based On SVM Using Matlab PDF
No ratings yet
Speech Emotion Recognition Based On SVM Using Matlab PDF
6 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
6 pages
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
No ratings yet
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
7 pages
Chethana H N REPORT
No ratings yet
Chethana H N REPORT
12 pages
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
No ratings yet
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
11 pages
Mohammed - PMP, ASM - ITIL - Resume For - SAP Project Manager
No ratings yet
Mohammed - PMP, ASM - ITIL - Resume For - SAP Project Manager
5 pages
Saira
100% (1)
Saira
6 pages
Human Emotion Detection With Speech Recognition Using Mel-Frequency Cepstral Coefficient and CNN - New
No ratings yet
Human Emotion Detection With Speech Recognition Using Mel-Frequency Cepstral Coefficient and CNN - New
2 pages
Selection Committee Minutes
50% (2)
Selection Committee Minutes
7 pages
A Review On Speech Emotion Classification Using Linear Predictive Coding and Neural Networks
No ratings yet
A Review On Speech Emotion Classification Using Linear Predictive Coding and Neural Networks
5 pages
Attachment 14940535 2 4 - S-GATE - Presentation
No ratings yet
Attachment 14940535 2 4 - S-GATE - Presentation
14 pages
Emona FOTEx LabManual ANS Ver1
100% (8)
Emona FOTEx LabManual ANS Ver1
246 pages
Chapter 4 Flexural Design - (Part 3)
No ratings yet
Chapter 4 Flexural Design - (Part 3)
37 pages
Marketing Principles
No ratings yet
Marketing Principles
54 pages
Purcell Cash Why Seismic Matters Activity and Presentation
No ratings yet
Purcell Cash Why Seismic Matters Activity and Presentation
47 pages
LG 50PM4700-TA Chassis PA22A
No ratings yet
LG 50PM4700-TA Chassis PA22A
73 pages
Semi - NCM 101
100% (1)
Semi - NCM 101
13 pages
HOA314N: Activity 2: Vernacular Houses
No ratings yet
HOA314N: Activity 2: Vernacular Houses
8 pages
Decision Trees in Managerial Decision Making
No ratings yet
Decision Trees in Managerial Decision Making
5 pages
"Rectbeam" - Rectangular Concrete Beam Analysis/Design: Program Description
No ratings yet
"Rectbeam" - Rectangular Concrete Beam Analysis/Design: Program Description
46 pages
CCNA4e Case Study
No ratings yet
CCNA4e Case Study
12 pages
CSS Solved General Science and Ability Past Paper 2021
No ratings yet
CSS Solved General Science and Ability Past Paper 2021
35 pages
10th Science Sample Paper 2024
No ratings yet
10th Science Sample Paper 2024
13 pages
Madhubhan Rejou Spa Services Menu
No ratings yet
Madhubhan Rejou Spa Services Menu
10 pages
NCR GDCE Notification 2019 English 2019
No ratings yet
NCR GDCE Notification 2019 English 2019
10 pages
Koala - Wikipedia
No ratings yet
Koala - Wikipedia
24 pages
Spring 2023 INT 500 - Syllabus (Marketing - Sales)
No ratings yet
Spring 2023 INT 500 - Syllabus (Marketing - Sales)
22 pages
70 433 Question
No ratings yet
70 433 Question
5 pages
Lecture 15 - Summing Up of Part-1 (Policy) & Introduction To Housing Planning
No ratings yet
Lecture 15 - Summing Up of Part-1 (Policy) & Introduction To Housing Planning
17 pages
Machining Strenx and Hardox: Drilling, Countersinking, Tapping, Turning and Milling
No ratings yet
Machining Strenx and Hardox: Drilling, Countersinking, Tapping, Turning and Milling
8 pages
Accident Definition & Meaning - Merriam-Webster
No ratings yet
Accident Definition & Meaning - Merriam-Webster
8 pages
Unit 5 - Systems of Equations and Inequalities Study Guide
No ratings yet
Unit 5 - Systems of Equations and Inequalities Study Guide
6 pages
Datasheet SX95
No ratings yet
Datasheet SX95
1 page
The High Line Hates Artists
No ratings yet
The High Line Hates Artists
4 pages
Affective Computing: Fundamentals and Applications
From Everand
Affective Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet

Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis

Uploaded by

Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis

Uploaded by

Volume 7-Issue 1, January 2024

Machine Learning and Deep Learning Techniques for Emotion

Fig. 1. Exploratory Data Analysis

IV. FEATURE SELECTION

• MFCC- It stands for Mel-Frequency Cepstral Coefficients

C(x(t)) = F −1[log(F (x(t)))] (1)

x(t) = e(t) ∗ h(t) (2)

Fig. 2. Result of the MFCCs

Fig. 3. Mel filter banks

Fig. 5. Proposed Methodology

Fig. 6. Model summary of the MLP

Author Method Dataset Accuracy

Fig. 7. Confusion Matrix of the Random Forest

Fig. 8. Accuracy Graph of the MLP

Fig. 9. Loss graph of the MLP

TABLE II COMPARATIVE ANALYSIS

Fig. 10. Comparative Analysis

VII. CONCLUSION AND FUTURE WORK

You might also like