0% found this document useful (0 votes)
30 views12 pages

Speech Emotion Recognition For Enhanced User Experience: A Comparative Analysis of Classification Methods

Speech recognition has gained significant importance in facilitating user interactions with various technologies. Recognizing human emotions and affective states from speech, known as Speech Emotion Recognition (SER), has emerged as a rapidly growing research subject. Unlike humans, machines lack the innate ability to perceive and express emotions. Therefore, leveraging speech signals for emotion detection has become an adaptable and accessible approach.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views12 pages

Speech Emotion Recognition For Enhanced User Experience: A Comparative Analysis of Classification Methods

Speech recognition has gained significant importance in facilitating user interactions with various technologies. Recognizing human emotions and affective states from speech, known as Speech Emotion Recognition (SER), has emerged as a rapidly growing research subject. Unlike humans, machines lack the innate ability to perceive and express emotions. Therefore, leveraging speech signals for emotion detection has become an adaptable and accessible approach.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Speech Emotion Recognition for Enhanced User


Experience: A Comparative Analysis of
Classification Methods
1 2
Samjhana Pokharel (Author) Ujwal Basnet (Author)
Department of Computer Science and Engineering Department of Computer Science and Engineering
Kathmandu University Kathmandu University
Dhulikhel, Nepal Dhulikhel, Nepal

Abstract:- Speech recognition has gained significant the ability to perceive and express emotions. Speech,
importance in facilitating user interactions with various psychological signals, facial expressions, and other
technologies. Recognizing human emotions and affective modalities can all be used to detect emotions. Speech signals
states from speech, known as Speech Emotion are far more adaptable and simple to acquire than other
Recognition (SER), has emerged as a rapidly growing modalities. Mel-frequency cepstrum coefficients (MFCC),
research subject. Unlike humans, machines lack the chroma, and mel features are extracted from the speech
innate ability to perceive and express emotions. signals and used to train the classifiers.
Therefore, leveraging speech signals for emotion
detection has become an adaptable and accessible Our project aims to classify the emotional state of the
approach. This paper presents a project aimed at speech which can be used in a number of applications like
classifying emotional states in speech for applications call centers, measuring the degree of emotional attachment
such as call centers, measuring emotional attachment in from phone calls, real-time emotion recognition in online
phone calls, and real-time emotion recognition in online learning, etc. There are three classifying methods that are
learning. The classification methods employed in this used in this project for analyzing emotions (calm, happy,
study include Support Vector Machines (SVM), Logistic fearful, angry, disgust, surprised) using SVM, Logistic
Regression (LR), and Multi-Layer Perceptron (MLP). Regression (LR), and Multi-Layer Perceptron (MLP).
The project utilizes features such as Mel-frequency
cepstrum coefficients (MFCC), chroma, and mel to  Motivations for Doing the Project
extract relevant information from speech signals and In today's world, identifying the emotion exhibited in a
train the classifiers. Through a comparative analysis of spoken percept has various applications. Human-Computer
these classification methods, this research aims to Interaction (HCI) is a branch of study that looks into how
enhance the understanding of speech emotion humans and computers interact with each other. A computer
recognition and contribute to the development of more system that understands more than simply words is required
effective and accurate emotion recognition systems. for an efficient HCI application. Voice-based inputs are used
by several real-world IoT applications, including Amazon
Keywords:- Speech Emotion Recognition, Speech Alexa, Google Home, and Mycroft. In IoT applications,
Recognition (SER), Emotion Classification, Support Vector voice plays a critical role. According to a recent survey,
Machines (SVM), Logistic Regression (LR), Multi-Layer about 12% of all IoT applications will be completely
Perceptron (MLP), Mel-frequency Cepstrum Coefficients functional by 2022. Self-driving automobiles are one
(MFCC), Chroma, Mel Features. example of the emerging field that uses voice commands to
operate several of its tasks. In emergency scenarios where
I. INTRODUCTION the user may be unable to offer a clear spoken command, the
emotion communicated through the user's tone of voice can
Speech recognition has become increasingly important be used to activate specific car emergency functions.
in recent years as a means of assisting others with ease of
use. Several well-known technology companies, including  Objectives
Google, Samsung, and Apple, have used speech recognition The primary objective of speech emotion recognition
to convert human speech into sentences so that their is to improve human-machine interaction interface by
customers may quickly navigate around their products. detecting the emotional state of a person using speech.

Speech emotion recognition, SER, is the act of


attempting to recognize human emotion and the associated
affective states from speech. This uses the fact that tone and
pitch in the voice often indicate underlying emotion. In
recent years, emotion recognition has become a rapidly
increasing research subject. Machines, unlike humans, lack

IJISRT23MAY770 www.ijisrt.com 3781


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
II. RELATED WORKS  File Naming Convention
Each of the 1440 files has a unique filename. The
There are a number of studies done on speech emotion filename consists of a 7-part numerical identifier (e.g., 03-
recognition and different companies are doing research and 01-06-01-02-01-12.wav). These identifiers define the
work related to speech emotion recognition directly or as an stimulus characteristics:
application for different parts of the work.
audEERING, an audio analysis company based in  Filename Identifiers
Germany, that specialises in emotional artificial intelligence.
Their team are experts in voice emotion analytics, machine  Modality (01 = full-AV, 02 = video-only, 03 = audio-
learning and signal processing. only).
 Vocal channel (01 = speech, 02 = song).
Alexa, is a virtual assistant AI technology developed  Emotion (01 = neutral, 02 = calm, 03 = happy, 04 = sad,
by Amazon, first used in the Amazon Echo smart speaker 05 = angry, 06 = fearful, 07 = disgust, 08 = surprised).
and the Echo Dot, Echo Studio and Amazon Tap speakers  Emotional intensity (01 = normal, 02 = strong). NOTE:
developed by Amazon Lab, working on detecting emotions There is no strong intensity for the 'neutral' emotion.
like sadness, happiness, anger, etc, for understanding the  Statement (01 = "Kids are talking by the door", 02 =
mental state of a speaker from the sound of your voice. "Dogs are sitting by the door").
 Repetition (01 = 1st repetition, 02 = 2nd repetition).
III. DATASETS  Actor (01 to 24. Odd numbered actors are male, even
numbered actors are female).
In this project we have used the RAVDESS(Ryerson
Audio-Visual Database of Emotional Speech and Song)  Filename Example 03-01-06-01-02-01-12.wav
dataset. It contains 7356 files rated by 246 persons 10 times
on emotional validity. The dataset is 24.8 GB from 24  Audio-only (03)
different actors. The dataset is huge so we used the sample  Speech (01)
rate lowered versions which is around 171 MB. The dataset  Fearful (06)
includes the following emotions: Neutral, Calm, Happy,  Normal intensity (01)
Sad, Angry, Fearful, Disgust, Surprised.  Statement "dogs" (02)
 1st Repetition (01)
 12th Actor (12) Female, as the actor ID number is even.

IJISRT23MAY770 www.ijisrt.com 3782


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

IV. METHODS AND ALGORITHMS USED

Fig 1 Methodology

The above figures show the general flow-chart of our  Phase 1: Data Collection
project. It consists of 5 different phases. The data is stored in The RAVDESS dataset is used in the project. The
files in the project directory. The files are loaded using dataset is downloaded into our system.
different python libraries then unnecessary files are
removed. And, we extract different features of sound files Audio files in the directory are loaded using libraries
like mfcc, mel, chroma, which will be used as features for like: os, glob, and soundfile.
mapping classifier function. The dataset is then divided into
two different sets: testing and training sets. We then build We use glob module which finds all the path names
different classifier models. And using the training set, we matching a specified pattern as the dataset consists of audio
train the model. After that we use a testing set for different files named in some specific pattern, which also consists of
evaluations and accuracy calculations of the model. The the emotion decoded value in file name only. os module is
whole process is generalized by the above diagram. used to get the base name of the file. Then, using soundfile
library we read sound files along with the sample rate of the
audio.

IJISRT23MAY770 www.ijisrt.com 3783


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 Phase 2: Extracting Features  Mel-Spectrogram
Different features of sound are extracted, Mel A spectrogram where the frequencies are converted to
Frequency Cepstral Coefficients(MFCCs), Chroma, Mel- the mel scale. It takes samples of sound files over time to
Spectrogram. represent audio signals. Then, the audio signal is mapped
from time domain into frequency domain using fast Fourier
 MFCCs transform then shifted frequency and amplitude to form a
It is a frequency domain feature of the sound, which spectrogram.
captures timbral or textural and the phonetical crucial
characteristics of the speech. It is widely used in speech, Librosa library is used to extract features from the
music genre, and musical instrument classifications. audio file. Librosa is a python library for music and audio
analysis. It provides the building blocks necessary to create
 Chroma music information retrieval systems.
Chroma captures harmonic and melodic characteristics
of music, while being robust to changes in timbre and
instrumentation. It is also referred to as pitch class profiles.

 Phase 3: Classification The dataset was divided into two sets:


In this phase different algorithm models are used for
classifying the emotions:  Training set (80%)
 Testing set (20%)
 MLP Classifier
 Logistic Regression  MLP Classifier
 SVM Multi-Layer Perceptron (MLP) is a part of feedforward
artificial neural network which consists of an input layer,
multiple hidden layers, and an output layer which are
connected. Based on adjustments of parameters, biases,
weights of the model, the model represents the target
function. Activation function that is used during the
experiment was relu which makes the model easier to train
and often achieves better performance.

Fig 2 MLP Classifier

IJISRT23MAY770 www.ijisrt.com 3784


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
MLP (Multi-Layer Perceptron) Classifier is used to  Alpha: a parameter for regularization term, aka penalty
categorize the given data into respective groups. It is term, that combats overfitting by constraining the size of
capable of approximating boolean and nonlinear functions. the weights.
It is frequently used in supervised learning problems. The  Batch Size: the number of samples that will be
network works on real-values, so the categorical values must propagated through the network.
be converted into real-value representation.  Epsilon: value for numerical stability.
 Hidden Layer Sizes: 1 hidden layers with 300 hidden
Following values of parameters were used in our units,
model:  Learning Rate: learning rate for weight updates

 alpha=0.01, ‘adaptive’ keeps the learning rate constant to


 batch_size=256, ‘learning_rate_init’ as long as training loss keeps
 epsilon=1e-08, decreasing. Each time two consecutive epochs fail to
 hidden_layer_sizes=(300,), decrease training loss by at least tol, or fail to increase
 learning_rate='adaptive', validation score by at least tol if ‘early_stopping’ is on, the
 max_iter=500 current learning rate is divided by 5.

 Logistic Regression Following are the parameters used in building the


The logistic model is used to model the probability of a model:
certain class or event existing such as pass/fail, win/lose,
alive/dead or healthy/sick. This can be extended to model  multi_class='multinomial',
several classes of events such as determining whether an  solver='lbfgs'
image contains a cat, dog, lion, etc. Each object being
detected in the image would be assigned a probability  Multi_class: multinomial, extension of logistic
between 0 and 1, with a sum of one. This model is regression that adds support for multi-class classification
preferable for dependent variable (categorical) data since the problems.
data used have small size of output (happy and sad).  Solver: lbfgs, solver is an algorithm used for
optimization problems. In our case, lbfgs is used, it
This linear relationship can be written in the following approximates the second derivative matrix updates with
mathematical form (where ℓ is the log-odds, b is the base of gradient evaluations, and stores only the last few
the logarithm, and βi are parameters of the model): updates, so it saves memory, also it isn't super fast with
large data sets.

 SVM
SVM (Support Vector Machine) are supervised learning models with associated learning algorithms that analyze data for
classification and regression analysis. Given a set of training examples, each marked as belonging to one of two categories, an
SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic
binary linear classifier. It is one of supervised machine learning models that linearly separable binary sets. The goal of this model
is to calculate and create a hyperplane that classifies all training vectors. After creating a hyperplane, the next step is to determine
the maximum margin between data point and hyperplane which can be called as support vectors.

IJISRT23MAY770 www.ijisrt.com 3785


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Fig 3 SVM

Following are the parameters used in building model:

 kernel="linear",
 C=1

 kernel="linear", specify kernel type of the algorithm


 C=1, Regularization parameter. The strength of the regularization is inversely proportional to C and it must be strictly positive.

 Phase 4: Evaluation

Evaluation of the experiment involves comparison between each model classification report and accuracy. Evaluation of the
experiments includes comparison of accuracy between the multiple experiment’s of each algorithm and between different
algorithms.

V. EXPERIMENTS AND EVALUATIONS

In this phase different algorithm models are used for classifying the emotions: MLP Classifier, Logistic Regression, SVM.

Datasets consisted of following data:

Fig 4 Datasets consisted

IJISRT23MAY770 www.ijisrt.com 3786


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
 MLP Classifier

Table 1 Experiments Conducted with Respective Evaluations:


SN Experiments Accuracy F1 Score Precision Recall

I alpha=0.01, 0.63 0.63 0.67 0.64


batch_size=256,
epsilon=1e-08, hidden_layer_sizes=(300,),
learning_rate='adaptive',
max_iter=500

II hidden_layer_sizes=(300,150,), 0.61 0.59 0.63 0.6

III hidden_layer_sizes=(600,), 0.7 0.7 0.71 0.71

Fig 5 MLP, Experiment I

Fig 6 MLP, Experiment II

IJISRT23MAY770 www.ijisrt.com 3787


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Fig 7 MLP, Experiment III


 Logistic Regression

Table 2 Experiments Conducted with Respective Evaluations:


SN Experiments Accuracy F1 Score Precision Recall

I solver='lbfgs' 0.53 0.52 0.53 0.53

II solver='saga' 0.50 0.50 0.53 0.50

III solver='newton-cg' 0.58 0.58 0.58 0.58

Fig 8 LR, Experiment I

IJISRT23MAY770 www.ijisrt.com 3788


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Fig 9 LR, Experiment II

Fig 10 LR, Experiment III

 SVM

Table 3 Experiments Conducted with Respective Evaluations:


SN Experiments Accuracy F1 Score Precision Recall

I kernel="linear", C=1 0.43 0.56 0.56 0.56

II kernel="poly", C=1 0.33 0.26 0.26 0.32

III kernel="linear", C=2 0.42 0.57 0.57 0.57

IJISRT23MAY770 www.ijisrt.com 3789


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Fig 11 SVM, Experiment I

Fig 12 SVM, Experiment II

Fig 13 SVM, Experiment III

IJISRT23MAY770 www.ijisrt.com 3790


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
VI. DISCUSSION ON RESULTS

Evaluation of the experiments includes comparison of accuracy

 between the multiple experiment’s of each algorithm and


 between different algorithms

The table below describes the comparison between different algorithms and the table consists the best result of the algorithm
after performing multiple experiments:

Model Best experiment condition Accuracy F1 Score Precision Recall

MLP alpha=0.01, 0.7 0.7 0.71 0.71


batch_size=256,
epsilon=1e-08, hidden_layer_sizes=(600,),
learning_rate='adaptive',
max_iter=500

SVM kernel="linear", C=1 0.43 0.56 0.56 0.56

LR solver='newton-cg' 0.58 0.58 0.58 0.58

MLP has best accuracy, f1-score, precision and recall when it has a hidden layer with 600 hidden units. SVM has best results
when it uses linear kernel and value 1 for regularization parameter, and LR has best results when it uses newton-cg as solver for
optimization.

Compared to different models, we get best results with MLP Classifier and other two SVM and LR have similar results.

VII. CONTRIBUTIONS OF EACH GROUP MEMBER

SN Name Contribution

1 Samjhana Pokharel 1. Research on Speech Emotion Recognition


2. Study of sound file and its features for speech emotion recognition
3. Visualization of speech features
4. Visualization of datasets
5. Study of ML models for speech emotion recognition
6. Multiple experimentation with Support Vector Machine and Logistic Regression
7. Performance and accuracy measurement of SVM and LR
8. Comparison of each experiments and models
9. Drawing Conclusions

2 Ujwal Basnet 1. Research on Speech Emotion Recognition


2. Study of sound file and its features for speech emotion recognition
3. Visualization of speech features
4. Visualization of datasets
5. Study of ML models for speech emotion recognition
6. Multiple experimentation with Multi-Layer Perceptron Classifiers
7. Performance and accuracy measurement of MLP
8. Comparison of each experiments and models
9. Drawing Conclusions

VIII. CODE

Snippet codes for the implementation of different methods have already been specified and discussed above.

The complete code can be accessed via public github repository:

https://github.com/SamjhanaP/speechemotionrecognition

IJISRT23MAY770 www.ijisrt.com 3791


Volume 8, Issue 5, May – 2023 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IX. CONCLUSION AND FUTURE EXTENSIONS [5]. Nair, A. (2019, June 20). A Beginner’s Guide To
TO THE PROJECT Scikit-Learn’s MLPClassifier. Analytics India
Magazine. https://analyticsindiamag.com/a-beginners-
The new era of automation has begun as a result of the guide-to-scikit-learns-mlpclassifier/
increasing growth and development in the fields of AI and [6]. Peerzade, G. N., & Deshmukh, R. R. (2018, March). A
machine learning. The majority of these automated gadgets Review: Speech Emotion Recognition. International
are controlled by the user's vocal commands. Many Journal of Computer Sciences and Engineering, 6(3),
advantages can be created over present systems if, in 2347-2693.
addition to identifying words, the machines can interpret the https://www.researchgate.net/publication/325774548_
speaker's emotion. A_Review_Speech_Emotion_Recognition
[7]. scikit-learn developers. (n.d.). Support Vector
The processes for creating a voice emotion recognition Machines. Scikit Learn. https://scikit-
system were covered in detail in this project, and several learn.org/stable/modules/svm.html
experiments were conducted to determine the influence of [8]. SMART Lab. (n.d.). RAVDESS. Smart Laboratory.
each step. Three different learning models were used: MLP, https://smartlaboratory.org/ravdess/
LR and SVM. Firstly, speech features like mfcc, chroma, [9]. Stojiljković, M. (n.d.). Logistic Regression in Python.
and mel were extracted from the audio files. Then each Real Python. https://realpython.com/logistic-
model is trained in multiple experiments with variation in regression-python/
the parameters. And using the test dataset, accuracy of each [10]. Wikipedia. (2021). Multilayer perceptron. Wikipedia.
model and each experiment were studied. And, at the end https://en.wikipedia.org/wiki/Multilayer_perceptron
we conclude that MLP Classifier performs better when [11]. Wikipedia. (2021, March 9). Chroma feature.
hidden units in a hidden layer are increased. Wikipedia.
https://en.wikipedia.org/wiki/Chroma_feature
So, we conclude the following are the advantages of [12]. Wikipedia. (2021, May 7). Mel-frequency cepstrum.
using MLP classifier in Speech Emotion Recognition: Wikipedia. https://en.wikipedia.org/wiki/Mel-
frequency_cepstrum#:~:text=From%20Wikipedia%2C
 Allows you to work with nonlinear values with ease. %20the%20free%20encyclopedia,nonlinear%20mel%
 Higher performance compared to other models 20scale%20of%20frequency.
 Missing values can be handled, [13]. Wikipedia. (2021, May 19). Support-vector machine.
 Complicated relationships can be modelled, and Wikipedia. https://en.wikipedia.org/wiki/Support-
 Many inputs can be supported. vector_machine
[14]. Wikipedia. (2021, May 22). Logistic regression.
For future enhancements, the proposed project can be Wikipedia.
further modeled in terms of efficiency, accuracy, and https://en.wikipedia.org/wiki/Logistic_regression
usability. The model may be extended to recognize more
emotional states and sensations like sarcasm. And, a number
of interactive systems can be developed using trained
models in the underlying system to provide a system where
users can interact with the machine or more like command
using voice. Also, the communication can be made bi-
directional instead of directional.

REFERENCES

[1]. Brownlee, J. (2016, April 1). Logistic Regression for


Machine Learning. Machine Learning Mastery.
https://machinelearningmastery.com/logistic-
regression-for-machine-learning/
[2]. B.V., E. (2020). Speech Emotion Recognition with
deep learning. Procedia Computer Science, 176(2020),
251-260.
https://www.sciencedirect.com/science/article/pii/S187
7050920318512
[3]. Gandhi, R. (2018, June 7). Support Vector Machine —
Introduction to Machine Learning Algorithms.
Towards Data Science.
https://towardsdatascience.com/support-vector-
machine-introduction-to-machine-learning-algorithms-
934a444fca47
[4]. https://www.audeering.com/. (2020). Audeering.
Audeering. https://www.audeering.com/

IJISRT23MAY770 www.ijisrt.com 3792

You might also like