0% found this document useful (0 votes)
18 views6 pages

DL Emotion MFCC

Uploaded by

VARUN MODI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views6 pages

DL Emotion MFCC

Uploaded by

VARUN MODI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1st IEEE International conference on Innovations in High-Speed Communication and Signal Processing (IEEE-IHCSP) 4-5 March, 2023

Deep Learning Based Emotion Classification Using


Mel Frequency Magnitude Coefficient
1st Siba Prasad Mishra 2nd Pankaj Warule 3rd Suman Deb
2023 1st International Conference on Innovations in High Speed Communication and Signal Processing (IHCSP) | 979-8-3503-4595-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/IHCSP56702.2023.10127148

Department of Electronics Engineering Department of Electronics Engineering Department of Electronics


SVNIT, SVNIT, EngineeringSVNIT,
Surat, India Surat, India Surat, India
[email protected] [email protected] [email protected]

Abstract—The popularity of emotion recognition using speech emotional state, e-tutoring, customer care service enhance-
signals increases more and more because of its vast number of ments, and psychological treatments [2]. Even though research
applications in the practical field. Emotion recognition using the into this area has been going on over the last three decades, the
speech signal is a very complicated and challenging task. This
plays an essential role in enhancing human-computer interaction results are not yet accurate enough to be applied to real-world
(HCI). Many authors used different methods to improve the scenarios for any degree of accuracy.
accuracy of speech emotion recognition (SER). Proper selection
of features and suitable machine and deep learning model design
can improve the recognition rate. In this work, we used a modified Input speech Feature
version of the mel frequency cepstral coefficient (MFCC) feature Pre processing
signal Extraction
named the mel frequency magnitude coefficient (MFMC) with
convolutional neural network (CNN) and deep neural network
(DNN) classifiers to enhance the SER. We used MFMC and
MFCC features as input to CNN and DNN classifiers and Predicted
evaluated the accuracy of SER. We made two observations from Emotion Classification
our experiment. First, the performance of the MFMC feature
in SER is better than the MFCC feature for both classifiers.
Fig. 1: Baisc block diagram for SER
Second, the proposed DNN classifier achieved better accuracy
than the CNN classifier for both features (MFMC and MFCC).
The MFMC feature with the DNN classifier achieved an accuracy The basic block diagram of SER is shown in the Fig.
of 76.72%, 84.72%, 77.88%, and 100% for the RAVDESS, 1. The two important stages of emotion identification using
EMODB, SAVEE, and TESS datasets, respectively. Similarly, the speech signals are feature extraction and classification. Before
CNN classifier with the MFMC feature achieved an accuracy of feature extraction, the speech signal is passed through the pre-
72.9%, 82.41%, 74.5%, and 100% for the same datasets. Our
proposed work was compared with the state-of-the-art models, processing stage. The primary purpose of the pre-processing
and we found that our model performed better than others. is to find the salient segments, which contain informationabout
the speech signal. The pre-processing stage includes
Index Terms—Mel frequency cepstral coefficient, Mel fre- quency preemphasis, normalization, segmentation, and windowing
magnitude coefficient, Deep neural network, Convolu- tional techniques. Features carry information about the speech signal
neural network, Speech emotion recognition.
and its emotions; hence, they play a vital role in emotion
classification. The rate of emotion detection may increase with
I. INTRODUCTION the proper feature selection, while it may fall with the im-
Speech serves as a reliable interface for communicating with proper selection. Features may be acoustic or non-acoustic.The
computers and is essential to the communication of informa- acoustic or non-acoustic features are extracted from a speech
tion between individuals. Emotion is the human response to signal’s voice and non-voiced components. Prosody, spectrum,
a particular situation that may be communicated via words non-linear wavelet, and voice quality attributes are examples of
or actions. For the same circumstance, every individual per- son acoustic characteristics. Language, visual, face, and gesture
communicates their feelings in a unique way. For this reason elements are included in non-acoustics. Researchers used
automatic SER is a useful research area, particularly in different combinations of acoustic and non-acoustic features
emotionally challenging human-machine interface systems. to improve emotion recognition. Different machine learning
Different inputs, including audio, images, video, text, and and deep learning classifiers are used to improve emotion
multi-modal systems, may be used to identify emotion. The recognition. Some traditional Machine learning models used
characteristics of speech signals are mostly unaffected by for emotion recognition are hidden Markov model (HMM),
linguistic variety and the movement of the speaker. Therefore, support vector machines (SVM) [3], [4] decision tree, random
numerous scholars have been interested in the problem of forest, k- nearest neighbor (KNN) classifiers, and combina-
emotion identification using speech signal. SER has a broad tions of more than one model. Similarly, popular deep learning
range of applications, it is used to improve human-computer models used for SER are DNN [5], CNN [6], [7], long short-
interface (HCI), internet marketing [1] based on customers

93 979-8-3503-4595-7 03/2023 © 2023 IEEE

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on August 09,2024 at 10:38:18 UTC from IEEE Xplore. Restrictions apply.
1st IEEE International conference on Innovations in High-Speed Communication and Signal Processing (IEEE-IHCSP) 4-5 March, 2023

term memory (LSTM), radial basics function (RBF), and a bandwidth as features and SVM classifiers for emotion recog-
combination of more than one classifier. In our work we used nition. The author achieved an accuracy of 82.8% and 74.3%
two models named DNN and CNN for emotion recognition. for EMO-DB and SAVEE datasets. Zeng et al. [12] used spec-
The MFCC is an acoustic characteristic most frequentlyused trogram feature extracted from the RAVDESS dataset and deep
in automated speech emotion classification systems. MFCC learning classifiers for emotion recognition. Venkataramanan
feature can be extracted by passing the speech signal through a et al. [13] used a two-dimension CNN classifier and the Log-
discrete Fourier transform (DFT), a logarithmic Mel scaled Mel Spectrogram features for the RAVDESS dataset to classify
filter bank, and a discrete cosine (DCT) transform block. Short- emotions. They ttained a SER accuracy of 68%. Deb et al.
term energy used to extract MFCC is not sat- isfactory for [14] used the harmonic to peak energy ratio (HPER) feature and
significant signals because it contains a square function. SVM model to classify the emotion and achieved better
Similarly, DCT included all frequency bands, if any band is accuracy in the SER compared to MFCC, LPC, and TEO-CB-
affected by noise, then it will reflect in all MFCC coefficients. Auto-Env.
To overcome this, we used a modified versionof MFCC III. DATABASE
named the Mel frequency magnitude coefficient (MFMC) In our work, we used four datasets RAVDESS, EMO-DB,
proposed by Ancilin et al. [3]. To test the effective- ness of our SAVEE, and TESS. The description of the datasets is given
models (CNN and DNN), we compared the results with state- below.
of-the-art techniques using four datasets: RAVDESS(Ryerson
Audio Video Database of Emotional Speech and Song) [8], A. RAVDESS
TESS (Toronto Emotional Speech Set), EMO-DB (Emotional The RAVDESS dataset included 1440 speech files. Twenty-
Speech database for classification problem), and SAVEE four actors (12 male and 12 female) participated in recording
(Surrey Audio-Visual Expressed Emotion). the dataset. The dataset included both audio and video files
This paper is structured as follows: Literature related to for research purposes. Eight emotions in the database were
emotion recognition using speech signal are discussed in present: sad, happy, angry, calm, fearful, surprised, neutral, and
Section 2. The explanations of the datasets used for our work disgust. In our experimental work, we utilized only the audio
are discussed in Section 3. Section 4 explained the data from the database.
experimental setup, feature extraction, and the models used for
B. EMO-DB
our research work. Section 5 included the result and discussion
parts. The conclusion and future work are discussed in Section This database was produced by the Institute of Communication
6. Science at the Technical University of Berlin in Germany. Five
professional men and five professional women participated in
II. RELATED WORKS the database’s recording. The database included a total 535
Emotion recognition using speech is a complicated and chal- number of utterances. The EMO-DB dataset contains seven
lenging task. SER has many practical benefits and applications emotions. The emotions are anger, fear, boredom, neutral,
in the real world. Hence, it has been a topic of interest for happiness, disgust, and sadness.. While being recorded, the data
many researchers for the last three decades. Researchers used a was sampled at a rate of 48 kHz and then lowered to arate
different combination of features and classifiers to improve of 16 kHz.
emotion recognition. Five spectral features (MFCC, Mel scaled C. TESS
spectrogram, chromagram, tonetz and spectral contrast) were There are 2800 recordings of just female speakers present in the
used by Issa et al. [6] to improve emotion recognition. They TESS database. It is an excellent database that may be usedto
achieved 71.61%, 86.1%, and 64.3% accu- racy for the train models without overfitting. Two women, aged 26 and 64,
RAVDESS, EMO-DB, and IEMOCAP databases, respectively. participated in recording the database. Anger, disgust, fear,joy,
Lukos et al. [9] recognized emotions using MFCC pleasant surprise, sad, and neutral are the seven emotions
characteristics. The author used both GMM and SVM models included in the database.
to categorize emotions as classifiers. With the SVM and GMM
classifier, the author achieved an accuracy of 81.57% and D. SAVEE
76.31%, respectively, for the EMO-DB database. In order to This database included a total of 480 speech files. Four English-
recognise emotions, Ancilin et al. [3] modified the MFCC speaking postgraduate students at the University of Surrey,
feature and suggested a new feature called MFMC. The MFMC aged 27 to 31, recorded this dataset. This dataset is freely
feature and SVM classifier were used by the authorfor SER, available to all researchers for experimetal research work. The
and he came to the conclusion that MFMC outper- formed recorded emotions present in this dataset are anger, disgust,
MFCC. Deb et al. [10] used an SVM classifier and a special fear, happiness, sadness, surprise, and neutral.
residual sinusoidal peak amplitude (RSPA) feature to recognise IV. METHODOLOGY
emotions. They found that the RSPA outperformed MFCC,
The proposed SER methodologies are described in this section.
LPCC, and Teager-Energy-operator (TEO) based Crit- ical
The proposed methodology includes feature extraction, DNN,
Band Autocorrelation envelope attributes (TEO-CB-Auto-
CNN architectures, and a experimental setup.
Env). Özseven et al. [11] used MFCC, pitch, formant, and

979-8-3503-4595-7 03/2023 © 2023 IEEE 94

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on August 09,2024 at 10:38:18 UTC from IEEE Xplore. Restrictions apply.
1st IEEE International conference on Innovations in High-Speed Communication and Signal Processing (IEEE-IHCSP) 4-5 March, 2023

A. Feature Extraction information from one layer to another. The output of each
neuron depends on the cost function and weights. A cost
The selection of well specified characteristics determines the function is a difference between actual and predicted output.
performance accuracy of SER. In order to precisely and Weights are updated concerning the cost function until achiev-
correctly distinguish the emotions, we must choose the features ing the desired output. DNN is used for several applications like
that may convey the most emotional information. In our study speaker recognition, image classification, and emotion
we used the MFMC feature, a modified version of MFCC recognition. In the proposed model of DNN, the mean feature
feature. The block digram used for MFMC feature extraction vector is passed through a flattened layer to maintain the input
is shown in the Fig.2. shape. We used a DNN model with four hidden layers in
our work. The first, second, third, and fourth hidden layers
speech Pre Framing and consist of 920, 810, 770, and 650 nodes, followed by batch
𝐷𝐷𝐹𝐹𝑇𝑇 normalization and dropout layers of 0.1, 0.2, 0.2, and 0.3,
signal emphasis windowing
respectively. The exponential linear unit (ELU) is used as an
activation function throughout all layers. The number of nodes
Mel filter in the final output layer is the same as the number of emotions
MFMC Log(.)
bank
in the dataset. At the output, a softmax activation function is
Fig. 2: Block diagram of MFMC feature extraction used to predict or classify the emotion.
2) Convolutional Neural Network: CNN has three layers:
The speech spectrum has a high amplitude at lower fre- an input layer, an output layer, and convolutional layers
quencies and a low amplitude at high-frequency components. between the input and output layers. Each convolutional layer
Hence, it first passed through a preemphasis filter to balance has several filters to perform the convolution operation with the
the speech signal spectrum. The speech signal is non-stationary input. The resulting output becomes the input to the next
and is assumed to be stationary for a short duration. Hence convolutional layer after passing through the max-pooling
we segmented the speech signal into small frames of 20 ms layer, activation function, and batch normalization layer. To
to 50 ms. In our work, we used a frame size of 23 ms. After reduce the feature dimension of each convolution layer output
segmentation, we used a hamming window to make it smooth max pooling layers are used. The final convolutional layer
or to improve the spectral resolution. Then, we evaluated the output becomes the input to the flattened layer. The flattened
DFT magnitude of each frame and processed it as input to the layer converted the data to a one-dimensional vector and gave
Mel filter bank. A Mel filter bank converted the magnitude input to the output layer. In our work, we used CNN with three
spectrum to the Mel spectrum. The Mel scale approximates convolution layers. The first, second, and third layers have 512,
the linear frequency to the human auditory perceptual non- 256, and 512 filters with kernel sizes equal to five. Each
linear frequency range. The frequency can be converted to a convolutional layer output proceeded through a maxpooling
Mel scaled frequency by using the Equation (1). layer of kernel size five, a batch normalization layer, anda
f drop-out layer of 0.2. ELU activation function is used
Mel(f) = 2595 × log(1 + ) (1) throughout all convolutional layers. The output of the third
700 convolutional layer becomes the input to the flattened layer,
where Mel(f) is the frequency that corresponds to the Mel followed by one dense layer and the output layer. The dense
scale and f is the frequency in Hz. Finally, the MFMC feature layer has 512 neurons and a dropout of 0.3. The output layer
coefficient is extracted by passing the Mel filter bank output has several neurons corresponding to the number of classes in
through a logarithmic scale. The method used to extract the the speech signal, and the softmax activation function is used
MFMC feature is the same as MFCC except for two changes. to classify the emotions.
The first magnitude of DFT was used instead of the magnitude
square, and the second is the discrete cosine term excluded in C. Experimental setup
the MFMC feature extraction. In our experiment we used 40
number of MFMC coefficient for emotion classification. Our We extracted MFMC and MFCC features from the RAVDESS,
experimental result shows that the MFMC feature performed EMODB, SAVEE, and TESS datasets for our research using
better compared to MFCC. Python software. We used 40 number coefficients of both
MFCC and MFMC feature to train and test the model. We
B. Proposed Models randomly divided the datasets into five equal parts. Every time
In our study, we used two models named DNN and CNN. The four-fold is used to trained the model, the rest of one-fold is
descriptions of the model are given in the section below. used to test the model. This process was repeated for five times.
1) Deep Neural Network: A DNN is an artificial neural Then we average the output of five experiments to get the
network with a number of hidden layers in-between the input accuracy of SER. We ran each experiment for 700 epochs. The
and output layers. Each layer has several nodes known as model parameters are the same for all the databases. We used
“neurons” connected to the next and previous layer nodeswith an Adam optimizer with a learning rate of 0.0001 and a sparse
suitable weight. Neurons are responsible for transferring categorical cross-entropy loss function for our research.

95 979-8-3503-4595-7 03/2023 © 2023 IEEE

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on August 09,2024 at 10:38:18 UTC from IEEE Xplore. Restrictions apply.
1st IEEE International conference on Innovations in High-Speed Communication and Signal Processing (IEEE-IHCSP) 4-5 March, 2023

TABLE III: CONFUSION MATRIX FOR EMODB DATASET USING


V. RESULTS AND DISCUSSION MFMC FEATURE AND DNN CLASSIFIER
We compared the emotion recognition accuracy obtained using Ang. Bor. Dis. Fea. Hap. Neu. Sad
the MFMC, and MFCC features with CNN and DNN clas- Angry 90.02 0 3.86 0 6.12 0 0
sifiers in this section. We also compared the SER accuracy Boredom 0 77.4 1.18 0 0 16.44 4.98
Disgust 3 2 82.9 6.66 1.22 4.22 0
of our proposed model with state-of-the-art methods. Table I Fear 6.09 0 1.54 80.89 5.8 1.42 4.26
shows the comparison of classification accuracy for MFMC Happy 10.26 0 2.76 5.5 80.06 1.42 0
and MFCC features using CNN and DNN classifiers for all Neutral 1.24 7.5 0 1.52 1.24 84.76 3.74
Sad 0 1 0 1 0 1 97
the datasets (RAVDESS, EMODB, SAVEE, and TESS). The Average accuracy=84.72%
following points we observed from the table are: (1) The
MFMC feature performed better than MFCC for all the TABLE IV: CONFUSION MATRIX FOR SAVEE DATASET USING
datasets and classifiers. (2) The proposed DNN model MFMC FEATURE AND DNN CLASSIFIER
performed better in SER than CNN for all the datasets.(3) The Ang. Dis. Fea. Hap. Neu. Sad. Sur
TESS dataset achieved an accuracy of 100% using MFMC as Angry 80.34 6.66 1.66 6.36 0 1.66 3.32
features for both models (CNN and DNN) Disgust 0 82.66 5.0 1.66 7.42 3.32 0
Fear 5.0 3.32 62 3.32 1.66 3.32 21.38
From Table I, it was observed that the MFMC feature with Happiness 6.64 1.66 1.66 73.68 1.66 0 14.7
the DNN classifier achieved an accuracy of 76.72%,84.72%, Neutral 0 2.5 0 0 89.18 8.32 0
77.88%, and 100% for the RAVDESS, EMODB, Sadness 0 5 0 0 14.68 80.32 0
SAVEE, and TESS datasets, respectively. MFCC feature with Surprised 0 5 8 8.34 1.66 0 77
Average accuracy=77.88%
a DNN classifier achieved an accuracy of 73.26%, 82.24%,
76.87%, and 99.28% for the same datasets. Similarly, the CNN TABLE V: CONFUSION MATRIX FOR TESS DATASET
classifier using the MFMC feature attained an accuracy of USING MFMCFEATURE AND DNN CLASSIFIER
72.9%, 82.41%, 74.35%, and 100%, and the MFCC feature
with the CNN classifier achieved an accuracy of 68.54%, Ang. Cal. Dis. Fea. Hap. Neu. Sad
Angry 100 0 0 0 0 0 0
81.31%, 73.13%, and 97.65% for the datasets mentioned Calm 0 100 0 0 0 0 0
earlier. Disgust 0 0 100 0 0 0 0
Fear 0 0 0 100 0 0 0
TABLE I: ACCURACY COMPARISON OF MFMC AND MFCC Happiness 0 0 0 0 100 0 0
FEATUREUSING CNN AND DNN CLASSIFIER FOR ALL THE Neutral 0 0 0 0 0 100 0
DATABASE Sadness 0 0 0 0 0 0 100
Accuracy in % Accuracy in % Average accuracy=100%
Database Classifier
MFMC feature MFCC feature
RAVDESS DNN 76.72 73.26 TABLE VI: CONFUSION MATRIX FOR RAVDESS DATASET
RAVDESS CNN 72.9 68.54 USINGMFMC FEATURE AND CNN CLASSIFIER
EMODB DNN 84.72 82.24
Ang. Cal. Dis. Fea. Hap. Neu. Sad. Sur
EMODB CNN 82.41 81.31
Angry 84.2 5.3 2.6 0 2.6 0 0 5.3
SAVEE DNN 77.88 76.87
Calm 0 89.5 2.6 0 0 5.3 2.6 0
SAVEE CNN 74.35 73.13
TESS DNN 100 99.28 Disgust 7.9 2.6 76.3 2.6 2.6 2.6 2.6 2.6
TESS CNN 100 97.65 Fear 2.6 0 2.6 71.8 7.7 0 12.8 2.6
Happy 7.9 0 2.6 10.5 63.2 5.3 5.3 5.3
Neutral 0 25 0 0 0 55.0 15 5
Sad 0 23.1 0 5.1 2.6 5.1 59.0 5.1
TABLE II: CONFUSION MATRIX FOR RAVDESS DATASET MFMC
Surprised 2.6 0 2.6 0 10.5 0 0 84.2
FEATURE AND USING DNN CLASSIFIER
Average accuracy=72.9%
Ang. Cal. Dis. Fea. Hap. Neu. Sad. Sur
Angry 89.7 2.6 0 0 7.7 0 0 0 TABLE VII: CONFUSION MATRIX FOR EMODB DATASET USING
Calm 2.6 72.8 2.6 2.6 2.6 9.1 7.7 0 MFMC FEATURE AND CNN CLASSIFIER
Disgust 0 5.1 74.4 2.6 0 0 5.1 12.8
Fear 0 0 2.6 81.6 2.6 0 10.6 2.6 Ang. Bor. Dis. Fea. Hap. Neu. Sad
Happy 2.6 0 0 2.6 74.7 2.6 5.3 12.2 Angry 91.42 0 0.76 3.12 4.7 0 0
Neutral 5.03 7.5 5.03 0 5.03 64.88 5.03 7.5 Boredom 0 71.48 1.18 1.24 0 23.62 2.48
Sad 0 6.8 5.3 5.3 2.6 2.6 72.1 5.3 Disgust 0 4.24 83.12 6.44 4.22 0 2
Surprised 5.3 0 0.5 0 0 5.3 5.3 83.6 Fear 1.42 1.58 0 88.36 2.96 1.42 4.26
Average accuracy=76.72% Happy 15.42 1.42 4.08 4.08 73.5 1.5 0
Neutral 1.24 18.94 0 1.24 0 77.34 1.24
Sad 0 5.02 1.66 0 0 1.66 91.66
Average accuracy=82.41%
A. Result analysis using DNN model
Tables II, III, IV, and V shows the confusion matrices for
RAVDESS, EMODB, SAVEE, and TESS datasets respectively For the RAVDESS dataset, the emotion angry has the highest
using the MFMC feature with DNN classifier. RAVDESS,
EMODB, SAVEE, and TESS datasets achieved an SER ac-

curacy of 76.72%, 84.72%, 77.88%, and 100%, respectively. recognition rate, and neutral has the lowest compared to

979-8-3503-4595-7 03/2023 © 2023 IEEE 96

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on August 09,2024 at 10:38:18 UTC from IEEE Xplore. Restrictions apply.
1st IEEE International conference on Innovations in High-Speed Communication and Signal Processing (IEEE-IHCSP) 4-5 March, 2023

TABLE VIII: CONFUSION MATRIX FOR SAVEE DATASET USING TABLE X: PERFORMANCE COMPARISON OF SER WITH STATE-OF-
MFMC FEATURE AND CNN CLASSIFIER THE-ART METHODS ON THE RAVDESS DATASET

Ang. Dis. Fea. Hap. Neu. Sad. Sur Author Feature Model Accuracy
Angry 3.32 3.32 8.32 1.66 0 14.98 Ancilin et al. [3] MFMC SVM 64.31%
69.32
Disgust 1.66 Fundamental
73.28 3.34 3.32 13.32 3.34 5.0 BLSTM+
Jalal et al. [15] frequency, 69.40%
Fear 6.66 1.66 65.34 6.66 0 4.98 16.66 Capsule
MFCC, log energy
Happiness 6.66 3.32 3.32 68.02 1.66 0 20.0 routing
Neutral 0 2.52 0 0.84 85.82 9.98 0.84 MFCC, spectrogram,
Sad 0 0 1.66 0 15.0 83.34 0 Issa et al. [6] chromagram, spectral CNN 71.6%
Surprised 3.32 6.66 9.98 4.98 0 1.66 75.32 contrast, tonetz
Average accuracy=74.35% Andayani et al. [16] MFCC Hybrid 75.62%
LSTM
TABLE IX: CONFUSION MATRIX FOR TESS DATASET USING MFMC Proposed DNN MFMC DNN 76.72%
FEATURE AND CNN CLASSIFIER
TABLE XI: PERFORMANCE COMPARISON OF SER WITH STATE-OF-
Ang. Cal. Dis. Fea. Hap. Neu. Sad
THE-ART METHODS ON THE EMODB DATASET
Angry 100 0 0 0 0 0 0
Calm 0 100 0 0 0 0 0 Author Feature Model Accuracy
Disgust 0 0 100 0 0 0 0 Lukose et al. [9] MFCC GMM 76.31%
Fear 0 0 0 100 0 0 0
Liu et al. [17] Formants SVM+RB 78.66%
Happiness 0 0 0 0 100 0 0 F
Neutral 0 0 0 0 0 100 0
Wu et al. [18] Modulation spectral SVM 79.6%
Sadness 0 0 0 0 0 0 100
features
Average accuracy=100%
Ancilin et al. [3] MFMC SVM 81.5%
MFCC, pitch, Formant,
Ö zseven et al. SVM 82.8%
[11] band width
others. In the case of the EMO-DB dataset, the recognition Proposed MFMC DNN 84.72%
rate of angry is the highest, like RAVDESS, and boredom has
the lowest recognition rate. For the EMODB dataset, boredom TABLE XII: PERFORMANCE COMPARISON OF SER WITH STATE-OF-
is misclassified as neutral. Similarly, for the SAVEE dataset, THE-ART METHODS ON THE SAVEE DATASET
recognition of neutral emotion is highest, and recognition of Author Feature Model Accuracy
fear is lowest. The fear emotion is misclassified as a surprise Mekruksavanich et al. MFCC CNN 65.83%
emotion. [19]
Sun et al. [7] Spectrograms CNN 72.39%
B. Result analysis using CNN model Ö zseven et al. [11]
MFCC, pitch,
SVM 74.3%
formant,
Tables VI, VII, VIII, and IX show the confusion matrices for
bandwidth
RAVDESS, EMODB, SAVEE, and TESS datasets respectively Ancilin et al. [3] MFMC SVM 75.63%
using the MFMC feature with CNN classifier. RAVDESS, Proposed MFMC DNN 77.88%
EMODB, SAVEE, and TESS datasets achieved an SER accu-
racy of 72.9%, 82.41%, 74.35%, and 100%, respectively. For TABLE XIII: PERFORMANCE COMPARISON OF SER WITH STATE-
OF- THE-ART METHODS ON THE TESS DATASET
the RAVDESS dataset, the emotion calm has the highest recog-
nition rate, and neutral has the lowest compared to others. For Author Feature Model Accuracy
the RAVDESS dataset, neutral emotion is misclassified as Gokilavani et al.
ZCR, MFCC, Chroma
CNN 99%
STFT,
calm and sadness, and sadness is misclassified as calm.For [20]
RMS, Mel scaled
the EMO-DB dataset, the recognition rate of emotionslike spectrogram
anger and sadness is the highest, and happiness has the lowest. MFCC, Mel spectrogram,
Bansal et al. [21] Max 99.46%
Emotions like boredom is misclassified with neutral emotion, chroma voting
and happiness with angry emotion. Similarly, in the SAVEE Dolka et al. [22] MFCC ANN 99.52%
dataset, recognition of neutral emotions is highest, and fear is Proposed MFMC DNN 100%
lowest. Emotions like fear, happiness, and anger are
misclassified as surprise emotions. for emotion recognition and achieved an accuracy of 64.31%
for the RAVDESS dataset. Jalal et al. [15] used a combination
C. Comparison with the state-of-the-art models
of acoustic features and achieved an accuracy of 69.40%.Issa
In this section we compared our result with the state-of-the- et al. [6] used a variety of acoustic attributes and a CNN
art method methods. Tables X, XI, XII and XIII shows our classifier for emotion recognition. The author achieved an
performance comparison with the state-of-the-art method for accuracy of 71.6%. Andayani et al. [16] used the MFCC feature
RAVDESS, EMODB, SAVEE, and TESS datasets, respec- and a hybrid LSTM model for SER and achieved an accuracy
tively. of 75.62%. In contrast, we achieved an accuracy of 76.72%
Table X shows that our proposed DNN model performed with the MFMC feature and DNN classifier.
better than other methods in SER for the RAVDESS dataset.
Ancilin et al. [3] used the MFMC feature and SVM classifier

97 979-8-3503-4595-7 03/2023 © 2023 IEEE

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on August 09,2024 at 10:38:18 UTC from IEEE Xplore. Restrictions apply.
1st IEEE International conference on Innovations in High-Speed Communication and Signal Processing (IEEE-IHCSP) 4-5 March, 2023

Table XI shows that our proposed method achieved better [4] S. Deb and S. Dandapat, “Emotion classification using segmentation of
vowel-like and non-vowel-like regions,” IEEE Transactions on Affective
accuracy than others for the EMODB dataset. Lukos et al. Computing, vol. 10, no. 3, pp. 360–373, 2017.
[9] used the MFCC feature and GMM classifier for SER and [5] P. Warule, S. P. Mishra, and S. Deb, “Classification of cold and non-cold
achieved an accuracy of 76.31%. Liu et al. [17] used formant speech using vowel-like region segments,” in 2022 IEEE International
Conference on Signal Processing and Communications (SPCOM). IEEE,
as a feature and a combination of SVM and RBF as a classifier. 2022, pp. 1–5.
The author achieved an accuracy of 78.66%. The authors [18], [6] D. Issa, M. F. Demirci, and A. Yazici, “Speech emotion recognition with
[3], and [11] used spectral features and a combination of deep convolutional neural networks,” Biomedical Signal Processing and
Control, vol. 59, p. 101894, 2020.
attributes with an SVM classifier and achieved an accuracy [7] L. Sun, J. Chen, K. Xie, and T. Gu, “Deep and shallow features fusion
of 79.6%, 81.5%, and 82.8%, respectively. In contrast, our based on deep convolutional neural network for speech emotion
recognition,” International Journal of Speech Technology, vol. 21, no. 4,
proposed work attained an accuracy of 84.72%. pp. 931–940, 2018.
Table XII compares our methods with the state-of-the-art
[8] S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of
models for the SAVEE dataset. The authors [19] and [7] used emotional speech and song (ravdess): A dynamic, multimodal setof
spectral features (MFCC and spectrogram) and a CNN facial and vocal expressions in north american english,” PloS one,vol.
classifier for emotion recognition. They achieved an accuracy 13, no. 5, p. e0196391, 2018.
[9] S. Lukose and S. S. Upadhya, “Music player based on emotion recog-
of 65.83% and 72.39%, respectively. Özseven et al. [11] used nition of voice signals,” in 2017 International Conference on Intelli- gent
a combination of spectral features and an SVM classifier for Computing, Instrumentation and Control Technologies (ICICICT). IEEE,
SER. The author achieved an accuracy of 74.3%. In contrast, 2017, pp. 1751–1754.
[10] S. Deb and S. Dandapat, “Emotion classification using residual si-
our proposed DNN model with the MFMC feature attained an nusoidal peak amplitude,” in 2016 International conference on signal
emotion recognition accuracy of 77.88%, which is better than processing and communications (SPCOM). IEEE, 2016, pp. 1–5.
the others. [11] T. Özseven, “Investigation of the effect of spectrogram images and
different texture analysis methods on speech emotion recognition,”
Table XIII shows that our proposed DNN model performed Applied Acoustics, vol. 142, pp. 70–77, 2018.
better than others for the TESS dataset. The authors [20], [21], [12] Y. Zeng, H. Mao, D. Peng, and Z. Yi, “Spectrogram based multi-task
and [22] used different combinations of spectral features for audio classification,” Multimedia Tools and Applications, vol. 78, no. 3,
pp. 3705–3722, 2019.
SER. They attained an accuracy of 99%, 99.46%, and 99.52%, [13] K. Venkataramanan and H. R. Rajamohan, “Emotion recognition from
respectively. In contrast, we achieved an accuracy of 100% speech,” arXiv preprint arXiv:1912.10458, 2019.
using the CNN or DNN classifier for MFMC feature. [14] S. Deb and S. Dandapat, “Classification of speech under stress using
harmonic peak to energy ratio,” Computers & Electrical Engineering,
vol. 55, pp. 12–23, 2016.
VI. CONCLUSION AND FUTURE WORK [15] M. A. Jalal, E. Loweimi, R. K. Moore, and T. Hain, “Learning temporal
clusters using capsule routing for speech emotion recognition,” in
Recognition of emotion using speech is a challenging task Proceedings of Interspeech 2019. ISCA, 2019, pp. 1701–1705.
for any human being. Many authors tried different methods [16] F. Andayani, L. B. Theng, M. T. Tsun, and C. Chua, “Hybrid lstm-
transformer model for emotion recognition from speech audio files,”
to achieve better SER accuracy. We used a modified version IEEE Access, vol. 10, pp. 36 018–36 027, 2022.
of the MFCC feature named MFMC to find SER. We used [17] Z.-T. Liu, A. Rehman, M. Wu, W.-H. Cao, and M. Hao, “Speech emotion
two classifiers, CNN and DNN, for this purpose. From our recognition based on formant characteristics feature extraction and
phoneme type convergence,” Information Sciences, vol. 563, pp. 309–
experiment, we found that the MFMC features performed better 325, 2021.
than MFCC for both classifiers (CNN and DNN) andthe [18] S. Wu, T. H. Falk, and W.-Y. Chan, “Automatic speech emotion
performance of the DNN model is better than CNN for recognition using modulation spectral features,” Speech communication,
vol. 53, no. 5, pp. 768–785, 2011.
both features (MFCC and MFMC). We compared ourproposed [19] S. Mekruksavanich, A. Jitpattanakul, and N. Hnoohom, “Negative emo-
model with state-of-the-art methods and found that our model tion recognition using deep learning for thai language,” in 2020 joint
performed better than others. In the future, we will try international conference on digital arts, media and technology withECTI
northern section conference on electrical, electronics, computer and
combinations of acoustic and non-acoustic features and a telecommunications engineering (ECTI DAMT & NCON). IEEE, 2020,
combination of more than one deep learning model to improve pp. 71–74.
emotion recognition accuracy. [20] M. Gokilavani, H. Katakam, S. A. Basheer, and P. Srinivas, “Ravdness,
crema-d, tess based algorithm for emotion recognition using speech,”
in 2022 4th International Conference on Smart Systems and Inventive
REFERENCES Technology (ICSSIT). IEEE, 2022, pp. 1625–1631.
[21] M. Bansal, S. Yadav, and D. K. Vishwakarma, “A language-independent
[1] C. M. Lee and S. S. Narayanan, “Toward detecting emotions in spoken speech sentiment analysis using prosodic features,” in 2021 5th Inter-
dialogs,” IEEE transactions on speech and audio processing, vol. 13, national Conference on Computing Methodologies and Communication
no. 2, pp. 293–303, 2005. (ICCMC). IEEE, 2021, pp. 1210–1216.
[2] K. E. B. Ooi, L.-S. A. Low, M. Lech, and N. Allen, “Early prediction of [22] H. Dolka, A. X. VM, and S. Juliet, “Speech emotion recognition using
major depression in adolescents using glottal wave characteristics and ann on mfcc features,” in 2021 3rd International Conference on Signal
teager energy parameters,” in 2012 IEEE International Conference on Processing and Communication (ICPSC). IEEE, 2021, pp. 431–43
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp.
4613–4616.
[3] J. Ancilin and A. Milton, “Improved speech emotion recognition with
mel frequency magnitude coefficient,” Applied Acoustics, vol. 179, p.
108046, 2021.

979-8-3503-4595-7 03/2023 © 2023 IEEE 98

Authorized licensed use limited to: SARDAR VALLABHBHAI NATIONAL INSTITUTE OF TECH. Downloaded on August 09,2024 at 10:38:18 UTC from IEEE Xplore. Restrictions apply.

You might also like