0% found this document useful (0 votes)

2 views16 pages

Emotion Classification From Speech Signal Based On

This paper investigates an emotion recognition system from speech signals, aiming to classify seven emotional states using empirical mode decomposition and non-linear features. The method employs entropy measures derived from intrinsic mode functions of speech signals, achieving a peak balanced accuracy of 93.3% with the Linear Discriminant Analysis classifier. The study highlights the effectiveness of using advanced signal processing techniques for emotion detection in human-computer interaction applications.

Uploaded by

Anithadevi N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views16 pages

Emotion Classification From Speech Signal Based On

Uploaded by

Anithadevi N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Complex & Intelligent Systems

https://doi.org/10.1007/s40747-021-00295-z

ORIGINAL ARTICLE

Emotion classification from speech signal based on empirical mode

decomposition and non-linear features

Speech emotion recognition

Palani Thanaraj Krishnan1 · Joseph Raj Alex Noel2 · Vijayarajan Rajangam3

Received: 11 October 2020 / Accepted: 2 February 2021

Abstract
Emotion recognition system from speech signal is a widely researched topic in the design of the Human–Computer Interface
(HCI) models, since it provides insights into the mental states of human beings. Often, it is required to identify the emotional
condition of the humans as cognitive feedback in the HCI. In this paper, an attempt to recognize seven emotional states from
speech signals, known as sad, angry, disgust, happy, surprise, pleasant, and neutral sentiment, is investigated. The proposed
method employs a non-linear signal quantifying method based on randomness measure, known as the entropy feature, for
the detection of emotions. Initially, the speech signals are decomposed into Intrinsic Mode Function (IMF), where the IMF
signals are divided into dominant frequency bands such as the high frequency, mid-frequency , and base frequency. The
entropy measures are computed directly from the high-frequency band in the IMF domain. However, for the mid- and base-
band frequencies, the IMFs are averaged and their entropy measures are computed. A feature vector is formed from the
computed entropy measures incorporating the randomness feature for all the emotional signals. Then, the feature vector is
used to train a few state-of-the-art classifiers, such as Linear Discriminant Analysis (LDA), Naïve Bayes, K-Nearest Neighbor,
Support Vector Machine, Random Forest, and Gradient Boosting Machine. A tenfold cross-validation, performed on a publicly
available Toronto Emotional Speech dataset, illustrates that the LDA classifier presents a peak balanced accuracy of 93.3%,
F1 score of 87.9%, and an area under the curve value of 0.995 in the recognition of emotions from speech signals of native
English speakers.

Keywords Speech signal · Emotion perception · Entropy measures · Linear discriminant analysis · Empirical mode
decomposition

Introduction

Speech signals have a huge impact on the current modes

of communication, such as emails and text messages. Even
B Joseph Raj Alex Noel
in written messages, emotion representations like emojis
[email protected]
are inserted to reveal our emotional states. Speech com-
Palani Thanaraj Krishnan
[email protected]
munication is more prominent and effective when the text
communication fails to reveal the emotional states.
Vijayarajan Rajangam
[email protected]
The speech signal is of research interest over the decades
for various applications such as emotion perception, HCI, bio
1 Department of Electronics and Instrumentation Engineering, metrics, and so on [1]. Also, emotion analysis from the audi-
St. Joseph’s College of Engineering, Chennai, India tory signal of humans has become more prominent research
2 Department of Electronic Engineering, Shantou University, due to (a) the availability of fast computing systems, (b) the
Shantou, China effectiveness of various signal processing algorithms, and
3 Division of Healthcare Advancement, Innovation and (c) the acoustic differences in speech signals that are natu-
Research, Vellore Institute of Technology, Chennai, India

123
Complex & Intelligent Systems

rally embedded in various emotional situations. An in-depth frequency cepstral coefficient, are extracted from each frame
analysis of speech signals in different domains is helpful of the speech signal [9]. A high-dimensional feature vector
in recognizing the emotions from the auditory signals of is structured from the first- and second-order derivatives of
the people who are unable to communicate through proper the above-said feature vector. The dimension reduction of
speech signals. Furthermore, the speech signal analysis is the feature vector is carried out by quantum behaved particle
also used to study the heart rate of the speaker [2]. The swarm optimization. The reduced feature vector is classified
broader research perspective of Speech Emotion Classifi- by a Gaussian elliptical basis function neural network clas-
cation (SEC) finds its applications in crime investigation, sifier. Palo et al. proposed an SER system in wavelet domain
psychiatric diagnosis, human–computer interaction, fatigue based on Mel-frequency coefficients [10]. Both static and
detection, auxiliary disease diagnosis, bio metrics, and many dynamic elements of the coefficients are combined for an
more. SER system. The above-said feature coefficients are reduced
The basic emotions are categorized into sadness, fear, hap- in dimension using Principal Component Analysis (PCA)
piness, disgust, surprise, and anger [3]. The combination of and linear discriminant analysis [11]. Jing et al. suggested an
basic emotions leads to other emotions such as love, affec- SER system using prominence features and traditional acous-
tion, amusement, contempt, excitement, embarrassment, and tic features [12]. The combined feature vector is reduced in
so on. Over the decades, various studies have been conducted dimension using PCA and non-parametric discriminant anal-
in the field of SEC where the general pipeline includes feature ysis. The features are classified using four types of supervised
extraction, dimensionality reduction, and emotion classifica- learning classifiers. Wavelet-based features, extracted from
tion. The broad literature for emotion analysis suggests two the speech signals, are used for SEC in [13]. In [14], spectral
preferable features, known as statistical and temporal fea- features with Naïve Bayes(NB) classifier is employed.
tures. [4,5]. A set of methods on speech emotion classification is based
Speech Emotion Recognition (SER) system can be struc- on hidden Markov model [15], Gaussian Mixture Model
tured by analyzing well-crafted features that effectively (GMM) [16], Self-Organizing Map (SOM) [17], and neural
expose each emotion in the speech signals [6]. The vary- network [18]. Singular Value Decomposition (SVD) clas-
ing length and continuous nature of speech signals require sifier is used in [19], whereas, in [20], ensemble software
local and global features for emotion recognition. The regression model is proposed for emotion classification. A
local features represent temporal dynamics, Whereas the deep belief network based on high- and low-level features is
global features expose the statistical aspects like standard also proposed for SEC [21]. Pao et al. proposed a method
deviation, mean, and minimum and maximum values. The based on Support Vector Machine (SVM) and neural net-
features of SER system are categorized into prosodic fea- works to classify five emotions such as anger, surprise,
tures, spectral features, voice quality features, and Teager neutral, happiness, and sadness [22]. Xiao et al. suggested
energy operator-based features. Prosodic features, such as a classifier that uses several sub classifiers for the classifica-
rhythm and intonation, are the features based on human’s tion of seven types of emotions [23]. Lin and Wei presented
perception. These features are based on energy, duration, a method that was experimented on gender-dependent and
and fundamental frequency. Spectral features are extracted gender-independent experiments [24]. More recently, Xie
in frequency domain using transforms and have received et al. developed a frame-level emotion recognition system
wide attention due to their ability of representing vocal card based on attention model in recurrent neural networks. They
characteristics [5]. Short-term power spectrum is presented validated their system for English and non-English speech
by Mel-frequency cepstral coefficients, whereas vocal tract signals [25]. Demircan and Kahramanli proposed spectral
characteristics are presented by linear prediction coefficients. features based on Mel Cepstral coefficients and linear pre-
Logarithmic filtering of auditory system is characterized by diction coefficients for speech emotion detection. Later, they
log-frequency power coefficients using Fourier transform [7]. used Fuzzy c-means for feature dimension reduction which
Voice quality measurements, such as jitter, harmonics-to- was further given as input to machine learning classifiers.
noise ratio, and shimmer, exploit the relation between vocal They used German speech emotion dataset for their work
tract characteristics and emotion content. Teager features [26].
detect stresses happening to the vocal tract muscles in the Our contributions are motivated by (a) the non-stationary
form of energy operator [8]. A few spectral and temporal nature of speech signals and classical signal processing meth-
feature-based SER systems are discussed below. ods such as Fourier and wavelet analysis use predefined basis
Fatemeh Daneshfar et al. proposed a hybrid SER system functions failing to extract relevant information regarding
comprising of feature extraction, dimensionality reduction, emotions and (b) the above transformation techniques are
and classification stages. In the feature extraction stage, three block-based methods, wherein a group of samples surround-
features, such as perceptual minimum variance distortion less ing the centre element are projected on to the respective
response, perceptual linear prediction coefficient, and Mel- basis function. Selection of an optimum window size is a

123
Complex & Intelligent Systems

additional requirement for improving the detection accuracy Toronto, known as, Toronto Emotional Speech Set (TESS)
and elimination of artifacts for slow time-varying emotion [28]. The dataset consists of speech signals recorded from
like sadness. Therefore, there is a need to investigate the two native English participants of age 26 and 64 respectively
classification accuracy of human emotions through data- speaking about 200 target words which completes the phrase
driven signal processing methods such as Empirical Mode ”Say the word—-”. These phrases are captured with seven
Decomposition (EMD) and non-linear features. This paper different emotions of the speakers, namely anger, disgust,
investigates an SEC approach where emotions are recog- fear, happiness, pleasant surprise, sadness, and neutral. The
nized from speech signals by decomposing them into intrinsic duration of the records vary between 2 and 3 s and is sampled
mode functions. Later, five unique randomness measures at 22 KHz. Figure 2 illustrates the speech signals for different
are computed through entropy measures and state-of-the-art emotions from the TESS dataset. For analysis, 200 record-
machine learning classifiers are trained on the entropy fea- ings of each emotion class were taken for the development
tures. Finally, the performance of the model is validated using of the speech recognition system. It should be noted that the
standard quantitative metrics on a publicly available emotion original recordings are of high quality (recorded in a noise
classification dataset. less environment) and, therefore, do not require additional
The rest of the paper is organized as follows. “Materi- pre-processing steps.
als and methods” elaborates on the proposed methodology,
the extraction of IMFs through EMD, the computation of Empirical mode decomposition
the randomness through entropy features, and the detailed
analysis of the need for different entropy measures. Results In this section, the EMD of a signal is analyzed. Suppose
and discussion are presented in “Results” and “Discussion” x(t) be a time-series speech signal that delivers the IMF sig-
followed by the conclusion in “Conclusion”. nals c(t) and the residue function r (t) when decomposed by
the EMD method. Equation (1) illustrates the decomposition
process: [27]:
Materials and methods

d
Speech signal is a time-varying signal and requires proper x(t) = c(t) + r (t), (1)
selection of a signal processing method to extract the rel- i=1
evant features for emotion recognition. In this paper, the
speech signals are analyzed in IMF domain using EMD. where ‘d’ is the number of IMFs generated for the input sig-
Unlike conventional signal processing methods using prede- nal x(t).
fined basis function such as Fourier transform and Wavelet
transform, EMD relies on the extraction of inherent patterns For the experiments, d is preset to a value of 10 com-
in the data for decomposing a signal into intrinsic signals ponents. Preliminary analysis illustrates that setting a lower
[27]. Figure 1 shows the block diagram of the proposed value to ‘d’ leads to less number of decomposed IMFs result-
speech recognition system. The speech signals of duration ∼ ing in the loss of information. On contrary, a large value for
2 s are initially decomposed into dominant, mid-, and base- ‘d’ leads to higher levels of decomposition but at a consider-
band IMF frequencies. Here, windowing techniques are not able computational cost. Hence, an optimal value of 10 was
involved, and hence, inherent features corresponding to the chosen based on the ad hoc analysis at different levels of
emotions are extracted with a higher confidence. Non-linear decomposition. Figure 3 shows the decomposed speech sig-
features based on entropy are extracted from the decomposed nal using EMD. The decomposed signal captures different
IMF signals. A feature vector is constructed from the entropy oscillatory features of the speech signal in both temporal and
features and used to train a set of classifiers such as LDA, frequency domain.
Naïve Bayes (NB), K-Nearest Neighbor (K-NN), Support
Vector Machine (SVM), Random Forest (RF), and Gradi- Principal frequency modes
ent Boosting machine (GB). Finally the performance of the
classifiers is evaluated through balanced accuracy, F1 score, EMD decomposes a time-series signal into IMFs that are
recall, area under the curve, specificity, and precision. localized in time and frequency domains. Since different
emotions are captured in distinct frequency components of
Emotion dataset the IMF signal, the information content in each IMF sig-
nal is not uniform and varies depending on the input speech
To present a realistic comparison, the proposed emotion signal. Speech signals pertaining to happy and pleasant sur-
recognition system from speech signals was trained and prise are positive emotions. Meanwhile, negative emotions
tested on publicly available dataset provided by University of such as angry, fear, disgust, and sad are captured in different

123
Complex & Intelligent Systems

Fig. 1 Proposed method for

speech signal recognition
system based on EMD and
non-linear features: a simplified
representation. b Detailed
illustration of different
processing steps involved in
proposed SER system

frequency scales [29]. Hence, predefined selection of any dle the loss of information. The categories are represented
IMF component or frequency scale will lead to a loss of as follows: (a) the lower order IMFs starting from IMF-1 to
information. The IMF signals are decomposed into three IMF-6 represent the high-frequency modes Hfd1-6 , (b) IMF-7
frequency groups, namely, the High-Frequency (HF), the and IMF-8 correspond to mid-frequency modes Mfd7-8 , and
Mid-Frequency (MF), and the Low-Frequency (LF) modes (c) the higher order IMFs, namely, IMF-9 and IMF-10, corre-
based on the frequency content, as shown in Fig. 3, to han- spond to low-frequency modes L fd9-10 . The last component rt

123
Complex & Intelligent Systems

Fig. 2 Speech Signals for different emotions: a angry, b disgust, c fear, d happy, e neutral, f pleasant surprise, and g sad

1
8
is the residue mode which corresponds to the baseline activ-
ity of the signal. To enhance the discrimination ability of the Mfd = ci (t) (3)
N
i=7
speech signals, especially for negative emotions such as sad,
fear, and disgust, with reduced number of trainable features, 1 10

an averaging scheme is proposed for mid-frequency, low- L fd = ci (t) + r (t). (4)

N
i=9
frequency, and residual modes based on the power spectral
density distribution.
Equations (2), (3), and (4) mathematically represent the The proposed frequency-based categorization technique
different modes of IMFs: Here Hfd1-6 , Mfd , and L fd denote can be validated by observing the Power Spectral Density
the high-frequency mode, mid-frequency, and low-frequency (PSD) plot of IMFs, as shown in Fig. 4, of speech emo-
modes respectively: tions like happy, angry, and sad. From the figure, it could be
discerned that the IMF modes from IMF-1 to IMF-6 show
unique power spectral density patterns compared to other
higher order IMFs. Therefore, IMF-1–IMF-6 are considered
Hfd1-6 = ci (t), i = 1, 2, 3, 4, 5, 6 (2) separately and directly used for feature extraction. Mean-

123
Complex & Intelligent Systems

Fig. 3 Decomposed IMFs of speech signals: a angry, b happy, and c sad. Here, x axis represents the sample number and the y axis denotes the IMF
amplitude. We could observe that different emotions show distinct IMF signals

while, mid-frequency, low-frequency modes, and the residue lems. They have been reported with good classification
functions show similar PSD patterns. Therefore, these modes performance for many biomedical applications that involves
are averaged as explained in Eqs. (3) and (4), respectively. ECG and EEG signals. Though the speech signals are inher-
Hence, eight unique IMFs from the input speech signals for ently different from biological signals, the oscillations within
feature extraction process are selected based on the PSD dis- the signals define each emotions. Here, it is attempted to
tribution. quantify the randomness measure by computing entropy
functions. Thus, five entropy measures such as Approximate
entropy (ApEnt), Sample entropy (SamEnt), singular value
Feature extraction
decomposition entropy (SVDEnt), Permutation entropy (Per-
mEnt), and Spectral entropy (SpecEnt) are used for extracting
Non-linear features based on randomness measure and chaos
randomness features from the speech signals. For each IMF,
theory have been widely used in signal classification prob-

123
Complex & Intelligent Systems

Fig. 4 Power spectral density of some of the speech signals with emotions: a–c angry, d–f, and happy g–i sad for the three frequency groups,
namely, HF (IMF-1–IMF-6), MF (IMF-7–IMF-8), LF (IMF-9, IMF-10), and residue modes

a total of 5 different entropy measures are computed resulting similarity value, respectively. ApEnt be computed through
to 40 different trainable features: Eq. (6) [30]:

Hfd1-6 ∈ R1×30 , Mfd ∈ R1×5 , L fd ∈ R1×5 . (5) ApEntx(t) = φrm − φrm+1 . (6)

The proceeding sections present (a) the different entropy Here, x(t) is the speech signal and φr is the correlation
measures, (b) a detailed investigation on the need for different integral function for the phase space vectors (embedded sig-
entropy measures, and (c) a brief analysis of the different nal). For experiments, the chosen values are m = 3 and
types of classifiers used in this study. r = 0.2std(x(t)) based on the work of [31]. Here, ‘std’ refers
to the standard deviation of the input signal.
Approximate entropy
Sample entropy
Approximate entropy is a complexity measure widely used
in the regularity analysis of time-series signal. It quantifies Sample entropy is a modified version of approximate entropy
the amount of randomness based on signal fluctuations. A where the limitations such as self-similar pattern bias are
lower value of ApEnt suggests that the time-series signal overcome [32]. Here, the similarity measure is computed
is regular and a higher value demonstrates the randomness. based on various embedded time-series samples and avoids
The parameters ‘m’ and ‘r’ define the delay parameter and computing the self-similarity measure between the samples.

123
Complex & Intelligent Systems

It reduces the bias which is inherent in approximate entropy. Table 1 Entropy measure and its attributes in speech analysis
The representation of sample entropy is provided in Eq. 7 Randomness measure Implication
[32]:
Approximate entropy Computes irregularity in the
z(m + 1, r ) speech signal, however considers
SamEntx(t) = − log . (7) self-similar patterns in the input
z(m, r ) signal
Sample entropy Modification of Approximate
Here, ‘z’ is the measure of similarity of a embedded time- entropy with no bias of
series for ‘m’ and ‘m + 1’ and r is the tolerance level. The self-similar patterns
values of ‘m’ and ‘r’ are identical to the values used for SVD entropy Computes the randomness measure
approximate entropy. based on decomposition of
high-dimensional data using
singular values
SVD entropy
Spectral entropy computes the randomness in the
power spectral density function
Singular value decomposition represents the dimensionality of the speech signal
of the data. It decomposes the high-dimensional data into Permutation entropy Uses ordinal patterns in the speech
orthogonal matrices based on the singular values ‘σ ’. The signal to detect the emotions
time-series signal is converted to a embedded matrix based on
different time delayed template vector taken from the speech
signal. The embedded matrix is decomposed into various denotes the size of the embedded matrix. Usually, τ and D are
orthogonal matrices. However, the SVD entropy measure is set to 1 and 3, respectively. The different ordinal patterns are
computed only on the diagonal matrix which contain the sin- tabulated and verified with the column vector of the embed-
gular values. Equation (8) represents the SVD computation ded matrix. The number of occurrence of the ordinal patterns
[33]: in the matrix is counted and the probability of occurrence,
‘ψi ’, of each pattern is tabulated. The permutation entropy

L
is computed as given below [35] :
SVDEntx(t) = − σi log2 (σi ). (8)
i=1
PermEntx(t) = − p(ψi ) log2 p(ψi ). (10)
Here, ‘L’ represents the number of singular values of the i
embedded matrix and ‘σi ’ denote the singular values.
The following section briefly explains the contribution of
Spectral entropy each entropy toward the calculation of randomness attribute
which plays a significant role in the analysis of the speech
Spectral entropy measures the randomness by employing emotions. Table 1 provides a summary of each entropy mea-
Fourier transform to the time-series signal. For the compu- sure and its implications in speech analysis.
tation of this entropy measure, the power spectral density,
S( f ), of the speech signal is obtained. The spectral entropy Entropy and their relation with IMFs
is calculated using the formulation of Shannon entropy mea-
sure as given below [34]: As explained in the previous section, the entropy features
are extracted from the principal mode IMFs through decom-

fn position of the original speech signal. Figure 5 illustrates
SpecEntx(t) = − S( f ) log2 [S( f )]. (9) the box-plot of different emotions captured by non-linear
f =0 entropy features. From the figure, It is observed that the
median values of all the entropy features differ for differ-
Here, ‘ f n ’ is the sampling frequency of the signal. ent class of emotions, and therefore, the computed entropy
features could be readily used as discriminators for classifi-
Permutation entropy cation of emotions. We analyzed the emotion classification
accuracy of different entropy features by varying the num-
Permutation entropy computes the randomness of the time- ber of extracted IMFs. Figure 6 illustrates the variation of
series signal based on ordinal patterns of the signal. It is a classification accuracy of each of the entropy measure for
non-parametric approach and provides a robust estimation of different IMF lengths. It could be observed that no single
irregularity information of the signal. The approach involves entropy measure provides good classification accuracy for
creating a embedded time delayed matrix based on τ and D all the speech signal emotions and the classification accuracy

123
Complex & Intelligent Systems

Fig. 5 Distribution of
randomness value based on
entropy of different speech
emotion signals: a approximate
entropy, b sample entropy, c
SVD entropy, d permutation
entropy, and e spectral entropy

depends on the choice of number IMFs. For example, permu- entropy provides good discrimination of sad, angry, neutral,
tation entropy provides good accuracy for emotions such as and pleasant surprise and less accuracy for other emotions.
pleasant surprise, angry, and sad speech signals decomposed Likewise, entropies such as approximate, SVD, and spectral
upto IMF-3, however, for the same decomposition, its accu- provide higher discrimination abilities from IMF-3 to IMF-
racy is less for fear, disgust, neural, and happy. Similarly, 6, respectively. Therefore, the experimental analysis suggests
when the speech signal is decomposed upto IMF-4, sample that entropy feature extracted from the EMD of speech sig-

123
Complex & Intelligent Systems

1.00 1.00

Emotion Type Emotion Type

ACC_Approximate_Entropy

ACC_Permutation_Entropy
0.75 0.75
angry angry
disgust disgust
fear fear
0.50 0.50
happy happy
neutral neutral
pleasant pleasant
0.25 0.25
sad sad

0.00 0.00

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of IMF Modes Number of IMF Modes

(a) (b)
1.00 1.00

Emotion Type Emotion Type

0.75 0.75

ACC_Sample_Entropy
ACC_SVD_Entropy

angry angry
disgust disgust
fear fear
0.50 0.50
happy happy
neutral neutral
pleasant pleasant
0.25 0.25
sad sad

0.00 0.00

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of IMF Modes Number of IMF Modes

Emotion Type
ACC_Spectral_Entropy

0.75 Entropy Type

angry
0.6
disgust Approximate Ent
Accuracy

fear Permutation Ent

0.50
happy Sample Ent
neutral Spectral Ent
pleasant SVD Ent
0.25 0.4
sad

0.00

1 2 3 4 5 6 7 8
Number of IMF Modes 1 2 3 4 5 6 7 8
Number of IMF Modes

(e) (f)
Fig. 6 Performance of different entropy measures of speech emotion signals for different IMF modes: a approximate entropy, b permutation
entropy, c SVD entropy, d sample entropy, e spectral entropy, and f average accuracy of each entropy measure for all emotions

nals present complimentary information at different IMFs. employing them individually for emotion classification
Thus, it is prudent to include all the IMFs to improve the presents a peak accuracy of only ∼ 79% for the 7 different
classification accuracy. emotions considered in this study. Hence, the entropy fea-
Subsequently, Fig.6f illustrates the average emotion clas- tures computed from different IMFs are combined as a feature
sification accuracy for different entropies considered for vector (considering 40 features) and presented to the classi-
IMFs 1–8 (only 8 features). the entropies present com- fier for improved signal classification which is explained in
plementary information at different decomposition levels, the proceeding section.

123
Complex & Intelligent Systems

Feature classification Support vector machine

State-of-the-Art (SoA) classifiers such as LDA, NB, K-NN, Support vector machine is a machine learning model used for
SVM, RF, and GB are used for classification of emotions classification and regression challenges [38]. Each data point
from speech signals. The following section briefly describes is plotted as a point in a n-dimensional space with feature
the classifiers. value expressed as the coordinate value. Hyperplane- based
decision boundaries are found separating the two classes.
Linear discriminant analysis While finding the hyperplane, many possibilities are consid-
ered and the plane that has maximum margin separating the
Linear discriminant analysis is a well-known machine learn- two classes is selected. The separation plane classifies the
ing algorithm for classification and prediction task. The future test point with utmost confidence [39].
method is simple, and therefore, prediction process is triv-
ial compared to some of the other classification algorithms. Random forest
LDA is a dimensionality reduction technique with focus on
projecting higher dimension space to lower dimension. The Random forest is a supervised machine learning technique
following steps are involved in LDA: initially, the separabil- where multiple decision trees are built and combined together
ity between the different classes is calculated by finding a to present a stable prediction. It can be used for both clas-
distance between mean and the elements of each class which sification and regression challenges. Generally, the more the
is referred as the intra class variance [36]. Finally, a lower number of decision trees, better the model’s accuracy [37].
dimension space is created using the Fisher’s criterion by
reducing the intraclass variance and increasing the distance Gradient boosting machine
between the interclasses.
It is one of the powerful models designed for predictions.
The technique involves three parts. (a) Differentiable loss
Naïve Bayes function; (b) a decision tree to boost the weak learners; (c) a
additive model along with the decision trees for selection of
It is a classification method which is based on Bayes theorem. the best decision tree model.
This classifier works on the assumption that the presence of The nodes in each decision tree take a different subset of
a particular feature of a class is not related to the presence features for selecting the best split. In this technique, all the
of other features and, therefore, represents a probabilistic trees are unique and they are able to capture different signal
machine learning model. It is easier to build and very useful from the data points. Also each new tree is based on the errors
for very large dataset. Using Bayes theorem, probability of an of the previous tree and all these operations are executed in
happening event ‘A’ can be found out, given another event ‘B’ a sequential order [40].
that has already occurred. Since the presence of one particular
feature does not effect others, it is referred as Naïve [14] .
Results
K-nearest neighbor
From the preliminary analysis discussed in the methodol-
K-NN is a supervised machine learning algorithm which is ogy section, it is understood that detection of emotions from
widely used for classification as well as prediction problems. speech signals requires to use the complementary informa-
It is considered as a lazy learning technique, since there is no tion provided by all the entropy measures. Hence, the entropy
specialized training data. Generally, the entire data are used features are combined to form a composite feature vector.
for the training purpose. It is also a non-parametric method, Each emotion class consists of 200 speech signals that are
because there are no assumptions involved where the simi- used for forming the feature space of dimension 1400 × 40.
larity between features is used for prediction of a new data The feature matrix is used to train a collection of state-of-the-
point. The value of ‘K’ is the number of neighbors selected art machine learning classifiers to detect the seven emotions
initially, which can be any integer, based on number of classes from the speech signal. A tenfold cross-validation technique
in the dataset [37] is used to obtain the performance measures of the classi-
The distance between the training and test data is cal- fiers in emotion classification. To evaluate the classifiers,
culated using the distance measures such as Euclidean, the performance metrics such as balanced accuracy, F1 score,
Hamming, etc. The computed distance is sorted in the ascend- recall, AUC , specificity, and precision are used [41]. Figure 7
ing order and the queried data are presented with the class illustrates the box-plot of performance measures for different
label having the least distance. classifiers used in the work. Table 2 provides the classifica-

123
Complex & Intelligent Systems

Fig. 7 Performance metrics of the state-of-the-art classifiers for speech emotion classification based on entropy measure: a balanced accuracy, b
F1 score, c AUC, d sensitivity, e specificity, and f precision

tion performance metrics of the SoA models with respect to lower score of 0.6 and LDA gives the best score of 0.84.
balanced accuracy. Table 3 compares the area under the curve When considering AUC, which provides an overall measure
value of the Receiver-Operating Characteristic (ROC) curve, of classifiers performance across all possible classification
and finally, Table 4 tabulates the F1 score for all the classi- thresholds, Naïve Bayes classifier records the lowest value
fiers considered for emotion classification. Table 5 shows the of 0.89, while LDA scores the highest value of 0.98. Consid-
specificity of the proposed system. ering recall metric which is a measure of sensitivity, SVM
Here, the Mean Balanced Accuracy (MBA) metric is used provides the lowest performance score of 0.58, while LDA
to analyze the goodness of the binary classifier and the delivers the best score of 0.85. The model ability to predict
MBA for SVM classifier is 0.74, while LDA gives the best the true negatives is recorded by the specificity metric and
accuracy of 0.899. In regards to F1 score, K-NN gives the LDA gives best score of 0.97. For precision, a measure of rel-

123
Complex & Intelligent Systems

Table 2 Balanced accuracy of some of the state-of-the-art (SoA) clas- Discussion

sifiers trained on entropy features for emotion recognition
Model Min Mean Max The performance of an SER system highly depends on the
type of emotion signal databases. There are three types of
LDA 0.8500000 0.8991203 0.9330049
speech signal databases on which classification algorithms
K-NN 0.7204258 0.7766033 0.8333333
are experimented. Simulated databases or acted databases
SVM 0.7000000 0.7566092 0.8166667 consist of speech signals generated by professional actors
RF 0.7666667 0.8619875 0.9333333 for various emotions. Induced databases, otherwise known
NB 0.7013957 0.7662931 0.8333333 as elicited databases, are collected from fresh induced emo-
GB 0.7956897 0.8586084 0.9000000 tions by artificial situations [42]. In this case, emotion signals
are recorded without informing the speakers, and the emo-
tions are induced by playing audios, videos, digital games,
Table 3 AUC of ROC plot of some of the SoA classifiers trained on and images. The third type of database is known as natu-
entropy features for emotion recognition ral database in which the emotions are recorded naturally
Model Min Mean Max like talk shows, natural conversations, and so on. Among the
databases, the first one is prominent for emotion analysis,
LDA 0.9533333 0.9760250 0.9950739 whereas natural databases are often influenced by surround-
K-NN 0.8290476 0.8908201 0.9366667 ing noises and artifacts [43].
SVM 0.8890394 0.9099204 0.9571429 In this section, the related works in speech emotion recog-
RF 0.9185714 0.9590877 0.9819048 nition are compared based on TESS and other publicly
NB 0.8219048 0.8912003 0.9371429 available speech emotion datasets for English and other spo-
GB 0.9302135 0.9604885 0.9753695 ken languages, as shown in Table 6. Verma et al. studied the
impact of age in recognition of emotions and trained a set
of support vector machine classifier based on Mel-frequency
Table 4 F1 score of some of the SoA classifiers trained on entropy coefficients [44]. They showed an overall accuracy of 96%.
features for emotion recognition However, their method considered only five emotions and
Model Min Mean Max required to categorize the dataset based on age before
applying the classifier. Sundarprasad used Mel-Frequency
LDA 0.7662338 0.8327253 0.8797314 Cepstrum Coefficients and PCA for dimensionality reduc-
K-NN 0.5025308 0.6105503 0.7332112 tion followed by SVM classifier to classify the emotions.
SVM 0.5700371 0.6324655 0.6673469 They have used the TESS dataset and reported an accuracy
RF 0.5838384 0.7542363 0.8838384 of 90% [45]. Gao et al. used a combination of features based
NB 0.5506081 0.6236622 0.6925170 on Mel frequency and time domain features, and reported
GB 0.6582973 0.7561204 0.8249433 a highest overall classification accuracy of 81%. The lower
classification accuracy of neutral emotions reduced the over-
all accuracy of their method [46]. Xie et al. showed that
Table 5 Specificity of some of the SoA classifiers trained on entropy frame-level features based on attention model of long-short
features for emotion recognition time memory (LSTM) could be more discriminant in emotion
Model Min Mean Max
detection [25]. They reported an unweighted average recall
(UAR) measure of 89.6% based on LSTM for a English emo-
LDA 0.9571429 0.9710978 0.9802956 tion database (Enterface). Demircan et al. proposed speech
K-NN 0.9194229 0.9360638 0.9523810 recognition system based on Mel-frequency cepstral coef-
SVM 0.9142857 0.9303612 0.9476190 ficients (MFCC) and linear prediction coefficients (LPC)
RF 0.9333333 0.9604035 0.9809524 and used fuzzy c-means for dimensionality reduction. They
NB 0.9170772 0.9333005 0.9523810 reported a peak classification accuracy of 92.86% for seven
GB 0.9413793 0.9593596 0.9714286 emotions for German language dataset, EmoDB using SVM
classifier [26].
Venkataramanan et al. used convolution neural network
based on log-Mel spectrogram and reported an accuracy of
evancy, LDA scores the best with a high value of 0.89. Based 70% based on TESS dataset [47]. Praseetha et al. proposed
on the performance scores of all the classifiers, it could be a model based on MFCC features, Deep Neural Network
stated that the proposed model delivers the best prediction (DNN), and Gated Recurrent Unit (GRU) [48]. The two mod-
accuracy using LDA classifier. els were tested on TESS dataset for five emotions: anger,

123
Complex & Intelligent Systems

89.96% (DNN) 95.82% (GRU)

happiness, fear, sadness, and neutral. The DNN model with
MFCC delivered a classification accuracy of 89.96%; on the

Classification accuracy
other hand, GRU model gave an accuracy of 95.82%. Though
they have reported with high accuracy for the TESS dataset,
they have considered only five emotions. Moreover, DNN
and GRU are computationally intensive methods that could
not be easily implemented for real-time emotion recognition.

89.6%

93.3%
Kerkeni et al. proposed an automated SER system based

90%

70%
96%

81%
on combination of features obtained in EMD domain. They
have used modulation spectral features and modulation fre-
quency features based on the IMF signal and combined them

7 (angry, sad, happy, surprise, disgust, neutral, fear)

with cepstral features [49]. Their methodology was initially
evaluated on the Spanish emotional database using RNN clas-

6 (angry, sad, happy, surprise, disgust, fear)

sifier and reported an accuracy of 91.16%. For the Berlin
database, they have used SVM classifier and reported an
accuracy of 86.22%. However, their method is not tested for

5 (angry, sad, happy, neutral, fear)

5 (anger, happy, fear, sad, neutral)

the English language.

4 (angry, sad, happy, neutral)

In comparison, the proposed approach used non-linear
features such as the entropy measure and selection of princi-

# of Emotions (types)
pal IMF modes for feature extraction. The proposed system
obtained a highest balanced accuracy of 93.3% with a mean
of 89.9%, F1 score of maximum 87.9%, and a mean of 83.3%
using LDA classifier. The method obtained a peak AUC
value of 0.995 and a mean value of 0.976 for LDA classi-
fier in recognizing the seven speech emotions. Other related
works which reported high accuracy involved only five emo-
tions.Other methods which considered all seven emotions
Table 6 Comparison of some of the notable works in the speech emotion recognition (English)

from TESS dataset reported a lower accuracy since neutral

eNTERFACE
emotions were considered. The proposed method based on
Dataset

entropy features from principal modes performs well in rec-

TESS

TESS
TESS

ognizing human emotions from speech signals. In this work,

the impact of age and gender in recognition of emotions from
speech is not considered, though it is known to influence
the emotions in the speech signal. More investigations are
Mel-frequency cepstral coefficients

Convolution neural network based

Mel-frequency cepstrum features,

MFCC and time domain features

required in advancing the proposed method to account for

Frame level features and LSTM

IMF modes and SoA machine

Entropy features from principal
(MFCC) and SVM classifier

MFCC features, Deep Neural

Network (DNN) and Gated

the age and gender bias in the speech emotion recognition.

on log-Mel spectrogram

Moreover, it is also planned to use deep learning methods

Recurrent Unit (GRU)

for time-series and sequence analysis such as recurrent neu-

learning classifier
(attention-based)

ral networks for speech emotion recognition as an extension

PCA and SVM

of this work.
Method

Conclusion

In this paper, the application of non-linear features such as

the entropy measure in recognition of human emotions from
Venkataramanan et al. [47]

speech signals is demonstrated. This work investigated the

entropy feature extraction using EMD for recognizing seven
Praseetha et al. [48]
Sundarprasad [45]

Proposed method
Verma et al. [44]

emotions of native English speakers. As the positive and

Gao et al. [46]
Xie et al. [25]

negative emotions are captured in different frequency scales

Author(s)

of the speech signal, the IMFs are categorized into various

frequency groups and selected principal modes for entropy
feature extraction. The reported metrics are a peak balanced

123
Complex & Intelligent Systems

accuracy of 93.3%, a peak F1 score of 87.9%, and a peak AUC and signal processing and 4th Pacific-Rim conference on multime-
value of 0.995 using LDA classifier. This proves that the pro- dia, institute of electrical and electronics engineers Inc., vol 3, pp
1619–1623. https://doi.org/10.1109/ICICS.2003.1292741
posed method of dividing the frequency components in the 8. Teager HM, Teager SM (1990) Evidence for nonlinear sound pro-
speech signal into three frequency groups such as the high- duction mechanisms in the vocal tract. In: Speech production and
frequency, mid-frequency, and low-frequency modes could speech modelling. Springer Netherlands, pp 241–261. https://doi.
recognize different emotions existing in different frequency org/10.1007/978-94-009-2037-8_10
9. Daneshfar F, Kabudian SJ, Neekabadi A (2020) Speech emotion
scales of a speech signal. recognition using hybrid spectral-prosodic features of speech sig-
nal/glottal waveform, metaheuristic-based dimensionality reduc-
Acknowledgements This research was financially supported by the tion, and Gaussian elliptical basis function network classi-
Scientific Research Grant of Shantou University, China, Grant no: fier. Appl Acoust 166:107360. https://doi.org/10.1016/j.apacoust.
NTF17016. 2020.107360
10. Palo HK, Behera D, Rout BC (2020) Comparison of classifiers for
speech emotion recognition (SER) with discriminative spectral fea-
Compliance with ethical standards tures, pp 78–85. https://doi.org/10.1007/978-981-15-2774-6_10
11. Nazid Mohd H, Muthusamy H, Vijean V, Yaacob S (2018)
Conflict of interest The authors declare that they have no conflict of Improved speaker-independent emotion recognition from speech
interest. using two-stage feature reduction—UUM Repository. J Inf Com-
mun Technol 14:57–76. http://repo.uum.edu.my/24081/
Open Access This article is licensed under a Creative Commons 12. Jing S, Mao X, Chen L (2018) Prominence features: effective emo-
Attribution 4.0 International License, which permits use, sharing, adap- tional features for speech emotion recognition. Digit Signal Proc
tation, distribution and reproduction in any medium or format, as 72:216–231. https://doi.org/10.1016/j.dsp.2017.10.016
long as you give appropriate credit to the original author(s) and the 13. Roy T, Marwala T, Chakraverty S (2020) Speech emotion recogni-
source, provide a link to the Creative Commons licence, and indi- tion using neural network and wavelet features, pp 427–438. https://
cate if changes were made. The images or other third party material doi.org/10.1007/978-981-15-0287-3_30
in this article are included in the article’s Creative Commons licence, 14. Khan A, Roy UK (2018) Emotion recognition using prosodie and
unless indicated otherwise in a credit line to the material. If material spectral features of speech and Naïve Bayes Classifier. Institute
is not included in the article’s Creative Commons licence and your of Electrical and Electronics Engineers (IEEE), pp 1017–1021.
intended use is not permitted by statutory regulation or exceeds the https://doi.org/10.1109/wispnet.2017.8299916
permitted use, you will need to obtain permission directly from the copy- 15. Song P, Jin Y, Zhao L, Xin M (2014) Speech emotion recognition
right holder. To view a copy of this licence, visit http://creativecomm using transfer learning. IEICE Trans Inf Syst E97D(9):2530–2532.
ons.org/licenses/by/4.0/. https://doi.org/10.1587/transinf.2014EDL8038
16. Partila P, Tovarek J, Voznak M (2016) Self-organizing map classi-
fier for stressed speech recognition, p 98500A. https://doi.org/10.
1117/12.2224253
References 17. Lanjewar RB, Mathurkar S, Patel N (2015) Implementation and
comparison of speech emotion recognition system using gaussian
1. Huang W, Wu Q, Dey N, Ashour A, Fong SJ, González-Crespo R mixture model (GMM) and K-nearest neighbor (K-NN) tech-
(2020) Adjectives grouping in a dimensionality affective clustering niques. Procedia Comput Sci 49:50–57. https://doi.org/10.1016/
model for fuzzy perceptual evaluation. Int J Interact Multimedia j.procs.2015.04.226
Artif Intell 6(2):10. https://doi.org/10.9781/ijimai.2020.05.002 18. Patel P, Chaudhari AA, Pund MA, Deshmukh DH (2017) Speech
2. Anttonen J, Surakka V (2005) Emotions and heart emotion recognition system using gaussian mixture model and
rate while sitting on a chair. In: Proceedings of the improvement proposed via boosted gmm. IRA Int J Technol Eng
SIGCHI conference on Human factors in computing (ISSN 2455-4480) 7(2 (S)):56–64
systems—CHI ’05, ACM Press, New York, New York, 19. Yang N, Yuan J, Zhou Y, Demirkol I, Duan Z, Heinzelman W,
USA, p 491. https://doi.org/10.1145/1054972.1055040, Sturge-Apple M (2017) Enhanced multiclass SVM with threshold-
http://portal.acm.org/citation.cfm?doid=1054972.1055040 ing fusion for speech-based emotion classification. Int J Speech
3. Akçay MB, Oğuz K (2020) Speech emotion recognition: emotional Technol 20(1):27–41. https://doi.org/10.1007/s10772-016-9364-
models, databases, features, preprocessing methods, supporting 2
modalities, and classifiers. Speech Commun 116:56–76. https:// 20. Sinith MS, Aswathi E, Deepa TM, Shameema CP, Rajan S (2016)
doi.org/10.1016/j.specom.2019.12.001 Emotion recognition from audio signals using Support Vector
4. Sailunaz K, Dhaliwal M, Rokne J, Alhajj R (2018) Emotion detec- Machine. In: 2015 IEEE recent advances in intelligent computa-
tion from text and speech: a survey. Soc Netw Anal Min 8(1):28. tional systems, RAICS 2015, Institute of Electrical and Electronics
https://doi.org/10.1007/s13278-018-0505-2 Engineers Inc., pp 139–144. https://doi.org/10.1109/RAICS.2015.
5. Koolagudi SG, Rao KS (2012) Emotion recognition from speech: 7488403
a review. Int J Speech Technol 15(2):99–117. https://doi.org/10. 21. Wen G, Li H, Huang J, Li D, Xun E (2017) Random deep belief
1007/s10772-011-9125-1 networks for recognizing emotions from speech signals. Comput
6. Yang N, Dey N, Sherratt RS, Shi F (2020) Recognize basic Intell Neurosci 2017:1–9. https://doi.org/10.1155/2017/1945630
emotional statesin speech by machine learning techniques using 22. Tsang-Long Pao YC, Jun-Heng Yeh PL (2006) Mandarin emo-
mel-frequency cepstral coefficient features. J Intell Fuzzy Syst. tional speech recognition based on SVM and NN. In: 18th
https://doi.org/10.3233/jifs-179963 International conference on pattern recognition (ICPR’06), IEEE,
7. Nwe TL, Foo SW, De Silva LC (2003) Detection of stress and emo- pp 1096–1100. https://doi.org/10.1109/ICPR.2006.780
tion in speech using traditional and FFT based log energy features. 23. Xiao Z, Dellandrea E, Dou W, Chen L (2010) Multi-stage classi-
In: ICICS-PCM 2003—Proceedings of the 2003 joint conference of fication of emotional speech motivated by a dimensional emotion
the 4th international conference on information, communications

123
Complex & Intelligent Systems

model. Multimedia Tools Appl 46(1):119–145. https://doi.org/10. 38. Bellamkonda S, Np G (2020) An enhanced facial expression recog-
1007/s11042-009-0319-3 nition model using local feature fusion of gabor wavelets and local
24. Lin YL, Wei G (2005) Speech emotion recognition based on HMM directionality patterns. Int J Ambient Comput Intell 11(1):48–70.
and SVM. In: 2005 International conference on machine learning https://doi.org/10.4018/ijaci.2020010103
and cybernetics, IEEE, vol 8, pp 4898–4901. https://doi.org/10. 39. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for
1109/ICMLC.2005.1527805 cancer classification using support vector machines. Mach Learn
25. Xie Y, Liang R, Liang Z, Huang C, Zou C, Schuller B (2019) Speech 46(1–3):389–422. https://doi.org/10.1023/A:1012487302797
emotion classification using attention-based lstm. IEEE/ACM 40. Friedman JH (2001) Greedy function approximation: a gra-
Trans Audio Speech Lang Proc 27(11):1675–1685. https://doi.org/ dient boosting machine. Ann Stat. https://doi.org/10.1214/aos/
10.1109/TASLP.2019.2925934 1013203451
26. Demircan S, Kahramanli H (2018) Application of fuzzy c-means 41. Angadi S, Nandyal S (2020) Human identification system based
clustering algorithm to spectral features for emotion classification on spatial and temporal features in the video surveillance system.
from speech. Neural Comput Appl 29(8):59–66. https://doi.org/10. Int J Ambient Comput Intell 11(3):1–21. https://doi.org/10.4018/
1007/s00521-016-2712-y ijaci.2020070101
27. Huang NE, Shen Z, Long SR, Wu MC, Shih HH, Zheng Q, Yen 42. Sapinski, Tomasz; Kaminska D, Pelikant A, Ozcinar C, Avots E,
NC, Tung CC, Liu HH (1998) The empirical mode decomposi- Anbarjafari G (2018) Multimodal database of emotional speech,
tion and the Hilbert spectrum for nonlinear and non-stationary video and gestures
time series analysis. Proc R Soc Lond Ser A Math Phys Eng Sci 43. Saratxaga I, Navas E, Hernáez I, Aholab I (2006) Designing and
454(1971):903–995. https://doi.org/10.1098/rspa.1998.0193 recording an emotional speech database for corpus based synthesis
28. Dupuis K, Kathleen Pichora-Fuller M (2010) in Basque. In: Proceedings of the Fifth International Conference on
Toronto emotional speech set (TESS) | TSpace Language Resources and Evaluation (LREC’06), European Lan-
Repository. https://doi.org/10.5683/SP2/E8H2MF, guage Resources Association (ELRA), Genoa, Italy, http://www.
https://tspace.library.utoronto.ca/handle/1807/24487 lrec-conf.org/proceedings/lrec2006/pdf/19_pdf.pdf
29. Hassouneh A, Mutawa AM, Murugappan M (2020) Development 44. Verma D, Mukhopadhyay D (2017) Age driven automatic speech
of a real-time emotion recognition system using facial expressions emotion recognition system. In: Proceeding—IEEE international
and EEG based on machine learning and deep neural network conference on computing, communication and automation, ICCCA
methods. Inform Med Unlock 20:100372. https://doi.org/10.1016/ 2016, Institute of Electrical and Electronics Engineers Inc., pp
j.imu.2020.100372 1005–1010. https://doi.org/10.1109/CCAA.2016.7813862
30. Pincus SM (1991) Approximate entropy as a measure of system 45. Sundarprasad N (2018) Speech emotion detection using machine
complexity. Proc Nat Acad Sci 88(6):2297–2301. https://doi.org/ learning techniques. Master’s thesis, San Jose State University, San
10.1073/pnas.88.6.2297 Jose, CA, USA. https://doi.org/10.31979/etd.a5c2-v7e2, https://
31. Delgado-Bonal A, Marshak A (2019) Approximate entropy and scholarworks.sjsu.edu/etd_projects/628
sample entropy: a comprehensive tutorial. Entropy 21(6):541. 46. Gao Y (2019) Speech-Based Emotion Recognition. Master’s thesis,
https://doi.org/10.3390/e21060541 https://libraetd.lib.virginia.edu/downloads/2f75r8498?filename=
32. Richman JS, Lake DE, Moorman J (2004) Sample entropy. In: 1_Gao_Ye_2019_MS.pdf
Methods in enzymology, pp 172–184. https://doi.org/10.1016/ 47. Venkataramanan K, Rajamohan HR (2019) Emotion recognition
S0076-6879(04)84011-4 from speech. arXiv: 1912.10458
33. Gu R, Shao Y (2016) How long the singular value decomposed 48. Praseetha V, Vadivel S (2018) Deep learning models for speech
entropy predicts the stock market—evidence from the dow jones emotion recognition. J Comput Sci 14(11):1577–1587. https://doi.
industrial average index. Phys A 453:150–161 org/10.3844/jcssp.2018.1577.1587
34. Tian Y, Zhang H, Xu W, Zhang H, Yang L, Zheng S, Shi Y 49. Kerkeni L, Serrestou Y, Raoof K, Mbarki M, Mahjoub MA, Cleder
(2017) Spectral entropy can predict changes of working memory C (2019) Automatic speech emotion recognition using an optimal
performance reduced by short-time training in the delayed-match- combination of features based on EMD-TKEO. Speech Commun
to-sample task. Front Hum Neurosci 11:437. https://doi.org/10. 114:22–35. https://doi.org/10.1016/j.specom.2019.09.002
3389/fnhum.2017.00437
35. Yang Y, Zhou M, Niu Y, Li C, Cao R, Wang B, Yan P, Ma Y,
Xiang J (2018) Epileptic seizure prediction based on permutation
Publisher’s Note Springer Nature remains neutral with regard to juris-
entropy. Front Comput Neurosci. https://doi.org/10.3389/fncom.
dictional claims in published maps and institutional affiliations.
2018.00055
36. Izenman AJ (2013) Linear discriminant analysis. Springer, New
York, pp 237–280. https://doi.org/10.1007/978-0-387-78189-1_8
37. Pohjalainen J, Räsänen O, Kadioglu S (2015) Feature selection
methods and their combinations in high-dimensional classification
of speaker likability, intelligibility and personality traits. Comput
Speech Lang 29(1):145–171. https://doi.org/10.1016/j.csl.2013.
11.004

123

Speech Emotion Recognition
No ratings yet
Speech Emotion Recognition
55 pages
Project-PPT-Speech Emotion Recognition
85% (13)
Project-PPT-Speech Emotion Recognition
10 pages
Speech Databases Speech Features and Classifiers in Speech Emotion Recognition A Review
No ratings yet
Speech Databases Speech Features and Classifiers in Speech Emotion Recognition A Review
31 pages
Emotion Recognition Based On Speech Signals by Combining Empirical Mode Decomposition and Deep Neural Network
No ratings yet
Emotion Recognition Based On Speech Signals by Combining Empirical Mode Decomposition and Deep Neural Network
10 pages
Speech Emotion Recognition Using Machine Learning
No ratings yet
Speech Emotion Recognition Using Machine Learning
14 pages
Pre Processing
No ratings yet
Pre Processing
54 pages
EpochSER MTA
No ratings yet
EpochSER MTA
35 pages
MS Thesis Final
No ratings yet
MS Thesis Final
47 pages
Deep Learning Approaches For Speech Emotion Recognition: State of The Art and Research Challenges
No ratings yet
Deep Learning Approaches For Speech Emotion Recognition: State of The Art and Research Challenges
68 pages
Comparison Between SVM Other Classifiers For Ser IJERTV2IS1457
No ratings yet
Comparison Between SVM Other Classifiers For Ser IJERTV2IS1457
6 pages
Real-Time Speech Emotion Recognition Using Deep Le
No ratings yet
Real-Time Speech Emotion Recognition Using Deep Le
40 pages
Information Sciences: Zhen-Tao Liu, Abdul Rehman, Min Wu, Wei-Hua Cao, Man Hao
No ratings yet
Information Sciences: Zhen-Tao Liu, Abdul Rehman, Min Wu, Wei-Hua Cao, Man Hao
17 pages
1 PB
No ratings yet
1 PB
12 pages
Speech Emotion Recognition Using Machine Learning Techniques
No ratings yet
Speech Emotion Recognition Using Machine Learning Techniques
8 pages
Modeling and Simulation of Bacterial Foraging Variants - Acoustic Feature Selection and Classification
No ratings yet
Modeling and Simulation of Bacterial Foraging Variants - Acoustic Feature Selection and Classification
7 pages
Yan 2020
No ratings yet
Yan 2020
5 pages
Emotion Recognition From Speech Via The Use of Dif
No ratings yet
Emotion Recognition From Speech Via The Use of Dif
11 pages
XEmoAccent Embracing Diversity in Cross-Accent Emo
No ratings yet
XEmoAccent Embracing Diversity in Cross-Accent Emo
19 pages
DL Emotion MFCC
No ratings yet
DL Emotion MFCC
6 pages
SER Final
No ratings yet
SER Final
10 pages
Emotional Expression Detection in Spoken Language Employing Machine Learning Algorithms
No ratings yet
Emotional Expression Detection in Spoken Language Employing Machine Learning Algorithms
15 pages
Speech Emotion Recognition Using Deep Learning Techniques: A Review
No ratings yet
Speech Emotion Recognition Using Deep Learning Techniques: A Review
19 pages
Enhanced Speech Emotion Detection Using Deep Neural Networks
No ratings yet
Enhanced Speech Emotion Detection Using Deep Neural Networks
14 pages
Project Report
No ratings yet
Project Report
106 pages
Electronics 12 00839 v2
No ratings yet
Electronics 12 00839 v2
17 pages
Speech Emotion Recognition: Ashish B. Ingale, D. S. Chaudhari
No ratings yet
Speech Emotion Recognition: Ashish B. Ingale, D. S. Chaudhari
4 pages
Emotion Recognition From Speech Signals by Mena M.E.
No ratings yet
Emotion Recognition From Speech Signals by Mena M.E.
116 pages
Multimodal Speech Emotion Recognition and Ambiguity Resolution
No ratings yet
Multimodal Speech Emotion Recognition and Ambiguity Resolution
9 pages
Comparative Analysis of Neural Networks For Speech Emotion Recognition
No ratings yet
Comparative Analysis of Neural Networks For Speech Emotion Recognition
5 pages
Recognition of Emotions in Speech Using Deep CNN A
No ratings yet
Recognition of Emotions in Speech Using Deep CNN A
18 pages
4.sandeep - A Critical Analysis of Emotion Detection From Speech Signals - 19-25
No ratings yet
4.sandeep - A Critical Analysis of Emotion Detection From Speech Signals - 19-25
7 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
6 pages
2019-11-14 992718 - RFP 251 19 157 Social Emotional Learning Screener Submitted by EDUMETRISIS LLC
100% (1)
2019-11-14 992718 - RFP 251 19 157 Social Emotional Learning Screener Submitted by EDUMETRISIS LLC
120 pages
Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis
No ratings yet
Machine Learning and Deep Learning Techniques For Emotion Recognition From Human Speech Using Acoustic Analysis
10 pages
Speech Emotion Recognition Based On SVM Using MATLAB: March 2016
No ratings yet
Speech Emotion Recognition Based On SVM Using MATLAB: March 2016
7 pages
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
No ratings yet
Speaker Emotion Recognition: Leveraging Self-Supervised Models For Feature Extraction Using Wav2Vec2 and Hubert
9 pages
Sharma 2020
No ratings yet
Sharma 2020
5 pages
1 s2.0 S0003682X23002906 Main
No ratings yet
1 s2.0 S0003682X23002906 Main
11 pages
Speech Emotion Recognition Using Machine Learning
No ratings yet
Speech Emotion Recognition Using Machine Learning
8 pages
Emotion Detection Final Paper
No ratings yet
Emotion Detection Final Paper
15 pages
Speech Emotion Recognition Using Deep Learning Hybrid Models
No ratings yet
Speech Emotion Recognition Using Deep Learning Hybrid Models
5 pages
Speech Emotion Recognition: Ashish B. Ingale, D. S. Chaudhari
No ratings yet
Speech Emotion Recognition: Ashish B. Ingale, D. S. Chaudhari
4 pages
Speech Emotion Recognization
No ratings yet
Speech Emotion Recognization
65 pages
Set Conference Draft Paper - 223585
No ratings yet
Set Conference Draft Paper - 223585
6 pages
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
No ratings yet
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
20 pages
Economic and Cultural Growth
No ratings yet
Economic and Cultural Growth
3 pages
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
No ratings yet
Sat - 82.Pdf - Election Prediction With Automated Speech Emotion Recognition
11 pages
Applsci 12 09188 v2
No ratings yet
Applsci 12 09188 v2
17 pages
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
No ratings yet
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
6 pages
GROUP7 Researchpaper
No ratings yet
GROUP7 Researchpaper
9 pages
Emotion Recognition in Speech Signal Using Emotion-Extracting Binary Decision Trees
No ratings yet
Emotion Recognition in Speech Signal Using Emotion-Extracting Binary Decision Trees
8 pages
Emotion Recognition From Speech: Abstract. Emotions Play An Extremely Vital Role in Human Lives and Human
No ratings yet
Emotion Recognition From Speech: Abstract. Emotions Play An Extremely Vital Role in Human Lives and Human
13 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
Speech Emotion Recognition Based On SVM Using Matlab PDF
No ratings yet
Speech Emotion Recognition Based On SVM Using Matlab PDF
6 pages
Speech Emotion Recognition For Enhanced User Experience: A Comparative Analysis of Classification Methods
No ratings yet
Speech Emotion Recognition For Enhanced User Experience: A Comparative Analysis of Classification Methods
12 pages
Physical Features Based Speech Emotion Recognition Using Predictive Classification
No ratings yet
Physical Features Based Speech Emotion Recognition Using Predictive Classification
12 pages
Domain Adaptation and Compensation For Emotion Detection
No ratings yet
Domain Adaptation and Compensation For Emotion Detection
4 pages
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
No ratings yet
Speech-Emotion-Recognition Using SVM, Decision Tree and LDA Report
7 pages
Report File (VJ)
No ratings yet
Report File (VJ)
56 pages
Data Science Interview Questions and Answers For 2020 PDF
No ratings yet
Data Science Interview Questions and Answers For 2020 PDF
20 pages
Minimental Quiroga
No ratings yet
Minimental Quiroga
12 pages
Mod 7 Smote ML
No ratings yet
Mod 7 Smote ML
40 pages
Fake Image Detection Research Paper
No ratings yet
Fake Image Detection Research Paper
10 pages
A Review On Speech Emotion Classification Using Linear Predictive Coding and Neural Networks
No ratings yet
A Review On Speech Emotion Classification Using Linear Predictive Coding and Neural Networks
5 pages
Artificial Intelligence Within Medical Diagnostics
No ratings yet
Artificial Intelligence Within Medical Diagnostics
19 pages
The Prediction of Default With Outliers - Robust Logistic Regression
No ratings yet
The Prediction of Default With Outliers - Robust Logistic Regression
21 pages
Credit Card Fraud Detection Using A Deep Learning Multistage Model
No ratings yet
Credit Card Fraud Detection Using A Deep Learning Multistage Model
26 pages
Blockchain Introduction Presentation
No ratings yet
Blockchain Introduction Presentation
15 pages
Package Dismo': R Topics Documented
No ratings yet
Package Dismo': R Topics Documented
68 pages
BreastCancer CNNs
No ratings yet
BreastCancer CNNs
17 pages
02 Image Characteristics and Quality
No ratings yet
02 Image Characteristics and Quality
8 pages
Enhancing IoT Security With CNN and LSTM-Based Int
No ratings yet
Enhancing IoT Security With CNN and LSTM-Based Int
7 pages
Strocke
No ratings yet
Strocke
17 pages
Polverino Et Al 2023 Machine Learning For Prognostics and Health Management of Industrial Mechanical Systems and
No ratings yet
Polverino Et Al 2023 Machine Learning For Prognostics and Health Management of Industrial Mechanical Systems and
20 pages
Penman Et Al-2010-Diversity and Distributions
No ratings yet
Penman Et Al-2010-Diversity and Distributions
10 pages
Applied Geography and Geoinformatics For Sustainable Development Proceedings of Icggs 2022 Wuttichai Boonpook Instant Download
No ratings yet
Applied Geography and Geoinformatics For Sustainable Development Proceedings of Icggs 2022 Wuttichai Boonpook Instant Download
66 pages
Dermoscopy As A Tool in Differentiating Cutaneous Squamous Cell Carcinoma From Its Variants
No ratings yet
Dermoscopy As A Tool in Differentiating Cutaneous Squamous Cell Carcinoma From Its Variants
16 pages
Jaburek 2021
No ratings yet
Jaburek 2021
20 pages
Previewpdf
No ratings yet
Previewpdf
45 pages
Classifier 3
No ratings yet
Classifier 3
19 pages
Interview Questions
No ratings yet
Interview Questions
26 pages
Normas TADL-Q (Muñoz-Neira, 2012)
No ratings yet
Normas TADL-Q (Muñoz-Neira, 2012)
11 pages
Untitled
No ratings yet
Untitled
5 pages
Abscess Volume in Orbital Cellulitis
No ratings yet
Abscess Volume in Orbital Cellulitis
7 pages
Fda A3 13642032 PDF
No ratings yet
Fda A3 13642032 PDF
19 pages
Application of Semi-Supervised Learning in Image Classification Research On Fusion of Labeled and Unlabeled Data
No ratings yet
Application of Semi-Supervised Learning in Image Classification Research On Fusion of Labeled and Unlabeled Data
13 pages
A Multimodal Transfer Learning Approach Using PubMedCLIP For Medical Image Classification
No ratings yet
A Multimodal Transfer Learning Approach Using PubMedCLIP For Medical Image Classification
12 pages
Quinn Et Al. - 2018 - Discriminative Ability and Clinical Utility of The
No ratings yet
Quinn Et Al. - 2018 - Discriminative Ability and Clinical Utility of The
10 pages
Final 13 March Review Paper
No ratings yet
Final 13 March Review Paper
8 pages
Alpha Forna Et Al 2020
No ratings yet
Alpha Forna Et Al 2020
8 pages
Machine Learning Methods For Surgery Cancellation
No ratings yet
Machine Learning Methods For Surgery Cancellation
4 pages
What Will Be The Output of The Following C Code
No ratings yet
What Will Be The Output of The Following C Code
3 pages
Dsa II Midsem 2 13.6.2022
No ratings yet
Dsa II Midsem 2 13.6.2022
2 pages
Kismet: Fundamentals and Applications
From Everand
Kismet: Fundamentals and Applications
Fouad Sabry
No ratings yet
Affective Computing: Fundamentals and Applications
From Everand
Affective Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet

Emotion Classification From Speech Signal Based On

Uploaded by

Emotion Classification From Speech Signal Based On

Uploaded by

Complex & Intelligent Systems

Emotion classification from speech signal based on empirical mode

Speech emotion recognition

Palani Thanaraj Krishnan1 · Joseph Raj Alex Noel2 · Vijayarajan Rajangam3

Received: 11 October 2020 / Accepted: 2 February 2021

Speech signals have a huge impact on the current modes

Fig. 1 Proposed method for

an averaging scheme is proposed for mid-frequency, low- L fd = ci (t) + r (t). (4)

Emotion Type Emotion Type

Emotion Type Emotion Type

0.75 Entropy Type

fear Permutation Ent

Feature classification Support vector machine

Table 2 Balanced accuracy of some of the state-of-the-art (SoA) clas- Discussion

89.96% (DNN) 95.82% (GRU)

7 (angry, sad, happy, surprise, disgust, neutral, fear)

7 (angry, sad, happy, surprise, disgust, neutral, fear)

7 (angry, sad, happy, surprise, disgust, neutral, fear)

6 (angry, sad, happy, surprise, disgust, fear)

5 (angry, sad, happy, neutral, fear)

5 (anger, happy, fear, sad, neutral)

4 (angry, sad, happy, neutral)

from TESS dataset reported a lower accuracy since neutral

entropy features from principal modes performs well in rec-

ognizing human emotions from speech signals. In this work,

Convolution neural network based

MFCC and time domain features

required in advancing the proposed method to account for

IMF modes and SoA machine

MFCC features, Deep Neural

the age and gender bias in the speech emotion recognition.

Moreover, it is also planned to use deep learning methods

for time-series and sequence analysis such as recurrent neu-

ral networks for speech emotion recognition as an extension

In this paper, the application of non-linear features such as

speech signals is demonstrated. This work investigated the

emotions of native English speakers. As the positive and

negative emotions are captured in different frequency scales

of the speech signal, the IMFs are categorized into various

You might also like