Emotion Classification From Speech Signal Based On
Emotion Classification From Speech Signal Based On
https://doi.org/10.1007/s40747-021-00295-z
ORIGINAL ARTICLE
Abstract
Emotion recognition system from speech signal is a widely researched topic in the design of the Human–Computer Interface
(HCI) models, since it provides insights into the mental states of human beings. Often, it is required to identify the emotional
condition of the humans as cognitive feedback in the HCI. In this paper, an attempt to recognize seven emotional states from
speech signals, known as sad, angry, disgust, happy, surprise, pleasant, and neutral sentiment, is investigated. The proposed
method employs a non-linear signal quantifying method based on randomness measure, known as the entropy feature, for
the detection of emotions. Initially, the speech signals are decomposed into Intrinsic Mode Function (IMF), where the IMF
signals are divided into dominant frequency bands such as the high frequency, mid-frequency , and base frequency. The
entropy measures are computed directly from the high-frequency band in the IMF domain. However, for the mid- and base-
band frequencies, the IMFs are averaged and their entropy measures are computed. A feature vector is formed from the
computed entropy measures incorporating the randomness feature for all the emotional signals. Then, the feature vector is
used to train a few state-of-the-art classifiers, such as Linear Discriminant Analysis (LDA), Naïve Bayes, K-Nearest Neighbor,
Support Vector Machine, Random Forest, and Gradient Boosting Machine. A tenfold cross-validation, performed on a publicly
available Toronto Emotional Speech dataset, illustrates that the LDA classifier presents a peak balanced accuracy of 93.3%,
F1 score of 87.9%, and an area under the curve value of 0.995 in the recognition of emotions from speech signals of native
English speakers.
Keywords Speech signal · Emotion perception · Entropy measures · Linear discriminant analysis · Empirical mode
decomposition
Introduction
123
Complex & Intelligent Systems
rally embedded in various emotional situations. An in-depth frequency cepstral coefficient, are extracted from each frame
analysis of speech signals in different domains is helpful of the speech signal [9]. A high-dimensional feature vector
in recognizing the emotions from the auditory signals of is structured from the first- and second-order derivatives of
the people who are unable to communicate through proper the above-said feature vector. The dimension reduction of
speech signals. Furthermore, the speech signal analysis is the feature vector is carried out by quantum behaved particle
also used to study the heart rate of the speaker [2]. The swarm optimization. The reduced feature vector is classified
broader research perspective of Speech Emotion Classifi- by a Gaussian elliptical basis function neural network clas-
cation (SEC) finds its applications in crime investigation, sifier. Palo et al. proposed an SER system in wavelet domain
psychiatric diagnosis, human–computer interaction, fatigue based on Mel-frequency coefficients [10]. Both static and
detection, auxiliary disease diagnosis, bio metrics, and many dynamic elements of the coefficients are combined for an
more. SER system. The above-said feature coefficients are reduced
The basic emotions are categorized into sadness, fear, hap- in dimension using Principal Component Analysis (PCA)
piness, disgust, surprise, and anger [3]. The combination of and linear discriminant analysis [11]. Jing et al. suggested an
basic emotions leads to other emotions such as love, affec- SER system using prominence features and traditional acous-
tion, amusement, contempt, excitement, embarrassment, and tic features [12]. The combined feature vector is reduced in
so on. Over the decades, various studies have been conducted dimension using PCA and non-parametric discriminant anal-
in the field of SEC where the general pipeline includes feature ysis. The features are classified using four types of supervised
extraction, dimensionality reduction, and emotion classifica- learning classifiers. Wavelet-based features, extracted from
tion. The broad literature for emotion analysis suggests two the speech signals, are used for SEC in [13]. In [14], spectral
preferable features, known as statistical and temporal fea- features with Naïve Bayes(NB) classifier is employed.
tures. [4,5]. A set of methods on speech emotion classification is based
Speech Emotion Recognition (SER) system can be struc- on hidden Markov model [15], Gaussian Mixture Model
tured by analyzing well-crafted features that effectively (GMM) [16], Self-Organizing Map (SOM) [17], and neural
expose each emotion in the speech signals [6]. The vary- network [18]. Singular Value Decomposition (SVD) clas-
ing length and continuous nature of speech signals require sifier is used in [19], whereas, in [20], ensemble software
local and global features for emotion recognition. The regression model is proposed for emotion classification. A
local features represent temporal dynamics, Whereas the deep belief network based on high- and low-level features is
global features expose the statistical aspects like standard also proposed for SEC [21]. Pao et al. proposed a method
deviation, mean, and minimum and maximum values. The based on Support Vector Machine (SVM) and neural net-
features of SER system are categorized into prosodic fea- works to classify five emotions such as anger, surprise,
tures, spectral features, voice quality features, and Teager neutral, happiness, and sadness [22]. Xiao et al. suggested
energy operator-based features. Prosodic features, such as a classifier that uses several sub classifiers for the classifica-
rhythm and intonation, are the features based on human’s tion of seven types of emotions [23]. Lin and Wei presented
perception. These features are based on energy, duration, a method that was experimented on gender-dependent and
and fundamental frequency. Spectral features are extracted gender-independent experiments [24]. More recently, Xie
in frequency domain using transforms and have received et al. developed a frame-level emotion recognition system
wide attention due to their ability of representing vocal card based on attention model in recurrent neural networks. They
characteristics [5]. Short-term power spectrum is presented validated their system for English and non-English speech
by Mel-frequency cepstral coefficients, whereas vocal tract signals [25]. Demircan and Kahramanli proposed spectral
characteristics are presented by linear prediction coefficients. features based on Mel Cepstral coefficients and linear pre-
Logarithmic filtering of auditory system is characterized by diction coefficients for speech emotion detection. Later, they
log-frequency power coefficients using Fourier transform [7]. used Fuzzy c-means for feature dimension reduction which
Voice quality measurements, such as jitter, harmonics-to- was further given as input to machine learning classifiers.
noise ratio, and shimmer, exploit the relation between vocal They used German speech emotion dataset for their work
tract characteristics and emotion content. Teager features [26].
detect stresses happening to the vocal tract muscles in the Our contributions are motivated by (a) the non-stationary
form of energy operator [8]. A few spectral and temporal nature of speech signals and classical signal processing meth-
feature-based SER systems are discussed below. ods such as Fourier and wavelet analysis use predefined basis
Fatemeh Daneshfar et al. proposed a hybrid SER system functions failing to extract relevant information regarding
comprising of feature extraction, dimensionality reduction, emotions and (b) the above transformation techniques are
and classification stages. In the feature extraction stage, three block-based methods, wherein a group of samples surround-
features, such as perceptual minimum variance distortion less ing the centre element are projected on to the respective
response, perceptual linear prediction coefficient, and Mel- basis function. Selection of an optimum window size is a
123
Complex & Intelligent Systems
additional requirement for improving the detection accuracy Toronto, known as, Toronto Emotional Speech Set (TESS)
and elimination of artifacts for slow time-varying emotion [28]. The dataset consists of speech signals recorded from
like sadness. Therefore, there is a need to investigate the two native English participants of age 26 and 64 respectively
classification accuracy of human emotions through data- speaking about 200 target words which completes the phrase
driven signal processing methods such as Empirical Mode ”Say the word—-”. These phrases are captured with seven
Decomposition (EMD) and non-linear features. This paper different emotions of the speakers, namely anger, disgust,
investigates an SEC approach where emotions are recog- fear, happiness, pleasant surprise, sadness, and neutral. The
nized from speech signals by decomposing them into intrinsic duration of the records vary between 2 and 3 s and is sampled
mode functions. Later, five unique randomness measures at 22 KHz. Figure 2 illustrates the speech signals for different
are computed through entropy measures and state-of-the-art emotions from the TESS dataset. For analysis, 200 record-
machine learning classifiers are trained on the entropy fea- ings of each emotion class were taken for the development
tures. Finally, the performance of the model is validated using of the speech recognition system. It should be noted that the
standard quantitative metrics on a publicly available emotion original recordings are of high quality (recorded in a noise
classification dataset. less environment) and, therefore, do not require additional
The rest of the paper is organized as follows. “Materi- pre-processing steps.
als and methods” elaborates on the proposed methodology,
the extraction of IMFs through EMD, the computation of Empirical mode decomposition
the randomness through entropy features, and the detailed
analysis of the need for different entropy measures. Results In this section, the EMD of a signal is analyzed. Suppose
and discussion are presented in “Results” and “Discussion” x(t) be a time-series speech signal that delivers the IMF sig-
followed by the conclusion in “Conclusion”. nals c(t) and the residue function r (t) when decomposed by
the EMD method. Equation (1) illustrates the decomposition
process: [27]:
Materials and methods
d
Speech signal is a time-varying signal and requires proper x(t) = c(t) + r (t), (1)
selection of a signal processing method to extract the rel- i=1
evant features for emotion recognition. In this paper, the
speech signals are analyzed in IMF domain using EMD. where ‘d’ is the number of IMFs generated for the input sig-
Unlike conventional signal processing methods using prede- nal x(t).
fined basis function such as Fourier transform and Wavelet
transform, EMD relies on the extraction of inherent patterns For the experiments, d is preset to a value of 10 com-
in the data for decomposing a signal into intrinsic signals ponents. Preliminary analysis illustrates that setting a lower
[27]. Figure 1 shows the block diagram of the proposed value to ‘d’ leads to less number of decomposed IMFs result-
speech recognition system. The speech signals of duration ∼ ing in the loss of information. On contrary, a large value for
2 s are initially decomposed into dominant, mid-, and base- ‘d’ leads to higher levels of decomposition but at a consider-
band IMF frequencies. Here, windowing techniques are not able computational cost. Hence, an optimal value of 10 was
involved, and hence, inherent features corresponding to the chosen based on the ad hoc analysis at different levels of
emotions are extracted with a higher confidence. Non-linear decomposition. Figure 3 shows the decomposed speech sig-
features based on entropy are extracted from the decomposed nal using EMD. The decomposed signal captures different
IMF signals. A feature vector is constructed from the entropy oscillatory features of the speech signal in both temporal and
features and used to train a set of classifiers such as LDA, frequency domain.
Naïve Bayes (NB), K-Nearest Neighbor (K-NN), Support
Vector Machine (SVM), Random Forest (RF), and Gradi- Principal frequency modes
ent Boosting machine (GB). Finally the performance of the
classifiers is evaluated through balanced accuracy, F1 score, EMD decomposes a time-series signal into IMFs that are
recall, area under the curve, specificity, and precision. localized in time and frequency domains. Since different
emotions are captured in distinct frequency components of
Emotion dataset the IMF signal, the information content in each IMF sig-
nal is not uniform and varies depending on the input speech
To present a realistic comparison, the proposed emotion signal. Speech signals pertaining to happy and pleasant sur-
recognition system from speech signals was trained and prise are positive emotions. Meanwhile, negative emotions
tested on publicly available dataset provided by University of such as angry, fear, disgust, and sad are captured in different
123
Complex & Intelligent Systems
frequency scales [29]. Hence, predefined selection of any dle the loss of information. The categories are represented
IMF component or frequency scale will lead to a loss of as follows: (a) the lower order IMFs starting from IMF-1 to
information. The IMF signals are decomposed into three IMF-6 represent the high-frequency modes Hfd1-6 , (b) IMF-7
frequency groups, namely, the High-Frequency (HF), the and IMF-8 correspond to mid-frequency modes Mfd7-8 , and
Mid-Frequency (MF), and the Low-Frequency (LF) modes (c) the higher order IMFs, namely, IMF-9 and IMF-10, corre-
based on the frequency content, as shown in Fig. 3, to han- spond to low-frequency modes L fd9-10 . The last component rt
123
Complex & Intelligent Systems
Fig. 2 Speech Signals for different emotions: a angry, b disgust, c fear, d happy, e neutral, f pleasant surprise, and g sad
1
8
is the residue mode which corresponds to the baseline activ-
ity of the signal. To enhance the discrimination ability of the Mfd = ci (t) (3)
N
i=7
speech signals, especially for negative emotions such as sad,
fear, and disgust, with reduced number of trainable features, 1 10
123
Complex & Intelligent Systems
Fig. 3 Decomposed IMFs of speech signals: a angry, b happy, and c sad. Here, x axis represents the sample number and the y axis denotes the IMF
amplitude. We could observe that different emotions show distinct IMF signals
while, mid-frequency, low-frequency modes, and the residue lems. They have been reported with good classification
functions show similar PSD patterns. Therefore, these modes performance for many biomedical applications that involves
are averaged as explained in Eqs. (3) and (4), respectively. ECG and EEG signals. Though the speech signals are inher-
Hence, eight unique IMFs from the input speech signals for ently different from biological signals, the oscillations within
feature extraction process are selected based on the PSD dis- the signals define each emotions. Here, it is attempted to
tribution. quantify the randomness measure by computing entropy
functions. Thus, five entropy measures such as Approximate
entropy (ApEnt), Sample entropy (SamEnt), singular value
Feature extraction
decomposition entropy (SVDEnt), Permutation entropy (Per-
mEnt), and Spectral entropy (SpecEnt) are used for extracting
Non-linear features based on randomness measure and chaos
randomness features from the speech signals. For each IMF,
theory have been widely used in signal classification prob-
123
Complex & Intelligent Systems
Fig. 4 Power spectral density of some of the speech signals with emotions: a–c angry, d–f, and happy g–i sad for the three frequency groups,
namely, HF (IMF-1–IMF-6), MF (IMF-7–IMF-8), LF (IMF-9, IMF-10), and residue modes
a total of 5 different entropy measures are computed resulting similarity value, respectively. ApEnt be computed through
to 40 different trainable features: Eq. (6) [30]:
Hfd1-6 ∈ R1×30 , Mfd ∈ R1×5 , L fd ∈ R1×5 . (5) ApEntx(t) = φrm − φrm+1 . (6)
The proceeding sections present (a) the different entropy Here, x(t) is the speech signal and φr is the correlation
measures, (b) a detailed investigation on the need for different integral function for the phase space vectors (embedded sig-
entropy measures, and (c) a brief analysis of the different nal). For experiments, the chosen values are m = 3 and
types of classifiers used in this study. r = 0.2std(x(t)) based on the work of [31]. Here, ‘std’ refers
to the standard deviation of the input signal.
Approximate entropy
Sample entropy
Approximate entropy is a complexity measure widely used
in the regularity analysis of time-series signal. It quantifies Sample entropy is a modified version of approximate entropy
the amount of randomness based on signal fluctuations. A where the limitations such as self-similar pattern bias are
lower value of ApEnt suggests that the time-series signal overcome [32]. Here, the similarity measure is computed
is regular and a higher value demonstrates the randomness. based on various embedded time-series samples and avoids
The parameters ‘m’ and ‘r’ define the delay parameter and computing the self-similarity measure between the samples.
123
Complex & Intelligent Systems
It reduces the bias which is inherent in approximate entropy. Table 1 Entropy measure and its attributes in speech analysis
The representation of sample entropy is provided in Eq. 7 Randomness measure Implication
[32]:
Approximate entropy Computes irregularity in the
z(m + 1, r ) speech signal, however considers
SamEntx(t) = − log . (7) self-similar patterns in the input
z(m, r ) signal
Sample entropy Modification of Approximate
Here, ‘z’ is the measure of similarity of a embedded time- entropy with no bias of
series for ‘m’ and ‘m + 1’ and r is the tolerance level. The self-similar patterns
values of ‘m’ and ‘r’ are identical to the values used for SVD entropy Computes the randomness measure
approximate entropy. based on decomposition of
high-dimensional data using
singular values
SVD entropy
Spectral entropy computes the randomness in the
power spectral density function
Singular value decomposition represents the dimensionality of the speech signal
of the data. It decomposes the high-dimensional data into Permutation entropy Uses ordinal patterns in the speech
orthogonal matrices based on the singular values ‘σ ’. The signal to detect the emotions
time-series signal is converted to a embedded matrix based on
different time delayed template vector taken from the speech
signal. The embedded matrix is decomposed into various denotes the size of the embedded matrix. Usually, τ and D are
orthogonal matrices. However, the SVD entropy measure is set to 1 and 3, respectively. The different ordinal patterns are
computed only on the diagonal matrix which contain the sin- tabulated and verified with the column vector of the embed-
gular values. Equation (8) represents the SVD computation ded matrix. The number of occurrence of the ordinal patterns
[33]: in the matrix is counted and the probability of occurrence,
‘ψi ’, of each pattern is tabulated. The permutation entropy
L
is computed as given below [35] :
SVDEntx(t) = − σi log2 (σi ). (8)
i=1
PermEntx(t) = − p(ψi ) log2 p(ψi ). (10)
Here, ‘L’ represents the number of singular values of the i
embedded matrix and ‘σi ’ denote the singular values.
The following section briefly explains the contribution of
Spectral entropy each entropy toward the calculation of randomness attribute
which plays a significant role in the analysis of the speech
Spectral entropy measures the randomness by employing emotions. Table 1 provides a summary of each entropy mea-
Fourier transform to the time-series signal. For the compu- sure and its implications in speech analysis.
tation of this entropy measure, the power spectral density,
S( f ), of the speech signal is obtained. The spectral entropy Entropy and their relation with IMFs
is calculated using the formulation of Shannon entropy mea-
sure as given below [34]: As explained in the previous section, the entropy features
are extracted from the principal mode IMFs through decom-
fn position of the original speech signal. Figure 5 illustrates
SpecEntx(t) = − S( f ) log2 [S( f )]. (9) the box-plot of different emotions captured by non-linear
f =0 entropy features. From the figure, It is observed that the
median values of all the entropy features differ for differ-
Here, ‘ f n ’ is the sampling frequency of the signal. ent class of emotions, and therefore, the computed entropy
features could be readily used as discriminators for classifi-
Permutation entropy cation of emotions. We analyzed the emotion classification
accuracy of different entropy features by varying the num-
Permutation entropy computes the randomness of the time- ber of extracted IMFs. Figure 6 illustrates the variation of
series signal based on ordinal patterns of the signal. It is a classification accuracy of each of the entropy measure for
non-parametric approach and provides a robust estimation of different IMF lengths. It could be observed that no single
irregularity information of the signal. The approach involves entropy measure provides good classification accuracy for
creating a embedded time delayed matrix based on τ and D all the speech signal emotions and the classification accuracy
123
Complex & Intelligent Systems
Fig. 5 Distribution of
randomness value based on
entropy of different speech
emotion signals: a approximate
entropy, b sample entropy, c
SVD entropy, d permutation
entropy, and e spectral entropy
depends on the choice of number IMFs. For example, permu- entropy provides good discrimination of sad, angry, neutral,
tation entropy provides good accuracy for emotions such as and pleasant surprise and less accuracy for other emotions.
pleasant surprise, angry, and sad speech signals decomposed Likewise, entropies such as approximate, SVD, and spectral
upto IMF-3, however, for the same decomposition, its accu- provide higher discrimination abilities from IMF-3 to IMF-
racy is less for fear, disgust, neural, and happy. Similarly, 6, respectively. Therefore, the experimental analysis suggests
when the speech signal is decomposed upto IMF-4, sample that entropy feature extracted from the EMD of speech sig-
123
Complex & Intelligent Systems
1.00 1.00
ACC_Permutation_Entropy
0.75 0.75
angry angry
disgust disgust
fear fear
0.50 0.50
happy happy
neutral neutral
pleasant pleasant
0.25 0.25
sad sad
0.00 0.00
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of IMF Modes Number of IMF Modes
(a) (b)
1.00 1.00
ACC_Sample_Entropy
ACC_SVD_Entropy
angry angry
disgust disgust
fear fear
0.50 0.50
happy happy
neutral neutral
pleasant pleasant
0.25 0.25
sad sad
0.00 0.00
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of IMF Modes Number of IMF Modes
(c) (d)
0.8
1.00
Emotion Type
ACC_Spectral_Entropy
0.00
1 2 3 4 5 6 7 8
Number of IMF Modes 1 2 3 4 5 6 7 8
Number of IMF Modes
(e) (f)
Fig. 6 Performance of different entropy measures of speech emotion signals for different IMF modes: a approximate entropy, b permutation
entropy, c SVD entropy, d sample entropy, e spectral entropy, and f average accuracy of each entropy measure for all emotions
nals present complimentary information at different IMFs. employing them individually for emotion classification
Thus, it is prudent to include all the IMFs to improve the presents a peak accuracy of only ∼ 79% for the 7 different
classification accuracy. emotions considered in this study. Hence, the entropy fea-
Subsequently, Fig.6f illustrates the average emotion clas- tures computed from different IMFs are combined as a feature
sification accuracy for different entropies considered for vector (considering 40 features) and presented to the classi-
IMFs 1–8 (only 8 features). the entropies present com- fier for improved signal classification which is explained in
plementary information at different decomposition levels, the proceeding section.
123
Complex & Intelligent Systems
State-of-the-Art (SoA) classifiers such as LDA, NB, K-NN, Support vector machine is a machine learning model used for
SVM, RF, and GB are used for classification of emotions classification and regression challenges [38]. Each data point
from speech signals. The following section briefly describes is plotted as a point in a n-dimensional space with feature
the classifiers. value expressed as the coordinate value. Hyperplane- based
decision boundaries are found separating the two classes.
Linear discriminant analysis While finding the hyperplane, many possibilities are consid-
ered and the plane that has maximum margin separating the
Linear discriminant analysis is a well-known machine learn- two classes is selected. The separation plane classifies the
ing algorithm for classification and prediction task. The future test point with utmost confidence [39].
method is simple, and therefore, prediction process is triv-
ial compared to some of the other classification algorithms. Random forest
LDA is a dimensionality reduction technique with focus on
projecting higher dimension space to lower dimension. The Random forest is a supervised machine learning technique
following steps are involved in LDA: initially, the separabil- where multiple decision trees are built and combined together
ity between the different classes is calculated by finding a to present a stable prediction. It can be used for both clas-
distance between mean and the elements of each class which sification and regression challenges. Generally, the more the
is referred as the intra class variance [36]. Finally, a lower number of decision trees, better the model’s accuracy [37].
dimension space is created using the Fisher’s criterion by
reducing the intraclass variance and increasing the distance Gradient boosting machine
between the interclasses.
It is one of the powerful models designed for predictions.
The technique involves three parts. (a) Differentiable loss
Naïve Bayes function; (b) a decision tree to boost the weak learners; (c) a
additive model along with the decision trees for selection of
It is a classification method which is based on Bayes theorem. the best decision tree model.
This classifier works on the assumption that the presence of The nodes in each decision tree take a different subset of
a particular feature of a class is not related to the presence features for selecting the best split. In this technique, all the
of other features and, therefore, represents a probabilistic trees are unique and they are able to capture different signal
machine learning model. It is easier to build and very useful from the data points. Also each new tree is based on the errors
for very large dataset. Using Bayes theorem, probability of an of the previous tree and all these operations are executed in
happening event ‘A’ can be found out, given another event ‘B’ a sequential order [40].
that has already occurred. Since the presence of one particular
feature does not effect others, it is referred as Naïve [14] .
Results
K-nearest neighbor
From the preliminary analysis discussed in the methodol-
K-NN is a supervised machine learning algorithm which is ogy section, it is understood that detection of emotions from
widely used for classification as well as prediction problems. speech signals requires to use the complementary informa-
It is considered as a lazy learning technique, since there is no tion provided by all the entropy measures. Hence, the entropy
specialized training data. Generally, the entire data are used features are combined to form a composite feature vector.
for the training purpose. It is also a non-parametric method, Each emotion class consists of 200 speech signals that are
because there are no assumptions involved where the simi- used for forming the feature space of dimension 1400 × 40.
larity between features is used for prediction of a new data The feature matrix is used to train a collection of state-of-the-
point. The value of ‘K’ is the number of neighbors selected art machine learning classifiers to detect the seven emotions
initially, which can be any integer, based on number of classes from the speech signal. A tenfold cross-validation technique
in the dataset [37] is used to obtain the performance measures of the classi-
The distance between the training and test data is cal- fiers in emotion classification. To evaluate the classifiers,
culated using the distance measures such as Euclidean, the performance metrics such as balanced accuracy, F1 score,
Hamming, etc. The computed distance is sorted in the ascend- recall, AUC , specificity, and precision are used [41]. Figure 7
ing order and the queried data are presented with the class illustrates the box-plot of performance measures for different
label having the least distance. classifiers used in the work. Table 2 provides the classifica-
123
Complex & Intelligent Systems
Fig. 7 Performance metrics of the state-of-the-art classifiers for speech emotion classification based on entropy measure: a balanced accuracy, b
F1 score, c AUC, d sensitivity, e specificity, and f precision
tion performance metrics of the SoA models with respect to lower score of 0.6 and LDA gives the best score of 0.84.
balanced accuracy. Table 3 compares the area under the curve When considering AUC, which provides an overall measure
value of the Receiver-Operating Characteristic (ROC) curve, of classifiers performance across all possible classification
and finally, Table 4 tabulates the F1 score for all the classi- thresholds, Naïve Bayes classifier records the lowest value
fiers considered for emotion classification. Table 5 shows the of 0.89, while LDA scores the highest value of 0.98. Consid-
specificity of the proposed system. ering recall metric which is a measure of sensitivity, SVM
Here, the Mean Balanced Accuracy (MBA) metric is used provides the lowest performance score of 0.58, while LDA
to analyze the goodness of the binary classifier and the delivers the best score of 0.85. The model ability to predict
MBA for SVM classifier is 0.74, while LDA gives the best the true negatives is recorded by the specificity metric and
accuracy of 0.899. In regards to F1 score, K-NN gives the LDA gives best score of 0.97. For precision, a measure of rel-
123
Complex & Intelligent Systems
123
Complex & Intelligent Systems
Classification accuracy
other hand, GRU model gave an accuracy of 95.82%. Though
they have reported with high accuracy for the TESS dataset,
they have considered only five emotions. Moreover, DNN
and GRU are computationally intensive methods that could
not be easily implemented for real-time emotion recognition.
89.6%
93.3%
Kerkeni et al. proposed an automated SER system based
90%
70%
96%
81%
on combination of features obtained in EMD domain. They
have used modulation spectral features and modulation fre-
quency features based on the IMF signal and combined them
# of Emotions (types)
pal IMF modes for feature extraction. The proposed system
obtained a highest balanced accuracy of 93.3% with a mean
of 89.9%, F1 score of maximum 87.9%, and a mean of 83.3%
using LDA classifier. The method obtained a peak AUC
value of 0.995 and a mean value of 0.976 for LDA classi-
fier in recognizing the seven speech emotions. Other related
works which reported high accuracy involved only five emo-
tions.Other methods which considered all seven emotions
Table 6 Comparison of some of the notable works in the speech emotion recognition (English)
TESS
TESS
TESS
TESS
TESS
of this work.
Method
Conclusion
Proposed method
Verma et al. [44]
123
Complex & Intelligent Systems
accuracy of 93.3%, a peak F1 score of 87.9%, and a peak AUC and signal processing and 4th Pacific-Rim conference on multime-
value of 0.995 using LDA classifier. This proves that the pro- dia, institute of electrical and electronics engineers Inc., vol 3, pp
1619–1623. https://doi.org/10.1109/ICICS.2003.1292741
posed method of dividing the frequency components in the 8. Teager HM, Teager SM (1990) Evidence for nonlinear sound pro-
speech signal into three frequency groups such as the high- duction mechanisms in the vocal tract. In: Speech production and
frequency, mid-frequency, and low-frequency modes could speech modelling. Springer Netherlands, pp 241–261. https://doi.
recognize different emotions existing in different frequency org/10.1007/978-94-009-2037-8_10
9. Daneshfar F, Kabudian SJ, Neekabadi A (2020) Speech emotion
scales of a speech signal. recognition using hybrid spectral-prosodic features of speech sig-
nal/glottal waveform, metaheuristic-based dimensionality reduc-
Acknowledgements This research was financially supported by the tion, and Gaussian elliptical basis function network classi-
Scientific Research Grant of Shantou University, China, Grant no: fier. Appl Acoust 166:107360. https://doi.org/10.1016/j.apacoust.
NTF17016. 2020.107360
10. Palo HK, Behera D, Rout BC (2020) Comparison of classifiers for
speech emotion recognition (SER) with discriminative spectral fea-
Compliance with ethical standards tures, pp 78–85. https://doi.org/10.1007/978-981-15-2774-6_10
11. Nazid Mohd H, Muthusamy H, Vijean V, Yaacob S (2018)
Conflict of interest The authors declare that they have no conflict of Improved speaker-independent emotion recognition from speech
interest. using two-stage feature reduction—UUM Repository. J Inf Com-
mun Technol 14:57–76. http://repo.uum.edu.my/24081/
Open Access This article is licensed under a Creative Commons 12. Jing S, Mao X, Chen L (2018) Prominence features: effective emo-
Attribution 4.0 International License, which permits use, sharing, adap- tional features for speech emotion recognition. Digit Signal Proc
tation, distribution and reproduction in any medium or format, as 72:216–231. https://doi.org/10.1016/j.dsp.2017.10.016
long as you give appropriate credit to the original author(s) and the 13. Roy T, Marwala T, Chakraverty S (2020) Speech emotion recogni-
source, provide a link to the Creative Commons licence, and indi- tion using neural network and wavelet features, pp 427–438. https://
cate if changes were made. The images or other third party material doi.org/10.1007/978-981-15-0287-3_30
in this article are included in the article’s Creative Commons licence, 14. Khan A, Roy UK (2018) Emotion recognition using prosodie and
unless indicated otherwise in a credit line to the material. If material spectral features of speech and Naïve Bayes Classifier. Institute
is not included in the article’s Creative Commons licence and your of Electrical and Electronics Engineers (IEEE), pp 1017–1021.
intended use is not permitted by statutory regulation or exceeds the https://doi.org/10.1109/wispnet.2017.8299916
permitted use, you will need to obtain permission directly from the copy- 15. Song P, Jin Y, Zhao L, Xin M (2014) Speech emotion recognition
right holder. To view a copy of this licence, visit http://creativecomm using transfer learning. IEICE Trans Inf Syst E97D(9):2530–2532.
ons.org/licenses/by/4.0/. https://doi.org/10.1587/transinf.2014EDL8038
16. Partila P, Tovarek J, Voznak M (2016) Self-organizing map classi-
fier for stressed speech recognition, p 98500A. https://doi.org/10.
1117/12.2224253
References 17. Lanjewar RB, Mathurkar S, Patel N (2015) Implementation and
comparison of speech emotion recognition system using gaussian
1. Huang W, Wu Q, Dey N, Ashour A, Fong SJ, González-Crespo R mixture model (GMM) and K-nearest neighbor (K-NN) tech-
(2020) Adjectives grouping in a dimensionality affective clustering niques. Procedia Comput Sci 49:50–57. https://doi.org/10.1016/
model for fuzzy perceptual evaluation. Int J Interact Multimedia j.procs.2015.04.226
Artif Intell 6(2):10. https://doi.org/10.9781/ijimai.2020.05.002 18. Patel P, Chaudhari AA, Pund MA, Deshmukh DH (2017) Speech
2. Anttonen J, Surakka V (2005) Emotions and heart emotion recognition system using gaussian mixture model and
rate while sitting on a chair. In: Proceedings of the improvement proposed via boosted gmm. IRA Int J Technol Eng
SIGCHI conference on Human factors in computing (ISSN 2455-4480) 7(2 (S)):56–64
systems—CHI ’05, ACM Press, New York, New York, 19. Yang N, Yuan J, Zhou Y, Demirkol I, Duan Z, Heinzelman W,
USA, p 491. https://doi.org/10.1145/1054972.1055040, Sturge-Apple M (2017) Enhanced multiclass SVM with threshold-
http://portal.acm.org/citation.cfm?doid=1054972.1055040 ing fusion for speech-based emotion classification. Int J Speech
3. Akçay MB, Oğuz K (2020) Speech emotion recognition: emotional Technol 20(1):27–41. https://doi.org/10.1007/s10772-016-9364-
models, databases, features, preprocessing methods, supporting 2
modalities, and classifiers. Speech Commun 116:56–76. https:// 20. Sinith MS, Aswathi E, Deepa TM, Shameema CP, Rajan S (2016)
doi.org/10.1016/j.specom.2019.12.001 Emotion recognition from audio signals using Support Vector
4. Sailunaz K, Dhaliwal M, Rokne J, Alhajj R (2018) Emotion detec- Machine. In: 2015 IEEE recent advances in intelligent computa-
tion from text and speech: a survey. Soc Netw Anal Min 8(1):28. tional systems, RAICS 2015, Institute of Electrical and Electronics
https://doi.org/10.1007/s13278-018-0505-2 Engineers Inc., pp 139–144. https://doi.org/10.1109/RAICS.2015.
5. Koolagudi SG, Rao KS (2012) Emotion recognition from speech: 7488403
a review. Int J Speech Technol 15(2):99–117. https://doi.org/10. 21. Wen G, Li H, Huang J, Li D, Xun E (2017) Random deep belief
1007/s10772-011-9125-1 networks for recognizing emotions from speech signals. Comput
6. Yang N, Dey N, Sherratt RS, Shi F (2020) Recognize basic Intell Neurosci 2017:1–9. https://doi.org/10.1155/2017/1945630
emotional statesin speech by machine learning techniques using 22. Tsang-Long Pao YC, Jun-Heng Yeh PL (2006) Mandarin emo-
mel-frequency cepstral coefficient features. J Intell Fuzzy Syst. tional speech recognition based on SVM and NN. In: 18th
https://doi.org/10.3233/jifs-179963 International conference on pattern recognition (ICPR’06), IEEE,
7. Nwe TL, Foo SW, De Silva LC (2003) Detection of stress and emo- pp 1096–1100. https://doi.org/10.1109/ICPR.2006.780
tion in speech using traditional and FFT based log energy features. 23. Xiao Z, Dellandrea E, Dou W, Chen L (2010) Multi-stage classi-
In: ICICS-PCM 2003—Proceedings of the 2003 joint conference of fication of emotional speech motivated by a dimensional emotion
the 4th international conference on information, communications
123
Complex & Intelligent Systems
model. Multimedia Tools Appl 46(1):119–145. https://doi.org/10. 38. Bellamkonda S, Np G (2020) An enhanced facial expression recog-
1007/s11042-009-0319-3 nition model using local feature fusion of gabor wavelets and local
24. Lin YL, Wei G (2005) Speech emotion recognition based on HMM directionality patterns. Int J Ambient Comput Intell 11(1):48–70.
and SVM. In: 2005 International conference on machine learning https://doi.org/10.4018/ijaci.2020010103
and cybernetics, IEEE, vol 8, pp 4898–4901. https://doi.org/10. 39. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for
1109/ICMLC.2005.1527805 cancer classification using support vector machines. Mach Learn
25. Xie Y, Liang R, Liang Z, Huang C, Zou C, Schuller B (2019) Speech 46(1–3):389–422. https://doi.org/10.1023/A:1012487302797
emotion classification using attention-based lstm. IEEE/ACM 40. Friedman JH (2001) Greedy function approximation: a gra-
Trans Audio Speech Lang Proc 27(11):1675–1685. https://doi.org/ dient boosting machine. Ann Stat. https://doi.org/10.1214/aos/
10.1109/TASLP.2019.2925934 1013203451
26. Demircan S, Kahramanli H (2018) Application of fuzzy c-means 41. Angadi S, Nandyal S (2020) Human identification system based
clustering algorithm to spectral features for emotion classification on spatial and temporal features in the video surveillance system.
from speech. Neural Comput Appl 29(8):59–66. https://doi.org/10. Int J Ambient Comput Intell 11(3):1–21. https://doi.org/10.4018/
1007/s00521-016-2712-y ijaci.2020070101
27. Huang NE, Shen Z, Long SR, Wu MC, Shih HH, Zheng Q, Yen 42. Sapinski, Tomasz; Kaminska D, Pelikant A, Ozcinar C, Avots E,
NC, Tung CC, Liu HH (1998) The empirical mode decomposi- Anbarjafari G (2018) Multimodal database of emotional speech,
tion and the Hilbert spectrum for nonlinear and non-stationary video and gestures
time series analysis. Proc R Soc Lond Ser A Math Phys Eng Sci 43. Saratxaga I, Navas E, Hernáez I, Aholab I (2006) Designing and
454(1971):903–995. https://doi.org/10.1098/rspa.1998.0193 recording an emotional speech database for corpus based synthesis
28. Dupuis K, Kathleen Pichora-Fuller M (2010) in Basque. In: Proceedings of the Fifth International Conference on
Toronto emotional speech set (TESS) | TSpace Language Resources and Evaluation (LREC’06), European Lan-
Repository. https://doi.org/10.5683/SP2/E8H2MF, guage Resources Association (ELRA), Genoa, Italy, http://www.
https://tspace.library.utoronto.ca/handle/1807/24487 lrec-conf.org/proceedings/lrec2006/pdf/19_pdf.pdf
29. Hassouneh A, Mutawa AM, Murugappan M (2020) Development 44. Verma D, Mukhopadhyay D (2017) Age driven automatic speech
of a real-time emotion recognition system using facial expressions emotion recognition system. In: Proceeding—IEEE international
and EEG based on machine learning and deep neural network conference on computing, communication and automation, ICCCA
methods. Inform Med Unlock 20:100372. https://doi.org/10.1016/ 2016, Institute of Electrical and Electronics Engineers Inc., pp
j.imu.2020.100372 1005–1010. https://doi.org/10.1109/CCAA.2016.7813862
30. Pincus SM (1991) Approximate entropy as a measure of system 45. Sundarprasad N (2018) Speech emotion detection using machine
complexity. Proc Nat Acad Sci 88(6):2297–2301. https://doi.org/ learning techniques. Master’s thesis, San Jose State University, San
10.1073/pnas.88.6.2297 Jose, CA, USA. https://doi.org/10.31979/etd.a5c2-v7e2, https://
31. Delgado-Bonal A, Marshak A (2019) Approximate entropy and scholarworks.sjsu.edu/etd_projects/628
sample entropy: a comprehensive tutorial. Entropy 21(6):541. 46. Gao Y (2019) Speech-Based Emotion Recognition. Master’s thesis,
https://doi.org/10.3390/e21060541 https://libraetd.lib.virginia.edu/downloads/2f75r8498?filename=
32. Richman JS, Lake DE, Moorman J (2004) Sample entropy. In: 1_Gao_Ye_2019_MS.pdf
Methods in enzymology, pp 172–184. https://doi.org/10.1016/ 47. Venkataramanan K, Rajamohan HR (2019) Emotion recognition
S0076-6879(04)84011-4 from speech. arXiv: 1912.10458
33. Gu R, Shao Y (2016) How long the singular value decomposed 48. Praseetha V, Vadivel S (2018) Deep learning models for speech
entropy predicts the stock market—evidence from the dow jones emotion recognition. J Comput Sci 14(11):1577–1587. https://doi.
industrial average index. Phys A 453:150–161 org/10.3844/jcssp.2018.1577.1587
34. Tian Y, Zhang H, Xu W, Zhang H, Yang L, Zheng S, Shi Y 49. Kerkeni L, Serrestou Y, Raoof K, Mbarki M, Mahjoub MA, Cleder
(2017) Spectral entropy can predict changes of working memory C (2019) Automatic speech emotion recognition using an optimal
performance reduced by short-time training in the delayed-match- combination of features based on EMD-TKEO. Speech Commun
to-sample task. Front Hum Neurosci 11:437. https://doi.org/10. 114:22–35. https://doi.org/10.1016/j.specom.2019.09.002
3389/fnhum.2017.00437
35. Yang Y, Zhou M, Niu Y, Li C, Cao R, Wang B, Yan P, Ma Y,
Xiang J (2018) Epileptic seizure prediction based on permutation
Publisher’s Note Springer Nature remains neutral with regard to juris-
entropy. Front Comput Neurosci. https://doi.org/10.3389/fncom.
dictional claims in published maps and institutional affiliations.
2018.00055
36. Izenman AJ (2013) Linear discriminant analysis. Springer, New
York, pp 237–280. https://doi.org/10.1007/978-0-387-78189-1_8
37. Pohjalainen J, Räsänen O, Kadioglu S (2015) Feature selection
methods and their combinations in high-dimensional classification
of speaker likability, intelligibility and personality traits. Comput
Speech Lang 29(1):145–171. https://doi.org/10.1016/j.csl.2013.
11.004
123