0% found this document useful (0 votes)
85 views5 pages

Arabic English Speech Emotion Recognition System

Uploaded by

Sahar Fawzi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views5 pages

Arabic English Speech Emotion Recognition System

Uploaded by

Sahar Fawzi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Arabic English Speech Emotion Recognition System

Mai El Seknedy Sahar Fawzi .


Biomedical Systems Group Biomedical Systems Group
Center of Informatics Science Center of Informatics Science
Nile University Nile University
Giza, Egypt Giza, Egypt
[email protected] [email protected]

Abstract— The Speech Emotion Recognition (SER) system is an The proposed feature-set performance was compared with the
approach to identify individuals' emotions. This is important for benchmarked IS09 feature set (based on the INTERSPEECH
human-machine interface applications and for the emerging 2009 Emotion Challenge) performance [11]. Multi-Layer
Metaverse. This paper presents a bilingual Arabic-English Perceptron (MLP), Support Vector Machine (SVM), Simple
speech emotion recognition system using EYASE and
Logistic Regression (SLR), and Random Forest (RF) machine
RAVDESS datasets. A novel feature set was composed by using
spectral and prosodic parameters to obtain high performance at learning classification models were used [12]. The
a low computational cost. Different machine learning classifiers performance of SER was analyzed using 10 folds to ensure
were applied, including Multi-Layer Perceptron, Support model generalization and stability. Three evaluation metrics;
Vector Machine, Random Forest, Logistic Regression, and accuracy, recall rate, and precision were used.
Ensemble learning. The execution time of the proposed feature
set was compared to the benchmarked feature set of This paper is organized as follows: a literature review of the
"Interspeech 2009”. Promising results were obtained using the evolution of SER is presented in Section 2, the methodology
proposed feature sets. SVM resulted in the best emotion applied with the proposed features set is presented in Section
recognition rate and execution performance. The best accuracies
3, experiments and results are presented in Section 4 and
achieved were 85% on RADVESS, and 64% on EYASE.
Ensemble learning detected the valence emotion with 90% on finally, Section 5 introduced the conclusion and the proposed
RADVESS, and 87.6% on EYASE. future work.

Keywords: Bilingual Speech emotion recognition, Cross corpus, II. LITERATURE REVIEW
Mel frequency cepstral coefficients, prosodic features A significant interest in SER research has evolved over the
past two decades. Several acted, elicited, and non-acted
I. INTRODUCTION datasets are now available for use in SER systems using

S peech Emotion Recognition (SER) has a wide range of


applications in human interacting systems to enhance the
interactive experience.
different languages [7]. Features representing different
domains have been used in SER systems. Prosodic features
which describe the speech intonation, rhythm, and pitch
Useful applications include human-computer interaction [1], trajectories were the main components of the proposed feature
the emerging technology of Metaverse [2], call centers [3], sets [13]. Spectral features such as contrast, bandwidth,
medical applications [4], autonomous vehicles [5], e-learning centroid, signal energy features (RMS), and Mel spectrogram
engagement evaluation [6], and commercial applications [7]. features were also used extensively. The most widely used
Most of these applications allow the user to choose the feature in the SER domain is Mel-frequency cepstral
language to use. In Egypt and most Arab countries, coefficients (MFCC) as it represents the natural speech
applications provide Arabic and English language choices. perception of humans [14-15].
An Arabic-English SER can be integrated with online
MFCC Features and spectrogram images of the audio
customer support services to predict clients’ satisfaction [8].
signals were used to train Convolution Neural Networks
This will improve the quality of services by analyzing the
client’s psychological attitude and taking needed actions on (CNN) and Long Short-Term Memory (LSTM) deep neural
the spot. network systems [16]. Other important features include linear
Furthermore, e-learning may benefit from tracking the prediction coefficients (LPC) and Voice Quality Features
emotional status of students/attendants, which will improve such as Jitter and Shimmer [17,18].
the instructors’ communication skills [9]. Different classification algorithms were applied in the SER
This paper introduces an SER model based on a novel domain as Hidden Markov models (HMM) [7], Gaussian
features’ set to identify the emotional status of the speaker. Mixture Models (GMM), tree-based Models (Random Forest)
The model performance is validated using EYASE dataset for [19], Support Vector Machine (SVM) [20,21], K-Nearest
Arabic [8] and RAVDESS dataset for English [10]. Neighbor (KNN) [22], Logistic Regression [23] and Artificial
Neural Networks (ANN) [24].

979-8-3503-0030-7/23/$31.00 ©2023 IEEE


Artificial neural networks such as CNN, LSTM, Auto-
encoders, RNN, and attention-based models are currently the
dominant stream in SER [25]. Transfer Learning through a
pre-defined CNN model was applied using spectrogram
images of the speech [26]. Multimodal systems were also
implemented. The integration between speech and text for
emotion classifications was introduced in [27]. Speech and
visual images were combined for emotion recognition, as
presented in [28].

III. METHODOLOGY
Two datasets were used to check the efficiency of
the proposed feature set over the benchmarked feature set of
INTERSPEECH 2009.

A. Datasets

RAVDESS (Ryerson Audio-Visual Database of Emotional


Speech and Song), which is a dynamic dataset of lexically Fig. 1. Chi-square scores for the top-ranked proposed features
matched statements in an American accent. Twenty-four
actors (12 male and 12 female) acted on eight emotions Further details about the feature set and the benchmarked
angry, happy, neutral, sad, calm, fearful, surprise, and INTERSPEECH 2009 Paralinguistic Challenge feature set
disgust. Each expression is recorded at two emotional (IS09) [11], obtained using openSMILE [32] are displayed
intensity levels and neutral. It consists of 1440 utterances in in Table I.
.wav formats with a sampling rate of 48 kHz [10].
EYASE (Egyptian Arabic Semi-Natural Emotion) speech TABLE I. FEATURES’ SET DESCRIPTION
dataset. It includes 579 statements representing four basic
emotions; angry, happy, neutral, and sad, and pronounced by Feature set IS09 – Number of Proposed Feature set -
3 male and 3 female professional actors. Statements were features: 384 Number of features: 122
extracted from an Egyptian drama series. The files are in .wav Tool used OpenSMILE Tool Librosa + pYAAPT
formats with a sampling rate of 44.1 kHz [8]. Components RMS, 12 MFCC, RMS,14 MFCC, 8 Mel-
ZCR, Voicing spectrogram, ZCR, 12
B. Features extraction probability, and Chroma, Tonnetz, 8
Fundamental Contrast, Fundamental
frequency F0 frequency (F0), Pitch
As mentioned in the literature review, Prosodic and Contour, and Signal’s low-
Spectral are commonly used features for emotion detection frequency band mean energy
[8,18]. (SLFME)
The proposed feature set consists of prosodic, spectral, and Statistical min, max, mean, min, max, standard
statistical parameters and was developed using Librosa [29], Functions range, standard deviation, mean, range, and
and pYAAPT (pitch tracker tool) for python users [30]. deviation, maxPos, percentile (25, 50, 75,90)
Chi-square test was applied to select the most significant minPos, linregc1,
linregc2, linregerrQ,
features with the best Chi-square scores [31]. skewness, and
Pitch features, Mel spectrogram, and MFCCs showed high kurtosis
Chi-square scores which indicate a high impact on our results.
Fig. 1, shows the top Chi-square scores for our selected
C. Feature Scaling
features.
Different Normalization techniques were used in the
literature such as Standard Scaler and Minimum-Maximum
Scaler (MMS) [23],[26]. Minimum-Maximum scaler
(MMS) method was adopted using Eq. (1).

= ( − min) / (max −min) (1)

979-8-3503-0030-7/23/$31.00 ©2023 IEEE


Where X is the input features, min and max are the Recall:
minimum and maximum values of the features. Shows how many of the actual positive emotional classes
D. Machine Learning Models were correctly predicted.
Four classification techniques were considered. The Support &! '' = (4)
+
vector Machine (SVM) was selected for its high performance
in higher-dimension data such as audio data. A Random Confusion Matrix:
Forest tree-based ensemble classifier of 500 decision trees Is a representation to analyze the model’s performance by
with a maximum depth of trees equal to 20 was implemented. comparing between the actual and predicted labels.
The logistic Regression algorithm was also used to analyze
the linear model's performance. Finally, a 3-layered IV. RESULTS AND DISCUSSION
feedforward neural network algorithm, Multi-Layer
Perceptron (MLP) was applied. This section elaborates on the classification models’ results
For hyperparameters tuning, the GridSearchCV method was for emotion recognition and shows a comparison between this
used to fine-tune the classifier’s parameters. For SVM, the paperwork with previous related research. The recognition
kernel function used is (rbf), the decision function shape is performance was analyzed for the used classifiers MLP,
SVM, Random Forest, and Logistic Regression through 2
set to a one-vs-rest (ovr) decision function of shape
datasets for 2 different languages (English, and Arabic):
(n_samples, n_classes), and the regularization parameter (C)
RADVESS, and EYASE, respectively. The 10 K folds were
is set to 10 (Inverse of regularization strength). In the case of used for evaluation to ensure the model's generalization and
Random Forest trees, (n_estimators) with a maximum depth stability.
of trees equal to 20 and the "entropy" criterion function (the
A. Single Corpus Multi Emotions Classification
function to measure the quality of a split) was used whereas
“lbfgs” solver with l2 norm was used for Logistic Regression. The models are trained and tested with the same language.
Close results were obtained by applying the different
The maximum iterations for a solver to converge were set to
classifiers on the two feature sets, as shown in Table 2.
1000 (max_iter). For MLP, the number of neurons in the
hidden layer was 400 neurons, the solver used was ‘adam’, TABLE II. SINGLE CORPUS MULTI-EMOTION CLASSIFICATION USING
the activation method was set to the default ‘relu’, the size of 10 FOLDS (ANGRY/HAPPY/NEUTRAL/SAD)

mini-batches was 5 for stochastic optimizers (batch_size), Feature-set

Regression

Ensemble
Learning
Random
Datasets

Logistic
metrics

Forest
and the learning rate was ‘constant’.

SVM
MLP
E. Evaluation Metrics
The 10-fold cross-validation was applied to ensure statistical
Feature

78.3 85.4 76.2 70.8 79.4


stability and generalization of the model. Where, in 10-fold 77.5 76.5 70.7 78.5
RADVESS

83.2
Accuracy
-set

cross-validation, the database is randomly partitioned into 10 76.7 84.8 74.6 68.1 77.7
Precision
equal size subsamples. Of the 10 subsamples, 1 subsample Recall 81.2 84.7 74.6 80.8 82.7
IS09

which is 10% of the database is considered as the testing data 81 84.4 76.8 80.7 83.8
to validate the classification model, and the remaining 9 82.1 83.2 71.4 81.2 82.2
subsamples are used as training data. The reported accuracy 64.6 62.5 61.3 64
Feature

64.1
is the average of the 10 folds tests. 64.7 64.7 60 60.3 63.7
Accuracy
-set

We used 4 evaluation metrics during our experiments. 63.9 63 61 60.4 62.6


EYASE

Precision
Accuracy: Recall
61 64.6 61.5 63 64.2
IS-09

where it gives an overall measure of the percentage of 60 64.5 60.6 62 64


correctly classified instances. 60 63.7 60.7 62 62.7
+
= (2)
+ + +
B. Cross Corpus Multi Emotion Classification
Where,
Tp: True positive (positive examples predicted positive) The models are trained with both languages and tested with
Tn: True negative (negative examples predicted negative) one language at a time.
Fp: False positive (negative examples predicted positive)
Fn: False negative (negative examples predicted negative)
Precision:
|Is to measure the true positive cases relative to all positively
predicted emotional classes.
! "#"$ = (3)
+

979-8-3503-0030-7/23/$31.00 ©2023 IEEE


TABLE III. CROSS CORPUS MULTI-EMOTION CLASSIFICATION USING REFERENCES
PROPOSED FEATURE SET (ANGRY/HAPPY/NEUTRAL/SAD)
[1] A. J. and R. A. L. Matsane, “The use of Automatic Speech Recognition
in education for identifying attitudes of the Speakers,” in IEEE Asia-Pacific

Feature-set

Regression

Ensemble
Learning
Random
Datasets

Logistic
metrics Conference on Computer Science and Data Engineering (CSDE), 2020.

Forest
SVM
MLP
[2] S. -M. Park and Y. -G. Kim, "A Metaverse: Taxonomy, Components,
Applications, and Open Challenges," in IEEE Access, vol. 10, pp. 4209-4251,
2022, doi: 10.1109/ACCESS.2021.3140175.
[3] Blumentals E and Salimbajevs A, "Emotion Recognition in Real-World
65.6 63.8 63.6 66.2
Feature
66.3 Support Call Center Data for Latvian Language", "CEUR Workshop
63 62.7 63.5 64
RADVESS

Accuracy 64 Proceedings", vol. 3124, 2022


-set

63.4 65.8 63.4 63.2 65.2 [4] M. A. Rashidan et al., “Technology-Assisted Emotion Recognition for
Precision
Autism Spectrum Disorder (ASD) Children: A Systematic Literature
Recall 64.6 64.3 64.8 63.8 65.7 Review,” IEEE Access, vol. 9, pp. 33638–33653, 2021.
IS09

62.8 64.6 62.3 64 64.6 [5] L. Tan et al., "Speech Emotion Recognition Enhanced Traffic
63 65.4 65 63.8 65.7 Efficiency Solution for Autonomous Vehicles in a 5G-Enabled Space–Air–
Ground Integrated Intelligent Transportation System," in IEEE Transactions
61.6 60.5 59.3 62
Feature

62.1 on Intelligent Transportation Systems, vol. 23, no. 3, pp. 2830-2842, March
Accuracy 61.3 61 60.2 58.3 62.7 2022, doi: 10.1109/TITS.2021.3119921.
-set

60.9 61.6 61 59.4 62.2


EYASE

Precision [6] Du, Y., Crespo, R. G., & Martínez, O. S., “Human emotion recognition
Recall for enhanced performance evaluation in e-learning”, “Progress in Artificial
61.7 62.6 59.5 59.6 61.2 Intelligence”, 2022, 1–13. https://doi.org/10.1007/S13748-022-00278-2
IS-09

62.6 61.5 58.6 57.3 60.7 [7] M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional
60.8 61.7 58.7 58.5 61.7 models, databases, features, preprocessing methods, supporting modalities,
and classifiers,” Speech Commun., vol. 116, pp. 56–76, 2020.
[8] L. Abdel-Hamid, “Egyptian Arabic speech emotion recognition using
Since the main objective of this bilingual proposed system prosodic, spectral and wavelet features,” Speech Commun., vol. 122, pp. 19–
30, 2020
is to detect the speaker’s emotion through the duration of the [9] S. Mirsamadi, E. Barsoum and C. Zhang, "Automatic speech emotion
talk, the valence-arousal emotion classification was recognition using recurrent neural networks with local attention," 2017 IEEE
considered, as shown in Table IV. Valence describes the International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2017, pp. 2227-2231
emotion as positive/satisfying or negative/dissatisfying, while [10] S. Livingstone and F. Russo, The Ryerson Audio-Visual Database of
arousal describes the strength of emotion [33]. Emotional Speech and Song (RAVDESS), vol. 13. 2018.
[11] Schuller, B., Steidl, S., and Batliner, A. ,” The INTERSPEECH 2009
TABLE IV. VALENCE EMOTION CLASSIFICATION emotion challenge” , INTERSPEECH,2010.
[12] S. G. Koolagudi, Y. V. S. Murthy, and S. P. Bhaskar, “Choice of a
classifier, based on properties of a dataset: case study-speech emotion
recognition,” Int. J. Speech Technol., vol. 21, no. 1, pp. 167–183, 2018.
Regression

Ensemble
Learning
Random

Logistic

[13] M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion


Forest
SVM
MLP

recognition: Features, classification schemes, and databases,” Pattern


Recognit., vol. 44, no. 3, pp. 572–587, 2011.
[14] S. Lalitha, D. Geyasruti, R. Narayanan, and M. Shravani, “Emotion
Detection Using MFCC and Cepstrum Features,” Procedia Comput. Sci., vol.
RADVESS

Accuracy 70, pp. 29–35, 2015.


88.8 89.8 86.8 84.2 90
precision [15] K. A. Araño, P. Gloor, C. Orsenigo, and C. Vercellis, “When Old Meets
88.3 88.2 87 84.1 89.5
New: Emotion Recognition from Speech Signals,” Cognit. Comput., no.
recall 88.2 87.3 87.2 84.4 89.7
April, 2021.
[16] J. Ancilin and A. Milton, “Improved speech emotion recognition with
Accuracy 85 86.5 85.8 82.2 87.6 Mel frequency magnitude coefficient”, ”Applied Acoustics”, vol. 179,
EYASE

precision 85 86.2 85.9 82 88 pp.108046, 2021


85.2 86.3 85.8 82.2 86.5 [17] Mustaqeem and S. Kwon, “A CNN-assisted enhanced audio signal
recall
processing for speech emotion recognition,” Sensors (Switzerland), vol. 20,
no. 1, 2020.
[18] A. Koduru, H. B. Valiveti, and A. K. Budati, “Feature extraction
V. CONCLUSION algorithms to improve the speech emotion recognition rate,” Int. J. Speech
In this paper, a novel speech features’ set was used to train Technol., vol. 23, no. 1, pp. 45–55, 2020.
[19] N. Vryzas, L. Vrysis, M. Matsiola, R. Kotsakis, C. Dimoulas, and G.
different ML models. The impact of each feature was studied, Kalliris, "Continuous Speech Emotion Recognition with Convolutional
and it was found that MFCC is one of the most dominant Neural Networks," J. Audio Eng. Soc., vol. 68, no. 1/2, pp. 14-24, 2020.
features across the used classifiers. SVM proved to be the [20] Bhavan, A., Chauhan, P., Hitkul, and Shah, R.R., “Bagged support
vector machines for emotion recognition from speech”,” Knowledge-Based
optimum SER classifier for its accuracy and efficiency. MLP Syst., vol. 184, p. 104886, 2019.
is a very promising classifier, but it needs a long training time. [21] Parry, J., Palaz, D., Clarke, G., Lecomte, P., Mead, R., Berger, M.A., &
Promising results were reached as the recognition rates of the Hofer, "Analysis of Deep Learning Architectures for Cross-Corpus Speech
single corpus multi-emotion classification system were 85% Emotion Recognition", Interspeech, 2019.
[22] W. Zehra1, A. R. Javed2, Z. Jalil2, H. U. Khan3 and T. R. Gadekallu,
for RADVESS, 64.6% for EYASE using SVM and MLP “Cross corpus multi-lingual speech emotion recognition using ensemble
respectively, and the recognition rates of the cross-corpus learning,“ Complex Intell. Syst., vol. 7, pp.1845–1854, 2021.
multi-emotion classification system was 66% for RADVESS, [23] S. Goel and H. Beigi, “Cross-Lingual Cross-Corpus Speech Emotion
Recognition,” arXiv, 2020.
62% for EYASE using SVM. [24] Z. Peng, Y. Lu, S. Pan and Y. Liu, "Efficient Speech Emotion
Recognition Using Multi-Scale CNN and Attention," ICASSP 2021 - 2021

979-8-3503-0030-7/23/$31.00 ©2023 IEEE


IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2021, pp. 3020-3024
[25] N. -C. Ristea, L. C. Duţu and A. Radoi, "Emotion Recognition System
from Speech and Visual Information based on Convolutional Neural
Networks," 2019 International Conference on Speech Technology and
Human-Computer Dialogue (SpeD), 2019, pp. 1-6
[26] Z. T. Liu, A. Rehman, M. Wu, W. H. Cao, and M. Hao, “Speech
emotion recognition based on formant characteristics feature extraction and
phoneme type convergence,” Inf. Sci. (Ny)., vol.
[27] M. Caschera, P. Grifoni, F. Ferri, “Emotion Classification from
Speech and Text in Videos Using a Multimodal Approach, Multimodal
Technologies and Interaction, 6(4):28, DOI: 10.3390/mti6040028”, April
2022
[28] Y. Li, Q. He, Y. Zhao, H. Yao, ”Multi-modal Emotion Recognition
Based on Speech and Image”, Advances in Multimedia Information
Processing, May 2018, DOI: 10.1007/978-3-319-77380-3_81
[29] https://librosa.org/
[30] http://bjbschmitt.github.io/AMFM_decompy/pYAAPT.html
[31] https://www.analyticsvidhya.com/blog/2020/10/feature-selection-
techniques-in-machine-learning/.
[32] https://www.audeering.com/research/opensmile/
[33] https://cxl.com/blog/valence-arousal-and-how-to-kindle-an-emotional-
fire/, Last updated: Aug 25, 2022.

979-8-3503-0030-7/23/$31.00 ©2023 IEEE

You might also like