Arabic English Speech Emotion Recognition System
Arabic English Speech Emotion Recognition System
Abstract— The Speech Emotion Recognition (SER) system is an The proposed feature-set performance was compared with the
approach to identify individuals' emotions. This is important for benchmarked IS09 feature set (based on the INTERSPEECH
human-machine interface applications and for the emerging 2009 Emotion Challenge) performance [11]. Multi-Layer
Metaverse. This paper presents a bilingual Arabic-English Perceptron (MLP), Support Vector Machine (SVM), Simple
speech emotion recognition system using EYASE and
Logistic Regression (SLR), and Random Forest (RF) machine
RAVDESS datasets. A novel feature set was composed by using
spectral and prosodic parameters to obtain high performance at learning classification models were used [12]. The
a low computational cost. Different machine learning classifiers performance of SER was analyzed using 10 folds to ensure
were applied, including Multi-Layer Perceptron, Support model generalization and stability. Three evaluation metrics;
Vector Machine, Random Forest, Logistic Regression, and accuracy, recall rate, and precision were used.
Ensemble learning. The execution time of the proposed feature
set was compared to the benchmarked feature set of This paper is organized as follows: a literature review of the
"Interspeech 2009”. Promising results were obtained using the evolution of SER is presented in Section 2, the methodology
proposed feature sets. SVM resulted in the best emotion applied with the proposed features set is presented in Section
recognition rate and execution performance. The best accuracies
3, experiments and results are presented in Section 4 and
achieved were 85% on RADVESS, and 64% on EYASE.
Ensemble learning detected the valence emotion with 90% on finally, Section 5 introduced the conclusion and the proposed
RADVESS, and 87.6% on EYASE. future work.
Keywords: Bilingual Speech emotion recognition, Cross corpus, II. LITERATURE REVIEW
Mel frequency cepstral coefficients, prosodic features A significant interest in SER research has evolved over the
past two decades. Several acted, elicited, and non-acted
I. INTRODUCTION datasets are now available for use in SER systems using
III. METHODOLOGY
Two datasets were used to check the efficiency of
the proposed feature set over the benchmarked feature set of
INTERSPEECH 2009.
A. Datasets
Regression
Ensemble
Learning
Random
Datasets
Logistic
metrics
Forest
and the learning rate was ‘constant’.
SVM
MLP
E. Evaluation Metrics
The 10-fold cross-validation was applied to ensure statistical
Feature
83.2
Accuracy
-set
cross-validation, the database is randomly partitioned into 10 76.7 84.8 74.6 68.1 77.7
Precision
equal size subsamples. Of the 10 subsamples, 1 subsample Recall 81.2 84.7 74.6 80.8 82.7
IS09
which is 10% of the database is considered as the testing data 81 84.4 76.8 80.7 83.8
to validate the classification model, and the remaining 9 82.1 83.2 71.4 81.2 82.2
subsamples are used as training data. The reported accuracy 64.6 62.5 61.3 64
Feature
64.1
is the average of the 10 folds tests. 64.7 64.7 60 60.3 63.7
Accuracy
-set
Precision
Accuracy: Recall
61 64.6 61.5 63 64.2
IS-09
Feature-set
Regression
Ensemble
Learning
Random
Datasets
Logistic
metrics Conference on Computer Science and Data Engineering (CSDE), 2020.
Forest
SVM
MLP
[2] S. -M. Park and Y. -G. Kim, "A Metaverse: Taxonomy, Components,
Applications, and Open Challenges," in IEEE Access, vol. 10, pp. 4209-4251,
2022, doi: 10.1109/ACCESS.2021.3140175.
[3] Blumentals E and Salimbajevs A, "Emotion Recognition in Real-World
65.6 63.8 63.6 66.2
Feature
66.3 Support Call Center Data for Latvian Language", "CEUR Workshop
63 62.7 63.5 64
RADVESS
63.4 65.8 63.4 63.2 65.2 [4] M. A. Rashidan et al., “Technology-Assisted Emotion Recognition for
Precision
Autism Spectrum Disorder (ASD) Children: A Systematic Literature
Recall 64.6 64.3 64.8 63.8 65.7 Review,” IEEE Access, vol. 9, pp. 33638–33653, 2021.
IS09
62.8 64.6 62.3 64 64.6 [5] L. Tan et al., "Speech Emotion Recognition Enhanced Traffic
63 65.4 65 63.8 65.7 Efficiency Solution for Autonomous Vehicles in a 5G-Enabled Space–Air–
Ground Integrated Intelligent Transportation System," in IEEE Transactions
61.6 60.5 59.3 62
Feature
62.1 on Intelligent Transportation Systems, vol. 23, no. 3, pp. 2830-2842, March
Accuracy 61.3 61 60.2 58.3 62.7 2022, doi: 10.1109/TITS.2021.3119921.
-set
Precision [6] Du, Y., Crespo, R. G., & Martínez, O. S., “Human emotion recognition
Recall for enhanced performance evaluation in e-learning”, “Progress in Artificial
61.7 62.6 59.5 59.6 61.2 Intelligence”, 2022, 1–13. https://doi.org/10.1007/S13748-022-00278-2
IS-09
62.6 61.5 58.6 57.3 60.7 [7] M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional
60.8 61.7 58.7 58.5 61.7 models, databases, features, preprocessing methods, supporting modalities,
and classifiers,” Speech Commun., vol. 116, pp. 56–76, 2020.
[8] L. Abdel-Hamid, “Egyptian Arabic speech emotion recognition using
Since the main objective of this bilingual proposed system prosodic, spectral and wavelet features,” Speech Commun., vol. 122, pp. 19–
30, 2020
is to detect the speaker’s emotion through the duration of the [9] S. Mirsamadi, E. Barsoum and C. Zhang, "Automatic speech emotion
talk, the valence-arousal emotion classification was recognition using recurrent neural networks with local attention," 2017 IEEE
considered, as shown in Table IV. Valence describes the International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2017, pp. 2227-2231
emotion as positive/satisfying or negative/dissatisfying, while [10] S. Livingstone and F. Russo, The Ryerson Audio-Visual Database of
arousal describes the strength of emotion [33]. Emotional Speech and Song (RAVDESS), vol. 13. 2018.
[11] Schuller, B., Steidl, S., and Batliner, A. ,” The INTERSPEECH 2009
TABLE IV. VALENCE EMOTION CLASSIFICATION emotion challenge” , INTERSPEECH,2010.
[12] S. G. Koolagudi, Y. V. S. Murthy, and S. P. Bhaskar, “Choice of a
classifier, based on properties of a dataset: case study-speech emotion
recognition,” Int. J. Speech Technol., vol. 21, no. 1, pp. 167–183, 2018.
Regression
Ensemble
Learning
Random
Logistic