0% found this document useful (0 votes)
13 views

Bengali_Speech_Sentiment_Analysis_Using_Machine_Learning_Models_A_Comparative_Study

This study presents a comparative analysis of Bengali speech sentiment analysis using various machine learning models, focusing on the under-researched Bangla language. The research evaluates models including CNN, DNN, and RNN on a merged dataset derived from existing Bengali speech datasets, achieving the highest accuracy of 95% using ensemble methods. The findings highlight the potential of using audio data for sentiment analysis in low-resource languages and emphasize the significance of feature extraction techniques like Mel-Frequency Spectrograms.

Uploaded by

neuronerjot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Bengali_Speech_Sentiment_Analysis_Using_Machine_Learning_Models_A_Comparative_Study

This study presents a comparative analysis of Bengali speech sentiment analysis using various machine learning models, focusing on the under-researched Bangla language. The research evaluates models including CNN, DNN, and RNN on a merged dataset derived from existing Bengali speech datasets, achieving the highest accuracy of 95% using ensemble methods. The findings highlight the potential of using audio data for sentiment analysis in low-resource languages and emphasize the significance of feature extraction techniques like Mel-Frequency Spectrograms.

Uploaded by

neuronerjot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2024 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET)

Bengali Speech Sentiment Analysis using Machine


Learning Models: A Comparative Study
Mohammad Tanveer Shams 1, Md. Akib Hasan 2, Animesh Das Chowdhury 3, Tabassum Jahan Lamia 4, Md. Reasad Zaman Chowdhury 5,*,
Mohammad Marufur Rahman 6
1,2,3,4,5
Department of Computer Science and Technology (CSE)
Ahsanullah University of Science and Technology (AUST)
Dhaka, Bangladesh
Email: [email protected] 1, [email protected] 2, [email protected] 3,
2024 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET) | 979-8-3503-8969-2/24/$31.00 ©2024 IEEE | DOI: 10.1109/IICAIET62352.2024.10730340

[email protected] 4, [email protected] 5,*, [email protected] 6

Abstract—Though scholars find speech sentiment analysis DNN, and RNN on features extracted from a Bengali speech
based on audio data to be a very intriguing study topic, not dataset. This dataset has been created by merging three
enough work has been done for the fifth most spoken language
in the world, Bangla. The purpose of this study is to close
datasets KBES, BanglaSER, and KBES. Different features
this research gap. SUBESCO, BanslaSER, and KBES combined such as the Mel-Frequecy Spectrogram, Chroma Shift, Zero
dataset were utilized in this study’s evaluation of all the models. Crossing Rate, Mel Frequency Cepstral Coefficients, etc.
Evaluations have been done on CNN models, ML models, have been extracted from the datasets. More emphasis has
and sequence models like LSTM and BI-LSTM. Mel-frequency been put on Mel-Frequncy Spectrograms as since its intro-
spectrum was exploited by CNN models, while machine learning
and sequence models were applied to numerical features.
duction by S.Davis and P.Mermelstein [1], it has become a
DenseNet201 achieved the finest accuracy of all the models, staple in sentiment analysis.
at 94%. Finally, utilizing both hard and soft voting, all of
the models based on numerical features and spectrogram II. Related Works
features were ensembled, yielding accuracy rates of 94.99%
Analyzing speech sentiment is a comparatively new and
for numerical features and 95% for spectrogram features.
demanding research sector as people use various types of
voice and audio in day-to-day scenarios. Though there are
Keywords—SSA, SUBESCO, BanslaSER, KBES, CNN, ML,
Sequence model, Spectrogram, MFCC, Ensemble many studies on finding and optimizing the efficiency of SSA
in various languages, most of the studies are done in English.
I. Introduction Speech Sentiment Analysis in the Bangla language carries a
Opinion mining, or sentiment analysis, is the compu- vital significance as it is the fifth most spoken language in
tational process of determining and extracting sentiments, the world [2]. Even though Bengali is one of the most spoken
opinions, or emotional tones expressed in various types of languages, there has been comparatively less research done
data, including textual, audio, or audio-visual data. Sentiment on this language than English.
analysis is an essential function in a variety of systems Among them, [3] thoroughly examined different models
which includes product review analysis, consumer feedback for SSA using Bengali datasets like SUBESCO and Ban-
analysis, education engagement analysis, danger detection glaSER. There, the datasets were experimented with CNN
systems, etc. Throughout recent years, researchers have architecture like VGG-19, Inception v4, and ResNet-50 as
widely explored speech analysis of sentiment using text- well as RNN like LSTM and Bi-LSTM models. They also
based information. As social media has grown in popularity implemented ML models like Random Forest, AdaBoost and
and user base, it has become an asset of textual corpus. K-Nearest Neighbour. The results of the study showed that
However, sentiment analysis based on audio data is to be RF and KNN models had the highest accuracy rates (90%)
explored thoroughly. Because human speech can convey followed by LSTM and Bi-LSTM models (80%), while
a wide range of emotions through the fusion of acoustic AdaBoost and CNN models performed worse (60%). How-
features, linguistic elements, and prosody, it is crucial to ever, in the custom noisy dataset, the accuracy drastically
explore sentiment analysis based on audio data. decreased, indicating how different the models’ performance
This trend continues the low resource languages such as was in different scenarios.
Bangla. Most of the research done with Bengali Speech For SSA, there are many papers where SUBESCO and
sentiment analysis works on textual data. Considering the RAVDESS datasets have been used for research. Among
number of Bengali speakers this is a crucial research gap that them, [4] concentrated on choosing significant features to
requires investigation and this research works in reducing increase accuracy. According to the study, enhanced feature
the corresponding flaw by evaluating performances of ML, selection can produce results with accuracy that are compa-
rable to deep learning models but at a lower computational
cost. In their paper, different ML models, including RF, SVM
and XGBoost, were implemented. Their model’s optimized
feature set achieved an accuracy of 82.9% for SUBESCO
* Corresponding Author: Md. Reasad Zaman Chowdhury using SVM, demonstrating the efficacy of their method in

979-8-3503-8969-2/24/$31.00 ©2024 IEEE


254
Authorized licensed use limited to: Bangladesh Univ of Engineering and Technology. Downloaded on November 25,2024 at 20:16:20 UTC from IEEE Xplore. Restrictions apply.
the recognition of emotions. On the other hand, [5] concen- in the combination process. Different approaches have been
trated on Bengali audio speech emotion recognition where employed with different datasets for the existing research
capturing both local and sequential speech characteristics, works. However, no particular benchmark can be found for
the research created a novel model that merges a deep the approaches to be generalized for the best approach for
convolutional neural network (DCNN) with a bidirectional a generalized dataset. Most existing dataset fail to represent
long-short term memory (BLSTM) network, enhanced with a real world scenarios, which vary in audio quality and include
time-distributed flatten layer. Known as DCTFB, this model speech with different dialects, local Bengali language.
outperformed other CNN-based models with high accuracy
rates of 86.9% on SUBESCO and 82.7% on RAVDESS. III. Methodology
A thorough cross-lingual analysis using transfer learning A. Dataset
was also included in the study, demonstrating the model’s Three pre-existing audio datasets named KBES [10],
wider linguistic applicability. But most importantly, [6] out- SUBESCO [11], and BanglaSER [12] were collected. The
performed these previous works by designing a hybrid model overall proposed diagram is represented in Fig. 2.
and achieving 98.97% accuracy in SUBESCO and 98.30%
KBES dataset consists of 900 audio files of following
accuracy in RAVDESS dataset. For the hybrid model, they
categories: Angry (Low), Disgust (Low), Happy (Low), Sad
integrated one-dimensional CNN with an LSTM architecture
(Low), Neutral, Angry (High), Disgust (High), Happy (High)
into a fully connected network. This paper strongly high-
and Sad (High). Among these nine types, each of them are of
lights with evidence why their hybrid model is effective in
3 seconds length. Additionally, SUBESCO is a purely audio
emotion detection and classification in voice inputs.
emotive speech corpus including seven thousand sentence-
While SUBESCO and RAVDESS datasets were popular, level Bangla statements. The classes of the audio files are
[7] worked with a self-collected audio recording dataset, happiness, surprise, anger, fear, neutral, sadness and disgust.
the Abeg dataset. In the research paper, they portrayed a In this audio dataset, each file is 3 seconds long. On the other
model for identifying emotions in Bengali speech with a hand, BanglaSER which contains 1467 Bangla speech-audio
focus on neutrality, happiness, and rage. The study compared recordings, have five human sentimental classes, called happy,
several ML models, for example, Logistic Regression and
surprise, neutral, sad, and angry. Each of the audio file of this
SVM, using MFCC as well as LPC features. Significant
dataset also is of 3 seconds of length.
accuracy was attained by the model; on the Abeg dataset,
Logistic Regression performed the best, with an accuracy With a total of seven distinct classes, SUBESCO has the
of 92%. The study also demonstrated how well MFCC and highest number of sentiment data. Since all five classes
LPC features combined can predict emotions in Bengali of BanglaSER align with those of SUBESCO, the datasets
speech. In the research, the RAVDESS dataset was utilized to were merged according to their respective classes. KBES
benchmark the effectiveness of the system. When RAVDESS has five different sentiments with high-low divisions for each
and Abeg both datasets were combined, the XGBoost classifier sentiment except Neutral. These high-low divisions were first
achieved 86% accuracy. Another research [8] focused on unified and each class was then combined with the matching
identifying Bengali speech’s four emotional states: happy, sad, classes from SUBESCO. Classwise frequency distribution of
angry, and neutral. The study trained a k-Nearest Neighbor different datasets along with our merged dataset is outlined in
classifier using feature vectors that included pitch and Mel- Table I.
frequency Cepstral Coefficient (MFCC). There were four The merged dataset contains 9367 audio files. The highest
hundred emotive sentences from fifty people in the dataset. amount of data is to be found in Happy, Angry and Sad as
With individual emotion detection accuracy of 80% for happy, theese three are common classes of the three datasets. The
75% for sad, 85% for angry, and 75% for neutral emotions, lowest amount of data is to be found in Fear class as it only
the method yielded an average accuracy of 87.50%. exists in SUBESCO dataset.
Meanwhile all the other Banlga SER datasets did not include B. Feature Extraction
intensity measure, [9] created a dataset named KBES that
includes intensity measure which is comparatively more real- Various feature extraction algorithms were implemented
istic to other datasets. The study developed a cascaded model to recover features from the augmented data after it had
which includes CNN with TDF, LSTM, Bi-LSTM and dense been enhanced with noise, pitch shifting, and time stretching.
layers. The proposed model was trained using 3D transformed Numerous numerical features such as MFCCs, ZCR, Chroma
speech data which was created by blending Mel-Frequency Shift, etc. along with the Log Mel-Spectrograms of the audio
Cepstral Coefficient, Short-Time Fourier Transform (STFT),
and Chroma STFT signal transformation techniques. This was TABLE I. F REQUENCY D ISTRIBUTION OF D IFFERENT DATASETS
used to both classify the audio sentiment and also intensity. No of samples
Sentiment Merged Dataset
The model reached an accuracy of 88.30% and 71.67% for SUBESCO BanglaSER KBES
RAVDESS and KBES respectively. Happy 1000 306 200 1506
Surprise 1000 306 - 1306
Upon reviewing the existing researches there remains some
Angry 1000 306 200 1506
noteable limitations to work on. Bangla being a low-resource
Sad 1000 306 200 1506
language, few datasets are made available for classifying Fear 1000 - - 1000
speech sentiments. Although some datasets have previously Disgust 1000 - 200 1200
been combined to achieve a larger trainable dataset to work on, Neutral 1000 243 100 1343
some of the most recent datasets still remained un-employed Total 7000 1467 900 9367

255
Authorized licensed use limited to: Bangladesh Univ of Engineering and Technology. Downloaded on November 25,2024 at 20:16:20 UTC from IEEE Xplore. Restrictions apply.
data were extracted from the dataset using Python’s Librosa
library.
From the audio dataset, 40 unique mfcc values were recov-
ered using Librosa library [13]. A signal’s sign shifting rate
from positive to negative and vice-versa, creating intermediate (a) Angry (b) Disgust
frequencies is known as the zero-crossing rate or ZCR. We
used Librosa to extract the zero crossing for every frame. A
different extraction method called CS was applied for assessing
the tonal variations of compressed sound data. Also, the
chromagram feature from librosa is utilized to help identifying
audio’s chords and determining harmonic similarity. Root (c) Fear (d) Happy
Mean Square (RMS) was also extracted from the audio files
using the same library. Finally, a CSV file was created based on
the extracted numerical features of all the merged augmented
and raw audio data.
Librosa library was also used in generating Log Mel- (e) Neutral (f) Sad
Spectrograms from the merged audio data, both augmented
and raw. Mel-Spectrograms are visual portrayals of frequency
spectrums that vary across time mapped onto the Mel scale.
To generate the logarithmic Mel-Spectrograms, a logarithmic
scale is applied on the frequency axis and a Mel scale is
applied to the amplitude axis. The log-mel spectrograms were (g) Surprise
generated using 128 Mel-bins, Hann window of size 2048
Fig. 1. Log Mel-Spectrogram for Different Speech Sentiment Data
samples, and Hop-length of 512 samples as represented in
Fig. 1. The generated Log Mel-Spectrograms of both raw and
augmented audio data were stored together in a directory with new data for the minor class. Additionally, standardization
each spectrogram being in .png format. technique has been used to normalize the dataset to ensure
C. Data Augmentation uniform scaling among the features.
All three datasets namely used in this research consist of E. Sentiment Analysis Based On Numerical Data using ML
audio samples recorded in a controlled environment free from Classifiers
significant noise and disruptions. This does not represent a Different machine learning models were trained on numer-
real-world environment where a plethora of noise may be ical features extracted from the audio data for the sentiment
present. Thus it was decided to augment the audio files of analysis task. One such algorithm is Random Forest which is a
the dataset by injecting noise, shifting pitch, and stretching machine-learning ensemble model that employs bootstrap ag-
time. gregation and decision trees as foundational learners. Decision
The merged dataset was augmented using Librosa. It is a trees are simple models that aid in decision-making and serve
Python package for music and audio analysis developed by as foundations for more sophisticated ensemble techniques.
Mcfee et.al[13]. Three factors were present in the augmenta- KNN is a powerful supervised algorithm used for regression
tion of our data. and classification, but it can be computationally intensive.
Pitch Shifting: The frequencies of the audio files were Gradient Boost is an iterative method for developing weak
shifted by a factor of approximately 0.7. A random variance learners into strong prediction models, whereas Ada Boost
of -0.1 to 0,1 was added to the pitch factor for to randomize enhances weak learners’ performance by assigning weights
each augmentation. to training instances. In terms of sentiment analysis from
Time Stretching: The Time stretching factor was set to audio speech, these ML models effectively classify emotions
0.8. Also, a random variance of -0,1 to 0,1 was added for by analyzing audio features like MFCCs, ZCR, and so on.
randomization of each augmentation. These models capture complex emotional patterns in speech,
Noise Injection: Noise factor was set to 0 to 0.5% of allowing for precise sentiment prediction using subtle sound
the maximum amplitude (loudest sound) of each audio data. signals. All the parameters used in these models have been
Gaussian distribution was used for the noise injection and fine-tuned for the highest metrics.
numpy’s random function was used to randomize the factor.
F. Sentiment Analysis Based On Log-Mel Spectrogram using
D. Resampling and Normalization Convolutional Neural Networks
Since the dataset, which is being used, is a combination In another approach, the effectiveness of several popular
of three distinct datasets, certain classes of data are absent Convolutional Neural Network (CNN) models trained on Log-
in some datasets. For example, the fear class was absent Mel Spectrograms for speech sentiment classification was
in BanglSER and KBES, resulting in a noticeable dataset evaluated. ResNet architectures, such as ResNet-50, ResNet-
imbalance. The numerical features of the dataset have been 101, and ResNet-152, are suited for capturing subtle features
balanced using Synthetic Minority Oversampling Technique in a variety of domains because they process complex data effi-
(SMOTE). SMOTE balances the dataset by synthesizing ciently by utilizing deep layers with shortcut connections. The

256
Authorized licensed use limited to: Bangladesh Univ of Engineering and Technology. Downloaded on November 25,2024 at 20:16:20 UTC from IEEE Xplore. Restrictions apply.
to the residual blocks.
G. Sentiment Analysis Based On Numerical Data using
Sequence Models
In yet another approach, sentiment analysis based on
numerical data was explored in employing different sequence
models. Sequence models such as LSTM, a recurrent neural
network, enhance the learning and retention of long-term
dependencies in sequential data. It is appropriate for tasks such
as natural language processing and time series prediction. Bi-
LSTM, an extension of LSTM, incorporates data from both
past and future contexts. It uses two LSTM layers to process
the sequence forward and backward, allowing the model to
capture dependencies in both directions. Bi-LSTM is used in
natural language processing and speech recognition to make
accurate predictions and classifications. In sentiment analysis
from audio speech, LSTM, and Bi-LSTM models efficiently
detect emotional variations using extracted numerical features
such as MFCCs, CS, and so on. These models analyze speech’s
temporal dynamics, allowing precise emotion detection based
on the speech. In this research, a three-layer LSTM network
was employed. The layers comprised 100, 50, and 20 cells
sequentially.
H. Preparing Dataset for Training
From the combined dataset 10% data have been separated
as test dataset. The remaining dataset have been augmented to
create the augmented train set which will be used in training.
Table II shows the class wise distribution of audio data for
each of the subsets of data.
IV. Result Analysis
A. Result Analysis of ML and CNN Models
To explore the challenging task of sentiment analysis
different types of features ranging from numerical features
Fig. 2. Flow Diagram of The Proposed Approach such as ZCR, RMS to Spectrogram features such as Log-Mel
Frequency Spectrogram have been extracted. Based on these
features, DNNs, ML and sequence models have been applied.
VGG-16 and VGG-19 models emphasize simplicity and depth The results obtained from ML and DL models for different
in feature extraction by using deep stacks of convolutional features have been compiled in Table III. Popular machine
layers with small filters. With the introduction of GoogleNet’s learning models such as Random Forest Classifier, AdaBoost
inception modules, extensive data analysis at various scales is Classifier, KNN, Decision Tree, and Gradient Boost Classifier
made possible by parallel processing using filters of various have been trained on the various features extracted from
sizes. With its distinctive densely connected architecture, the different datasets. Moreover, sequence models such as
DenseNet-201 improves network-wide information flow and LSTM and Bi-LSTM have been evaluated with the same
feature reuse. These architectures make use of effective audio features. Several widely recognized metrics, including
data processing techniques and deep learning principles to Precision, Accuracy and F1-Score have been assessed for all
provide strong frameworks for handling challenging tasks.
In sentiment analysis with audio speech, these models use
TABLE II. C LASS D ISTRIBUTION OF T RAINING AND T ESTING DATASET
log mel-spectrograms to analyze complex sound patterns and
identify emotional signals in speech. This method allows Sentiment Merged Test Set Train Train
Dataset Dataset Dataset
for a more detailed understanding of sentiments in spoken After
speech, which improves the accuracy of emotion detection. Augmen-
In this research, many pre-trained CNN models like ResNet, tation
DenseNet, VGG, and GoogleNet have been used. To better Happy 1506 151 1355 2710
Surprise 1306 131 1175 2350
fine-tune the models for this specific task, the top layer of
Angry 1506 151 1355 2710
each of the models has been replaced with a custom layer Sad 1506 151 1355 2710
containing five custom-defined residual blocks and a linear Fear 1000 100 900 1800
layer as the output layer. For VGG models adaptive average Disgust 1200 120 1080 2160
pooling was used and then flattened before passing the output Neutral 1343 110 1233 2466

257
Authorized licensed use limited to: Bangladesh Univ of Engineering and Technology. Downloaded on November 25,2024 at 20:16:20 UTC from IEEE Xplore. Restrictions apply.
TABLE III. E VALUATION S CORE OF D IFFERENT ML AND S EQUENTIAL M ODELS FOR D IFFERENT F EATURES
Machine Learning Models Sequence Models
Features Metrics
K-NN Random Forest Gradient Boost Ada Boost Decision Tree BI-LSTM LSTM
Accuracy 55.27% 80.83% 73.69% 24.49% 73.16% 77.95% 79.76%
Precision 55.07% 80.77% 73.46% 10.24% 73.90% 79.11% 80.406%
MFCCs
Recall 55.27% 80.83% 73.69% 24.49% 73.16% 77.95% 79.765%
F1-Score 55.05% 80.59% 73.39% 11.90% 73.17% 75.80% 77.55%
Accuracy 14.32% 14.44% 16.71% 13.41% 16.08% 17.57% 17.25%
Precision 14.58% 14.69% 14.33% 5.8% 16.52% 12.34% 13.65%
Zero Crossing Rate
Recall 14.37% 14.48% 14.48% 13.41% 16.08% 17.67% 17.25%
F1-Score 14.45% 14.56% 14.46% 5.9% 16.20% 10.72% 10.46%
Accuracy 18.21% 17.57% 20.44% 17.41% 17.145% 20.23% 21.19%
Precision 18.68% 18.29% 22.44% 10.02% 17.78% 17.41% 18.21%
Root Mean Square
Recall 18.21% 17.57% 20.44% 17.14% 17.14% 20.23% 21.19%
F1-Score 18.26% 17.69% 17.01% 7.8% 16.96% 12.22% 13.72%
Accuracy 14.90% 14.90% 20.76% 15.01% 15.76% 20.55% 21.61%
Precision 15.33% 15.27% 18.40% 4.9% 16.9% 12.58% 10.377%
Chroma Shift
Recall 14.90% 14.90% 20.76% 15.01% 15.76% 20.55% 21.61%
F1-Score 15.04% 15.02% 15.80% 6.9% 15.98% 20.553% 13.79%
Accuracy 71.13% 87% 90.84% 71.139% 72.94% 80.29% 79.44%
Precision 57.29% 88.25% 91.16% 57.86% 73.49% 81.11% 79.59%
Combined Features
Recall 71.13% 87% 90.84% 71.13% 72.5% 80.29% 79.44%
F1-Score 61.78% 87.36% 90.78% 62.04% 72.86% 78.01% 77.12%
Accuracy 28.22% 33.65% 32.58% 17.25% 35.67% 31.73% 33.01%
Precision 28.34% 33.66% 32.97% 6.08% 36.95% 30.67% 27.7%
ZCR+RMS+CS
Recall 28.22% 33.65% 32.58% 17.25% 35.67% 31.73% 33.01%
F1-Score 28.23% 33.52% 30.91% 7.7% 35.89% 28.39% 29.47%

models. Additionally, performance on different audio features Mean Square and Zero Crossing Rate do not perform well on
on the different ML and DL models has been assessed. Also, any of the models.
a combination of Zero Crossing Rate, Root Mean Square and
Chroma Shift have been evaluated. B. Result Analysis of Ensemble approach
For further analysis, different models were combined using
CNN models have been trained on Log-Mel Spectrogram
ensemble methods to evaluate their performance on Bengali
extracted from the audio dataset. Popular CNN models
speech sentiment analysis. Table V illustrates the performance
such as ResNet50, ResNet101, ResNet152, VGG16, VGG19,
score of the ensemble technique applied to different models.
DenseNet201 and GoogleNet were chosen for the experiment.
At first, all the CNN models have been ensembled using
While ResNet Family showed almost similar results, com-
the voting technique. Both variants of voting ensemble hard
paratively GoogleNet performed worst with an accuracy of
voting and soft voting have been evaluated. Both gave the
86%. On the other hand, the best-performing model here is
same result with an accuracy of 95%. Similarly, ML models
DenseNet201 having an accuracy of 94%. Table IV depicts the
and sequence models have been ensembled together with the
performance score of CNN models for classifying the speech
same set of numerical features. Two experiments have been
sentiment.
done ensembling on all the features and ensembling based
Combination of all the features yielded the best results on MFCC. Ensembling on all the features provide the best
having an accuracy of 90.84% with Gradient Boosting and accuracy of 94.99%.
80.295% on Bi-LSTM. Also as individual feature, MFCCs
performed the best across the ML and sequence models, with
TABLE V. E VALUATION S CORE FOR THE E NSEMBLE T ECHNIQUE ON
an accuracy of 80.33% on Random Forest and 77.95% on Bi- D IFFERENT M ODELS
LSTM. In addition, it is clearly observed from the evaluation
Model Feature Hard Voting Soft Voting
result that individual features such as Chroma Shift, Root
Accuracy 94.99% 91.58%
Precision 95.52% 91.84%
All Features
TABLE IV. E VALUATION S CORE OF CNN M ODELS USING L OG -M EL ML Recall 94.99% 91.58%
S PECTROGRAM + F1-Score 94.93% 91.36%
Sequence
CNN Models Accuracy 83.91% 82.42%
Models
Model Precision Recall F1-Score Accuracy MFCC
Precision 84.28% 82.95%
Resnet50 90% 90% 89% 90% Recall 83.91% 82.42%
Resnet101 90% 90% 90% 90% F1-Score 83.71% 81.69%
Resnet152 92% 92% 92% 92% Accuracy 95.00% 95.00%
VGG16 90% 90% 90% 90% CNN Log-Mel Precision 95.00% 95.00%
VGG19 91% 91% 91% 91% Models Spectrogram Recall 95.00% 95.00%
DenseNet201 94% 94% 94% 94% F1-Score 95.00% 95.00%
GoogleNet 86% 86% 86% 86%

258
Authorized licensed use limited to: Bangladesh Univ of Engineering and Technology. Downloaded on November 25,2024 at 20:16:20 UTC from IEEE Xplore. Restrictions apply.
V. Conclusion and Future Works [9] M. Masum Billah, L. Sarker, M. Akhand, M. A. Samad Kamal, et
al., “Emotion recognition with intensity level from bangla speech
After the analysis, it can be concluded that among CNN, using feature transformation and cascaded deep learning model.,”
ML, and Sequence Models, for Speech Sentiment Analy- International Journal of Advanced Computer Science & Applications,
sis, Convolutional Neural Networks outperform the others. vol. 15, no. 4, 2024.
Ensemble approaches were also explored which increased [10] M. M. Billah, M. L. Sarker, and M. Akhand, “Kbes: A dataset for
realistic bangla speech emotion recognition with intensity level,” Data
the accuracy on both the numerical and spectrogram-based in Brief, vol. 51, 2023. [Online]. Available: https://www.sciencedirect.
approaches. Both the approaches converged to 95% accuracy. com/science/article/pii/S2352340923008107.
This provides evidence to the fact that ensemble-based [11] S. Sultana, M. S. Rahman, M. R. Selim, and M. Z. Iqbal, “Sust
approach is necessary to increase accuracy in this topic. bangla emotional speech corpus (subesco): An audio-only emotional
speech corpus for bangla,” Plos One, 2021. [Online]. Available: https:
The research highly contributes to the field of artificial
//journals.plos.org/plosone/article?id=10.1371/journal.pone.0250173.
intelligence in engineering and technology, addressing an [12] R. K. Das, N. Islam, M. R. Ahmed, S. Islam, S. Shatabda, and A. M.
important gap in speech sentiment analysis for one of the Islam, “Banglaser: A speech emotion recognition dataset for the bangla
world’s most spoken yet low-resource languages. This work language,” Data in Brief, vol. 42, 2022. [Online]. Available: https :
//www.sciencedirect.com/science/article/pii/S235234092200302X.
provides a benchmark to determine effective approaches for
[13] B. McFee, C. Raffel, D. Liang, et al., “Librosa: Audio and music
speech sentiment analysis by comparing different multiple ML signal analysis in python,” in Proceedings of the 14th python in science
and deep learning models as well as showcases the usefulness conference, vol. 8, 2015, pp. 18–25.
of hybrid approaches by utilizing voting ensemble methods. It
provides valuable insights for researchers working on speech
sentiment analysis, especially for low-resource languages.
There has been almost no research conducted by combining
all three available Bangla SER datasets up until now. In this
research, all three existing datasets have been combined to
form a generalized enlarged custom dataset on which all the
comparative models have been trained and evaluated. In this
research, no custom dataset including real-world noise was
involved, which is not depictive of the real-world scenario.
There is much more room for work in this regard as more CNN
models including self-developed ones might be evaluated to
see the results. Robust models using noisy real-world data with
varied speaking styles may be developed based on the ground
works laid by this research. The effectiveness of using other
appropriate features in deploying the compared approaches
may be explored for a better understanding of which method
best suits such classification. In short, this work can work as a
foothold in starting a plethora of research.
References
[1] S. Davis and P. Mermelstein, “Comparison of parametric represen-
tations for monosyllabic word recognition in continuously spoken
sentences,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 28, no. 4, pp. 357–366, Aug. 1980, issn: 0096-3518.
doi: 10.1109/TASSP.1980.1163420.
[2] Statistics and Data, “The most spoken languages 2023,” Data in Brief,
2023. [Online]. Available: https://statisticsanddata.org/data/the-most-
spoken-languages-2023/.
[3] A. C. Shruti, R. H. Rifat, M. Kamal, and M. G. R. Alam, “A comparative
study on bengali speech sentiment analysis based on audio data,” in 2023
IEEE International Conference on Big Data and Smart Computing
(BigComp), 2023, pp. 219–226. doi: 10.1109/BigComp57234.2023.
00043.
[4] S. Sultana and M. S. Rahman, “Acoustic feature analysis and optimiza-
tion for bangla speech emotion recognition,” Acoustical Science and
Technology, vol. 44, no. 3, pp. 157–166, 2023. doi: 10.1250/ast.44.157.
[5] S. Sultana, M. Z. Iqbal, M. R. Selim, M. M. Rashid, and M. S. Rahman,
“Bangla speech emotion recognition and cross-lingual study using deep
cnn and blstm networks,” IEEE Access, vol. 10, pp. 564–578, 2022. doi:
10.1109/ACCESS.2021.3136251.
[6] M. M. Hassan, M. Raihan, M. M. Hassan, and A. K. Bairagi, “Bser: A
learning framework for bangla speech emotion recognition,” in 2024 6th
International Conference on Electrical Engineering and Information
& Communication Technology (ICEEICT), IEEE, 2024, pp. 410–415.
[7] P. Dhar and S. Guha, “A system to predict emotion from bengali
speech,” International Journal of Mathematical Sciences and Com-
puting (IJMSC), vol. 7, no. 1, pp. 26–35, 2021.
[8] J. Devnath, S. Hossain, M. Rahman, H. Saha, M. A. Habib, and N.
Sultan, “Emotion recognition from isolated bengali speech,” 2020.

259
Authorized licensed use limited to: Bangladesh Univ of Engineering and Technology. Downloaded on November 25,2024 at 20:16:20 UTC from IEEE Xplore. Restrictions apply.

You might also like