Speech Based Emotion Recognition
Speech Based Emotion Recognition
https://doi.org/10.22214/ijraset.2022.44583
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
Abstract: In this paper we are exhibiting our Final Year Project which is Speech Emotion Recognition. Today's hot study
issue is speech and emotion detection, with the goal of improving human-machine connection. Currently, the majority of
research in this field relies on discriminator extraction to classify emotions into several categories. The majority of the present
research focuses on the utterance of words that are employed in language-dependent lexical analysis for emotion detection.
This study employs strategies to classify emotions into five categories: anger, calm, anxiety, happiness, and sorrow using
Machine Learning algorithm Convolutional Neural Network.
I. INTRODUCTION
The approach of collecting emotion characteristics from computer voice signals, comparing them, and analysing the parameter
values and the associated emotion changes is known as speech emotion recognition. To recognise emotions from audio sources,
feature extraction and classifier training are necessary. The feature vector is made up of audio signal components that characterise
the speaker's distinguishing characteristics (such as pitch, pitch, and energy) and is used to train the classifier model to properly
detect certain moods.
In the social media and on the Web, there is a tremendous amount of opinionated data in the shape of Twitter, message boards,
Facebook, blogs, and user forums.
Ideas shared on the internet by a varied collection of thought leaders and common individuals impact people's decision-making
processes. Text-based reviews are one way for people to communicate their feelings/opinions about items or societal concerns.
Another common approach to convey one's thoughts is through audio/video. Millions of videos on product and movie reviews,
product unpacking, political and analysis of social issue, and opinion on these issues may be found on YouTube. There are
several audio venues on the Internet where individuals may express themselves. In many cases, audio is more appealing than text
because it gives more information about the speaker's opinions.
This vast resource is mostly underutilised, and collecting society sentiment/opinion on specific items, as well as mass opinion on
social or political problems, will be incredibly useful for data analysis. Sentiment analysis in audio is still a relatively new area.
Speech-based emotion extraction is a very young and challenging area. This study describes robust methods for extracting
sentiment or opinion from natural audio sources.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3160
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3161
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
Convolutional neural networks stack artificial neurons. Artificial neurons, like actual neurons, compute the weighted sum of a
vast number of inputs and output an activity value. A CNN's layers generate a large number of activation functions, which are
passed on to the next layer.
1) As input, a sample audio file is provided.
2) To extract the MFCC, we utilise the LIBROSA python package (Mel Frequency Cepstral Coefficient)
3) Remixing the data, splitting it into train and test groups, and analysing the results training the dataset with a CNN model and
its layers.
4) Predict human voice sentiment using the training data
The CNN model outperformed the others in the classification task. This trained model has the highest validation accuracy. There
are a total of 18 levels.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3162
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
V. CONFUSION MATRIX
It's a table that's used to figure out where model flaws originated in classification problems. The rows reflect the category in
which the results should have been. The columns, on the other hand, show our forecasts. It's simple to identify whose forecasts
were incorrect with this table. We may utilise the confusion matrix function on our actual and expected values after the
measurements have been imported.
Confusion_matrix = metrics.confusion_matrix
(actual, predicted).
To make the table more clear, we should transform it into a confusion matrix presentation.
cm_display = metrics.ConfusionMatrixDispl-
lay(confusion_matrix=confusion_matrix,
display_labels=[False, True])
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3163
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
VI. RESULTS
VII. CONCLUSION
This speech-based emotion recognition may be used to interpret the opinions/thoughts by feeding the audio into the model. For
example, the sentiments they transmit about a product or a political viewpoint. This approach might be used in conjunction with a
number of music apps to provide users with song recommendations depending on their emotions. This may also be used to enhance
product suggestions for customers of online shopping applications like Amazon.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3164
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue VI June 2022- Available at www.ijraset.com
REFERENCES
[1] M. Dong, “Convolutional neural network achieves human-level accuracy in music genre classification,” 2018.
[2] A. Guzhov, F. Raue, J. Hees, and A. Dengel, “Esresnet: Environmental sound classification based on visual domain models,” 2020.
[3] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model
for raw audio,” arXiv preprint
[4] Y. Tokozume and T. Harada, “Learning environmental sounds with end-to-end convolutional neural network,” in 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 2721–2725.
[5] X. Li, V. Chebiyyam, and K. Kirchhoff, “Multi-stream network with temporal attention for environmental sound classification,” 2019.
[6] Y. Su, K. Zhang, J. Wang, and K. Madani, “Environment sound classification using a two-stream cnn based on decision-level fusion,” Sensors, vol. 19, no.
7, p. 1733, 2019
[7] A Research of Speech Emotion Recognition Based on Deep Belief Network and SVM Chenchen Huang, Wei Gong, Wenlong Fu, and Dongyu Feng
https://www.hindawi.com/journals/mpe/2014/749604/
[8] https://vesitaigyan.ves.ac. in/recogni zing-emotion-from-speech-using-machinehttps://vesitaigyan.ves.ac.in/recognizing-emotion-from-speech-using-
machine-learning-and-deep-learning/learning-and-deep-learning/
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 3165