Speech Emotion Recognition and Classification Using Deep Learning
Speech Emotion Recognition and Classification Using Deep Learning
ABSTRACT
Recognizing human emotion has always been a fascinating task for data
scientists. I am working on an experimental Speech Emotion Recognition
(SER) project to explore its potential.
(vi)
Introduction
This project consists of all the information collected during the making of
this project. The project “” is explained in detail here.
In this project report, we will discuss the making and preparing of this
project and how we handle the different problems that can arise while
creating the aforementioned.
1.
Software Requirements
Python
2.1 Overview:
2.
2.2 Features:
1.Easy to code:
Python language is freely available at the official website and you can
download it from the given download link below click on the Download
Python keyword.
Since it is open-source, this means that source code is also available to the
public.So you can download as it is, use it as well as share it.
3.Object-Oriented Language:
3.
5. High-Level Language:
6.Extensible feature:
Python is an Extensible language.we can write our some python code into
c or C++ language and also we can compile that code in C/C++ language.
9. Interpreted Language:
4.
Python has a large standard library which provides a rich set of module and
functions so you do not have to write your own code for every single
thing.There are many libraries present in python for such as regular
expressions, unit-testing, web browsers etc.
This section will discuss the various python modules and libraries used in
this project.
1. OS:
2. Matplotlib:
3. Librosa:
LibROSA is a python package for music and audio analysis. It provides the
building blocks necessary to create music information retrieval systems.
4. Wave:
5. Numpy:
Besides its obvious scientific uses, NumPy can also be used as an efficient
multi-dimensional container of generic data. Arbitrary data-types can be
defined. This allows NumPy to seamlessly and speedily integrate with a
wide variety of databases.
6. Pandas:
7.
8. Scikit-Learn:
8.
9. JSON:
json exposes an API familiar to users of the standard library marshal and
pickle modules.
10. Tensorflow:
11. Glob:
The glob module finds all the pathnames matching a specified pattern
according to the rules used by the Unix shell, although results are returned
in arbitrary order. No tilde expansion is done, but *, ?, and character ranges
expressed with [] will be correctly matched. This is done by using the
os.scandir() and fnmatch.fnmatch() functions in concert, and not by actually
invoking a subshell. Note that unlike fnmatch.fnmatch(), glob treats
filenames beginning with a dot (.) as special cases.
12. Pickle:
The pickle module implements binary protocols for serializing and de-
serializing a Python object structure. “Pickling” is the process whereby a
Python object hierarchy is converted into a byte stream, and “unpickling” is
the inverse operation, whereby a byte stream (from a binary file or bytes-
like object) is converted back into an object hierarchy. Pickling (and
unpickling) is alternatively known as “serialization”, “marshalling,” or
“flattening”; however, to avoid confusion, the terms used here are “pickling”
and “unpickling”.
13. Seaborn:
Platform
11.
● IPython
● ØMQ
● Tornado (web server)
● jQuery
● Bootstrap (front-end framework)
● MathJax
12.
13.
Machine Learning
4.1 Overview:
● The systems that use this method are able to considerably improve
learning accuracy. Usually, semi-supervised learning is chosen when
the acquired labeled data requires skilled and relevant resources in
order to train it / learn from it. Otherwise, acquiring unlabeled data
generally doesn’t require additional resources.
● Reinforcement machine learning algorithms is a learning method that
interacts with its environment by producing actions and discovers
errors or rewards.
16.
Deep Learning
5.1 Overview:
17.
Most modern deep learning models are based on artificial neural networks,
specifically, Convolutional Neural Networks (CNN)s, although they can also
include propositional formulas or latent variables organized layer-wise in
deep generative models such as the nodes in deep belief networks and
deep Boltzmann machines.
In deep learning, each level learns to transform its input data into a slightly
more abstract and composite representation. In an image recognition
application, the raw input may be a matrix of pixels; the first
representational layer may abstract the pixels and encode edges; the
second layer may compose and encode arrangements of edges; the third
layer may encode a nose and eyes; and the fourth layer may recognize that
the image contains a face. Importantly, a deep learning process can learn
which features to optimally place in which level on its own. (Of course, this
does not completely eliminate the need for hand-tuning; for example,
varying numbers of layers and layer sizes can provide different degrees of
abstraction).
The word "deep" in "deep learning" refers to the number of layers through
which the data is transformed. More precisely, deep learning systems have
a substantial credit assignment path (CAP) depth. The CAP is the chain of
transformations from input to output. CAPs describe potentially causal
connections between input and output. For a feedforward neural network,
the depth of the CAPs is that of the network and is the number of hidden
layers plus one (as the output layer is also parameterized). For recurrent
neural networks, in which a signal may propagate through a layer more
than once, the CAP depth is potentially unlimited. No universally agreed
upon threshold of depth divides shallow learning from deep learning, but
most researchers agree that deep learning involves CAP depth higher than
2. CAP of depth 2 has been shown to be a universal approximator in the
sense that it can emulate any function.
18.
Beyond that, more layers do not add to the function approximator ability of
the network. Deep models (CAP > 2) are able to extract better features
than shallow models and hence, extra layers help in learning the features
effectively.
One of the shortcomings with basic CNNs is there inability to model long
distance dependencies, which is important for various NLP tasks. To
address this problem, CNNs have been coupled with time-delayed neural
networks (TDNN) which enable larger contextual range at once during
training. Other useful types of CNN that have shown success in different
NLP tasks, such as sentiment prediction and question type classification,
are known as dynamic convolutional neural network (DCNN). A DCNN
uses a dynamic k-max pooling strategy where filters can dynamically span
variable ranges while performing the sentence modeling.
CNNs have also been used for more complex tasks where varying lengths
of texts are used such as in aspect detection, sentiment analysis, short text
categorization, and sarcasm detection.
21.
Project Implementation
Recognizing human emotion has always been a fascinating task for data
scientists. Lately, I am working on an experimental Speech Emotion
Recognition (SER) project to explore its potential.
Data Description:
8192 different audio files distributed into 5 classes. Those classes are:
a. Neutral
b. Sad
c. Disgust
d. Happy
e. Fear
22.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
Results and Conclusion
● We find that the predicted values consists of only two classes.
● The only reason for this type of behaviour is the number of files of
each class. This dataset consists of 8192 files.
● The number of files for "neutral" alone is 5117.
● That is more than 50% in a dataset consisting of 5 classes.
● Second most occurring class is "happy" with 1790 files which is still
very less compared to neutral.
● Due to this bias in the data, we can assume that the results are
deviated towards predicting neutral almost all of the time.
39.
References
● https://towardsdatascience.com/speech-emotion-recognition-with-
convolution-neural-network-1e6bb7130ce3
● https://skymind.ai/wiki/neural-network
● https://machinelearningmastery.com/what-is-deep-learning/
● https://towardsdatascience.com/
● https://github.com/topics/speech-emotion-recognition
● https://medium.com/dair-ai/deep-learning-for-nlp-an-overview-of-
recent-trends-d0d8f40a776d
● http://www.wildml.com/2015/11/understanding-convolutional-neural-
networks-for-nlp/
● https://towardsdatascience.com/how-to-build-a-gated-convolutional-
neural-network-gcnn-for-natural-language-processing-nlp-
5ba3ee730bfb
40.