100% found this document useful (1 vote)
269 views39 pages

Speech Emotion Recognition and Classification Using Deep Learning

This document discusses a speech emotion recognition project that uses deep learning. It uses speech recognition technology to identify spoken words and convert them to text. Natural language processing is then used to help computers understand human spoken language and perform tasks like question answering. The project will discuss the software requirements including Python, libraries like NumPy, Pandas, Keras and modules like OS and Matplotlib that are used to create and handle the speech emotion recognition model.

Uploaded by

John Cena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
269 views39 pages

Speech Emotion Recognition and Classification Using Deep Learning

This document discusses a speech emotion recognition project that uses deep learning. It uses speech recognition technology to identify spoken words and convert them to text. Natural language processing is then used to help computers understand human spoken language and perform tasks like question answering. The project will discuss the software requirements including Python, libraries like NumPy, Pandas, Keras and modules like OS and Matplotlib that are used to create and handle the speech emotion recognition model.

Uploaded by

John Cena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Speech Emotion Recognition and Classification

Using Deep Learning

ABSTRACT

Recognizing human emotion has always been a fascinating task for data
scientists. I am working on an experimental Speech Emotion Recognition
(SER) project to explore its potential.

Speech recognition technology is used in various computer programs for


identifying spoken words and phrases and converting them into machine
readable format. It works using various algorithms through acoustic and
language modelling. It is a complex process as it includes various phases
such as feature extraction, audio sampling and speech recognition to go
through various sounds and convert the language into text. These speech
recognition systems can be used in mobile phones as well as laptops for
setting up reminders, send emails, play games, check weather reports etc.

Natural Language Processing (NLP) refers to a branch of artificial


intelligence and particularly deals with the interaction between humans and
computers with the use of natural language. It works on machine learning
algorithms and enhances the ability of a computer program to understand
human spoken language. It helps the computers to understand and
manipulate human language and perform tasks such as question
answering and language translation.
Through the use of natural language processing, a single model can now
perform multiple tasks such as learning a word meaning as well as perform
language tasks simultaneously. These advanced technologies have
enabled the industries to cut unnecessary costs, miscommunications and
resources on traditional speech recognition process and improve business
efficiency.
NLP has become one of the most important technologies for speech
recognition process as it has made the process easier and less time
consuming.

(vi)
Introduction

This project consists of all the information collected during the making of
this project. The project “” is explained in detail here.

The essence of Natural Language Processing lies in making computers


understand the natural language. That’s not an easy task though.
Computers can understand the structured form of data like spreadsheets
and the tables in the database, but human languages, texts, and voices
form an unstructured category of data, and it gets difficult for the computer
to understand it, and there arises the need for Natural Language
Processing.

In this project report, we will discuss the making and preparing of this
project and how we handle the different problems that can arise while
creating the aforementioned.
1.

Software Requirements

Python

2.1 Overview:

Python is an interpreted, high-level, general-purpose programming


language. Created by Guido van Rossum and first released in 1991,
Python's design philosophy emphasizes code readability with its notable
use of significant whitespace. Its language constructs and object-oriented
approach aim to help programmers write clear, logical code for small and
large-scale projects.

Python is dynamically typed and garbage-collected. It supports multiple


programming paradigms, including procedural, object-oriented, and
functional programming. Python is often described as a "batteries included"
language due to its comprehensive standard library.

Python was conceived in the late 1980s as a successor to the ABC


language. Python 2.0, released in 2000, introduced features like list
comprehensions and a garbage collection system capable of collecting
reference cycles. Python 3.0, released in 2008, was a major revision of the
language that is not completely backward-compatible, and much Python 2
code does not run unmodified on Python 3.

2.
2.2 Features:

1.Easy to code:

Python is high level programming language.Python is very easy to learn


language as compared to other languages like c, c#, java script, java etc.It
is very easy to code in python language and anybody can learn python
basic in few hours or days.It is also developer-friendly language.

2. Free and Open Source:

Python language is freely available at the official website and you can
download it from the given download link below click on the Download
Python keyword.

Since it is open-source, this means that source code is also available to the
public.So you can download as it is, use it as well as share it.

3.Object-Oriented Language:

One of the key features of python is Object-Oriented programming.Python


supports object oriented language and concepts of classes, objects,
encapsulation etc.

4. GUI Programming Support:

Graphical Users interfaces can be made using a module such as PyQt5,


PyQt4, wxPython or Tk in python.

3.

5. High-Level Language:

Python is a high-level language.When we write programs in Python, we do


not need to remember the system architecture, nor do we need to manage
the memory.

6.Extensible feature:
Python is an Extensible language.we can write our some python code into
c or C++ language and also we can compile that code in C/C++ language.

7. Python is Portable language:

Python language is also a portable language.for example, if we have


python code for windows and if we want to run this code on other platforms
such as Linux, Unix and Mac then we do not need to change it, we can run
this code on any platform.

8. Python is Integrated language:

Python is also an Integrated language because we can easily integrated


python with other languages like C, C++ etc.

9. Interpreted Language:

Python is an Interpreted Language. because python code is executed line


by line at a time. like other language C, C++, java etc there is no need to
compile python code this makes it easier to debug our code.The source
code of python is converted into an immediate form called bytecode.

4.

10. Large Standard Library

Python has a large standard library which provides a rich set of module and
functions so you do not have to write your own code for every single
thing.There are many libraries present in python for such as regular
expressions, unit-testing, web browsers etc.

11. Dynamically Typed Language:

Python is a dynamically-typed language. That means the type (for


example- int, double, long etc) for a variable is decided at run time not in
advance.because of this feature we don’t need to specify the type of
variable.
2.3 Libraries:

This section will discuss the various python modules and libraries used in
this project.

1. OS:

The OS module in python provides functions for interacting with the


operating system. OS, comes under Python’s standard utility modules. This
module provides a portable way of using operating system dependent
functionality. The *os* and *os.path* modules include many functions to
interact with the file system.

2. Matplotlib:

Matplotlib is a Python 2D plotting library which produces publication quality


figures in a variety of hardcopy formats and interactive environments
across platforms. Matplotlib can be used in Python scripts, the Python and
IPython shells, the Jupyter notebook, web application servers, and four
graphical user interface toolkits.
5.
Matplotlib tries to make easy things easy and hard things possible. You can
generate plots, histograms, power spectra, bar charts, error charts,
scatterplots, etc. with just a few lines of code. For examples, see the
sample plots and thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface,


particularly when combined with IPython. For the power user, you have full
control of line styles, font properties, axes properties, etc, via an object
oriented interface or via a set of functions familiar to MATLAB users.

3. Librosa:
LibROSA is a python package for music and audio analysis. It provides the
building blocks necessary to create music information retrieval systems.

4. Wave:

The wave module provides a convenient interface to the WAV sound


format. It does not support compression/decompression, but it does
support mono/stereo.

5. Numpy:

NumPy is the fundamental package for scientific computing with Python. It


contains among other things:

● a powerful N-dimensional array object


● sophisticated (broadcasting) functions
6.

● useful linear algebra, Fourier transform, and random number


capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient
multi-dimensional container of generic data. Arbitrary data-types can be
defined. This allows NumPy to seamlessly and speedily integrate with a
wide variety of databases.

6. Pandas:

In computer programming, pandas is a software library written for the


Python programming language for data manipulation and analysis. In
particular, it offers data structures and operations for manipulating
numerical tables and time series. It is free software released under the
three-clause BSD license. The name is derived from the term "panel data",
an econometrics term for data sets that include observations over multiple
time periods for the same individuals.
7. Keras:

Keras is a high-level neural networks API, written in Python and capable of


running on top of TensorFlow, CNTK, or Theano. It was developed with a
focus on enabling fast experimentation. Being able to go from idea to result
with the least possible delay is key to doing good research.

7.

Use Keras if you need a deep learning library that:

● Allows for easy and fast prototyping (through user friendliness,


modularity, and extensibility).
● Supports both convolutional networks and recurrent networks, as well
as combinations of the two.
● Runs seamlessly on CPU and GPU.

Keras is an open-source neural-network library written in Python. It is


capable of running on top of TensorFlow, Microsoft Cognitive Toolkit,
Theano, or PlaidML. Designed to enable fast experimentation with deep
neural networks, it focuses on being user-friendly, modular, and extensible.
It was developed as part of the research effort of project ONEIROS (Open-
ended Neuro-Electronic Intelligent Robot Operating System), and its
primary author and maintainer is François Chollet, a Google engineer.
Chollet also is the author of the XCeption deep neural network model.

8. Scikit-Learn:

Scikit-learn (formerly scikits.learn and also known as sklearn) is a free


software machine learning library for the Python programming language. It
features various classification, regression and clustering algorithms
including support vector machines, random forests, gradient boosting, k-
means and DBSCAN, and is designed to interoperate with the Python
numerical and scientific libraries NumPy and SciPy.

Scikit-learn is largely written in Python, and uses numpy extensively for


high-performance linear algebra and array operations. Furthermore, some
core algorithms are written in Cython to improve performance.

8.

Support vector machines are implemented by a Cython wrapper around


LIBSVM; logistic regression and linear support vector machines by a similar
wrapper around LIBLINEAR. In such cases, extending these methods with
Python may not be possible.

Scikit-learn integrates will with many other Python libraries, such as


matplotlib and plotly for plotting, numpy for array vectorization, pandas
dataframes, scipy, and many more.

9. JSON:

JSON (JavaScript Object Notation), specified by RFC 7159 (which


obsoletes RFC 4627) and by ECMA-404, is a lightweight data interchange
format inspired by JavaScript object literal syntax (although it is not a strict
subset of JavaScript).

json exposes an API familiar to users of the standard library marshal and
pickle modules.

10. Tensorflow:

TensorFlow is a free and open-source software library for dataflow and


differentiable programming across a range of tasks. It is a symbolic math
library, and is also used for machine learning applications such as neural
networks. It is used for both research and production at Google.
9.

11. Glob:

The glob module finds all the pathnames matching a specified pattern
according to the rules used by the Unix shell, although results are returned
in arbitrary order. No tilde expansion is done, but *, ?, and character ranges
expressed with [] will be correctly matched. This is done by using the
os.scandir() and fnmatch.fnmatch() functions in concert, and not by actually
invoking a subshell. Note that unlike fnmatch.fnmatch(), glob treats
filenames beginning with a dot (.) as special cases.

12. Pickle:

The pickle module implements binary protocols for serializing and de-
serializing a Python object structure. “Pickling” is the process whereby a
Python object hierarchy is converted into a byte stream, and “unpickling” is
the inverse operation, whereby a byte stream (from a binary file or bytes-
like object) is converted back into an object hierarchy. Pickling (and
unpickling) is alternatively known as “serialization”, “marshalling,” or
“flattening”; however, to avoid confusion, the terms used here are “pickling”
and “unpickling”.

13. Seaborn:

Seaborn is a Python data visualization library based on matplotlib. It


provides a high-level interface for drawing attractive and informative
statistical graphics.
10.

Platform

3.1 Jupyter Notebook:

Project Jupyter is a nonprofit organization created to "develop open-source


software, open-standards, and services for interactive computing across
dozens of programming languages". Spun-off from IPython in 2014 by
Fernando Pérez, Project Jupyter supports execution environments in
several dozen languages. Project Jupyter's name is a reference to the
three core programming languages supported by Jupyter, which are Julia,
Python and R, and also a homage to Galileo's notebooks recording the
discovery of the moons of Jupiter. Project Jupyter has developed and
supported the interactive computing products Jupyter Notebook,
JupyterHub, and JupyterLab, the next-generation version of Jupyter
Notebook.

Jupyter Notebook (formerly IPython Notebooks) is a web-based interactive


computational environment for creating Jupyter notebook documents. The
"notebook" term can colloquially make reference to many different entities,
mainly the Jupyter web application, Jupyter Python web server, or Jupyter
document format depending on context. A Jupyter Notebook document is a
JSON document, following a versioned schema, and containing an ordered
list of input/output cells which can contain code, text (using Markdown),
mathematics, plots and rich media, usually ending with the ".ipynb"
extension.

A Jupyter Notebook can be converted to a number of open standard output


formats (HTML, presentation slides, LaTeX, PDF, ReStructuredText,
Markdown, Python) through "Download As" in the web interface, via the
nbconvert library or "jupyter nbconvert" command line interface in a shell.

11.

To simplify visualisation of Jupyter notebook documents on the web, the


nbconvert library is provided as a service through NbViewer which can take
a URL to any publicly available notebook document, convert it to HTML on
the fly and display it to the user.

Jupyter Notebook interface:

Jupyter Notebook provides a browser-based REPL built upon a number of


popular open-source libraries:

● IPython
● ØMQ
● Tornado (web server)
● jQuery
● Bootstrap (front-end framework)
● MathJax

Jupyter Notebook can connect to many kernels to allow programming in


many languages. By default Jupyter Notebook ships with the IPython
kernel. As of the 2.3 release (October 2014), there are currently 49 Jupyter-
compatible kernels for many programming languages, including Python, R,
Julia and Haskell.

The Notebook interface was added to IPython in the 0.12 release


(December 2011), renamed to Jupyter notebook in 2015 (IPython 4.0 –
Jupyter 1.0). Jupyter Notebook is similar to the notebook interface of other
programs such as Maple, Mathematica, and SageMath, a computational
interface style that originated with Mathematica in the 1980s. According to
The Atlantic, Jupyter interest overtook the popularity of the Mathematica
notebook interface in early 2018.

12.

3.2 Jupyter kernels:

A Jupyter kernel is a program responsible for handling various types of


request (code execution, code completions, inspection), and providing a
reply. Kernels talk to the other components of Jupyter using ZeroMQ over
the network, and thus can be on the same or remote machines. Unlike
many other Notebook-like interfaces, in Jupyter, kernels are not aware that
they are attached to a specific document, and can be connected to many
clients at once. Usually kernels allow execution of only a single language,
but there are a couple of exceptions.

By default Jupyter ships with IPython as a default kernel and a reference


implementation via the ipykernel wrapper. Kernels for many languages
having varying quality and features are available.

13.

Machine Learning

4.1 Overview:

Machine learning is closely related to computational statistics, which


focuses on making predictions using computers. The study of mathematical
optimization delivers methods, theory and application domains to the field
of machine learning. Data mining is a field of study within machine learning,
and focuses on exploratory data analysis through unsupervised learning. In
its application across business problems, machine learning is also referred
to as predictive analytics.
Machine learning is an application of artificial intelligence (AI) that provides
systems the ability to automatically learn and improve from experience
without being explicitly programmed. Machine learning focuses on the
development of computer programs that can access data and use it to
learn for themselves.

4.2What is Machine Learning?

Machine learning (ML) is the scientific study of algorithms and statistical


models that computer systems use to perform a specific task without using
explicit instructions, relying on patterns and inference instead. It is seen as
a subset of artificial intelligence. Machine learning algorithms build a
mathematical model based on sample data, known as "training data", in
order to make predictions or decisions without being explicitly programmed
to perform the task.
14.

Machine learning algorithms are used in a wide variety of applications,


such as email filtering and computer vision, where it is difficult or infeasible
to develop a conventional algorithm for effectively performing the task

The process of learning begins with observations or data, such as


examples, direct experience, or instruction, in order to look for patterns in
data and make better decisions in the future based on the examples that
we provide. The primary aim is to allow the computers learn automatically
without human intervention or assistance and adjust actions accordingly.

Machine learning algorithms are often categorized as supervised or


unsupervised.

● Supervised machine learning algorithms can apply what has been


learned in the past to new data using labeled examples to predict
future events. Starting from the analysis of a known training dataset,
the learning algorithm produces an inferred function to make
predictions about the output values. The system is able to provide
targets for any new input after sufficient training. The learning
algorithm can also compare its output with the correct, intended
output and find errors in order to modify the model accordingly.
● In contrast, unsupervised machine learning algorithms are used when
the information used to train is neither classified nor labeled.
Unsupervised learning studies how systems can infer a function to
describe a hidden structure from unlabeled data. The system doesn’t
figure out the right output, but it explores the data and can draw
inferences from datasets to describe hidden structures from
unlabeled data.
● Semi-supervised machine learning algorithms fall somewhere in
between supervised and unsupervised learning, since they use both
labeled and unlabeled data for training – typically a small amount of
labeled data and a large amount of unlabeled data.
15.

● The systems that use this method are able to considerably improve
learning accuracy. Usually, semi-supervised learning is chosen when
the acquired labeled data requires skilled and relevant resources in
order to train it / learn from it. Otherwise, acquiring unlabeled data
generally doesn’t require additional resources.
● Reinforcement machine learning algorithms is a learning method that
interacts with its environment by producing actions and discovers
errors or rewards.
16.
Deep Learning

5.1 Overview:

Deep learning (also known as deep structured learning or hierarchical


learning) is part of a broader family of machine learning methods based on
artificial neural networks. Learning can be supervised, semi-supervised or
unsupervised.

Deep learning architectures such as deep neural networks, deep belief


networks, recurrent neural networks and convolutional neural networks
have been applied to fields including computer vision, speech recognition,
natural language processing, audio recognition, social network filtering,
machine translation, bioinformatics, drug design, medical image analysis,
material inspection and board game programs, where they have produced
results comparable to and in some cases superior to human experts.

Artificial Neural Networks (ANNs) were inspired by information processing


and distributed communication nodes in biological systems. ANNs have
various differences from biological brains. Specifically, neural networks
tend to be static and symbolic, while the biological brain of most living
organisms is dynamic (plastic) and analog.

Deep learning is a class of machine learning algorithms that uses multiple


layers to progressively extract higher level features from the raw input. For
example, in image processing, lower layers may identify edges, while
higher layers may identify the concepts relevant to a human such as digits
or letters or faces.

17.

Most modern deep learning models are based on artificial neural networks,
specifically, Convolutional Neural Networks (CNN)s, although they can also
include propositional formulas or latent variables organized layer-wise in
deep generative models such as the nodes in deep belief networks and
deep Boltzmann machines.

In deep learning, each level learns to transform its input data into a slightly
more abstract and composite representation. In an image recognition
application, the raw input may be a matrix of pixels; the first
representational layer may abstract the pixels and encode edges; the
second layer may compose and encode arrangements of edges; the third
layer may encode a nose and eyes; and the fourth layer may recognize that
the image contains a face. Importantly, a deep learning process can learn
which features to optimally place in which level on its own. (Of course, this
does not completely eliminate the need for hand-tuning; for example,
varying numbers of layers and layer sizes can provide different degrees of
abstraction).

The word "deep" in "deep learning" refers to the number of layers through
which the data is transformed. More precisely, deep learning systems have
a substantial credit assignment path (CAP) depth. The CAP is the chain of
transformations from input to output. CAPs describe potentially causal
connections between input and output. For a feedforward neural network,
the depth of the CAPs is that of the network and is the number of hidden
layers plus one (as the output layer is also parameterized). For recurrent
neural networks, in which a signal may propagate through a layer more
than once, the CAP depth is potentially unlimited. No universally agreed
upon threshold of depth divides shallow learning from deep learning, but
most researchers agree that deep learning involves CAP depth higher than
2. CAP of depth 2 has been shown to be a universal approximator in the
sense that it can emulate any function.

18.
Beyond that, more layers do not add to the function approximator ability of
the network. Deep models (CAP > 2) are able to extract better features
than shallow models and hence, extra layers help in learning the features
effectively.

Deep learning architectures can be constructed with a greedy layer-by-


layer method. Deep learning helps to disentangle these abstractions and
pick out which features improve performance.

For supervised learning tasks, deep learning methods eliminate feature


engineering, by translating the data into compact intermediate
representations akin to principal components, and derive layered structures
that remove redundancy in representation.

Deep learning algorithms can be applied to unsupervised learning tasks.


This is an important benefit because unlabeled data are more abundant
than the labeled data. Examples of deep structures that can be trained in
an unsupervised manner are neural history compressor and deep belief
networks.

5.2 Convolutional Neural Networks:

In deep learning, a convolutional neural network (CNN, or ConvNet) is a


class of deep neural networks, most commonly applied to analyzing visual
imagery. They are also known as shift invariant or space invariant artificial
neural networks (SIANN), based on their shared-weights architecture and
translation invariance characteristics. They have applications in image and
video recognition, recommender systems, image classification, medical
image analysis, and natural language processing.
19.

CNNs are regularized versions of multilayer perceptrons. Multilayer


perceptrons usually mean fully connected networks, that is, each neuron in
one layer is connected to all neurons in the next layer. The "fully-
connectedness" of these networks makes them prone to overfitting data.
Typical ways of regularization include adding some form of magnitude
measurement of weights to the loss function. However, CNNs take a
different approach towards regularization: they take advantage of the
hierarchical pattern in data and assemble more complex patterns using
smaller and simpler patterns. Therefore, on the scale of connectedness
and complexity, CNNs are on the lower extreme.

Convolutional networks were inspired by biological processes[5][6][7][8] in that


the connectivity pattern between neurons resembles the organization of the
animal visual cortex. Individual cortical neurons respond to stimuli only in a
restricted region of the visual field known as the receptive field. The
receptive fields of different neurons partially overlap such that they cover
the entire visual field.

CNNs use relatively little pre-processing compared to other image


classification algorithms. This means that the network learns the filters that
in traditional algorithms were hand-engineered. This independence from
prior knowledge and human effort in feature design is a major advantage.

5.3 CNN for Natural Language Processing:

By increasing the complexity of the aforementioned basic CNN and


adapting it to perform word-based predictions, other NLP tasks such as
NER, aspect detection, and POS can be studied. This requires a window-
based approach, where for each word a fixed size window of neighboring
words (sub-sentence) is considered.
20.

Then a standalone CNN is applied to the sub-sentence and the training


objective is to predict the word in the center of the window, also referred to
as word-level classification.

One of the shortcomings with basic CNNs is there inability to model long
distance dependencies, which is important for various NLP tasks. To
address this problem, CNNs have been coupled with time-delayed neural
networks (TDNN) which enable larger contextual range at once during
training. Other useful types of CNN that have shown success in different
NLP tasks, such as sentiment prediction and question type classification,
are known as dynamic convolutional neural network (DCNN). A DCNN
uses a dynamic k-max pooling strategy where filters can dynamically span
variable ranges while performing the sentence modeling.
CNNs have also been used for more complex tasks where varying lengths
of texts are used such as in aspect detection, sentiment analysis, short text
categorization, and sarcasm detection.

21.

Project Implementation

6.1 Introduction of the implementation:

Recognizing human emotion has always been a fascinating task for data
scientists. Lately, I am working on an experimental Speech Emotion
Recognition (SER) project to explore its potential.

Data Description:
8192 different audio files distributed into 5 classes. Those classes are:
a. Neutral
b. Sad
c. Disgust
d. Happy
e. Fear

These audio files are in the “.wav” file format.

22.

6.2 Source Code with explanation:


23.
24.
25.
26.
27.
28.

29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
Results and Conclusion
● We find that the predicted values consists of only two classes.
● The only reason for this type of behaviour is the number of files of
each class. This dataset consists of 8192 files.
● The number of files for "neutral" alone is 5117.
● That is more than 50% in a dataset consisting of 5 classes.
● Second most occurring class is "happy" with 1790 files which is still
very less compared to neutral.
● Due to this bias in the data, we can assume that the results are
deviated towards predicting neutral almost all of the time.

39.
References
● https://towardsdatascience.com/speech-emotion-recognition-with-
convolution-neural-network-1e6bb7130ce3
● https://skymind.ai/wiki/neural-network
● https://machinelearningmastery.com/what-is-deep-learning/
● https://towardsdatascience.com/
● https://github.com/topics/speech-emotion-recognition
● https://medium.com/dair-ai/deep-learning-for-nlp-an-overview-of-
recent-trends-d0d8f40a776d
● http://www.wildml.com/2015/11/understanding-convolutional-neural-
networks-for-nlp/
● https://towardsdatascience.com/how-to-build-a-gated-convolutional-
neural-network-gcnn-for-natural-language-processing-nlp-
5ba3ee730bfb

40.

You might also like