0% found this document useful (0 votes)
22 views50 pages

Final Presentation

The document discusses the development of a Speech Emotion Recognition (SER) system using machine learning techniques to classify emotions from audio samples. It outlines the objectives, datasets, workflow, and various machine learning models employed, including CNN, SVM, and Decision Trees, along with their evaluation metrics. The study emphasizes the importance of audio features and data augmentation in improving emotion detection accuracy, achieving over 80% in some models.

Uploaded by

Rahul Dash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views50 pages

Final Presentation

The document discusses the development of a Speech Emotion Recognition (SER) system using machine learning techniques to classify emotions from audio samples. It outlines the objectives, datasets, workflow, and various machine learning models employed, including CNN, SVM, and Decision Trees, along with their evaluation metrics. The study emphasizes the importance of audio features and data augmentation in improving emotion detection accuracy, achieving over 80% in some models.

Uploaded by

Rahul Dash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

SPEECH EMOTION

RECOGNITION USING
ML
DATA 606 : CAPSTONE IN DATA
SCIENCE

Guided By: Prof. Ozgur Ozturk


Our team

SAI SHAH SASHIDHAR


SARAN BANSARI GUTHI
[email protected] [email protected] [email protected]
OVERVIEW
• Speech Emotion Recognition (SER), as the name indicates, is a tool for determining
the emotions present in various audio samples.
• In a variety of audio recordings, including job interviews, caller-agent conversations,
streaming movies, and music, speech emotion recognition is utilized to determine
the emotional spectrum or sentimental value.
• Even music classification or recommendation systems can group songs according to
their mood and provide tailored playlists to the consumer.
• It is probable to infer that a SER component aids with music suggestions in the
intricate algorithms of Spotify and YouTube.
Table of contents

DATASET
0 OBJECTIVE 02 DESCRIPTION
1
RESULT ANALYSIS &
03 WORKFLOW 04 EVALUATION
OBJECTIVE
• To build a machine learning model that recognize emotion from speech using the librosa
package, sklearn libraries along with four datasets.
• To present a classification models for predicting emotions elicited by speeches based on
CNN, MLPC, SVM, Logistic Regression, Decision Tree classification by using acoustic
features.
• The machine learning model has been trained to classify eight different emotions.
How Business Incorporates SER?
• Fosters customization with clients by utilizing the SER algorithm to identify their
emotions.
• Calls are routed based on the client's feelings. An unhappy caller can have their
call forwarded to the retention team.
• Automotive is the other domain where emotion recognition is in high demand
due to its role in providing riders with a satisfying driving experience.
• Delivering business-level analytics that makes recommendations based on
customer’s emotions for choosing between different products.
Business Applications of SER
Market Recruitment
research
Screening of Candidates
Customer reaction to
new products
Healthca IoT
re (Internet of
Robots Things)/sma
rt devices
Caregiver
Robots, Devices responding to
Emotional emotions.
Counseling
DATASET DESCRIPTION
Audio Format: .wav Audio Length: 1-3 secs.
Data Size: 1.6 GB Training : 70% Testing 30%

Crowd-sourced Ryerson Audio-Visual Toronto emotional Surrey Audio-Visual


Emotional Multimodal Database of Emotional speech set (Tess) Expressed Emotion
Actors Dataset (Crema) Speech and Song (Savee)
(Ravdess)
WORKFLOW
DATA AUGMENTATION
To cope up with the real-world scenarios, adding noise, pitch, stretching and shifting
the audio files helps to identify the emotion more clearly from the dataset.

WAVEPLOT FOR ORIGINAL AUDIO


DATA AUGMENTATION CONT..
WAVEPLOT FOR ADDING NOISE TO ORIGINAL AUDIO

WAVEPLOT FOR ADDING STRETCH TO ORIGINAL AUDIO


DATA AUGMENTATION CONT..
WAVEPLOT FOR ADDING SHIFTING TO ORIGINAL AUDIO

WAVEPLOT FOR ADDING PITCHTO ORIGINAL AUDIO


DATA PREPROCESSING
COMBING DATA FROM
ALL THE FOUR DATASETS.

EXTRACTING EMOTIONS
FROM EACH DATA FILE.

PASSING THE DATA FILES


FURTHER FOR FEATURE
EXTRACTION.
Tools Used for Audio Processing
Librosa library can be used in Python to
process and extract features from the
audio files. “Librosa” is a python
LIBROSA PACKAGE package for music and audio analysis. It
provides the building blocks necessary to
create music information retrieval systems.

FEATURE • MFCC
• ROOT-MEAN-SQUARE
EXTRACTION • ZERO CROSSING RATE
TECHNIQUES • TONNETZ
EDA – PART Emotion Counts for
1 Each Dataset
EDA – PART
2
EDA – PART 3 WAVEPLOT AND SPECTOGRAM FOR EACH
EMOTION
EDA – PART 4 WAVEPLOT AND SPECTOGRAM FOR EACH
EMOTION
EDA – PART 5 WAVEPLOT AND SPECTOGRAM FOR EACH
EMOTION
MAPPING SER WITH ML
Models: Below Models are used in Based conditions and Parameter Tuned
Conditions

SVC MLP KNeighbors


from sklearn.svm import Classifier from
from sklearn.neighbors
SVC
sklearn.neural_network import
import MLPClassifier KNeighborsClassifier
Decision Logistic
Tree CNN
Regression
from
from sklearn.tree
sklearn.linear_model
import
import
DecisionTreeClassifier
LogisticRegression
MAPPING SER WITH ML

SVC MODEL
MAPPING SER WITH ML

MLPC MODEL
MAPPING SER WITH ML

KNN MODEL
MAPPING SER WITH ML

DECISION TREE MODEL


MAPPING SER WITH ML

LOGISTIC REGRESSION MODEL


MAPPING SER
WITH
CNN MODEL ML
(TENSORFLOW)
MAPPING SER
WITH
CNN MODEL ML
(TENSORFLOW)
EVALUATION
METRICS CLASSIFICATION REPORT:
CONFUSION MATRIX:
It compares the actual It provides the comparison of
and predicted classes of metric evaluated among
the data different emotions for the data.

Precision is the ratio of correctly classified


PRECISION positive samples to total number of classified
samples.

Recall is the ability of a classifier to find all


RECALL positive instances.

The F1 score is a weighted harmonic mean of


F-1 SCORE precision and recall.

It is the number of correct predictions made


ACCURACY divided by the total number of predictions.
SVC MODEL CLASSIFICATION
REPORT
SVC BASED MODEL SVC TUNED MODEL
MLPC MODEL CLASSIFICATION
REPORT
MLPC BASED MODEL MLPC TUNED MODEL
KNN MODEL CLASSIFICATION
REPORT
KNN BASED MODEL KNN TUNED MODEL
DECISION TREE MODEL
CLASSIFICATION REPORT
DECISION TREE BASED DECISION TREE TUNED
MODEL MODEL
LOGISTIC REGRESSION
CLASSIFICATION REPORT
LOGISTIC REGRESSION LOGISTIC REGRESSION
BASED MODEL TUNED MODEL
CNN MODEL CLASSIFICATION
REPORT
CNN BASED MODEL CNN TUNED MODEL
CONFUSION MATRIX FOR SVC
MODEL
SVC BASED MODEL SVC TUNED MODEL
CONFUSION MATRIX FOR MLPC
MODEL
MLPC BASED MODEL MLPC TUNED MODEL
CONFUSION MATRIX FOR KNN
MODEL
KNN BASED MODEL KNN TUNED MODEL
CONFUSION MATRIX FOR DECISION
TREE MODEL
DECISION TREE BASED MODEL DECISION TREE TUNED MODEL
CONFUSION MATRIX FOR LOGISTIC
REGRESSION MODEL
LOGISTIC REGRESSION BASED LOGISTIC REGRESSION TUNED
MODEL MODEL
CONFUSION MATRIX FOR CNN
MODEL
CNN BASED MODEL CNN TUNED MODEL
ACCURACY SCORE
METRICS
SVC MODEL MLPC MODEL
ACCURACY SCORE
METRICS
KNN MODEL DECISION TREE MODEL
METRIC SCORES

LOGISTIC REGRESSION MODEL CNN MODEL


METRIC SCORES
ACCURACY & LOSS AT EACH EPOCH
FOR CNN
CNN BASE MODEL TRAINING AND CNN BASE MODEL TRAINING AND
VALIDATION ACCURACY VALIDATION LOSS
ACCURACY & LOSS AT EACH EPOCH
FOR CNN
CNN TUNED MODEL TRAINING AND CNN TUNED MODEL TRAINING AND
VALIDATION ACCURACY VALIDATION LOSS
CONCLUSIONS
• When the same audio information is conveyed in a variety of emotions, detecting
emotion from audio rather than speech is quite useful.
• Tone, rhythm, pitch, frequency, loudness, speed of sound, etc. are the major
factors
that determine an auditory feature.
• With the aid of advanced audio features like MFCC, RMS, ZCR, and Tonnetz, machine
learning models may use the audio file to their advantage and anticipate emotions.
• For model simulation, a variety of audio manipulation techniques can be employed to
improve the audio's clarity, volume, and time characteristics like shifting and
stretching.
• Version one base models and version two parameter-tuned models of machine
learning models are employed in this study. In certain instances, base models
performed better than tuned models, although tuned models had a firm hold on
neural network models.
• For real-world scenarios in forecasts, accuracy scores of better than 80% are advised
by model performance analysis.
BLOG LINK: GITHUB LINK:
https://medium.com/@s157/speech- https://github.com/saran987/Speech-
emotion-recognition-74f8878649a7 Emotion-Recognition-with-Audio
REFERENCES
Speech Emotion Recognition (akaike.ai)
Speech Emotion Recognition (SER) through Machine Learning (analyticsinsight.net)
Shengyun Wei et al 2020 J. Phys.: Conf. Ser. 1453 012085DOI 10.1088/1742-6596/1453/1/012085
Open Access proceedings Journal of Physics: Conference series (iop.org)
Parekh, R. (2012). Recognition of Isolated Words using Features based on LPC, MFCC, ZCR and STE,
with Neural Network Classifiers (IJMER 2012). Int. Journal of Modern Engineering Research (IJMER).
2012_ijmer-with-cover-page-v2.pdf (d1wqtxts1xzle7.cloudfront.net)
Rawlinson, H., Segal, N., & Fiala, J. (2015, January). Meyda: an audio feature extraction library for the
web audio api. In The 1st web audio conference (WAC). Paris, Fr.
meyda-audio-feature-libre.pdf (d1wqtxts1xzle7.cloudfront.net)
Dashtipour, K., Gogate, M., Adeel, A., Larijani, H., & Hussain, A. (2021). Sentiment analysis of persian
movie reviews using deep learning. Entropy, 23(5), 596.
Entropy | Free Full-Text | Sentiment Analysis of Persian Movie Reviews Using Deep Learning
(mdpi.com)
THANKS ! !

ANY
QUESTION?

You might also like