0% found this document useful (0 votes)
9 views11 pages

Voice Emotion Recognition

The document presents a project on Voice Emotion Recognition (VER), which aims to identify human emotions from speech using machine learning techniques. It details the dataset used (RAVDESS), preprocessing steps, feature extraction methods, and the model building process, highlighting challenges such as noisy audio and imbalanced datasets. The project successfully demonstrates the potential applications of VER in enhancing human-computer interaction across various fields.

Uploaded by

Shanmugam K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

Voice Emotion Recognition

The document presents a project on Voice Emotion Recognition (VER), which aims to identify human emotions from speech using machine learning techniques. It details the dataset used (RAVDESS), preprocessing steps, feature extraction methods, and the model building process, highlighting challenges such as noisy audio and imbalanced datasets. The project successfully demonstrates the potential applications of VER in enhancing human-computer interaction across various fields.

Uploaded by

Shanmugam K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Voice Emotion Recognition

PRESENTED BY:
 513322106043-SARAVANA KUMAR A
 513322106045-SHARAN BHARATH M
 513322106049-BALAJI R
 513322106701-SUGITHA V.T

DATE:
30.04.2025

DATASET OVERVIEW:
Dataset description;

RAVDESS is one of the most widely used and reliable datasets for emotion recognition
from speech. It contains audio and video recordings of professional actors vocalizing different
emotions.

Emotions:

 Anger
 Disgust
 Fear
 Happiness
 Sadness
 Surprise
 Neutral

Features Used;

 MFCC (Mel-Frequency Cepstral Coefficients)

 Chroma Frequencies

 Spectral Centroid

 Zero-Crossing Rate

 Spectral Rolloff

 Root Mean Square Energy

Dataset Used;

1.Ryerson Audio-Visual Database of Emotional Speech and Song

2. Toronto Emotional Speech Set

3. Surrey Audio-Visual Expressed Emotion

Data Preprocessing;
 Audio Loading

 Noise Reduction

 Silence Removal (Optional)

 Normalization

 Feature Extraction

 Label Encoding

 Data Splitting

 Padding or Truncating

1. Data preprocessing is a critical step in Voice Emotion Recognition projects to ensure


that raw audio data is cleaned, standardized, and transformed into a format suitable for
machine learning models. Standardizing sample rate and converting to mono-channel using
libraries like librosa.

2. RAVDESS is one of the most widely used and reliable datasets for emotion recognition
from speech. It contains audio and video recordings of professional actors vocalizing different
emotions.

3. Together, these preprocessing steps transform raw, inconsistent audio data into clean,
structured, and informative inputs that allow machine learning models to accurately detect and classify
emotions in speech.

PROBLEM STATEMENT:
Brief Overview;

Voice Emotion Recognition (VER) is a machine learning-based system designed to


identify human emotions from speech. By analyzing vocal characteristics such as pitch, tone,
rhythm, and intensity, the system classifies emotions like happiness, sadness, anger, and fear.

This project aims to build an efficient emotion recognition model using audio datasets,
feature extraction techniques (e.g., MFCCs), and machine learning classifiers such as Support
Vector Machines and deep neural networks.

The goal is to enhance human-computer interaction, making systems more responsive


and emotionally intelligent in real-world applications such as virtual assistants, call centers, and
mental health monitoring.

Key Objectives;

1. To detect and classify emotional states from spoken audio using machine learning or deep
learning models.

2. To extract meaningful features (e.g., MFCCs, pitch, energy) that reflect emotional variations in
speech.

3. To build and train classification models such as SVM, Random Forest, CNN, or LSTM for
accurate emotion prediction.

4. To evaluate model performance using metrics like accuracy, precision, recall, and F1-score.

5. To improve real-time human-computer interaction by enabling systems to respond emotionally


and contextually.

METHODOLOGY:
Approach;
1. Problem Definition

 Objective: The primary goal is to classify emotional states such as happiness, sadness, anger,
fear, surprise, etc., from human speech.
 The system needs to identify features from voice signals that are indicative of different emotions
to categorize them accurately.

2. Data Collection
 Dataset Selection: Select or create an emotion-labeled dataset containing various speech
recordings with different emotional tones. Common datasets used are:
o RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)
o TESS (Toronto Emotional Speech Set)
o SAVEE (Surrey Audio-Visual Expressed Emotion)
 Content: These datasets typically contain several speakers expressing emotions in different
tones and intensities. They should include diverse emotions such as happy, sad, angry, and
neutral, which will serve as labels for classification.

3. Data Preprocessing

 Audio Loading: Load the audio files using libraries such as librosa. Standardize the sample
rate and convert the files to mono to simplify further processing.
 Noise Reduction: Clean the audio data by reducing background noise and unwanted sounds
using spectral filtering or specialized libraries (e.g., noisereduce).
 Silence Removal: Detect and remove long silent intervals in the audio signals using Voice
Activity Detection (VAD) or librosa.effects.trim.
 Normalization: Normalize the audio signals to ensure uniform loudness, preventing volume-
related biases during model training.

4. Feature Extraction

 Key Features: Extract features from the audio that are most indicative of emotional tone.
Commonly used features include:
o MFCC (Mel-Frequency Cepstral Coefficients): Captures the spectral properties of the
speech signal.
o Chroma Features: Capture harmonic and melodic features that indicate pitch variations.
o Spectral Centroid: Measures the "brightness" of the sound.
o Zero-Crossing Rate: Counts how frequently the signal crosses zero, related to speech
dynamics.
o Mel Spectrogram: A time-frequency representation of the audio signal.
 Feature Scaling: Normalize or standardize extracted features to ensure that no feature
dominates others due to different scales.

5. Model Building

 Model Selection: Choose a suitable machine learning or deep learning model for the
classification task. Common approaches include:
o Traditional Models:
 Support Vector Machines (SVM): Effective for high-dimensional feature spaces.
 Random Forests: Ensemble models that can handle varied and non-linear data.
 K-Nearest Neighbors (KNN): A simple yet effective approach for smaller
datasets.
Algorithm for Voice Emotion Recognition using Machine Learning;
1. Data Collection: Use a labeled dataset of speech recordings with emotion labels.
2. Audio Preprocessing:
o Convert audio to mono format and resample.
o Remove noise and silence.
3. Feature Extraction:
o Extract MFCC features from the audio files.
4. Data Preprocessing:
o Scale the features.
o Encode emotion labels.
5. Model Building:
o Train a Random Forest Classifier.
6. Model Evaluation:
o Evaluate using metrics like accuracy, precision, recall, and F1-score.

Results:
DISCUSSION:
1. Noisy Audio Data

 Problem: Real-world audio often contains background noise, silence, or overlapping speech.
 Impact: Reduces model accuracy by introducing irrelevant or misleading features.
 Solution: Apply noise reduction, silence removal, and audio enhancement techniques during
preprocessing.

2. Emotion Overlap & Ambiguity

 Problem: Some emotions (e.g., sad vs. tired, angry vs. fear) have overlapping acoustic features.
 Impact: Causes confusion during classification and decreases precision.
 Solution: Use high-quality datasets, and consider combining audio features with facial or textual
data (multimodal).

3. Imbalanced Datasets

 Problem: Some emotions are underrepresented in datasets (e.g., disgust or fear).


 Impact: Leads to biased models favoring dominant classes.
 Solution: Apply data augmentation, resampling techniques (SMOTE), or collect more balanced
data.

4. Feature Selection & Dimensionality

 Problem: Extracted features (e.g., MFCCs, chroma, spectral contrast) may be high-dimensional
or redundant.
 Impact: Can cause overfitting and slower training.
 Solution: Use feature selection methods like PCA or LDA to reduce dimensionality and retain
meaningful data.

5.Overfitting

 Problem: The model performs well on training data but poorly on unseen data.
 Impact: Poor generalization and real-world performance.
 Solution: Use cross-validation, regularization techniques, and simpler models when necessary.

6. Limited Data for Deep Learning

 Problem: Deep learning models require large datasets to learn effectively.


 Impact: Underfitting or poor performance on rare emotion classes.
 Solution: Start with classical ML (Random Forest, SVM), use pre-trained models, or apply
transfer learning.
7. Language & Cultural Variations

 Problem: Emotional expression varies across languages and cultures.


 Impact: A model trained on one language may not generalize to another.
 Solution: Train models on multilingual datasets or adapt them for target languages.

8. Real-Time Processing Constraints

 Problem: Processing speed matters for live applications (e.g., virtual assistants).
 Impact: Complex models may lag or fail to respond in time.
 Solution: Optimize the pipeline and use lightweight models for real-time inference

SOLUTION IMPACT:
In building a Voice Emotion Recognition model, several challenges can arise, each requiring
targeted solutions to ensure robust performance. One common issue is noisy or low-quality audio,
which can degrade model accuracy by introducing irrelevant information. This can be mitigated through
audio preprocessing techniques such as noise reduction, silence trimming, and normalization—readily
implemented using libraries like librosa (e.g. librosa. effects. trim()) and tools such as noisereduce. Another
challenge is the overlap or ambiguity between emotions (e.g., anger vs. fear), which can be addressed by
extracting and combining multiple audio features beyond MFCC, such as chroma, spectral centroid, and
pitch, all of which are accessible through librosa.feature.

Dataset imbalance, where some emotions are underrepresented, can lead to biased predictions.
To solve this, data augmentation methods like pitch shifting or time stretching can be applied to
increase the variety of samples, and oversampling techniques such as (imblearn.over_sampling.SMOTE) can
help balance class distribution. Additionally, high-dimensional feature sets may lead to overfitting or
slow training. Dimensionality reduction techniques such as Principal Component Analysis (PCA) can be
used to retain the most informative features while improving computational efficiency.

Overfitting is another frequent issue, especially when the model performs well on training data
but poorly on unseen inputs. This can be tackled using cross-validation (e.g., StratifiedKFold from scikit-
learn), regularization techniques, and limiting complexity parameters like max_depth in Random Forests
or applying dropout layers in neural networks. For projects with limited data, starting with classical
machine learning models like Random Forests is advisable, or alternatively, transfer learning from pre-
trained deep models can be used to improve performance without large datasets. Lastly, deploying
emotion recognition models in real-time systems introduces latency concerns. To address this, the
model pipeline should be optimized for speed by using lightweight models, real-time MFCC extraction,
and serialization methods like joblib to ensure fast predictions.
CONCLUTION:
 Objective Achieved:
The main goal of building a Voice Emotion Recognition (VER) system using machine
learning was successfully achieved.

 Approach Used:
MFCC (Mel-Frequency Cepstral Coefficients) features were extracted from voice
samples, and a Random Forest Classifier was trained to recognize emotions.

 Performance:
The model showed good classification accuracy on common emotions like happy, sad,
angry, and neutral, confirming the effectiveness of classical ML techniques for audio-
based emotion recognition.

 Model Strengths:
The Random Forest model offered fast training, good interpretability, and worked well
even with limited data.

 Practical Value:
This work demonstrates the potential for real-world applications in customer support,
healthcare monitoring, education, and smart assistants.

FUTURE WORK:
 Use of Deep Learning:
Implement advanced models like CNNs, RNNs, or LSTMs to improve performance by
learning more complex patterns in voice data.

 Multimodal Emotion Recognition:


Combine voice data with facial expressions or text sentiment to create a more robust,
multi-source emotion recognition system.

 Larger and Multilingual Datasets:


Train the model on larger, more diverse datasets including various languages, dialects,
and emotional intensities for better generalization.

 Cross-Cultural Adaptation:
Investigate cultural and linguistic differences in emotional expression to make the system
effective across different regions.

 Personalization:
Develop adaptive systems that learn individual user’s speech and emotional patterns for
improved accuracy over time.
OVERVIEW:
Software Libraries & Tools
1. Python
o Main language used for implementation.
2. Librosa
o Audio processing and MFCC feature extraction.
o Link: https://librosa.org/
3. scikit-learn
o For model training (Random Forest, SVM), scaling, and evaluation.
o Link: https://scikit-learn.org/
4. NumPy / Pandas
o For data manipulation and feature arrays.
5. Matplotlib / Seaborn
o For visualizing results, confusion matrices, and feature distributions.
6. imblearn (SMOTE)
o Handling imbalanced datasets.
o Link: https://imbalanced-learn.org/

Datasets Used:
1. RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)
2. TESS (Toronto Emotional Speech Set)
3. CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset)
4. Emo-DB (Berlin Database of Emotional Speech)

REFERENCE:

1. Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources,


features, and methods. Speech Communication, 48(9), 1162–1181.
2. El Ayadi, M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition:
Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.
3. Huang, R., & Narayanan, S. (2005). Speech emotion recognition using CNN. Proceedings
of INTERSPEECH.
4. Livingstone, S. R., & Russo, F. A. (2018). The Ryerson Audio-Visual Database of
Emotional Speech and Song (RAVDESS). PLOS ONE.

You might also like