Voice Emotion Recognition
Voice Emotion Recognition
PRESENTED BY:
513322106043-SARAVANA KUMAR A
513322106045-SHARAN BHARATH M
513322106049-BALAJI R
513322106701-SUGITHA V.T
DATE:
30.04.2025
DATASET OVERVIEW:
Dataset description;
RAVDESS is one of the most widely used and reliable datasets for emotion recognition
from speech. It contains audio and video recordings of professional actors vocalizing different
emotions.
Emotions:
Anger
Disgust
Fear
Happiness
Sadness
Surprise
Neutral
Features Used;
Chroma Frequencies
Spectral Centroid
Zero-Crossing Rate
Spectral Rolloff
Dataset Used;
Data Preprocessing;
Audio Loading
Noise Reduction
Normalization
Feature Extraction
Label Encoding
Data Splitting
Padding or Truncating
2. RAVDESS is one of the most widely used and reliable datasets for emotion recognition
from speech. It contains audio and video recordings of professional actors vocalizing different
emotions.
3. Together, these preprocessing steps transform raw, inconsistent audio data into clean,
structured, and informative inputs that allow machine learning models to accurately detect and classify
emotions in speech.
PROBLEM STATEMENT:
Brief Overview;
This project aims to build an efficient emotion recognition model using audio datasets,
feature extraction techniques (e.g., MFCCs), and machine learning classifiers such as Support
Vector Machines and deep neural networks.
Key Objectives;
1. To detect and classify emotional states from spoken audio using machine learning or deep
learning models.
2. To extract meaningful features (e.g., MFCCs, pitch, energy) that reflect emotional variations in
speech.
3. To build and train classification models such as SVM, Random Forest, CNN, or LSTM for
accurate emotion prediction.
4. To evaluate model performance using metrics like accuracy, precision, recall, and F1-score.
METHODOLOGY:
Approach;
1. Problem Definition
Objective: The primary goal is to classify emotional states such as happiness, sadness, anger,
fear, surprise, etc., from human speech.
The system needs to identify features from voice signals that are indicative of different emotions
to categorize them accurately.
2. Data Collection
Dataset Selection: Select or create an emotion-labeled dataset containing various speech
recordings with different emotional tones. Common datasets used are:
o RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)
o TESS (Toronto Emotional Speech Set)
o SAVEE (Surrey Audio-Visual Expressed Emotion)
Content: These datasets typically contain several speakers expressing emotions in different
tones and intensities. They should include diverse emotions such as happy, sad, angry, and
neutral, which will serve as labels for classification.
3. Data Preprocessing
Audio Loading: Load the audio files using libraries such as librosa. Standardize the sample
rate and convert the files to mono to simplify further processing.
Noise Reduction: Clean the audio data by reducing background noise and unwanted sounds
using spectral filtering or specialized libraries (e.g., noisereduce).
Silence Removal: Detect and remove long silent intervals in the audio signals using Voice
Activity Detection (VAD) or librosa.effects.trim.
Normalization: Normalize the audio signals to ensure uniform loudness, preventing volume-
related biases during model training.
4. Feature Extraction
Key Features: Extract features from the audio that are most indicative of emotional tone.
Commonly used features include:
o MFCC (Mel-Frequency Cepstral Coefficients): Captures the spectral properties of the
speech signal.
o Chroma Features: Capture harmonic and melodic features that indicate pitch variations.
o Spectral Centroid: Measures the "brightness" of the sound.
o Zero-Crossing Rate: Counts how frequently the signal crosses zero, related to speech
dynamics.
o Mel Spectrogram: A time-frequency representation of the audio signal.
Feature Scaling: Normalize or standardize extracted features to ensure that no feature
dominates others due to different scales.
5. Model Building
Model Selection: Choose a suitable machine learning or deep learning model for the
classification task. Common approaches include:
o Traditional Models:
Support Vector Machines (SVM): Effective for high-dimensional feature spaces.
Random Forests: Ensemble models that can handle varied and non-linear data.
K-Nearest Neighbors (KNN): A simple yet effective approach for smaller
datasets.
Algorithm for Voice Emotion Recognition using Machine Learning;
1. Data Collection: Use a labeled dataset of speech recordings with emotion labels.
2. Audio Preprocessing:
o Convert audio to mono format and resample.
o Remove noise and silence.
3. Feature Extraction:
o Extract MFCC features from the audio files.
4. Data Preprocessing:
o Scale the features.
o Encode emotion labels.
5. Model Building:
o Train a Random Forest Classifier.
6. Model Evaluation:
o Evaluate using metrics like accuracy, precision, recall, and F1-score.
Results:
DISCUSSION:
1. Noisy Audio Data
Problem: Real-world audio often contains background noise, silence, or overlapping speech.
Impact: Reduces model accuracy by introducing irrelevant or misleading features.
Solution: Apply noise reduction, silence removal, and audio enhancement techniques during
preprocessing.
Problem: Some emotions (e.g., sad vs. tired, angry vs. fear) have overlapping acoustic features.
Impact: Causes confusion during classification and decreases precision.
Solution: Use high-quality datasets, and consider combining audio features with facial or textual
data (multimodal).
3. Imbalanced Datasets
Problem: Extracted features (e.g., MFCCs, chroma, spectral contrast) may be high-dimensional
or redundant.
Impact: Can cause overfitting and slower training.
Solution: Use feature selection methods like PCA or LDA to reduce dimensionality and retain
meaningful data.
5.Overfitting
Problem: The model performs well on training data but poorly on unseen data.
Impact: Poor generalization and real-world performance.
Solution: Use cross-validation, regularization techniques, and simpler models when necessary.
Problem: Processing speed matters for live applications (e.g., virtual assistants).
Impact: Complex models may lag or fail to respond in time.
Solution: Optimize the pipeline and use lightweight models for real-time inference
SOLUTION IMPACT:
In building a Voice Emotion Recognition model, several challenges can arise, each requiring
targeted solutions to ensure robust performance. One common issue is noisy or low-quality audio,
which can degrade model accuracy by introducing irrelevant information. This can be mitigated through
audio preprocessing techniques such as noise reduction, silence trimming, and normalization—readily
implemented using libraries like librosa (e.g. librosa. effects. trim()) and tools such as noisereduce. Another
challenge is the overlap or ambiguity between emotions (e.g., anger vs. fear), which can be addressed by
extracting and combining multiple audio features beyond MFCC, such as chroma, spectral centroid, and
pitch, all of which are accessible through librosa.feature.
Dataset imbalance, where some emotions are underrepresented, can lead to biased predictions.
To solve this, data augmentation methods like pitch shifting or time stretching can be applied to
increase the variety of samples, and oversampling techniques such as (imblearn.over_sampling.SMOTE) can
help balance class distribution. Additionally, high-dimensional feature sets may lead to overfitting or
slow training. Dimensionality reduction techniques such as Principal Component Analysis (PCA) can be
used to retain the most informative features while improving computational efficiency.
Overfitting is another frequent issue, especially when the model performs well on training data
but poorly on unseen inputs. This can be tackled using cross-validation (e.g., StratifiedKFold from scikit-
learn), regularization techniques, and limiting complexity parameters like max_depth in Random Forests
or applying dropout layers in neural networks. For projects with limited data, starting with classical
machine learning models like Random Forests is advisable, or alternatively, transfer learning from pre-
trained deep models can be used to improve performance without large datasets. Lastly, deploying
emotion recognition models in real-time systems introduces latency concerns. To address this, the
model pipeline should be optimized for speed by using lightweight models, real-time MFCC extraction,
and serialization methods like joblib to ensure fast predictions.
CONCLUTION:
Objective Achieved:
The main goal of building a Voice Emotion Recognition (VER) system using machine
learning was successfully achieved.
Approach Used:
MFCC (Mel-Frequency Cepstral Coefficients) features were extracted from voice
samples, and a Random Forest Classifier was trained to recognize emotions.
Performance:
The model showed good classification accuracy on common emotions like happy, sad,
angry, and neutral, confirming the effectiveness of classical ML techniques for audio-
based emotion recognition.
Model Strengths:
The Random Forest model offered fast training, good interpretability, and worked well
even with limited data.
Practical Value:
This work demonstrates the potential for real-world applications in customer support,
healthcare monitoring, education, and smart assistants.
FUTURE WORK:
Use of Deep Learning:
Implement advanced models like CNNs, RNNs, or LSTMs to improve performance by
learning more complex patterns in voice data.
Cross-Cultural Adaptation:
Investigate cultural and linguistic differences in emotional expression to make the system
effective across different regions.
Personalization:
Develop adaptive systems that learn individual user’s speech and emotional patterns for
improved accuracy over time.
OVERVIEW:
Software Libraries & Tools
1. Python
o Main language used for implementation.
2. Librosa
o Audio processing and MFCC feature extraction.
o Link: https://librosa.org/
3. scikit-learn
o For model training (Random Forest, SVM), scaling, and evaluation.
o Link: https://scikit-learn.org/
4. NumPy / Pandas
o For data manipulation and feature arrays.
5. Matplotlib / Seaborn
o For visualizing results, confusion matrices, and feature distributions.
6. imblearn (SMOTE)
o Handling imbalanced datasets.
o Link: https://imbalanced-learn.org/
Datasets Used:
1. RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)
2. TESS (Toronto Emotional Speech Set)
3. CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset)
4. Emo-DB (Berlin Database of Emotional Speech)
REFERENCE: