Solar
Solar
INTRODUCTION
1
INTRODUCTION
1.1 Artificial Intelligence (AI)
Artificial Intelligence (AI) plays a vital role in Speech Emotion Recognition (SER),
enabling machines to understand human emotions through vocal expressions. A
commonly used approach involves extracting features from speech using Mel-Frequency
Cepstral Coefficients (MFCC), and then processing these features through deep learning
models such as Convolutional Neural Networks (CNN), Recurrent Neural Networks
(RNN), and Long Short-Term Memory (LSTM) networks.
Artificial Intelligence MFCC is a powerful feature extraction technique that
captures the short-term power spectrum of sound, emulating the way humans perceive
audio. It transforms raw audio signals into a compact and informative representation,
typically a 2D array, which serves as the input to AI models. Once MFCC features are
extracted, CNNs are often used to detect spatial patterns in the data. Treating the MFCC
output like an image, CNN layers identify local acoustic patterns that correlate with
different emotions.
Following this, the data is passed through RNN or LSTM layers to learn the
temporal dynamics of the speech. RNNs are designed to handle sequential data, making
them suitable for capturing the time-dependent nature of speech. However, traditional
RNNs can struggle with long sequences, which is why LSTMs are preferred—they can
retain important information over longer periods, enabling the model to understand
emotional cues that unfold over time.
The combination of MFCC for feature extraction, CNN for spatial learning, and LSTM for
sequential modeling creates a robust system capable of accurately classifying emotions like
happiness, sadness, anger, fear, and neutrality from speech. This approach is widely used in
applications such as virtual assistants, customer service bots, mental health monitoring, and
adaptive learning systems, where understanding human emotion enhances user experience
and system responsiveness
2
Fig 1.1 Components of AI
AI-based predictive models for speech emotion recognition primarily rely on machine learning
and deep learning algorithms, including:
SVM (Support Vector Machine)– Works well with high-dimensional data like MFCCs,
Effective for binary and multi-class classification.
Random Forest– Random Forest is an ensemble learning algorithm based on decision trees. In
Speech Emotion Recognition, it is commonly used for classifying speech into emotions based on
extracted features.
KNN (K-Nearest Neighbors)– K-Nearest Neighbors (KNN) is a simple, instance-based
learning algorithm that classifies an emotion based on the majority vote of its K nearest
neighbors in the feature space.
1.1.1 Components of AI
1. Data Collection & Preprocessing
Data Sources:
3
TESS (Toronto Emotional Speech Set).
CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset).
Data Preprocessing:
Removes background noise and silent parts to focus only on emotional speech content.
Time-series data transformation for accurate forecasting.
Noise reduction using statistical techniques
2. Machine Learning (ML)
Traditional ML Models:
c) Decision Trees
Easy to interpret.
d) Random Forests
Often combined with Hidden Markov Models (HMMs) to model temporal dynamics.
f) Logistic Regression
g) Naive Bayes
Simple and fast but sometimes too simplistic for complex speech features.
Detect customer frustration, anger, or satisfaction in real time, Route calls to human
agents when customers are upset, Analyze customer service quality via emotional trends
empathetically if the user sounds stressed or upset.Improve engagement and user experience.
Detect signs of depression, anxiety, or mood disorders through speech.Passive monitoring for a
at-risk individuals (e.g., elderly, PTSD patients). Support digital therapeutics and telehealth
5
platforms
In supervised learning, the algorithm learns from labeled data, where each
example consists of input features and corresponding labels.
Unsupervised Learning
Semi-Supervised Learning
7
FIG 1.2.1 types of machine learning
8
1.3Generative AI
9
1.3.1Applications of Generative AI
10
CHAPTER II
FEASIBILITY STUDY
11
FEASIBILITY STUDY
The The existing systems for Speech Emotion Recognition (SER) primarily
rely on machine learning and deep learning models trained on labeled speech datasets.
These systems typically use acoustic features such as Mel-Frequency Cepstral
Coefficients (MFCCs), pitch, intensity, and spectral features to identify emotional
states. Traditional systems often employ classifiers like Support Vector Machines
(SVM), K-Nearest Neighbors (KNN), and Hidden Markov Models (HMM). More
recent advancements incorporate deep learning techniques using Convolutional Neural
Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term
Memory (LSTM) networks to improve accuracy and handle complex temporal patterns
in speech. Despite significant progress, existing systems still face challenges in
recognizing emotions accurately in noisy environments, across different languages, and
with diverse speaking styles. Most current SER applications are integrated into call
centers, virtual assistants, and emotion-aware healthcare tools, but their performance
often depends on the quality and diversity of the training data.
Limitation
• Speaker Dependency: SER systems often perform well only on known speakers, SER
systems often perform better when the speaker's voice characteristics are already part
of the training data.
• Language and Accent Variability: Speech emotion patterns can vary significantly
across different languages and accents.
• Background Noise: Real-world environments often contain background noise, such
as traffic, people talking, or mechanical sounds.
• Emotion Ambiguity: Emotions are often not clearly expressed or are blended, making
them difficult to categorize.
12
• Computational Complexity: Advanced AI-based models (e.g., deep learning, LSTMs)
require high computational resources, making them expensive to deploy and maintain.
• Lack of Generalization: Models trained on data from one region may not perform
well in another due to differences in geographic and climatic conditions.
• Integration Challenges: Predictive models must integrate with power grids and
IoT systems, but compatibility issues often arise due in protocols and data formats.
2.2 Proposed system
The The proposed system for Speech Emotion Recognition (SER) leverages a
hybrid deep learning architecture combining LSTM, CNN, RNN, and the advanced Wav2Vec
2.0 Transformer model to achieve high accuracy in emotion detection from speech. This
integrated approach is designed to overcome the limitations of traditional models by capturing
both spatial and temporal features of audio signals in a robust and efficient manner. Initially,
raw audio input is preprocessed and features such as spectrograms or MFCCs are extracted.
These features are first passed through a Convolutional Neural Network (CNN) to learn
spatial hierarchies in the acoustic data, effectively capturing local patterns and energy
distribution across time and frequency. The output from the CNN is then fed into a stack of
Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) layers, which are
adept at modeling temporal dependencies and sequence dynamics, crucial for understanding
the progression of emotions in speech over time. To further enhance the performance and
contextual understanding, the system integrates Wav2Vec 2.0, a self-supervised Transformer-
based model pretrained on a large corpus of unlabeled speech data. Wav2Vec 2.0 learns
contextualized audio representations directly from the waveform, significantly reducing the
reliance on handcrafted features and boosting the system’s ability to generalize across
different speakers and acoustic conditions. By combining these powerful architectures, the
proposed system ensures superior performance in real-time emotion recognition tasks and
demonstrates strong generalization across various datasets and languages. This architecture is
highly suitable for deployment in interactive voice systems, healthcare diagnostics, virtual
assistants, and customer service applications, providing emotionally intelligent responses and
improved human-computer interaction..
13
Limitation
• Variability in Speech: Differences in accent, pitch, tone, and speaking style can
degrade the accuracy of emotion recognition
• Noise Sensitivity: Background noise or poor audio quality can interfere with feature
extraction, especially in real-world environments.
• Subjectivity of Emotions: Emotions can be interpreted differently across individuals and
cultures, making accurate classification difficult.
The In conclusion, the integration of advanced deep learning models such as LSTM,
CNN, RNN, and Wav2Vec 2.0 has significantly enhanced the accuracy and effectiveness of
Speech Emotion Recognition systems. These models enable the system to capture complex
acoustic and temporal patterns, leading to more nuanced and context-aware emotion
classification. While the proposed approach demonstrates strong potential for real-world
applications in healthcare, customer service, and human-computer interaction, it is important
to acknowledge and address existing challenges such as variability in speech, data
limitations, computational demands, and ethical considerations. Continued research and
innovation in model optimization, dataset diversity, and privacy-preserving techniques will
be crucial to building more robust, inclusive, and responsible SER systems in the future.
14
2.2.1HARDWARE AND SOFTWARE REQUIREMENTS
1. Hardware Requirements
• Processor: Intel Core i7/i9, AMD Ryzen 7/9, or higher (for training ML models)
• GPU (Optional but Recommended): NVIDIA RTX 3060/RTX 4090 or
equivalent for deep learning acceleration
2. Software Requirements
• Operating System: Windows 10/11, Linux (Ubuntu), or macOS
15
CHAPTER III
LITERATURE
SURVEY
16
LITERATURE SURVEY
3.1 P. Tzirakis, J. Zhang, and B. W. Schuller, “End-to-End Speech Emotion Recognition
Using Deep Neural Networks,” IEEE Journal of Selected Topics in Signal Processing,
2018, Volume 11, Issue 8, Pages 1301–1309, ISSN: 1932-4553
Abstract
The Speech Emotion Recognition (SER) is a rapidly evolving field that aims to detect
human emotions from speech signals, enabling more natural and intelligent human-computer
interaction. This project proposes a hybrid deep learning approach that combines Convolutional
Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory
(LSTM), and the Wav2Vec 2.0 Transformer model to improve the accuracy and robustness of
emotion classification. The system leverages CNN for spatial feature extraction, LSTM and
RNN for modeling temporal dependencies, and Wav2Vec 2.0 for learning contextualized
speech
Description
Preprocessing: All audio files are resampled to a consistent sampling rate (commonly 16 kHz or
44.1 kHz) to ensure uniformity across datasets.
Model Training & Optimization: Training the Deep LSTM model using a regression-
based dataset with polynomial extrapolation.
17
• Performance Evaluation: Comparing the Root Mean Square Error (RMSE) of different
models, including LSTM, Bi-LSTM, and Deep LSTM. Results indicate that Deep LSTM
achieves better accuracy compared to traditional methods. The study suggests that
incorporating hybrid approaches like CNN-LSTM could further improve performance.
Limitation
• Dependence on Large Datasets: The Deep LSTM model requires extensive
historical data for effective training, which may not be available in all locations.
• Speaker Dependency: Models trained on specific speakers often perform poorly
when tested on new, unseen speakers.
• Emotion Overlap and Ambiguity: Emotional states often overlap (e.g., anger
and frustration, or fear and surprise), making classification difficult.
• Language and Cultural Bias: Most datasets are in English or a few major
languages, limiting cross-lingual and multicultural applicability.
Accurate This study investigates the application of deep learning architectures, particularly
Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), in the field of
Speech Emotion Recognition (SER). The research uses the RAVDESS dataset, which includes
multiple emotional classes like happy, sad, angry, and neutral. Audio files are preprocessed using
MFCC feature extraction, and models are trained to classify emotional states. Performance
evaluation indicates that CNN-RNN hybrid models outperform traditional classifiers such as SVM
and Random Forest, with an accuracy improvement of up to 15%.
18
Description
The The paper presents a hybrid CNN-RNN model for SER that processes MFCC
features derived from the RAVDESS dataset. The CNN layers extract spatial features from
MFCC inputs, while RNN layers capture temporal dependencies, improving the model’s
ability to detect complex emotional patterns in speech. Hyperparameter tuning and dropout
layers are applied to prevent overfitting. The model achieves 82.4% accuracy in multi-class
emotion recognition, showing robust performance on both male and female speech samples.
The study emphasizes deep learning’s potential in enhancing human-computer interaction.
Limitations:
Dataset Bias: The RAVDESS dataset contains acted emotions, which may not generalize well to
real-world scenarios with spontaneous speech.
Speaker Dependency: Performance drops significantly when tested on speakers not present in the
training set.
Limited Language Support: The study only focuses on English, restricting cross-lingual
applications.
High Computational Load: The hybrid CNN-RNN architecture requires high processing power
and memory.
No Real-time Evaluation: The system has not been tested in real-time applications or embedded
environments.
19
3.3
Author(s): Ekhlas Al Nassr, Mohd Fadzil Hassan, Yousef Alhwaiti
Title: "Improving Speech Emotion Recognition Using Data Augmentation and Feature Fusion"
Journal: Computers, Volume 10, Issue 6, Article 79, 2021
ISSN: 2073-431X
Abstract
Deep This research focuses on enhancing speech emotion recognition accuracy through
data augmentation techniques and the fusion of multiple audio features. The study leverages
both traditional (MFCC, Chroma) and advanced features (spectral centroid, roll-off) and
combines them using a deep neural network (DNN) for classification. Evaluation is conducted
using the EMO-DB dataset, and data augmentation is applied through pitch shifting and time
stretching. Results indicate that feature fusion and augmentation significantly improve
classification performance.
Description
This study presents he methodology involves preprocessing the EMO-DB dataset with multiple
augmentation methods to create a diverse training set. Feature extraction includes MFCCs,
chroma features, zero-crossing rate, and spectral bandwidth. These are concatenated to form a rich
feature vector, which is input into a DNN with batch normalization and ReLU activation. The
model achieves a classification accuracy of 89.3% with augmentation, compared to 76% without
it. The fusion of complementary features provides robustness against noise and speaker variation.
Limitations:
Overfitting Risk: Extensive feature sets and augmentation increase the risk of overfitting,
especially on small datasets.
Artificial Emotions: Use of EMO-DB limits realism due to acted emotional expressions.
Limited Multilingual Support: The model was only tested on German speech data.
Augmentation Artifacts: Pitch and speed manipulation can introduce audio artifacts that
affect emotion clarity.
Lack of Real-time Testing: The system was only evaluated offline, not under real-time
conditions.
20
CHAPTER IV
SYSTEM DESIGN
21
SYSTEM DESIGN
4.1 System Architecture
22
Description
The he Speech Emotion Recognition (SER) system leverages machine learning techniques
to identify and classify human emotions from spoken audio data. The project initiates with the
collection of emotional speech datasets, such as RAVDESS (Ryerson Audio-Visual Database
of Emotional Speech and Song), TESS (Toronto Emotional Speech Set), or SAVEE. These
datasets contain audio recordings labeled with various emotions including neutral, happy, sad,
angry, fearful, and disgusted.
The raw audio files are then preprocessed using the Librosa library in Python. Key
preprocessing steps include noise reduction, trimming silences, and sampling normalization.
Feature extraction is performed using Mel-Frequency Cepstral Coefficients (MFCC), Chroma
features, Zero-Crossing Rate, Spectral Centroid, and Spectral Roll-off to capture both temporal
and spectral properties of the audio signal.
Following preprocessing and feature extraction, the dataset is split into training and
testing sets. A variety of machine learning and deep learning models are experimented with,
including Support Vector Machines (SVM), Convolutional Neural Networks (CNN), Long
Short-Term Memory (LSTM) networks, and hybrid CNN-LSTM models. These models are
trained to classify the speech into emotional categories. Model performance is evaluated based
on metrics such as accuracy, precision, recall, and F1-score, and the best-performing model is
saved using Python's joblib or pickle in .pkl format for later use.
1. Voice Signal (Input)
It captures speech signals that carry both linguistic content and emotional cues.
2. Feature Extraction
This step involves extracting meaningful characteristics from the audio signal to feed into
machine learning
These include basic statistical measures from the waveform, such as:
23
o Max: Maximum amplitude
Involves converting the audio to a spectrogram using techniques like FFT or MFCC.
3. Model Training
This is the machine learning or deep learning phase, where extracted features are used to train
models to recognize emotions.
a. Traditional ML Algorithms:
Naïve Bayes
Decision Tree
These are typically fast but may struggle with complex features.
b. Neural Networks:
Advanced deep learning architectures (like CNN, RNN, LSTM) that can automatically learn
complex
The trained model predicts the emotion category from the audio.
24
Common emotions identified include:
o Angry
o Happy
o Sad
o Neutral
o Fear
25
The classification results—which represent the identified emotions—are then stored and displayed for
further use or feedback.
.
4.3 Modules
1. Data Pre-processing
2. Exploration Data Analysis Of visualization
3. Data Validation/cleaning/Preparing Process
4. Web application & deployment module
26
.
Once the audio is cleaned, feature extraction is carried out using libraries such as
Librosa or PyAudioAnalysis. The goal is to extract low-level descriptors and statistical
features that represent the speech signal's characteristics. Key features include:
MFCC (Mel-Frequency Cepstral Coefficients): Captures the timbral texture and pitch
variations of the voice, making it essential for emotion detection.
Chroma Features: Capture pitch class profiles, providing tonal information which may differ
across emotions.
o Zero Crossing Rate (ZCR): Measures the frequency of signal sign changes, which
can be correlated with speech intensity.
27
o Spectral Centroid and Spectral Bandwidth: Reflect the “brightness” and
frequency distribution of the audio.
o Root Mean Square Energy (RMSE): Represents signal strength and can vary with
emotional intensity.
o Tempo, Harmonic-to-Noise Ratio, and Spectrograms (for CNN input) are also
optionally extracted depending on the model architecture.
After feature extraction, these features are compiled into tabular datasets—often in
NumPy arrays or Pandas DataFrames—with each row representing an audio sample and
each column representing a feature. This is followed by feature normalization or
standardization, which ensures that all features are on a similar scale, thus preventing
model bias toward features with higher numeric ranges.
Finally, the entire dataset is split into training, validation, and test sets, commonly
in a 70:15:15 or 80:10:10 ratio. This ensures that the model can learn effectively, be tuned
using validation data, and be evaluated fairly on unseen samples. Data augmentation
techniques such as pitch shifting, time stretching, and background noise injection may also
be applied during preprocessing to enhance model robustness and handle overfitting,
especially when working with limited datasets.
Overall, the preprocessing phase plays a foundational role in the SER pipeline,
ensuring high-quality, well-structured data that allows machine learning and deep learning
models to accurately learn and classify human emotions from speech signals.
28
Class Diagram
This diagram represents a modular architecture for29a Speech Emotion Recognition system. It begins with
the AudioInput module, which is responsible for capturing speech through methods like recordSpeech()
and uploadAudio(). The captured audio is then sent to the Preprocessing module, which performs tasks
such as noiseReduction() and featureExtraction() to prepare the audio for analysis by reducing noise and
extracting relevant features. These features are further processed by the FeatureExtraction module,
Cepstral Coefficients) and extractSpectrogram(), essential for emotion classification. Parallelly, the
Admin module manages datasets and training processes via manageDataset() and retrainModel(),
ensuring the model stays updated and accurate. Both the features and admin commands feed into the
Model module, which contains core functionalities like trainModel() for learning from data and
classifyEmotion() to predict emotions from the extracted features. The EmotionClassification module
then applies specific detection methods such as detectHappiness(), detectSadness(), detectAnger(), and
detectNeutrality() to categorize the emotional state from the model’s output. Finally, the classified
emotion is displayed to the User through functions like provideAudio() and viewResults(), closing the
loop by presenting the emotion recognition results back to the end user. This structured flow ensures
clear separation of concerns, from audio capture to emotion detection and user feedback. This diagram
represents a use case or component flow for an audio-based emotion recognition system. It shows the
sequence of components and how they interact to process audio, extract features, train a model, and
classify emotions
30
Sequence Diagram:
The sequence diagram illustrates the flow of operations in a Speech Emotion Recognition (SER) system
using deep learning models such as CNN, RNN, and LSTM. The process begins with the User initiating
a voice input, which is captured by the Audio Recorder and passed as raw audio to the Preprocessor.
The Preprocessor cleans the data by removing noise and unnecessary segments, then forwards the
extracted features to the Feature Extractor, which generates meaningful representations like MFCCs or
spectrograms. These features are then sent to the Model Selector, which determines the appropriate
model (CNN, RNN, or LSTM) based on configuration or performance needs. The selected model
processes the features and sends the output to the Emotion Classifier, which interprets the model’s
31
output and identifies the emotional state (e.g., happy, sad, angry). Finally, the Display System presents the
predicted emotion back to the user. This structured interaction ensures a streamlined and automated workflow
for recognizing emotions from speech using deep learning architectures.
.These features are passed to a model selection mechanism, which dynamically chooses the most
appropriate deep learning model—either CNN, RNN, or LSTM—based on predefined criteria or
configurations. The chosen model processes the features and produces a prediction, which is then
interpreted by the Emotion Classifier to determine the emotional state. Finally, the Display System
communicates the detected emotion to the user. This diagram not only demonstrates the modular nature
of the system but also emphasizes the logical and time-ordered communication required to achieve
accurate emotion detection.
Audio Recorder component captures this vocal signal as raw audio data in waveform format. This data
often contains background noise, silences, and other artifacts, so it is passed to a Preprocessor, which
applies signal enhancement techniques such as noise reduction, silence trimming, amplitude normalization,
and voice activity detection (VAD). These preprocessing steps help to clean and standardize the input,
making it suitable for feature extraction. The cleaned audio is then fed into the Feature Extractor, where
essential acoustic features are computed. These features may include low-level descriptors such as Mel-
Frequency Cepstral Coefficients (MFCCs), chroma features, zero-crossing rate, spectral centroid,
pitch, formants, and prosodic elements like speech rate or energy dynamics. These features capture the
temporal, spectral, and emotional characteristics of the speech signal. Once extracted, these features are
input into an Emotion Classifier, typically a machine learning or deep learning model such as a
Convolutional Neural Network (CNN) for spatial feature learning, a Recurrent Neural Network (RNN)
or Long Short-Term Memory (LSTM) network for modeling temporal dependencies, or a hybrid
architecture that combines both. The classifier processes the features based on its learned parameters and
predicts an emotion label, such as “happy,” “sad,” “angry,” “fearful,” or “neutral.”
predicted emotion is then sent to the Display System, which outputs the recognized emotion to the user
32
Collaboration Diagram
The process begins with the user, who provides raw audio data containing
speech recordings that exhibit various emotional expressions. These recordings could come
from datasets like TESS, RAVDESS, or custom sources. The user uploads or provides this
raw audio data into the system, initiating the pipeline.
Once received, the audio files are collected in the Raw Data Source module.
This component acts as a central repository that stores unprocessed audio clips along with
their associated metadata, such as file paths and emotion labels (if available). At this stage,
the data remains in its original, unfiltered format.
The raw audio is then passed to the Data Preprocessing Module, where
essential transformations take place. This step includes cleaning noise from the audio,
normalizing the signals, segmenting audio clips, extracting meaningful features such as Mel-
frequency cepstral coefficients (MFCCs), chroma, zero-crossing rate, and pitch, and
converting labels into numerical form if they are categorical. This preprocessing is crucial to
prepare the audio for effective model training.
33
4.5.1Exploration Data Analysis Of Visualization
Data Exploratory Data Analysis (EDA) and Visualization for Speech Emotion
Recognition (SER) play a crucial role in understanding the structure, quality, and patterns
within audio datasets before applying machine learning models. The first step typically
involves analyzing the distribution of emotion classes to identify imbalances, such as
overrepresentation of “neutral” and underrepresentation of “fear” or “disgust.” Visualization
tools like bar plots or pie charts help highlight these class distributions clearly. Next, audio
feature distributions such as MFCCs, chroma features, pitch, and energy are examined
using box plots, histograms, and violin plots to understand their variance across different
emotions. Correlation heatmaps are also used to detect highly correlated features, which
may be candidates for dimensionality reduction. Additionally, waveform plots and
spectrograms give insight into how different emotions modulate the speech signal visually
—revealing patterns in energy, frequency content, and rhythm. t-SNE or PCA plots are
often used for dimensionality reduction and visualization, helping to identify whether the
extracted features form distinguishable clusters for different emotional states. This phase
helps identify noise, missing values, and outliers, and provides a foundation for feature
engineering, class balancing, and model selection. Overall, EDA not only enhances
understanding of the dataset but also guides the design of more accurate and interpretable
SER systems. In addition to basic class distribution analysis, EDA for SER often begins
with examining audio duration and silence ratios across recordings, as emotions like
sadness may correlate with slower speech and longer pauses. Plotting the duration
distribution of audio files helps in identifying inconsistencies, such as truncated or overly
lengthy recordings, which can introduce noise into model training. A common practice is to
visualize waveforms and mel-spectrograms using libraries like librosa, which allows for
time-frequency analysis that can reveal emotion-linked acoustic patterns—such as higher
frequency content in angry speech or lower energy in sad speech.
34
Class Diagram
35
4.6.1Data Validation/cleaning/Preparing Process
The The data validation, cleaning, and preparation process for Speech Emotion
Recognition (SER) is a critical stage that directly influences the performance and reliability
of the system. This process begins with the collection of raw audio data from established
emotional speech datasets or real-time recordings. At this stage, data validation ensures that
the recordings are complete, correctly labeled with emotional categories, and free from
major technical issues such as missing files or corrupt formats. Once validated, the data
undergoes a cleaning phase where background noise is reduced or removed using signal
processing techniques such as noise filtering, normalization, and silence trimming to
enhance the clarity of emotional cues. Inconsistent samples, such as those with too much
distortion or incorrect labels, are either corrected or removed. Following cleaning, the
preparation phase begins with feature extraction, where relevant acoustic features like
MFCCs, pitch, energy, and spectral characteristics are computed from the cleaned audio.
These features are then standardized and, if necessary, subjected to feature selection
methods to retain only the most informative ones. The dataset is subsequently split into
training, validation, and test sets to ensure that model evaluation is unbiased and
performance can be accurately assessed. Throughout this process, consistency in
preprocessing steps is maintained to avoid data leakage and ensure that the system
generalizes well to new inputs. This structured pipeline ensures high-quality input to the
machine learning model, forming a robust foundation for the accurate detection and classific
machine learning models. Finally, the dataset is split into training, validation, and testing
subsets to facilitate effective model training and evaluation. This end-to-end process ensures
that the input data is robust, reliable, and optimally structured for accurate solar radiation
and energy generation predictions using advanced AI techniques.
36
Class Diagram
The This diagram illustrates the complete process involved in data validation, cleaning, and
preparation specifically tailored for Speech Emotion Recognition (SER) systems. It begins
with AudioCollection, where raw speech audio samples are gathered, often from diverse
sources such as databases, real-time recordings, or call centers. The next step,
FeatureExtraction, involves extracting meaningful acoustic characteristics from the audio
signals, such as pitch, energy, MFCCs (Mel-Frequency Cepstral Coefficients), and spectral
features, which are vital for recognizing emotions.
Following extraction, FeatureSelection is performed to identify and retain only the most
relevant features that significantly contribute to emotion classification, reducing
dimensionality and improving model efficiency. The process then moves to DataSplitting,
where the cleaned and processed data is divided into training and test sets, ensuring that the
37
model can learn patterns from one part of the data and be evaluated on unseen examples.
ModelTraining involves applying machine learning algorithms (e.g., SVM, Random Forest,
CNN, RNN) to the training data to develop a predictive model. Once trained, the model
proceeds to the EmotionRecognition phase, where it classifies or predicts emotions (such as
happy, sad, angry, etc.) from new or unseen speech data. Finally, ModelEvaluation assesses
the performance of the model using metrics like accuracy, precision, recall, F1-score, and
confusion matrix, helping validate its reliability and generalization capability.
This flow ensures a structured and systematic approach for building effective and accurate
speech emotion recognition systems. This is the foundational step where raw speech data is
gathered. It involves collecting voice recordings from various sources such as emotional
speech datasets (like RAVDESS, EMO-DB, or CREMA-D), real-world audio (e.g., call
centers, therapy sessions), or through direct microphone input. The key requirement here is
ensuring that the collected audio is of good quality, appropriately labeled with the
corresponding emotion, and representative of different speakers, accents, and speaking
styles. Not all extracted features contribute equally to identifying emotions. This step
involves selecting the most relevant ones, which improves the model’s accuracy and
efficiency. Feature selection techniques like PCA (Principal Component Analysis), recursive
feature elimination, or mutual information-based methods are commonly used to reduce
noise and computational load. The process begins with AudioCollection, where speech
audio samples are gathered from various sources such as emotional speech databases, real-
world recordings, or direct microphone inputs. These audio recordings must be of high
quality and labeled accurately with the corresponding emotions to ensure the effectiveness of
the downstream processes. Once collected, the next step is FeatureExtraction, where the
audio data is converted into structured numerical representations. Acoustic features such as
Mel-Frequency Cepstral Coefficients (MFCCs), pitch, energy, and spectral features are
extracted to capture the emotional characteristics embedded in the speech. These features.
Sequence Diagram
38
Fig 4.8 Sequence Diagram for Data Preparing Process
The he diagram illustrates the comprehensive workflow for data validation, cleaning,
and preparation in a speech emotion recognition system. It is structured as a sequence
diagram featuring four main roles: the User, Data Engineer, Feature Extraction, and ML
Model. The process begins with a request from the User to validate and prepare the speech
data, which is received by the Data Engineer. The Data Engineer then initiates a series of
operations starting with cleaning and validating the data. This step is crucial as it ensures the
quality and integrity of the dataset before any analysis. Following this, noise and silence are
removed from the audio, which helps in focusing only on the relevant parts of the speech that
carry emotional content. The audio is then segmented into meaningful utterances, allowing
for a more fine-grained analysis. After segmentation, the system proceeds to extract acoustic
features such as pitch, tone, and energy—key indicators of emotional state. These features are
then normalized, encoded, and aggregated to ensure they are in a suitable format for machine
learning. Once the data is prepared, it is sent to the ML Model module, which acknowledges
receipt and proceeds to train the emotion classifier using the processed features. Finally, the
trained model and its results are returned to the Data Engineer, who then provides them to the
User. This end-to-end process demonstrates a robust pipeline for preparing raw speech data
into actionable inputs for emotion recognition models
39
Collaboration diagram
The Data Engineer then performs crucial tasks to ensure the data is usable. These
tasks include cleaning the data by removing background noise and silences, segmenting it
into utterances, and extracting meaningful acoustic features such as pitch, energy, and
spectral characteristics. Once this preprocessing is complete, the data is normalized,
40
encoded, and structured appropriately for machine learning. The cleaned and feature-rich
data is then sent to the ML model component.
The ML model receives both the raw and prepared data for training. Using this input,
the model is trained to classify emotions expressed in the speech, such as happiness, anger,
sadness, or neutrality. Once the model has been successfully trained, it is stored back in the
Storage component and also returned to the Data Engineer, who provides the final results
and trained model output to the User. This collaboration ensures that the speech data is
carefully processed and accurately modeled to identify emotions, enabling effective
deployment in applications like call center analytics, human-computer interaction, or mental
health monitoring.
41
4.8.1DATABASE SCHEMA
A database schema is the skeleton structure that represents the logical view of the entire database.
It defines how the data is organized and how the relations among them are associated. It formulates all the
constraints that are to be applied on the data.
Module I
This line creates a mapping dictionary (label_map) that assigns each unique emotion label in the
df['labels'] column a unique integer. For example, if the labels were ['happy', 'sad', 'angry'], it might
generate:
The DataFrame output below the code has two columns:
audio_paths: Contains the full file paths to the audio files from the TESS (Toronto Emotional
Speech Set) dataset.
labels: Now contains integer values instead of strings. Here, the first two entries are labeled 0,
meaning both audio files correspond to the same emotion category (e.g., both might represent
"happy").
This step is crucial in the data preparation phase of a Speech Emotion Recognition pipeline.
Machine learning models cannot process string labels directly, so converting them into numerical
form allows for classification tasks like predicting emotions based on audio features.
42
CHAPTER V
SYSTEM IMPLEMENTATION
43
SYSTEM IMPLEMENTATION
5.1 Sample Code View.py
import streamlit as st
import librosa
import numpy as np
import torch
from transformers import Wav2Vec2Processor, Wav2Vec2ForSequenceClassification
def set_background():
st.markdown(
f"""
<style>
.stApp {{
background: url("https://github.com/Sapir52/Speech_emotion_recognition/blob/main/static/
img/speech_background.jpg?raw=true");
background-size: cover;
}}
</style>
""",
unsafe_allow_html=True
)
set_background()
# Streamlit app
st.title("Speech Emotion Recognition🎤")
st.write("""
Speech Emotion Recognition (SER) is a cutting-edge technology that aims to identify and classify
emotions from human speech.
By analyzing audio signals, SER systems can detect emotions such as happiness, sadness, anger, and
more.
This application leverages advanced machine learning models to process audio inputs and predict the
underlying emotional state,
providing valuable insights for various domains like customer service, healthcare, and entertainment.
""")
# File uploader
uploaded_file = st.file_uploader("Upload an audio file", type=["wav", "mp3"])
st.markdown(
"""
<style>
.balloon-effect {
position: absolute;
bottom: 0;
left: 50%;
transform: translateX(-50%);
width:50px;
height:50px;
background-color: #ffcc00;
border-radius: 50%;
animation: rise 3s ease-out forwards;
}
@keyframes rise {
0% { bottom: 0; opacity: 1; }
100% { bottom: 100%; opacity: 0; }
}
</style>
""",
unsafe_allow_html=True
)
45
if uploaded_file is not None:
# Load the audio file
data, sampling_rate = librosa.load(uploaded_file, sr=16000)
# Perform inference
with torch.no_grad():
outputs = model(input_values)
logits = outputs.logits
predicted_class = logits.argmax(dim=-1).item()
predicted_emotion = inverse_label_map.get(predicted_class, "Unknown")
46
img = librosa.display.specshow(D, sr=sampling_rate, x_axis='time', y_axis='log', ax=ax,
cmap='viridis')
fig.colorbar(img, ax=ax, format="%+2.0f dB")
ax.set_title("Spectrogram (Log Frequency)")
ax.set_xlabel("Time (s)")
ax.set_ylabel("Frequency (Hz)")
st.pyplot(fig)
}
</style>
<style>
.button:hover {
background-color: #0050b3;
color: white;
transition: background-color 0.3s ease;
}
.label:hover {
transform: scale(1.05);
47
transition: transform 0.3s ease;
}
.loading-animation {
animation: spin 1s linear infinite;
}
@keyframes spin {
0% { transform: rotate(0deg); }
100% { transform: rotate(360deg); }
}
</style>
""",
unsafe_allow_html=True
)
fig, ax = plt.subplots()
ax.pie([accuracy, loss], labels=['Accuracy', 'Loss'], autopct='%1.1f%%', startangle=90,
colors=['#4CAF50', '#FF5722'])
ax.set_title("Model Performance")
col1, col2, col3 = st.columns([1, 2, 1,]) # Create three columns for centering
with col2:
st.pyplot(fig) # Display the pie chart in the center column with a smaller size
48
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(metrics, values, color=['#FF5722', '#4CAF50', '#2196F3', '#FFC107'])
ax.set_ylim(0, 100) # Set y-axis limit to 100 for percentage values
ax.set_title("Model Performance Metrics")
ax.set_ylabel("Percentage / Score")
for i, v in enumerate(values):
ax.text(i, v + 2, f"{v:.2f}", ha='center', fontsize=10) # Add value labels above bars
st.pyplot(fig)
import os
from os.path import isdir, join
from pathlib import Path
import pandas as pd
# Math
import numpy as np
from scipy.fftpack import fft
from scipy import signal
from scipy.io import wavfile
import librosa
# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as ipd
import librosa.display
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import pandas as pd
%matplotlib inline
# %% [markdown]
#
# # 1. Visualization
# <a id="visualization"></a>
#
49
# There are two theories of a human hearing - place (
https://en.wikipedia.org/wiki/Place_theory_(hearing) (frequency-based) and temporal
(https://en.wikipedia.org/wiki/Temporal_theory_(hearing) )
# In speech recognition, I see two main tendencies - to input
[spectrogram](https://en.wikipedia.org/wiki/Spectrogram) (frequencies), and more sophisticated
features MFCC - Mel-Frequency Cepstral Coefficients, PLP. You rarely work with raw, temporal data.
#
# Let's visualize some recordings!
#
# ## 1.1. Wave and spectrogram:
# <a id="waveandspectrogram"></a>
#
# Choose and read some file:
# %%
train_audio_path = '../input/train/audio/'
filename = '/yes/0a7c2a8d_nohash_0.wav'
sample_rate, samples = wavfile.read(str(train_audio_path) + filename)
# %% [markdown]
# Define a function that calculates spectrogram.
#
# Note, that we are taking logarithm of spectrogram values. It will make our plot much more clear,
moreover, it is strictly connected to the way people hear.
# We need to assure that there are no 0 values as input to logarithm.
# %%
def log_specgram(audio, sample_rate, window_size=20,
step_size=10, eps=1e-10):
nperseg = int(round(window_size * sample_rate / 1e3))
noverlap = int(round(step_size * sample_rate / 1e3))
freqs, times, spec = signal.spectrogram(audio,
fs=sample_rate,
window='hann',
nperseg=nperseg,
noverlap=noverlap,
detrend=False)
return freqs, times, np.log(spec.T.astype(np.float32) + eps)
# %% [markdown]
# Frequencies are in range (0, 8000) according to [Nyquist
theorem](https://en.wikipedia.org/wiki/Nyquist_rate).
#
# Let's plot it:
50
# %%
freqs, times, spectrogram = log_specgram(samples, sample_rate)
ax2 = fig.add_subplot(212)
ax2.imshow(spectrogram.T, aspect='auto', origin='lower',
extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.set_title('Spectrogram of ' + filename)
ax2.set_ylabel('Freqs in Hz')
ax2.set_xlabel('Seconds')
# %% [markdown]
# If we use spectrogram as an input features for NN, we have to remember to normalize features. (We
need to normalize over all the dataset, here's example just for one, which doesn't give good *mean*
and *std*!)
# %%
mean = np.mean(spectrogram, axis=0)
std = np.std(spectrogram, axis=0)
spectrogram = (spectrogram - mean) / std
# %% [markdown]
# There is an interesting fact to point out. We have ~160 features for each frame, frequencies are
between 0 and 8000. It means, that one feature corresponds to 50 Hz. However, [frequency resolution
of the ear is 3.6 Hz within the octave of 1000 – 2000
Hz](https://en.wikipedia.org/wiki/Psychoacoustics) It means, that people are far more precise and can
hear much smaller details than those represented by spectrograms like above.
# %% [markdown]
# ## 1.2. MFCC
# <a id="mfcc"></a>
#
# If you want to get to know some details about *MFCC* take a look at this great tutorial. [MFCC
explained](http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-
cepstral-coefficients-mfccs/) You can see, that it is well prepared to imitate human hearing properties.
#
# You can calculate *Mel power spectrogram* and *MFCC* using for example *librosa* python
package.
51
#
# %%
# From this tutorial
# https://github.com/librosa/librosa/blob/master/examples/LibROSA%20demo.ipynb
S = librosa.feature.melspectrogram(samples, sr=sample_rate, n_mels=128)
# Convert to log scale (dB). We'll use the peak power (max) as reference.
log_S = librosa.power_to_db(S, ref=np.max)
plt.figure(figsize=(12, 4))
librosa.display.specshow(log_S, sr=sample_rate, x_axis='time', y_axis='mel')
plt.title('Mel power spectrogram ')
plt.colorbar(format='%+02.0f dB')
plt.tight_layout()
# %%
mfcc = librosa.feature.mfcc(S=log_S, n_mfcc=13)
plt.figure(figsize=(12, 4))
librosa.display.specshow(delta2_mfcc)
plt.ylabel('MFCC coeffs')
plt.xlabel('Time')
plt.title('MFCC')
plt.colorbar()
plt.tight_layout()
# %% [markdown]
# In classical, but still state-of-the-art systems, *MFCC* or similar features are taken as the input to
the system instead of spectrograms.
#
# However, in end-to-end (often neural-network based) systems, the most common input features are
probably raw spectrograms, or mel power spectrograms. For example *MFCC* decorrelates features,
but NNs deal with correlated features well. Also, if you'll understand mel filters, you may consider
their usage sensible.a
#
# It is your decision which to choose!
# %% [markdown]
# ## 1.3. Spectrogram in 3d
# <a id="3d"></a>
#
52
# By the way, times change, and the tools change. Have you ever seen spectrogram in 3d?
# %%
data = [go.Surface(z=spectrogram.T)]
layout = go.Layout(
title='Specgtrogram of "yes" in 3d',
scene = dict(
yaxis = dict(title='Frequencies', range=freqs),
xaxis = dict(title='Time', range=times),
zaxis = dict(title='Log amplitude'),
),
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)
# %% [markdown]
# (Don't know how to set axis ranges to proper values yet. I'd also like it to be streched like a classic
spectrogram above..)
# %% [markdown]
# ## 1.4. Silence removal
# <a id="silenceremoval"></a>
#
# Let's listen to that file
# %%
ipd.Audio(samples, rate=sample_rate)
# %% [markdown]
# I consider that some *VAD* (Voice Activity Detection) will be really useful here. Although the
words are short, there is a lot of silence in them. A decent *VAD* can reduce training size a lot,
accelerating training speed significantly.
# Let's cut a bit of the file from the beginning and from the end. and listen to it again (based on a plot
above, we take from 4000 to 13000):
# %%
samples_cut = samples[4000:13000]
ipd.Audio(samples_cut, rate=sample_rate)
# %% [markdown]
# We can agree that the entire word can be heard. It is impossible to cut all the files manually and do
this basing on the simple plot. But you can use for example *webrtcvad* package to have a good
*VAD*.
#
# Let's plot it again, together with guessed alignment of* 'y' 'e' 's'* graphems
53
# %%
freqs, times, spectrogram_cut = log_specgram(samples_cut, sample_rate)
ax2 = fig.add_subplot(212)
ax2.set_title('Spectrogram of ' + filename)
ax2.set_ylabel('Frequencies * 0.1')
ax2.set_xlabel('Samples')
ax2.imshow(spectrogram_cut.T, aspect='auto', origin='lower',
extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.text(0.06, 1000, 'Y', fontsize=18)
ax2.text(0.17, 1000, 'E', fontsize=18)
ax2.text(0.36, 1000, 'S', fontsize=18)
# %% [markdown]
# ## 1.5. Resampling - dimensionality reduction
# <a id="resampl"></a>
#
# Another way to reduce the dimensionality of our data is to resample recordings.
#
# You can hear that the recording don't sound very natural, because they are sampled with 16k
frequency, and we usually hear much more. However, [the most speech related frequencies are
presented in smaller band](https://en.wikipedia.org/wiki/Voice_frequency). That's why you can still
understand another person talking to the telephone, where GSM signal is sampled to 8000 Hz.
#
# Summarizing, we could resample our dataset to 8k. We will discard some information that shouldn't
be important, and we'll reduce size of the data.
#
# We have to remember that it can be risky, because this is a competition, and sometimes very small
difference in performance wins, so we don't want to lost anything. On the other hand, first experiments
can be done much faster with smaller training size.
#
54
# We'll need to calculate FFT (Fast Fourier Transform). Definition:
#
# %%
def custom_fft(y, fs):
T = 1.0 / fs
N = y.shape[0]
yf = fft(y)
xf = np.linspace(0.0, 1.0/(2.0*T), N//2)
vals = 2.0/N * np.abs(yf[0:N//2]) # FFT is simmetrical, so we take just the first half
# FFT is also complex, to we take just the real part (abs)
return xf, vals
# %% [markdown]
# Let's read some recording, resample it, and listen. We can also compare FFT, Notice, that there is
almost no information above 4000 Hz in original signal.
# %%
filename = '/happy/0b09edd3_nohash_0.wav'
new_sample_rate = 8000
# %%
ipd.Audio(samples, rate=sample_rate)
# %%
ipd.Audio(resampled, rate=new_sample_rate)
# %% [markdown]
# Almost no difference!
# %%
xf, vals = custom_fft(samples, sample_rate)
plt.figure(figsize=(12, 4))
plt.title('FFT of recording sampled with ' + str(sample_rate) + ' Hz')
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()
# %%
xf, vals = custom_fft(resampled, new_sample_rate)
plt.figure(figsize=(12, 4))
55
plt.title('FFT of recording sampled with ' + str(new_sample_rate) + ' Hz')
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()
# %% [markdown]
# This is how we reduced dataset size twice!
# %% [markdown]
# ## 1.6. Features extraction steps
# <a id="featuresextractionsteps"></a>
#
# I would propose the feature extraction algorithm like that:
# 1. Resampling
# 2. *VAD*
# 3. Maybe padding with 0 to make signals be equal length
# 4. Log spectrogram (or *MFCC*, or *PLP*)
# 5. Features normalization with *mean* and *std*
# 6. Stacking of a given number of frames to get temporal information
#
# It's a pity it can't be done in notebook. It has not much sense to write things from zero, and
everything is ready to take, but in packages, that can not be imported in Kernels.
# %% [markdown]
#
# # 2. Dataset investigation
# <a id="investigations"></a>
#
# Some usuall investgation of dataset.
#
# ## 2.1. Number of records
# <a id="numberoffiles"></a>
#
# %%
dirs = [f for f in os.listdir(train_audio_path) if isdir(join(train_audio_path, f))]
dirs.sort()
print('Number of labels: ' + str(len(dirs)))
# %%
# Calculate
number_of_recordings = []
for direct in dirs:
waves = [f for f in os.listdir(join(train_audio_path, direct)) if f.endswith('.wav')]
56
number_of_recordings.append(len(waves))
# Plot
data = [go.Histogram(x=dirs, y=number_of_recordings)]
trace = go.Bar(
x=dirs,
y=number_of_recordings,
marker=dict(color = number_of_recordings, colorscale='Viridius', showscale=True
),
)
layout = go.Layout(
title='Number of recordings in given label',
xaxis = dict(title='Words'),
yaxis = dict(title='Number of recordings')
)
py.iplot(go.Figure(data=[trace], layout=layout))
# %% [markdown]
# Dataset is balanced except of background_noise, but that's the different thing.
# %% [markdown]
# ## 2.2. Deeper into recordings
# <a id="deeper"></a>
# %% [markdown]
# There's a very important fact. Recordings come from very different sources. As far as I can tell, some
of them can come from mobile GSM channel.
#
# Nevertheless,** it is extremely important to split the dataset in a way that one speaker doesn't occur
in both train and test sets.**
# Just take a look and listen to this two examlpes:
# %%
filenames = ['on/004ae714_nohash_0.wav', 'on/0137b3f4_nohash_0.wav']
for filename in filenames:
sample_rate, samples = wavfile.read(str(train_audio_path) + filename)
xf, vals = custom_fft(samples, sample_rate)
plt.figure(figsize=(12, 4))
plt.title('FFT of speaker ' + filename[4:11])
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()
# %% [markdown]
57
# Even better to listen:
# %%
print('Speaker ' + filenames[0][4:11])
ipd.Audio(join(train_audio_path, filenames[0]))
# %%
print('Speaker ' + filenames[1][4:11])
ipd.Audio(join(train_audio_path, filenames[1]))
# %% [markdown]
# There are also recordings with some weird silence (some compression?):
#
# %%
filename = '/yes/01bb6a2a_nohash_1.wav'
sample_rate, samples = wavfile.read(str(train_audio_path) + filename)
freqs, times, spectrogram = log_specgram(samples, sample_rate)
plt.figure(figsize=(10, 7))
plt.title('Spectrogram of ' + filename)
plt.ylabel('Freqs')
plt.xlabel('Time')
plt.imshow(spectrogram.T, aspect='auto', origin='lower',
extent=[times.min(), times.max(), freqs.min(), freqs.max()])
plt.yticks(freqs[::16])
plt.xticks(times[::16])
plt.show()
# %% [markdown]
# It means, that we have to prevent overfitting to the very specific acoustical environments.
#
# %% [markdown]
# ## 2.3. Recordings length
# <a id="len"></a>
#
# Find if all the files have 1 second duration:
# %%
num_of_shorter = 0
for direct in dirs:
waves = [f for f in os.listdir(join(train_audio_path, direct)) if f.endswith('.wav')]
for wav in waves:
sample_rate, samples = wavfile.read(train_audio_path + direct + '/' + wav)
58
if samples.shape[0] < sample_rate:
num_of_shorter += 1
print('Number of recordings shorter than 1 second: ' + str(num_of_shorter))
# %% [markdown]
# That's suprising, and there is a lot of them. We can pad them with zeros.
# %% [markdown]
# ## 2.4. Mean spectrograms and FFT
# <a id="meanspectrogramsandfft"></a>
# %% [markdown]
# Let's plot mean FFT for every word
# %%
to_keep = 'yes no up down left right on off stop go'.split()
dirs = [d for d in dirs if d in to_keep]
print(dirs)
plt.figure(figsize=(14, 4))
plt.subplot(121)
plt.title('Mean fft of ' + direct)
plt.plot(np.mean(np.array(vals_all), axis=0))
plt.grid()
plt.subplot(122)
plt.title('Mean specgram of ' + direct)
plt.imshow(np.mean(np.array(spec_all), axis=0).T, aspect='auto', origin='lower',
extent=[times.min(), times.max(), freqs.min(), freqs.max()])
plt.yticks(freqs[::16])
plt.xticks(times[::16])
59
plt.show()
# %% [markdown]
# ## 2.5. Gaussian Mixtures modeling
# <a id="gmms"></a>
#
# We can see that mean FFT looks different for every word. We could model each FFT with a mixture
of Gaussian distributions. Some of them however, look almost identical on FFT, like *stop* and *up*...
But wait, they are still distinguishable when we look at spectrograms! High frequencies are earlier than
low at the beginning of *stop* (probably *s*).
#
# That's why temporal component is also necessary. There is a [Kaldi](http://kaldi-asr.org/) library, that
can model words (or smaller parts of words) with GMMs and model temporal dependencies with
[Hidden Markov Models](https://github.com/danijel3/ASRDemos/blob/master/notebooks/
HMM_FST.ipynb).
#
# We could use simple GMMs for words to check what can we model and how hard it is to distinguish
the words. We can use [Scikit-learn](http://scikit-learn.org/) for that, however it is not straightforward
and lasts very long here, so I abandon this idea for now.
# %% [markdown]
# ## 2.6. Frequency components across the words
# <a id="components"></a>
#
# %%
def violinplot_frequency(dirs, freq_ind):
""" Plot violinplots for given words (waves in dirs) and frequency freq_ind
from all frequencies freqs."""
plt.figure(figsize=(13,7))
plt.title('Frequency ' + str(freqs[freq_ind]) + ' Hz')
plt.ylabel('Amount of frequency in a word')
plt.xlabel('Words')
sns.violinplot(data=pd.DataFrame(spec_all.T, columns=dirs))
plt.show()
# %%
violinplot_frequency(dirs, 20)
# %%
violinplot_frequency(dirs, 50)
# %%
violinplot_frequency(dirs, 120)
# %% [markdown]
# ## 2.7. Anomaly detection
# <a id="anomaly"></a>
#
# We should check if there are any recordings that somehow stand out from the rest. We can lower the
dimensionality of the dataset and interactively check for any anomaly.
# We'll use PCA for dimensionality reduction:
# %%
fft_all = []
names = []
for direct in dirs:
waves = [f for f in os.listdir(join(train_audio_path, direct)) if f.endswith('.wav')]
for wav in waves:
sample_rate, samples = wavfile.read(train_audio_path + direct + '/' + wav)
if samples.shape[0] != sample_rate:
samples = np.append(samples, np.zeros((sample_rate - samples.shape[0], )))
x, val = custom_fft(samples, sample_rate)
fft_all.append(val)
names.append(direct + '/' + wav)
fft_all = np.array(fft_all)
# Normalization
fft_all = (fft_all - np.mean(fft_all, axis=0)) / np.std(fft_all, axis=0)
61
# Dim reduction
pca = PCA(n_components=3)
fft_all = pca.fit_transform(fft_all)
interactive_3d_plot(fft_all, names)
# %% [markdown]
# Notice that there are *yes/e4b02540_nohash_0.wav*, *go/0487ba9b_nohash_0.wav* and more
points, that lie far away from the rest. Let's listen to them.
# %%
print('Recording go/0487ba9b_nohash_0.wav')
ipd.Audio(join(train_audio_path, 'go/0487ba9b_nohash_0.wav'))
# %%
print('Recording yes/e4b02540_nohash_0.wav')
ipd.Audio(join(train_audio_path, 'yes/e4b02540_nohash_0.wav'))
# %% [markdown]
# If you will look for anomalies for individual words, you can find for example this file for *seven*:
# %%
print('Recording seven/e4b02540_nohash_0.wav')
ipd.Audio(join(train_audio_path, 'seven/b1114e4f_nohash_0.wav'))
import pandas as pd
import numpy as np
import os
import sys
import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning)
# %% [markdown]
# # Data Import
# %%
Ravdess = r"C:\Users\kisho\Downloads\ravdess\audio_speech_actors_01-24"
Crema = r"C:\Users\kisho\Downloads\crema"
Tess = r"C:\Users\kisho\Downloads\archive\TESS Toronto emotional speech set data"
Savee = r"C:\Users\kisho\Downloads\save\ALL"
# %%
# Ravdess Datset
ravdess_directory_list = os.listdir(Ravdess)
file_emotion = []
file_path = []
for i in ravdess_directory_list:
# as their are 24 different actors in our previous directory we need to extract files for each actor.
actor = os.listdir(os.path.join(Ravdess, i))
for f in actor:
part = f.split('.')[0].split('-')
# third part in each file represents the emotion associated to that file.
file_emotion.append(int(part[2]))
file_path.append(os.path.join(Ravdess, i, f))
print(actor[0])
print(part[0])
63
print(file_path[0])
print(int(part[2]))
print(f)
# %%
# dataframe for emotion of All dataset files
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])
# %%
# dataframe for path of files.
path_df = pd.DataFrame(file_path, columns=['Path'])
Ravdess_df = pd.concat([emotion_df, path_df], axis=1)
# changing integers to actual emotions.
Ravdess_df.Emotions.replace({1:'neutral', 2:'neutral', 3:'happy', 4:'sad', 5:'angry', 6:'fear', 7:'disgust',
8:'surprise'},
inplace=True)
print(Ravdess_df.head())
print("______________________________________________")
print(Ravdess_df.tail())
print("_______________________________________________")
print(Ravdess_df.Emotions.value_counts())
# %%
import os
import pandas as pd
file_emotion = []
file_path = []
# %%
import os
import pandas as pd
file_emotion = []
file_path = []
directories = os.listdir(dir_path)
for file in directories:
if not file.lower().endswith('.wav'):
continue # Skip non-audio files
file_emotion.append(emotion)
file_path.append(os.path.join(dir_path, file))
65
# Create DataFrame
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])
path_df = pd.DataFrame(file_path, columns=['Path'])
Tess_df = pd.concat([emotion_df, path_df], axis=1)
# Display results
print(Tess_df.head())
print(Tess_df.Emotions.value_counts())
# %%
# SAVEE Dataset
savee_directory_list = os.listdir(Savee)
file_emotion = []
file_path = []
66
# %%
# creating Dataframe using all the 4 dataframes we created so far.
data_path = pd.concat([Ravdess_df, Crema_df, Tess_df, Savee_df], axis = 0)
data_path.head()
# %%
len(data_path)
# %%
print(data_path.Emotions.value_counts())
# %%
file_path = data_path['Path'].iloc[0]
# loading the audio file using librosa
import librosa
import librosa.display
import matplotlib.pyplot as plt
data,sr = librosa.load(file_path[0])
67
CHAPTER VI
SYSTEM TESTING
68
CHAPTER VI
SYSTEM TESTING
6.1 Introduction
The software which has been developed has to be tested in its validity. Testing is considered
to be the least creating phase whole cycle of system design. In the real sense it is the phase, which
helps to bring out the creativity of other phase makes it shine. Testing is the most important space
in the software development activity. In software development lifecycle, the main aim of the testing
process is the quality, the developed software is tested against attaining the required functionality
and performance. During the testing process the software is worked with some particular test case
and the output of the test cases are analyzed whether the software is working according to the
expectations or not. The success of the testing process in determining the error is mostly depends
upon the test case criteria, for testing any software we need to have a description of the expert
behavior of the system and method of determining whether the observed behavior conformed to the
expected behavior.
The test process is initiated by developing a comprehensive plan to test the general
functionality and special features on a variety of platform combination. Strict quality control
procedures are used. The process verifies that the application meets the requirements specified in
the system requirements document and is bug free. The following are the consideration used to
develop the framework from developing the testing methodologies.
69
6.4 Types of Testing
Since the error in the software can be injured at any stage, we have to carry out the
testing process at different levels during the development. The basic levels of testing are
1. Unit Testing.
2. Functional testing.
3. System Testing.
4. Performance testing.
5. Integration Testing.
6. Acceptance Testing.
6.4.1 Unit Testing
Unit testing involves the design of test cases that validate that the internal program logic
is functioning properly, and that program input produces valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application. It is done after the completion of an individual unit before integration. This is a
structural testing, this relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately
to the documented specifications and contains clearly defined inputs and expected results.
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items: Valid Input: Identified classes of valid
input must be accepted. Invalid Input: Identified classes of invalid input must be rejected.
Functions: Identified functions must be exercised. Output: Identified classes of application
outputs must be exercised. Systems Procedures: Interfacing systems or procedures must be
invoked.
System testing ensures that the entire integrated software system meets requirements. It
tests a configuration to ensure known and predictable results. An example of system testing is the
70
configuration-oriented system integration test. System testing is based on process descriptions
and
71
flows, emphasizing pre-driven process links and integration points.
The Performance test ensures that the output will be produced within the time 49 limits,
and the time taken by the system for compiling, giving response to the users and requests being
sent to the system to retrieve the results.
Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects. The
task of the integration test is to check that component or software application, for example,
components in a software system or - one steps up - software applications at the company level -
interact without error. The following are the types of integration testing.
Top-down integration
Bottom-up integration
Top-down Integration
Bottom-up Integration
Bottom-Up Integration testing is a kind of testing methodology in which the modules are
tested from the bottom of control flow upwards. Bottom-Up integration testing is opposite to the
Top-down integration testing. In this testing first the bottom modules of lower levels are tested
and then the higher-level modules are tested. As the lower Evel components are started first,
means all
the complex modules are tested find. So, any errors in the complex modules will be solved in the
early stage only.
72
Once the applications ready to be released the crucial step is user acceptance testing. In
this step a group representing a cross section of end users tests the application. The user
acceptance
73
testing is done using real world scenarios perceptions relevant to the end users. User acceptance
testing is often the final step before rolling out the application. usually, the end users who will be
using the application test the application 50 before accepting the application. This type of testing
gives the end users the confidence that the application being delivered to them meets their
requirements.
Any project can be divided into units that can be further performed for detailed
processing. Then a testing strategy for each of this unit is carried out. Testing helps to identify
the possible bugs in the individual components, so the component that has bugs can be identified
and can be rectified from errors.
Quality assurance defines the objectives of a project and reviews the overall activities. So
that errors are connected early in the development process.
Testing
In system testing the common view is to eliminate program errors. This is extremely
difficult and time consuming. Since designers cannot prove 100% accuracy. A successful test,
then is one that finds errors.
Validation
Software validation checks that the software product satisfies or fits the intended use that
is the software meets the user requirements, not as specification artefacts or as needs of those
who will operate the software only but, as the needs of all the stakeholders. There are two ways
to perform software validation, internal and external. During internal software validation it is
assumed that the goals of the stakeholders were correctly understood and that they were
74
expressed
75
in the requirement artefacts precisely and comprehensively. If the software meets the
requirement specification, it has been internally validated. External validation happens when it is
performed by asking the stakeholders if the software meets the needs.51 Different software
development methodologies call for different levels of users and stakeholder involvement and
feedback, so external validation occurs when all the stakeholders accept the software product and
express that it satisfies their needs. Such final external validation requires the use of an
acceptance test which is a dynamic test.
Risk is an expectation of loss, a potential problem that may or may not occur in the
future. It is generally caused due to lack of information, control of Time. A possibility of
suffering from loss in the software development process is called a software risk. Reduce the
probability or likelihood of risk.
Risk monitoring
The software quality assurance consists of a variety of tasks associated with seven major activities.
76
Conduct of formal technical reviews
Software testing
Enforcement of standards52
Control of change
Measurement
Record keeping and reporting
77
CHAPTER VII
CONCLUSION
AND
FUTURE
ENHANCEMENT
78
Conclusion
The The Speech Emotion Recognition (SER) project leverages machine learning
algorithms to accurately detect and classify human emotions from speech signals. By
utilizing diverse datasets containing audio features such as pitch, energy, MFCCs, and
prosodic elements, machine learning techniques like support vector machines, random
forests, and deep neural networks are employed to interpret emotional states under
various speaking conditions. These predictive models can enhance human-computer
interaction, improve user experiences in virtual assistants, support mental health
monitoring, and optimize services in sectors like education, healthcare, and customer
support. Ultimately, this project has the potential to advance emotionally intelligent
systems, promote empathetic technology, and support real-time, emotion-aware
applications by providing more accurate and dynamic emotion recognition from speech.
Future Enhancement
o Multimodal Emotion Recognition: Future research can explore the fusion of speech
with other modalities like facial expressions and physiological signals to enhance
o Hybrid Model Architectures: Combining CNNs for feature extraction with LSTM or
GRU models for temporal analysis could yield better performance, especially in noisy
environments.
o Larger and More Diverse Datasets: Incorporating larger, real-world datasets with a
wider variety of speakers, accents, and environments can improve the generalizability of
the models.
transfer learning techniques may reduce training time and improve accuracy in emotion
emotions across different languages and cultures will enhance global applicability and
robustness.
Combine speech with other modalities such as facial expressions, body language, physiological
signals (e.g., heart rate), or textual sentiment to improve recognition accuracy and contextual
understanding.
mechanisms to capture complex patterns in emotional speech and reduce dependency on labeled
data.
Develop systems that can accurately recognize emotions across different languages, accents, and
Improve the ability of SER systems to function effectively in real-world conditions with
80calls, or crowded environments.
background noise, such as public spaces, phone
o Personalized Emotion Models
Design adaptive systems that learn from individual users’ emotional patterns over time, offering
Optimize systems for real-time analysis and low-latency deployment on edge devices, enabling
Enhance systems to not only classify emotions but also detect the intensity and context of the
Use transfer learning techniques to apply models trained in one domain (e.g., movies or call
Move beyond basic emotions (happy, sad, angry, etc.) to detect complex emotional states such as
81
CHAPTER
VIII
APPENDICES
82
APPENDIX I
A1.1 Screenshot
83
Fig A 1.3 Selecting Audio
84
Fig A 1.5 Accuracy trained loss in the Bar
chart and pie chart
85
8.1TECH STACK:
–––
Windows / Linux Development environment
86
CHAPTER IX
REFERENCES
AND
BIBLIOGRAPHY
87
A2.1 References
[1] Khalil, Ruhul Amin, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan,
Mohammad Haseeb Zafar, and Thamer Alhussain. "Speech emotion recognition using deep learning
techniques: A review." IEEE access 7 (2019): 117327-117345.
[2] Abbaschian, Babak Joze, Daniel Sierra-Sosa, and Adel Elmaghraby. "Deep learning
techniques for speech emotion recognition, from databases to models." Sensors 21, no.
4 (2021): 1249.
[3] Pandey, Sandeep Kumar, Hanumant Singh Shekhawat, and SR Mahadeva Prasanna. "Deep
learning techniques for speech emotion recognition: A review." In 2019 29th international
conference RADIOELEKTRONIKA (RADIOELEKTRONIKA), pp. 1-
6. IEEE, 2019.
[4]Han, Kun, Dong Yu, and Ivan Tashev. "Speech emotion recognition using deep neural network and
extreme learning machine." In Interspeech 2014. 2014.
[5]Tzirakis, Panagiotis, Jiehao Zhang, and Bjorn W. Schuller. "End-to-end speech emotion
recognition using deep neural networks." In 2018 IEEE international conference on acoustics,
speech and signal processing (ICASSP), pp. 5089-5093. IEEE, 2018.
[6]Jahangir, Rashid, Ying Wah Teh, Faiqa Hanif, and Ghulam Mujtaba. "Deep learning approaches
for speech emotion recognition: State of the art and research challenges." Multimedia Tools and
Applications 80, no. 16 (2021): 23745-23812.
88
[7]Lim, Wootaek, Daeyoung Jang, and Taejin Lee. "Speech emotion recognition using convolutional and
recurrent neural networks." In 2016 Asia-Pacific signal and information processing association
annual summit and conference (APSIPA), pp. 1-4. IEEE, 2016
[8] Aggarwal, Apeksha, Akshat Srivastava, Ajay Agarwal, Nidhi Chahal, Dilbag Singh, Abeer Ali
Alnuaim, Aseel Alhadlaq, and Heung-No Lee. "Two-way feature extraction for speech emotion
recognition using deep learning." Sensors 22, no. 6 (2022): 2378.
[9] Lalitha, S., Shikha Tripathi, and Deepa Gupta. "Enhanced speech emotion detection using deep
neural networks." International Journal of Speech Technology 22 (2019): 497-510.
[10] Tarunika, K., Pradeeba, R.B. and Aruna, P., 2018, July. Applying machine learning techniques for
speech emotion recognition. In 2018 9th international conference on computing, communication and
networking technologies (ICCCNT) (pp. 1-5). IEEE.
[11] Zhou, Xi, Junqi Guo, and Rongfang Bie. "Deep learning based affective model for speech
emotionrecognition." In 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing,
Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data
Computing, Internet of People, and Smart World Congress
(UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), pp. 841-846. IEEE, 2016.
[12] Suganya, S. and Charles, E.Y.A., 2019, September. Speech emotion recognition using deep learning
on audio recordings. In 2019 19th International Conference on Advances in ICT for Emerging
Regions (ICTer) (Vol. 250, pp. 1-6). IEEE.
[13] Asiya, U. A., & Kiran, V. K. (2021, November). Speech emotion recognition-a deep learning
approach. In 2021 Fifth International Conference on ISMAC(IoT in Social, Mobile,Analytics and
Cloud)(I-SMAC) (pp. 867-871). IEEE.
[14] Yadav, Satya Prakash, Subiya Zaidi, Annu Mishra, and Vibhash Yadav. "Survey on machine
learning in speech emotion recognition and vision systems using a recurrent neural network (RNN)."
Archives of
A2.2 Bibliography
[1] https://youtu.be/z_dbnYHAQYg?feature=shared
[2] https://medium.com/analytics-vidhya/haar-cascades-explained-38210e57970d
[3] https://flask.palletsprojects.com/
[4] https://www.sqlite.org/
[5] https://youtu.be/-VQL8ynOdVg?si=YWs1xvp9vMURJPa3
89