0% found this document useful (0 votes)

4 views89 pages

Solar

The document discusses the role of Artificial Intelligence in Speech Emotion Recognition (SER), detailing the use of feature extraction techniques like Mel-Frequency Cepstral Coefficients (MFCC) and deep learning models such as CNN, RNN, and LSTM for emotion classification. It highlights existing challenges in SER systems, including speaker dependency, language variability, and noise sensitivity, while proposing a hybrid deep learning architecture that integrates Wav2Vec 2.0 to improve accuracy and generalization. The document concludes with an overview of hardware and software requirements for implementing these advanced SER systems.

Uploaded by

kishorebrave2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views89 pages

Solar

Uploaded by

kishorebrave2004

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 89

CHAPTER I

INTRODUCTION

1
INTRODUCTION
1.1 Artificial Intelligence (AI)

Artificial Intelligence (AI) plays a vital role in Speech Emotion Recognition (SER),
enabling machines to understand human emotions through vocal expressions. A
commonly used approach involves extracting features from speech using Mel-Frequency
Cepstral Coefficients (MFCC), and then processing these features through deep learning
models such as Convolutional Neural Networks (CNN), Recurrent Neural Networks
(RNN), and Long Short-Term Memory (LSTM) networks.
Artificial Intelligence MFCC is a powerful feature extraction technique that
captures the short-term power spectrum of sound, emulating the way humans perceive
audio. It transforms raw audio signals into a compact and informative representation,
typically a 2D array, which serves as the input to AI models. Once MFCC features are
extracted, CNNs are often used to detect spatial patterns in the data. Treating the MFCC
output like an image, CNN layers identify local acoustic patterns that correlate with
different emotions.
Following this, the data is passed through RNN or LSTM layers to learn the
temporal dynamics of the speech. RNNs are designed to handle sequential data, making
them suitable for capturing the time-dependent nature of speech. However, traditional
RNNs can struggle with long sequences, which is why LSTMs are preferred—they can
retain important information over longer periods, enabling the model to understand
emotional cues that unfold over time.
The combination of MFCC for feature extraction, CNN for spatial learning, and LSTM for
sequential modeling creates a robust system capable of accurately classifying emotions like
happiness, sadness, anger, fear, and neutrality from speech. This approach is widely used in
applications such as virtual assistants, customer service bots, mental health monitoring, and
adaptive learning systems, where understanding human emotion enhances user experience
and system responsiveness

2
Fig 1.1 Components of AI

AI Techniques for Speech Emotion Recognition

AI-based predictive models for speech emotion recognition primarily rely on machine learning
and deep learning algorithms, including:
 SVM (Support Vector Machine)– Works well with high-dimensional data like MFCCs,
Effective for binary and multi-class classification.
 Random Forest– Random Forest is an ensemble learning algorithm based on decision trees. In
Speech Emotion Recognition, it is commonly used for classifying speech into emotions based on
extracted features.
 KNN (K-Nearest Neighbors)– K-Nearest Neighbors (KNN) is a simple, instance-based
learning algorithm that classifies an emotion based on the majority vote of its K nearest
neighbors in the feature space.
1.1.1 Components of AI
1. Data Collection & Preprocessing

Data Sources:

 RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)

3
 TESS (Toronto Emotional Speech Set).
 CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset).
Data Preprocessing:
 Removes background noise and silent parts to focus only on emotional speech content.
Time-series data transformation for accurate forecasting.
 Noise reduction using statistical techniques
2. Machine Learning (ML)

Traditional ML Models:

a) Support Vector Machines (SVM)

 Popular for its robust classification.

 Works well with small datasets.

 Kernel trick (linear, RBF, polynomial) helps model nonlinearities.

 Usually combined with feature selection/dimensionality reduction.

b) k-Nearest Neighbors (k-NN)

 Classifies emotion based on the closest samples in feature space.

 Simple and interpretable.

 Sensitive to feature scaling and noisy data.

 Performance depends heavily on choice of k and distance metric.

c) Decision Trees

 Build tree models to split data based on feature thresholds.

 Easy to interpret.

 Prone to overfitting, often combined into ensembles.

d) Random Forests

 Ensemble of decision trees with random feature selection.

 More robust and less prone to overfitting than single trees.

4
 Handles high-dimensional features well.

e) Gaussian Mixture Models (GMM)

 Models probability distribution of features for each emotion.

 Useful in modeling emotional states as mixtures of Gaussian distributions.

 Often combined with Hidden Markov Models (HMMs) to model temporal dynamics.

f) Logistic Regression

 Linear classifier for binary or multi-class classification.

 Can provide probabilistic output.

 Limited in capturing nonlinear relationships.

g) Naive Bayes

 Probabilistic classifier based on Bayes theorem.

 Assumes feature independence.

 Simple and fast but sometimes too simplistic for complex speech features.

H)Real world Applications

Call Centers / Customer Service:

Detect customer frustration, anger, or satisfaction in real time, Route calls to human

agents when customers are upset, Analyze customer service quality via emotional trends

Virtual Assistants & Chatbots (e.g., Siri, Alexa, Google Assistant):

Enhance human-computer interaction by adapting to the user's emotional state, Respond

empathetically if the user sounds stressed or upset.Improve engagement and user experience.

Mental Health Monitoring:

Detect signs of depression, anxiety, or mood disorders through speech.Passive monitoring for a

at-risk individuals (e.g., elderly, PTSD patients). Support digital therapeutics and telehealth

5
platforms

FIG 1.1.1 applications of SER

1.2 Machine learning(ML)

Machine Learning (ML) plays a crucial role in the development of solar
intelligence predictive models by enabling accurate forecasting of solar power
generation and radiation levels. ML algorithms analyze vast amounts of historical and
real-time meteorological data, identifying complex patterns and relationships that
traditional statistical models might overlook. These models leverage supervised learning
techniques such as linear regression, support vector machines (SVM), and decision trees
to predict solar radiation based on temperature, humidity, cloud cover, and other
6
environmental factors. Additionally, deep learning architectures like Artificial Neural
Networks (ANNs) and Long Short-Term Memory (LSTM) networks are employed for
time-series forecasting, providing highly accurate predictions of solar energy output.
Machine learning also facilitates adaptive optimization by continuously learning from
new data, improving energy management strategies, and enhancing the efficiency of
solar power generation systems. By integrating ML with IoT-enabled sensors and
satellite imagery, AI-powered predictive models can make real-time adjustments,
ensuring optimal energy utilization and grid stability.
1.2.1Types of Machine Learning
 Supervised Learning

In supervised learning, the algorithm learns from labeled data, where each
example consists of input features and corresponding labels.

 Unsupervised Learning

Unsupervised learning involves learning patterns and structures from unlabeled

data. Clustering and dimensionality reduction are common tasks in unsupervised
learning, where the goal is to discover hidden patterns or group similar data points.

 Semi-Supervised Learning

Semi-supervised learning combines elements of supervised and unsupervised

learning, where the algorithm learns from a small amount of labeled data and a large
amount of unlabeled data.

7
FIG 1.2.1 types of machine learning

FIG 1.2.2 BENEFITS OF MACHINE LEARNING

8
1.3Generative AI

Generative AI is an advanced subset of Artificial Intelligence that creates new

data, predictions, or simulations based on existing datasets. In the context of solar
intelligence predictive models, Generative AI can be used to enhance forecasting
accuracy, simulate solar energy production under different environmental conditions, and
generate synthetic weather data to improve model robustness. One of the key applications
of Generative AI in solar forecasting is using Generative Adversarial Networks
Generative AI refers to a class of artificial intelligence models designed to create new
content, such as text, images, audio, or even code, by learning patterns from existing
data. These models, often powered by deep learning architectures like Generative
Adversarial Networks (GANs) and Transformer-based systems (e.g., GPT), can produce
remarkably realistic and coherent outputs. In fields like art, design, marketing, and
entertainment, generative AI enables automated content generation, personalizes user
experiences, and accelerates creative workflows. Its applications extend to drug
discovery, education, and customer service, making it a transformative technology with
Speech Emotion Recognition (SER) using machine learning is an advanced field that
focuses on detecting and interpreting human emotions from speech signals. By analyzing
vocal features such as tone, pitch, energy, and rhythm, SER systems aim to classify
emotions like happiness, anger, sadness, and fear. Machine learning techniques—
especially supervised learning algorithms like Support Vector Machines (SVM),
Random Forests, and deep learning models such as Convolutional Neural Networks
(CNNs) and Recurrent Neural Networks (RNNs)—are trained on labeled audio datasets
to recognize patterns associated with different emotional states. SER has valuable
applications in customer service, healthcare, mental health monitoring, and human-
computer interaction, making communication between humans and machines more
natural and empathetic.

9
1.3.1Applications of Generative AI

FIG 1.3.1 APPLICATIONS OF GENERATIVE AI

The image below illustrates the diverse and rapidly growing applications of Generative
Artificial Intelligence (AI) across multiple domains. It categorizes use cases into four main areas:
Visual Content, Audio Generation, Text Generation, and Code Generation. Within each
category, specific applications are highlighted ranging from image enhancement, video
prediction, and 3D shape generation in visual content; to music composing, text-to-speech
(TTS), and speech-to-speech (STS) conversion in audio generation. Text generation
encompasses capabilities such as creative writing, chatbot interactions, and language transla

10
CHAPTER II
FEASIBILITY STUDY

11
FEASIBILITY STUDY

2.1 Existing System

The The existing systems for Speech Emotion Recognition (SER) primarily
rely on machine learning and deep learning models trained on labeled speech datasets.
These systems typically use acoustic features such as Mel-Frequency Cepstral
Coefficients (MFCCs), pitch, intensity, and spectral features to identify emotional
states. Traditional systems often employ classifiers like Support Vector Machines
(SVM), K-Nearest Neighbors (KNN), and Hidden Markov Models (HMM). More
recent advancements incorporate deep learning techniques using Convolutional Neural
Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term
Memory (LSTM) networks to improve accuracy and handle complex temporal patterns
in speech. Despite significant progress, existing systems still face challenges in
recognizing emotions accurately in noisy environments, across different languages, and
with diverse speaking styles. Most current SER applications are integrated into call
centers, virtual assistants, and emotion-aware healthcare tools, but their performance
often depends on the quality and diversity of the training data.

Limitation

• Speaker Dependency: SER systems often perform well only on known speakers, SER
systems often perform better when the speaker's voice characteristics are already part
of the training data.
• Language and Accent Variability: Speech emotion patterns can vary significantly
across different languages and accents.
• Background Noise: Real-world environments often contain background noise, such
as traffic, people talking, or mechanical sounds.
• Emotion Ambiguity: Emotions are often not clearly expressed or are blended, making
them difficult to categorize.

12
• Computational Complexity: Advanced AI-based models (e.g., deep learning, LSTMs)
require high computational resources, making them expensive to deploy and maintain.
• Lack of Generalization: Models trained on data from one region may not perform
well in another due to differences in geographic and climatic conditions.
• Integration Challenges: Predictive models must integrate with power grids and
IoT systems, but compatibility issues often arise due in protocols and data formats.
2.2 Proposed system
The The proposed system for Speech Emotion Recognition (SER) leverages a
hybrid deep learning architecture combining LSTM, CNN, RNN, and the advanced Wav2Vec
2.0 Transformer model to achieve high accuracy in emotion detection from speech. This
integrated approach is designed to overcome the limitations of traditional models by capturing
both spatial and temporal features of audio signals in a robust and efficient manner. Initially,
raw audio input is preprocessed and features such as spectrograms or MFCCs are extracted.
These features are first passed through a Convolutional Neural Network (CNN) to learn
spatial hierarchies in the acoustic data, effectively capturing local patterns and energy
distribution across time and frequency. The output from the CNN is then fed into a stack of
Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) layers, which are
adept at modeling temporal dependencies and sequence dynamics, crucial for understanding
the progression of emotions in speech over time. To further enhance the performance and
contextual understanding, the system integrates Wav2Vec 2.0, a self-supervised Transformer-
based model pretrained on a large corpus of unlabeled speech data. Wav2Vec 2.0 learns
contextualized audio representations directly from the waveform, significantly reducing the
reliance on handcrafted features and boosting the system’s ability to generalize across
different speakers and acoustic conditions. By combining these powerful architectures, the
proposed system ensures superior performance in real-time emotion recognition tasks and
demonstrates strong generalization across various datasets and languages. This architecture is
highly suitable for deployment in interactive voice systems, healthcare diagnostics, virtual
assistants, and customer service applications, providing emotionally intelligent responses and
improved human-computer interaction..

13
Limitation

• Variability in Speech: Differences in accent, pitch, tone, and speaking style can
degrade the accuracy of emotion recognition
• Noise Sensitivity: Background noise or poor audio quality can interfere with feature
extraction, especially in real-world environments.
• Subjectivity of Emotions: Emotions can be interpreted differently across individuals and
cultures, making accurate classification difficult.

• Mixed or Subtle Emotions: The system may struggle to recognize compound or

less distinct emotions, such as bittersweet or neutral moods.
• Real-time Prediction Challenges: Processing large datasets and making accurate
predictions in real time can be difficult, especially with edge computing constraints.
• Data Dependency: High performance relies heavily on large, diverse, and well-
labeled datasets, which are limited and expensive to produce.
• Overfitting Risks: Deep learning models like LSTM, CNN, and RNN may overfit on
training data if not properly regularized or validated.
• High Computational Cost: Training and deploying hybrid models with Wav2Vec
2.0 and deep neural networks require significant hardware and processing power..
Conclusion

The In conclusion, the integration of advanced deep learning models such as LSTM,
CNN, RNN, and Wav2Vec 2.0 has significantly enhanced the accuracy and effectiveness of
Speech Emotion Recognition systems. These models enable the system to capture complex
acoustic and temporal patterns, leading to more nuanced and context-aware emotion
classification. While the proposed approach demonstrates strong potential for real-world
applications in healthcare, customer service, and human-computer interaction, it is important
to acknowledge and address existing challenges such as variability in speech, data
limitations, computational demands, and ethical considerations. Continued research and
innovation in model optimization, dataset diversity, and privacy-preserving techniques will
be crucial to building more robust, inclusive, and responsible SER systems in the future.

14
2.2.1HARDWARE AND SOFTWARE REQUIREMENTS

1. Hardware Requirements

• Processor: Intel Core i7/i9, AMD Ryzen 7/9, or higher (for training ML models)
• GPU (Optional but Recommended): NVIDIA RTX 3060/RTX 4090 or
equivalent for deep learning acceleration

• RAM: Minimum 16GB (32GB+ recommended for large datasets)

• Storage: SSD (512GB or higher for faster data access)
• Internet Connection: Required for cloud-based model training and data retrieval

2. Software Requirements
• Operating System: Windows 10/11, Linux (Ubuntu), or macOS

• Programming Languages: Python (preferred), R, MATLAB

• Libraries & Frameworks:

o TensorFlow / PyTorch (for deep learning)

o Scikit-learn (for ML models)

o Pandas & NumPy (for data processing)

o Matplotlib & Seaborn (for data visualization)

• Database (if needed): MySQL, PostgreSQL, MongoDB

• Cloud Platforms (Optional): Google Colab, AWS, Azure, or Google Cloud
for model training and deployment

• IDE: Jupyter Notebook, PyCharm, VS Code

15
CHAPTER III
LITERATURE
SURVEY

16
LITERATURE SURVEY
3.1 P. Tzirakis, J. Zhang, and B. W. Schuller, “End-to-End Speech Emotion Recognition
Using Deep Neural Networks,” IEEE Journal of Selected Topics in Signal Processing,
2018, Volume 11, Issue 8, Pages 1301–1309, ISSN: 1932-4553

Abstract

The Speech Emotion Recognition (SER) is a rapidly evolving field that aims to detect
human emotions from speech signals, enabling more natural and intelligent human-computer
interaction. This project proposes a hybrid deep learning approach that combines Convolutional
Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory
(LSTM), and the Wav2Vec 2.0 Transformer model to improve the accuracy and robustness of
emotion classification. The system leverages CNN for spatial feature extraction, LSTM and
RNN for modeling temporal dependencies, and Wav2Vec 2.0 for learning contextualized
speech

Description

The Speech Emotion Recognition (SER) is a technology that enables computers to

identify and interpret human emotions from spoken language. By analyzing audio signals,
particularly features such as pitch, tone, intensity, and spectral patterns, SER systems can
classify emotional states like happiness, sadness, anger, fear, and neutrality.

Data Collection: RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and

Song): A high-quality dataset that includes 24 professional actors (12 male, 12 female) vocalizing a set of
8 emotions: neutral, calm, happy, sad, angry, fearful, disgust, and surprised.

Preprocessing: All audio files are resampled to a consistent sampling rate (commonly 16 kHz or
44.1 kHz) to ensure uniformity across datasets.

Model Training & Optimization: Training the Deep LSTM model using a regression-
based dataset with polynomial extrapolation.

17
• Performance Evaluation: Comparing the Root Mean Square Error (RMSE) of different
models, including LSTM, Bi-LSTM, and Deep LSTM. Results indicate that Deep LSTM
achieves better accuracy compared to traditional methods. The study suggests that
incorporating hybrid approaches like CNN-LSTM could further improve performance.
Limitation
• Dependence on Large Datasets: The Deep LSTM model requires extensive
historical data for effective training, which may not be available in all locations.
• Speaker Dependency: Models trained on specific speakers often perform poorly
when tested on new, unseen speakers.
• Emotion Overlap and Ambiguity: Emotional states often overlap (e.g., anger
and frustration, or fear and surprise), making classification difficult.
• Language and Cultural Bias: Most datasets are in English or a few major
languages, limiting cross-lingual and multicultural applicability.

• Environmental Noise tes: Real-world speech often includes background noise,

reverberation, or multiple speakers, which can degrade recognition accuracy.

3.2Author(s): Hassan A. Sheta, Mahmoud M. Abdelwahab

Title: "Speech Emotion Recognition Using Deep Learning Techniques"
Journal: International Journal of Advanced Computer Science and Applications (IJACSA),
Volume 11, No. 3, 2020
ISSN: 2156-5570Abstract

Accurate This study investigates the application of deep learning architectures, particularly
Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), in the field of
Speech Emotion Recognition (SER). The research uses the RAVDESS dataset, which includes
multiple emotional classes like happy, sad, angry, and neutral. Audio files are preprocessed using
MFCC feature extraction, and models are trained to classify emotional states. Performance
evaluation indicates that CNN-RNN hybrid models outperform traditional classifiers such as SVM
and Random Forest, with an accuracy improvement of up to 15%.

18
Description
The The paper presents a hybrid CNN-RNN model for SER that processes MFCC
features derived from the RAVDESS dataset. The CNN layers extract spatial features from
MFCC inputs, while RNN layers capture temporal dependencies, improving the model’s
ability to detect complex emotional patterns in speech. Hyperparameter tuning and dropout
layers are applied to prevent overfitting. The model achieves 82.4% accuracy in multi-class
emotion recognition, showing robust performance on both male and female speech samples.
The study emphasizes deep learning’s potential in enhancing human-computer interaction.

Limitations:

 Dataset Bias: The RAVDESS dataset contains acted emotions, which may not generalize well to
real-world scenarios with spontaneous speech.

 Speaker Dependency: Performance drops significantly when tested on speakers not present in the
training set.

 Limited Language Support: The study only focuses on English, restricting cross-lingual
applications.

 High Computational Load: The hybrid CNN-RNN architecture requires high processing power
and memory.

 No Real-time Evaluation: The system has not been tested in real-time applications or embedded
environments.

19
3.3
Author(s): Ekhlas Al Nassr, Mohd Fadzil Hassan, Yousef Alhwaiti
Title: "Improving Speech Emotion Recognition Using Data Augmentation and Feature Fusion"
Journal: Computers, Volume 10, Issue 6, Article 79, 2021
ISSN: 2073-431X
Abstract

Deep This research focuses on enhancing speech emotion recognition accuracy through
data augmentation techniques and the fusion of multiple audio features. The study leverages
both traditional (MFCC, Chroma) and advanced features (spectral centroid, roll-off) and
combines them using a deep neural network (DNN) for classification. Evaluation is conducted
using the EMO-DB dataset, and data augmentation is applied through pitch shifting and time
stretching. Results indicate that feature fusion and augmentation significantly improve
classification performance.

Description

This study presents he methodology involves preprocessing the EMO-DB dataset with multiple
augmentation methods to create a diverse training set. Feature extraction includes MFCCs,
chroma features, zero-crossing rate, and spectral bandwidth. These are concatenated to form a rich
feature vector, which is input into a DNN with batch normalization and ReLU activation. The
model achieves a classification accuracy of 89.3% with augmentation, compared to 76% without
it. The fusion of complementary features provides robustness against noise and speaker variation.
Limitations:
 Overfitting Risk: Extensive feature sets and augmentation increase the risk of overfitting,
especially on small datasets.
 Artificial Emotions: Use of EMO-DB limits realism due to acted emotional expressions.
 Limited Multilingual Support: The model was only tested on German speech data.
 Augmentation Artifacts: Pitch and speed manipulation can introduce audio artifacts that
affect emotion clarity.
 Lack of Real-time Testing: The system was only evaluated offline, not under real-time
conditions.

20
CHAPTER IV
SYSTEM DESIGN

21
SYSTEM DESIGN
4.1 System Architecture

Fig4.1 System Architecture

22
Description
The he Speech Emotion Recognition (SER) system leverages machine learning techniques
to identify and classify human emotions from spoken audio data. The project initiates with the
collection of emotional speech datasets, such as RAVDESS (Ryerson Audio-Visual Database
of Emotional Speech and Song), TESS (Toronto Emotional Speech Set), or SAVEE. These
datasets contain audio recordings labeled with various emotions including neutral, happy, sad,
angry, fearful, and disgusted.

The raw audio files are then preprocessed using the Librosa library in Python. Key
preprocessing steps include noise reduction, trimming silences, and sampling normalization.
Feature extraction is performed using Mel-Frequency Cepstral Coefficients (MFCC), Chroma
features, Zero-Crossing Rate, Spectral Centroid, and Spectral Roll-off to capture both temporal
and spectral properties of the audio signal.

Following preprocessing and feature extraction, the dataset is split into training and
testing sets. A variety of machine learning and deep learning models are experimented with,
including Support Vector Machines (SVM), Convolutional Neural Networks (CNN), Long
Short-Term Memory (LSTM) networks, and hybrid CNN-LSTM models. These models are
trained to classify the speech into emotional categories. Model performance is evaluated based
on metrics such as accuracy, precision, recall, and F1-score, and the best-performing model is
saved using Python's joblib or pickle in .pkl format for later use.
1. Voice Signal (Input)

 This is the raw audio input from a human speaker.

 It captures speech signals that carry both linguistic content and emotional cues.

2. Feature Extraction

This step involves extracting meaningful characteristics from the audio signal to feed into
machine learning

a. Time Domain Features:

These include basic statistical measures from the waveform, such as:

o Min: Minimum amplitude

23
o Max: Maximum amplitude

o Mean: Average signal amplitude

o Range: Difference between max and min

o Variance: Measure of signal spread

b. Frequency Domain Features:

 Involves converting the audio to a spectrogram using techniques like FFT or MFCC.

 These features reflect how energy is distributed across different frequencies.

 Used for better capturing pitch, tone, and emotion-relevant cues.

3. Model Training

This is the machine learning or deep learning phase, where extracted features are used to train
models to recognize emotions.

a. Traditional ML Algorithms:

 Support Vector Machine (SVM)

 k-Nearest Neighbor (k-NN)

 Naïve Bayes

 Decision Tree

 Gaussian Mixture Model (GMM)

These are typically fast but may struggle with complex features.

b. Neural Networks:

 Advanced deep learning architectures (like CNN, RNN, LSTM) that can automatically learn
complex

 More powerful in handling non-linear and temporal features in speech.

4. Emotion Classification (Output)

 The trained model predicts the emotion category from the audio.

24
Common emotions identified include:

o Angry

o Happy

o Sad

o Neutral

o Fear

These are usually classified as discrete classes in emotion recognition systems

4.2 Use Case Diagram

Fig 4.2 Use Case Diagram

The use case diagram The diagram illustrates a Speech Emotion Recognition System through a
structured use-case model, showcasing interactions between three main actors: the User, the Admin, and
the System. The process begins with the user providing a speech input, which is then passed through
several stages of audio preprocessing to prepare the raw data for analysis. After preprocessing, the
system proceeds to extract features from the audio input, which are essential components for
understanding emotional patterns in speech. These features are subsequently used for classifying
emotions, a process empowered by a deep learning model that has been pre-trained using labeled data.

25
The classification results—which represent the identified emotions—are then stored and displayed for
further use or feedback.
.
4.3 Modules
1. Data Pre-processing
2. Exploration Data Analysis Of visualization
3. Data Validation/cleaning/Preparing Process
4. Web application & deployment module

26
.

4.3.1 Data Pre-processing

Validation Data preprocessing is a crucial step in building a robust Speech
Emotion Recognition (SER) system, as it transforms raw audio recordings into structured
and meaningful input for machine learning models. The process begins with loading the
emotional speech datasets, such as RAVDESS, TESS, or SAVEE, which contain audio
files labeled with distinct emotions. These audio files are first standardized to a uniform
sampling rate (commonly 22,050 Hz) to ensure consistency across the dataset. Next,
noise reduction techniques are applied to minimize background interference and enhance
the quality of speech signals. Silence trimming is performed to eliminate unnecessary
pauses at the beginning and end of recordings. The core of preprocessing involves
extracting relevant acoustic features using libraries like Librosa. These features include
Mel-Frequency Cepstral Coefficients (MFCC), Chroma frequencies, Zero Crossing Rate,
Spectral Centroid, Spectral Bandwidth, and Root Mean Square Energy (RMSE), which
together capture the tonal, rhythmic, and energy-related characteristics of speech. These
features are then compiled into numerical arrays and stored in a structured format such as
CSV or a Pandas DataFrame. To facilitate efficient training, the features are normalized
or scaled, and emotion labels are encoded into numerical categories using label encoding
or one-hot encoding. This structured dataset is then split into training, validation, and test
sets to prepare for the model-building phase. Overall, the preprocessing stage ensures that
raw audio inputs are clean, consistent, and optimized for effective learning and
classification of emotions by machine learning algorithms.

Once the audio is cleaned, feature extraction is carried out using libraries such as
Librosa or PyAudioAnalysis. The goal is to extract low-level descriptors and statistical
features that represent the speech signal's characteristics. Key features include:

MFCC (Mel-Frequency Cepstral Coefficients): Captures the timbral texture and pitch
variations of the voice, making it essential for emotion detection.

Chroma Features: Capture pitch class profiles, providing tonal information which may differ
across emotions.

o Zero Crossing Rate (ZCR): Measures the frequency of signal sign changes, which
can be correlated with speech intensity.
27
o Spectral Centroid and Spectral Bandwidth: Reflect the “brightness” and
frequency distribution of the audio.

o Root Mean Square Energy (RMSE): Represents signal strength and can vary with
emotional intensity.

o Tempo, Harmonic-to-Noise Ratio, and Spectrograms (for CNN input) are also
optionally extracted depending on the model architecture.

After feature extraction, these features are compiled into tabular datasets—often in
NumPy arrays or Pandas DataFrames—with each row representing an audio sample and
each column representing a feature. This is followed by feature normalization or
standardization, which ensures that all features are on a similar scale, thus preventing
model bias toward features with higher numeric ranges.

Simultaneously, label encoding is applied to convert categorical emotion labels

(e.g., “happy”, “angry”, “sad”) into numerical values suitable for model training. In cases
where deep learning models like CNNs or LSTMs are used, the audio may also be
transformed into log-mel spectrograms or MFCC images, which are then resized and
structured as image inputs.

Finally, the entire dataset is split into training, validation, and test sets, commonly
in a 70:15:15 or 80:10:10 ratio. This ensures that the model can learn effectively, be tuned
using validation data, and be evaluated fairly on unseen samples. Data augmentation
techniques such as pitch shifting, time stretching, and background noise injection may also
be applied during preprocessing to enhance model robustness and handle overfitting,
especially when working with limited datasets.

Overall, the preprocessing phase plays a foundational role in the SER pipeline,
ensuring high-quality, well-structured data that allows machine learning and deep learning
models to accurately learn and classify human emotions from speech signals.

28
Class Diagram

Fig 4.3.2 Class Diagram for Data Pre-processing

This diagram represents a modular architecture for29a Speech Emotion Recognition system. It begins with
the AudioInput module, which is responsible for capturing speech through methods like recordSpeech()

and uploadAudio(). The captured audio is then sent to the Preprocessing module, which performs tasks

such as noiseReduction() and featureExtraction() to prepare the audio for analysis by reducing noise and

extracting relevant features. These features are further processed by the FeatureExtraction module,

which specializes in extracting specific audio representations like extractMFCC() (Mel-Frequency

Cepstral Coefficients) and extractSpectrogram(), essential for emotion classification. Parallelly, the

Admin module manages datasets and training processes via manageDataset() and retrainModel(),

ensuring the model stays updated and accurate. Both the features and admin commands feed into the

Model module, which contains core functionalities like trainModel() for learning from data and

classifyEmotion() to predict emotions from the extracted features. The EmotionClassification module

then applies specific detection methods such as detectHappiness(), detectSadness(), detectAnger(), and

detectNeutrality() to categorize the emotional state from the model’s output. Finally, the classified

emotion is displayed to the User through functions like provideAudio() and viewResults(), closing the

loop by presenting the emotion recognition results back to the end user. This structured flow ensures

clear separation of concerns, from audio capture to emotion detection and user feedback. This diagram

represents a use case or component flow for an audio-based emotion recognition system. It shows the

sequence of components and how they interact to process audio, extract features, train a model, and

classify emotions

30
Sequence Diagram:

FIG 4.4 sequence diagram for Data preprocessing

The sequence diagram illustrates the flow of operations in a Speech Emotion Recognition (SER) system
using deep learning models such as CNN, RNN, and LSTM. The process begins with the User initiating
a voice input, which is captured by the Audio Recorder and passed as raw audio to the Preprocessor.
The Preprocessor cleans the data by removing noise and unnecessary segments, then forwards the
extracted features to the Feature Extractor, which generates meaningful representations like MFCCs or
spectrograms. These features are then sent to the Model Selector, which determines the appropriate
model (CNN, RNN, or LSTM) based on configuration or performance needs. The selected model
processes the features and sends the output to the Emotion Classifier, which interprets the model’s
31
output and identifies the emotional state (e.g., happy, sad, angry). Finally, the Display System presents the
predicted emotion back to the user. This structured interaction ensures a streamlined and automated workflow
for recognizing emotions from speech using deep learning architectures.

.These features are passed to a model selection mechanism, which dynamically chooses the most
appropriate deep learning model—either CNN, RNN, or LSTM—based on predefined criteria or
configurations. The chosen model processes the features and produces a prediction, which is then
interpreted by the Emotion Classifier to determine the emotional state. Finally, the Display System
communicates the detected emotion to the user. This diagram not only demonstrates the modular nature
of the system but also emphasizes the logical and time-ordered communication required to achieve
accurate emotion detection.

Audio Recorder component captures this vocal signal as raw audio data in waveform format. This data
often contains background noise, silences, and other artifacts, so it is passed to a Preprocessor, which
applies signal enhancement techniques such as noise reduction, silence trimming, amplitude normalization,
and voice activity detection (VAD). These preprocessing steps help to clean and standardize the input,
making it suitable for feature extraction. The cleaned audio is then fed into the Feature Extractor, where
essential acoustic features are computed. These features may include low-level descriptors such as Mel-
Frequency Cepstral Coefficients (MFCCs), chroma features, zero-crossing rate, spectral centroid,
pitch, formants, and prosodic elements like speech rate or energy dynamics. These features capture the
temporal, spectral, and emotional characteristics of the speech signal. Once extracted, these features are
input into an Emotion Classifier, typically a machine learning or deep learning model such as a
Convolutional Neural Network (CNN) for spatial feature learning, a Recurrent Neural Network (RNN)
or Long Short-Term Memory (LSTM) network for modeling temporal dependencies, or a hybrid
architecture that combines both. The classifier processes the features based on its learned parameters and
predicts an emotion label, such as “happy,” “sad,” “angry,” “fearful,” or “neutral.”

predicted emotion is then sent to the Display System, which outputs the recognized emotion to the user

32
Collaboration Diagram

Fig 4. 5 Collabration Diagram for Data Pre-processing

The process begins with the user, who provides raw audio data containing
speech recordings that exhibit various emotional expressions. These recordings could come
from datasets like TESS, RAVDESS, or custom sources. The user uploads or provides this
raw audio data into the system, initiating the pipeline.

Once received, the audio files are collected in the Raw Data Source module.
This component acts as a central repository that stores unprocessed audio clips along with
their associated metadata, such as file paths and emotion labels (if available). At this stage,
the data remains in its original, unfiltered format.

The raw audio is then passed to the Data Preprocessing Module, where
essential transformations take place. This step includes cleaning noise from the audio,
normalizing the signals, segmenting audio clips, extracting meaningful features such as Mel-
frequency cepstral coefficients (MFCCs), chroma, zero-crossing rate, and pitch, and
converting labels into numerical form if they are categorical. This preprocessing is crucial to
prepare the audio for effective model training.

33
4.5.1Exploration Data Analysis Of Visualization
Data Exploratory Data Analysis (EDA) and Visualization for Speech Emotion
Recognition (SER) play a crucial role in understanding the structure, quality, and patterns
within audio datasets before applying machine learning models. The first step typically
involves analyzing the distribution of emotion classes to identify imbalances, such as
overrepresentation of “neutral” and underrepresentation of “fear” or “disgust.” Visualization
tools like bar plots or pie charts help highlight these class distributions clearly. Next, audio
feature distributions such as MFCCs, chroma features, pitch, and energy are examined
using box plots, histograms, and violin plots to understand their variance across different
emotions. Correlation heatmaps are also used to detect highly correlated features, which
may be candidates for dimensionality reduction. Additionally, waveform plots and
spectrograms give insight into how different emotions modulate the speech signal visually
—revealing patterns in energy, frequency content, and rhythm. t-SNE or PCA plots are
often used for dimensionality reduction and visualization, helping to identify whether the
extracted features form distinguishable clusters for different emotional states. This phase
helps identify noise, missing values, and outliers, and provides a foundation for feature
engineering, class balancing, and model selection. Overall, EDA not only enhances
understanding of the dataset but also guides the design of more accurate and interpretable
SER systems. In addition to basic class distribution analysis, EDA for SER often begins
with examining audio duration and silence ratios across recordings, as emotions like
sadness may correlate with slower speech and longer pauses. Plotting the duration
distribution of audio files helps in identifying inconsistencies, such as truncated or overly
lengthy recordings, which can introduce noise into model training. A common practice is to
visualize waveforms and mel-spectrograms using libraries like librosa, which allows for
time-frequency analysis that can reveal emotion-linked acoustic patterns—such as higher
frequency content in angry speech or lower energy in sad speech.

34
Class Diagram

Fig 4.6 Class Diagram for Exploration Data Analysis of Visualization

The diagram is a class diagram representing a comprehensive data exploration and

predictive modeling system. At the core of the system is the Explore Data class, which has
attributes like Title, Description, and Created By, and methods for Start() and Analyze Data().
The Explore Data class is linked to various components. It uses the Predictive Model class,
which defines the type of model (Model Type) and includes methods for training (Train
Model()) and making predictions (Predict()). The system generates outputs through the
Visualization class, which specifies the graph type and display format. The model’s
predictions are related to the Power Generation class, which calculates generation capacity
and efficiency through Calculate Efficiency(). Similarly, the model measures radiation levels
via the Radiation class using the Measure Radiation() method. User interaction is handled
through an Abstract User class, which stores the UserID, Username, and Email, and provides
Login() and Logout() methods. There are two user types: Admin, who can manage models
and view analysis reports, and Regular User, who can view reports and update model

35
4.6.1Data Validation/cleaning/Preparing Process
The The data validation, cleaning, and preparation process for Speech Emotion
Recognition (SER) is a critical stage that directly influences the performance and reliability
of the system. This process begins with the collection of raw audio data from established
emotional speech datasets or real-time recordings. At this stage, data validation ensures that
the recordings are complete, correctly labeled with emotional categories, and free from
major technical issues such as missing files or corrupt formats. Once validated, the data
undergoes a cleaning phase where background noise is reduced or removed using signal
processing techniques such as noise filtering, normalization, and silence trimming to
enhance the clarity of emotional cues. Inconsistent samples, such as those with too much
distortion or incorrect labels, are either corrected or removed. Following cleaning, the
preparation phase begins with feature extraction, where relevant acoustic features like
MFCCs, pitch, energy, and spectral characteristics are computed from the cleaned audio.
These features are then standardized and, if necessary, subjected to feature selection
methods to retain only the most informative ones. The dataset is subsequently split into
training, validation, and test sets to ensure that model evaluation is unbiased and
performance can be accurately assessed. Throughout this process, consistency in
preprocessing steps is maintained to avoid data leakage and ensure that the system
generalizes well to new inputs. This structured pipeline ensures high-quality input to the
machine learning model, forming a robust foundation for the accurate detection and classific

machine learning models. Finally, the dataset is split into training, validation, and testing
subsets to facilitate effective model training and evaluation. This end-to-end process ensures
that the input data is robust, reliable, and optimally structured for accurate solar radiation
and energy generation predictions using advanced AI techniques.

36
Class Diagram

Fig 4.7 Class Diagram for Data Validation

The This diagram illustrates the complete process involved in data validation, cleaning, and
preparation specifically tailored for Speech Emotion Recognition (SER) systems. It begins
with AudioCollection, where raw speech audio samples are gathered, often from diverse
sources such as databases, real-time recordings, or call centers. The next step,
FeatureExtraction, involves extracting meaningful acoustic characteristics from the audio
signals, such as pitch, energy, MFCCs (Mel-Frequency Cepstral Coefficients), and spectral
features, which are vital for recognizing emotions.

Following extraction, FeatureSelection is performed to identify and retain only the most
relevant features that significantly contribute to emotion classification, reducing
dimensionality and improving model efficiency. The process then moves to DataSplitting,
where the cleaned and processed data is divided into training and test sets, ensuring that the
37
model can learn patterns from one part of the data and be evaluated on unseen examples.

ModelTraining involves applying machine learning algorithms (e.g., SVM, Random Forest,
CNN, RNN) to the training data to develop a predictive model. Once trained, the model
proceeds to the EmotionRecognition phase, where it classifies or predicts emotions (such as
happy, sad, angry, etc.) from new or unseen speech data. Finally, ModelEvaluation assesses
the performance of the model using metrics like accuracy, precision, recall, F1-score, and
confusion matrix, helping validate its reliability and generalization capability.

This flow ensures a structured and systematic approach for building effective and accurate
speech emotion recognition systems. This is the foundational step where raw speech data is
gathered. It involves collecting voice recordings from various sources such as emotional
speech datasets (like RAVDESS, EMO-DB, or CREMA-D), real-world audio (e.g., call
centers, therapy sessions), or through direct microphone input. The key requirement here is
ensuring that the collected audio is of good quality, appropriately labeled with the
corresponding emotion, and representative of different speakers, accents, and speaking
styles. Not all extracted features contribute equally to identifying emotions. This step
involves selecting the most relevant ones, which improves the model’s accuracy and
efficiency. Feature selection techniques like PCA (Principal Component Analysis), recursive
feature elimination, or mutual information-based methods are commonly used to reduce
noise and computational load. The process begins with AudioCollection, where speech
audio samples are gathered from various sources such as emotional speech databases, real-
world recordings, or direct microphone inputs. These audio recordings must be of high
quality and labeled accurately with the corresponding emotions to ensure the effectiveness of
the downstream processes. Once collected, the next step is FeatureExtraction, where the
audio data is converted into structured numerical representations. Acoustic features such as
Mel-Frequency Cepstral Coefficients (MFCCs), pitch, energy, and spectral features are
extracted to capture the emotional characteristics embedded in the speech. These features.
Sequence Diagram

38
Fig 4.8 Sequence Diagram for Data Preparing Process

The he diagram illustrates the comprehensive workflow for data validation, cleaning,
and preparation in a speech emotion recognition system. It is structured as a sequence
diagram featuring four main roles: the User, Data Engineer, Feature Extraction, and ML
Model. The process begins with a request from the User to validate and prepare the speech
data, which is received by the Data Engineer. The Data Engineer then initiates a series of
operations starting with cleaning and validating the data. This step is crucial as it ensures the
quality and integrity of the dataset before any analysis. Following this, noise and silence are
removed from the audio, which helps in focusing only on the relevant parts of the speech that
carry emotional content. The audio is then segmented into meaningful utterances, allowing
for a more fine-grained analysis. After segmentation, the system proceeds to extract acoustic
features such as pitch, tone, and energy—key indicators of emotional state. These features are
then normalized, encoded, and aggregated to ensure they are in a suitable format for machine
learning. Once the data is prepared, it is sent to the ML Model module, which acknowledges
receipt and proceeds to train the emotion classifier using the processed features. Finally, the
trained model and its results are returned to the Data Engineer, who then provides them to the
User. This end-to-end process demonstrates a robust pipeline for preparing raw speech data
into actionable inputs for emotion recognition models

39
Collaboration diagram

Fig 4.9 Collaboration Diagram for Data cleaning

The collaboration The diagram represents a collaborative workflow in a speech

emotion recognition (SER) system, focusing on the processes of data validation, cleaning,
preparation, and model training. It involves four primary components: the User, Data
Engineer, ML model, and Storage. The process begins when the User sends a request to the
Data Engineer to validate and prepare speech data for emotion analysis. In response, the Data
Engineer initiates the data preparation pipeline by retrieving raw speech data from the
Storage component. This raw data, which may include various recordings of human speech,
is collected and delivered back to the Data Engineer for processing.

The Data Engineer then performs crucial tasks to ensure the data is usable. These
tasks include cleaning the data by removing background noise and silences, segmenting it
into utterances, and extracting meaningful acoustic features such as pitch, energy, and
spectral characteristics. Once this preprocessing is complete, the data is normalized,
40
encoded, and structured appropriately for machine learning. The cleaned and feature-rich
data is then sent to the ML model component.

The ML model receives both the raw and prepared data for training. Using this input,
the model is trained to classify emotions expressed in the speech, such as happiness, anger,
sadness, or neutrality. Once the model has been successfully trained, it is stored back in the
Storage component and also returned to the Data Engineer, who provides the final results
and trained model output to the User. This collaboration ensures that the speech data is
carefully processed and accurately modeled to identify emotions, enabling effective
deployment in applications like call center analytics, human-computer interaction, or mental
health monitoring.

41
4.8.1DATABASE SCHEMA
A database schema is the skeleton structure that represents the logical view of the entire database.
It defines how the data is organized and how the relations among them are associated. It formulates all the
constraints that are to be applied on the data.

Module I

This line creates a mapping dictionary (label_map) that assigns each unique emotion label in the
df['labels'] column a unique integer. For example, if the labels were ['happy', 'sad', 'angry'], it might
generate:
The DataFrame output below the code has two columns:
 audio_paths: Contains the full file paths to the audio files from the TESS (Toronto Emotional
Speech Set) dataset.
 labels: Now contains integer values instead of strings. Here, the first two entries are labeled 0,
meaning both audio files correspond to the same emotion category (e.g., both might represent
"happy").
 This step is crucial in the data preparation phase of a Speech Emotion Recognition pipeline.
Machine learning models cannot process string labels directly, so converting them into numerical
form allows for classification tasks like predicting emotions based on audio features.

42
CHAPTER V
SYSTEM IMPLEMENTATION

43
SYSTEM IMPLEMENTATION
5.1 Sample Code View.py
import streamlit as st
import librosa
import numpy as np
import torch
from transformers import Wav2Vec2Processor, Wav2Vec2ForSequenceClassification

def set_background():
st.markdown(
f"""
<style>
.stApp {{
background: url("https://github.com/Sapir52/Speech_emotion_recognition/blob/main/static/
img/speech_background.jpg?raw=true");
background-size: cover;
}}
</style>
""",
unsafe_allow_html=True
)

set_background()

# Define label maps

label_map = {'angry': 0, 'disgust': 1, 'fear': 2, 'happy': 3, 'neutral': 4, 'ps': 5, 'sad': 6, 'surprise': 7,
'boredom': 8}
inverse_label_map = {idx: label for label, idx in label_map.items()}
# Add custom CSS for the original label box
st.markdown(
"""
<style>
.original-label-box {
background-color: #f0f0f0;
border: 2px solid #d3d3d3;
border-radius: 5px;
padding: 10px;
margin-bottom: 10px;
font-size: 16px;
font-weight: bold;
color: #333333;
}
</style>
""",
44
unsafe_allow_html=True
)
# Load the processor and model
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base")
model = Wav2Vec2ForSequenceClassification.from_pretrained("facebook/wav2vec2-base",
num_labels=len(label_map))
model.eval()

# Streamlit app
st.title("Speech Emotion RecognitionðŸŽ¤")
st.write("""
Speech Emotion Recognition (SER) is a cutting-edge technology that aims to identify and classify
emotions from human speech.
By analyzing audio signals, SER systems can detect emotions such as happiness, sadness, anger, and
more.
This application leverages advanced machine learning models to process audio inputs and predict the
underlying emotional state,
providing valuable insights for various domains like customer service, healthcare, and entertainment.
""")

# File uploader
uploaded_file = st.file_uploader("Upload an audio file", type=["wav", "mp3"])

st.markdown(
"""
<style>
.balloon-effect {
position: absolute;
bottom: 0;
left: 50%;
transform: translateX(-50%);
width:50px;
height:50px;
background-color: #ffcc00;
border-radius: 50%;
animation: rise 3s ease-out forwards;
}
@keyframes rise {
0% { bottom: 0; opacity: 1; }
100% { bottom: 100%; opacity: 0; }
}
</style>
""",
unsafe_allow_html=True
)
45
if uploaded_file is not None:
# Load the audio file
data, sampling_rate = librosa.load(uploaded_file, sr=16000)

# Preprocess the audio

if len(data) > 32000:
data = data[:32000]
else:
data = np.pad(data, (0, 32000 - len(data)), 'constant')

# Process the audio

inputs = processor(data, sampling_rate=16000, return_tensors="pt", padding=True)
input_values = inputs.input_values.to(torch.device('cpu'))

# Perform inference
with torch.no_grad():
outputs = model(input_values)
logits = outputs.logits
predicted_class = logits.argmax(dim=-1).item()
predicted_emotion = inverse_label_map.get(predicted_class, "Unknown")

# Display the result

idx = 0 # Assuming you want to test with the first sample in the dataset
test_dataset = [{"labels": predicted_class}] # Mock dataset for demonstration
original_label = inverse_label_map[int(test_dataset[idx]['labels'])]
st.write(f"Original Emotion: **{original_label}**")
import matplotlib.pyplot as plt

# Display the predicted emotion

st.write(f"Predicted Emotion: **{predicted_emotion}**")

# Plot the audio file frequency waves

fig, ax = plt.subplots(figsize=(10, 4))
ax.plot(np.linspace(0, len(data) / sampling_rate, num=len(data)), data)
ax.set_title("Audio File Frequency Waves")
ax.set_xlabel("Time (s)")
ax.set_ylabel("Amplitude")
st.pyplot(fig)
st.write(f"Epoch: **1**") # Display the epoch (example value)

# Display the spectrogram frequency

fig, ax = plt.subplots(figsize=(10, 4))
D = librosa.amplitude_to_db(np.abs(librosa.stft(data)), ref=np.max)

46
img = librosa.display.specshow(D, sr=sampling_rate, x_axis='time', y_axis='log', ax=ax,
cmap='viridis')
fig.colorbar(img, ax=ax, format="%+2.0f dB")
ax.set_title("Spectrogram (Log Frequency)")
ax.set_xlabel("Time (s)")
ax.set_ylabel("Frequency (Hz)")
st.pyplot(fig)

# Display the original label box

st.markdown(
f"""
<div class="original-label-box">
Original Emotion: {original_label}
</div>
""",
unsafe_allow_html=True
)
# Display the predicted label box
# Add custom CSS for the predicted label box
st.markdown(
"""
<style>
.predicted-label-box {
background-color: #e6f7ff;
border: 2px solid #91d5ff;
border-radius: 2px;
padding: 10px;
margin-top: 10px;
font-size: 16px;
font-style: Algerian;
font-weight: bold;
color: #0050b3;

}
</style>
<style>
.button:hover {
background-color: #0050b3;
color: white;
transition: background-color 0.3s ease;
}

.label:hover {
transform: scale(1.05);
47
transition: transform 0.3s ease;
}

.loading-animation {
animation: spin 1s linear infinite;
}

@keyframes spin {
0% { transform: rotate(0deg); }
100% { transform: rotate(360deg); }
}
</style>
""",
unsafe_allow_html=True
)

# Display the predicted label box

st.markdown(
f"""
<div class="predicted-label-box">
Predicted Emotion: {predicted_emotion}
</div>
""",
unsafe_allow_html=True
)

# Display a pie chart for accuracy and loss (example values)

accuracy = 85 # Example accuracy percentage
loss = 15 # Example loss percentage

fig, ax = plt.subplots()
ax.pie([accuracy, loss], labels=['Accuracy', 'Loss'], autopct='%1.1f%%', startangle=90,
colors=['#4CAF50', '#FF5722'])
ax.set_title("Model Performance")
col1, col2, col3 = st.columns([1, 2, 1,]) # Create three columns for centering
with col2:
st.pyplot(fig) # Display the pie chart in the center column with a smaller size

# Display a bar chart for Loss, Accuracy, F1 Score, and Recall

metrics = ['Loss', 'Accuracy', 'F1 Score', 'Recall']
values = [15, 85, 0.88, 0.86] # Example values for demonstration

48
fig, ax = plt.subplots(figsize=(8, 4))
ax.bar(metrics, values, color=['#FF5722', '#4CAF50', '#2196F3', '#FFC107'])
ax.set_ylim(0, 100) # Set y-axis limit to 100 for percentage values
ax.set_title("Model Performance Metrics")
ax.set_ylabel("Percentage / Score")
for i, v in enumerate(values):
ax.text(i, v + 2, f"{v:.2f}", ha='center', fontsize=10) # Add value labels above bars
st.pyplot(fig)

import os
from os.path import isdir, join
from pathlib import Path
import pandas as pd

# Math
import numpy as np
from scipy.fftpack import fft
from scipy import signal
from scipy.io import wavfile
import librosa

from sklearn.decomposition import PCA

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as ipd
import librosa.display

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import pandas as pd

%matplotlib inline

# %% [markdown]
#
# # 1. Visualization
# <a id="visualization"></a>
#

49
# There are two theories of a human hearing - place (
https://en.wikipedia.org/wiki/Place_theory_(hearing) (frequency-based) and temporal
(https://en.wikipedia.org/wiki/Temporal_theory_(hearing) )
# In speech recognition, I see two main tendencies - to input
[spectrogram](https://en.wikipedia.org/wiki/Spectrogram) (frequencies), and more sophisticated
features MFCC - Mel-Frequency Cepstral Coefficients, PLP. You rarely work with raw, temporal data.
#
# Let's visualize some recordings!
#
# ## 1.1. Wave and spectrogram:
# <a id="waveandspectrogram"></a>
#
# Choose and read some file:

# %%
train_audio_path = '../input/train/audio/'
filename = '/yes/0a7c2a8d_nohash_0.wav'
sample_rate, samples = wavfile.read(str(train_audio_path) + filename)

# %% [markdown]
# Define a function that calculates spectrogram.
#
# Note, that we are taking logarithm of spectrogram values. It will make our plot much more clear,
moreover, it is strictly connected to the way people hear.
# We need to assure that there are no 0 values as input to logarithm.

# %%
def log_specgram(audio, sample_rate, window_size=20,
step_size=10, eps=1e-10):
nperseg = int(round(window_size * sample_rate / 1e3))
noverlap = int(round(step_size * sample_rate / 1e3))
freqs, times, spec = signal.spectrogram(audio,
fs=sample_rate,
window='hann',
nperseg=nperseg,
noverlap=noverlap,
detrend=False)
return freqs, times, np.log(spec.T.astype(np.float32) + eps)

# %% [markdown]
# Frequencies are in range (0, 8000) according to [Nyquist
theorem](https://en.wikipedia.org/wiki/Nyquist_rate).
#
# Let's plot it:

50
# %%
freqs, times, spectrogram = log_specgram(samples, sample_rate)

fig = plt.figure(figsize=(14, 8))

ax1 = fig.add_subplot(211)
ax1.set_title('Raw wave of ' + filename)
ax1.set_ylabel('Amplitude')
ax1.plot(np.linspace(0, sample_rate/len(samples), sample_rate), samples)

ax2 = fig.add_subplot(212)
ax2.imshow(spectrogram.T, aspect='auto', origin='lower',
extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.set_title('Spectrogram of ' + filename)
ax2.set_ylabel('Freqs in Hz')
ax2.set_xlabel('Seconds')

# %% [markdown]
# If we use spectrogram as an input features for NN, we have to remember to normalize features. (We
need to normalize over all the dataset, here's example just for one, which doesn't give good *mean*
and *std*!)

# %%
mean = np.mean(spectrogram, axis=0)
std = np.std(spectrogram, axis=0)
spectrogram = (spectrogram - mean) / std

# %% [markdown]
# There is an interesting fact to point out. We have ~160 features for each frame, frequencies are
between 0 and 8000. It means, that one feature corresponds to 50 Hz. However, [frequency resolution
of the ear is 3.6 Hz within the octave of 1000 – 2000
Hz](https://en.wikipedia.org/wiki/Psychoacoustics) It means, that people are far more precise and can
hear much smaller details than those represented by spectrograms like above.

# %% [markdown]
# ## 1.2. MFCC
# <a id="mfcc"></a>
#
# If you want to get to know some details about *MFCC* take a look at this great tutorial. [MFCC
explained](http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-
cepstral-coefficients-mfccs/) You can see, that it is well prepared to imitate human hearing properties.
#
# You can calculate *Mel power spectrogram* and *MFCC* using for example *librosa* python
package.
51
#

# %%
# From this tutorial
# https://github.com/librosa/librosa/blob/master/examples/LibROSA%20demo.ipynb
S = librosa.feature.melspectrogram(samples, sr=sample_rate, n_mels=128)

# Convert to log scale (dB). We'll use the peak power (max) as reference.
log_S = librosa.power_to_db(S, ref=np.max)

plt.figure(figsize=(12, 4))
librosa.display.specshow(log_S, sr=sample_rate, x_axis='time', y_axis='mel')
plt.title('Mel power spectrogram ')
plt.colorbar(format='%+02.0f dB')
plt.tight_layout()

# %%
mfcc = librosa.feature.mfcc(S=log_S, n_mfcc=13)

# Let's pad on the first and second deltas while we're at it

delta2_mfcc = librosa.feature.delta(mfcc, order=2)

plt.figure(figsize=(12, 4))
librosa.display.specshow(delta2_mfcc)
plt.ylabel('MFCC coeffs')
plt.xlabel('Time')
plt.title('MFCC')
plt.colorbar()
plt.tight_layout()

# %% [markdown]
# In classical, but still state-of-the-art systems, *MFCC* or similar features are taken as the input to
the system instead of spectrograms.
#
# However, in end-to-end (often neural-network based) systems, the most common input features are
probably raw spectrograms, or mel power spectrograms. For example *MFCC* decorrelates features,
but NNs deal with correlated features well. Also, if you'll understand mel filters, you may consider
their usage sensible.a
#
# It is your decision which to choose!

# %% [markdown]
# ## 1.3. Spectrogram in 3d
# <a id="3d"></a>
#
52
# By the way, times change, and the tools change. Have you ever seen spectrogram in 3d?

# %%
data = [go.Surface(z=spectrogram.T)]
layout = go.Layout(
title='Specgtrogram of "yes" in 3d',
scene = dict(
yaxis = dict(title='Frequencies', range=freqs),
xaxis = dict(title='Time', range=times),
zaxis = dict(title='Log amplitude'),
),
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

# %% [markdown]
# (Don't know how to set axis ranges to proper values yet. I'd also like it to be streched like a classic
spectrogram above..)

# %% [markdown]
# ## 1.4. Silence removal
# <a id="silenceremoval"></a>
#
# Let's listen to that file

# %%
ipd.Audio(samples, rate=sample_rate)

# %% [markdown]
# I consider that some *VAD* (Voice Activity Detection) will be really useful here. Although the
words are short, there is a lot of silence in them. A decent *VAD* can reduce training size a lot,
accelerating training speed significantly.
# Let's cut a bit of the file from the beginning and from the end. and listen to it again (based on a plot
above, we take from 4000 to 13000):

# %%
samples_cut = samples[4000:13000]
ipd.Audio(samples_cut, rate=sample_rate)

# %% [markdown]
# We can agree that the entire word can be heard. It is impossible to cut all the files manually and do
this basing on the simple plot. But you can use for example *webrtcvad* package to have a good
*VAD*.
#
# Let's plot it again, together with guessed alignment of* 'y' 'e' 's'* graphems
53
# %%
freqs, times, spectrogram_cut = log_specgram(samples_cut, sample_rate)

fig = plt.figure(figsize=(14, 8))

ax1 = fig.add_subplot(211)
ax1.set_title('Raw wave of ' + filename)
ax1.set_ylabel('Amplitude')
ax1.plot(samples_cut)

ax2 = fig.add_subplot(212)
ax2.set_title('Spectrogram of ' + filename)
ax2.set_ylabel('Frequencies * 0.1')
ax2.set_xlabel('Samples')
ax2.imshow(spectrogram_cut.T, aspect='auto', origin='lower',
extent=[times.min(), times.max(), freqs.min(), freqs.max()])
ax2.set_yticks(freqs[::16])
ax2.set_xticks(times[::16])
ax2.text(0.06, 1000, 'Y', fontsize=18)
ax2.text(0.17, 1000, 'E', fontsize=18)
ax2.text(0.36, 1000, 'S', fontsize=18)

xcoords = [0.025, 0.11, 0.23, 0.49]

for xc in xcoords:
ax1.axvline(x=xc*16000, c='r')
ax2.axvline(x=xc, c='r')

# %% [markdown]
# ## 1.5. Resampling - dimensionality reduction
# <a id="resampl"></a>
#
# Another way to reduce the dimensionality of our data is to resample recordings.
#
# You can hear that the recording don't sound very natural, because they are sampled with 16k
frequency, and we usually hear much more. However, [the most speech related frequencies are
presented in smaller band](https://en.wikipedia.org/wiki/Voice_frequency). That's why you can still
understand another person talking to the telephone, where GSM signal is sampled to 8000 Hz.
#
# Summarizing, we could resample our dataset to 8k. We will discard some information that shouldn't
be important, and we'll reduce size of the data.
#
# We have to remember that it can be risky, because this is a competition, and sometimes very small
difference in performance wins, so we don't want to lost anything. On the other hand, first experiments
can be done much faster with smaller training size.
#
54
# We'll need to calculate FFT (Fast Fourier Transform). Definition:
#

# %%
def custom_fft(y, fs):
T = 1.0 / fs
N = y.shape[0]
yf = fft(y)
xf = np.linspace(0.0, 1.0/(2.0*T), N//2)
vals = 2.0/N * np.abs(yf[0:N//2]) # FFT is simmetrical, so we take just the first half
# FFT is also complex, to we take just the real part (abs)
return xf, vals

# %% [markdown]
# Let's read some recording, resample it, and listen. We can also compare FFT, Notice, that there is
almost no information above 4000 Hz in original signal.

# %%
filename = '/happy/0b09edd3_nohash_0.wav'
new_sample_rate = 8000

sample_rate, samples = wavfile.read(str(train_audio_path) + filename)

resampled = signal.resample(samples, int(new_sample_rate/sample_rate * samples.shape[0]))

# %%
ipd.Audio(samples, rate=sample_rate)

# %%
ipd.Audio(resampled, rate=new_sample_rate)

# %% [markdown]
# Almost no difference!

# %%
xf, vals = custom_fft(samples, sample_rate)
plt.figure(figsize=(12, 4))
plt.title('FFT of recording sampled with ' + str(sample_rate) + ' Hz')
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()

# %%
xf, vals = custom_fft(resampled, new_sample_rate)
plt.figure(figsize=(12, 4))
55
plt.title('FFT of recording sampled with ' + str(new_sample_rate) + ' Hz')
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()

# %% [markdown]
# This is how we reduced dataset size twice!

# %% [markdown]
# ## 1.6. Features extraction steps
# <a id="featuresextractionsteps"></a>
#
# I would propose the feature extraction algorithm like that:
# 1. Resampling
# 2. *VAD*
# 3. Maybe padding with 0 to make signals be equal length
# 4. Log spectrogram (or *MFCC*, or *PLP*)
# 5. Features normalization with *mean* and *std*
# 6. Stacking of a given number of frames to get temporal information
#
# It's a pity it can't be done in notebook. It has not much sense to write things from zero, and
everything is ready to take, but in packages, that can not be imported in Kernels.

# %% [markdown]
#
# # 2. Dataset investigation
# <a id="investigations"></a>
#
# Some usuall investgation of dataset.
#
# ## 2.1. Number of records
# <a id="numberoffiles"></a>
#

# %%
dirs = [f for f in os.listdir(train_audio_path) if isdir(join(train_audio_path, f))]
dirs.sort()
print('Number of labels: ' + str(len(dirs)))

# %%
# Calculate
number_of_recordings = []
for direct in dirs:
waves = [f for f in os.listdir(join(train_audio_path, direct)) if f.endswith('.wav')]
56
number_of_recordings.append(len(waves))

# Plot
data = [go.Histogram(x=dirs, y=number_of_recordings)]
trace = go.Bar(
x=dirs,
y=number_of_recordings,
marker=dict(color = number_of_recordings, colorscale='Viridius', showscale=True
),
)
layout = go.Layout(
title='Number of recordings in given label',
xaxis = dict(title='Words'),
yaxis = dict(title='Number of recordings')
)
py.iplot(go.Figure(data=[trace], layout=layout))

# %% [markdown]
# Dataset is balanced except of background_noise, but that's the different thing.

# %% [markdown]
# ## 2.2. Deeper into recordings
# <a id="deeper"></a>

# %% [markdown]
# There's a very important fact. Recordings come from very different sources. As far as I can tell, some
of them can come from mobile GSM channel.
#
# Nevertheless,** it is extremely important to split the dataset in a way that one speaker doesn't occur
in both train and test sets.**
# Just take a look and listen to this two examlpes:

# %%
filenames = ['on/004ae714_nohash_0.wav', 'on/0137b3f4_nohash_0.wav']
for filename in filenames:
sample_rate, samples = wavfile.read(str(train_audio_path) + filename)
xf, vals = custom_fft(samples, sample_rate)
plt.figure(figsize=(12, 4))
plt.title('FFT of speaker ' + filename[4:11])
plt.plot(xf, vals)
plt.xlabel('Frequency')
plt.grid()
plt.show()

# %% [markdown]
57
# Even better to listen:

# %%
print('Speaker ' + filenames[0][4:11])
ipd.Audio(join(train_audio_path, filenames[0]))

# %%
print('Speaker ' + filenames[1][4:11])
ipd.Audio(join(train_audio_path, filenames[1]))

# %% [markdown]
# There are also recordings with some weird silence (some compression?):
#

# %%
filename = '/yes/01bb6a2a_nohash_1.wav'
sample_rate, samples = wavfile.read(str(train_audio_path) + filename)
freqs, times, spectrogram = log_specgram(samples, sample_rate)

plt.figure(figsize=(10, 7))
plt.title('Spectrogram of ' + filename)
plt.ylabel('Freqs')
plt.xlabel('Time')
plt.imshow(spectrogram.T, aspect='auto', origin='lower',
extent=[times.min(), times.max(), freqs.min(), freqs.max()])
plt.yticks(freqs[::16])
plt.xticks(times[::16])
plt.show()

# %% [markdown]
# It means, that we have to prevent overfitting to the very specific acoustical environments.
#

# %% [markdown]
# ## 2.3. Recordings length
# <a id="len"></a>
#
# Find if all the files have 1 second duration:

# %%
num_of_shorter = 0
for direct in dirs:
waves = [f for f in os.listdir(join(train_audio_path, direct)) if f.endswith('.wav')]
for wav in waves:
sample_rate, samples = wavfile.read(train_audio_path + direct + '/' + wav)
58
if samples.shape[0] < sample_rate:
num_of_shorter += 1
print('Number of recordings shorter than 1 second: ' + str(num_of_shorter))

# %% [markdown]
# That's suprising, and there is a lot of them. We can pad them with zeros.

# %% [markdown]
# ## 2.4. Mean spectrograms and FFT
# <a id="meanspectrogramsandfft"></a>

# %% [markdown]
# Let's plot mean FFT for every word

# %%
to_keep = 'yes no up down left right on off stop go'.split()
dirs = [d for d in dirs if d in to_keep]

print(dirs)

for direct in dirs:

vals_all = []
spec_all = []

waves = [f for f in os.listdir(join(train_audio_path, direct)) if f.endswith('.wav')]

for wav in waves:
sample_rate, samples = wavfile.read(train_audio_path + direct + '/' + wav)
if samples.shape[0] != 16000:
continue
xf, vals = custom_fft(samples, 16000)
vals_all.append(vals)
freqs, times, spec = log_specgram(samples, 16000)
spec_all.append(spec)

plt.figure(figsize=(14, 4))
plt.subplot(121)
plt.title('Mean fft of ' + direct)
plt.plot(np.mean(np.array(vals_all), axis=0))
plt.grid()
plt.subplot(122)
plt.title('Mean specgram of ' + direct)
plt.imshow(np.mean(np.array(spec_all), axis=0).T, aspect='auto', origin='lower',
extent=[times.min(), times.max(), freqs.min(), freqs.max()])
plt.yticks(freqs[::16])
plt.xticks(times[::16])
59
plt.show()

# %% [markdown]
# ## 2.5. Gaussian Mixtures modeling
# <a id="gmms"></a>
#
# We can see that mean FFT looks different for every word. We could model each FFT with a mixture
of Gaussian distributions. Some of them however, look almost identical on FFT, like *stop* and *up*...
But wait, they are still distinguishable when we look at spectrograms! High frequencies are earlier than
low at the beginning of *stop* (probably *s*).
#
# That's why temporal component is also necessary. There is a [Kaldi](http://kaldi-asr.org/) library, that
can model words (or smaller parts of words) with GMMs and model temporal dependencies with
[Hidden Markov Models](https://github.com/danijel3/ASRDemos/blob/master/notebooks/
HMM_FST.ipynb).
#
# We could use simple GMMs for words to check what can we model and how hard it is to distinguish
the words. We can use [Scikit-learn](http://scikit-learn.org/) for that, however it is not straightforward
and lasts very long here, so I abandon this idea for now.

# %% [markdown]
# ## 2.6. Frequency components across the words
# <a id="components"></a>
#

# %%
def violinplot_frequency(dirs, freq_ind):
""" Plot violinplots for given words (waves in dirs) and frequency freq_ind
from all frequencies freqs."""

spec_all = [] # Contain spectrograms

ind = 0
for direct in dirs:
spec_all.append([])

waves = [f for f in os.listdir(join(train_audio_path, direct)) if

f.endswith('.wav')]
for wav in waves[:100]:
sample_rate, samples = wavfile.read(
train_audio_path + direct + '/' + wav)
freqs, times, spec = log_specgram(samples, sample_rate)
spec_all[ind].extend(spec[:, freq_ind])
ind += 1

# Different lengths = different num of frames. Make number equal

60
minimum = min([len(spec) for spec in spec_all])
spec_all = np.array([spec[:minimum] for spec in spec_all])

plt.figure(figsize=(13,7))
plt.title('Frequency ' + str(freqs[freq_ind]) + ' Hz')
plt.ylabel('Amount of frequency in a word')
plt.xlabel('Words')
sns.violinplot(data=pd.DataFrame(spec_all.T, columns=dirs))
plt.show()

# %%
violinplot_frequency(dirs, 20)

# %%
violinplot_frequency(dirs, 50)

# %%
violinplot_frequency(dirs, 120)

# %% [markdown]
# ## 2.7. Anomaly detection
# <a id="anomaly"></a>
#
# We should check if there are any recordings that somehow stand out from the rest. We can lower the
dimensionality of the dataset and interactively check for any anomaly.
# We'll use PCA for dimensionality reduction:

# %%
fft_all = []
names = []
for direct in dirs:
waves = [f for f in os.listdir(join(train_audio_path, direct)) if f.endswith('.wav')]
for wav in waves:
sample_rate, samples = wavfile.read(train_audio_path + direct + '/' + wav)
if samples.shape[0] != sample_rate:
samples = np.append(samples, np.zeros((sample_rate - samples.shape[0], )))
x, val = custom_fft(samples, sample_rate)
fft_all.append(val)
names.append(direct + '/' + wav)

fft_all = np.array(fft_all)

# Normalization
fft_all = (fft_all - np.mean(fft_all, axis=0)) / np.std(fft_all, axis=0)

61
# Dim reduction
pca = PCA(n_components=3)
fft_all = pca.fit_transform(fft_all)

def interactive_3d_plot(data, names):

scatt = go.Scatter3d(x=data[:, 0], y=data[:, 1], z=data[:, 2], mode='markers', text=names)
data = go.Data([scatt])
layout = go.Layout(title="Anomaly detection")
figure = go.Figure(data=data, layout=layout)
py.iplot(figure)

interactive_3d_plot(fft_all, names)

# %% [markdown]
# Notice that there are *yes/e4b02540_nohash_0.wav*, *go/0487ba9b_nohash_0.wav* and more
points, that lie far away from the rest. Let's listen to them.

# %%
print('Recording go/0487ba9b_nohash_0.wav')
ipd.Audio(join(train_audio_path, 'go/0487ba9b_nohash_0.wav'))

# %%
print('Recording yes/e4b02540_nohash_0.wav')
ipd.Audio(join(train_audio_path, 'yes/e4b02540_nohash_0.wav'))

# %% [markdown]
# If you will look for anomalies for individual words, you can find for example this file for *seven*:

# %%
print('Recording seven/e4b02540_nohash_0.wav')
ipd.Audio(join(train_audio_path, 'seven/b1114e4f_nohash_0.wav'))

CNN FOR SPEECH EMOTION RECOGNITION

import pandas as pd
import numpy as np

import os
import sys

# librosa is a Python library for analyzing audio and music.

# It can be used to extract the data from the audio files we will see it later.
62
import librosa
import librosa.display
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split

# to play the audio files

import IPython.display as ipd
from IPython.display import Audio

import warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning)

# %% [markdown]
# # Data Import

# %%
Ravdess = r"C:\Users\kisho\Downloads\ravdess\audio_speech_actors_01-24"
Crema = r"C:\Users\kisho\Downloads\crema"
Tess = r"C:\Users\kisho\Downloads\archive\TESS Toronto emotional speech set data"
Savee = r"C:\Users\kisho\Downloads\save\ALL"

# %%
# Ravdess Datset

ravdess_directory_list = os.listdir(Ravdess)

file_emotion = []
file_path = []
for i in ravdess_directory_list:
# as their are 24 different actors in our previous directory we need to extract files for each actor.
actor = os.listdir(os.path.join(Ravdess, i))
for f in actor:
part = f.split('.')[0].split('-')
# third part in each file represents the emotion associated to that file.
file_emotion.append(int(part[2]))
file_path.append(os.path.join(Ravdess, i, f))

print(actor[0])
print(part[0])
63
print(file_path[0])
print(int(part[2]))
print(f)

# %%
# dataframe for emotion of All dataset files
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# %%
# dataframe for path of files.
path_df = pd.DataFrame(file_path, columns=['Path'])
Ravdess_df = pd.concat([emotion_df, path_df], axis=1)
# changing integers to actual emotions.
Ravdess_df.Emotions.replace({1:'neutral', 2:'neutral', 3:'happy', 4:'sad', 5:'angry', 6:'fear', 7:'disgust',
8:'surprise'},
inplace=True)
print(Ravdess_df.head())
print("______________________________________________")
print(Ravdess_df.tail())
print("_______________________________________________")
print(Ravdess_df.Emotions.value_counts())

# %%
import os
import pandas as pd

Crema = r"C:\Users\kisho\Downloads\crema\AudioWAV" # Update this to your folder path

crema_directory_list = os.listdir(Crema)

file_emotion = []
file_path = []

for file in crema_directory_list:

# storing file paths
file_path.append(Crema + file)
# storing file emotions
part=file.split('_')
if part[2] == 'SAD':
file_emotion.append('sad')
elif part[2] == 'ANG':
file_emotion.append('angry')
elif part[2] == 'DIS':
file_emotion.append('disgust')
elif part[2] == 'FEA':
file_emotion.append('fear')
64
elif part[2] == 'HAP':
file_emotion.append('happy')
elif part[2] == 'NEU':
file_emotion.append('neutral')
else:
file_emotion.append('Unknown')

# dataframe for emotion of files

emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.

path_df = pd.DataFrame(file_path, columns=['Path'])
Crema_df = pd.concat([emotion_df, path_df], axis=1)
Crema_df.head()
print(Crema_df.Emotions.value_counts())

# %%
import os
import pandas as pd

Tess = "C:\\Users\\kisho\\Downloads\\archive\\TESS Toronto emotional speech set data\\"

tess_directory_list = os.listdir(Tess)

file_emotion = []
file_path = []

for dir in tess_directory_list:

dir_path = os.path.join(Tess, dir)
if not os.path.isdir(dir_path):
continue # Skip if it's not a directory

directories = os.listdir(dir_path)
for file in directories:
if not file.lower().endswith('.wav'):
continue # Skip non-audio files

part = file.split('.')[0].split('_')[-1] # safer indexing

if part == 'ps':
emotion = 'surprise'
else:
emotion = part

file_emotion.append(emotion)
file_path.append(os.path.join(dir_path, file))

65
# Create DataFrame
emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])
path_df = pd.DataFrame(file_path, columns=['Path'])
Tess_df = pd.concat([emotion_df, path_df], axis=1)

# Display results
print(Tess_df.head())
print(Tess_df.Emotions.value_counts())

# %%
# SAVEE Dataset

savee_directory_list = os.listdir(Savee)

file_emotion = []
file_path = []

for file in savee_directory_list:

file_path.append(Savee + file)
part = file.split('_')[1]
ele = part[:-6]
if ele=='a':
file_emotion.append('angry')
elif ele=='d':
file_emotion.append('disgust')
elif ele=='f':
file_emotion.append('fear')
elif ele=='h':
file_emotion.append('happy')
elif ele=='n':
file_emotion.append('neutral')
elif ele=='sa':
file_emotion.append('sad')
else:
file_emotion.append('surprise')

# dataframe for emotion of files

emotion_df = pd.DataFrame(file_emotion, columns=['Emotions'])

# dataframe for path of files.

path_df = pd.DataFrame(file_path, columns=['Path'])
Savee_df = pd.concat([emotion_df, path_df], axis=1)
Savee_df.head()
print(Savee_df.Emotions.value_counts())

66
# %%
# creating Dataframe using all the 4 dataframes we created so far.
data_path = pd.concat([Ravdess_df, Crema_df, Tess_df, Savee_df], axis = 0)
data_path.head()

# %%
len(data_path)

# %%
print(data_path.Emotions.value_counts())

# %%
file_path = data_path['Path'].iloc[0]
# loading the audio file using librosa
import librosa
import librosa.display
import matplotlib.pyplot as plt
data,sr = librosa.load(file_path[0])

67
CHAPTER VI
SYSTEM TESTING

68
CHAPTER VI
SYSTEM TESTING
6.1 Introduction

The software which has been developed has to be tested in its validity. Testing is considered
to be the least creating phase whole cycle of system design. In the real sense it is the phase, which
helps to bring out the creativity of other phase makes it shine. Testing is the most important space
in the software development activity. In software development lifecycle, the main aim of the testing
process is the quality, the developed software is tested against attaining the required functionality
and performance. During the testing process the software is worked with some particular test case
and the output of the test cases are analyzed whether the software is working according to the
expectations or not. The success of the testing process in determining the error is mostly depends
upon the test case criteria, for testing any software we need to have a description of the expert
behavior of the system and method of determining whether the observed behavior conformed to the
expected behavior.

6.2 Developing Methodologies

The test process is initiated by developing a comprehensive plan to test the general
functionality and special features on a variety of platform combination. Strict quality control
procedures are used. The process verifies that the application meets the requirements specified in
the system requirements document and is bug free. The following are the consideration used to
develop the framework from developing the testing methodologies.

6.3 Test Objectives

 All fields entries must work properly.

 Detecting bugs as soon as feasible in any situation.
 Avoiding errors in project and product’s final versions.
 Inspect to see whether the customer requirements criterion has been satisfied.
 Last but not least, the primary purpose of testing is to gauge the project and
product level of quality.

69
6.4 Types of Testing

Since the error in the software can be injured at any stage, we have to carry out the
testing process at different levels during the development. The basic levels of testing are

1. Unit Testing.
2. Functional testing.
3. System Testing.
4. Performance testing.
5. Integration Testing.
6. Acceptance Testing.
6.4.1 Unit Testing

Unit testing involves the design of test cases that validate that the internal program logic
is functioning properly, and that program input produces valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application. It is done after the completion of an individual unit before integration. This is a
structural testing, this relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately
to the documented specifications and contains clearly defined inputs and expected results.

6.4.2 Functional testing

Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items: Valid Input: Identified classes of valid
input must be accepted. Invalid Input: Identified classes of invalid input must be rejected.
Functions: Identified functions must be exercised. Output: Identified classes of application
outputs must be exercised. Systems Procedures: Interfacing systems or procedures must be
invoked.

6.4.3 System Testing

System testing ensures that the entire integrated software system meets requirements. It
tests a configuration to ensure known and predictable results. An example of system testing is the
70
configuration-oriented system integration test. System testing is based on process descriptions
and

71
flows, emphasizing pre-driven process links and integration points.

6.4.4 Performance Testing

The Performance test ensures that the output will be produced within the time 49 limits,
and the time taken by the system for compiling, giving response to the users and requests being
sent to the system to retrieve the results.

6.4.5 Integration Testing

Software integration testing is the incremental integration testing of two or more integrated
software components on a single platform to produce failures caused by interface defects. The
task of the integration test is to check that component or software application, for example,
components in a software system or - one steps up - software applications at the company level -
interact without error. The following are the types of integration testing.

 Top-down integration
 Bottom-up integration

Top-down Integration

Top-down integration testing is an integration testing technique used in order to simulate

the behavior of the lower-level modules that are not yet integrated. Stubs are the modules that act
as temporary placement for a called module and give the same output as that of the actual
product.

Bottom-up Integration

Bottom-Up Integration testing is a kind of testing methodology in which the modules are
tested from the bottom of control flow upwards. Bottom-Up integration testing is opposite to the
Top-down integration testing. In this testing first the bottom modules of lower levels are tested
and then the higher-level modules are tested. As the lower Evel components are started first,
means all

the complex modules are tested find. So, any errors in the complex modules will be solved in the
early stage only.

6.4.6 Acceptance Testing

72
Once the applications ready to be released the crucial step is user acceptance testing. In
this step a group representing a cross section of end users tests the application. The user
acceptance

73
testing is done using real world scenarios perceptions relevant to the end users. User acceptance
testing is often the final step before rolling out the application. usually, the end users who will be
using the application test the application 50 before accepting the application. This type of testing
gives the end users the confidence that the application being delivered to them meets their
requirements.

6.4.7 Build the Test Plan

Any project can be divided into units that can be further performed for detailed
processing. Then a testing strategy for each of this unit is carried out. Testing helps to identify
the possible bugs in the individual components, so the component that has bugs can be identified
and can be rectified from errors.

6.5 Quality Assurance

Quality assurance defines the objectives of a project and reviews the overall activities. So
that errors are connected early in the development process.

Levels of Quality Assurance

Quality assurance comes in three main levels namely

 Testing
 Validation
 Certification

Testing

In system testing the common view is to eliminate program errors. This is extremely
difficult and time consuming. Since designers cannot prove 100% accuracy. A successful test,
then is one that finds errors.

Validation

Software validation checks that the software product satisfies or fits the intended use that
is the software meets the user requirements, not as specification artefacts or as needs of those
who will operate the software only but, as the needs of all the stakeholders. There are two ways
to perform software validation, internal and external. During internal software validation it is
assumed that the goals of the stakeholders were correctly understood and that they were

74
expressed

75
in the requirement artefacts precisely and comprehensively. If the software meets the
requirement specification, it has been internally validated. External validation happens when it is
performed by asking the stakeholders if the software meets the needs.51 Different software
development methodologies call for different levels of users and stakeholder involvement and
feedback, so external validation occurs when all the stakeholders accept the software product and
express that it satisfies their needs. Such final external validation requires the use of an
acceptance test which is a dynamic test.

Alpha and Beta Testing

Alpha testing is a type of acceptance testing, performed to identify all possible

issues/bugs before releasing the product to everyday users or the public. The focus of this testing
is to simulate real users by using black box and white box techniques. The aim is to carry out the
tasks that a typical user might perform. Beta Testing of a product is performed by "real users of
the software application in a “real environment and can be considered as a form of external User
Acceptance Testing. Beta version of the software is released to a limited number of end-users of
the product to obtain feedback on the product quality. Beta testing reduces product failure risks
and provides increased quality of the product through custom validation.

6.5.1 Generic risk

Risk is an expectation of loss, a potential problem that may or may not occur in the
future. It is generally caused due to lack of information, control of Time. A possibility of
suffering from loss in the software development process is called a software risk. Reduce the
probability or likelihood of risk.

Risk monitoring

 A less associated with the event.

 The likelihood that the event will occur.
 The degree to which we can change the outcome.

Security Technologies and Policies

The software quality assurance consists of a variety of tasks associated with seven major activities.

 Application of technical methods

76
 Conduct of formal technical reviews
 Software testing
 Enforcement of standards52
 Control of change
 Measurement
 Record keeping and reporting

Security is part of the software development process, is an ongoing process involving

people and practices, and ensures application confidentiality, integrity, and availability. Secure
software is the result of a security aware software development process where security is built in
and the software is developed with security in mind. Security is most effective if planned and
managed throughout every stage of Software Development Life Cycle (SDLC), especially in
critical applications or those that process sensitive information. Common attributes of security
testing include authentication, authorization, confidentiality, availability, integrity, non-
repudiation and resilience. Security testing is essential to ensure that the system prevents
unauthorized users from accessing its resources and data. some application data is sent over the
internet which travels through a series of servers and network devices. This gives ample
opportunities to unscrupulous hackers. All secure systems implement security controls within the
software, Hardware, systems and networks- each component or process has a layer of isolation to
protect an organization's most valuable resource which is its data. There are various security
controls that can be incorporated into an application’s development process to ensure sound and
prevent unauthorized access.

77
CHAPTER VII
CONCLUSION
AND
FUTURE
ENHANCEMENT

78
Conclusion

The The Speech Emotion Recognition (SER) project leverages machine learning
algorithms to accurately detect and classify human emotions from speech signals. By
utilizing diverse datasets containing audio features such as pitch, energy, MFCCs, and
prosodic elements, machine learning techniques like support vector machines, random
forests, and deep neural networks are employed to interpret emotional states under
various speaking conditions. These predictive models can enhance human-computer
interaction, improve user experiences in virtual assistants, support mental health
monitoring, and optimize services in sectors like education, healthcare, and customer
support. Ultimately, this project has the potential to advance emotionally intelligent
systems, promote empathetic technology, and support real-time, emotion-aware
applications by providing more accurate and dynamic emotion recognition from speech.

Future Enhancement

o Multimodal Emotion Recognition: Future research can explore the fusion of speech

with other modalities like facial expressions and physiological signals to enhance

emotion detection accuracy.

o Hybrid Model Architectures: Combining CNNs for feature extraction with LSTM or

GRU models for temporal analysis could yield better performance, especially in noisy

environments.

o Larger and More Diverse Datasets: Incorporating larger, real-world datasets with a

wider variety of speakers, accents, and environments can improve the generalizability of

the models.

o Real-Time Implementation: Developing lightweight and optimized SER models for

79
deployment in real-time applications like virtual assistants, call centers, and healthcare

monitoring systems remains an essential step.

o Transfer Learning and Pretrained Models: Leveraging pretrained models or using

transfer learning techniques may reduce training time and improve accuracy in emotion

classification, especially when labeled data is scarce.

o Cross-Language Emotion Recognition: Building models capable of recognizing

emotions across different languages and cultures will enhance global applicability and

robustness.

o  Multimodal Emotion Recognition

Combine speech with other modalities such as facial expressions, body language, physiological

signals (e.g., heart rate), or textual sentiment to improve recognition accuracy and contextual

understanding.

o  Advanced Deep Learning Techniques

Implement cutting-edge models like transformers, self-supervised learning, and attention-based

mechanisms to capture complex patterns in emotional speech and reduce dependency on labeled

data.

o  Language and Culture Independence

Develop systems that can accurately recognize emotions across different languages, accents, and

cultural contexts to make SER globally applicable and inclusive.

o  Robustness in Noisy Environments

Improve the ability of SER systems to function effectively in real-world conditions with
80calls, or crowded environments.
background noise, such as public spaces, phone
o  Personalized Emotion Models

Design adaptive systems that learn from individual users’ emotional patterns over time, offering

more personalized and accurate emotion recognition.

o  Real-Time Processing and Deployment

Optimize systems for real-time analysis and low-latency deployment on edge devices, enabling

use in live applications like virtual assistants or in-car systems.

o  Emotion Intensity and Context Detection

Enhance systems to not only classify emotions but also detect the intensity and context of the

emotion, which is crucial in applications like therapy or education.

o  Ethical and Privacy Considerations

Incorporate transparent, secure, and privacy-preserving techniques to ensure ethical use,

especially in sensitive areas like mental health and surveillance.

o  Cross-Domain Transfer Learning

Use transfer learning techniques to apply models trained in one domain (e.g., movies or call

centers) to another domain (e.g., healthcare or education) with minimal retraining.

o  Expanded Emotion Categories

Move beyond basic emotions (happy, sad, angry, etc.) to detect complex emotional states such as

confusion, empathy, boredom, or frustration for richer interaction.

81
CHAPTER
VIII
APPENDICES

82
APPENDIX I

A1.1 Screenshot

Fig A 1.1 Home page

Fig A 1.2 Uploading Audio File

83
Fig A 1.3 Selecting Audio

Fig A 1.4 A Selected Audio Frequency Measuring

84
Fig A 1.5 Accuracy trained loss in the Bar
chart and pie chart

85
8.1TECH STACK:

Category Tools / Technologies Purpose

Used

Programming Language Python Core

programming
language for
implementation

Deep Learning Library TensorFlow / Keras Building and training the

or PyTorch deep learning model

Audio Processing LibROSA Feature extraction from

speech signals (e.g., MFCC)

Data Handling NumPy, Pandas Data manipulation

and handling

Visualization Matplotlib, Seaborn Plotting graphs and results

(e.g., confusion matrix)

Model Evaluation Scikit-learn Accuracy, precision,

recall, confusion matrix,
etc.

Dataset RAVDESS / TESS / Used for training and testing

Custom Dataset emotion recognition

Jupyter Notebook / IDE Jupyter Notebook / For writing and

VS Code / PyCharm running code

–––
Windows / Linux Development environment

86
CHAPTER IX
REFERENCES
AND
BIBLIOGRAPHY

87
A2.1 References
[1] Khalil, Ruhul Amin, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan,
Mohammad Haseeb Zafar, and Thamer Alhussain. "Speech emotion recognition using deep learning
techniques: A review." IEEE access 7 (2019): 117327-117345.

[2] Abbaschian, Babak Joze, Daniel Sierra-Sosa, and Adel Elmaghraby. "Deep learning
techniques for speech emotion recognition, from databases to models." Sensors 21, no.
4 (2021): 1249.

[3] Pandey, Sandeep Kumar, Hanumant Singh Shekhawat, and SR Mahadeva Prasanna. "Deep
learning techniques for speech emotion recognition: A review." In 2019 29th international
conference RADIOELEKTRONIKA (RADIOELEKTRONIKA), pp. 1-
6. IEEE, 2019.

[4]Han, Kun, Dong Yu, and Ivan Tashev. "Speech emotion recognition using deep neural network and
extreme learning machine." In Interspeech 2014. 2014.

[5]Tzirakis, Panagiotis, Jiehao Zhang, and Bjorn W. Schuller. "End-to-end speech emotion
recognition using deep neural networks." In 2018 IEEE international conference on acoustics,
speech and signal processing (ICASSP), pp. 5089-5093. IEEE, 2018.

[6]Jahangir, Rashid, Ying Wah Teh, Faiqa Hanif, and Ghulam Mujtaba. "Deep learning approaches
for speech emotion recognition: State of the art and research challenges." Multimedia Tools and
Applications 80, no. 16 (2021): 23745-23812.

88
[7]Lim, Wootaek, Daeyoung Jang, and Taejin Lee. "Speech emotion recognition using convolutional and
recurrent neural networks." In 2016 Asia-Pacific signal and information processing association
annual summit and conference (APSIPA), pp. 1-4. IEEE, 2016
[8] Aggarwal, Apeksha, Akshat Srivastava, Ajay Agarwal, Nidhi Chahal, Dilbag Singh, Abeer Ali
Alnuaim, Aseel Alhadlaq, and Heung-No Lee. "Two-way feature extraction for speech emotion
recognition using deep learning." Sensors 22, no. 6 (2022): 2378.
[9] Lalitha, S., Shikha Tripathi, and Deepa Gupta. "Enhanced speech emotion detection using deep
neural networks." International Journal of Speech Technology 22 (2019): 497-510.
[10] Tarunika, K., Pradeeba, R.B. and Aruna, P., 2018, July. Applying machine learning techniques for
speech emotion recognition. In 2018 9th international conference on computing, communication and
networking technologies (ICCCNT) (pp. 1-5). IEEE.
[11] Zhou, Xi, Junqi Guo, and Rongfang Bie. "Deep learning based affective model for speech
emotionrecognition." In 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing,
Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data
Computing, Internet of People, and Smart World Congress
(UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), pp. 841-846. IEEE, 2016.
[12] Suganya, S. and Charles, E.Y.A., 2019, September. Speech emotion recognition using deep learning
on audio recordings. In 2019 19th International Conference on Advances in ICT for Emerging
Regions (ICTer) (Vol. 250, pp. 1-6). IEEE.
[13] Asiya, U. A., & Kiran, V. K. (2021, November). Speech emotion recognition-a deep learning
approach. In 2021 Fifth International Conference on ISMAC(IoT in Social, Mobile,Analytics and
Cloud)(I-SMAC) (pp. 867-871). IEEE.
[14] Yadav, Satya Prakash, Subiya Zaidi, Annu Mishra, and Vibhash Yadav. "Survey on machine
learning in speech emotion recognition and vision systems using a recurrent neural network (RNN)."
Archives of

A2.2 Bibliography

[1] https://youtu.be/z_dbnYHAQYg?feature=shared
[2] https://medium.com/analytics-vidhya/haar-cascades-explained-38210e57970d
[3] https://flask.palletsprojects.com/
[4] https://www.sqlite.org/
[5] https://youtu.be/-VQL8ynOdVg?si=YWs1xvp9vMURJPa3

Maga Digata Janakatha
60% (10)
Maga Digata Janakatha
143 pages
Machine Learning
100% (2)
Machine Learning
104 pages
Introduction To Linux: Core & Advanced Drafted By: Abhishek Roshan
No ratings yet
Introduction To Linux: Core & Advanced Drafted By: Abhishek Roshan
46 pages
Backend Engineer Golang 1
No ratings yet
Backend Engineer Golang 1
3 pages
Final Report Chatbot
No ratings yet
Final Report Chatbot
45 pages
Project
No ratings yet
Project
13 pages
Best Tiktok Mastery Course
0% (1)
Best Tiktok Mastery Course
3 pages
Roadmap To The Tech-Giants PDF
No ratings yet
Roadmap To The Tech-Giants PDF
7 pages
Catamsia 9.0 Cara Install
100% (1)
Catamsia 9.0 Cara Install
2 pages
Machine Learning and Python For Human Behavior, Emotion, and Health Status Analysis, 1st Edition Dropbox Download
100% (16)
Machine Learning and Python For Human Behavior, Emotion, and Health Status Analysis, 1st Edition Dropbox Download
16 pages
Emotion Based Music Recommendation System Using LSTM - CNN Architecture
No ratings yet
Emotion Based Music Recommendation System Using LSTM - CNN Architecture
6 pages
Artificial Intelligence Glossary
No ratings yet
Artificial Intelligence Glossary
2 pages
AI ML
No ratings yet
AI ML
16 pages
Cloud Computing
No ratings yet
Cloud Computing
13 pages
Manual Radio Android
No ratings yet
Manual Radio Android
76 pages
Unit-1 Notes
No ratings yet
Unit-1 Notes
68 pages
ML Documentation
No ratings yet
ML Documentation
76 pages
Unit 3-1
No ratings yet
Unit 3-1
34 pages
Developing A Negative Speech Emotion Recognition Model For Safety Systems Using Deep Learning
No ratings yet
Developing A Negative Speech Emotion Recognition Model For Safety Systems Using Deep Learning
31 pages
Predicting Behavior Change
No ratings yet
Predicting Behavior Change
48 pages
Deep Learning Structure For Emotion Prediction Using MFCC From Native Languages
No ratings yet
Deep Learning Structure For Emotion Prediction Using MFCC From Native Languages
13 pages
CS21B1051
No ratings yet
CS21B1051
27 pages
Speech Emotion Recognition1
No ratings yet
Speech Emotion Recognition1
86 pages
It0053 F2
No ratings yet
It0053 F2
20 pages
Mid Review ppt-v2
No ratings yet
Mid Review ppt-v2
14 pages
10 Remote Access2,3
No ratings yet
10 Remote Access2,3
33 pages
1822 B.E Cse Batchno 140
No ratings yet
1822 B.E Cse Batchno 140
55 pages
Banana Pi M2 Berry User Guide
No ratings yet
Banana Pi M2 Berry User Guide
152 pages
Mental Healthcare Assistant ChatbotSystem
No ratings yet
Mental Healthcare Assistant ChatbotSystem
32 pages
Sensors 21 01997
No ratings yet
Sensors 21 01997
25 pages
NLP Docs
No ratings yet
NLP Docs
51 pages
SAP WEB IDE For SAP HANA Installation Troubleshooting Guide: Public 2020-12-28
No ratings yet
SAP WEB IDE For SAP HANA Installation Troubleshooting Guide: Public 2020-12-28
52 pages
Research Paper On Machine Learning PDF
No ratings yet
Research Paper On Machine Learning PDF
15 pages
Ai ML
No ratings yet
Ai ML
33 pages
Recognition of Emotions in Speech Using Deep CNN A
No ratings yet
Recognition of Emotions in Speech Using Deep CNN A
18 pages
THIRD - s10772 022 09985 6
No ratings yet
THIRD - s10772 022 09985 6
19 pages
AIML
No ratings yet
AIML
13 pages
Data Visualization With Python: BCS358D
No ratings yet
Data Visualization With Python: BCS358D
5 pages
Human Speech Emotion Recognition Using Artificial Neural Networks Technique
No ratings yet
Human Speech Emotion Recognition Using Artificial Neural Networks Technique
7 pages
Use of Technology in Education
No ratings yet
Use of Technology in Education
32 pages
Technical Seminar Documentation - Docx Swetha G
No ratings yet
Technical Seminar Documentation - Docx Swetha G
30 pages
Survey Paper Updated
No ratings yet
Survey Paper Updated
12 pages
PRT Pratham - Merged
No ratings yet
PRT Pratham - Merged
21 pages
Nitin Raj Sharma - AIApplicationsInTheDomainsOfMachineLearningAndDeepLearning - NitinRajSh
No ratings yet
Nitin Raj Sharma - AIApplicationsInTheDomainsOfMachineLearningAndDeepLearning - NitinRajSh
9 pages
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
No ratings yet
Exploring The Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
6 pages
INTRO TO AI - Mehul Bharti
No ratings yet
INTRO TO AI - Mehul Bharti
23 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
MiniProject 5
No ratings yet
MiniProject 5
11 pages
Application and Challenges of Artificial Intelligence (AI) & Machine Learning (ML) in Healthcare Industry
No ratings yet
Application and Challenges of Artificial Intelligence (AI) & Machine Learning (ML) in Healthcare Industry
8 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
AJBSR MS ID 002973ArificialIntelligencegenerativevspredictivevsmixingAIbyJorgeGarzaUlloa
No ratings yet
AJBSR MS ID 002973ArificialIntelligencegenerativevspredictivevsmixingAIbyJorgeGarzaUlloa
11 pages
DL Emotion MFCC
No ratings yet
DL Emotion MFCC
6 pages
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
No ratings yet
Winter Semester 2021-22 CSE4020-Machine Learning Digital Assignment-1
20 pages
Speech Emotion Detection Using Machine Learning
No ratings yet
Speech Emotion Detection Using Machine Learning
11 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
6 pages
A Comprehensive Survey On Artificial Intelligence and Machine Learning Techniques
No ratings yet
A Comprehensive Survey On Artificial Intelligence and Machine Learning Techniques
7 pages
Simple Banking Application Using CORE
No ratings yet
Simple Banking Application Using CORE
7 pages
Prediction of Mental Health (Depression) Using Data Science and Machine Learning Techniques
No ratings yet
Prediction of Mental Health (Depression) Using Data Science and Machine Learning Techniques
11 pages
Can Cnns Accurately Classify Human Emotions? A Deep-Learning Facial Expression Recognition Study
No ratings yet
Can Cnns Accurately Classify Human Emotions? A Deep-Learning Facial Expression Recognition Study
7 pages
Ajbsr MS Id 002973
No ratings yet
Ajbsr MS Id 002973
10 pages
Mental Health Prediction Using Deep Learning
No ratings yet
Mental Health Prediction Using Deep Learning
11 pages
Toxic Comments Classification
No ratings yet
Toxic Comments Classification
10 pages
Personalized Emotion Detection Adapting Models To Individual Emotional Expressions
No ratings yet
Personalized Emotion Detection Adapting Models To Individual Emotional Expressions
6 pages
Information Fusion: Jianhua Zhang, Zhong Yin, Peng Chen, Stefano Nichele
No ratings yet
Information Fusion: Jianhua Zhang, Zhong Yin, Peng Chen, Stefano Nichele
24 pages
Speech Based Emotion Recognition
No ratings yet
Speech Based Emotion Recognition
8 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
8 pages
Speech Emotion Recognition Using Deep Learning Hybrid Models
No ratings yet
Speech Emotion Recognition Using Deep Learning Hybrid Models
5 pages
SDMPSI Tutorial
No ratings yet
SDMPSI Tutorial
18 pages
Lab 03
No ratings yet
Lab 03
3 pages
Internship Project
No ratings yet
Internship Project
9 pages
7 TH
No ratings yet
7 TH
6 pages
Log
No ratings yet
Log
12 pages
Irjet V7i6804
No ratings yet
Irjet V7i6804
7 pages
Session 2-DS topology-VF
No ratings yet
Session 2-DS topology-VF
8 pages
Research Proposal
No ratings yet
Research Proposal
3 pages
Synchronize3 TRB Worksheets Unit 5 Standard Vocabulary
No ratings yet
Synchronize3 TRB Worksheets Unit 5 Standard Vocabulary
1 page
David Crowther - QGIS - PDF Hyperlinking - Docx V2
No ratings yet
David Crowther - QGIS - PDF Hyperlinking - Docx V2
8 pages
JMeter Syllabus
No ratings yet
JMeter Syllabus
4 pages
Speech Emotion Recognition Using Deep Learning
No ratings yet
Speech Emotion Recognition Using Deep Learning
4 pages
Ce Bank and Account Util
No ratings yet
Ce Bank and Account Util
9 pages
Conference Paper 4
No ratings yet
Conference Paper 4
3 pages
Kubernetes Cluster Using Ubuntu 20.04 LTS.
No ratings yet
Kubernetes Cluster Using Ubuntu 20.04 LTS.
2 pages
3G Resetting
No ratings yet
3G Resetting
7 pages
Firmware Dev
No ratings yet
Firmware Dev
2 pages
Automate Login For Azure Powershell Scripts With Service Principals
No ratings yet
Automate Login For Azure Powershell Scripts With Service Principals
3 pages
DOC0585277 Rev3
No ratings yet
DOC0585277 Rev3
8 pages
Nagarjuna CV
No ratings yet
Nagarjuna CV
2 pages
AI Techniques and Tools Through Python. Supervised Learning: Classification Methods, Ensemble Learning and Neural Networks
From Everand
AI Techniques and Tools Through Python. Supervised Learning: Classification Methods, Ensemble Learning and Neural Networks
César Pérez López
No ratings yet
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
From Everand
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
Nietsnie Trebla
No ratings yet
Artificial Intelligence Algorithms
From Everand
Artificial Intelligence Algorithms
akosnemeth
No ratings yet