0% found this document useful (0 votes)
3 views14 pages

ANN Report

The project report details the implementation of an Automatic Music Emotion Classification system using a CNN+LSTM architecture to classify music into four emotional categories: Happy/Energetic, Calm/Peaceful, Angry/Tense, and Sad/Melancholic. It highlights the significance of Music Emotion Recognition (MER) in applications such as music streaming services and mental health therapy, while also addressing challenges like subjective emotional perception and computational complexity. The model achieved approximately 85% accuracy, demonstrating its effectiveness in analyzing music patterns and predicting emotions.

Uploaded by

riya.birnale31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views14 pages

ANN Report

The project report details the implementation of an Automatic Music Emotion Classification system using a CNN+LSTM architecture to classify music into four emotional categories: Happy/Energetic, Calm/Peaceful, Angry/Tense, and Sad/Melancholic. It highlights the significance of Music Emotion Recognition (MER) in applications such as music streaming services and mental health therapy, while also addressing challenges like subjective emotional perception and computational complexity. The model achieved approximately 85% accuracy, demonstrating its effectiveness in analyzing music patterns and predicting emotions.

Uploaded by

riya.birnale31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

K J Somaiya Institute of Technology

An Autonomous Institute Permanently Affiliated to University of Mumbai.

A Project Based Learning

Report On

“Implement Automatic Music Emotion Classification


Using NN ”
SUBMITTED BY

Gargi Bendale--(16)

Riya Birnale--(18)

Krishna Jogi--(35)

Guide

Prof. Milind Nemade


Department of Artificial Intelligence and Data Science
2024-25
K J Somaiya Institute of Technology
UNIVERSITY OF MUMBAI

CERTIFICATE

This is to certify that the project titled “Implement Automatic Music Emotion Classification

Using NN “ is completed under my supervision and guidance in partial fulfilment of the


requirements of the course ANN, by the following students:

Gargi Bendale--(16)

Riya Birnale--(18)

Krishna Jogi--(35)
The course is a part of semester VI of the Department of Artificial Intelligence and Data Science
during the academic year 2024-2025. The said work has been assessed and is found to be
satisfactory.

(Internal guide name and sign.)

College seal
Table of Contents
Sr.No Content Page No.
1 Introduction 2
2 Literature Survey 4
3 Problem Definition 5
4 Working of Project 6
5 Analysis of Different Methods 7
6 Performance parameters 8
7 Flowchart of Program 9
8 Result 11
9 Conclusion 12
10 References 13

1
1. Introduction

Background and Motivation


Music has always played a significant role in human emotions, evoking feelings of happiness,
sadness, excitement, or calmness. The ability to recognize and categorize these emotions is crucial
for various applications such as music recommendation systems, emotional therapy, and interactive
AI-driven music platforms. The field of Music Emotion Recognition (MER) seeks to automatically
classify music into different emotional categories based on its acoustic and structural properties.

Importance of Music Emotion Recognition


 Music Streaming Services: Personalized playlists based on mood (e.g., Spotify and Apple
Music).
 Music Therapy: Assisting in emotional well-being and mental health treatments.
 AI and Human-Computer Interaction: Enhancing the experience in gaming, virtual reality,
and assistive technologies.

Traditional methods of music classification were based on genre or manually annotated metadata.
However, these approaches fail to capture the complex emotional impact of music. Recent
advancements in machine learning and deep learning, particularly Convolutional Neural Networks
(CNNs) and Long Short-Term Memory (LSTM) networks, have shown promising results in
extracting meaningful features from music and understanding its emotional content.

This report focuses on the development of an MER system using deep learning techniques,
specifically CNN+LSTM architecture, to analyze spectrograms of music and classify them into
predefined emotional categories.

Objectives
 Develop a MER system using CNN+LSTM.
 Extract Mel Spectrograms from songs to analyze spatial and temporal features.
 Classify music into four emotion categories: Happy/Energetic, Calm/Peaceful,
Angry/Tense, and Sad/Melancholic.
 Enable real-time emotion prediction for new songs.
 Contribute to AI-driven music analytics, personalized recommendations, and affective
computing.

2
2. Literature Survey

Traditional Feature-Based Approaches


Early studies in MER relied on handcrafted features such as MFCCs (Mel-Frequency Cepstral
Coefficients), Chroma Features, and Rhythm Patterns. Research by Laurier et al. (2009) used
Support Vector Machines (SVMs) and k-Nearest Neighbors (k-NN) for emotion classification based
on such features. While these methods achieved moderate accuracy, they struggled with complex
emotional dynamics.

Arousal-Valence Models in MER


The Russell’s Circumplex Model of Affect (1980) has been widely adopted for emotion
representation, mapping emotions on arousal (energy level) and valence (positivity or negativity)
dimensions. Studies like Yang et al. (2012) developed regression-based models for continuous
emotion prediction using arousal and valence values at different time intervals.

Deep Learning in MER


 CNNs for Feature Extraction: CNNs extract spatial and frequency-based features from
spectrograms, identifying pitch, tone, and intensity.
 LSTMs for Capturing Time-Series Dependencies: LSTMs model emotional progression in
music by analyzing sequential patterns over time.
 Comparison with Other Architectures:
o CNN+LSTM vs. GRUs: GRUs are computationally efficient but similar in
performance to LSTMs.
o CNN+LSTM vs. Transformers: Transformers handle long-term dependencies better
but require more computational resources.
o Why CNN+LSTM? CNN+LSTM balances efficiency and accuracy, making it a
reliable choice for MER tasks.

3
3. Problem Definition

Music classification based on genre is common, but recognizing emotions from music remains
challenging due to:
 The subjective nature of emotions.
 Variations in tempo, pitch, and harmony affecting perceived emotions.
 The need for real-time classification in applications like music streaming and therapy.
This project aims to develop a Neural Network MER system that effectively classifies emotions
from music, leveraging CNN+LSTM architecture.

4
4. Working of Project

Dataset and Feature Extraction

 Dataset: The dataset consists of 1,802 MP3 songs, each associated with arousal and valence
values recorded at 500ms intervals. These values represent the song’s emotional intensity
and positivity.
 Labeling Process: Since explicit emotion labels were unavailable, we manually labeled the
dataset by mapping arousal and valence values to four emotional categories:
Happy/Energetic, Calm/Peaceful, Angry/Tense, and Sad/Melancholic.
 Feature Extraction: Mel Spectrograms are extracted from the audio files, capturing
frequency and time information.
 Data Preprocessing: Normalization and augmentation techniques are applied to improve
model generalization.
 Dataset: The dataset consists of 1,802 MP3 songs, each associated with arousal and valence
values recorded at 500ms intervals. These values represent the song’s emotional intensity
and positivity.
 Feature Extraction: Mel Spectrograms are extracted from the audio files, capturing
frequency and time information.
 Data Preprocessing: Normalization and augmentation techniques are applied to improve
model generalization.

Model Architecture
1. CNN Layer: Extracts spatial features from spectrograms.
2. LSTM Layer: Processes extracted features over time to detect emotion transitions.
3. Fully Connected Layer: Classifies into four emotion categories.
4. Softmax Layer: Outputs final emotion probability.

Training and Evaluation


 Training: The model is trained using a supervised learning approach with labels derived from
arousal-valence thresholds.
 Hyperparameter Tuning: Batch size, learning rate, and model depth are optimized.
 Evaluation Metrics: Performance is measured using accuracy, precision, recall, F1-score,
and confusion matrix analysis.

5
5. Analysis of Different Methods

Comparison of Approaches
 Supervised Learning (CNN+LSTM): Offers high accuracy and learns hierarchical audio
patterns but requires labelled data.
 Unsupervised Learning (Clustering): Useful for exploratory analysis but lacks accuracy in
classification.
 Traditional Feature Extraction (MFCCs, Chroma): Simple and interpretable but requires
manual feature selection.
 Transformers: Captures long-term dependencies but demands large datasets and high
computational resources.

Experimental Analysis
 CNN+LSTM achieved the best balance between accuracy and computational efficiency.
 Clustering-based methods were less reliable for real-time applications.
 Transformers showed potential but were resource-intensive.

6
6. Performance Parameters

Key Metrics for Evaluation

 Accuracy: Measures the percentage of correctly classified emotions.


Accuracy = (TP + TN) / (TP + TN + FP + FN)
 Precision: Determines how many predicted emotions were correct.
Precision = TP / (TP + FP)
 Recall (Sensitivity): Evaluates the ability to correctly detect emotions.
Recall = TP / (TP + FN)
 F1-Score: Balances precision and recall.
F1-Score = 2 * (Precision * Recall) / (Precision + Recall)
 Confusion Matrix: Analyzes classification errors and misclassifications.
A matrix representation of True Positives (TP), False Positives (FP), True Negatives (TN),
and False Negatives (FN) to evaluate model performance.
 Loss Function (Cross-Entropy Loss): Measures the difference between predicted and actual
probabilities.
Cross-Entropy Loss = -Σ (y * log(y_pred))
Where y is the actual class label, and y_pred is the predicted probability for that class.

7
7. Flowchart of Program

1. Start
The system begins by initializing all necessary components, including importing libraries and setting
up the environment for data processing.

2. Load Dataset (Arousal and Valence CSV, MP3 Files)


 The dataset consists of 1,802 MP3 files and their corresponding arousal and valence values,
recorded at 500ms intervals.
 These CSV files contain song IDs as keys, with numerical values representing emotional
intensity (arousal) and positivity (valence).

3. Preprocess Data: Normalize Values, Generate Emotion Labels


 Normalization: Arousal and valence values are scaled to a uniform range (e.g., between 0
and 1) to improve model efficiency.
 Label Generation: Since explicit labels are unavailable, emotions are derived from arousal
and valence values:
o High Arousal + High Valence → Happy/Energetic
o Low Arousal + High Valence → Calm/Peaceful
o High Arousal + Low Valence → Angry/Tense
o Low Arousal + Low Valence → Sad/Melancholic

4. Extract Mel Spectrograms from Audio


 Mel Spectrograms are generated from each MP3 file to capture frequency and time-based
information.
8
 CNN is used to extract spatial features from these spectrograms.

5. Reshape Data for CNN + LSTM


 The extracted spectrograms are reshaped into a format suitable for deep learning models.
 CNN layers handle feature extraction, while LSTM layers process sequential
dependencies over time.

6. Train CNN + LSTM Model


 The CNN extracts spatial features from spectrograms.
 The LSTM models temporal dependencies to understand emotional progressions within a
song.
 The model is trained using supervised learning, with labels derived from the arousal-valence
mapping.

7. Evaluate Performance (Accuracy, Loss, Confusion Matrix)


 The model’s performance is tested using metrics like accuracy, precision, recall, F1-score,
and a confusion matrix.
 Loss functions (such as cross-entropy loss) measure how well the model generalizes.

8. Predict Emotion for New Song


 When a new song is provided, it undergoes the same preprocessing, feature extraction, and
prediction process.
 The CNN+LSTM model classifies the song into one of the four emotion categories.

9. End
The system outputs predicted emotions and can be used in applications like music
recommendation systems, therapy, gaming, and media analysis.

9
8. Result

Model Performance
 Accuracy: Achieved an accuracy of approximately 85% on test data.
 Emotion Trends:
o High valence + high arousal → Happy
o Low valence + high arousal → Angry
o Low valence + low arousal → Sad

Visualization & Interpretability


 Spectrogram representations for different emotional categories.
 Accuracy and loss graphs over training epochs.
 Confusion matrix to analyze misclassifications.
 Comparison of ground truth vs. predicted emotions.

Spectrogram of a Happy Music Clip

10
9. Conclusion

Music Emotion Recognition (MER) using deep learning is a significant advancement in the
field of artificial intelligence, allowing for a more profound understanding of how music
influences human emotions. This project successfully implemented a CNN+LSTM-based
model that classifies music into four primary emotional categories: Happy/Energetic,
Calm/Peaceful, Angry/Tense, and Sad/Melancholic. By leveraging arousal and valence
values, the system effectively analyzes music patterns and assigns corresponding emotions
with high accuracy. The manually labeled dataset, created using arousal-valence mapping,
played a crucial role in training the model to recognize emotional variations accurately.

The impact of MER extends across multiple domains. Music streaming services can
integrate MER to provide mood-based recommendations, enhancing user engagement.
Mental health applications can benefit from emotion-driven playlists that assist in therapy
and relaxation techniques. In interactive AI systems, MER can enhance gaming experiences
by adapting in-game soundtracks to match player emotions, creating a more immersive and
dynamic environment. Moreover, the application of MER in film scoring can help automate
the selection of background music that aligns with the mood of a scene.

Despite its promising capabilities, MER faces several challenges. Subjectivity in emotional
perception makes classification difficult, as different listeners may interpret the same song
differently. Additionally, the overlap of emotional categories complicates classification,
requiring more nuanced models for better accuracy. Computational complexity is another
hurdle, as training deep learning models on large-scale audio data demands significant
processing power. The trade-off between real-time processing and model accuracy must
also be addressed for MER to be effectively deployed in commercial applications.

11
10.Reference

 IEEE Paper: Music Emotion Recognition Using Deep Learning.


 YouTube Tutorials:
o Music Emotion Recognition | Deep Learning Project.
o Spectrograms and Feature Extraction.
 Hizlisoy, S., Yildirim, S., & Tufekci, Z. (2020). Music emotion recognition using
convolutional long short-term memory deep neural networks. Neural Computing and
Applications, 32(10), 6629-6641.Hizlisoy, S., Yildirim, S., & Tufekci, Z. (2020). Music
emotion recognition using convolutional long short-term memory deep neural networks.
Neural Computing and Applications, 32(10), 6629-6641.
 Zhao, S., Li, Y., Yao, X., Nie, W., Xu, P., Yang, J., & Keutzer, K. (2020). Emotion-based
end-to-end matching between image and music in valence-arousal space. arXiv preprint
arXiv:2009.05103.
 https://ieeexplore.ieee.org/document/10729602

12

You might also like