0% found this document useful (0 votes)
19 views4 pages

DM Final Report

Uploaded by

sweety.reddy2727
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views4 pages

DM Final Report

Uploaded by

sweety.reddy2727
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Spotify Genre Prediction

Nandini Chaganti Manya Mallikarjun Keerthi Reddy Sure


Luddy School of Informatics, Luddy School of Informatics, Luddy School of Informatics,
Computing and Engineering Computing and Engineering Computing and Engineering
Indiana University Bloomington Indiana University Bloomington Indiana University Bloomington
Bloomington, Indiana Bloomington, Indiana Bloomington, Indiana
[email protected] [email protected] [email protected]

Keywords: Genre prediction, One-Hot Encoding, MinMax Scalar,


LGBM classifier, Decision Tree Classifier,
III. DATASET
I. ABSTRACT
Genres vary from song to song on different The Dataset contains a varied set of columns, Fig A
properties. Proper classification of these features will contains the information about these columns, and the
help us predict the genre which can be quite useful number of rows in the data set is 50000. The columns
for customers who are interested in hearing songs refer to different properties of the song which differ
from a particular genre. A dataset consisting of these one song from another.
features is collected to perform the classification
techniques. Before that, preprocessing steps are
employed in order to clean the data. The techniques
used to remove null values, negative values and
unidentified characters(?). Different modeling
techniques[1] are applied to predict the genre and
accuracies are computed for each of the models. The
results indicate that the LGBM classifier has better
performance compared to other classifiers like KNN
and ridge.

II. INTRODUCTION
Fig A. Analyzing dataset
Music is a very subjective topic. Nowadays, everyone
includes music in their daily lives. There are various In the collected dataset, we have 10 types of genres.
kinds of songs. Music tastes vary from person to Which are as follows Electronic, Anime, Jazz,
person. The type of music a person prefers relies on Alternative, Country, Rap, Blues, Rock, Classical,
their mood as well as their interests or preferences[6]. Hip-Hop. All these genres are considered popular
But then. If a lot of people enjoy a song, it is genres around the world.
unquestionably regarded as a hit because it is popular
and is frequently played. Based on the preferences of
the listeners, music can be grouped using genre
prediction. Spotify can categorize songs from all
genres with ease and can even suggest musical Fig B. Types of Music Genre
subgenres to its customers.
The considered dataset has equal composition of all
This project aims at predicting the genre for the genres which is 10%, constituting 100% together,
the songs and the genre relies on various factors like which is the entire dataset.
danceability, energy, liveliness, tempo, acousticness,
etc.
other columns, so we are using
normalization on this column
Fig D indicates the skewed data for the
duration_ms column.

Fig C. Ratio of data for every genre in the dataset FigD.Normalization graph for duration_,ms

IV. DATA PREPROCESSING 6. Feature Selection:


The data is cleaned using different preprocessing
techniques which are explained below. These
processes helped to obtain better accuracy.

1. Removing Columns with 0 correlation:


track_name and instance_id columns are
dropped as they do not have any correlation
with other columns, they only have unique
values .
2. Removing NaN and duplicate values:
Duplicates and NaN values are dropped
using ​drop_duplicates and dropna.
3. One-Hot Encoding:
It is necessary to transform categorical data
into a numerical form and to reclassify
model predictions into a categorical form. Fig D. Heat Map for columns in the data set
Further, One Hot Encoding is used to
convert the categorical data into numeric After the observation of the heat map, columns were
data. removed which had a threshold of more than 80 so
With this method, 16 new columns, based on energy, and 4-Apr columns were dropped as they had
key, mode and obtained_data are generated a threshold more than 80 when compared with
which are filled with 0’s and 1’s. loudness and 3-Apr columns.
Key, mode and obtained_data columns are The python packages used for this process are
dropped. Sequential Feature Selector, from mlxtend
4. Replacing -1 and ? values with the mean feature_selection From sklearn linear_model
value: For feature selection, Linear regression is considered.
Duration_ms and tempo columns contain -1 After successful implementation the code has
and ? values, which are replaced with the selected the best 13 features in order to reduce the
mean values of the same columns. dimensionality, ignoring the extra columns. These 13
5. MinMax Scaler: features have created a good balance between the
It is a form of normalization which uses the accuracy and efficiency of the model.
standard deviation and mean values to map The best 13 features that are selected are listed
all the available data into the range between below: popularity, acousticness ,danceability,
min and max values. In most cases, the min duration_ms, instrumentalness, liveness, loudness,
and max values are going to be default speechiness, tempo, valence, F, Minor, 3-Apr.
which are 0 and 1.
MinMax scalar is performed on the V. MODELS
duration_ms column as this column values After preprocessing the music genre data using
have a very high range when compared to multiple techniques[4] which were discussed above.
The models are trained on 80% of the data and the
predictions are made on 20% of the data. All the the main goal while drawing the hyperplane. As a
genres in the data have equal composition. On this maximum-margin hyperplane, the depicted
data different classification methods are applied hyperplane was referred to.
which are as follows. The basic definition of these
classifiers and the pseudo-code used to obtain the G.Logistic Regression
accuracy is as follows. Only when a decision threshold is included does
logistic regression become a classification approach.
A.LGBM Classifier The classification problem itself determines the
It is a gradient boosting[2] framework that uses threshold value, which is a crucial component of
tree-based learning algorithms and is regarded as one logistic regression. Based on a given dataset of
of the most powerful computation-based algorithms. independent factors, logistic regression calculates the
It is regarded as a processing algorithm with quick likelihood that an event will occur. Given that the
speeds. By using this classifier the obtained accuracy result is a probability, the dependent variable's range
is comparatively higher than other classifiers. is 0 to 1.

B. KNN H.Naive Bayes


The k-nearest neighbor's algorithm, often known as Simple models like the Naive Bayes Classifier are
KNN or k-NN, is a supervised learning classifier that frequently employed in classification issues. The core
makes predictions or classifications about how a principles are extremely simple to understand, and
single data point will be grouped. Although it can be the arithmetic that supports them is also quite
used to solve classification or regression problems, it understandable. However, this model performs
is frequently used as a classification technique surprisingly well in a lot of situations, and it as well
because it is predicated on the notion that similar as its modifications are utilized to solve a lot of
points can be found nearby. issues.

C.Ridge Classifier VI. RESULTS


In order to solve the problem, the Ridge Classifier, The accuracy is finally obtained after the successful
which is based on the Ridge regression methodology, execution of all the defined\classifier models on the
transforms the label data into the range [-1, 1]. given dataset. The classifiers have to predict the
Multiple output regression is used for multiclass data, correct data genre from the set of ten genres. The
and the target class is the one with the greatest classifiers highly vary in the results. The one that
prediction value. stands out with highest accuracy of 62.8% is LGBM
Classifier. KNN takes the next place with 51.98%
D. Random Forest accuracy. Eight different classifiers are used in order
The widely used machine learning technique known to compare and obtain the best accuracy possible.
as random forest combines the output of multiple These accuracies are clearly tabularized in table A.
decision trees to get a single conclusion. Its
adaptability and use have boosted its popularity since Classifiers ROC AUC Score Accuracy
it can resolve classification and regression
challenges.[3] LGBM Classifier 0.9822 62.8
Logistic Regression 0.7640 29.14
E. Decision Tree
KNN 0.9647 51.98
Models for supervised machine learning include
decision tree classifiers. This indicates that they train Support Vector 0.561 17.99
an algorithm that can make predictions using pre
DecisionTree 0.999 43.81
labeled data. Regression issues can also be solved
with decision trees. Random Forest 0.884 47.12
Naïve Bayes 0.853 41.81
F.Support Vector Classifier
The training examples are plotted in space. There Ridge 0.465 46.84
should be an apparent gap between these data points.
A straight hyperplane splitting two classes is what is Table A. Table with accuracy results of genre
predicted. Maximizing the distance from the prediction
hyperplane to the closest data point of either class is
Fig E represents the confusion matrix constructed on [4] Adragna, Robert, and Yuan Hong Bill Sun.
top of the predicted data and actual results when the "Music Genre Classification." MIE324 Project
LGBM classifier is used. The Confusion matrix
Report (2019).
clearly shows the prediction for all the ten genres.

[5] Huang, Derek A., Arianna A. Serafini, and Eli J.


Pugh. "Music Genre Classification." CS229 Stanford
(2018).

[6] Dawson Jr, Christopher E., et al. "Spotify: You


have a Hit!." SMU Data Science Review 5.3 (2021): 9

Fig E. Confusion Matrix for LGBM Classifier

VII. DISCUSSION

From the results, it is clear that the LGBM classifier


provides sufficient accuracy as compared to other
models like Logistic regression, KNN, Decision tree,
support vector, random forest, naive bayes and ridge.
It provides an accuracy of 62.8. KNN and random
forest provide an accuracy of 51.98 and 47.12
respectively. So, for this particular dataset, using the
LGBM classifier, the music genre ‘Anime’ seems to
be the best predicted[5] followed by ‘Rock’with this
classification technique and the least predicted is
‘Country ’.

VIII. REFERENCES

[1] Luo, Kehan. "Machine Learning Approach for


Genre Prediction On Spotify Top Ranking Songs."
(2018).

[2] Bang-Dang Pham, Minh-Triet Tran, and


Hoang-Long Pham. Hit song prediction based on
gradient boosting decision tree. In 2020 7th
NAFOSTED Conference on Information and
Computer Science (NICS), pages 356–361. IEEE,
2020

[3] Leo Breiman. Random forests. Machine learning,


45(1):5–32, 2001.

You might also like