A Model For Predicting Music Popularity On Streami
A Model For Predicting Music Popularity On Streami
RESEARCH ARTICLE
Abstract: The global music market moves billions of dollars every year, most of which comes from streaming
platforms. In this paper, we present a model for predicting whether or not a song will appear in Spotify’s Top 50,
a ranking of the 50 most popular songs in Spotify, which is one of today’s biggest streaming services. To make
this prediction, we trained different classifiers with information from audio features from songs that appeared in
this ranking between November 2018 and January 2019. When tested with data from June and July 2019, an
SVM classifier with RBF kernel obtained accuracy, precision, and AUC above 80%.
Keywords: music — hit song science — machine learning — Spotify
Resumo: O mercado musical global movimenta bilhões de dólares todos os anos. A maioria desses bilhões
vem de plataformas de streaming. Neste artigo, apresentamos um modelo que prevê se uma música irá
ou não aparecer no Top 50 do Spotify, um ranking das 50 músicas mais populares nessa plataforma, que é
um dos maiores serviços de streaming atualmente. Para fazermos essa previsão, nós treinamos diferentes
classificadores com informações de caracterı́sticas acústicas das músicas que apareceram nesse ranking
entre novembro de 2018 a janeiro de 2019. Quando fizemos testes em dados de junho e julho de 2019, um
classificador SVM com kernel RBF obteve acurácia, precisão e AUC superiores a 80%.
Palavras-Chave: música — hit song science — aprendizagem de máquina — Spotify
1 Institute
of Computing, Federal University of Amazonas, Manaus, Brazil
*Corresponding author: [email protected]
DOI: http://dx.doi.org/10.22456/2175-2745.107021 • Received: 30/08/2020 • Accepted: 23/10/2020
CC BY-NC-ND 4.0 - This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
and record labels5,6 . We consider a music to be popular if it Germany’s 500 most popular songs in 2011. There is no infor-
has been featured in the Spotify’s Top 50 Global daily rank- mation from where such list is published. The authors were
ing, which contains the 50 songs with most listeners the day able to show that there was a correlation between the data
before each edition. To make predictions, the model employs with 95% of statistical significance. However, the maximum
audio features collected provided by the platform API. These accuracy obtained was 43.5% using the PLS-SEM approach.
features indicate if the songs are dancing, energetic, acoustic, Regarding the use of social network data, Kim, Suh, and
instrumental, among other possibilities. Lee [8] collected messages on Twitter associated with the tags:
A previous version of this work was presented at the #nowplaying, its abbreviated version (#np), and #itunes (a
2019 edition of the Brazilian Symposium on Computer Music digital music selling platform). With this data, they sought to
(SBCM) [3]. During this study some models were developed predict whether a song would be successful. For the authors,
using different approaches, like ranking positions [4] and success is achieved when the song appears up to a certain
acoustic characteristics of songs [5]. On the SBCM paper position on the Billboard Hot 1009 (this position was varied in
we presented the results of a model that predicts if a song the experiments). The authors calculated different correlation
on Spotify’s Viral 50 Global ranking will appears on Top 50 coefficients between the number of messages collected and the
Global ranking and vice-versa. The main scope of this paper success of each song. The maximum value was 0.41, which
is different from previous work in that, here, the idea is to may indicate that there is no correlation between them. Even
predict if a song will be popular even before its release. with such an obstacle, the authors applied a random forest
The remainder of this paper is organized as follows: In classifier, obtaining 90% accuracy in the model where a song
Section 2 we present related work, while in Section 3 we is only considered successful if it is in the top ten.
describe our methodology. In Section 4 we show the results On a different approach, Herremans, Martens, and
we obtained and discuss them in Section 5. Finally, in Section Sörensen [9] created a model for predicting the popularity
6 we present our final remarks and point out future directions of Dance songs using acoustic features. For a song to be
to be taken. considered popular in this research it should be up to a cer-
tain position in the Official Charts Company Top 40 Dance
2. Related Work Music10 (just as in the previous work, this position was also
varied in the experiments). The authors collected metadata
HSS models are generally based on supervised machine learn- and information from acoustic features of the tracks that ap-
ing techniques. So, training data is needed to build these peared in this ranking between 2009 and 2013 using The Echo
models. Different sources of data have been used in the litera- Nest11 . Three distinct experiments were performed where dif-
ture. In our studies, we identified three types of data sources ferent parameters for a song to be considered popular were
as the most common, namely: songs acoustic features, social tested. The best results were obtained by using the Naive
network information, and concert and festival data. In this Bayes classifier. In such experiment a music should be in the
section, we present researches that used these data sources to top ten to be considered popular and between positions 31
make their predictions. to 40 to be considered unpopular. Songs in positions 11 to
Regarding the use of concert information to make predic- 30 were discarded. Given all these assumptions, the authors
tions, Arakelyan et al. [6] collected data from the SongKick obtained accuracy and AUC of 65%.
website7 . The data contained the location, list of participating
In addition to this work, Karydis et al. [10] retrieved data
artists, event name and a value indicating the event popularity
associated with 9,193 songs that were featured in at least one
given by the website. The authors considered an artist to be
popularity ranking from the following sources between April
popular if they had a contract with one of the following record
28th , 2013 and December 28th , 2014: Billboard, Last.fm, and
companies: Sony BMG, Universal Music Group or Warner.
Spotify. Additionally, they retrieved data from 14,192 songs
The labels affiliated with these companies were also consid-
of the albums in which these popular tracks were released.
ered for success. The authors applied the logistic regression
They retrieved this data from three different sources, namely:
method to predict whether an artist would succeed or not. The
iTunes12 , Spotify, and 7digital13 . Plus, using four different
maximum accuracy obtained was 39%.
tools, they extracted the songs acoustic features from 30-
Another work that also used data from concerts and fes-
second samples of them. Their goal was to predict which
tivals was made by Steininger and Gatzemeier [7]. For each
song would be the most successful from an unseen album.
event, the authors obtained some 20 parameters identified
with Amazon Mechanical Turk8 contributors. From this data, 9 The Billboard Hot 100 is a weekly ranking containing the 100 most pop-
they sought to predict whether or not the songs of the artists ular songs in the United States. hhttps://www.billboard.com/charts/hot-100i
who participated in these events would appear on a list of 10 The Official Charts Company publishes rankings of most popular songs,
The authors employed two temporal-data models: a nonlinear features 50 songs that had the biggest increase in the number
auto-regressive network classifier (NAR) and its variation with of plays the day before16 .
exogenous inputs (NARX). They reported precision of 46% In this work, we consider the songs in the Top 50 to be
and accuracy of 52%. popular, in an approach already used in other HSS works [9,
Since Pons and Serra [11] showed that neural networks 13]. Because we require data to be collected from the Spotify
could be a valuable option on HSS researches, Martı́n-Gutiérrez API, non-popular songs must still be featured on the platform.
et al. [12], in a more recent work, used them along with in- Therefore we consider non-popular songs to be those that
formation collected from Spotify and Genius14 of more than featured in Viral 50 but did not appear in the Top 50 during the
100 thousand tracks. This data relates to audio features and collection period, thus avoiding a song to be simultaneously
characteristics, plus knowledge from songs lyrics and files. popular and unpopular.
The authors created a model based on neural networks to pre- The data from these rankings was collected using the
dict the song popularity value on Spotify, a number in a scale API’s “Get a Playlist’s Tracks” function. We retrieved the
from 0 to 100. The higher this number, the higher is the song names of the artists and their tracks, the songs ID’s within the
popularity on the platform. The authors obtained an accuracy platform, and the Explicit flag, which indicates whether the
and recall of 83.46% in the best case when using a neural song contains explicit lyrics.
network with 3 layers and Adam optimizer. We also collected audio features from each song. To do
The research that most closely resembles ours is the one this, we used their ID’s as input to the API’s “Get Audio
by Reiman and Örnell [13]. In this work, the authors col- Features for Several Tracks” function. The features used in
lected data from 287 songs that appeared in Billboard Hot our experiment are listed below. All features range in [0, 1],
100 between 2016 and 2018. They also collected data from with values closer to 1 expressing more strongly the concept
322 other songs that never appeared in this ranking, randomly of the feature:
chosen from 13 different music genres. The information was
collected using the Spotify API and relates to the same audio • Danceability: describes how suitable a track is for
features we used in our research. As previously stated, these dancing, taking into account several factors such as
features indicate if the songs are happy, dancing, instrumental, tempo, rythm, and overall regularity;
etc. For a song to be considered popular in this research, it • Energy: represents a “perceptual measure of inten-
should be present in Hot 100. sity and activity. Typically, energetic tracks feel fast,
Reiman and Örnell [13] used four different algorithms to loud, and noisy”17 . Obtained from features such as
make their predictions, namely: K-Nearest Neighbors, Sup- dynamic range, perceived loudness, timbre, onset rate,
port Vector Machines, Gaussian Naive Bayes and Logistic and general entropy;
Regression. The experimental evaluation was based on hold-
out validation (80% for training and 20% for testing) with a • Speechiness: whether the music contains spoken words.
maximum accuracy of 60.17%, obtained by Gaussian Naive According to the official documentation18 , if this mea-
Bayes. The authors’ conclusion is that the experiments have sure is above 0.66, then it is probably made entirely of
not shown that it is possible to predict whether or not a song spoken words. Values between 0.33 and 0.66 describe
will be a success. tracks that may contain both music and speech—e.g.,
rap music. Values below 0.33 most likely represent
music and other non-speech-like tracks;
3. Methodology
• Acousticness: gives a confidence level on how acous-
In this section we present the methodology used in this re- tic the music is, in terms of relying more on acoustic
search, beginning with the way the data was collected and instruments rather than electronic ones;
prepared (cf. Subsection 3.1). We then present the experi-
ments we carried out (cf. Subsection 3.2), and how we eval- • Instrumentalness: how prevalent the sound of instru-
uate the results obtained (cf. Subsection 3.3). A graphical ment is rather than vocals. Non-verbal sounds such as
representation of the methodology is given in Figure 1. “ooh” and “aah” are considered instrumental. According
to the documentation, values above 0.5 are represent
songs that are mostly instrumental;
3.1 Data Collection and Preparation
The data collection was performed using the Spotify Web • Liveness: detects the presence of an audience in the
API15 . From November 2018 to July 2019 we collected daily recording. According to the documentation, a value
information from the Top 50 and Viral 50 public playlists.
16 According to Kevin Goldsmith, Spotify’s former vice-president of engi-
These playlists act as platform rankings, the first containing
neering, whose explanation may be found at hhttp://bit.ly/33fXg67i (requires
the top 50 songs listened the day before, while the second log in to the platform). Access on 2020-08-13.
17 hhttps://developer.spotify.com/documentation/web-api/reference/
14 hhttps://genius.com/i tracks/get-audio-features/i
15 hhttps://spoti.fi/37vPA2li 18 See footnote 17.
above 0.8 indicates with high degree of probability that • Duration ms: the duration of the track in milliseconds;
the song was recorded live;
• Key: the key the track is in. Integers map to pitches
• Valence: the valence is a measure of “positiveness”. using standard Pitch Class notation. E.g. 0 = C, 1 =
The higher the valence, the more the song relates to pos- C]/D[, 2 = D, and so on;
itive feelings, such as happiness and euphoria, whereas
low valence resemble negative feelings, such as sadness • Mode: indicates the modality (major or minor) of a
and anger. track, the type of scale from which its melodic content
is derived. Major is represented by 1 and minor is 0;
All of these audio features are float fields and the docu-de
Extração Criação do Modelo
mentation does not tell how they are calculated. Therefore,
Características • Tempo: the overall estimated tempo of a track in beats
we cannot compute these values for songs that are not on the per minute (BPM). In musical terminology, tempo is
platform, making it difficult to make predictions for songs the speed or pace of a given piece and derives directly
notBase
yet in
dethe platform. To make such predictions viable, we Aprendizado
from the average beat duration; O Modelo
Dadosto binarize these fields. In the binarization of the de Máquina
decided
collected data, the field was considered positive if its value • Time signature: an estimated overall time signature
was greater than 0.5. The exceptions were speechiness and of a track. The time signature (meter) is a notational
liveness, where we used the values 0.33 and 0.8 as a basis, convention to specify how many beats are in each bar
respectively, due to the description of these fields in the docu- (or measure).
mentation.
For the remainder of this paper, we will adopt the acronym
In order to make predictions for a song not yet released,
PM when dealing with our Proposed Model. In PM, we do
even if we do not know the exact value achieved by it in
not used all audio features available in the API, because we
the audio features, the artist themselves can indicate whether
only selected those where it was possible to do the binariza-
or not it is happy, live, dancing, etc. Thus, it is possible to
tion process. In the specific case of the Mode field which is
represent unreleased songs as instances of our base, allowing
binary, we do not use it because what it represents is directly
making predictions of its success.
associated with the musical note, which is represented in the
For our experiments, we set up two databases. In the first
Key field that was discarded since it cannot be binarized.
one, each entry represented one song on a given day, and there
We remark that the ROM is not an exact reproduction
might be multiple entries for the same song if it appears more
of the methodology used by Reiman and Örnell [13], but a
than once in the ranking. In the second, the entries with the
model created based on that text, so modifications were made
same song name and artist were combined into one. In this
to fit our experiments. The first difference is in the source
case a song was only considered popular if it appeared more
of popular and non-popular data. In this work we used Top
than a certain number of times in the Top 50 during collection
50 and Viral 50 as sources of popular and non-popular songs,
time. After this process, we discard the name fields and the
respectively. On the other hand, Reiman and Örnell used
ID’s of the two databases.
Billboard’s Hot 100 as their source for popular works and
During the Christmas season it is common for themed
randomly collected music of different genres from Spotify
songs to appear in the Top 50 from December 23 to 26. To
as non-popular. Also, in that work the audio features were
prevent these songs from being taken as popular in the second
not extracted directly from the Spotify API, as we did. They
experiment, we established that for a song to be considered
used the Spotipy19 Python library. Therefore, there may be
popular it should have appeared more than four times in the
differences in the way audio features are calculated in these
Top 50.
two cases.
For comparison, we set up a model based on the methodol-
In ROM, as in the base text, we do not perform the bi-
ogy used by Reiman and Örnell [13]. We will use the acronym
narization process and we do not normalize the data neither.
ROM (Reiman and Örnell Model) when dealing with this
In that paper, it is stated that only the instances where the
model from now on. In this work we used all audio features
audio features were in the same range were used. However,
available in the API except the field “Explicit”. Therefore,
besides the previously presented features, we also use: 19 hhttps://spotipy.readthedocs.io/en/latest/i
there is no information on which interval was used, so in our Logistic regression, despite its name, is a linear model
experiments the entire dataset was employed. We also set up for classification rather than regression. Logistic regression
two databases for ROM in order to compare against the results is also known in the literature as logit regression, maximum-
obtained with our methodology. The instances of these two entropy classification (MaxEnt) or the log-linear classifier. In
databases represent the same entries as the PM ones. this model, the probabilities describing the possible outcomes
of a single trial are modeled using a logistic function21 .
3.2 Experimentation SVM is an instance-based classifier that projects the train-
We used different machine learning algorithms in our exper- ing samples into a space of higher dimensionality, where it
iments. Therefore, it was necessary to divide our databases assumes to have a representative layout of the original space.
into training and testing groups. In the first experiment, we In this projection, SVM attempts to find the hyperplane that
used data from November and December 2018 for training. best separates the classes, effectively dividing the decision
In the second one, the data from January 2019 were also used. space into two subspaces. When classifying a new sample,
Testing has always been performed on the June and July 2019 SVM projects its features into the same high-dimensional
data. Thus, there is a minimum difference of at least five space and verifies on which subspace the projected instance
months between the training and test data dates. “falls”, and then assigns it the class label associated with that
For PM, before training, all the data was standardized by subspace [18]. The kernel function defines the inner product
removing the mean and scaling to unit variance. This step in the transformed space, so that different kernels imply on
was not made for ROM’s input as it was not made in the different ways to calculate inner products [19].
base text. The standard score of a sample x is calculated as
z = (x − u)/s, where u is the mean of the training samples and 3.3 Evaluation of Results
s is the standard deviation. To evaluate the results obtained, we use the following metrics,
To make the results more comparable, we restricted the where we denote the number of true positives, true negatives,
number of algorithms used in our experiments to those that false positives and false negatives as t p, tn, f p, and f n respec-
were also used by Reiman and Örnell [13]. Thus, the algo- tively:
rithms used were Gaussian Naive Bayes (GNB), K-Nearest t p+tn
Neighbors (KNN), Logistic Regression (LR) and Support 1. Accuracy = t p+tn+ f p+ f n , the percentage of correctly
Vector Machine (SVM) with RBF kernel. The way these predicted instances;
algorithms work will be discussed next. 2. Precision = t p+tp
f p , the percentage of correctly pre-
We used the “scikit-learn” [14] library in our experiments. dicted positive instances;
This library contains implementations of all the algorithms
tn
used, as well as being one of the most widely used in academia 3. Negative Predictive Value (NPV) = f n+tn , the percent-
and market. We used default values in all parameters. Our age of correctly predicted negative instances;
data were store in .csv files and were accessed using Pan- tp
das library [15]. Pandas is one of the most used libraries 4. Recall = t p+ f n , the percentage of true positives;
for data manipulation and analysis in Python programming tn
5. Specificity = tn+ f p , the percentage of true negatives;
language [16].
Naive Bayes methods are a set of supervised learning al- Precision∗Recall
6. F1 Score = 2 Precision+Recall , the harmonic mean of pre-
gorithms based on applying Bayes’ theorem with the “naive”
cision and recall;
assumption of conditional independence between every pair
of features given the value of the class variable. The different 7. Area Under the RReceiver Operating Characteristic
naive Bayes classifiers differ mainly by the assumptions they Curve (AUC) = ∞−∞ Recall(T)Specificity’(T) dT,
make regarding the distribution of the attributes. In the Gaus- the probability that the classifier will rank a randomly
sian Naive Bayes algorithm the likelihood of the features is chosen positive instance higher than a randomly chosen
assumed to be Gaussian [17]. negative instance when using normalized units [20];
Neighbors-based classification is a type of instance-based
learning or non-generalizing learning: it does not attempt to 8. Matthews Correlation Coefficient (MCC) =
√ t p∗tn− f p∗ f n
construct a general internal model, but simply stores instances , a measure of the qual-
(t p+ f p)∗(t p+ f n)∗(tn+ f p)∗(tn+ f n)
of the training data. Classification is computed from a simple ity of binary classifications. It takes into account true
majority vote of the nearest neighbors of each point: a query and false positives and negatives and is generally re-
point is assigned the data class which has the most represen- garded as a balanced measure which can be used even
tatives within the nearest neighbors of the point. The scikit- if the classes are of very different sizes [21].
learn’s KNeighborsClassifier implements learning based on
the K nearest neighbors of each query point, the default value Except when noted, these metrics were defined according
of K is 5 20 . to Olson and Delen [22].
20 hhttp://bit.ly/2Qh3vTYi 21 hhttp://bit.ly/2Qgtnj2i
4. Results 5. Discussion
The confusion matrices obtained in the experiment where the In this research, we developed a machine learning-based
predictions were made on a per-day basis are shown in Tables model to predict whether or not an unreleased song will be-
1, 2, 3 and 4. A more graphical version of these matrices are come popular. Specifically, we predict whether or not a song
also available in the Appendix Section. Table 5 shows the will appear in Spotify’s Top 50 ranking. However, we remark
values achieved in the evaluation metrics in this experiment. that our methodology could be reproduced on any streaming
The best results obtained in each of the metrics are shown in platform, provided the audio features and the binary accoustic
red. features are available. We performed two experiments. In
Regarding PM, in this experiment, the best result was ob- the first, each instance represented one song on a specific
tained, in terms of accuracy, using the SVM classifier. This day of the rankings collected (Spotify’s Top 50 and Viral 50),
case has the smallest amount of false positives on the exper- so there were several identical instances that represented the
iment with a value 122% lower than the case using KNN, same songs. In the second, the instances that represented
which has the second lowest amount of these incorrectly pre- the same song were merged into one entry. Thus, in the first
dicted instances. However, when using SVM the highest experiment the algorithms were trained with 5389 instances
amount of false negatives were also obtained, with a value and in the second with 405.
34.5% higher than the case using GNB, which presented the In the first experiment, the predictions were made for
lowest amount of these instances. individual ranking editions. That is, a song was considered
popular on a specific day if it appeared in the Top 50 of that
Due to these factors, the SVM classifier has not obtained
day. In contrast, in the second experiment the predictions
the best results in all the metrics used to evaluate the models.
were made for a set of rankings. In this case, for a song to
However, it has the highest value in MCC, which evaluates
be considered popular, it should appear a certain number of
the quality of a binary classification. This indicates that the
times in the Top 50. We decided to set this value on four
result obtained by this classifier was the best overall.
appearances, as this value prevents songs that stood out only
The confusion matrices obtained in the experiment where from December 23 to 26 to be considered as popular.
the predictions were made on a per-song basis are in the Tables Despite the discrepancy in the number of training in-
6, 7, 8 and 9. As before, more graphical versions of these stances in the experiments, MP obtained similar results in
matrices are also available in the Appendix Section. Table 10 both of them, showing that it can achieve a good learning even
presents the values achieved in the evaluation metrics in this with a small amount of data. The SVM classifier with RBF
experiment. The best results obtained in each of the metrics kernel obtained the highest values in MCC, AUC and accuracy
are shown in red. in our experiments. Comparing the results in the two cases,
Regarding PM, in this experiment, SVM obtained again the difference in accuracy was 5.7 percentage points, while it
the highest value in MCC and accuracy, which indicates that was 0.23 in AUC and 5.17 in MCC. The second experiment
it was the one that obtained the best results in general. In obtained the highest values in these metrics.
this case, the number of false positives and false negatives The results obtained by MP differ from those obtained by
were the second lowest in comparison to the other models. ROM. MP presented, in the best case of both models, 56.65%
Thus, unlike the first experiment, the highest F1 Score was higher accuracy in the first experiment and MCC 921.02%
also obtained by SVM. higher in the second. In Table 11 we show the percentage of
One concern was that the results obtained by ROM would superior performance of MP compared to ROM in the two
not necessarily represent the results that the original model experiments we performed. For this calculation we used the
might obtain. However, the maximum difference obtained in values of the best models for each test – SVM for MP and
percentage points between the results obtained by ROM in KNN for ROM.
our first experiment and the results presented in the base text One possible explanation for the poor result obtained by
was only 6.64 in accuracy when using the GNB classifier. ROM is that there is no data preparation in this methodology.
In that work, the authors stated that it is not possible to The authors did not normalize the attributes used in their
make predictions in the music market using audio features. research, thus hindering the learning of their models, because
By creating a model based on the methodology proposed in they are exposed to atypical values and with great variation.
that paper, we could not make good predictions neither. The On the other hand, in our model, in addition to not utilizing
MCC in the experiments did not exceed 0.14 in neither case, the full set of information available through the Spotify API,
indicating an unsatisfactory binary classification. However, we transform the attributes into binary fields. This process
our proposed methodology allowed predictions with MCC removes the need for normalization, facilitates learning, and
greater than 0.7. This result indicates that is possible to predict even allows us to make predictions for unreleased songs.
if songs will be popular, even before their releases, using audio One study [23] has already shown that popular songs tend
features. to sound similar. This study analysed 500,000 albums from
15 different genres. The authors evaluated the complexity
of each song, calculated from the tracks acoustic features,
Table 1. Confusion matrices of the experiment where the predictions were made on a per-day basis using SVM classifier.
Predicted Label
PM ROM
0 1 0 1
True 0 2129 85 1280 934
Label 1 697 2342 1519 1520
Table 2. Confusion matrices of the experiment where the predictions were made on a per-day basis using Gaussian Naive
Bayes classifier.
Predicted Label
PM ROM
0 1 0 1
True 0 1828 386 744 1470
Label 1 412 2627 971 2068
Table 3. Confusion matrices of the experiment where the predictions were on a per-day basis using Logistic Regression.
Predicted Label
PM ROM
0 1 0 1
True 0 1841 373 761 1453
Label 1 466 2573 997 2042
Table 4. Confusion matrices of the experiment where the predictions were made on a per-day basis using KNN.
Predicted Label
PM ROM
0 1 0 1
True 0 1921 293 1297 917
Label 1 550 2489 1482 1557
Table 5. Performance of the models for the experiment where the predictions were made by day.
SVM GNB LR KNN
PM ROM PM ROM PM ROM PM ROM
Accuracy 0.8511 0.5330 0.8481 0.5353 0.8403 0.5336 0.8395 0.5433
Precision 0.9650 0.6194 0.8719 0.5845 0.8734 0.5843 0.8947 0.6293
NPV 0.7534 0.4573 0.8161 0.4338 0.7980 0.4329 0.7774 0.4667
Recall 0.7706 0.5002 0.8644 0.6805 0.8467 0.6719 0.8190 0.5123
Specificity 0.9616 0.5781 0.8257 0.3360 0.8315 0.3437 0.8677 0.5858
F1 Score 0.8569 0.5534 0.8681 0.6289 0.8598 0.6250 0.8552 0.5648
AUC 0.8661 0.5391 0.8450 0.5083 0.8391 0.5078 0.8433 0.5491
MCC 0.7253 0.0775 0.6890 0.0174 0.6748 0.0164 0.6793 0.0971
Table 6. Confusion matrices of the experiment where the predictions were made on a per-song basis using SVM classifier.
Predicted Label
PM ROM
0 1 0 1
True 0 184 6 117 73
Label 1 19 63 39 43
and compared these values to the amount of sales of these the more complex a song is, the less it tends to get higher
albums. By applying Pearson’s correlation coefficient to the sales. This situation may explain how our proposed model has
data, the authors obtained a value of -0.69 with p-value equal achieved such good results, as it learns some characteristics
to 0.001, which demonstrates the statistical significance of associated with popular musics.
this result. Thus, they demonstrated that there is a negative
linear correlation between the data. This result indicates that
Table 7. Confusion matrices of the experiment where the predictions were made on a per-song basis using Gaussian Naive
Bayes classifier.
Predicted Label
PM ROM
0 1 0 1
True 0 168 22 156 34
Label 1 20 62 66 16
Table 8. Confusion matrices of the experiment where the predictions were made on a per-song basis using Logistic Regression.
Predicted Label
PM ROM
0 1 0 1
True 0 156 34 182 8
Label 1 14 68 78 4
Table 9. Confusion matrices of the experiment where the predictions were made on a per-song basis using KNN.
Predicted Label
PM ROM
0 1 0 1
True 0 186 4 150 40
Label 1 31 51 59 23
Table 10. Performance of the models for the experiment where the predictions were made per song.
SVM GNB LR KNN
PM ROM PM ROM PM ROM PM ROM
Accuracy 0.9081 0.5882 0.8456 0.6324 0.8235 0.6838 0.8713 0.6360
Precision 0.9130 0.3707 0.7381 0.3200 0.6667 0.3333 0.9273 0.3651
NPV 0.9064 0.7500 0.8936 0.7027 0.9176 0.7000 0.8571 0.7177
Recall 0.7683 0.5244 0.7561 0.1951 0.8293 0.0488 0.6220 0.2805
Specificity 0.9684 0.6158 0.8842 0.8211 0.8211 0.9579 0.9789 0.7895
F1 Score 0.8344 0.4343 0.7470 0.2424 0.7391 0.0851 0.7445 0.3172
AUC 0.8684 0.5701 0.8603 0.5081 0.8560 0.5033 0.8004 0.5350
MCC 0.7770 0.1301 0.6360 0.0192 0.6164 0.0149 0.6866 0.0761
Table 11. Higher performance percentages achieved by PM that had already appeared in the Top 50 and others that were
over ROM. never there. Using the platform’s own API we extract infor-
Experiment 1 Experiment 2 mation about the database songs. The information collected
Accuracy 56.65% 42.78% indicates if the songs are dancing, acoustic, instrumental, etc.
Precision 53.34% 150.07% This data is collected in float numbers. To allow the inclu-
NPV 61.43% 26.29% sion of songs not yet released, we decided to binarize these
Recall 50.42% 173.90% attributes. This way, the artist or label can determine whether
Specificity 64.15% 22.66% or not their music has these characteristics without having to
F1 Score 51.72% 163.05% make use of the API.
AUC 57.73% 63.32% Alongside our proposed model, we also developed an-
MCC 646.96% 921.02% other one based on the methodology used by Reiman and
Örnell [13] in order to compare the results obtained.
We performed two experiments. In the first one, predic-
6. Conclusion and Future Work tions were made on a per-day basis, that is, we sought to
In this paper, we presented a model for predicting whether predict which songs would be popular on a specific day. So, a
a particular song will be popular on Spotify, one of today’s song that only appeared in the Top 50 once was considered
largest music streaming platforms. For a song to be considered popular on that particular day. Therefore, this experiment
popular in this research it must appear in the Top 50 Global relied on repeated instances with possible distinct classes. On
ranking, which features Spotify’s 50 most popular songs. the other hand, in the second one, the predictions were made
To create our model, we set up a database containing songs on a per-song basis, so each instance represented a distinct
song and it was only considered popular if it had appeared [3] ARAUJO, C.; CRISTO, M.; GIUSTI, R. Predicting mu-
at least four times in the Top 50. Despite this, the model sic popularity on streaming platforms. In: Anais do XVII
obtained similar results in both experiments with a maximum Simpósio Brasileiro de Computação Musical. Porto Ale-
difference of 5.7 percentage points in accuracy. gre, RS, Brasil: SBC, 2019. p. 141–148. Disponı́vel em:
The proposed model obtained accuracy, precision and hhttps://sol.sbc.org.br/index.php/sbcm/article/view/10436i.
AUC above 80% in all cases. In the best case, using the SVM [4] ARAUJO, C.; CRISTO, M.; GIUSTI, R. Will I Remain
classifier with RBF kernel, the result was more than 920% Popular? A Study Case on Spotify. In: Anais do XVI Encontro
higher, according to the Matthews Correlation Coefficient, Nacional de Inteligência Artificial e Computacional. Porto
than the Reiman and Örnell [13] based model. Alegre, RS, Brasil: SBC, 2019. p. 599–610. Disponı́vel em:
However, improvements can still be done, since the num- hhttps://sol.sbc.org.br/index.php/eniac/article/view/9318i.
ber of false negatives obtained by the proposed model is still
high, after all about 23% of positive instances were predicted [5] ARAUJO, C. V. S.; CRISTO, M. A. P. de; GIUSTI, R.
erroneously. We believe that a possible way to reverse this Predicting music popularity using music charts. In: 2019 18th
situation is to add information from social networks to our IEEE International Conference On Machine Learning And
model. This belief is given because a research [24] has shown Applications (ICMLA). [S.l.: s.n.], 2019. p. 859–864.
that there is a linear correlation between the popularity of [6] ARAKELYAN, S. et al. Mining and forecasting career
an album on Spotify and the amount of positively polarized trajectories of music artists. CoRR, abs/1805.03324, 2018.
messages about the artist of this album on Twitter. Disponı́vel em: hhttp://arxiv.org/abs/1805.03324i.
Plus, we didn’t consider in this work the impact of the
[7] STEININGER, D. M.; GATZEMEIER, S. Using the
artist previous popularity and marketing investing by them or
wisdom of the crowd to predict popular music chart success.
their record labels to boost their songs popularity. Our idea is
In: ECIS. [S.l.: s.n.], 2013. p. 215.
to also use these information on a future model to reach out
better results. [8] KIM, Y.; SUH, B.; LEE, K. #Nowplaying the Future Bill-
In addition, we also intend to partner with record labels board: Mining Music Listening Behaviors of Twitter Users for
and artists to apply the proposed model to songs before they Hit Song Prediction. In: Proceedings of the First International
are released. In our experiment, because we do not have Workshop on Social Media Retrieval and Analysis. New York,
access to these songs, the tests were made considering songs NY, USA: ACM, 2014. (SoMeRA ’14), p. 51–56. Disponı́vel
already released as if they were unreleased songs. em: hhttp://doi.acm.org/10.1145/2632188.2632206i.
Finally, we highlight that although this work was devel- [9] HERREMANS, D.; MARTENS, D.; SÖRENSEN, K.
oped focusing on Spotify, its methodology can be easily repli- Dance hit song prediction. Journal of New Music Research,
cated to other platforms that contain music rankings. More- Routledge, v. 43, n. 3, p. 291–302, 2014. Disponı́vel em:
over, it would easily be possible to experiment with other hhttps://doi.org/10.1080/09298215.2014.881888i.
success parameters, which may be necessary, because artists
[10] KARYDIS, I. et al. Musical track popularity mining
at different levels of fame may have different parameters.
dataset: Extension & experimentation. Neurocomputing,
v. 280, p. 76 – 85, 2018. Applications of Neural Modeling
Acknowledgements in the new era for data and IT. Disponı́vel em: hhttp://www.
sciencedirect.com/science/article/pii/S0925231217317666i.
The results here presented were reached during the master’s
of Carlos V. S. Araujo which was fund by Coordenação de [11] PONS, J.; SERRA, X. Randomly weighted cnns for (mu-
Aperfeiçoamento de Pessoal de Nı́vel Superior (CAPES). sic) audio classification. In: ICASSP 2019 - 2019 IEEE In-
ternational Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP). [S.l.: s.n.], 2019. p. 336–340.
Author contributions
[12] MARTÍN-GUTIÉRREZ, D. et al. A multimodal end-to-
The work was developed by Msc. Carlos V. S. Araujo with end deep learning architecture for music popularity prediction.
orientation of Prof. Rafael Giusti and co-orientation of Prof. IEEE Access, v. 8, p. 39361–39374, 2020.
Marco A. P. Cristo.
[13] REIMAN, M.; ÖRNELL, P. Predicting hit songs with
machine learning. In: . [S.l.: s.n.], 2018. (TRITA-EECS-EX,
2018:202).
References [14] PEDREGOSA, F. et al. Scikit-learn: Machine learning
[1] PACHET, F.; SONY, C. Hit song science. Music data in Python. Journal of Machine Learning Research, v. 12, p.
mining, Chapman & Hall/CRC Press Boca Raton, FL, p. 305– 2825–2830, 2011.
326, 2011.
[15] REBACK, J. et al. pandas-dev/pandas: Pandas 1.1.3.
[2] LI, T.; OGIHARA, M.; TZANETAKIS, G. Music data Zenodo, 2020. Disponı́vel em: hhttps://doi.org/10.5281/
mining. [S.l.]: CRC Press, 2011. zenodo.4067057i.
[16] ROSSUM, G. V.; JR, F. L. D. Python tutorial. [S.l.]: relation coefficient metric. PLOS ONE, Public Library of
Centrum voor Wiskunde en Informatica Amsterdam, 1995. Science, v. 12, n. 6, p. 1–17, 06 2017. Disponı́vel em:
v. 620. hhttps://doi.org/10.1371/journal.pone.0177678i.
[17] ZHANG, H. The optimality of naive bayes. AA, v. 1, n. 2, [22] OLSON, D. L.; DELEN, D. Advanced data mining tech-
p. 3, 2004. niques. [S.l.]: Springer Science & Business Media, 2008.
[18] SCHÖLKOPF, B. et al. Estimating the support of a high- [23] PERCINO, G.; KLIMEK, P.; THURNER, S. Instrumen-
dimensional distribution. Neural Computation, v. 13, n. 7, p. tational complexity of music genres and why simplicity sells.
1443–1471, 2001. Disponı́vel em: hhttps://doi.org/10.1162/ PLOS ONE, Public Library of Science, v. 9, n. 12, p. 1–16, 12
089976601750264965i. 2015. Disponı́vel em: hhttps://doi.org/10.1371/journal.pone.
[19] HERBRICH, R. Learning kernel classifiers: theory and 0115255i.
algorithms. Massachusetts: MIT press, 2001. [24] ARAUJO, C. V. et al. Predicting music success based on
[20] FAWCETT, T. An introduction to roc analysis. Pattern users’ comments on online social networks. In: Proceedings
Recognition Letters, v. 27, n. 8, p. 861 – 874, 2006. ROC of the 23rd Brazillian Symposium on Multimedia and the Web.
Analysis in Pattern Recognition. Disponı́vel em: hhttp://www. New York, NY, USA: ACM, 2017. (WebMedia ’17), p. 149–
sciencedirect.com/science/article/pii/S016786550500303Xi. 156. Disponı́vel em: hhttp://doi.acm.org/10.1145/3126858.
[21] BOUGHORBEL, S.; JARRAY, F.; EL-ANBARI, M. 3126885i.
Optimal classifier for imbalanced data using matthews cor-