Music Data Analysis in R
Music Data Analysis in R
Music Data Analysis in R
net/publication/333917670
CITATIONS READS
0 4,600
2 authors, including:
Bruna Wundervald
National University of Ireland, Maynooth
27 PUBLICATIONS 34 CITATIONS
SEE PROFILE
All content following this page was uploaded by Bruna Wundervald on 21 June 2019.
GitHub: https://github.com/brunaw/SER2019
2 / 75
Who are we
Bruna Wundervald
3 / 75
Who are we
Julio Trecenti
4 / 75
Goals
Learn how to use the packages:
5 / 75
Requisites & resources
R beginner/intermediate
tidyverse
%>% (pipe) is essential!
R-Music Blog
6 / 75
Don't get lost!
If you are ever stuck in any part of this course, don't hesitate in asking us
https://www.rstudio.com/resources/cheatsheets/
https://curso-r.com/material/
7 / 75
Loading packages
Main packages:
library(vagalumeR)
library(Rspotify)
library(chorrrds)
library(tidyverse)
8 / 75
Data extraction
vagalumeR:: music lyrics
vagalumeR
RSpotify:: Spotify variables
RSpotify
chorrrds:: music chords
chorrrds
9 / 75
Data extraction
For each package, there are a few steps to be followed.
1. obtain the IDs of the objects for which we want to extract information
(like artists, albums, songs), and
2. use those IDs inside speci c functions;
10 / 75
Connecting to the APIs
vagalumeR
Steps:
11 / 75
Connecting to the APIs
Rspotify
Steps:
library(Rspotify)
key_spotify <- spotifyOAuth("app_id","client_id","client_secret")
The keys will be used later to create the connection between R and the
data extraction functions.
12 / 75
vagalumeR
# 1. Defining the artists
artist <- "chico-buarque"
13 / 75
RSpotify - variables
“danceability” = describes how suitable a track is for dancing based on a
combination of musical elements including tempo, rhythm stability, beat
strength, and overall regularity.
“energy” = a measure from 0.0 to 1.0 and represents a perceptual measure of
intensity and activity.
“key_spotify” = estimated overall key of the track. Integers map to pitches
using standard Pitch Class notation. E.g. 0 = C, 1 = C#/Db 2 = D, and so on.
“loudness” = overall loudness of a track in decibels (dB).
“mode” = indicates the modality (major or minor) of a track, the type of scale
from which its melodic content is derived.
“speechiness” = detects the presence of spoken words in a track.
“acousticness” = a measure from 0.0 to 1.0 of whether the track is acoustic.
“instrumentalness” = whether a track contains no vocals.
“liveness” = detects the presence of an audience in the recording.
“valence” = measure from 0.0 to 1.0 describing the musical positiveness
conveyed by a track.
“tempo” = overall estimated tempo of a track in beats per minute (BPM).
“duration_ms” = duration of the track in milliseconds.
“time_signature” = the duration of the track in milliseconds.
"popularity" = the popularity of the song
14 / 75
RSpotify
# 1. Search the artist using the API
find_artist <- searchArtist("chico buarque", token = key_spotify)
15 / 75
Until this moment, the package does not have a simple solution to extract the
popularity of the songs. How do we solve this issue?
16 / 75
Details about the APIs
APIs can be very unstable. This means that, sometimes, even without
reaching the access limit, they will fail.
17 / 75
chorrrds
# 1. Searching the songs
songs <- "chico-buarque" %>%
chorrrds::get_songs()
18 / 75
Combining different datasets
# Standardise the name of the key column and use it inside the joins
chords <- chords %>%
dplyr::mutate(song = stringr::str_remove(music, "chico buarque ")) %>%
dplyr::select(-music)
19 / 75
What if there are a lot of unmatches?
nrow(chords) - nrow(all_data)
#> 8973
## [1] 1
## [1] 0.9375
When the titles are different, we should take the most similar ones and merge.
Usually, there's no clear cut point for the similarity, so we de ne it arbitrarily.
20 / 75
Fixing the titles
# Let's find the string distances between the titles and use this
# information to fix them in the dataset
# 1. Which ones are in the chords data but not in the lyrics one?
anti_chords_lyrics <- chords %>%
dplyr::anti_join(lyrics, by = "song")
21 / 75
# 4. Retrieving the most similar titles in the two datasets
ordered_dists <- dists %>% purrr::map_dbl(max)
max_dists <- dists %>% purrr::map_dbl(which.max)
# 5. Filtering the one that have similarity > 0.70
indexes_min_dist <- which(ordered_dists > 0.70)
songs_min_dist <- lyrics$song[indexes_min_dist]
index_lyrics <- max_dists[which(ordered_dists > 0.70)]
...
22 / 75
Fixing manually
chords <- chords %>%
dplyr::mutate(
song =
dplyr::case_when(
song == 'a bela a fera' ~ 'a bela e a fera',
song == 'a historia de lily braun' ~ 'a história de lily braun',
song == 'a moca do sonho' ~ 'a moça do sonho',
song == 'a ostra o vento' ~ 'a ostra e o vento',
song == 'a televisao' ~ 'a televisão',
song == 'a valsa dos clows' ~ 'a valsa dos clowns',
song == 'a voz do dono o dono da voz' ~ 'a voz do dono e o dono da voz',
song == 'agora falando serio' ~ 'agora falando sério',
TRUE ~ song))
cat(
paste0("song == ", "'", results_dist_lyrics$from_chords, "' ~ '",
results_dist_lyrics$from_lyrics, "', "), collapse = "")
23 / 75
Redoing the joins
all_data <- chords %>%
dplyr::inner_join(lyrics, by = "song") %>%
dplyr::inner_join(features, by = "song")
24 / 75
Exploratory Analysis
25 / 75
Part 1: lyrics
Extra packages:
26 / 75
n-grams
n-grams: the words and its "past"
## [[1]]
## [1] "geni" "e" "o" "zepelim"
tokenizers::tokenize_ngrams(nome1, n = 2)
## [[1]]
## [1] "geni e" "e o" "o zepelim"
tokenizers::tokenize_ngrams(nome1, n = 3)
## [[1]]
## [1] "geni e o" "e o zepelim"
27 / 75
n-grams
the unnest_tokens() separates the n-grams of each lyric.
library(tidytext)
library(wordcloud)
# List of portuguese stopwords:
stopwords_pt <- data.frame(word = tm::stopwords("portuguese"))
stopwords: very frequent words of a language, which might not be essential for
the overall meaning of a sentence
28 / 75
Part 1: lyrics
Counting each word that appeared in the songs:
unnested %>%
dplyr::count(word) %>%
arrange(desc(n)) %>%
slice(1:10)
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 é 40064
## 2 pra 24276
## 3 iá 23832
## 4 amor 18460
## 5 diz 16505
## 6 chocalho 15888
## 7 vai 14603
## 8 morena 11148
## 9 esperando 10987
## 10 dia 10642
29 / 75
1-grams
unnested %>%
dplyr::count(word) %>%
# removing words that barely appeared
dplyr::filter(n < quantile(n, 0.999)) %>%
dplyr::top_n(n = 30) %>%
ggplot(aes(reorder(word, n), n)) +
geom_linerange(aes(ymin = min(n), ymax = n, x = reorder(word, n)),
position = position_dodge(width = 0.2), size = 1,
colour = 'darksalmon') +
geom_point(colour = 'dodgerblue4', size = 3, alpha = 0.9) +
coord_flip() +
labs(x = 'Top 30 most common words', y = 'Count') +
theme_bw(14)
30 / 75
31 / 75
In a wordcloud format
unnested %>%
count(word) %>%
with(wordcloud(word, n, family = "serif",
random.order = FALSE, max.words = 30,
colors = c("darksalmon", "dodgerblue4")))
32 / 75
2-grams
all_data %>%
select(text) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stopwords_pt$word,
!is.na(word1), !is.na(word2),
!word2 %in% stopwords_pt$word) %>%
count(word1, word2, sort = TRUE) %>%
mutate(word = paste(word1, word2)) %>%
filter(n < quantile(n, 0.999)) %>%
arrange(desc(n)) %>%
slice(1:30) %>%
ggplot(aes(reorder(word, n), n)) +
geom_linerange(aes(ymin = min(n), ymax = n, x = reorder(word, n)),
position = position_dodge(width = 0.2), size = 1,
colour = 'darksalmon') +
geom_point(colour = 'dodgerblue4', size = 3, alpha = 0.9) +
coord_flip() +
labs(x = 'Top 30 most common 2-grams', y = 'Count') +
theme_bw(18)
33 / 75
34 / 75
Sentiment analysis
# devtools::install_github("sillasgonzaga/lexiconPT")
# Retrieving the sentiments of portuguese words from the lexiconPT package
sentiments_pt <- lexiconPT::oplexicon_v2.1 %>%
mutate(word = term) %>%
select(word, polarity)
35 / 75
add_sentiments %>%
group_by(polarity) %>%
count(word) %>%
filter(n < quantile(n, 0.999)) %>%
top_n(n = 15) %>%
ggplot(aes(reorder(word, n), n)) +
geom_linerange(aes(ymin = min(n), ymax = n, x = reorder(word, n)),
position = position_dodge(width = 0.2), size = 1,
colour = 'darksalmon') +
geom_point(colour = 'dodgerblue4', size = 3, alpha = 0.9) +
facet_wrap(~polarity, scales = "free") +
coord_flip() +
labs(x = 'Top 15 most common words', y = 'Counts', title = 'Sentiments') +
theme_bw(14)
36 / 75
37 / 75
Which are the most positive
and most negative songs?
summ <- add_sentiments %>%
group_by(song) %>%
summarise(mean_pol = mean(polarity))
# 15 most positive and most negative songs
summ %>%
arrange(desc(mean_pol)) %>%
slice(c(1:15, 121:135)) %>%
mutate(situation = rep(c('+positive', '+negative'), each = 15)) %>%
ggplot(aes(reorder(song, mean_pol), mean_pol)) +
geom_linerange(aes(ymin = min(mean_pol), ymax = mean_pol,
x = reorder(song, mean_pol)),
position = position_dodge(width = 0.2), size = 1,
colour = 'darksalmon') +
geom_point(colour = 'dodgerblue4', size = 3, alpha = 0.9) +
facet_wrap(~situation, scales = "free") +
coord_flip() +
labs(x = 'Songs', y = 'Polarities') +
theme_bw(14)
38 / 75
39 / 75
What do we know so far?
The most common words and bi-grams
The are more positive than negative words in the lyrics
In which songs the most positive or negative feelings are
40 / 75
Part 2. Chords
Extra packages:
# Removing enarmonies
chords <- all_data %>%
select(chord, song) %>%
dplyr::mutate(chord = case_when(
chord == "Gb" ~ "F#",
chord == "C#" ~ "Db",
chord == "G#" ~ "Ab",
chord == "A#" ~ "Bb",
chord == "D#" ~ "Eb",
chord == "E#" ~ "F",
chord == "B#" ~ "C",
TRUE ~ chord))
41 / 75
Part 2. Chords
# Top 20 songs with more distinct chords
chords %>%
dplyr::group_by(song, chord) %>%
dplyr::summarise(distintos = n_distinct(chord)) %>%
dplyr::summarise(cont = n()) %>%
dplyr::mutate(song = fct_reorder(song, cont)) %>%
top_n(n = 20) %>%
ggplot(aes(y = cont, x = song)) +
geom_bar(colour = 'dodgerblue4', fill = 'darksalmon',
size = 0.5, alpha = 0.6, stat = "identity") +
labs(x = 'Songs', y = 'Counts') +
coord_flip() +
theme_bw(14)
42 / 75
43 / 75
Extracting variables
The chords data are, in fact, just pieces of text.
Text in a raw state is not very informative.
minor
diminished
augmented
sus
chords with the 7th
chords with the major 7th
chords with the 6th
chords with the 4th
chords with the augmented 5th
chords with the diminished 5th
chords with the 9th
chords with varying bass
44 / 75
Extracting variables
feat_chords <- all_data %>%
select(chord, song) %>%
chorrrds::feature_extraction() %>%
select(-chord) %>%
group_by(song) %>%
summarise_all(mean)
45 / 75
Extracting variables
dplyr::glimpse(feat_chords)
## Observations: 135
## Variables: 13
## $ song <chr> "a banda", "a bela e a fera", "a cidade ideal", "a gal…
## $ minor <dbl> 0.28282828, 0.43939394, 0.15294118, 0.07317073, 0.0000…
## $ dimi <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.0000…
## $ augm <dbl> 0.00000000, 0.00000000, 0.02352941, 0.00000000, 0.0000…
## $ sus <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seventh <dbl> 0.7878788, 0.9090909, 0.5294118, 0.4390244, 1.0000000,…
## $ seventh_M <dbl> 0.04040404, 0.00000000, 0.02352941, 0.00000000, 0.0000…
## $ sixth <dbl> 0.17171717, 0.12121212, 0.00000000, 0.00000000, 0.2673…
## $ fourth <dbl> 0.00000000, 0.34848485, 0.00000000, 0.00000000, 0.1386…
## $ fifth_aug <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fifth_dim <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.0000…
## $ ninth <dbl> 0.31313131, 0.50000000, 0.00000000, 0.00000000, 0.7871…
## $ bass <dbl> 0.10101010, 0.03030303, 0.07058824, 0.00000000, 0.1386…
46 / 75
Visualizing it
library(ggridges)
# Renaming current levels
dt$group <- forcats::lvls_revalue(
dt$group,
c("Augmented", "Bass", "Diminished",
"Augm. Fifth", "Dimi. Fifth",
"Fourth", "Minor", "Ninth", "Seventh",
"Major Seventh", "Sixth"))
# Plotting densities of the extracted features
dt %>%
ggplot(aes(vars, group, fill = group)) +
geom_density_ridges(alpha = 0.6) +
scale_fill_cyclical(values = c("dodgerblue4", "darksalmon")) +
guides(fill = FALSE) +
xlim(0, 1) +
labs(x = "Densities", y = "extracted features") +
theme_bw(14)
47 / 75
48 / 75
Chord diagrams using the
chords
The chords transitions are an important element of the harmonic structure of
songs. Let's check how those transitions are happening in this case.
# devtools::install_github("mattflor/chorddiag")
# Counting the transitions between the chords
comp <- chords %>%
dplyr::mutate(
# Cleaning the chords to the base form
chord_clean = stringr::str_extract(chord, pattern = "^([A-G]#?b?)"),
seq = lead(chord_clean)) %>%
dplyr::filter(chord_clean != seq) %>%
dplyr::group_by(chord_clean, seq) %>%
dplyr::summarise(n = n())
50 / 75
Chord diagram
Gb
G#
A
G
A#
Ab
F#
Bb
Eb
C
E
C#
Db
D#
51 / 75
The circle of fths
Allows us to understand the most probable harmonic elds
52 / 75
What do we know so far?
Some songs are harmonically more "complex" than others:
number of distinct chords
extracted variables
The most common or rare chords transitions
53 / 75
Part 3. Spotify Variables
Exploring the variables
spot <- all_data %>%
group_by(song) %>%
slice(1) %>%
ungroup()
# Density of the popularity of the songs
spot %>%
ggplot(aes(popul)) +
geom_density(colour = 'dodgerblue4', fill = "darksalmon",
alpha = 0.8) +
labs(y = 'Density', x = 'Popularity') +
theme_bw(14)
54 / 75
It varies a lot!
55 / 75
Most popular and least
popular songs
spot %>%
arrange(desc(popul)) %>%
slice(c(1:15, 121:135)) %>%
mutate(situation = rep(c('+popul', '-popul'), each = 15)) %>%
select(popul, situation, song) %>%
ggplot(aes(reorder(song, popul), popul, group = 1)) +
geom_bar(colour = 'dodgerblue4', fill = "darksalmon",
size = 0.3, alpha = 0.6,
stat = "identity") +
facet_wrap(~situation, scales = "free") +
coord_flip() +
labs(x = 'Songs', y = 'Popularity') +
theme_bw(14)
56 / 75
57 / 75
Danceability x variables
dt <- spot %>%
select(energy,
loudness, speechiness, liveness, duration_ms,
acousticness) %>%
tidyr::gather(group, vars)
58 / 75
59 / 75
What do we know so far?
How the popularity varies for this dataset
What are the least and most popular songs
How is the relationship between the danceability and the other variables
60 / 75
Modeling
61 / 75
Modeling
Let's consider now that we have an especial interest in the popularity of the songs.
Which variables would be most associated with higher or smaller levels of
popularity?
library(randomForest)
spot %>%
janitor::tabyl(pop_class)
## pop_class n percent
## neutral 38 0.2814815
## popular 63 0.4666667
## unpopular 34 0.2518519
62 / 75
Wrangling the data to make it ready for modeling
model_data %>%
janitor::tabyl(part)
## part n percent
## test 30 0.2222222
## train 105 0.7777778
63 / 75
Separating in train set (75%) and test set (25%):
64 / 75
m0 <- randomForest(pop_class ~ ., data = train,
ntree = 1000)
m0
##
## Call:
## randomForest(formula = pop_class ~ ., data = train, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 33.33%
## Confusion matrix:
## neutral popular unpopular class.error
## neutral 21 12 1 0.3823529
## popular 7 41 0 0.1458333
## unpopular 7 8 8 0.6521739
65 / 75
Visualizing the variable importance:
imp0 %>%
arrange(var, value) %>%
mutate(var = fct_reorder(factor(var), value, min)) %>%
ggplot(aes(var, value)) +
geom_point(size = 3.5, colour = "darksalmon") +
coord_flip() +
labs(x = "Variables", y = "Decrease in Gini criteria") +
theme_bw(14)
66 / 75
Visualizing the variable importance:
67 / 75
corrplot::corrplot(cor(train %>% select_if(is.numeric),
method = "spearman"))
68 / 75
Redoing the model with the best variables
##
## Call:
## randomForest(formula = form, data = train, ntree = 1000, mtry = 5)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 25.71%
## Confusion matrix:
## neutral popular unpopular class.error
## neutral 26 7 1 0.2352941
## popular 7 41 0 0.1458333
## unpopular 5 7 11 0.5217391
69 / 75
Measuring the accuracy in
the test set
pred <- predict(m0, test)
sum(pred == test$pop_class)/nrow(test)
## [1] 0.5333333
mean(m0$err.rate[,1])
## [1] 0.3409731
70 / 75
How could we improve this model?
More data!
Evaluate better the correlation between the variables
Remove noisy predictors
Engineer new features
71 / 75
Citation
@misc{musicdatainR,
author = {Wundervald, Bruna and Trecenti, Julio},
title = {Music Data Analysis in R},
url = {https://github.com/brunaw/SER2019},
year = {2019}
}
72 / 75
Acknowledgments
This work was supported by a Science Foundation Ireland Career Development
Award grant number: 17/CDA/4695
73 / 75
Some references
Feinerer, I, K. Hornik, and D. Meyer (2008). “Text Mining Infrastructure in R”. In:
Journal of Statistical Software 25.5, pp. 1–54. URL:
http://www.jstatsoft.org/v25/i05/.
Silge, J., D. Robinson, and J. Hester (2016). tidytext: Text mining using dplyr,
ggplot2, and other tidy tools. DOI: 10.5281/zenodo.56714. URL:
http://dx.doi.org/10.5281/zenodo.56714.
Thank you! 75 / 75