Music Data Analysis in R

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/333917670
Music Data Analysis in R
Presentation · May 2019

DOI: 10.13140/RG.2.2.16222.48966
CITATIONS READS
0 4,600
2 authors, including:
Bruna Wundervald
National University of Ireland, Maynooth
27 PUBLICATIONS 34 CITATIONS
SEE PROFILE
All content following this page was uploaded by Bruna Wundervald on 21 June 2019.
The user has requested enhancement of the downloaded file.

Music Data Analysis in R
IV International Seminar on Statistics
with R
Bruna Wundervald & Julio Trecenti
May, 2019
This presentation can be found at: http://brunaw.com/shortcourses/IXSER/en/pres-
en.html
GitHub: https://github.com/brunaw/SER2019
2 / 75
Who are we
Bruna Wundervald
Ph.D. Candidate in Statistics at

Maynooth University.
Twitter: @bwundervald
GitHub: @brunaw
3 / 75
Who are we
Julio Trecenti
Ph.D. Candidate in Statistics at

IME-USP
Partner at Curso-R
Twitter: @jtrecenti
GitHub: @jtrecenti
4 / 75
Goals
Learn how to use the packages:
vagalumeR: lyrics extraction

chorrrds: chords extraction
Rspotify: extract variables from the Spotify API
Understand how APIs work in general;
Combine data from different sources;

Understand and summarise data in various formats:
Text,
Continuous variables,
Sequences
Create a prediction model with the nal data.
Not included: audio analysis.
5 / 75
Requisites & resources
R beginner/intermediate
tidyverse
%>% (pipe) is essential!
R-Music Blog
R for music data extraction & analysis
6 / 75
Don't get lost!
If you are ever stuck in any part of this course, don't hesitate in asking us
Have the RStudio Cheatsheets at hand at all moments:
https://www.rstudio.com/resources/cheatsheets/
If you need material in Portuguese, check the Curso-R website:
https://curso-r.com/material/
7 / 75
Loading packages
Main packages:
library(vagalumeR)
library(Rspotify)
library(chorrrds)
library(tidyverse)
8 / 75
Data extraction
vagalumeR:: music lyrics
vagalumeR
RSpotify:: Spotify variables
RSpotify
chorrrds:: music chords
chorrrds
9 / 75
Data extraction
For each package, there are a few steps to be followed.
The steps involve, basically,
1. obtain the IDs of the objects for which we want to extract information
(like artists, albums, songs), and
2. use those IDs inside speci c functions;
10 / 75
Connecting to the APIs
vagalumeR
Steps:
1. Go to https://auth.vagalume.com.br/ and log in,

2. Go to https://auth.vagalume.com.br/settings/api/ and create a new app,
3. Go to https://auth.vagalume.com.br/settings/api/ again and copy the
app's credential, .
4. Save that credential in an object, like:
key_vagalume <- "my-credential"
11 / 75
Connecting to the APIs
Rspotify
Steps:
1. Go to https://developer.spotify.com/ and log in,

2. Go to https://developer.spotify.com/dashboard/ and create a new app,
3. Save the client ID and the client Secret generated,
4. De ne the redirect URL as http://localhost:1410/,
5. Use the spotifyOAuth() to authenticate:
library(Rspotify)
key_spotify <- spotifyOAuth("app_id","client_id","client_secret")
The keys will be used later to create the connection between R and the
data extraction functions.
12 / 75
vagalumeR
# 1. Defining the artists
artist <- "chico-buarque"
# 2. Look for the names and IDs of the songs

songs <- artist %>%
purrr::map_dfr(songNames)
# 3. Map the lyrics functions in the IDs found

lyrics <- songs %>%
dplyr::pull(song.id) %>%
purrr::map(lyrics,
artist = artist,
type = "id",
key = key_vagalume) %>%
purrr::map_dfr(data.frame) %>%
dplyr::select(-song) %>%
dplyr::right_join(songs %>%
dplyr::select(song, song.id), by = "song.id")
13 / 75
RSpotify - variables
“danceability” = describes how suitable a track is for dancing based on a
combination of musical elements including tempo, rhythm stability, beat
strength, and overall regularity.
“energy” = a measure from 0.0 to 1.0 and represents a perceptual measure of
intensity and activity.
“key_spotify” = estimated overall key of the track. Integers map to pitches
using standard Pitch Class notation. E.g. 0 = C, 1 = C#/Db 2 = D, and so on.
“loudness” = overall loudness of a track in decibels (dB).
“mode” = indicates the modality (major or minor) of a track, the type of scale
from which its melodic content is derived.
“speechiness” = detects the presence of spoken words in a track.
“acousticness” = a measure from 0.0 to 1.0 of whether the track is acoustic.
“instrumentalness” = whether a track contains no vocals.
“liveness” = detects the presence of an audience in the recording.
“valence” = measure from 0.0 to 1.0 describing the musical positiveness
conveyed by a track.
“tempo” = overall estimated tempo of a track in beats per minute (BPM).
“duration_ms” = duration of the track in milliseconds.
“time_signature” = the duration of the track in milliseconds.
"popularity" = the popularity of the song
14 / 75
RSpotify
# 1. Search the artist using the API
find_artist <- searchArtist("chico buarque", token = key_spotify)
# 2. Use the ID to search for album information

albums <- getAlbums(find_artist$id[1], token = key_spotify)
# 3. Obtain the songs of each album
albums_res <- albums %>%
dplyr::pull(id) %>%
purrr::map_df(
~{
getAlbum(.x, token = key_spotify) %>%
dplyr::select(id, name)
}) %>%
tidyr::unnest()
ids <- albums_res %>%

dplyr::pull(id)
# 4. Obtain the variables for each song

features <- ids %>%
purrr::map_dfr(~getFeatures(.x, token = key_spotify)) %>%
dplyr::left_join(albums_res, by = "id")
15 / 75
Until this moment, the package does not have a simple solution to extract the
popularity of the songs. How do we solve this issue?
# 5. Create a simple function to get the popularity

getPop <- function(id, token){
u <- paste0("https://api.spotify.com/v1/tracks/", id)
req <- httr::GET(u, httr::config(token = token))
json1 <- httr::content(req)
res <- data.frame(song = json1$name,
popul = json1$popularity,
id = json1$id)
return(res)
}
# 6. Map this function in the IDs found

popul <- features %>%
dplyr::pull(id) %>%
purrr::map_dfr(~getPop(.x, token = key_spotify))
# 7. Join the popularity with the other variables

features <- features %>%
dplyr::right_join(
popul %>% dplyr::select(-song),
by = c("id" = "id"))
16 / 75
Details about the APIs
APIs can be very unstable. This means that, sometimes, even without
reaching the access limit, they will fail.
How do we solve that?
Dividing the whole process into smaller batches

Using small time intervals between each access to the API, with Sys.sleep()
for example
17 / 75
chorrrds
# 1. Searching the songs
songs <- "chico-buarque" %>%
chorrrds::get_songs()
# 2. Mapping the chord extraction in the songs found

chords <- songs %>%
dplyr::pull(url) %>%
purrr::map(chorrrds::get_chords) %>%
purrr::map_dfr(dplyr::mutate_if, is.factor, as.character) %>%
chorrrds::clean(message = FALSE)
18 / 75
Combining different datasets
# Standardise the name of the key column and use it inside the joins
chords <- chords %>%
dplyr::mutate(song = stringr::str_remove(music, "chico buarque ")) %>%
dplyr::select(-music)
lyrics <- lyrics %>%

dplyr::mutate(song = stringr::str_to_lower(song))
features <- features %>%
dplyr::mutate(song = stringr::str_to_lower(name)) %>%
dplyr::select(-name)
all_data <- chords %>%

dplyr::inner_join(lyrics, by = "song") %>%
dplyr::inner_join(features, by = "song")
19 / 75
What if there are a lot of unmatches?
nrow(chords) - nrow(all_data)
#> 8973
Solving those cases manually can take a lot of time.

One simple way is to use string similarity to nd similar titles between the
songs.
We can do that by calculating the distances between the titles and verifying
how many "letters of difference" they have. For example:
nome1 <- "Geni e o Zepelim"

nome2 <- "Geni e o Zepelin"
# Finding the distance
RecordLinkage::levenshteinDist(nome1, nome2)
## [1] 1
# Finding the similarity = 1 - dist / str_length(biggest string)

RecordLinkage::levenshteinSim(nome1, nome2)
## [1] 0.9375
When the titles are different, we should take the most similar ones and merge.
Usually, there's no clear cut point for the similarity, so we de ne it arbitrarily.
20 / 75
Fixing the titles
# Let's find the string distances between the titles and use this
# information to fix them in the dataset
# 1. Which ones are in the chords data but not in the lyrics one?
anti_chords_lyrics <- chords %>%
dplyr::anti_join(lyrics, by = "song")
# 2. Saving the titles to fix

names_to_fix <- anti_chords_lyrics %>%
dplyr::distinct(song) %>%
dplyr::pull(song)
# 3. Calculating the distances between the titles of the
# lyrics dataset and the unmatched ones from the chords dataset
dists <- lyrics$song %>%
purrr::map(RecordLinkage::levenshteinSim, str1 = names_to_fix)
21 / 75
# 4. Retrieving the most similar titles in the two datasets
ordered_dists <- dists %>% purrr::map_dbl(max)
max_dists <- dists %>% purrr::map_dbl(which.max)
# 5. Filtering the one that have similarity > 0.70
indexes_min_dist <- which(ordered_dists > 0.70)
songs_min_dist <- lyrics$song[indexes_min_dist]
index_lyrics <- max_dists[which(ordered_dists > 0.70)]
# 6. Saving the similar ones in a data.frame

results_dist_lyrics <- data.frame(from_chords = names_to_fix[index_lyrics],
from_lyrics = songs_min_dist)
Examples of similar cases found:
a bela a fera and a bela e a fera,
logo eu and logo eu?,
não fala de maria and não fala de maria ,
...
Now we have fewer problems! Let's x them manually.
22 / 75
Fixing manually
chords <- chords %>%
dplyr::mutate(
song =
dplyr::case_when(
song == 'a bela a fera' ~ 'a bela e a fera',
song == 'a historia de lily braun' ~ 'a história de lily braun',
song == 'a moca do sonho' ~ 'a moça do sonho',
song == 'a ostra o vento' ~ 'a ostra e o vento',
song == 'a televisao' ~ 'a televisão',
song == 'a valsa dos clows' ~ 'a valsa dos clowns',
song == 'a voz do dono o dono da voz' ~ 'a voz do dono e o dono da voz',
song == 'agora falando serio' ~ 'agora falando sério',
TRUE ~ song))
cat(
paste0("song == ", "'", results_dist_lyrics$from_chords, "' ~ '",
results_dist_lyrics$from_lyrics, "', "), collapse = "")
Link for the data:

https://github.com/brunaw/SER2019/tree/master/shortcourse/data/all_data.txt
23 / 75
Redoing the joins
all_data <- chords %>%
dplyr::inner_join(lyrics, by = "song") %>%
dplyr::inner_join(features, by = "song")
# Finally saving the complete data!

write.table(all_data, "all_data.txt")
24 / 75
Exploratory Analysis
25 / 75
Part 1: lyrics
Extra packages:
tm: text analysis in general

tidytext: tidy text analysis
lexiconPT: sentiment dictionary for portuguese
26 / 75
n-grams
n-grams: the words and its "past"
Useful to analyze more complex expressions, keep more complex expressions

or sequences
nome1 <- "Geni e o Zepelim"

tokenizers::tokenize_ngrams(nome1, n = 1)
## [[1]]
## [1] "geni" "e" "o" "zepelim"
## [[1]]
## [1] "geni e" "e o" "o zepelim"
## [[1]]
## [1] "geni e o" "e o zepelim"
27 / 75
n-grams
the unnest_tokens() separates the n-grams of each lyric.
library(tidytext)
library(wordcloud)
# List of portuguese stopwords:
stopwords_pt <- data.frame(word = tm::stopwords("portuguese"))
# Breaking the phrases into single words with 1-gram

unnested <- all_data %>%
select(text) %>%
unnest_tokens(word, text, token = "ngrams", n = 1) %>%
# Removing stopwords
dplyr::anti_join(stopwords_pt, by = c("word" = "word"))
stopwords: very frequent words of a language, which might not be essential for
the overall meaning of a sentence
28 / 75
Part 1: lyrics
Counting each word that appeared in the songs:
unnested %>%
dplyr::count(word) %>%
arrange(desc(n)) %>%
slice(1:10)
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 é 40064
## 2 pra 24276
## 3 iá 23832
## 4 amor 18460
## 5 diz 16505
## 6 chocalho 15888
## 7 vai 14603
## 8 morena 11148
## 9 esperando 10987
## 10 dia 10642
29 / 75
1-grams
unnested %>%
dplyr::count(word) %>%
# removing words that barely appeared
dplyr::filter(n < quantile(n, 0.999)) %>%
dplyr::top_n(n = 30) %>%
ggplot(aes(reorder(word, n), n)) +
geom_linerange(aes(ymin = min(n), ymax = n, x = reorder(word, n)),
position = position_dodge(width = 0.2), size = 1,
colour = 'darksalmon') +
geom_point(colour = 'dodgerblue4', size = 3, alpha = 0.9) +
coord_flip() +
labs(x = 'Top 30 most common words', y = 'Count') +
theme_bw(14)
30 / 75
31 / 75
In a wordcloud format
unnested %>%
count(word) %>%
with(wordcloud(word, n, family = "serif",
random.order = FALSE, max.words = 30,
colors = c("darksalmon", "dodgerblue4")))
32 / 75
2-grams
all_data %>%
select(text) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stopwords_pt$word,
!is.na(word1), !is.na(word2),
!word2 %in% stopwords_pt$word) %>%
count(word1, word2, sort = TRUE) %>%
mutate(word = paste(word1, word2)) %>%
filter(n < quantile(n, 0.999)) %>%
arrange(desc(n)) %>%
slice(1:30) %>%
coord_flip() +
labs(x = 'Top 30 most common 2-grams', y = 'Count') +
theme_bw(18)
33 / 75
34 / 75
Sentiment analysis
# devtools::install_github("sillasgonzaga/lexiconPT")
# Retrieving the sentiments of portuguese words from the lexiconPT package
sentiments_pt <- lexiconPT::oplexicon_v2.1 %>%
mutate(word = term) %>%
select(word, polarity)
# Joining the sentiments with the words from the songs

add_sentiments <- all_data %>%
select(text, song) %>%
group_by_all() %>%
slice(1) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
dplyr::anti_join(stopwords_pt, by = c("word" = "word")) %>%
dplyr::inner_join(sentiments_pt, by = c("word" = "word"))
35 / 75
add_sentiments %>%
group_by(polarity) %>%
count(word) %>%
filter(n < quantile(n, 0.999)) %>%
top_n(n = 15) %>%
facet_wrap(~polarity, scales = "free") +
coord_flip() +
labs(x = 'Top 15 most common words', y = 'Counts', title = 'Sentiments') +
theme_bw(14)
36 / 75
37 / 75
Which are the most positive
and most negative songs?
summ <- add_sentiments %>%
group_by(song) %>%
summarise(mean_pol = mean(polarity))
# 15 most positive and most negative songs
summ %>%
arrange(desc(mean_pol)) %>%
slice(c(1:15, 121:135)) %>%
mutate(situation = rep(c('+positive', '+negative'), each = 15)) %>%
ggplot(aes(reorder(song, mean_pol), mean_pol)) +
geom_linerange(aes(ymin = min(mean_pol), ymax = mean_pol,
x = reorder(song, mean_pol)),
facet_wrap(~situation, scales = "free") +
coord_flip() +
labs(x = 'Songs', y = 'Polarities') +
theme_bw(14)
38 / 75
39 / 75
What do we know so far?
The most common words and bi-grams
The are more positive than negative words in the lyrics
In which songs the most positive or negative feelings are
40 / 75
Part 2. Chords
Extra packages:
ggridges: density plots

chorddiag: chords diagrams
# Removing enarmonies
chords <- all_data %>%
select(chord, song) %>%
dplyr::mutate(chord = case_when(
chord == "Gb" ~ "F#",
chord == "C#" ~ "Db",
chord == "G#" ~ "Ab",
chord == "A#" ~ "Bb",
chord == "D#" ~ "Eb",
chord == "E#" ~ "F",
chord == "B#" ~ "C",
TRUE ~ chord))
41 / 75
Part 2. Chords
# Top 20 songs with more distinct chords
chords %>%
dplyr::group_by(song, chord) %>%
dplyr::summarise(distintos = n_distinct(chord)) %>%
dplyr::summarise(cont = n()) %>%
dplyr::mutate(song = fct_reorder(song, cont)) %>%
top_n(n = 20) %>%
ggplot(aes(y = cont, x = song)) +
geom_bar(colour = 'dodgerblue4', fill = 'darksalmon',
size = 0.5, alpha = 0.6, stat = "identity") +
labs(x = 'Songs', y = 'Counts') +
coord_flip() +
theme_bw(14)
42 / 75
43 / 75
Extracting variables
The chords data are, in fact, just pieces of text.
Text in a raw state is not very informative.
Let's use the feature_extraction() function to extract covariables related to the

chords that have a clear interpretation:
minor
diminished
augmented
sus
chords with the 7th
chords with the major 7th
chords with the 6th
chords with the 4th
chords with the augmented 5th
chords with the diminished 5th
chords with the 9th
chords with varying bass
44 / 75
feat_chords <- all_data %>%
select(chord, song) %>%
chorrrds::feature_extraction() %>%
select(-chord) %>%
group_by(song) %>%
summarise_all(mean)
dt <- feat_chords %>%

tidyr::gather(group, vars, minor, seventh,
seventh_M, sixth, fifth_dim, fifth_aug,
fourth, ninth, bass, dimi, augm)
45 / 75
dplyr::glimpse(feat_chords)
## Observations: 135
## Variables: 13
## $ song <chr> "a banda", "a bela e a fera", "a cidade ideal", "a gal…
## $ minor <dbl> 0.28282828, 0.43939394, 0.15294118, 0.07317073, 0.0000…
## $ dimi <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.0000…
## $ augm <dbl> 0.00000000, 0.00000000, 0.02352941, 0.00000000, 0.0000…
## $ sus <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seventh <dbl> 0.7878788, 0.9090909, 0.5294118, 0.4390244, 1.0000000,…
## $ seventh_M <dbl> 0.04040404, 0.00000000, 0.02352941, 0.00000000, 0.0000…
## $ sixth <dbl> 0.17171717, 0.12121212, 0.00000000, 0.00000000, 0.2673…
## $ fourth <dbl> 0.00000000, 0.34848485, 0.00000000, 0.00000000, 0.1386…
## $ fifth_aug <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fifth_dim <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.0000…
## $ ninth <dbl> 0.31313131, 0.50000000, 0.00000000, 0.00000000, 0.7871…
## $ bass <dbl> 0.10101010, 0.03030303, 0.07058824, 0.00000000, 0.1386…
46 / 75
Visualizing it
library(ggridges)
# Renaming current levels
dt$group <- forcats::lvls_revalue(
dt$group,
c("Augmented", "Bass", "Diminished",
"Augm. Fifth", "Dimi. Fifth",
"Fourth", "Minor", "Ninth", "Seventh",
"Major Seventh", "Sixth"))
# Plotting densities of the extracted features
dt %>%
ggplot(aes(vars, group, fill = group)) +
geom_density_ridges(alpha = 0.6) +
scale_fill_cyclical(values = c("dodgerblue4", "darksalmon")) +
guides(fill = FALSE) +
xlim(0, 1) +
labs(x = "Densities", y = "extracted features") +
theme_bw(14)
47 / 75
48 / 75
Chord diagrams using the
chords
The chords transitions are an important element of the harmonic structure of
songs. Let's check how those transitions are happening in this case.
# devtools::install_github("mattflor/chorddiag")
# Counting the transitions between the chords
comp <- chords %>%
dplyr::mutate(
# Cleaning the chords to the base form
chord_clean = stringr::str_extract(chord, pattern = "^([A-G]#?b?)"),
seq = lead(chord_clean)) %>%
dplyr::filter(chord_clean != seq) %>%
dplyr::group_by(chord_clean, seq) %>%
dplyr::summarise(n = n())
mat <- tidyr::spread(comp, key = chord_clean, value = n, fill = 0)

mm <- as.matrix(mat[, -1])
# Building the chord diagram

chorddiag::chorddiag(mm, showTicks = FALSE,
palette = "Blues")
49 / 75
Regular expressions (regex)
Mini-language used to represent text
If you're working with text, you need to know regex
In R, regex can be used with the stringr package
To know more about regex:
Slides
Online material
Cheat Sheet
50 / 75
Chord diagram
Gb
G#
A
G
A#
Ab
F#
Bb
Eb
C
E
C#
Db
D#
51 / 75
The circle of fths
Allows us to understand the most probable harmonic elds
52 / 75
Some songs are harmonically more "complex" than others:
number of distinct chords
extracted variables
The most common or rare chords transitions
53 / 75
Part 3. Spotify Variables
Exploring the variables
spot <- all_data %>%
group_by(song) %>%
slice(1) %>%
ungroup()
# Density of the popularity of the songs
spot %>%
ggplot(aes(popul)) +
geom_density(colour = 'dodgerblue4', fill = "darksalmon",
alpha = 0.8) +
labs(y = 'Density', x = 'Popularity') +
theme_bw(14)
54 / 75
It varies a lot!
55 / 75
Most popular and least
popular songs
spot %>%
arrange(desc(popul)) %>%
slice(c(1:15, 121:135)) %>%
mutate(situation = rep(c('+popul', '-popul'), each = 15)) %>%
select(popul, situation, song) %>%
ggplot(aes(reorder(song, popul), popul, group = 1)) +
geom_bar(colour = 'dodgerblue4', fill = "darksalmon",
size = 0.3, alpha = 0.6,
stat = "identity") +
facet_wrap(~situation, scales = "free") +
coord_flip() +
labs(x = 'Songs', y = 'Popularity') +
theme_bw(14)
56 / 75
57 / 75
Danceability x variables
dt <- spot %>%
select(energy,
loudness, speechiness, liveness, duration_ms,
acousticness) %>%
tidyr::gather(group, vars)
dt$danceability <- spot$danceability

dt %>%
ggplot(aes(danceability, vars)) +
geom_point(colour = "darksalmon") +
geom_smooth(method = "lm", colour = "dodgerblue4") +
labs(x = "Danceability", y = "Variables") +
facet_wrap(~group, scales = "free") +
theme_bw(14)
58 / 75
59 / 75
How the popularity varies for this dataset
What are the least and most popular songs
How is the relationship between the danceability and the other variables
60 / 75
Modeling
61 / 75
Modeling
Let's consider now that we have an especial interest in the popularity of the songs.
Which variables would be most associated with higher or smaller levels of
popularity?
To start with, let's transform the popularity into a class variable:
library(randomForest)
spot <- spot %>%

mutate(pop_class = ifelse(
popul < quantile(popul, 0.25), "unpopular",
ifelse(popul < quantile(popul, 0.55), "neutral", "popular")))
spot %>%
janitor::tabyl(pop_class)
## pop_class n percent
## neutral 38 0.2814815
## popular 63 0.4666667
## unpopular 34 0.2518519
62 / 75
Wrangling the data to make it ready for modeling
# Combining the previous datasets and wrangling

set.seed(1)
model_data <- feat_chords %>%
right_join(spot, by = c("song" = "song")) %>%
right_join(summ, by = c("song" = "song")) %>%
select(-analysis_url, -uri, -id.x, -id.y, -song,
-name, -text, -lang, -chord, -long_str,
-key.x, -song.id, -sus,
-popul) %>%
mutate(pop_class = as.factor(pop_class)) %>%
# Separating into train and test set
mutate(part = ifelse(runif(n()) > 0.25, "train", "test"))
model_data %>%
janitor::tabyl(part)
## part n percent
## test 30 0.2222222
## train 105 0.7777778
63 / 75
Separating in train set (75%) and test set (25%):
train <- model_data %>%

filter(part == "train") %>%
select(-part)
test <- model_data %>%

filter(part == "test") %>%
select(-part)
The model will be like:
pop_class ~ minor + dimi + augm + seventh + seventh_M + sixth + fourth +

fifth_aug + fifth_dim + ninth + bass + danceability + energy + key.y +
loudness + mode + speechiness + acousticness + instrumentalness + liveness
+ valence + tempo + duration_ms + time_signature + mean_pol
64 / 75
m0 <- randomForest(pop_class ~ ., data = train,
ntree = 1000)
m0
##
## Call:
## randomForest(formula = pop_class ~ ., data = train, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 33.33%
## Confusion matrix:
## neutral popular unpopular class.error
## neutral 21 12 1 0.3823529
## popular 7 41 0 0.1458333
## unpopular 7 8 8 0.6521739
65 / 75
Visualizing the variable importance:
imp0 <- randomForest::importance(m0)

imp0 <- data.frame(var = dimnames(imp0)[[1]],
value = c(imp0))
imp0 %>%
arrange(var, value) %>%
mutate(var = fct_reorder(factor(var), value, min)) %>%
ggplot(aes(var, value)) +
geom_point(size = 3.5, colour = "darksalmon") +
coord_flip() +
labs(x = "Variables", y = "Decrease in Gini criteria") +
theme_bw(14)
66 / 75
Visualizing the variable importance:
67 / 75
corrplot::corrplot(cor(train %>% select_if(is.numeric),
method = "spearman"))
68 / 75
Redoing the model with the best variables
vars <- imp0 %>%

arrange(desc(value)) %>%
slice(1:10) %>%
pull(var)
form <- paste0("pop_class ~ ", paste0(vars, collapse = '+')) %>%
as.formula()
m1 <- randomForest(form, data = train,

ntree = 1000, mtry = 5)
m1
##
## Call:
## randomForest(formula = form, data = train, ntree = 1000, mtry = 5)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 25.71%
## Confusion matrix:
## neutral popular unpopular class.error
## neutral 26 7 1 0.2352941
## popular 7 41 0 0.1458333
## unpopular 5 7 11 0.5217391
69 / 75
Measuring the accuracy in
the test set
pred <- predict(m0, test)
sum(pred == test$pop_class)/nrow(test)
## [1] 0.5333333
mean(m0$err.rate[,1])
## [1] 0.3409731
70 / 75
How could we improve this model?
More data!
Evaluate better the correlation between the variables
Remove noisy predictors
Engineer new features
71 / 75
Citation
@misc{musicdatainR,
author = {Wundervald, Bruna and Trecenti, Julio},
title = {Music Data Analysis in R},
url = {https://github.com/brunaw/SER2019},
year = {2019}
}
72 / 75
Acknowledgments
This work was supported by a Science Foundation Ireland Career Development
Award grant number: 17/CDA/4695
73 / 75
Some references
Feinerer, I, K. Hornik, and D. Meyer (2008). “Text Mining Infrastructure in R”. In:
Journal of Statistical Software 25.5, pp. 1–54. URL:
http://www.jstatsoft.org/v25/i05/.
Silge, J., D. Robinson, and J. Hester (2016). tidytext: Text mining using dplyr,
ggplot2, and other tidy tools. DOI: 10.5281/zenodo.56714. URL:
http://dx.doi.org/10.5281/zenodo.56714.
Wundervald, B. (2018a). R-Music: Introduction to the vagalumeR package. URL:

https://r-music.rbind.io/posts/2018-11-22-introduction-to-the-vagalumer-
package/.
— (2018b). R-Music: Introduction to the vagalumeR package. URL: https://r-

music.rbind.io/posts/2018-11-22-introduction-to-the-vagalumer-package/.
Wundervald, B. and T. M. Dantas (2018). R-Music: Rspotify. URL: https://r-

music.rbind.io/posts/2018-10-01-rspotify/.
Wundervald, B. and J. Trecenti (2018). R-Music: Introduction to the chorrrds

package. URL: https://r-music.rbind.io/posts/2018-08-19-chords-analysis-with-
the-chorrrds-package/. 74 / 75
View publication stats
Thank you! 75 / 75

Music Data Analysis in R

Uploaded by

Copyright:

Available Formats

Music Data Analysis in R

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Music Data Analysis in R

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Music Data Analysis in R

Presentation · May 2019

The user has requested enhancement of the downloaded file.

Ph.D. Candidate in Statistics at

Ph.D. Candidate in Statistics at

vagalumeR: lyrics extraction

Understand how APIs work in general;

Combine data from different sources;

Not included: audio analysis.

R for music data extraction & analysis

Have the RStudio Cheatsheets at hand at all moments:

If you need material in Portuguese, check the Curso-R website:

The steps involve, basically,

1. Go to https://auth.vagalume.com.br/ and log in,

key_vagalume <- "my-credential"

1. Go to https://developer.spotify.com/ and log in,

# 2. Look for the names and IDs of the songs

# 3. Map the lyrics functions in the IDs found

# 2. Use the ID to search for album information

ids <- albums_res %>%

# 4. Obtain the variables for each song

# 5. Create a simple function to get the popularity

# 6. Map this function in the IDs found

# 7. Join the popularity with the other variables

How do we solve that?

Dividing the whole process into smaller batches

# 2. Mapping the chord extraction in the songs found

lyrics <- lyrics %>%

all_data <- chords %>%

Solving those cases manually can take a lot of time.

nome1 <- "Geni e o Zepelim"

# Finding the similarity = 1 - dist / str_length(biggest string)

# 2. Saving the titles to fix

# 6. Saving the similar ones in a data.frame

Examples of similar cases found:

a bela a fera and a bela e a fera,

logo eu and logo eu?,

não fala de maria and não fala de maria ,

Now we have fewer problems! Let's x them manually.

Link for the data:

# Finally saving the complete data!

tm: text analysis in general

Useful to analyze more complex expressions, keep more complex expressions

nome1 <- "Geni e o Zepelim"

# Breaking the phrases into single words with 1-gram

# Joining the sentiments with the words from the songs

ggridges: density plots

Let's use the feature_extraction() function to extract covariables related to the

dt <- feat_chords %>%

mat <- tidyr::spread(comp, key = chord_clean, value = n, fill = 0)

# Building the chord diagram

dt$danceability <- spot$danceability

To start with, let's transform the popularity into a class variable:

spot <- spot %>%

# Combining the previous datasets and wrangling

train <- model_data %>%

test <- model_data %>%

The model will be like: