Music Data Analysis in R

Download as pdf or txt
Download as pdf or txt
You are on page 1of 76

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/333917670

Music Data Analysis in R

Presentation · May 2019


DOI: 10.13140/RG.2.2.16222.48966

CITATIONS READS

0 4,600

2 authors, including:

Bruna Wundervald
National University of Ireland, Maynooth
27 PUBLICATIONS 34 CITATIONS

SEE PROFILE

All content following this page was uploaded by Bruna Wundervald on 21 June 2019.

The user has requested enhancement of the downloaded file.


Music Data Analysis in R
IV International Seminar on Statistics
with R
Bruna Wundervald & Julio Trecenti
May, 2019
This presentation can be found at: http://brunaw.com/shortcourses/IXSER/en/pres-
en.html

GitHub: https://github.com/brunaw/SER2019

2 / 75
Who are we
Bruna Wundervald

Ph.D. Candidate in Statistics at


Maynooth University.
Twitter: @bwundervald
GitHub: @brunaw

3 / 75
Who are we
Julio Trecenti

Ph.D. Candidate in Statistics at


IME-USP
Partner at Curso-R
Twitter: @jtrecenti
GitHub: @jtrecenti

4 / 75
Goals
Learn how to use the packages:

vagalumeR: lyrics extraction


chorrrds: chords extraction
Rspotify: extract variables from the Spotify API

Understand how APIs work in general;

Combine data from different sources;


Understand and summarise data in various formats:
Text,
Continuous variables,
Sequences
Create a prediction model with the nal data.

Not included: audio analysis.

5 / 75
Requisites & resources
R beginner/intermediate
tidyverse
%>% (pipe) is essential!

R-Music Blog

R for music data extraction & analysis

6 / 75
Don't get lost!
If you are ever stuck in any part of this course, don't hesitate in asking us

Have the RStudio Cheatsheets at hand at all moments:

https://www.rstudio.com/resources/cheatsheets/

If you need material in Portuguese, check the Curso-R website:

https://curso-r.com/material/

7 / 75
Loading packages
Main packages:

library(vagalumeR)
library(Rspotify)
library(chorrrds)
library(tidyverse)

8 / 75
Data extraction
vagalumeR:: music lyrics
vagalumeR
RSpotify:: Spotify variables
RSpotify
chorrrds:: music chords
chorrrds
9 / 75
Data extraction
For each package, there are a few steps to be followed.

The steps involve, basically,

1. obtain the IDs of the objects for which we want to extract information
(like artists, albums, songs), and
2. use those IDs inside speci c functions;

10 / 75
Connecting to the APIs
vagalumeR
Steps:

1. Go to https://auth.vagalume.com.br/ and log in,


2. Go to https://auth.vagalume.com.br/settings/api/ and create a new app,
3. Go to https://auth.vagalume.com.br/settings/api/ again and copy the
app's credential, .
4. Save that credential in an object, like:

key_vagalume <- "my-credential"

11 / 75
Connecting to the APIs
Rspotify
Steps:

1. Go to https://developer.spotify.com/ and log in,


2. Go to https://developer.spotify.com/dashboard/ and create a new app,
3. Save the client ID and the client Secret generated,
4. De ne the redirect URL as http://localhost:1410/,
5. Use the spotifyOAuth() to authenticate:

library(Rspotify)
key_spotify <- spotifyOAuth("app_id","client_id","client_secret")

The keys will be used later to create the connection between R and the
data extraction functions.

12 / 75
vagalumeR
# 1. Defining the artists
artist <- "chico-buarque"

# 2. Look for the names and IDs of the songs


songs <- artist %>%
purrr::map_dfr(songNames)

# 3. Map the lyrics functions in the IDs found


lyrics <- songs %>%
dplyr::pull(song.id) %>%
purrr::map(lyrics,
artist = artist,
type = "id",
key = key_vagalume) %>%
purrr::map_dfr(data.frame) %>%
dplyr::select(-song) %>%
dplyr::right_join(songs %>%
dplyr::select(song, song.id), by = "song.id")

13 / 75
RSpotify - variables
“danceability” = describes how suitable a track is for dancing based on a
combination of musical elements including tempo, rhythm stability, beat
strength, and overall regularity.
“energy” = a measure from 0.0 to 1.0 and represents a perceptual measure of
intensity and activity.
“key_spotify” = estimated overall key of the track. Integers map to pitches
using standard Pitch Class notation. E.g. 0 = C, 1 = C#/Db 2 = D, and so on.
“loudness” = overall loudness of a track in decibels (dB).
“mode” = indicates the modality (major or minor) of a track, the type of scale
from which its melodic content is derived.
“speechiness” = detects the presence of spoken words in a track.
“acousticness” = a measure from 0.0 to 1.0 of whether the track is acoustic.
“instrumentalness” = whether a track contains no vocals.
“liveness” = detects the presence of an audience in the recording.
“valence” = measure from 0.0 to 1.0 describing the musical positiveness
conveyed by a track.
“tempo” = overall estimated tempo of a track in beats per minute (BPM).
“duration_ms” = duration of the track in milliseconds.
“time_signature” = the duration of the track in milliseconds.
"popularity" = the popularity of the song
14 / 75
RSpotify
# 1. Search the artist using the API
find_artist <- searchArtist("chico buarque", token = key_spotify)

# 2. Use the ID to search for album information


albums <- getAlbums(find_artist$id[1], token = key_spotify)
# 3. Obtain the songs of each album
albums_res <- albums %>%
dplyr::pull(id) %>%
purrr::map_df(
~{
getAlbum(.x, token = key_spotify) %>%
dplyr::select(id, name)
}) %>%
tidyr::unnest()

ids <- albums_res %>%


dplyr::pull(id)

# 4. Obtain the variables for each song


features <- ids %>%
purrr::map_dfr(~getFeatures(.x, token = key_spotify)) %>%
dplyr::left_join(albums_res, by = "id")

15 / 75
Until this moment, the package does not have a simple solution to extract the
popularity of the songs. How do we solve this issue?

# 5. Create a simple function to get the popularity


getPop <- function(id, token){
u <- paste0("https://api.spotify.com/v1/tracks/", id)
req <- httr::GET(u, httr::config(token = token))
json1 <- httr::content(req)
res <- data.frame(song = json1$name,
popul = json1$popularity,
id = json1$id)
return(res)
}

# 6. Map this function in the IDs found


popul <- features %>%
dplyr::pull(id) %>%
purrr::map_dfr(~getPop(.x, token = key_spotify))

# 7. Join the popularity with the other variables


features <- features %>%
dplyr::right_join(
popul %>% dplyr::select(-song),
by = c("id" = "id"))

16 / 75
Details about the APIs
APIs can be very unstable. This means that, sometimes, even without
reaching the access limit, they will fail.

How do we solve that?

Dividing the whole process into smaller batches


Using small time intervals between each access to the API, with Sys.sleep()
for example

17 / 75
chorrrds
# 1. Searching the songs
songs <- "chico-buarque" %>%
chorrrds::get_songs()

# 2. Mapping the chord extraction in the songs found


chords <- songs %>%
dplyr::pull(url) %>%
purrr::map(chorrrds::get_chords) %>%
purrr::map_dfr(dplyr::mutate_if, is.factor, as.character) %>%
chorrrds::clean(message = FALSE)

18 / 75
Combining different datasets
# Standardise the name of the key column and use it inside the joins
chords <- chords %>%
dplyr::mutate(song = stringr::str_remove(music, "chico buarque ")) %>%
dplyr::select(-music)

lyrics <- lyrics %>%


dplyr::mutate(song = stringr::str_to_lower(song))
features <- features %>%
dplyr::mutate(song = stringr::str_to_lower(name)) %>%
dplyr::select(-name)

all_data <- chords %>%


dplyr::inner_join(lyrics, by = "song") %>%
dplyr::inner_join(features, by = "song")

19 / 75
What if there are a lot of unmatches?

nrow(chords) - nrow(all_data)
#> 8973

Solving those cases manually can take a lot of time.


One simple way is to use string similarity to nd similar titles between the
songs.
We can do that by calculating the distances between the titles and verifying
how many "letters of difference" they have. For example:

nome1 <- "Geni e o Zepelim"


nome2 <- "Geni e o Zepelin"
# Finding the distance
RecordLinkage::levenshteinDist(nome1, nome2)

## [1] 1

# Finding the similarity = 1 - dist / str_length(biggest string)


RecordLinkage::levenshteinSim(nome1, nome2)

## [1] 0.9375

When the titles are different, we should take the most similar ones and merge.
Usually, there's no clear cut point for the similarity, so we de ne it arbitrarily.

20 / 75
Fixing the titles
# Let's find the string distances between the titles and use this
# information to fix them in the dataset
# 1. Which ones are in the chords data but not in the lyrics one?
anti_chords_lyrics <- chords %>%
dplyr::anti_join(lyrics, by = "song")

# 2. Saving the titles to fix


names_to_fix <- anti_chords_lyrics %>%
dplyr::distinct(song) %>%
dplyr::pull(song)
# 3. Calculating the distances between the titles of the
# lyrics dataset and the unmatched ones from the chords dataset
dists <- lyrics$song %>%
purrr::map(RecordLinkage::levenshteinSim, str1 = names_to_fix)

21 / 75
# 4. Retrieving the most similar titles in the two datasets
ordered_dists <- dists %>% purrr::map_dbl(max)
max_dists <- dists %>% purrr::map_dbl(which.max)
# 5. Filtering the one that have similarity > 0.70
indexes_min_dist <- which(ordered_dists > 0.70)
songs_min_dist <- lyrics$song[indexes_min_dist]
index_lyrics <- max_dists[which(ordered_dists > 0.70)]

# 6. Saving the similar ones in a data.frame


results_dist_lyrics <- data.frame(from_chords = names_to_fix[index_lyrics],
from_lyrics = songs_min_dist)

Examples of similar cases found:

a bela a fera and a bela e a fera,

logo eu and logo eu?,

não fala de maria and não fala de maria ,

...

Now we have fewer problems! Let's x them manually.

22 / 75
Fixing manually
chords <- chords %>%
dplyr::mutate(
song =
dplyr::case_when(
song == 'a bela a fera' ~ 'a bela e a fera',
song == 'a historia de lily braun' ~ 'a história de lily braun',
song == 'a moca do sonho' ~ 'a moça do sonho',
song == 'a ostra o vento' ~ 'a ostra e o vento',
song == 'a televisao' ~ 'a televisão',
song == 'a valsa dos clows' ~ 'a valsa dos clowns',
song == 'a voz do dono o dono da voz' ~ 'a voz do dono e o dono da voz',
song == 'agora falando serio' ~ 'agora falando sério',
TRUE ~ song))

cat(
paste0("song == ", "'", results_dist_lyrics$from_chords, "' ~ '",
results_dist_lyrics$from_lyrics, "', "), collapse = "")

Link for the data:


https://github.com/brunaw/SER2019/tree/master/shortcourse/data/all_data.txt

23 / 75
Redoing the joins
all_data <- chords %>%
dplyr::inner_join(lyrics, by = "song") %>%
dplyr::inner_join(features, by = "song")

# Finally saving the complete data!


write.table(all_data, "all_data.txt")

24 / 75
Exploratory Analysis
25 / 75
Part 1: lyrics
Extra packages:

tm: text analysis in general


tidytext: tidy text analysis
lexiconPT: sentiment dictionary for portuguese

26 / 75
n-grams
n-grams: the words and its "past"

Useful to analyze more complex expressions, keep more complex expressions


or sequences

nome1 <- "Geni e o Zepelim"


tokenizers::tokenize_ngrams(nome1, n = 1)

## [[1]]
## [1] "geni" "e" "o" "zepelim"

tokenizers::tokenize_ngrams(nome1, n = 2)

## [[1]]
## [1] "geni e" "e o" "o zepelim"

tokenizers::tokenize_ngrams(nome1, n = 3)

## [[1]]
## [1] "geni e o" "e o zepelim"

27 / 75
n-grams
the unnest_tokens() separates the n-grams of each lyric.

library(tidytext)
library(wordcloud)
# List of portuguese stopwords:
stopwords_pt <- data.frame(word = tm::stopwords("portuguese"))

# Breaking the phrases into single words with 1-gram


unnested <- all_data %>%
select(text) %>%
unnest_tokens(word, text, token = "ngrams", n = 1) %>%
# Removing stopwords
dplyr::anti_join(stopwords_pt, by = c("word" = "word"))

stopwords: very frequent words of a language, which might not be essential for
the overall meaning of a sentence

28 / 75
Part 1: lyrics
Counting each word that appeared in the songs:

unnested %>%
dplyr::count(word) %>%
arrange(desc(n)) %>%
slice(1:10)

## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 é 40064
## 2 pra 24276
## 3 iá 23832
## 4 amor 18460
## 5 diz 16505
## 6 chocalho 15888
## 7 vai 14603
## 8 morena 11148
## 9 esperando 10987
## 10 dia 10642

29 / 75
1-grams
unnested %>%
dplyr::count(word) %>%
# removing words that barely appeared
dplyr::filter(n < quantile(n, 0.999)) %>%
dplyr::top_n(n = 30) %>%
ggplot(aes(reorder(word, n), n)) +
geom_linerange(aes(ymin = min(n), ymax = n, x = reorder(word, n)),
position = position_dodge(width = 0.2), size = 1,
colour = 'darksalmon') +
geom_point(colour = 'dodgerblue4', size = 3, alpha = 0.9) +
coord_flip() +
labs(x = 'Top 30 most common words', y = 'Count') +
theme_bw(14)

30 / 75
31 / 75
In a wordcloud format
unnested %>%
count(word) %>%
with(wordcloud(word, n, family = "serif",
random.order = FALSE, max.words = 30,
colors = c("darksalmon", "dodgerblue4")))

32 / 75
2-grams
all_data %>%
select(text) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stopwords_pt$word,
!is.na(word1), !is.na(word2),
!word2 %in% stopwords_pt$word) %>%
count(word1, word2, sort = TRUE) %>%
mutate(word = paste(word1, word2)) %>%
filter(n < quantile(n, 0.999)) %>%
arrange(desc(n)) %>%
slice(1:30) %>%
ggplot(aes(reorder(word, n), n)) +
geom_linerange(aes(ymin = min(n), ymax = n, x = reorder(word, n)),
position = position_dodge(width = 0.2), size = 1,
colour = 'darksalmon') +
geom_point(colour = 'dodgerblue4', size = 3, alpha = 0.9) +
coord_flip() +
labs(x = 'Top 30 most common 2-grams', y = 'Count') +
theme_bw(18)

33 / 75
34 / 75
Sentiment analysis
# devtools::install_github("sillasgonzaga/lexiconPT")
# Retrieving the sentiments of portuguese words from the lexiconPT package
sentiments_pt <- lexiconPT::oplexicon_v2.1 %>%
mutate(word = term) %>%
select(word, polarity)

# Joining the sentiments with the words from the songs


add_sentiments <- all_data %>%
select(text, song) %>%
group_by_all() %>%
slice(1) %>%
ungroup() %>%
unnest_tokens(word, text) %>%
dplyr::anti_join(stopwords_pt, by = c("word" = "word")) %>%
dplyr::inner_join(sentiments_pt, by = c("word" = "word"))

35 / 75
add_sentiments %>%
group_by(polarity) %>%
count(word) %>%
filter(n < quantile(n, 0.999)) %>%
top_n(n = 15) %>%
ggplot(aes(reorder(word, n), n)) +
geom_linerange(aes(ymin = min(n), ymax = n, x = reorder(word, n)),
position = position_dodge(width = 0.2), size = 1,
colour = 'darksalmon') +
geom_point(colour = 'dodgerblue4', size = 3, alpha = 0.9) +
facet_wrap(~polarity, scales = "free") +
coord_flip() +
labs(x = 'Top 15 most common words', y = 'Counts', title = 'Sentiments') +
theme_bw(14)

36 / 75
37 / 75
Which are the most positive
and most negative songs?
summ <- add_sentiments %>%
group_by(song) %>%
summarise(mean_pol = mean(polarity))
# 15 most positive and most negative songs
summ %>%
arrange(desc(mean_pol)) %>%
slice(c(1:15, 121:135)) %>%
mutate(situation = rep(c('+positive', '+negative'), each = 15)) %>%
ggplot(aes(reorder(song, mean_pol), mean_pol)) +
geom_linerange(aes(ymin = min(mean_pol), ymax = mean_pol,
x = reorder(song, mean_pol)),
position = position_dodge(width = 0.2), size = 1,
colour = 'darksalmon') +
geom_point(colour = 'dodgerblue4', size = 3, alpha = 0.9) +
facet_wrap(~situation, scales = "free") +
coord_flip() +
labs(x = 'Songs', y = 'Polarities') +
theme_bw(14)

38 / 75
39 / 75
What do we know so far?
The most common words and bi-grams
The are more positive than negative words in the lyrics
In which songs the most positive or negative feelings are

40 / 75
Part 2. Chords
Extra packages:

ggridges: density plots


chorddiag: chords diagrams

# Removing enarmonies
chords <- all_data %>%
select(chord, song) %>%
dplyr::mutate(chord = case_when(
chord == "Gb" ~ "F#",
chord == "C#" ~ "Db",
chord == "G#" ~ "Ab",
chord == "A#" ~ "Bb",
chord == "D#" ~ "Eb",
chord == "E#" ~ "F",
chord == "B#" ~ "C",
TRUE ~ chord))

41 / 75
Part 2. Chords
# Top 20 songs with more distinct chords
chords %>%
dplyr::group_by(song, chord) %>%
dplyr::summarise(distintos = n_distinct(chord)) %>%
dplyr::summarise(cont = n()) %>%
dplyr::mutate(song = fct_reorder(song, cont)) %>%
top_n(n = 20) %>%
ggplot(aes(y = cont, x = song)) +
geom_bar(colour = 'dodgerblue4', fill = 'darksalmon',
size = 0.5, alpha = 0.6, stat = "identity") +
labs(x = 'Songs', y = 'Counts') +
coord_flip() +
theme_bw(14)

42 / 75
43 / 75
Extracting variables
The chords data are, in fact, just pieces of text.
Text in a raw state is not very informative.

Let's use the feature_extraction() function to extract covariables related to the


chords that have a clear interpretation:

minor
diminished
augmented
sus
chords with the 7th
chords with the major 7th
chords with the 6th
chords with the 4th
chords with the augmented 5th
chords with the diminished 5th
chords with the 9th
chords with varying bass

44 / 75
Extracting variables
feat_chords <- all_data %>%
select(chord, song) %>%
chorrrds::feature_extraction() %>%
select(-chord) %>%
group_by(song) %>%
summarise_all(mean)

dt <- feat_chords %>%


tidyr::gather(group, vars, minor, seventh,
seventh_M, sixth, fifth_dim, fifth_aug,
fourth, ninth, bass, dimi, augm)

45 / 75
Extracting variables
dplyr::glimpse(feat_chords)

## Observations: 135
## Variables: 13
## $ song <chr> "a banda", "a bela e a fera", "a cidade ideal", "a gal…
## $ minor <dbl> 0.28282828, 0.43939394, 0.15294118, 0.07317073, 0.0000…
## $ dimi <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.0000…
## $ augm <dbl> 0.00000000, 0.00000000, 0.02352941, 0.00000000, 0.0000…
## $ sus <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seventh <dbl> 0.7878788, 0.9090909, 0.5294118, 0.4390244, 1.0000000,…
## $ seventh_M <dbl> 0.04040404, 0.00000000, 0.02352941, 0.00000000, 0.0000…
## $ sixth <dbl> 0.17171717, 0.12121212, 0.00000000, 0.00000000, 0.2673…
## $ fourth <dbl> 0.00000000, 0.34848485, 0.00000000, 0.00000000, 0.1386…
## $ fifth_aug <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fifth_dim <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.0000…
## $ ninth <dbl> 0.31313131, 0.50000000, 0.00000000, 0.00000000, 0.7871…
## $ bass <dbl> 0.10101010, 0.03030303, 0.07058824, 0.00000000, 0.1386…

46 / 75
Visualizing it
library(ggridges)
# Renaming current levels
dt$group <- forcats::lvls_revalue(
dt$group,
c("Augmented", "Bass", "Diminished",
"Augm. Fifth", "Dimi. Fifth",
"Fourth", "Minor", "Ninth", "Seventh",
"Major Seventh", "Sixth"))
# Plotting densities of the extracted features
dt %>%
ggplot(aes(vars, group, fill = group)) +
geom_density_ridges(alpha = 0.6) +
scale_fill_cyclical(values = c("dodgerblue4", "darksalmon")) +
guides(fill = FALSE) +
xlim(0, 1) +
labs(x = "Densities", y = "extracted features") +
theme_bw(14)

47 / 75
48 / 75
Chord diagrams using the
chords
The chords transitions are an important element of the harmonic structure of
songs. Let's check how those transitions are happening in this case.

# devtools::install_github("mattflor/chorddiag")
# Counting the transitions between the chords
comp <- chords %>%
dplyr::mutate(
# Cleaning the chords to the base form
chord_clean = stringr::str_extract(chord, pattern = "^([A-G]#?b?)"),
seq = lead(chord_clean)) %>%
dplyr::filter(chord_clean != seq) %>%
dplyr::group_by(chord_clean, seq) %>%
dplyr::summarise(n = n())

mat <- tidyr::spread(comp, key = chord_clean, value = n, fill = 0)


mm <- as.matrix(mat[, -1])

# Building the chord diagram


chorddiag::chorddiag(mm, showTicks = FALSE,
palette = "Blues")
49 / 75
Regular expressions (regex)
Mini-language used to represent text
If you're working with text, you need to know regex
In R, regex can be used with the stringr package
To know more about regex:
Slides
Online material
Cheat Sheet

50 / 75
Chord diagram

Gb
G#

A
G
A#
Ab
F#

Bb
Eb

C
E
C#
Db
D#

51 / 75
The circle of fths
Allows us to understand the most probable harmonic elds

52 / 75
What do we know so far?
Some songs are harmonically more "complex" than others:
number of distinct chords
extracted variables
The most common or rare chords transitions

53 / 75
Part 3. Spotify Variables
Exploring the variables
spot <- all_data %>%
group_by(song) %>%
slice(1) %>%
ungroup()
# Density of the popularity of the songs
spot %>%
ggplot(aes(popul)) +
geom_density(colour = 'dodgerblue4', fill = "darksalmon",
alpha = 0.8) +
labs(y = 'Density', x = 'Popularity') +
theme_bw(14)

54 / 75
It varies a lot!

55 / 75
Most popular and least
popular songs
spot %>%
arrange(desc(popul)) %>%
slice(c(1:15, 121:135)) %>%
mutate(situation = rep(c('+popul', '-popul'), each = 15)) %>%
select(popul, situation, song) %>%
ggplot(aes(reorder(song, popul), popul, group = 1)) +
geom_bar(colour = 'dodgerblue4', fill = "darksalmon",
size = 0.3, alpha = 0.6,
stat = "identity") +
facet_wrap(~situation, scales = "free") +
coord_flip() +
labs(x = 'Songs', y = 'Popularity') +
theme_bw(14)

56 / 75
57 / 75
Danceability x variables
dt <- spot %>%
select(energy,
loudness, speechiness, liveness, duration_ms,
acousticness) %>%
tidyr::gather(group, vars)

dt$danceability <- spot$danceability


dt %>%
ggplot(aes(danceability, vars)) +
geom_point(colour = "darksalmon") +
geom_smooth(method = "lm", colour = "dodgerblue4") +
labs(x = "Danceability", y = "Variables") +
facet_wrap(~group, scales = "free") +
theme_bw(14)

58 / 75
59 / 75
What do we know so far?
How the popularity varies for this dataset
What are the least and most popular songs
How is the relationship between the danceability and the other variables

60 / 75
Modeling
61 / 75
Modeling
Let's consider now that we have an especial interest in the popularity of the songs.
Which variables would be most associated with higher or smaller levels of
popularity?

To start with, let's transform the popularity into a class variable:

library(randomForest)

spot <- spot %>%


mutate(pop_class = ifelse(
popul < quantile(popul, 0.25), "unpopular",
ifelse(popul < quantile(popul, 0.55), "neutral", "popular")))

spot %>%
janitor::tabyl(pop_class)

## pop_class n percent
## neutral 38 0.2814815
## popular 63 0.4666667
## unpopular 34 0.2518519

62 / 75
Wrangling the data to make it ready for modeling

# Combining the previous datasets and wrangling


set.seed(1)
model_data <- feat_chords %>%
right_join(spot, by = c("song" = "song")) %>%
right_join(summ, by = c("song" = "song")) %>%
select(-analysis_url, -uri, -id.x, -id.y, -song,
-name, -text, -lang, -chord, -long_str,
-key.x, -song.id, -sus,
-popul) %>%
mutate(pop_class = as.factor(pop_class)) %>%
# Separating into train and test set
mutate(part = ifelse(runif(n()) > 0.25, "train", "test"))

model_data %>%
janitor::tabyl(part)

## part n percent
## test 30 0.2222222
## train 105 0.7777778

63 / 75
Separating in train set (75%) and test set (25%):

train <- model_data %>%


filter(part == "train") %>%
select(-part)

test <- model_data %>%


filter(part == "test") %>%
select(-part)

The model will be like:

pop_class ~ minor + dimi + augm + seventh + seventh_M + sixth + fourth +


fifth_aug + fifth_dim + ninth + bass + danceability + energy + key.y +
loudness + mode + speechiness + acousticness + instrumentalness + liveness
+ valence + tempo + duration_ms + time_signature + mean_pol

64 / 75
m0 <- randomForest(pop_class ~ ., data = train,
ntree = 1000)
m0

##
## Call:
## randomForest(formula = pop_class ~ ., data = train, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 33.33%
## Confusion matrix:
## neutral popular unpopular class.error
## neutral 21 12 1 0.3823529
## popular 7 41 0 0.1458333
## unpopular 7 8 8 0.6521739

65 / 75
Visualizing the variable importance:

imp0 <- randomForest::importance(m0)


imp0 <- data.frame(var = dimnames(imp0)[[1]],
value = c(imp0))

imp0 %>%
arrange(var, value) %>%
mutate(var = fct_reorder(factor(var), value, min)) %>%
ggplot(aes(var, value)) +
geom_point(size = 3.5, colour = "darksalmon") +
coord_flip() +
labs(x = "Variables", y = "Decrease in Gini criteria") +
theme_bw(14)

66 / 75
Visualizing the variable importance:

67 / 75
corrplot::corrplot(cor(train %>% select_if(is.numeric),
method = "spearman"))

68 / 75
Redoing the model with the best variables

vars <- imp0 %>%


arrange(desc(value)) %>%
slice(1:10) %>%
pull(var)
form <- paste0("pop_class ~ ", paste0(vars, collapse = '+')) %>%
as.formula()

m1 <- randomForest(form, data = train,


ntree = 1000, mtry = 5)
m1

##
## Call:
## randomForest(formula = form, data = train, ntree = 1000, mtry = 5)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 25.71%
## Confusion matrix:
## neutral popular unpopular class.error
## neutral 26 7 1 0.2352941
## popular 7 41 0 0.1458333
## unpopular 5 7 11 0.5217391

69 / 75
Measuring the accuracy in
the test set
pred <- predict(m0, test)

sum(pred == test$pop_class)/nrow(test)

## [1] 0.5333333

mean(m0$err.rate[,1])

## [1] 0.3409731

70 / 75
How could we improve this model?

More data!
Evaluate better the correlation between the variables
Remove noisy predictors
Engineer new features

71 / 75
Citation
@misc{musicdatainR,
author = {Wundervald, Bruna and Trecenti, Julio},
title = {Music Data Analysis in R},
url = {https://github.com/brunaw/SER2019},
year = {2019}
}

72 / 75
Acknowledgments
This work was supported by a Science Foundation Ireland Career Development
Award grant number: 17/CDA/4695

73 / 75
Some references
Feinerer, I, K. Hornik, and D. Meyer (2008). “Text Mining Infrastructure in R”. In:
Journal of Statistical Software 25.5, pp. 1–54. URL:
http://www.jstatsoft.org/v25/i05/.

Silge, J., D. Robinson, and J. Hester (2016). tidytext: Text mining using dplyr,
ggplot2, and other tidy tools. DOI: 10.5281/zenodo.56714. URL:
http://dx.doi.org/10.5281/zenodo.56714.

Wundervald, B. (2018a). R-Music: Introduction to the vagalumeR package. URL:


https://r-music.rbind.io/posts/2018-11-22-introduction-to-the-vagalumer-
package/.

— (2018b). R-Music: Introduction to the vagalumeR package. URL: https://r-


music.rbind.io/posts/2018-11-22-introduction-to-the-vagalumer-package/.

Wundervald, B. and T. M. Dantas (2018). R-Music: Rspotify. URL: https://r-


music.rbind.io/posts/2018-10-01-rspotify/.

Wundervald, B. and J. Trecenti (2018). R-Music: Introduction to the chorrrds


package. URL: https://r-music.rbind.io/posts/2018-08-19-chords-analysis-with-
the-chorrrds-package/. 74 / 75
View publication stats

Thank you! 75 / 75

You might also like