One Deep Music Representation To Rule Them All? A Comparative Analysis of Different Representation Learning Strategies
One Deep Music Representation To Rule Them All? A Comparative Analysis of Different Representation Learning Strategies
https://doi.org/10.1007/s00521-019-04076-1(0123456789().,-volV)(0123456789().
,- volV)
Received: 7 December 2017 / Accepted: 12 February 2019 / Published online: 4 March 2019
The Author(s) 2019
Abstract
Inspired by the success of deploying deep learning in the fields of Computer Vision and Natural Language Processing, this
learning paradigm has also found its way into the field of Music Information Retrieval. In order to benefit from deep
learning in an effective, but also efficient manner, deep transfer learning has become a common approach. In this approach,
it is possible to reuse the output of a pre-trained neural network as the basis for a new learning task. The underlying
hypothesis is that if the initial and new learning tasks show commonalities and are applied to the same type of input data
(e.g., music audio), the generated deep representation of the data is also informative for the new task. Since, however, most
of the networks used to generate deep representations are trained using a single initial learning source, their representation
is unlikely to be informative for all possible future tasks. In this paper, we present the results of our investigation of what
are the most important factors to generate deep representations for the data and learning tasks in the music domain. We
conducted this investigation via an extensive empirical study that involves multiple learning sources, as well as multiple
deep learning architectures with varying levels of information sharing between sources, in order to learn music repre-
sentations. We then validate these representations considering multiple target datasets for evaluation. The results of our
experiments yield several insights into how to approach the design of methods for learning widely deployable deep data
representations in the music domain.
123
1068 Neural Computing and Applications (2020) 32:1067–1093
the truly ‘relevant’ information is in the music data used for also has been applied in training deep networks for repre-
the tasks, and how this properly can be translated into sentation learning, both in the music domain [11, 12] and in
numeric representations that should be used for prediction. general [3, p. 2]. As the model learns several tasks and
While research into such proper translations can be con- datasets in parallel, it may pick up commonalities among
ducted per individual task, it is likely that informative them. As a consequence, the expectation is that a network
factors in music data will be shared across tasks. As a learned with MTL will yield robust performance across
consequence, when seeking to identify informative factors different tasks, by transferring shared knowledge [2, 3]. A
that are not explicitly restricted to a single task, multitask simple illustration of the conceptual difference between
learning (MTL) is a promising strategy. In MTL, a single traditional DTL and deep transfer learning based on MTL
learning framework hosts multiple tasks at once, allowing (further referred to as multitask based deep transfer
for models to perform better by sharing commonalities learning (MTDTL)) is shown in Fig. 1.
between involved tasks [2]. MTL has been successfully The mission of this paper is to investigate the effect of
used in a range of applied ML works [3–10], also including conditions around the setup of MTDTL, which are
the music domain [11, 12]. important to yield effective deep music representations.
Following successes in the fields of Computer Vision Here, we understand an ‘effective’ representation to be a
(CV) and Natural Language Processing (NLP), deep representation that is suitable for a wide range of new tasks
learning approaches have recently also gained increasing and datasets. Ultimately, we aim for providing a method-
interest in the MIR field, in which case deep representa- ological framework to systematically obtain and evaluate
tions of music audio data are directly learned from the data, such transferable representations. We pursue this mission
rather than being hand-crafted. Many works employing by exploring the effectiveness of MTDTL and traditional
such approaches reported considerable performance DTL, as well as concatenations of multiple deep repre-
improvements in various music analysis, indexing and sentations, obtained by networks that were independently
classification tasks [13–20]. trained on separate single learning tasks. We consider these
In many deep learning applications, rather than training representations for multiple choices of learning tasks and
a complete network from scratch, pre-trained networks are considering multiple target datasets.
commonly used to generate deep representations, which Our work will address the following research questions:
can be either directly adopted or further adapted for the
• RQ1: Given a set of learning sources that can be used
current task at hand. In CV and NLP, (parts of) certain pre-
to train a network, what is the influence of the number
trained networks [21–24] have now been adopted and
and type of the sources on the effectiveness of the
adapted in a very large number of works. These ‘standard’
learned deep representation?
deep representations have typically been obtained by
• RQ2: How do various degrees of information sharing in
training a network for a single learning task, such as visual
the deep architecture affect the effectiveness of a
object recognition, employing large amounts of training
learned deep representation?
data. The hypothesis on why these representations are
effective in a broader spectrum of tasks than they originally By answering the RQ1, we arrive at an understanding of
were trained for, is that deep transfer learning (DTL) is important factors regarding the composition of a set of
happening: information initially picked up by the network learning tasks and datasets (which in the remainder of this
is beneficial also for new learning tasks performed on the work will be denoted as learning sources) to achieve an
same type of raw input data. Clearly, the validity of this effective deep music representation, specifically on the
hypothesis is linked to the extent to which the new task can number and nature of learning sources. The answer to RQ2
rely on similar data characteristics as the task on which the provides insight into how to choose the optimal multitask
pre-trained network was originally trained. network architecture under MTDTL context. For example,
Although a number of works deployed DTL for various in MTL, multiple sources are considered under a joint
learning tasks in the music domain [25–28], to our learning scheme that partially shares inferences obtained
knowledge, however, transfer learning and the employment from different learning sources in the learning pipeline. In
of pre-trained networks are not as standard in the MIR MTL applications using deep neural networks, this means
domain as in the CV domain. Again, this may be due to the that certain layers will be shared between all sources, while
broad and partially subjective range and nature of possible at other stages, the architecture will ‘branch’ out into
music descriptions. Following the considerations above, it source-specific layers [2, 5–8, 12, 29]. However, an in-
may then be useful to combine deep transfer learning with vestigation is still needed on where in the layered archi-
multitask learning. tecture branching should ideally happen—if a branching
Indeed, in order to increase robustness to a larger scope strategy would turn out beneficial in the first place.
of new learning tasks and datasets, the concept of MTL
123
Neural Computing and Applications (2020) 32:1067–1093 1069
Task A
DTL Deep
Task 1 Network Representation
Task B
Task 1
Task A
Task 2 Deep
MTDTL Network Representation
...
Task B
Task M
Fig. 1 Simplified illustration of the conceptual difference between At the same time, this representation may not be that informative to
traditional deep transfer learning (DTL) based on a single learning another future task, leading to a low transfer learning performance.
task (above) and multitask based deep transfer learning (MTDTL) The hypothesis behind MTDTL is that relying on more learning tasks
(below). The same color used for a learning and an target task increases robustness of the learned representation and its usability for
indicates that the tasks have commonalities, which implies that the a broader set of target tasks (color figure online)
learned representation is likely to be informative for the target task.
To reach the aforementioned answers, it is necessary to Sect. 3. Our strategy to evaluate the effectiveness of dif-
conduct a systematic assessment to examine relevant fac- ferent representation network variants by employing vari-
tors. For RQ1, we investigate different numbers and ous target datasets will be the focus of Sect. 4.
combinations of learning sources. For RQ2, we study Experimental results will be discussed in Sect. 5, after
different architectural strategies. However, we wish to which general conclusions will be presented in Sect. 6.
ultimately investigate the effectiveness of the representa-
tion with respect to new, target learning tasks and datasets
(which in the remainder of this paper will be denoted by 2 Framework for deep representation
target datasets). While this may cause a combinatorial learning
explosion with respect to possible experimental configu-
rations, we will make strategic choices in the design and In this section, we formally define the deep representation
evaluation procedure of the various representation learning learning problem. As Fig. 2 illustrates, any domain-specific
strategies. MTDTL problem can be abstracted into a formal task,
The scientific contribution of this work can be summa- which is instantiated by a specific dataset with specific
rized as follows: observations and labels. Multiple tasks and datasets are
involved to emphasize different aspects of the input data,
• We provide insight into the effectiveness of various
such that the learned representation is more adaptable to
deep representation learning strategies under the mul-
different future tasks. The learning part of this scheme can
titask learning context.
be understood as the MTL phase, which is introduced in
• We offer in-depth insight into ways to evaluate desired
Sect. 2.1. Subsequently in Sect. 2.2, we discuss learning
properties of a deep representation learning procedure.
sources involved in this work, which consist of various
• We propose and release several pre-trained music
tasks and datasets to allow investigating their effects on the
representation networks, based on different learning
transfer learning. Further, we introduce the label prepro-
strategies for multiple semantic learning sources.
cessing procedure that is applied in this work in Sect. 2.3,
The rest of this work is presented as follows: a formal- ensuring that the learning sources are more regularized,
ization of this problem, as well as the global outline of how such that their comparative analysis is clearer.
learning will be performed based on different learning
tasks from different sources, will be presented in Sect. 2.
Detailed specifications of the deep architectures we con-
sidered for the learning procedure will be discussed in
123
1070 Neural Computing and Applications (2020) 32:1067–1093
... ...
(Xt , zt ) cdr_tag artist GTZAN Ext. Ballroom Last.FM
2.1 Problem definition When deep learning is employed, the model function f
denotes a learnable network. Typically, the network model
A machine learning problem, focused on solving a specific f is learned in an end-to-end fashion, from raw data at the
task t, can be formulated as a minimization problem, in input to the learning label. In the speech and music field,
which a model function ft must be learned that minimizes a however, using true end-to-end learning is still not a
ðiÞ ðiÞ common practice. Instead, raw data is typically trans-
loss function L for given dataset Dt ¼ f ðxt ; yt Þ j
i 2 f1; . . .; Ig g, comparing the model’s predictions given formed first, before serving as network input. More
by the input xt and actual task-specific learning labels yt . specifically, in the music domain, common input to func-
This can be formulated using the following expression: tion f would be X 2 Rcnb , replacing the originally hand-
crafted feature vector x 2 Rd from (1) by a time-frequency
h^ ¼ arg min EDt Lðyt ; ft ðxt ; hÞÞ ð1Þ
representation of the observed music data, usually obtained
where xt 2 Rd is, traditionally, a hand-crafted d-dimen- through the short-time Fourier transform (STFT), with
sional feature vector and h is a set of model parameters of f. potential additional filter bank applications (e.g., mel-filter
bank). The dimensions c, n, b indicate channels of the
audio signal, time steps, and frequency bins, respectively.
123
Neural Computing and Applications (2020) 32:1067–1093 1071
If such a network still is trained for a specific single globally categorized as Algorithm or Annotation. As for the
machine learning task t, we can now reformulate (1) as Algorithm category, by employing traditional feature
follows: extraction or representation transformation algorithms, we
will be able to automatically extract semantically inter-
h^ ¼ arg min EDt Lðyt ; ft ðXt ; hÞÞ: ð2Þ
esting aspects from input data. As for the Annotation cat-
In MTL, in the process of learning the network model f, egory, these include different types of label annotations of
different tasks will need to be solved in parallel. In the case the input data by humans.
of deep neural networks, this is usually realized by having The dataset used as a resource for our learning experi-
a network in which lower layers are shared for all tasks, but ments is the Million Song Dataset (MSD) [30]. In its
upper layers are task-specific. Given m different tasks t, original form, it contains metadata and precomputed fea-
each having the learning label yt , we can formulate the tures for a million songs, with several associated data
learning objective of the neural network in MTL scenario resources, e.g., considering Last.fm social tags and lis-
as follows: tening profiles from the Echo Nest. While the MSD
s does not distribute audio due to copyright reasons, through
h^ ; h^ ¼ arg min Et2T EDt Lðyt ; ft ðXt ; hs ; ht ÞÞ ð3Þ the API of the 7digital service, 30-s audio previews
Here, T ¼ ft1 ; t2 ; . . .; tm g is a given set of tasks to be can be obtained for the songs in the dataset. These 30-s
learned and h ¼ fh1 ; h2 ; . . .; hm g indicates a set of model previews will form the source for our raw audio input.
parameters ht with respect to each task. Since the deep Using the MSD data, we consider several subcategories
architecture initially shares lower layers and branches out of learning sources within the Algorithm and Annotation
to task-specific upper layers, the parameters of shared categories; below, we give an overview of these, and
layers and task-specific layers are referred to separately as specify what information we considered exactly for the
learning labels in our work.
hs and ht , respectively. Updates for all parameters can be
achieved through standard back-propagation. Further spe-
2.2.1 Algorithm
cifics on network architectures and training configurations
will be given in Sect. 3.
• Self. The music track is the learning source itself; in
Given the formalizations above, the first step in our
other words, intrinsic information in the input music
framework is to select a suitable set T of learning tasks.
track should be captured through a learning procedure,
These tasks can be seen as multiple concurrent descriptions
without employing further data. Various unsupervised
or transformations of the same input fragment of musical
or auto-regressive learning strategies can be employed
audio: each will reflect certain semantic aspects of the
under this category, with variants of autoencoders,
music. However, unlike the approach in a typical MTL
including the Stacked Autoencoder [31, 32], Restricted
scheme, solving multiple specific learning tasks is actually
Boltzmann Machines (RBM) [33], Deep Belief Net-
not our main goal; instead, we wish to learn an effective
works (DBN) [34] and Generative Adversarial Net-
representation that captures as many semantically impor-
works (GAN) [35]. As another example within this
tant factors in the low-level music representation as pos-
category, variants of the Siamese networks for simi-
sible. Thus, rather than using learning labels yt , our
larity learning can be considered [36–38].
representation learning process will employ reduced
In our case, we will employ the Siamese architecture
learning labels zt , which capture a reduced set of semantic
to learn a metric that measures whether two input music
factors from yt . We then can reformulate (3) as follows:
s
clips belong to the same track or two different tracks.
h^ ; h^ ¼ arg min Et2T EDt Lðzt ; ft ðXt ; hs ; ht ÞÞ ð4Þ This can be formulated as follows:
self s
where zt 2 Rk is a k-dimensional vector that represents h^ ; h^ ¼ arg min EX ;X D Lðyself ; fself ðXl ; Xr ; hself ; hs ÞÞ
l r self
123
1072 Neural Computing and Applications (2020) 32:1067–1093
123
Neural Computing and Applications (2020) 32:1067–1093 1073
learning sources, it is not trivial to compare the effect of this purpose, we used a fixed single value k ¼ 50 for the
individual learning sources. We, therefore, choose to work number of factors (pLSA) and the number of Gaussians
with a subset of the dataset, in which equal numbers of (GMM). In the remainder of this paper, the datasets and
samples across learning sources can be used. As a conse- tasks processed in the above manner will be denoted by
quence, we managed to collect 46,490 clips of tracks with learning sources for coherent presentation and usage of the
corresponding learning source labels. A 41,841/4,649 split terminology.
was made for training and validation for all sources from
both MSD and CDR. Since we mainly focus on transfer
learning, we used the validation set mostly for monitoring 3 Representation network architectures
the training, to keep the network from overfitting.
In this section, we present the detailed specification of the
2.3 Latent factor preprocessing deep representation neural network architecture we
exploited in this work. We will discuss the base architec-
Most learning sources are noisy. For instance, social tags ture of the network and further discuss the shared archi-
include tags for personal playlist management, long sen- tecture with respect to different fusion strategies that one
tences, or simply typos, which do not actually show rele- can take in the MTDTL context. Also, we introduce details
vant nuances in describing the music signal. The on the preprocessing related to the input data served into
algorithmically extracted BPM information also is imper- networks.
fect, and likely contains octave errors, in which BPM is
under- or overestimated by a factor of 2. To deal with this 3.1 Base architecture
noise, several previous works using the MSD [16, 26]
applied a frequency-based filtering strategy along with top-
down domain knowledge. However, this shrinks the As the deep base architecture for feature representation
available sample size. As an alternative way to handle learning, we choose a convolutional neural network (CNN)
noisiness, several other previous works [11, 17, 27, 40–42] architecture inspired by [21], as described in Fig. 4 and
apply latent factor extraction using various low-rank Table 3.
approximation models to preprocess the label information. CNN is one of the most popular architectures in many
We also choose to do this in our experiments. music-related machine learning tasks [16, 17, 20, 25,
A full overview of chosen learning sources, their cate- 44–55]. Many of these works adopt an architecture having
gory, origin dataset, dimensionality, and preprocessing cascading blocks of 2-dimensional filters and max-pooling,
strategies is shown in Table 1. In most cases, we apply derived from well-known works in image recognition
probabilistic latent semantic analysis (pLSA), which [21, 56]. Although variants of CNN using 1-dimensional
extracts latent factors as a multinomial distribution of latent filters also were suggested by [12, 57–59] to learn features
topics [43]. Table 2 illustrates several examples of strong directly from a raw audio signal in an end-to-end manner,
social tags within extracted latent topics. not many works managed to use them on music classifi-
For situations in which learning labels are a scalar, non- cation tasks successfully [60].
binary value (BPM and release year), we applied a Gaus- The main difference between the base architecture and
sian mixture model (GMM) to transform each value into a [21] is the use of global average pooling (GAP) and the
categorical distribution of Gaussian components. In the Batch Normalization (BN) layers. BN is applied to accel-
case of the Self category, as it basically is a binary mem- erate the training and stabilize the internal covariate shift
bership test, no factor extraction was needed in this case. for every convolution layer and the fc-feature layer
After preprocessing, learning source labels yt are now [61]. Also, global spatial pooling is adopted as the last
expressed in the form of probabilistic distributions zt . Then, pooling layer of the cascading convolution blocks, which is
the learning of a deep representation can take place by known to effectively summarize the spatial dimensions
minimizing the Kullback–Leibler (KL) divergence both in the image [22] and music domain [20]. We also
between model inferences ft ðXÞ and label factor distribu- applied the approach to ensure the fc-feature layer not
tions zt . to have a huge number of parameters.
Along with the noise reduction, another benefit from We applied the rectified linear unit (ReLU) [62] to all
such preprocessing is the regularization of the scale of the convolution layers and the fc-feature layer. For the
objective function between different tasks involved in the fc-output layer, softmax activation is used. For each
learning, when the resulting factors have the same size. convolution layer, we applied zero-padding such that the
This regularity between the objective functions is particu- input and the output have the same spatial shape. As for the
larly helpful for comparing different tasks and datasets. For regularization, we choose to apply dropout [63] on the fc-
123
1074 Neural Computing and Applications (2020) 32:1067–1093
FC (256)
have generally been a popular input representation choice
FCSoftmax (50)
for CNN applied in music-related tasks
GAP
[16, 17, 20, 26, 41, 64]; besides, it also was reported
Conv62 (256)
recently that their frequency-domain summarization, based
Conv61 (256)
on psycho-acoustics, is efficient and not easily learnable
MaxPool5
through data-driven approaches [65, 66]. We choose a
Conv5 (128)
1024-sample window size and 256-sample hop size,
MaxPool4 Representation
Network translating to about 46 ms and 11.6 ms, respectively, for a
Conv4 (64)
sampling rate of 22 kHz. We also applied standardization
MaxPool3
to each frequency band of the mel spectrum, making use of
Conv3 (64)
the mean and variance of all individual mel spectra in the
MaxPool2 training set.
Conv2 (32) Preprocessing
MaxPool1 3.1.2 Sampling
Conv1 (16) Sampling
123
Neural Computing and Applications (2020) 32:1067–1093 1075
123
1076 Neural Computing and Applications (2020) 32:1067–1093
FCSoftmax (50) FCSoftmax (50) FCSoftmax (50) FCSoftmax (50) FCSoftmax (50)
FC (256) FC (256) FC (256) FC (256) FC (256)
GAP GAP GAP GAP GAP
Conv62 (256) Conv62 (256) Conv62 (256) Conv62 (256) Conv62 (256)
Conv61 (256) Conv61 (256) Conv61 (256) Conv61 (256) Conv61 (256)
MaxPool5 MaxPool5 MaxPool5 MaxPool5 MaxPool5
Conv5 (128) Conv5 (128) Conv5 (128) Conv5 (128) Conv5 (128)
MaxPool4 MaxPool4 MaxPool4 MaxPool4 MaxPool4
Conv4 (64) Conv4 (64) Conv4 (64) Conv4 (64) Conv4 (64)
MaxPool3 MaxPool3 MaxPool3 MaxPool3 MaxPool3
Conv3 (64) Conv3 (64) Conv3 (64) Conv3 (64) Conv3 (64)
MaxPool2 MaxPool2 MaxPool2 MaxPool2 MaxPool2
Conv2 (32) Conv2 (32) Conv2 (32) Conv2 (32) Conv2 (32)
MaxPool1 MaxPool1 MaxPool1 MaxPool1
Conv1 (16) Conv1 (16) Conv1 (16) Conv1 (16)
(a) SS-R: Base (b) MSS-CR: Concatenation of (c) MS-CR@2: network branches
setup. multiple independent SS-R net- to source-specific layers from 2nd
works. convolution layer.
FCSoftmax (50) FCSoftmax (50) FCSoftmax (50) FCSoftmax (50) FCSoftmax (50) FCSoftmax (50)
FC (256) FC (256) FC (256) FC (256) FC (256)
GAP GAP GAP GAP GAP
Conv62 (256) Conv62 (256) Conv62 (256) Conv62 (256) Conv62 (256)
Conv61 (256) Conv61 (256) Conv61 (256) Conv61 (256) Conv61 (256)
MaxPool5 MaxPool5 MaxPool5 MaxPool5
Conv5 (128) Conv5 (128) Conv5 (128) Conv5 (128)
MaxPool4 MaxPool4 MaxPool4 MaxPool4
Conv4 (64) Conv4 (64) Conv4 (64) Conv4 (64)
MaxPool3 MaxPool3 MaxPool3
Conv3 (64) Conv3 (64) Conv3 (64)
MaxPool2 MaxPool2 MaxPool2
Conv2 (32) Conv2 (32) Conv2 (32)
MaxPool1 MaxPool1 MaxPool1
Conv1 (16) Conv1 (16) Conv1 (16)
(d) MS-CR@4: network branches (e) MS-CR@6: network branches (f) MS-SR@FC: heavily shared
to source-specific layers from 4th to source-specific layers from 6th network, source-specific branch-
convolution layer. convolution layer. ing only at final FC layer.
Fig. 5 The various model architectures considered in the current simplification, multi-source cases are illustrated here for two sources.
work. Beyond single-source architectures, multi-source architectures The fc-feature layer from which representations will be
with various degrees of shared information are studied. For extracted is the FC(256) layer in the illustrations (see Table 3)
representations learned for different learning sources in However, considering that learned representations are
neural network architectures. For example, for different usually taken from a specific fixed layer of the shared
tasks, representations can be extracted from different architecture, we focus on the strategies as we outlined
intermediate hidden layers, benefiting from the hierarchical above.
feature encoding capability of the deep network [26].
123
Neural Computing and Applications (2020) 32:1067–1093 1077
123
1078 Neural Computing and Applications (2020) 32:1067–1093
...
Shared Layers
2.5s
Transfer
n
Evaluation
ea
Sliding Window
m
j=3
Representation(3) Loss
Network ... Task Model
...
sd
...
R(d×m)
Fig. 6 Overall system framework. The first row of the figure illustrates entire evaluation scenario. The representation is first extracted from
the learning scheme, where the representation learning is happening the representation network, which is transferred from the upper row.
by minimizing the KL divergence between the network inference The sequence of representation vectors is aggregated as the concate-
ft ðXÞ and the preprocessed learning label zt . The preprocessing is nation of their means and standard deviations. The purple block
conducted by the blue blocks which transform the original noisy indicates a machine learning model employed to evaluate the
labels yt to zt , reducing noise and summarizing the high-dimensional representation’s effectiveness (color figure online)
label space into a smaller latent space. The second row describes the
due to the lack of a confirmed and exhaustive track consists of short music clips annotated with the pre-
listing of the GTZAN dataset. We choose to use a fault- dominant instruments present in the clip. Compared to
filtered data split for the training and evaluation, which the genre classification task, instrument classification is
is suggested in [73]. The split originally includes a generally considered as less subjective, requiring fea-
training, validation and evaluation split; in our case, we tures to separate timbral characteristics of the music
also included the validation split as training data. signal as opposed to high-level semantics like the genre.
Among the various packages provided by the FMA, We split the dataset to make sure that observations from
we chose the top-genre classification task of FMA- the same music track are not split into training and test
Medium [71]. This is a classification dataset with an sets.
unbalanced genre distribution. We used the data split As a performance metric for all these classification
provided by the dataset for our experiment, where the tasks, we used classification accuracy.
training is validation set are combined as the training. • Regression. As exemplars of regression tasks, we
Considering another type of genre classification, we evaluate our proposed deep representations on the
selected the Extended Ballroom dataset [74, 75]. dataset used in the MediaEval Music Emotion predic-
Because the classes in this dataset are highly separable tion task [77]. It contains frame-level and song-level
with regard to their BPM [80], we specifically included labels of a two-dimensional representation of emotion,
this ‘purposefully biased’ dataset as an example of how with valence and arousal as dimensions [81]. Valence is
a learned representation may effectively capture tem- related to the positivity or negativity of the emotion,
poral dynamics properties present in a target dataset, as and arousal is related to its intensity [77]. The song-
long as learning sources also reflected these properties. level annotation of the V-A coordinates was used as the
Since no pre-defined split is provided or suggested by learning label. In similar fashion to the approach taken
other literature, we used stratified random sampling in [26], we trained separate models for the two
based on the genre label. emotional dimensions. As for the dataset split, we used
The last dataset we considered for classification is the split provided by the dataset, which is done by the
the training set of the IRMAS dataset [76], which random split stratified by the genre distribution.
123
Neural Computing and Applications (2020) 32:1067–1093 1079
As an evaluation metric, we measured the coefficient MIR research. In this work, we extract and aggregate
of determination R2 of each model. MFCC following the strategy in [26]. In particular, we
• Recommendation. Finally, we employed the ‘Last.fm - extracted 20 coefficients and also used their first- and
1K users’ dataset [78] to evaluate our representations in second-order derivatives. After obtaining the sequence
the context of a content-aware music recommendation of MFCCs and its derivatives, we performed aggrega-
task (which will be denoted as Lastfm in the remaining tion by taking the average and standard deviation over
of the paper). This dataset contains 19 million records the time dimension, resulting in 120-dimensional vector
of listening events across 961, 416 unique tracks representation.
collected from 992 unique users. In our experiments, • Random Network Feature (Rand). We extracted the
we mimicked a cold-start recommendation problem, in representation at the fc-feature layer without any
which items not seen before should be recommended to representation network training. With random initial-
the right users. For efficiency, we filtered out users who ization, this representation, therefore, gives a random
listened to less than 5 tracks and tracks known to less baseline for a given CNN architecture. We refer to this
than 5 users. baseline as Rand.
As for the audio content of each track, we obtained • Latent Representation from Music Auto-Tagger
the mapping between the MusicBrainz Identifier (Choi). The work in [26] focused on a music auto-
(MBID) with the Spotify identifier (SpotifyID) using tagging task and can be considered as yielding a state-
the MusicBrainz API.3 After cross-matching, we of-the-art deep music representation for MIR. While the
collected 30 s previews of all track using the Spotify model’s focus on learning a representation for music
API.4 We found that there is a substantial amount of auto-tagging can be considered as our SS-R case, there
missing mapping information between the SpotifyID are a number of issues that complicate direct compar-
and MBID in the MusicBrainz database, where only isons between this work and ours. First, the network in
approximately 30% of mappings are available. Also, [26] is trained with about 4 times more data samples
because of the substantial amount of inactive users and than in our experiments. Second, it employed a much
unpopular tracks in the dataset, we ultimately acquired smaller network than our architecture. Further, inter-
a dataset of 985 unique users and 27, 093 unique tracks mediate representations were extracted, which is out of
with audio content. the scope of our work, as we only consider represen-
Similar to [28], we considered the outer matrix tations at the fc-feature layer. Nevertheless,
performance for un-introduced songs; in other words, despite these caveats, the work still is very much in
the model’s recommendation accuracy on the items line with ours, making it a clear candidate for compar-
newly introduced to the system [28]. This was done by ison. Throughout the evaluation, we could not fully
holding out certain tracks when learning user models reproduce the performance reported in the original
and then predicting user preference scores based on all paper [26]. When reporting our results, we, therefore,
tracks, including those that were held out, resulting in a will report the performance we obtained with the
ranked track list per user. As an evaluation metric, we published model, referring to this as Choi.
consider Normalized Discounted Cumulative Gain
(nDCG@500), only treating held-out tracks that were
4.3 Experimental design
indeed liked by a user as relevant items. Further details
on how hold-out tracks were chosen are given in
Sect. 4.4.
In order to investigate our research questions, we carried
A summary of all evaluation datasets, their origins, and out an experiment to study the effect of the number and
properties, can be found in Table 5. type of learning sources on the effectiveness of deep rep-
resentations, as well as the effect of the various architec-
4.2 Baselines tural learning strategies described in Sect. 3.2. For the
experimental design, we consider the following factors:
We examined three baselines to compare with our pro-
posed representations: • Representation strategy, with 6 levels: SS-R, MS-
SR@FC, MS-CR@6, MS-CR@4, MS-CR@2, and
• Mel-Frequency Cepstral Coefficient (MFCC). These MSS-CR).
are some of the most popular audio representations in • 8 2-level factors indicating the presence or not of each
of the 8 learning sources: self, year, bpm, taste, tag,
3
https://musicbrainz.org/. lyrics, cdr_tag, and artist.
4
https://developer.spotify.com/documentation/web-api/.
123
1080 Neural Computing and Applications (2020) 32:1067–1093
• Number of learning sources present in the learning However, rather than using a pre-specified optimal
process (1 to 8). Note that this is actually calculated as design with a fixed amount of runs [83], we decided to
the sum of the eight factors above. run sequentially for as long as time would permit us,
• Target dataset, with 7 levels: Ballroom, FMA, GTZAN, generating at each step a new experimental run on
IRMAS, Lastfm, Arousal, and Valence. demand in a way that would maximize desired
Given a learned representation, fitting dataset-specific properties of the design up to that point, such as
models is much more efficient than learning the represen- balance and orthogonality.6
tation, so we decided to evaluate each representation on all We did this with the greedy Algorithm 2. From the
7 target datasets. The experimental design is thus restricted set of still remaining runs A, a subset O is selected
to combinations of representation and learning sources, and such that the expected unbalance in the augmented
for each such combination we will produce 7 observations. design B [ fog is minimal. In this case, the unbalance
However, given the constraint of SS-R relying on a single of design is defined as the maximum unbalance found
learning source, that there is only one possible combination between the levels of any factor, except for those
for n = 8 sources, as well as the high unbalance in the already exhausted.7 From O, a second subset P is
number of sources,5 we proceeded in three phases: selected such that the expected aliasing in the aug-
mented design is minimal, here defined as the maxi-
1. We first trained the SS-R representations for each of the mum absolute aliasing between main effects.8 Finally,
8 sources and repeated 6 times each. This resulted in 48 a run p is selected at random from P, the corresponding
experimental runs. representation is learned, and the algorithm iterates
2. We then proceeded to train all five multi-source again after updating A and B.
strategies with all sources, that is, n ¼ 8. We repeated Following this on-demand methodology, we man-
this 5 times, leading to 25 additional experimental runs. aged to run another 352 experimental runs from all the
3. Finally, we ran all five multi-source strategies with 1230 possible.
n ¼ 2; . . .; 7. The full design matrix would contain 5
representations and 8 sources, for a total of 1230
possible runs. Such an experiment was unfortunately 6
An experimental design is orthogonal if the effects of any factor
infeasible to run exhaustively given available balance out across the effects of the other factors. In a non-orthogonal
resources, so we decided to follow a fractional design. design, effects may be aliased, meaning that the estimate of one effect
is partially biased with the effect of another, the extent of which
ranges from 0 (no aliasing) to 1 (full aliasing). Aliasing is sometimes
5 referred to as confounding. See sections 8.5 and 9.5 in [82] for details
For instance, from the 255 possible combinations of up to 8 sources,
on aliasing.
there are 70 combinations of n ¼ 4 sources, but 28 with n ¼ 2, or only 7
8 for n ¼ 7. Simple random sampling from the 255 possible For instance, let a design have 20 runs for SS-R, 16 for MS-SR@FC,
combinations would lead to a very unbalanced design, that is, a and 18 for all other representations. The unbalance in the represen-
highly non-uniform distribution of observation counts across the tation factor is thus 20 16 ¼ 4. The total unbalance of the design is
levels of the factor (n in this case). A balanced design is desired to defined as the maximum unbalance found across all factors.
8
prevent aliasing and maximize statistical power. See section 15.2 in See section 2.3.7 in [83] for details on how to compute an alias
[82] for details on unbalanced designs. matrix.
123
Neural Computing and Applications (2020) 32:1067–1093 1081
Algorithm 2: Sequential generation of experimental runs. learning will now be applied, to consider our representa-
1 Initialize A with all possible 1,230 runs to execute; tions in the context of these new target datasets.
2 Initialize B ← ∅ for the set of already executed runs;
3 while time allows do
As a consequence, new machine learning pipelines are
4 Select O ⊆ A s.t. ∀o ∈ O, the unbalance in B ∪ {o} is minimal; set up, focused on each of the target datasets. In all cases,
5 Select P ⊆ O s.t. ∀p ∈ P, the aliasing in B ∪ {p} is minimal;
6 Select p ∈ P at random; we applied the pre-defined split if it is feasible. Otherwise,
7 Update A ← A − {p}; we randomly split the dataset into an 80% training set and
8 Update B ← B ∪ {p};
9 Learn the representation coded by p; 20% test set. For every dataset, we repeated the training
and evaluation for 5 times, using different train/test splits.
After going through the three phases above, the final In most of our evaluation cases, validation will take place
experiment contained 48 þ 25 þ 352 ¼ 425 experimental on the test set; in case of the recommendation problem, the
runs, each producing a different deep music representation. test set represents a set of tracks to be held out from each
We further evaluated each representation on all 7 target user during model training, and re-inserted for validation.
datasets, leading to a grand total of 42 7 ¼ 2975 data In all cases, we will extract representations from evaluation
points. Figure 7 plots the alias matrix of the final experi- dataset audio as detailed in Sect. 4.4.1, and then learn
mental design, showing that the aliasing among main fac- relatively simple models based on them, as detailed in
tors is indeed minimal. The final experimental design Sect. 4.4.2. Employing the metrics as mentioned in the
matrix can be downloaded along with the rest of the sup- previous section, we will then take average performance
plemental material. scores over the 5 different train/test splits for final perfor-
Each considered representation network was trained mance reporting.
using the CNN representation network model from Sect. 3,
based on the specific combination of learning sources and 4.4.1 Feature extraction and preprocessing
deep architecture as indicated by the experimental run. In
order to reduce variance, we fixed the number of training Taking raw audio from the evaluation datasets as input, we
epochs to N ¼ 200 across all runs and applied the same take non-overlapping slices out of this audio with a fixed
base architecture, except for the branching point. This length of 2.5 s. Based on this, we apply the same prepro-
entire training procedure took approximately 5 weeks with cessing transformations as discussed in Sect. 3.1.1. Then,
given computational hardware resources introduced in we extract a deep representation from this preprocessed
Sect. 3.4. audio, employing the architecture as specified by the given
experimental run. As in the case of Sect. 3.2, representa-
4.4 Implementation details tions are extracted from the fc-feature layer of each
trained CNN model. Depending on the choice of archi-
In order to assess how our learned deep music represen- tecture, the final representation may consist of concatena-
tations perform on the various target datasets, transfer tions of representations obtained by separate representation
networks.
MS−SR@FC
IRMAS
Lastfm
lyrics
artist
taste
FMA
tag
self
1 deviation values. As a result, we get representation with
year
0.9
averages per learned feature dimension and another rep-
bpm
resentation with standard deviations per feature dimension.
taste 0.8
tag These will be concatenated, as illustrated in Fig. 6.
lyrics 0.7
cdr_tag
artist 0.6
4.4.2 Target dataset-specific models
MS−SR@FC
MS−CR@6 0.5
MS−CR@4
As our goal is not to over-optimize dataset-specific per-
MS−CR@2 0.4 formance, but rather perform a comparative analysis
MSS−CR between different representations (resulting from different
FMA 0.3
GTZAN learning strategies), we keep the model simple and use
0.2
IRMAS fixed hyper-parameter values for each model across the
Lastfm
Arousal 0.1 entire experiment.
Valence To evaluate the trained representations, we used differ-
0
ent models according to the target dataset. For classifica-
Fig. 7 Aliasing among main effects in the final experimental design tion and regression tasks, we used the multilayer
123
1082 Neural Computing and Applications (2020) 32:1067–1093
perceptron (MLP) model [84]. More specifically, the MLP representation learning, in terms of their general perfor-
model has two hidden layers, whose dimensionality is 256. mance, reliability, and model compactness. In Sect. 5.3, we
As for the nonlinearity, we choose ReLU [62] for all nodes, discuss the effectiveness of different representations in
and the model is trained with ADAM optimization tech- MIR. Finally, we present some initial evidence for multi-
nique [67] for 200 iterations. In the evaluation, we used the faceted semantic explainability of the proposed MTDTL in
Scikit-Learn’s implementation for ease of distributed Sect. 5.5.9
computing on multiple CPU computation nodes.
For the recommendation task, we choose a similar 5.1 Single-source and multi-source
model as suggested in [28, 85], in which the learning representation
objective function L is defined as
V
^ V;
U; ^ ¼ arg min jjP UV T jjC þ k jjV XWjj
^ W Figure 8 presents the performance of SS-R representa-
2 ð7Þ tions on each of the 7 target datasets. We can see that all
kU kW
þ jjUjj þ jjWjj sources tend to outperform the Rand baseline on all data-
2 2 sets, except for a handful cases involving sources self and
where P 2 Rui is a binary matrix indicating whether there bpm. Looking at the top performing sources, we find that
is interaction between users u and items i, U 2 Rur and tag, cdr_tag, and artist perform better or on-par with the
V 2 Rir are r dimensional user factors and item factors for most sophisticated baseline, Choi, except for the IRMAS
the low-rank approximation of P. P is derived from the dataset. The other sources are found somewhere between
original interaction matrix R 2 Rui , which contains the these two baselines, except for datasets Lastfm and Arou-
number of interaction from users u to items i, as follows: sal, where they perform better than Choi as well. Finally,
the MFCC is generally outperformed in all cases, with the
1; if Ru;i [ 0 notable exception of the IRMAS dataset, where only Choi
Pu;i ¼ ð8Þ
0 otherwise performs better.
Zooming in to dataset-specific observed trends, the bpm
W 2 Rdr is a free parameter for the projection from d-
learning source shows a highly skewed performance across
dimensional feature space to the factor space. X 2 Rid is
target datasets: it clearly outperforms all other learning
the feature matrix where each row corresponds to a track.
sources in the Ballroom dataset, but it achieves the worst or
Finally, jj jjC is the Frobenious norm weighted by the
second-worst performance in the other datasets. As shown
confidence matrix C 2 Rui , which controls the credibility in [80], this confirms that the Ballroom dataset is well-
of the model on the given interaction data, given as separable based on BPM information alone. Indeed, rep-
follows: resentations trained on the bpm learning source seem to
C ¼ 1 þ aR ð9Þ contain a latent representation close to the BPM of an input
music signal. In contrast, we can see that the bpm repre-
where a controls credibility. As for hyper-parameters, we set
sentation achieves the worst results in the Arousal dataset,
a ¼ 0:1, kV ¼ 0:00001, kU ¼ 0:00001, and kW ¼ 0:1, where both temporal dynamics and BPM are considered as
respectively. For the number of factors we choose r ¼ 50 to important factors determining the intensity of emotion.
focus only on the relative impact of the representation over On the IRMAS dataset, we see that all the SS-Rs per-
the different conditions. We implemented an update rule form worse than the MFCC and Choi baselines. Given that
with the alternating least squares (ALS) algorithm similar to they both take into account low-level features, either by
[28], and updated parameters during 15 iterations. design or by exploiting low-level layers of the neural net-
work, this suggests that predominant instrument sounds are
harder to distinguish based solely on semantic features,
5 Results and discussion which is the case of the representations studied here.
Also, we find that there is small variability for each SS-R
In this section, we present results and discussion related to run within the training setup we applied. Specifically, in
the proposed deep music representations. In Sect. 5.1, we 50% of cases, we have within-SS-R variability less than
will first compare the performance across the SS-Rs, to 15% of the within-dataset variability. 90% of the cases are
show how different individual learning sources work for within 30% of the within-dataset variability.
each target dataset. Then, we will present general experi-
mental results related to the performance of the multi-
9
source representations. In Sect. 5.2, we discuss the effect of For the reproducibility, we release all relevant materials including
code, models and extracted features at https://github.com/eldrin/
the number of learning sources exploited in the MTLMusicRepresentation-PyTorch.
123
Neural Computing and Applications (2020) 32:1067–1093 1083
0.62
0.80
0.60
0.9
0.55
0.60
0.75
● ●
0.50
0.8
● ●
●
0.58
0.70
Accuracy
Accuracy
Accuracy
● ●
● ●
Accuracy
0.45
0.56
0.65
●
0.7
0.40
● ●
● ● ●
●
●
0.54
0.60
● ● ●
0.35
● ●
0.6
● ● ●
0.52
0.55
0.30
●
●
●
●
●
0.50
0.50
0.25
0.5
self
year
bpm
taste
tag
lyrics
cdr_tag
artist
self
year
bpm
taste
tag
lyrics
cdr_tag
artist
self
year
bpm
taste
tag
lyrics
cdr_tag
artist
self
year
bpm
taste
tag
lyrics
cdr_tag
artist
Lastfm Arousal Valence
Choi
0.7
0.5
0.06
MFCC
● Rand
0.6
0.4
● ● ●
●
●
0.05
●
●
0.5
nDCG
0.3
● ●
2
R2
● ●
0.4
●
R
●
0.04
● ●
0.2
●
0.3
●
●
0.1
●
0.03
0.2
●
●
●
0.1
0.0
self
year
bpm
taste
tag
lyrics
cdr_tag
artist
self
year
bpm
taste
tag
lyrics
cdr_tag
artist
self
year
bpm
taste
tag
lyrics
cdr_tag
artist
Fig. 8 Performance of single-source representations. Each point indicates the performance of a representation learned from a single source. Solid
points indicate the average performance per source. The baselines are illustrated as horizontal lines
We now consider how the various representations based makes this statement still rather unclear. In order to gain
on multiple learning sources perform, in comparison to those a better insight of the effects of the dataset, architecture
based on single learning sources. The boxplots in Fig. 9 strategies and number and type of learning sources, we
show the distributions of performance scores for each further analyzed the results using a hierarchical or multi-
architectural strategy and per target dataset. For comparison, level linear model on all observed scores [86]. The
the gray boxes summarize the distributions depicted in advantage of such a model is essentially that it accounts for
Fig. 8, based on the SS-R strategy. In general, we can see that the structure in our experiment, where observations nested
these SS-R obtain the lowest scores, followed by MS- within datasets are not independent.
SR@FC, except for the IRMAS dataset. Given that these By Fig. 9, we can anticipate a very large dataset effect
representations have the same dimensionality, these results because of the inherently different levels of difficulty, as
suggest that adding a single-source-specific layer on top of a well as a high level of heteroskedasticity. We, therefore,
heavily shared model may help to improve the adaptability of analyzed standardized performance scores rather than raw
the neural network models, especially when there is no prior scores. In particular, the i-th performance score yi is stan-
knowledge regarding the well-matching learning sources for dardized with the within-dataset mean and standard devi-
the target datasets. The MS-CR and MSS-CR representations ation scores, that is, yi ¼ ðyi yd½i Þ=sd½i , where
obtain the best results in general, which is somewhat d[i] denotes the dataset of the i-th observation. This way,
expected because of their larger dimensionality. the dataset effect is effectively 0 and the variance is
homogeneous. In addition, this will allow us to compare
5.2 Effect of number of learning sources the relative differences across strategies and number of
and fusion strategy sources using the same scale in all datasets.
We also transformed the variable n that refers to the
number of sources to n , which is set to n ¼ 0 for SS-Rs
While the plots in Fig. 9 suggest that MSS-CR and MS- and to n ¼ n 2 for the other strategies. This way, the
CR are the best strategies, the high observed variability intercepts of the linear model will represent the average
123
1084 Neural Computing and Applications (2020) 32:1067–1093
0.80
0.62
0.9
0.55
0.60
0.75
0.8
Accuracy
0.58
0.70
Accuracy
Accuracy
Accuracy
0.45
0.56
0.65
0.7
0.54
0.60
0.35
0.6
0.52
0.55
0.50
0.50
0.25
0.5
SS−R
MS−SR@FC
MS−CR@6
MS−CR@4
MS−CR@2
MSS−CR
SS−R
MS−SR@FC
MS−CR@6
MS−CR@4
MS−CR@2
MSS−CR
SS−R
MS−SR@FC
MS−CR@6
MS−CR@4
MS−CR@2
MSS−CR
SS−R
MS−SR@FC
MS−CR@6
MS−CR@4
MS−CR@2
MSS−CR
Lastfm Arousal Valence
0.7
0.5
Choi
0.06
MFCC
0.6
Rand
0.4
0.05
0.5
nDCG
0.3
R2
R2
0.4
0.04
0.2
0.3
0.1
0.03
0.2
0.1
0.0
SS−R
MS−SR@FC
MS−CR@6
MS−CR@4
MS−CR@2
MSS−CR
SS−R
MS−SR@FC
MS−CR@6
MS−CR@4
MS−CR@2
MSS−CR
SS−R
MS−SR@FC
MS−CR@6
MS−CR@4
MS−CR@2
MSS−CR
Fig. 9 Performance by representation strategy. Solid points represent the mean per representation. The baselines are illustrated as horizontal lines
performance of each representation strategy in its simplest Figure 11 shows the estimated effects and bootstrap
case, that is, SS-R (n ¼ 1) or non-SS-R with n ¼ 2. We 95% confidence intervals. The left plot confirms the
fitted a first analysis model as follows: observations in Fig. 9. In particular, they confirm that SS-R
performs significantly worse than MS-SR@FC, which is
yi ¼ b0r½id½i þ b1r½id½i ni þ ei ei Nð0; r2e Þ ð10Þ
similarly statistically worse than the others. When carrying
b0rd ¼ b0r þ u0rd u0rd Nð0; r20r Þ ð11Þ out pairwise comparisons, MSS-CR outperforms all other
strategies except MS-CR@2 (p ¼ 0:32), which outperforms
b1rd ¼ b1r þ u1rd u1rd Nð0; r21r Þ; ð12Þ
all others except MS-CR@6 (p ¼ 0:09). The right plot
where b0r½id½i is the intercept of the corresponding repre- confirms the qualitative observation from Fig. 10 by
sentation strategy within the corresponding dataset. Each of showing a significantly positive effect of the number of
these coefficients is defined as the sum of a global fixed sources except for MS-SR@FC, where it is not statistically
effect b0r of the representation, and a random effect u0rd different from 0. The intervals suggest a very similar effect
which allows for random within-dataset variation.10 This in the best representations, with average increments of
way, we separate the effects of interest (i.e., each b0r ) from about 0.16 per additional source—recall that scores are
the dataset-specific variations (i.e., each u0rd ). The effect of standardized.
the number of sources is similarly defined as the sum of a To gain better insight into differences across represen-
fixed representation-specific coefficient b1r and a random tation strategies, we used a second hierarchical model
dataset-specific coefficient u1rd . Because the slope depends where the representation strategy was modeled as an
on the representation, we are thus implicitly modeling the ordinal variable r instead of the nominal variable r used in
interaction between strategy and number of sources, which the first model. In particular, r represents the size of the
can be appreciated in Fig. 10, especially with MS-SR@FC. network, so we coded SS-R as 0, MS-SR@FC as 0.2, MS-
CR@6 as 0.4, MS-CR@4 as 0.6, MS-CR@2 as 0.8, and
10
We note that hierarchical models do not fit each of the individual MSS-CR as 1 (see Fig. 5). In detail, this second model is as
u0rd coefficients (a total of 42 in this model), but the amount of follows:
variability they produce, that is, r20r (6 in total).
123
Neural Computing and Applications (2020) 32:1067–1093 1085
2
1
1
Accuracy
Accuracy
Accuracy
Accuracy
0
0
−1
−1
−1
−1
−2
−2
−2
−2
−3
−3
−3
−3
−4
−4
−4
−4
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of learning sources Number of learning sources Number of learning sources Number of learning sources
2
SS−R
MS−SR@FC
1
1
MS−CR@6
MS−CR@4
MS−CR@2
0
0
nDCG
MSS−CR
R2
R2
−1
−1
−1
−2
−2
−2
−3
−3
−3
−4
−4
−4
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of learning sources Number of learning sources Number of learning sources
Fig. 10 (Standardized) performance by the number of learning sources. Solid points represent the mean per architecture and number of sources.
The black horizontal line marks the mean performance of the SS-R representations. The colored lines show linear fits (color figure online)
Fig. 11 Fixed effects and Fixed intercepts: β0r Fixed slopes: β1r
bootstrap 95% confidence
MSS−CR MSS−CR
intervals estimated for the first
analysis model. The left plot MS−CR@2 MS−CR@2
depicts the effects of the
representation strategy (b0r MS−CR@4 MS−CR@4
intercepts), and the right plot
shows the effects of the number MS−CR@6 MS−CR@6
of sources (b1r slopes)
MS−SR@FC MS−SR@FC
SS−R ● SS−R
−1.0 −0.5 0.0 −0.05 0.00 0.05 0.10 0.15 0.20
Effect Effect
yi ¼ b0 þ b1d½i ri þ b2d½i ni þ b3d½i ri ni þ ei effect u1d . Likewise, this model includes the main effect of
ð13Þ the number of sources (fixed effect b20 ), as well as its
ei Nð0; r2e Þ
interaction with the network size (fixed effect b30 ). Fig-
b1d ¼ b10 þ u1d u1d Nð0; r21 Þ ð14Þ ure 12 shows the fitted coefficients, confirming the statis-
tically positive effect of the size of the networks and, to a
b2d ¼ b20 þ u2d u2d Nð0; r22 Þ ð15Þ
smaller degree but still significant, of the number of
b3d ¼ b30 þ u3d u3d Nð0; r23 Þ: ð16Þ sources. The interaction term is not statistically significant,
probably because of the unclear benefit of the number of
In contrast to the first model, there is no representation-
sources in MS-SR@FC.
specific fixed intercept but an overall intercept b0 . The
Overall, these analyses confirm that all multi-source
effect of the network size is similarly modeled as the sum
strategies outperform the single-source representations,
of an overall fixed slope b10 and a random dataset-specific
123
1086 Neural Computing and Applications (2020) 32:1067–1093
Fixed effects sources available, one can expect less variability across
β30 (r*n*) instantiations of the network. Most importantly, variability
obtained for a single learning source (n ¼ 1) is always
β20 (n*) larger than the variability with 2 or more sources. The
Ballroom dataset shows much smaller variability when
BPM is included in the combination. For this specific
β10 (r*)
dataset, this indicates that once bpm is used to learn the
representation, the expected performance is stable and does
β0 (intercept) ●
not vary much, even if we keep including more sources.
−1.0 −0.5 0.0 0.5 1.0 1.5
Section 5.3 provides more insight in this regard.
Effect
Fig. 12 Fixed effects and bootstrap 95% confidence intervals 5.3 Single source versus multi-source
estimated for the second analysis model, depicting the overall
intercept (b0 ), the slope of the network size (b10 ), the slope of the
number of sources (b20 ), and their interaction (b30 )
The evidence so far tells us that, on average, learning
from multiple sources leads to better performance than
with a direct relation to the number of parameters in the
learning from a single source. However, it could be pos-
network. In addition, there is a clearly positive effect of the
sible that the SS-R representation with the best learning
number of sources, with a minor interaction between both
source for the given target dataset still performs better than
factors.
a multi-source alternative. In fact, in Fig. 10 there are
Figure 10 also suggests that the variability of perfor-
many cases where the best SS-R representation (black cir-
mance scores decreases with the number of learning
cles at n ¼ 1) already perform quite well compared to the
sources used. This implies that if there are more learning
more sophisticated alternatives. Figure 13 presents similar
2
1
1
0
0
Accuracy
Accuracy
Accuracy
Accuracy
−1
−1
−1
−1
−2
−2
−2
−2
−3
−3
−3
−3
−4
−4
−4
−4
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of learning sources Number of learning sources Number of learning sources Number of learning sources
0
nDCG
R2
R2
−1
−1
−1
−2
−2
−2
−3
−3
−3
−4
−4
−4
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of learning sources Number of learning sources Number of learning sources
Fig. 13 (Standardized) performance by number of learning sources. without it. Solid and dashed lines represent linear fits, respectively;
Solid points mark representations including the source performing dashed areas represent 95% confidence intervals (color figure online)
best with SS-R in the dataset; empty points mark representations
123
Neural Computing and Applications (2020) 32:1067–1093 1087
100
Within−dataset variance component (%)
representations using the single best source (filled circles, ●
self
year
solid lines) and not using it (empty circles, dashed lines). bpm
50
taste
The results suggest that even if the strongest learning tag
lyrics
source for the specific dataset is not used, the others largely cdr_tag
20
artist
compensate for it in the multi-source representations,
catching up and even surpassing the best SS-R represen-
10
tations. The exception to this rule is again bpm in the
Ballroom dataset, where it definitely makes a difference. 5 Ballroom
●
FMA
As the plots shows, the variability for low numbers of ● GTZAN
2
IRMAS
Lastfm
learning sources is larger when not using the strongest
1
●
Arousal
●
Valence
source, but as more sources are added, this variability ●
●
●
●
0
reduces.
To further investigate this issue, for each target dataset, −4 −3 −2 −1 0 1
Standardized performance (y*)
we also computed the variance component due to each of
the learning sources, excluding SS-R representations [87]. Fig. 14 Correlation between (standardized) SS-R performance and
A large variance due to one of the sources means that, on variance component (color figure online)
average and for that specific dataset, there is a large dif-
ference in performance between having that source or not. necessarily required because the other sources make up for
Table 6 shows all variance components, highlighting the its absence. This is especially important in practical terms,
per-dataset largest. Apart from bpm in the Ballroom data- because different tasks generally have different best sour-
set, there is no clear evidence that one single source is ces, and practitioners rarely have sufficient domain
specially good in all datasets, which suggests that in gen- knowledge to select them up front. Also, and unlike the
eral there is not a single source that one would use by Ballroom dataset, many real-world problems are not easily
default. Notably though, sources artist, tag and self tend to solved with a single feature. Therefore, choosing a more
have large variance components. general representation based on multiple sources is a much
In addition, we observe that the sources with the largest simpler way to proceed, which still yields comparable or
variance are not necessarily the sources that obtain the best better results.
results by themselves in an SS-R representation (see Fig. 8). In other words, if ‘‘a single deep representation to rule
We examined this relationship further by calculating the them all’’ is pre-trained, it is advisable to base this repre-
correlation between variance components and (standard- sentation on multiple learning sources. At the same time,
ized) performance of the corresponding SS-Rs. The Pearson given that MSS-CR representations also generally show
correlation is 0.38, meaning that there is a mild association. strong performance (albeit that they will bring high
Figure 14 further shows this with a scatterplot, with a clear dimensionality), and that they will come ‘for free’ as soon
distinction between poorly-performing sources (year, taste as SS-R networks are trained, alternatively, we could
and lyrics at the bottom) and well-performing sources (tag, imagine an ecosystem in which the community could pre-
cdr_tag, and artist at the right). train and release many SS-R networks for different indi-
This result implies that even if some SS-R is particularly vidual sources in a distributed way, and practitioners can
strong for a given dataset, when considering more complex then collect these into MSS-CR representations, without the
fusion architectures, the presence of that one source is not need for retraining.
123
1088 Neural Computing and Applications (2020) 32:1067–1093
123
Neural Computing and Applications (2020) 32:1067–1093 1089
Fig. 16 Potential semantic explainability of DTMTL music repre- various types of learning sources. The specific model used in the
sentations. Here, we provide a visualization using t-SNE [88], plotting visualization is the 232th model from the experimental design we
2-dimensional coordinates of each sample from the GTZAN dataset, introduce in Sect. 4.3, which is performing better than 95% of other
as resulting from an MS-CR representation trained on 5 sources. In the models on GTZAN target dataset
zoomed-in panes, we overlay the strongest topic model terms in zt , for
representation, although representations based on a CR@2, MSS-CR) tend to outperform models where
single learning source will already be effective in sharing is higher (e.g., MS-CR@6, MS-SR@FC), all of
specialized cases (e.g., BPM and the Ballroom dataset). which outperform the base model (SS-R).
• RQ2 In terms of architecture, the amount of shared Our findings give various pointers to useful future work.
information has a negative effect on performance: First of all, ‘generality’ is difficult to define in the music
larger models with less shared information (e.g., MS- domain, maybe more so than in CV or NLP, in which
123
1090 Neural Computing and Applications (2020) 32:1067–1093
123
Neural Computing and Applications (2020) 32:1067–1093 1091
on acoustics, speech and signal processing, ICASSP. IEEE, 30. Bertin-Mahieux T, Ellis DPW, Whitman B, Lamere P (2011) The
Florence, Italy, pp 6979–6983. https://doi.org/10.1109/ICASSP. million song dataset. In: Proceedings of the 12th international
2014.6854953 society for music information retrieval conference, ISMIR.
16. Choi K, Fazekas G, Sandler MB (2016) Automatic tagging using University of Miami, Miami, FL, USA. pp 591–596
deep convolutional neural networks. In: Proceedings of the 17th 31. Bengio Y, Lamblin P, Popovici D, Larochelle H (2006) Greedy
international society for music information retrieval conference, layer-wise training of deep networks. In: Advances in neural
ISMIR. New York City, USA, pp 805–811 information processing systems 19. NIPS. MIT Press, Vancouver,
17. van den Oord A, Dieleman S, Schrauwen B (2013) Deep content- BC, Canada, pp 153–160
based music recommendation. In: Advances in neural informa- 32. Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008)
tion processing systems 26 NIPS. Lake Tahoe, NV, USA, Extracting and composing robust features with denoising
pp 2643–2651 autoencoders. In: Proceedings of the 25th international confer-
18. Chandna P, Miron M, Janer J, Gómez E (2017) Monoaural audio ence on machine learning ICML. ACM, Helsinki, Finland,
source separation using deep convolutional neural networks. In: pp 1096–1103. https://doi.org/10.1145/1390156.1390294
Latent variable analysis and signal separation—13th international 33. Smolensky P (1986) Information processing in dynamical sys-
conference, LVA/ICA, Proceedings. Grenoble, France, tems: Foundations of harmony theory. Technical report,
pp 258–266. https://doi.org/10.1007/978-3-319-53547-0_25. University of Colorado, Boulder, Department of Computer
ISBN: 978-3-319-53547-0 Science
19. Jeong I-Y, Lee K (2016) Learning temporal features using a deep 34. Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algo-
neural network and its application to music genre classification. rithm for deep belief nets. Neural Comput 18(7):1527–1554.
In: Proceedings of the 17th international society for music https://doi.org/10.1162/neco.2006.18.7.1527
information retrieval conference, ISMIR. New York City, USA, 35. Goodfellow I, Pouget-Abadie J, Mirza M, Bing X, Warde-Farley
pp 434–440 D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial
20. Han Y, Kim J-H, Lee K (2017) Deep convolutional neural net- nets. In: Advances in neural information processing systems 27.
works for predominant instrument recognition in polyphonic NIPS. Curran Associates Inc., Montreal, QC, Canada,
music. IEEE/ACM Trans Audio Speech Lang Process pp 2672–2680
25(1):208–221. https://doi.org/10.1109/TASLP.2016.2632307. 36. Han X, Leung T, Jia Y, Sukthankar R, Berg AC (2015) Matchnet:
ISSN: 2329-9290 unifying feature and metric learning for patch-based matching.
21. Simonyan K, Zisserman A (2015) Very deep convolutional net- In: IEEE conference on computer vision and pattern recognition,
works for large-scale image recognition. In: 3th international CVPR. IEEE Computer Society, Boston, MA, USA,
conference on learning representations, ICLR, San Diego, CA, pp 3279–3286. https://doi.org/10.1109/CVPR.2015.7298948
USA 37. Arandjelovic R, Zisserman A (2017) Look, listen and learn. In:
22. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for IEEE international conference on computer vision, ICCV. IEEE
image recognition. In: IEEE conference on computer vision and Computer Society, Venice, Italy, pp 609–617. https://doi.org/10.
pattern recognition, CVPR. IEEE Computer Society, Las Vegas, 1109/ICCV.2017.73
NV, USA, pp 770–778. https://doi.org/10.1109/CVPR.2016.90 38. Huang Y-S, Chou S-Y, Yang Y-H (2018) Generating music
23. Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, medleys via playing music puzzle games. In: Proceedings of the
Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with thirty-second conference on artificial intelligence, AAAI. AAAI
convolutions. In: IEEE conference on computer vision and pat- Press, New Orleans, LA, USA, pp 2281–2288
tern recognition, CVPR. IEEE Computer Society, Boston, MA, 39. Salton G, McGill M (1984) Introduction to modern information
USA, pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594 retrieval. McGraw-Hill Book Company, New York City. ISBN:
24. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) 0-07-054484-0
Distributed representations of words and phrases and their com- 40. Lamere P (2008) Social tagging and music information retrieval.
positionality. In: Advances in neural information processing J New Music Res 37(2):101–114. https://doi.org/10.1080/
systems 26 NIPS. Lake Tahoe, NV, USA, pp 3111–3119 09298210802479284. ISSN: 0929-8215
25. Dieleman S, Brakel P, Schrauwen B (2011) Audio-based music 41. Hamel P, Davies MEP, Yoshii K, Goto M (2013) Transfer
classification with a pretrained convolutional network. In: Pro- learning in MIR: sharing learned latent representations for music
ceedings of the 12th international society for music information audio classification and similarity. In: Proceedings of the 14th
retrieval conference, ISMIR. University of Miami, Miami, FL, international society for music information retrieval conference,
USA. pp 669–674. ISBN: 9780615548654 ISMIR. Curitiba, Brazil, pp 9–14
26. Choi K, Fazekas G, Sandler MB, Cho K (2017) Transfer learning 42. Law E, Settles B, Mitchell TM (2010) Learning to tag from open
for music classification and regression tasks. In: Proceedings of vocabulary labels. In: Machine learning and knowledge discovery
the 18th international society for music information retrieval in databases, European conference, ECML PKDD, Proceedings.
conference, ISMIR. Suzhou, China, pp 141–149 Part II. Springer, Barcelona, Spain, pp 211–226
27. van den Oord A, Dieleman S, Schrauwen B (2014) Transfer 43. Hofmann T (1999) Probabilistic latent semantic analysis. In:
learning by supervised pre-training for audio-based music clas- UAI: proceedings of the fifteenth conference on uncertainty in
sification. In: Proceedings of the 15th international society for artificial intelligence. Morgan Kaufmann, Stockholm, Sweden,
music information retrieval conference, ISMIR. Taipei, Taiwan, pp 289–296
pp 29–34 44. Schlüter J (2016) Learning to pinpoint singing voice from weakly
28. Liang D, Zhan M, Ellis DPW (2015) Content-aware collaborative labeled examples. In: Proceedings of the 17th international
music recommendation using pre-trained neural networks. In: society for music information retrieval conference, ISMIR. New
Proceedings of the 16th international society for music infor- York City, USA, pp 44–50
mation retrieval conference, ISMIR. Málaga, Spain, pp 295–301 45. Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A,
29. Misra I, Shrivastava A, Gupta A, Hebert M (2016) Cross-stitch Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, Slaney M,
networks for multi-task learning. In: IEEE conference on com- Weiss RJ, Wilson KW (2017) CNN architectures for large-scale
puter vision and pattern recognition. CVPR. IEEE Computer audio classification. In: IEEE international conference on
Society, Las Vegas, NV, USA, pp 3994–4003 acoustics, speech and signal processing, ICASSP. IEEE, New
123
1092 Neural Computing and Applications (2020) 32:1067–1093
Orleans, LA, USA, pp 131–135. https://doi.org/10.1109/ICASSP. IEEE international conference on acoustics, speech, and signal
2017.7952132 processing, ICASSP. IEEE, Prague, Czech Republic,
46. Lee H, Pham PT, Largman Y, Ng AY (2009) Unsupervised pp 5884–5887. https://doi.org/10.1109/ICASSP.2011.5947700
feature learning for audio classification using convolutional deep 60. Lee J, Park J, Kim KL, Nam J (2017) Sample-level deep con-
belief networks. In: Advances in neural information processing volutional neural networks for music auto-tagging using raw
systems 22. NIPS. Curran Associates Inc, Vancouver, BC, waveforms. In: 14th sound and music computing conference,
Canada, pp 1096–1104 SMC, Espoo, Finland
47. Humphrey EJ, Bello JP (2012) Rethinking automatic chord 61. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep
recognition with convolutional neural networks. In: 11th inter- network training by reducing internal covariate shift. In: Pro-
national conference on machine learning and applications, ceedings of the 32nd international conference on machine
ICMLA. IEEE, Boca Raton, FL, USA, pp 357–362. https://doi. learning, ICML. JMLR, Inc, Lille, France, pp 448–456
org/10.1109/ICMLA.2012.220 62. Nair V, Hinton GE (2010) Rectified linear units improve
48. Nakashika T, Garcia C, Takiguchi T (2012) Local-feature-map restricted boltzmann machines. In: Proceedings of the 27th
integration using convolutional neural networks for music genre international conference on machine learning ICML. Omnipress,
classification. In: INTERSPEECH, 13th annual conference of the Haifa, Israel, pp 807–814
international speech communication association. ISCA, Portland, 63. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhut-
OR, USA, pp 1752–1755 dinov R (2014) Dropout: a simple way to prevent neural networks
49. Ullrich K, Schlüter J, Grill T (2015) Boundary detection in music from overfitting. J Mach Learn Res 15(1):1929–1958
structure analysis using convolutional neural networks. In: Pro- 64. Nam J, Herrera J, Slaney M, Smith JO (2012) Learning sparse
ceedings of the 16th international society for music information feature representations for music annotation and retrieval. In:
retrieval conference, ISMIR. Málaga, Spain, pp 417–422 Proceedings of the 13th international society for music infor-
50. Piczak KJ (2015) Environmental sound classification with con- mation retrieval conference, ISMIR. FEUP Edições, Porto, Por-
volutional neural networks. In: 25th IEEE international workshop tugal, pp 565–570
on machine learning for signal processing, MLSP. IEEE, Boston, 65. Choi K, Fazekas G, Sandler MB, Cho K (2018) A comparison of
MA, USA, pp 1–6. https://doi.org/10.1109/MLSP.2015.7324337 audio signal preprocessing methods for deep neural networks on
51. Simpson AJR, Roma G, Plumbley MD (2015) Deep karaoke: music tagging. In: 26th European signal processing conference.
extracting vocals from musical mixtures using a convolutional EUSIPCO. IEEE, Roma, Italy, pp 1870–1874
deep neural network. In: Latent variable analysis and signal 66. Dörfler M, Grill T, Bammer R, Flexer A (2018) Basic filters for
separation—12th international conference, LVA/ICA, Proceed- convolutional neural networks applied to music: training or
ings. Springer, Liberec, Czech Republic, pp 429–436. https://doi. design? Neural Comput Appl https://doi.org/10.1007/s00521-
org/10.1007/978-3-319-22482-4_50. ISBN: 978-3-319-22482-4 018-3704-x. ISSN: 1433-3058
52. Phan H, Hertel L, Maaß M, Mertins A (2016) Robust audio event 67. Kingma DP, Ba J (2015) Adam: a method for stochastic opti-
recognition with 1-max pooling convolutional neural networks. mization. In: 3th International conference on learning represen-
In: INTERSPEECH 17th annual conference of the international tations, ICLR, San Diego, CA, USA
speech communication association. ISCA, San Francisco, CA, 68. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin
USA, pp 3653–3657. https://doi.org/10.21437/Interspeech.2016- Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differen-
123 tiation in PyTorch. In: NIPS-W
53. Pons J, Lidy T, Serra X (2016) Experimenting with musically 69. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B,
motivated convolutional neural networks. In: 14th international Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V,
workshop on content-based multimedia indexing, CBMI. IEEE, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M,
Bucharest, Romania, pp 1–6. https://doi.org/10.1109/CBMI.2016. Duchesnay D (2012) Scikit-learn: machine learning in python.
7500246 J Mach Learn Res 12:2825–2830. https://doi.org/10.1007/
54. Stasiak B, Monko J (2016) Analysis of time-frequency repre- s13398-014-0173-7.2. ISSN: 15324435
sentations for musical onset detection with convolutional neural 70. McFee B, Raffel C, Liang D, Ellis DPW, McVicar M, Battenberg
network. In: Proceedings of the federated conference on com- M, Nieto O (2015) librosa: audio and music signal analysis in
puter science and information systems, FedCSIS. Gdańsk, python. In: Kathryn H, James B (eds) Proceedings of the 14th
Poland, pp 147–152. https://doi.org/10.15439/2016F558 python in science conference SciPy. Austin, TX, USA, pp 18 –
55. Su H, Zhang H, Zhang X, Gao G (2016) Convolutional neural 24. https://doi.org/10.25080/Majora-7b98e3ed-003
network for robust pitch determination. In: IEEE international 71. Defferrard M, Benzi K, Vandergheynst P, Bresson X (2017)
conference on acoustics, speech and signal processing, ICASSP. FMA: a dataset for music analysis. In: Proceedings of the 18th
IEEE, Shanghai, China. pp 579–583. https://doi.org/10.1109/ international society for music information retrieval conference,
ICASSP.2016.7471741 ISMIR. Suzhou, China, pp 316–323
56. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classi- 72. Tzanetakis G, Cook PR (2002) Musical genre classification of
fication with deep convolutional neural networks. Commun ACM audio signals. IEEE Trans Speech Audio Process 10(5):293–302.
60(6):84–90. https://doi.org/10.1145/3065386 https://doi.org/10.1109/TSA.2002.800560. ISSN: 1063-6676
57. Dieleman S, Schrauwen B (2014) End-to-end learning for music 73. Kereliuk C, Sturm BL, Larsen J (2015) Deep learning and music
audio. In: IEEE international conference on acoustics, speech and adversaries. IEEE Trans Multimed 17(11):2059–2071. https://doi.
signal processing, ICASSP. IEEE, Florence, Italy, pp 6964–6968. org/10.1109/TMM.2015.2478068. ISSN: 1520-9210
https://doi.org/10.1109/ICASSP.2014.6854950 74. Fabien G, Anssi K, Simon D, Alonso M, George T, Uhle C, Pedro
58. van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, C (2006) An experimental comparison of audio tempo induction
Graves A, Kalchbrenner N, Senior AW, Kavukcuoglu K (2016) algorithms. IEEE Trans Audio Speech Lang Process
Wavenet: a generative model for raw audio. In: The 9th ISCA 14(5):1832–1844. https://doi.org/10.1109/TSA.2005.858509.
speech synthesis workshop, SSW. ISCA, Sunnyvale, CA, USA, ISSN: 1558-7916
p 125 75. Marchand U, Peeters G (2016) Scale and shift invariant time/
59. Jaitly N, Hinton GE (2011) Learning a better representation of frequency representation using auditory statistics: application to
speech soundwaves using restricted boltzmann machines. In: rhythm description. In: 26th IEEE international workshop on
123
Neural Computing and Applications (2020) 32:1067–1093 1093
machine learning for signal processing, MLSP. IEEE, Salerno, neuroscience, cognitive development, and psychopathology. Dev
Italy, pp 1–6. https://doi.org/10.1109/MLSP.2016.7738904 Psychopathol 17(3):715734. https://doi.org/10.1017/
76. Bosch JJ, Janer J, Fuhrmann F, Herrera P (2012) A comparison of S0954579405050340. ISSN: 1469-2198
sound segregation techniques for predominant instrument 82. Montgomery DC (2012) Design and analysis of experiments, 8th
recognition in musical audio signals. In: Proceedings of the 13th edn. Wiley, Hoboken
international society for music information retrieval conference, 83. Goos P, Jones B (2011) Optimal design of experiments: a case
ISMIR. FEUP Edições, Porto, Portugal, pp 559–564 study approach, 1st edn. Wiley, Hoboken
77. Soleymani M, Caro MN, Schmidt EM, Sha C-Y, Yang Y-H 84. Hinton GE (1989) Connectionist learning procedures. Artif Intell
(2013) 1000 songs for emotional analysis of music. In: Pro- 40(1):185–234. https://doi.org/10.1016/0004-3702(89)90049-0.
ceedings of the 2nd ACM international workshop on crowd- ISSN: 0004-3702
sourcing for multimedia CrowdMM@ACM multimedia. ACM, 85. Hu Y, Koren Y, Volinsky C (2008) Collaborative filtering for
Barcelona, Spain, pp 1–6. https://doi.org/10.1145/2506364. implicit feedback datasets. In: Proceedings of the 8th IEEE
2506365. ISBN: 978-1-4503-2396-3 international conference on data mining (ICDM). IEEE Computer
78. Òscar C (2010) Music recommendation and discovery–the long Society, Pisa, Italy, pp 263–272. https://doi.org/10.1109/ICDM.
tail, long fail, and long play in the digital music space. Springer, 2008.22
Berlin. https://doi.org/10.1007/978-3-642-13287-2. ISBN: 978-3- 86. Gelman A, Hill J (2006) Data analysis using regression and
642-13286-5 multilevel/hierarchical models. Cambridge University Press,
79. Sturm BL (2014) The state of the art ten years after a state of the Cambridge
art: future research in music information retrieval. J New Music 87. Searle SR, Casella G, McCulloch CE (2006) Variance compo-
Res 43(2):147–172. https://doi.org/10.1080/09298215.2014. nents. Wiley, Hoboken
894533 88. van der Maaten L, Hinton G (2008) Visualizing data using t-SNE.
80. Sturm BL (2016) The ‘‘Horse’’ inside: seeking causes behind the J Mach Learn Res 9(November):2579–2605
behaviors of music content analysis systems. Comput Entertain
14(2):3:1–3:32. https://doi.org/10.1145/2967507 Publisher’s Note Springer Nature remains neutral with regard to
81. Jonathan P, Russell James A, Peterson Bradley S (2005) The jurisdictional claims in published maps and institutional affiliations.
circumplex model of affect: an integrative approach to affective
123