0% found this document useful (0 votes)
20 views27 pages

One Deep Music Representation To Rule Them All? A Comparative Analysis of Different Representation Learning Strategies

This document discusses different strategies for learning deep representations of music data, including multitask learning (MTL) and deep transfer learning (DTL). It presents the results of a comparative analysis investigating which factors most influence the effectiveness of learned representations, including the number and type of learning sources, and the degree of information sharing in network architectures. The goal is to determine an optimal approach for learning representations that are widely deployable across music analysis, indexing, and classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views27 pages

One Deep Music Representation To Rule Them All? A Comparative Analysis of Different Representation Learning Strategies

This document discusses different strategies for learning deep representations of music data, including multitask learning (MTL) and deep transfer learning (DTL). It presents the results of a comparative analysis investigating which factors most influence the effectiveness of learned representations, including the number and type of learning sources, and the degree of information sharing in network architectures. The goal is to determine an optimal approach for learning representations that are widely deployable across music analysis, indexing, and classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Neural Computing and Applications (2020) 32:1067–1093

https://doi.org/10.1007/s00521-019-04076-1(0123456789().,-volV)(0123456789().
,- volV)

DEEP LEARNING FOR MUSIC AND AUDIO

One deep music representation to rule them all? A comparative


analysis of different representation learning strategies
Jaehun Kim1 • Julián Urbano1 • Cynthia C. S. Liem1 • Alan Hanjalic1

Received: 7 December 2017 / Accepted: 12 February 2019 / Published online: 4 March 2019
 The Author(s) 2019

Abstract
Inspired by the success of deploying deep learning in the fields of Computer Vision and Natural Language Processing, this
learning paradigm has also found its way into the field of Music Information Retrieval. In order to benefit from deep
learning in an effective, but also efficient manner, deep transfer learning has become a common approach. In this approach,
it is possible to reuse the output of a pre-trained neural network as the basis for a new learning task. The underlying
hypothesis is that if the initial and new learning tasks show commonalities and are applied to the same type of input data
(e.g., music audio), the generated deep representation of the data is also informative for the new task. Since, however, most
of the networks used to generate deep representations are trained using a single initial learning source, their representation
is unlikely to be informative for all possible future tasks. In this paper, we present the results of our investigation of what
are the most important factors to generate deep representations for the data and learning tasks in the music domain. We
conducted this investigation via an extensive empirical study that involves multiple learning sources, as well as multiple
deep learning architectures with varying levels of information sharing between sources, in order to learn music repre-
sentations. We then validate these representations considering multiple target datasets for evaluation. The results of our
experiments yield several insights into how to approach the design of methods for learning widely deployable deep data
representations in the music domain.

Keywords Representation learning  Music Information Retrieval  Multitask learning

1 Introduction engineered by humans to reflect dedicated semantic signal


properties. The feature representation would then serve as
In the Music Information Retrieval (MIR) field, many input to various statistical or machine learning (ML)
research problems of interest involve the automatic approaches [1].
description of properties of musical signals, employing The framing as described above can generally be applied
concepts that are understood by humans. For this, tasks are to many applied ML problems: complex real-world prob-
derived that can be solved by automated systems. In such lems are abstracted into a relatively simpler form, by
cases, algorithmic processes are employed to map raw establishing tasks that can be computationally addressed by
music audio information to humanly understood descrip- automatic systems. In many cases, the task involves mak-
tors (e.g., genre labels or descriptive tags). To achieve this, ing a prediction based on a certain observation. For this,
historically, the raw audio would first be transformed into a modern ML methodologies can be employed that auto-
representation based on hand-crafted features, which are matically can infer the logic for the prediction directly
from (a numeric representation of) the given data, by
optimizing an objective function defined for the given task.
& Jaehun Kim However, music is a multimodal phenomenon that can
[email protected] be described in many parallel ways, ranging from objective
1
Multimedia Computing Group, Department of Intelligent descriptors to subjective preference. As a consequence, in
Systems, Faculty of Electrical Engineering, Mathematics and many cases, while music-related tasks are well understood
Computer Science, Delft University of Technology, Delft, by humans, it often is hard to pinpoint and describe where
Netherlands

123
1068 Neural Computing and Applications (2020) 32:1067–1093

the truly ‘relevant’ information is in the music data used for also has been applied in training deep networks for repre-
the tasks, and how this properly can be translated into sentation learning, both in the music domain [11, 12] and in
numeric representations that should be used for prediction. general [3, p. 2]. As the model learns several tasks and
While research into such proper translations can be con- datasets in parallel, it may pick up commonalities among
ducted per individual task, it is likely that informative them. As a consequence, the expectation is that a network
factors in music data will be shared across tasks. As a learned with MTL will yield robust performance across
consequence, when seeking to identify informative factors different tasks, by transferring shared knowledge [2, 3]. A
that are not explicitly restricted to a single task, multitask simple illustration of the conceptual difference between
learning (MTL) is a promising strategy. In MTL, a single traditional DTL and deep transfer learning based on MTL
learning framework hosts multiple tasks at once, allowing (further referred to as multitask based deep transfer
for models to perform better by sharing commonalities learning (MTDTL)) is shown in Fig. 1.
between involved tasks [2]. MTL has been successfully The mission of this paper is to investigate the effect of
used in a range of applied ML works [3–10], also including conditions around the setup of MTDTL, which are
the music domain [11, 12]. important to yield effective deep music representations.
Following successes in the fields of Computer Vision Here, we understand an ‘effective’ representation to be a
(CV) and Natural Language Processing (NLP), deep representation that is suitable for a wide range of new tasks
learning approaches have recently also gained increasing and datasets. Ultimately, we aim for providing a method-
interest in the MIR field, in which case deep representa- ological framework to systematically obtain and evaluate
tions of music audio data are directly learned from the data, such transferable representations. We pursue this mission
rather than being hand-crafted. Many works employing by exploring the effectiveness of MTDTL and traditional
such approaches reported considerable performance DTL, as well as concatenations of multiple deep repre-
improvements in various music analysis, indexing and sentations, obtained by networks that were independently
classification tasks [13–20]. trained on separate single learning tasks. We consider these
In many deep learning applications, rather than training representations for multiple choices of learning tasks and
a complete network from scratch, pre-trained networks are considering multiple target datasets.
commonly used to generate deep representations, which Our work will address the following research questions:
can be either directly adopted or further adapted for the
• RQ1: Given a set of learning sources that can be used
current task at hand. In CV and NLP, (parts of) certain pre-
to train a network, what is the influence of the number
trained networks [21–24] have now been adopted and
and type of the sources on the effectiveness of the
adapted in a very large number of works. These ‘standard’
learned deep representation?
deep representations have typically been obtained by
• RQ2: How do various degrees of information sharing in
training a network for a single learning task, such as visual
the deep architecture affect the effectiveness of a
object recognition, employing large amounts of training
learned deep representation?
data. The hypothesis on why these representations are
effective in a broader spectrum of tasks than they originally By answering the RQ1, we arrive at an understanding of
were trained for, is that deep transfer learning (DTL) is important factors regarding the composition of a set of
happening: information initially picked up by the network learning tasks and datasets (which in the remainder of this
is beneficial also for new learning tasks performed on the work will be denoted as learning sources) to achieve an
same type of raw input data. Clearly, the validity of this effective deep music representation, specifically on the
hypothesis is linked to the extent to which the new task can number and nature of learning sources. The answer to RQ2
rely on similar data characteristics as the task on which the provides insight into how to choose the optimal multitask
pre-trained network was originally trained. network architecture under MTDTL context. For example,
Although a number of works deployed DTL for various in MTL, multiple sources are considered under a joint
learning tasks in the music domain [25–28], to our learning scheme that partially shares inferences obtained
knowledge, however, transfer learning and the employment from different learning sources in the learning pipeline. In
of pre-trained networks are not as standard in the MIR MTL applications using deep neural networks, this means
domain as in the CV domain. Again, this may be due to the that certain layers will be shared between all sources, while
broad and partially subjective range and nature of possible at other stages, the architecture will ‘branch’ out into
music descriptions. Following the considerations above, it source-specific layers [2, 5–8, 12, 29]. However, an in-
may then be useful to combine deep transfer learning with vestigation is still needed on where in the layered archi-
multitask learning. tecture branching should ideally happen—if a branching
Indeed, in order to increase robustness to a larger scope strategy would turn out beneficial in the first place.
of new learning tasks and datasets, the concept of MTL

123
Neural Computing and Applications (2020) 32:1067–1093 1069

Representation Learning Transfer Learning

Learning Task(s) Unseen Task(s) Performance

Task A
DTL Deep
Task 1 Network Representation
Task B

Task 1
Task A
Task 2 Deep
MTDTL Network Representation
...
Task B
Task M

Fig. 1 Simplified illustration of the conceptual difference between At the same time, this representation may not be that informative to
traditional deep transfer learning (DTL) based on a single learning another future task, leading to a low transfer learning performance.
task (above) and multitask based deep transfer learning (MTDTL) The hypothesis behind MTDTL is that relying on more learning tasks
(below). The same color used for a learning and an target task increases robustness of the learned representation and its usability for
indicates that the tasks have commonalities, which implies that the a broader set of target tasks (color figure online)
learned representation is likely to be informative for the target task.

To reach the aforementioned answers, it is necessary to Sect. 3. Our strategy to evaluate the effectiveness of dif-
conduct a systematic assessment to examine relevant fac- ferent representation network variants by employing vari-
tors. For RQ1, we investigate different numbers and ous target datasets will be the focus of Sect. 4.
combinations of learning sources. For RQ2, we study Experimental results will be discussed in Sect. 5, after
different architectural strategies. However, we wish to which general conclusions will be presented in Sect. 6.
ultimately investigate the effectiveness of the representa-
tion with respect to new, target learning tasks and datasets
(which in the remainder of this paper will be denoted by 2 Framework for deep representation
target datasets). While this may cause a combinatorial learning
explosion with respect to possible experimental configu-
rations, we will make strategic choices in the design and In this section, we formally define the deep representation
evaluation procedure of the various representation learning learning problem. As Fig. 2 illustrates, any domain-specific
strategies. MTDTL problem can be abstracted into a formal task,
The scientific contribution of this work can be summa- which is instantiated by a specific dataset with specific
rized as follows: observations and labels. Multiple tasks and datasets are
involved to emphasize different aspects of the input data,
• We provide insight into the effectiveness of various
such that the learned representation is more adaptable to
deep representation learning strategies under the mul-
different future tasks. The learning part of this scheme can
titask learning context.
be understood as the MTL phase, which is introduced in
• We offer in-depth insight into ways to evaluate desired
Sect. 2.1. Subsequently in Sect. 2.2, we discuss learning
properties of a deep representation learning procedure.
sources involved in this work, which consist of various
• We propose and release several pre-trained music
tasks and datasets to allow investigating their effects on the
representation networks, based on different learning
transfer learning. Further, we introduce the label prepro-
strategies for multiple semantic learning sources.
cessing procedure that is applied in this work in Sect. 2.3,
The rest of this work is presented as follows: a formal- ensuring that the learning sources are more regularized,
ization of this problem, as well as the global outline of how such that their comparative analysis is clearer.
learning will be performed based on different learning
tasks from different sources, will be presented in Sect. 2.
Detailed specifications of the deep architectures we con-
sidered for the learning procedure will be discussed in

123
1070 Neural Computing and Applications (2020) 32:1067–1093

Fig. 2 Schematic overview of


what this work investigates. The Problem
upper scheme illustrates a
general problem solving
framework in which multitask
transfer learning is employed.
The tasks t 2 ft0 ; t1 ; . . .; tM g are tm Task TA Task TB ... Task EA Task EB ...
derived from a certain problem
domain, which is instantiated by
datasets, that often are (Xt , yt ) Data TA Data TA2 Data TB1
...
Data EA1 Data EB1 Data EB2
...
represented as sample pairs of
observations and corresponding
labels ðXt ; yt Þ. Sometimes, the
original dataset is processed (Xt , zt ) ’
Data TA1 Data T’A2 Data T’B1
...
Data E’A1 Data E’B1 ’
Data EB2
...
further into simpler
representation forms ðXt ; zt Þ, to
filter out undesirable
ft (Xt ) Model Model
information and noise. Once a Transfer
model or system ft ðXt Þ has
Representation Learning Evaluation
learned the necessary mappings
within the learning sources, this
knowledge can be transferred to (a) Multi-Task Transfer Learning in General Problem Domain
another set of target datasets,
leveraging commonalities
already obtained by the pre-
training. Below the general Music Information Retrieval
framework, we show a concrete
example, in which the broad
MIR problem domain is
abstracted into various sub-
problems with corresponding tm Auto-Tagging Artist Class. ... Genre Class. Recommendation ...
tasks and datasets

(Xt , yt ) CDR MSD


...
GTZAN Ext. Ballroom Last.FM
...

... ...
(Xt , zt ) cdr_tag artist GTZAN Ext. Ballroom Last.FM

ft (Xt ) Model Model


Transfer

Representation Learning Evaluation

(b) Multi-Task Transfer Learning in Music Information Retrieval Domain

2.1 Problem definition When deep learning is employed, the model function f
denotes a learnable network. Typically, the network model
A machine learning problem, focused on solving a specific f is learned in an end-to-end fashion, from raw data at the
task t, can be formulated as a minimization problem, in input to the learning label. In the speech and music field,
which a model function ft must be learned that minimizes a however, using true end-to-end learning is still not a
ðiÞ ðiÞ common practice. Instead, raw data is typically trans-
loss function L for given dataset Dt ¼ f ðxt ; yt Þ j
i 2 f1; . . .; Ig g, comparing the model’s predictions given formed first, before serving as network input. More
by the input xt and actual task-specific learning labels yt . specifically, in the music domain, common input to func-
This can be formulated using the following expression: tion f would be X 2 Rcnb , replacing the originally hand-
crafted feature vector x 2 Rd from (1) by a time-frequency
h^ ¼ arg min EDt Lðyt ; ft ðxt ; hÞÞ ð1Þ
representation of the observed music data, usually obtained
where xt 2 Rd is, traditionally, a hand-crafted d-dimen- through the short-time Fourier transform (STFT), with
sional feature vector and h is a set of model parameters of f. potential additional filter bank applications (e.g., mel-filter
bank). The dimensions c, n, b indicate channels of the
audio signal, time steps, and frequency bins, respectively.

123
Neural Computing and Applications (2020) 32:1067–1093 1071

If such a network still is trained for a specific single globally categorized as Algorithm or Annotation. As for the
machine learning task t, we can now reformulate (1) as Algorithm category, by employing traditional feature
follows: extraction or representation transformation algorithms, we
will be able to automatically extract semantically inter-
h^ ¼ arg min EDt Lðyt ; ft ðXt ; hÞÞ: ð2Þ
esting aspects from input data. As for the Annotation cat-
In MTL, in the process of learning the network model f, egory, these include different types of label annotations of
different tasks will need to be solved in parallel. In the case the input data by humans.
of deep neural networks, this is usually realized by having The dataset used as a resource for our learning experi-
a network in which lower layers are shared for all tasks, but ments is the Million Song Dataset (MSD) [30]. In its
upper layers are task-specific. Given m different tasks t, original form, it contains metadata and precomputed fea-
each having the learning label yt , we can formulate the tures for a million songs, with several associated data
learning objective of the neural network in MTL scenario resources, e.g., considering Last.fm social tags and lis-
as follows: tening profiles from the Echo Nest. While the MSD
s  does not distribute audio due to copyright reasons, through
h^ ; h^ ¼ arg min Et2T EDt Lðyt ; ft ðXt ; hs ; ht ÞÞ ð3Þ the API of the 7digital service, 30-s audio previews
Here, T ¼ ft1 ; t2 ; . . .; tm g is a given set of tasks to be can be obtained for the songs in the dataset. These 30-s
learned and h ¼ fh1 ; h2 ; . . .; hm g indicates a set of model previews will form the source for our raw audio input.
parameters ht with respect to each task. Since the deep Using the MSD data, we consider several subcategories
architecture initially shares lower layers and branches out of learning sources within the Algorithm and Annotation
to task-specific upper layers, the parameters of shared categories; below, we give an overview of these, and
layers and task-specific layers are referred to separately as specify what information we considered exactly for the
learning labels in our work.
hs and ht , respectively. Updates for all parameters can be
achieved through standard back-propagation. Further spe-
2.2.1 Algorithm
cifics on network architectures and training configurations
will be given in Sect. 3.
• Self. The music track is the learning source itself; in
Given the formalizations above, the first step in our
other words, intrinsic information in the input music
framework is to select a suitable set T of learning tasks.
track should be captured through a learning procedure,
These tasks can be seen as multiple concurrent descriptions
without employing further data. Various unsupervised
or transformations of the same input fragment of musical
or auto-regressive learning strategies can be employed
audio: each will reflect certain semantic aspects of the
under this category, with variants of autoencoders,
music. However, unlike the approach in a typical MTL
including the Stacked Autoencoder [31, 32], Restricted
scheme, solving multiple specific learning tasks is actually
Boltzmann Machines (RBM) [33], Deep Belief Net-
not our main goal; instead, we wish to learn an effective
works (DBN) [34] and Generative Adversarial Net-
representation that captures as many semantically impor-
works (GAN) [35]. As another example within this
tant factors in the low-level music representation as pos-
category, variants of the Siamese networks for simi-
sible. Thus, rather than using learning labels yt , our
larity learning can be considered [36–38].
representation learning process will employ reduced
In our case, we will employ the Siamese architecture
learning labels zt , which capture a reduced set of semantic
to learn a metric that measures whether two input music
factors from yt . We then can reformulate (3) as follows:
s 
clips belong to the same track or two different tracks.
h^ ; h^ ¼ arg min Et2T EDt Lðzt ; ft ðXt ; hs ; ht ÞÞ ð4Þ This can be formulated as follows:
self s
where zt 2 Rk is a k-dimensional vector that represents h^ ; h^ ¼ arg min EX ;X  D Lðyself ; fself ðXl ; Xr ; hself ; hs ÞÞ
l r self

a reduced learning label for a specific task t. Each zt will be ð5Þ


obtained through task-specific factor extraction methods, as 
described in Sect. 2.3. 1; if Xl and Xr sampled from same track
yself ¼
0 otherwise
2.2 Learning sources ð6Þ
where Xl and Xr are a pair of randomly sampled short
In MTDTL context, a training dataset can be seen as the
music snippets (taken from the 30-s MSD audio pre-
‘source’ to learn the representation, which will be further
views) and fself is a network for learning a metric
transferred to the future ‘target’ dataset. Different learning
between given input representations in terms of the
sources of different nature can be imagined that can be
criteria imposed by yself . It is composed of one or more

123
1072 Neural Computing and Applications (2020) 32:1067–1093

• Crowd. Through interaction with music streaming or


FCSoftmax (2)
scrobbling services, large numbers of users, also
designated as the crowd, left explicit or implicit
FC (128)
information regarding their perspectives on musical
content. For example, they may have created social
tags, ratings, or social media mentionings of songs.
With many services offering API access to these types
Representation Representation of descriptors, crowd data, therefore, offers scalable,
Network Network spontaneous and diverse (albeit noisy) human perspec-
tives on music signals.
In our experiments, we use social tags from
Last.fm1 and user listening profiles from the Echo
Nest.
Preprocessing Preprocessing
• Professional. As mentioned in [1], annotation of music
tracks is a complicated and time-consuming process:
Sampling Sampling
annotation criteria frequently are subjective, and con-
siderable domain knowledge and annotation experience
Fig. 3 Siamese architecture adopted for the self learning task. For
further details of the representation network, see Sect. 3.1 and Fig. 4 may be required before accurate and consistent anno-
tations can be made. Professional experts in catego-
fully connected layers and one output layer with soft- rization have this experience, and thus are capable of
max activation. A global outline illustration of our indicating clean and systematic information about
chosen architecture is given in Fig. 3. Further specifi- musical content. It is not trivial to get such professional
cations of the representation network and sampling annotations at scale; however, these types of annota-
strategies will be given in Sect. 3. tions may be available in existing professional libraries.
• Feature. Many algorithms exist already for extracting In our case, we use professional annotations from the
features out of musical audio, or for transforming Centrale Discotheek Rotterdam (CDR), the largest
musical audio representations. By running such algo- music library in The Netherlands, holding all music
rithms on musical audio, learning labels are automat- ever released in the country in physical and digital form
ically computed, without the need for soliciting human in its collection. The CDR collection can be digitally
annotations. Algorithmically computed outcomes will accessed through the online Muziekweb2 platform. For
likely not be perfect and include noise or errors. At the each musical album in the CDR collection, genre
same time, we consider them as a relatively efficient annotations were made by a professional annotator,
way to extract semantically relevant and more struc- according to a fixed vocabulary of 367 hierarchical
tured information out of a raw input signal. music genres.
In our case, under this category, we use beat per As another professional-level ‘description,’ we
minute (BPM) information, released as part of the adopted lyrics information per each track, which is
MSD’s precomputed features. The BPM values were provided in Bag-of-Words format with the MSD. To
computed by an estimation algorithm, as part of the filter out trivial terms such as stop-words, we applied
Echo Nest API. TF-IDF [39].
• Combination. Finally, learning labels can be derived
from combinations of the above categories. In our
2.2.2 Annotation
experiment, we used a combination of artist informa-
tion and social tags, by making a bag of tags at the artist
• Metadata. Typically, metadata will come ‘for free’ with
level as a learning label.
music audio, specifying side information, such as a
release year, the song title, the name of the artist, the Not all songs in the MSD actually include learning labels
corresponding album name, and the corresponding from all the sources mentioned above. Clearly, it is another
album cover image. Considering that this information advantage of using MTL that one can use such unbalanced
describes categorization facets of the musical audio, datasets in a single learning procedure, to maximize the
metadata can be a useful information source to learn a coverage of the dataset. However, on the other hand, if one
music representation. In our experiments, we use uses an unbalanced number of samples across different
release year information, which is readily provided as 1
https://labrosa.ee.columbia.edu/millionsong/lastfm.
metadata with each song in the MSD. 2
https://www.muziekweb.nl/.

123
Neural Computing and Applications (2020) 32:1067–1093 1073

learning sources, it is not trivial to compare the effect of this purpose, we used a fixed single value k ¼ 50 for the
individual learning sources. We, therefore, choose to work number of factors (pLSA) and the number of Gaussians
with a subset of the dataset, in which equal numbers of (GMM). In the remainder of this paper, the datasets and
samples across learning sources can be used. As a conse- tasks processed in the above manner will be denoted by
quence, we managed to collect 46,490 clips of tracks with learning sources for coherent presentation and usage of the
corresponding learning source labels. A 41,841/4,649 split terminology.
was made for training and validation for all sources from
both MSD and CDR. Since we mainly focus on transfer
learning, we used the validation set mostly for monitoring 3 Representation network architectures
the training, to keep the network from overfitting.
In this section, we present the detailed specification of the
2.3 Latent factor preprocessing deep representation neural network architecture we
exploited in this work. We will discuss the base architec-
Most learning sources are noisy. For instance, social tags ture of the network and further discuss the shared archi-
include tags for personal playlist management, long sen- tecture with respect to different fusion strategies that one
tences, or simply typos, which do not actually show rele- can take in the MTDTL context. Also, we introduce details
vant nuances in describing the music signal. The on the preprocessing related to the input data served into
algorithmically extracted BPM information also is imper- networks.
fect, and likely contains octave errors, in which BPM is
under- or overestimated by a factor of 2. To deal with this 3.1 Base architecture
noise, several previous works using the MSD [16, 26]
applied a frequency-based filtering strategy along with top-
down domain knowledge. However, this shrinks the As the deep base architecture for feature representation
available sample size. As an alternative way to handle learning, we choose a convolutional neural network (CNN)
noisiness, several other previous works [11, 17, 27, 40–42] architecture inspired by [21], as described in Fig. 4 and
apply latent factor extraction using various low-rank Table 3.
approximation models to preprocess the label information. CNN is one of the most popular architectures in many
We also choose to do this in our experiments. music-related machine learning tasks [16, 17, 20, 25,
A full overview of chosen learning sources, their cate- 44–55]. Many of these works adopt an architecture having
gory, origin dataset, dimensionality, and preprocessing cascading blocks of 2-dimensional filters and max-pooling,
strategies is shown in Table 1. In most cases, we apply derived from well-known works in image recognition
probabilistic latent semantic analysis (pLSA), which [21, 56]. Although variants of CNN using 1-dimensional
extracts latent factors as a multinomial distribution of latent filters also were suggested by [12, 57–59] to learn features
topics [43]. Table 2 illustrates several examples of strong directly from a raw audio signal in an end-to-end manner,
social tags within extracted latent topics. not many works managed to use them on music classifi-
For situations in which learning labels are a scalar, non- cation tasks successfully [60].
binary value (BPM and release year), we applied a Gaus- The main difference between the base architecture and
sian mixture model (GMM) to transform each value into a [21] is the use of global average pooling (GAP) and the
categorical distribution of Gaussian components. In the Batch Normalization (BN) layers. BN is applied to accel-
case of the Self category, as it basically is a binary mem- erate the training and stabilize the internal covariate shift
bership test, no factor extraction was needed in this case. for every convolution layer and the fc-feature layer
After preprocessing, learning source labels yt are now [61]. Also, global spatial pooling is adopted as the last
expressed in the form of probabilistic distributions zt . Then, pooling layer of the cascading convolution blocks, which is
the learning of a deep representation can take place by known to effectively summarize the spatial dimensions
minimizing the Kullback–Leibler (KL) divergence both in the image [22] and music domain [20]. We also
between model inferences ft ðXÞ and label factor distribu- applied the approach to ensure the fc-feature layer not
tions zt . to have a huge number of parameters.
Along with the noise reduction, another benefit from We applied the rectified linear unit (ReLU) [62] to all
such preprocessing is the regularization of the scale of the convolution layers and the fc-feature layer. For the
objective function between different tasks involved in the fc-output layer, softmax activation is used. For each
learning, when the resulting factors have the same size. convolution layer, we applied zero-padding such that the
This regularity between the objective functions is particu- input and the output have the same spatial shape. As for the
larly helpful for comparing different tasks and datasets. For regularization, we choose to apply dropout [63] on the fc-

123
1074 Neural Computing and Applications (2020) 32:1067–1093

Table 1 Properties of learning


Identifier Category Data Dimensionality Preprocessing
sources
self Algorithm Self MSD—Track 1
bpm Feature MSD—BPM 1 GMM
year Annotation Metadata MSD—Year 1 GMM
tag Crowd MSD—Tag 174,156 pLSA
taste Crowd MSD—Taste 949,813 pLSA
cdr_tag Professional CDR—Tag 367 pLSA
lyrics Professional MSD—Lyrics 5000 pLSA, TF-IDF
artist Combination MSD—Artist and Tag 522,366 pLSA

Table 2 Examples of latent


Topic Strongest social tags
topics extracted with pLSA
from MSD social tags tag1 indie rock, indie, british, Scottish
tag2 pop, pop rock, dance, male vocalists
tag3 soul, rnb, funk, Neo-Soul
tag4 Melodic Death Metal, black metal, doom metal, Gothic Metal
tag5 fun, catchy, happy, Favorite

FC (256)
have generally been a popular input representation choice
FCSoftmax (50)
for CNN applied in music-related tasks
GAP
[16, 17, 20, 26, 41, 64]; besides, it also was reported
Conv62 (256)
recently that their frequency-domain summarization, based
Conv61 (256)
on psycho-acoustics, is efficient and not easily learnable
MaxPool5
through data-driven approaches [65, 66]. We choose a
Conv5 (128)
1024-sample window size and 256-sample hop size,
MaxPool4 Representation
Network translating to about 46 ms and 11.6 ms, respectively, for a
Conv4 (64)
sampling rate of 22 kHz. We also applied standardization
MaxPool3
to each frequency band of the mel spectrum, making use of
Conv3 (64)
the mean and variance of all individual mel spectra in the
MaxPool2 training set.
Conv2 (32) Preprocessing
MaxPool1 3.1.2 Sampling
Conv1 (16) Sampling

During the learning process, in each iteration, a random


Fig. 4 Default CNN architecture for supervised single-source repre-
batch of songs is selected. Audio corresponding to these
sentation learning. Details of the representation network are presented
at the left of the global architecture diagram. The numbers inside the songs originally is 30 s in length; for computational effi-
parentheses indicate either the number of filters or the number of units ciency, we randomly crop 2.5 s out of each song each time.
with respect to the type of layer Keeping stereo channels of the audio, the size of a single
input tensor X  we used for the experiment ended up with
feature layer. We added L2 regularization across all the 2  216  128, where the first dimension indicates
parameters with the same weight k ¼ 106 . the number of channels, and following dimensions mean
time steps and mel-bins, respectively. Along with the
3.1.1 Audio preprocessing computational efficiency, a number of previous works in
MIR field reported that using a small chunk of the input not
We aim to learn a music representation from as-raw-as- only inflates the dataset but also shows good performance
possible input data to fully leverage the capability of the on the high-level tasks such as music auto-tagging
neural network. For this purpose, we use the dB-scale mel- [20, 57, 60]. For the self case, we generate batches with
scale magnitude spectrum of an input audio fragment, equal numbers of songs for both membership categories in
extracted by applying 128-band mel-filter banks on the yself .
short-time Fourier transform (STFT). mel-spectrograms

123
Neural Computing and Applications (2020) 32:1067–1093 1075

Table 3 Configuration of the


Layer Input shape Weight shape Sub-sampling Activation
base CNN
conv1 2 9 216 9 128 2 9 16 9 5 9 5 291 ReLU
max-pool1 16 9 108 9 128 292
conv2 16 9 54 9 64 16 9 32 9 3 9 3 ReLU
max-pool2 32 9 54 9 64 292
conv3 32 9 27 9 32 32 9 64 9 3 9 3 ReLU
max-pool3 64 9 27 9 32 292
conv4 64 9 13 9 16 64 9 64 9 3 9 3 ReLU
max-pool4 64 9 13 9 16 292
conv5 64 9 6 9 8 64 9 128 9 3 9 3 ReLU
max-pool5 128 9 6 9 8 292
conv61 128 9 3 9 4 128 9 256 9 3 9 3 ReLU
conv62 256 9 3 9 4 256 9 256 9 1 9 1 ReLU
gap 256
fc-feature 256 256 9 256 ReLU
dropout 256
fc-output 256 Learning source specific Softmax
conv and max-pool indicate a 2-dimensional convolution and max-pooling layer, respectively. We set
the stride size with 2 on the time dimension of conv1, to compress dimensionality at the early stage.
Otherwise, all strides are set as 1 across all the convolution layers. gap corresponds to the global average
pooling used in [22], which averages out all the spatial dimensions of the filter responses. fc is an
abbreviation of a fully connected layer. We use dropout with p ¼ 0:5 only for the fc-feature layer,
where the intermediate latent representation is extracted and evaluated. For simplicity, we omit the batch-
size dimension of the input shape

3.2 Multi-source architectures with various corresponding representation network is illustrated in


degrees of shared information Fig. 5b.
• When applying MTL learning strategies, the deep
When learning a music representation based on various architecture should involve shared knowledge layers,
available learning sources, different strategies can be taken before branching out to various individual learning
regarding the choice of architecture. We will investigate sources, whose learned representations will be concate-
the following setups: nated in the final d  m-dimensional representation. We
call these Multi-Source Concatenated Representations
• As a base case, a Single-Source Representation (SS-R)
(MS-CR). As the branching point can be chosen at
can be learned for a single source only. As mentioned
different stages, we will investigate the effect of various
earlier, this would be the typical strategy leading to pre-
prototypical branching point choices: at the second
trained networks, that later would be used in transfer
convolution layer (MS-CR@2, Fig. 5c), the fourth
learning. In our case, our base architecture from
convolution layer (MS-CR@4, Fig. 5d), and the sixth
Sect. 3.1 and Fig. 4 will be used, for which the layers
convolution layer (MS-CR@6, Fig. 5e). The later the
in the representation network also are illustrated in
branching point occurs, the more shared knowledge the
Fig. 5a. Out of the fc-feature layer, a d-dimen-
network will employ.
sional representation is obtained.
• In the most extreme case, branching would only occur
• If multiple perspectives on the same content, as
at the very last fully connected layer, and a Multi-
reflected by the multiple learning labels, should also
Source Shared Representation (MS-SR) (or, more
be reflected in the learned representation, one can learn
specifically, MS-SR@FC) is learned, as illustrated in
SS-R representations for each learning source and
Fig. 5f. As the representation is obtained from the fc-
simply concatenate them afterward. With d dimensions
feature layer, no concatenation takes place here, and
per source and m sources, this leads to a d  m Multiple
a d-dimensional representation is obtained.
Single-Source Concatenated Representation (MSS-
CR). In this case, independent networks are trained A summary of these different representation learning
for each of the sources, and no shared knowledge will architectures is given in Table 4. Beyond the strategies we
be transferred between sources. A layer setup of the choose, further approaches can be thought of to connect

123
1076 Neural Computing and Applications (2020) 32:1067–1093

FCSoftmax (50) FCSoftmax (50) FCSoftmax (50) FCSoftmax (50) FCSoftmax (50)
FC (256) FC (256) FC (256) FC (256) FC (256)
GAP GAP GAP GAP GAP
Conv62 (256) Conv62 (256) Conv62 (256) Conv62 (256) Conv62 (256)
Conv61 (256) Conv61 (256) Conv61 (256) Conv61 (256) Conv61 (256)
MaxPool5 MaxPool5 MaxPool5 MaxPool5 MaxPool5
Conv5 (128) Conv5 (128) Conv5 (128) Conv5 (128) Conv5 (128)
MaxPool4 MaxPool4 MaxPool4 MaxPool4 MaxPool4
Conv4 (64) Conv4 (64) Conv4 (64) Conv4 (64) Conv4 (64)
MaxPool3 MaxPool3 MaxPool3 MaxPool3 MaxPool3
Conv3 (64) Conv3 (64) Conv3 (64) Conv3 (64) Conv3 (64)
MaxPool2 MaxPool2 MaxPool2 MaxPool2 MaxPool2
Conv2 (32) Conv2 (32) Conv2 (32) Conv2 (32) Conv2 (32)
MaxPool1 MaxPool1 MaxPool1 MaxPool1
Conv1 (16) Conv1 (16) Conv1 (16) Conv1 (16)
(a) SS-R: Base (b) MSS-CR: Concatenation of (c) MS-CR@2: network branches
setup. multiple independent SS-R net- to source-specific layers from 2nd
works. convolution layer.

FCSoftmax (50) FCSoftmax (50) FCSoftmax (50) FCSoftmax (50) FCSoftmax (50) FCSoftmax (50)
FC (256) FC (256) FC (256) FC (256) FC (256)
GAP GAP GAP GAP GAP
Conv62 (256) Conv62 (256) Conv62 (256) Conv62 (256) Conv62 (256)
Conv61 (256) Conv61 (256) Conv61 (256) Conv61 (256) Conv61 (256)
MaxPool5 MaxPool5 MaxPool5 MaxPool5
Conv5 (128) Conv5 (128) Conv5 (128) Conv5 (128)
MaxPool4 MaxPool4 MaxPool4 MaxPool4
Conv4 (64) Conv4 (64) Conv4 (64) Conv4 (64)
MaxPool3 MaxPool3 MaxPool3
Conv3 (64) Conv3 (64) Conv3 (64)
MaxPool2 MaxPool2 MaxPool2
Conv2 (32) Conv2 (32) Conv2 (32)
MaxPool1 MaxPool1 MaxPool1
Conv1 (16) Conv1 (16) Conv1 (16)
(d) MS-CR@4: network branches (e) MS-CR@6: network branches (f) MS-SR@FC: heavily shared
to source-specific layers from 4th to source-specific layers from 6th network, source-specific branch-
convolution layer. convolution layer. ing only at final FC layer.

Fig. 5 The various model architectures considered in the current simplification, multi-source cases are illustrated here for two sources.
work. Beyond single-source architectures, multi-source architectures The fc-feature layer from which representations will be
with various degrees of shared information are studied. For extracted is the FC(256) layer in the illustrations (see Table 3)

representations learned for different learning sources in However, considering that learned representations are
neural network architectures. For example, for different usually taken from a specific fixed layer of the shared
tasks, representations can be extracted from different architecture, we focus on the strategies as we outlined
intermediate hidden layers, benefiting from the hierarchical above.
feature encoding capability of the deep network [26].

123
Neural Computing and Applications (2020) 32:1067–1093 1077

Table 4 Properties of the


Multi-source Shared network Concatenation Dimensionality
various categories of
representation learning SS-R No No No d
architectures
MSS-CR Yes No Yes dm
MS-CR Yes Partial Yes dm
MS-SR Yes Yes No d

3.3 MTL training procedure 4 Evaluation

So far, we discussed the details regarding the learning


Algorithm 1: Training a Multi-Source CNN phase of this work, which corresponds to the upper row of
1 Initialize Θ: {θ t , θ s } randomly; Fig. 6. This included various choices of sources for the
2 for epoch in 1...N do
3 for iteration in 1...L do representation learning, and various choices of architecture
4 Pick a learning source t randomly; and fusion strategies. In this section, we present the eval-
5 Pick batch of samples from learning source t;
(Xl , Xr ) for self ;
uation methodology we followed, as illustrated in the
X otherwise; second row of Fig. 6. First, we will discuss the chosen
6 Derive learning label zt ; target tasks and datasets in Sect. 4.1, followed in Sect. 4.2
7 Sub-sample chunk X ∗ from track X;
8 Forward-pass:;
by the baselines against which our representations will be
L(yself , Θ, Xl∗ , Xr∗ ) =Eq. 5 for self ; compared. Section 4.3 explains our experimental design,
L(zt , Θ, X ∗ ) =Eq. 2 otherwise; and finally, we discuss the implementation of our evalua-
9 Backward-pass: ∇(Θ);
10 Update model: Θ ← Θ − ∇(Θ); tion experiments in Sect. 4.4.

4.1 Target datasets


Similar to [4, 11], we choose to train the MTL models
with a stochastic update scheme as described in Algo- In order to gain insight into the effectiveness of learned
rithm 1. At every iteration, a learning source is selected representations with respect to multiple potential future
randomly. After the learning source is chosen, a batch of tasks, we consider a range of target datasets. In this work,
observation-label pairs ðX; zt Þ is drawn. For the audio our target datasets are chosen to reflect various semantic
previews belonging to the songs within this batch, an input properties of music, purposefully chosen semantic biases,
representation X  is cropped randomly from its super- or popularity in the MIR literature. Furthermore, the rep-
sample X. The updates of the parameters H are conducted resentation network should not be configured or learned to
through back-propagation using the Adam algorithm [67]. explicitly solve the chosen target datasets.
For each neural network we train, we set L ¼ lm, where l is While for the learning sources, we could provide cate-
the number of iterations needed to visit all the training gorizations on where and how the learning labels were
samples with fixed batch size b ¼ 128, and m is the number derived, and also consider algorithmic outcomes as labels,
of learning sources used in the training. Across the training, the existing popular research datasets mostly fall in the
we used a fixed learning rate  ¼ 0:00025. After a fixed Professional or Crowd categories. In our work, we choose
number of epochs N is reached, we stop the training. 7 evaluation datasets commonly used in MIR research,
which reflect three conventional types of MIR tasks,
3.4 Implementation details namely classification, regression, and recommendation:
• Classification. Different types of classification tasks
We used PyTorch [68] to implement the CNN models and exist in MIR. In our experiments, we consider several
parallel data serving. For the evaluation of models and datasets used for genre classification and instrument
cross-validation, we made extensive use of functionality in classification.
Scikit-Learn [69]. Furthermore, Librosa [70] was used to For genre classification, we chose the GTZAN [72]
process audio files and its raw features including mel- and FMA [71] datasets as main exemplars. Even though
spectrograms. The training is conducted with 8 Graphical GTZAN is known for its caveats [79], we deliberately
Processing Unit (GPU) computation nodes, composed of 2 used it, because its popularity can be beneficial when
NVIDIA GRID K2 GPUs and 6 NVIDIA GTX 1080Ti compared with previous and future work. We note
GPUs. though that there may be some overlap between the
tracks of GTZAN and the subset of the MSD we use in
our experiments; the extent of this overlap is unknown,

123
1078 Neural Computing and Applications (2020) 32:1067–1093

Mel-Spectrogram Representation Network ft (X) zt yt

Source Speci c Source Speci c

Noisy Learning Data


KL
Layers (1) Factor Model (1)
Learning

...
Shared Layers
2.5s

Source Speci c Source Speci c


KL
Layers (m) Factor Model (m)

Transfer

Clean Evaluation Data


j=1
Representation(1)
j=2

n
Evaluation

ea
Sliding Window

Representation Representation(2) Evaluation

m
j=3
Representation(3) Loss
Network ... Task Model

...

sd
...

R(d×m)

Fig. 6 Overall system framework. The first row of the figure illustrates entire evaluation scenario. The representation is first extracted from
the learning scheme, where the representation learning is happening the representation network, which is transferred from the upper row.
by minimizing the KL divergence between the network inference The sequence of representation vectors is aggregated as the concate-
ft ðXÞ and the preprocessed learning label zt . The preprocessing is nation of their means and standard deviations. The purple block
conducted by the blue blocks which transform the original noisy indicates a machine learning model employed to evaluate the
labels yt to zt , reducing noise and summarizing the high-dimensional representation’s effectiveness (color figure online)
label space into a smaller latent space. The second row describes the

due to the lack of a confirmed and exhaustive track consists of short music clips annotated with the pre-
listing of the GTZAN dataset. We choose to use a fault- dominant instruments present in the clip. Compared to
filtered data split for the training and evaluation, which the genre classification task, instrument classification is
is suggested in [73]. The split originally includes a generally considered as less subjective, requiring fea-
training, validation and evaluation split; in our case, we tures to separate timbral characteristics of the music
also included the validation split as training data. signal as opposed to high-level semantics like the genre.
Among the various packages provided by the FMA, We split the dataset to make sure that observations from
we chose the top-genre classification task of FMA- the same music track are not split into training and test
Medium [71]. This is a classification dataset with an sets.
unbalanced genre distribution. We used the data split As a performance metric for all these classification
provided by the dataset for our experiment, where the tasks, we used classification accuracy.
training is validation set are combined as the training. • Regression. As exemplars of regression tasks, we
Considering another type of genre classification, we evaluate our proposed deep representations on the
selected the Extended Ballroom dataset [74, 75]. dataset used in the MediaEval Music Emotion predic-
Because the classes in this dataset are highly separable tion task [77]. It contains frame-level and song-level
with regard to their BPM [80], we specifically included labels of a two-dimensional representation of emotion,
this ‘purposefully biased’ dataset as an example of how with valence and arousal as dimensions [81]. Valence is
a learned representation may effectively capture tem- related to the positivity or negativity of the emotion,
poral dynamics properties present in a target dataset, as and arousal is related to its intensity [77]. The song-
long as learning sources also reflected these properties. level annotation of the V-A coordinates was used as the
Since no pre-defined split is provided or suggested by learning label. In similar fashion to the approach taken
other literature, we used stratified random sampling in [26], we trained separate models for the two
based on the genre label. emotional dimensions. As for the dataset split, we used
The last dataset we considered for classification is the split provided by the dataset, which is done by the
the training set of the IRMAS dataset [76], which random split stratified by the genre distribution.

123
Neural Computing and Applications (2020) 32:1067–1093 1079

As an evaluation metric, we measured the coefficient MIR research. In this work, we extract and aggregate
of determination R2 of each model. MFCC following the strategy in [26]. In particular, we
• Recommendation. Finally, we employed the ‘Last.fm - extracted 20 coefficients and also used their first- and
1K users’ dataset [78] to evaluate our representations in second-order derivatives. After obtaining the sequence
the context of a content-aware music recommendation of MFCCs and its derivatives, we performed aggrega-
task (which will be denoted as Lastfm in the remaining tion by taking the average and standard deviation over
of the paper). This dataset contains 19 million records the time dimension, resulting in 120-dimensional vector
of listening events across 961, 416 unique tracks representation.
collected from 992 unique users. In our experiments, • Random Network Feature (Rand). We extracted the
we mimicked a cold-start recommendation problem, in representation at the fc-feature layer without any
which items not seen before should be recommended to representation network training. With random initial-
the right users. For efficiency, we filtered out users who ization, this representation, therefore, gives a random
listened to less than 5 tracks and tracks known to less baseline for a given CNN architecture. We refer to this
than 5 users. baseline as Rand.
As for the audio content of each track, we obtained • Latent Representation from Music Auto-Tagger
the mapping between the MusicBrainz Identifier (Choi). The work in [26] focused on a music auto-
(MBID) with the Spotify identifier (SpotifyID) using tagging task and can be considered as yielding a state-
the MusicBrainz API.3 After cross-matching, we of-the-art deep music representation for MIR. While the
collected 30 s previews of all track using the Spotify model’s focus on learning a representation for music
API.4 We found that there is a substantial amount of auto-tagging can be considered as our SS-R case, there
missing mapping information between the SpotifyID are a number of issues that complicate direct compar-
and MBID in the MusicBrainz database, where only isons between this work and ours. First, the network in
approximately 30% of mappings are available. Also, [26] is trained with about 4 times more data samples
because of the substantial amount of inactive users and than in our experiments. Second, it employed a much
unpopular tracks in the dataset, we ultimately acquired smaller network than our architecture. Further, inter-
a dataset of 985 unique users and 27, 093 unique tracks mediate representations were extracted, which is out of
with audio content. the scope of our work, as we only consider represen-
Similar to [28], we considered the outer matrix tations at the fc-feature layer. Nevertheless,
performance for un-introduced songs; in other words, despite these caveats, the work still is very much in
the model’s recommendation accuracy on the items line with ours, making it a clear candidate for compar-
newly introduced to the system [28]. This was done by ison. Throughout the evaluation, we could not fully
holding out certain tracks when learning user models reproduce the performance reported in the original
and then predicting user preference scores based on all paper [26]. When reporting our results, we, therefore,
tracks, including those that were held out, resulting in a will report the performance we obtained with the
ranked track list per user. As an evaluation metric, we published model, referring to this as Choi.
consider Normalized Discounted Cumulative Gain
(nDCG@500), only treating held-out tracks that were
4.3 Experimental design
indeed liked by a user as relevant items. Further details
on how hold-out tracks were chosen are given in
Sect. 4.4.
In order to investigate our research questions, we carried
A summary of all evaluation datasets, their origins, and out an experiment to study the effect of the number and
properties, can be found in Table 5. type of learning sources on the effectiveness of deep rep-
resentations, as well as the effect of the various architec-
4.2 Baselines tural learning strategies described in Sect. 3.2. For the
experimental design, we consider the following factors:
We examined three baselines to compare with our pro-
posed representations: • Representation strategy, with 6 levels: SS-R, MS-
SR@FC, MS-CR@6, MS-CR@4, MS-CR@2, and
• Mel-Frequency Cepstral Coefficient (MFCC). These MSS-CR).
are some of the most popular audio representations in • 8 2-level factors indicating the presence or not of each
of the 8 learning sources: self, year, bpm, taste, tag,
3
https://musicbrainz.org/. lyrics, cdr_tag, and artist.
4
https://developer.spotify.com/documentation/web-api/.

123
1080 Neural Computing and Applications (2020) 32:1067–1093

Table 5 Properties of target datasets used in our experiments


Task Data #Tracks #Class Split method

Classification FMA [71] Genre 25,000 16 Artist Filtered [71]


Classification GTZAN [72] Genre 1000 10 Artist Filtered [73]
Classification Ext. Ballroom [74, 75] Genre 3390 13 N/A
Classification IRMAS [76] Instrument 6705 11 Song Filtered
Regression Music emotion [77] Arousal 744 Genre Stratified [77]
Regression Music emotion [77] Valence 744 Genre Stratified [77]
Recommendation Lastfm* [78] Listening count 27,093 (961,416) N/A
Because of time constraints, we sampled the Lastfm dataset as described in Sect. 4.1; the original size appears between parentheses. In case
particular data splits are defined by an original author or follow-up study, we apply the same split, including the reference in which the split is
introduced. Otherwise, we applied either a random split stratified by the label (Ballroom), or simple filtering based on reported faulty entries
(IRMAS)

• Number of learning sources present in the learning However, rather than using a pre-specified optimal
process (1 to 8). Note that this is actually calculated as design with a fixed amount of runs [83], we decided to
the sum of the eight factors above. run sequentially for as long as time would permit us,
• Target dataset, with 7 levels: Ballroom, FMA, GTZAN, generating at each step a new experimental run on
IRMAS, Lastfm, Arousal, and Valence. demand in a way that would maximize desired
Given a learned representation, fitting dataset-specific properties of the design up to that point, such as
models is much more efficient than learning the represen- balance and orthogonality.6
tation, so we decided to evaluate each representation on all We did this with the greedy Algorithm 2. From the
7 target datasets. The experimental design is thus restricted set of still remaining runs A, a subset O is selected
to combinations of representation and learning sources, and such that the expected unbalance in the augmented
for each such combination we will produce 7 observations. design B [ fog is minimal. In this case, the unbalance
However, given the constraint of SS-R relying on a single of design is defined as the maximum unbalance found
learning source, that there is only one possible combination between the levels of any factor, except for those
for n = 8 sources, as well as the high unbalance in the already exhausted.7 From O, a second subset P is
number of sources,5 we proceeded in three phases: selected such that the expected aliasing in the aug-
mented design is minimal, here defined as the maxi-
1. We first trained the SS-R representations for each of the mum absolute aliasing between main effects.8 Finally,
8 sources and repeated 6 times each. This resulted in 48 a run p is selected at random from P, the corresponding
experimental runs. representation is learned, and the algorithm iterates
2. We then proceeded to train all five multi-source again after updating A and B.
strategies with all sources, that is, n ¼ 8. We repeated Following this on-demand methodology, we man-
this 5 times, leading to 25 additional experimental runs. aged to run another 352 experimental runs from all the
3. Finally, we ran all five multi-source strategies with 1230 possible.
n ¼ 2; . . .; 7. The full design matrix would contain 5
representations and 8 sources, for a total of 1230
possible runs. Such an experiment was unfortunately 6
An experimental design is orthogonal if the effects of any factor
infeasible to run exhaustively given available balance out across the effects of the other factors. In a non-orthogonal
resources, so we decided to follow a fractional design. design, effects may be aliased, meaning that the estimate of one effect
is partially biased with the effect of another, the extent of which
ranges from 0 (no aliasing) to 1 (full aliasing). Aliasing is sometimes
5 referred to as confounding. See sections 8.5 and 9.5 in [82] for details
For instance, from the 255 possible combinations of up to 8 sources,
on aliasing.
there are 70 combinations of n ¼ 4 sources, but 28 with n ¼ 2, or only 7
8 for n ¼ 7. Simple random sampling from the 255 possible For instance, let a design have 20 runs for SS-R, 16 for MS-SR@FC,
combinations would lead to a very unbalanced design, that is, a and 18 for all other representations. The unbalance in the represen-
highly non-uniform distribution of observation counts across the tation factor is thus 20  16 ¼ 4. The total unbalance of the design is
levels of the factor (n in this case). A balanced design is desired to defined as the maximum unbalance found across all factors.
8
prevent aliasing and maximize statistical power. See section 15.2 in See section 2.3.7 in [83] for details on how to compute an alias
[82] for details on unbalanced designs. matrix.

123
Neural Computing and Applications (2020) 32:1067–1093 1081

Algorithm 2: Sequential generation of experimental runs. learning will now be applied, to consider our representa-
1 Initialize A with all possible 1,230 runs to execute; tions in the context of these new target datasets.
2 Initialize B ← ∅ for the set of already executed runs;
3 while time allows do
As a consequence, new machine learning pipelines are
4 Select O ⊆ A s.t. ∀o ∈ O, the unbalance in B ∪ {o} is minimal; set up, focused on each of the target datasets. In all cases,
5 Select P ⊆ O s.t. ∀p ∈ P, the aliasing in B ∪ {p} is minimal;
6 Select p ∈ P at random; we applied the pre-defined split if it is feasible. Otherwise,
7 Update A ← A − {p}; we randomly split the dataset into an 80% training set and
8 Update B ← B ∪ {p};
9 Learn the representation coded by p; 20% test set. For every dataset, we repeated the training
and evaluation for 5 times, using different train/test splits.
After going through the three phases above, the final In most of our evaluation cases, validation will take place
experiment contained 48 þ 25 þ 352 ¼ 425 experimental on the test set; in case of the recommendation problem, the
runs, each producing a different deep music representation. test set represents a set of tracks to be held out from each
We further evaluated each representation on all 7 target user during model training, and re-inserted for validation.
datasets, leading to a grand total of 42  7 ¼ 2975 data In all cases, we will extract representations from evaluation
points. Figure 7 plots the alias matrix of the final experi- dataset audio as detailed in Sect. 4.4.1, and then learn
mental design, showing that the aliasing among main fac- relatively simple models based on them, as detailed in
tors is indeed minimal. The final experimental design Sect. 4.4.2. Employing the metrics as mentioned in the
matrix can be downloaded along with the rest of the sup- previous section, we will then take average performance
plemental material. scores over the 5 different train/test splits for final perfor-
Each considered representation network was trained mance reporting.
using the CNN representation network model from Sect. 3,
based on the specific combination of learning sources and 4.4.1 Feature extraction and preprocessing
deep architecture as indicated by the experimental run. In
order to reduce variance, we fixed the number of training Taking raw audio from the evaluation datasets as input, we
epochs to N ¼ 200 across all runs and applied the same take non-overlapping slices out of this audio with a fixed
base architecture, except for the branching point. This length of 2.5 s. Based on this, we apply the same prepro-
entire training procedure took approximately 5 weeks with cessing transformations as discussed in Sect. 3.1.1. Then,
given computational hardware resources introduced in we extract a deep representation from this preprocessed
Sect. 3.4. audio, employing the architecture as specified by the given
experimental run. As in the case of Sect. 3.2, representa-
4.4 Implementation details tions are extracted from the fc-feature layer of each
trained CNN model. Depending on the choice of archi-
In order to assess how our learned deep music represen- tecture, the final representation may consist of concatena-
tations perform on the various target datasets, transfer tions of representations obtained by separate representation
networks.
MS−SR@FC

Input audio may originally be (much) longer than 2.5 s;


MS−CR@6
MS−CR@4
MS−CR@2
MSS−CR

therefore, we aggregate information in feature vectors over


Valence
Arousal
GTZAN
cdr_tag

IRMAS
Lastfm
lyrics

artist
taste

FMA

multiple time slices by taking their mean and standard


bpm
year
self

tag

self
1 deviation values. As a result, we get representation with
year
0.9
averages per learned feature dimension and another rep-
bpm
resentation with standard deviations per feature dimension.
taste 0.8
tag These will be concatenated, as illustrated in Fig. 6.
lyrics 0.7
cdr_tag
artist 0.6
4.4.2 Target dataset-specific models
MS−SR@FC
MS−CR@6 0.5
MS−CR@4
As our goal is not to over-optimize dataset-specific per-
MS−CR@2 0.4 formance, but rather perform a comparative analysis
MSS−CR between different representations (resulting from different
FMA 0.3
GTZAN learning strategies), we keep the model simple and use
0.2
IRMAS fixed hyper-parameter values for each model across the
Lastfm
Arousal 0.1 entire experiment.
Valence To evaluate the trained representations, we used differ-
0
ent models according to the target dataset. For classifica-
Fig. 7 Aliasing among main effects in the final experimental design tion and regression tasks, we used the multilayer

123
1082 Neural Computing and Applications (2020) 32:1067–1093

perceptron (MLP) model [84]. More specifically, the MLP representation learning, in terms of their general perfor-
model has two hidden layers, whose dimensionality is 256. mance, reliability, and model compactness. In Sect. 5.3, we
As for the nonlinearity, we choose ReLU [62] for all nodes, discuss the effectiveness of different representations in
and the model is trained with ADAM optimization tech- MIR. Finally, we present some initial evidence for multi-
nique [67] for 200 iterations. In the evaluation, we used the faceted semantic explainability of the proposed MTDTL in
Scikit-Learn’s implementation for ease of distributed Sect. 5.5.9
computing on multiple CPU computation nodes.
For the recommendation task, we choose a similar 5.1 Single-source and multi-source
model as suggested in [28, 85], in which the learning representation
objective function L is defined as
V
^ V;
U; ^ ¼ arg min jjP  UV T jjC þ k jjV  XWjj
^ W Figure 8 presents the performance of SS-R representa-
2 ð7Þ tions on each of the 7 target datasets. We can see that all
kU kW
þ jjUjj þ jjWjj sources tend to outperform the Rand baseline on all data-
2 2 sets, except for a handful cases involving sources self and
where P 2 Rui is a binary matrix indicating whether there bpm. Looking at the top performing sources, we find that
is interaction between users u and items i, U 2 Rur and tag, cdr_tag, and artist perform better or on-par with the
V 2 Rir are r dimensional user factors and item factors for most sophisticated baseline, Choi, except for the IRMAS
the low-rank approximation of P. P is derived from the dataset. The other sources are found somewhere between
original interaction matrix R 2 Rui , which contains the these two baselines, except for datasets Lastfm and Arou-
number of interaction from users u to items i, as follows: sal, where they perform better than Choi as well. Finally,
 the MFCC is generally outperformed in all cases, with the
1; if Ru;i [ 0 notable exception of the IRMAS dataset, where only Choi
Pu;i ¼ ð8Þ
0 otherwise performs better.
Zooming in to dataset-specific observed trends, the bpm
W 2 Rdr is a free parameter for the projection from d-
learning source shows a highly skewed performance across
dimensional feature space to the factor space. X 2 Rid is
target datasets: it clearly outperforms all other learning
the feature matrix where each row corresponds to a track.
sources in the Ballroom dataset, but it achieves the worst or
Finally, jj  jjC is the Frobenious norm weighted by the
second-worst performance in the other datasets. As shown
confidence matrix C 2 Rui , which controls the credibility in [80], this confirms that the Ballroom dataset is well-
of the model on the given interaction data, given as separable based on BPM information alone. Indeed, rep-
follows: resentations trained on the bpm learning source seem to
C ¼ 1 þ aR ð9Þ contain a latent representation close to the BPM of an input
music signal. In contrast, we can see that the bpm repre-
where a controls credibility. As for hyper-parameters, we set
sentation achieves the worst results in the Arousal dataset,
a ¼ 0:1, kV ¼ 0:00001, kU ¼ 0:00001, and kW ¼ 0:1, where both temporal dynamics and BPM are considered as
respectively. For the number of factors we choose r ¼ 50 to important factors determining the intensity of emotion.
focus only on the relative impact of the representation over On the IRMAS dataset, we see that all the SS-Rs per-
the different conditions. We implemented an update rule form worse than the MFCC and Choi baselines. Given that
with the alternating least squares (ALS) algorithm similar to they both take into account low-level features, either by
[28], and updated parameters during 15 iterations. design or by exploiting low-level layers of the neural net-
work, this suggests that predominant instrument sounds are
harder to distinguish based solely on semantic features,
5 Results and discussion which is the case of the representations studied here.
Also, we find that there is small variability for each SS-R
In this section, we present results and discussion related to run within the training setup we applied. Specifically, in
the proposed deep music representations. In Sect. 5.1, we 50% of cases, we have within-SS-R variability less than
will first compare the performance across the SS-Rs, to 15% of the within-dataset variability. 90% of the cases are
show how different individual learning sources work for within 30% of the within-dataset variability.
each target dataset. Then, we will present general experi-
mental results related to the performance of the multi-
9
source representations. In Sect. 5.2, we discuss the effect of For the reproducibility, we release all relevant materials including
code, models and extracted features at https://github.com/eldrin/
the number of learning sources exploited in the MTLMusicRepresentation-PyTorch.

123
Neural Computing and Applications (2020) 32:1067–1093 1083

Ballroom FMA GTZAN IRMAS

0.62

0.80

0.60
0.9

0.55
0.60

0.75
● ●

0.50
0.8

● ●

0.58

0.70
Accuracy

Accuracy

Accuracy
● ●
● ●
Accuracy

0.45
0.56

0.65

0.7

0.40
● ●
● ● ●

0.54

0.60
● ● ●

0.35
● ●
0.6

● ● ●

0.52

0.55

0.30




0.50

0.50

0.25
0.5
self

year

bpm

taste

tag

lyrics

cdr_tag

artist

self

year

bpm

taste

tag

lyrics

cdr_tag

artist

self

year

bpm

taste

tag

lyrics

cdr_tag

artist

self

year

bpm

taste

tag

lyrics

cdr_tag

artist
Lastfm Arousal Valence
Choi
0.7

0.5
0.06

MFCC
● Rand
0.6

0.4
● ● ●


0.05



0.5
nDCG

0.3
● ●
2

R2
● ●
0.4


R


0.04

● ●

0.2

0.3



0.1

0.03

0.2



0.1

0.0
self

year

bpm

taste

tag

lyrics

cdr_tag

artist

self

year

bpm

taste

tag

lyrics

cdr_tag

artist

self

year

bpm

taste

tag

lyrics

cdr_tag

artist
Fig. 8 Performance of single-source representations. Each point indicates the performance of a representation learned from a single source. Solid
points indicate the average performance per source. The baselines are illustrated as horizontal lines

We now consider how the various representations based makes this statement still rather unclear. In order to gain
on multiple learning sources perform, in comparison to those a better insight of the effects of the dataset, architecture
based on single learning sources. The boxplots in Fig. 9 strategies and number and type of learning sources, we
show the distributions of performance scores for each further analyzed the results using a hierarchical or multi-
architectural strategy and per target dataset. For comparison, level linear model on all observed scores [86]. The
the gray boxes summarize the distributions depicted in advantage of such a model is essentially that it accounts for
Fig. 8, based on the SS-R strategy. In general, we can see that the structure in our experiment, where observations nested
these SS-R obtain the lowest scores, followed by MS- within datasets are not independent.
SR@FC, except for the IRMAS dataset. Given that these By Fig. 9, we can anticipate a very large dataset effect
representations have the same dimensionality, these results because of the inherently different levels of difficulty, as
suggest that adding a single-source-specific layer on top of a well as a high level of heteroskedasticity. We, therefore,
heavily shared model may help to improve the adaptability of analyzed standardized performance scores rather than raw
the neural network models, especially when there is no prior scores. In particular, the i-th performance score yi is stan-
knowledge regarding the well-matching learning sources for dardized with the within-dataset mean and standard devi-
the target datasets. The MS-CR and MSS-CR representations ation scores, that is, yi ¼ ðyi  yd½i Þ=sd½i , where
obtain the best results in general, which is somewhat d[i] denotes the dataset of the i-th observation. This way,
expected because of their larger dimensionality. the dataset effect is effectively 0 and the variance is
homogeneous. In addition, this will allow us to compare
5.2 Effect of number of learning sources the relative differences across strategies and number of
and fusion strategy sources using the same scale in all datasets.
We also transformed the variable n that refers to the
number of sources to n , which is set to n ¼ 0 for SS-Rs
While the plots in Fig. 9 suggest that MSS-CR and MS- and to n ¼ n  2 for the other strategies. This way, the
CR are the best strategies, the high observed variability intercepts of the linear model will represent the average

123
1084 Neural Computing and Applications (2020) 32:1067–1093

Ballroom FMA GTZAN IRMAS

0.80
0.62
0.9

0.55
0.60

0.75
0.8

Accuracy
0.58

0.70
Accuracy
Accuracy

Accuracy
0.45
0.56

0.65
0.7

0.54

0.60

0.35
0.6

0.52

0.55
0.50

0.50

0.25
0.5

SS−R

MS−SR@FC

MS−CR@6

MS−CR@4

MS−CR@2

MSS−CR

SS−R

MS−SR@FC

MS−CR@6

MS−CR@4

MS−CR@2

MSS−CR

SS−R

MS−SR@FC

MS−CR@6

MS−CR@4

MS−CR@2

MSS−CR

SS−R

MS−SR@FC

MS−CR@6

MS−CR@4

MS−CR@2

MSS−CR
Lastfm Arousal Valence
0.7

0.5
Choi
0.06

MFCC
0.6

Rand

0.4
0.05

0.5
nDCG

0.3
R2

R2
0.4
0.04

0.2
0.3

0.1
0.03

0.2
0.1

0.0
SS−R

MS−SR@FC

MS−CR@6

MS−CR@4

MS−CR@2

MSS−CR

SS−R

MS−SR@FC

MS−CR@6

MS−CR@4

MS−CR@2

MSS−CR

SS−R

MS−SR@FC

MS−CR@6

MS−CR@4

MS−CR@2

MSS−CR
Fig. 9 Performance by representation strategy. Solid points represent the mean per representation. The baselines are illustrated as horizontal lines

performance of each representation strategy in its simplest Figure 11 shows the estimated effects and bootstrap
case, that is, SS-R (n ¼ 1) or non-SS-R with n ¼ 2. We 95% confidence intervals. The left plot confirms the
fitted a first analysis model as follows: observations in Fig. 9. In particular, they confirm that SS-R
performs significantly worse than MS-SR@FC, which is
yi ¼ b0r½id½i þ b1r½id½i  ni þ ei ei  Nð0; r2e Þ ð10Þ
similarly statistically worse than the others. When carrying
b0rd ¼ b0r þ u0rd u0rd  Nð0; r20r Þ ð11Þ out pairwise comparisons, MSS-CR outperforms all other
strategies except MS-CR@2 (p ¼ 0:32), which outperforms
b1rd ¼ b1r þ u1rd u1rd  Nð0; r21r Þ; ð12Þ
all others except MS-CR@6 (p ¼ 0:09). The right plot
where b0r½id½i is the intercept of the corresponding repre- confirms the qualitative observation from Fig. 10 by
sentation strategy within the corresponding dataset. Each of showing a significantly positive effect of the number of
these coefficients is defined as the sum of a global fixed sources except for MS-SR@FC, where it is not statistically
effect b0r of the representation, and a random effect u0rd different from 0. The intervals suggest a very similar effect
which allows for random within-dataset variation.10 This in the best representations, with average increments of
way, we separate the effects of interest (i.e., each b0r ) from about 0.16 per additional source—recall that scores are
the dataset-specific variations (i.e., each u0rd ). The effect of standardized.
the number of sources is similarly defined as the sum of a To gain better insight into differences across represen-
fixed representation-specific coefficient b1r and a random tation strategies, we used a second hierarchical model
dataset-specific coefficient u1rd . Because the slope depends where the representation strategy was modeled as an
on the representation, we are thus implicitly modeling the ordinal variable r  instead of the nominal variable r used in
interaction between strategy and number of sources, which the first model. In particular, r  represents the size of the
can be appreciated in Fig. 10, especially with MS-SR@FC. network, so we coded SS-R as 0, MS-SR@FC as 0.2, MS-
CR@6 as 0.4, MS-CR@4 as 0.6, MS-CR@2 as 0.8, and
10
We note that hierarchical models do not fit each of the individual MSS-CR as 1 (see Fig. 5). In detail, this second model is as
u0rd coefficients (a total of 42 in this model), but the amount of follows:
variability they produce, that is, r20r (6 in total).

123
Neural Computing and Applications (2020) 32:1067–1093 1085

Ballroom FMA GTZAN IRMAS


2

2
1

1
Accuracy

Accuracy

Accuracy

Accuracy
0

0
−1

−1

−1

−1
−2

−2

−2

−2
−3

−3

−3

−3
−4

−4

−4

−4
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of learning sources Number of learning sources Number of learning sources Number of learning sources

Lastfm Arousal Valence


2

2
SS−R
MS−SR@FC
1

1
MS−CR@6
MS−CR@4
MS−CR@2
0

0
nDCG

MSS−CR
R2

R2
−1

−1

−1
−2

−2

−2
−3

−3

−3
−4

−4

−4

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of learning sources Number of learning sources Number of learning sources

Fig. 10 (Standardized) performance by the number of learning sources. Solid points represent the mean per architecture and number of sources.
The black horizontal line marks the mean performance of the SS-R representations. The colored lines show linear fits (color figure online)

Fig. 11 Fixed effects and Fixed intercepts: β0r Fixed slopes: β1r
bootstrap 95% confidence
MSS−CR MSS−CR
intervals estimated for the first
analysis model. The left plot MS−CR@2 MS−CR@2
depicts the effects of the
representation strategy (b0r MS−CR@4 MS−CR@4
intercepts), and the right plot
shows the effects of the number MS−CR@6 MS−CR@6
of sources (b1r slopes)
MS−SR@FC MS−SR@FC

SS−R ● SS−R
−1.0 −0.5 0.0 −0.05 0.00 0.05 0.10 0.15 0.20

Effect Effect

yi ¼ b0 þ b1d½i  ri þ b2d½i  ni þ b3d½i  ri  ni þ ei effect u1d . Likewise, this model includes the main effect of
ð13Þ the number of sources (fixed effect b20 ), as well as its
ei  Nð0; r2e Þ
interaction with the network size (fixed effect b30 ). Fig-
b1d ¼ b10 þ u1d u1d  Nð0; r21 Þ ð14Þ ure 12 shows the fitted coefficients, confirming the statis-
tically positive effect of the size of the networks and, to a
b2d ¼ b20 þ u2d u2d  Nð0; r22 Þ ð15Þ
smaller degree but still significant, of the number of
b3d ¼ b30 þ u3d u3d  Nð0; r23 Þ: ð16Þ sources. The interaction term is not statistically significant,
probably because of the unclear benefit of the number of
In contrast to the first model, there is no representation-
sources in MS-SR@FC.
specific fixed intercept but an overall intercept b0 . The
Overall, these analyses confirm that all multi-source
effect of the network size is similarly modeled as the sum
strategies outperform the single-source representations,
of an overall fixed slope b10 and a random dataset-specific

123
1086 Neural Computing and Applications (2020) 32:1067–1093

Fixed effects sources available, one can expect less variability across
β30 (r*n*) instantiations of the network. Most importantly, variability
obtained for a single learning source (n ¼ 1) is always
β20 (n*) larger than the variability with 2 or more sources. The
Ballroom dataset shows much smaller variability when
BPM is included in the combination. For this specific
β10 (r*)
dataset, this indicates that once bpm is used to learn the
representation, the expected performance is stable and does
β0 (intercept) ●
not vary much, even if we keep including more sources.
−1.0 −0.5 0.0 0.5 1.0 1.5
Section 5.3 provides more insight in this regard.
Effect

Fig. 12 Fixed effects and bootstrap 95% confidence intervals 5.3 Single source versus multi-source
estimated for the second analysis model, depicting the overall
intercept (b0 ), the slope of the network size (b10 ), the slope of the
number of sources (b20 ), and their interaction (b30 )
The evidence so far tells us that, on average, learning
from multiple sources leads to better performance than
with a direct relation to the number of parameters in the
learning from a single source. However, it could be pos-
network. In addition, there is a clearly positive effect of the
sible that the SS-R representation with the best learning
number of sources, with a minor interaction between both
source for the given target dataset still performs better than
factors.
a multi-source alternative. In fact, in Fig. 10 there are
Figure 10 also suggests that the variability of perfor-
many cases where the best SS-R representation (black cir-
mance scores decreases with the number of learning
cles at n ¼ 1) already perform quite well compared to the
sources used. This implies that if there are more learning
more sophisticated alternatives. Figure 13 presents similar

Ballroom FMA GTZAN IRMAS


2

2
1

1
0

0
Accuracy

Accuracy

Accuracy
Accuracy
−1

−1

−1

−1
−2

−2

−2

−2
−3

−3

−3

−3
−4

−4

−4

−4

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

Number of learning sources Number of learning sources Number of learning sources Number of learning sources

Lastfm Arousal Valence


SS−R w/ best source
2

SS−R w/o best source


Non−SS−R w/ best source
1

Non−SS−R w/o best source


0

0
nDCG

R2

R2
−1

−1

−1
−2

−2

−2
−3

−3

−3
−4

−4

−4

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Number of learning sources Number of learning sources Number of learning sources

Fig. 13 (Standardized) performance by number of learning sources. without it. Solid and dashed lines represent linear fits, respectively;
Solid points mark representations including the source performing dashed areas represent 95% confidence intervals (color figure online)
best with SS-R in the dataset; empty points mark representations

123
Neural Computing and Applications (2020) 32:1067–1093 1087

Table 6 Variance components


Ballroom FMA GTZAN IRMAS Lastfm Arousal Valence
(as percent of total) of the
learning sources, within each of self 2 32 39 18 29 6 10
the target datasets, and for non-
SS-R representations year \1 6 \1 1 2 2 \1
bpm 96 3 \1 8 16 \1 42
taste \1 \1 \1 \1 \1 \1 6
tag 1 17 21 16 20 33 14
lyrics \1 \1 \1 3 \1 11 \1
cdr_tag \1 9 12 16 2 16 14
artist 1 32 28 37 32 31 15
Largest per-dataset in bold face

scatter plots, but now explicitly differentiating between

100
Within−dataset variance component (%)
representations using the single best source (filled circles, ●
self
year
solid lines) and not using it (empty circles, dashed lines). bpm

50
taste
The results suggest that even if the strongest learning tag
lyrics
source for the specific dataset is not used, the others largely cdr_tag

20
artist
compensate for it in the multi-source representations,
catching up and even surpassing the best SS-R represen-

10
tations. The exception to this rule is again bpm in the
Ballroom dataset, where it definitely makes a difference. 5 Ballroom

FMA
As the plots shows, the variability for low numbers of ● GTZAN
2

IRMAS
Lastfm
learning sources is larger when not using the strongest
1


Arousal

Valence
source, but as more sources are added, this variability ●



0

reduces.
To further investigate this issue, for each target dataset, −4 −3 −2 −1 0 1
Standardized performance (y*)
we also computed the variance component due to each of
the learning sources, excluding SS-R representations [87]. Fig. 14 Correlation between (standardized) SS-R performance and
A large variance due to one of the sources means that, on variance component (color figure online)
average and for that specific dataset, there is a large dif-
ference in performance between having that source or not. necessarily required because the other sources make up for
Table 6 shows all variance components, highlighting the its absence. This is especially important in practical terms,
per-dataset largest. Apart from bpm in the Ballroom data- because different tasks generally have different best sour-
set, there is no clear evidence that one single source is ces, and practitioners rarely have sufficient domain
specially good in all datasets, which suggests that in gen- knowledge to select them up front. Also, and unlike the
eral there is not a single source that one would use by Ballroom dataset, many real-world problems are not easily
default. Notably though, sources artist, tag and self tend to solved with a single feature. Therefore, choosing a more
have large variance components. general representation based on multiple sources is a much
In addition, we observe that the sources with the largest simpler way to proceed, which still yields comparable or
variance are not necessarily the sources that obtain the best better results.
results by themselves in an SS-R representation (see Fig. 8). In other words, if ‘‘a single deep representation to rule
We examined this relationship further by calculating the them all’’ is pre-trained, it is advisable to base this repre-
correlation between variance components and (standard- sentation on multiple learning sources. At the same time,
ized) performance of the corresponding SS-Rs. The Pearson given that MSS-CR representations also generally show
correlation is 0.38, meaning that there is a mild association. strong performance (albeit that they will bring high
Figure 14 further shows this with a scatterplot, with a clear dimensionality), and that they will come ‘for free’ as soon
distinction between poorly-performing sources (year, taste as SS-R networks are trained, alternatively, we could
and lyrics at the bottom) and well-performing sources (tag, imagine an ecosystem in which the community could pre-
cdr_tag, and artist at the right). train and release many SS-R networks for different indi-
This result implies that even if some SS-R is particularly vidual sources in a distributed way, and practitioners can
strong for a given dataset, when considering more complex then collect these into MSS-CR representations, without the
fusion architectures, the presence of that one source is not need for retraining.

123
1088 Neural Computing and Applications (2020) 32:1067–1093

Fig. 15 Number of network


parameters by number of
learning sources

5.4 Compactness source-specific fc-out layers, we can predict a factor


distribution zt for each of the learning sources. Then, from
Under an MTDTL setup with branching (the MS-CR archi- the predicted zt , one can either map this back on the
tectures), as more learning sources are used, not only the original learning labels yt , or simply consider the strongest
representation will grow larger, but so will the necessary predicted topics (which we visualized in Fig. 16), to relate
deep network to learn it: see Fig. 15 for an overview of the representation to human-understandable facets or
necessary model parameters for the different architectures. descriptions.11
When using all the learning sources, MS-CR@6, which for a
considerable part encompasses a shared network architec-
ture and branches out relatively late, has an around 6.3 times 6 Conclusion
larger network size compared to the network size needed for
SS-R. In contrast, MS-SR@FC, which is the most heavily In this paper, we have investigated the effect of different
shared MTDTL case, uses a network that is only 1.2 times strategies to learn music representations with deep net-
larger than the network needed for SS-R. works, considering multiple learning sources and different
Also, while the representations resulting from the MSS- network architectures with varying degrees of shared
CR and various MS-CR architectures linearly depend on the information. Our main research questions are how the
chosen number of learning sources m (see Table 4), for number and combination of learning sources (RQ1), and
MS-SR@FC, which has a fixed dimensionality of d inde- different configurations of the shared architecture (RQ2)
pendent of m, we do notice increasing performance as more affect the effectiveness of the learned deep music repre-
learning sources are used, except IRMAS dataset. This sentation. As a consequence, we conducted an experiment
implies that under MTDTL setups, the network does learn training 425 neural network models with different combi-
as much as possible from the multiple sources, even in case nations of learning sources and architectures.
of fixed network capacity. After an extensive empirical analysis, we can summa-
rize our findings as follows:
5.5 Multiple explanatory factors
• RQ1 The number of learning sources positively affects
the effectiveness of a learned deep music
By training representation models on multiple learning
11
sources in the way we did, our hope is that the represen- Note that as soon as a pre-trained representation network model
will be adapted to an new dataset through transfer learning, the fc-
tation will reflect latent semantic facets that will ultimately out layer cannot be used to obtain such explanations from the
allow for semantic explainability. In Fig. 16, we show a learning sources used in the representation learning, since the layers
visualization that suggests this indeed may be possible. will then be fine-tuned to another dataset. However, we hypothesize it
More specifically, we consider one of our MS-CR models may be possible that the semantic explainability can still be
preserved, if fine-tuning is jointly conducted with the original
trained on 5 learning sources. For each learning source- learning sources used during the pre-training time in the multi-
specific block of the representation, using the learning objective strategy.

123
Neural Computing and Applications (2020) 32:1067–1093 1089

Fig. 16 Potential semantic explainability of DTMTL music repre- various types of learning sources. The specific model used in the
sentations. Here, we provide a visualization using t-SNE [88], plotting visualization is the 232th model from the experimental design we
2-dimensional coordinates of each sample from the GTZAN dataset, introduce in Sect. 4.3, which is performing better than 95% of other
as resulting from an MS-CR representation trained on 5 sources. In the models on GTZAN target dataset
zoomed-in panes, we overlay the strongest topic model terms in zt , for

representation, although representations based on a CR@2, MSS-CR) tend to outperform models where
single learning source will already be effective in sharing is higher (e.g., MS-CR@6, MS-SR@FC), all of
specialized cases (e.g., BPM and the Ballroom dataset). which outperform the base model (SS-R).
• RQ2 In terms of architecture, the amount of shared Our findings give various pointers to useful future work.
information has a negative effect on performance: First of all, ‘generality’ is difficult to define in the music
larger models with less shared information (e.g., MS- domain, maybe more so than in CV or NLP, in which

123
1090 Neural Computing and Applications (2020) 32:1067–1093

lower-level information atoms may be less multifaceted in References


nature (e.g., lower-level representations of visual objects
naturally extend to many vision tasks, while an equivalent 1. Casey MA, Veltkamp RC, Goto M, Leman M, Rhodes C, Slaney
in music is harder to pinpoint). In case of clear task-specific M (2008) Content-based music information retrieval: current
directions and future challenges. Proc IEEE 96(4):668–696.
data skews, practitioners should be pragmatic about this. https://doi.org/10.1109/JPROC.2008.916370
Also, we only investigated one special case of transfer 2. Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75.
learning, which might not be generalized well if one con- https://doi.org/10.1023/A:1007379606734. ISSN: 1573-0565
siders the adaptation of the pre-trained network for further 3. Bengio Y, Courville AC, Vincent P (2013) Representation
learning: a review and new perspectives. IEEE Trans Pattern
fine-tuning with respect to their target dataset. Since there Anal Mach Intell 35(8):1798–1828. https://doi.org/10.1109/
are various choices to make, which will bring a substantial TPAMI.2013.50. ISSN: 0162-8828
amount of variability, we decided to leave the aspects for 4. Liu W, Mei T, Zhang Y, Che C, Luo J (2015) Multi-task deep
further future works. We believe open-sourcing the models visual-semantic embedding for video thumbnail selection. In:
IEEE conference on computer vision and pattern recognition
we trained throughout this work will be helpful for such CVPR, Boston, MA, USA, pp 3707–3715. https://doi.org/10.
follow-up works. Another limitation of current work is the 1109/CVPR.2015.7298994
selective set of label types in the learning sources. For 5. Bingel J, Søgaard A (2017) Identifying beneficial task relations
instance, there are also a number of MIR-related tasks that for multi-task learning in deep neural networks. In: Proceedings
of the 15th conference of the European chapter of the association
are using time-variant labels such as automatic music for computational linguistics, vol 2. Association for Computa-
transcription, segmentation, beat tracking and chord esti- tional Linguistics, Valencia, Spain, pp 164–169
mation. We believe that such tasks should be investigated 6. Li S, Liu Z-Q, Chan AB (2015) Heterogeneous multi-task
as well in the future to build a more complete overview of learning for human pose estimation with deep convolutional
neural network. Int J Comput Vis 113(1):19–36. https://doi.org/
MTDTL problem. 10.1007/s11263-014-0767-8. ISSN: 1573-1405
Finally, in our current work, we still largely considered 7. Zhang W, Li R, Zeng T, Sun Q, Kumar S, Ye J, Ji S (2015) Deep
MTDTL as a ‘black box’ operation, trying to learn how model based transfer and multi-task learning for biological image
MTDTL can be effective. However, the original reason for analysis. In: Proceedings of the 21th ACM SIGKDD international
conference on knowledge discovery and data mining KDD,
starting this work was not only to yield an effective gen- Sydney. ACM, NSW, Australia, pp 1475–1484. https://doi.org/
eral-purpose representation, but one that also would be 10.1145/2783258.2783304. ISBN: 978-1-4503-3664-2
semantically interpretable according to different semantic 8. Zhang Z, Luo Z, Loy CC, Tang X (2014) Facial landmark
facets. We showed some early evidence our representation detection by deep multi-task learning. In: Computer vision—
ECCV 13th European conference, proceedings, part VI. Springer,
networks may be capable of picking up such facets; how- Zurich, Switzerland, pp 94–108. https://doi.org/10.1007/978-3-
ever, considerable future work will be needed into more in- 319-10599-4_7
depth analysis techniques of what the deep representations 9. Kaiser L, Gomez AN, Shazeer N, Vaswani A, Parmar N, Jones L,
actually learned. Uszkoreit J (2017) One model to learn them all. arXiv:abs/1706.
05137
10. Rick Chang J-H, Li C-L, Póczos B, Vijaya Kumar BVK (2017)
Acknowledgements This work was carried out on the Dutch national One network to solve them all—solving linear inverse problems
e-infrastructure with the support of SURF Cooperative. We further using deep projection models. In: IEEE international conference
thank the CDR for having provided their album-level genre annota- on computer vision, ICCV. IEEE Computer Society, Venice,
tions for our experiments. We thank Keunwoo Choi for the discussion Italy, pp 5889–5898. https://doi.org/10.1109/ICCV.2017.627
and all the help regarding the implementation of his work. We also 11. Weston J, Bengio S, Hamel P (2011) Multi-tasking with joint
thank David Tax for the valuable inputs and discussion. Finally, we semantic spaces for large-scale music annotation and retrieval.
thank editors and reviewers for their effort and constructive help to J New Music Res 40(4):337–348. https://doi.org/10.1080/
improve this work. 09298215.2011.603834
12. Aytar Y, Vondrick C, Torralba A (2016) Soundnet: Learning
Compliance with ethical standards sound representations from unlabeled video. In: Advances in
neural information processing systems 29: annual conference on
Conflict of interest The authors declare that they have no conflict of neural information processing systems. Barcelona, Spain,
interest. pp 892–900
13. Hamel P, Eck D (2010) Learning features from music audio with
Open Access This article is distributed under the terms of the Creative deep belief networks. In: Proceedings of the 11th international
Commons Attribution 4.0 International License (http://creative society for music information retrieval conference, ISMIR.
commons.org/licenses/by/4.0/), which permits unrestricted use, dis- Utrecht, Netherlands, pp 339–344
tribution, and reproduction in any medium, provided you give 14. Boulanger-Lewandowski N, Bengio Y, Vincent P (2012)
appropriate credit to the original author(s) and the source, provide a Modeling temporal dependencies in high-dimensional sequences:
link to the Creative Commons license, and indicate if changes were application to polyphonic music generation and transcription. In:
made. Proceedings of the 29th international conference on machine
learning, ICML. Omnipress, Edinburgh, Scotland, UK
15. Schlüter J, Böck S (2014) Improved musical onset detection with
convolutional neural networks. In: IEEE international conference

123
Neural Computing and Applications (2020) 32:1067–1093 1091

on acoustics, speech and signal processing, ICASSP. IEEE, 30. Bertin-Mahieux T, Ellis DPW, Whitman B, Lamere P (2011) The
Florence, Italy, pp 6979–6983. https://doi.org/10.1109/ICASSP. million song dataset. In: Proceedings of the 12th international
2014.6854953 society for music information retrieval conference, ISMIR.
16. Choi K, Fazekas G, Sandler MB (2016) Automatic tagging using University of Miami, Miami, FL, USA. pp 591–596
deep convolutional neural networks. In: Proceedings of the 17th 31. Bengio Y, Lamblin P, Popovici D, Larochelle H (2006) Greedy
international society for music information retrieval conference, layer-wise training of deep networks. In: Advances in neural
ISMIR. New York City, USA, pp 805–811 information processing systems 19. NIPS. MIT Press, Vancouver,
17. van den Oord A, Dieleman S, Schrauwen B (2013) Deep content- BC, Canada, pp 153–160
based music recommendation. In: Advances in neural informa- 32. Vincent P, Larochelle H, Bengio Y, Manzagol P-A (2008)
tion processing systems 26 NIPS. Lake Tahoe, NV, USA, Extracting and composing robust features with denoising
pp 2643–2651 autoencoders. In: Proceedings of the 25th international confer-
18. Chandna P, Miron M, Janer J, Gómez E (2017) Monoaural audio ence on machine learning ICML. ACM, Helsinki, Finland,
source separation using deep convolutional neural networks. In: pp 1096–1103. https://doi.org/10.1145/1390156.1390294
Latent variable analysis and signal separation—13th international 33. Smolensky P (1986) Information processing in dynamical sys-
conference, LVA/ICA, Proceedings. Grenoble, France, tems: Foundations of harmony theory. Technical report,
pp 258–266. https://doi.org/10.1007/978-3-319-53547-0_25. University of Colorado, Boulder, Department of Computer
ISBN: 978-3-319-53547-0 Science
19. Jeong I-Y, Lee K (2016) Learning temporal features using a deep 34. Hinton GE, Osindero S, Teh Y-W (2006) A fast learning algo-
neural network and its application to music genre classification. rithm for deep belief nets. Neural Comput 18(7):1527–1554.
In: Proceedings of the 17th international society for music https://doi.org/10.1162/neco.2006.18.7.1527
information retrieval conference, ISMIR. New York City, USA, 35. Goodfellow I, Pouget-Abadie J, Mirza M, Bing X, Warde-Farley
pp 434–440 D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial
20. Han Y, Kim J-H, Lee K (2017) Deep convolutional neural net- nets. In: Advances in neural information processing systems 27.
works for predominant instrument recognition in polyphonic NIPS. Curran Associates Inc., Montreal, QC, Canada,
music. IEEE/ACM Trans Audio Speech Lang Process pp 2672–2680
25(1):208–221. https://doi.org/10.1109/TASLP.2016.2632307. 36. Han X, Leung T, Jia Y, Sukthankar R, Berg AC (2015) Matchnet:
ISSN: 2329-9290 unifying feature and metric learning for patch-based matching.
21. Simonyan K, Zisserman A (2015) Very deep convolutional net- In: IEEE conference on computer vision and pattern recognition,
works for large-scale image recognition. In: 3th international CVPR. IEEE Computer Society, Boston, MA, USA,
conference on learning representations, ICLR, San Diego, CA, pp 3279–3286. https://doi.org/10.1109/CVPR.2015.7298948
USA 37. Arandjelovic R, Zisserman A (2017) Look, listen and learn. In:
22. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for IEEE international conference on computer vision, ICCV. IEEE
image recognition. In: IEEE conference on computer vision and Computer Society, Venice, Italy, pp 609–617. https://doi.org/10.
pattern recognition, CVPR. IEEE Computer Society, Las Vegas, 1109/ICCV.2017.73
NV, USA, pp 770–778. https://doi.org/10.1109/CVPR.2016.90 38. Huang Y-S, Chou S-Y, Yang Y-H (2018) Generating music
23. Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, medleys via playing music puzzle games. In: Proceedings of the
Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with thirty-second conference on artificial intelligence, AAAI. AAAI
convolutions. In: IEEE conference on computer vision and pat- Press, New Orleans, LA, USA, pp 2281–2288
tern recognition, CVPR. IEEE Computer Society, Boston, MA, 39. Salton G, McGill M (1984) Introduction to modern information
USA, pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594 retrieval. McGraw-Hill Book Company, New York City. ISBN:
24. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) 0-07-054484-0
Distributed representations of words and phrases and their com- 40. Lamere P (2008) Social tagging and music information retrieval.
positionality. In: Advances in neural information processing J New Music Res 37(2):101–114. https://doi.org/10.1080/
systems 26 NIPS. Lake Tahoe, NV, USA, pp 3111–3119 09298210802479284. ISSN: 0929-8215
25. Dieleman S, Brakel P, Schrauwen B (2011) Audio-based music 41. Hamel P, Davies MEP, Yoshii K, Goto M (2013) Transfer
classification with a pretrained convolutional network. In: Pro- learning in MIR: sharing learned latent representations for music
ceedings of the 12th international society for music information audio classification and similarity. In: Proceedings of the 14th
retrieval conference, ISMIR. University of Miami, Miami, FL, international society for music information retrieval conference,
USA. pp 669–674. ISBN: 9780615548654 ISMIR. Curitiba, Brazil, pp 9–14
26. Choi K, Fazekas G, Sandler MB, Cho K (2017) Transfer learning 42. Law E, Settles B, Mitchell TM (2010) Learning to tag from open
for music classification and regression tasks. In: Proceedings of vocabulary labels. In: Machine learning and knowledge discovery
the 18th international society for music information retrieval in databases, European conference, ECML PKDD, Proceedings.
conference, ISMIR. Suzhou, China, pp 141–149 Part II. Springer, Barcelona, Spain, pp 211–226
27. van den Oord A, Dieleman S, Schrauwen B (2014) Transfer 43. Hofmann T (1999) Probabilistic latent semantic analysis. In:
learning by supervised pre-training for audio-based music clas- UAI: proceedings of the fifteenth conference on uncertainty in
sification. In: Proceedings of the 15th international society for artificial intelligence. Morgan Kaufmann, Stockholm, Sweden,
music information retrieval conference, ISMIR. Taipei, Taiwan, pp 289–296
pp 29–34 44. Schlüter J (2016) Learning to pinpoint singing voice from weakly
28. Liang D, Zhan M, Ellis DPW (2015) Content-aware collaborative labeled examples. In: Proceedings of the 17th international
music recommendation using pre-trained neural networks. In: society for music information retrieval conference, ISMIR. New
Proceedings of the 16th international society for music infor- York City, USA, pp 44–50
mation retrieval conference, ISMIR. Málaga, Spain, pp 295–301 45. Hershey S, Chaudhuri S, Ellis DPW, Gemmeke JF, Jansen A,
29. Misra I, Shrivastava A, Gupta A, Hebert M (2016) Cross-stitch Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, Slaney M,
networks for multi-task learning. In: IEEE conference on com- Weiss RJ, Wilson KW (2017) CNN architectures for large-scale
puter vision and pattern recognition. CVPR. IEEE Computer audio classification. In: IEEE international conference on
Society, Las Vegas, NV, USA, pp 3994–4003 acoustics, speech and signal processing, ICASSP. IEEE, New

123
1092 Neural Computing and Applications (2020) 32:1067–1093

Orleans, LA, USA, pp 131–135. https://doi.org/10.1109/ICASSP. IEEE international conference on acoustics, speech, and signal
2017.7952132 processing, ICASSP. IEEE, Prague, Czech Republic,
46. Lee H, Pham PT, Largman Y, Ng AY (2009) Unsupervised pp 5884–5887. https://doi.org/10.1109/ICASSP.2011.5947700
feature learning for audio classification using convolutional deep 60. Lee J, Park J, Kim KL, Nam J (2017) Sample-level deep con-
belief networks. In: Advances in neural information processing volutional neural networks for music auto-tagging using raw
systems 22. NIPS. Curran Associates Inc, Vancouver, BC, waveforms. In: 14th sound and music computing conference,
Canada, pp 1096–1104 SMC, Espoo, Finland
47. Humphrey EJ, Bello JP (2012) Rethinking automatic chord 61. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep
recognition with convolutional neural networks. In: 11th inter- network training by reducing internal covariate shift. In: Pro-
national conference on machine learning and applications, ceedings of the 32nd international conference on machine
ICMLA. IEEE, Boca Raton, FL, USA, pp 357–362. https://doi. learning, ICML. JMLR, Inc, Lille, France, pp 448–456
org/10.1109/ICMLA.2012.220 62. Nair V, Hinton GE (2010) Rectified linear units improve
48. Nakashika T, Garcia C, Takiguchi T (2012) Local-feature-map restricted boltzmann machines. In: Proceedings of the 27th
integration using convolutional neural networks for music genre international conference on machine learning ICML. Omnipress,
classification. In: INTERSPEECH, 13th annual conference of the Haifa, Israel, pp 807–814
international speech communication association. ISCA, Portland, 63. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhut-
OR, USA, pp 1752–1755 dinov R (2014) Dropout: a simple way to prevent neural networks
49. Ullrich K, Schlüter J, Grill T (2015) Boundary detection in music from overfitting. J Mach Learn Res 15(1):1929–1958
structure analysis using convolutional neural networks. In: Pro- 64. Nam J, Herrera J, Slaney M, Smith JO (2012) Learning sparse
ceedings of the 16th international society for music information feature representations for music annotation and retrieval. In:
retrieval conference, ISMIR. Málaga, Spain, pp 417–422 Proceedings of the 13th international society for music infor-
50. Piczak KJ (2015) Environmental sound classification with con- mation retrieval conference, ISMIR. FEUP Edições, Porto, Por-
volutional neural networks. In: 25th IEEE international workshop tugal, pp 565–570
on machine learning for signal processing, MLSP. IEEE, Boston, 65. Choi K, Fazekas G, Sandler MB, Cho K (2018) A comparison of
MA, USA, pp 1–6. https://doi.org/10.1109/MLSP.2015.7324337 audio signal preprocessing methods for deep neural networks on
51. Simpson AJR, Roma G, Plumbley MD (2015) Deep karaoke: music tagging. In: 26th European signal processing conference.
extracting vocals from musical mixtures using a convolutional EUSIPCO. IEEE, Roma, Italy, pp 1870–1874
deep neural network. In: Latent variable analysis and signal 66. Dörfler M, Grill T, Bammer R, Flexer A (2018) Basic filters for
separation—12th international conference, LVA/ICA, Proceed- convolutional neural networks applied to music: training or
ings. Springer, Liberec, Czech Republic, pp 429–436. https://doi. design? Neural Comput Appl https://doi.org/10.1007/s00521-
org/10.1007/978-3-319-22482-4_50. ISBN: 978-3-319-22482-4 018-3704-x. ISSN: 1433-3058
52. Phan H, Hertel L, Maaß M, Mertins A (2016) Robust audio event 67. Kingma DP, Ba J (2015) Adam: a method for stochastic opti-
recognition with 1-max pooling convolutional neural networks. mization. In: 3th International conference on learning represen-
In: INTERSPEECH 17th annual conference of the international tations, ICLR, San Diego, CA, USA
speech communication association. ISCA, San Francisco, CA, 68. Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin
USA, pp 3653–3657. https://doi.org/10.21437/Interspeech.2016- Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differen-
123 tiation in PyTorch. In: NIPS-W
53. Pons J, Lidy T, Serra X (2016) Experimenting with musically 69. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B,
motivated convolutional neural networks. In: 14th international Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V,
workshop on content-based multimedia indexing, CBMI. IEEE, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M,
Bucharest, Romania, pp 1–6. https://doi.org/10.1109/CBMI.2016. Duchesnay D (2012) Scikit-learn: machine learning in python.
7500246 J Mach Learn Res 12:2825–2830. https://doi.org/10.1007/
54. Stasiak B, Monko J (2016) Analysis of time-frequency repre- s13398-014-0173-7.2. ISSN: 15324435
sentations for musical onset detection with convolutional neural 70. McFee B, Raffel C, Liang D, Ellis DPW, McVicar M, Battenberg
network. In: Proceedings of the federated conference on com- M, Nieto O (2015) librosa: audio and music signal analysis in
puter science and information systems, FedCSIS. Gdańsk, python. In: Kathryn H, James B (eds) Proceedings of the 14th
Poland, pp 147–152. https://doi.org/10.15439/2016F558 python in science conference SciPy. Austin, TX, USA, pp 18 –
55. Su H, Zhang H, Zhang X, Gao G (2016) Convolutional neural 24. https://doi.org/10.25080/Majora-7b98e3ed-003
network for robust pitch determination. In: IEEE international 71. Defferrard M, Benzi K, Vandergheynst P, Bresson X (2017)
conference on acoustics, speech and signal processing, ICASSP. FMA: a dataset for music analysis. In: Proceedings of the 18th
IEEE, Shanghai, China. pp 579–583. https://doi.org/10.1109/ international society for music information retrieval conference,
ICASSP.2016.7471741 ISMIR. Suzhou, China, pp 316–323
56. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classi- 72. Tzanetakis G, Cook PR (2002) Musical genre classification of
fication with deep convolutional neural networks. Commun ACM audio signals. IEEE Trans Speech Audio Process 10(5):293–302.
60(6):84–90. https://doi.org/10.1145/3065386 https://doi.org/10.1109/TSA.2002.800560. ISSN: 1063-6676
57. Dieleman S, Schrauwen B (2014) End-to-end learning for music 73. Kereliuk C, Sturm BL, Larsen J (2015) Deep learning and music
audio. In: IEEE international conference on acoustics, speech and adversaries. IEEE Trans Multimed 17(11):2059–2071. https://doi.
signal processing, ICASSP. IEEE, Florence, Italy, pp 6964–6968. org/10.1109/TMM.2015.2478068. ISSN: 1520-9210
https://doi.org/10.1109/ICASSP.2014.6854950 74. Fabien G, Anssi K, Simon D, Alonso M, George T, Uhle C, Pedro
58. van den Oord A, Dieleman S, Zen H, Simonyan K, Vinyals O, C (2006) An experimental comparison of audio tempo induction
Graves A, Kalchbrenner N, Senior AW, Kavukcuoglu K (2016) algorithms. IEEE Trans Audio Speech Lang Process
Wavenet: a generative model for raw audio. In: The 9th ISCA 14(5):1832–1844. https://doi.org/10.1109/TSA.2005.858509.
speech synthesis workshop, SSW. ISCA, Sunnyvale, CA, USA, ISSN: 1558-7916
p 125 75. Marchand U, Peeters G (2016) Scale and shift invariant time/
59. Jaitly N, Hinton GE (2011) Learning a better representation of frequency representation using auditory statistics: application to
speech soundwaves using restricted boltzmann machines. In: rhythm description. In: 26th IEEE international workshop on

123
Neural Computing and Applications (2020) 32:1067–1093 1093

machine learning for signal processing, MLSP. IEEE, Salerno, neuroscience, cognitive development, and psychopathology. Dev
Italy, pp 1–6. https://doi.org/10.1109/MLSP.2016.7738904 Psychopathol 17(3):715734. https://doi.org/10.1017/
76. Bosch JJ, Janer J, Fuhrmann F, Herrera P (2012) A comparison of S0954579405050340. ISSN: 1469-2198
sound segregation techniques for predominant instrument 82. Montgomery DC (2012) Design and analysis of experiments, 8th
recognition in musical audio signals. In: Proceedings of the 13th edn. Wiley, Hoboken
international society for music information retrieval conference, 83. Goos P, Jones B (2011) Optimal design of experiments: a case
ISMIR. FEUP Edições, Porto, Portugal, pp 559–564 study approach, 1st edn. Wiley, Hoboken
77. Soleymani M, Caro MN, Schmidt EM, Sha C-Y, Yang Y-H 84. Hinton GE (1989) Connectionist learning procedures. Artif Intell
(2013) 1000 songs for emotional analysis of music. In: Pro- 40(1):185–234. https://doi.org/10.1016/0004-3702(89)90049-0.
ceedings of the 2nd ACM international workshop on crowd- ISSN: 0004-3702
sourcing for multimedia CrowdMM@ACM multimedia. ACM, 85. Hu Y, Koren Y, Volinsky C (2008) Collaborative filtering for
Barcelona, Spain, pp 1–6. https://doi.org/10.1145/2506364. implicit feedback datasets. In: Proceedings of the 8th IEEE
2506365. ISBN: 978-1-4503-2396-3 international conference on data mining (ICDM). IEEE Computer
78. Òscar C (2010) Music recommendation and discovery–the long Society, Pisa, Italy, pp 263–272. https://doi.org/10.1109/ICDM.
tail, long fail, and long play in the digital music space. Springer, 2008.22
Berlin. https://doi.org/10.1007/978-3-642-13287-2. ISBN: 978-3- 86. Gelman A, Hill J (2006) Data analysis using regression and
642-13286-5 multilevel/hierarchical models. Cambridge University Press,
79. Sturm BL (2014) The state of the art ten years after a state of the Cambridge
art: future research in music information retrieval. J New Music 87. Searle SR, Casella G, McCulloch CE (2006) Variance compo-
Res 43(2):147–172. https://doi.org/10.1080/09298215.2014. nents. Wiley, Hoboken
894533 88. van der Maaten L, Hinton G (2008) Visualizing data using t-SNE.
80. Sturm BL (2016) The ‘‘Horse’’ inside: seeking causes behind the J Mach Learn Res 9(November):2579–2605
behaviors of music content analysis systems. Comput Entertain
14(2):3:1–3:32. https://doi.org/10.1145/2967507 Publisher’s Note Springer Nature remains neutral with regard to
81. Jonathan P, Russell James A, Peterson Bradley S (2005) The jurisdictional claims in published maps and institutional affiliations.
circumplex model of affect: an integrative approach to affective

123

You might also like