0% found this document useful (0 votes)
137 views8 pages

Rethinking Automatic Chord Recognition With Convolutional Neural Networks

Uploaded by

shidqi zuhdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
137 views8 pages

Rethinking Automatic Chord Recognition With Convolutional Neural Networks

Uploaded by

shidqi zuhdi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

2012 11th International Conference on

International Conference onMachine


MachineLearning
Learning and
and Applications
Applications

Rethinking Automatic Chord Recognition with Convolutional Neural Networks

Eric J. Humphrey and Juan P. Bello


Music and Audio Research Lab (MARL)
New York University
New York, NY USA
{ejhumphrey, jpbello}@nyu.edu

Abstract—Despite early success in automatic chord recog-


nition, recent efforts are yielding diminishing returns while +3 +3 -3 1
basically iterating over the same fundamental approach. Here,
we abandon typical conventions and adopt a different perspec-
tive of the problem, where several seconds of pitch spectra are
classified directly by a convolutional neural network. Using
labeled data to train the system in a supervised manner, we
achieve state of the art performance through this initial effort +5 -3 PT -3 -3 NT
in an otherwise unexplored area. Subsequent error analysis
provides insight into potential areas of improvement, and
this approach to chord recognition shows promise for future
harmonic analysis systems. Figure 1. One interpretation of a simple melody in F. The implied chord
of the phrase is an F Major triad, despite the presence of two nonchord
Keywords-chord recognition; automatic music tones falling on unaccented beats; a B♭ passing tone (4th scale degree) and
transcription; convolutional neural nets; an E neighboring tone (7th scale degree). Intervals between notes are also
indicated, showing the relative relationships between nearby pitches.
I. I NTRODUCTION
Even from the earliest efforts in music informatics re-
the ability to assign chord labels to musical scenes despite
search, automatic music transcription stands apart as one of
an occasionally opaque decision making process, and it is
the Holy Grails of the field. It has proven substantially
an exciting challenge to produce a computational system
more difficult than once thought however, and since
with a similar capacity.
fractured into a variety of smaller subtopics. Automatic
chord recognition is one such task, receiving healthy Historically speaking, automatic chord recognition re-
attention for more than a decade and is an established search is mostly summarized by a few seminal works.
Arguably, the two most influential systems are those of
benchmark at the annual MIReX challenge1. Given the
Fujishima, who proposed the use of chroma features [3],
prerequisite skill necessary to produce these transcriptions
and Sheh and Ellis, who introduced the use of Hidden
manually, there is strong motivation to develop automated
Markov Models (HMMs) to stabilize chord classification
systems capable of reliably performing this task.
[9]. The former, also known as pitch class profiles, are a
Applications of a computational chord recognition system
short- time estimate of octave-equivalent pitch calculated
are myriad, ranging from straightforward annotation to the
several times a second. Each chroma vector is classified as
development of compositional tools and musically
a chord independently, and HMMs greatly improve
meaningful search and retrieval systems.
performance by smoothing spurious behavior and
The identification of chords is also intriguing from a
effectively “stitching” together feature vectors that produce
musical perspective, being a high level cognitive process
large classification likelihoods. Based on this early success,
that is often open to multiple interpretations between
many current approaches adopt this interpretation of the
knowl- edgable experts. One common definition of a chord
problem and incorporate the same three stage process: fast
is the “simultaneous sounding of two or more notes” [7],
frame-level features are calculated over short-time
but music is seldom so simple. Though the explicit use of
observations of an audio signal, instantaneous feature
chords can be straightforward – strumming a root-position
vectors are assigned to chord classes, and post-filtering is
C major on guitar, for instance – real music is typically
performed to identify the best chord classification path.
characterized by complex tonal scenes that only imply a
Though the implementation details have continued to
certain chord or harmony, often in the presence of
evolve over the last decade, the brunt of chord recog-
nonchord tones and with no guarantee of simultaneity; one
nition research has concentrated not on the fundamental
such example of this is the simple monophonic melody
system per se, but rather the tuning of its components.
given in Figure 1, which clearly suggests F Major. Skilled
In particular, much time and energy has been invested in
human listeners are quite robust in
1
978-0-7695-4913-2/12 $26.00 © 2012 IEEE DOI 10.1109/ICMLA.2012.220
http://www.music-ir.org/mirex/wiki/MIREX HOME
978-0-7695-4913-2/12 $26.00 © 2012 IEEE 357
DOI 10.1109/ICMLA.2012.220
357
developing not just better features, but specifically better
chroma features [8]. Acknowledging the challenges
inherent to designing good features, Pachet et al pioneered
work in automatic feature optimization [11], and more
recently deep learning methods have been employed to
produce robust Tonnetz features [4]. Alternatively, some X
work leverages the repetitive structure of music to smooth a j
Xi
chroma features prior to classification [1]. Various
classification strategies have been investigated to a lesser
extent [10], but Gaus- sian Mixture Models (GMM) are
conventionally preferred for the probabilistic interpretation.
The choice of post- filtering methods has been shown to
significantly impact classification accuracy, and much (a) (b) (c)
research has focused on properly tuning HMMs [2], in
addition to exploring other post-filtering methods such as
Dynamic Bayesian Networks (DBNs) [6]. Figure 2. The CNN Chord Recognition Architecture: A time-pitch tile (a)
Despite steady incremental improvements, there are two is input to a CNN (b), yielding a probability surface (c).
important observations to draw from this research tradition:
performance increases are undeniably diminishing, and the
now de facto standard interpretation of the problem exhibits II. T HE H IERARCHY OF H ARMONY
a few crucial shortcomings. After a period of quick Like most facets of music, chords are characterized by
progress, improvements in automatic chord recognition the hierarchical composition of more atomic elements, i.e.
have stalled well below what one would be deem to be a individual pitches combine in time to form intervals, and
solved problem. Even if an optimal transform existed, similarly into chords, harmonic progressions and ultimately
chroma is concep- tually limited in the vocabulary of songs. Noting that these musical building blocks of
chords it can uniquely represent. Some previous work intervals take shape along the dimensions of both pitch and
attempts to compensate for this deficiency by time, the task of chord recognition can be conceptually
supplementing chroma features with other hand-crafted reformulated as a natural hierarchy of spectro-temporal
statistics, e.g. bass note features [7], but a little more than a events; we again refer to Figure 1, which illustrates the
decade of hand-crafted feature design in music informatics intervallic relation- ships of nearby notes over both pitch
has shown the process to be an arduous and time and time.
consuming one. More over, frame-level classification Framed in the context of relative intervals, it is possible
assumes both simultaneity in pitch space and a predomi- to explain the full range of harmony, such as monophonic
nantly Gaussian distribution of the data, neither of which melodies, inverted chords, or nonchord tone
are particularly realistic assumptions. embellishments through the combination of simpler parts.
In this work, we adopt a different view of chord recog- Chord recognition can then be reduced to two separate
nition and propose a trainable, data-driven approach that challenges: how should a system be architected to encode
automatically learns relevant hierarchical features and its musical harmony as a hierar- chy of parts, and how can we
classifier simultaneously. Rather than attempting to classify determine what these parts are? Deep convolutional
short-time features and relying on post-filtering to smooth architectures provide a potential answer to the first
the results into a musically plausible chord path, we in- question, and automatic feature learning can be used to
stead use a convolutional neural network to classify five- discover optimal feature representations.
second tiles of pitch spectra, producing a jointly optimized A. Convolutional Neural Networks
chord recognition system. A significant advantage of this
approach is that minimal assumptions are made regarding Initially inspired by research of the cat’s visual cortex by
what statistics might be informative for the task at hand, Hubel & Weisel in the 1960’s, convolutional neural
instead relying on data to objectively tease out these fea- networks (CNNs) are trainable, hierarchical nonlinear
tures. We additionally show that optimizing to labeled data functions; we refer to [5] for a modern review. CNNs can
provides insight into not only the task at hand, but also be seen as a special instance of classic Artificial Neural
the ground truth data itself. The remainder of this paper is Networks (ANN) where weights are shared over an input
outlined as follows: Section II addresses the motivation and vector, acting like local receptive fields that “move” over
concepts behind the proposed system; Section III describes an input. This movement manifests as translation
our experimental methodology; Section IV presents and invariance, allowing the same feature to be characterized at
discusses the experimental findings; and finally, Section V different absolute positions. For the purposes of chord
offers conclusions and directions for future work. recognition, a convo- lutional architecture learns relative
intervals separate from absolute pitch height or position in
time; this architecture is diagrammed in Figure 2.

358
35
Defining explicitly, a CNN is a multistage architecture Importantly chord classes context – the information
composed of both convolutional and fully-connected exhibit a power-law around a given chord – is
layers distribution where a small actually a very strong
that operates on an input output vector sums to one. number of classes – prior, and it is advisable to
Xin, produces an output mostly major and minor split the data such that all
Z Prob , and is described by ex chords – live in the short folds are, ideally,
a set of weight parameters p(
Zl head, while more obscure identically distributed with
Wl, where l indexes the chords may occur only a respect to chord
layer of the machine. handful of times. transitions. Doing so
Layers are stacked Traditionally, due to both should result in roughly
consecutively, such that a sparse representation of equivalent performance
the output Zl is taken as the chord classes in the long across various splits of the
input Xl+1 of the following tail and a general data, and in some scenarios
layer. Each layer consists simplification of the task, even eliminate the need to
of a linear projection tt(X, all chord labels are train and test over all k
W ), a hyperbolic tangent resolved to twenty-five folds.
activation and an optional classes: 12 major, 12
down-sampling, or To create similarly
minor, and a waste-basket,
pooling, operation, distributed folds of the
“no-chord” class. While
expressed generally by this mapping reduces the data, we use the genetic
space of possible classes, algorithm ×
(GA) to
Zl = it is worth noting that this minimize the variance of
pool(tanh(tt(Xl , Wl ) + process × also introduces chord transition
Wl0 )). additional intra-class distributions. First chord
variance. transition histograms are
The function tt of a tallied separately for each
convolutional layer is a 3- 2) Training Strategy: track, yielding a (25 25
dimensional operation, 475) tensor H. A 1-of-k
Being mindful of nuances
where an input tensor Xl, binary assignment matrix
of the dataset, special
known as a collection of N A, with a shape of (475 k),
attention must be paid to
feature maps, is convolved can then be used to
how training is conducted.
with another tensor of determine fold ownership.
The size of the dataset
weights Wl, referred to as∞ The fitness of a matrix Ai
ΣbyΣ ∞ Σ requires k-fold cross val-
kernels, defined (2); the can be computed as the
input Xin can be considered idation, but it is poor
practice to stratify the data variance of its matrix
as a special instance of a
arbitrarily. As we intend product with H. We run
feature map where N = 1.
on classifying the GA by randomly
observations on the order initializing a “population”
N
of seconds, musical of assignment matrices,
tt(X, W ) = X[i, j, followed
k]Wi[m−j, n−k] (2) Σ s C ( by evaluating the fitness of
o ie
4
i=0 j=−∞ f =x
) each, assigning
k=−∞ t
m 1
p probabilities accordingly,
The function tt of a a and randomly selecting
x Z
fully-connected layer is the ( “parents” to merge. A new
product of a flattened input Z
l population is created by
vector Xl and a matrix of ) randomly keeping rows
=
weights Wl, given in from

tt(X, W ) = (X × W ). B. Feature
Learning
Finally, the output ZProb
Given a trainable optimization of its
is defined as a softmax
architecture and a parameters. Despite early
activation over C chord
differentiable ob- jective concerns to the contrary,
classes, as in (4), such that
function, the true power of stochastic gradient methods
the ith output is bounded on
data-driven methods is have been shown to
the interval [0, 1] and the
realized in the numerical converge to “good”

358
35
solutions and labeled data, each pair of parents, raw audio and is linear in divisive contrast
when plentiful, can be used yielding a new set of pitch, allowing kernels to normalization [5], which
to fit the function directly. assignment ma- trices. translate over an input. serves as a frequency-
This process is not always Despite concerns about Importantly, this filterbank varying gain control.
trivial, however, and these GAs being front-end can be interpreted
two facets – labeled computationally expensive as hard-coding the first B. Experimental Objectives
ground truth data and and having no layer of a larger One reality faced in the
training strategy – require convergence guarantees, it convolutional architecture. development of any moder-
due consideration. proved to be quite fast for We implement the ately novel system is that
1) Ground Truth Data: this application and constant-Q transform as a there is minimal precedent
Originating from a variety actually converged to a time-domain filterbank of to guide architectural design
of different sources, we consistent local minima. complex-valued Gabor decisions. That being said,
conduct our work on a set filters, spaced at 36 filters the classic consideration in
of 475 music recordings, For training, we adopt a
per octave and tuned such neural network research is
consisting of 181 songs mini-batch variant of
that the duration of each that of model complexity,
from Christopher Harte’s stochastic gradient
filter corresponds to 40 which manifests here in two
descent, where the loss
Beatles dataset2, 100 songs cycles of the center main ways: the number and
function, defined as the
from the RWC Pop dataset frequency, yielding 252 the dimensionality of the
Neg- ative log-likelihood,
and 194 songs from the US filters spanning 27.5– kernels. Detailed in Table I,
is evaluated over a set of
Pop dataset3. Based on the 1760Hz. Audio signals are we define three kernel
training data at each
provided annotations, there first downsampled to quantities and two kernel
update step. To prevent shapes, for a total of six
are some 800 unique chord 7040Hz and the filterbank
any class bias, mini- architectures. Convolutional
labels provided for just is applied to centered
batches are formed as an layers are described by a
over 50k distinct chord windows at a frame rate of
integer multiple of 25 and kernel tensor K of shape
instances. 40Hz. This over-sampled
a uniform class
pitch spectra is then (Nin, Nout, time,
2
http://isophonics. distribution is imposed; in
reduced to a frame rate of frequency), and an
net/content/referen this work, we use a batch
ce-annotations- 4Hz by mean-filtering each optional pooling operation
size of 150, or 6 instances
beatles subband with a 15-point P of shape (time,
3 of each class per mini-
https://github.co window and decimating in frequency). Fully-
m/tmc323/Chord- batch. We split the five
Annotations
time by a factor of 10. connected layers are
folds into 3–1–1 for
There are two additional described by a single
training, validation, and
data manipulations worth weight matrix W of (Nin,
test, respectively, and
men- tioning here that we Nout).
introduce an early-
refer to collectively as Model complexity is of
stopping criterion by
extended training data particular interest, given
computing the
(ETD). First, the linearity the size of the dataset –
classification error over
of pitch in a constant-Q which is modest compared
the entire validation set
representation affords the to other fields and
every 50 iterations. Every
ability to “transpose” an applications – and only a
training run is conducted
obser- vation as if it were a limited sense of how
for 6000 iterations, and the
true data point in a different
weights that produce the
pitch class by literally
best validation score (best)
shifting the pitch tile and
are stored for subsequent
changing the label
evaluation.
accordingly. In other
III. M E time-domain audio, the words, every data point can
TH system is simplified by count toward each chord
OD
OL using musical knowledge class of the same mode
OG to pro- duce perceptually (Major or minor),
Y motivated pitch spectra via effectively increasing the
A. Input Representation the constant- Q transform. amount of training data by
Transforming audio signals a factor of 12. Another
Though theoretically
to time-frequency preliminary processing
conceivable that a
representations provides stage, popularized in
convolutional architecture
the dual benefits of a lower computer vision, is the
could be applied to raw,
dimen- sional input than application of subtractive-

358
36
T R
A
Previously published range [1], but it is
a
b
C numbers on this dataset encouraging that a CNN
Y
l fall in the upper 70% could reach
e
Arch 3–1 this benchmark in a first This is a particularly
I Fold Train Valid Test attempt. Additionally, we interesting outcome because
C 1 83.2 77.6 77.8 found that median filtering transposing the input pitch
N 2 83.6 78.2 76.9 the output probability spectra should have
N
3 82.0 78.1 78.3 surface with a window of negligible effects of the
A 4 83.6 78.6 76.8 length 5 (approximately 1 learned kernels. It is
R 5 81.7 76.5 77.7 second) before tak- ing the therefore reasonable to
C
H Total 82.81 77.80 77.48
frame-wise argmax() assume that over-fitting
I slightly boosted occurs in the fully
T
E performance across the connected layers of the
C
T board, but never more than network and not necessarily
complicated the
U 1%; even so, all results in the convolutional ones.
R underlying task might be.
E presented here include this This also raises an
S In addition to simply
operation. interesting question in the
determining the best
–1
Given the performance process: why can’t these
performing configuration,
consistency across folds models over-fit the ETD
3 K:(1, 16, 6, 25), P:(1, 3) there are a few questions a
and our interest in general condition?
K:(16, 20, 6, 27) data-driven approach is
trends rather than statistical Focusing our on Arch:3–
K:(20, 24, 6, 27) particularly suited to
significance, we conduct a 1, one potential cause of
W:(1440, 200) address: If and how does
parameter sweep using a over- fitting is due to an
W:(200, 25) the system over-fit the
leave-one-out (LoO) under-representation of
2 K:(1, 6, 6, 25), P:(1, 3) training data, and are there
strategy in the spirit of chord classes in the dataset.
K:(6, 9, 6, 27) cases when it cannot? How
computational efficiency. Figure 3 plots the accuracy
K:(9, 12, 6, 27) important is the choice of
Noting that training and differential between ETD
architecture? And what are
W:(720, 125) validation performance are conditions for both training
the effects of using ETD?
W:(125, 25) lowest for the fifth fold, we and test as a function of
1 K:(1, 4, 6, 25), P:(1, 3)
IV. RESULT select this stratification chord class, ranked by
K:(4, 6, 6, 27) S&
scenario to train each occurrence in the dataset,
K:(6, 8, 6, 27) D ISCUS architecture in Table I and indi- cates that this is
W:(480, 50) SION under both True and False not the root cause. ETD
W:(50, 25) ETD conditions based on reduces over-fitting, but it
As an initial step toward
the premise that this split of does so uniformly and
assessing the outlined
T the data is likely the most there is only a weak
experimen- tal objectives,
a difficult. Each architecture positive correlation between
b we wish to determine both
is trained similarly as train-test discrepancy and
l the performance ceiling of
e before, and the best ranked chord type. In other
this approach and the
performing weight words, all chord classes
variation between folds of
I parameters over the benefit equally from ETD,
I the data. Informal tests
validation set are used for which is more characteristic
5 indicated large-kernel
- final evaluation; overall of widespread intra-class
architectures with ETD
F results are given in Table variance than some classes
O showed the most promise,
L III. being inadequately
D so Arch:3–1 and Arch:1–1
The most noticeable represented. If this is
were trained and evaluated
result of this parameter indeed the case, there are
R across all five folds. The
E sweep is the accuracy two main sources of intra-
C
results of this experiment,
differential in performance class variance: mapping all
O shown in Table II, provide
G on the training set chord classes to Major-
N two important insights:
I between ETD conditions. minor, or simple labeling
T CNN chord recognition
Transposing the training error. To assess the former,
I performs competitively
O data improves Figure 4 plots the accuracy
N with the state of the art
generalization, in addition for Major-minor (Mm)
and performance dis-
A to reducing the extent to versus all other
crepancies between folds
C which the network can (O) chords that are not
C fall within a 2% margin.
U over-fit the training data.
358
36
strictly annotated as root- T after ETD for the train,
position Major-minors in a validation, and test
the train (Tr) and test (Te) b datasets. Though the
l
conditions, with and e accuracy over most tracks
without ETD. This figure is unaffected by the ETD
illustrates that a good deal I condition, as evidenced by
of over-fitting in the I the near-zero mode of the
I
training set is due to Other distributions, certain tracks
chord types mapped to P
in the training set do much
Major-minor, which ETD A worse when the data is
R
reduces, while A transposed. We can take
generalization performance M
E
this to mean that some
for Other chord types is T tracks are particularly
E
almost equal in both ETD R problematic, and this
conditions. Therefore, it is should be explored further
logical to conclude that S in future work. This is
W
ETD distributes the added E intuitively satisfying
variance of Other chord E because the repetitive
P
types across all classes nature of music would
evenly, and the machine – likely cause single tracks
learns to contain multiple
R instances of rare chords
E
S
and outliers in the dataset,
U and suffer the most when
L
T these few data points are
S
not over-fit by the
machine.
ETD:False
Arch Train Valid Test V. C ONCLUSION
1-1 84.7 74.9 75.6 S & F UTURE
1-2 87.0 73.1 74.5 W ORK
2-1 85.5 75.0 75.5 In this work, we have
2-2 91.2 73.9 74.0presented a new approach
3-1 92.0 75.2 75.5 to tackling the well-worn
3-2 91.7 73.6 73.8 task of automatic chord
recognition. Viewing
harmony as a hierarchy of

Figure 3. Accuracy differential


between training and test as a
function of chord class, ordered
along the x-axis from most to
least common in the dataset for
ETD:False (blue) and ETD:True
(green) conditions.

to ignore much of it as
noise, while also learning
a better strict Major-minor
model in the process.
There is also the intervallic relationships,
concern that over-fitting we train a convolutional
may be a function of neural network to classify
specific tracks. Figure 5 five- second tiles of pitch
shows the track-wise spectra, yielding state of
histograms of accuracy the art perfor-
differential before and

358
36
training, either through convolutional Deep Belief Nets or
stacked Autoencoders, may have a substantial impact on
improving performance for sparsely represented chords. To
this point, there is also the possibility of incorporating other
work in transfer, or one-shot, learning, where new classes
can be discriminated with only a few training examples.
Going forward, the greatest potential of this approach will
likely be realized in extending to larger, more realistic
chord vocabularies.
ACKNOWLEDGMENT
This material is based upon work supported by the Na-
tional Science Foundation under grant IIS-0844654.
REFERENCES
[1] T. Cho and J. P. Bello, “A Feature Smoothing Method For
Figure 4. Effects of transposition on recognition accuracy as a function Chord Recognition Using Recurrence Plots.” In Proc. ISMIR,
explicitly labeled Major-Minor chords (dark bars), versus mapping other 2011.
chord types (lighter bars) to the nearest Major-Minor for training (blue) and
test (green), for ETD:False (left) and ETD:True (right). [2] T. Cho, R. J. Weiss and J. P. Bello, “Exploring Common
Variations in State of the Art Chord Recognition Systems.”
In Proc. SMC, 2010.

[3] T. Fujishima, “Realtime chord recognition of musical sound:


a system using common lisp music,” In Proc. ICMC, 1999.

[4] E. J. Humphrey, T. Cho, and J. P. Bello, “Learning a Robust


Tonnetz-space Representation for Chord Recognition,” In
Proc. ICASSP, 2011.

[5] Y. LeCun, K. Kavukvuoglu and C. Farabet. Convolutional


Networks and Applications in Vision,” In Proc. ISCAS, 2010.

[6] M. Mauch and S. Dixon, “Approximate Note Transcription


For The Improved Identification Of Difficult Chords.” In
Proc. ISMIR, 2010.

[7] M. Mauch and S. Dixon, “Simultaneous Estimation of Chords


and Musical Context From Audio.” IEEE Transactions On
Audio, Speech, And Language Processing (TSALP), vol. 18,
Figure 5. Histograms of trackwise accuracy differential between no. 6, pp. 1280–1289, 2010
ETD:False and ETD:True conditions, for training (blue), validation (red)
and test (green) datasets. [8] M. Mu¨ller and S. Ewert, “Towards timbre-invariant
audio features for harmony-based music,” IEEE Transactions
on Audio, Speech, and Language Processing (TASLP), vol.
18, no. 3, pp. 649–662, 2010.
mance across a variety of configurations. We find that sim-
pler architectures (Arch:1–1) perform nearly as well as [9] A. Sheh and D. P. W. Ellis, “Chord segmentation and recog-
over- complete ones (Arch:3–1) without any explicit nition using EM-trained Hidden Markov Models,” In Proc.
regularization or weight decay penalties. When examining ISMIR, 2003.
performance across a variety of conditions, over-fitting is
[10] A. Weller, D. P. W. Ellis, and T. Jebara, “Structured Prediction
used to gain insight into what and how these machines
Models for Chord Transcription of Music Audio.” In Proc.
actually learn. In particular, over-fitting is due to largely ICMLA, 2009.
to the mapping of various chord types to Major-minor and
a few outlier tracks; ETD techniques reduce these effects [11] A. Zils and F. Pachet. “Automatic extraction of music descrip-
by distributing additional intra-class variance across all tors from acoustic signals using EDS.” In Proc. AES, 2004.
chord classes, which the machine learns to mostly ignore.
The inability to over-fit all data, however, suggests that a
non-negligible amount may be mislabeled or mapped to
Major-minor chords incorrectly.
With an eye toward future work, unsupervised pre-

You might also like