Rethinking Automatic Chord Recognition With Convolutional Neural Networks
Rethinking Automatic Chord Recognition With Convolutional Neural Networks
358
35
Defining explicitly, a CNN is a multistage architecture Importantly chord classes context – the information
composed of both convolutional and fully-connected exhibit a power-law around a given chord – is
layers distribution where a small actually a very strong
that operates on an input output vector sums to one. number of classes – prior, and it is advisable to
Xin, produces an output mostly major and minor split the data such that all
Z Prob , and is described by ex chords – live in the short folds are, ideally,
a set of weight parameters p(
Zl head, while more obscure identically distributed with
Wl, where l indexes the chords may occur only a respect to chord
layer of the machine. handful of times. transitions. Doing so
Layers are stacked Traditionally, due to both should result in roughly
consecutively, such that a sparse representation of equivalent performance
the output Zl is taken as the chord classes in the long across various splits of the
input Xl+1 of the following tail and a general data, and in some scenarios
layer. Each layer consists simplification of the task, even eliminate the need to
of a linear projection tt(X, all chord labels are train and test over all k
W ), a hyperbolic tangent resolved to twenty-five folds.
activation and an optional classes: 12 major, 12
down-sampling, or To create similarly
minor, and a waste-basket,
pooling, operation, distributed folds of the
“no-chord” class. While
expressed generally by this mapping reduces the data, we use the genetic
space of possible classes, algorithm ×
(GA) to
Zl = it is worth noting that this minimize the variance of
pool(tanh(tt(Xl , Wl ) + process × also introduces chord transition
Wl0 )). additional intra-class distributions. First chord
variance. transition histograms are
The function tt of a tallied separately for each
convolutional layer is a 3- 2) Training Strategy: track, yielding a (25 25
dimensional operation, 475) tensor H. A 1-of-k
Being mindful of nuances
where an input tensor Xl, binary assignment matrix
of the dataset, special
known as a collection of N A, with a shape of (475 k),
attention must be paid to
feature maps, is convolved can then be used to
how training is conducted.
with another tensor of determine fold ownership.
The size of the dataset
weights Wl, referred to as∞ The fitness of a matrix Ai
ΣbyΣ ∞ Σ requires k-fold cross val-
kernels, defined (2); the can be computed as the
input Xin can be considered idation, but it is poor
practice to stratify the data variance of its matrix
as a special instance of a
arbitrarily. As we intend product with H. We run
feature map where N = 1.
on classifying the GA by randomly
observations on the order initializing a “population”
N
of seconds, musical of assignment matrices,
tt(X, W ) = X[i, j, followed
k]Wi[m−j, n−k] (2) Σ s C ( by evaluating the fitness of
o ie
4
i=0 j=−∞ f =x
) each, assigning
k=−∞ t
m 1
p probabilities accordingly,
The function tt of a a and randomly selecting
x Z
fully-connected layer is the ( “parents” to merge. A new
product of a flattened input Z
l population is created by
vector Xl and a matrix of ) randomly keeping rows
=
weights Wl, given in from
tt(X, W ) = (X × W ). B. Feature
Learning
Finally, the output ZProb
Given a trainable optimization of its
is defined as a softmax
architecture and a parameters. Despite early
activation over C chord
differentiable ob- jective concerns to the contrary,
classes, as in (4), such that
function, the true power of stochastic gradient methods
the ith output is bounded on
data-driven methods is have been shown to
the interval [0, 1] and the
realized in the numerical converge to “good”
358
35
solutions and labeled data, each pair of parents, raw audio and is linear in divisive contrast
when plentiful, can be used yielding a new set of pitch, allowing kernels to normalization [5], which
to fit the function directly. assignment ma- trices. translate over an input. serves as a frequency-
This process is not always Despite concerns about Importantly, this filterbank varying gain control.
trivial, however, and these GAs being front-end can be interpreted
two facets – labeled computationally expensive as hard-coding the first B. Experimental Objectives
ground truth data and and having no layer of a larger One reality faced in the
training strategy – require convergence guarantees, it convolutional architecture. development of any moder-
due consideration. proved to be quite fast for We implement the ately novel system is that
1) Ground Truth Data: this application and constant-Q transform as a there is minimal precedent
Originating from a variety actually converged to a time-domain filterbank of to guide architectural design
of different sources, we consistent local minima. complex-valued Gabor decisions. That being said,
conduct our work on a set filters, spaced at 36 filters the classic consideration in
of 475 music recordings, For training, we adopt a
per octave and tuned such neural network research is
consisting of 181 songs mini-batch variant of
that the duration of each that of model complexity,
from Christopher Harte’s stochastic gradient
filter corresponds to 40 which manifests here in two
descent, where the loss
Beatles dataset2, 100 songs cycles of the center main ways: the number and
function, defined as the
from the RWC Pop dataset frequency, yielding 252 the dimensionality of the
Neg- ative log-likelihood,
and 194 songs from the US filters spanning 27.5– kernels. Detailed in Table I,
is evaluated over a set of
Pop dataset3. Based on the 1760Hz. Audio signals are we define three kernel
training data at each
provided annotations, there first downsampled to quantities and two kernel
update step. To prevent shapes, for a total of six
are some 800 unique chord 7040Hz and the filterbank
any class bias, mini- architectures. Convolutional
labels provided for just is applied to centered
batches are formed as an layers are described by a
over 50k distinct chord windows at a frame rate of
integer multiple of 25 and kernel tensor K of shape
instances. 40Hz. This over-sampled
a uniform class
pitch spectra is then (Nin, Nout, time,
2
http://isophonics. distribution is imposed; in
reduced to a frame rate of frequency), and an
net/content/referen this work, we use a batch
ce-annotations- 4Hz by mean-filtering each optional pooling operation
size of 150, or 6 instances
beatles subband with a 15-point P of shape (time,
3 of each class per mini-
https://github.co window and decimating in frequency). Fully-
m/tmc323/Chord- batch. We split the five
Annotations
time by a factor of 10. connected layers are
folds into 3–1–1 for
There are two additional described by a single
training, validation, and
data manipulations worth weight matrix W of (Nin,
test, respectively, and
men- tioning here that we Nout).
introduce an early-
refer to collectively as Model complexity is of
stopping criterion by
extended training data particular interest, given
computing the
(ETD). First, the linearity the size of the dataset –
classification error over
of pitch in a constant-Q which is modest compared
the entire validation set
representation affords the to other fields and
every 50 iterations. Every
ability to “transpose” an applications – and only a
training run is conducted
obser- vation as if it were a limited sense of how
for 6000 iterations, and the
true data point in a different
weights that produce the
pitch class by literally
best validation score (best)
shifting the pitch tile and
are stored for subsequent
changing the label
evaluation.
accordingly. In other
III. M E time-domain audio, the words, every data point can
TH system is simplified by count toward each chord
OD
OL using musical knowledge class of the same mode
OG to pro- duce perceptually (Major or minor),
Y motivated pitch spectra via effectively increasing the
A. Input Representation the constant- Q transform. amount of training data by
Transforming audio signals a factor of 12. Another
Though theoretically
to time-frequency preliminary processing
conceivable that a
representations provides stage, popularized in
convolutional architecture
the dual benefits of a lower computer vision, is the
could be applied to raw,
dimen- sional input than application of subtractive-
358
36
T R
A
Previously published range [1], but it is
a
b
C numbers on this dataset encouraging that a CNN
Y
l fall in the upper 70% could reach
e
Arch 3–1 this benchmark in a first This is a particularly
I Fold Train Valid Test attempt. Additionally, we interesting outcome because
C 1 83.2 77.6 77.8 found that median filtering transposing the input pitch
N 2 83.6 78.2 76.9 the output probability spectra should have
N
3 82.0 78.1 78.3 surface with a window of negligible effects of the
A 4 83.6 78.6 76.8 length 5 (approximately 1 learned kernels. It is
R 5 81.7 76.5 77.7 second) before tak- ing the therefore reasonable to
C
H Total 82.81 77.80 77.48
frame-wise argmax() assume that over-fitting
I slightly boosted occurs in the fully
T
E performance across the connected layers of the
C
T board, but never more than network and not necessarily
complicated the
U 1%; even so, all results in the convolutional ones.
R underlying task might be.
E presented here include this This also raises an
S In addition to simply
operation. interesting question in the
determining the best
–1
Given the performance process: why can’t these
performing configuration,
consistency across folds models over-fit the ETD
3 K:(1, 16, 6, 25), P:(1, 3) there are a few questions a
and our interest in general condition?
K:(16, 20, 6, 27) data-driven approach is
trends rather than statistical Focusing our on Arch:3–
K:(20, 24, 6, 27) particularly suited to
significance, we conduct a 1, one potential cause of
W:(1440, 200) address: If and how does
parameter sweep using a over- fitting is due to an
W:(200, 25) the system over-fit the
leave-one-out (LoO) under-representation of
2 K:(1, 6, 6, 25), P:(1, 3) training data, and are there
strategy in the spirit of chord classes in the dataset.
K:(6, 9, 6, 27) cases when it cannot? How
computational efficiency. Figure 3 plots the accuracy
K:(9, 12, 6, 27) important is the choice of
Noting that training and differential between ETD
architecture? And what are
W:(720, 125) validation performance are conditions for both training
the effects of using ETD?
W:(125, 25) lowest for the fifth fold, we and test as a function of
1 K:(1, 4, 6, 25), P:(1, 3)
IV. RESULT select this stratification chord class, ranked by
K:(4, 6, 6, 27) S&
scenario to train each occurrence in the dataset,
K:(6, 8, 6, 27) D ISCUS architecture in Table I and indi- cates that this is
W:(480, 50) SION under both True and False not the root cause. ETD
W:(50, 25) ETD conditions based on reduces over-fitting, but it
As an initial step toward
the premise that this split of does so uniformly and
assessing the outlined
T the data is likely the most there is only a weak
experimen- tal objectives,
a difficult. Each architecture positive correlation between
b we wish to determine both
is trained similarly as train-test discrepancy and
l the performance ceiling of
e before, and the best ranked chord type. In other
this approach and the
performing weight words, all chord classes
variation between folds of
I parameters over the benefit equally from ETD,
I the data. Informal tests
validation set are used for which is more characteristic
5 indicated large-kernel
- final evaluation; overall of widespread intra-class
architectures with ETD
F results are given in Table variance than some classes
O showed the most promise,
L III. being inadequately
D so Arch:3–1 and Arch:1–1
The most noticeable represented. If this is
were trained and evaluated
result of this parameter indeed the case, there are
R across all five folds. The
E sweep is the accuracy two main sources of intra-
C
results of this experiment,
differential in performance class variance: mapping all
O shown in Table II, provide
G on the training set chord classes to Major-
N two important insights:
I between ETD conditions. minor, or simple labeling
T CNN chord recognition
Transposing the training error. To assess the former,
I performs competitively
O data improves Figure 4 plots the accuracy
N with the state of the art
generalization, in addition for Major-minor (Mm)
and performance dis-
A to reducing the extent to versus all other
crepancies between folds
C which the network can (O) chords that are not
C fall within a 2% margin.
U over-fit the training data.
358
36
strictly annotated as root- T after ETD for the train,
position Major-minors in a validation, and test
the train (Tr) and test (Te) b datasets. Though the
l
conditions, with and e accuracy over most tracks
without ETD. This figure is unaffected by the ETD
illustrates that a good deal I condition, as evidenced by
of over-fitting in the I the near-zero mode of the
I
training set is due to Other distributions, certain tracks
chord types mapped to P
in the training set do much
Major-minor, which ETD A worse when the data is
R
reduces, while A transposed. We can take
generalization performance M
E
this to mean that some
for Other chord types is T tracks are particularly
E
almost equal in both ETD R problematic, and this
conditions. Therefore, it is should be explored further
logical to conclude that S in future work. This is
W
ETD distributes the added E intuitively satisfying
variance of Other chord E because the repetitive
P
types across all classes nature of music would
evenly, and the machine – likely cause single tracks
learns to contain multiple
R instances of rare chords
E
S
and outliers in the dataset,
U and suffer the most when
L
T these few data points are
S
not over-fit by the
machine.
ETD:False
Arch Train Valid Test V. C ONCLUSION
1-1 84.7 74.9 75.6 S & F UTURE
1-2 87.0 73.1 74.5 W ORK
2-1 85.5 75.0 75.5 In this work, we have
2-2 91.2 73.9 74.0presented a new approach
3-1 92.0 75.2 75.5 to tackling the well-worn
3-2 91.7 73.6 73.8 task of automatic chord
recognition. Viewing
harmony as a hierarchy of
to ignore much of it as
noise, while also learning
a better strict Major-minor
model in the process.
There is also the intervallic relationships,
concern that over-fitting we train a convolutional
may be a function of neural network to classify
specific tracks. Figure 5 five- second tiles of pitch
shows the track-wise spectra, yielding state of
histograms of accuracy the art perfor-
differential before and
358
36
training, either through convolutional Deep Belief Nets or
stacked Autoencoders, may have a substantial impact on
improving performance for sparsely represented chords. To
this point, there is also the possibility of incorporating other
work in transfer, or one-shot, learning, where new classes
can be discriminated with only a few training examples.
Going forward, the greatest potential of this approach will
likely be realized in extending to larger, more realistic
chord vocabularies.
ACKNOWLEDGMENT
This material is based upon work supported by the Na-
tional Science Foundation under grant IIS-0844654.
REFERENCES
[1] T. Cho and J. P. Bello, “A Feature Smoothing Method For
Figure 4. Effects of transposition on recognition accuracy as a function Chord Recognition Using Recurrence Plots.” In Proc. ISMIR,
explicitly labeled Major-Minor chords (dark bars), versus mapping other 2011.
chord types (lighter bars) to the nearest Major-Minor for training (blue) and
test (green), for ETD:False (left) and ETD:True (right). [2] T. Cho, R. J. Weiss and J. P. Bello, “Exploring Common
Variations in State of the Art Chord Recognition Systems.”
In Proc. SMC, 2010.