−j2πkn N
−j2πkn N
−j2πkn N
ABSTRACT this limits the generality of a particular method in finding the key for
unknown musical inputs.
In this paper we present the INESC Key Detection (IKD) system In light of these findings, we introduce a strategy to explore key
which incorporates a novel method for dynamically biasing key mode estimates of the IKD system [10, 11], a key detection method
mode estimation using the spatial displacement of beat-synchronous based on the Tonal Interval Space [11], without the need to hand-
Tonal Interval Vectors (TIVs). We evaluate the performance of tune key profiles. The geometric properties of Tonal Interval Space
the IKD system at finding the global key on three annotated audio allow us to easily adapt key mode estimation by introducing spatial
datasets and using three key-defining profiles. Results demonstrate displacements to the input a non-trivial task in commonly used
the effectiveness of the mode bias in favoring either the major or metric spaces used in related literature. This not only enables users
minor mode, thus allowing users to fine tune this variable to improve to bias the systems towards major or minor correct modes estimates,
correct key estimates on style-specific music datasets or to balance which has been shown to be an important feature for style-specific
predictions across key modes on unknown input sources. key detection [4], but can also balance the correct number of esti-
Index Terms Audio key estimation, tonal pitch representa- mates across modes for enhanced results on unknown musical in-
tion, music signal processing, music information retrieval. puts. We demonstrate the efficacy of our approach by explicitly ma-
nipulating its accuracy on existing annotated datasets comprised of
excerpts in predominantly major and minor modes.
1. INTRODUCTION The remainder of this paper is structured as follows. Section 2
provides an overview of the Tonal Interval Space, as well as distance
Key or tonality is a prominent concept in Western music. It is defined metrics computed in the space relevant to the IKD system. Section 3
by a pitch class (tonic) and a mode (major or minor), whose combi- starts by presenting the architecture of our system, followed by a
nation establishes a system of relations between pitches in both the detailed description of each of the component modules, with partic-
vertical and horizontal dimensions of musical structure [1]. Key es- ular emphasis on the novelty of our approach, i.e. the use of mode
timation from musical audio has been extensively researched within bias in the key detection method. Sections 4 and 5 present an objec-
the music information retrieval community [2, 3, 4], as it provides tive evaluation of the IKD system and, finally, in Section 6 we draw
important annotations for enhanced navigation and retrieval in large conclusions and state areas for future work.
music collections as well as contributing to music-creative tasks such
as harmonic DJ mixing [5]. 2. OVERVIEW OF THE TONAL INTERVAL SPACE
However, the majority of existing key estimation systems rely on
the same fundamental principle, that of using key profiles expressing The system reported in this paper is based on the Tonal Interval
pitch class distributions to which a similar representation obtained Space [12], an extended tonal pitch space in the context of the Ton-
from analyzing a musical piece or excerpt is compared to estimate netz [13]. The most salient pitch levels of tonal Western music
the most probable key. Research on key finding devotes great effort pitches, chords and keys can be represented as unique locations
to the creation and evaluation of different key profiles proposed in in the space as Tonal Intervals Vectors (TIVs) from music encoded
the literature [6, 7, 2]. Yet, these findings must be understood in as both symbolic or audio data. A predominant feature of the Tonal
the context of the datasets on which they are evaluated, as it has Interval Space is the ability to compute theoretical and perceptual as-
been shown that different key profiles explicitly favor either major pects of Western tonal music, such as indicators of multi-level tonal
or minor key modes [8]. Given that the most widely used datasets pitch relatedness and consonance, as distances.
in the evaluation of audio key estimation systems have pronounced In this paper, we focus on audio signal representations in the
divergences in key mode distribution (e.g., a strong bias towards the Tonal Interval Space, due to its relevance in the key estimation sys-
major mode in the Beatles collection [9] and the minor mode in the tem under discussion. To represent an audio signal in the Tonal In-
GiantSteps [3] dataset), we believe this has led to an intrinsic bias terval Space, we first aggregate the energy of each pitch class in a
in current key detection systems which are adapted to convey better 12-dimensional chroma vector, c(n), and compute a 12-dimensional
estimations in either minor or major modes, but not both. In turn, Tonal Interval Vector, T (k) as its L1 normalized Discrete Fourier
Transform (DFT), such that:
Project TEC4Growth - Pervasive Intelligence, Enhancers and Proofs of
N 1
Concept with Industrial Impact/NORTE-01-0145-FEDER-000020, financed X j2kn
by the North Portugal Regional Operational Programme (NORTE 2020), un- T (k) = w(k) c(n)e N , kZ (1)
der the PORTUGAL 2020 Partnership Agreement, and through the Euro- n=0
pean Regional Development Fund (ERDF). GB is also supported by FCT, the
Portuguese Foundation for Science and Technology, under the post-doctoral where N = 12 is the dimension of the chroma vector and w(k) =
grant SFRH/BPD/109457/2015. {2, 11, 17, 16, 19, 7} are weights derived from empirical conso-
nance ratings of dyads used to adjust the contribution of each Sonic Annotator [16] with default parameters, including both tun-
dimension k of the space (or interpreted musical interval), making ing correction and spectral whitening. Each chroma vector is calcu-
it a perceptually relevant space in comparison to its non-weighted lated over a 46 ms frame. Next, we extract beat locations from the
version [12]. We set k to 1 k 6 for T (k) since the remaining same input audio signal, using the QM-VAMP bar and beat tracking
coefficients are symmetric. T (k) uses c(n) which is c(n) nor- [17] also within Sonic Annotator. To compute the beat-synchronous
PN 1
malized by the DC component T (0) = n=0 c(n) to allow the
chroma vectors we then take the median value per chroma bin across
representation and comparison of different hierarchical levels of all frames within each beat, b. Finally, we apply Eq. 1 to compute
tonal pitch [12]. the beat-synchronous TIVs.
The resulting spatial location of tonal pitch in the Tonal Inter-
val Space ensures that configurations understood as perceptually re- 3.2. Mode Bias
lated within the Western tonal music context correspond to small Eu-
clidean distances [12]. Relevant here are the distances of the 24 ma- The principal novelty of our key estimation method in comparison
jor and minor TIV keys, which are sparsely and (per mode) equidis- to related template-based key estimation methods is the introduction
tantly represented in the space. Neighboring key TIVs adhere to the- of a key mode bias, , which exploits the vector norm difference
oretical and perceptual relations (e.g., in the neighborhood of each between major and minor key in the Tonal Interval Space. This vari-
key TIV, we find its dominant, subdominant, and relative keys) [12]. able adjusts the location of input beat-synchronous TIVs to favor
Furthermore, the set of diatonic pitch classes and chords of each key key estimates in one of the major or minor modes. This can be better
are at smaller distances than non-diatonic pitch configurations, al- understood in the 2-dimensional illustration of the key level in the
lowing us to infer the key TIV from a collection of pitch and chord Tonal Interval Space shown in Fig. 2. When < 1 we pull input
configurations. A final property of the space relevant to our study vectors towards the center of the space (i.e., decrease their norm),
is the constant vector norm of transposition invariant configurations. thus favoring minor keys estimates. On the other hand, when > 1
This property indicates that, for example, all major keys are at the we push them towards the edge of the space (i.e. increasing their
same distance from the center (the same applies to all minor keys). norm), thus favoring major keys estimates.
Additionally, due to the difference of intervallic relations between
major and minor keys, a consistent vector norm difference exists be-
tween these two sets of configurations, thus the ideal binarised TIVs
(containing only the notes of each scale) for harmonic minor keys
are closer to the centre of the Tonal Interval Space than for major
keys.
Fig. 1 shows the architecture of the IKD system. The first module is
responsible for performing a beat segmentation on a musical audio
input, whose onset times are then used to compute beat-synchronous
TIVs. Given that harmonic changes typically occur on beats [14],
we adopt beat segments as the temporal resolution for representing
the harmonic content, in order to maximize the efficiency of the sys-
tem while minimizing the likelihood of two chords being temporally
merged. The second module introduces a spatial displacement to the Fig. 2. Illustrative example of the key level in the Tonal Interval
input beat-synchronous TIVs to bias or balance the inference of key Space mapped into 2 dimensions. Upper and lower case letters rep-
mode based on the vector norm difference between major and minor resent major and minor keys, respectively, along with their corre-
keys in the Tonal Interval Space. Finally, the third module computes sponding vector norm. By altering the norm of an input TIV (repre-
the distance between the displaced input TIVs from 12 major and 12 sented as a circle) using different values of , we show the impact of
minor TIV key-defining profiles and finds the most probable key as the mode bias on key estimates, which oscillates between key modes,
that with the smallest distance. notably the relative C major (for values of > 1, represented as a
square) and A minor key (for values of < 1, represented as a
triangle).
(r) = (2kr)/N radians, where r = [0, 11] semitones. Fur- and timbre qualities. The first datatset consists of the initial 30
ther details on the transposition of pitch configurations in the Tonal seconds of 96 classical musical examples evenly distributed across
Interval Space by means of TIV rotation can be found in [10]. modes and tonics (4 musical examples per key) used in the MIREX
Audio Key Estimation task [18]. The second dataset includes the
3.4. Key Estimates as Minimal Cumulative Distances first 30 seconds of 179 Beatles songs [9], with 89.4% examples in
the major mode. The third dataset is the GiantSteps collection [3],
Based on the assumption that a key-indicating element is the use of which consists of the initial 2 minutes of 604 EDM examples across
its diatonic pitch set and chords, we define a method for estimating 23 sub-genres, with 84.8% of the data in the minor mode. As dis-
the global key of a musical example, Rmin , in the Tonal Interval cussed in the introduction, the use of datasets with even mode distri-
Space by finding the minimum in a function which accumulates over bution is an important design decision in the evaluation of systems
time the distances of the total number of query beat-synchronous for the key estimation on unknown input. While the MIREX training
TIV, Tb (k), from the 12 major and 12 minor key TIVs, such that: set fulfills this criterion, the two remaining datasets favor different
v modes, which we use as a strategy to understand the behavior of our
B u 6
X uX p
2 mode bias algorithm. To this end, we expect to improve the baseline
Rmin = argminr Tb (k) Tr (k) (2)
results (i.e. when = 1) on the Beatles and GiantSteps datasets by
t
b=1 k=1 increasing and decreasing the , respectively.
where Trp are 24 major and minor key TIVs, derived from the col-
lection of three different key profiles, p. When r 6 11, we adopt the 5. RESULTS
major profile and when r > 12, the minor profile. To limit the in-
fluence of silent or noisy (inharmonic) beats, b, we only retain those Fig. 4 shows the performance of our IKD system on the three
for which T (0) > 0.1, where B is the total number of retained beat- datasets under evaluation, for which we provide a score for =
synchronous TIVs. By default, the mode bias = 1 (i.e. no spatial [0.05, 20] and across each profile, p, as well. To allow a fair
displacement to the input beat-synchronous TIVs is introduced) and comparison with previous studies, we use the MIREX evaluation
can be adjusted to favor one of the two major and minor modes as procedure [19], which is widely applied in key estimation studies,
detailed in Section 3.2. The system output is a number, Rmin , rang- where correct and neighboring keys estimates are weighted and
ing between 0-11 for major keys and 12-23 for minor keys, where averaged into a final score according to the following point assign-
0 corresponds to C major, 1 to C# major, and so on through to 23 ment: correct (1), dominant/subdominant (.5), relative (.3), parallel
being B minor. (.2), and others (0).
The most immediate observation we can draw from our results
4. EVALUATION in Fig. 4 is the effectiveness of the mode bias in regulating the ten-
dency of mode prediction, confirming the expected tendencies on
We undertake an objective assessment of the IKD system in esti- the evolution of the correct estimates in the Beatles and GiantSteps
mating the global key from musical audio, focusing on the impli- datasets shown in Fig. 4 (b) and (c). As the vast majority of musi-
cations of the mode biasing strategy on three different datasets and cal examples in the Beatles dataset are in major mode, the ascending
for three key-defining templates. By adopting different values of accuracy curve shows the expected improvements for the three key-
(both greater than and less than 1) in Eq. 2, we aim to show that: defining templates when increases. Equally, the GiantSteps datat-
i) our mode bias can improves performance on either major or mi- set reinforces the mode bias effectiveness by showing the contrary
nor modes by increasing and decreasing , respectively, ii) overall tendency, i.e. smaller values of result in better predictions. On the
results on correct key estimates can be improved by adopting a bal- other hand, the results for the evenly distributed MIREX training set
anced value, and iii) key-defining profiles have a tendency to priv- generate a less asymmetric curve for the same range of values. The
ilege one of the major or minor modes. inflection point for each key profile curve on the MIREX training set
We use three audio datasets with key annotations made by ex- results shown in Fig. 4 (a) can be considered the optimal value of
perts in our evaluation. When combined, this collection provides a , which provides the best, and most balanced key mode, results for
total of 879 musical examples, which include heterogeneous genre this dataset.
Fig. 4. Performance of the IKD system for the three datasets under evaluation: a. MIREX training set, b. Beatles, and c. GiantSteps. Each
dataset was evaluated using three key profiles (Aarden (T a ) [7], Shatath (T s ) [2], and Temperley (T t ) [6]), on a range of values for the
modes bias = [0.05, 20].