−j2πkn N

AUTOMATIC MUSICAL KEY ESTIMATION WITH ADAPTIVE MODE BIAS
Gilberto Bernardes1 , Matthew E. P. Davies1 and Carlos Guedes1,2

1
INESC TEC, Sound and Music Computing Group, Porto, Portugal
2
New York University Abu-Dhabi, Abu Dhabi, United Arab Emirates
ABSTRACT this limits the generality of a particular method in finding the key for
unknown musical inputs.
In this paper we present the INESC Key Detection (IKD) system In light of these findings, we introduce a strategy to explore key
which incorporates a novel method for dynamically biasing key mode estimates of the IKD system [10, 11], a key detection method
mode estimation using the spatial displacement of beat-synchronous based on the Tonal Interval Space [11], without the need to hand-
Tonal Interval Vectors (TIVs). We evaluate the performance of tune key profiles. The geometric properties of Tonal Interval Space
the IKD system at finding the global key on three annotated audio allow us to easily adapt key mode estimation by introducing spatial
datasets and using three key-defining profiles. Results demonstrate displacements to the input a non-trivial task in commonly used
the effectiveness of the mode bias in favoring either the major or metric spaces used in related literature. This not only enables users
minor mode, thus allowing users to fine tune this variable to improve to bias the systems towards major or minor correct modes estimates,
correct key estimates on style-specific music datasets or to balance which has been shown to be an important feature for style-specific
predictions across key modes on unknown input sources. key detection [4], but can also balance the correct number of esti-
Index Terms Audio key estimation, tonal pitch representa- mates across modes for enhanced results on unknown musical in-
tion, music signal processing, music information retrieval. puts. We demonstrate the efficacy of our approach by explicitly ma-
nipulating its accuracy on existing annotated datasets comprised of
excerpts in predominantly major and minor modes.
1. INTRODUCTION The remainder of this paper is structured as follows. Section 2
provides an overview of the Tonal Interval Space, as well as distance
Key or tonality is a prominent concept in Western music. It is defined metrics computed in the space relevant to the IKD system. Section 3
by a pitch class (tonic) and a mode (major or minor), whose combi- starts by presenting the architecture of our system, followed by a
nation establishes a system of relations between pitches in both the detailed description of each of the component modules, with partic-
vertical and horizontal dimensions of musical structure [1]. Key es- ular emphasis on the novelty of our approach, i.e. the use of mode
timation from musical audio has been extensively researched within bias in the key detection method. Sections 4 and 5 present an objec-
the music information retrieval community [2, 3, 4], as it provides tive evaluation of the IKD system and, finally, in Section 6 we draw
important annotations for enhanced navigation and retrieval in large conclusions and state areas for future work.
music collections as well as contributing to music-creative tasks such
as harmonic DJ mixing [5]. 2. OVERVIEW OF THE TONAL INTERVAL SPACE
However, the majority of existing key estimation systems rely on
the same fundamental principle, that of using key profiles expressing The system reported in this paper is based on the Tonal Interval
pitch class distributions to which a similar representation obtained Space [12], an extended tonal pitch space in the context of the Ton-
from analyzing a musical piece or excerpt is compared to estimate netz [13]. The most salient pitch levels of tonal Western music
the most probable key. Research on key finding devotes great effort pitches, chords and keys can be represented as unique locations
to the creation and evaluation of different key profiles proposed in in the space as Tonal Intervals Vectors (TIVs) from music encoded
the literature [6, 7, 2]. Yet, these findings must be understood in as both symbolic or audio data. A predominant feature of the Tonal
the context of the datasets on which they are evaluated, as it has Interval Space is the ability to compute theoretical and perceptual as-
been shown that different key profiles explicitly favor either major pects of Western tonal music, such as indicators of multi-level tonal
or minor key modes [8]. Given that the most widely used datasets pitch relatedness and consonance, as distances.
in the evaluation of audio key estimation systems have pronounced In this paper, we focus on audio signal representations in the
divergences in key mode distribution (e.g., a strong bias towards the Tonal Interval Space, due to its relevance in the key estimation sys-
major mode in the Beatles collection [9] and the minor mode in the tem under discussion. To represent an audio signal in the Tonal In-
GiantSteps [3] dataset), we believe this has led to an intrinsic bias terval Space, we first aggregate the energy of each pitch class in a
in current key detection systems which are adapted to convey better 12-dimensional chroma vector, c(n), and compute a 12-dimensional
estimations in either minor or major modes, but not both. In turn, Tonal Interval Vector, T (k) as its L1 normalized Discrete Fourier
Transform (DFT), such that:
Project TEC4Growth - Pervasive Intelligence, Enhancers and Proofs of
N 1
Concept with Industrial Impact/NORTE-01-0145-FEDER-000020, financed X j2kn
by the North Portugal Regional Operational Programme (NORTE 2020), un- T (k) = w(k) c(n)e N , kZ (1)
der the PORTUGAL 2020 Partnership Agreement, and through the Euro- n=0
pean Regional Development Fund (ERDF). GB is also supported by FCT, the
Portuguese Foundation for Science and Technology, under the post-doctoral where N = 12 is the dimension of the chroma vector and w(k) =
grant SFRH/BPD/109457/2015. {2, 11, 17, 16, 19, 7} are weights derived from empirical conso-
nance ratings of dyads used to adjust the contribution of each Sonic Annotator [16] with default parameters, including both tun-
dimension k of the space (or interpreted musical interval), making ing correction and spectral whitening. Each chroma vector is calcu-
it a perceptually relevant space in comparison to its non-weighted lated over a 46 ms frame. Next, we extract beat locations from the
version [12]. We set k to 1 k 6 for T (k) since the remaining same input audio signal, using the QM-VAMP bar and beat tracking
coefficients are symmetric. T (k) uses c(n) which is c(n) nor- [17] also within Sonic Annotator. To compute the beat-synchronous
PN 1
malized by the DC component T (0) = n=0 c(n) to allow the
chroma vectors we then take the median value per chroma bin across
representation and comparison of different hierarchical levels of all frames within each beat, b. Finally, we apply Eq. 1 to compute
tonal pitch [12]. the beat-synchronous TIVs.
The resulting spatial location of tonal pitch in the Tonal Inter-
val Space ensures that configurations understood as perceptually re- 3.2. Mode Bias
lated within the Western tonal music context correspond to small Eu-
clidean distances [12]. Relevant here are the distances of the 24 ma- The principal novelty of our key estimation method in comparison
jor and minor TIV keys, which are sparsely and (per mode) equidis- to related template-based key estimation methods is the introduction
tantly represented in the space. Neighboring key TIVs adhere to the- of a key mode bias, , which exploits the vector norm difference
oretical and perceptual relations (e.g., in the neighborhood of each between major and minor key in the Tonal Interval Space. This vari-
key TIV, we find its dominant, subdominant, and relative keys) [12]. able adjusts the location of input beat-synchronous TIVs to favor
Furthermore, the set of diatonic pitch classes and chords of each key key estimates in one of the major or minor modes. This can be better
are at smaller distances than non-diatonic pitch configurations, al- understood in the 2-dimensional illustration of the key level in the
lowing us to infer the key TIV from a collection of pitch and chord Tonal Interval Space shown in Fig. 2. When < 1 we pull input
configurations. A final property of the space relevant to our study vectors towards the center of the space (i.e., decrease their norm),
is the constant vector norm of transposition invariant configurations. thus favoring minor keys estimates. On the other hand, when > 1
This property indicates that, for example, all major keys are at the we push them towards the edge of the space (i.e. increasing their
same distance from the center (the same applies to all minor keys). norm), thus favoring major keys estimates.
Additionally, due to the difference of intervallic relations between
major and minor keys, a consistent vector norm difference exists be-
tween these two sets of configurations, thus the ideal binarised TIVs
(containing only the notes of each scale) for harmonic minor keys
are closer to the centre of the Tonal Interval Space than for major
keys.
3. AUDIO KEY ESTIMATION METHOD
Fig. 1 shows the architecture of the IKD system. The first module is
responsible for performing a beat segmentation on a musical audio
input, whose onset times are then used to compute beat-synchronous
TIVs. Given that harmonic changes typically occur on beats [14],
we adopt beat segments as the temporal resolution for representing
the harmonic content, in order to maximize the efficiency of the sys-
tem while minimizing the likelihood of two chords being temporally
merged. The second module introduces a spatial displacement to the Fig. 2. Illustrative example of the key level in the Tonal Interval
input beat-synchronous TIVs to bias or balance the inference of key Space mapped into 2 dimensions. Upper and lower case letters rep-
mode based on the vector norm difference between major and minor resent major and minor keys, respectively, along with their corre-
keys in the Tonal Interval Space. Finally, the third module computes sponding vector norm. By altering the norm of an input TIV (repre-
the distance between the displaced input TIVs from 12 major and 12 sented as a circle) using different values of , we show the impact of
minor TIV key-defining profiles and finds the most probable key as the mode bias on key estimates, which oscillates between key modes,
that with the smallest distance. notably the relative C major (for values of > 1, represented as a
square) and A minor key (for values of < 1, represented as a
triangle).
3.3. Key TIV Profiles

Fig. 3 shows three key-defining profiles, p, adopted in this study,
Fig. 1. Architecture of the IKD system. which expose the pitch class distribution of the C major and C mi-
nor keys. Their selection was based on their different nature: the
knowledge-based profiles by Temperleys (T t ) [6], the corpus-driven
profiles by Aarden (T a ) [7] and Shatath (T s ) [2], from folk and
3.1. Beat-synchronous TIV electronic dance music (EDM) corpora, respectively. These profiles
are considered here as chroma vectors, c(n), which we convert to
Given an audio signal (with sampling frequency 44.1kHz), we first key TIVs using Eq. 1. The key TIVs of remaining keys are com-
extract chroma vectors using the NNLS chroma [15] plugin within puted by rotating the C major and C minor key TIVs, T (k), by
Fig. 3. The major and minor key profiles for the four set of profiles used: Aarden (T a ) [7], Shatath (T s ) [2], and Temperley (T t ) [6] (for
enhanced visualization the profiles were normalized to zero mean and unit variance).
(r) = (2kr)/N radians, where r = [0, 11] semitones. Fur- and timbre qualities. The first datatset consists of the initial 30
ther details on the transposition of pitch configurations in the Tonal seconds of 96 classical musical examples evenly distributed across
Interval Space by means of TIV rotation can be found in [10]. modes and tonics (4 musical examples per key) used in the MIREX
Audio Key Estimation task [18]. The second dataset includes the
3.4. Key Estimates as Minimal Cumulative Distances first 30 seconds of 179 Beatles songs [9], with 89.4% examples in
the major mode. The third dataset is the GiantSteps collection [3],
Based on the assumption that a key-indicating element is the use of which consists of the initial 2 minutes of 604 EDM examples across
its diatonic pitch set and chords, we define a method for estimating 23 sub-genres, with 84.8% of the data in the minor mode. As dis-
the global key of a musical example, Rmin , in the Tonal Interval cussed in the introduction, the use of datasets with even mode distri-
Space by finding the minimum in a function which accumulates over bution is an important design decision in the evaluation of systems
time the distances of the total number of query beat-synchronous for the key estimation on unknown input. While the MIREX training
TIV, Tb (k), from the 12 major and 12 minor key TIVs, such that: set fulfills this criterion, the two remaining datasets favor different
v modes, which we use as a strategy to understand the behavior of our
B u 6
X uX p
2 mode bias algorithm. To this end, we expect to improve the baseline
Rmin = argminr Tb (k) Tr (k) (2)

results (i.e. when = 1) on the Beatles and GiantSteps datasets by
t
b=1 k=1 increasing and decreasing the , respectively.
where Trp are 24 major and minor key TIVs, derived from the col-
lection of three different key profiles, p. When r 6 11, we adopt the 5. RESULTS
major profile and when r > 12, the minor profile. To limit the in-
fluence of silent or noisy (inharmonic) beats, b, we only retain those Fig. 4 shows the performance of our IKD system on the three
for which T (0) > 0.1, where B is the total number of retained beat- datasets under evaluation, for which we provide a score for =
synchronous TIVs. By default, the mode bias = 1 (i.e. no spatial [0.05, 20] and across each profile, p, as well. To allow a fair
displacement to the input beat-synchronous TIVs is introduced) and comparison with previous studies, we use the MIREX evaluation
can be adjusted to favor one of the two major and minor modes as procedure [19], which is widely applied in key estimation studies,
detailed in Section 3.2. The system output is a number, Rmin , rang- where correct and neighboring keys estimates are weighted and
ing between 0-11 for major keys and 12-23 for minor keys, where averaged into a final score according to the following point assign-
0 corresponds to C major, 1 to C# major, and so on through to 23 ment: correct (1), dominant/subdominant (.5), relative (.3), parallel
being B minor. (.2), and others (0).
The most immediate observation we can draw from our results
4. EVALUATION in Fig. 4 is the effectiveness of the mode bias in regulating the ten-
dency of mode prediction, confirming the expected tendencies on
We undertake an objective assessment of the IKD system in esti- the evolution of the correct estimates in the Beatles and GiantSteps
mating the global key from musical audio, focusing on the impli- datasets shown in Fig. 4 (b) and (c). As the vast majority of musi-
cations of the mode biasing strategy on three different datasets and cal examples in the Beatles dataset are in major mode, the ascending
for three key-defining templates. By adopting different values of accuracy curve shows the expected improvements for the three key-
(both greater than and less than 1) in Eq. 2, we aim to show that: defining templates when increases. Equally, the GiantSteps datat-
i) our mode bias can improves performance on either major or mi- set reinforces the mode bias effectiveness by showing the contrary
nor modes by increasing and decreasing , respectively, ii) overall tendency, i.e. smaller values of result in better predictions. On the
results on correct key estimates can be improved by adopting a bal- other hand, the results for the evenly distributed MIREX training set
anced value, and iii) key-defining profiles have a tendency to priv- generate a less asymmetric curve for the same range of values. The
ilege one of the major or minor modes. inflection point for each key profile curve on the MIREX training set
We use three audio datasets with key annotations made by ex- results shown in Fig. 4 (a) can be considered the optimal value of
perts in our evaluation. When combined, this collection provides a , which provides the best, and most balanced key mode, results for
total of 879 musical examples, which include heterogeneous genre this dataset.
Fig. 4. Performance of the IKD system for the three datasets under evaluation: a. MIREX training set, b. Beatles, and c. GiantSteps. Each
dataset was evaluated using three key profiles (Aarden (T a ) [7], Shatath (T s ) [2], and Temperley (T t ) [6]), on a range of values for the
modes bias = [0.05, 20].
NS SH FGJH Rekordbox IKD task [22].

[20] [2] [4] [21]
Beatles 72.4 59.3 76.0 56.37 65.9
GiantSteps 52.9 59.3 74.6 79.6 68.5 6. CONCLUSIONS AND FUTURE WORK
Combined 57.4 59.9 69.3 74.3 67.9 In this paper, we have presented an enhanced key mode estimation
on the IKD system using a mode bias strategy, which introduces a
spatial displacement to input TIVs. We demonstrated that the mode
Table 1. Comparison of different key estimation software on the
bias not only allows users to favor correct key estimates on one of
Beatles and Giantstep datasets. The best score for each dataset is
the two major or minor modes relevant for key detection on style-
shown in bold.
specific datasets but also provides a strategy to balance correct key
mode predictions when in the presence of unknown input sources.
The method sheds some light on breaking the computational key es-
Overall, adopting Temperley [6] profile, T t when = 0.55 timation problem into two parts: one for mode estimation and the
gives the best performance for the MIREX training set (91.3% score) other for tonic estimation. To this end, we envisage our method
and Shatath [2] profile, T s when = 0.35 provides the best results could particularly benefit end-users who may be readily be able to
for the combined set of the Beatles and GiantSteps datasets (67.9% distinguish between major and minor modes, but are unable to in-
score). Combining the two latter datasets indicates which value fer the root, as well as provide insight into enhanced discrimination
provides the best results for a system to which unknown audio con- between the two modes at the various hierarchical pitch levels.
tent input is presented. Moreover, in cases where a known tendency The major contribution of this paper and the strength of the IKD
for one of the modes exists (such as the minor mode in EDM mu- system in relation to related research is its flexibility in correctly es-
sic), the system can achieve much better performance (77.1% for the timating the key from audio inputs with dynamic control over mode
Beatles dataset and 71.3% for the GiantSteps, when equals 10 and prediction a feature that, to the best of our knowledge, has never
0.15, respectively). The best performing values for all datasets been considered before in key detection system other than adjusting
indicate that the Shatath [2] and Temperley [6] profiles without the key-defining profiles in an ad hoc manner to fit to existing datasets.
mode bias (i.e., when = 1) favor correct major mode estimates, Finally, an important consideration resulting from this study is the
where as the Aarden [7] profiles show the opposite behavior. relevance and influence of evenly distributed datasets in the eval-
In Table 1 we present the performance of different systems for uation design of key detection systems in order to avoid unsound
audio key estimation on the Beatles and GiantSteps datatsets re- conclusions resulting from the tendency of particular key-defining
ported in [4], to which we include the scores for our IKD system (us- templates to favor one of the two major or minor modes.
ing Shatath profiles and = 0.35) and for the Rekordbox [21] soft- Stylistic instantiations of the current IKD system are planned
ware on the Beatles dataset. While our system outperforms Noland for future work towards improving the chroma representation used
and Sandler [20] and Shatath [2] systems, it provides worse results to accommodate musical sounds with non-harmonic timbral quali-
than Rekordbox [21] and Falardo et al.s systems [4]. We believe ties [23], such as those typically featured in EDM, as well as un-
that the poorer performance of our algorithm in relation to the two derstanding the distribution of modes across different musical gen-
latter systems is due to the high optimization of their algorithms for res to tune the system for its optimal performance in style-specific
EDM. Not only is the Tonal Interval Space designed to provide gen- datasets. A style-specific key estimation system that profits from the
eral inferences for Western tonal music, without being fitted to any IKD mode bias algorithm, must know in advance the bias towards
particular style, the NNLS chroma representation used in the IKD one of the major or minor modes to provide better key estimates. To
also aims at finding perfectly tuned harmonic pitch templates, which this end, we plan on using timbral features to automatically select
may not be the case in most EDM and pop/rock music, namely when from the audio signal .
using synthesizers. While the performance of Faraldo et al.s sys-
tem [4] is most accurate across these two datasets, an initial version
of our IKD system outperformed it on the (closed) dataset of 1252
classical music examples from MIREX 2016 Audio Key Detection
7. REFERENCES [16] C. Cannam, M. Jewell, C. Rhodes, M. Sandler, and
M. dInverno, Linked data and you: Bringing music research
[1] A. J. Milne, A Computational Model of Cognition of Tonallity, software into the semantic web, Journal of New Music Re-
Ph.D. thesis, The Open University, 2013. search, vol. 39, no. 4, pp. 313325, 2010.
[2] I. Shatath, Estimation of key in digital music recordings, [17] M. Davies, M. Plumbley, and D. Eck, Towards a musical
M.S. thesis, Birkbeck College, University of London, 2011. beat emphasis function, in IEEE Workshop on Applications
of Signal Processing to Audio and Acoustics (WASPAA), 2009,
[3] P. Knees, A. Faraldo, P. Herrera, R. Vogl, S. Bock, pp. 6164.
F. Horschlager, and M. Le Goff, Two data sets for tempo esti-
mation and key detection in electronic dance music annotated [18] MIREX Audio Key Finding (Training) Dataset,
from user corrections, in Proceedings of the International So- http://www.music-ir.org/mirex/wiki/2005:
ciety for Music Information Retrieval Conference, 2016, pp. Audio_and_Symbolic_Key_Finding, [Accessed:
364370. 2016-09-12].
[19] J. S. Downie, A Ehmann, M. F., Bay, and M. C. Jones, The
[4] A. Faraldo, E. Gomez, S. Jorda, and P. Herrera, Key esti-
music information retrieval evaluation exchange: Some obser-
mation in electronic dance music, in Proceedings of Euro-
vations and insights, in Advances in music information re-
pean Conference on Information Retrieval. 2016, pp. 335347,
trieval). 2010, pp. 93115, Springer Berlin Heidelberg.
Springer International Publishing.
[20] K. Noland and M. B. Sandler, Key estimation using a hidden
[5] R. Gebhardt, M. Davies, and B. Seeber, Psychoacoustic ap- markov model, in Proceedings of The International Society
proaches for harmonic music mixing, Applied Sciences, vol. for Music Information Retrieval Conference, 2006, pp. 121
6, no. 5, 2016. 126.
[6] D. Temperley, Whats key for key? the Krumhansl- [21] Pioneer DJ, Rekordbox, https://rekordbox.com,
Schmuckler key-finding algorithm reconsidered, Music Per- [Accessed: 2016-09-12].
ception: An Interdisciplinary Journal, vol. 17, no. 1, pp. 65
100, 1999. [22] MIREX Audio Key Finding Results, http://nema.
lis.illinois.edu/nema_out/mirex2016/
[7] B. Aarden, Dynamic melodic expectancy, Ph.D. thesis, Ohio results/akd/mrx_05/summary.html, [Accessed:
State University, 2003. 2016-09-12].
[8] J. Albrecht and D. Shanahan, The use of large corpora to train [23] M. Muller, Sebastian Ewert, and Sebastian Kreuzer, Making
a new type of key-finding algorithm: An improved treatment of chroma features more robust to timbre changes, in Proceed-
minor mode, Music Perception: An Interdisciplinary Journal, ings of International Conference on Acoustics, Speech, and
vol. 31, no. 1, pp. 5967, 2016. Signal Processing (ICASSP), 2009, pp. 18691872.
[9] C. Harte, Towards Automatic Extraction of Harmony Informa-
tion from Music Signal, Ph.D. thesis, Queen Mary University
of London, 2010.
[10] G. Bernardes, D. Cocharro, C. Guedes, and M. Davies, Har-
mony generation driven by a perceptually motivated tonal in-
terval space, ACM Computers in Entertainment, vol. 14, no.
2, 2016.
[11] G. Bernardes and M. Davies, Audio key finding in the tonal
interval space, in Submission to the Music Information Re-
trieval Evaluation eXchange (MIREX) Audio Key Detection
task, 2016.
[12] G. Bernardes, D. Cocharro, M. Caetano, C. Guedes, and
M. Davies, A multi-level tonal interval space for modelling
pitch relatedness and musical consonance, Journal of New
Music Research, vol. 45, no. 4, pp. 281294, 2016, http://
dx.doi.org/10.1080/09298215.2016.1182192.
[13] L. Euler, Tentamen novae theoriae musicae, Broude, New
York/St. Petersburg, 1968/1739.
[14] H. Papadopoulos and G. Tzanetakis, Models for music anal-
ysis from a markov logic networks perspective, IEEE/ACM
Transactions on Audio, Speech, and Language Processing, vol.
25, no. 1, pp. 1934, 2017.
[15] M. Mauch and S. Dixon, Approximate note transcription for
the improved identification of difficult chords, in Proceedings
of the 11th International Society Music Information Retrieval
Conference, 2010, pp. 135140.

−j2πkn N

Uploaded by

Copyright:

Available Formats

−j2πkn N

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

−j2πkn N

Uploaded by

Copyright:

Available Formats

AUTOMATIC MUSICAL KEY ESTIMATION WITH ADAPTIVE MODE BIAS

Gilberto Bernardes1 , Matthew E. P. Davies1 and Carlos Guedes1,2

3. AUDIO KEY ESTIMATION METHOD

3.3. Key TIV Profiles

NS SH FGJH Rekordbox IKD task [22].

You might also like