Jannach Et Al (2017) - Music Data Analysis. Foundations and Applications
Jannach Et Al (2017) - Music Data Analysis. Foundations and Applications
ANALYSIS
Foundations and
Applications
Chapman & Hall/CRC
Computer Science and Data Analysis Series
The interface between the computer and statistical sciences is increasing, as each
discipline seeks to harness the power and resources of the other. This series aims to
foster the integration between the computer sciences and statistical, numerical, and
probabilistic methods by publishing a broad range of reference works, textbooks, and
handbooks.
SERIES EDITORS
David Blei, Princeton University
David Madigan, Rutgers University
Marina Meila, University of Washington
Fionn Murtagh, Royal Holloway, University of London
Proposals for the series should be sent directly to one of the series editors above, or submitted to:
Published Titles
®
Computational Statistics Handbook with MATLAB , Third Edition
Wendy L. Martinez and Angel R. Martinez
R Graphics
Paul Murrell
MUSIC DATA
ANALYSIS
Foundations and
Applications
edited by
Claus Weihs
Technical University of Dortmund, Germany
Dietmar Jannach
Technical University of Dortmund, Germany
Igor Vatolkin
Technical University of Dortmund, Germany
Günter Rudolph
Technical University of Dortmund, Germany
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2017 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher cannot
assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication
and apologize to copyright holders if permission to publish in this form has not been obtained. If any
copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information stor-
age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a photo-
copy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are
used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
1 Introduction 1
1.1 Background and Motivation 1
1.2 Content, Target Audience, Prerequisites, Exercises,
and Complementary Material 2
1.3 Book Overview 3
1.4 Chapter Summaries 3
1.5 Course Examples 8
1.6 Authors and Editors 9
Bibliography 11
vii
viii CONTENTS
viii
CONTENTS ix
ix
x CONTENTS
II Methods 217
9 Statistical Methods 219
9.1 Introduction 219
9.2 Probability 219
9.2.1 Theory 219
9.2.2 Empirical Analogues 222
9.3 Random Variables 223
9.3.1 Theory 223
x
CONTENTS xi
10 Optimization 263
10.1 Introduction 263
10.2 Basic Concepts 264
10.3 Single-Objective Problems 266
10.3.1 Binary Feasible Sets 266
10.3.2 Continuous Feasible Sets 271
10.3.3 Compound Feasible Sets 276
10.4 Multi-Objective Problems 276
10.5 Further Reading 281
Bibliography 281
xi
xii CONTENTS
13 Evaluation 329
13.1 Introduction 329
13.2 Resampling 332
13.2.1 Resampling Methods 334
13.2.2 Hold-Out 334
13.2.3 Cross-Validation 335
13.2.4 Bootstrap 336
13.2.5 Subsampling 338
13.2.6 Properties and Recommendations 338
13.3 Evaluation Measures 339
13.3.1 Loss-Based Performance 339
13.3.2 Confusion Matrix 340
13.3.3 Common Performance Measures Based on the Confusion
Matrix 341
13.3.4 Measures for Imbalanced Sets 343
13.3.5 Evaluation of Aggregated Predictions 345
13.3.6 Measures beyond Classification Performance 347
13.4 Hyperparameter Tuning: Nested Resampling 352
13.5 Tests for Comparing Classifiers 354
13.5.1 McNemar Test 354
13.5.2 Pairwise t-Test Based on B Independent Test Data Sets 356
13.5.3 Comparison of Many Classifiers 357
13.6 Multi-Objective Evaluation 359
13.7 Further Reading 360
Bibliography 361
xii
CONTENTS xiii
xiii
xiv CONTENTS
17 Transcription 433
17.1 Introduction 433
17.2 Data 434
17.3 Musical Challenges: Partials, Vibrato, and Noise 434
17.4 Statistical Challenge: Piecewise Local Stationarity 435
17.5 Transcription Scheme 436
17.5.1 Separation of the Relevant Part of Music 436
17.5.2 Estimation of Fundamental Frequency 436
17.5.3 Classification of Notes, Silence, and Noise 440
17.5.4 Estimation of Relative Length of Notes and Meter 442
17.5.5 Estimation of the Key 443
17.5.6 Final Transcription into Sheet Music 443
17.6 Software 443
17.7 Concluding Remarks 444
17.8 Further Reading 445
Bibliography 446
xiv
CONTENTS xv
21 Emotions 511
21.1 Introduction 511
21.1.1 What Are Emotions? 511
21.1.2 Difference between Basic Emotions, Moods, and Emotional
Episodes 512
21.1.3 Personality Differences and Emotion Perception 512
21.2 Theories of Emotions and Models 513
21.2.1 Hevner Clusters of Affective Terms 513
xv
xvi CONTENTS
xvi
CONTENTS xvii
IV Implementation 607
25 Implementation Architectures 609
25.1 Introduction 609
25.2 Architecture Variants and Their Evaluation 610
25.2.1 Personal Player Device Processing 612
25.2.2 Network Server-Based Processing 613
25.2.3 Distributed Architectures 614
25.3 Applications 615
25.3.1 Music Recommendation 615
25.3.2 Music Recognition 616
xvii
xviii CONTENTS
Notation 665
Index 667
xviii
Chapter 1
Introduction
1
2 Chapter 1. Introduction
2
1.3. Book Overview 3
when used in a lecture. In contrast to these works, our book aims to be more com-
prehensive in that it also covers the foundations of music and signal analysis and
introduces the required basics in the fields of statistics and data mining. Further-
more, examples based on music data are provided for all basic chapters of this book.
Nonetheless, the above-mentioned books can serve as valuable additional readings
for advanced topics in music data analysis.
3
4 Chapter 1. Introduction
4
1.4. Chapter Summaries 5
number of additional sources for music data that can be used alone or in combination
with signal-level features for music data analysis applications.
5
6 Chapter 1. Introduction
6
1.4. Chapter Summaries 7
7
8 Chapter 1. Introduction
8
1.6. Authors and Editors 9
course was held within one week at the TU Dortmund, Germany, with about 8 hours
of classroom activities (lectures and exercises) per day.
An alternative course design particularly for musicologists might include the ba-
sic Chapters 2, 3, 5, 7, and 8 as well as all chapters on applications (Chapters 16–24).
The signal analytical and statistical methods that are needed to understand the appli-
cation chapters might be briefly explained in passing.
A course design especially for engineers might include the basic Chapters 2–5
and 7, the Chapters 11–15 on methodology, the application Chapters 16–20 as well
as Chapter 27 on hardware.
A course design targeted to computer scientists might include the basic Chapters
2–5 and 7, the application Chapters 17–19 and 22–24, and the hardware Chapters 26
and 27. The material of Chapters 9–15 on methodology should be interspersed when
needed.
9
10 Chapter 1. Introduction
Table 1.2: Author list
10
1.6. Authors and Editors 11
statistical process control, statistical design of experiments, time series analysis, and,
since 1999, statistics in music data analysis. He has co-authored more than 30 papers
on the topic of the book.
Dietmar Jannach is a full professor of Computer Science at TU Dortmund, Ger-
many. Before joining TU Dortmund he was an associate professor at University
Klagenfurt, Austria. Dietmar Jannach’s main research interest lies in the application
of intelligent systems technology to practical problems, e.g., in the form of recom-
mendation and product configuration systems.
Igor Vatolkin is postdoctoral researcher at the Department of Computer Science,
TU Dortmund, where he received a diploma degree in Computer Science and Mu-
sic as a secondary subject and a Ph.D. degree. His main research interests cover
the optimization of music classification tasks with the help of computational intel-
ligence techniques, in particular evolutionary multi-objective algorithms. He has
co-authored more than 25 peer-reviewed papers on the topic of the book.
Günter Rudolph is professor of Computational Intelligence at the Department of
Computer Science at TU Dortmund, Germany. Before joining TU Dortmund, he
was with Informatics Center Dortmund (ICD), the Collaborative Research Center
on Computational Intelligence (SFB 531), and Parsytec AG (Aachen). His research
interests include music informatics, digital entertainment technologies, and the de-
velopment and theoretical analysis of bio-inspired methods applied to difficult opti-
mization problems encountered in engineering sciences, logistics, and economics.
Bibliography
[1] A. Klapuri and M. Davy, eds. Signal Processing Methods for Music Transcrip-
tion. Springer, 2006.
[2] T. Li, M. Ogihara, and G. Tzanetakis, eds. Music Data Mining. Chapman &
Hall/CRC Data Mining and Knowledge Discovery, 2011.
[3] M. Müller. Information Retrieval for Music and Motion. Springer, 2007.
[4] M. Müller. Fundamentals of Music Processing: Audio, Analysis, Algorithms,
Applications. Springer, 2015.
[5] Z. W. Ras and A. Wieczorkowska, eds. Advances in Music Information Retrieval.
Springer, 2010.
[6] J. Shen, J. Shepherd, B. Cui, and L. Liu, eds. Intelligent Music Information
Systems: Tools and Methodologies. IGI Global, 2007.
11
12 Chapter 1. Introduction
12
Part I
13
Chapter 2
S EBASTIAN K NOCHE
Department of Physics, TU Dortmund, Germany
M ARTIN E BELING
Institute of Music and Musicology, TU Dortmund, Germany
2.1 Introduction
In Ancient Greek, the Muses were the goddesses of the inspiration of literature, sci-
ence and arts, and were considered the source of knowledge. The Latin word musica
and eventually the word music derive from the Greek mousike, which means the art
of the Muses. Although music is ubiquitous in all cultures, a commonly accepted
definition of music has not yet been given. Instead, through the centuries a vari-
ety of definitions and classifications of music from different perspectives have been
proposed [25]. All definitions of music agree about its medium: Music constitutes
communication by artificially organized sound. For the purpose of communication,
the structures of music must be comprehensible in connection with auditory percep-
tion and cognition.
The constitutive musical signals are tones. From a phenomenological point of
view, a tone is a sensation. Different essential sensational moments of a tone can be
discriminated such as pitch, loudness, duration, and timbre. Tones are distinguished
by their tone names, which refer to the pitch. Two tones with different loudness
values or different timbres but the same pitch are identified as the same tone merely
played with different intensities and on different instruments. Thus, according to Carl
Stumpf (1848–1934) the tonal quality of pitch is the crucial sensational moment of a
tone, whereas the tonal intensity is its loudness [30].
According to the specific properties of the tonal sensation, music is organized in
relation to the sensational moments pitch (tone name), intensity (dynamics), rhythm
(onset and offset), and timbre (instrumentation). In what follows we will consider
each single sensational moment of a tone by looking at the musical signal as a sound
15
16 Chapter 2. The Musical Signal: Physically and Psychologically
and as a sensation. This includes sound generation and sound propagation as well as
the psychoacoustic preconditions of music perception and cognition.
16
2.2. The Tonal Quality: Pitch — the First Moment 17
that involves temporal and spatial derivatives of a function u(z,t), denoted with
a dot and a prime, respectively, so that ü = ∂ 2 u/∂t 2 and u00 = ∂ 2 u/∂ z2 . The
constant c that appears in the wave equation is called the phase velocity.
For a string under tension, the phase velocity depends on the tensionalp force
Ft and the line density ρl of the string (measured in kg/m) by c = Ft /ρl . A
solution u(z,t) depends not only on the wave equation, but also on the initial
conditions (i.e. initial configuration u(z,t0 ) at the starting time t0 , initial veloc-
ity u̇(z,t0 )), and boundary conditions, which must be specified.
For infinitely long strings there is a very general class of solutions, the
D’Alembert solutions.
Theorem 2.1 (D’Alembert Solution of the Wave equation). The functions
u+ (z,t) = f (z + ct) and u− (z,t) = f (z − ct) with any twice differentiable func-
tion f satisfy the wave Equation (2.1). They are called the D’Alembert solu-
tions of the wave equation.
1 The Physics Interludes in this chapter go deeper into the physical mechanisms underlying music, and
may be skipped by readers who are satisfied with a purely phenomenological description of the musical
signal.
17
18 Chapter 2. The Musical Signal: Physically and Psychologically
u−
t=0
z
t=1
z
t=2
z
Figure 2.1: D’Alembert solution u− (z,t) = f (z −t) (phase velocity c = 1), plotted at
three different times. For t = 0, we just see the plot of f (z). A point on the curve is
identified by its function value f (z0 ). At a later time t1 , this function value is found
at a different position z1 = z0 + t1 since u− (z1 ,t1 ) = f (z0 + t1 − t1 ) = f (z0 ).
1 T̈ (t) Z 00 (z)
Z(z)T̈ (t) = c2 Z 00 (z)T (t) ⇔ = . (2.2)
c2 T (t) Z(z)
Note that the left-hand side of the last form only depends on the independent
variable t, and the right-hand side only on z, but the equation has to be satis-
fied for all z and t. This is only possible when both sides of the equation are
actually independent of z and t, i.e. equal to some constant −k2 (this form of
the constant is chosen so that the results can be written in a simple form).
18
2.2. The Tonal Quality: Pitch — the First Moment 19
u
z λ1 = 2L f1 = c/λ1
0 L
z λ2 = L f2 = 2f1
z λ3 = 2L/3 f3 = 3f1
Figure 2.2: First three eigenmodes of a vibrating string. The spatial structures oscil-
late up and down with frequencies fn . The first mode oscillates with the fundamental
frequency f1 (= 1 · f0 ); higher modes oscillate with multiples of the fundamental fre-
quency fn = n f1 = (n f0 ).
So the spatial differential equation reads Z 00 (z) = −k2 Z(z), which is the
familiar differential equation of a harmonic oscillator and has sine and cosine
functions as solutions. For our boundary conditions X(0) = X(L) = 0, only
sine functions with a root at z = L are appropriate, Z(z) ∼ sin (kn z) with kn =
nπ/L and n ∈ Z. Thus the constant k is quantized, i.e. may only assume discrete
values kn .
The time-dependent differential equation T̈ (t) = −c2 kn2 T (t) is of the same
structure as the spatial differential equation, and its general solution is T (t) ∼
sin (ωn t + ϕn ) with ωn = ckn and an integration constant ϕn representing the
initial phase at time t = 0.
To obtain the final solution, the results of the spatial and temporal parts
must be inserted back into the ansatz. In summary, there are infinitely many
solutions, indexed by n = 1, 2, 3, . . . , which are called the eigenmodes of a
clamped string:
Definition 2.2 (Eigenmodes of a Clamped String). The eigenmodes of a clamped
string are the solutions
un (z,t) = An sin(kn z) sin(ωnt + ϕn ) with kn = nπ/L ωn = ckn
and
(2.3)
of the wave equation, with coefficients An and ϕn which must be determined
from the initial conditions.
The wave number kn and angular frequency ωn describe the spatial and
temporal dilatation of the sine functions. The wave number is related to the
wave length λ , which measures the spatial distance between two oscillation
maxima, by λ = 2π/k. Other typical parameters of the temporal structure are
the frequency f , which measures the number of oscillations per second, and
the period T , which measures the duration of one oscillation. They are related
to the angular frequency by f = ω/2π and T = 1/ f .
The relation ω = ck with which the angular frequency was introduced is
called the dispersion relation and can be equivalently expressed as f = c/λ .
19
20 Chapter 2. The Musical Signal: Physically and Psychologically
Although it looks as simple as the previous relations, it has much more physical
meaning. Basically, λ and k describe the same thing – the spatial stretching of
the sine wave – and are therefore obviously related, and ω, f and T all describe
how fast the oscillations are. The dispersion relation, on the other hand, relates
the spatial to the temporal characteristics, and describes how fast a wave with
given wavelength will oscillate.
In Figure 2.2, the spatial structure of the lowest three eigenmodes of a
vibrating string, called the first harmonic or fundamental, second harmonic,
and third harmonic are sketched. It oscillates up and down with proceeding
time, and thus represents standing waves. The boundary conditions enforce
the wavelength λn = 2π/kn = 2L/n to be quantized, so that two nodes of the
sine fall exactly on the string boundaries. The frequencies of the harmonics
are also indicated in Figure 2.2. They are integer multiples of the fundamental
frequency (concerning the notation of the fundamental frequency, see remark
at the beginning of this section)
s
c 1 Ft
(1 · f0 ) = f1 = = , (2.4)
λ1 2L ρl
which depends on the force Ft acting on the string, its line density ρl and its
length L. The fundamental frequency can be tuned by adjusting these char-
acteristics of the system, corresponding to different notes being played on the
string.
The general solution for the clamped string is a linear superposition of the
eigenmodes:
Theorem 2.2 (General Solution for the Clamped String). The linear superpo-
sition of the eigenmodes with arbitrary coefficients An and ϕn ,
∞
u(z,t) = ∑ An sin(kn z) sin(ωnt + ϕn ), (2.5)
n=1
is the general solution of the wave equation with boundary conditions u(0,t) =
u(L,t) = 0.
The coefficients of the superposition can be deduced from the initial con-
ditions with a Fourier analysis (see Section 2.2.6). Again, a uniquely defined
solution requires the initial shape u(z, 0) and initial velocity u̇(z, 0) to be given,
see [1] for a worked out example.
End of the Physics Interlude
The vibrating string generates sound waves that propagate through the air, and
eventually arrive at the listener’s ear. We will discuss the physics behind sound prop-
agation below in Section 2.3; for now we are satisfied with the result that a time
signal x(t) arrives at the ear and is perceived. This signal has the same time depen-
dence as the deflection Equation (2.5) of the vibrating string. Based on this signal, we
20
2.2. The Tonal Quality: Pitch — the First Moment 21
define two types of tones, pure and complex ones. The tones of musical are typically
complex tones.
Definition 2.3 (Pure and Complex Tones, Partials, Harmonics, Overtone series). A
pure tone or sine tone is a signal that consists of only one frequency f ,
21
22 Chapter 2. The Musical Signal: Physically and Psychologically
pitch. This indicates that a periodicity detection mechanism in the auditory system
is responsible for pitch perception [16].
∆1 = τ2 − τ1 , ∆2 = τ3 − τ2 . (2.8)
If a melody moves from τ1 to τ2 and then to τ3 , both intervals are joined together
and the melody has moved over the interval ∆3 = τ3 − τ1 . Perceptually, the intervals
that are tonal distances are added. Normally, we are still aware of the first tone and
can hear that the melody has moved over a distance of the Interval ∆3 :
∆1 + ∆2 = τ2 − τ1 + τ3 − τ2 = τ3 − τ1 = ∆3 . (2.9)
To give a simple example from music perception, let τ1 be tone c’ and let τ2 be
tone g’. These tones form the interval of a pure fifth: ∆1 = pure fifth, and we may
note pure fifth = g’ − c’. If τ3 is the tone c’’, the tones τ3 and τ2 (g’) have a distance
of a pure fourth: ∆2 = pure fourth, and we note pure fourth = c’’ − g’. The tones c”
and c’ are a pure octave apart, so that ∆3 = pure octave and we note pure octave =
c’’ − c’. It is well known that a pure fifth and a pure fourth joined together lead to a
pure octave, which can easily be verified by hearing or singing or on any instrument,
for example on a monocord. According to Equation (2.9) we calculate
22
2.2. The Tonal Quality: Pitch — the First Moment 23
23
24 Chapter 2. The Musical Signal: Physically and Psychologically
use pure thirds, pure fifths, and pure octaves to get purely tuned triads. However,
as a result only some chords sound more or less pure, whereas some others become
unbearably mistuned. Equidistant tonal systems divide the octave into equal inter-
vals. For mathematical reasons, except for the octave, the intervals of an equidistant
tonal system cannot be pure intervals. Here we present the currently established
tuning system of Western music, the equal tempered system, which is based on an
equidistant division of the octave into 12 semitones.
To develop the mathematics behind the musical pitches, we start with some basic
definitions that allow us to calculate with tones. A tone τ is, in principle, a sensation
– and not a number to calculate with. The variable τ associated with a tone could
be identified, for example, with the tone name. The frequency f (τ) of a pure tone
or the fundamental of a complex tone ( f0 = f (τ)), on the other hand, is a numerical
quantity, given in the unit Hz. Based on the frequency, we define our numerical scale
for the musical tone height by:
Definition 2.5 (Tone Height as a Numerical Quantity). A tone τ with frequency f (τ)
has a tone height
24
2.2. The Tonal Quality: Pitch — the First Moment 25
H(f /f0 )
50
40
30
20
10
0 f /f0
0 5 10 15
Figure 2.3: Plot of the sensational tone height H(hn ) of the overtone series hn of
the tone C. On the abscissa, the ratio of the frequency to the fundamental frequency
f0 is shown. The numbers on the abscissa are the numbers n of the partials which
are equal to fn / f0 . The dots mark the tone heights of the partials. Note, that the n-th
overtone is the (n+1)-th partial. At the bottom, the musical notation of the overtones
is shown, where arrows indicate overtones that are a bit lower than the corresponding
tones of the equal temperament.
ences in tone height between two tones. The interval between the tones τ1 and τ2 is
assigned the numerical value
where s(τ1 , τ2 ) = f (τ2 )/ f (τ1 ) is the vibration ratio of the interval. A finer scale
to measure intervals is the cent scale, often used in electronic tuners, on which the
hundred-fold value of the above definition is given, i.e. I(τ1 , τ2 ) = 1200 log2 (s(τ1 , τ2 ))
cent.
Note that a semitone step comprises 100 cents; an octave corresponds to 1200
cents. In the preceding definition of intervals, the logarithmic identity log x − log y =
log(x/y) was used. The reference frequency F cancels out, so that the interval be-
tween two tones only depends on the vibration ratio, that is, the ratio of their fre-
quencies. An octave, for example, is assigned to the numerical value IO = 12 in our
system, because it has a vibration ratio sO = 2 and because log2 (2) = 1.
In our aim to divide the octave into twelve equidistant intervals, the logarithmic
relationship between frequencies and tone heights must be taken into account. On
the tone height scale, the octave IO = 12 shall be divided into twelve equal intervals
25
26 Chapter 2. The Musical Signal: Physically and Psychologically
where k > 0 corresponds to a tone that is higher than a’, and k < 0 to a tone that is
lower than a’.
2 This definition holds for the equal temperament; other tuning systems do not divide the octave
equidistantly.
26
2.2. The Tonal Quality: Pitch — the First Moment 27
c’ d’ e’ f’ g’ a’ b’ c’’
Figure 2.4: Musical notation and naming of the diatonic notes (black). The brackets
[ and h indicate whole tone and semitone steps, respectively, between the diatonic
notes. The chromatic notes between the diatonic notes are printed in gray.
revoked by the natural sign (symbol: \). Each of the diatonic notes can be raised by
a sharp or lowered by a flat. Note that the same pitch can be represented by various
notes. The note between c and d for example can be noted as c-sharp or alternatively
as d-flat. Both c-sharp and d-flat have the same pitch. The notes e and f are separated
by only one semitone step, so that there is no note in between them: e-sharp has the
same pitch as f, and f-flat the same as e. The same holds for b and c. In the strict
sense, these so-called enharmonic changes without pitch changes are only possible
in the equal temperament. In other tuning systems, an enharmonic change slightly
changes the pitch, e.g., the note f-sharp would be slightly higher than g-flat in pure
tuning.
Finally, the octave indication completes the tone name. The octave presented
in Figure 2.4 is marked with a prime – it is the one-line octave. Higher octaves are
denoted with an increasing number of primes; as indicated in the figure, c” introduces
the two-line octave, and so on. Lower octaves are marked as follows. Directly below
the one-line octave, there is the small octave, indicated with lower-case letters for the
tone names without any prime, for example c. Even below, there is the great octave,
indicated by capital letters, like C. The contra octave is indicated by underlining
a capital letter (like C) or by a comma following a capital letter (like C,) and this
notation is continued by adding more underlines or commas, respectively, for even
deeper octaves.
The English notation, also called the scientific notation of pitch, uses the same
note names but numbers the octaves from the bottom up starting from C with 32.703
Hz. Thus the tones from C to H are the first octave, denoted by C1, D1, E1 etc., the
tones from C with 64.406 Hz to H are the second octave, denoted C2, D2, E2, etc.
and so on. The one-line octave is the fourth octave and C4 corresponds to c’ and has
a frequency of 262.626 Hz. The concert pitch a’ with 440 Hz is the tone A4 in the
scientific system.
The MIDI standard assigns numbers to all notes from the bottom up starting
with the even inaudible note A-1 with 8.176 Hz as midi-number 0. The notes are
chromatically numbered all the way through so that the tone c’ or C4 respectively
with 262.626 Hz has the midi-number 60, the concert pitch of 440 Hz has the midi-
number 69.
As we are now able to count the number n of semitone steps between any given
note and the reference pitch a’, we can calculate the frequencies of all tones of the
equal temperament according to Equation (2.15). Table 2.1 summarizes all tones of
27
28 Chapter 2. The Musical Signal: Physically and Psychologically
Table 2.1: Notes of the One-Line Octave and the Corresponding Frequencies Ac-
cording to the Equal Temperament
the one-line octave. The tones in all other octaves can be obtained by doubling or
halving these frequencies to ascend or descend by one octave.
Intervals are also referred to by names rather than numerical values as defined
in Equation (2.14). Originally, the tonal distances were determined by counting the
notes of the diatonic scale. Thus the intervals are named after ordinal numbers:
prime, second, third, fourth, and so on. The distances between all chromatic tones
can be determined by counting the contained whole tone and semitone steps on the
basis of the diatonic scale.
Because of their strong tonal fusion and for historical reasons, the intervals prime,
fourth, fifth, and octave are attributed as pure intervals. If these intervals are aug-
mented by a semitone they are called augmented intervals, and if they are diminished
by a half tone they are called diminished intervals. The other intervals exist in two
forms: as minor and major intervals differing by a semitone. Diminishing a minor
interval by a semitone results in a diminished interval. Augmenting a major interval
by a semitone leads to an augmented interval.
The intervals of the chromatic scale starting from c’ are summarized in Table
2.2. Along with the numerical values I of Definition 2.6, the vibration ratios s are
given, which can be obtained as s = 2I/12 . Except for the prime and octave, vibra-
tion ratios are not rational numbers, which is an artefact of the equal temperament.
Other attempts to construct tonal systems, such as the pure intonation, try to ascribe
vibration ratios of simple rational numbers to the consonant intervals of fifth, fourth
and major third. From the overtone series, see Figure 2.3, one obtains the vibration
ratios 3 : 2 = 1.5 for the pure fifth, 4 : 3 = 1.3̄ for the pure fourth and 5 : 4 = 1.25
for the major third, which are not perfectly matched by the equally temperament. It
was an assumption of speculative music theory, that simple vibration ratios are re-
28
2.2. The Tonal Quality: Pitch — the First Moment 29
Table 2.2: Intervals of the Chromatic Scale and their Numerical Values I and Vibra-
tion Ratios s According to Equation (2.14) in the Equal Temperament
sponsible for the consonance of two tones being played simultaneously. The smaller
the numbers of the ratio the more consonant is the interval. An underlying reason
might be that the period of the resulting superposition signal is quite short, because
the periods of the individual tones have a relatively small common multiple. How-
ever, as a consequence of the irrational vibration ratios, the superposition of two
equally tempered tone signals is not periodic at all, unless the two tones are in octave
distance. Nevertheless, the consonant intervals of fifth, fourth and major third have
the same character in the equal temperament; see Section 3.5.1 below. It needs a
musically trained ear and certain circumstances to distinguish the irrational vibration
ratios of the equal temperament from the ideal rational ones, and a demand of high
standards to be bothered by these differences. These flaws of the equal temperament
have therefore been accepted, and they are outweighed by the benefits concerning the
possibility to modulate through all scales. On the other hand, ensembles of early mu-
sic favor historical tuning systems, which noticeably change the sound and character
of early music. Furthermore, these tuning systems are more convenient for historical
instruments, i.e. old organs and keyboard instruments.
29
30 Chapter 2. The Musical Signal: Physically and Psychologically
for frequencies that are not too high; roughly speaking, not higher than about 2000
Hz [16].
In contrast to the physical octave with a vibration ratio of 2 : 1, listeners prefer a
slightly greater vibration ratio for the octave. Due to refractory effects, the interspike-
intervals of the neural coding of intervals in the auditory system deviate from the
integer ratios of the stimuli [18].
The perceptional scales of melodic octaves are equal to the scale of harmonic
octaves up to a frequency of 500 Hz. But for higher frequencies the perception of
melodic octaves deviates considerably from the vibration ratio of 2 : 1 as described
in what follows.
If a listener is presented a pure tone a’ with f (a’) = 440 Hz and is asked to adjust
the frequency of a second tone so that it has half the pitch of a’, he will find a tone
with about 220 Hz, at least when he is musically trained [34]. At higher frequencies,
however, the frequency ratio of two tones must be much larger to elicit the sensation
of half pitch (or double pitch). If the half pitch of a tone with a frequency of 8 kHz
is to be determined, subjects on average find a tone around 1.8 kHz, and not around
4 Hz.
Systematic pitch halving and pitch doubling experiments with pure tones were
used to define the psychoacoustic quantity ratio-pitch with its unity mel, so that a
tone with the double pitch of another has the double ratio-pitch, i.e. the double mel
value. From different experimental procedures, several mel scales with different
reference frequencies have been determined. The mel scale of Stevens and Volkman
[29] refers to the frequency of 1000 Hz corresponding to 1000 mel. The mel scale
defined by Zwicker et al. refers to the bark scale, which is a scale of the width of
auditory filters [34]. Its reference frequency is 125 Hz, which corresponds to 125
mel. Here we adopt the approximation of O’Shaugnessy [22] given in the following
definition.
Definition 2.9 (Ratio-Pitch and Mel Scale). The psychoacoustic quantity ratio-pitch
fmel with its unity mel is defined for a pure tone with a frequency f by
f
fmel ( f ) = log10 1 + · 2595 mel. (2.16)
700 Hz
From the given formulas, we can calculate the frequencies of tones that have half
the pitch of a given tone. For example, a tone with f = 8 kHz has a ratio-pitch of
fmel = 2840 mel. A tone that is perceived to have half the pitch must therefore have a
ratio pitch of fmel /2 = 1420 mel, which corresponds to a frequency of f (1420 mel) =
1768 Hz.
Figure 2.5 shows a plot of the ratio pitch as a function of the frequency, fmel ( f ).
A frequency of 1000 Hz corresponds to a ratio pitch of 1000 mel, which can be seen
in the figure where the function fmel ( f ) crosses the identity. For small frequencies,
30
2.2. The Tonal Quality: Pitch — the First Moment 31
5000
1000
fmel [mel]
500
100 f mel
50 f
Figure 2.5: Double logarithmic plot of the mel scale fmel ( f ) as defined by Equation
(2.16). The continuous line is a plot of the ratio pitch fmel as a function of the
frequency, and the dashed line is a plot of the identity function, i.e. a plot of the
frequency as a function of itself.
the ratio pitch is approximately proportional to the frequency. This can also be seen
directly from Equation (2.16) by expanding the logarithm in a Taylor series around
f = 0, where we obtain fmel ≈ 1.61 f for f 700 Hz. For larger frequencies, the
slope of the function fmel ( f ) gets smaller and smaller, indicating that larger and
larger vibration ratios are needed to produce tones with half (or double) pitch.
The mel scale is mainly used in psychoacoustics and finds applications in speech
recognition, and recently also in music data analysis, see [33] for an overview. In
the theory of Western music as developed in Sections 2.2.3 and 2.2.4, however, the
mel scale is of little importance. On the other hand, the mel scale corresponds to
physiological data: There is a linear relationship between the mel scale and the num-
ber of abutting haircells of the basilar membrane: 10 mel correspond to 15 abutting
haircells, and the total of 2400 mel corresponds to the total of 3600 haircells. Fur-
thermore, the mel scale reflects the width of auditory filters, which are described by
the bark scale (see [34, p. 162]).
31
32 Chapter 2. The Musical Signal: Physically and Psychologically
Note that in the complex notation, the summation runs from −N to N; to recover the
original Equation (2.18), which runs from 1 to N, the summands n and −n must be
collected.
It can be shown that if the series of coefficients an and bn converge absolutely,
n=1 |an | < ∞ and ∑n=1 |bn | < ∞, the trigonometric series converges uniformly,
i.e. if ∑∞ ∞
32
2.2. The Tonal Quality: Pitch — the First Moment 33
Equivalently, when the complex form Equation (2.21) of the trigonometric series is
used, the Fourier coefficients are defined as
T
1
Z
2
cn = f (xt)e−inω0 t dt. (2.24)
2T − T2
33
34 Chapter 2. The Musical Signal: Physically and Psychologically
As F(ω) is complex, it is the sum of a real and a complex component and has a
polar coordinate representation
Definition 2.13 (Fourier Spectrum, Energy Spectrum, Phase Angle). The function
A(ω) = |F(ω)| is the Fourier spectrum of f (t) and its square A2 (ω) is the energy
spectrum of f (t). The function φ (ω) = arg F(ω) is the phase angle.
The Fourier transform F(ω) measures the magnitude and phase with which an
oscillation of the angular frequency ω is contained in the function f (t); very similar
to the coefficients cn of the complex Fourier series. The superposition of all these
oscillations then reproduces the original function f (t). As we are dealing with a
continuum of oscillations, this superposition is written as an integral,
1
Z ∞
f (t) = F(ω)eiωt dt. (2.27)
2π −∞
This equation is the analogue of Equation (2.21), where the Fourier transform F(ω)
plays the part of the Fourier coefficients cn . For Equation (2.27) to hold, the function
f (t) must satisfy the Dirichlet conditions on any finite interval, and it must be ab-
solutely integrable over the whole range t ∈ (−∞, ∞). In comparison with Equation
(2.25), we see that the inverse transform, Equation (2.27), has the same structure as
the Fourier transform, but with a factor of 1/2π and a sign change in the exponential.
34
2.2. The Tonal Quality: Pitch — the First Moment 35
autocorrelation annihilates phase shifts. This is in line with pitch perception: Spatial
changes of the sound source result in runtime differences of the sound wave which
are mathematically represented by phase shifts. But the pitch percept is not at all
altered by spatial changes of the sound source.
The nomenclature to classify signals f (t) is adopted from physics. One defines
that | f (t)|2 has the meaning of the (instantaneous) power of the signal, and analogous
to mechanics, power is defined as energy per time. This motivates the following
definitions.
Definition 2.14 (Average Power and Total Energy). The total energy E f and average
power Pf of a signal f (t) are defined by the integrals
Z D
1
Z ∞
Pf = lim | f (t)|2 dt and E f = | f (t)|2 dt. (2.28)
D→∞ 2D −D −∞
When analyzing a signal f (t), one must distinguish between two types of signals.
Energy signals are signals with finite energy E f , that is, f (t) is square integrable. On
the other hand, power signals are signals whose total energy E f is infinite, but whose
average power Pf is finite. The distinction between these two signals is necessary to
avoid infinities (or to avoid that everything is zero), which is achieved by choosing
suitable normalizations in the definitions of the autocorrelation functions.
Definition 2.15 (Autocorrelation Functions). The autocorrelation functions for en-
ergy signals and power signals are defined as
Z D
1
Z ∞
a(τ) = f ∗ (t) f (t + τ)dt and a(τ) = lim f ∗ (t) f (t + τ)dt, (2.29)
−∞ D→∞ 2D −D
Thus, the autocorrelation itself is periodic, and has relative maxima equal to the
35
36 Chapter 2. The Musical Signal: Physically and Psychologically
average power Pf for all delays equal to the periods τ = nT . One hint for calculating
the average power or the autocorrelation of a periodic function in practice: It is
1 RD
sufficient to evaluate the averages 2D −D . . . over one period, i.e. with D = T /2,
instead of forming the limit D → ∞. Due to the periodicity, the average over one
period is the same as the average over the whole time axis.
36
2.2. The Tonal Quality: Pitch — the First Moment 37
amplitude
0.4
x(t)
0.0 0.3
– 0.5 0.2
0.1
– 1.0
0.0
0 1 2 3 4 5 0 10 20 30 40
–1
t [s] ω [s ]
Figure 2.6: (a) Signal and (b) spectrum of a frequency modulated sine with carrier
frequency ωc = 20 s−1 , a modulation frequency ωm = 4 s−1 , and a modulation index
of β = 2.
The amplitudes Jk (β ) occurring here are the Bessel functions of the first kind [1, 12].
From the representation in Equation (2.34) of the signal, its frequency content can
be immediately read off: There is a carrier with frequency ωc and an infinite number
of sidebands with frequencies ωc ± kωm .
Figure 2.6 shows the signal and its spectrum. The signal looks like a sine function
that is periodically stretched and compressed, as a result of the periodically modu-
lated angular frequency, Equation (2.31). In the spectrum, the absolute values of the
amplitudes are plotted; the factor (−1)k occurring in Equation (2.34) turns some of
the left sidebands negative, which is equal to a phase shift of half a period.
37
38 Chapter 2. The Musical Signal: Physically and Psychologically
x (t)
0.6 0
0.4
–1
0.2
–2
0.0
0 5 10 15 20 25 30 0 2 4 6 8
ω [s –1] t [s]
Figure 2.7: (a) Spectrum and (b) waveform x(t) of a signal that is the sum of two
sines with equal amplitude, angular frequencies of ω1 = 20 s−1 and ω2 = 22 s−1 . The
envelope E(t), plotted in gray, fluctuates with the beating frequency ωb = 1 s−1 .
The trigonometric identity sin α + sin β = 2 sin([α + β ]/2) cos([α − β ]/2), see [3],
can be used to rewrite this equation as
ω1 + ω2 ω1 − ω2
x(t) = 2 sin(ωmt) cos(ωst) with ωm = , ωs = . (2.36)
2 2
The first factor fluctuates rapidly with the mean frequency ωm , whereas the second
factor oscillates very slowly with a low frequency ωs , which is half of the frequency
difference. The slow oscillation thus constitutes an envelope E(t) = 2 cos(ωs ) which
is perceived as a periodic loudness fluctuation with the slow frequency ωs . In this
respect, beats remind us of an amplitude modulated sine. Note that due to the differ-
ent kind of envelopes, the beating signal of two sines has two spectral components,
whereas an amplitude modulate sine has three spectral components as discussed be-
low in Section 2.3.4. Figure 2.7 shows the spectrum and waveform of a beating
sinusoids, and the beating envelope.
The phenomenon of roughness, extensively investigated by Helmholtz [13], is
closely related to beats. If the frequencies of both tones are very different, two si-
multaneous pitches and fast beats are perceived. If the frequency difference is small,
only one pitch is heard, which corresponds to the algebraic mean of the original
frequencies. In between is a region of frequency differences where roughness oc-
curs: The perceived beats sound ugly and rough and the pitch percept is unclear.
Helmholtz thought that a frequency difference of about ∆ f = 33 Hz, corresponding
to ωs = 2π∆ f /2 ≈ 104 s−1 , elicits the highest degree of roughness. He developed a
consonance theory based on roughness, which in essence is still en vogue in Anglo-
American music theory, although it cannot sufficiently explain the phenomenon of
consonance and dissonance, which is based on tonal fusion (see Section 3.5.1).
The discussion of beats was based on the assumption that the signals of two tones
can be simply added to give the resulting signal. Now we will discuss what happens
to simultaneous tones when they are processed by a non-linear transmission system,
and we will find that those systems produce additional tones called combination tones
as a result of a signal distortion. Non-linear transmission systems may be the body
38
2.2. The Tonal Quality: Pitch — the First Moment 39
39
40 Chapter 2. The Musical Signal: Physically and Psychologically
corresponds to a delta pulse, which Fourier transforms to the constant 1 in the fre-
quency domain. Thus, a click contains all frequencies and no pitch can be sensed.
All drums are tuned to a certain frequency region, which gives them an individual
timbre, but their inharmonic spectra do not elicit a fundamental pitch. The vibra-
tions and spectra of church bells are highly complex, and often several pitches can
be heard next to the strike tone. Some percussion sounds are noise-like. A vague
pitch sensation can be evoked in case of sounds similar to band-limited noises. In the
following, we discuss what kind of signals these sounds are.
In the auditory system, pitch is detected by a neuronal periodicity analysis [16].
All stimuli with a somehow periodic time course can elicit a pitch sensation. Fourier
analysis shows that these stimuli are also somehow centered in frequency. Wave-
forms may vary from completely aperiodic over quasi-periodic to completely pe-
riodic sounds. Sounds may have a continuous frequency spectrum or they can be
centered more or less in frequency. As a consequence there are sounds without any
pitch or sounds with only a vague pitch percept. Pitch ambiguities may also occur.
Complex tones are preferred as musical tones because of their clear and unambiguous
pitch. As we will see later, the timbre of a sound is mainly determined by its spectral
content. Thus, even sounds without any pitch may have a distinct tonal color.
As an example of sounds without pitch percept, we discuss noise. Roughly
speaking, a noise is a randomly fluctuating signal. A precise definition in contin-
uous time is mathematically quite elaborate, and we will keep the discussion on an
abstract level.
Noises can be classified according to their power spectral density. The power
spectral density Nx (ω) of a (power) signal x(t) is the Fourier transform of its au-
tocorrelation function (cf. Theorem of Wiener–Khintchine). Its meaning is that
Nx (2π f )d f is the amount of power contributed by the signal components with fre-
quencies between f and f + d f to the average power of the signal. Integrating this
density over all frequencies thus gives the average power,
Z ∞
Px = Nx (2π f )d f . (2.41)
−∞
40
2.3. Volume — the Second Moment 41
41
42 Chapter 2. The Musical Signal: Physically and Psychologically
Table 2.3: Some Properties of Dry Air for Usual Ambient Conditions (20 ◦ C, Stan-
dard Pressure)
In the following Physics Interlude, the three-dimensional wave equation for sound
waves is discussed and the speed of sound in air is determined. A nice calculation
shows that the displacement amplitude of the air particles, which is caused by a sound
at the hearing threshold, is only about 10−11 m, a tenth of an atom diameter.
A Physics Interlude
When the deviations p∗ , ρ ∗ and v are small, their time evolution is governed
by the following wave equation [15]. It is formulated in terms of an abstract
quantity, the velocity potential, from which p∗ , ρ ∗ and v can be calculated.
Definition 2.16 (Three-Dimensional Wave Equation, Velocity Potential). The
equation
∂ 2φ
2
∂ 2φ ∂ 2φ
2 ∂ φ
−c + 2 + 2 =0 (2.42)
∂t 2 ∂ x2 ∂y ∂z
is called a three-dimensional wave equation for the velocity potential φ , from
which the fields of interest can be derived by [15]
T
∂φ ∂φ ∂φ ∂φ 1 ∗
v= , , , p∗ = −ρ0 , ρ∗ = p . (2.43)
∂x ∂y ∂z ∂t c2
In the wave equation, c is called the phase velocity, and it depends on the
adiabatic bulk modulus K of the air (which
p is a constant and will be calculated
below) and the air density ρ0 by c = K/ρ0 .
Together with appropriate boundary and initial conditions, Equations (2.42)
and (2.43) entirely describe the dynamics of sound waves in air. The use of
the velocity potential as the fundamental quantity has the advantage that all
physical quantities can be calculated straightforwardly from it. However, the
velocity potential has no particularly evident meaning; it is rather an auxiliary
quantity. Sometimes we may be interested only in one of the physical fields v ,
p∗ or ρ ∗ . For such cases, we should note that wave equations for these fields
can be derived from Equation (2.42), which have exactly the same form as
Equation (2.42) but with the respective field in place of φ . Thus, all quantities
v , p∗ and ρ ∗ satisfy the same wave equation. The solutions, however, will be
different because of deviating boundary and initial conditions for the different
fields.
42
2.3. Volume — the Second Moment 43
The wave equation is quite general and holds for any compressible fluid or
gas with negligible viscosity.
p The material properties particularly influence the
speed of sound c = K/ρ0 . For air, the speed of sound can be estimated quite
accurately by basic thermodynamics of gases. To determine the bulk modu-
lus K we consider an air volume V with mass m, which is compressed while
its mass is conserved. By definition, K = ρd p/dρ = −V d p/dV where the
differential was transformed according to the relation ρ = m/V . To evaluate
the derivative d p/dV , we must specify the appropriate thermodynamic pro-
cess. The simplest suggestion might be an isothermal process; however, sound
waves oscillate typically so fast that the heat exchange between compressed
and decompressed regions is too slow to balance temperature differences. It is
much more reasonable to assume no heat exchange at all between compressed
and decompressed regions, i.e. an adiabatic compression. For an adiabatic
process, thermodynamic teaches pV γ = constant, where γ is the adiabatic in-
dex. For air, whose main constituents are diatomic gases, γ = 7/5. Thus,
d p/dV = −γ p/V for adiabatic processes, and the adiabatic bulk pmodulus at
equilibrium is K = γ p0 . For the speed of sound we obtain c = γ p0 /ρ0 . In
this formulation, however, ρ0 itself depends on the pressure p0 . It is a good
idea to resolve this dependency by defining the density ρ0 = M/Vm via the
molar mass M and molar volume Vm . Furthermore, the temperature and pres-
sure dependence of the molar volume is given by the ideal gas equation for
one mole, p0 Vm = R T0 with the universal gas constant R = 8.31446 J/mol K.
Thus, our final result for the speed of sound in air reads
r
γ R T0
c= ≈ 343 m/s (2.44)
M
(see Table 2.3 for the numerical values). This table corresponds quite well to
the measurements. Note that the final result shows that the speed of sound is
independent of the ambient pressure, but proportional to the square root of the
absolute temperature. Hence, sound propagates faster in warm air than in cold
air. The dependence on the average molar mass M of the air has also practical
significance: Moist air has a lower average molar mass than dry air, because
the weight of a molecule of water is lower than that of nitrogen or oxygen (the
two main constituents of air). Therefore, sound propagates faster in moist air
than in dry air. Both effects play an important part in intonation problems of
various instruments.
The d’Alembert solutions which we encountered in Section 2.2.2 can be
easily transferred to three dimensions, and they are useful to describe prop-
agating sound waves. Here we consider the special case of sinusoidal plane
waves given by the velocity potential
with a certain amplitude φ̂ , wave vector k and angular frequency ω. The wave
vector is the three-dimensional generalization of the wave number, and can be
43
44 Chapter 2. The Musical Signal: Physically and Psychologically
velocity vectors
particles
e direction of propagation
Figure 2.8: Section of a plane wave. With proceeding time, the spatial structure will
be shifted in direction e with velocity c, but keep its shape. The phase velocity c, with
which the whole structure moves, has to be well distinguished from the velocity field
v(rr ,t) (see indicated vectors), which describes the local velocities of air particles.
written as k = kee with a unit vector e and norm k = |kk |. It is easily seen that
Equation (2.45) satisfies the wave Equation (2.42) if ω = ck, so we obtain the
same dispersion relation as for the one-dimensional wave equation.
The name “plane wave” comes from the observation that the spatial vari-
ations only occur along the directions e ; in the directions perpendicular to e ,
the velocity potential is constant, thus defining planar wave fronts of constant
φ (see Figure 2.8). That becomes clear by writing r in the orthonormal basis
e , i , j via r = re e + ri i + r j j . Then the velocity potential is independent of the
side components ri and r j since it reads φ = φ̂ sin(kre − ωt).
According to Equations (2.43), the physically relevant fields are given by
v (rr ,t) = v̂v cos(kk · r − ω t), p∗ (rr ,t) = p̂ cos(kk · r − ω t) and
(2.46)
ρ ∗ (rr ,t) = ρ̂ cos(kk · r − ω t)
with amplitudes v̂v = k φ̂ , p̂ = ρ0 ω φ̂ and ρ = ρ0 ω φ̂ /c2 . They all have the same
spatial and temporal characteristics and differ only in amplitude. An example
is sketched in Figure 2.8. Because the velocity v of air particles is oriented
along the propagation direction e (which is parallel to k ), sound waves belong
to the class of longitudinal waves, in contrast to the transverse waves on a
vibrating string.
A very nice calculation shows with which orders of magnitude we are deal-
ing. The auditory threshold for a pure 1-kHz tone is p̂ = 28 µPa. The corre-
sponding particle velocity is v̂ = k φ̂ = k p̂/ρ0 ω = p̂/ρ0 c = 6.8 · 10−8 m/s, so
it is actually very slow compared to the propagation velocity c. TheR displace-
ment of the air particles out of their equilibrium position is ξ (t) = v(t) dt =
−(v̂/ω) sin(ω t). Thus, the displacement amplitude is only
v̂ p̂
ξˆ = = = 1 · 10−11 m, (2.47)
ω 2π f ρ0 c
a tiny distance of approximately a tenth of an atom diameter which our ear can
just detect!
44
2.3. Volume — the Second Moment 45
suggests itself. Within the range of perceivable sounds, this value varies over many
orders of magnitude, therefore it is advantageous to use a logarithmic scale. This
leads us to the following definition.
Definition 2.17 (Sound Pressure Level (SPL)). The sound pressure level, abbrevi-
ated SPL, is defined as 2
pRMS
L p = 10 log dB (2.49)
p2ref
with a reference value pref = 20 µPa.
The reference value pref was believed to be the auditory threshold (for a pure
sound of 1 kHz) at the time the sound pressure level was first defined, and the defini-
tion assures that a tone with pRMS = pref corresponds to a SPL of L p = 0. Actually,
the value is slightly wrong, as more recent psychoacoustic experiments showed and
we will discuss later. Since the acoustic pressure is technically quite easy to measure
using a microphone, it is perfectly suitable for defining a usable scale. In addition to
this practical advantage, the acoustic pressure is the relevant quantity on the stimu-
lus level for the hearing sensation of loudness. The relation between sound pressure
level and loudness sensation is described below.
√ p(t) = p̂ sin(ω t), the root mean square value is simply
For pure tones, where
obtained as pRMS = p̂/ 2. Note that the integration time to calculate the RMS must
be much longer than a single period. In the case of an incoherent superposition of
pure tones, the squares of the individual root mean square pressures add up to the
total:
!2
1 D 1 D 2 2
Z Z
2
pRMS = ∑ p̂i sin(ωi t) dt = ∑ p̂i sin (ωi t) dt (2.50)
D 0 i i D 0
= ∑ p2RMS, i . (2.51)
i
45
46 Chapter 2. The Musical Signal: Physically and Psychologically
R
In the second step we use that mixed integrals (1/D) sin(ωi t) sin(ω j t) dt are ap-
proximately zero for integration times much larger than the period, since ωi 6= ω j in
incoherent superpositions.
There is another important measure of the magnitude of sound, according to the
following definition.
Definition 2.18 (Sound Intensity and Sound Intensity Level (SIL)). The sound in-
tensity I of a sound wave is defined as
and measures the acoustic energy per area and time (unit W/m2 ) transported by the
wave. Its associated sound intensity level LI , abbreviated SIL, is defined via its root
mean square value,
IRMS
LI = 10 log dB with Iref = 10−12 W/m2 . (2.53)
Iref
Again, the reference value Lref was believed to be the auditory threshold. For
propagating plane waves, the sound intensity level is identical to the sound pres-
sure level: From Equation (2.46) we see that v(t) = p∗ (t)/ρ0 c, and consequently
I = p∗ 2 /ρ0 c. The calculated reference value Iref = p2ref /ρ0 c = 0.97 · 10−12 W/m2 ≈
10−12 W/m2 coincides, apart from rounding errors, with the definition. So we have
2
pRMS /ρ0 c
LI = 10 log = Lp, (2.54)
pref /ρ0 c
but only in the case of propagating plane waves. Yet it is a good approximation when
the sound comes from one direction and the listener is sufficiently far away from the
sound sources. In other cases, LI can only be determined by separately measuring
p∗ (t) and v(t), which is technically more complex.
46
2.3. Volume — the Second Moment 47
two loudness sensations Ψ1 (I1 ) and Ψ2 (I2 ) elicited by the Intensities I1 and I2 is
logarithmically related to the ratio of the intensities.
I2
∆L = Ψ(I2 ) − Ψ(I1 ) = k · log10 k = const. (2.55)
I1
However, Fechner made some assumptions which contradict experimental obser-
vations. Since Stevens, a power law is applied for a more realistic description of the
relation between loudness and sound intensity (see [12, p. 64]).
Listeners are able to ascribe an equal loudness to sine tones of different frequen-
cies. Successively presenting sine tones covering the whole audible frequency range
from about 20 Hz up to almost 20,000 Hz and adjusting the sound pressure level of
each tone to produce the same loudness sensation leads to curves of equal loudness
in dependence of frequency (see Figure 2.9). To quantify this common loudness the
unit phon is established in the following way.
Definition 2.19 (Loudness Level). For sine tones of f = 1000 Hz, the loudness level
LΦ , measured in phon, is equal to the sound pressure level L p in dB,
Lp
LΦ = · 1 phon for 1000-Hz sine tones. (2.56)
1 dB
Hence, a 1000-Hz sine tone of L p = 20 dB sound pressure level has a loudness level
of LΦ = 20 phon (note that when inserting the given SPL into Equation (2.56), the
unit dB cancels out and the unit phon is left). For sine tones of some other frequency,
the corresponding loudness level is defined indirectly as the value of the SPL (in
dB) of the 1000-Hz tone, which is perceived as equally loud. In [6], the SPLs of
equally loud tones of different frequencies are defined by a diagram (and table of
values) similar to Figure 2.9, which displays curves of constant loudness level in the
f -L p -plane.
Though isophonic sine tones, i.e. tones with equal loudness level, are perceived as
equally loud they may have quite different SPLs, depending on their frequencies. It
turns out that sine tones of very low or very high frequencies must have much higher
intensities than sine tones in the midst of the audible frequency range to produce the
same loudness level. The ear is most sensitive to frequencies between 2000 Hz and
5000 Hz.
For example, the reference sine tone of f = 1000 Hz and LΦ = 30 phon has an
SPL of L p = 30 dB, whereas a sine tone of 40 Hz and 30 phon must have an SPL of
65 dB to achieve the same loudness, and a sine tone with 5000 Hz and 30 phon has
an SPL of only 23 dB as can be seen from the figure.
An isophone of particular importance is the threshold in quiet or hearing thresh-
old, where the limit of loudness sensation is reached. It corresponds to the equal-
loudness contour of 3 phon (not 0 phon, because the reference value in the SPL scale
had been defined slightly falsely). People with average hearing capabilities fail to
hear sounds below this threshold.
The definition of the loudness level LΦ measured in phon lets us identify tones
of different frequencies that all appear to be equally loud to the human ear. But it
47
48 Chapter 2. The Musical Signal: Physically and Psychologically
120
frequency f [Hz]
Figure 2.9: Curves of constant loudness level LΦ = 3, 10, ...100 phon, so-called iso-
phones, in the f -L p -plane, after [6].
for tones that are louder than the reference tone, and approximately
h i
Ψ ≈ (LΦ /40 phon)1/0.35 − 0.0005 sone if LΦ < 40 phon (2.58)
48
2.3. Volume — the Second Moment 49
amplitude
0.5
0.6
x(t)
0.0
–0.5 0.4
–1.0 0.2
–1.5
0.0
0 2 4 6 8 0 5 10 15 20 25 30
t [s] ω [s –1]
Figure 2.10: (a) Amplitude modulated signal with carrier angular frequency ωc =
20 s−1 , envelope angular frequency ωm = 4 s−1 , modulation depth m = 1/2 and initial
phase shift φ = 0. The envelope function E(t) is plotted in gray. (b) Spectrum of the
modulated signal.
is the product of a fast oscillating carrier c(t) and an envelope E(t) = 1 + s(t) con-
taining the signal s(t) (which, in radio engineering, is to be broadcast).
If the carrier is a sine with the carrier angular frequency ωc and the modulation
is periodic with the angular frequency ωm , the amplitude modulated signal becomes
x(t) = [1 + m cos(ωmt + φ )] sin(ωct). This signal is called a sinusoidally amplitude
modulated sinusoid or SAM. The factor m describes the modulation depth. It is
often expressed in percentage, and φ specifies the initial phase of the signal. For an
example, see Figure 2.10 (a).
Applying trigonometric identities shows that x(t) is the sum of three sine func-
tions with frequencies ωc , ωc − ωm , and ωc + ωm ,
m m
x(t) = sin(ωct) + sin([ωc − ωm ]t − φ ) + sin([ωc + ωm ]t + φ ). (2.60)
2 2
The components with the angular frequencies ωc ± ωm are called sidebands and have
an amplitude of m/2. Figure 2.10 (b) shows the spectrum of the amplitude modu-
lated sine of Equation (2.60). Thus, an amplitude modulated sine is a complex tone
49
50 Chapter 2. The Musical Signal: Physically and Psychologically
consisting of the three equidistant partials and is very likely to elicit a residue pitch
corresponding to the modulation frequency which is the frequency of the fundamen-
tal.
50
2.4. Timbre — the Third Moment 51
tions are similar to the Fourier transform, but they integrate the musical signal only
over a short time interval, thus calculating the spectral content within this interval.
However, the local precision of a time-frequency atom is limited by the uncertainty
principle [17]. For the calculated time-dependent spectrum this means that we ei-
ther have a high temporal resolution together with a low frequency resolution, or a
low temporal resolution together with precisely determined frequencies – but both
together, a high temporal resolution and a high frequency resolution, is impossible.
Note that in the following theory description, we will always work with angu-
lar frequencies ω because it is much more convenient than using frequencies f and
spares a lot of factors 2π in the equations. As usual in the signal processing lit-
erature, the word “frequency” is often used synonymous with “angular frequency.”
Therefore, care must be taken: the real physical frequencies can be obtained from
the angular frequencies via f = ω/2π. Whenever the unit Hz is used, a real (and not
an angular) frequency is indicated; angular frequencies are given in s−1 . Although
both Hz and s−1 are of the same dimensions, this distinction proves useful to remove
the confusion introduced by the theorists.
respectively. Similarly,
1 1
Z ∞ Z ∞
ξ= ω|F(ω)|2 dω and σω2 = (ω − ξ )2 |F(ω)|2 dω (2.62)
2π −∞ 2π −∞
51
52 Chapter 2. The Musical Signal: Physically and Psychologically
angular frequency ω
|F(ω)| σt
ξ σω
| f (t)|
time t
u
where the function g∗u,ξ (t) = g(t − u)eiξ t is called the time-frequency atom. The
Gabor transform characterizes the strength and phase with which the frequency ξ
is contained in the signal at about time u. Its energy density, i.e. the square of its
absolute value, is called the spectrogram
g(t)
Name g(t)
Rectangle 1 for |t| < 0.5 else 0
Hamming 0.54 + 0.46 cos(2πt)
Gaussian exp(−18t 2 )
Hanning cos2 (πt)
Blackman 0.42 + 0.5 cos(2πt) + 0.08 cos(4πt)
t
52
2.4. Timbre — the Third Moment 53
The time-frequency atom gu,ξ has a time location u and frequency location of
approximately ξ , and its Fourier transform reads Gu,ξ (ω) = G(ω − ξ )e−iu(ω−ξ ) ,
where G is the Fourier transform of the window function g. The duration and fre-
quency spread of this time-frequency atom do not depend on the instantaneous time
u or frequency ξ but only on the window function g, as
Z ∞ Z ∞
σt2 = (t − u)2 |g(t − u)|2 dt = t 2 |g(t)|2 dt (2.67)
−∞ −∞
1 1
Z ∞ Z ∞
σω2 = (ω − ξ )2 |Gu,ξ (ω)|2 dω = ω 2 |G(ω)|2 dt. (2.68)
2π −∞ 2π −∞
Therefore, all time-frequency atoms gu,ξ have the same duration and frequency
spreads and thus all their Heisenberg Boxes, which are centered at points (u, ξ ) in the
time-frequency plane, are of the same aspect ratio and of the same area σt σω > 1/2;
see Figure 2.19. Loosely speaking, a Heisenberg Box at the point (u, ξ ) indicates the
region of the time-frequency plane, which influences the value of the spectrogram
Pwx (u, ξ ). Two values Pwx (u1 , ξ1 ) and Pwx (u2 , ξ2 ) are independent of each other, only
if the Heisenberg Boxes around the two sampling points (u1 , ξ1 ) and (u2 , ξ2 ) do not
overlap. Hence, the time-frequency resolution of the Gabor transform is limited by
the size of the Heisenberg Boxes, which in turn underlies the uncertainty principle,
Equation (2.63).
53
54 Chapter 2. The Musical Signal: Physically and Psychologically
Spectogram Piano Spectogram Trumpet
10 10
frequency f [kHz]
frequency f [kHz]
7.5 7.5
5.0 5.0
2.5 2.5
0 0
0 1.5 0 4.0
time u [s] time u [s]
Figure 2.12: Spectrograms of a short tune (see inset) played on a piano (left) and on
a trumpet (right).
performed with the software package Praat [2]. Energy concentrations, i.e. large
values of Pwx , are represented by dark gray colors. The spectrograms reveal that
the piano tones have higher strong partials than the trumpet tones. The onsets of the
piano tones are quicker and sharper than the onsets of the trumpet tones. On the other
hand, the onsets of the piano tones are very noisy. Recall that the Fourier transform
of a click contains all frequencies. Correspondingly, the long vertical lines at the
beginning of each tone of the piano spectrogram represent the stroke of the piano
hammer. On the other hand, the initial white stripes of the trumpet spectrogram
demonstrate that the trumpet player cared about soft onsets.
54
2.4. Timbre — the Third Moment 55
oscillator resonator
t
ω ω
eigenmodes filter waveform
frequency regions of different tonal color are the so-called registers of the instru-
ment. In 1929 the German physicist Erich Schumann (1898–1985) demonstrated the
influence of the stability of formants on the changing parameters of a sound such as
pitch and volume [19]. Some examples of formant center frequencies are given in
Table 2.5.
The formants of string instruments do not only depend on the strike and position
of the bow, but vary widely from instrument to instrument and depend on the form of
the corpus, the volume of air in the corpus, the wood, and the coating. Nevertheless,
averaged values of the formant frequencies can be given (see Table 2.6).
If an instrument player changes the register, e.g., by over-blowing his wind in-
strument, or if a violinist changes the sound of his violin by altering the stroke of
the bowing, the formants also change. Some instruments have very stable formants,
Table 2.5: Center Frequencies of the Formants of Woodwind Instruments (left) and
Brass Instruments (right) [4]
55
56 Chapter 2. The Musical Signal: Physically and Psychologically
e.g., brass instruments, whereas other instruments show a great flexibility of their
formants, e.g., string instruments or the human voice.
Speech recognition is based on the human ability to form and to discern different
vowels. Each vowel has characteristic formants of its own. Generally, four vowel
formants are distinguished. The first and second formants f1 and f2 are important
for vowel recognition (see Table 2.7) whereas the higher formants are individually
different and determine the timbre of the speaker’s voice. In the table, the frequencies
of the main formants are bold-faced. On the right, there is a spectrogram of the five
vowels.
A formant of special interest is the singer’s formant. This physical designation
describes a special energy concentration in the human voice which is a result of the
highly artificial old Italian belcanto singing technique. The voice of an educated
opera singer produces a strong formant in a frequency region of about 2000 Hz and
3000 Hz, which give the voice its sonority and noble timbre. In contrast, an opera
orchestra produces much less sound energy in this frequency region as demonstrated
in Figure 2.14. Recall that the human ear is highly sensitive for those frequencies
as can be read from the isophones shown in Figure 2.9. As a result, a voice with a
well-developed singer’s formant is heard out against a whole orchestra although the
singer is not at all able to produce as much sound energy as the orchestra [31].
2.4.5 Transients
Two successive notes in music or two vowels in speech are joined by transitory sound
components called transients. In contrast to tones or vowels, transients are not sta-
tionary and thus do not have a clear pitch. Instead, transients are characterized by a
quickly changing frequency content and noisy sound components. The consonants
of speech are transients with great importance for recognition. In music, the onsets
of tones are transient sounds decisive for instrument recognition. Next to the spec-
trum of the stationary tones, the transients determine the timbre of the instruments.
The short transient oscillation from tone onset to the stationary sound shows a quiet
individual evolution of the spectral components. As an example, the first 50 mil-
Table 2.7: Center Frequencies of the First and Second Formants of the Five Vowels
[4]
5
frequency f [kHz]
56
2.4. Timbre — the Third Moment 57
–20
Orchestra
–30 Singer + Orchestra
0 1 2 3 4 5
frequency f [kHz]
Figure 2.14: The singer’s formant. Between 2000 Hz and 3000 Hz, a professional
opera singer produces a sound energy concentration much higher than that of an
orchestra [31].
20 2. partial
amplitude
10 1. partial
3. partial
5
5. partial
4. partial
0
0 10 20 30 40
time [ms]
Figure 2.15: Transient oscillations (50 milliseconds) of the first five partials of a
saxophone [28].
liseconds from onset to the stationary sound of a saxophone are shown in Figure
2.15 [28]. Other instruments have a different evolution of their spectral components
in the onset phase, which may last even longer than 100 ms.
For most instruments, the onset is more intense than the following stationary part
of the sound. In sound synthesis applications, the intensity evolution of tone onset is
simulated by an ADSR envelope. The acronym ADSR stands for the four phases of
the envelope (see Figure 2.16): An intense attack is followed by a quick decay to the
designated sustain level of the sound. The long sustain phase is determined by the
finishing release phase, during which the sound intensity is turned off. Depending on
the different time evolutions of onsets, a number of variations of the ADSR-envelope
are applied.
57
58 Chapter 2. The Musical Signal: Physically and Psychologically
amplitude
attack decay sustain release
time
58
2.4. Timbre — the Third Moment 59
In Section 2.2.2 we saw that a clamped string can vibrate in different modes –
the fundamental and the overtones – which gives rise to the generation of com-
plex tones on that string. A similar concept can be applied to wind instruments,
where the air inside the instrument vibrates itself in different modes.
Two kinds of boundary conditions for the air column inside the instrument
can be specified:
• Closed boundary (rigid wall): Since particles cannot move into a rigid wall,
the normal velocity must vanish, e.g., v ·eez = 0 if the wall is in the x-y-plane.
This implies for the velocity potential, see Definition 2.16, that for all times
∂ φ /∂ z = 0 on this boundary.
• Open boundary (the “outlet” of the instrument): For the idealized case that
the instrument doesn’t irradiate any sound, i.e. the waves are entirely re-
flected back into the instrument, the acoustic pressure vanishes in the am-
bient atmosphere in front of the boundary, p∗ = 0. In terms of the velocity
potential, this condition reads ∂ φ /∂t = 0 on the open boundary.
Obviously, an instrument without any sound irradiation at its outlet is entirely
useless; but this is the only simplification which gives a manageable boundary
condition in the framework of eigenmodes.
Let us discuss two special instrument geometries that are relevant for wind
instruments. One of the simplest geometries in use is the cylinder, found for
example in flutes, clarinets, and mostly anything with a “pipe” in its name
like organ pipes, bagpipes or panpipes. The full analysis of the eigenmodes of
a cylinder requires a coordinate transformation of the wave equation and the
use of Bessel functions. However, we will be content with finding the most
obvious, and yet most relevant, modes.
We denote the direction of the cylinder axis as z, so that the circular cross
section lies in the x-y-plane (see Figure 2.17, left). Due to the boundary con-
ditions, the velocity field must be parallel to the z-axis at the cylinder walls. In
the simplest case, it is parallel to the z-axis and constant throughout the whole
circular cross section. In this plane wave assumption, we only need to solve the
wave equation in the z direction for the velocity field. As boundary conditions,
we take a closed boundary at z = 0, because the opening is blocked by a reed
or the player’s lips, and an open boundary at z = L, so our task is to solve
z r
0 L
0 L
Figure 2.17: Coordinates of a cylinder with plane wavefronts (left) and a cone with
spherical wavefronts (right).
59
60 Chapter 2. The Musical Signal: Physically and Psychologically
T̈ Z 00
= −c2 k2 and = −k2 (2.70)
T Z
⇒ T (t) ∼ sin(ωt + ϕ) and Z(z) ∼ sin(kz + ψ) (2.71)
∂2
2 1 ∂ 2 ∂
φ (r,t) − c r φ (r,t) = 0. (2.74)
∂t 2 r2 ∂ r ∂r
60
2.4. Timbre — the Third Moment 61
p p
z r
z r
z r
z r
0 L 0 L
Figure 2.18: Eigenmodes in a cylinder (left) and cone (right) with one end open.
For the cylinder, plane waves are assumed, and an appropriate scaled cosine function
with a maximum at z = 0 and root at z = L describes the spatial pressure variation. In
the case of the cone, spherical waves are assumed. The pressure variation is described
by an appropriately scaled cardinal sine, again with a maximum at r = 0 and root at
r = L. Despite their similarity (besides the decay in amplitude for the cone), these
different modes produce very different harmonics – odd ones for the cylinder, and
the full spectrum for the cone.
R(r)
Inserting the separation ansatz φ (r,t) = r T (t) gives
T̈ R00
= −c2 k2 and = −k2 (2.75)
T R
⇒ T ∼ sin(ω t + α) and R ∼ sin(k r + β ). (2.76)
We have a closed boundary condition at the tip, φ (0,t) = 0 and an open bound-
ary at the other end, φ̇ (L,t) = 0. The first one implies β = 0, which is also the
correct choice to obtain a finite value for φ (0,t) since sin(k r)/r converges to
1 when r approaches zero. The second boundary condition quantizes the wave
number, and thus the wavelength and frequency, as
2L c
kn = n π/L, λn = and fn = n (2.77)
n 2L
with n = 1, 2, 3, . . . and so on. To summarize, an eigenmode of the cone is
given by
sin(k r)
φn (rr ,t) = An sin(ω t + αn ). (2.78)
r
For an illustration, see Figure 2.18. Note that the ansatz used in this calcu-
lation has a very special form. There are other possible solutions involving
spherical Bessel functions (and spherical harmonics for modes which are not
homogeneous in the angles).
However, the modes calculated above are musically very relevant, since
61
62 Chapter 2. The Musical Signal: Physically and Psychologically
62
2.5. Duration — the Fourth Moment 63
that is, a function with zero average and unit norm that is centered in the neighbor-
hood of t = 0.
Let Ψ(ω) be its Fourier transform. Its frequency location is given by
1
Z ∞
η= ω|Ψ(ω)|2 dω. (2.80)
2π −∞
The wavelet ψ(t) can be scaled by a real factor 1/s to alter its width,
1
ψs (t) = √ ψ(t/s), (2.81)
s
√
where the prefactor 1/ s ensures that the scaled version ψs is still normalized. From
√ it follows that the Fourier transform Ψs (ω) of ψs (t) is also rescaled,
Fourier theory
Ψs (ω) = s Ψ(sω). Applying this relation, it becomes clear that the frequency loca-
tion ηs of Ψs (ω) is equal to the scaled frequency location of Ψ(ω), that is, ηs = η/s.
In addition to this rescaling, which achieves a shift of the frequency location, the
wavelet can be time-shifted to any time location u,
t −u
1
ψu,s (t) = √ ψ . (2.82)
s s
From the time-shift theorem it follows for the Fourier transform of ψu,s (t) that
Ψu,s (ω) = Ψs (ω) e−iuω . Thus, the frequency location of Ψu,s (ω) is equal to the
63
64 Chapter 2. The Musical Signal: Physically and Psychologically
frequency location of Ψs (ω), which is η/s as shown above; the time shift does not
influence the frequency properties of the wavelet.
The duration σt,s and frequency spread σω,s of the shifted and scaled wavelet
ψu,s (t) are also only affected by the scaling parameter s. A quick calculation shows
that they are related to the duration σt and frequency spread σω of the original
wavelet ψ(t) by σt,s = sσt and σω,s = σω /s. When the scale s decreases, the width
(or duration) of ψu,s (t) is reduced; and at the same time, its frequency location ηs and
frequency spread σω,s increase. Thus, the Heisenberg box of the wavelet is shifted
upwards to higher frequencies in the time-frequency plane, and changes its aspect
ratio to be more slender; see Figure 2.19 (b).
Definition 2.25 (Wavelet Transform and Scalogram). The wavelet transform of a
function x(t) is the complex amplitude W X ∈ C
t −u
1
Z ∞ Z ∞
∗
W X(u, s) = x(t)ψu,s (t)dt = x(t) √ ψ ∗ dt. (2.83)
−∞ −∞ s s
Analogous to the spectrogram of the Gabor transform, the energy density of the
wavelet transform is called the scalogram and is defined as
According to the Heisenberg Boxes of the shifted and scaled wavelets, the high-
frequency components of a musical signal are analyzed with a higher time resolution
than the low-frequency components with a wavelet transform. The smaller the value
of the scaling factor s, the more the energy of the wavelet ψu,s (t) is confined to a
smaller time interval σt,s , but the higher the frequency location ηs is and the greater
the frequency spread σω,s . This is in contrast to the Gabor transform, whose Heisen-
berg Boxes have the same aspect ratio everywhere on the time-frequency plane; see
Figure 2.19 for a comparison.
Plotting a scalogram, a common choice for the scale parameter is s = 2−k/M with
k ∈ {0, 1, . . . , I · M}. The positive integer I is called the number of octaves and the
integer M is the number of voices per octave. Choosing M = 12 would correspond to
the twelve notes of the chromatic scale [17].
In the following example a wavelet transform is applied to a series of clicks. The
number of octaves is I = 8 and the window function is a Mexican Hat
2
1 − t 2 /σ 2 exp −t 2 /2σ 2
ψ(t) = √ (2.85)
π 1/4 3σ
as wavelet, which is the normalized second derivative of a Gaussian function. This
function is real and symmetric; thus its Fourier transform Ψ(ω) is also real and sym-
metric. That means that the frequency location η = 0 vanishes, so that this wavelet
transform cannot detect the frequency of a signal. However, it can detect irregular-
ities like jumps and kinks in a signal [17], and is therefore suitable for the analysis
of click series or rhythmic patterns. Figure 2.20 (a) shows an equidistant click se-
quence with a frequency of about 4.8 Hz and its wavelet transform. Bright regions of
64
2.5. Duration — the Fourth Moment 65
(a) ω (b) ω
s2 σ t
σt
ξ2 η/s2 σ ω /s2
σω
σt s1 σ t
η/s1 σ ω /s1
ξ1 σω
t t
u1 u2 u1 u2
Figure 2.19: Heisenberg Boxes in the time-frequency plane for (a) the Gabor trans-
form and (b) the wavelet transform. The small functions on the axes symbolize the
window or wavelet functions. In case of the Gabor transform, the time and fre-
quency resolution is constant throughout the time-frequency plane, whereas in case
of the wavelet transform, the time resolution increases for higher frequencies.
(a) (b)
28 26
26 24.5
scaling factor 1/s
24 23
22 21.5
20 20
0.5 1.0 1.5 2 4 6 8
time u [s] time u [s]
Figure 2.20: (a) A signal of seven equidistant clicks undergoes a wavelet transform.
The top line shows the signal, the bottom figure the scalogram. (b) A percussion
clip of the song “Buenos Aires” is analyzed by a wavelet transform. The rhythmic
patterns on the different frequency levels are analyzed and can be read from the
scalogram. For the analyzes, the software FAWAVE, written by the mathematician
James S. Walker, was applied.
the scalogram indicate high energy concentration. As discussed before on the level
of Heisenberg Boxes, the scalogram clearly shows a more precise time resolution for
higher factors 1/s. The limits of time resolution can be observed directly: Only at
large enough values of 1/s, the brighter regions are clearly separated.
65
66 Chapter 2. The Musical Signal: Physically and Psychologically
Detailed insights into the rhythmic structure of a piece of music can be obtained
from a scalogram. Figure 2.20 (b) shows an analysis of the highly complex structure
of a percussion clip from the song “Buenos Aires” [32]. On the level of lower values
of 1/s, superordinate rhythmic patterns are detected; on the level of higher values of
1/s, the rhythmic fine structure is reflected by the scalogram.
2.7 Exercises
Exercises, theoretical as well as practical based on the software packages R [23] and
MATLAB® , will be provided at the book’s web site
http://sig-ma.de/music-data-analysis-book,
which also includes example data sets partly needed for the exercises.
Bibliography
[1] M. L. Boas. Mathematical Methods in the Physical Sciences. Wiley, 2006.
[2] P. Boersma and D. Weenink. Praat: Doing phonetics by computer. www.praat.
org.
[3] I. N. Bronshtein, K. A. Semendyayev, G. Musiol, and H. Mühlig. Handbook of
Mathematics. Springer, 2007.
[4] M. Dickreiter. Der Klang der Musikinstrumente. TR-Verlagsunion, 1977.
[5] DIN 45630. Physical and Subjective Magnitudes of Sound. Beuth, 1971.
[6] DIN ISO 226. Acoustics: Normal Equal-Loudness-Level Contours. Beuth,
2003.
66
2.7. Exercises 67
[7] M. Dörfler. Gabor Analysis for a Class of Signals Called Music. Diss. Univer-
sity of Vienna, 2002.
[8] M. Ebeling. Die Ordnungsstrukturen der Töne. In W. G. Schmidt, ed., Faszi-
nosum Klang. Anthropologie – Medialität – kulturelle Praxis, pp. 11–27. De
Gruyter, 2014.
[9] N. Fletcher and T. Rossing. The Physics of Musical Instruments. Kluwer Aca-
demic Publishers, 1998.
[10] D. E. Hall. Musical Acoustics. Brooks/Cole, 2001.
[11] D. E. Hall. Musikalische Akustik. Schott/Mainz, 2008.
[12] W. M. Hartmann. Signals, Sound and Sensation. Springer, 2000.
[13] H. v. Helmholtz. Die Lehre von den Tonempfindungen als physiologische
Grundlage der Theorie der Musik. Olm, 1862 / 1983.
[14] W. Köhler. Akustische Untersuchungen. Zeitschrift für Psychologie. Barth,
1909.
[15] L. Landau and E. Lifšic. Fluid Dynamics. Course of Theoretical Physics.
Butterworth-Heinemann, 1995.
[16] G. Langner. Die zeitliche Verarbeitung periodischer Signale im Hörsystem:
Neuronale Repräsentation von Tonhöhe, Klang und Harmonizität. Zeitschrift
für Audiologie, 2007.
[17] S. Mallat. A Wavelet Tour of Signal Processing. Elsevier, 2009.
[18] M. McKinney and B. Delgutte. A possible neurophysiological basis of the
octave enlargement effect. J Acoust Soc Am., 106(5):2679–2692, 1999.
[19] P.-H. Mertens. Die Schumannschen Klangfarbengesetze und ihre Bedeutung
für die Übertragung von Sprache und Musik. Bochinsky, 1975.
[20] B. Moore and K. Ohgushi. Audibility of partials in inharmonic complex tones.
J. Acoust. Soc. Am. 93, 1993.
[21] D. Muzzulini. Genealogie der Klangfarbe. Peter Lang, 2006.
[22] D. O’Shaughnessy. Speech Communications: Human and Machine. Addison-
Wesley, 1987.
[23] R Core Team. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria, 2014.
[24] C. Reuter. Die auditive Diskrimination von Orchesterinstrumenten. Peter Lang,
1996.
[25] A. Riethmüller and H. Hüschen. Musik. In L. Finscher, ed., Die Musik in
Geschichte und Gegenwart, volume 6. Bärenreiter, 2006.
[26] J. F. Schouten, R. J. Ritsma, and B. Lopes Cardozo. Pitch of the residue. J.
Acoust. Soc. Am. 34, 1962.
[27] H. Smith, L. M. & Honing. Time-frequency representation of musical rhythm
by continuous wavelets. J. of Mathematics & Music 2/2, 2008.
[28] W. Stauder. Einführung in die Akustik. Florian Noetzel Verlag, 1990.
67
68 Chapter 2. The Musical Signal: Physically and Psychologically
[29] S. Stevens and J. Volkman. The relation of pitch to frequency. J. Acoust. Soc.
Am. 34, 1940.
[30] C. Stumpf. Tonpsychologie. S. Hirzel, 1883 / 1890.
[31] J. Sundberg. The Science of the Singing Voice. Northern Illinois University
Press, 1987.
[32] J. S. Walker. A Primer on Wavelets and Their Scientific Application. Chapman
& Hall/CRC, 2008.
[33] C. Weihs, U. Ligges, F. Mörchen, and D. Müllersiefen. Classification in music
research. Advances in Data Analysis and Classification, 1(3):255–291, 2007.
[34] E. Zwicker and H. Fastl. Psychoacoustics: Facts and Models. Springer, 1999.
68
Chapter 3
M ARTIN E BELING
Institute of Music and Musicology, TU Dortmund, Germany
3.1 Introduction
The sensation of tone and the auditory perception of time patterns are the foundations
of all music. In the previous chapter we started from the sensation of tone and inves-
tigated the relation of the most prominent moments of the tonal sensation, which are
pitch, loudness, duration, and timbre, to the properties of the tonal stimulus. In the
following we demonstrate how the elementary components of music, that is to say
tones and time patterns, form musical structures and discuss the psychological foun-
dations of their perception. We use the notion of Gestalt coined by v. Ehrenfels and
the rules of Gestalt perception of Max Wertheimer to reflect the emergence of musi-
cal meaning. For this purpose, we consider some essential elements of Western music
theory in the light of music perception and cognition. It must be pointed out, that the
grasping of musical structures is foremost implicitly learned by mere exposure to
music and does not require knowledge of the musical system and of music theory.
The perception of Gestalt in time (“Zeitgestalten”) is still an open question. Musical
structures are objects of musical thinking, which is essentially different from rational
thinking. But to grasp musical structures is likewise a powerful source of emotions.
Musical thinking and feeling are intertwined to evoke the aesthetic effects of music,
or as Ludwig v. Beethoven took it: Music is a higher offering than all wisdom and
philosophy (“Musik ist höhere Offenbarung als alle Weisheit und Philosophie”).
69
70 Chapter 3. Musical Structures and Their Perception
lined c’ (scientific: C4) with 261.626 Hz, and the g-clef or treble clef marks the
one-lined g’ (scientific: G4) with 391.995 Hz (concerning the note name, refer to
Section 2.2.4). Note that the clefs are a pure fifth apart. Theoretically, every clef can
be positioned on each of the five lines of a staff. But in practice, only few of them
have been used. These are shown in Figure 3.1.
Straightforward
Straightforward or down-to-earth or down-to-earth
StraightforwardStraightforward
or down-to-earth
or down-to-earth
Straightforward Straightforward
or down-to-earthor down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 3.1: The clefs used in Western music notation.
The treble clef and the bass clef are the commonly used clefs. The alto or viola
clef is regularly used for notation of viola music. The tenor clef is often used in in-
strumental music, e.g., for the notation of higher parts of the violoncello or bassoon.
All c-clefs were in use up to the end of the 19th century, especially in scores of choir
music. All clefs can be found in older music scores.
70
3.2. Scales and Keys 71
Straightforward
Straightforward or down-to-earth
or down-to-earth
Figure 3.2: The notes of the diatonic C major scale. The number 1 beneath the staff
marks a whole tone step whereas 1/2 indicates a semitone step. Each bracket spans a
tetrachord.
keynote scale
d Dorian mode
e Phrygian mode
f Lydian mode
g Mixolydian mode
The chromatic notes are deduced from the diatonic notes by alterations. A flat-
sign (German: B) (symbol: [) or a sharp sign (German: Kreuz) (symbol: ]) as an
accidental before one of the seven notes of the chromatic scale changes its pitch: a
flat lowers the original pitch by a semitone and a sharp raises it by a semitone. A
double flat sign (German: Doppel-B) (symbol: [[) lowers the original pitch by a
whole tone and a double sharp (German: Doppelkreuz) (symbol: x) raises it by a
whole tone. The alteration of a note is revoked by the natural sign (symbol: \ ).
Each of the seven notes of the diatonic scale can be raised by a sharp or lowered
by a flat. As a consequence, there are two ways to notate a chromatic note. For
example, the note C sharp has the same pitch as the note D flat, the note D sharp has
the same pitch as the note E flat, etc. In the strict sense, this so-called enharmonic
change without pitch change is only possible in the equal temperament. In other tun-
ing systems, an enharmonic change slightly changes the pitch, e.g., in pure intonation
the note F sharp is slightly higher than the note G flat (for a detailed discussion of
tuning systems see [19]).
The diatonic major scale of Figure 3.2 has the keynote c as starting point. The
natural minor scale has the same notes as the diatonic major scale but its keynote is
the tone a, which is a minor third below the keynote c of the major scale. Medieval
music and the music of the Renaissance and early Baroque commonly used the ec-
clesiastical modes or church modes, which are also based on the diatonic scale but
differ in their keynotes.
Only the degrees II, III, IV, and V of the diatonic scale serve as keynotes in the
medieval system of church modes (see Table 3.1). In order to construct a closed
system of church scales in which every degree of the diatonic scale can be a keynote
of a scale, later theorists in the time of the Renaissance invented the Aeolian scale,
71
72 Chapter 3. Musical Structures and Their Perception
which is just the natural minor scale with keynote a (degree VI of the diatonic scale).
They further constructed the Ionic scale with keynote c (degree I of the diatonic
scale), which is equal to the major diatonic scale of the modern music theory. They
also invented the Locrian mode with keynote b (degree VII of the diatonic scale),
which however has no significance in music. Church modes are still important in
modern music theory: in jazz harmony, every chord is identified with a certain church
mode.
A scale (like any melody) can be transposed: it is moved upward or downward
in pitch so that the keynote changes. Of course, each of the twelve tones of the
chromatic scale can be the key of a diatonic scale. To preserve the diatonic structure
of the scale under transposition, key signatures are added to the staff. If the diatonic
scale is moved upwards by the interval of a pure fifth, a sharp must be added to the
seventh degree of the transposed scale to make it a leading note. If the diatonic scale
is moved downwards by the interval of a pure fifth, a flat must be added to the fourth
step of the transposed scale to preserve the interval of a pure fourth between the new
keynote and the fourth degree of the transposed scale.
Starting from the key c and successively ascending by the interval of a pure fifth
leads to the ascending circle of fifth with the keys c, g, d, a, e, h, f sharp, c sharp
(see Figure 3.3). Each upward transposition by a pure fifth makes it necessary to add
a further sharp (]). Successive downward transposition by the interval of a pure fifth
yields to the descending circle of fifth with the keys c, f, b flat, e flat, a flat, d flat,
g flat, and c flat (see Figure 3.4). A further flat ([) must be added to the new key
on proceeding to it by a downward transposition of a pure fifth. To each major key
corresponds a minor key with the same number of flats or sharps. Its keynote is equal
to the sixth degree (VI) of the corresponding major scale. The seventh degrees of the
minor scales are commonly altered by a semitone upwards to turn them into leading
tones.
72
3.2. Scales and Keys 73
Caring
Caring
Caring
Caring Caring
Caring
Caring Caring
Caring Caring
Caring Caring
Caring
Caring Caring
Caring Caring
Figure 3.3: The circle of fifths: upward direction with seven major and correspond-
ing minor keys starting from the diatonic C major scale.
spaces tones. It is obvious that none of these tones can be equal to any of the twelve
tones of the equidistant chromatic scale. The Javanese Pelog is another system us-
ing five tones from a set of seven tones within an octave not equally spaced. None
of these seven tones is equal to the tones of the Western chromatic scale. The blue
notes of blues are another example of tones that are not contained in the chromatic
scale. The blues third lies between the minor and the major third. The other blue
notes are the blues fifth the blues seventh. On a keyboard, instead of the blue notes,
the minor third and the minor seventh and the diminished fifth are played. Different
blues scales are used and the most prominent are shown in Figure 3.6. Note the mi-
nor seventh, which represents the blues seventh. The bottom scale has a minor and a
major third. This ambiguity reflects the blues third.
73
74 Chapter 3. Musical Structures and Their Perception
Positive attitude Positive attitude
Caring Caring
Caring Caring
Caring Caring
Caring Caring
Caring Caring
Caring
Caring
Figure 3.4: The circle of fifths: downward direction with seven major and corre-
sponding minor keys starting from the diatonic C major scale.
74
3.3. Gestalt and Auditory Scene Analysis 75
Positive attitude
Positive attitude
Positive attitude
Positive attitude Positive
Positive attitude
attitude
Figure 3.5: Pentatonic scales: (a) the tones of the chord of five pure fifths have
the same note names as the tones of the anhemitonic pentatonic scale (b); classical
Japanese music uses the hemitonic pentatonic scale shown in (c), which is different
from the hemitonic pentatonic scale as used in Indonesian music.
Komplexes of the major triad. Symmetry operations preserve relations and thus parts
of a Gestalt, and have always been a topic of aesthetics. The elementary symmetry
operations are translation, reflection, rotation, and dilatation. The notion of Gestalt
was coined by Christian von Ehrenfels (1859–1932), who referred to the example of
a melody and pointed out that its Gestalt is more than the sum of its parts (notes)
and that it is transposable [28]. In music, a transposition is a translation of pitch:
the same melody can start from another tone and is played in another key. A canon
consists of a melody and of one or more time-shifted versions of the same melody
in other parts. This is a translation in time. A rhythm can be played with augmented
or diminished note values. These are dilatations in time. Reflections and rotations
are also applied in music, e.g., in counterpoint or in twelve-tone technique (Krebs,
Inversion), but may not always be auditorily evident to the listener.
We cannot but conceptualize our perceptions as Gestalten, which become the
content of higher cognitive functions and which may be one source of meaning.
Gestalt emerges from the form-generating capacities of our sensory systems. Max
Wertheimer (1880–1943) postulated Gestalt principles that apply not only to vision
75
76 Chapter 3. Musical Structures and Their Perception
but also to music, although vision and hearing are quite different in many other re-
spects [29]. These principles may, for example, explain why and under which condi-
tions a series of tones is perceived as connected, thereby forming a melody, or they
may explain why certain simultaneous tones are heard as a musically meaningful
chord perceived as an entity.
Gestalt Principles
1. Figure-ground articulation – two components are perceived: a figure and a ground.
Example: A soloist accompanied by an orchestra.
2. Proximity principle – elements tend to be perceived as aggregated into a group if
they are close to each other.
Example: A good melody prefers small tone steps and avoids great jumps, which
would destroy the continuous flow of the music.
3. Common fate principle – elements tend to be perceived as grouped together if they
move together.
Example: If all parts of a piece of music move with the same rhythm, a succession
of harmonies is heard instead of single independent voices.
4. Similarity principle – elements tend to be grouped if they are similar to each other.
Example: Tones played by one instrument are heard as connected because they
have a quite similar timbre.
5. Continuity principle – oriented units or groups tend to be integrated into percep-
tual wholes if they are aligned with each other.
Example: The oriented notes of a scale are perceived as a single upward-moving
figure.
6. Closure principle – elements tend to be grouped together, if they are parts of a
closed figure.
Example: The notes of an arpeggio are heard as a harmony, which is the closed
figure in this case, e.g., a triad or seventh chord.
7. Good Gestalt principle – elements tend to be grouped together if they are part of a
pattern that is a good Gestalt, which means that it is as simple, orderly, balanced,
coherent, etc., as possible.
Example: A two-part piece of music; each part is perceived as a closed and good
Gestalt, so that the tones of two voices are perceptually segregated.
8. Past experience principle – elements tend to be grouped together if they appeared
quite often together in the past experience of the subject.
Example: Certain chord successions such as cadences have so often been heard
that the chords are perceived as an entity. Any deviation from the chord scheme
surprises and irritates the listener as in case of an interrupted cadence.
Integration and Segregation and Auditory Scene Analysis The reign of the Gestalt
principles in the sense of sight and the visual perception of static pictures is ob-
vious and was the main topic of the Gestalt psychologists. On the other hand, the
perception and cognition of temporal forms (German: Zeitgestalten) are not yet com-
pletely clear. Different levels and durations of memory, short-term memory as well
76
3.4. Musical Textures from Monophony to Polyphony 77
77
78 Chapter 3. Musical Structures and Their Perception
as independent from each another. Nevertheless, perceptually, all parts should fit
together well. To this aim, a collection of rules should be attended which form the
basis of the craft of counterpoint. Rules of counterpoint were collected in textbooks,
i.e. in the seminal compendium Gradus ad Parnassum written 1725 by the German
composer Johann Joseph Fux (1660–1741), a textbook that has been studied by gen-
erations of composers [10]. We briefly discuss the elementary rules of a two-part
counterpoint as they provide insight into the psychological preconditions of music
perception and cognition. A thorough introduction to the craft of counterpoint is
given by Lemacher and Schroeder [16]. The following discussion of the rules of
counterpoint is widely based on this book. De la Motte [4] describes the develop-
ment of counterpoint in Western music. Many textbooks on harmony cover the theory
of chords, chord progressions, cadences, and modulations. The word counterpoint
stems from the Latin punctus contra punctum, which describes the technique of how
to put a suitable note (punctus) against the notes of a given tune called cantus firmus.
It describes the simplest form of a two-part counterpoint. The concept of harmony is
a (logical and historical) consequence of the rules that form the craft of counterpoint.
The theory of harmony describes different types of chords and their significance in
music and gives rules for the chord progression.
Figure 3.7: Intervals. The pure or perfect consonants are prime (unison), octave,
fifth and fourth (left), the imperfect consonances are thirds and sixths (middle), the
dissonant intervals are seconds and sevenths (right).
78
3.5. Polyphony and Harmony 79
By alterations, which are chromatic changes of one or both interval tones, all
intervals can be augmented or diminished. Disregarding their perceptional qualities,
all augmented and diminished intervals are counted as dissonances in counterpoint
and must be resolved.
The dichotomy of musical intervals simplifies music theory, but perceptionally,
each interval has its specific degree of consonance. Since ancient times the phe-
nomenon of consonance has been debated and especially the quest for the cause of
consonance led to different explanations.
The octave phenomenon is of special importance in psychoacoustics and in mu-
sic theory. Remarkably, in most tonal systems tones an octave apart are regarded as
the same note. This is due to the intense sensation of tonal fusion of the two tones of
a simultaneous octave. The term tonal fusion describes the sensation of a unity when
listening to two simultaneous tones an octave apart. The phenomenon of tonal fusion
has already been discussed by the ancient Greek philosophers. Other consonant in-
tervals also show a more or less pronounced tonal fusion. The degree of tonal fusion
is directly correlated to the degree of consonance of the interval. The German psy-
chologist and philosopher Carl Stumpf (1848–1936) was the first to investigate this
phenomenon systematically in extensive hearing experiments [25]. He concluded
that there must be a physiological cause of tonal fusion in the brain. Indeed, tonal
fusion and the sensation of consonance has neurophysiological reasons. Licklider
[18] had already proposed a neuronal autocorrelation mechanism for pitch detection.
This idea is based on the theorem of Wiener–Khintchine which states that the Fourier
transform of the power spectrum of a signal is equal to the autocorrelation function
of the signal [11]. Thus, a spectral analysis by Fourier transform is equivalent to
an autocorrelation analysis (see Section 2.2.7). Unfortunately, Licklider’s model is
physiologically infeasible. But a neuronal periodicity detection mechanism for pitch
and timbre perception in the auditory systems has been found in neural nodes of the
brain stem (nucleus cochlearis) and the mid-brain (inferior colliculus) [14, 15]. Es-
sentially, it performs an autocorrelation analysis and is based on a bank of neuronal
circuits. Each circuit adds a specific delay to the signal. The neural codes of the
original signal and the delayed signal are projected onto a coincidence neuron of the
circuit. If the specific delay is equal to a period of the signal, the coincidence neuron
fires a pulse thus indicating that the specific delay of the circuit is equal to a period
of the signal. Physiological data suggest a time window of ε = 0.8 ms for coinci-
dence detection. In the auditory system, a tone is represented by a periodic train of
neuronal pulses. Its period is the same as the period of the tone [30]. In the case
of musical intervals, the periodicities of both interval tones is well preserved in the
auditory nerve [27]. Thus, in the model, an interval is represented by two pulse trains
x1 (t) and x2 (t) with periods p1 and p2 which are related by the vibration ratio s of
the interval tones: p2 = s · p1 . The width of the pulses Iε (t) is adjusted to the width
ε = 0.8 ms of the time window for coincidence detection.
79
80 Chapter 3. Musical Structures and Their Perception
x1 (t) := ∑ Iε (t − n · p1 ) (3.1)
n
x2 (t) := ∑ Iε (t − n · p2 ) = ∑ Iε (t − n · s · p1 ) (3.2)
n n
Now, the interval with the vibration ratio s is represented by the sum of both
pulse trains representing the interval tones. The autocorrelation function of this sum
is calculated to simulate the neuronal periodicity detection mechanism applied to
neuronal representation of the interval. For arbitrary vibration ratios s, an autocor-
relation function a(τ, s) of the corresponding pulse trains can be calculated over the
range of all audible periods from about 0 ms, corresponding to 20,000 Hz, to D = 50
ms corresponding to 20 Hz.
Z D
a(τ, s) := (x1 (t) + x2 (t))(x1 (t + τ) + x2 (t + τ))dt (3.3)
0
The General Coincidence Function Γ(s) as defined by Ebeling [6, 7] integrates
over the squared autocorrelation function for arbitrary vibration ratios s, thus calcu-
lating the power of the autocorrelation function for every possible vibration ratio s
and the corresponding interval.
Z D
Γ(s) := a(τ, s)2 dτ (3.4)
0
Tolerant
Tolerant
Tolerant
Tolerant Tolerant
Tolerant Tolerant
Figure 3.8: The Generalized Coincidence Function [6, 7]; the vibration ratio of the
two interval tones for arbitrary intervals within the range of an octave are shown on
the abscissa.
The graph of the General Coincidence Function predicts high firing rates for con-
sonant intervals and low firing rates for dissonant intervals. It shows the same qual-
80
3.5. Polyphony and Harmony 81
itative course as the “Curve der Verschmelzungsstufen” which Carl Stumpf deter-
mined from extensive hearing experiments [25]. The predictions can experimentally
be confirmed [1].
81
82 Chapter 3. Musical Structures and Their Perception
Tolerant Tolerant
Straightforward orStraightforward
down-to-earth or down-to-earth
Straightforward orStraightforward
down-to-earth or down-to-earth
Tolerant
Figure 3.10: Resolution of introduced dissonances in the two-part counterpoint.
incidental deviations from the overall consonant structure quickly passing by. Exam-
ples of changing notes are given in Figure 3.11. Dissonant changing notes are only
allowed in stepwise motion.
Straightforward orStraightforward
down-to-earth or down-to-earth
Straightforward orStraightforward
down-to-earth or down-to-earth
82
3.5. Polyphony and Harmony 83
Figure 3.14: Dorian cantus firmus (c.f.) by Fux in the lower part and a 1:1 counter-
point in the upper part.
83
84 Chapter 3. Musical Structures and Their Perception
5. In every part, successive skips in the same direction should be avoided. The per-
ceptual reason is that the integration of the tones into a melodic line would be
disturbed by two successive skips.
6. The interval of the tenth should not be exceeded between both parts. The percep-
tual reason is that a separation of both parts by intervals greater than a tenth would
disturb their integration.
The second and third species follow the easy rule that on sustained notes only
consonant intervals are allowed, whereas consonant- and stepwise-reached disso-
nant intervals–passing and changing notes–may be used on all unsustained notes as
demonstrated in Figure 3.15. Again, the perceptual reason is that consonant intervals
support the integration of the parts, dissonant intervals lead to their segregation.
Figure 3.15: Dorian cantus firmus (c.f.) by Fux in the upper part and a 1:2 counter-
point in the lower part. The dissonant intervals are indicated by numbers.
Straightforward orStraightforward
down-to-earth or down-to-earth
Figure 3.16: Dorian cantus firmus (c.f.) by Fux in the lower part and a 1/2:1 coun-
terpoint (bindings) in the upper part.
The fifth species is no longer bounded to strict rhythmical relations but still ob-
serves the rules of the first four species. Arbitrary rhythms are used to achieve a
more lively expression. In vocal polyphony, the rhythms of the parts should follow
and support the rhythm of the text. As an example of a florid two-part counterpoint,
Figure 3.17 presents an original composition of the German composer and music
theorist Michael Prätorius (1571–1621). The cantus firmus in the upper part is the
Protestant chorale “Jesus Christus unser Heiland” against which Prätorius composed
a counterpoint in the lower part. Observe the independent movement of two parts
concerning their texts as well as the music.
84
3.5. Polyphony and Harmony 85
Straightforward orStraightforward
down-to-earth or down-to-earth
Straightforward orStraightforward
down-to-earth or down-to-earth
Straightforward orStraightforward
down-to-earth or down-to-earth
Straightforward orStraightforward
down-to-earth or down-to-earth
Straightforward orStraightforward
down-to-earth or down-to-earth
Straightforward orStraightforward
down-to-earth or down-to-earth
Figure 3.17: Michael Prätorius, “Jesus Christus unser Heiland” from “Musae Sio-
niae,” 1610.
3.5.4 Chords
Generally, chords are two or more simultaneous pitches. A synonym of a chord of
two pitches is an interval, a triad is a chord of three pitches. The chords of the
Western music system consist of stacked thirds upon the root, which is the lowest
and fundamental tone of the chord. The name of the root denotes the chord. Chord
85
86 Chapter 3. Musical Structures and Their Perception
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 3.18: C major and a minor triads and their inversions. In strict counterpoint
the six-four chord is regarded as a dissonant chord.
inversions contain the same notes as the original chord but their order is inverted so
that a note other than the root is in the lowest part.
Triads There are four kinds of triads. Starting from the root,
• the major triad is composed of a major third followed by a minor third (Figure
3.19 (1));
• the minor triad is composed of a minor third followed by a major third (Figure
3.19 (2));
• the diminished triad is composed of two minor thirds (Figure 3.19 (3)) and
• the augmented triad is composed of two major thirds (Figure 3.19 (4)).
86
3.5. Polyphony and Harmony 87
A triad can be built up on every degree of the diatonic scale. Consider the diatonic
major scale (see Figure 3.20 top):
• major triads are on degrees I, IV, and V;
• minor triads are on degrees II, III, and VI and
• a diminished triad is on degree VII.
Thus, three different kinds of triads are in the major diatonic scale in contrast
to four kinds of triads in the diatonic scale of the harmonic minor (note the altered
leading note; see Figure 3.20 bottom):
• minor triads are on degrees I and IV;
• major triads are on degrees V and VI;
• diminished triads are on degrees II and VII and
• an augmented triad is on degree III.
Straightforward or down-to-earth
Nonjudgmental Nonjudgmental
Straightforward or down-to-earth
NonjudgmentalNonjudgmental
Note that only the major triad and the minor triad, but neither the diminished nor
the augmented triad, appear on degree I. This coincides with the rule of music theory
that only a major or minor triad can finish a piece of music. The diminished fifth of
the diminished triad is a dissonant interval.
Seventh Chords On each of the four triads, another minor or major third can be
stacked up to get a seventh chord. Seven types of seventh chords are used in Western
music theory (see Figure 3.21):
• Two seventh chords are derived from the major triad: (1) the major seventh chord
and (2) the dominant seventh chord.
• Two seventh chords are derived from the minor triad: (3) the minor major seventh
chord and (4) the minor seventh chord.
• Two seventh chords are derived from the diminished triad: (5) the half diminished
seventh chord and (6) the diminished seventh chord.
• One seventh chord is derived from the augmented triad: (7) the augmented major
seventh chord.
87
88 Chapter 3. Musical Structures and Their Perception
Straightforward or down-to-earth
Straightforward or down-to-earth
Consider the seventh chords with the notes of the major diatonic scale as roots
(see Figure 3.22 top):
• major seventh chords are on degrees I and IV;
• a dominant seventh chord is on degree V;
• minor seventh chords are degrees II, III, and VI and
• a half diminished seventh chord is on degree VII.
Only four of the seven possible seventh chords can be built up in the major mode,
but all seven seventh chords occur in the harmonic minor mode (note the altered
leading note; compare Figure 3.22 bottom). The minor mode has a greater harmonic
variety than the major mode:
• a major seventh chord is on degree VI;
• a dominant seventh chord is on degree V;
• a minor major seventh chord is on degree I;
• a minor seventh chords is on degree IV;
• a half diminished seventh chord is on degree II;
• a diminished seventh chord is on degree VII and
• an augmented seventh chord is on degree III.
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 3.22: Diatonic major and minor scales with seven chords.
Each seventh chord has three inversions. The first inversion of a seventh chord is
a six-fifth chord (65 -chord, third in the lowest part), the second inversion is a quarter-
third chord (43 -chord, fifth in the lowest part), and the third inversion is a second chord
(2-chord, seventh in the lowest part).
88
3.5. Polyphony and Harmony 89
Note, that all tones of a chord, triads as well as seventh chords, can be altered. Al-
tered tones are additional leading notes and must be resolved following the direction
of alteration.
Further Chords By stacking up more than three thirds, further chords can be formed.
A ninth chord or even eleventh and thirteenth chords are upward extensions of sev-
enth chords that encompass the intervals of a ninth (=octave plus second), an eleventh
(=octave plus fourth), or even a thirteenth (=octave plus sixth). In modern music, all
kinds of chords with arbitrary interval structures are conceivable. In most atonal
music, the dichotomy of consonant and dissonant intervals is abolished. As a conse-
quence, the requirement to resolve dissonant intervals is dropped. Unresolved disso-
nant intervals of a chord are not perceived as much as disturbances of the sound but
as an individual timbre. A succession of chords with unresolved dissonant intervals
evokes the effect of a timbre melody. The timbral richness of impressionistic music,
e.g., of Claude Debussy, is mostly evoked by successions of unresolved dissonant
chords (see [3]). In jazz and some kinds of popular music, seventh chords without
resolution are ubiquitous and are one source of the characteristic jazz sound. Ob-
viously, the seventh is treated as a consonant interval. Note, that the seventh is a
weakly fusing interval (see Figure 3.8).
Chord Notation in Jazz and Popular Music In jazz and pop music, a shorthand
notation of tones and harmonies has been developed to facilitate notation and to
organize group improvisation. A letter indicates the tone on which the chord is build
up. The notes of the diatonic scale are C-D-E-F-G-A-B (German: H). Alterations
are indicated by sharps and flats. For example, C] denotes a c sharp and an e flat
is written as E[. Though one always has to be aware of individual variations, some
rules of chord notation a generally observed.
• Capitals label major triads. For example, D denotes the d-major triad, B[ indicates
the b flat major triad. Nothing is said about the inversion of the actual triad.
• A minus sign or the small letter m added to the capital indicate a minor triad. For
example, A− or Am are the symbols for the a minor triad. F]− or F]m denote an
f sharp minor triad.
• Added tones are indicated by index numbers corresponding to the interval be-
tween the fundamental note and the added tone. The Berklee system uses unam-
biguous prefixes to indicate whether this interval is pure (no prefix), minor (−),
diminished ([), major (M), or augmented (]). Other usual prefixes and their mean-
ings in the Berklee system are listed in Table 3.2 (see: [13, p. 11]).
• The index number 7 without any prefix always denotes a minor 7. Thus G7 is the
minor seventh chord on the note g. To denote a major seventh chord, a variety of
prefixes are common: MAJ7 , Maj7 , maj7 , M7 , j7 , ∆7 etc.
Tonal Functions In a musical context the degrees of a scale are ascribed certain
musical functions. The key of a scale is the tonic and the fifth degree of a scale
is the dominant. One step below the dominant is the fourth degree, which is the
subdominant (see Figure 3.23 (a)).
Rearranging the scale from the fourth degree, an octave lower to the fifth degree
89
90 Chapter 3. Musical Structures and Their Perception
Table 3.2: The Berklee System of Chord Notation
shows a constellation of the scale with the dominant a fifth above the tonic and the
subdominant a fifth beneath the tonic. Note that the fifth is the most consonant in-
terval besides the prime and the octave. The tonic is, so to speak, framed by the
subdominant and the dominant which are harmonically closely related to the tonic
by the strong consonant of a fifth (see Figure 3.23 (b)).
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 3.23: (a) Diatonic scale with the functions tonic T, dominant D, and subdom-
inant S. (b) In the rearranged diatonic scale from degree IV to degree V an octave
above, the subdominant and dominant are both a fifth apart from the tonic in the
center.
Jumping a fifth up or down, a voice can change between the three functions. If
this voice is the bass part, the degrees of this functions can be the roots of triads.
And as the fifth is harmonically stable, a two times falling fifth first from the tonic to
the subdominant and then from the dominant to the tonic stabilizes the tonic, which
is the key (see Figure 3.24 (a)). By replacing the jump of a ninth by a second, the
classical formula of the bass part of a complete cadence with a twice falling fifth
is obtained (see Figure 3.24 (b)). Some theorists claim that these falling fifths are
described by the name cadence: cadere means to fall in Latin.
90
3.5. Polyphony and Harmony 91
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 3.24: The bass formular of the classical cadence.
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 3.25: (a) Plagal cadence IV-I or S-T, (b) authentic cadence V-I or D-T.
The formula of Figure 3.24 (b) in the bass with the degrees I-IV-V-I represents
the harmonic succession tonic-subdominant-dominant-tonic. It is the basic harmonic
pattern of Western music. Figure 3.26 shows the cadence in the fifth position (left:
the fifth of the bass note is in the upper part of the first chord), the octave position
(middle: the octave of the bass note is in the upper part of the first chord), and the
third position (right: the third of the bass note is in the upper part of the first chord):
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 3.26: Cadence, left: fifth position, middle: octave position, right: third posi-
tion.
91
92 Chapter 3. Musical Structures and Their Perception
The triads of the tonic, the dominant, and the subdominant can be substituted
by other triads that have two tones in common with the original triad. These substi-
tuting triads are called mediant chords. Instead of triads, seventh chords may also
be used. Note that the finalizing tonic chord must be a consonant triad. The minor
seventh chord on the fifth degree is also called the dominant seventh chord. It is the
most prominent seventh chord and strongly quests for the tonic as resolution. Espe-
cially the triad of the subdominant has a lot a substitutions that enrich the harmonic
repertoire of music.
The pattern of the jazz cadence differs from the classical cadence. Its bass voice
consists of the following sequence of degrees: I-VI-II-V-I, so that this chord progres-
sion is also called sixteen - twenty-five. Except for the first chord, it can be regarded
as a succession of three falling pure fifths. As seventh chords are the standard chords
of jazz, a basic jazz cadence may be:
Compassionate
Compassionate
Compassionate
Figure 3.27: Jazz cadence.
In the harmony theory of jazz, every chord is regarded as part of a church mode.
Further notes from this church scale may optionally be play together with the chord
and these additional tones are called options. For example, consider a minor seventh
chord on the root c. It may be regarded as part of a Dorian scale if it is the chord of
the II degree in b-flat major, or it is part of a Phrygian scale, if it is the chord of the III
degree in a-flat major, or finally, it may be part of an Aeolian scale, if it is the chord
of the VI degree in e-flat major. In Figure 3.28 the optional notes are indicated: in
case of the Dorian scale (Figure 3.28 (a)), the options 9 and 11 (d or f) are possible,
the 13 (a) may be used instead of the seventh, in case of the Phrygian scale (Figure
3.28 (b)) the option 11 (f) may be used, in case of the Aeolian scale (Figure 3.28 (c))
the options 9 and 11 (d and g) may be added. Some notes are crossed out as they
are inappropriate as options although they belong to the particular modal scale (for
reasoning see [13, p. 19].
As each chord of the harmonic repertoire of jazz is identified with a certain modal
scale when used in a piece of music, the number of options that may be added to the
chord is quite reduced. When improvising in a group, the musicians are aware of the
chord successions. The elaborated system of chord and modal scale identifications
grants that no inappropriate and disturbing tones are added by improvisation.
Certain chord successions are characteristic of a musical style. For example, the
blues is a musical style of Afro-American music that had a great influence on other
musical styles such as jazz (blues jazz, boogie woogie), pop music, and rock (blues
rock). The original blues scheme consists of an easy chord succession within twelve
bars. Let Latin numbers represent the degrees of the diatonic scale. The twelve bars
92
3.5. Polyphony and Harmony 93
Straightforward or down-to-earth
Accepting
Straightforward or down-to-earth
Accepting
Straightforward or down-to-earth
Accepting
Figure 3.28: The minor seventh chord as part of three different modal scales. The
identification of a chord with a certain modal scale depends on the musical context.
The grey notes are possible options; theoretically possible but inappropriate options
are crossed out.
of the blues scheme are now given by a pattern of three times four bars as shown in
Figure 3.29 (a). In case of a quick change, the second bar has a chord on degree IV
instead of the chord of degree I. Figure 3.29 (b) shows the chords of this pattern with
a quick change applied to C major.
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 3.29: The figure shows the original blues scheme (a) and its realization with
a quick change in C major (b).
This chord pattern can be found in the American folk song “Blackwater Blues”.
Note that all chords are based on the blues scale as presented in Section 3.2.3. Thus
the minor seventh, which reflects the original blues seventh, can be used on every
degree. In bar nine, a clash of the minor third of the melody and the major third of
the chord reminds us of the blues third. Of course, besides the original blues scheme
there exist a number of variations of this chord pattern.
93
94 Chapter 3. Musical Structures and Their Perception
Perseverant
Perseverant
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 3.30: The American folk song “Blackwater Blues” is based on a blues scheme
with a quick change in bar two.
3.5.5 Modulations
The term modulation refers to the process of changing from one key to another. In
case of a melodic modulation a single and possibly unaccompanied melodic line
audibly changes to a new key. In a harmonic modulation a certain chord mediating
between both keys functions as a means of modulation. Different kinds of harmonic
modulation are distinguished depending on the means of modulation.
In a diatonic or common-chord modulation, the means of modulation is a chord
shared by both keys. For example, a modulation from C-major to B-flat-major can
be mediated by an F-major triad, which is on the fourth degree of C-major (subdom-
inant) and on the fifth degree (dominant) of B-sharp major (see Figure 3.31).
Figure 3.31: Diatonic modulation from C major to B flat major. The F major six-
chord of the first bar is the modulation means. The F major triad is on the fourth
degree of C major and on the fifth degree of B flat major.
94
3.6. Time Structures of Music 95
Caring
augmented six-fifth chord of the minor key a semitone under the original key (see
Figure 3.33).
Caring Caring
Figure 3.33: Enharmonic modulation from F major to e minor. The b flat of the
6 -chord of the second bar is read as a sharp in the following augmented 6 -chord of e
5 5
minor.
95
96 Chapter 3. Musical Structures and Their Perception
Positive attitude
Positive attitude
Positive attitude
Positive attitude
Good listener
Good listener
Figure 3.34: Relations of note values.
Analogously, the durations of the rests must be measurable. The different signs
of the rests and the corresponding note values are shown in Figure 3.35.
A dot behind a note or rest prolongs its duration by half of its original value. Let
N be the note value, then N· = (1 + 1/2)N. Figure 3.36 shows dotted note values.
Two dots behind a note or rest prolong its duration by half and a quarter of its
original value. Let N be the note value, then N · · = (1 + 1/2 + 1/4)N.
Instead of halving a note value, irregular divisions are also applied. A division by
three results in a triplet, a division by five leads to a quintuplet, a division by seven
creates a septuplet (see Figure 3.37).
96
3.6. Time Structures of Music 97
3.6.2 Measure
The term measure originally refers to the metrical foots of ancient Greek poetry
which describe patterns of long and short syllables. Thus, the measure is an ordering
principle. Applied to music, metrical foot refers to either patterns of tone durations
or to patterns of the accentuation of notes (ordering of heavy/stressed and light/un-
stressed notes). Basic metrical foots with relevance in music theory are listed in
Table 3.3. They are binary (trochee, iambus, spondee) or ternary (dactyl, anapaest,
tribrach). Figure 3.38 shows these metric foots as patterns of tone durations.
Each of these time patterns encompass a time frame of the psychological present,
which is a short time interval conceived by the individual as the present moment [21]
and which may merely last less than a second but does not last more than three or four
seconds. The repetitions of these metric patterns evoke the sensation of coherence
and support the integration over time. As an example of a continued trochee, the fa-
mous “Marcia funebre” from Beethoven’s piano sonata Pathétique demonstrate that
the ancient metric foots are still an appropriate resource to describe the elementary
time structure of music (see Figure 3.39).
3.6.3 Meter
In music, the continuous flow of time is perceptually discretized by regular (equidis-
tant) beats. The meter is a schematic ordering of stressed and unstressed (heavy and
light) beats. Mostly, the meter is indicated by the numbers of a fraction behind the
clef. The denominator represents the note value of the beats to be counted and the
numerator specifies the number of beats in one bar of the meter. Simple meters are
distinguished from compound meters. Simple meters have only one stress on the first
97
98 Chapter 3. Musical Structures and Their Perception
Straightforward or Straightforward
down-to-earth or down-to-earth
Straightforward or Straightforward
down-to-earth or down-to-earth
Caring
Straightforward or Straightforward
down-to-earth or down-to-earth
CaringCaring
Figure 3.39: Trochee: Beethoven, Marcia funebre from the piano sonata Pathétique,
op. 13.
beat, the other beats are unstressed. Compound meters have their main stress on the
first beat and secondary stresses on other beats. Simple meters are all meters with two
or three beats per bar, for example: two four times 24 , two-eight times 28 , and three four
times 23 , three-eight times 38 . Compound meters either have equal parts (2+2 or 3+3)
or unequal parts (2+3 or 3+2). Next to the main stress on the first beat, the first beat
of the other part (or parts) has a secondary and somewhat lighter stress. Examples
of compound meters are four-four times 44 =24 +24 , six-four times 64 =34 +34 , five-four
times 54 =24 +34 or 54 =34 +34 , six-eight times 68 =38 +38 , seven-eighth times 78 =38 +48 or
7 =4 +3 , nine-eight times 9 =3 +3 +3 , twelve-eight times 12 =3 +3 +3 +3 . The term
8 8 8 8 8 8 8 8 8 8 8 8
alla breve describes meters in which half notes are counted as beats: one-two meter
1 , two-two meter 2 , three-two meter 3 , four-two meter 4 .
2 2 2 2
Generally, in Western music the first beat of a bar has the strongest stress. Thus,
the regular bar lines at the beginning of each bar indicate this scheme of stressed and
unstressed beats. Note that there are exceptions to this rule: for example, a four-four
time in jazz music is often stressed on the second and fourth beats instead of the first
and third beats.
98
3.6. Time Structures of Music 99
For simple meters, binary and ternary patterns of beats can be distinguished ac-
cording to whether the beats or pulses are organized in two or three. That means
either a stressed beat is followed by one unstressed beat or a stressed beat is followed
by two unstressed beats. In music theory the quarter note is the standard value. Thus
there are two basic simple meters: the two-four and the three-four.
Figure 3.40: The two-four meter (left) and the three-four meter (right) are the basic
simple meters.
Each beat may be subdivided by twos, threes, or fours. Occasionally, even uni-
tary patterns with subdivisions occur, especially if the music has a fast pulse. For
example, a Vienna waltz is an unitary pattern subdivided by three.
In medieval times, ternary patterns of beats were associated with the trinity of
God and thus named tempus perfectus. Especially in sacred music, the tempus per-
fectus symbolized heavenly spheres and the perfection of God. As this divine perfec-
tion was graphically symbolized by a circle, the tempus perfectus was also indicated
by a circle. On the other hand, the sinfulness of the human live on earth and the im-
perfection of manhood was characterized by binary patterns of beats called tempus
imperfectus and symbolized by an imperfect circle (semicircle or three-quarter circle
like the letter C). This example demonstrates how religious sense or ideas of world
view can be associated with elementary musical structures imposing meaning on the
music. Even today, an imperfect circle is used to indicate the four-four time. An
example is given in Figure 3.39.
The regular four-four time is a compound meter with two parts. But there is also
an alla-breve four-four meter which has four beats but only a single stress on the
first beat whereas all other beats are unstressed. Thus the alla-breve four-four time
is a simple and unitary meter with four subdivisions. As only every fourth beat is
stressed, the alla-breve meter has a wafting expression, especially in combination
with a slow tempo. A prominent example is Mozart’s “Ave verum” which is signed
Adagio alla breve and which shows the alla breve sign which looks like a struck-
through C: it is equal to the C-sign for the four-four time but with a Roman number
I to symbolize unity and to indicate that the meter is simple (see Figure 3.41).
3.6.4 Rhythm
Rhythm stems from Greek and means the floating. Originally it described the con-
stant alterations of tension and relaxation. Rhythm freely combines metric units and
is a quantifying ordering principle of tone durations and accentuations. It is a super-
order of measure and meter. Rhythm can be independent of any scheme so that
tensions and relaxation freely alternate. Thus, measure and meter are not presuppo-
99
100 Chapter 3. Musical Structures and Their Perception
Straightforward or Straightforward
down-to-earth or down-to-earth
Figure 3.41: Mozart: Ave verum K.V. 618, the alla breve sign indicates that only the
first beat has to be stressed.
sitions of rhythm. This becomes obvious with chants in which the music follows the
text as in the Gregorian chants of the medieval Roman church.
If music has an underlying meter, rhythm is integrated into the metric scheme
so that the stresses of the bars and the stresses of the rhythm pattern coincide. But
the independence of a rhythm may result in a segregation of the bar scheme and the
stresses of the rhythm, which leads to syncopation. A syncope originally means beat-
ing together and it results from slurring a stressed note to the prevenient unstressed
one, which induces a shift of the stress onto the actually unstressed beat (see Figure
3.42). A syncope induces deviations from the accentuation scheme of the entrenched
meter. Its effect may be surprising, yields diversion, and arouses interest. Syncopes
support stream segregation especially if they occur only in one part.
Straightforward or Straightforward
down-to-earth or down-to-earth
Figure 3.42: Syncope: (a) regular four-four meter, (b) the slur shifts the stress of the
third beat to the second beat, (c) notation of resulting syncope with the same sound
as in (b).
Certain syncopation schemes are distinctive of certain music styles and associ-
ated dance styles. Figure 3.43 shows the rhythmic patterns of two Latin-American
dances, a rumba and a bossa nova. Note the syncopations of the bass voices and the
alternating rhythms of the chords, which are characteristics of these dances.
100
3.7. Elementary Theory of Form 101
Caring
Caring
Caring
Figure 3.43: Syncopes in dance music: the vividness of syncopated rhythms are a
characteristic of Latin-American dances; rhythmic patterns (a) of a rumba, and (b)
of a bossa nova.
analyzes how the ordered elements of rhythm, melody, and harmony constitute a
musical Gestalt which is always experienced as a whole or entity with distinguishable
parts. Elementary Gestalten-like motifs form musical Gestalten of higher order like
musical themes or melodies which themselves are parts of Gestalten at an even higher
level like a whole composition or a movement of a sonata for example. Thus, the
scope of the theory of musical form are the structural principles of music that evoke
the experience of musical meaning and educe extra-musical associations, which both
belong to the content of a composition.
Motif The motif is the smallest musical entity. Sometimes, a motif is the musical
nucleus from which the whole musical structure of a piece of music evolves. It is
characterized by a succession of certain pitches and bears an individual rhythmic
content of one metric unit. A motif encompasses a time frame of a psychological
present and represents an elementary musical Gestalt. Translating its pitches trans-
poses the motif. A progression is a repeated translation of the same interval. A
dilatation of the note values of a motif is called augmentation if all note values are
proportionally elongated. The term diminution describes the proportional shorten-
ing of all note values of a motif. Figure 3.44 (a) shows a simple motif. Possible
variations are demonstrated in Figure 3.44.
• A motif can be transposed, which is a translation operation on pitch. A series of
transpositions is called a sequence. Figure 3.44 (b) shows a diatonic sequence.
• A motif can be augmented, which is a dilatation (elongation) of note values (Fig-
ure 3.44 (c)).
• A motif can be diminished, which is a shortening of note values (Figure 3.44 (d)).
• A motif can be augmented or diminished in its interval structure (Figures 3.44 (e)
and (f)).
101
102 Chapter 3. Musical Structures and Their Perception
• A motif can be augmented and/or diminished with respect to its note values as
well as in its interval structure (Figures 3.44 (g)–(j)).
• A motif and its variations can be inverted (Figures 3.44 (k)–(t)).
Straightforward or Straightforward
down-to-earth or down-to-earth
Straightforward or Straightforward
down-to-earth or down-to-earth
Straightforward or Straightforward
down-to-earth or down-to-earth
Straightforward or Straightforward
down-to-earth or down-to-earth
Figure 3.44: Sequence and rhythmic augmentation and diminution of the motif (a).
102
3.7. Elementary Theory of Form 103
gives an example of this type of theme from Die Meistersinger von Nürnberg by R.
Wagner.
Figure 3.45: R. Wagner: Vorspiel of the opera Die Meistersinger von Nürnberg: a
sequence of phrases form the theme of the pseudo-fugue.
Closed Forms and Sequential Forms Closed forms comprise a sequence of musical
phrases that are normally two or four measures long. Their separations are marked
by cadences. The ending cadence is a full cadence on the tonic (or first degree). The
prototype of a closed musical form is the period, which is a binary form. Normally
it encompasses eight bars and consists of two phrases, each four measures long. The
antecendent phrase ends on a weak imperfect cadence mostly on the fifth degree,
whereas the second phrase ends on an authentic cadence on the first degree, which is
also the tonic determined by the key tone.
Gestalt psychologists claim that symmetries are a means to improve Gestalt per-
ception and thus symmetries are of aesthetic importance. As music is time depen-
dent, a repetition (which is a translation in time) is the symmetry which is most
effective for Gestalt perception in hearing as the German musicologist Hugo Rie-
mann (1849–1919) pointed out [22]. Repetitions and varied repetitions are the core
of elementary as well as higher-level structures in music. Compare Figure 3.44 for
examples of variations of a motif.
Closed periods can be composed to form simple higher-level structures. The
single periods of this form may be repetitions or variations of one another or they may
be completely independent. A simple example of a form composed of independent
periods is a chain of periods. Those chains already emerge from improvised singing
or playing of a group of musicians. With capitals denoting the single periods, this
form reads as A-B-C-D-....
The simplest binary song form consists of only one Period A that is repeated with
only slight variations or an altered final cadence: A-A or A-A’.
Of course, the second period (now denoted B) can also be contrasting and com-
pletely different: A-B. Repeating part A results in a bar form, which was widespread
in the medieval art of minnesong: A-A-B (see Figure 3.46).
Another elementary example is the ternary song form, which consists of two
different phrases A and B and a repetition of A or its variations A’, A”. The resulting
103
104 Chapter 3. Musical Structures and Their Perception
Straightforward
Straightforward or down-to-earth
or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
forms are: A-A’-A”, A-B-A, or A-B-A’. The German Volkslied “Alle Vögel sind
schon da” has the form A-B-A, cp. Figure 3.47).
Figure 3.47: The German spring song “Alle Vögel sind schon da” clearly has a
ternary A-B-A form.
104
3.7. Elementary Theory of Form 105
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 3.48: The Beatles song “Yellow Submarine” has the simple sequential form
A-A1-A2-A3-B-B.
Figure 3.49: The opening theme of the overture of Mozart’s La nozze di Figaro has
a complete asymmetric structure.
third part is a repetition of the first one so that part A frames the contrasting part B.
Parts A and B are also ternary with parts a, b, and parts c, d respectively, so that A has
the structure a-b-a and B has the subdivision c-d-c. The whole formal organization
is as follows:
A B A
a-b-a c-d-c a-b-a
The different kinds of the rondo-form also illustrate the principles of sequential
forms. Originally the rondo (Latin: rondellus, French: rondeau) was a chain of
independent songs or a series of successive dances with the formal structure: A-B-
C-D- ... . Repeating the first part as a so-called ritornelle leads to the formal structure
105
106 Chapter 3. Musical Structures and Their Perception
A-B-A-C-A-D- ... . The classical rondo combines the sequential organization with a
balance of the harmonic structure. Denoting the key of the tonic by the letter T and
the key of the dominant as D, the structure of the classical rondo can be written as
follows:
A B A C A B A
T D T parallel minor key T T T
or parallel major key
Developmental Forms Whereas sequential forms with their interplay of repetition,
variation, and contrast evoke a variety of musical ideas, developmental forms are
restricted to a musical base material which may be a single theme or even a single
motif. A whole piece of music can be developed by variations and transformations
of this base material. The most important developmental forms are the fugue and
the sonata. The fugue is regarded as the culmination of polyphonic composition and
counterpoint. Fugues of the baroque era are to be regarded as unsurpassed master-
pieces of this composition technique, which is based on the recurrent imitation of a
theme in all parts of the composition and on different pitches. The composition starts
with one voice to introduce the theme. The other voices, one after the other, imitate
the theme on other tonal degrees. While one voice imitates the theme, the other
voices play independent melodic lines composed against the subject according to the
craft of counterpoint. After the theme has been played in all parts, one exposition
of the fugue has been played. Three or more expositions form a fugue composition.
Between the expositions, modulating interplays with sequences are inserted. Figure
3.50 shows the first exposition of a three-part “Fugue in C major” by J.S. Bach.
First of all, the theme determines the character of the composition. The counter-
point may support or contrast the effect of the theme. The technique of fugue can be
realized in different higher-order forms as demonstrated in J.S. Bach’s Die Kunst der
Fuge.
The baroque suite is a succession of historical dances of the Renaissance and
Baroque. At the end of the 18th century the classical sonata as the most prominent
musical form of the 19th century evolved from the suite. Regarded from the stand-
point of musical form, a Symphony is a classical sonata composed for an orchestra.
It consists of four movements: the first movement, Allegro, is contrasted by a second
slow movement. As a relict of the Suite, the third movement is a Minuet, or later in
the historical course, a fast Scherzo. The last fast movement is often a rondo. Joseph
Haydn is renowned as the inventor of the sonata. The term sonata-form refers to
the first movement of the whole sonata. The first part is called the exposition, which
presents and develops the musical material: a robust or virile first theme is followed
by a second theme in a singing stile. In case of a major key, the second theme is on
the key of the fifth degree (dominant key); in case of a minor key, the second theme is
in the parallel major key. The second part is the development: modulation, splitting
of the themes, variations of their motifs, and other compositional means are applied
for dramatic effects and to produce musical tension. The third part is a recapitulation
of the exposition and returns to the original key for both themes. Often a final part
called the coda finishes the movement.
106
3.8. Further Reading 107
Approachable
Approachable
Approachable
Approachable
Straightforward or down-to-earth
Approachable
107
108 Chapter 3. Musical Structures and Their Perception
tion Theory of Harmony, which are both available in several reprints. Those who
want to study counterpoint from the ground up are advised to read the textbook by
Lemacher / Schoeder [16], which partly follows Fux’s gradus ad parnassum. A dif-
ferent approach that involves the music from the baroque era to the 19th century is
the textbook by Walter Piston [20]. A theory of form with a richness of examples is
provide by Lemacher / Schroeder [17].
Bibliography
[1] G. M. Bidelman and A. Krishnan. Neural correlates of consonance, dissonance,
and the hierarchy of musical pitch in the human brainstem. The Journal of
Neuroscience, 29(42):13165–13171, 2009.
[2] A. S. Bregmann. Auditory Scene Analysis. MIT Press, 1990.
[3] D. de la Motte. Harmonielehre. dtv / Bärenreiter Kassel, 1976.
[4] D. de la Motte. Kontrapunkt. Bärenreiter, 2010.
[5] D. Deutsch, ed. The Psychology of Music. Academic Press, New York, 1982.
[6] M. Ebeling. Verschmelzung und neuronale Autokorrelation als Grundlage einer
Konsonanztheorie. Peter Lang Verlag, 2007.
[7] M. Ebeling. Neuronal periodicity detection as a basis for the perception of con-
sonance: A mathematical model of tonal fusion. The Journal of the Acoustical
Society of America, 124(4):2320–2329, 2008.
[8] L. Finscher, ed. Die Musik in Geschichte und Gegenwart (MGG). Bärenreiter,
2nd, revised edition, 2003.
[9] C. W. Fox. Modern counterpoint: A phenomenological approach. Notes: Quar-
terly Journal of the Music Library Association, 6(1):46–57, 1948.
[10] J. Fux. Gradus ad Parnassum oder Anführung zur regelmäßigen musicalischen
Composition. (Nachdr. d. Ausg. Leipzig 1742). Olms, 1742 / 2004.
[11] W. M. Hartmann. Signal, Sounds, and Sensation. Springer, 2000.
[12] W. M. Hartmann and D. Johnson. Stream segregation and peripheral channel-
ing. Music Perception, 9(2):155–183, 1991.
[13] A. Jungblut. Jazz Harmonielehre. Schott, 1981.
[14] G. Langner. Evidence for neuronal periodicity detection in the auditory system
of the guinea fowl: Implications for pitch analysis in the time domain. Experi-
mental Brain Research, 52(3):333–355, 1983.
[15] G. Langner. Die zeitliche Verarbeitung periodischer Signale im Hörsystem:
Neuronale Repräsentation von Tonhöhe, Klang und Harmonizität. Zeitschrift
für Audiologie, 46(1):8–21, 2007.
[16] H. Lemacher and H. Schroeder. Kontrapunkt. Schott, 1978.
[17] H. Lemacher and H. Schroeder. Formenlehre der Musik. Gerig, 1979.
[18] J. C. R. Licklider. A duplex theory of pitch perception. Experimenta VII/4,
1951.
108
3.8. Further Reading 109
109
Chapter 4
4.1 Introduction
In this chapter we will review fundamental concepts and methods of digital signal
processing with emphasis on aspects which are important for music signal analy-
sis. We thus lay the foundations for chapters dealing with feature extraction, feature
selection, and feature processing (Chapters 5, 15, 14). The chapter starts with the
definition of continuous, discrete, and digital signals and a brief review of linear
time-invariant systems. We explain the design and implementation of digital filters
with finite impulse response (FIR) and infinite impulse response (IIR). These filters
are frequently used in filter banks for spectral analysis of audio signals. Besides filter
banks, we present transformations for spectral analysis such as the discrete Fourier
transformation (DFT), the constant-Q transformation (CQT), and the cepstrum. The
chapter concludes with a brief introduction to fundamental frequency estimation.
111
112 Chapter 4. Digital Filters and Spectral Analysis
112
4.3. Discrete-Time Systems 113
xq [k]
x(t)
x[k]
0 0 0
Figure 4.1: Analog, discrete time, and digital signals. t, k, and Ts denote, respec-
tively, the continuous time variable, the discrete time index, and the sampling period.
A system is time-invariant if any temporal shift k0 of the input signal x[k] results in
the corresponding shift of the output signal y[k] = T [x[k]], i.e.
An LTI system may be characterized by its impulse response h[k] and its initial
conditions. Frequently, we assume that initial conditions are zero, i.e., the system
is at rest before a signal is applied. For zero initial conditions the impulse response
is obtained by submitting a unit impulse1 δ [k] to the system input; see Figure 4.2.
Using linearity and time-invariance we obtain for any input signal x[k]
" #
∞ ∞
y[k] = T [x[k]] = T ∑ x[l]δ [k − l] = ∑ x[l]T [δ [k − l]] (4.4)
l=−∞ l=−∞
∞
= ∑ x[l]h[k − l] = x[k] ∗ h[k] ,
l=−∞
1 The unit impulse is defined as a sequence δ [k] with δ [0] = 1 and δ [k] = 0 for all k 6= 0.
113
114 Chapter 4. Digital Filters and Spectral Analysis
where the above summation is known as the discrete convolution and ∗ denotes the
convolution operator. The convolution operation is commutative,
∞ ∞
y[k] = ∑ x[l]h[k − l] = ∑ h[l]x[k − l] , (4.5)
l=−∞ l=−∞
which leads to the interesting interpretation that the input signal and the impulse
response may be interchanged without changing the output signal. Given the impulse
response h[k] we may compute the output signal y[k] to any input signal x[k]. Note,
that the convolution of two finite signals with N and M successive non-zero samples
has at most N + M − 1 non-zero samples.
Definition 4.6 (Frequency Response). The frequency response H(eiΩ ) of an LTI sys-
tem is given by the discrete-time Fourier transform (DTFT) of its impulse response,
and, vice versa, the impulse response may be computed via an inverse DTFT of the
frequency response. Thus we have2
∞
1
Z π
H(eiΩ ) = ∑ h[k]e−iΩk and h[k] = H(eiΩ )eiΩk dΩ . (4.6)
k=−∞ 2π −π
Computing the DTFT of Equation (4.4) we find that Y (eiΩ ) = H(eiΩ )X(eiΩ ), where
X(eiΩ ) and Y (eiΩ ) denote the DTFT of the input and the output signals. Therefore,
the convolution of the input signal and the impulse response of an LTI system leads
to a multiplication of their respective frequency responses.
The frequency response is a complex-valued
quantity which is often displayed
in terms of its magnitude response H(eiΩ ) and its phase response φ (Ω) such that
H(eiΩ ) = H(eiΩ ) eiφ (Ω) . Since the magnitude response often covers a large
dynamic
range, it is common to denote it in decibels (dB) as A(Ω) = 20 log10 H(eiΩ ) .
Furthermore, it is common to display the frequency response on a scale normalized to
the sampling frequency, i.e. Ω = 2π f / fs . Then, the Nyquist frequency corresponds
to Ω = π.
Definition 4.7 (Group Delay of an LTI System). The group delay τg (Ω) of an LTI
2 To remain consistent with the widely used z-transform notation, we write the frequency response as
114
4.3. Discrete-Time Systems 115
system characterizes the delay of the output signal w.r.t. the input signal. In general,
this delay depends on frequency and is defined as
dφ (Ω)
τg (Ω) = − . (4.7)
dΩ
An important special case concerns systems that have a linear phase response.
A system with a linear phase response has a constant group delay. It will delay the
input signal uniformly across frequency and thus will not introduce dispersion. For
music signal processing such behavior is beneficial, for instance, for the reproduction
of sharp onsets.
Example 4.1 (Sampling rate conversion). To reduce the computational effort of mu-
sic analysis tasks we may lower the sampling rate from fs = 48 kHz to fs0 = 16 kHz.
Prior to this decimation step the bandwidth of the audio signal must be limited to
frequencies below fs0 /2 = 8 kHz. The impulse, amplitude, and phase responses of a
suitable low-pass filter are shown in Figure 4.3. This filter has an impulse response of
181 non-zero coefficients and a linear phase response. The attenuation of frequency
components outside the desired frequency band (stopband attenuation) is about 80
dB. Its design is discussed in Section 4.3.2.
0.4
0.2
h[k]
0
-0.2
0 20 40 60 80 100 120 140 160 180
k
0
A(Ω ) / dB
-50
-100
0 0.2 0.4 0.6 0.8 1
Ω /π
0
φ (Ω )/rad
-50
-100
0 0.2 0.4 0.6 0.8 1
Ω /π
115
116 Chapter 4. Digital Filters and Spectral Analysis
This difference equation is visualized in the block diagram of Figure 4.4. In the case
x[k] T T T
b0 b1 b2 bN
∑ y[k]
aM a3 a2 a1
T T T T
b0 + b1 e−iΩ + · · · + bN e−iNΩ
= .
1 − a1 e−iΩ − a2 e−i2Ω · · · − aM e−iMΩ
116
4.3. Discrete-Time Systems 117
Thus, the frequency response and its inverse DTFT, the impulse response h[k], de-
pend on the coefficients aµ and bν only.
Example 4.2 (First-Order Recursive System). First-order recursive systems are wide-
ly used in (music) signal processing for the purpose of smoothing fluctuating signals
and for the generation of stochastic autoregressive processes. While the statistical
view on models of stochastic time series is treated in depth in Section 9.8.2, we here
explain the first-order recursion in terms of a digital filter. Using Equation (4.8) and
the stability condition |a1 | < 1 we find for the difference equation and the frequency
response of a first-order recursive system
b0
y[k] = b0 x[k] + a1 y[k − 1] ⇔ H(eiΩ ) = , (4.10)
1 − a1 e−iΩ
and for its magnitude response
|b0 |
H(eiΩ ) = q . (4.11)
1 + a21 − 2a1 cos (Ω)
Obviously, b0 controls the overall gain of the filter. In order to normalize the overall
response on its maximum, it may be set to b0 = 1 − |a1 |. The frequency character-
istic of the filter is determined by the coefficient a1 : When this coefficient is positive
we achieve a low-pass filter which attenuates high-frequency components and thus
smooths the input signal. When this coefficient is negative a high-pass filter results,
which leads to a relative emphasis of high frequencies up to the Nyquist frequency
Ω = π. An example for a1 = ±0.8 is shown in Figure 4.5.
1
a1 = 0.8
a1 = −0.8
|H(eiΩ )|
0.5
0
0 0.2 0.4 0.6 0.8 1
Ω /π
When we set aµ = 0 for all µ we eliminate the feedback path and obtain the block
diagram in Figure 4.6. Then, the frequency response simplifies to
N
Y (eiΩ ) −iΩ −iNΩ
H(eiΩ ) = = b0 + b1 e + · · · + b N e = ∑ bν e−iνΩ . (4.12)
X(eiΩ ) ν=0
117
118 Chapter 4. Digital Filters and Spectral Analysis
Since there is no feedback, the impulse response of this system has a finite number
of non-zero coefficients.
Definition 4.9 (Finite-Impulse Response (FIR) System). A system with a finite num-
ber of non-zero coefficients in its impulse response is called a finite-impulse response
(FIR) system.
x[k]
T T T
b0 b1 b2 bN
y[k]
Figure 4.6: A block diagram of a causal and linear parametric system with finite
impulse response (FIR).
The filter discussed in Example 4.1 is a FIR filter and could be implemented
using the block diagram in Figure 4.6. Based on this implementation, the impulse
response is given by h[k] = bk ∀k ∈ [0, 1, . . . , N], where N is the order of the filter.
FIR filters may also be used to compute a moving average of the input signal and are
thus useful for the online computation of audio features.
1 1 1 2
Figure 4.7: Ideal magnitude responses of a low-pass filter (A), a high-pass filter (B),
and a band-pass filter (C). Ω1 and Ω2 denote cut-off frequencies.
118
4.3. Discrete-Time Systems 119
|H(e i )|
2 p
transition
pass band band stop band
p s
desired frequency response HT (eiΩ ) includes the frequency ranges where the input
signal should be passed or attenuated by the filter, also known as the passband(s) and
stopband(s), respectively. As the transitions between these bands cannot be abrupt,
the width of corresponding transition band(s) is also part of the specification. Fur-
thermore, tolerance intervals which, for instance, specify the maximum permissible
ripple in the passband(s) and the stopband(s) of the filter, are necessary. Figure 4.8
plots a tolerance specification for a discrete low-pass filter which entails a passband,
a transition band, and a stopband with their respective parameters. In most applica-
tions it will be desirable to make the maximum passband distortion δ p , the width of
the transition band ∆Ω = Ωs − Ω p , and the stopband ripple δs as small as possible
while satisfying a constraint on the filter order and hence on the group delay and the
computational complexity.
Digital filters may be designed as FIR or IIR filters. FIR filters are often pre-
ferred as they are non-recursive and therefore always stable. Furthermore, they can
be designed to have a linear phase response avoiding the detrimental effects of phase
distortions and non-uniform group delays.
Popular design methods for FIR filters are based on the modified Fourier series
approximation or the Chebychev approximation. In both cases we strive to approxi-
mate an ideal frequency response HT (eiΩ ) by the FIR filter response given in Equa-
tion (4.12).
119
120 Chapter 4. Digital Filters and Spectral Analysis
Thus, the coefficients bν are the Fourier series representation of the desired fre-
quency response HT (eiΩ ). For a given filter order N this constitutes the best approx-
imation in the mean-square sense.
When the filter coefficients are (even or odd) symmetric around the center bin,
the filter has a linear phase response. Then, for a filter of order N we obtain a constant
group delay of τg (Ω) = N/2.
Example 4.3 (Fourier Approximation). Figure 4.9 depicts a linear-phase approxi-
mation of the ideal low-pass filter for two values of N. Here the filter coefficients
are arranged in a non-causal fashion, i.e., symmetric around the time index k = 0.
Clearly, the filter of higher order achieves a smaller transition interval between the
passband and the stopband. Note, however, that the maximum ripple in the passband
and the stopband is not improved when the filter order is increased.
This design may be modified by multiplying the impulse responses in Figure
4.9 by a tapered window function w[k], thus achieving a higher stopband attenuation
and less passband ripple. For a Hamming window (as defined in Equation (4.24))
the resulting filter coefficients and the corresponding frequency response are shown
in Figure 4.10. We now observe significantly less ripple in the passband and the
stopband but a wider transition band.
The filter length and the choice of the window clearly depends on the desired
stopband attenuation and the width of the transition region. For the widely used
Kaiser window (see Equation (4.26)) the following design rule for the filter order
has been established [12, 19]
As − 7.95
N≈ , (4.15)
2.2855∆Ω
where As = −20 log10 (δs ) specifies the desired stopband attenuation and ∆Ω = 2π ∆fsf
is the normalized transition width. The shape parameter α of the Kaiser window
controls its bandwidth and is found by
0
, As < 21
0.4
α = 0.5842(As − 21) + 0.07886(As − 21) , 21 ≤ As ≤ 50 (4.16)
0.1102(As − 8.7) , As > 50 .
120
4.3. Discrete-Time Systems 121
h[k] |H(eiΩ )|
0.3 1.5
0.2
1
0.1
0.5
0
-0.1 0
-20 -10 0 10 20 0 0.5 1
0.3 1.5
0.2
1
0.1
0.5
0
-0.1 0
-20 -10 0 10 20 0 0.5 1
k Ω/π
Figure 4.9: Fourier approximation of the ideal low-pass filter response. Left: impulse
response h[k], right: magnitude response |H(eiΩ )|. Top: N = 20, Bottom: N = 40.
121
122 Chapter 4. Digital Filters and Spectral Analysis
h[k] |H(eiΩ )|
0.3 1.5
0.2
1
0.1
0.5
0
-0.1 0
-20 -10 0 10 20 0 0.5 1
0.3 1.5
0.2
1
0.1
0.5
0
-0.1 0
-20 -10 0 10 20 0 0.5 1
k Ω/π
Figure 4.10: Modified Fourier approximation of an ideal low-pass filter using a Ham-
ming window. Left: impulse response h[k], right: magnitude response |H(eiΩ )|. Top:
N = 20, Bottom: N = 40.
0.3
0
0.2 -10
A(Ω ) / dB
-20
h[k]
0.1
-30
0
-40
-0.1 -50
0 10 20 30 40 0 0.5 1
k Ω /π
Figure 4.11: Filter design based on the Chebychev approximation and the Parks-
McClellan (Remez) algorithm. Left: impulse response h[k], right: magnitude re-
sponse A(Ω) = 20 log10 |H(eiΩ )| . The filter order is 40 and the width of the transi-
122
4.4. Spectral Analysis Using the Discrete Fourier Transform 123
1 M−1 2π µk
x[k] = ∑ X[µ]ei M , k = 0, . . . , M − 1 . (4.19)
M µ=0
123
124 Chapter 4. Digital Filters and Spectral Analysis
where X[µ]
e and xe[k] are the periodically continued sequences.
Theorem 4.3 (Symmetry). When the sequence x[k], k = 0, . . . , M − 1, is real-valued,
the sequence of DFT coefficients is conjugate-symmetric, i.e. when M is even we
have X[µ] = X ∗ [M − µ], µ = 1, . . . , M/2 − 1.
Theorem 4.4 (Cyclic convolution of two sequences). The cyclic convolution
M M−1
x[k] ~ y[k] := ∑ x[`] y[(k − `)mod M ]
`=0
of two time domain sequences x[k] and y[k] corresponds to a multiplication X[µ]Y [µ]
of the respective DFT sequences X[µ] and Y [µ]. (k)mod M denotes the modulo oper-
ator.
Theorem 4.5 (Multiplication of two sequences). The multiplication x[k] y[k] of two
sequences x[k] and y[k] corresponds to the cyclic convolution
1 M 1 M−1
M
X[`] ~Y [`] = ∑ X[`]Y [(µ − `)mod M ]
M `=0
Whenever we select a length-M sequence of the signal x[k] prior to computing the
DFT, we may describe this in terms of applying a window function w[k] of length M
to the original longer signal. Thus, the DFT coefficients X[µ] are equal to the DTFT
Xw (eiΩ ) of the windowed sequence w[k]x[k] at the discrete frequencies Ωµ = 2πMµ .
The implementation of the DFT makes use of fast algorithms known as the fast
Fourier transform (FFT). The FFT algorithm achieves its efficacy by segmenting the
input sequence (or the output sequence) into shorter sequences in several steps. Then,
DFTs of the shortest resulting sequences are computed and recombined in several
stages to form the overall result. The most popular versions of the FFT algorithms
use a DFT length M being equal to a power of two. This allows a repeated split into
shorter sequences until, in the last stage, only DFTs of length two are required. The
FFT algorithm then recombines these length-2 DFTs in log2 (M) − 1 stages following
a regular pattern.
124
4.4. Spectral Analysis Using the Discrete Fourier Transform 125
Example 4.5 (Window functions). In Figure 4.12 we illustrate the effect of applying
a rectangular window w[k] to a sinusoidal signal x[k] = sin(Ωk) of infinite length
prior to computing the DFT. The plots on the left side show the sinusoidal signal
for two different signal frequencies Ω while the plots on the right side show the cor-
responding DTFT (dashed line) and DFT magnitude spectra. We find that the DFT
coefficients result from a sampled version of the spectrum Xw (eiΩ ) = X(eiΩ )∗W (eiΩ ),
where W (eiΩ ) is the DTFT of the window function w[k]. The effect of convolving the
spectrum of an infinite sinusoidal signal with the spectrum of the window function is
clearly visible in the DTFT. For the plots in the upper graphs, the length of the DFT
is equal to an integer multiple of the period of the sinusoidal signal, a condition that
is not met in the plots of the lower row. Hence, the DFT spectra are quite different.
While the sampling of the DTFT in the upper graph results in two distinct peaks in
the DFT (as it would be expected for a sinusoidal signal), the lower graph shows the
typical spectral leakage as the DTFT is now sampled on its side lobes. This spectral
leakage will obfuscate the signal spectrum and should be minimized.
The spectral leakage may be reduced by using a tapered window function, how-
ever, at the cost of a reduced spectral resolution. Well-known window functions are
the Hamming, the Hann, the Blackman, and the Kaiser window. Some widely used
1 10
|Xw (eiΩ )|, |X[µ ]|
w[k]x[k]
0 5
-1 0
0 5 10 15 0 5 10 15
k µ
1 10
|Xw (eiΩ )|, |X[µ ]|
w[k]x[k]
0 5
-1 0
0 5 10 15 0 5 10 15
k µ
Figure 4.12: DFT analysis of a sinusoidal signal multiplied with a rectangular win-
dow. The DFT length is M = 16. Upper plots: signal and magnitude spectrum for
Ω = 3/16. Lower plots: signal and magnitude spectrum for Ω = 10/48.
125
126 Chapter 4. Digital Filters and Spectral Analysis
1.5 40
boxcar
A(Ω ) / dB
1 20
w[k]
0.5
0
0
0 20 40 60 0 0.5 1
k Ω /π
1.5 40
Hamming
A(Ω ) / dB
1 20
w[k]
0
0.5 -20
0 -40
0 20 40 60 0 0.5 1
k Ω /π
1.5 40
Blackman 20
A(Ω ) / dB
1 0
w[k]
-20
0.5 -40
-60
0 -80
0 20 40 60 0 0.5 1
k Ω /π
Figure 4.13: The rectangular (boxcar), the Hamming, the Blackman window and
their respective magnitude responses A(Ω) = 20 log10 |W (eiΩ )| .
r !
k−(M−1)/2 2
1− /I0 (α) 0 ≤ k ≤ M − 1
I
0 α (M−1)/2
wKaiser [k] = (4.26)
0 otherwise ,
126
4.4. Spectral Analysis Using the Discrete Fourier Transform 127
1.5 40
α =2
20
1
A(Ω ) / dB
w[k] 0
0.5 -20
-40
0
0 20 40 60 0 0.5 1
k Ω /π
1.5 40
α =6
20
1
A(Ω ) / dB
0
w[k]
0.5 -20
-40
0
0 20 40 60 0 0.5 1
k Ω /π
where I0 (·) denotes the zero-order modified Bessel function of the first kind and α is
a shape parameter which controls the bandwidth of its frequency response.
The effects of a tapered window are shown in Figure 4.15, where the same sinu-
soidal signals as in Example 4.5 and a Hamming window w[k] are used. Clearly, the
amplitudes of the spectral side lobes are now significantly reduced and the amount
of leakage in the DFT spectra depends much less on the signal frequency. The main
lobe, however, is wider indicating a loss of spectral resolution.
1
4
-1 0
0 5 10 15 0 5 10 15
k µ
1
4
0
2
-1 0
0 5 10 15 0 5 10 15
k µ
Figure 4.15: DFT analysis of a sinusoidal signal multiplied with a Hamming window.
The DFT length is M = 16. Upper plots: signal and magnitude spectrum for Ω =
3/16. Lower plots: signal and magnitude spectrum for Ω = 10/48.
The 3-dB bandwidth is defined as the frequency interval for which the magnitude
response of a band-pass filter is not more than 3 dB below its maximum value. Typi-
cally, the maximum response is achieved for the center frequency of a frequency bin.
For the windows specified in Equation (4.24) the 3-dB bandwidth ∆Ω3dB relative to
4π/M is shown in Figure 4.16 as a function of parameter a.
3
∆ Ω3dB · M/4π
∆ Ω3dB · M/4π
Hann window
2
Hamming window
rectangular window
1
0
0.4 0.5 0.6 0.7 0.8 0.9 1
a
Figure 4.16: 3-dB bandwidth of the window function from Equation (4.24) relative
to 4π/M as a function of the window design parameter a.
128
4.4. Spectral Analysis Using the Discrete Fourier Transform 129
where R is the shift between successive segments and λ the index to these segments.
This constitutes a sliding-window short-term Fourier analysis.
Honest and
Honest and
Caring
Honest
Honest and trustworthy
trustworthy
trustworthy
Straightforward or down-to-earth
and trustworthy
Caring
Straightforward or down-to-earth
Figure 4.17: Narrowband (top) and wideband (bottom) spectrogram of a music signal
(pop song). In the top plot the harmonics of the singing voice are clearly resolved
while in the lower plot the formants are more prominently displayed.
The log-magnitude spectra 20 log10 (|X[λ , µ]|) of successive signal segments are
then plotted on a dB scale to obtain the spectrogram of an audio signal.
129
130 Chapter 4. Digital Filters and Spectral Analysis
To derive the CQT we consider the Fourier transform of a windowed signal seg-
ment x[k + λ R] starting at k = λ R, where R ∈ Z denotes the advance between suc-
cessive signal segments and N is the window length,
N−1
2π f µ k
Xstft [λ , µ] = ∑ x[k + λ R] w[k] exp −i , (4.30)
k=0 fs
and replace the equispaced subband center frequencies f µ = fs Nµ by the geometri-
µ
cally spaced frequencies f µ = fmin 2 12b , where fmin and b denote the minimal anal-
ysis frequency and the number of bins per semi-tone, respectively. Further, the uni-
form frequency resolution ∆ f = fs /N of the DFT is replaced by a non-uniform reso-
lution ∆ f µ = fs /Nµ by applying frequency-dependent window lengths Nµ such that
1
a constant quality factor Q = f µ /∆ f µ = f µ /( f µ+1 − f µ ) = 1/(2 12b − 1) is achieved.
For b = 1 the quality factor is adjusted to a scale of 12 semi-tones and for b > 1 we
obtain more than 12 frequency bins per octave and thus a higher frequency resolution.
µ
Definition 4.11 (Constant-Q Transform). For a frequency grid f µ = fmin 2 12b , µ ∈ N,
1
and a quality factor Q = f µ /( f µ+1 − f µ ) = 1/(2 12b − 1) the CQT is defined as
N −1
1 µ 2πQk
Xcqt [λ , µ] = ∑ x[k + λ R] wµ [k] exp −i Nµ ,
Nµ k=0
(4.31)
130
4.6. Filter Banks for Short-Time Spectral Analysis 131
where b denotes the number of frequency bins per semi-tone. The length of the
window functions wµ [k] depends on the frequency band index µ and is given by
Nµ = Q ffµs = f µ+1fs− f µ .
Example 4.7 (Comparison of DFT and CQT). Figure 4.18 shows an example of a
sliding window DFT (left) vs. a sliding window CQT (right) of a sustained multi-tone
mixture with a spacing of four semitones sampled at fs = 16 kHz and with b = 2.
Clearly, the resolution of the DFT is not sufficient at low frequencies.
Example 4.8 (Short-time spectral analysis using the CQT). Figure 4.19 depicts an
example where a music signal was analyzed using the CQT. The base line and the
singing voice are clearly resolved.
3000 20 dB 3047 20 dB
2500 0 1710 0
539
f /Hz
131
132 Chapter 4. Digital Filters and Spectral Analysis
Figure 4.19: CQT spectrum of a pop song (left) and narrowband spectrogram (right).
The harmonics of the singing voice as well as the bass line are clearly resolved in the
CQT spectrum.
x[k] y0 [k]
LP R0 y0 [k0 ]
y1 [k]
BP1 R1 y1 [k1 ]
yN [k]
BPN RN yN [kN ]
Figure 4.20: Filter bank for spectral analysis composed of a low-pass (LP) filter,
several band-pass filters (BPi ), and decimators for sampling rate reduction by Ri .
132
4.6. Filter Banks for Short-Time Spectral Analysis 133
Figure 4.21: Magnitude response of subband filters with uniform (top) and non-
uniform (bottom) filter bandwidths.
The prototype filter impulse response h[k] = hBP0 [k] may now be designed for a de-
sired number of subbands, for a desired bandwidth and stopband attenuation. Fur-
thermore, we like to achieve an overall perfect response (unity response)
M−1
hA [k] = ∑ hBP
µ [k] = δ [k − k0 ], (4.34)
µ=0
133
134 Chapter 4. Digital Filters and Spectral Analysis
where k0 is the overall group delay (latency) of the filter bank. Thus, for a uniform
frequency spacing Ωµ = 2π M µ we have
M−1 M−1 2π
hA [k] = ∑ hBP
µ [k] = ∑ h[k]e
i M µk
= h[k]M p(M) [k], (4.35)
µ=0 µ=0
2π
where p(M) [k] = M1 ∑M−1
µ=0 e
i M µk is different from zero only for k = λ M with λ ∈ Z.
All other samples of h[k] may be used to optimize the channel bandwidth and the
stopband attenuation. Note that the discrete tapered sinc function3
where w[k] denotes an arbitrary window function, satisfies the above constraint.
Thus, the frequency response of the individual band-pass filters can be controlled
via the shape and the length of the window function w[k]; see Section 4.4.1.
Example 4.9 (Uniform Filter Bank). Figure 4.22 depicts an example for M = 8 chan-
nels of which 5 channels are located between 0 and the Nyquist frequency. For the
design of the prototype low-pass filter, a Hann window of length 81 was used.
When the length of the prototype impulse response h[k] equals the DFT length
M the above complex-modulated filter bank with uniform frequency spacing corre-
sponds to a sliding window DFT analysis system. With Ωµ = 2πM µ, the substitution
u = ` − k, and (
w[u] u = 0, . . . , M − 1
h[−u] = (4.38)
0 otherwise
we rewrite Equation (4.33) to yield
2π
M−1 2π
X[k, µ] = xµ [k] = e−i M µk ∑ x[k + u]w[u]e−i M µu . (4.39)
u=0
3 The cardinal sine function is defined here as sinc[x] = sin[x]/x for x 6= 0 and sinc[x] = 1 for x = 0.
134
4.6. Filter Banks for Short-Time Spectral Analysis 135
0.15
0.1
h[k]
0.05
-0.05
0 10 20 30 40 50 60 70 80
k
0
A(Ω )/dB
-50
-100
0 0.2 0.4 0.6 0.8 1
Ω /π
Figure 4.22: Impulse response of prototype low-pass filter (top) and magnitude re-
sponse of a complex-modulated uniform filter bank (bottom). The center frequencies
are spaced by π/4. For clarity, the line style toggles between dashed and solid lines.
These frequency responses add to a constant value of one.
Here, a is a gain factor, n is the filter order, fc the center frequency, b the bandwidth,
and φ the phase of the cosine signal. The bandwidth of a filter at center frequency fc
135
136 Chapter 4. Digital Filters and Spectral Analysis
is given by [8]
b = 24.7 · (4.37 · fc /1000 + 1)BWC , (4.41)
which is the equivalent rectangular bandwidth (ERB) of a human auditory filter cen-
tered at the frequency fc . This ERB is multiplied by a bandwidth correction factor
BWC = 1.019.
Note that the bandwidth is bounded by 24.7BWC as the center frequency ap-
proaches zero. For a sampling rate fs , the discrete-time impulse response of one of
these band-pass filters is found as
a
g[k] = (k/ fs )n−1 e−2πbk/ fs cos (2π fc k/ fs + φ ) ∀k ∈ N (4.42)
fs
and the phase term φ = −(n − 1) fc /b aligns the temporal fine structure of the filter
impulse responses. Often, the gammatone filters are normalized to provide a maxi-
mum amplitude response of 0 dB. An example is shown in Figure 4.23 for 21 bands
which are separated by about one ERB. The summation of these filters results in
an almost flat overall response which is also shown in this figure. Several authors
have developed code for the efficient implementation of the gammatone filter bank
in terms of parametric recursive LTI systems which approximate the above impulse
response [26]. The gammatone filter bank has no exact inverse, but approximations
are available [11].
-20
A(Ω )/dB
-40
-60
Figure 4.23: Magnitude response of a gammatone filter bank with center frequencies
in the range of 442–5544 Hz. The center frequencies are spaced by the corresponding
ERB. The sampling frequency is 32 kHz. For clarity, the line style toggles between
dashed and solid lines. The bold line indicates the sum of all subband responses.
136
4.7. The Cepstrum 137
or the singing voice. The cepstrum is also the basis for audio features such as mel
frequency cepstral coefficients (MFCC), see Chapter 5, and has been used in auto-
matic instrument recognition tasks [5]. The cepstrum was introduced in [1], in which
its name and related vocabulary were also coined: the word cepstrum is derived by
reversing the order of the first four letters of the word spectrum. The cepstrum can
be defined as a complex-valued or real-valued quantity. In what follows, we only
consider the real cepstrum.
Definition 4.13 (Cepstrum). The real cepstrum of a discrete-time signal x[k] is com-
puted using the inverse DTFT as
1 π
Z
cx [q] = log |X(e jΩ )| e jΩq dΩ ,
2π −π
1 M−1 2π µq
cx [q] = ∑ log |Xµ | e j M ,
M µ=0
137
138 Chapter 4. Digital Filters and Spectral Analysis
0 0
dB
-20
-1
0 40 80 120 160 0 32 64 96 128
sampling index k DFT bins µ
cepstrum
1.5
1
0.5
0
-0.5
0 32 64 96 128
cepstral bins q
Figure 4.24: Sampled waveform, DFT power spectrum, and cepstrum of a short
voiced sound. For clarity all signals are displayed as solid lines. The sampling rate
is fs = 8 kHz.
The accurate estimation of the fundamental frequency therefore entails the seg-
mentation of the music signal into quasi-periodic segments, the extraction of the fun-
damental frequency from these segments, and a smoothing procedure to avoid sud-
den and non-plausible variations of the f0 -estimate due to estimation errors. Given
138
4.8. Fundamental Frequency Estimation 139
A more versatile framework [4] exploits the shift invariance of periodic signal
segments and minimizes the mean-squared difference
k
d[k, τ] = ∑ (x[`] − x[` + τ])2 (4.44)
`=k−N+1
k k k
= ∑ x[`]2 + ∑ x[` + τ]2 − 2 ∑ x[`]x[` + τ] (4.45)
`=k−N+1 `=k−N+1 `=k−N+1
= ϕxx [k, 0] + ϕxx [k + τ, 0] − 2ϕxx [k, τ] (4.46)
with respect to τ. Along with a normalization on the average of d[k, τ], this modi-
fication reduces errors which arise from variations of the signal power [4]. Another
source of error is the search grid for τ0 imposed by the sampling rate. To increase the
resolution, an interpolation of the signal x[`] or of the optimization objective d[k, τ]
is often used. Other widely used methods employ the average magnitude difference
function (AMDF) [23]
k
damd [k, τ] = ∑ |x[`] − x[` + τ]| , (4.47)
`=k−N+1
139
140 Chapter 4. Digital Filters and Spectral Analysis
Example 4.11 ( f0 -tracking). An example is shown in Figure 4.25 where the YIN al-
gorithm [4] was used to compute a fundamental frequency estimate on a piece of
monophonic music. In the spectrogram in Figure 4.25 the fundamental frequency
corresponds to the frequency of the lowest of the equispaced harmonics.
Positive attitude
Available
Straightforward
Straightforward or down-to-earth
or down-to-earth
Positive attitude
Available
Straightforward or down-to-earth
Straightforward or down-to-earth
140
4.9. Further Reading 141
Non-uniform filter banks may also be designed using frequency warping tech-
niques. In this context the bilinear transformation has been used to approximate an
auditory filter bank [27]. Furthermore, in many applications the filter bank for spec-
tral analysis is followed by a signal modification step and a synthesis filter bank.
Then, the design method has to take the overall response into account [29] and per-
fect signal reconstruction becomes a desirable design constraint. Prominent methods
are explained in, e.g., [29] for uniform filter banks and, e.g., in [11] and [17], the lat-
ter two offering near perfect reconstruction for the gammatone filter bank and for the
sliding-window CQT.
Bibliography
[1] B. Bogert, M. Healy, and J. Tukey. The quefrency alanysis of time series for
echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking.
In Proc. of the Symposium on Time Series Analysis, pp. 209–243, 1963.
[2] J. Brown. Calculation of a constant Q spectral transform. J. Acoust. Soc. of
America., 89:425–434, 1991.
[3] J. Brown and M. Puckette. An efficient algorithm for the calculation of a con-
stant Q transform. J. Acoust. Soc. of America, 92(5):2698–2701, 1992.
[4] A. de Cheveigne and H. Kawahara. Yin, a fundamental frequency estimator for
speech and music. J. Acoust. Soc. of America., 111(4):1917 – 1930, 2001.
[5] A. Eronen and A. Klapuri. Musical instrument recognition using cepstral co-
efficients and temporal features. In Proceedings of IEEE International Confer-
ence on Acoustics, Speech, and Signal Processing (ICASSP ’00), volume 2, pp.
II753–II756, 2000.
[6] N. Fliege. Multirate Digital Signal Processing: Multirate Systems—Filter
Banks—Wavelets. Wiley, 1999.
[7] D. Gerhard. Pitch extraction and fundamental frequency: History and current
techniques. Technical Report TR-CS 2003-06, Department of Computer Sci-
ence, University of Regina, 2003.
[8] B. Glasberg and B. Moore. Derivation of auditory filter shapes from notched-
noise data. Hearing Research, 47:103–108, 1990.
[9] H. Göckler and A. Groth. Multiratensysteme: Abtastratenumsetzung und digi-
tale Filterbänke. J. Schlembach, 2004. (in German).
[10] W. Hess. Pitch Determination of Speech Signals. Springer, Berlin, 1983.
141
142 Chapter 4. Digital Filters and Spectral Analysis
142
4.9. Further Reading 143
143
Chapter 5
Signal-Level Features
5.1 Introduction
A musical signal carries a substantial amount of information that corresponds to its
timbre, melody, or rhythm properties and that may be used to classify music, e.g., in
terms of instrumentation, chord progression, or musical genre. However, apart from
these high-level musical features it also contains a lot of additional information which
is irrelevant for an analysis or a classification task, or even degrades the performance
of the task. It is therefore necessary to extract relevant and discriminative features
from the raw audio signal, which can be used either to identify properties of a music
piece or to assign music to predefined classes.
In this chapter some of the most commonly used features are reviewed. These
features are often referenced in the literature and have proven to be well suited for
music-related classification tasks. A feature value is obtained by following a de-
fined calculation rule which can be defined in the time, spectral, or cepstral domain
depending on the musical property to be modeled. Often it is computed for short,
possibly overlapping signal segments which cover approximately 20–30 ms, thereby
resulting in a feature series which may then describe the temporal evolution of a
specific aspect.
As a starting point, a raw audio signal x[κ] is given, where κ denotes the discrete
time index. The time interval between successive time indices is defined by the
inverse sampling frequency 1/ fs . This signal is segmented into L frames x[λ , k] of
length K
x[λ , k] = x[λ R + k], k ∈ {0, 1, . . . , K − 1}, (5.1)
where λ and R denote the frame index and frame shift, respectively. If necessary, a
spectral transform such as the short-time Fourier transform (STFT) or the constant-Q
transform (CQT), as introduced in Chapter 4, can be applied to the frames yielding a
complex-valued spectral coefficient X[λ , µ] = |X[λ , µ]| eiφ[λ ,µ] , where µ denotes the
discrete frequency index and φ[λ , µ] is the phase. Often X[λ , µ] is computed using
145
146 Chapter 5. Signal-Level Features
the discrete Fourier transform (DFT) of length K. Note that each frame is assumed
to contain a quasi-stationary portion of the signal. Therefore, a meaningful analysis
can be performed which eventually yields a set of short-time features. These features
are supposed to highlight the most important signal characteristics with respect to a
certain task and therefore are a compact representation of the signal itself.
The remainder of this chapter is organized as follows. In Section 5.2 timbre-
related features are introduced, which are commonly used in applications such as
instrument recognition. Section 5.3 presents features which describe harmony prop-
erties and characteristics of partial tones in music signals. Features used for the
extraction of note onsets and the description of rhythmic properties are discussed in
Section 5.4. The chapter is concluded with a reference to related literature in Section
5.5.
where the signum function sgn(·) yields 1 for positive arguments and 0 for negative
arguments. The zero-crossing rate is a rough measure of the noisiness and the high-
frequency content of the signal.
Definition 5.2 (Low-Energy). Another measure based on the time domain repre-
sentation of the signal is the low-energy feature. Unlike many other features the
low-energy feature is calculated using the whole signal x[κ]. The root mean square
(RMS) energy of each frame λ is evaluated and normalized on the RMS energy of
x[κ]
q
1 K−1 2
K ∑k=0 x [λ , k]
e(λ ) = q , (5.3)
1 KT −1 2
∑
KT κ=0 x [κ]
where KT is the total number of samples. This ratio is less than 1 if the RMS energy
of frame λ is lower than the total RMS energy. Otherwise the ratio is greater than
or equal to 1. The actual low-energy feature is defined as the relative frequency of
frames with less RMS energy than the RMS energy of the signal x[κ]. Hence,
146
5.2. Timbre Features 147
which is the center of gravity of the magnitude spectrum. For symmetry reasons,
the summations range from 0 to M/2 only. Lower values correspond to dull sounds,
whereas higher values denote brighter sounds.
Example 5.1 (Spectral Centroid). We consider the note c’ played on a piano and an
oboe, respectively. Their spectrograms are computed using a DFT of length M = 512
at the sampling frequency fs = 16, 000. The frame shift is set to R = 256. Then,
we obtain the frame-based spectral centroid, Equation (5.5). Both, spectrograms
and the temporal evolution of the spectral centroid, are shown in Figure 5.1. The
spectrogram of the piano note (left) shows that the energy of the harmonics steadily
decreases with increasing time. This effect is also revealed in the slightly negative
trend of the spectral centroid shown by the continuous black line. At the same time
the harmonics of the sustained oboe tone (right) behave relatively stably in time
which is also demonstrated by the corresponding evolution of the spectral centroid.
It is also worth noting that on average the spectral centroid attains higher values for
the oboe tone (1552.5 ± 68.0 Hz) than for the piano tone (740.0 ± 111.8 Hz) which
points towards a higher energy concentration in the harmonics of the oboe sound.
The following three features are based on the estimation of spectral moments;
see also Definition 9.7.
Definition 5.4 (Spectral Spread). A measure which characterizes the frequency range
147
148 Chapter 5. Signal-Level Features
8 8
0 0
frequency f [kHz]
frequency f [kHz]
6 -20 6 -20
-40 -40
4 4
-60 -60
2 2
-80 -80
0 -100 0 -100
0 1 2 0 1 2
time t [s] time t [s]
Figure 5.1: Temporal evolution of the spectral centroid (black line) of the note c’
played on a piano (left) and an oboe (right) with their respective spectrograms.
of a sound around the spectral centroid is given by the spectral spread. It is defined
as the normalized, second centered moment of the spectrum, i.e. the spectral vari-
ance. In order to make the spectral spread comparable in units with the magnitude
spectrum, it is advisable to take the square root of the spectral variance which yields
the standard deviation of the magnitude spectrum in a given frame
v
u M/2
∑ (µ − tcent [λ ])2 |X[λ , µ]|
u
t
µ=0
tspread [λ ] = v . (5.6)
u M/2
u
t
∑ |X[λ , µ]|
µ=0
The spectral spread accounts for the sensation of timbral fullness or richness of a
sound.
Definition 5.5 (Spectral Skewness). The ratio between the third centered moment
and the spectral spread raised to the power of three is defined as the skewness. It can
be obtained by
M/2
∑ (µ − tcent [λ ])3 |X[λ , µ]|
µ=0
tskew [λ ] = (5.7)
3 M/2
tspread [λ ] ∑ |X[λ , µ]|
µ=0
and describes the symmetry property of the spectral distribution in a frame. For neg-
ative values the distribution of spectral energy drops faster if frequencies exceed the
spectral centroid while it develops a wider tail towards lower frequencies. For posi-
tive values the opposite behavior is observed. A value of zero indicates a symmetric
distribution of spectral energy around the spectral centroid.
148
5.2. Timbre Features 149
8
7 Piano
Guitar
spectral skewness
6 Organ
5
4
3
2
1
500 1000 1500 2000
spectral spread [Hz]
Figure 5.2: Scatter plot of spectral skewness values vs. spectral spread values for
exemplary recordings of different instruments.
Example 5.2 (Spectral Spread and Skewness). We consider excerpts of a piano piece
(F. Chopin, “Waltz No. 9 a-flat Major Op. 69 No. 1”), a classical guitar piece (F.
Tarrega, “Prelude in D minor, Oremus (lento)”), and an organ piece (J.S. Bach,
“Toccata and Fugue in D minor, BWV 565”) and compute their spectral spread and
skewness values in a frame-wise fashion. The spectral analysis is performed using a
DFT of length M = 512 at the sampling frequency fs = 16, 000. The frame shift is set
to R = 256. In Figure 5.2 the spectral spread and skewness values are plotted against
each other for each instrument. The scatter plot shows that piano and guitar sounds
exhibit a lower spectral spread on average than an organ sound. However, we can
also observe that in particular the piano sounds have a higher variation in terms of
the spectral spread than the other two instruments. Further, it is worth noting that
for all instruments the spectral skewness only attains positive values, which is a sign
of a negative spectral tilt towards increasing frequencies. Obviously, in this feature
space, organ sounds are well separated from piano and guitar sounds, respectively.
Such an observation can be utilized for training supervised classifiers on instrument
recognition (see Chapters 12, 18). A complete separation of piano sounds and guitar
sounds, however, is not possible using only these features.
Definition 5.6 (Spectral Kurtosis). Furthermore, the spectrum can be characterized
in terms of its peakiness. This property can be expressed by means of the spectral
kurtosis
M/2
∑ (µ − tcent [λ ])4 |X[λ , µ]|
µ=0
tkurt [λ ] = − 3. (5.8)
4 M/2
tspread [λ ] ∑ |X[λ , µ]|
µ=0
149
150 Chapter 5. Signal-Level Features
In particular, the kurtosis describes to what extent the spectral shape resembles
or differs from the shape of a Gaussian bell curve. For values below zero the spectral
shape is subgaussian, which implies that the spectral energy tends towards a uniform
distribution. Such a behavior typically occurs for wide-band sounds. A value of zero
points towards an exact bell-curved spectral shape. Values larger than zero char-
acterize a peaked spectral shape which is strongly concentrated around the spectral
centroid. Such a spectral shape is typically obtained for narrow-band sounds.
Definition 5.7 (Spectral Flatness). Similarly, the peakiness can be described by means
of the spectral flatness measure, which is defined as the ratio between the geometric
mean and the arithmetic mean of the magnitude spectrum
q
M/2+1 M/2
∏µ=0 |X[λ , µ]|
tflat [λ ] = M/2
. (5.9)
1
M/2+1 ∑µ=0 |X[λ , µ]|
A higher spectral flatness value points towards a more uniform spectral distribu-
tion, whereas a lower value implies a peaked and sparse spectrum.
Definition 5.8 (Spectral Rolloff). The distribution of spectral energy to low and high
frequencies can be characterized by the spectral rolloff feature. It is defined as the
frequency index µsr below which 85% of the cumulated spectral magnitudes are con-
centrated. µsr fulfills the equation
µsr M/2
∑ |X[λ , µ]| = 0.85 ∑ |X[λ , µ]|. (5.10)
µ=0 µ=0
The lower the value of µsr , the more spectral energy is concentrated in low-
frequency regions.
Definition 5.9 (Spectral Brightness). Instead of keeping the energy ratio fixed, it is
also possible to choose a fixed cut-off frequency, e.g. fc = 1500 Hz, above which
the percentage share of cumulated spectral magnitudes is measured. Thereby, the
amount of high-frequency energy can be quantified. The spectral brightness feature
is hence defined as
M/2
∑µ=µc |X[λ , µ]|
tbright [λ ] =
M/2
, (5.11)
∑µ=0 |X[λ , µ]|
where µc is the discrete frequency index corresponding to the cut-off frequency.
Definition 5.10 (Spectral Flux). The amount of spectral change between consecutive
signal frames can be measured by means of the spectral flux, which is defined as the
sum of the squared difference between the (normalized) magnitudes of successive
short-time spectra over all frequency bins µ
M/2
tflux [λ ] = ∑ (|X[λ , µ]| − |X[λ − 1, µ]|)2 . (5.12)
µ=0
150
5.2. Timbre Features 151
151
152 Chapter 5. Signal-Level Features
q=1 3 5 7 9 11 13 15 17 19 21 23 25
1
0.8
∆q (f )
0.6
0.4
0.2
0
0 500 1000 1500 2000
frequency f [Hz]
Figure 5.3: Weighting functions of the first 25 mel frequency band-pass filters.
In total, this results in a so-called mel filter bank. In Figure 5.3 the weighting func-
tions are depicted for the first 25 filters.
For discrete frequency indices µ = M f / fs the weighting function of the q-th
filter shall be denoted as ∆q [µ]. Then, the short-time power spectrum for frame λ is
weighted with ∆q [µ] for q ∈ {1, 2, . . . , Q}. Subsequently, each filter output is summed
up across all frequency indices to obtain the mel spectrum
M/2
e , q] =
X[λ ∑ |X[λ , µ]|2 ∆q [µ]. (5.16)
µ=0
To account for the non-linear relationship between the sound pressure level and
the perceived loudness, the logarithm of Equation (5.16) is evaluated. As a last step
e , q] is decorrelated using the discrete cosine transform (DCT),
the mel spectrum X[λ
yielding the MFCCs
Q−1
πqξ
x̃[λ , ξ ] = vξ ∑ log X[λ , q] cos
e , (5.17)
q=0 Q
√ p
for ξ ∈ {0, 1, . . . , Q − 1}, with v0 = 1/ M and vξ = 2/M, for ξ ∈ {1, . . . , Q − 1}.
This step decomposes the mel spectrum into cepstral coefficients which describe the
spectral envelope and the spectral fine structure, respectively.
Example 5.3 (MFCCs). We consider the note a’ played by two different instruments,
e.g. a cello and a piano, which differ considerably in terms of their timbral charac-
teristics. The temporal evolution of the MFCCs for ξ ∈ {2, 3, ..., 13} are shown in
Figure 5.4. We observe that the piano exhibits higher values for ξ ∈ {2, 4, 5}. Fur-
thermore, the MFCCs corresponding to the tone played by the cello have a more
152
5.3. Harmony Features 153
Cello Piano
1.5 1.5
12 1 12 1
MFCC index ξ
MFCC index ξ
10 0.5 10 0.5
8 0 8 0
-0.5 -0.5
6 6
-1 -1
4 4
-1.5 -1.5
2 2
-2 -2
0 1 2 3 0 1 2 3
time t [s] time t [s]
Figure 5.4: Temporal evolution of MFCCs x̃[λ , ξ ], ξ ∈ {2, 3, ..., 13}, for the note a’
played by a cello (left) and a piano (right).
rapidly fluctuating behavior which reveals a tremolo effect. Therefore, for describ-
ing timbre properties of an instrument or a music piece, it is useful to take temporal
changes of short-term features into account.
153
154 Chapter 5. Signal-Level Features
certain musical note, respectively. Hence, the p-th value of the chroma vector, with
p = {0, 1, ..., 11}, is computed as the summation over O octaves
O
tchroma [λ , p] = ∑ |XCQT [λ , p + 12o]|, (5.18)
o=1
where the first tone of the lowest octave o = 1 corresponds to the minimal anal-
ysis frequency fmin , and we assume B = 12 CQT bands per octave. The chroma
representation is typically utilized in applications such as chord transcription or key
estimation.
tchroma [λ , p]
tchroma,L1 [λ , p] = 12
. (5.19)
∑ p=1 tchroma [λ , p]
In order to make the normalized chroma values more robust against variations in
tempo or articulation, the normalized chroma values can be quantized. To this end,
in [11] the intuitive quantization function
0, for 0≤ tchroma,L1 [λ , p] < 0.05,
1, for 0.05 ≤ tchroma,L1 [λ , p] < 0.1,
tchroma,Q [λ , p] = 2, for 0.1 ≤ tchroma,L1 [λ , p] < 0.2, (5.20)
3, for 0.2 ≤ tchroma,L1 [λ , p] < 0.4,
4, for 0.4 ≤ tchroma,L1 [λ , p] < 1,
154
5.3. Harmony Features 155
Chromagram CENS
B 0.3 B 0.3
A# A#
A 0.25 A 0.25
G# 0.2 G# 0.2
G G
F# F#
F 0.15 F 0.15
E 0.1 E 0.1
D# D#
D 0.05 D 0.05
C# C#
C 0 C 0
0 1 2 3 0 1 2 3
time t [s] time t [s]
Figure 5.5: Chromagram (left) for an excerpt of a classical piano piece and the CENS
representation of the same excerpt (right).
which corresponds to the note c’. The frame shift is set to R = 32 samples at the
sampling frequency fs = 16 kHz. This yields a feature rate of 500 Hz. For the tem-
porally smoothed CENS features a 500-point Hann window and a decimation factor
of 50 were used. This reduces the feature rate to 10 Hz, i.e. 10 features per second.
Note that in this example the dominant A# is prominent.
155
156 Chapter 5. Signal-Level Features
∑M
µ̂=1 (A[λ , µ̂] − A[λ , µ̂ + 1])
tirreg [λ ] = 2
, (5.21)
∑M
µ̂=1 (A[λ , µ̂])
which is obtained by accounting for the squared difference between amplitude values
of adjacent partials.
Definition 5.12 (Inharmonicity). Another meaningful property of partial tones is
their degree of harmonicity, which is perceived as the amount of consonance or disso-
nance in a music piece. Harmonic tones consist of a fundamental tone and overtones
whose frequencies are integer multiples of the fundamental frequency f0 . Note that
generally these frequency components are referred to as harmonics, where the first
harmonic equals the fundamental tone, i.e. f1 = f0 (cf. Chapter 2.2.2). Given a set
of extracted partial tones, a measure of inharmonicity can be obtained by computing
the energy-weighted absolute deviation of the estimated partial tone frequencies f µ̂
and the idealized harmonic frequencies µ̂ f0
M 2
2 ∑µ̂=1 | f µ̂ − µ̂ f0 | (A[λ , µ̂])
tinharm [λ ] = 2
, (5.22)
f0 ∑M µ̂=1 (A[λ , µ̂])
(A[λ , 1])2
ttrist1 [λ ] = 2
, (5.23)
∑M
µ̂=1 (A[λ , µ̂])
156
5.4. Rhythmic Features 157
respectively.
Definition 5.14 (Even-harm and Odd-harm). Similarly, the energy ratios of partials
can be analyzed in terms of even-numbered and odd-numbers partial indices. Corre-
sponding even-harmonic and odd-harmonic energy ratios can be defined as
v
u bM/2c
u ∑µ̂=1 (A[λ , 2µ̂])2
teven−harm [λ ] = t M 2
(5.26)
∑µ̂=1 (A[λ , µ̂])
and v
u bM/2+1c
u ∑µ̂=1 (A[λ , 2µ̂ − 1])2
todd−harm [λ ] = t
2
, (5.27)
∑Mµ̂=1 (A[λ , µ̂])
respectively.
157
158 Chapter 5. Signal-Level Features
This feature works well for detecting percussive onsets, but has weaknesses for other
types of onsets [1].
Definition 5.16 (Phase Deviation). Besides defining features solely based on the
magnitude spectrum, features related to the phase spectrum can also be taken into
account. The phase spectrum contains additional details about the temporal struc-
ture of the signal. Considering the difference between two successive frames of the
phase spectrum for a particular frequency bin yields an estimate of the instantaneous
frequency
φ 0 [λ , µ] = φ[λ , µ] − φ[λ − 1, µ]. (5.29)
The difference in the instantaneous frequency between two successive frames
φ 00 [λ , µ] = φ 0 [λ , µ] − φ 0 [λ − 1, µ] (5.30)
Since this feature is not robust against low-energy noise stemming from frequency
bins which do not carry partials of a musical sounds, a normalized weighted phase
deviation was proposed [3]
M/2
∑µ=0 |X[λ , µ]φ 00 [λ , µ]|
tnwpd [λ ] = M/2
, (5.32)
∑µ=0 |X[λ , µ]|
which takes into account the strength of partial tones and consequently reduces the
impact of irrelevant frequency bins. This feature shows a better performance for
pitched non-percussive onsets than features based on the spectral amplitude [1].
Definition 5.17 (Complex Domain Features). Instead of a separate treatment of the
magnitude spectrum and phase spectrum, a joint consideration of magnitude and
phase is also possible. Assuming a constant amplitude and constant rate of phase
change, a spectral estimate of the current frame λ based on the two previous frames
can be obtained by
0
XT [λ , µ] = |X[λ − 1, µ]| ei(φ[λ −1,µ]+φ [λ −1,µ]) . (5.33)
By computing the sum of absolute differences between the spectrum and the target
function XT [λ , µ] we arrive at a complex domain onset detection function which
measures the deviation from a stationary signal behavior
M/2
tcd [λ ] = ∑ |X[λ , µ] − XT [λ , µ]| . (5.34)
µ=0
158
5.4. Rhythmic Features 159
This feature treats increases and decreases in energy equally and therefore cannot
distinguish between onsets and offsets. As we are only interested in onsets, the
rectified sum of absolute deviations from the target function can be used instead of
tcd [λ ]. This results in the rectified complex domain feature [3]
M/2
trcd [λ ] = ∑ trcd,2 [λ , µ] (5.35)
µ=0
with
|X[λ , µ] − XT [λ , µ]| , if |X[λ , µ]| ≥ |X[λ − 1, µ]|
trcd,2 [λ , µ] = . (5.36)
0, otherwise
The rectification ensures that only increases in spectral power which correspond to
note onsets are considered while note offsets are discarded. This feature is well suited
for detecting non-pitched percussive as well as pitched non-percussive onsets [1].
The second feature, the average angle between the differences of successive
phase domain vectors, is defined as:
1 L−3
tAVG−A = ∑ α (ppk+1 − p k , p k+2 − p k+1 ),
L − 3 k=1
(5.39)
159
160 Chapter 5. Signal-Level Features
where
p k+1 − p k , p k+2 − p k+1
α (ppk+1 − p k , p k+2 − p k+1 ) =
. (5.40)
p k+1 − p k
·
p k+2 − p k+1
Note that in the general case, where an m-dimensional phase vector is used with
a spacing step d, “L − 2” should be replaced with “L − (m − 1)d − 1” in Equation
(5.38) and “L − 3” with “L − (m − 1)d − 2” in Equation (5.39).
Example 5.5 (Difference Vectors in Phase Domain). Figure 5.6 illustrates the dif-
ferences between successive phase vectors using d = 1 and m = 2. The horizon-
tal axis corresponds to the first dimension (pk+1,1 − pk,1 ), and the vertical to the
second one (pk+1,2 − pk,2 ). In Figure 5.6(a), 13 difference vectors are plotted for
15 original values of the function sin(x) evaluated at integer multiples of 0.5 ra-
dian, yielding (0, 0.4794, 0.8415, . . .). Hence, we obtain p 1 = (0, 0.4794)T , p 2 =
(0.4794, 0.8415)T , etc. The periodic structure of the original series is visualized
by an elliptic progression of the differences between phase vectors. Figure 5.6(b),
shows 13 differences for a vector with 15 uniformly drawn random numbers between
−1 and 1 (−0.7162, −0.1565, 0.8315, . . .). Figure 5.6(c) plots the difference vectors
for one second of a rock song (AC/DC, “Back in Black”), and (d) for a classical
piano piece (Chopin, “Mazurka in e-minor Op. 41 No. 2”). In both cases, the phase
transform was applied to discrete-time signals samples at 44.1 kHz. We can observe
a significantly broader distribution for the rock song, similar to examples in [10].
On a linear scale the hearing threshold is expressed by Lhs lin = 10Lhs /20 . Hence, the
160
5.4. Rhythmic Features 161
where f µ = µMfs is the frequency corresponding to the µ-th STFT bin. Complying
with the concept of critical bands, we then sum up the spectral power within each
Bark band
µu,i
XBark [λ , i] = ∑ Xcomp [λ , µ], (5.43)
µl,i
where i denotes the Bark band index and µl,i and µu,i are the STFT bins correspond-
ing to the lower and upper frequency limits of the i-th Bark band as listed in Table
5.1. This results in a spectro-temporal representation with I = 24 Bark bands and L
frames. In order to account for spectral masking, a spreading function
B(i)
= 15.81 + 7.5 (i + 0.474) − 17.5 (1 + (i + 0.474)2 )1/2 (5.44)
dB
proposed in [16] is convolved with the Bark spectrum yielding
To analyze the rhythmic periodicity, a DFT is computed across all L frames for each
Bark band, respectively, resulting in
L−1 2πλ ν
Xemod [ν, i] = ∑ X̂Bark [λ , i]e−i L . (5.46)
λ =0
161
162 Chapter 5. Signal-Level Features
Table 5.1: Lower Frequencies fl,i and Upper Frequencies fu,i of i-th Bark Band
i 1 2 3 4 5 6 7 8
fl,i /Hz 0 100 200 300 400 510 630 770
fu,i /Hz 100 200 300 400 510 630 770 920
i 9 10 11 12 13 14 15 16
fl,i /Hz 920 1080 1270 1480 1720 2000 2320 2700
fu,i /Hz 1080 1270 1480 1720 2000 2320 2700 3150
i 17 18 19 20 21 22 23 24
fl,i /Hz 3150 3700 4400 5300 6400 7700 9500 12000
fu,i /Hz 3700 4400 5300 6400 7700 9500 12000 15500
Bibliography
[1] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and M. B. Sandler.
A tutorial on onset detection in music signals. IEEE Trans. Speech and Audio
Processing, 13(5):1035–1047, 2005.
[2] S. B. Davis and P. Mermelstein. Comparison of parametric representations for
monosyllabic word recognition in continuously spoken sentences. IEEE Trans.
Acoustics, Speech, and Signal Processing, 28(4):357–366, August 1980.
162
5.5. Further Reading 163
[3] S. Dixon. Onset detection revisited. In Proc. Intern. Conf. Digital Audio Effects
(DAFx), pp. 133–137. McGill University Montreal, 2006.
[4] H.-G. Kim, N. Moreau, and T. Sikora. MPEG-7 Audio and Beyond: Audio
Content Indexing and Retrieval. John Wiley & Sons, 2006.
[5] O. Lartillot and P. Toiviainen. A MATLAB toolbox for musical feature ex-
traction from audio. In Proc. Intern. Conf. Digital Audio Effects (DAFx), pp.
237–244. Université Bordeaux, 2007.
[6] T. Li, M. Ogihara, and G. Tzanetakis, eds. Music Data Mining. CRC Press,
2011.
[7] B. Logan. Mel frequency cepstral coefficients for music modeling. In Proc.
Intern. Soc. Music Information Retrieval Conf. (ISMIR), 2000.
[8] M. F. McKinney and J. Breebaart. Features for audio and music classification.
In Proc. Intern. Soc. Music Information Retrieval Conf. (ISMIR), 2003.
[9] A. Meng, P. Ahrendt, J. Larsen, and L. K. Hansen. Temporal feature integra-
tion for music genre classification. IEEE Trans. Audio, Speech, and Language
Processing, 15(5):1654–1664, July 2007.
[10] I. Mierswa and K. Morik. Automatic feature extraction for classifying audio
data. Machine Learning Journal, 58(2-3):127–149, 2005.
[11] M. Müller. Information Retrieval for Music and Motion. Springer, 2007.
[12] M. Müller and S. Ewert. Towards timbre-invariant audio features for harmony-
based music. IEEE Transactions on Audio, Speech, and Language Processing,
18(3):649–662, 2010.
[13] M. Müller and S. Ewert. Chroma toolbox: MATLAB implementations for
extracting variants of chroma-based audio features. In Proc. Intern. Soc. Music
Information Retrieval Conf. (ISMIR), 2011.
[14] A. Nagathil, P. Göttel, and R. Martin. Hierarchical audio classification using
cepstral modulation ratio regressions based on Legendre polynomials. In Proc.
IEEE Intern. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp.
2216–2219. IEEE Press, 2011.
[15] E. Pampalk, A. Rauber, and D. Merkl. Content-based organization and visu-
alization of music archives. In Proc. ACM Intern. Conf. on Multimedia, pp.
570–579. ACM Press, 2002.
[16] M. Schroeder, B. Atal, and J. Hill. Optimizing digital speech coders by ex-
ploiting masking properties of the human ear. J. Acoust. Soc. Am. (JASA),
66(6):1647–1652, 1979.
[17] M. Slaney. Auditory toolbox: A MATLAB toolbox for auditory modeling.
Technical Report 45, Apple Computer, 1994.
[18] S. Stevens, J. Volkmann, and E. B. Newman. A Scale for the Measurement
of the Psychological Magnitude Pitch. J. Acoust. Soc. Am. (JASA), 8:185–190,
January 1937.
[19] E. Terhardt. Calculating virtual pitch. Hearing Research, 1(2):155–182, 1979.
163
164 Chapter 5. Signal-Level Features
164
Chapter 6
Auditory Models
6.1 Introduction
Auditory models are computer-based simulation models which emulate the human
auditory process by mathematical formulas which transform acoustic signals into
neural activity. Analyzing this activity instead of just the original signal might im-
prove several tasks of music data analysis, especially tasks where human perception
still outperforms computer-based estimations, e.g., several transcription tasks like
onset detection and pitch estimation. For applying auditory models as a front-end for
such tasks, some basic knowledge about the different stages of the auditory process
is advisable in order to decide which one are most important and which stages can
be ignored. Naturally, this depends on the actual application and might be important
to simplify the auditory model in order to reduce the computation times.
The hearing process of humans consists of several stages located in the ear and
the brain. In recent decades several computational models have been developed
which can help to prove assumptions about the different stages of the hearing process
by comparing psycho-physical experiments and animal observations to simulation
outputs on the same acoustical data. While our understanding of these processes im-
proves and the results of the models are getting more realistic, their application for
automatic speech recognition and for several tasks in music research increases.
While some later stages of the hearing process taking place in several parts of
the brain are more difficult to observe, the beginning of the process located directly
in the ear is far more investigated. This stage is called the auditory periphery and
models the transformation from acoustical pressure waves in the air to release events
of the auditory nerve fibers as introduced in Section 6.2. Though several models of
the auditory periphery have been proposed, in Section 6.3 we describe the popular
Meddis model ([17]). In Section 6.4 technical models for pitch detection are com-
pared to auditory models of the midbrain which try to simulate the pitch extraction
process of humans. In Section 6.5 we give an overview of further reading. Addition-
ally, music classification systems are described which process the output of auditory
165
166 Chapter 6. Auditory Models
models. While the applications are explained in detail in other chapters, here, the
main focus is a brief overview of the feature generating process.
Supportive Supportive
Supportive
Supportive
Supportive
Supportive
Supportive
Supportive
Supportive
166
6.3. The Meddis Model of the Auditory Periphery 167
ral activity occurs phase-locked with the stimulus. Phase-locking means that neural
activity is periodic with peaks and tails where the period corresponds to periodici-
ties of the stimulus. In later stages in the brain, this effect can be utilized to encode
the frequency content of a stimulus by additionally analyzing the temporal structure
besides the neural intensity of different fibers. The human auditory system consists
of roughly 30,000 auditory nerve fibers, each responsible for the recognition of an
individual but overlapping frequency range. In auditory models this is simplified and
simulated by a much smaller quantity of filters.
Kind
Kind
Kind
Kind
Kind
Kind
Kind
Kind
Kind Kind
Kind
From a signal processing view, a simplified model of the auditory periphery can
be seen as a bank of linear filters (see Chapter 4) with a succeeding half-wave rectifier
(only positive values pass). In this context linear means that a higher stimulus results
in higher output levels of all simulated auditory nerve fibers independent of addi-
tional simultaneous noise. However, modern models of the auditory periphery are
far more complex, modeling nonlinear and asynchrony properties which can more
precisely explain psychoacoustic phenomena. One example is two-tone suppression:
From two simultaneous tones, the louder one can mask the softer one even if they
consist of entirely different frequencies. This means a drastic reduction of auditory
nerve activity corresponding to the softer tone in contrast to the case where this tone
is presented alone. This masking phenomenon violates the assumption of a linear
model.
167
168 Chapter 6. Auditory Models
events (spike emissions) by applying a stochastic process. Similar to the human sys-
tem, each channel has an individual best frequency that defines which frequencies
are stimulated the most. The best frequency is between 100 Hz for the first and 6000
Hz for the 41st channel.
Figure 6.3 shows an example where the acoustic stimulus is a harmonic tone
which can be seen in the first plot. In the lowest plot of Figure 6.3 the probabilistic
output of the model is represented. While the 41 channels are located on the ver-
tical axis and the time response on the horizontal axis, the gray-scale indicates the
probability of neural activity (firing rates), where white means high probability. In
the following subsections the several intermediate steps from the input stimulus to
the neuronal output are described in detail. Where needed, additional details about
human perception are explained.
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 6.3: Exemplary output of the Meddis auditory model.
168
6.3. The Meddis Model of the Auditory Periphery 169
169
170 Chapter 6. Auditory Models
filled again. Transmitter release is only indirectly controlled by the electrical voltage
V (t). Actually, it controls the calcium stream into the cell and the calcium promotes
the release of transmitter. Other auditory models make the simplifying assumption
that the content of the transmitter reservoir into the synaptic cleft is proportional to
the stimulus intensity. In the Meddis models, transmitter release is modeled more
realistically using a cascade of transmitter reservoirs.
170
6.4. Pitch Estimation Using Auditory Models 171
where ∆t is the sampling interval and τ(l) is a time constant (10 ms) which defines
the length of the time period over which regularities are assessed.
The running SACF is defined by
1 N
s(t, l) = ∑ h(t, l, k), (6.2)
N k=1
where N is the number of channels used and h(t, l, k) is the ACF at time t and lag
l in channel k. The peaks of the SACF are indicators for the perceived pitch and a
natural variant of monophonic pitch detection simply identifies the maximum peak
for every time point t. The lag l achieving the maximum peak at a given time point t
corresponds to the dominant periodicity, and hence its reciprocal is the estimated fun-
damental frequency. The model is successfully compared to several psychophysical
phenomena like pitch detection with missing fundamental frequency.
After some recent psychophysical studies which indicate that the autocorrelation
approach is inconsistent with human perception for some special stimuli, in [3] the
autocorrelation approach is improved by a low-pass filter, resulting in the new func-
tion LP-SACF. Additionally, the time constant τ(l) is linked to the given lag and is
set to 2l since it is assumed that for some specific stimuli the pitch characteristic can
only be assessed over a longer period. The low-pass filter of LP-SACF is recursively
defined as an exponentially decaying average,
−∆t
P(t, l) = s(t, l) + P(t − ∆t, l) · e λ , (6.3)
where λ is the time constant (120 ms) of the filter. Results indicate that the modified
LP-SACF can overcome the limitations of the SACF.
171
172 Chapter 6. Auditory Models
172
6.5. Further Reading 173
cess is iterated until a specified number of pitches is achieved or until the maximum
strength is not above a specified threshold.
In [12], the IPEM toolbox is proposed, a MATLAB® toolbox which constructs
perception-based features for different applications. However, it relies on a some-
what outdated auditory model. Features based on an auditory model have been ap-
plied for several tasks of music classification which are separately described in the
following paragraphs.
Onset Detection (see also Chapter 16) In [4], onset detection is applied to the
output of the Meddis model of the auditory periphery. The output of 40 simulated
nerve fibers are first transformed into 40 individual onset detection functions and
afterwards combined by using a specific quantile. Certainly, this is a simple approach
leaving space for improvements. However, the results show that even by using this
method the auditory model approach performs as well as the original approach on
the acoustic waveform.
Transcription and Melody Detection (see also Chapter 17) Mathematically, a melo-
dy is a sequence of notes where each note has a pitch, an onset, and an offset. In [26],
the detection of the singing melody is performed using features based on an auditory
model with a consecutive autocorrelation analysis. Here, again, the problem arises
of how to choose the correct peak of the autocorrelation function. The naive variant
consists of picking the maximum peak for each frame and merging successive frames
with the same pitch to one tone. Better approaches additionally take the temporal de-
velopment into account. For each frame the pitch candidates and their strengths with
respect to the autocorrelation function are taken as features. As a second feature, for
each pitch candidate the ratio of its strength to its strength in the predecessor frame is
considered enabling the separation of consecutive tones with the same pitch. In [16],
the ratio of each peak strength to the neighborhood average strength is computed as
an additional feature. Furthermore, the zero-lag correlation of each channel is calcu-
lated to obtain a running estimate of the energy in each channel where local maxima
might indicate onsets. Other examples for melody detection using an auditory model
are described in [6], [15], and [23].
Instrument Recognition (see also Chapter 18) In [30], instrument recognition is
applied to features based on the output of the Meddis model. Under most circum-
stances these features lead to better results than the features based on the original
waveform. Other promising approaches using an auditory model as the front end for
instrument recognition are described in [22], and [29].
Genre Classification Approaches for genre classification with an auditory model as
the front end are described in [14] and [24].
Bibliography
[1] A. Bahmer and G. Langner. Oscillating neurons in the cochlear nucleus: I. Ex-
perimental basis of a simulation paradigm. Biological Cybernetics, 95(4):371–
379, 2006.
173
174 Chapter 6. Auditory Models
[2] A. Bahmer and G. Langner. Oscillating neurons in the cochlear nucleus: II.
Simulation results. Biological Cybernetics, 95(4):381–392, 2006.
[3] E. Balaguer-Ballester, S. L. Denham, and R. Meddis. A cascade autocorrelation
model of pitch perception. The Journal of the Acoustical Society of America,
124(4):2186–2195, 2008.
[4] N. Bauer, K. Friedrichs, D. Kirchhoff, J. Schiffner, and C. Weihs. Tone onset
detection using an auditory model. In M. Spiliopoulou, L. Schmidt-Thieme,
and R. Janning, eds., Data Analysis, Machine Learning and Knowledge Dis-
covery, pp. 315–324. Springer International Publishing, 2014.
[5] M. Borst, G. Langner, and G. Palm. A biologically motivated neural network
for phase extraction from complex sounds. Biological Cybernetics, 90(2):98–
104, 2004.
[6] L. P. Clarisse, J.-P. Martens, M. Lesaffre, B. De Baets, H. De Meyer, and M. Le-
man. An auditory model based transcriber of singing sequences. In Proceedings
of the 3rd International Conference on Music Information Retrieval (ISMIR),
pp. 116–123. IRCAM, 2002.
[7] M. Ebeling. Neuronal periodicity detection as a basis for the perception of con-
sonance: A mathematical model of tonal fusion. The Journal of the Acoustical
Society of America, 124(4):2320–2329, 2008.
[8] M. Karjalainen and T. Tolonen. Multi-pitch and periodicity analysis model
for sound separation and auditory scene analysis. In IEEE International Con-
ference on Acoustics, Speech, and Signal Processing, volume 2, pp. 929–932.
IEEE, 1999.
[9] A. Klapuri. Multipitch analysis of polyphonic music and speech signals us-
ing an auditory model. IEEE Transactions on Audio, Speech, and Language
Processing, 16(2):255–266, 2008.
[10] G. Langner. Neuronal mechanisms for pitch analysis in the time domain. Ex-
perimental Brain Research, 44(4):450–454, 1981.
[11] G. Langner and C. Benson. The Neural Code of Pitch and Harmony. Cam-
bridge University Press, 2015.
[12] M. Leman, M. Lesaffre, and K. Tanghe. Introduction to the IPEM toolbox for
perception-based music analysis. Mikropolyphonie – The Online Contemporary
Music Journal, 7, 2001.
[13] J. C. R. Licklider. A duplex theory of pitch perception. Experientia, 7, pp.
128–134, 1951.
[14] S. Lippens, J.-P. Martens, and T. De Mulder. A comparison of human and
automatic musical genre classification. In IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP’04), volume 4, pp. 233–
236. IEEE, 2004.
[15] M. Marolt. A connectionist approach to automatic transcription of polyphonic
piano music. IEEE Transactions on Multimedia, 6(3):439–449, 2004.
174
6.5. Further Reading 175
175
176 Chapter 6. Auditory Models
176
Chapter 7
G ÜNTER RUDOLPH
Department of Computer Science, TU Dortmund, Germany
7.1 Introduction
The computer-aided generation, manipulation, and analysis of music requires a dig-
ital representation of music. Historically, music could be passed on only by singing
or playing instruments from memory. This form of propagation inevitably leads to
alterations in the original melodies over the years. The development of musical nota-
tions finally provided a method to record music in written form not only preventing
uncontrolled changes in existing music but also enabling the reliable storage, repro-
duction, and dissemination of music over time and space.
Graphical notations were already used for Gregorian chants (about 800 AD) in
the form of graphical elements called neume for representing the melodic shape –
initially without, but later in a four-line staff notation. A five-line staff notation was
created by Guido von Arezzo (about 1000 AD) which developed further to a standard
notation basically valid since 1600 AD in Western music [3]. With the advent of
electronic music it became necessary to develop new graphical notations; many of
them may be considered as art work themselves [11]. In this chapter, however, we
stick to the standard graphical notation of Western music as introduced in Chapter 3.
The processing of music with a computer requires mappings from the analog
to the digital world and vice versa at different levels. Figure 7.1 sketches some
conversion paths between different notational representations and file formats. Writ-
ten sheet music may be mapped to a standardized file format by an appropriately
educated person or a computer system that scans the sheet music from the paper,
recognizes the music notation, and maps the information to a digital representation
(OMR: optical music recognition) which can be stored in and re-read from files with
some specified file format. Examples of such file formats (e.g. MIDI or abc) and
the principles of OMR are presented in Section 7.2. Another path from the analog
to the digital world is the sounding music performed by artists where the analog
signals are recorded, mapped to digital signals by A/D-converters (cf. Chapter 4.2),
and stored digitally in a format called PCM encoding. This field of business is dis-
177
178 Chapter 7. Digital Representation of Music
analog written
audio sheet music
6 6
? ?
Figure 7.1: Conversion paths between musical notations and file formats.
cussed in Section 7.3. The transcription of digital audio data to digital sheet music
(cf. Chapter 17) is a complex task which is hardly possible without errors and loss in
precision. Nevertheless, there exist (mainly proprietary) software systems like widi
[15] or melodyne [8] that claim to accomplish such a conversion. A compilation
of those systems is given in Chapter 17 whereas Section 7.4 presents some tools for
rendering digital sheet music to written sheet music. Finally, Section 7.5 is devoted
to transforming digital music representation into analog sound. Digital sheet music
can be translated into digital audio data with the help of sound libraries and synthe-
sizers, whereas music available as digital audio signals can be rendered to sound by
converting them to analog signals using D/A-converters (cf. Chapter 4.2).
178
7.2. From Sheet to File 179
The semantics of each header’s entry is indicated by a capital letter at the begin-
ning of the line followed by a colon. For example, the letter T indicates that the title
of the tune is written after the colon, M specifies the meter by two integers separated
by a slash, and L the default note length where 1/2 denotes a half, 1/4 a quarter and
1/8 a eighth note (and so forth).
Typically, the default note length is the note length appearing most frequently in
the tune. For longer notes one has to put an integer factor after the note (e.g., C2 for
double default length of note C), for shorter notes one puts a slash followed by the
integer divisor after the note (e.g., C/2 for half the length). C/3 would be used for
triplets.
Rests are represented by the lower-case letter z, and its length is defined by the
default note length. Analogous to the notes, longer and shorter rests are expressed
by a subsequent integer factor or a subsequent slash and an integer divisor.
The note pitches are specified by letters. The notation differs from usual octave
designation systems (ODS) as summarized in Table 7.1. Accidentals are indicated
by the symbols and ˆ. For example, G means “G flat” and ˆG means “G sharp”. A
tie is indicated by a hyphen after the first note.
Table 7.1: Notational Differences between the Helmholtz and abc Octave Designa-
tion Systems
179
180 Chapter 7. Digital Representation of Music
Figure 7.2: Conversion of abc file of Example 7.1 with the tool abcm2ps.
The optional mark C: opens the field for the composer(s) and the entry beginning
with Q: specifies how many notes with the default length are to be played per minute.
Note lengths deviating from the default length may be used in the specification. For
example, the expression Q:1/4=36 means that 36 quarter notes should be played per
minute. Finally, the entry K: declares the key by using, for example, a “G” for G
major and “Gm” for g minor. The body of the file follows directly after the header
without any notification. Bars lines are indicated by a vertical line.
Example 7.1 (Smoke on the Water).
A version of the first four bars of the intro of Deep Purple’s Smoke on the Water from
1972 written in abc notation might look as follows:
X :1
T : Smoke on the water
M :4/4
L :1/8
C : Ritchie Blackmore , Ian Gillian , Roger Glover , Jon Lord , Ian Paice
Q :1/4=120
K : Bb
G2 B2 c3 G | z B z _d c4 | G2 B2 c3 B | z G - G6 ||
The entry X:1 indicates that the following data describe the first tune in the file. The
title (T) of the tune is Smoke on the Water. The tune in four-four time (M) is specified
with the eighth note as the default note length (L), composers are listed after the C,
the speed (Q) is set to 120 quarter notes per minute (moderate rock tempo), and the
key is B-flat major (K). The first four bars of the tune follow directly after the header.
The score music shown in Figure 7.2 was generated with the tool abcm2ps directly
from the abc listing of this example.
180
7.2. From Sheet to File 181
A standard MIDI file typically has the extension .mid or .smf. It consists of a
header and a body, which in turn may consist of several tracks. The header always
has a length of 14 bytes whose structure is displayed in Table 7.2. The first four bytes
indicate that this file is actually an SMF. The next four bytes contain the length of the
header without the first 8 bytes; therefore the value is always 6. Three different track
formats are possible: single track (0), multiple track (1), or multiple song (2). The
code ∈ {0, 1, 2} of the track format is stored in two bytes starting at offset 8. In the
multiple track format, individual parts are saved on different tracks in the sequence
whereas everything is merged onto a single track in single track format. The rarely
used multiple song format stands for a series of tracks of type 0. The next two bytes
contain the number of tracks. In the case of the single track format, the value is
always 1. The last two bytes specify the meaning of the time stamps. If bit 15 is
zero, then bits 0 to 14 encode the number of ticks per quarter note. If bit 15 is one,
bits 14 through 8 correspond to an LTC time code [9] whereas bits 0 to 7 encode the
resolution within a frame.
Table 7.2: Structure of the Header of a Standard MIDI File
A track consists of a track header and a sequence of track events (= the track
body). The track header always has a size of 8 bytes. The first four bytes indicate the
beginning of a track. The second four bytes contain the number of bytes in the track
body. Table 7.3 provides a summary.
A track event within the track body consists of delta time (i.e., the time elapsed
since the previous event) and either a MIDI event or a meta event or a system ex-
clusive (sysex) event. If two events occur simultaneously, the delta time must be
zero.
The encoding of numbers like the delta time deserves special consideration. Ac-
tually, it is a variable-length encoding that can lead to some data compression. Only
the lowest 7 bits of a byte are used to encode a number. The 8th bit serves as the
variable length encoding: If some number requires n bits in standard binary encod-
ing, then we need dn/7e bytes to store the bit pattern. The 8th bit of each of these
bytes is set to 1, except the least significant byte for which the 8th bit is set to 0.
181
182 Chapter 7. Digital Representation of Music
A MIDI event is any type of MIDI channel event preceded by a delta time. A
channel event consists of a status byte and one or two data bytes. The status byte
specifies the function. Since standard MIDI has 16 channels, each particular function
has 16 different status bytes: for example, the function note on exists with values 90
to 9F. Table 7.4 shows only functions for channel 0. Bit 8 is set for status bytes and
cleared for data bytes. The note number 0 stands for note C at 8.176 Hz, number
1 for C# at 8.662 Hz, number 2 for D at 9.177 Hz up to note number 127 for G at
12,543.854 Hz. The velocity typically means the volume of a note (higher velocity
= louder), but in case of note-off events it can also describe how quickly or slowly a
note should end.
A meta event is tagged by the initial byte with value FF. The next byte specifies
the type of meta event before the number of bytes of the subsequent metadata is given
in variable-length encoding. It is not required for every program to support each meta
event. The data of a time signature event consists of 4 bytes. The first byte is the
Table 7.5: Code and Meaning of Some Typical MIDI Meta Events
numerator, the second byte is the denominator given by the exponent of a power of
two, the third byte expresses the number of MIDI ticks in a metronome click, and the
fourth byte contains the number of 32nd notes in 24 MIDI ticks.
The data of a key signature event consists of 2 bytes. The first byte specifies the
number of flats or sharps from the base key C and the second byte indicates major
182
7.2. From Sheet to File 183
(0) or minor (1) key. Flats are represented by negative numbers, sharps by positive
numbers. For example, C minor would be expressed as FD 01, E major as 04 00, A
flat major as FC 00.
A sysex event is tagged by the initial byte with value F0 or F7 before the length of
the subsequent data is given in variable length encoding. The end of the data should
be tagged with an F7 byte. Since these events are tailored to specific MIDI hardware
or software, a more detailed description is omitted here.
Example 7.2 (Smoke on the Water).
SMF files are stored in binary format. Here an annotated hexadecimal dump of the
binary MIDI file is given for the same piece of music considered in Example 7.1. The
abbreviation dt in the annotation stands for “delta time”.
4D 54 68 64 " MThd "
00 00 00 06 length of header - 8
00 00 track format 0
00 01 # tracks = 1
01 E0 480 ticks per quarter note
01 90 43 69 dt =1 , note 43 = G3 on , volume 69
83 5F 80 43 00 dt =479 , note 43 off
01 90 46 50 dt =1 , note 46 = A #3 on , volume 50
83 5F 80 46 00 dt =479 , note 46 off
01 90 48 5F dt =1 , note 48 = C4 on , volume 5 F
85 4F 80 48 00 dt =719 , note 48 off
01 90 43 50 dt =1 , note 43 = G3 on , volume 50
81 6F 80 43 00 dt =239 , note 43 off
81 71 90 46 50 dt =241 , note 46 = A #3 on , volume 50
81 6F 80 46 00 dt =239 , note 46 off
81 71 90 49 50 dt =241 , note 49 = C #4 on , volume 50
81 6F 80 49 00 dt =239 , note 49 off
01 90 48 5F dt =1 , note 48 = C4 on , volume 5 F
87 3F 80 48 00 dt =959 , note 48 off
01 90 43 69 dt =1 , note 43 = G3 on , volume 69
83 5F 80 43 00 dt =479 , note 43 off
01 90 46 50 dt =1 , note 46 = A #3 on , volume 50
83 5F 80 46 00 dt =479 , note 46 off
01 90 48 5F dt =1 , note 48 = C4 on , volume 5 F
85 4F 80 48 00 dt =719 , note 48 off
01 90 46 50 dt =1 , note 46 = A #3 on , volume 50
81 6F 80 46 00 dt =239 , note 46 off
81 71 90 43 50 dt =241 , note 43 = G3 on , volume 50
8D 0F 80 43 00 dt =1679 , note 43 off
1A FF 2F 00 dt =26 , meta : end of track
183
184 Chapter 7. Digital Representation of Music
Typically, SMF files are generated by playing a MIDI instrument or using spe-
cific software that records the tune in its own format. Finally, the specific format is
converted to SMF by some software tool. In Example 7.2 the tool ABC Converter
was applied to convert the abc file to binary SMF. The binary file was then converted
with a hex editor into hexadecimal representation.
184
7.2. From Sheet to File 185
185
186 Chapter 7. Digital Representation of Music
< pitch > < step >B </ step > < alter > -1 </ alter > < octave >4 </ octave > </ pitch >
< duration >120 </ duration > < voice >1 </ voice > < type > quarter </ type >
</ note >
< note >
< pitch > < step >C </ step > < octave >5 </ octave > </ pitch >
< duration >180 </ duration > < voice >1 </ voice > < type > quarter </ type > < dot / >
</ note >
< note >
< pitch > < step >B </ step > < alter > -1 </ alter > < octave >4 </ octave > </ pitch >
< duration >60 </ duration > < voice >1 </ voice > < type > eighth </ type >
</ note >
</ measure >
< measure number ="4" >
< note >
< rest / > < duration >60 </ duration > < voice >1 </ voice > < type > eighth </ type >
</ note >
< note >
< pitch > < step >G </ step > < octave >4 </ octave > </ pitch >
< duration >60 </ duration > < voice >1 </ voice > < type > eighth </ type >
< notations > < tied type =" start " / > </ notations >
</ note >
< note >
< pitch > < step >G </ step > < octave >4 </ octave > </ pitch >
< duration >360 </ duration > < voice >1 </ voice > < type > half </ type > < dot / >
< notations > < tied type =" stop " / > </ notations >
</ note >
< barline location =" right " >
<bar - style > light - light </ bar - style >
</ barline >
</ measure >
</ part >
</ score - partwise >
186
7.3. From Signal to File 187
in the range t ∈ [ 0, 1.6) (see Figure 7.3). If we sample the signal every dt = 0.05 time
units we obtain 1.6/dt = 32 values ai = a(ti ) with ti = 0.05 · i for i = 0, 1, . . . , 31.
Using n = 4 bits to encode each sample value leads to a division of the range
of possible values into L = 24 = 16 intervals of equal width. The range is defined
symmetrically to zero as [−v, v] with v = max{|ai | : i = 0, . . . , 31} = 5.73297. Ta-
ble 7.6 contains the 16 interval centers of the quantization levels in its first column.
The second and third columns show the assignment of binary code words to each
quantization level using two’s complement and offset binary encoding.
Using offset binary encoding and a hexadecimal representation of the 32 code
words of our exemplary sequence leads to the raw audio data
7 a d f f f e b 9 6 4 3 3 3 5 6 8 9 9 9 8 7 6 5 5 6 7 9 a b c b
that can be stored to a file. Figure 7.3 illustrates the conversion from the analog
waveform to raw digital audio data.
As evident from the preceding example, raw audio data can be interpreted cor-
187
188 Chapter 7. Digital Representation of Music
Table 7.6: Quantization Levels and Possible Code Tables for a 4-Bit LPCM
interval center signed binary code word offset binary code word
−5.37466 1000 0000
−4.65804 1001 0001
−3.94142 1010 0010
−3.22480 1011 0011
−2.50818 1100 0100
−1.79155 1101 0101
−1.07493 1110 0110
−0.35831 1111 0111
0.35831 0000 1000
1.07493 0001 1001
1.79155 0010 1010
2.50818 0011 1011
3.22480 0100 1100
3.94142 0101 1101
4.65804 0110 1110
5.37466 0111 1111
or down-to-earth
or down-to-earth
or down-to-earth
or down-to-earth
Straightforward
Straightforward
Straightforward
Straightforward
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 7.3: Exemplary conversion of analog waveform to raw digital audio format.
Black dots represent the sampled value at at step t whereas the dashed horizontal
lines indicate the borders of the quantization intervals.
188
7.3. From Signal to File 189
rectly only if additional information like the number of bits per sample or the type
of binary encoding is supplied. In case of a closed system, this information might be
specified implicitly, but if the audio file is to be transferred to arbitrary destinations or
stored in databases and later retrieved from arbitrary customers, the information must
be provided in additional files. Alternatively, the audio file format can be modified
for storing format specification and data in a single file (e.g. WAVE file format).
Some entries need elucidation: At offset 20, the audio format stored in the body
can be specified by different identifiers (IDs). The ID 0x0001 stands for LPCM,
the ID 0x0055 for MP3, and many other formats are possible. This observation re-
veals that the WAVE format may be considered as a container format that is able to
encapsulate different audio formats and serves simply as a container for the trans-
port of the audio data. Apparently, the LPCM raw audio format is most frequently
used. In this book we follow this understanding by using WAVE and LPCM format
interchangeably.
At offset 32, the number of bytes necessary to store a single sample is required as
a multiple of 8 bits. In case of LPCM raw audio format, this value can be calculated
from the data provided in the header: add 7 to the number of bits per sample (at
189
190 Chapter 7. Digital Representation of Music
offset 34), divide by 8, take the integer part and multiply the result by the number of
channels (at offset 22).
At offset 34, the number of bits per sample is also an indicator of the type of
binary encoding of LPCM audio data (starting at offset 44). Offset binary encoding
is used up to 8 bits per sample, otherwise two’s complement.
Example 7.5 (WAVE format). Suppose we want to store the LPCM audio data of
Example 7.4 in a file with WAVE format. Since the WAVE format insists on multiples
of 8 bit per audio sample we must pad our 4-bit code words with 4 leading zeros.
Therefore, the length of the audio data block (offset 40) is 32 bytes and the total size
of the file sums up to 76 bytes, which in turn leads to the value 68 at offset 4. Since
we store a mono recording (single channel), the value 1 is set at offset 22. Although
we have 4 bits per sample (offset 34), we need 8 bits of memory for storing the data
(offset 32). If we assume that a time unit represents 1 millisecond, we have drawn
a sample every 50 microseconds or 20, 000 samples per second (offset 24). Since
each sample is stored with 8 bits, we have 20, 000 bytes per second (offset 28). Now
we are in the position to compile the WAVE file where each byte is represented in
hexadecimal form:
52 49 46 46 44 00 00 00 57 41 56 45 66 6d 74 20
10 00 00 00 01 00 01 00 00 00 20 4e 00 00 20 4e
01 00 04 00 64 61 74 61 20 00 00 00 07 0a 0d 0f
0f 0f 0e 0b 09 06 04 03 03 03 05 06 08 09 09 09
08 07 06 05 05 06 07 09 0a 0b 0c 0b
190
7.3. From Signal to File 191
How can this be achieved? The key to the high compression rate of MP3 is the
joint application of two different compression techniques: First, it sorts out negligible
signal data based on a psychoacoustic model of the human ear. Second, redundancies
in the frequency domain of the reduced data set are exploited by means of Huffman
encoding.
The psychoacoustic technique exploits the imperfection of the human acoustic
system [5, p. 24f.]. Typically, humans cannot hear frequencies below 20 Hz and
above 20 kHz. Therefore signal data below or above these thresholds need not be
stored anyway. Moreover, humans do not perceive all frequencies in that range
equally well. Most people are less sensitive to low and high frequencies whereas they
are most sensitive to frequencies between 2 and 4 kHz. This threshold of audibility
can be expressed by a function of sound pressure (aka volume) versus frequency.
This function is not static; rather, its characteristic may be affected by so-called au-
ditory and temporal masking. Auditory masking describes the effect that a certain
signal cannot be distinguished from a stronger signal (say, plus 10 db) if the pitches
are only slightly different (say, 100 Hz difference); here, the stronger signal masks
the weaker signal so that the latter signal need not be stored. In case of temporal
masking, a strong, abruptly ending tone provokes a short pause of a few milliseconds
in which the human hear is unable to perceive very quiet signals which need not be
stored accordingly. Temporal masking also appears in the opposite direction: the last
few milliseconds of a quiet tone are wiped out by a sudden strong subsequent signal.
The identification of such masking effects and the adjustment of the audibility
curve is a complex task and causes some computational effort which is fortunately
only necessary in the encoding and not in the decoding phase. The psychoacoustic
analysis is made in the frequency domain after a fast Fourier transform and it finally
provides an adjusted audibility curve which is later used by the second compression
technique for the decision which signals can be omitted.
In the beginning of MP3 compression, the data stream is separated into 32 spec-
tral bands whose bandwidths are not equally wide but are adapted to the human
acoustic system (in the range 1 to 4 kHz the bands are denser than below and above).
Each spectral band is divided into frames containing either 384 or 1152 samples.
Long frames are for sub-bands with low frequencies whereas the other sub-bands
use short frames. Next, a modified discrete cosine transformation (MDCT) finally
leads to a division into 576 frequency bands. Now the adjusted audibility curve is
used to decide which signals must be stored and which can be neglected. The re-
maining frequency data are then quantized with fixed or variable bit depth. Next,
the frequency samples are Huffman encoded: short code words are assigned to fre-
quently occurring frequencies whereas rarely occurring frequencies get longer code
words. Since this assignment is specific to each frame, the code tables of all frames
must be included in the final MP3 data stream.
The end user does not need to know the structure of an MP3 file, but we are
curious about the technical realization. An MP3 file is simply a sequence of many
encoded frames. Each frame contains a header of 32 bit (see Table 7.8) and a body
with the compressed signal data.
At the beginning of the frame header, there is a synchronization (sync) pattern of
191
192 Chapter 7. Digital Representation of Music
Table 7.8: The Structure of the Frame Header of an MP3 File
Table 7.9: Lookup Table of the Bitrate Index (kbit/sec) for MPEG Encodings
11 bits which is important for broadcasted data streams. When such a data stream
is entered at arbitrary time, the MP3 decoder first seeks the sync pattern within the
stream. If such a bit pattern is found, it may be the sync pattern but also some data
from the body part of the frame. Thus, the decoder must check its hypotheses of a
detected sync pattern by verifying that this sync pattern appears also at the position in
the stream where the next frame should start. The more often this hypothesis cannot
be rejected, the more likely the event that a true sync pattern has actually been found.
192
7.4. From File to Sheet 193
Table 7.10: Lookup Table of the Sampling Rate Index (in Hz) for MPEG Encodings
The MP3 encoding is only a special case of audio file formats from the MPEG
family. In case of MP3, the MPEG version is no. 1 and the MPEG layer is no. 3. Thus,
the next four bits are set to 1101. If protection is activated, then a 16-bit checksum is
placed directly after the header. Bit and sampling rates are given by indices to fixed
lookup tables. In MPEG encodings, it may happen that a frame requires one byte
fewer than the “standard size”. In this case the padding bit is set. The private bit
may trigger some application-specific behavior. The next bits indicate if the tune is a
mono or stereo recording and which kind of stereo mode is used. If copyright bit is
switched on, then it is officially illegal to copy the track and the next bit indicates if
this file is the original file or a copy of it. The last bits of the header are now obsolete.
In most cases, the MP3 file is preceded by an ID3 tag that contains meta infor-
mation like the name of artist and the title of tune that may be displayed by an MP3
player. The ID3 tag has a header (10 bytes) and a body. The structure of the header
is presented in Table 7.11.
The first three bytes indicate an ID3 tag by the ASCII codes of the string “ID3”
and the next two bytes are the ID3v2 revision number in little-endian format. The
following byte contains 8 flags whose meaning will not be discussed here. The next
four bytes contain the size of the ID3 tag’s body in bytes but in a 7-bit encoding, i.e.,
for each byte the most significant bit is always set to zero and it is ignored so that 4 ×
7 = 28 bits are available for representing the body size. If the size is stored in bytes
b6 to b9 , then the size (in bytes) is given by b9 + 128 × (b8 + 128 × (b7 + 128 × b6 )).
The MP3 data follow directly after the ID3 tag. Further information about ID3 tags
can be found online (see http://id3.org/).
193
194 Chapter 7. Digital Representation of Music
a typical example, where digital sheet music in abc format was converted to vector
graphics by the tool abc2ps.
Nevertheless, there are software systems that are exclusively devoted to typeset
digital sheet music. Here we briefly discuss MusicTEX, which is an extension of the
widely known typesetting system LATEX.
194
7.5. From File to Signal 195
thetic rendering. Based on these packages, the typesetting system lilyPond con-
vinces with a simpler command language.
Example 7.6 (Smoke on the Water).
This example presents how the few notes in Figure 7.2 can be specified with Mu-
sicTEX.
\ begin { music }
\ def \ nbinstruments {1} % single instrument
\ def \ nbporteesi {1} % instrument has single staff
\ generalmeter {\ meterfrac {4}{4}} % four - four time
\ generalsignature { -2} % key signature : 2 flats
\ debutmorceau % start
\ normal % normal spacing
\ notes \ Uptext {\ metron {\ qu }{120}}\ enotes
\ notes \ qu g \ ql i \ qlp j \ cu g \ enotes %
\ barre
\ notes \ ds \ cl i \ ds \ qsk \ cl { _k } \ hl j \ enotes %
% \ qsk skips a virtual quarter note to the right
\ barre
\ notes \ qu g \ ql i \ qlp j \ cl i \ enotes %
\ barre
\ notes \ ds \ itenl 0 g \ cu g \ qsk \ tten 0 \ hup g \ enotes %
% \ itenl \ tten initiate and terminate a tie
\ hfil \ finmorceau
\ end { music }
195
196 Chapter 7. Digital Representation of Music
Bibliography
[1] D. Bainbridge and T. Bell. The challenge of optical music recognition. Com-
puters and the Humanities, 35(2):95–121, 2001.
[2] A. H. Bullen. Bringing sheet music to life: My experiences with OMR. The
code{4}lib Journal, Issue 3, 2008-06-23, 2008.
[3] R. B. Dannenberg. Music representation issues, techniques, and systems. Com-
puter Music Journal, 17(3):20–30, 1993.
[4] M. D. Good. MusicXML: The first decade. In J. Steyn, ed., Structuring Mu-
sic through Markup Language: Designs and Architectures, pp. 187–192. IGI
Global, Hershey (PA), 2013.
[5] S. Hacker. MP3: The Definite Guide. O’Reilly, Sebastopol (CA), 2000.
[6] D. M. Huber. MIDI. In G. Ballou, ed., Handbook of Sound Engineers, pp.
1099–1130. Focal Press, Burlington (MA), 4th edition, 2008.
[7] N. S. Jayant and P. Noll. Digital Coding of Waveforms. Prentice Hall, Engle-
wood Cliffs (NJ), 1984.
[8] Melodyne. http://www.celemony.com/en/melodyne/what-is-
melodyne, accessed 10-Mar-2016.
[9] J. Ratcliff. Timecode: A User’s Guide. Focal Press, Burlington (MA), 3rd
edition, 1999.
[10] A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S. Marcal, C. Guedes, and J. S.
Cardoso. Optical music recognition: State-of-the-art and open issues. Interna-
tional Journal on Multimedia Information Retrieval, 1(2):173–190, 2012.
[11] T. Sauer. Notations 21. Mark Batty Publisher, New York, 2009.
[12] E. Selfridge-Field, ed. Beyond MIDI: The Handbook of Musical Codes. MIT
Press, Cambridge (MA), 1997.
[13] A. Skonnard and M. Gudgin, eds. Essential XML. Pearson Education, Indi-
anapolis (IN), 2002.
[14] D. Taupin. MusicTEX: Using TEX to write polyphonic or instrumental music
(version 5.17). https://www.ctan.org/pkg/musictex, 2010, accessed 26-
May-2015.
[15] Widisoft. http://www.widisoft.com/, accessed 10-Mar-2016.
196
Chapter 8
G EOFFRAY B ONNIN
LORIA, Université de Lorraine, Nancy, France
8.1 Introduction
In Chapter 5 we discussed a number of features that can be extracted from the audio
signal, including rhythmic, timbre, or harmonic characteristics. These features can
be used for a variety of applications of Music Information Retrieval (MIR), includ-
ing automatic genre classification, instrument and harmony recognition, or music
recommendation.
Beside these signal-level features, however, a number of other sources of infor-
mation exist that explicitly or indirectly describe musical characteristics or metadata
of a given track. In recent years, for example, more and more information can be
obtained from Social Web sites, on which users can, for instance, tag musical tracks
with genre or mood-related descriptions. At the same time, various music databases
exist which can be accessed online and which contain metadata for millions of songs.
Finally, some approaches exist to derive “high-level”, interpretable musical features
from the low-level signal to be able to build more intuitive and better usable MIR
applications.
This chapter gives an overview of the various types of additional information
sources that can be used for the development of MIR applications. Section 8.2
presents a general approach to predict meaningful semantic features from audio sig-
nal. Section 8.3 deals with features that can be obtained from digital symbolic rep-
resentations of music and Section 8.4 provides a short introduction to the analysis of
music scores. In Section 8.5, methods to extract music-related data from the Social
Web are discussed. The properties of typical music databases are outlined in Section
8.6. Finally, Section 8.7 introduces lyrics as another possible information source to
determine musical features for MIR applications.
197
198 Chapter 8. Music Data: Beyond the Signal Level
The boundaries between low-level and semantic features are often blurred. Con-
sider, for example, the chroma as described in Equation (5.18). The idea to map all
related frequencies to a semitone bin is very close to the signal level, but the progress
198
8.2. From the Signal Level to Semantic Features 199
of a chroma component with the largest value over time may describe the melody
line, which we would consider as a semantic feature based on music theory.
Generally, there exists no agreement on which features should be described as
“high-level”. In [5], the features are categorized by their “closeness” to a listener.
Signal-level features therefore include descriptors like timbre, energy, or pitch. In
contrast, rhythm, dynamics, and harmony are musical characteristics considered to
be more meaningful and closer to a user. The highest-level features according to [5]
are referred to as “human knowledge” and relate to the personal music perception
(emotions, opinions, personal identity, etc.). These features are particularly hard to
assess. In yet another categorization scheme, [39] describes seven “aspects” of mu-
sical expression: temporal, melodic, orchestrational, tonality and texture, dynamic,
acoustical, electromusical and mechanical. Rötter et al. [35] finally list 61 high-level
binary descriptors suitable for the prediction of personal music categories.
199
200 Chapter 8. Music Data: Beyond the Signal Level
Straightforward or down-to-earth
Straightforward or down-to-earth
improve or simplify the recognition of instruments. Example 8.1 shows several pos-
sible sequences of SFS. The final step in each sequence obviously comprises the
prediction of the aspect that we are actually interested in.
Example 8.1 (Levels of Sliding Feature Selection).
• low-level features 7→ instruments 7→ moods 7→ genres
Straightforward or down-to-earth
• low-level features 7→ instrument groups (keys, strings, wind) 7→ individual instru-
ments 7→ styles 7→ genres 7→ personal preferences
• low-level features 7→ harmonic properties 7→ moods 7→ styles
• low-level features 7→ harmonic properties 7→ rhythmic patterns 7→ moods 7→
styles 7→ genres
The individual levels of the SFS chain (cf. Figure 8.1) can be combined with fea-
ture construction techniques; see Section 14.5. In the described generic approach,
new features can be constructed through the application of mathematical operators
(like sum, product, or logarithm) on their input(s). As an example, consider a chroma
vector which should help us identify harmonic properties. In the first step, new char-
acteristics can be constructed by summing up the strengths of the chroma amplitudes
for each pair of chroma semitones. The “joint strength of C and G” – as a sum of the
amplitudes of C and G – would, for example, measure the strength of the consonant
fifths C-G and fourths G-C. The overall number of these new features based on 12
semitone strengths is equal to 12 · 12 · 11 = 66. In the next step, the “strength of sad
mood” could be predicted after applying the SFS chain using a supervised classifi-
cation model which is trained on the subset of these 66 descriptors that leads to the
best accuracy.
8.2.3 Discussion
The extraction of robust semantic characteristics from audio signals of several music
sources is generally challenging, both in cases where various signal transforms (see
Chapters 16-21) are applied or when supervised classification is used as introduced
above. Nevertheless, the selection of the most relevant semantic features can be
helpful to support a further analysis by music scientists or help music listeners to
understand the relevant properties of their favorite music.
200
8.3. Symbolic Features 201
The usage of higher-level descriptors derived with the help of SFS may not neces-
sarily improve the classification quality when compared to approaches that only use
low-level features, as both methods start with the same signal data. Another chal-
lenge is the proper selection of training data in each classification step. For example,
too many training tracks from the same album typically lead to a so-called album ef-
fect, where the characteristics of albums are learned instead of genres. In the context
of the genre recognition problem, however, SFS was proven to be sufficiently robust
in [44] and the experiments also showed that models that were trained on a subset
of the features performed significantly better than models that relied on all available
features.
2 The data files used in the experiments used two special symbolic formats. Using MIDI-encoded files
201
202 Chapter 8. Music Data: Beyond the Signal Level
3 Those are features that considered to be “musical abstractions that are meaningful to musically
trained individuals.”
202
8.4. Music Scores 203
Table 8.2: Examples of Features from jSymbolic, Alphabetically Sorted
staccato quarter. Therefore, the authors of [11], for example, propose to combine
MIDI features with other features, including lyrics, popularity information, or chord
annotations that can be obtained from different sources.
203
204 Chapter 8. Music Data: Beyond the Signal Level
can be highly interconnected (see Figure 8.2) and that they can vary in shape and size
even within the same score [33].
An OMR process usually consists of several phases [32]. First, image processing
is done, which involves techniques such as image enhancement, binarization, noise
removal or blurring. The second step is symbol recognition, which typically in-
cludes tasks like staff line detection and removal, segmentation of primitive symbols
and symbol recognition, where the last step is often done with the help of machine
learning classifiers which are trained on labeled examples. In the following steps,
the identified primitive symbols are combined to build the more complex musical
symbols. At that stage, graphical and syntactical rules can be applied to validate the
plausibility of the recognition process and correct possible errors. In the final phase,
the musical meaning is analyzed and the symbolic output is produced, e.g. in terms
of a MIDI file.
Over the years, a variety of techniques have been proposed to address the chal-
lenges in the individual phases, but a number of limitations remain in particular with
respect to hand-written scores. At the same time, from a research and methodological
perspective, better means are required to be able to compare and benchmark different
OMR systems [32].
From a practical perspective, today a number of commercial OMR tools exist
including both commercial ones like SmartScore5 and open-source solutions like
Audiveris.6 According to [32], these tools produce good results for printed sheets,
but have limitations when it comes to hand-written scores.
204
8.5. Social Web 205
205
206 Chapter 8. Music Data: Beyond the Signal Level
set of playlists, see Example 8.3. To measure the “degree of co-occurrence” of two
artists, the concepts of support and confidence from the field of association rule min-
ing can be used.
Vatolkin et al. in [43] define the normalized support Supp(ai , a j ) of two artists
ai and a j in a collection of playlists P as the number of playlists in which both
ai and a j appeared divided by the number of playlists |P|. The confidence value
Con f (ai , a j ) relates the support to the frequency of an artist ai , which helps to reduce
the overemphasis on popular artists that comes with the support metric.
Let us now assume that our problem setting is a binary classification task with the
goal to predict the genre of an unknown artist (or, analogously, the genre of a track
where we know the artist). We assume that each artist is related to one predominant
genre. Our training data can therefore be seen to contain Tp annotated “positive”
examples of artists for each genre (ap1 , ..., apTp ) and Tn artists who do not belong to
a given genre (“negative” artists an1 , ..., anTn ). To learn the classification model we
now look at our playlists and determine for each “positive” artist api those artists that
appeared most often together with api . Similarly, we look for co-occurrences for the
negative examples for a given genre.
When we are now given a track of some artist ax to classify, we can determine
with which other artists ax co-occurred in the playlists. Obviously, the higher the
co-occurrence of ax with artists that co-occurred also with some api , the higher the
probability that ax has the same predominant genre as api . Otherwise, if ax often
co-occurs with artists that do not belong to the genre in question, we see this as an
indication that ax does not have this predominant genre either. Technically, the co-
occurrence statistics are collected in the training phase and used as features to learn a
supervised classification model (see Chapter 12). In [43], experiments with different
classification techniques were conducted and the result showed that the approach
based on playlist statistics outperformed an approach based on audio features for 10
out of 14 tested genres. The results also showed that using confidence is favorable in
estimating the strength of a co-occurrence pattern in most cases.
Example 8.3 (Extraction of Artist Co-Occurrences in Playlists). Table 8.3 shows
those five artists (provided in the table header) that most frequently co-occur with
four “positive” artists for the genres Classical, Jazz, Heavy Metal, and Progressive
Rock based on Last.fm playlist data.
Even if the top co-occurring artists are very popular, this method can be helpful
to classify less popular artists. For example, after the comparison of support values
for Soulfly, the most probable assignment would be the genre Heavy Metal given
the statistics of the data set used in [43]: Supp(Soulfly, Beethoven) = 5.841E-5,
Supp(Soulfly, Miles Davis) = 2.767E-5, Supp(Soulfly, Metallica) = 516.539E-5,
Supp(Soulfly, Pink Floyd) = 66.810E-5.
One underlying assumption of the approach is that playlists are generally ho-
mogeneous in terms of their genre. Also, this method does not take into account
that artists can be related to different genres over their career. Finally, a practical
challenge when using public playlists is that artists are often spelled differently or
even wrongly, consider, e.g., “Ludwig van Beethoven”, “Beethoven”, “Beethoven,
206
8.5. Social Web 207
Table 8.3: Top 5 Co-Occurrences for Artists with the Predominant Genres Classical,
Jazz, Heavy Metal, and Progressive Rock
Ludwig van”, “L.v.Beethoven”, etc. A string distance measure can help to identify
identical artists, when the distance is below some threshold. In [43], for example, the
Smith–Waterman algorithm [36] is applied to compare artist names.
Generally, shared playlists can be used for music-related tasks other than genre
classification. They can, for instance, provide a basis for automated playlist gener-
ation and next-track music recommendation; see for example [1], [16] and Chapter
23.
207
208 Chapter 8. Music Data: Beyond the Signal Level
208
8.7. Lyrics 209
Apart from these public music databases and services, the database used by Pan-
dora,14 the probably most popular Internet radio station in the United States at the
moment, is worth mentioning. The Internet radio is based on the data created in the
Music Genome Project. In contrast to databases which derive features from audio
signals, each musical track in the Pandora database is annotated by hand by musical
experts in up to 400 different dimensions (“genes”).15 The available genes depend
on the musical style and can be very specific like “level of distortion on the electric
guitar” or “gender of the lead vocalist”.16 The annotation of one track is said to last
20 to 30 minutes; correspondingly, the size of the database – approximately 400,000
tracks – is limited when compared to other platforms.
8.7 Lyrics
Many music tracks, particularly in the area of popular music, are “songs”, i.e., they
are compositions for voice and performed by one or more singers. Correspondingly,
these tracks have accompanying lyrics, which in turn can be an interesting resource
to be analyzed and used for music-related applications. For example, instead of
trying to derive the general mood of a track based only on the key or tempo, one
intuitive approach could be to additionally look at the lyrics and analyze the key
terms appearing in the text with respect to their sentiment.
In the literature, a number of approaches exist that try to exploit lyric information
for different MIR-related tasks. In [14], for example, the authors combine acoustic
and lyric features for the problem of “hit song” prediction. Interestingly, at least in
their initial approach, the lyrics-based prediction model that used Latent Semantic
Analysis (LSA) [13] for topic detection was even slightly better than the acoustics-
based one; the general feasibility of hit song prediction is, however, not undisputed
[30].
Also the work of [21] is based on applying an LSA technique on a set of lyrics.
In their work, however, the goal was to estimate artist similarity based on the lyrics.
While the authors could show that their approach is better than random, the results
were worse than those achieved with a similarity method that was based on acous-
tics, at least on the chosen dataset. Since both methods made a number of wrong
classifications, a combination of both techniques is advocated by the authors.
Instead of finding similar artists, the problem of the Audio Music Similarity and
Retrieval task in the annual Music Information Retrieval eXchange (MIREX) is to
retrieve a set of suitable tracks, i.e., a short playlist, for a given seed song. In [20], the
authors performed a user study in which the participants had to subjectively evaluate
the quality of playlists generated by different algorithms. Several participants of the
study stated that they themselves build playlists based on the lyrics of the tracks or
liked certain playlists because of the similarity of the content of their lyrics. This
indicates that lyrics can be another input that can be used for automated playlist
generation. As lyrics alone are, however, not sufficient and other factors like track
14 http://www.pandora.com.Accessed 03 January 2016
15 http://www.pandora.com/about/mgp. Accessed 03 January 2016
16 http://en.wikipedia.org/wiki/Music\_Genome\_Project. Accessed 03 January 2016
209
210 Chapter 8. Music Data: Beyond the Signal Level
The calculation of the TF-IDF vectors for a collection of text documents d typ-
ically begins with a pre-processing step. In our case, each document contains
17 Mathematically, different ways to compute the weights are possible. For an example, see [10].
18 Compared to Latent Semantic Analysis techniques mentioned above, TF-IDF-based approaches can-
210
8.7. Lyrics 211
the lyrics of one track. In this phase, irrelevant so-called “stop-words” like
articles are removed. Furthermore, stemming can be applied, a process which
replaces the terms in the document with their word stem.
We then compute a normalized term-frequency value T F(i, j), which rep-
resents how often the term i appears in document j. Normalization should
be applied to avoid that longer text documents lead to higher absolute term-
frequency values. Different normalization schemes are possible. For instance,
we can compute the normalized frequency value of a term by dividing it by
the highest frequency of any other term appearing in the same document.
Let maxFrequencyOtherTerms(i, j) be the maximum frequency of terms other
than i appearing in document j. If f req(i, j) represents the unnormalized fre-
quency count, then
f req(i, j)
T F(i, j) = . (8.1)
maxFrequencyOtherTerms(i, j)
The IDF component of the TF-IDF encoding reduces the weight of a term
proportional to its appearance in documents across the entire collection. Let
N be the number of documents in d and n(i) be the number of documents in
which term i appears. We can calculate the Inverse Document Frequency as
N
IDF(i) = log (8.2)
n(i)
and the final TF-IDF score as TF-IDF(i,j)= T F(i, j) · IDF(i).
The resulting term vectors can be very long and sparse as every word ap-
pearing in the documents corresponds to a dimension of the vector. Therefore,
additional pruning techniques can be applied, e.g., by not considering words
that appear too seldom or too often in the collection.
211
212 Chapter 8. Music Data: Beyond the Signal Level
Bibliography
[1] G. Bonnin and D. Jannach. Automated generation of music playlists: Survey
and experiments. ACM Computing Surveys, 47:1–35, 2014.
[2] K. Bosteels, E. Pampalk, and E. E. Kerre. Evaluating and analysing dynamic
playlist generation heuristics using radio logs and fuzzy set theory. In Proc. of
the International Society for Music Information Retrieval Conference (ISMIR),
pp. 351–356. International Society for Music Information Retrieval, 2009.
[3] N. Carter, R. Bacon, and T. Messenger. The acquisition, representation and
reconstruction of printed music by computer: A review. Computers and the
Humanities, 22(2):117–136, 1988.
[4] Z. Cataltepe, Y. Yaslan, and A. Sonmez. Music genre classification using
midi and audio features. EURASIP Journal of Applied Signal Processing,
2007(1):150–150, 2007.
[5] Ò. Celma and X. Serra. FOAFing the music: Bridging the semantic gap in mu-
sic recommendation. Journal of Web Semantics: Science, Services and Agents
on the World Wide Web, 6(4):250–256, 2008.
[6] Ò. Celma. Music Recommendation and Discovery: The Long Tail, Long Fail,
and Long Play in the Digital Music Space. Springer, 2010.
[7] Ò. Celma and P. Lamere. Music recommendation tutorial. International Society
for Music Information Retrieval Conference (ISMIR), September 2007.
[8] W. Chai and B. Vercoe. Folk music classification using hidden Markov models.
In Proc. of the International Conference on Artificial Intelligence (ICAI), Las
Vegas, 2001.
[9] R. Cilibrasi, P. Vitányi, and R. De Wolf. Algorithmic clustering of music based
on string compression. Computer Music Journal, 28(4):49–67, 2004.
212
8.8. Concluding Remarks 213
213
214 Chapter 8. Music Data: Beyond the Signal Level
214
8.8. Concluding Remarks 215
215
Part II
Methods
217
Chapter 9
Statistical Methods
C LAUS W EIHS
Department of Statistics, TU Dortmund, Germany
9.1 Introduction
Statistical models and methods are the basis for many parts of music data analysis.
In this chapter we will lay the statistical foundations for the methods described in the
next chapters. An overview is given over the most important notions and theorems
in statistics, needed in this book. The notion of probability is introduced as well as
random variables. We will define, characterize and represent stochastic distributions
in general and give examples relevant for music data analysis. We will show how
to estimate unknown parameters and how to test hypotheses on the distribution of
variables. Typical statistical models for the relationship between different random
variables will be introduced, and the estimation of their unknown parameters and the
properties of predictions from such models will be discussed. We will introduce the
most important statistical models for signal analysis, namely time series models and,
finally, a first impression of dimension reduction methods will be given.
9.2 Probability
9.2.1 Theory
The notion of probability is basic for statistics, but often used intuitively. In what
follows, we will give an exact definition. Notice that probabilities are defined for
fairly general sets of observations of variables. There is no need that these results
are quantitative so that they could be used in calculations. Instead, we will define
probability on subsets of a set of possible observations (sample space). These subsets
have to have some formal properties (formalized as so-called σ -algebras) in order to
guarantee general and consistent usability of probabilities.
Let us start with a motivating example. In this example the notion of probability
is intuitively used by now.
Example 9.1 (Semitones). Consider the set of the 12 semitones ignoring octaves:
219
220 Chapter 9. Statistical Methods
Ω = {C, C#, . . . , B}. In 12-tone music, the idea is that all 12 semitones are equiprob-
1
able, e.g., P(D) = 12 . In different historical periods and different keys, these proba-
bilities might be different.
Here, it is implicitly assumed that all relevant probabilities exist and can be easily
calculated from basic probabilities. For example, one might want to calculate the
probability of groups of tones by adding the individual probabilities of the tones in
the group. Can this be done in general? In order to do this, we have to be able
to calculate the probability of unions of sets for which the probabilities are already
known. This is formalized in the following definition.
Definition 9.1 (Probability). A sample space Ω is the set of all possible observations
ω. The action of randomly drawing elements from a sample space is called sampling.
The outcome of sampling is a sample.
A random event A is a subset of Ω. Ω − A, consisting of all elements of Ω
which are not in A, is called the complementary event of A in Ω. In order to be able
derive probabilities for all sets from the probabilities of elementary sets, we restrict
ourselves to sets ASof subsets of Ω (called σ -algebra) with Ω ∈ A , (Ω − A) ∈ A
for all A ∈ A , and Ki=1 Ai ∈ A for all K and all sets A1 , A2 , . . . ∈ A .
Then, a probability function P is defined as any real-valued function on A with
values in the interval [0, 1], i.e. P : A → R with A 7→ P(A) ∈ [0, 1] iff P(A) ≥ 0 ∀A ∈
A , P(Ω) = 1, and for all sets of pairwise disjunct events A1 , A2 , . . . (Ai ∩A j = 0,
/ i 6= j)
it is true that P Ki=1 Ai = ∑Ki=1 P(Ai ) for all K. The values of a probability function
S
220
9.2. Probability 221
two previous examples the progression of elements (tones or chords) is ignored for
the moment.
Let us now extend our ability to work with probabilities. Often, we are interested
in the probability of an event if another event has already happened. In the example
below we would like to know the probability of a certain tone height when it is
already known that it is in the literature of a fixed voice type. This leads to what is
called conditional probability. The following definition also contains related terms.
Definition 9.2 (Conditional Probability and Independence). Let A, B be two events
in A . Then, the conditional probability of A conditional on the event B is defined
by: PB (A) = P(A | B) := P(A ∩ B)/P(B) if P(B) > 0.
A and B are called stochastically independent events iff P(A ∩ B) = P(A)P(B),
i.e. the probability of the intersection of two events is the product of the probabilities
of these events. Equivalently, the conditional probability P(A | B) does not depend
on B, i.e. P(A | B) := P(A ∩ B)/P(B) = P(A).
The K events Ai , i = 1, 2, . . . , K, with P(Ai ) > 0 build a partition of Ω, iff Ai ∩A j =
/ i 6= j, and Ki=1 Ai = Ω, i.e. iff the events are separated and cover the whole Ω.
S
0,
Note that the division by P(B) in the definition of conditional probability guar-
antees that P(B | B) = 1.
Example 9.4 (Semitones (cont.)). Consider again Example 9.1, i.e. the set of semi-
tones Ω = {C, C#, . . . , B}. In this case, obviously there is a partition with K = 12.
Let us now distinguish different octaves. As notation we use C (65.4 Hz), c (130.8
Hz), c’ (261.6 Hz), c” (523.2 Hz) etc. (Helmholtz notation, cp. Table 7.1). For
singing, obviously, the probability of a semitone in the singing voice depends on the
type of voice. For example, with very few exceptions soprano voices stay between c’
and c’’’ (261.6–1046 Hz). The “Queen of Night” in “The Magic Flute” of Mozart
is one exception going up to f’’’, 1395 Hz. Also, bass voices stay between F and
f’ (87.2–348 Hz), again with some exceptions like the “Don-Cossack” choir going
down to F1 = 43.6 Hz. This means that the probability of a tone outside such ranges
is (nearly) zero conditional on the voice type.
Formally, this can be modeled by so-called combined events A = {(tone1 ,type1 ),
(tone2 ,type2 ), . . . , (tone p ,type p )} over the classical singers’ literature, say, where,
e.g., tonei ∈ {F1 , . . . , C, D, . . . , f’’’} and typei ∈ {soprano, mezzo, alto, tenor, bari-
tone, bass}. Then, we may be interested in conditional probabilities, e.g.,
P((tone,type) | type = soprano) and P((tone,type) | type = bass). For example, for
A = {(c”,type)} and B = {(tone, soprano)}, P(A | B) = P((c”,type) | type = soprano)
= P(c”, soprano) /P(soprano). Obviously, there are tones sung by different voices
and tones sung only by one voice. For example, only tones ∈ {c’, c#’, d’, d#’, e’,
f’} are sung by both, soprano and bass. Therefore, the events A = {(tone, soprano)}
and B = {(tone, bass)} are independent for any fixed tone 6∈ {c’, c#’, d’, d#’, e’, f’}
since P(A ∩ B) = P(A)P(B) = 0 because A ∩ B = 0/ and either P(A) or P(B) is zero.
Also, the subsets A1 = {(tone, soprano) for all tones}, A2 = {(tone, mezzo) for
all tones}, ..., A6 = {(tone, bass) for all tones} build another example for a partition.
221
222 Chapter 9. Statistical Methods
Note that conditional probabilities are not symmetric, i.e. P(A|B) 6= P(B|A). Of-
ten, it is reasonably easy to get one of these types of probabilities, say P(B|A), but one
is interested in the other type P(A|B). For example, consider the case that we want to
calculate the conditional probability P(Ai | B) of the type of voice Ai given a certain
semitone B. However, the only probabilities we have from literature are P(B | Ai ).
Such problems can be solved by means of the following fundamental properties of
probabilities which can be derived from the notions of conditional probability, inde-
pendence, and partition.
Theorem 9.1 (Total Probability). Let Ai , i = 1, 2, . . . , K, be a partition of Ω with
P(Ai ) > 0. Then, for every B ∈ A : P(B) = ∑Ki=1 P(B | Ai )P(Ai ).
The probability of an event B can be, thus, calculated by means of the conditional
probabilities of B conditional to a partition of the sample space Ω.
Theorem 9.2 (Bayes Theorem). Let Ai , i = 1, 2, . . . , K, be a partition of Ω with P(Ai ) >
0. Then, for every event B ∈ A with P(B) > 0 it is valid that:
P(B | Ai )P(Ai )
P(Ai | B) = .
∑Kj=1 P(B| A j )P(A j )
222
9.3. Random Variables 223
223
224 Chapter 9. Statistical Methods
iff the distribution function can Rbe represented as follows by means of a so-called
x
density function fX (x): FX (x) = −∞ fX (z)dz.
Obviously, for continuous random variables X, the distribution function FX (x) is
not only continuous but even differentiable since fX (x) = FX0 (x). To illustrate that
densities represent the probabilities of points or intervals first realize that fX (x) :=
P(X = xi ) for discrete densities if xi is a values taken by the random variable X. For
continuous distributions, the probability of individual points is always zero. Thus, a
density value in a certain point x in the image of X is not equal to the probability of
x. However, densities represent the probability of intervals in that P(a ≤ x ≤ b) =
FX (b) − FX (a) = ab fX (x)dx.
R
Example 9.7 (Discrete Distributions). Let us come back to the motivating examples
for random variables. A discrete random variable Xmode would have the distribution
function Fmode (x) = 0 for x < −1, Fmode (x) = 0.5 for −1 ≤ x < 1, and Fmode (x) = 1
for 1 ≤ x. Obviously, this distribution function is not continuous. Also, the cor-
responding density takes the values fmode (−1) = fmode (1) = 0.5 and fmode (x) = 0
otherwise. Analogous arguments are true for the distribution of the semitones.
Examples for continuous distributions will be given later.
You might have noticed that the definition of distribution functions needs proba-
bilities. The most important property of random variables is, though, that distribution
functions and their densities can even be characterized without any recourse to prob-
abilities. This way, random variables and their distributions in a way replace the
much more abstract concept of probabilities in practice. The following properties of
random variables characterize their distributions in general.
Theorem 9.3 (Representation of Distribution and Density Functions). Let FX be the
distribution function of a random variable X. Then,
1. FX (−∞) := limx→−∞ FX (x) = 0,
2. FX (∞) := limx→∞ FX (x) = 1,
3. FX is monotonically increasing: FX (a) ≤ FX (b) for a < b,
4. FX is continuous from the right: lim0<h→0 FX (x + h) = FX (x).
Every function F from R into the interval [0, 1] with the above properties 1–4
defines a distribution function.
Every function f from R into the interval [0, 1] defines a discrete density function,
iff for a maximally countable set {x1 , x2 , x3 , . . .}:
1. f (xi ) > 0 for i = 1, 2, 3, . . .,
2. f (x) = 0 for x 6= xi , i = 1, 2, 3, ...,
3. ∑i f (xi ) = 1.
Every function f : RR→ [0, ∞) defines a density function of a continuous distribution
∞
iff f (x) ≥ 0 ∀ x and −∞ f (x)dx = 1.
That such properties are sufficient to characterize distributions can be most easily
seen for discrete density functions because of their direct relationship to probabilities.
224
9.3. Random Variables 225
So, any function of the types in the above definition represents a distribution function
or density and therefore implicitly a system of probabilities. Statistics concentrates,
therefore, most often on functions of this type. Distributions of random variables are
in the core of statistics.
In order to easily study more examples, let us first introduce the empirical ana-
logues of random variable, distribution, and density.
225
226 Chapter 9. Statistical Methods
0.8
Bar Chart
0.6
0.30
density
0.4
0.20
0.2
0.10
0.0
−2 0 2 4 6 8
0.00
C C# D D# E F F# G G# A Bb B MFCC non−windowed
Figure 9.1: Comparison of two distributions; left: discrete case, theoretical distri-
bution in black, realized distribution in grey; right: continuous case, non-windowed
MFCC 3 grey and MFCC 1 lightgrey.
data set often used in this section. The data is composed of 13 MFCC variables
(non-windowed and windowed) (see Section 5.2.3) and 14 chroma variables. All
variables are available for 5654 guitar and piano tones.
Let us now briefly indicate how these variables are calculated. This paragraph is
not necessary to understand the density example here, but illustrates the relationship
to signal analysis (see Chapters 4, 5). Each single analyzed tone has a length of
1.2 seconds and is given as a WAVE signal (see Section 7.3.2) with sampling rate
44,100 Hz and samples xi , i ∈ {1, . . . , 52, 920}. The non-windowed MFCCs are cal-
culated over the whole tone, i.e. we have one value of each MFCC variable per tone.
For the other features, the signal is framed by half overlapping windows containing
4096 samples each. This results in 25 different windows, the last window not being
complete. We aggregate the windows to so-called blocks of 5 overlapping windows
each. This way, one block is composed of 12,288 observations and corresponds to
around 0.25 seconds. In the names of the variables the individual blocks are noted,
e.g., ‘MFCC 1 block 1’ means the 1st MFCC calculated for block 1. As chroma fea-
tures we rely on the so-called Pitchless Periodogram describing the distribution of
the fundamental frequency and of 13 overtones of a tone. The periodogram is called
pitchless because the value of the pitch of the tone, i.e. of its fundamental frequency,
is ignored in the representation, only the periodogram heights pi (see Section 9.8.2,
Definition 9.47) of the fundamental frequency and its overtones are presented on an
equidistant scale, i ∈ {0, . . . , 13}. This way, the overtone structure is represented on
the same scale for all fundamental frequencies (cp. [3]). The windowed MFCCs and
the chroma variables are calculated for each of the 5 blocks of each tone.
Two continuous distributions can be compared in one density plot. At the right,
Figure 9.1 shows two densities found for the non-windowed MFCCs 1 (light grey)
and 3 (grey). The histograms are approximated by best fitting normal densities (see
Section 9.4.3). Obviously, the normal densities fit the histograms quite well.
226
9.4. Characterization of Random Variables 227
Negative values of the skewness indicate distributions with more weight on high
values (steepness at the right), positive values stand for steepness at the left. Sym-
metric distributions like the normal or Student’s t-distribution have skewness 0 (see
Definition 9.15).
The “minus 3” at the end of excess kurtosis formula is often explained as a correc-
tion to make the kurtosis of the normal distribution equal to zero (cp. Section 9.4.3).
The “classical” interpretation of the kurtosis, which applies only to symmetric and
unimodal distributions (those whose skewness is 0), is that kurtosis measures both
the “peakedness” of the distribution and the heaviness of its tail. A distribution with
227
228 Chapter 9. Statistical Methods
positive kurtosis has a more acute peak around the mean and fatter tails. An exam-
ple of such distributions is the Student’s t-distribution. A distribution with negative
kurtosis has a lower, wider peak around the mean and thinner tails. Examples of
such distributions are the continuous or discrete uniform distributions (see Defini-
tion 9.15).
Example 9.10 (Problem with Expected Value). Notice that for all the character-
istics of a distribution it is assumed that their values make sense in the context
of the application. This might well not be the case, though. For example, recon-
sider Example 9.6, where P = {1/3, 0, 0, 0, 1/3, 0, 0, 1/3, 0, 0, 0, 0}. Then, the ex-
pected value of the corresponding random variable X with values {1, 2, . . . , 12} is
E[X] := ∑i xi fX (xi ) = 1/3(1 + 5 + 8) = 14/3 = 4 2/3. Obviously, this value is not
interpretable in the context of the application since it corresponds to a note between
D# and E. This is typically a problem of the expected value and the standard de-
viation. The (empirical) quantiles can be defined so that they always take realized
values (see Section 9.4.2).
The other often-used characteristic of a distribution is dividing the distribution
into four, in a sense, equally sized parts.
Definition 9.8 (Quantiles, Quartiles, and Median). Let X be a random variable with
distribution function FX . The q-quantile ξq of X is defined as the smallest number ξ ∈
R with FX (ξ ) ≥ q. The median medX , med(X) or ξ0.5 of X is the 0.5-quantile. The
lower and the upper quartile of X are defined as q4 (X) := ξ0.25 and q4 (X) := ξ0.75 ,
correspondingly. Every value for which the density fX takes a (local) maximum, is
called a modal value or mode of X denoted by modusX or mod(X).
An often-used characteristic is the so-called
5-summaries characteristic = (minimum, q4 (X), medX , q4 (X), maximum),
dividing the distribution into four in that sense equally sized parts that 25% of the
distribution lies between each two neighboring characteristics. For example, the
lowest 25% of the distribution lies between minimum and q4 (X).
Example 9.11 (Uniform Distribution (see Section 9.4.3)). Reconsider the 12-tone
music case with {C, C#, . . . , B} having all the same probability 1/12. These notes
are mapped to {1, 2, . . . , 12}
p leading to expected value EX = (1 p
+ . . . + 12)/12 = 6.5,
standard deviation σX = ((1 − 6.5)2 + . . . + (12 − 6.5)2 )/12 = (5.52 + . . . + 0.52 )/6
≈ 3.5, median medX = 6, and quartiles q4 (X) = 3, q4 (X) = 9.
The 5-summaries characteristic might be much more illustrative than the above 2-
summaries characteristic. The latter, however, is much better suited for the derivation
of theoretical properties, e.g., of transformations of random variables. In particular,
standardization is very important since standardized variables always have ‘standard
form’ with expected value 0 and variance 1.
Theorem 9.4 (Linear Transformation and Standardization). Let X be a random vari-
able. Then:
• E[a + bX] = a + bµX and var[a + bX] = b2 var[X]. Therefore:
228
9.4. Characterization of Random Variables 229
229
230 Chapter 9. Statistical Methods
230
9.4. Characterization of Random Variables 231
Table 9.1: Applicable Location and Dispersion Measures
2 is chosen so that the extreme value of Φmin is equal to the extreme value of Φmax ,
since Φmin = 2(1 − K1 ) for hk = K1 , i = 1, . . . , K, and Φmax = 1 − K1 + (K − 1) K1 =
2(1 − K1 ) for h j = 1 and hk = 0 for all k 6= j.
Table 9.1 gives an overview of the applicability of the introduced measures for
nominal, ordinal, and cardinal features. Note that the mode most of time does not
make much sense for cardinal features since the values will often not be repeated.
Also, the Φ-dispersion measure only makes sense for cardinal features if their val-
ues repeat, e.g., after aggregation to predefined classes. Moreover, in order to be
meaningful for ordinal features, quantiles should only take observed values, like in
the above definition 3 of the median and in our definition of the p-quantile q p . Note
that sometimes computer programs use other definitions of quantiles which might
not make sense for ordinal features.
Example 9.12 (Uniform Distribution). Reconsider the 12-tone music Example 9.11.
Assume that we have observed the following notes in a short piece of music
{12, 8, 5, 8, 5, 5, 2, 1}. The (not very sensible) mean is then
x̄ = (12
p+ 8 + 8 + 5 + 5 + 5 + 2 + 1)/8 = 5.75 and the empirical standard deviation
= ((12 − 5.75)2 + 2(8 − 5.75)2 + 3(5 − 5.75)2 + (2 − 5.75)2 + (1 − 5.75)2 )/7
sx p
= (6.252 + 2 · 2.252 + 3 · 0.752 + 3.752 + 4.752 )/7 ≈ 3.5, which is reasonably close
to EX = 6.5 and nearly equal to σX ≈ 3.5, correspondingly. The median is medx = 5
and the quartiles q4 = 2, q4 = 8, all close to the theoretical values. Notice how-
ever, that empirical and theoretical values are expected to be similar only for a large
number of observations N. In our example, though, we spread the observations well
over their range so that also for our small number of observations empirical and
theoretical values are similar.
Based on the above empirical measures, the 5-summaries characteristic is often
used for illustration.
Definition 9.13 (Boxplot). A box- (and whisker-) plot is defined to be a box with
(vertical) borderlines in the lower and upper quartile q4 and q4 , the median med as
an inner (vertical) line, (horizontal) lines (whiskers) from the quartiles to the most
extreme value inside so-called fences, i.e. ≥ q4 − 1.5 · qd and ≤ q4 + 1.5 · qd, qd
being the above defined quartile difference. All points outside the whiskers are called
outliers, marked by o (see Figure 9.2).
Obviously, the center 50% of the observations are located inside the box. For so-
231
232 Chapter 9. Statistical Methods
data
quantiles q4 med q
4
1.5 qd qd 1.5 qd
boxplot
called “skewed” distributions, the parts left and right of the median are of different
size. The choice of 3 · qd as the maximal length of the two whiskers together leads to
only 0.7% outliers in the case of a normal distribution (cp. Section 9.4.3). Note that
boxplots may well be drawn vertically and side by side for different features making
them easily comparable.
Example 9.13 (Characteristics of Discrete and Continuous Distributions in Music:
Chords). Consider again Example 9.3. Let us compare the variation of chords in
the standard 12-bar blues (I, I, I, I, IV, IV, I, I, V, V, I, I) and in the “standard” jazz
version (I7, IV7 IVdim, I7, Vm7 I7, IV7, IVdim, I7, III7 VI7, IIm7, V7, III7 VI7,
II7 V7). We assume that only full schemes are observed in corresponding pieces
of music. Obviously, in the first blues scheme, 3 different chords I, IV, V are in-
volved in contrast to 9 different chords I7, IV7, IVdim, Vm7, III7, VI7, IIm7, V7,
II7 in the jazz version. Let us now calculate the Φ-dispersion measure for both
cases. For the 1st scheme, {h1 = 2/3, h2 = 1/6, h3 = 1/6} are the relative fre-
quencies of the different observed values {a1 = I, a2 = IV, a3 = V } and amod = I
with relative frequency h(amod ) = 2/3. Therefore, Φmin := 2(1 − h(amod ) ) = 2/3 and
Φmax = | 2/3 − 1/3 | +2· | 1/6 − 1/3 |= 2/3. Then, Φ = ΦminΦ+Φ min
max
= 0.5. For the 2nd
scheme h = {7/24, 3/24, 3/24, 1/24, 2/24, 2/24, 2/24, 3/24, 1/24}, h(amod ) = 7/24, Φmin :=
2(1 − 7/24) = 1.42, Φmax = (7/24 − 1/9) + 3(3/24 − 1/9) + 3(1/9 − 2/24) + 2(1/9 − 1/24)
= 0.44, and Φ = 0.76. This shows that the 2nd scheme varies more, as expected!
Example 9.14 (Characteristics of Discrete and Continuous Distributions in Music:
Boxplots of distributions). Looking again at the data in Example 9.9, we compare
the non-windowed MFCCs 1–4 by means of boxplots (see Figure 9.3). Obviously,
MFCC 1 has the highest values, MFCC 3 is heavy tailed, and at least MFCCs 2 and
4 are not symmetric.
232
9.4. Characterization of Random Variables 233
6
4
2
0 MFCC non−windowed
different distributions by
MFCC 1 MFCC 2 MFCC 3 MFCC 4
means of boxplots.
233
234 Chapter 9. Statistical Methods
development of the signal over time, i.e. from one window to the next. For this, we
model this development as a sequence of certain musical states x1 , x2 , . . . , xT , one
for each window. In order to keep the model simple, we just distinguish the states
attack (atck) and sustain (sust) (for alternative models cp. the end of this example).
The corresponding state graph is shown in Figure 9.4. It models the development in
time as a so-called hidden Markov chain of states, where each state only depends on
the preceding state. In our graph, music is modeled as a sequence of sub-graphs,
one for each solo note, which are arranged so that the process enters the start of
the (n + 1)-st note as it leaves the n-th note. From the figure, one can see that each
note begins with a short sequence of states meant to capture the attack portion of the
note (atck). This is followed by another sequence of states with “self-loops” meant
to capture the main body of the note (sust, sustain), and to account for the variation
in note duration we may observe.
If we chain together m states changed with probability p, i.e. remain unchanged
with probability q = 1 − p, then the total number of states visited, T , i.e. the number
of audio frames, spent in the sequence of m states has a so-called negative binomial
t−1
distribution P(T = t) = m−1 pm qt−m for t = m, m + 1, . . ., indicating the probability
of m “successes” in T runs, where a “success” means a state change. The expected
value of T is given by E[T ] = mp and the variance by var[T ] = mq p2
. Unfortunately,
the parameters m and p are unknown, in general. In order to “estimate” these
parameters having seen several performances of the music piece in question, we
could choose them individually for each note so that the empirical mean and variance
of T agree with the true mean and variance as given in the above formulas: x̄T = mp
and s2T = m(1−p)
p2
. This is the so-called method of moments.
In reality, one has to use a wider variety of note models than depicted in the
figure, with variants for short notes, notes ending with optional rests, notes that are
rests, etc., though all are following the same essential idea.
Let us now continue with important continuous distributions.
Definition 9.15 (Typical Continuous Distributions). A continuous density function
1
of the type f (x) = b−a , x ∈ [a, b], and f (x) = 0 else, where a, b ∈ R, defines a density
of a continuous uniform distribution or rectangular distribution on the interval [a, b].
A random variable with such a density is called continuous uniformly distributed.
The expected value of a continuous uniform distribution is E[X] = a+b 2 , and the vari-
1
ance var[X] = 12 (b − a)2 .
1 x−µ 2
A continuous density function of the type f (x) = √ 1 e− 2 ( σ ) , where σ > 0
2πσ
and µ ∈ R, defines a density of a normal distribution with the parameters µ, σ 2 . A
random variable X with such a density is called normally distributed. µ is the ex-
pected value and σ 2 the variance of the normal distribution.
Let Xi , i = 1, . . . , N, be independent identically N (µ, σ 2 ) distributed (for in-
dependence of random variables see Definition 9.18). Then, the random variable
tN−1 := s X̄−µ
√ is called t-distributed with (N − 1) degrees of freedom, when sX :=
X/ N
empirical standard deviation of observations x1 , . . . , xN of the Xi , i = 1, . . . , N, esti-
234
9.4. Characterization of Random Variables 235
note 1 q q q q q q
p p p p p
atck atck sust sust sust sust sust sust
p
note 2 q q q q q q
p p p p p
atck atck sust sust sust sust sust sust
p
note 3 q q q q q q
p p p p p
atck atck sust sust sust sust sust sust
[—————————–m—————————–]
etc.
mating the standard deviation σX , X ∼ N (µ, σ 2 ) (see Section 9.4.2). One can show
that E[tN−1 ] = 0 if N > 2, and var[tN−1 ] = (N − 1)/(N − 3) if N > 3.
The sum of squares of n independent standard normal distributions is called χ 2 -
distribution (chi-squared distribution) with n degrees freedom χn2 .
The ratio of two independent scaled χ 2 -distributions with n and m degrees of
freedom
χ 2 /n
Fn,m = 2n
χm /m
is called F-distribution with n, m degrees of freedom.
Please notice that t would be N (0, 1)−distributed if the true standard devia-
tion σx would be used instead of sx . The variance of the t-distribution is somewhat
greater than the variance of the N (0, 1)−distribution. For N → ∞ the t-distribution
converges towards the N (0, 1)−distribution. Examples for χ 2 - and F-distributions
can be found in Section 13.5.
Example 9.17 (Continuous Distributions: Automatic Composition). In automatic
composition, Xenakis [6, pp. 246-249] experimented with amplitude and/or duration
values of notes obtained directly from a probability distribution (e.g., uniform or
normal). Also many other distributions, not introduced here, were tried.
Example 9.18 (Continuous Distributions: MFCCs). Let us come back to the data in-
troduced in Example 9.9. We look at the quantitative variable non-windowed MFCC
1. From Figure 9.5 it should be clear that this variable can be very well approxi-
235
236 Chapter 9. Statistical Methods
−2 0 2 4 6 8 dashed line).
MFCC non−windowed
mated by a normal distribution with expected value = empirical mean and variance
= empirical variance.
where Σ X is the positive definite (and thus invertible) covariance matrix of the ran-
dom vector X , | Σ X | is the determinant of Σ X , and µ X is the vector of expected
values of the elements of X. The covariance matrix will be defined below.
Note that normal distributions are also defined in the case of singular Σ x . This
case will, however, not be discussed here.
Example 9.19 (Multivariate Normality in Classes). Imagine you want to distinguish
classes like genres or instruments by means of the values of a vector of influential
variables X . This is called a supervised classification problem (cp. Chapter 12). A
typical assumption in such problems is that the influential variables X follow indi-
vidual (multivariate) normal distributions for each class. In the case of m influential
variables X = (X1 . . . Xm )T this leads to a density for class c of the following kind:
1 1 T −1
fX (x1 , . . . , xm ) = p e− 2 (xx−µµ X (c)) Σ X (c) (xx−µµ X (c)) .
(2π)m | X (c) |
Σ
236
9.5. Random Vectors 237
0.005
0.015
0.025
4
35 0.01
0.0
0.03
0.0 5
45
x2
0.0
5
2
0.0
0.08
04 07
0. 0.
3 6
0.0 0.0
0
0.02 4
0.0
0.02
0.01
−2
Let us discuss a special case of random vectors, where the entries are so-called
independent (note the analogue to independent subsets in Definition 9.2).
Definition 9.18 (Independence of Random Variables). Random variables
X1 , . . . , XN with densities f (X1 ), . . . , f (XN ) are called independent iff
f (X1 , . . . , XN ) = f (X1 ) · . . . · f (XN ).
237
238 Chapter 9. Statistical Methods
In this case, covariances are zero and expected values and variances of (functions
of) random vectors can be easily calculated.
Theorem 9.5 (Expected Values and Independence). For independent random vari-
ables X1 , . . . , XN it is true that:
• E[X1 · . . . · XN ] = E[X1 ] · . . . · E[XN ],
• cov(Xi , X j ) = 0, i 6= j, i.e. Xi and X j are uncorrelated,
• var[∑Ni=1 Xi ] = ∑Ni=1 var[Xi ].
Example 9.20 (Independence). In Example 9.16, the music data model might be
composed of three variables bt , et , and st assumed to be (conditionally) independent
given the state xt : P(bt , et , st |xt ) = P(bt |xt )P(et |xt )P(st |xt ). The first variable, bt ,
measures the local “burstiness” of the signal, particularly useful in distinguishing
between note attacks and steady-state behavior (sustain) distinguished in Figure 9.4.
The 2nd variable, et , measures the local energy, useful in distinguishing between
rests and notes. And the vector-valued variable st represents the magnitude of dif-
ferent frequency components given the state xt . For each of the three components a
distribution may be fixed independently.
The above Bayes Theorem 9.2 can also be formulated for densities. For this, we
first have to define the generalization of conditional probabilities for densities:
Definition 9.19 (Conditional Density). Let X, Y be random variables on the same
sample space with a bivariate density f (x, y). Then, f (x | y) := f f(x,y)
(y) and
f (y | x) := f f(x,y) are called conditional densities of X given Y and vice versa, where
R ∞ (x) R∞
f (y) := −∞ f (x, y)dx, f (x) := −∞ f (x, y)dy are the so-called marginal densities of
Y, X corresponding to the joint density f (x, y).
With this definition, the Bayes theorem for densities can be formulated:
Theorem 9.6 (Bayes Theorem for Densities). Let X, Y be random variables on the
same sample space. Then, f (x | y) = f (y|x) f (x)
f (y) = R ∞ f f(y|x) f (x)
(y|x) f (x)dx
.
−∞
The Bayes theorem can be well generalized to the multivariate case, as demon-
strated in the following example.
Example 9.21 (Application of Bayes Theorem for Densities). This version of the
Bayes theorem will be very important in Chapter 12, i.e. in supervised classification
of classes like genres or instruments. A typical assumption in classification is that
an individual (multivariate) normal distribution f (xx | c) of the influential variables
X are valid for each class c (see Example 9.19). With the Bayes theorem, it is then
possible to calculate the discrete density of class c, i.e. its probability, given an
observation x by means of the density of the observation given the class:
f (c | x ) = f (xf|c) = Gf (xf|c)
x f (c) x f (c)
(xx) (xx|c) f (c)
if G classes are distinguished.
∑c=1
238
9.5. Random Vectors 239
239
240 Chapter 9. Statistical Methods
Scatterplot Scatterplot
non−windowed MFCC 2
6
1
MFCC 1 in block 1
4
0
2
−1
0
−2
0 2 4 6 0 2 4 6
non−windowed MFCC 1 non−windowed MFCC 1
Figure 9.7: Scatterplot of two MFCC variables each, left: high correlation, right:
low correlation.
slight relationship if any (see right part of Figure 9.7). Note that here the empirical
correlation is −0.11.
Obviously, the above covariances and correlations are only defined for cardinal
features. For ordinal features, ranks are used instead of the original observations for
the calculation of covariances and correlations.
Definition 9.22 (Ranking). Ranking refers to the data transformation in which nu-
merical or ordinal values are replaced by their ranks when the data are sorted. Typi-
cally, the ranks are assigned to values in ascending order. Identical values (so-called
rank ties) are assigned a rank equal to the average of their positions in the ascending
order of the values.
For example, if the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of
these data items would be 2, 3, 1 and 4, respectively. As another example, the ordinal
data high, low, and middle pitch would be replaced by 3, 1, 2.
Using ranks instead of the original observations for the calculation of correlations
leads to the following definition.
Definition 9.23 (Spearman’s Rank Correlation). The Spearman (rank) correlation
coefficient of two features is defined as the Pearson correlation coefficient between
the corresponding ranked features. For a sample of size N, the N raw observations
xi , yi are converted to ranks pi , qi , and the correlation coefficient is computed from
these:
∑Ni=1 (pi − p̄)(qi − q̄)
rXY := q .
∑Ni=1 (pi − p̄)2 ∑Ni=1 (qi − q̄)2
Example 9.23 (Spearman Rank Correlation). Looking again at the data in Exam-
ple 9.22, the Spearman rank correlation of the non-windowed MFCC and MFCC 1
240
9.5. Random Vectors 241
Table 9.2: Contingency Table for Binary Features (left) and for General Nominal
Features (right); (a • Index Indicates Summing up over this Index)
y1 ... ym total
y=1 y=0 total
x1 H11 ... H1m H1•
x=1 H11 H10 H1•
... ... ...
x=0 H01 H00 H0•
xn Hn1 ... Hnm Hn•
total H•1 H•0 N
total H•1 ... H•m N
in block 1 shows with 0.91 a similarly high value as the Pearson correlation with
0.93. Also between the two non-windowed MFCCs 1 and 2 there is nearly the same
low rank correlation (−0.12) as the Pearson correlation (−0.11). This shows the
close connection of the two concepts of correlation.
For nominal features so-called contingency coefficients are in use instead of cor-
relation coefficients. A typical example is the so-called φ coefficient.
Definition 9.24 (φ coefficient). The φ coefficient (also referred to as the “mean
square contingency coefficient”) is a measure of association for, e.g., two binary
features. For the definition of the φ coefficient, consider the so-called contingency
table. The contingency table for binary features x and y is defined as in Table 9.2
(left), where H11 , H10 , H01 , H00 , are the absolute frequencies “cell counts” that sum
to N, the total number of observations.
Then, the φ coefficient is defined by
H11 H00 − H10 H01
φ := √ .
H1• H0• H•0 H•1
In case of two general nominal features with n and m levels the contingency table
looks as in Table 9.2 (right). Then, the φ coefficient is defined by
s s
1 n m (Hi j − Hei j )2 n m
(Hi j − Hi• H• j /N)2
φ := ∑ ∑ = ∑∑ ,
N i=1 j=1 Hei j i=1 j=1 Hi• H• j
where Hi j is the observed frequency in the (i j)-th cell of the contingency table and
Hei j = Hi• H• j /N is the so-called expected absolute frequency in the cell for stochas-
tically independent variables. The above formula for binary features is a special case
of this more general formula.
This measure is similar to the Pearson correlation coefficient in its interpretation.
In fact, the Pearson correlation coefficient calculated for two binary variables will
result in the above φ coefficient.
Example 9.24 (Melody Generation1 ). Automatic melody generation is often carried
out by means of a Markov chain on note or pitch values. In a Markov chain, the
1 cp. http://en.wikipedia.org/wiki/Pop_music_automation. Accessed 13 March 2016.
241
242 Chapter 9. Statistical Methods
Table 9.3: Transition Matrix (left) and Contingency Table (right)
note A C# Eb total
note A C# Eb
A 44 207 98 349
A 0.1 0.6 0.3
C# 79 22 226 327
C# 0.25 0.05 0.7
Eb 225 98 0 323
Eb 0.7 0.3 0
total 348 327 324 999
value at time point t only depends on the preceding value at time point t − 1. The
transitions from one value to the next are controlled by so-called transition probabil-
ities gathered in a transition probability matrix. This matrix is constructed row-wise,
note by note, by vectors containing the probabilities to switch from one specific note
to any other note (row sums = 1, see Table 9.3(left)). Note values are generated by
an algorithm based on the transition matrix probabilities. From the resulting Markov
chain we can generate a contingency table with the numbers of the realized transi-
tions (x = starting tone, y = next tone). Based on the contingency Table 9.3(right),
the φ coefficient is calculated as 0.75. That there is dependence is expected because
of the transition probabilities. This dependency should decrease, though, when tak-
ing y = “overnext realized tone”, and indeed then the φ coefficient is calculated as
0.41.
242
9.6. Estimators of Unknown Parameters and Their Properties 243
3
4.0
3
23
3
non−windowed MFCC 1, guitar
3
2
23
2
23
3
3
3
23
3
23
23
3
3.5
23
2
22
3
22
23
23
2 23
3
2
3
2
3
2
3
1 23
12
1
11123
1
1123
1 23
1
123
123
13
23
1
12
1 123
123
123
123
3.0
123
23
13
1
23
1 3
3 33333
12 33 33
3 3 33 33
3
123 3 3
13 3 3
12 3 3 3 3
3
33 3 3 3 3333333333333333
12 3 3 33333 3 3333333
3 3 333 333
23 3 3 333
333333333 333 33 33
3333333333333333333333333333333 3333333 333333333333333333333333333333333333
23 3 3 333 333333 33 3 3333333333 33 333333
3
333 333 22222222222222222222222 33333333333333333333 3333333333333333
3333333333 33333 33333333333333333 3333 3333
3 33
3333333 333 333 33 2222 333 333333333
3333 33333333
333333 3333333333
12 3 3
33 33 333 222 22
222 333 33333333333333333 22222222 333333333333333 333333 3333 222222222222222222222222222222222222
3333
3 3 333 33 33 33 3333333333333 2222 333333333333333333333333 333333333 222222222222222222222 22222222 222222222222 3 3333333333 33333333333 333333 222222
3
2222 2222222222 33333333333333333333333
23 2 3 333 33 333 333333333333 3333333 333333 222 2222222222 222222 22222 3333333 22222222222222222 2222 222222
123 2 22 3 333333 33333 33
3 33 33 33 333 222222222222 333 3333333333 33333 333
2
22 11111111111111111111111 222222222222222222222 33333333 2222222222222222 222 2222222 2222
3333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333
3 3 33 3333333
3 222 222 33 333 33333 3333 3333333 33 3 3 3 333 2222 222222 333 333 3333333333 22 1111 11 222 222222222 22222222222222222222 22222 22222 11111111111111111111111
1111111 11 11111111 222222222222222 333333333333333333333
3 33 33 33 3 3
3
3
3
33333
3 33 333 333 222 22 333 333 3 22222222222222 1111 111 22 2222222222 22222222222222222 11111111111
1111111111111111 11111111 111111111111111 2222 222222222222222222 111111 111111111111111111111 2 2222222222 22222222222222 333333
233 2
2 2 33 3
2 3 33 333333 33 222 222 33333333333333333333 3333 2 1 11 2 22 2 2222222 111111111 11 222222 11111111111111111 222222 2222222222222222222222222222222222222222222222222222222222222222222222222222222 33333333333333333 3333333333333333333
3333333333333 333 3333
2 2 33
3 33
33 33 333
333 222222222 3 333 3333333 333333 3 3
3333 33 2222 111 22222222222222222
33 333 333 222222 11
11 1111111111111111 2222222 22222222222222 1111111111111111 1111 1111
111 111111 22 111111 1111111111 1111
111111111111111 222 222222222222222222 3333333
3333333333333333333333333333333333333333333333333333333333333
1 123 2 2 3
3
3 33 333 33333 22222222 222222 1 11111 111111 111111 111111111 11111111 11111 22 22222222
32 2 23 33 3 3
333 3
3 22 2
22 22 222222222
33333 22 11111111 2 2222 333 222 2222222 111 111 1111111 1111111 1111 111111
11 1111111 222222222222222222 33 222222222222222222222222222222
123 2 2 333 3333 333 22 1111 11111 222 111 11 111111111111 111111111 1111 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 2222222 22222222222
2 23 33 3 22 222 2222 2222 222 1111 11 222 222 111111111111 111111111 11111 11111 11111 111 11111
11111111111111111111111 22 2222222222222222222222222222222222222222222222222
32 2 33 22 2 2 22 22 2 111 2222222222222222222222 2222 11111 1 1 111111111111 22222222 1111111111111111111111111111
2 33 2222 222
2 2
2 2 22 222 2 11 11 2 222 22 111 11 1
11 111 11 11111111111 11111111111
23 2 2 3 222 222222 2 2
2222 22
2
2222222222222222 222 1111
2 111111111111111 222222 1111111111111 1111111111111111111111111111111111111111111111111111111111111
1 32 2 2 3 22 22 22222 2 11 2 22 2 22222 22 11 11 11111 111 11 11
1 32 2 2 2 22 22222222 222 222 1111111111 22 11111 222 22222 111 111
11 1111111111111 1111
1232 2 22
222 22
2 2 1
11 11 111 111 11 11 1111 1 1 1
2 1 11 11 1 1 11 1111 11 1
2 22 22 2
2 11 11 11 111 111 11 1111
12 2 2 2 2
222 2
2 1 1
1 11 11
11 1 111111111111111111111 1111
2.5
2 2 2 222 1
1 1
2 1 2 11111 111 1 11
is increasing: non-windowed
01
500 1000 1500 2000 2500 3000 MFCC 1, guitar.
sample size
Note that τ(θθ ) may be equal to θ itself. The above formulation thus allows for
a transformation of θ to be estimated. As an example, let us now apply this general
definition to the estimation of the expected value and the variance.
Theorem 9.7 (Estimation of Expected Value and Variance). Let X1 , . . . , XN be inde-
pendent random variables with identical expected values µ and variances σ 2 . Let the
mean be the random variable X̄ := N1 ∑Ni=1 Xi . Then,
• E[X̄] = µ and var[X̄] = σ 2 /N.
• Let xi be a realization of Xi . Then, µ̂ = x̄ = N1 ∑Ni=1 xi is an estimator for the
expected value of the Xi with shrinking variance for N → ∞.
• An analogue estimator for the variance is σ̂ 2 = N1 ∑Ni=1 (xi − x̄)2 .
1
• An unbiased estimator for the variance is s2 = N−1 ∑Ni=1 (xi − x̄)2 .
• An (1−α)100% confidence interval for µ with unknown σ and independent iden-
tically N (µ, σ 2 )-distributed random variables Xi is given by:
s s
x̄ − tN−1;1−α/2 √ , x̄ + tN−1;1−α/2 √ ,
N N
where s is the above estimator for the standard deviation of the Xi , tN−1;1−α/2 the
(1 − α/2) quantile of a t-distribution with N − 1 degrees of freedom, and α is
typically 0.05 or 0.01.
Example 9.25 (Effect of Increasing Sample Size). Let us look again at the data in
Example 9.9 and study the effect of increasing the sample size. We study the estimates
of the expected value and standard deviation as well as the corresponding confidence
intervals of the non-windowed MFCC 1, guitar only, in dependence of the sample
size (see Figure 9.8). We see that the mean is nearly stable from sample size 1200
on, whereas the confidence interval is continuously shrinking.
243
244 Chapter 9. Statistical Methods
244
9.7. Testing Hypotheses on Unknown Parameters 245
X̄ − µ0
t=p ∼ tN−1 ,
s2 /N
where s is the unbiased estimator of the standard deviation σ , and the test statistic t
is t-distributed with N − 1 degrees of freedom.
A typical corresponding pair of statistical hypotheses is:
H0 : µ = µ0 vs. H1 : µ 6= µ0 .
Two sample t-test: If all Xi are independently N (µX , σX2 )-distributed with un-
known variance, i = 1, . . . , N, and all Yi are independently N (µY , σY2 )-distributed
with unknown variance, i = 1, . . . , M, then analogous to the one sample case the test
245
246 Chapter 9. Statistical Methods
statistic
(X̄ − Ȳ ) − δ0
t=q
s2X /N + sY2 /M
can be used for the comparison of two expected values with unknown variances,
where sX and sY are the unbiased estimators of the standard deviations and N and M
are the corresponding sample sizes.
Here, we obviously do not test on equal expected values, but on a difference δ0 ,
and for H0 : µX − µY = δ0 the test statistic t is t-distributed with k degrees of freedom,
where 2 2
sX sY2
+
N M
k= 2 2 .
2 2
1 sX 1 sY
N−1 N + M−1 M
246
9.7. Testing Hypotheses on Unknown Parameters 247
H0 : H01 , . . . , H0k are all valid vs. H1 : H0i is not valid for at least one i
on the “global” significance level α. In such cases, the significance levels of the k
individual tests have to be adapted. A conservative possibility to do this adequately
is the usage of the significance level αk = α/k for each individual test (Bonferroni
correction). Such corrections are essential because of the following argument.
If k independent tests each are carried out on the significance level α, then the
probability to incorrectly reject any of the hypotheses is α, i.e. for each test the
probability to reject the hypothesis correctly is 1−α. Since the tests are independent,
the probability to reject all k hypotheses correctly is the product of the individual
probabilities, namely (1 − α)k . Therefore, the probability to reject at least one of the
hypotheses incorrectly is 1 − (1 − α)k . With an increasing number of tests, this error
probability is increasing. For example, for α = 0.05 and k = 100 independent tests
it takes the value 1 − (1 − 0.05)100 = 0.994. In other words, testing 100 independent
correct hypotheses leads almost surely to at least one wrong significant result. This
247
248 Chapter 9. Statistical Methods
makes significance level corrections like the one above necessary. Note that 1 − (1 −
0.05/100)100 ≈ 0.04878 < 0.05 for the Bonferroni correction.
Y = f (X1 , . . . , XK ; β1 , . . . , βL ) + ε
9.8.1 Regression
As a motivation of what will follow, consider the following example.
Example 9.29 (Fit Plot and Residual Plot). Let us come back to Example 9.22.
There, we observed a high correlation between the non-windowed MFCC 1 and
MFCC 1 in block 1. This indicates a linear relationship of the non-windowed MFCC
248
9.8. Modeling of the Relationship between Variables 249
1 and MFCC 1 in block 1. We are interested in the linear model between the two
variables in order to be able to predict the non-windowed MFCC 1 by MFCC 1 in
block 1, i.e. by MFCC 1 in the beginning of the tone.
Let us, therefore, start with the simplest regression model for a linear relationship
between two variables.
Definition 9.31 (Linear Regression Model for One Influential Variable). The simple
so-called 2-variables regression model is of the form:
yi = β0 + β1 xi + εi , i = 1, . . . , N,
Let us now generalize this result for more than one influencing variable:
Definition 9.33 (Multiple Linear Regression Model). The multiple linear regression
model has the form y = X β + ε , where y = vector of the response variable with N
observations, X = matrix with entries xik for observation i of influential variable k, β
= vector of K + 1 unknown regression coefficients, and ε = error vector of length N.
Notice that typically
1 x11 . . . x1K
1 x21 . . . x2K
X = . .. ..
.. ..
. . .
1 xN1 . . . xNK
so that the first influential “variable” is assumed constant, i.e. a constant term is
included in the model. The following assumptions are assumed to be valid:
(A.1) X is non-stochastic with rank(X) = K + 1, i.e. all columns are linearly inde-
pendent,
249
250 Chapter 9. Statistical Methods
Let us continue with important properties of LS-estimates. We will see that under
reasonable assumptions the LS-estimator is best unbiased, i.e. it is unbiased (see Sec-
tion 9.6) and has minimum variance. Then, we will derive confidence intervals for
the true model coefficients. Note that minimum variance of the LS-estimator guaran-
tees minimum length of confidence intervals. Last but not least, we will show, how
the LS-estimator simplifies in the case of uncorrelated influential variables which can
be guaranteed in some time series models below.
Theorem 9.9 (Properties of LS-Estimates). Under the assumptions (A.1) – (A.3) the
LS-estimator is unbiased with minimum variance among the linear estimators of the
unknown coefficient vector β .
Under assumption (A.4) the (1 − α) · 100-confidence interval for the unknown
q q
coefficient βi has the form: β̂i − tcrit var(
ˆ β̂i ) , β̂i + tcrit var(
ˆ β̂i ) ,
where tcrit := tN−K;(1−α/2) is the (1 − α/2)-quantile of the t-distribution with N − K
degrees of freedom and α is typically 0.05 or 0.01.
If the columns of X are uncorrelated, then βˆ = (X X T X )−1 X T y = D X T y , where
D is a diagonal matrix. In this case, the estimate of the coefficient for the influential
variable xk is independent of the observations of the other variables.
250
9.8. Modeling of the Relationship between Variables 251
ˆ ε)
var(ε
R2 = 1 − .
ˆ y)
var(y
251
252 Chapter 9. Statistical Methods
252
9.8. Modeling of the Relationship between Variables 253
4
6
3
MFCC 1 in block 1
2
4
residuals
1
2
0
−1
0
0 2 4 6
0 1 2 3 4 5 6 7
non−windowed MFCC 1 fitted values
Figure 9.9: Fit plot (left) and residual plot (right) of simple regression of
“mfcc unwin 1” on “mfcc.block1 1”.
Obviously, time is giving the data a “natural” structure and the time dependence
is decisive for the interpretation.
Example 9.31 (Music Observations as Time Series). Music is nothing but vibrations
generated by music instruments played over time. Therefore, musical signals can be
represented as time series in a natural way. Indeed, in Example 9.16 we already saw
a model for musical tone progression over time. However, this model considered a
sequence of discrete states instead of observations of quantitative signals. In this
section we will introduce time series models for quantitative vibrating signals like
the waveform of an audio signal or the MFCCs and the chroma features introduced
in Example 9.9.
There are lots of different models for time series data. Here, however, we will
concentrate on autoregressive and periodical models which are most often used for
modeling musical time series.
Definition 9.40 (Time Series Models). The model y[t] = β1 +β2 y[t −1]+ε[t], |β2 | <
1, where ε ∼ i.i.N (0, σ 2 ), is called a (stationary) 1st-order autoregressive model
(AR(1)-model). In such a model, the value of Y in time period t linearly depends on
its value in time period t − 1, i.e. the value of Y with time lag 1.
The model y[t] = β1 +β2 y[t −1]+. . .+β p+1 y[t − p]+ε[t], where ε ∼ i.i.N (0, σ 2 ),
is called p-th-order autoregressive model (AR(p)-model). If all roots of the so-called
characteristic polynomial have absolute value greater than 1, meaning that | z |> 1
for all z with 1 − β2 z − β3 z2 − . . . − β p+1 z p = 0, this model is stationary. Obviously,
p is the maximal involved time lag.
A model is called stationary if its predictions have expected values, variances,
and covariances that are invariant against shifts on the time axis, i.e. that do not
depend on time. This means that E[Ŷ [t]] and var[Ŷ [t]] are constant for all t and the
cov(Ŷ [t], Ŷ [s]) only depend on the difference t − s.
253
254 Chapter 9. Statistical Methods
Please notice that the conditions |β2 | < 1 and “all roots of the characteristic poly-
nomial 1 − β2 z − β3 z2 − . . . − β p+1 z p have absolute value greater than 1” both guar-
antee the stationarity of the autoregressive model. Moreover, the two conditions are
equivalent for AR(1)-models, since for 1 − β2 z = 0 obviously |z| > 1 is equivalent
with |β2 | < 1.
Indeed, an AR(1)-model only represents a damped waveform if |β2 | < 1. The
notion autoregression is based on the fact that a variable is regressed on itself in a
previous time period.
Autoregressive models relate to autocorrelation already introduced in Sections 4.8
and 2.2.7.
Definition 9.41 (Autocorrelation). The autocorrelation coefficient of order p is de-
fined as the correlation coefficient of y[t] with its lag y[t − p]. One can show that
in a stationary AR(1)-model the coefficient of y[t − 1] is equal to the 1st-order auto-
correlation coefficient. The empirical 1st-order autocorrelation coefficient looks as
follows:
∑T (y[k] − ȳ)(y[k − 1] − ȳ)
ry[t],y[t−1] := k=2
s2y
in case of stationarity.
Note that in the autocorrelation function in Section 4.8 it is implicitly assumed
that y[t] is centered at 0 and the variance normalization is ignored (see also the dis-
cussion in Section 2.2.7).
Example 9.32 (Stationary AR(1)-Models). Figure 9.10 illustrates that an AR(1)-
model with a positive coefficient β2 (positive autocorrelation) causes a slow oscilla-
tion after a short “attack” phase and a negative coefficient β2 causes a “nervous”
β1
oscillation (negative autocorrelation), both around 1−β . Independent of the starting
2
value of the oscillation, in the long run the model will “converge” to this value in that
β1
the expected value of the model prediction will be constant 1−β for t big enough.
2
Note that positive autocorrelation relates to low-pass filtering and negative auto-
correlation to high-pass filtering (cp. Example 4.2).
Unbiasedness can be proven for the estimates in “static” linear models which are
independent of time (cp. Section 9.6). In contrast, time dependency as in time series
models, often also called dynamics, typically leads to situations where unbiasedness
cannot be expected. This is then replaced by so-called asymptotic properties, i.e.
properties which are only valid for T → ∞. Typical such properties are consistency
and asymptotic normality, which are valid for least-squares estimates of the coeffi-
cients of stationary AR(p)-models.
Definition 9.42 (Consistency and Asymptotic Normality). Let θ be an unknown
parameter vector of a statistical distribution. An estimator t T of g(θθ ) ∈ Rq based on
T repetitions of the corresponding random variable is called consistent iff for all η >
0 : P(ktt T − g(θθ )k > η) → 0 for T → ∞, which is often written as plim(tt T ) = g(θθ ),
where plim stands for probability limit.
254
9.8. Modeling of the Relationship between Variables 255
AR(1) with positive autocorrelation AR(1) with negative autocorrelation
8
8
6
time series
time series
6
4
4
2
0
2
0 200 400 600 800 1000 0 200 400 600 800 1000
Time Time
255
256 Chapter 9. Statistical Methods
function g is the smallest period P. Notice that each integer multiple of the base
period is again a period of the function g. The frequency f of g(t) is the inverse of
the base period P, i.e. f = P1 . Typically, frequencies are represented in the unit Hertz
(Hz), i.e. in number of oscillations per second.
Examples for periodic functions are the so-called harmonic oscillations.
Definition 9.44 (Harmonic Models). A simple (harmonic) oscillation is defined by
the model
f f
y[t] = β1 + β2 cos(2π t) + β3 sin(2π t) + ε[t],
fs fs
where ε[t] ∼ i.i.N (0, σ 2 ), f is the frequency of the oscillation, fs := sampling rate
:= number of observations in a desired time unit, and β2 , β3 are the amplitudes of
cosine and sine, respectively. If the time unit is a second, n is also measured in Hz.
Frequently, oscillations with different frequencies are superimposed. This leads to a
model of the form:
K
fk fk
y[t] = β1 + ∑ (β2k cos(2π t) + β2k+1 sin(2π t)) + ε[t].
k=1 fs fs
Note that such harmonic oscillations are not damped, i.e. go on in the same way
forever. Decisive for the adequacy of the model is the correct choice of the fre-
quencies fk . Harmonic oscillations have the favorable property that the influence
of so-called Fourier frequencies f µ = µ Tfs can be determined independently of each
other, when µ = 1, . . . , T2 . Then 0 ≤ f µ ≤ f2s . f2s is also called Nyquist frequency.
f
Note that fµs = Tµ . These oscillations do not influence each other, they are uncorre-
lated. Therefore, important frequencies of this type can be determined independently
of each other, e.g., one can individually check those frequencies which make sense
by substantive arguments. In the following, we will always denote the Fourier fre-
quencies by f µ = µ Tfs . Note, however, that important frequencies will usually not
be Fourier frequencies. Therefore, in general we cannot make use of the regres-
sion simplification for uncorrelated influential variables mentioned in Theorem 9.9,
since the estimate of the amplitude of a non-Fourier frequency depends on the other
frequencies.
Example 9.34 (Polyphonic Sound (see [2])). In general, the oscillations of the air
generated by a music instrument are harmonic, i.e. they are composed of a fun-
damental frequency and so-called overtone frequencies, which are multiples of the
fundamental frequency. The tones related to the fundamental frequency and to the
overtone frequencies are called partial tones. In a polyphonic sound the partial
tones of all involved tones are superimposed. For the identification of the individ-
ually played tones, a general model is proposed for J simultaneously played tones
with varying numbers M j of partial tones. This model will be introduced here in the
special case of constant volume over the whole sound length:
J Mj
f0 j f0 j
y[t] = ∑ ∑ a j,m cos 2π(m + δ j,m ) t + b j,m sin 2π(m + δ j,m ) t +ε[t],
j=1 m=1 fs fs
256
9.8. Modeling of the Relationship between Variables 257
2 M̃ fµ 2 M̃ fµ
y[t] = ȳ + ∑ C f µ cos(2π t) + ∑ S f µ sin(2π t) if fs = 2M̃ + 1,
fs µ=1 fs fs µ=1 fs
2 M̃−1 2 M̃−1
fµ fµ 1 fs
y[t] = ȳ + ∑ µ C f cos(2π t) + ∑ µ S f sin(2π t) + C cos(πt)
fs µ=1 fs fs µ=1 fs fs 2
if fs = 2M̃.
257
258 Chapter 9. Statistical Methods
0.4
normalized periodogram
0.3
0.2
0.1
0.0
Figure 9.11: Normed periodogram for model y[t] = 2 sin(2π f · t/44100) + sin(4π f ·
t/44100); dashed vertical line = true frequency.
Note that the inverse of the discrete Fourier transform indicates that for the time
representation of a time series we only need a finite number of (Fourier) frequencies.
f
Also note that 0 < fµs < 0.5 in the time representation so that Definition 9.45 of
Fourier transforms is sufficient. Therefore, the coefficients of the Fourier frequencies
defined by the Fourier transform fully characterize the frequency behavior of the time
series. In order to have real and not complex numbers, typically the squared absolute
values corresponding to the Fourier frequencies are considered.
Definition 9.47 (Periodogram). The squared absolute value of the discrete Fourier
transform is called periodogram:
2
T fµ T f µ 2 2
Iy f µ := ∑ y[t] cos(2π t) − i ∑ y[t] sin(2π t) = C f µ + S f µ .
t=1 fs t=1 f s
258
9.8. Modeling of the Relationship between Variables 259
259
260 Chapter 9. Statistical Methods
1, . . . , N, k = 1, . . . , K, of the PCs are called scores. The vector z k of the scores of the
k-th PC Zk has the form z k = X g k so that zik = (xi1 − x̄1 )g1k + . . . + (xiK − x̄K )gKk .
Notice that orthogonality of score vectors is equivalent to empirical uncorrelation
of score vectors. Moreover, notice that the length restriction of the loading vectors
is necessary since the empirical variance of the score vectors increases quadratically
with the length of the loading vector. PCs are often interpreted as implicit (latent)
features, since they are not observed but derived from the original features.
The loadings of the PCs are constructed by means of a property of covariance
matrices.
Theorem 9.10 (Construction of Principal Components). The empirical covariance
TX
matrix S := XN−1 of the mean centered features in X can be transformed by means
of the so-called spectral decomposition into a diagonal matrix2 where a matrix G is
constructed so that G T S G = Λ , where G T G = I and Λ is a diagonal matrix whose
elements are 0 except on the main diagonal: λ11 ≥ . . . ≥ λKK ≥ 0.
This matrix G := [gg1 . . . g K ] satisfies the properties of a loading matrix since
TXTXG ZT Z
Λ = G T S G = G N−1 = N−1 . Thus, the columns of Z := [Z Z 1 . . . Zk ], i.e. the score
vectors of the PCs, are uncorrelated and var ˆ Z 1 = λ11 ≥ . . . ≥ λKK = var ˆ Z K ≥ 0.
All K PCs together span the same K-dimensional space as the original K features.
A PCA can, however, be used for dimension reduction. For this, the number of
dimensions relevant to “explain” most of the variance in the data are determined by
a dimension reduction criterion. A typical simple criterion is the share r p of the first
p PCs on the “total variation in the data”, i.e. the ratio of the variance of the first p
PCs and the total variance. Since the PCs are empirically uncorrelated, the empirical
variances just add up to the total variation so that
ˆ Z 1 + var
var ˆ Z 2 + . . . + var
ˆ Zp
rp = .
ˆ Z 1 + var
var ˆ Z 2 + . . . + var
ˆ ZK
Since the first PCs represent the biggest share of the total variance, a criterion like
r p ≥ 0.95 often leads to a drastic dimension reduction so that an adequate graphi-
cal representation is possible. The “latent observations” zik , i.e. the scores, of the
first few PCs are then often plotted in order to identify structures or groups in the
observations. Notice that the absolute distances between the score values are not
interpretable.
Also notice that PCA is not scale invariant so that the result may change also in
interpretation if the units of the features change, e.g., from MHz, db, and seconds
to Hz, Joule, and milliseconds. Typical kinds of PCA are PCA on the basis of co-
variances, where “natural units” are assumed, and PCA on the basis of correlations,
where all features are standardized to variance 1.
A further disadvantage of the PCA is the fact that weighted sums are most of the
time not interpretable. In such cases, results might even be useless for a user.
2 The spectral decomposition can be seen a singular value decomposition (SVD) for quadratic and
260
9.8. Modeling of the Relationship between Variables 261
for this variable. Biplots are often used to interpret structure in score plots by identi-
fying original variables related to the structure (see next example).
Example 9.37 (Scores Plots and Bi-Plots (Example 9.36 cont.). Let us once more
consider the data in Example 9.9, this time the 14 chroma variables of block 1. A
principal component analysis of these data leads to a share of 93% of the variance
of the first two principal components on the basis of covariances. A scatterplot of
the first two principal components can be seen in Figure 9.12. Obviously, there is
structure in the plot, namely, there are (nearly) strict bounds for the realizations.
Biplot
−1.5 −1.0 −0.5 0.0 0.5 1.0
Scores Plot
0.4
1.0
. ....... .
0.4
.. .
.. .............................
.
................................................................................
0.2
0.5
. .................................... ..................... .. .
.. .. ... . . .. .... ...... . ...
..................................................................................................................................... .. ..
0.2
0.0
......................
....................................................................... .......................................................................... ...............................................................................................................................................
. . . .. ... ... . .. . .. ...... . .. . ... . ..... . ...... . . ......... ... .
.........................................................................................................................................................................................................................................................
0.0
Comp.2
.. . .. ...... .. ..... ........ ...... . . .... ..... . ... ....... ..... ................ ..
........... ............................................................................. .... ........... ............... ....... .. ............. . ...... ..
. . . . .... . . .. .
−0.5
.
....... ................ .................................... ....... .... .. .......... ..... . .... .. . .
−0.2
. . .. .. .
.......... .... . .. . . .
....... . . . ... .
−1.0
..
....
−0.4
.. chroma 2
−0.4
−1.5
−0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 −0.4 −0.2 0.0 0.2 0.4
Comp.1 Comp.1
Figure 9.12: First 2 principal components of 14 chroma vectors - left: scores plot;
right: biplot.
261
262 Chapter 9. Statistical Methods
Original Features
0.8
Chroma 2 in block 1
0.6
0.4
0.2
The biplot shows that the triangle legs nearly correspond to the directions of the first
two chroma elements (for the fundamental and the first overtone). The other chroma
elements all have very small contributions. This leads to the idea of plotting the
first two original chroma elements against each other (see Figure 9.13) leading to
a graphics very similar to the score plot of the first two principal components but
somewhat rotated. This graphics is easily understood by the fact that the sum of the
two components never exceeds 1.
9.9 Further Reading
A very good introduction to the theory of statistics, but without music examples, can
be found in [4].
Bibliography
[1] Y. Cho and L. Saul. Learning dictionaries of stable autoregressive models for
audio scene analysis. In L. Bottou and M. Littman, eds., Proceedings of the 26th
International Conference on Machine Learning, pp. 169–176, Montreal, 2009.
Omnipress.
[2] M. Davy and S. Godsill. Bayesian harmonic models for musical pitch estimation
and analysis. Technical Report 431, Cambridge University, Engineering Depart-
ment, Cambridge, 2002. Published in http://www-labs.iro.umontreal.
ca/~pift6080/H08/documents/papers/davy_bayes_extraction.pdf.
[3] M. Eichhoff, I. Vatolkin, and C. Weihs. Piano and guitar tone distinction based
on extended feature analysis. In A. Giusti, G. Ritter, and M. Vichi, eds., Classi-
fication and Data Mining, pp. 215–224. Springer, 2013.
[4] A. Mood, F. Graybill, and D. Boes. Introduction to the Theory of Statistics.
McGraw-Hill, Singapore, 1974.
[5] C. Raphael. Music plus one and machine learning. In J. Fürnkranz and
T. Joachims, eds., Proceedings of the 27th International Conference on Machine
Learning (ICML-10), pp. 21–28, Haifal, 2010. Omnipress.
[6] I. Xenakis. Formalized Music. Pendragon Press, New York, 1992.
262
Chapter 10
Optimization
G ÜNTER RUDOLPH
Department of Computer Science, TU Dortmund, Germany
10.1 Introduction
In music data analysis there are numerous occasions to apply optimization techniques
to achieve a better performance, e.g. in the classification of tunes into genres, in
recognizing instruments in tunes, or in the generation of playlists. Many musical
software products and applications are based on optimized methods in the field of
music data analysis which will be detailed in later chapters. This chapter concentrates
on the introduction of terminology in the field of optimization and the description of
commonly applied optimization techniques. Small examples illustrate how these
methods can be deployed in music data analysis tasks.
Before introducing the basic concepts in a formal manner, we provide a brief ab-
stract overview of the topics covered in this chapter. Let the map f : X → Y describe
the input/output behavior of some system from elements x ∈ X to elements y ∈ Y .
Here, it is assumed that the map is deterministic and time-invariant (i.e., every spe-
cific input always yields the same specific output). The task of optimization is to find
one or several inputs x∗ ∈ X such that the output f (x∗ ) ∈ Y exhibits some extremal
property in the set Y . In most cases Y is a subset of the set of real numbers R and the
extremal property requires that f (x∗ ) is either the maximum or the minimum value
in Y ⊂ R, where the map f (x) is termed a real-valued objective function. The task to
find such an element x∗ is known under the term single-objective optimization. The
situation changes if the map is vector-valued with elements from Rd where d ≥ 2. In
this case a different extremal property has to be postulated, which in turn requires dif-
ferent optimization methods. This problem is known under the term multi-objective
optimization.
The main difference between single- and multi-objective optimization rests on
the fact that two distinct elements from Y are not guaranteed to be comparable in the
latter case since Y is only partially ordered. To understand the problem to full extent
it is important to keep in mind that the values f1 (x), f2 (x), . . . , fd (x) of the d ≥ 2
real-valued objective functions represent incommensurable quantities that cannot be
263
264 Chapter 10. Optimization
264
10.2. Basic Concepts 265
Notice that it is sufficient to consider only minimization problems since every max-
imization problem can be equivalently reformulated and solved as a minimization
problem and vice versa:
A decision vector dominates another decision vector if their images do. Two distinct
decision vectors f (x1 ) and f (x2 ) are called comparable if either f (x1 ) ≺ f (x2 ) or
f (x2 ) ≺ f (x1 ). Otherwise they are termed incomparable, denoted f (x1 ) k f (x2 ).
For example, in case of two objectives (to be minimized simultaneously) we have
3 4 3 3 3 2
≺ , ≺ but k .
5 6 5 6 5 6
265
266 Chapter 10. Optimization
large as min(n, d − 1). In most practical cases we have n d so that the solution
sets are typically (d − 1)-dimensional manifolds. Thus, one typically obtains (possi-
bly disconnected) curves for bi-objective problems (d = 2), surfaces for tri-objective
problems (d = 3), 3D objects for quad-objective problems (d = 4), and so forth. The
visualization of the Pareto front does not pose any problem up to three objectives. In
case of higher-dimensional objective spaces, some pointers to the literature are given
in Section 10.5.
266
10.3. Single-Objective Problems 267
bit position of x and x̃. Therefore ρ(·, ·) coincides with the Hamming distance (cp.
Section 11.2).
The insertion of this kind of neighborhood in Definition 10.2 for X = Bn and k = 1
specifies the notion of local and global optimality in pseudo-Boolean optimization.
Example 10.2 (Feature Selection). The classification error of some classifier during
the training phase depends on the training data and the features calculated for each
element of the training data. Suppose there are n different features available and let
xi = 1 indicate that feature i ∈ {1, . . . , n} is used during classification whereas xi = 0
indicates its non-consideration. Thus, vector x ∈ Bn represents which features are
applied for classification and if f : Bn → R+ measures the error depending on the
used features, the task of feature selection consists of finding that set of features for
which the classifier exhibits least error. Formally, we like to find x∗ = argmin{ f (x) :
x ∈ Bn }.
Since Bn has finite cardinality 2n the exact optimum can be found by a com-
plete enumeration of all possible input/output pairs, but this approach is not efficient
for large dimensionality n and therefore prohibitive for almost all practical cases.
Nevertheless, there are efficient exact optimization methods if the objective function
exhibits special properties.
Example 10.3 (Feature Selection (cont’d)). Suppose we know that the error rate f (x)
in Example 10.2 is an additively decomposable objective function so that Equation
(10.1) reduces to a linear pseudo-Boolean function
n
f (x1 , . . . , xn ) = ∑ ci · xi
i=1
267
268 Chapter 10. Optimization
all solutions within the finite neighborhood in an arbitrary order. In case of the first
improvement heuristic (Algorithm 10.1), the new solution is accepted as soon as
a better solution than the current one is found. In contrast, the best improvement
heuristic (Algorithm 10.2) first evaluates all solutions in the neighborhood and ac-
cepts that one with the best improvement.
growing exponentially fast for increasing k. Therefore, local search methods typi-
cally work with small neighborhoods, i.e., with small k.
Example 10.4 (Feature Selection (cont’d)). Suppose the error rate of Example 10.2
depends on n = 3 potential features and is given by the (unknown) objective function
f (x) = x1 + x2 + x3 − 4 x1 x2 x3 + 1, (10.2)
which is not additively decomposable. Let us employ a 1-bit neighborhood for mini-
mizing the objective function from Equation (10.2) with the first improvement heuris-
tic where N1 (x) is enumerated by inverting the bits of x from left to right. Unless
268
10.3. Single-Objective Problems 269
we do not start at x∗ =111 with global minimum f (111) = 0, only the starting point
x =011 leads to the global minimum, whereas all other starting points lead to the
local optimum f (000) = 1.
The situation changes when using the best improvement heuristic that determines
the objective function values for all elements in the neighborhood before choosing
that element with maximum improvement (using some tie-breaking mechanism if nec-
essary).
When we apply this method to the objective function from Equation (10.2) with
a 1-bit neighborhood, then all starting points in N1 (x∗ ) with |N1 (x∗ )| = 4 lead to
the global solution x∗ =111, whereas the remaining starting points B3 \ N1 (x∗ ) with
cardinality 23 − 4 = 4 end up in the local solution. A 50:50 chance for a uniformly
distributed starting point seems acceptable. But if the error rate would be given by
the generalized version
n n
f (x) = ∑ xi − (n − 1) ∏ xi + 1 (10.3)
i=1 i=1
of the objective function from Equation (10.2), we would have n + 1 starting points
leading to the global solution and 2n − (n + 1) starting points leading to the local
solution. Larger neighborhoods improve the relation between good and bad starting
points, but they also require substantially more objective function evaluations per
iteration.
These basic local search methods may be extended in various ways: Suppose
that the neighborhood radius k is not fixed but depends on the iteration counter t ≥ 0,
i.e., kt ∈ {1, . . . , n} for t ∈ N0 . In this case the two local search methods above are
variants of variable neighborhood search.
If the local search method is restarted with a new starting point randomly chosen
after a certain number of iterations or depending on some other restart strategy, these
methods are called local multistart or local restart strategies.
If the new starting point of a local restart strategy is not chosen uniformly at
random from Bn but in a certain distance from the current local solution, the resulting
restart approach is termed iterated local search.
Apart from these local methods there are also sophisticated global methods pro-
vided by solvers like CPLEX or COIN based on branch-and-bound or related ap-
proaches. If such solvers are at your disposal and the problem dimension is not too
high it would be unjustifiable to ignore these methods.
If such solvers are not at your disposal or these solvers need too much time and
global solutions are not required then we might resort to metaheuristics like evo-
lutionary algorithms (EAs), whose algorithmic design is inspired by principles of
biological evolution.
A feasible solution x ∈ Bn of an EA is considered as a chromosome of an indi-
vidual that may be perturbed at random by a process called mutation. The objective
function f (x) is seen as a fitness function to be optimized. Typically, an individ-
ual x ∈ Bn is mutated by inverting each of the n bits independently with mutation
probability p ∈ (0, 1) ⊂ R. This mutation operation can be expressed in terms of an
269
270 Chapter 10. Optimization
Typical stopping criteria used in EAs are the exceedance of a specific maximum
number of iterations or the observation that there was no improvement in the objec-
tive function value within a prescribed number of iterations.
In most cases the mutation probability is set to p = 1/n, resulting in a single bit
mutation on average. Nevertheless, provided that 0 < p < 1 any number of mutations
from 0 to n is possible with nonzero probability. This observation reveals that the
(1 +1)-EA may be seen as a randomized version of the variable neighborhood search
instantiated with the first improvement heuristic.
But the crucial ingredient that distinguishes evolutionary algorithms from other
optimization methods is the deployment of a population of individuals in each itera-
tion of the algorithm.
270
10.3. Single-Objective Problems 271
There are many degrees of freedom for an instantiation of the algorithmic skele-
ton above. Typically, the variation of the parents is done by recombination of two
parents (also called crossover) with subsequent mutation. Whereas mutation can be
realized as in the (1 + 1)-EA, the crossover operation requires some inspiration from
biology. A simple version (called 1-point crossover) chooses two distinct parents at
random, draws an integer k uniformly at random between 1 and n − 1, and compiles
a preliminary offspring by taking the first k entries from the first and the last n − k
entries from the second parent. This may be generalized in an obvious manner to
multiple crossover points. An extreme case is termed uniform crossover where each
entry is chosen independently from the first or second parent with the same probabil-
ity. Of course, these recombination operations are not limited to a binary encoding;
rather, they may be applied accordingly for any kind of encoding based on Cartesian
products.
The selection operations are independent from the encoding as they are typically
solely based on the fitness values. Suppose the population consists of µ individu-
als. An individual is selected by binary tournament selection if we draw two parents
at random from the population and keep the one with the better fitness value. This
process may be iterated as often as necessary. But notice that a finite number of iter-
ations does not guarantee that the current best individual gets selected. If it got lost,
then the worst of the selected individuals is replaced by the current best individual.
This kind of ‘repair mechanism’ is termed 1-elitism. In general, if some selection
method guarantees that the current best individual will survive the selection process,
then it is called an elitist selection procedure. Elitism is guaranteed if we generate
λ offspring from µ parents and select the µ best (based on fitness) from parents and
offspring; this method is called (µ + λ )-selection. If the µ new parents are selected
only from the λ offspring (where λ > µ), elitism is not guaranteed and this method
is termed (µ, λ )-selection or truncation selection.
271
272 Chapter 10. Optimization
nario, this approach is not applicable and one has to resort to iterative optimization
methods.
The algorithmic pattern of descent methods is x(k+1) = x(k) + s(k) d (k) for k ≥ 0
and some starting point x(0) ∈ X. First, a descent method determines a descent direc-
tion for the current position and moves in that direction until no further improvement
can be made. Then it determines a new direction of descent and the process repeats.
The big variety of descent methods differ on individual choices of the step sizes
s(k) ∈ R+ and the descent directions d (k) ∈ Rn . If the objective function is differen-
tiable it is easily verified whether a chosen direction is a direction of descent.
Theorem 10.1. If f : X → R is differentiable, then d ∈ Rn is a direction of descent
if and only if d T ∇ f (x) < 0 for all x ∈ S.
Popular representatives of the derivative-based approach are gradient methods
which use the negative gradient as direction of descent and differ by their step size
rules.
Gradient Method According to Theorem 10.1 the negative gradient is a direction
of descent since d T ∇ f (x) = −∇ f (x)T ∇ f (x) = −k∇ f (x)k2 < 0 for all x ∈ S if d =
−∇ f (x) 6= 0. Therefore, the gradient method instantiates the algorithmic pattern of
descent methods to xt+1 = xt −st ∇ f (xt ) where the step size st = α k with α, γ ∈ (0, 1)
and
k = min{i ∈ N0 : f (xt + α i · d) ≤ f (xt ) + γ · α i · d 0 ∇ f (xt )}
is chosen according to the so-called Armijo rule. Alternative step size rules are, for
example, the Goldstein or Wolfe–Powell rules. Regardless of the step size rule the
gradient method can only locate local minima.
If the objective function is representable as a sum of N sub-functions, i.e.,
N
f (x) = ∑ fi (x) for fi : Rn → R and i = 1, . . . , N,
i=1
a specific and typically randomized variant of the gradient method may come into
operation. In principle, the stochastic gradient method (Algorithm 10.5) may be
considered as an inexact gradient method that uses approximations ∇ f (x) + e(x) of
272
10.3. Single-Objective Problems 273
the gradient with some unknown additive error function e(x). Inexact gradients may
accelerate the convergence velocity if the objective function is ill-conditioned. But
if the objective function is sufficiently well conditioned, inexact gradients may slow
down the approach to the optimum considerably.
273
274 Chapter 10. Optimization
whereas the stochastic gradient method may use any partial sum of the sub-functions’
gradients for updating the parametrization w. The update can be made after the
classification of each tune or after a certain number of tunes has been classified. The
order of the tunes should be randomly shuffled to provide the chance to escape from
local optima.
Newton’s Method If the second partial derivatives, gathered in the so-called Hes-
sian matrix ∇2 f (x) of the objective function, are also available, then Newton’s method
may be applied. Its advantage is its rapid convergence to the optimum under certain
conditions. In any case, Newton’s method should only be used if the Hessian matrix
is positive definite, otherwise its sequence of iterates is not guaranteed to converge.
274
10.3. Single-Objective Problems 275
Direct Search Methods that base all decisions on where to place the next step only
on information gained from objective function evaluations without attempting to ap-
proximate partial derivatives are termed direct search methods. Many variants have
been proposed since at least the 1950s. The simplest version unfolding the main idea
is called compass search.
The compass search defines the set D of potential descent directions by the coor-
dinate axes, which results in 2 n directions. It chooses a direction from D and tests if
a move in this direction with the current step size leads to an improvement. If so, the
move is made. Otherwise, another direction of D is probed. If none of the directions
in D leads to an improvement, the step size is made smaller and the process repeats
with the same set D of potential descent directions. The algorithms stops if the step
size gets smaller than a chosen limit ε > 0.
275
276 Chapter 10. Optimization
276
10.4. Multi-Objective Problems 277
finite representative subset of the Pareto front has been found. For the approxima-
tion of the entire Pareto front population-based evolutionary algorithms like NSGA-2
(Algorithm 10.8) and SMS-EMOA (Algorithm 10.9) are commonly used and widely
accepted. Both EAs use their population as an approximation of the Pareto front.
The variation operators need not be changed for the multi-objective case, but the
selection methods must cope appropriately with unavoidable incomparableness of
solutions. The ranking of the individuals is achieved in two stages. In the first stage
the population P is partitioned in h disjunct nondominated sets R1 , . . . , Rh via
k−1
R1 = ND f (P, ) and Rk = ND f P \ Ri , for k = 2, . . . , h if h ≥ 2
S
i=1
277
278 Chapter 10. Optimization
f2(x) f2(x)
R4
R3
R2
R1
f1(x) f1(x)
In case of NSGA-2 the crowding distance is used in the second stage. The crowd-
ing distance of an individual measures the proximity of other solutions in objective
space. The individual with smallest value has close neighbors which are considered
sufficient to approximate this part of the Pareto front so that the individual with least
crowding distance can be deleted. This process is iterated (with or without recalcu-
lation of distances after deletion) as often as necessary.
Another qualifier for deleting individuals from crowded regions in the objective
space is based on a commonly accepted measure [14] for assessing the quality of an
approximation of the Pareto front since it simultaneously measures the closeness to
the Pareto front and the spread along the Pareto front with a single scalar value:
Definition 10.7. Let v(1) , v(2) , . . . v(µ) ∈ Rd be a nondominated set and r ∈ Rd such
that v(i) ≺ r for all i = 1, . . . , µ. The value
µ
!
H(v(1) , . . . , v(µ) ; r) = Λd [v(i) , r]
[
i=1
is termed the dominated hypervolume with respect to reference point r, where Λd (·)
measures the volume of a set in Rd . The hypervolume contribution of some element
x ∈ Rk is the difference H(Rk ; r) − H(Rk \ {x}; r) between the dominated hypervol-
ume of set Rk and the dominated hypervolume of set Rk without element x.
Thus, the hypervolume contribution of an individual is that amount of domi-
nated hypervolume that would get lost if this individual is deleted. Therefore the
SMS-EMOA deletes individuals with least hypervolume contribution from crowded
regions. Figure 10.2 illustrates how to determine the dominated hypervolume of a
given population and the hypervolume contribution of each nondominated individual
of the population.
278
10.4. Multi-Objective Problems 279
f2(x) f2(x)
r r
v(1) v(1)
v(2) v(2)
v(3) v(3)
v(4) v(4)
v(5) v(5)
v(6) v(6)
f1(x) f1(x)
Figure 10.2: Left: The nondominated individuals (set R1 ) of the population given
in Figure 10.1 are labeled with v(1) , . . . , v(6) and a reference point r is chosen that
is dominated by each of the nondominated individuals. The union of the rectangles
[v(i) , r] for i = 1, . . . , 6 is the (shaded) area which is dominated by the population up to
the limiting reference point r. The volume of the area is the dominated hypervolume.
Right: The darker shaded rectangles characterize the amount of dominated hypervol-
ume that is exclusively contributed by each particular nondominated individual.
279
280 Chapter 10. Optimization
Once this has been achieved, a closer inspection of the approximation set yields in-
sight about the tradeoff between the conflicting objectives.
As already mentioned, the approximation of the Pareto front by a multi-objective
evolutionary algorithm is only a preparatory step in the decision process. The final
choice of the solution from the approximation that is to be realized is made by the
user (i.e., the decision maker).
But this kind of preparatory activity can already be applied prior to running the
optimization itself, namely in the phase of building the optimization model. As
demonstrated in Example 10.8, one might use the a posteriori approach to assess
the reasonableness of the objectives specified in the optimization model.
Example 10.8. An (unfortunately costly) procedure is to estimate the tradeoff be-
tween measures after several multi-objective optimization experiments as applied to
music classification in [13]. Figure 10.3(a) shows an example of the non-dominated
front of solutions, where two objectives have to be minimized. The ideal solution
is marked with an asterisk, and the reference point with a diamond. To measure
Understanding
Understanding
Sets Limits
Sets Limits
Figure 10.3: Two theoretical examples for non-dominated fronts (a,b) after [13] and
a practical example minimizing the balanced classification error and the size of the
training set (c). For further details see the text.
the tradeoff between both evaluation measures, we may estimate the shaded area be-
tween the ideal solution v ID and the front built with solutions v 1 , . . . , v K . This share of
the area exquisitely dominated by v ID can be, in general, estimated as (cf. Definition
10.7):
H(vvID ; r) − H(vv1 , . . . , v K ; r)
εID = · 100%, (10.4)
H(vvID ; r)
where r is the reference point. Larger εID corresponds to a broader distributed non-
dominated front and means that the optimization of both criteria is reasonable in
contrast to smaller εID ; an example of the latter case is illustrated in Figure 10.3(b).
Here, the optimization of one of both measures may be sufficient.
Another possibility to check if two measures should be optimized simultaneously
is to first calculate the maximum hypervolume exclusively dominated by an indi-
vidual solution from the non-dominated front. Then, the share of the hypervolume
280
10.5. Further Reading 281
exquisitely dominated by other solutions (marked as area with diagonal lines in Fig-
ures 10.3(a) and (b) in relation to the hypervolume of the front can be measured:
H(vv1 , . . . , v K ) − max{H(vvk ) : k = 1, . . . , K}
εMAX = · 100%. (10.5)
H(vv1 , ..., v K )
Small εMAX means that there exists one solution in the non-dominated front whose
exclusive contribution to the overall hypervolume of the front is significantly larger
than for other solutions. Because this solution can be found using proper weights
for a single-objective weighted sum approach (see Section 10.4), the multi-objective
optimization is not necessary.
Figure 10.3, (c) shows an example after 10 statistical repetitions for the simul-
taneous minimization of the training set size and mBRE using the model (d) from
Figure 13.1 and SMS-EMOA as optimization algorithm (see Algorithm 10.9). As we
can observe, approximately 11% of all training classification windows are enough to
produce the smallest error mBRE = 0.30. A further reduction of the training set size
leads to higher errors up to mBRE = 0.36.
Bibliography
[1] M. S. Bazaraa, H. D. Sherali, and C. M. Shetty. Nonlinear Programming:
Theory and Algorithms. Wiley, Hoboken (NJ), 3rd edition, 2006.
[2] N. Beume, B. Naujoks, and M. Emmerich. SMS-EMOA: Multiobjective se-
lection based on dominated hypervolume. European Journal of Operational
Research, 181(3):1653–1669, 2007.
[3] E. Boros and P. L. Hammer. Pseudo-Boolean optimization. Discrete Applied
Mathematics, 123(1-3):155–225, 2002.
[4] C. Buchheim and G. Rinaldi. Efficient reduction of polynomial zero-one op-
timization to the quadratic case. SIAM Journal on Optimization, 18(4):1398–
1413, 2007.
[5] K. Deb. Multi-Objective Optimization Using Evolutionary Algorithms. Wiley,
2001.
281
282 Chapter 10. Optimization
282
Chapter 11
Unsupervised Learning
C LAUS W EIHS
Department of Statistics, TU Dortmund, Germany
11.1 Introduction
In this chapter we will introduce two kinds of methods for unsupervised learning,
namely unsupervised classification and independent component analysis.
Unsupervised classification (also called cluster analysis or clustering) is the task
of grouping a set of objects in such a way that objects in the same group (called
cluster) are more similar (in some sense or another) to each other than to those in
other clusters. Cluster analysis typically includes the definition of a distance measure
and a threshold for cluster distinction. In this section we will discuss agglomerative
hierarchical clustering methods like single linkage, complete linkage, and average
linkage as well as the Ward method. Also, we introduce partitioning methods like the
k-means method and self-organizing maps (SOMs). Moreover, we discuss different
distance measures for different data scales and for features as well as the relation of
clustering and outlier detection.
For independent component analysis (ICA) the aim is to “separate” the underly-
ing independent components having produced the observations. One typical musical
application is transcription where the relevant part of music to be transcribed (e.g.
human voice) has to be separated from other sounds (e.g. piano accompaniment). In
this case, ICA ideally generates two “independent” parts, namely the human voice
and the accompaniment.
Clusters can be useful for various applications. Sometimes, it is particularly
important that the found classes are well separated (when further used for supervised
classification). Often, we would like to replace a big number of observations or
variables by representatives (data reduction), which could be, e.g., cluster centers.
Also, groups of missing values (cp. Section 14.2.3) should often be replaced by good
representatives.
Definition 11.1 (Clustering). Given a set of observations or variables, a clustering is
a partition of such objects into groups, so-called clusters, so that the distance of the
283
284 Chapter 11. Unsupervised Learning
objects inside a group is distinctly smaller than the distance of objects of different
groups. We speak of homogeneity inside clusters and heterogeneity between clusters.
The separation quality of a clustering is defined as the average heterogeneity of pairs
of clusters.
The basis of any cluster analysis is the definition of the distance between objects.
Typically, by means of a cluster analysis we look for a partition of objects into classes
in order to reach one of the following two targets: data reduction and a better data
overview or the finding of unknown object groups for the clarification of the studied
issue.
The methodical approach can be summarized as follows:
• Observations are appointed to all objects.
• The distances between the objects are calculated based on the matrix of observa-
tions.
• The clustering criteria are applied to the distances finally leading to a clustering.
Obviously decisive for the “success” of a clustering is the definition of the dis-
tance between observations. Please notice that not only observations may be clus-
tered but also features.
284
11.2. Distance Measures and Cluster Distinction 285
• On the one hand, one should notice, however, that the Euclidean distance is well-
known for being outlier sensitive. This might motivate switching to another dis-
tance measure like, e.g., the Manhattan distance or City-Block distance ([6])
k
dC (xx1 , x 2 ) := ∑ |x1 j − x2 j |.
j=1
• On the other hand, one might want to discard correlations between the features
and to restrict the influence of single features. This might lead to transformations
by means of covariance or correlation matrices S, i.e. to Mahalanobis distances
([6]) q
dM (xx1 , x 2 ) := (xx1 − x 2 )T S −1 (xx1 − x 2 ).
• For ordinal features, often the Euclidian distance of the ranks of the k features
is used. This means that the observations are first ordered, feature by feature,
and then the ranks of the observations are used as if they were the observations
themselves.
• The values of nominal features are often first coded by real values, either in a “nat-
ural” way or “optimally” according to the application. For example, sometimes
the coding −1 and 1 is often in use for the “extreme” values. Optimal coding
is often achieved by special methods like multidimensional scaling ([1], p. 249).
The coded features are then used as if they were cardinal.
• For dichotomous/binary features, often the so-called Hamming distance is used:
Note that dH could also be directly used for vectors of general nominal features.
• Other distance measures in use for binary features are the Jaccard index defined
by
no.(non-matching entries in x 1 and x 2 )
dJ (xx1 , x 2 ) :=
no.(double-positives + non-matching)
and the simple matching index defined by
no. of non-matching entries
dS (xx1 , x 2 ) := .
total no. of entries
285
286 Chapter 11. Unsupervised Learning
Note that the Jaccard index assumes an asymmetric situation in that only the dou-
ble positive results (both entries = 1) determine similarity! In contrast, in the
simple matching index, both the double positives and the double negatives count
for similarity.
• In each of the above cases the matrix D of the distances of pairs of observations
1, . . . , n is called the distance matrix:
0 d12 . . . d1n
d21 0 . . . d2n
D := . .. .. .
.. ..
. . .
dn1 dn2 . . . 0
• Similarity of features is often measured by means of the correlation coefficient.
One possibility to define the distance between two features is the so-called (linear
indetermination, i.e.
1 − coefficient of determination = 1 − correlation2 ,
which is interpreted as that part of one feature which is not determined by the
other feature.
• The basis for this distance measure is the correlation matrix (contingency matrix):
1 r12 . . . r1h
r21 1 . . . r2h
Cor := . .. .. ,
.. ..
. . .
rh1 rh2 ... 1
where ri j is the correlation (contingency) between features i and j.
• This way, the distance matrix (matrix of indetermination) of the features 1, . . . , k
is defined by
2 2
0 1 − r12 . . . 1 − r1k
1 − r2 0 2
. . . 1 − r2k
21
D := ((1 − ri2j )) = . . . .
. . . . .
. . . .
2
1 − rk1 2
1 − rk2 ... 0
Example 11.1 (Distances). An example for metric observations will be given in the
next section. We will concentrate here on examples for distances between qualitative
observations and between features:
• Scale (“major” vs. “minor”) as well as rhythm (“three-four time” vs. “four-four
time”) are both binary nominal features. One might be interested in the distance
between classical composers according to scale or rhythm. For this, we might
rank movements of their compositions by the number of their performances, e.g.
over the radio, and compare the pieces on corresponding ranks 1–10, say. In such
cases, the simple matching index might be adequate to group together composers
well known for pieces with similar emotions.
286
11.3. Agglomerative Hierarchical Clustering 287
• Distances between two features correspond to that part of one feature which is not
determined by the other. This way, features which can explain each other well, i.e.
are highly correlated, are clustered together. See Section 11.5 for examples.
Let us now discuss typical transformations, e.g. before using the Euclidean dis-
tance. In musical applications, e.g., the Euclidean distance is used for clustering the
“log-odds ratio” of the probabilities of notes for various compositions ([1], p. 238)
leading to a clear separation of “early music” from the rest. Note the transforma-
tion of the frequencies p j , j = 0, 1, . . . , 11, of the notes (modulo 12) by means of the
log-odds ratio, i.e. to ξ j = log(p j /(1 − p j )).
Another transformation used in musical applications is the entropy of melodic
shapes and spectral entropies ([1], pp. 93-96). This lead to a clear separation of
Bach’s Cello suites from “Das Wohltemperierte Klavier” ([1], p. 242).
[1] also proposes a specific smoothing for tempo curves (HISMOOTH) ([1], pp.
141–144). This leads to a similar clustering for different group distances ([1], p.
243).
In all these applications, so-called “complete linkage” and “single linkage” are
used for defining distances between groups of observations. Let us now look at such
distances systematically.
Kit ∩ K tj = 0,
/ i 6= j.
287
288 Chapter 11. Unsupervised Learning
of the hierarchy is identified for further use which has the best quality measure (see
below).
Agglomerative hierarchical methods have the following structure:
Stage 0: (Initialization)
where d jk is one of the distances above of the two elements x j ∈ Ki1 and xk ∈
Ki2 . Cluster methods with this heterogeneity measure are called complete linkage
methods or farthest neighbor methods.
Problem: the heterogeneity tends to be overestimated.
2. Distance of the two most similar elements in the classes
Cluster methods with this heterogeneity measure are called single linkage methods
or nearest neighbor methods.
Problem: the heterogeneity tends to be underestimated.
288
11.3. Agglomerative Hierarchical Clustering 289
Cluster methods with this heterogeneity measure are called average linkage meth-
ods.
Based on these heterogeneity measures, the quality of a partition can be evalu-
ated.
Definition 11.5 (Quality measures of partitions). As a quality measure of a partition,
often the inverse mean heterogeneity of the classes in the partition is taken. Let K be
a partition, |K|:= no. of clusters in partition K, and n = no. of observations. Then,
∑Ki1 ∈K ∑Ki2 ∈K,i2<i1 ν(Ki1 , Ki2 )
gv := |K|(|K|−1)
2
is called mean class heterogeneity, where ν is a heterogeneity measure.
Another quality measure of a partition is the so-called Calinski measure which
relates the between-cluster variation SSB to the within-cluster-variation SSW :
|K|
SSB/(|K| − 1) ∑ ni (x̄i − x̄¯)2 /(|K| − 1)
Ca := = |K|i=1n ,
SSW /(n − |K|) ∑ ∑ i (xi j − x̄i )2 /(n − |K|)
i=1 j=1
where ni is the no. of elements in cluster i = 1, . . . , |K|, x̄i is the empirical mean in
cluster i, and x̄¯ is the overall mean of all data.
Obviously, gv evaluates only the distances between clusters, whereas Ca is judg-
ing between-cluster variation by within-cluster variance. Both quality measures gv
and Ca should be maximized.
289
290 Chapter 11. Unsupervised Learning
11.3.3 Visualization
As a visual representation of a hierarchy of partitions, a so-called dendrogram is
used.
Definition 11.6 (Dendrogram). In a dendrogram the individual data points are ar-
ranged along the bottom of the dendrogram and referred to as leaf nodes. Data
clusters are formed by any hierarchical cluster method leading to the combination
of individual observations or existing data clusters with the combination point re-
ferred to as a node. A dendrogram shows the leaf nodes and the combination nodes.
The combinations are indicated by lines. At each dendrogram node we have a right
and left sub-branch of clustered data. The vertical axis of a dendrogram is labeled
distance and refers to a distance measure between observations or data clusters. The
height difference between a node and its sub-branch nodes can be thought of as the
distance value between a node and its right or left sub-branch nodes, respectively.
Data clusters can refer to a single observation or a group of data. As we move up
the dendrogram, the data clusters get bigger, but the distance between data clusters
may vary. One way to identify a “natural” clustering (partition) is to cut the dendro-
gram in its longest branch, this means at a place where sub-branch clusters have the
biggest distance to the nodes above.
Example 11.2. Let us now introduce an example data set often used in this section.
The data is composed of MFCC variables (non-windowed and windowed), chroma
variables, and envelopes in time and spectral space. All variables are available for
4309 guitar and 1345 piano tones. Blocks are composed of 12,288 observations
each for a signal sampled with 44,100 Hz. This means that one block corresponds to
around 0.25 seconds. We have studied the tones 4530–4640, including 55 guitar and
56 piano tones, and the features MFCC 1 in the first block (“mfcc block1 1”) and in
the last (fifth) block (“mfcc block5 1”) of the tones. The idea is that these blocks are
able to distinguish between guitar and piano tones since the beginning and the end
of the tones are important for distinction. Figure 11.1 shows the partitions of dif-
ferent hierarchical clustering methods based on Euclidean distances. Note that the
different symbols represent the different clusters and the filling state (empty or filled)
distinguish guitar and piano. Obviously, complete linkage and Ward reproduce the
instruments much better than average and single linkage. Note that the number of
clusters was fixed by means of cutting the dendrogram at a certain height so that
5 clusters were produced each. No automatic rule was followed here. The dendro-
grams are somewhat confusing because of the high number of observations and the
labels of the tones. We will thus restrict ourselves to a dendrogram representing only
tones 4575–4595 analyzed by complete linkage clustering (Figure 11.2). Notice that
the classes are perfectly split into different clusters, since the guitar tones (4575–
4584) are in different clusters than the piano tones (4585–4595). Cutting at distance
= 3 would lead to perfect clusters.
Note that the illustration of clustering by scatterplots, as in Figure 11.1, is only
290
11.4. Partition Methods 291
6
mfcc_block5_1
mfcc_block5_1
●
5
5
●●● ●●
●●●
● ●●
● ●
● ●●
4
4
●
3
3
2
2
1 2 3 4 5 6 7 1 2 3 4 5 6 7
mfcc_block1_1 mfcc_block1_1
6
mfcc_block5_1
mfcc_block5_1
5
5
4
4
3
3
2
1 2 3 4 5 6 7 1 2 3 4 5 6 7
mfcc_block1_1 mfcc_block1_1
possible since the original data space was 2D. Dendrograms, however, can also be
used in higher dimensions.
291
292 Chapter 11. Unsupervised Learning
0 1 2 3 4 5
●
4.5
●
mfcc_block5_1
distance
3.5
4584
2.5
4592
4582
4585
4588
4589
4593
4578
4586
4590
4580
4587
4595
4579
4581
4591
4594
4577
4583
4575
4576
1.5
1 2 3 4 5
labels mfcc_block1_1
Agglomerative Coefficient = 0.92
Figure 11.2: Left: Dendrogram for complete linkage, tones 4875–4895, right: cor-
responding scatterplot.
The initial cluster centers are vectors of the same dimension as the observations,
which are central in the observations. In practical applications it might be sensible
to utilize prior knowledge for the setting of initial centers by hand. Another possi-
bility is the drawing of k elements from a uniform distribution on the indices of the
observations. For the choice of the initial cluster centers there exist also different
pre-optimization methods aiming at the improvement of the convergence speed of
the iteration.
The second step of the algorithm leads to a partition of the objects into k classes.
In iteration t the sets C1t−1 , . . . ,Ckt−1 contain for each class the indices of objects
assigned to it. Formally, these sets are determined as follows:
n
[
Cht−1 = i · Th (xxi ), h = 1, . . . , k, where
i=1
1 , ||xx − zt−1 || = min ||xx − zt−1 ||
h j
Th (xx) = 1≤ j≤k .
0/ , else
292
11.4. Partition Methods 293
\ K
[
Cit−1 Ct−1
j / i 6= j,
= 0, and Ct−1
j = {1, . . . , n}.
j=1
In the third step, the centers are replaced by location measures of the temporary
clusters, i.e. of the assigned observations. Which location measures are chosen,
depend on the preset L p -criterion to be minimized. Each center zth is defined as that
x L(h) for which the following expression is minimal:
!1
m p
L p
∑ ∑ |xi j − x(h) j| .
i∈Cht−1 j=1
For some parameters p we get the location measure x L from Table 11.1.
p xL
1 median
2 mean
∞ (max + min) / 2
The algorithm might stop if the maximal relative change in the cluster centers is
small enough defined by a preset threshold, if the clusters do not change by the latest
iteration, or if a preset maximum number of iterations is reached (early stopping).
293
294 Chapter 11. Unsupervised Learning
The network must be fed a large number of example vectors that represent, as
close as possible, the expected kinds of vectors. The examples are usually presented
several times in iterations.
When a training example xt is fed to the network, its Euclidean distance to all
weight vectors is computed. The neuron whose weight vector is most similar to the
input is called the best matching unit (BMU). The weights of the BMU and neurons
close to it in the SOM lattice are adjusted towards the input vector. The magnitude of
the change decreases with time and with distance (within the lattice) from the BMU.
The update formula for a neuron with weight vector mt is
294
11.4. Partition Methods 295
v, then
Uv := ∑ ||m
mv − m u ||
u∈NN(v)
is the height of the U-matrix in node v. The U-matrix is a display of the U-heights
on top of the grid positions of the neurons on the map.
The U-matrix delivers a “landscape” of the distance relationships of the input
data in the data space. Properties of the U-matrix are:
• the position of the projections of the input data points reflects the topology of the
input space, according to the underlying SOM algorithm;
• weight vectors of neurons with large U-heights are very distant from other nodes;
• weight vectors of neurons with small U-heights are surrounded by other nodes;
• outliers in the input space are found in “funnels”;
• “mountain ranges” on a U-Matrix point to cluster boundaries; and
• “valleys” on a U-Matrix point to cluster centers.
The U-matrix realizes the so-called emergence of structure of features corre-
sponding the distances within the data space. Outliers, as well as possible cluster
structures, can be recognized for high dimensional data spaces. The proper setting
and functioning of the SOM algorithm on the input data can also be visually checked.
Example 11.3. Let us now look at the results of k-means and SOM clustering for the
above example. Figure 11.3 shows on the left the partition of the k-means method
with 5 clusters, the same number of clusters as for the above hierarchical methods.
The plot should be interpreted in the same way as Figure 11.1. If we associate a
cluster with the more frequent class, k-means delivers just slightly better results (7
elements with wrong class in clusters) than Ward (8 errors).
Let us now evaluate the result of k-means clustering by means of the Calinski
295
296 Chapter 11. Unsupervised Learning
6
●
6
●
●●●
● ●
mfcc_block5_1
mfcc_block5_1
●
●● ●
● ● ●●
5
● ● ● ●
● ●●
5
● ● ● ● ●●●
●● ● ●●
● ● ●● ● ● ●
● ● ●
●● ● ●●
●● ● ●
● ● ●●
● ● ● ● ● ● ●
4
●
4
● ●
●
● ● ● ● ●
●● ● ●
● ●
●
● ● ●
3
3
●●
●●
●
●● ●
● ●
●● ●●
●●
●
2
●●
●●
2
●●
●●●
● ●
● ● ●
●●●●
●
●● ●●
●●●
●●●
●●
1 2 3 4 5 6 7 1 2 3 4 5 6 7
mfcc_block1_1 mfcc_block1_1
Figure 11.3: Left: k-means with 5-means clusters, right: SOM with 6 clusters indi-
cated at weights of nodes.
k 2 3 4 5 6 7 8
Ca 87 124 121 139 139 145 148
quality measure Ca without knowledge of the “true” classes. To this end, we have
repeated the k-means algorithm 50 times for k = 2, . . . , 8 with random starting vec-
tors, and took the partition with maximum Ca-index for each k. This led to the results
in Table 11.2. Obviously, the 5-means method is not optimal, since its Ca is not max-
imal.
For SOM clustering, let us first consider a similar scatterplot as for the other
clustering methods (Figure 11.3, right). Notice, however, that the symbols of the 6
clusters are not marking the original observations but the (nearby) weight vectors
for the SOM-nodes. The original observations are included by black and grey dots
indicating the true classes. In order to understand the structure of the SOM-map,
we look at the U-matrix and at a map with the assigned original observations sep-
arately. The U-matrix is given on the left of Figure 11.4 together with the proposed
boundaries of 6 clusters. Note the “funnel” in the lower left corner representing
the singleton corresponding to the upper right individual in Figure 11.3. Also note
that higher weights correspond to less-intensive greys. On the right of Figure 11.4
the location of the original observations is indicated by their class numbers in the
nodes of the SOM. Obviously, the reproduction of the original classes was similarly
successful as k-means. Moreover, note that some of the nodes do not contain any
original observation. Notice that the 6 clusters are generated by means of complete
linkage clustering applied to the weights of the SOM-map. The resulting dendrogram
can be seen in Figure 11.5. Note the nodes are numbered consecutively row by row
296
11.5. Clustering Features 297
Mapping plot
Neighbour distance plot 111 1 1
11 11
1 1 111 1 1 11
7 ●●●●●●●● 111
11 11 1
11 1
1 11
6 ●●●●●●●● 1111 111
2 1
5 ●●●●●●●● 1
1
2
1
4
3
●●●●●●●● 1 11 121
2 22 22
2 12 22 1
2
●●●●●●●● 2
222 22 2 22 2222 2222
2
●●●●●●●● 22 222 222
2 22 22 2 2 222 2
Figure 11.4: Left: U-matrix with boundaries of 6 clusters, right: original classes in
SOM.
starting in the lower left corner (node 1) and ending in the upper right corner (node
48).
297
298
clustering.
distance
distance 0 1 2 3 4 5 6 7
0.8 0.9 1.0 1.1 1.2 1.3 1.4
1
3
mfcc_block1_1 11
chroma_block1_1
2
9
chroma_block1_2 10
chroma_block1_3 28
19
chroma_block1_4 27
chroma_block1_5 42
35
chroma_block1_6 43
chroma_block1_10 36
chroma_block1_8 37
44
chroma_block1_9 41
chroma_block1_12 17
18
chroma_block1_14 25
chroma_block1_7 26
33
298
mfcc_block1_2 34
mfcc_block1_5 24
mfcc_block1_6 7
labels
15
mfcc_block1_3 labels 8
mfcc_block1_4 16
6
mfcc_block1_7 14
mfcc_block1_8 22
hclust (*, "complete")
23
SOM: dendrogram
mfcc_block1_10 4
where ai j are some parameters that depend on the distances of the microphones from
the instruments. The aim is now to estimate the two original instrument signals
s1t and s2t using only the recorded signals x1t and x2t . This is a music equivalent of
the so-called “cocktail-party” problem for speech.
More generally, let us assume that we observe N linear mixtures x1 , . . . , xN of N
independent components
x j = a1 j s1 + . . . aN j sN , j = 1, . . . , N.
We have now dropped the time index t in the ICA model, and instead assume that
each mixture x j as well as each independent component si is a random variable,
instead of a proper time signal. The observed values x jt , e.g. the microphone signals
in the instruments identification problem above, are then a sample of the random
variable x j .
In matrix formulation, the data matrix X is considered to be a linear combination
of non-Gaussian independent components, i.e.
X = SA,
where the columns of S contain the independent components and A is a linear mixing
matrix. In short, ICA attempts to “un-mix” the data by estimating an un-mixing
matrix W where X W = S .
In this formulation the two most important assumptions in ICA are already stated,
i.e. independence and non-Gaussianity. One approach to solving X = S A is to use
some information on the statistical properties of the signals si to estimate the ai j .
Actually, and perhaps surprisingly, it turns out that it is enough to assume that the
si are statistically independent. This is not an unrealistic assumption in many cases,
and it need not be exactly true in practice.
The other key to estimating the ICA model is non-Gaussianity. Actually, without
non-Gaussianity the estimation is not possible at all. Indeed, in the case of Gaussian
1 This section is composed from [4] and [5].
299
300 Chapter 11. Unsupervised Learning
where the ai are the possible values of Y . This very well-known definition can be
generalized for continuous-valued random variables and vectors, in which case it is
often called differential entropy. The differential entropy H of a random vector y
with density f (yy) is defined as
Z
H(yy) = − f (yy) log2 f (yy)dyy.
where ygauss is a Gaussian random variable with the same covariance matrix as y.
Due to the above-mentioned properties, neg-entropy is always non-negative, and it is
zero iff y has a Gaussian distribution.
Because of the complex calculation of neg-entropy, in FastICA simple approxi-
mations to neg-entropy are used which will not be discussed here. The maximization
of J obviously produces a kind of maximum non-Gaussianity. Before maximization,
first the data are pretransformed in the following way:
1. The data are centered by subtracting the mean of each column of the data matrix
X and
2. the data matrix is then “whitened” by projecting the data onto its principal com-
ponent directions, i.e. X → X G , where G is a loading matrix (see Section 9.8.3).
The number of components can be specified by the user. This way, we already
have uncorrelated components.
The ICA algorithm then estimates another matrix W so that
X GW = S .
300
11.7. Further Reading 301
y1 = x1 + x2 , y2 = x1 − x2 , y3 = x1 + x2 + x4 , and y4 = x1 + x2 + x3 .
From Figure 11.7, it becomes clear that we found a correspondence of the 1st data
component ”mfcc block1 1” with the 2nd ICA component, and with a correspon-
dence of the 2nd data component ”mfcc block1 2” with the 4th ICA component.
Note the sign change in the 2nd ICA component. Also note that the ICA components
are centered in contrast to the original components.
For a possibly more relevant example see Chapter 17.
301
302 Chapter 11. Unsupervised Learning
2
1
1
ICA component 1
ICA component 2
0
0
−2
−2
−4
−4
1 2 3 4 5 6 7 1 2 3 4 5 6 7
mfcc_block1_1 mfcc_block1_1
2
2
ICA component 3
ICA component 4
1
1
0
0
−3 −2 −1
−1
−2
−1.5 −1.0 −0.5 0.0 0.5 −1.5 −1.0 −0.5 0.0 0.5
mfcc_block1_2 mfcc_block1_2
Bibliography
[1] J. Beran. Statistics in Musicology. Chapman&Hall/CRC, 2004.
[2] M. Dash and H. Liu. Feature selection for classification. Intelligent Data Anal-
ysis, 1:131–156, 1997.
[3] A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis. Wi-
ley, 2001.
[4] A. Hyvärinen and E. Oja. Independent component analysis. Neural Networks,
13:411–430, 2000.
[5] J. L. Marchini and C. Heaton. Package fastICA. http://cran.stat.sfu.ca/
web/packages/fastICA/fastICA.pdf, 2014.
[6] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-
Wesley, 2005.
[7] C. Weihs, U. Ligges, F. Mörchen, and D. Müllensiefen. Classification in Music
Research. Advances in Data Analysis and Classification (ADAC), 1:255–291,
2007.
302
Chapter 12
Supervised Classification
C LAUS W EIHS
Department of Statistics, TU Dortmund, Germany
T OBIAS G LASMACHERS
Institute for Neuroinformatics, Ruhr-Universität Bochum, Germany
12.1 Introduction
Classification of entities into categories is omnipresent in everyday life as well as
in engineering and science, in particular music data analysis. As a problem per se
it is analyzed with mathematical rigor within the disciplines of statistics and ma-
chine learning. Classification simply means to assign one (or more) of finitely many
possible class labels to each entity.
Formally, a classifier is a map f : X → Y , where X is the input space containing
characteristics of the entities to classify, also called instances, and Y is the set of
categories or classes. For example, X may consist of characteristics of all possible
pieces of music, while Y may consist of genres. Thus, a genre classifier is a function
that assigns each piece of music to a genre. For other classification examples, see
Section 12.3.
The problem of constructing f can be approached by fixing a statistical model
class fθ . The process of estimating the a priori unknown parameter vector θ from
data is called training or learning. Parameters in θ are, e.g., mean and variance
in the case of a normal distribution. Learning a classifier model from data has ad-
vantages and disadvantages over manual (possibly algorithmic) construction of the
function f : constructing a model programmatically allows for a rather direct incor-
poration of expert knowledge, however, it is cumbersome and time consuming. In
many classification problems data of problem characteristics are easier to obtain and
to encode in machine readable form than expert knowledge, and one can hope to save
considerable effort by letting a learning machine figure out a good model by itself
based on a corpus of data, e.g., characterizing a large collection of music. This pro-
ceeding has the added benefit of increased flexibility: the model can be improved as
303
304 Chapter 12. Supervised Classification
more data becomes available without the need for further expensive, time-consuming
engineering.
The questions of which model class to use for which problem and how to esti-
mate or learn its parameters based on data are core topics of statistics and machine
learning.
304
12.3. Targets of Classification 305
derstood as the cost of misclassification. For example, we may define the cost of
mistaking a piece of music from the classic period for one from the romantic period
as less than mistaking it for modern electronic music or heavy metal. A standard loss
function for classification is the 0/1-loss that assigns a cost of one uniformly to all
types of mistakes. Other types of losses can be found in Section 12.4.4.
Now we can formally state the goal of learning, which is to minimize the ex-
pected loss, also called the risk
h i Z
R( f ) = E L(y, f (xx)) =
L f (xx), y dP(xx, y) .
X×Y
For our examples this is the average (severity of) error of the classifier over all possi-
ble pieces of music, weighted by their probability of being encountered. Of course,
we would like to pick a classifier f with an as small as possible risk R( f ).
The minimizer f ∗ of the risk functional over all possible (measurable) classi-
fiers f is known as the Bayes classifier, and the corresponding risk R ∗ = R( f ∗ ) is
the Bayes risk. Note that in general even the best possible model has a non-zero
risk. This is plausible in the context of genre classification since the assignment of
a piece of music to a genre may be subjective, and some pieces combine aspects of
different genres. Thus, even the assignment rule that works best on average cannot
make 100% correct predictions.
The goal of finding this minimizer is in general not achievable. This is because
the risk cannot even be computed since the data generating distribution P(xx, y) is not
known. The available information about this distribution is limited to a finite sample,
the training set. The learning task now is to find a classifier f for which we know
with reasonable certainty that it comes as close as possible to f ∗ , given restricted
knowledge about P. How to estimate the 0/1-loss will be discussed in greater detail
in Chapter 13. In this chapter we will estimate the loss by the error rate on a so-called
test set, drawn independently but disjoint from the population of examples.
305
306 Chapter 12. Supervised Classification
“Pop,” “Hard-Rock” and other genres. This way, genre classification is either of type
“binary” or of type “multi-class.” Even the type “multi-label” might be adequate for
genre classification, since in some cases the genre is by no means clear. For example,
some “Hard-Rock” pieces might be also “Pop.” The class type might influence the
choice of the classification method, the content possibly not that much. An excep-
tion might be caused by the ease of interpretation of the results of some methods,
like decision trees (see Section 12.4.3).
Note that the input features are all assumed to be metric for convenience, in order
to be able to apply all classification methods introduced below. However, the type of
input features may define a third dimension in order to distinguish (at least) signal-
level from high-level data. Signal-level features are introduced in Chapter 5, high-
level features in Chapter 8. Typical signal-level or low-level features are chroma,
timbre, and rhythmic features. Typical high-level features are MIDI-features, music
scores, social web features, and even lyrics. Note that high-level features might
have to be transformed into metric features before becoming usable in the following
classification methods. Also this third dimension can be deliberately combined with
content. For example, one might want to identify genre from low-level audio features
or from music scores.
In this chapter we will only consider genre classification as an example applica-
tion based on the data set described below.
Example 12.1 (Music Genre Classification). We will consider a two-class example.
We will try to distinguish Classic from Non-Classic music comprising examples from
Pop, Rock, Jazz and other genres. Our training set consists of 26 MFCC features
corresponding to 10 Classic and 10 Non-Classic songs. The first 13 MFCC features
are aggregated for 4s classification windows with 2s overlap (mean and standard
deviation, named MFCC..m and MFCC..s, respectively, where the dots stand for the
number of the MFCC). Note that this obviously leads to non i.i.d. data at least be-
cause of overlapping data in consecutive windows are dependent. This leads to 2361
training observations over all 20 music pieces. Our test set consists of the corre-
sponding features for 15 Classic and 105 Non-Classic songs. There is no overlap
between the artists of the training and tests sets. Below, we will compare the per-
formance of 8 classification methods, as delivered by the software R ([10]). Also we
will discuss examples of class separation by the different methods by means of plots.
306
12.4. Selected Classification Methods 307
307
308 Chapter 12. Supervised Classification
s
0.3
0.2
0.1
(indicated by symbols
”o” and ”+”) ideal class
−0.1
orthogonal projection
−10 −5 0 5 10 15 20 space (line p).
X1
i.e. the borders (µµ j − µ i )T Σ −1 x = const are hyperplanes in R p . Note that hyper-
planes are lines in two dimensions, planes in three dimensions, etc. In the case of
two classes in two dimensions, we are looking for that line separating the two classes
“best.” An idealized example can be found in Figure 12.1 where the two classes can
be completely separated by a line. Please also notice the line indicating the projec-
tion direction. In reality the two classes will often overlap, though, so that an ideal
separation with zero errors is not possible.
308
12.4. Selected Classification Methods 309
Obviously, this rule is not restricted to two classes. A class prior may be es-
timated by assuming equiprobable classes, i.e. P̂(y) = 1/ (no. of classes), or by
calculating an estimate for the class probability from a training set by
P̂(y) = (no. of samples in class y)/(total no. of samples).
The conditional probabilities of the individual features xi given the class y have to
be estimated also from the data, e.g. by discretizing the data into groups for all in-
volved quantitative attributes. Since the true density of a quantitative feature is usu-
ally unknown for real-world data, unsafe assumptions and, thus, unsafe probability
estimations unfortunately often occur.
Discretization can circumvent this problem. With discretization, a qualitative at-
tribute Xi∗ is formed for each Xi . Each value xi∗ of Xi∗ corresponds to an interval
(ai , bi ] of Xi . Any original quantitative value xi ∈ (ai , bi ] is replaced by xi∗ . All rele-
vant probabilities are estimated with respect to xi∗ . Since probabilities of Xi∗ can be
properly estimated from corresponding frequencies as long as there are enough train-
ing instances, there is no need to assume the probability density function anymore.
However, discretization might suffer from information loss. See also Section 14.2.1
for discretization methods.
The different implementations of the Naive Bayes classifier typically differ in
different discretizations. Obviously, discretization can be effective only to the degree
that P(y|xx∗ ) is an accurate estimate of P(y|xx).
Example 12.2 (Music Genre Classification (cont.)). Let us come back to Exam-
ple 12.1 on music genre classification. We have applied LDA, IR, and NB to the
above data leading to the error rates 16.2% for LDA, 16.3% for IR, and 12.3%
for NB. So, the idea of a simplification of the covariance matrix to a diagonal does
not lead to an improvement by assuming a normal distribution (IR), but by using
a discretization (NB). Let us try to visualize the class separation of the different
methods by means of projections using not all 26 features for classification, but only
the two most important, namely the means of the first two MFCCs, MFCC01m and
MFCC02m. With these features we get the error rates 18.3% for LDA, 18.4% for
IR, and 18.6% for NB. Thus, NB suffers the most from feature selection. Looking at
Figure 12.2 we see that the separation is similar for LDA and NB, except that NB
leads to a somewhat nonlinear separation. Note that in the plots, the background
colors indicate the posterior class probabilities through color alpha blending. IR
leads to nearly exactly the same separation as LDA.
309
310 classif.lda:
Chapter 12. Supervised Classification
classif.naiveBayes:
Train: mmce=0.111; CV: mmce.test.mean=0.111 Train: mmce=0.112; CV: mmce.test.mean=0.111
0.8 0.8
●
●
●
●
●
●
●
● ●
0.7 0.7 ●
●
●
● ●
● ●
● ● ●
●
●
●● ● ●
●
●
●
● ●
●
●
●
● ●
● ●
●
● ●
●
●
●● ●
0.6 y 0.6
MFCC02m
MFCC02m
●●
y
● ●
●
● ●
●
● ● ● ●
● ● ● ●
● ● ● ●●
● ● ● ●
●● ●●
Classic ● Classic
●● ● ●●● ● ●
● ●● ● ●●
●
● ● ● ●
● ●●● ● ●● ● ● ●
● ●
● ●
● ●
●●
● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ●
● ● ●● ●
●
● ● ●● ● ●● ●
Non_Classic Non_Classic
● ●●
● ● ●● ● ● ● ● ● ●
● ● ●● ● ● ●
● ● ● ● ●
●●
● ● ●
● ● ●
● ●●
●
●● ●● ●●
●
● ●●
●●
●● ●
●
● ●● ● ●
●
●
● ● ●●● ● ●●
●
● ● ●● ● ● ●● ●●● ● ●●
● ● ● ● ● ●●● ●● ● ●
● ●
●● ●
● ●
● ● ● ● ● ●●
●●● ●●●
●● ● ● ●
●
● ● ●
●● ● ●
●
●
●● ● ● ● ● ● ● ● ● ● ● ● ●
●
●●●
●● ● ●● ●● ● ●●
●
● ● ●● ● ● ●
● ● ● ●
●●●
●
●
●
●●● ● ● ● ●● ●
● ●
● ●● ●● ● ●
●● ●● ● ● ●● ● ●
●●
● ● ●● ●● ● ● ●● ● ●
● ● ● ●●
● ● ●●● ●● ●●● ●● ●● ● ●● ●● ●
●
● ● ● ●●
● ● ●● ● ●
● ● ● ●● ●
●●
●● ● ● ●● ●●
●
● ●
●● ●● ●● ● ● ● ● ● ●
●
●
●
● ●● ●
● ●●
●
●
●● ● ●
● ●●● ●
●
●
●
● ● ● ●●●
●●
● ●
● ● ●●●● ●● ●●● ● ●● ● ● ●● ● ● ●●●● ●● ●
● ● ●
●● ● ● ● ●
● ●●● ● ●●● ● ● ●● ●●
● ● ●● ●●
●● ●
●
● ● ●●● ● ●●● ●●● ● ●● ●
● ● ● ●●● ● ●
●
● ●
●
●
● ● ●● ● ● ● ●●● ●● ●● ● ●●
● ●
0.5 0.5 ● ●
●
● ● ● ●● ● ●● ● ● ● ● ●●
● ●
● ●●
●
● ●
● ●● ●●
●●
●● ● ● ● ● ● ● ● ●● ●● ● ●
●●
●●●● ●● ● ● ●●● ● ● ●
● ●● ● ●
● ●● ●● ● ●●● ●●
●
●● ●
●● ●
●● ●● ●●● ● ● ● ●● ● ● ●●
●
● ● ●●● ● ● ●
●
●
● ●● ● ●
● ●● ● ●● ●
●● ●●●●●● ●
● ● ● ●●●
● ●
●
●
● ●●● ● ● ● ●
● ●●
●●
●●● ●●● ● ● ●
●●● ●
●●●● ●●● ●●●● ● ● ●●
●
●
●
●● ●●
●
●● ●●
●
● ● ● ● ● ●● ● ● ● ●● ●●●● ●●● ● ● ●●
●
●
● ●●●● ● ●●
●
● ●●
●
●●●
●
● ● ●● ● ●● ●
●●
● ● ● ● ●●
●● ● ●
●
●●● ●●● ●●●● ● ● ●● ●
●●
● ● ● ●
●● ●
● ●
●
● ●● ● ● ● ● ●● ●● ●● ● ● ● ● ●●●
●● ●
● ● ● ●● ● ●● ● ●● ●● ●● ● ●
● ●●
● ● ● ● ●● ●● ●● ● ● ●● ● ● ● ●
● ● ● ●● ● ● ●●
●
● ● ● ●●● ● ● ● ● ● ●
●●
● ●
●
● ●
●
●● ●●● ● ●
●● ●● ● ● ● ● ● ●●●●● ●● ●● ● ●
● ●● ●● ● ●●●●● ●●● ●●● ●● ● ● ● ●● ●
● ●
●●● ● ●●
●●
●
● ● ●● ● ● ●
●● ● ● ●● ● ● ● ●
● ●●● ● ●
● ● ● ● ●
● ● ● ● ● ●
●● ●
● ●●
● ● ●● ● ● ● ●●
● ● ●● ● ●
●● ● ●● ● ● ●
●●
● ●● ●
● ●● ● ● ●●● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ●●
● ●
● ● ● ●
● ●●● ● ● ● ●
● ● ●● ●●
● ● ● ●
● ● ●
● ● ● ● ●●
● ● ●●● ●●● ●
● ●
●
● ● ● ● ●
●
● ● ●●
0.4 0.4
● ●
● ● ●
●
●
●
●
●
●●
Figure 12.2: Class separation by LDA (left) and NB (right); true classes: Classic =
circles, Non-Classic = triangles; estimated classes: Classic = darker region, Non-
Classic = lighter region; errors indicated by bigger symbols.
introductory presentation ties can be broken with any rule, e.g., at random.
310
12.4. Selected Classification Methods 311
There are many elaborate variants of the nearest neighbor prediction scheme. An
alternative to the choice of a fixed number of neighbors is to define the neighborhood
directly based on the metric d, e.g., by thresholding distance to the query point by a
fixed radius. Please notice the relation to the Naive Bayes idea in the previous section
where continuous features where discretized using balls of nearby values.
It is also possible to modify the majority voting scheme. For example nearby
points may be given more impact, either based on distance or on their distance rank
relative to other neighbors. Elaborate tie-breaking mechanisms work with shared
and averaged (non-integer) ranks that can be taken into account in a subsequent
voting scheme. Finally and maybe most importantly, the ad hoc choice of the Eu-
clidean
q metric can be replaced, e.g., with the Mahalanobis distance dM (xx, xi ) :=
(xx − x i )T S −1 (xx − x i ), where S is the sample covariance or correlation matrix (see
Section 11.2).
Nearest neighbor predictors have the advantage that they essentially do not have a
training step. On the downside, they require the storage of all training points for pre-
diction making, and worse, at least in a naive implementation all distances between
test and training points need to be computed. This makes predictions computation-
ally slow.2
An important statistical property of the k-NN rule is that in the limit of infi-
nite data it approaches the Bayes-optimal classifier for all problems. This property
is known as universal consistency. It holds under quite mild assumptions, namely
that the number of neighbors kn as a function of data set size grows arbitrarily
large (limn→∞ kn = ∞) and the relative size of the neighborhood shrinks to zero
(limn→∞ kn /n = 0). This means that on the one hand the prediction rule is flexi-
ble enough to model the optimal decision boundary of any classification task. On the
other hand, a relatively simple technical condition on the sequence kn ensures that
overfitting is successfully avoided in the limit of infinite data.
To summarize, the k-NN rule is an extremely simple yet powerful prediction
mechanism for supervised classification. It does not have a training step, but it re-
quires storing all training examples for prediction making. It is based on a metric
measuring distances of inputs. Nearest neighbor models are most frequently applied
to continuous features.
Example 12.3 (Music Genre Classification (cont.)). We continue with the music
genre classification example above, and look at different numbers of neighbors used
for classification. With k = 1, 11, 31 we get the error rates 20.9%, 18.4%, 17.1%
using all 26 features. When only using the most important 2 features MFCC01m,
MFCC02m, we get 26.4%, 22.9%, 20.5%, and the separation is shown in Figure 12.3.
Obviously, the boundary between the two classes gets smoother when k is increas-
ing. Notice that the boundaries are much more flexible than for the methods LDA
and NB. Having in mind that the training set includes 2361 observations, k = 31 is
not very large.
2 In low-dimensional input spaces it is possible to reduce prediction time considerably by means of
binary space-partitioning tree data structures (e.g. KD-trees) from “brute force” search complexity of O(n)
to only O(log(n)) operations (see [6]).
311
312 classif.knn: k=1
Chapter 12. Supervised Classification
classif.knn: k=31
Train: mmce= 0; CV: mmce.test.mean=0.114 Train: mmce=0.0966; CV: mmce.test.mean=0.0991
0.8 0.8
● ●
● ●
●
●
●
●
● ●
● ●
0.7 0.7
● ●
● ●
● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
●
●
●● ● ●
●
●
●
●● ● ●
●
● ●
● ●
● ●
● ●
● ● ●
●
●
●
●
●
●
●
●●
●
●
●
● ● ● ●
● ●
● ●
●● ● ●● ●
0.6 y 0.6
MFCC02m
MFCC02m
●● ●●
y
● ● ● ●
●
● ●
●
●
●
●
●
● ● ● ●
● ●
● ●
●
●
●●
●● ●
●
●
●
●
● ●
●
●
●
●
● ●
● ● ●
● ●
● ●
●
●
●●
●●
●
●
●
●
● Classic ● Classic
●● ● ●●● ● ● ● ●●● ● ●
●
●●
●● ● ●●
● ●● ● ●●
●
● ●
●
● ●
●
● ● ● ●
●● ● ●●
● ●●●
● ● ● ● ●●
●
● ● ● ● ●
● ●
●
●
●
● ●
● ● ●
●●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ●●
●
●
● ●
●●
● ● ● ● ● ● ● ● ●
●
●●
● ● ● ●● ● ● ●● ●
●
●
●●
●
●
● ● ●● ● ● ● ●
●●
●● ●● ●● ●
Non_Classic
●
Non_Classic
● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
●●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●
●
● ● ● ● ●
● ● ●● ● ● ● ●
●
●● ● ● ●● ●● ● ●
●
●●
●
●● ●● ● ●● ●
●
● ●● ● ●● ●●
● ●● ● ●● ● ● ●● ●
●
● ●●●
●
● ● ● ●●
●
●
● ● ●●● ● ●●
●
● ● ●●● ● ●●
● ● ● ● ●
●● ● ● ●● ●●● ● ●● ●● ● ● ●● ● ● ●●
● ● ● ● ● ●●● ● ●●
●
●● ● ● ● ● ● ● ● ●●●
● ●
● ●● ● ●
●●
●
●● ● ●
● ●
● ● ● ● ●● ● ●
● ● ● ● ● ● ●●
●● ●●●
●● ● ●
● ● ● ●● ●●● ● ●
●
● ●
● ●
● ● ●
● ●
● ● ● ●
●
● ●●
●
●● ●
● ● ●
●
● ●
●
● ● ● ●● ●
●
● ● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ●
●● ●
●●
● ●
● ● ●● ●● ●● ● ●●
● ● ● ● ● ● ●
●● ● ● ●● ●● ● ●●
● ● ● ● ● ● ●
●
● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●●●
● ● ●
●
● ●
●
● ●
●
● ● ● ●● ● ● ●● ● ●●● ● ● ● ●● ●
● ●
● ●● ● ● ●● ●
● ●
●
●
●
●● ●● ●
●●
●● ● ●● ●●
●
● ●
●
●●
●
● ● ● ●● ● ● ●●
●
● ●
●●
● ●● ●● ● ●
● ● ●● ●● ●●
●●
● ● ● ●●
● ●●●● ● ●
● ●● ● ●●● ●● ● ● ● ●● ●● ●
●
●
●
●●● ●● ● ●●● ● ●● ● ●● ●● ●
●
●
● ●● ● ● ● ● ●
●
● ● ● ●● ● ● ●
●
●
● ●
●
● ● ●● ●
●
●
●
● ● ● ●● ●
●●
●●●● ● ● ●●
●●
●●●● ● ● ● ●●
●
●●
●
● ●●
●
● ● ●
● ● ●
●
● ● ● ● ●
●
● ●
●
●
●●● ● ●
●
●
●
● ●●
●● ● ●
●●●● ● ●● ● ●●
● ●●
● ●● ●
● ● ● ●●
● ●
●
●● ● ● ●●● ●
●●
●● ● ●
● ● ●●● ●
●●
●
●
●
●●
● ●
●●● ● ● ● ●●● ● ● ●
●
●●● ● ●●●
●
●●
● ● ●
●
●●●
●
●
●
● ● ●●● ●● ● ● ●● ● ● ●● ● ● ●● ● ●
●
●
● ● ●● ● ●●
●
● ●
● ● ● ● ●● ● ●
● ● ●●
●● ●● ●● ● ● ●●
●
●● ●●● ●●
●● ●● ●● ●●
●● ● ●● ● ● ● ●
● ● ●● ●
●●
●
●
● ●
●
●
●
●
● ● ●
●
●
● ● ●●
●● ●● ● ●●● ●●● ●● ● ● ● ● ●●● ● ●●● ●●● ● ●● ●
●
● ● ● ●●● ● ●
●
● ●
●
● ●
●
● ●
● ●
●
● ● ● ● ● ●● ● ●●
●● ●● ●
●● ●
● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ●●● ●● ●● ● ●●
●
●
●
●
● ●
●
● ●
●
0.5 ● ●●●● ●● ● ●●
●●●● ●●● ●● ● ● ● ●●
● ●
● ●●
0.5 ● ●● ● ● ● ● ● ●● ● ● ● ● ●●
● ●
● ●●
●● ●● ●
● ●
●● ●●
●
● ●●●● ●●
● ● ●●
●●
● ● ●● ●● ●● ●●
● ● ● ● ● ● ● ● ● ● ● ●
●● ● ●●
●● ●●● ●
● ●●
● ●●●● ● ● ●●●
● ●
● ● ●●
●●
● ● ● ●● ● ● ● ● ●●
●
●● ● ●
● ●● ● ● ● ●● ●
●● ● ●
● ●● ●
●● ● ●
●● ●● ●● ●●● ● ● ●● ● ● ●● ●
●
●● ●● ●●● ● ● ●● ● ● ●●
●
● ● ●●● ● ● ●
● ●
● ● ●
●
●
● ●
● ● ●
● ● ● ●●● ● ● ● ● ●●●
● ●
●
● ●●● ● ●
● ●●● ●●● ● ● ●
● ● ●●
●●
● ●● ● ● ● ●●●
●
●
● ●●● ● ●
● ●●● ●●● ● ● ●
●●
● ● ●● ● ●
●●
●
●
●
●
● ●● ●●●●●● ● ● ● ● ●●
●● ●● ●●●●●● ● ● ● ● ●●
●●
●●● ●●● ●
●
●
●●● ●
●
● ● ● ●●● ● ● ●● ●●● ●●● ●
●
● ● ● ●●
●
●
●
●● ● ●
●
● ●● ●●
●
● ● ●●
● ● ●
● ● ●●
●● ●●●● ●●● ● ● ●● ● ●
● ●
●
● ● ● ● ●●●● ●●●
●
●● ● ● ● ●●
● ●●
● ●
●
●●
●● ● ●●●● ● ●
● ● ●●●● ●
●
● ● ● ●● ● ●●
●● ●
●
● ●● ●
●●
● ●●● ● ● ● ●● ●
●● ●
●
● ● ● ●●●● ● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●●● ●●●● ● ● ●● ●● ● ● ● ●
●
●
●
● ●● ●● ●● ● ● ●● ● ● ● ● ●● ●● ●● ● ● ● ●● ● ●
●●●●
●●
●
● ● ● ● ●● ●● ● ● ● ● ●●
●●
●
●
● ●
● ● ● ● ●● ● ● ● ● ●●● ●● ●● ● ● ● ● ●●●
●● ● ●● ● ●● ● ● ● ● ● ●
●●
● ● ●● ●
● ●● ●● ● ● ●● ● ●● ● ●● ●
●
● ●● ●● ●
● ● ●● ● ● ●●
●● ●● ● ● ●● ●
●●
● ● ● ●● ● ●● ●● ● ● ●● ● ● ● ●
●
● ●●● ● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ●●
●● ● ● ● ● ● ● ● ● ●●● ● ● ●
●● ●
● ●
●
● ● ● ● ● ● ● ● ●
●●
● ●
●
● ● ●● ● ● ●
●
●● ●●● ● ●● ●
● ● ● ●● ● ● ●● ●●● ● ●
●● ●● ● ● ● ● ● ●●●●● ●●
●
● ●
● ●●
● ●●● ●● ●●
●●●●● ● ●●● ●● ● ● ● ●● ●● ● ●●
● ●● ● ●●
●●●●● ●●● ●●● ●● ● ● ● ●● ● ● ●●
● ●
● ●● ●● ● ● ●●● ● ●● ● ● ● ● ●●● ● ●●
●
● ● ● ● ● ●
●
● ● ● ● ●
●
● ●
●
●● ● ●● ●● ● ● ● ● ●● ● ●● ● ● ●
● ● ● ● ●●●
● ● ● ● ● ● ● ● ●
●
● ● ●
●●
● ●
●
● ● ● ● ● ● ● ●
● ●●
●
● ● ●
● ●● ● ● ● ●● ●● ● ●● ●● ● ● ● ●● ●●
● ● ●● ●● ● ● ● ●● ●● ●
●● ●●
● ●
●●
● ● ● ● ● ●
● ●
●
●●
●● ●● ●
●●
● ●● ● ●● ● ●● ● ●●
● ● ● ● ●
● ●
●
●●● ● ● ●
● ●
● ● ●
● ● ● ● ●●● ●
●
● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ●
● ●●●
● ●
● ●
●
●
●
● ●●
●
● ● ●
● ●
● ●
●
● ● ●
●●●
● ●
● ●
●
●
●
● ●●
●
●
● ● ●
● ●
● ●
●
●
●
●
●
● ● ● ● ●
●
●
● ● ●
●●
● ●● ● ● ● ●●
●● ●●● ●
●
● ● ● ● ●●● ●●● ●
● ● ●
● ●
●●
● ● ● ● ●
●
● ●●
●
● ●
● ● ● ● ●
●
● ●●
0.4 0.4
● ● ● ●
● ● ● ● ● ●
● ●
● ●
● ●
● ●
●
●
● ●
●●
●
●
● ●
Figure 12.3: Class separation by 1-NN (left) and 31-NN (right); true classes: Classic
= circles, Non-Classic = triangles; estimated classes: Classic = darker region, Non-
Classic = lighter region; errors indicated by bigger symbols.
312
12.4. Selected Classification Methods 313
employing the decision rule under consideration. Let us consider the two cases of
quantitative and qualitative split features.
In the quantitative case, the CART (Classification And Regression Tree) method
uses the following split technique. The observations are experimentally split into two
child nodes by means of each realized value of the first feature and splits of the kind
(value ≤ constant). The “yes-”cases are put into the left node, the “no-”cases into the
right one. Let s be such a split in node t. CART then evaluates the reduction of the
so-called Gini-impurity i(t) := 1 − S with S := ∑Gj=1 P2 ( j|t), where S is the so-
called purity function (which is maximal if all probability mass is concentrated on a
single class) and P( j|t) = probability of class j in node t. This reduction is calculated
by means of the formula ∆i(s,t) = i(t) − pL · i(tL ) − pR · i(tR ), where pL is the share
of the cases in node t which are put into the left child node tL , pR analogously.
CART chooses that split as the best for the chosen feature which produces the biggest
impurity reduction. These steps are repeated for each feature. CART then orders
the features corresponding to their ability to reduce the impurity and realizes the
split with that feature and the corresponding split point with the biggest impurity
reduction. This procedure is recursively repeated for each current terminal node.
This way, CART constructs a very big tree with very many terminal nodes which are
either pure or contain only a very small number of different classes.
In the case of qualitative split features, we could possibly try each possible feature
value for splits of the type (value = constant). However, let us also consider the
classification of the so-called C4.5 method. There, the node information content of a
subtree below a node for a sample of size n is measured by its entropy H(X X ):
G
|yc |X |yc |X
X) = − ∑
H(X · log2 , (12.3)
c=1 n n
where G is the number of classes and |yc |X is the number of observations from X
which belong to class c ∈ {1, ..., G}. The efficiency of candidate nodes can be mea-
sured by the so-called information gain, aiming to reduce the information content
carried by a node using a split s:
k
|X
X js |
gain (X X)− ∑
X , s) = H(X · H(X
X js ), (12.4)
j=1 n
where X js are the observations of X with the j-th value of the k outcomes of the split
feature in split s. Note that an arbitrary number k > 1 of split branches is allowed in
C4.5, not only 2 as in CART.
Several further enhancements led to the development of the decision tree algo-
rithms CART and C4.5 (for details see [3] and [9]), in particular handling of missing
feature values (cp. Section 14.2.3) and tree pruning. Especially the latter technique is
very important, since too large trees increase the danger of overfitting: if a model de-
scribes the data perfectly, from which it has been trained, but is not suitable anymore
for reasonable classification of other instances (cp. Chapter 13). Moreover, under-
standing and interpretation of trees with many terminal nodes might be complicated.
313
314 Chapter 12. Supervised Classification
Pruning Big decision trees are complex trees, measuring the tree complexity by
means of the number of terminal nodes. The trade-off between goodness of fit on the
training set and not too high complexity is measured by the so-called cost-complexity
:= (error rate on the training set) + β · (no. terminal nodes), where β = “penalty” per
additional terminal node, often also called the complexity parameter.
The search for the tree of the “right size” starts with the pruning of the branches of
the maximal tree (Tmax ) from the terminal nodes (“bottom up”) as long as the training
error rate stays constant (T1 ). Then, we look for the so-called weakest link, i.e. for
that node for which the pruning of the corresponding subtree below this node leads
to the smallest increase of the training error. This is equivalent to looking for that
node for which the increase of the “penalty parameter” is smallest for maintaining the
cost-complexity at the same level. The subtree with the weakest link is pruned. This
procedure is repeated until only the tree root is left. From the corresponding sequence
of trees the tree with the lowest cross-validated error rate (see Section 13.2.3) is
chosen to be the final tree. This leads to the tree with the smallest prediction error.
Example 12.4 (Music Genre Classification (cont.)). Let us now look at the above
music genre classification example by means of CART decision trees based on Gini-
impurity (function rpart in the software R). We have tried unpruned and pruned trees.
Unpruned trees with all 26 MFCC features lead to 21.5% error rate, pruned trees to
17.4%, both with default parameter values. Figure 12.4 starts with the correspond-
ing pruned tree. Then, it shows the separation based on MFCC01m and MFCC02m
for the unpruned case. Note that the separation is along the coordinate axes. Also
notice the light area between the areas definitely assigned to one of the classes. In
this area assignment is most uncertain. The unpruned and the pruned tree based on
MFCC01m and MFCC02m lead to error rates 18.5% and 19.2%.
3 We ignore the case f (xx) = 0. Any prediction may be made in this case, e.g., at random.
314
12.4. Selected Classification Methods 315
classif.rpart: xval=0
Train: mmce=0.0936; CV: mmce.test.mean=0.0961
0.8
●
●
●
● ●
0.7
●
●
● ●
Non_Classic
●
●●
● ●
107 / 1100
● ●
●
● ●
● ●
●
●●
●
●
●
● ●
●
●
●● ●
0.6
MFCC02m
●●
y
● ●
●
● ●
●
● ● ● ●
● ● ● ●
● ● ● ●●
● ● ● ●
●● ●●
● Classic
●● ● ●●● ● ●
● ●● ● ● ●●
●
● ● ●
● ●●● ● ●●
● ● ● ●
● ● ●
●
● ●
●
●●
● ● ● ● ● ●
● ● ●
● ●● ● ●
●
● ● ●
●●
● ● ●● ●
● ● ●
●● ● ● ● ● ● ●
● ●
● ● ●● ● ● ●
●●
● ● ● ●●
● ●● ●
●
●
Non_Classic
●●
● ● ●
●● ● ● ●● ●● ●● ● ●● ● ●
● ●
● ● ●● ●● ● ● ●● ●
●
● ● ●●● ● ●●
●
● ● ●● ● ● ●● ●●● ● ●●
● ● ● ● ● ●●● ●● ● ●
● ●
Classic ●● ●
● ●
● ● ● ● ● ●●
●●● ●● ● ●
● ● ●
●
● ● ●
●● ● ●
● ●
●
●● ● ● ● ● ● ● ● ● ● ●● ●
●
●●● ●
●● ● ●● ●● ● ●●
●
● ● ●● ● ● ●
● ● ● ●
●●●
●
●
●
● ● ● ● ● ●
●
● ● ●● ● ● ● ●
●● ●● ●● ● ● ● ● ● ● ●
●
● ●
● ●●●
●●● ●● ● ●● ●
● ●●● ● ● ●
25 / 921 ●
● ● ● ● ● ●●●
●● ● ● ● ● ●
● ●●● ● ●●● ● ● ●● ●●●● ●● ●●
●● ●
●
●
●
● ●
● ● ● ●●● ● ● ●●● ● ●●● ●●● ● ●
●
● ●
●
●●●
●
●●●● ● ● ● ● ●●
● ● ●
●
● ● ●
● ● ● ●● ● ● ● ●● ●
0.5
●● ●
●
●
●
● ● ●●
●
● ●●
●● ●
● ●● ●●
●●
●● ● ● ● ● ● ● ● ● ●● ●● ● ●
●
●
●●●●● ●
● ●●●● ●
● ●● ● ●
●●● ● ●
● ●● ● ●●● ●●
●● ● ● ● ●●● ●● ●● ●●● ● ● ●● ● ● ●●
●
● ● ●●●
● ●
● ●●
●
● ● ● ●
●
●● ●●● ● ● ●
●
● ●● ● ●
● ●●●
● ● ●●● ● ●
● ●
●
● ●● ● ●
●
● ● ● ●●
●
●● ●●●●●● ● ● ●●
●●●● ●
●●● ●●●● ●
●
● ●●
●
●
●
●●
●
● ●●
●● ● ● ●● ● ●
● ●●
●●●● ●●●
●● ● ●
●
● ● ● ● ● ●●
●
●
● ●●●● ● ●●
●●●
● ●
●● ●● ● ●
● ●● ●
●●● ●●● ●●●● ● ● ● ● ●●
●
● ●● ● ● ●● ●● ● ● ●
●● ● ● ● ●● ● ●
● ●● ●● ●●
●●
● ● ● ● ●● ●●
●●
●● ●
●
● ● ● ● ● ●●●
●
●
● ● ●● ● ●● ● ●● ●● ●● ● ●
●●●●●
●●
●● ●● ● ● ●● ● ● ● ● ●●
● ● ●● ●
●
●● ●● ● ●●● ● ● ● ●● ● ●●
● ● ●● ● ● ●●
● ● ● ●
●● ●
● ●
● ●
●●
●
●
●● ●● ● ● ●●●●● ●● ●
● ●● ● ●●● ● ●●● ●●●
●● ●● ● ● ● ●● ●● ● ●●
●
● ●●● ● ●●
●
● ● ● ●
●
● ● ● ●
●
●● ● ● ●● ●●● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
●● ●
● ●●
● ● ●● ● ● ● ●●
● ● ●● ●● ●
●●
●●
● ● ● ●
●
●● ●
●
● ● ●● ● ●●
●● ● ● ● ● ●●● ● ● ●
●
● ● ● ●● ● ●
● ●
● ●
●●●
● ●
● ●
●
●
●
● ●●
●
● ● ●
● ●
● ●
●
●
●
● ● ●
●
●
● ● ● ●●
● ● ●● ●
● ●●● ● ●
● ● ● ● ● ●
●
●
Classic Non_Classic
●
● ●●
0.4
● ●
● ● ●
●
●
42 / 219 27 / 121
●
●
●
●
●
● ●
Figure 12.4: The tree of the pruned rpart model based on 26 MFCCs with node error
rates (left) and class separation of an unpruned rpart model based on MFCC01m and
MFCC02m (right); true classes: Classic = circles, Non-Classic = triangles; estimated
classes: Classic = darker region, Non-Classic = lighter region; errors indicated by
bigger symbols.
f so that all points are correctly classified and the safety margin between the classes
is maximized; see Figure 12.5 for an illustration. The margins of some points coin-
cide with the margin of the hyperplane, these points are called support vectors. The
training problem is equivalent to requiring function values of at least +1 for positive
class points and at most −1 for negative class points while moving the correspond-
ing level sets { f = ±1} as far away from the decision boundary { f = 0} as possible.
This amounts to minimization of the (squared) norm of w ∈ R p :
1
min wk2
kw wT x i + b) ≥ 1 ∀i.
s.t. yi · (w
w,b 2
315
316 Chapter 12. Supervised Classification
Figure 12.5: SVM class separation. Classes are indicated by light triangles and dark
circles. Left: The hard-margin support vector machine separates classes with the
maximum-margin hyperplane (solid line). Point-wise safety margins are indicated
by dotted lines. The dashed lines parallel to the separating hyperplane indicate the
safety margin. The five points located exactly on these margin hyperplanes are the
support vectors. Note that the space enclosed by the margin hyperplanes does not
contain any data points. Right: For a linearly non-separable problem, the support
vector machine separates classes with a large margin, while it allows for margin
violations, indicated by dotted lines.
The solution (w w∗ , b∗ ) of this problem is defined as the (standard) linear SVM. It has
a single parameter, C > 0, trading maximization of the margin against minimization
of margin violations. Although exact maximization of the margin is meaningless in
the non-separable case, the SVM still achieves the largest possible margin given the
constraints. It is therefore often referred to as a large margin classifier.
The SVM problem can be rewritten in unconstrained form as
n
1
minp wk2 +C · ∑ L( f (xxi ), yi ),
kw (12.5)
w ∈R 2 i=1
316
12.4. Selected Classification Methods 317
h ·, · i given by the kernel function: k(xx, x 0 ) = hφ (xx), φ (xx0 )i. This proceeding remains
implicit in the sense that the feature map φ does not need to be constructed since
we only need the scalar products as specified by the kernel Gram matrix. Indeed, a
linear method such as the SVM (just like linear regression and many others) can be
formulated in terms of vector space operations (addition of vectors, multiplication of
vectors with scalars) and inner products. Now the kernel trick amounts to replacing
all inner products with a kernel function. This way of handling nonlinear transfor-
mations is in contrast to other transformation-based methods with an explicit (often
manual) construction of a feature map and the application of the algorithm to the re-
sulting feature vectors. The kernel approach is particularly appealing since (a) it can
implicitly handle extremely high-dimensional and even infinite-dimensional feature
spaces H , and (b) for many feature maps of interest, the computation of the kernel
is significantly faster than the calculation of the corresponding feature vectors.
The most important examples of kernel functions on X = R p are as follows:
The Gaussian kernel (also often called radial basis function (RBF) kernel) corre-
sponds to an infinite dimensional feature space H .
Non-linear SVM Given a kernel k and a regularization parameter C > 0 the non-
linear SVM classifier is defined as the solution w ∈ H , b ∈ R of the problem
n
1
min wk2 +C · ∑ L( f (xxi ), yi ) ,
kw
w ∈H ,b∈R 2 i=1
where the squared norm kw wk2 = hw w, w i and the scalar product in the decision func-
tion are operations of the feature space H , and the input x is replaced with its feature
vector φ (xx). i.e. f (xx) = hw
w, φ (xx)i + b, where the scalar product h ·, · i is given by the
kernel function k. Fortunately, the so-called representer theorem guarantees that the
optimum w ∗ is located in the span of the training data, i.e. w ∗ = ∑ni=1 αi φ (xxi ), leading
to the following decision function:
n
f (xx) = ∑ αi k(xx, x i ) . (12.6)
i=1
317
318 Chapter 12. Supervised Classification
solve learning problems that require even highly non-linear decision functions while
avoiding overfitting at the same time. This requires problem-specific tuning of the
regularization trade-off parameter C.
Note that the sum in Equation (12.6) is usually sparse, i.e., many of the coeffi-
cients αi are zero. This means that often only a small subset of the data needs to be
stored in the model, namely the x i corresponding to non-zero coefficients αi . These
points are the support vectors defined above.
Multiple Classes Returning to our initial example of genre classification, it is ap-
parent that the SVM’s ability to separate two classes is insufficient. Many practical
problems involve three or more classes. Therefore, the large margin principle has
been extended to multiple classes.
The simplest and maybe most widespread scheme is the one-versus-all (OVA)
approach. It is not at all specific to SVMs; instead it may be applied to turn any binary
classifier based on thresholding of a real-valued decision function into a multi-class
classifier. For this purpose, the G-class problem is converted into G = |Y | binary
problems. In the c-th binary problem, class c acts as the positive class (label +1),
while the union of all other classes becomes the negative class (label −1). An SVM
decision function fc is trained on each of these G binary problems, thus the c-th
decision function tends to be positive only for data points of class c. Now for a point
x the OVA scheme produces G different predictions. If only one of the resulting
values fc (xx) is positive, then the machines produce a consistent result: all machines
agree that the example is of class c, and hence the prediction ŷ = c is made. However,
either none or more than one function value may be positive. Then the G binary
predictions are inconsistent. The above prediction rule is extended to this case by
picking the function with largest value as the prediction:
n o
ŷ = f (xx) = arg max fc (xx) . (12.7)
c∈Y
There are other extensions of the large margin framework to multiple classes, e.g.
replacing the hinge loss with a corresponding loss function for G > 2 classes. See,
e.g., [16, 5, 8], for examples.
Example 12.5 (Music Genre Classification (cont.)). Let us now look at the above
music genre classification example by means of support vector machines. We have
tried linear SVMs (lSVM) and SVMs with a Gaussian kernel with a width γ optimized
on a grid (kSVM). In both cases, the cost parameter C was also optimized on a grid,
taking that parameter value of a prefixed grid which minimizes the error estimate.
For lSVM and kSVM we get 13.6%, 19.6% error rates based on all features. Using
only the best 2 features we get 17.9%, 25.5% error rates. Figure 12.6 shows the
separations for the two SVMs based on the best 2 features. Note the similarity of the
plot for lSVM to the result of LDA and the flexibility of the separation by kSVM
similar as for k-NN.
318
12.4. Selected Classification
classif.svm.tuned:
Methods 319
classif.svm.tuned:
Train: mmce=0.109; CV: mmce.test.mean=0.111 Train: mmce=0.0826; CV: mmce.test.mean=0.0919
0.8 0.8
●
● ●
● ●
●
● ●
●
● ●
●
●
0.7 ● 0.7
● ●
●
● ●
● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ●
●
●
●● ● ●
●
●
●
● ● ●
●
● ●
● ●
● ●
● ●
● ● ●
●
●
●
●
●
● ●
●
●
●
●
●
● ● ● ●
● ●
● ●
●● ● ●● ●
0.6 y 0.6
MFCC02m
MFCC02m
●● ●●
y
● ● ● ●
● ●
● ● ● ●
●
●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ●
● ● ●
● ●●
●● ●
●
●
● ●
● ● ●
● ●●
●● ●
●
●
● Classic ● Classic
●● ● ●●● ● ● ● ●●● ●
●
●● ●
●● ● ●●
● ●● ● ●●
●
● ● ●
●
● ● ● ● ●
● ●●● ● ●●
● ●●●
● ● ● ● ●● ● ● ●
● ●
● ●
● ●
●
●●
●
● ●
●
●
●●
●
●●
●
● ● ● ● ● ● ● ● ● ● ● ●
●●
● ● ●
● ●●
●
●
● ● ● ● ● ● ● ● ● ●
●
● ● ●● ● ● ● ●● ●
●
●
●
● ● ●● ●● ● ● ● ●● ● ●● ●
Non_Classic
●
Non_Classic
● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ●
●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●
●
●
●
● ● ● ● ●
● ● ● ● ●
● ●● ●
●● ●● ●● ●● ● ● ● ●
●
●
●● ●● ● ●● ● ●
●●
● ●● ●● ●●
●● ● ●
●
●● ● ● ●●
● ●
● ● ●● ●
● ● ● ●● ●
● ●
●● ● ●●● ● ●● ● ●●● ● ●●
● ●● ● ●
●● ● ● ●● ●●● ● ●● ● ●● ● ● ●● ● ● ●●
● ● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●●● ●●
● ●
●
● ● ●● ● ●
●● ● ● ●
● ● ● ●
● ● ● ● ●● ● ● ● ● ●●
●● ●
●
●●●
●● ●
●● ● ● ● ●● ●●●● ● ●
● ●
● ● ●
●
●
●● ●
● ● ●
● ●●
●
●● ●
● ●
●
● ● ●
●
● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●
●● ● ●
●
●● ● ● ● ● ●
●
●●
●
●● ● ●● ●● ● ●●
●
● ● ●● ● ● ● ●●●
●
●
●● ●● ● ●● ●● ● ●●
●
● ● ●● ● ● ●
● ● ● ●
●●●
●
● ● ●
●
●
● ● ●● ●
● ●
● ●● ● ● ●● ● ●●● ● ● ● ●● ●
●
● ●
●● ● ● ●● ● ● ● ●● ● ●
●● ● ●● ● ●●
● ● ●● ● ●● ● ●● ●●
●
● ● ● ● ● ●● ●
●
●● ●● ● ● ● ●● ●● ●●
●
● ● ● ● ●●
● ●●● ● ●● ●
●
●● ● ● ● ● ●●● ●● ● ● ●● ●
● ●●● ●● ● ● ●
●
● ●●● ●● ● ●
●
● ● ●
●
● ● ●
● ● ●● ● ● ● ● ●● ●
● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ●
●●
●● ● ● ● ●●
●●
●● ●● ● ● ●●
● ●
● ●●
● ●
● ●
● ● ● ● ● ●● ●● ● ● ● ● ● ● ●
●
●
● ●
●
●● ●
●
●●●● ● ●● ● ●●
●●
●● ● ●
● ●● ●● ●
●
●
● ●●
● ●
●
●● ● ● ●●● ● ●● ● ●
● ● ●●● ●
●
●
●
●
●
● ●●● ●
●●
● ●
●
●●● ● ●●●
●●
● ●
●
●● ● ● ●● ● ● ● ● ●●● ●
● ●● ● ●● ● ●● ●●●● ●● ●●● ●● ● ● ●● ● ● ● ●●
● ●●● ●● ● ● ●●
● ● ● ●●● ● ●
●●● ●● ●●
● ● ● ●● ●● ●●
●● ● ● ● ● ●● ●● ●●
●●
● ●● ●
●●
● ● ●● ● ●
●
●
● ●
●
●
● ● ● ●●● ● ●●● ●●● ●● ● ●●● ●●● ●
● ● ● ●●● ● ● ● ●
● ● ● ●●● ● ● ●●● ● ●● ●
● ●
● ● ● ●
●
● ● ●
● ●
●● ● ●● ●● ●● ● ● ●●
● ● ● ● ●● ● ●● ● ● ● ●● ●● ●
●
● ●●
●
●
● ●
●●
0.5 ● ●
● ●●
● ●●● ●● ● ●● ● ● ● ● ●●
● ●
● ●●
0.5 ● ●
●
●● ● ●● ● ●● ● ● ● ● ●●
● ●
● ●●
●
●● ●●
●● ● ●
●● ● ●● ●●
●●
●●
● ●●
● ● ● ● ●
●●
● ● ● ●● ●●●●● ● ●
● ●
● ●● ● ● ● ● ●
●●
● ● ● ●● ●●●●● ● ●
●
●●● ● ● ● ●
●
●●● ● ● ● ●
●●
● ● ●●
●●
● ● ● ●● ● ● ● ● ●●
●
● ●● ● ● ● ●● ●
● ● ●
●● ●
●● ● ● ●
●● ●● ●●● ● ● ● ●● ● ● ●● ●● ●● ●●● ● ●
●
● ● ●●● ● ● ● ● ●● ● ● ●●
● ● ● ●●● ● ● ●
●
●
●
● ● ● ●
●●
● ● ● ●●● ●●● ● ● ●
● ● ●●●
● ●●● ●●● ● ● ●
●
●● ● ● ●● ● ●
● ● ●●● ● ●●● ● ●
● ● ●●●
● ● ● ● ●●● ● ●
●
● ●● ● ● ● ●
●●
●
● ●● ●
●
● ●
●
●● ●●●●●● ● ● ● ● ●●
●● ● ●● ●●●●●● ● ● ● ● ●●
●●
●●● ● ●
●●● ●●●● ● ● ●● ●●● ●●●● ●
●
●
● ●
● ● ●●
●● ●●
●
●
●
●●
● ● ●●
●
●● ●●●● ●●● ● ● ●● ● ● ●● ● ●
●
●●●● ●●●
● ●●
● ●
● ● ● ● ● ●● ● ● ● ● ●●
●
●
● ●●●● ● ●● ● ●●
●
●
● ●● ● ● ●●●● ●
● ● ●
●●
● ● ●●
●
● ●● ●
●●
● ●●● ● ● ●● ● ●
●
● ● ●●●● ● ● ● ●●● ●●●
● ●
● ●● ● ● ●● ●● ● ● ● ● ●● ●● ● ●●●● ● ● ●● ●● ● ● ● ●
● ●● ● ● ●● ●
● ●● ●
● ● ● ●●
●
● ● ● ●
●
●● ●● ● ●● ● ●● ●●
● ●
● ● ● ● ●● ●● ● ● ● ●●
●●
● ●● ●● ● ● ● ● ●●● ●● ● ●● ●● ● ● ● ● ●●●
● ● ● ● ●● ● ●
●
●● ● ●● ● ●● ● ●
● ●● ● ●● ●● ● ● ●● ● ●●
●
● ● ●● ● ●● ●● ● ● ●●
●●
● ● ●● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ●
● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ●●
●
● ● ● ● ● ● ●
●●
● ● ● ● ●
●
● ● ● ●
●
●
●● ●● ● ● ● ●● ●●● ● ● ● ●
● ● ● ●●●●●
●
● ●●● ● ● ● ●● ● ● ● ●●● ● ●● ● ●
●
●●
● ●
●●
●
● ●●●
● ●● ●●●●● ●●● ●●● ●● ● ● ● ●● ●● ● ●●
●● ●● ● ● ● ●●
● ● ●●●● ●●●● ● ● ● ●● ● ● ●●
●
● ●●● ● ●● ● ●● ● ● ●●● ● ●●
●
● ● ● ●
●
● ● ● ● ●
●
● ● ● ● ● ● ●
●
●● ● ●● ●● ● ● ● ●● ● ●● ● ● ●
● ●●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ●
●●
●
● ● ●
● ● ●● ● ● ● ● ●● ●● ● ●● ●● ● ● ● ●● ●●
● ● ●● ●● ● ● ● ●● ●● ●
●●
● ●
●●
● ● ● ● ● ●
●
●● ● ●● ●
●●
● ●● ● ●● ● ●● ● ●●
● ●● ● ●
● ● ●●● ● ● ●
●
● ● ● ● ● ●●● ● ● ●
●
● ● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ●
●
●
●●●
● ●
● ●
●
●●
●
● ●●
●
● ●
●
●
● ●
● ●
●
● ● ●
●●●
● ●
● ●
●
●
●
● ●●
●
●
● ● ●
● ●
● ●
●
●
●
● ● ● ● ●
● ● ● ●
●
● ●● ● ● ● ●●
● ● ●●● ●●● ● ● ●●● ●●● ●
● ● ● ●
● ●
● ● ● ● ● ●
●
● ● ●
●
● ● ●
● ● ●● ● ●●
0.4 0.4
● ● ● ●
● ● ● ● ● ●
● ●
● ●
● ●
● ●
● ●
●● ●●
● ●
Figure 12.6: Class separation by lSVM (left) and kSVM (right); true classes: Classic
= circles, Non-Classic = triangles; estimated classes: Classic = darker region, Non-
Classic = lighter region; errors indicated by bigger symbols.
319
320 classif.rpart.bagged: bw.iters=1000; bw.feats=0.5
Chapter 12. Supervised Classification
Train: mmce=0.109; CV: mmce.test.mean=0.163
0.8
●
● ●
●
●
●
●
●● ●
● ●
0.7 ●● ●●
●
●
●
●
●
● ●
●●
●
● ● ● ● ●●●●●
●
●●
●
●
● ●
●
●●
● ● ●
● ●●
● ● ●
●
●
●●
●
● ● ●
●
● ●
0.6 ●
●●
●
●
●● ●
●
●
MFCC02m
●
●
●● ● y
● ●
●
● ●●
● ●
● ● ●●● ●
●
● ● ●●●
● ● ●
●
●
●● ●
● ●
●●●
●
● ● ● ●●●
●
● ● ●● ● ●
● ●● ● ● Classic
● ●
● ● ● ●●
●
●
●
●●
● ● ●●
● ● ●
● ●●●
● ●
● ●
● ● ● ● ●●
●
● ● ● ●
● ●
●
● ● ● ● ●
● ● ● ● ●● ●●●● ● ●● ●
Non_Classic
● ●● ●
● ●● ●● ● ● ● ● ● ● ●● ●
● ●● ●● ● ●
● ●
●
●● ● ● ●● ●●● ● ●●● ● ●
● ●
● ●● ● ● ●● ●
● ● ●● ●
● ●● ● ● ● ●●
● ● ●● ● ●
● ●●● ● ●●
●●
●● ●
● ● ●●
●
●
● ● ● ● ●
● ● ● ●● ●●
● ●● ● ●● ● ●
● ●●
●●
● ● ● ● ● ●●● ●
● ●● ●
●● ●
● ● ● ● ● ●●●
●
●
● ●● ● ● ●●● ●● ●
●● ● ●● ● ● ● ● ●
● ●● ● ●
●
●●● ● ● ●● ●●● ● ●
● ● ● ●● ●● ● ● ● ●●● ●
● ●● ●
● ● ● ● ●●● ●
● ● ● ● ● ● ● ●● ●
● ●
● ●●●●● ●● ● ●● ● ●● ● ● ● ●● ●
●●● ● ● ● ●● ● ● ●● ● ●●●● ●●●● ● ●
●
● ● ●
●● ● ● ● ●
● ● ● ●
● ● ●●
●● ● ●●● ●● ● ●● ● ●● ●●● ● ● ●●● ●
●
● ●● ●●●● ● ● ● ●● ● ● ● ●● ●●
● ●●
● ● ●●
● ●● ●
● ● ● ●●● ● ● ●●● ● ● ●●●
● ●● ● ●●
●● ●
●
● ● ●
●
● ● ● ● ●● ● ●●● ●●● ● ●
● ● ●●●●● ●
●
● ●
● ● ● ●●●● ●●
●● ● ●
● ●●
● ● ●● ●
● ●●
● ● ● ●●
0.5
● ●
●● ●●● ●● ● ●● ●● ●● ●● ● ● ● ● ● ● ● ●
● ● ● ● ●●●●● ● ●●●
●●
●●
●● ● ● ● ●● ● ● ●●●
●
●
●●
●● ● ● ●
●● ●● ●● ●●● ●
●
● ● ● ● ● ●● ● ●● ●● ●●●
●
●●
● ● ●● ● ● ● ● ●
●●● ●● ● ●●● ●●● ●●● ● ● ●
●
● ● ● ● ● ●●
● ● ● ●● ● ● ● ● ● ● ●
● ●● ●●●●●● ● ● ●
●●
● ●● ●● ● ●● ●
● ●●● ● ● ● ● ●●
●
●
● ●●
●● ●
●
● ●●
●●
● ● ● ● ●
●
● ●
●●● ● ● ● ●● ● ● ●
●● ●●● ●
● ●● ● ● ● ●● ● ● ● ● ●
●● ● ● ● ● ●●
●● ● ● ● ● ● ●● ● ● ●
● ●● ● ● ● ●● ● ●●●
●
● ● ● ●●● ● ● ● ● ●●● ●●● ● ●
●● ●● ● ● ●● ● ● ● ● ●●
● ● ●●●
● ● ● ●● ● ● ●● ●● ● ● ● ● ●
●
●
●●●● ● ● ● ● ●
●●
● ●● ● ● ● ●● ● ●●
●
●
● ●●● ●● ● ● ● ● ●
●
●●●●
●
● ●
●●●●●●●● ●●● ●● ● ● ● ●● ●● ●● ● ●●● ●
●
● ● ●●● ● ●
●
● ● ● ● ● ●
●
● ●● ●●
●
● ●● ● ● ● ● ●
● ● ● ●●
● ● ● ● ●
● ● ● ● ●
●● ●
● ● ●●●● ●●
●● ●
●
●● ● ● ●
●
● ● ●● ● ●●
● ●
● ●
●
●
● ●
●●
●●● ●
● ●
●● ● ●
●
●●
●
● ● ●
●●● ●
● ●
●
●
● ●
● ● ●● ●● ● ●
● ● ●
● ● ● ● ● ●●
● ● ● ●●● ●
●
●● ●
●
●
●
● ● ● ●●
●
● ●●
0.4
● ●
● ● ●
●
●
●
●
● ●
●
●
●
estimated classes: Classic = darker
region, Non-Classic = lighter region;
0.4 0.6 0.8
MFCC01m errors indicated by bigger symbols.
Table 12.1: Test Error Rates (in %) for the Different Classifiers
are not drawn. A disadvantage is that interpretation is much more complicated than
for classification trees.
Example 12.6 (Music Genre Classification (cont.)). For Bagged Decision Trees (BDT)
(1000 replications with 50% randomly chosen features each) the estimated error rate
is 16.9% based on all MFCCs and 21.4% based on the 2 best features. Figure 12.7
shows the separation based on the 2 best features only. Notice that the rectangles
on the main diagonal are purely assigned to one class, and realizations of the other
class are always marked as an error. In contrast, in the rectangles on the secondary
diagonal the assignment is changing, meaning that the voting of the 1000 trees is
ambiguous.
Let us now compare the error rates of all introduced classifiers based on all 26
MFCCs and the two best. From Table 12.1 it is clear that the Naive Bayes method
reproduces the true classes best (error rate 12%) based on all MFCCs. Based on
only the two best MFCCs, 6 methods approximately produce the best error rates
(near 18%). Obviously, feature selection, by taking only the two most important
features into account, appears to be improvable. See Chapter 15 for a systematic
introduction into feature selection methods.
320
12.4. Selected Classification Methods 321
g()
β h1 α01 ε1
α1
1
α2 s1
h2 f1 Y1
X1
f2 Y2
s2
.. ..
. .
α02 ε2
XL
hd
Figure 12.8: Model of a multi-layer neural network for classification. There are L
input neurons and d hidden neurons in one hidden layer.
problems because of space restrictions. ANNs can be used to model regression prob-
lems with measured responses variable (cp. Section 9.8.1) and classification prob-
lems with class responses. We will concentrate here on the classification problem in
the 2-class case. Let us start, however, with a very general definition of ANNs and
specialize afterwards.
Definition 12.1 (Artificial Neural Network (ANN)). An Artificial Neural Network
(ANN) consists of a set of processing units, the so-called nodes simulating neurons,
which are linked analogous to the synaptic connections in the nervous system. The
nodes represent very simple calculation components based on the observation that a
neuron behaves like a switch: if sufficient neurotransmitters have been accumulated
in the cell body, an action potential is generated. This potential is mathematically
modeled as a weighted sum of all signals reaching the node, and is compared to a
given threshold. Only if this limit is exceeded, the node “fires.” Structurally, an
ANN is obviously comparable to a natural (biological) neural network like, e.g., the
human brain.
Let us now consider the special ANN exclusively studied in this chapter.
Definition 12.2 (Multi-Layer Networks). The most well-known neural network is
the so-called Multi-Layer Neural Network or Multi-Layer Perceptron. This network
is organized into layers of neurons, namely the input layer, any number of hidden
layers, and the output layer. In a feed-forward network, signals are only propagated
in one direction, namely from the input nodes towards the output nodes. As in any
neural network, in a multi-layer network, a weight is assigned to every connection
between two nodes. These weights represent the influence of the input node on the
successor node.
For simplicity we consider a network with a single hidden layer (see Figure 12.8).
In such an artificial neural network (ANN), linear combinations of the input signals
321
322 Chapter 12. Supervised Classification
1.00
0.75
g(x)
0.50
0.25
0.00
−6 −4 −2 0 2 4 6
1, X1 , . . . , XL with individual weights βl are used as input for each node of the hidden
layer. Note that the input 1 corresponds to a constant. Each node then transforms
this input signal using an activation function g to derive the output signal. In a 2-
class classification problem, for each class 1 and 2, these output signals are then
again linearly combined with weights αig , g ∈ {1, 2}, to determine the value fg of
a node representing class g. In addition to the transformation g of the input signals
X = (1 X1 . . . XL )T , a constant term α0g , the so-called bias, is added to the output,
analogous to the intercept term of the linear model. Finally, in order to be able to
interpret the outputs fg as (pseudo-)probabilities, the real values of fg are transformed
by the so-called softmax transformation sg which should be as near as possible to the
value of the dummy variable Yg being 1 if the correct class is g and 0 otherwise.
Since both Yg are modeled jointly, one of the probabilities should be near 1 iff the
other should be near to 0. The model errors are denoted by εg , g ∈ {1, 2}. In what
follows, we will introduce and discuss all these terms.
Definition 12.3 (Activation Function). The activation function is generally not cho-
sen as a jump function “firing” only beyond a fixed activation potential, as originally
proposed, but as a symmetrical sigmoid function with the properties:
A popular choice for the activation function is the logistic function (see Figure 12.9):
1
g(x) = .
1 + e−x
Another obvious choice for the activation function is the cumulative distribution
function of any symmetrical distribution.
322
12.4. Selected Classification Methods 323
e fg (XX ,θθ )
sg (X
X,θ ) = , g ∈ {1, 2}.
e f1 (XX ,θθ ) + e f2 (XX ,θθ )
Note that normalization leads to cross-dependencies of s1 on f2 and of s2 on f1 so that
the (pseudo-) probability of one class is dependent on the corresponding (pseudo-
)probability of the other class, which obviously makes sense.
Overall this leads to the following model for neural networks:
Definition 12.5 (Model for Neural Networks). The model corresponding to the multi-
layer network with one hidden layer and two classes has the form:
d
β Ti X + βi0 )) + ε1 =: s1 ( f1 (X
Y1 = s1 (α01 + ∑ αi1 g(β X , Θ )),
i=1
d
β Ti X + βi0 )) + ε2 =: s2 ( f2 (X
Y2 = s2 (α02 + ∑ αi2 g(β X , Θ )),
i=1
323
324 Chapter 12. Supervised Classification
For an input x with unknown class y, predict the class by ŷ = argmaxg sg (xx, Θ̂), where
Θ̂ is the least-squares estimate of the unknown parameter vector Θ.
Using many hidden layers leads to deep learning (see, e.g., [1]). Convolutional
neural networks (CNNs) are a particularly successful class of deep networks. Instead
of using fully connected layers (each node in layer k feeds into each node in layer
k + 1), a CNN has a spatially constrained connectivity: each hidden node “sees”
only a small patch of the input or previous hidden layer, its so-called receptive field.
For music data, this means that each node in the first hidden layer processes a short
time window. Each time window is processed by a number of neurons in parallel, so
that different nodes can specialize on the extraction of different features. A second
special property of CNNs is that neurons processing different time windows share
the same weights in their input connections. In effect, the same features are ex-
tracted from each time window. The overall operation of such a processing layer is
described compactly as a set of convolutions, see Equation (4.4), where the convolu-
tion kernels are encoded in the weights of the network and hence learned from data.
Information from neighboring time windows is merged in subsequent layers. In such
a deep learning architecture, low layers extract simple, low-level features, which are
aggregated into complex, high-level information in higher layers. The last layer(s)
of a CNN are fully connected. They compute a class prediction from the high-level
features.
Obviously, this kind of modeling can be easily extended to G > 2 classes. For
more information on neural networks see, e.g., [17].
324
12.6. Further Reading 325
325
326 Chapter 12. Supervised Classification
cation methods. See Chapter 13 for a systematic introduction into various evaluation
measures.
An ensemble method not discussed in this book because of space restrictions is
boosting. Similar to bagging, with boosting an ensemble of classifiers is applied and
aggregated. However, the different elements of the ensemble are also weighted by
their quality in the overall decision (see, e.g., [7]).
The actual chapter only deals with classification problems with unambiguous la-
bels. Problems with multi-labels, e.g. mentioned in the beginning of this chapter,
can be treated, e.g., as described in [13]. Also, this chapter only deals with indi-
vidual classification models. In music data analysis, though, sometimes so-called
hierarchical models are used, meaning that the input of one classification model is
determined by another classification model (see, e.g., Chapter 18).
Bibliography
[1] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in
Machine Learning, 2(1):1–71, 2009.
[2] C. M. Bishop. Neural networks and their applications. Review of Scientific
Instruments, 65(6):1803–1832, 1994.
[3] L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996.
[4] L. Breiman. Random forests. Machine Learning Journal, 45(1):5–32, 2001.
[5] K. Crammer and Y. Singer. On the learnability and design of output codes for
multiclass problems. Machine Learning, 47(2):201–233, 2002.
[6] J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best
matches in logarithmic expected time. ACM Transactions on Mathematical
Software, 3(3):209–226, 1977.
[7] T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of statistical learn-
ing: data mining, inference, and prediction. New York: Springer-Verlag, 2001.
[8] Y. Lee, Y. Lin, and G. Wahba. Multicategory support vector machines: Theory
and application to the classification of microarray data and satellite radiance
data. Journal of the American Statistical Association, 99(465):67–82, 2004.
[9] J. R. Quinlan. C4. 5: Programs for Machine Learning, volume 1. Morgan
Kaufmann, 1993.
[10] R Core Team. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria, 2014.
[11] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Ma-
chines, Regularization, Optimization, and Beyond. MIT Press, 2002.
[12] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.
[13] G. Tsoumakas and I. Katakis. Multi-label classification: An overview. Inter-
national Journal of Data Warehousing & Mining, 3(3):1–13, 2007.
[14] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.
326
12.6. Further Reading 327
327
Chapter 13
Evaluation
I GOR VATOLKIN
Department of Computer Science, TU Dortmund, Germany
C LAUS W EIHS
Department of Statistics, TU Dortmund, Germany
13.1 Introduction
Most models are not an exact image of reality. Often, models only give rough ideas
of real relationships. Therefore, models have to be evaluated whether their image of
reality is acceptable. This is true for both regression and classification models (cp.
Section 9.8.1 and Chapter 12). What are, however, the properties a model should
have to be acceptable? Most of the time, it is more important to identify models
with good predictive power than to identify factors which significantly influence the
response on the sample used for modeling (goodness of fit). This means that the
predictions of a model should be acceptable, i.e. a model should be able to well
approximate (unknown) responses for values of the influential factors which were
not used for model building. For example, it is not acceptable that the goodness
of fit of a regression model for valence prediction dropped from 0.58 to 0.30 and
to 0.06 when the model was trained for film music and was validated on classical,
respectively, popular pieces (cp. Section 21.5.6). Also, a classification model for
instrument recognition determined on some instances of music pieces, has to be able
to predict the playing instrument of a new piece of music from the audio signal
(cp. Section 18.4). Such arguments lead to corresponding ideas for model selection,
which will be discussed in this section.
If you wish to obtain an impression of the predictive power of a model without re-
lying on too many assumptions, i.e. in the non-parametric case, you should apply the
model on samples from the underlying distribution which were not used for model
estimation. If the same sample would be used for estimations and checking predictive
power, we might observe so-called overfitting, since the model is optimally fitted to
this sample and might be less adequate for other samples. Unfortunately in practice,
most of the time only one sample is available. So we have to look for other solutions.
329
330 Chapter 13. Evaluation
New relevant data can only be generated by means of new experiments, which are
often impossible to conduct in due time. So what should we do? As a solution to this
dilemma, resampling methods have been developed since the late 1960s. The idea is
to sample repeatedly from the only original sample we have available. These repeti-
tions are then used to generate predictions for cases not used for model estimation.
This way, we can at least be sure that the values in the sample can be realized by
practical sampling.
In this chapter, we will briefly introduce such methods and refer to three kinds
of evaluation tasks: model selection, feature selection, and hyperparameter tuning.
Model selection is the main task needing evaluation and feature selection and hyper-
parameter tuning can be thought of as subtasks of finding optimal models.
Model Selection In many cases, several models or model classes are candidates for
fitting the data. Resampling methods and the related predictive power assessment
efficiently support the selection process of the most appropriate and reliable model.
Feature Selection Often an important decision for the selection of the best model
of a given model type is the decision about the features to be included in the model
(e.g., in a linear model). Feature selection is discussed in Chapter 14.
Hyperparameter Tuning Most modeling strategies require the setting of so-called
hyperparameters (e.g., the penalty parameter C in the optimality criterion of soft-
margin linear support vector machines, see Section 12.4.4). Thus, tuning these hy-
perparameters is desirable to determine a model of high quality. This can be realized
by some kind of (so-called) nested resampling introduced below in Section 13.4.
In the following, model quality is solely reflected by predictive power, which in
our view is the most relevant aspect, although other aspects of model quality are also
discussed in some settings. For example, interpretable models are preferable, which
is obviously related to feature selection. It should also be noted that it is usually
advisable to choose a less complex model achieving good results for small sample
sizes, since more-complex models usually require larger data sets to have sufficient
predictive power.
We will deal here with the two most important statistical modeling cases, i.e.
supervised classification (see Chapter 12) and regression (see Chapter 9): In classifi-
cation problems an integer-valued response y ∈ Z with finitely many possible values
y1 , . . . , yG has to be predicted by a so-called classification rule based on n observa-
T
tions of z i = x Ti yi , i = 1, . . . , n, where the vector x summarizes the influential
330
13.1. Introduction 331
The remainder of this chapter is organized as follows. In the next section, sev-
eral established resampling methods as general frameworks for model evaluation are
introduced. Later on (Section 13.3), we will focus on evaluation measures, mainly
different performance aspects (Sections 13.3.1-13.3.5), but also provide a discussion
of several groups of measures beyond classification performance (Section 13.3.6).
Section 13.4 provides a practical example how resampling can be applied for hyper-
parameter tuning. The application of statistical tests for the comparison of classifiers
is discussed in Section 13.5. We conclude with some remarks on multi-objective
evaluation (Section 13.6) and refer to further related works (Section 13.7).
Before we go into detail concerning resampling methods and performance mea-
sures let us introduce the two data sets and four classification models evaluated later
as examples for the evaluation measures. On these training samples the below clas-
sification models are trained and the corresponding models are evaluated by the fol-
lowing performance measures on a larger test sample.
331
332 Chapter 13. Evaluation
Example 13.1 (Training Samples). Table 13.1 lists ten tracks of four different genres
for the classification task of identifying pop/rock songs among music pieces of other
genres. Two training samples are used: a smaller sample with only four tracks (upper
part of the table), and a larger one using all tracks.
Example 13.2 (Classification Models). Using the two training samples from Exam-
ple 13.1, four decision tree models (cf. Section 12.4.3) are trained using a set of
13 MFCCs (see Section 5.2.3) for which the mean values are estimated for classifi-
cation windows of 4 s length and 2 s overlap. Figure 13.1 shows the trees created
from the smaller set (left subfigures (a), (b)) and the larger one (right subfigures (c),
(d)), where the depth of the tree is limited either to two levels (upper subfigures (a),
(c)) or four levels (lower subfigures (b), (d)). The numbers of positive and negative
instances (classification windows) are given in brackets, e.g., the tree in the subfigure
(a) classifies 7 windows of pop/rock tracks as not belonging to this class.
So, which of these four models is best? This will be evaluated in what follows by
means of a larger test sample of songs of the same genres as in the training samples.
13.2 Resampling
The situation in the above example is typical for practical evaluation situations that
there is only one data set, wherefrom training samples L are taken. With resampling,
such training samples are drawn randomly. This will be systematically discussed
below.
The values of the quality measures pk of the models ak , k = 1, . . . , K, depend on
the underlying data distribution F. Let the learning sample L consist of n independent
observations z j , j = 1, . . . , n, distributed according to some unknown distribution F.
This is denoted by L ∼ Fn . The most frequent situation in practice is that only one
learning sample L ∼ Fn is available and there is no possibility to easily generate new
samples. In this situation F is imitated by the empirical distribution function on the
learning sample F̂n . Resampling is sampling of B independent learning samples from
F̂n . In order to distinguish the resampled new learning samples from the original
332
13.2. Resampling 333
learning sample, the new learning samples will be called training samples in the
following.
L1 , . . . , LB ∼ F̂n .
The quality of a model ak might be assessed on the basis of the training samples Li
(b = 1, . . . , B). Therefore, by calculating
p̂ki = p(ak , Li ), b = 1, . . . , B,
p̂ki = p̂(ak , T i ).
In practice, the test samples are normally taken as the residue of Li in L, i.e.
Ti = L̄i = L/Li . This is also assumed in the following subsection.
Note that in the above example we have another situation, where an extra, not too
small, test sample is available apart from the learning sample. This means, though,
that we deliberately restrict the learning sample to be small by artificially “holding
out” the observations of the test sample. This procedure in a way mimics the practical
situation that we can train ourselves typically only on small data sets.
333
334 Chapter 13. Evaluation
13.2.2 Hold-Out
Let us start with the special case of the above generic procedure with B = 1, i.e.
where only one split into training and test sample is realized.
Definition 13.3 (Hold-Out or Train-and-Test Method). For large n the so-called
hold-out or train-and-test method could be used, where the learning sample is di-
vided into one training sample L0 of smaller size and one test sample T : L = L0 ∪ T .
If n is so small that such an approach in infeasible, at first sight the following
method appears to be most natural:
Definition 13.4 (Resubstitution Method). The resubstitution method uses for each
model the original training sample also as the test sample.
Unfortunately, such an approach often leads to so-called overfitting, since the
model was optimally fitted to the learning sample, and thus the error rate on this
334
13.2. Resampling 335
same sample will likely be better than on other, unseen, samples. With the help of
the resampling methods described in the next subsections, such overfitting can be
avoided.
13.2.3 Cross-Validation
Cross-Validation (CV) [26, 13] is probably one of the oldest resampling techniques.
Like all other methods presented in this subsection, it uses the generic resampling
strategy as described in Algorithm 13.1. The B subsets (line 1 of Algorithm 13.1)
are generated according to Algorithm 13.2. Note that the instruction SHUFFLE(L)
stands for a random permutation of the sample L. The idea is to divide the data set
into B equally sized blocks and then use B − 1 blocks to fit the model and validate
it on the remaining block. This is done for all possible combinations of B − 1 of
the B blocks. The B blocks are usually called folds in the cross-validation literature.
So a cross-validation with B = 10 would be called a 10-fold cross-validation. Usual
choices for B are 5, 10, and n.
335
336 Chapter 13. Evaluation
Table 13.2: Variants of Cross-Validation: Number of Cases and Repetitions
Leave-One-Out B-Fold CV
training cases n−1 n − n/B
test cases 1 n/B
repetitions n B
just one split of the original learning sample into a smaller new learning sample and
a test sample, often produce a satisfying accuracy of the quality criterion.
Also in B-fold cross-validation with B < n, the cases are randomly partitioned in
B mutually exclusive groups of (at least nearly) the same size. Each group is used
exactly once as the test sample and the remaining groups as the new learning sample,
i.e. as the training sample. In classification, the mean of the error rates in the B
test samples is called cross-validated error rate. Table 13.2 gives an overview of the
variants of cross-validation.
13.2.4 Bootstrap
The most important alternative resampling method to cross-validation is the boot-
strap. We will only discuss the most classical variant here, called the e0 bootstrap.
The development of the bootstrap resampling strategy [8] is ten years younger
than the idea of cross-validation. Again, Algorithm 13.1 is the basis of the method,
but the B subsets are generated using Algorithm 13.3. Note that the instruction
RANDOMELEMENT(L) stands for drawing a random element from the sample L
by means of uniformly distributed random number ∈ {1, . . . , n}.
The subset generation is based on the idea that instead of sampling from L with-
out replacement, as in the CV case, we sample with replacement. This basic form
of the bootstrap is often called the e0 bootstrap. One of the advantages of this ap-
proach is that the size of the training sample, in the bootstrap literature often also
called the in-bag observations, is equal to the actual data set size. On the other hand,
336
13.2. Resampling 337
Table 13.3: Bootstrap Method
Bootstrap
training cases n ( j different)
test cases n− j
repetitions ≥ 200
this means that some observations can and likely will be present multiple times in
the training sample Li . In fact, asymptotically only about 63.2% of the data points
in the original learning sample L will be present in the training sample, since the
probability not to be chosen n times is (1 − 1/n)n , so the probability to be chosen is
1 − (1 − 1/n)n ≈ 1 − e−1 ≈ 0.632. The remaining 36.8% of observations are called
out-of-bag and form the test sample as in CV.
Here the number of repetitions B is usually chosen much larger than in the CV
case. Values of B = 100 up to B = 1000 are not uncommon. Do note, however,
that there are nn different bootstrap samples. So for very small n there are limits to
the number of bootstrap samples you can generate. In general, B ≥ 200 is consid-
ered to be necessary for good bootstrap estimation (cp. Table 13.3). This number of
repetitions may be motivated by the fact that in many applications not only the boot-
strap quality criterion is of interest, but the whole distribution, especially the 95%
confidence interval for the true value of the criterion. For this, first the empirical
distribution of the B quality measure values on the test samples is determined, and
then the empirical 2.5% and 97.5% quantiles. With 200 repetitions, the 5th and the
195th element of the ordered list of the quality measures can be taken as limits for the
95% confidence interval, i.e. there are enough repetitions for an easy determination
of even extreme quantiles. Note, however, that the bootstrap is much more expensive
than LOOCV, at least for small learning samples.
The fact that with the bootstrap some observations are present multiple times
in the training sample can be problematic for some modeling techniques. Several
approaches have been proposed for dealing with this. Most add a small amount of
random noise to the observations [8].
Please note that e0 can be approximated by repeated 2-fold cross-validation, i.e.
by repeated 50/50 partition of the learning sample, or by repeated 2:1 train-and-test
splitting, because the e0 generates roughly 63.2% in-bag (train) observations and
36.8% out-of-bag (test) observations.
Another problem with adding some observations multiple times to the training
sample is that we overemphasize their importance. This is called oversampling. This
leads to an estimation bias for our quality measure. Instead of discussing variants
of bootstrap which try to counter this, we will introduce another resampling method
called subsampling which does not suffer from multiple observations.
337
338 Chapter 13. Evaluation
13.2.5 Subsampling
Subsampling is very similar to the classical bootstrap. The only difference is that
observations are drawn from L without replacement (see Algorithm 13.4). Therefore,
the training sample has to be smaller than L or no observations would remain for the
test sample. Usual choices for the subsampling rate |Li |/|L| are 4/5 or 9/10. This
corresponds to the usual number of folds in cross-validation (5-fold or 10-fold). Like
in bootstrapping, B has to be selected a priori by the user. Choices for B are also
similar to bootstrapping, e.g., in the range of 200 to 1000.
338
13.3. Evaluation Measures 339
339
340 Chapter 13. Evaluation
erally chosen as the mean value. This is also used in the classification case, i.e. for
the 0-1 loss, leading to so-called error rates, i.e. (number of errors)/(number of ob-
servations). The quality measure is then called empirical risk. This risk is assumed
to be evaluated on a test sample T i independent of the training sample Li .
Definition 13.6 (Empirical Risk). The empirical risk of the model ak is defined as
h
1
p̂ki =
h ∑ (y j − ak (xx j |Li ))2 , (13.1)
j=1
T
where z 1 , . . . , z h , z i = x Ti yi are the elements of a sample T i independent of Li .
In the regression case, p̂ki is also called the mean squared error on T i . In the
classification case, p̂ki is equal to the misclassification (error) rate on T i :
Definition 13.7 (Misclassification Error Rate). The misclassification error rate of
the model ak is defined as
h
1
p̂ki =
h ∑ I(y j 6= ak (xx j |Li )), (13.2)
j=1
340
13.3. Evaluation Measures 341
Table 13.4: Two Confusion Matrices Using Training Sample with 4 Tracks (Upper
Part) and 10 Tracks (Lower Part)
The number of false negatives is the number of observations which belong to the
positive class, but are recognized as not belonging to it:
W
mFN = ∑ yw · (1 − ŷw ) . (13.6)
w=1
341
342 Chapter 13. Evaluation
342
13.3. Evaluation Measures 343
Example 13.4 (Evaluation of Classification Models). Let us now compare the mod-
els from Example 13.2 with respect to the evaluation measures discussed above. Re-
calling the example, models (a,b) are created from the smaller training sample of 4
songs, and models (c,d) are based on 10 training songs. The decision tree depth was
limited to 2 levels for models (a,c) and 4 levels for (b,d). The measures are listed in
Table 13.5.
If mACC is used as the only evaluation criterion, the first impression is that the
larger training sample leads to a better classification performance (0.64/0.68 against
0.56/0.56). Larger trees do not significantly change mACC for the smaller training
sample, but help to increase mACC from 0.64 to 0.68 when the models are trained on
a larger training sample. The correct classification of 68% test tracks is quite appre-
ciated, because in this example only MFCCs are used as features and the number of
training tracks is very small.
However, if other measures are taken into account, the advantage of the larger
training sample is not completely obvious. Lines (c,d) contain smaller mT P values
and mREC decreases from 0.56/0.58 to 0.36/0.29. As discussed above in Expression
13.13, a high precision and a low recall mean that the number of songs wrongly rec-
ognized as belonging to Pop/Rock genre is lower than the number of true Pop/Rock
songs recognized as not belonging to this genre.
Because the smaller training sample consists of only Pop/Rock and Classical
pieces (cf. Table 13.1), it does not contain enough information to learn some other
non-classical genres different from Pop/Rock. 12 of 15 Electronic pieces are rec-
ognized as Pop/Rock (see the upper confusion matrix in Figure 13.4). The larger
training sample with more Jazz and Electronic examples helps to increase the recog-
nition of non-Pop/Rock songs, but the number of correctly predicted Pop/Rock songs
decreases.
Depending on the characteristics of training and test data, the correlation be-
tween evaluation measures may more or less vary, as investigated in [28]. An as-
sessment of the classification performance with regard to several measures leads to
multi-objective evaluation and optimization as discussed later in Section 13.6.
343
344 Chapter 13. Evaluation
test sample (mT P + mFN ) /W 1, a model which simply classifies all observations
as negatives would achieve a very high mACC = 1 − (mT P + mFN ) /W and a very low
mRE .
For a credible evaluation and tuning of models created for the application on
imbalanced sets, there exist several possibilities to measure aggregated performance
for observations of both classes.
Definition 13.10 (Measures for Imbalanced Sets).
The balanced relative error is the average of relative errors for positive and negative
observations and should be minimized:
1 mFN mFP
mBRE = + . (13.14)
2 mT P + mFN mT N + mFP
The performance of a classification model can be also compared against the perfor-
mance of a random classifier by means of the Kappa statistic [33, p. 163], which
should be maximized and measures the difference between the number of correct
predictions and the number of correct predictions of a random classifier R in rela-
tion to the difference between the number of observations and the number of correct
predictions of a random classifier:
For binary classification with a random unbiased classifier, expected values of correct
predictions depend on the numbers of positive and negative observations: E[mT P (R)] =
(mT P + mFN ) /2 and E[mT N (R)] = (mT N + mFP ) /2, so that E[(mT P (R) + mT N (R))] =
W /2. The substitution of this term into Equation (13.17) leads to:
Example 13.5 (Evaluation of Classification Models for Imbalanced Sets). Table 13.6
lists mBRE , mF , mGEO , and mKA for the models of Example 13.2. Models (c) and (d)
344
13.3. Evaluation Measures 345
trained with the larger set perform worse w.r.t. mF and mGEO but are better when val-
idated with mBRE and mKA . This discrepancy illustrates the complexity of a proper
choice of an appropriate model and also data for training. Model (a) classifies cor-
rectly 56% of Pop/Rock songs and 56% of other tracks (cf. corresponding mREC and
mSPEC values in Table 13.5). Model (d) classifies correctly 29% of Pop/Rock songs
and 92% of other tracks. Here, a decision can be done according to the desired
preference: a higher mean performance on tracks of both classes of model (d) or a
low variance and a high minimum across performances on tracks of both classes of
model (a). Generally, it is crucial to adapt the evaluation scheme to the requirements
of the concrete application.
Table 13.6: Evaluation of Balanced Performance for Classification Models from Ex-
ample 13.2
Until now, most related studies use only a few evaluation measures. An inter-
esting statistic is provided in [27]: from 467 analyzed works on genre recognition,
accuracy is the most popular measure and is estimated in 82% of the studies. Recall
is used for evaluation in 25% of the studies, precision in 10%, and the F-measure in
4%. This means that many models are tuned to the best performance with regard to a
single or a few evaluation metrics, and may be of poor quality when other evaluation
aspects play a role.
345
346 Chapter 13. Evaluation
0
& '
∑W
w=1 ŷw 1
ŷ(xx1 , ..., xW 0 ) = − . (13.19)
W0 2
The advantage of this method is that it reduces the impact of outlier windows in a
song: e.g., if a quiet intro and an intermediate part with string quartet in a rock song
are recognized as belonging to the classical genre, the aggregated prediction may still
be correct. A further discussion about reasonable sizes of classification windows is
provided in Section 14.1.
Example 13.6 (Aggregation of Predictions for Genre Classification). In the exam-
ples from the previous sections we observed that a larger training sample with more
negative observations leads to a better identification of tracks which do not belong to
the Pop/Rock genre. Figure 13.2 plots predictions for individual classification win-
dows of 4 s with 2 s overlap. For 5 genres, 15 tracks each of the test sample are
taken into account. A horizontal dash corresponds to a window which was wrongly
predicted as belonging to Pop/Rock. The lengths of all songs were normalized, so
that a broader dash may correspond to a single window for a shorter song. The up-
per subfigure plots classification results for the training sample with 4 tracks and the
bottom subfigure for the training sample with 10 tracks.
down-to-earth or down-to-earth
Straightforward orStraightforward
Straightforward or down-to-earth
Straightforward or down-to-earth
down-to-earth or down-to-earth
Straightforward orStraightforward
Nurturing
Figure 13.2: Recognition of the genre Pop/Rock, classification results for individual
windows of 4 s with 2 s overlap for 75 tracks of 5 other genres. Top: small training
sample; bottom: large training sample.
The balance between wrongly and correctly predicted windows can be easily
recognized. For instance, classical music pieces were correctly predicted as not
346
13.3. Evaluation Measures 347
belonging to Pop/Rock for both training samples (cf. also Figure 13.4), but there
exist some individual classification windows recognized as Pop/Rock, in particular
for the 14-th piece (Adagio from Vivaldi’s “The Four Seasons”). The number of
misclassified windows is strongly reduced when the larger training sample is used
(bottom subfigure). This holds for all genres (but there still remain 2 Electronic, 3
Rap, and 1 R’n’B tracks classified as Pop/Rock).
347
348 Chapter 13. Evaluation
Table 13.7: Evaluation Measures and the Impact of Their Optimization on Three
Evaluation Focuses. +: Strong Impact; (+): Some Impact; -: No or almost No
Impact. ↓: Measures to Minimize; ↑: Measures to Maximize
13.3.6.1 Runtime
If a classification model has a small error and achieves the best classification perfor-
mance compared to other models, it may have a strong drawback being very slow for
many possible reasons: the classification method may require transformations into
higher dimensions, the search for optimal parameters may be costly, or data must be
intensively preprocessed before the classification.
Runtime can be measured for different methods. For example, most music data
analysis tasks rely on previously extracted features. The runtime of feature extraction
depends on the source of the feature. Audio features often require several complex
steps like the Fourier transform. It is possible to give priority to the extraction of
features which may be extracted faster, on the other side increasing the danger that
classification models trained with these features may perform worse. For instance,
three audio features (autocorrelation, fundamental frequency, and power spectrum)
required more than 65% of the overall extraction time of 25 features in [4].
For models themselves, one should distinguish between the training runtime and
classification runtime. If new data instances are classified over and over again, for
instance when new tracks are added to a music collection or an online music shop, the
classification runtime becomes more important. In this situation, too high costs of the
optimization runtime during the training stage, e.g., for the tuning of hyperparameters
(see Section 13.4), may be less problematic.
It is also possible to shift the costs between individual steps to a certain extent:
for example, the implementation and the extraction of complex high-level charac-
teristics (e.g., recognition of instruments) may help to reduce runtime costs during
classification so that rather simple approaches like k-NN and Naive Bayes would
have similar or almost similar performance when compared with complex methods
like SVMs. If the extraction of the features is done only once for a music instance,
the user would be less influenced by a long “offline” runtime.
The biggest challenge of runtime-based evaluation is that it is very hard to achieve
a reliable comparison between models. Implementations of algorithms may differ
(cf. the difference between discrete and fast Fourier transforms, Section 4.4.1), and
348
13.3. Evaluation Measures 349
13.3.6.3 Stability
Stability measures whether an output of a classification system has a low variation for
B 1 experiments (i.e. training samples). For the measurement of stability w.r.t. an
evaluation measure m, the standard deviation is estimated as:
s
B
1
sm = · ∑ (mi − m̄)2 , (13.20)
B − 1 i=1
where m̄ is the mean value of m. Note that in principle the whole distribution of
the evaluation measure is of interest. However, the stability is particularly important
together with the mean performance. Also note that m does not necessarily need to
be a classification performance measure. Stability can be estimated for most mea-
sures discussed in this chapter, such as runtime. There exist several possibilities to
introduce variation for B experiments. We discuss here three cases: stability under
test data variation, stochastic repetitions, and parameter variation.
To measure the stability under test data variation, the model is applied on B dif-
ferent and preferably non-overlapping test samples. Higher values of sm mean that
the model is more sensitive to data and it is harder to see in advance whether this
349
350 Chapter 13. Evaluation
model would be successful when applied to a new unlabeled data sample. If a test
sample can be built from all available labeled data, B subsets can be built with the
help of resampling discussed in Section 13.2. It is also thinkable, however, to create
subsets with data instances which share a particular property, e.g., a genre (see ex-
ample below), or belonging to the same cluster estimated by means of unsupervised
learning (see Chapter 11).
Example 13.7 (Evaluation of Stability of Classification Performance for Negative
Examples). Let us measure the variance of classification performance across tracks
of different non-Pop/Rock genres for models (b), (d) from the confusion matrices in
Example 13.3. The mean and the standard deviation of classification performance
measures for negative examples – estimated separately for the 5 non-Pop/Rock gen-
res Classical, Electronic, Jazz, Rap, and R’n’B tracks – are listed in Table 13.8. For
each of these genres the corresponding 15 tracks were used as a test sample. Because
mT N + mFP = 15 for both models, smT N = smFP .
As already discussed above, the extension of the training sample with Jazz and
Electronic tracks significantly increases the performance for these genres, and the
standard deviation of specificity decreases from 0.36 to 0.09. Even if it is not pos-
sible to reliably forecast the prediction quality of a model for tracks of other genres
currently completely nonexistent in our database (say, African drum music), we can
at least measure how the performance varies for genres nonexistent in the given
training data. Such measure gives us a very rough estimator of the complexity of
the classification task, in the example above, whether the identification of Pop/Rock
songs is a rather simple task (because these songs have some unique properties com-
pared to all possible other music genres) or a rather hard task (there exist other
genres with very similar properties to Pop/Rock).
Stability under stochastic repetitions can be measured when some methods of
a classification system provide output based on random decisions and results vary
after repetitions for the same data. Because Random Forest selects random features
for training of many trees during the training process (see Section 12.4.5), for such a
classifier we may simply repeat the training B times and estimate the variation of per-
formance of the created models as a measure of stability. Another candidate for the
measurement of stability under stochastic repetitions is evolutionary feature selection
(see Section 15.6.3). Even if this method is applied for a fixed training sample with a
deterministic classifier (such as Naive Bayes), random selection of features may find
several suboptimal feature sets after B repetitions of the evolutionary algorithm.
Finally, classification models may be sensitive to (hyper)parameters (cf. Defini-
tion 13.11), so that stability under parameter variation can be measured. Note that
350
13.3. Evaluation Measures 351
only a limited range of reasonable parameter values should be of interest. For in-
stance, too small numbers of trees in Random Forest would lead to models with poor
performance, and an increase of the number of trees above some threshold would
only slow down the classification process. The search for a reasonable range of pa-
rameter values may be complex in practice and is beyond the scope of this chapter. To
name a few possibilities, for a particular classifier it is possible to use settings recom-
mended after theoretical investigations, the values can be selected from an interval of
optimal parameter values found after the application for several classification tasks,
and factorial or other statistical designs of experiments can be applied [1].
351
352 Chapter 13. Evaluation
semantic features, the depth of the tree can be minimized (too large trees are not
interpretable anymore). In some studies, fuzzy classification was applied to music
classification [9, 29]. Then, opposing goals can be the minimization of the num-
ber of fuzzy rules and the maximization of classification performance. It is worth
mentioning that if the interpretability is important, each step in the algorithm chain
should be verified. Consider the application of Principal Component Analysis (see
Definition 9.48) on high-level features making their interpretation very hard at least
when the number of components is high.
Particularly for music recommendation systems there are many user-related mea-
sures which may be completely uncorrelated with classification performance mea-
sures. For example, a “perfect” classifier in terms of accuracy would not take novelty
and surprise into account, which may be important for some listeners. Then, the
recommendation of music too similar to current user preferences would exclude the
chance to discover new music or even change personal preferences. Another prob-
lematic issue is to use only performance measures if the order of the output is impor-
tant. In a playlist, the order of mood and genre changes may be relevant, and it may
be undesired to place tracks of the same artist too close after each other. Therefore,
two playlists constructed with the same songs may have completely different impacts
on listener satisfaction. More evaluation approaches for generated playlists are dis-
cussed in [5]. For a further discussion of measures related to music recommendation,
we refer to Section 23.4.
352
13.4. Hyperparameter Tuning: Nested Resampling 353
Outer Cross−Validation
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Validation / Test Data
Inner Cross−Validation
(parameter tuning)
Figure 13.3: Nested resampling for (hyper)parameter tuning with two nested CVs.
process schematically for two nested cross-validations, namely 5-fold CV in both the
inner and the outer resampling.
Example 13.8 (Hyperparameter Tuning). In order to demonstrate nested resampling
for hyperparameter tuning, we show results for the SVM with radial basis kernel in
the piano-guitar distinction Example 11.2. For this SVM we vary the kernel width w
and the error weight C. We will not only use MFCCs (as in Example 11.2) as tone
characterization but four types of audio features which represent musically relevant
properties of sound. Overall, this leads to 407 numeric non-constant features not
including any missing values (cp. Section 14.2.3).
Experiment for Hyperparameter Tuning in Piano-Guitar Classification
1. Select the SVM with radial basis kernel as a classifier.
2. Subsampling: Randomly select 600 observations from the full data set as a train-
ing sample for model selection. Retain the rest as a test sample.
3. Apply grid search on all powers of 2 in [2−20 , 220 ] for both the kernel width w and
the error weight C on these 600 observations with SVM. Performance is measured
by 5-fold CV and MisClassification Error rate (MCE) (also denoted as relative
error mRE , cp. Equation (13.12)). Note: the data partitioning of the CV is held
fixed, i.e. is the same for all grid points to reduce variance in comparisons.
4. Store the hyperparameters w and C with optimum MCE.
5. Train classifier with selected hyperparameters on all 600 instances of the training
sample.
6. Predict classes in the test sample and store the test error.
7. Repeat steps (2)–(6) 50 times.
The results show that the chosen SVMs with tuned hyperparameters but without fea-
ture selection realize an empirical distribution of the MCEs with a range between
2% and 5% (see Figure 13.4).
353
354 Chapter 13. Evaluation
7
6
5
Frequency
4
3
2
1
354
13.5. Tests for Comparing Classifiers 355
Table 13.9: Contingency Table for McNemar Test
HFF = no. of instances misclassified by HFT = no. of instances misclassified by
both classifiers classifier 1 but not by classifier 2
HT F = no. of instances misclassified by HT T = no. of instances correctly
classifier 2 but not by classifier 1 classified by both classifiers
table with 4 cells, cp. Definition 9.24) for dependent samples are used. Consider
Table 13.9 for the comparison of two classifiers on the basis of a single test data set.
If the two classifiers were equally good, then the number of successes (correct
classifications) should be as similar as possible. This leads to the equality PT F +
PT T = PFT + PT T , i.e. PT F = PFT , where P stands for probability.
Therefore, the null hypothesis to be tested has the form
H0 : The probability of a success is equal for the two classifiers, i.e. PFT = PT F .
For fixed q = HFT + HT F , under H0 the frequency HFT should be binomially
distributed with success probability 0.5,
since there should be as many instances being successfully classified by method
1 but not by method 2 and vice versa. Thus, the test statistic
HFT − q/2
t= p
q/4
is approximately N(0,1) distributed, at least if q is big enough (then the binomial dis-
tribution is approximately normally distributed with E(HFT ) = q/2 and var(HFT ) =
q/4). Therefore,
2 (| HFT − HT F | −1)2
χcorr = .
HFT + HT F
The McNemar Test rejects the equality of the goodness of the performance of
the two classifiers, e.g., if the 95% quantile of the χ 2 -distribution with 1 degree of
freedom is exceeded, i.e.
2 2
χcorr > χ0.95,1 = 3.84.
Example 13.9 (McNemar Test). We will reconsider the models in the above Exam-
ple 13.2 and test the equality of the goodness of model pairs. The only values we
have to calculate are HFT and HT F on the test sample. Then, χcorr 2 can be derived.
Table 13.10 presents the results. Note that if we reject H0 , we assume H1 .
First, we compare models (a) against (b) and (c) against (d). For these pairs the
size of the training sample is fixed, and the tree size varies. As we can observe, there
2
is no significant difference between models, as χcorr < 3.84 for both cases.
355
356 Chapter 13. Evaluation
Second, we compare models (a) against (c) and (b) against (d). Here, the tree
size is fixed and the training sample varies. For smaller trees with depth 2 there is no
significant difference between the two models, but for larger trees with depth 4 H0 is
rejected.
356
13.5. Tests for Comparing Classifiers 357
whole test sample (each time, 4/5 of tracks with all corresponding classification win-
dows are randomly selected for validation). The t-statistic is compared to the 97.5%
quantile t0.975,99 = 1.98.
The left part of Table 13.11 lists mean values of the relative classification error
mRE for the two tested models, the t-statistic, and the “assumed” hypothesis. The
right part of the table contains the corresponding values for the balanced relative
error mBRE . We can observe that in all cases except for comparison of models (a)
and (b), H0 is rejected.
Ei j = µ + αi + β j + εi j ,
where the εi j are independently identically distributed and α1 + . . . + αB = 0, β1 +
. . . + βK = 0, B = no. of test data sets, K = no. of classifiers.
Side conditions result from the fact that the overall mean error rate should be µ.
By means of the data set effect αi , the overall mean is only adapted to the actual data
set. Interactions between data set and classifier are excluded here, i.e. the effect of
the classifier does not depend on the data set. Such interactions could, though, be
included in the model without problems.
Testing on significant differences in the classifier effects, i.e. checking the validity
of the null hypothesis
H0 : µ1 = µ2 = . . . = µK = µ or β1 = β2 = . . . = βK = 0
357
358 Chapter 13. Evaluation
1 B 1 B K
b j = ē· j − m = ∑ ei j − ∑ ∑ ei j , j = 1, . . . , K.
B i=1 BK i=1 j=1
Second, the mean contribution of all classifiers to the dependent variable E is esti-
mated as the so-called Mean Squared Classifier effect:
mei j = ei j − m − ai − b j .
358
13.6. Multi-Objective Evaluation 359
Example 13.11 (Two-Way Analysis of Variance). We will again reconsider the mod-
els in the above Example 13.2 and test on significant classifier effects. Again, we gen-
erate B = 100 test samples with 4/5-subsampling from the whole test sample. Then,
we have B = 100 replicates and K = 4 classifiers. The F-statistic is compared to the
95% quantile F0.95,3,297 = 2.64. For mRE , F = 402.12, and for mBRE , F = 128.97,
indicating significant differences between the 4 models.
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
or down-to-earth
StraightforwardStraightforward or down-to-earth
or down-to-earth
StraightforwardStraightforward or down-to-earth
First, system properties for the evaluation should be checked. It is often necessary
to think about details or less-obvious applications. For example, a music recommen-
dation system aiming at a perfect recognition of current listener properties would fail
if the discovery of new music and the evolution of the personal taste play a role.
Example 13.12 (Properties of genre recognition systems). An extensive analysis of
tasks related to genre recognition systems was applied in [27]. Here, ten experimen-
tal designs are listed, sorted by their application number: classify (“how well does
the system predict genres?”), features (“at what is the system looking to identify gen-
res?”), generalize (“how well does the system identify genre in varied data sets?”),
robust (“to what extent is the system invariant to aspects inconsequential for iden-
tifying genre?”), eyeball (“how well do the parameters make sense with respect to
identifying genre?”), cluster (“how well does the system group together music using
359
360 Chapter 13. Evaluation
the same genres?”), scale (“how well does the system identify the music genre with
varying numbers of genres?”), retrieve (“how well does the system identify music
using the same genres used by the query?”), rules (“what are the decisions the sys-
tem is making to identify genres?”), compose (“what are the internal genre models
of the system?”).
In the next step, human factors should be identified which influence the eval-
uation. Reference [22] distinguishes between four factors which influence human
music perception: music content (properties of sound, e.g., timbre or rhythm), mu-
sic context (metadata like lyrics or details of composition), user properties (musical
experience, age, etc.), and user context (properties of listening situation like cur-
rent activity or current mood, cp. Section 21.4.2). It is not easy to achieve a perfect
match with regard to all these factors: for example, a recommender system may have
to learn that a listener of classical music does not like organ (music content), does
not prefer to listen to operas with lyrics based on fairy tales (music context), does
not like Wagner because her/his parents listened Wagner’s operas too often during
her/his childhood (user properties), and does not like to listen to complex polyrhyth-
mic pieces while driving because it may disturb her/his attention (user context).
Finally, evaluation measures should be selected which are relevant according to
desired system properties and human factors. Another requirement is that these mea-
sures should be less correlated: if a data sample is well balanced, the maximization
of accuracy may already lead to the minimization of balanced classification error and
it is not necessary to evaluate the system using both measures or selecting both as
criteria for multi-objective optimization. The measurement of correlation between
measures is not always straightforward because there may be some dependencies
which hold only for some regions of the search space. For example, increasing the
number of features may lead to an increase of classification performance at first,
but later lead to a decrease of the performance because too many irrelevant features
would be identified as relevant. Also, the counter-strategy to decrease the number of
features would not necessarily lead to a higher performance.
360
13.7. Further Reading 361
for music recommendation, and some general aspects of user-related evaluation are
discussed in [17].
Bibliography
[1] T. Bartz-Beielstein. Experimental Research in Evolutionary Computation.
Springer, 2006.
[2] Y. Bengio and Y. Grandvalet. No unbiased estimator of the variance of k-fold
cross-validation. Journal of Machine Learning Research, 5:1089–1105, 2004.
[3] H. Binder and M. Schumacher. Adapting prediction error estimates for biased
complexity selection in high-dimensional bootstrap samples. Statistical Appli-
cations in Genetics and Molecular Biology, 7(1):12, 2008.
[4] H. Blume, M. Haller, M. Botteck, and W. Theimer. Perceptual feature based
music classification: A DSP perspective for a new type of application. In W. A.
Najjar and H. Blume, eds., Proc. of the 8th International Conference on Sys-
tems, Architectures, Modeling and Simulation (IC-SAMOS), pp. 92–99. IEEE,
2008.
[5] G. Bonnin and D. Jannach. Automated generation of music playlists: Survey
and experiments. ACM Computing Surveys, 47(2):1–35, 2014.
[6] Ò. Celma. Music Recommendation and Discovery: The Long Tail, Long Fail,
and Long Play in the Digital Music Space. Springer, 2010.
[7] J. S. Downie. Music information retrieval. Annual Review of Information Sci-
ence and Technology, 37(1):295–340, 2003.
[8] B. Efron. Bootstrap methods: Another look at the jackknife. The Annals of
Statistics, 7(1):1–26, 1979.
[9] F. Fernández and F. Chavez. Fuzzy rule based system ensemble for music genre
classification. In Proc. of the 1st International Conference on Evolutionary and
Biologically Inspired Music, Sound, Art and Design (EvoMUSART), pp. 84–95.
Springer, 2012.
[10] X. Hu and N. Kando. User-centered measures vs. system effectiveness in find-
ing similar songs. In Proc. of the 13th International Society for Music Informa-
tion Retrieval Conference (ISMIR), pp. 331–336. FEUP Edições, 2012.
[11] R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation
and model selection. In Proc. of the International Joint Conference on Artificial
Intelligence (IJCAI), pp. 1137–1143, 1995.
[12] M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets:
One-sided selection. In Proc. of the 14th International Conference on Machine
Learning (ICML), pp. 179–186. Morgan Kaufmann, 1997.
[13] P. A. Lachenbruch and M. R. Mickey. Estimation of error rates in discriminant
analysis. Technometrics, 10(1):1–11, 1968.
[14] T. Lange, M. L. Braun, V. Roth, and J. M. Buhmann. Stability-based model se-
lection. In S. B. et al., ed., Advances in Neural Information Processing Systems
361
362 Chapter 13. Evaluation
362
13.7. Further Reading 363
et al., ed., Proc. of the 34th Annual Conference of the German Classification
Society (GfKl), 2010, pp. 401–410. Springer, Berlin Heidelberg, 2012.
[29] I. Vatolkin and G. Rudolph. Interpretable music categorisation based on fuzzy
rules and high-level audio features. In B. Lausen, S. Krolak-Schwerdt, and
M. Böhmer, eds., Data Science, Learning by Latent Structures, and Knowledge
Discovery, pp. 423–432. Springer, Berlin Heidelberg, 2015.
[30] I. Vatolkin, G. Rudolph, and C. Weihs. Interpretability of music classification
as a criterion for evolutionary multi-objective feature selection. In Proc. of the
4th International Conference on Evolutionary and Biologically Inspired Music,
Sound, Art and Design (EvoMUSART), pp. 236–248. Springer, 2015.
[31] D. Weigl and C. Guastavino. User studies in the music information retrieval
literature. In Proc. of the 12th International Society for Music Information
Retrieval Conference (ISMIR), pp. 335–340. University of Miami, 2011.
[32] S. Weiss and C. Kulikowski. Computer Systems that Learn. Morgan Kaufmann,
San Francisco, 1991.
[33] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools
and Techniques. Elsevier, San Francisco, 2005.
363
Chapter 14
Feature Processing
I GOR VATOLKIN
Department of Computer Science, TU Dortmund, Germany
14.1 Introduction
After the extraction of features which represent music data, several steps may either
be necessary, e.g., for subsequent classification (e.g., substitution of missing values)
or may improve the classification performance (for example, removal of irrelevant
features). The initial input of feature processing algorithms are previously extracted
characteristics. The output are data instances for the training of classification or
regression models.
Preprocessing consists of basic steps for the preparation of feature vectors. For a
music piece, one of the goals is to create a matrix of F features for W time frames.
Methods for preprocessing are discussed in Section 14.2.
After the feature matrix has been built from available observations, it may contain
a very large number of entries because many features are estimated from short time
frames: e.g., a time series of the spectral centroid extracted from 23-ms frames for a
4-min song contains more than 10,000 values. Primary tasks after the preprocessing
are the reduction of the amount of data and the increase of quality of a subsequent
classification or regression.
One group of methods operate on the feature dimension, keeping the number of
matrix columns (time windows) unchanged. The number of rows (processed fea-
tures) may remain the same, be reduced (e.g., after feature selection), or increased
(after feature construction). Examples of methods for the processing of feature di-
mension are presented in Section 14.3. Another option is to focus on the time dimen-
sion (usually for individual features), for example by means of time series analysis
or the aggregation of feature values around relevant musical events. Methods for the
processing of the time dimension are introduced in Section 14.4.
Figure 14.1 illustrates how various feature processing methods influence the di-
mensionality of the feature matrix. Parts of the feature matrix to be processed are
indicated by bordered transparent rectangles. Dashed areas mark parts of feature
matrix with changed values after feature processing. Some of the methods do not
365
366 Chapter 14. Feature Processing
Straightforward or down-to-earth
Straightforward or down-to-earth
Nonjudgmental
Nonjudgmental
Straightforward or down-to-earth
change the dimensionality (Figure 14.1 (a)) and often belong to preprocessing tech-
niques, e.g., the normalization of feature values. If new features are constructed
from existing ones, the feature dimensionality increases (Figure 14.1 (b)). Generic
frameworks for the creation of new features from existing ones are discussed in Sec-
tion 14.5. As the number of time windows cannot be increased any more after the
extraction of features, we do not consider methods which may increase the time di-
mensionality of the matrix (empty space above Figure 14.1 (b)). The reduction of
dimensions can be achieved in two ways. Figure 14.1 (c,d) shows the selection of
time intervals (e.g., verse of a song) or the selection of features (most relevant for
classification). Another option is to apply transforms to certain time intervals (es-
timation of mean feature values around beat events) or certain features (principal
component analysis, cp. Definition 9.48) (Figure 14.1 (e,f)).
Each feature processing step requires its own computing costs, and an improper
application may even decrease the classification quality. The evaluation of feature
processing is briefly discussed in Section 14.6.
If a feature matrix is optimized, e.g. for later classification of music data, data
instances to classify may be constructed from the whole matrix or its parts. For ex-
ample, when the feature matrix represents a single tone for the identification of an
instrument, a complete column of the matrix might be summarized by one statistic
(see Section 14.4.2). For other tasks, like classification into musical genres or listener
preferences, the calculation of these statistics should be done separately for a set of
classification windows: too many different musical segments with varying properties
(harmonic, instrumental, rhythmic, etc.) would be mixed together if aggregated for
the whole music piece. A constant length of several seconds (longer than a single
note but shorter than a phrase or a segment) may be considered. The optimal length
can, however, depend on the classification task, as investigated in our study on the
recognition of personal preferences [33] where the length of classification window
was optimized. For two complex classes, the optimal length was between 1 s and 5 s,
and for a task very similar to the classification of classical against popular music the
optimal estimated length was approximately 24 s. In another study, best results were
reported for classification frames between 2 s and 5 s [4]. Another option is to aggre-
366
14.2. Preprocessing 367
gate feature statistics for classification windows with lengths adapted to time events
of the musical structure (see Section 14.4.3). In [34], onset-based segmentation (cp.
Section 16.2) outperformed other methods with windows of constant length and the
aggregation of features for complete music tracks.
14.2 Preprocessing
In this section several groups of preprocessing methods are discussed: transforms
of feature domains, normalization, handling of missing values, and harmonization
of the feature matrix. Not all these methods are always necessary: some classifica-
tion methods handle missing values themselves and/or do not require normalization.
Moreover, an improper application of feature processing may even harm the on-
going classification. Therefore, for each classification scenario a careful choice of
(pre)processing algorithms should be made. The impact of these methods can also
be measured experimentally as discussed later in Section 14.6.
In the literature, there exists no clearly defined boundary between processing
and preprocessing. For instance, [9] lists cleaning, integration, transformation, and
reduction as preprocessing methods, hence counting data reduction among prepro-
cessing.
367
368 Chapter 14. Feature Processing
Now consider that for a large collection of classical music pieces the appearances
of chord degrees were measured. Suppose that tonic chords represent 25% of all
chords, dominant to 12%, subdominant to 8%, mediant to 4%, and supertonic to 3%.
Now we can order degrees by their relevance:
• 1 - tonic, 2 - dominant, 3 - subdominant, 4 - mediant, 5 - supertonic.
For the classification of music styles or the identification of the composing period, the
model may benefit from such ordering. A simple linear model could identify pieces
for which generally less relevant degrees play a more important role than expected.
Based on the exemplary measurements above, one can also consider the mapping of
categories to real numbers:
• 0.25 - tonic, 0.12 - dominant, 0.08 - subdominant, 0.04 - mediant, 0.03 - super-
tonic.
Continuous features can also be transformed to categorical ones, e.g., if required
for a classifier. Another example is the discretization of continuous values, e.g.,
to save indexing space (see Definition 9.4 for the difference between discrete and
continuous variables and Section 12.4.1 for another discussion of discretization). A
simple option is to limit the number of positions after the decimal place. Another
common approach is to use histograms (an example was provided in Figure 9.1).
Because some features may contain larger intervals with a very sparse number of
values, the histograms can be constructed based on equal frequency (the number of
observed feature values is equal in each histogram bin) and not based on equal width
(each bin has the same length). Feature values may also be grouped by clustering (for
related algorithms see Chapter 11), and the original continuous values mapped to the
number of the corresponding cluster. A supervised classification scheme can be also
integrated, for example, if feature intervals are divided using an entropy criterion as
applied in decision trees; see Section 12.4.3.
Discretization may not only save space but also help to construct more mean-
ingful, high-level interpretations of features. For example, the spectral centroid of an
audio signal can be measured in Hz. On the other side, it is possible to map frequency
ranges of each octave to one bin. Then, some harmonic and instrumental properties
can be identified easier.
14.2.2 Normalization
Original ranges of feature values may be very different. In particular, distance-based
classifiers, such as the k-nearest neighbors method (cp. Section 12.4.2), may overem-
phasize the impact of features with larger ranges. Many neural networks (cp. Sec-
tion 12.4.6) expect input values between zero and one. But also for other classifica-
tion methods the mapping of original feature values to the same interval may help to
improve the performance. The task of normalization is to map original values to an
interval of given range [Nmin , Nmax ].
The original values x Tu of feature u = 1, ..., F can be normalized with respect to
the difference between the maximum and the minimum values of the feature (min-
max normalization):
368
14.2. Preprocessing 369
0 Xu,w − min(xxTu )
Xu,w = · (Nmax − Nmin ) + Nmin . (14.1)
max(xxTu ) − min(xxTu )
Xu,w denotes here the scalar value of u-th feature x Tu from the w-th extraction window
as element of the feature matrix X constructed from F features (rows) and W extrac-
tion windows (columns). Often, the target interval is [0, 1] (zero-one normalization).
Another option is to normalize to mean 0 and standard deviation 1 (zero-mean or
z-score normalization) independent of maximum and minimum values:
0 Xu,w − x̄xTu
Xu,w = , (14.2)
sx u
where x̄xTu is the mean and sxu the (empirical) standard deviation of x Tu . The target
range is here not equal to [0, 1] anymore, and this method may not work with all
classifiers.
The problem of Equation (14.1) is that the maximum and the minimum estimated
from some set of instances are not necessarily the same for features extracted from
other music data. In that case normalization may lead to values below 0 or above
1 (out-of-range problem). Zero-mean normalization also does not guarantee that all
values will be normalized to the interval [0, 1].
A procedure with several advantages is the softmax normalization [25], which is
defined as follows
0 Xu,w − x̄xTu
Xu,w = , (14.3)
λS · (sx u /2π)
and is plugged into the logistic function
00 1
Xu,w = 0 , (14.4)
1 + e−Xu,w
where λS is the control parameter for the linear response in standard deviations and
should be set w.r.t. a desired level of confidence (cf. Definition 9.25). Feature val-
ues from the confidence interval are approximately linearly normalized. The default
value of λS = 2 corresponds to the level of confidence ≈ 95.5%.
Softmax normalization has several advantages. For example, values from a mid-
dle region of the original range are normalized almost in a linear way and the values
are always between zero and one. Although outliers are mapped to a short interval
(with respect to their distance from most expected values), even for very large or
small outliers with different original values the corresponding normalized values are
not the same and the order is kept, in contrast to methods which map outliers to a
single value.
If normalization is applied each time before the classification with a particular
model, the normalization function should be the same independent of current data to
classify. This means that parameters such as maximum and minimum in Equation
(14.1), and mean and standard deviation in Equations (14.2) and (14.3) have to be
369
370 Chapter 14. Feature Processing
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 14.2: Normalization of three features with softmax. Upper row: estimation
of mean and variance from classical and normalization for popular music pieces.
Bottom row: estimation of mean and variance from popular and normalization for
classical music pieces.
estimated only once. If new music pieces will have completely different distributions
of features, the normalized values may be outside of the required range.
Figure 14.2 plots the normalizations of three features with softmax. The upper
row contains instances where the estimation of mean and variance was done using a
set of 15 classical music pieces, and normalization was applied to 15 popular songs.
For the bottom row, mean and variance were estimated from popular songs and nor-
malization was applied to classical music.
Original values are normalized differently in both rows. The average distance in
phase domain (subfigures on the right) was a relevant feature for the distinction be-
tween classic and pop in [19], cf. Equation (5.38). After the estimation of mean and
variance for classical music, a large share of popular music has normalized values
very close to one, and the training of classification models would not sufficiently cap-
ture differences in popular music. Therefore, representative data samples should be
analyzed not only for the training of classification models, but also for the definition
of a normalization function.
An application of the same normalization method to all available features is not
always necessary and may even lead to undesired effects. Some features do not re-
quire normalization because they are already scaled between zero and one (pitch class
profiles, linear prediction coefficients). Sometimes it is known which values may be
theoretically achieved as minimum or maximum: with a sampling rate of 44,100 Hz
it is possible to analyze the frequencies up to the maximum limit of 22,050 Hz. So
for frequency-related features (spectral centroid, spread, etc.) it is not necessary to
normalize using softmax. Tempo in beats per minute, or duration of a music piece in
seconds have no theoretical but practical limits to their ranges. Ratios (e.g., between
the amplitudes of the first and the second periodicity peaks) can achieve very high
values – up to infinity – if the second peak does not exist. Such extreme values can
be replaced with the help of methods discussed in the next section.
370
14.2. Preprocessing 371
371
372 Chapter 14. Feature Processing
undesired bias. For example, removal of music tracks with missing metadata may
overemphasize the impact of popular songs.
If data instances with missing values are kept, these values can be replaced by a
location measure (cf. Definition 9.10), for example, the mean or the median of the
corresponding feature series. Because features with the same mean or median may
have very different distributions, the missing values can be also replaced in the way
that the standard deviation remains the same. If it is expected that data instances with
missing values are rather uncommon (e.g., very noisy frames where the fundamental
frequency cannot be estimated) or the reason for a missing value may carry some
important information to learn (low popularity of songs with missing metadata), the
value can be set to an outlying number, e.g., to zero for a feature with a positive
definition space. However, a drawback might be that dissimilar music pieces would
be characterized with the same values of corresponding features.
The replacement of all missing values in a series by the same single value may
introduce even more problems. In particular, for time series of features based on
audio signals, the values of neighboring positions may then have a weaker or stronger
relation to each other. Even different features might often correlate to a smaller or
bigger extent. Alternatively, a linear regression (Definition 9.32) can be applied to
approximate missing values from other positions of the same feature or a multiple
regression (Definition 9.34) might be used based on other features. The application
of regression is limited, though, to single or several consecutive positions and is less
reasonable for longer blocks of missing values in feature series.
These and more approaches to handle missing data are introduced in [13].
372
14.3. Processing of Feature Dimension 373
Perseverant Perseverant
Figure 14.3: Harmonization of feature vectors with different dimensions for the es-
timation of the feature matrix.
j k j k
u is the index of the current feature. lmin (w−1)
l + 1 and lmin (w−1)
l + 2 are the
indices of larger original frames which overlap with the w-th shorter frame.
If only one longer frame covers the new shorter frame, its value is taken (cf.
feature F4 in Figure 14.3).
If no longer original frame with an overlap to the w-th shorter frame exists, the
new value is set to “not a number” (cf. last value of feature F3 in Figure 14.3).
There exist other options to construct a harmonized feature matrix, for instance,
using interpolation between the values of the original frames. In that case, however,
it should be guaranteed that the interpolated values are feasible and make sense with
respect to the definition of the feature. Consider a C major cadence where the domi-
nant triad (G) changes to a tonic (C), and the normalized strength of C in the chroma
vector switches from a value near zero to a value near one. Then, an interpolation
would blur the clear identification of this occurrence and make the recognition of a
chord or a cadence more difficult.
373
374 Chapter 14. Feature Processing
goal to identify decorrelated dimensions with highest variance. Second, the compo-
nents are sorted w.r.t. their variance, and the ones with less variance are finally re-
moved. In another approach, Independent Component Analysis (ICA), it is assumed
that the analyzed observations are linear combinations of some independent sources.
In Section 11.6, the application of ICA to sound source separation is described. How-
ever, this method can also be used for feature transformation. For example, in a study
on instrument recognition [7], an improvement of accuracy up to 9% is reported af-
ter the transformation of original features to more independent dimensions. Another
option to select the most relevant dimensions after transformation is to apply Linear
Discriminant Analysis (LDA), cf. Section 12.4.1.
Although these transformations can be applied efficiently, may reduce the com-
plexity of classification models, and help to increase the classification quality, these
algorithms also have disadvantages. If interpretable music features (moods, instru-
ments, vocal characteristics, etc.) are used to train classification models, their mean-
ing is lost after transformation. Later theoretical analysis of music categories be-
comes hard or impossible. Furthermore, the extraction of all original features is still
required for new data instances even if a classification model is trained only on a few
components.
374
14.4. Processing of Time Dimension 375
analyzed [28, 18]. The optimal length – depending on the application scenario – may
be somewhere between hundreds of milliseconds (capturing a note or note sequence)
up to several minutes for longer pieces with varying characteristics. Obviously, there
is no statistical evidence to restrain the interval length to 30 s. One of the arguments
for this particular number was based on legal issues: in some countries audio excerpts
of 30-s length could be freely distributed.
Not only the length but also the starting point of the interval has an impact on the
later analysis. In our previous study on the recognition of music genres and styles,
the selection of 30 s from the middle of a music track performed better than 30 s taken
from the beginning, and even better performance was achieved with the selection of
30 s after the first minute [32]. The latter method increases the probability to skip the
song intro and capture vocal parts (verse or refrain) which are usually representative
for popular music.
An interesting fact was observed in two studies on genre, artist, and style recog-
nition, where very short time intervals of 250 ms and 400 ms were sufficient to
recognize a class for human listeners [8, 10]. However, this does not mean that such
intervals are optimal for automatic analysis of music pieces. The human brain may
very quickly recognize previously learned patterns. Furthermore, the recognition of
more complex music classes with differing structural segments may fail if based on
too short segments because relevant “aspects of music such as rhythm, melody and
melodic effects such as tremolo are found on larger time scales” [1, p. 31].
A commonly applied method for the aggregation of features is the estimation of
a few statistics of each feature, including location measures (mean and median, Defi-
nition 9.10) and dispersion measures (standard deviation, quantiles, mode, Definition
9.11). To reduce the impact of outliers, the so-called trimmed mean can be estimated
by sorting, removing a fixed percentage of extreme values, and then taking the mean.
In [9], 2% and in [20] 2.5% are recommended. Skewness and kurtosis – also referred
to as 3rd and 4th moments (see Definition 9.7) – describe the asymmetry of feature
series and the flatness around its mean.
More information about the distribution of feature values can be saved using
boundaries of confidence intervals (Definition 9.25) and histograms (Definition 9.6).
For example, beat histograms are estimated for tempo prediction in [28] but may
be useful to capture different levels of periodicity as well. As mentioned above in
Section 14.2.1, histogram bins can be constructed either with an equal length or with
an equal number of values. For both cases the optimal number of bins may not be
known in advance.
Properties of features which are not normally distributed can be stored as param-
eters of a mixture of multiple (Gaussian) distributions [17].
375
376 Chapter 14. Feature Processing
vious section, here the temporal development of feature series is taken into account,
and the estimated statistics are dependent on the order of extraction frames.
One of the methods to derive relevant properties of the original series without
data reduction is to estimate first- and second-order derivatives, or to smooth the
original series, e.g., with a running average. Also, order-independent statistics such
as parameters of a Gaussian mixture model can be estimated for a sequence of larger
texture windows. The transformation of a feature series to the phase domain with a
subsequent estimation of characteristics such as average distance as done in [19, 20]
can also be applied (the relevance of this feature to distinguish between classical and
popular music is illustrated in Figures 14.2,15.2).
Another option is to save only a few characteristics of the series such as the pa-
rameters of an autoregressive model (Definition 9.40). Here, a feature value is pre-
dicted from P preceding values of the same feature. Such an application is introduced
in [18] by means of P-th-order diagonal autoregressive (DAR) model:
P
0
Xu,w = ∑ Au,p · Xu,w−p + εu , (14.5)
p=1
so that each feature u = 1, ..., F is independent of the other features but only depen-
dent on its own past. ε is considered as white noise for each dimension.
Because of often existing correlations between different features, a more general
multivariate autoregressive (MAR) model predicts the whole feature vector x w based
on the whole feature vector in past time periods:
P
xw = ∑ A u,p · x w−p + ε u , (14.6)
p=1
376
14.4. Processing of Time Dimension 377
where ν ∈ {0, 1, . . . , TA /2 − 1} denotes the modulation frequency bin and i the imag-
inary unit. High energy at low or high modulation frequencies then corresponds
to slow or fast changes of the feature values, respectively. Thus, by appropriately
summarizing the squared magnitudes of the modulation spectrum the energies of the
summarized bands serve as descriptors which model the temporal structure of the
short-time features. McKinney and Breebaart [16] propose to subsume the modula-
tion frequency bins to four bands which correspond to the frequency ranges of 0 Hz
(average across observations), 1–2 Hz (musical beat rates), 3–15 Hz (speech syllabic
rates), and 20–43 Hz (perceptual roughness).
Structural complexity is a method proposed in [14] to capture relevant structural
changes of feature values in a larger analysis window. First, sets of Fz features are
selected to represent the z-th of Z (if desired interpretable) properties. In the origi-
nal contribution, chroma is represented by a 12-dimensional feature vector, and also
rhythm and timbre are characterized. Later, [29] extended these properties to instru-
mentation, chords, harmony, and tempo/rhythm. For each short extraction frame wa
in the analysis window, a number TA of frames before and after wa (including wa -th
frame) are taken into account. The difference between vectors representing the sum-
mary of TA preceding frames w p and TA succeeding frames w s is measured by the
Jensen–Shannon divergence:
w p , w p +w
dKL (w 2
ws
ws , w p +w
) + dKL (w 2
ws
)
dJS (w
w p, ws) = , (14.8)
2
where dKL (w
w p , w s ) is the Kullback–Leibler divergence:
Fz
w pk
dKL (w
w p, w s) = ∑ w pk · log2 wsk
, (14.9)
k=1
1 wa −1 1 wa +TA −1
w pk = ∑ Xk0 ,w , w s k = ∑ Xk0 ,w , (14.10)
TA w=wa −TA TA w=w a
where k0 is the index in the complete feature matrix X corresponding to the k-th
feature representing property z.
377
378 Chapter 14. Feature Processing
most granular level is represented with notes of the score and onset events for au-
dio. Another source of information at this level is the Attack-Decay-Sustain-Release
(ADSR) envelope sketched in Figure 2.16. Each of the four intervals is characterized
by its specific characteristics (inharmonic components in attack phase, stable energy
in sustain phase, decreasing energy in release phase, etc.). Therefore, an improper ag-
gregation of features from different intervals may complicate further analysis. Often,
a simplified model of the envelope is calculated, the Attack-Onset-Release (AOR)
envelope, where an onset corresponds to the time point with a maximum energy after
the attack phase, and all remaining components of the sound are assigned to release
phase. See also Section 16.3 for ADSR analysis.
378
14.4. Processing of Time Dimension 379
Available
Available
Available
Straightforward or down-to-earth
Available
Available
Available
Straightforward or down-to-earth
Accepting
Figure 14.4: Analysis of music structure. Left subfigures (from top to down): the
score of the first bars of Beethoven’s “Für Elise,” first 30 bins of the magnitude spec-
trum, the root mean square of the signal, the time events extracted from audio (beats
and tatums after [6], onsets with MIR Toolbox [11]). Right subfigures: self-similarity
matrix of the complete music piece, its variant enhanced by means of thresholding
(both matrices are estimated with SM Toolbox [21]).
further processing. Examples of such frames are marked with small filled circles
in the left bottom subfigure: (a): onset frame, (b): interonset frame, (c): middle of
attack interval, (d): middle of release interval. The choice of the method depends
on the application. For automatic analysis of harmony, frames between onsets with
a stable sound may be preferred. For the identification of instruments, more rele-
vant features may be extracted from the middle of the attack interval which contain
instrument-specific inharmonic components such as stroke of a piano key or noise of
a violin bow.
A general structure of a music piece can be estimated from Self-Similarity Ma-
trices (SSM) which measure distances between feature vectors as introduced in Def-
inition 16.1. The right top subfigure plots an SSM based on chroma features, and the
right bottom subfigure an enhanced SSM variant. Segments with high similarity are
visualized in the matrix as dark diagonal stripes. The information about the structure
of a music piece can be used differently for data reduction. One option is to remove
features from segments which are already contained in the feature matrix (or are very
similar to such feature). Two examples of such segments are marked with rectangles
379
380 Chapter 14. Feature Processing
on the plot of the enhanced SSM (bottom right subfigure). Another procedure is to
select only a limited number of short time frames from each segment: here it is as-
sumed that the relevant characteristics of the segment do not have a strong variation
and may be captured by a sample, e.g., from the middle of the segment. Compared to
the “blind” method of selecting 30 s from the middle of the music piece (see discus-
sion in Section 14.4.1), here the properties from different (representative) parts of a
music piece are maintained despite of strong data reduction. Because structures may
have several and not always coincident layers (consider a segment with the same
sequence of harmonic events but other instrumentation), the analysis of SSMs based
on different features can be necessary.
Data reduction and processing based on temporal structure are helpful for clas-
sification and music analysis. On the other side, complex and time-consuming algo-
rithms are required to extract the structural information if it is not available at hand.
Because of many simultaneously playing sources, varying properties of instruments
(such as progress of the envelope), and also applied digital effects, the accuracy
of these methods has some limitations and an accurate resolution of musical struc-
ture cannot be guaranteed by state-of-the-art algorithms. Some of the challenges
are discussed in Part III of the book. Identification of onset events is addressed in
Section 16.2, tempo recognition in Chapter 20, and structure segmentation in Sec-
tion 16.4.
380
14.5. Automatic Feature Construction 381
down-to-earth or down-to-earth
down-to-earth
or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
or
down-to-earth
Straightforward orStraightforward
Straightforward
orStraightforward
down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth Straightforward
Straightforward or
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 14.5: Examples of better separability between classes with new constructed
features. (a,b): artificial example; (c–f): distinction of cello and flute tones.
we may integrate the knowledge of time events for the construction of new feature
dimensions (see discussion in Section 14.4.3). In Figure 14.5 (d), two new features
are generated from the original ones: it is distinguished between frames from the
attack interval and the release interval, and only the middle of these intervals is taken
into account. This leads to a better separation ability of new feature dimensions
which produce more distinctive distributions for the two classes as shown on Figures
14.5 (e,f).
Pachet and Roy [23] introduce a general framework for the construction of so-
called analytical features. The extraction of an individual feature is described by a se-
quence of operators allowing exactly one operation in each step. Mierswa and Morik
[19] distinguish between several categories of operations and allow the construction
of method trees capable of extracting multiple features. The following categories of
operations are proposed (examples are provided in parentheses):
• transforms change the space of input series (DFT, phase space transform, auto-
correlation),
381
382 Chapter 14. Feature Processing
• filters do not change the space and estimate some function for each value of input
series (logarithmic scaling, band-pass filtering, moving average),
• markups assign elements of input series to some properties (audio segmentation,
clustering),
• generalized windowing splits input series into windows of given size and overlap,
and
• functions save single values for a complete series (characteristics of the strongest
peak, location measures).
The implementation as a tree helps to save computing time if the same transform
is used as input several times (see Example 14.4). On the other side, in [23, p. 5] it
is argued that “the uniform view on windowing, signal operators, aggregation opera-
tors, and operator parameters” improves the flexibility and simplifies the generation
process.
Example 14.4 (Feature Construction). The following example of a feature genera-
tion chain after [23] estimates the maximum amplitude of the spectrum after FFT for
frames of 1024 samples. Then, the minimum across all frames is saved.
• Min(Max(Sqrt(FFT(Split(x,1024))))).
The following example of a feature generation tree after [19] in Figure 14.6 saves
several features after the estimation of the spectrum (characteristics of three strongest
spectral peaks, strongest chroma bin) as well as several time domain–based charac-
teristics (zero-crossings, periods of two strongest peaks after autocorrelation).
Accepting
Accepting
Accepting Accepting
Accepting AcceptingAccepting
Accepting
Accepting
Two general problems have to be resolved for a successful automatic feature gen-
eration. First, a set of operations and transforms should be designed. For instance,
[23] lists more than 70 operators from basic mathematical constructs (maximum,
minimum, absolute value, etc.) to complex transforms (FFT, estimation of mel spec-
trum, various filters). The next challenge is to find a strategy for the exploration of
a huge search space: l k different chains of length l using k possible operators are
possible. As discussed in Chapter 10, stochastic methods such as evolutionary al-
gorithms are in particular suited for such complex optimization problems. Genetic
382
14.6. A Note on the Evaluation of Feature Processing 383
programming is applied in [19, 23], and particle swarm optimization in [12]. During
the evolutionary process, several operations for the variation of feature candidates
can be applied:
Example 14.5 (Genetic Operators for Feature Construction). Substitution of an op-
erator with another, e.g.:
• Centroid(Low-Pass Filter(FFT(Split(x,1024)))) 7→
Centroid(High-Pass Filter(FFT(Split(x,1024))))
Removal of an operator:
• Centroid(Low-Pass Filter(FFT(Split(x,1024)))) 7→
Centroid(FFT(Split(x,1024)))
Addition of a new operator:
• Centroid(Low-Pass Filter(FFT(Split(x,1024)))) 7→
Centroid(Low-Pass Filter(Log(FFT(Split(x,1024)))))
Crossover between two chains:
• { Centroid(Low-Pass Filter(FFT(Split(x,1024)))) ,
Max(Peak Positions(Autocorrelation(Differentiation(Split(x,1024))))) } 7→
Max(Peak Positions(Low-Pass Filter(FFT(Split(x,1024)))
383
384 Chapter 14. Feature Processing
difference between the amount of data before and after feature processing. Consid-
ering feature processing in general as an intermediate step between the extraction of
raw features and classification or regression (training), we can estimate the reduction
rate of feature processing mFPRR as the share of values in the feature matrix after
feature processing in relation to the original number of values:
F ·W
mFPRR = ∗ . (14.11)
∑Fu=1 (W ∗∗ (u) · F ∗∗ (u))
F and W are dimensions of the feature matrix X after processing (number of fea-
tures resp. number of classification windows). F ∗ is the number of original features,
where the number of dimensions for feature x u is characterized with F ∗∗ (u) and the
number of corresponding extraction frames with W ∗∗ (u).
For each method that operates on only the feature or time dimension (cf. Sec-
tions 14.3,14.4), its specific data reduction performance can be measured. For the
reduction of features the feature reduction rate mFRR is:
F
mFRR = (14.12)
F∗
(F is the number of features after and F ∗ before processing). Similarly, the time
reduction rate mT RR is defined as:
W
mT RR = , (14.13)
W∗
where W is the number of time windows (number of values in feature series) after
and W ∗ before the processing. Instead of measuring the number of time windows,
the sum of the lengths of all extraction time intervals can be estimated and divided
by the length of the complete music piece.
The interpretation of measures given above is not always straightforward; con-
sider the two following examples. Even if the number of features is strongly reduced
after PCA (e.g., only the two strongest components are selected), all the original fea-
tures are still necessary for the determination of the components. Also, successful
data reduction based on music structure may require algorithms for the estimation of
boundaries between structural segments, which themselves require the extraction of
underlying features from complete music pieces for the building of the self-similarity
matrix (see Figure 14.4).
The outcome of a successful data reduction is, however, not only the reduced
amount of data, but also an increase of generalization performance of a classification
or regression model. Models built with fewer features and trained with fewer outly-
ing instances may tend to be (but not necessarily are) more stable and may not overfit
towards training sets as well. The stability of classification models after feature pro-
cessing can be evaluated as discussed in Section 13.3.6.3.
The three groups of evaluation criteria (quality, resources, and data reduction
rates) are often in conflict: algorithms which best improve classification quality or
very efficiently reduce the number of features keeping high relevance of remaining
ones often have to pay the price of high computing costs. Also, classification quality
384
14.7. Further Reading 385
alone suffers from too strong data reduction. For an unbiased evaluation of methods,
multi-objective optimization can be applied (see Section 10.4). In Section 15.7 we
will discuss how the simultaneous optimization of two evaluation criteria can be
integrated into feature selection.
Bibliography
[1] P. Ahrendt. Music Genre Classification Systems - A Computational Approach.
PhD thesis, Informatics and Mathematical Modelling, Technical University of
385
386 Chapter 14. Feature Processing
Denmark, 2006.
[2] E. Alpaydin. Introduction to Machine Learning. The MIT Press, Cambridge
London, 2010.
[3] Y. Bengio, O. Delalleau, N. L. Reoux, J.-F. Paiement, P. Vincent, and
M. Ouimet. Spectral dimensionality reduction. In I. Guyon, M. Nikravesh,
S. Gunn, and L. A. Zadeh, eds., Feature Extraction. Foundations and Applica-
tions, volume 207 of Studies in Fuzziness and Soft Computing, pp. 137–165.
Springer, Berlin Heidelberg, 2006.
[4] J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kégl. Aggregate features
and ADABOOST for music classification. Machine Learning, 65(2-3):473–
484, 2006.
[5] J. R. Cano, F. Herrera, and M. Lozano. Using evolutionary algorithms as
instance selection for data reduction in KDD: An experimental study. IEEE
Transactions on Evolutionary Computation, 7(6):561–575, 2003.
[6] A. J. Eronen, V. T. Peltonen, J. T. Tuomi, A. P. Klapuri, S. Fagerlund, T. Sorsa,
G. Lorho, and J. Huopaniemi. Audio-based context recognition. IEEE Trans-
actions on Audio, Speech, and Language Processing, 14(1):321–329, 2006.
[7] A. J. Eronen. Musical instrument recognition using ICA-based transform of
features and discriminatively trained HMMs. In Proc. of the 7th International
Symposium on Signal Processing and Its Applications (ISSPA), pp. 133–136.
IEEE, 2003.
[8] R. O. Gjerdingen and D. Perrott. Scanning the dial: The rapid recognition of
music genres. Journal of New Music Research, 37(2):93–100, 2008.
[9] J. Han, M. Kamber, and J. Pei. Data preprocessing. In Data Mining: Con-
cepts and Techniques, Series in Data Management Systems, pp. 93–124. Mor-
gan Kaufmann Publishers, 2012.
[10] C. L. Krumhansl. Plink: “Thin slices” of music. Music Perception: An Inter-
disciplinary Journal, 27(5):337–354, 2010.
[11] O. Lartillot and P. Toiviainen. MIR in MATLAB (II): A toolbox for musical
feature extraction from audio. In S. Dixon, D. Bainbridge, and R. Typke, eds.,
Proc. of the 8th International Conference on Music Information Retrieval (IS-
MIR), pp. 127–130. Austrian Computer Society, 2007.
[12] T. Mäkinen, S. Kiranyaz, J. Raitoharju, and M. Gabbouj. An evolutionary fea-
ture synthesis approach for content-based audio retrieval. EURASIP Journal on
Audio, Speech, and Music Processing, 2012(23), 2012. doi:10.1186/1687-
4722-2012-23.
[13] B. M. Marlin. Missing Data Problems in Machine Learning. PhD thesis, De-
partment of Computer Science, University of Toronto, 2008.
[14] M. Mauch and M. Levy. Structural change on multiple time scales as a correlate
of musical complexity. In Proc. of the 12th International Society for Music
Information Retrieval Conference (ISMIR), pp. 489–494. University of Miami,
2011.
386
14.7. Further Reading 387
387
388 Chapter 14. Feature Processing
388
Chapter 15
Feature Selection
I GOR VATOLKIN
Department of Computer Science, TU Dortmund, Germany
15.1 Introduction
As we discussed in Chapters 11 and 12, the task of classification methods is to orga-
nize and categorize data based on features and their statistical characteristics. There
exist many music classification scenarios, from classification into genres and styles
to identification of instruments and cover songs, recognition of mood and tempo, and
so on. Even similar tasks within the same application scenario may require differ-
ent features. For example, for genre classification of classic against pop, the relative
share of percussion may be a relevant feature. If popular music contains mostly rap
songs, more relevant features may describe vocal characteristics, and for progressive
rock, the important features may rather describe harmonic properties.
A manual solution is to select the best-suited features for each task with the help
of a music expert, who would carefully analyze the data provided for each class.
This approach requires high manual effort with no guarantee of the optimality of the
selected features. Another method is to create a large feature set only once for dif-
ferent classification tasks. Then, some of these features would surely be relevant for
a particular classification task. As we will discuss later, too many irrelevant features
(which would be an unavoidable part of this large set) often lead to overfitting: the
performance of classification models suffers because some of the irrelevant features
are then identified as relevant.
In this chapter we address the search for relevant features by means of automatic
feature selection, an approach to identify most relevant features and to remove re-
dundant and irrelevant ones before the training of classification models. As the terms
“relevance,” “redundancy,” and “irrelevance” are central for feature selection, we
shall start with their definitions in Section 15.2, before the general scope of feature
selection will be discussed in Section 15.3. Section 15.4 outlines the design steps
for a feature selection algorithm. Several functions for the measurement of feature
relevance are listed in Section 15.5, followed by three selected algorithms in Section
15.6. Multi-objective feature selection is introduced in Section 15.7.
389
390 Chapter 15. Feature Selection
15.2 Definitions
Definition 15.1 (Relevant Feature). In the context of classification, a feature x is
called relevant if its removal from the full feature set F will lead to a decreased
classification performance:
• P(y = ŷ|F ) < P(y = ŷ|F \ {x}) and
• P(y 6= ŷ|F ) > P(y 6= ŷ|F \ {x}),
where y describes the labeled (correct) class, ŷ the predicted class, and P(y = ŷ|F )
is the probability that the predicted label ŷ is the correct label y when all features F
are used.
The first basic goal of feature selection is to keep relevant features. The second
one is to remove non-relevant features which may be categorized as redundant or
irrelevant. The following definitions describe the difference between the two latter
kinds of features.
Definition 15.2 (Redundant Feature). If a feature subset S exists which does not
contain x so that after the removal of S the non-relevant feature x would become
relevant, this feature is called redundant:
• P(y = ŷ|F ) = P(y = ŷ|F \ {x}) and
• ∃S ⊆ F , {x} ∩ S = 0/ : P(y = ŷ|F \ S ) 6= P(y = ŷ|F \ {S ∪ {x}}).
This means that a redundant feature may be removed without the reduction of
performance of a classifier, but may contain by itself some relevant information about
the target class (e.g., have significantly different distributions for data of different
classes). Often, strongly correlated features are redundant. However, this is not
always true (see Example 15.1). Note that correlation is measured by the empirical
correlation coefficient rXY as introduced in Definition 9.20.
Example 15.1 (Redundancy and Correlation). Figure 15.1 (a) plots the temporal
evolution of two features which are aggregated in classification windows of 4 s and
have a high negative correlation: rXY = −0.94 for Schubert, “Andante con moto”
(left subfigure) and rXY = −0.86 for the Becker Brothers, “Scrunch” (middle sub-
figure). The distribution of these features is visualized in the right subfigure. The
two features have a high grade of redundancy: linear separation of classes with
either a vertical or a horizontal dashed line with individual projections of features
leads to only few errors. Figure 15.1 (b) also shows two highly anti-correlated fea-
tures: rXY = −0.98 for Schubert, “Andante con moto” and rXY = −0.82 for the
Becker Brothers, “Scrunch.” However, the combination of the two features leads to
an almost perfect linear separation of the two classes in contrast to the individual
projections of these features.
Definition 15.3 (Irrelevant Feature).
In contrast to redundant features, the removal of an irrelevant feature x does not
affect the performance of a classifier:
• P(y = ŷ|F ) = P(y = ŷ|F \ {x}) and
390
15.2. Definitions 391
Straightforward or down-to-earth
Straightforward or down-to-earth
or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 15.1: Examples of correlated features which have (a) high individual redun-
dancy and (b) are relevant in their combination. The original values were normalized,
and the mean value was estimated for each classification window of 4 s.
391
392 Chapter 15. Feature Selection
Positive role model
Dependable
Dependable
Dependable
Figure 15.3: Examples of pairs of features which are individually irrelevant but rel-
evant in their combination. Squares and circles represent instances of two different
classes. (a): Theoretical example constructed after [8, p. 10] (features are uniformly
distributed between 0 and 1). (b): Distribution of two audio signal features for two
classical music pieces (squares, Jean Sibelius, “Symphony No. 5 Es-major op. 82”
and Georg Friedrich Haendel, “Opus 4 No. 1”) and two pop music pieces (circles,
Herbert Grönemeyer, “Mambo” and Sonny Rollins, “No Moe”).
Let m be a measure which describes the success of feature selection (see Section
13.3 for lists of measures for and beyond classification performance).
Definition 15.4 (Feature Selection).
The task of feature selection is to find the optimal set of features represented by q ∗ :
where F is the full feature set, Φ(F , q ) is the set with selected features, and q is a
binary vector of length F which indicates whether a feature u is selected (qu = 1) or
392
15.3. The Scope of Feature Selection 393
not (qu = 0). Labeled class indicators y and predicted classes ŷy are required as inputs
for the estimation of classification performance on selected data instances (but may
be unnecessary for other evaluation measures).
Evaluation measures which have to be maximized, such as accuracy, can be easily
adapted to Definition 15.4 as follows:
max {m(x)} = − min {−m(x)} ⇒ arg max {m(x)} = arg min {−m(x)} . (15.1)
393
394 Chapter 15. Feature Selection
properties of the related class. This advantage is lost after feature processing meth-
ods which do not keep the interpretability, such as Principal Component Analysis
(see Definition 9.48): even if original features have some musical meaning, these
properties may be hidden in a new feature space after the transform of dimensions.
Finally, it is worth highlighting the role of feature selection when features are au-
tomatically constructed for a concrete classification task (see Section 14.5). Feature
selection then helps to restrict the number of features in a typically very large search
domain because of many available operators and transforms.
394
15.5. Ways to Measure Relevance of Features 395
k · ρ(S , y )
c(S , y ) = p . (15.2)
k + (k − 1) · ρ(S )
Here, ρ(S , y ) denotes the mean correlation between all features in S and the label
vector y , and ρ(S ) the mean inter-correlation of all features in S .
Another approach proposed in [10] is to calculate the distance between instances
which are close to each other in the feature space but belong to different classes
(the corresponding Relief algorithm is described below in Section 15.6.1). For a
fixed number of iterations t = 1, ...., I, the weight of the u-th feature Wu is updated
according to:
2 2
Wu (t) = Wu (t − 1) − Xuw − X(nearest -hit)w + Xuw − X(nearest -miss)w , (15.3)
395
396 Chapter 15. Feature Selection
where P(yc ) is the probability of class yc . Let Ic denote the union of all intervals for
which ŷ(xu ) = c. For the example in Figure 15.2, right subfigure, I1 ≈ (−0.029, 0.04)
for Chopin (c = 1) and I2 ≈ [−0.1, −0.029] ∪ [0.04, 0.5] for AC/DC (c = 2). Then,
the success of the classifier can be measured as:
G G
RNB = ∑ P(xu ∈ Ic , yc ) = ∑ P(xu ∈ Ic |yc ) · P(yc ). (15.7)
c=1 c=1
P(yc ) can be simply estimated as the share of the instances belonging to yc in the
training set, and P(xu ∈ Ic |yc ) corresponds to the sum of areas (integrals) below the
distribution of xu for the union of intervals Ic .
If a feature xu and a class yc are statistically independent of each other (i.e., the
396
15.5. Ways to Measure Relevance of Features 397
feature is irrelevant), P(xu , yc ) = P(xu ) · P(yc ) (cf. Definition 9.2), and the difference
between the joint and the independent distributions of a feature and a class is equal
to zero. The Kolmogorov distance sums up these differences:
G G
RKOL (xu ) = ∑ ∑ |P(xu , yc ) − P(xu ) · P(yc )| = ∑ ∑ |P(xu |yu ) · P(yc ) − P(xu ) · P(yc )|
u c=1 u c=1
(15.8)
so that RKOL = 0 for a completely irrelevant feature.
H(yy) is referred as entropy, which measures the average amount of information re-
quired for the identification of a class for W instances.
For the feature x u , a relevance function called information gain is measured as
the difference between the general entropy and the entropy after the estimation of the
distribution of this feature:
where
G
H(yy|xxu ) := − ∑ P(xxu ) · ∑ P(c|xxu ) · log2 P(c|xxu ). (15.12)
u c=1
The decision tree classifier C4.5 uses the information gain ratio as a relevance func-
tion for the decision to select a feature in the split node [15] (see also Section 12.4.3):
H(yy) − H(yy|xxu )
IGR(yy, x u ) = . (15.13)
H(xxu )
During the construction of a decision tree, in each node the information gain is max-
imized for both subtrees, using some threshold value of feature x u for the optimal
splitting.
397
398 Chapter 15. Feature Selection
IG(yy, x u )
SU(yy, xu ) = 2 . (15.14)
H(yy) + H(xxu )
Symmetrical uncertainty was applied in [9] for the measurement of correlation be-
tween features as well as for the correlation between a feature and a class.
A general framework for simultaneous reduction of redundancy between selected
features and maximization of relevance to a class is discussed in [5]. The minimal
redundancy condition WH and the maximal relevance condition VH are defined as:
F F
1
WH = ∑ q k · ∑ qu · H(xxu , x k ), (15.15)
|Φ(F , q )|2 k=1 u=1
F
1
VH = ∑ qu · H(xxu , y ), (15.16)
|Φ(F , q )| u=1
where q is a binary vector indicating the selected features which build the set Φ(F , q )
according to Definition 15.4. In [5] it is proposed to maximize VH −WH or VH /WH .
15.6.1 Relief
In Algorithm 15.1, the pseudocode of the Relief algorithm [10] is sketched. The
original procedure is extended for K neighbors using Equation (15.5). The following
inputs and parameters of the algorithm are required: X ∈ RF×W is the matrix of F
feature dimensions and W classification frames, y ∈ RW are the binary class rela-
tionships, I the number of iterations of Relief, and τ the threshold value to decide if
a feature is relevant. The binary vector q, which indicates the features to select, is
reported as output.
The algorithm distinguishes between “nearest-hits” (classification instances which
are closest to the selected instance and belong to the same class) and “nearest-misses”
(instances which are closest to the selected instance but belong to another class).
First, the weights of nearest-hits and nearest-misses are initialized to zero (lines
1–5). Then, for I iterations, a random instance r is selected (line 7), and nearest-
hits and nearest-misses are estimated (lines 8–16). The numerator and denominator
398
15.6. Examples for Feature Selection Algorithms 399
399
400 Chapter 15. Feature Selection
of Equation (15.5) are calculated in lines 19–26 (the sum of distances between the
selected instance and K neighbors for F feature dimensions). Finally, the weights
of each feature w are estimated (line 31), and the decision if the feature should be
selected is made based on the threshold τ (lines 32–36).
400
15.6. Examples for Feature Selection Algorithms 401
401
402 Chapter 15. Feature Selection
q ∗ = arg min[m1 (yy, ŷy, Φ(F , p)) , ..., mK (yy, ŷy, Φ(F , q ))].
q
Example 15.2 (Multi-Objective Feature Selection). Figure 15.4 illustrates the opti-
mization of four pairs of criteria; see Section 13.3 for definitions of the correspond-
ing measures. The task was to identify the genre Electronic after [23]. Each point is
associated with a feature set where the classification was done with a random forest
or a linear support vector machine (see Chapter 12).
402
15.7. Multi-Objective Feature Selection 403
Straightforward or down-to-earth
Straightforward or down-to-earth
Nonjudgmental
Nonjudgmental
Nonjudgmental
Nonjudgmental
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 15.4: Four pairs of simultaneously optimized objectives. The feature selec-
tion was applied to the recognition of the genre Electronic. (a): maximization of
both measures; (b): minimization of both measures; (c): minimization of error and
maximization of accuracy; (d): maximization of both measures.
For the two left subfigures, the multi-objective feature selection leads to larger
sets of compromise solutions. In Figure 15.4 (a), the goal was to maximize re-
call mREC and specificity mSPEC . The compromise non-dominated solutions are
distributed between the points [mREC = 0.732; mSPEC = 0.834] and [mREC = 0.927;
mSPEC = 0.557]. In Figure 15.4 (b), the goal was to minimize the feature reduction
rate mFRR , Equation (14.12), and to maximize the balanced relative error mBRE . The
best compromise solutions are distributed between sets of 68 features (mFRR = 0.107)
and 21 features (mFRR = 0.033), where mBRE increases from 0.19 to 0.273. For mo-
bile devices with limited resources, the reduction of requirements on storage space
and runtime may be relevant (models with fewer features are trained faster and more
important, classify new music faster) so that an increase of the classification error
would be acceptable to a certain level.
In Figures 15.4 (c,d), the multi-objective optimization makes less sense because
of stronger anti-correlation between balanced relative error and accuracy and the
correlation between F-measure and geometric mean: the optimization of one of the
two criteria may be sufficient.
The challenge is to identify those evaluation criteria which play the essential role
in a concrete classification scenario. Such a decision may be supported by means of
empirical validation using Equations (10.4) and (10.5) in Section 10.4.
The evaluation functions for MO-FS can be chosen from groups of measures pre-
sented in Chapter 13, e.g., for the simultaneous minimization of resource demands
and maximization of classification quality and listener satisfaction. Even closely re-
lated measures may be relevant and only weakly correlated, such as classification
performance on positive and negative examples. The estimation of a single com-
bined measure such as balanced error rate or F-measure may not be sufficient here
because of the fixed balance between the original measures. For some music-related
classification and recommendation scenarios, the desired balance between surprise
and safety cannot be always identified in advance. Surprise means that the rate of
403
404 Chapter 15. Feature Selection
false positives is accepted below some threshold (the listener appreciates the identi-
fication of some negative music pieces as belonging to a class). Safety means that
for a very low rate of false positives, a higher rate of false negatives is accepted to a
certain level (it is desired to keep the number of recommended songs which do not
belong to a class as small as possible).
Evolutionary algorithms are very well suited for MO-FS: the optimization of a
population of solutions helps to search for not only one but for a set of compro-
mise solutions. Feature selection w.r.t. several objectives becomes a very complex
problem for large feature sets, and stochastic components (mutation, self-adaptation)
are particularly valuable to overcome local optima. The first application of EA for
MO-FS was introduced in [6] and for music classification in [26].
An evolutionary loop for FS as presented in Algorithm 15.2 can be simply ex-
tended to multi-objective FS through the estimation of several fitness functions for
the selection of individuals. Then, a metric such as hypervolume (Definition 10.7)
may measure the quality and the diversity of solutions in the search space. The com-
parison of single solutions (sets of features) can be done based on Pareto dominance
(Definition 10.3) and algorithms with a fast non-dominated sorting, such as SMS-
EMOA (Algorithm 10.9), may be useful for the efficient search for trade-off feature
sets.
404
15.8. Further Reading 405
subsequent extraction of music audio features on several levels and the optimization
with multi-objective feature selection (“sliding feature selection”) was proposed in
[23] for a more interpretable music classification into genres and styles. This proce-
dure is briefly discussed in Section 8.2.2.
Last but not least, it is important to mention the importance of a proper evaluation
of feature sets. Reference [16] pointed out the danger of overfitting when cross-
validation is applied to evaluate feature subsets. The re-evaluation of a previous
study on music classification with an independent test set is described in [7]. A
possible way to reduce the danger of over-optimization is to distinguish between
inner and outer validation loops [2]. For an introduction into resampling methods
and evaluation measures, see Chapter 13.
Bibliography
[1] B. Bischl, M. Eichhoff, and C. Weihs. Selecting groups of audio features by sta-
tistical tests and the group lasso. In Proc. of the 9. ITG Fachtagung Sprachkom-
munikation. VDE Verlag, Berlin, Offenbach, 2010.
[2] B. Bischl, O. Mersmann, H. Trautmann, and C. Weihs. Resampling methods
for meta-model validation with recommendations for evolutionary computa-
tion. Evolutionary Computation, 20(2):249–275, 2012.
[3] B. Bischl, I. Vatolkin, and M. Preuß. Selecting small audio feature sets in
music classification by means of asymmetric mutation. In R. Schaefer et al.,
eds., Proc. of the 11th International Conference on Parallel Problem Solving
From Nature (PPSN), pp. 314–323. Springer, 2010.
[4] H. Blume, M. Haller, M. Botteck, and W. Theimer. Perceptual feature based
music classification - a DSP perspective for a new type of application. In W. A.
Najjar and H. Blume, eds., Proc. of the 8th International Conference on Sys-
tems, Architectures, Modeling and Simulation (IC-SAMOS), pp. 92–99. IEEE,
2008.
[5] C. H. Q. Ding and H. Peng. Minimum redundancy feature selection from mi-
croarray gene expression data. Journal of Bioinformatics and Computational
Biology, 3(2):185–205, 2005.
[6] C. Emmanouilidis, A. Hunter, and J. MacIntyre. A multiobjective evolutionary
setting for feature selection and a commonality-based crossover operator. In
Proc. of the IEEE Congress on Evolutionary Computation (CEC), volume 1,
pp. 309–316. IEEE, 2000.
[7] R. Fiebrink and I. Fujinaga. Feature selection pitfalls and music classification.
In Proc. of the 7th International Conference on Music Information Retrieval
(ISMIR), pp. 340–341. University of Victoria, 2006.
[8] I. Guyon, M. Nikravesh, S. Gunn, and L. A. Zadeh, eds. Feature Extraction.
Foundations and Applications, volume 207 of Studies in Fuzziness and Soft
Computing. Springer, Berlin Heidelberg, 2006.
405
406 Chapter 15. Feature Selection
406
15.8. Further Reading 407
407
Part III
Applications
409
Chapter 16
Segmentation
I GOR VATOLKIN
Department of Computer Science, TU Dortmund, Germany
16.1 Introduction
Segmentation is a task necessary for a variety of applications in music analysis. In
Chapter 17 we will find it useful to segment a piece of music into small single parts
corresponding to notes in sheet music to allow, e.g., for later transcription. This par-
ticular kind of segmentation is typically called onset detection and may be based,
e.g., on time-domain or frequency-domain features indicating the fundamental fre-
quency f0 of a certain part of the tone (see Section 16.2).
An even finer segmentation splits a tone into parts such as attack, sustain, decay,
and eventually noise (see Section 2.4.5). This is useful to allow for instrument clas-
sification from a piece of sound, for example. After finding relevant features for this
low-level task, a clustering method (see Chapter 11), e.g., the k-means method, can
be applied to yield a reduced number of features for subsequent classification (see
Section 16.3).
We can, however, also aim at a segmentation that corresponds to larger parts of
a piece of music like refrains, for example. In musicology, a typical first step of the
analysis of compositions is to structure the piece into different phases. This is a time-
intensive task, which is usually done by experts. For an analysis of large collections
of music this is not feasible. Therefore, an automatic structuring method is desirable.
As no ground truth is available, unsupervised learning methods like clustering are a
sensible approach to solving this task (see Section 16.4).
Overall, the size of segments highly depends on the final application. Therefore,
the methods used to achieve the goal also depend on the application. This chapter in-
troduces basic concepts for constructing methods and hence algorithms that allow for
segmenting music into smaller parts that are desirable for subsequent applications.
411
412 Chapter 16. Segmentation
16.2.1 Definition
For the tone onset definition, the concept of so-called transients (see Section 2.4.5)
is essential. Transient signals are located in the attack phase of music tones (Fig-
ure 2.16). Transients are non-periodic and characterized by a quick change of fre-
quency. They usually occur by interaction between the player and the musical instru-
ment, which is necessary to produce a new tone. Reference [2] defines a tone onset
to be located – in most cases – as the start of the transient phase.
The work of [24] summarizes three definitions of a tone onset: physical onset
(first rising from zero), perceptual onset (time where an onset can first be perceived
by a human listener) and perceptual attack onset (time where the rhythmic of a tone
can first be perceived by a human listener). Reference [38, p. 334] conducted a study
which found out that the perceptual onset “lies between about 6 and 15 dB below the
maximum level of the tone.” Reference [7] criticizes that the study did not consider
complex musical signals.
Whether the physical or the perceptual tone onset definition is used very much
depends on the data format. The MIDI file format (Section 7.2.3) contains all in-
formation about music notes, including onsets. Hence we can imagine these to be
physical onsets. The perceptual definition is more suitable for the WAVE format
(Section 7.3.2) that is typically used if real music pieces have to be annotated by
human listeners.
There are two kinds of onset detectors: offline and online ones. For offline de-
tectors, information of a whole music recording can by used for analysis. This case
is well studied and there exist many algorithms. Many applications like hearing aids
require, however, online (or real-time) approaches. Here, tone onsets should be de-
tected in time or with minimal delay, also called latency time. The latency should
not exceed a few tens of milliseconds, as human beings perceive – depending on
the tempo of music pieces – two tone onsets separated by less than 20 to 30 ms as
simultaneous [34].
Music instruments differ in the kind of tone “producing.” There exist many differ-
ent instrument types (see also Section 18.3): percussion instruments (like bass drum
or timpani), string or bowed instruments (like guitar or violin), keyboard instruments
(like piano or accordion), or wind instruments (like flute or trumpet).
Example 16.1 (Temporal visualization). The basic onset detection procedure is il-
lustrated in this chapter by means of two music pieces: monophonic recordings of
the first strophe of the Hallelujah song1 played by piano and flute, respectively. Fig-
1 Also known as the German song “Ihr seid das Volk, das der Herr sich ausersehn,” http://www.
412
16.2. Onset Detection 413
7000
5000
amplitude
amplitude
0 0
−5000
−7000
0 1 2 3 4 0 1 2 3
time time
Figure 16.3: Sheet music for the first strophe of the Hallelujah song.
ures 16.1 and 16.2 present the amplitude envelope of the recordings, where the grey
vertical lines mark the true onset times. The associated sheet music is presented in
Figure 16.3. While for percussion or string instruments new tone onsets are neces-
sarily marked with a major or minor amplitude increase, this is not always the case
for wind instruments (especially for legato playing). This indicates the challenge of
finding a universal approach for onset detection which suits for all kinds of music
instruments.
413
414 Chapter 16. Segmentation
414
16.2. Onset Detection 415
Nonjudgmental
Dependable
Dependable
Nonjudgmental Nonjudgmental
Nonjudgmental
Figure 16.4: Auditory image of monophonic piano recording.
Nonjudgmental
Dependable
Dependable
Nonjudgmental Nonjudgmental
Nonjudgmental
Figure 16.5: Auditory image of monophonic trumpet recording.
only the first 2 seconds are displayed in order to better demonstrate the delayed nerve
activities for higher frequencies.
Reference [8] also divided the ongoing signal into several frequency bands for
a hybrid approach. While in the upper bands, energy-based detector functions are
applied in order to detect strong transients, frequency-based detectors are used for
lower bands for exploring the soft onsets.
Another pre-processing approach is adaptive whitening [34]. The main idea is
to re-weight the STFT in a data-dependent manner. STFT does not provide a good
resolution in the low-frequency domain but contains excessive information for high
frequencies. Adaptive whitening aims to bring “the magnitude of each frequency
band into a similar dynamic range” [34, p. 315]. Define
(
max(|Xstft [λ , µ]|, r, m · q[λ − 1, µ]) if λ > 0
q[λ , µ] =
max(|Xstft [λ , µ]|, r) otherwise
(16.1)
aw Xstft [λ , µ]
Xstft [λ , µ] ← .
q[λ , µ]
The Xstft [λ , µ] denote Fourier coefficients (complex numbers) for the µ-th frequency
bin of the λ -th window (cp. Section 4.5, Equation (4.30)).
q
|Xstft [λ , µ]| = Re(Xstft [λ , µ])2 + Im(Xstft [λ , µ])2
415
416 Chapter 16. Segmentation
is the magnitude of these coefficients. The memory parameter m lies in the interval
[0, 1] while an appropriate interval for the floor parameter r depends on the magni-
tude distribution. A value of r > max (|Xstft [λ , µ]|) eliminates the effect of adaptive
∀µ,∀λ
whitening, while r = 0 and m = 0 cause absolute whitening (all magnitudes will be
equal to 1). This simple and efficient approach shows a noticeable improvement for
many online onset detectors.
416
16.2. Onset Detection 417
each window. The described algorithm performed well in the MIREX onset detection
competitions in recent years.2 The drawback is the time-consuming model training.
s(odf )1 = odf 1
(16.2)
s(odf )i = α · odf i + (1 − α) · s(odf )i−1 , i = 2, . . . ,W.
The smoothing parameter α, 0 < α < 1, determines the influence of the past obser-
vations on the actual value: the smaller α is, the greater the smoothing effect.
Regarding re-scaling, many normalization methods have been proposed (see Sec-
tion 14.2.2). Standardization, for example, is motivated by statistics:
s(odf) − s(odf)
n(odf) = . (16.3)
σs(odf)
The standardized vector n(odf) will then have mean 0 and standard deviation 1.
However, min(n(odf)) and max(n(odf)) are unknown. A further method guarantees
min(n(odf)) = 0 and max(n(odf)) = 1. Here
s(odf) − min(s(odf))
n(odf) = . (16.4)
max(s(odf)) − min(s(odf))
Note that the described kind of normalization just works offline. Reference [3] pro-
posed and compared many online modifications of the normalization step.
Example 16.3 (Onset detection functions). Figure 16.6 presents the features Spec-
tral Flux (SF) and Energy Envelop (EE) as well as their normalized variants (n.SF
and n.EE with α = 0.6 and re-scaling to [0, 1]) for a piano and a flute interpretation
of the Hallelujah piece (Figures 16.1 and 16.2). While the first feature is computed
in the spectral domain, for the second feature, just the amplitude envelope (time do-
main) is considered.
Consider the first and the fourth piano tone: they last longer than the other tones
and show a relevant change in spectral domain as well a minor amplitude increase
toward their ends. This can be explained by the change of temporal and spectral
characteristics of the signal during developing of long tones (possibly caused by
overtones), which obviously results in at least one false detection. Interestingly, such
patterns are not usual for synthetically produced tones. Unfortunately, differences
2 http://www.music-ir.org/mirex/wiki/MIREX_HOME. Accessed 20 May 2015.
417
418 Chapter 16. Segmentation
between real and synthetic music, with regard to onset detection, have not been ex-
tensively investigated yet.
As we see in Figure 16.6, the second tone onset of the flute is not reflected in
the spectral flux feature. An appropriate explanation would be that the first and
the second tone have the same pitch (cp. Figure 16.3) and the transient phase of
the second tone was very short. The fourth flute tone contains relevant changes of
spectral content so that many false detections would be expected here. This behavior
is caused by the applied vibrato technique for this tone.
To summarize, SF and EE features appear to be suitable for the detection of piano
tone onsets while EE does not appear to be a meaningful detection function for flute.
Smoothing of the detection function can be seen as advisable in general.
where ThreshFunction is either the median or the mean function and lT and rT are
the numbers of windows left and right of the current frame to be considered. Note
that frequently lT = rT is chosen for offline detection, but for online detectors this
distinction is of importance as rT has to be small or even equal to zero. δ and β are
additive and multiplicative threshold parameters, respectively. See Example 16.4 for
a comparison of two choices for these parameters.
o is the onset vector. lO and rO are additional parameters – number of windows left
and right of the current window, respectively, for calculating the local maximum.
lO = rO = 0 implies that every value with n(odf )i > ti is a tone onset. For online
applications, rO should be chosen small or equal to zero.
418
16.2. Onset Detection 419
piano SF flute SF
0.8
0.8
s_fl
0.4
0.4
0.0
0.0
0 20 40 60 80 0 20 40 60 80
Index
0.8
0.4
0.4
0.0
0 20 40 60 80 0.0 0 20 40 60 80
piano EE flute EE
0.002 0.006 0.010
0.006
0.002
0 20 40 60 80 0 20 40 60 80
0.8
0.4
0.4
0.0
0.0
0 20 40 60 80 0 20 40 60 80
Figure 16.6: Spectral flux (SF), normalized SF (n.SF), energy envelop (EE), and
normalized EE (n.EE) features for piano (left) and flute (right) interpretation of the
Hallelujah piece.
419
420 Chapter 16. Segmentation
2 · (Otrue − mFN )
mF = , mF ∈ [0, 1].
2 · (Otrue − mFN ) + mFP + mFN
This relationship can be used to derive the dependence of the number of misclassifi-
cations on the F-value for three scenarios:
mF
mFP = 0 =⇒ mFN = 1 − · Otrue
2 − mF
2
mFN = 0 =⇒ mFP = − 2 · Otrue
mF
mFP = mFN =⇒ mFP = mFN = (1 − mF ) · Otrue
For example, if the number of false detections of onsets mFP = 0, then mF = 0.8
means that the number of undetected onsets mFN = 13 Otrue . mFN = 0 corresponds to
the case that all true onsets are detected, whereas mFP = mFN corresponds to the case
that the number of errors is the same for onsets and non-onsets.
Alternatively, the F-value can be defined using Recall (mREC ) and Precision
(mPREC ) measures (cp. Definitions 13.9, 13.10):
2mPREC · mREC
mF = , where
mREC + mPREC
mT P mT P
mPREC = and mREC = .
mT P + mFP mT P + mFN
Note that there is a tradeoff between recall and precision, so that onset detection
could be optimized in a multi-objective fashion (see Section 10.4).
In order to achieve good detection quality, a sophisticated optimization of algo-
rithm parameters is essential. Of course, the tuned algorithm will work particularly
well on music pieces close to the training set. This illustrates the importance of elab-
orating a training data set which considers many musical aspects. Reference [18]
420
16.2. Onset Detection 421
1.0
piano n.SF flute n.SF
1.0
0.95 0.73
0.92 0.72
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 20 40 60 80 0 20 40 60 80
1.0
0
0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.0
0.0
0 20 40 60 80 0 20 40 60 80
Figure 16.7: Thresholding for normalized SF (n.SF) and normalized EE (n.EE) for
piano (left) and flute (right) interpretation of the Hallelujah piece. The dotted line
corresponds to setting SET1 and the dashed line represents SET2. The associated
F-values are given in the legend.
proposed, for example, a way for building a representative corpus of classical music.
Further research in this field is not only important for onset detection but also for
many other music applications.
Example 16.4. Figure 16.7 illustrates the thresholding and onset localizing proce-
dure. We consider the above-mentioned normalized spectral flux (n.SF) and energy
envelop (n.EE) features for piano and flute (as in Figure 16.6). Two possible param-
eter settings were exemplarily compared. The dotted line corresponds to the setting
SET1: δ = 0.1, β = 0.9, th.fun=mean and lT = rT = 5 while the dashed line repre-
sents SET2: δ = 0.4, β = 0.7, th.fun=median and lT = rT = 10. The figure shows
how important the correct choice of the thresholding parameters is. The dynamic
threshold is especially sensible to even small variation of δ and β . The smaller lT
and rT are, the more likely it is to detect smaller peaks of the detection function
(dotted line). Please note the associated F-values in the legend.
Figure 16.7 reflects some well-known facts: the simplest onset detection problem
is given for monophonic string, keyboard, or percussive instruments. Furthermore,
421
422 Chapter 16. Segmentation
the spectral flux feature achieves remarkable results for almost all musical instru-
ments.
422
16.3. Tone Phases 423
Table 16.1: Overview of the Number of Observations and Features
feature vectors for all frames. This greatly reduces complexity (see, e.g., Table 16.1
related to the example below), but still allows us to make use of the change in the
instruments’ sound. As the silence and noise cluster includes no useful information
for the following classification task, this cluster center is completely dropped for a
further complexity reduction. As results of the clustering process one might consider
the clustered frames as well as the cluster centers useful for further classification.
Example 16.5 (Tone Phases of Different Instruments). For clustering we use the k-
means method (see Section 11.4.1) with 25 random starting points, which results in
promising clustering results. In Figure 16.8 the almost perfect clustering of a piano
note is exemplarily. Labels have been given to the different clusters according to
their first occurrence in the sound. The time frames containing only silence and a bit
of noise in the beginning and the end of the recording are grouped to a single cluster.
The actual sound has a first phase of high energy and additional overtones of the
hammer hitting the strings, followed by a phase where these additional overtones
have subsided before the sound fades away. Strings or even wind instruments often
also result in excellent clusterings, as can be seen in Figures 16.9 and 16.10. For the
bowed viola sound in Figure 16.9, the cluster labeled as attack consists of two parts,
where the bow accelerates (in the beginning) and decelerates after the sustain phase
before the sound finally decays.
The clustering of a contrabassoon (cp. Figure 16.11) shows crisp clusters only
for the silence/noise part in the beginning and the end. A smaller number of clusters
seems to be more sensible and the labeling of the phases attack, sustain, and decay
Sustain
Decay
Attack
423
424 Chapter 16. Segmentation
Sustain
Decay
Attack
Sustain
Decay
Attack
does not make much sense in this case. Therefore, we tried to apply an automated
selection of the right number of clusters. Relative criteria to validate the number of
clusters using the Dunn, Davies–Bouldin, or SD indices as described in [12] suggest
a minimum of 2 and a maximum of 5 clusters to be tried for all the instruments.
Although usage of k-means with k = 4 clusters is not always optimal, we will use
k = 4 later in our classification example.
Sustain
Decay
Attack
424
16.4. Musical Structure Analysis 425
425
426 Chapter 16. Segmentation
426
16.4. Musical Structure Analysis 427
homogeneous segments are repeated in a music piece, the corresponding SSM would
contain blocks with high similarity values which may appear in different parts of the
matrix and not only around the main diagonal; for a more general definition of a
block, see [20].
For the search of similar segments in S , the path of length L is defined as an
ordered set of cells P = {Sm1 ,n1 , ..., SmL ,nL } with ε1 ≤ ml+1 − ml ≤ ε2 and ε1 ≤
nl+1 − nl ≤ ε2 (ε1 and ε2 define the permitted step sizes; the path is parallel to the
main diagonal if ε1 = ε2 = 1). The score of a path is defined as:
L
s(P) = ∑ Sml ,nl . (16.10)
l=1
Because of possible tempo changes across similar segments, a path with a high
score does not have to be necessarily parallel to the main diagonal of S . Examples of
two paths with high scores are visible as dark stripes enclosed in marked rectangles
in Figure 14.4, right bottom subfigure.
Visualization of Features For easier judgement of the quality of a clustering result,
a visual representation of the similarity between the feature vectors of different time
frames is necessary. The most common method is a heatmap of the pairwise distances
between feature vectors of different time frames. The darker the color is, the more
similar the time frames are. As a similarity measure, the cosine similarity
xmT xn
1
s(xxm , x n ) = 1+
2 ||xxm ||||xxn ||
is usually best suited for visualizing the distances between features of frames m and
Tx
n [22]. d(xxm , x n ) = 12 1 − ||xxxmm||||xn
xn || is the corresponding cosine distance. In the
heatmap, clusters are represented by white dots with the same vertical location (pre-
sented at the last time frame in the cluster). Hence, a direct comparison to the similar-
ity of the time frames is possible, since areas with very high similarity are represented
by dark colored squares.
Let us now give an example for the segmentation of a musical piece by using
order-constrained solutions in k-means clustering (see [17]).
Example 16.7 (Structure Segmentation by Clustering). For this task we use longer
time frames than in the instrument recognition setting in Example 16.6. Non-over-
lapping time frames of 3 seconds duration give sufficient temporal resolution. For a
visual representation of the results, the plots described above will be used.
Let us now consider our results for two different recordings of popular music.
One is Depeche Mode’s song “Stripped,” which has a very clear structure. The
other one is Queen’s “Bohemian Rhapsody,” a longer piece of music with a very
diversified composition.
In Figure 16.12 the structure of Depeche Mode’s “Stripped” is shown. The dark
squares are easily visible and the clusters represent this structure very well. The
number of clusters is estimated as 10.
427
428 Chapter 16. Segmentation
Ticking
250
●●●●
200
Chorus ●●●●●●●●●●●●●●●●●●●●
150
Time in seconds
Instr./Verse ●●●●●●●●●●●●●●●●●●●●●●●
100
Chorus ●●●●●●●
Verse ●●●●●●●●●●●
50
Synth. intro ●●●●●●●●●●
0
0 50 100 150 200 250
Time in seconds
For Queen’s “Bohemian Rhapsody” the picture is a bit more difficult, see Fig-
ure 16.13. The squares are less dark but it is still possible to see the structure of the
song. Even in this situation the cluster results correspond very well with the squares
in the plot. This song does not follow the simple structure of most popular music.
Hence a short annotation of the clusters is not possible. Listening to the music al-
lows us to verify the result. The clusters end when the dominant instruments or voices
change. Because of the more complex structure, more clusters are necessary.
428
16.6. Further Reading 429
●●●
350
●●●●●●●●
●●●●●●●
●●●●●●
●●●●●●●●●●●●
300
●●●●●●●●●●●●●●●●●●●
250
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
Time in Seconds
200
●●●●●●●●●●●●●●●●●●●●●●●●
150
●●●●●●●●●●●
●●●●●●
100
●●●●●●●●●●●●
●●●●●●●●●●●
●●●●
50
●●●●●●●●●●●●●●●●●●●●●●●●
0
Time in seconds
like Queen’s “Bohemian Rhapsody” as well as for simpler songs, with a very easy
recognizable structure like Depeche Mode’s “Stripped.” The resulting cluster centers
can be used for further tasks like music genre classification, where each part of a song
is labeled separately instead of labeling the whole song in order to improve accuracy.
429
430 Chapter 16. Segmentation
archically related ([4],[26],[14]), e.g. if smaller groups of certain note sequences are
parts of a rougher division into verses and bridges.
Bibliography
[1] N. Bauer, K. Friedrichs, D. Kirchhoff, J. Schiffner, and C. Weihs. Tone onset
detection using an auditory model. In M. Spiliopoulou, L. Schmidt-Thieme,
and R. Janning, eds., Data Analysis, Machine Learning and Knowledge Dis-
covery, volume Part VI, pp. 315–324. Springer International Publishing, 2014.
[2] J. P. Bello, L. Daudet, S. A. Abdallah, C. Duxbury, M. E. Davies, and M. B.
Sandler. A tutorial on onset detection in music signals. IEEE Transactions on
Speech and Audio Processing, 13(5):1035–1047, 2005.
[3] S. Böck, F. Krebs, and M. Schedl. Evaluating the online capabilities of onset
detection methods. In Proc. of the 13th International Society for Music Infor-
mation Retrieval Conference (ISMIR), pp. 49–54. FEUP Edições, 2012.
[4] W. Chai. Automated Analysis of Musical Structure. PhD thesis, School of
Architecture and Planning, Massachusetts Institute of Technology, 2005.
[5] N. Collins. Using a pitch detector for onset detection. In MIREX Online Pro-
ceedings (ISMIR 2005), pp. 100–106, 2005.
[6] S. B. Davis and P. Mermelstein. Comparison of Parametric Representations
for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE
Transactions on Acoustics, Speech and Signal Processing, ASSP-28(4):357–
366, August 1980.
[7] S. Dixon. Onset detection revisited. In Proc. of the 9th International Confer-
ence on Digital Audio Effects (DAFx), pp. 133–137. McGill University, 2006.
[8] C. Duxbury, M. Sandler, and M. Davies. A hybrid approach to musical note
onset detection. In Proc. of the 5th International Conference on Digital Audio
Effects, pp. 33–38, 2002.
[9] F. Eyben, S. Böck, B. Schuller, and A. Graves. Universal onset detection with
bidirectional long short-term memory neural networks. In J. S. Downie and
R. C. Veltkamp, eds., Proc. of the 11th International Society for Music Infor-
mation Retrieval Conference (ISMIR), pp. 589–594. International Society for
Music Information Retrieval, 2010.
[10] J. Foote. Visualizing music and audio using self-similarity. In Proc. ACM
Multimedia, pp. 77–80. ACM, 1999.
[11] H. Grohganz, M. Clausen, N. Jiang, and M. Müller. Converting path structures
into block structures using eigenvalue decompositions of self-similarity matri-
ces. In Proc. of the 14th International Society for Music Information Retrieval
Conference (ISMIR), pp. 209–215. International Society for Music Information
Retrieval, 2013.
[12] M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation tech-
niques. Journal of Intelligent Information Systems, 17(2-3):107–145, Decem-
430
16.6. Further Reading 431
ber 2001.
[13] H. Hermansky. Perceptual linear predictive (PLP) analysis of speech. Journal
of Acoustical Society of America, 87(4):1738–1752, April 1990.
[14] K. Jensen. Multiple scale music segmentation using rhythm, timbre, and har-
mony. EURASIP Journal on Advances in Signal Processing, 2007. doi:
10.1155/2007/73205.
[15] F. Kaiser and G. Peeters. A simple fusion method of state and sequence seg-
mentation for music structure discovery. In Proc. of the 14th International
Society for Music Information Retrieval Conference (ISMIR), pp. 257–262. In-
ternational Society for Music Information Retrieval, 2013.
[16] A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis. kernlab - An S4 package
for kernel methods in R. Journal of Statistical Software, 11(9):1–20, 2004.
[17] S. Krey, U. Ligges, and F. Leisch. Music and timbre segmentation by recur-
sive constrained K-means clustering. Computational Statistics, 29(1–2):37–50,
2014.
[18] J. London. Building a representative corpus of classical music. Music Percep-
tion, 31(1):68–90, 2013.
[19] R. Meddis. Auditory-nerve first-spike latency and auditory absolute threshold:
A computer model. Journal of the Acoustical Society of America, 119(1):406–
417, 2006.
[20] M. Müller. Fundamentals of Music Processing - Audio, Analysis, Algorithms,
Applications. Springer International Publishing, 2015.
[21] F. Opolko and J. Wapnick. McGill University master samples (CDs), 1987.
[22] J. Paulus, M. Müller, and A. Klapuri. Audio-based music structure analysis.
In Proc. of the 11th International Society on Music Information Retrieval Con-
ference (ISMIR), pp. 625–636. International Society for Music Information Re-
trieval, 2010.
[23] J. Pauwels, F. Kaiser, and G. Peeters. Combining harmony-based and novelty-
based approaches for structural segmentation. In Proc. of the 14th International
Society for Music Information Retrieval Conference (ISMIR), pp. 601–606. In-
ternational Society for Music Information Retrieval, 2013.
[24] R. Polfreman. Comparing onset detection and perceptual attack time. In Proc.
of the 14th International Society for Music Information Retrieval Conference
(ISMIR), pp. 523–528. International Society for Music Information Retrieval,
2013.
[25] R Core Team. R: A Language and Environment for Statistical Computing. R
Foundation for Statistical Computing, Vienna, Austria, 2014.
[26] C. Rhodes and M. Casey. Algorithms for determining and labelling approxi-
mate hierarchical self-similarity. In Proc. of the 8th International Conference
on Music Information Retrieval (ISMIR), pp. 41–46. Austrian Computer Soci-
ety, 2007.
431
432 Chapter 16. Segmentation
432
Chapter 17
Transcription
17.1 Introduction
In this chapter we describe methods for automatic transcription based on audio fea-
tures. Transcription is transforming audio signals into sheet music, and it is in some
sense the opposite of playing music from sheet music. The statistical core of tran-
scription is classification of notes into classes of pitch (e.g. c, d, ...) and lengths (e.g.
dotted eight note, quarter note, ...). A typical transcription algorithm includes at least
some of the following steps:
1. Separation of the relevant part of music to be transcribed (e.g. human voice) from
other sounds (e.g. piano accompaniment)
2. Estimation of fundamental frequencies
3. Classification of notes, silence and noise
4. Estimation of the relative length of notes and meter
5. Estimation of the key
6. Final transcription into sheet music
Note that step 1 is related to a pre-processing of the time series of the original mu-
sical audio signal. In step 2, time series modeling is used to estimate fundamental
frequencies (cp. also Sections 4.8 and 6.4) which are to be classified into notes in
step 3. In steps 4 and 5 these notes are fitted into meter and key. Finally, sheet music
is produced in step 6.
Section 17.5 below will be organized along this list of steps and will present
more details. In Sections 17.2, 17.3, and 17.4 we will comment on the analyzed
audio data and describe the musical and statistical challenges of the transcription
task. Transcription software is discussed in Section 17.6. For more information on
transcription methods see, e.g., [20, 53].
433
434 Chapter 17. Transcription
17.2 Data
Most existing transcription systems have been invented for the transcription of MIDI
data (see Section 7.2); both onset times and pitch are already exactly encoded in
the data or for instruments such as piano and other plucked string or percussion
instruments.
The transcription of MIDI data is not very difficult, because information related
to pitch as well as the beginning and end of tones is already explicitly available within
the data in digital form. Therefore, this information has not to be estimated from the
sound signal for MIDI data. Transcription of WAVE data (see Section 7.3.2) or other
types of audio data is harder. For WAVE data, transcribing plucked and stroked in-
struments (piano, guitar, etc.) is still simpler than, e.g., the transcription of melodies
sung by a highly flexible human voice. Moreover, some properties of the data may
have to be differently interpreted for different instruments. For example, sudden in-
creases of the signal’s amplitude may indicate new tones for some instruments like
piano, but this may not be the case for other types of instruments like flute, violin, or
the human voice.
For the following part of this chapter, the sound that has to be transcribed is given
in form of a WAVE file, typically in CD quality with sampling rate 44,100 Hz and in
16 bit format (i.e. 216 possible values).
Example 17.1 (Transcription). For this chapter, as an example, we use the German
Christmas song “Tochter Zion” (G.F. Händel) performed by a professional soprano
singer. The singer is recorded in one channel and the piano accompaniment in the
other channel of the stereo WAVE file.
434
17.4. Statistical Challenge: Piecewise Local Stationarity 435
2000
1500
frequency [Hz]
1000
500
0
0 10 20 30 40 50 60
time [s]
Hertz. Models and detection methods for vibrato have been described, for example,
by [39] and [31].
Example 17.2 (Transcription cont.). The strong vibrato of the professional soprano
singer (see Example 17.1) is shown by the nervously changing line of fundamental
frequencies (the lower dark curve) in the spectrum given in Figure 17.1.
A third challenge is the presence of noise in the signal. Noise might be caused
by the environment of the music, but also by other instruments in a polyphonic per-
formance if only one (say) instrument is of interest (predominant instrument recog-
nition). For a more detailed discussion, see Section 17.5.2.
435
436 Chapter 17. Transcription
kind of tone generation (e.g. for a violin) within the same tone might lead to change
points as well, which might prevent correct identification of, e.g., onsets by means of
change points.
Most algorithms used in transcription apply Short Time Fourier Transformation
(STFT), i.e. calculate periodograms of very small pieces (e.g. 23–46 ms, see Section
6.4) corresponding to windows (mostly overlapping by 50%) of the time series in
order to detect the change points and estimate fundamental frequencies.
436
17.5. Transcription Scheme 437
10000
left channel
0
−10000
10000
right channel
0
−10000
0 10 20 30 40 50 60
time [s]
Figure 17.2: Two channels of the wave before unmixing via ICA.
using a sine wave around the “average audible” frequencies and their partials. The
aim is to model well-known physical characteristics of the sound in order to estimate
f0 independently of other relevant factors that might influence estimation. Proposed
methods to estimate the model are non-linear optimization of an error criterion such
as the Mean Squared Error (MSE) between the real signal and the signal generated
from the model after a transformation of the signals to the frequency domain.
The fundamental frequencies can, however, be estimated much faster than by
the above modeling when using a heuristic approach as proposed in, e.g., [52]. In
this approach several thresholds are applied to values of the periodograms Ix [ f µ ] =
|Fx [ f µ ]|2 (cp. Definition 9.47) derived from the complex DFT with coefficients Fx [ f µ ]
for Fourier frequencies f µ on a window of size T from the original musical time
series x[t] in order to identify the peak representing the fundamental frequency. This
is done using the following steps:
1. Restrict the frequencies f to a sensible region R defined by:
437
438 Chapter 17. Transcription
20000
left channel
0
−20000
20000
right channel
0
−20000
0 10 20 30 40 50 60
time [s]
Figure 17.3: Two channels of the wave after unmixing via ICA.
Possible values are thresholdnoise = 0.1 (ignore noise), lowerbound = 80 Hz, upper-
bound= 5000 Hz (sensible frequency region), thresholdovertone < 2 (keep below over-
tone 1), l2 = 1.3 , u2 = 1.7 (search for overtone 2).
Unfortunately, just choosing the relevant peak is not sufficiently accurate given
the resolution of the Fourier frequencies. Therefore, we have to estimate the funda-
mental frequency f0 more precisely, e.g. by weighting the frequencies f ∗ and f ∗∗ of
the two strongest Fourier frequencies’ values Ix [ f ∗ ] (strongest, see Figure 17.4) and
Ix [ f ∗∗ ] (second strongest) of that peak:
s
f ∗∗ − f ∗ Ix [ f ∗∗ ]
∗
fˆ0 := f + · . (17.1)
2 Ix [ f ∗ ]
fˆ0,Quinn := (µ ∗ + δ ) f1 , (17.2)
438
17.5. Transcription Scheme 439
0.20
0.15
normalized periodogram
0.10
0.05
0.00
μ* − 1 = 45 μ* = 46 μ* + 1 = 47
f* = 990.5 f** = 1012.1
index / frequency [Hz]
Figure 17.4: Part around the relevant peak in a periodogram showing which frequen-
cies are used for Equations (17.1) and (17.2).
where f1 is the first Fourier frequency. Note that f µ = µ · f1 . Hence Quinn proposes
to shift away from f µ ∗ by δ Fourier frequencies with |δ | < 1.
Example 17.4 (Simulation of Frequency Estimation Methods). Following [3] we
generated time series x f (t) = sin(2π f ·t/44100+φ )+εt , t = 1, . . . , T , where we used
frequencies f ∈ {80, 81, . . . , 1000} Hz, while the noise variance σ 2 was varied from
0 to 1, and the phase φ was selected randomly from [0, 2π] for the resulting sinusoids.
Every signal was sampled T = 2048 (as a typical size of a window) times. Figure
17.5 shows the error distributions for the two estimators from Equations (17.1) and
(17.2). It is clearly visible that the simple interpolation after Equation (17.1) results
in the worst accuracy. It exhibits a much larger variance than the method of Quinn.
Also the main mass of the distribution in Equation (17.1) is bimodal around zero.
Example 17.5 (Peak Picking). In some cases it turns out that finding the right peak
representing the fundamental frequency is difficult. In such cases the estimation al-
gorithms fail to estimate the correct fundamental frequency if the overtone sequence
is not taken into account. An example of a periodogram showing a series of extremely
strong overtones compared to the strength of the fundamental frequency is given in
Figure 17.6. Here we see that the strongest overtone is the sixth one and 20 overtones
are visible. The underlying signal was produced by a professional bass singer. The
method based on Equation (17.1) estimates the fundamental frequency of the very
first relevant peak of the tone, namely 141.35 Hz.
439
440 Chapter 17. Transcription
1.0
0.8
0.6
density
0.4
0.2
0.0
−4 −2 0 2 4
440
17.5. Transcription Scheme 441
0.20
0.15
normalized periodogram
0.10
0.05
0.00
frequency [Hz]
f''
e'' ideal
d#'' estimated
d''
c#''
c''
b'
classified note
a#'
a'
g#'
g'
f#'
f'
e'
46.3
energy
silence
−8.7
time
17.4 the minimum frequency difference (realized for the lowest tone of 80 Hz) that
corresponds to a difference of 50 cents in halftones, is 2.38 Hz. Figure 17.5 shows
that both estimators produce deviations mainly lower than this threshold.
As alternative methods, in Section 17.4 we already mentioned the SLEX [30]
procedure and a segmentation algorithm for speech in [1]. Also, the segmentation of
441
442 Chapter 17. Transcription
c'''
b'' true
a#''
a'' estimated
g#''
g''
f#''
f''
e''
d#''
d''
note
c#''
c''
b'
a#'
a'
g#'
g'
f#'
f'
● ● ●
● ●
●
44.8
energy
● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
●
silence
● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ●
● ●
−1.1
1 2 3 4 5 6 7 8
bar
sound or notes has been discussed in Chapter 16. Segmentation of sound related to
transcription has also been examined by [40].
442
17.6. Software 443
has been described by [10] and was extended later by [9] in order to take care of
dynamic changes of the tempo during time. Alternatively, [8] proposes some Monte
Carlo methods for tempo tracking and [55] uses Bayesian models of temporal struc-
tures. Reference [11] try to adapt the quantization to dynamic tempo changes. The
“perceptual smoothness” of tempo in expressively performed music is analyzed by
[14]. For more general findings on extracting tempo and other semantic features from
audio data with signal processing techniques, see Chapter 5.
After a successful quantization, the meter has to be estimated. This is a rather
difficult task, because even humans cannot always distinguish between, for example,
2 4 4
4 , 4 and 8 meters. Most of the time, it is, thus, assumed that the meter is externally
given by the user of the algorithm. A detailed discussion about tempo and meter
(metrical level) estimation is given in Chapter 20 and in Section 20.2.3 in particular.
A rough distinction between 44 , and 34 meters was proposed, e.g., in [51] by means of
the number of quarters between so-called accentuation events.
17.6 Software
The freely available R [38] package tuneR [26] is a framework for statistical analysis
and transcription of music time series which provides many tools (e.g. for reading
WAVE files, estimating fundamental frequencies, etc.) in the form of R functions.
Therefore, it is highly flexible, extendable and allows experimenting and playing
around with various methods and algorithms for the different steps of the transcrip-
tion procedure. A drawback is that knowledge of the statistical programming lan-
443
444 Chapter 17. Transcription
guage R is required, because it does not provide transcription on a single key press
nor any graphical user interface – as opposed to commercial products.
A free and powerful software for music notation is LilyPond [29] which uses
LATEX[24], the well-known enhancement of TEX[23]. Beside sheet music, LilyPond
is also capable of generating MIDI files. Therefore it is possible to examine the
results of transcription both visually and acoustically. The R package tuneR contains
a function which implements an interface from the statistical programming language
to LilyPond.
Finally, we discuss commercial software products. We have reviewed more than
50 software products of which only 7 provide the basic capabilities of transcrip-
tion we ask for, which means taking a WAVE file and converting it to some for-
mat of Midi or sheet music like representation. Those we found are AKoff Music
Composer,1 AmazingMIDI,2 AudioScore,3 Intelliscore,4 Melodyne,5 Tartini,6 and
the WIDI Recognition System.7
From our point of view, the well known Melodyne is currently the best commer-
cial transcription software we tried out for the singer’s transcription. It performs all
the steps required by a full featured transcription software, including key and tempo
estimation. Its recognition performance is quite good (see Example 17.9), even with
default settings on sound that has been produced by human voices. Some parameters
can be tuned in order to improve recognition performance.
Example 17.9 (Transcription cont.). The outcome of the example that has been con-
tinued throughout this chapter is the transcription of 8 bars of “Tocher Zion” given in
Figure 17.10 for tuneR. Figure 17.11 shows the transcription for Melodyne. For com-
parison, the original notes of that part of “Tocher Zion” are shown in Figure 17.9.
For both Melodyne and tuneR we have optimized the quantization by specifying
the number of bars and the speed. The quality of the final transcriptions is com-
parable. The software tuneR produces more “nervous” results. At some places,
additional notes have been inserted where the singer slides smoothly from one note
to another. The first note is estimated one octave too high due to an immensely strong
second partial almost in absence of any other partials. Melodyne omits some notes.
Here we guess that Melodyne smooths the results too much and detects smooth tran-
sitions of the singer even if the singer intended to sing a separate note.
ber 2015.
4 http://www.intelliscore.net/. Accessed 18 December 2015.
5 http://www.celemony.com/. Accessed 18 December 2015.
6 http://miracle.otago.ac.nz/tartini/. Accessed 18 December 2015.
7 http://www.widisoft.com/. Accessed 18 December 2015.
444
17.8. Further Reading 445
classification into notes, fitting into meter and key, and sheet music production. This
chapter gave an overview over some methods for all these steps. Note that in most
steps there are noise and uncertainties involved and we have to make rather strong
assumptions in order to get results which are still much worse than the original sheet
music.
445
446 Chapter 17. Transcription
lection for spectrum estimation. In the MAMI project (Musical Audio-Mining, see
[25]), software for the fundamental frequency estimation has been developed.
Reference [32] proposes “Algorithms for Nonnegative Independent Component
Analysis” (N-ICA) in order to extract features of polyphonic sound, but applies it
only to sound generated by MIDI instruments. Moreover, [33] suggests optimization
using Fourier expansion for N-ICA and expresses his hope to extend the method to
perform well for regular ICA. In another work, [34] proposes to use dictionaries of
sounds, i.e., databases that contain many tones of different instruments played in
different pitches. Using such dictionaries might overcome the problem that different
tones containing a lot of partials may not be identifiable for polyphonic problems.
Under some circumstances the frequency of partials is slightly shifted from the
expected value. This is a problem for the polyphonic case, if a partial’s frequency
cannot be assigned to a corresponding fundamental frequency. Hence this phe-
nomenon has to be modeled as done in some recent work by [17].
Reference [35] modeled phenomena like pink noise (noise decreasing with fre-
quency; also known as 1/ f noise) using wavelet techniques in order to get a more
appropriate model and hence better estimates. Later on, [36] also modeled other spe-
cial kinds of unwanted noise or the sound of consonants that do not sound with a
well-defined fundamental frequency. A more general article about wavelet analysis
of music time series can be found in [15].
A general overview of music transcription methods can be found, e.g., in [20].
Bibliography
[1] S. Adak. Time-dependent spectral analysis of nonstationary time series. Jour-
nal of the American Statistical Association, 93:1488–1501, 1998.
[2] J. Beran. Statistics in Musicology. Chapman & Hall/CRC, Boca Raton, 2004.
[3] B. Bischl, U. Ligges, and C. Weihs. Frequency estimation by
DFT interpolation: A comparison of methods. Technical Report
06/09, SFB 475, Department of Statistics, TU Dortmund, Germany,
2009. http://www.statistik.tu-dortmund.de/fileadmin/user_
upload/Lehrstuehle/MSind/SFB_475/2009/tr06-09.pdf.
[4] P. Bloomfield. Fourier Analyis of Time Series: An Introduction. John Wiley
and Sons, 2nd edition, 2000.
[5] K. Brandenburg and H. Popp. An introduction to MPEG Layer 3. EBU Tech-
nical Review, 2000.
[6] D. Brillinger. Time Series: Data Analysis and Theory. Holt, Rinehart & Win-
ston Inc., NY, 1975.
[7] H. Brown, D. Butler, and M. Jones. Musical and temporal influences on key
discovery. Music Perception, 11(4):371–407, 1994.
[8] A. Cemgil and B. Kappen. Monte Carlo methods for tempo tracking and rhythm
quantization. Journal of Artificial Intelligence Research, 18:45–81, 2003.
446
17.8. Further Reading 447
447
448 Chapter 17. Transcription
lecting and annotating vocal queries for music information retrieval. In Pro-
ceedings of the International Conference on Music Information Retrieval, 2003.
[26] U. Ligges. Transkription monophoner Gesangszeitreihen. Dissertation, Fach-
bereich Statistik, Universität Dortmund, Dortmund, Germany, 2006.
[27] J. Meyer. Akustik und musikalische Aufführungspraxis. Bochinsky, Frankfurt
am Main, 1995.
[28] D. Müllensiefen and K. Frieler. Optimizing measures of melodic similarity for
the exploration of a large folk song database. In 5th International Conference
on Music Information Retrieval, Barcelona, Spain, 2004.
[29] H.-W. Nienhuys, J. Nieuwenhuizen, et al. GNU LilyPond: The Music Typeset-
ter. Free Software Foundation, 2005. Version 2.6.5.
[30] H. Ombao, J. Raz, R. von Sachs, and B. Malow. Automatic statistical analysis
of bivariate nonstationary time series. JASA, 96(454):543–560, 2001.
[31] H. Pang and D. Yoon. Automatic detection of vibrato in monophonic music.
Pattern Recognition, 38(7):1135–1138, 2005.
[32] M. Plumbley. Algorithms for nonnegative independent component analysis.
IEEE Transactions on Neural Networks, 14(3):534–543, 2003.
[33] M. Plumbley. Optimization using Fourier expansion over a geodesic for non-
negative ICA. In Proceedings of the International Conference on Indepen-
dent Component Analysis and Blind Signal Separation (ICA 2004), pp. 49–56,
Granada, Spain, 2004.
[34] M. Plumbley, S. Abdallah, T. Blumensath, M. Jafari, A. Nesbit, E. Vincent,
and B. Wang. Musical audio analysis using sparse representations. In COMP-
STAT 2006 — Proceedings in Computational Statistics, pp. 104–117, Heidel-
berg, 2006. Physica Verlag.
[35] P. Polotti and G. Evangelista. Harmonic-band wavelet coefficient modeling for
pseudo-periodic sound processing. In Proceedings of the COST G-6 Conference
on Digital Audio Effects (DAFX-00), Verona, Italy, December 7–9 2000.
[36] P. Polotti and G. Evangelista. Multiresolution sinusoidal/stochastic model for
voiced-sounds. In Proceedings of the COST G-6 Conference on Digital Audio
Effects (DAFX-01), Limerick, Ireland, December 6–8 2001.
[37] B. G. Quinn. Estimating frequency by interpolation using Fourier coefficients.
IEEE Transactions on Signal Processing, 42(5):1264–1268, 1994.
[38] R Core Team. R: A language and environment for statistical computing. R
Foundation for Statistical Computing, Vienna, Austria, 2015.
[39] S. Rossignol, P. Depalle, J. Soumagne, X. Rodet, and J.-L. Collette. Vibrato:
Detection, estimation, extraction, modification. In COST-G6 Workshop on Dig-
ital Audio Effects, 1999.
[40] S. Rossignol, X. Rodet, J. Soumagne, J.-L. Collette, and P. Depalle. Automatic
characterisation of musical signals: Feature extraction and temporal segmenta-
tion. Journal of New Music Research, 28(4):281–295, 1999.
448
17.8. Further Reading 449
449
450 Chapter 17. Transcription
450
Chapter 18
Instrument Recognition
18.1 Introduction
The goal of instrument recognition is the automatic distinction of the sounds of mu-
sical instruments playing in a given piece of music. Under most circumstances it
is a difficult task since different musical instruments have different compositions of
partial tones (cp. Definition 2.3), e.g., in the sound of a clarinet only odd partials
occur. This composition of partials is, however, also dependent on other factors like
the pitch, the played instrument, the room acoustics, and the performer [14]. Ad-
ditionally, there are temporal changes within one tone like vibrato. Also, different
non-harmonic properties, e.g. noise in the attack phase of a tone (cp. Section 2.4.7),
are typical for many families of instruments. For a plucked string, e.g., the attack
is the very short period between the initial contact of the plectrum or finger and the
scraping of the string. For a hammered string, the attack is the period between the
initial contact and the rebounding of the hammer (or mallet). Both, the plectrum
and the hammer produce typical noise. Hence, expert knowledge for distinguishing
the instruments is very specific and complex, and instead of expert rules, supervised
classification (see Chapter 12) is usually applied.
The typical processing flow of instrument recognition is illustrated in Figure 18.1.
It starts with an appropriate data set of labeled observations, labels corresponding
to instruments or families of instruments. Dependent on the concrete application,
the kind of data can be very different. For example, observations can be derived
from single tones, but also from complete pieces of music. There are at least four
dimensions which define the complexity of a specific instrument recognition task.
These dimensions are described in detail in the next section.
The next processing step is the taxonomy applied to the data. The obvious one
is a flat taxonomy, where each observation is directly assigned to an instrument la-
bel. However, due to the different degree of similarity between different pairs of
instruments a hierarchical taxonomy makes also sense, which will be discussed in
Section 18.3.
451
452 Chapter 18. Instrument Recognition
452
18.2. Types of Instrument Recognition 453
453
454 Chapter 18. Instrument Recognition
much data and so in many studies the considered music pieces are only based on,
say, three different representatives of each instrument class at most.
Aspect 5: Databases There exist three databases commonly used for (monophonic)
single tone classification: the McGill University Master Samples (MUMS) database
[13], the University of Iowa Musical Instrument Samples [8] and the Real World
Computing (RWC) Database [9]. However, the various other types of instrument
recognition discussed in this section are lacking clear reference data sets. Hence, it
is often difficult to compare results of different studies and the evaluation of accuracy
results should take this point into account.
Nonjudgmental
Straightforward or down-to-earth
Straightforward or down-to-earth
454
18.3. Taxonomy Design 455
Patient
Patient
Straightforward
Straightforward or down-to-earth
or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 18.3: Hierarchical taxonomy.
Patient
Patient
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforward or down-to-earth
Straightforwardor
Straightforward ordown-to-earth
Straightforward or down-to-earth
down-to-earth
Figure 18.4: Hornbostel–Sachs system (extract).
not all instruments fit well into one (and only one) group. For example, the pi-
ano could be classified into string and percussion [14]. Another popular taxonomy
is the Hornbostel–Sachs system which considers the sound production source of
the instruments [17]. It consists of over 300 categories, ordered on several levels.
On the first level, instruments are classified into five main categories: idiophones,
membranophones, chordophones, aerophones, and electrophones. Idiophones are
instruments where the instrument body is the sound source itself without requir-
ing stretched membranes or strings. This includes all percussion instruments except
drums. Membranophones are all instruments where the sound is produced by tightly
stretched membranes which includes most types of drums. Chordophones are all
instruments where one or more strings are stretched between fixed points, which in-
455
456 Chapter 18. Instrument Recognition
cludes all string instruments and piano. The sound of aerophones is produced by
vibrating air like in most brass and woodwind instruments. Electrophones are all in-
struments where electricity is involved for sound producing, such as synthesizers or
theremins. A small extract of the Hornbostel–Sachs system is shown in Figure 18.4.
In [6] an automatic taxonomy is built by agglomerative hierarchical clustering
putting classes together automatically with respect to an appropriate closeness crite-
rion. The authors argue that the Euclidean distance is not appropriate and instead,
two probabilistic distance measures are tested. Their classification results, using a
“support vector machine” (SVM) with a Gaussian kernel, yield a slight superiority
of the hierarchical approaches over the flat approach (64% vs. 61% accuracy). On the
other hand, following [5] there is no evidence for the superiority of hierarchical clas-
sification of single instruments in comparison to flat classification. However, in both
studies, pizzicati and sustained instruments are quite well distinguished, whereas the
classification of individual sustained instruments appeared to be much more error-
prone.
456
18.4. Example of Instrument Recognition 457
to one of the five instruments resulting in 80 phrases, respectively 850 tones, with 5
different class labels. The accompanying instruments may be a piano or strings and
are not changed. The ISP toolbox1 in MATLAB® is applied to convert the phrases
into WAVE files with a sampling rate of 44,100 Hz.
The features derived from these data will be introduced in Section 18.4.3. The
class variable (label) is an instrument indicator specifying the instrument of the main
voice.
Nurturing
Nurturing
Nurturing
1 http://kom.aau.dk/project/isound/
457
458 Chapter 18. Instrument Recognition
Dependable
Dependable
Dependable
458
18.4. Example of Instrument Recognition 459
459
460 Chapter 18. Instrument Recognition
Table 18.1: Error Rates Using All Instruments and All Features
respectively. There are two natural grouping mechanisms since the features can be
categorized by two dimensions: the channel number and the feature name. The first
approach is to combine the same features over all channels into one group and the
second approach is to combine all features generated within the same channel into
one group. This results in 21 feature groups for the first approach and 40 groups
for the second approach. Channel-based grouping has the additional advantage of
neglecting entire channels, which not only reduces computing time for feature gen-
eration but also the computing time for the auditory modeling process.
Some of the classification methods have free parameters. These parameters are
tuned by means of a grid search, i.e. we test all parameter combinations on a grid
and take that combination with the lowest classification error rate. For SVML we
tune the “cost” parameter C, for SVMR C and the kernel width γ, and for SVMP
C, γ, the increment q, and the polynomial degree d (see Section 12.4.4). For k-NN
the parameters k and λ in the Minkowski distance are tuned (see Section 12.4.2).
For “One-vs-All”-classifications and for feature selection, the default parameters are
used due to runtime restrictions.
18.4.5 Evaluation
For error estimation, generally the mean of 10 random repetitions of 10-fold cross-
validation is taken. In tuning, only 3-fold cross-validation is carried out.
460
18.4. Example of Instrument Recognition 461
Table 18.2: Error Rates One vs. Rest; Single Class Instrument Specified
Table 18.3: Error Rates for Feature Selection in the Multi-Class Case (All Instru-
ments)
461
462 Chapter 18. Instrument Recognition
462
18.4. Example of Instrument Recognition 463
Table 18.4: Error Rates for the 2 Nodes of the Classical Variant of Hierarchical
Classification
463
464 Chapter 18. Instrument Recognition
with class dependent costs. For example it could be argued that it is worse to classify
an observation that is labeled as oboe as violin than to classify it as clarinet.
464
18.6. Further Reading 465
In [4], intervals and chords played by instruments of four families (strings, wind,
piano, plucked strings) are used to build classification rules for the recognition of the
musical instruments on the basis of the same groups of mid-level features, again by
means of feature selection.
In [1], again the same groups of mid-level features and common statistical classi-
fication algorithms are used to evaluate by statistical tests whether the discriminating
power of certain subsets of feature groups dominates other group subsets. The au-
thors examine if it is possible to directly select a useful set of groups by applying
logistic regression regularized by a group lasso penalty structure. Specifically, the
methods are applied to a data set of single piano and guitar tones.
In [16], multi-objective feature selection is applied on data sets which are based
on intervals and chords. The first objective is the classification error and the second
one is the number of features. The authors argue that a smaller number of features
yield better classification models since the danger of overfitting is reduced. Ad-
ditionally, smaller feature sets also need less storage and computing time. Their
experimental results show decreased error rates by applying feature selection. Fur-
thermore, it is shown that the best set of features might be very diverse for different
kinds of instruments. In [15], this study is extended by comparing the results of the
best specific feature sets for concrete instruments to a generic feature set, which is
the best compromise for classifying several instruments. By applying their experi-
ments to four different classification tasks, they conclude that it is possible to get a
generic feature set which is almost as good as the specific ones.
In [2], solo instruments accompanied by a keyboard or an orchestra are distin-
guished. Instead of classification on a note-by-note level, they classify on entire
sound files. The authors argue that most of the features used for monophonic instru-
ment recognition do not work well in the context of predominant instrument recog-
nition and only use features based on partials. First, they estimate the most dominant
fundamental frequencies for all frames. Afterwards, 6 features are generated on each
of the lowest 15 partials, which yields 90 features altogether. One drawback of this
approach is that it depends strongly on the goodness of predominant F0 estimation,
a problem which itself is not solved, yet. Using a Gaussian classifier they get an
accuracy of 86% for 5 instruments.
In [14], several strategies for multi-label classification of polyphonic music are
explained and compared. Additionally, specific characteristics for multi-label feature
selection are discussed. In [7], hierarchical classification is applied to multi-label
classification. This means, e.g., first classify the dominant instrument, then the next
one, etc.
Bibliography
[1] B. Bischl, M. Eichhoff, and C. Weihs. Selecting groups of audio features by sta-
tistical tests and the group lasso. In 9. ITG Fachtagung Sprachkommunikation,
Berlin, Offenbach, 2010. VDE Verlag.
[2] J. Eggink and G. Brown. Instrument recognition in accompanied sonatas and
concertos. In Proceedings of the IEEE International Conference on Acoustics,
465
466 Chapter 18. Instrument Recognition
466
18.6. Further Reading 467
467
Chapter 19
Chord Recognition
G EOFFROY P EETERS
Sound Analysis and Synthesis Team, IRCAM, France
J OHAN PAUWELS
School of Electronic Engineering and Computer Science, Queen Mary University of
London, England
19.1 Introduction
Chords are abstract representations of a set of musical pitches (notes) played (almost)
simultaneously (see also Section 3.5.4). Chords, along with the main melody, are
often predominant characteristics of a music track. Well-known examples of chord
reductions are the “chord sheets” where the background harmony of a music track is
reduced to a succession of symbols over time (C major, C7, . . .) to be played on a
guitar or a piano.
In this chapter, we describe how we can automatically estimate such chord suc-
cessions from the analysis of the audio signal. The general scheme of a chord recog-
nition system is represented in Figure 19.1. It is made of the following blocks that
will be described in the next sections:
1. A block that defines a set of chords that will form the dictionary over which the
music will be projected (see Section 19.2),
2. A block that extracts meaningful observations from the audio signal: chroma or
Pitch Class Profile (PCP) features extracted at each time frame (see Section 19.3),
3. A block that creates a representation (knowledge-driven see Section 19.4.1) or a
model (data-driven see Section 19.4.2) of the chords that will be used to map the
chords to the audio observation,
4. The mapping of the extracted audio observations to the models that represent the
various chords. This can be achieved on a frame basis (see Section 19.5) but leads
to a strongly fragmented chord sequence. We show that simple temporal smooth-
ing methods can improve the recognition. In Section 19.6 we show how chord
469
470 Chapter 19. Chord Recognition
Audio
Chord Recognition System
Knowledge Data-
Chroma/PCP -driven driven
Extraction
2
Frame-based
Chord Chord
Mapping
Representation Dictionary
Hidden
Markov model
4
3 1
Estimated
Chords
Figure 19.1: General scheme of an automatic system for chord recognition from
audio.
470
19.3. Chroma or Pitch Class Profile Extraction 471
Table 19.1: Dictionary for the Root Notes and Three Possible Dictionaries for the
Type of Chords
their octave positions, it is not possible to distinguish whether the chord has been
inverted or not. For this last reason, chords are often estimated jointly with the local
key. The root of a chord is then expressed as a specific degree in a specific key. The
choice of C-M6 will be favored in a C-Major key while A-m7 will be favored in a
A-minor key.
When estimating chords from the audio signal, we will also rely on enharmonic
equivalence, i.e. we consider the note c# to be equivalent to d[, and also consider the
chord F#-M to be equivalent to G[-M.
1 It should be noted that the two-component theory of pitch was originally proposed by Hornbostel
[33].
471
472 Chapter 19. Chord Recognition
possible octaves of the a: a0, a1, a2, a3 . . . . The representation is computed at each
time frame λ . In the following we denote by tchroma [p, λ ] the set of chroma vectors
over time, also known as a chromagram.
Unlike multiple-pitch estimation, chroma/PCP is a mapping and not a an estima-
tion. Therefore it is not prone to errors.
472
19.3. Chroma or Pitch Class Profile Extraction 473
Frequencies
1046 Hz
B
A#
A
G#
G
F#
F
523 Hz E
D#
D
C#
C
261 Hz
131 Hz
473
474 Chapter 19. Chord Recognition
Use of the CQT for Chroma/PCP Extraction When applied to music analysis, the
frequencies f µ are chosen to correspond to the pitches of the musical scale: f µ =
µ
fmin 2 12b where µ ∈ N and b is the number of frequency bins per semitone (if b =
1, we obtain the semitone pitch-scale, if b = 2, the quarter-tone pitch-scale). In
1
this case, Q = 21/(12b) −1
. In practice, Q is chosen to separate the lowest considered
pitches. For example, in order to be able to separate c3 from c#3, Q is chosen such
f 130.8
that Q = f µ+1µ− f µ = 138.6−130.8 = 16.7. If the frequencies f µ of the Constant-Q are
chosen to be exactly the frequencies of the pitches (if b = 1), the computation of the
chroma/PCP is straightforward since it just consists of adding the values for which
rem(mµ , 12) + 1 = p:
O
tchroma [λ , p] = ∑ Xcqt [λ , p + 12o] (19.4)
o=1
474
19.3. Chroma or Pitch Class Profile Extraction 475
5000
Frequency [Hz]
4000
3000
2000
1000
0
2 4 6 8 10 12
Time [sec]
12
10
8
Chroma
6
4
2
2 4 6 8 10 12
Time [sec]
5000
Frequency [Hz]
4000
3000
2000
1000
0
2 4 6 8 10 12
Time [sec]
12
10
8
Chroma
6
4
2
2 4 6 8 10 12
Time [sec]
Figure 19.3: On each figure, the top part represents the spectrogram Xstft [λ , µ], the
bottom the corresponding chromagram tchroma [λ , p]. The influence of the higher har-
monics of the notes are clearly visible in the form of extra values in the chromagram.
475
476 Chapter 19. Chord Recognition
Table 19.2: Harmonic Series of the Pitch c3, Corresponding Frequencies f µ , Con-
version into MIDI Scale mµ , Conversion to Chroma/PCP p
trum [15] or using a pitch salience spectrum that takes the energy of higher har-
monics into consideration too [28].
3. Keep the chroma/PCP vector as it is and consider the existence of the higher
harmonics in the chords representation (see Section 19.4).
It is also possible to combine the first approach with one of the two others.
476
19.5. Frame-Based System for Chord Recognition 477
Figure 19.4: Chord templates Tc [p] corresponding to the chords C-M, C-m and C-
dim.
477
478 Chapter 19. Chord Recognition
N E A E B E E:7/3A
B−m 0.08
A#−m
A−m
G#−m
G−m 0.07
F#−m
F−m
E−m
D#−m 0.06
D−m
C#−m
C−m
B−M 0.05
A#−M
A−M
G#−M
G−M 0.04
F#−M
F−M
E−M
D#−M 0.03
D−M
C#−M
C−M
5 10 15 20 25
Time [sec]
Figure 19.5: Example of frame-based chord recognition on the track “I Saw Her
Standing There” by The Beatles.
Euclidean distance v
u 12
u
d(x, y) = t ∑ (x p − y p )2 , (19.5)
p=1
478
19.6. Hidden Markov Model-Based System for Chord Recognition 479
479
480 Chapter 19. Chord Recognition
Chord emission
P (tchroma [p, λ]|ci )
Chord transition Ptrans (cj=i |ci )
Ptrans (cj=i |ci )
C-M C#-M
D-M
480
19.6. Hidden Markov Model-Based System for Chord Recognition 481
481
482 Chapter 19. Chord Recognition
-7 semitones +7 semitones
C-M
F-M G-M
A-m E-m
G-m
F#-m
Eb-M A-M
C-m
C#-m
Ab-M
F-m G#-m E-M
Bb-m Eb-m
Db-M B-M
Gb-M
(a) Doubly nested circle of fifths for major and minor chords.
G#-M
G#-m
C#-M
D#-M
C#-m
D#-m
A#-M
A#-m
F#-M
F#-m
G-M
G-m
C-M
D-M
C-m
D-m
E-M
A-M
B-M
E-m
A-m
B-m
F-M
F-m
C-M
C#-M
D-M
D#-M
E-M
F-M
F#-M
G-M
G#-M
A-M
A#-M
B-M
C-m
C#-m
D-m
D#-m
E-m
F-m
F#-m
G-m
G#-m
A-m
A#-m
B-m
Figure 19.7: Deriving a transition matrix from a theoretic model of chord distance.
482
19.7. Joint Chord and Key Recognition 483
N E A E B E E:7/3A
B−m 0.08
A#−m
A−m
G#−m
G−m 0.07
F#−m
F−m
E−m
D#−m 0.06
D−m
C#−m
C−m
B−M 0.05
A#−M
A−M
G#−M
G−M 0.04
F#−M
F−M
E−M
D#−M 0.03
D−M
C#−M
C−M
5 10 15 20 25
Time [sec]
Figure 19.8: Example of HMM-based chord recognition on the track “I Saw Her
Standing There” by The Beatles.
2 It should be noted that, in this approach, chords that do not belong to the scale of the key cannot be
determined.
483
484 Chapter 19. Chord Recognition
484
19.8. Evaluating the Performances of Chord and Key Estimation 485
of musical context, ranging from frame-based approaches [4], over lattice rescoring
of HMM output [11], to a full search over all key and chord trigram sequences [23].
485
486 Chapter 19. Chord Recognition
1 (Bdim, Dmin, d1 )
AcceptingAccepting
Accepting 2 (Dmin, Dmin, d2 )
Accepting
Accepting Accepting 3 (Dmin, Bmin, d3 )
Accepting 4 (G7, Bmin, d4 )
Accepting
AcceptingAccepting
Accepting 5 (Cmaj, Bmin, d5 )
6 (Cmaj, Cmaj, d6 )
(a) Example chord sequences. (b) Corresponding list of eval-
uation segments.
example, it makes sense to drop segments annotated with diminished chords from
the evaluation when the algorithm can only output major and minor chords, such as
segment 1 in Figure 19.9. Another possible use case is to limit the evaluation only to
certain categories of labels, for instance, to compare the performance on triads with
the one on tetrads (segment 4 versus segments 2,3,4,6). In all cases, this decision of
inclusion should be based on the reference label only.
Harmonic Content Correspondence Finally, the retained label pairs are compared
to each other and a score is assigned to the evaluation segment, which then gets
weighted by the segment’s duration. All segment scores of the whole data set are
summed together and divided by the total duration of retained segments to arrive at
the final result. The pairwise score itself can be calculated according to a number of
methods.
Obviously, when both labels are the same (taking into account enharmonic vari-
ants and the transformation to the previously defined chord and key vocabularies),
the score is 1. The remaining question is if, and how, a difference is made between
erroneous estimations that are close and those that are completely off.
The case of related keys is reasonably well defined. In the Music Information
Retrieval Evaluation Exchange (MIREX)3 audio key detection contest, a part of the
score is assigned to keys that are a perfect fifth away from, relative or parallel to the
reference key. For C major, these are F and G major (perfect fifth away), A minor
(relative minor), and C minor (parallel minor). They get 0.5, 0.3, and 0.2 points,
respectively. The best score obtained in MIREX-2014 was 0.8683.
For chords, there is less consensus about how to account for related chords, be-
cause the chord dictionary is typically more complex, so many times, almost-correct
estimations do not contribute at all. One option is to consider chords as sets of chro-
mas and to take the precision and the recall of these sets (see Section 13.3.3). This is
useful to detect over- or under-estimation of the chord cardinality when mixing triads
and tetrads, but cannot measure if the root has been estimated correctly. Therefore it
can be complemented by a score that only looks at a match between the roots.
Just like for designing a recognition algorithm, the evaluation procedure requires
a dictionary on which the music will be projected, and the transformation rules to
achieve this. This is used to bring the reference and tested sequence into the same
3 http://www.music-ir.org/mirex/wiki/MIREX_HOME. Accessed 22 June 2016.
486
19.9. Concluding Remarks 487
space. So far we have assumed that this evaluation dictionary is the same as the
algorithmic dictionary, which is the easiest and most recommended option, but this
is not always possible. A notable example is when we want to compare multiple
algorithms with different vocabularies to the same ground truth, as is done for the
MIREX audio chord estimation task.
To handle these differences, extra rules need to be formulated about which seg-
ments should be included in the evaluation and which evaluation dictionary should
be used. To this end, a framework for the rigorous definition of evaluation mea-
sures is explained in [24]. The accompanying software, as used in MIREX too, is
freely available online.4 Naturally, it can also be used for the evaluation of a single
algorithm on its own.
When multiple algorithms are compared to each other, it is also important to
know to what degree their differences are statistically significant. The method used
for MIREX is described in [2], and is also freely available as an R package.5
4 https://github.com/jpauwels/MusOOEvaluator
5 https://bitbucket.org/jaburgoyne/mirexace
487
488 Chapter 19. Chord Recognition
488
19.10. Further Reading 489
Bibliography
[1] J. P. Bello and J. Pickens. A robust mid-level representation for harmonic con-
tent in music signals. In Proceedings of the 6th International Conference on
Music Information Retrieval (ISMIR), pp. 304–311, 2005.
[2] J. A. Burgoyne, W. B. de Haas, and J. Pauwels. On comparative statistics for
labelling tasks: What can we learn from Mirex Ace 2013? In Proceedings of the
15th Conference of the International Society for Music Information Retrieval
(ISMIR), pp. 525–530, 2014.
[3] J. A. Burgoyne, L. Pugin, C. Kereliuk, and I. Fujinaga. A cross-validated study
of modelling strategies for automatic chord recognition in audio. In Proceed-
ings of the 8th International Conference on Music Information Retrieval (IS-
MIR), pp. 251–254, 2007.
[4] H.-T. Cheng, Y.-H. Yang, Y.-C. Lin, I.-B. Liao, and H. H. Chen. Automatic
chord recognition for music classification and retrieval. In Proceedings of the
IEEE International Conference on Multimedia and Expo (ICME), pp. 1505–
1508. IEEE Press, 2008.
[5] D. P. Ellis and A. V. Weller. The 2010 Labrosa chord recognition system. In
MIREX 2010 Extended Abstract (14th Conference of the International Society
for Music Information Retrieval), 2010.
[6] T. Fujishima. Realtime chord recognition of musical sound: A system using
Common Lisp music. In Proceedings of the International Computer Music
Conference (ICMC), pp. 464–467. International Computer Music Association,
1999.
[7] E. Gómez. Tonal description of polyphonic audio for music content processing.
INFORMS Journal on Computing, Special Cluster on Computation in Music,
18(3):294–304, 2006.
489
490 Chapter 19. Chord Recognition
[8] C. Harte, M. Sandler, and M. Gasser. Detecting harmonic change in musical au-
dio. In Proceedings of the 1st ACM Workshop on Audio and Music Computing
Multimedia, pp. 21–26, New York, NY, USA, 2006. ACM.
[9] E. J. Humphrey and J. P. Bello. Rethinking automatic chord recognition with
Convolutional Neural Networks. In Proceedings of the IEEE International
Conference on Machine Learning and Applications (ICMLA), pp. 357–362.
IEEE Press, 2012.
[10] O. Izmirli. Template based key finding from audio. In Proceedings of the
International Computer Music Conference (ICMC), pp. 211–214. International
Computer Music Association, 2005.
[11] M. Khadkevich and M. Omologo. Use of hidden Markov models and factored
language models for automatic chord recognition. In Proceedings of the 10th
International Conference on Music Information Retrieval (ISMIR), pp. 561–
566, 2009.
[12] C. L. Krumhansl and E. J. Kessler. Tracing the dynamic changes in perceived
tonal organization in a spatial representation of musical keys. Psychological
Review, 89(4):334–368, July 1982.
[13] K. Lee and M. Slaney. Acoustic chord transcription and key extraction from
audio using key-dependent HMMs trained on synthesized audio. IEEE Trans-
actions on Audio, Speech and Language Processing, 16(2):291–301, February
2008.
[14] M. Mauch and S. Dixon. Simultaneous estimation of chords and musical con-
text from audio. IEEE Transactions on Audio, Speech and Language Process-
ing, 18(6):1280–1289, August 2010.
[15] J. Morman and L. Rabiner. A system for the automatic segmentation and classi-
fication of chord sequences. In Proceedings of the 1st ACM Workshop on Audio
and Music Computing Multimedia, pp. 1–10. ACM, 27 October 2006.
[16] S. H. Nawab, S. Abu Ayyash, and R. Wotiz. Identification of musical chords
using constant-Q spectra. In Proceedings of the IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), volume 5, pp. 3373–
3376. IEEE Press, 2001.
[17] L. Oudre, Y. Grenier, and C. Févotte. Chord recognition using measures of fit,
chord templates and filtering methods. In Proceedings of the IEEE Workshop
on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE
Press, 2009.
[18] H. Papadopoulos. Joint Estimation of Musical Content Information. PhD thesis,
Université Paris VI, 2010.
[19] H. Papadopoulos and G. Peeters. Large-scale study of chord estimation al-
gorithms based on chroma representation. In Proceedings of the IEEE Fifth
International Workshop on Content-Based Multimedia Indexing (CBMI). IEEE
Press, 2007.
[20] H. Papadopoulos and G. Peeters. Joint estimation of chords and downbeats
490
19.10. Further Reading 491
from an audio signal. IEEE Transactions on Audio, Speech and Language Pro-
cessing, 19(1):138–152, January 2011.
[21] J. Pauwels, F. Kaiser, and G. Peeters. Combining harmony-based and novelty-
based approaches for structural segmentation. In Proceedings of the 14th Con-
ference of the International Society for Music Information Retrieval (ISMIR),
pp. 138–143, 2013.
[22] J. Pauwels and J.-P. Martens. Combining musicological knowledge about
chords and keys in a simultaneous chord and local key estimation system. Jour-
nal of New Music Research, 43(3):318–330, 2014.
[23] J. Pauwels, J.-P. Martens, and M. Leman. Modeling musicological information
as trigrams in a system for simultaneous chord and local key extraction. In Pro-
ceedings of the IEEE International Workshop on Machine Learning for Signal
Processing (MLSP). IEEE Press, 2011.
[24] J. Pauwels and G. Peeters. Evaluating automatically estimated chord sequences.
In Proceedings of the IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). IEEE Press, 2013.
[25] G. Peeters. Musical key estimation of audio signal based on hidden Markov
modelling of chroma vectors. In Proceedings of the International Conference
on Digital Audio Effects (DAFx), pp. 127–131. McGill University Montreal,
September 18–20 2006.
[26] L. Rabiner. A tutorial on hidden Markov model and selected applications in
speech. Proceedings of the IEEE, 77(2):257–285, 1989.
[27] T. Rocher, M. Robine, P. Hanna, and M. Desainte-Catherine. A survey of chord
distances with comparison for chord analysis. In Proceedings of the Interna-
tional Computer Music Conference (ICMC), pp. 187–190. International Com-
puter Music Association, 2010.
[28] M. P. Ryynänen and A. P. Klapuri. Automatic transcription of melody, bass
line, and chords in polyphonic music. Computer Music Journal, 32(3):72–86,
Fall 2008.
[29] R. Scholz, E. Vincent, and F. Bimbot. Robust modeling of musical chord se-
quences using probabilistic n-grams. In Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 53–56.
IEEE Press, 2009.
[30] R. N. Shepard. Circularity in judgements of relative pitch. Journal of the
Acoustical Society of America, 36:2346–2353, 1964.
[31] K. Sumi, K. Itoyama, K. Yoshii, K. Komatani, T. Ogata, and H. G. Okuno. Au-
tomatic chord recognition based on probabilistic integration of chord transition
and bass pitch estimation. In Proceedings of the 9th International Conference
on Music Information Retrieval (ISMIR), pp. 39–44, 2008.
[32] D. Temperley. What’s key for key? The Krumhansl-Schmuckler key-finding
algorithm reconsidered. Music Perception, 17(1):65–100, 1999.
[33] E. M. von Hornbostel. Psychologie der Gehörerscheinungen. In A. Bethe et al.,
491
492 Chapter 19. Chord Recognition
eds., Handbuch der normalen und pathologischen Physiologie, chapter 24, pp.
701–730. J.F. Bergmann-Verlag, 1926.
[34] G. H. Wakefield. Mathematical representation of joint time-chroma distribu-
tions. In Proc. of SPIE conference on Advanced Signal Processing Algorithms,
Architectures and Implementations, volume 3807, pp. 637–645. SPIE, 1999.
[35] A. Weller, D. Ellis, and T. Jebara. Structured prediction models for chord tran-
scription of music audio. In Proceedings of the IEEE International Conference
on Machine Learning and Applications (ICMLA), pp. 590–595. IEEE Press,
2009.
[36] Y. Zhu and M. S. Kankanhalli. Precise pitch profile feature extraction from
musical audio for key detection. IEEE Transactions on Multimedia, 8(3):575–
584, June 2006.
492
Chapter 20
Tempo Estimation
J OS É R. Z APATA
Department of TIC, Universidad Pontificia Bolivariana, Colombia
20.1 Introduction
Rhythm, along with harmony, melody and timbre, is one of the most fundamental
aspects of music; sound by its very nature is temporal. Rhythm in its most generic
sense is used to refer to all of the temporal aspects of music, whether it is represented
in a score, measured from a performance, or existing only in the perception of the
listener. In order to build a computer system capable of intelligently processing
music, it is essential to design representation formats and processing algorithms for
the estimation of rhythmic content of music [24]. Tempo estimation (also referred
to as tempo induction or tempo detection) is the computational approach to estimate
the rate of the perceived musical pulses and normally referred to as the foot tapping
rate.
Content analysis of musical audio signals has received increasing attention from
the research community, specifically in the field of music information retrieval (MIR)
[42]. MIR aims to retrieve musical pieces by processing not only text information,
such as artist name, song title or music genre, but also by processing musical con-
tent directly in order to retrieve a piece based on its rhythm or melody [54]. Since
the earliest automatic audio rhythm estimation systems proposed in [21, 50, 13] in
the mid to late 1990s, there has been a steady growth in the variety of approaches
developed and the applications to which these automatic systems have been applied.
The use of automatic rhythm estimation has become a standard tool for solving other
MIR problems, e.g. structural segmentation [38], chord detection [40], music simi-
larity [31], cover song detection [47], automatic remixing [28], and interactive music
systems [48]; by enabling “beat-synchronous” analysis of music.
While many different tempo estimation and beat tracking techniques have been
proposed in recent years, e.g. for beat tracking, see [18, 9, 43, 15, 5, 11, 56] and
for tempo estimation [20, 19, 44], recent comparative studies of rhythm estimation
systems [57] suggest that there has been little improvement in the state of the art
493
494 Chapter 20. Tempo Estimation
in recent years [41] and the method by Klapuri [33] is still widely considered to
represent the state of the art for both tasks.
Current approaches for tempo estimation focus on the analysis of mainstream
popular music with clear and stable rhythm and percussion instruments, which fa-
cilitates this task. These approaches mainly consider the periodicity of intensity
descriptors (principally onset detection functions) to locate the beats, and then to
estimate the tempo. Nevertheless, they usually fail when they are analyzing other
music genres like classical music, because this type of music often exhibits tempo
variations; in other words, it does not include clear percussive and repetitive events.
The same problem appears with a capella or choral music, acoustic music, different
jazz styles and pop music [24].
The goal of this chapter is to provide basic knowledge about automatic music
tempo estimation. The remainder of the chapter is structured as follows: Section 20.2
gives an overview of the principal definitions and relations of the rhythm elements.
The system steps to estimate the tempo are presented in Section 20.3, emphasizing
the computation of the onset detection functions for tempo estimation. Section 20.4
presents the evaluation methods for tempo estimation approaches. Then, in Section
20.5, a simple implementation of a tempo estimation system is provided, and finally
in Section 20.6, some applications of automatic rhythm estimation are described.
20.2 Definitions
Musical rhythm is used to refer to the temporal aspects of a musical work and the
pulse (which is a regular sequence of events); its components are beat, tempo, meter,
timing and grouping, and are presented in Figure 20.1. For the sake of understanding
the computational approaches of automatic rhythm estimation methods in Western
music, we assumed that the musical pulse which can be felt by a human being is re-
lated to the beat. This chapter is focused on estimating the time regularity of musical
beats in audio signals (tempo estimation) related to the beats per minute in a song.
20.2.1 Beat
Beat is any of the events or accents in the music and is characterized by [27, p. 391]
as what listeners typically entrain to as they tap their foot or dance along with a piece
of music.
The beat perception is an active area of research in music cognition, in which
there has long been an interest in the cues listeners use to extract a beat. Refer-
ence [53] lists six factors that most researchers agree are important in beat finding
(i.e. in inferring the beat from a piece of music). These factors can be expressed as
preferences:
1. for beats to coincide with note onsets,
2. for beats to coincide with longer notes,
3. for regularity of beats,
4. for beats to align with the beginning of musical phrases,
494
20.2. Definitions 495
Accepting Accepting
Accepting
Accepting Accepting
Accepting
Straightforward or down-to-earth
Accepting Accepting
Straightforward or down-to-earth
Straightforward or down-to-earth
20.2.2 Tempo
Tempo is defined as the number of beats in a time unit (usually the minute). There
is usually a preferred regularity, which corresponds to the rate at which most people
would tap or clap in time with the music. However, the perception of tempo exhibits
a degree of variability. Differences in human perception of tempo depend on age,
musical training, music preferences and the general listening context [34]. They are,
nevertheless, far from random and most often correspond to a focus on a different
metrical level and are quantifiable as simple ratios (e.g. 2, 3, 1/2 or 1/3 ) [45]. The
computation of tempo is based on the periodicities extracted from the music signal
and the time differences between the beats. Therefore, the absolute beat positions
are not necessarily required for estimating the tempo. The principal computational
tempo estimation problems are related to tempo variations in the same musical piece,
the correct choice of the metrical level (see Section 20.2.3), and mainly because the
495
496 Chapter 20. Tempo Estimation
tempo is a perceptual value. In this chapter, the automatic tempo estimation is related
to the detection of the number of beats per minute (BPM).
496
20.2. Definitions 497
Accepting
Straightforward or down-to-earth
Straightforward or down-to-earth
Accepting
Accepting
Straightforward or down-to-earth
Straightforward or down-to-earth
Accepting
Straightforward or down-to-earth
Straightforward or down-to-earth
Accepting
Accepting
Straightforward or down-to-earth
Straightforward or down-to-earth
Accepting Accepting
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 20.2: Metrical structure for the first seconds of the song Kuku-cha Ku-cha by
“Charanga 76”. (a) Audio signal. (b) Note onset locations. (c) Lowest metrical level:
the Tatum. (d) Tactus (Beat locations). (e) Bar boundaries (Measure).
functional aspects (feature list creation, tempo induction, Figure 20.3), as pointed out
by [24].
497
498 Chapter 20. Tempo Estimation
Kind KindKind
KindKind
Kind Kind Kind
Kind Kind
Kind Kind
Following the recent results in [51, ch. 4] and [56] which show the importance
of the feature list creation for rhythm estimation through input features rather than
tempo induction models, this chapter emphasizes understanding of the onset detec-
tion functions for tempo estimation.
The performance of the energy flux ODF is higher if the music audio signal
presents percussive instruments or clear note onsets, but the performance drops
with other kind of music.
• Spectral Flux. This onset detection function, proposed in [39] and presented
in Definition 5.10 and Equation (20.2), describes the temporal evolution of the
magnitude spectrogram calculated by computing the short time Fourier trans-
form (STFT) X[λ , µ] in the frames. From these frames, each spectral flux sample
SFX(λ ) is calculated as the sum of the positive differences in magnitude between
498
20.3. Overall Scheme of Tempo Estimation 499
each frequency bin of the current short time Fourier transform frame and its pre-
decessor:
M/2
SFX(λ ) = ∑ H(|X[λ , µ]| − |X[λ − 1, µ]|), (20.2)
µ=0
Xλlog f ilt (µ) = log(γ · (|Xλ (µ)| · F(λ , µ)) + 1). (20.3)
where the frequencies are aligned according to the frequencies of the semitones of
the Western music scale over the frequency range from 27.5 Hz to 16 kHz, using
a fixed window length for the STFT. The resulting filter bank, F(λ , µ), has B = 82
frequency bins with λ denoting the bin number of the filter and µ the bin number
of the linear spectrogram. The filters have not been normalized, resulting in an
emphasis of the higher frequencies, similar to the high frequency content (HFC)
method presented in Definition 5.15. From these frames, in Equation (20.4) each
input feature sample is calculated as the sum of the positive differences in log-
arithmic magnitude (using γ as a compression parameter, e.g. γ = 20) between
each frequency bin of the current STFT frame and its predecessor:
B=82
SFLF(λ ) = H Xλlog f ilt (µ) − Xλlog f ilt
(µ) . (20.4)
∑ −1
µ=1
• Beat Emphasis Function [10]. This ODF emphasizes the periodic structure in
musical excerpts with a steady tempo. The beat emphasis function is defined
as a weighted combination of the sub-band complex spectral difference func-
tions, Equation (20.5), which emphasize the periodic structure of the signal by
499
500 Chapter 20. Tempo Estimation
• Harmonic Feature [26]. This is a method for harmonic change detection and is
calculated in Equation (20.7) by computing a short time Fourier transform. HF
uses a modified Kullback–Leibler distance measure, see Equation (14.9), to detect
spectral changes between frequency ranges of consecutive frames X[λ , µ]. The
modified measure is thus tailored to accentuate positive energy change.
B
|X[λ , µ]|
HF(λ ) = ∑ log2 . (20.7)
µ=1 |X[λ − 1, µ]|
• Mel Auditory Feature [18]. This feature is calculated from a short time Fourier
transform magnitude spectrogram and is based on the MFCC (Section 5.2.3). In
Equation (20.8) each frame is then converted to an approximate “auditory” repre-
sentation in 40 bands on the mel frequency scale and converted to dB, Xmel (µ).
Then the first-order difference in time is taken and the result is half-wave rec-
tified. The result is summed across frequency bands before some smoothing is
performed to create the final feature.
B=40
MAF(λ ) = ∑ H (|Xmel [λ , µ]| − |Xmel [λ − 1, µ]|) . (20.8)
µ=1
The auditory frequency scale is used to balance the periodicities in each perceptual
frequency band.
• Phase Slope Function [30]. This feature is based on the group delay, which is
used to determine instants of significant excitation in audio signals and is com-
puted as the derivative of phase over frequency τ(λ ) (presented before in Section
5.16), as can be seen in Equation (20.9). Reference [30] uses this concept as an
onset detection function: using a large overlap, an analysis window is shifted over
the signal and for each window position the average group delay is computed.
The obtained sequence of average group delays is referred to as the phase slope
function (PSF). To avoid the problems of unwrapping, the phase spectrum of the
signal for the computation of group delay can be computed as
where X(λ ) and Y (λ ) are the Fourier transforms of x[k] and kx[k], respectively.
The phase slope function is then computed as the negative of the average of the
group delay function. The performance of the phase slope function is higher in
musical signals with simple rhythmic structure and little or no percussive content.
500
20.4. Evaluation of Tempo Estimation 501
• Bandwise Accent Signals [33]. This feature estimates the degree of musical accent
as a function of time at four different frequency ranges. This ODF is calculated
from a short time Fourier transform and used to calculate power envelopes at 36
sub-bands on a critical-band scale. Each sub-band is up-sampled by a factor of
two, smoothed using a low-pass filter with a 10-Hz cutoff frequency, and half-
wave rectified. A weighted average of each band and its first-order differential
is taken, Eµ (λ ). In [33] each group of 9 adjacent bands (i.e. bands 1–9, 10–18,
19–27 and 28–36) are summed up to create a four-channel input feature.
36
BAS(λ ) = ∑ Eµ (λ ). (20.10)
µ=1
• Comb Filterbank. The comb filterbank uses a bank of resonator filters, each tuned
to a possible periodicity, where the output of the resonator indicates the strength
of that particular periodicity. This method also “implicitly encodes aspects of the
rhythmic hierarchy” [49, p. 594]. More information about filter banks can be seen
in Section 4.6. Examples of systems that use comb filterbanks are [33, 49].
• Time Interval Histogram. The information of the onset events can be extracted
from the onset detection function and the time interval between the onsets is used
to calculate a histogram, whose maximum peak gives us the tempo value. To
estimate the tempo, the time difference between the onsets is more important than
the specific time position of each one. This method is called the IOI (inter-onset
interval) histogram; an example of a system that uses this method is [14].
501
502 Chapter 20. Tempo Estimation
through the use of annotated test databases, in particular the MIR community has
made a considerable effort to standardize evaluations of MIR systems.
The evaluation databases are made up of songs with the annotated ground truth
consisting of a single tempo value (e.g., from the score). Accordingly, the output of
the tempo estimation systems is the overall BPM value of a piece of music.
The evaluation measures that are mainly used to evaluate tempo estimation algo-
rithms return a state of the task accomplishment given by two separated metrics:
• Metric 1: The tempo estimation value is within 4% (the precision window) of
the ground-truth tempo. This measure is used to evaluate the accuracy of the
algorithm to detect the general BPM of the song.
• Metric 2: The tempo estimation value is within 4% (the precision window) of 1,
2, 12 , 3, 13 times the ground-truth tempo. This measure is used to take into account
problems of double or triple deviation of the tempo estimation.
The algorithm with the best average score of Metric 1 and Metric 2 will achieve the
highest rank.
Complementary to this evaluation method, there is a specific task in the Music
Information Retrieval Evaluation eXchange (MIREX)[16] initiative to evaluate the
estimation of the perceptual tempo, given by two tempo values, because a piece of
music can be perceived faster or slower than its notated tempo. Perceptual tempo
estimation algorithms estimate two tempo values in BPM (T1 and T2, where T1 is
the slower of the two tempo values). For a given algorithm, the performance, P, for
each audio excerpt will be given by the following equation:
P = ST 1 ∗ T T 1 + (1 − ST 1) ∗ T T 2 , (20.12)
where ST1 is the relative perceptual strength of T1 (given by ground truth data, varies
from 0 to 1.0). TT1 is the ability of the algorithm to identify T1 using Metric 1 to
within 8% (the precision window), and TT2 is the ability of the algorithm to identify
T2 using Metric 1 to within 8%. The algorithm with the best average P-score will
achieve the highest rank in the task.
Many approaches to tempo estimation have been proposed in the literature, and
some efforts have been devoted to their quantitative comparison. The first public
evaluation of tempo extraction methods was carried out in 2004 by [25] evaluating
the accuracy of 11 methods at the ISMIR audio estimation contest; an updated tempo
evaluation comparison is presented in [57]. In 2005, 2006, and 2010 to the present,
the MIREX initiative1 continued the evaluation comparison of tempo estimation sys-
tems.
502
20.5. A Simple Tempo Estimation System 503
tempo estimation MATLAB® algorithm computing the energy correlation for 8 dif-
ferent frequency bands.
1. Load the audiofile and compute STFT using a FFT of 2048 points and a hop size
of 512.
503
504 Chapter 20. Tempo Estimation
40
1
2
3
30
4
5
Energy
6
20
7
8
10
−10
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time [=] Sec
% Band weighting
weight = sum(kk);
kk = kk /weight;
for o = 1:nb;
corr_matrix(o,:) = corr_matrix(o,:).*kk(o);
end
4. Sum of all the correlation signals, normalize the amplitude and plot the result
signal with the horizontal axe in BPM values in a range between 30 Bpm and 240
Bpm. (cp. Figure 20.5 (lower part)).
sum_corr_matrix = sum(corr_matrix, 1)/nb;
tt = [0:hopsize:(nfr_corr-1)*hopsize]/fs;
bpm = 60./tt;
5. Finally locate the maximum values as the most prominent tempo values, if you
have a ground truth verify that one of the values correspond to the annotated
tempo.
504
20.7. Concluding Remarks 505
2
3
1 4
0 5
6
−1
40 60 80 100 120 140 160 180 200 220 240 7
BPM 8
BPM Sum of all band autocorrelations
Sum of bands’ correlations
0.1 Detected peaks
Sum
0.05
0
40 60 80 100 120 140 160 180 200 220 240
BPM
505
506 Chapter 20. Tempo Estimation
Bibliography
[1] C. Barabasa, M. Jafari, and M. D. Plumbley. A robust method for S1/S2
heart sounds detection without ECG reference based on music beat tracking.
In 10th International Symposium on Electronics and Telecommunications, pp.
307–310. IEEE, 2012.
[2] J. P. Bello. Towards the Automated Analysis of Simple Polyphonic Music: A
Knowledge-Based Approach. PhD thesis, University of London, London, 2003.
[3] J. Bilmes. Timing Is of the Essence: Perceptual and Computational Techniques
for Representing, Learning, and Reproducing Expressive Timing in Percussive
Rhythm. PhD thesis, Massachusetts Institute of Technology, 1993.
[4] S. Böck, F. Krebs, and M. Schedl. Evaluating the online capabilities of onset
detection methods. In Proceedings of the 13th International Society for Music
Information Retrieval Conference (ISMIR), pp. 49–54, Porto, 2012.
[5] S. Böck and M. Schedl. Enhanced beat tracking with context-aware neural
networks. In Proceedings of the 14th International Conference on Digital Audio
Effects (DAFx-11), pp. 135–139, 2011.
506
20.8. Further Reading 507
[6] E. Clarke. Rhythm and timing in music. In The Psychology of Music, pp.
473–500. Academic Press, San Diego, 2nd edition, 1999.
[7] N. Collins. Towards a style-specific basis for computational beat tracking. In
Proceedings of the 9th International Conference on Music Perception and Cog-
nition, pp. 461–467, 2006.
[8] M. E. P. Davies. Towards Automatic Rhythmic Accompaniment. PhD thesis,
Queen Mary University of London, 2007.
[9] M. E. P. Davies and M. D. Plumbley. Context-dependent beat tracking of mu-
sical audio. IEEE Transactions on Audio, Speech, and Language Processing,
15(3):1009–1020, 2007.
[10] M. E. P. Davies, M. M. D. Plumbley, and D. Eck. Towards a musical beat
emphasis function. In IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics (WASPAA), pp. 61–64, New Paltz, NY, 2009. IEEE.
[11] N. Degara, E. Rua, A. Pena, S. Torres-Guijarro, M. E. P. Davies, M. D. Plumb-
ley, and E. Argones. Reliability-informed beat tracking of musical signals.
IEEE Transactions on Audio, Speech and Language Processing, 20(1):290–
301, 2012.
[12] P. Desain and H. Honing. Computational models of beat induction: The rule-
based approach. Journal of New Music Research, 28(1):29–42, 1999.
[13] S. Dixon. Beat induction and rhythm recognition. In The Australian Joint
Conference on Artificial Intelligence, pp. 311–320, 1997.
[14] S. Dixon. Automatic extraction of tempo and beat from expressive perfor-
mances. Journal of New Music Research, 30(1):39–58, 2001.
[15] S. Dixon. Evaluation of the audio beat tracking system BeatRoot. Journal of
New Music Research, 36(1):39–50, 2007.
[16] J. S. Downie. The music information retrieval evaluation exchange (2005–
2007): A window into music information retrieval research. Acoustical Science
and Technology, 29(4):247–255, 2008.
[17] C. Duxbury, J. Bello, M. E. P. Davies, and M. D. Sandler. Complex domain
onset detection for musical signals. In Proceedings of the 6th Conference on
Digital Audio Effects (DAFx), volume 1, London, UK, 2003.
[18] D. Ellis. Beat tracking by dynamic programming. Journal of New Music Re-
search, 36(1):51,60, 2007.
[19] M. Gainza and E. Coyle. Tempo detection using a hybrid multiband approach.
IEEE Transactions on Audio, Speech, and Language Processing, 19(1):57–68,
2011.
[20] A. Gkiokas, V. Katsouros, and G. Carayannis. Tempo induction using filterbank
analysis and tonal features. In Proceedings of the 11th International Society on
Music Information Retrieval Conference (ISMIR), pp. 555–558, 2010.
[21] M. Goto and Y. Muraoka. A beat tracking system for acoustic signals of music.
In the Second ACM Intl. Conf. on Multimedia, pp. 365–372, 1994.
507
508 Chapter 20. Tempo Estimation
508
20.8. Further Reading 509
[37] F. Lerdahl and R. Jackendoff. A Generative Theory of Tonal Music. MIT Press,
Cambridge, MA, 1983.
[38] M. Levy and M. B. Sandler. Structural segmentation of musical audio by con-
strained clustering. IEEE Transactions on Audio, Speech and Language Pro-
cessing, 16(2):318–326, 2008.
[39] P. Masri. Computer Modelling of Sound for Transformation and Synthesis of
Musical Signal. PhD thesis, University of Bristol, Bristol, UK, 1996.
[40] M. Mauch, K. Noland, and S. Dixon. Using musical structure to enhance auto-
matic chord transcription. In Proceedings of the 10th International Society for
Music Information Retrieval Conference (ISMIR), pp. 231–236, 2009.
[41] M. F. McKinney, D. Moelants, M. E. P. Davies, and A. Klapuri. Evaluation
of audio beat tracking and music tempo extraction algorithms. Journal of New
Music Research, 36(1):1–16, 2007.
[42] E. Pampalk. Computational Models of Music Similarity and Their Application
to Music Information Retrieval. PhD thesis, Vienna University of Technology,
2006.
[43] G. Peeters. Beat-tracking using a probabilistic framework and linear discrimi-
nant analysis. In Proceedings of the 12th International Conference on Digital
Audio Effect, (DAFx), pp. 313–320, 2009.
[44] G. Peeters. Template-based estimation of tempo: Using unsupervised or su-
pervised learning to create better spectral templates. In Proceedings of the
International Conference on Digital Audio Effect (DAFx), pp. 6–9, 2010.
[45] P. Polotti and D. Rocchesso, eds. Sound to Sense—Sense to Sound: A State of
the Art in Sound and Music Computing. Logos Verlag Berlin GmbH, 2008.
[46] Z. Rafii and B. Pardo. REpeating Pattern Extraction Technique (REPET):
A simple method for music/voice separation. IEEE Transactions on Audio,
Speech, and Language Processing, 21(1):71–82, 2013.
[47] S. Ravuri and D. P. W. Ellis. Cover song detection: From high scores to gen-
eral classification. In IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 65–68. IEEE, 2010.
[48] A. Robertson and M. D. Plumbley. B-Keeper: A beat-tracker for live perfor-
mance. In International Conference on New Interfaces for musical expression
(NIME), pp. 234–237, New York, 2007.
[49] E. Scheirer. Tempo and beat analysis of acoustic musical signals. The Journal
of the Acoustical Society of America, 103(1):588–601, 1998.
[50] E. D. Scheirer. Pulse tracking with a pitch tracker. In IEEE Workshop on
Applications of Signal Processing to Audio and Acoustics, Mohonk, NY, 1997.
[51] A. M. Stark. Musicians and Machines: Bridging the Semantic Gap in Live
Performance. PhD thesis, Queen Mary, U. of London, 2011.
[52] A. M. Stark, M. D. Plumbley, and M. E. P. Davies. Real-time beat-synchronous
audio effects. In International Conference on New Interfaces for Musical Ex-
509
510 Chapter 20. Tempo Estimation
510
Chapter 21
Emotions
G ÜNTHER R ÖTTER
Institute for Music and Music Science, TU Dortmund, Germany
I GOR VATOLKIN
Department of Computer Science, TU Dortmund, Germany
21.1 Introduction
Six basic emotions are known: fear, joy, sadness, disgust, anger, and surprise [17].
These basic emotions are objects of emotion theories. Each emotion has three com-
ponents: the personal experience of an emotion, activation (which is a higher activ-
ity of the sympathetic nervous system), and action. When talking about emotions in
music, action means the anticipation of a structure in music. There is a connection
between the emotional expression in language and in music. The melodic contour,
the range, the change rate of tones and the tempo of a spoken sentence in a specific
emotion are very similar to music expressing the same emotional state. Still, music
does not cause real emotions; it rather works like a pointer that elicits stored emo-
tional experiences. Music not only deals with basic emotions, but also with moods
and emotional episodes.
511
512 Chapter 21. Emotions
groups from all over the world. This means that emotional behavior is not the result
of a learning process, but is an anthropological constant. In the following, various
emotion theories shall be explained but first it has to be remarked that these theories
mostly deal with basic emotions and not with “aesthetic” ones.
512
21.2. Theories of Emotions and Models 513
In the following, we will start with an overview of emotional theories and models
(Section 21.2). Section 21.3 briefly introduces the relationship of speech and emo-
tion. Emotions in music are discussed in Section 21.4. Groups of features helpful for
automatic emotion prediction are introduced in Section 21.5. Section 21.5.7 shows
how individual feature relevance can be measured for a categorical and a dimensional
music emotion recognition system. In Section 21.6, the history and targets of auto-
matic music emotion recognition systems are outlined, with a list of databases with
freely available annotations and a discussion of classification and regression methods
applied in recent studies.
513
514 Chapter 21. Emotions
Straightforward or down-to-earth
Kind
Kind
Kind Kind Kind
Straightforward or down-to-earth
Straightforward or down-to-earth
Kind
Kind
Kind
or down-to-earth
StraightforwardStraightforward or down-to-earth
StraightforwardStraightforward
or down-to-earth
or down-to-earth
Straightforward or down-to-earth
Figure 21.1: Mandler’s emotion theory and its adaption to music.
original and arranged music (variation of condition) and musical elements (intervals,
rhythms)2 are evaluated. Overall, numerous investigations were developed between
the late 19th century and the 1970s, which can only be presented partially.
Hevner tried to determine the emotional expressions of musical parameters by
changing pieces of music [24]. She either altered the melody or changed the com-
positions’ mode from major to minor. The musical examples have to be assigned
to different adjectives that were circularly arranged according to their relation. In
these adjective circles, bars were drawn with their length depending on how much
the adjective fit the given example. Simple harmonies appeared happy, graceful and
detached, whereas dissonant and more complex melodies appeared more powerful,
exciting and desolate. As expected, major seemed more happy and graceful, minor
instead dignified, desolate and dreamy. A moving rhythm appeared happy and grace-
ful contrastingly, the firm rhythm appeared powerful and dignified. Still, it was not
always possible to make explicit judgments and it was not always simple to reason-
ably change the original composition.
A recent meta-study was conducted by Gabrielsson and Lindström who used
the Hevner adjective circle [22]. They therefore examined every study about musical
expression since the 1930s and asserted that 18 musical parameters have already been
investigated on its emotional expression.3 A rising melody, for example, can either
be connected with dignity, seriousness and tension or with fear, surprise, anger, and
power. Falling melodies can on the one hand be regarded as graceful, powerful and
detached or on the other hand be linked to boredom and gusto. Another methodical
problem is that only one musical parameter can be investigated at a time.
514
21.2. Theories of Emotions and Models 515
De la Motte-Haber states:
515
516 Chapter 21. Emotions
Kind Kind
Kind
Kind Kind
Kind
Kind
Kind Kind
Kind
Kind
Kind
Kind
Kind Kind
Kind
Kind
Kind
Kind
Kind
Kind
Kind
Kind Kind Kind
Kind Kind Kind
Kind
Kind Kind
Kind
Figure 21.2: Circumplex word mapping by Russell after [47].
516
21.3. Speech and Emotion 517
Firm
Firm
Firm
Firm FirmFirmFirm
Firm Firm Firm
Firm Firm Firm
Nonjudgmental
Nonjudgmental
Firm
Nonjudgmental
Firm
Nonjudgmental
Firm
Firm
Nonjudgmental
Nonjudgmental
Nonjudgmental
Nonjudgmental
Nonjudgmental
Nonjudgmental
Firm
Nonjudgmental
Nonjudgmental
Firm
Firm Firm Firm Firm
Firm Firm Firm Firm
Firm
Firm
Firm
Figure 21.3: Tellegen and Watson diagram after [56].
engagement, and from low negative affect to high negative affect. On these axes,
several emotions are listed, one below the other.
517
518 Chapter 21. Emotions
Table 21.2: Emotional Expression in Speech and Music [50, p. 300]
The confines of the models are reached with the hearers’ relation between un-
derstood musical expression and observed emotions. Processes as induction,
empathy and infection in the course of communication [35, p. 556].
This becomes even more complicated, taking into account that a biographical
5 Kreutz in this context mentions a non-checked model of six components by Huron. In this model
occurs an emotional meaning analysis through six different systems (reflexive, denotative, connotative,
associative, emphatic, critical). This model ranges from a reflexive, unconscious reaction to a critical
examination of the authenticity of emotional expressions.
518
21.4. Music and Emotion 519
connection and the current (emotional) situation influence the hearers’ listening ex-
perience. In this context it is necessary to mention two studies by Knobloch et al.,
who found out that men listen to sad music in case of lovesickness only, whereas
women also listen to sad music when they’re in love [32, 31].
Some emotions can be found in music. Despite their distinct acoustic structure,
rock and popular music both imitate 18th- and 19th-century musical expression be-
cause they convey the same basic emotions (especially the positive ones). Disgust,
prudence, and interest are usually not expressed in music in opposition to joy, anger,
fear, and sadness.
In the following, some examples are given:
Example 21.1 (Joy). (1) “Maniac” from the film Flashdancer by Michael Sembello.
Similar to the speaking melody, in “Maniac” joy is expressed by a high frequency,
large intervals. (2) “All My Loving” by The Beatles. (3) “Happy” by Pharrell
Williams. (4) “Don’t Worry Be Happy” by Bobby McFerrin. (5) “Lollipop” by
The Chordettes. When expressing joy in music, large intervals and irregularity of
instrumentation are used to convey surprise, which is a part of joy.
Example 21.2 (Fear). (1) “Bring Me to Life” by Evanescence. (2) “Gott” from
Beethoven’s “Fidelio.” (3) “This Is Halloween” by Marilyn Manson. (4) “Psycho
Theme” by Bernard Herrmann. One can find strangeness and intransparency cues as
well as sudden exclamations in fearful music. According to de la Motte-Haber [10],
fear has a dual nature: “stiffening in horror” on the one hand and “simultaneous
desire to run away” on the other.
Example 21.3 (Sadness). (1) “Marche Funèbre” by Frederick Chopin. (2) “Sonata
in G Minor” by Albioni. (3) “Goodbye My Lover” by James Blunt. (4) “Der Weg”
by Herbert Grönemeyer. In music, sadness can be characterized by small intervals, a
falling melody, a low sound level (small activation), and a slow rhythm. Sometimes,
sad songs are ambivalent because they might include encouraging and soothing ele-
ments.
Example 21.4 (Anger). (1) “Enter Sandman” by Metallica. (2) “Lose Yourself” by
Eminem. (3) “Line and Sinker” by Billy Talent. (4) “Königin der Nacht” by Mozart.
The expression of anger is often highly similar in music and spoken language. Au-
dio frequency, range, variability, and tempo are high, which are indicative of an
increased pulse rate. Due to their related expression of joy and anger, some pieces
might be “emotionally confused.” An example is the first movement of Johann Sebas-
tian Bach’s “Brandenburg Concerto no. 3” (BWV 1048). One can either interpret
the musical expression as joyful or angry (measures 116–130).
Mathematically, there would be over a thousand shadings of emotions if there
were only six grades of intensity of an emotion and four different emotions ex-
pressed. Sometimes there are, however, less basic emotions within a song. As a
result, the musical expression of the basic emotions is intensified through language
and thus, easier to understand.
519
520 Chapter 21. Emotions
We believe that progress in this area will be difficult as long as researchers re-
main committed to the assumption that real intense emotions must be tradi-
tional basic emotions, such as fear or anger, for which one can identify relatively
straightforward action tendencies such as fight or flight. Progress is more likely
to occur if we are prepared to identify emotion episodes where all of the compo-
nents shown in the table are in fact synchronized without there being a concrete
action tendency or a traditional, readily accessible verbal label. In order to study
these phenomena, we need to free ourselves from the tendency of wanting to as-
sign traditional categorical labels to emotion processes [51, p. 384].
6 Konecni called them “aesthetic mini-episodes” that are embedded in everyday life, cf. [34].
520
21.4. Music and Emotion 521
Table 21.3: Design Feature Delimitation of Different Affective States after [51]. VL:
Very Low; L: Low; M: Medium; H: High; VH: Very High
Appraisal elicitation
Behavioural impact
Rapidity of change
Synchronization
Event focus
Duration
Intensity
Preferences: evaluative judgements of L M VL VH H VL M
stimuli in the sense of liking or disliking,
or preferring or not over another stimulus
(like, dislike, positive, negative)
Emotions: relatively brief episodes of syn- H L VH VH VH VH VH
chronized response of all or most organis-
mic subsystems in response to the evalua-
tion of an external or internal event as be-
ing of major significance (angry, sad, joy-
ful, fearful, ashamed, proud, elated, desper-
ate)
Mood: diffuse affect states, most pro- M H L L L H H
nounced as change in subjective feeling,
of low intensity but relatively long dura-
tion, often without apparent cause (cheer-
ful, gloomy, irritable, listless, depressed,
buoyant)
Interpersonal stances: affective stance M M L H L VH H
taken toward another person in a spe-
cific interaction, colouring the interpersonal
exchange in that situation (distant, cold,
warm, supportive, contemptuous)
Attitudes: relatively enduring, affectively M H VL VL L L L
coloured beliefs and predispositions to-
wards objects or persons (liking, loving,
hating, valuing, desiring)
Personality traits: emotionally laden, sta- L VH VL VL VL VL L
ble personality dispositions and behavior
tendencies, typical for a person (nervous,
anxious, reckless, morose, hostile, envious,
jealous)
It is perhaps the ultimate humanistic moment, and it may well include an elitist
element: of feeling privileged to regard Mozart as a brother, of sensing the larger
521
522 Chapter 21. Emotions
truth hidden in the pinnacles of human achievement, and yet realizing, with
some resignation, their minuscule role in the universe [33, p. 339].
Due to the lack of appropriate terminology, researchers have mostly avoided the
phenomenon of affection. Additionally, it is possible that some reactions might be
provoked through music that aren’t emotional at all and thus are not definable with
the help of the table. This might be an additional field of study for further research.
The previous explanations have illustrated the complexity of the relation between
music and emotions. Recognition of emotions is one of the most challenging tasks
in music data handling for a computer scientist. Automatic prediction of emotions
from music data is often treated as a classification or a regression problem [71].
Methods from Chapters 5, 8, 12, 14, and 15 may be integrated into a music emotion
recognition (MER) system. The goal of classification-based MER systems is to pre-
dict emotions from categorical (discrete) emotion models and regression-based MER
systems to estimate numerical characteristics from dimensional emotion models (for
examples of models, see Section 21.2). In the following we will discuss groups of
music characteristics which are helpful in recognizing emotions.
522
21.5. Factors of Influence and Features 523
In polyphonic music, chords based upon consonant and dissonant intervals have an
unquestioned emotional impact.
The temporal progression of harmonies is also relevant. For a satisfying percep-
tion of enclosed music sequences, dissonant intervals should be resolved. A typical
example is a cadence, consider Figures 3.10, 3.24–3.27. A temporary change to
a secondary dominant or a modulation (cf. Section 3.5.5) may evoke a surprising
stimulus, whereas multiple repetitions of the same cadence may be associated with
boredom or frustration. Properties of harmonic sequences may be characterized by
probabilities of transitions between chords by means of Markov models (cf. Example
9.16) or generalized coincidence function (see Figure 3.8).
The first step of the extraction of harmonic features from audio signal is the esti-
mation of the semitone spectrum (Definition 2.7) or chroma (Section 5.3.1). Simple
statistics may help to recognize emotions, e.g., a high average pitch tends to correlate
with a higher arousal and a high deviation of pitch with a higher valence [71, p. 51].
Several harmonic characteristics are available in the MIR Toolbox [36]: Harmonic
Change Detection Function (HCDF), key and its clarity, alignment between major
and minor, strengths of major/minor keys, tonal novelty, and tonal centroid. Based
on chroma vector, individual strengths of consonant and dissonant intervals can be
measured [60, p. 31]. Statistics of chord progression include numbers of different
chords, their changes, and most frequent chords [60, p. 32], or longest common
chord subsequence as well as chord histograms [71, p. 191]. For methods to extract
chords from audio, refer to Chapter 19.
Harmony-related features are often integrated in MER systems [4]. Including
these features increased the performance of a system for the prediction of induced
emotions in [1]. The share of minor chords was relevant for the recognition of “sad-
ness,” the share of tritones for “tension,” and the share of seconds for “joyful activa-
tion.” In another study [14], the MIR Toolbox features key clarity and tonal novelty
were among the most relevant features for regression-based modeling of anger and
tenderness. HCDF and tonal centroid were among the best features selected by Relief
(cf. Section 15.6.1) for the recognition of emotion clusters in [45].
Example 21.5 (Relevant Harmony and Pitch Features). Figure 21.4 shows distribu-
tions of best individually relevant harmony and pitch features for the recognition of
emotions after Examples 21.1–21.4. A set of audio descriptors was extracted for 17
music pieces with AMUSE [62]. Feature values from extraction frames between esti-
mated onset events were selected, normalized and aggregated for windows of 4 s with
2 s overlap as mean and standard deviation. The most individually relevant features
were identified by means of the Wilcoxon test (see Definition 9.29). For visualization,
probability density was calculated using Gaussian kernel (function KSDENSITY in
MATLAB ®). The prediction of anger and sadness seems to be easier than fear and
joy, but the error is high in all cases when only the distribution of the most relevant
feature is taken into account alone. For the list of best features across all tested
groups (harmony and pitch, timbre, dynamics, tempo and rhythm), see Example 21.9
in Section 21.5.7.
523
524 Chapter 21. Emotions
Nonjudgmental
Nonjudgmental
Kind Kind
Straightforward or down-to-earth
Straightforward or down-to-earth
Nonjudgmental
Nonjudgmental
Kind Kind
Straightforward or down-to-earth
Straightforward or down-to-earth
Figure 21.4: Probability densities for the most individually relevant harmony and
pitch features for the recognition of emotions from Examples 21.1–21.4.
21.5.2 Melody
In music with vocals, melody has a large impact on listeners’ attention: a general
high melodiousness may correlate with invoked tenderness and low melodiousness
with tension [1]. As discussed in the previous section, the grade of consonance or
dissonance may be relevant. In contrast to harmonic properties, for the measurement
of consonance in a melody interval strengths should be extracted between succeeding
tones of a melody. Reference [21, p. 132] lists other descriptors: “[...] large intervals
to represent joy, small intervals to represent sadness, ascending motion for pride but
descending motion for humility, or disordered sequences of notes for despair.”
The shape of the melodic contour can be extracted after pitch detection. For in-
stance, [49] differs between initial (I), final (F), lowest (L), and highest (H) pitches
in a melody. Then, contours can be grouped into several categories, e.g., I-L-H-F for
four-stage contours with descending and then ascending movements, or (I,L)-H-F for
three-stage contours whose initial pitch is also the lowest one. Further characteris-
tics comprised statistics over pitch distribution in a melody (pitch deviation, highest
pitch, etc.) and characteristics of vibrato. The integration of melodic audio fea-
tures together with other groups of audio, MIDI, and lyric characteristics contributed
to best achieved results for classification into emotions [45] (vibrato characteristics
belonged to the most relevant melodic features selected by Relief) as well as for
regression predicting arousal and valence [46].
A robust extraction of melody characteristics from audio is a challenging task and
its success depends on the extraction of semitone spectrum. Further requirements
are the robust onset detection and source separation for the identification of the main
voice. Approaches to solve those tasks are presented in Sections 16.2 and 11.6.
524
21.5. Factors of Influence and Features 525
21.5.4 Dynamics
Indicators of loudness in the score may be helpful for the recognition of arousal or
emotions related to arousal dimension, as in the Russell model (see Figure 21.2). We
7 Analogue synthesizer based on multiple oscillators which was for instance used by Trent Reznor for
525
526 Chapter 21. Emotions
may expect that pieces with annotations pp (pianissimo) and p (piano) may have a
high positive correlation with emotions like calmness, sleepiness, and boredom. In
contrast, ff (fortissimo) and f (forte) may indicate anger, fear, or excitement.
For audio recordings, an often applied estimator of a volume is the root mean
square (RMS) of the time signal, Equation (2.48). Also the zero-crossing rate (Defi-
nition 5.1), which roughly measures the noisiness of the signal, can be helpful. The
deviation of zero-crossing rate may correlate with valence [71, p. 42].
For the measurement of perceived loudness the spectrum can be transformed
to scales adapted to subjective perception of volume like a sone scale, cf. Section
2.3.3. The temporal change of volume can be captured by statistics of energy over
longer time windows. Examples of these features are lowenergy (Definition 5.4) and
characteristics of RMS peaks such as overall number of peaks and number of peaks
above half of the maximum peak [60]. Dynamics features contributed to the most
efficient feature sets for the recognition of four emotions (anger, happiness, sadness,
tenderness) in [48].
Example 21.7 (Relevant Dynamics Features). Probability densities of the most indi-
vidually relevant dynamics features for Examples 21.1–21.4 are provided in Figure
21.5. The number of energy peaks is higher for joy and lower for sadness as may
be expected from music theory. The deviation of the 2nd sub-band energy ratio is
higher for angry pieces. Fear seems to be particularly hard to predict using the most
relevant dynamics feature, and distributions of the two classes are similar.
Good listener Good listener
Good listener
Kind Kind
Kind Kind
Figure 21.5: Probability densities for the most individually relevant dynamics fea-
tures for the recognition of emotions from Examples 21.1–21.4.
526
21.5. Factors of Influence and Features 527
evoked emotions. Pieces which evoked joyful activation and amazement were char-
acterized by fast tempo (Spearman’s coefficient rXY = 0.76 resp. 0.50); calmness,
tenderness, and sadness by slow tempo (rXY = −0.64, −0.48, −0.45). Algorithms
for the extraction of tempo from audio are discussed in Chapter 20. A simple statis-
tic of temporal musical progress is the event density (number of onsets per second),
used for instance in [48]. It is also possible to save the density of beats and tatums.
Certain rhythmical patterns may be representative of a genre and emotions in-
duced by corresponding music pieces; consider two syncopated rhythms typical for
Latin American dances in Figure 3.43. The rhythmic periodicity of audio signal can
be captured by the estimation of fluctuation patterns (Section 5.4.3) which had the
largest regression β weights for the recognition of anger in [14]. Rhythmic charac-
teristics can be calculated from onset curves (for onset detection see Section 16.2).
Reference [39] defines the rhythm strength as the average onset strength, and rhythm
regularity as the strength of peaks after autocorrelation of the onset curve. It is ar-
gued that the strength of a rhythm is higher for emotions with a high arousal and a
low valence compared to emotions with a low arousal and a high valence. Rhythm
regularity may have a high correlation with arousal [71, p. 41]. Both features along
with tempo and event density improved the recognition of three out of four emotion
regions with either low or high arousal and valence. The pulse clarity was the best
individual predictor for arousal and the second best for valence in [11].
Other rhythm features showed, in [53], the largest individual correlation with
arousal and valence, compared to descriptors of spectrum, chords, metadata, and
lyrics. Here, signal energies of candidate tempi between 60 and 180 BPM were
estimated after the application of comb filters. In the next step, the meter of music
pieces was calculated. Rhythm features together with dynamic characteristics were
suggested as the best descriptors to recognize 4 emotions in [48].
Example 21.8 (Relevant Tempo and Rhythm Features). In the last example of fea-
ture comparison, best individually relevant tempo and rhythm features are estimated.
The fluctuation patterns belong to the most relevant characteristics for three of four
classes: the 1st dimension for anger, the 4th for fear, and the 5th for joy. Sad pieces
tend to have a lower rhythmic clarity, which is the most relevant feature for this class.
527
528 Chapter 21. Emotions
expert in a music web database or by listeners, may have a strong correlation with
emotions. Asking listeners to rate emotions for classical, jazz, pop/rock, Latin Amer-
ican, and techno pieces revealed significant differences of emotion distributions over
tested genres both for felt and perceived emotions [73]. In another study, the corre-
lation between genres and emotions was measured as significant by χ 2 -statistic for
three large music data sets with 12 genres and 184 emotions [71, p. 199]. Reference
[11] provides interesting insights on the impact of genre for the evaluation of MER
systems. For example, the performance of regression model (R2 ) for valence pre-
diction dropped from 0.58 to 0.30 and further to 0.06 when the models were trained
for film music and were validated on classical versus popular pieces. The same test
scenario for arousal led to a decrease of R2 from 0.59 to 0.54 and 0.37.
User-generated content in the social web can help to improve the performance of
MER systems. Extension of audio features with last.FM listener tags sorted by their
frequency improved the performance of mood clustering into four regions with high
or low arousal/valence as well as into four MIREX8 mood clusters [7]. To avoid an
excessive number of tag dimensions, they can be mapped to sentiment words after
the estimation of co-occurrences. Also, reduction techniques like clustering can be
applied [37]; see Chapter 11. Another possibility to mine social data is to measure co-
occurrences of artists and songs in the history of listening behaviour or user-crafted
playlists from web radios [61].
528
21.5. Factors of Influence and Features 529
brightness. Note that the number of analyzed music pieces was rather low, so that
outcomes of this example may be of less general significance.
Table 21.4: Comparison of Audio Features for the Recognition of Four Emotions.
Number after Feature Name: Dimension; (m): Mean; (s): Standard Deviation. Table
Entries Are p-Values from the Wilcoxon Test. Bold Values in Brackets Outline Top
1st, 2nd, and 3rd Features for the Corresponding Classification Task
Example 21.10 (Feature Relevance for a Dimensional Approach). For the evalua-
tion of individual feature relevance in a dimensional approach, we randomly sam-
pled 100 music pieces from the database 1000 Songs (further details of this and other
databases are provided in Section 21.6 and Table 21.6). The same features were ex-
tracted as in Example 21.9 and aggregated for complete songs leading to 100 labeled
data instances.
Table 21.5 lists the five most relevant features for each task w.r.t. R2 for linear
regression. Both energy-related characteristics of RMS peaks can at best individu-
ally explain arousal, but the number of RMS peaks also has the highest individual
contribution for valence prediction. Mean distance in the phase domain (Equation
(5.38)) is the 3rd individually most relevant feature for arousal prediction. Phase
domain characteristics were particularly successful for the classification of percus-
sive versus “non-percussive” music in [42]. For valence prediction, three of the five
most relevant features belong to rhythm descriptors: characteristics of fluctuation
patterns and rhythmic clarity. As can be expected from music theory and previous
studies (see, e.g., [11, p. 351]), arousal – which can also be described as a level of
activation – is easier to predict than valence.
Original feature implementations are mostly from the MIR Toolbox, except for
RMS peak characteristics (implementation from [60] based on the MIR Toolbox
peak function) and distances in phase domain/spectral skewness (Yale implemen-
tation [42]).
In Figure 21.6, the values of the two most relevant features for the prediction of
arousal and valence are plotted together with corresponding regression lines.
529
530 Chapter 21. Emotions
Table 21.5: Comparison of Audio Features for the Recognition of Arousal and Va-
lence. Number after Feature Name: Dimension; (m): Mean. Table Entries Are R2
after Linear Regression. Bold Values in Brackets Outline Top 1st–5th Features for
the Corresponding Regression Task
Good listener
530
21.6. Computationally Based Emotion Recognition 531
A favorite tune “became associated with grief and tears,” after being informed about
the death of one’s uncle during the playing. As another example, aggressive hard
rock may induce calmness and relaxation for younger people who are fond of this
genre. Most systems are designed to recognize perceived emotions.
For the evaluation and optimization of a MER system, annotated music data is
necessary. Table 21.6 lists publicly available databases with annotated perceived
emotions, sorted by publication year. The first half contains categorical annotations,
the second half dimensional ones. Note that dimensional annotations can also be
used for the classification into 4 regions with low or high arousal/valence.
Unfortunately, all data sets have individual drawbacks. One of the most accurate
annotations, the Soundtrack database [15], was created with the help of music ex-
perts who carefully selected music according to “extremes of the three-dimensional
model.” However, this database only contains film music. Databases with annota-
tions for many tracks often do not contain audio for copyright reasons. 1000 Songs
[54] is a good alternative with a large freely distributed set of tracks and features, but
contains probably less popular music from Free Music Archive.
After the merging of processed features with corresponding annotations, classi-
fication or regression models can be created and evaluated.
For categorical MER systems, supervised classification methods from Chapter 12
can be applied. Earlier studies often adopted a single classifier, like neural networks
in [19] and SVMs in [38]. For better performance, it makes sense to compare several
methods, as done in [45] for Naive Bayes (NB, introduced in Section 12.4.1), de-
cision tree C4.5 (Section 12.4.3), k-Nearest Neighbors (k-NN, Section 12.4.2), and
SVMs (Section 12.4.4). SVMs performed best achieving F-measure mF = 0.64, cf.
Equation (13.15), for the recognition of five emotional clusters. The best model in
[6] across 3 classifiers (NB, k-NN, SVMs), 7 feature sets, and 3 feature selection
methods was also built with SVMs (the accuracy mACC = 0.65, cf. Equation (13.11),
for the recognition of four regions in arousal/valence space).
The choice of the “best” classifier strongly depends on data and a classification
task. In [6], NB was the best method for three of seven feature sets, and in [45] the
comparative performance of C4.5, k-NN, and NB varied for different feature sets.
The choice of a classification method is usually harder when performance measures
beyond classification performance are taken into account (see Section 13.3.6) and
they are very seldom integrated in MER systems until now.
For dimensional MER systems, various regression methods can be considered.
Linear regression (see Section 9.8.1) was inferior to k-Nearest Neighbors (k-NN) re-
gression and Support Vector Regression (SVR) in [46] for the prediction of arousal
and valence. The best performance w.r.t. R2 was achieved using SVR. Similar re-
sults were reported in [26]: linear regression was the worst method and SVR the
best one according to R2 for the prediction of arousal and valence. In [14], best R2
values for the prediction of valence, activity, and tension were achieved by means of
Partial Least Squares Regression [66], compared to Multiple Linear Regression and
Principal Component Regression (PCR).
Similar to the choice of a classifier for a categorical MER system, it is hard
531
532 Chapter 21. Emotions
Table 21.6: Databases with Annotated Emotions. No.: Number of Tracks; Goal:
Prediction Goal; Genres: Genres of Database Tracks; Audio: Availability for Down-
load; Features: Availability for Download; Ref.: Reference
C ATEGORICAL ANNOTATIONS
No. Goal Genres Audio Features Ref.
CAL500, http://jimi.ithaca.edu/~dturnbull/data, annotations by at least three students
for each track
500 18 emotions 36 genres low quality MFCCs [59]
Emotions, http://mlkd.csd.auth.gr/multilabel.htm, 3 annotations per track by experts
593 6 emotions 7 genres no 72 rhythmic and timbre [57]
features
MIREX-like Mood, http://mir.dei.uc.pt/resources/MIREX-like_mood.zip, annota-
tions from AllMusicGuide
193 5 clusters mostly 30s excerpts no (lyrics and MIDIs in- [45]
pop/rock cluded)
764 no (lyrics included)
903 no
Soundtracks, https://www.jyu.fi/hum/laitokset/musiikki/en/research/coe/
materials/emotion/soundtracks, annotations by 6 experts
110 5 emotions soundtracks 10-30s excerpts no [15]
D IMENSIONAL ANNOTATIONS
No. Goal Genres Audio Features Ref.
Soundtracks, https://www.jyu.fi/hum/laitokset/musiikki/en/research/coe/
materials/emotion/soundtracks, annotations by 6 experts
110 valence, energy, soundtracks 10-30s excerpts no [15]
tension
NTUMIR-60, http://mac.citi.sinica.edu.tw/~yang/MER/NTUMIR-60, annotations by
40 students for each track
60 arousal, valence mostly no 252 features (timbre, [71, p.
pop/rock pitch, rhythm, etc.) 92]
NTUMIR-1240, http://mac.citi.sinica.edu.tw/~yang/MER/NTUMIR-1240, annotations
by 4.3 subjects for each track (online test)
1240 arousal, valence Chinese pop no 213 features (timbre, [71, p.
pitch, rhythm, etc.) 92]
1000 Tracks, http://cvml.unige.ch/databases/emoMusic, annotations by 10 crowdwork-
ers for each track
744 arousal, valence 8 genres 45s excerpts 6669 features (spec- [54]
trum, MFCCs, etc.)
AMG1608, http://mpac.ee.ntu.edu.tw/dataset/AMG1608, annotations by 665 subjects
(students and crowdworkers); each track annotated by 15 crowdworkers
1608 arousal, valence mostly no 72 features (MFCCs, [8]
pop/rock tonal, spectral, tempo-
ral)
532
21.6. Computationally Based Emotion Recognition 533
533
534 Chapter 21. Emotions
534
21.8. Further Reading 535
lullaby, joy in a love song, fear in a thriller movie, etc. Automatic recognition of
emotions in music may help to create more interpretable classification models and
enable personal recommendations.
In this chapter, we introduced several emotional theories and models applica-
ble to music, followed by lists with characteristics which can be used for automatic
prediction of emotions and moods in music. We discussed further details of the
implementation of music emotion recognition systems, such as the choice of classi-
fication and regression approaches, feature processing methods, and databases with
categorical and dimensional annotations.
Bibliography
[1] A. Aljanaki, W. F., and R. C. Veltkamp. Computational modeling of induced
emotion using GEMS. In Proc. of the 15th International Society for Music
Information Retrieval Conference (ISMIR), pp. 373–378. International Society
for Music Information Retrieval, 2014.
[2] L. Balkwill and W. F. Thompson. A cross-cultural investigation of the percep-
tion in music: Psychophysical and cultural clues. Music Perception, 17(1):43–
64, 1999.
[3] M. Barthet, G. Fazekas, and M. Sandler. Multidisciplinary perspectives on mu-
sic emotion recognition: Implications for content and context-based models. In
Proc. of the 9th International Symposium on Computer Modelling and Retrieval
(CMMR), pp. 492–507. Springer, 2012.
[4] M. Barthet, G. Fazekas, and M. Sandler. Music emotion recognition: From
content- to context-based models. In Proc. of the 9th International Symposium
on Computer Modelling and Retrieval (CMMR), pp. 228–252. Springer, 2012.
535
536 Chapter 21. Emotions
536
21.8. Further Reading 537
537
538 Chapter 21. Emotions
538
21.8. Further Reading 539
539
540 Chapter 21. Emotions
540
Chapter 22
S EBASTIAN S TOBER
Machine Learning in Cognitive Sciences, University of Potsdam, Germany
22.1 Introduction
Since the introduction of digital music formats such as MP3 (see Section 7.3.3) in
the late 1990s, personal music collections have grown considerably. But not much
has changed concerning the way we structure and organize them. Popular music
players that also function as collection management tools like iTunes or AmaroK
still organize music collections in tables and lists based on simple metadata, such
as the artist and album tags. Even their integrated music recommendation functions,
like iTunes Genius, often only rely on usage information, which is typically extracted
from playlists and ratings – i.e., they largely ignore the actual music content in the
audio signal.
Equipped with sophisticated content analysis techniques as described in earlier
chapters, we are now ready to pursue new ways of organizing music collections
based on music similarity. We can compare music tracks on a content-based level
and group similar tracks together. Furthermore, we can generate maps that visualize
the structure of a similarity space, i.e., the space defined by a set of objects and their
pairwise similarities. Such maps easily allow to identify regions or neighborhoods
of similar tracks. Moreover, they provide the foundation for new ways of interacting
with music collections. For instance, users can find relevant music by starting from a
general overview map, identify interesting regions and then explore these in more de-
tail until they find what they are looking for. This way, they do not have to formulate
a query explicitly. Instead, they can systematically and incrementally narrow down
their region of interest and define their search goal implicitly during the exploration
process. Note that this also works in scenarios where users just want to explore and
find something new. Here, narrowing down the region of interest more and more,
characterizes that “something.”
However, such new approaches do not come without their own challenges. First,
541
542 Chapter 22. Similarity-Based Organization of Music Collections
music similarity is not a simple concept to start with. In fact, various frameworks
exist within the fields of musicology, psychology, and cognitive science and there
are many ways of comparing music tracks. How we compare music may depend
on our musical background, our personal preferences or our specific retrieval task.
Hence, we have to accept the fact that there is no such thing as the music similarity.
Instead, there are many ways of describing music similarity and we have to find out
which one works best in each specific context – or better, we would like to have the
computer figure this out for us automatically. Section 22.2 will present a way of
achieving this.
Second, visualizing a similarity space as a two-dimensional map – often also
called a projection – necessarily requires some sort of dimensionality reduction. Ex-
cept for very trivial cases, it is generally not possible to perfectly capture the structure
of the similarity space in two dimensions. Hence, there will be unavoidable projec-
tion errors that can have a negative impact on the users’ experience. Specifically,
some neighbors in the visualization may in fact not be similar whilst some similar
music tracks may be positioned in very distant regions of the map. Section 22.3 will
discuss ways to address this issue and utilize it for the task of exploring a collection.
As a third major challenge, music collections usually are not static. They change
over time as we add new music or sometimes remove some tracks. With every change
of the collection, we will also have to change the corresponding map visualization.
But we want to change it as little as possible to maximize continuity in the visual-
ization and not confuse the user. In Section 22.4, solutions for several projection
algorithms are compared.
may however be questioned from a psychological point of view. Such a discussion is beyond the scope of
this chapter. An overview can, for instance, be found on Scholarpedia [1].
542
22.2. Learning a Music Similarity Measure 543
less interesting than the lyrics or the harmonic structure. Similarly, musicians might
especially look after structures, tonality or instrumentation and possibly pay special
attention, consciously or unconsciously, to their own instruments.
In order to build an application that can accommodate such diverse views on
music similarity, we need three important ingredients. First, we need an adaptable
model of music similarity. Each parameter setting of this model represents one pos-
sible view. We would like to have the computer find optimal parameter values for the
desired outcome. Second, we require a way to express and capture preferences for
choosing the right model parameters. Finally, we need an algorithm – the adaptation
logic – that derives the model parameters from the preferences. We will address these
points in the following subsections.
543
544 Chapter 22. Similarity-Based Organization of Music Collections
chords with non-zero frequency and compare them using the Jaccard index (cp. Sec-
tion 11.2). Finally, if we only want to compare the number of different chords used,
we can compute the Manhattan distance (cp. Section 11.2) of the size of the chord
sets.
In order to avoid a bias when aggregating several facet distance measures, the
values should be normalized. The following normalization can be applied for all
distance values δ f (a, b) of a facet f :
δ f (a, b)
δ f0 (a, b) = (22.1)
µf
where µ f is the mean facet distance w.r.t. f :
1
µf = ∑ δ f (a, b). (22.2)
|{(a, b) ∈ T 2 }| (a,b)∈T 2
As a result, all facet distances have a mean value of 1.0. Special care has to be taken,
if extremely high facet distance values are present that express “infinite dissimilar-
ity” or “no similarity at all.” Such values introduce a strong bias for the mean of the
facet distance and thus should be ignored during its computation. Further normal-
ization methods for features that can also be applied to normalize distance values are
introduced in Section 14.2.2.
The actual distance between objects a, b ∈ T w.r.t. the facets f1 , . . . , fl is com-
puted as the weighted sum of the individual facet distances δ f1 (a, b), . . . , δ fl (a, b):
l
d(a, b) = ∑ wi δ fi (a, b). (22.3)
i=1
This way, we introduce the facet weights w1 , . . . , wl ∈ R which allow us to adapt the
importance of each facet according to subjective user preferences or for a specific
retrieval task. Note that the linear combination assumes the independence of the
individual facets, which might be a limiting factor of this model in some settings.
The weights obviously have to be non-negative and should correspond to proportions,
i.e., add up to 1, thus:
wi ≥ 0 ∀1 ≤ i ≤ l (22.4)
l
∑ wi = 1. (22.5)
i=1
The resulting adaptable model of music similarity has a linear number of param-
eters – one weight per distance facet. The weights are intuitively comprehensible and
can also be easily represented as sliders in a graphical user interface.
544
22.2. Learning a Music Similarity Measure 545
The algorithm will then eventually identify the values for the weight parameters that
best reflect our preferences.
Distance or similarity preferences can be expressed in two ways, either through
absolute (quantitative) statements or through relative (qualitative) statements. The
former statements can be binary like “x and y are / are not similar,” which requires
a hard decision criterion, or quantitative like “the similarity / distance of x and y is
0.5,” which requires a well-defined scale. Relative preference statements, on the
contrary, do not compare objects directly but their pair-wise distances in the form
“x and y are more similar / less distant than u and v.” Usually, this is done relative
to a seed object reducing the statements to the form “x is more similar / less distant
to y than z.” Such statements will be considered in the remainder of this chapter
as they are much easier to express and thus more stable than absolute statements,
i.e., when asked again, users are more likely to confirm the earlier expressed relative
preference than stating the same absolute value for the distance / similarity again.
We will concentrate on the simpler version that refers to a seed track and is easier to
comprehend. However, all of the following can easily be modified to accommodate
general relative distance constraints defined on two pairs of tracks without a seed.
Definition 22.2 (Relative Distance Constraint). A relative distance constraint
(s, a, b) demands that object a is closer to the seed object s than object b, i.e.:
substituting xi := δ fi (s, b) − δ fi (s, a). As we will see later, such basic constraints
can directly be used to guide an optimization algorithm that aims to identify weights
that violate as few constraints as possible. But there is also an alternative perspec-
tive on the weight-learning problem as pointed out by Cheng and Hüllermeier [3]
and illustrated in Figure 22.1. We can transform the optimization problem into a bi-
nary classification problem with positive training examples (xx, +1) that correspond
to satisfied constraints and negative examples (−xx, −1) corresponding to constraint
violations, respectively. In this case, the weights (w1 . . . wl ) = w T describe the model
(separating hyperplane) to be learned by the classifier.2
The focus on relative distance constraints may seem like a strong limitation, but
as we will see next, complex expressions of distance preferences can be broken down
into “atomic” relative distance constraints. Let us, for instance, consider two very
common user activities related to collection structuring: grouping and (re-)ranking.
Grouping can be realized by assigning tags such that tracks in the same group share
the same tag. It could also be done visually in a graphical user interface where tracks
can be moved into folders or similar cluster containers by drag-and-drop operations.
2 For an explanation of (binary) classification problems and the concept of the separating hyperplane,
545
546 Chapter 22. Similarity-Based Organization of Music Collections
For ranking, a set of tracks is arranged as a list according to the similarity with a seed
track. Groupings and ranking list, do not have to be created from scratch but can
be pre-computed using default facet weights and then modified by the user. Once
the user has modified the ranking or grouping, constraints can be derived. For a set
G of tracks grouped together by similarity, every pair of tracks x, y within G should
be more similar to each other than to any track o outside of G. This results in two
relative distance constraints per triplet x, y, o:
∀x, y ∈ G, o ∈
/G: d(x, y) < d(x, o) ∧ d(y, x) < d(y, o). (22.8)
For a ranked list of similar tracks t1 , . . . ,tn w.r.t. a seed track s, the track ranked first
should be the one most similar to the seed, and in general, the track at rank i should
be more similar than the tracks at ranks j > i. This results in the following set of
relative distance constraints:
These two examples aim to demonstrate how atomic relative distance constraints
can be inferred from a user’s interaction with a system. Of course, more ways of
interacting with a user interface are possible depending on the complexity of the
application. Each will require a slightly different approach to model the expressed
similarity preference. As long as an interaction relates to the (perceived) relative
546
22.2. Learning a Music Similarity Measure 547
can either be resolved by removing all m + n edges or by just removing min(m, n) edges in each direction
and leaving |m − n| edges in the direction of stronger preference.
547
548 Chapter 22. Similarity-Based Organization of Music Collections
which can be directly derived from Equation (22.7). This leads to the following
update rule for the individual weights:
wi = wi + η∆wi with (22.11)
∂ obj(s, a, b)
∆wi = = δ fi (s, b) − δ fi (s, a), (22.12)
∂ wi
where the learning rate η defines the step width of each iteration and can optionally
be decreased progressively for better convergence. To enforce the bounds on wi given
by Equations (22.4) and (22.5), an additional step is necessary after the update, in
which all negative weights are set to 0 and the weights are normalized such that we
obtain a constant weight sum of 1. The algorithm can stop as soon as no constraints
are violated anymore. Furthermore, a maximum number of iterations or a time limit
can be specified as another stopping criterion.
This algorithm can compute a weighting, even if not all constraints can be satis-
fied. However, it is not guaranteed to find a globally optimal solution and no max-
imum margin is enforced for extra stability. Using the previous weight settings as
initial values in combination with a small learning rate allows for gradual change in
scenarios where constraints are added incrementally, but there may still be solutions
with less change required.
548
22.2. Learning a Music Similarity Measure 549
l k
z }| { z }| {
Ce = [ 1 1 ··· 1 0 0 ··· 0 ] be = [ 1 ]
1 0 ··· 0 0 0 ··· 0 0
0 1 ··· 0 0 0 ··· 0 0
.. .. .. .. .. .. ..
.. ..
.
. . . . . . .
.
0 0 ··· 1 0 0 ··· 0 0
Ci = bi =
c1,1
c1,2 ··· c1,l 1 0 ··· 0
ε
c2,1 c2,2 ··· c2,l 0 1 ··· 0 ε
.. .. .. .. .. .. ..
.. ..
. . . . . . . . .
ck,1 ck,2 ··· ck,l 0 0 ··· 1 ε
Figure 22.2: Scheme for modeling a weight-learning problem with soft distance con-
straints through the equality and inequality constraints of a quadratic programming
problem.
the facet weights we would like to optimize. We have to satisfy the weight constraints
given in Equations (22.4) and (22.5). The single equality constraint corresponds
to Equation (22.5) whereas Equation (22.4) is represented by the first l inequality
constraints – one for each weight. Each of the remaining k inequality constraints
models a relative distance constraint as formulated in Equation (22.7). For the i-th
distance constraint (s, a, b), the value ci, j refers to the facet distance difference for
the j-th facet, i.e., δ f j (s, b) − δ f j (s, a). The constant ε in Figure 22.2 refers to a small
value close to machine precision which is used to enforce inequality.
In order to allow each of the distance constraints to be violated, individual slack
variables ξ ≥ 0 are introduced such that:
l
∑ wi (δ fi (s, b) − δ fi (s, a)) + ξ > 0. (22.16)
i=1
A slack value greater than zero means that the respective constraint is violated. For
the k constraints, we require k slack variables ξ1 , . . . , ξk that form the remaining k
dimensions of the variable vector x = (w1 , . . . , wl , ξ1 , . . . , ξk ). The objective is then
to minimize the sum of the squared slack variables. This can be accomplished by
setting the last k values on the diagonal of the matrix G to 1 and all other values of
G and a to zero.
This approach will always find a globally optimal solution that minimizes the
slack. However, it cannot be used directly in incremental scenarios. In order to
support incremental learning, the objective function has to be modeled differently.
In particular, we have to add a term for minimizing the weight change and balance
it with the slack minimization objective. This, however, is beyond the scope of this
chapter.
549
550 Chapter 22. Similarity-Based Organization of Music Collections
550
22.3. Visualization: Dealing with Projection Errors 551
tical or neighboring cells of a two-dimensional grid. In the field of MIR, they have
been used in a large number of applications for structuring music collections like
the Islands of Music [23, 21], MusicMiner [19] (Figure 22.4), nepTune [9] (Figures
22.4 and 26.3), or the Map of Mozart (Figure 26.7). An overview on SOM-related
publications in the field of MIR is given in [30]. As a major drawback, SOMs gen-
erally require that the objects they process are represented as vectors, i.e., elements
of a vector space.4 If the feature representation does not adhere to this condition,
we need to vectorize it first. For instance, as proposed in [28], we can use MDS
to compute an embedding of the tracks into a high-dimensional Euclidean space for
vectorization. This vectorization step is exactly like using MDS directly for projec-
tion, but here the output space can have as many dimensions as needed to not lose
any information about the distances between the tracks. Afterwards, we can use a
regular SOM to project the vectorized data for visualization. Alternatively, there are
also several special versions of SOMs that do not require vector input such as kernel
SOMs and dissimilarity SOMs (also called median SOMs) [10].
where ri j is the rank of j in the ordering of the distance from i in the original space,
4 SOM cells are usually represented by prototype feature vectors and the SOM learning algorithm
relies on vector operations to update these prototypes based on the assigned objects.
5 In order to correctly display all pair-wise distances of N tracks, a space with at most N dimensions are
needed. The intrinsic dimensionality of the collection refers to the smallest number of dimensions 1 ≤ m ≤
N where this is still possible. This value very much depends on the complexity of the similarity measure
and the distribution of feature values in the collection. In all but trivial cases, it will be significantly higher
than 2.
551
552 Chapter 22. Similarity-Based Organization of Music Collections
similar
dissimilar
Uk (i) is the set of i’s false neighbors in the display, and C(k) = 2/(Nk(2N − 3k − 1))
is a constant for obtaining values in [0, 1].
Definition 22.4 (Continuity). The measure of continuity considers the k nearest
neighbors in the original space and captures how well they are preserved in the visu-
alization:
N
Mcontinuity = 1 −C(k) ∑ ∑ (r̂i j − k), (22.18)
i=1 j∈Vk (i)
where r̂i j is the rank of j in the ordering of the distance from i in the visualization
and Vk (i) is the set of i’s true neighbors missing in the visualized neighborhood.
Type I projection errors increase the number of dissimilar (i.e., irrelevant) tracks
displayed in a local region of interest (low trustworthiness). While this might be-
come annoying, it is much less problematic than type II projection errors. Type
II projection errors are like “wormholes” connecting possibly distant regions in the
two-dimensional display space through the underlying high-dimensional similarity
space.6 They result in similar (i.e., relevant) music tracks to be displayed far away
from the region of interest – the neighborhood they actually belong to (low continu-
ity). In the worst case, misplaced neighbors could even be off-screen if the display is
limited to the currently explored region. This way, users could miss tracks they are
actually looking for.
of paper (the screen projection) to approximate the three-dimensional volume of a box (the actual similar-
ity space). Each coordinate in the volume is mapped to the closest point on the crumpled paper. When the
paper is flattened (visualization), not all mapped volume coordinates will be next to their actual neighbors.
In reality, the similarity space usually has many more dimensions, which amplifies the problem.
552
22.3. Visualization: Dealing with Projection Errors 553
Figure 22.4: Screenshots of approaches that use mountain ranges to separate dissim-
ilar regions (left: MusicMiner [19], middle: SoniXplorer [15]) or to visualize regions
with a high density of similar music tracks (right: nepTune [9], a variant of Islands
of Music [23, 21]).
displayed close to each other. SoniXplorer [14, 15] uses the same geographical
metaphor but in a 3D virtual environment that users can navigate with a game pad.
The Islands of Music [23, 21] and related approaches [9, 20, 6] use the third dimen-
sion the other way around. Here, islands or mountains refer to regions of similar
tracks (with high density) separated by water (with low density of similar tracks).
All these approaches visualize local properties of the projection, i.e., neighborhoods
of either dissimilar or similar music tracks.
Kind
Kind Kind
Figure 22.5: In the SoundBite user interface [13], a selected seed track and its actual
nearest neighbors are connected by lines.
553
554 Chapter 22. Similarity-Based Organization of Music Collections
Understanding
Understanding
Understanding
Understanding
Understanding
Understanding
Understanding
Understanding
Figure 22.6: Left: MusicGalaxy (inverted color scheme for print). Tracks are vi-
sualized as stars with brightness corresponding to listening frequency. For a well-
distributed selection of popular tracks, the album cover is shown for better orienta-
tion. Same covers indicate different tracks from the same album. Hovering over a
track displays the title. For tracks in focus, the album covers are shown with the
size increased by the lens scale factor. Top right: corresponding SpringLens distor-
tion resulting from (user-controlled) primary focus (large) and (adaptive) secondary
lenses (small). Bottom right: facet weights for the projection and distortion distance
measures (cp. Section 22.3.5).
interface [29] shown in Figure 22.6, which exploits the wormhole metaphor for nav-
igation.7 Instead of trying to globally repair errors in the projection (implemented
through MDS), the general idea is to temporarily fix and highlight the neighborhood
in focus through distortion. To this end, an adaptive mesh-based distortion tech-
nique called SpringLens is applied that is guided by the user’s focus of interest. The
SpringLens consists of a complex overlay of multiple fish-eye lenses divided into a
primary and secondary focus (Figure 22.6, top right). The primary focus is a single
large fish-eye lens used to zoom into regions of interest. At the same time, it com-
pacts the surrounding space but does not hide it from the user to preserve overview.
While the user can control the position and size of the primary focus, the secondary
focus is automatically adapted. It consists of a varying number of smaller fish-eye
lenses. When the primary focus is moved by the user, a neighbor index is queried
554
22.4. Dealing with Changes in the Collection 555
with the track closest to the new center of focus. If nearest neighbors are returned
that are not in the primary focus region, secondary lenses are added at the respective
positions. As a result, the overall distortion of the visualization temporarily brings
the distant nearest neighbors back closer to the focused region of interest. This way,
distorted distances introduced by the projection can, to some extent, be compensated
whilst the distant nearest neighbors are highlighted. By clicking on a secondary
focus region, users “travel through a wormhole” and the primary focus is changed
respectively. This is like navigating an invisible neighborhood graph.
555
556 Chapter 22. Similarity-Based Organization of Music Collections
Ideally, a map should be altered as little as possible and only as much as necessary to
reflect the changes of the underlying collection. Too abrupt changes in the topology
might confuse the user who over time will get used to the location of specific regions
in the map.
556
22.4. Dealing with Changes in the Collection 557
Please Please Me With The Beatles A Hard Day's Night Beatles for Sale
Figure 22.8: Aligned MDS projections computed after adding the first four Beatles
albums to the collection.
557
558 Chapter 22. Similarity-Based Organization of Music Collections
558
22.6. Further Reading 559
Further work by McFee et al. [16] focuses on adapting content-based song sim-
ilarity by learning from a sample of collaborative filtering data. Here, they use the
metric learning to rank (MLR) technique [18] – an extension of the Structural SVM
approach – to adapt a Mahalanobis distance according to a ranking loss measure.
This approach is also applied by Wolff et al. [33] whose similarity adaptation exper-
iments are based on the MagnaTagATune dataset derived from the TagATune game
[12]. Further experiments described in [32] compare the approaches that are using
the more complex Mahalanobis distance to the weighted facet distance approach de-
scribed in this chapter.
Bibliography
[1] F. G. Ashby and D. M. Ennis. Similarity measures. Scholarpedia, 2(12):4116,
2007. revision #142770.
[2] G. Cauwenberghs and T. Poggio. Incremental and decremental support vec-
tor machine learning. In Advances in Neural Information Processing Systems
(NIPS’00), pp. 409–415, Cambridge, MA, USA, 2000. MIT Press.
[3] W. Cheng and E. Hüllermeier. Learning similarity functions from qualitative
feedback. In Proceedings of the 9th European Conference on Advances in Case-
Based Reasoning (ECCBR’08), pp. 120–134, Trier, Germany, 2008. Springer-
Verlag.
[4] V. de Silva and J. B. Tenenbaum. Global versus local methods in nonlinear di-
mensionality reduction. In Advances in Neural Information Processing Systems
(NIPS’02), pp. 705–712, Cambridge, MA, USA, 2002. MIT Press.
[5] M. Dopler, M. Schedl, T. Pohle, and P. Knees. Accessing music collections via
representative cluster prototypes in a hierarchical organization scheme. In Pro-
ceedings of the 9th International Conference on Music Information Retrieval
(ISMIR’08), pp. 179–184, 2008.
[6] M. Gasser and A. Flexer. FM4 Soundpark: Audio-based music recommenda-
tion in everyday use. In Proceedings of the 6th Sound and Music Computing
Conference (SMC’09), pp. 161–166, 2009.
[7] J. C. Gower and G. B. Dijksterhuis. Procrustes Problems. Oxford University
Press, 2004.
[8] S. Kaski, J. Nikkilä, M. Oja, J. Venna, P. Törönen, and E. Castrén. Trustwor-
thiness and metrics in visualizing similarity of gene expression. BMC Bioinfor-
matics, 4(1):48, 2003.
[9] P. Knees, T. Pohle, M. Schedl, and G. Widmer. Exploring music collections in
virtual landscapes. IEEE MultiMedia, 14(3):46–54, 2007.
[10] T. Kohonen and P. Somervuo. How to make large self-organizing maps for non-
vectorial data. Neural Networks, 15(8–9):945–952, October–November 2002.
[11] J. B. Kruskal and M. Wish. Multidimensional Scaling. Sage, 1986.
[12] E. Law and L. von Ahn. Input-agreement: A new mechanism for collecting data
559
560 Chapter 22. Similarity-Based Organization of Music Collections
560
22.6. Further Reading 561
561
Chapter 23
Music Recommendation
D IETMAR JANNACH
Department of Computer Science, TU Dortmund, Germany
G EOFFRAY B ONNIN
LORIA, Université de Lorraine, Nancy, France
23.1 Introduction
Until recently, music discovery was a difficult task. We had to listen to the radio
hoping one track will be interesting, actively browse the repertoire of a given artist, or
randomly try some new artists from time to time. With the emergence of personalized
recommendation systems, we can now discover music just by letting music platforms
play tracks for us. In another scenario, when we wanted to prepare some music
for a particular event, we had to carefully browse our music collection and spend
significant amounts of time selecting the right tracks. Today, it has become possible
to simply specify some desired criteria like the genre or mood and an automated
system will propose a set of suitable tracks.
Music recommendation is however a very challenging task, and the quality of the
current recommendations is still not always satisfying. First, the size of the pool of
tracks from which to make the recommendations can be quite huge. For instance,
Spotify,1 Groove,2 Tidal,3 and Qobuz,4 four of the currently most successful web
music platforms, all contain more than 30 million tracks.5 Moreover, most of the
tracks on these platforms typically have a low popularity6 and hence little informa-
tion is available about them, which makes them even harder to process for the task of
automated recommendation. Another difficulty is that the recommended tracks are
563
564 Chapter 23. Music Recommendation
immediately consumed, which means the recommendations must be made very fast,
and must, at the same time, fit the current context.
Music recommendation was one early application domain for recommendation
techniques, starting with the Ringo system presented in 1995 [35]. Since then how-
ever, most of the research literature on recommender systems (RS) has dealt with the
recommendation of movies and commercial products [17]. Although the correspond-
ing core strategies can be applied to music, music has a set of specificities which can
make these strategies insufficient.
In this chapter, we will discuss today’s most common methods and techniques
for item recommendation which were developed mostly for movies and in the e-
commerce domain, and talk about particular aspects of the recommendation of mu-
sic. We will then show how we can measure the quality of recommendations and
finally give examples of real-world music recommender systems. Parts of our dis-
cussion will be based on [5], [8], and [22], which represent recent overviews on
music recommendation and playlist generation.
7 In Section 23.4 we will discuss in more detail what relevance or interestingness could mean for the
user.
8 Taking the general popularity of items into account in the ranking process is, however, also common
in practical settings because of the risk that only niche items are recommended.
564
23.2. Common Recommendation Techniques 565
Table 23.1: A Simple Rating Database, Adapted from [20]. When Recommendation
Is Considered as a Rating Prediction Problem, the Goal is to Estimate the Missing
Values in the Rating “Matrix” (Marked with ’?’)
23.2.1.1 CF Algorithms
One of the earliest and still relatively accurate schemes to predict Alice’s missing
ratings is to base the prediction on the opinion of other users, who have liked similar
items as Alice in the past, i.e., who have the same taste. The users of this group are
usually called “neighbors” or “peers”. When using such a scheme, the question is (a)
how to measure the similarity between users and (b) how to aggregate the opinions
of the neighbors. In one of the early papers on RS [32], the following approach
was proposed, which is still used as a baseline for comparative evaluation today. To
determine the similarity, the use of Pearson’s correlation coefficient (Definition 9.20
in Chapter 9) was advocated. The similarity of users u1 and u2 can thus be calculated
via
where Ib denotes the set of products that have been rated both by user u1 and user u2 ,
ru1 is u1 ’s average rating and ru1 ,i denotes u1 ’s rating for item i.
Besides using Pearson’s correlation, other metrics such as cosine similarity have
been proposed.9 One of the advantages of Pearson’s correlation is that it takes into
account the tendencies of individual users to give mostly low or high ratings.
Once the similarity of users is determined, the remaining problem is to predict
Alice’s missing ratings. Given a user u1 and an unseen item i, we could for example
compute the prediction based on u1 ’s average rating and the opinion of a set of N
closest neighbors as follows:
565
566 Chapter 23. Music Recommendation
When we apply these calculations to the example in Table 23.1, we can identify
User2 and User3 as the closest neighbors to Alice (sim(User2,User3) is 0.85 and
0.7). Both have rated Song5 above their average and predict an above-average rating
between 4 and 5 (exactly 4.87) for Alice, which means that we should include the
song in a recommendation list.
While the presented scheme is quite accurate – we will see, later on, how to mea-
sure accuracy – and simple to implement, it has the disadvantage of being basically
not applicable for real-world problems due to its limited scalability, since there are
millions of songs and millions of users for which we would have to calculate the
similarity values.
Therefore a large variety of alternative methods have been proposed over the last
decades to predict the missing ratings. Nearly all of these more recent methods are
based on offline data preprocessing and on what is called “model-building”. In such
approaches, the system learns a usually comparably compact model in an offline
and sometimes computationally intensive training phase. At runtime, the individual
predictions for a user can be calculated very quickly. Depending on the application
domain and the frequency of newly arriving data, the model is then re-trained peri-
odically. Among the applied methods we find various data mining techniques such
as association rule mining or clustering (see Chapter 11), support vector machines,
regression methods and a variety of probabilistic approaches (see Chapter 12). In
recent years, several methods were designed which are based on matrix factorization
(MF) as well as ensemble methods which combine the results of different learning
methods [23].
In general, the ratings that the users assigned to items can be represented as a
matrix, and this matrix can be factorized, i.e., it is possible to write this matrix R as
the product of two other matrices Q and P:
R = QT · P.
Matrix Factorization techniques determine approximations of Q and P using dif-
ferent optimization procedures (see Chapter 10). Implicitly, these methods thereby
map users and items to a shared factor space of a given size (dimensionality) and use
the inner product of the resulting matrices to estimate the relationship between users
and items [23]. Using such factorizations makes the computation times much shorter
and at the same time implicitly reveals some latent factors. A latent aspect of a song
could be the artist or the musical genre the song belongs to; in general, however, the
semantic meanings of the factors are unknown. After the factorization process with
f latent factors (for example f = 100), we are given a vector q i ∈ R f for each item
i and a vector p u ∈ R f for each user. For the user vectors, each value of the vector
corresponds to the interest of a user in a certain factor; for item vectors, each element
indicates the degree of “fit” of the item to the factor. Given a user u and an item i,
we can finally estimate the “match” between the user and the item by using the dot
product q Ti p u .
Different heuristic strategies exist for determining the values for the latent factor
vectors p u and q i . The most common ones in RS, which also scale to larger-scale
566
23.2. Common Recommendation Techniques 567
rating data bases, are stochastic gradient descent optimization and Alternating Least
Squares [23]; see also Chapter 10.
In order to estimate a rating r̂u,i for user u and item i, we can use the following
general equation, where µ is the global rating average, bi is the item bias, and bu is
the user bias.
r̂u,i = µ + bi + bu + q Ti p u (23.3)
The reason for modeling user and item biases is that there are items which are
generally more liked or disliked than others, and there are, on the other hand, users
who generally give higher or lower ratings than others. For instance, Equation (23.3)
with f = 2 corresponds to the assumption that only two factors are sufficient to accu-
rately estimate the ratings of users. These factors may be, for instance, the genre and
the tempo of tracks, or any other factors, which are inferred during the factorization
step.
The learning phase of such an algorithm consists of estimating the unknown pa-
rameters based on the data. This can be achieved by searching for parameters which
minimize the squared prediction error (see Chapter 10), given the set of known rat-
ings K:
2
ru,i − (µ + bu + bi + qTi pu ) + λ kqi k2 + kpu k2 + b2u + b2i . (23.4)
min ∑
q∗,p∗,b∗
(u,i)∈K
The last term in the function is used for regularization and to “penalize” large
parameter values.
Overall, in the past years much research in the field of recommender systems was
devoted to such rating prediction algorithms. It however becomes more and more ev-
ident that rating prediction is very seldom the goal in practical applications. Finding
a good ranking of the tracks based on observed user behavior is more relevant, which
led to an increased application of “learning to rank” methods for this task or to the
development of techniques that optimize the order of the recommendations accord-
ing to music-related criteria such as track transitions or the coherence of the playlists
[19].
567
568 Chapter 23. Music Recommendation
any sort of low-level data) has to be acquired and maintained. At the same time,
CF methods are well understood and have been successfully applied in a variety of
domains, including those where massive amounts of data have to be processed and a
large number of parallel users have to be served. The inherent characteristic of CF-
based algorithms can in addition lead to recommendations that are surprising and
novel for the user, which can be a key feature for a music recommender in particular
when the user is interested in discovering new artists or musical sub-genres.
On the down side, CF methods require the existence of a comparably large user
community to provide useful recommendations. Related to that is the typical issue of
data sparsity. In many domains, a large number of items in the catalog have very few
(or even no) ratings, which can lead to the effect that they are never recommended
to users. At the same time, some users only rate very few items, which makes it
hard for CF methods to develop a precise enough user profile. Situations in which
there are no or only a few ratings available for an item or a user are usually termed
“cold start” situations. A number of algorithms have been proposed to deal with this
problem in the literature. Many of them rely for example on hybridization strategies,
where different algorithms or knowledge sources are used as long as the available
ratings are not sufficient. Finally, as also discussed in [8], some CF algorithms have
a tendency to boost the popularity of already popular items so that, based on the
chosen algorithm, some niche items have a low chance of ever being recommended.
CF-based music recommendation has some aspects which are quite specific for
the domain. Besides the fact that in the case of song recommendation it is plausible
to recommend the same item multiple times to a user, it is often difficult to acquire
good and discriminative rating information from the user. Analyses have shown that
for example on YouTube users tend to give ratings only to items they like so that the
number of “dislike” statements is very small. While this bias towards liked items can
also be observed in other domains, it appears to be particularly strong for multime-
dia content as provided on YouTube, which “degrades” the user feedback basically to
unary ratings (“like” statements). With respect to data sparsity as mentioned above, a
common strategy in CF recommender systems is to rely on implicit item ratings, that
is, one interprets actions performed by users on items as positive or negative feed-
back. In the music domain, such implicit feedback is often collected by monitoring
the user’s listening behavior, and in particular, listening times are used to estimate to
which extent a user liked a song.
One of the most popular online music services that uses – among other techniques
– collaborative filtering, is Spotify. Spotify provides several types of radios such as
genre radios, artist radios, and playlist radios. Once the user has chosen a radio,
the system automatically plays one recommended song after the other. The user
can give some feedback (like, dislike, or skip) on the tracks and this feedback is
taken into account and used to adapt the selection of the next recommendations. All
these radios use collaborative filtering, and more precisely matrix factorization [4].
Another interesting feature of Spotify is the Discover weekly playlist, a playlist that
is automatically generated each week and that the user can play to discover music
he may like. This feature also uses collaborative filtering to select the tracks which
are “around” the favorite tracks of the users in the similar users’ listening logs [37].
568
23.2. Common Recommendation Techniques 569
Table 23.2 shows an example for a content-enhanced music catalog, where the
items marked with a tick (X) correspond to those which the user has liked. A basic
10 In contrast to CF methods, the user profile in CB approaches is not based on the behavior of the
569
570 Chapter 23. Music Recommendation
strategy to derive a user profile from the liked items would be to simply collect all
the values in each dimension (artist, genre, etc.) of all liked items. The relevance of
unseen items for the user can then be based, for example, on the overlap of keywords.
In the example, recommending the Willie Nelson song appears to be a reasonable
choice due to the user’s preference for country music.
In the area of document retrieval, more elaborate methods are usually employed
for determining the similarity between a user profile and an item which, for example,
take into account how discriminative a certain keyword is for the whole item collec-
tion. Most commonly, the TF-IDF (term frequency - inverse document frequency)
metric is used to measure the importance of a term in a document in IR scenarios.
The main idea is to represent the recommendable item as a weight vector, where
each vector element corresponds to a keyword appearing in the document. The TF-
IDF metric then calculates a weight that measures the importance of the keyword
or aspect, that is, how well it characterizes the document. The calculation of the
weight value depends both on the number of occurrences of the word in the docu-
ment (normalized by the document length) as well as how often the term appears in
all documents, thus avoiding giving less weight to words that appear in most docu-
ments.
The user profile is represented in exactly the same way, that is, as a weight vector.
The values of the vector, which represent the user’s interest in a certain aspect can,
for example, be calculated by taking the average vector of all songs that the user has
liked.
In order to determine the degree of match between the user profile u and a not-
yet-seen item i, we can calculate the cosine similarity as shown in Equation (23.5)
and rank the items based on their similarity.
u ·i
sim(uu, i ) = . (23.5)
| u || i |
The cosine similarity between two vectors measures the distance (angle) between
them and uses the dot product (·) and the magnitudes (| u | and | i |) of the rating
vectors. The resulting values lie between 0 and 1.
Generally, the recommendation could be viewed as a standard IR ranked retrieval
problem with the difference that we use the user profile as an input instead of a par-
ticular query. Thus, on principle, modern IR methods based, e.g., on Latent Semantic
Analysis or classical ones based on Rocchio’s relevance feedback can be employed;
see [20]. Viewed from yet a different perspective, the recommendation problem can
also be seen as a classification task, where the goal is to assess whether or not a user
will like a certain item. For such classification tasks, a number of other approaches
have been developed in the field of Information Retrieval, based, e.g., on probabilis-
tic methods, Support Vector Machines, regression techniques, and so on; see Chapter
12.
570
23.2. Common Recommendation Techniques 571
of keywords. This is not possible for music (except maybe for recommending songs
based on the lyrics), and other types of content features have to be acquired. On
principle, all the various pieces of information that can be automatically extracted
through automated music analysis, such as timbre, instruments, emotions, speed, or
audio features, can be integrated into the recommendation procedure.
In general, the content features in CB systems have to be acquired and maintained
either manually or automatically. In each case, however, the resulting annotations can
be imprecise, inconsistent or wrong. When songs are, for example, labeled manually
with a corresponding genre, the problem exists that there is not even a “gold stan-
dard” and that when using annotations from different sources the annotations may
contradict.
In recent years, additional sources of information have become available with
the emergence of Semantic Web technologies, see [10, 5], and in particular with the
Social Web. In this frame, users can actively provide meta-information about items,
for instance by attaching tags to items, thereby creating so-called folksonomies. This
is referred to as Social Tagging, and it is becoming an increasingly valuable source
of additional information. Since the manual annotation process of songs does not
scale well, “crowdsourcing” the labeling and classification task is promising despite
the problems of labeling inconsistencies and noisy tags. An important aspect here is
that the tags applied to a resource not only tell us something about the resource, e.g.,
the song itself, but also about the interests and preferences of the person that tags the
item.
Besides expert-based annotation and social tagging, further approaches to anno-
tating music include Web Mining, e.g., from music blogs or by analyzing the lyrics
of songs; automated genre classification; or similarity analysis [9].
Content-based recommendation methods have their pros and cons. In contrast to
CF methods, for example, no large user community is required to generate recom-
mendations. On principle, a content-based system can start making recommenda-
tions based on one single positive implicit or explicit user feedback action or based
on a sample song or user query. More precise and more personalized recommenda-
tions can of course be made, if more information is available. The obvious disadvan-
tage of content-based methods when compared with CF methods is that the content
information has to be acquired and maintained. In that context, the additional prob-
lem arises that the available content information might not be sufficiently detailed or
discriminative to make good recommendations.
From the perspective of the user-perceived quality of the recommendations, meth-
ods based on content features by design recommend items similar to those the user
has liked in the past. Thus, recommendation lists can exhibit low diversity and may
contain items that are too similar to each other. In addition, such lists might only in
rare cases contain elements which are surprising for the user. Being able to make
such “serendipitous” and surprising recommendations is however considered as an
important quality factor of an RS. On the other hand, recommending at least a few
familiar items – as content-based systems will do – can help the user to develop trust
in the system’s capability of truly understanding the user’s preferences and tastes.
With respect to real-world systems, Pandora Music is most often cited as a
571
572 Chapter 23. Music Recommendation
572
23.2. Common Recommendation Techniques 573
context may refer to user-independent aspects such as the time of the day or year
but also to user-specific ones such as the current geographic location or activity.
In particular, the second type of information, that is, including the information about
whether the user is alone or part of a group, becomes more and more available thanks
to GPS-enabled smartphones and corresponding Social Web applications.
Since the different basic recommendation techniques (e.g., collaborative filtering
or content-based filtering) have their advantages and disadvantages, it is a common
strategy to overcome limitations of the individual approaches by combining them in
a hybrid approach. When, for example, a new user has only rated a small number of
items so far, applying a neighborhood-based approach might not work well, because
not enough neighbors can be identified who have rated the same items. In such
a situation, one could therefore first adopt a content-based approach in which one
single item rating is enough to start and switch to a CF method later on, when the
user has rated a certain number of items. In [6], Burke identifies seven different ways
that recommenders can be combined. Jannach et al. in [20] later on organize them in
the following three more general categories:
• Monolithic designs: In such approaches, the hybrid system consists of one rec-
ommendation component which pre-processes and combines different knowledge
sources; hybridization is achieved by internally combining different techniques
that operate on the different sources (Figure 23.1).
• Parallelized designs: Here, the system consists of several components whose out-
put is aggregated to produce the final recommendation lists. An example is a
weighted design where the recommendation lists of two algorithms are combined
based on some ranking or confidence score. The above-mentioned “switching”
behavior can be seen as an extreme case of weighting (Figure 23.2).
• Pipelined designs: In such systems, the recommendation process consists of mul-
tiple stages. A possible configuration could be that a first algorithm pre-filters the
available items which are then ranked by another technique in a subsequent step
(Figure 23.3).
Hybrid Recommender
Algorithm 1
Inputs Recommendations
Algorithm 2
Algorithm n
573
574 Chapter 23. Music Recommendation
Hybrid Recommender
Algorithm 1 Combination
Inputs Recommendations
Algorithm 2 step
Algorithm n
Hybrid Recommender
574
23.3. Specific Aspects of Music Recommendation 575
rarely watched again and again. Familiarity is a very specific and important fea-
ture of music. Users usually like to discover some new tracks, but at the same
time like to listen to the tracks with which they are familiar. A very specific com-
promise thus exists between familiarity and discovery for user satisfaction.
Songs are often consumed in sequence. It is important that successive songs
form a smooth transition regarding the mood, tempo or style. A good playlist thus
not only balances possible quality features like coherence, familiarity, discovery,
diversity and serendipity, but also has to provide smooth transitions.
Finally, in contrast to many other recommendation domains, tracks can be con-
sumed when doing other things. One can listen to music while working, studying,
dancing, etc., and each type of activity fits best with a different musical style.
• Feedback mechanisms: With respect to user feedback and user profiling, the con-
sumption times (listening durations), track skipping actions, and volume adjust-
ments can be used as implicit user feedback in the music recommendation domain.
On some music websites, users can furthermore actively “ban” tracks in order to
avoid listening to tracks which they actually do not like. At the same time, the
consumption frequency can be used as another feedback signal. This repeated
“consumption” of items seems to be particularly relevant for music, because it is
intuitive to assume that the tracks that the users play most frequently are the tracks
that the users like the most. This information about repeated consumption can also
be used in other domains like web browsing recommendation, but it seems to be
less relevant for these types of applications [13, 21].
Another typical feature of many music websites is that their users can create
and share playlists. Many users create such playlists,13 and the tracks in these
playlists usually correspond to the tracks the users like. Playlists can therefore
represent another valuable source for an RS to improve the user profiles.
• Data-related aspects: Music recommendation deals with very large item spaces.
Music websites usually contain tens of millions of tracks. Moreover, other kinds
of musical resources can also be recommended, like for instance concerts. Some
of these resources should, however, not appear in recommendation lists, for ex-
ample karaoke versions, tribute bands, cover versions, etc.
Furthermore, the available music metadata is in many cases noisy and hard to
process. Users often misspell or type inappropriate metadata, as for instance “!!!”
as an artist’s name. At the same time, dozens of bands, artists, albums, and tracks
can have identical names, which not only makes the interpretation of a user query
challenging, but can also lead to problems when organizing and retrieving tracks
based on the metadata.
• Psychological questions and the cost of wrong recommendations: From a psycho-
logical perspective, music represents a popular means of self-expression. With
respect to today’s Social Web sites, the question arises if one can trust that all the
positive feedback statements on such platforms are true expressions of what users
think and what they really like. Additionally, in the music domain, there exists a
575
576 Chapter 23. Music Recommendation
576
23.4. Evaluating Recommender Systems 577
Variations of this scheme, e.g., concerning the logarithmic base, are also common
in the literature. Usually, the DCG is also normalized and divided by the score of the
“optimal” ranking so that finally the values of the normalized DCG lie between zero
and one.
Note that other domain-specific or problem-specific schemes are possible. In the
2011 KDD Cup,15 the task was to separate highly rated music items from non-rated
577
578 Chapter 23. Music Recommendation
items given a test set consisting of six tracks, out of which 3 were highly rated and 3
were not rated by the user.
578
23.4. Evaluating Recommender Systems 579
Hit/Listening/Download
Long Tail
Count
sales or download numbers, or, when the goal is to measure the diversity of the
recommendations, the number of appearances of a song in a recommendation list.
The term long tail refers to the fact that in many domains – and in particular the
music domain – some very few popular items account for a large amount of the sales
volume. In [8], Celma cites the numbers of a report from 2007 about the state-of-
the-industry in music consumption, where 1% of all available tracks were reported to
be responsible for about 80% of the sales or that nearly 80% of about 570,000 tracks
were purchased fewer than 100 times.
Given this skewed distribution toward popular songs and mainstream artists, it
could therefore be – according to marketing theory – a goal to increase sales of items
in the long tail. Recommender systems are one possible method to achieve such a
goal and studies such as [38] or [12] have analyzed how recommenders impact the
buying behavior of customers and the overall sales diversity. On the one hand, one
can observe that in some domains a recommender can help the user to better explore
the item space and find new items he or she was not aware of. On the other hand,
there is a danger that depending on the underlying strategy and algorithm, the usage
of a recommender system can lead to the undesired effect of further boosting already
very popular items as recommending blockbusters to everyone is a comparably safe
strategy.
579
580 Chapter 23. Music Recommendation
580
23.5. Current Topics and Outlook 581
can become outdated very quickly as well. Another perspective on adaptivity is the
system’s rate of taking changes and additions in the user profile into account. When
users rate tracks, they might expect that their preferences are immediately taken into
account, which might not be the case if the underlying algorithm is based on a com-
putationally expensive training phase and models are only updated, e.g., once a day.
Scalability is a characteristic of recommender systems which is particularly rel-
evant for situations where we have to deal with millions of items and several mil-
lion users as is the case in music recommendation. Techniques such as nearest-
neighborhood algorithms do not scale even to problems of modest size. Therefore,
only techniques which rely on offline pre-computation and model-building work in
practice.
Robustness typically refers to the resistance of a system against attacks by malev-
olent users who want to push or “nuke” certain artists. Recent works such as [28]
have shown that, for example, nearest-neighbor algorithms can be vulnerable to var-
ious types of attacks, whereas model-based approaches are often more robust in that
respect.
581
582 Chapter 23. Music Recommendation
methods are, for instance, not always capable of taking the cultural background of
a track into account, except for cases in which cultural information can be extracted
from tags or metadata annotations.
Kaminskas and Ricci classify the possible contextual factors in three major groups
[22]:
• environment-related context, e.g., location, time or weather;
• user-related context, e.g., the current activity, demographical information (even
though this can be considered part of the user profile, and provides information
about the environment), and the emotional state; and
• multimedia context, which relates to the idea of combining music with other cor-
responding resources such as images or stories.
In the same work, the authors review a set of context-aware prototypical music
recommenders and experimental studies in this area. They conclude that research
so far is “data-driven” and that researchers often tend to fuse given contextual in-
formation into their machine learning techniques. Instead, the authors advocate a
“knowledge-based” approach, where expertise, e.g., from the field of psychology,
about the relationship between individual contextual factors and musical perception
is integrated into the recommendation systems.
582
23.5. Current Topics and Outlook 583
small-scale data collection within a defined user group and vocabulary, harvesting
social tags from online music sites, and using tagging games or different strate-
gies to automatically mine tags from other web sources. A complementary ap-
proach to harmonizing the vocabulary on tagging-enabled platforms is the use of
“tag-recommenders”. Such recommenders can already be found on today’s resource
sharing platforms such as delicious.com and make tagging suggestions to the users
based on RS technology.
As discussed in [24], user-provided tags may carry different types of valuable
information about tracks such as genre, mood or instrumentation that can help to
address some challenging tasks in Music Information Retrieval such as similarity
calculation, clustering, (faceted) search, music discovery and, of course, music rec-
ommendation. The major research challenges include however the detection and
removal of noise and, as usual, cold start problems and the issue of lacking data for
niche items.
583
584 Chapter 23. Music Recommendation
proposed a frequent pattern mining approach where patterns of latent topics are ex-
tracted from user playlists and then used to compute recommendations. The authors
of [14] exploited long-term preferences including artists liked on Facebook and us-
age data on Spotify to build playlists which consist of the most popular tracks of the
artists who are the most similar to the artists the user likes. A statistical approach
was presented in [27], where random walks on a hypergraph are used to iteratively
select similar tracks. In the same spirit, [11] proposed a sophisticated Markov model
in which tracks are represented as points in the Euclidean space and transition prob-
abilities are derived from the corresponding Euclidean distances. The coordinates of
the tracks are learned using a likelihood maximization heuristic. The authors of [19]
went further by taking into account the characteristics of the tracks that are already
in the playlist, and proposed a heuristic which tries to mimic these characteristics in
the generation process.
584
23.7. Further Reading 585
Bibliography
[1] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender
systems: A survey of the state-of-the-art and possible extensions. IEEE Trans-
actions on Knowledge and Data Engineering, 17(6):734–749, 2005.
[2] G. Adomavicius and A. Tuzhilin. Context-aware recommender systems. In
F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, eds., Recommender Systems
Handbook, pp. 217–253. Springer, 2011.
[3] D. Baur, S. Boring, and A. Butz. Rush: Repeated Recommendations on Mobile
Devices. In Proc. IUI, pp. 91–100, New York, NY, USA, 2010. ACM.
[4] E. Bernhardsson. Systems and methods of selecting content items using latent
vectors, August 18 2015. US Patent 9,110,955.
[5] G. Bonnin and D. Jannach. Automated generation of music playlists: Survey
and experiments. ACM Computing Surveys, 47(2):1–35, 2014.
[6] R. Burke. Hybrid recommender systems: Survey and experiments. User Mod-
eling and User-Adapted Interaction, 12(4):331–370, 2002.
[7] M. Burns. The Pandora One subscription service to cost $5 a
month, http://techcrunch.com/2014/03/18/the-pandora-one-
subscription-service-to-cost-5-a-month/, 2014, accessed February
2016.
[8] Ò. Celma. Music Recommendation and Discovery - The Long Tail, Long Fail,
and Long Play in the Digital Music Space. Springer, 2010.
[9] Ò. Celma and P. B. Lamere. Music recommendation tutorial. IS-
MIR’07, http://ocelma.net/MusicRecommendationTutorial-
ISMIR2007/slides/music-rec-ismir2007-low.pdf, September 2007,
accessed February 2016.
[10] Ò. Celma and X. Serra. FOAFing the music: Bridging the semantic gap in
music recommendation. Journal of Web Semantics, 6(4):250–256, 2008.
[11] S. Chen, J. Moore, D. Turnbull, and T. Joachims. Playlist prediction via metric
embedding. In Proc. KDD, pp. 714–722, New York, NY, USA, 2012. ACM.
[12] D. M. Fleder and K. Hosanagar. Recommender systems and their impact on
sales diversity. In Proceedings of the 8th ACM Conference on Electronic Com-
merce (EC’07), pp. 192–199, New York, NY, USA, 2007. ACM.
[13] S. Fox, K. Karnawat, M. Mydland, S. Dumais, and T. White. Evaluating im-
plicit measures to improve web search. ACM Transactions On Information
Systems (TOIS), 23(2):147–168, 2005.
[14] A. Germain and J. Chakareski. Spotify me: Facebook-assisted automatic
playlist generation. In Proc. MMSP, pp. 25–28, Piscataway, NJ, USA, 2013.
IEEE.
[15] N. Hariri, B. Mobasher, and R. Burke. Context-aware music recommendation
based on latent topic sequential patterns. In Proc. RecSys, pp. 131–138, New
York, NY, USA, 2012. ACM.
585
586 Chapter 23. Music Recommendation
586
23.7. Further Reading 587
perspective: Survey of the state of the art. User Model. User-Adapt. Interact.,
22(4-5):317–355, 2012.
[32] P. Resnick, N. Iacovou, M. Suchak, P. Bergstorm, and J. Riedl. Grouplens: An
open architecture for collaborative filtering of netnews. In Proceedings of the
1994 ACM Conference on Computer Supported Cooperative Work (CSCW’94),
pp. 175–186, New York, NY, USA, 1994. ACM.
[33] F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor, eds. Recommender Systems
Handbook. Springer, 2011.
[34] G. Shani and A. Gunawardana. Evaluating recommendation systems. In Rec-
ommender Systems Handbook, pp. 257–297. Springer, 2011.
[35] U. Shardanand and P. Maes. Social information filtering: Algorithms for au-
tomating word of mouth. In Proceedings of the SIGCHI Conference on Human
Factors in Computing Systems, CHI ’95, pp. 210–217, New York, NY, USA,
1995. ACM.
[36] D. Turnbull, L. Barrington, and G. R. G. Lanckriet. Five approaches to collect-
ing tags for music. In Proc. ISMIR 2008, pp. 225–230, Philadelphia, 2008.
[37] M. Vacher. Introducing Discover Weekly: Your ultimate personalised
playlist, https://press.spotify.com/it/2015/07/20/introducing-
discover-weekly-your-ultimate-personalised-playlist, accessed
February 2016.
[38] M. Zanker, M. Bricman, S. Gordea, D. Jannach, and M. Jessenitschnig. Per-
suasive online-selling in quality & taste domains. In Proceedings of the 7th
International Conference on Electronic Commerce and Web Technologies (EC-
Web’06), pp. 51–60, Heidelberg / Berlin, 2006. Springer Verlag.
587
Chapter 24
Automatic Composition
24.1 Introduction
While most chapters of this book deal with analyzing given music, this one gives
an introduction to synthesis tasks called automatic (or, in a broader sense, algorith-
mic) composition. We start with a discussion of what a composer does and what
a composition actually is (or may be). There are some suggestions as to why the
act of composing could (or should) be automatic. After a short outline of historical
examples, we are going to show some of the basic principles composing computers
work with, giving a broad rather than an in-depth overview. Particular software ap-
plications are numerous as well as subject to constant change and therefore not in the
scope of this chapter.
24.2 Composition
24.2.1 What Composers Do
In order to understand how computers compose music, one should first try to find
out how humans compose music. As early as 1959, in their book on composing with
computers, Hiller and Isaacson state that “the act of composing can be thought of as
the extraction of order out of a chaotic multitude of available possibilities” [8, p. 1].
These possibilities include all kinds of sounds, at least within the range of human
audibility. The choice of sonic material and its required features is therefore one of
the basic compositional activities.
If a composer decides to work mainly with pitched sounds, the next decision
affects the number of pitches to be used. The continuous pitch range of the sound
spectrum can be reduced to a finite set of pitches or pitch classes (cf. Section 3.5).
These are commonly known as scales. Similarly, the time continuum may be subdi-
vided into units to form a grid of sound durations, which can then be combined to
rhythms or meters (cf. Section 3.6). These grids are necessary to write down music
589
590 Chapter 24. Automatic Composition
in notes. The musical notes commonly used today are a symbolic language. They
do not store the music itself, “but rather instructions for musicians who learn which
actions to perform from these symbols in order to play the music” [19, p. 4]. Notes
can only describe a few features of the sound; others (e.g. the amplitudes of its over-
tones) cannot be represented, but they can be composed as additional declarations to
the notes. The invention of MIDI (cf. Section 7.2.3) made it possible for computers
to process music notation, and the notes could then be interpreted by musicians as
well as by computers. As it is obviously easier to program a computer to compose
with a finite set of, say, twelve pitches (e.g. the chromatic scale) than with an infinite
number of possible pitches, most composing software works rather with notes than
with sounds, although this is not essential (cf. [2]). But computers can do much more:
they can control the production of any requested sound features (e.g. duration, pitch,
overtone spectrum) on a microscopic level, which leads to possibilities that even the
symphonic orchestra with its variety of instruments cannot provide. “But the larger
the inventory the greater the chances of producing pieces beyond the human thresh-
old of comprehension and enjoyment. For instance, it is perfectly within our hearing
capacity to use scales of 24 notes within an octave but more than this would only
create difficulties for both the composer and listener” [19, p. 13]. Moreover, “one of
the most difficult issues in composition is to find the right balance between repetition
and diversity” [19, p. 13].
On the whole, composing could be regarded as a chain of decisions and in this
way be compared to the structure of computer programs, but this analogy raises some
more questions. Certainly a human composer does not make every single decision
fully consciously. Many of them may be made unconsciously from adopted traditions
and paragons perceived in the past. Original ideas and new combinations, or clichés
and stereotypes may arise – intended or not. Other decisions are made as a result of
previously chosen rules and styles or with regard to the intended listeners or markets.
If you want to compose a twelve-tone piece or decide to write a pop song, you do not
have to care about quarter tones or white noise. After all, the order in which decisions
are taken is obviously not always the same. The ways of composing may go bottom-
up, from a detailed idea to a large-scale work, or top-down, from an overall plan to
the single sound. In most cases, there will be a permanent intersection between these
directions.
590
24.2. Composition 591
applies traditional or self-made rules. As long as the music simply does what the
composer wants it to do, there is little surprise in the resulting sound. But what if the
composer tries to stand aside and “let the music do what it wants to do” or even “to
let the music compose itself” [12, p. 2]? When Steve Reich lets microphones swing
over loudspeakers, when Paul Panhuysen stretches long wires across a lake or Peter
Ablinger plants a row of trees in the open landscape, the composer seems to vanish
behind his work, which could somehow hardly be called a piece, and the music, as
a result, is rather found than composed. “In all these cases the ‘composers’ are [. . . ]
simply letting music arise out of circumstances that they can not personally control”
[12, p. 2]. This kind of music is not intended to be performed on traditional musical
instruments and therefore there are no conventional notes. This kind of music has
mainly to do with sound. Composers like Tom Johnson, on the other hand, try to find
music in mathematical objects like Pascal’s triangle. These objects deliver a set of
numbers which the composer can transform into music, mapping them to concrete
pitches and rests to make their inherent structure audible. This kind of music can
be performed on traditional musical instruments, and it has mainly to do with notes
(cf. [13]). Likewise, algorithmic composition is divided into two domains, generating
scores or generating sounds. The generated scores may be written out in notes so that
musicians can perform them on classical instruments while the generated sounds can
only be heard through loudspeakers (cf. [24]).
Computers can be used at any step during the process of composing. They can
perform precise calculations just to spare the composer time and effort. They can
help to learn rules typical of a certain time period, musical style or composer’s
handwriting. Or they can assist in creating an automatic accompaniment or even
completely new music based on a mixture of rules from different genres or periods.
Moreover, computer programs could be set up to compose automatically or even
autonomously. In doing so, computers can serve as supporting devices, they can
produce compositions or meta compositions (cf. [1]). “In its purest form, (computer-
based) algorithmic music is the output of a stand-alone program, without user con-
trols, with musical content determined by the seeding of the random number genera-
tor [. . . ]. Most systems allow for some sort of control, however, through inputs to the
algorithm, or live controls to a running process. Interactive music systems take ad-
vantage of algorithmic routines to produce output influenced by their environments,
while live coders burrow around inside running algorithms, modifying them from
within” [1, p. 300]. One application could be live composition in real time, interac-
tive or on demand, for computer games or web sites, to influence the user’s current
mood or to create a new experience (cf. [26]).
Algorithmic composition makes use of mathematical statements and production
rules, which are used within (computer) programs (cf. [28]). Algorithms may be
influential from the micro to the macro structures. They can cover the whole range
from the generation of sounds or notes over aspects like timbre or expression to the
musical form (cf. [28]). If an algorithm models traditional composition rules the re-
sult will also sound traditional. Inventing new algorithmic rules or taking over those
from outside the realm of music, on the other hand, can lead to completely new mu-
sical forms and structures (cf. [24]). Adapting algorithms from the natural sciences
591
592 Chapter 24. Automatic Composition
may simply arise from the curiosity to find new means of musical expression. But
in a Platonic sense, it is also about investigating and explaining the world around us
in musical terms, and thus is linked with historical concepts like the music of the
spheres. “In any case, the modeling of natural processes requires the use of comput-
ers owing to the large number of mathematical calculations needed. The computer is
not only used as a compositional facilitator – it becomes a necessity” [24, p. 53].
592
24.3. Principles of Automatic Composition 593
593
594 Chapter 24. Automatic Composition
ab
bab
abbab
bababbab and so forth.
These structures cannot be derived from traditional compositional rules (cf. [24]
and [14]).
Bear in mind that one may map these variables not only to notes, but to any musi-
cal structures. Think for instance of the first two bars of a famous German children’s
song, as shown in Figure 24.1. This is just to keep things simple and continues a
tradition of using folk songs as starting material for manageable experiments.1
a b
Applying the above L-system to the motifs in Figure 24.1, the rather dull, how-
ever quite long melody2 in Figure 24.2 results.
14
Of course there are infinite automata too, which go beyond those finite sets to in-
finity. If you apply them, for instance, to pitches, the music will at some point not be
playable or even audible anymore, so you may want to stop the algorithm somewhere.
1 Music pedagogue Fritz Jöde remarked in the 1920s: “Proceeding beyond this kind of folklore should
only happen when the child has developed so far that its perceiving organs have grown far enough for
further designs which exceed the simple folk-like structure that it could really perceive the substance”
[11, translation by the authors, p. 108].
2 If you want to work in this direction, compare Helmut Lachenmann’s “Ein Kinderspiel” [17], which
594
24.3. Principles of Automatic Composition 595
These automata could, e.g., be fractal processes which have been adapted from chaos
theory (cf. [20]). The term fractal was coined by Benoit Mandelbrot to describe self-
similar patterns. Common examples are the growth of trees or snowflakes.
Cellular automata are another example of algorithms that have been adapted by
composers from the natural sciences. Cellular automata have been designed to de-
scribe the behavior of particles in liquids. “Individual particles influence each other
by knocking against one another or changing places, etc., thereby defining the move-
ment of the fluid as a whole. The easily comprehensible, elementary effects within
a cell’s immediate area can have an unexpected effect on the system as a whole,
because each cell can simultaneously influence and be influenced by several of its
neighbors” [24, p. 52].
595
596 Chapter 24. Automatic Composition
11
11
Figure 24.4: Five notes with equal probability in the rhythm of the original song.
the values in Table 24.1. Letting this feature of the old song be the distribution of
notes in a new song, we could get a melody like the one in Figure 24.6.
The next step in building a model for melody generation is to observe not the
frequencies of single notes, but those of two-note combinations instead. We count
596
24.3. Principles of Automatic Composition 597
12
11
Figure 24.6: Melody with a distribution of notes as observed in the original melody.
the absolute frequencies of pairs of notes and get the second-level feature shown in
Table 24.2. The columns represent the first note, and the rows the second note of a
two-note pair.
Table 24.2: Relative Frequencies of Note Pairs Observed in the Original Song
C D E F G
C 0 0.042 0.042 0 0
D 0.063 0.146 0.042 0 0
E 0 0 0.146 0.125 0.042
F 0 0.063 0.020 0 0.042
G 0.042 0 0.063 0 0.125
In this way, we could model a series of events that are linked with probability
597
598 Chapter 24. Automatic Composition
12
Figure 24.7: Melody with a distribution of note pairs as observed in the original
song.
values using Markov chains (cf. Example 9.24). To emulate a human composer with
the computer, we have to take the balance of probabilities into consideration. Since
the pitches within a diatonic scale are not uniformly distributed, we could not use
standard uniform random numbers to pick them. A perfect dice gives equal chances
for every number, and no event has any influence on any of the following events. But
within a diatonic scale, the probabilities differ for each note, and the selection of a
note has an influence on the following ones. The tonic, for instance, will be found
more often than the dominant. In a C major scale, we may find the note G quite often
since it belongs to both the tonic and the dominant chord [1]. In order to go beyond
independent draws we could create a discrete state space where each state has an
associated probability. Since each outcome must represent one state, the sum of all
probabilities is always 1. If the current state is dependent on 0 prior states, we have
a zero-order Markov chain, the result being pure chance. In Markov chains of higher
order, the calculations may require very large tables.
If we want to generate a melody to be performed on a certain instrument, we have
to take care that the pitches remain within the pitch range of this instrument. From
the lowest pitch, we can only move upwards, and from the highest pitch we can only
move downwards. We could define that most of the notes should be in the middle of
the pitch range by assigning them higher probability values than the notes close to
the boundaries, or we could advise the program to stop once a boundary is met (cf.
[28]).
Stochastic processes can even be used for sound generation in Granular Synthe-
sis. Here, sounds are made from grains – sound particles too short to be perceived
by ear separately. Probability functions control the density of grains in clouds, and
hereby the spectra of resulting sounds (cf. [22]). Xenakis’s GENDY3 forms another
approach to stochastic synthesis. Here he directly calculates segments of waveforms
in order to get new, unexpected sounds (cf. [23]).
598
24.3. Principles of Automatic Composition 599
$getInterval((N_n),(N_(n-1)))$
no yes
Interval > Fourth?
End
599
600 Chapter 24. Automatic Composition
Understanding
Understanding
Understanding
Understanding
Kind
Understanding
Kind
Understanding
Understanding
Kind
Understanding
Kind
Kind
600
24.3. Principles of Automatic Composition 601
the dominant on G, while A’ ends with the tonic C. Basically, there are four motifs.
Two of them (a and d) are transformed by transposition up or downwards.
A A'
a a' a
b
6 B
a' c d
11 A'
d' a a' c
Let’s pick some more motifs from the children’s songs “Frère Jacques” and
“Twinkle, Twinkle, Little Star”3 (cf. Figure 24.11).
e f g h
Now we could recombine all these following the structure of “Hänschen klein”,
choosing motifs randomly out of the following lists:
a → { a, e, f }
b → { b, g }
c → { c, h }
d → { b, c, d, g, h }
One example of how these rules can be applied to musical notes is shown in Figure
24.12.
Of course, not only transposition but all basic transformation methods can be
applied. As you can see from our example, formal grammars could be described in
terms of finite automata, which help getting them into computer code (cf. [19]).
3 If you want to use different melodies as a database, you may have to transpose them into the same
key.
601
602 Chapter 24. Automatic Composition
A A'
f f' b f
6 B
f' h c
11 A'
c' f f' h
Figure 24.12: New melody from old motifs using a simple formal grammar.
Parent Offspring 1
5 Offspring 2 Offspring 3
602
24.4. Concluding Remarks 603
processes. David Cope, for instance, uses AI to extract databases of a musical style,
on which processes of formal grammar work to (re)compose a new work (cf. [4]).
Bibliography
[1] N. Collins. Introduction to Computer Music. Wiley, Chichester, 2010.
603
604 Chapter 24. Automatic Composition
604
24.5. Further Reading 605
605
Part IV
Implementation
607
Chapter 25
Implementation Architectures
M ARTIN B OTTECK
Fachhochschule Südwestfalen, Meschede, Germany
25.1 Introduction
This chapter’s author started to investigate music signal processing whilst being a
member of a mobile phone manufacturer’s corporate research unit. During the early
2000s, mobile phone devices were beginning to be equipped with music players and
large enough flash storage in order to accommodate music collections that were hard
to sort and navigate. Our idea was to utilize music signal processing techniques to
help present the contents of a music collection on a mobile device somehow taking
account of the listening experience connected to each track, i.e. provide the user
with a personalized set of track lists that are generated automatically. This will be
the topic of this chapter.
Since the computational effort required to execute the algorithms exceeded a mo-
bile device’s capabilities during the early 2000s by far, Linux computing grids were
used instead.1 Despite much effort in research on processing concepts demanding
less computation power, this concept still has not found its way into today’s products.
A concise description of the considerable achievements can be found in Chapter 27.
The research work was complemented by activities designed to separate a major part
of the calculations and perform them on dedicated servers outside the mobile device.
These attempts up to now seem to have faltered. Sorting your private music col-
lection by a dedicated service on the network still is not a very popular application
despite the fact that several attempts for commercialization have indeed been made
(cf. references given in Section 25.3). There exist, however, a few services on the
Internet that offer small applications based on music signal processing techniques.
Some of them may be used in combination with each other, thereby offering more
complex functionality. This concept deserves a closer look.
1 The experiments were carried out as part of the Music Information Retrieval Evaluation eXchange
(MIREX) coordinated by the Graduate School of Library Information Science, University of Illinois at
Urbana-Champaign, www.music-ir.org. Accessed 22 June 2016.
609
610 Chapter 25. Implementation Architectures
2 “Efficiency” seems to not be uniquely defined though; in some cases it seems reasonable to relate to
specific “effort”-determining measures which may include chip size, number of electronic components,
amount of data exchange, power consumption, computation time, or any combination of these.
610
25.2. Architecture Variants and Their Evaluation 611
the desired association of tracks. The classes themselves may be defined in several
ways in order to create track lists according to the desired listening experience.
Please refer to Section 12.3 for further explanation of this process and the mean-
ing of key terms.
Understanding Kind
Kind
Kind Kind Kind Kind
Kind
Kind
Kind Understanding
Kind
Figure 25.1: Processing chain for music classification [4].
3 Many music features rely on the spectrum of the musical signal. Then, simply the number of calcula-
tions required for Fourier transform (DFT/FFT) already constitutes a substantial computational challenge.
611
612 Chapter 25. Implementation Architectures
ease of use, scalability, decreasing marginal cost, plus a concept to establish a critical
minimum market presence to begin with ([8], [6]).
For the time being, the influence of economic considerations on implementation
architectures can be summarized by a list of questions to be addressed in evaluating
implementation proposals:
• How much effort is to be spent for implementation? Who will contribute to this?
The “cheapest” solution will not necessarily win here. Very costly concepts will
however impose higher risks for investors and it might be more difficult to raise
respective funds.
• How will revenue be created and from whom? As of today, end customers are
not accustomed to paying directly for information in digital format. Commer-
cially successful Internet companies instead monetarize information about these
customer’s behavior. Hence, revenue models will be quite more complicated than
they used to be in the past.
• Which legal or commercial regulations are required? Copyright laws were estab-
lished in order to provide revenue opportunities for creative artists. They are not a
law of nature but rather a mere agreement between people in our society. In view
of technological advances, these copy protection rules might seem not to be tech-
nically enforceable any longer. Consequently, the law might change and different
regulations might be established in order to retain some commercial incentive for
creating art.
• What happens in case one of these crucial regulations ceases to exist?
Specifically with respect to the latter aspect, architectures with ample potential for
ad hoc change, rapid update, and further development provide clear benefits.
Three different approaches for implementing a music processing solution as pro-
posed in previous chapters can be identified and will be discussed.
612
25.2. Architecture Variants and Their Evaluation 613
Kind
Kind
Kind Kind
putational capabilities will remain limited due to limits in power consumption and
even more when compared to dedicated computing servers that may be located some-
where in the IP network.
Considerations on suitable device architectures for this approach will be dis-
cussed in more detail in a subsequent chapter (Chapter 27), thereby addressing a
wide range of alternative processor types and integrated circuit solutions.
4 Although it is indeed possible to limit the requirements on maximum delay and delay variations by
implementing playback buffers or progressive download techniques, it should be noted here that these
compensation techniques will degrade user experience to some extent: at least when changing tracks the
user cannot avoid rather long reaction times of the user interface.
613
614 Chapter 25. Implementation Architectures
Kind
Kind
Kind
Kind Kind
Kind
Kind
Kind
Kind
614
25.3. Applications 615
Kind
Kind
KindKind
Kind Kind Kind
Kind
Kind Kind
Kind
Kind
Kind
Kind
Kind
Kind
Kind Kind
Kind
Kind Kind
25.3 Applications
The following sections present architectural concepts for a selected set of applica-
tions.
5 A music player for PCs released by mufin in 2011 was discontinued. The company since then
provides products for music recognition instead: www.mufin.com. Accessed 22 June 2016.
615
616 Chapter 25. Implementation Architectures
More popular systems understand “interesting” in a much wider way: they take
into account other users’ music choice. iTunes Genius6 and last.fm7 provide rec-
ommendations beyond the scope of music stored in the end customer’s local data
base. Suitable implementation architectures rely on a network server-based process-
ing concept thus making use of information stored remotely as well as abundant
processing power available on centralized servers. The concept however implies a
rather tight coupling between the application on the end customer device and the
service provider’s music data base. Specifically, iTunes Genius will not recommend
tracks outside the iTunes Store. Furthermore, these systems as of today seem to
mostly rely on metadata information (artist, title, album) rather than on musical con-
tent or listening experience. Users with very individual listening habits often remain
unsatisfied since their preferred music might not be labeled or hardly noticed by other
users so far [5].
A system making use of the already provided recommendation concepts extend-
ing these with classification techniques based on musical properties leads to a dis-
tributed architecture. One such example is the Shazam8 application: based on a
snippet of audio, it presents a list of recognized original material (the most likely
metadata) including hyperlinks to the iTunes Store. Coupling of this information is
possible across an interface agreed on between the application providers, in this case
a solution proprietary to Shazam and Apple.
6 http://www.apple.com/legal/internet-services/itunes/de/genius.html. Accessed
22 June 2016.
7 www.last.fm. Accessed 22 June 2016.
8 www.shazam.com. Accessed 22 June 2016.
616
25.4. Novel Applications and Future Development 617
wider range of music tracks. Further developments of applications for music recog-
nition consequently focused on mobile devices. Music recognition is seen as a wel-
come extension of the limited user interface of such devices. Specifically, the Shazam
app has gained wide acceptance on the market. A popular use case is to let the app
recognize music played in commercials in order to subsequently purchase the track
from the iTunes Store by just one further click.9 In this case, the system uses a dis-
tributed architecture thus developing a new use case and revenue opportunity. It shall
be noted here that Shazam and Apple do not share their revenues; each application
provider collects fees for its service separately.
Only the distributed architecture approach has survived on the market in this
case. The service originally presented by Vodafone was discontinued only months
after its introduction due to lack of customer interest .10 Seemingly, the music data
base provided by these services so far did not provide an offer tempting enough for
people to start browsing through it. Connecting the well-established iTunes Store to
another application (music recognition by Shazam) instead raised enough interest to
exceed a critical minimum market acceptance.
9 There are examples of several tracks or artists that became hits or stars through just this scenario:
found only 200,000 German customers in 2013 [9]. The service was relaunched in 2014 after Ampya’s
acquisition by Deezer.
617
618 Chapter 25. Implementation Architectures
11 This “yet another remote location” might be present in a completely different network segment.
Imagine a moderately powerful server in your home network rather than something out on the public
Internet.
618
25.4. Novel Applications and Future Development 619
619
620 Chapter 25. Implementation Architectures
are given. A much more elaborate concept was presented by Mackay in [7]. Its com-
plexity, however, tends to produce rather large data sets. Mackay intends to directly
specify algorithmic behavior or list the computation results (feature values). Behav-
ioral description in languages intended for execution (like e.g. Java, Python, etc.)
promises to be much more compact than this approach. Anyhow, with the availabil-
ity of universally agreed feature references, an abundant range of novel applications
and scenarios would become possible:
• Music recommendation services already existing could eventually be extended to
provide track lists based on perceptional features, not only on metadata (similar
to the Shazam-iTunes case).
• People could trade “Musical Taste”. For example, why shouldn’t a number of
customers be interested in celebrities’ musical preferences?
• Individual users’ “Musical Taste” could provide them with recommendations from
virtually any sort of music collection. The amount of music somehow published
per month continuously grows, so there is a lot of good music out there for every-
one to match. Discovering it will be the problem of the future.
The next chapter on user interaction (Chapter 26), amongst other things, covers op-
portunities to utilize context information. Depending on the end user’s listening
situation, recommendations for music or maybe the behavior of the user interface
as such might change: why not propose a different type of music when commuting
back from work than during physical exercise at the gym? Maybe even choose tracks
with a beat matching to your current heartbeat or workout rhythm?
Mobile devices typically provide several types of sensors (GPS position, ac-
celerometer, compass, . . . ) which deliver basic signals. To determine the “listening
context” from these signals constitutes a classification task in itself, which shall not
be discussed here. However, “listening context” needs to be available in a machine-
readable form in order to adapt classification parameters or recommendation settings.
In its most straightforward realization, a specific listening context could be described
by a reference to a specific “Musical Taste”. Again, we face the challenge of main-
taining link references here (as we already have seen in the context of MP3 files (see
Section 7.3.3) and metadata information). However, if a formal description, even-
tually as part of the generic computing platform or script engine as described above
would become available, a further set of use cases might be developed, providing an
even more intriguing listening experience.
620
25.6. Further Reading 621
the beginning. With the advent of formal notations to exchange algorithms and/or
processing results between computing nodes, a rapid development of an abundant
range of novel applications for music analysis can be foreseen. Its pace will by far
outrun the development of concepts that require all processing to reside on a single
office computer or mobile device.
Bibliography
[1] I. Architecture Working Group. IEEE Recommended Practice for Architectural
Description of Software-Intensive Systems. IEEE, New York, 2000.
[2] L. Bass and R. Kazman. Software Architecture in Practice. Addison Wesley
Pearson, New York, 2012.
[3] R. B’Far. Mobile Computing Principles. Cambridge University Press, 2004.
[4] H. Blume, M. Botteck, M. Haller, and W. Theimer. Perceptual Feature based
Music Classification: A DSP Perspective for a New Type of Application. In
Proceedings of the 8th International Workshop SAMOS VIII; Embedded Com-
puter Systems: Architectures, Modeling, and Simulation. IEEE, 2008.
[5] O. Celma. Music Recommendation and Discovery, pp. 5–6. Springer Science
and Business Media, 2010.
[6] R. G. Cooper. Winning at New Products. Perseus HarperCollins, New York,
2001.
[7] C. McKay and I. Fujinaga. Expressing musical features, class labels, ontolo-
gies, and metadata using ace xml 2.0. In J. Stein, ed., Structuring Music through
Markup Language: Designs and Architectures, pp. 48–79. IGI Global, Hershey,
2013.
[8] G. A. Moore. Crossing the Chasm. Perennial HarperCollins, New York, 1999.
621
622 Chapter 25. Implementation Architectures
[9] J. Stüber. Was Berliner Musikstreamingdienste besser als Spotify können (in
German). Berliner Morgenpost, 12(2), 2013.
[10] A. S. Tanenbaum and D. J. Wetherall. Computer Networks, chapter The
Medium Access Control Sublayer, pp. 257–354. Prentice Hall Pearson, Boston,
2011.
[11] I. Vatolkin, W. M. Theimer, and M. Botteck. AMUSE (Advanced MUSic Ex-
plorer): A multitool framework for music data analysis. In Proc. Int. Conf.
Music Information Retrieval (ISMIR), pp. 33–38. International Society for Mu-
sic Information Retrieval, 2010.
622
Chapter 26
User Interaction
W OLFGANG T HEIMER
Volkswagen Infotainment GmbH, Bochum, Germany
26.1 Introduction
Humans use their senses such as sight, hearing, touch, taste, and smell to perceive
the environment. These modalities also serve to respond to input signals. A technical
system can interact with its environment with the same modalities, but is not limited
to them. Think for example about other (electronic) data channels or new sensors and
actuators which extend the classical “senses”. Therefore, user interfaces are often
categorized according to which input and output modalities are used in the system.
In the context of performing music, the tactile and the audio channels dominate the
artist’s interaction with the instruments. Technical music processing systems extend
the interaction to all input and output modalities and typically rely on visual user
feedback.
Music is to a large extent a social activity. Musicians are, for example, per-
forming music together in an orchestra and music listeners are gathering in concert
halls. Thus, it is important to take into account the interaction and communication
among musicians, listeners and novel technical systems for music processing. In or-
der to generalize the following argument, all objects with which the user interacts,
be it a physical instrument, electronics, or a software implementation, are defined as
music processing systems. Figure 26.1 gives a top-level overview of how the user
interfaces of technical music processing systems can be characterized from an archi-
tectural point of view. The figure uses a system engineering approach to describe an
entity, which in our case is a system to process musical signals: A complete system is
decomposed into self-contained subsystems which are characterized as blocks with a
set of input signals, processing of the input inside the subsystem, and a set of output
signals. The functional block Music processing system - User X can represent a mu-
sical instrument or a technical system which processes an input signal in the context
of music under the control of a user. The input can be a direct user input, for example
to play a musical instrument, or it could be a musical signal which is manipulated by
the user for music editing.
623
624 Chapter 26. User Interaction
The user input to a music processing system in essence has four purposes:
1. Generation of music (via musical instruments or vocals)
2. Modification of existing music (music editing)
3. Definition of a music query (i.e. specifying the search parameters)
4. Navigation through a music collection
The functions of a music processing system are mainly generating, modifying,
finding and exploring music. Thus the output of a musical system covers four differ-
ent areas as well:
• Music performance (playback of music and related multimedia)
• Presentation of music query results
• Representation of a music collection
• Feedback for the user to confirm the input and state of the system
User N
UserMusic
C
processing
UserMusic
B system
processing
Music
system
processing
system
Communication
624
26.2. User Input for Music Applications 625
Figure 26.2: Early synthesizer (Studio 66 system) on the left compared to a modern
version (Korg Kronos X88) on the right [pictures under CC license].
625
626 Chapter 26. User Interaction
626
26.2. User Input for Music Applications 627
Figure 26.3: User interface screenshot of nepTune for an exemplary music collection
(Department of Computational Perception, Johannes Kepler University, Linz).2
a passage with their instruments and replace an existing time interval in a recording.
Modern sound recording solutions allow multi-channel recordings (one or several
separate voices for each instrument) and enable channel-specific editing and synchro-
nization operations before the complete opus is assembled again from the individual
channels. Music editing can be supported by intelligent systems which are able to
align different recordings in terms of timing and absolute pitch.
627
628 Chapter 26. User Interaction
628
26.2. User Input for Music Applications 629
629
630 Chapter 26. User Interaction
an image into musical properties for music generation purposes is given in [17]. Im-
age properties such as contours, colors, and textures are mapped to musical features
like pitch, duration, and key. Instead of using those musical properties for music
generation, they could specify music query parameters as well and create a query for
music which is compatible in its mood with the imagery.
Visual and Sensor Input for Music Navigation
Visual and other sensor signals can be used directly for music navigation. This is
done by sensing the body, arm, and hand postures and translating them into nav-
igation input similar to a haptic input. An example of this approach, using a Wii
Remote, is outlined in [13].
But sensors can also be used as indirect input for music navigation: A video
analysis can be used to synchronize the music to the user or adapt the tempo of the
music (for example in music games). A typical simplification of the analysis can
be done when the user carries markers or sends out signals such as the positions of
infrared light sources as is, for example, used in game consoles. Image processing
can concentrate on the user, for example, by identifying the user mood from the facial
expression. This information is used for a suitable selection of music or as general
user feedback. Alternatively, a complete visual scene analysis can be made. This
leads to future applications which perform a music audience analysis and use the
results for adapting the music playback.
Visual and Sensor Input for Music Editing
The difference between body sensors and other sensor devices is their “wearable“ na-
ture, i.e. they are integrated into the clothing, are worn (for example like a watch) or
are implanted due to a medical indication. Interesting concepts emerge which make
the body signals audible and give real-time feedback to the user. The “motion soni-
fication“ project [2] is an approach to optimize motion coordination in rehabilitation
or sports training by picking up body limb orientation and acceleration to create an
audio feedback in relation to how far the motion pattern deviates from the optimal
coordination. In the future, more music-like (pleasant) feedback signals might help
to achieve a long-term user acceptance of this effective method.
630
26.3. User Interface Output for Music Applications 631
631
632 Chapter 26. User Interaction
632
26.3. User Interface Output for Music Applications 633
graph structure shows the artists most similar to an initially provided artist based
on user tag data retrieved from LastFM.6 These artists can be used as starting
points for a new similarity search. A similar service is offered by LivePlasma7
which provides audio streaming for playlists of the selected artists. An exemplary
visualization of LivePlasma is shown in Figure 26.6.
• Map of Mozart8 is another type of visualization of a musical landscape (see Fig-
ure 26.7). Rhythm patterns are extracted from each piece of music. A self-
organizing map [12] groups acoustically similar pieces of music close to each
other (cp. Section 11.4.2). A user can listen to a certain type of music by selecting
a region on the map.
633
634 Chapter 26. User Interaction
Figure 26.6: Example of LivePlasma output for The Beatles (screenshot from Live-
Plasma web site, October 3rd, 2014).
634
26.4. Factors Supporting the Interpretation of User Input 635
635
636 Chapter 26. User Interaction
type of day (for example work day vs. weekend), reflecting different attention
levels of a music listener.
• Mood: The user’s mood is a more subtle parameter influencing situation-specific
music preferences, but it certainly impacts the more emotional music genres.
• Activity pattern: Physical activity influences music listening habits, but conversely,
music can also support an activity. Motivational music in sports is a typical ex-
ample of the latter.
• Social interaction: The social interaction with friends and other persons in a cer-
tain situation also shapes the behavior of a user. For example, it can directly
influence the music preferences during a party or other events.
• Environment: The atmosphere created by the surroundings, such as climate, au-
diovisual and haptic stimuli, in short everything that a user can perceive via his
or her senses, is an external input for the user’s mood. Indirectly the environment
has an impact on the music preferences in a certain situation.
These context parameters are additional indirect input signals. They can be es-
timated by a technical system by evaluating information from a positioning system,
a clock, user input patterns, and sensors. These measurements help to characterize
the physical environment as well as how the user reacts to it. Thus, it is possible to
implement a context engine which tries to infer the user’s music preferences. The
system can suggest suitable music for certain contexts or confine itself to an existing
selection, see also Section 23.2.3.
636
26.4. Factors Supporting the Interpretation of User Input 637
637
638 Chapter 26. User Interaction
Another emerging trend of today’s user interface is the use of implicit informa-
tion. In contrast to explicit user input (for example pressing a button with a defined
meaning), implicit inputs have to be interpreted by a technical system since they are
extracted from a complex user behavior or the user’s context.
• The user context can be estimated through a variety of sensors and influences the
user habits; therefore, the user interface output is often adapted to those chang-
ing needs. Examples are a recommender system which proposes different music
based on the user’s mood or adapted representations of the user interface for mo-
bile and stationary devices.
• Implicit information can also be generated directly by the user during the interac-
tion with the technical system in two ways: One option is that the user is certain
about his input (for example when issuing a speech command or performing a
gesture). However, for the machine it is implicit input which has to be interpreted
via feature extraction, processing, and classification. Alternatively, the user input
is explicit, but only the sequence of explicit inputs provides new information and
has to be interpreted. An example is a music recommendation system where users
are often reluctant to provide explicit ratings. However, their listening behavior,
i.e. how long they listen to a song, skip forward or revisit a piece of music, gives
a good indication of their music preferences; see Section 23.3.
In both cases machine learning algorithms are typically used to create a relationship
between implicit user interface input and user interface output (for example recom-
mended music).
638
26.5. Concluding Remarks 639
Bibliography
[1] I. Aldshina and E. Davidenkova. The history of electro-musical instruments in
Russia in the first half of the twentieth century. In Proceedings of the Second
Vienna Talk, University of Music and Performing Arts Vienna, Austria, pp. 51–
54, 2010.
[2] H. Brückner, W. Theimer, and H. Blume. Real-time low latency movement
sonification in stroke rehabilitation based on a mobile platform. In International
Conference on Consumer Electronics (ICCE), pp. 262–263, Las Vegas, January
2014. IEEE.
[3] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A review of audio fingerprinting.
J. VLSI Signal Processing, 41:271–284, 2005.
[4] S. Deterding, R. Khaled, L. Nacke, and D. Dixon. Gamification: Toward a
definition. In CHI 2011 Workshop Gamification: Using Game Design Elements
in Non-Game Contexts, Vancouver, May 2011. ACM.
[5] G. Essl and M. Rohs. Interactivity for mobile music-making. Organised Sound,
14(2):197–207, 2009.
[6] A. Ghias, J. Logan, V. Chamberlain, and B. Smith. Query by humming: Mu-
sical information retrieval in an audio database. Proc. ACM Multimedia, pp.
231–236, 1995.
[7] A. Huq, J. Bello, A. Sarroff, J. Berger, and R. Rowe. Sourcetone: An automated
music emotion recognition system. In Proceedings; International Society for
Music Information Retrieval (ISMIR 2009); Late breaking papers / demo ses-
sion, Kobe, Japan, 2009. International Society for Music Information Retrieval.
[8] P. Knees, M. Schedl, T. Pohle, and G. Widmer. Exploring music collections in
virtual landscapes. IEEE Multimedia, 14(3):46–54, 2007.
[9] T. Nakra. Synthesizing expressive music through the language of conducting.
Journal of New Music Research, 31(1):11–26, 2001.
[10] E. Pampalk and M. Goto. MusicRainbow: A new user interface to discover
artists using audio-based similarity and web-based labeling. In Proceedings
International Society for Music Information Retrieval, pp. 367–370, 2006.
[11] QNX Software Systems Limited. QNX Neutrino RTOS System Architecture,
2014. http://www.qnx.com.
[12] H. Ritter, T. Martinetz, and K. Schulten. Neural Computation and Self-
Organizing Maps: An Introduction. Addison-Wesley, Boston, MA, USA, 1992.
[13] R. Stewart, M. Levy, and M. Sandler. 3D Interactive environment for music
collection navigation. In Proc. of the 11th Int. Conference on Digital Audio
Effects (DAFx-08), pp. 1–5, Espoo, Finland, September 2008.
[14] A. Tanenbaum and H. Bos. Modern Operating Systems. Prentice Hall Press,
Upper Saddle River, NJ, USA, 4th edition, 2014.
[15] W. Theimer, U. Görtz, and A. Salomäki. Method for acoustically controlling
an electronic device, in particular a mobile station in a mobile radio network,
639
640 Chapter 26. User Interaction
640
Chapter 27
27.1 Introduction
Music classification is a very data intensive application evoking high computation ef-
fort. Dependent on the hardware architecture utilized for signal processing, this may
result in long computation times, high energy consumption, and even in high produc-
tion cost. Thus, the utilized hardware architecture determines the attractiveness of a
music classification-enabled device for the user. This multitude of requirements and
restrictions a hardware architecture should meet cannot be covered simultaneously.
This is why a hardware designer must know the quantitative properties of all suitable
architectures regarding the application for music classification in order to develop a
successful media device.
In this chapter, we discuss challenges a system designer is confronted with when
creating a hardware system for music classification. Several requirements like pre-
ferred short computation times, low production costs, low power consumption, and
programmability cannot be covered by one hardware architecture. Instead, each
hardware architecture has its advantages and disadvantages.
We will present several hardware architectures and their corresponding perfor-
mance regarding computation times and efficiency when utilized for music clas-
sification. In detail, General Purpose Processors (GPP), Digital Signal Processors
(DSP), an Application Specific Instruction Processor (ASIP), a Graphics Processing
Unit (GPU), a Field Programmable Gate Array (FPGA), and an Application Spe-
cific Integrated Circuit (ASIC) are examined. Special care is given to the extraction
of short-term features which can be accelerated in different ways and represents the
most time-consuming step in music classification besides decoding. Finally, a prac-
tical example of energy cost-limited end-consumer devices demonstrates that this
design space exploration of hardware architectures for music classification can sup-
port the design phase of stationary and mobile end-consumer devices.
641
642 Chapter 27. Hardware Architectures for Music Classification
Media devices like smartphones and stationary home entertainment systems with
built-in music classification techniques provide several benefits compared to Internet-
based solutions. First, no Internet connection is required to transfer data between a
media device and a server offering music classification services. This also reduces
energy costs since power-consuming wireless connections can remain switched off.
This is a very important aspect considering the operating time of mobile devices.
Finally, new music files, even unpublished music, can be processed offline. Hence,
the approach of media devices with built-in music classification support promises
high flexibility.
This chapter is structured as follows: First, different metrics are presented that
are suitable to evaluate hardware architectures. Then, feature extraction-specific ap-
proaches to utilize available hardware resources are explained. Hardware architec-
ture basics are provided afterwards. Finally, a comparison of architectures regarding
their efficiency in terms of performing music classification as well as expected costs
is presented.
Computation time per result (T) (resp. throughput rate (η/T )): This application-
specific cost factor requires the definition of a task to be performed or a result to be
computed. In terms of music classification, a task can be the classification of one mu-
sic file including computation steps like feature extraction, feature processing, and
642
27.2. Evaluation Metrics for Hardware Architectures 643
Energy consumption per result (Eresult ): The energy consumption per result is related
to the power consumption and the computation time per result for a given architec-
ture. The advantage of this cost metric is that information about the expected battery
lifetime can be directly estimated if the energy capacity is known.
System quality: This cost factor has a limited dependency on the utilized hardware
architecture. For music classification systems, the system quality results from the
interaction between hardware and software. A suitable system quality metric is the
classification rate.
η/T
EArea =
A
and energy efficiency
η/T
EEnergy =
P
represent two typical combined cost metrics that are useful for music classification
systems. These cost metrics relate the application-specific throughput rate of a hard-
ware architecture to its silicon area or its power consumption, respectively.
The throughput rate in the context of music classification depends on the amount
of music content per file. An evaluation metric, which does not depend on con-
tent per file, is required. This metric can incorporate an alternative definition of the
throughput rate, which is based purely on the time taken to extract a series of feature
sets within a prescribed time window. This approach can be used since the feature
extraction step is by far the most time-consuming step of the signal processing chain.
Hence, the efficiency metrics are related to the feature extraction and are used to esti-
mate the efficiency of performing music classification. In this case, the computation
time per result is related to the time for extracting a feature set from a frame.
Finally, the ATE product
CAT E = A · T · Eresult
643
644 Chapter 27. Hardware Architectures for Music Classification
considers the silicon area, the computation time, and the energy consumption of a
hardware architecture at the same time. This combined cost metric can also be used
to estimate a hardware architecture’s efficiency. In addition, it can be used together
with the achievable classification rate in order to explore most of the discussed cost
factors in a two-dimensional design space.
644
27.4. Architectures for Digital Signal Processing 645
IF: Instruction fetch ID: Instruction decode/ EX: Execute/ MEM: Memory WB: Write
register file read address calculation access back
32
5 32
PC Address rs rs_data
A 32
Instruction L Address M
5 32 U Read 32
U
rt rt_data X
+ Instruction Register data
4 Memory 5 File Data
rd Memory
32
rd_data
32 Write
data
ferent design philosophies are followed to achieve high performance and low power
demands. However, the baseline of the architecture design of GPPs remains the same
and is based on two main components: datapath and control unit. The datapath per-
forms arithmetic operations. The control unit tells the datapath and other elements
of a processor and system what to do, according to the instructions to be executed.
GPPs normally include internal memory like caches that can reduce data access times
resulting from bigger but also slower external memories. A recent example of these
general purpose processors is the Intel Core i7-2640M.
For data-intensive processing steps like the extraction of short-term features, the
datapath limits the runtime performance more than the control path and is therefore
a subject for further consideration. The fundamental structure of a typical GPP da-
tapath is illustrated in Figure 27.1. For simplification, an architecture of a reduced
instruction set computer (RISC) is shown without the control path and without addi-
tional datapath elements required for program branches.
Typically, datapaths are subdivided into five pipeline stages. Within the instruc-
tion fetch (IF) stage, instructions are read from an appropriate memory. Instructions
are coded by 32 bit and therefore, a program counter (PC) pointing to the current
memory address is incremented by four bytes. In more detailed datapath illustrations,
instructions are also able to modify the PC value. During the instruction decode (ID)
stage, the two independent source register addresses (rs, rt) and the destination reg-
ister address (rd) of an instruction are identified. These addresses are related to an
array of registers, which is called a register file. This is a set of registers, to which the
datapath has direct access. The actual operation of an instruction is performed within
the execute (EX) stage. The operation is executed by an arithmetic logic unit (ALU)
which can process up to two 32-bit operands to compute one result. Moreover, the
ALU is used to calculate a memory address from which data is read or written to
memory. This is done within the latter memory access (MEM) stage. Finally, data
read from memory is executed by the ALU and results are written into the datapath’s
register within the write-back (WB) stage.
645
646 Chapter 27. Hardware Architectures for Music Classification
float get_spc(float *mag, int size){ C-Code for extracting the spectral
centroid (spc) feature from the
float weighted = 0; magnitude of spectrum (mag)
float total = 0;
float spc;
MIPS-Assembler Code:
for-loop body of the spc feature
for(int k = 0; k < size; k+=1) { lw $r2, 0($r1) # load mag[k]
weighted += mag[k] * k; mul $r3, $r2, $r1 # $r3=mag[k]*k
total += mag[k]; add $r4, $r4, $r3 # weighted += $r3
} add $r5, $r5, $r2 # total += mag[k]
add $r1, $r1, 1 # k+=1
spc = weighted/total;
return spc;
}
Machine Code
0x00 : 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0x04 : 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0
0x08 : 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0
0x0c : 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0
0x10 : 0 1 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
Figure 27.2: Translation flow from high-level C-language to machine code; example
program code is based on the spectral centroid feature.
In the following, the program code for the extraction of the spectral centroid
(spc) feature is explained (cp. Definition 5.3). A C-language-based function for its
extraction is depicted in Figure 27.2. It is assumed that this feature is extracted from
the magnitude of the spectrum (declared as mag) which has already been previously
computed from an audio frame. The number of spectral components to be considered
is specified by the function parameter size.
An essential element of such a feature extraction function is the for-loop in which
spectral components, respectively audio samples, are sequentially processed. This
procedure is typical for most of the short-term features and generally corresponds to
the most time-consuming part of a feature extraction function. For an execution of
this function, the C-code must be translated into machine code. Therefore, an in-
termediate step is performed that translates the architecture-independent high-level
program code into the so-called assembler code. On the right-hand side of Figure
27.2, the assembler code of the for-loop body is shown. These are all instructions
executed during one for-loop turn. This assembler syntax is applicable for “Micro-
processor without Interlocked Pipeline Stages” (MIPS) [10]. At this level, source
and destination registers within the register file are directly addressed (denoted by
$r). Furthermore, the C-code is split into instructions that can be executed by the
ALU. At the end, the assembler code can be translated into machine code that is
readable by the processor. An example of machine code is also presented at the
bottom of Figure 27.2.
646
27.4. Architectures for Digital Signal Processing 647
Program
execution order 1 2 3 4 5 6 7 8 9 10
Cycle
(in instructions)
lw $r2, 0($r1) IF ID EX MEM WB
mul $r3, $r2, $r1 IF ID EX MEM WB
add $r4, $r4, $r3 IF ID EX MEM WB
add $r5, $r5, $r2 IF ID EX MEM WB
add $r1, $r1, 1 IF ID EX MEM WB
647
648 Chapter 27. Hardware Architectures for Music Classification
float get_spc_simd(float *mag, int size){ C-Code for extracting the spectral
centroid (spc) feature from the
float weighted1 = 0; float weighted2 = 0; magnitude of spectrum (mag) by
float total1 = 0; float total2 = 0; using SIMD instructions
float spc;
Figure 27.4: Modified program code of the spectral centroid feature (spc) for SIMD
execution.
by the spc feature code in Figure 27.4. In this example, a GPP is supposed to execute
two single instructions concurrently using SIMD instructions, so that the for-loop
counts can be reduced by a factor of two. The for-loop body is expanded in order
to execute twice the number of instructions per turn. This partition requires a cer-
tain data independency. Moreover, a final step must be performed after the for-loop
structure to merge the partitioned results and to extract the spc feature. As a result,
the instruction count to extract the spc feature is reduced overall.
The corresponding assembler code is similar to the single instruction-based pro-
gram code. One obvious change is the identifier appended to the instruction names
that indicates the respective array size of each operand. This is also reflected in the
selection of register addresses. In detail, only the first register address of an array
is specified while the next upper address is indirectly used. That is why only even
addresses are used within the assembler code.
648
27.4. Architectures for Digital Signal Processing 649
Warp Warp
DRAM
DRAM
Scheduler Scheduler
SM SM SM SM SM SM SM SM CUDA CUDA Load/Store
Core Core Unit
. . .
Host Interface
. . .
DRAM
768 KB L2 Cache
. . .
Thread Block
DRAM
Core Core Unit
DRAM
DRAM
Figure 27.5: Regular structure of a graphics processing unit (GPU): (a) Top-level
GPU structure, (b) Streaming Multiprocessor (SM) structure.
vice Architecture) cores each containing an ALU and register file. The computation
flow of a CUDA core is similar to the flow of a GPP. However, all CUDA cores of
an SM execute the same instructions concurrently on different data, which consti-
tutes SIMD-like processing. In contrast to real SIMD computation, CUDA cores are
also able to execute unconditional and conditional branches. Since all cores must
execute the same instructions, the output of each core can be masked. For example,
if an if -condition has to be executed, all cores, where the condition is not true, are
disabled by masking. Afterwards, an optional else branch is executed by only letting
the remaining cores of an SM being enabled. This procedure demonstrates that such
programming structures must be avoided to increase the hardware utilization and the
efficiency of a GPU. Besides, each SM provides its own shared memory to which
all CUDA cores of the same SM have access, allowing a certain degree of data de-
pendency. However, the same is not valid for data dependencies between other SMs
since they cannot be synchronized. The usage of shared memory is recommended
because it provides higher data rates than the external memory of a GPU card that
must contain all data to be processed and that offers higher storage capacities than
the shared memory available.
The GPU concept can be applied to feature extraction in order to extract one fea-
ture from several frames simultaneously. In detail, the extraction of a feature from
one frame is performed by exactly one SM and the extraction itself is additionally
accelerated by the available CUDA cores. NVIDIA offers a special programming
model for their GPUs that is based on the C/C++ language. Since a GPU cannot be
used alone, it must be integrated within a system, also called the host, that includes
a CPU. The CPU must execute additional program code in order to call GPU spe-
cific functions to be executed. The general programming concept and workflow is
explained in the example of the spc feature in Figure 27.6.
649
650 Chapter 27. Hardware Architectures for Music Classification
sum_reduction<256>(weighted, tid);
sum_reduction<256>(total, tid); parallel reduction
Figure 27.6: GPU specific program code to extract the spc feature.
650
27.4. Architectures for Digital Signal Processing 651
Processor core 32
Datapath A 32
Instruction decode
Instruction fetch
Data cache
64 32
256 256 Register file A
32
Datapath B
.L2 .S2 .M2 .D2 32
32
64
Register file B 32
Figure 27.7: Digital signal processor with Very Long Instruction Word architecture
for up to eight instructions per instruction word.
651
652 Chapter 27. Hardware Architectures for Music Classification
lw $r2, 0($r1) RAW lw $r2, 0($r1) || add $r1, $r1, 1 || mv $r6, $r1
Reorganize
mul $r3, $r2, $r1 mul $r3, $r2, $r6 || add $r5, $r5, $r2 || add $r6, $r6, 1
add $r4, $r4, $r3 add $r4, $r4, $r3 || : parallel
add $r5, $r5, $r2 Note: Code optimization is also suitable for
add $r1, $r1, 1 WAR Software Pipelined LOOP (SPLOOP)
Figure 27.8: Read after write (RAW) and write after read (WAR) conflict identifi-
cation and instruction sequence optimization for reduced VLIW instruction count;
symbol || separates parallel 32-bit instructions.
from each other, data access between the datapaths is possible by special cross-paths
that impact computation performance. Finally, the DSP utilizes pipelining as intro-
duced in the GPP-related Section 27.4.1.
The degree of instruction-level parallelism is not only dependent on the hardware
architecture but also on data dependencies between instructions. Considering the se-
quence of MIPS instructions of the spc’s for-loop body as shown in Figure 27.8, two
different data dependencies exist that limit a concurrent instruction execution. The
first one is a Read after Write (RAW) conflict that occurs if a register value is read
after it has been updated one or more cycles before. In this case, the respective in-
struction must not be executed before this subsequent instruction or at the same time.
The second dependency is a Write after Read (WAR) conflict. In detail, an instruc-
tion updates a register content that has to be read before. Both conflict types must be
respected if the program code should be optimized for an instruction parallel execu-
tion. A reasonable rescheduling of the instruction sequence can enable a reduction
of the VLIW instruction count as depicted in Figure 27.8.
The first and the last instruction of the sequential program code (lw and add) are
merged together in order to be executed at once. Although a WAR conflict between
these two instructions exists, a parallel execution is possible because of the regis-
ter file implementation. Thereby, read and write access is performed in two phases
beginning with a read access. In this way, a register content is read before a reg-
ister content can be updated which avoids WAR conflicts that could occur between
instructions of the same cycle.
It has to be noted that an additional instruction is inserted and can be found
within the second line of the optimized DSP code. This instruction is duplicated from
the last line of the original program code in order to further optimize DSP-specific
program execution. The reason for this optimization is explained by examining the
sequence of parallel instructions of successive for-loop runs, which is illustrated in
Figure 27.9.
The instructions of the for-loop body are executed concurrently as far as possible
while the instructions of different for-loop runs are executed sequentially. However,
a further increase in computation performance can be achieved if the instructions
of different for-loops could be overlapped, too. This requires that RAW and WAR
conflicts between for-loop runs are considered. From a software point of view, com-
652
27.4. Architectures for Digital Signal Processing 653
loop
cycle 1 2 3 4 5
1 lw, add
2 mul, add, add
3 add
4 lw, add
5 mul, add, add
6 add
.. .. ..
. . .
13 lw, add
14 mul, add, add
15 add
Figure 27.9: VLIW instruction sequence of for-loop turns without SPLOOP opti-
mization.
piler methods that implement overlapped for-loop runs for VLIW architectures are
called Software Pipelined LOOPs (SLOOPs). Such methods may utilize several code
optimization approaches like instruction reordering, the insertion of No OPeration
(NOP) instructions, or additional instructions as shown in Figure 27.8 in order to
reduce the overall runtime. The instruction sequence resulting from SPLOOP uti-
lization is shown in Figure 27.10. The optimized program code can be subdivided
into three parts: prolog, kernel, and epilog. The prolog code corresponds to the initial
phase of the for-loop execution during which the maximum number of overlapping
for-loop turns is not reached. The kernel part of the program code possesses a repeat-
ing sequence of parallel instructions and the maximum number of for-loop turns are
overlapped. In case of the spc feature, this means that all instructions are executed
concurrently in each cycle. At the end of the for-loop execution, the number of over-
lapped turns decreases and requires another sequence of parallel instructions named
epilog. The requirements of these three parts illustrates that the SPLOOP concept
loop
cycle 1 2 3 4 5
1 lw, add
Prolog
2 mul, add, add lw, add
3 add mul, add, add lw, add
4 add mul, add, add lw, add Kernel
5 add mul, add, add lw, add
6 add mul, add, add
Epilog
7 add
Figure 27.10: VLIW instruction sequence of for-loop turns with SPLOOP optimiza-
tion.
653
654 Chapter 27. Hardware Architectures for Music Classification
increases the program size. Anyhow, DSPs may provide special hardware support
for SPLOOPs that automatically generates prolog, kernel, and epilog program code
from SPLOOP prepared program code. Thus, the additional DSP hardware allows
the program size increase to be limited.
654
27.4. Architectures for Digital Signal Processing 655
x + M a
U Reg 2x
X
M 0 0 M
Reg U U Reg 2x M
X X U spc
1 + X
Figure 27.12: Simplified dedicated hardware implementation for extracting the spc
feature from a continuous data stream.
example of a dedicated hardware design that is able to extract the spc feature from a
continuous data stream.
In order to provide a better overview, only the datapath is presented while most of
the control elements and signals are omitted. Only relevant REGisters (REG), MUl-
tipleXers (MUX) for signal selection, and a signal comparator are illustrated as they
are needed to describe the algorithm. Moreover, arithmetic elements (depicted as cir-
cles) are assumed to require one clock cycle for signal processing. The input signal
(mag) is shown on the left side of the figure and provides a continuous data stream.
Thus, it must be permanently processed and the total and weight values are concur-
rently processed by separate hardware elements. The division operator is performed
by a long-division-like algorithm that computes the quotient (spc) with several iter-
ations. It utilizes two times the elements that correspond to binary shift operations.
In practice, such elements do not require any hardware resources because they can
be realized through modified wiring. Although the complete division hardware takes
several cycles to compute the result, it can operate concurrently with the computa-
tion of the total and weight values. Thus, all hardware elements can be utilized at the
same time which further increases hardware utilization and efficiency. The hardware
design presented can be physically realized by various technologies. Two popular ap-
proaches are presented in the following section that additionally affect computation
performance, efficiency, and flexibility.
27.4.5.1 FPGA
Field Programmable Gate Arrays (FPGAs) are complex logic devices that can in-
clude millions of programmable logic elements. These logic elements can implement
simple logic functions. By connecting several of these logic elements together, even
complex hardware designs can be mapped onto the regular structure of FPGAs. An
example “island style” FPGA structure is depicted in Figure 27.13.
655
656 Chapter 27. Hardware Architectures for Music Classification
The FPGA consists of various“islands” that include a Logic Element (LE), Con-
nection Boxes (CB), and a Routing Switch (RS). The connection boxes are used to
link adjacent logic elements to a global connection network. Thereby, horizontally
and vertically routed connections of the global network are managed by the available
routing switches, each containing several connection points to flexibly establish sig-
nal connections between signal lines. More advanced FPGA structures may provide
further logic elements per“island” and additional elements like dedicated hardware
multipliers, for example, that can be used to accelerate signal processing or to utilize
available hardware resources more effectively.
Dedicated hardware elements can be described by Boolean functions, which
means they can be specified by truth value expressions. Thus, logic elements nor-
mally include a Look-Up Table (LUT) with up to six input ports in order to imple-
ment such Boolean functions. Moreover, a one-bit storage element called Flip Flop
is included that may be utilized to implement clock synchronous signals. In the fol-
lowing, the implementation of a hardware adder on an FPGA is demonstrated as it
is also required to extract the spc feature by its dedicated hardware design. Because
of the complexity of floating point hardware, a 4-bit adder design for signed integer
operands is examined instead to present the essential implementation steps. These
steps are shown in Figure 27.14.
The binary addition of two signed operands can be segmented into computation
elements called Full Adders (FA). A full adder is designed to compute the sum of
three input bits: one bit of the input operands with the same significance (ai and
bi ) and an additional bit (ci−1 ) from a prior full adder. The decimal result of three
input bit ranges between zero and three. Hence, two bits are required to represent
the result. The lower Significant result bit (si ) is used directly as an output signal
while the upper significant result bit, also called the Carry bit (ci ), is routed to the
successive full adder. Because each full adder is dependent on the prior full adder
elements, this adder design is called a ripple carry adder.
Considering the presented FPGA design, two logic elements are required to im-
plement a full adder circuit. One logic element is used to compute si and the other
one to generate ci . Therefore, truth tables of the output signals are created that in-
656
27.4. Architectures for Digital Signal Processing 657
0 0 1 0 1 a1 =1
XOR
gate
0 1 0 0 1 b1
=1 s1
0 1 1 1 0 c0 z
x y z VCC
1 0 0 0 1
1 0 1 1 0 0 0 0
1 1 0 1 0 Truth table 0 1 1 y
1 1 1 1 1 of XOR 1 0 1
1 1 0 nmos-transistor
clude the resulting output signals for each possible input signal combination. These
tables can be easily stored within the look-up tables of the corresponding logic ele-
ments. Through this work flow, the complete ripple adder and even more complex
hardware designs can also be mapped on FPGAs.
27.4.5.2 ASIC
With Application Specific Integrated Circuits (ASICs), dedicated hardware is imple-
mented on transistor level. This approach is even more complex than FPGA-based
designs because additional implementation steps have to be performed, like physi-
cal restrictions and effects, in order to get the final hardware design into production.
Therefore, a convenient approach to reduce development effort is to use standard
cells provided by chip manufacturers like TSMC. Standard cells define transistor
placement and dimension of logic gates, which corresponds to hardware implemen-
tations of Boolean operators (e.g. AND, OR, NOT, etc.) or basic arithmetic functions
(ADD, SUB, etc.). Thus, dedicated hardware must be described by Boolean, respec-
tively, logic functions before it can be implemented on the basis of such logic gates.
For example, the result signal of a full adder (si ) is investigated again in Figure 27.15.
657
658 Chapter 27. Hardware Architectures for Music Classification
The output signal si can be described by two XOR operators (exclusive or) that com-
bine ai , bi , and ci−1 . The truth table of an XOR shows that the combination of two
variables is one when exactly one of the two variables is one. Otherwise the result
is zero. An XOR gate corresponds to the hardware implementation of such an XOR
operator that is typically included in standard cell libraries. The presented XOR gate
implementation requires twelve transistors. In total, 24 transistors are required to im-
plement the sum computation of a full adder element which is less than the number of
transistors required to implement a logic element of an FPGA. Thus, an ASIC-based
hardware design is typically more efficient compared to an FPGA but less flexible as
well because of the fixed hardware design after production.
658
27.5. Design Space Exploration 659
Table 27.1: Investigated Feature Sets and Classification Rates Achieved by Music
Classification Experiments Including the GTZAN Music Database that Includes 10
Different Genres and 100 Music Files per Genre. Extraction Times per Frame are
Measured on an ARM Cortex A8 RISC Processor Featuring a Clock Frequency of 1
GHz
659
660 Chapter 27. Hardware Architectures for Music Classification
108
1,E+08
Area efficiency [1/(s· mm2)]
107
1,E+07
106
1,E+06
105
1,E+05
104
1,E+04
103
1,E+03
102
1,E+02
1,E+03
103 1,E+04
104 1,E+05
105 1,E+06
106 1,E+07
107 1,E+08
108 1,E+09
109
Energy efficiency [1/J]
Figure 27.16: Energy and area efficiency of investigated hardware architectures de-
termined by extracting one of the four respected feature sets.
processor. Thereby, the power consumption is 2.7 times higher compared to the RIS,
which limits the increase in energy efficiency. Because of its very low power con-
sumption and silicon area, the ASIP offers a twice higher area and energy efficiency
than the Intel GPP. Thereby, the achievable extraction rate is of the same magnitude
as the ARM GPP. The ASIP therefore is a very attractive architecture approach for
low-power devices. It has to be mentioned that the high energy efficiency results
are the result of intensive architecture optimization that requires long development
times.
The highest efficiency results are achieved by the dedicated hardware architec-
ture approach that extracts different features from a frame at the same time. The
FPGA-based solutions offer lower area efficiency than the ASIC implementation be-
cause the available FPGA resources are not completely utilized for all implemented
feature sets. Finally, the ASIC offers the highest efficiency values and concurrently
the lowest silicon areas and power consumptions as expected. However, these results
come along with the very low flexibility of the ASIC concept. The ASIC is therefore
a suitable approach for low-power and high-performance devices that require a fixed
set of acoustic features.
A practical example that helps to interpret these results is given by a Samsung
Galaxy S2 smartphone. This mobile device utilizes a GPP that is comparable to the
ARM Cortex-A8. Based on its battery, which is implemented by default, 1.5% of the
overall battery capacity is consumed if a database of 1000 music files, each with 3
minutes of music content, is classified. This corresponds to a reduction of 22 minutes
in operating time if the overall operating time is assumed to be 1 day. The ATE costs
of flexibly programmable processors as well as suitable combinations of hardware
architectures are presented in Table 27.3.
660
27.6. Concluding Remarks 661
Table 27.3: ATE Costs of Programmable
Processors and Heterogeneous Architecture
Approaches (Physical Unit: mm2 sJ )
The ATE costs are related to the classification of a music file with a length of
30 seconds as reference. Thereby, the Intel processor is reasonably combined with
a GPU as the degree of ATE cost decrease compared to the single processor solu-
tion. In contrast, a GPU is not a suitable extension for the examined ARM Cortex
A8 GPP and the Synopsys ARC600 ASIP because of the resulting increase in ATE
costs. However, an ASIC-based coprocessor applied for the feature extraction step
can significantly reduce the ATE costs for both processors. This demonstrates which
combination of architecture approaches are suitable in order to efficiently classify
music. Finally, the computation time of heterogeneous architectures to analyze a
database of 1000 music files, each with a length of 3 minutes, is shown in Table 27.4.
By utilizing the measured ATE costs, the related computation time results during the
early design phase of hardware systems, cost-efficient hardware systems can be de-
signed that are very suitable for realizing the applications as they are presented in
this book.
661
662 Chapter 27. Hardware Architectures for Music Classification
Table 27.4: Computation Time in Seconds to Classify a Complete Music Database
with 1000 Music Files Each with Three Minutes of Music Content
Bibliography
[1] H. Blume. Modellbasierte Exploration des Entwurfsraumes für heterogene Ar-
chitekturen zur digitalen Videosignalverarbeitung. Habilitation thesis, RWTH
Aachen University, 2008.
[2] M. Gries and K. Keutzer. Building ASIPs: The Mescal Methodology. Springer
US, 2006.
[3] M. Harris and et al. Optimizing parallel reduction in CUDA. NVIDIA Devel-
oper Technology, 2(4):1–39, 2007.
[4] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative
Approach. Elsevier, 2012.
[5] H. Kou, W. Shang, I. Lane, and J. Chong. Efficient MFCC feature extraction
on graphics processing units. IET Conference Proceedings, 2013.
662
27.7. Further Reading 663
[6] C.-H. Lee, J.-L. Shih, K.-M. Yu, and H.-S. Lin. Automatic music genre classi-
fication based on modulation spectral analysis of spectral and cepstral features.
IEEE Transactions on Multimedia, 11(4):670–682, 2009.
[7] MARSYAS. Music analysis, retrieval and synthesis for audio signals: Data
sets. http://marsyas.info/downloads/datasets.html. [accessed 09-
Jan-2016].
[8] J. Nickolls and W. J. Dally. The gpu computing era. IEEE Micro, 30(2):56–69,
2010.
[9] Y. Patt and S. Patel. Introduction to Computing Systems: From Bits & Gates to
C & Beyond. Computer Engineering Series. McGraw-Hill Education, 2003.
[10] D. A. Patterson and J. L. Hennessy. Computer Organization and Design: The
Hardware/Software Interface. Newnes, 2013.
[11] E. M. Schmidt, K. West, and Y. E. Kim. Efficient acoustic feature extraction for
music information retrieval using programmable gate arrays. In Proceedings of
the 10th International Society for Music Information Retrieval Conference, pp.
273–278, Kobe, Japan, October 26-30 2009.
[12] G. Schuller, M. Gruhne, and T. Friedrich. Fast audio feature extraction from
compressed audio data. IEEE Journal of Selected Topics in Signal Processing,
5(6):1262–1271, 2011.
[13] H. J. Veendrick. Nanometer CMOS ICs: From Basics to ASICs. Springer, 2010.
663
Notation
Unless otherwise noted, we use the following symbols and notations throughout the
book.
Abbreviations
Abbreviation Meaning
iff if and only if
Basic symbols
Symbol Meaning
:= equal by definition, defined by
R set of real numbers
C set of complex numbers
Z set of integers
N set of natural numbers
[xi ] vector with elements xi
[xi j ] matrix with elements xi j
x vectors are represented using bold lower case letters
X matrices are represented by upper case bold letters
665
666 NOTATION
Mathematical functions
Symbol Meaning
z∗ complex conjugate of a complex number z = x + iy
cov covariance
f0 fundamental frequency
fs sampling frequency
fµ Fourier frequency, center frequency of the DFT bins
log natural logarithm (base e)
log10 base-10-logarithm
log2 base-2-logarithm
X T , xT transpose of X , x . Transposing a vector results in a row vector.
x̄ (arithmetical) mean of observations x
med median
mod mode
P(·) probability (function)
666
Index
667
668 INDEX
668
INDEX 669
669
670 INDEX
670
INDEX 671
671
672 INDEX
672
INDEX 673
673
674 INDEX
674
INDEX 675
visual, 629
Viterbi algorithm, 480
wave
longitudinal, 44
number, 19
plane, 43
sound, 41
transverse, 18
wave length, 19
wave equation
D’Alembert solutions, 17
one-dimensional, 17
standing wave solutions, 20
three-dimensional, 42
WAVE file, 189
wavelet transform, 63, 64
Mexican Hat wavelet, 64
wearable, 630, 637
Wilcoxon test, 246
window function, 126
rectangular, 126
zero-crossings, 146
675