0% found this document useful (0 votes)
20 views

Manuscript

This document provides an overview of automatic music transcription, which aims to convert acoustic music signals into musical notation. It discusses applications and impact, relationships to other fields like speech recognition, and approaches based on deep learning and non-negative matrix factorization. Open challenges in polyphonic transcription are also described.

Uploaded by

burakcay0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Manuscript

This document provides an overview of automatic music transcription, which aims to convert acoustic music signals into musical notation. It discusses applications and impact, relationships to other fields like speech recognition, and approaches based on deep learning and non-negative matrix factorization. Open challenges in polyphonic transcription are also described.

Uploaded by

burakcay0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/330068609

Automatic Music Transcription: An Overview

Article in IEEE Signal Processing Magazine · January 2019


DOI: 10.1109/MSP.2018.2869928

CITATIONS READS
192 14,887

4 authors, including:

Emmanouil Benetos
Queen Mary, University of London
185 PUBLICATIONS 4,180 CITATIONS

SEE PROFILE

All content following this page was uploaded by Emmanouil Benetos on 05 August 2019.

The user has requested enhancement of the downloaded file.


1

Automatic Music Transcription: An Overview


Emmanouil Benetos Member, IEEE, Simon Dixon, Zhiyao Duan Member, IEEE, and Sebastian
Ewert Member, IEEE

I. I NTRODUCTION IV-F, as well as methods for transcribing specific sources


The capability of transcribing music audio into music within a polyphonic mixture such as melody and bass line.
notation is a fascinating example of human intelligence. It
involves perception (analyzing complex auditory scenes), cog- A. Applications & Impact
nition (recognizing musical objects), knowledge representation A successful AMT system would enable a broad range
(forming musical structures) and inference (testing alternative of interactions between people and music, including music
hypotheses). Automatic Music Transcription (AMT), i.e., the education (e.g., through systems for automatic instrument
design of computational algorithms to convert acoustic music tutoring), music creation (e.g., dictating improvised musical
signals into some form of music notation, is a challenging task ideas and automatic music accompaniment), music production
in signal processing and artificial intelligence. It comprises (e.g., music content visualization and intelligent content-based
several subtasks, including (multi-)pitch estimation, onset and editing), music search (e.g., indexing and recommendation of
offset detection, instrument recognition, beat and rhythm track- music by melody, bass, rhythm or chord progression), and
ing, interpretation of expressive timing and dynamics, and musicology (e.g., analyzing jazz improvisations and other non-
score typesetting. Given the number of subtasks it comprises notated music). As such, AMT is an enabling technology with
and its wide application range, it is considered a fundamental clear potential for both economic and societal impact.
problem in the fields of music signal processing and music AMT is closely related to other music signal processing
information retrieval (MIR) [1], [2]. Due to the very nature tasks [3] such as audio source separation, which also in-
of music signals, which often contain several sound sources volves estimation and inference of source signals from mixture
(e.g., musical instruments, voice) that produce one or more observations. It is also useful for many high-level tasks in
concurrent sound events (e.g., notes, percussive sounds) that MIR [4] such as structural segmentation, cover-song detection
are meant to be highly correlated over both time and frequency, and assessment of music similarity, since these tasks are
AMT is still considered a challenging and open problem much easier to address once the musical notes are known.
in the literature, particularly for music containing multiple Thus, AMT provides the main link between the fields of
simultaneous notes1 and multiple instruments [2]. music signal processing and symbolic music processing (i.e.,
The typical data representations used in an AMT system processing of music notation and music language modeling).
are illustrated in Fig. 1. Usually an AMT system takes an The integration of the two aforementioned fields through AMT
audio waveform as input (Fig. 1a), computes a time-frequency will be discussed in Section IV.
representation (Fig. 1b), and outputs a representation of pitches Given the potential impact of AMT, the problem has also
over time (also called a piano-roll representation, Fig. 1c) or attracted commercial interest in addition to academic research.
a typeset music score (Fig. 1d). While it is outside the scope of the paper to provide a com-
In this paper, we provide a high-level overview of Automatic prehensive list of commercial AMT software, commonly used
Music Transcription, emphasizing the intellectual merits and software includes Melodyne2 , AudioScore3 , ScoreCloud4 , An-
broader impacts of this topic, and linking AMT to other themScore5 , and Transcribe!6 . It is worth noting that AMT
problems found in the wider field of digital signal processing. papers in the literature have refrained from making explicit
We give an overview of approaches to AMT, detailing the comparisons with commercially available music transcription
methodology used in the two main families of methods, software, possibly due to different scopes and target applica-
based respectively on deep learning and non-negative matrix tions between commercial and academic tools.
factorization. Finally we provide an extensive discussion of
open challenges for AMT. Regarding the scope of the paper,
we emphasize approaches for transcribing polyphonic music B. Analogies to Other Fields
produced by pitched instruments and voice. Outside the scope AMT has close relations with other signal processing prob-
of the paper are methods for transcribing non-pitched sounds lems. With respect to the field of speech processing, AMT is
such as drums, for which a brief overview is given in Section widely considered to be the musical equivalent of Automatic
Speech Recognition (ASR), in the sense that both tasks involve
Authors in alphabetical order.
EB and SD are with the Centre for Digital Music, Queen Mary University
converting acoustic signals to symbolic sequences. Like the
of London, UK. e-mail: {emmanouil.benetos,s.e.dixon}@qmul.ac.uk. 2 http://www.celemony.com/en/melodyne
ZD is with the Department of Electrical and Computer Engineering,
3 http://www.sibelius.com/products/audioscore/
University of Rochester, NY, USA. e-mail: [email protected]
4 http://scorecloud.com/
SE is with Spotify Ltd, UK. e-mail: [email protected]
EB is supported by a UK RAEng Research Fellowship (RF/128). 5 https://www.lunaverus.com/
1 Called polyphonic music in the music signal processing literature. 6 https://www.seventhstring.com/xscribe/
2

0.2

Amplitude
(a) 0

−0.2
1 2 3 4 5 6

2000

1500
Frequency (Hz)

(b)
1000

500

1 2 3 4 5 6

80
MIDI pitch

(c)
60

40
1 2 3 4 5 6
Time (seconds)

(d)

Figure 1. Data represented in an AMT system. (a) Input waveform, (b) Internal time-frequency representation, (c) Output piano-roll representation, (d)
Output music score, with notes A and D marked in gray circles. The example corresponds to the first 6 seconds of W. A. Mozart’s Piano Sonata No. 13, 3rd
movement (taken from the MAPS database).

cocktail party problem in speech, music usually involves mul- vision, as musical objects such as notes can be recognized
tiple simultaneous voices, but unlike speech, these voices are as two-dimensional patterns in time-frequency representations.
highly correlated in time and in frequency (see Challenges 2 Compared with image processing and computer vision, where
and 3 in Section I-C). In addition, both AMT and ASR systems occlusion is a common issue, AMT systems are often affected
benefit from language modeling components that are combined by musical objects occupying the same time-frequency regions
with acoustic components in order to produce plausible results. (this is detailed in Section I-C).
Thus, there are also clear links between AMT and the wider
field of natural language processing (NLP), with music having C. Key Challenges
its own grammatical rules or statistical regularities, in a similar
way to natural language [5]. The use of language models for Compared to other problems in the music signal processing
AMT is detailed in Section IV. field or the wider signal processing discipline, there are several
factors that make AMT particularly challenging:
Within the emerging field of sound scene analysis, there is 1) Polyphonic music contains a mixture of multiple simul-
a direct analogy between AMT and Sound Event Detection taneous sources (e.g., instruments, vocals) with different
(SED) [6], in particular with polyphonic SED which involves pitch, loudness and timbre (sound quality), with each
detecting and classifying multiple overlapping events from source producing one or more musical voices. Inferring
audio. While everyday and natural sounds do not exhibit the musical attributes (e.g., pitch) from the mixture signal
same degree of temporal regularity and inter-source frequency is an extremely under-determined problem.
dependence as found in music signals, there are close interac- 2) Overlapping sound events often exhibit harmonic re-
tions between the two problems in terms of the methodologies lations with each other; for any consonant musical
used, as observed in the literature [6]. interval, the fundamental frequencies form small integer
Further, AMT is related to image processing and computer ratios, so that their harmonics overlap in frequency,
3

making the separation of the voices even more difficult. Frame-level


Taking a C major chord as an example, the fundamental
frequency ratio of its three notes C:E:G is 4:5:6, and the
percentage of harmonic positions that are overlapped by
the other notes are 46.7%, 33.3% and 60% for C, E and
G, respectively.
3) The timing of musical voices is governed by the regular
metrical structure of the music. In particular, musicians
pay close attention to the synchronization of onsets
and offsets between different voices, which violates the Note-level
common assumption of statistical independence between
sources which otherwise facilitates separation.
4) The annotation of ground-truth transcriptions for poly-
phonic music is very time consuming and requires high
expertise. The lack of such annotations has limited
the use of powerful supervised learning techniques to
specific AMT sub-problems such as piano transcription,
where the annotation can be automated due to certain
piano models that can automatically capture perfor- Stream-level
mance data. An approach to circumvent this problem
was proposed in [7], however, it requires professional
music performers and thorough score pre- and post-
processing. We note that sheet music does not generally
provide good ground-truth annotations for AMT; it is
not time-aligned to the audio signal, nor does it usually
provide an accurate representation of a performance.
Even when accurate transcriptions exist, it is not trivial
to identify corresponding pairs of audio files and musical
scores, because of the multitude of versions of any given
musical work that are available from music distributors. Figure 2. Examples of frame-level, note-level and stream-level transcriptions,
produced by running methods proposed in [8], [9] and [10], respectively, of
At best, musical scores can be viewed as weak labels. the first phrase of J. S. Bach’s chorale “Ach Gott und Herr” from the Bach10
The above key challenges are often not fully addressed in dataset. All three levels are parametric descriptions of the music performance.
current AMT systems, leading to common issues in the AMT
outputs, such as octave errors, semitone errors, missed notes
(in particular in the presence of dense chords), extra notes ered through filtering frame-level pitch estimates in a post-
(often manifested as harmonic errors in the presence of unseen processing stage. Fig. 2(top) shows an example of a frame-
timbres), merged or fragmented notes, incorrect onsets/offsets, level transcription, where each black dot is a pitch estimate.
or mis-assigned streams [1], [2]. The remainder of the paper Methods in this category do not form the concept of musical
will focus on ways to address the above challenges, as well notes and rarely model any high-level musical structures. A
as discussion of additional open problems for the creation of large portion of existing AMT approaches operate at this level.
robust AMT systems. Recent approaches include traditional signal processing meth-
ods [11], [12], probabilistic modeling [8], Bayesian approaches
[13], non-negative matrix factorization (NMF) [14], [15], [16],
II. A N OVERVIEW OF AMT M ETHODS [17], and neural networks [18], [19]. All of these methods
In the past four decades, many approaches have been have pros and cons and the research has not converged to
developed for AMT for polyphonic music. While the end goal a single approach. For example, traditional signal processing
of AMT is to convert an acoustic music recording to some methods are simple and fast and generalize better to different
form of music notation, most approaches were designed to instruments, while deep neural network methods generally
achieve a certain intermediate goal. Depending on the level achieve higher accuracy on specific instruments (e.g., piano).
of abstraction and the structures that need to be modeled Bayesian approaches provide a comprehensive modeling of
for achieving such goals, AMT approaches can be generally the sound generation process, however models can be very
organized into four categories: frame-level, note-level, stream- complex and slow. Readers interested in a comparison of the
level and notation-level. performance of different approaches are referred to the Mul-
Frame-level transcription, or Multi-Pitch Estimation (MPE), tiple Fundamental Frequency Estimation & Tracking task of
is the estimation of the number and pitch of notes that are the annual Music Information Retrieval Evaluation eXchange
simultaneously present in each time frame (on the order of (MIREX) 7 . However, readers are reminded that evaluation
10 ms). This is usually performed in each frame indepen-
dently, although contextual information is sometimes consid- 7 http://www.music-ir.org/mirex
4

results may be biased by the limitations of datasets and instead of beats; pitch is measured in MIDI numbers instead
evaluation metrics (see Sections I-C and IV-G). of spelled note names that are compatible with the key (e.g.,
Note-level transcription, or note tracking, is one level higher C] vs D[); and the concepts of beat, bar, meter, key, harmony,
than MPE, in terms of the richness of structures of the and stream are lacking.
estimates. It not only estimates the pitches in each time frame, Notation-level transcription aims to transcribe the music au-
but also connects pitch estimates over time into notes. In dio into a human readable musical score, such as the staff no-
the AMT literature, a musical note is often characterized by tation widely used in Western classical music. Transcription at
three elements: pitch, onset time, and offset time [1]. As this level requires deeper understanding of musical structures,
note offsets can be ambiguous, they are sometimes neglected including harmonic, rhythmic and stream structures. Harmonic
in the evaluation of note tracking approaches, and as such, structures such as keys and chords influence the note spelling
some note tracking approaches only estimate pitch and onset of each MIDI pitch; rhythmic structures such as beats and bars
times of notes. Fig. 2(middle) shows an example of a note- help to quantize the lengths of notes; and stream structures
level transcription, where each note is shown as a red circle aid the assignment of notes to different staffs. There has
(onset) followed by a black line (pitch contour). Many note been some work on the estimation of musical structures from
tracking approaches form notes by post-processing MPE out- audio or MIDI representations of a performance. For example,
puts (i.e., pitch estimates in individual frames). Techniques methods for pitch spelling [26], timing quantization [27], and
that have been used in this context include median filtering voice separation [28] from performed MIDI files have been
[12], Hidden Markov Models (HMMs) [20], and neural net- proposed. However, little work has been done on integrating
works [5]. This post-processing is often performed for each these structures into a complete music notation transcription,
MIDI pitch independently without considering the interactions especially for polyphonic music. Several software packages,
among simultaneous notes. This often leads to spurious or including Finale, GarageBand and MuseScore, provide the
missing notes that share harmonics with correctly estimated functionality of converting a MIDI file into music notation,
notes. Some approaches have been proposed to consider note however, the results are often not satisfying and it is not clear
interactions through a spectral likelihood model [9] or a music what musical structures have been estimated and integrated
language model [5], [18] (see Section IV-A). Another subset during the transcription process. Cogliati et al. [29] proposed
of approaches estimate notes directly from the audio signal a method to convert a MIDI performance into music notation,
instead of building upon MPE outputs. Some approaches first with a systematic comparison of the transcription performance
detect onsets and then estimate pitches within each inter-onset with the above-mentioned software. In terms of audio-to-
interval [21], while others estimate pitch, onset and sometimes notation transcription, a proof-of-concept work using end-to-
offset in the same framework [22], [23], [24]. end neural networks was proposed by Carvalho and Smaragdis
Stream-level transcription, also called Multi-Pitch Stream- [30] to directly map music audio into music notation without
ing (MPS), targets grouping estimated pitches or notes into explicitly modeling musical structures.
streams, where each stream typically corresponds to one in-
strument or musical voice, and is closely related to instrument III. S TATE - OF - THE - ART
source separation. Fig. 2(bottom) shows an example of a
stream-level transcription, where pitch streams of different While there is a wide range of applicable methods, auto-
instruments have different colors. Compared to note-level matic music transcription has been dominated during the last
transcription, the pitch contour of each stream is much longer decade by two algorithmic families: Non-Negative Matrix Fac-
than a single note and contains multiple discontinuities that are torization (NMF) and Neural Networks (NNs). Both families
caused by silence, non-pitched sounds and abrupt frequency have been used for a variety of tasks, from speech and image
changes. Therefore, techniques that are often used in note-level processing to recommender systems and natural language
transcription are generally not sufficient to group pitches into processing. Despite this wide applicability, both approaches
a long and discontinuous contour. One important cue for MPS offer a range of properties that make them particularly suitable
that is not explored in MPE and note tracking is timbre: notes for modeling music recordings at the note level.
of the same stream (source) generally show similar timbral
characteristics compared to those in different streams. There- A. Non-negative Matrix Factorization for AMT
fore, stream-level transcription is also called timbre tracking
or instrument tracking in the literature. Existing works at this The basic idea behind NMF and its variants is to rep-
level are few, with [16], [10], [25] as examples. resent a given non-negative time-frequency representation
×N
From frame-level to note-level to stream-level, the transcrip- V ∈ RM ≥0 , e.g., a magnitude spectrogram, as a product of
×K
tion task becomes more complex as more musical structures two non-negative matrices: a dictionary D ∈ RM ≥0 and an
K×N
and cues need to be modeled. However, the transcription activation matrix A ∈ R≥0 , see Fig. 3. Computationally,
outputs at these three levels are all parametric transcriptions, the goal is to minimize a distance (or divergence) between
which are parametric descriptions of the audio content. The V and DA with respect to D and A. As a straightforward
MIDI piano roll shown in Fig. 1(c) is a good example of such approach to solving this minimization problem, multiplicative
a transcription. It is indeed an abstraction of music audio, update rules have been central to the success of NMF. For
however, it has not yet reached the level of abstraction of example, the generalized Kullback-Leibler divergence between
music notation: time is still measured in the unit of seconds V and DA is non-increasing under the following updates and
5

(a) (b)

2000 2000

1500 1500
Frequency (Hz)

Frequency (Hz)
1000 1000

500 500

1 2 3 4 5 6 1 2 3 4 5 6
Time (seconds)

(c) (d)

2000

80

1500
Frequency (Hz)

MIDI pitch
1000
60

500

40
40 60 80 1 2 3 4 5 6
MIDI pitch Time (seconds)

Figure 3. NMF example, using the same audio recording as Fig. 1. (a) Input spectrogram V, (b) Approximated spectrogram DA, (c) Dictionary D
(pre-extracted), (d) Activation matrix A.

guarantees the non-negativity of both D and A as long as both 0

are initialized with positive real values [31]: 2

D> ( DA
V
) V
( DA )A>
dbFS

4
A←A and D ← D ,
D> J JA>
6
where the operator denotes point-wise multiplication, J ∈
RM ×N denotes the matrix of ones, and the division is point- 8
0 100 200 300 400 500 600 700 800
Frequency in Hz
wise. Intuitively, the update rules can be derived by choosing
a specific step-size in a gradient (or rather coordinate) descent Figure 4. Inharmonicity: Spectrum of a C]1 note played on a piano. The
based minimization of the divergence [31]. stiffness of strings causes partials to be shifted from perfect integer multiples
In an AMT context, both unknown matrices have an intu- of the fundamental frequency (shown as vertical dotted lines); here the 23rd
partial is at the position where the 24th harmonic would be expected. Note
itive interpretation: the n-th column of V, i.e. the spectrum at that the fundamental frequency of 34.65Hz is missing as piano soundboards
time point n, is modeled in NMF as a linear combination of typically do not resonate for modes with a frequency smaller than ≈50Hz.
the K columns of D, and the corresponding K coefficients
are given by the n-th column of A. Given this point of view,
each column of D is often referred to as a (spectral) template roll representation shown in Fig. 1(c) indicates the correlation
and usually represents the expected spectral energy distribution between NMF activations and the underlying musical score.
associated with a specific note played on a specific instrument. While Fig. 3 illustrates the principles behind NMF, it also
For each template, the corresponding row in A is referred to indicates why AMT is difficult – indeed, a regular NMF
as the associated activation and encodes when and how in- decomposition would rarely look as clean as in Fig. 3. Com-
tensely that note is played over time. Given the non-negativity pared to speech analysis, sound objects in music are highly
constraints, NMF yields a purely constructive representation correlated. For example, even in a simple piece as shown
in the sense that spectral energy modeled by one template in Fig. 1, most pairs of simultaneous notes are separated
cannot be cancelled by another – this property is often seen by musically consonant intervals, which acoustically means
as instrumental in identifying a parts-based and interpretable that many of their partials overlap (e.g., the A and D notes
representation of the input [31]. around 4 seconds, marked with gray circles in Fig. 1(d), share
In Fig. 3, an NMF-based decomposition is illustrated. The a high number of partials). In this case, it can be difficult
magnitude spectrogram V shown in Fig. 3(a) is modeled as to disentangle how much energy belongs to which note.
a product of the dictionary D and activation matrix A shown The task is further complicated by the fact that the spectro-
in Fig. 3(c) and (d), respectively. The product DA is given temporal properties of notes vary considerably between differ-
in Fig. 3(b). In this case, the templates correspond to indi- ent pitches, playing styles, dynamics and recording conditions.
vidual pitches, with clearly visible fundamental frequencies Further, stiffness properties of strings affect the travel speed
and harmonics. Additionally, comparing A with the piano of transverse waves based on their frequency – as a result,
6

c1 =0.8 c2 =0.6 c3 =0.3 c4 =0.2


2500 years on the problem of music transcription and on music
2000
signal processing in general. NNs are able to learn a non-
Frequency in Hz

1500
linear function (or a composition of functions) from input
1000 + + + = to output via an optimization algorithm such as stochastic
gradient descent [33]. Compared to other fields including
500
image processing, progress on NNs for music transcription
0
has been slower and we will discuss a few of the underlying
Figure 5. Harmonic NMF [15]: Each NMF template (right hand side) is reasons below.
represented as a linear combination of fixed narrow-band sub-templates. The One of the earliest approaches based on neural networks
resulting template is constrained to represent harmonic sounds by construction. was Marolt’s Sonic system [21]. A central component in
this approach was the use of time-delay (TD) networks,
the partials of instruments such as the piano are not found at which resemble convolutional networks in the time direction
perfect integer multiples of the fundamental frequency. Due [33], and were employed to analyse the output of adaptive
to this property called inharmonicity, the positions of partials oscillators, in order to track and group partials in the output
differ between individual pianos (see Fig. 4). of a gammatone filterbank. Although it was initially published
To address these challenges, the basic NMF model has been in 2001, the approach remains competitive and still appears in
extended by encouraging additional structure in the dictionary comparisons in more recent publications [23].
and the activations. For example, an important principle is In the context of the more recent revival of neural networks,
to enforce sparsity in A to obtain a solution dominated by a first successful system was presented by Böck and Schedl
few but substantial activations – the success of sparsity paved [34]. One of the core ideas was to use two spectrograms
the way for a whole range of sparse coding approaches, in as input to enable the network to exploit both a high time
which the dictionary size K can exceed the input dimension accuracy (when estimating the note onset position) and a
M considerably [32]. Other extensions focus on the dictionary high frequency resolution (when disentangling notes in the
design. In the case of supervised NMF, the dictionary is lower frequency range). This input is processed using one
pre-computed and fixed using additionally available training (or more) Long Short-Term Memory (LSTM) layers [33].
material. For example, given K recordings each containing The potential benefit of using LSTM layers is two-fold. First,
only a single note, the dictionary shown in Fig. 3(b) was the spectral properties of a note evolve across input frames
constructed by extracting one template from each recording – and LSTM networks have the capability to compactly model
this way, the templates are guaranteed to be free of interference such sequences. Second, medium and long range dependencies
from other notes and also have a clear interpretation. As between notes can potentially be captured: for example, based
another example, Fig. 5 illustrates an extension in which each on a popular chord sequence, after hearing C and G major
NMF template is represented as a linear combination of fixed chords followed by A minor, a likely successor is an F
narrow-band sub-templates [15], which enforces a harmonic major chord. An investigation of whether such long-range
structure for all NMF templates – this way, a dictionary can be dependencies are indeed modeled, however, was not in scope.
adapted to the recording to be transcribed, while maintaining Sigtia et al. [18] focus on long-range dependencies in
its clean, interpretable structure. music by combining an acoustic front-end with a symbolic-
In shift-invariant dictionaries a single template can be used level module resembling a language model as used in speech
to represent a range of different fundamental frequencies. In processing. Using information obtained from MIDI files, a
particular, using a logarithmic frequency axis, the distances recurrent network is trained to predict the active notes in
between individual partials of a harmonic sound are fixed and the next time frame given the past. This approach needs to
thus shifting a template in frequency allows modeling sounds learn and represent a very large joint probability distribution,
of varying pitch. Sharing parameters between different pitches i.e., a probability for every possible combination of active and
in this way has turned out to be effective towards increasing inactive notes across time – note that even in a single frame
model capacity (see e.g., [16], [17]). Further, spectro-temporal there are 288 possible combinations of notes on a piano. To
dictionaries alleviate a specific weakness of NMF models: in render the problem of modeling such an enormous probability
NMF it is difficult to express that notes often have a specific space tractable, the approach employs a specific neural net-
temporal evolution – e.g., the beginning of a note (or attack work architecture (NADE), which represents a large joint as a
phase) might have entirely different spectral properties than long product of conditional probabilities – an approach quite
the central part (decay phase). Such relationships are modeled similar to the idea popularized recently by the well-known
in spectro-temporal dictionaries using a Markov process which WaveNet architecture. Despite the use of a dedicated music
governs the sequencing of templates across frames, so that language model, which was trained on relatively large MIDI-
different subsets of templates can be used for the attack and based datasets, only modest improvements over an HMM
the decay parts, respectively [16], [23]. baseline could be observed and thus the question remains open
to which degree long-range dependencies are indeed captured.
To further disentangle the influence of the acoustic front-end
B. Neural Networks for AMT from the language model on potential improvement in perfor-
As for many tasks relating to pattern recognition, neural mance, Kelz et al. [19] focus on the acoustic modeling and
networks (NNs) have had a considerable impact in recent report on the results of a larger scale hyperparameter search
7

Frame Onset represent an observed spectrum of a single pitch C4, we can


Predictions Predictions linearly combine the two templates associated with C4. The
set (or manifold) of valid spectra for C4 notes, however, is
complex and thus in most cases our linear interpolation will
FC Sigmoid FC Sigmoid not correspond to a real-world recording of a C4. We could
increase the number of templates such that their interpolation
could potentially get closer to a real C4 – however, the
Bi LSTM Bi LSTM number of invalid spectra we can represent increases much
more quickly compared to the number of valid spectra. Deep
networks have shown considerable potential in recent years to
Conv Stack Conv Stack (implicitly) represent such complex manifolds in a robust and
comparatively efficient way [33]. An additional benefit over
generative models such as NMF is that neural networks can be
trained in an end-to-end fashion, i.e., note detections can be a
Log Mel
direct output of a network without the need for additional post-
Spectrogram
processing of model parameters (such as NMF activations).
Figure 6. Google Brain’s Onset and Frames Network: The input is processed Yet, despite these quite principled limitations, NMF-based
by a first network detecting note onsets. The result is used as side information methods remain competitive or even exceed results achieved
for a second network focused on estimating note lengths (adapted from [24]). using neural networks. Currently, there are two main chal-
Bi LSTM refers to bi-directional LSTM layers; FC Sigmoid refers to a fully
connected sigmoid layer; Conv Stack refers to a series of convolutional layers. lenges for neural network-based approaches. First, there are
only few, relatively small annotated datasets available, and
these are often subject to severe biases [7]. The largest publicly
and describe the influence of individual system components. available dataset [11] contains several hours of piano music
Trained using this careful and extensive procedure the resulting – however, all recorded on only seven different (synthesizer-
model outperforms existing models by a reasonable margin. In based and real) pianos. While typical data augmentation
other words, while in speech processing, language models have strategies such as pitch shifting or simulating different room
led to a drastic improvement in performance, the same effect acoustics might mitigate some of the effects, there is still a
is still to be demonstrated in an AMT system – a challenge considerable risk that a network overfits the acoustic properties
we will discuss in more detail below. of these specific instruments. For many types of instruments,
The development of neural network based AMT approaches even small datasets are not available. Other biases include
continues: the current state of the art method for general musical style as well as the distribution over central musical
purpose piano transcription was proposed by Google Brain concepts, such as key, harmony, tempo and rhythm.
[24]. Combining and extending ideas from existing methods, A second considerable challenge is the adaptability to
this approach combines two networks (Fig. 6): one network is new acoustic conditions. Providing just a few examples of
used to detect note onsets and its output is used to inform a isolated notes of the instrument to be transcribed, considerable
second network, which focuses on detecting note lengths. This improvements are observed in the performance of NMF based
can be interpreted from a probabilistic point of view: note models. There is currently no corresponding equally effective
onsets are rare events compared to frame-wise note activity mechanism to re-train or adapt a neural network based AMT
detections – the split into two network branches can thus system on a few seconds of audio – thus the error rate for non-
be interpreted as splitting the representation of a relatively adapted networks can be an order of magnitude higher than
complex joint probability distribution over onsets and frame that of an adapted NMF system [23], [24]. Overall, as both
activity into a probability over onsets and a probability over of these challenges cannot easily be overcome, NMF-based
frame activities, conditioned on the onset distribution. Since methods are likely to remain relevant in specific use cases.
the temporal dynamics of onsets and frame activities are quite In Fig. 7, we qualitatively illustrate some differences in the
different, this can lead to improved learning behavior for the behavior of systems based on supervised NMF and neural
entire network when trained jointly. networks. Both systems were specifically trained for tran-
scribing piano recordings and we expose the approaches to
a recording of an organ. Like the piano, the organ is played
C. A Comparison of NMF and Neural Network Models with a keyboard but its acoustic properties are quite different:
Given the popularity of NMF and neural network based the harmonics of the organ are rich in energy and cover the
methods for automatic music transcription, it is interesting entire spectrum, the energy of notes does not decay over time
to discuss their differences. In particular, neglecting the non- and onsets are less pronounced. With this experiment, we
negativity constraints, NMF is a linear, generative model. want to find out how gracefully the systems fail when they
Given that NMF-based methods are increasingly replaced by encounter a sound that is outside the piano-sound manifold but
NN-based ones, the question arises in which way linearity still musically valid. Comparing the NMF output in Fig. 7(a)
could be a limitation for an AMT model. and the NN output in Fig. 7(b) with the ground truth, we
To look into this, assume we are given an NMF dictio- find that both methods detect additional notes (shown in red),
nary with two spectral templates for each musical pitch. To mostly at octaves above and below the correct fundamental.
8

(a)
acoustic output of an AMT system. Sigtia et al. [18] also used
80
the aforementioned RNN-RBM as an MLM, and combined
the acoustic and language predictions using a probabilistic
MIDI pitch

60
graphical model. While these initial works showed promising
results, there are several directions for future research in
MLMs; these include creating unified acoustic and language
40
models (as opposed to using MLMs as post-processing steps)
1 2 3 4 5 6 and modeling other musical cues, such as chords, key and
Time (seconds)
meter (as opposed to simply modeling note sequences).
(b)

80
B. Score-Informed Transcription
MIDI pitch

If a known piece is performed, the musical score provides a


60
strong prior for the transcription. In many cases, there are dis-
crepancies between the score and a given music performance,
40
which may be due to a specific interpretation by a performer,
1 2 3 4 5 6 or due to performance mistakes. For applications such as
Time (seconds)
music education, it is useful to identify such discrepancies, by
Figure 7. Piano-roll representations of the first 6 seconds of a recording of incorporating the musical score as additional prior information
a Bach piece (BWV 582) for organ. Black color corresponds to correctly to simplify the transcription process (score-informed music
detected pitches, red to false positives, and blue to false negatives. (a) Output transcription [35]). Typically, systems for score-informed mu-
of NMF model trained on piano templates. (b) Output of the piano music-
trained neural network model of [24]. sic transcription use a score-to-audio alignment method as a
pre-processing step, in order to align the music score with
the input music audio prior to performing transcription, e.g.
Given the rich energy distribution, this behavior is expected. [35]. While specific instances of score-informed transcription
While we use a simple baseline model for NMF and thus some systems have been developed for certain instruments (piano,
errors could be attributed to that choice, the neural network violin), the problem is still relatively unexplored, as is the
fails more gracefully: fewer octave errors and fewer spurious related and more challenging problem of lead sheet-informed
short note detections are observed (yet in terms of recall the transcription and the eventual integration of these methods
NMF-based approach identifies additional correct notes). It is towards the development of automatic music tutoring systems.
difficult to argue why the acoustic model within the network
should be better prepared to such a situation. However, the
C. Context-Specific Transcription
results suggest that the network learned something additional:
the LSTM layers as used in the network (compare Fig. 6) seem While the creation of a “blind” multi-instrument AMT
to have learned how typical piano notes evolve in time and thus system without specific knowledge of the music style, in-
most note lengths look reasonable and less spurious. Similarly, struments and recording conditions is yet to be achieved,
the bandwidth in which octave errors occur is narrower for considerable progress has been reported on the problem of
the neural network, which could potentially indicate that the context-specific transcription, where prior knowledge of the
network models the likelihood of co-occurring notes or, in sound of the specific instrument model or manufacturer and the
other words, a simple music language model, which leads us recording environment is available. For context-specific piano
to our discussion of important remaining challenges in AMT. transcription, multi-pitch detection accuracy can exceed 90%
[23], [22], making such systems appropriate for user-facing
IV. F URTHER E XTENSIONS AND F UTURE W ORK applications. Open work in this topic includes the creation of
A. Music Language Models context-specific AMT systems for multiple instruments.
As outlined in Section I-B, AMT is closely related to
automatic speech recognition (ASR). In the same way that D. Non-Western Music
a typical ASR system consists of an acoustic component and As might be evident by surveying the AMT literature, the
a language component, an AMT system can model both the vast majority of approaches target only Western (or Euroge-
acoustic sequences and also the underlying sequence of notes netic) music. This allows several assumptions, regarding both
and other music cues over time. AMT systems have thus the instruments used and also the way that music is represented
incorporated music language models (MLMs) for modeling and produced (typical assumptions include: octaves containing
sequences of notes in a polyphonic context, with the aim 12 equally-spaced pitches; two modes, major and minor; a
of improving transcription performance. The capabilities of standard tuning frequency of A4 = 440 Hz). However, these
deep learning methods towards modeling high-dimensional assumptions do not hold true for other music styles from
sequences have recently made polyphonic music sequence pre- around the world, where for instance an octave is often
diction possible. Boulanger-Lewandowski et al. [5] combined divided into microtones (e.g., Arabic music theory is based
a restricted Bolzmann machine (RBM) with an RNN for poly- on quartertones), or on the existence of modes that are not
phonic music prediction, which was used to post-process the used in Western music (e.g., classical Indian music recognizes
9

hundreds of modes, called ragas). Therefore, automatically errors might be penalized more compared with in-key ones.
transcribing non-Western music still remains an open problem Therefore, the creation of perceptually relevant evaluation
with several challenges, including the design of appropriate metrics for AMT, as well as the creation of evaluation metrics
signal and music notation representations while avoiding a so- for notation-level transcription, remain open problems.
called Western bias [36]. Another major issue is the lack of
annotated datasets for non-Western music, rendering the ap- V. C ONCLUSIONS
plication of data-intensive machine learning methods difficult. Automatic music transcription has remained an active area
of research in the fields of music signal processing and music
E. Expressive Pitch and Timing information retrieval for several decades, with several potential
benefits in other areas and fields extending beyond the remit
Western notation conceptualizes music as sequences of
of music. As outlined in this paper, there remain several
unchanging pitches being maintained for regular durations, and
challenges to be addressed in order to fully address this
has little scope for representing expressive use of microtonality
problem: these include key challenges as described in Section
and microtiming, nor for detailed recording of timbre and
I-C on modeling music signals and on the availability of data,
dynamics. Research on automatic transcription has followed
challenges with respect to the limitations of state-of-the-art
this narrow view, describing notes in terms of discrete pitches
methodologies as described in Section III-C, and finally on ex-
plus onset and offset times. For example, no suitable notation
tensions beyond the current remit of existing tasks as presented
exists for performed singing, the most universal form of music-
in Section IV. We believe that addressing these challenges will
making. Likewise for other instruments without fixed pitch
lead towards the creation of a “complete” music transcription
or with other expressive techniques, better representations are
system and towards unlocking the full potential of music signal
required. These richer representations can then be reduced
processing technologies. Supplementary audio material related
to Western score notation, if required, by modeling musical
to this paper can be found in the companion website9 .
knowledge and stylistic conventions.
R EFERENCES
F. Percussion and Unpitched Sounds [1] A. Klapuri and M. Davy, Eds., Signal Processing Methods for Music
An active problem in the music signal processing literature Transcription. New York: Springer, 2006.
[2] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri, “Au-
is that of detecting and classifying non-pitched sounds in tomatic music transcription: challenges and future directions,” Journal
music signals [1, Ch. 5]. In most cases this is expressed as of Intelligent Information Systems, vol. 41, no. 3, pp. 407–434, Dec.
the problem of drum transcription, since the vast majority 2013.
[3] M. Müller, D. P. Ellis, A. Klapuri, and G. Richard, “Signal processing for
of contemporary music contains mixtures of pitched sounds music analysis,” IEEE Journal of Selected Topics in Signal Processing,
and unpitched sounds produced by a drum kit. Drum kit vol. 5, no. 6, pp. 1088–1110, Oct. 2011.
components typically include the bass drum, snare drum, hi- [4] M. Schedl, E. Gómez, and J. Urbano, “Music information retrieval:
Recent developments and applications,” Foundations and Trends in
hat, cymbals and toms. The problem in this case is to detect Information Retrieval, vol. 8, pp. 127–261, 2014.
and classify percussive sounds into one of the aforementioned [5] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling
sound classes. Elements of the drum transcription problem that temporal dependencies in high-dimensional sequences: Application to
polyphonic music generation and transcription,” in Proc. International
make it particularly challenging are the concurrent presence of Conference on Machine Learning (ICML), 2012.
several harmonic, inharmonic and non-harmonic sounds in the [6] T. Virtanen, M. D. Plumbley, and D. P. W. Ellis, Eds., Computational
music signal, as well as the requirement of an increased tem- Analysis of Sound Scenes and Events. Springer, 2018.
[7] L. Su and Y.-H. Yang, “Escaping from the abyss of manual annotation:
poral resolution for drum transcription systems compared to New methodology of building polyphonic datasets for automatic music
typical multi-pitch detection systems. Approaches for pitched transcription,” in Proc. International Symposium on Computer Music
instrument transcription and drum transcription have largely Multidisciplinary Research (CMMR), 2015, pp. 309–321.
[8] Z. Duan, B. Pardo, and C. Zhang, “Multiple fundamental frequency
been developed independently, and the creation of a robust estimation by modeling spectral peaks and non-peak regions,” IEEE
music transcription system that supports both pitched and Transactions on Audio, Speech, and Language Processing, vol. 18, no. 8,
unpitched sounds still remains an open problem. pp. 2121–2133, 2010.
[9] Z. Duan and D. Temperley, “Note-level music transcription by maximum
likelihood sampling.” in ISMIR, 2014, pp. 181–186.
G. Evaluation Metrics [10] Z. Duan, J. Han, and B. Pardo, “Multi-pitch streaming of harmonic
sound mixtures,” IEEE/ACM Transactions on Audio, Speech, and Lan-
Most AMT approaches are evaluated using the set of metrics guage Processing, vol. 22, no. 1, pp. 138–150, Jan 2014.
proposed for the MIREX Multiple-F0 Estimation and Note [11] V. Emiya, R. Badeau, and B. David, “Multipitch estimation of piano
sounds using a new probabilistic spectral smoothness principle,” IEEE
Tracking public evaluation tasks8 . Three types of metrics are Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6,
included: frame-based, note-based and stream-based, mirror- pp. 1643–1654, 2010.
ing the frame-level, note-level, and stream-level transcription [12] L. Su and Y.-H. Yang, “Combining spectral and temporal representations
for multipitch estimation of polyphonic music,” IEEE/ACM Transactions
categories presented in Sec. III. While the above sets of on Audio, Speech, and Language Processing, vol. 23, no. 10, pp. 1600–
metrics all have their merits, it could be argued that they do 1612, Oct 2015.
not correspond with human perception of music transcription [13] P. H. Peeling, A. T. Cemgil, and S. J. Godsill, “Generative spectrogram
factorization models for polyphonic piano transcription,” IEEE Trans-
accuracy, where e.g., an extra note might be considered as a actions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp.
more severe error than a missed note, or where out-of-key note 519–527, March 2010.
8 http://www.music-ir.org/mirex/ 9 http://c4dm.eecs.qmul.ac.uk/spm-amt-overview/
10

[14] P. Smaragdis and J. C. Brown, “Non-negative matrix factorization for 12th International Society for Music Information Retrieval Conference,
polyphonic music transcription,” in Proc. IEEE Workshop on Applica- 2011, pp. 151–156.
tions of Signal Processing to Audio and Acoustics, 2003, pp. 177–180.
[15] E. Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectral
decomposition for multiple pitch estimation,” IEEE Transactions on
Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 528–537,
2010.
[16] E. Benetos and S. Dixon, “Multiple-instrument polyphonic music tran- Emmanouil Benetos (S’09, M’12) is Lecturer and
scription using a temporally-constrained shift-invariant model,” Journal Royal Academy of Engineering Research Fellow
of the Acoustical Society of America, vol. 133, no. 3, pp. 1727–1741, with the Centre for Digital Music, Queen Mary
March 2013. University of London, and Turing Fellow with the
[17] B. Fuentes, R. Badeau, and G. Richard, “Harmonic adaptive latent Alan Turing Institute. He received the Ph.D. degree
component analysis of audio and application to music transcription,” in Electronic Engineering from Queen Mary Uni-
IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, versity of London, U.K., in 2012. From 2013 to
no. 9, pp. 1854–1866, Sept 2013. 2015, he was University Research Fellow with the
[18] S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural network Department of Computer Science, City, University
for polyphonic piano music transcription,” IEEE/ACM Transactions on of London. He has published over 80 peer-reviewed
Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 927–939, papers spanning several topics in audio and music
May 2016. signal processing. His research focuses on signal processing and machine
[19] R. Kelz, M. Dorfer, F. Korzeniowski, S. Böck, A. Arzt, and G. Widmer, learning for music and audio analysis, as well as applications to music
“On the potential of simple framewise approaches to piano transcrip- information retrieval, acoustic scene analysis, and computational musicology.
tion,” in Proc. International Society for Music Information Retrieval
Conference, 2016, pp. 475–481.
[20] J. Nam, J. Ngiam, H. Lee, and M. Slaney, “A classification-based
polyphonic piano transcription approach using learned feature represen-
tations,” in ISMIR, 2011, pp. 175–180.
[21] M. Marolt, “A connectionist approach to automatic transcription of Simon Dixon is Professor and Deputy Director
polyphonic piano music,” IEEE Transactions on Multimedia, vol. 6, of the Centre for Digital Music at Queen Mary
no. 3, pp. 439–449, 2004. University of London. He has a Ph.D. in Computer
[22] A. Cogliati, Z. Duan, and B. Wohlberg, “Context-dependent piano music Science (Sydney) and L.Mus.A. diploma in Clas-
transcription with convolutional sparse coding,” IEEE/ACM Transactions sical Guitar. His research is in music informatics,
on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2218– including high-level music signal analysis, compu-
2230, Dec 2016. tational modeling of musical knowledge, and the
[23] S. Ewert and M. B. Sandler, “Piano transcription in the studio using an study of musical performance. Particular areas of
extensible alternating directions framework,” IEEE/ACM Transactions focus include automatic music transcription, beat
on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 1983– tracking, audio alignment and analysis of intonation
1997, Nov 2016. and temperament. He was President (2014-15) of the
[24] C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, International Society for Music Information Retrieval (ISMIR), is founding
J. Engel, S. Oore, and D. Eck, “Onsets and frames: Dual-objective piano Editor of the Transactions of ISMIR, and has published over 160 refereed
transcription,” in Proc. International Society for Music Information papers in the area of music informatics.
Retrieval Conference, 2018.
[25] V. Arora and L. Behera, “Multiple F0 estimation and source clustering
of polyphonic music audio using PLCA and HMRFs,” IEEE/ACM
Transactions on Audio, Speech and Language Processing (TASLP),
vol. 23, no. 2, pp. 278–287, 2015. Zhiyao Duan (S’09, M’13) is an assistant professor
[26] E. Cambouropoulos, “Pitch spelling: A computational model,” Music in the Electrical and Computer Engineering Depart-
Perception, vol. 20, no. 4, pp. 411–429, 2003. ment at the University of Rochester. He received
[27] H. Grohganz, M. Clausen, and M. Mueller, “Estimating musical time his B.S. in Automation and M.S. in Control Science
information from performed MIDI files,” in Proc. International Society and Engineering from Tsinghua University, China, in
for Music Information Retrieval Conference, 2014. 2004 and 2008, respectively, and received his Ph.D.
[28] I. Karydis, A. Nanopoulos, A. Papadopoulos, E. Cambouropoulos, and in Computer Science from Northwestern University
Y. Manolopoulos, “Horizontal and vertical integration/segregation in in 2013. His research interest is in the broad area
auditory streaming: a voice separation algorithm for symbolic musical of computer audition, i.e., designing computational
data,” in Proc. Sound and Music Computing Conference (SMC), 2007. systems that are capable of understanding sounds,
[29] A. Cogliati, D. Temperley, and Z. Duan, “Transcribing human piano including music, speech, and environmental sounds.
performances into music notation,” in Proc. International Society for He co-presented a tutorial on Automatic Music Transcription at ISMIR 2015.
Music Information Retrieval Conference, 2016, pp. 758–764. He received a best paper award at the 2017 Sound and Music Computing
[30] R. G. C. Carvalho and P. Smaragdis, “Towards end-to-end polyphonic (SMC) conference and a best paper nomination at ISMIR 2017.
music transcription: Transforming music audio directly to a score,” in
2017 IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics, Oct 2017, pp. 151–155.
[31] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix
factorization,” in Advances in neural information processing systems
(NIPS), 2001, pp. 556–562. Sebastian Ewert is a Senior Research Scientist at
[32] S. A. Abdallah and M. D. Plumbley, “Unsupervised analysis of poly- Spotify. He received the M.Sc./Diplom and Ph.D.
phonic music by sparse coding,” IEEE Transactions on neural Networks, degrees (summa cum laude) in computer science
vol. 17, no. 1, pp. 179–196, 2006. from the University of Bonn (svd. Max-Planck-
[33] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning. Institute for Informatics), Germany, in 2007 and
MIT press Cambridge, 2016. 2012, respectively. In 2012, he was awarded a GAES
[34] S. Böck and M. Schedl, “Polyphonic piano note transcription with fellowship and joined the Centre for Digital Music,
recurrent neural networks,” in Proc. IEEE International Conference on Queen Mary University of London (United King-
Acoustics, Speech and Signal Processing, 2012, pp. 121–124. dom). At the Centre, he became Lecturer for Signal
[35] S. Wang, S. Ewert, and S. Dixon, “Identifying missing and extra notes in Processing in 2015 and was one of the founding
piano recordings using score-informed dictionary learning,” IEEE/ACM members of the Machine Listening Lab, which fo-
Transactions on Audio, Speech, and Language Processing, vol. 25, cuses on the development of machine learning and signal processing methods
no. 10, pp. 1877–1889, Oct 2017. for audio and music applications.
[36] X. Serra, “A multicultural approach in music information research,” in

View publication stats

You might also like