Manuscript
Manuscript
net/publication/330068609
CITATIONS READS
192 14,887
4 authors, including:
Emmanouil Benetos
Queen Mary, University of London
185 PUBLICATIONS 4,180 CITATIONS
SEE PROFILE
All content following this page was uploaded by Emmanouil Benetos on 05 August 2019.
0.2
Amplitude
(a) 0
−0.2
1 2 3 4 5 6
2000
1500
Frequency (Hz)
(b)
1000
500
1 2 3 4 5 6
80
MIDI pitch
(c)
60
40
1 2 3 4 5 6
Time (seconds)
(d)
Figure 1. Data represented in an AMT system. (a) Input waveform, (b) Internal time-frequency representation, (c) Output piano-roll representation, (d)
Output music score, with notes A and D marked in gray circles. The example corresponds to the first 6 seconds of W. A. Mozart’s Piano Sonata No. 13, 3rd
movement (taken from the MAPS database).
cocktail party problem in speech, music usually involves mul- vision, as musical objects such as notes can be recognized
tiple simultaneous voices, but unlike speech, these voices are as two-dimensional patterns in time-frequency representations.
highly correlated in time and in frequency (see Challenges 2 Compared with image processing and computer vision, where
and 3 in Section I-C). In addition, both AMT and ASR systems occlusion is a common issue, AMT systems are often affected
benefit from language modeling components that are combined by musical objects occupying the same time-frequency regions
with acoustic components in order to produce plausible results. (this is detailed in Section I-C).
Thus, there are also clear links between AMT and the wider
field of natural language processing (NLP), with music having C. Key Challenges
its own grammatical rules or statistical regularities, in a similar
way to natural language [5]. The use of language models for Compared to other problems in the music signal processing
AMT is detailed in Section IV. field or the wider signal processing discipline, there are several
factors that make AMT particularly challenging:
Within the emerging field of sound scene analysis, there is 1) Polyphonic music contains a mixture of multiple simul-
a direct analogy between AMT and Sound Event Detection taneous sources (e.g., instruments, vocals) with different
(SED) [6], in particular with polyphonic SED which involves pitch, loudness and timbre (sound quality), with each
detecting and classifying multiple overlapping events from source producing one or more musical voices. Inferring
audio. While everyday and natural sounds do not exhibit the musical attributes (e.g., pitch) from the mixture signal
same degree of temporal regularity and inter-source frequency is an extremely under-determined problem.
dependence as found in music signals, there are close interac- 2) Overlapping sound events often exhibit harmonic re-
tions between the two problems in terms of the methodologies lations with each other; for any consonant musical
used, as observed in the literature [6]. interval, the fundamental frequencies form small integer
Further, AMT is related to image processing and computer ratios, so that their harmonics overlap in frequency,
3
results may be biased by the limitations of datasets and instead of beats; pitch is measured in MIDI numbers instead
evaluation metrics (see Sections I-C and IV-G). of spelled note names that are compatible with the key (e.g.,
Note-level transcription, or note tracking, is one level higher C] vs D[); and the concepts of beat, bar, meter, key, harmony,
than MPE, in terms of the richness of structures of the and stream are lacking.
estimates. It not only estimates the pitches in each time frame, Notation-level transcription aims to transcribe the music au-
but also connects pitch estimates over time into notes. In dio into a human readable musical score, such as the staff no-
the AMT literature, a musical note is often characterized by tation widely used in Western classical music. Transcription at
three elements: pitch, onset time, and offset time [1]. As this level requires deeper understanding of musical structures,
note offsets can be ambiguous, they are sometimes neglected including harmonic, rhythmic and stream structures. Harmonic
in the evaluation of note tracking approaches, and as such, structures such as keys and chords influence the note spelling
some note tracking approaches only estimate pitch and onset of each MIDI pitch; rhythmic structures such as beats and bars
times of notes. Fig. 2(middle) shows an example of a note- help to quantize the lengths of notes; and stream structures
level transcription, where each note is shown as a red circle aid the assignment of notes to different staffs. There has
(onset) followed by a black line (pitch contour). Many note been some work on the estimation of musical structures from
tracking approaches form notes by post-processing MPE out- audio or MIDI representations of a performance. For example,
puts (i.e., pitch estimates in individual frames). Techniques methods for pitch spelling [26], timing quantization [27], and
that have been used in this context include median filtering voice separation [28] from performed MIDI files have been
[12], Hidden Markov Models (HMMs) [20], and neural net- proposed. However, little work has been done on integrating
works [5]. This post-processing is often performed for each these structures into a complete music notation transcription,
MIDI pitch independently without considering the interactions especially for polyphonic music. Several software packages,
among simultaneous notes. This often leads to spurious or including Finale, GarageBand and MuseScore, provide the
missing notes that share harmonics with correctly estimated functionality of converting a MIDI file into music notation,
notes. Some approaches have been proposed to consider note however, the results are often not satisfying and it is not clear
interactions through a spectral likelihood model [9] or a music what musical structures have been estimated and integrated
language model [5], [18] (see Section IV-A). Another subset during the transcription process. Cogliati et al. [29] proposed
of approaches estimate notes directly from the audio signal a method to convert a MIDI performance into music notation,
instead of building upon MPE outputs. Some approaches first with a systematic comparison of the transcription performance
detect onsets and then estimate pitches within each inter-onset with the above-mentioned software. In terms of audio-to-
interval [21], while others estimate pitch, onset and sometimes notation transcription, a proof-of-concept work using end-to-
offset in the same framework [22], [23], [24]. end neural networks was proposed by Carvalho and Smaragdis
Stream-level transcription, also called Multi-Pitch Stream- [30] to directly map music audio into music notation without
ing (MPS), targets grouping estimated pitches or notes into explicitly modeling musical structures.
streams, where each stream typically corresponds to one in-
strument or musical voice, and is closely related to instrument III. S TATE - OF - THE - ART
source separation. Fig. 2(bottom) shows an example of a
stream-level transcription, where pitch streams of different While there is a wide range of applicable methods, auto-
instruments have different colors. Compared to note-level matic music transcription has been dominated during the last
transcription, the pitch contour of each stream is much longer decade by two algorithmic families: Non-Negative Matrix Fac-
than a single note and contains multiple discontinuities that are torization (NMF) and Neural Networks (NNs). Both families
caused by silence, non-pitched sounds and abrupt frequency have been used for a variety of tasks, from speech and image
changes. Therefore, techniques that are often used in note-level processing to recommender systems and natural language
transcription are generally not sufficient to group pitches into processing. Despite this wide applicability, both approaches
a long and discontinuous contour. One important cue for MPS offer a range of properties that make them particularly suitable
that is not explored in MPE and note tracking is timbre: notes for modeling music recordings at the note level.
of the same stream (source) generally show similar timbral
characteristics compared to those in different streams. There- A. Non-negative Matrix Factorization for AMT
fore, stream-level transcription is also called timbre tracking
or instrument tracking in the literature. Existing works at this The basic idea behind NMF and its variants is to rep-
level are few, with [16], [10], [25] as examples. resent a given non-negative time-frequency representation
×N
From frame-level to note-level to stream-level, the transcrip- V ∈ RM ≥0 , e.g., a magnitude spectrogram, as a product of
×K
tion task becomes more complex as more musical structures two non-negative matrices: a dictionary D ∈ RM ≥0 and an
K×N
and cues need to be modeled. However, the transcription activation matrix A ∈ R≥0 , see Fig. 3. Computationally,
outputs at these three levels are all parametric transcriptions, the goal is to minimize a distance (or divergence) between
which are parametric descriptions of the audio content. The V and DA with respect to D and A. As a straightforward
MIDI piano roll shown in Fig. 1(c) is a good example of such approach to solving this minimization problem, multiplicative
a transcription. It is indeed an abstraction of music audio, update rules have been central to the success of NMF. For
however, it has not yet reached the level of abstraction of example, the generalized Kullback-Leibler divergence between
music notation: time is still measured in the unit of seconds V and DA is non-increasing under the following updates and
5
(a) (b)
2000 2000
1500 1500
Frequency (Hz)
Frequency (Hz)
1000 1000
500 500
1 2 3 4 5 6 1 2 3 4 5 6
Time (seconds)
(c) (d)
2000
80
1500
Frequency (Hz)
MIDI pitch
1000
60
500
40
40 60 80 1 2 3 4 5 6
MIDI pitch Time (seconds)
Figure 3. NMF example, using the same audio recording as Fig. 1. (a) Input spectrogram V, (b) Approximated spectrogram DA, (c) Dictionary D
(pre-extracted), (d) Activation matrix A.
D> ( DA
V
) V
( DA )A>
dbFS
4
A←A and D ← D ,
D> J JA>
6
where the operator denotes point-wise multiplication, J ∈
RM ×N denotes the matrix of ones, and the division is point- 8
0 100 200 300 400 500 600 700 800
Frequency in Hz
wise. Intuitively, the update rules can be derived by choosing
a specific step-size in a gradient (or rather coordinate) descent Figure 4. Inharmonicity: Spectrum of a C]1 note played on a piano. The
based minimization of the divergence [31]. stiffness of strings causes partials to be shifted from perfect integer multiples
In an AMT context, both unknown matrices have an intu- of the fundamental frequency (shown as vertical dotted lines); here the 23rd
partial is at the position where the 24th harmonic would be expected. Note
itive interpretation: the n-th column of V, i.e. the spectrum at that the fundamental frequency of 34.65Hz is missing as piano soundboards
time point n, is modeled in NMF as a linear combination of typically do not resonate for modes with a frequency smaller than ≈50Hz.
the K columns of D, and the corresponding K coefficients
are given by the n-th column of A. Given this point of view,
each column of D is often referred to as a (spectral) template roll representation shown in Fig. 1(c) indicates the correlation
and usually represents the expected spectral energy distribution between NMF activations and the underlying musical score.
associated with a specific note played on a specific instrument. While Fig. 3 illustrates the principles behind NMF, it also
For each template, the corresponding row in A is referred to indicates why AMT is difficult – indeed, a regular NMF
as the associated activation and encodes when and how in- decomposition would rarely look as clean as in Fig. 3. Com-
tensely that note is played over time. Given the non-negativity pared to speech analysis, sound objects in music are highly
constraints, NMF yields a purely constructive representation correlated. For example, even in a simple piece as shown
in the sense that spectral energy modeled by one template in Fig. 1, most pairs of simultaneous notes are separated
cannot be cancelled by another – this property is often seen by musically consonant intervals, which acoustically means
as instrumental in identifying a parts-based and interpretable that many of their partials overlap (e.g., the A and D notes
representation of the input [31]. around 4 seconds, marked with gray circles in Fig. 1(d), share
In Fig. 3, an NMF-based decomposition is illustrated. The a high number of partials). In this case, it can be difficult
magnitude spectrogram V shown in Fig. 3(a) is modeled as to disentangle how much energy belongs to which note.
a product of the dictionary D and activation matrix A shown The task is further complicated by the fact that the spectro-
in Fig. 3(c) and (d), respectively. The product DA is given temporal properties of notes vary considerably between differ-
in Fig. 3(b). In this case, the templates correspond to indi- ent pitches, playing styles, dynamics and recording conditions.
vidual pitches, with clearly visible fundamental frequencies Further, stiffness properties of strings affect the travel speed
and harmonics. Additionally, comparing A with the piano of transverse waves based on their frequency – as a result,
6
1500
linear function (or a composition of functions) from input
1000 + + + = to output via an optimization algorithm such as stochastic
gradient descent [33]. Compared to other fields including
500
image processing, progress on NNs for music transcription
0
has been slower and we will discuss a few of the underlying
Figure 5. Harmonic NMF [15]: Each NMF template (right hand side) is reasons below.
represented as a linear combination of fixed narrow-band sub-templates. The One of the earliest approaches based on neural networks
resulting template is constrained to represent harmonic sounds by construction. was Marolt’s Sonic system [21]. A central component in
this approach was the use of time-delay (TD) networks,
the partials of instruments such as the piano are not found at which resemble convolutional networks in the time direction
perfect integer multiples of the fundamental frequency. Due [33], and were employed to analyse the output of adaptive
to this property called inharmonicity, the positions of partials oscillators, in order to track and group partials in the output
differ between individual pianos (see Fig. 4). of a gammatone filterbank. Although it was initially published
To address these challenges, the basic NMF model has been in 2001, the approach remains competitive and still appears in
extended by encouraging additional structure in the dictionary comparisons in more recent publications [23].
and the activations. For example, an important principle is In the context of the more recent revival of neural networks,
to enforce sparsity in A to obtain a solution dominated by a first successful system was presented by Böck and Schedl
few but substantial activations – the success of sparsity paved [34]. One of the core ideas was to use two spectrograms
the way for a whole range of sparse coding approaches, in as input to enable the network to exploit both a high time
which the dictionary size K can exceed the input dimension accuracy (when estimating the note onset position) and a
M considerably [32]. Other extensions focus on the dictionary high frequency resolution (when disentangling notes in the
design. In the case of supervised NMF, the dictionary is lower frequency range). This input is processed using one
pre-computed and fixed using additionally available training (or more) Long Short-Term Memory (LSTM) layers [33].
material. For example, given K recordings each containing The potential benefit of using LSTM layers is two-fold. First,
only a single note, the dictionary shown in Fig. 3(b) was the spectral properties of a note evolve across input frames
constructed by extracting one template from each recording – and LSTM networks have the capability to compactly model
this way, the templates are guaranteed to be free of interference such sequences. Second, medium and long range dependencies
from other notes and also have a clear interpretation. As between notes can potentially be captured: for example, based
another example, Fig. 5 illustrates an extension in which each on a popular chord sequence, after hearing C and G major
NMF template is represented as a linear combination of fixed chords followed by A minor, a likely successor is an F
narrow-band sub-templates [15], which enforces a harmonic major chord. An investigation of whether such long-range
structure for all NMF templates – this way, a dictionary can be dependencies are indeed modeled, however, was not in scope.
adapted to the recording to be transcribed, while maintaining Sigtia et al. [18] focus on long-range dependencies in
its clean, interpretable structure. music by combining an acoustic front-end with a symbolic-
In shift-invariant dictionaries a single template can be used level module resembling a language model as used in speech
to represent a range of different fundamental frequencies. In processing. Using information obtained from MIDI files, a
particular, using a logarithmic frequency axis, the distances recurrent network is trained to predict the active notes in
between individual partials of a harmonic sound are fixed and the next time frame given the past. This approach needs to
thus shifting a template in frequency allows modeling sounds learn and represent a very large joint probability distribution,
of varying pitch. Sharing parameters between different pitches i.e., a probability for every possible combination of active and
in this way has turned out to be effective towards increasing inactive notes across time – note that even in a single frame
model capacity (see e.g., [16], [17]). Further, spectro-temporal there are 288 possible combinations of notes on a piano. To
dictionaries alleviate a specific weakness of NMF models: in render the problem of modeling such an enormous probability
NMF it is difficult to express that notes often have a specific space tractable, the approach employs a specific neural net-
temporal evolution – e.g., the beginning of a note (or attack work architecture (NADE), which represents a large joint as a
phase) might have entirely different spectral properties than long product of conditional probabilities – an approach quite
the central part (decay phase). Such relationships are modeled similar to the idea popularized recently by the well-known
in spectro-temporal dictionaries using a Markov process which WaveNet architecture. Despite the use of a dedicated music
governs the sequencing of templates across frames, so that language model, which was trained on relatively large MIDI-
different subsets of templates can be used for the attack and based datasets, only modest improvements over an HMM
the decay parts, respectively [16], [23]. baseline could be observed and thus the question remains open
to which degree long-range dependencies are indeed captured.
To further disentangle the influence of the acoustic front-end
B. Neural Networks for AMT from the language model on potential improvement in perfor-
As for many tasks relating to pattern recognition, neural mance, Kelz et al. [19] focus on the acoustic modeling and
networks (NNs) have had a considerable impact in recent report on the results of a larger scale hyperparameter search
7
(a)
acoustic output of an AMT system. Sigtia et al. [18] also used
80
the aforementioned RNN-RBM as an MLM, and combined
the acoustic and language predictions using a probabilistic
MIDI pitch
60
graphical model. While these initial works showed promising
results, there are several directions for future research in
MLMs; these include creating unified acoustic and language
40
models (as opposed to using MLMs as post-processing steps)
1 2 3 4 5 6 and modeling other musical cues, such as chords, key and
Time (seconds)
meter (as opposed to simply modeling note sequences).
(b)
80
B. Score-Informed Transcription
MIDI pitch
hundreds of modes, called ragas). Therefore, automatically errors might be penalized more compared with in-key ones.
transcribing non-Western music still remains an open problem Therefore, the creation of perceptually relevant evaluation
with several challenges, including the design of appropriate metrics for AMT, as well as the creation of evaluation metrics
signal and music notation representations while avoiding a so- for notation-level transcription, remain open problems.
called Western bias [36]. Another major issue is the lack of
annotated datasets for non-Western music, rendering the ap- V. C ONCLUSIONS
plication of data-intensive machine learning methods difficult. Automatic music transcription has remained an active area
of research in the fields of music signal processing and music
E. Expressive Pitch and Timing information retrieval for several decades, with several potential
benefits in other areas and fields extending beyond the remit
Western notation conceptualizes music as sequences of
of music. As outlined in this paper, there remain several
unchanging pitches being maintained for regular durations, and
challenges to be addressed in order to fully address this
has little scope for representing expressive use of microtonality
problem: these include key challenges as described in Section
and microtiming, nor for detailed recording of timbre and
I-C on modeling music signals and on the availability of data,
dynamics. Research on automatic transcription has followed
challenges with respect to the limitations of state-of-the-art
this narrow view, describing notes in terms of discrete pitches
methodologies as described in Section III-C, and finally on ex-
plus onset and offset times. For example, no suitable notation
tensions beyond the current remit of existing tasks as presented
exists for performed singing, the most universal form of music-
in Section IV. We believe that addressing these challenges will
making. Likewise for other instruments without fixed pitch
lead towards the creation of a “complete” music transcription
or with other expressive techniques, better representations are
system and towards unlocking the full potential of music signal
required. These richer representations can then be reduced
processing technologies. Supplementary audio material related
to Western score notation, if required, by modeling musical
to this paper can be found in the companion website9 .
knowledge and stylistic conventions.
R EFERENCES
F. Percussion and Unpitched Sounds [1] A. Klapuri and M. Davy, Eds., Signal Processing Methods for Music
An active problem in the music signal processing literature Transcription. New York: Springer, 2006.
[2] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri, “Au-
is that of detecting and classifying non-pitched sounds in tomatic music transcription: challenges and future directions,” Journal
music signals [1, Ch. 5]. In most cases this is expressed as of Intelligent Information Systems, vol. 41, no. 3, pp. 407–434, Dec.
the problem of drum transcription, since the vast majority 2013.
[3] M. Müller, D. P. Ellis, A. Klapuri, and G. Richard, “Signal processing for
of contemporary music contains mixtures of pitched sounds music analysis,” IEEE Journal of Selected Topics in Signal Processing,
and unpitched sounds produced by a drum kit. Drum kit vol. 5, no. 6, pp. 1088–1110, Oct. 2011.
components typically include the bass drum, snare drum, hi- [4] M. Schedl, E. Gómez, and J. Urbano, “Music information retrieval:
Recent developments and applications,” Foundations and Trends in
hat, cymbals and toms. The problem in this case is to detect Information Retrieval, vol. 8, pp. 127–261, 2014.
and classify percussive sounds into one of the aforementioned [5] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “Modeling
sound classes. Elements of the drum transcription problem that temporal dependencies in high-dimensional sequences: Application to
polyphonic music generation and transcription,” in Proc. International
make it particularly challenging are the concurrent presence of Conference on Machine Learning (ICML), 2012.
several harmonic, inharmonic and non-harmonic sounds in the [6] T. Virtanen, M. D. Plumbley, and D. P. W. Ellis, Eds., Computational
music signal, as well as the requirement of an increased tem- Analysis of Sound Scenes and Events. Springer, 2018.
[7] L. Su and Y.-H. Yang, “Escaping from the abyss of manual annotation:
poral resolution for drum transcription systems compared to New methodology of building polyphonic datasets for automatic music
typical multi-pitch detection systems. Approaches for pitched transcription,” in Proc. International Symposium on Computer Music
instrument transcription and drum transcription have largely Multidisciplinary Research (CMMR), 2015, pp. 309–321.
[8] Z. Duan, B. Pardo, and C. Zhang, “Multiple fundamental frequency
been developed independently, and the creation of a robust estimation by modeling spectral peaks and non-peak regions,” IEEE
music transcription system that supports both pitched and Transactions on Audio, Speech, and Language Processing, vol. 18, no. 8,
unpitched sounds still remains an open problem. pp. 2121–2133, 2010.
[9] Z. Duan and D. Temperley, “Note-level music transcription by maximum
likelihood sampling.” in ISMIR, 2014, pp. 181–186.
G. Evaluation Metrics [10] Z. Duan, J. Han, and B. Pardo, “Multi-pitch streaming of harmonic
sound mixtures,” IEEE/ACM Transactions on Audio, Speech, and Lan-
Most AMT approaches are evaluated using the set of metrics guage Processing, vol. 22, no. 1, pp. 138–150, Jan 2014.
proposed for the MIREX Multiple-F0 Estimation and Note [11] V. Emiya, R. Badeau, and B. David, “Multipitch estimation of piano
sounds using a new probabilistic spectral smoothness principle,” IEEE
Tracking public evaluation tasks8 . Three types of metrics are Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6,
included: frame-based, note-based and stream-based, mirror- pp. 1643–1654, 2010.
ing the frame-level, note-level, and stream-level transcription [12] L. Su and Y.-H. Yang, “Combining spectral and temporal representations
for multipitch estimation of polyphonic music,” IEEE/ACM Transactions
categories presented in Sec. III. While the above sets of on Audio, Speech, and Language Processing, vol. 23, no. 10, pp. 1600–
metrics all have their merits, it could be argued that they do 1612, Oct 2015.
not correspond with human perception of music transcription [13] P. H. Peeling, A. T. Cemgil, and S. J. Godsill, “Generative spectrogram
factorization models for polyphonic piano transcription,” IEEE Trans-
accuracy, where e.g., an extra note might be considered as a actions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp.
more severe error than a missed note, or where out-of-key note 519–527, March 2010.
8 http://www.music-ir.org/mirex/ 9 http://c4dm.eecs.qmul.ac.uk/spm-amt-overview/
10
[14] P. Smaragdis and J. C. Brown, “Non-negative matrix factorization for 12th International Society for Music Information Retrieval Conference,
polyphonic music transcription,” in Proc. IEEE Workshop on Applica- 2011, pp. 151–156.
tions of Signal Processing to Audio and Acoustics, 2003, pp. 177–180.
[15] E. Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectral
decomposition for multiple pitch estimation,” IEEE Transactions on
Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 528–537,
2010.
[16] E. Benetos and S. Dixon, “Multiple-instrument polyphonic music tran- Emmanouil Benetos (S’09, M’12) is Lecturer and
scription using a temporally-constrained shift-invariant model,” Journal Royal Academy of Engineering Research Fellow
of the Acoustical Society of America, vol. 133, no. 3, pp. 1727–1741, with the Centre for Digital Music, Queen Mary
March 2013. University of London, and Turing Fellow with the
[17] B. Fuentes, R. Badeau, and G. Richard, “Harmonic adaptive latent Alan Turing Institute. He received the Ph.D. degree
component analysis of audio and application to music transcription,” in Electronic Engineering from Queen Mary Uni-
IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, versity of London, U.K., in 2012. From 2013 to
no. 9, pp. 1854–1866, Sept 2013. 2015, he was University Research Fellow with the
[18] S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural network Department of Computer Science, City, University
for polyphonic piano music transcription,” IEEE/ACM Transactions on of London. He has published over 80 peer-reviewed
Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 927–939, papers spanning several topics in audio and music
May 2016. signal processing. His research focuses on signal processing and machine
[19] R. Kelz, M. Dorfer, F. Korzeniowski, S. Böck, A. Arzt, and G. Widmer, learning for music and audio analysis, as well as applications to music
“On the potential of simple framewise approaches to piano transcrip- information retrieval, acoustic scene analysis, and computational musicology.
tion,” in Proc. International Society for Music Information Retrieval
Conference, 2016, pp. 475–481.
[20] J. Nam, J. Ngiam, H. Lee, and M. Slaney, “A classification-based
polyphonic piano transcription approach using learned feature represen-
tations,” in ISMIR, 2011, pp. 175–180.
[21] M. Marolt, “A connectionist approach to automatic transcription of Simon Dixon is Professor and Deputy Director
polyphonic piano music,” IEEE Transactions on Multimedia, vol. 6, of the Centre for Digital Music at Queen Mary
no. 3, pp. 439–449, 2004. University of London. He has a Ph.D. in Computer
[22] A. Cogliati, Z. Duan, and B. Wohlberg, “Context-dependent piano music Science (Sydney) and L.Mus.A. diploma in Clas-
transcription with convolutional sparse coding,” IEEE/ACM Transactions sical Guitar. His research is in music informatics,
on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2218– including high-level music signal analysis, compu-
2230, Dec 2016. tational modeling of musical knowledge, and the
[23] S. Ewert and M. B. Sandler, “Piano transcription in the studio using an study of musical performance. Particular areas of
extensible alternating directions framework,” IEEE/ACM Transactions focus include automatic music transcription, beat
on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 1983– tracking, audio alignment and analysis of intonation
1997, Nov 2016. and temperament. He was President (2014-15) of the
[24] C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, International Society for Music Information Retrieval (ISMIR), is founding
J. Engel, S. Oore, and D. Eck, “Onsets and frames: Dual-objective piano Editor of the Transactions of ISMIR, and has published over 160 refereed
transcription,” in Proc. International Society for Music Information papers in the area of music informatics.
Retrieval Conference, 2018.
[25] V. Arora and L. Behera, “Multiple F0 estimation and source clustering
of polyphonic music audio using PLCA and HMRFs,” IEEE/ACM
Transactions on Audio, Speech and Language Processing (TASLP),
vol. 23, no. 2, pp. 278–287, 2015. Zhiyao Duan (S’09, M’13) is an assistant professor
[26] E. Cambouropoulos, “Pitch spelling: A computational model,” Music in the Electrical and Computer Engineering Depart-
Perception, vol. 20, no. 4, pp. 411–429, 2003. ment at the University of Rochester. He received
[27] H. Grohganz, M. Clausen, and M. Mueller, “Estimating musical time his B.S. in Automation and M.S. in Control Science
information from performed MIDI files,” in Proc. International Society and Engineering from Tsinghua University, China, in
for Music Information Retrieval Conference, 2014. 2004 and 2008, respectively, and received his Ph.D.
[28] I. Karydis, A. Nanopoulos, A. Papadopoulos, E. Cambouropoulos, and in Computer Science from Northwestern University
Y. Manolopoulos, “Horizontal and vertical integration/segregation in in 2013. His research interest is in the broad area
auditory streaming: a voice separation algorithm for symbolic musical of computer audition, i.e., designing computational
data,” in Proc. Sound and Music Computing Conference (SMC), 2007. systems that are capable of understanding sounds,
[29] A. Cogliati, D. Temperley, and Z. Duan, “Transcribing human piano including music, speech, and environmental sounds.
performances into music notation,” in Proc. International Society for He co-presented a tutorial on Automatic Music Transcription at ISMIR 2015.
Music Information Retrieval Conference, 2016, pp. 758–764. He received a best paper award at the 2017 Sound and Music Computing
[30] R. G. C. Carvalho and P. Smaragdis, “Towards end-to-end polyphonic (SMC) conference and a best paper nomination at ISMIR 2017.
music transcription: Transforming music audio directly to a score,” in
2017 IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics, Oct 2017, pp. 151–155.
[31] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix
factorization,” in Advances in neural information processing systems
(NIPS), 2001, pp. 556–562. Sebastian Ewert is a Senior Research Scientist at
[32] S. A. Abdallah and M. D. Plumbley, “Unsupervised analysis of poly- Spotify. He received the M.Sc./Diplom and Ph.D.
phonic music by sparse coding,” IEEE Transactions on neural Networks, degrees (summa cum laude) in computer science
vol. 17, no. 1, pp. 179–196, 2006. from the University of Bonn (svd. Max-Planck-
[33] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning. Institute for Informatics), Germany, in 2007 and
MIT press Cambridge, 2016. 2012, respectively. In 2012, he was awarded a GAES
[34] S. Böck and M. Schedl, “Polyphonic piano note transcription with fellowship and joined the Centre for Digital Music,
recurrent neural networks,” in Proc. IEEE International Conference on Queen Mary University of London (United King-
Acoustics, Speech and Signal Processing, 2012, pp. 121–124. dom). At the Centre, he became Lecturer for Signal
[35] S. Wang, S. Ewert, and S. Dixon, “Identifying missing and extra notes in Processing in 2015 and was one of the founding
piano recordings using score-informed dictionary learning,” IEEE/ACM members of the Machine Listening Lab, which fo-
Transactions on Audio, Speech, and Language Processing, vol. 25, cuses on the development of machine learning and signal processing methods
no. 10, pp. 1877–1889, Oct 2017. for audio and music applications.
[36] X. Serra, “A multicultural approach in music information research,” in