Automatic Music Transcription An Overview
Automatic Music Transcription An Overview
0.2
Amplitude
(a) 0
−0.2
1 2 3 4 5 6
2000
1500
Frequency (Hz)
(b)
1000
500
1 2 3 4 5 6
80
MIDI pitch
(c)
60
40
1 2 3 4 5 6
Time (seconds)
(d)
Figure 1. Data represented in an AMT system. (a) Input waveform, (b) Internal time-frequency representation, (c) Output piano-roll representation, (d)
Output music score, with notes A and D marked in gray circles. The example corresponds to the first 6 seconds of W. A. Mozart’s Piano Sonata No. 13, 3rd
movement (taken from the MAPS database).
cocktail party problem in speech, music usually involves mul- vision, as musical objects such as notes can be recognized
tiple simultaneous voices, but unlike speech, these voices are as two-dimensional patterns in time-frequency representations.
highly correlated in time and in frequency (see Challenges 2 Compared with image processing and computer vision, where
and 3 in Section I-C). In addition, both AMT and ASR systems occlusion is a common issue, AMT systems are often affected
benefit from language modeling components that are combined by musical objects occupying the same time-frequency regions
with acoustic components in order to produce plausible results. (this is detailed in Section I-C).
Thus, there are also clear links between AMT and the wider
field of natural language processing (NLP), with music having C. Key Challenges
its own grammatical rules or statistical regularities, in a similar
way to natural language [5]. The use of language models for Compared to other problems in the music signal processing
AMT is detailed in Section IV. field or the wider signal processing discipline, there are several
factors that make AMT particularly challenging:
Within the emerging field of sound scene analysis, there is
1) Polyphonic music contains a mixture of multiple simul-
a direct analogy between AMT and Sound Event Detection
taneous sources (e.g., instruments, vocals) with different
(SED) [6], in particular with polyphonic SED which involves
pitch, loudness and timbre (sound quality), with each
detecting and classifying multiple overlapping events from
source producing one or more musical voices. Inferring
audio. While everyday and natural sounds do not exhibit the
musical attributes (e.g., pitch) from the mixture signal
same degree of temporal regularity and inter-source frequency
is an extremely under-determined problem.
dependence as found in music signals, there are close interac-
2) Overlapping sound events often exhibit harmonic re-
tions between the two problems in terms of the methodologies
lations with each other; for any consonant musical
used, as observed in the literature [6].
interval, the fundamental frequencies form small integer
Further, AMT is related to image processing and computer ratios, so that their harmonics overlap in frequency,
3
evaluation metrics (see Sections I-C and IV-G). of spelled note names that are compatible with the key (e.g.,
Note-level transcription, or note tracking, is one level higher C] vs D[); and the concepts of beat, bar, meter, key, harmony,
than MPE, in terms of the richness of structures of the and stream are lacking.
estimates. It not only estimates the pitches in each time frame, Notation-level transcription aims to transcribe the music au-
but also connects pitch estimates over time into notes. In dio into a human readable musical score, such as the staff no-
the AMT literature, a musical note is often characterized by tation widely used in Western classical music. Transcription at
three elements: pitch, onset time, and offset time [1]. As this level requires deeper understanding of musical structures,
note offsets can be ambiguous, they are sometimes neglected including harmonic, rhythmic and stream structures. Harmonic
in the evaluation of note tracking approaches, and as such, structures such as keys and chords influence the note spelling
some note tracking approaches only estimate pitch and onset of each MIDI pitch; rhythmic structures such as beats and bars
times of notes. Fig. 2(middle) shows an example of a note- help to quantize the lengths of notes; and stream structures
level transcription, where each note is shown as a red circle aid the assignment of notes to different staffs. There has
(onset) followed by a black line (pitch contour). Many note been some work on the estimation of musical structures from
tracking approaches form notes by post-processing MPE out- audio or MIDI representations of a performance. For example,
puts (i.e., pitch estimates in individual frames). Techniques methods for pitch spelling [26], timing quantization [27], and
that have been used in this context include median filtering voice separation [28] from performed MIDI files have been
[12], Hidden Markov Models (HMMs) [20], and neural net- proposed. However, little work has been done on integrating
works [5]. This post-processing is often performed for each these structures into a complete music notation transcription,
MIDI pitch independently without considering the interactions especially for polyphonic music. Several software packages,
among simultaneous notes. This often leads to spurious or including Finale, GarageBand and MuseScore, provide the
missing notes that share harmonics with correctly estimated functionality of converting a MIDI file into music notation,
notes. Some approaches have been proposed to consider note however, the results are often not satisfying and it is not clear
interactions through a spectral likelihood model [9] or a music what musical structures have been estimated and integrated
language model [5], [18] (see Section IV-A). Another subset during the transcription process. Cogliati et al. [29] proposed
of approaches estimate notes directly from the audio signal a method to convert a MIDI performance into music notation,
instead of building upon MPE outputs. Some approaches first with a systematic comparison of the transcription performance
detect onsets and then estimate pitches within each inter-onset with the above-mentioned software. In terms of audio-to-
interval [21], while others estimate pitch, onset and sometimes notation transcription, a proof-of-concept work using end-to-
offset in the same framework [22], [23], [24]. end neural networks was proposed by Carvalho and Smaragdis
Stream-level transcription, also called Multi-Pitch Stream- [30] to directly map music audio into music notation without
ing (MPS), targets grouping estimated pitches or notes into explicitly modeling musical structures.
streams, where each stream typically corresponds to one in-
strument or musical voice, and is closely related to instrument
source separation. Fig. 2(bottom) shows an example of a III. S TATE - OF - THE - ART
stream-level transcription, where pitch streams of different While there is a wide range of applicable methods, auto-
instruments have different colors. Compared to note-level matic music transcription has been dominated during the last
transcription, the pitch contour of each stream is much longer decade by two algorithmic families: Non-Negative Matrix Fac-
than a single note and contains multiple discontinuities that are torization (NMF) and Neural Networks (NNs). Both families
caused by silence, non-pitched sounds and abrupt frequency have been used for a variety of tasks, from speech and image
changes. Therefore, techniques that are often used in note-level processing to recommender systems and natural language
transcription are generally not sufficient to group pitches into processing. Despite this wide applicability, both approaches
a long and discontinuous contour. One important cue for MPS offer a range of properties that make them particularly suitable
that is not explored in MPE and note tracking is timbre: notes for modeling music recordings at the note level.
of the same stream (source) generally show similar timbral
characteristics compared to those in different streams. There-
fore, stream-level transcription is also called timbre tracking A. Non-negative Matrix Factorization for AMT
or instrument tracking in the literature. Existing works at this
level are few, with [16], [10], [25] as examples. The basic idea behind NMF and its variants is to rep-
From frame-level to note-level to stream-level, the transcrip- resent a given non-negative time-frequency representation
×N
tion task becomes more complex as more musical structures V ∈ RM ≥0 , e.g., a magnitude spectrogram, as a product of
×K
and cues need to be modeled. However, the transcription two non-negative matrices: a dictionary D ∈ RM ≥0 and an
K×N
outputs at these three levels are all parametric transcriptions, activation matrix A ∈ R≥0 , see Fig. 3. Computationally,
which are parametric descriptions of the audio content. The the goal is to minimize a distance (or divergence) between
MIDI piano roll shown in Fig. 1(c) is a good example of such V and DA with respect to D and A. As a straightforward
a transcription. It is indeed an abstraction of music audio, approach to solving this minimization problem, multiplicative
however, it has not yet reached the level of abstraction of update rules have been central to the success of NMF. For
music notation: time is still measured in the unit of seconds example, the generalized Kullback-Leibler divergence between
instead of beats; pitch is measured in MIDI numbers instead V and DA is non-increasing under the following updates and
5
Figure 3. NMF example, using the same audio recording as Fig. 1. (a) Input spectrogram V, (b) Approximated spectrogram DA, (c) Dictionary D
(pre-extracted), (d) Activation matrix A.
D> ( DA
V
) ( V )A>
dbFS
4
A←A >
and D ← D DA > ,
D J JA
6
where the operator denotes point-wise multiplication, J ∈
RM ×N denotes the matrix of ones, and the division is point- 8
0 100 200 300 400 500 600 700 800
Frequency in Hz
wise. Intuitively, the update rules can be derived by choosing
a specific step-size in a gradient (or rather coordinate) descent
Figure 4. Inharmonicity: Spectrum of a C]1 note played on a piano. The
based minimization of the divergence [31]. stiffness of strings causes partials to be shifted from perfect integer multiples
In an AMT context, both unknown matrices have an intu- of the fundamental frequency (shown as vertical dotted lines); here the 23rd
partial is at the position where the 24th harmonic would be expected. Note
itive interpretation: the n-th column of V, i.e. the spectrum at that the fundamental frequency of 34.65Hz is missing as piano soundboards
time point n, is modeled in NMF as a linear combination of typically do not resonate for modes with a frequency smaller than ≈50Hz.
the K columns of D, and the corresponding K coefficients
are given by the n-th column of A. Given this point of view,
each column of D is often referred to as a (spectral) template roll representation shown in Fig. 1(c) indicates the correlation
and usually represents the expected spectral energy distribution between NMF activations and the underlying musical score.
associated with a specific note played on a specific instrument. While Fig. 3 illustrates the principles behind NMF, it also
For each template, the corresponding row in A is referred to indicates why AMT is difficult – indeed, a regular NMF
as the associated activation and encodes when and how in- decomposition would rarely look as clean as in Fig. 3. Com-
tensely that note is played over time. Given the non-negativity pared to speech analysis, sound objects in music are highly
constraints, NMF yields a purely constructive representation correlated. For example, even in a simple piece as shown
in the sense that spectral energy modeled by one template in Fig. 1, most pairs of simultaneous notes are separated
cannot be cancelled by another – this property is often seen by musically consonant intervals, which acoustically means
as instrumental in identifying a parts-based and interpretable that many of their partials overlap (e.g., the A and D notes
representation of the input [31]. around 4 seconds, marked with gray circles in Fig. 1(d), share
In Fig. 3, an NMF-based decomposition is illustrated. The a high number of partials). In this case, it can be difficult
magnitude spectrogram V shown in Fig. 3(a) is modeled as to disentangle how much energy belongs to which note.
a product of the dictionary D and activation matrix A shown The task is further complicated by the fact that the spectro-
in Fig. 3(c) and (d), respectively. The product DA is given temporal properties of notes vary considerably between differ-
in Fig. 3(b). In this case, the templates correspond to indi- ent pitches, playing styles, dynamics and recording conditions.
vidual pitches, with clearly visible fundamental frequencies Further, stiffness properties of strings affect the travel speed
and harmonics. Additionally, comparing A with the piano of transverse waves based on their frequency – as a result,
6
1500
linear function (or a composition of functions) from input
1000 + + + = to output via an optimization algorithm such as stochastic
gradient descent [33]. Compared to other fields including
500
image processing, progress on NNs for music transcription
0
has been slower and we will discuss a few of the underlying
Figure 5. Harmonic NMF [15]: Each NMF template (right hand side) is reasons below.
represented as a linear combination of fixed narrow-band sub-templates. The One of the earliest approaches based on neural networks
resulting template is constrained to represent harmonic sounds by construction. was Marolt’s Sonic system [21]. A central component in
this approach was the use of time-delay (TD) networks,
the partials of instruments such as the piano are not found at which resemble convolutional networks in the time direction
perfect integer multiples of the fundamental frequency. Due [33], and were employed to analyse the output of adaptive
to this property called inharmonicity, the positions of partials oscillators, in order to track and group partials in the output
differ between individual pianos (see Fig. 4). of a gammatone filterbank. Although it was initially published
in 2001, the approach remains competitive and still appears in
To address these challenges, the basic NMF model has been
comparisons in more recent publications [23].
extended by encouraging additional structure in the dictionary
and the activations. For example, an important principle is In the context of the more recent revival of neural networks,
to enforce sparsity in A to obtain a solution dominated by a first successful system was presented by Böck and Schedl
few but substantial activations – the success of sparsity paved [34]. One of the core ideas was to use two spectrograms
the way for a whole range of sparse coding approaches, in as input to enable the network to exploit both a high time
which the dictionary size K can exceed the input dimension accuracy (when estimating the note onset position) and a
M considerably [32]. Other extensions focus on the dictionary high frequency resolution (when disentangling notes in the
design. In the case of supervised NMF, the dictionary is lower frequency range). This input is processed using one
pre-computed and fixed using additionally available training (or more) Long Short-Term Memory (LSTM) layers [33].
material. For example, given K recordings each containing The potential benefit of using LSTM layers is two-fold. First,
only a single note, the dictionary shown in Fig. 3(b) was the spectral properties of a note evolve across input frames
constructed by extracting one template from each recording – and LSTM networks have the capability to compactly model
this way, the templates are guaranteed to be free of interference such sequences. Second, medium and long range dependencies
from other notes and also have a clear interpretation. As between notes can potentially be captured: for example, based
another example, Fig. 5 illustrates an extension in which each on a popular chord sequence, after hearing C and G major
NMF template is represented as a linear combination of fixed chords followed by A minor, a likely successor is an F
narrow-band sub-templates [15], which enforces a harmonic major chord. An investigation of whether such long-range
structure for all NMF templates – this way, a dictionary can be dependencies are indeed modeled, however, was not in scope.
adapted to the recording to be transcribed, while maintaining Sigtia et al. [18] focus on long-range dependencies in
its clean, interpretable structure. music by combining an acoustic front-end with a symbolic-
In shift-invariant dictionaries a single template can be used level module resembling a language model as used in speech
to represent a range of different fundamental frequencies. In processing. Using information obtained from MIDI files, a
particular, using a logarithmic frequency axis, the distances recurrent network is trained to predict the active notes in
between individual partials of a harmonic sound are fixed and the next time frame given the past. This approach needs to
thus shifting a template in frequency allows modeling sounds learn and represent a very large joint probability distribution,
of varying pitch. Sharing parameters between different pitches i.e., a probability for every possible combination of active and
in this way has turned out to be effective towards increasing inactive notes across time – note that even in a single frame
model capacity (see e.g., [16], [17]). Further, spectro-temporal there are 288 possible combinations of notes on a piano. To
dictionaries alleviate a specific weakness of NMF models: in render the problem of modeling such an enormous probability
NMF it is difficult to express that notes often have a specific space tractable, the approach employs a specific neural net-
temporal evolution – e.g., the beginning of a note (or attack work architecture (NADE), which represents a large joint as a
phase) might have entirely different spectral properties than long product of conditional probabilities – an approach quite
the central part (decay phase). Such relationships are modeled similar to the idea popularized recently by the well-known
in spectro-temporal dictionaries using a Markov process which WaveNet architecture. Despite the use of a dedicated music
governs the sequencing of templates across frames, so that language model, which was trained on relatively large MIDI-
different subsets of templates can be used for the attack and based datasets, only modest improvements over an HMM
the decay parts, respectively [16], [23]. baseline could be observed and thus the question remains open
to which degree long-range dependencies are indeed captured.
To further disentangle the influence of the acoustic front-end
B. Neural Networks for AMT from the language model on potential improvement in perfor-
As for many tasks relating to pattern recognition, neural mance, Kelz et al. [19] focus on the acoustic modeling and
networks (NNs) have had a considerable impact in recent report on the results of a larger scale hyperparameter search
7
(a)
acoustic output of an AMT system. Sigtia et al. [18] also used
80
the aforementioned RNN-RBM as an MLM, and combined
the acoustic and language predictions using a probabilistic
MIDI pitch
60
graphical model. While these initial works showed promising
results, there are several directions for future research in
MLMs; these include creating unified acoustic and language
40
models (as opposed to using MLMs as post-processing steps)
1 2 3 4 5 6 and modeling other musical cues, such as chords, key and
Time (seconds)
meter (as opposed to simply modeling note sequences).
(b)
80 B. Score-Informed Transcription
If a known piece is performed, the musical score provides a
MIDI pitch
60
strong prior for the transcription. In many cases, there are dis-
crepancies between the score and a given music performance,
40 which may be due to a specific interpretation by a performer,
1 2 3 4 5 6
or due to performance mistakes. For applications such as
Time (seconds) music education, it is useful to identify such discrepancies, by
incorporating the musical score as additional prior information
Figure 7. Piano-roll representations of the first 6 seconds of a recording of
a Bach piece (BWV 582) for organ. Black color corresponds to correctly
to simplify the transcription process (score-informed music
detected pitches, red to false positives, and blue to false negatives. (a) Output transcription [35]). Typically, systems for score-informed mu-
of NMF model trained on piano templates. (b) Output of the piano music- sic transcription use a score-to-audio alignment method as a
trained neural network model of [24].
pre-processing step, in order to align the music score with
the input music audio prior to performing transcription, e.g.
Given the rich energy distribution, this behavior is expected. [35]. While specific instances of score-informed transcription
While we use a simple baseline model for NMF and thus some systems have been developed for certain instruments (piano,
errors could be attributed to that choice, the neural network violin), the problem is still relatively unexplored, as is the
fails more gracefully: fewer octave errors and fewer spurious related and more challenging problem of lead sheet-informed
short note detections are observed (yet in terms of recall the transcription and the eventual integration of these methods
NMF-based approach identifies additional correct notes). It is towards the development of automatic music tutoring systems.
difficult to argue why the acoustic model within the network
should be better prepared to such a situation. However, the C. Context-Specific Transcription
results suggest that the network learned something additional: While the creation of a “blind” multi-instrument AMT
the LSTM layers as used in the network (compare Fig. 6) seem system without specific knowledge of the music style, in-
to have learned how typical piano notes evolve in time and thus struments and recording conditions is yet to be achieved,
most note lengths look reasonable and less spurious. Similarly, considerable progress has been reported on the problem of
the bandwidth in which octave errors occur is narrower for context-specific transcription, where prior knowledge of the
the neural network, which could potentially indicate that the sound of the specific instrument model or manufacturer and the
network models the likelihood of co-occurring notes or, in recording environment is available. For context-specific piano
other words, a simple music language model, which leads us transcription, multi-pitch detection accuracy can exceed 90%
to our discussion of important remaining challenges in AMT. [23], [22], making such systems appropriate for user-facing
applications. Open work in this topic includes the creation of
IV. F URTHER E XTENSIONS AND F UTURE W ORK context-specific AMT systems for multiple instruments.
A. Music Language Models
As outlined in Section I-B, AMT is closely related to D. Non-Western Music
automatic speech recognition (ASR). In the same way that As might be evident by surveying the AMT literature, the
a typical ASR system consists of an acoustic component and vast majority of approaches target only Western (or Euroge-
a language component, an AMT system can model both the netic) music. This allows several assumptions, regarding both
acoustic sequences and also the underlying sequence of notes the instruments used and also the way that music is represented
and other music cues over time. AMT systems have thus and produced (typical assumptions include: octaves containing
incorporated music language models (MLMs) for modeling 12 equally-spaced pitches; two modes, major and minor; a
sequences of notes in a polyphonic context, with the aim standard tuning frequency of A4 = 440 Hz). However, these
of improving transcription performance. The capabilities of assumptions do not hold true for other music styles from
deep learning methods towards modeling high-dimensional around the world, where for instance an octave is often
sequences have recently made polyphonic music sequence pre- divided into microtones (e.g., Arabic music theory is based
diction possible. Boulanger-Lewandowski et al. [5] combined on quartertones), or on the existence of modes that are not
a restricted Bolzmann machine (RBM) with an RNN for poly- used in Western music (e.g., classical Indian music recognizes
phonic music prediction, which was used to post-process the hundreds of modes, called ragas). Therefore, automatically
9
transcribing non-Western music still remains an open problem Therefore, the creation of perceptually relevant evaluation
with several challenges, including the design of appropriate metrics for AMT, as well as the creation of evaluation metrics
signal and music notation representations while avoiding a so- for notation-level transcription, remain open problems.
called Western bias [36]. Another major issue is the lack of
annotated datasets for non-Western music, rendering the ap- V. C ONCLUSIONS
plication of data-intensive machine learning methods difficult.
Automatic music transcription has remained an active area
of research in the fields of music signal processing and music
E. Expressive Pitch and Timing information retrieval for several decades, with several potential
Western notation conceptualizes music as sequences of benefits in other areas and fields extending beyond the remit
unchanging pitches being maintained for regular durations, and of music. As outlined in this paper, there remain several
has little scope for representing expressive use of microtonality challenges to be addressed in order to fully address this
and microtiming, nor for detailed recording of timbre and problem: these include key challenges as described in Section
dynamics. Research on automatic transcription has followed I-C on modeling music signals and on the availability of data,
this narrow view, describing notes in terms of discrete pitches challenges with respect to the limitations of state-of-the-art
plus onset and offset times. For example, no suitable notation methodologies as described in Section III-C, and finally on ex-
exists for performed singing, the most universal form of music- tensions beyond the current remit of existing tasks as presented
making. Likewise for other instruments without fixed pitch in Section IV. We believe that addressing these challenges will
or with other expressive techniques, better representations are lead towards the creation of a “complete” music transcription
required. These richer representations can then be reduced system and towards unlocking the full potential of music signal
to Western score notation, if required, by modeling musical processing technologies. Supplementary audio material related
knowledge and stylistic conventions. to this paper can be found in the companion website9 .
8 http://www.music-ir.org/mirex/ 9 http://c4dm.eecs.qmul.ac.uk/spm-amt-overview/
10
[14] P. Smaragdis and J. C. Brown, “Non-negative matrix factorization for Transactions on Audio, Speech, and Language Processing, vol. 25,
polyphonic music transcription,” in Proceedings of the IEEE Workshop no. 10, pp. 1877–1889, Oct 2017.
on Applications of Signal Processing to Audio and Acoustics, 2003, pp. [36] X. Serra, “A multicultural approach in music information research,” in
177–180. 12th International Society for Music Information Retrieval Conference,
[15] E. Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectral 2011, pp. 151–156.
decomposition for multiple pitch estimation,” IEEE Transactions on
Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 528–537,
2010.
[16] E. Benetos and S. Dixon, “Multiple-instrument polyphonic music tran-
scription using a temporally-constrained shift-invariant model,” Journal Emmanouil Benetos (S’09, M’12) is Lecturer and
of the Acoustical Society of America, vol. 133, no. 3, pp. 1727–1741, Royal Academy of Engineering Research Fellow
March 2013. with the Centre for Digital Music, Queen Mary
[17] B. Fuentes, R. Badeau, and G. Richard, “Harmonic adaptive latent University of London, and Turing Fellow with the
component analysis of audio and application to music transcription,” Alan Turing Institute. He received the Ph.D. degree
IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, in Electronic Engineering from Queen Mary Uni-
no. 9, pp. 1854–1866, Sept 2013. versity of London, U.K., in 2012. From 2013 to
[18] S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural network 2015, he was University Research Fellow with the
for polyphonic piano music transcription,” IEEE/ACM Transactions on Department of Computer Science, City, University
Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 927–939, of London. He has published over 80 peer-reviewed
May 2016. papers spanning several topics in audio and music
[19] R. Kelz, M. Dorfer, F. Korzeniowski, S. Böck, A. Arzt, and G. Widmer, signal processing. His research focuses on signal processing and machine
“On the potential of simple framewise approaches to piano transcrip- learning for music and audio analysis, as well as applications to music
tion,” in Proceedings of the International Society for Music Information information retrieval, acoustic scene analysis, and computational musicology.
Retrieval Conference, 2016, pp. 475–481.
[20] J. Nam, J. Ngiam, H. Lee, and M. Slaney, “A classification-based
polyphonic piano transcription approach using learned feature represen-
tations,” in ISMIR, 2011, pp. 175–180.
[21] M. Marolt, “A connectionist approach to automatic transcription of Simon Dixon is Professor and Deputy Director
polyphonic piano music,” IEEE Transactions on Multimedia, vol. 6, of the Centre for Digital Music at Queen Mary
no. 3, pp. 439–449, 2004. University of London. He has a Ph.D. in Computer
[22] A. Cogliati, Z. Duan, and B. Wohlberg, “Context-dependent piano music Science (Sydney) and L.Mus.A. diploma in Clas-
transcription with convolutional sparse coding,” IEEE/ACM Transactions sical Guitar. His research is in music informatics,
on Audio, Speech, and Language Processing, vol. 24, no. 12, pp. 2218– including high-level music signal analysis, compu-
2230, Dec 2016. tational modeling of musical knowledge, and the
[23] S. Ewert and M. B. Sandler, “Piano transcription in the studio using an study of musical performance. Particular areas of
extensible alternating directions framework,” IEEE/ACM Transactions focus include automatic music transcription, beat
on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 1983– tracking, audio alignment and analysis of intonation
1997, Nov 2016. and temperament. He was President (2014-15) of the
[24] C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, International Society for Music Information Retrieval (ISMIR), is founding
J. Engel, S. Oore, and D. Eck, “Onsets and frames: Dual-objective piano Editor of the Transactions of ISMIR, and has published over 160 refereed
transcription,” in Proceedings of the International Society for Music papers in the area of music informatics.
Information Retrieval Conference, 2018.
[25] V. Arora and L. Behera, “Multiple F0 estimation and source clustering
of polyphonic music audio using PLCA and HMRFs,” IEEE/ACM
Transactions on Audio, Speech and Language Processing (TASLP),
vol. 23, no. 2, pp. 278–287, 2015. Zhiyao Duan (S’09, M’13) is an assistant professor
[26] E. Cambouropoulos, “Pitch spelling: A computational model,” Music in the Electrical and Computer Engineering Depart-
Perception, vol. 20, no. 4, pp. 411–429, 2003. ment at the University of Rochester. He received
[27] H. Grohganz, M. Clausen, and M. Mueller, “Estimating musical time his B.S. in Automation and M.S. in Control Science
information from performed MIDI files,” in Proceedings of International and Engineering from Tsinghua University, China, in
Society for Music Information Retrieval Conference, 2014. 2004 and 2008, respectively, and received his Ph.D.
[28] I. Karydis, A. Nanopoulos, A. Papadopoulos, E. Cambouropoulos, and in Computer Science from Northwestern University
Y. Manolopoulos, “Horizontal and vertical integration/segregation in in 2013. His research interest is in the broad area
auditory streaming: a voice separation algorithm for symbolic musical of computer audition, i.e., designing computational
data,” in Proceedings of Sound and Music Computing Conference systems that are capable of understanding sounds,
(SMC), 2007. including music, speech, and environmental sounds.
[29] A. Cogliati, D. Temperley, and Z. Duan, “Transcribing human piano He co-presented a tutorial on Automatic Music Transcription at ISMIR 2015.
performances into music notation,” in Proceedings of the International He received a best paper award at the 2017 Sound and Music Computing
Society for Music Information Retrieval Conference, 2016, pp. 758–764. (SMC) conference and a best paper nomination at the 2017 International
[30] R. G. C. Carvalho and P. Smaragdis, “Towards end-to-end polyphonic Society for Music Information Retrieval (ISMIR) conference.
music transcription: Transforming music audio directly to a score,” in
2017 IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics, Oct 2017, pp. 151–155.
[31] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix
factorization,” in Advances in neural information processing systems Sebastian Ewert is a Senior Research Scientist at
(NIPS), 2001, pp. 556–562. Spotify. He received the M.Sc./Diplom and Ph.D.
[32] S. A. Abdallah and M. D. Plumbley, “Unsupervised analysis of poly- degrees (summa cum laude) in computer science
phonic music by sparse coding,” IEEE Transactions on neural Networks, from the University of Bonn (svd. Max-Planck-
vol. 17, no. 1, pp. 179–196, 2006. Institute for Informatics), Germany, in 2007 and
[33] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning. 2012, respectively. In 2012, he was awarded a GAES
MIT press Cambridge, 2016. fellowship and joined the Centre for Digital Music,
[34] S. Böck and M. Schedl, “Polyphonic piano note transcription with Queen Mary University of London (United King-
recurrent neural networks,” in Proceedings of the IEEE International dom). At the Centre, he became Lecturer for Signal
Conference on Acoustics, Speech and Signal processing (ICASSP), 2012, Processing in 2015 and was one of the founding
pp. 121–124. members of the Machine Listening Lab, which fo-
[35] S. Wang, S. Ewert, and S. Dixon, “Identifying missing and extra notes in cuses on the development of machine learning and signal processing methods
piano recordings using score-informed dictionary learning,” IEEE/ACM for audio and music applications.