Transcription of Polyphonic Piano Music With Neural Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/3871433

Transcription of polyphonic piano music with neural networks

Conference Paper · February 2000


DOI: 10.1109/MELCON.2000.879982 · Source: IEEE Xplore

CITATIONS READS

44 361

1 author:

Matija Marolt
University of Ljubljana
112 PUBLICATIONS   671 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Med 3D Framework View project

PerceiveConceive: an application for ICT-supported inclusion of blind and visually impaired youth in society View project

All content following this page was uploaded by Matija Marolt on 21 May 2014.

The user has requested enhancement of the downloaded file.


SONIC: Transcription of Polyphonic
Piano Music with Neural Networks
Matija Marolt
Faculty of Computer and Information Science, University of Ljubljana
[email protected], http://lgm.fri.uni-lj.si/~matic

Abstract

This paper presents a connectionist approach to transcription of polyphonic piano music. We


propose a new partial tracking technique based on a combination of an auditory model and
adaptive oscillator networks. We show how synchronization of adaptive oscillators can be
exploited to track partials in a musical signal. Our system uses time-delay neural networks to
recognize notes from outputs of the partial tracking model. We provide a short overview of other
parts of our transcription system and present its performance on transcriptions of several
synthesized and real piano recordings. Results show that our approach is a viable alternative to
existing transcription systems.

the transcription process. Therefore, our motivation


1 Introduction was to develop a transcription system based on neural
networks, which have proved to be useful in a variety
of pattern recognition tasks. We tried to avoid explicit
Music transcription could be defined as an act of symbolic algorithms, and instead used connectionist
listening to a piece of music and writing down music approaches in different parts of our system.
notation for the piece. Transcription of polyphonic
music (polyphonic pitch recognition) converts an
acoustical waveform into a parametric representation, 2 SONIC
where notes, their pitches, starting times and
durations are extracted from the signal. Transcription The name of our transcription system is SONIC.
is a difficult cognitive task and is not inherent in Transcription is a difficult task, so we put one major
human perception of music, although it can be constraint on the system: it only transcribes piano
learned. It is also a very difficult problem for current music, so piano should be the only instrument in the
computer systems. Separating notes from a mixture of analyzed musical signal. We didn't make any other
other sounds, which may include notes played by the assumptions about the signal, such as maximal
same or different instruments or simply background polyphony, minimal note length, style of transcribed
noise requires robust algorithms with performance music or the type of piano used. The system takes an
that should degrade gracefully when noise increases. acoustical waveform of a piano recording (44.1 kHz
In recent years, several transcription systems have sampling rate, 16 bit resolution) as its input. Stereo
been developed. Some of them are targeted to recordings are converted to mono. The output of the
transcription of music played on specific instruments system is a MIDI file containing the transcription.
[3-5], while others are general transcription systems Notes, their starting times, durations and loudness' are
[1-2]. All of them share several common extracted from the signal.
characteristics. In the beginning, they calculate a Most current transcription systems have similar
time-frequency representation of the musical signal. structure. First, a time-frequency representation of the
In the next step, the time-frequency representation is input signal is calculated. Then, a peak picking
refined by locating partials in the signal. To track algorithm is used to determine possible partials in
partials, most systems use ad hoc algorithms such as each frequency image. Partials are formed by
peak picking and peak connecting. Partial tracks are connecting the discovered peaks over time. In the
then grouped into notes with different algorithms end, the found partials are associated with notes. The
relying on cues such as common onset time and last part of this chain is where systems differ the most;
harmonicity. Some authors use templates of some authors use statistical methods to group partials
instrument tones in this process [1,3], as well as into notes [1,2,4], others use ad hoc algorithms [3].
higher-level knowledge of music, such as SONIC has an analogous structure. The main
probabilities of chord transitions [2]. distinction is that neural networks are used in the
Recognizing notes in a signal is a typical pattern partial tracking and note recognition stages. We
recognition task and we were surprised that few present these two stages in sections 3 and 4. SONIC
current systems use machine learning algorithms in also includes an onset detection algorithm, an
algorithm for detection of repeated notes, a tuning extended the algorithm for tracking individual partials
algorithm and simple procedures for calculating the to an algorithm for tracking groups of harmonically
length and loudness of each note. These parts are related partials by joining oscillators into networks.
briefly described in section 5. Section 6 presents the
performance of SONIC on several synthesized and 3.1 Auditory Model
real piano recordings.
The first stage of our partial tracking algorithm
transforms the acoustical waveform into time-
3 Partial Tracking frequency space with an auditory model, emulating
the functionality of human ear. The auditory model
Tones of melodic music instruments can be roughly consists of two parts. A filterbank is first used to split
described as a sum of frequency components the signal into several frequency channels, modeling
(sinusoids) with time-varying amplitudes and almost the movement of basilar membrane in the inner ear.
constant frequencies. These frequency components The filterbank consists of an array of bandpass IIR
are called partials and can be recognized as prominent filters, called gammatone filters [7-8]. We are using
horizontal structures in a time-frequency 200 filters with center frequencies logarithmically
representation of a musical signal. By finding partials, spaced between 70 and 6000 Hz.
one can obtain a clearer and more compact Subsequently, the output of each gammatone filter is
representation of the signal, and partial tracking is processed by the Meddis’ model of hair cell
therefore used in all current transcription systems. transduction [9]. The hair cell model converts each
Although a partial tracking algorithm plays an gammatone filter output into a probabilistic
important role in a transcription system, because it representation of firing activity in the auditory nerve.
provides data to the note recognition engine, little Its operations are based on a biological model of the
attention has been paid to the development of these hair cell and it simulates several of its characteristics,
algorithms. Most systems use a procedure similar to most notably half-wave rectification, saturation and
that of a tracking phase vocoder [6]. After the adaptation. Saturation and adaptation are very
calculation of time-frequency representation, peaks important to our model, as they reduce the dynamic
are computed in each frequency image. Only peaks range of the signal, and in turn enable our partial
with amplitude that is larger than a chosen (possibly tracking system to track partials with low amplitude.
adaptive) threshold are kept as candidate partials. These characteristics can be observed in Fig. 1,
Detected peaks are then linked over time according to displaying outputs of three gammatone filters and the
intuitive criteria such as proximity in frequency and hair cell model on the 1., 2., and 4. partial of piano
amplitude, and partial tracks are formed in the tone F3 (pitch 174 Hz).
process. Such approach is quite susceptible to errors
in the peak peaking procedure, where missed or
spurious peaks can lead to fragmented or spurious
partial tracks. Some systems therefore use additional
heuristics for merging fragmented partial tracks. The
second main shortcoming of the “peak picking-peak
connecting” approach is detection of frequency
modulated partials. Here, the peak connecting
algorithm can fail if it is not designed to tolerate
frequency modulation. An innovative approach to
partial tracking has been proposed by Sterian [4], who Figure 1: Analysis of three partials of piano tone F3
still uses a peak picking procedure in the first phase with the auditory model.
of his system, but later uses Kalman filters, trained on
examples of instrument tones, to link peaks into 3.2 Partial Tracking with Adaptive
partial tracks. His system still suffers due to errors in Oscillators
the peak picking stage, but its main drawback is that The auditory model outputs a set of frequency
partials have to be at least 150 ms long to be channels containing quasi-periodic firing activities of
discovered. For our system, this is a very serious inner hair cells (see Fig. 1). Temporal models of pitch
limitation, because tones in piano music are perception are based on the assumption that
frequently shorter than 100 ms. periodicity detection in these channels forms the basis
We have therefore decided to develop a new partial of human pitch perception. Periodicity is usually
tracking algorithm based on a different paradigm. Our calculated with autocorrelation. This produces a
algorithm is built on an auditory model, which three-dimensional time-frequency representation of
emulates the functionality of human ear, and on the signal (autocorrelogram), with time, channel
adaptive oscillators that extract partial tracks from center frequency and autocorrelation lag represented
outputs of the auditory model. Additionally, we on orthogonal axes. A summary autocorrelogram
(summed across frequency channels) can be new measure of successfulness of synchronization
computed to give a total estimate of periodicity in the that is used as the oscillator's output value. The
signal at a given time. Meddis and Hewitt [10] have measure is related to the amount of phase corrections
demonstrated that the summary autocorrelogram made in the synchronization process; less phase
explains the perception of pitch in a wide variety of corrections signify better synchronization.
stimuli.
We decided to use a different approach for
calculating periodicity in frequency channels. It is
based on adaptive oscillators that try to synchronize
to signals in frequency channels of the auditory
model. A synchronized oscillator indicates that the
signal in a channel is periodic, which in turn indicates
that a partial with frequency equal to that of the
oscillator is present in the input signal.
An oscillator is a system with periodic behavior. It
oscillates in time according to its two internal
parameters: phase and frequency. An adaptive
oscillator adapts its phase and frequency in response
to its input (driving) signal. When a periodic signal is
presented to an oscillator, it synchronizes to the signal Figure 2: Partial tracking with adaptive oscillators.
by adjusting its phase and frequency to match that of
the driving signal. By observing the frequency of a The presented oscillator model can successfully track
synchronized oscillator, we can make an accurate partials with diverse characteristics. Three examples
estimate of the frequency its driving signal. are given in Fig. 2. Example A shows a simple case of
Various models of adaptive oscillators have been tracking a 440 Hz sinusoid. The oscillator (initial
introduced; some have also found use in computer frequency 440 Hz) synchronizes successfully, as can
music researches for modeling rhythm perception [11- be seen from its output, and after an initial 1 Hz rise,
12] and for simulation of various psychoacoustic the frequency of the oscillator settles at 440 Hz.
phenomena [13]. After reviewing several models, we Example B shows how two oscillators with initial
decided to use a modified version of the Large-Kolen frequencies set to 440 and 445 Hz synchronize to a
adaptive oscillator [11] in our system. sum of 440 and 445 Hz sinusoids (5 Hz beating).
The Large-Kolen oscillator oscillates in time Both oscillators sync successfully at 442.5 Hz, as can
according to its period (frequency) and phase. The be seen from their outputs and frequencies. The
input of the oscillator consists of a series of discrete behavior is consistent to human perception of the
impulses, representing events. After each oscillation signal. Example C shows the tracking of a frequency
cycle, the oscillator adjusts its phase and period, modulated 440 Hz sinusoid. The oscillator
trying to match its oscillations to events in the input synchronizes successfully, its frequency follows that
signal. If input events occur in regular intervals (are of the sinusoid.
periodic), the final effect of synchronization is
alignment of oscillations with input events. Phase and 3.3 Tracking Groups of Partials with
period of the Large-Kolen oscillator are updated Adaptive Oscillator Networks
according to the modified gradient descent rule,
In the previous section we demonstrated how adaptive
minimizing an error function that describes the
oscillators can be used to track partials in a musical
difference between input events and beginnings of
signal. We extended the model of tracking individual
oscillation cycles. The speed of synchronization can
partials to a model of tracking groups of harmonically
be controlled by two oscillator parameters.
related partials by joining adaptive oscillators into
Our partial tracking model uses adaptive oscillators to
networks.
detect periodicity in output channels of the auditory
Networks consist of up to ten interconnected
model. Each output channel is routed to the input of
oscillators. Their initial frequencies are set to integer
one adaptive oscillator. The initial frequency of the
multiples of the frequency of the first oscillator (see
oscillator is equal to the center frequency of its input
Fig. 3). Network oscillators therefore track a group of
channel. When an oscillator synchronizes to its input,
up to ten harmonically related partials, which may
this indicates that the input signal is periodic and
belong to one tone.
consequently that a partial with frequency similar to
Connecting oscillators into networks has several
that of the oscillator is present in the input signal. A
advantages for our transcription system. Output of a
synchronized oscillator therefore represents (tracks) a
network represents the strength of a group of
partial in the input signal. To improve partial
harmonically related partials, tracked by oscillators in
tracking, we made a few minor changes to the Large-
the network. Such output provides a better indication
Kolen oscillator model. Most notably, we added a
of presence of a tone in the input signal than do
outputs of individual partials. Noise doesn't usually d is the number of the destination (non-synchronized)
appear in the form of harmonically related frequency oscillator, while s represents the source
components, so networks of oscillators are more (synchronized) oscillator. The period of the
resistant to noise and provide a clearer time-frequency destination oscillator pd and its output value cd change
representation of the signal. Within the network, each according to two ratios: rp and rc. These are two
oscillator is connected to all other oscillators with gaussians, representing the ratio of periods of the two
excitatory connections. Connections are used by oscillators (pd - period of the destination oscillator, ps
synchronized oscillators to speed up synchronization - period of the source oscillator) and the ratio of
of non-synchronized oscillators, leading to a faster outputs of the two oscillators (cd - output of the
network response and faster discovery of a group of destination oscillator, cs output of the source
partials. oscillator). Ratio rp is a gaussian with maximum
value, when periods of both oscillators are in a perfect
harmonic relationship (dpd/sps = 1). The value falls as
periods drift away from this perfect ratio and
approaches zero, when the ratio is larger than a
semitone. rc has the largest value, when a
synchronized oscillator influences the behavior of a
non-synchronized oscillator (cs is large, cd is small)
and falls as cd increases. Connection weights wsd are
calculated according to amplitudes of partials in piano
tones; the first few partials are considered to be more
important and consequently the influence of lower-
numbered oscillators in the network is stronger than
the influence of higher-numbered oscillators
(w1n>wn1).
Figure 3: A network of adaptive oscillators. Connections contribute to faster synchronization of
oscillators in the network. The output of an entire
We use 88 oscillator networks in our system. The network represents the strength of a group of partials
initial frequency of the first oscillator in each network that are tracked by oscillators in the network and is
is set to the pitch of one of 88 piano notes (A0-C8). calculated as a weighted sum of outputs of individual
Initial frequencies of other oscillators are integer oscillators in the network.
multiples of the first oscillator's frequency (see Fig. Fig. 4 displays slices taken from three time-frequency
3). Networks consist of up to ten oscillators. This representations of piano chord C3E3B4 100 ms after
number decreases as frequency increases, because the the onset: representation with uncoupled oscillators,
highest tracked partial lies at 6000 Hz. representation with networks of adaptive oscillators
Each oscillator is connected to all other oscillators and short-time Fourier transform. The representation
with excitatory connections. These connections can with uncoupled oscillators was calculated with 88
be used to adjust the frequencies and outputs of non- oscillators tuned to fundamental frequencies of piano
synchronized oscillators in the network with the goal tones A0-C8. Oscillator outputs (independent of
of speeding up their synchronization. partial amplitudes) are presented (A). We also show
Only a synchronized oscillator can influence other outputs of 88 oscillator networks (B) and the
oscillators in the network. It changes their frequencies combination of these outputs with amplitudes of
(periods) and output values according to the following partials (C). The Fourier transform was calculated
rules: with a 100 ms Hamming window (440 frequency bins
are shown in D).
Individual oscillators have no difficulty in finding the
(1) first few partials of each tone. Not all of the higher
partials are found, because they are spaced too close
together (we use only one oscillator per semitone).

Figure 4: Representations of piano chord C3E3B4


Oscillator networks produce a clearer representation very low and high notes were not frequent enough in
of the signal. High outputs are produced by networks the chosen pieces, we complemented the song set with
coinciding with the three tones, because all partials in a set of chords (polyphony from one to six).
the networks are found. Networks coinciding with Each network was trained to recognize the presence
second partials of some tones also have high outputs of its target note in vectors taken from the database.
for the same reason. Some of the networks at high The training set for each network included approx.
frequencies have high output values too, because they 30000 examples with 1/3 of them containing the
only have one or two oscillators and they all target note. Inputs of networks consist of outputs of
synchronize with higher partials of the three tones. oscillator networks, amplitudes of signals in
The combination of networks' outputs and amplitudes frequency channels of the auditory filterbank and a
of partials produces the clearest representation; only combination of amplitudes and network outputs.
the first partials of all three tones (C3, E3, B4) and Amplitudes in frequency channels were calculated
the second partial of tone E3 stand out (Fig. 4C). from outputs of the auditory filterbank with full-wave
The example shows that oscillator networks produce a rectification and smoothing.
compact and clear representation of partial groups in We tested the combination of the described partial
a musical signal. The main problem of this tracking and note recognition models on a set of
representation lies in occasional slow synchronization synthesized piano pieces (different than those used for
of oscillators in networks, which can lead to delayed training neural networks). We compared the
discovery of partial groups. This is especially true at performance of this combination to the performance
lower frequencies, where delays of 40-50 ms are quite of a system without a partial tracking model, where
common, because synchronization only occurs once neural networks recognized notes directly from time-
per cycle; an oscillator at 100 Hz synchronizes with frequency representation of the signal (we used a
the signal every 10 ms, so several 10s of milliseconds transform similar to constant-Q transform). The tested
are needed for synchronization. Closely spaced systems didn't include an onset detection algorithm or
partials may also slow down synchronization, an algorithm for detecting repeated notes; both
although it is quite rare for a group of partials not to problems were ignored as note onsets and lengths
be found. were taken from MIDI files of synthesized piano
pieces. Table I lists average performance statistics on
4 Note Recognition seven piano pieces of different complexities and
styles. Percentages of correctly found notes, spurious
notes and octave errors are given for both systems.
In SONIC, we use a set of neural networks to
recognize notes from outputs of the partial tracking
correct spurious oct. err.
model. Each network is trained to recognize a single
piano note in its input; i.e. one network is trained to No PT 92.79 27.93 39.46
recognize note A4, another network recognizes note With PT 94.39 11.15 77.90
G4... We use 76 networks to recognize notes from A1
to C8. This represents the entire range of piano notes, Table 1: Performance of systems with and without
except for the lowest octave from A0 to Ab1. We partial tracking.
decided to ignore the lowest octave, because of poor
recognition results. These notes are quite rare in piano The percentage of correctly found notes is similar in
pieces, so their exclusion doesn't have a big impact on both systems; partial tracking improves accuracy by
overall results. Because each neural network approximately 1.5%. Partial tracking significantly
recognizes only one note (we call it the target note) in reduces the number of spurious notes, as it more than
its input, it only has one output; a high output value halves. Just as important is the change in the structure
indicates the presence of the target note in the input of errors. Almost 80% of all errors in the system with
signal, a low value indicates that the note is not partial tracking are octave errors that occur when the
present. We tested several neural network models for system misses a note (or finds a spurious note),
note recognition (MLPs, RBF, time-delay, Elman and because of a note an octave, octave and a half or two
fuzzy ARTMAP networks) and obtained the best octaves apart from the missed/spurious note. These
results with time-delay neural networks [14]. are very hard to remove, but because the missed or
Networks were trained on a database of piano pieces spurious notes are consonant with other notes in
and chords. To obtain a set of segmented piano pieces music, octave errors aren't very apparent if we listen
that could be used for network training, we collected to the resynthesized transcription. They are therefore
a set of 120 MIDI piano pieces, which we rendered not as critical as some other types of errors (i.e.
with 16 different piano samples obtained from halftone errors), which make listening to the
commercially available piano sample CD-ROMs. We resynthesized transcription unpleasant. We therefore
chose songs of various styles, including classical from consider the higher percentage of octave errors in the
several periods, ragtime, jazz, blues and pop. Because system with partial tracking to be a significant
improvement.
To sum up, partial tracking represents an important spurious notes mostly occurred because of beating or
part of our system, as it significantly decreases the other noises in the signal.
number of errors.
5.2 Repeated Notes
5 What's Missing? Detecting repeated notes in a musical signal can be a
difficult problem, even if the played instrument has
5.1 Onset Detection pronounced onsets (such as piano). An illustration of
the problem is given in Fig. 6.
We added an onset detector to SONIC to improve the
accuracy of onset times of notes found by the system.
We based our onset detection algorithm on a model
proposed by Smith [15] for segmentation of speech
signals. First, the algorithm splits the signal into
several frequency bands with a bank of gammatone
filters. We are using the same set of filters as in our
partial tracking system. The signal is split into 22
overlapping frequency bands, each covering half an
octave. Channels are full-wave rectified and then Figure 6: Different interpretations of networks'
processed with the following filter: outputs.

(2) The upper part of Fig. 6 shows outputs of the onset


detection network and five note recognition networks
on an unknown piece of music. Four onsets and five
s(x) represents the signal in each frequency channel, fs
notes were found; note C4 lasts through the entire
the sample rate, ts and tl are two time constants. The
duration of the piece, while other notes appear for
filter calculates the difference between two amplitude
shorter periods of time. Three transcription examples
envelopes; one calculated with a smoothing filter with
show three possible interpretations of these outputs.
short time constant ts (6-20 ms), and the other by
Interpretations differ in the way note C4 is handled;
smoothing the signal with a longer time constant (20-
we could transcribe it as one whole note, four quarter
40 ms). The output of the filter has positive values
notes... Altogether eight combinations are possible,
when the signal rises and negative otherwise. Outputs
and all of them are consistent with networks' outputs.
of all 22 filters are fed into a fully connected network
It becomes apparent that the system needs an
of integrate-and-fire neurons. Each neuron in the
algorithm for detection of repeated notes. We first
network accumulates its input over time and if its
used the most obvious solution, which is to track the
internal activation exceeds a certain threshold, the
amplitude of the first harmonic of a possible repeated
neuron fires (emits an output impulse). Firings of
note and produce a repetition if the amplitude rises
neurons provide indications of amplitude growths in
enough. Because of shared partials between notes,
frequency channels. After firing, activity of the
this approach fails when a note that shares partials
neuron is reset and it is not allowed to respond to its
with the repeated note occurs in the signal. We
input for a period of time (50 ms in our model).
therefore decided to entrust the decision on repeated
Neurons are connected with all other neurons in the
notes to a MLP neural network, trained on a set of
network with excitatory connections. The firing of a
piano pieces. Inputs of the MLP consist of amplitude
neuron raises activations of all other neurons in the
changes, as well as several other parameters. This
network and accelerates their firing, if imminent.
solution improves transcription accuracy for
Such mechanism clusters neuron firings, which may
approximately 2.5% over the fundamental frequency
otherwise be dispersed in time and improves the
approach.
discovery of weak onsets.
A network of integrate-and-fire neurons outputs a
series of impulses indicating the presence of onsets in 5.3 Tuning, Note Length and Loudness
the signal. Not all impulses are onsets, since various Before transcription actually starts, a simple tuning
noises and beating can cause amplitude oscillations in procedure is used to calculate tuning of the entire
the signal. We use a MLP neural network to decide piano and initialize frequencies of adaptive oscillators
which impulses represent onsets. We trained the MLP accordingly. The procedure uses adaptive oscillators
on a set of piano pieces (the same as we used for to find partials in the piano piece and then compares
training note recognition networks). We tested the partial frequencies to frequencies of an ideally tuned
algorithm on a mixture of synthesized and real piano piano. The tuning of the piano is calculated as a
recordings. It correctly found over 98.5% of all onsets weighted average of deviations of partial frequencies
and produced around 2% of spurious onsets. Most of from ideal tuning. Stretching of piano tuning is also
the missed notes were notes played in very fast taken into consideration in the process. The tuning
passages, or notes in ornamentations such as thrills; procedure guarantees unchangeable transcription
accuracy, when the piano is tuned differently then the The average number of correctly found notes in
standard A4=440 Hz. The procedure only calculates synthesized recordings is around 90%. The average
tuning of the entire piano, not of individual piano number of spurious notes is 9%. Reasons for missed
tones. notes fall into several categories. Octave errors and
The system also calculates length and loudness of misjudged repeated notes are a major factor in all
each note. Both are needed to produce the MIDI file pieces. Notes are also missed in very fast passages,
containing the transcription. Length of a note is such as arpeggios or thrills (most missed notes in
calculated by observing activations of the note Partita), when they are masked by louder notes (many
recognition network; note is terminated when network notes in Humoresque) or due to other factors such as
activation falls below the training threshold. Loudness missed onsets and high polyphony. A vast majority of
is calculated from the amplitude envelope of the spurious notes are octave errors, often combined with
note’s first harmonic. misjudged repeated notes. These are especially
common in pedaled music (Humoresque) or in loud
5 Performance analysis chords (The Entertainer). Other reasons for spurious
notes include missed and spurious onsets and errors
due to high polyphony.
In this section, we present the performance of our Some common errors can be seen in a transcription
system on transcriptions of three synthesized and example taken from Humoresque (table 2(2), Fig.
three real recordings of piano music. Originals and 7A). Missed notes are marked with a - sign, spurious
transcriptions of all presented pieces (and more) can notes are marked with a + sign. All spurious notes are
be heard on http://lgm.fri.uni-lj.si/SONIC. Table 2 octave errors. Note E3 is a missed repeated note,
lists percentages of correctly found and spurious notes while note A5 wasn't found, because it is masked by
in transcriptions, as well as the distribution of errors the louder E3C4 chord.
into octave, repeated note and other errors. Separate Results on real recordings are not as good as those on
error distributions are given for missed and spurious synthesized recordings. Poorer transcription accuracy
notes. An error can fall into several categories, so the is a consequence of several factors. Recordings
sum of error percentages may be greater than 100. contain reverberation and more noise, while the sound
Total number of notes, maximal and average of real pianos includes beating and sympathetic
polyphony for each piece are also shown. resonance. Furthermore, performances of piano pieces
The transcribed synthesized recordings are: (1) J.S. are much more expressive, they contain increased
Bach, Partita no. 4, BWV828, Fazioli piano; (2) A. dynamics, more arpeggios and pedaling. All of these
Dvorak, Humoresque no. 7, op. 101, Steinway D factors make transcription more difficult.
piano; (3) S. Joplin, The Entertainer, Bösendorfer The analysis of SONIC's performance on the real
piano. Real recordings are: (4) J.S. Bach, English recording of Bach's English Suite (table 2(4), Fig. 7B)
suite no. 5, BWV810, 1st movement, performer showed that besides octave and repeated note errors,
Murray Perahia, Sony Classical SK 60277; (5) F. most of the missed notes are either quiet low pitched
Chopin, Nocturne no. 2, Op. 9/2, performer Artur notes (measure 2 in Fig. 7B) or notes in arpeggios and
Rubinstein, RCA 60822; (6) S. Joplin, The thrills.
Entertainer, performer unknown, MCA 11836.

Figure 7: Transcription examples: A: Humoresque (2), B: BWV810 (4), C: The Entertainer (6)

corr. spur. missed notes spurious notes num. avg. max.


notes notes octave repeat. other octave repeat. other notes poly poly
1 98.1 7 31.4 23.6 56.4 84.4 22.3 7.9 6680 2.7 6
2 92.3 10.6 53.2 39.2 29.4 95.3 29.9 0 1008 4.1 12
3 86 9.5 80.8 25.6 9 96 8.2 5.1 1564 3.4 9
4 88.5 15.5 35.1 18.2 52.2 80.5 17.6 13.9 1351 2.6 6
5 68.3 13.6 30.3 2.1 75.3 79 6.4 20.7 457 4.4 11
6 85.9 15.2 70.3 10.8 27 87.4 7.1 12.3 1564 3.4 9
Table 2: Transcription results on three real and three synthesized recordings
Our system showed the worst performance on [4] A.D. Sterian, Model-based Segmentation of
Chopin's Nocturne (table 2(5)). The recording is a Time-Frequency Images for Musical
good example of very expressive playing, where a Transcription. Ph.D. Thesis, Univesity of
distinctive melody is accompanied by very quiet, Michigan, 1999.
sometimes barely audible left hand chords. The
[5] S. Dixon, "On the computer recognition of solo
system misses many notes, but even so the
piano music," in Proceedings of Australasian
resynthesized transcription sounds very similar to the
Computer Music Conference, Brisbane,
original (listen to examples on the aforementioned
Australia, 2000.
URL address).
Transcriptions of the real and synthesized version of [6] C. Roads, The Computer Music Tutorial.
The Entertainer (table 2(3) and 2(6), Fig. 7C) are very Cambridge, MA: MIT Press, 1996.
similar. Transcription of the real recording contains
[7] R. D. Patterson, J. Holdsworth, "A functional
more spurious notes, mostly occurring because of
model of neural activity patterns and auditory
pedaling (not used in the synthesized version). Most
images," in Advances in speech, hearing and
spurious notes are octave errors. The number of
auditory images, W.A. Ainsworth (ed.),
correctly found notes is almost the same in both
London: JAI Press, 1990.
pieces.
[8] M. Slaney, "An efficient implementation of the
7 Conclusion Patterson-Holdsworth auditory filterbank,"
Apple Computer Technical Report #35, 1993.
In this paper, we presented a system for transcription [9] R. Meddis, "Simulations of mechanical to
of polyphonic piano music. Our main goal was to neural transduction in the auditory receptor,"
evaluate connectionist approaches in all parts of the Journal of Acoustical Society of America, vol.
system. We presented a new partial tracking method 79, no. 3, pp. 702-711, 1986.
based on networks of adaptive oscillators. By using a [10] R. Meddis, M.J. Hewitt, "Virtual pitch and
connectionist approach to partial tracking, we avoided phase sensitivity of a computer model of the
problems that occur with standard “peak picking-peak auditory periphery I: pitch identification,"
connecting” algorithms. An additional advantage of Journal of Acoustical Society of America, vol.
our approach is that in addition to tracking individual 89, no. 6, 1991.
partials, it is able to track groups of partials. We
presented an overview of our entire transcription [11] E.W. Large, J.F. Kolen, "Resonance and the
system and showed performance statistics of perception of musical meter," Connection
transcriptions of several synthesized and real piano Science, no. 6, vol. 2, 1994.
recordings. Overall, we are very satisfied with the [12] J.D. McAuley, Perception of time as phase:
results. They show that neural networks present a toward an adaptive-oscillator model of rhytmic
good alternative in building transcription systems and pattern processing. Ph.D. Thesis, Indiana
should be further studied. Further researches will University, 1995.
include addition of feedback mechanisms to the
currently strictly feed-forward approach, with the [13] D. Wang, "Primitive Auditory Segregation
intention of reducing some common errors. Based on Oscillatory Correlation," Cognitive
Additionally, an extension of the system to Science, no. 20, 1996.
transcription of other instruments may be considered. [14] A.T. Waibel, T. Hanazawa, G. Hinton, K.
Shikano, K.J. Lang, "Phoneme recognition
References using time-delay neural networks," IEEE Trans.
Acoustics., Speech and Signal Processing 37,
[1] A. Klapuri, Automatic Transcription of Music. 1989.
M.Sc. Thesis, Tampere University of [15] L.S. Smith, "Onset-based Sound Segmentation,"
Technology, Finland, 1997. in Advances in Neural Information Processing
[2] K. Kashino, K. Nakadai, T. Kinoshita, H. Systems 8, Touretzky, Mozer and Haselmo
Tanaka, "Application of Bayesian probability (ers.), Cambridge, MA: MIT Press, 1996.
network to music scene analysis," in
Proceedings of International Joint Conference
on AI, Workshop on Computational Auditory
Scene Analysis, Monteal, Canada, 1995.
[3] L. Rossi, Identification de Sons Polyphoniques
de Piano. Ph.D. Thesis, L'Universite de Corse,
France, 1998.

View publication stats

You might also like