ponents, the average envelope of which could be interpolated from 3. FILTER DESIGN
the surrounding spectrum. A possible explanation is that these
distortions arise from frequency and amplitude modulations of a The basic idea in this spectral-filtering approach to separation is
sinusoidal component within the time frame, or alternatively, that that if the pitches are known of all notes present during a particu-
the modelling of instrument harmonics as slowly time-varying sin- lar time frame, and the number of notes is not too large, then it is
gle sinusoids may not always be very accurate. The filter of unit possible to identify most of the prominent spectral peaks uniquely
amplitude removes the majority of the energy attributed to the in- with single harmonics, and to construct filters notches of unit am-
strument harmonic, without assuming that the harmonic conforms plitude across the width of each peak to remove the corresponding
to a precise shape in the spectral domain. Signal-to-residual ratios harmonic from the spectrum. A separate filter is designed for each
of more than 30 dB have been achieved [6] when separating mixes pitch whose effect is to remove all the harmonics of this pitch from
of two simultaneous notes, and this provides some validation for the spectrum, and the width of the notches are taken to be between
using this filtering approach. the troughs in the amplitude spectrum on either side of the peak
maxima. A difficulty arises when harmonics of more than one
pitch are overlapping in the spectrum. This problem was resolved
in [8] for combinations of two overlapping partials in a stereo mix.
2. PRE-PROCESSING In our case, the sum of the filter amplitudes for all pitches, is set
to unity across the width of this peak, and the shape of each filter
notch is designed so that a suitable division of the energy in the
All instrumental note samples used were in .wav format, mono, spectral peak is achieved. This will be discussed in more detail
sampled at 44.1 kHz, 16 bit resolution, and all except the piano below.
samples were recorded in an anechoic chamber. Mixed samples of To begin with, it was necessary to ascertain whether each promi-
5-20 seconds in length consisting of multiple inter-weaving melo- nent spectral peak was attributable to a single harmonic or multiple
dies were produced within a software sequencer such that audio harmonics. For the former case, we will refer to the spectral peak
and MIDI tracks for each instrument were recorded in parallel. as a single-component peak and in the latter, a multi-component
MIDI note messages were used to trigger real audio samples, and it peak. A peak was matched to the j th harmonic of note p if its
was possible for each instrument to be playing more than one note centre frequency fk was within a fixed range δ of the predicted
concurrently. The original mix was then split into overlapping time harmonic frequency fjp , where fjp ≈ j · f0p , and f0p is the pitch of
frames of length Nwin = 8192 samples (186 ms), with an overlap note p. The values of the fjp were allowed to deviate from exact
of 87.5%. In each frame, after time weighting the signal with a harmonicity (fjp = j · f0p ), such that if a single-component peak
Hamming window, an FFT was used to transform to the spectral at fk was found to be very close to a predicted harmonic fjp , then
domain. fjp would be set equal to fk . In either case, the next predicted har-
In each time frame, the number of simultaneously sounding monic would be at fj+1 = fjp + f0p . This modification improved
notes in the original mix was found from the MIDI data. As the separation performance, probably due to the fact that instruments
transcribed pitches of these notes in the MIDI data were restricted whose harmonics are slightly de-tuned are treated more appropri-
to the notes of a keyboard, and considerable pitch variations over ately, and also that any slight pitch errors would not necessarily be
the duration of a note are not uncommon, a pitch-refinement pro- compounded when multiplying by j to find the j th harmonic.
cess was used to accurately estimate all pitches present in each When a spectral peak was matched to more than one harmonic
frame. Each refined pitch estimate was taken to be the mean of from separate notes, then corresponding to each note p contribut-
{fjp /j ; j = 1 . . . J}, where fjp is the frequency of the j th har- ing to that peak, a filter notch was designed that depended on
monic of pitch p. The harmonic frequencies {fjp } were found us- the predicted frequency and predicted amplitude of its harmonic
ing an iterative process starting with the identification of the funda- within the peak: fjp , Apj , where it is implicit that j ≡ j(p). The
mental frequency spectral component and then searching for spec- prediction of harmonic frequencies was discussed previously, and
tral peaks at successively higher harmonics. the predicted harmonic amplitudes were obtained by linear inter-
An effective method for detecting prominent spectral peaks polation between the amplitudes of the nearest harmonics of this
was necessary both during pitch refinement and later in the filter pitch, above and below fjp , that were matched to single-component
design. The aim was to detect all local peaks in the amplitude peaks. Two similar filter notch designs were tested, both achieving
spectrum significantly higher than the noise floor. A frequency- comparable performance after fine-tuning their parameters. The
dependent threshold is usually necessary to detect all harmonics, filter notches H p (f ) were defined for frequencies f between the
whilst keeping the number of spurious spectral peaks or noise com- troughs on either side of the peak: fkl and fkr . For the first design,
ponents above the threshold to a minimum. This frequency-depen- the filters obeyed equation (1a):
dent thresholding was implemented by dividing the spectrum by
|f − fjp |
Env(f )c where c was chosen to be between 0 and 1, and Env(f ) p p
Ĥ (f ) = Aj · exp − , ∀p∈Q (1a)
is the convolution of the amplitude spectrum with a Hamming win- σ
dow of length 1 + Nwin /64. Local peaks were found above the followed by a normalisation:
threshold using a neighbourhood search. Harmonics right up to the
Nyquist frequency were detected effectively using this method. Ĥ p (f )
Finally, the baricentric interpolator [7] was used to interpolate H p (f ) = P (1b)
q∈Q Ĥ (f )
the spectral peak centre frequencies to sub-bin frequency resolu-
tion. This interpolator was compared with others such as Grandke’s, where
Quinn’s and the parabolic interpolator, and found to be quite effec-
tive for Hamming windowed data. Q = {p ; ∃ j(p) s.t. |fk − fjp | < δ, p = 1 . . . P } (1c)
Table 1: Mean signal-to-residual (SRR) ratios and π/M , for sam-
3 (f2,A2) ple mixes of 2-4 instrumental parts
2.5 polyphony 2 3 4
mean SRRxi (x0i ) 23.2 11.4 10.4
π(xi , x0i , y)/M 23.2 14.4 15.3
Flute 1 Flute 2
1 filter 1 filter 2 1 1
0.5 0.5
a) 0 0
2.06 2.07 2.08 2.09 2.1 2.11 2.12 2.13 2.14 −0.5 −0.5
3.5 −1 −1
0 1 2 3 4 5 0 1 2 3 4 5
3 1 1
0.5 0.5
0 0
−0.5 −0.5
1.5 filtered 1 −1 −1
0 1 2 3 4 5 0 1 2 3 4 5
original 1 1 1
filtered 2
original 2 Residual 0.5 0.5
b) 0 0
0 −0.5 −0.5
2.06 2.07 2.08 2.09 2.1 2.11 2.12 2.13 2.14
Frequency (kHz) −1 −1
0 1 2 3 4 5 0 1 2 3 4 5
Time (s) Time (s)
Figure 1: Filtering of a spectral peak arising from two overlap-
ping harmonics: (a) Construction of the filters using equation (2) Figure 2: Original, separated and residual waveforms for a mix
is determined by the predicted harmonic frequencies f1 , and f2 , of two flute melodies.
and predicted harmonic amplitudes A1 and A2 ; (b) Comparison
of the filtered spectra and original spectra of the individual notes.
where y = m xm is the mixed original signal, and larger values
of π/M correspond to better separation performance.
and a suitable value for σ was found to be about 0.02 · (fkr − fkl )2 . The average SRR’s and π/M are presented in Table 1 for some
For the second filter notch design, if Fwin (f ) is the DFT of samples mixes of two to four instrumental parts. The waveforms
the window function truncated to frequencies between zero and the of the original mixes were between 5 and 20 seconds in length.
Nyquist limit, then the filters notches were designed according to: The dual polyphony sample was a mix of two harmonising flute
Ĥ p (f ) = Apj · |Fwin ( · |f − fjp |)| (2) melodies, the polyphony of three corresponded to a few upbeat
bars in a major key played by a mix of flute, clarinet and French
where 0.5 < < 1, and again normalised using equation (1b) to horn, and the example with a polyphony of four was a rough ren-
obtain H p (f ). dition of a few bars of Barber’s ‘Adagio For Strings’ played on
The shape of the filters designed using equation (2) is illus- flute, French horn and two soprano saxophones. The audio files
trated in Figure 1a for a peak composed of two overlapping har- corresponding to these test cases have been put on the internet [1]
monics, and the two resulting filtered peaks are compared with the for comparison.
original spectra of the individual harmonics in Figure 1b. A visual representation of the original, separated and residual
time waveforms of each instrumental part in the mix, for the sam-
4. RESULTS ple consisting of a mix of two flute melodies in Table 1, is given in
Figure 2. For the same sample mix, the spectrograms of the orig-
The signal-to-residual ratio (SRR) has been used as a quantifiable inal mixed sound and the separated flute parts after filtering are
measure of separation performance. The residual in this case is the shown in Figure 3. One can see from this last figure a clear separa-
difference between the original x and separated x’ waveforms of tion of the set of harmonics belonging to each instrument, and also
each instrumental part. Explicitly, note that the noise level in the separated spectrograms is of lower
P 2 amplitude than that of the original mix, i.e. the noise components
n xn
SRRx (x0 ) [dB] = 10 log P 0 2
(3) of the original mix have mostly gone into the residual waveform.
n (xn − xn )
Spectrogram of a mix of two flute melodies section. Also, the accuracy of the note timing information is an
important factor in separation performance. If for example, a note
Frequency (Hz)
actually starts sounding slightly later than the note onset time pro-
3000 vided in the MIDI data, then it is possible that the filter correspond-
2000 ing to this instrument will be filtering content from the mixed spec-
1000 trum in the few time frames preceding the first time frame that the
0 note is actually present.
0 1 2 3 4 5
Spectrogram of separated flute A Lastly, we have found that the separation algorithms tend to
5000 produce interesting sounding residuals that seem to preserve the
Frequency (Hz)
200 — DAFx'04 Proceedings — 200