Ss Augmented Exp

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

04

Proc. of the 7thth Int. Conference on Digital Audio Effects (DAFx’04),


(DAFx'04), Naples, Italy, October 5-8, 2004 DAFx
A SPECTRAL-FILTERING APPROACH TO MUSIC SIGNAL SEPARATION

Mark R. Every, John E. Szymanski

Media Engineering Group, Department of Electronics


University of York, York, U.K.
[email protected], [email protected]

ABSTRACT sounding in this frame. Following this, a filter is designed in the


frequency domain for each instrument, whose purpose is to remove
The task of separating a mix of several inter-weaving melodies the harmonics assigned to that instrument from the spectrum. It is
from a mono recording into multiple tracks is attempted by filter- possible that an instrument could be playing more than one note
ing in the spectral domain. The transcribed score is provided in concurrently, and in this case a filter is designed that filters from
MIDI format a priori. In each time frame a filter is constructed for the spectrum the harmonics of each note played by this instrument.
each instrument in the mix, whose effect is to filter out all harmon- Re-synthesized separated waveforms are produced by calculating
ics of that instrument from the DFT spectrum. The complication the DFT−1 of each filtered spectrum, and interpolating between
of overlapping harmonics arising from separate notes is discussed time frames using an overlap-add technique.
and two filter shapes that were found to be fairly successful at sep-
arating overlapping harmonics are presented. In comparing the Another approach to separating musical instruments [3] also
separated audio tracks to the original instrumental parts, signal- notes the need for an accurate time-varying pitch estimate of each
to-residual ratios (SRR’s) in excess of 20 dB have been achieved. note, but instead takes an additive approach to re-synthesis, whereby
Audio demonstrations are on the internet [1]. the harmonics of each note are synthesized using oscillators whose
time-varying frequency, amplitude and phase have been previously
estimated in a least-squares sense. Similarly, additive synthesis
1. INTRODUCTION
has been used for the separation of harmonic sounds in [4]. Whilst
these approaches may be able to produce fairly realistic synthe-
Music separation, or more specifically, separating a number of in-
sized sounds, some difficulty was encountered in the preliminary
struments playing inter-weaving melodic lines from a mono record-
stages of our research in obtaining a realistic sounding residual us-
ing, is nearly impossible to perform perfectly as mixing audio sig-
ing this method. The residual in this respect is the original mix
nals almost always results in a loss of information. A more achiev-
minus the sum of separated sounds produced by additive synthe-
able aim is to obtain separation of adequate quality to be useful in
sis. During this time domain subtraction, harmonics are liable to
a number of applications. These include: audio restoration; de-
bleed into the residual unless highly accurate phase matching is
mixing of old mono recordings before cleaning-up the separated
achieved between the sinusoidal components of the additive syn-
instruments individually and re-mixing; re-mixing mono record-
thesis model and corresponding components in the original mix.
ings in stereo/surround sound; structured audio coding; and some
creative applications, for example an effects processor that applies Alternatively, one could consider the spectrum of a harmonic
an effect to a structured component of a sound rather than the sound in a single time frame to consist of a sum of scaled and
whole. translated Fourier transforms of the window function centred at
The task is considered here to be a two-stage process: tran- the harmonic frequencies, plus a residual component. This type of
scribing the mix into separate instrumental parts for which the model is discussed for example in [5]. The separation of harmonic
pitch and timing of each note are found, and then performing the sources could then be achieved by removing the harmonics of each
separation. It is conceivable that the separated results could con- source from the spectrum by subtracting from the mixed spectrum
versely aid the transcription process, but this is not part of this a sum of ideal window shapes, whose amplitude, phase and cen-
implementation. As the first stage, automatic music transcription tre frequencies had all been optimally calculated. On the contrary,
(AMT), is a demanding task in itself and the reader is referred to in the approach described here, assuming for the moment that a
[2] for an account of some approaches to AMT, this research in- spectral peak we are investigating contains a single harmonic, the
stead focuses on achieving good separation performance given the harmonic is separated from the mixed spectrum by constructing a
score in advance. The score is provided in MIDI format, such that filter of unit amplitude across the main lobe of the spectral peak
a transcription of each instrumental part is available on a separate between the troughs in the amplitude spectrum on either side of
MIDI track. the peak. Thus if the shape of the spectral peak was indeed the
To begin with, the mixed waveform is split into overlapping Fourier transform of the window function, this method would not
time frames and the DFT of the signal is computed in each frame. remove the window’s side-lobes, but for example in the case of
The pitch of a note can vary considerably over its duration, whereas the Hamming window, the largest side-lobes are 43 dB lower than
a transcription of a note will most likely assign the note to a con- the main lobe, so they may be sufficiently small for this not to be
stant and discrete pitch. It was also observed that high fidelity a concern. It is fairly common to observe spectral peaks signifi-
separation was only achieved when the variation in pitch over the cantly higher than the noise level that do not closely resemble the
duration of each note was estimated accurately. Thus, for every shape of the DFT of the window function, even if one takes into
time frame, a refinement is made of the MIDI pitches of all notes account the distortion of their shape due to noise or residual com-

DAFX-1
197 — DAFx'04 Proceedings — 197
Proc. of the 7thth Int. Conference on Digital Audio Effects (DAFx’04),
(DAFx'04), Naples, Italy, October 5-8, 2004

ponents, the average envelope of which could be interpolated from 3. FILTER DESIGN
the surrounding spectrum. A possible explanation is that these
distortions arise from frequency and amplitude modulations of a The basic idea in this spectral-filtering approach to separation is
sinusoidal component within the time frame, or alternatively, that that if the pitches are known of all notes present during a particu-
the modelling of instrument harmonics as slowly time-varying sin- lar time frame, and the number of notes is not too large, then it is
gle sinusoids may not always be very accurate. The filter of unit possible to identify most of the prominent spectral peaks uniquely
amplitude removes the majority of the energy attributed to the in- with single harmonics, and to construct filters notches of unit am-
strument harmonic, without assuming that the harmonic conforms plitude across the width of each peak to remove the corresponding
to a precise shape in the spectral domain. Signal-to-residual ratios harmonic from the spectrum. A separate filter is designed for each
of more than 30 dB have been achieved [6] when separating mixes pitch whose effect is to remove all the harmonics of this pitch from
of two simultaneous notes, and this provides some validation for the spectrum, and the width of the notches are taken to be between
using this filtering approach. the troughs in the amplitude spectrum on either side of the peak
maxima. A difficulty arises when harmonics of more than one
pitch are overlapping in the spectrum. This problem was resolved
in [8] for combinations of two overlapping partials in a stereo mix.
2. PRE-PROCESSING In our case, the sum of the filter amplitudes for all pitches, is set
to unity across the width of this peak, and the shape of each filter
notch is designed so that a suitable division of the energy in the
All instrumental note samples used were in .wav format, mono, spectral peak is achieved. This will be discussed in more detail
sampled at 44.1 kHz, 16 bit resolution, and all except the piano below.
samples were recorded in an anechoic chamber. Mixed samples of To begin with, it was necessary to ascertain whether each promi-
5-20 seconds in length consisting of multiple inter-weaving melo- nent spectral peak was attributable to a single harmonic or multiple
dies were produced within a software sequencer such that audio harmonics. For the former case, we will refer to the spectral peak
and MIDI tracks for each instrument were recorded in parallel. as a single-component peak and in the latter, a multi-component
MIDI note messages were used to trigger real audio samples, and it peak. A peak was matched to the j th harmonic of note p if its
was possible for each instrument to be playing more than one note centre frequency fk was within a fixed range δ of the predicted
concurrently. The original mix was then split into overlapping time harmonic frequency fjp , where fjp ≈ j · f0p , and f0p is the pitch of
frames of length Nwin = 8192 samples (186 ms), with an overlap note p. The values of the fjp were allowed to deviate from exact
of 87.5%. In each frame, after time weighting the signal with a harmonicity (fjp = j · f0p ), such that if a single-component peak
Hamming window, an FFT was used to transform to the spectral at fk was found to be very close to a predicted harmonic fjp , then
domain. fjp would be set equal to fk . In either case, the next predicted har-
p
In each time frame, the number of simultaneously sounding monic would be at fj+1 = fjp + f0p . This modification improved
notes in the original mix was found from the MIDI data. As the separation performance, probably due to the fact that instruments
transcribed pitches of these notes in the MIDI data were restricted whose harmonics are slightly de-tuned are treated more appropri-
to the notes of a keyboard, and considerable pitch variations over ately, and also that any slight pitch errors would not necessarily be
the duration of a note are not uncommon, a pitch-refinement pro- compounded when multiplying by j to find the j th harmonic.
cess was used to accurately estimate all pitches present in each When a spectral peak was matched to more than one harmonic
frame. Each refined pitch estimate was taken to be the mean of from separate notes, then corresponding to each note p contribut-
{fjp /j ; j = 1 . . . J}, where fjp is the frequency of the j th har- ing to that peak, a filter notch was designed that depended on
monic of pitch p. The harmonic frequencies {fjp } were found us- the predicted frequency and predicted amplitude of its harmonic
ing an iterative process starting with the identification of the funda- within the peak: fjp , Apj , where it is implicit that j ≡ j(p). The
mental frequency spectral component and then searching for spec- prediction of harmonic frequencies was discussed previously, and
tral peaks at successively higher harmonics. the predicted harmonic amplitudes were obtained by linear inter-
An effective method for detecting prominent spectral peaks polation between the amplitudes of the nearest harmonics of this
was necessary both during pitch refinement and later in the filter pitch, above and below fjp , that were matched to single-component
design. The aim was to detect all local peaks in the amplitude peaks. Two similar filter notch designs were tested, both achieving
spectrum significantly higher than the noise floor. A frequency- comparable performance after fine-tuning their parameters. The
dependent threshold is usually necessary to detect all harmonics, filter notches H p (f ) were defined for frequencies f between the
whilst keeping the number of spurious spectral peaks or noise com- troughs on either side of the peak: fkl and fkr . For the first design,
ponents above the threshold to a minimum. This frequency-depen- the filters obeyed equation (1a):
dent thresholding was implemented by dividing the spectrum by 
|f − fjp |

Env(f )c where c was chosen to be between 0 and 1, and Env(f ) p p
Ĥ (f ) = Aj · exp − , ∀p∈Q (1a)
is the convolution of the amplitude spectrum with a Hamming win- σ
dow of length 1 + Nwin /64. Local peaks were found above the followed by a normalisation:
threshold using a neighbourhood search. Harmonics right up to the
Nyquist frequency were detected effectively using this method. Ĥ p (f )
Finally, the baricentric interpolator [7] was used to interpolate H p (f ) = P (1b)
q
q∈Q Ĥ (f )
the spectral peak centre frequencies to sub-bin frequency resolu-
tion. This interpolator was compared with others such as Grandke’s, where
Quinn’s and the parabolic interpolator, and found to be quite effec-
tive for Hamming windowed data. Q = {p ; ∃ j(p) s.t. |fk − fjp | < δ, p = 1 . . . P } (1c)

DAFX-2
198 — DAFx'04 Proceedings — 198
Proc. of the 7thth Int. Conference on Digital Audio Effects (DAFx’04),
(DAFx'04), Naples, Italy, October 5-8, 2004

3.5
(f1,A1)
Table 1: Mean signal-to-residual (SRR) ratios and π/M , for sam-
3 (f2,A2) ple mixes of 2-4 instrumental parts
2.5 polyphony 2 3 4
mean SRRxi (x0i ) 23.2 11.4 10.4
2
π(xi , x0i , y)/M 23.2 14.4 15.3
1.5

Flute 1 Flute 2
1 filter 1 filter 2 1 1

0.5 0.5

Original
0.5
a) 0 0
0
2.06 2.07 2.08 2.09 2.1 2.11 2.12 2.13 2.14 −0.5 −0.5

3.5 −1 −1
0 1 2 3 4 5 0 1 2 3 4 5

3 1 1

Separated
0.5 0.5
2.5
0 0
2
−0.5 −0.5

1.5 filtered 1 −1 −1
0 1 2 3 4 5 0 1 2 3 4 5
original 1 1 1
1
filtered 2
original 2 Residual 0.5 0.5
0.5
b) 0 0

0 −0.5 −0.5
2.06 2.07 2.08 2.09 2.1 2.11 2.12 2.13 2.14
Frequency (kHz) −1 −1
0 1 2 3 4 5 0 1 2 3 4 5
Time (s) Time (s)
Figure 1: Filtering of a spectral peak arising from two overlap-
ping harmonics: (a) Construction of the filters using equation (2) Figure 2: Original, separated and residual waveforms for a mix
is determined by the predicted harmonic frequencies f1 , and f2 , of two flute melodies.
and predicted harmonic amplitudes A1 and A2 ; (b) Comparison
of the filtered spectra and original spectra of the individual notes.
P
where y = m xm is the mixed original signal, and larger values
of π/M correspond to better separation performance.
and a suitable value for σ was found to be about 0.02 · (fkr − fkl )2 . The average SRR’s and π/M are presented in Table 1 for some
For the second filter notch design, if Fwin (f ) is the DFT of samples mixes of two to four instrumental parts. The waveforms
the window function truncated to frequencies between zero and the of the original mixes were between 5 and 20 seconds in length.
Nyquist limit, then the filters notches were designed according to: The dual polyphony sample was a mix of two harmonising flute
Ĥ p (f ) = Apj · |Fwin ( · |f − fjp |)| (2) melodies, the polyphony of three corresponded to a few upbeat
bars in a major key played by a mix of flute, clarinet and French
where 0.5 <  < 1, and again normalised using equation (1b) to horn, and the example with a polyphony of four was a rough ren-
obtain H p (f ). dition of a few bars of Barber’s ‘Adagio For Strings’ played on
The shape of the filters designed using equation (2) is illus- flute, French horn and two soprano saxophones. The audio files
trated in Figure 1a for a peak composed of two overlapping har- corresponding to these test cases have been put on the internet [1]
monics, and the two resulting filtered peaks are compared with the for comparison.
original spectra of the individual harmonics in Figure 1b. A visual representation of the original, separated and residual
time waveforms of each instrumental part in the mix, for the sam-
4. RESULTS ple consisting of a mix of two flute melodies in Table 1, is given in
Figure 2. For the same sample mix, the spectrograms of the orig-
The signal-to-residual ratio (SRR) has been used as a quantifiable inal mixed sound and the separated flute parts after filtering are
measure of separation performance. The residual in this case is the shown in Figure 3. One can see from this last figure a clear separa-
difference between the original x and separated x’ waveforms of tion of the set of harmonics belonging to each instrument, and also
each instrumental part. Explicitly, note that the noise level in the separated spectrograms is of lower
P 2 amplitude than that of the original mix, i.e. the noise components
n xn
SRRx (x0 ) [dB] = 10 log P 0 2
(3) of the original mix have mostly gone into the residual waveform.
n (xn − xn )

Another measure of separation performance is the average increase 5. DISCUSSION


in the sum of SRR’s for the M instrumental parts:
M
Although the results describe only a small selection of test cases,
π(xi , x0i , y) 1 X both the quantitative results given in Table 1 and direct comparison
= · (SRRxm (x0m ) − SRRxm (y)) (4)
M M m=1 by listening to the original and separated audio files, show that this

DAFX-3
199 — DAFx'04 Proceedings — 199
Proc. of the 7thth Int. Conference on Digital Audio Effects (DAFx’04),
(DAFx'04), Naples, Italy, October 5-8, 2004

Spectrogram of a mix of two flute melodies section. Also, the accuracy of the note timing information is an
5000
important factor in separation performance. If for example, a note
Frequency (Hz)

4000
actually starts sounding slightly later than the note onset time pro-
3000 vided in the MIDI data, then it is possible that the filter correspond-
2000 ing to this instrument will be filtering content from the mixed spec-
1000 trum in the few time frames preceding the first time frame that the
0 note is actually present.
0 1 2 3 4 5
Spectrogram of separated flute A Lastly, we have found that the separation algorithms tend to
5000 produce interesting sounding residuals that seem to preserve the
Frequency (Hz)

4000 inharmonic characteristics of each instrument, for example the


3000 ‘breathiness’ of a flute or percussiveness of a piano note. There
2000 is potential for further research in finding ways of separating the
1000 mixed residual into instrumental parts and recombining these with
0 the separated harmonic components in such as way as to produce
0 1 2 3 4 5
Spectrogram of separated flute B more natural sounding results. Furthermore, these residuals may
5000 be useful in creative applications such as adding natural sounding,
Frequency (Hz)

4000 inharmonic instrument characteristics to synthesized sounds.


3000
2000 6. ACKNOWLEDGEMENTS
1000
0 We would like to show our appreciation to the University of Iowa,
0 1 2 3 4 5
Time (s)
Electronic Music Studios for providing all the instrumental sam-
ples that were used in this research.
Figure 3: Spectrograms of a sample mix of two flute melodies, and
the separated individual flute parts after filtering. (The gray-scale 7. REFERENCES
is equal for each spectrogram and the spectrogram specifics are:
sample rate fs = 44.1 kHz, FFT length 4096 samples, 87.5 % [1] M. R. Every and J. E. Szymanski. (2004, July) “Melody Sepa-
overlap, Hamming windowed). ration Demonstrations,” [Online]. Available: http://www
-users.york.ac.uk/∼jes1/Separation2.html
[2] A. Klapuri, “Automatic transcription of music,” Proc. Stock-
holm Music Acoustics Conf., SMAC-03, Stockholm, Sweden,
fairly straight-forward approach to music signal separation is quite
Aug. 6-9, 2003.
successful. Mean SRR’s of between 10.4 and 23.2 were obtained
in Table 1 which represents a factor of about 11 to 210 times more [3] T. Virtanen and A. Klapuri, “Separation of harmonic sounds
energy in the original un-mixed sounds xm , than in the residuals using multipitch analysis and iterative parameter estimation,”
xm − x0m . These can be compared with the mean SRR’s achieved Proc. IEEE Workshop on Appl. of Signal Proc. to Audio and
for separating mixtures of single notes in [6]. In [6], mean SRR’s Acoustics, WASPAA-01, New Paltz, New York, 2001.
of 26.0 and 18.8 dB were obtained for polyphonies of 2 and 4 re- [4] T. Virtanen and A. Klapuri, “Separation of harmonic sound
spectively, as an average over many almost random sample mixes. sources using sinusoidal modeling,” Proc. IEEE Int. Conf.
These samples mixes were chosen by randomly selecting an in- on Acoustics, Speech and Signal Proc., ICASSP-00, Istanbul,
strument out of a group of 10 orchestral instrument types and then Turkey, 2000.
randomly choosing a pitch out of each instrument’s pitch range. In
this paper, the sound examples studied consisted of instrumental [5] X. Rodet, “Musical sound signal analysis/synthesis: Sinu-
parts that harmonised with each other, i.e. notes intervals such as soidal+residual and elementary waveform models,” Proc.
octaves, fifths and thirds were common, making separation con- IEEE Conf. on Time-Frequency and Time-Scale Analysis,
siderably more difficult than in random note mixtures due to the TFTS-97, 1997.
fact that many more harmonics would be overlapping in the spec- [6] M. R. Every and J. E. Szymanski, “Separation of synchronous
tral domain. This is believed to be one of the main reasons that pitched notes in the spectral domain,” submitted to IEEE
higher SRR’s where achieved in [6]. Hence, the issue of how to Trans. on Speech and Audio Proc..
separate overlapping harmonics is relevant to separating typical
[7] M. Donadio. (1999, May). “How to Interpolate Frequency
musical signals. It is also worth considering that notes usually
Peaks,” dspGuru, Iowegian Intern. Corp. [Online]. Available:
contain a significant noise or inharmonic component, and given
http://www.dspguru.com/howto/tech/peakff
that these algorithms only attempt to remove prominent spectral
t2.htm
peaks from the mixed spectrum, even if the harmonics of each note
were perfectly subtracted from the spectrum, the maximum SRR’s [8] H. Viste and G. Evangelista, “An extension for source sep-
achievable using this approach would be limited by the amount of aration techniques avoiding beats,” Proc. 5th Int. Conf. on
inharmonic content produced by each instrument. Digital Audio Effects, DAFx-02, Hamburg, Germany, 2002.
During listening, as expected the most noticeable differences
between the original and separated sounds occur at note onsets.
This is partly due to the fact that there is usually a larger inhar-
monic component of a note at its onset than during the sustained

DAFX-4
200 — DAFx'04 Proceedings — 200

You might also like