0% found this document useful (0 votes)
39 views8 pages

Perceptual Wavetable Matching For Synthesis of Musical

Uploaded by

Meneses Luis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views8 pages

Perceptual Wavetable Matching For Synthesis of Musical

Uploaded by

Meneses Luis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Perceptual Wavetable Matching

for Synthesis of Musical Instrument Tones


Cheuk-Wai Wun , Andrew Horner and Lydia Ayers
Department of Computer Science
Hong Kong University of Science and Technology
Clearwater Bay, Kowloon, Hong Kong.
Email: [email protected] and [email protected]

Abstract wavetable matching. Section 4 concludes. (Refer to [2]


Recent parameter matching methods for multiple for full results and more.)
wavetable synthesis have used a simple relative spectral
error formula to measure how accurately the synthetic 1 Background
spectrum matches an original spectrum [1]. It is 1.1 Multiple Wavetable Synthesis
supposed that the smaller the spectral error, the better Multiple wavetable synthesis is an efficient
the match, but this is not always true. This paper synthesis technique based on adding a number of fixed
describes a modified error formula, which takes into waveforms with time-varying weights. The basic
account the masking characteristics of our auditory assumption is that the original sound is nearly harmonic,
system, as an improved measure of the perceived quality and can therefore be approximated as
of the matched spectrum. Selected instrument tones NHAR

have been matched using both error formulae, and y (t ) = ∑b


k =1
k (t ) sin[2πk ∫ f (t )dt ] (1)
resynthesized. Listening test results show that
wavetable matching using the perceptual error formula where NHAR is the number of partials in the tone;
slightly outperforms ordinary matching, especially for bk(t) is the time-varying amplitude of the kth
instrument tones that have several masked partials. harmonic; and
f(t) is the time-varying fundamental frequency.
0 Introduction In multiple wavetable synthesis, before
synthesizing the sound, one period of each fixed
Multiple wavetable synthesis [1] is an efficient
waveform is pre-computed and stored in a wavetable.
synthesis technique based on the addition of a number
Each waveform is a weighted sum of harmonic
of fixed waveforms with time-varying weights.
sinusoids, whose spectrum is known as the wavetable’s
Matching synthesis starts with a time-varying spectral
basis spectrum. The wavetable entries are computed as
analysis of the original sound. Next, the synthesis
follows:
parameters that produce the “best” match of the original NHAR
2πki for 0 ≤ i < L, 1 ≤ j ≤ NTAB
spectrum are determined. Finally, the sound is tablei , j = ∑ a k , j sin( )
resynthesized using the matched parameters. k =1 L
Ideally, the “best” match should be determined by (2)
the listener’s perception of the quality of the match. where tablei,j is the ith entry of the jth wavetable;
However, it is impractical to have a listener judge every L is the length of each wavetable;
set of candidate synthesis parameters. Unfortunately, NHAR is the number of partials in each basis
there is no known objective error metric that perfectly spectrum;
matches human judgment on the similarity between two ak,j is the amplitude of the kth harmonic in the
time-varying spectra. A practical first-order jth basis spectrum; and
approximation that has been used with good success is NTAB is the number of wavetables or basis
the Relative Spectral Error. Can we do better? Is there spectra.
a more effective way to measure the perceptual Then the time-varying weights, or amplitude
similarity between two spectra? envelopes, of the basis spectra can be determined by
This paper describes an improved parameter solving the following system of linear equations:
matching method for multiple wavetable synthesis that  a1,1

a1, 2  a1, NTAB  w1,1

w1, 2 w1, NFRM 

takes into account the masking characteristics of our a
 2 ,1  w2,1 
    
auditory system. The rest of this paper is divided into 
 

 

a  a NHAR , NTAB  wNTAB ,1  wNTAB , NFRM 
four parts. Section 1 gives an overview of multiple  NHAR ,1
wavetable synthesis. Section 2 describes the perceptual  b1,1

b1, 2  b1, NFRM 

parameter matching method. Section 3 then measures  b2,1 
≈ 
masking in a wide variety of musical instrument tones, 
  

and compares the results of ordinary and perceptual b  bNHAR , NFRM 
 NHAR ,1
or A⋅W ≈ B (3) NHAR
where NHAR is the number of partials to match;
NTAB is the number of wavetables or basis 1 NFRM ∑ (b k ,n − bk*,n ) 2
spectra; and ε=
NFRM
∑ k =1
NHAR
(4)
∑b
n =1 2
NFRM is the number of frames of the k ,n
representative target spectrum. k =1
In the above equation, ak,j is the amplitude of the The frames in Eq.4 are the same as those in Eq.3. A
kth harmonic in the jth basis spectrum, wj,n is the weight relative spectral error of 0 is a perfect match, while ε =
of the jth wavetable at the nth selected time point and 0.1 is a 10% relative spectral error.
bk,n is the amplitude of the kth harmonic in the nth frame
of the representative target spectrum.
Instead of using all frames of the original spectrum 2 Perceptual Wavetable Matching Synthesis
(which usually number from 500 to 5000), only a Recent wavetable matching methods have used the
limited number of representative frames (NFRM) are Relative Spectral Error formula (Eq. 4) to measure the
selected for matching. There are two reasons for doing quality of the matched spectrum. It is supposed that the
this. First, the computational cost is reduced. Second, smaller the relative error, the better the match.
this prevents the long sustain from dominating the short, However, this is generally, but not always, true. Some
but perceptually more significant, attack. In practice, spectra have lower relative errors, yet may not sound as
we use NFRM = 30 with half selected from the attack similar to the original as others. This means that the
(defined as the part before the peak r.m.s. amplitude is relative error does not exactly reflect the perceptual
reached), and the other half from the remainder of the quality of the matched spectrum.
tone at evenly spaced time points. These representative In fact, not all partials in the matched (and original)
frames form the target spectrum B. spectrum are perceived, as some of them are masked by
If, in Eq.3, NHAR equals NTAB and the basis others. In this case, part of the relative error contributed
spectra are linearly independent, there will be a perfect by the masked partials probably accounts for the
match at every time point and thus a trivial solution. anomalies. A Perceptual Relative Error formula, which
However, a reduced number of wavetables is usually takes into account the effects of the masked partials,
desired so that NTAB is a lot smaller than NHAR, would be a better measure of the perceptual quality of
therefore the best solution in the least-square’s sense is the matched spectrum.
sought. The matched spectrum is given by B* = A⋅W, Before the Perceptual Relative Error can be
where computed, we must first determine which partials are
NTAB masked. The following algorithm tests if a partial is
bk*,n = ∑a
j =1
k, j w j ,n masked:
1. Remove the candidate partial from the spectrum.
2. Find the excitation level at the frequency of the
is the amplitude of the kth harmonic in the nth frame.
candidate partial by calculating the output of the
Then the task is to find unique values of wj,n that
auditory filter centered at the candidate frequency
minimize the squared error
NHAR
as the simple sum of the outputs due to each of the
∑ (b
k =1
k ,n − bk*, n ) 2 remaining partials.
3. Compute the masked threshold at the candidate
frequency as the sum of the excitation level and the
at each selected time point for 1 ≤ n ≤ NFRM. Efficient
masking index.
algorithms exist to find the least-square’s solution, for
4. The candidate is considered masked if its intensity
instance, by the use of the normal equations [3].
is below the masked threshold.
How are the basis spectra determined? After the
Every partial is tested. Therefore, in addition to the
user specifies the number of wavetables, an
amplitude bk,n, an extra flag mk,n is associated with each
optimization procedure such as the genetic algorithm
partial to indicate whether it is masked or not.
(GA) determines the best frames for basis spectra by
The Perceptual Relative Error is then defined as
selecting several from the original spectrum. The
follows:
fitness function that guides the search measures the
NHAR
quality of the matched spectrum, and is defined as the
∑δ
2
following Relative Spectral Error [1]: NFRM k ,n
1
εp =
NFRM
∑ k =1
NHAR
(5)
∑b
n =1 2
k ,n
k =1
where
b − bk*, n if m k,n is not set harmonics is negligible. On the other hand, many
δ k ,n =  k ,n partials above the 20th harmonic are masked.
 0 otherwise Moreover, the 11th and 13th harmonics of the Eb4
The above definition does not include any error tone are masked by the spectral peak at the 12th
terms introduced by the masked partials. We assume harmonic.
that if a partial is masked in the original spectrum, it In the G5 clarinet tone, no noticeable masking is
will be masked in the matched spectrum too. This observed. Having a fundamental frequency of 784 Hz,
assumption holds when the matched spectrum is neighboring harmonics are so widely spaced that they
reasonably close to the original spectrum, which is true can hardly fall within one critical band to have any
in practice when three or more wavetables are used. masking effect on each other.
It is further assumed that all of the unmasked The following is a summary of masking in clarinet
partials are equally important perceptually. A possible tones, which applies to many other instrument tones as
alternative would be to group the partials by critical well:
band, and take their average. This implies that the 1. The higher harmonics are more easily masked than
overall spectral power within each critical band is the the lower harmonics. The lower the center
only thing that matters in our perception of musical frequency of the auditory filter, the narrower is its
instrument tones. Discarding the spectral variation bandwidth. Therefore, the excitation level (and
within each critical band probably goes too far, so we masked threshold) in the lower frequency range is
prefer to evenly weight all the partials. usually not high enough to allow any masking,
since its calculation involves only a few harmonics.
3 Results On the other hand, the dense higher harmonics
3.1 Measuring Masking in Musical Instrument cause significant masking on one another. (We
Tones listened to only the partials above the 10th
This section describes the effects of masking on a harmonic of the Eb3 clarinet tone with and without
variety of musical instrument tones. In addition to those the masked ones, and they sounded almost the
discussed below, we have measured masking in the same. This confirms that the higher harmonics are
oboe, bassoon, trumpet, trombone, Chinese zheng, masked not due to the lower strong harmonics.)
piano, violin, cello, and Chinese erhu. 2. Weak harmonics around spectral peaks are usually
The clarinet illustrates several aspects of masking masked.
in musical instrument tones. Three clarinet tones of 3. The masking effect is mainly observed in low
different pitches (Eb3, Eb4 and G5) were analyzed, and notes. The separation of neighboring harmonics
their spectra are shown in Fig-1, Fig-2 and Fig-3. The increases with the fundamental frequency, thus in
upper part of each figure shows the spectral evolution of high notes, usually no more than one harmonic falls
the tone in a three dimensional amplitude versus within a single critical band. Other instruments
harmonic versus time plot, while the lower part is a usually have a similar situation for their high notes,
snapshot of the spectrum taken at the overall peak r.m.s. hence we will focus on notes in the lower registers.
amplitude point. There is an asterisk under each The Eb2 tuba tone has a rich spectrum, but none of
masked harmonic. the lower harmonics is masked. This is because the
The odd harmonics of the Eb3 and Eb4 clarinet amplitude changes gradually from partial to partial, and
tones are prominent, especially the 1st, 3rd, 5th and 7th no harmonic is remarkably weaker than its neighbors.
harmonics. The weaker even harmonics are often Fig-5 shows a 192 Hz tenor voice spectrum with
masked. For example, the Eb3 tone’s 6th, 10th, 14th two well-defined formants at 800 Hz and 2700 Hz.
and 16th harmonics are masked at the time of the There are no masked harmonics near the lower formant
snapshot, and most other time points as well. because harmonics in such a low frequency range are
Although the 2nd and 4th harmonics are very weak, not readily masked. The second formant is located
they are not masked because their adjacent stronger odd around the 13th and 14th harmonics, and masks the
harmonics do not fall within the same critical band. 11th, 12th, 15th-18th harmonics.
This becomes clear when linear frequency is translated
to critical band rate. Fig-4 is essentially the same plot 3.2 Matching Results
as Fig-1b, except that its frequency axis is labeled in This section compares the results of wavetable
Bark instead of harmonic number. The lower matching for three instruments: the clarinet (Eb4 and
harmonics are farther apart than the higher harmonics, G5), tuba (Eb2) and tenor (G3). They were matched
since the bandwidth of the auditory filter is smaller at using both the ordinary and perceptual relative error
lower center frequencies. As a result, the masking formulae (Eq. 4 and Eq. 5, respectively), and
effect of the odd harmonics on the 2nd and 4th resynthesized with the following configuration:
Number of partials to match, NHAR = 30*
Number of wavetables or basis spectra, NTAB = 1…5 perceptual wavetable matching compared to ordinary
Number of representative spectra, NFRM = 30 wavetable matching:
* For the G5 clarinet tone, only the first 14 harmonics e' p − e p
below the Nyquist frequency were matched. β= (8)
Fig-6 shows a spectral snapshot from an original ep
Eb4 clarinet tone’s sustain and that of the resynthesized where ep is the perceptual relative error of the spectrum
tone (using 5 wavetables) taken at the same time. They matched by ordinary wavetable matching; and e’p is the
indicate a common set of masked harmonics in both the perceptual relative error of the spectrum matched by
original and matched spectra. perceptual wavetable matching.
A listening test of indistinguishability between the The results of ordinary and perceptual wavetable
original and matched tones was carried out to evaluate matching of the four instrument tones using one to five
the quality of the synthetic tones. Five subjects with wavetables are shown in Table-1 to Table-4. Table-5
good music background took the test. There were four shows the results of the listening test in which all of the
instrument tones, with three types per tone (original synthetic tones are synthesized with five wavetables.
acoustic, synthesized using ordinary wavetable The amount of masking as measured by α agrees
matching and synthesized using perceptual wavetable with our previous analysis of masking in musical
matching), and five repetitions of each tone type, instrument tones (Section 3.1). The tenor voice, with
making a total of 60 sound samples that were played in two well defined formants, experiences the largest
a random order during the test. After each sound amount of masking and has α > 10%. A certain amount
sample was played, listeners answered whether they of masking also occurs in the Eb4 clarinet tone, giving
thought it was either an acoustic or synthetic tone. an α value on the order of 5%. No obvious masking is
The perceived quality of a synthetic tone is observed in either the G5 clarinet or tuba tones that have
measured by how often it can be distinguished from its
α < 1 or 2%.
acoustic counterpart. This discrimination factor (d) is
Surprisingly, the listening test results reveal that the
defined as follows:
%correctly identified synthetics - %falsely identified synthetics + 1 value of β does not directly relate to the improvement of
d= perceptual wavetable matching on ordinary wavetable
2
(6) matching. The results show that perceptual wavetable
The number of falsely identified synthetics (acoustic matching outperforms ordinary wavetable matching in
samples misidentified as synthetic) is subtracted from general, especially for instrument tones that have
the number of correctly identified synthetic samples to several masked partials. For the tenor voice, which has
penalize the listener for guessing, then the difference is the largest amount of masking, both discrimination
normalized to give a value in [0, 1]. The perceived factors d and dp are smaller than 0.5, which means that
quality of a matched tone increases with decreasing d, the synthetic tones are just too good for comparison.
and when d falls below about 0.75, the matched tone is The Eb4 clarinet tone has d = 0.68 > 0.5 = dp, indicating
considered nearly indistinguishable from the original. that the tone synthesized by ordinary wavetable
This is reasonable if we consider that in an extreme case matching is more easily distinguished than that
when the listener thinks all the samples are acoustic, synthesized by perceptual wavetable matching. On the
then d will be 0.5. other hand, only slight improvement is observed for the
Two variables are introduced for comparing the synthetic G5 clarinet tones that have few masked
matching results. First, to measure the relative amount harmonics. With the least amount of masking, the tuba
of masking in the matched spectrum, a masking factor α happens to have a better tone synthesized by ordinary
is defined as: wavetable matching.
e − ep
α= (7) 3.3 Related Results
e So perceptual wavetable matching shows an
where e is the relative error of the matched spectrum; improvement, which is relatively small though. It is
and suspected that the squared terms in the error formulae
ep is the perceptual relative error of the already reduce the significance of weak partials that are
matched spectrum. likely masked. This suggests us to try the following
α is not a direct measure (which may be defined as perceptual relative absolute error:
NHAR
power of masked harmonics divided by total spectral
power), but it indirectly reflects the effect of masking by
1 NFRM ∑δ k ,n
computing the percentage of the relative error accounted
for by the masked harmonics. Second, we use the
εp =
NFRM
∑ k =1
NHAR
(9)

following factor to assess the improvement of


n =1
∑b
k =1
k ,n
where
bk ,n − bk*, n if m k,n is not set References
δ k ,n =
 0 otherwise 1. Horner, A., J. Beauchamp, and L. Haken. (1993).
“Methods for multiple wavetable synthesis of
Using the above formula, the Eb4 and G5 clarinet tones musical instrument tones.” J. Audio Eng. Soc.
were matched. The results show more or less the same 41(5), 336-356.
trend as the previous ones. In particular, the values of α 2. Wun, C. W. and Horner, A. 2001. “Perceptual
are on about the same order, indicating that masked wavetable matching for synthesis of musical
partials count equally in the original perceptual relative instrument tones.” Journal of the Audio
error and perceptual relative absolute error. Engineering Society 49(4), 250-262.
3. Press, W. H. (1989). “Numerical recipes: the art of
4 Conclusion scientific computing.” Cambridge: Cambridge
This paper has described a perceptual relative error University Press.
formula which takes into account the effects of masked 4. Fletcher, H. (1940). “Auditory patterns.” Rev. Mod.
partials. The perceptual relative error does not include Phys. 12, 47-65.
any error terms introduced by the masked partials 5. Zwicker, E. and H. Fastl. (1990). “Psychoacoustics:
because they are not perceived anyway. To the best of Facts and Models,” 133-155. Berlin: Springer-
our current knowledge of psychoacoustics, a complete Verlag.
perceptual representation of a sound is not obvious, if 6. Patterson, R. D. (1976). “Auditory filter shapes
not impossible. This paper presents a small effort to derived with noise stimuli.” J. Acoust. Soc. Am. 59,
predict, and apply in sound synthesis, the perceptual 640-654.
similarity between an original acoustic musical 7. Patterson, R. D. and Moore, B. C. J. (1986).
instrument tone and its synthetic counterpart. “Auditory filters and excitation patterns as
A comparative study on masking in different representations of frequency resolution.” In B. C. J.
instruments has been carried out. We conclude from the Moore (Ed.), Frequency Selectivity in Hearing
analysis results that (i) higher harmonics are more easily (123-178). London: Academic Press.
masked than lower harmonics; (ii) weak harmonics 8. Moore, B. C. J. and Glasberg, B. R. (1987).
around spectral peaks are often masked; and (iii) the “Formulae describing frequency selectivity as a
masking effect is mainly observed in low notes. function of frequency and level, and their use in
Selected instrument tones have been matched and calculating excitation patterns.” Hearing Res. 28,
resynthesized using both the ordinary and perceptual 209-225.
error formulae, and a listening test has evaluated the 9. Moore, B. C. J. and Glasberg, B. R. (1983).
quality of the synthetic tones. The results show that “Suggested formulae for calculating auditory filter
perceptual wavetable matching slightly outperforms bandwidths and excitation patterns.” J. Acoust. Soc.
ordinary wavetable matching, especially for instrument Am. 73, 1249-1259.
tones that have several masked partials. 10. Wegel, R. C. and Lane C. F. (1924). “The auditory
The improvement is relatively small, for example, masking of one sound by another and its probable
adding an extra wavetable always improves the results relation to the dynamics of the inner ear.” Phys.
more than perceptual matching. However, perceptual Rev. 23, 266-285.
wavetable matching requires extra computation only 11. Egan, J. P. and Hake, H. W. (1950). “On the
during the parameter matching stage to mark the masking pattern of a simple auditory stimulus.” J.
masked partials and calculate the perceptual relative Acoust. Soc. Am. 22, 622-630.
error, while resynthesis is as efficient as with the 12. Greenwood, D. D. (1961). “Auditory masking and
ordinary approach. Therefore, perceptual parameter the critical band.” J. Acoust. Soc. Am. 33, 484-501.
matching is worth the small extra effort it takes.

5 Acknowledgements
This work was supported in part by the Hong Kong
Research Grant Council’s projects HKUST6136/98E
and HKUST6087/99E. We used a PC version of James
Beauchamp’s excellent sound analysis and spectral
display software Sndan, and a variation on his listening
test program SameDiff in our work. Thanks to the
anonymous reviewers for their excellent comments.
NTAB Ordinary wavetable matching Perceptual matching
Relative error Perceptual Amount of Perceptual Improve on
(e) relative error masking (α) relative error ordinary
(ep) (e’p) match (β)
1 0.228809 0.225868 1.29% 0.225713 0.07%
2 0.086673 0.082546 4.76% 0.082546 0.00%
3 0.069248 0.065275 5.74% 0.061091 6.41%
4 0.051063 0.049152 3.74% 0.049152 0.00%
5 0.039142 0.036078 7.83% 0.035747 0.92%
Table-1 Matching results of a Eb4 clarinet tone.

NTAB Ordinary wavetable matching Perceptual matching


Relative error Perceptual Amount of Perceptual Improve on
(e) relative error masking (α) relative error ordinary
(ep) (e’p) match (β)
1 0.253116 0.251948 0.46% 0.251948 0.00%
2 0.165065 0.164025 0.63% 0.163872 0.09%
3 0.076026 0.075172 1.12% 0.074898 0.36%
4 0.050496 0.049533 1.91% 0.049106 0.86%
5 0.030991 0.030792 0.64% 0.029371 4.61%
Table-2 Matching results of a G5 clarinet tone.

NTAB Ordinary wavetable matching Perceptual matching


Relative error Perceptual Amount of Perceptual Improve on
(e) relative error masking (α) relative error ordinary
(ep) (e’p) match (β)
1 0.136312 0.136219 0.07% 0.136081 0.10%
2 0.082707 0.082637 0.08% 0.080805 2.22%
3 0.054648 0.054556 0.17% 0.054072 0.89%
4 0.03455 0.03444 0.32% 0.034222 0.63%
5 0.025134 0.024948 0.74% 0.02157 13.54%
Table-3 Matching results of a tuba tone.

NTAB Ordinary wavetable matching Perceptual matching


Relative error Perceptual Amount of Perceptual Improve on
(e) relative error masking (α) relative error ordinary
(ep) (e’p) match (β)
1 0.211353 0.191788 9.26% 0.19074 0.55%
2 0.057635 0.052005 9.77% 0.052005 0.00%
3 0.043594 0.03778 13.34% 0.03773 0.13%
4 0.033756 0.030136 10.72% 0.028821 4.36%
5 0.025163 0.021721 13.68% 0.020634 5.00%
Table-4 Matching results of a tenor voice.

Instrument Discrimination factor


tones Ordinary wavetable Perceptual wavetable
matching (d) matching (dp)
Clarinet (Eb4) 0.68 0.5
Clarinet (G5) 0.62 0.56
Tuba 0.6 0.62
Tenor Voice 0.48 0.48
Table-5 Results of the listening test in which all of the synthetic tones are synthesized
with five wavetables.
(a)
(a)

(b)
(b)

Fig-2 (a) Spectrum of a Eb4 clarinet tone. (b) Spectral snapshot at overall peak r.m.s.
Fig-1 (a) Spectrum of a Eb3 clarinet tone. (b) Spectral snapshot at overall peak r.m.s. amplitude point.
amplitude point. An asterisk is placed under each masked harmonic.

(a)

(b) Fig-4 Spectral snapshot of the Eb3 clarinet tone with frequency expressed in Bark (see
also Fig-1).

Fig-3 (a) Spectrum of a G5 clarinet tone. (b) Spectral snapshot at overall peak r.m.s.
amplitude point.
(a) (a)

(b)
(b)

Fig-5 (a) Spectrum of a tenor voice. (b) Spectral snapshot at overall peak r.m.s. amplitude Fig-6 (a) A spectral snapshot from an original Eb4 clarinet tone’s sustain. (b) Spectral
point. snapshot of a resynthesized Eb4 clarinet tone (using 5 wavetables) taken at the same
time.

You might also like