0% found this document useful (0 votes)
18 views11 pages

A Practical Handbook of Speech Coders

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 11

Goldberg, R. G.

"Auditory Information Processing"


A Practical Handbook of Speech Coders
Ed. Randy Goldberg
Boca Raton: CRC Press LLC, 2000

© 2000 CRC Press LLC


Chapter 6
Auditory Information Processing

A grasp of both the theory of speech production and the theory of human
audition is essential to understand the fundamentals of speech coding.
The fact that speech is generated through a human vocal tract allows a
more compact signal representation for analysis/synthesis as opposed to
a generic acoustic signal. Because decoded speech is synthesized for the
human ear, further reductions in the signal representation are possible by
disregarding signal information that cannot be perceived. Various com-
ponents of the signal interact and interfere to determine the “perceived
sound.” These facts have been applied to high-fidelity coding of audio
[82, 83, 84, 85] for the consumer electronic market, for Internet audio
compression, and within standards such as MPEG-2 and MPEG-4 (see
Appendix A). This processing can be incorporated into speech coders to
further reduce the information needed to regenerate high-quality speech.
This chapter begins with a description of how the ear performs fre-
quency analysis and continues with the concept of critical bands. The
minimum detectable sound level is presented for quiet and noisy acoustic
environments. This leads into masking, in both frequency and time. The
information in this chapter provides the basis for perceptual speech cod-
ing. Chapter 12 describes how masking can be used to improve speech
coding efficiency.

6.1 The Basilar Membrane: A Spectrum Analyzer


The basilar membrane is a key component of the inner ear. Oversim-
plified, sound vibrations cause movement of the basilar membrane by

© 2000 CRC Press LLC


transduction through the middle ear. Movement of the basilar mem-
brane stimulates hair cells, which in turn produce impulses in the audi-
tory nerve fibers.
Ohm and Von Helmholtz [166] were the first to present the notion that
the basilar membrane acts as a spectrum analyzer. Von Békésy
expounded upon this theory and demonstrated that the basilar mem-
brane vibrates locally, and the point of vibration is related monotoni-
cally to the frequency of the acoustic stimulus [164, 165]. Von Békésy
proved that the basilar membrane was a spectrum analyzer, not an array
of tuned resonators, but a nonuniform (almost logarithmically scaled)
transmission line with limited but distinct spectral resolution. Further
experimentation showed that this limited spectral resolution was char-
acterized by critical bands [10].

6.2 Critical Bands


In loose terms, a critical band can be thought of as a frequency span,
or frequency “bin,” into which sounds are lumped perceptually. Although
critical bands can be defined experimentally, the following definitions
are most useful for the purposes of this discussion:
“The threshold [of audibility] [see Section 6.3] of a narrow band of
noise lying between two masking tones remains constant as the frequency
separation between the tones increases until the critical band is reached;
then the threshold of audibility of the noise drops precipitously.” [141]
“The loudness of a band of noise at a constant sound pressure remains
constant as the bandwidth increases up to the critical band; then the
loudness begins to increase.'' [141]
It is adequate to say that the critical band is a frequency range, de-
fined by its band edges (specific frequencies), outside of which subjective
responses change abruptly.
The fact that a critical band can be saturated is important to speech
coding. In this case, saturated refers to the critical band being “filled”
with sound, in that additional lower level sounds added to that frequency
range cannot be perceived. The fact that certain acoustic stimuli cannot
be sensed by the human ear is crucial because these stimuli need not be
preserved for accurate coding. This savings allows coding resources to
be allocated to frequency ranges where the sound will be perceived.

© 2000 CRC Press LLC


FIGURE 6.1
Frequency width of critical bands as a function of the band
center frequency.

© 2000 CRC Press LLC


Critical Band No. (Barks) Frequency (Hz) Mels
1 20-100 0-150
2 100-200 150-300
3 200-300 300-400
4 300-400 400-500
5 400-510 500-600
6 510-630 600-700
7 630-770 700-800
8 770-920 800-950
9 920-1080 950-1050
10 1080-1270 1050-1150
11 1270-1480 1150-1300
12 1480-1720 1300-1400
13 1720-2000 1400-1550
14 2000-2320 1550-1700
15 2320-2700 1700-1850
16 2700-3150 1850-2000
17 3150-3700 2000-2150
18 3700-4400 2150-2300
19 4400-5300 2300-2500
20 5300-6400 2500-2700
21 6400-7200 2700-2850
22 7200-9500 2850-3050

Table 6.1 The relationship between the frequency units: Barks,


Hertz, and Mels.

Extensive experimental research has been performed to quantify crit-


ical bandwidth as a function of the frequency at the center of the band.
Figure 6.1 [10] shows the results of these experiments for single ear lis-
tening. As can be seen from this figure, at center frequencies greater
than 500 Hz, critical bandwidth increases approximately linearly as cen-
ter frequency increases logarithmically.
Figure 6.1 is the basis for the Bark domain and the Mel domain. Both
the Bark and the Mel domains were created to have a constant number
of each unit (Barks or Mels) in each critical band. The Bark domain
was normalized to have 1 Bark per critical band. Barks and Mels are
perceptually based frequency units that increase, almost logarithmically,
with frequency.
Table 6.1 illustrates the relationship between the frequency units of
Barks, Mels, and Hertz. The table shows that each critical band contains

© 2000 CRC Press LLC


FIGURE 6.2
Comparing the experimentally derived frequency scales of
Barks versus Mels.

a logarithmically increasing frequency bandwidth in the linear scale of


Hertz. Approximately 150-200 Mels span each critical band. By defini-
tion, there is 1 Bark per critical band.
Figure 6.2 shows a graph of Barks versus Mels. Although the line
is somewhat linear, it is not exactly linear. This is because all of the
information known about critical bands is a result of experimental tests,
which are far from exact. The fact that both units are so close to being
linearly related even though they are formed on the basis of separate
experimental tests, supports the validity of these frequency scalings.

6.3 Thresholds of Audibility and Detectability


The threshold of audibility for a specified acoustic signal is the min-
imum effective sound pressure that is capable of evoking an auditory
sensation in the absence of noise in a specified fraction of the trials [10].
It is often expressed in decibels relative to 0.0002 microbar, which is

© 2000 CRC Press LLC


FIGURE 6.3
Threshold of audibility for a pure tone in silence.

considered the absolute threshold of audibility in terms of pressure.


The American standard threshold of audibility for monaural hearing
of pure tones, for a listener with normal hearing seated in an anechoic
(echo-free) chamber wearing earphones, is shown on the curve of Figure
6.3. The sound pressure is measured at the entrance to the ear canal.
In other words, a person with “normal'' hearing cannot hear tones be-
low the curve (softer) but can hear tones above the curve (louder). The
term normal hearing is used because some people have better than nor-
mal hearing (and can hear some tones below the curve) and some have
subnormal hearing (and conversely cannot hear some tones above the
curve).
The threshold of detectability for a specified acoustic signal is the min-
imum effective sound pressure that is capable of evoking an auditory
sensation in a specific acoustic environment. Therefore, the threshold
of detectability is identical to the threshold of audibility when the spe-
cific acoustic environment is silence, and conversely, the threshold of
detectability is highly elevated in the specific acoustic environment of a
crowded, noisy restaurant.

© 2000 CRC Press LLC


6.4 Monaural Masking
It is much more difficult to hear a specific sound in noisy surroundings
than to hear that same sound in a quiet environment. One needs to shout
to make oneself heard in a crowded restaurant, but in the silence of a
library, a gentle whisper can often disturb others. Psychophysicists have
learned a great deal about how the ear analyzes sounds by examining
the way certain sounds drown out, or mask, other sounds [29].
One of the most valuable, and exploitable, properties of hearing is
that of monaural masking [46]. Masking is defined as “the process by
which the detectability of one sound (the maskee) is impaired by the
presence of another sound (the masker).” [28]

6.4.1 Simultaneous Masking in Frequency

Simultaneous masking is masking where both sounds (the maskee and


the masker) occur at the same instance in time. In-depth studies have
been done on simultaneous masking of pure tones on a pure tone [46,
43, 81, 139, 45, 140, 177, 89, 71]. If a tone is sounded in the presence
of a strong tone close in frequency (particularly if it is in the same
critical band, but this is not essential), its threshold of detectability is
substantially elevated as shown in Figure [81].
The figure shows the tones and levels that a normal listener can hear
in the presence of a 1200 Hz, 80 dB primary tone. All weaker signals
below the curve cannot be heard by a normal listener. Notice that the
masking effect is much more prevalent when the secondary tone is at a
frequency greater than the primary tone. Also note that the masking
effect is strongest when the secondary tone is very close in frequency to
the primary tone.
Similar results are observed when one or both of the sounds are bands
of noise [155]. Therefore, in a complex spectrum of sound, some weak
components in the presence of stronger ones are not detectable at all.
Spectral analysis and examination of simultaneous masking in frequency,
carried out moment by moment, forms the basis of current algorithms
for efficient coding of wideband audio and can be utilized for efficient
coding of speech.

© 2000 CRC Press LLC


FIGURE 6.4
Simultaneous masking in frequency of one tone on another tone
(data adapted from [81]).

© 2000 CRC Press LLC


6.4.2 Temporal Masking

Masking can occur between signals that are separated in time and
nonoverlapping. A loud sound followed closely in time by a weaker one
can elevate the threshold of detectability of the weaker one and render
it undetectable. Surprisingly, the masking effect works when the weaker
sound is presented prior to the stronger sound, but too a much lesser
extent. The fact that the masker can occur later or earlier than the
maskee gives rise to the terminology forward and backward temporal
masking [34]. A great deal of experimentation has also been done to
characterize the temporal qualities of masking [176, 72, 31, 28, 134, 34,
81, 155, 140, 89, 141, 139].
Figure 6.5 illustrates both forward and backward masking. If a pri-
mary signal occurs at time t0 and a secondary signal of the same fre-
quency occurs at time t0 + ∆, then the secondary signal cannot be heard
if the amplitude difference of the two tones is less than the threshold in-
dicated in the curve. For example, if a 1200 Hz, 80 dB sound pressure
level (spl) primary tone is present at time t 0 = 0 and a 1200 Hz, 30
dB spl secondary tone is present at time t = 30ms, then the secondary
tone is completely masked because 30 dB < (80 dB - 38 dB). Similar
calculations can be performed using the curve for backward masking,
but the secondary tone occurs at a time ∆ before the primary tone.
For maximum coding bit savings, both simultaneous frequency and
temporal masking are considered together. Chapter 12 describes the
application of masking to perceptual coding.

© 2000 CRC Press LLC


FIGURE 6.5
Illustration of the effect of temporal masking.

© 2000 CRC Press LLC

You might also like