A Practical Handbook of Speech Coders
A Practical Handbook of Speech Coders
A Practical Handbook of Speech Coders
A grasp of both the theory of speech production and the theory of human
audition is essential to understand the fundamentals of speech coding.
The fact that speech is generated through a human vocal tract allows a
more compact signal representation for analysis/synthesis as opposed to
a generic acoustic signal. Because decoded speech is synthesized for the
human ear, further reductions in the signal representation are possible by
disregarding signal information that cannot be perceived. Various com-
ponents of the signal interact and interfere to determine the “perceived
sound.” These facts have been applied to high-fidelity coding of audio
[82, 83, 84, 85] for the consumer electronic market, for Internet audio
compression, and within standards such as MPEG-2 and MPEG-4 (see
Appendix A). This processing can be incorporated into speech coders to
further reduce the information needed to regenerate high-quality speech.
This chapter begins with a description of how the ear performs fre-
quency analysis and continues with the concept of critical bands. The
minimum detectable sound level is presented for quiet and noisy acoustic
environments. This leads into masking, in both frequency and time. The
information in this chapter provides the basis for perceptual speech cod-
ing. Chapter 12 describes how masking can be used to improve speech
coding efficiency.
Masking can occur between signals that are separated in time and
nonoverlapping. A loud sound followed closely in time by a weaker one
can elevate the threshold of detectability of the weaker one and render
it undetectable. Surprisingly, the masking effect works when the weaker
sound is presented prior to the stronger sound, but too a much lesser
extent. The fact that the masker can occur later or earlier than the
maskee gives rise to the terminology forward and backward temporal
masking [34]. A great deal of experimentation has also been done to
characterize the temporal qualities of masking [176, 72, 31, 28, 134, 34,
81, 155, 140, 89, 141, 139].
Figure 6.5 illustrates both forward and backward masking. If a pri-
mary signal occurs at time t0 and a secondary signal of the same fre-
quency occurs at time t0 + ∆, then the secondary signal cannot be heard
if the amplitude difference of the two tones is less than the threshold in-
dicated in the curve. For example, if a 1200 Hz, 80 dB sound pressure
level (spl) primary tone is present at time t 0 = 0 and a 1200 Hz, 30
dB spl secondary tone is present at time t = 30ms, then the secondary
tone is completely masked because 30 dB < (80 dB - 38 dB). Similar
calculations can be performed using the curve for backward masking,
but the secondary tone occurs at a time ∆ before the primary tone.
For maximum coding bit savings, both simultaneous frequency and
temporal masking are considered together. Chapter 12 describes the
application of masking to perceptual coding.