0% found this document useful (0 votes)
23 views23 pages

M I Itai Au Ioc Ing: Dealing With Bit Rates

The document discusses the development of MPEG audio standards, specifically MPEG-1 and MPEG-2, which provide high-quality digital audio compression techniques for various applications. It highlights the importance of bit rate reduction, the need for low bit-rate audio coding, and the role of perceptual coding technologies in achieving efficient audio compression. Additionally, it outlines the key features and advancements in audio coding, including multichannel audio and the upcoming MPEG-4 standard.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views23 pages

M I Itai Au Ioc Ing: Dealing With Bit Rates

The document discusses the development of MPEG audio standards, specifically MPEG-1 and MPEG-2, which provide high-quality digital audio compression techniques for various applications. It highlights the importance of bit rate reduction, the need for low bit-rate audio coding, and the role of perceptual coding technologies in achieving efficient audio compression. Additionally, it outlines the key features and advancements in audio coding, including multichannel audio and the upcoming MPEG-4 standard.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

M I ita I

11

Au io C Ing
11

T
e Moving Pictures Expert Group (MPEG) Dealing with Bit Rates
with n� �e International Organization of Stan­ PCM Bit Rates
dardizatlOn (ISO) has developed a series of
Typical audio signal classes are telephone speech, wide­
audio-visual standards known as MPEG-l and
band speech, and wideband audio, all of which differ in
MPEG-2. These audio-coding standards are the first in­
bandwidth, dynamic range, and in listener expectation of
ternational standards in the field of
high-quality digital audio compres­
sion. MPEG-l covers coding of
stereophonic audio signals at high
sampling rates aiming at transparent
quality, whereas MPEG-2 also offers
stereophonic audio coding at lower
sampling rates. In addition, MPEG-2
introduces multichannel coding with
and without backwards compatibility
to MPEG-l to provide an improved
acoustical image for audio-only appli­
cations and for enhanced television
and video-conferencing systems.
MPEG-2 audio coding without back­
wards compatibility, called MPEG-2
Advanced Audio Coding (AAC), of­
fers the highest compression rates.
Typical application areas for
MPEG-based digital audio are in the
fields of audio production, program
distribution and exchange, digital
sound broadcasting, digital storage,
and various multimedia applications.
In this article we describe in some de­
tail the key technologies and main
features of MPEG-l and MPEG-2 audio coders. We also offered quality. The quality of telephone-bandwidth
present a short section on the upcoming MPEG-4 stan­ speech is acceptable for telephony and for some videote­
dard, and we discuss some of the typical applications for lephony services. Higher bandwidths (7 kHz for wide­
band speech) may be necessary to improve the
MPEG audio compression.

SEPTEMBER 1997 IEEE SIGNAL PROCESSING MAGAZINE 59


l053-5888/97/$lO.OO© 19971EEE
intelligibility and naturalness of speech. Wideband (high Basic requirements in the design of low bit-rate audio
fidelity) audio representation including multichannel coders are firstly, to retain a high quality of the recon­
audio needs a bandwidth of at least 20 kHz. The conven­ structed signal with robustness to variations in spectra
tional digital format for these signals is pulse code modu­ and levels. In the case of stereophonic and multichannel
lation (PCM), with typical sampling rates and amplitude signals, spatial integrity is an additional dimension of
resolutions (PCM bits per sample) as given in Table 1. quality. Secondly, robustness against random and bursty
The compact disc (CD) is today's de facto standard of channel bit errors and packet losses is required. Thirdly,
digital audio representation. With its 44.1 kHz sampling low complexity and power consumption of the codecs are
rate, the resulting stereo net bit rate on a CD is 2 X 44.1 of high relevance. For example, in broadcast and playback
X 16 X 1000 == 1.41 Mb/s (see Table 2). However, the
applications, the complexity and power consumption of
CD needs a significant overhead for a runlength-limited audio decoders used must be low, whereas constraints on
line code, which maps 8 information bits into 14 bits, for encoder complexity are more relaxed. Additional
synchronization and for error correction, resulting in a network-related requirements are low encoder/decoder
49-bit representation of each 16-bit audio sample. delays, robustness against errors introduced by cascading
Hence, the total stereo bit rate is 1.41 X 49/16 4.32
=
codecs, and a graceful degradation of quality with increas­
Mb/s. Table 2 compares bit rates of the CD and the digital ing bit error rates in mobile radio and broadcast applica­
audio tape (DAT). tions. Finally, in professional applications, the coded bit
streams must allow editing, fading, mixing, and dynamic
For archiving and processing of audio signals, sam­
range compression.
pling rates twice as large as those mentioned and ampli­
We have seen rapid progress in bit-rate compression
tude resolutions of up to 24 b/sample are being discussed.
techniques for speech and audio signals [2-7]. Linear pre­
Furthermore, lossless coding is an important topic in or­
diction, subband coding, transform coding, as well as
der not to compromise audio quality in any way [1]. The
various forms of vector quantization and entropy coding
digital versatile (or video) disk (DVD) with its capacity of
techniques have been used to design efficient coding algo­
4.7 GB (single layer) or 8.5 GB (double layer) is the ap­
rithms that can achieve substantially more compression
propriate storage medium for such applications. than was thought possible only a few years ago. Recent
results in speech and audio coding indicate that an excel­
lent coding quality can be obtained with bit rates of 0.5 to
Bit Rate Reduction 1 b/sample for speech and wideband speech, and 1 to 2
Although high bit rate channels and networks have be­ b/sample for audio. In storage and packet-oriented trans­
come more easily accessible, low bit rate coding of audio mission systems, additional savings are possible by em­
signals has retained its importance. The main motivations ploying variable-rate coding with its potential to offer a
for low bit-rate coding are the need to minimize transmis­ time-independent, constant-quality performance.
sion costs or to provide cost-efficient storage, the demand Compressed digital audio representations can be made
to transmit over channels of limited capacity such as mo­ less sensitive to channel impairments tl1an analog ones if
bile radio channels, and to support variable-rate coding in source and channel coding are implemented appropri­
packet-oriented networks. ately. Bandwidth expansion has often been mentioned as

60 IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER 1997


a disadvantage of digital coding and transmission but
with �oday's data compression and multilevel signiling In MPEG coding the encoder is
techmques, channel bandwidths can be made smaller
than those of analog systems. In broadcast systems, the
not standardized, thus leaving
reduced bandwidth requirements, together with the error room for improvements in the
robustness of the coding algorithms, will allow an effi­
cient use of available radio and TV channels as well as "ta­
coding process.
boo" channels currently left vacant because of
interference problems. terminals to high-quality multichannel sound systems.
The standard will allow for interactivity and universal ac­
cessibility, and it will provide a high degree of flexibility
MPEG Standardization Activities and extensibility [10].
Of particular importance for digital audio is the stan­
dardization work within the ISOjIEC that is intended to
provide �nt�rnational standards for a wide range of Key Technologies in Audio Coding
commumcatlOns-based and storage-based applications.
Proposals to reduce wideband audio-coding rates have fol­
This group is called MPEG, an acronym for Moving Pic­
lowed those for speech coding. Differences between audio
tures Experts Group. MPEG's initial effort was the MPEG
Phase 1 (MPEG-l ) coding standard IS 1 1172, which sup­ :md �pee�h signals a:e manifold, however, audio coding
. �plies higher sampling rates, better amplitude resolution,
ports bIt rates of around 1.2 Mb/s for video (with video .
higher dynamIC range, larger variations in power density
quality comparable to that of today's analog video cassette
spectr�, stereophonic and multichannel audio signal repre­
recorders) and 256 kb/s for two-channel audio (with audio
sentanons, and, finally, higher quality expectations. In­
quality comparable to that of today's CDs) [8].
deed, the high quality of the CD with its 16-b/sample
The �ore recent MPEG-2 standard IS 13818 pro­
. . . PCM format has made digital audio popular.
v�des, m Its VIdeo part, standards for high-quality video
Speech and audio coding are similar in that, in both
(mcluding high definition TV (HDTV) ) in bit-rate
cases, quality is based on the properties of human audi­
rang�s from 3 to 15 Mb/s and above. In its audio part,
tory perception. On the other hand, speech can be coded
multIchannel audio coding with two to five full­
very efficiently because a speech production model is avail­
bandwidth audio channels has been standardized. In ad­
able, whereas nothing similar exists for audio signals.
dition, for stereophonic audio the range ofsampling rates
Modest reductions in audio bit rates have been obtained
was extended to lower sampling frequencies for bit rates
by instantaneous companding (e.g., a conversion of uni­
at, or below, 64 kb/s [9]. Part IS 13818-7 ofthat standard
for:n 14-bit PCM into a 11-bit nonuniform PCM presen­
will offer a collection of very flexible tools for advanced
tanon), or by forward-adaptive PCM (block companding)
audio coding (MPEG-2 AAC) for applications where
as employed in various forms ofnear-instantaneously com­
compatibility with MPEG-l is not relevant.
panded audio multiplex (NICAM) coding (ITU-R, Rec.
Finally, the current MPEG-4 work addresses stan­
660). For example, the British Broadcasting Corporation
?-ardization of a�diovisual coding for applications rang­
(BBC) has used the NICAM 728 coding format for digital
mg from mobile-access, low-complexity multimedia
transmission of sound in several European broadcast tele­
vision networks; it employs 32 kHz sampling with 14-bit
initial quantization followed by a compression to a 10-bit
f�rmat on the basis of1 ms blocks resulting in a total stereo
bIt rate of 728 kb/s [11]. Such adaptive PCM schemes can
solve m.e prob!em of providing a sufficient dynamic range
for audio coding, but they are not efficient compression
schemes since they do not exploit statistical dependencies
between samples and do not sufficiently remove signal ir­
relevancies.
In recent audio-coding algorithms, four key technolo­
gies play an important role: perceptual coding,
freq,:en�y-doma�n coding, window switching, and dy­
namIC bIt allocanon. These technologies will be covered
in the following sections.

Auditory Masking and perceptual Coding


Auditory Masking
... 1. Threshold in quiet and masking threshold (acoustical events The inner ear performs short-term critical band analyses
in the gray areas will not be audible). where frequency-to-place transformations occur along

SEPTEMBER 1997 IEEE SIGNAL PROCESSING MAGAZINE 61


enough to each other in
frequency [12]. Such
masking is largest in the
critical band in which the
masker is located, and it is
effective to a lesser degree
in neighboring bands. A
masking threshold can be
measured and low-level
signals below this thresh­
old will not be audible.
This masked signal can
consist of low-level signal
contributions, of quanti­
zation noise, aliasing dis­
t o rtion, or of
transmission errors. The
masking threshold, in the
context of source coding
also known as threshold
of just noticeable distor­
tion (JND) [13], varies
with time. It depends on
itt.. 2. Masking threshold and signal-to-mask ratio (SMR) (acoustical events in the gray areas will not be the sound pressure level
audible). (SPL), the frequency of
the masker, and on char­
the basilar membrane. The power spectra are not repre­ acteristics of masker and maskee. Take the example of the
sented on a linear frequency scale but on limited fre­ masking threshold for the SPL = 60 dB narrowband
quency bands called critical bands. The auditory system masker in Fig. 1: around 1 kHz the four maskees will be
can roughly be described as a bandpass filterbank, consist­ masked as long as their individual sound pressure levels
ing of strongly overlapping bandpass filters with band­ are below the masking threshold. The slope of the mask­
widths in the order of 50 to 100 Hz for signals below 500 ing threshold is steeper toward lower frequencies; i.e.
Hz and up to 5000 Hz for signals at high frequencies. higher frequencies are more easily masked. It should be
Twenty-six critical bands covering frequencies of up to 24 noted that the distance between masker and masking
kHz have to be taken into account. threshold is smaller in noise-masking-tone experiments
Simultaneous masking is a frequency domain phenome­ than in tone-masking-noise experiments, i.e., noise is a
non where a low-level signal (the maskee) can be made in­ better masker than a tone. In MPEG coders both thresh­
audible (masked) by a simultaneously occurring stronger olds play a role in computing the masking threshold.
signal (the masker) as long as masker and maskee are close Without a masker, a signal is inaudible if its sound pres-

itt.. 3. Temporal masking (acoustic events in the gray areas will not be audible).

62 IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER 1997


may mask the weal<:er one, even if the maskee precedes the
masker (Fig. 3).
Temporal masking can help to mask pre-echoes caused
by the spreading of a sudden large quantization error over
the actual coding block (see the "Window Switching"
section). The duration within which premasking applies is
significantly less than one tenth of that of the postmasking)
which is in the order of 50 to 200 ms. Both pre- and post­
masking are being exploited in MPEG audio-coding al­
gorithms.

.... 4. Block diagram of perception-based coders (acoustical Perceptual Coding


events in the gray areas will not be audible). Digital coding at high bit rates is dominantly waveform­
preserving; i.e., the amplitude-versus-time waveform of
sure level is below the threshold in quiet) which depends on the decoded signal approximates that of the input signal.
frequency and covers a dynamic range of more than 60 The difference signal between input and output wave­
dB, as shown in the lower curve of Fig. l. forms is then the basic error criterion of coder design.
Waveform coding principles have been covered in detail
The qualitative sketch of Fig. 2 gives a few more details
in [2]. At lower bit rates, facts about the production and
about the masking threshold: within a critical band, tones
perception of audio signals have to be included in coder
below this threshold (darker area) are masked. The dis­
design, and the error criterion has to be in favor of an out­
tance between the level of the masker and the masking
put signal that is useful to the human receiver rather than
threshold is called the signal-to-mask ratio (SMR). Its
favoring an output signal that follows and preserves the
maximum value is at the left border of the critical band
input waveform. Basically, an efficient source coding al­
(point A in Fig. 2), and its minimum value occurs in the
gorithm will (i) remove redundant components of the
frequency range of the masker and is around 6 dB in
source signal by exploiting correlations between its sam­
noise-masking-tone experiments. Assuming an m-bit
ples and (ii) remove components that are perceptually ir­
quantization of an audio signal, within a critical band the
relevant to the ear. Irrelevancy manifests itself as
quantization noise will not be audible as long as its
unnecessary amplitude or frequency resolution; portions
signal-to-noise ratio (SNR) is higher than its SMR.
Noise and signal contributions outside the particular criti­
cal band will also be masked, although to a lesser degree,
if their SPL is below the masking threshold.
Defining SNR( m) as the SNR resulting from an m-bit
quantization, the perceivable distortion in a given sub­
band is measured by the noise-to-mask ratio (NMR):

NMR(m) = SMR - SNR(m) (in dB).

NMR(m) describes the difference in dB between the


SMR and the SNR ratio to be expected from an m-bit
quantization. The NMR value is also the difference (in
dB) between the level of quantization noise and the level
where a distortion may just become audible in a given
subband. Within a critical band, coding noise will not be
audible as long as NMR(m) is negative.
We have just described masking by only one masker. If
the source signal consists of many simultaneous maskers,
each has its own masking threshold, and aglobal masking
threshold can be computed that describes the threshold of
just-noticeable distortions as a function of frequency (see
also the "ISOjMPEG-l Audio Coding" section).
In addition to simultaneous masking, the time­
domain phenomenon of temporal masking plays an impor­
tant role in human auditory perception. It may occur .... 5. Window switching: (a) source signal, (b) reconstructed sig­
when two sounds appear within a small interval of time. nal with block size N = 1024, (c) reconstructed signal with
Depending on the individual SPLs, the stronger sound block size N = 256. (Source: Iwadare et 01. [25].)

SEPTEMBER 1997 IEEE SIGNAL PROCESSING MAGAZINE 63


noticeable distortion,
since postprocessing of
the acoustic signal by the
end-user and multiple en­
coding/decoding pro­
cesses in transmission
links have to be consid­
ered. Moreover, our cur­
rent knowledge about
auditory masking is very
limited. Generalizations
of masking results, de­
rived for simple and sta­
tionary maskers and for
limited bandwidths, may
be appropriate for most
source signals, but may
fail for others. Therefore,
as an additional require-
Ai. 6. Conventional adaptive transform coding (ATC).
ment, we need a suffi­
cient safety margin in
of the source signal that are masked do not need to be practical designs of such perception-based coders. It
transmitted. should be noted that the MPEG/Audio coding standard
The dependence of hurnan auditory perception on fre­ is open for better encoder-located psychoacoustic models
quency and on the accompanying perceptual tolerance of since such models are not normative elements of the stan­
errors can (and should) directly influence encoder de­ dard (see the "ISO/MPEG-l Audio Coding" section).
signs; noise-shaping techniques can emphasize coding noise
in frequency bands where that noise is not important for
perception. To this end, the noise shifting must be dy­
Frequency-Domain Coding
As one example of dynamic noise-shaping, quantization
namically adapted to the actual short-term input spec­
noise feedback can be used in predictive schemes [16,
trum in accordance with the SMR, which can be done in
17]. However, frequency-domain coders with dynamic
different ways. However,frequency weightings based on
allocations of bits (and hence of quantization noise con­
linear filtering,as is typical in speech coding,cannot make
tributions) to subbands or transform coefficients offer an
full use of results from psychoacoustics. Therefore, in
easier and more accurate way to control the quantization
wideband audio coding, noise-shaping parameters are
noise [2, 14] (see also the "Dynamic Bit Allocation" sec­
dynamically controlled in a more efficient way to exploit
tion).
simultaneous masking and temporal masking.
In all frequency-domain coders, redundancy (the non­
Figure 4 depicts the structure of aperception-based coder
flat short-term spectral characteristics of the source sig­
that exploits auditory masking. The encoding process is
nal) and irrelevancy (signals below the psychoacoustical
controlled by the SMR vs. frequency curve from which
thresholds) are exploited to reduce the transmitted data
the needed amplitude resolution (and hence the bit allo­
rate with respect to PCM. This is achieved by splitting the
cation and rate) in each frequency band is derived. The
source spectrum into frequency bands to generate nearly
SMR is typically determined from a high resolution,say,
a 1024-point FFT-based spectral analysis of the audio
block to be coded. In general, any coding scheme may be
used that can be dynamically controlled by such percep­
tual information. Frequency domain coders (see next sec­
tion) are of particular interest since they offer a direct
method for noise shaping. If the frequency resolution of
these coders is high enough, the SMR can be derived directly
from the subband samples or transform coefficients without
running a FFT-based spectral analysis in parallel [14,15].
If the necessary bit rate for a complete masking of dis­
tortion is available, the coding scheme will be perceptu­
ally transparent, i.e. the decoded signal is then
subjectively indistinguishable from the source signal. In
practical designs, we cannot go to the limits of just-
Ai. 7. Hierarchy of Layers I, II, and III of MPEG-T/Audio.

64 IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER 1997


... 8. Structure of MPEG-l audio encoder and decoder (Layers I and 1/).

uncorrelated spectral components and by quantizing decimation results in an aggregate number of subband
these components separately. Two coding categories ex­ samples that equals that in the source signal. In the re­
ist, transform coding (TC) and subband coding (SBC). The ceiver, the sampling rate of each subband is increased to
differentiation between these two categories is mainly that of the source signal by filling in the appropriate
due to historical reasons. Both use an analysis fIlterbank number of zero samples. Interpolated subband signals ap­
in the encoder to decompose the input signal into pear at the bandpass outputs of the synthesis filterbank.
subsampled spectral components. The spectral compo­ The sampling processes may introduce aliasing distortion
nents are called subband samples if the fIlterbanlc has low due to the overlapping nature of the subbands. If perfect
frequency resolution, otherwise they are called spectral filters, such as two-band quadrature mirror filters or po­
lines or transform coefficients. These spectral compo­ lyphase filters, are applied, aliasing terms will cancel and
nents are recombined in the decoder via synthesis filter­ the sum of the bandpass outputs equals the source signal
banlcs. in the absence of quantization [18-21]. With quantiza­
In SBC, the source signal is fed into an analysis filter­ tion, aliasing components will not cancel ideally; never­
bank consisting ofM bandpass filters that are contiguous theless, the errors will be inaudible in MPEG/Audio
in frequency so that the set of subband signals can be re­ coding if a sufficient number of bits is used. However,
combined additively to produce the original signal or a these errors may reduce the original dynamic range of 20
close version thereof. Each filter output is critically deci­ bits to around 18 bits [15].
mated (i.e., sampled at twice the nominal bandwidth) by In TC, a block of input samples is linearly transformed
a factor equal toM, the number of bandpass filters. This via a discrete transform into a set of near-uncorrelated

SEPTEMBER 1997 IEEE SIGNAL PROCESSING MAGAZINE 65


tions can be provided at
different frequencies in a
flexible way and with low
complexity. For example,
a high spectral resolution
can be obtained in an effi­
cient way by using a cas­
cade of a filterbank (with
its short delays) and a lin­
e a r MDCT transform
that splits each subband
sequence further in fre­
quency content to
achieve a high-frequency
resolution. MPEG audio
coders use a subband ap­
proach in Layer I and
Layer II, and a hybrid fil­
terbank in Layer III.
A 9. Block companding in MPEG- l audio codecs.

transform coefficients. These coefficients are then quan­ Window switching


tized and transmitted in digital form to the decoder. In A crucial part in frequency-domain coding of audio
the decoder an inverse transform maps the signal back signals is the appearance of pre-echoes, which are
into the time domain. In the absence of quantization er­ similar to copying effects on analog tapes. Consider
rors the synthesis yields exact reconstruction. Typical the case that a silent period is followed by a percussive
transforms are the discrete Fourier transform or the dis­ sound, such as from castanets or triangles, within the
crete cosine transform (DCT), calculated via an FFT, and same coding block. Such an onset ("attack") will cause
modified versions thereof. We have already mentioned comparably large instantaneous quantization errors.
that the decoder-based inverse transform can be viewed as In TC, the inverse transform in the decoding process
the synthesis filterbanl<.; the impulse responses of its will distribute such errors over the block; similarily, in
bandpass filters equal the basis sequences of the trans­ SBC, the decoder bandpass filters will spread such er­
form. The impulse responses of the analysis filterbank are
rors. In both mappings pre-echoes can become dis­
just the time-reversed versions thereof. The finite lenghts
tinctively audible, especially at low bit rates with
of these impulse responses may cause so-called block
comparably high error contributions. Pre-echoes can
boundary effects. State-of-the-art transform coders em­
be masked by the time-domain effect of premasking if
ploy a modified DCT (MDCT) filterbank as proposed by
the time spread is of short length (in the order of a few
Princen and Bradley [20]. The MDCT is typically based
milliseconds). Therefore, they can be reduced or
on a 50% overlap between successive analysis blocks.
avoided by using blocks of short lengths. However,a
Without quantization they are free from block boundary
larger percentage of the total bit rate is typically re­
effects, have a higher transform coding gain than the
quired for the transmission of side information if the
DCT, and their basis sequences correspond to better
blocks are shorter. A solution to this problem is to
bandpass responses. In the presence of quantization,
block boundary effects are de-emphasized due to the dou­ switch between block sizes of different lengths as pro­

bling of the filter impulse responses resulting from the posed by Edler (window switching) [24]; typical block
overlap. sizes are between N = 64 and N = 1024. The small
Hybrid filterbanks) i.e., combinations of discrete blocks are only used to control pre-echo artifacts dur­
transform and filterbank implementations, have fre­ ing nonstationary periods of the signal, otherwise the
quently been used in speech and audio coding [22,23]. coder switches back to long blocks. It is clear that
One of the advantages is that different frequency resolu- block size selection has to be based on an analysis of
the characteristics of the actual audio-coding block.
Fig. 5 demonstrates the effect in transform coding: if
the block size is N = 1024 (Fig. 5b) pre-echoes are
clearly (visible and) audible whereas a block size of
256 will reduce these effects because they are limited
to the block where the signal attack and the corre­
sponding quantization errors occur (Fig. 5c). In addi­
A 10. I 152-sample block of an audio signal. tion, pre-masking can become effective.

66 IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER 1997


Dynamic Bit Allocation
Frequency-domain coding significantly gains in
performance if the number of bits assigned to each
of the quantizers of the transform coefficients is
adapted to the short-term spectrum of the audio­
coding block on a block-by-block basis. In the
mid-1970s Zelinski and Noll introduced dynamic
bit allocation and demonstrated significant SNR­
based and subjective improvements with their
adaptive transform coding ( ATC) (see Fig. 6) [14,
26]. They proposed a DCT mapping and a dy­
namic bit-allocation algorithm that used the DCT
transform coefficients to compute a DCT-based
short-term spectral envelope. Parameters of this
spectrum were coded and transmitted from which
the short-term spectrum was estimated using lin­
ear interpolation in the log-domain. This estimate
was then used to calculate the optimum number of
bits for each transform coefficient, both in the en­
coder and decoder.
ATC had a number of shortcomings, such as
block boundary effects, pre-echoes, marginal ex­
ploitation of masking, and low subjective quality at
low bit rates. Despite these shortcomings we find
many of the features of the conventional ATC in
more recent frequency-domain coders. Examples
of the very sophisticated bit-allocation strategies
that MPEG audio-coding algorithms use will be de­
scribed in detail in the "Layers 1 and 2" section.

ISO/MPEG-l Audio Coding


The MPEG audio-coding standard [8, 27-29] has
already become a universal standard in diverse
fields such as consumer electronics, professional
audio processing, telecommunications, and broad­
casting [30]. The standard combines features of
MUSICAM and ASPEC coding algorithms [31,
32] . Main steps of development toward the
MPEG-1 audio standard have been described in
[29, 33]. MPEG-1 audio coding offers a subjective
reproduction quality that is equivalent to CD qual­
ity (16-bit PCM) at stereo rates given in Table 3 for
many types of music. Because of its high dynamic
range, MPEG- a1 udio has potential to exceed the
quality of a CD [30, 34].

The Basics
Structure
The basic structure of MPEG-1 audio coders fol­
A 11. Frequency distributions of various important MPEG parameters lows that of perception-based coders (see Fig. 4).
taken from the audio block of Fig. 10. MPEG-l Layer /I coding with an In the first step the audio signal is converted into
overall bit rate of 128 kb/s. (a) Sound-pressure level (SPL) of input spectral components via an analysis filterbank; Lay­
frame vs. index of subbands (each subband is 750 Hz wide); (b) Global ers I and II make use of a sub band filterbank and
masking threshold vs. frequency; (c): Signal-to-mask ratio vs. fre­ Layer III employs a hybrid filterbank. Each spectral
quency; (d) Bit allocation vs. frequency; (e) SPL of reconstruction error component is quantized and coded with the goal of
vs. frequency. keeping the quantization noise below the masking

SEPTEMBER 1997 IEEE SIGNAL PROCESSING MAGAZINE 67


threshold. The number of bits
for each sub band and a scale­
factor are determined on a
block-by-block basis: each
block has 12 (Layer I) or 36
(Layers II and III) subband
samples (see the "Layers I and
II" section). The number of
quantizer bits is obtained
from a dynamic bit-allocation
algorithm (Layers I and II)
that is controlled by a psychoa­
coustic model (see below). The
s u b b a n d c o d e words, t h e
s c a l e f a ctor, a n d the bit­
allocation information are
multiplexed into one bit­
s t r e am, t o g e ther with a
header and optional ancillary
data. In the decoder the syn­
thesis filterbank reconstructs
a block of 32 audio output
samples from the demulti­
plexed bitstream.
MPEG-1/Audio supports
sampling rates of 32, 44.1,
and 48 kHz and bit rates be­
tween 32 kb/s (mono) and
448 kb/s, 384 kb/s, and 320
kb/s (stereo and Layers I, II,
and III, respectively).

Layers and Operating Modes


The standard consists of three
layers (I, II, and III) of in­
creasing complexity, delay,
and subjective performance.
From a hardware and soft­
ware point of view, the higher
layers incorporate the main
building blocks of the lower
layers (Fig. 7). A standard full
MPEG-1 audio decoder is
able to decode bit streams of
all three layers. More typical
are MPEG-1/Audio Layer X
decoders (X == I, II, or III).

Stereo Redundancy Coding


M PEG-1/Audio supports
four modes: mono, stereo,
dual with two separate chan­
nels (useful for bilingual pro­
A 72. Bit allocations of three allocation rules taken from the audio block of Fig. 7 O. MPEG- l Layer grams), and joint stereo. In
/I coding with an overall bit rate of 128 kb/s: (a) Bit allocation (model 1); (b): Bit allocation t h e optional join t-ster e o
(model 2); (c) Bit allocation for unweighted minimum mean-squared error. mode interchannel dependen-

68 IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER 1997


tional joint stereo mode will
only be effective if the re­
quired bit rate exceeds the
available bit rate, and it will
only be applied to subbands
corresponding to frequencies
of around 2 kHz and above.
Layer III has an additional
option: in the mono/stereo
(M/S) mode the left- and
right-channel signals are en­
coded as middle (L + R) and
side (L - R) channels. This
latter mode can be combined
with the joint stereo mode.

Psychoacoustic Models
We have already mentioned
that the adaptive bit­
allocation algorithm is con­
trolled by a psychoacoustic
model. This model computes
SMRs and takes into account
the short-term spectrum of
the audio block to be coded
and knowledge about noise
masking. The model is only
needed in the encoder, which
makes the decoder less com­
plex; this asymmetry is a de­
sirable feature for audio
playback and audio broad­
casting applications.
The normative part of the
standard describes the de­
coder and the meaning of the
ll ll
... 13. Bit a ocations for MPEG- 1 Layer 1/ coding with overa bit rates of 128kb/s and 64 kb/s. encoded bitstream, but the
encoder is not standardized,
cies are exploited to reduce the overall bit rate by using an thus leaving room for an
irrelevancy-reducing technique called intensity stereo. It is evolutionary improvement of the encoder. In particu­
known that, above 2 kHz and within each critical band, lar, different psychoacoustic models can be used that
the human auditory system bases its perception of stereo range from very simple (or none at all) to very complex
imaging more on the temporal envelope of the audio sig­ based on quality and implementability requirements.
nal than on its temporal fine structure. Therefore, the Information about the short-term spectrum can be de­
MPEG audio-compression algorithm supports a stereo rived in various ways; for example, as an accurate esti­
redundancy coding mode called intensity stereo coding) mate from an FFT-based spectral analysis of the audio
which reduces the total bit rate without violating the spa­ input samples, or, less accurate, directly from the spec­
tial integrity of the stereophonic signal. tral components as in the conventional ATe [14] (see
In this mode the encoder codes some upper-frequency also Fig. 6). Encoders can also be optimized for a cer­
subband outputs with a single sum signal L + R (or some tain application. All these encoders can be used with
linear combination thereof) instead of sending independ­ complete compatibility with all existing MPEG-l
ent left (L) and right (R) subband signals. The decoder audio decoders.
reconstructs the left and right channels based only on the The informative part of the standard gives two exam­
single L + R signal and on independent left- and right­ ples of FFT-based models (see also [8, 29, 36]). Both
channel scalefactors. Hence, the spectral shape of the left models identify, in different ways, tonal and nontonal
and right outputs is the same within each intensity-coded spectral components and use the corresponding results of
subband but the magnitudes are different [35]. The op- tone-masks-noise and noise-masks-tone experiments in

SEPTEMBER 1997 IEEE SIGNAL PROCESSING MAGAZINE 69


A 74. Structure of MPEG- 7 audio encoder and decoder (Layer 11/).

the calculation of the global masking thresholds. Details Filterbank


are given in the standard experimental results for both Layer I and Layer II coders map the digital audio input
psychoacoustic models that have been described in [36]. into 32 subbands via equally spaced bandpass filters
In the informative part of the standard a 5 12-point FFT is (Figs. 8 and 9). A polyphase filter structure is used for the
proposed for Layer I, and a 1024-point FFT for layers II frequency mapping; its filters have 5 12 coefficients. Po­
and III. In both models the audio input samples are lyphase structures are computationally very efficient since
a DeI can be used in the filtering process, and they are of
Hann-weighted. Modell, which may be used for Layers I
moderate complexity and low delay. On the negative side,
and II, computes for each masker its individual masking
the filters are equally spaced, and therefore the frequency
threshold, talcing into account its frequency position,
bands do not correspond well to the critical band parti­
power, and tonality information. The global masking tion (see the earlier section on auditory masking). At a 48
threshold is obtained as the sum of all individual masking kHz sampling rate each band has a width of 24000/32 =

thresholds and the absolute maslcing threshold. The SMR 750 Hz; hence, at low frequencies, a single subband cov­
is then the ratio of the maximum signal level within a ers a number of adjacent critical bands. The subband sig­
given subband and the minimum value of the global nals are resampled (critically decimated) at a rate of 1500
maslcing threshold in that given subband (see Fig. 2). Hz. The impulse response of subband k, hsub(k)(n), is ob­
Model 2, which may be used for all layers, is more com­ tained by multiplication of the impulse response of a sin­
plex: tonality is assumed when a simple prediction indi­ gle prototype wwpassfilter) h (n), by a modulating function
cates a high prediction gain, the masking thresholds are that shifts the lowpass response to the appropriate sub­
calculated in the cochlea domain, i.e., properties ofthe in­ band frequency range:
ner ear are talcen into account in more detail, and, finally,
in case of potential pre-echoes the global masking thresh­ hsub(k)(n) =h(n)cos
[( 2k - l) 1
+<p(k);
old is adjusted appropriately. 2M
M = 32; k = 1,2, . . .32;n = 1,2, ... ,512

Layers I and" The prototype lowpass filter h(n) has a 3 dB band­


MPEG Layer I and II coders have very similar structures. width of 750/2 375 Hz, and the center frequencies are
=

The Layer II coder achieves a better performance, mainly at odd multiples thereof (all values at 48 kHz sampling
because the overall scalefactor side information is reduced rate). Therefore, the subsampled filter outputs exhibit a
by exploiting redundancies between the scalefactors. Ad­ significant overlap. However, the design of the prototype
ditionally, a slightly finer quantization is provided. filter and the inclusion of appropriate phase shifts in the

70 IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER 1997


cosine terms result in an aliasing cancellation at the out­
put of the decoder synthesis filterbank. Details about the The coded bit streams must al­
coefficients of the prototype filter and the phase shifts
<p(k) are given in the ISO/MPEG standard. Details about
low editing, fading, mixing, and
an efficient implementation of the filterbank can be found dynamic range compression.
in [15] and [36],and,again,in the standardization docu­
ments.
factors is reduced by around 50% [29]. Figure 9 indicates
the block companding structure.
Quantization The scaled and quantized spectral subband compo­
The number of quantizer levels for each spectral compo­ nents are transmitted to the receiver together with scale­
nent is obtained from a dynamic bit-allocation rule that is factor, scalefactor select (Layer II), and bit-allocation
controlled by a psychoacoustic model. The bit-allocation information. Quantization with block companding pro­
algorithm selects one uniform midtread quantizer out of vides a very large dynamic range of more than 120 dB.
a set of available quantizers such that both the bit-rate re­ For example,in Layer II uniform midtread quantizers are
quirement and the masking requirement are met. The it­ available with 3,5,7,9, 15,31,...,65535 levels for sub­
erative procedure minimizes the NMR in each sub band. bands of low index (low frequencies). In the mid- and
It starts with the number of bits for the samples and scale­ high-frequency region the quantizers have a reduced
factors set to zero. In each iteration step the quantizer number of levels. For example,subbands of index 24 to
SNR, SNR(m), is increased for the one m-bit sub band 27 have only quantizers with 3, 5, and 65535 (1) levels.
quantizer producing the largest value of the NMR at the The 16-bit quantizers prevent overload effects. Subbands
quantizer output. (The increase is obtained by allocating of index 28 to 32 are not transmitted at all. In order to re­
one more bit). For that purpose,the NMR,NMR(m) = duce the bit rate, the codewords of three successive sub­
SMR - SNR(m),is calculated as the difference (in dB) be­ band samples resulting from quantizing with 3-, 5-,and
tween the actual quantization noise level and the mini­ 9-step quantizers are assigned one common codeword.
mum global masking threshold. The standard provides The savings in bit rate is about 40% [29].
tables with estimates for the quantizer SNR, SNR(m),
for a given m.
Block companding is used in the quantization process, Coding Examples

i.e.,blocks of decimated samples are formed and divided The following figures demonstrate the way MPEG-l
by a scalefactor such that the sample of largest magnitude Layer II encodes audio signals. Figure 10 shows an indi­
is unity. In Layer I, blocks of 12 decimated and scaled vidual 1152-sample audio block to be coded. Figure 11
samples are formed in each subband (and for the left and shows the frequency dependencies of various important
right channel) and there is one bit allocation for each MPEG parameters; the frequency axes are divided in ac­
block. At a 48 kHz sampling rate, 12 subband samples cordance with the subband separations. The sampling
correspond to 8 ms of audio. There are 32 blocks, each rate is 48 kHz, hence each subband index represents a
with 12 decimated samples,representing 32 X 12 = 384 subband of bandwidth 750 Hz. We have chosen an over­
audio samples. all bit rate of 128 kb/s.
In Layer II,in each subband a 36-sample superblock is Figure ll(a) shows the frequency distribution of the
formed of three consecutive blocks of 12 decimated sam­ sound-pressure level of the audio block. From this distri­
ples corresponding to 24 ms of audio at 48 kHz sampling bution and from the threshold in quiet a global masking
rate. There is one bit allocation for each 36-sample super­ threshold can be derived (Fig. 11(b)). For each subband,
block. All 32 superblocks, each with 36 decimated sam­ the SMR (in dB) is the difference between the level of the
ples, represent, altogether, 32 X 36 1152 audio
= masker and the minimum value of the global masking
samples. As in Layer I,a scalefactor is computed for each threshold (Fig. ll(c)). Note that, for subbands of index
12-sample block. A redundancy reduction technique is 23 and higher,the signal power is significantly below that
used for the transmission of the scalefactors: depending of the global masking threshold. Accordingly, the corre­
on the significance of the changes between the three con­ sponding subband signals need not be transmitted. In the
secutive scalefactors,one,two, or all three scalefactors are next step, the number of bits per subband quantizer is
transmitted, together with a 2-bit scalefactor select infor­ chosen such that its quantization noise is kept sufficiently
mation. Compared with Layer I,the bit rate for the scale- below the global masking threshold (Fig. 11(d)). There­
fore,the bit allocation for those subbands,which have to
be transmitted,roughly follows the SMR. The spectrum
of the reconstruction error is shown in Fig. 11(e) (Please
take into account that the dB values of Fig. 11(e) cover
only the range 0 to 35 dB). If we compare it with the spec­
trum of the global masking threshold, we note that the
... 15. Typical sequence of windows in adaptive window switching. power of the reconstruction error is below the threshold,

SEPTEMBER 1997 IEEE SIGNAL PROCESSING MAGAZINE 71


analysis-by-synthesis ap­
proach, an advanced
pre- echo control, and
nonuniform quantization
with entropy coding. A
buffer technique, called
bit reservoir, leads to fur­
ther savings in bit rate.
Layer III is the only layer
that provides mandatory
decoder support for vari­
able bit-rate coding [37].

Switched Hybrid Filterbank


At. 16. MPEG-l frame structure and packetization. Layer I: 384 subband samples; Layer ll: 1 152 In order to achieve a
subband samples. Packets P: 4-byte header; 184-byte payload field higher frequency resolu-
tion closer to critical band
partitions, the 32 sub­
consequently, it is masked. Note also that the spectrum of
band sigr:als are subdivided further in frequency content
the reconstruction error is identical to that of the input
by applym�, to each of the subbands, a 6-point or 18-
spectrum for subbands 23 and above because the corre­ .
pomt modIfied DCT block transform,with 50% overlap',
sponding subband signals are not transmitted.
hence, the windows contain, respectively, 12 or 36 sub­
In Fig. 12, three bit allocations for the same audio
band sarr:p1es. The maximum number of frequency com­
block are compared, employing (i) the psychoacoustic
ponents IS 32 x 18 576,each representing a bandwidth
=
model I of the standard, (ii) the psychoacoustic model 2
of only 24000/576 = 41.67 Hz. The I8-point block
of the standard, and (iii) the unweighted bit allocation. In
transform is normally applied because it provides better
this example, there are clear differences between the two
frequency resolution, whereas the 6-point block trans­
models suggested in the MPEG standard. Note that the
form provides better time resolution and is applied in case
calculation of the unweighted bit allocation does not de­
of expected pre-echoes (see the earlier section on window
pend on masking tl1resholds. Nevertheless, the bit alloca­
switching). In principle,a pre-echo is assumed when an in­
tion resembles that of model I, except that the
stantaneous demand for a high number of bits occurs. De­
unweighted bit allocation spends 3 bits for subbands of
pending on the nature of potential pre-echoes, all or a
indices 23 and above where the signal power is well below
smaller number of transforms are switched. Two special
that of the masking threshold.
MDCT windows,a start window and a stop window,are
Finally, in Fig. 13, we compare the model l-based bit .
needed m case of transitions between short and long blocks
allocations for bit rates of 128 kb/s and 64 kb/s,again for .
and Vice versa to maintain the time-domain alias cancella­
the same audio block. Note that at the lower bit rate' a

lowpass version of the audio sig al is reconstructed.
tior: feature of the MDCT [21,24,36J. Figure 15 shows a
typICal sequence of windows.

Decoding
Quantization and Coding
The decoding is straightforward: the subband sequences
The MDCT output samples are nonuniformly quantized,
are reconstructed on the basis of blocks of 12 subband
thus providing both smaller mean-squared errors and
samples talcing into account the decoded scalefactor and
masking because larger errors can be tolerated if the sam­
bit-allocation information. If a subband has no bits allo­
ples to be quantized are large. Huffman coding, based on
cated to it,the samples in that subband are set to zero. Each
3� code tables and additional run-length coding, are ap­
time the subband samples of all 32 subbands have been cal­
phed to represent the quantizer indices in an efficient
culated, they are applied to the synthesis filterbank, and 32
way. The encoder maps the variable wordlength code­
consecutive 16-bit, PCM-format audio samples are calcu­
words of the Huffman code tables into a constant bit rate
lated. If available, as in bidirectional communications or in
by monitoring the state of a bit reservoir. The bit reser­
recorder systems, the encoder (analysis) fllterbanlz can be
voir ensures that the decoder buffer neither underflows or
used in a reverse mode in the decoding process.
overflows when the bitstream is presented to the decoder
at a constant rate.
Layer 1/1 In order to keep the quantization noise in all critical
Layer III of the MPEG-l/Audio coding standard intro­ bands below the global masking threshold (noise alloca­
duces many new features (see Fig. 14), in particular a tion) an iterative analysis-by-synthesis method is employed
switched hybrid filterbank. In addition, it employs an whereby the process of scaling, quantization,and coding

72 IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER 1997


.& 17. MPEG packet delivery.

of spectral data is carried out within two nested iteration which varies in Layer II; and (iii) the ancillary data field,
loops. The decoding follows that ofthe encoding process. the length of which is not specified.

Frome ond Mw6pkx SuuduR Multiplex Structure


Frame Structure The systems part of the MPEG-l coding standard IS
Figure 16 shows the frame structure ofMPEG-l audio­ 1 1 1 72 defines a packet structure for multiplexing audio,
coded signals for both Layer I and Layer II. Each frame video, and ancillary data bitstreams in one stream. The
has a header; its first part contains 12 synchronisation variable-length MPEG frames are broken down into
bits, 20-bit system information, and an optional 16-bit packets. The packet structure uses 188-byte packets con­
cyclic redundancy check code. Its second part contains sisting of a 4-byte header followed by 184 bytes of pay­
side information about the bit allocation and the scalefac­ load (see Fig. 17) . The header includes a sync byte, a
tors (and, in Layer II, scalefactor select information) . As 13-bit field called the packet identifier to inform the de­
main information a frame carries a total of 32 x 12 sub­ coder about the type of data, and additional information.
band samples (corresponding to 384 PCM audio input For example, a I-bit payload unit start indicator)ndicates
samples-equivalent to 8 ms at a sampling rate of 48 if the payload starts with a frame header. No predeter­
kHz) in Layer I, and a total of 32 X 36 subband samples mined mix ofaudio, video, and ancillary data bitstreams is
in Layer II (corresponding to 1 152 PCM audio input required; the mix may change dynamically; and services
samples-equivalent to 24 ms at a sampling rate of 48 can be provided in a very flexible way. Ifadditional header
kHz) . Please note that the frames are autonomous: each information is required such as for periodic synchroniza­
frame contains all information necessary for decoding. tion of audio and video timing, a variable-length adapta­
Therefore each frame can be decoded independently from tion header can be used as part of the 184-byte payload
previous frames-it defines an entry point for audio stor­ field.
age and audio editing applications. Please note also that Although the lengths of the frames are not fixed, the
the lengths ofthe frames are not fixed due to (i) the length interval between frame headers is constant (within a byte)
of the main information field, which d(:pends on bit-rate throughout the use of padding bytes. The MPEGjSys­
and sampling frequency; (ii) the side information field, terns specification describes how MPEG-compressed

SEPTEMBER 1997 IEEE SIGNAL PROCESSING MAGAZINE 73


enced listeners, approximately 10 test sequences were
used, and the sessions were performed in stereo with both
loudspeakers and headphones. In order to detect even
small impairments the 5-point ITU-R impairment scale
was used in all experiments. Details are given in [39] and
[40] . Critical test items were chosen in the tests to evalu­
ate the coders by their worst case (not average) perform­
ance. The subjective evaluations, which had been based
on triple stimulus/hidden reference/double blind tests,
have shown very similar and stable evaluation results. In
these tests the subject is offered three signals, A, B, and C
(triple stimulus) . A is always the unprocessed source sig­
.... 78. 3/2 multichannel loudspeaker configuration (© Deutsche nal (the reference) . B and C, or C and B, are the reference
Telekom AG, Highlights aus der Forschung, 7996, with and the system under test (hidden reference) . The selec­
permission). tion is neither known to the subjects nor to the conduc­
tor( s) of the test (double blind test) . The subjects have to
decide ifB or C is the reference and have to grade the re­
maining one.
The MPEG-l audio-coding standard has shown excel­
lent performance for all layers at the rates given in Table
3 . It should be mentioned again that the standard leaves
room for encoder-based improvements by using better
psychoacoustic models. Indeed, many improvements
have been achieved since the fIrst subjective tests had been
carried out in 1991.

MPEG Audio Coding with


Lower Sampling Frequencies
We have mentioned above that MPEG-1/Audio supports
sampling frequencies of 32, 44. 1, and 48 kHz. For appli­
cations with limited bandwidths (mediumband), lower
sampling frequencies ( 16, 22.05, and 24 kHz) have been
defmed in MPEG-2 to bring bit rates down to 64 kb/s per
channel and less [9] . The corresponding maximum audio
bandwidths are 7.5, 10.3, and 1 1.25 kHz. The syntax, se­
mantics, and coding techniques of MPEG-l are main­
tained except for a small number of parameters (two
tables in the decoder) . Therefore, coding can be based
again on Layers I, II, or III. The extension to lower sam­
pling frequencies leads to higher frequency resolutions
.... 79. Triangle sound representation in five channels (from top:
and hence to higher coding gains, partly because of better
RS, LS, C, R, and L) (©Deutsche Telekom AG, Highlights aus
der Forschung, 7 996, with permission).
adaptations to the masking thresholds, and partly because
side information becomes a smaller part of the overall bit
rate. As in the case of coding wideband audio signals, the
audio and video data streams are to be multiplexed to­ best audio quality is obtained with Layer III. Finally we
gether to form a single data stream. The terminology and note that some applications malce use of sampling fre­
the fundamental principles of the systems layer are de­ quencies of8, 1 1 . 025, and 12 kHz, which are outside the
scribed in [38] . MPEG-2 standard.

Subjective Quality
MPEG Mu ltichannel Audio Coding
(MPEG- l; Stereophonic Audio Signals)
The standardization process included extensive subjective Multichannel Audio Representations
tests and objective evaluations ofparameters such as com­ A logical further step in digital audio is the defInition of
plexity and overall delay. The MPEG (and equivalent multichannel audio representation systems to create a re­
ITU-R) listening tests were carried out under very similar alistic surround-sound fIeld both for audio-only applica­
and carefully defIned conditions with around 60 experi- tions and for audiovisual systems, including video

74 IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER 1 997


stereophonic presenta­
tion, i.e., those that do
not contribute to the lo­
calization o f sound
sources, may be identi­
fied and reproduced in a
monophonic format to
further reduce bit rates.
State-of-the-art multi­
channel coding algo­
rithms make use of such
effects. However, a care­
ful design is needed, oth­
.A. 20. Compatibility of MPEG-2 multichannel audio bit streams. erwise such joint coding
may produce artifacts.

conferencing, videophony, multimedia services, and elec­


tronic cinema. Multichannel systems can also provide MPEG-2/Audio Multichannel Coding
multilingual channels or additional channels for visually The second phase of MPEG, labeled MPEG-2, includes
impaired (a verbal description of the visual scene) and for in its audio part two multichannel audio-coding stan­
�earing impaired (dialogue with enhanced intelligibil­ dards, one of which is forward- and backwards compati­
Ity). ITU-R and other international groups have recom­ ble with MPEG-l /Audio [ 8 , 4 1 -44] . Forwar d
mended a five-channel 10udspeaktT configuration, compatibility means that an MPEG-2 multichannel de­
referred to as 3/2-stereo, with a left and a right channel (L coder is able to properly decode MPEG-l mono- or
and R), an additional center channel (C) and two stereophonic signals; backwards compatibility means that
side/rear surround channels (LS and RS) augmenting the existing MPEG-l stereo decoders, which only handle
L and R channels (see Fig. 18) (ITU-R Rec. 775). Such a two-channel audio, are able to reproduce a meaningful
configuration offers a surround-sound field with a stable basic 2/0 stereo signal from a MPEG-2 multichannel bit
frontal sound image and a large listening area. Figure 19 stream so as to serve the need of users with simple mono
shows four blocks of a five-channel triangle audio signal or stereo equipment. Nonbackwards-compatible multi­
(which is difficult to code). channel coders will not be able to feed a meaningful bit
.
stream mto an MPEG-l stereo decoder. On the other
Multichannel digital audio systems support p/q pres­
hand, nonbackwards-compatible codecs have more free­
entations with p front and q back channels, and also pro­
dom in producing a high-quality reproduction of audio
vide the possibilities of transmitting two independent
signals.
stereophonic programs and/or a number of commentary
With backwards compatibility it is possible to intro­
or multilingual channels. Typical combinations of chan­
duce multichannel audio at any time in a smooth way
nels are given in Table 4.
without making existing two-channel stereo decoders ob­
ITU-R Recommendation 775 provides a set of down­ solete. An important example is the European Digital
wards mixing equations if the number of loudspeakers is Audio Broadcast system, which will require MPEG-l ste­
to be reduced (downwards compatibility) . An additional reo decoders in the first generation but may offer multi­
low-frequency enhancement (LFE or subwoofer) chan­ channel audio at a later point.
nel, particularly useful for HDTV applications, can be
optionally added to any of the configurations. The LFE
channel extends the low-frequency content between 15 BackwardS-Compatible MPEG-2 Audio Coding
Hz and 120 Hz in terms of both frequency and leveL One Backwards compatibility implies the use of compatibility
or �ore loudspeakers can be positioned freely in the lis­ matrices. A down-mix of the five channels ("matrixing")
tenmg room to reproduce this LFE signal. The film in­ delivers a correct basic 2/0 stereo signal, consisting of a
dustry uses a similar system for their digital sound left and a right channel, LO and RO, respectively. A typical
systems. A 3/2-configuration with five high-quality full­ set of equations is:
range channels plus a subwoofer channel is often called a
5. 1 system.
LD = a(L + � . C+ 8 · LS)
In order to reduce the overall bit rate of multichannel a= 1
· � = 8 = .J2
1+ .J2 '
--

audio-co�ing systems, redundancies and irrelevancy, RO = a(R + � . C + 8 . RS)


such as mterchannel dependencies and interchannel
masking effects, respectively, may be exploited. In addi­ Other choices are possible, including LO L and RO= =

tion, components of the multichannel signal, which are R. The factors, a, �, and 0, attenuate the signals to avoid
irrelevant with respect to the spatial perception of the overload when calculating the compatible stereo signal

SEPTEMBER 1997 IEEE SIGNAL PROCESSING MAGAZINE 75


in that channel, this noise may become audible. Note that
the masking signal will still be present in the multichannel
representation but it will appear on a different loud­
speaker. Measures against "unmasking" effects hive been
described in [46] . As an additional measure, MPEG-2's
optional variable bit-rate mode can be evoked to encode
difficult audio content at a momentarily higher bit rate.
MPEG-1 decoders have a bit-rate limitation (384 kbls
in Layer II) . In order to overcome this limitation, the
MPEG-2 standard allows for a second bit stream, the ex­
tension part, to provide compatible multichannel audio at
higher rates. Figure 23 shows the structure of the bit
stream with extension.

MPEG-2 Advanced Audio Coding


A second standard within MPEG-2 supports applications
that do not request compatibility with the existing
MPEG-1 stereo format. Therefore, matrixing and dema­
A 2 1. Data format of MPEC audio bit streams: (a) MPEC- l audio trixing are not necessary and the corresponding potential
frame; (b) MPEC-2 audio frame, compatible with MPEC-l format artifacts disappear (see Fig. 24) .
The last two years have seen extensive activities in the
optimization and standardization of a nonbackwards­
(LO,RO) . LO and RO are transmitted in MPEG-1 format compatible MPEG-2 multichannel audio coding algo­
in transmission channels T1 and T2. Channels T3, T4, rithm. Many companies around the world contributed
and T5 together form the multichannel extension signal advanced audio-coding algorithms in an collaborative ef­
(Fig. 20) . They have to be chosen such that the decoder fort to come up with a flexible high-quality coding stan­
can recompute the complete 3/2-stereo multichannel sig­ dard [43] .
nal. Interchannel redundancies and masking effects are
taken into account to find the best choice. A simple exam­ Tools . The MPEG-2 AAe standard employs high­
ple is T3 C, T4 LS, and T5 RS. In MPEG-2 the
=:: = =
resolution filter banks, prediction techniques, and noise­
matrixing can be done in a very flexible and even time­ less coding. It is based on recent evaluations and defini­
dependent way. Note , however, that the audio content of tions of tools (or modules), each having been selected from
the extension signal is already delivered in the MPEG-1 a number of proposals. The self-contained tools include
audio stream (signals LO and RO) ; this redundancy re­ an optional preprocessing, a filterbank, a perceptual
duces the achievable compression rate. model, temporal noise shaping, intensity multichannel
Backwards compatibility is achieved by transmitting coding, prediction, MIS stereo coding, quantization,
the channels LO and RO in the subband-sample section noiseless coding, and a bit-stream multiplexer (see Fig.
of the MPEG-1 audio frame and all multichannel exten­ 25) . The filterbank is a 1 024-line modified discrete cosine
sion signals (T3, T4, and T5) in the first part of the transform and the perceptual model is taken from
MPEG-1 audio frame reserved for ancillary data. This MPEG-1 (model 2) . The temporal noise shaping tool
ancillary data field is ignored by MPEG-1 decoders (see controls the time dependence of the quantization noise,
Fig. 21) . The length of the ancillary data field is not intensity, and MIS coding and the second-order
specified in the standard. If the decoder is of type backward-adaptive predictor improves coding efficiency.
MPEG-1, it uses the 2/0-format front left and right
down-mix signals, LO' and RO', directly (see Fig. 22) . If
the decoder is of type MPEG-2, it recomputes the com­
plete 3/2-stereo multichannel signal with its compo­
nents L', R', G, LS', and RS' via "dematrixing" of LO',
RO', T3', T4', and T5' (see Fig. 20) .
Matrixing is obviously necessary to provide backwards
compatibility; however, if used in connection with per­
ceptual coding, "unmasking" of quantization noise may
appear [45 ] . It may be caused in the dematrixing process
when sum and difference signals are formed. In certain
situations such a masking sum or difference signal com­
ponent can disappear in a specific channel. Since this A 22. MPEC- l stereo decoding of MPEC-2 multichannel bit
component was supposed to mask the quantization noise stream.

76 IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER 1997


MC Ancillary
.
Data

.. 23. Data format of MPEG-2 audio bit stream with extension part for multichannel data.

The predictor reduces the bit rate for coding subsequent The above-listed selected modules define the MPEG-2
subband samples in a given subband, and it bases its pre­ audio AAC standard that became an International Stan­
diction on the quantized spectrum of the previous block, dard in April 1 997 as an extension to MPEG-2
which is also available in the decoder (in the absence of (ISO/MPEG 13818 - 7) . A more detailed description of
channel errors). Finally, for quantization and noiseless the MPEG-2 AAC multichannel standard can be found in
coding, an iterative method is employed so as to keep the the literature [43]. The standard offers high quality at the
quantization noise in all critical bands below the global lowest possible bit rates between 320 and 384 kb/s for
masking threshold. five channels; it will find many applications in both con­
sumer and professional use.
Profiles. In order to serve different needs, the standard
provides three profiles: (i) the main profile offers highest
quality, (ii) the low-complexity profile works without Backwards Compatibility via
prediction, and (iii) the sampling-rate-scaleable profile Simulcast Transmission
offers the lowest complexity. For example, in its main pro­ If bit rates are not of high concern, a simulcast transmission
file the filterbank is a 1024-line MDCT with 50% overlap may be employed where a full MPEG-l bitstream is mul­
(block length of 2048 samples). The filter bank is tiplexed with a full nonbackwards-compatible multichan­
switchable to eight 128-line MDCTs (block lengths of n e l b i t s t r e a m in order to support backwards
256 samples). Hence, it allows for a frequency resolution compatibility without matrixing techniques (Fig. 26).
of 23.43 Hz and a time resolution of 2.6 ms (both at a
sampling rate of 48 kHz). In the case of the long block
length, the window shape can vary dynamically as a func­ Subjective Tests
tion of the signal. (MPEG-2, Multichannel Audio Signals)
The low-complexity profile does not employ temporal The first subjective tests, independently run at German
noise shaping and time-domain prediction (the predic­ Telekom and BBC (UK) under the umbrella of the
tion adds significantly to the complexing), whereas in the MPEG-2 standardization process, had shown a satisfac­
sampling-rate-scaleable profile a hybrid illterbank is used. tory average performance of nonbackwards-compatible and
MPEG-2 AAC supports up to 46 channels for various of backwards-compatible coders. The tests had been carried
multichannel loudspeaker configurations and other ap­ out with experienced listeners and critical test items at
plications; the default loudspeaker configurations are the low bit rates (320 and 384 kb/s). However, all codecs
monophonic channel, the stereophonic channel, and the showed significant deviations from transparency for
5.1 system (five channels plus LFE channel). some of the test items [47, 48]. Recently, extensive for­
mal subjective tests have been carried out to compare
MPEG-2 AAC versions, operating, respectively, at 256
and 320 kbls, and a backward-compatible MPEG-2
Layer II coder, operating at 640 kb/s [49] (a 1995 ver­
sion of this latter coder was used, therefore its test results
do not reflect any subsequent enhancements). All coders
performed very well with a slight advantage to the
nonbackwards-compatible 320 kb/s MPEG-2 AAC
coder compared with the backwards-compatible 640 kb/s
MPEG-2 Layer II coder. The performances of those cod­
.. 24. MPEG-2 advanced audio coding (multichannel configura­ ers are indistinguishable from the original in the sense of
tion). the EBU definition of indistinguishable quality [50].

SEPTEMBER 1997 IEEE SIGNAL PROCESSING MAGAZINE 77


,

Ji. 25. Structure of MPEG-2 advanced audio coder (AAC).

From these subjective tests, it has become clear that the the choice of tools made by the encoder. This description
concept of backwards compatibility implies the need for can also be used to describe new algorithms and down­
higher bit rates. load their configuration to the decoding processor for
execution.
The current toolset supports audio and speech com­
MPEG-4 Audio Coding pression at monophonic bit rates ranging from 2 to 64
Activities within MPEG-4 aim at proposals for a broad kb/s. Three core coders are used:
field of applications including multimedia. (We note in Ji. a parametric coding scheme for low bit rate speech cod­
passing that MPEG has started a new work item called ing (2 to 10 kb/s)
"Multimedia content description interface" (in short Ji. an analysis-by-synthesis coding scheme for medium bit
"MPEG-7") . MPEG-7 does not cover coding, its goal is rates (6 to 16 kb/s)
rather to specify a standardized description of various Ji. a subband/transform -based coding scheme for bit rates
types of multimedia information. A typical application below 64 kb/s.
will be the search for video, graphics, or audio material in The three core coders have been integrated into a so­
the sense of today's text-based search engines in the called verification model that describes the operations of
World Wide Web.) MPEG-4 will offer higher compres­ encoders and decoders and that is used to carry out simu­
sion rates, and it will merge the whole range of audio lations and optimizations. In the end, the verification
from high-fidelity audio coding and speech coding down model will be the embodiment of the standard [5 1] .
to synthetic speech and synthetic audio, supporting appli­ Let us also note that MPEG-4 will offer new function­
cations from high-fidelity audio systems down to alities such as time-scale changes, pitch control, editabil­
mobile-access multimedia terminals. In order to repre­ ity, database access, and scalability, which allows one to
sent, integrate, and exchange pieces of audio-visual infor­ extract from the transmitted bit stream a subset sufficient
mation, MPEG-4 offers standard tools that can be to generate audio signals with lower bandwidth and/or
combined to satisfY specific user requirements [51]. A lower quality depending on channel capacity or decoder
number of such configurations may be standardized. A complexity. MPEG-4 will become an international stan­
syntactic description will be used to convey to a decoder dard in November 1998.

78 IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER 1 997


of 4.7 GB relieves the pressure for ex­
treme compression factors. It will open
the possibilities of storing audio chan­
nels that have been coded in a lossless
mode, and it will provide the necessary
capacity for various forms of multichan­
nel coding. The DVD will support
stereophonic and (at least) 5 . 1 -
multichannel audio. In connection with
video, the PAL version of the DVD
(5625/50 TV system) will use MPEG
audio coding with Dolby's AC-3 trans­
form coding technique as an option
[54-56], whereas the NTSC version
(525/60 TV system) will be based on
.... 26. Backwards-compatible MPEG-2 multichannel audio coding (simulcast mode).
AC-3 with MPEG as an option. The
overall audio bit rate is 384 kb/s.
A number of decisions concerning
the introduction of digital audio broadcast (DAB) and
digital video broadcast (DVB) services have been made
Applications
recently. In Europe, a project group named Eureka 147
MPEG/Audio compression technologies will play an im­ has worked out a DAB system that is able to cope with the
portant role in consumer electronics, professional audio, problems of digital broadcasting [57]. ITU-R has recom­
telecommunications, broadcasting, and multimedia. mended the MPEG-l audio-coding standard after it had
Here we describe a few typical application fields. made extensive subjective tests. Layer II of this standard
Main applications will be based on delivering digital is used for program emission, and the Layer III version is
audio signals over terrestrial and satellite-based digital recommended for commentary links at low rates. The
broadcast and transmission systems such as subscriber lines, sampling rate is 48 kHz in all cases, and the ancillary data
program exchange links, cellular mobile radio networks, field is used for program-associated data (PAD informa­
cable-TV networks, and local area networks. [52] . For tion) and other data. The DAB system has a significant
example, in narrowband integrated services digital net­ bit-rate overhead for error correction based on punctured
works (ISDN) customers have physical access to one or convolutional codes in order to support source-adapted
two 64-kb/s B channels and one 16-kb/s D channel channel coding, i.e., an unequal error protection that is in
(which supports signaling but can also carry user infor­ accordance with the sensitivity of individual bits or a
mation) . Other configurations are possible including p X group of bits to channel errors [58]. Additionally, error­
64 kb/s (p = 1,2,3, . . . ) services. ISDN rates offer useful concealment techniques are applied to provide agraceful
channels for a practical distribution of stereophonic and degradation in case of severe errors. In the United States a
multichannel audio signals. Because ISDN is a bidirec­ standard has not yet been defined. Simulcasting analog
tional service, it also provides upstream paths for future and digital versions of the same audio program in the FM
on-demand and interactive audiovisualfust-in-time audio terrestrial band (88-108 MHz) is an important issue
services. The backbone ofdigital telecommunication net­ (whereas the European solution is based on new chan­
works will be broadband (B-) ISDN with its cell-oriented nels) [59] .
structure. Cell delays and cell losses are sources of distor­ The Hughes DirecTV satellite subscription television
tions to be taken into account in designs of digital audio system and ADR (Astra Digital Radio) are examples of
systems [53] . A related application is Internet broadcast­ satellite-based digital broadcasting that make use of
ing, which will need significant compression rates as long MPEG-l Layer II. As a further example, the Eutelsat
as home computers are connected to the backbone net­ SaRa system will be based on Layer III coding.
works via modems with typical bit rates between 14.4 Advanced digital TV systems provide HDTV delivery
kbls and 33 kb/s. to the public by terrestrial broadcasting and a variety of al­
In the field of digital storage on digital audio tape and ternate media and offer full-motion, high-resolution
(re-writeable) disks, a number ofMPEG-based consumer video and high-quality multichannel surround audio. The
products have recently reached the audio market. Of overall bit rate may be transmitted within the bandwidth
these products, Philips' Digital Compact Cassette (DCC) of an analog UHF television channel. The United States
essentially makes use of Layer I of the MPEG-l audio Grand Alliance HDTV system and the European DVB
coder by employing its 384 kb/s stereo rate; its audio­ system both make use ofthe MPEG-2 video-compression
coding algorithm is called PASC (precision audio sub­ system and ofthe MPEG-2 system transport layer, which
band coding) [ 1 5 ] . The upcoming DVD with its capacity uses a flexible ATM-like packet protocol with headers/de-

SEPTEMBER 1997 IEEE SIGNAL PROCESSING MAGAZINE 79


scriptors for multiplexing audio and video bit streams in Acknowledgments
one stream with the necessary information to keep the
streams synchronized when decoding (see Fig. 17) . The The MPEG standards were created through the long­
systems differ in the way the audio signal is compressed: lasting efforts of a great many people coming from com­
The Grand Alliance system will use Dolby'S AC-3 algo­ panies and research institutions from all over the world.
rithm, whereas the European system will use Many of the contributors participated in the MPEG stan­
MPEG-2jAudio. dards meetings. Their outstanding qualificatioll; their en­
thusiasm, and countless collaborative efforts ma<;le these
standards happen.
Conclusions
Peter Noll is a P rofesso r of Telecommunications at the
Low bit-rate digital audio is applied in many different
Technische Universitat Berlin in Berlin, Germany.
fields, such as consumer electronics, professional audio
processing, telecommunications, and broadcasting. Per­
ceptual coding in the frequency domain has paved the References
way to high compression rates in audio coding. MPEG-1
1. A.A.M.L. Bruekers et al., "Lossless Coding for DVD Audio," 1 01'" Audio
audio coding with its three layers has been widely ac­ Engineering Society Convention, Los Angeles, 1996, preprint 4358.
cepted as international standard. Software encoders, sin­
2. N. S. Jayant and P. Noli, "Digital Coding of Waveforms: Principles and
gle DSP chip implementations, and computer extensions Applications to Speech and Video," Prentice Hall, 1984.
are available from a number of suppliers.
3. AS. Spanias, "Speech Coding: A Tutorial Review," Proc. ofthe IEEE, Vol.
In the area of broadcasting and mobile radio systems, 82, No. 10, pp. 1541-1582, Oct. 94.
services are moving to portable and handheld devices, 4. N.S. Jayant, J.D. Johnston, and Y. Shoham, "Coding of Wideband
and new, third-generation mobile communication net­ Speech," Speech Communication 11 (1992), pp. 127-138.
works are evolving. Coders for these networks must not 5. A Gersho, "Advances in Speech and Audio Compression," Proc. ofthe
only operate at low bit rates but must be stable in burst­ IEEE, Vol. 82, No.6, pp. 900-918, 1994.
error and packet- (cell-) loss environments. Error­ 6. P. Noli, "Wideband Speech and Audio Coding," IEEE Commun. Mag.,
concealment techniques play a significant role. Due to the Vol. 31, No. 11, pp. 34-44, 1993.

lack of available bandwidth, traditional channel coding 7. P . Noli, "Digital Audio Coding for Visual Communications," PrOG. of the
techniques may not be able to sufficiently improve the re­ IEEE, Vol. 83, No. 6, June 1995.

liability of the channel. 8 . ISOjIEC JTCljSC29, "Information Technology - Coding of Moving Pic­

MPEG audio coders are controlled by psychoacoustic tures and Associated Audio for Digital Storage Media at up to About 1 . 5
Mbit/s - I S 1 1 172 (Part 3, Audio)," 1992.
models that may be improved, thus leaving room for an
evolutionary improvement of codecs. In the future, we 9. ISOjIEC JTCljSC29, "Information Technology · Generic Coding of Mov­
ing Pictures and Associated Audio Information - IS 1 3 8 1 8 (Part 3,
will see new solutions for encoding. A better understand­
Audio)," 1994.
ing of binaural perception and of stereo representation
10. ISO/MPEG, Doc. N0821, Proposal Package Description - Revision 1.0,
will lead to new proposals.
Nov. 1994.
Digital multichannel audio improves stereophonic im­
1 1 . G.T. Hathaway, "A NICAM Digital Stereophonic Encoder," in "Audiovis­
ages and will be of importance both for audio-only and ual Telecommunications (Editor: N.D. Nigthingale), Chapman & Hall,
multimedia applications. MPEG-2jAudio offers both 1992, pp. 71-84.

backwards-compatible and nonbackwards-compatible 12. E. Zwicker and R. Feldtkelier, Das Ohr als Nachrichtenemfanger. Sturt­
coding schemes to serve different needs. Ongoing re­ gart: S. Hirzel Verlag, 1967.

search will result in enhanced multichannel representa­ 13. N.S. Jayant, J.D. Johnston, and R. Safranek, "Signal compression based on
tions by malGng better use of interchannel correlations models of human perception," ProG. ofthe IEEE, Vol. 81, No. 10, pp.

and interchannel maslGng effects to bring the bit rates fur­ 1385-1422, 1993.

ther down. We can also expect solutions for special pres­ 14. R. Zelinski and P. Noli, "Adaptive Transform Coding of Speech Signals,"

entations for people with impairments of hearing or IEEE Trans. onAcoustiGS, Speech and SignaI Proc. , Vol. ASSP-25, pp. 299-
309, August 1977.
vision which can make use ofthe multichannel configura­
tions in various ways. 1 5 . A Hoogendorn, "Digital compact cassette," Proc. oftheIEEE, vol. 82, No.
10, pp. 1479-1489, Oct. 1994.
Current activities of the ISOjMPEG expert group
16. P. Noli, " On predictive quantizing schemes," Bell System TechnicalJournal,
aim at proposals for audio coding that will offer higher
vol. 57, pp. 1499-1532, 1978.
compression rates, and which will merge the whole
17. J. Makhoul and M. Berouti, "Adaptive noise spectral shaping and enttopy
range of audio, from high-fidelity audio coding and coding in predictive coding of speech." IEEE Trans. on Acoustics, Speech, and
speech coding down to synthetic speech and synthetic Signal Processing Vol. 27, No. 1, pp. 63-73, Feb. 1979
audio (ISOjlEC MPEG-4) . MPEG-4 will be the future 18. D. Esteban and C. Galand, "Application of Quadrature Mirror Filters to
multimedia standard. Because the basic audio quality Split Band Voice Coding.5chemes," Proc. lCASSP, pp. 191-195, 1987.
will be more important than compatibility with existing 19. J.H. Rothweiler, "Polyphase Quadrature Filters, a New Subband Coding
standards, this activity has opened the door for com­ Technique," Proc. International Conference ICASSP'83, pp. 1280-1283,
pletely new solutions. 1983.

80 IEEE SIGNAL PROCESSING MAGAZINE SEPTEMBER 1 997


20. J. Princen and A. Bradley, "Analysis/Synthesis Filterbank Design Based on 39. T. Ryden, C. Grewin, S. Bergman, "The SR Report on the MPEG Audio
Time Domain Aliasing Cancellation," IEEE Trans. on Acoust. Speech, and subjective listening tests in Stockholm April/May 1991," ISO/IEC
Signal Process. Vol. ASSP-34, pp. 1153-1161, 1986. JTCI/SC29/WG II: Doc.-No. MPEG 91/010, May1991.

2 1 . H.S. Malvar, "Signal Processing with Lapped Transforms," Artech House 40. H. Fuchs, "Report on the MPEG/Audio subjective listening tests in Han­
Inc., 1992. nover," ISO/lEC JTCI/SC29/WG 1 1 : Doc.-No. MPEG 91/331, November
1991.
22. F.S. Yeah, C.S. Xydeas, "Split-band coding of speech signals using a trans­
form technique," Proc. ICC, 1984, Vol.3, pp. 1 183-1187. 41. G. Stoll et al., "Extension ofISO/MPEG-Audio Layer II to multi-channel
coding: The future standard for broadcasting, telecommunication, and mul­
23 W. Granzow, P. Noll, C. Volmary, "Frequency-domain coding of speech
signals," (in German), NTG-Fachbericht No. 94, VDE-Verlag, Berlin, timedia application," 94th Audio Engineering Society Convention, Berlin, Pre­

1986, pp. 150-155. print no. 3550, 1993.

24. B. Edler, "Coding of Audio Signals with Overlapping Block Transform and 42. B. Grill et al., "Improved MPEG-2 audio multi-channel encoding," 96th

Adaptive Window Functions," (in German), Frequcnz, vol. 43, pp. 252- Audio Engineering Society Convention, preprint 3865, Amsterdam, 1994.

256, 1989. 43. M. Bosi et aI., "ISO/IEC MPEG-2 advanced audio coding," 101" Audio
25. M. Iwadare, A. Sugiyama, F. Hazu, A. Hirano, and T. Nishitani, "A 128 Engineering Society Convention, preprint 4382, Los Angeles, 1996.
kb/s Hi-Pi Audio CODEC Based on Adaptive Transform Coding with
44. J.D. Johnston et al., "NBC-Audia-Stereo and multichannel coding meth­
Adaptive Block Size," IEEE]. on Set. Areas in Commun. , Vol. 10, No. 1, pp.
ods," 101'h Audio Engineering Society Convention, Los Angeles, 1996, pre­
138-144, January 1992.
print 4383.
26. R. Zelinski and P. Noll, "Adaptive Blockquantisierung von Sprachsig­
45. W.R.Th. Ten Kate, et al., "Matrixing of bit rate reduced audio signals,"
nalen," Technical Report No. 181, Heinrich-Hertz-Institut fur
Proc. Int. Conf on Acoustics, Speech, and Signa/ Processing (ICASSP '92),
Nachrichtentechnik, Berlin, 1975.
Vol. 2, pp. II-205 - II-208, 1992.
27. R.G. van der Waal, K. Brandenburg, and G. Stoll, "Current and future
46. W.R.Th. Ten Kate, "Compatibility matrixing of multi-channel bit-rate­
standardization of high-quality digital audio coding in MPEG," Proc. IEEE
reduced audio signals," 96thAudio Engineering Society Convention, preprint
ASSP Workshop on Applications ofSignal Processing to Audio andAcoustics,
3792, Amsterdam, 1994.
New Paltz, N.Y., 1993.
47. F. Feige and D. Kirby, "Report on the MPEG/Audio Multichannel Formal
28. P. Noll and D. Pan, "ISO/MPEG Audio Coding," InternationalJournal of
Subjective Listening Tests, " ISO/lEC JTCI/SC29/WG 1 1 : Doc. N 0685,
High Speed Electronics and Systems, Vol. 8, No. 1, pp. 69-118, 1997.
March 1994.
29. K. Brandenburg and G. Stoll, "The ISO/MPEG-Audio Codec: A Generic
48. D. Meares and D. Kirby, "Brief Subjective Listening Tests on MPEG-2
Standard for Coding of High Quality Digital Audio,"Journal ofthe Audio
Backwards Compatible Multichannel Audio Codecs," ISO/IEC
Engineerir!!J Society (AES), Vol. 42, No. 10, pp. 780-792, Oct. 1994.
JTCI/SC29/WG 1 1 : August 1994.
30. L.M. van de Kerkhof and A.G. Cugnini, "The ISO/MPEG audio coding
49. ISO/IEC/JTCl/SC29, "Reporr on the formal subjective listening tests of
standard," Widescreen Review, 1994.
MPEG-2 NBC multichannel audio coding," Document N1371, Oct. 1996.
31. Y.F. Dehery, G. Stoll, L. v.d. Kerkhof, "MUSICAM Source Coding for
50. ITU-R Document TG 10-2/3, Oct. 1991.
Digital Sound," 17th International Television Symposium, Montreux, Re­
cord pp. 612-617, June 1991. 51. lEC/JTCl/SC29, "Description ofMPEG-4," Document N1410, Oct.

32. K. Brandenburg, J. Herre, J. D. Johnston, Y. Mahieux, E.F. Schroeder: 1996.

"ASPEC: Adaptive Spectral Perceptual Entropy Coding of High Quality 52. D.S. Burpee and P.W. Shumate, "Emerging residential broadband telecom­
Music Signals," 90th. Audio Engineering Societ Cyonvention, Paris , preprint munications," Prot. IEEE, Vol. 82, No.4, pp. 604-614, 1994.
3011, 1991.
53. N.S. Jayant, "High Quality Networking of Audio-visual Information,"
33. H.G. Musmann, "The ISO Audio Coding Standard," Proc. IEEE Globe­ IEEE Commun. M'l!l'. , pp. 84-95, 1993.
com, Dec. 1990.
54. C. Todd et al., "AC-3 : Flexible perceptual coding for audio transmission
34. R.G. van der Waal, A.W.J. Oomen, and F.A. Griffiths, "Performance com­
and storage," 96th Audio Engineering Society Convention, Preprint 3796,
parison of CD, noise-shaped CD and DCC," 96th Audio Engineering Society
Amsterdam, 1994.
Convention, Amsterdam, preprint 3845, 1994.
55. R. Hopkins, "Choosing an American digital HDTV terrestrial broadcasting
35. J. Herre, K. Brandenburg, and D. Lederer, "Intensity stereo coding," 96th
system," Proc. ofthe IEEE, Vol. 82, No. 4, pp. 554-563, 1994.
Audio Engineering Society Convention, Amsterdam, preprint no. 3799, 1994.
56. "The Grand Alliance," IEEE Spectrum, pp. 36-45, April 1995.
36. D. Pan, "A Tutorial on MPEG/Audio Compression," IEEE Trans. on Mul­
timedia, Vol. 2, No. 2, 1995, pp. 60-74. 57. ETSI, European Telecommunication Standard, Draft prETS 300 401, Jan.
1994
37. K. Brandenburg et al., " Variable data-rate recording on a PC using
MPEG-Audio Layer III," 95th Audio Engineering Society Convention, New 58. Ch. Week, "The error protection of DAB," Audio Engineering Society­
York, 1993. Conference "DAB - The Future of Radio," London, May 1995.

38. P.A. Sarginson,"MPEG-2: Overview of the system layer," BBC Research 59. R.D. Jurgen, "Broadcasting with Digital Audio," IEEE Spectrum, pp. 52-
and Development Report, BBC RD 1996/2, 1996. 59, March 1996.

SEPTEMBER 1997 IEEE SIGNAL PROCESSING MAGAZINE 81

You might also like