Mp3 Standard Tutorial
Mp3 Standard Tutorial
KRISTER LAGERSTRM Masters Thesis Computer Science and Engineering Program CHALMERS UNIVERSITY OF TECHNOLOGY Department of Computer Engineering Gothenburg, Sweden 2001
May 2001
Abstract
Digital compression of audio data is important due to the bandwidth and storage limitations inherent in networks and computers. Algorithms based on perceptual coding are effective and have become feasible with faster computers. The ISO standard 11172-3 MPEG-1 layer III (a.k.a. MP3) is a perceptual codec that is presently very common for compression of CD quality music. An MP3 decoder has a complex structure and is computationally demanding. The purpose of this masters thesis is to present a tutorial on the standard. We have analysed several algorithms suitable for implementing an MP3 decoder, their advantages and disadvantages with respect to speed, memory demands and implementation complexity. We have also designed and implemented a portable reference MP3 decoder in C.
May 2001
May 2001
Preface
This thesis is part of the requirements for the Master of Science degree at Chalmers University of Technology in Gothenburg, Sweden. The work was done at UniData HB by Krister Lagerstrm. I wish to thank Professor Per Stenstrm at Chalmers University for his guidance and support throughout the work.
May 2001
Table of contents
1 Introduction ............................................................................................................. 8
1.1 1.2 1.3 1.4 1.5 Background ......................................................................................................................... 8 Problem Statement .............................................................................................................. 8 Methods Used...................................................................................................................... 9 Results ................................................................................................................................. 9 Thesis Organization........................................................................................................... 10
2.3
2.4
3.3
3.4
3.5
May 2001
3.5.3 3.5.4
4.2
4.3
4.4
4.5
6.3 6.4
References ............................................................................................................. 42
Glossary 44
May 2001
1 Introduction
1.1 Background
Digital compression of audio has become increasingly more important with the advent of fast and inexpensive microprocessors. It is used in many applications such as transmission of speech in the GSM mobile phone system, storing music in the DCC digital cassette format, and for the DAB digital broadcast radio. Normally no information loss is acceptable when compressing digital data such as programs, source code, and text documents. Entropy coding is the method most commonly used for lossless compression. It exploits the fact that all bit combinations are not as likely to appear in the data, which is used in coding algorithms such as Huffman. This approach works for the data types mentioned above, however audio signals such as music and speech cannot be efciently compressed with entropy coding. When compressing speech and music signals it is not crucial to retain the input signal exactly. It is sufcient that the output signal appears to sound identical to a human listener. This is the method used in perceptual audio coders. A perceptual audio coder uses a psychoacoustic effect called 'auditory masking', where the parts of a signal that are not audible due to the function of the human auditory system are reduced in accuracy or removed completely. The international standard ISO 11172-3 ([2]) denes three different methods of increasing complexity and compression efciency for perceptual coding of generic audio such as speech and music signals. This thesis deals exclusively with the third method, also known as MP3. It has become very popular for compressing CD quality music with almost no audible degradation down from 1.4 Mbit/s to 128 kbit/s. This means that an ISDN connection can be used for real-time transmission and that a full-length song can be stored in 3-4 Mbytes. An MP3 decoder is relatively complex and CPU intensive. A commercial implementation must therefore be carefully designed in order to be cost-effective. This thesis is intended to serve both as a tutorial on the standard and as a reference model of an implementation. The target audience of this document is mainly design engineers who need to write an MP3 decoder, e.g. for an embedded DSP.
May 2001
One goal has been to compile an introduction to the subject of MP3 encoding and decoding as well as psychoacoustics. There exists a number of studies of various parts of the decoder, but complete treatments on a technical level are not as common. We have used material from papers, journals, and conference proceedings that best describe the various parts. Another goal has been to search for algorithms that can be used to implement the most demanding components of an MP3 decoder: Huffman decoding of samples Requantization of samples Inverse Modied Cosine Transform (IMDCT) Polyphase lterbank A third goal is to evaluate their performance with regard to speed, memory requirements, and complexity. These properties were chosen because they have the greatest impact on the implementation effort and the computation demands for MP3 decoding. The Huffman decoding of samples deals with variable length decoding of samples from the bitstream. The other three parts all deal with various mathematical transforms of the samples that are specied by the standard. A nal goal has been to design and implement an MP3 decoder. This should be done in in C for Unix. The source code should be easy to understand so that it can serve as a reference on the standard for designers that need to implement a decoder.
1.4 Results
We have compiled an introduction to the subject of MP3 decoders from existing sources. The introduction relates the decoder to the encoder, as well as giving a background on perceptual audio coding. The decoder is also described in some depth. The search for efcient algorithms has been successful and we give several examples of algorithms that are signicantly better than those described by the standard. The Huffman decoder was found to be best implemented using a combination of binary trees and table lookups. For the requantization part we did not nd any really fast algorithms for the calculations, instead a table lookup method was found to best in the general case. An algorithmic approach is also described.
Design and Implementation of an MP3 Decoder May 2001 9
Both the IMDCT and the polyphase lterbank has been shown to be best computed using fast DCT algorithms. We also implemented the reference decoder in C as planned. We hope that it will be useful as a denite guide on the unclear parts of the standard.
May 2001
10
May 2001
11
The lossy compression scheme described here achieves coding gain by exploiting both perceptual irrelevancies and statistical redundancies. Most perceptual audio coders follow the general outline of gure 1 below.
Parameters s(n) Time/Frequency Analysis Quantization and Encoding Entropy (Lossless) Coding M U X Psychoacoustic Analysis Masking Thresholds Parameters
The coder segments the input s(n) into quasistationary frames ranging from 2 to 50 ms in duration. Then a time-frequency analysis block estimates the temporal and spectral components of each frame. These components are mapped to the analysis properties of the human auditory system and the time-frequency parameters suitable for quantization and encoding are extracted. The psychoacoustic block allows the quantization and encoding block to exploit perceptual irrelevancies in the time-frequency parameter set. The remaining redundancies are typically removed through lossless entropy coding techniques. 2.2.2 Psychoacoustic Principles The eld of psychoacoustics deals with characterizing human auditory perception, in particular the time-frequency analysis capabilities of the inner ear. Most current audio coders achieve compression by exploiting the fact that irrelevant signal information is not detectable even by a sensitive listener. The inner ear performs short-term critical band analyses where frequency-to-place transformations occur along the basilar membrane [1]. The power spectra are not represented on a linear frequency scale but on limited frequency bands called critical bands. The auditory system can roughly be described as a bandpass lterbank, consisting of strongly overlapping bandpass lters with bandwidths in the order of 50 to 100 Hz for signals below 500 Hz and up to 5000 Hz for signals at high frequencies. Twenty-six critical bands covering frequencies of up to 24 kHz have to be taken into account. Simultaneous masking is a frequency domain phenomenon where a low-level signal (the maskee) can be made inaudible (masked) by a simultaneously occurring stronger signal (the masker) as long as masker and maskee are close enough to each other in frequency. Such masking is largest in the critical band in which the masker is located, and it is effective to a lesser degree in neighboring bands.
May 2001
12
In addition to simultaneous masking, the time-domain phenomenon of temporal masking plays an important role in human auditory perception. It may occur when two sounds appear within a small interval of time. Depending on the signal levels, the stronger sound may mask the weaker one, even if the maskee precedes the masker. The duration within which premasking applies is signicantly less than that of the postmasking which is in the order of 50 to 200 ms. A masking threshold can be measured and low-level signals below this threshold will not be audible, as illustrated by gure 2 below. This masked signal can consist of low-level signal contributions, of quantization noise, aliasing distortion, or of transmission errors. The threshold will vary with time and depend on the sound pressure level, the frequency of the masker, and on characteristics of masker and maskee (e.g. noise is a better masker than a tone). Without a masker, a signal is inaudible if its sound pressure level is below the threshold in quiet. This depends on the frequency and covers a dynamic range of more than 60 dB, as shown by the lower curve of gure 2 below. Sound Pressure Level (in dB)
70
Masker
60 50
Threshold in Quiet
40 30 20 10 0 0.02 0.05 0.1 0.2 0.5 1 2 5 10 20
Frequency (kHz)
FIGURE 2. Threshold in quiet and masking threshold (acoustical events under the masking thresholds will not be audible). (Source: [1]).
May 2001
13
PCM Input
Analysis Filterbank
Subband Samples
Scaled Data
FFT
Masking Thresholds
The input to the encoder is normally PCM coded data that is split into frames of 1152 samples. The frames are further divided into two granules of 576 samples each. The frames are sent to both the Fast Fourier Transform (FFT) block and the analysis lterbank. 2.3.2 FFT Analysis The FFT block transforms granules of 576 samples to the frequency domain using a Fourier transform. 2.3.3 Masking Thresholds The frequency information from the FFT block is used together with a psychoacoustic model to determine the masking thresholds for all frequencies. The masking thresholds are applied by the quantizer to determine how many bits are needed to encode each sample. They are also used to determine if window switching is needed in the MDCT block. 2.3.4 Analysis Filterbank The analysis lterbank consists of 32 bandpass lters of equal width. The output of the lters are critically sampled. That means that for each granule of 576 samples there are 18 samples output from each of the 32 bandpass lters, which gives a total of 576 subband samples. 2.3.5 MDCT with Dynamic Windowing The subband samples are transformed to the frequency domain through a modied discrete cosine transform (MDCT). The MDCT is performed on blocks that are windowed and overlapped 50%.
May 2001
14
The MDCT is normally performed for 18 samples at a time (long blocks) to achieve good frequency resolution. It can also be performed on 6 samples at a time (short blocks) to achieve better time resolution, and to minimize pre-echoes. There are special window types for the transition between long and short blocks. 2.3.6 Scaling and Quantization The masking thresholds are used to iteratively determine how many bits are needed in each critical band to code the samples so that the quantization noise is not audible. The encoder usually also has to meet a xed bitrate requirement. The Huffman coding is part of the iteration since it is not otherwise possible to determine the number of bits needed for the encoding. 2.3.7 Huffman Coding and Bitstream Generation The quantized samples are Huffman coded and stored in the bitstream along with the scale factors and side information. 2.3.8 Side Information The side information contains parameters that control the operation of the decoder, such as Huffman table selection, window switching and gain control.
May 2001
15
Bitstream
Huffman Information
Scalefactor Information
Scalefactor Decoding
IMDCT
Frequency Inversion
Right
The different parts of the decoder are described in more detail below.
May 2001
16
2.4.2 Frame Format The frame is a central concept when decoding MP3 bitstreams. It consists of 1152 mono or stereo frequency-domain samples, divided into two granules of 576 samples each. Each granule is further divided into 32 subband blocks of 18 frequency lines apiece:
Subband blocks 18 freq. lines 18 freq. lines 18 freq. lines 31 30 29 28
Granule 0
1 0
Frame
18 freq. lines
Granule 1 ...
FIGURE 5. Format of MP3 frame, granules, subband blocks and frequency lines.
The frequency spectrum ranges from 0 to FS/2 Hz. The subbands divide the spectrum into 32 equal parts. The subbands each contain 18 samples that have been transformed to the frequency domain by a modied discrete cosine transform (MDCT). The 576 frequency lines in a granule are also divided into 21 scalefactor bands that have been designed to match the critical band frequencies as closely as possible. The scalefactor bands are used primarily for the requantization of the samples. The frame consists of four parts: header, side information, main data, and ancillary data:
Header
Side info
Main data
Ancillary data
The length of a frame is constant for a xed bitrate, with the possible deviation of one byte to maintain an exact bitrate. There is also a variable bitrate format where the frame lengths can vary according to the momentaneous demands of the encoder. The main data (scalefactors and Huffman coded data) are not necessarily located adjacent to the side information, as shown in gure 7 below.
May 2001
17
2.4.2.1 Header The header is always 4 bytes long and contains information about the layer, bitrate, sampling frequency and stereo mode. It also contains a 12-bit syncword that is used to nd the start of a frame in a bitstream, e.g. for broadcasting applications. 2.4.2.2 Side Information The side information section contains the necessary information to decode the main data, such as Huffman table selection, scale factors, requantization parameters and window selection. This section is 17 bytes long in single channel mode and 32 bytes in dual channel mode. 2.4.2.3 Main Data The main data section contains the coded scale factor values and the Huffman coded frequency lines (main data). The length depends on the bitrate and the length of the ancillary data. The length of the scale factor part depends on whether scale factors are reused, and also on the window length (short or long). The scalefactors are used in the requantization of the samples, see section 2.4.4 for details. The demand for Huffman code bits varies with time during the coding process. The variable bitrate format can be used to handle this, but a xed bitrate is often a requirement for an application (e.g. for broadcasting). Therefore there is also a bit reservoir technique dened that allows unused main data storage in one frame to be used by up to two consecutive frames:
header frame 1 sync side info 1 header frame 2 sync side info 2 header frame 3 sync side info 3 header frame 4 sync side info 4 main 4 main 3 main data 3
main 2 main 1
main data 1
main data 2
main data 4
In this example frame 1 uses bits from frame 0 and 1. Frame 2 uses bits from frame 1. Frame 3 that has a high demand uses bits from frames 1, 2 and 3. Finally, frame 4 uses bits only from frame 4.
Design and Implementation of an MP3 Decoder May 2001 18
The main_data_begin parameter in the side information indicates whether bits from previous frames are needed. All the main data for one frame is stored in that and previous frames. The maximum size of the bit reservoir is 511 bytes. 2.4.2.4 Ancillary Data This section is intended for user-dened data and is not specied further in the standard. It is not needed to decode the audio data. 2.4.3 Huffman Decoding The Huffman data section contains the variable-length encoded samples. The Huffman coding scheme assumes that the large values occurs at the low spectral frequencies and mainly low values and zeroes occur at the high spectral frequencies. Therefore, the 576 spectral lines of each granule are partitioned into ve regions as illustrated in this gure:
big_value regions region0 1 [-8206..8206] [-1..1] [0] region1 region2 576 count1 region rzero region
The rzero region contains only zero values, while the count1 region contains small values ranging from -1 to 1 and the big_value region contains values from -8206 to 8206. Different Huffman code tables are used depending on the maximum quantized value and the local statistics of the signal. There are a total of 32 possible tables given in the standard. Each of the four regions in big_value and count1 can use a different Huffman table for decoding. The count1 parameter that indicates the number of frequency lines in the count1 region is not explicitly coded in the bitstream. The end of the count1 region is known only when all bits for the granule (as specied by part2_3_length) have been exhausted, and the value of count1 is known implicitly after decoding the count1 region. Chapter 3.2 contains a survey of different ways to implement the Huffman decoder.
May 2001
19
2.4.4 Requantization The sample requantization block uses the scale factors to convert the Huffman decoded values isi back to their spectral values xri using the following formula: 4 -3
xr i = is i *2
(0.25*C)
Requantization of samples.
(EQ 1)
The factor C in the equation consists of global and scalefactor band dependent gain factors from the side information and the scale factors. Chapter 3.3 contains a survey of different ways to implement the sample requantization. 2.4.5 Reordering The requantized samples must be reordered for the scalefactor bands that use short windows. In this example there are a total of 18 samples in a band that contains 3 windows of 6 samples each:
Dequantized samples Low High a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a6 b6 c6
Low
High
Low
High
Low
High
a1 a2 a3 a4 a5 a6
b1 b2 b3 b4 b5 b6 Reordered samples
c1 c2 c3 c4 c5 c6
The short windows are reordered in the encoder to make the Huffman coding more efcient, since the samples close in frequency (low or high) are more likely to have similar values. 2.4.6 Stereo Decoding The compressed bitstream can support one or two audio channels in one of four possible modes [5]: 1. a monophonic mode for a single audio channel, 2. a dual-monophonic mode for two independent audio channels (functionally identical to the stereo mode), 3. a stereo mode for stereo channels that share bits but do not use joint-stereo coding, and
May 2001
20
4. a joint-stereo mode that takes advantage of either the correlations between the stereo channels (MS stereo) or the irrelevancy of the phase difference between channels (intensity stereo), or both. The stereo processing is controlled by the mode and mode_extension elds in the frame header. 2.4.6.1 MS Stereo Decoding In the MS stereo mode the left and right channels are transmitted as the sum (M) and difference (S) of the two channels, respectively. This mode is suitable when the two channels are highly correlated, which means that the sum signal will contain much more information than the difference signal. The stereo signal can therefore be compressed more efciently compared to transmitting the two stereo channels independently of each other. In the decoder the left and right channels can be reconstructed using the following equation, where i is the frequency line index: M i + Si M i - Si - and R i = ---------------L i = ----------------2 2
MS Stereo Decoding. (EQ 2)
The MS stereo processing is lossless. 2.4.6.2 Intensity Stereo Decoding In intensity stereo mode the encoder codes some upper-frequency subband outputs with a single sum signal L+R instead of sending independent left (L) and right (R) subband signals. The balance between left and right is transmitted instead of scalefactors. The decoder reconstructs the left and right channels based only on the single L+R (=Li) signal which is transmitted in the left channel and the balance which is transmitted instead of scalefactors (ispossfb) for the right channel: isratio sfb = tan ispos sfb ----12 isratio sfb 1 - and R i = L' i -------------------------------L i = L' i -------------------------------1 + isratio sfb 1 + isratio sfb
Intensity Stereo Decoding. (EQ 3)
The ispossfb parameter is limited to values between 0 and 6, so the tan() function is easily replaced by a small lookup table.
May 2001
21
2.4.7 Alias Reduction The alias reduction is required to negate the aliasing effects of the polyphase lterbank in the encoder. It is not applied to granules that use short blocks. The alias reduction consists of eight buttery calculations for each subband as illustrated by the gure below.
X0
X575
lower
upper
FIGURE 10. Alias reduction butteries (Source: [4]). The csi and cai constants are tabulated in [2].
2.4.8 IMDCT The IMDCT (Inverse Modied Discrete Cosine Transform) transforms the frequency lines (Xk) to polyphase lter subband samples (Si). The analytical expression of the IMDCT is shown below, where n is 12 for short blocks and 36 for long blocks:
n ---1 2
xi =
k=0
IMDCT Transform.
(EQ 4)
May 2001
22
In case of long blocks the IMDCT generates an output of 36 values for every 18 input values. The output is windowed depending on the block type (start, normal, stop) and the rst half is overlapped with the second half of the previously saved block. In case of short blocks three transforms are performed which produce 12 output values each. The three vectors are windowed and overlapped with each other. Concatenating 6 zeros on both ends of the resulting vector gives a vector of length 36, which is processed like the output of a long transform. The overlapped addition operation is illustrated by the following gure:
saven-1
resultn Windowed output from IMDCT 36 samples 0..35 Add rst half Save second half 0..17 18..35
saven
The output from the IMDCT operation is 18 time-domain samples for each of the 32 subband blocks. Chapter 3.4 contains a survey of different ways to implement the IMDCT transform. N.B.: It is important to clear the overlapped addition buffers when performing random seeking in the decoder in order to avoid noise in the output. 2.4.9 Frequency Inversion In order to compensate for frequency inversions in the synthesis polyphase lterbank every odd time sample of every odd subband is multiplied with -1. The subbands are numbered [0..31] and the time samples in each subband [0..17].
May 2001
23
2.4.10 Synthesis Polyphase Filterbank The synthesis polyphase lterbank transforms the 32 subband blocks of 18 time-domain samples in each granule to 18 blocks of 32 PCM samples. The lterbank operates on 32 samples at a time, one from each subband block, as illustrated by the following gure:
0 subband samples 0 Matrixing 31 63 V vector
U vector 0 511
D window 0 511
W vector 0 511
0 31
...
31
In the synthesis operation, the 32 subband values are transformed to the 64 value V vector using a variant of the IMDCT (matrixing). The V vector is pushed into a fo which stores the last 16 V vectors. A U vector is created from the alternate 32 component blocks in the fo as illustrated and a window function D is applied to U to produce the W vector. The reconstructed samples are obtained from the W vector by decomposing it into 16 vectors each 32 values in size and summing these vectors. Chapter 3.5 contains a survey of different ways to implement the lterbank.
May 2001
24
N.B.: The vector V has to be cleared at the start of each song, or when performing random seeking in the bitstream.
May 2001
25
May 2001
26
The trees are traversed according to the bits in the bitstream, where a 0 might mean go left and a 1 go right. An entire code-word is fully decoded when a leaf is encountered. The leaves contains the values for the spectral lines. 3.2.4 Direct Table Lookup For the direct table lookup method the decoder uses large tables. The length of each table is 2b, where b is the maximum number of bits in the longest code-word for that table. To decode a code-word, the decoder reads b bits. The bits are used as a direct index into the table, where each entry contains the spectral line values and information about the real length of the code-word. The surplus bits must then be re-used for the next code-word. 3.2.5 Clustered Decoding The clustered decoding method combines the binary tree and direct table methods. A xed number of bits (e.g. 4) is read from the bitstream and used as a lookup index into a table. Each table element contains a hit/miss bit that indicates whether the code-word has been fully decoded yet. If a hit is detected the symbol is read from the table element as well as the number of bits that is used for the code-word. If it is a miss the decoding continues by using the information from the table element to determine how many more bits to read from the bitstream for the next index, as well as the starting address of the next table to use.
3.3 Requantizer
3.3.1 Denition The requantization formula describes the processing to rescale the Huffman coded data. Global gain and subblock gain affect all values within one time window. Scalefactors and preag further adjust the gain within each scalefactor band. The following is the requantization equation for short windows. The Huffman decoded value at buffer index i is called isi, the input to the reordering block at index i is called xri:
xr i = sign ( is i ) * is i *2
4 -3
1 --A 4
*2
-B
A = global _ gain [ gr ] - 210 - ( 8 ) * subblock _ gain [ window ] [ gr ] B = scalefac _ multiplier * scalefac _ s [ gr ] [ ch ] [ sfb ] [ window ]
Requantization of samples, short blocks. (EQ 5)
May 2001
27
xr i = sign ( is i ) * is i *2
*2
-B
A = global _ gain [ gr ] - 210 B = scalefac _ multiplier * scalefac _ l [ gr ] [ ch ] [ sfb ] + preflag [ gr ] * pretab [ sfb ]
Requantization of samples, long blocks. (EQ 6)
Pretab[sfb] is tabulated in the standard. It is used to amplify high-frequency scalefactor bands. The value 210 is a system constant that is needed to scale the samples. It is important to note that the maximum value of isi is 8206, not 8191 as stated by the standard [2]. The reason for this is that the Huffman decoder adds 15 to the value of linbits. Linbits can be 13 bits long which gives a maximum value of 213-1 = 8191 for the linbits part alone. 3.3.2 Implementation Issues Both the isi4/3 and the 2A,B power functions are computationally expensive to implement using the standard math library function pow(). This is true even if it is calculated using a FPU or DSP. The isi4/3 function in the requantizer can assume 8207 different values. A lookup table is fast, but would require approximately 256 kbits of memory. We therefore also look at an algorithmic approach in section 3.3.4 below. The function 20.25*A*2-B does not assume more than 384 different values. That means that a lookup table is probably the best choice even for memory-constrained implementations. It can be noted that the table can made even smaller (196 values) by rounding small values (< 2-35) for the function down to zero since that will not affect the end result. 3.3.3 Table-based Approach for y=x4/3 A lookup table for the y=x4/3 function is easy to implement. The table could be included as part of the initialized data section, or it might be generated at run-time if the pow() function is available. If there is enough memory the lookup table could include the negative values of isi as well. This will speed up the decoding.
May 2001
28
3.3.4 Newtons Method for y=x4/3 The y = x4/3 function can be rewritten as y3 - x4 = 0. This form is suitable for Newtons method of root-nding which will yield a value of y that approximates x4/3. The function result is calculated through repeated iterations that successively reduces the residual error | y - x4/3 |: yn + 1 yn - x 2 yn + x = y n - --------------= ------------------2 2 3 yn 3 yn
3 4 3 4
(EQ 7)
The formula is rewritten as the second form to avoid oating-point cancellation. The starting value y0 for the iteration formula affects the number for iterations needed to achieve the desired accuracy. For this application an accuracy larger than 16 bits is sufcient. A good starting value for y0 is calculated by the polynomial t function y0=a0+a1*x+a2*x2. This function is designed to resemble y = x4/3 as closely as possible for 0<x<8207. The starting value will yield the desired accuracy in 3 iterations.
xi =
k=0
IMDCT Transform.
(EQ 8)
The value of n in the expression can be either 12 for short blocks or 36 for long blocks. The output from the IMDCT must be windowed with a xed function and overlapped with the data from the previous block. 3.4.2 Implementation Issues The IMDCT operation is very computationally expensive to implement as it is dened by the standard. A lookup table can be used to replace the cos() function, but the inner loop of the equation will still require substantial processing. We therefore investigate faster algorithms for the IMDCT below.
May 2001
29
The window functions can be replaced by a 4 kbit lookup table. 3.4.3 Direct Calculation A direct calculation of the IMDCT operation is easy to implement since it only consists of two simple nested for-loops. A lookup table can be used to replace the cos() function call in the inner loop. 3.4.4 Fast IMDCT Implementation Marovich has shown in [17] that Konstantinides method ([16]) of accelerating the polyphase lterbank matrixing operation can also be applied to the 12- and 36-point IMDCT operations:
N/2 subband samples
N/2-point IDCT
B -B -A -A
The N-point result is identical to the N-point IMDCT as dened by the standard. This means that only 6 and 18 points need to be computed, respectively. These points can be computed from a modied version of the IDCT using a Lee-style ([18]) method for decomposing the 6- and 18-point transforms into 3-, 4-, and 5-point IDCT kernels.
May 2001
30
The short block 6-point transform is decomposed into two 3-point transforms that can be evaluated directly:
X(n) 0 2 4 x(k)
3-pt IDCT
0 1 2
0 1 2
1 3 5
3-pt IDCT
0 1 2
5 4 3
5 4 3
= 1/(2*cos(k/12))
FIGURE 14. Decomposition of the 6-point IDCT into two 3-point kernels.
The long block 18-point transform is decomposed in a similar fashion into two 9-point parts. These 9-point parts are then decomposed further into a 4- and a 5-point part which are directly evaluated.
May 2001
31
Steps 2 and 3 above are straightforward to implement, especially in a DSP that has special addressing capabilities. Step 1 is also straightforward to implement as it is dened by the standard: for i = 0 to 63 do: Vi =
N ik *Sk
k=0
31
(EQ 9)
3.5.2 Implementation Issues We have examined steps 2 and 3 for possible enhancements, but there are no obvious ways to improve upon them. A literature search for improvements did not yield any results either. Two possible implementations for the matrixing operation (step 1) are described below. 3.5.3 Direct Calculation A direct calculation of the matrixing operation is easy to implement since it only consists of two nested for-loops.
May 2001
32
3.5.4 32-point Fast DCT Implementation Konstantinides has shown in [16] that the matrixing operation in step 1 can be substantially improved by the use of a 32-point fast DCT transformation and some data copy operations:
32 subband samples
Sk
32-point DCT
Vi
32-point result from DCT
A Vi
B -B -A -A
The problem is then reduced to nding a good implementation of the 32-point DCT: for i = 0 to 31 do: V i =
S k cos
k=0
31
(EQ 10)
May 2001
33
One of the common fast DCT algorithms for 2m points is described by Lee in [18]. It has a simple recursive structure where the transform is decomposed into even and odd parts:
N 1
X (n) =
k=0
G(n) =
k=0
N ---1 2
H (n) =
k=0
- , ( 2 k + 1 ) -- h ( k ) cos N n
for n = 0 to N/2-1
(EQ 11)
The even and odd parts can themselves be decomposed in the same way until the parts are small enough to be computed through direct evaluation, e.g. when N=2.
May 2001
34
The processing performance is clearly the worst of the three methods, since every bit of the code-word has to be handled individually. The memory requirements are moderate since the tables are efciently stored. 4.2.3 Direct Table Lookup This method is easy to implement, with the possible exception of generating the necessary tables. The processing performance is clearly the best of the three methods, since every decoding operation will complete in a short xed time. The drawback is the very large tables which will be on the order of several megabits. 4.2.4 Clustered Decoding This method is moderately simple to implement. The difcult part could be generating the tables. Salomonsen et al. have done a study of this method in conjunction with MP3 decoding [4]. They have shown that individual tables should be at most 16 elements long when decoding MP3. It is further shown that the processing requirements are approximately 1 MIPS for a RISC-based architecture and the memory requirements are 56 kbits for the lookup tables.
4.3 Requantizer
4.3.1 General The requantization step must be performed once for each sample in the bitstream. 4.3.2 y=x4/3, Table-based Implementation This implementation is simple to realize. It requires only one table lookup operation for every sample which translates to a maximum of 100 kops. The drawback is that it requires a 256 kbit table (8207 oating point values), which could be too large for some applications. 4.3.3 y=x4/3, Newtons Method This method could be better than the table-based approach when it is too costly to add the memory needed for the table. The drawback is that it requires 7 op to calculate y0 and x4, and a further 5 op by 3 iterations = 15 op to calculate the nal value for y. That results in a total of 22 op per sample, for a maximum of 2.1 Mops. This method is not as straightforward as the table-based implementation, but is still relatively easy to realize.
Design and Implementation of an MP3 Decoder May 2001 36
May 2001
37
The fast IMDCT is clearly superior compared to the direct method in terms of processing requirements.
May 2001
38
May 2001
39
May 2001
40
May 2001
41
7 References
[1] P. Noll, MPEG Digital Audio Coding, IEEE Signal Processing Magazine, pp. 59-81, Sep. 1997. ISO/IEC 11 172-3, Information technology - Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s - Part 3: Audio, rst edition, Aug. 1993. T. Painter and A. Spanias, Perceptual Coding of Digital Audio, Proceedings of the IEEE, vol. 88, no. 4, pp. 451-513, April 2000. K. Salomonsen et al., Design and Implementation of an MPEG/Audio Layer III Bitstream Processor, Masters thesis, Aalborg University, Denmark, 1997. D. Pan, A Tutorial on MPEG/Audio Compression, IEEE Multimedia, vol. 2, issue 2, pp. 60-74, Summer 1995. S. Shlien, Guide to MPEG-1 Audio Standard, IEEE Transactions on Broadcasting, vol. 40, no. 4, Dec. 1994. R. Schfer, MPEG-4: a multimedia compression standard for interactive applications and services, Electronics and Communication Engineering Journal, pp. 253262, Dec. 1998. R. Koenen, MPEG-4: Multimedia for our time, IEEE Spectrum, pp. 26-33, Feb. 1999. D. Pan et al., IIS MP3 Decoder Source Code, http://www.mp3-tech.org, April 1995. W. Jung, SPLAY MP3 Decoder Source Code, http://splay.sourceforge.net, April 2001. M. Hipp et al., MPG123 MP3 Decoder Source Code, http://www.mpg123.de, April 2001. K. Lagerstrm, MP3 Reference Decoder Source Code, http://www.dtek.chalmers.se/~d2ksla, April 2001. M. Dietz et al., MPEG-1 Audio Layer III test bitstream package, http:// www.iis.fhg.de, May 1994. Analog Devices Inc., ADSP-21061 SHARC DSP, http://www.analog.com, April 2001. Free Software Foundation, GNU Compiler Collection, http://www.fsf.org, April 2001. K. Konstantinides, Fast Subband Filtering in MPEG Audio Coding, IEEE Signal Processing Letters, vol.1, no. 2, Feb. 1994.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
May 2001
42
[17]
S. Marovich, Faster MPEG-1 Layer III Audio Decoding, HP Laboratories Palo Alto, June 2000. B.G. Lee, FCT - A Fast Cosine Transform, IEEE International Conference on Acoustics, Speech and Signal Processing San Diego 1984, pp. 28A.3.1-28A3.4, March 1984.
[18]
May 2001
43
Appendix A Glossary
ADC CODEC CPU DCT DSP FS FFT FIFO FLOP FPU ISO MFLOPS MPEG PCM Analog to Digital Converter. CODer/DECoder. Central Processing Unit. Discrete Cosine Transform Digital Signal Processor. Sampling Frequency, e.g. 44100 Hz for CD audio. Fast Fourier Transform First in, rst out. Floating-point operation. Floating point unit. Hardware math acceleration inside a CPU. International Standards Organisation. Million oating-point operations per second. Motion Picture Expert Group. Working group within ISO. Pulse Code Modulation. Output from an ADC.
May 2001
44