0% found this document useful (0 votes)
59 views11 pages

New Implementation Techniques of An Effi

Uploaded by

Rad Ou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views11 pages

New Implementation Techniques of An Effi

Uploaded by

Rad Ou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

E. Kurniawati et al.

: New Implementation Techniques of an Efficient MPEG Advanced Audio Coder 655

New Implementation Techniques of an Efficient


MPEG Advanced Audio Coder
E. Kurniawati, C. T. Lau, B. Premkumar, J. Absar, S. George

Abstract — MPEG-AAC is the current state of the art in Among the perceptual audio coding schemes available
audio compression technology. The CD-quality promised at today, MPEG-AAC is the leading option, giving transparent
bit rate as low as 64 kbps makes AAC a strong candidate for CD quality at 64kbps. In this scheme, each AAC frame is
high quality low bandwidth audio streaming applications over independently decodable. With time domain aliasing
wireless network. Besides this low bit rate requirement, the cancellation concept, the information is carried by two
codec must be able to run on personal wireless handheld consecutive AAC frames. These features make the scheme
devices with its inherent low power characteristics. While the favourable when it comes to audio streaming application.
AAC standard is definite enough to ensure that a valid AAC The recent advances in wireless network bring about the
stream is correctly decodable by all AAC decoders, it is challenge of developing applications on portable devices,
flexible enough to accommodate variations in implementation,
including digital audio streaming. Not only low bit rate is
suited to different resources available and application areas.
desired, but also the encoder and decoder pair must be able to
This paper reviews various implementation techniques of the
run on this low power portable device. These are the
encoder. We then proposed our method of an optimized
software implementation of MPEG-AAC (LC profile). The motivations behind our research.
coder is able to perform encoding task using half the The AAC decoder is less demanding computationally,
processing power compared to standard implementation particularly because of the lack of psychoacoustics and bit
without significant degradation in quality as shown by both allocations modules. These two modules will be the focus of
subjective listening test and an ITU-R compliant quality our discussion. Section 2 will give a brief description of AAC
testing program (OPERA). and its efficiency issues. Section 3 will discuss
psychoacoustics and time to frequency transformation in
Index Terms — Audio Compression, MPEG-AAC, greater detail and section 4 will focus on bit allocation-
Psychoacoustics Model, Quantization. quantization module. Finally, section 5 will highlight the
experimental results and conclusion will be presented in
I. INTRODUCTION section 6.
Audio technology has evolved tremendously over the last
II. MPEG-ADVANCED AUDIO CODER (AAC)
century. In the advent of digital systems, sound reproduction
reaches its state of the art performance in terms of quality. AAC is the latest audio compression standard released by
However, the high bit rate characteristic of digital music does Moving Picture Experts Group (MPEG). Being a perceptual
not suit the demand of application with limited bandwidth, for encoder, it follows the basic structure depicted in figure 1
example, in digital audio streaming. To achieve efficient
transmission, compression needs to be employed. input
filter bank entropy output
Efficient coding systems are those that could optimally quantisation coding
eliminate irrelevant and redundant parts of an audio stream.
The first is achieved by reducing psychoacoustical irrelevancy
through psychoacoustics analysis. The term “perceptual audio
masking
coder” was coined to refer to those compression schemes that spectral
threshold
analysis
exploit the properties of human auditory perception. Further calc.

reduction is obtained from redundancy reduction.


Psychoacoustics module

Fig 1. Basic structure of perceptual audio coder


This work was supported by ST Microelectronics Asia Pacific Pte Ltd.
E.Kurniawati, C.T. Lau, and B. Premkumar are with School of Computer
Engineering, Nanyang Technological University, Nanyang Avenue,
Essentially, a perceptual coder consists of a psychoacoustics
Singaproe 639798(email: [email protected], [email protected], model, a filter bank (for time to frequency transformation), and
[email protected]). a quantization unit. For AAC, an extra spectral processing is
J.Absar and S. George are with ST Microelectronics Asia Pacific Pte. Ltd.,
performed before the quantization (a complete diagram of
R&D Centre Singapore Science Park II, Teletech Park, Singapore (e-mail:
[email protected], [email protected]). MPEG-4 AAC is shown in figure 2). This spectral processing

Contributed Paper
Manuscript received February 16, 2004 0098 3063/04/$20.00 © 2004 IEEE
656 IEEE Transactions on Consumer Electronics, Vol. 50, No. 2, MAY 2004

block is used to reduce redundant components, consisting profile tiled to have lesser computational burden compared to
mostly of prediction tools. the other profiles. However, the overall efficiency still depends
AAC uses Modified Discrete Cosine Transform (MDCT) on the detail implementations of the encoder itself.
with 50% overlap in its filterbank module. After overlap-add
process, due to the time domain aliasing cancellation, we Input
time
should be able to get a perfect reconstruction of the original signal

signal. However, this is not the case because error is AAC Gain
Control Tool
introduced during the quantization process. The idea of a
perceptual coder is to hide this quantization error such that our Window
Filter
hearing will not notice it. Those spectral components that we Length
Decision
Bank

would not be able to hear are also eliminated from the coded Spectral Processing

stream. This irrelevancy reduction exploits the masking Psychoacoustic Model

TNS
properties of human ear (more details on this will be given in
Perceptual
subsequent section). The quality of a perceptual coder depends model
LTP
on the psychoacoustics module because this is where all the
psychoacoustical analysis is performed. The calculation of Bark Scale
to
Intensity/
14496-3
Scalefactor Coded
masking threshold is among the computationally intensive task Band
Coupling
Bitstream
Audio Stream
Mapping
of the encoder. Predictio
Formatter

AAC has 2 different window sizes to be used depending on n

whether the signal is stationary or transient. This feature


combats the pre-echo artifact, which all perceptual encoders PNS

are prone to. The decision to switch between window sizes is


also determined by the psychoacoustics module, making it M/S

more crucial to the performance of the encoder.


AAC quantization module operates in two-nested loop. The AAC Twin VQ

inner loop quantizes the input vector and increases the Scalefactor
coding
Spectrum
normalization and
quantizer step size until the output vector can be coded with Quantization
Noiseless coding
Interleaved VQ

the available number of bits. After completion of the inner


loop an outer loop checks the distortion of each scale factor Quantization and Coding

Fig 2. Block diagram of MPEG4-AAC


band and, if the allowed distortion is exceeded, amplifies the
scale factor band and calls the inner loop again. AAC uses a
non-uniform quantizer. Figure 3 shows the computational demand of a standard
Figure 2 shows the complete diagram of MPEG4-AAC [1]. AAC-LC implementation from MPEG reference coder, run for
There are 3 profiles defined in the standard: 64 kbps bit rate with CD quality sampling rate of 44.1 kHz.
- Main Profile, with all the tools enabled demanding Psychoacoustics module takes up 22% of the processing power
substantial processing power. due to its heavy spectral analysis for the masking threshold
- Low Complexity (LC) Profile, with lesser calculation. The most demanding module is quantization due
compression ratio to save processing and RAM usage to the presence of the nested loop for rate distortion control.
- Scaleable Sampling Rate Profile, with ability to adapt These are the 2 modules that have been of interest in the effort
to various bandwidths. to optimize the encoder.
We will discuss only the second profile as processing power In this paper, we will describe these two modules and
savings is our main concern. various options for their implementation to improve the
Besides the main module explained earlier, AAC-LC has efficiency along with the pros and cons. Sufficient arguments
Temporal Noise Shaping (TNS) and stereo coding enabled will be presented before choosing the final method and a
without the rest of the prediction module in the spectral comparison will be performed with the initial implementation
processing unit (please refer to figure 2). Working in tandem in terms of both computation and subjective quality.
with block switching, TNS is also used to reduce the pre-echo
artefact by controlling the temporal shape of the quantization
Psychoacoustics (22%)
noise. However, in LC profile the order of TNS is limited. The
Filterbank(5%)
stereo coding is used to control the imaging of coding noise by
Quantization(64%)
coding the left and right coefficients as sum and difference.
The AAC standard only ensures that a valid AAC stream is Other(9%)

correctly decodable by all AAC decoders. The encoder can


accommodate variations in implementation, suited to different
resources available and application areas. AAC-LC is the Fig 3. Distribution of resources in AAC-LC encoder
E. Kurniawati et al.: New Implementation Techniques of an Efficient MPEG Advanced Audio Coder 657

III. PSYCHOACOUSTICS AND FILTERBANK measure of unpredictability is computed as shown below:


A perceptual audio coder achieves compression by reducing
psychoacoustical irrelevancy in the audio data stream. The u(k)=|X(k) – Xp(k) | / ( |X(k)|+|Xp(k)| )
masking threshold is determined to judge which part of the
signal is less important to our perception. This is done by which is essentially a ratio of the prediction-error to the
exploiting simultaneous masking properties of our auditory magnitude of the coefficient and its predicted value. The
system, which states that under the influence of one prominent weighted unpredictability is determined by multiplying the
tone, the adjacent spectral components will lose their energy r2(k) with the unpredictability measure u(k), and
significance to our perception. summing the product over each partition. A convolution of
The main steps in calculating the masking threshold in the the sum is then performed with the spreading function.
Psychoacoustics Module (PAM) are summarized as follows: Since this result is weighted by the signal energy, it will
need to be renormalized. The tonality is calculated from this
1. Transformation to frequency domain. result cb(b), using the formula :
To calculate the complex spectrum of the input signal, an
FFT is performed on the windowed (Hann window) segment α = min (1 , max (0 , − 0 . 299 − 0 . 43 log (cb ( b ) )))
of the input signal, resulting in spectral coefficient X(k):
5. Adjustment to masking
N −1 − j 2πnk An offset is determined using the tonality (index) criteria α,
1
X (k ) = r (k )e − jθ ( k ) =
N
∑ x ( n )e
n=0
N with the generic formula:

Offset = α TMN + (1 − α )NMT


2. Calculation of the energy spectrum where TMN is tone masking noise and NMT is noise
The spectral coefficients are segmented into seventy masking tone. Being a better masker, NMT has a lower
partitions [1] and the energy for each segment is computed, value compared to TMN. The value suggested in AAC [1] is
as below: 6 dB and 18dB respectively. The offset is subtracted from
khigh(b ) the log spread bark spectrum to obtain the masking
e(b) = ∑r
k = klow(b )
2
(k ) threshold.
70

where klow and khigh are the lowest and highest frequency 60
Offset

line in the partition, and b is the partition index. 50


level [dB]

40

30 Spreading Function

3. Convolution with the spreading functions. 20

This step accounts for the spread of the masking 10

phenomenon across critical (or bark) bands. An analytical 0


0 1 2 3 4 5 6 7

expression for the spreading function (for each partition) is critical band rate [Bark]

given by: Fig 4. Offset determined from tonality index

SF dB = 15 . 81 + 7 . 5 ( x + 0 . 474 ) − 17 . 5 1 + ( x + 0 . 474 ) 2 6. Comparison of mask with hearing threshold


The masking threshold is compared with the absolute
where x represents the distance (in bark) of the masker from threshold of hearing, approximated with the formula
the maskee. This gives us a triangular model for spreading
function with slopes of +25 and –10 dB per bark. Observe T ( f ) = 3 . 64 f − 0 .8
− 6 .5 e − 0 .6 (f − 3 .3 )2
+ 0 . 001 f 4

in Fig. 4 that the masking effect is gently sloping on the where f is the frequency in kHz.
higher frequency end while on the lower side it is
considerably steep. This accounts for the fact that it is easier The computational complexity of step 1, 2 and 3 are N log
to mask higher frequency component than the lower ones. N, N and N2 respectively. Our ear analyzes sounds according
to bark scale. Therefore, conversion to frequency domain (step
4. Determination of the tonality index 1) and grouping of the spectral lines to 1/3rd bark resolution
Tonals and noise have different masking capabilities (the (step 2) as well as convolution with spreading function (step 3)
later being a better masker). A precise assessment of are inevitable. PAM implementation differs mostly in the 4th
tonality is crucial in order to avoid under-coding and over- step. Furthermore, the quality of the masking threshold
coding. In AAC, this parameter is estimated using depends greatly on how accurate this tonality index estimation
unpredictability measure. Here, let Xp(k) be the predicted is. The last two steps have negligible computational cost
value for coefficient X(k). Xp(k) is computed by compared to the previous ones.
extrapolating values of X(k) over the previous two frames. A
658 IEEE Transactions on Consumer Electronics, Vol. 50, No. 2, MAY 2004

The standard tonality calculations using weighted non-linearity involved in human auditory system [5]. However,
unpredictability highlighted above involve an N2 complexity of the use of this non-linear PAM would increase the
a convolution process. Instead of this, we propose the computational weight, which is against the goal of our
identification of the nature of the spectrum locally at different experiment.
bark band, thus avoiding this convolution process. This would Further improvement in efficiency was realized in
help to isolate the calculation strictly within a partition. The conjunction with the filter bank module. Transform is a costly
unpredictability is averaged within the partition: process, and the fact that AAC has MDCT in its filter bank
module and DFT in PAM makes this a computational
1 khigh ( b ) overhead. The MDCT used in AAC[1] is formulated as
average _ u (b ) = ∑ (ub )(k )
(khigh (b) − klow (b) ) + 1 k = klow follows:

Using this method, the complexity is reduced to N.


N −1
 2π  1 
X i , k = 2∑ z i ,n cos (n + no ) k +  , for 0 ≤ k ≤ N
Besides unpredictability, which makes use of the spectral n =0 N  2  2
certainty across frames, it is also possible to look at Spectral
Flatness Measure (SFM) within one frame to decide tonality
where z is the windowed input sequence, n is sample index, k
characteristic [3][4]. SFM is defined as the ratio of the
is spectral coefficient index, i is the block index, N is window
geometric mean Gm to the arithmetic mean Am of the power
length(2048 for long and 256 for short) and No is computed as
spectrum.
(N/2 + 1) / 2.
G  There have been several suggestions to use MDCT in the
SFM dB = 10 log 10  m  psychoacoustics module [6][7][8][9] with the view to avoid
 Am  performing this two transforms. However, the characteristics
and the tonality is determined as follows : of MDCT itself could make it inappropriate for
psychoacoustics analysis.
 SFM dB  MDCT is a purely real transform. If the input signal has a
α = min ,1 strong component that is π/2 out of phase with respect to the
 SFM dB max  MDCT basis function, the corresponding coefficient will be
zero [10]. This problem was also discussed in [11] regarding
where SFMdBmax = -60dB is used to estimate if the signal is the peculiar properties of MDCT for signal that exhibits local
entirely tone-like. A flat spectrum will give SFM of 0 dB symmetry. Figure 6 illustrates this problem of misdetection.
which indicates noise characteristics. The advantage of using
this method is in memory usage. This is because we no longer
need to keep the spectral coefficient values of the two previous
frames that were essential in the previous method for
calculating the unpredictability.
70
60
50
level[dB]

40
30
20
10
0
0 10 20 30 40 50 60 70
partition

Unpredictability measure SFM avg. unpredictability

Fig 5. Masking threshold comparison for classical music segment

Figure 5 shows the masking threshold obtained from 3


different methods of estimating the tonality index. Average
unpredictability was selected in our implementation due to its
computational advantage and its quality. A more thorough
comparison has been done in [2]. Fig 6. Misdetection of frequency component in MDCT spectrum.
A) Time domain signal B) DFT spectrum C) MDCT spectrum
Up to this point, the PAM implementation assumed
additivity of masking for practical reason. For a better
Figure 6a shows the time domain signal and 6b shows the
approximation of masking threshold, one should cater for the
magnitude of the Discrete Fourier Transform (DFT). DFT
E. Kurniawati et al.: New Implementation Techniques of an Efficient MPEG Advanced Audio Coder 659

managed to catch the two frequency component of the signal employed for the psychoacoustics analysis. Figure 8 illustrates
whereas MDCT coefficients in figure 6c gives zero results due the minor differences in masking threshold result obtained
to the problem highlighted earlier. This misdetection poses a from using different window functions. A more thorough
problem when one tries to track the signal tonality with comparison on the use of each window in PAM is discussed in
unpredictability function. One way to workaround this is by [13].
using SFM to determine the tonality [6][7]. If unpredictability 70

is still desired, an extra process needs to be incorporated to 60

track the presence of the spectral component. Every detected 50

tonal component is assigned a life stage to make sure that if a 40

tone was detected in a previous frame and not in the current

dB
30
frame due to phase and/ or resolution, it will not be ignored 20
[10].
10
Instead of using MDCT in PAM, we propose the use of Odd
0
DFT (ODFT), which can be easily manipulated to obtain the 0 10 20 30 40
MDCT coefficients [12]. ODFT corresponds to DFT with the scalefactor band

discrete frequency bins shifted by π/N. Using this hann sine kbd
modification, the complexity is also reduced by one transform
Fig 8. Masking threshold comparison with different window functions
process, but in this case we do not have to deal with the
problem of misdetection stated earlier. The restructured coder
IV. BIT ALLOCATION-QUANTIZATION
is illustrated in figure 7.
input output
AAC Quantization module: AAC uses a non-uniform
ODFT to entropy
ODFT
MDCT
quantization
coding quantizer:
 3 
 x 4
masking x _ quantized (i ) = int 3 + 0.4054 (1)
 16 ( gl − scf (i )) 
spectral
threshold
analysis
calc.
2 
Psychoacoustics module
where i is the scale factor band index, x is the spectral values
Fig 7. Restructured AAC-LC encoder
within that band to be quantized, gl is the global scale factor
(the rate controlling parameter), and scf(i) is the scale factor
ODFT is defined as : value (the distortion controlling parameter). Figure 9 illustrates
j 2π ( k + 1 ) n the nested loop in this module to obtain the parameter gl and
N −1 − 2
Xo(k ) = ∑ h(n) x(n)e N scf(i) from inner and outer loop respectively.
n =0 Begin
where x(n) is the time domain sample and h(n) is the window
function. This ODFT output is fed into the psychoacoustics 1
module for further spectral analysis, whereas for the filter bank Initialized gl
Initialized scf(i)
module, the coefficients of the MDCT are obtained as:
no yes
MDCT ( k ) = Re{ Xo ( k )} cos θ ( k ) + Im{ Xo ( k )} sin θ ( k ) exit criteria end

where θ ( k ) =
π  1  N
 k +  1 + 
N 2  2 2

The main problem with this method is the mismatch of Adjust gl

window functions between the filterbank and PAM. MDCT 5


adjust scf(i) in band whose
has two choices: sine or Kaiser Bessel Derived (KBD) window (error > masking
threshold)
as defined in the standard [1]. However in PAM, the window
3
function used is Hann. Hann window is not acceptable for Calculate bit_used
MDCT calculation due to its failure in meeting the perfect
reconstruction criteria required for MDCT time domain
4
aliasing cancellation. The PAM window function on the other yes calc. quant. error
bit_used<rate per scalefactor
hand, does not have this strict criterion. Furthermore, the band
calculation in PAM is done in one-third bark domain. This inner loop
grouping of spectral coefficients makes the use of different Fig 9. Nested-loop in bit allocation / quantization module
window function in PAM less noticeable. During the
experiment, sine or KBD window function from MDCT is The ideal exit criteria for the above process are when
660 IEEE Transactions on Consumer Electronics, Vol. 50, No. 2, MAY 2004

bit_used is below the chosen bit rate and the quantization noise We adopt both ideas by fixing the lower scale factor band to
in all scale factor bands are below the masking threshold. use the same codebook and sub-optimal solution for the
However, this is not always achievable, especially in a very upper band. The reason for this is because the lower band
low bit rate case. Two more exit criteria are defined in the contains less spectral lines. The savings in bits gained from
standard. Firstly when all scale factor bands have been using the most optimal codebook per band is less than that
amplified and secondly when the difference between two in overhead of the side information. Therefore the groupings
consecutive scale factor bands exceeds 60 (which is the of first few bands containing only 4 spectral lines resulted in
maximum number decodable). Time constraint has to be a better (less) bit_used.
employed as well when real time encoding is desired.
Improving the efficiency of this module involved optimizing 4. Quantization error calculation
each of the steps outlined in figure 9. The quantization error is calculated per scale factor band, by
summing the square difference between the original spectral
1. Initialization of gl and scf(i) value and the dequantized value. The dequantization
Normally the initial value of scf(i) would be zero, and gl process uses the following formula:
would be 1
( gl −scf (i ))
x _ dequantized (i ) = (x _ quantized (i )) .2
4
3 4
16  max_mdct _ line34 
gl =  log2   (2) which is also the process performed at the decoder side.
3  8191
   This analysis by synthesis process has to be performed
which ensures that the maximum MDCT coefficient is every time the distortion control in the outer loop is
decoded as 8191 (the maximum value decodable by AAC executed.
decoder). However, in general audio signals, the current To reduce this task, a pre-allocation and pre-exclusion of
audio frame is highly correlated with the previous one. Due bits can be adopted [16][17]. From robust experiment, there
to this property, these bit allocation parameters of the are bands in which the bits are always allocated and bands
current audio frame is similar to that of the previous frame. which always have zero allocation (this mostly occurs in the
By using the previous frame result as the initial estimate of upper band due to the high threshold of hearing in this
gl and scf(i) parameters, the iterative step of the bit region). For these special bands, iteration is no longer
allocation can be reduced [14]. needed and the process of calculating the quantization error
can be skipped. However, this technique can only be used
2. Global scale factor (gl) adjustment when we have enough bits at hand. Pre-allocation might
Instead of using linear search from initial to the desired result in bit shortage in a more important band and hence,
value for this parameter, binary search is employed. This not advisable for low bit rate coding.
would reduce the number of iterations from N to log N. A more general approach to optimize this task would be to
approximate the quantization error mathematically. In order
3. Calculation of bit_used to strongly reduce the number of operations, a uniform
One of the reasons why bit allocation module is a time quantizer can be considered to estimate the noise power
3
( gl − scf (i ))
consuming task is the presence of Huffman coding within [18], that is ∆2 where the step size ∆ = 2 16 . This
the inner loop. The relation between the global scale factor 12
(gl) and bit_used is not linear due to this reason. Every time method disregards the compression process (x¾) of a non-
gl is adjusted, the coefficient needs to be requantized and uniform quantizer in exchange for simplicity. We will adopt
Huffman coded. There are eleven Huffman codebook a more precise approximation for quantization noise, which
options for each of the scale factor band and there is a will be discussed later in this section in conjunction with
grouping option for adjacent scalefactor that uses the same approximation of global and the individual scale factors.
Huffman codebook. Grouping is performed to reduce the
number of side information, but it is not always 5. Scale factor (scf(i)) adjustment
advantageous to use the same codebook for adjacent scale As mentioned earlier, when bit resources are low, more
factor band. Hence in choosing the most optimum often than not we have to choose to only amplify the scale
codebook, we have to try grouping possibilities as well. factor bands with the highest NMR (Noise to Mask Ratio).
This is an NP-complete problem and it is not always This search process can be optimized by using a complete
feasible to get the optimum solution mostly due to time binary tree data structure with a property that the value
constraint. (NMR) of each node is at least as large as the value of its
A sub-optimal solution has been suggested in [7][9] by children nodes [19]. In this case, the scale factor adjustment
checking the grouping possibilities just in one iteration for reduces to just deleting the top element of the tree, adjusting
adjacent scale factor bands. Another option is to have a the scale factor (modifying the NMR accordingly), and
fixed Huffman sectioning, by having three nonzero bands reinserting this element back into the tree. When the NMR
share the same codebook [15]. becomes lower than zero, it need not be inserted back into
E. Kurniawati et al.: New Implementation Techniques of an Efficient MPEG Advanced Audio Coder 661

the tree as no scale factor adjustment needs to be made so This is the same approach used in [22] with the assumption
the tree size becomes smaller during this process. The that random variable e and x are independent and uniformly
advantage of this approach is that for each modification, we distributed. The error is relatively small and this series
need to work on log N elements as opposed to N elements in converges practically fast. For simplicity reason, only the first
traditional linear search. order result is employed.
From experimental result, the quantization noise derived
Apart from these optimizations, we are still faced with a from the above approximation is often lower than that obtained
problem that all these processes are repetitively executed until from the traditional analysis by synthesis method. This could
the best solution is found or until the time in the exit criteria be due to the truncation of higher order results.
expires. A more intuitive way to get a better result is to start Underestimating the noise could lead to perceptual artefact
with a better initial value for the parameters. The best case because what we thought was already under the masking
then is to arrive at the best solution within first trial. This is the threshold might end up being higher. We adopted a scaling
method attempted in [15][20][21][22], especially in obtaining factor within the noise approximation to circumvent this
the initial value of scale factor (sf(i)). This distortion problem as over estimating this value will not have any effect
controlling parameter will depend on the masking threshold, on the perceptual quality. Figure 10 shows the comparison
and we will try to relate this two variables mathematically. between the noise and its approximated counterpart.
Combining equation (1) and (2), we will have the
dequantized value 100

90
4
  3   3 1 ( gl − scf (i ) 80

  x 4 
+ 0.4054  .2 4
70
x _ dequantized (i ) = int 3 60

  2 16 ( gl − scf (i )) 

dB
50
 40

30
20
4
 3  3 1 ( gl − scf (i ) 10

x 4
= 3 + e  .2 4
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

 16 ( gl − scf (i ))  scale factor band

2  noise estimated noise


4
 3   3
Fig 10. Estimated noise comparison
 x 4  e  1
( gl − scf (i )
= 3 1 +  . 2 4
( gl − scf (i )) 3
( gl − scf (i ))  With this method, no iteration is needed in the outer loop
 2 16  x
3
4
2 16 
  because the scale factor value is derived directly from the
4 masking threshold result (which is the maximum allowable
  3
noise).
 e  For the rate controlling parameter (the global scale factor)
= x 1 + 3 
( gl − scf (i )) adjustment, it has been suggested to avoid iteration here as
 3
x 4 2 16 
  well by deriving its value using a linear model-based
algorithm, which relates the global gain and bit used [22].
Ideally, without the constant addition and the integer However, as mentioned earlier, the relation between these two
rounding process, we would be able to get the original x back variables is actually non-linear due to the presence of Huffman
from this process. However, error is introduced due to this coding within the inner loop. Furthermore, this linearity
process. Using binomial expansion, we can expand the term in assumption is valid only when the scale factors are kept
the bracket constant in all bands, which is not the case most of the times,
because we have to keep adjusting them to the masking
x _ dequantized (i ) = threshold requirements.
We propose a different linear-model to derive the global
    
2
 scale factor with minimal iteration. Unlike scale factor, which
 4  e  2  e  
x 1 +  + + ....  is applied to individual band, global scale factor is applied to
  
 3  3 4 163 ( gl − scf (i ))  9  3 4 163 ( gl − scf (i ))   all bands. Therefore, the distortion introduced by this
 x 2  x 2  
  parameter (as global scale factor practically determined the
step size of the quantizer) must be acceptable to all bands.
4  14  16 ( gl −scf (i )) 
3

≈ x + e x  2  (3) This is the observation that motivates us to relate the global
3    scale factor with the minimum masking threshold. Unlike the

previous method, this relation holds even with variations in the
scale factor values. Figure 11 shows the correlation between
662 IEEE Transactions on Consumer Electronics, Vol. 50, No. 2, MAY 2004

the resulting global gain and the minimum masking threshold. critical audio signals. The optimized method does not show
Initial value refers to the initial gl calculated with equation 2. different result for both bit rate because both rarely involves
TABLE I
80
MAIN DIFFERENCES IN IMPLEMENTATION
70
Traditional impl. Optimized impl.
1. Transform MDCT calculation Derived from
60
PAM’s FFT
50 • block switching Perceptual entropy PE and energy
(PE) based based
40
2. Psychoacoustics FFT with Hann FFT with Sin/KBD
30 Module (PAM) window and π/N freq. shift
20
• tonality index Weighted Average
10 unpredictability unpredictability
3. Bit Allocation
0
1 12 23 34 45 56 67 78 89 100 111 122 133 144 155 166 177 188 199 210
• initial sfb All zero Estimated with
equation 3.
global scalefactor initial value minimum xmin

Fig 11. Correlation between gl and minimum masking threshold. • initial gl Equation (2) Previous frame
Minimum xmin serve as better initial value and have better correlation. value (except after
short block)
Instead of using this initial value, we take the previous global
gain value and use linear interpolation based on the gradient of • gl adjustment Linear adjustment Interpolated based
the minimum masking threshold obtained. Figure 12 shows the on xmin gradient
linear regression analysis of the two variables having a
correlation value of 0.85. • scf adjustment Linear adjustment Minor fine-tuning
75
• noise Analysis by Approximated
70 calculation synthesis with equation 3.
65
TABLE II
60 PROCESSING TIME COMPARISON FOR BIT RATE 64 KBPS
Number Original Optimized Gain
55
of method method
50 frames (seconds) (seconds)
45
Castanet 301 11 5 2.20
0 10 20 30 40 50 60 Flute 804 14 8 1.75
global gain Predicted gl Glockenspiel 345 11 6 1.83
Fig 12. Linear regression analysis
Pop music 330 13 7 1.86
Speech 727 14 7 2.00
This method however was not applied during the transient Hihat 109 5 2 2.50
part of the signal (when short window is used). The inter frame
correlation during transient is extremely low; hence the use of TABLE III
previous window does not yield a good result. In this case, the PROCESSING TIME COMPARISON FOR BIT RATE 96 KBPS
xmin value will be used as an initial estimate until the coder Number Original Optimized Gain
switches back to using long window. of method method
frames (seconds) (seconds)
Castanet 301 10 5 2.00
V. RESULTS AND DISCUSSIONS
Flute 804 10 9 1.11
We tested the codec to verify the performance of the
Glockenspiel 345 9 7 1.29
encoding system both in quality and encoding speed. The
Pop music 330 12 7 1.71
comparison is performed against a standard implementation
from ISO reference coder. Table 1 highlights the main Speech 727 12 8 1.50
differences between the two encoder implementation from Hihat 109 5 2 2.50
algorithm point of view. iteration in the bit allocation module (due to the direct
The encoding speed was evaluated using PC with Pentium II estimation of scale factor and global scale factor values). For
350 MHz processor for two different bit rates of 44.1KHz the original method, the higher the bit rate, the faster the rate
audio signal. Tables 2 and 3 summarize the result for the control loop converges. This is because the bit budget is much
E. Kurniawati et al.: New Implementation Techniques of an Efficient MPEG Advanced Audio Coder 663

higher. In this experiment, it can be observed that for 96 kbps, The computational demand of the optimized encoder is
the encoding time is generally much shorter. shown in figure 14. Comparing it with figure 3 from the initial
The perceptual quality was tested using two approaches. implementation, the major improvement comes from the
The first approach is subjective listening test, involving six quantization module due to the reduction of the nested loop.
critical signals listed in table 2. These are the signals known to The filterbank module also shows improvement because the
be difficult to be encoded by a perceptual coder because they major calculation has been absorbed by the psychoacoustics
are prone to perceptual audio artefact [24]. The second module. Overall, the proposed optimized method was able to
approach uses a quality-testing program called OPERA safe half of the computational resources.
(Objective Perceptual Analyzer) which simulates the human
ear. This software is compliant with PEAQ (Perceptual
Evaluation of Audio Quality), an ITU-R standard. The result is Psychoacoustics (19%)
presented in figure 11 for bit rate 64 an 96 kbps. Filterbank (1%)
Figure 13 shows MOS differences, with diffscore = 0 for the Quantisation (23%)

original reference. There is a discrepancy of about 0.3 on the Other (7%)


Unused (50%)
MOS scale between the subjective test result and the OPERA
result in the 64kbps tests. Nevertheless, both agreed that the
original and the optimized implementation are
indistinguishable in terms of quality for both bit rates. Fig 14. Distribution of resources in the optimized coder
Although the subjective listening test result seems to be in
favour of the new optimized model in 64kbps testing, VI. CONCLUSION
significant improvement in quality cannot be concluded as
We have presented in this paper an optimized software
they did not show a non-overlapping confidence interval.
implementation of MPEG-AAC (LC profile). The experiment
was conducted to answer to the challenge of having the encoder
0.00 run on low power personal handheld devices. We start with the
-0.20
analysis of the distribution of resources among task, and then
focus on two of the most computationally intensive tasks,
Not Annoying

-0.40
namely the psychoacoustics analysis and the bit allocation.
-0.60 ODG = -0.49 As a perceptual encoder, AAC quality relies heavily on the
ODG = -0.6 psychoacoustics module, which generate the masking
-0.80
threshold curve. This threshold represents the maximum
-1.00 threshold of noise that will not be perceptible to our ear. The
analysis exploits simultaneous masking properties of our
-1.20
auditory system, which is calculated in bark scale. Therefore,
Slightly Annoying

-1.40 the conversion from time domain to bark frequency domain is


inevitable. The other issue is the calculation of tonality index.
-1.60
Subjective Quality Measurement

ODG = -1.6 ODG = -1.57 Since tone and noise have different masking properties, a
-1.80 precise estimation of this index is important to avoid over and
-2.00
under masking. Three methods have been discussed in this
paper, and average unpredictability scheme was selected for
-2.20 implementation mainly because of its low computational
-2.40
weight and relatively good quality.
Annoying

AAC employs transform coding scheme with MDCT as its


-2.60
transform engine. In a traditional encoder, this is the second
-2.80 transform process performed besides DFT in psychoacoustics
module. We propose the use of ODFT for psychoacoustics
-3.00
analysis and deriving MDCT coefficient from its results. The
MOS (Mean Opinion Score) from listening test
-2.20
Mean result from original implementation
different window function issues in these two transforms have
been thoroughly discussed in this paper. With this scheme, we
Very Annoying

-3.40 Mean result from optimized implementaion

ODG (Objective Difference Grade) from OPERA only need to perform one transform in the encoder without
-3.60
Original Implementation degrading the overall quality.
-3.80 Optimized implementation Bit allocation unit took more than half of the processing
power due to the present of rate distortion control loop. This
-4.00
nested loop iterates until the optimum global and individual
64Kbps 96Kbps
scale factors are found. A better way to calculate the initial
Fig 13. Subjective quality test and OPERA ODG at two different bit rate value for these parameters is presented in this paper in an
664 IEEE Transactions on Consumer Electronics, Vol. 50, No. 2, MAY 2004

effort to avoid unnecessary iteration. The calculation of [15] C.M. Liu, W.J. Lee, R.S. Hong, “Bit Allocation for Advanced Audio
Coding using Bandwidth Proportional Noise Shaping Criterion”,
quantization error is improved using an estimator derived from Proceedings of the 6th International Conference on Digital Audio
the scale factor values. This saves us from performing the Effects (DAFX-03).
dequantization process in the encoder, as normally used in [16] Hyen-O Oh, Joon-Seok Kim, Chang-Jun Song, Young-Cheol Park, Dae
analysis by synthesis method to calculate the error. Hee Youn, “Low Power MPEG/Audio Encoders Using Simplified
Psychoacoustic model and Fast Bit Allocation”, 0-7803-6622-0/01,
The perceptual quality of the optimized encoder was 2001 IEEE.
evaluated using subjective listening test and objective [17] Hyen-O Oh, Joon-Seok Kim, Chang-Jun Song, Dae-Hee Youn, Il-
evaluation of a quality testing program called OPERA. Both WhanCha, “New Implementation Techniques of A Real Time MPEG-2
Audio Encoding System”, 0-7803-5041-3/99, 1999 IEEE.
results show no significant degradation in the optimized coder [18] A.D.Duenas, R. Perez, B.Rivas, E.Alexandre, A.S.Pena, “Realtime
for bit rate of 64kbps and 96 kbps, as overlapping confidence Implementation of MPEG-2 and MPET-4 Natural Audio Coders”,
interval was obtained from both listening test. Audio Engineering Society 110th Convention 2001, Preprint #5302
[19] Manoj Kumar, Mohammad Zubair, “A High Performance Software
The latest effort to further reduce the audio bit rate results in Implementation of MPEG Audio Encoder”, ICASSP, Vol. 2, 1996
the standardization of High Efficiency – AAC (HE-AAC) as [20] C.M. Liu, W.J. Lee, R.S. Hong, “A New Crieterion and Associated Bit
part of MPEG 4 systems, promising CD quality at 48 kbps. Allocation Method For Current Audio Coding Standards”, Proceedings
of the 5th International Conference on Digital Audio Effects (DAFX-
HE-AAC contains a standard AAC to code the low frequency
02).
region and a new Spectral Band Replication (SBR) technology [21] Chi-Min Liu, Chin-Ching Chen, Wen-Chieh Lee, Szu-Wei Lee, “A Fast
to generate the high frequency portion. All the modifications Bit Allocation Method for MPEG Layer III”, 0-7803-5123-1/99, 1999
highlighted in this paper can be utilized in the core coder of IEEE.
[22] C.Y.Lee, Y.C.Fang, H.C.Chuang, C.N.Wang, T.H. Chiang, “A Fast
HE-AAC. Our future research will focus on the optimization Audio Bit Allocation Technique Based on a Linear R-D Model”, IEEE
of the SBR part of this newly defined coding system. Transactions of Consumer Electronics, Vol. 48, No.3, August 2002
[23] Kelvin H.C. Eng, D.Y.Huang, S.W. Foo, “A New Bit Allocation
Method for Low Delay Audio Coding at Low Bit Rates”, Audio
REFERENCES Engineering Society 112th Convention 2002, Preprint #5573
[1] ISO/IEC 14496-3, “Information Technology – Coding of audio-visual [24] Markus Erne, “Perceptual Audio Coders, What to listen for”, Audio
objects, Part 3: Audio” (1999) Engineering Society 111th Convention 2001
[2] E.Kurniawati, J.Absar, S.George, C.T.Lau, B.Premkumar, “An
Investigation Into Different Masking Behaviours Resulting from
Estimation of Tonality Index”, 14th International Conference on Evelyn Kurniawati received her Bachelor of Applied
Digital Signal Processing, July 2002, Santorini, Greece. Science (Computer Engineering) degree from Nanyang
[3] J.D. Johnston, “Estimation of Perceptual Entropy Using Noise Masking Technological Uniersity (NTU) Singapore in 2000. She
Criteria”, IEEE CH2561-9/88/0000-2524, 1988. is now pursuing her doctoral degree in School of
[4] J.D. Johnston, “Transform Coding of Audio Signals Using Perceptual Computer Engineering, NTU. Her research interest are
Noise Criteria”, IEEE Journal on Selected Areas in Communications in digital audio compression, network security and
Vol.6No. 2, February 1988 computer animation.
[5] E.Kurniawati, J.Absar, S.George, C.T.Lau, B.Premkumar, “The
Significance of Tonality Index and Nonlinear Psychoacoustics Models Chiew-Tong Lau received his B.Eng. degree from
for Masking Threshold Estimation”, Audio Engineering Society 22nd Lakehead University in 1983, and M.A.Sc and Ph.D.
International Conference on Virtual, Synthetic and Entertainment degrees in Electrical Engineering from the University of
Audio, June 2002, Espoo, Finland. British Columbia in 1985 and 1990 respectively. He is
[6] Ivan Dimkovic, Dragorad Milovanovic, Zoran Bojkovic, “Fast Software currently an Associate Professor and Head of Division
Implementation of MPEG Audio Encoder”, 14th International of Computer Communications in the School of
Conference on Digital Signal Processing, July 2002, Santorini, Greece. Computer Engineering, Nanyang Technological
[7] Toshiyuki Nomura, Yuchiro Takamizawa, “Processor-Efficient University, Singapore. His main research interests are in
Implementation of a Hight Quality MPEG-2 AAC Encoder”, Audio
Engineering Society 110th Convention 2001, Preprint #5294 wireless communications.
[8] T.H. Tsai, S.W. Huang, L.G.Chen, “Design of a Low Power
Psychoacoustic Model Co-Processor for MPEG-2/4 AAD LC Stereo Benjamin Premkumar received his Bachelor of
Encoder”, 0-7803-7761-3/03, IEEE Science degree in Physics and Math from Bangalore
[9] Yuichiro Takamizawa, Toshiyuki Nomura, and Masao Ikekawa, “High- University (India) and a Bachelor’s degree in Electrical
Quality and Processor-Efficient Implementation of an MPEG-2 AAC communication Engineering from the Indian Institute of
Encoder”, 0-7803-7041-4/01, IEEE. Science (India). He briefly worked in large
[10] A.D.Duenas, R. Perez, B.Rivas, E.Alexandre, A.S.Pena, “A robust and communication industry in Bangalore in their Research
Efficient Implementation of MPEG-2/4 AAC Natural Audio Coders”, and Development division before proceeding to the US
Audio Engineering Society 112th Convention 2002, Preprint #5556 to earn his M.S. from North Dakota State University.
[11] Ye Wang, Leonid Yaroslavsky, Miikka Vilermo, Mauri Vaananen, His MS research was in the area of Digital Speech Processing. He taught as a
“Some Peculiar Properties of the MDCT”, 0-7803-5747-7/00, 2000. graduate teaching fellow at NDSU. He then went on to obtain his PhD from
[12] Anibal J.S. Ferreira, “Perceptual coding using sinusoidal modeling in University of Idaho. His PhD thesis was in the area of Synthetic Aperture
the MDCT domain”, Audio Engineering Society 112th Convention Radar Signal Processing, a project funded by NASA.
2002, Preprint #5569. He has held various teaching positions since 1991 both in the US and
[13] E.Kurniawati, J.Absar, S.George, C.T.Lau, B.Premkumar, “Single Singapore. Currently he is an Associate Professor in the school of Computer
Transform Perceptual Audio Encoder”,14th International Conference Engineering (NTU). His research interests include Digital Signal Processing
on Digital Signal Processing, July 2002, Santorini, Greece. and its applications in Wireless Communication, Software Defined Radio and
[14] Kai-Tat Fung, Yui-Lam Chan, Wan-Chi Siu, “A Fast Bit Allocation Impulse Radio. He also works in the area of multirate signal processing, filter
Algorithm for MPEG Audio Encoder”, Proceedings of 2001 banks, transform techniques, speech coding techniques, Number Theory,
International Symposium on Intelligent Multimedia, Video and Speech Wavelet transform and its application to signal analysis.
Processing, May 2001
E. Kurniawati et al.: New Implementation Techniques of an Efficient MPEG Advanced Audio Coder 665

Javed Absar received his Bachelor of Applied Science


(Computer Engineering) degree from Nanyang
Technological University (NTU) Singapore in 1996. He
has been with ST Microelectronics Asia Pacific Pte.
Ltd. since then, working on audio compression scheme
and low power compiler. His main research interests are
in low power design for multimedia.

Sapna George completed her BTech in Electronics and


Communications in 1985 from College of Engineering,
Kerala, India. Since 1995, she has been with ST
Microelectronics Asia Pacific Pte. Ltd. She is Technical
Manager, in charge of Audio research and development
at STMicroelectronics's R&D centre in Singapore.

You might also like