0% found this document useful (0 votes)
147 views10 pages

Floating-Point To Fixed-Point Conversion For Audio

This document discusses a methodology for converting audio algorithms from floating-point to fixed-point representations for implementation on digital signal processors (DSPs). It begins by introducing key concepts like number representations and digital signal processing. The methodology involves first designing a static fixed-point model of the algorithm and quantizing coefficients. Then the dynamic behavior is defined by scaling and quantizing inputs and intermediate results. Test signals are used to validate the conversion process and check for any introduction of audible artifacts.

Uploaded by

asfsfsafsafas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
147 views10 pages

Floating-Point To Fixed-Point Conversion For Audio

This document discusses a methodology for converting audio algorithms from floating-point to fixed-point representations for implementation on digital signal processors (DSPs). It begins by introducing key concepts like number representations and digital signal processing. The methodology involves first designing a static fixed-point model of the algorithm and quantizing coefficients. Then the dynamic behavior is defined by scaling and quantizing inputs and intermediate results. Test signals are used to validate the conversion process and check for any introduction of audible artifacts.

Uploaded by

asfsfsafsafas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A Floating-Point to Fixed-Point Conversion Methodology

for Audio Algorithms


Mihir Sarkar

Most Digital Signal Processors perform computations on integers, or fixed-point numbers, rather
than floating-point numbers. In contrast, Digital Signal Processing algorithms are often designed
with real numbers in mind and usually implemented in floating-point. Apart from finite word-
length effects that may appear during signal acquisition and intermediate computations, limits on
the signal precision and range often compromise the stability of the system.
Audio algorithms are particularly sensitive to fixed-point implementations due to the audible
artifacts that the conversion process may introduce. Therefore, it is essential to validate the
stability and static characteristics of the system after conversion. Then, the dynamic behavior of
the system can be studied by applying suitable test signals.
Starting with a presentation of basic Digital Signal Processing concepts relevant to our discussion,
this paper carries on with a floating-point to fixed-point conversion strategy for audio processing
algorithms. Finally, a high-pass IIR filter implementation example is presented.

manipulations on the data in order to complete the


1. INTRODUCTION required math operations with minimal loss of
An audio signal is an electrical (or optical) precision. However, this additional effort is usually
representation of the continuous variation in time a one-time engineering cost that is quickly
of a sound pressure wave traveling through a amortized on consumer products.
medium such as air. In mathematical terms, we When analyzing a system, it is important to
describe an audio signal as a real function of a make a distinction between the input signals and
single real variable, where the x-axis represents the algorithms that are going to process those
time, and the y-axis, the signal amplitude, which signals. In audio systems, input signals are
can take an arbitrary value at any point in time; in typically digitized and coded as Pulse Code
other words, an audio signal is analog. Modulation (PCM) data by an Analog-to-Digital
Most modern audio products, however, Converter (ADC). A CD-quality signal, for
process audio signals digitally, which is to say, instance, is sampled at 44.1kHz with 16-bit
they process numerical values representing an precision. Thus, audio signals are generally
analog audio signal. Fuelled by increasingly represented as fixed-point numbers in most
complex audio algorithms, Digital Signal systems; we shall not discuss here algorithms that
Processors (DSPs) have become a key component handle real numbers (discrete-time signals). On the
for audio systems. Moreover, fixed-point DSPs are other hand, audio-processing algorithms can either
preferred over their floating-point counterparts in be implemented on a floating-point or a fixed-point
an attempt to control costs and consume less power computing system.
for portable applications. These fixed-point DSPs The floating-point to fixed-point conversion
do not support floating-point data types and process involves several steps. First, a static model
operations at the instruction level; furthermore, of the fixed-point algorithm is designed. For
emulating a floating-point operation on a fixed- example, we select the filter structure that
point DSP is out of the question as it results in an maximizes stability, and quantize the filter
intolerable run-time penalty. Since fixed-point coefficients. If the filter loses its properties (cut-off
numbers require fewer bits than floating-point frequency, etc.) during this one-time conversion,
numbers to achieve the same accuracy in the alternate representations such as double precision
mantissa [6], designers can reduce the chip size by or block floating-point can be used. Second, the
not including a Floating-Point Unit (FPU), and still dynamic behavior of the system is defined by
maintain a high throughput while running at a appropriate scaling and quantization of the input
limited clock rate. On the flip side, fixed-point signals and intermediate results. A careful selection
processors require a great number of macro of test signals is required to validate this step.

Mihir Sarkar 1
2. NUMBER REPRESENTATIONS result. A floating-point addition requires the
scaling of Es, followed by the sum of Ms, and a
Floating-Point Representation scaling of the result.
Representing infinitely many real numbers
Fixed-Point Representation
into a finite number of bits requires an approximate
representation. Most calculations involving real In a fixed-point representation, the decimal
numbers produce quantities that cannot be exactly point is actually an illusion–a formatting artifact
represented using a finite number of bits. that is always in the same position. We can actually
Therefore, not only the operands, but also the result imagine the fractional values to be scaled by a
of a floating-point calculation must often be constant factor that would render it an integer.
rounded in order to fit into a finite representation. Programmers typically use the Q notation to
indicate the virtual radix [6]. For example, Q15
There are two reasons why a real number is
indicates a signed 16-bit fixed-point, of which 15
not exactly representable as a floating-point
bits are fractional.
number. The decimal number 0.1 illustrates the
most common situation. Although it has a finite A fixed-point value is coded by 1 sign bit, N1
decimal representation, it has an infinite repeating bits before the point (integer word-length or iwl),
binary representation: the number 0.1 lies strictly and N2 bits after the point (fractional word-length
between two floating-point numbers and is exactly or fwl). In a 16-bit fixed-point DSP, numbers are
representable by neither of them. Another, less typically represented in the following sizes:
common, situation is that the real number is out of • Memory and registers: N1 = 0, N2 = 15;
range, i.e., its absolute value is larger than the • Accumulator: N1 = 8, N2 = 31.
maximum, or smaller than the minimum, floating-
In the accumulator, N1 holds 8 overflow bits,
point value.
and N2, a 15-bit MSB and 16-bit LSB. The fixed-
The IEEE floating-point standard, which is point range is [-1; 1[ in 16-bit memory locations
generally accepted by commercial hardware and registers, and [-256; 256[ in the 40-bit
manufacturers, specifies the rounding method for accumulator.
basic operations (addition, subtraction,
Further understanding of a particular Q format
multiplication, division, and square root), and
can be gained by calculating the quantization step
requires that implementations produce a bit-
q (smallest difference between two numbers that
accurate result with that algorithm. By bit-
can be represented), and the [–L; L[ range of values
accurate, we mean that the implementation
that can be represented (signed and unsigned):
produces a result that is identical bit-for-bit with
the specification. • q = 2 – fwl;
The IEEE floating-point representation is in • [–L; L[signed = [-2iwl – 1; 2iwl – 1 – q];
the form N = (-1)S M 2E where S is the sign bit, M, • [–L; L[unsigned = [0; 2iwl – q].
the (normalized) fractional mantissa, and E the For example, in Q15, the precision is 2-15.
(biased) exponent [2]. The principal advantage of fixed-point math is
On most 32-bit systems (e.g. Intel Pentium- the inherent simplicity of its arithmetic operations.
based personal computers), floating-point numbers Addition and subtraction are the same operations as
are represented by the following data sizes: they are for ordinary integer values. Multiplication
• Single precision: NE = 8, NM = 23; and division are expensive, divisions even more so.
Therefore, we try to eliminate divisions by writing
• Double precision: NE = 11, NM = 52;
algorithms that, at most, contain divisions by a
• Accumulator: NE = 23, NM = 64;
multiple of two, which, in effect, can be substituted
where NE is the number of bits in the exponent, by a right shift. In fact, when a number is divided
and NM , the number of bits in the mantissa. by a multiple of eight, no actual shifting is
The floating-point format allows a large necessary: the result is obtained by discarding the
dynamic range to be represented. The single least significant byte(s). Multiplication involves
precision format, for instance, can represent real intermediate values that are 2B times larger than
values in the range [1.17 10-38; 3.40 10+38[. The the operands (where B is the number of bits of the
main advantage of floating-point over fixed-point operands). This is realized in DSPs by having a 32-
is its constant relative accuracy. However, it is bit-wide product register to store the multiplication
more complex to use [8]. A floating-point result of two 16-bit values. The addition of 2 B-bit
multiplication requires the computation of the signals gives B + 1 bits; which, when scaled back
product of Ms, the sum of Es, and the scaling of the to B bits, may produce an overflow.

Mihir Sarkar 2
Signed Numbers xQ
There is considerable variety in the way
negative numbers are represented. By way of
illustration, four common number representation
for B = 3 are given in the following table:
Decimal Sign and One’s Two’s Offset q
value magnitude complement complement binary

+4 - - - 111 x
+3 011 011 011 110
+2 010 010 010 101
+1 001 001 001 100
+0 000 000 000 -
-0 100 111 - 011
-1 101 110 111 010
-2 110 101 110 001
-3 111 100 101 000 Fig. 2: Value Truncation Characteristic
-4 - - 100 -
Table 1: Signed Number Representations xQ

Most DSPs use the two’s complement


representation. In the case of a signed fixed-point
number with only fractional bits, the successive
bits have the following value: -(2-1) 2-2 2-3… q
3. FIXED-POINT IMPLEMENTATIONS x
OF DIGITAL SIGNALS
When representing an analog signal in the
digital domain, signals undergo non-reversible
transformations to fit the finite representation
format.
Quantization Fig. 3: Magnitude Truncation Characteristic

Quantization is the process in which a quantity Overflow


x with infinite precision is converted into a quantity
xQ that is approximately equal to x but can assume Overflow takes care of signals that exceed the
less different values than x. The relation between x limits [–L; L[ enforced by the finite representation.
and xQ is called the quantization characteristic. In formal terms, we can describe this as a
The following conversion rules are applicable: conversion of x into xP, where the relation between
of x and xP is called the overflow characteristic.
• Rounding-off (Fig. 1), The following conversion rules can be applied:
• Value truncation (Fig. 2), and
• Saturation (Fig. 4),
• Magnitude truncation (Fig. 3).
• Zeroing (Fig. 5), and
xQ • Sawtooth or wrap-around (Fig. 6).
Saturation is a specific DSP operation. The
sawtooth characteristic comes for free when
dropping the extra MSB in two’s complement
notation. In theory, any form of quantization can be
q combined with any form of overflow.
x

Fig. 1: Rounding Characteristic

Mihir Sarkar 3
xP We must ensure that during all of the calculation
stages, we are using storage locations and
operations suitable for the range of values required
L at that stage.
In practice, when scaling a signal, either at the
x input or at an intermediate stage, we generally have
to check that the value we are using is still accurate
enough (in terms of precision bits). The top and
bottom of the range are usually a good choice for
-L this check. If the values at both ends of the range
scale correctly, it is reasonable to assume that all of
the intermediate values will also scale correctly.
We typically try to use power-of-two scaling so
Fig. 4: Saturation Characteristic that this operation can be performed using shifts.
A comment of general validity is that an
xP increase of the word length B by one bit improves
the maximum achievable signal/quantization noise
ratio by approximately 6dB [3].
L In a multiplication, if both operands are scaled
up by a factor N, the multiplication result is scaled
x up by N2. To restore the original scaling, we need
to divide the result by N. Before a division, we
have to be careful not to scale down the
denominator to zero. Usually, since division
-L cancels the scaling, we keep the scaling at the
numerator and ignore it at the denominator in order
to obtain a scaled result without losing out on the
resolution.
Fig. 5: Zeroing Characteristic
4. FINITE WORD-LENGTH EFFECTS
xP Finite word-length effects appear when converting
continuous amplitude signals into discrete
amplitude signals. Since the transition from
L continuous amplitude to discrete amplitude is in
principle never completely reversible, there is
always a loss of information, however slight. Finite
x
word-length effects are very complicated to
analyze because they are non-linear,
i.e., xQ + yQ ≠ (x + y)Q.
-L For example, by using the truncation
characteristic on the fractional value, we can verify
that (3.7)Q + (4.9)Q ≠ (3.7 + 4.9)Q.
A/D Conversion
Fig. 6: Sawtooth Characteristic
In converting a signal of continuous amplitude
Scaling x into a signal of discrete amplitude xQ, we proceed
as though xQ was obtained by adding to x the noisy
Scaling is a form of fixed-point operation with signal e = xQ – x. In this way, we arrive at the
which we compress, or expand, the actual useful concept of quantization noise. We generally
“physical” values of the original signal into a range assume that the error can be represented as a
suitable to the computer. By suitable, we mean that uniformly distributed white noise source, added to
the computer will be able to store the input value the ideal operation. This is not always true, but it
into its registers, and use operations with sufficient gives an idea about the noise power in the filter.
range to hold intermediate results (in the Other descriptions of the error include stochastic
accumulator or the product register for instance). models [7].

Mihir Sarkar 4
Digital computations A widely used aid in this context is scaling,
The most complicated consequences of particularly for filters consisting of cascaded
working with a finite word length are found when sections. In the design, a multiplier by a constant
limiting the word length of intermediate results in factor smaller than one is inserted between the
digital systems. Just as in everyday calculations in sections, to prevent the occurrence of overflow in
the decimal system, addition and multiplication in the following section. If this factor is an integral
the binary system often increase the word length. power of two, the multiplication only means a shift
With recursive filters, this inevitably introduces of the sample value by one or more binary places.
problems because the result of one calculation is In some cases, scaling by a constant greater than
the starting point of the next. With non-recursive one may be used if it appears during the design that
filters, the increase in the word length is always the most significant bit would otherwise remain
finite, but even then, some limitation may often be unused in the following section. After such
necessary. measure, overflow does not have to be considered
further. Then, an attempt is made to get some idea
In principle, we can choose the form of
of the effects due to quantization of the
quantization and overflow to be used, and where
intermediate results in the filter output signal. In
they should be used in the filter. However, these
practice there are distinct preferences for particular
choices can have completely different effects on
combinations of quantization and overflow.
the operation of the filter, and need to be studied
thoroughly. 5. FILTER DESIGN ASPECTS
Limit Cycles In a practical digital filter, the number of bits
The great problem in analyzing the word- used for representing the coefficients must be as
length limitation of intermediate results rests in the small as possible, largely for reasons of cost. The
fact that, because of finite word-length effects, the values found must therefore be quantized. But this
filter is not completely linear. The most alters the characteristics of the filter, i.e., it changes
troublesome effects are usually due to overflow: the locations of the poles and zeros. These changes
signal distortion can then be very serious, and with can be very substantial. It may happen that, after
recursive structures, there may even be overflow quantization, the filter no longer satisfies the
oscillations (also called limit cycles) of very large design specifications that were used in calculating
amplitude. Filters should therefore be designed so the real-valued coefficients. In extreme cases, a
that this cannot happen. stable filter may even become unstable. However,
the quantization of filter coefficients introduces no
It is found that quantization by means of changes in the linear operation of a circuit, nor
magnitude truncation is not so liable to cause does it cause any effects that depend on the input
oscillation of this type as other forms of signal or that vary with time. It introduces no more
quantization [3]. It is also useful to know that limit than a once-calculable change in the filter
cycles are always confined to a number of the least characteristics.
significant bits. If we reduce the quantization step q
of all the quantizers in a filter where these cycles Some filter structures are much more sensitive
occur by increasing all the word lengths, then the to coefficients quantization than others. In general,
amplitude of the limit cycle expressed in an a filter structure becomes less sensitive as the
absolute sense (e.g. the signal voltage) will also location of each pole and each zero becomes
decrease. By quantizing the output signal, we can dependent on fewer coefficients.
then make a filter where the limit cycles are In this part, we discuss various system
negligible. In the case of audio signals, however, descriptions that can be used to study the static
limit cycles can be very annoying, even at very low behavior of filters. We shall confine ourselves to
levels. signals that can be abstracted to a function of a
Several schemes can be used to prevent limit single independent variable representing time,
cycles [3]: while the value of the function itself is denoted as
the instantaneous amplitude.
• Downscale the inputs, and the
coefficients; Difference Equation
• Hide limit cycle (use more bits internally Quite generally, a practical Linear Time-
and quantize the output); invariant Discrete (LTD) system can be described
• Randomize the quantization. by:

Mihir Sarkar 5
N M We can derive some remarks of general
y[n] = bi x[n − i ] + ai y[n − i ], (1) validity for practically feasible systems that are
i =0 i =1
relevant in our discussion in relation to
where ai and bi are real constants. This quantization [3]:
equation is called a linear difference equation (of
the Mth order) with constant coefficients. (i) Since the frequency response corresponds
to the system function on the unit circle in the z-
Apart from being a good starting point for plane, each pole and each zero has the most effect
deriving other system descriptions, such as the on the frequency range associated with the nearest
system function, the difference equation gives a part of the unit circle. Furthermore, the effect of a
direct indication of a possible structure of a system. pole or zero on the frequency response increases
However, there can be many structures described the closer it is to the unit circle. In the extreme case
by the same difference equation. of a zero actually on the unit circle, the amplitude
System function of the frequency response is zero at the
The most abstract but at the same time the corresponding frequency, and there is a jump of π
most versatile description of an LTD system is radians in phase at that point. On the other hand, a
given by the system function H(z). One, often very pole on the unit circle gives an infinite amplitude at
effective, manner of determining H(z) can be the corresponding frequency, again with a phase
derived from the difference equation of the system: jump of π radians.
N (ii) In a stable system, all the poles are inside
bi z −i the unit circle; zeros may lie inside it, on it, or
Y ( z) i =0
H (z) = = . (2) outside it.
X ( z ) 1 − M a z −i
i These remarks show that a slight variation in
i =1
the coefficient values may affect the locations of
In practice, the analysis of an LTD system is the poles and zeros and therefore completely alter
often based on a block diagram. To determine the the system behavior.
system function, a direct description is usually We shall not discuss further about other
made by means of z-transforms. We can make system descriptions, such as the impulse response,
considerable use of the property that a delay of one and the frequency response, although they might be
sampling interval in the time domain corresponds useful in studying system characteristics.
to multiplication by z –1 in the z-domain.
Poles and zeros Filter Coefficients
In eq. (2) we have found that the system The effects of quantization can be verified by
function H(z) of practical LTD systems takes the calculating and plotting the new frequency
form of a ratio of two polynomials in z –1, i.e.: response and/or the pole-zero diagram. These
effects can be reduced by using different filter
b0 + b1 z −1 + b2 z −2 + ... + bn z − N structures (cascade connection instead of direct-
H (z) = . (3)
1 − a1 z −1 − a 2 z − 2 − ... − a M z − M form structures for higher order filters), or using
other number representations (e.g. 0.9995 = 1.0 –
We can always rewrite the numerator and the 0.5 10 –3).
denominator as a product of factors:
Filter Structure
( z − z1 )( z − z 2 )...( z − z N ) M − N
H ( z ) = b0 z . (4) We can classify discrete filters by structure, as
( z − p1 )( z − p 2 )...( z − p3 )
indicated by their block diagram. We have to
The precise (complex) values of z1, z2, …, zN remember that a particular structure is rarely
and p1, p2, …, pM depend on the (real) coefficients unique: a specified filter characteristic can usually
b0, b1, b2, …, bN, a1, a2, …, aM in eq. (3). The factor be achieved with different structures. Significant
zM-N can be neglected as it represents a simple shift differences in characteristics between structures
of h[n] in time [3]. The poles and zeros fully may also be found when the finite word-length
determine the function H(z) and hence the effects of digital filters are taken into account.
corresponding LTD system, except for a constant We note that a Finite-Impulse Response (FIR)
b0. The positions of poles and zeros are easily filter can only possess zeros (apart from the origin),
visualized in the complex z-plane. This gives the while an Infinite Impulse Response (IIR) filter can
poles-and-zeros plot of H(z), which is a very useful have both zeros and poles [2]. In terms of stability,
graphic aid. an FIR filter is always stable; an IIR filter, as we

Mihir Sarkar 6
have seen, is unstable if one or more poles lie on quantized coefficient values in a digital filter. A
the unit circle or outside it. solution to this problem can be found by building
Let us take a look at the Recursive Discrete up the total filter from smaller units in which each
Filters (RDFs). One possible structure follows pole and each zero is determined by a smaller
directly from the general description of discrete number of coefficients. Filters of arbitrarily high-
filters in eq. (1): the direct-form-I structure (Fig. 7). order can be produced by cascading 1st-order and
2nd-order sections, without the high-sensitivity to
x[n]
b0
y[n] coefficient variations that characterizes the direct
form structures. In some cases, double precision for
the coefficients, the input data and/or the
T T
intermediate calculations might be required to
prevent the filter from diverging. Additionally, we
b1 + -a1 have to take particular care about a possible
overflow in the 1st order section that might carry
over to the 2nd order section, or go into the
T T
feedback loop, which can generate limit-cycles.
Alternative computation techniques can be
b2 -a2
used to prevent overflow in the intermediate
calculations, as illustrated by the following
Fig. 7: Direct-Form-I Structure examples:
bo b b
We can think of it as split up into a transversal y[n] = ( x[n] + 1 x[n − 1] + 2 x[n − 2]
2 2 2 (5)
part with filter coefficients b0, b1, …, bN followed a a
− 1 y[n − 1] − 2 y[n − 2]) × 2
by a purely recursive part with filter coefficients a1, 2 2
a2, …, aM. If the recursive part and the transversal b1 b
y[ n] = b0 x[ n] + x[ n − 1] + 1 x[ n − 1] + b2 x[n − 2]
part are made to change places–this is permissible 2 2 (6)
because each part is an LTD system in itself–the a a
− 1 y[ n − 1] − 1 y[ n − 1] − a2 y[ n − 2]
result is the direct-form-II structure (Fig. 8). 2 2
Fig. 9 is an example of a recursive filter where
x[n]
b0
y[n] scaling and saturation are used to prevent limit
cycles:
Scaling Scaling
T x[n] y[n]
↓ b0 Saturation ↑

+ -a1 b1 + T T

T b1 + -a1

-a2 b2 T T

b2 -a2

Fig. 8: Direct-Form-II Structure

The direct-form-I structure comprises M + N Fig. 9: Direct-Form-I Filter with Scaling and Saturation
unit-delay elements, whereas the direct-form-II
structure only contains M unit-delay elements. The 6. A FLOATING-POINT TO FIXED-POINT
system function is the same in both cases; the zeros CONVERSION METHODOLOGY
are determined by the coefficients bi and the poles Let us discuss a widely used DSP algorithm
by the coefficients ai. A disadvantage of the direct- implementation strategy. Generally, the algorithm
form-II structure is that each coefficient bi affects parameters (e.g. the filter coefficients, etc.) are
all of the zeros. The same thing applies for each ai computed as real, or sometimes complex, numbers.
and all of the poles. This means that even small A floating-point reference code is then generated
changes in the coefficients can have a considerable from the algorithm blueprint, and tested with
effect on the frequency response of the filter. This reference input signals. If, for cost reasons, a fixed-
is definitely a serious drawback if we have to use

Mihir Sarkar 7
point target processor is selected, the code needs to (iii) Then, we consider non-linear functions.
be ported from the floating-point reference code to We analyze the input data and make
a fixed-point implementation. Sometimes, an implementation choices, such as look-up tables
intermediary step includes a target-independent versus polynomial approximations. The choice is
fixed-point model (usually in a high-level language mainly driven by the trade-off between precision,
like C), which is then followed by the target computational requirements and memory. Up to
implementation (often in assembly). This this point, we do not consider intermediate results.
intermediate step provides a way to reuse the fixed- (iv) The final step in this methodology deals
point conversion effort for various target with intermediate results and quantization effects.
implementations that may contain hardware- We apply the required arithmetic operators, study
specific optimizations. The final implementation is the effects of quantization on intermediate results,
tested for bit-accuracy with the fixed-point and use the accumulator (which provides the
reference code. This section focuses on the maximum precision) as much as possible.
conversion from a floating-point reference code to
a target-agnostic fixed-point model. 7. AUDIO ALGORITHMS TESTING
(i) The starting point of the conversion process STRATEGY
is the floating-point reference code provided by the In most audio systems, signals are represented
algorithm designer. This code may require large in 16-bits, providing 96dB of dynamic range,
dynamic ranges as well as high precision; it may which is deemed to be sufficient for human hearing
also include non-linear functions (e.g., logarithm, characteristics. However, consecutive operations
square root, cosine, etc.). Typically, some functions on the audio signal, especially scaling, may reduce
will require a block floating-point representation to the dynamic range (1 bit less equals 6dB decrease
maintain their accuracy. in the SNR). Quantization noise and saturation are
(ii) The first pass of the conversion process to be taken into consideration since the human ear
consists in managing the signals dynamically by is susceptible to them. On the other hand, the
scaling the data to be in the [–1; 1[ range. If the human ear is largely insensitive to phase distortion.
input value is smaller than the quantization step Several test signals can be applied to assess the
(i.e., x < 2B – 1), then the quantized value is set to dynamic behavior of the system.
zero: xQ = 0. We should be careful about
(i) Worst-case criterion
denominators here. Moreover, when applying
scalars, we should keep in mind that one right shift In the case of a Non-Recursive Discrete Filter
means losing one bit of precision. (NRDF), the output signal can be defined as:
M
In terms of scaling, we determine that data can y[n] = x[i ]h[n − i ] . (7)
either be scaled statically or dynamically. In the i=0
former case, a fixed scalar, usually a power of two, Assuming that |x[i]| ≤ 1.0, we have:
is pre-defined, and all incoming data is divided (or
M
shifted) by this factor. In the latter, the dynamic | y[n] |= | h[i] | . (8)
range of the data is evaluated at run-time, and a i=0
scalar is applied to the current block. The fixed To get |y[n]| ≤ 1.0, we can scale down the input
scalar method is driven by the worst-case input and M
does not take into account characteristics of the signal by | h[i] | .
i =0
current input buffer; hence, it can result in a loss of
precision for low-level signals. On the other hand, However, this is a very pessimistic scaling; it
dynamic scaling, sometimes also called block should be used sparingly as downscaling reduces
floating-point, maintains the largest Signal-to- the number of bits, and thus, the SNR.
Noise Ratio (SNR) possible. Dynamic scaling is a (ii) Second, we assume that the input signal is
trade-off between double precision and single a sine wave. Then, the maximum gain for the
precision. By storing a normalized mantissa for signal is determined by the peak value of |H(z)| for
each sample, and one common shift (scalar) value z = ejωT where (-π ≤ ωT ≤ π). To get |y[n]| ≤ 1.0,
per block, dynamic scaling requires less storage we scale down the signal by max(|H(ejωT)|). Other
space than double precision data, but needs more scaling methods exist but don’t guarantee that there
CPU time to compute the scale factor, normalize will be no overflow because they all make some
the buffer, and perform operations on the data. assumptions on the input signal. The best bet is to
Formally, a block floating-point value can be find a reasonable compromise, and when overflow
described as: A[i] = ANorm[i] 2Scalar. occurs, use saturation.

Mihir Sarkar 8
(iii) Lastly, we test the system by using white
noise signals at various amplitudes. This lets us
visualize the frequency response of the system.
However, it is also recommended to check with
other signals; in some instances, consecutive
medium values could accumulate to generate a
single large value that could overflow [5].
8. AN EXAMPLE: FLOATING-POINT
TO FIXED-POINT CONVERSION OF
AN IIR FILTER
We shall discuss a high-pass IIR filter that
processes a speech signal input. The floating-point
code is based on a direct-form-II filter structure Fig. 10: Theoretical Floating-Point and Fixed-Point
Frequency Response
(Fig. 8) whereas the fixed-point code has been
implemented as a direct-form-I structure (Fig. 7).
For the following computations, we use single-
The output of the filter is given by the precision fixed-point coefficients and low
equation: amplitude white noise as input. The resulting
y[n] = b0 x[n] + b1 x[n − 1] + b2 x[n − 2] frequency responses are depicted in Fig. 11, 12 and
. (9)
− a1 y[n − 1] − a 2 y[n − 2] 13, respectively for a single-precision, double-
Filtering can be done with the following precision and block floating-point feedback loop.
options:
• Single-precision or double-precision
coefficients;
• Single-precision, double-precision or
block scaling feedback loop;
• Static, dynamic or no scaling before and
after the filter output.
Block scaling is used in order to optimize the
code in terms of throughput and memory while
maintaining a double-precision-like output.
The floating-point coefficients for a cut-off
frequency at 50Hz have the following value:
a1 = –1.94447765776709; Fig. 11: Single-Precision Feedback Loop
a2 = 0.945977936232282;
b0 = 0.972613898499844; In Fig. 11 and 13, the fixed-point filters
b1 = –1.94522779699969; diverge considerably from the expected response.
b2 = 0.972613898499844.
These coefficients have been converted into
16-bit fixed-points with the following value:
a1 = –1.944458;
a2 = 0.9459839;
b0 = 0.9725952;
b1 = –1.945251;
b2 = 0.9725952.
A comparison of the floating-point frequency
response with the fixed-point (single-precision
coefficients) frequency response gives the
theoretical expected behavior (Fig. 10).
Fig. 12: Double-Precision Feedback Loop

Mihir Sarkar 9
The results in Fig. 12 show that the filter system with dynamic scaling (which gives an
output follows its theoretical behavior when using identical frequency response).
double precision data. Thus, we see that the optimum fixed-point
solution for this filter is reached by using single-
precision coefficients and a double-precision
feedback loop with static scaling before and after
the filter, with a structure similar to that shown in
Fig. 9.
9. CONCLUSION
We have discussed how fixed-point arithmetic
refers to the handling of numbers that are scaled up
by a certain factor to allow space for fractional
parts. Moreover, we have studied fixed-point DSP
characteristics in order to understand the
techniques for implementing a floating-point
Fig. 13: Block-Scaling Feedback Loop algorithm on a fixed-point processor.
Techniques described here reflect current
With maximum amplitude white noise and industry practices to port algorithm designs to
different scaling modes, we get the frequency audio and multimedia consumer products by
responses depicted in Fig. 14 and 15. keeping costs down, while providing high-quality
solutions. It is important for programmers to
understand trade-off parameters in order to take
informed design decisions.
Additional investigations in the area of
floating-point to fixed-point conversion can be
found in automatic conversion mechanisms, and
fixed-point simulation tools.
REFERENCES
[1] A.V. Oppenheim, R.W. Schafer, Digital Signal Processing,
Prentice-Hall, 1975.
[2] E.C. Ifeachor, B.W. Jervis, Digital Signal Processing, A
Practical Approach, Addison-Wesley, 1993.
Fig. 14: Single-Precision Feedback Loop, No Scaling
[3] A. W. M. van den Enden and N. A. M. Verhoeckx, “Digital
Signal Processing: Theoretical Background,” Philips Tech. Rev.,
In Fig. 14, with no scaling around the filter, we vol. 42, no. 4, pp. 110-144, Dec. 1985.
enter into limit cycles with high-amplitude signals. [4] D. Goldberg, “What Every Computer Scientist Should
Know About Floating-Point Arithmetic,” ACM Computing
Surveys, vol. 23, no. 1, Mar. 1991.
[5] R. Gordon, “A Calculated Look at Fixed-Point Arithmetic,”
Embedded Systems Programming, Apr. 1998, pp. 72-78.
[6] M. Jersák and M. Willems, “Fixed-Point Extended C
Compiler Allows More Efficient High-Level Programming of
Fixed-Point DSPs,” Proc. Int'l Conf. Sig. Proc. App. and Tech.
(ICSPAT’98), Toronto, Canada, Sept. 1998.
[7] J. Tielen, “Restrictions when Implementing Filters on a
Fixed-Point DSP,” presentation, Philips, Leuven, Belgium, Mar.
2001.
[8] T. Gouraud, “Floating-Point to Fixed-Point Conversion:
Some Ideas for Complex Algorithms,” presentation, Philips,
Leuven, Belgium, June 2000.
[9] M. Sarkar, Fixed-Point Implementation of a High-Pass IIR
Fig. 15: Single-Precision Feedback Loop, Static Scaling Filter, tech. report, Philips, Leuven, Belgium, July 2001.
[10] T. Gouraud, Fixed Point Arithmetic, tech. report, Philips,
Fig. 15 shows a reasonable frequency response Leuven, Belgium, Apr. 2000.
for static scaling. So we do not need to evaluate the

Mihir Sarkar 10

You might also like