Implementation of DSP Algorithms On Fixed Point DSP Processors
Implementation of DSP Algorithms On Fixed Point DSP Processors
Digital filters will be used as a vehicle in the discussions. The implementation of DSP algorithms (e.g.
digital filters, adaptive filters and FFTs) in real-time on DSP processors may involve low level
assembly language codes and/or codes in an efficient high level language, such as C or C++. The
use of high level languages is now widespread, especially with newer DSP processors which are
very complex and sophisticated. Typically, assembly language codes are generated from thigh level
language codes.
Non recursive N-point FIR filters, with the structure given in Figure 1a, are characterised by the
following difference equation:
N 1
y (n) h( k ) x (n k ) , where N is the filter length. (2)
k 0
A fragment of a C language implementation of the general FIR filter is given in Program 1. For real-time
FIR filtering, the data and coefficients are stored in memory, conceptually, as shown in Figure 16(b).
Figure 16 Implementation of FIR filter: (a) filter structure; (b) coefficient and data memory maps; (c) an
alternative memory map.
20
Typically, for real-time implementation, the new data sample, x(n), is read from the ADC, the RAM
contents are shifted one place to make room for the new data, the new sample is saved and the output
sample is then computed from Equation 2 and then sent to the DAC. The process is repeated until the
filter is stopped. In off-line processing, the input data may be read from, e.g. hard disk, and the output
stored or used in some way.
To appreciate how the FIR filter works, tackle the following problem.
Illustrative Problem
The coefficients of a three-point FIR filter are given by: h(0) = 0.5; h(1) = 0.75; h(2) = 0.25. The filter is fed
with the data sequence given below.
Sampling x(n)
instant, n
0.25
0
1 0.5
2 0.25
3 -0.25
4 0.75
(a) Determine the contents of the tapped delay line and associated output sample at each of the
successive sampling instants, n (n=0, 1, 2, 3, 4). Sketch the transversal structure, showing the contents of
the delay line and associated output sample at the successive sampling instants.
(b) Sketch the input and output signals, taking account the delay through the filter.
21
7.2 Implementation of FIR filters on fixed point DSP processors
==================================================
NXTPT IN XN, ADC
ZAC
LT XNM2
MPY H2 ;h(2)x(n-2)
APAC ;h(2)x(n-2)+h(1)x(n-1)+h(0)x(n)
B NXTPT
==================================================
An implementation of the three-point FIR filter in a first generation fixed point DSP processor is given
in Program 2. In this case, the computation of the products starts at the bottom of the data and
coefficients to exploit the TMS320C10 data move instructions. The instruction pair LTD and MPY are
central to the TMS320C10-based FIR filter implementation. For example, the instruction pair below
performs the shift implied in Equation 3 or represented by z 1 in Figure 17, adds the previous product to
the accumulator and calculates the next product, h(k ) x( n k ).
LTD XNM1
MPY H1
Specifically, the instruction LTD XNM1 loads the T (temporary) register with the data sample x(n-1)
(held in data RAM address XNM1), adds the product, h(2)x(n-2), which is still in the P (product)
register to the accumulator, and shifts x(n-1) up to the next address, that is x(n-1) x(n-2). The second
instruction MPY multiplies the contents of the T register with h(1) and leaves the result in the product
register. The shifting scheme ensures that the input data samples are in the right locations when the next
sample is to be computed.
Straight line coding of the FIR filter, such as Program 2, leads to a fast implementation, but is not
general purpose, and for large N-point filters will not yield a compact program.
In particular, a general purpose FIR filter is implemented by setting up an inner loop to execute the FIR
equation and calculate the filter output as specified in Equation 2. The flowchart for an N-point FIR
filter showing the inner loop is given in Figure 18. In the first general DSP processor, the inner loop of
the FIR filter may be executed by the following instructions:
22
LOOP LTD *, AR0 ; shift/update delay line and accumulate products
MPY *-, AR1 ; multiply next coefficient and data value
BANZ LOOP
In this case, the auxiliary registers, AR0 and AR1, are used to point to the data value and coefficient to
be multiplied. The auxiliary register, AR1, contains the filter length and acts as a loop counter. The
branch on register not zero instruction, BANZ, together with AR1, is used to control the loop. FIR filter
implementation in the first generation DSP processor is not efficient because of the overhead associated
with loop control.
Figure 18 Flowchart for N-point FIR filter. The FIR filter inner loop executes the convolution sum in
Equation 2.
The second generation fixed point DSP processors, such as TMS320C50 and Motorola DSP56000, have
zero-overhead looping capability and special multiply and accumulate instructions which help to cut
down the time to execute the FIR inner loop. In the TMS320C50, the inner loop of an N-point FIR filter
which is shown in Figure 18 can be efficiently executed using the following instructions:
RPT NM1
MACD HNM1, XNM1
23
The instruction RPT NM1 loads the filter length minus 1 (N-1) into the repeat register and causes the
multiply and accumulate with data shift instruction, MACD, to be repeated N-1 times with zero
overhead. The MACD combines the instruction pair LTD MPY into a single instruction, enabling faster
execution. The instruction pair RPT and MACD is a good example of time-saving special instructions
available in DSP processors.
Illustrative Example
A digital FIR notch filter is required to satisfy the specifications given below is to be implemented on
the second generation fixed point DSP processor, TMS320C50.
Solution
The unquantised and quantised coefficients (quantised to 16 bits by multiplying each unquantised
coefficient by 215 and then rounding to the nearest integer) are listed in Table 3.
As shown in the flowchart in Figure 18, the complete FIR filter has at least 4 essential parts:
(1) initialisation - initialise system, this may include setting up coefficient table
(2) input section - may include reading of the input sample, x(n), e.g. from an ADC via a serial port
(3) inner loop computation – execution of FIR equation to obtain y(n)
(4) output section – may include shifting/rounding of result of inner loop computation and sending this,
e.g. to the DAC via a serial port
As much of steps 1, 2 and 4 are system dependent, we will concentrate on the inner loop computation
here.
The FIR inner loop may be implemented with the following instructions in the TMS320C50:
24
In this case, the coefficient and data memories are organised as shown in Figure 16c. The auxiliary
register AR1 is used for indirect addressing in the inner loop computation (the MACD instruction) and
initially points to the oldest data sample, XNM1, in the data memory. In the inner loop, the MACD
instruction does the following:
25
7.3 Circular buffer-based implementation of FIR filters.
An alternative approach to implementing the N-point FIR filters in second and later generations of DSP
processors is to use circular buffers. It is evident that in FIR filtering, the content of the coefficient
memory is static, but the data memory changes when each new input data sample arrives. Effectively,
successive new data samples are fed into a sliding window whilst the oldest data samples drop off. A
circular buffer may be used to handle the changes in the block of input data samples that are used for
FIR filtering without having to shift the data as in linear data buffers.
Conceptually, a circular buffer is the same as a linear buffer if we consider the two ends of the linear
buffer to be adjacent, i.e. the latest and oldest data samples, x(n) and x[n-(N-1)] are adjacent, see Figure
19. In the circular buffer in Figure 19a, the data pointer (symbolised by the arrow) points to the memory
location of the newest input sample, x(n), and previous input data samples, x(n-1), x(n-2), …x(n-7) are
stored in successive locations, clockwise. The FIR inner loop is executed at each sampling period, as
before, by multiplying each data sample by the corresponding filter coefficient, h(k) and accumulating
the products. The only difference is that the data samples are not shifted. After the inner loop
computation, the pointer is located at the x(n-7), the oldest data sample which is then overwritten by the
next input sample, x(n). Figure 19a to d illustrates how the circular buffer works for 3 successive data
samples.
MOVE #XDATA, R0
MOVE #COEFF, R4
MOVE #N-1, M0 ; buffer/modulo size
MOVEP X: INPUT, X: (R0) ; read and store input sample
CLR A ; clear the accumulator
REP #N-1 ; execute FIR inner loop
MAC X0, Y0, A X:(R0)+, X0 Y:(R4)+, Y0
MACR X0, X0, A (R0)-
In this case, circular buffers are used to store both the data and coefficients. The circular data buffer
performs the time shift implicitly as described above. However, the circular coefficient buffer is used
here for convenience for automatic wraparound of the coefficient pointer. The first four instructions
above set up the address pointers, R0 and R4. The inner FIR loop is executed by the instruction-pair
REP and MAC. The repeat instruction, REP, repeats the next instruction N-1 times. The next instruction
line exploits the multi-path architecture and parallelism of the DSP56000 to perform a set of multiple
operations – it multiplies the data value and coefficient which are in X0 and Y0, adds the product to the
accumulator and fetches the next data value and coefficient pair to be multiplied from X and Y
memories and updates the pointers.
Apart from FIR filtering, circular addressing is useful in the efficient implementation of a number of
DSP functions that require time shifts or FIFO queues, e.g. correlation, multirate filters (decimation and
interpolation filters) and periodic waveform generation. Its use eliminates the need to move data or for
constant checking/resetting of address pointers. Later generation of DSP processors have enhanced
circular addressing capability.
26
Figure 19 An illustration of the principles of circular buffer-based FIR implementation.
27
8. IIR Digital Filtering
In practice, IIR filters are often implemented using the second order canonic structure (Figure
1) and the direct form I structure (Figure 2). The canonic second order filter structure is
characterised by the following two-step difference equation:
y ( n) b0 w( n) b1 w(n 1) b2 w( n 2) (1a)
An examination of Equations 1a and 1b shows that in general we need the following hardware
resources to implement a second order canonic filter section:
6 coefficients, including the scale factor (two of these are trivial for the IIR problem)
2 delay elements
11 memory locations (6 for coefficients and 5 for data).
2 accumulators (4 additions)
1 multiplier (5 multiplications).
Possible memory organisation for the general second order IIR filter is shown in Figure 1. A C-
language pseudo code implementation of the IIR filter section is shown in Program 1.
Figure 1 (a) Second order canonic filter section; (b) coefficient and data storage.
28
Program 1 C Language pseudo code for the canonic section
A DSP56002 implementation is shown in Figure 2. It operates as follows: The IIR filter routine
reads the input sample from memory (input). It then performs the IIR filtering on the data and
internal node data - w(n), w(n-1) and w(n-2). The filtered output is then written to the memory
location (output).
As with FIR filters, the coefficients are represented as 2's compliment numbers. In this
representation, all numbers must be less than 1. To ensure that this is satisfied, one scheme is to
divide each coefficient by 2 before it is stored (see later).
(a)
|b1/2|
|b2/2|
|w(n-1)| |a1/2|
|w(n-2)| r0 |a2/2| r4
Data Coefficients
(b)
Figure 2 (a) DSP56002 code for a canonic IIR filter implementation; (b) Coefficient and
data memory organization for the DSP56002 implementation.
For higher order IIR filters, several second order canonic sections or direct form I sections) are
connected in cascade or in parallel.
Cascade
The transfer function, H(z), of an Nth-order IIR filter, using second-order sections in cascade, is
given by:
29
H(z) =
The cascade realization of a fourth order (N=4) filter using second-order canonic sections is
shown in figure x(a). The storage of the filter variables (data and coefficients) is shown in
Figure x. The set of difference equations for the fourth order IIR filter, using canonic sections,
is given by:
w1
y1
w2
y2
Parallel realization.
The transfer function of an Nth order IIR filter for parallel realization is given by:
H(z)
the realization diagram, using the second order canonic sections, for N=4 is given in Figure
12.30. For the canonic section, the difference equation is given by:
w1
y1
w2
y2
yn
A simple C language code for an IIR filter realized as a parallel combination of second-order
canonic sections is given in Program 12.6.
The problems below will assist you to review the topics and to gain further insight into the issues. Do
have a go at them.
Next week and in the laboratory sessions we will look at some of these issues again and, in particular,
we will look at IIR filters.
30
9. Summary
We have discussed at length several aspects of DSP processors and DSP implementation, using FIR
filters as a vehicle. In particular, we have discussed the following two topics:
10. Problems
1 In relation to DSP processors, write short critical notes on each of the following concepts, using
diagrams where appropriate to illustrate your answer:
Harvard architecture;
pipelining;
multiplier-accumulator;
special instructions for DSP;
data and program memory.
Explain how Harvard architecture as used by the TMS320 family differs from the strict Harvard
architecture. Compare this with the architecture of a standard von Neumann processor.
(2) Assume a memory access time of 150 ns, multiplication time of 100 ns,
addition time of 100 ns, and over head of 5 ns at each pipe stage. Determine
the throughput of the MAC. Comment on your answer.
(3) The DSP system is required to execute the following algorithm in real time:
How long will it take the MAC to compute each output sample?
31
4 (a) Explain why traditional measures such as processor clock speed, MIPS and MFLOPS may
not be suitable for comparing the execution performance of DSP processors. Suggest, with
justifications, an alternative method of comparing execution performance.
(b) State and then discuss 4 key factors, apart from execution speed, that should be considered in
choosing a DSP processor for each of the following applications:
(i) high fidelity digital audio; (ii) voice over Internet Protocol telephony; (iii) physio- logical
signal processing for diagnosis in biomedicine
5 (a) Compare the computational performance of the TMS320C50 and DSP56000 fixed point
processors based on the execution of the inner loop of an N-point FIR filter.
(b) Repeat (a) for an Nth order IIR filter with M second order canonic filter sections in cascade.
Assume that N is even.
6 In relation to DSP processors, write brief explanatory notes , with the aid of sketches where
appropriate, for each of the following techniques:
In each case, clearly point out the advantages and disadvantages of the technique in signal processing.
Bibliography
Ifeachor E C and Jervis B W (2002). DSP A Practical Approach, 2 nd Edition. Pearson Education.
Buyer’s Guide to DSP Processors. Berkeley Design Technology Inc, Fremont, Calif, 1999. Details
available at www.BDTI.com
32