DSP Arch
DSP Arch
Let A
UNIT I ARCHITECTURES FOR PROGRAMMABLE DSP PROCESSORS be represented using m bits as (Am-1 Am-2 …….. A1 A0) and B be
1. Basic Architectural Features represented using n bits as (Bn-1 Bn-2 …….. B1 B0). Then the product of
A programmable DSP device should provide instructions similar to these two numbers is given by,
a conventional microprocessor. The instruction set of a typical DSP device
should include the following,
a. Arithmetic operations such as ADD, SUBTRACT, MULTIPLY etc
b. Logical operations such as AND, OR, NOT, XOR etc
c. Multiply and Accumulate (MAC) operation
d. Signal scaling operation
In addition to the above provisions, the architecture should also include,
a. On chip registers to store immediate results
b. On chip memories to store signal samples (RAM)
c. On chip memories to store filter coefficients (ROM)
2. DSP Computational Building Blocks
Each computational block of the DSP should be optimized for
functionality and speed and in the meanwhile the design should be This operation can be implemented paralleling using Braun multiplier
sufficiently general so that it can be easily integrated with other blocks to whose hardware structure is as shown in the below figure.
implement overall DSP systems.
2.1 Multipliers
The advent of single chip multipliers paved the way for
implementing DSP functions on a VLSI chip. Parallel multipliers replaced the
traditional shift and add multipliers now days. Parallel multipliers take a
single processor cycle to fetch and execute the instruction and to store the
result. They are also called as Array multipliers. The key features to be
considered for a multiplier are:
a. Accuracy
b. Dynamic range
c. Speed
The number of bits used to represent the operands decides the
accuracy and the dynamic range of the multiplier. Whereas speed is decided
by the architecture employed. If the multipliers are implemented using
hardware, the speed of execution will be very high but the circuit
complexity will also increases considerably. Thus there should be a tradeoff
between the speed of execution and the circuit complexity. Hence the choice
of the architecture normally depends on the application. 2.3 Multipliers for Signed Numbers
2.2 Parallel Multipliers In the Braun multiplier the sign of the numbers are not considered into
account. In order to implement a multiplier for signed numbers, additional
hardware is required to modify the Braun multiplier. The modified and save Z at two successive memory locations. Although it stores the exact
multiplier is called as Baugh-Wooley multiplier. value of Z in the memory, it takes two cycles to store the result.
Consider two signed numbers A and B, b. Discard the lower n bits of the result Z and store only the higher order n
bits into the memory. It is not applicable for the applications where accurate
result is required. Another alternative can be used for the applications
where speed is not a major concern. In which latches are used for inputs and
outputs thus requiring a single bus to fetch the operands and to store the
result (Fig4.2)
2.4 Speed
Conventional Shift and Add technique of multiplication requires n cycles to
perform the multiplication of two n bit numbers. Whereas in parallel
2.6 Shifters
multipliers the time required will be the longest path delay in the
Shifters are used to either scale down or scale up operands or the results.
combinational circuit used. As DSP applications generally require very high
The following scenarios give the necessity of a shifter
speed, it is desirable to have multipliers operating at the highest possible
a. While performing the addition of N numbers each of n bits long, the sum
speed by having parallel implementation.
can grow up to n+log2 N bits long. If the accumulator is of n bits long, then
2.5 Bus Widths
an overflow error will occur. This can be overcome by using a shifter to
Consider the multiplication of two n bit numbers X and Y. The product Z can
scale down the operand by an amount of log2N.
be at most 2n bits long. In order to perform the whole operation in a single
b. Similarly while calculating the product of two n bit numbers, the product
execution cycle, we require two buses of width n bits each to fetch the
can grow up to 2n bits long. Generally the lower n bits get neglected and the
operands X and Y and a bus of width 2n bits to store the result Z to the
sign bit is shifted to save the sign of the product.
memory. Although this performs the operation faster, it is not an efficient
c. Finally in case of addition of two floating-point numbers, one of the
way of implementation as it is expensive. Many alternatives for the above
operands has to be shifted appropriately to make the exponents of two
method have been proposed. One such method is to use the program bus
numbers equal.
itself to fetch one of the operands after fetching the instruction, thus
From the above cases it is clear that, a shifter is required in the architecture
requiring only one bus to fetch the operands. And the result Z can be stored
of a DSP.
back to the memory using the same operand bus. But the problem with this
2.7 Barrel Shifters
is the result Z is 2n bits long whereas the operand bus is just n bits long. We
In conventional microprocessors, normal shift registers are used for shift
have two alternatives to solve this problem, a. Use the n bits operand bus
operation. As it requires one clock cycle for each shift, it is not desirable for
DSP applications, which generally involves more shifts. In other words, for
DSP applications as speed is the crucial issue, several shifts are to be Above Figure depicts the implementation of a 4 bit shift right barrel shifter.
accomplished in a single execution cycle. This can be accomplished using a Shift to right by 0, 1, 2 or 3 bit positions can be controlled by setting the
barrel shifter, which connects the input lines representing a word to a group control inputs appropriately.
of output lines with the required shifts determined by its control inputs. For
an input of length n, log2 n control lines are required. And an dditional 3 Multiply and Accumulate Unit
control line is required to indicate the direction of the shift. Most of the DSP applications require the computation of the sum of the
The block diagram of a typical barrel shifter is as shown in below figure. products of a series of successive multiplications. In order to implement
such functions a special unit called a multiply and Accumulate (MAC) unit is
required. A MAC consists of a multiplier and a special register called
Accumulator. MACs are used to implement the functions of the type A+BC. A
typical MAC unit is as shown in the below figure.
Although addition and multiplication are two different operations, they can
be performed in parallel. By the time the multiplier is computing the
product, accumulator can accumulate the product of the previous
multiplications. Thus if N products are to be accumulated, N-1
multiplications can overlap with N-1 additions. During the very first
multiplication, accumulator will be idle and during the last accumulation,
multiplier will be idle. Thus N+1 clock cycles are required to compute the
sum of N products.
3.1 Overflow and Underflow
While designing a MAC unit, attention has to be paid to the word sizes
encountered at the input of the multiplier and the sizes of the add/subtract
unit and the accumulator, as there is a possibility of overflow and
underflows. Overflow/underflow can be avoided by using any of the
following methods viz
a. Using shifters at the input and the output of the MAC
b. Providing guard bits in the accumulator operations like AND, OR , NOT, XOR etc. The block diagram of a typical ALU
c. Using saturation logic for a DSP is as shown in the figure.
Shifters It consists of status flag register, register file and multiplexers.
Shifters can be provided at the input of the MAC to normalize the data and at
the output to de normalize the same.
Guard bits
As the normalization process does not yield accurate result, it is not
desirable for some applications. In such cases we have another alternative
by providing additional bits called guard bits in the accumulator so that
there will not be any overflow error. Here the add/subtract unit also has to
be modified appropriately to manage the additional bits of the accumulator.
Saturation Logic
Overflow/ underflow will occur if the result goes beyond the most positive
number or below the least negative number the accumulator can handle.
Thus the overflow/underflow error can be resolved by loading the
accumulator with the most positive number which it can handle at the time
of overflow and the least negative number that it can handle at the time of
Status Flags
underflow. This method is called as saturation logic. A schematic diagram of
ALU includes circuitry to generate status flags after arithmetic and logic
saturation logic is as shown in below figure. In saturation logic, as soon as
operations. These flags include sign, zero, carry and overflow.
an overflow or underflow condition is satisfied the accumulator will be
Overflow Management
loaded with the most positive or least negative number overriding the result
Depending on the status of overflow and sign flags, the saturation logic can
computed by the MAC unit.
be used to limit the accumulator content.
Register File
Instead of moving data in and out of the memory during the operation, for
better speed, a large set of general purpose registers are provided to store
the intermediate results.
5 Bus Architecture and Memory
Conventional microprocessors use Von Neumann architecture for memory
management wherein the same memory is used to store both the program
and data (Fig). Although this architecture is simple, it takes more number of
processor cycles for the execution of a single instruction as the same bus is
used for both data and program.
6. A DSP has a circular buffer with the start and the end addresses as 0200h
and 020Fh respectively. What would be the new values of the address
pointer of the buffer if, in the course of address computation, it gets updated
to
a. 0212h
b. 01FCh
Buffer Length= (EAR-SAR+1) = 020F-0200+1=10h
a. New Address Pointer= Updated Pointer-buffer length = 0212-10=0202h
b. New Address Pointer= Updated Pointer+ buffer length = 01FC+10=020Ch
7. Repeat the previous problem for SAR= 0210h and EAR=0201h