0% found this document useful (0 votes)
158 views10 pages

DSP Arch

The document discusses the architecture and programming of Digital Signal Processors (DSP), focusing on the multiplication of unsigned and signed numbers, and the importance of various computational blocks such as multipliers, shifters, and Multiply and Accumulate (MAC) units. It highlights the trade-offs between speed, accuracy, and circuit complexity in DSP design, as well as the significance of on-chip memory and different addressing modes for efficient data handling. Additionally, it covers the need for specialized addressing modes like circular and bit-reversed addressing for real-time applications and FFT algorithms.

Uploaded by

sheriffvtht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
158 views10 pages

DSP Arch

The document discusses the architecture and programming of Digital Signal Processors (DSP), focusing on the multiplication of unsigned and signed numbers, and the importance of various computational blocks such as multipliers, shifters, and Multiply and Accumulate (MAC) units. It highlights the trade-offs between speed, accuracy, and circuit complexity in DSP design, as well as the significance of on-chip memory and different addressing modes for efficient data handling. Additionally, it covers the need for specialized addressing modes like circular and bit-reversed addressing for real-time applications and FFT algorithms.

Uploaded by

sheriffvtht
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

CEC337- DSP ARCHITECTURE AND PROGRAMMING Consider the multiplication of two unsigned numbers A and B.

Let A
UNIT I ARCHITECTURES FOR PROGRAMMABLE DSP PROCESSORS be represented using m bits as (Am-1 Am-2 …….. A1 A0) and B be
1. Basic Architectural Features represented using n bits as (Bn-1 Bn-2 …….. B1 B0). Then the product of
A programmable DSP device should provide instructions similar to these two numbers is given by,
a conventional microprocessor. The instruction set of a typical DSP device
should include the following,
a. Arithmetic operations such as ADD, SUBTRACT, MULTIPLY etc
b. Logical operations such as AND, OR, NOT, XOR etc
c. Multiply and Accumulate (MAC) operation
d. Signal scaling operation
In addition to the above provisions, the architecture should also include,
a. On chip registers to store immediate results
b. On chip memories to store signal samples (RAM)
c. On chip memories to store filter coefficients (ROM)
2. DSP Computational Building Blocks
Each computational block of the DSP should be optimized for
functionality and speed and in the meanwhile the design should be This operation can be implemented paralleling using Braun multiplier
sufficiently general so that it can be easily integrated with other blocks to whose hardware structure is as shown in the below figure.
implement overall DSP systems.
2.1 Multipliers
The advent of single chip multipliers paved the way for
implementing DSP functions on a VLSI chip. Parallel multipliers replaced the
traditional shift and add multipliers now days. Parallel multipliers take a
single processor cycle to fetch and execute the instruction and to store the
result. They are also called as Array multipliers. The key features to be
considered for a multiplier are:
a. Accuracy
b. Dynamic range
c. Speed
The number of bits used to represent the operands decides the
accuracy and the dynamic range of the multiplier. Whereas speed is decided
by the architecture employed. If the multipliers are implemented using
hardware, the speed of execution will be very high but the circuit
complexity will also increases considerably. Thus there should be a tradeoff
between the speed of execution and the circuit complexity. Hence the choice
of the architecture normally depends on the application. 2.3 Multipliers for Signed Numbers
2.2 Parallel Multipliers In the Braun multiplier the sign of the numbers are not considered into
account. In order to implement a multiplier for signed numbers, additional
hardware is required to modify the Braun multiplier. The modified and save Z at two successive memory locations. Although it stores the exact
multiplier is called as Baugh-Wooley multiplier. value of Z in the memory, it takes two cycles to store the result.
Consider two signed numbers A and B, b. Discard the lower n bits of the result Z and store only the higher order n
bits into the memory. It is not applicable for the applications where accurate
result is required. Another alternative can be used for the applications
where speed is not a major concern. In which latches are used for inputs and
outputs thus requiring a single bus to fetch the operands and to store the
result (Fig4.2)

2.4 Speed
Conventional Shift and Add technique of multiplication requires n cycles to
perform the multiplication of two n bit numbers. Whereas in parallel
2.6 Shifters
multipliers the time required will be the longest path delay in the
Shifters are used to either scale down or scale up operands or the results.
combinational circuit used. As DSP applications generally require very high
The following scenarios give the necessity of a shifter
speed, it is desirable to have multipliers operating at the highest possible
a. While performing the addition of N numbers each of n bits long, the sum
speed by having parallel implementation.
can grow up to n+log2 N bits long. If the accumulator is of n bits long, then
2.5 Bus Widths
an overflow error will occur. This can be overcome by using a shifter to
Consider the multiplication of two n bit numbers X and Y. The product Z can
scale down the operand by an amount of log2N.
be at most 2n bits long. In order to perform the whole operation in a single
b. Similarly while calculating the product of two n bit numbers, the product
execution cycle, we require two buses of width n bits each to fetch the
can grow up to 2n bits long. Generally the lower n bits get neglected and the
operands X and Y and a bus of width 2n bits to store the result Z to the
sign bit is shifted to save the sign of the product.
memory. Although this performs the operation faster, it is not an efficient
c. Finally in case of addition of two floating-point numbers, one of the
way of implementation as it is expensive. Many alternatives for the above
operands has to be shifted appropriately to make the exponents of two
method have been proposed. One such method is to use the program bus
numbers equal.
itself to fetch one of the operands after fetching the instruction, thus
From the above cases it is clear that, a shifter is required in the architecture
requiring only one bus to fetch the operands. And the result Z can be stored
of a DSP.
back to the memory using the same operand bus. But the problem with this
2.7 Barrel Shifters
is the result Z is 2n bits long whereas the operand bus is just n bits long. We
In conventional microprocessors, normal shift registers are used for shift
have two alternatives to solve this problem, a. Use the n bits operand bus
operation. As it requires one clock cycle for each shift, it is not desirable for
DSP applications, which generally involves more shifts. In other words, for
DSP applications as speed is the crucial issue, several shifts are to be Above Figure depicts the implementation of a 4 bit shift right barrel shifter.
accomplished in a single execution cycle. This can be accomplished using a Shift to right by 0, 1, 2 or 3 bit positions can be controlled by setting the
barrel shifter, which connects the input lines representing a word to a group control inputs appropriately.
of output lines with the required shifts determined by its control inputs. For
an input of length n, log2 n control lines are required. And an dditional 3 Multiply and Accumulate Unit
control line is required to indicate the direction of the shift. Most of the DSP applications require the computation of the sum of the
The block diagram of a typical barrel shifter is as shown in below figure. products of a series of successive multiplications. In order to implement
such functions a special unit called a multiply and Accumulate (MAC) unit is
required. A MAC consists of a multiplier and a special register called
Accumulator. MACs are used to implement the functions of the type A+BC. A
typical MAC unit is as shown in the below figure.

Although addition and multiplication are two different operations, they can
be performed in parallel. By the time the multiplier is computing the
product, accumulator can accumulate the product of the previous
multiplications. Thus if N products are to be accumulated, N-1
multiplications can overlap with N-1 additions. During the very first
multiplication, accumulator will be idle and during the last accumulation,
multiplier will be idle. Thus N+1 clock cycles are required to compute the
sum of N products.
3.1 Overflow and Underflow
While designing a MAC unit, attention has to be paid to the word sizes
encountered at the input of the multiplier and the sizes of the add/subtract
unit and the accumulator, as there is a possibility of overflow and
underflows. Overflow/underflow can be avoided by using any of the
following methods viz
a. Using shifters at the input and the output of the MAC
b. Providing guard bits in the accumulator operations like AND, OR , NOT, XOR etc. The block diagram of a typical ALU
c. Using saturation logic for a DSP is as shown in the figure.
Shifters It consists of status flag register, register file and multiplexers.
Shifters can be provided at the input of the MAC to normalize the data and at
the output to de normalize the same.
Guard bits
As the normalization process does not yield accurate result, it is not
desirable for some applications. In such cases we have another alternative
by providing additional bits called guard bits in the accumulator so that
there will not be any overflow error. Here the add/subtract unit also has to
be modified appropriately to manage the additional bits of the accumulator.

Saturation Logic
Overflow/ underflow will occur if the result goes beyond the most positive
number or below the least negative number the accumulator can handle.
Thus the overflow/underflow error can be resolved by loading the
accumulator with the most positive number which it can handle at the time
of overflow and the least negative number that it can handle at the time of
Status Flags
underflow. This method is called as saturation logic. A schematic diagram of
ALU includes circuitry to generate status flags after arithmetic and logic
saturation logic is as shown in below figure. In saturation logic, as soon as
operations. These flags include sign, zero, carry and overflow.
an overflow or underflow condition is satisfied the accumulator will be
Overflow Management
loaded with the most positive or least negative number overriding the result
Depending on the status of overflow and sign flags, the saturation logic can
computed by the MAC unit.
be used to limit the accumulator content.
Register File
Instead of moving data in and out of the memory during the operation, for
better speed, a large set of general purpose registers are provided to store
the intermediate results.
5 Bus Architecture and Memory
Conventional microprocessors use Von Neumann architecture for memory
management wherein the same memory is used to store both the program
and data (Fig). Although this architecture is simple, it takes more number of
processor cycles for the execution of a single instruction as the same bus is
used for both data and program.

4 Arithmetic and Logic Unit A typical DSP device should be capable of


handling arithmetic instructions like ADD, SUB, INC, DEC etc and logical
In order to increase the speed of operation, separate memories were used to
store program and data and a separate set of data and address buses have
been given to both memories, the architecture called as Harvard Although the above architecture improves the speed of operation, it
Architecture. It is as shown in below figure. requires more hardware and interconnections, thus increasing the cost and
complexity of the system. Therefore there should be a tradeoff between the
cost and speed while selecting memory architecture for a DSP.
5.1 On-chip Memories
In order to have a faster execution of the DSP functions, it is desirable to
have some memory located on chip. As dedicated buses are used to access
the memory, on chip memories are faster. Speed and size are the two key
parameters to be considered with respect to the on-chip memories.
Speed
On-chip memories should match the speeds of the ALU operations in order
to maintain the single cycle instruction execution of the DSP.
Size
In a given area of the DSP chip, it is desirable to implement as many DSP
Although the usage of separate memories for data and the instruction functions as possible. Thus the area occupied by the on-chip memory should
speeds up the processing, it will not completely solve the problem. As many be minimum so that there will be a scope for implementing more number of
of the DSP instructions require more than one operand, use of a single data DSP functions on- chip.
memory leads to the fetch the operands one after the other, thus increasing 5.2 Organization of On-chip Memories
the delay of processing. This problem can be overcome by using two Ideally whole memory required for the implementation of any DSP
separate data memories for storing operands separately, thus in a single algorithm has to reside on-chip so that the whole processing can be
clock cycle both the operands can be fetched together in the below Figure. completed in a single execution cycle. Although it looks as a better solution,
it consumes more space on chip, reducing the scope for implementing any
functional block on-chip, which in turn reduces the speed of execution.
Hence some other alternatives have to be thought of. The following are
some other ways in which the on-chip memory can be organized.
a. As many DSP algorithms require instructions to be executed repeatedly, In this addressing mode, the operand is accessed using a pointer. A pointer
the instruction can be stored in the external memory, once it is fetched can is generally a register, which holds the address of the location where the
reside in the instruction cache. operands resides. Indirect addressing mode can be extended to inculcate
b. The access times for memories on-chip should be sufficiently small so that automatic increment or decrement capabilities, which has lead to the
it can be accessed more than once in every execution cycle. following addressing modes.
c. On-chip memories can be configured dynamically so that they can serve
different purpose at different times.
6 Data Addressing Capabilities
Data accessing capability of a programmable DSP device is configured by
means of its addressing modes. The summary of the addressing modes used
in DSP is as shown in the table below.

7 Special Addressing Modes


For the implementation of some real time applications in DSP, normal
addressing modes will not completely serve the purpose. Thus some special
addressing modes are required for such applications.
7.1 Circular Addressing Mode
While processing the data samples coming continuously in a sequential
manner, circular buffers are used. In a circular buffer the data samples are
6.1 Immediate Addressing Mode stored sequentially from the initial location till the buffer gets filled up. Once
In this addressing mode, data is included in the instruction itself. the buffer gets filled up, the next data samples will get stored once again
6.2 Register Addressing Mode from the initial location. This process can go forever as long as the data
In this mode, one of the registers will be holding the data and the register samples are processed in a rate faster than the incoming data rate.
has to be specified in the instruction. Circular Addressing mode requires three registers viz
6.3 Direct Addressing Mode a. Pointer register to hold the current location (PNTR)
In this addressing mode, instruction holds the memory location of the b. Start Address Register to hold the starting address of the buffer (SAR)
operand. c. End Address Register to hold the ending address of the buffer (EAR)
6.4 Indirect Addressing Mode There are four special cases in this addressing mode. They are
a. SAR < EAR & updated PNTR > EAR
b. SAR < EAR & updated PNTR < SAR
c. SAR >EAR & updated PNTR > SAR
d. SAR > EAR & updated PNTR < EAR
The buffer length in the first two case will be (EAR-SAR+1) whereas for the
next two cases (SAR-EAR+1)
The pointer updating algorithm for the circular addressing mode is as
shown below.

7.2 Bit Reversed Addressing Mode


To implement FFT algorithms we need to access the data in a bit reversed
manner. Hence a special addressing mode called bit reversed addressing
mode is used to calculate the index of the next data to be fetched. It works as
follows. Start with index 0. The present index can be calculated by adding
half the FFT length to the previous index in a bit reversed manner, carry
being propagated from MSB to LSB.
Current index= Previous index+ B (1/2(FFT Size))
2.8 Address Generation Unit
The main job of the Address Generation Unit is to generate the address of
the operands required to carry out the operation. They have to work fast in
order to satisfy the timing constraints. As the address generation unit has to
Four cases explained earlier are as shown in the below figure perform some mathematical operations in order to calculate the operand
address, it is provided with a separate ALU.
Address generation typically involves one of the following operations.
a. Getting value from immediate operand, register or a memory location
b. Incrementing/ decrementing the current address
c. Adding/subtracting the offset from the current address
d. Adding/subtracting the offset from the current address and generating
new address according to circular addressing mode
e. Generating new address using bit reversed addressing mode
The block diagram of a typical address generation unit is as shown in figure.
9.2 Program Sequencer
It is a part of the control unit used to generate instruction addresses in
sequence needed to access instructions. It calculates the address of the next
instruction to be fetched. The next address can be from one of the following
sources.
a. Program Counter
b. Instruction register in case of branching, looping and subroutine calls
c. Interrupt Vector table
d. Stack which holds the return address
The block diagram of a program sequencer is as shown in figure.

9 Programmability and program Execution


A programmable DSP device should provide the programming capability
involving branching, looping and subroutines. The implementation of repeat
capability should be hardware based so that it can be programmed with
minimal or zero overhead. A dedicated register can be used as a counter. In
a normal subroutine call, return address has to be stored in a stack thus
requiring memory access for storing and retrieving the return address,
which in turn reduces the speed of operation. Hence a LIFO memory can be
directly interfaced with the program counter.
9.1 Program Control
Like microprocessors, DSP also requires a control unit to provide necessary
control and timing signals for the proper execution of the instructions. In
microprocessors, the controlling is micro coded based where each
instruction is divided into microinstructions stored in micro memory. As
this mechanism is slower, it is not applicable for DSP applications. Hence in
DSP the controlling is hardwired base where the Control unit is designed as
a single, comprehensive, hardware unit. Although it is more complex it is
faster.
Program sequencer should have the following circuitry: The actual sum can be obtained by shifting the result by 6 bits towards left
a. PC has to be updated after every fetch side after the sum being computed. Therefore
b. Counter to hold count in case of looping Actual Sum= Accumulator content X 2 6
c. A logic block to check conditions for conditional jump instructions 3. If a sum of 256 products is to be computed using a pipelined MAC unit,
d. Condition logic-status flag. and if the MAC execution time of the unit is 100nsec, what will be the total
time required to complete the operation?
Problems: As N=256 in this case, MAC unit requires N+1=257execution cycles.
1). Investigate the basic features that should be provided in the DSP As the single MAC execution time is 100nsec, the total time required will be,
architecture to be used to implement the following Nth order FIR filter. (257*100nsec)=25.7usec
Solution:- 4. Consider a MAC unit whose inputs are 16 bit numbers. If 256 products are
y(n)= Σh(i) x(n-i) n=0,1,2… to be summed up in this MAC, how many guard bits should be provided for
In order to implement the above operation in a DSP, the architecture the accumulator to prevent overflow condition from occurring?
requires the As it is required to calculate the sum of 256, 16 bit numbers, the sum can be
following features as long as (16+ log2 256)=24 bits. Hence the accumulator should be capable
i. A RAM to store the signal samples x (n) of handling these 22 bits. Thus the guard bits required will be (24-16)= 8
ii. A ROM to store the filter coefficients h (n) bits.
iii. An MAC unit to perform Multiply and Accumulate operation The block diagram of the modified MAC after considering the guard or
iv. An accumulator to store the result immediately extension bits is as shown in the figure.
v. A signal pointer to point the signal sample in the memory
vi. A coefficient pointer to point the filter coefficient in the memory
vii. A counter to keep track of the count
viii. A shifter to shift the input samples appropriately
2). It is required to find the sum of 64, 16 bit numbers. How many bits
should the accumulator have so that the sum can be computed without the
occurrence of overflow error or loss of accuracy?
The sum of 64, 16 bit numbers can grow up to (16+ log2 64 )=22 bits long.
Hence the accumulator should be 22 bits long in order to avoid overflow
error from occurring.
1. In the previous problem, it is decided to have an accumulator with only
16 bits but shift the numbers before the addition to prevent overflow, by
how many bits should each number be shifted?
As the length of the accumulator is fixed, the operands have to be shifted by
an amount of log2 64 = 6 bits prior to addition operation, in order to avoid
the condition of overflow.
2. If all the numbers in the previous problem are fixed point integers, what
is the

actual sum of the numbers?


to the previous index. i.e. Present Index= (000)+B (100)= (100)
5. What are the memory addresses of the operands in each of the following Similarly the next index can be calculated as
cases of indirect addressing modes? In each case, what will be the content of Present Index= (100)+B (100)= (010)
the addreg after the memory access? Assume that the initial contents of the The process continues till all the indices are calculated. The following table
addreg and the offsetreg are 0200h and 0010h, respectively. summarizes the calculation.
a. ADD *addreg
b.ADD +*addreg
c. ADD offsetreg+,*addreg
d. ADD *addreg,offsetreg-

6. A DSP has a circular buffer with the start and the end addresses as 0200h
and 020Fh respectively. What would be the new values of the address
pointer of the buffer if, in the course of address computation, it gets updated
to
a. 0212h
b. 01FCh
Buffer Length= (EAR-SAR+1) = 020F-0200+1=10h
a. New Address Pointer= Updated Pointer-buffer length = 0212-10=0202h
b. New Address Pointer= Updated Pointer+ buffer length = 01FC+10=020Ch
7. Repeat the previous problem for SAR= 0210h and EAR=0201h

Buffer Length= (SAR-EAR+1)= 0210-0201+1=10h


c. New Address Pointer= Updated Pointer- buffer length = 0212-10=0202h
d. New Address Pointer= Updated Pointer+ buffer length = 01FC+10=020Ch
9. Compute the indices for an 8-point FFT using Bit reversed Addressing
Mode
Start with index 0. Therefore the first index would be (000)
Next index can be calculated by adding half the FFT length; in this case it is
(100)

You might also like