0% found this document useful (0 votes)
50 views31 pages

Imp 22

This document discusses architectural features of programmable digital signal processing devices. It describes basic features like arithmetic, logical, and multiply-accumulate instructions. It also discusses computational building blocks like multipliers and shifters. Memory architectures like Von Neumann and Harvard are covered. Addressing modes like immediate, register, direct, indirect, circular and bit-reversed are explained. The document provides details about multipliers, barrel shifters, multiply-accumulate units, and arithmetic logic units.

Uploaded by

Panku Rangaree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views31 pages

Imp 22

This document discusses architectural features of programmable digital signal processing devices. It describes basic features like arithmetic, logical, and multiply-accumulate instructions. It also discusses computational building blocks like multipliers and shifters. Memory architectures like Von Neumann and Harvard are covered. Addressing modes like immediate, register, direct, indirect, circular and bit-reversed are explained. The document provides details about multipliers, barrel shifters, multiply-accumulate units, and arithmetic logic units.

Uploaded by

Panku Rangaree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 31

Subject Name: Digital Signal Processing Algorithms & Architecture

Subject Code:10EC751

Prepared By: S. Shikky Marice, Prashanth, Shivlila

Department: Electronics and Communication Engineering

Date:24.8.2014
Unit-02

Architectures for programmable digital signal –


processing devices
Basic Architectural Features

A programmable DSP device should provide instructions similar to a conventional


microprocessor. The instruction set of a typical DSP device should include the
following,
a. Arithmetic operations such as ADD, SUBTRACT, MULTIPLY etc
b. Logical operations such as AND, OR, NOT, XOR etc
c. Multiply and Accumulate (MAC) operation
d. Signal scaling operation
In addition to the above provisions, the architecture should also include,
a. On chip registers to store immediate results
b. On chip memories to store signal samples (RAM)
c. On chip memories to store filter coefficients (ROM)
DSP Computational Building Blocks

•Multipliers
The advent of single chip multipliers paved the way for
implementing DSP functions on a VLSI chip. Parallel multipliers
replaced the traditional shift and add multipliers now days.
Parallel multipliers take a single processor cycle to fetch and
execute the instruction and to store the result. They are also
called as Array multipliers. The key features to be considered
for a multiplier are:
a. Accuracy
b. Dynamic range
c. Speed
•Parallel multipliers:
Consider the multiplication of two unsigned numbers A and B. Let A be
represented using m bits as (Am-1 Am-2 …….. A1 A0) and B be
represented using n bits as (Bn-1 Bn-2 …….. B1 B0). Then the product of
these two numbers is given by, Braun multiplier.
•Multipliers for signed numbers:
Consider two signed numbers A and B,
•Bus Widths
1.Consider the multiplication of two n bit numbers X and Y. The product Z can
be atmost 2n bits long. In order to perform the whole operation in a single
execution cycle, we require two buses of width n bits each to fetch the operands
X and Y and a bus of width 2n bits to store the result Z to the memory. Although
this performs the operation faster, it is not an efficient way of implementation as
it is expensive.

We have two alternatives to solve this problem,

a. Use the n bits operand bus and save Z at two successive memory locations.
Although it stores the exact value of Z in the memory, it takes two cycles to store
the result.
b. Discard the lower n bits of the result Z and store only the higher order n bits
into the memory. It is not applicable for the applications where accurate result is
required.
Another alternative can be used for the applications where speed is not a major
concern. In which latches are used for inputs and outputs thus requiring a single bus to
fetch the operands and to store the result
Shifters
Shifters are used to either scale down or scale up operands or the results. The
following scenarios give the necessity of a shifter

a. While performing the addition of N numbers each of n bits long, the sum can
grow up to n+log2 N bits long. If the accumulator is of n bits long, then an
overflow error will occur. This can be overcome by using a shifter to scale down
the operand by an amount of log2N.
b. Similarly while calculating the product of two n bit numbers, the product can
grow up to 2n bits long. Generally the lower n bits get neglected and the sign bit
is shifted to save the sign of the product.
c. Finally in case of addition of two floating-point numbers, one of the operands has
to be shifted appropriately to make the exponents of two numbers equal.
Barrel Shifters
For an input of length n, log2 n control lines are required. And an additional
control line is required to indicate the direction of the shift.
•A Barrel Shifter is to be designed with 16 inputs for left shifts from 0 to 15 bits.
How many control lines are required to implement the shifter?
As the number of bits used to represent the input are 16, log2 16=4 control inputs
are required.

•It is required to find the sum of 64, 16 bit numbers. How many bits should the
accumulator have so that the sum can be computed without the occurrence of
overflow error or loss of accuracy?
The sum of 64, 16 bit numbers can grow up to (16+ log2 64 )=22 bits long. Hence
the accumulator should be 22 bits long in order to avoid overflow error from occurring.
Multiply and Accumulate Unit
Overflow and Underflow

While designing a MAC unit, attention has to be paid to the word sizes
encountered at the input of the multiplier and the sizes of the add/subtract unit
and the accumulator, as there is a possibility of overflow and underflows.
Overflow/underflow can be avoided by using any of the following methods viz
a. Using shifters at the input and the output of the MAC
b. Providing guard bits in the accumulator
c. Using saturation logic
Saturation logic
Overflow/ underflow will occur if the result goes beyond the most positive number or
below the least negative number the accumulator can handle. Thus the
overflow/underflow error can be resolved by loading the accumulator with the most
positive number which it can handle at the time of overflow and the least negative
number that it can handle at the time of underflow. This method is called as saturation
logic.

Arithmetic and Logic Unit


Arithmetic logic unit (ALU) carries out additional arithmetic and logic operations required
for a DSP:
• add, subtract, increment, decrement, negate AND, OR, NOT, XOR, compare
shift, multiply (uncommon to general microprocessors) with additional features common
to general microprocessors:
1. status flags for sign, zero, carry and overflow
2.overflow management via saturation logic
3.register files for storing intermediate results
Arithmetic Logic Unit of a DSP

Bus Architecture and Memory


Bus architecture and memory play a significant role in dictating cost, speed
and size of DSPs.
Common architectures include the von Neumann and Harvard architectures.
Von Neumann Architecture

Harvard Architecture
Von Neumann Architecture
• program and data reside in same memory
•single bus is used to access both

Implications:
slows down program execution since processor has to wait for data even
after instruction is made available

Harvard Architecture
program and data reside in separate memories with two independent buses
Implications:
• faster program execution because of simultaneous memory
access capability
On-Chip Memory
• on-chip = on-processor
•help in running the DSP algorithms faster than when memory is off-chip dedicated
addresses and data buses are available
 speed: on-chip memories should match the speeds of the ALU
Operations
size: the more area chip memory takes, the less area available for
other DSP functions

Data Addressing Capabilities


• Efficient way of accessing data (signal sample and filter coefficients) can
significantly improve implementation
1. performance flexible ways to access data helps in writing efficient.
2.programs data addressing modes enhance DSP implementations
DSP Addressing Modes

• Immediate
•Register
• Direct
•Indirect

Special Addressing Modes:


• Circular
•Bit-reversed
Immediate Addressing Mode:
• operand is explicitly known in value
• capability to include data as part of the instruction
Instruction Operation
ADD #imm #imm + A A
#imm: value represented by imm (fixed number such as filter coefficient is
known ahead of time)
A: accumulator register

Register Addressing Mode


• operand is always in processor register reg
• capability to reference data through its register
Instruction Operation
ADD reg reg + A A
• reg : processor register provides operand
A: accumulator register
Direct Addressing Mode
• operand is always in memory location mem
• capability to reference data by giving its memory location directly
Instruction Operation
ADD mem mem + A A
• mem: specied memory location provides operand (e.g., memory could hold input
signal value)
A: accumulator register

Indirect Addressing Mode


• operand memory location is variable
• operand address is given by the value of register addrreg
operand accessed using pointer addrreg
Instruction Operation
ADD addrreg addrreg + A A
addrreg: needs to be loaded with the register location before use
A: accumulator register
Special Addressing Modes

Circular Addressing Mode: circular buffer allows one to handle a continuous


stream of incoming data samples; once the end of the buffer is reached,
samples are wrapped around and added to the beginning again useful for
implementing real-time digital signal processing where the input stream is
effectively continuous

Bit-Reversed Addressing Mode: address generation unit can be provided with


the capability of providing bit-reversed indices useful for implementing radix-
2 FFT (fast Fourier Transform) algorithms where either the input or output is
in bit-reversed order
Circular Addressing:
Can avoid constantly testing for the need to wrap.
Suppose we consider eight registers to store an incoming data stream.

Reference Index Address

0 = 0 mod 8 = 8 mod 8 = 16 mod 8 000 = 0


1 = 1 mod 8 = 9 mod 8 = 17 mod 8 001 = 1
2 = 2 mod 8 = 10 mod 8 = 18 mod 8 010 = 2
3 = 3 mod 8 = 11 mod 8 = 19 mod 8 011 = 3
4 = 4 mod 8 = 12 mod 8 = 20 mod 8 100 = 4
5 = 5 mod 8 = 13 mod 8 = 21 mod 8 101 = 5
6 = 6 mod 8 = 14 mod 8 = 22 mod 8 110 = 6
7 = 7 mod 8 = 15 mod 8 = 23 mod 8 111 = 7
Bit-Reversed Addressing:

Input Index Output Index


000 = 0 000 = 0
001 = 1 100 = 4
010 = 2 010 = 2
011 = 3 110 = 6
100 = 4 001 = 1
101 = 5 101 = 5
110 = 6 011 = 3
111 = 7 111 = 7
Speed Issues
fast execution of algorithms is the most important requirement
of a DSP architecture
• high speed instruction operation
• large throughputs
•facilitated by advances in VLSI technology and design
innovations
Hardware Architecture
•dedicated hardware support for multiplications, scaling, loops and repeats, and
special addressing modes are essential for fast.
DSP implementations
•Harvard architecture significantly improves program execution time compared
to von Neumann
•on-chip memories aid speed of program execution considerably

Parallelism
Parallelism means:
provision of multiple function units, which may operate in parallel to increase throughput
multiple memories
different ALUs for data and address computations
advantage: algorithms can perform more than one operation at a time increasing speed
disadvantage: complex hardware required to control units and make sure instructions and
data can be fetched simultaneously
Pipelining
architectural feature in which an instruction is broken into a number of steps
a separate unit performs each step at the same time usually working on different stage
of data
advantage: if repeated use of the instruction is required, then after an initial latency
the output throughput becomes one instruction per unit time
disadvantages: pipeline latency, having to break instructions up into equally-timed
units

Pipelining example:
Five steps:
Step 1: instruction fetch
Step 2: instruction decode
Step 3: operand fetch
Step 4: execute
Step 5: save
Pipelining for speeding up the execution of an instruction

Time slot Step 1 step2 Step 3 Step 4 Step 5 Result


T0 Inst1
T1 Inst 2 Inst 1
T2 Inst 3 Inst 2 Inst 1
T3 Inst 4 Inst 3 Inst 2 Inst 1
T4 Inst 5 Inst 4 Inst 3 Inst 2 Inst 1 Inst 1
complete
t5 Inst 6 Inst 5 Inst 4 Inst 3 Inst 2 Inst 2
complete
Consider 8-tap FIR filter:
y(n) =∑h(i)x(n-i)
The filter can be implemented in many ways depending on the multipliers and
accumulators avaliable.

1.Implementation using a single MAC unit


X(n-1) X(n-2) X(n-3) X(n-4) X(n-5) X(n-6) X(n-7)
X(n)
8T 8T 8T 8T 8T 8T 8T

Multiplier

MAC
unit

multiplexer
•Pipelined implementation of an 8-tap FIR filter using eight MACs
•Parallel implementation using two MAC units

Type of Maximum sample Maximum


implementation rate throughput
1 MAC 1/8T 1 sample in 8T units
of time
Pipelined(8 1/T 1 sample in T units
multipliers and 8 of time
adders)
2 MAC 1/4T 1 sample in 4T units
of time

You might also like