DSP Architecture
DSP Architecture
DSP Architecture
Analog D Analog
Antialiasing Sample Reconst.
Signal Filter and Hold A/D S D/A Filter
Signal
in out
P
A perspective of the Digital Signal Processing problem
Application areas
Medical Radar Speech Seismic Image
•••
Theoretical
problem Basic functions
modelling
Algorithms
Architechtures
Processor
Implementation instruction sets
and/or hardware
functions
Component technology
DSP APPLICATION CHARACTERISTICS
Programmable Expensive
Can be configured for Complex control
different applications I/O overheads
1. Flexible
2. Suitable for Internet and
Multimedia application
3. Software Intensive
4. Slow for high speed application
5. Too bulky
6. Power hungry
Why are conventional Processors not
suitable for DSP?
Data Read
Program Address
Counter
Data Write
Address
*
Program/ Data
Coefficient Memory
Memory
CPU ACC
Architecture of Digital Signal
Processors
• General-purpose processors are based on the Von
Neumann architecture (single memory bank and
processor accesses this memory bank thro’ single
set of address and data lines)
2. PIPELINING
3. HARDWARE MULTIPLIERS AND OTHER
ARITHMETIC FUNCTIONS
4. ON-CHIP AND CACHE MEMORIES
5. A VARIETY OF ADDRESSING MODES
7. INSTRUCTIONS THAT PACK SEVERAL
OPERATIONS
8. ZERO-OVERHEAD LOOPING
9. I/O FEATURES SUCH AS INTERRUPT, SERIAL
I/O, DMA
10. OTHER CONTROL FUNCTIONS SUCH AS WAIT
STATES
A second order FIR filter
y(n)
+ +
x(n-1) h(1)
Delay
ar1 x(n-2) ar2 h(2)
MAC
y(n)
Organization of signal samples and filter coefficients
for a second order FIR filter implementation
An Nth order FIR filter implementation
A[0] X[n]
A[1] X[n-1]
A[2] X[n-2]
*
•• ••
•• P ••
•• ••
A[N-1] +
X[n-N+1]
ACC y[n]
Coefficient Data
Memory Memory
FIR Filter pseudo-code
Load loop count
Initialize coefficient and data addr
regs
Zero Acc and P registers
LOOP: Pnew = A[i] . X[n-i]
Accnew = ACCold + Pold
Decrement coefficient and data addr
regs
X[n-i] X[n-i-1] {for next iteration}
Decr loop count
BNZ LOOP
Acc Y[n]
A Typical DSP Architecture
PM Data DM Data
PM Address Address Address DM Address
Program
Memory Generator Generator Data
(PM) Memory
(DM)
Instruc- Program Sequencer
tions & Instruction Cache Data
secondary PM Data DM Data
only
data
Registers DMA Bus
I/O
Multiplier Controller
(DMA)
ALU
Shifter
Input/Output
Salient Features
• REPEAT-MAC instruction
- Performs auto-increment of both coefficient and
data pointers
- Frees up program memory bus for fetching
coefficients
• Circular buffer
- to manage data movement at the end of every
output computation
• Handling precision
- Accumulator guard bits
- Saturation mode
- Shifters (both right and left shift)
Types of multipliers used
• Array multipliers
• Multipliers based on modified Booth’s
algorithm
Product Computation Unit of a
simple multiplier for 4-bit
unsigned numbers X and Y
Multiplying X and Y: summation
unit of the simple multiplier
Combinational array for Booth’s
algorithm – Basic cell B
Array Multiplier for 4x4-bit
numbers using basic cell B
Arithmetic
Fixed point Vs Floating point
Array indices, Loop Wider dynamic range
counters etc. frees user from scaling
concerns
Less sensitive to error
accumulation
Unbiased rounding
Internal Vs External
Pincount limitation
Speed penalty Off-chip bussing
PROGRAM DATA
MEMORY MEMORY
MODIFICATION #1 MODIFICATION#2
PROGRAM MULTI-PORT
DATA PROGRAM
/DATA DATA
MEMORY MEMORY
MEMORY MEMORY
MEMORY ORGANISATION - II
MODIFICATION #3
PROGRAM/
PROGRA DATA
DATA
M CACHE MEMORY
MEMORY
MODIFICATION #4
MODIFICATION #5
VLIW architecture
• Each instruction specifies several operations
to be done in parallel
• Advantages : Simple hardware
compilers can spot ILP easily
• Disadvantages : Little compatibilty between
generations
Explicit NOPs bloat code size
Super scalar architecture
F2 D2 O2 W2 for conditional
branch, if taken
Pipeline in ADSP 219x
Processors
6-Stage Instruction Pipeline with Single-cycle
Computational Units:
• Look-Ahead: places the address of an instruction
that is going to be executed several cycles down the
road, on the program-memory address (PMA) bus
• Pre-fetch: Pre-fetches an instruction if the
instruction is already in the instruction cache
• Fetch: Fetches the instruction that was “looked
ahead” 2 cycles ago
• Address-decode: Decoding of the DAG operand
fields in the opcode in this cycle
• Decode: The second stage of the instruction
decoding process, where the rest of the opcode is
decoded
• Execute: Instruction is executed, status updated,
results written to destinations
Causes for Pipeline Stalls
Memory block conflicts: If both instruction and data are to be
fetched from the same block of memory, a stall is
automatically inserted
DAG usage immediately (or within 2 cycles) after
initialization. e.g.
I2 = 0x1234;
AX0 = DM(I2,M2);
Bus conflicts: Instructions which use the PMA/PMD buses for
data transfer may cause bus conflict. e.g.
PM(I5,M7)=M3;
Avoiding DAG-related Pipeline
Stalls
• Note that
– I2 = 0x1234;
– I3 = 0x0001;
– I1 = 0x002;
– AX0 = DM(I2,M2);
will NOT cause a stall.
• Also, note that switching DAG register
banks (primary secondary) immediately
before using them will NOT cause a stall.
TMS320C25 KEY FEATURES
+5 v GND
Xi SINGLE-CYCLE
Z-1 Z-1 Z-1 Z-1 MULTIPLY/ACCUMULATE
MULTIPLY/ACCUMULAT
E USING EXTERNAL
PROGRAM MEMORY
REPEAT INSTRUCTION
ADAPTIVE FILTERING
Yi INSTRUCTIONS
BIT-REVERSED
0-16 BIT SCALING SHIFTER (SIGNED ADDRESSING
OR UNSIGNED) AUTOMATIC DATA-
OVERFLOW MANAGEMENT MOVE IN MEMORY (Z-1)
-SATURATION MODE
-BRANCH ON OVERFLOW
-PRODUCT RIGHT SHIFT
TMS320C25 - HIGHER PERFORMANCE AT LESS CODE SPACE
xn
Z-1 Z-1 Z-1 Z-1
x x x x
Yn
N
Yn = b K X(n-K) TMS320C25
K=0
RPTK 49
MACD
INDIRECT ADDRESSING
- B AUXILIARY REGISTERS
- USED OFTEN IN PROGRAM LOOPS WITH AUTO
INC/DEC OPTIONS
Addressing Mode (contd.)
MEMORY ORGANIZATION
4K WORDS ON-CHIP
MASKED ROM
544 WORDS ON-CHIP
DATA RAM
256 WORDS ON-CHIP
RAM RECONFIGURABLE
AS DATA/PROGRAM
MEMORY
BLOCK TRANSFERS IN
MEMORY
DIRECT, INDIRECT, AND
IMMEDIATE ADDRESSING
MODES
BLOCK DIAGRAM OF A TMS320C5X DSP
General-Purpose Microprocessor
circa 1984 : Intel 8088
~100,000 transistors
Clock speed : ~ 5 MHz
Address space : 20 bits
Bus width : 8 bits
100+ instructions
2-35 cycles per instruction
Micro-coded architecture
DSP TMS 32010 1984
Clock 20 MHz
16 bits
8, 12 bits addressing space
~ 50 k transistors
~ 35 instructions
Harvard architecture
Hardware multiplier
Double length accumulator with saturation
A few special DSP instructions
Relatively inexpensive
General Purpose Microprocessor 2000
GHz clock speed
32-bit address or more
32-bit bus, 128-bit instructions
Complex MMU
Super scalar CPU
MMX instructions
On chip cache
Single cycle execution
32-bit floating point ALU on board
Very expensive
10s of watts of power
DSP in 2000