Avoiding DAG-related Pipeline
Stalls
• Note that
– I2 = 0x1234;
– I3 = 0x0001;
– I1 = 0x002;
– AX0 = DM(I2,M2);
will NOT cause a stall.
• Also, note that switching DAG register
banks (primary secondary) immediately
before using them will NOT cause a stall.
Branch Dependency in Pipelining
A Branch instruction can cause a pipeline stall if the
branch is taken, as the next instruction has to be
aborted in that case. If I1 is an unconditional branch
instruction, the next Fetch cycle (F2) can start after D1.
But if I1 is a conditional branch instruction, F2 has to
wait until O1 for the decision as to whether the branch
will be taken or not.
F1 D1 O1 W1 branch instruction
F2 D2 O2 W2 executed if branch is not taken
executed for
F2 D2 O2 W2
unconditional branch
F2 D2 O2 W2 for conditional
branch, if taken
A Digital Signal Processing
System
Analog D Analog
Antialiasing Sample Reconst.
Signal Filter and Hold A/D S D/A Filter
Signal
in out
P
A perspective of the Digital Signal Processing
problem
Application areas
Medical Radar Speech Seismic Image
•••
Digital signal processing theory
Theoretical
problem Basic functions
modelling
Algorithms
Architechtures
Processor
Implementation instruction sets
and/or hardware
functions
Component technology
DSP APPLICATION CHARACTERISTICS
APPLICATION REQUIREMENT PROCESSOR ATTRIBUTES
REAL-TIME PROCESSING HIGH SPEED, HIGH THROUGHPUT
LARGE ARRAY OF DATA INSTRUCTIONS TO MOVE AND
PROCESS LARGE DATA ARRAYS
ALGORITHM INTENSIVE FAST MATHEMATICAL
COMPUTATIONS, SINGLE CYCLE
DSP OPERATIONS (MACD)
SYSTEM FLEXIBILITY GENERAL PURPOSE
PROGRAMMABILITY, EPROM,
MC/MP MODE
Different approaches to
hardware
implementation
1. HIGH SPEED GENERAL PURPOSE COMPUTERS
Programmable Expensive
Can be configured for Complex control
different applications I/O overheads
2. CUSTOM-DESIGNED VLSI COMPONENTS
Efficient design Application specific
Large throughputs High development cost
3. GENERAL PURPOSE DIGITAL SIGNAL PROCESSORS
Combine the Programmability & Control features of
general
purpose computers and the Architectural innovations
of
special purpose chips.
GOALS: HIGH SPEED, LOW POWER AND LOW COST
General purpose computers
1. Flexible
2. Suitable for Internet and
Multimedia application
3. Software Intensive
4. Slow for high speed
application
5. Too bulky
6. Power hungry
Why are conventional Processors not
suitable for DSP?
Caches are a waste of chip area
Small register files force lots of memory
accesses
- these are different from cache since
these are program managed
Complex instruction issue logic, branch
prediction, speculation etc. are not
needed for DSP
Not enough ALU function
Data Processing vs
Signal Processing
• General-purpose microprocessors are designed
primarily for Data Processing.
– The primary burden is Data Read/Write
• Digital Signal Processors are Microprocessors
specifically designed for Signal Processing.
– The primary burden is Mathematical operation
• DSP architecture therefore incorporates certain
features not found in general-purpose P’s.
DSP Requirements
• Emphasis is on mathematical operations rather
than data manipulation operations like word
processing, database management etc.
– Design is optimized for DSP algorithms which
implement FIR filter, FFT generator etc.
• Processing is real-time, i.e. the input signal comes
continuously, and the output signal is also
produced continuously as the input is acquired.
• Dominant mathematical operation is Multiply and
Accumulate (MAC), on separate inputs in parallel.
Digital Signal Processor features
Caters to high arithmetic demands
Real time operation
Analog input / output
Large number of functional units
for a given size
Small control Logic
Typical MAC Execution Cycle
• Obtain a sample of the Input signal
• Move the input sample into the input buffer
• Fetch the co-efficient from internal buffer
• Multiply the input sample by the co-efficient
• Add the product to the Accumulator
• Move the output to the output buffer
• Send it out as a sample of the output signal
MAC Execution Hardware
Data
Progra Read
m Address
Data Write
Counter
Address
*
Program/ Data
Coefficient Memory
Memory
CP ACC
Architecture of Digital Signal
Processors
• General-purpose processors are based on the Von
Neumann architecture (single memory bank and
processor accesses this memory bank thro’ single
set of address and data lines)
• Harvard architecture commonly used in DSP
processors
– Separate Data and Program memories (two memory
banks)
– Separate Address Buses for Data and Program
memories
Additional Features in a DSP
Processor
• Instruction Cache and Pipelined processor
as in any modern microprocessor, but no
Data Cache
• Separate ALU, Multiplier and Shifter,
connected through multiple internal data
buses, enabling fast MAC operations
DSP, CISC and RISC
• DSP Processors can’t be called truly as
CISC or RISC-type of processors
• Some features present in a RISC processor
may exist. However, DSP processors are
“tuned” towards operations encountered in
signal processing applications
DSP IMPLEMENTATION APPROACHES
Important desirable characteristics
Adequate word length
Fast multiply & accumulate
High speed RAM
Fast Coefficient table addressing
Fast new sample fetch
mechanism
DSP functions implemented
with IC chips
Issues:
Speed Architectural features
Accuracy Register lengths and
floating point
capability
Cost Advances in VLSI
GENERAL PURPOSE DSP FEATURES
1. PARALLELISM: Multiple Functional Units
Multiple Buses
Multiple Memories
2. PIPELINING
3. HARDWARE MULTIPLIERS AND OTHER
ARITHMETIC FUNCTIONS
4. ON-CHIP AND CACHE MEMORIES
5. A VARIETY OF ADDRESSING MODES
7. INSTRUCTIONS THAT PACK SEVERAL
OPERATIONS
8. ZERO-OVERHEAD LOOPING
9. I/O FEATURES SUCH AS INTERRUPT, SERIAL
I/O, DMA
10. OTHER CONTROL FUNCTIONS SUCH AS WAIT
STATES
x(n+1)
x(n) Delay h(0)
x(n-1) h(1)
Delay
ar1 x(n-2) ar2 h(2)
MAC
y(n)
Organization of signal samples and filter coefficients
for a second order FIR filter implementation
An Nth order FIR filter implementation
A[0] X[n]
A[1] X[n-1]
A[2] X[n-2]
*
•• ••
•• P ••
•• ••
A[N-1] +
X[n-
N+1]
y[n]
ACC
Coefficient Data
Memory Memory
FIR Filter pseudo-code
Load loop count
Initialize coefficient and data addr
regs
Zero Acc and P registers
LOOP: Pnew = A[i] . X[n-i]
Accnew = ACCold + Pold
Decrement coefficient and data
addr regs
X[n-i] X[n-i-1] {for next
iteration}
Decr loop count
BNZ LOOP
A Typical DSP Architecture
PM Data DM Data
PM Address Address Address DM Address
Program
Memory Generator Generator Data
(PM) Memory
(DM)
Instruc- Program Sequencer
tions & Instruction Cache Data
secondary PM Data DM Data
only
data
Registers DMA Bus
I/O
Multiplier Controller
(DMA)
ALU
Shifter
Input/Output
Salient Features
• REPEAT-MAC instruction
- Performs auto-increment of both coefficient
and data pointers
- Frees up program memory bus for fetching
coefficients
• Circular buffer
- to manage data movement at the end of
every output computation
• Handling precision
- Accumulator guard bits
- Saturation mode
- Shifters (both right and left shift)
Product Computation Unit of a
simple multiplier for 4-bit
unsigned numbers X and Y
Saturation Logic
Sets the contents of register to maximize the
value if overflow occurs
Block Floating Point
Scaling logic + exponent register: If overflow
condition of any point is detected, the entire
array is rescaled downwards and the scaling is
stored in the block exponent register.
SHIFTERS
- Scales numbers to prevent
overflow/underflow
- Conversion between fixed point and
floating point
- Many bits must be shifted in a single cycle
to preserve single cycle computational
speed (Barrel Shifter)
- Logical shift assumes unsigned data and
fills with zeroes left or right
- Arithmetic shift scales numbers upwards
(left) or downward (right)
zero fills sign extend
Super scalar architecture
• Hardware responsible for finding ILP in
a sequential program
• Advantage : Compatibility between
generations
• Disadvantage : Very complex hardware
Explicitly Parallel Instruction
Computing (EPIC)
• Combines VLIW and super scalar
architectures
• Instructions are grouped into 3
operating blocks and a template block
• Template block tells hardware if
instructions can be executed in parallel
• Also gives information whether the
block can be executed in parallel
ILP versus Power
Increasing instructions / cycle
Requires fewer cycles to execute a task
Uses longer clock for same performance
Uses lower supply voltage
And hence uses less power
However, too many functional units and too
many transitions per clock cycle increase
power consumption.
Low Power architecture
Power consumed by additional circuits vs.
ability to lower clock rate while maintaining
performance
Circuits must be highly used
Move complexity into software
Voltage scaling : Reduce Vdd
Clock gating : Turn off clock when chip
is not in use ( applies to
sub-modules of chip also)
VLIW is more suitable than super scalar
for low power
- VLIW is smaller for same number of
functional units
- Compiler is better at finding
parallelism than hardware
Put multiple processors on chip rather
than lots of functional units in one processor
Helps in running independent tasks
Improvement of Speed by
Pipelining
• Processor speed can be enhanced by having separate
hardware units for the different functional blocks,
with buffers between the successive units.
– The number of unit operations into which the instruction
cycle of a processor can be divided for this purpose
defines the number of stages in the pipeline.
– A processor having an n-stage pipeline would have up to
n instructions simultaneously being processed by the
different functional units of the processor.
• Effective processor speed increases ideally by a
factor equal to the number of pipelining stages.
Data Dependency in Pipelining
If the input data for an instruction depends on the
outcome of the previous instruction, the Write cycle of
the previous instruction has to be over before the
Operate cycle of the next instruction can start. The
pipeline effectively idles through one instruction,
creating a bubble in the pipeline which persists for
several instructions.
F1 D1 O1 W1
F2 D2 idle O2 W2
F3 idle D3 O3 W3
Bubble
ends
F4 D4 O4 W4
here
Example of dependency
• A 3 + A; B 4 x A
Can’t perform these two in parallel
• Another case: A = B + A; B = A – B; A =
A – B (swapping without temp) ; examine
how you can handle this.
Pipeline in ADSP 219x
Processors
6-Stage Instruction Pipeline with Single-cycle
Computational Units:
• Look-Ahead: places the address of an instruction
that is going to be executed several cycles down the
road, on the program-memory address (PMA) bus
• Pre-fetch: Pre-fetches an instruction if the
instruction is already in the instruction cache
• Fetch: Fetches the instruction that was “looked
ahead” 2 cycles ago
• Address-decode: Decoding of the DAG operand
fields in the opcode in this cycle
• Decode: The second stage of the instruction
decoding process, where the rest of the opcode is
decoded
• Execute: Instruction is executed, status updated,
results written to destinations
Causes for Pipeline Stalls
Memory block conflicts: If both instruction and data are to be
fetched from the same block of memory, a stall is
automatically inserted
DAG usage immediately (or within 2 cycles) after
initialization. e.g.
I2 = 0x1234;
AX0 = DM(I2,M2);
Bus conflicts: Instructions which use the PMA/PMD buses for
data transfer may cause bus conflict. e.g.
PM(I5,M7)=M3;
TMS320C25 GENERAL PURPOSE FEATURES
· COMPREHENSIVE INSTRUCTION SET-
133 INSTRUCTIONS INCLUDING
- NUMERICAL (34)
X=X-Y - LOGICAL (15)
- MEMORY
MANAGEMENT (33) -
BRANCHES (20) -
PROGRAM/MODE CONTROL (31)
· EXTENDED-PRECISION ARITHMETIC
BIT · SERIAL PORT (DOUBLE BUFFERED,
NO STATIC)
16=0
· MULTIPROCESSOR INTERFACES
(CONCURRENT DMA, GLOBAL DATA
YES MEMORY)
· BLOCK MOVES (UP TO 10 M
WORDS/SEC)
OUTPUT X
· ON-CHIP TIMER
· THREE EXTERNAL MASKABLE
INTERRUPTS
TMS320C25 ALU
DESIGN & OPERATION
¨ 32-BIT ALU & ACCUMULATOR
¨ CARRY BIT FOR EXTENDED
PRECISION
¨ OVERFLOW DETECTION &
SATURATION
¨ SIGN EXTENSION OPTION
¨ 0-16 BIT PARALLEL SHIFTER
FOR LOADS AND ARITHMETIC
OPS
¨ SHIFTERS ON PRODUCT
REGISTER OUTPUT DATA
¨ 0-7 BIT PARALLEL SHIFTER FOR
ACCUMULATOR STORES
TMS320C25 - MULTIPLY INSTRUCTIONS II
MAC MPY data memory * program
memory & add past
P-Reg to ACC
MACD MPY data memory * program
memory, add past P-
Reg to ACC, & move data
memory
SQRA Square data memory value
& add past P-Reg to
ACC
SQRS Square data memory value
& sub past P-Reg from
TMS320C25 SPECIAL PURPOSE FEATURES
Xi ¨ SINGLE-CYCLE
Z -1 Z -1 Z -1
Z -1
MULTIPLY/ACCUMULATE
¨ MULTIPLY/ACCUMULATE
USING EXTERNAL
PROGRAM MEMORY
¨ REPEAT INSTRUCTION
¨ ADAPTIVE FILTERING
Yi INSTRUCTIONS
¨ BIT-REVERSED
ADDRESSING
¨ 0-16 BIT SCALING SHIFTER (SIGNED
¨ AUTOMATIC DATA-MOV
OR UNSIGNED)
IN MEMORY (Z-1)
¨ OVERFLOW MANAGEMENT
-SATURATION MODE
-BRANCH ON OVERFLOW
-PRODUCT RIGHT SHIFT
TMS320C25 - HIGHER PERFORMANCE AT LESS CODE SPACE
xn
Z-1 Z-1 Z-1 Z-1
x x x x
Yn
N
Yn = b K X(n-K) TMS320C25
K=0
RPTK 49
MACD
3 WORDS PROG MEMORY
53 CYCLES
TMS320C25 ADDRESSING MODES
¨ IMMEDIATE ADDRESSING Program memory
- BOTH LONG AND SHORT CONSTANTS
ADDK 5
- EXAMPLES:
ADDLK
ADDK 5
1325
ADLK > 1325
¨ DIRECT ADDRESSING From
DP instructio
- SAME AS TMS320C1X BUT DP IS 9 9BITS
BITS
n
7 BITS
- 512 “BANKS” OF 128 WORDS
- USED OFTEN FOR LONG
OPERAND ADDRESS
SEQUENCES OF IN-LINE CODE
¨ INDIRECT ADDRESSING
- B AUXILIARY REGISTERS
- USED OFTEN IN PROGRAM
LOOPS WITH AUTO INC/DEC OPTIONS
Addressing Mode (contd.)
• Circular buffer addressing mode
• Bit-reversed addressing mode
TMS320C25 AUXILIARY REGISTER INSTRUCTIONS
LAR Load aux-reg w/data
LARK Load AR w/8-bit constant
LRLK Load AR w/16-bit
constant
MAR Modify auxiliary register
SAR Store auxiliary register
ADRK Add 8-bit constant to AR
SBRK Sub 8-bit constant from
AR
LARP Load auxiliary register
pointer
TMS320C25 ON-CHIP MEMORY
MEMORY ORGANIZATION
¨ 4K WORDS ON-CHIP
MASKED ROM
¨ 544 WORDS ON-CHIP
DATA RAM
¨ 256 WORDS ON-CHIP
RAM RECONFIGURABLE
AS DATA/PROGRAM
MEMORY
¨ BLOCK TRANSFERS IN
MEMORY
¨ DIRECT, INDIRECT, AND
IMMEDIATE
ADDRESSING MODES
BLOCK DIAGRAM OF A TMS320C5X DSP
General-Purpose Microprocessor
circa 1984 : Intel 8088
~100,000 transistors
Clock speed : ~ 5 MHz
Address space : 20 bits
Bus width : 8 bits
100+ instructions
2-35 cycles per instruction
Micro-coded architecture
DSP TMS 32010 1984
Clock 20 MHz
16 bits
8, 12 bits addressing space
~ 50 k transistors
~ 35 instructions
Harvard architecture
Hardware multiplier
Double length accumulator with
saturation
A few special DSP instructions
Relatively inexpensive
General Purpose Microprocessor 2000
GHz clock speed
32-bit address or more
32-bit bus, 128-bit instructions
Complex MMU
Super scalar CPU
MMX instructions
On chip cache
Single cycle execution
32-bit floating point ALU on board
Very expensive
10s of watts of power
DSP in 2000
Clock 100 ~ 200 MHz
16-bit floating point or 32-bit floating
point
16-24 bits address space
Large on-chip and off-chip memories
Single cycle execution of most
instructions
Harvard architecture
Lots of special DSP instructions
50 mw to 2w power
Future of DSP Microprocessor
Sufficiently unique for an
independent class of applications (HDD,
cell phone)
Low power consumption, low cost
High performance within power, cost
constraints (MIPS/mw, MIPS/$)
Fixed point & floating point
Better compilers - but users must be
informed
Hybrid DSP/ GP systems
DSP
Architecture
s
Professor S. Srinivasan
Electrical Engineering Department
I.I.T.-Madras, Chennai –600 036
[email protected]