DSP Processor Fundamentals
Subhasish Mukherjee
Slide: 1
Salient Features of DSP Processors
Fast multiply and accumulate
Multiple access memory architecture
Specialized addressing modes Specialized execution control Peripherals and I/O interfaces
Slide: 2
DSP Processor Embodiments
Multichip modules
Multiple dies in a single package Increased operating speed & reduced power dissipation
Multiple processors on chip Chip sets
Dividing the processor into two or more packages Makes sense when the processor is very complex & has large no of I/O pins Saves cost
DSP Cores
Slide: 3
Fixed-Point vs. Floating Point
Most DSP are Fixed-Point
Fixed Point DSP support integer and fraction arithmetic Limited dynamic range and precision Cheaper too. Mostly use 16-bit format, though some use 20/24 bit format.
Floating point DSPs use mantissa and exponent representation
They provide good dynamic range and precision Mostly use 32-bit format Easier to program
Slide: 4
Fixed Point Data Path
Slide: 5
Content of Fixed Point Data Path
Typically incorporate a multiplier, an ALU, shifters, operand registers & accumulators.
Single cycle multipliers are central to programmable DSP Often integrated with adder to make a multiply accumulate unit.
Slide: 6
Accumulator
Holds intermediate and final results of MAC operation Most DSP processors provide multiple Accumulator.
Have guard bits to accumulate a number of values
Guard bits provide greater flexibility than scaling.
Slide: 7
ALU
Implements basic arithmetic and logical operations in a single instruction cycle.
Common operations include add, subtract, increment, negate, logical and, or, not.
Differs in the word size used for logical operations.
Slide: 8
Shifter
Used for scaling the input by a power of 2
Either eliminates or reduces the possibilities of overflow to an acceptably lower level.
Trade off is loss of precision and dynamic range. Barrel shifters offers more flexibility
Slide: 9
Memory Architecture & Addressing Schemes
Slide: 10
Motivation
FIR Filter involves following operations
Fetch the MAC instruction
Fetch coefficient hm Fetch delayed input x(n-m) Multiply both Add with the previous result Shift data in the delay line
h0 h1 h2 hN-1 hN Input x(n) z-1 z-1 z-1
Output y(n)
The above set of operations done for all the taps of the filter for each sample
N y ( n) h( m) x ( n m) m 0
Slide: 11
Motivation
Conventional processors need more than 5 cycles/tap/sample to implement the above FIR filter
DSP architectures try to reduce the cycles needed to compute this primitive
This is accomplished by
Harvard architecture Efficient addressing modes
Slide: 12
Harvard Architecture
Basic Harvard Architecture Program Memory Data Memory
Basic Harvard Architecture
Separate program and data bus
different from Von-Neumann Architecture
P BUS D BUS
Modification 1
Data fetches possible from program memory Opcode and one data fetch done in parallel
Harvard Architecture Modification #1 Program/
Data
Data Memory
Memory
P BUS
D BUS
Slide: 13
Harvard Architecture
Modification 2
One program memory
Harvard Architecture Modification 2
Program Memory Multi Port Data Memory
One dual ported data memory
3 busses for the internal memory
2 for data 1 for program
2 data words can be fetched in parallel to an instruction
P BUS
D BUS 1 D BUS 2
Slide: 14
Harvard Architecture
Harvard Architecture Modification 3
Program Cache Program Memory Data Memory 1 Data Memory 2
P BUS
D BUS 1
D BUS 2
Modification 3
One program memory & Program Cache Two Data memory 3 busses for the internal memory
2 for data & 1 for program
2 data words can be fetched in parallel to an instruction
Slide: 15
Addressing mode Circular Addressing
Avoids shifting of data in the delay line Oldest element is overwritten by the new element Pointer wraps around once it crosses start or the end of the circular buffer Need to maintain 5 parameters for circular buffer operation
Circular buffer - Example
Recent sample at time instant n 2nd recent sample at time instant n+1
X(n-1) X(n-2) X(n-3) X(n) X(n-7)
Oldest sample at time instant n
X(n-6) X(n-4) X(n-5)
Will be overwritten by the recent sample at instant n+1
Start
X(n-m)
X(n) X(n-N)
End
X(n-m-1)
Slide: 16
Multiple Access Memories
Supports multiple, sequential access per instruction cycle Can be combined with Harvard Architecture to have better performance
Supporting off-chip memory means introducing significant additional delay between processor core and memory
Slide: 17
Multiported Memories
Has multiple independent sets of address and data connections Can provide multiple simultaneous accesses Costly Supporting off-chip memory means larger and more expensive package
Slide: 18
Program Cache
Simplest type is single instruction repeat buffer
Can be extended to multi word repeat buffer
Another type is single sector instruction cache Extended to multiple independent sector cache Used only for program instructions and not for data
Slide: 19
Wait States
State in which processor waits to access memory Conflict Wait states
Multiple access to memory that can not handle multiple access
Externally requested wait states
Multiple processors sharing a data bus
TMS320C5x has a special READY pin which can be used by external hardware to signal the processor that it must wait before accessing external memory.
Slide: 20
Multiprocessor Support- Memory Interface
Multiple external memory ports
Sometimes multiple processors share one external memory bus
Bus arbitration required
Two pins can be configured to act as bus request and bus grant signals
TMS320C5x allows external access to on chip memory through BR and IAQ signals Helpful for multiprocessor communication without shared memory
Slide: 21
Direct Memory Access
Handled by DMA controller
Coupled with Bus Request and Bus Grant pins of the processor
Some sophisticated DMA controllers reside onchip and access on chip memory Multiple channel DMA controllers handle multiple memory transfer in parallel
Slide: 22
Memory Addressing Schemes
Implied addressing
Operand addresses are implied P=X*Y
Immediate data
Operand itself is encoded in the instruction AX0 = 1234
Memory direct addressing
The address of the data in memory is enclosed in the instruction word AX0 = DM(1000)
Slide: 23
Memory Addressing Schemes
Register direct addressing
Data being addressed reside in a register
SUBF R1, R2
Register indirect addressing
Data resides in memory and the address resides in the register, A0 = A0 + *R5
0x1000
Address Registers
Slide: 24
7 Memory
0x1000
Memory Addressing Schemes
Register indirect addressing with pre and post increment
A0 = A0 + *R5++ (Post Increment) A0 = A0 + *R5++R17 (Post Increment)
Address incremented by the value stored in register R17
MOVE X: -(R0), A1 (Pre Decrement)
Slide: 25
Memory Addressing Schemes
Register indirect addressing with indexing
Values stored in two address registers are added to form an effective address
Does not change the content of any of the address registers MOVE Y1, X: (R6 + N6) LDI *-AR1(1), R7
Slide: 26
Memory Addressing Schemes
Register addressing with bit reversal
Used for FFT
The output or input is in a scrambled order
000 = 0
100 = 4 010 = 2
001 = 1
101 = 5 011 = 3
110 = 6
111 = 7
Slide: 27
Instruction Set
Slide: 28
Instruction Types
Arithmetic & Multiplication Logic Operations Shifting Rotation Comparison
Looping
Branching, subroutine calls and returns Conditional instruction Special function instruction Block floating point instructions, stack operation etc. Bit manipulation instructions
Slide: 29
Registers
Accumulators
General & special purpose registers
Address registers Other registers
Stack pointer Program counter
Loop registers
Slide: 30
Parallel Move Support
Operand related parallel moves
MPY (R0), (R4)
Accesses are limited to those required by arithmetic operation
Operand unrelated parallel moves
MPY X0, Y0, A X: (R0)+, X0 Y1, Y: (R4)+ Memory accesses unrelated to the operands of the ALU operation
Slide: 31
Orthogonality
Indicates the extent to which processor instruction set is consistent Depends upon
Consistency & Completeness of the instruction set
Degree to which operands and addressing modes are uniformly available with different operations
Slide: 32
Assembly Language Format
Traditional opcode operand variety MPY X0, Y0
ADD P,A MOV (R0), X0 JMP LOOP
C Like Syntax
P = X0 * Y0 A=P+A
X0 = *R0
GOTO LOOP
Slide: 33
Execution Control
Slide: 34
Looping
Hardware looping
RPT #16 MAC (R0)+, (R4)+, A
Software looping
MOVE #16, B
LOOP: MAC (R0)+, (R4)+, A DEC B
JNE LOOP
Slide: 35
Considerations in Looping
Sometimes 0 loop repetition count causes the processor to repeat the loop the maximum number of times Consider loop effects on interrupt latency
Slide: 36
Nesting
Directly nestable
Hardware loop instruction placed within the outer loop
Partially nestable
Single instruction loop inside multi instruction loop
Software nestable
Multi instruction hardware loops are nested by saving various registers like loop index, loop start & loop count
Slide: 37
Interrupts
Interrupt sources
On chip peripherals, External interrupt lines and software interrupts
Interrupt vectors
Associating each interrupt with a different memory address
Typically one or two word long and are located in low memory
Usually contains a branch or subroutine call to an interrupt handler routine
Slide: 38
Interrupt latency
Time between the assertion of an external interrupt line to the execution of the first word of the interrupt vector Following adds up to the interrupt latency
Interrupt line to be asserted prior to the start of an instruction cycle when interrupt is said to have occurred (Set up time)
To be passed through synchronization stages
Wait until the processor reaches an interruptible state Wait until all instructions in the pipeline are finished
If interrupt vector holds only address of the interrupt routine then the time required to branch to that location
Slide: 39
Stacks
Typically one of the three kinds of stack support is provided
Shadow registers Hardware stack
Software stack
Slide: 40
Pipelining
Slide: 41
Pipelining and Performance
Technique for increasing the performance of a processor
Breaks a sequence of operations into smaller pieces Execute the pieces in parallel whenever possible
Hypothetical processor
Fetch an instruction word from memory Decode the instruction
Read/write data operands from/to memory
Execute the ALU or MAC operation of the instruction
Slide: 42
Pipelining and Performance
Clock Cycle
1
Instruction Fetch Decode Data Read/Write Execute
7
P
I1
I2
I1
I3
I2 I1
I4
I3 I2 I1
I5
I4 I3 I2
I6
I5 I4 I3
I7
I6 I5 I4
I P E L I
D E P T H
N
E
Perfect Overlap
100% utilization of processor execution stages
Ideal scenario
Slide: 43
Conflicting Instruction
Clock Cycle
1
Instruction Fetch Decode Data Read/Write Execute
7
P
I1
I2
I1
I3
I2 I1
I4
I3 I2 I1
I5
I4 I2 I3 I2
I6
I5 I4 I3
I7
I6 I5 I4
I P E L I
D E P T H
N
E
I2 tries to write to memory while I3 tries to read memory
Solution to this problem is interlocking
Interlocking is delaying the conflicting instruction in pipeline
Slide: 44
Interlocking
Clock Cycle
1
Instruction Fetch Decode Data Read/Write Execute
7
P
I1
I2
I1
I3
I2 I1
I4
I3 I2 I1
I4
I3 I2 I2
I5
I4 I3 NOP
I6
I5 I4 I3
I P E L I
D E P T H
N
E
Interlocking resolves resource conflict
Pipeline sequencer holds instruction I3 at the decode stage I4 is held at the fetch stage One instruction cycle penalty occurs
Slide: 45
Multicycle Branching Effects
Clock Cycle
1
Instruction Fetch Decode Data Read/Write Execute
BR
I2
BR
----BR
------BR
I4
----NOP
I5
I4 --NOP
I6
I5 I4 NOP
I7
I6 I5 I4
When a branch instruction reaches the decode stage already one instruction is
fetched which has to be flushed from the pipeline NOPs are executed for the invalidated pipeline slots Multicycle branch typically executes for as many cycles as pipeline depth
Slide: 46
Delayed Branching Effects
Clock Cycle
1
Instruction Fetch Decode Data Read/Write Execute
BR
N2
BR
N3
N2 BR
N4
N3 N2 BR
I4
N4 N3 N2
I5
I4 N4 N3
I6
I5 I4 N4
I7
I6 I5 I4
An alternative to multicycle branch, does not flush the pipeline Instructions to be executed before the branch instruction must be located exactly after the branch instruction in the memory Increased efficiency and confusing code on casual inspection
Slide: 47
Interrupt Effects
Clock Cycle
3
Instruction Fetch Decode Data Read/Write Execute
10
I6
I5 I4 I3
--INTR I5 I4
----INTR I5
------INTR
V1
----NOP
V2
V1 --NOP
V3
V2 V1 NOP
V4
V3 V2 V1
INETRRUPT
Processor inserts the INTR instruction in the pipeline
INTR is a special branch instruction that flushes the pipeline and jumps to the appropriate interrupt vector location Causes a 4 cycle delay before the first word of the interrupt vector is executed I6 is flushed but would be refetched on returning from interrupt
Slide: 48
Fast Interrupt Processing
Clock Cycle
1
Instruction Fetch Decode Execute
I3
I2 I1
I4
I3 I2
V1
I4 I3
V2
V1 I4
I5
V2 V1
I6
I5 V2
I7
I6 I5
I8
I7 I6
INETRRUPT
Interrupt handler stored at the interrupt vector location
In this case V1 & V2 are the two instructions in the interrupt vector This is called fast interrupt as this does not insert any delay in the pipeline
Slide: 49
Peripherals
Slide: 50
Serial Ports
Serial interface transmits and receives data one bit at a time Requires far fewer interface pins than parallel interface
Used for variety of applications
Sending/receiving data to/from A/D and D/A converters
Sending/receiving data from other processors or DSP
Communicating with other external peripherals
Slide: 51
Serial Ports
Synchronous
Transmits one bit clock signal in addition to the serial data bits
Receiver uses that for sampling the received data
Asynchronous
Do not transmit separate clock signal Receiver deduces the clock signal from the serial data itself More complex
Slide: 52
Data and Clock
BIT CLOCK
FRAME SYNC DATA
- - -
- -
Most DSPs allow changing the clock polarity, data polarity and shift direction
Frame sync signal indicates the position of the first bit of a data word on the serial data line
Common formats are bit length and word length Also can have multiple words per frame
Slide: 53
Serial Clock Generation
Provide Circuitry for clock generation
Usually called serial clock generation support
Normally done by scaling the master clock in DSP Usually contains a pre-scaler and a down counter
Slide: 54
Time Division Multiplex
CLOCK FRAME SYNC DATA
CLOCK
CLOCK CLOCK CLOCK FRAME SYNC DATA
FRAME SYNC
FRAME SYNC
FRAME SYNC DATA
DATA
DATA
DSP
DSP
DSP
DSP
One processor (or External Circuitry) generates the clock and Frame sync signal Frame sync indicates the start of a new set of time slots Transmitted data word might contain some number of bits to indicate the destination DSP. Other bits are used for data
Slide: 55
Timers
Programmable timers are often a source of periodic interrupts May also be used as a software controlled square wave generator
Clock Source
Prescale Preload Value
Counter Preload Value
Slide: 56
Parallel Ports
Transmit/receive multiple data bits at a time Faster than serial ports but require more pins External data bus may be used as a parallel port Can also have separate parallel ports Bit I/O ports
Individual pins can be made input or output on a bit by bit basis
Host ports
Specialized 8/16 bit bidirectional parallel ports used for data transfer between DSP and host microprocessor
May be used to control the DSP
Communication ports
Special parallel port intended for multiprocessor communication
Slide: 57
Slide: 58