Unit 1dspa
Unit 1dspa
Unit I : Fundamentals of
Programmable DSPs
Mr. Jitesh R. Shinde
Assistant Professor
1
9/8/2022
2
9/8/2022
Outline
• Need of P-DSPs
• Architectures for Programmable Digital Signal Processing Devices
– Basic Architectural Features : FIR filter eg
– DSP basic Computational Building Blocks
• Multiplier
• Shifter : Barrel Shifter
• MAC : Overflow & Underflow, Saturation logic
• ALU
– Bus Architecture & Memory : Von Neumann and Harvard Architecture, Modified Bus Structures and
Memory access in P-DSPs,
– On chip Memories : Need, Organization, fast memories (Multiple access memory) Multiported memories
– Data Addressing Capabilities
– Special Addressing Modes : Circular & Bite Reversed
– Address Generation Unit
– Programmability & Program Control : Program Sequencer
• Computational accuracy in DSP processor
• Pipelining & Parallel Processing
• VLIW architecture
• Innovations in Hardware Architecture to increase the speed of operations of DSP
Processors
• Peripherals
3
9/8/2022
4
9/8/2022
10
5
9/8/2022
11
Shift-and-Add Multiplication
• Paper & Pencil Method
12
6
9/8/2022
Shift-and-Add Multiplication
• Perform the multiplication 9 × 12 (1001 × 1100) using the shift & add
multiplication algorithm
13
Shift-and-Add Multiplication
• Perform the multiplication 11 × 13 (1011 × 1101) using the shift & add
multiplication algorithm
DSRJ11
DSRJ12
Multiplicand Carry C Accumulator Multiplier Operation Step
M A Q (CheckQ0)
1011 0 0000 1101 Initialization
14
7
Slide 14
DSRJ11 Whenever Q0=0, No multiplication required. Just go for shifting. Hence no A=A+M
Dr. SHINDE RAMDAS JITESH, 8/29/2022
DSRJ12 Check how many bit is Q? Number of cycle will be equal to number of Q.
Dr. SHINDE RAMDAS JITESH, 8/29/2022
9/8/2022
Shift-and-Add Multiplication
15
Multipliers
The advent of single chip multipliers paved the way for implementing DSP
functions on a VLSI chip. Parallel multipliers replaced the traditional shift and add
multipliers now a days. Parallel multipliers take a single processor cycle to fetch and
execute the instruction and to store the result. They are also called as Array
multipliers.
16
8
9/8/2022
Speed
Conventional Shift and Add technique of
multiplication requires n cycles to
perform the multiplication of two n bit numbers.
Whereas in parallel multipliers the time
required will be the longest path delay in the
combinational circuit used.
As DSP applications generally require very high
speed, it is desirable to have multipliers
operating at the highest possible speed by having
parallel implementation.
17
Bus Widths
Consider the multiplication of two n bit numbers X and Y. The product Z can be
almost 2n bits long.
In order to perform the whole operation in a single execution cycle,
we require two buses of width n bits each to fetch the operands X and Y and a bus of
width 2n bits to store the result Z to the memory.
Although this performs the operation faster, it is not an efficient way of
implementation as it is expensive.
Many alternatives for the above method have been proposed
a. Use the n bits operand bus and save Z at two successive memory locations.
Although it stores the exact value of Z in the memory, it takes two cycles to store
the result.
b. Discard the lower n bits of the result Z (truncate ) and store only the higher
order n bits into the memory. It is not applicable for the applications where accurate
result is required.
18
9
9/8/2022
19
Bus Widths
Consider the multiplication of two n bit numbers X and Y. The product Z can be
almost 2n bits long.
In order to perform the whole operation in a single execution cycle,
we require two buses of width n bits each to fetch the operands X and Y and a bus of
width 2n bits to store the result Z to the memory.
Although this performs the operation faster, it is not an efficient way of
implementation as it is expensive.
Many alternatives for the above method have been proposed. One such method is
to use the program bus itself to fetch one of the operands after fetching the instruction,
thus requiring only one bus to fetch the operands. And the result Z can be stored back to
the memory using the same operand bus. But the problem with this is the result Z is 2n
bits long whereas the operand bus is just n bits long. We have two alternatives to solve
this problem,
a. Use the n bits operand bus and save Z at two successive memory locations.
Although it stores the exact value of Z in the memory, it takes two cycles to store
the result.
b. Discard the lower n bits of the result Z (truncate ) and store only the higher order n
bits into the memory. It is not applicable for the applications where accurate result is
required.
20
10
9/8/2022
Shifters
Shifters are used to either scale down or scale up operands or the results. The
following scenarios give the necessity of a shifter.
a. While performing the addition of N numbers each of n bits long, the sum can
grow up to n+log2 N bits long. If the accumulator is of n bits long, then an
overflow error will occur. This can be overcome by using a shifter to scale down
the operand by an amount of log2N.
b. Similarly, while calculating the product of two n bit numbers, the product can
grow up to 2n bits long. In such cases, programmer will want to choose a particular
subset of result bits to pass along to next stage of processing. A shifter in the data
path eases this selection by scaling (multiplying) its input by power of two (2n)
21
22
11
9/8/2022
23
24
12
9/8/2022
Barrel Shifters
• Bits shifted out of the input word are discarded & new
bit positions are filled with zeros in case of left shift.
• In case of right shift, the new bit positions are
replicated with the MSB to maintain the sign of the
shifted result.
25
26
13
9/8/2022
27
DSRJ1
DSRJ2
DSRJ3
28
14
Slide 28
DSRJ3 https://www.youtube.com/watch?v=59INORkPeqI
Dr. SHINDE RAMDAS JITESH, 7/21/2022
9/8/2022
29
30
15
9/8/2022
31
32
16
9/8/2022
33
34
17
9/8/2022
35
36
18
9/8/2022
– When the guard bits are in use it is necessary to scale the final result in
order to convert from intermediate representation to the final one. For
eg. In a 16 bit processor with 4 guard bits in the accumulator, it may be
necessary to scale the accumulator by 2-4 before writing them to
memory as 16 bit values.
– TMS320 C5X processor lacks guard bits. It allows product register to be
automatically shifted right by six bits. TMS320C54XX has guard bits.
37
38
19
9/8/2022
Saturation Logic
• Overflow/ underflow will occur if the result goes beyond the most positive
number or below the least negative number the accumulator can handle.
• Thus, the overflow/underflow error can be resolved by loading the
accumulator with the most positive number which it can handle at the time
of overflow and the least negative number that it can handle at the time of
underflow.
• This method of limiting accumulator content to its saturation limit is called
as saturation logic.
39
Saturation Logic
40
20
9/8/2022
DSRJ4
41
42
21
Slide 41
43
DSRJ6
44
22
Slide 44
• In order to increase the speed of operation, separate memories were used to store
program and data and a separate set of data and address buses have been given to
both memories, the architecture called as Harvard Architecture.
• For FIR Filter algorithm, this architecture will require atleast 2 instruction cycles
to execute multiply-accumulate instruction.
• Although the usage of separate memories for data and the instruction speeds up
the processing, it will not completely solve the problem. As many of the DSP
instructions require more than one operand, use of a single data memory leads
to the fetch the operands one after the other, thus increasing the delay of
processing.
8-Sep-22 Dr.Jitesh Shinde
45
• This problem can be overcome by using two separate data memories for storing
operands separately, thus in a single clock cycle both the operands can be fetched
together ((Modified / Super Harvard Architecture)(Figure).
• Achieve multiple memory accesses per instruction cycle by using multiple, independent
memory banks connected to the processor data path via independent buses.
• Although the above architecture improves the speed of operation, it requires more
hardware and interconnections, thus increasing the cost and complexity of the system.
• Therefore, there should be a trade off between the cost and speed while selecting
memory architecture for a DSP.
8-Sep-22 Dr.Jitesh Shinde
46
23
9/8/2022
47
On-Chip Memories
48
24
9/8/2022
On-Chip Memories
• Need of On-Chip Memories
– In order to have faster execution of the DSP functions, it is
desirable to have some memory located on chip.
• As dedicated buses are used to access the memory, on
chip memories are fasters.
• Speed & Size are the two key parameters of On-chip
memories.
– Speed : On-chip memories should match the speed of the
ALU operations in order to maintain the single cycle
instruction execution of the DSP.
– Size : In a given area of DSP chip, it is desirable to implement
as many DSP functions as possible. Thus, the area occupied by
the on-chip memory should be minimum as possible so that
there will be scope for implementing a greater number of DSP
functions on-chip.
49
50
25
9/8/2022
51
52
26
Slide 51
DSRJ7 Achieve multiple memory accesses per instruction cycle by using multiple, independent memory banks
connected to the processor data path via independent buses
Dr. SHINDE RAMDAS JITESH, 7/22/2022
9/8/2022
• Explain why the P-DSPs have multiple address & data buses for
internal memory access but have only a single address bus &
data bus for the external data memory & peripherals?
– Ideally whole memory required for the implementation of
any DSP algorithm has to reside on-chip so that the whole
processing can be completed in a single execution cycle.
– The access times for memories on-chip should be sufficiently
small so that it can be accessed more than once in every
execution cycle
– Hence, the P-DSPs have multiple address & data buses for
internal memory access
53
• Explain why the P-DSPs have multiple address & data buses for internal memory
access but have only a single address bus & data bus for the external data memory
& peripherals?
– In P-DSPs, fast memory or multiported memory having dedicated buses are used
as on-chip memory to improve its speed of execution of instructions
– Drawback of such memories is that they are much costly (in terms of chip
area)
– Since the cost of IC increases with the number of pins in the IC, extending
number of buses outside the chip would unduly increase the price.
– Any operation that involves an off-chip memory is slow compared to that
using the on-chip memory.
• To minimize this delay , DSP algorithms that require instructions to be executed repeatedly, the
instruction can be stored in the external memory, once it is fetched can reside in the instruction cache.
– Hence, the P-DSPs have only a single address bus & data bus for the external data
memory (off-chip memory) & peripherals.
54
27
9/8/2022
ROM
– DSP processors that are intended for low-cost, embedded
applications like consumer electronics and telecommunications
equipment provide on-chip read-only memory (ROM) to store the
application program and constant data
• On-chip ROM
– The main purpose of internal ROM is to permanently store
the program code and data for a specific application during
manufacturing of the chip itself.
– used to store program, data values, boot loader program,
µ law expansion table, A law expansion table, interrupt
vector table & sine look up table.
– The content of the on-chip ROM can be protected so that
any external device cannot have access to the program
code.
55
56
28
9/8/2022
57
58
29
9/8/2022
59
60
30
9/8/2022
61
62
31
9/8/2022
63
64
32
9/8/2022
65
66
33
9/8/2022
67
68
34
9/8/2022
69
70
35
9/8/2022
71
72
36
9/8/2022
73
74
37
9/8/2022
75
76
38
9/8/2022
Computational accuracy
77
Computational Accuracy
• Number Formats
78
39
9/8/2022
where s represents the sign of the number ( s = 0 for positive and s = 1 for negative)
79
80
40
9/8/2022
81
82
41
9/8/2022
83
84
42
9/8/2022
85
• Answer : floating-point!
86
43
9/8/2022
87
88
44
9/8/2022
point format.
– Given decimal number is converted into binary
number & then binary point is moved to a
position i.e., value of exponent adjusted such
that MSB bit of mantissa is one and mantissa is
adjusted accordingly. This form of floating-point
no. is called normalized form.
– Why representation called as floating point?
89
90
45
Slide 89
DSRJ8 In floating point representation the binary point can be shifted to desired position so that number of
digits in the integer part and fraction part of a number can be varied. This leads to larger range of
number that
can be represented in floating point representation
Dr. SHINDE RAMDAS JITESH, 7/24/2022
9/8/2022
91
92
46
9/8/2022
93
94
47
9/8/2022
95
96
48
9/8/2022
97
98
49
9/8/2022
Example:
• Note :- Both Mantissa and exponent uses one bit for sign.
• Find the decimal equivalent of the floating-point binary number
with bias = 2(n-1) − 1= 23 − 1 = 7. 4 bits for E & 8 bits for M or F.
1011000011100
1 0110 00011100
Sign biased exponent significant
99
100
50
9/8/2022
101
= + 25 × 1.00011
=1000112
= + (1 × 25 + 0 × 24 + 0 × 23 + 0 × 22 + 1 × 21 + 0 × 20)
= + (32 + 0 + 0 + 0 + 2 + 1)
= + 3510
8-Sep-22 Dr.Jitesh Shinde
102
51
9/8/2022
103
Dynamic range
• Ratio of the maximum value to the minimum
non-zero value that the signal can take in a
given number representation scheme
• Dynamic range is proportional to the number
of bits n used to represent it and increases by
6db for every extra bit used for representation.
• In Floating point format, exponent determines
dynamic range.
104
52
9/8/2022
• −2 24−1 ≤ x ≤ 2 24−1 -1
105
Resolution
• general definition: smallest non-zero value
that can be represented using a number
representation format
• Q: What is the resolution if k -bits (signed
fractional fixed-point) are used to represent
a number between 0 and 1?
• Resolution= 1/2k-1
• Resolution ≈ 1/2k ; if k is very large
106
53
9/8/2022
Precision
• Computed as percentage resolution:
• Precision = Resolution × 100% = 1/ (2k −1 )
× 100%
• relates to accuracy of computations
• usually, the greater the precision, the slower
the speed or the more complex the support
hardware such as bus architectures
107
108
54
9/8/2022
Pipelining
109
110
55
9/8/2022
111
Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e 90 min
r D
• This operator scheduled his loads to be delivered to the laundry every 90
minutes which is the time required to finish one load. In other words he will
not start a new task unless he is already done with the previous task
• The process is sequential. Sequential laundry takes 6 hours for 4 loads
8-Sep-22 Dr.Jitesh Shinde
112
56
9/8/2022
113
Pipelining
• Pipelining is one of the architectural features of the P-DSP device
that should be evaluated before implementing an algorithm.
• It is process of increasing the performance of DSP processor by
breaking longer sequence of operations into smaller pieces &
executing these pieces in parallel when possible, thereby
decreasing the overall time to complete the set of operations.
• Decomposes a sequential process into segments (overlapping
allowed).
• Divide the processor into segment processors each one is
dedicated to a particular segment.
• Each segment is executed in a dedicated segment-processor
operates concurrently with all other segments.
• Information flows through these multiple hardware segments.
114
57
9/8/2022
DSRJ9
DSRJ10
115
SPEEDUP
• Consider a k-segment pipeline operating on n data sets. (In
the above example, k = 3 and n = 4).
• It takes k clock cycles to fill the pipeline and get the first
result from the output of the pipeline.
116
58
Slide 115
Example
• A non-pipeline system takes 100ns to process a
task;
• the same task can be processed in a FIVE-
segment pipeline into 20ns, each
• It therefore takes (k + n - 1) clock cycles = 5+n-
1=5+(1000-1) clock cycles to complete the task.
117
Example Answer
118
59
9/8/2022
119
120
60
9/8/2022
sequential processing
Instruction pipeline
8-Sep-22 Dr.Jitesh Shinde
121
Difficulties
• Slowest Unit decides the throughput.
• Pipeline latency & Pipeline depth
– Extra time is required at the start of algorithm execution, as the
pipeline has to be filled before the result of the first instruction can
start to flow out. This initial delay in units of time is pipeline latency,
related to number of stages in pipeline (pipeline depth)
• Branching effect:
– If a complicated memory access occurs in stage 1, stage 2 will be
delayed and the rest of the pipe is stuck.
– If there is a branch, if.. and jump, then some of the instructions that
have already entered the pipeline should not be processed.
– We need to deal with these difficulties to keep the pipeline moving
122
61
9/8/2022
Pipeline Hurdles
• Pipeline hazards/hurdles are situations, that prevent the next instruction
in the instruction stream from executing during its designated cycle
• Structural hazard
– two different instructions use same hardware in same cycle
• Data hazard
– two different instructions use same storage
– An instruction depends on the results of a previous instruction
123
Pipelining
• Instruction execution is divided into k segments or stages
– Instruction exits pipe stage k-1 and proceeds into pipe
stage k
– All pipe stages take the same amount of time; called one
processor cycle
– Length of the processor cycle is determined by the
slowest pipe stage
k segments
124
62
9/8/2022
5-Stage Pipelining
S1 S2 S3 S4 S5
Fetch Decode Fetch Execution Write
Instruction Instruction Operand Instruction Operand
(FI) (DI) (FO) (EI) (WO)
Time
S1 1 2 3 4 5 6 7 8 9
S2 1 2 3 4 5 6 7 8
S3 1 2 3 4 5 6 7
S4 1 2 3 4 5 6
S5 1 2 3 4 5
125
Parallel Architecture
• The key to higher performance is the ability to
exploit parallelism. It is one of the architectural
features of P-DSP device that should be evaluated
before implementing DSP algorithm.
• Increases the speed of operation of DSP processor.
• Requires complex hardware to control units, &
controller should be hardwired rather than micro
programmed in order to ensure high speed.
• The architecture should be such that instructions &
data required for computation are fetched
simultaneously from memory simultaneously.
126
63
9/8/2022
Parallel Architecture
• Some methods for exploiting parallelism include:
– multiple processors : Instead of same arithmetic unit used to do the
computation on data & address, a separate address arithmetic unit be
provided to take care of address computation. This frees main
arithmetic unit to concentrate on data computation alone &
thereby increasing the throughput.
Using multiple processors improves performance for only a restricted
set of applications.
127
Parallel Processing
• Pipelining is now universally implemented in high-
performance processors. Little more can be gained by
improving the implementation of a single pipeline.
• Superscalar implementations can improve
performance for all types of applications. Superscalar
(super: beyond; scalar: one dimensional) means the
ability to fetch, issue to execution units, and
complete more than one instruction at a time.
Superscalar implementations are required when
architectural compatibility must be preserved
128
64
9/8/2022
129
time time
P1 a1 a2 a3 a4 P1 a1 b1 c1 d1
P2 b1 b2 b3 b4 P2 a2 b2 c2 d2
P3 c1 c2 c3 c4 P3 a3 b3 c3 d3
P4 d1 d2 d3 d4 P4 a4 b4 c4 d4
130
65
9/8/2022
Data Dependence
P2 P2
P3 P3
P4 P4
time time
131
132
66
9/8/2022
133
VLIW
• VLIW hardware is simple and straightforward,
• VLIW separately directs each functional unit
• The number of operations in a VLIW instruction =
equal to the number of execution units (FU) in the
processor
• Each operation specifies the instruction that will be
executed on the corresponding execution unit in the
cycle that the VLIW instruction is issued
134
67
9/8/2022
VLIW processor
• Very large instruction word means that
program recompiled in the instruction to
run sequentially without the stall in the
pipeline
• Thus, require that programs be recompiled
for the VLIW architecture
• No need for the hardware to examine the
instruction stream to determine which
instructions may be executed in parallel
135
VLIW processor
• Take a different approach to instruction-level parallelism
• Relying on the compiler to determine which instructions
may be executed in parallel and providing that information
to the hardware (FU)
• Each instruction specifies several independent operations
(called very long words) that are executed in parallel by the
hardware
136
68
9/8/2022
137
138
69
9/8/2022
VLIW Architecture
139
SIMD Architecture
140
70
9/8/2022
141
142
71
9/8/2022
Peripherals
143
Peripherals
• Why Study peripherals?
– Allows DSP to be used in an embedded system with
minimum amount of external hardware to support
its operation & interface it to the outside world.
– Power of the peripheral interfaces provided by
different processors can have significant impact on
their suitability for particular application.
– On-chip peripherals should be carefully evaluated
along with other processor features such as
arithmetic performance, memory bandwidth & so
on.
144
72
9/8/2022
145
Serial Ports
146
73
9/8/2022
147
148
74
9/8/2022
Serial Ports
• A serial interface transmits & receives data
one bit at a time.
– Sending & receiving data to & from A/D & D/A
converters & codecs.
– Sending & receiving data to & from other
micro-processors or DSPs.
– Communicating with other external peripherals
or hardware.
149
Serial Port
• Types of Serial Interface:-
– Asynchronous serial port
– Synchronous serial port
– TDM serial port
– Buffered serial port
150
75
9/8/2022
151
Serial Ports
152
76
9/8/2022
153
154
77
9/8/2022
• One processor is responsible for generating bit clock & frame sync
signals.
• The frame sync is used to indicate the start of new set of time slots.
• After the frame sync, each processor must keep the track of current
slot number & transmit only during its slot.
155
• A transmitted data word (16 bit) might contain some number of bits
to indicate the destination DSP (e.g., two bit for 4 processor) with
remainder containing data.
• TDM support requires that processor must be able to put its serial
port transmit data pin in a high impedance state when processor is
not transmitting. This allows other DSPs to transmit data during their
time slots without interference.
8-Sep-22 Dr.Jitesh Shinde
156
78
9/8/2022
Parallel Ports
• It transmits & receives multiple data bits (typically 8 or 16 bits ) at a time.
• Transfers data much faster than serial ports but require more pins to do
so.
• It sometimes in addition to data lines includes handshake or strobe lines
to indicate to an external device that data have been written to the port
by DSP or vice versa.
• Approaches for assigning lines for parallel ports :
– Data bus itself used for parallel ports by allocating specific address for
I/O.
– Separate lines are dedicated for parallel ports including handshake
signals.
• Types of parallel ports :
– Traditional
– BIT I/O
– HPI
– Comm
157
158
79
9/8/2022
159
160
80
9/8/2022
161
External Interrupts
• They are pins that an external device can
assert to interrupt the P-DSP.
• External interrupt lines can be edge triggered
or level triggered.
162
81
9/8/2022
RISC
• Eg. Of RISC processor :TMS320C6X
• Reduced number of instruction, chip area reduced
considerably.
• 20% of chip area used for control unit.
• As a result of considerable reduction in control area, CPU
registers, data paths can be replicated & throughput of
register can be increased by applying pipelining & parallel
processing.
• In RISC, all instruction are of same length & takes same time
for execution. This increases computational speed.
• A simpler & smaller control unit has fewer gates. This reduces
propagation delay & increases speed.
163
RISC
• Reduced number of instruction format & addressing modes results in simpler
& smaller decoder, which in turn increases speed.
• Delayed branch & Call instruction effectively used.
• HLL support :
– Due to smaller number of instruction , compiler for HLL is shorter and
simple.
– Availability of many GP registers permits more efficient code optimization
by maximizing the use of GP registers over slower memories.
– Support for writing program in C or C++ and thereby relieves the
programmer from learning instruction set of DSP & instead concentrate
on application.
• Since RISC has smaller number of instructions, implementation of single CISC
instruction might require a number of instructions in CISC. This increases
memory required for storing the program and the traffic between CPU &
memory is increased. This increases program computation time and make it
difficult to debug
164
82
9/8/2022
CISC
• Has rich instruction set that supports HLL
constructs similar to “ if condition true then
do”, “for”, & “while”.
• Have instruction specifically required for DSP
applications such as MACD, FIRS etc.
• 30-40% chip area is used for control unit.
165
166
83
9/8/2022
167
168
84
9/8/2022
169
170
85
9/8/2022
UNIT 2: ARCHITECTURE OF
TMS320C5X (08)
Architecture ,
Bus Structure & memory,
CPU ,
addressing modes ,
AL syntax.
171
172
86
9/8/2022
173
174
87
9/8/2022
175
88