0% found this document useful (0 votes)
11 views95 pages

Unit 1dspa

The document discusses the architecture of programmable digital signal processing (DSP) devices. It begins with an outline of topics to be covered, including the need for DSP processors, basic architectural features, computational building blocks, memory organization, and programmability. It then discusses the basic architectural features required for efficient implementation of common DSP algorithms like FIR filters. This includes memory units for signal samples and coefficients, a multiplier-accumulator unit, registers for intermediate results, and pointers to access memory locations. The document also explains the shift-and-add algorithm for multiplication using two examples.

Uploaded by

Panku Rangaree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views95 pages

Unit 1dspa

The document discusses the architecture of programmable digital signal processing (DSP) devices. It begins with an outline of topics to be covered, including the need for DSP processors, basic architectural features, computational building blocks, memory organization, and programmability. It then discusses the basic architectural features required for efficient implementation of common DSP algorithms like FIR filters. This includes memory units for signal samples and coefficients, a multiplier-accumulator unit, registers for intermediate results, and pointers to access memory locations. The document also explains the shift-and-add algorithm for multiplication using two examples.

Uploaded by

Panku Rangaree
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

9/8/2022

Unit I : Fundamentals of
Programmable DSPs
Mr. Jitesh R. Shinde
Assistant Professor

8-Sep-22 Dr.Jitesh Shinde

8-Sep-22 Dr.Jitesh Shinde

1
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

8-Sep-22 Dr.Jitesh Shinde

2
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

Outline
• Need of P-DSPs
• Architectures for Programmable Digital Signal Processing Devices
– Basic Architectural Features : FIR filter eg
– DSP basic Computational Building Blocks
• Multiplier
• Shifter : Barrel Shifter
• MAC : Overflow & Underflow, Saturation logic
• ALU
– Bus Architecture & Memory : Von Neumann and Harvard Architecture, Modified Bus Structures and
Memory access in P-DSPs,
– On chip Memories : Need, Organization, fast memories (Multiple access memory) Multiported memories
– Data Addressing Capabilities
– Special Addressing Modes : Circular & Bite Reversed
– Address Generation Unit
– Programmability & Program Control : Program Sequencer
• Computational accuracy in DSP processor
• Pipelining & Parallel Processing
• VLIW architecture
• Innovations in Hardware Architecture to increase the speed of operations of DSP
Processors
• Peripherals

8-Sep-22 Dr.Jitesh Shinde

3
9/8/2022

Why DSP Processors having architecture


suitable for DSP operations are developed?
– Few things common to all DSP algorithms (filtering,
FFT, DFT etc) are
• Processing of an array is involved.
• Majority of operations are multiply & accumulate (MAC)
• Linear & Circular shifting of arrays are involved.

– These operations require large time when they are


implemented on GPP. This is because hardware of GPP is
not optimized to perform such operations fast. Hence
GPP are not suitable for DSP operations.

– Particularly real time operations are difficult on GPP.


Hence, DSP Processors having architecture suitable for DSP
operations are developed.

8-Sep-22 Dr.Jitesh Shinde

Architectures for Programmable


Digital Signal Processing
Devices

8-Sep-22 Dr.Jitesh Shinde

4
9/8/2022

Basic Architectural Features


A programmable DSP device should provide instructions similar to a conventional
microprocessor.
The instruction set of a typical DSP device should include the following,
a. Arithmetic operations such as ADD, SUBTRACT, MULTIPLY etc
b. Logical operations such as AND, OR, NOT, XOR etc
c. Multiply and Accumulate (MAC) operation
d. Signal scaling operation
In addition to the above provisions, the architecture should also include,
a. On chip registers to store immediate results
b. On chip memories to store signal samples (RAM)
c. On chip memories to store filter coefficients (ROM)
d. DSP operations require multiple operands simultaneously. Hence DSP processor should
have multiple operands fetch capacity.
e. DSP processors should have circular buffers circular shift operations.
f. For real time applications, interrupts & timers are required. Hence DSP processors should
have powerful interrupt structure & timer.

8-Sep-22 Dr.Jitesh Shinde

Investigate the basic features that should be provided


in the DSP architecture to be used to implement the
following Nth order FIR filter

In order to implement the above operation in a DSP, the architecture


requires the following features :-
• i. A RAM to store the signal samples x (n), x(n-1),…etc
ii. A ROM to store the filter coefficients h (n): h(0), h(1),…
iii. An MAC unit (hardware) to perform Multiply and Accumulate
operation.
iv. An accumulator (register) to store the result immediately
v. A signal pointer (register) to point the signal sample in the
memory
vi. A coefficient pointer (register) to point the filter coefficient in the
memory
vii. A counter to keep track of the count of the MAC operations
viii. A shifter to shift the input samples appropriately to scale the
signal value x(n) as it is read from the memory; and also for the computed
signal y(n) as it is stored in the memory
8-Sep-22 Dr.Jitesh Shinde

10

5
9/8/2022

DSP Computational Building Blocks


Each computational block of the DSP should
be optimized for functionality and speed and in
the meanwhile the design should be sufficiently
general so that it can be easily integrated with
other blocks to implement overall DSP systems.
Basic Computational Blocks are
i. Multiplier
ii. Shifter
iii.Multiplier & Accumulator (MAC) unit
iv.Arithmetic Logic Unit (ALU)

8-Sep-22 Dr.Jitesh Shinde

11

Shift-and-Add Multiplication
• Paper & Pencil Method

8-Sep-22 Dr.Jitesh Shinde

12

6
9/8/2022

Shift-and-Add Multiplication
• Perform the multiplication 9 × 12 (1001 × 1100) using the shift & add
multiplication algorithm

Multiplicand Carry C Accumulator Multiplier Operation Step


M A Q (CheckQ0)
1001 0 0000 1100 Initialization

1001 0 0000 011 0 Q0=0 (Shift 1


Rt) CAQ
1001 0 0000 001 1 Q0=0 (Shift 2
Rt) CAQ
1001 0 1001 001 1 Q0=1 3
0 0100 100 1 Add A=A+M
(Shift Rt) CAQ
1001 0 1101 100 1 Q0=1 4
0 0110 110 0 Add A=A+M
(Shift Rt) CAQ
Product result in AQ =(0110 1100)base 2=(108)base10
8-Sep-22 Dr.Jitesh Shinde

13

Shift-and-Add Multiplication
• Perform the multiplication 11 × 13 (1011 × 1101) using the shift & add
multiplication algorithm
DSRJ11
DSRJ12
Multiplicand Carry C Accumulator Multiplier Operation Step
M A Q (CheckQ0)
1011 0 0000 1101 Initialization

1011 0 1011 110 1 Q0=1 1


0 010 1 111 0 Add A=A+M
(Shift Rt) CAQ
1011 0 001 0 111 1 Q0=0 (Shift 2
Rt) CAQ
1011 0 1101 111 1 Q0=1 3
0 0110 111 1 Add A=A+M
(Shift Rt) CAQ
1011 1 0001 111 1 Q0=1 4
0 1000 111 1 Add A=A+M
(Shift Rt) CAQ
Product result in AQ =(1000 1111)base 2=(143)base10
8-Sep-22 Dr.Jitesh Shinde

14

7
Slide 14

DSRJ11 Whenever Q0=0, No multiplication required. Just go for shifting. Hence no A=A+M
Dr. SHINDE RAMDAS JITESH, 8/29/2022

DSRJ12 Check how many bit is Q? Number of cycle will be equal to number of Q.
Dr. SHINDE RAMDAS JITESH, 8/29/2022
9/8/2022

Shift-and-Add Multiplication

8-Sep-22 Dr.Jitesh Shinde

15

Multipliers
The advent of single chip multipliers paved the way for implementing DSP
functions on a VLSI chip. Parallel multipliers replaced the traditional shift and add
multipliers now a days. Parallel multipliers take a single processor cycle to fetch and
execute the instruction and to store the result. They are also called as Array
multipliers.

The key features to be considered for a multiplier are:


a. Accuracy.
b. Dynamic range
c. Speed
The number of bits used to represent the operands decide the accuracy and the
dynamic range of the multiplier  Finite word length effect
Whereas speed is decided by the architecture employed.
If the multipliers are implemented using hardware, the speed of execution will be
very high but the circuit complexity will also increase considerably.
Thus, there should be a tradeoff between the speed of execution and the circuit
complexity. Hence the choice of the architecture normally depends on the
application.
8-Sep-22 Dr.Jitesh Shinde

16

8
9/8/2022

Speed
Conventional Shift and Add technique of
multiplication requires n cycles to
perform the multiplication of two n bit numbers.
Whereas in parallel multipliers the time
required will be the longest path delay in the
combinational circuit used.
As DSP applications generally require very high
speed, it is desirable to have multipliers
operating at the highest possible speed by having
parallel implementation.

8-Sep-22 Dr.Jitesh Shinde

17

Bus Widths
Consider the multiplication of two n bit numbers X and Y. The product Z can be
almost 2n bits long.
In order to perform the whole operation in a single execution cycle,
we require two buses of width n bits each to fetch the operands X and Y and a bus of
width 2n bits to store the result Z to the memory.
Although this performs the operation faster, it is not an efficient way of
implementation as it is expensive.
Many alternatives for the above method have been proposed
a. Use the n bits operand bus and save Z at two successive memory locations.
Although it stores the exact value of Z in the memory, it takes two cycles to store
the result.
b. Discard the lower n bits of the result Z (truncate ) and store only the higher
order n bits into the memory. It is not applicable for the applications where accurate
result is required.

8-Sep-22 Dr.Jitesh Shinde

18

9
9/8/2022

Another alternative can be used for the applications where speed is


not a major concern. In which latches are used for inputs and outputs
thus requiring a single bus to fetch the operands and to store the
result

8-Sep-22 Dr.Jitesh Shinde

19

Bus Widths
Consider the multiplication of two n bit numbers X and Y. The product Z can be
almost 2n bits long.
In order to perform the whole operation in a single execution cycle,
we require two buses of width n bits each to fetch the operands X and Y and a bus of
width 2n bits to store the result Z to the memory.
Although this performs the operation faster, it is not an efficient way of
implementation as it is expensive.
Many alternatives for the above method have been proposed. One such method is
to use the program bus itself to fetch one of the operands after fetching the instruction,
thus requiring only one bus to fetch the operands. And the result Z can be stored back to
the memory using the same operand bus. But the problem with this is the result Z is 2n
bits long whereas the operand bus is just n bits long. We have two alternatives to solve
this problem,

a. Use the n bits operand bus and save Z at two successive memory locations.
Although it stores the exact value of Z in the memory, it takes two cycles to store
the result.
b. Discard the lower n bits of the result Z (truncate ) and store only the higher order n
bits into the memory. It is not applicable for the applications where accurate result is
required.

8-Sep-22 Dr.Jitesh Shinde

20

10
9/8/2022

Shifters
Shifters are used to either scale down or scale up operands or the results. The
following scenarios give the necessity of a shifter.

a. While performing the addition of N numbers each of n bits long, the sum can
grow up to n+log2 N bits long. If the accumulator is of n bits long, then an
overflow error will occur. This can be overcome by using a shifter to scale down
the operand by an amount of log2N.

b. Similarly, while calculating the product of two n bit numbers, the product can
grow up to 2n bits long. In such cases, programmer will want to choose a particular
subset of result bits to pass along to next stage of processing. A shifter in the data
path eases this selection by scaling (multiplying) its input by power of two (2n)

c. Finally in case of addition of two floating-point numbers, one of the operands


has to be shifted appropriately to make the exponents of two numbers equal.

From the above cases, a shifter is required in the architecture of a


DSP. Shifter is usually found immediately following multiplier & ALU.
8-Sep-22 Dr.Jitesh Shinde

21

• It is required to find the sum of 64, 16-bit numbers.


How many bits should the
accumulator have so that the sum can be computed
without the occurrence of
overflow error or loss of accuracy?
• What will happen if accumulator is of 16 bits?
– N =64 i.e., sum operation to be performed 64 times on
n=16-bit number.
– Final sum result will be of n + log2N = 22 bits long.
– The accumulator should be of 22 bits in this case to
avoid overflow.
– If the accumulator is of 16 bits, then overflow will occur.
To prevent overflow each input should be shifted to right
(scaled) by 6 bits before addition // output can be
scaled by power of 2n = 26 after addition.

8-Sep-22 Dr.Jitesh Shinde

22

11
9/8/2022

Barrel Shifters (Comb Ckt)

• In conventional microprocessors, normal shift


registers are used for shift operation. As it requires
one clock cycle for each shift, it is not desirable for
DSP applications, which generally involves more
shifts. In other words, for DSP applications as speed
is the crucial issue, several shifts are to be
accomplished in a single execution cycle.
• This can be accomplished using a barrel
shifter
– a combinational circuit
– Shifts or rotate the n-bit data in one cycle

8-Sep-22 Dr.Jitesh Shinde

23

Barrel Shifters (Comb Ckt)

• This can be accomplished using a barrel


shifter,
– which connects the input lines representing a
word to a group of output lines with the
required shifts determined by its control inputs.
– For an input of length n, log2n control lines are
required.
– And an additional control line is required to
indicate the direction of the shift.

8-Sep-22 Dr.Jitesh Shinde

24

12
9/8/2022

Barrel Shifters
• Bits shifted out of the input word are discarded & new
bit positions are filled with zeros in case of left shift.
• In case of right shift, the new bit positions are
replicated with the MSB to maintain the sign of the
shifted result.

8-Sep-22 Dr.Jitesh Shinde

25

8-Sep-22 Dr.Jitesh Shinde

26

13
9/8/2022

Figure depicts the implementation of a 4 bit shift right barrel


shifter.
Shift to right by 0, 1, 2 or 3 bit positions can be controlled by
setting the control inputs appropriately.
A Barrel Shifter is to be designed with 16 inputs for left shifts from 0 to 15 bits.
How many control lines are required to implement the shifter?
=> As the number of bits used to represent the input are n=16, log2n = log2 16=4 control
inputs are required.
8-Sep-22 Dr.Jitesh Shinde

27

DSRJ1
DSRJ2

DSRJ3

8-Sep-22 Dr.Jitesh Shinde

28

14
Slide 28

DSRJ1 No. of 2:1 Mux require for n input data = nlog2n


Dr. SHINDE RAMDAS JITESH, 7/21/2022

DSRJ2 for n=4, 4x log2 4=4x2=8


Dr. SHINDE RAMDAS JITESH, 7/21/2022

DSRJ3 https://www.youtube.com/watch?v=59INORkPeqI
Dr. SHINDE RAMDAS JITESH, 7/21/2022
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

29

8-Sep-22 Dr.Jitesh Shinde

30

15
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

31

Multiply and Accumulate (MAC)Unit


• Most of the DSP applications require the
computation of the sum of the
products of a series of successive
multiplications.
• In order to implement such functions a
special unit called a Multiply and
Accumulate (MAC) unit is required.
• A MAC consists of a multiplier and a
special register called Accumulator.
MACs are used to implement the
functions of the type A+BC.
• A typical MAC unit is as shown
in the figure. It is based on pipelining
concept.
8-Sep-22 Dr.Jitesh Shinde

32

16
9/8/2022

Multiply and Accumulate (MAC)Unit


• Although addition and multiplication are two different operations, they
can be performed in parallel.
• By the time the multiplier is computing the product, accumulator can
accumulate the product of the previous multiplications.
• Thus, if N products are to be accumulated, N-1 multiplications can
overlap with N-1 additions.
• During the very first multiplication, accumulator will be idle and during
the last accumulation, multiplier will be idle. Thus N+1 clock cycles are
required to compute the sum of N products.
• The presence of H/W multiplier and / or MAC is one of the mandatory
requirement of a P-DSP.
– As DSP applications generally require very high speed, it is desirable
to have multipliers operating at the highest possible speed.
– Speed is decided by the architecture employed .If the multipliers are
implemented using hardware, the speed of execution will be very
high.

8-Sep-22 Dr.Jitesh Shinde

33

Multiply and Accumulate (MAC)Unit

• If a sum of 256 products is to be computed


using a pipelined MAC unit, and if the
MAC execution time of the unit is 100nsec,
what will be the total time required to
complete the operation?
– As N=256 in this case, MAC unit requires
N+1=257execution cycles. As the
single MAC execution time is 100nsec, the total
time required will be,
(257*100nsec)=25.7μsec

8-Sep-22 Dr.Jitesh Shinde

34

17
9/8/2022

Overflow and Underflow


• While designing a MAC unit, attention must be paid to
the word sizes encountered at the input of the multiplier
and the sizes of the add/subtract unit and the
accumulator, as there is a possibility of overflow and
underflows.
• Overflow/underflow can be avoided by using any of the
following methods viz
– a. Using shifters at the input and the output of the MAC
– b. Providing guard bits in the accumulator
– c. Using saturation logic.
• Shifters
– Shifters can be provided at the input of the MAC to normalize
the data and at the output to denormalized the same.

8-Sep-22 Dr.Jitesh Shinde

35

Overflow and Underflow


• Guard bits
– As the normalization process does not yield accurate result, it is not
desirable for some applications.
– In such cases we have another alternative by providing additional bits
called guard bits in the accumulator so that there will not be any
overflow error. Here the add/subtract unit also must be modified
appropriately to manage the additional bits of the accumulator.
– Ideally, the size of the accumulator registers should be larger than the
size of multiplier output word by several bits. The extra bits, called guard
bits, allow the programmer to accumulate a number of values without
the risk of overflowing the accumulator and without the need for
scaling intermediate results to avoid overflow.
– Guard bits provide greater flexibility than scaling the multiplier
product because they allow the maximum precision to be retained in
intermediate steps of computation
– An accumulator with n guard bits provides the capacity for up to 2n
values to be accumulated without the possibility of overflow. Most
processors provide either four or eight guard bits.

8-Sep-22 Dr.Jitesh Shinde

36

18
9/8/2022

Overflow and Underflow


• Guard bits
– As the normalization process does not yield accurate result, it is
not desirable for some applications.
– In such cases we have another alternative by providing additional
bits called guard bits in the accumulator so that there will not be
any overflow error. Here the add/subtract unit also must be
modified appropriately to manage the additional bits of the
accumulator.

– When the guard bits are in use it is necessary to scale the final result in
order to convert from intermediate representation to the final one. For
eg. In a 16 bit processor with 4 guard bits in the accumulator, it may be
necessary to scale the accumulator by 2-4 before writing them to
memory as 16 bit values.
– TMS320 C5X processor lacks guard bits. It allows product register to be
automatically shifted right by six bits. TMS320C54XX has guard bits.

8-Sep-22 Dr.Jitesh Shinde

37

• Consider a MAC unit whose inputs are 16-


bit numbers. If 256 products are to be
summed up in this MAC, how many guard
bits should be provided for the
accumulator to prevent overflow
condition from occurring?

– As it is required to calculate the sum of


N=256, n=16-bit numbers, the sum can
be as long as of
n + log2N = (16 + log2256)=24
bits.
– Hence the accumulator should be
capable of handling these 24 bits. Thus,
the guard bits required will be (24-16)=
8 bits.
– The block diagram of the modified
MAC after considering the guard or
extension bits is as shown in the figure.

8-Sep-22 Dr.Jitesh Shinde

38

19
9/8/2022

Saturation Logic
• Overflow/ underflow will occur if the result goes beyond the most positive
number or below the least negative number the accumulator can handle.
• Thus, the overflow/underflow error can be resolved by loading the
accumulator with the most positive number which it can handle at the time
of overflow and the least negative number that it can handle at the time of
underflow.
• This method of limiting accumulator content to its saturation limit is called
as saturation logic.

• A schematic diagram of saturation logic is as shown in figure.


• In saturation logic, as soon as an overflow or underflow condition is
satisfied the accumulator will be loaded with the most positive or least
negative number overriding the result computed by the MAC unit.
8-Sep-22 Dr.Jitesh Shinde

39

Saturation Logic

• Why saturation logic?


• Eg. A 2-digit decimal calculator
– (55)10 + (45) 10 = (115) 10
– Here the result is of three digit. Without saturation logic, result =
15 & with saturation logic , most positive two-digit decimal value
=99 will be loaded in accumulator with load bit for overflow
enabled.

8-Sep-22 Dr.Jitesh Shinde

40

20
9/8/2022

DSRJ4

8-Sep-22 Dr.Jitesh Shinde

41

Arithmetic and Logic Unit

8-Sep-22 Dr.Jitesh Shinde

42

21
Slide 41

DSRJ4 Under flow in floating point arith


Dr. SHINDE RAMDAS JITESH, 7/21/2022
9/8/2022

• A typical DSP device should be capable of handling arithmetic instructions


like ADD, SUB, INC, DEC etc and logical operations like AND, OR , NOT,
XOR etc.
• The block diagram of a typical ALU for a DSP is as shown in the figure.
• It consists of status flag register, register file and multiplexers. Also,
multiplier, shifter (barrel shifter),Accumulator (MAC )
• Status Flags
– ALU includes circuitry to generate status flags after arithmetic and logic
operations. These flags include sign, zero, carry and overflow.
• Overflow Management
– Depending on the status of overflow and sign flags, the saturation logic
can be used to limit the accumulator content.
• Register File
– Instead of moving data in and out of the memory during the operation,
for better speed, a large set of general-purpose registers are provided to
store the intermediate results.

8-Sep-22 Dr.Jitesh Shinde

43

DSRJ6

Bus Architecture and Memories Structures

• Conventional microprocessors use Von Neumann architecture for


memory management wherein the same memory is used to store
both the program and data .
• Both program instructions and data are stored in the single
memory. In the simplest case, the processor can make one access
(either a read or a write) to memory during each instruction cycle
• For FIR Filter algorithm, this architecture will require atleast 4
instruction cycles to execute multiply-accumulate instruction.
• Although this architecture is simple, it takes a greater number of
processor cycles for the execution of a single instruction as the
same bus is used for both data and program.
8-Sep-22 Dr.Jitesh Shinde

44

22
Slide 44

DSRJ6 This is one reason


why conventional processors often do not perform well on nSP-intensive applications, and why
designers of DSP processors have developed a wide range of alternatives to the Von Neumann
architecture,
Dr. SHINDE RAMDAS JITESH, 7/22/2022
9/8/2022

Bus Architecture and Memory

• In order to increase the speed of operation, separate memories were used to store
program and data and a separate set of data and address buses have been given to
both memories, the architecture called as Harvard Architecture.
• For FIR Filter algorithm, this architecture will require atleast 2 instruction cycles
to execute multiply-accumulate instruction.
• Although the usage of separate memories for data and the instruction speeds up
the processing, it will not completely solve the problem. As many of the DSP
instructions require more than one operand, use of a single data memory leads
to the fetch the operands one after the other, thus increasing the delay of
processing.
8-Sep-22 Dr.Jitesh Shinde

45

Bus Architecture and Memory

• This problem can be overcome by using two separate data memories for storing
operands separately, thus in a single clock cycle both the operands can be fetched
together ((Modified / Super Harvard Architecture)(Figure).
• Achieve multiple memory accesses per instruction cycle by using multiple, independent
memory banks connected to the processor data path via independent buses.
• Although the above architecture improves the speed of operation, it requires more
hardware and interconnections, thus increasing the cost and complexity of the system.
• Therefore, there should be a trade off between the cost and speed while selecting
memory architecture for a DSP.
8-Sep-22 Dr.Jitesh Shinde

46

23
9/8/2022

• Explain the difference between Von Neuman


& Harvard Architecture for the computer.
Which architecture is preferred for DSP
applications & why?

8-Sep-22 Dr.Jitesh Shinde

47

On-Chip Memories

8-Sep-22 Dr.Jitesh Shinde

48

24
9/8/2022

On-Chip Memories
• Need of On-Chip Memories
– In order to have faster execution of the DSP functions, it is
desirable to have some memory located on chip.
• As dedicated buses are used to access the memory, on
chip memories are fasters.
• Speed & Size are the two key parameters of On-chip
memories.
– Speed : On-chip memories should match the speed of the
ALU operations in order to maintain the single cycle
instruction execution of the DSP.
– Size : In a given area of DSP chip, it is desirable to implement
as many DSP functions as possible. Thus, the area occupied by
the on-chip memory should be minimum as possible so that
there will be scope for implementing a greater number of DSP
functions on-chip.

8-Sep-22 Dr.Jitesh Shinde

49

Organization of On-Chip Memories


• Hence, to improve the speed of execution,
following are some other ways in which the
on-chip memory can be organized :
a. As many DSP algorithms require
instructions to be executed repeatedly, the
instruction can be stored in the external
memory, once it is fetched can reside in the
instruction cache
(or Program Cache : a small memory within the Processor core that is
used for storing program instructions that eliminate the need to access
the program memory when fetching certain instructions. Eg. Single
instruction repeat buffer).

8-Sep-22 Dr.Jitesh Shinde

50

25
9/8/2022

Organization of On-Chip Memories


b. The access times for memories on-chip should be sufficiently small so that it DSRJ7
can be accessed more than once in every execution cycle.
I. Fast memories (Multiple Acess Memory): support multiple sequential
access per instruction cycle over a single set of buses.
I. Consider a modified Harvard architecture with two banks of fast
memory. Each bank can complete two sequential memory accesses per
instruction cycle. The two banks together can complete four memory
accesses per instruction cycle, assuming the memory accesses are
arranged so that each memory bank handles two accesses.
II. Multiported memories has multiple independent set of address and data
connections, allowing independent memory access in parallel.
I. Drawback of multiported memories is that they are much costly (in
terms of chip area) to implement than standard single ported
memories.

8-Sep-22 Dr.Jitesh Shinde

51

Organization of On-Chip Memories


c. On-chip memories can be configured
dynamically so that they can serve different
purpose at different times.
For eg. If DSP has two blocks of on-chip memory, ordinary
one of them will be configured as program memory & other
as data memory. However, for execution of instructions
which require two operands to be fetched simultaneously,
they both can be configured as data memories.

8-Sep-22 Dr.Jitesh Shinde

52

26
Slide 51

DSRJ7 Achieve multiple memory accesses per instruction cycle by using multiple, independent memory banks
connected to the processor data path via independent buses
Dr. SHINDE RAMDAS JITESH, 7/22/2022
9/8/2022

• Explain why the P-DSPs have multiple address & data buses for
internal memory access but have only a single address bus &
data bus for the external data memory & peripherals?
– Ideally whole memory required for the implementation of
any DSP algorithm has to reside on-chip so that the whole
processing can be completed in a single execution cycle.
– The access times for memories on-chip should be sufficiently
small so that it can be accessed more than once in every
execution cycle
– Hence, the P-DSPs have multiple address & data buses for
internal memory access

8-Sep-22 Dr.Jitesh Shinde

53

• Explain why the P-DSPs have multiple address & data buses for internal memory
access but have only a single address bus & data bus for the external data memory
& peripherals?
– In P-DSPs, fast memory or multiported memory having dedicated buses are used
as on-chip memory to improve its speed of execution of instructions
– Drawback of such memories is that they are much costly (in terms of chip
area)
– Since the cost of IC increases with the number of pins in the IC, extending
number of buses outside the chip would unduly increase the price.
– Any operation that involves an off-chip memory is slow compared to that
using the on-chip memory.
• To minimize this delay , DSP algorithms that require instructions to be executed repeatedly, the
instruction can be stored in the external memory, once it is fetched can reside in the instruction cache.
– Hence, the P-DSPs have only a single address bus & data bus for the external data
memory (off-chip memory) & peripherals.

8-Sep-22 Dr.Jitesh Shinde

54

27
9/8/2022

ROM
– DSP processors that are intended for low-cost, embedded
applications like consumer electronics and telecommunications
equipment provide on-chip read-only memory (ROM) to store the
application program and constant data
• On-chip ROM
– The main purpose of internal ROM is to permanently store
the program code and data for a specific application during
manufacturing of the chip itself.
– used to store program, data values, boot loader program,
µ law expansion table, A law expansion table, interrupt
vector table & sine look up table.
– The content of the on-chip ROM can be protected so that
any external device cannot have access to the program
code.

8-Sep-22 Dr.Jitesh Shinde

55

Static & Dynamic Memory (RAM)


– All of the writable memory found on DSP processors and most of
the memory found in systems based on DSP processors is static
memory, also called SRAM (for static random-access memory; a
better name would have been static read and write memory).
– Static memory is simpler to use and faster than dynamic memory
(DRAM), but it also requires more silicon area and is more costly
for a given number of bits of memory.
– The key operational attribute distinguishing static from dynamic
memories is that static memories retain their data as long as power is
available.
• Dynamic memories must be refreshed periodically; that is, a special sequence
of signals must be applied to reinforce the stored data, or it eventually
(typically in a few tens of milliseconds) is lost.
– In addition, interfacing to static memories is usually simpler than
interfacing to dynamic memories; the use of dynamic memories
usually requires a separate, external DRAM controller to generate
the necessary control signals.

8-Sep-22 Dr.Jitesh Shinde

56

28
9/8/2022

Direct Memory Access


– Direct memory access (DMA) is a technique whereby
data can be transferred to or from the processor's
memory without the involvement of the processor itself.
– DMA is typically used to provide improved performance
for input/output devices
• Rather than have the processor read data from an I/O device and
copy the data into memory or vice versa, a separate DMA
controller can handle such transfers more efficiently.
– This DMA controller may be a peripheral on the DSP chip
itself or it may be implemented using external hardware.

8-Sep-22 Dr.Jitesh Shinde

57

Data Addressing Capabilities


• Addressing refers to means by which location of operands are specified for
instruction in general.
– types of addressing are called addressing modes
- operands may be input operands for the operation as well as results of the
operation
• Data access capability of a programmable DSP device is configured by means of its
addressing modes.

8-Sep-22 Dr.Jitesh Shinde

58

29
9/8/2022

Data Addressing Capabilities


• Implied Addressing mode
– Implied addressing means that the operand
addresses are implied by the instruction; there is
no choice of operand locations.
– In these cases, the operands are in registers.
• For example, in the AT&T DSPI6xx, all multiplication operations take
their inputs from the multiplier input registers X and Y, and deposit
their result into the product register P.

8-Sep-22 Dr.Jitesh Shinde

59

Indirect addressing uses ‘*’ in conjunction with one of the


address pointer registers

8-Sep-22 Dr.Jitesh Shinde

60

30
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

61

8-Sep-22 Dr.Jitesh Shinde

62

31
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

63

• 1. Circular Addressing Mode // modulo addressing mode


• 2. Bit Reversed Addressing Mode

• Circular Addressing Mode


• Circular addressing is used to create a circular buffer.
• Buffer is created in hardware and is very useful for applications like
digital filtering.
• Most DSP Implement Circular addressing in hardware in order to
conserve memory and minimizing software overhead.
• This addressing mode in conjunction with circular buffer updates
samples by shifting data without creating overhead as in direct
shifting
• When pointer reaches bottom location, and when incremented the
pointer is automatically wrapped around to the top location.
• Circular buffering allows processors to access a block of data
sequentially and then automatically wrap around to the beginning
address exactly the pattern used to access coefficients in FIR filter.
8-Sep-22 Dr.Jitesh Shinde

64

32
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

65

8-Sep-22 Dr.Jitesh Shinde

66

33
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

67

8-Sep-22 Dr.Jitesh Shinde

68

34
9/8/2022

The TMS320C67x does not support modulo addressing for 64-bit


data.
8-Sep-22 Dr.Jitesh Shinde

69

Bit Reversed Addressing Mode


• To implement FFT algorithms we need to access
the data in a bit reversed manner. Hence a
special addressing mode called bit reversed
addressing mode is used to
calculate the index of the next data to be
fetched.
• It works as follows. Start with index 0.
• The present index can be calculated by
adding half the FFT length to the previous index
in a bit reversed manner, carry being
propagated from MSB to LSB.
• Current index= Previous index+ B (1/2(FFT Size))

8-Sep-22 Dr.Jitesh Shinde

70

35
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

71

8-Sep-22 Dr.Jitesh Shinde

72

36
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

73

Address generation unit consist of


1. ALU
2. Registers to store current value, offset & new value.
3. Register to store limits of circular buffer & FFT length.
4. Logic to implement circular addressing mode.
5. Logic to implement bit-reversed addressing mode.

8-Sep-22 Dr.Jitesh Shinde

74

37
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

75

• The program sequencer acts as a multiplexer which selects


the address of the next instruction to be obtained from one of
the sources listed above.
8-Sep-22 Dr.Jitesh Shinde

76

38
9/8/2022

Computational accuracy

8-Sep-22 Dr.Jitesh Shinde

77

Computational Accuracy
• Number Formats

•In a DSP, signals are represented as discrete sets of


numbers from the input stage along through
intermediate processing stages to the output.
•Even DSP structures such as filters require numbers
to specify coefficients for operation.

• Two typical formats for these numbers:


• fixed-point format
• floating-point format

8-Sep-22 Dr.Jitesh Shinde

78

39
9/8/2022

Fixed point format


• simplest scheme
• number is represented as an integer or fraction
using a fixed number of bits
• An n-bit fixed-point signed integer −2n−1 ≤ x
≤ 2n−1 − 1 is represented as:

• x = −s · 2 n−1 + bn - 2 · 2 n−2 + b n−3 · 2 n−3 + · · · + b1· 2 1 + b0 · 2 0

where s represents the sign of the number ( s = 0 for positive and s = 1 for negative)

8-Sep-22 Dr.Jitesh Shinde

79

Fixed Point Integer Format


• What is the most negative number that can
be represented ?
when s=-1 and all b coefficients are zero
-2 n− 1 represented as [1 0 0 -------- 0 0]

• What is the most positive number that can


be represented ?
2 n− 1 -1 represented as [0 1 1 -------- 1 1]
8-Sep-22 Dr.Jitesh Shinde

80

40
9/8/2022

Example: Fixed Point Integer Format


• n=3
• − 2 n−1  x  2 n−1 -1
• − 2 3−1  x  2 3−1 -1
• − 2 2 x  2 2 -1
• {-4,-3,-2,-1,0,1,2,3}
-2 is represented as x = −s · 2 2 + b1 2 1+ b0
−1 · 22 + 1. 21+ 0. 20 =-4 +2=-2
[1 1 0]

8-Sep-22 Dr.Jitesh Shinde

81

Fixed point Fractional Format


• Similarly, a n-bit fixed-point signed fraction
−1 ≤ x ≤ (1 − 2−(n−1) ) is represented as:
x = −s · 2 0 + b-1 2−1 + b −2 · 2−2 + · · · + b-(n-1)· 2–(n-1)
where s represents the sign of the number ( s = 0 for positive and s = 1 for negative)
• What is the most negative number that can be represented ?
when s=-1 and all b coefficients are zero
-1 represented as [1 0 0 -------- 0 0]

• What is the most positive number that can be represented ?


1-2-(n− 1) represented as [0 1 1 -------- 1 1]

• What granularity of numbers can be represented in this range?


x = −s · 2 0 + b-1 2−1 + b −2 · 2−2 + · · · + b-(n-1)· 2–(n-1)
Resolution= 2–(n-1)

8-Sep-22 Dr.Jitesh Shinde

82

41
9/8/2022

Example: Fixed point Fractional


Format
• n=3
• − 1  x  (1-2−(n-1) )
• − 1  x  (1-2−(3-1) )
• −1 x¾
• X={-1,-3/4,-1/2,-1/4,0,1/4,1/2,3/4}
• ¼=1/ 2(n-1) Is the smallest precision of measurement
for n=3
• Representation of ¼
x = −s · 2 0 + b-1 2−1 + b −2 · 2−2
x = 0 · 2 0 + 0. 2−1 + 1 · 2−2
[0 0 1]
8-Sep-22 Dr.Jitesh Shinde

83

• What is the range of numbers that can be represented in


fixed-point format using 16 bit if the numbers are treated
as (a) signed integers (b) signed fractions ?
– (a) using 16 bit if the numbers are treated signed integers
i.e., n = 16
− 2 n−1  x  2 n−1 -1 => − 2 15  x  2 15 -1
=> -32,768 to 32,767
(b) signed fractions
n=16
− 1  x  (1-2−(n-1) ) => − 1  x  (1-2−15)
=> -1 to 0.9999969482

8-Sep-22 Dr.Jitesh Shinde

84

42
9/8/2022

Fixed point format and size


Q. How would one increase the range of numbers that can
be represented in integer fixed-point format? State its
implications.
 increase its size (i.e., the number of bits n); doubling the
size substantially increases the range of numbers
represented
– for n = 3, −4 ≤ x ≤ 3
– for n = 6, −2 6−1 ≤ x ≤ 2 6−1 − 1 =⇒ −32 ≤ x ≤ 31
 doubling the size has implications:
– need double the storage for the same data
– may need to double the number of accesses using the original
size of data bus
– May need to increase data bus size thereby may affect area,
speed & power of circuit.

8-Sep-22 Dr.Jitesh Shinde

85

Fixed point format and size


• Question: Is there a number format with a
different compromise between overflow,
precision and storage needs?

• Answer : floating-point!

8-Sep-22 Dr.Jitesh Shinde

86

43
9/8/2022

Floating point format


 Suitable for computations where a large number of bits
(in fixed point format) would be required to store
intermediate and final results
– example: algorithm involves summation of a large number
of products ( multiply and accumulate)
 A floating-point number x is represented as:
x = Mx 2Ex
where Mx is called the mantissa and Ex is called the
exponent
 Mantissa is used to represent binary fraction number.
 Exponent is positive or negative binary integer.
 The value of exponent can be adjusted to move the
position of binary point in mantissa. Hence, the
representation name floating point.

8-Sep-22 Dr.Jitesh Shinde

87

Floating point format


• The product of two floating point numbers
x = Mx 2Ex and y = My 2 Ey is given by
by = Mx My 2 (Ex + Ey)
• A floating-point multiplier must contain a
multiplier for the mantissa and an adder for the
exponent.
• A floating-point adder requires normalization
of the numbers to be added so that they have
the same exponents.

8-Sep-22 Dr.Jitesh Shinde

88

44
9/8/2022

Floating point format


• Conversion of decimal no. into floating
DSRJ8

point format.
– Given decimal number is converted into binary
number & then binary point is moved to a
position i.e., value of exponent adjusted such
that MSB bit of mantissa is one and mantissa is
adjusted accordingly. This form of floating-point
no. is called normalized form.
– Why representation called as floating point?

8-Sep-22 Dr.Jitesh Shinde

89

Floating point format


• Convert +510 in 10-bit floating point format
with 7 bits for mantissa & 3 bits for exponent.
+510 = 1012 =1012 * 20 = 0.1010002 * 2+3 =
=0.1010002 * 2+11 = 0.1010002 * 2011 =
+510 = 01010002 * 2011 ; --Ans in floating point
format

8-Sep-22 Dr.Jitesh Shinde

90

45
Slide 89

DSRJ8 In floating point representation the binary point can be shifted to desired position so that number of
digits in the integer part and fraction part of a number can be varied. This leads to larger range of
number that
can be represented in floating point representation
Dr. SHINDE RAMDAS JITESH, 7/24/2022
9/8/2022

Floating point format


• Convert -510 in 10-bit floating point format
with 7 bits for mantissa & 3 bits for
exponent.
-510 = -1012 =-1012 * 20 = -0.1010002 * 2+3 =
-0.1010002 * 2+11 = -0.1010002 * 2011 =
-510 = 11010002 * 2011

8-Sep-22 Dr.Jitesh Shinde

91

Floating point format


Convert given no. in floating point format with 5 bits for mantissa & 3 bits for
exponent.

8-Sep-22 Dr.Jitesh Shinde

92

46
9/8/2022

Floating point format


• A commonly used single-precision floating-
point representation is the IEEE 754-1985
format given as: x = (−1)S × (1 . M ) × 2 (E −bias
)

S , E and M are all in unsigned fixed-point


format.

8-Sep-22 Dr.Jitesh Shinde

93

8-Sep-22 Dr.Jitesh Shinde

94

47
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

95

Floating point format


• A commonly used single-precision floating-point representation
is the IEEE 754-1985 format given as:
• x = (−1)S × (1 .M ) × 2 (E −bias )
• M is the magnitude fraction of the mantissa
Note: In determining the full mantissa value, a 1 is placed
immediately before the implied binary point
• E is the biased exponent
Note: The bias makes sure that the exponent is signed to
represent both small and large numbers.
The bias is set to 127 (largest positive number represented by
• (8 − 1)-bits).
• S gives the sign of the fractional part of the number

8-Sep-22 Dr.Jitesh Shinde

96

48
9/8/2022

IEEE-754 Single Precision Format


• +2510 = +110012 * 20 = 1.10012 * 2+4
= 1.10012 * 2+13110 - 12710 .
The number N in IEEE 754 single precision 32-
bit format is
N = (−1)S × (1 .M ) × 2 (E −bias )
1.M=1.1001=1.1001 0000 0000 0000 0000 000
E=13110 = 1000 00112
+2510 = 0 1000 0011 1001 0000 0000 0000 0000 000

8-Sep-22 Dr.Jitesh Shinde

97

8-Sep-22 Dr.Jitesh Shinde

98

49
9/8/2022

Example:
• Note :- Both Mantissa and exponent uses one bit for sign.
• Find the decimal equivalent of the floating-point binary number
with bias = 2(n-1) − 1= 23 − 1 = 7. 4 bits for E & 8 bits for M or F.
1011000011100
1 0110 00011100
Sign biased exponent significant

• F = 0 · 2−1 + 0 · 2−2 + 0 · 2−3 + 1 · 2−4 +1 · 2−5 + 1 · 2−6 + 0 ·


2−7 + 0 · 2−8 = 0.109375
• E = 0 · 2 3 + 1 · 22 + 1 · 21 + 0 · 20 = 6

• ∴ x = −1 × 1.109375 × 2 6−7 = −0.5546875

8-Sep-22 Dr.Jitesh Shinde

99

Example of conversion of decimal number into 32-


bit IEEE-754 binary format

8-Sep-22 Dr.Jitesh Shinde

100

50
9/8/2022

Example of conversion of decimal number into 32-


bit IEEE-754 binary format

8-Sep-22 Dr.Jitesh Shinde

101

Example of conversion number in 32-bit IEEE-754


binary format to decimal number
0 10000100 000110000000000000000002 =
(-1)0 × 2100001002 - 12710 × 1.00011

= + 213210 - 12710 × 1.00011

= + 25 × 1.00011

=1000112

= + (1 × 25 + 0 × 24 + 0 × 23 + 0 × 22 + 1 × 21 + 0 × 20)
= + (32 + 0 + 0 + 0 + 2 + 1)
= + 3510
8-Sep-22 Dr.Jitesh Shinde

102

51
9/8/2022

Floating point format


• What is the main disadvantage of using
floating-point over fixed-point?
– The product of two floating point numbers
x = Mx 2Ex and y = My 2 Ey is given by
by = Mx My 2 (Ex + Ey)
– speed reduction:
• floating point multiplication requires addition of
exponents and multiplication of mantissas
• floating point addition requires exponents to be
normalized prior to addition

8-Sep-22 Dr.Jitesh Shinde

103

Dynamic range
• Ratio of the maximum value to the minimum
non-zero value that the signal can take in a
given number representation scheme
• Dynamic range is proportional to the number
of bits n used to represent it and increases by
6db for every extra bit used for representation.
• In Floating point format, exponent determines
dynamic range.

8-Sep-22 Dr.Jitesh Shinde

104

52
9/8/2022

Example: Dynamic range(fixed point


format)
• −2 n−1 ≤ x ≤ 2 n−1 − 1

• −2 24−1 ≤ x ≤ 2 24−1 -1

• − 8, 388, 608 ≤ x ≤ 8, 388, 607


• x ∈ {−8, 388, 608, −8, 388, 607, . . . , −1, 0, 1, . . .
8, 388, 607}
• xmax = 8, 388, 608 and xmin = 1
• dynamic range = 8, 388, 608/1 = 8, 388, 608
• dynamic range = 20 log10 (8, 388, 608) = 138 dB
• dynamic range =24 x 6 = 144 db
8-Sep-22 Dr.Jitesh Shinde

105

Resolution
• general definition: smallest non-zero value
that can be represented using a number
representation format
• Q: What is the resolution if k -bits (signed
fractional fixed-point) are used to represent
a number between 0 and 1?
• Resolution= 1/2k-1
• Resolution ≈ 1/2k ; if k is very large

8-Sep-22 Dr.Jitesh Shinde

106

53
9/8/2022

Precision
• Computed as percentage resolution:
• Precision = Resolution × 100% = 1/ (2k −1 )
× 100%
• relates to accuracy of computations
• usually, the greater the precision, the slower
the speed or the more complex the support
hardware such as bus architectures

8-Sep-22 Dr.Jitesh Shinde

107

8-Sep-22 Dr.Jitesh Shinde

108

54
9/8/2022

Pipelining

8-Sep-22 Dr.Jitesh Shinde

109

Pipelining: Laundry Example

• This example is from Dr.


Bernard Chen Presentation on
Pipeline and Vector Processing
A B C D
• Small laundry has one washer,
one dryer and one operator, it
takes 90 minutes to finish one
load:

– Washer takes 30 minutes


– Dryer takes 40 minutes
– “operator folding” takes 20
minutes

8-Sep-22 Dr.Jitesh Shinde

110

55
9/8/2022

Pipelining: Laundry Example

• This example is from Dr.


Bernard Chen Presentation on
Pipeline and Vector Processing
A B C D
• Small laundry has one washer,
one dryer and one operator, it
takes 90 minutes to finish one
load:

– Washer takes 30 minutes


– Dryer takes 40 minutes
– “operator folding” takes 20
minutes

8-Sep-22 Dr.Jitesh Shinde

111

Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time

30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e 90 min
r D
• This operator scheduled his loads to be delivered to the laundry every 90
minutes which is the time required to finish one load. In other words he will
not start a new task unless he is already done with the previous task
• The process is sequential. Sequential laundry takes 6 hours for 4 loads
8-Sep-22 Dr.Jitesh Shinde

112

56
9/8/2022

• Multiple tasks operating


Pipelining Facts simultaneously
• Pipelining doesn’t help
latency of single task, it
6 PM 7 8 9 helps throughput of
Time entire workload
T
• Pipeline rate limited by
a 30 40 40 40 40 20 slowest pipeline stage
s • Potential speedup =
k A Number of pipe stages
O • Unbalanced lengths of
r B pipe stages reduces
d
e
speedup
r C The washer
waits for the • Time to “fill” pipeline
dryer for 10
minutes
and time to “drain” it
D reduces speedup

8-Sep-22 Dr.Jitesh Shinde

113

Pipelining
• Pipelining is one of the architectural features of the P-DSP device
that should be evaluated before implementing an algorithm.
• It is process of increasing the performance of DSP processor by
breaking longer sequence of operations into smaller pieces &
executing these pieces in parallel when possible, thereby
decreasing the overall time to complete the set of operations.
• Decomposes a sequential process into segments (overlapping
allowed).
• Divide the processor into segment processors each one is
dedicated to a particular segment.
• Each segment is executed in a dedicated segment-processor
operates concurrently with all other segments.
• Information flows through these multiple hardware segments.

8-Sep-22 Dr.Jitesh Shinde

114

57
9/8/2022

DSRJ9
DSRJ10

8-Sep-22 Dr.Jitesh Shinde

115

SPEEDUP
• Consider a k-segment pipeline operating on n data sets. (In
the above example, k = 3 and n = 4).

• It takes k clock cycles to fill the pipeline and get the first
result from the output of the pipeline.

• After that the remaining (n - 1) results will come out at


each clock cycle.

• It therefore takes (k + n - 1) clock cycles to complete the


task.

8-Sep-22 Dr.Jitesh Shinde

116

58
Slide 115

DSRJ9 In pipeline, no. of stages=4. No. of instructions to be executed =5


Dr. SHINDE RAMDAS JITESH, 7/24/2022

DSRJ10 5 instructions will be completed in k+(n-1)=4+(5-1)=8 clock cycles


Dr. SHINDE RAMDAS JITESH, 7/24/2022
9/8/2022

Example
• A non-pipeline system takes 100ns to process a
task;
• the same task can be processed in a FIVE-
segment pipeline into 20ns, each
• It therefore takes (k + n - 1) clock cycles = 5+n-
1=5+(1000-1) clock cycles to complete the task.

• Determine the speedup ratio of the pipeline for


1000 tasks?

8-Sep-22 Dr.Jitesh Shinde

117

Example Answer

• Speedup Ratio for 1000 tasks gained by


pipelining:
100*1000 / (5 + 1000 -1)*20 = 4.98

8-Sep-22 Dr.Jitesh Shinde

118

59
9/8/2022

Branching Effect in Pipelining

8-Sep-22 Dr.Jitesh Shinde

119

Delayed branching effect in Pipelining

8-Sep-22 Dr.Jitesh Shinde

120

60
9/8/2022

Instruction pipeline versus sequential


processing

sequential processing

Instruction pipeline
8-Sep-22 Dr.Jitesh Shinde

121

Difficulties
• Slowest Unit decides the throughput.
• Pipeline latency & Pipeline depth
– Extra time is required at the start of algorithm execution, as the
pipeline has to be filled before the result of the first instruction can
start to flow out. This initial delay in units of time is pipeline latency,
related to number of stages in pipeline (pipeline depth)
• Branching effect:
– If a complicated memory access occurs in stage 1, stage 2 will be
delayed and the rest of the pipe is stuck.
– If there is a branch, if.. and jump, then some of the instructions that
have already entered the pipeline should not be processed.
– We need to deal with these difficulties to keep the pipeline moving

8-Sep-22 Dr.Jitesh Shinde

122

61
9/8/2022

Pipeline Hurdles
• Pipeline hazards/hurdles are situations, that prevent the next instruction
in the instruction stream from executing during its designated cycle

• Structural hazard
– two different instructions use same hardware in same cycle

• Data hazard
– two different instructions use same storage
– An instruction depends on the results of a previous instruction

• Control /Branch hazard


– one instruction affects which instruction is next

8-Sep-22 Dr.Jitesh Shinde

123

Pipelining
• Instruction execution is divided into k segments or stages
– Instruction exits pipe stage k-1 and proceeds into pipe
stage k
– All pipe stages take the same amount of time; called one
processor cycle
– Length of the processor cycle is determined by the
slowest pipe stage

k segments

8-Sep-22 Dr.Jitesh Shinde

124

62
9/8/2022

5-Stage Pipelining
S1 S2 S3 S4 S5
Fetch Decode Fetch Execution Write
Instruction Instruction Operand Instruction Operand
(FI) (DI) (FO) (EI) (WO)

Time
S1 1 2 3 4 5 6 7 8 9
S2 1 2 3 4 5 6 7 8
S3 1 2 3 4 5 6 7
S4 1 2 3 4 5 6
S5 1 2 3 4 5

8-Sep-22 Dr.Jitesh Shinde

125

Parallel Architecture
• The key to higher performance is the ability to
exploit parallelism. It is one of the architectural
features of P-DSP device that should be evaluated
before implementing DSP algorithm.
• Increases the speed of operation of DSP processor.
• Requires complex hardware to control units, &
controller should be hardwired rather than micro
programmed in order to ensure high speed.
• The architecture should be such that instructions &
data required for computation are fetched
simultaneously from memory simultaneously.

8-Sep-22 Dr.Jitesh Shinde

126

63
9/8/2022

Parallel Architecture
• Some methods for exploiting parallelism include:
– multiple processors : Instead of same arithmetic unit used to do the
computation on data & address, a separate address arithmetic unit be
provided to take care of address computation. This frees main
arithmetic unit to concentrate on data computation alone &
thereby increasing the throughput.
Using multiple processors improves performance for only a restricted
set of applications.

– Provision of multiple memories & multiple buses to fetch an


instruction & operand simultaneously.
– pipelining
– superscalar implementation
– Data parallelism- SIMD architecture
– Instruction parallelism- VLIW architecture

8-Sep-22 Dr.Jitesh Shinde

127

Parallel Processing
• Pipelining is now universally implemented in high-
performance processors. Little more can be gained by
improving the implementation of a single pipeline.
• Superscalar implementations can improve
performance for all types of applications. Superscalar
(super: beyond; scalar: one dimensional) means the
ability to fetch, issue to execution units, and
complete more than one instruction at a time.
Superscalar implementations are required when
architectural compatibility must be preserved

8-Sep-22 Dr.Jitesh Shinde

128

64
9/8/2022

• Specifying multiple operations per instruction creates a very-long instruction


word architecture or VLIW.

• SIMD (Single Instruction Multiple Data)architectures are based on data-level


parallelism, i.e., only one instruction is issued at a time but the same operation
specified by the instruction is performed on multiple data sets. SIMD describes
computers with multiple processing elements that perform the same operation on
multiple data points simultaneously

• A VLIW implementation has capabilities very similar to those of a superscalar


processor—issuing and completing more than one operation at a time—with
one important exception: the VLIW hardware is not responsible for discovering
opportunities to execute multiple operations concurrently.
• For the VLIW implementation, the long instruction word already encodes the
concurrent operations. This explicit encoding leads to dramatically reduced
hardware complexity compared to a high-degree superscalar implementation of
a RISC or CISC.
• The big advantage of VLIW, then, is that a highly concurrent (parallel)
implementation is much simpler and cheaper to build than equivalently
concurrent RISC or CISC chips.

8-Sep-22 Dr.Jitesh Shinde

129

Difference between parallel processing


and pipelining
• Parallel processing • Pipelined processing

time time
P1 a1 a2 a3 a4 P1 a1 b1 c1 d1

P2 b1 b2 b3 b4 P2 a2 b2 c2 d2

P3 c1 c2 c3 c4 P3 a3 b3 c3 d3

P4 d1 d2 d3 d4 P4 a4 b4 c4 d4

Less inter-processor communication More inter-processor communication


Complicated processor hardware Simpler processor hardware

Colors: different types of operations performed


a, b, c, d: different data streams processed
8-Sep-22 Dr.Jitesh Shinde

130

65
9/8/2022

Data Dependence

• Parallel processing requires • Pipelined processing will


NO data dependence between involve inter-processor
processors communication
• Increases speed of operation. • Increases throughput .
P1 P1

P2 P2

P3 P3

P4 P4

time time

8-Sep-22 Dr.Jitesh Shinde

131

Superscalar vs. VLIW

• Superscalar and VLIW: More than a single instruction can


be issued to the execution units per cycle.
• Superscalar machines are able to dynamically issue
multiple instructions each clock cycle from a conventional
linear instruction stream. CPU hardware dynamically
checks for data dependencies between instructions at run
time (versus software checking at compile time).
• VLIW processors use a long instruction word that contains
a usually fixed number of instructions that are fetched,
decoded, issued, and executed synchronously.
• Superscalar: dynamic issue, VLIW: static issue

8-Sep-22 Dr.Jitesh Shinde

132

66
9/8/2022

VLIW (Very Long Instruction Word)


• VLIW is one of the method for parallelism to
enhance the performance of DSP processors.
• VLIW processors use a long instruction word
that contains a usually fixed number of
instructions that are fetched, decoded,
issued, and executed synchronously. Hence
static.
• VLIW processors have emerged as the mainstay of
digital signal processing and high performance
embedded computing.

8-Sep-22 Dr.Jitesh Shinde

133

VLIW
• VLIW hardware is simple and straightforward,
• VLIW separately directs each functional unit
• The number of operations in a VLIW instruction =
equal to the number of execution units (FU) in the
processor
• Each operation specifies the instruction that will be
executed on the corresponding execution unit in the
cycle that the VLIW instruction is issued

8-Sep-22 Dr.Jitesh Shinde

134

67
9/8/2022

VLIW processor
• Very large instruction word means that
program recompiled in the instruction to
run sequentially without the stall in the
pipeline
• Thus, require that programs be recompiled
for the VLIW architecture
• No need for the hardware to examine the
instruction stream to determine which
instructions may be executed in parallel

8-Sep-22 Dr.Jitesh Shinde

135

VLIW processor
• Take a different approach to instruction-level parallelism
• Relying on the compiler to determine which instructions
may be executed in parallel and providing that information
to the hardware (FU)
• Each instruction specifies several independent operations
(called very long words) that are executed in parallel by the
hardware

8-Sep-22 Dr.Jitesh Shinde

136

68
9/8/2022

VLIW (Very Long Instruction Word)


• VLIW is one of the method for parallelism to
enhance the performance of DSP processors.
• VLIW processors use a long instruction word
that contains a usually fixed number of
instructions that are fetched, decoded, issued,
and executed synchronously. Hence static.
• VLIW processors have emerged as the mainstay of
digital signal processing and high performance
embedded computing. DSP processors such as the Texas
Instruments TMS320C6000 and Analog Devices
SHARC as well as media processors such as the Philips
Trimedia use VLIW technology.

8-Sep-22 Dr.Jitesh Shinde

137

Each operation in a VLIW instruction


• Equivalent to one instruction in a superscalar
or purely sequential processor
• The number of operations in a VLIW
instruction = equal to the number of
execution units in the processor
• Each operation specifies the instruction that
will be executed on the corresponding
execution unit in the cycle that the VLIW
instruction is issued

8-Sep-22 Dr.Jitesh Shinde

138

69
9/8/2022

VLIW Architecture

8-Sep-22 Dr.Jitesh Shinde

139

SIMD Architecture

8-Sep-22 Dr.Jitesh Shinde

140

70
9/8/2022

Simple Superscalar pipeline

Simple superscalar pipeline. By fetching and dispatching two instructions at a


time, a maximum of two instructions per cycle can be completed. (IF =
Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory
access, WB = Register write back, i = Instruction number, t = Clock cycle [i.e.,
time])
8-Sep-22 Dr.Jitesh Shinde

141

Innovations in Hardware Architecture to increase


the speed of operations of DSP Processors
• Conventional Microprocessor implement the multiplication by means of
microprogram (microcode) using shift & add algorithm. This approach takes
large number of clock cycles to implement.
On the other hand in DSP processors, parallel multipliers have been used to
carry out the entire multiplications in single clock cycle. This increase in
speed of operations has been possible because of VLSI technology.
• Harward Architecture which separates program & data memories with
separate buses for each, increases the speed of program considerably.
• Dual-ported memories with individual buses for each (multi-
ported memories help in accessing dual operands simultaneously.
• On chip memories (fast memory) can be accessed twice or more in a clock
cycle, thereby reducing number of separate memories & buses required in
device.
• Use of concept of Parallelism along with pipelining
• Use of VLIW & SIMD architecture
• Use of barrel shifter

8-Sep-22 Dr.Jitesh Shinde

142

71
9/8/2022

Peripherals

8-Sep-22 Dr.Jitesh Shinde

143

Peripherals
• Why Study peripherals?
– Allows DSP to be used in an embedded system with
minimum amount of external hardware to support
its operation & interface it to the outside world.
– Power of the peripheral interfaces provided by
different processors can have significant impact on
their suitability for particular application.
– On-chip peripherals should be carefully evaluated
along with other processor features such as
arithmetic performance, memory bandwidth & so
on.

8-Sep-22 Dr.Jitesh Shinde

144

72
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

145

Serial Ports

8-Sep-22 Dr.Jitesh Shinde

146

73
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

147

8-Sep-22 Dr.Jitesh Shinde

148

74
9/8/2022

Serial Ports
• A serial interface transmits & receives data
one bit at a time.
– Sending & receiving data to & from A/D & D/A
converters & codecs.
– Sending & receiving data to & from other
micro-processors or DSPs.
– Communicating with other external peripherals
or hardware.

8-Sep-22 Dr.Jitesh Shinde

149

Serial Port
• Types of Serial Interface:-
– Asynchronous serial port
– Synchronous serial port
– TDM serial port
– Buffered serial port

8-Sep-22 Dr.Jitesh Shinde

150

75
9/8/2022

Asynchronous Serial Port Synchronous Serial Port


• It does not transmit • It transmit a bit clock signal in
addition to serial data bits. The
separate clock signal. receiver used this clock signal to
decide when to sample the
• It rely on receiver deducing received data.
clock signal from serial data • All synchronous serial ports
itself, which complicates the assumes that Tx changes data on
one clock edge (rising or falling
serial interface & limit the edge) & that data area stable on
maximum speed at which other clock edge.
• DSP chips allow the programmer
bits can be transmitted. to choose clock polarity i.e.,
which edge controls when data
• Asynchronous ports are change.
typically used for RS 232 or • Serial ports on some DSP chips
RS 422 communication. may allows the programmer to
select data polarity as well.
• It may allow selections of shift
direction i.e., LSB or MSB first.

8-Sep-22 Dr.Jitesh Shinde

151

Serial Ports

8-Sep-22 Dr.Jitesh Shinde

152

76
9/8/2022

Synchronous Serial Port


• When two devices are connected via
synchronous serial link, both devices must
agree on where clock will come from. One of
the devices must generate clock or the clock
must be generated by external third device.
• Serial ports supports variety of word length
i.e., no of bit to be transmitted. It uses a frame
synchronization i.e., frame sync or word sync
signal to indicate to the receiver the position
of first bit of data word on the serial data line.
8-Sep-22 Dr.Jitesh Shinde

153

Time Division Multiplex serial port


• Synchronous serial ports sometimes use to
connect more than two DSP processors.
• In TDM , time is divided into time slots. During
any given time slots. During any given time slot,
one processor can transmit, & all other processor
must listen.
• Processor communicate over three wire bus :
data line, a clock signal & a frame sync line.
• Each frame sync & clock signal is tied to
appropriate line on the bus, & DSPs transmit &
receive lines are both tied to data lines.
8-Sep-22 Dr.Jitesh Shinde

154

77
9/8/2022

• One processor is responsible for generating bit clock & frame sync
signals.
• The frame sync is used to indicate the start of new set of time slots.
• After the frame sync, each processor must keep the track of current
slot number & transmit only during its slot.

8-Sep-22 Dr.Jitesh Shinde

155

• A transmitted data word (16 bit) might contain some number of bits
to indicate the destination DSP (e.g., two bit for 4 processor) with
remainder containing data.
• TDM support requires that processor must be able to put its serial
port transmit data pin in a high impedance state when processor is
not transmitting. This allows other DSPs to transmit data during their
time slots without interference.
8-Sep-22 Dr.Jitesh Shinde

156

78
9/8/2022

Parallel Ports
• It transmits & receives multiple data bits (typically 8 or 16 bits ) at a time.
• Transfers data much faster than serial ports but require more pins to do
so.
• It sometimes in addition to data lines includes handshake or strobe lines
to indicate to an external device that data have been written to the port
by DSP or vice versa.
• Approaches for assigning lines for parallel ports :
– Data bus itself used for parallel ports by allocating specific address for
I/O.
– Separate lines are dedicated for parallel ports including handshake
signals.
• Types of parallel ports :
– Traditional
– BIT I/O
– HPI
– Comm

8-Sep-22 Dr.Jitesh Shinde

157

Bit I/O Ports


• It is just like the bits of flag register used to
monitor status of external device.
• Bidirectional port.
• It refers to parallel ports wherein individual pins
can be made inputs or outputs on bit by bit basis.
• It does not have handshake or strobe line support
& may not have support for interrupt.
• Processor must poll the bit I/O to determine if
the input values have changed. It is often used for
control purposes (eg. In conditional branch
instruction viz. BIOZ in TI fixed point processor)
and sometimes may used for data transfer.
8-Sep-22 Dr.Jitesh Shinde

158

79
9/8/2022

Host Ports or Host Port Interface


• Specialized parallel ports used for data
transfer between DSP & host processor (or
GPP) i.e., Allows to interface 8- or 16-bit host
devices or host processor.
• It can be used to control DSP (Eg., Force to
execute instruction or interrupt service
routine or read or write registers & memory
etc). => have interrupt support
• It may be used for bootsrap loading.
• TMS320C5X & 54XX has 8-bit parallel HPI.

8-Sep-22 Dr.Jitesh Shinde

159

Comm Ports or Communication Ports


• It is a specialized parallel port intended for
multi-processor communication of same
type.
• It does not provide support for special
functions for controlling DSPs , as are often
found in HPI.
• Eg. TMS320C40 has 6 Comm ports each 8 bit
wide.

8-Sep-22 Dr.Jitesh Shinde

160

80
9/8/2022

On-Chip A/D & D/A converters


• Some P-DSPs (eg. MOTOROLLA DSP 561XX )
that are target at speech applications have on
chip A/D & D/A converters (also called codec).
• Criteria for evaluating on-chip codecs:
– Resolution (in bits) of samples
– Sampling rates
– SNR plus distortion ratio
– Number of analog input channels
– Programmable output gain
– Analog power gain

8-Sep-22 Dr.Jitesh Shinde

161

External Interrupts
• They are pins that an external device can
assert to interrupt the P-DSP.
• External interrupt lines can be edge triggered
or level triggered.

8-Sep-22 Dr.Jitesh Shinde

162

81
9/8/2022

RISC
• Eg. Of RISC processor :TMS320C6X
• Reduced number of instruction, chip area reduced
considerably.
• 20% of chip area used for control unit.
• As a result of considerable reduction in control area, CPU
registers, data paths can be replicated & throughput of
register can be increased by applying pipelining & parallel
processing.
• In RISC, all instruction are of same length & takes same time
for execution. This increases computational speed.
• A simpler & smaller control unit has fewer gates. This reduces
propagation delay & increases speed.

8-Sep-22 Dr.Jitesh Shinde

163

RISC
• Reduced number of instruction format & addressing modes results in simpler
& smaller decoder, which in turn increases speed.
• Delayed branch & Call instruction effectively used.

• HLL support :
– Due to smaller number of instruction , compiler for HLL is shorter and
simple.
– Availability of many GP registers permits more efficient code optimization
by maximizing the use of GP registers over slower memories.
– Support for writing program in C or C++ and thereby relieves the
programmer from learning instruction set of DSP & instead concentrate
on application.
• Since RISC has smaller number of instructions, implementation of single CISC
instruction might require a number of instructions in CISC. This increases
memory required for storing the program and the traffic between CPU &
memory is increased. This increases program computation time and make it
difficult to debug

8-Sep-22 Dr.Jitesh Shinde

164

82
9/8/2022

CISC
• Has rich instruction set that supports HLL
constructs similar to “ if condition true then
do”, “for”, & “while”.
• Have instruction specifically required for DSP
applications such as MACD, FIRS etc.
• 30-40% chip area is used for control unit.

8-Sep-22 Dr.Jitesh Shinde

165

Comparison: CISC, RISC, VLIW

8-Sep-22 Dr.Jitesh Shinde

166

83
9/8/2022

On Chip Timer/ Timers

8-Sep-22 Dr.Jitesh Shinde

167

8-Sep-22 Dr.Jitesh Shinde

168

84
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

169

UNIT 1: FUNDAMENTALS OF PROGRAMMABLE


DSPs
Multiplier and Multiplier accumulator,
Modified Bus Structures and Memory access in P-DSPs,
Multiple access memory ,
Multi-ported memory ,
VLIW architecture,
Pipelining ,
Special Addressing modes in P- DSPs ,
On chip Peripherals,
Computational accuracy in DSP processor,
Von Neumann and Harvard Architecture,
MAC

8-Sep-22 Dr.Jitesh Shinde

170

85
9/8/2022

UNIT 2: ARCHITECTURE OF
TMS320C5X (08)
Architecture ,
Bus Structure & memory,
CPU ,
addressing modes ,
AL syntax.

8-Sep-22 Dr.Jitesh Shinde

171

8-Sep-22 Dr.Jitesh Shinde

172

86
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

173

8-Sep-22 Dr.Jitesh Shinde

174

87
9/8/2022

8-Sep-22 Dr.Jitesh Shinde

175

88

You might also like