DTSP Unit 5

UNIT- V
1) Elaborate on the Addressing modes and Instruction set of TMS 320C6x processor.
SOLUTION
LINEAR AND CIRCULAR ADDRESSING MODES
Addressing modes determine how one accesses memory. They specify how data are accessed,
such as retrieving an operand indirectly from a memory location. Both linear and circular
modes of addressing are supported. The most commonly used mode is the indirect addressing
of memory.
Indirect Addressing
Indirect addressing can be used with or without displacement. Register R represents one of the
32 registers A0 through A15 and B0 through B15 that can specify or point to memory
addresses. As such, these registers are pointers. Indirect addressing mode uses a “*” in
conjunction with one of the 32 registers. To illustrate, consider R as an address register.
1. *R. Register R contains the address of a memory location where a data value is stored.
2. *R++(d). Register R contains the memory address (location). After the memory address is
used, R is postincremented (modified), such that the new address is the current address offset
by the displacement value d. If d = 1 (by default), the new address is R + 1, or R is incremented
to the next-higher address in memory. A double minus (--) instead of a double plus would
update or postdecrement the address to R - d.
3. *++R(d). The address is preincremented or offset by d, such that the current address is R +
d. A double minus would predecrement the memory address so that the current address is R -
d.
4. *+R(d). The address is preincremented by d, such that the current address is R + d (as with
the preceding case). However, in this case, R preincrements without modification. Unlike the
previous case, R is not updated or modified.
Circular Addressing
Circular addressing is used to create a circular buffer. This buffer is created in hardware and is
very useful in several DSP algorithms, such as in digital filtering or correlation algorithms
where data need to be updated.
The C6x has dedicated hardware to allow a circular type of addressing. This addressing mode
can be used in conjunction with a circular buffer to update samples by shifting data without the
overhead created by shifting data directly. As a pointer reaches the end or “bottom” location of
a circular buffer that contains the last element in the buffer, and is then incremented, the pointer
is automatically wrapped around or points to the beginning or “top” location of the buffer that
contains the first element.
Two independent circular buffers are available using BK0 and BK1 within the address mode
register (AMR). The eight registers A4 through A7 and B4 through B7, in conjunction with the
two .D units, can be used as pointers (all registers can be used for linear addressing). Table
below shows the modes associated with registers A4 through A7 and B4 through B7.
TMS320C6x INSTRUCTION SET
The following illustrates some of the syntax of assembly code
1. Add/Subtract/Multiply
(a) The instruction
ADD .L1 A3,A7,A7 ;add A3 + A7 A7 (accum in A7)
adds the values in registers A3 and A7 and places the result in register A7. The unit .L1 is
optional. If the destination or result is in B7, the unit would be .L2.
(b) The instruction
SUB .S1 A1,1,A1 ;subtract 1 from A1
subtracts 1 from A1 to decrement it, using the .S unit.
(c) The parallel instructions
MPY .M2 A7,B7,B6 ;multiply 16 LSBs of A7,B7 B6
|| MPYH .M1 A7,B7,A6 ;multiply 16 MSBs of A7,B7 A6
multiplies the lower or least significant 16 bits (LSBs) of both A7 and B7 and places the product
in B6, in parallel (concurrently within the same execution packet) with a second instruction
that multiplies the higher or most significant 16 bits (MSBs) of A7 and B7 and places the result
in A6. In this fashion, two multiply/accumulate operations can be executed within a single
instruction cycle. This can be used to decompose a sum of products into two sets of sum of
products: one set using the lower 16 bits to operate on the first, third, fifth,... number, and
another set using the higher 16 bits to operate on the second, fourth, sixth,... number. Note that
the parallel symbol is not in column 1.
2. Load/Store
(a) The instruction
LDH .D2 *B2++,B7 ;load (B2) B7, increment B2
|| LDH .D1 *A2++,A7 ;load (A2) A7, increment A2
loads into B7 the half-word (16 bits) whose address in memory is specified/pointed by B2.
Then register B2 is incremented (postincremented) to point at the next-higher memory address.
In parallel is another indirect addressing mode instruction to load into A7 the content in
memory, whose address is specified by A2. Then A2 is incremented to point at the nexthigher
memory address.
The instruction LDW loads a 32-bit word. Two paths using .D1 and .D2 allow for the loading
of data from memory to registers A and B using the instruction LDW. The double-word load
floating-point instruction LDDW on the C6711 can simultaneously load two 32-bit registers
into side A and
two 32-bit registers into side B.
(b) The instruction
STW .D2 A1,*+A4[20] ;store A1 (A4) offset by 20
stores the 32-bit word A1 into memory whose address is specified by A4
offset by 20 words (32 bits) or 80 bytes. The adddress register A4 is preincremented with offset,
but it is not modified (two plus signs are used if A4 is to be modified).
3. Branch/Move.
The following code segment illustrates branching and data
transfer.
Loop MVK .S1 x,A4 ;move 16 LSBs of x address A4
MVKH .S1 x,A4 ;move 16 MSBs of x address A4
. SUB .S1 A1,1,A1 ;decrement A1
[A1] B .S2 Loop ;branch to Loop if A1 # 0
NOP 5 ;five no-operation instructions
STW .D1 A3,*A7 ;store A3 into (A7)
The first instruction moves the lower 16 bits (LSBs) of address x into register A4. The second
instruction moves the higher 16 bits (MSBs) of address x into A4, which now contains the full
32-bit address of x. One must use the instructions MVK/MVKH in order to get a 32-bit constant
into a register. Register A1 is used as a loop counter. After it is decremented with the SUB
instruction, it is tested for a conditional branch. Execution branches to the label or address loop
if A1 is not zero. If A1 = 0, execution continues and data in register A3 are stored in memory
whose address is specified (pointed) by A7.
2) Realize the difference equation for a 9 tap FIR filter.
SOLUTION
On a basic level, a bona fide DSP chip must, as a minimum requirement be able to optimally
perform the convolution summation used to compute the output of an FIR (finite impulse
response) filter.
N-1
y[n] = SUM{ h[n]*x[n-k]
k=0
y[n] = FIR filter output

x[n] = FIR filter input
h[n] = impulse response of FIR filter (also FIR coefficients)
N = Number of taps
A 9 tap FIR filter is given by

𝒚[𝒏] = 𝒉[𝟎]𝒙[𝒏] + 𝒉[𝟏]𝒙[𝒏 − 𝟏] + 𝒉[𝟐]𝒙[𝒏 − 𝟐] + 𝒉[𝟑]𝒙[𝒏 − 𝟑] + 𝒉[𝟒]𝒙[𝒏 − 𝟒] +
𝒉[𝟓]𝒙[𝒏 − 𝟓] + 𝒉[𝟔]𝒙[𝒏 − 𝟔] + 𝒉[𝟕]𝒙[𝒏 − 𝟕] + 𝒉[𝟖]𝒙[𝒏 − 𝟖]
-
3) Elaborate on the MAC unit used to realize the equation.
SOLUTION
Multiply and Accumulate (MAC) Unit :

This is the most important block of the complete DSP. It is composed of an adder, multiplier
and the accumulator. Basically the multiplier will multiply the inputs and give the results to the
adder, which will add the multiplier results to the previously accumulated results.
• The inputs for the MAC are supposed to be fetched from some memory location and fed to the
multiplier block of the MAC, which will perform multiplication and give the result to adder
which will accumulate the result and then if needed will also store the result into a memory
location. This entire process is to be achieved in a single clock cycle.
An N bit by N bit multiplication will result in a 2N bit result and a N bit by N bit addition will
result in a N+1 bit result. Thus if Multiplier results (2N bits) are to be added then the result will
be 2N+1. Thus extra guard bit needs to be provided. Again, addition operation will take place
a number of times and each time there is a chance that an additional carry is generated. Thus a
number of guard bits are required. These bits are called Extension bits. Desired features of
MAC are summarized as below:
1. Speed: The faster the better
2. Input format: The MACs should have input format control so that any kind of numbers can
be multiplied, whether 2’s complement or unsigned, fractional or integer
3. Output precision: Single or double precision options should be available
4. Fixed point vs. floating point tradeoffs
5. For high throughput – use Pipelined Multiplier
6. Use of saturation arithmetic and not Wrap around
7. Appropriate number of extension bits be provided in the accumulator for fixed point
implementations
4) Implementation of DSP algorithms require data to be transferred to and from memory.

Discuss about any two architectures that support this functionality.
SOLUTION
One of the biggest bottlenecks in executing DSP algorithms is transferring information to and
from memory. This includes data, such as samples from the input signal and the filter
coefficients, as well as program instructions, the binary codes that go into the program
sequencer. For example, suppose we need to multiply two numbers that reside somewhere in
memory. To do this, three binary values must be fetched from memory, the numbers to be
multiplied, plus the program instruction describing what to do.
VON NEUMANN ARCHITECTURE:
Fig(a) shown below is often called a Von Neumann architecture, after the brilliant American
mathematician John Von Neumann (1903-1957).
• As shown in (a), a Von Neumann architecture contains a single memory and a single bus for
transferring data into and out of the central processing unit (CPU). Multiplying two numbers
requires at least three clock cycles, one to transfer each of the three numbers over the bus from
the memory to the CPU. The time to transfer the result back to memory is not counted because
it is assumed that it remains in the CPU for additional manipulation (such as the sum of
products in an FIR filter). The Von Neumann design is quite satisfactory when all of the tasks
are required to be executed in serial.
HARVARD ARCHITECTURE:
• In fact, most computers today are of the Von Neumann design. Other architectures are needed
when very fast processing is required, and are willing to pay the price of increased complexity.
This leads to the Harvard architecture, shown in (b). This is named for the work done at
Harvard University in the 1940s under the leadership of Howard Aiken (1900-1973). As shown
in this illustration, Aiken insisted on separate memories for data and program instructions, with
separate buses for each. Since the buses operate independently, program instructions and data
can be fetched at the same time, improving the speed over the single bus design. Most present
day DSPs use this dual bus architecture.
SUPER HARVARD ARCHITECTURE:

Figure (c) illustrates the next level of sophistication, the Super Harvard Architecture. This term
was coined by Analog Devices to describe the internal operation of their ADSP-2106x and new
ADSP-211xx families of Digital Signal Processors. These are called SHARC® DSPs, a
contraction of the longer term, Super Harvard ARChitecture. The idea is to build upon the
Harvard architecture by adding features to improve the throughput. While the SHARC DSPs
are optimized in dozens of ways, two areas are important enough to be included in Fig.(c): an
instruction cache, and an I/O controller.
• Instruction cache improves the performance of the Harvard architecture. A handicap of the
basic Harvard design is that the data memory bus is busier than the program memory bus. When
two numbers are multiplied, two binary values (the numbers) must be passed over the data
memory bus, while only one binary value (the program instruction) is passed over the program
memory bus. To improve upon this situation, we start by relocating part of the "data" to
program memory. For instance, we might place the filter coefficients in program memory, while
keeping the input signal in data memory. (This relocated data is called "secondary data" in the
illustration).
• However, DSP algorithms generally spend most of their execution time in loops, such as
instructions This means that the same set of program instructions will continually pass from
program memory to the CPU. The Super Harvard architecture takes advantage of this situation
by including an instruction cache in the CPU. This is a small memory that contains about 32 of
the most recent program instructions. The first time through a loop, the program instructions
must be passed over the program memory bus. This results in slower operation because of the
conflict with the coefficients that must also be fetched along this path. However, on additional
executions of the loop, the program instructions can be pulled from the instruction cache. This
means that all of the memory to CPU information transfers can be accomplished in a single
cycle: the sample from the input signal comes over the data memory bus, the coefficient comes
over the program memory bus, and the program instruction comes from the instruction cache.
In the jargon of the field, this efficient transfer of data is called a high memory-access
bandwidth.
5) Discuss the architecture of TMS 320C6x floating point DSP processor.

SOLUTION
TMS320C6x ARCHITECTURE
The TMS320C6711 onboard the DSK is a floating-point processor based on the VLIW
architecture. Internal memory includes a two-level cache architecture with 4 kB of level 1
program cache (L1P), 4 kB of level 1 data cache (L1D), and 64 kB of RAM or level 2 cache
for data/program allocation (L2). It has a glueless (direct) interface to both synchronous
memories (SDRAM and SBSRAM) and asynchronous memories (SRAM and EPROM).
Synchronous memory requires clocking but provides a compromise between static SRAM and
dynamic SDRAM, with SRAM being faster but more expensive than DRAM.
On-chip peripherals include two multichannel buffered serial ports (McBSPs), two
timers, a 16-bit host port interface (HPI), and a 32-bit external memory interface (EMIF). It
requires 3.3 V for I/O and 1.8 V for the core (internal). Internal buses include a 32-bit program
address bus, a 256-bit program data bus to accommodate eight 32-bit instructions, two 32-bit
data address buses, two 64-bit data buses, and two 64-bit store data buses. With a 32-bit address
bus, the total memory space is 232 = 4 GB, including four external memory spaces: CE0, CE1,
CE2, and CE3. Figure (a) shows a functional block diagram of the C6711 processor included
with CCS. Independent memory banks on the C6x allow for two memory accesses within one
instruction cycle.
The CPU consists of eight independent functional units divided into two data paths A
and B, as shown in Figure (a). Each path has a unit for multiply operations (.M), for logical
and arithmetic operations (.L), for branch, bit manipulation, and arithmetic operations (.S), and
for loading/storing and arithmetic operations (.D). The .S and .L units are for arithmetic,
logical, and branch instructions. All data transfers make use of the .D units. The arithmetic
operations, such as subtract or add (SUB or ADD), can be performed by all the units except
the .M units (one from each data path). The eight functional units consist of four floating/fixed-
point ALUs (two .L and two .S), two fixed-point ALUs (.D units), and two floating/fixed-point
multipliers (.M units). Each functional unit can read directly from or write directly to the
register file within its own path. Each path includes a set of sixteen 32-bit registers, A0 through
A15 and B0 through B15. Units ending in 1 write to register file A, and units ending in 2 write
to register file B. Two cross-paths (1x and 2x) allow functional units from one data path to
access a 32-bit operand from the register file on the opposite side.There can be a maximum of
two cross-path source reads per cycle. Each functional unit side can access data from the
registers on the opposite side using a cross-path (i.e., the functional units on one side can access
the register set from the other side). There are 32 general purpose registers, but some of them
are reserved for specific addressing or are used for conditional instructions.
Two sets of register files, each set with 16 registers, are available: register file A (A0
through A15) and register file B (B0 through B15). Registers A0, A1, B0, B1, and B2 are used
as conditional registers. Registers A4 through A7 and B4 through B7 are used for circular
addressing. Registers A0 through A9 and B0 through B9 (except B3) are temporary registers.
Any of the registers A10 through A15 and B10 through B15 used are saved and later restored
before returning from a subroutine. A 40-bit data value can be contained across a register pair.
The 32 least significant bits (LSBs) are stored in the even register (e.g., A2) and the remaining
8 bits are stored in the 8 LSBs of the next-upper (odd) register (A3). A similar scheme is used
to hold a 64-bit double-precision value within a pair of registers (even and odd). These 32
registers are considered as general-purpose registers.
The peripheral on C6 X processes are as follows:
1.EDMA (Enhanced Direct Memory Access): It has 16 Programmable channels and RAM
space to hold multiple configurations. It makes the movement of data from one place in
memory to the other place without interfering with CPU operation.
2.Boot Loader: It boots the code from HPI to internal memory. It is basically used to determine
what actions the DSP performs; when the device is reset.
3.McBSP (Multichannel Buffered Serial Port): It provides high speed multi-channel serial
communication link. This port can buffer serial samples in memory automatically with the help
of a EDMA controller. It is also having multichannel capability which is compatible with
various networking standards.
4.HPI (Host Port Interference): It allows the host to access internal memory. The host and CPU
can exchange the data via internal memory.
5.Time and Power down unit: Two 32-Bit general purpose timer and used to time events, count
events, general pulses, interrupt the CPU etc. This unit also sends synchronization event to
DMA controller. Power down unit is used to save the power for duration when CPU is inactive.
6.EMI (External Memory Interface): This block supports an interface to several external
devices, like synchronous burst, asynchronous devices, external shared memory device
6) Identify the addressing mode and the sequence of steps involved in execution of the each of
the following instructions in of TMS320C6X processor
(i) *A2++(3) (2)
(ii) *++A2(3) (2)
(iii) *+A2(3) (2)
(iv) MVK .S2 0x0004, B2 (2)
(v) MVKLH .S2 0x0005, B2 (2)
(vi) MVC .S2 B2,AMR (3)
SOLUTION
*A2++(3) : Indirect addressing -Register A2 contains the memory address. After the memory
address is used(post incremented), New address is A2 + 3 – 2marks
*++A2(3) : Indirect addressing - The address is pre incremented by 3 such that New address is
A2 + 3. The new memory address is used. – 2marks
*+A2(3) : Indirect addressing -The address is pre incremented by 3 such that New address is
A2 + 3. However, A2 is not updated or modified. – 2marks
MVK .S2 0x0004, B2 : Circular addressing - move 0x0004 into the 16LSBs of register B2–
2marks
MVKLH .S2 0x0005, B2 : Circular addressing – move 0x0005 into the 16MSBs of B2–
2marks
MVC .S2 B2,AMR : A 32-bit value is created in B2,which is then transferred to AMR with
the instruction MVC to access AMR– 3marks
7) A sum of 512 products is to be computed using a pipelined MAC unit. If the execution time
of single MAC unit is 200nsec, deduce the total time required to complete the operation?
SOLUTION
i) Number of Products = 512
MAC unit execution time = T + 1 execution cycles
= 513 execution cycles
Single execution time = 200nsec
Total time required = 0.1 microsec – 3 marks
8 ) Draw the block diagram of MAC with provision for guard bits. Consider the MAC unit
whose inputs are 16 bits wide. If 512 products are to be accumulated in this MAC, how many
guard bits must be provided to prevent the accumulator from overflowing?
SOLUTION
M specifies the number of guard bits
N = 16
To accumulate 256 products (Op), Accumulator width = 2N + log 2 OP
= 32 + log 2 512 = 32 + log 10 512
/ log 10 2
= 41 bits
2N + M = 41 where M is the number of guard bits
Hence, M = 41 – 2N = 41 – 32
= 9 bits
Number of guard bits required = 9 bits
The modified block diagram with required guard bits is given below:
-
9) Consider Indirect addressing mode in a DSP processor. The initial contents of Areg (address
register) and Ofreg(Offset register) are 0200h and 0010h, respectively In each case listed
below, what will be the content of the Areg after the memory access operation is complete?
a. ADD *Areg
b.ADD +* Areg
c. ADD Ofreg +,* Areg
d. ADD * Areg, Ofreg –
SOLUTION
Instruction Sequence of operations Operand address Areg( Address reg)
content after
memory access
ADD *Areg Post decrement 0200h 0200h-01 = 01FFh
ADD +* Areg Pre increment 0200h +01 = 0201h

0201h
ADD Ofreg +,* Pre increment by offset 0200h + 0010 = 0210h
Areg and store 0210h
ADD * Areg, Ofreg Store and then Post 0200h 0200h – 0010 =
– decrement by offset 01F0h
10) An accumulator is deployed to find the sum of 64, 16bit numbers. The accumulator must
be devoid of overflow error and loss of accuracy. To meet this requirement, how many bits
must the accumulator have?
SOLUTION
Initial Accumulator width(N): 16 bits wide
No.of addition operations(OP): 64
Final Accumulator width = N + log 2 OP = N + log 10 OP / log 10 2
= 16 + log 2 64 = 16 + log 10 64 / log 10 2
= 22 bits long – 7 marks
11) Analyze how the eight functional units of TMS320C6X processor are distributed and
allocated to perform operations among the two data paths A and B.
SOLUTION
multiply operations (.M), logical and arithmetic operations (.L), branch, bit
manipulation, and arithmetic operations (.S), loading/storing and arithmetic operations
(.D)
12) Determine the instructions and appropriate functional unit needed to execute the following
operations in a TMSC6X processor
A3 + A7
A7*B 7where A and B are each 32bits wide
SOLUTION
A3+A7 :
ADD .L1 A3,A7,A7 ;add A3 + A7 →A7 (accum in A7)
A7*B7:
MPY .M2 A7,B7,B6 ;multiply 16LSBs of A7,B7 → B6
|| MPYH .M1 A7,B7,A6 ;multiply 16MSBs of A7,B7 → A6 – 5marks
13) The data in 32 bits wide registers A2 and B2 have to be moved to 32bit wide registers A7
and B7. Analyze how this operation can be executed using LDH and LDW instructions.
Identify the differences between these 2 ways of execution.
SOLUTION
LDH .D2 *B2++,B7 ;load (B2) →B7, increment B2
|| LDH .D1 *A2++,A7 ;load (A2) →A7, increment A2
LDH loads half word(16 bits)
LDW .D2 *B2++,B7 ;load (B2) →B7, increment B2

|| LDW .D1 *A2++,A7 ;load (A2) →A7, increment A2
LDW loads 32 bit word – 5marks

DTSP Unit 5

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

DTSP Unit 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DTSP Unit 5

Uploaded by

Copyright:

Available Formats

UNIT- V

y[n] = FIR filter output

A 9 tap FIR filter is given by

Multiply and Accumulate (MAC) Unit :

4) Implementation of DSP algorithms require data to be transferred to and from memory.

SUPER HARVARD ARCHITECTURE:

5) Discuss the architecture of TMS 320C6x floating point DSP processor.

ADD +* Areg Pre increment 0200h +01 = 0201h

LDH loads half word(16 bits)

LDW .D2 *B2++,B7 ;load (B2) →B7, increment B2

You might also like