DTSP Unit 5
DTSP Unit 5
DTSP Unit 5
1) Elaborate on the Addressing modes and Instruction set of TMS 320C6x processor.
Addressing modes determine how one accesses memory. They specify how data are accessed,
such as retrieving an operand indirectly from a memory location. Both linear and circular
modes of addressing are supported. The most commonly used mode is the indirect addressing
of memory.
Indirect Addressing
Indirect addressing can be used with or without displacement. Register R represents one of the
32 registers A0 through A15 and B0 through B15 that can specify or point to memory
addresses. As such, these registers are pointers. Indirect addressing mode uses a “*” in
conjunction with one of the 32 registers. To illustrate, consider R as an address register.
1. *R. Register R contains the address of a memory location where a data value is stored.
2. *R++(d). Register R contains the memory address (location). After the memory address is
used, R is postincremented (modified), such that the new address is the current address offset
by the displacement value d. If d = 1 (by default), the new address is R + 1, or R is incremented
to the next-higher address in memory. A double minus (--) instead of a double plus would
update or postdecrement the address to R - d.
3. *++R(d). The address is preincremented or offset by d, such that the current address is R +
d. A double minus would predecrement the memory address so that the current address is R -
4. *+R(d). The address is preincremented by d, such that the current address is R + d (as with
the preceding case). However, in this case, R preincrements without modification. Unlike the
previous case, R is not updated or modified.
Circular Addressing
Circular addressing is used to create a circular buffer. This buffer is created in hardware and is
very useful in several DSP algorithms, such as in digital filtering or correlation algorithms
where data need to be updated.
The C6x has dedicated hardware to allow a circular type of addressing. This addressing mode
can be used in conjunction with a circular buffer to update samples by shifting data without the
overhead created by shifting data directly. As a pointer reaches the end or “bottom” location of
a circular buffer that contains the last element in the buffer, and is then incremented, the pointer
is automatically wrapped around or points to the beginning or “top” location of the buffer that
contains the first element.
Two independent circular buffers are available using BK0 and BK1 within the address mode
register (AMR). The eight registers A4 through A7 and B4 through B7, in conjunction with the
two .D units, can be used as pointers (all registers can be used for linear addressing). Table
below shows the modes associated with registers A4 through A7 and B4 through B7.
The following illustrates some of the syntax of assembly code
1. Add/Subtract/Multiply
(a) The instruction
ADD .L1 A3,A7,A7 ;add A3 + A7 A7 (accum in A7)
adds the values in registers A3 and A7 and places the result in register A7. The unit .L1 is
optional. If the destination or result is in B7, the unit would be .L2.
(b) The instruction
SUB .S1 A1,1,A1 ;subtract 1 from A1
subtracts 1 from A1 to decrement it, using the .S unit.
(c) The parallel instructions
MPY .M2 A7,B7,B6 ;multiply 16 LSBs of A7,B7 B6
|| MPYH .M1 A7,B7,A6 ;multiply 16 MSBs of A7,B7 A6
multiplies the lower or least significant 16 bits (LSBs) of both A7 and B7 and places the product
in B6, in parallel (concurrently within the same execution packet) with a second instruction
that multiplies the higher or most significant 16 bits (MSBs) of A7 and B7 and places the result
in A6. In this fashion, two multiply/accumulate operations can be executed within a single
instruction cycle. This can be used to decompose a sum of products into two sets of sum of
products: one set using the lower 16 bits to operate on the first, third, fifth,... number, and
another set using the higher 16 bits to operate on the second, fourth, sixth,... number. Note that
the parallel symbol is not in column 1.
2. Load/Store
(a) The instruction
LDH .D2 *B2++,B7 ;load (B2) B7, increment B2
|| LDH .D1 *A2++,A7 ;load (A2) A7, increment A2
loads into B7 the half-word (16 bits) whose address in memory is specified/pointed by B2.
Then register B2 is incremented (postincremented) to point at the next-higher memory address.
In parallel is another indirect addressing mode instruction to load into A7 the content in
memory, whose address is specified by A2. Then A2 is incremented to point at the nexthigher
memory address.
The instruction LDW loads a 32-bit word. Two paths using .D1 and .D2 allow for the loading
of data from memory to registers A and B using the instruction LDW. The double-word load
floating-point instruction LDDW on the C6711 can simultaneously load two 32-bit registers
into side A and
two 32-bit registers into side B.
(b) The instruction
STW .D2 A1,*+A4[20] ;store A1 (A4) offset by 20
stores the 32-bit word A1 into memory whose address is specified by A4
offset by 20 words (32 bits) or 80 bytes. The adddress register A4 is preincremented with offset,
but it is not modified (two plus signs are used if A4 is to be modified).
3. Branch/Move.
The following code segment illustrates branching and data
Loop MVK .S1 x,A4 ;move 16 LSBs of x address A4
MVKH .S1 x,A4 ;move 16 MSBs of x address A4
. SUB .S1 A1,1,A1 ;decrement A1
[A1] B .S2 Loop ;branch to Loop if A1 # 0
NOP 5 ;five no-operation instructions
STW .D1 A3,*A7 ;store A3 into (A7)
The first instruction moves the lower 16 bits (LSBs) of address x into register A4. The second
instruction moves the higher 16 bits (MSBs) of address x into A4, which now contains the full
32-bit address of x. One must use the instructions MVK/MVKH in order to get a 32-bit constant
into a register. Register A1 is used as a loop counter. After it is decremented with the SUB
instruction, it is tested for a conditional branch. Execution branches to the label or address loop
if A1 is not zero. If A1 = 0, execution continues and data in register A3 are stored in memory
whose address is specified (pointed) by A7.
2) Realize the difference equation for a 9 tap FIR filter.
On a basic level, a bona fide DSP chip must, as a minimum requirement be able to optimally
perform the convolution summation used to compute the output of an FIR (finite impulse
response) filter.
y[n] = SUM{ h[n]*x[n-k]
• As shown in (a), a Von Neumann architecture contains a single memory and a single bus for
transferring data into and out of the central processing unit (CPU). Multiplying two numbers
requires at least three clock cycles, one to transfer each of the three numbers over the bus from
the memory to the CPU. The time to transfer the result back to memory is not counted because
it is assumed that it remains in the CPU for additional manipulation (such as the sum of
products in an FIR filter). The Von Neumann design is quite satisfactory when all of the tasks
are required to be executed in serial.
• In fact, most computers today are of the Von Neumann design. Other architectures are needed
when very fast processing is required, and are willing to pay the price of increased complexity.
This leads to the Harvard architecture, shown in (b). This is named for the work done at
Harvard University in the 1940s under the leadership of Howard Aiken (1900-1973). As shown
in this illustration, Aiken insisted on separate memories for data and program instructions, with
separate buses for each. Since the buses operate independently, program instructions and data
can be fetched at the same time, improving the speed over the single bus design. Most present
day DSPs use this dual bus architecture.
• Instruction cache improves the performance of the Harvard architecture. A handicap of the
basic Harvard design is that the data memory bus is busier than the program memory bus. When
two numbers are multiplied, two binary values (the numbers) must be passed over the data
memory bus, while only one binary value (the program instruction) is passed over the program
memory bus. To improve upon this situation, we start by relocating part of the "data" to
program memory. For instance, we might place the filter coefficients in program memory, while
keeping the input signal in data memory. (This relocated data is called "secondary data" in the
• However, DSP algorithms generally spend most of their execution time in loops, such as
instructions This means that the same set of program instructions will continually pass from
program memory to the CPU. The Super Harvard architecture takes advantage of this situation
by including an instruction cache in the CPU. This is a small memory that contains about 32 of
the most recent program instructions. The first time through a loop, the program instructions
must be passed over the program memory bus. This results in slower operation because of the
conflict with the coefficients that must also be fetched along this path. However, on additional
executions of the loop, the program instructions can be pulled from the instruction cache. This
means that all of the memory to CPU information transfers can be accomplished in a single
cycle: the sample from the input signal comes over the data memory bus, the coefficient comes
over the program memory bus, and the program instruction comes from the instruction cache.
In the jargon of the field, this efficient transfer of data is called a high memory-access
6) Identify the addressing mode and the sequence of steps involved in execution of the each of
the following instructions in of TMS320C6X processor
(i) *A2++(3) (2)
(ii) *++A2(3) (2)
(iii) *+A2(3) (2)
(iv) MVK .S2 0x0004, B2 (2)
(v) MVKLH .S2 0x0005, B2 (2)
(vi) MVC .S2 B2,AMR (3)
*A2++(3) : Indirect addressing -Register A2 contains the memory address. After the memory
address is used(post incremented), New address is A2 + 3 – 2marks
*++A2(3) : Indirect addressing - The address is pre incremented by 3 such that New address is
A2 + 3. The new memory address is used. – 2marks
*+A2(3) : Indirect addressing -The address is pre incremented by 3 such that New address is
A2 + 3. However, A2 is not updated or modified. – 2marks
MVK .S2 0x0004, B2 : Circular addressing - move 0x0004 into the 16LSBs of register B2–
MVKLH .S2 0x0005, B2 : Circular addressing – move 0x0005 into the 16MSBs of B2–
MVC .S2 B2,AMR : A 32-bit value is created in B2,which is then transferred to AMR with
the instruction MVC to access AMR– 3marks
7) A sum of 512 products is to be computed using a pipelined MAC unit. If the execution time
of single MAC unit is 200nsec, deduce the total time required to complete the operation?
i) Number of Products = 512
MAC unit execution time = T + 1 execution cycles
= 513 execution cycles
Single execution time = 200nsec
Total time required = 0.1 microsec – 3 marks
8 ) Draw the block diagram of MAC with provision for guard bits. Consider the MAC unit
whose inputs are 16 bits wide. If 512 products are to be accumulated in this MAC, how many
guard bits must be provided to prevent the accumulator from overflowing?
M specifies the number of guard bits
N = 16
To accumulate 256 products (Op), Accumulator width = 2N + log 2 OP
= 32 + log 2 512 = 32 + log 10 512
/ log 10 2
= 41 bits
2N + M = 41 where M is the number of guard bits
Hence, M = 41 – 2N = 41 – 32
= 9 bits
Number of guard bits required = 9 bits
The modified block diagram with required guard bits is given below:
9) Consider Indirect addressing mode in a DSP processor. The initial contents of Areg (address
register) and Ofreg(Offset register) are 0200h and 0010h, respectively In each case listed
below, what will be the content of the Areg after the memory access operation is complete?
a. ADD *Areg
b.ADD +* Areg
c. ADD Ofreg +,* Areg
d. ADD * Areg, Ofreg –
Instruction Sequence of operations Operand address Areg( Address reg)
content after
memory access
ADD *Areg Post decrement 0200h 0200h-01 = 01FFh
10) An accumulator is deployed to find the sum of 64, 16bit numbers. The accumulator must
be devoid of overflow error and loss of accuracy. To meet this requirement, how many bits
must the accumulator have?
Initial Accumulator width(N): 16 bits wide
No.of addition operations(OP): 64
Final Accumulator width = N + log 2 OP = N + log 10 OP / log 10 2
= 16 + log 2 64 = 16 + log 10 64 / log 10 2
= 22 bits long – 7 marks
11) Analyze how the eight functional units of TMS320C6X processor are distributed and
allocated to perform operations among the two data paths A and B.
multiply operations (.M), logical and arithmetic operations (.L), branch, bit
manipulation, and arithmetic operations (.S), loading/storing and arithmetic operations
12) Determine the instructions and appropriate functional unit needed to execute the following
operations in a TMSC6X processor
A3 + A7
A7*B 7where A and B are each 32bits wide
A3+A7 :
ADD .L1 A3,A7,A7 ;add A3 + A7 →A7 (accum in A7)
MPY .M2 A7,B7,B6 ;multiply 16LSBs of A7,B7 → B6
|| MPYH .M1 A7,B7,A6 ;multiply 16MSBs of A7,B7 → A6 – 5marks
13) The data in 32 bits wide registers A2 and B2 have to be moved to 32bit wide registers A7
and B7. Analyze how this operation can be executed using LDH and LDW instructions.
Identify the differences between these 2 ways of execution.
LDH .D2 *B2++,B7 ;load (B2) →B7, increment B2
|| LDH .D1 *A2++,A7 ;load (A2) →A7, increment A2