TMS320C6X Digital Signal Processors Architecture Programming and Applications
TMS320C6X Digital Signal Processors Architecture Programming and Applications
TMS320C6X
13
INTRODUCTION 13.1
The TMS320C6X DSPs use the VelociTITM architecture, the first DSPs to use advanced VLIW (Very Large
Instruction Word) architecture to achieve high performance through increased instruction parallelism.
This makes the ¢C6X DSPs an excellent choice for multichannel and multifunction applications.
The conventional VLIW architecture consists of multiple execution units running in parallel
performing multiple instructions during a single clock cycle. The VelociTI architecture is a highly
deterministic architecture having reduced code size, flexibility of code and data type and zero overhead
in branching.
The TMS320C62X, TMS320C64X and TMS320C67X are the family of DSPs in the ¢C6X generation.
The ¢C62X and ¢C64X devices are fixed point and ¢C67X devices are floating point DSPs. In ¢C6X DSPs
¢C62X and ¢C64X processors are code compatible, ¢C62X and ¢C67X processors are code compatible.
The ¢C6X devices execute up to eight 32-bit instructions per cycle with an execution speed of up
to 6000 million instructions per second (MIPS). The ¢C6X CPU consists of eight functional units, two
multiplier and six ALUs and some general purpose registers. The CPU of ¢C62X and ¢C67X device
consists of 32 general purpose registers of 32-bit size, where as ¢C64X devices have 64 general purpose
registers of 32-bit size.
CPU 13.4
The central processing unit of ¢C6X device is 32-bit size. The block diagram of ¢C6X CPU is given in
Fig. 13.2. The CPU contains the following units:
(a) Program fetch unit
358 Digital Signal Processors
(b) Instruction dispatch unit
(c) Instruction decode unit
(d) Two data paths, each data path
consists of four functional units
(e) Register file for each data path
(f) Control registers
(g) Control logic
(h) Test, emulation and interrupt
logic
The functional units shaded in Fig.
13.2. are common to all ¢C6X devices. The
¢C6X CPU is based on advanced VLIW
architecture, which accepts eight 32-bit
instructions (the instruction fetch packet
size is 256 bits) at a time. The program
fetch unit generates the addresses of eight
instructions and sends it to the program Fig. 13.2 CPU Unit of TMS320C6X DSP
memory for each fetch packet. Once the
contents of the program memory read occurs, the fetch packet is received at the CPU.
The instruction dispatch unit receives the fetch packet and splits it into execute packets. The
instructions in the execute packet (eight instructions) are assigned to the appropriate eight functional
units in the data path. During the instruction decode, the source registers, destination registers and
associated paths are decoded for the execution of the instructions in the functional units. Finally the
instructions are executed by the functional units.
The register file (A&B) of all the ¢C6X devices contain 32 numbers of 32-bit registers (16 register for
each data path) except ¢C64X devices. The ¢C64X device register file has 64 numbers of 32-bit registers
with 32 registers for each data path.
The ¢C6X CPU contains eight functional units, six arithmetic and logic units and two multipliers
(.L1, .L2, .S1, .S2, .M1, .M2, .D1 and .D2.). These functional units can be divided into two groups of
four. The L, S & D units are arithmetic and logic units (ALU), and the M unit is a multiplier unit. Each
data path has almost identical functional units.
Table 13.2 Functional units of ¢C6X and its fixed point operations
11 Reserved
Floating-point auxiliary configu- FAUCR Specifies underflow mode, rounding mode, not a number
ration register (NaN) and other exceptions for the .S unit
Floating-point multiplier configu- FMCR Specifies underflow mode, rounding mode, not a number
ration register (NaN) and other exceptions for the .M unit
Review Questions
13.1 What is VLIW architecture? 13.10 List the functions performed by .D unit?
13.2 List the processors in ¢C6X family. Which 13.11 How many data paths are in ¢C6X register file?
processors are code compatible? 13.12 What is register file cross path? What is its use?
13.3 What are the blocks present in the CPU of ¢C6X? 13.13 How many data paths are in ¢C6X to access
13.4 What is register file? What is the size of register memory? What is its size?
files in ¢C6X processors? 13.14 List the control registers common to ¢C6X family
13.5 What is ¢C6X register pair? Explain its use. of processors.
13.6 List the various functional units in ¢C6X CPU. 13.15 What are the fields in the addressing mode
13.7 What are the functions performed by .L unit? register? Explain the functions of each field.
13.8 Explain the functions performed by .S unit? 13.16 List the additional control registers in ¢C64X and
13.9 What are the different multiply operations ¢C62X processors.
performed by .M unit?
Architecture of TMS320C6X 369
(Contd.)
372 Digital Signal Processors
Table 14.2 (Contd.)
SSHL Shift left with saturation operation
Branch operations B disp Branch operation using a displacement
B reg Branch operation using a register
B NRP Branch operation using NMI return pointer
B IRP Branch operation using interrupt return pointer
Move operations MV Move from register to register operation(Pseudo-operation)
MVC Move between control file and the register file operation
MVK Move a 16-bit signed constant into a register and sign extend
MVKH/ MVKLH Move 16-bit constant into the upper/lower bits of a register
Other operations CLR Clear a bit field operation
EXT/EXTU Extract and sign-extend/zero-extend a bit field operation
SET Set a bit field operation
ZERO Zero a register (Pseudo-operation)
Load store operations LDB/LDBU/ LDH/ Load byte/half word/word from memory with 5-bit/15-bit un-
LDHU/ LDW signed constant offset or register offset
Example 14.1 ADD .L1 A1,A2,A3 – This instruction adds the hexadecimal signed integer operands
in register A1 and A2. The result is stored in register A3. The content of register A1
and A2 are unchanged. The functional unit used is .L1 and the registers of path A are used for both source
and destination operands.
Before execution After execution
A1 11223344 A1 11223344
A2 33445566 A2 33445566
A3 22222222 A3 446688AA
In the above example, to perform add operation the functional units .S1and .D1 are also used. For the
source and destination operands, registers from register file A alone are to be used. Same way, to do add
operation in register path B, the functional units .L2, .S2 and .D2 are used. The source and destination
operand registers are to be used only from register file B (B1-B15 registers). For the arithmetic and logic
instructions, the source and destination operand can be specified with same register of the register file.
Example 14.2 ADD .S2 B1,B2,B2 – This instruction adds the hexadecimal signed integer operands
in register B1 and B2. The result is stored in register B2 itself after addition. The
content of register B1 is unchanged. The functional unit used is .S2 and the registers of path B are used
for both source and destination operands.
Before execution After execution
B1 3456789A B1 3456789A
B2 11112222 B2 45679ABC
The data path of ¢C6X architecture has cross paths between path A and B (1X &2X). This cross path
is used to access one of the source operand from the opposite path. The destination operand cannot use
the cross path.
Example 14.3 ADD .L1X A1,B2,A2 – This instruction adds the hexadecimal signed integer operands
in register A1 and B2. The result is stored in register A2. The content of register A1
and B2 are unchanged. The functional unit used is .L1 and the registers of path A are used for the source
operand (A1) and destination operand (A2). The source operand B2 is obtained through cross path from
register file B.
Before execution After execution
A1 22221111 A1 22221111
B2 33332222 B2 33332222
A2 44444444 A2 55553333
Table 14.5 Address generation Option for Mode field in Linear addressing mode
Mode field Syntax Address modification performed
*+baseR[offsetR/ucst5] Positive offset from baseR specified by offsetR/ucst5
*-baseR[offsetR/ucst5] Negative offset from baseR specified by offsetR/ucst5
*++baseR[offsetR/ucst5] Pre-increment from baseR specified by offsetR/ucst5
*––baseR[offsetR/ucst5] Pre-decrement from baseR specified by offsetR/ucst5
*baseR++[offsetR/ucst5] Post increment from baseR specified by offsetR/ucst5
*baseR– –[offsetR/ucst5] Post decrement from baseR specified by offsetR/ucst5
The offset value specified in the offset register (offsetR) or the 5-bit unsigned constant given in the
instruction is left shifted by 0, 1 or 2 for the byte, halfword and word access instructions respectively.
Then, to find the address of the operand the following procedure is used:
(i) The shifted offset value is added or subtracted from the value in the base register (baseR) for
*+ or *- mode fields respectively. The added or the subtracted value from the content of the
base register is the address of the operand to be accessed from memory. The content of the base
register is unchanged.
(ii) For *++ or *–– mode fields, the address of the operand is calculated as mentioned in (i), but the
content of the base register increments or decrements by the shifted offset value respectively
before accessing the memory (pre-increment/pre-decrement). The address of the operand is
incremented or decremented value from the base register content.
(iii) In the case of *baseR++ or *baseR––, the address of the operand is calculated as mentioned
in (i), but the content of the base register increments or decrements by the shifted offset value
respectively after accessing memory (post increment/post decrement). The address of the
operand is the content of the base register, after accessing the address changes as per the address
modification syntax.
Example 14.4 LDW .D1 *+A0[1],A1 – This instruction loads a hexadecimal word from memory to
register A1. The address of the memory is the base address value in register A0
added with the 5-bit constant offset given in brackets left shifted by two times. If the base address is
500h, the given offset 1 is left shifted by two times is 4, the address of the memory to be accessed is
504h. The content of A0 is unchanged after access.
376 Digital Signal Processors
Before execution After execution
A0 00000500 A0 00000500
B1 11111111 A1 3456789A
504h 3456789A 504h 3456789A
Example 14.5 LDW .D1 *++A0[A4],A1 – This instruction loads a hexadecimal word from memory
to register A1. The address of the memory is the base address value in register A0
added with the content of offset register A4 given in brackets left shifted by two times. If the base
address is 500h, the content of offset register A4 is say 4, then it is left shifted by two is 10h (16). The
content of A0 is incremented to 510h before accessing the memory. The address of the memory to be
accessed is 510h. The content of offset register A4 is unchanged after access.
Before execution After execution
A0 00000500 A0 00000510
A4 00000004 A4 00000004
A1 34587698 A1 55667788
510h 55667788 510h 55667788
Example 14.6 LDW .D1 *A0++[2],A1 – This instruction loads a hexadecimal word from memory to
register A1. The address of the memory is the base address value in register A0.
After accessing the memory the new address in register A0 is the content of A0 added with the content
of offset given in brackets left shifted by two times. If the base address is 500h, the address of the
memory to be accessed is 500h. If the offset given is say 2, the two times left shifted value is 8h. Then
the new address in A0 is the register value A0 added with the left shifted value i.e. 508h.
Before execution After execution
A0 00000500 A0 00000500
B1 76234589 A1 99887766
500h 99887766 500h 99887766
Example 14.7 For circular addressing mode, register A4 is used. To specify the block size, BK0
field in AMR register is used. The two bit mode field for A4 is 01 and the 5-bit field
to specify the block size in BK0 is 01, hence the control word for AMR is 00010001h. The size of the block
is 21+1 = 4. If the starting address of the memory is 0x0100h, the circular buffer boundary is from 0x0100h
to 0x0103h. Content of memory locations 0100h-0103h is 44332211
MVK .S1 0X0001,A0 ;move the two bit mode field value to LSB of A0
MVKLH .S1 0X0001,A0 ;move the 5-bit BK0 value to MSB of A0
MVC .S 2X A0,AMR ;move the control word from A0 to AMR register
MVK .S1 0X0100,A4 ;the register A4 is loaded with the start address of
the buffer 0x0100h
LDB .D1 *A4++[1], A1 ;load byte from the address of the memory pointed
NOP 4 by A4 register to A1 register, increment content
of A4 by one. Followed by that is 4 no operations
Before executions After execution
A4 00000100 A4 00000101
A1 00000000 A1 00000011
LDB .D1 *A4++[1], A1 A4 00000101 A4 00000102
NOP 4 A1 00000011 A1 00000022
LDB .D1 *A4++[1], A1 A4 00000102 A4 00000103
NOP 4 A1 00000022 A1 00000033
LDB .D1 *A4++[1], A1 A4 00000103 A4 00000100
NOP 4 A1 00000033 A1 00000044
In this example, the memory address increments by one location for each load byte instruction, once
it reaches the end of the buffer 0x0103h, the next content in A4 is 0x100h. The data access happens
circularly between 0x0100h to 0x0103h address locations.
The syntax of the circular addressing mode instruction for add with addressing mode and subtract
with addressing mode case is given below.
mnemonic .unit src2, src1, dst
The mnemonic field can be add and subtract with addressing mode instructions given in Section
14.2.2. The source operand src2 should be registers A4-A7 and B4-B7 of the respective data paths. The
source operand src1 can be any register in the register file and the destination operand dst should be the
same register used for source operand src2.
The content of source operand src1 in the instruction is left shifted by 0, 1 or 2 for the byte, half-
word and word access instructions respectively. The shifted content of src1 is added/subtracted from the
content of src2, if the added/subtracted content is exceeding the circular buffer boundary, the content
378 Digital Signal Processors
src2 is wrapped around with in the buffer size, the result is available in the destination register dst. The
content of src2 is always within the circular buffer size.
Example 14.8 For circular addressing mode, register B5 is used. To specify the block size, BK1
field in AMR register is used. The two bit mode field for B5 is 10 and the 5-bit field
to specify the block size in BK1 is 03, hence the control word for AMR is 00600800h. The size of the block
is 23+1 = 16. If the starting address of the memory is 0x0100h, the circular buffer boundary is from
0x0100h to 0x010Fh.
MVK .S2 0X0800,B0 ; move the two bit mode field value to LSB of B0
MVKLH .S2 0X0060,B0 ; move the 5-bit BK1 value to MSB of B0
MVC .S2 B0,AMR ; move the control word from B0 to AMR register
MVK .S1X 0X0100,B5 ; the register B5 is loaded with the start address of
the buffer 0x0100h using cross path
MVK .S2 0x0002,B1 ;the register B1 is loaded with the value 02h
ADDAH .D1 B5,B1,B5 ; the content of B1 is left shifted by one (04h),
added with the content of B5(0100h), result
stored in B5(0104h). The content of B1 is
unchanged
Before executions After execution
B5 00000100 B5 00000104
ADDAH .D1 B5,B1,B5 B5 00000104 B5 00000108
ADDAH .D1 B5,B1,B5 B5 00000108 B5 0000010C
ADDAH .D1 B5,B1,B5 B5 0000010C B5 00000100
ADDAH .D1 B5,B1,B5 B5 00000100 B5 00000104
In this example, the content of B5 increments by a value of 04h for each time ADDAH instruction is
executed. Once the content of B5 exceeds the end value of the circular buffer 0x010Fh, it is wrapped
around to the first value 0x0100h. The register B5 content increments four values within the circular
buffer size.
Example 14.9 MV .S1 A1,A2 – Move register to register instruction. The content of register A1 is
moved to register A2, the content of register A1 is unchanged and the functional
unit used is .S1
Before executions After execution
A1 22334455 A1 22334455
A2 20408754 A2 22334455
MV .L1X B1,A3 – Move register to register instruction using the cross path. The content of register B1 is
moved to register A3, the content of B1 is unchanged and the functional unit used is .L1
Before executions After execution
A3 30504321 A3 547698AB
B1 547698AB B1 547698AB
Example 14.10 MVC .S2 A1,AMR – Move value between control register file and register file
instruction. The content of register A1 is moved to Address mode register (AMR) in
control register file, the content of register A1 is unchanged and the functional unit used is .S2
Before executions After execution
A1 00020005 A1 00020005
AMR 00400001 AMR 00020005
MVC .S2 AMR,B2 – The content of AMR is moved to register B2, the content of AMR is unchanged and the
functional unit used is .S2
Before executions After execution
AMR 00020005 AMR 00020005
B2 20408754 B2 00020005
Example 14.11 MVK .S1 0x1223,A1 – Move the 16-bit constant to lower 16-bit of register in register
file. The 16-bit constant 1223h is moved to lower 16-bits of register A1 and the
functional unit used is .S1
Before executions After execution
A1 00020005 A1 00021223
MVK .S2 -0x012,B2 – The negative 16-bit constant -012h is moved to lower 16-bit of register B2 and the
sign bit is extended to MSB bits. The 2-s complement value of 012h (FFEDh) appears as result in register
B2 lower 16-bit.The MSB bits are sign extended and functional unit used is .S2
Before executions After execution
B2 00050002 B2 FFFFFFED
380 Digital Signal Processors
Example 14.12 MVKLH .S1 0x3344,A2 – Move the lower 16-bit constant to upper 16-bit of register
in register file. The 16-bit constant 3344h is moved to upper 16-bit of register A2,
the lower 16-bits are unchanged and functional unit used is .S1
Before executions After execution
A2 00220055 A2 33440005
MVKH .S2 0x44552233,B2 – The upper 16-bit of the 32-bit constant is moved to upper 16-bit of register B2.
The upper 16-bit value 4455h is moved to register B2 upper 16-bit, lower 16-bit are unchanged. The
functional unit used is .S2
Before executions After execution
B2 20404252 B2 44554252
Example 14.13 LDB .D1 *A0,A1 – Load byte instruction. The byte content of memory location,
who’s address is present in base address register A0 is loaded into register A1, the
sign bit is extended to MSB bits of register A1. The memory address is 100h, the byte content of 100h
location is 44h. The value 44h is moved to LSB of A1 register and MSB bits are zero filled. The content of
register A0 and 100h location are unchanged; the functional unit used is .D1
Before executions After execution
A1 11111111 A1 00000044
A0 00000100 A0 00000100
100h 11223344 100h 11223344
TMS320C6X Assembly Language Instructions 381
LDH .D1 *+A0[2],A2 – Load Half-word instruction with positive offset. To calculate the address of the
memory to be accessed, the 5-bit constant offset given in the instruction is left shifted ones and added
to base register content A0. The content of register A0 is unchanged. The half-word content of the
memory address is moved to register A2. The offset value 2 left shifted once is 4; the content of base
register is 100h, hence the address of memory is 104h. The half-word content of memory (104h) 8899h is
loaded into register A2 LSB, the MSBs are sign extended as FFFF. The functional unit used is .D1
Before executions After execution
A2 11111111 A2 FFFF8899
A0 00000100 A0 00000100
104h 44558899 104h 44556677
LDW .D1 *++A0[1],B2 – Load word instruction with pre-increment. The 5-bit constant offset given in the
instruction is left shifted two times and added to base register content A0. The content of base register
A0 is pre-incremented and it is the address of the memory to be accessed. The word content of memory
address is moved to register B2. The offset value 1 left shifted twice is 4. If the content of base register
A0 is 100h, the new content of A0 is 104h and the memory address is also 104h. The word content of
memory (104h) 44558899h is loaded into register B2. The content of 104h location is unchanged, the
functional units used is .D1 along with the cross path.
Before executions After execution
B2 11111111 B2 44558899
A0 00000100 A0 00000104
104h 44558899 104h 44558899
STW .D2 B3,*B1--[B0] – Store word instruction with post-decrement. The content of register B3 is stored
in memory. The address of the operand is the content of base register B1 and the offset is specified in
offset register B0. The content of offset register B0 is left shifted by two times and subtracted from the
content of base register B1 and that is the new content in base register B1. If the content of register B3
is 00004578h, the content of base register B1 is 100h the content of register B3 is stored in the memory
location pointed by register B1. The content of base register B0 is 1h, left shifted twice is 4h, which
subtracted from 100h is 0FCh that is the new content in B1. The content of B0 and B3 are unchanged and
the functional unit used is .D2
Before executions After execution
B3 00004578 B3 00004578
B1 00000100 B1 000000FC
B0 00000001 B0 00000001
100h 11223344 100h 00004578
Example 14.14 ADD .D1 31,A0,A1 – Five bit signed constant (-31 to 31) add instruction. The given
five bit signed constant is added to the content of register A0 and the result is
stored in register A1. The 5-bit constant 31 is added to register A0 content 00008754h and the result
00008773h is loaded in register A1. The functional unit used is .D1, the register content A0 is
unchanged.
Before executions After execution
A0 00008754 A0 00008754
A1 00020005 A1 00008773
Example 14.15 ADD .L1 A0,A1,A2 – Signed 32-bit integer add instruction. The signed integer
content of register A0 and A1 are added, the result is stored in register A2. If the
32-bit positive integers 00045566h in A0 and 00076655h in A1, they are added, the result 000BBBBBh is
stored in register A2. The content of registers A0 and A1 are unchanged, the functional unit used is .L1
Before executions After execution
A0 00045566 A0 00045566 +284006
A1 00076655 A1 00076655 +484949
B0 12348765 B0 000BBBBB +768955
ADD .S2X A0,B1,B2 – If the 32-bit positive integer in register A0 is 00045566h and the negative integer in
register B1 is FFFFC742h, they are added and the result 00041CA8h is stored in register B2. The content
of registers A0 and B1 are unchanged, the functional unit used is .S2 with the cross path.
Before executions After execution
A0 00045566 A0 00045566 +284006
B1 FFFFC742 B1 FFFFC742 - 14526
B2 12348765 B2 00041CA8 +269480
Example 14.14 SADD .L1 A1,A2,A3 – Signed integer add instruction with saturation. The signed
integer content of register A1 and A2 are added; the added result is stored in
register A3, if there is no saturation. If the result is saturated, for positive integer the maximum positive
number (7FFF FFFF) and for negative integer the negative number (8000 0000) is loaded in register A3
respectively. The SAT bit in CSR register is set. The functional unit used is .L1; the content of registers
A1 and A2 are unchanged.
TMS320C6X Assembly Language Instructions 383
Example 14.17 ADDU .L1 A5,A6,A9:A8 – Unsigned 32-bit add instruction. The unsigned 32-bit
contents in register A5 and register A6 are added and the resultant 40-bit content
is loaded in A9:A8 register pair. If the 32-bit integer in register A5 is 00087654h and in register A6 is
FFFF4332h, both of them are added and the resultant 40 bit content 10007B986h is loaded in A9:A8
register pair. The functional unit used is .L1, the register contents A5 and A6 are unchanged.
Before executions After execution
A5 00087654 A1 00087654 +554580
A6 FFFF4332 A6 FFFF4332 +4294918962
A9:A8 00020005 00020005 A9:A8 00000001 0007B986 +4295473542
Example 14.18 ADDK .S1 2345,A1 – A 16-bit signed constant add instruction. The signed 16-bit
constant given in the instruction is added with the content of register A1 and the
result stored in A1. If the 16-bit positive constant is 2345 (0929h), the content in register A1 is 00015432h,
they are added, the result 00015D5Bh is stored in register A1. The functional unit used is .S1
Before executions After execution
A1 00015432 A1 00015D5B
ADD2 .S2 B1,B2,B3 – Two 16-bit integer add instruction on upper and lower register halves. The upper
and lower halves content of register B1 are added to the upper and lower halves content of register B2,
the result is stored in upper and lower halves of register B3 respectively. If the content of register B1 is
00347698h, register B2 is 03127654h, the upper and lower halves are added, the result 0346ECECh stored
in register B3. The content of registers B1 and B2 are unchanged, the functional unit used is .S2.
Before executions After execution
B1 00347698 B1 00347698
B2 03127654 B2 03127654
B3 00000544 B3 0346ECEC
Example 14.19 SUBC .L1 A1,A2,A3 – Conditional subtract and shift operation. The content of
register A2 is subtracted from the content of register A1. If the subtraction result
is ≥ 0, then the result is left shifted by one bit and 1 is added to LSB bit and the final value is loaded in
register A1. Else the subtracted result is less than zero, the content of register A1 is left shifted by one
bit and the shifted value is loaded in register A1.
(i) If the register content A2 is 00000404h and A1 is 00002222h, the A2 content is subtracted from A1
content. The result 00001E1E which is > 0 is left shifted by 1 bit (00003C3C) and 1 is added to LSB bit,
the final result 00003C3D is loaded in register A3.
Before executions After execution
A1 00002222 A1 00347698
A2 00000404 A2 00000404
A3 12243333 A3 00003C3D
(ii) If the register content A2 is 00002424h and A1 is 00002222h, the A2 content is subtracted from A1 con-
tent. The result is less than zero. The content of register A1 is left shifted by 1 bit, the result 00004444h
is loaded in register A3. The content of register A1 and A2 are unchanged and the unit used is .L1.
Before executions After execution
A1 00002222 A1 00347698
A2 00002424 A2 00002424
A3 12243333 A3 00004444
Example 14.20 MPYU .M1 A1,A2,A3 – Unsigned integer multiply instruction. The Unsigned 16-bit
number present in 16 LSBs of registers A1 and A2 are multiplied and the result is
stored in register A3. If the content of register A1 is 56003442h and register A2 is 23451122h, the 16 LSBs
are multiplied and the result 037F52C4h is stored in register A3. The content of register A1 and A2 are
unchanged and the functional unit used is .M1
Before executions After execution
A1 56003442 13378 A1 56003442 13378 16 LSB value
A2 23451122 4386 A2 23451122 4386 16 LSB value
A3 00007689 A3 037F52C4 58675908 Product value
MPYHL .M1 A4,A5,A6 – Signed integer multiply instruction on 16 MSBs and 16 LSBs of registers. The signed
16-bit number present in 16 MSBs of registers A4 and 16 LSBs of register A5 are multiplied and the result
is stored in register A6. If the content of register A4 is FFA13344h and register A5 is 48480044h, the 16
MSBs (FFA1h) and 16 LSBs (0044h) are multiplied and the result FFFFE6C4h is stored in register A6. The
content of register A4 and A5 are unchanged and the functional unit used is .M1
Before executions After execution
A4 56003442 (-95) A4 FFA13344 (-95) 16 MSB value
A5 48480044 (68) A5 48480044 (68) 16 LSB value
A6 00560544 A6 FFFFE6C4 (-6460) Product value
Example 14.21 SMPYLH .M1 A1,A2,A3 – Integer multiply with left shift and saturation instruction.
The signed number in 16 LSBs of register A1 and 16 MSBs of register A2 are
multiplied and the result is left shifted by one bit and stored in register A3. If the left shifted result is
8000 0000h, then the result is saturated to 7FFF FFFFh. If the content of register A1 is F023 3344h,
register A2 is 8787 4A81h, the 16 LSBs (3344h) and 16 MSBs (8787h) are multiplied and the result E7DF
E4DCh is left shifted by one bit and the value CFBF C9B8h is stored in register A3. The content of register
A1 and A2 are unchanged and the functional unit used is .M1
386 Digital Signal Processors
Before executions After execution
A1 F0233344 13124 A1 F0233344 16 LSB value
A2 87874A81 -30841 A2 87874A81 16 MSB value
A3 00007689 A3 CFBFC9B8 -809514568
MPY .M1 14,A1,A2 – Signed 5-bit constant multiply instruction on 16 LSBs of register. The signed 5-bit
number in the instruction is multiplied with 16 LSBs of registers A1 and the result is stored in register
A2. If the content of register A1 is 2131 3344h, the 16 LSBs (3344h) and 14 (Eh) are multiplied and the
result 0002 CDB8h is stored in register A2. The content of register A1is unchanged and the functional
unit used is .M1
Before executions After execution
A1 21313344 A1 21313344
A2 48480044 A2 0002CDB8
Example 14.22 AND .L1 A1,A2,A3 – Bitwise AND operation instruction. The bitwise AND operation
is performed between the contents of register A1 and A2. The result is placed in
register A3. If signed 5-bit constant is used as operand, the sign is extended to 32 bits. If the content of
register A1 is 7367 5454h, register A2 is 8282 7676h, the bitwise AND operation between the register
contents 0202 5454h is loaded in register A3. The functional unit used is .L1, the register content A1 and
A2 are unchanged.
Before executions After execution
A1 73675454 A1 73675454
A2 82827676 A2 82827676
A3 11224509 A3 02025454
TMS320C6X Assembly Language Instructions 387
Table 14.11 Logical, Compare and Shift Instructions of ¢C6X processor
Instruction Functional unit Description
NOT .L1 or.L2, .S1 or.S2 Bitwise NOT-Pseudo operation
AND .L1 or.L2, .S1 or.S2 Bitwise AND - Pseudo operation
OR .L1 or.L2, .S1 or.S2 Bitwise OR - Pseudo operation
NEG .L1 or.L2, .S1 or.S2 Negate-Pseudo operation
SHL .S1 or.S2 Arithmetic shift left
SHR .S1 or.S2 Arithmetic shift right
SHRU .S1 or.S2 Logical shift right
SSHL .S1 or.S2 Shift left with saturation
CMPEQ .L1 or.L2 Integer compare for equality
CMPGT/CMPGTU .L1 or.L2 Signed/Unsigned integer compare for greater than
CMPLT/CMPLTU .L1 or.L2 Signed/Unsigned integer compare for less than
Example 14.23 SHR .S2 B1,B2,B3 – Arithmetic shift right instruction. The content of register B1
is right shifted by n-bits specified in register B2, the result is stored in register
B3. If the content of register B1 is 7367 5454h, register B2 is 0012h, the content of B1 is right shifted by
the content of B2 (12h=18bits) times, the result is stored in register B3. The functional unit used is .S2,
the register content B1 and B2 are unchanged.
Before executions After execution
B1 73675454 B1 73675454
B2 00000012 B2 00000012
B3 00020005 B3 00001CD9
Example 14.24 CMPGT .L1X A1,B2,A2 – Integer compare for greater than instruction. The
contents of register A1 and B2 are compared for greater number. If register A1
content is greater than B2, the comparison is true, 1 is stored in register A2. If content of A1 is less than
B2, the comparison is false, 0 is stored in register A2. If the content of register A1 is 7676h, register B2
is 5454h, the content of A1 is greater than B2. The comparison is true, 1 is set in register A2. The
functional unit used is .L1 through cross path; the content of registers A1 and B2 are unchanged.
Before executions After execution
A1 00007676 A1 00007676
B2 00005454 B2 00005454
A2 00020005 A2 00000001
If the p-bit of the instruction i is 1, then the next instruction i+1 is to be executed in parallel with
instruction i in the same machine cycle. If the p-bit is zero, then the instruction i+1 is executed in the
next machine cycle after the execution of instruction i. All the eight instructions executing in parallel
constitute an execute packet. Each instruction in the execute packet must use a different functional unit.
The execute packet cannot be more than eight words, so the last p-bit of the last instruction in a fetch
packet is always set to 0. There are three types of execution of the instructions in the fetch packet based
on the p-bits. They are
(i) Fully serial
(ii) Fully parallel
(iii) Partially serial
In fully serial type of execution, all the p-bits are set zero. The eight instructions of a fetch packet are
executed serially one after the other in eight machine cycles. For a fully parallel type, all the p-bits are
set 1 except the last instruction. All the eight instructions of a fetch packet are executed in parallel at the
same machine cycle itself. In case of partially serial scheme, p-bit of some instructions are set zero and
some with one. The instructions are executed serially one after the other from left to right of the fetch
packet until the first p-bit with one is detected. Once the p-bit with one is detected, that instruction, the
next instruction and the successive instructions who’s p-bits with one are executed parallel until next
p-bit zero is detected. If the p-bit with zero is sensed then the next instruction will be executed serially
and so on. The instruction that is to be executed parallel is represented by the symbol || in the beginning
of the opcode. The sample programs illustrating the above concept are given in example 14.25, 14.26
and 14.27.
Example 14.25 Fully serial execution with conditional operation. The following codes are to find
the sum of N numbers. The instructions are executed one after the other.
The conditional operation is used for branch instruction. Register B0 is used to check for non zero
condition. The count N is loaded register B0. The generation of the sequence is done in register A3, sum-
mation of N number is done in register B1 using ADD instruction. At the end of each summation the count
N in B0 register is decremented using SUB instruction. The condition for non zero of B0 is checked each
time using the representation [B0] and branching to location loop is performed until B0 content becomes
zero. On zero of register content B0, execution comes out of the loop.
MVK .S2 05h, B0 ; count N specified in register A1
LOOP ADD .L1 1,A3,A3 ; generation of sequence in A3
ADD .L2 A3,B1,B1 ; summation of sequence in A4
SUB .S2 B0,1,B0 ; decrement of count N in register A1
[B0] B .S1 LOOP ; check for non zero of content A1, branch to loop
NOP 5 ; no operation 5 times to avoid conflict in pipeline
390 Digital Signal Processors
Example 14.26 Partially serial and parallel execution. The codes given in example 14.25 to find
the sum on N numbers are modified, written for partially serial and parallel type
and is given below. Loading the count value in register B0 is done serially and all other operations are
executed in parallel. The parallel operations are performed in .L1, .L2, .S1 and .S2 units.
MVK .S2 05H, B0
LOOP1 ADD .L1 1,A3,A3
|| ADD .L2 A3,B1,B1
|| SUB .S2 B0,1,B0
|| [B0] B .S1 LOOP1
NOP 5
NOP
NOP
NOP
Example 14.27 Fully parallel execution. The following codes are executed in parallel. All the
eight functional units in ¢C6X are used to perform fully parallel execution.
LDW .D1 *A4++,A3
|| STW .D2 B3,*B2++
|| ADD .L1 A4,A4,A5
|| ADD .L2 B4,B4,B5
|| MPY .M1 A5,A5,A6
|| MPYH .M2 B5,B5,B6
|| SUB .S1 A6,A5,A7
|| SUB .S2 B6,B5,B7
The formula to translate the s, e and f fields into single precision floating point number is given
below.
Normal range: -1s * 2 (e-127) * 1.f 0 < e < 255
Denormalized range: -1s * 2 (-126) * 0.f e = 0; f - nonzero
The fields of the double-precision floating point format operand are shown in Fig. 14.3. The full even
register 32-bits and odd register LSBs 0-19 (20-bits) of the register pair represent fraction (mantissa)
part (52-bits), the bits 20-30 of odd register are used to represent the exponent part (11-bits) and the
MSB bit (31st bit) of odd register is the sign bit. In double precision format for normalized range the
value of exponent is between 0 and 2047 and for denormalized range the value of exponent is 0.
The formula to translate the s, e and f fields into double precision floating point number is given
below.
Normal range: -1s * 2 (e-1023) * 1.f 0 < e < 2047
Denormalized range: -1s * 2 (-1022) * 0.f e = 0; f - nonzero
Example 14.28 INTSP .L1 A1,A2 – Signed integer to single-precision floating point conversion
instruction. The signed integer content of register A1 is converted to single-
precision floating point format and stored in register A2. If integer value content of register A1 is
00007272h (29298), its single precision floating point value 46E4 E400h (2.9298E+4) is loaded in register
A2.The functional unit used is .L1 and the content of registers A1 is unchanged.
Before executions After execution
A1 00007272 29298 A1 00007272
A2 00020005 A2 46E4E400 2.9298E+4
Example 14.29 ADDSP .L1 A1,A2,A3 – Single-precision floating point add instruction. The single-
precision floating point content of register A1 and A2 are added; the result in
single-precision floating point format is stored in register A3. If the floating point content of register A1
is 4370 0000h (2.4E+2), register A2 is C453 4000h (-8.45E+2), the added result C417 4000h (-6.05E+2) is
stored in register A3. The functional unit used is .L1; the content of registers A1 and A2 are unchanged.
Before executions After execution
A1 43700000 (2.40E+2) A1 43700000 (2.40E+2)
A2 C4534000 (-8.45E+2) A2 C4534000 (-8.45E+2)
A3 500F0D18 A3 C4174000 (-6.05E+2)
Example 14.31 MPYI .M1X A1,B1,A2 – 32-bit integer multiply instruction. The 32-bit integer
content of register A1 and B1 are multiplied; the lower 32-bits of the product is
stored in register A2. If the content of register A1 is 0008 5BDBh, register B1 is 000D B371h, the multiplied
result is 72 8609 ACABh. The 32 LSBs of the product (8609 ACABh) alone are stored in register A2. The
functional unit used is .M1 with the cross path; the content of registers A1 and B1 are unchanged. If the
same operation is performed with MPYID instruction being register pairs used for destination, the entire
result of the product is stored in register pairs.
Before executions After execution
A1 00085BDB 547803 A1 00085BDB 547803
B1 C4534000 897905 B1 000DB371 897905
A2 00000000 A2 8609ACAB 491875052715
Example 14.32 CMPGTSP .S1 A3,A4,A5 – Single-precision floating point compare instruction for
greater than case. The single-precision floating point content of register A3 and
A4 are compared, if the content of register A3 is greater than the content of A4, then ‘1’ is written in
register A5. If the floating point content of register A3 is 4608F59Ah (8.7654E+3), register A4 is 45F6
3E66h (7.8798E+3), the comparison is true; hence ‘1’ is written in register A5. The functional unit used
is .S1; the content of registers A3 & A4 are unchanged.
TMS320C6X Assembly Language Instructions 395
Before executions After execution
A3 4608F59A (8.7654E+3) A3 4608F59A (8.7654E+3)
A4 45F63E66 (7.8798E+3) A4 45F63E66 (7.8798E+3)
A5 00000000 A5 00000001
Example 14.33 RSQRSP .S1 A1,A2 – Single-precision floating point square-root reciprocal
approximation instruction. The square root of the single-precision floating point
content of register A1 is obtained and its reciprocal value is stored in register A2 in single-precision
floating point format. If the single-precision floating point content of register A1 is 4380 0000h (2.56E+2),
it’s square root value is 1.6E+1 and its reciprocal value 3D80 0000h (6.25E-2) is stored in register A2. The
functional unit used is .S1; the content of registers A1 is unchanged.
Before executions After execution
A1 43800000 (2.56E+2) A1 43800000 (2.56E+2)
A2 C4534000 (-8.45E+2) A2 3D800000 (6.25E-2)
are sent to memory and in PW phase the memory read operation is performed. Finally, in PR phase
the eight instructions are received at the CPU. The number of execute packets in the fetch packet is
based upon instructions written in fully serial, fully parallel and partially serial execution types. If eight
instructions of a fetch packet are serial, there are eight execute packets, where as eight instructions are
in parallel, there is only one execute packet. In case of partial serial type, the number of execute packets
TMS320C6X Assembly Language Instructions 397
varies between two to seven and depend on the number of instructions that are parallel in the fetch
packet.
INTERRUPTS 14.8
The ¢C6X processors have three types of interrupts based on their priorities. First the reset interrupt
(RESET ) which has the highest priority, second the nonmaskable interrupt (NMI) having the second
highest priority and third are the twelve makeable interrupts INT4-INT15 having lowest priorities. In
¢C6X, eight registers are present that control servicing the interrupts. The list of interrupts and their
functions are given in Table 14.18.
The reset interrupt is an active low signal and all other interrupts are active high signal. The reset
interrupt must be held low for 10 clock cycles. The nonmaskable interrupt is used to alert the CPU for
serious hardware problem such as power failures likely to happen immediately. The twelve maskable
interrupts are associated with external devices, on-chip peripherals, software control or in some
processor not be available. The ¢C6X processors have interrupt acknowledgement signal (IACK) to
alert the external hardware that an interrupt has occurred and is being processed and INUMx signals
(INUM3-INUM0) to indicate the number of interrupts that is being processed.
When an interrupt occur, the CPU begins to process it and it references interrupt service table (IST).
IST is a table containing codes for servicing the interrupts. The IST contains 16 consecutive fetch
packets, where each fetch packet contains eight instructions. Instructions of ¢C6X is 32-bits, so for eight
398 Digital Signal Processors
instructions it occupies 32-bytes of program memory locations for each fetch packet. Hence the address
of IST is incremented by 32 bytes (20h) for the next interrupt to be serviced. The interrupt service
routine can fit in with these eight instructions.
Ex-
ecute Type Operations performed
Phase
E1 Conditional For all instructions, the conditions for the instructions are checked and
Instructions operands are read.
Load and store Address generation is performed and address modifications are written to
instructions a register file
Store instructions The address and the data are sent to memory
Single-cycle Single-cycle instructions with saturate results, if saturation occurs, set the
instructions SAT bit in the control status register (CSR)
Multiply instructions For 16x16 multiply instructions, results are written to a register file. In
‘C64X multiply unit, for the non-multiply instructions, results are written
to a register file
Multiply instructions Multiply instructions with saturate results, if saturation occurs, sets the
SAT bit in the control status register (CSR)
Multiply instructions In ‘C64X multiply extensions, results are written to a register file
(Contd.)
TMS320C6X Assembly Language Instructions 399
Table 14.17 (Contd.)
Branch Instructions Branch fetch packet in PG phase is affected
Single-cycle instructions The results are written to a register file
DP compare, ADDDP/SUBDP and The lower 32-bits of the source are read. For all other instructions,
MPYDP instructions the source are read
2-cycle DP instructions The lower 32-bits of the result are written to a register file
E2 Load instructions The address is sent to memory
Store instructions The address and the data are sent to memory
Single-cycle instructions Single-cycle instructions with saturate results, if saturation occurs,
set the SAT bit in the control status register (CSR)
Multiply, 2-cycle DP and DP Results are written to a register file
compare instructions
DP compare and ADDDP/SUBDP The upper 32-bits of the source are read
instructions
MPYDP instruction The lower 32-bits of src1 and the upper 32-bits of src2 are read
MPYI and MPYID instruction The sources are read
E3 Store instructions Data memory accesses are performed
Multiply instructions Multiply instructions with saturate results, if saturation occurs, sets
the SAT bit in the control status register (CSR)
MPYDP instruction The upper 32-bits of src1 and the lower 32-bits of src2 are read
MPYI and MPYID instruction The sources are read
E4 Load instructions Data is brought to the CPU boundary
MPYI and MPYID instruction The sources are read
MPYDP instruction The upper 32-bits of the source are read
4-cycle instructions Results are written to register file
INTDP instruction The lower 32-bits of the result are written to a register file
E5 Load Instructions The data is written into a register
INTDP instruction The upper 32-bits of the result are written to a register file
E6 ADDDP/SUBDP instructions The lower 32-bits of the result are written to a register file
E7 ADDDP/SUBDP instructions The upper 32-bits of the result are written to a register file
E8 —- Nothing read or written
E9 MPYI instruction The result is written to a register file
MPYDP and MPYID instructions The lower 32-bits of the result are written to a register file
E10 MPYDP and MPYID instructions The upper 32-bits of the result are written to a register file
If the interrupt service routine for an interrupt is larger than eight instructions that cannot fit in
the IST, an interrupt service fetch packet (ISFP) is used to service an interrupt. The interrupt service
fetch packet contains a branch to the interrupt return pointer instruction (B IRP) followed by five no
operations (NOP 5) for the branch to reach the execution stage of the pipeline. The additional interrupt
service routine code is written from the branched memory location. In both IST and ISFP the interrupt
service table pointer (ISTP) register is used to locate the interrupt service routine.
400 Digital Signal Processors
Table 14.18 Interrupt Control Register and their Functions
Name of the register Abbreviation Functions
Control status register CSR To globally set or disable the maskable interrupts
Interrupt enable register IER To enable the makableinterrupts
Interrupt flag register IFR Shows the status of interrupts
Interrupt set register ISR To set the flags in IFR register manually
Interrupt clear register ICR To clear the flags in IFR register manually
Interrupt service table ISTP Pointer to the beginning of the interrupt service table
pointer
Nonmaskable interrupt NRP Contains the return address used on return from a nonmaskable
return pointer interrupt. This is accomplished using the B NRP instruction
Interrupt return pointer IRP Contains the return address used on return from a maskable interrupt.
This is accomplished using the B IRP instruction
Review Questions
14.1 What are the types of operations performed by .L 14.10 What are the various shift and compare
functional units? operations supported by ¢C6X processors?
14.2 List the various types of multiply operations 14.11 Explain how logical conditional can be defined in
performed by .M functional units. ¢C6X instructions?
14.3 Which unit is used to process the branch 14.12 What are the various instruction execution types
instructions? List the various types of branch instructions in ¢C6X? Explain.
in ¢C6X. 14.13 What are the various data formats supported by
14.4 What are the various types of load and store the ‘C67X processors?
operations performed by .D units of ¢C6X processor? 14.14 List the various data format conversion
14.5 List the addressing modes supported by the ¢C6X instructions in ‘C67X processors.
processor. 14.15 What are the floating point arithmetic operations
14.6 What are the address generation options present ‘C67X processor supports?
in linear addressing mode? 14.16 Explain the different phases of fetch operation
14.7 Explain the operation of circular addressing mode of ¢C6X pipeline.
with example. 14.17 What are operations performed in decode phase
14.8 What are the various types of move instructions of ‘C67X pipeline.
in ¢C6X processors? 14.18 List the categories of ¢C6X fixed point processor
14.9 List the various types of addition and subtract pipeline execute phases
instructions in ¢C6X processors.
TMS320C6X Assembly Language Instructions 401
14.19 What are the categories of ‘C67X processor 14.21 What are the registers in ¢C6X register file used
pipeline execute phases? for conditional operations?
14.20 List the register present in ¢C6X processor to
process the interrupts.
Example 15.1 The assembly code generates the arithmetic series, finds the sum of the series
and stores it in memory. The first few instructions initialize the content of
registers used to zero. The register A1 specifies the number of values of the series N (20h); register A2
specifies the start address of the memory (0200h) to store the series. The series is generated in register
A3, the sum of the series values are accumulated in register A4. The N values of the series are stored in
N words (4 bytes) of the memory starting from the next address in register A2 (0204h) and N+1th word
the sum of the series is stored. The content of register A1 is used for conditional operation.
Label Mnemonic Comments
.text ; assemble directive to initialize the program section
(case sensitive)
ZERO .S1 A1 ; zero the content of registers A1, A2, A3 done in parallel
|| ZERO .D1 A2
|| ZERO .L1 A3
NOP 5
ZERO .D1 A4 ;zero the content of register A4, the no. of series-
|| MVK .S1 020h, A1 ;values (20h) entered in A1, done in parallel
NOP 6
MVK .S1 200h,A2 ; the start address of the memory (0200h) to store the-
; sequence is loaded in A2
LOOP ADD .L1 1,A3,A3 ; the series generation done in A3
STW .D1 A3,*++A2[1] ; the values of the series are stored
ADD .L1 A3,A4,A4 ; the sum of the series is done in A4
SUB .S1 A1,1,A1 ; decrement the count N
NOP 6
[A1] B .S1 LOOP ; the content of A1 being nonzero condition is tested
NOP 6
STW .D1 A4,*++A2[1] ; the sum of the series is stored in N+1th location
NOP
.end ;assembler directive to specify the end of section
(case sensitive)
Step 4: Building the code: Select Project menu – Build option in CCS, a Debug window will appear
in CCS. It checks the syntax of the ¢C6X assembly code. If the build is successful an .out file
in the name of the project is created in Debug folder of the project folder (series.out) else error
messages will appear in the debug window. By reading the error messages the correct syntax
can be written in the assembly language file. The build option is to be continued till the end of
successful build.
Step 5: Down loading the code in target processor: Now the .out file is to be down loaded to the
on-chip memory of the target processor, for this first the target processor is to be connected.
Select Debug menu – Connect option in CCS, a Disassembly window will open in CCS. To
load the .out file, Select File menu-Load program option in CCS. A Load program window will
popup, click the Debug folder and select the .out file in the name of the project (series.out).
406 Digital Signal Processors
The disassembly window will point the starting address 0000 0000h or the default location of
the program counter (PC). The assembly codes developed by the user will be loaded from the
starting address 00000020h. The complete assembly codes downloaded can be viewed in the
disassembly window from this address.
Program 15.6 DFT computation (8-poit) using DIT FFT radix-2 algorithm
Fig. 15.3
a=100;
b=120;
c=a+b;
printf(“Sum of %d and %d is %d\n”,a,b,c);
}
418 Digital Signal Processors
Fig 15.4
The code generation tools underlying CCS, that is, C compiler, assembler, and linker, have a number of
options associated with each of them. These options must be set appropriately before attempting to build
a project. Once set, these options will be stored in the project file. Figure 15.5 shows the build options
set for the new.pjt. After setting the build options and adding necessary files, goto DebugÆConnect
for connecting the target board with CCS. Then goto ProjectÆRebuild All to build the entire project.
After building the entire project, goto FileÆLoad program and select the new.out to be loaded onto the
target board. Then goto debugÆ Run the project to see the result on the output window as shown in
Fig. 15.6.
Uint32 input_sample()
{
short CHANNEL_data;
if (poll) while(!MCBSP_rrdy(DSK6713_AIC23_DATAHANDLE));//if ready to receive
AIC_data.uint=MCBSP_read(DSK6713_AIC23_DATAHANDLE); //read data
CHANNEL_data=AIC_data.channel[RIGHT];
AIC_data.channel[RIGHT]=AIC_data.channel[LEFT];
AIC_data.channel[LEFT]=CHANNEL_data;
return(AIC_data.uint);
}
The C-program for the computation of 8- point DFT using FFT algorithm is given in Program 15.10.
The speech input through the microphone interface is sampled at the rate of 8 KSPS, digitized and stored
in a data file ‘samplefft.txt’. The 8- point DFT is computed and the DFT coefficients are printed on the
output window of CCS.
#include “DSK6713_aic23.h”
Uint32 fs=DSK6713_AIC23_FREQ_8KHZ;
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define noof_stages 3 /* the no. of butter fly stages – 3*/
#define noof_samples 8 /* the no. of inputs 8*/
#define PI 3.14159
struct complex {
float real;
float imag;
};
struct buffer {
struct complex data[1][20];
};
#pragma DATA_SECTION(real_buffer,”.EXTRAM”)
struct buffer real_buffer;
FILE *f1;
void fft (struct buffer *, int , int );
/* Main Program */
void main()
{
int k;
422 Digital Signal Processors
float sample;
int sn,sm;
sn=noof_samples;
sm=noof_stages;
printf(“Input\n”);
f1=fopen(“samplefft.txt”,”r”);
for(k=0;k<noof_samples;k++)
{
fscanf(f1,”%f”,&sample);
real_buffer.data[0][k].real = ((float)sample);
fscanf(f1,”%f”,&sample);
real_buffer.data[0][k].imag = ((float)sample);
printf(“%f\t%fi\n”,real_buffer.data[0][k].real,real_buffer.data[0][k].imag);
}
fclose(f1);
fft(&real_buffer,sn,sm);
}
/* Function to Compute Fast Fourier Transform */
void fft (struct buffer *input_data, int n, int m) {
int n1,n2,i,j,k,l;
float xt,yt,c,s,e,a;
n2 = n;
for ( k=0; k<m; k++) {
n1 = n2;
n2 = n2/2;
e = PI/n1;
for ( j= 0; j<n2; j++) {
a = j*e;
c = (float) cos(a);
s = (float) sin(a);
for (i=j; i<n; i+= n1) {
l = i+n2;
xt = input_data->data[0][i].real - input_data->data[0][l].real;
input_data->data[0][i].real = input_data->data[0][i].real+input_data->data[0][l].real;
yt = input_data->data[0][i].imag - input_data->data[0][l].imag;
input_data->data[0][i].imag = input_data->data[0][i].imag+input_data->data[0][l].imag;
input_data->data[0][l].real = c*xt + s*yt;
input_data->data[0][l].imag = c*yt - s*yt;
}
}
}
j = 0;
for ( i=0; i<n-1; i++) {
if (i<j) {
xt = input_data->data[0][j].real;
TMS320C6X Application Programs and Peripherals 423
input_data->data[0][j].real = input_data->data[0][i].real; input_data->data[0][i].real = xt;
yt = input_data->data[0][j].imag; input_data->data[0][j].imag = input_data->data[0][i].imag;
input_data->data[0][i].imag = yt;
}
}
/* printf(“Output\n”);
for ( l=0; l<n; l++)
{
printf(“%f\t”,input_data->data[0][l].real);
printf(“%fi\n”,input_data->data[0][l].imag);
}
*/
return;
}
15.3.3 Estimation of Clock Cycles Required for Code Execution using CCS
The number of clock cycles/machine cycles required to excute the complete program in assembly,
C as well as combined assembly and C environment can be estimated using CCS tool. In CCS, first
select the option Profile – Clock – Enable and then select the option Profile – Clock – View, a clock
icon will appear on the right down corner of the CCS menu bar. The clock can be resetted by double
clicking the clock icon. Once the project file is downloaded to the target processor, the PC will set to the
starting point of the program code (default value of the start address is 0000 0020h). Break point can be
introduced at the last line of the code. (To introduce break point refer section 15.1.4). Select the Debug-
run option in the CCS tool or use the shortcut key to run the code from the current point of the program
counter (PC) to the address where break point is introduced. The count shown in the clock icon of CCS
tool is the measure of number of clock cycles required to execute the block of code from the starting
address to the address where the break point is introduced. In the same way by introducing breakpoints
at any other place of the program, the clock cycle count required to execute any block of the program
can be computed.
15.3.4 Comparison of the Number of Clock Cycles Required for the Computation of
8 Point DFT in both Assembly Language and C Environment
The number of clock cycles required for the computation of DFT using assembly language program given
in section 15.2.3 (Program 15.6) and the C program given in 15.3.2 (Program 15.10) are evaluated and
compared in this section. For the evaluation of the number of clock cycles required for the computation
of DFT using Program 15.6, the number of clock cycles required for bit-reversing the input sequence
is computed using CCS tool as mentioned in section 15.3.3. The 8,16,32 and 64 input sequences take
119, 197, 353 and 671 machine cycles respectively for bit-reversing. The clock cycle count for the
computation of 8-point DFT using Program 15.6 is also evaluated. The number of clock cycles required
to compute 2-point, 4-point and 8-point butterfly outputs are 220, 585 and 565 respectively. For the
complete computation of the 8-point DFT using assembly code the number of clock cycles evaluated
using CCS tool in ¢C6416 starter kit is 1,479.
The number clock cycles required for 8-point DFT computation using C-code in ¢C6713 starter kit
is also evaluated and the value is 14,739 clock cycles. Hence, the number of clock cycles required for
computing 8 point DFT using programming in C environment is larger by a factor of 10. In addition to
424 Digital Signal Processors
non optimality of the C compiler, the type of processor used for the implementation also contributes to
the difference. The C6416 processor used for the assembly language programming in section 15.2.4 is
a fixed point processor where as the C6713 processor used executing the program in C environment is
a floating point processor. The floating point processor in general requires more cycles than the fixed
point processor. It may be noted CCS can be used for both of these processors to develop programs in
assembly language, C language or combinations of both.
Callable Assembly The callable assembly approach uses the C source code, which calls an externally
declared user defined assembly language function. The C-code can be re-written using callable assem-
bly as shown in Program 15.12.
TMS320C6X Application Programs and Peripherals 425
Program 15.12 C Code for the computuation of Eucludian distance using callable assembly
language function:
#include<math.h>
#include<stdio.h>
extern float errasm(float,float,float,float);
main( )
{
float x1,x2,y1,y2,e;
e = errasm(x1,y1,x2,y2);
}
In program 15.12, errasm() is a function called by c-code which is written in assembly language and
saved as errasm.asm. The errasm.asm is given by Program 15.13.
.def _errasm
Intrinsic Functions Intrinsics are special functions that map directly to inline C6x instructions. For
example, int _mpy() is equivalent to the assembly instruction MPY to multiply the 16LSBs of two
numbers. The above-mentioned C- code example can be written using intrinsic functions as shown in
Program 15.14
426 Digital Signal Processors
#include<ieeef.h>
#include<fastrts67x.h>
#include<stdio.h>
main( )
{
float x1,x2,y1,y2,dx,dy,e;
dx = x1-x2;
dy = y1-y2;
e = _rcpsp(_rsqrsp(dx*dx + dy*dy));
}
In the above C-code, two intrinsic functions are used. float _rcpsp(float src) computes the approximate
32-bit float reciprocal and float _rsqrsp(float src) computes the approximate 32-bit float square root
reciprocal.
Linear Assembly Linear assembly code is a cross between assembly and C. It uses the syntax of as-
sembly code instructions such as ADD, SUB, and MPY, but with operands/registers as used in C. The
above-mentioned C- code example can be written using linear assembly as shown in Program 15.15.
#include<math.h>
#include<stdio.h>
extern float err(float,float,float,float);
main( )
{
float x1,x2,y1,y2,e;
e = err(x1,y1,x2,y2);
}
In program 15.15, err() is a function called by c-code which is written in linear assembly and saved
as err.sa file. The linear assembly code for function err() is given in program 15.16.
Inline Assembly An inline assembly code can be used with the asm statement within a C program.
For example, asm(“ MVK 0x0040,B6”). The above-mentioned C- code example can be written using
inline assembly as shown in Program in 15.17.
Note: i) For ¢C67X processors – LD1 & LD2 data bus size - 64-bits
ii) For ¢C64X processors – LD1, LD2, ST1& ST2 data bus size - 64-bits.
Fig. 15.7 Internal Memory Block diagram of ¢C6X processors
Table 15.3 MAR Registers and its corresponding CE space address range
15.6.1 Timers
The ¢C6X devices have two 32-bit general purpose timers that are used to time events, count events,
generate pulses, interrupt CPU and send synchronization event to DMA. The timer operation can be
configured through three memory mapped registers namely timer control register, timer period register
and timer counter register. The ¢C6X processor on-chip timer block diagram is given in Fig. 15.10. The
timer control register (TCR) is programmed to select the different modes of operation of timer; the timer
period register contains the number of timer input clock cycle to count and the timer counter register
increments when it is enabled to count. The timer counter register resets to 0 when the count reaches the
count value in the period register.
432 Digital Signal Processors
The McBSP consists of two paths, a data path and a control path which is used to connect to external
devices. The block diagram of McBSP is shown in Fig. 15.11. There are thirteen memory mapped
434 Digital Signal Processors
registers for each McBSPs present in the processor and these registers are accessed via 32-bit peripheral
bus. The list of registers and its memory mapped address are given in Table 15.4. The different modes
of operation of McBSP are programmed through a 32-bit serial port control register (SCR).
The data communication in McBSP is through data transmit (DX) and data receiver (DR) pins. The
clocking and frame synchronization are via CLKX, CLKR, FSX and FSR pins. Either CPU or DMA
controller reads the received data from data receiver register (DRR) and also the data to be transmitted
is written in data transmit register (DXR). The data transmit shift register (XSR) shifts out the data in
DXR to DX pin and the same way the data received in DR pin is shifted into receive shift register (RSR)
and copied into the receive buffer register (RBR) and then copied to DRR. The received data is read by
the CPU or DMA controller.
Review Questions
15.1 List the steps to do programming in ¢C6X tool. 15.9 Explain the operation of ¢C6X timer.
15.2 What are the basic features of ¢C6416 starter kit? 15.10 What are the features of McBSP?
15.3 Explain the memory resources available in ¢C6416 15.11 For what interfaces McBSP is used?
DSK. 15.12 List the signals used for clocking and frame
15.4 What are the steps involved in ¢C6X code synchronization of ¢C6X McBSP.
generation using CCS tool? 15.13 What regions of memory map of ¢C6X DMA
15.5 Which instruction is used for division? How? controller can be used?
15.6 Explain the internal memory details of ¢C6X 15.14 Explain the uses of EMIF.
processors. 15.15 What is the use of interrupt selector?
15.7 For what operations L2 controller is used? 15.16 Why power-down logic is needed? Explain the
15.8 List the on-chip peripheral in ¢C6X processors. ¢C6X power-down logics.
TMS320C6X Application Programs and Peripherals 437