0% found this document useful (1 vote)
218 views

TMS320C6X Digital Signal Processors Architecture Programming and Applications

Uploaded by

PARTH JAGTAP
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
218 views

TMS320C6X Digital Signal Processors Architecture Programming and Applications

Uploaded by

PARTH JAGTAP
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

ARCHITECTURE OF

TMS320C6X
13
INTRODUCTION 13.1
The TMS320C6X DSPs use the VelociTITM architecture, the first DSPs to use advanced VLIW (Very Large
Instruction Word) architecture to achieve high performance through increased instruction parallelism.
This makes the ¢C6X DSPs an excellent choice for multichannel and multifunction applications.
The conventional VLIW architecture consists of multiple execution units running in parallel
performing multiple instructions during a single clock cycle. The VelociTI architecture is a highly
deterministic architecture having reduced code size, flexibility of code and data type and zero overhead
in branching.
The TMS320C62X, TMS320C64X and TMS320C67X are the family of DSPs in the ¢C6X generation.
The ¢C62X and ¢C64X devices are fixed point and ¢C67X devices are floating point DSPs. In ¢C6X DSPs
¢C62X and ¢C64X processors are code compatible, ¢C62X and ¢C67X processors are code compatible.
The ¢C6X devices execute up to eight 32-bit instructions per cycle with an execution speed of up
to 6000 million instructions per second (MIPS). The ¢C6X CPU consists of eight functional units, two
multiplier and six ALUs and some general purpose registers. The CPU of ¢C62X and ¢C67X device
consists of 32 general purpose registers of 32-bit size, where as ¢C64X devices have 64 general purpose
registers of 32-bit size.

FEATURES OF ¢C6X PROCESSORS 13.2


∑ Advanced VLIW CPU with eight functional units, including two multipliers and six ALUs
∑ Executes up to eight instructions per cycle allows to develop effective RISC like code
∑ Instruction packing reduces code size, program fetches and power consumption
∑ Conditional execution of all instructions
∑ Efficient code execution on independent functional units
∑ Supports 8/16/32- bit data formats
∑ 40-bit arithmetic operations, saturation and normalization operations
∑ Field manipulation and instruction extract, set, clear and bit counting operations
∑ The ¢C67X device has hardware support for single precision (32-bit) and double precision (64-
bit) IEEE floating point operations and also 32 X 32 bit integer multiplication with 32 or 64
– bit results.
Architecture of TMS320C6X 357
∑ The ¢C64X device multiplier can perform two 16 X16 bit or four 8 X 8 bit multiplications per
cycle, quad 8-bit and dual 16-bit instruction set extensions with data flow support, memory
access for non-aligned 32-bit and 64-bit, special communication-specific instruction useful in
realizing error-correcting codes, bit count and rotate hardware.

INTERNAL ARCHITECTURE 13.3


The block diagram of TMS320C6X devices is given in Fig. 13.1. The ¢C6X devices contains 32-bit
CPU, on-chip program, data memory and on-chip peripherals. The on-chip memory has cache either for
program space or for both program and data space. The ¢C6X devices have peripherals such as external
memory interface (EMIF), direct memory access controller (DMA), timers, multi-channel buffered
serial ports (McBSP), host port interface (HPI) and power down logic.

Fig. 13.1 Internal Architecture of TMS320C62X/¢C64X/¢C67X Devices

CPU 13.4
The central processing unit of ¢C6X device is 32-bit size. The block diagram of ¢C6X CPU is given in
Fig. 13.2. The CPU contains the following units:
(a) Program fetch unit
358 Digital Signal Processors
(b) Instruction dispatch unit
(c) Instruction decode unit
(d) Two data paths, each data path
consists of four functional units
(e) Register file for each data path
(f) Control registers
(g) Control logic
(h) Test, emulation and interrupt
logic
The functional units shaded in Fig.
13.2. are common to all ¢C6X devices. The
¢C6X CPU is based on advanced VLIW
architecture, which accepts eight 32-bit
instructions (the instruction fetch packet
size is 256 bits) at a time. The program
fetch unit generates the addresses of eight
instructions and sends it to the program Fig. 13.2 CPU Unit of TMS320C6X DSP
memory for each fetch packet. Once the
contents of the program memory read occurs, the fetch packet is received at the CPU.
The instruction dispatch unit receives the fetch packet and splits it into execute packets. The
instructions in the execute packet (eight instructions) are assigned to the appropriate eight functional
units in the data path. During the instruction decode, the source registers, destination registers and
associated paths are decoded for the execution of the instructions in the functional units. Finally the
instructions are executed by the functional units.
The register file (A&B) of all the ¢C6X devices contain 32 numbers of 32-bit registers (16 register for
each data path) except ¢C64X devices. The ¢C64X device register file has 64 numbers of 32-bit registers
with 32 registers for each data path.
The ¢C6X CPU contains eight functional units, six arithmetic and logic units and two multipliers
(.L1, .L2, .S1, .S2, .M1, .M2, .D1 and .D2.). These functional units can be divided into two groups of
four. The L, S & D units are arithmetic and logic units (ALU), and the M unit is a multiplier unit. Each
data path has almost identical functional units.

GENERAL-PURPOSE REGISTER FILES 13.5


There are two general-purpose register files A and B in ¢C6X CPU data paths. In ¢C62X/¢C67X devices
each register file contains 16 numbers of 32-bit registers, the registers A0-A15 for register file-A and
B0-B15 for register file-B. The ¢C64X devices have double the number of general-purpose registers as
that are in ¢C62X/¢C67X processors. There are 32 numbers of 32-bit registers for each data path, where
A0-A31 for register file-A and B0-B31 for register file-B. The general-purpose registers can be used for
handling data; data address pointers or condition registers.
The ¢C62X/¢C67X general-purpose register files supports packed 16-bit, 40-bit fixed point data and
64-bit floating point data types. The packed data type can store four 8-bit values or two 16-bit values
in a single 32-bit register or four 16-bit values in 64-bit register. The values larger than 32 bits, such as
40-bit fixed point and 64-bit floating point are stored in register pairs. The storage scheme for 40-bit
Architecture of TMS320C6X 359
long data in register pair is shown in Fig. 13.3. In register pairs (A3:A2), 32 LSBs of data are placed
in an even numbered register (A2) and the remaining 8-bit or 32 MSBs in the next upper register (A3),
which is always odd numbered register. The ¢C64X register file has this facility by supporting packed
8-bit and 64-bit floating point data types. For 40-bit and 64-bit data, there are 16 valid register pairs in
¢C62X/¢C67X and 32 valid register pairs in ¢C64X core. The valid register pairs in ¢C6X devices are
given in Table 13.1.

Fig. 13.3 Storage Scheme for 40-Bit Data in a Register Pair

Table 13.1 Valid Register Pairs in ¢C6X CPU Register files


Register Pairs
Device Family Data path – A Data path – B
¢C62X/¢C64X/¢C67X A1 : A0 B1 : B0
A3 : A2 B3 : B2
A5 : A4 B5 : B4
A7 : A6 B7 : B6
A9 : A8 B9 : B8
A11 : A10 B11 : B10
A13 : A12 B13 : B12
A15 : A14 B15 : B14
¢C64X only A17 : A16 B17 : B16
A19 : A18 B19 : B18
A21 : A20 B21 : B20
A23 : A22 B23 : B22
A25 : A24 B25 : B24
A27 : A26 B27 : B26
A29 : A28 B29 : B28
A31 : A30 B31 : B30

FUNCTIONAL UNITS AND OPERATION 13.6


The ¢C6X CPU consists of eight functional units, .L1, .S1, .M1, .D1, .L2, .S2, .M2 and .D2. These
eight functional units of ¢C6X devices are divided into two groups, one group for each data path. Each
functional unit in one data path is almost identical to the corresponding unit in the other data path and
arranged as mirror image to each functional unit. The .L, .S and .D units are arithmetic and logic unit,
.M unit is a multiplier unit. The fixed point operations performed in ¢C6X processor functional units and
the bit size of the operation are given in Table 13.2.
The .L unit performs arithmetic and logical operations, other operations like compare and count are
performed in this unit. The .S unit is used for arithmetic and logical operations as well as for branch,
shift, constant generation and move operations. The .D unit does add and subtract operations. The .D
unit is a dedicated unit for the load, store operations, linear and circular address calculations. The .M unit
360 Digital Signal Processors
is dedicated unit to perform multiply operations. The functional units in ¢C64X processor supports Dual
16-bit and Quad 8-bit functional operations pertaining to their units apart from the normal operations.
The operations performed by ¢C67X processor functional units are given in Table 13.3. The .L unit
performs arithmetic operations and .S unit does the compare operations. The .M unit can do 32 x32
bit fixed-point multiply operations and floating point operations. The .D unit is used to load and store
double words with 5-bit constant offset.

Table 13.2 Functional units of ¢C6X and its fixed point operations

Name of the .L unit .S unit .M unit .D unit


Type of Unit
operation
Arithmetic 32/40 bit operation 32-bit operation — 32-bit add & subtract
operation Dual 16 bit, Quad Dual 16 bit, Quad 8 operations only
8 bit arithmetic bit saturated arith-
and min/max metic operations*
operations*
Logical operation 32-bit operations 32-bit operations — 32-bit logical
operations*
Multiply operations — — 16x16 multiply op- —
erations16x32, Quad
8x8, Dual 16x16
multiply operations
Shift operations 32/40 bit shift opera- Variable shift opera- —
tions tions*
Byte shifts* Byte shifts, Dual 16
bit shift operation*
Compare operations 32/40 bit operations Dual 16 bit, Quad — —
8 bit compare
operations *
Branch operations — Yes — —
Load and Store — — — Load and stores
operations with 5-bit constant
offset(15-bit constant
offset in .D2 only)
Linear and circular — — — Yes
address calculation
Constant generation 5 bit constant gen- Yes — 5 bit constant
eration* generation*
Count operations 32/40 bit count — — —
operations
Move operations Register to register 16-bit move — Register to register
only 16-bit move operations Register to register only16-bit move
operations* only* operations*
* - additional operations performed by the functional units in ¢C64X processors.
Architecture of TMS320C6X 361
Table 13.3 Functional units of ¢C6X and its floating point operations

Name of the unit Type of floating point operation


.L unit Arithmetic operations
.S unit Compare, square-root and Absolute value operations
.M unit 32x32 bit Fixed point multiply operations and Floating point multiply operations
.D unit Load double word with 5-bit constant offset

DATA PATHS 13.7


The ¢C6X CPU has two data paths, Data path – A and Data path – B. The data paths of ¢C62X, ¢C67X
and ¢C64X devices are shown in Fig. 13.4., Fig. 13.5., and Fig. 13.6. respectively.

13.7.1 Register File Data Paths


Most of the data lines in the CPU data path are 32-bit wide but some support 40-bit (long operands)
and 64-bit (double word operands) lines. The functional units ending in 1 (.L1, .S1, .M1 and .D1) have
access to register file A, and functional units ending in 2 (.L2, .S2, .M2 and .D2) to register file B.
Each functional unit has two 32-bit ports for reading source operands src1 and src2 from the respective
register files. The .L and .S units have an extra 8-bit line for 40-bit long src operand reads.
Each functional unit has its own 32-bit write port into the respective register file for destination dst
operands except .M unit of ¢C64X. The ¢C64X multiplier unit can return up to a 64-bit result, so an extra
32-bit write port is available to the register file. The same way us the read port, .L and .S units have an
extra 8-bit line for 40-bit long dst operand writes. Since each unit has its own port for operand read and
writes, when performing 32-bit operations all the eight functional units can be used in parallel every
machine cycle.

13.7.2 Register File Cross Paths


The ¢C6X processors functional units can read and write the operands directly from their respective
register files using its own data paths. The register files are connected to the opposite side functional
units through 1X and 2X cross paths. These cross paths allow the functional units from one data path to
access 32-bit operand from the opposite side register file. The functional units of data path –A read their
source operands from register file B via 1X cross path and the 2X cross path allows the functional units
of data path –B to read the source operand from register file A.
The six functional units (.L1, .L2, .S1, .S2, .M1 and .M2) out of the eight units of ¢C62X and ¢C67X
processors, have access to the opposite side register file via cross path. In .S1, .S2, .M1 and .M2 units
src2 operand is selectable between the cross path and the same side register file path but in the case of
.L1 and .L2 units, both src1 and src2 operands are selectable.
In ¢C64X processor, all the eight functional units have access to the register file of the opposite side
through cross path. In ¢C64X also .L1 and .L2 units both src1 and src2 operands are selectable between
the cross path and the same side register file path but in the case of other six functional units only src2
operand is selectable.
362 Digital Signal Processors

Fig. 13.4 TMS32C62X CPU Data Paths


Architecture of TMS320C6X 363

Fig. 13.5 TMS32C67X CPU Data Paths


364 Digital Signal Processors

Fig. 13.6 TMS32C64X CPU Data Paths


Architecture of TMS320C6X 365
13.7.3 Register File Memory Access Paths
In order to access data from memory to CPU register files, ¢C6X CPU has address paths, data load and
store paths. The DA1 and DA2 the address paths, LD1 and LD2 the data load paths and ST1 and ST2
the data store paths are used for memory access.
The DA1 and DA2 address paths are 32-bit size and are connected to .D unit of the respective data
paths. The paths allow addresses generated by any one path to access data to or from any register. The
DA1 and DA2 resources and their associated data paths are specified as T1 and T2 respectively in the
instruction set. It is important to note that there is cross path for the address buses, the address generated
in .D1 and .D2 units can have access to DA2 and DA1 paths (opposite paths) respectively.
The ¢C62X processor has two 32-bit paths for loading data from memory to register file, LD1 for
register file A and LD2 for register file B, but both ¢C64X and ¢C67X processors have additional 32-
bit load paths (LD1a and LD1b, LD2a and LD2b) for register files A and B. This allows CPU to load
simultaneously two 32-bit (64-bit) values in register files A and B.
As for as the store path is concerned, both ¢C62X and ¢C67X processors have two 32-bit paths to store
data values from register file to memory. The ¢C64X has additional 32-bit store paths ST1a and ST1b
and ST2a and ST2b for register files A and B. The¢C64X processor alone supports double word load and
store instructions. The size of memory access paths in C6X processors are given in Table 13.4.
Table 13.4 Size of memory access paths in ¢C6X processors
Data path ¢C62X ¢C64X ¢C67X
type Size Number Size Number Size Number
Address path 32-bit 2 32-bit 2 32-bit 2
(DA1 and DA2) (DA1 and DA2) (DA1 and DA2)
Load path 32-bit 2 64-bit 2 64-bit 2
(LD1 and LD2) (LD1a, LD1b and (LD1a, LD1b and
LD2a, LD2b) LD2a, LD2b)
Store path 32-bit 2(ST1 and ST2) 64-bit 2 32-bit 2
(ST1a, ST1b and (ST1 and ST2)
ST2a, ST2b)

CONTROL REGISTER FILE 13.8


The control register file of ¢C6X processor contains ten control registers common to ¢C62X, ¢C64X and
¢C67X. The .S2 unit alone can read and write to control register file. The control registers are generally
accessed by the MVC (Move between the Control file and Register file) instruction but some of the
control register bits are specially accessed in other ways. For example, the global interrupt enable bit,
maskable interrupt bits and interrupt flag bits are accessed in different way. The list of control registers
common to ¢C6X processors and their description is given in Table 13.5.

13.8.1 Addressing Mode Register (AMR)


The eight registers A4-A7 and B4-B7 of the CPU register file can be used for linear and circular addressing.
The Addressing Mode Register (AMR) specifies the addressing mode; it consists of mode select fields
and block select fields. The various fields of the AMR are shown in Fig. 13.7. A 2-bit field, mode select
filed for each register in AMR selects the address modification mode between linear or circular mode.
366 Digital Signal Processors
The 5-bit field, block size field BK0 and BK1 is used to select the block size of the circular buffer in
circular addressing. The 2-bit field in AMR also specifies which BK (block size) field is to be used for a
circular buffer. The mode select field encoding is given in Table 13.6. The calculation of block size for
circular addressing based on the 5-bit block size fields in BK0 and BK1 is given below.
Block size in bytes = 2(N+1)
where, N is the 5-bit value in BK0 and BK1
The buffer must be aligned on a byte boundary equal to the block size. The reserved portion of AMR
is always 0 and AMR is initialized to 0 at reset.

Table 13.5 Control registers common to ¢C6X processors

Register Name Abbreviation Description


Addressing Mode Register AMR Specifies linear or circular addressing for eight
registers A4-A7 and B4-B7. Also used to select size of
the circular buffer in circular addressing
Control Status Register CSR Contains important control and status bits of the
processor
Program Counter, E1 phase PCE1 Contains the address of the fetch packet that is in the
E1 phase of pipeline
Interrupt Flag Register IFR Contains the status of INT4-INT15 and NMI
maskable interrupts
Interrupt Set Register ISR Used to manually set maskable pending interrupts
Interrupt Clear Register ICR Used to manually clear maskable pending interrupts
Interrupt Enable Register IER Used to enable/disable the individual maskable
interrupts
Interrupt Service Table Pointer ISTP Points to the beginning of the interrupts service table
Interrupt Return Pointer IRP Contains the address to be used to return from a
maskable interrupt
Nonmaskable interrupt Return NRP Contains the address to be used to return from a non-
Pointer maskable interrupt

Fig. 13.7 Address Mode Register (AMR) fields


Architecture of TMS320C6X 367
Table 13.6 AMR Mode select field encoding

Mode select bits Description of mode


00 Linear modification of address (default at reset)

01 Circular addressing using the BK0 field

10 Circular addressing using the BK1 field

11 Reserved

13.8.2 Control Status Register (CSR)


The Control Status Register (CSR) of ¢C6X contains control and status bits of the processor. The various
fields of the CSR are given in Fig. 13.8 and the functions of each field are listed in table 13.7. The bits
0-7 and 10-15 are both readable and writable, but bits 8, 9 and 16-31 are only readable. During reset of
the processor, 16 LSB bits are reset to zero; the 16 MSB bits containing Revision ID and CPU ID are
fixed for a particular processor.

Fig. 13.8 Control Status Register (CSR) fields

Table 13.7 Control Status Register field functions

Field Name Functions of the field


CPU ID CPU ID defines which family of CPUs:
CPU ID = 00h - ¢C62X family of processors
CPU ID = 02h - ¢C67X family of processors
CPU ID = 04h - ¢C64X family of processors
Revision ID Revision ID defines silicon version of the CPU
PWRD Control power down modes; the values are always read as zero
SAT The saturate bit. Bit is set only by the functional units when it performs saturate and can be
cleared only by the MVC instruction
EN Endian bit1 = little endian , 0 = big endian
PCC Program cache control mode
DCC Data cache control mode
PGIE Previous GIE bit; saves GIE when an interrupt is taken.
GIE Global Interrupt Enable bit.
Used to enable (1) and disable (0) all the maskable interrupts
368 Digital Signal Processors
13.8.3 Control Register File Extensions
The ¢C67X and ¢C64X processors contain additional control registers. The ¢C67X processor contains
three configurations registers to support floating point operations. These registers specify the desired
floating-point rounding mode for the .L, .S and .M units. The ¢C67X additional control registers and
its functions are given in Table 13.8. There in only one additional control register in ¢C64X, the Galois
Field Polynomial Generator Function Register (GFPGFR). This GFPGFR register along with the Galois
Filed Multiply hardware in ¢C64X can be used for Reed Solomon encode and decode functions. The
GFPGFR register contains 8-bit (0-7) polynomial generator field (POLY) and 3-bit (24-26) field size
field (SIZE), remaining bits are reserved. The Galois Field Multiply on ¢C64X processor is performed
using GMPY4 instruction. The GMPY4 instruction performs four parallel operations on 8-bit packed
data on the .M unit. All Galois Multiplies for fields of the form GF (2m) can be programmed using Galois
Field Multiplier in ¢C64X. The value of m can range between 1 and 8 using any generator polynomial.

Table 13.8 Control Register File Extensions in ¢C67X

Register Name Abbreviation Description


Floating-point adder configura- FADCR Specifies underflow mode, rounding mode, not a number
tion register (NaN) and other exceptions for the .L unit

Floating-point auxiliary configu- FAUCR Specifies underflow mode, rounding mode, not a number
ration register (NaN) and other exceptions for the .S unit

Floating-point multiplier configu- FMCR Specifies underflow mode, rounding mode, not a number
ration register (NaN) and other exceptions for the .M unit

Review Questions
13.1 What is VLIW architecture? 13.10 List the functions performed by .D unit?
13.2 List the processors in ¢C6X family. Which 13.11 How many data paths are in ¢C6X register file?
processors are code compatible? 13.12 What is register file cross path? What is its use?
13.3 What are the blocks present in the CPU of ¢C6X? 13.13 How many data paths are in ¢C6X to access
13.4 What is register file? What is the size of register memory? What is its size?
files in ¢C6X processors? 13.14 List the control registers common to ¢C6X family
13.5 What is ¢C6X register pair? Explain its use. of processors.
13.6 List the various functional units in ¢C6X CPU. 13.15 What are the fields in the addressing mode
13.7 What are the functions performed by .L unit? register? Explain the functions of each field.
13.8 Explain the functions performed by .S unit? 13.16 List the additional control registers in ¢C64X and
13.9 What are the different multiply operations ¢C62X processors.
performed by .M unit?
Architecture of TMS320C6X 369

Self Test Questions


13.1 The ¢C6X processor is based on ——— 13.15 ——— functional unit of ¢C6X is used to load and
architecture store the data values.
(a) Modified Harvard (b) Advanced Harvard (a) .L unit (b) .S unit (c) .M unit (d) .D unit
(c) Veloci TI (d) Davinci 13.16 The functional unit of ¢C6X used for linear and
13.2 The fixed point devices in ¢C6X processors are circular addressing is ———
(a) ¢C62X (b) ¢C62X and ¢C64X (a) .L unit (b) .S unit (c) .M unit (d) .D unit
(c) ¢C67X (d) ¢C64X 13.17 ——— functional units is used for 32/40 bit
13.3 The floating point devices in ¢C6X processors are compare operations
(a) ¢C62X (b) ¢C62X and ¢C64X (a) .L unit (b) .S unit (c) .M unit (d) .D unit
(c) ¢C67X (d) ¢C64X 13.18 The ____ functional unit of ¢C6X is used for 32/40
13.4 The number of functional units in ¢C6X CPU is bit count operation
(a) 2 (b) 8 (c) 4 (d) 16 (a) .L unit (b) .S unit (c) .M unit (d) .D unit
13.5 The size of the ¢C6X CPU is 13.19 The number of data paths from ¢C6X register file
(a) 16-bit (b) 32-bit (c) 40-bit (d) 64-bit to functional units is
13.6 The number of general purpose register files in (a) 16 (b) 32 (c) 24 (d) 40
¢C6X CPU is 13.20 The number of cross paths in ¢C6X register file is
(a) 2 (b) 3 (c) 4 (d) 8 (a) 4 (b) 2 (c) 6 (d) 8
13.7 The number of register in ¢C62X and ¢C67X CPU 13.21 ——— functional units of ¢C62X and ¢C67X is not
register file is having cross path access
(a) 16 (b) 32 (c) 40 (d) 64 (a) .L unit (b) .S unit (c) .M unit (d) .D unit
13.8 The number of register in ¢C64X CPU register file 13.22 ——— ¢C6X processor has all the memory access
is paths with 32-bit.
(a) 16 (b) 32 (c) 40 (d) 64 (a) ¢C62X (b) ¢C62X and ¢C64X
13.9 The number of ALU units in ¢C6X CPU is (c) ¢C62X and ¢C67X (d) ¢C64X
(a) 8 (b) 4 (c) 6 (d) 2 13.23 The ¢C6X processor having both load and store
13.10 The number of multiplier units in ¢C6X CPU is paths with 64-bit is ———
(a) 8 (b) 4 (c) 6 (d) 2 (a) ¢C62X (b) ¢C62X and ¢C64X
13.11 The ¢C6X CPU accepts ——— instructions at a (c) ¢C62X and ¢C67X (d) ¢C64X
time 13.24 The ¢C6X processor having load path with 64-bit
(a) 8 (b) 4 (c) 6 (d) 2 is ———
13.12 Which units of the following are ALU units in (a) ¢C62X (b) ¢C62X and ¢C64X
¢C6X CPU? (c) ¢C64X and ¢C67X (d) ¢C64X
(a) .L unit (b) .S unit (c) .M unit (d) .D unit 13.25 The number of control registers common to ¢C6X
13.13 The ——— functional unit of ¢C6X is used for family of processor is __
32/40 bit shift operation (a) 10 (b) 16 (c) 8 (d) 3
(a) .L unit (b) .S unit (c) .M unit (d) .D unit 13.26 ——— numbers of additional control registers
13.14 The functional unit of ¢C6X that can be used for are in ¢C67X processor.
branch operation is (a) 10 (b) 16 (c) 8 (d) 3
(a) .L unit (b) .S unit (c) .M unit (d) .D unit
TMS320C6X ASSEMBLY
LANGUAGE INSTRUCTIONS
14
In the ¢C6X family of DSPs ¢C62X and ¢C64X, ¢C62X and ¢C67X processors are code compatible.
All the fixed point instruction sets of C62X processor are valid for ¢C64X and ¢C67X processors. The
¢C67X is a floating-point device; there are certain instructions unique to it which do not execute on
the fixed point devices. Similarly, ¢C64X with additional functionality to the ¢C62X devices has some
unique instructions. This chapter describes about the assembly language instructions corresponding to
functional units of the CPU, addressing modes, parallel, and conditional operations. Also, details about
the fixed point instructions common to the ¢C62X, ¢64X and ¢C67X devices as well as ¢C67X floating-
point instructions are described.

FUNCTIONAL UNITS AND ITS INSTRUCTIONS 14.1


The ¢C6X devices have six ALU (.L1, .L2, .S1, .S2, .D1 and .D2 units) and two multiplier units (.M1 and
.M2 units). The ALU units can perform basic arithmetic and logical operations; apart from that each unit
has special functions as listed in Table 13.2. The multiplier unit can perform only multiply operations.

14.1.1 Instructions to .L Functional unit


The .L unit (Basic ALU unit) performs 32/40 bit arithmetic, 32-bit logical, 32/40 bit compare and 32/40
bit count operations. The .L unit of ¢C64X processor is used to do dual 16-bit, quad 8-bit arithmetic, byte
shifts and 5-bit constant generation operations. The fixed point instructions of .L unit common to ¢C62X,
¢C64X and ¢C67X processors are given in Table 14.1.

14.1.2 Instructions to .S Functional unit


The .S unit (Shift and Branch unit) is a dedicated unit to perform 32/40 bit shift operations and branch
operations. It is also used to perform 32-bit arithmetic, 32-bit logical operations and constant generation
operations. The ¢C64X processor .S unit performs dual 16-bit, quad 8-bit arithmetic operations, dual 16-
bit shift operation, dual 16-bit and quad 8-bit compare operations. The fixed point instructions those are
common to ¢C62X, ¢C64X and ¢C67X processors for .S unit are given in Table 14.2.
TMS320C6X Assembly Language Instructions 371
Table 14.1 Assembly Language Instructions for .L Function unit
Type of operations Mnemonic Description
Arithmetic operations ABS Integer absolute value with saturation
ADD/ADDU Signed/unsigned integer addition operation without saturation
SADD Integer addition operation with saturation to result size
SSUB Integer subtraction operation with saturation to result size
SUB/SUBU Signed/unsigned integer subtraction operation without saturation
SUBC Conditional integer subtract and shift operation
NEG Negate operation (Pseudo-operation)
Logical operations AND Bitwise AND operation
NOT Bitwise NOT operation
OR Bitwise OR operation
XOR Bitwise XOR operation
Compare operations CMPEQ Integer compare operation for equality
CMPGT/ CMPGTU Signed/unsigned integer compare operation for greater than
CMPLT/ CMPLTU Signed/unsigned integer compare operation for less than
Other operations NORM Normalize integer operation
MV Move from register to register operation(Pseudo-operation)
LMBD Left most bit detection operation
SAT Saturate a 40-bit integer to a 32-bit integer operation
ZERO Zero a register (pseudo-operation)
Table 14.2 Assembly Language Instructions for .S Function unit
Type of operations Mnemonic Description
Arithmetic operations ADD Signed integer addition operation without saturation
ADDK Integer addition operation using signed 16-bit constant
ADD2 Two 16-bit integer addition on upper and lower register halves
SUB/SUBU Signed/unsigned integer subtraction operation without saturation
SUB2 Two 16-bit integer subtractions on upper & lower register halves
NEG Negate operation (Pseudo-operation)
Logical operations AND Bitwise AND operation
NOT Bitwise NOT operation
OR Bitwise OR operation
XOR Bitwise XOR operation
Shift operations SHL Arithmetic shift left operation
SHR Arithmetic shift right operation
SHRU Logical shift right operation

(Contd.)
372 Digital Signal Processors
Table 14.2 (Contd.)
SSHL Shift left with saturation operation
Branch operations B disp Branch operation using a displacement
B reg Branch operation using a register
B NRP Branch operation using NMI return pointer
B IRP Branch operation using interrupt return pointer
Move operations MV Move from register to register operation(Pseudo-operation)
MVC Move between control file and the register file operation
MVK Move a 16-bit signed constant into a register and sign extend
MVKH/ MVKLH Move 16-bit constant into the upper/lower bits of a register
Other operations CLR Clear a bit field operation
EXT/EXTU Extract and sign-extend/zero-extend a bit field operation
SET Set a bit field operation
ZERO Zero a register (Pseudo-operation)

14.1.3 Instructions to .M Functional unit


The .M unit (Multiply unit) is a dedicated unit, which performs 16x16 bit multiply operations. In the
¢C64X processor, in .M unit 16 ¥ 32, dual 16 ¥ 16 and quad 8 ¥ 8 multiply operations can be performed.
The fixed point instructions of .M unit common to ¢C62X, ¢C64X and ¢C67X processors are given in
Table 14.3.

Table 14.3 Assembly Language Instructions for .M Function unit


Type of operations Mnemonic Description
Multiply operations MPY/MPYU/MPYUS/ Signed/unsigned integer multiply of 16LSB X 16 LSB
MPYSU operation
MPYH/MPYHU/ Signed/unsigned integer multiply of 16MSB X 16 MSB
MPYHUS/MPYHSU operation
MPYHL/MPYHLU/ Signed/unsigned integer multiply of 16MSB X 16 LSB
MPYHULS/MPYHSLU operation
MPYLH/MPYLHU/ Signed/unsigned integer multiply of 16LSB X 16 MSB
MPYLUHS/MPYLSHU operation
SMPY/SMPYHL/ Integer multiply with left shift and saturation operation
SMPYLH/SMPYH

14.1.4 Instructions to .D Functional unit


The .D unit (Data access unit) is a dedicated unit for memory access. The linear and circular address
generation, load and store operations with 5-bit constant offset are performed by this unit. The load
and store operations of .D2 unit alone can have 15-bit constant offset. Apart from the data access, .D
unit is used to do only 32-bit add and subtract operations. The .D unit of ¢C64X is used to perform 32-
bit logical operations and 5-bit constant generation. The fixed point instructions of .D unit common to
¢C62X, ¢C64X and ¢C67X processors are given in Table 14.4.
TMS320C6X Assembly Language Instructions 373
Table 14.4 Assembly Language Instructions for .D Function unit

Type of operations Mnemonic Description


Arithmetic operations ADD Signed integer addition operation without saturation

ADDAB/ADDAH/ Integer addition using byte/half word/word addressing mode


ADDAW

SUB Signed integer subtraction operation without saturation

SUBAB/SUBAH/ Integer subtraction using byte/half word/word addressing mode


SUBAW

Load store operations LDB/LDBU/ LDH/ Load byte/half word/word from memory with 5-bit/15-bit un-
LDHU/ LDW signed constant offset or register offset

STB/STH/STW Store byte/half word/word to memory with 5-bit/15-bit unsigned


constant offset or register offset

Other operations MV Move from register to register operation(Pseudo-operation)

ZERO Zero a register (Pseudo-operation)

ADDRESSING MODES 14.2


The addressing modes of ¢C62X, ¢C64X and ¢C67X are
(i) Register addressing mode
(ii) Linear addressing mode or (Indirect addressing mode)
(iii) Circular addressing mode
All the functional units (.L, .S, .M and .D) with all registers in the register file (A0-A15 and B0-B15)
are used to perform Register addressing. For linear and circular addressing mode, .D unit alone is used.
All the registers of the register file are used for linear addressing mode, but for circular addressing the
registers A4-A7 are used by the .D1 unit and registers B4-B7 are used by the .D2 unit

14.2.1 Register Addressing Mode


The register file of ¢C62X and ¢C67X contains 32 registers and of ¢C64X contains 64 registers. The
content of these registers are used as operand. The syntax of the assembly language instruction for
register addressing mode is given below. The instruction contains four fields, the mnemonic, functional
unit, source operands and the destination operand.
mnemonic .unit src1, src2, dst
The mnemonic filed is for the assembly codes like ADD, MPY and SUB etc that support register
addressing mode. For the .unit field, any of the eight functional units is specified depending upon the
operation performed as per the Table 14.1 to 14.4. The source operands (src1, src2) and destination
operand (dst) are the registers of the register file.
374 Digital Signal Processors

Example 14.1 ADD .L1 A1,A2,A3 – This instruction adds the hexadecimal signed integer operands
in register A1 and A2. The result is stored in register A3. The content of register A1
and A2 are unchanged. The functional unit used is .L1 and the registers of path A are used for both source
and destination operands.
Before execution After execution
A1 11223344 A1 11223344
A2 33445566 A2 33445566
A3 22222222 A3 446688AA

In the above example, to perform add operation the functional units .S1and .D1 are also used. For the
source and destination operands, registers from register file A alone are to be used. Same way, to do add
operation in register path B, the functional units .L2, .S2 and .D2 are used. The source and destination
operand registers are to be used only from register file B (B1-B15 registers). For the arithmetic and logic
instructions, the source and destination operand can be specified with same register of the register file.

Example 14.2 ADD .S2 B1,B2,B2 – This instruction adds the hexadecimal signed integer operands
in register B1 and B2. The result is stored in register B2 itself after addition. The
content of register B1 is unchanged. The functional unit used is .S2 and the registers of path B are used
for both source and destination operands.
Before execution After execution
B1 3456789A B1 3456789A
B2 11112222 B2 45679ABC

The data path of ¢C6X architecture has cross paths between path A and B (1X &2X). This cross path
is used to access one of the source operand from the opposite path. The destination operand cannot use
the cross path.

Example 14.3 ADD .L1X A1,B2,A2 – This instruction adds the hexadecimal signed integer operands
in register A1 and B2. The result is stored in register A2. The content of register A1
and B2 are unchanged. The functional unit used is .L1 and the registers of path A are used for the source
operand (A1) and destination operand (A2). The source operand B2 is obtained through cross path from
register file B.
Before execution After execution
A1 22221111 A1 22221111
B2 33332222 B2 33332222
A2 44444444 A2 55553333

14.2.2 Linear Addressing Mode


The linear addressing mode uses .D (.D1 and .D2) unit alone, along with all the registers of the register
file. The load instruction, store instruction, add and subtract with addressing mode instructions can
use linear addressing mode. These instructions are of three kinds, byte access, half word access and
word access. The syntax of the linear addressing mode type instruction is given below. The instruction
contains four fields, the mnemonic, functional unit, mode field and destination field.
mnemonic .unit mode field, dst
TMS320C6X Assembly Language Instructions 375
The mnemonic field uses load, store, add and subtract with addressing mode instructions only
(LDB(U)/ LDH(U)/ LDW, STB(U)/ STH(U)/ STW, ADDAB/ ADDAH/ ADDAW/ADDAD & SUBAB/
SUBAH/SUBAW). For the unit field, the .D1 and .D2 units are used. The mode field specifies the type
of address access and address modification type. The destination field (dst) can use any of the register
in the register file. The different types of mode fields that are used in linear addressing mode are given
in Table 14.5. The register containing the base address of the operand is denoted as baseR. The offset
(displacement) from the base address specified in some register is represented as offsetR. Instead of
using register to specify the offset, a 5-bit unsigned constant can be used as an offset, which is denoted
as ucst5. The registers used for baseR and offsetR are must be in the same register file. The destination
(dst) register can be from the opposite register file through cross path.

Table 14.5 Address generation Option for Mode field in Linear addressing mode
Mode field Syntax Address modification performed
*+baseR[offsetR/ucst5] Positive offset from baseR specified by offsetR/ucst5
*-baseR[offsetR/ucst5] Negative offset from baseR specified by offsetR/ucst5
*++baseR[offsetR/ucst5] Pre-increment from baseR specified by offsetR/ucst5
*––baseR[offsetR/ucst5] Pre-decrement from baseR specified by offsetR/ucst5
*baseR++[offsetR/ucst5] Post increment from baseR specified by offsetR/ucst5
*baseR– –[offsetR/ucst5] Post decrement from baseR specified by offsetR/ucst5

The offset value specified in the offset register (offsetR) or the 5-bit unsigned constant given in the
instruction is left shifted by 0, 1 or 2 for the byte, halfword and word access instructions respectively.
Then, to find the address of the operand the following procedure is used:
(i) The shifted offset value is added or subtracted from the value in the base register (baseR) for
*+ or *- mode fields respectively. The added or the subtracted value from the content of the
base register is the address of the operand to be accessed from memory. The content of the base
register is unchanged.
(ii) For *++ or *–– mode fields, the address of the operand is calculated as mentioned in (i), but the
content of the base register increments or decrements by the shifted offset value respectively
before accessing the memory (pre-increment/pre-decrement). The address of the operand is
incremented or decremented value from the base register content.
(iii) In the case of *baseR++ or *baseR––, the address of the operand is calculated as mentioned
in (i), but the content of the base register increments or decrements by the shifted offset value
respectively after accessing memory (post increment/post decrement). The address of the
operand is the content of the base register, after accessing the address changes as per the address
modification syntax.

Example 14.4 LDW .D1 *+A0[1],A1 – This instruction loads a hexadecimal word from memory to
register A1. The address of the memory is the base address value in register A0
added with the 5-bit constant offset given in brackets left shifted by two times. If the base address is
500h, the given offset 1 is left shifted by two times is 4, the address of the memory to be accessed is
504h. The content of A0 is unchanged after access.
376 Digital Signal Processors
Before execution After execution
A0 00000500 A0 00000500
B1 11111111 A1 3456789A
504h 3456789A 504h 3456789A

Example 14.5 LDW .D1 *++A0[A4],A1 – This instruction loads a hexadecimal word from memory
to register A1. The address of the memory is the base address value in register A0
added with the content of offset register A4 given in brackets left shifted by two times. If the base
address is 500h, the content of offset register A4 is say 4, then it is left shifted by two is 10h (16). The
content of A0 is incremented to 510h before accessing the memory. The address of the memory to be
accessed is 510h. The content of offset register A4 is unchanged after access.
Before execution After execution
A0 00000500 A0 00000510
A4 00000004 A4 00000004
A1 34587698 A1 55667788
510h 55667788 510h 55667788

Example 14.6 LDW .D1 *A0++[2],A1 – This instruction loads a hexadecimal word from memory to
register A1. The address of the memory is the base address value in register A0.
After accessing the memory the new address in register A0 is the content of A0 added with the content
of offset given in brackets left shifted by two times. If the base address is 500h, the address of the
memory to be accessed is 500h. If the offset given is say 2, the two times left shifted value is 8h. Then
the new address in A0 is the register value A0 added with the left shifted value i.e. 508h.
Before execution After execution
A0 00000500 A0 00000500
B1 76234589 A1 99887766
500h 99887766 500h 99887766

14.2.3 Circular Addressing Mode


In circular addressing mode .D1 unit of register path-A and .D2 unit of register path-B is used. The
registers A4-A7 of path-A and B4-B7 of path-B can be used for circular addressing. To activate the
circular buffer the corresponding mode select bits (two bit field), the size of the block size (BK0/BK1,
5-bit field) are to be loaded in Address Mode Register (AMR) as given in section 13.8.1.
The load instruction, store instruction, add with addressing mode and subtract with addressing mode
instructions can use circular addressing mode. These instructions are of three kinds, byte access, half
word access and word access. The syntax of the circular addressing mode for load and store instructions
is given below.
mnemonic .unit mode field, dst
The instruction contains four fields, the mnemonic, functional unit, mode field and destination field.
The mnemonic field can be load and store instructions as described in Section 14.2.2. The unit field
has to be .D1 or .D2 unit. The mode field specifies the type of address modification, the different types
of address modification that are used in circular addressing mode is given in Table 14.5. In circular
TMS320C6X Assembly Language Instructions 377
addressing mode the base register (baseR) specified in the mode field must be only the registers A4-A7
and B4-B7, the destination (dst) register can be any of the registers of the register file.
The offset value specified in the offset register (offsetR) or the 5-bit unsigned constant given in the
instruction is left shifted by 0, 1 or 2 for the byte, halfword and word access instructions respectively.
The address increment/decrement for the shifted offset value happens up to the end address/start address
of the circular buffer; once it is reached, the address is wrapped around to the start/end address of the
circular buffer.

Example 14.7 For circular addressing mode, register A4 is used. To specify the block size, BK0
field in AMR register is used. The two bit mode field for A4 is 01 and the 5-bit field
to specify the block size in BK0 is 01, hence the control word for AMR is 00010001h. The size of the block
is 21+1 = 4. If the starting address of the memory is 0x0100h, the circular buffer boundary is from 0x0100h
to 0x0103h. Content of memory locations 0100h-0103h is 44332211
MVK .S1 0X0001,A0 ;move the two bit mode field value to LSB of A0
MVKLH .S1 0X0001,A0 ;move the 5-bit BK0 value to MSB of A0
MVC .S 2X A0,AMR ;move the control word from A0 to AMR register
MVK .S1 0X0100,A4 ;the register A4 is loaded with the start address of
the buffer 0x0100h
LDB .D1 *A4++[1], A1 ;load byte from the address of the memory pointed
NOP 4 by A4 register to A1 register, increment content
of A4 by one. Followed by that is 4 no operations
Before executions After execution
A4 00000100 A4 00000101
A1 00000000 A1 00000011
LDB .D1 *A4++[1], A1 A4 00000101 A4 00000102
NOP 4 A1 00000011 A1 00000022
LDB .D1 *A4++[1], A1 A4 00000102 A4 00000103
NOP 4 A1 00000022 A1 00000033
LDB .D1 *A4++[1], A1 A4 00000103 A4 00000100
NOP 4 A1 00000033 A1 00000044
In this example, the memory address increments by one location for each load byte instruction, once
it reaches the end of the buffer 0x0103h, the next content in A4 is 0x100h. The data access happens
circularly between 0x0100h to 0x0103h address locations.

The syntax of the circular addressing mode instruction for add with addressing mode and subtract
with addressing mode case is given below.
mnemonic .unit src2, src1, dst
The mnemonic field can be add and subtract with addressing mode instructions given in Section
14.2.2. The source operand src2 should be registers A4-A7 and B4-B7 of the respective data paths. The
source operand src1 can be any register in the register file and the destination operand dst should be the
same register used for source operand src2.
The content of source operand src1 in the instruction is left shifted by 0, 1 or 2 for the byte, half-
word and word access instructions respectively. The shifted content of src1 is added/subtracted from the
content of src2, if the added/subtracted content is exceeding the circular buffer boundary, the content
378 Digital Signal Processors
src2 is wrapped around with in the buffer size, the result is available in the destination register dst. The
content of src2 is always within the circular buffer size.

Example 14.8 For circular addressing mode, register B5 is used. To specify the block size, BK1
field in AMR register is used. The two bit mode field for B5 is 10 and the 5-bit field
to specify the block size in BK1 is 03, hence the control word for AMR is 00600800h. The size of the block
is 23+1 = 16. If the starting address of the memory is 0x0100h, the circular buffer boundary is from
0x0100h to 0x010Fh.
MVK .S2 0X0800,B0 ; move the two bit mode field value to LSB of B0
MVKLH .S2 0X0060,B0 ; move the 5-bit BK1 value to MSB of B0
MVC .S2 B0,AMR ; move the control word from B0 to AMR register
MVK .S1X 0X0100,B5 ; the register B5 is loaded with the start address of
the buffer 0x0100h using cross path
MVK .S2 0x0002,B1 ;the register B1 is loaded with the value 02h
ADDAH .D1 B5,B1,B5 ; the content of B1 is left shifted by one (04h),
added with the content of B5(0100h), result
stored in B5(0104h). The content of B1 is
unchanged
Before executions After execution
B5 00000100 B5 00000104
ADDAH .D1 B5,B1,B5 B5 00000104 B5 00000108
ADDAH .D1 B5,B1,B5 B5 00000108 B5 0000010C
ADDAH .D1 B5,B1,B5 B5 0000010C B5 00000100
ADDAH .D1 B5,B1,B5 B5 00000100 B5 00000104
In this example, the content of B5 increments by a value of 04h for each time ADDAH instruction is
executed. Once the content of B5 exceeds the end value of the circular buffer 0x010Fh, it is wrapped
around to the first value 0x0100h. The register B5 content increments four values within the circular
buffer size.

FIXED POINT INSTRUCTIONS 14.3


In this section, the fixed-point instruction details those are common for ¢C62X, ¢C64X and ¢C67X
processors are given. The syntax of the instruction, the functional unit details and the addressing modes
of the instruction are given with examples.

14.3.1 Move Instructions


The move instructions are used to move the contents between registers of the register file, control
register file to register files and also to move 16-bit constant into the lower and upper bits of the registers
of the register file. To move contents between registers all .L, .S and .D units are used and to move
values between control register file and register file, .S2 unit alone is used. To move 16-bit constant to
registers of the register file, .S1 and .S2 units are used, but in ¢C64X processors, for the same 16-bit
constant move operation all .L, .S and .D units are used. The register file cross path is not accessible for
MVK instruction. The addressing mode used for move instructions is only register addressing mode.
The instruction and its description of the ¢C6X processors move instructions are listed in Table 14.6.
TMS320C6X Assembly Language Instructions 379
Table 14.6 Move Instructions of ¢C6X processor
Instruction Functional unit Description
MV .L1 or.L2,.S1 or.S2, .D1 Move value from one register to another register in register file
or .D2
MVC .S2 only Move value between control register file and register file
MVK .S1 or .S2(all .L, .S and Move a 16-bit constant into lower 16-bits of a register and sign extended
.D units in ¢C64X only)
MVKLH .S1 or .S2 Move a 16-bit constant into upper 16-bits of a register
MVKH .S1 or .S2 Move upper 16-bit value of 32-bit constant to upper 16-bits of a register

Example 14.9 MV .S1 A1,A2 – Move register to register instruction. The content of register A1 is
moved to register A2, the content of register A1 is unchanged and the functional
unit used is .S1
Before executions After execution
A1 22334455 A1 22334455
A2 20408754 A2 22334455
MV .L1X B1,A3 – Move register to register instruction using the cross path. The content of register B1 is
moved to register A3, the content of B1 is unchanged and the functional unit used is .L1
Before executions After execution
A3 30504321 A3 547698AB
B1 547698AB B1 547698AB

Example 14.10 MVC .S2 A1,AMR – Move value between control register file and register file
instruction. The content of register A1 is moved to Address mode register (AMR) in
control register file, the content of register A1 is unchanged and the functional unit used is .S2
Before executions After execution
A1 00020005 A1 00020005
AMR 00400001 AMR 00020005
MVC .S2 AMR,B2 – The content of AMR is moved to register B2, the content of AMR is unchanged and the
functional unit used is .S2
Before executions After execution
AMR 00020005 AMR 00020005
B2 20408754 B2 00020005

Example 14.11 MVK .S1 0x1223,A1 – Move the 16-bit constant to lower 16-bit of register in register
file. The 16-bit constant 1223h is moved to lower 16-bits of register A1 and the
functional unit used is .S1
Before executions After execution
A1 00020005 A1 00021223
MVK .S2 -0x012,B2 – The negative 16-bit constant -012h is moved to lower 16-bit of register B2 and the
sign bit is extended to MSB bits. The 2-s complement value of 012h (FFEDh) appears as result in register
B2 lower 16-bit.The MSB bits are sign extended and functional unit used is .S2
Before executions After execution
B2 00050002 B2 FFFFFFED
380 Digital Signal Processors

Example 14.12 MVKLH .S1 0x3344,A2 – Move the lower 16-bit constant to upper 16-bit of register
in register file. The 16-bit constant 3344h is moved to upper 16-bit of register A2,
the lower 16-bits are unchanged and functional unit used is .S1
Before executions After execution
A2 00220055 A2 33440005
MVKH .S2 0x44552233,B2 – The upper 16-bit of the 32-bit constant is moved to upper 16-bit of register B2.
The upper 16-bit value 4455h is moved to register B2 upper 16-bit, lower 16-bit are unchanged. The
functional unit used is .S2
Before executions After execution
B2 20404252 B2 44554252

14.3.2 Load/Store Instructions


The load and store instructions are used for the memory to register file and register file memory data
transfer through load and store data paths respectively. The various types of load/store instructions are
based on byte, half- word and word access. For load and store instructions linear and circular addressing
modes are used (Sections 14.2.2, 14.2.3). The offset value given in the instruction is scaled by left-shift
of 0, 1 or 2 for byte, half-word or word access respectively, It is added or subtracted from the base
register content based on the address modification specified in the instruction. The functional units used
for load and store operations with register offset or 5-bit constant offset are .D1 and .D2 units. For 15-
bit constant offset type of instructions, .D2 unit alone is used and the base register that could be used in
the instruction is B14 and B15 only. The different load and store instructions of ¢C6X processors and its
description are listed in Table 14.7.
Table 14.7 Load Instructions of ¢C6X processor
Instruction Functional unit Description
LDB/STB .D1 or .D2 Load byte from memory to register in register file/Store byte from register in
register file to memory
LDBU .D1 or .D2 Load byte unsigned from memory to register in register file
LDH/STH .D1 or .D2 Load half word from memory to register in register file/ Store half word from
register in register file to memory
LDHU .D1 or .D2 Load half word unsigned from memory to register in register file
LDW/STW .D1 or .D2 Load word from memory to register in register file/ Store wrod from register
in register file to memory
(For 15-bit constant offset, functional unit used is .D2 only)

Example 14.13 LDB .D1 *A0,A1 – Load byte instruction. The byte content of memory location,
who’s address is present in base address register A0 is loaded into register A1, the
sign bit is extended to MSB bits of register A1. The memory address is 100h, the byte content of 100h
location is 44h. The value 44h is moved to LSB of A1 register and MSB bits are zero filled. The content of
register A0 and 100h location are unchanged; the functional unit used is .D1
Before executions After execution
A1 11111111 A1 00000044
A0 00000100 A0 00000100
100h 11223344 100h 11223344
TMS320C6X Assembly Language Instructions 381

LDH .D1 *+A0[2],A2 – Load Half-word instruction with positive offset. To calculate the address of the
memory to be accessed, the 5-bit constant offset given in the instruction is left shifted ones and added
to base register content A0. The content of register A0 is unchanged. The half-word content of the
memory address is moved to register A2. The offset value 2 left shifted once is 4; the content of base
register is 100h, hence the address of memory is 104h. The half-word content of memory (104h) 8899h is
loaded into register A2 LSB, the MSBs are sign extended as FFFF. The functional unit used is .D1
Before executions After execution
A2 11111111 A2 FFFF8899
A0 00000100 A0 00000100
104h 44558899 104h 44556677
LDW .D1 *++A0[1],B2 – Load word instruction with pre-increment. The 5-bit constant offset given in the
instruction is left shifted two times and added to base register content A0. The content of base register
A0 is pre-incremented and it is the address of the memory to be accessed. The word content of memory
address is moved to register B2. The offset value 1 left shifted twice is 4. If the content of base register
A0 is 100h, the new content of A0 is 104h and the memory address is also 104h. The word content of
memory (104h) 44558899h is loaded into register B2. The content of 104h location is unchanged, the
functional units used is .D1 along with the cross path.
Before executions After execution
B2 11111111 B2 44558899
A0 00000100 A0 00000104
104h 44558899 104h 44558899
STW .D2 B3,*B1--[B0] – Store word instruction with post-decrement. The content of register B3 is stored
in memory. The address of the operand is the content of base register B1 and the offset is specified in
offset register B0. The content of offset register B0 is left shifted by two times and subtracted from the
content of base register B1 and that is the new content in base register B1. If the content of register B3
is 00004578h, the content of base register B1 is 100h the content of register B3 is stored in the memory
location pointed by register B1. The content of base register B0 is 1h, left shifted twice is 4h, which
subtracted from 100h is 0FCh that is the new content in B1. The content of B0 and B3 are unchanged and
the functional unit used is .D2
Before executions After execution
B3 00004578 B3 00004578
B1 00000100 B1 000000FC
B0 00000001 B0 00000001
100h 11223344 100h 00004578

14.3.3 Add Instructions


The various types of add instructions of ¢C6X processor and its description are given in Table 14.8.
All add instructions use register addressing mode except addition using addressing mode instructions
(ADDAB/ADDAH/ADDAW). Addition with addressing mode instructions use circular addressing
mode (Section 14.2.3). Signed integer addition with and without saturation, unsigned integer addition,
16- bit constant and two 16-bit integer addition operations can be performed using add instructions.
To perform signed integer addition operation, all .L, .S and .D functional units are used. For unsigned
integer addition and integer addition with saturation, .L1 and .L2 units are used. To add 16-bit constant
and two 16-bit integers, .S1 and .S2 units are used. Integer addition with addressing mode instructions
uses .D1 and .D2 units.
382 Digital Signal Processors
Table 14.8 Add Instructions of ¢C6X processor
Instruction Functional unit Description
ADD .L1 or.L2, .S1 or Signed integer addition without saturation
.S2, .D1 or .D2
ADDU .L1 or .L2 Unsigned integer addition without saturation
SADD .L1 or .L2 Integer addition with saturation to result size
ADDK .S1 or .S2 Integer addition using singed 16-bit constant
ADD2 .S1 or .S2 Two 16-bit integer additions on upper and lower register halves
ADDAB/ADDAH/ .D1 or .D2 Integer Byte/ Half-word/Word addition using addressing mode
ADDAW

Example 14.14 ADD .D1 31,A0,A1 – Five bit signed constant (-31 to 31) add instruction. The given
five bit signed constant is added to the content of register A0 and the result is
stored in register A1. The 5-bit constant 31 is added to register A0 content 00008754h and the result
00008773h is loaded in register A1. The functional unit used is .D1, the register content A0 is
unchanged.
Before executions After execution
A0 00008754 A0 00008754
A1 00020005 A1 00008773

Example 14.15 ADD .L1 A0,A1,A2 – Signed 32-bit integer add instruction. The signed integer
content of register A0 and A1 are added, the result is stored in register A2. If the
32-bit positive integers 00045566h in A0 and 00076655h in A1, they are added, the result 000BBBBBh is
stored in register A2. The content of registers A0 and A1 are unchanged, the functional unit used is .L1
Before executions After execution
A0 00045566 A0 00045566 +284006
A1 00076655 A1 00076655 +484949
B0 12348765 B0 000BBBBB +768955
ADD .S2X A0,B1,B2 – If the 32-bit positive integer in register A0 is 00045566h and the negative integer in
register B1 is FFFFC742h, they are added and the result 00041CA8h is stored in register B2. The content
of registers A0 and B1 are unchanged, the functional unit used is .S2 with the cross path.
Before executions After execution
A0 00045566 A0 00045566 +284006
B1 FFFFC742 B1 FFFFC742 - 14526
B2 12348765 B2 00041CA8 +269480

Example 14.14 SADD .L1 A1,A2,A3 – Signed integer add instruction with saturation. The signed
integer content of register A1 and A2 are added; the added result is stored in
register A3, if there is no saturation. If the result is saturated, for positive integer the maximum positive
number (7FFF FFFF) and for negative integer the negative number (8000 0000) is loaded in register A3
respectively. The SAT bit in CSR register is set. The functional unit used is .L1; the content of registers
A1 and A2 are unchanged.
TMS320C6X Assembly Language Instructions 383

Before executions After execution


A1 7D007D00 A1 7D007D00
A2 13881388 A2 13881388
A3 12247666 A3 7FFF FFFF

Example 14.17 ADDU .L1 A5,A6,A9:A8 – Unsigned 32-bit add instruction. The unsigned 32-bit
contents in register A5 and register A6 are added and the resultant 40-bit content
is loaded in A9:A8 register pair. If the 32-bit integer in register A5 is 00087654h and in register A6 is
FFFF4332h, both of them are added and the resultant 40 bit content 10007B986h is loaded in A9:A8
register pair. The functional unit used is .L1, the register contents A5 and A6 are unchanged.
Before executions After execution
A5 00087654 A1 00087654 +554580
A6 FFFF4332 A6 FFFF4332 +4294918962
A9:A8 00020005 00020005 A9:A8 00000001 0007B986 +4295473542

Example 14.18 ADDK .S1 2345,A1 – A 16-bit signed constant add instruction. The signed 16-bit
constant given in the instruction is added with the content of register A1 and the
result stored in A1. If the 16-bit positive constant is 2345 (0929h), the content in register A1 is 00015432h,
they are added, the result 00015D5Bh is stored in register A1. The functional unit used is .S1
Before executions After execution
A1 00015432 A1 00015D5B
ADD2 .S2 B1,B2,B3 – Two 16-bit integer add instruction on upper and lower register halves. The upper
and lower halves content of register B1 are added to the upper and lower halves content of register B2,
the result is stored in upper and lower halves of register B3 respectively. If the content of register B1 is
00347698h, register B2 is 03127654h, the upper and lower halves are added, the result 0346ECECh stored
in register B3. The content of registers B1 and B2 are unchanged, the functional unit used is .S2.
Before executions After execution
B1 00347698 B1 00347698
B2 03127654 B2 03127654
B3 00000544 B3 0346ECEC

14.3.4 Subtract Instructions


The various types of subtract instructions of ¢C6X processor and its descriptions are given in Table
14.9. All subtract instructions use register addressing mode except subtraction using addressing mode
instructions (SUBAB/SUBAH/SUBAW). Subtraction with addressing mode instructions use circular
addressing mode (Section 14.2.3). Signed integer subtraction with and without saturation, unsigned
integer subtraction, conditional integer subtraction and two 16-bit integer subtraction operations can be
performed using subtract instructions. To perform signed integer subtraction, all .L, .S and .D functional
units are used. For unsigned integer subtract operation .L and .S units are used. .L units are used for
conditional integer subtract and integer subtract with saturation operations. To subtract two 16-bit
integers, .S1 and .S2 units are used. Integer subtraction with addressing mode instructions uses .D1 and
.D2 units. The SUB, SUBU, SSUB, SUB2 and subtraction using addressing mode instruction operation
modes are same us the respective add instructions, except that the operands are subtracted rather than
384 Digital Signal Processors
addition. The conditional subtract instruction SUBC is used for signed and unsigned integer division
operation.

Table 14.9 Subtract Instructions of ¢C6X processor


Instruction Functional unit Description
SUB .L1 or.L2, .S1 or.S2, Signed integer subtraction without saturation(For .D units src1 is
.D1 or .D2 subtracted from src2)
SUBU .L1 or .L2, .S1or .S2 Unsigned integer subtraction without saturation
SUBC .L1 or .L2 Conditional integer subtract and shift used for division
SSUB .L1 or .L2 Integer subtraction with saturation to result size
SUB2 .S1 or .S2 Two 16-bit integer subtractions on upper and lower register halves
SUBAB/SUBAH/ .D1 or .D2 Integer Byte/ Half-word/Word subtraction using addressing mode
SUBAW

Example 14.19 SUBC .L1 A1,A2,A3 – Conditional subtract and shift operation. The content of
register A2 is subtracted from the content of register A1. If the subtraction result
is ≥ 0, then the result is left shifted by one bit and 1 is added to LSB bit and the final value is loaded in
register A1. Else the subtracted result is less than zero, the content of register A1 is left shifted by one
bit and the shifted value is loaded in register A1.
(i) If the register content A2 is 00000404h and A1 is 00002222h, the A2 content is subtracted from A1
content. The result 00001E1E which is > 0 is left shifted by 1 bit (00003C3C) and 1 is added to LSB bit,
the final result 00003C3D is loaded in register A3.
Before executions After execution
A1 00002222 A1 00347698
A2 00000404 A2 00000404
A3 12243333 A3 00003C3D
(ii) If the register content A2 is 00002424h and A1 is 00002222h, the A2 content is subtracted from A1 con-
tent. The result is less than zero. The content of register A1 is left shifted by 1 bit, the result 00004444h
is loaded in register A3. The content of register A1 and A2 are unchanged and the unit used is .L1.
Before executions After execution
A1 00002222 A1 00347698
A2 00002424 A2 00002424
A3 12243333 A3 00004444

14.3.5 Multiply Instructions


All multiply instructions of ¢C6X use register addressing mode. The various multiply instructions of
¢C6X processor and its descriptions are given in Table 14.10. The multiply instructions are of signed
and unsigned type. Multiplication operation can be performed on 16 LSBs, 16 MSBs, 16 LSBs with
16 MSBs and vice versa on the register file register contents. Integer multiplication with left shift and
saturation can also be performed on lower and higher order register contents of the register file. To
perform multiplication .M1 and .M2 units are used. The MPY and MPYSU instructions supports signed
5 bit constant multiplication.
TMS320C6X Assembly Language Instructions 385
Table 14.10 Multiply Instructions of ¢C6X processor
Instruction Functional unit Description
MPY/MPYU .M1 or .M2 Signed/Unsigned integer multiplication on 16 LSBs
MPYH/MPYHU Signed/Unsigned integer multiplication on 16 MSBs
MPYLH/MPYLHU .M1 or .M2 Signed/Unsigned integer multiply on 16 LSBs and 16MSBs
MPYHL/MPYHLU Signed/Unsigned integer multiply on 16 MSBs and 16LSBs
MPYUS/MPYSU .M1 or .M2 US-unsigned and signed/ SU-signed and unsigned integer multipli-
cation on 16 LSBs
MPYHUS/MPYHSU US-unsigned and signed/ SU-signed and unsigned integer multipli-
cation on 16 MSBs
MPYLUHS/MPYLSHU .M1 or .M2 Unsigned 16 LSBs and signed 16 MSBs /signed 16 LSBs and un-
signed 16 MSBs multiplication
MPYHULS/MPYHSLU Unsigned 16 MSBs and signed 16 LSBs/signed 16 MSBs ans un-
signed 16 LSBs multiplication
SMPY/SMPYH .M1 or .M2 Integer multiplication with left shift and saturation on 16 LSBs /16
MSBs
SMPYHL/SMPYLH 16MSBs and 16 LSBs/16 LSBs & 16 MSB s

Example 14.20 MPYU .M1 A1,A2,A3 – Unsigned integer multiply instruction. The Unsigned 16-bit
number present in 16 LSBs of registers A1 and A2 are multiplied and the result is
stored in register A3. If the content of register A1 is 56003442h and register A2 is 23451122h, the 16 LSBs
are multiplied and the result 037F52C4h is stored in register A3. The content of register A1 and A2 are
unchanged and the functional unit used is .M1
Before executions After execution
A1 56003442 13378 A1 56003442 13378 16 LSB value
A2 23451122 4386 A2 23451122 4386 16 LSB value
A3 00007689 A3 037F52C4 58675908 Product value
MPYHL .M1 A4,A5,A6 – Signed integer multiply instruction on 16 MSBs and 16 LSBs of registers. The signed
16-bit number present in 16 MSBs of registers A4 and 16 LSBs of register A5 are multiplied and the result
is stored in register A6. If the content of register A4 is FFA13344h and register A5 is 48480044h, the 16
MSBs (FFA1h) and 16 LSBs (0044h) are multiplied and the result FFFFE6C4h is stored in register A6. The
content of register A4 and A5 are unchanged and the functional unit used is .M1
Before executions After execution
A4 56003442 (-95) A4 FFA13344 (-95) 16 MSB value
A5 48480044 (68) A5 48480044 (68) 16 LSB value
A6 00560544 A6 FFFFE6C4 (-6460) Product value

Example 14.21 SMPYLH .M1 A1,A2,A3 – Integer multiply with left shift and saturation instruction.
The signed number in 16 LSBs of register A1 and 16 MSBs of register A2 are
multiplied and the result is left shifted by one bit and stored in register A3. If the left shifted result is
8000 0000h, then the result is saturated to 7FFF FFFFh. If the content of register A1 is F023 3344h,
register A2 is 8787 4A81h, the 16 LSBs (3344h) and 16 MSBs (8787h) are multiplied and the result E7DF
E4DCh is left shifted by one bit and the value CFBF C9B8h is stored in register A3. The content of register
A1 and A2 are unchanged and the functional unit used is .M1
386 Digital Signal Processors
Before executions After execution
A1 F0233344 13124 A1 F0233344 16 LSB value
A2 87874A81 -30841 A2 87874A81 16 MSB value
A3 00007689 A3 CFBFC9B8 -809514568
MPY .M1 14,A1,A2 – Signed 5-bit constant multiply instruction on 16 LSBs of register. The signed 5-bit
number in the instruction is multiplied with 16 LSBs of registers A1 and the result is stored in register
A2. If the content of register A1 is 2131 3344h, the 16 LSBs (3344h) and 14 (Eh) are multiplied and the
result 0002 CDB8h is stored in register A2. The content of register A1is unchanged and the functional
unit used is .M1
Before executions After execution
A1 21313344 A1 21313344
A2 48480044 A2 0002CDB8

14.3.6 Logical, Shift and Compare Instructions


Like other processors, ¢C6X also supports logical operations; arithmetic and logical shift operations
signed and unsigned integers compare operations. The list of logical, shift and compare operations
of ¢C6X are given in Table 14.11. The logical operations are performed by .L and .S units, the shift
operations by .S units and compare operations by .L units of the CPU functional units. All the logical
operations are bitwise operations that use only register addressing mode. The AND and OR operations
support signed 5-bit constant for the source operand, the remaining 27 MSBs are sign extended. The
shift and compare operations also use only register addressing mode.
The arithmetic shift supports both left and right shifting of register contents, but in logical shift
instruction only right shift is possible. In both arithmetic and logical shift operations, when register is
used the 6 LSBs specify the shift amount, the shift value is 0–40 and for immediate value given in the
instruction the shift amount is 0–31 (5 bits). In the case of shift left with saturation instruction, the shift
amount is 0–31 for both register and immediate types. If the shift amount is greater than 31 bits the
result is saturated to 7FFF FFFFh. In all cases the shift value should be an unsigned number.
In the case of compare operations, signed and unsigned integers can be compared for equality, greater
than and less than cases. The compare instructions support signed 5-bit constant for the source operand.
If the comparison is true, 1 is written else 0 is written in to destination (dst) register.

Example 14.22 AND .L1 A1,A2,A3 – Bitwise AND operation instruction. The bitwise AND operation
is performed between the contents of register A1 and A2. The result is placed in
register A3. If signed 5-bit constant is used as operand, the sign is extended to 32 bits. If the content of
register A1 is 7367 5454h, register A2 is 8282 7676h, the bitwise AND operation between the register
contents 0202 5454h is loaded in register A3. The functional unit used is .L1, the register content A1 and
A2 are unchanged.
Before executions After execution
A1 73675454 A1 73675454
A2 82827676 A2 82827676
A3 11224509 A3 02025454
TMS320C6X Assembly Language Instructions 387
Table 14.11 Logical, Compare and Shift Instructions of ¢C6X processor
Instruction Functional unit Description
NOT .L1 or.L2, .S1 or.S2 Bitwise NOT-Pseudo operation
AND .L1 or.L2, .S1 or.S2 Bitwise AND - Pseudo operation
OR .L1 or.L2, .S1 or.S2 Bitwise OR - Pseudo operation
NEG .L1 or.L2, .S1 or.S2 Negate-Pseudo operation
SHL .S1 or.S2 Arithmetic shift left
SHR .S1 or.S2 Arithmetic shift right
SHRU .S1 or.S2 Logical shift right
SSHL .S1 or.S2 Shift left with saturation
CMPEQ .L1 or.L2 Integer compare for equality
CMPGT/CMPGTU .L1 or.L2 Signed/Unsigned integer compare for greater than
CMPLT/CMPLTU .L1 or.L2 Signed/Unsigned integer compare for less than

Example 14.23 SHR .S2 B1,B2,B3 – Arithmetic shift right instruction. The content of register B1
is right shifted by n-bits specified in register B2, the result is stored in register
B3. If the content of register B1 is 7367 5454h, register B2 is 0012h, the content of B1 is right shifted by
the content of B2 (12h=18bits) times, the result is stored in register B3. The functional unit used is .S2,
the register content B1 and B2 are unchanged.
Before executions After execution
B1 73675454 B1 73675454
B2 00000012 B2 00000012
B3 00020005 B3 00001CD9

Example 14.24 CMPGT .L1X A1,B2,A2 – Integer compare for greater than instruction. The
contents of register A1 and B2 are compared for greater number. If register A1
content is greater than B2, the comparison is true, 1 is stored in register A2. If content of A1 is less than
B2, the comparison is false, 0 is stored in register A2. If the content of register A1 is 7676h, register B2
is 5454h, the content of A1 is greater than B2. The comparison is true, 1 is set in register A2. The
functional unit used is .L1 through cross path; the content of registers A1 and B2 are unchanged.
Before executions After execution
A1 00007676 A1 00007676
B2 00005454 B2 00005454
A2 00020005 A2 00000001

14.3.7 Branch and other Instructions


The ¢C6X processor supports branch operations. The location to which the branch could occur is specified
as label (displacement) in the branch instruction or a register in the register file or interrupt return pointer
or NMI return pointer can hold it. The branch using displacement instruction is processed by .S1 and .S2
units. All other branch instructions are processed only by .S2 unit. The CLR, SET, EXT, EXTU, LMBD
and ZERO instructions are used for accessing the bit fields of registers in register file. These instructions
are used to set, clear or extract information about the bit fields. The ABS, NORM and SAT instructions
388 Digital Signal Processors
are used for handing sign bits of the operands. The NOP instruction is used in programming to avoid
conflicts between the functional units in the pipeline operation. The IDLE instruction is used to halt the
processor operation until interrupt occurs. The list of ¢C6X branch instructions and other instructions
are given in Table 14.12.

Table 14.12 Branch and other Instructions of ¢C6X processor

Instruction Functional unit Description


B .S1 or.S2 Branch using displacement/using a register
B IRP/B NRP .S2 Branch using an interrupt return pointer/ using NMI return pointer
CLR .S1 or.S2 Clear a bit field
SET .L1 or.L2 Set a bit field
EXT .S1 or.S2 Extract and sign-extend a bit field
EXTU .S1 or.S2 Extract and zero-extend a bit field
LMBD .L1 or.L2 Leftmost bit detection
ZERO L1 or.L2, .S1 or.S2, .D1 or Zero a register (Pseudo-operation)
.D2
ABS .L1 or.L2 Integer absolute value with saturation
NORM .L1 or.L2 Normalize integer
SAT .L1 or.L2 Saturate a 40-bit integer to a 32-bit integer
NOP — No operation
IDLE — Multi-cycle NOP with no termination until interrupt

CONDITIONAL OPERATIONS 14.4


All the instructions in ¢C6X can be conditional instructions. The content of registers A1, A2, B0, B1
and B2 are tested for conditional operations. These register contents equal to zero and non zero can be
used as conditions. The conditional instructions are represented by square brackets, [ ], surrounding the
register tested for condition. If register name alone is present in square bracket [A1] then the condition
to be tested is register content being nonzero rather the exclamatory symbol [!A1] is used before the
register in square bracket then the condition to be tested is register content being zero. The specified
condition register is tested at the beginning of the E1 phase of the pipeline for all instructions. Example
14.25 shows how ¢C6X conditional instructions can be written and how it can get executed.

PARALLEL OPERATIONS 14.5


In ¢C6X, instruction are fetched eight times to form a fetch packet. The fetch packets are aligned 256 bits
(8 words x 32-bits) and the basic format of the fetch packet is shown in Fig. 14.1. The execution of these
eight instructions is controlled by the p-bit. The execution of an instruction in the fetch packet in parallel
with another instruction is determined by scanning the p-bit, bit-0 of an instruction from left to right.
TMS320C6X Assembly Language Instructions 389

Fig. 14.1 Basic Format of a Fetch packet

If the p-bit of the instruction i is 1, then the next instruction i+1 is to be executed in parallel with
instruction i in the same machine cycle. If the p-bit is zero, then the instruction i+1 is executed in the
next machine cycle after the execution of instruction i. All the eight instructions executing in parallel
constitute an execute packet. Each instruction in the execute packet must use a different functional unit.
The execute packet cannot be more than eight words, so the last p-bit of the last instruction in a fetch
packet is always set to 0. There are three types of execution of the instructions in the fetch packet based
on the p-bits. They are
(i) Fully serial
(ii) Fully parallel
(iii) Partially serial
In fully serial type of execution, all the p-bits are set zero. The eight instructions of a fetch packet are
executed serially one after the other in eight machine cycles. For a fully parallel type, all the p-bits are
set 1 except the last instruction. All the eight instructions of a fetch packet are executed in parallel at the
same machine cycle itself. In case of partially serial scheme, p-bit of some instructions are set zero and
some with one. The instructions are executed serially one after the other from left to right of the fetch
packet until the first p-bit with one is detected. Once the p-bit with one is detected, that instruction, the
next instruction and the successive instructions who’s p-bits with one are executed parallel until next
p-bit zero is detected. If the p-bit with zero is sensed then the next instruction will be executed serially
and so on. The instruction that is to be executed parallel is represented by the symbol || in the beginning
of the opcode. The sample programs illustrating the above concept are given in example 14.25, 14.26
and 14.27.

Example 14.25 Fully serial execution with conditional operation. The following codes are to find
the sum of N numbers. The instructions are executed one after the other.
The conditional operation is used for branch instruction. Register B0 is used to check for non zero
condition. The count N is loaded register B0. The generation of the sequence is done in register A3, sum-
mation of N number is done in register B1 using ADD instruction. At the end of each summation the count
N in B0 register is decremented using SUB instruction. The condition for non zero of B0 is checked each
time using the representation [B0] and branching to location loop is performed until B0 content becomes
zero. On zero of register content B0, execution comes out of the loop.
MVK .S2 05h, B0 ; count N specified in register A1
LOOP ADD .L1 1,A3,A3 ; generation of sequence in A3
ADD .L2 A3,B1,B1 ; summation of sequence in A4
SUB .S2 B0,1,B0 ; decrement of count N in register A1
[B0] B .S1 LOOP ; check for non zero of content A1, branch to loop
NOP 5 ; no operation 5 times to avoid conflict in pipeline
390 Digital Signal Processors

Example 14.26 Partially serial and parallel execution. The codes given in example 14.25 to find
the sum on N numbers are modified, written for partially serial and parallel type
and is given below. Loading the count value in register B0 is done serially and all other operations are
executed in parallel. The parallel operations are performed in .L1, .L2, .S1 and .S2 units.
MVK .S2 05H, B0
LOOP1 ADD .L1 1,A3,A3
|| ADD .L2 A3,B1,B1
|| SUB .S2 B0,1,B0
|| [B0] B .S1 LOOP1
NOP 5
NOP
NOP
NOP

Example 14.27 Fully parallel execution. The following codes are executed in parallel. All the
eight functional units in ¢C6X are used to perform fully parallel execution.
LDW .D1 *A4++,A3
|| STW .D2 B3,*B2++
|| ADD .L1 A4,A4,A5
|| ADD .L2 B4,B4,B5
|| MPY .M1 A5,A5,A6
|| MPYH .M2 B5,B5,B6
|| SUB .S1 A6,A5,A7
|| SUB .S2 B6,B5,B7

FLOATING POINT INSTRUCTIONS 14.6


The ¢C67X floating point DSP supports all the fixed point instructions described in Section 14.3, but it
has instructions that are specific to ¢C67X. Instructions like 32-bit integer multiply, double word load
and floating point addition, subtraction and multiplication are specific instructions for ¢C67X processors.
This topic describes about those specific instructions.

14.6.1 Data Formats


The ¢C67X floating point DSPs support both fixed point and floating point data formats. For the fixed
point case, the operands can be signed 32-bit integer values or unsigned 32-bit integer values. As for
as the floating point operands are concerned, either Single-Precision (SP) or Double Precision (DP)
floating point format can be used. Single-precision floating point operands are 32-bit values stored in
a single register, whereas double-precision floating point operands are 64-bit values stored in register
pairs (Section 13.5) present in the register file.
The fields of the single-precision floating point format operand are shown in Fig. 14.2. The LSBs
0-22 of the register represent the fraction (mantissa) part (23-bits), the bits 23-30 are used to represent
the exponent part (8-bits) and the MSB bit (31st bit) is the sign bit. The floating point fields represent
floating point numbers in two ranges, normalized (exponent field is between 0 to 255) and denormalized
(exponent field is 0).
TMS320C6X Assembly Language Instructions 391

Fig. 14.2 Single-precision Floating point Fields

The formula to translate the s, e and f fields into single precision floating point number is given
below.
Normal range: -1s * 2 (e-127) * 1.f 0 < e < 255
Denormalized range: -1s * 2 (-126) * 0.f e = 0; f - nonzero
The fields of the double-precision floating point format operand are shown in Fig. 14.3. The full even
register 32-bits and odd register LSBs 0-19 (20-bits) of the register pair represent fraction (mantissa)
part (52-bits), the bits 20-30 of odd register are used to represent the exponent part (11-bits) and the
MSB bit (31st bit) of odd register is the sign bit. In double precision format for normalized range the
value of exponent is between 0 and 2047 and for denormalized range the value of exponent is 0.

Fig. 14.3 Double-precision Floating point Fields

The formula to translate the s, e and f fields into double precision floating point number is given
below.
Normal range: -1s * 2 (e-1023) * 1.f 0 < e < 2047
Denormalized range: -1s * 2 (-1022) * 0.f e = 0; f - nonzero

14.6.2 Data Format Conversion Instructions


The ¢C67X processor supports fixed point, single-precision and double-precision floating point data
formats. To convert one data format to another data format ¢C67X processor supports various data
format conversion instructions. The various data format conversion instructions of ¢C67X processors
are listed in Table 14.13. All the data format conversion instructions are processed by .L units (.L1 and
.L2) except single-precision floating point to double precision floating point conversion instruction
(SPDP) which is processed by .S units of the CPU. All the data format conversion instructions use
register addressing mode.
392 Digital Signal Processors
Table 14.13 Data Format Conversion Instructions of ‘C67X processor

Instruction Functional Description


unit
INTSP/ .L1 or .L2 Convert signed /unsigned integer to single-precision floating point instruction
INTSPU
INTDP/ .L1 or .L2 Convert signed /unsigned integer to double-precision floating point instruction
INTDPU
SPINT .L1 or .L2 Covert single-precision floating point value to integer instruction
SPTRUNC .L1 or .L2 Covert single-precision floating point value to integer with truncation instruction
DPINT .L1 or .L2 Covert double-precision floating point value to integer instruction
DPTRUNC .L1 or .L2 Covert double-precision floating point value to integer with truncation instruction
SPDP .S1 or .S2 Covert single-precision floating point value to double -precision floating point
instruction
DPSP .L1 or .L2 Covert double-precision floating point value to single-precision floating point
instruction
(SP-Single-Precision, DP- Double-Precision, INT-Integer)

Example 14.28 INTSP .L1 A1,A2 – Signed integer to single-precision floating point conversion
instruction. The signed integer content of register A1 is converted to single-
precision floating point format and stored in register A2. If integer value content of register A1 is
00007272h (29298), its single precision floating point value 46E4 E400h (2.9298E+4) is loaded in register
A2.The functional unit used is .L1 and the content of registers A1 is unchanged.
Before executions After execution
A1 00007272 29298 A1 00007272
A2 00020005 A2 46E4E400 2.9298E+4

14.6.3 Arithmetic Operation Instructions


The ¢C67X processors has arithmetic instructions for single and double precision addition, subtraction
and multiplication operations. Integer addition using double word addressing mode (ADDAD) is
supported. It also has instructions to perform 32-bit integer multiplications, where the product can be
obtained for 32-bits or 64-bits. The list of ¢C67X arithmetic instructions are given in Table 14.14. All add
and subtract instructions are processed by .L units except ADDAD instruction, which is processed by .D
units. The multiply instructions are processed by .M units of the CPU. All the arithmetic instructions use
register addressing mode except ADDAD instruction. In case of ADDAD instruction default is linear
addressing mode, but if src2 operand is one of the registers A4-A7 or B4-B7, then the mode is circular
addressing mode. The src1 operand is left shifted 3 times for double word addressing mode (refer
ADDAB/ADDAH/ADDAW in Section14.2.3).
TMS320C6X Assembly Language Instructions 393
Table 14.14 Arithmetic Operation Instructions of ¢C67X processor
Instruction Functional unit Description
ADDSP .L1 or .L2 Single-precision floating point add instruction
ADDDP .L1 or .L2 Double-precision floating point add instruction
ADDAD .D1 or .D2 Integer addition using double word addressing mode instruction
SUBSP .L1 or .L2 Single-precision floating point subtract instruction
SUBDP .L1 or .L2 Double-precision floating point subtract instruction
MPYSP .M1 or .M2 Single-precision floating point multiply instruction
MPYDP .M1 or .M2 Double-precision floating point multiply instruction
MPYI .M1 or .M2 32-bit integer multiply instruction(32 LSBs of the product is placed in
destination)
MPYID .M1 or .M2 32-bit integer multiply instruction(64-bits of the product is placed in
destination register pair)

Example 14.29 ADDSP .L1 A1,A2,A3 – Single-precision floating point add instruction. The single-
precision floating point content of register A1 and A2 are added; the result in
single-precision floating point format is stored in register A3. If the floating point content of register A1
is 4370 0000h (2.4E+2), register A2 is C453 4000h (-8.45E+2), the added result C417 4000h (-6.05E+2) is
stored in register A3. The functional unit used is .L1; the content of registers A1 and A2 are unchanged.
Before executions After execution
A1 43700000 (2.40E+2) A1 43700000 (2.40E+2)
A2 C4534000 (-8.45E+2) A2 C4534000 (-8.45E+2)
A3 500F0D18 A3 C4174000 (-6.05E+2)

Example 14.30 SUBDP .L2 B1:B0,B3:B2,B5:B4 – Double-precision floating point subtract


instruction. The double-precision floating point content of register pair B3:B2 is
subtracted from the content of register pair B1:B0 and the result in double-precision floating point
format is stored in register pair B5:B4. If the floating point content of register pair B3:B2 is 8.87634E+3,
register pair B1:B0 content is -1.043567E+5, then B3:B2 content is subtracted from B1:B0 content and
the result -1.1323304E+5 is stored in register pair B5:B4. The functional unit used is .L2; the content of
registers pairs B1:B0 and B3:B2 are unchanged.
Before execution
B1:B0 C0F97A4B 33333333 (-1.043567E+5)
B3:B2 40C1562B 851EB852 (-8.8763E+3)
B5:B4 00000000 00000000 (0.0)
After execution
B1:B0 C0F97A4B 33333333 (-1.043567E+5)
B3:B2 40C1562B 851EB852 (-8.8763E+3)
B5:B4 C0FBA5C0 3D70A3D (1.1323304E+5)
394 Digital Signal Processors

Example 14.31 MPYI .M1X A1,B1,A2 – 32-bit integer multiply instruction. The 32-bit integer
content of register A1 and B1 are multiplied; the lower 32-bits of the product is
stored in register A2. If the content of register A1 is 0008 5BDBh, register B1 is 000D B371h, the multiplied
result is 72 8609 ACABh. The 32 LSBs of the product (8609 ACABh) alone are stored in register A2. The
functional unit used is .M1 with the cross path; the content of registers A1 and B1 are unchanged. If the
same operation is performed with MPYID instruction being register pairs used for destination, the entire
result of the product is stored in register pairs.
Before executions After execution
A1 00085BDB 547803 A1 00085BDB 547803
B1 C4534000 897905 B1 000DB371 897905
A2 00000000 A2 8609ACAB 491875052715

14.6.4 Compare and Reciprocal Approximation Instructions


The ¢C67X processors has instructions for compare and reciprocal approximation operations. The
comparisons are performed for equality, less than and greater than cases using single and double-
precision floating point data formats. The src1 and src2 operands given in registers are compared, if the
comparison case is true, ‘1’ is written in destination register else ‘0’ is written to destination register.
The reciprocal and square-root reciprocal approximation operations are performed for single and
double-precision floating point data formats. The list of ¢C67X compare and reciprocal approximation
instructions are given in Table 14.15. The compare and reciprocal approximation instructions are
processed by .S1 and .S2 units and all these instructions use register addressing mode.
Table 14.15 Compare and Reciprocal Approximation Instructions of ¢C67X processor

Instruction Functional unit Description


CMPEQSP .S1 or .S2 Single-precision floating point compare for equality instruction
CMPEQDP .S1 or .S2 Double-precision floating point compare for equality instruction
CMPLTSP .S1 or .S2 Single-precision floating point compare for less than instruction
CMPLTDP .S1 or .S2 Double-precision floating point compare for less than instruction
CMPGTSP .S1 or .S2 Single-precision floating point compare for greater than instruction
CMPGTDP .S1 or .S2 Double-precision floating point compare for greater than instruction
RCPSP .S1 or .S2 Single-precision floating point reciprocal approximation instruction
RCPDP .S1 or .S2 Double-precision floating point reciprocal approximation instruction
RSQRSP .S1 or .S2 Single-precision floating point square-root reciprocal approximation instruction
RSQRDP .S1 or .S2 Double-precision floating point square-root reciprocal approximation instruction

Example 14.32 CMPGTSP .S1 A3,A4,A5 – Single-precision floating point compare instruction for
greater than case. The single-precision floating point content of register A3 and
A4 are compared, if the content of register A3 is greater than the content of A4, then ‘1’ is written in
register A5. If the floating point content of register A3 is 4608F59Ah (8.7654E+3), register A4 is 45F6
3E66h (7.8798E+3), the comparison is true; hence ‘1’ is written in register A5. The functional unit used
is .S1; the content of registers A3 & A4 are unchanged.
TMS320C6X Assembly Language Instructions 395
Before executions After execution
A3 4608F59A (8.7654E+3) A3 4608F59A (8.7654E+3)
A4 45F63E66 (7.8798E+3) A4 45F63E66 (7.8798E+3)
A5 00000000 A5 00000001

Example 14.33 RSQRSP .S1 A1,A2 – Single-precision floating point square-root reciprocal
approximation instruction. The square root of the single-precision floating point
content of register A1 is obtained and its reciprocal value is stored in register A2 in single-precision
floating point format. If the single-precision floating point content of register A1 is 4380 0000h (2.56E+2),
it’s square root value is 1.6E+1 and its reciprocal value 3D80 0000h (6.25E-2) is stored in register A2. The
functional unit used is .S1; the content of registers A1 is unchanged.
Before executions After execution
A1 43800000 (2.56E+2) A1 43800000 (2.56E+2)
A2 C4534000 (-8.45E+2) A2 3D800000 (6.25E-2)

14.6.5 Other Instructions


The ¢67X processor supports finding absolute value of single and double-precision floating point
numbers. The absolute value of source register/register pair content is stored in destination register/
register pairs. The functional units used are .S1 and .S2; addressing mode used is register addressing.
The ¢C67X processors also support loading double word from memory with unsigned constant offset or
register offset (refer chapter 14.3.2). The absolute and load double word instructions of ¢C67X are given
in Table 14.15. The type of addressing mode used for load double word instruction is linear addressing;
the functional units used are .D1 and .D2 through LD1b and LD2b 32-bit MSB buses in ¢C67X (refer
Fig. 13.5).
Table 14.15 Absolute and Load Double word Instructions of ¢C67X processor
Instruction Functional unit Description
ABSSP .S1 or .S2 Absolute value of single-precision floating point number
ABSDP .S1 or .S2 Absolute value of double-precision floating point number
LDDW .D1 or .D2 Load double word from memory with an unsigned constant offset or register offset

PIPELINE OPERATION 14.7


The ¢C6X pipeline operation provides easy way of programming and improves performance. The major
phases of ¢C6X pipeline
∑ Fetch
∑ Decode
∑ Execute
All instructions require same number of pipeline phases for fetch and decode, but require a varying
number of execute phases depending on the type of instruction. The ¢C62X/¢C64X fixed point processors
require less execution phases than the ‘C67X floating point processor. The fetch operation consists of
four phases and the decode operation has two phases for all ¢C6X processors. But the execute operation
of fixed point processors have five phases where as it is ten phases for floating point processors. The
¢C6X fixed point and floating point processor pipeline stages are show in Figs 14.4 and 14.5.
396 Digital Signal Processors

Fig. 14.4 Fixed point processor (¢C62X/¢C64X) pipeline stages

Fig. 14.5 Floating point processor (¢C67X) pipeline stages

14.7.1 Fetch Operation


The ¢C6X processor uses eight instructions in a fetch packet (FP). The eight instructions are fetched
from memory through four phases. The fetch phase is subdivided into the following phases.
∑ Program address generate (PG)
∑ Program address send (PS)
∑ Program access ready wait (PW)
∑ Program fetch packet receive (PR)
Figure 14.6 shows the functional block diagram of fetch phase. During the PG phase the memory
addresses corresponding to eight instructions of fetch packet are generated. In PS phase, the addresses

Fig. 14.6 Functional Block Diagram of ¢C6X Fetch phases

are sent to memory and in PW phase the memory read operation is performed. Finally, in PR phase
the eight instructions are received at the CPU. The number of execute packets in the fetch packet is
based upon instructions written in fully serial, fully parallel and partially serial execution types. If eight
instructions of a fetch packet are serial, there are eight execute packets, where as eight instructions are
in parallel, there is only one execute packet. In case of partial serial type, the number of execute packets
TMS320C6X Assembly Language Instructions 397
varies between two to seven and depend on the number of instructions that are parallel in the fetch
packet.

14.7.2 Decode Operation


In decode phase the fetch packet having eight instructions are split into execute packets, assigned to
appropriate functional units and are decoded. The decode phase is subdivided into the following two
phases
∑ Instruction dispatch (DP)
∑ Instruction decode (DC)
The execute packet consists of one instruction or two to eight parallel instructions. In instruction
dispatch phase (DP), the instructions in an execute packet are assigned to the appropriate functional
units. In instruction decode phase (DC), the source registers, destination registers and associated data
paths are decoded for the execution of the instructions in the eight functional units.

14.7.3 Execute Operation


The execute phase of the pipeline for fixed-point processor is subdivided into five phases (E1-E5) and
for floating point processors it is subdivide into ten phases (E1-E10). Different type of instructions
require different numbers of execute phases to complete the execution. The execute phases and the
operation performed in each phase for fixed point processors are given in Table 14.16 and the same for
floating point processors are given in Table 14.17. The pipeline operations of fixed-point processors are
categorized into seven instruction types. They are single-cycle, single 16x16 multiply (Two-cycle) and
‘C64X non-multiply, store, ‘C64X extended multiply, load, branch and no operation (NOP) instructions.
The pipeline operations of floating-point processors are categorized into fourteen instruction types. They
are single-cycle, single 16x16 multiply, store, load, branch, 2-cycle DP, 4-cycle, INTDP, DP compare,
ADDDP/SUBDP, MPYI, MPYID, MPYDP and no operation (NOP) instructions. The execute phase in
which these instruction categories are executed are shown in Tables 14.16 and 14.17.

INTERRUPTS 14.8
The ¢C6X processors have three types of interrupts based on their priorities. First the reset interrupt
(RESET ) which has the highest priority, second the nonmaskable interrupt (NMI) having the second
highest priority and third are the twelve makeable interrupts INT4-INT15 having lowest priorities. In
¢C6X, eight registers are present that control servicing the interrupts. The list of interrupts and their
functions are given in Table 14.18.
The reset interrupt is an active low signal and all other interrupts are active high signal. The reset
interrupt must be held low for 10 clock cycles. The nonmaskable interrupt is used to alert the CPU for
serious hardware problem such as power failures likely to happen immediately. The twelve maskable
interrupts are associated with external devices, on-chip peripherals, software control or in some
processor not be available. The ¢C6X processors have interrupt acknowledgement signal (IACK) to
alert the external hardware that an interrupt has occurred and is being processed and INUMx signals
(INUM3-INUM0) to indicate the number of interrupts that is being processed.
When an interrupt occur, the CPU begins to process it and it references interrupt service table (IST).
IST is a table containing codes for servicing the interrupts. The IST contains 16 consecutive fetch
packets, where each fetch packet contains eight instructions. Instructions of ¢C6X is 32-bits, so for eight
398 Digital Signal Processors
instructions it occupies 32-bytes of program memory locations for each fetch packet. Hence the address
of IST is incremented by 32 bytes (20h) for the next interrupt to be serviced. The interrupt service
routine can fit in with these eight instructions.

Table 14.16 Operation performed in execute phases of ¢C6X fixed-point processors

Ex-
ecute Type Operations performed
Phase
E1 Conditional For all instructions, the conditions for the instructions are checked and
Instructions operands are read.

Load and store Address generation is performed and address modifications are written to
instructions a register file

Branch Instructions Branch fetch packet in PG phase is affected

Single-cycle The results are written to a register file


instructions

E2 Load instructions The address is sent to memory

Store instructions The address and the data are sent to memory

Single-cycle Single-cycle instructions with saturate results, if saturation occurs, set the
instructions SAT bit in the control status register (CSR)

Multiply instructions For 16x16 multiply instructions, results are written to a register file. In
‘C64X multiply unit, for the non-multiply instructions, results are written
to a register file

E3 Store instructions Data memory accesses are performed

Multiply instructions Multiply instructions with saturate results, if saturation occurs, sets the
SAT bit in the control status register (CSR)

E4 Load instructions Data is brought to the CPU boundary

Multiply instructions In ‘C64X multiply extensions, results are written to a register file

E5 Load Instructions The data is written into a register

Table 14.17 Operation performed in execute phases of ¢C6X floating-point processors


Execute
Phase Type Operations performed
E1 Conditional Instructions For all instructions, the conditions for the instructions are checked
and operands are read.
Load and store instructions Address generation is performed and address modifications are
written to a register file

(Contd.)
TMS320C6X Assembly Language Instructions 399
Table 14.17 (Contd.)
Branch Instructions Branch fetch packet in PG phase is affected
Single-cycle instructions The results are written to a register file
DP compare, ADDDP/SUBDP and The lower 32-bits of the source are read. For all other instructions,
MPYDP instructions the source are read
2-cycle DP instructions The lower 32-bits of the result are written to a register file
E2 Load instructions The address is sent to memory
Store instructions The address and the data are sent to memory
Single-cycle instructions Single-cycle instructions with saturate results, if saturation occurs,
set the SAT bit in the control status register (CSR)
Multiply, 2-cycle DP and DP Results are written to a register file
compare instructions
DP compare and ADDDP/SUBDP The upper 32-bits of the source are read
instructions
MPYDP instruction The lower 32-bits of src1 and the upper 32-bits of src2 are read
MPYI and MPYID instruction The sources are read
E3 Store instructions Data memory accesses are performed
Multiply instructions Multiply instructions with saturate results, if saturation occurs, sets
the SAT bit in the control status register (CSR)
MPYDP instruction The upper 32-bits of src1 and the lower 32-bits of src2 are read
MPYI and MPYID instruction The sources are read
E4 Load instructions Data is brought to the CPU boundary
MPYI and MPYID instruction The sources are read
MPYDP instruction The upper 32-bits of the source are read
4-cycle instructions Results are written to register file
INTDP instruction The lower 32-bits of the result are written to a register file
E5 Load Instructions The data is written into a register
INTDP instruction The upper 32-bits of the result are written to a register file
E6 ADDDP/SUBDP instructions The lower 32-bits of the result are written to a register file
E7 ADDDP/SUBDP instructions The upper 32-bits of the result are written to a register file
E8 —- Nothing read or written
E9 MPYI instruction The result is written to a register file
MPYDP and MPYID instructions The lower 32-bits of the result are written to a register file
E10 MPYDP and MPYID instructions The upper 32-bits of the result are written to a register file

If the interrupt service routine for an interrupt is larger than eight instructions that cannot fit in
the IST, an interrupt service fetch packet (ISFP) is used to service an interrupt. The interrupt service
fetch packet contains a branch to the interrupt return pointer instruction (B IRP) followed by five no
operations (NOP 5) for the branch to reach the execution stage of the pipeline. The additional interrupt
service routine code is written from the branched memory location. In both IST and ISFP the interrupt
service table pointer (ISTP) register is used to locate the interrupt service routine.
400 Digital Signal Processors
Table 14.18 Interrupt Control Register and their Functions
Name of the register Abbreviation Functions
Control status register CSR To globally set or disable the maskable interrupts
Interrupt enable register IER To enable the makableinterrupts
Interrupt flag register IFR Shows the status of interrupts
Interrupt set register ISR To set the flags in IFR register manually
Interrupt clear register ICR To clear the flags in IFR register manually
Interrupt service table ISTP Pointer to the beginning of the interrupt service table
pointer
Nonmaskable interrupt NRP Contains the return address used on return from a nonmaskable
return pointer interrupt. This is accomplished using the B NRP instruction
Interrupt return pointer IRP Contains the return address used on return from a maskable interrupt.
This is accomplished using the B IRP instruction

To process a maskable interrupt the following conditions are to be satisfied.


∑ The global interrupt enable bit (GIE) in the control status register is set to 1
∑ The NMIE bit in the interrupt enable register (IER) is set to 1
∑ The interrupt enable bit (IE) in the interrupt enable register (IER) for the corresponding interrupt
is set to 1
On the above conditions satisfied when an interrupt occurs, the corresponding bit in interrupt flag
register (IFR) is set. Based on the priority of the interrupt, the interrupt service table pointer (ISTP)
locates the interrupt service routine and the interrupt is processed.

Review Questions
14.1 What are the types of operations performed by .L 14.10 What are the various shift and compare
functional units? operations supported by ¢C6X processors?
14.2 List the various types of multiply operations 14.11 Explain how logical conditional can be defined in
performed by .M functional units. ¢C6X instructions?
14.3 Which unit is used to process the branch 14.12 What are the various instruction execution types
instructions? List the various types of branch instructions in ¢C6X? Explain.
in ¢C6X. 14.13 What are the various data formats supported by
14.4 What are the various types of load and store the ‘C67X processors?
operations performed by .D units of ¢C6X processor? 14.14 List the various data format conversion
14.5 List the addressing modes supported by the ¢C6X instructions in ‘C67X processors.
processor. 14.15 What are the floating point arithmetic operations
14.6 What are the address generation options present ‘C67X processor supports?
in linear addressing mode? 14.16 Explain the different phases of fetch operation
14.7 Explain the operation of circular addressing mode of ¢C6X pipeline.
with example. 14.17 What are operations performed in decode phase
14.8 What are the various types of move instructions of ‘C67X pipeline.
in ¢C6X processors? 14.18 List the categories of ¢C6X fixed point processor
14.9 List the various types of addition and subtract pipeline execute phases
instructions in ¢C6X processors.
TMS320C6X Assembly Language Instructions 401
14.19 What are the categories of ‘C67X processor 14.21 What are the registers in ¢C6X register file used
pipeline execute phases? for conditional operations?
14.20 List the register present in ¢C6X processor to
process the interrupts.

Self Test Questions


14.1 Arithmetic operations are performed by ——— 14.13 Instruction to perform two 16-bit addition on
units. upper and lower register halves is ———
(a) .L (b) .S (c) . M (d) .L, .S and .D (a) ADD (b) ADDU (c) ADD2 (d) ADDK.
14.2 The logical operations are processed using ——— 14.14 ——— instruction is used to perform division
units. operation.
(a) .L (b) .D (c) . M (d) .L and .S (a) ADD (b) SUBC (c) SUB (d) SUBU
14.3 Shift operations are processed by ——— units. 14.15 For fully parallel type of execution the P-bit set
(a) .L (b) .D (c) . M (d) .S for ——— .
14.4 Compare operations are processed by ——— (a) all the eight instructions (b) the fist instruction
units. (c) the last instruction (d) fist seven instructions
(a) .L (b) .D (c) . M (d) .S 14.16 The condition checked for [!A1] is ———
14.5 ——— units are used to perform move operations (a) content of register A1 being non zero
(a) .L (b) .D (c) .S (d) .S and .D (b) content of register A1 being zero
14.6 Multiply operations are processed by ——— (c) content of register A1 being negative
units. (d) content of register A1 being positive
(a) .L (b) .D (c) . M (d) .S 14.17 Integer addition using double word addressing
14.7 Branch operations are processed by ——— units. mode instruction is in ——— processor.
(a) .L (b) .D (c) . M (d) .S (a) ‘C62X (b) ‘C64X
(c) ‘C62X and ‘C64X (d) ‘C67X
14.8 ——— units are used to perform load and store
operations. 14.18 32-bit integer multiply instruction is in ———
(a) .L (b) .D (c) . M (d) .S processor
(a) ‘C62X (b) ‘C64X
14.9 In linear addressing mode the number of address
(c) ‘C62X and ‘C64X (d) ‘C67X.
generation options is
(a) 4 (b) 10 (c) 6 (d) 8 14.19 Reciprocal approximation operations are
processed by ——— units
14.10 For circular addressing registers used are ———.
(a) .L (b) .D (c) . M (d) .S
(a) A0-A15 (b) B0-B15
(c) A0-A32 (d) A4-A7 and B4-B7 14.20 The number of phases of ¢C62X fixed point
processor pipeline is ———.
14.11 Instruction to move values between control
(a) 5 (b) 10 (c) 11 (d) 16
register file and register file is
(a) MV (b) MVK (c) MVC (d) MVKH. 14.21 The number of phases of ¢C67X floating point
processor pipeline is ———.
14.12 Instruction to perform signed 16-bit constant
(a) 5 (b) 10 (c) 11 (d) 16
addition operation is ———.
(a) ADD (b) ADDU (c) ADD2 (d) ADDK.
TMS320C6X APPLICATION
PROGRAMS AND PERIPHERALS
15
In this chapter some application programs for TMS320C6X processors, some details on memory and on-
chip peripheral are given. To develop and test ¢C6X application codes, programming tools and hardware
accessories are needed. The programming tool used is Code Composer Studio (CCS) and TMS320C6X
starter kits are used for the implementation. The details about the internal memory resources and the
various peripherals like timers, multichannel buffered serial ports, DMA controllers and external
memory interface are discussed.

CODE COMPOSER STUDIO (CCS) 15.1


Code composer studio has the basic code generation tools with set of debugging and real-time analysis
capabilities. The code composer studio is available in integrate development environment (IDE), which
is designed to edit, build and debug ¢C6X processor target programs. The steps to do programming in
¢C6X processor environment are:
∑ Setting up the target processor
∑ Code generation
∑ Debugging and execution of codes

15.1.1 Setting up the Target Processor


The code composer studio tool of ¢C6X platform supports code generation for ¢C62X, ¢C64X fixed point
and ¢C67X floating point processors. The CCS tool supports processor simulator mode of operation
where there is no target ¢C6X processor present. The other modes of operation are using starter kit
(DSK) or evaluation module (EVM) in which a particular target ¢C6X processor present in the board can
be selected. The selection of simulator mode or starter kit or EVM for specific target ¢C6X processor can
be programmed using the setup option in the CCS tool.
The setup menu of the ¢C6X code composer studio is shown in Fig. 15.1. The ¢C62X, ¢C64X, ¢C67X
devices supported by the code composer studio tool are listed in the second window. Based on the mode,
simulator or DSK or EVM and a device of a family can be selected. The selected device is updated in
system configuration window and the details about the selected device are available in third window.
After proper selection of the mode and device click the save and quit button at the left bottom corner of
the setup window. This completes the setup process of target processor for ¢C6X code generation. In this
Fig. 15.1 Code composer studio setup for ¢C6416 DSK
TMS320C6X Application Programs and Peripherals
403
404 Digital Signal Processors
chapter the code generation, debugging and execution are carried out using both TMS320C6416 DSK
(Starter kit, operating at 720MHz) and TMS320C6713 DSK.

15.1.2 An Overview of the ¢C6416 DSK


The ¢C6416 Starter kit is a standalone board consisting of the following features. The code composer
studio tool present in the host computer communicates with the DSK through an embedded JTAG
emulator with a USB host interface.
∑ TMS320C6416T processor operating at 1GHz
∑ 16 Mbytes of synchronous DRAM connected to CE0 space of EMIA
∑ 512 Kbytes of non-volatile Flash memory connected to CE1 space of EMIB
∑ Software board configuration through registers implemented in CPLD of CE0 space of EMIB
∑ An AIC23 stereo codec connected to McBSP
∑ Configured boot options and clock input selection
∑ External memory, External peripheral and PCI/HPI connectors
∑ JTAG emulation through on-board JTAG emulator with USB host interface or external
emulator
The ¢C6416 starter kit has TMS320C6416T processor with 1024 K (1M) bytes of the on-chip RAM
as unified memory space in the address range 0000 0000h to 000F FFFFh. This space can be used to
store the program codes as well as data values. For more memory space applications, the external 16M
bytes of DRAM connected to CE0 space of the starter kit can be used. The 512 K bytes of external
Flash memory in CE1 space can be used for boot option. The CPLD is used to implement simple logic
functions without additional discrete devices. The AIC23 stereo codec is used for input and output audio
signals through multi-channel buffered serial port of the processor (McBSP).

15.1.3 Code Generation in CCS


The CCS tool supports ¢C6X code generation both in assembly and ‘C’ language. The code generation
flow in assembly language is explained with an example in this section. In this example is used to
generate an arithmetic series of N numbers, find the summation of the series, the series and the sum to
be stored in memory. Invoke the code composer studio tool using the shortcut on the desktop or from
the program menu. The code composer studio window will open; the following are the steps to create a
new project and build assembly language code.
Step 1: Creating a new project: Select Project menu - New project option in CCS, a project selection
window will appear. A project name (e.g. series) is to be entered. The default location of the
new project being created is in the folder –Code composer studio – Myprojects. Using the
browse option, the folder for the new project creation can be altered. A folder in the name of the
project will be created and a file in the name of project with the file extension .pjt (series.pjt)
will be present in that folder. In the files window of CCS the project name appears.
Step 2: Creating source file in assembly: Select File-New - Source file option of CCS. A text edit
window will appear; the assembly language code can be entered in the text window. The
procedure to enter codes in assembly is common for all the processors (Section 6.1.4). The
complete assembly language code for series generation is given in example 15.1. Once the
complete code is entered in the text pad, use the save option or shortcut keys of the CCS to save
the file in the project folder created (series). Save option window will appear, in which enter the
file name with file extension .asm (sumn.asm).
TMS320C6X Application Programs and Peripherals 405
Step 3: Adding file to the project: The assembly language file created is to be added to the project.
Select Project menu – Add files to the project option in CCS, add files to project window will
appear in which select the file type as ‘Asm source files, select the asm file in the project folder
(sumn.asm). In the files window click the project name, then source folder, added .asm file will
appear.

Example 15.1 The assembly code generates the arithmetic series, finds the sum of the series
and stores it in memory. The first few instructions initialize the content of
registers used to zero. The register A1 specifies the number of values of the series N (20h); register A2
specifies the start address of the memory (0200h) to store the series. The series is generated in register
A3, the sum of the series values are accumulated in register A4. The N values of the series are stored in
N words (4 bytes) of the memory starting from the next address in register A2 (0204h) and N+1th word
the sum of the series is stored. The content of register A1 is used for conditional operation.
Label Mnemonic Comments
.text ; assemble directive to initialize the program section
(case sensitive)
ZERO .S1 A1 ; zero the content of registers A1, A2, A3 done in parallel
|| ZERO .D1 A2
|| ZERO .L1 A3
NOP 5
ZERO .D1 A4 ;zero the content of register A4, the no. of series-
|| MVK .S1 020h, A1 ;values (20h) entered in A1, done in parallel
NOP 6
MVK .S1 200h,A2 ; the start address of the memory (0200h) to store the-
; sequence is loaded in A2
LOOP ADD .L1 1,A3,A3 ; the series generation done in A3
STW .D1 A3,*++A2[1] ; the values of the series are stored
ADD .L1 A3,A4,A4 ; the sum of the series is done in A4
SUB .S1 A1,1,A1 ; decrement the count N
NOP 6
[A1] B .S1 LOOP ; the content of A1 being nonzero condition is tested
NOP 6
STW .D1 A4,*++A2[1] ; the sum of the series is stored in N+1th location
NOP
.end ;assembler directive to specify the end of section
(case sensitive)

Step 4: Building the code: Select Project menu – Build option in CCS, a Debug window will appear
in CCS. It checks the syntax of the ¢C6X assembly code. If the build is successful an .out file
in the name of the project is created in Debug folder of the project folder (series.out) else error
messages will appear in the debug window. By reading the error messages the correct syntax
can be written in the assembly language file. The build option is to be continued till the end of
successful build.
Step 5: Down loading the code in target processor: Now the .out file is to be down loaded to the
on-chip memory of the target processor, for this first the target processor is to be connected.
Select Debug menu – Connect option in CCS, a Disassembly window will open in CCS. To
load the .out file, Select File menu-Load program option in CCS. A Load program window will
popup, click the Debug folder and select the .out file in the name of the project (series.out).
406 Digital Signal Processors
The disassembly window will point the starting address 0000 0000h or the default location of
the program counter (PC). The assembly codes developed by the user will be loaded from the
starting address 00000020h. The complete assembly codes downloaded can be viewed in the
disassembly window from this address.

Fig. 15.2 Various windows of CCS for ¢C6416

15.1.4 Execution of ¢C6X Codes in Target Processor


The assembly codes which are loaded to the target processor are to be executed and the results can
be verified in the CPU register and memory of the processor. To view the results, the register window
and memory widows are to be enabled. Select View – Registers – Core registers option in CCS. A new
window appears in CCS, in which the content of all the CPU core registers can be viewed. As the same
way select View – Memory option in CCS, a new small window will appear for options. Select the
address of the memory that is to be viewed (e.g. 0x00000200) and the format in which the data is to be
displayed (e.g. 32 Bit Hex – C style), a memory window appears in CCS. In the memory window, click
the right button of the mouse; choose Float in main window option to view the memory window along
with disassembly window in CCS.
TMS320C6X Application Programs and Peripherals 407
To start execution, the program counter (PC) should point to the starting address of the code. This is
being done in two ways; double click the PC in register window, the edit register window will appear,
in which enter the start address (0x00000020). The other ways is either in text edit window or in
disassembly window, keep the cursor in the first line of the code, right click the mouse button and select
Set PC to cursor option or use shortcuts options.
To execute the code, select Debug-Step into option in CCS or shortcut key, the code will be executed
line by line. Breakpoint can be introduced using the option in debug menu to the end address of the
code or double click the cursor at the lost line of the code in disassembly window. The break points can
be introduced to any line of the code and also ‘n’ number of such break points can be introduced in the
program. The ‘run’ option in debug menu can be used to execute the code in one step. The arithmetic
series values and the sum can be viewed in the memory window. The various windows of the CCS along
with the result in memory window are shown in Fig. 15.2.

APPLICATION PROGRAMS IN ¢C64X 15.2


The assembly language programs for various functions such as convolution, discrete Fourier transform,
FIR filter and real time audio signal capture are implemented in TMS320C6416 starter kit. The details
about the implementation are given in this section.

15.2.1 Integer Division


The integer division operation is performed using SUBC instruction in ¢C6X processor more efficiently.
The division operation using SUBC needs the denominator content to be aligned to the numerator
content. This can be performed by detecting the left most ‘1’ bit in the denominator using LMBD
instruction of ¢C6X. The TMS320C6X assembly program to perform unsigned integer division is given
in Program 15.1 and signed integer in Program 15.2. The numerator and denominator are stored in
two CPU registers (A2 and A3) where denominator must be less than the numerator. Using LMBD
instruction, both in numerator and denominators, the left most ‘1’ bit is detected and the result is
stored in two new registers (A5 and A6). The difference value of the left most ‘1’ bit detection of
the denominator to that of the numerator, say X is the critical value used in the division process. The
denominator content is left shifted X bits to align to numerator content. The content of the aligned
denominator content is subtracted from numerator using SUBC instruction. It is important to note that
X+1 time the SUBC instruction is to be executed to complete the division. After division, both quotient
and the remainder will be in a single register in which the numerator is loaded (A2). The quotient of the
division can be computed by taking X+1 LSB bits of the numerator register and the remaining MSB bits
is used compute the remainder of the division process. In signed integer division, the absolute value of
the signed number is obtained and the division is performed same way as unsigned case and at end, sign
information is added to the quotient. The Program 15.1 and 15.2 can be used for dividing unsigned and
signed integers up to 32-bits respectively.

Program 15.1 Unsigned Integer Division


Label Mnemonic Comments
.text ; assembler directive to initialize the program section
zero .s1 a1 ;zero the content of registers A1,A2 and A3
|| zero .d1 a2
408 Digital Signal Processors
|| zero .l1 a3
zero .s1 a4 ; zero the content of registers A4,A5 and A6
|| zero .d1 a5
|| zero .l1 a6
zero .s1 a7 ;zero the content of registers A7,A8 and A9
|| zero .d1 a8
|| zero .l1 a9
zero .s1 a10 ;zero the content of registers A10,A11 and A16
|| zero .d1 a11
mvk .s1 20h,a2 ; 16 LSBs of numerator moved to register A2
mvklh .s1 0h,a2 ; 16 MSBs of numerator moved to register A2
mvk .s1 03h,a3 ; 16 LSBs of denominator moved to register A3
mvklh .s1 0h,a3 ; 16 MSBs of denominator moved to register A3
mvk .s1 01h,a4 ; 1 is moved in register A4
lmbd .l1 a4,a2,a5 ; left most 1 detection for numerator, result in A5
lmbd .l1 a4,a3,a6 ;left most 1 detection for denominator, result in A6
sub .s1 a6,a5,a1 ;the difference in left most 1detection, result in A1
shl .s1 a3,a1,a3 ;the denominator aligned to numerator by left shifting,
; the shift value is the content of A1
add .l1 1,a1,a1
neg .s1 a4,a7
shl .s1 a7,a1,a8
not .s1 a8,a9
mv .l1 a1,a7
loop subc .s1 a2,a3,a2 ;register A3 content subtracted from A2 content, result
;in A2 register
sub .l1 a1,a4,a1 ;the content of A1 register decremented
[a1] b .s2 loop ;branch to loop for content register A1 non-zero
|| nop 5 ; no operations
and .d1 a2,a9,a10 ;quotient in A10 register
and .d1 a2,a8,a11
shr .s1 a11,a7,a11 ; reminder in A11 register
.end

Program 15.2 Signed Integer Division

Label Mnemonic Comments


.text ; assembler directive to initialize the program section
zero .s1 a1 ; zero the content of registers A1-A11,B0 and B1
|| zero .d1 a2
|| zero .l1 a3
|| zero .s2 b0
|| zero .d2 b1
zero .s1 a4
|| zero .d1 a5
|| zero .l1 a6
TMS320C6X Application Programs and Peripherals 409
zero .s1 a7
|| zero .d1 a8
|| zero .l1 a9
zero .s1 a10
|| zero .d1 a11
mvk .s2 -19,b0 ; numerator specified in lsb 16 bits of register B0
mvk .s2 10,b1 ; denominator specified in lsb 16 bits of register B1
abs .l1 b0,a2 ; absolute value of B0 stored in A2
cmplt .l2 b0,a2,b0 ;the sign information is stored in B0
abs .l1 b1,a3 ;absolute value of B1 stored in A3
cmplt .l2 b1,a3,b1 ; the sign information is stored in B1
sub .d2 b0,b1,b0
mvk .s1 01h,a4
lmbd .l1 a4,a2,a5 ; left most 1 detection for numerator, result in A5
lmbd .l1 a4,a3,a6 ;left most 1 detection for denominator, result in A6
sub .s1 a6,a5,a1
shl .s1 a3,a1,a3 ;the denominator aligned to numerator by left shifting,
; the shift value is the content of A1
add .l1 1,a1,a1
neg .s1 a4,a7
shl .s1 a7,a1,a8
not .s1 a8,a9
mv .l1 a1,a7
loop subc .l1 a2,a3,a2 ; subtract operation performed result stored in A2
sub .l1 a1,a4,a1
[a1] b .s2 loop
|| nop 5
and .d1 a2,a9,a10
[b0] neg .s1 a10,a10 ;quotient in register A10, for signed nos. sign information added
and .d1 a2,a8,a11
shr .s1 a11,a7,a11 ; reminder in register A11
.end

15.2.2 Convolution Operation


The basic operation to be implemented for signal processing applications is convolution. In ¢C6X
processor it can be implemented using multiply (MPY) and add (ADD) instruction. The multiply
and add instructions are executed in parallel to perform single cycle multiply and accumulate (MAC)
operation. In ¢C6X there are two multipliers and six ALUs functioning in parallel, so two single cycle
MAC operations can be performed simultaneously in path-A and path-B of the CPU paths respectively.
The convolution operation can be performed for 8, 16 and 32-bits of data values in ¢C6X processor and
the assembly language program to perform 8, 16 and 32-bit convolution is given in program 15.3. The
8, 16 and 32 bit data values can be defined using assembler directives .byte, .half and .word respectively.
The data values for the sequence can be directly defined in the program or it could be stored in separate
data files. The values in the data files can be called in the assembly program using .include or .copy
assembler directives. In program 15.3 the two sequence values that are to be convolved are defined
using variables x and h and the number of values in the sequence are defined by the variables n and
410 Digital Signal Processors
m respectively. The number of time the convolution output is to be computed is n+m-1. While storing
the sequence values in memory, padding of zeros for both the sequence x and h are necessary to avoid
garbage values being accessed from memory during convolution operation. For sequence x, m-1 zeros
are to be padded after the sequence and n-1 zeros to be padded before and after the sequence h as
shown in the program 15.3. The stored values of the two sequences are read from memory one by one
using load instruction through path-A and path-B simultaneously, get multiplied and accumulated, and
repeated for n+m-1 times to get the first convolution output. The result is stored in memory using store
instruction. After the address update of both sequences, the next output is computed and this process
continued for n+m-1 times. The conditional operations of ¢C6X are used to check the count values. The
convolved output can be viewed in memory by invoking the memory window in CCS.

Program 15.3 Convolution operation

Label Mnemonic Comments


.data ; assembler directive to initialize the data section
x .byte 1h,2h,2h,2h,2h,2h,1h ; the values of sequence x defined
; .byte, .half and .word assembler directives to represent-
; data in 8, 16 and 32 bit data formats respectively
xpa .byte 0h,0h,0h,0h ; m-1 values of zeros padded after the sequence x
hpb .byte 0h,0h,0h,0h,0h,0h ; n-1 values of zeros padded before the sequence h
h .byte 1h,2h,2h,2h,1h ; the values of sequence h defined
hpa .byte 0h,0h,0h,0h,0h,0h ; n-1 values of zeros padded after the sequence h
n .set 7 ; the no. of values in sequence x (m)
m .set 5 ; the no. of values in sequence h (n)
.text ; assembler directive to initialize the program section
zero .s1 a1 ; zero the contents of CPU registers
|| zero .d1 a2
|| zero .l1 a3
zero .s1 a0
|| zero .d1 a4
|| zero .l1 a5
zero .s2 b2
|| zero .s1 a5
|| zero .d1 a6
zero .s2 b3
|| zero .l2 b4
|| zero .d1 a7
mvkl .s1 n+m-1,a7 ; (n+m-1), ((n+m)*2)-2 and ((n+m)*4)-4 for 8,16 and 32 bit-
; data values respectively. The address displacement after-
; every convolution output
||mvkl .s2 h, b3 ; the start address of the sequence h loaded in register B3
mvkl .s1 n+m -1,a1 ; the no. of times the convolution output to be computed
|| mvkl .s2 0100h,b5 ; the start address to store result is loaded in register B5
loop1 mvkl .s2 n+m -2,b0 ;the no. of times the multiplication and accumulation to be
; performed
TMS320C6X Application Programs and Peripherals 411
|| mvkl .s1 x,a3 ; the start address of the sequence x loaded in register A3
|| zero .d1 a5
|| zero .l1 a6
loop ldb .d1 *a3++,a4 ; the sequence values x and h are loaded from memory to-
||ldb .d2 *b3- -,b4 ; register A4 and B4 respectively. ldb, ldh and ldw instruction-
||nop 6 ; for byte, half word and word load respectively
mpy .m1 a4,b4,a5 ; multiplication and accumulation of x and h values
|| add .s1 a5,a6,a6
|| sub .s2 b0,1,b0
|| nop 5
[b0] b loop
||nop 7
sub .s1 a1,1,a1
|| stb .d2 a6,*b5++ ; the convolved output sequence stored in memory. stb, sth-
||nop 6 ; and stw instructions for byte, half word and word store
add .s2 b3,a7,b3 ; the address update for sequence h
[a1] b loop1
||nop 7
.end

15.2.3 DFT using FFT Algorithm


The fast computation method of Discrete Fourier Transform (DFT) is using Fast Fourier Transform
(FFT) algorithm (refer chapter 1.14). The FFT algorithm is based on the symmetry property of the
factor WNkn, where (WNkn = e–j(2p/N)kn). The computation of DFT using FFT algorithm is carried out
in two methods, Decimation in Time (DIT) and Decimation in Frequency (DIF). The best way to
implement DFT either in DIT or DFT method is through butterfly structures. In this chapter 8-point
DFT implementation using DIT radix-2 FFT algorithm is presented and it’s ¢C64X assembly language
program is given in Program 15.6. The first module needed in DFT computation using radix-2 algorithm
is the rearrangement of input sequence in bit reversed order. In ¢C5X, ¢C3X and ¢C54X processors a
specific addressing mode called bit reversed addressing mode is available to perform the bit reversal.
But in ¢C6X there is no such addressing mode. Hence ¢C6X assembly language program to rearrange the
input sequence in bit reversed order is given in Program 15.4. The input sequence in DFT computation
is real, but the coefficients of FFT and the outputs of the intermediate stage of the butterfly structure are
complex, hence the second module required is complex number multiplication. To perform complex
multiplication, the ¢C6X assembly program is given in Program 15.5.
The FFT coefficients are the twiddle factor WNk represented in trigonometry form as cos (j2pk/N)
+ j sin (j2pk/N), where k varies from 0-7 and N is the number of inputs i.e. 8 for 8-point DFT. The
cosine and sine function can have values from + 1 to -1 as k varies from 0-7, the fractional values of the
coefficients are scaled by an appropriate scaling factor S and rounded off to nearest integers. The scaling
factors are selected in powers of 2 (S=2x), because after multiplying the inputs with coefficients in the
butterfly structure, the resultant product is to be divided by the scaling factor S to get back the actual
value. If the scaling factor is selected in powers of 2, division can be done easily by shift right operation
(SHR) for other scaling factors the division program given in program 15.1 and 15.2 can be used. The
DFT outputs obtained in the butterfly structure will have error comparing to manual calculation in the
path where the twiddle factors are fractional and this error is due to rounding off of the coefficients.
412 Digital Signal Processors

Program 15.4 Bit reversal of input sequence

Label Mnemonic Comments


.data ; assembler directive to initialize the data section
x .byte 1,-1,1,0,2,-1,2,1 ; the input sequence that is to be bit reversed
.text ; assembler directive to initialize the program section
n .set 8 ; number of points
l .equ (n/4) ; number of swaps
k .equ (n/2) ; half point value
zero .s1 a1 ; zero the register contents
|| zero .d1 a2
|| zero .l1 a3
zero .s1 a4
|| zero .d1 a5
|| zero .l1 a6
zero .s1 a7
|| zero .d1 a8
|| zero .l1 a9
zero .s1 a10
|| zero .d1 a11
|| zero .l1 a12
mvkl .s1 x,a3 ;the start address of the sequence x loaded in register A3
mvkl .s1 k,a1
|| mvkl .s2 l,b0
loop ldb .d1 *++a3[a1],a4
||nop 7
mv .s1 a4,a5
||sub a1,1,a0
||nop 6
ldb .d1 *--a3[a0],a4
|| nop 7
stb .d1 a5,*a3++[a0]
||sub a1,2,a0
||nop 7
stb .d1 a4,*a3--[a0]
||sub b0,1,b0
||nop 7
[b0] b .s2 loop
||nop 7
.end

Program 15.5 Complex number multiplication


Label Mnemonic Comments
.data ; assembler directive to initialize the data section
re .byte -1,-1 ; real part of two complex numbers defined
TMS320C6X Application Programs and Peripherals 413
im .byte 1,1 ; imaginary part of two complex numbers defined
pro .byte 0,0 ; multiplication result real and imaginary part output buffer
.text ; assembler directive to initialize the program section
mvkl re,a3 ; start address of real part loaded in register A3
mvkl im,a4 ; start address of imaginary part loaded in A4
mvkl pro,a15 ; start address of output buffer loaded in A15
ldb *a3++,a5 ; real part loaded in register A5 and A6
||nop 7
ldb *a3++,a6
||nop 7
ldb *a4++,a7 ; imaginary part loaded in register A7 and A8
||nop 7
ldb *a4++,a8
||nop 7
mpy a5,a6,a9 ; real parts multiplied result in a9
||nop 4
mpy a7,a8,a10 ; imaginary parts multiplied result in a10
||nop 4
neg a10,a10 ; sign information of imaginary part product extended
add a9,a10,a9 ; real part of multiplication obtained
stb a9,*a15++ ; real part of multiplication stored in memory
|| nop 7
mpy a5,a7,a9 ; real and imaginary part multiplied
||nop 4
mpy a6,a8,a10 ; real and imaginary part multiplied
|| nop 4
add a9,a10,a9 ; imaginary part of multiplication obtained
stb a9,*a15++ ; imaginary part of multiplication stored in memory
|| nop 7
.end

Program 15.6 DFT computation (8-poit) using DIT FFT radix-2 algorithm

Label Mnemonic Comments


.data ; assembler directive to initialize the data section
core .byte 8,6,0,-6,-8,-6,0,6 ;real value of coeff. scaled by factor 8
coim .byte 0,-6,-8,-6,0,6,8,6 ;imaginary value of coeff. scaled by factor 8
xb .byte 1,2,1,2,-1,-1,0,1 ;input sequence x in bit reversed order
n .set 8 ;no of inputs to find DFT
h .set (n/2)
h1 .set h-1
q .set (n/4)
q1 .set q-1
x2r .byte 0,0,0,0,0,0,0,0 ;2point butterfly output buffer
x4r .byte 0,0,0,0,0,0,0,0 ;4point butterfly real and imaginary output value buffer
.byte 0,0,0,0,0,0,0,0
414 Digital Signal Processors
x8r .byte 0,0,0,0,0,0,0,0 ;8point butterfly real and imaginary output value buffer
.byte 0,0,0,0,0,0,0,0
.text ; assembler directive to initialize the program section
; 2 point butterfly computation
mvkl .s1 core,a0 ;start address of real part of coeff. loaded in A0
mvkl .s1 h,a1
mvkl .s1 xb,a4 ;start address of input sequence loaded in A4
mvkl .s1 x2r,a15 ;start address of x2r buffer loaded in A15
ldb *++a0[h],a14 ;load coefficient in register A14
||nop 7
loop
ldb *a4++,a5 ; load first two inputs in register A5 and A6
||nop 7
ldb *a4++,a6
|| nop 7
add .l1 a5,a6,a8
||mpy .m1 a6,a14,a9
||nop 6
shr a9,3,a9 ; product divided by a factor 8 by shift right operation
add a5,a9,a9
||stb a8,*a15++ ; two point butterfly output stored in buffer
||nop 7
stb a9,*a15++
|| sub a1,1,a1
||nop 7
[a1] b loop
||nop 7
; 4 point butterfly computation
mvkl x4r,a15 ;start address of x4r buffer loaded in A15
||mvkl 1,b2
mvkl q,a2
loop3
mvkl core,a0 ;start address of real part of coeff. loaded in A0
mvkl coim,a1 ;start address of imaginary part of coeff. loaded in A1
||mvkl q,b1
loop2 [b2] mvkl .s1 x2r,a3 ;start address of x2r buffer loaded in A3
[!b2] mvkl .s1 x2r+h,a3 ;start address of the half of the buffer x2r loaded in A3
mvkl q,b0
loop1 ldb *a0++[q],a14 ;load real part of coefficient in A14
||nop 7
ldb *a1++[q],a13 ;load imaginary part of coefficient in A13
||nop 7
ldb *a3++[q],a4 ; load inputs
||nop 7
ldb *a3- -[q1],a5
||nop 7
TMS320C6X Application Programs and Peripherals 415
mpy a5,a14,a6
||nop 4
shr a6,3,a6 ; product divided by a factor 8
||mpy a5,a13,a7
|| nop 4
shr a7,3,a7 ; product divided by a factor 8
||add a4,a6,a6
||nop 6
stb a6,*a15++ ; real part of 4-point butterfly output stored
||nop 7
stb a7,*a15++ ; imaginary part of 4-point butterfly output stored
|| sub b0,1,b0
||nop 7
[b0] b loop1
||nop 7
sub b1,1,b1
[b1] b loop2
|| nop 7
zero b2
||sub a2,1,a2
[a2] b loop3
||nop 7
; 8 point butterfly computation
mvkl x8r,a15 ;start address of x8r buffer loaded in A15
mvkl core,a0 ;start address of real part of coeff. loaded in A0
mvkl coim,a1 ;start address of imaginary part of coeff. loaded in A1
||mvkl q,b0
loop5 mvkl x4r,a3 ;start address of x4r buffer loaded in A3
mvkl h,a2
loop4 ldb *a0++,a14 ;load real part of coefficient in A14
||nop 7
ldb *a1++,a13 ;load imaginary part of coefficient in A13
||nop 7
ldb *a3++,a4 ; load inputs
||nop 7
ldb *a3++(h1+h),a5
||nop 7
ldb *a3++,a6
||nop 7
ldb *a3—(h1+h),a7
||nop 7
mpy a6,a14,a8
||nop 4
shr a8,3,a8 ; product divided by a factor 8
||mpy a7,a13,a9
||nop 4
416 Digital Signal Processors
shr a9,3,a9 ; product divided by a factor 8
neg a9,a9
add a8,a9,a9
add a4,a9,a4
stb a4,*a15++ ; real part of 8 point butterfly output stored
||nop 7
mpy a6,a13,a8
||nop 4
shr a8,3,a8 ; product divided by a factor 8
||mpy a7,a14,a9
||nop 4
shr a9,3,a9 ; product divided by a factor 8
add a8,a9,a9
add a5,a9,a5
stb a5,*a15++ ; imaginary part of 8 point butterfly output stored
|| sub a2,1,a2
|| nop 7
[a2] b loop4
|| nop 7
sub b0,1,b0
[b0] b loop5
||nop 7
.end

APPLICATION PROGRAMS IN ¢C67X 15.3


The programs in this section are executed in TMS320C6713 starter kit. The programs may be written
in assembly language, C language and combination of both. In this section, examples using the last two
approaches are presented.

15.3.1 Code Development in C Environment using Code Composer Studio


Code Composer Studio (CCS) supports the integrated development environment (IDE) for real - time
digital signal processing applications based on the C programming language. It incorporates a C
compiler, an assembler, and a linker. It has graphical capabilities and supports real - time debugging.
Following are the various file extensions employed by code composer studio:
1. file.pjt : To create and build a project named file.
2. file.c : C source program.
3. file.asm : Assembly source program created by the user,
by the C compiler, or by the linear optimizer.
4. file.sa : Linear assembly source program. The linear optimizer uses file.sa
as input to produce an assembly program file.asm.
5. file.h : Header support file.
6. file.lib : Library file, such as the run - time support library file rts6700.lib.
7. file.cmd : Linker command file that maps sections to memory.
8. file.obj : Object file created by the assembler.
TMS320C6X Application Programs and Peripherals 417
9. file.out : Executable file created by the linker to be loaded and run
on the TMS320C6713 processor.
10. file.cdb : Configuration file when using DSP/BIOS.
The following steps are adopted for code development in C environment. In the Code Composer
Studio, Click New Project under the menu Project. Enter the project name, project output and the target
processor as shown in Figure 15.3. After completing this, the project will get added in the left side as
shown in Figure 15.4. A sample C-code new.c is given in Program 15.7 and the linker command file
C6713dsk.cmd file is given in Program 15.8. This is added to the project in order to provide the details
about the memory map for the program.

Fig. 15.3

Program 15.7 New.c


#include <stdio.h>
#include <math.h>
void main()
{
int a,b,c;

a=100;
b=120;
c=a+b;
printf(“Sum of %d and %d is %d\n”,a,b,c);
}
418 Digital Signal Processors

Fig 15.4

Program 15.8 C6713dsk.cmd Linker command File

/*C6713dsk.cmd Linker command file*/


MEMORY
{
IVECS: org=0h, len=0x220
IRAM: org=0x00000220, len=0x0000A000 /*internal memory*/
SDRAM: org=0x80000000, len=0x00100000 /*external memory*/
FLASH: org=0x90000000, len=0x00020000 /*flash memory*/
}
SECTIONS
{
.EXTRAM :> SDRAM
.vectors :> IVECS
.text :> IRAM
.bss :> IRAM
.cinit :> IRAM
.stack :> IRAM
.sysmem :> SDRAM
.const :> IRAM
.switch :> IRAM
.far :> IRAM
TMS320C6X Application Programs and Peripherals 419
.cio :> SDRAM
.csldata :> IRAM
}

Fig. 15.5 Build Options


420 Digital Signal Processors

Fig. 15.6 CCS IDE with output window

The code generation tools underlying CCS, that is, C compiler, assembler, and linker, have a number of
options associated with each of them. These options must be set appropriately before attempting to build
a project. Once set, these options will be stored in the project file. Figure 15.5 shows the build options
set for the new.pjt. After setting the build options and adding necessary files, goto DebugÆConnect
for connecting the target board with CCS. Then goto ProjectÆRebuild All to build the entire project.
After building the entire project, goto FileÆLoad program and select the new.out to be loaded onto the
target board. Then goto debugÆ Run the project to see the result on the output window as shown in
Fig. 15.6.

15.3.2 Computation of the 8- point DFT using FFT Algorithm in C Environment


The DSK6713 kit has an on-board Audio Codec (TLV320AIC23), which can be configured for speech
input and speech output. The sampling frequency for speech input can be varied from 8KHz to 96KHz
using software. More details about the AIC23 parameters which can be modified using software are
provided in dsk6713_aic23.h. The c6713dskinit.c file comprises functions necessary for speech input
TMS320C6X Application Programs and Peripherals 421
and speech output and it includes the function input_sample(). This function given in Program 15.9. is
used to capture speech input through microphone interface provided in the kit.

Program 15.9 Program to capture speech input through microphone interface

Uint32 input_sample()
{
short CHANNEL_data;
if (poll) while(!MCBSP_rrdy(DSK6713_AIC23_DATAHANDLE));//if ready to receive
AIC_data.uint=MCBSP_read(DSK6713_AIC23_DATAHANDLE); //read data
CHANNEL_data=AIC_data.channel[RIGHT];
AIC_data.channel[RIGHT]=AIC_data.channel[LEFT];
AIC_data.channel[LEFT]=CHANNEL_data;
return(AIC_data.uint);
}
The C-program for the computation of 8- point DFT using FFT algorithm is given in Program 15.10.
The speech input through the microphone interface is sampled at the rate of 8 KSPS, digitized and stored
in a data file ‘samplefft.txt’. The 8- point DFT is computed and the DFT coefficients are printed on the
output window of CCS.

Program 15.10 Computation of 8-Point DFT in ¢C6713 using C code

#include “DSK6713_aic23.h”
Uint32 fs=DSK6713_AIC23_FREQ_8KHZ;
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define noof_stages 3 /* the no. of butter fly stages – 3*/
#define noof_samples 8 /* the no. of inputs 8*/
#define PI 3.14159
struct complex {
float real;
float imag;
};
struct buffer {
struct complex data[1][20];
};
#pragma DATA_SECTION(real_buffer,”.EXTRAM”)
struct buffer real_buffer;
FILE *f1;
void fft (struct buffer *, int , int );
/* Main Program */
void main()
{
int k;
422 Digital Signal Processors
float sample;
int sn,sm;
sn=noof_samples;
sm=noof_stages;

printf(“Input\n”);
f1=fopen(“samplefft.txt”,”r”);
for(k=0;k<noof_samples;k++)
{
fscanf(f1,”%f”,&sample);
real_buffer.data[0][k].real = ((float)sample);
fscanf(f1,”%f”,&sample);
real_buffer.data[0][k].imag = ((float)sample);
printf(“%f\t%fi\n”,real_buffer.data[0][k].real,real_buffer.data[0][k].imag);
}
fclose(f1);
fft(&real_buffer,sn,sm);
}
/* Function to Compute Fast Fourier Transform */
void fft (struct buffer *input_data, int n, int m) {
int n1,n2,i,j,k,l;
float xt,yt,c,s,e,a;
n2 = n;
for ( k=0; k<m; k++) {
n1 = n2;
n2 = n2/2;
e = PI/n1;
for ( j= 0; j<n2; j++) {
a = j*e;
c = (float) cos(a);
s = (float) sin(a);
for (i=j; i<n; i+= n1) {
l = i+n2;
xt = input_data->data[0][i].real - input_data->data[0][l].real;
input_data->data[0][i].real = input_data->data[0][i].real+input_data->data[0][l].real;
yt = input_data->data[0][i].imag - input_data->data[0][l].imag;
input_data->data[0][i].imag = input_data->data[0][i].imag+input_data->data[0][l].imag;
input_data->data[0][l].real = c*xt + s*yt;
input_data->data[0][l].imag = c*yt - s*yt;
}
}
}
j = 0;
for ( i=0; i<n-1; i++) {
if (i<j) {
xt = input_data->data[0][j].real;
TMS320C6X Application Programs and Peripherals 423
input_data->data[0][j].real = input_data->data[0][i].real; input_data->data[0][i].real = xt;
yt = input_data->data[0][j].imag; input_data->data[0][j].imag = input_data->data[0][i].imag;
input_data->data[0][i].imag = yt;
}
}
/* printf(“Output\n”);
for ( l=0; l<n; l++)
{
printf(“%f\t”,input_data->data[0][l].real);
printf(“%fi\n”,input_data->data[0][l].imag);
}
*/
return;
}

15.3.3 Estimation of Clock Cycles Required for Code Execution using CCS
The number of clock cycles/machine cycles required to excute the complete program in assembly,
C as well as combined assembly and C environment can be estimated using CCS tool. In CCS, first
select the option Profile – Clock – Enable and then select the option Profile – Clock – View, a clock
icon will appear on the right down corner of the CCS menu bar. The clock can be resetted by double
clicking the clock icon. Once the project file is downloaded to the target processor, the PC will set to the
starting point of the program code (default value of the start address is 0000 0020h). Break point can be
introduced at the last line of the code. (To introduce break point refer section 15.1.4). Select the Debug-
run option in the CCS tool or use the shortcut key to run the code from the current point of the program
counter (PC) to the address where break point is introduced. The count shown in the clock icon of CCS
tool is the measure of number of clock cycles required to execute the block of code from the starting
address to the address where the break point is introduced. In the same way by introducing breakpoints
at any other place of the program, the clock cycle count required to execute any block of the program
can be computed.

15.3.4 Comparison of the Number of Clock Cycles Required for the Computation of
8 Point DFT in both Assembly Language and C Environment
The number of clock cycles required for the computation of DFT using assembly language program given
in section 15.2.3 (Program 15.6) and the C program given in 15.3.2 (Program 15.10) are evaluated and
compared in this section. For the evaluation of the number of clock cycles required for the computation
of DFT using Program 15.6, the number of clock cycles required for bit-reversing the input sequence
is computed using CCS tool as mentioned in section 15.3.3. The 8,16,32 and 64 input sequences take
119, 197, 353 and 671 machine cycles respectively for bit-reversing. The clock cycle count for the
computation of 8-point DFT using Program 15.6 is also evaluated. The number of clock cycles required
to compute 2-point, 4-point and 8-point butterfly outputs are 220, 585 and 565 respectively. For the
complete computation of the 8-point DFT using assembly code the number of clock cycles evaluated
using CCS tool in ¢C6416 starter kit is 1,479.
The number clock cycles required for 8-point DFT computation using C-code in ¢C6713 starter kit
is also evaluated and the value is 14,739 clock cycles. Hence, the number of clock cycles required for
computing 8 point DFT using programming in C environment is larger by a factor of 10. In addition to
424 Digital Signal Processors
non optimality of the C compiler, the type of processor used for the implementation also contributes to
the difference. The C6416 processor used for the assembly language programming in section 15.2.4 is
a fixed point processor where as the C6713 processor used executing the program in C environment is
a floating point processor. The floating point processor in general requires more cycles than the fixed
point processor. It may be noted CCS can be used for both of these processors to develop programs in
assembly language, C language or combinations of both.

15.3.5 Mixing Assembly Language and C Language


The programs in assembly language can be optimized by efficient use of the architecture of the processor
and hence require less number of clock cycles for execution. However, this requires the designer to learn
the assembly language of the processor. On the other hand, the length of the source program in assembly
language is larger compared to that of C language and hence requires more time for development,
debugging and testing. Moreover, for programming in C language, the designer need not know either
the architecture or the assembly language of the processor. In order to combine the advantages of
both, assembly language program may be used for implementing the functions which are computation
intensive and can be invoked from the C environment.

15.3.6 Different Ways of Invoking Assembly Language in C-code


In this section, different ways of invoking assembly language in C-code is illustrated on a TMS320C6713
floating point DSP processor using Code Composer Studio software.
There are four ways of invoking assembly language in C-code for DSP programming:
∑ callable assembly
∑ intrinsic functions
∑ linear assembly
∑ inline assembly
The above-mentioned approaches are illustrated with an example application which requires
computing the Euclidean distance for input with two variables. The C-code for the computation of
Euclidean distance is given in Program 15.11.

Program 15.11 C Code for the computation of Eucludian distance


#include<math.h>
#include<stdio.h>
main( )
{
float x1,x2,y1,y2,dx,dy,e;
dx = x1-x2;
dy = y1-y2;
e = sqrt(dx*dx + dy*dy);
}

Callable Assembly The callable assembly approach uses the C source code, which calls an externally
declared user defined assembly language function. The C-code can be re-written using callable assem-
bly as shown in Program 15.12.
TMS320C6X Application Programs and Peripherals 425

Program 15.12 C Code for the computuation of Eucludian distance using callable assembly
language function:

#include<math.h>
#include<stdio.h>
extern float errasm(float,float,float,float);
main( )
{
float x1,x2,y1,y2,e;
e = errasm(x1,y1,x2,y2);
}
In program 15.12, errasm() is a function called by c-code which is written in assembly language and
saved as errasm.asm. The errasm.asm is given by Program 15.13.

Program 15.13 errasm.asm Code

.def _errasm

_errasm: SUBSP .L1 A4,A6,A7;


NOP 5
SUBSP .L2 B4,B6,B7;
NOP 5
MV .S1 A7,A8;
MV .S2 B7,B8;
MPYSP .M1 A7,A8,A5;
NOP 5
MPYSP .M2 B7,B8,B5;
NOP 5
ADDSP .L1X A5,B5,A9;
NOP 5
RSQRSP .S1 A9,A6;
NOP 5
RCPSP .S1 A6,A4;
NOP 5
B B3;
NOP 3
.end

Intrinsic Functions Intrinsics are special functions that map directly to inline C6x instructions. For
example, int _mpy() is equivalent to the assembly instruction MPY to multiply the 16LSBs of two
numbers. The above-mentioned C- code example can be written using intrinsic functions as shown in
Program 15.14
426 Digital Signal Processors

Program 15.14 C-code with Intrinsic functions:

#include<ieeef.h>
#include<fastrts67x.h>
#include<stdio.h>
main( )
{
float x1,x2,y1,y2,dx,dy,e;
dx = x1-x2;
dy = y1-y2;
e = _rcpsp(_rsqrsp(dx*dx + dy*dy));
}

In the above C-code, two intrinsic functions are used. float _rcpsp(float src) computes the approximate
32-bit float reciprocal and float _rsqrsp(float src) computes the approximate 32-bit float square root
reciprocal.

Linear Assembly Linear assembly code is a cross between assembly and C. It uses the syntax of as-
sembly code instructions such as ADD, SUB, and MPY, but with operands/registers as used in C. The
above-mentioned C- code example can be written using linear assembly as shown in Program 15.15.

Program 15.15 C Code for linear assembly

#include<math.h>
#include<stdio.h>
extern float err(float,float,float,float);
main( )
{
float x1,x2,y1,y2,e;
e = err(x1,y1,x2,y2);
}
In program 15.15, err() is a function called by c-code which is written in linear assembly and saved
as err.sa file. The linear assembly code for function err() is given in program 15.16.

Program 15.16 Linear asm Code


.def _err
_err: .cproc zc,zcs,msf,msfs
.reg x,y,z,w,d1,d2,r;
mv zc,x
mv zcs,y
mv msf,z;
mv msfs,w;
TMS320C6X Application Programs and Peripherals 427
subsp x,y,d1;
mpysp d1,d1,y;
subsp z,w,d2;
mpysp d2,d2,w;
addsp y,w,x;
rsqrsp x,y;
rcpsp y,r;
.return r
.endproc

Inline Assembly An inline assembly code can be used with the asm statement within a C program.
For example, asm(“ MVK 0x0040,B6”). The above-mentioned C- code example can be written using
inline assembly as shown in Program in 15.17.

Program 15.17 C Code with Inline assembly


#include<math.h>
#include<stdio.h>
main( )
{
float x1,x2,y1,y2,dx,dy,e;
dx = x1-x2;
dy = y1-y2;
asm(“ mpysp .m1 a4,a4,a6”);
asm(“ NOP 5”);
asm(“ mpysp .m2 b4,b4,b6”);
asm(“ NOP 5”);
asm(“ addsp .l1x a6,b6,a5”);
asm(“ NOP 5”);
asm(“ rsqrsp .s1 a5,a6”);
asm(“ NOP 5”);
asm(“ rcpsp .s1 a6,a4”);
asm(“ NOP 5”);
e=getans();
asm(“ NOP 5”);
}
The number of clock cycles required for the computation of Eucludian distance using different
approaches of invoking assembly using C-code is reported in Table 15.1. It may be observed from
Table 1 that a pure C source code takes a longer execution time followed by inline assembly, callable
assembly, intrinsic functions and linear assembly based approaches.
428 Digital Signal Processors
Table 15.1 Number of Clock cycles required for computation of Eucluduian distance
Approach No. of. Clock Cycles Accel. Factor
C-code 1378 —
In-line Assembly 466 2.957
Callable Assembly 411 3.353
Intrinsic Functions 401 3.436
Linear Assembly 371 3.714

INTERNAL MEMORY 15.4


The internal memory configuration varies between the different ¢C6X processors. The TMS320C620X/
TMS320C670X family processors have separate on-chip program and data memories. The internal
program memory can be accessed by the CPU or it can be operated as program cache. The size of
internal program memory is 64 K bytes of RAM and it can accommodate 16K 32-bit instructions. The
CPU accesses this program memory space through program memory controller. The program memory
controller performs CPU and DMA (Direct Memory Access) requests to internal program memory,
performs CPU requests to external memory through external memory interface (EMIF) circuit and also
manages the internal program memory when it is configured as cache.
The size of internal data memory is 64 K bytes of RAM. Both CPU and DMA controller can access
this data memory space through data memory controller. The data memory controller connects CPU to
external memory and on-chip peripherals through EMIF and peripheral bus controller respectively. The
¢C6202 processor has 2x128 K bytes of internal program memory blocks out of which one 128K bytes
block can be used as program cache.
The ¢C621X/¢C671X family processors have cache-based internal memory architectures. They are
provided with two level memory architecture for internal program and data busses. The first level of
internal memory is with separate level-one program (L1P) cache and data cache (L1D) each of size 4K
bytes. The program and data cache spaces are not included in the memory map and are enabled at all
times. The level-one cache memories are accessible only by the CPU.
The program cache controller interfaces the CPU to the L1P cache. A 256 wide path is provided from
to the CPU to allow a continuous stream of eight 32-bit instructions for maximum performance. The 4K
L1P cache is organized as a 64 line direct mapped cache with a 64 byte line size.
The data cache controller provides interface between the CPU and L1D cache. The L1D is a dual-
ported memory. This allows simultaneous access by both paths of the CPU (Path A and B). The L1D, 4K
cache is organized as a 64 set 2-way set associative cache with a 32 byte line size. The second level of
internal memory is 64K bytes of RAM that is shared by both program and data memory space with L2
cache controller. The internal memories and bus connections between the CPU and various controllers
are shown in Fig. 15.7.
First the L1P and L1D caches are accessed, on a miss to either L1D or L1P; the request is passed to
L2 controller. The L2 controller facilitates the following accesses
∑ The CPU and the enhanced direct memory access (EDMA) controller accesses to the internal
memory, and performs the necessary arbitration
∑ The CPU data access to the EMIF
TMS320C6X Application Programs and Peripherals 429

Note: i) For ¢C67X processors – LD1 & LD2 data bus size - 64-bits
ii) For ¢C64X processors – LD1, LD2, ST1& ST2 data bus size - 64-bits.
Fig. 15.7 Internal Memory Block diagram of ¢C6X processors

∑ The CPU access to on-chip peripherals


∑ Sends a request to EMIF for an L2 data miss.
On request to L2 service, the service depends on the operation mode of L2, which is set in the Cache
Configuration Register Fields (CCFG). This is a memory mapped register, whose memory map address
is 0184 0000h. The format of the CCFG is shown in Fig. 15.8, and the various L2 modes are shown in
the Table 15.2. The L2 memories are organized as four 64 bit wide banks.

Fig. 15.8 Format of Cache Configuration Register (CCFG)


430 Digital Signal Processors
Table 15.2 Cache configuration register field description
Field Description
L2 MODE L2 Operation modes
000b – 64 K bytes RAM
001b – 16 K bytes 1-way cache/48 K bytes mapped RAM
010b – 32 K bytes 2-way cache/32 K bytes mapped RAM
011b – 48 K bytes 3-way cache/16 K bytes mapped RAM
111b – 64 K bytes 4-way cache
ID Invalidate L1D
ID =0 - normal L1D operation, ID = 1 – All L1D lines invalidated
IP Invalidate L1P
IP =0 - normal L1P operation, IP = 1 – All L1P lines invalidated
P L2 Requestor Priority
P=0, CPU accesses prioritized over enhanced DMA accesses
P=1, Enhanced DMA accesses prioritized over CPU accesses

EXTERNAL MEMORY 15.5


On L2 data miss, the L2 controller sends a request to external memory interface (EMIF). The memory
attribute register (MAR) can be programmed to turn on caching of each of the external chip enable (CE)
spaces. In this way, a single word reads to external mapped devices are performed. Without this feature
any external read would always read an entire L2 line of data. Each of the four CE spaces is dived in
to four ranges, each of which maps the least significant bit of an MAR register. If an MAR register is
set, the corresponding address range is cached by L2. At reset, MAR registers are set to 0. To begin
caching data in the L2, the initialization of the appropriate MAR register to 1 is necessary. The MAR
defines the cacheability for the EMIF only. Addresses accessed by the EMIF which are not defined by
the MAR register are always cacheable. The following Table 15.3 shows the various CE spaces and the
corresponding MAR registers to access that space.
All the memory space base address registers, word count registers and the fifteen memory attribute
registers are memory mapped registers starting from the location 0184 0000h to 0184 82CCh. Before
the memory access appropriate registers are to be initialized.

ON-CHIP PERIPHERALS 15.6


The ¢C6X processors programmable on-chip peripherals are listed below.
∑ Two 32-bit timers
∑ Two Multichannel buffered serial ports (McBSPs)
∑ Direct memory access (DMA)/Enhanced Direct memory interface (EDMA)
∑ External memory interface (EMIF)
∑ Host-Port Interface (HPI)
∑ Boot configuration
∑ Interrupt selector
∑ Expansion bus
TMS320C6X Application Programs and Peripherals 431
∑ Power down logic
All ¢C6X processors have two McBSPs, but ‘6202 processor has three McBSPs. The ¢C620X/¢C670X
family processors have DMA controllers where as ¢C621X/¢C671X processors are with EDMA
controllers. The Expansion bus is available only in ¢C6202 processor but HPI is not available in it. All
other peripheral devices are available in all ¢C6X processors. These peripherals are configured via a set
of memory-mapped control registers. The peripheral bus controller performs the arbitration for accesses
of on-chip peripherals. The boot configuration is interfaced through external signals only and the power
down logic is accessed directly by the CPU. The block diagram of ¢C6X processor with all on-chip
peripherals are shown in Fig. 15.9.

Table 15.3 MAR Registers and its corresponding CE space address range

MAR Address Range Enabled CE space


15 B300 0000h – B3FF FFFFh CE3
14 B200 0000h – B2FF FFFFh CE3
13 B100 0000h – B1FF FFFFh CE3
12 B000 0000h – B0FF FFFFh CE3
11 A300 0000h – A3FF FFFFh CE2
10 A200 0000h – A2FF FFFFh CE2
9 A100 0000h – A1FF FFFFh CE2
8 A000 0000h – A0FF FFFFh CE2
7 9300 0000h – 93FF FFFFh CE1
6 9200 0000h – 92FF FFFFh CE1
5 9100 0000h – 91FF FFFFh CE1
4 9000 0000h – 90FF FFFFh CE1
3 8300 0000h – 83FF FFFFh CE0
2 8200 0000h – 82FF FFFFh CE0
1 8100 0000h – 81FF FFFFh CE0
0 8000 0000h – 80FF FFFFh CE0

15.6.1 Timers
The ¢C6X devices have two 32-bit general purpose timers that are used to time events, count events,
generate pulses, interrupt CPU and send synchronization event to DMA. The timer operation can be
configured through three memory mapped registers namely timer control register, timer period register
and timer counter register. The ¢C6X processor on-chip timer block diagram is given in Fig. 15.10. The
timer control register (TCR) is programmed to select the different modes of operation of timer; the timer
period register contains the number of timer input clock cycle to count and the timer counter register
increments when it is enabled to count. The timer counter register resets to 0 when the count reaches the
count value in the period register.
432 Digital Signal Processors

Fig. 15.9 TMS320C621X/ ¢C671X block diagram with on-chip peripherals

Fig. 15.10 TMS320C6X Timer Block digaram


TMS320C6X Application Programs and Peripherals 433

The timer has two signaling modes, clock mode and pulse mode which can be selected by C/P bit
in TCR. The timer has an input pin TINP and an output pin TOUT and these pins can function as timer
clock input and output. These pins can also be configured for general purpose I/O pins respectively
using FUNC bit in TCR. The timer functions with both internal clock signal from the CPU and also
from the external clock, the clock source can be selected by CLKSRC pin in TCR. The start of the timer,
holding it and resetting it are performed with GO and HLD pins in TCR. The frequency of the timer
output when operated in clock and pulse modes are given below.
fclock = f(clock source)/ (2* timer period register)
fpulse = f(clock source)/ timer period register

15.6.2 Multichannel Buffered Serial Port (McBSP)


The multichannel buffered serial port is based on the standard serial port interface available in earlier TI
processors. The McBSP has the following features:
∑ Provides full-duplex communication
∑ Multichannel transmit and receive up to 128 channels
∑ Data selection size of 8,12,16,20,24 and 32 bits
∑ Independent framing and clocking for receive and transmit
∑ External shift clock or an internal programmable frequency shift clock for data transfer
∑ 8-bit data transfer with the option of LSB or MSB first
∑ Programmable polarity for both frame synchronization and data clocks
∑ Double-buffered registers, which allow continuous data transmission
∑ Auto buffering capability through 5-channel DMA controller
∑ μ-law and A-law companding
∑ Direct interface to industry standard codecs, A/D, D/A converters, analog interface chips, T1/
E1, MVIP, H.100, SCSA framers, IOM-2, AC97, IIS complaint devices and SPITM devices

Fig. 15.11 Multichannel buffered serial port (McBSP) block diagram

The McBSP consists of two paths, a data path and a control path which is used to connect to external
devices. The block diagram of McBSP is shown in Fig. 15.11. There are thirteen memory mapped
434 Digital Signal Processors
registers for each McBSPs present in the processor and these registers are accessed via 32-bit peripheral
bus. The list of registers and its memory mapped address are given in Table 15.4. The different modes
of operation of McBSP are programmed through a 32-bit serial port control register (SCR).
The data communication in McBSP is through data transmit (DX) and data receiver (DR) pins. The
clocking and frame synchronization are via CLKX, CLKR, FSX and FSR pins. Either CPU or DMA
controller reads the received data from data receiver register (DRR) and also the data to be transmitted
is written in data transmit register (DXR). The data transmit shift register (XSR) shifts out the data in
DXR to DX pin and the same way the data received in DR pin is shifted into receive shift register (RSR)
and copied into the receive buffer register (RBR) and then copied to DRR. The received data is read by
the CPU or DMA controller.

Table 15.4 McBSP memory mapped registers


Memory mapped register address Abberivation Register Name
McBSP0 McBSP1 McBSP2(in ¢C6202 only)
— — — RBR Receive buffere register
— — — RSR Receiver shift register
— — — XSR Transmit shift register
018C 0000 0190 0000 01A4 0000 DRR Data receiver register
018C 0004 0190 0004 01A4 0004 DXR Data transmit register
018C 0008 0190 0008 01A4 0008 SPCR Serial port control register
018C 000C 0190 000C 01A4 000C RCR Receive control register
018C 0010 0190 0010 01A4 0010 XCR Transmit control register
018C 0014 0190 0014 01A4 0014 SRGR Sample rate generator register
018C 0018 0190 0018 01A4 0018 MCR Multichannel control register
018C 001C 0190 001C 01A4 001C RCER Receiver channel enable register
018C 0020 0190 0020 01A4 0020 XCER Transmit channel enable register
018C 0024 0190 0024 01A4 0024 PCR Pin control register
Note: RBR, RSR and XSR registers are not directly accessible via CPU or DMA controller

15.6.3 DMA/EDMA Controller


The direct memory access (DMA) controller is available in ¢C620X/¢C670X devices. The DMA controller
transfers data between regions in the memory map without affecting the operation of CPU. The DMA
controller is used to move data to and from internal memory, internal peripherals or external devices to
occur in the background of CPU operation. The DMA controller has four independent programmable
channels, allowing four DMA operations and also there is a fifth auxiliary channel to service requests
from the host port interface (HPI). The DMA controller can access the following regions in the memory
map.
∑ On-chip data memory
∑ On-chip program memory, if it is mapped into memory space rather than being used as cache
∑ On-chip peripherals
∑ External memory via EMIF
∑ Expansion memory via expansion bus
TMS320C6X Application Programs and Peripherals 435
The enhance DMA (EDMA) controller is available in ¢C621X/¢C671X devices. The EDMA controller
performs block transfer of data to/from internal memory, transfer requests from peripherals and between
external memory spaces in parallel to CPU intensive operations. The EDMA controller has enhancements
than DMA controller and it provides 16 channels with programmable priority and the ability to link data
transfers. The EDMA operations are controlled by eight memory mapped EDMA control registers.

15.6.4 External Memory Interface (EMIF)


The external memory interface (EMIF) of ¢C6X devices support a glueless interface to a variety of
external devices. It can be used to interface synchronous as well as asynchronous devices such as
SRAM, DRAM, ROM, FIFOs, FPGAs and external shared memory devices. The ¢C620X/¢C670X
EMIF services requests of external bus from four requesters:
∑ The on-chip program memory controller that services CPU program fetches
∑ The on-chip data memory controller that services CPU data fetches
∑ The on-chip DMA controller
∑ An external shared-memory device controller
If multiple requests arrive at the same time, the EMIF prioritizes them and performs the necessary
number of operations. The ¢C621X/¢C671X device services requests of the external bus from two
requesters:
∑ An enhanced DMA controller
∑ An external shared-memory device controller

15.6.5 Host-Port Interface (HPI)


The host-port interface is a16-bit wide parallel port through which a host processor can directly access
the CPU’s memory space. The host device functions as a master to the interface, which increases the
ease of access. The host and CPU can exchange information via internal or external memory. The host
also has direct access to memory-mapped peripherals. The connectivity to the CPU’s memory space
is provided through the DMA controller. Both the host and CPU can access the HPI control register
(HPIC). The host can access HPI address register (HPIA), HPI data register (HPID) and HPIC using the
external data interface control signals.

15.6.6 Boot Configuration


The ¢C6X devices use variety of boot configurations to determine what action the devices are to perform
after the reset signal is initialized. Each ¢C6X device has some or all of the following boot configuration
options:
∑ Selection of memory map- to determine whether internal or external memory is mapped at
address zero
∑ Selection of type of external memory mapped address zero, if external memory map is selected
∑ Selection of boot process used to initialize the memory at address zero before the CPU is
released from reset.
The external pins BOOTMODE [4:0] are used to select the boot configuration. The values of the
BOOTMODE are latched during the low period of RESET

15.6.7 Interrupt Selector


The ¢C6X peripheral set has up to 32 interrupt sources. The CPU has only 12 interrupts available for
use. The interrupt selector allows the user to choose and prioritize the 12 of the 32 for the system needs.
436 Digital Signal Processors
The interrupt selector also allows to effectively change the polarity of the external interrupt inputs.
The RESET and NMI are the non-maskable interrupts. The CPU interrupts are maskable. To mask the
interrupts the global interrupt enable bit (GIE) in the control status register (CSR) is set to 1. To enable
an interrupt the respecive bit in the interrupt enable register (IE) is set to 1. When the corresponding
interrupt occurs, the bit in the interrupt flag register (IFR) is set and the CPU starts processing the
interrupt.

15.6.8 Expansion Bus


The expansion bus is available only in ¢C6202 processor. The expansion bus is 32-bit wide bus that is
used to interface different types of asynchronous peripherals, asynchronous and synchronous FIFOs, PCI
bridge chips and other external masters. The expansion bus offers a flexible bus arbitration scheme.

15.6.9 Power-down Logic


In CMOS logic circuits, power dissipation can be reduced by decreasing the switching from one logic
state to another. By preventing some or all of the chip’s logic from switching, significant power can be
reduced without losing the data or operational context. PD1, PD2 and PD3 are three power-down modes
available to perform this function. The PD1 mode blocks the internal clock inputs at the boundary of
the CPU, preventing most of its logic from switching. PD1 effectively shuts down the CPU. The PD2
mode halts the entire on-chip clock structure at the output of the PLL. The PD3 mode is like PD2 mode
but also disconnects the external clock source (CLKIN) from reaching PLL. In addition to these power-
down modes, the IDLE instruction provides low CPU power consumption by executing continuous
NOPs. The IDLE instruction terminates only upon servicing an interrupt.

Review Questions
15.1 List the steps to do programming in ¢C6X tool. 15.9 Explain the operation of ¢C6X timer.
15.2 What are the basic features of ¢C6416 starter kit? 15.10 What are the features of McBSP?
15.3 Explain the memory resources available in ¢C6416 15.11 For what interfaces McBSP is used?
DSK. 15.12 List the signals used for clocking and frame
15.4 What are the steps involved in ¢C6X code synchronization of ¢C6X McBSP.
generation using CCS tool? 15.13 What regions of memory map of ¢C6X DMA
15.5 Which instruction is used for division? How? controller can be used?
15.6 Explain the internal memory details of ¢C6X 15.14 Explain the uses of EMIF.
processors. 15.15 What is the use of interrupt selector?
15.7 For what operations L2 controller is used? 15.16 Why power-down logic is needed? Explain the
15.8 List the on-chip peripheral in ¢C6X processors. ¢C6X power-down logics.
TMS320C6X Application Programs and Peripherals 437

Self Test Questions


15.1 The operating frequency of ¢C6416 starter kit is 15.13 The size of on-chip memory in all ¢C6X processors
_____ except ¢C6202 is ___
(a) 200 MHz (b) 720 MHz (c) 1GHz (d) 800 MHz (a) 64 K words (b) 16 K words
15.2 The size of on-chip RAM in ¢C6416T processor is (c) 1024 bytes (d) 64K bytes
____ 15.14 The size of on-chip memory in ¢C6202 processor
(a) 64 K words (b) 16 K words is ____
(c) 1024 K bytes (d) 64 K bytes (a) 64 K byes (b) 128 K bytes
15.3 The Max. operating frequency of ¢C6416T processor (c) 256 K bytes (d) 64 K words
is ____ 15.15 The size of program & data cache in ¢C6X processor
(a) 200 MHz (b) 720 MHz (c) 1GHz (d) 800 MHz is ___
15.4 The size of external DRAM in ¢C6416 starter kit is (a) 2K bytes (b) 2 K words
_____ (c) 4 K bytes (d) 4K words
(a) 1024 K bytes (b) 512 K bytes 15.16 The number of external memory space in ¢C6X
(c) 8 M bytes (d) 16 M bytes processor is ___
15.5 The size of flash memory in ¢C6416 starter kit is (a) 2 (b) 5 (c) 4 (d) 3
_____ 15.17 The no. of McBSP in ¢C6202 processor is _____
(a) 1024 K bytes (b) 512 K bytes (a) 2 (b) 5 (c) 4 (d) 3
(c) 8 M bytes (d) 16 M bytes 15.18 The expansion bus is available in ____ processor
15.6 The name of extension given for a project in CCS (a) ¢C6201 (b) ¢C6202 (c) ¢C6211 (d) ¢C6711
is ___ 15.19 The ¢C6X processor without HPI is ____
(a) .pjt (b) .mak (c) .asm (d) .out (a) ¢C6201 (b) ¢C6202 (c) ¢C6211 (d) ¢C6711
15.7 The file extension name an assembly language file 15.20 The no. of on-chip timers in ¢C6X processors is
should have is ___ ____
(a) .pjt (b) .mak (c) .asm (d) .out (a) 2 (b) 5 (c) 4 (d) 3
15.8 The executable file name extension for a project 15.21 The no. of channels the McBSP can transmit and
is _____ receive is ___
(a) .pjt (b) .mak (c) .asm (d) .out (a) 64 (b) 128 (c) 32 (d) 200
15.9 The starting memory address of ¢C6X where the 15.22 The no. of DMA channels in ¢C6X processor is __
code is down loaded is _____ (a) 2 (b) 5 (c) 4 (d) 3
(a) 0x0000 0000h (b)0x0000 0200h
15.23 The no. of EDMA channels in ¢C6X processor is
(c) 0x0000 0020h (d) 0x0000 F000h
___
15.10 The instruction used to perform division in ¢C6X (a) 4 (b) 8 (c) 16 (d)13
processor is ___
15.24 The Max. no. of interrupt sources present in ¢C6X
(a) ADDH (b) SUB (c) SUBC (d) ADDU
processor is ___
15.11 The ¢C6X instruction used for the left most bit (a) 12 (b) 13 (c) 32 (d) 28
detection is ____
15.25 The no. of interrupt sources the CPU can use in
(a) LDB (b) LDHU (c) LMBD (d) LDH
¢C6X processor is ____
15.12 ____ instruction is used to perform convolution in (a) 12 (b) 13 (c) 32 (d) 28
¢C6X processor.
15.26 The no. of power-down logic modes in ¢C6X
(a) MPY & ADD (b) MPY & SUB
processor is ____
(c) MAC (d) MACD
(a) 4 (b) 2 (c) 5 (d) 3

You might also like