Class: T.E. E &TC Subject: DSP Expt. No.: Date: Title: Implementation of Convolution Using DSP Processor Objective
Class: T.E. E &TC Subject: DSP Expt. No.: Date: Title: Implementation of Convolution Using DSP Processor Objective
Class: T.E. E &TC Subject: DSP Expt. No.: Date: Title: Implementation of Convolution Using DSP Processor Objective
SUBJECT: DSP
EXPT. NO.:
DATE:
TITLE
: Implementation
Processor
OBJECTIVE
of
Convolution
using
DSP
THEORY
:
Difference in DSP Processor and General Purpose Processor
DSP Processor
Digital Signal Processors follow Harvard
architecture where Data memory and
Program Memory have separate data and
address buses.
P:F:-LTL-UG/03/R1
DSP
have two multipliers that enable two single Instruction. Mostly single
multiply-accumulate
operations
per Multiplier, accumulator units are
instruction cycle
available.
10
P:F:-LTL-UG/03/R1
DSP
INTRODUCTION
Texas Instruments introduced the first-generation TMS32010 DSP in 1982, the
TMS320C25 in 1986 [1], and the TMS320C50 in 1991. Several versions of each of these
processorsC1x, C2x, and C5xare available with different features, such as faster
execution speed. These 16-bit processors are all fixed-point processors and are codecompatible.
In Von Neumann architecture, program instructions and data are stored in a
single memory space. A processor with von Neumann architecture can make a read or
a write to memory during each instruction cycle. Typical DSP applications require
several accesses to memory within one instruction cycle. The fixed-point processors
C1x, C2x, and C5x are based on a modified Harvard architecture with separate
memory spaces for data and instructions that allow concurrent accesses.
Quantization error or round-off noise from an ADC is a concern with a fixed point
Processor. An ADC uses only a best-estimate digital value to represent an input.
A C6x processor can be used as a standard general-purpose DSP programmed for a
specific application. Specific-purpose digital signal processors are the modem, echo
canceller, and others. A fixed-point processor is better for devices that use batteries,
such as cellular phones, since it uses less power than does an equivalent floatingpoint processor. The fixed-point processors, C1x, C2x, and C5x, are 16-bit processors
with limited dynamic range and precision.
The C6x fixed-point processor is a 32-bit processor with improved dynamic
range and precision. In a fixed-point processor, it is necessary to scale the data.
Overflow, which occurs when an operation such as the addition of two numbers
produces a result with more bits than can fit within a processors register, becomes a
concern.
P:F:-LTL-UG/03/R1
DSP
TMS320C6x ARCHITECTURE
The TMS320C6713 onboard the DSK is a floating-point processor based on the
VLIW architecture.
Features of the Board
P:F:-LTL-UG/03/R1
DSP
Independent memory banks on the C6x allow for two memory accesses
within one instruction cycle.
Two independent memory banks can be accessed using two independent
buses. Since internal memory is organized into memory banks, two loads or
two stores of instructions can be performed in parallel.
No conflict results if the data accessed are in different memory banks. Separate buses
for program, data, and direct memory access (DMA) allow the C6x to perform
concurrent program fetches, data read and write, and DMA operations. With data
and instructions residing in separate memory spaces, concurrent memory accesses
are possible.
The C6x has a byte-addressable memory space. Internal memory is organized
as separate program and data memory spaces, with two 32-bit internal ports
(two 64- bit ports with the C64x) to access internal memory.
Figure shows
P:F:-LTL-UG/03/R1
DSP
With the DSK operating at 225MHz, one can ideally achieve two multiplies
and accumulates per cycle, for a total of 450 million multiplies and accumulates
(MACs) per second.
With six of the eight functional units in Figure 3.1 (not the .D units described
below) capable of handling floating-point operations, it is possible to perform 1350
million floating-point operations per second (MFLOPS) Operating at 225MHz,
this translates into 1800 million instructions per second (MIPS) with a 4.44-ns
instruction cycle time.
The CPU consists of eight independent functional units divided into two
data paths, A and B, as shown in Figure. Each path has a unit for multiply operations
(.M), for logical and arithmetic operations (.L), for branch, bit manipulation, and
arithmetic operations (.S), and for loading/storing and arithmetic operations (.D).
The .S and .L units are for arithmetic, logical, and branch instructions. All data
transfers make use of the .D units.
The arithmetic operations, such as subtract or add (SUB or ADD), can be
performed by all the units, except the .M units (one from each data path). The eight
functional units consist of four floating/fixed-point ALUs (two .L and two .S), two
fixed-point ALUs (.D units), and two floating/fixed-point multipliers (.M units).
Two cross-paths (1x and 2x) allow functional units from one data path to
access a 32-bit operand from the register file on the opposite side. There can be a
maximum of two cross-path source reads per cycle. Each functional unit side can
access data from the registers on the opposite side using a cross-path (i.e., the
functional units on one side can access the register set from the other side). There are
32 general purpose registers, but some of them are reserved for specific addressing
or are used for conditional instructions.
FETCH AND EXECUTE PACKETS
The architecture VELOCITI, introduced by TI, is derived from the VLIW
architecture.
An execute packet (EP) consists of a group of instructions that can be executed in
parallel within the same cycle time. The number of EPs within a fetch packet (FP)
can vary from one (with eight parallel instructions) to eight (with no parallel
instructions). The VLIW architecture was modified to allow more than one EP to be
included within an FP. The least significant bit of every 32-bit instruction is used
to determine if the next or subsequent instruction belongs in the same EP (if 1) or
is part of the next EP (if 0).
P:F:-LTL-UG/03/R1
DSP
PIPELINING
Pipelining is a key feature in a DSP to get parallel instructions working
properly, requiring careful timing. There are three stages of pipelining: program
fetch, decode, and execute.
FIGURE One FP with three EPs showing the p bit of each instruction.
1. The program fetch stage is composed of four phases:
(a) PG: program address generate (in the CPU) to fetch an address
(b) PS: program address send (to memory) to send the address
(c) PW: program address ready wait (memory read) to wait for data
(d) PR: program fetch packet receive (at the CPU) to read opcode from memory
2. The decode stage is composed of two phases:
(a) DP: to dispatch all the instructions within an FP to the appropriate functional
units
(b) DC: instruction decode
3. The execute stage is composed of 6 phases (with fixed point) to 10 phases
(with floating point) due to delays (latencies) associated with the following
instructions:
(a) Multiply instruction, which consists of two phases due to one delay
(b) Load instruction, which consists of five phases due to four delays
(c) Branch instruction, which consists of six phases due to five delays
Table shows the pipeline phases, and Table shows the pipelining effects.
The first row in Table 3.3 represents cycle 1, 2. . . 12. Each subsequent row represents
an FP. The rows represented PG, PS, . . . illustrate the phases associated with each
FP. The program generate (PG) of the first FP starts in cycle 1, and the PG of the
second FP starts in cycle 2, and so on. Each FP takes four phases for program fetch
and two phases for decoding. However, the execution phase can take from 1
to 10 phases (not all execution phases are shown in Table 3.3).We are assuming that
each FP contains one EP.
P:F:-LTL-UG/03/R1
DSP
For example, at cycle 7, while the instructions in the first FP are in the first execution
phase E1 (which may be the only one), the instructions in the second FP are
in the decoding phase, the instructions in the third FP are in the dispatching phase,
and so on. All seven instructions are proceeding through the various phases.
Therefore, at cycle 7, the pipeline is full.
Most instructions have one execute phase. Instructions such as multiply
(MPY), load (LDH/LDW), and branch (B) take two, five, and six phases,
respectively. Additional execute phases are associated with floating-point and
double-precision types of instructions, which can take up to 10 phases. For example,
the double-precision multiply operation (MPYDP), available on the C67x, has nine
delay slots, so that the execution phase takes a total of 10 phases.
The functional unit latency, which represents the number of cycles that an
instruction ties up a functional unit, is 1 for all instructions except double-precision
instructions, available with the floating-point C67x. Functional unit latency is
different from a delay slot. For example, the instruction MPYDP has four functional
unit latencies but nine delay slots. This implies that no other instruction can use the
associated multiply functional unit for four cycles. A store has no delay slot but
finishes its execution in the third execution phase of the pipeline. If the outcome of a
multiply instruction such as MPY is used by a subsequent instruction, a NOP (no
operation) must be inserted after the MPY instruction for the pipelining to operate
properly. Four or five NOPs are to be inserted in case an instruction uses the
outcome of a load or a branch instruction, respectively.
P:F:-LTL-UG/03/R1
DSP
CONCLUSION:
REFERENCES:
1.
2.
3.
4.
P:F:-LTL-UG/03/R1
DSP