Class: T.E. E &TC Subject: DSP Expt. No.: Date: Title: Implementation of Convolution Using DSP Processor Objective

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE - 411043
Department of Electronics & Telecommunication
CLASS: T.E. E &TC
SUBJECT: DSP
EXPT. NO.:
DATE:
TITLE
: Implementation
Processor
OBJECTIVE
of
Convolution
using
DSP
1. Study the difference in DSP Processor and General

Purpose Processor
2. Study DSP Architecture for TMS320C6713
3. Implement Convolution sum Using DSK C6713
S/W & H/W
: Code Composer Studio 3.1

TMS 320C 6713 DSP Kit
THEORY
:
Difference in DSP Processor and General Purpose Processor
DSP Processor
Digital Signal Processors follow Harvard
architecture where Data memory and
Program Memory have separate data and
address buses.
General Purpose Processor

General purpose microprocessor and
microcontrollers follow Von Neumann
architecture (or CISC architecture)
where Data Memory and Program
Memory have common data and
address bus.
Multiply and accumulate, Program CISC architecture takes multiple
memory fetch and data memory write are Instructions to execute fetch data
all executed in a single cycle. Barrel shift memory write instructions.
is also executed in a single cycle.
DSP processor has features designed to GPP are either not specialized for a
support high-performance, repetitive, specific kind of applications (in the
numerically intensive tasks.
case of general-purpose processors), or
they are designed for control-oriented
applications
Single-cycle
multiply-accumulate Most CISC Processors often take
capability; high-performance DSPs often multiple clock cycles for Execution of a
P:F:-LTL-UG/03/R1
DSP
have two multipliers that enable two single Instruction. Mostly single
multiply-accumulate
operations
per Multiplier, accumulator units are
instruction cycle
available.
10
Specialized addressing modes, for

example, pre- and post-modification of
address pointers, circular addressing, and
bit-reversed addressing
DSPs generally feature multiple-access
memory architectures that enable DSPs to
complete several accesses to memory in a
single instruction cycle
Most
DSPs
provide
various
configurations of on-chip memory and
peripherals tailored for DSP applications.
Ex. Cache is divided as L1P (Upper Cache
Memory) and L1D (Lower Cache
Memory)
Specialized execution control. Usually,
DSP
processors
provide
a
loop
instruction that allows tight loops to be
repeated
without
spending
any
instruction cycles for updating and
testing the loop counter or for jumping
back to the top of the loop
DSP processors are known for their
irregular instruction sets, which generally
allow several operations to be encoded in
a single instruction.
For example, a processor that uses 32-bit
instructions may encode two additions,
two multiplications, and four 16-bit data
moves into a single instruction.
In general, DSP processor instruction sets
allow a data move to be performed in
parallel with an arithmetic operation.
GPPs/MCUs, in contrast, usually specify
a single operation per instruction
P:F:-LTL-UG/03/R1
None of the CISC Architecture

processors support such specialized
addressing modes.
GPP takes Multiple Instruction cycles
to finish the task.
No such Significant Utility available

with GPP Architectures.
Looping in GPP are totally bounded

with Maximum Bit support of the
Processor.
Jump and Loop instructions has
significant Ex. delay
Totally bounded by the Maximum Bit

Support of the Processor.
Hardly possible to configure a single
resource in Multiple ways.
(Ex. In 89c51 only DPTR can be
configured as single 16 bit or two 8 bit
Register)
Not totally supported with traditional
General Purpose Processors/Micro
Controllers.
DSP
While the above differences traditionally distinguish DSPs from GPPs/MCUs, in

practice it is not important what kind of processor you choose. What is really
important is to choose the processor that is best suited for your application; if a
GPP/MCU is better suited for your DSP application than a DSP processor, the
processor of choice is the GPP/MCU. It is also worth noting that the difference
between DSPs and GPPs/MCUs is fading: many GPPs/MCUs now include DSP
features, and DSPs are increasingly adding microcontroller features.
INTRODUCTION
Texas Instruments introduced the first-generation TMS32010 DSP in 1982, the
TMS320C25 in 1986 [1], and the TMS320C50 in 1991. Several versions of each of these
processorsC1x, C2x, and C5xare available with different features, such as faster
execution speed. These 16-bit processors are all fixed-point processors and are codecompatible.
In Von Neumann architecture, program instructions and data are stored in a
single memory space. A processor with von Neumann architecture can make a read or
a write to memory during each instruction cycle. Typical DSP applications require
several accesses to memory within one instruction cycle. The fixed-point processors
C1x, C2x, and C5x are based on a modified Harvard architecture with separate
memory spaces for data and instructions that allow concurrent accesses.
Quantization error or round-off noise from an ADC is a concern with a fixed point
Processor. An ADC uses only a best-estimate digital value to represent an input.
A C6x processor can be used as a standard general-purpose DSP programmed for a
specific application. Specific-purpose digital signal processors are the modem, echo
canceller, and others. A fixed-point processor is better for devices that use batteries,
such as cellular phones, since it uses less power than does an equivalent floatingpoint processor. The fixed-point processors, C1x, C2x, and C5x, are 16-bit processors
with limited dynamic range and precision.
The C6x fixed-point processor is a 32-bit processor with improved dynamic
range and precision. In a fixed-point processor, it is necessary to scale the data.
Overflow, which occurs when an operation such as the addition of two numbers
produces a result with more bits than can fit within a processors register, becomes a
concern.
P:F:-LTL-UG/03/R1
DSP
A floating-point processor is generally more expensive since it has more

real estate or is a larger chip because of additional circuitry necessary to handle
integer as well as floating-point arithmetic. Several factors, such as cost, power
consumption, and speed, come into play when choosing a specific DSP. The C6x
processors are particularly useful for applications requiring intensive computations.
Family members of the C6x include both fixed-point (e.g., C62x, C64x) and floatingpoint (e.g., C67x) processors. Other DSPs are also available from companies such as
Motorola and Analog Devices.
Other architectures include the Super Scalar: which requires special
hardware to determine which instructions are executed in parallel. The burden is
then on the processor more than on the programmer,
VLIW architecture: It does not necessarily execute the same group of
instructions, and as a result, it is difficult to time. Thus, it is rarely used in
DSP.
TMS320C6x ARCHITECTURE
The TMS320C6713 onboard the DSK is a floating-point processor based on the
VLIW architecture.
Features of the Board
Internal memory includes a two-level cache architecture

with 4 kB of level 1 program cache (L1P),
4 kB of level 1 data cache (L1D), and
256 kB of level 2 memory shared between program and data space.
It has a glueless (direct) interface to both synchronous memories (SDRAM
and SBSRAM) and asynchronous memories (SRAM and EPROM).
On-chip peripherals include two McBSPs, two timers, a host port interface
(HPI), and a 32-bit EMIF.
It requires 3.3 V for I/O and 1.26 V for the core (internal).
Internal buses include a 32-bit program address bus, a 256-bit program data
bus to accommodate eight 32-bit instructions,
Two 32-bit data address buses, two 64-bit data buses, and two 64-bit store
data buses.
P:F:-LTL-UG/03/R1
DSP
Independent memory banks on the C6x allow for two memory accesses
within one instruction cycle.
Two independent memory banks can be accessed using two independent
buses. Since internal memory is organized into memory banks, two loads or
two stores of instructions can be performed in parallel.
No conflict results if the data accessed are in different memory banks. Separate buses
for program, data, and direct memory access (DMA) allow the C6x to perform
concurrent program fetches, data read and write, and DMA operations. With data
and instructions residing in separate memory spaces, concurrent memory accesses
are possible.
The C6x has a byte-addressable memory space. Internal memory is organized
as separate program and data memory spaces, with two 32-bit internal ports
(two 64- bit ports with the C64x) to access internal memory.
Figure shows
P:F:-LTL-UG/03/R1
DSP
With the DSK operating at 225MHz, one can ideally achieve two multiplies
and accumulates per cycle, for a total of 450 million multiplies and accumulates
(MACs) per second.
With six of the eight functional units in Figure 3.1 (not the .D units described
below) capable of handling floating-point operations, it is possible to perform 1350
million floating-point operations per second (MFLOPS) Operating at 225MHz,
this translates into 1800 million instructions per second (MIPS) with a 4.44-ns
instruction cycle time.
The CPU consists of eight independent functional units divided into two
data paths, A and B, as shown in Figure. Each path has a unit for multiply operations
(.M), for logical and arithmetic operations (.L), for branch, bit manipulation, and
arithmetic operations (.S), and for loading/storing and arithmetic operations (.D).
The .S and .L units are for arithmetic, logical, and branch instructions. All data
transfers make use of the .D units.
The arithmetic operations, such as subtract or add (SUB or ADD), can be
performed by all the units, except the .M units (one from each data path). The eight
functional units consist of four floating/fixed-point ALUs (two .L and two .S), two
fixed-point ALUs (.D units), and two floating/fixed-point multipliers (.M units).
Two cross-paths (1x and 2x) allow functional units from one data path to
access a 32-bit operand from the register file on the opposite side. There can be a
maximum of two cross-path source reads per cycle. Each functional unit side can
access data from the registers on the opposite side using a cross-path (i.e., the
functional units on one side can access the register set from the other side). There are
32 general purpose registers, but some of them are reserved for specific addressing
or are used for conditional instructions.
FETCH AND EXECUTE PACKETS
The architecture VELOCITI, introduced by TI, is derived from the VLIW
architecture.
An execute packet (EP) consists of a group of instructions that can be executed in
parallel within the same cycle time. The number of EPs within a fetch packet (FP)
can vary from one (with eight parallel instructions) to eight (with no parallel
instructions). The VLIW architecture was modified to allow more than one EP to be
included within an FP. The least significant bit of every 32-bit instruction is used
to determine if the next or subsequent instruction belongs in the same EP (if 1) or
is part of the next EP (if 0).
P:F:-LTL-UG/03/R1
DSP
PIPELINING
Pipelining is a key feature in a DSP to get parallel instructions working
properly, requiring careful timing. There are three stages of pipelining: program
fetch, decode, and execute.
FIGURE One FP with three EPs showing the p bit of each instruction.
1. The program fetch stage is composed of four phases:
(a) PG: program address generate (in the CPU) to fetch an address
(b) PS: program address send (to memory) to send the address
(c) PW: program address ready wait (memory read) to wait for data
(d) PR: program fetch packet receive (at the CPU) to read opcode from memory
2. The decode stage is composed of two phases:
(a) DP: to dispatch all the instructions within an FP to the appropriate functional
units
(b) DC: instruction decode
3. The execute stage is composed of 6 phases (with fixed point) to 10 phases
(with floating point) due to delays (latencies) associated with the following
instructions:
(a) Multiply instruction, which consists of two phases due to one delay
(b) Load instruction, which consists of five phases due to four delays
(c) Branch instruction, which consists of six phases due to five delays
Table shows the pipeline phases, and Table shows the pipelining effects.
The first row in Table 3.3 represents cycle 1, 2. . . 12. Each subsequent row represents
an FP. The rows represented PG, PS, . . . illustrate the phases associated with each
FP. The program generate (PG) of the first FP starts in cycle 1, and the PG of the
second FP starts in cycle 2, and so on. Each FP takes four phases for program fetch
and two phases for decoding. However, the execution phase can take from 1
to 10 phases (not all execution phases are shown in Table 3.3).We are assuming that
each FP contains one EP.
P:F:-LTL-UG/03/R1
DSP
For example, at cycle 7, while the instructions in the first FP are in the first execution
phase E1 (which may be the only one), the instructions in the second FP are
in the decoding phase, the instructions in the third FP are in the dispatching phase,
and so on. All seven instructions are proceeding through the various phases.
Therefore, at cycle 7, the pipeline is full.
Most instructions have one execute phase. Instructions such as multiply
(MPY), load (LDH/LDW), and branch (B) take two, five, and six phases,
respectively. Additional execute phases are associated with floating-point and
double-precision types of instructions, which can take up to 10 phases. For example,
the double-precision multiply operation (MPYDP), available on the C67x, has nine
delay slots, so that the execution phase takes a total of 10 phases.
The functional unit latency, which represents the number of cycles that an
instruction ties up a functional unit, is 1 for all instructions except double-precision
instructions, available with the floating-point C67x. Functional unit latency is
different from a delay slot. For example, the instruction MPYDP has four functional
unit latencies but nine delay slots. This implies that no other instruction can use the
associated multiply functional unit for four cycles. A store has no delay slot but
finishes its execution in the third execution phase of the pipeline. If the outcome of a
multiply instruction such as MPY is used by a subsequent instruction, a NOP (no
operation) must be inserted after the MPY instruction for the pipelining to operate
properly. Four or five NOPs are to be inserted in case an instruction uses the
outcome of a load or a branch instruction, respectively.
P:F:-LTL-UG/03/R1
DSP
CONCLUSION:
REFERENCES:
1.
2.
3.
4.
Digital Signal Processing, Ifeacher & Jervis

Digital Signal Processors, Sen M. Kuo
TMS320C6713 Datasheet
www.dsprelated.com
P:F:-LTL-UG/03/R1
DSP

Class: T.E. E &TC Subject: DSP Expt. No.: Date: Title: Implementation of Convolution Using DSP Processor Objective

Uploaded by

Copyright:

Available Formats

Class: T.E. E &TC Subject: DSP Expt. No.: Date: Title: Implementation of Convolution Using DSP Processor Objective

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Class: T.E. E &TC Subject: DSP Expt. No.: Date: Title: Implementation of Convolution Using DSP Processor Objective

Uploaded by

Copyright:

Available Formats

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE - 411043

Department of Electronics & Telecommunication

CLASS: T.E. E &TC

1. Study the difference in DSP Processor and General

S/W & H/W

: Code Composer Studio 3.1

General Purpose Processor

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE - 411043

Department of Electronics & Telecommunication

Specialized addressing modes, for

None of the CISC Architecture

No such Significant Utility available

Looping in GPP are totally bounded

Totally bounded by the Maximum Bit

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE - 411043

Department of Electronics & Telecommunication

While the above differences traditionally distinguish DSPs from GPPs/MCUs, in

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE - 411043

Department of Electronics & Telecommunication

A floating-point processor is generally more expensive since it has more

Internal memory includes a two-level cache architecture

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE - 411043

Department of Electronics & Telecommunication

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE - 411043

Department of Electronics & Telecommunication

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE - 411043

Department of Electronics & Telecommunication

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE - 411043

Department of Electronics & Telecommunication

PUNE INSTITUTE OF COMPUTER TECHNOLOGY, PUNE - 411043

Department of Electronics & Telecommunication

Digital Signal Processing, Ifeacher & Jervis

You might also like