0% found this document useful (0 votes)
55 views

Dspworkshop Part2 2006

gftf
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Dspworkshop Part2 2006

gftf
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Digital Signal Processing and Applications with the TMS320C6713 and the TMS3206416 DSK

D. Richard Brown III Associate Professor Worcester Polytechnic Institute Electrical and Computer Engineering Department [email protected] October 16-17, 2005

Day 2

Profiling Your Code and Making it More Efficient


z

How to estimate the execution time of your code. How to use the optimizing compiler to produce more efficient code. How data types and memory usage affect the efficiency of your code.

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 2 of 48

How to estimate code execution time when connected to the DSK


1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Start CCS with the C6713 DSK connected Debug -> Connect (or alt+C) Open project, build it, and load .out file to the DSK Open the source file you wish to profile Set two breakpoints for the start/end of the code range you wish to profile Profile -> Clock -> Enable Profile -> Clock -> View Run to the first breakpoint Reset the clock Run to the second breakpoint Clock will show raw number of execution cycles between breakpoints.

Tip: You can save your breakpoints, probe points, graphs, and watch windows with File -> Workspace -> Save Workspace As
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 3 of 48

Another method for estimating code execution time (part 1 of 3)


Repeat steps 1-4 previous method. 5. Clear any breakpoints in your code 6. Profile -> Setup 7. Click on Custom tab 8. Select Cycles 9. Click on clock (enable profiling)

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 4 of 48

Another method for estimating code execution time (part 2 of 3)


10. 11.

12.

Select Ranges tab Highlight code you want to profile and drag into ranges window (hint: you can drag whole functions into this window) Repeat for other ranges if desired

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 5 of 48

Another method for estimating code execution time (part 3 of 3)


13. 14. 15. 16.

Profile -> Viewer Run (let it run for a minute or more) Halt Observe profiling results in Profile Viewer window

Hint: edit the columns to see averages

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 6 of 48

What does it mean?


z

Access count is the number of times that CCS profiled the function
z

Note that the function was probably called more than 49 times. CCS only timed it 49 times.

Inclusive average is the average number of cycles needed to run the function including any calls to subroutines Exclusive average is the average number of cycles needed to run the function excluding any calls to subroutines
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 7 of 48

Optimizing Compiler

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 8 of 48

Profiling results after compiler optimization


z

In this example, we get a 3x-4x improvement with Speed Most Critical and File (-o3) optimization Optimization gains can be much larger, e.g. 20x

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 9 of 48

Limitations of hardware profiling


z

Breakpoint/clock profiling method may not work with compiler-optimized code Profile -> View method is known to be somewhat inaccurate when connected to real hardware (see profiling limitations in CCS help)
z

Accuracy is better when only one or two ranges are profiled Best accuracy is achieved by running a simulator

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 10 of 48

Other factors affecting code efficiency


z

Memory
z z z

C6713 has 64kB internal ram (L2 cache) DSK provides additional 16MB external RAM (SDRAM) Code location (.text in command file) z internal memory (fast) z external memory (slow, typically 2-4x worse) Data location (.data in command file) z internal memory (fast) z external memory (slow, depends on datatypes) Slowest execution is double-precision floating point Fastest execution is fixed point, e.g. short
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Data types
z z

Page 11 of 48

Command file example


MEMORY { vecs: IRAM: CE0: } SECTIONS { "vectors" .cinit .text .stack .bss .const .data .far .switch .sysmem .tables .cio } o = 00000000h o = 00000200h o = 80000000h l = 00000200h l = 0000FE00h l = 01000000h

> > > > > > > > > > > >

vecs IRAM IRAM IRAM IRAM IRAM IRAM IRAM IRAM IRAM IRAM IRAM

Code goes here Data goes here Addresses 00000000-0000FFFF are mapped to internal memory (IRAM). This is 64kB. External memory (CE0) is mapped to address range 80000000 80FFFFFF. This is 16MB.
Page 12 of 48

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 13 of 48

Some Things to Try


z z z z z

Try profiling parts of your FIR filter code from Day 1 without optimization. Try both profiling methods. Rebuild your project under various optimization levels and try various settings from size most critical to speed most critical. Compare profile results for no optimization and various levels of optimization. Change the data types in your FIR filter code and rebuild (with and without optimization) to see the effect on performance. Try moving the data and/or program to internal/external memory and profiling (you will need to modify the linker command file to do this) Contest: Who can make the most efficient 8th order bandpass filter (that works)?

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 14 of 48

Assembly Language Programming on the TMS320C6713


z z
1.

Sometimes you have to take matters into your own hands... Three options:
Linear assembly (.sa) z Compromise between effort and efficiency z Typically more efficient than C z Assembler takes care of details like assigning functional units, registers, and parallelizing instructions ASM statement in C code (.c) z asm(assembly code) C-callable assembly function (.asm) z Full control of assigning functional units, registers, parallelization, and pipeline optimization
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

2.

3.

Page 15 of 48

C-Callable Assembly Language Functions


z

Basic concepts:
z

z z

Arguments are passed in via registers A4, B4, A6, B6, ... in that order. All registers are 32-bit. Result returned in A4 also. Return address of calling code (program counter) is in B3. Dont overwrite B3! Naming conventions: z In C code: label z In ASM code: _label (note the leading underbar) Accessing global variables in ASM: z .ref _variablename
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 16 of 48

Skeleton C-Callable ASM Function


; header comments ; passed in parameters in registers A4, B4, A6, ... in that order .def _myfunc .equ 100 .ref _aglobalvariable NOP B NOP .end
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

ACONSTANT

; allow calls from external ; declare constants ; refer to a global variable ; instructions go here ; return (branch to addr B3) ; function output will be in A4 ; pipeline flush

_myfunc:

B3 5

Page 17 of 48

Example C-Callable Assembly Language Program

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 18 of 48

TMS320C67x Block Diagram


One instruction is 32 bits. Program bus is 256 bits wide. Can execute up to 8 instructions per clock cycle (225MHz->4.4ns clock cycle). 8 independent functional units: - 2 multipliers - 6 ALUs Code is efficient if all 8 functional units are always busy. Register files each have 16 general purpose registers, each 32-bits wide (A0-A15, B0-B15). Data paths are each 64 bits wide.
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 19 of 48

C6713 Functional Units


z z

Two data paths (A & B) Data path A


z z z z

Multiply operations (.M1) Logical and arithmetic operations (.L1) Branch, bit manipulation, and arithmetic operations (.S1) Loading/storing and arithmetic operations (.D1) Multiply operations (.M2) Logical and arithmetic operations (.L2) Branch, bit manipulation, and arithmetic operations (.S2) Loading/storing and arithmetic operations (.D2)

Data path B
z z z z

All data (not program) transfers go through .D1 and .D2


QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 20 of 48

Fetch & Execute Packets


z z

C6713 fetches 8 instructions at a time (256 bits) Definition: Fetch packet is a group of 8 instructions fetched at once. Coincidentally, C6713 has 8 functional units.
z

Ideally, all 8 instructions would be executed in parallel. 3 multiplies (only two .M functional units) Results of instruction 3 needed by instruction 4 (must wait for 3 to complete)

Often this isnt possible, e.g.:


z z

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 21 of 48

Execute Packets
z

Definition: Execute Packet is a group of (8 or less) consecutive instructions in one fetch packet that can be executed in parallel.
fetch packet

execute packet 1 z z

execute packet 2

execute packet 3

C compiler provides a flag to indicate which instructions should be run in parallel. You have to do this manually in Assembly using ||. See Chapter 3 of the Chassaing textbook.
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 22 of 48

C6713 Instruction Pipeline Overview


All instructions flow through the following steps:
1.
a) b) c) d)

Fetch
PG: Program address Generate PS: Program address Send PW: Program address ready Wait PR: Program fetch packet Receive

2.
a) b)

Decode
DP: Instruction DisPatch DC: Instruction DeCode

3.
a) b)

Execute
10 phases labeled E1-E10 Fixed point processors have only 5 phases (E1-E5)

each step = 1 clock cycle

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 23 of 48

Pipelining: Ideal Operation

Remarks: At clock cycle 11, the pipeline is full There are no holes (bubbles) in the pipeline in this example
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 24 of 48

Pipelining: Actual Operation

Remarks: Fetch packet n has 3 execution packets All subsequent fetch packets have 1 execution packet Notice the holes/bubbles in the pipeline caused by lack of parallelization
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 25 of 48

Execute Stage of C6713 Pipeline


z

C67x has 10 execute phases (floating point)

z z

C62x/C64x have 5 execute phases (fixed point) Different types of instructions require different numbers of these phases to complete their execution
z z

Anywhere between 1 and all 10 phases Most instruction tie up their functional unit for only one phase (E1)

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 26 of 48

Execution Stage Examples (1)

results available after E1 (zero delay slots) Functional unit free after E1 (1 functional unit latency)

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 27 of 48

Execution Stage Examples (2)

results available after E4 (3 delay slots) Functional unit free after E1 (1 functional unit latency)
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 28 of 48

Execution Stage Examples (3)

Results available after E10 (9 delay slots) Functional unit free after E4 (4 functional unit latency)
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 29 of 48

Functional Latency & Delay Slots


z z z

Functional Latency: How long must we wait for the functional unit to be free? Delay Slots: How long must we wait for the result? General remarks:
z z z z z

Functional unit latency <= Delay slots Strange results will occur in ASM code if you dont pay attention to delay slots and functional unit latency All problems can be resolved by waiting with NOPs Efficient ASM code tries to keep functional units busy all of the time. Efficient code is hard to write (and follow).
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 30 of 48

Some Things to Try


z

Try rewriting your FIR filter code as a Ccallable ASM function


z z z

Create a new ASM file Call the ASM function from your main code See Chassaing examples fircasm.pjt and fircasmfast.pjt for ideas

Profile your new FIR code and compare to the optimized compiler.
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 31 of 48

Lunch Break

Workshop resumes at 1:30pm

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 32 of 48

Infinite Impulse Response (IIR) Filters


z

Advantages:
z

Can achieve a desired frequency response with less memory and computation than FIR filters Can be unstable Affected more by finite-precision math due to feedback

Disadvantages:
z z

Input/output relationship:

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 33 of 48

IIR Filtering - Stability


z

Transfer function:

z z

Note that the filter is stable only if all of its poles (roots of the denominator) have magnitude less than 1. Quantization of coefficients (as and bs) will move the poles. A stable filter in infinite precision may not be stable after coefficient quantization. Numerator of H(z) does not affect stability.

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 34 of 48

Creating IIR Filters


1.
z z z

Design filter
Type: low pass, high pass, band pass, band stop, ... Filter order N Desired frequency response

Matlab

2. 3. 4. 5.

6. 7.

Decide on a realization structure Decide how coefficients will be quantized. Compute coefficients CCS Decide how everything else will be quantized (input samples, output samples, result of multiplies, result of additions) Write code to realize filter Test filter and compare to theoretical expectations
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 35 of 48

IIR Realization Structures


z

Many different IIR realization structures available (see options in fdatool)


z

Structures can have different memory and computational requirements All structures give the same behavior when the math is infinite precision Structures can have very different behavior when the math is finite precision z Stability z Accuracy with respect to the desired response z Potential for overflow/underflow
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 36 of 48

Direct Form I

Notation: 1/z = one sample delay


QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 37 of 48

Direct Form II

Note fewer delay elements (less memory) than DFI. Can prove that DFII has minimum number of delay elements.

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 38 of 48

Direct Form II: Second Order Sections


z z z z

Transfer function H(z) is factored into H1(z)H2(z)HK(z) where each factor Hk(z) has a quadratic denominator and numerator Each quadratic factor is called a Second Order Section (SOS) Each SOS is realized in DFII The results from each SOS are then passed to the next SOS

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 39 of 48

Direct Form II: Second Order Sections


z

Very popular realization structure


z z z

Low memory requirements (same as DFII) Easy to check the stability of each SOS Can write one DFII-SOS filter function and reuse it for any length filter Tends to be less sensitive to finite precision math than DFI or DFII. Why?
z z

Dynamic range of coefficients in each SOS is smaller Coefficient quantization only affects local poles/zeros
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 40 of 48

Determining How Coefficient Quantization Will Affect Your Filter

set quantization parameters

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 41 of 48

IIR Filtering Final Remarks


z

IIR filters are more sensitive to choice of realization structure and data types than FIR filters due to feedback
z z z z z

Memory requirements Time required to compute filter output Accuracy with respect to the desired response Stability Potential for overflow/underflow

fdatool can be useful for examining the tradeoffs before writing code
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 42 of 48

Some Things to Try


z

In fdatool, design an IIR filter with the following specs:


z z z z z z z

Bandstop First passband 0-2500Hz, 0dB nominal gain, 0.5dB max deviation First transition band 2500-3500Hz Stop band 3500-10500Hz, -20dB minimum suppression Second transition band 10500-12500Hz Second passband 12500-22050Hz 0dB nominal gain, 0.5dB max deviation Minimum filter order

z z z z

Explore DFII with and without Second Order Sections Try various coefficient quantizations including fixed point Implement your best filter in CCS Compare actual performance to the theoretical predictions
Page 43 of 48

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Some Interesting Applications of Real-Time DSP


z

Fast Fourier Transform (FFT): Chapter 6


z

Example projects:
z z

DFT, FFT256C, FFTSinetable, FFTr2, FFTr4, FFTr4_sim, fastconvo, fastconvo_sim, graphicEQ Note that TI provides optimized FFT functions (search for cfftr2_dit, cfftr2_dif, cfftr4_dif)

Adaptive Filtering: Chapter 7


z

Example projects:
z

Adaptc, adaptnoise, adaptnoise_2IN, adaptIDFIR, adaptIDFIRw, adaptIDIIR, adaptpredict, adaptpredict_2IN,


QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 44 of 48

Tip: Making Interesting Waveforms in Matlab


Example: In-Phase and Quadrature Sinusoids >> fs=44100; % set up sampling frequency >> t=0:1/fs:5; % time vector (5 seconds long) >> x=[sin(2*pi*1000*t) cos(2*pi*1000*t)]; % left = sin, right = cos >> soundsc(x,fs); % play sound through sound card Another example: white noise (in stereo) >> L=length(t); >> x=[randn(L,1) randn(L,1)]; >> soundsc(x,fs); % play sound through sound card You can also save your sounds to .wav files with Matlabs wavwrite function. These .wav files can be burned to CD and played with conventional stereo equipment.
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 45 of 48

Workshop Day 2 Summary


What you learned today: z How to profile code size and execution times. z How data types and memory usage affect code execution times. z How to reduce code size and execution time with CCSs optimizing compiler. z How assembly language can be integrated into your projects. z Basics of the TMS320C6713 architecture.
z z

Fetch packets, execute packets, pipelining Functional unit latency and delay slots

How to design and implement IIR filters on the C6713 z Realization structures z Quantization considerations Other applications for the C6713 DSK z FFT z Adaptive filtering
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 46 of 48

Workshop Day 2 Reference Material


z z z z z z z z

Chassaing textbook Chapters 3, 5-8 CCS Help system SPRU509F.PDF CCS v3.1 IDE Getting Started Guide C6713DSK.HLP C6713 DSK specific help material SPRU198G.PDF TMS320C6000 Programmers Guide SPRU189F.PDF TMS320C6000 CPU and Instruction Set Reference Guide Matlab fdatool help (>> doc fdatool) Other Matlab help (>> doc soundsc, >> doc wavwrite)
Latest documentation available at http://www.ti.com/sc/docs/psheets/man_dsp.htm
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 47 of 48

Some Things to Try


z

Explore some of Chassaings FFT and adaptive filtering projects in the myprojects directory Explore some of the reference literature (especially the Chassaing text and the CCS help system) Try a lab assignment in the ECE4703 real-time DSP course: http://spinlab.wpi.edu/courses/ece4703

QuickTime and a TIFF (LZW) decompressor are needed to see this picture.

Page 48 of 48

You might also like