Dspworkshop Part2 2006
Dspworkshop Part2 2006
D. Richard Brown III Associate Professor Worcester Polytechnic Institute Electrical and Computer Engineering Department [email protected] October 16-17, 2005
Day 2
How to estimate the execution time of your code. How to use the optimizing compiler to produce more efficient code. How data types and memory usage affect the efficiency of your code.
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 2 of 48
Start CCS with the C6713 DSK connected Debug -> Connect (or alt+C) Open project, build it, and load .out file to the DSK Open the source file you wish to profile Set two breakpoints for the start/end of the code range you wish to profile Profile -> Clock -> Enable Profile -> Clock -> View Run to the first breakpoint Reset the clock Run to the second breakpoint Clock will show raw number of execution cycles between breakpoints.
Tip: You can save your breakpoints, probe points, graphs, and watch windows with File -> Workspace -> Save Workspace As
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 3 of 48
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 4 of 48
12.
Select Ranges tab Highlight code you want to profile and drag into ranges window (hint: you can drag whole functions into this window) Repeat for other ranges if desired
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 5 of 48
Profile -> Viewer Run (let it run for a minute or more) Halt Observe profiling results in Profile Viewer window
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 6 of 48
Access count is the number of times that CCS profiled the function
z
Note that the function was probably called more than 49 times. CCS only timed it 49 times.
Inclusive average is the average number of cycles needed to run the function including any calls to subroutines Exclusive average is the average number of cycles needed to run the function excluding any calls to subroutines
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 7 of 48
Optimizing Compiler
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 8 of 48
In this example, we get a 3x-4x improvement with Speed Most Critical and File (-o3) optimization Optimization gains can be much larger, e.g. 20x
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 9 of 48
Breakpoint/clock profiling method may not work with compiler-optimized code Profile -> View method is known to be somewhat inaccurate when connected to real hardware (see profiling limitations in CCS help)
z
Accuracy is better when only one or two ranges are profiled Best accuracy is achieved by running a simulator
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 10 of 48
Memory
z z z
C6713 has 64kB internal ram (L2 cache) DSK provides additional 16MB external RAM (SDRAM) Code location (.text in command file) z internal memory (fast) z external memory (slow, typically 2-4x worse) Data location (.data in command file) z internal memory (fast) z external memory (slow, depends on datatypes) Slowest execution is double-precision floating point Fastest execution is fixed point, e.g. short
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Data types
z z
Page 11 of 48
> > > > > > > > > > > >
vecs IRAM IRAM IRAM IRAM IRAM IRAM IRAM IRAM IRAM IRAM IRAM
Code goes here Data goes here Addresses 00000000-0000FFFF are mapped to internal memory (IRAM). This is 64kB. External memory (CE0) is mapped to address range 80000000 80FFFFFF. This is 16MB.
Page 12 of 48
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 13 of 48
Try profiling parts of your FIR filter code from Day 1 without optimization. Try both profiling methods. Rebuild your project under various optimization levels and try various settings from size most critical to speed most critical. Compare profile results for no optimization and various levels of optimization. Change the data types in your FIR filter code and rebuild (with and without optimization) to see the effect on performance. Try moving the data and/or program to internal/external memory and profiling (you will need to modify the linker command file to do this) Contest: Who can make the most efficient 8th order bandpass filter (that works)?
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 14 of 48
Sometimes you have to take matters into your own hands... Three options:
Linear assembly (.sa) z Compromise between effort and efficiency z Typically more efficient than C z Assembler takes care of details like assigning functional units, registers, and parallelizing instructions ASM statement in C code (.c) z asm(assembly code) C-callable assembly function (.asm) z Full control of assigning functional units, registers, parallelization, and pipeline optimization
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
2.
3.
Page 15 of 48
Basic concepts:
z
z z
Arguments are passed in via registers A4, B4, A6, B6, ... in that order. All registers are 32-bit. Result returned in A4 also. Return address of calling code (program counter) is in B3. Dont overwrite B3! Naming conventions: z In C code: label z In ASM code: _label (note the leading underbar) Accessing global variables in ASM: z .ref _variablename
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 16 of 48
ACONSTANT
; allow calls from external ; declare constants ; refer to a global variable ; instructions go here ; return (branch to addr B3) ; function output will be in A4 ; pipeline flush
_myfunc:
B3 5
Page 17 of 48
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 18 of 48
Page 19 of 48
Multiply operations (.M1) Logical and arithmetic operations (.L1) Branch, bit manipulation, and arithmetic operations (.S1) Loading/storing and arithmetic operations (.D1) Multiply operations (.M2) Logical and arithmetic operations (.L2) Branch, bit manipulation, and arithmetic operations (.S2) Loading/storing and arithmetic operations (.D2)
Data path B
z z z z
Page 20 of 48
C6713 fetches 8 instructions at a time (256 bits) Definition: Fetch packet is a group of 8 instructions fetched at once. Coincidentally, C6713 has 8 functional units.
z
Ideally, all 8 instructions would be executed in parallel. 3 multiplies (only two .M functional units) Results of instruction 3 needed by instruction 4 (must wait for 3 to complete)
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 21 of 48
Execute Packets
z
Definition: Execute Packet is a group of (8 or less) consecutive instructions in one fetch packet that can be executed in parallel.
fetch packet
execute packet 1 z z
execute packet 2
execute packet 3
C compiler provides a flag to indicate which instructions should be run in parallel. You have to do this manually in Assembly using ||. See Chapter 3 of the Chassaing textbook.
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 22 of 48
Fetch
PG: Program address Generate PS: Program address Send PW: Program address ready Wait PR: Program fetch packet Receive
2.
a) b)
Decode
DP: Instruction DisPatch DC: Instruction DeCode
3.
a) b)
Execute
10 phases labeled E1-E10 Fixed point processors have only 5 phases (E1-E5)
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 23 of 48
Remarks: At clock cycle 11, the pipeline is full There are no holes (bubbles) in the pipeline in this example
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 24 of 48
Remarks: Fetch packet n has 3 execution packets All subsequent fetch packets have 1 execution packet Notice the holes/bubbles in the pipeline caused by lack of parallelization
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 25 of 48
z z
C62x/C64x have 5 execute phases (fixed point) Different types of instructions require different numbers of these phases to complete their execution
z z
Anywhere between 1 and all 10 phases Most instruction tie up their functional unit for only one phase (E1)
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 26 of 48
results available after E1 (zero delay slots) Functional unit free after E1 (1 functional unit latency)
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 27 of 48
results available after E4 (3 delay slots) Functional unit free after E1 (1 functional unit latency)
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 28 of 48
Results available after E10 (9 delay slots) Functional unit free after E4 (4 functional unit latency)
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 29 of 48
Functional Latency: How long must we wait for the functional unit to be free? Delay Slots: How long must we wait for the result? General remarks:
z z z z z
Functional unit latency <= Delay slots Strange results will occur in ASM code if you dont pay attention to delay slots and functional unit latency All problems can be resolved by waiting with NOPs Efficient ASM code tries to keep functional units busy all of the time. Efficient code is hard to write (and follow).
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 30 of 48
Create a new ASM file Call the ASM function from your main code See Chassaing examples fircasm.pjt and fircasmfast.pjt for ideas
Profile your new FIR code and compare to the optimized compiler.
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 31 of 48
Lunch Break
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 32 of 48
Advantages:
z
Can achieve a desired frequency response with less memory and computation than FIR filters Can be unstable Affected more by finite-precision math due to feedback
Disadvantages:
z z
Input/output relationship:
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 33 of 48
Transfer function:
z z
Note that the filter is stable only if all of its poles (roots of the denominator) have magnitude less than 1. Quantization of coefficients (as and bs) will move the poles. A stable filter in infinite precision may not be stable after coefficient quantization. Numerator of H(z) does not affect stability.
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 34 of 48
Design filter
Type: low pass, high pass, band pass, band stop, ... Filter order N Desired frequency response
Matlab
2. 3. 4. 5.
6. 7.
Decide on a realization structure Decide how coefficients will be quantized. Compute coefficients CCS Decide how everything else will be quantized (input samples, output samples, result of multiplies, result of additions) Write code to realize filter Test filter and compare to theoretical expectations
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 35 of 48
Structures can have different memory and computational requirements All structures give the same behavior when the math is infinite precision Structures can have very different behavior when the math is finite precision z Stability z Accuracy with respect to the desired response z Potential for overflow/underflow
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 36 of 48
Direct Form I
Page 37 of 48
Direct Form II
Note fewer delay elements (less memory) than DFI. Can prove that DFII has minimum number of delay elements.
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 38 of 48
Transfer function H(z) is factored into H1(z)H2(z)HK(z) where each factor Hk(z) has a quadratic denominator and numerator Each quadratic factor is called a Second Order Section (SOS) Each SOS is realized in DFII The results from each SOS are then passed to the next SOS
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 39 of 48
Low memory requirements (same as DFII) Easy to check the stability of each SOS Can write one DFII-SOS filter function and reuse it for any length filter Tends to be less sensitive to finite precision math than DFI or DFII. Why?
z z
Dynamic range of coefficients in each SOS is smaller Coefficient quantization only affects local poles/zeros
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 40 of 48
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 41 of 48
IIR filters are more sensitive to choice of realization structure and data types than FIR filters due to feedback
z z z z z
Memory requirements Time required to compute filter output Accuracy with respect to the desired response Stability Potential for overflow/underflow
fdatool can be useful for examining the tradeoffs before writing code
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 42 of 48
Bandstop First passband 0-2500Hz, 0dB nominal gain, 0.5dB max deviation First transition band 2500-3500Hz Stop band 3500-10500Hz, -20dB minimum suppression Second transition band 10500-12500Hz Second passband 12500-22050Hz 0dB nominal gain, 0.5dB max deviation Minimum filter order
z z z z
Explore DFII with and without Second Order Sections Try various coefficient quantizations including fixed point Implement your best filter in CCS Compare actual performance to the theoretical predictions
Page 43 of 48
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Example projects:
z z
DFT, FFT256C, FFTSinetable, FFTr2, FFTr4, FFTr4_sim, fastconvo, fastconvo_sim, graphicEQ Note that TI provides optimized FFT functions (search for cfftr2_dit, cfftr2_dif, cfftr4_dif)
Example projects:
z
Page 44 of 48
Page 45 of 48
Fetch packets, execute packets, pipelining Functional unit latency and delay slots
How to design and implement IIR filters on the C6713 z Realization structures z Quantization considerations Other applications for the C6713 DSK z FFT z Adaptive filtering
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 46 of 48
Chassaing textbook Chapters 3, 5-8 CCS Help system SPRU509F.PDF CCS v3.1 IDE Getting Started Guide C6713DSK.HLP C6713 DSK specific help material SPRU198G.PDF TMS320C6000 Programmers Guide SPRU189F.PDF TMS320C6000 CPU and Instruction Set Reference Guide Matlab fdatool help (>> doc fdatool) Other Matlab help (>> doc soundsc, >> doc wavwrite)
Latest documentation available at http://www.ti.com/sc/docs/psheets/man_dsp.htm
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 47 of 48
Explore some of Chassaings FFT and adaptive filtering projects in the myprojects directory Explore some of the reference literature (especially the Chassaing text and the CCS help system) Try a lab assignment in the ECE4703 real-time DSP course: http://spinlab.wpi.edu/courses/ece4703
QuickTime and a TIFF (LZW) decompressor are needed to see this picture.
Page 48 of 48