WiiProfiler v3.
Steve Rabin
Principal Software Engineer
Software Development Support Group
Agenda
WiiP fil iintroduction
WiiProfiler t d ti
– What it provides
WiiP fil M
WiiProfiler Methodology
h d l
– Game integration
– V2.0
V2 0 features
f
WiiProfiler v3.0 features
– Sampling
l b
based
d on performance
f counters
– Instrumenting using performance counters
– T
Tracking
ki user d
data
t
– Code coverage
Introduction
Measures CPU function performance
– How much time spent in each function
– Cycles, instructions, branches, cache misses
– Function call tree
– Function code coverage
– Frame rate performance
p
Free tool created exclusively for Wii
– Version 1.0
1 0 (May 2007)
– Version 2.0 (April 2008)
– Version 3.0 (Open BETA now, Final Summer 2009)
Requirements
– NDEV and minor p
programmer
g integration
g
WiiProfiler v1.0
WiiProfiler v2.0
WiiProfiler v3.0
WiiProfiler Design Methodology
E t
Extremely
l ffastt and
d easy to
t integrate
i t t
– Only a couple required function calls to library functions
((10 minute integration)
g )
Extremely fast and easy to operate
– Minimalist
Mi i li t interface
i t f that
th t jjustt works
k
– Deep functionality with little cognitive overhead
Effortless visual exploration of data
– Use graphs to maximize comprehension
– Frame-based graphs show problem frames
– Easy to compare and interpret
Methodology:
gy
Fast and easy to integrate
Code Integration: Step 1
Link against "wiiprofiler.a"
Code Integration: Step 2
Include the header file:
#include <revolution/wiiprofiler.h>
Code Integration: Step 3
WIIPROFILER_Init(void * bufferMEM2,u32 sizeInBytes,
BOOL doesGameWaitForRetrace);
Call init function with a MEM2 buffer
– At least 8MB, as large as 100MB
– Larger buffer = longer profiling
Answer the question:
– Does your main loop wait for the vertical retrace?
Code Integration: Step 4
while(true)
hil ( )
{ //Top of main loop Add This
WIIPROFILER_MarkFrameBegin();
//Game code, etc.
}
Methodology:
gy
Fast and easy to operate
Only Two Choices
Connect to NDEV Open
p ap
profile
Demo:
Fast and Easy to Operate
Statistical sampling
– Various rates available, Simple vs Full
– Accuracy vs Overhead/Size tradeoff
Start and Stop
Open and Save
Settings and right click menus
Methodology:
Effortless Visualization
Demo:
Effortless Visualization
Functions
– Sparklines
– Self vs Total
– Hide insignificant
Call tree exploration
– Reverse call tree
Statistical graph
– Click functions
– Zoom, scroll, choose
frame
– Highlight Band
– Range and average
Demo:
Effortless Visualization
Frame
F rate graph
h
– Examine frame rate spikes
– Events
Resort functions (new in v3.0)
– Sort based on selected frame
– Sort based on average (default)
– Sort
S t alphabetically
l h b ti ll
– Continuously resort
Performance Counter
Factoid Theater
4 CPU performance counters in Broadway CPU
– Reset, start, stop, and read in code
R
Reset
t counters
t
– PPCMtpmc1(0); PPCMtpmc2(0); PPCMtpmc3(0); PPCMtpmc4(0);
St t counters
Start t
– PPCMtmmcr0( <counter1> | <counter2> );
– PPCMtmmcr1( <counter3> | <counter4> );
Stop counters
– PPCMtmmcr0( 0 );
– PPCMtmmcr1( 0 );
Read counters
– PPCMfpmc1(); PPCMfpmc2(); PPCMfpmc3(); PPCMfpmc4();
Performance Counter
Factoid Theater
P f
Performance counter
t eventt examples
l (~60
( 60 total)
t t l)
– PMC1_CYCLE # processor cycles
– PMC1_L2_HIT # of accesses that hit L2
– PMC1_L1_MISS # of accesses that miss L1
– PMC1_Bx_UNRESOLVED # of branches unresolved
– PMC1_Bx_STALL_CYCLE # of cycles
y stalled due to branches
– PMC2_CYCLE # processor cycles
– PMC2_INSTRUCTION # of instructions completed
– PMC2 IC MISS
PMC2_IC_MISS # of L1 instruction cache misses
– PMC2_L1_CASTOUT # of L1 castouts to L2
– PMC2_Bx_FALL_THROUGH # of fall through branches
Select one PMC1, PMC2, PMC3, PMC4 at a time
Bracket code (Reset, Start, Stop) and measure results
Performance Counters in
WiiProfiler v3.0
Use performance counters to
– Statistically sample functions
– Instrument
I t t iindividual
di id l ffunctions
ti
Performance Counter
Statistical Sampling
Sample by CPU
– Mispredicted branches Data Instruction
L1 Cache L1 Cache
– Undecided branches 32KB 32KB
– Floating point instructions Combined
L2 Cache
– L1 or L2 instruction misses
– L1 or L2 data misses 256KB
– L1 writes to L2
Main Memory
– L2 writes to memory MEM1 MEM2
24MB 64MB
Performance Counter
Statistical Sampling
Ch
Choose a sampling
li rate
t
– Between every 10 and every 100K
Too often (every 10 to 100)
– Large overhead
– Can be less accurate (cache pollution)
– Fills up buffer fast
Often (every 100 to 1K)
– Medium overhead
– Good accuracy
Less often ((every
y 1K to 100K))
– Least overhead
– Most accurate overall (less accurate per frame)
Instrumenting Functions
Ch
Choose a class
l off performance
f counters
t
– Cycles only
– Cycles and instructions
– Branch prediction performance
– Why branch prediction failed
– C h and
Cache d memory performance
f
– L1 cache performance
– L2 cache performance
p
– Outbound cache writes
Explanation
E l ti off selected
l t d in
i big
bi gray b
box
Decide:
– Whether or not to also statistically sample by time
Instrumenting Functions:
Branch Prediction Performance
Performance counters selected
– PMC1_Bx_UNRESOLVED PMC3_Bx_TAKEN
– PMC2_Bx_FALL_THROUGH PMC4_Bx_MISSED
Data teased out from these 4 counters
– % of correctly predicted branches
– % of incorrectly predicted branches
– Correctly predicted branches
– Incorrectly predicted branches
– Skipped branches based on prediction
– Taken branches based on prediction
– B
Branches
h predicted
di t d by
b hardware
h d
– Branches unconditionally taken
– All branches
Instrumenting Functions:
L1 Cache Performance
P f
Performance counters
t selected
l t d
– PMC1_L1_MISS PMC3_DC_MISS
– PMC2_IC_MISS
PMC2 IC MISS PMC4 CYCLE
PMC4_CYCLE
Data teased out from these 4 counters
– Cycles
– Cycles waiting for memory
– Instruction not found in L1
– Data not found in L1
– Memory not found in L1
– Average cycles waiting for memory
– % of time waiting for memory
Instrumenting Functions:
Selecting Functions
Up
U to 10 ffunctions
i profiled
fil d at a time
i
3 ways to select a function
– Choose a Self or Total function
– Drop down list of all game functions
– Choose
Ch a function
f ti from
f Code
C d Coverage
C
Data captured is similar to "Total"
– Function call and child calls
Instrumenting Functions:
Profile and Explore
# function
f ti calls
ll tracked
t k d
# recursive calls tracked
Performance counters
– Total count for performance counter
– Range per frame (max, ave, min)
– Raw call data (might graph slowly)
Helpers
– Expand top level
– Auto-select similar
Tracking User Data
Track any data you want in code
– Track floating point values
WIIPROFILER_TrackValue(name, value);
– Will track multiple values per frame
WIIPROFILER_TrackAccumulatedValue(name, value);
– Will track one accumulated value per frame
WiiProfiler on PC
– Appears in Instrumented tab
– Graphs in Instrumented Graph tab
Code Coverage
D i
During a profile
fil (or
( over multiple)
lti l )
– Which functions get called
– Which functions don
don'tt get called
Filter
– Exclude SDK and platform libraries
– Exclude functions with certain prefixes
p
– Include functions with certain prefixes
Resett b
R button
tt
Instrument button
WiiProfiler v3.0
v3 0 Release
Open BETA for next 1-2 months
– Sign up and we'll send it to you:
https://www.warioworld.com/wii/wiiprofiler
Final release v3.0 early Summer
– More robust communications layer
y
– Instrumenting functions
Allow
RSO and REL functions
Remove interrupts from data
WiiProfiler Summary
Statistical sampling profiler
– Time and performance counters
Instrument functions
– Using performance counters
Track and graph arbitrary data
Function-based code coverage
Q
Questions?
Ask me after the presentation
Or e-mail
[email protected]