0% found this document useful (0 votes)
57 views29 pages

06 Profiling

This document summarizes profiling and code optimization techniques for improving program performance. It discusses Amdahl's Law, profiling to identify hot spots, standard compiler optimizations like common subexpression elimination and dead code elimination, and architectural optimizations like avoiding function calls. The goal is to spend time optimizing the parts of code responsible for 80% of execution time using these different profiling and optimization strategies.

Uploaded by

api-3726520
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views29 pages

06 Profiling

This document summarizes profiling and code optimization techniques for improving program performance. It discusses Amdahl's Law, profiling to identify hot spots, standard compiler optimizations like common subexpression elimination and dead code elimination, and architectural optimizations like avoiding function calls. The goal is to spend time optimizing the parts of code responsible for 80% of execution time using these different profiling and optimization strategies.

Uploaded by

api-3726520
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 29

Profiling & Code Optimization

Lecture 6

Introduction to Embedded Systems


Administrivia
• Lecture first half
• Quiz #1 for the second half of today’s lecture

Introduction to Embedded Systems


Summary of Previous Lecture
• Overview of the ARM Debug Monitor
• Loading a Program
• The ARM Image Format
• What happens on program startup?

Introduction to Embedded Systems


Outline of This Lecture
• Profiling
– Amdahl’s Law
– The 80/20 rule
– Profiling in the ARM environment

• Improving program performance


– Standard compiler optimizations
– Aggressive compiler optimizations
– Architectural code optimizations

Introduction to Embedded Systems


Quote of the Day

I haven’t failed. I’ve found 10,000 ways that won’t work.


– Benjamin Franklin

Introduction to Embedded Systems


Profiling and Benchmark Analysis
• Problem: You're given a program's source code (which someone else
wrote) and asked to improve its performance by at least 20%

• Where do you begin?


– Look at source code and try to find inefficient C code
– Try rewriting some of it in assembly
– Rewrite using a different algorithm
– (Remove random portions of the code) 

Introduction to Embedded Systems


Gene Amdahl
• One of the original architects of
the IBM 360 mainframe series

• Founded four companies


– Amdahl Corporation
– Trilogy Systems (Part of Elxsi)
– Andor Systems
– Commercial Data Servers (CDS)

• A relatively few sequential


instructions might have a limiting
factor on program speedup such
that adding more processors may
not make the program run faster.

Introduction to Embedded Systems


Amdahl’s Law

Introduction to Embedded Systems


Profiling and Benchmark Analysis (cont’d)
• Most important question ...
– Where is the program spending most of its time?

• Amdahl's Law
– The performance improvement gained from using some faster mode of
execution is limited by the fraction of the total time the faster mode can be
used

• Example:

Optimizable
2x Speedup

Unoptimizable Unoptimizable

Introduction to Embedded Systems


Profiling and Benchmark Analysis (cont’d)
• How do we figure out where a program is spending its time?

– If we could count every static instruction, we would know which


routines (functions) were the biggest
• Big deal, large functions that aren't executed often don't really
matter

– If we could count every dynamic instruction, we would know which


routines executed the most instructions
• Excellent! It tells us the “relative importance” of each function
• But doesn't account for memory system (stalls)

– If we could count how many cycles were spent in each routine, we


would know which routines took the most amount of time

Introduction to Embedded Systems


Profiling
• Profiling: collecting statistics from example executions
– Very useful for estimating importance of each routine
– Common profiling approaches:
• Instrument all procedure call/return points (expensive: e.g., 20% overhead)
• Sampling PC every X milliseconds ­­ so long as program run is significantly
longer than the sampling period, the accuracy of profiling is pretty good
– Usually results in output such as
Routine % of Execution Time
function_a 60%
function_b 27%
function_c 4%
...
function_zzz 0.01%
– Often over 80% of the time spent in less than 20% of the code (80/20 rule)
– Can now do more accurate profiling with on­chip counters and analysis
tools
• Alpha, Pentium, Pentium Pro, PowerPC
• DEC Atom analysis tool
• Both are covered in Advanced Computer Architecture courses
Introduction to Embedded Systems
Introduction to Embedded Systems
Timing execution with armsd
• The simulator simulates every cycle
– Can gather very accurate timings for each function

• Run the simulator to determine total time


– Section 4.8 of the ARM Developer Suite AXD and armsd Debugger’s Guide

• Compiler can optimize for speed


prompt> armcc ­Otime ­o sort sorts.c

• Can also optimize for size


prompt> armcc ­Ospace ­o sort sorts.c

• Re­run the simulator to determine new total time


– new time is 2,059,629 µsecs ­­ an improvement of 4.5% (compared to ­g)

Introduction to Embedded Systems


Profiling with armsd
• No compile­time options needed

• Run the simulator to profile, capturing callgraph data


prompt> armsd
armsd: load/callgraph sorts
armsd: ProfOn
armsd: go
armsd: ProfWrite sorts.prf
armsd: quit
prompt> armprof ­Parent sorts.prf > profile

• To profile for only samples, skip the “/callgraph” portion


– avoids the 20% overhead (in this example)

Introduction to Embedded Systems


armprof output
Name cum% self% desc% calls
main 96.4% 0.16% 95.88% 0
qsort 0.44% 0.75% 1
_printf 0.00% 0.00% 3
clock 0.00% 0.00% 6
_sprintf 0.34% 3.56% 1000
randomise 0.12% 0.69% 1
hell_sort 1.59% 3.43% 1
insert_sort 19.91% 59.44% 1
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
main 19.91% 59.44% 1
insert_sort 79.35% 19.91% 59.44% 1
strcmp 59.44% 0.00% 243432
­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
qs_string_compare 3.17% 0.00% 13021
shell_sort 3.43% 0.00% 14059
insert_sort 59.44% 0.00% 243432
strcmp 66.05% 66.05% 0.00% 270512

Introduction to Embedded Systems


Optimizing “sorts”
• Almost 60% of time spent in strcmp called by insert_sort
• strcmp compares two strings and returns int
– 0 if equal, negative if first is ``less than'' second, positive otherwise
• Replace “strcmp(a,b)” call with some initial compares
if (a[0] < b[0]) {
result is neg
}
if (a[0] == b[0]) {
if (a[1] < b[1]) {
result is neg
}
if (a[1] == b[1]) {
if (strcmp(a,b) <= 0) {
result is neg or zero
}
}
}

• Result of this change is 20% reduction in execution time


– Avoids some procedure call overheads (in­lining)
– Avoids some loop control overheads (loop unrolling)
– Handles common cases efficiently and other cases correctly

Introduction to Embedded Systems


Improving Program Performance
• Compiler writers try to apply several standard optimizations
– Do not always succeed

• Compiler writers sometimes apply aggressive optimizations


– Often not “informed” enough to know that change will help rather than hurt

• Optimizations based on specific architecture/implementation


characteristics can be very helpful
– Much harder for compiler writers because it requires multiple, generally
very different, “back­end” implementations

• How can one help?


– Better code, algorithms and data structures (of course)
– Re­organize code to help compiler find opportunities for improvement
– Replace poorly optimized code with assembly code (i.e., bypass compiler)

Introduction to Embedded Systems


Standard Compiler Optimizations
• Common Sub­expression Elimination
– Formally, “An occurrence of an expression E is called a common sub-
expression if E was previously computed, and the values of variables in E
have not changed since the previous computation.”
– You can avoid re­computing the expression if we can use the previously
computed one.
– Benefit: less code to be executed

b: b:
t6 = 4 * i t6 = 4* i
x = a[t6] x = a[t6]
t7 = 4 * i t8 = 4 * j
t8 = 4 * j t9 = a[t8]
t9 = a[t8] a[t6] = t9
a[t7] = t9 a[t8] = x
t10 = 4 * j goto b
a[t10] = x
Before After
goto b

Introduction to Embedded Systems


Standard Compiler Optimizations
• Dead­Code Elimination
– If code is definitely not going to be executed during any run of a program,
then it is called dead code and can be removed.
– Example:
debug = 0;
...
if (debug){
print .....
}
– You can help by using ASSERTs and #ifdefs to tell the compiler about
dead code
• It is often difficult for the compiler to identify dead code itself

Introduction to Embedded Systems


Standard Compiler Optimizations (con't)
• Induction Variables and Strength Reduction
– A variable X is called an induction variable of a loop L if every time the
variable X changed value, it is incremented or decremented by some
constant
– When there are 2 or more induction variables in a loop, it may be possible to
get rid of all but one
– It is also frequently possible to perform strength reduction on induction
variables
• the strength of an instruction corresponds to its execution cost
– Benefit: fewer and less expensive operations

t4 = 0 t4 = 0
label_XXX label_XXX
j = j + 1 t4 += 4
t4 = 4 * j t5 = a[t4]
t5 = a[t4] if (t5 > v) goto label_XXX
if (t5 > v) goto label_XXX

Before After

Introduction to Embedded Systems


Aggressive Compiler Optimizations
• In­lining of functions
– Replacing a call to a function with the function's code is called “in­lining”
– Benefit: reduction in procedure call overheads and opportunity for additional code
optimizations
– Danger: code bloat and negative instruction cache effects
– Appropriate when small and/or called from a small number of sites
MOV r0, r4 ; r4 ­­> r0 (param 1) ADD r5, r4, #4
MOV r1, #4 ; 4 ­­> r1 (param 2) SWI 0x11
BL c_add ; call c_add
MOV r5, r0 ; r0 (result) ­­> r5
SWI 0x11 ; terminate
c_add
MOV r12, r13 ; save sp
STMDB r13!, {r0,r1,r11,r12,r14,pc} ; save regs
SUB r11, r12, #4 ; (sp ­ 4) ­­> r11
MOV r2, r0 ; param 1 ­­> r2
ADD r3, r2, r1 ; param 1 + param 2 --> r3
MOV r0, r3 ; move result to r0
LDMDB r11, {r11, r13, pc} ; restore regs
Before After

Introduction to Embedded Systems


Aggressive Compiler Optimizations (2)
• Loop Unrolling
– Doing multiple iterations of work in each iteration is called “loop unrolling”
– Benefit: reduction in looping overheads and opportunity for more code opts.
– Danger: code bloat, negative instruction cache effects, and non-integral loop div.
– Appropriate when small and/or called from small number of sites
MOV r4, #0
MOV r4, #0
sym1:CMP r4, #4
sym1: CMP r4, #0x10 BLT sym3
BLT sym3 B sym4
B sym4 sym2:ADD r4, r4, #1
sym2: B sym1
ADD r4, r4, #1 sym3: LDR r1, [r13, r4, lsl #2]
B sym1 ADD r0, r13, r4, lsl #2 1
sym3: LDR r0, [r13, r4, lsl #2] LDR r0, [r0, #4] 2
ADD r1, r1, r0
ADD r5, r0, r5 ADD r0, r13, r4, lsl #2 3
B sym2 LDR r0, [r0, #8]
sym4: ADD r1, r1, r0
ADD r0, r13, r4, lsl #2 4
LDR r0, [r0, #0xc]
Before After ADD r6, r1, r0
B sym2
Loop in sym3 is unrolled 4 times sym4:

Introduction to Embedded Systems


Introduction to Embedded Systems
Architectural/Code Optimizations
• Often, it is important to understand the architecture's implementation in
order to effectively optimize code
– Much more difficult for compilers to do because it requires a different
compiler back­end for every implementation

• One example of this is the ARM barrel shifter


– Can convert Y * Constant into series of adds and shifts
• Y*9=Y*8+Y*1
• Assume R1 holds Y and R2 will hold the result
– ADD R2, R1, R1, LSL #3 ; LSL #3 is same as * by 8

• Another example is the ARM 7500 write buffer specifics

Introduction to Embedded Systems


ARM Path to Memory
• Normally, a STR will
Address Register
write data directly to
memory
Addr Incrementer
• Example Incrementer Bus
– STR r1, SP!
– Writes contents of r1 to ALU Bus Register Bank
memory
– Requires n cycles,
A Bus Barrel Shifter
where n is the time B Bus
necessary to access 32­bit ALU
memory (typically 5 ­
100 cycles)
– Very costly to Read Data/ Instr Reg
Mem Addr Register Write Data Register
performance but
doesn't really matter Dout[31:0] Data[31:0]
what the code looks
like RAM

Introduction to Embedded Systems


ARM Write Buffer
• “Write buffer”
holds writes and Address Register
slowly retires them
Addr
to memory while ALU Bus Incrementer
Incrementer
Bus
processor
continues to
Write Buffer Register Bank
execute other (holds address and data)
instructions
• A Bus Barrel Shifter
Allows multiple B Bus
writes to occur 32­bit ALU
back­to­back
• Now the order of
code does matter Mem Addr Register Write Data Register Read Data/Instr Reg

Dout[31:0] Data[31:0]

RAM

Introduction to Embedded Systems


Critical Thinking
• When is optimization a bad thing?

Introduction to Embedded Systems


Summary of Lecture
• Profiling
– Amdahl’s Law
– The 80/20 rule
– Profiling in the ARM environment

• Improving program performance


– Standard compiler optimizations
• Common sub­expression elimination
• Dead­code elimination
• Induction variables
– Aggressive compiler optimizations
• In­lining of functions
• Loop unrolling
– Architectural code optimizations

Introduction to Embedded Systems


And Now For Something Completely Different

Good luck for Quiz #1 !

Introduction to Embedded Systems

You might also like