0% found this document useful (0 votes)
118 views31 pages

Unit 2 Basic Optimization Techniques For Serial Code

1. Common sense optimizations for serial code include doing less work such as avoiding unnecessary computations, eliminating common subexpressions, and avoiding expensive operations. Techniques include loop unrolling, strength reduction, and using lookup tables. 2. Another optimization is to avoid branches in tight loops by restructuring code to remove conditional statements. Vectorizing code using SIMD instructions can also improve performance by performing multiple operations simultaneously. 3. Compilers can perform various levels of optimizations but may impact numerical accuracy aggressively. Profiling helps identify hotspots for further optimization by gathering information on resource usage.

Uploaded by

Sudha Palani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views31 pages

Unit 2 Basic Optimization Techniques For Serial Code

1. Common sense optimizations for serial code include doing less work such as avoiding unnecessary computations, eliminating common subexpressions, and avoiding expensive operations. Techniques include loop unrolling, strength reduction, and using lookup tables. 2. Another optimization is to avoid branches in tight loops by restructuring code to remove conditional statements. Vectorizing code using SIMD instructions can also improve performance by performing multiple operations simultaneously. 3. Compilers can perform various levels of optimizations but may impact numerical accuracy aggressively. Profiling helps identify hotspots for further optimization by gathering information on resource usage.

Uploaded by

Sudha Palani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Unit 2

Basic optimization techniques for


serial code
Objectives of Chapter 2

“Common sense” and simple optimization strategies for serial


code
The role of compilers
Basics of performance profiling
“Common sense” optimizations

Very simple code changes can sometimes lead to significant


performance boost.
The most important “common sense” principle: avoiding
performance pitfalls!
Do less work; example 1
Example: assume Ais an array of numerical values, and a
prescribed threshold value: threshold_value.

i n t flag = 0;
f o r ( i = 0 ; i<N; i++) {
i f ( some_function(A[i]) < threshold_value )
flag = 1;
}

Improvement: leave the loop as soon as f l a g becomes 1.

i n t flag = 0;
f o r ( i = 0 ; i<N; i++) {
i f ( some_function(A[i]) < threshold_value )
{ flag = 1;
break;
}
}
Do less work; example 2

f o r ( i = 0 ; i<500; i++)
f o r ( j = 0 ; j<80; j++)
f o r (k=0; k<4; k++)
a[i][j][k] = a[i]
[j][k] + b[i][j]
[k]*c[i][j][k];

How many times is the k-indexed loop executed? And how many
times for the j-indexed loop?
Do less work; example 2 (cont’d)

If the 3D arrays a, b and c have contiguous memory storage for all


their values, then we can re-code as follows:

double *a_ptr = a [ 0 ] [ 0 ] ;
double *b_ptr = b [ 0 ] [ 0 ] ;
double *c_ptr = c [ 0 ] [ 0 ] ;

f o r ( i = 0 ; i<(500*80*4);
i++)
a_ptr[i] = a_ptr[i] +
b_ptr[i]*c_ptr[i];

This technique is called loop collapsing. The main motivation is to


reduce loop overhead, may also help other (compiler-supported)
optimizations.
Do less work; example 3

f o r ( i = 0 ; i<ARRAY_SIZE; i++)
{ a[i] = 0.;
f o r ( j = 0 ; j<ARRAY_SIZE; j++)
a[i] = a[i] + b[j]*d[j]*c[i];
}

Observation: c [ i ] is independent of
the j-indexed loop.
Do less work; example 3 (cont’d)

Improvement:

f o r ( i = 0 ; i<ARRAY_SIZE; i++)
{ a[i] = 0.;
f o r ( j = 0 ; j<ARRAY_SIZE; j++)
a[i] = a[i] + b[j]*d[j];
a[i] = a[i]*c[i];
}

Can we improve further?


Do less work; example 3 (further simplification)

There is a common factor:


b[0]*d[0]+b[1]*d[1]+...+b[ARRAY_SIZE-1]*d[ARRAY_SIZE-1]
which is unnecessarily re-computed in every i iteration!

t = 0.;
f o r ( j = 0 ; j<ARRAY_SIZE; j++)
t = t + b[j]*d[j];

f o r ( i = 0 ; i<ARRAY_SIZE; i++)
a[i] = t*c[i];

This technique is called loop factoring or elimination of common


subexpressions.
Another example of common subexpression elimination

f o r ( i = 0 ; i<N; i++)
A[i] = A[i] + s + r * s i n ( x ) ;

tmp = s + r * s i n ( x ) ;
f o r ( i = 0 ; i<N; i++)
A[i] = A[i] + tmp;
Avoid expensive operations!
Special math functions (such as trigonometric, exponential and
logarithmic functions) are usually very costly to compute.
An example from simulating non-equilibrium spins:

f o r ( i = 1 ; i<Nx-1; i++)
f o r ( j = 1 ; j<Ny-1; j++)
f o r (k=1; k<Nz-1; k++)
{
iL = s p i n _ o r i e n t a t i o n [ i - 1 ] [ j ] [ k ] ;
iR = s p i n _ o r i e n t a t i o n [ i + 1 ] [ j ] [ k ] ;
iS = s p i n _ o r i e n t a t i o n [ i ] [ j - 1 ] [ k ] ;
iN = s p i n _ o r i e n t a t i o n [ i ] [ j + 1 ] [ k ] ;
iO = s p i n _ o r i e n t a t i o n [ i ] [ j ] [ k - 1 ] ;
iU = s p i n _ o r i e n t a t i o n [ i ] [ j ] [ k + 1 ] ;
edelz = iL+iR+iS+iN+iO+iU;
bod y _f or c e [ i ] [ j ] [ k] =
0.5*(1.0+tanh(edelz/tt));
Example continued
If the values of i L , i R , i S , iN, iO, iU can only be −1 or +1,
then the value of edelz (which is the sum of i L , i R , i S , iN,
iO, iU) can only be −6, −4, −2, 0, 2, 4, 6.
If t t is a constant, then we can create a lookup table:

double t a nh _t a b l e [ 1 3] ;
f o r ( i = 0 ; i<=12; i+=2)
tanh_table[i] =
0.5*(1.0+tanh((i-
6)/tt));

f o r ( i = 1 ; i<Nx-1; i++)
f o r ( j = 1 ; j<Ny-1; j++)
f o r (k=1; k<Nz-1; k++)
{
....
Strength reduction

f o r ( i = 0 ; i<N; i++)
y [ i ] = pow(x[i],3)/s;

double inverse_s = 1 . 0 / s ;
f o r ( i = 0 ; i<N; i++)
y[i] =
x[i]*x[i]*x[i]*inverse_
s;
Strength reduction (another
example)

f o r ( i = 0 ; i<N; i++)
y [ i ] = a*pow(x[i],4)+b*pow(x[i],3)+c*pow(x[i],2)
+d*pow(x[i],1)+e;

f o r ( i = 0 ; i<N; i++)
y [ i ] = ( ( ( a *x[ i ] +b ) *x [ i ] + c ) *x [ i ] + d) * x[ i ] + e ;

Use of Horner’s rule of polynomial evaluation:

ax 4 + bx 3 + cx 2 + dx + e = (((ax + b)x + c)x + d )x + e


Shrinking the work set!

The work set of a code is the amount of memory it uses (or


touches), also called memory footprint.
In general, shrinking the work set (if possible) is a good thing for
performance, because it raises the probability of cache hit.
One example: The s pi n _or i e nt a t i o n array should store values of
type char instead of type i n t . (A factor of 4 in the difference of
memory footprint.)
Avoiding branches

“Tight” loops: few operations per iteration, typically optimized by


compiler using some form of pipelining. In case of conditional
branches in the loop body, the compiler optimization will easily fail.

f o r ( j = 0 ; j<N; j++)
f o r ( i = 0 ; i<N; i++)
{ i f (i>j)
sign = 1 . 0 ;
else i f ( i <j )
sign = - 1 . 0 ;
else
sign =
0.0;

C[ j ] = C [ j ] +
sign * A [ j ] [ i ]
* B[i];
Avoiding branches (cont’d)

f o r ( j = 0 ; j<N-1; j++)
f o r ( i = j + 1 ; i<N; i++)
C [ j ] = C[ j ] + A [ j ] [ i ] *
B[i];

f o r ( j = 1 ; j<N; j++)
f o r ( i = 0 ; i < j ; i++)
C [ j ] = C[ j ] - A [ j ]
[i] * B[i];
}

We have got rid of the if-


tests completely!
Another example of avoiding branches

f o r ( i = 0 ; i < n ; i++)
{ i f (i==0)
a[ i ] = b[i+1]-b[i];
e l s e i f (i==n-1)
a[i] = b[i]-b[i-1];
else
a [ i ] = b[i+1]-
b[i-1];
}
Another example of avoid branches (cont’d)

Using the technique of loop peeling, we can re-code as follows:

a[0] = b[1]-b[0];
f o r ( i = 1 ; i < n- 1 ; i++)
a [ i ] = b[i+1]-b[i-1];
a [ n- 1 ] = b [ n - 1 ] - b [ n - 2 ] ;
Yet another example of avoiding
branches
f o r ( i = 0 ; i < n ; i++)
{ i f (j>0)
x[i] = x[i] + 1;
else
x[i] = 0;
}

i f ( j > 0)
f o r ( i = 0 ; i < n; i++)
x[i] = x[i] + 1;
else
f o r ( i = 0 ; i < n; i++)
x[i] = 0;
Using SIMD instructions

A “vectorizable” loop can potentially run faster if multiple


operations can be performed with a single instruction.
Using SIMD instructions, register-to-register operations will be
greatly accelerated.
Warning: if the code is strongly limited by memory bandwidth, no
SIMD technique can bridge this gap.
Ideal scenario for applying SIMD to a
loop

All iterations are independent


There is no branch in the loop body
The arrays are accessed with a stride of one

Example:

f o r ( i = 0 ; i<N; i++)
r[ i ] = x[i] + y[i];

(We assume here that the memory regions pointed by r , x , y do


not overlap—no aliasing)
An example of applying SIMD

Pseudocode of applying SIMD (assuming that each SIMD register


can store 4 values):

i n t i , r e s t = N% 4 ;
f o r ( i = 0 ; i < N- r e s t ; i+=4) {
load R1 = [ x [ i ] , x [ i + 1 ] , x [ i + 2 ] , x [ i + 3 ] ] ;
load R2 = [ y [ i ] , y [ i + 1 ] , y [ i + 2 ] , y [ i + 3 ] ] ;
R3 = ADD(R1,R2);
s t o r e [ r [ i ] , r [ i + 1 ] , r [ i + 2 ] , r [ i + 3 ] ] = R3;
}
f o r ( i = N - r e s t ; i<N; i++)
r[ i] = x[i] + y[i];
Beware of loop dependency!

If a loop iteration depends on the result of another


iteration—loop-carried dependency

f o r ( i = s t a r t ; i<end; i++)
A[i] = 10.0*A[i+offset];

If o ff s e t < 0 → real
dependency (read-after-write
hazard)
If o ff s e t > 0 → pseudo
dependency (write-after-read
harzard)
When there is loop-carried dependency...

In case of real dependency, SIMD cannot be applied if the negative


o f f s e t size is smaller than the SIMD width. For example,

f o r ( i = s t a r t ; i<end; i++)
A[i] = 10.0*A[i-1];

In case of pseudo dependency, SIMD can be applied. For example


when offs e t > 0,

f o r ( i = s t a r t ; i<end; i++)
A[i] = 10.0*A[i+offse t];
Risk of aliasing

Is it safe to vectorize the following function?

void compute(int s t a r t , i n t s t o p , double * a , double *b)


{ f o r ( i n t i = s t a r t ; i < s t o p ; i++)
a [ i ] = 10.0*b[i];
}
Risk of aliasing (cont’d)

A problem of “aliasing” will arise if the compute function is called


as follows

compute(0, N-1, & (array_a[1]), a r r a y_ a ) ;

If a programmer can guarantee that aliasing won’t happen, this hint


can be provided to the compiler.
The role of compilers

A compiler translates a program, which is implemented in a


programming language, to machine code.
A compiler can carry out code optimization of various degrees,
dictated by the compiler options provided by the user. (-O0, -O1,
-O2, . . . . )
Different compilers probably allow different compiler options,
should refer to the user manual!
Numerical accuracy may suffer from too aggressive compiler
optimizations.
Profiling

Profiling—gather information about a program’s behavior,


especially its use of resourses. The purpose is to pinpoint the “hot
spots”, and more importantly, to identify any performance
optimization opportunities (if any) and/or bugs.
Two apporaches of “information gathering”:

Instrumentation—compiler automatically inserts some code


to
log each function call during the actual execution
Sampling—the program execution is interrupted at periodic
intervals, with information being recorded
GNU gprof

One well-known profiler: GNU gprof


h t t p s : / / s o u r c e w a r e . o rg / b i n u t i l s / d o c s / g p r o f /

Step 1: compile and link the program with profiling enabled;


Step 2: execute the program to generate a profile data file;
Steo 3: run gprof to analyze the profile data.

(There are other profilers, of course.)


Hardware performance counters

Knowing how much time is spent where is the first step. But what
is the actual reason for “a slow code” or by which resource is the
performance limited?
Modern processors feature a small number of performance
counters, which are special on-chip registers that get incremented
each time a certain event occurs.
Possible events that can be monitored:

number of cache line transfers


number of loads and stores
number of floating-point operations
number of branch mispredictions
number of pipeline stalls
number of instructions executed

You might also like