Unit 2 Basic Optimization Techniques For Serial Code
Unit 2 Basic Optimization Techniques For Serial Code
i n t flag = 0;
f o r ( i = 0 ; i<N; i++) {
i f ( some_function(A[i]) < threshold_value )
flag = 1;
}
i n t flag = 0;
f o r ( i = 0 ; i<N; i++) {
i f ( some_function(A[i]) < threshold_value )
{ flag = 1;
break;
}
}
Do less work; example 2
f o r ( i = 0 ; i<500; i++)
f o r ( j = 0 ; j<80; j++)
f o r (k=0; k<4; k++)
a[i][j][k] = a[i]
[j][k] + b[i][j]
[k]*c[i][j][k];
How many times is the k-indexed loop executed? And how many
times for the j-indexed loop?
Do less work; example 2 (cont’d)
double *a_ptr = a [ 0 ] [ 0 ] ;
double *b_ptr = b [ 0 ] [ 0 ] ;
double *c_ptr = c [ 0 ] [ 0 ] ;
f o r ( i = 0 ; i<(500*80*4);
i++)
a_ptr[i] = a_ptr[i] +
b_ptr[i]*c_ptr[i];
f o r ( i = 0 ; i<ARRAY_SIZE; i++)
{ a[i] = 0.;
f o r ( j = 0 ; j<ARRAY_SIZE; j++)
a[i] = a[i] + b[j]*d[j]*c[i];
}
Observation: c [ i ] is independent of
the j-indexed loop.
Do less work; example 3 (cont’d)
Improvement:
f o r ( i = 0 ; i<ARRAY_SIZE; i++)
{ a[i] = 0.;
f o r ( j = 0 ; j<ARRAY_SIZE; j++)
a[i] = a[i] + b[j]*d[j];
a[i] = a[i]*c[i];
}
t = 0.;
f o r ( j = 0 ; j<ARRAY_SIZE; j++)
t = t + b[j]*d[j];
f o r ( i = 0 ; i<ARRAY_SIZE; i++)
a[i] = t*c[i];
f o r ( i = 0 ; i<N; i++)
A[i] = A[i] + s + r * s i n ( x ) ;
tmp = s + r * s i n ( x ) ;
f o r ( i = 0 ; i<N; i++)
A[i] = A[i] + tmp;
Avoid expensive operations!
Special math functions (such as trigonometric, exponential and
logarithmic functions) are usually very costly to compute.
An example from simulating non-equilibrium spins:
f o r ( i = 1 ; i<Nx-1; i++)
f o r ( j = 1 ; j<Ny-1; j++)
f o r (k=1; k<Nz-1; k++)
{
iL = s p i n _ o r i e n t a t i o n [ i - 1 ] [ j ] [ k ] ;
iR = s p i n _ o r i e n t a t i o n [ i + 1 ] [ j ] [ k ] ;
iS = s p i n _ o r i e n t a t i o n [ i ] [ j - 1 ] [ k ] ;
iN = s p i n _ o r i e n t a t i o n [ i ] [ j + 1 ] [ k ] ;
iO = s p i n _ o r i e n t a t i o n [ i ] [ j ] [ k - 1 ] ;
iU = s p i n _ o r i e n t a t i o n [ i ] [ j ] [ k + 1 ] ;
edelz = iL+iR+iS+iN+iO+iU;
bod y _f or c e [ i ] [ j ] [ k] =
0.5*(1.0+tanh(edelz/tt));
Example continued
If the values of i L , i R , i S , iN, iO, iU can only be −1 or +1,
then the value of edelz (which is the sum of i L , i R , i S , iN,
iO, iU) can only be −6, −4, −2, 0, 2, 4, 6.
If t t is a constant, then we can create a lookup table:
double t a nh _t a b l e [ 1 3] ;
f o r ( i = 0 ; i<=12; i+=2)
tanh_table[i] =
0.5*(1.0+tanh((i-
6)/tt));
f o r ( i = 1 ; i<Nx-1; i++)
f o r ( j = 1 ; j<Ny-1; j++)
f o r (k=1; k<Nz-1; k++)
{
....
Strength reduction
f o r ( i = 0 ; i<N; i++)
y [ i ] = pow(x[i],3)/s;
double inverse_s = 1 . 0 / s ;
f o r ( i = 0 ; i<N; i++)
y[i] =
x[i]*x[i]*x[i]*inverse_
s;
Strength reduction (another
example)
f o r ( i = 0 ; i<N; i++)
y [ i ] = a*pow(x[i],4)+b*pow(x[i],3)+c*pow(x[i],2)
+d*pow(x[i],1)+e;
f o r ( i = 0 ; i<N; i++)
y [ i ] = ( ( ( a *x[ i ] +b ) *x [ i ] + c ) *x [ i ] + d) * x[ i ] + e ;
f o r ( j = 0 ; j<N; j++)
f o r ( i = 0 ; i<N; i++)
{ i f (i>j)
sign = 1 . 0 ;
else i f ( i <j )
sign = - 1 . 0 ;
else
sign =
0.0;
C[ j ] = C [ j ] +
sign * A [ j ] [ i ]
* B[i];
Avoiding branches (cont’d)
f o r ( j = 0 ; j<N-1; j++)
f o r ( i = j + 1 ; i<N; i++)
C [ j ] = C[ j ] + A [ j ] [ i ] *
B[i];
f o r ( j = 1 ; j<N; j++)
f o r ( i = 0 ; i < j ; i++)
C [ j ] = C[ j ] - A [ j ]
[i] * B[i];
}
f o r ( i = 0 ; i < n ; i++)
{ i f (i==0)
a[ i ] = b[i+1]-b[i];
e l s e i f (i==n-1)
a[i] = b[i]-b[i-1];
else
a [ i ] = b[i+1]-
b[i-1];
}
Another example of avoid branches (cont’d)
a[0] = b[1]-b[0];
f o r ( i = 1 ; i < n- 1 ; i++)
a [ i ] = b[i+1]-b[i-1];
a [ n- 1 ] = b [ n - 1 ] - b [ n - 2 ] ;
Yet another example of avoiding
branches
f o r ( i = 0 ; i < n ; i++)
{ i f (j>0)
x[i] = x[i] + 1;
else
x[i] = 0;
}
i f ( j > 0)
f o r ( i = 0 ; i < n; i++)
x[i] = x[i] + 1;
else
f o r ( i = 0 ; i < n; i++)
x[i] = 0;
Using SIMD instructions
Example:
f o r ( i = 0 ; i<N; i++)
r[ i ] = x[i] + y[i];
i n t i , r e s t = N% 4 ;
f o r ( i = 0 ; i < N- r e s t ; i+=4) {
load R1 = [ x [ i ] , x [ i + 1 ] , x [ i + 2 ] , x [ i + 3 ] ] ;
load R2 = [ y [ i ] , y [ i + 1 ] , y [ i + 2 ] , y [ i + 3 ] ] ;
R3 = ADD(R1,R2);
s t o r e [ r [ i ] , r [ i + 1 ] , r [ i + 2 ] , r [ i + 3 ] ] = R3;
}
f o r ( i = N - r e s t ; i<N; i++)
r[ i] = x[i] + y[i];
Beware of loop dependency!
f o r ( i = s t a r t ; i<end; i++)
A[i] = 10.0*A[i+offset];
If o ff s e t < 0 → real
dependency (read-after-write
hazard)
If o ff s e t > 0 → pseudo
dependency (write-after-read
harzard)
When there is loop-carried dependency...
f o r ( i = s t a r t ; i<end; i++)
A[i] = 10.0*A[i-1];
f o r ( i = s t a r t ; i<end; i++)
A[i] = 10.0*A[i+offse t];
Risk of aliasing
Knowing how much time is spent where is the first step. But what
is the actual reason for “a slow code” or by which resource is the
performance limited?
Modern processors feature a small number of performance
counters, which are special on-chip registers that get incremented
each time a certain event occurs.
Possible events that can be monitored: