0% found this document useful (0 votes)

33 views31 pages

HPC Unit 5 B

Uploaded by

Pancham Bandishti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views31 pages

HPC Unit 5 B

Uploaded by

Pancham Bandishti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

IN3200/IN4200: Chapter 2

Basic optimization techniques for serial code

Textbook: Hager & Wellein, Introduction to High Performance Computing for Scientists and
Engineers
Objectives of Chapter 2

“Common sense” and simple optimization strategies for serial

code
(Data access optimization will be discussed in Chapter 3)
The role of compilers
Basics of performance profiling
“Common sense” optimizations

Very simple code changes can sometimes lead to significant

performance boost.
The most important “common sense” principle: avoiding
performance pitfalls!
Do less work; example 1
Example: assume A is an array of numerical values, and a
prescribed threshold value: threshold_value.

int flag = 0;
for (i=0; i<N; i++) {
if ( some_function(A[i]) < threshold_value )
flag = 1;
}

Improvement: leave the loop as soon as flag becomes 1.

int flag = 0;
for (i=0; i<N; i++) {
if ( some_function(A[i]) < threshold_value ) {
flag = 1;
break;
}
}
Do less work; example 2

for (i=0; i<500; i++)

for (j=0; j<80; j++)
for (k=0; k<4; k++)
a[i][j][k] = a[i][j][k] + b[i][j][k]*c[i][j][k];

How many times is the k-indexed loop executed? And how many
times for the j-indexed loop?
Do less work; example 2 (cont’d)

If the 3D arrays a, b and c have contiguous memory storage for all

their values, then we can re-code as follows:

double *a_ptr = a[0][0];

double *b_ptr = b[0][0];
double *c_ptr = c[0][0];

for (i=0; i<(500804); i++)

a_ptr[i] = a_ptr[i] + b_ptr[i]*c_ptr[i];

This technique is called loop collapsing. The main motivation is to

reduce loop overhead, may also help other (compiler-supported)
optimizations.
Do less work; example 3

for (i=0; i<ARRAY_SIZE; i++) {

a[i] = 0.;
for (j=0; j<ARRAY_SIZE; j++)
a[i] = a[i] + b[j]*d[j]*c[i];
}

Observation: c[i] is independent of the j-indexed loop.

Do less work; example 3 (cont’d)

Improvement:

for (i=0; i<ARRAY_SIZE; i++) {

a[i] = 0.;
for (j=0; j<ARRAY_SIZE; j++)
a[i] = a[i] + b[j]*d[j];
a[i] = a[i]*c[i];
}

Can we improve further?

Do less work; example 3 (further simplification)

There is a common factor:

b[0]*d[0]+b[1]*d[1]+...+b[ARRAY_SIZE-1]*d[ARRAY_SIZE-1]
which is unnecessarily re-computed in every i iteration!

t = 0.;
for (j=0; j<ARRAY_SIZE; j++)
t = t + b[j]*d[j];

for (i=0; i<ARRAY_SIZE; i++)

a[i] = t*c[i];

This technique is called loop factoring or elimination of common

subexpressions.
Another example of common subexpression elimination

for (i=0; i<N; i++)

A[i] = A[i] + s + r*sin(x);

tmp = s + r*sin(x);
for (i=0; i<N; i++)
A[i] = A[i] + tmp;
Avoid expensive operations!
Special math functions (such as trigonometric, exponential and
logarithmic functions) are usually very costly to compute.
An example from simulating non-equilibrium spins:

for (i=1; i<Nx-1; i++)

for (j=1; j<Ny-1; j++)
for (k=1; k<Nz-1; k++) {
iL = spin_orientation[i-1][j][k];
iR = spin_orientation[i+1][j][k];
iS = spin_orientation[i][j-1][k];
iN = spin_orientation[i][j+1][k];
iO = spin_orientation[i][j][k-1];
iU = spin_orientation[i][j][k+1];
edelz = iL+iR+iS+iN+iO+iU;
body_force[i][j][k] = 0.5*(1.0+tanh(edelz/tt));
}
Example continued
If the values of iL, iR, iS, iN, iO, iU can only be −1 or +1,
then the value of edelz (which is the sum of iL, iR, iS, iN,
iO, iU) can only be −6, −4, −2, 0, 2, 4, 6.
If tt is a constant, then we can create a lookup table:

double tanh_table[13];
for (i=0; i<=12; i+=2)
tanh_table[i] = 0.5*(1.0+tanh((i-6)/tt));

for (i=1; i<Nx-1; i++)

for (j=1; j<Ny-1; j++)
for (k=1; k<Nz-1; k++) {
....
edelz = iL+iR+iS+iN+iO+iU;
body_force[i][j][k] = tanh_table[edelz+6];
}
Strength reduction

for (i=0; i<N; i++)

y[i] = pow(x[i],3)/s;

double inverse_s = 1.0/s;

for (i=0; i<N; i++)
y[i] = x[i]*x[i]*x[i]*inverse_s;
Strength reduction (another example)

for (i=0; i<N; i++)

y[i] = a*pow(x[i],4)+b*pow(x[i],3)+c*pow(x[i],2)
+d*pow(x[i],1)+e;

for (i=0; i<N; i++)

y[i] = (((a*x[i]+b)*x[i]+c)*x[i]+d)*x[i]+e;

Use of Horner’s rule of polynomial evaluation:

ax 4 + bx 3 + cx 2 + dx + e = (((ax + b)x + c)x + d)x + e

Shrinking the work set!

The work set of a code is the amount of memory it uses (or

touches), also called memory footprint.
In general, shrinking the work set (if possible) is a good thing for
performance, because it raises the probability of cache hit.
One example: The spin_orientation array should store values of
type char instead of type int. (A factor of 4 in the difference of
memory footprint.)
Avoiding branches

“Tight” loops: few operations per iteration, typically optimized by

compiler using some form of pipelining. In case of conditional
branches in the loop body, the compiler optimization will easily fail.

for (j=0; j<N; j++)

for (i=0; i<N; i++) {
if (i>j)
sign = 1.0;
else if (i<j)
sign = -1.0;
else
sign = 0.0;

C[j] = C[j] + sign * A[j][i] * B[i];

}
Avoiding branches (cont’d)

for (j=0; j<N-1; j++)

for (i=j+1; i<N; i++)
C[j] = C[j] + A[j][i] * B[i];

for (j=1; j<N; j++)

for (i=0; i<j; i++)
C[j] = C[j] - A[j][i] * B[i];
}

We have got rid of the if-tests completely!

Another example of avoiding branches

for (i=0; i<n; i++) {

if (i==0)
a[i] = b[i+1]-b[i];
else if (i==n-1)
a[i] = b[i]-b[i-1];
else
a[i] = b[i+1]-b[i-1];
}
Another example of avoid branches (cont’d)

Using the technique of loop peeling, we can re-code as follows:

a[0] = b[1]-b[0];
for (i=1; i<n-1; i++)
a[i] = b[i+1]-b[i-1];
a[n-1] = b[n-1]-b[n-2];
Yet anothe example of avoiding branches

for (i=0; i<n; i++) {

if (j>0)
x[i] = x[i] + 1;
else
x[i] = 0;
}

if (j>0)
for (i=0; i<n; i++)
x[i] = x[i] + 1;
else
for (i=0; i<n; i++)
x[i] = 0;
Using SIMD instructions

A “vectorizable” loop can potentially run faster if multiple

operations can be performed with a single instruction.
Using SIMD instructions, register-to-register operations will be
greatly accelerated.
Warning: if the code is strongly limited by memory bandwidth, no
SIMD technique can bridge this gap.
Ideal scenario for applying SIMD to a loop

All iterations are independent

There is no branch in the loop body
The arrays are accessed with a stride of one

Example:

for (i=0; i<N; i++)

r[i] = x[i] + y[i];

(We assume here that the memory regions pointed by r, x, y do

not overlap—no aliasing)
An example of applying SIMD

Pseudocode of applying SIMD (assuming that each SIMD register

can store 4 values):

int i, rest = N%4;

for (i=0; i<N-rest; i+=4) {
load R1 = [x[i],x[i+1],x[i+2],x[i+3]];
load R2 = [y[i],y[i+1],y[i+2],y[i+3]];
R3 = ADD(R1,R2);
store [r[i],r[i+1],r[i+2],r[i+3]] = R3;
}
for (i=N-rest; i<N; i++)
r[i] = x[i] + y[i];
Beware of loop dependency!

If a loop iteration depends on the result of another

iteration—loop-carried dependency

for (i=start; i<end; i++)

A[i] = 10.0*A[i+offset];

If offset<0 → real dependency (read-after-write hazard)

If offset>0 → pseudo dependency (write-after-read harzard)
When there is loop-carried dependency...

In case of real dependency, SIMD cannot be applied if the negative

offset size is smaller than the SIMD width. For example,

for (i=start; i<end; i++)

A[i] = 10.0*A[i-1];

In case of pseudo dependency, SIMD can be applied. For example

when offset>0,

for (i=start; i<end; i++)

A[i] = 10.0*A[i+offset];
Risk of aliasing

Is it safe to vectorize the following function?

void compute(int start, int stop, double a, double b) {

for (int i=start; i<stop; i++)
a[i] = 10.0*b[i];
}
Risk of aliasing (cont’d)

A problem of “aliasing” will arise if the compute function is called

as follows

compute(0, N-1, &(array_a[1]), array_a);

If a programmer can guarantee that aliasing won’t happen, this hint

can be provided to the compiler.
The role of compilers

A compiler translates a program, which is implemented in a

programming language, to machine code.
A compiler can carry out code optimization of various degrees,
dictated by the compiler options provided by the user. (-O0, -O1,
-O2, ....)
Different compilers probably allow different compiler options,
should refer to the user manual!
Numerical accuracy may suffer from too aggressive compiler
optimizations.
Profiling

Profiling—gather information about a program’s behavior,

especially its use of resourses. The purpose is to pinpoint the “hot
spots”, and more importantly, to identify any performance
optimization opportunities (if any) and/or bugs.
Two apporaches of “information gathering”:

Instrumentation—compiler automatically inserts some code to

log each function call during the actual execution
Sampling—the program execution is interrupted at periodic
intervals, with information being recorded
GNU gprof

One well-known profiler: GNU gprof

https://sourceware.org/binutils/docs/gprof/

Step 1: compile and link the program with profiling enabled;

Step 2: execute the program to generate a profile data file;
Steo 3: run gprof to analyze the profile data.

(There are other profilers, of course.)

Hardware performance counters

Knowing how much time is spent where is the first step. But what
is the actual reason for “a slow code” or by which resource is the
performance limited?
Modern processors feature a small number of performance
counters, which are special on-chip registers that get incremented
each time a certain event occurs.
Possible events that can be monitored:

number of cache line transfers

number of loads and stores
number of floating-point operations
number of branch mispredictions
number of pipeline stalls
number of instructions executed

10 Optimization
No ratings yet
10 Optimization
57 pages
Unit-5 Toc
No ratings yet
Unit-5 Toc
41 pages
CD Unit-5
No ratings yet
CD Unit-5
45 pages
WINSEM2024-25 BCSE305L TH VL2024250501461 2025-02-12 Reference-Material-I
No ratings yet
WINSEM2024-25 BCSE305L TH VL2024250501461 2025-02-12 Reference-Material-I
45 pages
1.unit 5 - Compiler Design (PH)
No ratings yet
1.unit 5 - Compiler Design (PH)
30 pages
Lec02 2 Compiler Optimizations
No ratings yet
Lec02 2 Compiler Optimizations
32 pages
High Level Synthesis II: ECE 3401 Digital Systems Design
No ratings yet
High Level Synthesis II: ECE 3401 Digital Systems Design
35 pages
CODE Optimization
No ratings yet
CODE Optimization
50 pages
CD Unit 4
No ratings yet
CD Unit 4
152 pages
Lecture 7 - Optimizations - A 2025
No ratings yet
Lecture 7 - Optimizations - A 2025
55 pages
Code Optimization
No ratings yet
Code Optimization
36 pages
Unit 5 Part 1
No ratings yet
Unit 5 Part 1
44 pages
Presentation 1
No ratings yet
Presentation 1
18 pages
Code Optimization Techniques
No ratings yet
Code Optimization Techniques
16 pages
25 Optimization
No ratings yet
25 Optimization
54 pages
Op Tim Ization
No ratings yet
Op Tim Ization
70 pages
MC Ia-2
No ratings yet
MC Ia-2
14 pages
Program Optimization
No ratings yet
Program Optimization
63 pages
ACD Unit-5
No ratings yet
ACD Unit-5
20 pages
Unit 5 Cd.
No ratings yet
Unit 5 Cd.
27 pages
Code Optimization PDF
No ratings yet
Code Optimization PDF
25 pages
Ex:No:7 Implementation of Reduction in Strength For Code Optimization Aim
No ratings yet
Ex:No:7 Implementation of Reduction in Strength For Code Optimization Aim
11 pages
MIT6 172F10 Lec03
No ratings yet
MIT6 172F10 Lec03
75 pages
A Practical Approach To Optimize Code Implementation
No ratings yet
A Practical Approach To Optimize Code Implementation
11 pages
Unit 5 2 Optimization
No ratings yet
Unit 5 2 Optimization
18 pages
BPM Assignment
50% (2)
BPM Assignment
31 pages
VLSI CAD Flow: Logic Synthesis, Placement and Routing: Guest Lecture by Srini Devadas
No ratings yet
VLSI CAD Flow: Logic Synthesis, Placement and Routing: Guest Lecture by Srini Devadas
70 pages
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
No ratings yet
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
57 pages
Compiler Unit 4
No ratings yet
Compiler Unit 4
59 pages
PCC Unit 5
No ratings yet
PCC Unit 5
15 pages
Day01 HPC WRKSHP Compiler Opt
No ratings yet
Day01 HPC WRKSHP Compiler Opt
61 pages
Code Tuning Techniques
No ratings yet
Code Tuning Techniques
39 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Lab6 - Linear Algebra in C On A Microcontroller
No ratings yet
Lab6 - Linear Algebra in C On A Microcontroller
8 pages
Unit 5
No ratings yet
Unit 5
54 pages
LEC12-Optimization and New Trends
No ratings yet
LEC12-Optimization and New Trends
23 pages
Unit 5 Bard
No ratings yet
Unit 5 Bard
8 pages
280425
No ratings yet
280425
11 pages
Code Optimization
No ratings yet
Code Optimization
58 pages
FI FM 007 Presentation
100% (1)
FI FM 007 Presentation
60 pages
Highway Planning and Development Process
No ratings yet
Highway Planning and Development Process
8 pages
Unit 4
No ratings yet
Unit 4
15 pages
Clase de Progrea 555
No ratings yet
Clase de Progrea 555
35 pages
TM CBLM - Copy 2.odt
100% (2)
TM CBLM - Copy 2.odt
98 pages
Grade 3 Env
No ratings yet
Grade 3 Env
2 pages
EM 1110-2-5025 - Dredging and Dredged Material Disposal - Web
No ratings yet
EM 1110-2-5025 - Dredging and Dredged Material Disposal - Web
94 pages
On-The-Job Training (OJT) Orientation
No ratings yet
On-The-Job Training (OJT) Orientation
65 pages
Magic Word
100% (2)
Magic Word
17 pages
A Project Report ON Competitive Analysis and Study of Zomato's Online Ordering Business
No ratings yet
A Project Report ON Competitive Analysis and Study of Zomato's Online Ordering Business
81 pages
18 Code Optimization 07-02-2025
No ratings yet
18 Code Optimization 07-02-2025
9 pages
Embedded C Programming
100% (1)
Embedded C Programming
57 pages
CD Unit V
No ratings yet
CD Unit V
17 pages
MIT6 172F09 Lec02
No ratings yet
MIT6 172F09 Lec02
85 pages
Smriti Singh (Ev) Btbtd23060 (Dsa)
No ratings yet
Smriti Singh (Ev) Btbtd23060 (Dsa)
12 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
Human Genome Project
No ratings yet
Human Genome Project
36 pages
T - Test
100% (2)
T - Test
32 pages
Vision 2024 CD Chapter 5 Compiler Code Optimization 731689660928542
No ratings yet
Vision 2024 CD Chapter 5 Compiler Code Optimization 731689660928542
24 pages
Chapter 10 - Code Optimization
No ratings yet
Chapter 10 - Code Optimization
11 pages
Direct Inverse Proportion
No ratings yet
Direct Inverse Proportion
59 pages
Code Optmize
No ratings yet
Code Optmize
41 pages
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
No ratings yet
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
60 pages
Idt 92HD73C DST 20110926
No ratings yet
Idt 92HD73C DST 20110926
252 pages
Python for Absolute Beginners: Learn to Code Fast!
From Everand
Python for Absolute Beginners: Learn to Code Fast!
Ibnul Jaif Farabi
No ratings yet
Unit 5
No ratings yet
Unit 5
7 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
LECTURE 03 Styles of Communication
No ratings yet
LECTURE 03 Styles of Communication
39 pages
Env Assignment Bioplastic
No ratings yet
Env Assignment Bioplastic
3 pages
C Optimization Techniques
No ratings yet
C Optimization Techniques
79 pages
Lakbay Aral
No ratings yet
Lakbay Aral
9 pages
Compiler Construction: A Compulsory Module For Students in
No ratings yet
Compiler Construction: A Compulsory Module For Students in
34 pages
Optimal Code Compiling in C: Nitika Gupta Nistha Seth Prabhat Verma
No ratings yet
Optimal Code Compiling in C: Nitika Gupta Nistha Seth Prabhat Verma
8 pages
Optimization of Computer Programs in C
No ratings yet
Optimization of Computer Programs in C
2 pages
Optimization Techniques Code Optimizations
No ratings yet
Optimization Techniques Code Optimizations
10 pages
Conditional Dasas of Sage Parasara - Sumeet Chugh
100% (7)
Conditional Dasas of Sage Parasara - Sumeet Chugh
105 pages
Cost Constraint/Isocost Line
No ratings yet
Cost Constraint/Isocost Line
38 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Behavioral Finance
No ratings yet
Behavioral Finance
2 pages
Maintenance Instructions For Chemline 3/8" - 1-1/2" SB Series Pressure Relief Valves
No ratings yet
Maintenance Instructions For Chemline 3/8" - 1-1/2" SB Series Pressure Relief Valves
3 pages
Spinal Cord Injury Assessment Chart (ASIA)
100% (5)
Spinal Cord Injury Assessment Chart (ASIA)
2 pages
E-ABRASIC P 12 To P 220: For Coated Abrasives Products
100% (1)
E-ABRASIC P 12 To P 220: For Coated Abrasives Products
2 pages
IN DataSheetForInstruments
No ratings yet
IN DataSheetForInstruments
8 pages
The Merciad, April 14, 1978
No ratings yet
The Merciad, April 14, 1978
8 pages
Checklist409 - Pile Maintained Load Test
No ratings yet
Checklist409 - Pile Maintained Load Test
1 page
The Day I Became A Woman
No ratings yet
The Day I Became A Woman
4 pages
Mastery Level Frequency of Errors
No ratings yet
Mastery Level Frequency of Errors
5 pages
Latitude E6410 E6510 Specsheet
No ratings yet
Latitude E6410 E6510 Specsheet
2 pages
Validation Form
No ratings yet
Validation Form
2 pages
Ubd Template
No ratings yet
Ubd Template
2 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet

HPC Unit 5 B

Uploaded by

HPC Unit 5 B

Uploaded by

IN3200/IN4200: Chapter 2

Basic optimization techniques for serial code

“Common sense” and simple optimization strategies for serial

Very simple code changes can sometimes lead to significant

Improvement: leave the loop as soon as flag becomes 1.

for (i=0; i<500; i++)

If the 3D arrays a, b and c have contiguous memory storage for all

double *a_ptr = a[0][0];

for (i=0; i<(500*80*4); i++)

This technique is called loop collapsing. The main motivation is to

for (i=0; i<ARRAY_SIZE; i++) {

Observation: c[i] is independent of the j-indexed loop.

for (i=0; i<ARRAY_SIZE; i++) {

Can we improve further?

There is a common factor:

for (i=0; i<ARRAY_SIZE; i++)

This technique is called loop factoring or elimination of common

for (i=0; i<N; i++)

for (i=1; i<Nx-1; i++)

for (i=1; i<Nx-1; i++)

for (i=0; i<N; i++)

double inverse_s = 1.0/s;

for (i=0; i<N; i++)

for (i=0; i<N; i++)

Use of Horner’s rule of polynomial evaluation:

ax 4 + bx 3 + cx 2 + dx + e = (((ax + b)x + c)x + d)x + e

The work set of a code is the amount of memory it uses (or

“Tight” loops: few operations per iteration, typically optimized by

for (j=0; j<N; j++)

C[j] = C[j] + sign * A[j][i] * B[i];

for (j=0; j<N-1; j++)

for (j=1; j<N; j++)

We have got rid of the if-tests completely!

for (i=0; i<n; i++) {

Using the technique of loop peeling, we can re-code as follows:

for (i=0; i<n; i++) {

A “vectorizable” loop can potentially run faster if multiple

All iterations are independent

for (i=0; i<N; i++)

(We assume here that the memory regions pointed by r, x, y do

Pseudocode of applying SIMD (assuming that each SIMD register

int i, rest = N%4;

If a loop iteration depends on the result of another

for (i=start; i<end; i++)

If offset<0 → real dependency (read-after-write hazard)

In case of real dependency, SIMD cannot be applied if the negative

for (i=start; i<end; i++)

In case of pseudo dependency, SIMD can be applied. For example

for (i=start; i<end; i++)

Is it safe to vectorize the following function?

void compute(int start, int stop, double *a, double *b) {

A problem of “aliasing” will arise if the compute function is called

compute(0, N-1, &(array_a[1]), array_a);

If a programmer can guarantee that aliasing won’t happen, this hint

A compiler translates a program, which is implemented in a

Profiling—gather information about a program’s behavior,

Instrumentation—compiler automatically inserts some code to

One well-known profiler: GNU gprof

Step 1: compile and link the program with profiling enabled;

(There are other profilers, of course.)

number of cache line transfers

You might also like

for (i=0; i<(500804); i++)

void compute(int start, int stop, double a, double b) {