0% found this document useful (0 votes)

118 views31 pages

Unit 2 Basic Optimization Techniques For Serial Code

1. Common sense optimizations for serial code include doing less work such as avoiding unnecessary computations, eliminating common subexpressions, and avoiding expensive operations. Techniques include loop unrolling, strength reduction, and using lookup tables. 2. Another optimization is to avoid branches in tight loops by restructuring code to remove conditional statements. Vectorizing code using SIMD instructions can also improve performance by performing multiple operations simultaneously. 3. Compilers can perform various levels of optimizations but may impact numerical accuracy aggressively. Profiling helps identify hotspots for further optimization by gathering information on resource usage.

Uploaded by

Sudha Palani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views31 pages

Unit 2 Basic Optimization Techniques For Serial Code

Uploaded by

Sudha Palani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Unit 2

Basic optimization techniques for

serial code
Objectives of Chapter 2

“Common sense” and simple optimization strategies for serial

code
The role of compilers
Basics of performance profiling
“Common sense” optimizations

Very simple code changes can sometimes lead to significant

performance boost.
The most important “common sense” principle: avoiding
performance pitfalls!
Do less work; example 1
Example: assume Ais an array of numerical values, and a
prescribed threshold value: threshold_value.

i n t flag = 0;
f o r ( i = 0 ; i<N; i++) {
i f ( some_function(A[i]) < threshold_value )
flag = 1;
}

Improvement: leave the loop as soon as f l a g becomes 1.

i n t flag = 0;
f o r ( i = 0 ; i<N; i++) {
i f ( some_function(A[i]) < threshold_value )
{ flag = 1;
break;
}
}
Do less work; example 2

f o r ( i = 0 ; i<500; i++)
f o r ( j = 0 ; j<80; j++)
f o r (k=0; k<4; k++)
a[i][j][k] = a[i]
[j][k] + b[i][j]
[k]*c[i][j][k];

How many times is the k-indexed loop executed? And how many
times for the j-indexed loop?
Do less work; example 2 (cont’d)

If the 3D arrays a, b and c have contiguous memory storage for all

their values, then we can re-code as follows:

double *a_ptr = a [ 0 ] [ 0 ] ;
double *b_ptr = b [ 0 ] [ 0 ] ;
double *c_ptr = c [ 0 ] [ 0 ] ;

f o r ( i = 0 ; i<(500*80*4);
i++)
a_ptr[i] = a_ptr[i] +
b_ptr[i]*c_ptr[i];

This technique is called loop collapsing. The main motivation is to

reduce loop overhead, may also help other (compiler-supported)
optimizations.
Do less work; example 3

f o r ( i = 0 ; i<ARRAY_SIZE; i++)
{ a[i] = 0.;
f o r ( j = 0 ; j<ARRAY_SIZE; j++)
a[i] = a[i] + b[j]*d[j]*c[i];
}

Observation: c [ i ] is independent of
the j-indexed loop.
Do less work; example 3 (cont’d)

Improvement:

f o r ( i = 0 ; i<ARRAY_SIZE; i++)
{ a[i] = 0.;
f o r ( j = 0 ; j<ARRAY_SIZE; j++)
a[i] = a[i] + b[j]*d[j];
a[i] = a[i]*c[i];
}

Can we improve further?

Do less work; example 3 (further simplification)

There is a common factor:

b[0]*d[0]+b[1]*d[1]+...+b[ARRAY_SIZE-1]*d[ARRAY_SIZE-1]
which is unnecessarily re-computed in every i iteration!

t = 0.;
f o r ( j = 0 ; j<ARRAY_SIZE; j++)
t = t + b[j]*d[j];

f o r ( i = 0 ; i<ARRAY_SIZE; i++)
a[i] = t*c[i];

This technique is called loop factoring or elimination of common

subexpressions.
Another example of common subexpression elimination

f o r ( i = 0 ; i<N; i++)
A[i] = A[i] + s + r * s i n ( x ) ;

tmp = s + r * s i n ( x ) ;
f o r ( i = 0 ; i<N; i++)
A[i] = A[i] + tmp;
Avoid expensive operations!
Special math functions (such as trigonometric, exponential and
logarithmic functions) are usually very costly to compute.
An example from simulating non-equilibrium spins:

f o r ( i = 1 ; i<Nx-1; i++)
f o r ( j = 1 ; j<Ny-1; j++)
f o r (k=1; k<Nz-1; k++)
{
iL = s p i n _ o r i e n t a t i o n [ i - 1 ] [ j ] [ k ] ;
iR = s p i n _ o r i e n t a t i o n [ i + 1 ] [ j ] [ k ] ;
iS = s p i n _ o r i e n t a t i o n [ i ] [ j - 1 ] [ k ] ;
iN = s p i n _ o r i e n t a t i o n [ i ] [ j + 1 ] [ k ] ;
iO = s p i n _ o r i e n t a t i o n [ i ] [ j ] [ k - 1 ] ;
iU = s p i n _ o r i e n t a t i o n [ i ] [ j ] [ k + 1 ] ;
edelz = iL+iR+iS+iN+iO+iU;
bod y _f or c e [ i ] [ j ] [ k] =
0.5*(1.0+tanh(edelz/tt));
Example continued
If the values of i L , i R , i S , iN, iO, iU can only be −1 or +1,
then the value of edelz (which is the sum of i L , i R , i S , iN,
iO, iU) can only be −6, −4, −2, 0, 2, 4, 6.
If t t is a constant, then we can create a lookup table:

double t a nh _t a b l e [ 1 3] ;
f o r ( i = 0 ; i<=12; i+=2)
tanh_table[i] =
0.5*(1.0+tanh((i-
6)/tt));

f o r ( i = 1 ; i<Nx-1; i++)
f o r ( j = 1 ; j<Ny-1; j++)
f o r (k=1; k<Nz-1; k++)
{
....
Strength reduction

f o r ( i = 0 ; i<N; i++)
y [ i ] = pow(x[i],3)/s;

double inverse_s = 1 . 0 / s ;
f o r ( i = 0 ; i<N; i++)
y[i] =
x[i]*x[i]*x[i]*inverse_
s;
Strength reduction (another
example)

f o r ( i = 0 ; i<N; i++)
y [ i ] = a*pow(x[i],4)+b*pow(x[i],3)+c*pow(x[i],2)
+d*pow(x[i],1)+e;

f o r ( i = 0 ; i<N; i++)
y [ i ] = ( ( ( a *x[ i ] +b ) *x [ i ] + c ) *x [ i ] + d) * x[ i ] + e ;

Use of Horner’s rule of polynomial evaluation:

ax 4 + bx 3 + cx 2 + dx + e = (((ax + b)x + c)x + d )x + e

Shrinking the work set!

The work set of a code is the amount of memory it uses (or

touches), also called memory footprint.
In general, shrinking the work set (if possible) is a good thing for
performance, because it raises the probability of cache hit.
One example: The s pi n _or i e nt a t i o n array should store values of
type char instead of type i n t . (A factor of 4 in the difference of
memory footprint.)
Avoiding branches

“Tight” loops: few operations per iteration, typically optimized by

compiler using some form of pipelining. In case of conditional
branches in the loop body, the compiler optimization will easily fail.

f o r ( j = 0 ; j<N; j++)
f o r ( i = 0 ; i<N; i++)
{ i f (i>j)
sign = 1 . 0 ;
else i f ( i <j )
sign = - 1 . 0 ;
else
sign =
0.0;

C[ j ] = C [ j ] +
sign * A [ j ] [ i ]
* B[i];
Avoiding branches (cont’d)

f o r ( j = 0 ; j<N-1; j++)
f o r ( i = j + 1 ; i<N; i++)
C [ j ] = C[ j ] + A [ j ] [ i ] *
B[i];

f o r ( j = 1 ; j<N; j++)
f o r ( i = 0 ; i < j ; i++)
C [ j ] = C[ j ] - A [ j ]
[i] * B[i];
}

We have got rid of the if-

tests completely!
Another example of avoiding branches

f o r ( i = 0 ; i < n ; i++)
{ i f (i==0)
a[ i ] = b[i+1]-b[i];
e l s e i f (i==n-1)
a[i] = b[i]-b[i-1];
else
a [ i ] = b[i+1]-
b[i-1];
}
Another example of avoid branches (cont’d)

Using the technique of loop peeling, we can re-code as follows:

a[0] = b[1]-b[0];
f o r ( i = 1 ; i < n- 1 ; i++)
a [ i ] = b[i+1]-b[i-1];
a [ n- 1 ] = b [ n - 1 ] - b [ n - 2 ] ;
Yet another example of avoiding
branches
f o r ( i = 0 ; i < n ; i++)
{ i f (j>0)
x[i] = x[i] + 1;
else
x[i] = 0;
}

i f ( j > 0)
f o r ( i = 0 ; i < n; i++)
x[i] = x[i] + 1;
else
f o r ( i = 0 ; i < n; i++)
x[i] = 0;
Using SIMD instructions

A “vectorizable” loop can potentially run faster if multiple

operations can be performed with a single instruction.
Using SIMD instructions, register-to-register operations will be
greatly accelerated.
Warning: if the code is strongly limited by memory bandwidth, no
SIMD technique can bridge this gap.
Ideal scenario for applying SIMD to a
loop

All iterations are independent

There is no branch in the loop body
The arrays are accessed with a stride of one

Example:

f o r ( i = 0 ; i<N; i++)
r[ i ] = x[i] + y[i];

(We assume here that the memory regions pointed by r , x , y do

not overlap—no aliasing)
An example of applying SIMD

Pseudocode of applying SIMD (assuming that each SIMD register

can store 4 values):

i n t i , r e s t = N% 4 ;
f o r ( i = 0 ; i < N- r e s t ; i+=4) {
load R1 = [ x [ i ] , x [ i + 1 ] , x [ i + 2 ] , x [ i + 3 ] ] ;
load R2 = [ y [ i ] , y [ i + 1 ] , y [ i + 2 ] , y [ i + 3 ] ] ;
R3 = ADD(R1,R2);
s t o r e [ r [ i ] , r [ i + 1 ] , r [ i + 2 ] , r [ i + 3 ] ] = R3;
}
f o r ( i = N - r e s t ; i<N; i++)
r[ i] = x[i] + y[i];
Beware of loop dependency!

If a loop iteration depends on the result of another

iteration—loop-carried dependency

f o r ( i = s t a r t ; i<end; i++)
A[i] = 10.0*A[i+offset];

If o ff s e t < 0 → real
dependency (read-after-write
hazard)
If o ff s e t > 0 → pseudo
dependency (write-after-read
harzard)
When there is loop-carried dependency...

In case of real dependency, SIMD cannot be applied if the negative

o f f s e t size is smaller than the SIMD width. For example,

f o r ( i = s t a r t ; i<end; i++)
A[i] = 10.0*A[i-1];

In case of pseudo dependency, SIMD can be applied. For example

when offs e t > 0,

f o r ( i = s t a r t ; i<end; i++)
A[i] = 10.0*A[i+offse t];
Risk of aliasing

Is it safe to vectorize the following function?

void compute(int s t a r t , i n t s t o p , double * a , double *b)

{ f o r ( i n t i = s t a r t ; i < s t o p ; i++)
a [ i ] = 10.0*b[i];
}
Risk of aliasing (cont’d)

A problem of “aliasing” will arise if the compute function is called

as follows

compute(0, N-1, & (array_a[1]), a r r a y_ a ) ;

If a programmer can guarantee that aliasing won’t happen, this hint

can be provided to the compiler.
The role of compilers

A compiler translates a program, which is implemented in a

programming language, to machine code.
A compiler can carry out code optimization of various degrees,
dictated by the compiler options provided by the user. (-O0, -O1,
-O2, . . . . )
Different compilers probably allow different compiler options,
should refer to the user manual!
Numerical accuracy may suffer from too aggressive compiler
optimizations.
Profiling

Profiling—gather information about a program’s behavior,

especially its use of resourses. The purpose is to pinpoint the “hot
spots”, and more importantly, to identify any performance
optimization opportunities (if any) and/or bugs.
Two apporaches of “information gathering”:

Instrumentation—compiler automatically inserts some code

to
log each function call during the actual execution
Sampling—the program execution is interrupted at periodic
intervals, with information being recorded
GNU gprof

One well-known profiler: GNU gprof

h t t p s : / / s o u r c e w a r e . o rg / b i n u t i l s / d o c s / g p r o f /

Step 1: compile and link the program with profiling enabled;

Step 2: execute the program to generate a profile data file;
Steo 3: run gprof to analyze the profile data.

(There are other profilers, of course.)

Hardware performance counters

Knowing how much time is spent where is the first step. But what
is the actual reason for “a slow code” or by which resource is the
performance limited?
Modern processors feature a small number of performance
counters, which are special on-chip registers that get incremented
each time a certain event occurs.
Possible events that can be monitored:

number of cache line transfers

number of loads and stores
number of floating-point operations
number of branch mispredictions
number of pipeline stalls
number of instructions executed

Guardian Covering Letter
100% (2)
Guardian Covering Letter
8 pages
Unit-5 Toc
No ratings yet
Unit-5 Toc
41 pages
Unknown Mod Exceptions
No ratings yet
Unknown Mod Exceptions
12 pages
Lec03 1 Program Optimizations
No ratings yet
Lec03 1 Program Optimizations
43 pages
Lec02 2 Compiler Optimizations
No ratings yet
Lec02 2 Compiler Optimizations
32 pages
Migdalskiy Sergiy Physics Optimization Strategies
No ratings yet
Migdalskiy Sergiy Physics Optimization Strategies
104 pages
Alignment Software Spine Version.2.0 Revision.1.2 en
No ratings yet
Alignment Software Spine Version.2.0 Revision.1.2 en
42 pages
PDC Lecture 04
No ratings yet
PDC Lecture 04
44 pages
Speech Controlled RC Car: Imran Hafeez
No ratings yet
Speech Controlled RC Car: Imran Hafeez
45 pages
WINSEM2024-25 BCSE305L TH VL2024250501461 2025-02-12 Reference-Material-I
No ratings yet
WINSEM2024-25 BCSE305L TH VL2024250501461 2025-02-12 Reference-Material-I
45 pages
High Level Synthesis II: ECE 3401 Digital Systems Design
No ratings yet
High Level Synthesis II: ECE 3401 Digital Systems Design
35 pages
Lecture 7 - Optimizations - A 2025
No ratings yet
Lecture 7 - Optimizations - A 2025
55 pages
Lecture 5
No ratings yet
Lecture 5
29 pages
280425
No ratings yet
280425
11 pages
19 Code Optimization 17-02-2025
No ratings yet
19 Code Optimization 17-02-2025
32 pages
Coding Practices SSW
No ratings yet
Coding Practices SSW
47 pages
25 Optimization
No ratings yet
25 Optimization
54 pages
10 Optimization
No ratings yet
10 Optimization
57 pages
Part E Accuracy Instructions For Prefilled Excel Workbook
No ratings yet
Part E Accuracy Instructions For Prefilled Excel Workbook
14 pages
Op Tim Ization
No ratings yet
Op Tim Ization
70 pages
PCC Unit 5
No ratings yet
PCC Unit 5
15 pages
HPC Unit 5 A
No ratings yet
HPC Unit 5 A
49 pages
Code Optimization Techniques
No ratings yet
Code Optimization Techniques
16 pages
Code Optimization PDF
No ratings yet
Code Optimization PDF
25 pages
PP Unit 2 Tesseract
No ratings yet
PP Unit 2 Tesseract
38 pages
Program Optimization
No ratings yet
Program Optimization
63 pages
HPC Unit 5 B
No ratings yet
HPC Unit 5 B
31 pages
Ex:No:7 Implementation of Reduction in Strength For Code Optimization Aim
No ratings yet
Ex:No:7 Implementation of Reduction in Strength For Code Optimization Aim
11 pages
Graph-Based Threat Hunting
No ratings yet
Graph-Based Threat Hunting
16 pages
CN Lab Manuals 2019 (1) - Converted-1
No ratings yet
CN Lab Manuals 2019 (1) - Converted-1
114 pages
Presentation 1
No ratings yet
Presentation 1
18 pages
Compiler Unit 4
No ratings yet
Compiler Unit 4
59 pages
Unit 5
No ratings yet
Unit 5
54 pages
Vision 2024 CD Chapter 5 Compiler Code Optimization 731689660928542
No ratings yet
Vision 2024 CD Chapter 5 Compiler Code Optimization 731689660928542
24 pages
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
No ratings yet
Code Optimization Sept. 25, 2003: "The Course That Gives CMU Its Zip!"
57 pages
Buku Standard PDP2015
No ratings yet
Buku Standard PDP2015
24 pages
PS1 PS2 PS3 K 1408: Figure 1: Front View
No ratings yet
PS1 PS2 PS3 K 1408: Figure 1: Front View
6 pages
MC Ia-2
No ratings yet
MC Ia-2
14 pages
Lab6 - Linear Algebra in C On A Microcontroller
No ratings yet
Lab6 - Linear Algebra in C On A Microcontroller
8 pages
Web GPU
0% (1)
Web GPU
40 pages
Gmail - London Met - Conditional Offer 24019555 - Global Human Resource Management - (GSBL)
No ratings yet
Gmail - London Met - Conditional Offer 24019555 - Global Human Resource Management - (GSBL)
5 pages
ACD Unit-5
No ratings yet
ACD Unit-5
20 pages
Unit 4
No ratings yet
Unit 4
15 pages
VLSI CAD Flow: Logic Synthesis, Placement and Routing: Guest Lecture by Srini Devadas
No ratings yet
VLSI CAD Flow: Logic Synthesis, Placement and Routing: Guest Lecture by Srini Devadas
70 pages
Chapter 10 - Code Optimization
No ratings yet
Chapter 10 - Code Optimization
11 pages
01 Datasheet SmartLine
No ratings yet
01 Datasheet SmartLine
2 pages
GPT Custom Instructions Builder 3
No ratings yet
GPT Custom Instructions Builder 3
10 pages
LEC12-Optimization and New Trends
No ratings yet
LEC12-Optimization and New Trends
23 pages
Integrated UncertaintyAnalysis Using RELAP/SDAPSIM/MOD4.0
No ratings yet
Integrated UncertaintyAnalysis Using RELAP/SDAPSIM/MOD4.0
11 pages
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
Copa - Syllabus
No ratings yet
Copa - Syllabus
14 pages
CD Unit V
No ratings yet
CD Unit V
17 pages
Mobile Educational Applications For Children. What Educators and Parents Need To Know
No ratings yet
Mobile Educational Applications For Children. What Educators and Parents Need To Know
23 pages
Unit 1 Modern Processors
No ratings yet
Unit 1 Modern Processors
52 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
MIT6 172F10 Lec03
No ratings yet
MIT6 172F10 Lec03
75 pages
09 Pointers Arrays
No ratings yet
09 Pointers Arrays
34 pages
Unit 5 Bard
No ratings yet
Unit 5 Bard
8 pages
Code Tuning Techniques
No ratings yet
Code Tuning Techniques
39 pages
HPC Unit 3
No ratings yet
HPC Unit 3
31 pages
Unit - I MP&MC
No ratings yet
Unit - I MP&MC
30 pages
C Optimization Techniques
No ratings yet
C Optimization Techniques
79 pages
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
No ratings yet
Signal and Image Processing On The TMS320C54x DSP: Prof. Brian L. Evans
38 pages
Option One Turkish Brand Motor Datasheet
No ratings yet
Option One Turkish Brand Motor Datasheet
6 pages
DRDO Books PDF
No ratings yet
DRDO Books PDF
2 pages
Part A: Grammar (1-15) Instruction: Choose The Best Answer
No ratings yet
Part A: Grammar (1-15) Instruction: Choose The Best Answer
6 pages
Google Classroom Codes
No ratings yet
Google Classroom Codes
1 page
Optimal Code Compiling in C: Nitika Gupta Nistha Seth Prabhat Verma
No ratings yet
Optimal Code Compiling in C: Nitika Gupta Nistha Seth Prabhat Verma
8 pages
MIT6 172F09 Lec02
No ratings yet
MIT6 172F09 Lec02
85 pages
Day01 HPC WRKSHP Compiler Opt
No ratings yet
Day01 HPC WRKSHP Compiler Opt
61 pages
HCS12 Assembly Programming
No ratings yet
HCS12 Assembly Programming
24 pages
Asynchronous Counters: Asynchronous 4-Bit UP Counter
No ratings yet
Asynchronous Counters: Asynchronous 4-Bit UP Counter
13 pages
Minimization of DFA
No ratings yet
Minimization of DFA
5 pages
Clase de Progrea 555
No ratings yet
Clase de Progrea 555
35 pages
Embedded C Programming
100% (1)
Embedded C Programming
57 pages
HW 2 Is Out! Due 9/25!
No ratings yet
HW 2 Is Out! Due 9/25!
21 pages
Optimization Techniques Code Optimizations
No ratings yet
Optimization Techniques Code Optimizations
10 pages
Optimization of Computer Programs in C
No ratings yet
Optimization of Computer Programs in C
2 pages
Knowledge and Skill Guidelines For Aquarists: 2001 Marine Advanced Technology Education Center
No ratings yet
Knowledge and Skill Guidelines For Aquarists: 2001 Marine Advanced Technology Education Center
29 pages
PIRATE KING Resume - White
No ratings yet
PIRATE KING Resume - White
2 pages
Original PDF
100% (1)
Original PDF
4 pages
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
No ratings yet
Unit Ii Program Design and Analysis: - Software Components. - Representations of Programs. - Assembly and Linking
60 pages
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
Play With Patterns
No ratings yet
Play With Patterns
9 pages
AXIS P1435-LE Network Camera: Compact and Fully-Featured HDTV For Any Light Condition
No ratings yet
AXIS P1435-LE Network Camera: Compact and Fully-Featured HDTV For Any Light Condition
2 pages
Developing Keyboard Skill
No ratings yet
Developing Keyboard Skill
7 pages
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
No ratings yet
PS3 Programming Basics: Week 1. SIMD Programming On PPE Materials Are Adapted From The Textbook
37 pages
Compiler Construction: A Compulsory Module For Students in
No ratings yet
Compiler Construction: A Compulsory Module For Students in
34 pages
Alcatel-Lucent GSM: Merge G2 TC To 9125 TC
No ratings yet
Alcatel-Lucent GSM: Merge G2 TC To 9125 TC
32 pages
Review Essay - Melery-2
No ratings yet
Review Essay - Melery-2
7 pages
150+ JavaScript Pattern Programs
From Everand
150+ JavaScript Pattern Programs
Hernando Abella
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
150+ C Pattern Programs
From Everand
150+ C Pattern Programs
Hernando Abella
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet

Unit 2 Basic Optimization Techniques For Serial Code

Uploaded by

Unit 2 Basic Optimization Techniques For Serial Code

Uploaded by

Unit 2

Basic optimization techniques for

“Common sense” and simple optimization strategies for serial

Very simple code changes can sometimes lead to significant

Improvement: leave the loop as soon as f l a g becomes 1.

If the 3D arrays a, b and c have contiguous memory storage for all

This technique is called loop collapsing. The main motivation is to

Can we improve further?

There is a common factor:

This technique is called loop factoring or elimination of common

Use of Horner’s rule of polynomial evaluation:

ax 4 + bx 3 + cx 2 + dx + e = (((ax + b)x + c)x + d )x + e

The work set of a code is the amount of memory it uses (or

“Tight” loops: few operations per iteration, typically optimized by

We have got rid of the if-

Using the technique of loop peeling, we can re-code as follows:

A “vectorizable” loop can potentially run faster if multiple

All iterations are independent

(We assume here that the memory regions pointed by r , x , y do

Pseudocode of applying SIMD (assuming that each SIMD register

If a loop iteration depends on the result of another

In case of real dependency, SIMD cannot be applied if the negative

In case of pseudo dependency, SIMD can be applied. For example

Is it safe to vectorize the following function?

void compute(int s t a r t , i n t s t o p , double * a , double *b)

A problem of “aliasing” will arise if the compute function is called

compute(0, N-1, & (array_a[1]), a r r a y_ a ) ;

If a programmer can guarantee that aliasing won’t happen, this hint

A compiler translates a program, which is implemented in a

Profiling—gather information about a program’s behavior,

Instrumentation—compiler automatically inserts some code

One well-known profiler: GNU gprof

Step 1: compile and link the program with profiling enabled;

(There are other profilers, of course.)

number of cache line transfers

You might also like