0% found this document useful (0 votes)

14 views49 pages

HPC Unit 5 A

Uploaded by

Pancham Bandishti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views49 pages

HPC Unit 5 A

Uploaded by

Pancham Bandishti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

!

6.172
Performance !"##$*
Engineering %&'&(
of Software
Systems

"#)*+)$#)*+,*-./01*

LECTURE 2
Bentley Rules for
Optimizing Work
Julian Shun

1
© 2008–2018 by the MIT 6.172 Lecturers
Work

Definition.
The work of a program (on a given input) is
the sum total of all the operations executed
by the program.

2
© 2008–2018 by the MIT 6.172 Lecturers Image used under CC0 by openclipart.
Optimizing Work
● Algorithm design can produce dramatic
reductions in the amount of work it takes to
solve a problem, as when a Θ(n lg n)-time sort
replaces a Θ(n2)-time sort.
● Reducing the work of a program does not auto-
matically reduce its running time, however, due
to the complex nature of computer hardware:
• instruction-level parallelism (ILP),
• caching,
• vectorization,
• speculation and branch prediction,
• etc.
● Nevertheless, reducing the work serves as a good
heuristic for reducing overall running time.
3
© 2008–2018 by the MIT 6.172 Lecturers
!
!"##$*
%&'&(

"#)*+)$#)*+,*-./01*

“BENTLEY”
OPTIMIZATION RULES

4
© 2008–2018 by the MIT 6.172 Lecturers
New “Bentley” Rules
● Most of Bentley’s original rules dealt with work, but
some dealt with the vagaries of computer
architecture three and a half decades ago.
● We have created a new set of Bentley rules dealing
only with work.
● We shall discuss architecture-dependent
optimizations in subsequent lectures.

5
© 2008–2018 by the MIT 6.172 Lecturers
New Bentley Rules
Data structures Logic
● Packing and encoding ● Constant folding and
● Augmentation propagation
● Precomputation ● Common-subexpression
● Compile-time initialization elimination
● Caching ● Algebraic identities
● Lazy evaluation ● Short-circuiting
● Sparsity ● Ordering tests
● Creating a fast path
Loops
● Combining tests
● Hoisting
● Sentinels Functions
● Loop unrolling ● Inlining
● Loop fusion ● Tail-recursion elimination
● Eliminating wasted iterations ● Coarsening recursion

6
© 2008–2018 by the MIT 6.172 Lecturers
!
!"##$*
%&'&(

"#)*+)$#)*+,*-./01*

DATA STRUCTURES

7
© 2008–2018 by the MIT 6.172 Lecturers
Packing and Encoding
The idea of packing is to store more than one data
value in a machine word. The related idea of
encoding is to convert data values into a
representation requiring fewer bits.

Example: Encoding dates

● The string “September 11, 2018” can be stored in 18
bytes — more than two double (64-bit) words —
which must moved whenever a date is manipulated.
● Assuming that we only store years between 4096
B.C.E. and 4096 C.E., there are about 365.25 × 8192
≈ 3 M dates, which can be encoded in ⎡lg(3×106)⎤ =
22 bits, easily fitting in a single (32-bit) word.
● But determining the month of a date takes more
work than with the string representation.
8
© 2008–2018 by the MIT 6.172 Lecturers
Packing and Encoding (2)
Example: Packing dates
! Instead, let us pack the three fields into a word:
!"#$%$& '!()*! +
,-! "$.(/ 012
,-! 34-!5/ 62
,-! %."/ 72
8 %.!$9!2

! This packed representation still only takes 22 bits,

but the individual fields can be extracted much
more quickly than if we had encoded the 3 M dates
as sequential integers.
Sometimes unpacking and decoding are the
optimization, depending on whether more work is
involved moving the data or operating on it.
9
© 2008–2018 by the MIT 6.172 Lecturers
Augmentation
The idea of data-structure augmentation is to add
information to a data structure to make common
operations do less work.
Example: Appending singly linked lists
! Appending one list to !"#$
another requires walking
the length of the first list
to set its null pointer to
the start of the second.
! Augmenting the list with !"#$ %#&'

a tail pointer allows

appending to operate in
constant time.
10
© 2008–2018 by the MIT 6.172 Lecturers
Precomputation
The idea of precomputation is to perform calculations
in advance so as to avoid doing them at “mission-
critical” times.
Example: Binomial coefficients
P P
ভ ম
M M P ਷ M

Computing the “choose” function by implementing

this formula can be expensive (lots of multiplications),
and watch out for integer overflow for even modest
values of n and k.
Idea: Precompute the table of coefficients when
initializing, and perform table look-up at runtime.
11
© 2008–2018 by the MIT 6.172 Lecturers
Pascal’s Triangle
6$ 3$ 3$ 3$ 3$ 3$ 3$ 3$ 3$
6$ 6$ 3$ 3$ 3$ 3$ 3$ 3$ 3$
6$ :$ 6$ 3$ 3$ 3$ 3$ 3$ 3$
6$ ;$ ;$ 6$ 3$ 3$ 3$ 3$ 3$
P P
ভ ম 6$ <$ =$ <$ 6$ 3$ 3$ 3$ 3$
M M P ਷ M 6$ >$ 63$ 63$ >$ 6$ 3$ 3$ 3$
6$ =$ 6>$ :3$ 6>$ =$ 6$ 3$ 3$
6$ ?$ :6$ ;>$ ;>$ :6$ ?$ 6$ 3$
6$ @$ :@$ >=$ ?3$ >=$ :@$ @$ 6$

!"#$%&''()*!"#$"+$!"#$,-$.$
!/$*"$0$,-$1)#21"$34$
!/$*"$55$3-$1)#21"$64$
!/$*,$55$3-$1)#21"$64$
1)#21"$%&''()*"76+$,76-$8$%&''()*"76+$,-4$
9$
12
© 2008–2018 by the MIT 6.172 Lecturers
Precomputing Pascal
!"#$%&#>'())*+,*-.+>/00>
%&1>23445#6'())*+,*-.+76'())*+,*-.+78>

94%">%&%1,23445#:;><>
$4=>:%&1>&>?>08>&>@>'())*+,*-.+8>AA&;><>
23445#6&7607>?>/8>
23445#6&76&7>?>/8>
B>
$4=>:%&1>&>?>/8>&>@>'())*+,*-.+8>AA&;><>
23445#6076&7>?>08>
$4=>:%&1>C>?>/8>C>@>&8>AAC;><>
23445#6&76C7>?>23445#6&D/76CD/7>A>23445#6&D/76C78>
23445#6C76&7>?>08>
B>
B>
B>

Now, whenever we need a binomial coefficient (less

than 100), we can simply index the 23445#>array.
13
© 2008–2018 by the MIT 6.172 Lecturers
Compile-Time Initialization
The idea of compile-time initialization is to store the
values of constants during compilation, saving work at
execution time.

Example
!"# $%&&'()*+,)*+, - .
. */ +/ +/ +/ +/ +/ +/ +/ +/ +/ 0/
. */ */ +/ +/ +/ +/ +/ +/ +/ +/ 0/
. */ 1/ */ +/ +/ +/ +/ +/ +/ +/ 0/
. */ 2/ 2/ */ +/ +/ +/ +/ +/ +/ 0/
. */ 3/ 4/ 3/ */ +/ +/ +/ +/ +/ 0/
. */ 5/ *+/ *+/ 5/ */ +/ +/ +/ +/ 0/
. */ 4/ *5/ 1+/ *5/ 4/ */ +/ +/ +/ 0/
. */ 6/ 1*/ 25/ 25/ 1*/ 6/ */ +/ +/ 0/
. */ 7/ 17/ 54/ 6+/ 54/ 17/ 7/ */ +/ 0/
. */ 8/ 24/ 73/ *14/ *14/ 73/ 24/ 8/ */ 0/
09

14
© 2008–2018 by the MIT 6.172 Lecturers
Compile-Time Initialization (2)
Idea: Create large static tables by metaprogramming.

!"#:$%!"&!"#:%'()*:)+",#:)-%':.%'(/012:3:
!"!#4)-++,5&26:
7'!"#8&9!"#:)-++,50;<10;<1:=:3>"?26:
8+':&!"#:%:=:<6:%:@:;<6:AA%2:3:
7'!"#8&?::3?26:
8+':&!"#:B:=:<6:B:@:;<6:AAB2:3:
7'!"#8&?CDE*:?*:)-++,50%10B126:
F:
7'!"#8&?F*>"?26:
F:
7'!"#8&?F6>"?26:
F:

15
© 2008–2018 by the MIT 6.172 Lecturers
Caching
The idea of caching is to store results that have been
accessed recently so that the program need not
compute them again.

!"#!"$ %&'(#$ )*+&,$"'-$.%&'(#$ /0 %&'(#$ 12 3

4$,'4" -54,./6/ 7 16128
9
%&'(#$ :;:)$%</ = >?>8
%&'(#$ :;:)$%<1 = >?>8
%&'(#$ :;:)$%<) = >?>8

!"#!"$ %&'(#$ )*+&,$"'-$.%&'(#$ /0 %&'(#$ 12 3

About 30% faster !@ ./ == :;:)$%</ AA 1 == :;:)$%<12 3
if cache is hit 4$,'4" :;:)$%<)8
9
2/3 of the time. :;:)$%</ = /8
:;:)$%<1 = 18
:;:)$%<) = -54,./6/ 7 16128
4$,'4" :;:)$%<)8
9
16
© 2008–2018 by the MIT 6.172 Lecturers
Sparsity
The idea of exploiting sparsity is to avoid storing and
computing on zeroes. “The fastest way to compute is
not to compute at all.”
Example: Matrix-vector multiplication
⎛ 3 0 0 0 1 0 ⎞⎛ 1 ⎞
⎜ ⎟⎜ ⎟
⎜ 0 4 1 0 5 9 ⎟⎜ 4 ⎟
⎜ 0 0 0 2 0 6 ⎟⎜ 2 ⎟
y = ⎜ ⎟⎜ ⎟
5 0 0 3 0 0 8
⎜ ⎟⎜ ⎟
⎜⎜ 5 0 0 0 8 0 ⎟⎟ ⎜⎜ 5 ⎟⎟
⎝ 0 0 0 9 7 0 ⎠⎝ 7 ⎠

Dense matrix-vector multiplication performs n2 = 36

scalar multiplies, but only 14 entries are nonzero.
17
© 2008–2018 by the MIT 6.172 Lecturers
Sparsity
The idea of exploiting sparsity is to avoid storing and
computing on zeroes. “The fastest way to compute is
not to compute at all.”
Example: Matrix-vector multiplication
⎛ 3 1 ⎞⎛ 1 ⎞
⎜ ⎟⎜ ⎟
⎜ 4 1 5 9 ⎟⎜ 4 ⎟
⎜ 2 6 ⎟⎜ 2 ⎟
y = ⎜ ⎟⎜ 8 ⎟
5 3
⎜ ⎟⎜ ⎟
⎜⎜ 5 8 ⎟⎟ ⎜⎜ 5 ⎟⎟
⎝ 9 7 ⎠⎝ 7 ⎠

Dense matrix-vector multiplication performs n2 = 36

scalar multiplies, but only 14 entries are nonzero.
18
© 2008–2018 by the MIT 6.172 Lecturers
Sparsity (2)
Compressed Sparse Row (CSR)
!2 "2 #2 $2 %2 &2 '2 (2 )2 * "! "" "# "$2
+,-./ !2 #2 '2 )2 "! "" "%2

0,1./2!2 %2 "2 #2 %2 &2 $2 &2 !2 $2 !2 %2 $2 %2

341./2$2 "2 %2 "2 &2 *2 #2 '2 &2 $2 &2 )2 *2 (2

!2
! ! " " " # " $
# &
"2
# " $ # " % & &
#2 " " " ' " ( n=6
# &
$2 # % " " ! " " & nnz = 14
%2 ## " " " " % " &&
&2 " " " " ) & * %
!2 "2 #2 $2 %2 &2

Storage is O(n+nnz) instead of !"

19
© 2008–2018 by the MIT 6.172 Lecturers
Sparsity (3)
CSR matrix-vector multiplication
!"#$%$&5'!()*!5+5
,-!5-.5--/05
,-!51(23'05 4456$-7!85-5
,-!51*26'05 4456$-7!85--/5
%2)96$51:;6'05 4456$-7!85--/5
<5'#;('$=>;!(,?=!05

:2,%5'#>:@'#;('$=>;!(,?=!51A.5%2)96$51?.5%2)96$51"B5+5
&2(5@,-!5,5C5D05,5E5AFG-05,HHB5+5
"I,J5C5D05
&2(5@,-!5K5C5AFG(23'I,J05K5E5AFG(23'I,HLJ05KHHB5+5
,-!5M5C5AFG*26'IKJ05
"I,J5HC5AFG:;6'IKJ515?IMJ05
<5
<5
<5

Number of scalar multiplications = nnz,

which is potentially much less than n2.
20
© 2008–2018 by the MIT 6.172 Lecturers
Sparsity (4)
Storing a static sparse graph 0

Vertex IDs 0 1 2 3 4 3 1
Offsets 0 2 5 5 6 7

Edges 1 3 2 3 4 2 2 2 4

Can run many graph algorithms efficiently on this

representation, e.g., breadth-first search, PageRank

Can store edge weights with an additional array or

interleaved with Edges
21
© 2008–2018 by the MIT 6.172 Lecturers
!
!"##$*
%&'&(

"#)*+)$#)*+,*-./01*

LOGIC

22
© 2008–2018 by the MIT 6.172 Lecturers
Constant Folding and Propagation
The idea of constant folding and propagation is to
evaluate constant expressions and substitute the
result into further expressions, all during compilation.
!"#$%&'(I)*+,-.-/I

01"'I122(2345I6I
$1#7,I'1&8%(I2+'"&7I9I:;<=>>>.>?I
$1#7,I'1&8%(I'"+*(,(2I9I@IAI2+'"&7?I
$1#7,I'1&8%(I$"2$&*B(2(#$(I9ICDEFIAI'"+*(,(2?I
$1#7,I'1&8%(I$2177D+2(+I9ICDEFIAI2+'"&7IAI2+'"&7?I
$1#7,I'1&8%(I7&2B+$(D+2(+I9I$"2$&*B(2(#$(IAI'"+*(,(2?I
$1#7,I'1&8%(I01%&*(I9IGIAICDEFIAI2+'"&7IAI2+'"&7IAI2+'"&7IHI;?I
HHI...I
JI

With a sufficiently high optimization level, all the

expressions are evaluated at compile-time.

23
© 2008–2018 by the MIT 6.172 Lecturers
Common-Subexpression Elimination
The idea of common-subexpression elimination is to
avoid computing the same expression multiple times
by evaluating the expression once and storing the
result for later use.

!)")#)$)%&) !)")#)$)%&)
#)")!)' (&) #)")!)' (&)
%)")#)$)%&) %)")#)$)%&)
()")!)' (&) ()")#&)

The third line cannot be replaced by %)")!, because

the value of #)changes in the second line.

24
© 2008–2018 by the MIT 6.172 Lecturers
Algebraic Identities
The idea of exploiting algebraic identities is to replace
expensive algebraic expressions with algebraic
equivalents that require less work.
!"#$%&'(;)*+',--%./0;
!"#$%&'(;)12+/./0;

+34('(5;*+6&$+;7;
'-&,%(;89; ::;8<$--6'"#2+(;
'-&,%(;39; ::;3<$--6'"#2+(;
'-&,%(;=9; ::;=<$--6'"#2+(;
'-&,%(;69; ::;62'"&*;-5;,2%%;
>;,2%%?+9;

'-&,%(;*@&26(A'-&,%(;8B;7;
6(+&6#;8C89;
>;

,--%;$-%%"'(*A,2%%?+;C,DE;,2%%?+;C,FB;7;
'-&,%(;';G;*@6+A*@&26(A,D<08;< ,F<08B;
H;*@&26(A,D<03;< ,F<03B;
H;*@&26(A,D<0=;< ,F<0=BB9;
6(+&6#;';)G;,D<06;H;,F<069;
>;
25
© 2008–2018
2008 2018 by the MIT 6.172 Lecturers
Algebraic Identities
The idea of exploiting algebraic identities is to replace
expensive algebraic expressions with algebraic
equivalents that require less work.
!"#$%&'(;)*+',--%./0;
!"#$%&'(;)12+/./0;

+34('(5;*+6&$+;7;
'-&,%(;89; ::;8<$--6'"#2+(;
'-&,%(;39; ::;3<$--6'"#2+(;
'-&,%(;=9; ::;=<$--6'"#2+(; ,--%;$-%%"'(*A,2%%?+;C,DE;,2%%?+;C,FB;7;
'-&,%(;69; ::;62'"&*;-5;,2%%; '-&,%(;'*@&26(';G;*@&26(A,D<08;< ,F<08B;
>;,2%%?+9; H;*@&26(A,D<03;< ,F<03B;
H;*@&26(A,D<0=;< ,F<0=B9;
'-&,%(;*@&26(A'-&,%(;8B;7; 6(+&6#;'*@&26(';)G;*@&26(A,D<06;H;,F<06B9;
6(+&6#;8C89; >;
>;

,--%;$-%%"'(*A,2%%?+;C,DE;,2%%?+;C,FB;7; అ
'-&,%(;';G;*@6+A*@&26(A,D<08;< ,F<08B; W ମ X exactly when
H;*@&26(A,D<03;< ,F<03B;
H;*@&26(A,D<0=;< ,F<0=BB9; W ମ X .
6(+&6#;';)G;,D<06;H;,F<069;
>;
26
© 2008–2018
2008 2018 by the MIT 6.172 Lecturers
Short-Circuiting
When performing a series of tests, the idea of short-
circuiting is to stop evaluating as soon as you know
the answer.
!"#$%&'(2)*+',--%./02
1123%%2(%(4(#+*2-523267(2#-##(86+"9(2
,--%2*&4:(;$(('*<"#+2=3>2"#+2#>2"#+2%"4"+?2@2
"#+2*&42A2BC2
5-72<"#+2"2A2BC2"2)2#C2"DD?2@2
*&42DA23E"FC2
G2 !"#$%&'(2)*+',--%./02
7(+&7#2*&4202%"4"+C2 1123%%2(%(4(#+*2-523267(2#-##(86+"9(2
G2 ,--%2*&4:(;$(('*<"#+2=3>2"#+2#>2"#+2%"4"+?2@2
"#+2*&42A2BC2
5-72<"#+2"2A2BC2"2)2#C2"DD?2@2
Note that && and || *&42DA23E"FC2
are short-circuiting "52<*&4202%"4"+?2@2
7(+&7#2+7&(C2
logical operators, G2
and !"#$%"&"#'("$)*. G2
7(+&7#256%*(C2
G2 27
© 2008–2018 by the MIT 6.172 Lecturers
Ordering Tests
Consider code that executes a sequence of logical
tests. The idea of ordering tests is to perform those
that are more often “successful” — a particular
alternative is selected by the test — before tests that
are rarely successful. Similarly, inexpensive tests
should precede expensive ones.

!"#$%&'(>)*+',--%./0>
,--%>"*12/"+(*34$(5$/46>$7>8>
"9>5$>::>;<6;>==>$>::>;<+;>==>$>::>;>;>==>$>::>;<#;7>8>
6(+&6#>+6&(?>
@>
6(+&6#>94%*(?> !"#$%&'(>)*+',--%./0>
@> ,--%>"*12/"+(*34$(5$/46>$7>8>
"9>5$>::>;>;>==>$>::>;<#;>==>$>::>;<+;>==>$>::>;<6;7>8>
6(+&6#>+6&(?>
@>
6(+&6#>94%*(?>
@>
28
© 2008–2018 by the MIT 6.172 Lecturers
Creating a Fast Path
!"#$%&'(;)*+',--%./0;
!"#$%&'(;)12+/./0;

+34('(5;*+6&$+;7;
'-&,%(;89; ::;8<$--6'"#2+(;
'-&,%(;39; ::;3<$--6'"#2+(;
'-&,%(;=9; ::;=<$--6'"#2+(;
'-&,%(;69; ::;62'"&*;-5;,2%%;
>;,2%%?+9;

'-&,%(;*@&26(A'-&,%(;8B;7;
6(+&6#;8C89;
>;

,--%;$-%%"'(*A,2%%?+;C,DE;,2%%?+;C,FB;7;
'-&,%(;'*@&26(';G;*@&26(A,D<08;< ,F<08B;
H;*@&26(A,D<03;< ,F<03B;
H;*@&26(A,D<0=;< ,F<0=B9;
6(+&6#;'*@&26(';)G;*@&26(A,D<06;H;,F<06B9;
>;

29
© 2008–2018 by the MIT 6.172 Lecturers
Creating a Fast Path
;<8=%#!&E9'7!$""%>?/E
;<8=%#!&E9@)7?>?/E

74A&!&BE'7*#=7ECE
!"#$%&E06E DDE0.=""*!<8)7&E
!"#$%&E46E DDE4.=""*!<8)7&E
!"#$%&E56E DDE5.=""*!<8)7&E
!"#$%&E*6E DDE*)!<#'E"BE$)%%E
:E$)%%F76E

!"#$%&E'(#)*&,!"#$%&E02ECE
*&7#*8E0G06E
:E

$""%E="%%<!&',$)%%F7EG$-HE$)%%F7EG$12ECE
<BE,,)$',$-./0EIE$1./02E/E,$-./*E3E$1./*22EJJE
,)$',$-./4EIE$1./42E/E,$-./*E3E$1./*22EJJE
,)$',$-./5EIE$1./52E/E,$-./*E3E$1./*222E
*&7#*8EB)%'&6E
!"#$%&E!'(#)*&!E+E'(#)*&,$-./0E. $1./02E
3E'(#)*&,$-./4E. $1./42E
3E'(#)*&,$-./5E. $1./526E
*&7#*8E!'(#)*&!E9+E'(#)*&,$-./*E3E$1./*26E
:E

30
© 2008–2018 by the MIT 6.172 Lecturers
Combining Tests
The idea of combining tests is to replace a sequence
of tests with one test or switch.

Full adder !"#$ %&''()$$ *#+, )-

#+, .-
: ;'1; 6
#% *. 77 85 6
#+, /- #% */ 77 85 6
a b c carry sum #+, 01&2- 01&2 7 <9
#+, 0/)3345 6 0/)334 7 89
0 0 0 0 0 #% *) 77 85 6 : ;'1; 6
0 0 1 0 1 #% *. 77 85 6 01&2 7 89
#% */ 77 85 6 0/)334 7 <9
0 1 0 0 1 01&2 7 89 :
0 1 1 1 0 0/)334 7 89 : ;'1; 6
: ;'1; 6 #% */ 77 85 6
1 0 0 0 1 01&2 7 <9 01&2 7 89
0/)334 7 89 0/)334 7 <9
1 0 1 1 0 : : ;'1; 6
1 1 0 1 0 : ;'1; 6 01&2 7 <9
#% */ 77 85 6 0/)334 7 <9
1 1 1 1 1 01&2 7 <9 :
0/)334 7 89 :
: ;'1; 6 :
01&2 7 89 :
0/)334 7 <9
:
:
31
© 2008–2018 by the MIT 6.172 Lecturers
Combining Tests (2)
The idea of combining tests is to replace a sequence
of tests with one test or switch.

Full adder !"#$ %&''()$$ *#+, )-

#+, .-
/)17 CA
01&2 8 @=
#+, /- 0/)334 8 9=
a b c carry sum #+, 01&2- .37)B=
0 0 0 0 0 #+, 0/)3345 6 /)17 DA
#+, ,71, 8 **) 88 95 :: ;5 01&2 8 9=
0 0 1 0 1
< **. 88 95 :: 95 0/)334 8 @=
0 1 0 0 1 < */ 88 95= .37)B=
0 1 1 1 0 1>#,/?*,71,5 6 /)17 EA
/)17 @A 01&2 8 @=
1 0 0 0 1 01&2 8 @= 0/)334 8 9=
1 0 1 1 0 0/)334 8 @= .37)B=
.37)B= /)17 FA
1 1 0 1 0 /)17 9A 01&2 8 @=
1 1 1 1 1 01&2 8 9= 0/)334 8 9=
0/)334 8 @= .37)B=
.37)B= /)17 GA
For this example, /)17 ;A 01&2 8 9=
table look-up is 01&2 8 9= 0/)334 8 9=
0/)334 8 @= .37)B=
even better! .37)B= H
32
H
© 2008–2018 by the MIT 6.172 Lecturers
!
!"##$*
%&'&(

"#)*+)$#)*+,*-./01*

LOOPS

33
© 2008–2018 by the MIT 6.172 Lecturers
Hoisting
The goal of hoisting — also called loop-invariant code
motion — is to avoid recomputing loop-invariant code
each time through the body of a loop.

!"#$%&'( )*+,-.-/

01"' 2$+%(3'1&4%( 567 '1&4%( 587 "#, 9: ;

<1= 3"#, " > ?@ " ) 9@ "AA: ;
8B"C > 6B"C 5 (DE32F=,3GHIJKL::@
M
M !"#$%&'( )*+,-.-/

01"' 2$+%(3'1&4%( 567 '1&4%( 587 "#, 9: ;

'1&4%( <+$,1= > (DE32F=,3GHIJKL::@
<1= 3"#, " > ?@ " ) 9@ "AA: ;
8B"C > 6B"C 5 <+$,1=@
M
M

34
© 2008–2018 by the MIT 6.172 Lecturers
Sentinels
Sentinels are special dummy values placed in a data
structure to simplify the logic of boundary conditions,
and in particular, the handling of loop-exit tests.
!"#$%&'(@)*+'"#+,-.@ !"#$%&'(@)*+'"#+,-.@
!"#$%&'(@)*+'/00%,-.@ !"#$%&'(@)*+'/00%,-.@
/00%@01(23%045"#+678+@9:;@*"<(8+@
*"<(8+@#=@>@
??@:**&A(*@+-B+@:H#I@B#'@:H#GKI@(L"*+@B#'@
??@:%%@(%(A(#+*@03@:@B2(@#0##(CB+"1(@
??@$B#@/(@$%0//(2('@
"#+678+@*&A@D@EF@ /00%@01(23%045"#+678+@9:;@*"<(8+@#=@>@
GG"@=@>@
302@5@*"<(8+@"@D@EF@"@)@#F@GG ??@:%%@(%(A(#+*@03@:@B2(@#0##(CB+"1(@
*&A@GD@:H"IF@ :H#I@D@MNO678P:QF@
+2&(F@
"3@5@*&A@)@:H"I@=@2(+&2#@+2&( :H#GKI@D@KF@ ??@02@B#R@S0*"+"1(@#&A/(2@
J@ *"<(8+@"@D@EF@
2(+&2#@3B%*(F@ "#+678+@*&A@D@:HEIF@
J@ 4-"%(@5@*&[email protected]@:H"I@=@>@
*&A@GD@:HGG"IF@
J@
"3@5"@)@#=@2(+&2#@+2&(F@
2(+&2#@3B%*(F@
J@
35
© 2008–2018 by the MIT 6.172 Lecturers
Loop Unrolling
Loop unrolling attempts to save work by combining
several consecutive iterations of a loop into a single
iteration, thereby reducing the total number of
iterations of the loop and, consequently, the number
of times that the instructions that control the loop
must be executed.

● Full loop unrolling: All iterations are unrolled.

● Partial loop unrolling: Several, but not all, of the

iterations are unrolled.

36
© 2008–2018 by the MIT 6.172 Lecturers
Full Loop Unrolling
!"# $%& ' () !"# $%& ' ()
*+, -!"# ! ' () ! . /() !001 2 $%& 0' 34(5)
$%& 0' 34!5) $%& 0' 34/5)
6 $%& 0' 3475)
$%& 0' 3485)
$%& 0' 3495)
$%& 0' 34:5)
$%& 0' 34;5)
$%& 0' 34<5)
$%& 0' 34=5)
$%& 0' 34>5)

37
© 2008–2018 by the MIT 6.172 Lecturers
Partial Loop Unrolling
!"# $%& ' () !"# $%& ' ()
*+, -!"# ! ' () ! . ") //!0 1 !"# 6)
$%& /' 23!4) *+, -6 ' () 6 . "78) 6 /' 90 1
5 $%& /' 2364)
$%& /' 236/:4)
$%& /' 236/;4)
$%& /' 236/84)
5
*+, -!"# ! ' 6) ! . ") //!0 1
$%& /' 23!4)
5

Benefits of loop unrolling

! Lower number of instructions in loop control code
! Enables more compiler optimizations

Unrolling too much can cause poor use of instruction

cache
38
© 2008–2018 by the MIT 6.172 Lecturers
Loop Fusion
The idea of loop fusion — also called jamming — is to
combine multiple loops over the same index range
into a single loop body, thereby saving the overhead
of loop control.

!"# $%&' % ( )* % + &* ,,%- .

/0%1 ( $20%1 +( 30%1- 4 20%1 5 30%1*
6

!"# $%&' % ( )* % + &* ,,%- .

70%1 ( $20%1 +( 30%1- 4 30%1 5 20%1*
6

!"# $%&' % ( )* % + &* ,,%- .

/0%1 ( $20%1 +( 30%1- 4 20%1 5 30%1*
70%1 ( $20%1 +( 30%1- 4 30%1 5 20%1*
6

39
© 2008–2018 by the MIT 6.172 Lecturers
Eliminating Wasted Iterations
The idea of eliminating wasted iterations is to modify
loop bounds to avoid executing loop iterations over
essentially empty loop bodies.

!"# $%&' % ( )* % + &* ,,%- . !"# $%&' % ( 8* % + &* ,,%- .

!"# $%&' / ( )* / + &* ,,/- . !"# $%&' / ( )* / + %* ,,/- .
%! $% 0 /- . %&' '123 ( 45%65/6*
%&' '123 ( 45%65/6* 45%65/6 ( 45/65%6*
45%65/6 ( 45/65%6* 45/65%6 ( '123*
45/65%6 ( '123* 7
7 7
7
7

"#)*+)$#)*+,*-./01*

FUNCTIONS

41
© 2008–2018 by the MIT 6.172 Lecturers
Inlining
The idea of inlining is to avoid the overhead of a
function call by replacing a call to the function with
the body of the function itself.
!"#$%& '(#)*&+!"#$%& ,- .
*&/#*0 ,1,2
3

!"#$%& '#45"65'(#)*&'+!"#$%& 178 90/ 0- .

!"#$%& '#4 : ;<;2
6"* +90/ 9 : ;2 9 = 02 >>9- .
'#4 >: '(#)*&+7?9@-2
3 !"#$%& '#45"65'(#)*&'+!"#$%& 178 90/ 0- .
*&/#*0 '#42 !"#$%& '#4 : ;<;2
3 6"* +90/ 9 : ;2 9 = 02 >>9- .
!"#$%& /&4A : 7?9@2
'#4 >: /&4A1/&4A2
3
*&/#*0 '#42
3

42
© 2008–2018 by the MIT 6.172 Lecturers
Inlining (2)
The idea of inlining is to avoid the overhead of a
function call by replacing a call to the function with
the body of the function itself.
!"#$%&B'(#)*&+!"#$%&B,-B.B
*&/#*0B,1,2B
3B

!"#$%&B'#45"65'(#)*&'+!"#$%&B178B90/B0-B.B
!"#$%&B'#4B:B;<;2B
6"*B+90/B9B:B;2B9B=B02B>>9-B.B
'#4B>:B'(#)*&+7?9@-2B
'/)/9AB90%90&B!"#$%&B'(#)*&+!"#$%&B,-B.B
3B
*&/#*0B'#42B *&/#*0B,1,2B
3B 3B

!"#$%&B'#45"65'(#)*&'+!"#$%&B178B90/B0-B.B
!"#$%&B'#4B:B;<;2B
Inlined functions can 6"*B+90/B9B:B;2B9B=B02B>>9-B.B
be just as efficient as 3B
'#4B>:B'(#)*&+7?9@-2B

macros, and they are &/#0B'#42B

3B
better structured.
43
© 2008–2018 by the MIT 6.172 Lecturers
Tail-Recursion Elimination
The idea of tail-recursion elimination is to replace a
recursive call that occurs as the last step of a function
with a branch, saving function-call overhead.

!"#$ %&#'()"*+,#-+ ./0 #-+ -1 2

#3 ,- 4 51 2
#-+ * 6 78*+#+#"-,/0 -19
%&#'()"*+ ,/0 *19
%&#'()"*+ ,/ : * : 50 - ; * ; 519
<
<
!"#$ %&#'()"*+,#-+ ./0 #-+ -1 2
=>#?@ ,- 4 51 2
#-+ * 6 78*+#+#"-,/0 -19
%&#'()"*+ ,/0 *19
/ :6 * : 59
- ;6 * : 59
<
<

44
© 2008–2018 by the MIT 6.172 Lecturers
Coarsening Recursion
The idea of coarsening recursion is to increase the
size of the base case and handle it with more efficient
code that avoids function-call overhead.
!"#$L%&#'()"*+,#-+L./0L#-+L-1L2L @$6A#-6LBCDEFCGHIL8JL
34#56L,-L7L81L2L !"#$L%&#'()"*+,#-+L./0L#-+L-1L2L
#-+L*L9L:;*+#+#"-,/0L-1<L 34#56L,-L7LBCDEFCGHI1L2L
%&#'()"*+L,/0L*1<L #-+L*L9L:;*+#+#"-,/0L-1<L
/L=9L*L=L8<L %&#'()"*+L,/0L*1<L
-L>9L*L=L8<L /L=9L*L=L8<L
?L -L>9L*L=L8<L
?L ?L
KKL#-)6*+#"-L)"*+LA"*L)M;55L;**;N)L
A"*L,#-+LOL9L8<LOLPL-<L==O1L2L
#-+L(6NL9L/QOR<L
#-+L#L9LOL> 8<L
34#56L,#L79LJLSSL/Q#RL7L(6N1L2L
/Q#=8RL9L/Q#R<L
>>#<L
?L
/Q#=8RL9L(6N<L
?L
?L
45
© 2008–2018 by the MIT 6.172 Lecturers
!
!"##$*
%&'&(

"#)*+)$#)*+,*-./01*

SUMMARY

46
© 2008–2018 by the MIT 6.172 Lecturers
New Bentley Rules
Data structures Logic
● Packing and encoding ● Constant folding and
● Augmentation propagation
● Precomputation ● Common-subexpression
● Compile-time initialization elimination
● Caching ● Algebraic identities
● Lazy evaluation ● Short-circuiting
● Sparsity ● Ordering tests
● Creating a fast path
Loops
● Combining tests
● Hoisting
● Sentinels Functions
● Loop unrolling ● Inlining
● Loop fusion ● Tail-recursion elimination
● Eliminating wasted iterations ● Coarsening recursion

47
© 2008–2018 by the MIT 6.172 Lecturers
Closing Advice
● Avoid premature optimization. First get correct
working code. Then optimize, preserving
correctness by regression testing.
● Reducing the work of a program does not
necessarily decrease its running time, but it is a
good heuristic.
● The compiler automates many low-level
optimizations.
● To tell if the compiler is actually performing a
particular optimization, look at the assembly code.

If you find interesting examples of work

optimization, please let us know!

6.172 Performance Engineering of Software Systems

Fall 2018

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

Mc4101 Ads Notes Advance Data Structure Nodes
0% (1)
Mc4101 Ads Notes Advance Data Structure Nodes
144 pages
Matlab Notes PDF
No ratings yet
Matlab Notes PDF
45 pages
4-QUESTION PAPER MAT1014 - DMGT - Model QPs
100% (1)
4-QUESTION PAPER MAT1014 - DMGT - Model QPs
7 pages
Recursion Class 12 Computer Science Python
No ratings yet
Recursion Class 12 Computer Science Python
8 pages
Algorithmics and Optimization PDF
No ratings yet
Algorithmics and Optimization PDF
123 pages
(Ebook PDF) Analysis and Design of Algorithms 3rd Ed. Edition by Amrinder Arora Download
100% (2)
(Ebook PDF) Analysis and Design of Algorithms 3rd Ed. Edition by Amrinder Arora Download
49 pages
MVCT101 Advanced Mathematics UNIT 1 Capsule 1 Objective Question Notes RGPV MTECH CTM (Notescivil - Blogspot.com)
No ratings yet
MVCT101 Advanced Mathematics UNIT 1 Capsule 1 Objective Question Notes RGPV MTECH CTM (Notescivil - Blogspot.com)
8 pages
Preface of Algorithms
100% (1)
Preface of Algorithms
8 pages
MSIT-104 Data Structure and Algorithms
No ratings yet
MSIT-104 Data Structure and Algorithms
237 pages
Notes On Data Structures and Algorithms: Dr. Anindita Kundu
No ratings yet
Notes On Data Structures and Algorithms: Dr. Anindita Kundu
64 pages
Paul A. Gagniuc - Coding Examples From Simple To Complex - Applications in MATLAB (2024, Springer) - Libgen - Li
No ratings yet
Paul A. Gagniuc - Coding Examples From Simple To Complex - Applications in MATLAB (2024, Springer) - Libgen - Li
275 pages
Embedded C Programming
100% (1)
Embedded C Programming
57 pages
Computer Science Full GCSE Notes
No ratings yet
Computer Science Full GCSE Notes
16 pages
Algorithms
No ratings yet
Algorithms
90 pages
CSCForm48 DailyTimeRecord (DTR)
No ratings yet
CSCForm48 DailyTimeRecord (DTR)
1 page
Fe Am Exam Questions Batch 1
100% (2)
Fe Am Exam Questions Batch 1
3 pages
Introduction To Programming in Java An Interdisciplinary Approach Second Edition 9780134512396 0134512391 Compress Pages
No ratings yet
Introduction To Programming in Java An Interdisciplinary Approach Second Edition 9780134512396 0134512391 Compress Pages
16 pages
Ble 90
No ratings yet
Ble 90
268 pages
GTA Unit4
No ratings yet
GTA Unit4
424 pages
Matlab Primer
No ratings yet
Matlab Primer
39 pages
9A02709 Optimization Techniques
No ratings yet
9A02709 Optimization Techniques
4 pages
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
No ratings yet
Linear Classifier: by Dr. Sanjeev Kumar Associate Professor Department of Mathematics IIT Roorkee, Roorkee-247 667, India
86 pages
Introduction To Algorithms
No ratings yet
Introduction To Algorithms
19 pages
MATLAB Array Manipulation Tips and Tricks PDF
No ratings yet
MATLAB Array Manipulation Tips and Tricks PDF
63 pages
Algorithms Simplified - A Minimalist Approach To Problem-Solving by Rohith B. V.
No ratings yet
Algorithms Simplified - A Minimalist Approach To Problem-Solving by Rohith B. V.
146 pages
Chapter 1: Introduction: - Pseudocode - Abstract Data Type - Algorithm Efficiency
No ratings yet
Chapter 1: Introduction: - Pseudocode - Abstract Data Type - Algorithm Efficiency
48 pages
Mc4101 Ads Notes Advance Data Structure Nodes
No ratings yet
Mc4101 Ads Notes Advance Data Structure Nodes
144 pages
Mc4101 - Adsa Notes
No ratings yet
Mc4101 - Adsa Notes
142 pages
Unit 2 PPL
No ratings yet
Unit 2 PPL
56 pages
MIT6 172F09 Lec02
No ratings yet
MIT6 172F09 Lec02
85 pages
A Graduate Course in Algorithm Design and Analysis
No ratings yet
A Graduate Course in Algorithm Design and Analysis
155 pages
CSC 337 Lecture Notes
No ratings yet
CSC 337 Lecture Notes
74 pages
@3 All UNIT 3 Perfect UNIT 4 Perfect
No ratings yet
@3 All UNIT 3 Perfect UNIT 4 Perfect
144 pages
Sparse Matrix Technology PDF
No ratings yet
Sparse Matrix Technology PDF
45 pages
MIT6 172F10 Lec03
No ratings yet
MIT6 172F10 Lec03
75 pages
SECTION 1: Basic Concepts and Notations, Arrays and Recursion
No ratings yet
SECTION 1: Basic Concepts and Notations, Arrays and Recursion
6 pages
Introduction To AI Lecture 3: Uninformed Search: Heshaam Faili Hfaili@ece - Ut.ac - Ir University of Tehran
No ratings yet
Introduction To AI Lecture 3: Uninformed Search: Heshaam Faili Hfaili@ece - Ut.ac - Ir University of Tehran
64 pages
CH8 - Queues
No ratings yet
CH8 - Queues
30 pages
DSC VivaQuestions PDF
No ratings yet
DSC VivaQuestions PDF
7 pages
Matrix and Graph
No ratings yet
Matrix and Graph
44 pages
Session
No ratings yet
Session
51 pages
Data Structure Unit 1
No ratings yet
Data Structure Unit 1
45 pages
Linked Lists Slides
No ratings yet
Linked Lists Slides
57 pages
0282 Algorithms
No ratings yet
0282 Algorithms
90 pages
Unit III
No ratings yet
Unit III
66 pages
Fast Matlab Code
No ratings yet
Fast Matlab Code
22 pages
Chapter 2-Part 4
No ratings yet
Chapter 2-Part 4
21 pages
Daa Unit Wise Ques
No ratings yet
Daa Unit Wise Ques
18 pages
Technical Interview Study Guide
No ratings yet
Technical Interview Study Guide
18 pages
Moore Andrew 1991 1
No ratings yet
Moore Andrew 1991 1
20 pages
Matlab Matrix Tricks and Tips
No ratings yet
Matlab Matrix Tricks and Tips
46 pages
HPC Unit 5 B
No ratings yet
HPC Unit 5 B
31 pages
Unit 2 Basic Optimization Techniques For Serial Code
No ratings yet
Unit 2 Basic Optimization Techniques For Serial Code
31 pages
ABC: An Industrial-Strength Logic Synthesis and Verification Tool
No ratings yet
ABC: An Industrial-Strength Logic Synthesis and Verification Tool
29 pages
The Average Distance in A Random Graph
No ratings yet
The Average Distance in A Random Graph
22 pages
Local Search Algorithms and Optimization Problems: Presented By: DR - Qanita Bani Baker
No ratings yet
Local Search Algorithms and Optimization Problems: Presented By: DR - Qanita Bani Baker
32 pages
Practice Final CS61c
No ratings yet
Practice Final CS61c
19 pages
Matlab
No ratings yet
Matlab
45 pages
2 DataflowAnalysis
No ratings yet
2 DataflowAnalysis
49 pages
M Array Manipulation Tips and Tricks: Atlab
No ratings yet
M Array Manipulation Tips and Tricks: Atlab
31 pages
Lec 1,2
No ratings yet
Lec 1,2
21 pages
(Blas Lapack) F7
No ratings yet
(Blas Lapack) F7
14 pages
Artificial Intelligence: Paf-Karachi Institute of Economics & Technology College of Engineering
No ratings yet
Artificial Intelligence: Paf-Karachi Institute of Economics & Technology College of Engineering
8 pages
M269 - Lec10 Fall 1819
No ratings yet
M269 - Lec10 Fall 1819
34 pages
Integer and Inequality Practice
No ratings yet
Integer and Inequality Practice
12 pages
2019vgg Vqealgorithmhacker Dojowbg 190911191759
No ratings yet
2019vgg Vqealgorithmhacker Dojowbg 190911191759
12 pages
Lecture 17
No ratings yet
Lecture 17
7 pages
Notes 1 For CS 170
No ratings yet
Notes 1 For CS 170
6 pages
Why Math
No ratings yet
Why Math
6 pages
What Are Pointers? What Is The Purpose of Using Pointers?: ASK Olution
No ratings yet
What Are Pointers? What Is The Purpose of Using Pointers?: ASK Olution
8 pages
qb2018 EEM303
No ratings yet
qb2018 EEM303
6 pages
PPR - Espinal - Comparison of PSO and DE For Training Neural Networks
No ratings yet
PPR - Espinal - Comparison of PSO and DE For Training Neural Networks
5 pages
MAT2612 Assignment 01
No ratings yet
MAT2612 Assignment 01
5 pages
Taller Chapter 3 Sistemas Digitales
No ratings yet
Taller Chapter 3 Sistemas Digitales
4 pages
Data Structures and Algorithms BBIT 2.2 L1
No ratings yet
Data Structures and Algorithms BBIT 2.2 L1
5 pages
Assigning A Sound File To An Instance. Assigning A Keyboard Key To An Instance. Assigning An Image File To An Instance. All of The Above. ( )
No ratings yet
Assigning A Sound File To An Instance. Assigning A Keyboard Key To An Instance. Assigning An Image File To An Instance. All of The Above. ( )
4 pages
Ie400 Project S2020
No ratings yet
Ie400 Project S2020
2 pages
GE MMW Final Output
No ratings yet
GE MMW Final Output
2 pages
1.4 Data Types Data Structures and Algorithms.280155520
No ratings yet
1.4 Data Types Data Structures and Algorithms.280155520
2 pages
Optimization of Computer Programs in C
No ratings yet
Optimization of Computer Programs in C
2 pages
Carinosa Folk Dance: Don Quattrocchi
No ratings yet
Carinosa Folk Dance: Don Quattrocchi
1 page
Exam Unit 1
No ratings yet
Exam Unit 1
1 page
Python Data Science Cookbook
From Everand
Python Data Science Cookbook
Taryn Voska
No ratings yet
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
From Everand
Python Data Science Cookbook: Practical solutions across fast data cleaning, processing, and machine learning workflows with pandas, NumPy, and scikit-learn
Taryn Voska
No ratings yet
Mastering C++: Advanced Techniques and Tricks
From Everand
Mastering C++: Advanced Techniques and Tricks
Ted Norice
No ratings yet
C++ Mastery: Advanced Techniques and Strategies
From Everand
C++ Mastery: Advanced Techniques and Strategies
Adam Jones
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
IGNOU PGDCA All in One Previous Years Unsolved Papers
From Everand
IGNOU PGDCA All in One Previous Years Unsolved Papers
Manish Soni
No ratings yet
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
C Programming for the Pc the Mac and the Arduino Microcontroller System
From Everand
C Programming for the Pc the Mac and the Arduino Microcontroller System
Peter D Minns
No ratings yet

HPC Unit 5 A

Uploaded by

HPC Unit 5 A

Uploaded by

!

Example: Encoding dates

! This packed representation still only takes 22 bits,

a tail pointer allows

Computing the “choose” function by implementing

Now, whenever we need a binomial coefficient (less

!"#!"$ %&'(#$ )*+&,$"'-$.%&'(#$ /0 %&'(#$ 12 3

!"#!"$ %&'(#$ )*+&,$"'-$.%&'(#$ /0 %&'(#$ 12 3

Dense matrix-vector multiplication performs n2 = 36

Dense matrix-vector multiplication performs n2 = 36

0,1./2!2 %2 "2 #2 %2 &2 $2 &2 !2 $2 !2 %2 $2 %2

Storage is O(n+nnz) instead of !"

Number of scalar multiplications = nnz,

Can run many graph algorithms efficiently on this

Can store edge weights with an additional array or

With a sufficiently high optimization level, all the

The third line cannot be replaced by %)")!, because

Full adder !"#$ %&''()$$ *#+, )-

Full adder !"#$ %&''()$$ *#+, )-

01"' 2$+%(3'1&4%( 567 '1&4%( 587 "#, 9: ;

01"' 2$+%(3'1&4%( 567 '1&4%( 587 "#, 9: ;

● Full loop unrolling: All iterations are unrolled.

● Partial loop unrolling: Several, but not all, of the

Benefits of loop unrolling

Unrolling too much can cause poor use of instruction

!"# $%&' % ( )* % + &* ,,%- .

!"# $%&' % ( )* % + &* ,,%- .

!"# $%&' % ( )* % + &* ,,%- .

!"# $%&' % ( )* % + &* ,,%- . !"# $%&' % ( 8* % + &* ,,%- .

!"#$%& '#45"65'(#)*&'+!"#$%& 178 90/ 0- .

macros, and they are *&/#*0B'#42B

!"#$ %&#'()"*+,#-+ ./0 #-+ -1 2

If you find interesting examples of work

6.172 Performance Engineering of Software Systems

You might also like

macros, and they are &/#0B'#42B