0% found this document useful (0 votes)
134 views29 pages

MATH49111/69111: Scientific Computing: 26th October 2017

This document summarizes an academic lecture on scientific computing and optimization. It discusses algorithms and complexity, measuring performance, and provides a worked example on optimizing code to calculate the Mandelbrot set. Several optimization techniques are demonstrated, including removing square roots, using compiler optimizations, and improving the algorithm to check for periodicity. These optimizations successively reduced the runtime from the original 13.02 seconds to the final optimized version at 0.188 seconds, a speedup of over 69 times.

Uploaded by

Pedro Luis Carro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views29 pages

MATH49111/69111: Scientific Computing: 26th October 2017

This document summarizes an academic lecture on scientific computing and optimization. It discusses algorithms and complexity, measuring performance, and provides a worked example on optimizing code to calculate the Mandelbrot set. Several optimization techniques are demonstrated, including removing square roots, using compiler optimizations, and improving the algorithm to check for periodicity. These optimizations successively reduced the runtime from the original 13.02 seconds to the final optimized version at 0.188 seconds, a speedup of over 69 times.

Uploaded by

Pedro Luis Carro
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

MATH49111/69111: Scientific Computing

Lecture 10
26th October 2017

Dr Chris Johnson
[email protected]
Optimisation

. Arguments for and against optimisation

. Algorithms and complexity

. Measuring performance

. Worked example: Mandelbrot set

. Parallel code

. Memory and cache optimisation

Nothing in this lecture is essential knowledge for the projects.


Writing fast code

. Scientific programs can run for a long time (A 100 CPU-years)

. Therefore, worth investing time optimising them for speed


. Every level of a programs design affects its speed:
. Program architecture
. Algorithm choice
. Algorithm implementation
. Compilation / assembly language code
. Hardware

. Decisions to be taken both before we start writing code, and


during refactoring of existing code

. Optimisation does not respect abstraction: the fastest code


might not have the best structure.
Algorithms: time complexity

. The time an algorithm takes to execute is called its


time complexity

. This is approximated its number of operations, a count of the


number of basic scalar operations (, , , )

. We say that the time complexity of an algorithm Tˆn is

Oˆf ˆn (we say ”order f of n”)

where n is the size of the input to the algorithm, if

Tˆn
lim sup @ª
n ª fˆn

i.e. if Tˆn is no greater than a constant multiple of f ˆn.


Algorithms: time complexity

Example: for dense n  n matrices and n-vectors


. Vector dot product Oˆn operations
. Matrix-vector product Oˆn2  operations
. Matrix-matrix product Oˆn3  operations

More slowly growing complexity is (almost) always better:


. Strassen algorithm calculates matrix-matrix product in
Oˆnlog2 7 2.81...  operations
. In practice, slower than naïve algorithm unless n à 1000

Recursive algorithms (e.g. Fast Fourier Transform, heap sort, fast


multipole method) can often reduce Oˆn2  problems to Oˆn log n
Writing fast code: optimisation

. In most programs, the execution time is dominated by a small


number of lines of code

. For most lines of code the trade-off of clearest structure vs.


highest speed is not in favour of optimising for speed.
. Optimisation impacts adversely on code:
. portability
. flexibility
. maintainability

. Important to measure which lines of code are taking the time,


by profiling.
Mandelbrot set: an example of optimisation

The Mandelbrot set M is the set of complex numbers c > C such that
the iteration

z0 0,
zk1 z2k c

remains bounded as k ª.

How can we convert this ‘infinite’ problem into a finite one?

There are two ‘infinities’ in the definition of M:


. Szk S becomes infinite when zk diverges
. We are interested in the divergence for infinite k
Mandelbrot set: Divergence of zk
If Szk S A 2 and Szk S C ScS, then

S zk1 S z2k  cT C Szk S2  ScS C Szk S2  Szk S A 2Szk S  Szk S


T zk S ,
S

so c is not in the Mandelbrot set (c ¶ M) if

c A 2,
S S and/or Szk S A 2 for some k.

We approximate

zk remains bounded as k ª

by
zk S @ 2
S for k B K

for some (large) constant K, and check that our approximation to M


converges as K increases.
Mandelbrot set: unoptimised C++ code
#include <complex> // mandelbrot_1.cpp
#include <iostream>

int main()
{
for (int i=-500; i<=250; i++) // loop over real part of c
{
for (int j=-375; j<=375; j++) // loop over imag part of c
{
std::complex<double> z(0,0), c(i/250.0, j/250.0);
int k = 0; // loop up to 2500 times, or until |z|>2
for (; k < 2500 && std::abs(z)<2.0; k++)
z = z*z + c; // apply iteration
std::cout << k << " "; // output # of iterations
}
std::cout << std::endl;
}
return 0;
}
Mandelbrot set: Program output
Mandelbrot set: Run times (unoptimised)
Mandelbrot set run time (using single core of Core i7-4770K),

Language Time (s) Speedup


MATLAB (R2015a) 213.4 16.4  slower
Python (2.7.10) 116.4 8.94  slower
C++ (gcc 4.8.5) 13.02 (reference)

. MATLAB and Python are interpreted languages


. C++ is compiled to machine code

. Further optimisation possible with all languages


. We will profile and optimise the C++ code, by
1. Improving the algorithm, so that less work is required
2. Improving the code/compilation, so that the required work is
done more quickly
Timing code

. On Linux, time shows total program execution time:


$ time ./my_program

real 0m13.015s
user 0m12.998s
sys 0m0.004s

. Timer functions in C++ allow timing of program sections


. Platform-dependent timers
. std::chrono::high_resolution_clock::now() in
<chrono> header (C++11 only)
. See examples on course website

. Using a profiler gives a breakdown of the time taken by each


section of code
Profiling output
We can measure which parts of the code are using the most time
with a profiler (e.g. gprof (Linux), VS Diagnostic Tools (Windows))
total
time seconds calls name
16.87 1.75 239353549 abs<D>(complex<D> const&)
14.46 1.50 main
13.34 1.39 238883944 complex<D>::operator*=<D>(complex<D> const&)
11.99 1.25 239353549 __complex_abs(Dcomplex )
10.15 1.05 238883944 operator+<D>(complex<D> const&, complex<D> const&)
8.90 0.92 238883944 complex<D>::operator+=<D>(complex<D> const&)
7.74 0.80 238883944 operator*<D>(complex<D> const&, complex<D> const&)
5.61 0.58 477767888 complex<D>::imag() const
5.08 0.53 239353549 complex<D>::__rep() const
4.74 0.49 477767888 complex<D>::real() const
1.06 0.11 1128002 complex<D>::complex(D, D)

. Algorithm evaluated for  5.6  105 values of c


. Innermost loop repeated  2.3  108 times
. Most expensive single function is abs (used in abs(z)<2.0)
. Cost is due to evaluation of square root
Square-root free algorithm C++ code
. Replace abs(z)<2.0 test with the equivalent
z.real()*z.real()+z.imag()*z.imag()<4.0

for (int i=-500; i<=250; i++)


{
for (int j=-375; j<=375; j++)
{
std::complex<double> z(0,0), c(i/250.0, j/250.0);
int k = 0;
for (; k < 2500 &&
z.real()*z.real()+z.imag()*z.imag()<4.0; k++)
z = z*z + c;
std::cout << k << " ";
}
std::cout << std::endl;
}

. Program takes 8.472 seconds (1.54 faster)


Compiler flags
. When calling gcc we can set optimisation flags:
-O2 Set the Optimisation level to 2 (produces faster
code, but takes longer to compile)
-ffast-math Makes floating point maths faster, but doesn’t
guarantee floating point associativity rules,
behaviour of infinity, NaN, etc.
-march=native Optimises code for the CPU type, cache
size etc. of the current machine

. Compiled with
g++ -O2 -ffast-math -march=native mandelbrot_2.cpp

program takes 0.776s (10.9  speedup)


. Dramatic speedup mainly due to -O2 inlining functions.
. In Visual Studio, use ‘Release’ target to enable optimisations
Improving the algorithm: Periodicity checking

What happens to zk at values inside the Mandelbrot set?

k zk
0 0.000000  0.000000i
1 0.900000  0.100000i
2 0.100000  0.080000i
... ...
36 0.093627  0.123035i
37 0.906372  0.123039i
38 0.093629  0.123038i
39 0.906372  0.123040i
... ...

zk approaches an oscillation with period C 1


Improving the algorithm: Periodicity checking
. If zk zi for i @ k, iteration will never diverge.
. Abort the iteration if zk repeats an earlier value (to FP precision)
. Most efficient to check only a few i (here i is a power of 2)

std::complex<double> z(0,0), c(i/250.0, j/250.0), p(0,0);


int k = 0, pIndex = 2;
for (; k < maxIters && abs(z)<2.0; k++)
{
z = z*z + c;
if (z == p) // if we are in an orbit, quit
{k = maxIters; break;}
if (k == pIndex) // update p every 2^n iterations
{p = z; pIndex *= 2;}
}

. Reduces average iterations per c value from 424 to 92


. Program takes 0.188 seconds (4.12 speedup)
Mandelbrot set: optimisations so far
Optimisations Time (s) Speedup
No optimisations 13.02 (reference)
 square-root-free SzS2 test 8.472 1.54 
 -O2 -ffast-math -march=native 0.776 16.8 
 periodicity checking 0.188 69.3 

What more can we do to optimise?


. Vector SIMD instructions (beyond the scope of this course)
. Memory/cache optimisation
. Data layout in memory can affect speed by factor of A 10
. Not important for Mandelbrot calculation; see example later
. Parallelisation
. Extremely important, especially for large programs
Parallel programs
. Modern processors have several independent cores, each of
which runs one or more threads
. Each thread executes instructions independently
. All our code so far has executed on just one thread

1955–2003: Processors double in clock speed every three years


. A ‘free lunch’: all existing code gets faster
2003–: Speed increases come through more cores
. From  4 cores (PCs) to  106 cores
(supercomputers)
. Only efficiently parallelised code gets faster

. Threads are relatively slow at communicating with one another


. Challenge is to split algorithms into tasks for each thread
Parallelisation of Mandelbrot algorithm
Our Mandelbrot set program calculates the limit of an iteration
z0 , z1 , . . . for many values of c:
. For each iteration we need zk in order to calculate zk1
 cannot parallelise across each iteration of zk
. The iterations at different values of c are independent
 easy to parallelise across different values of c

Thread 1 Thread 2 Thread 3 Thread 4


.
c c1 c c2 c c3 c c4
z0 0 z0 0 z0 0 z0 0
time

z1 z20 c z1 z20 c z1 z20 c z1 z20  c


z2 z21 c z2 z21 c z2 z21 c z2 z21  c
... ... ... ...
Parallelisation example: C++11 threads
#include <thread>
#include <vector>

void DoWork(int threadID)


{ /* Parallel work done in this function body */ }

int main()
{
int nThreads = 8;
std::vector<std::thread> threads(nThreads);

// Start parallel work on each of 8 threads


for (int i=0; i<nThread; i++)
thread[i] = std::thread(DoWork, i);

// Wait here until all thread functions have returned


for (int i=0; i<nThread; i++)
thread[i].join();
}
Parallelisation example: Mandelbrot set
Use C++11 threads to parallelise the Mandelbrot set program
. Each thread calculates a subset of the values of c
(every Nth column, where we have N threads)

Thread 1
Thread 2
Thread 3
Thread 4
.

. Display of iteration counts must be single threaded.


We now store iteration counts in memory, then display at the
end of the program.
. Algorithm is otherwise the same as the single-threaded case.
Parallelisation example: Mandelbrot set
#include <complex>
#include <iostream>
#include <thread>
#include <vector>

int iSize=751, iStart=-500, jSize=751, jStart=-375; // loop lengths and start points
std::vector<int> iters(iSize*jSize); // storage for iteration counts

void CalcMandelbrot(int skip, int ofs) // skip = nThreads, ofs = threadID


{
for (int i=ofs; i<iSize; i+=skip)
for (int j=0; j<jSize; j++)
{
int maxIters = 2500;
std::complex<double> z(0,0), c((i+iStart)/250.0, (j+jStart)/250.0), p(0,0);
int k = 0, pIndex = 2;
for (; k < maxIters && z.real()*z.real()+z.imag()*z.imag()<4.0; k++)
{
z = z*z + c;
if (z == p) // if we are in an orbit, quit
{ k=maxIters; break; }
if (k == pIndex) // update p every 2^n iterations
{p = z; pIndex *= 2;}
}
iters[i*jSize + j] = k;
}
}
Parallelisation example: Mandelbrot set
int main()
{
int nThreads=2;
std::vector<std::thread> threads(nThreads);

// Start parallel work


for (int i=0; i<nThreads; i++)
threads[i] = std::thread(CalcMandelbrot, nThreads, i);

// Finish parallel work


for (int i=0; i<nThreads; i++)
threads[i].join();

// Display output in single-threaded code


for (int i=0; i<iSize; i++)
{
for (int j=0; j<jSize; j++)
std::cout <<iters[i*jSize + j] << " ";

std::cout << "\n";


}
return 0;
}

Optimised code is usually more complex!


Parallelisation example: Mandelbrot set
Threads Time (s) Speedup
1 0.182 (reference)
2 0.095 1.91 
3 0.067 2.71 
4 0.055 3.30 

. Speedup is a little less than the number of threads.


. Discrepancy due to remaining serial code (Amdahl’s law)

Optimisations Time (s) Speedup


No optimisations 13.02 (reference)
 square-root-free SzS2 test 8.472 1.54 
 -O2 -ffast-math -march=native 0.776 16.8 
 periodicity checking 0.188 69.3 
 parallelisation (4 cores) 0.055 236.7 
Memory optimisation
Computers have a hierarchy of memory, with latency/size tradeoff

Memory Size Latency (ns)


Registers  50 bytes 0
L1 Cache  32KB 1.5
L2 Cache  256KB 4
L3 Cache  8MB 25
RAM  4GB 100
Solid-state drive  256GB 16000
Hard disk drive  2TB 4000000

. These latencies improving only very slowly over time


. Need to arrange most frequently-used data in fastest memory
. Layout of data in memory affects the speed it can be accessed
Memory optimisation: caches
. The cache stores frequently-accessed data in memory
. Accessing a memory address puts the cache line (64 bytes of
adjacent memory) containing that address in the cache
. Once this is done, much faster to access data from this line
. ‘Nearby’ memory accesses much faster than random accesses

4
Time per access (ns)

2
L3 Cache
1 L1 Cache

0
256 1K 4K 16K 64K 256K 1M 4M 16M 64M 256M
Buffer size (bytes)
Memory optimisation example: matrix-vector product

b Ax bi Qa x
j
ij j

for (int i=0; i<A.rows(); i++


for (int j=0; j<A.cols(); j++)
b[i] += A(i, j)*x[j];

Faster1 to store A aij in memory in row-major order

a a a a . . . a21 a22 a23 a24 . . . a31 . . .


. 11 12 13 14

than column-major order

a a a a . . . a12 a22 a32 a42 . . . a13 . . .


. 11 21 31 41

N.B: The opposite is true for b xA. Data format must suit algorithm
1
I found 1.42ms (column-major) vs. 3.22ms (row-major) for 10242 matrix
Summary
. Optimisation has costs: only optimise where necessary

. Measure/profile to work out where optimisation is required

. Finding a better algorithm usually leads to greater speedup


than ‘micro-optimising’ individual lines of code

. Turn on compiler optimisations


. Three aims when optimising code:
. Use the minimum number of operations
. Use multiple cores effectively
. Use memory efficiently

. Further reading:
. Code Complete (S. McConnell) chapters 25 and 26
. What every programmer should know about memory
(U. Drepper) http://www.akkadia.org/drepper/cpumemory.pdf

You might also like