Principles of Parallel Sci Comp
Principles of Parallel Sci Comp
Tobias Weinzierl
Principles of
Parallel Scientific
Computing
A First Guide to Numerical Concepts
and Programming Methods
Undergraduate Topics in Computer
Science
Series Editor
Ian Mackie, University of Sussex, Brighton, UK
Advisory Editors
Samson Abramsky , Department of Computer Science, University of Oxford,
Oxford, UK
Chris Hankin , Department of Computing, Imperial College London, London, UK
Mike Hinchey , Lero – The Irish Software Research Centre, University of
Limerick, Limerick, Ireland
Dexter C. Kozen, Department of Computer Science, Cornell University, Ithaca,
NY, USA
Andrew Pitts , Department of Computer Science and Technology, University of
Cambridge, Cambridge, UK
Hanne Riis Nielson , Department of Applied Mathematics and Computer Science,
Technical University of Denmark, Kongens Lyngby, Denmark
Steven S. Skiena, Department of Computer Science, Stony Brook University, Stony
Brook, NY, USA
Iain Stewart , Department of Computer Science, Durham University, Durham,
UK
‘Undergraduate Topics in Computer Science’ (UTiCS) delivers high-quality
instructional content for undergraduates studying in all areas of computing and
information science. From core foundational and theoretical material to final-year
topics and applications, UTiCS books take a fresh, concise, and modern approach
and are ideal for self-study or for a one- or two-semester course. The texts are all
authored by established experts in their fields, reviewed by an international advisory
board, and contain numerous examples and problems, many of which include fully
worked solutions.
The UTiCS concept relies on high-quality, concise books in softback format, and
generally a maximum of 275–300 pages. For undergraduate textbooks that are
likely to be longer, more expository, Springer continues to offer the highly regarded
Texts in Computer Science series, to which we refer potential authors.
Principles of Parallel
Scientific Computing
A First Guide to Numerical Concepts
and Programming Methods
123
Tobias Weinzierl
Department of Computer Science
Durham University
Durham, UK
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
I started to think about this manuscript when I got appointed as Associate Professor
at Durham University and had been told to read numerical algorithms for computer
scientists plus the parallel programming submodule. In Durham, we teach this first
submodule on numerics in the third year of the computer science degree. It spans
only ten hours. Ten hours is ridiculously short, but I think it is enough to spark
interest in simulation codes and the simulation business. Thinking about it, it is
great that our numerics teaching runs parallel to a parallel programming course.
Numerical simulation without parallelism today is barely imaginable, so this
arrangement gives us the great opportunity to use examples from one area in the
other one. We can co-teach the two things: numerics basics and parallel coding.
When we later designed our Master in Scientific Computing and Data Analysis
(MISCADA), it became clear that co-teaching the two fields would be a good idea
for this course, too. With a very heterogeneous student intake—both by means of
qualification, cultural background, educational system, and discipline interest—we
need to run introductory courses that bring all students quickly to a decent level,
while numerical principles as well as parallel programming are omnipresent.
Obviously, the M.Sc.-level teaching progresses at a different pace and, in Durham,
also covers more lectures. Yet, some content overlaps with the third year’s course.
There seems to be a limited amount of literature that discusses parallel pro-
gramming and (the basics of) numerical techniques in one go and still remains
easily accessible. There are many books that cover scientific computing from a
mathematics point of view. Many scientists (including me) insist that a formal
approach with a strong mathematical background is imperative for research in
scientific computing. However, we have to appreciate that many of our undergrads
are enrolled in courses that are lighter on the mathematics side or focus, for
example, more on discrete maths. Some maths thus might be a little bit rusty when
students enter their third year. Our mathematical textbooks run the risk that we lose
out on students if they orbit around the profound maths first. They might get bored
before they reach the point where they can program a first numerical code that does
“something useful”.
v
vi Preface
The other type of teaching literature that is out there combines the first course in
numerics with a programming tutorial. Although a lot of the material I have seen is
really good, this type of approach is ill-suited for our students. Most of them are
decent programmers already.
To make a long story short, I think that there is a need for courses on numerics
with a “rolling up your sleeves” and “get the rough idea and make it work in code”
attitude. Some students like to get their hands dirty and write something “that does
something” before they start to dive into more details—for which excellent liter-
ature does exist. I am also strongly convinced that it makes sense to teach intro-
ductory scientific computing in combination with parallelisation. After all, we want
to have future Ph.D. students and colleagues who can help us with the big simu-
lation codes.
Mission Statement
Since we try to keep the theory light and hands-on, this book is not a compre-
hensive overview of the vast field of numerical algorithms and analysis, or even
only a certain subfield. It neither is a rock solid introduction to the difficulties,
pitfalls and depths of programming in high performance computing (HPC). It is
• a rush through some basic numerical concepts that are omnipresent when we
write simulation software. We want to get an idea of how some numerical things
work—so we can eventually look them up in a “real” maths book.
• a numerics primer with implementation glasses on: For all of the concepts, we
want to have an idea of how to realise them on a real machine or which impact
they have on our codes.
• a hands-on text. The best way to understand numerical concepts is to program
them or, even better, to see your own code fail and then find a theoretical
explanation (together with the fix) for the break down. And it is close to
impossible to learn parallel programming solely from textbooks anyway.
• a guide that makes students start to write simulation codes running on parallel
computers.
The manuscript jumps from parallel programming to numerics and back. I sprinkle
in some light-touch revision of things that I consider to be important but have been
told by my students that many computer science curricula do not cover them
anymore. Even if taught at school before, students might have forgotten them.
Where possible, I try to make all content feed into one simple N-body solver. That’s
the application leitmotif throughout the manuscript. If readers translate it into a
working code, they will end up with an N-body solver with simple gravity, will
Preface vii
have learned fundamental ideas behind the solver such as numerical terminology or
challenges faced by integrators for ordinary differential equations, and will have the
skill set to make this solver run efficiently on a shared memory machine or some
GPUs. There’s no need for a reader (or teacher) to stick to my concept or to read the
chapters one by one. Feel free to skip chapters, pick chapters, or use a different
application example.
Mathematicians will be horrified by my hands-on explanations and lack of
formalism. I am fine with this. This book is not yet another maths course. It is a
book that shall be quick and easy to read and motivate students to study the maths
behind the scenes later, once the first generation of their simulation code is up and
running. This book is also not a course where students really learn something about
N-body codes for a particular application area. Simple N-body codes just seemed to
be a convenient toy problem that we can understand with A-levels maths.
Experts deeply familiar with the details of OpenMP, e.g., also will be horrified by
the inaccuracies in the text. For example, the visibility of a variable in the OpenMP
standard is determined through a sequence of rules. These rules are important, but
most students quickly will be lost in them. Also, most programmers don’t have to
memorise them. They have to know what happens in 99% of the cases.
As I want to push students into the field of scientific computing plus HPC, I
made all examples in the text rely on C/C++. I am well aware that many codes still
are written in Fortran, and I am well aware that people nowadays can write inter-
esting and fast software with Python, e.g. But C still is very dominant and, fortu-
nately, many computer science students are still exposed to C in one form or the
other throughout their undergraduate studies. While A-levels maths is sufficient to
digest the book’s content, I hence expect students to bring along C programming
skills.1
Maybe unusual for a student-facing book, this text does not contain any exer-
cises. There are two reasons for this: On the one hand, there are plenty of books, in
particular in numerics, that are full of exercises; and more and more webpages with
(interactive) examples follow. On the other hand, the prime way to exercising
scientific computing is to write codes, and I think that there is a big difference
between writing small, well-defined example calculations or a bigger piece of
software. With this text, I want to encourage students to do the latter.
Acknowledgements
Thanks are due to numerous students who provided feedback. Particular thanks are
due to my colleagues Florian Rupp and Philipp Neumann. They found tons of bugs
and helped me to fix a lot of phrases. Thanks are also due to Lawrence Mitchell for
proofreading some of the chapters with an HPC touch. Holger Schulz worked with
1
The UTICS series hosts, for example, the Pitt-Francis and Whiteley book “Guide to Scientific
Computing in C++” where Chaps. 1–5 provide a reasonable base for the present book.
viii Preface
Major parts of this manuscript have been written during the first Corona lockdown.
I am sure that this unexpected situation has forced many lecturers to revise, rethink
and re-evaluate their academic teaching style. I had to. Though I do not know
(yet) whether this book is of any use or help for the brand new world of blended and
virtual teaching, the manuscript style is a result of my personal “rethink your
teaching” exercise.
The appendix hosts some ideas on how I use the book’s content to structure my
classes. The text is written in a language such that students can smoothly work
through the content themselves: If I have to make a choice, I run for the sloppy
jargon rather than ultra-formal, dryish phrases. Furthermore, I tried to use illus-
trations for technical details, as I am myself a visual learner. Key learning outcomes
are extracted and organised into boxes, and the appendix hosts some checklists and
cheatsheets for students’ coursework. All of this allows me to use the text as a basis
for courses in a “flipped classroom”-style. Obviously, this style can be comple-
mented by online/synchronous sessions, videos, exercises and code studies; and,
obviously, reading lists that dive deeper into the subject.
Enjoy the text, and I hope that it is useful.
Wrap-up
ix
x Contents
Wrap-up
Wrap-up
Wrap-up
Wrap-up
Wrap-up
Abstract
We discuss the two classic approaches (pillars) to acquire new insight in science:
experiment and theory. As computer simulations supersede the classic experiment
and blackboard theory, we obtain a third pillar. This notion of a third pillar can be
challenged, but it motivates us to define new research areas assembled under the
umbrella of the catchphrase computational science and engineering. After a quick
sketch of the simulation pipeline—typical steps required in any computational
discipline—we introduce scientific computing as area conducting research into
the tools enabling insight through computing.
This is a thought experiment, but feel free to make it a real one: We take a heavy
object (not grandma’s favourite Ming vase) and a light one such as our pen and
stand on a chair in our office. We drop both objects. Which one hits the ground
first?
The experiment is a classic in Physics in school. The urban legend says that
Galileo Galilei (1589–1592) stood on the Leaning Tower of Pisa and dropped
objects when he served as professor of mathematics there. The experiment
studies how the mass of an object affects the speed with which it falls down.
We might assume that the heavier object falls down faster. Aristotle had taught
exactly this, so the assumption is actually not that naïve or stupid. Galileo’s
experiment shows—as long as we neglect effects such as friction—that the ob-
jects’ speed does not depend on the mass. It thus invalidates Aristotle’s idea:
The “arrival” time of the object (when it crashes into the floor) is independent
of its mass.
We are not interested here in who did and thought what. This is not a history
of science lesson. The story illustrates something else: In science, we often
have a theory and an experiment, and one thing either validates or invalidates
the other one.
Humans and in particular scientists make models of the world all the time. We
postulate theories how things work and why. Every discipline has its own language
that help us to create these models. In science and engineering, mathematics is the
modelling language of choice.
With a model at hand, we compare its predictions with reality, i.e. with experi-
ments. Let’s assume T (m) is a function that gives us—for our tower experiment—the
time how long an object with mass m needs to hit the ground. The Aristotelian
hypothesis is that the heavier an object the smaller T . In the words of maths:
m 1 > m 2 ⇒ T (m 1 ) < T (m 2 ). The observations from the experiment teach us
that this is not correct. The experiment teaches us that T (m) is independent of m.
The interplay of theory and experiment is not a one-way street. New models
drive scientists to come up with new experiments to validate them, while the model
development is driven by observations: Whenever we observe something that we
cannot explain with the ideas and theories we have at hand, we ask ourselves “why is
this” and then start to develop a more sophisticated model which captures the things
we just observed.
•
> The two pillars of science
Theory and experiment are the two classic pillars (or legs) of science. They validate
and push each other. Theoretical scientists and engineers make all kinds of models
of the real world, but they are considered to be valid, i.e. we consider them to be rea-
sonable accurate representations of the real world, only after we have measurements
(data) that follow the predictions of these theories. If the experiments yield different
results or show effects that we can’t explain, the theory guys have to go back to the
blackboard.
1.1 A Third Pillar? 5
The two pillars as we have introduced them so far are barely enough to explain how
we advance in science and engineering today.
• Some experiments are economically infeasible. If we want to design the next gen-
eration of jumbo jets, we might want to put each model into a wind tunnel to see
whether it does take off. But fuelling a wind tunnel (in particular on a reasonable
scale) is extremely expensive.
• Some experiments are ethically inappropriate. If we design novel ways of radiation
treatment, it would not be ethical to try this out with patients in a trial-and-error
fashion. Another example: If we construct a new bridge, we don’t want the first
cars driving over this bridge to be Guinea pigs.
• Some experiments are ecologically dubious. If we design novel nuclear reactors,
we don’t want to rely on trial-and-error when it comes to security.
• Some experiments are by construction impossible. If we make up new theories
about the Big Bang, we are lost with our two pillar model: we will likely never be
able to run a small, experimental Big Bang.
• Some equations are so complex that we cannot solve them (analytically). For many
setups, we have good mathematical models. Maybe, we can even make claims
about the existence and properties of solutions to these models. But that does not
always mean that the maths gives us a constructive answer, i.e. can tell us what
the solution to our model is.
This list is certainly not comprehensive. Its last point is particularly intriguing in
modern science: We have some complex equations comprising a term u(x). Let this
u(x) be the quantity that we are in interested in. Yet, we cannot transform this formula
into something written as u(x) = . . . with no u(x) on the right-hand side. That is,
even when we have x, we still do not know u(x) directly.
A simple example goes as follows: The acceleration of a free object in space is
determined by gravity. Acceleration is the second derivative of the object’s position
in time. The first one is the velocity. With a formula for the gravitational force F, we
know ∂t ∂t u(x) = C F (where ∂t is the time derivative). This is something we can
solve analytically, i.e. write down the solution—either by hand or via a computer
program or a simple websearch. With two free objects that attract each other, solving
the arising equations analytically, i.e. distilling u(x) = . . ., becomes tricky. For three,
it becomes impossible.
All statements on the shortcomings of theory and experiment illustrate an impor-
tant trend: More and more areas of science need a computer to compensate for the
challenges traditional experiments and theoretic approaches face. Engineers simu-
late in a computer how a bridge behaves under load before they build it. Physicists
plug their new Big Bang models into a Universe simulation, make this simulation
compute what the Universe would look like today, and then compare the outcome to
observations. Mathematicians ask a computer to plot the solutions to complex equa-
tions calibrated to some measurements they know and thus obtain a feeling whether a
6 1 The Pillars of Science
Fig. 1.1 Left: New insight relies on the interplay of theory and experiment in the classic model.
Right: As theory and experiment often fall short, simulations become a third pillar of insight. Or
not?
model represents reality. More and more disciplines heavily rely on computer simu-
lations. This makes people speak of them as third pillar of insight besides experiment
and theory (Fig. 1.1).
However, there are people who challenge this notion of a third pillar. After all,
there are few things in nature with three legs (Fig. 1.2). And at the same time, there are
people who claim that we even have a fourth pillar today: big data bringing in machine
learning and artificial intelligence. The argument against the three pillar concept
orbits around the observation that most theoretical work today is not a pen-and-
paper exercise anyway. Mathematicians use computers all the time to approximate
and solve equations. Experimentalists in return have always used machines to track,
bookkeep and interpret their data; starting from a slide ruler or a notebook. The
four pillar advocates argue that we live in an age where data explodes and computer
algorithms obtain insight (learn) from these data autonomously; without complex
models of what they are supposed to see in the data.
I am a big fan of the two-pillar view of the world. It is not only about keeping things
simple, it is about re-reading the two classic legs of science, theory and experiments,
through today’s glasses. Read Moshe Y. Vardi’s discussion of why science has only
two legs (see below on further reading). He points out that we can think of science as
theory and modelling plus experiments and analysis of data. Computer simulations
are always used whenever we come up with a new theory or model. And when we do
experiments, we are typically swamped with data. Without computers and notably
simulations which give us a clue what to search for, it is hard to find the right thing
in all of our measurements. Artificial intelligence fits in here, too—as “yet” another
tool to digest vast experimental data.
Definition 1.1 (The third pillar of science) Some people call computer simulations
the third pillar of science. Others disagree. Anyway, computer simulations are in-
dispensable in modern science—whether as pillar of its own or as essential part of
theory and experiment.
1.2 Computational X
Fig. 1.3 The three areas coming together in Computational X projects. These projects employ
(generic) tools and knowledge from Scientific Computing. In return, they stimulate the involved
disciplines
8 1 The Pillars of Science
This sequence and variants thereof are called the simulation pipeline. I give a first,
simple example of the most prominent steps within this pipeline in Chap. 3. I person-
ally dislike the term pipeline. It suggests—similar to the term waterfall in software
development—some kind of sequentiality. In practice, we jump around and even
make excursions into the theory and experiment world. Let’s turn our attention to
the generic challenges within CSE and within this pipeline:
Chris Johnson: “Before every great invention was the discovery of a new tool”.1
This is a statement that holds in many different ways: Brunelleschi not only invented
the double shell system when he constructed the cupola (dome) of the Duomo of
Florence, he also first constructed some novel cranes that allowed him to get the roof
on. Reinforced concrete first had to be invented before mankind had been able to
construct some large bridges. There are many such examples. We need (to invent,
discover luckily, or pile up and assemble the set of) the right tools to solve problems.
If we want to solve problems on a computer, we need the right computational tools,
i.e. mathematics, algorithms, data structures, programming techniques, …We need
the right tools for Computational X. This book covers some aspects of these tools.
•
> Scientific Computing
We work within the realm of Scientific Computing whenever we work on the tools
that eventually allow Computational X to make progress.
Further reading
Abstract
We introduce Moore’s Law and Dennard scaling. They illustrate that further in-
creases of compute capabilities will result from an increase in parallelism over
the upcoming years; which implies that our codes have to be prepared to exploit
this parallelism. We identify three levels of parallelism within a CPU (inter-node,
intra-node and vector parallelism) and eventually characterise GPUs within these
levels.
In the early Middle Ages, the construction business was guided by experience.
A simple rule went like that: The larger a building such as a church, the higher
its weight. Higher weight means more massive walls to carry the weight. You
also should only have small windows, so you literally get a rock-solid structure.
With the advent of the Gothic architecture came the insight that it might be
cleverer to use a stone rib structure that is filled out with light walls, vaults
or even glass windows. The rib structure carries the weight and distributes it
properly. Larger, bigger, more impressive buildings became feasible.
1 The “law” is so popular that Intel paid 10,000 USD for a copy of the original paper print in 2005.
2.1 Moore’s Law 13
For simulation codes as we have sketched them before, it is not directly clear why
the transistor count makes a difference. We are interested in speed. However, there
is a correlation: First, vendors use the opportunity to have more transistors to allow
the computer to do more powerful things. A computer architecture provides some
services (certain types of calculations). With more transistors, we can offer more of
these calculation types, i.e. broaden the service set. Furthermore, vendors use the
opportunity to squeeze more cores onto the chip. Finally, the more of transistors
historically did go hand in hand with a shrinkage of the transistors. This leads us to
another important law:
Definition 2.1 (Dennard scaling) The power cost P to drive a transistor follows
roughly
P = α · C F V 2.
This law is called Dennard scaling.
C is the capacitance, i.e. encodes the size of the individual transistors. I use C here
for historic reasons. In the remainder of this manuscript, C is some generic constant
without a particular meaning. F is the frequency and V the voltage. α is some fixed
constant so we can safely skip it in our follow-up discussions. Note that the original
Dennard scaling ignores that we also have some leakage. Leakage did not play a
major role when the law was formulated in 1974.
Dennard’s scaling law is all about power. For both chip designers and computing
centres buying and running chips, controlling the power envelope of a chip is a sine
qua non, as
14 2 Moore Myths
While we want to bring the power needs down, we still want a computer to be as
capable as possible. That means, it should be able to do as many calculations per
seconds as possible. The Dennard scaling tells us that we have only three degrees of
freedom:
1. Reduce the voltage. This is clearly the gold solution as the V term enters the
equation squared. Reducing the voltage however is not trivial: If we reduce it too
much, the transistors don’t switch reliably anymore. As long as we need a reliable
chip, i.e. a chip that always gives us the right answer, we work already close to
the minimum voltage limit with modern architectures.
2. Reduce the transistor size. Chip vendors always try to decrease transistor sizes
with the launch of most new chip factories or assembly lines. Unfortunately, this
option now is, more or less, maxed out. You can’t go below a few atoms. A
further shrinkage of transistors means that the reliability of the machine starts
to suffer—we ultimately might have to add additional transistors to handle the
errors which once more need energy. Most importantly, smaller chips are more
expensive to build (if they have to meet high quality constraints) which makes
further shrinking less attractive.
3. Reduce the frequency. If we reduce the frequency, we usually also get away with
a slightly lower voltage, so this amplifies the savings effect further. However, we
want to have a faster transistor, not a slower one!
We asked for faster chips but what we get instead is plateauing or even decreasing
frequencies. There used to be a time where roadmaps predicted 10 GHz desktops.
This has never happened for the reasons above. For computational scientists, this
means that we can’t rely on upcoming machine generations to run our codes faster
automatically. We have to prepared that our codes become slower unless we manage
to exploit the increasing number of transistors.
2.3 The Three Layers of Parallelism 15
With Moore’s law continuing to hold and a break-down of Dennard scaling (frequen-
cy cannot be increased anymore at a given power envelope per transistor), there is
a “new” kid on the block that helps us to build more powerful computers. This one
eventually allows Computational X to run more challenging simulations. Actually,
there are three new kids around that dominate code development today (Fig. 2.2).
However, they all are flavours of one common pattern:
1. The parallelism in the computer increases, as modern computers still can do one
addition or multiplication or …on one or two pieces of data in one (abstract)
step.2 They apply with the same operation to a whole vector of entries in one
rush. We call this vector parallelism.
2. The parallelism in the computer increases, as modern CPUs do not only host one
core but an ensemble of cores per chip. Since this multicore idea implies that all
cores share their memory we call this shared memory parallelism.
3. The parallelism in the computer increases, as modern modern supercomputers
consist of thousands of compute nodes. A compute nodes is our term for a classic
computer which “speaks” to other computers via a network. Since the individual
nodes are independent computers, they do not share their memory. Each one has
memory of its own. We therefore call this distributed memory parallelism.
The three levels of parallelism have different potential to speed up calculations. This
potential depends on the character of the underlying calculations as well as on the
hardware, while the boundaries in-between the parallelism flavours are often blurred.
We discuss ways to program two out of three levels of parallelism here, but skip
the distributed memory world that unfolds its full power in multi-node supercom-
puters. The role of accelerators (GPGPUs) in such a classification spans all three
rubrics: An accelerator is a computer within the computer including its own memo-
ry. GPGPUs themselves rely heavily on both vectorisation and multicore processing:
They host multiple (streaming) processors, and the individual processors can han-
dle one instruction acting on multiple pieces of data per step. The latter realisation
is traditionally significantly different to what we find in CPUs—this lockstepping
2 Some of these operations might require multiple cycles when expressing them by means of chip
frequency, but the punchline is that there’s one operation per cycle sequence.
16 2 Moore Myths
Fig. 2.2 We discuss three different types of parallelisation techniques in this book: Vectorisation,
shared memory and distributed memory parallelisation. The potential gains in performance on the
x-axis (the labels there) are typical quantities from a classical 2020 CPU-only machine. They will
change dramatically over the next few years. Yet, the classification in principle will remain valid
We have to write code, right from the start, that accommodates the different types of
parallelism and that can exploit increasing hardware parallelism in the future. If a new
machine is installed after a year’s time, only codes that are flexible w.r.t. parallelism
can benefit from it. This is not easy. Herb Sutter has likely summarised it best: “The
free lunch is over”. Programmers have to be skilled in different parallel programming
techniques, and they have to think right from the start about how they write future-
proof codes w.r.t. parallelism. Parallelisation is not a nice gimmick or add-on that
you start to study once you got the functional behaviour right. In this manuscript,
I therefore introduce parallelisation techniques hand in hand with the numerical
techniques.
2.3 The Three Layers of Parallelism 17
• Herb Sutter: The Free Lunch Is Over—A Fundamental Turn Toward Concurrency
in Software. Dr. Dobb’s Journal. 30(3) (2005) http://www.gotw.ca/publications/
concurrency-ddj.htm
• Gordon E. Moore: Cramming more components onto integrated circuits. Elec-
tronics Magazine. 38(8), pp. 33–35 (1965) https://doi.org/10.1109/N-SSC.2006.
4785860
• Robert H. Dennard et al.: Design of ion-implanted MOSFET’s with very small
physical dimensions. IEEE Journal of Solid State Circuits. 9(5), pp. 256–268
(1974) https://doi.org/10.1109/JSSC.1974.1050511
• Moore’s law is all about transistor counts, i.e. complexity per cost. This used to
translate 1:1 into performance growth for years. It does not anymore.
• Dennard scaling states that the power consumption of a chip scales linearly with the
frequency but quadratic with the voltage. Its historic formulation ignores leakage
which is becoming a non-negligible factor in modern chips.
• If we can’t reduce the voltage anymore, we can not increase the frequency as the
ship shrinkage is naturally constrained. Vendors have to increase the parallelism
on the chip.
• Codes and algorithms have to be redesigned and programmed right from the start
to fit to massively parallel systems.
• We distinguish three different levels/types of parallelism on modern hardware in
this book.
• GPGPUs are technically separated from the computer. Modern hardware and pro-
gramming techniques hide this fact and integrate GPGPUs better and better. Log-
ically, we can thus treat them as shared memory and vector computing devices.
Our Model Problem (Our First
Encounter with the Explicit Euler) 3
Abstract
A simple N-body problem serves as demonstrator for the topics discussed in
upcoming sections. We introduce an explicit Euler for this showcase problem
informally, and thus obtain a baseline toy code to play with. The introduction
allows us to discriminate numerical approximations from analytical solutions.
We return to our Leaning Tower thought experiment and remind ourselves that
the force acting on an object is F = mg with g = 9.81 sm2 being a magic
constant1. m in F = mg is the mass of the object that we drop. We also know
from school that the acceleration of an object is the derivative of its velocity, i.e.
∂
v(t) = a(t) while F = ma.
∂t
We integrate and plug in our knowledge to obtain
t t
v(t) = a(t˜)d t˜ = g d t˜.
0 0
Here, we’ve exploited the fact that the object that we drop is at rest when we
release it up there on the tower. If the dropped object had an initial velocity, we
would have to add a constant term v(0) to the integral above.
Next, we remind ourselves that velocity is the change over time of the position
p(t). When we drop an object from p(0) > 0 over the ground, we can compute
the object’s position through
1 Thenotation here uses an m in sm2 as unit, while the m in F = mg denotes a parameter. From
hereon, I work quite lazily with units and just skip them.
t
t t 1 t
p(t) = p(0) − v(t˜) d t˜ = p(0) − g d t˜d t˜ = p(0) − g · t˜2 .
0 0 0 2 t˜=0
Therefore, we can compute exactly when the
object crashes2 into the ground,
i.e. when 0 = p(t) = p(0) − g 21 t 2 − 21 02 = p(0) − 9.81 2 t .
This was easy. And it was wrong. In real life, the object that we drop and the
Earth attract each other. And the force thus follows the equation F = G · m 1r·m
2
2
For a massive object (or two planets in space), we have to work with the re-
al formula. This time, things become ugly as the antiderivative is not trivial
anymore:
t t t
m1 · m2
vn (t) = a(t˜)d t˜ = F(t˜)/m n d t˜ = G d t˜.
n · r (t˜)
m 2
0 0 0
Let’s sit on object 1 and observe how far object 2 is away. That is, we make object
1 rest at position 0, while the other one moves freely. We move the coordinate
system with p1 . Then,
t t t t
m1 · m2 m1
p2 (t) = p2 (t = 0) − d t˜d t˜ = pn (0) − d t˜d t˜.
˜
0 0 m 2 · r (t )
2
0 0 p2 (t˜)
2
For many problems in science and engineering, we have good models that describe
how individual parts of the system behave. We have, for example, formulae that
describe how rigid bodies (planets) move under gravity. Unfortunately, that does not
imply that we can always solve the equations that arise once we use the relatively
3 Our Model Problem (Our First Encounter with the Explicit Euler) 21
simplistic equations as building blocks within a larger setup. The N -body problem
as sketched above (for N ∈ {2, 3}) is a classical example for a relatively simple
problem, i.e. the underlying mechanics are simple equations. However, we cannot
rewrite the physical model into a formula alike p(t) = . . . which gives us the position
p of the objects for a given time t directly once N ≥ 3. To be fair, there’s a sum
representation for N = 3, but even that one is irrelevant for practical applications as
it is infinitely long.
Given are N bodies (planets, e.g.) out there in space. They interact with each other
through gravity. Let body 1 be described by its position p1 = ( px,1 , p y,1 , pz,1 )T ∈
R3 (a vector) and its mass m 1 . Furthermore, it has a certain velocity v1 =
(vx,1 , v y,1 , vz,1 )T ∈ R3 which is a vector again. Body 2 is defined analogously.
Body 1 experiences a force
m1 · m2 ( p2 − p1 )
F1 = G · . (3.1)
| p 1 − p 2 |2 | p 1 − p 2 |2
2
If there are more than two objects, then Body 1 also gets a contribution from Body
3, 4, and so forth. The forces simply sum up. We furthermore know that
F1
∂t v1 (t) = and (3.2)
m1
∂t p1 (t) = v1 (t). (3.3)
This is a complete mathematical model of the reality. It highlights in (3.2) and (3.3)
that velocity and position of our object depend on time. These equations are our whole
theory of how the world out there in space behaves (cmp. Chap. 1) in a Newton sense.
Einstein has later revised this model.2
Notation remark
The expressions ∂t y(t) = ∂t∂ y(t) both denote the derivation of a function y(t). Often,
we drop the (t) parameter—we have already done so for F above. For the second
∂∂
derivative, there are various notations that all mean the same: ∂t ∂t y = ∂tt y = ∂t∂t y.
(2)
I often use ∂t . This notation makes it easy to specify arbitrary high derivatives.
2 Einstein developed his theories, since the classic Newton model struggled to explain certain effects:
The misfit of Newton’s theory with observations—Newton’s theory cannot quantiatively correctly
predict how light “bends” around objects, though it was pointed out in 1784 (unpublished) and
1801 that it does bend to some degree—was well-known since the 1859s, but it took till 1916 until
someone came up with a new theory that explains the effect. After that, experimentalists needed
another three years plus a total solar eclipse to run further experiments which eventually validated
Einstein’s theories. This is a great historic illustration of the ping-pong between the two classic
pillars of science.
22 3 Our Model Problem (Our First Encounter with the Explicit Euler)
The force in (3.1) is the bigger the bigger the product of the masses of the two objects.
It decreases rapidly if the objects are far away from each other. The Eukledian
2 2 2
| p 1 − p 2 |2 = px,1 − px,2 + p y,1 − p y,2 + pz,1 − pz,2
denotes the distance of two object from each other. It is the r (distance) we have used
before in the introductory example,
⎛ ⎞
px,2 − px,1
p2 − p1 = ⎝ p y,2 − p y,1 ⎠
pz,2 − pz,1
is a direction vector pointing from object 1 to object 2. It gives the whole force
a direction. Assume we had a force F̂ and we computed its directional force as
F̂( p2 − p1 ). This would yield a force with a direction. However, the further away
the two bodies from each other the bigger this vector, i.e. the bigger its magnitude.
This is clearly not what we want, so we have to normalise the direction vector. We
have to make it have unit length. This is achieved by the division through | p1 − p2 |2 .
We end up with a normalisation of r −3 overall. One of the r s is kicked out again via
the direction vector.
We next study the last equation. The ∂t in (3.3) is the time derivative. The faster the
position of an object changes over time, the higher its velocity. If the velocity equals
zero, the object does not move. The change of the velocity in (3.3) is something we
call acceleration. If we apply a force to an object, this force accelerates the object.
This acceleration is the more significant the lighter the object. I can push my bike
but I struggle to push my car, even if I always accelerate them with my maximum
force. We immediately see that our model stands the test of Galileo’s experiments
from Sect. 1. The gravity force’s scaling with the body mass and the acceleration
scaling cancel each other out.
3.1 The N -Body Problem 23
Definition 3.1 (Differential equation) An equation with a function y where the func-
tion’s (time) derivatives enter the equation, too, describes a (system of) differential
equations. So y(t) = t 2 is a normal (explicit) equation, but ∂t y(t) = t 2 is a differen-
tial equation. A normal equation is brilliant for scientists: once we fix t, we directly
know what y(t) is. Differential equations can become difficult or even impossible
to rewrite into an normal, explicit formula such that we can evaluate the outcome
directly.
We call our differential equations ordinary, i.e. ordinary differential equation (ODE),
as there are only derivatives w.r.t. one unknown (time). We formalise the term ODE
later (Chap. 12).
With given positions pn (t = 0), i ∈ {1, 2, . . . , N }, we can throw all of our analysis
skills onto the Eqs. (3.1)–(3.3), and hope that we can find an expression for pn (t).
This is laborious. Bad news: Without further assumption it is impossible to solve this
problem.
Movie universe
Let’s assume that we live in a movie universe where the whole universe runs at 60
frames per second. In each frame, we know all objects pn as well as their velocities
vn . We can compute the forces at this very frame. They push the particles, i.e. they
alter their velocities. Due to the (changed) velocity, all objects end up at a different
place in the next frame (Fig. 3.1). The longer the time in-between two frames, the
more the particles can change their position (the velocity acts longer). To make the
impact of our force pushes realistic, the forces are the bigger the bigger the time
Fig. 3.1 Three frames (left to right) of our movie universe. Forces act in-between the particles
but they only update the velocity once per frame. Therefore, we might miss collisions, and we are
definitely not exactly mirroring the behaviour of our “continuous” Universe
24 3 Our Model Problem (Our First Encounter with the Explicit Euler)
in-between two frames, too. The more frames per second we use, the more accurate
our universe. With the time step size going to zero, we eventually end up with the
real thing.
The construction of the movie universe from the real one is called discretisation. We
know that time is continuous—at least at our scale—and computers struggle to hold
things that are infinitely small and detailed. After all, we have finite memory, finite
compute power and finite compute time. So we sample or discretise the time and
track the solution in time steps. Let dt be the time step size, i.e. the time in-between
two snapshots.3 Our computer program simulation the space bodies then reads as
follows.
3 The notation here can be quickly misunderstood, so you have to be careful: ∂ and dt are derivatives.
t
dt is the time in-between two time steps.
3.2 Time Discretisation: Running a Movie 25
This code is not particular sophisticated. For example, we do not exploit any sym-
metries between the particles. It is also not necessary to introduce force variables in
this code, as we could directly add the force contributions to the object velocities. I
leave it to the reader to come up with more intelligent versions of this code. For the
time being, it is sufficient to us to know that the scheme we have just implemented
is an explicit Euler.
Writing data into a file is very slow. Therefore, programs typically give you the
opportunity to specify the time in-between two snapshots. This time is usually way
bigger than dt. Don’t mix them up: You step through your simulation with dt and
then every k of these time steps you write a snapshot. As alternative to time steps,
you can specify a second DT after which you write the snapshot.
Computers are calculators. If we pass them a certain problem like “here are two
bodies interacting through gravity”, they yield values as solution: “the bodies end
up at position x, y, z”. They yield numerical solutions. This is different to quite a lot
of maths we do in school. There, we manipulate formulae and compute expressions
b
like F(a, b) = a 4xd x = 2b2 − 2a 2 . Indeed, many teachers save us till the very
last minute (or until we have pocket calculators) from inserting actual numbers for
b and a.
In programming languages, we often speak of variables. But these variables still
contain data at any point of the program run. They are fundamentally different to
variables in a mathematical formula which might or might not hold specific values.
We conclude: There are two different ways to handle equations: We can try to solve
them for arbitrary input, i.e. find expressions like F(a, b) = 2b2 − 2a 2 . Once we
have such a solution, we can insert different a and b values. The solution is an
26 3 Our Model Problem (Our First Encounter with the Explicit Euler)
Computers yield numerical solutions. This statement is not 100% correct. There
are computer programs which yield symbolic solutions. They are called computer
algebra systems. While they are very powerful, they cannot find an analytical solution
always and obviously do not yield a result if there is no analytical solution. We walk
down the numerics route in this course.
Analogous to this distinction of numerical and analytical solutions, we can also
distinguish how we manipulate formulae. If you want to determine the derivative
∂t y(t) of y(t) = t 2 , you know ∂t y(t) = 2t from school. You manipulate the formulae
symbolically. In a computer, you can also evaluate ∂t y(t) only for a given input. This
often comprises some algorithmic approximations. In this case, you again tackle the
expression numerically.
Further Reading
4 Mathematicians like analytic functions. That’s something completely different, namely functions
that can infinitely often be differentiated in C. We always speak of analytical solutions here. Our
“analytical” highlights how the solution has been obtained, i.e. via exact, symbolic manipulation
of the equations. The “analytic” in contrast is a property of a function.
3.3 Numerical Versus Analytical Solutions 27
The three chapters of Part I give us an idea why we have to study numerical tech-
niques: Many (perhaps most of the) interesting phenomena in science and engineering
are phrased as differential equations. Often, no analytical solution to these equations
is known. We therefore need numerical codes that give us answers for scenarios we
are interested in.
These answers have to drop in fast. Tomorrow’s weather forecast is useless if it
drops in after 48 hours. Furthermore, many problems become fun and meaningful
only if we can make them really large and hence costly. If you can simulate three
molecules, then this is interesting. But what people really are interested in are gas
volumes hosting millions of molecules. The N -body problem as sketched is relatively
simple yet already impossible to solve analytically. We therefore stick to this problem
for the majority of this book.
Getting results faster needs us to write code that is aware that performance growth
today and in the future is due to an increase in parallelism. We therefore will try to
make our N -body code scale. Multiple flavours of parallelism do exist in modern
computers.
In the next sections, we will study how floating point calculations are to be phrased
such that they are processed effectively on modern machines. We already mentioned
that numerical solutions often are approximations. There is more to that: Computers
make errors in every single computation, and they inherently cannot represent some
numbers. We thus have to study these errors. We have to understand to which degree
we can trust the outcome of our calculations.
Part II
How the Machine Works
Floating Point Numbers
4
Abstract
Numbers from R cannot be handled directly by computers, as they can have
infinitely many digits after the decimal point. We have to truncate their represen-
tation. Storage schemes with a finite number of digits before and after the decimal
point struggle: There are always regimes of values where they work prefectly fine
and regimes where they cannot represent values accurately. We therefore intro-
duce the concept of floating point numbers and discuss the IEEE standard. Its
concept of normalisation allows us to establish the notion of machine precision,
as we realise that the truncation into a finite number of digits equals rounding and
introduces a bounded relative error. Since we work with approximations to real
numbers, we have to be careful when we compare floating point numbers in our
codes.
code the numbers x/100, 000 or x/3, as we are running out of proper digits after
the decimal point. Very large numbers also pose a problem. Can you quantify
the error we obtain?
In our second attempt as chip designers, we devise a new format where we
sacrifice the first digit. Let the first digit tell us where the decimal point is found.
4.......... implies 4....|......, i.e. we have the decimal point after
the fourth real digit. With the first digit being 4, the remaining bits encode a
number similar to the scheme before. However, we can now encode numbers
like x/200, 000 as 3000000005, i.e. 000|000005. A few questions arise:
(i) What is the smallest number we can write down now? (ii) We obviously still
struggle to represent 1/3. The best we can do is to write it down as 0333333333
meaning 0.333333333. What is the error that we get? (iii) How much more
accurate is this than the same number written down in the first format?
Computers use a fixed number of bytes per number; typically 4 or 8; to store values
from R. In theory, machines could work with very flexible formats, where the number
of bits that the machine invests per number is not fixed a priori. However, that would
mean that the memory footprint of codes could suddenly explode, some values remain
“unstorable” (take 1/3 or π ) and it would make the hardware very slow. Therefore,
we work with a finite number of bits and approximate “the real thing”. With a finite
number of bits, we can encode a finite number of values. R however is unbound and
continuous, i.e. we can zoom in further and further. Infinitely many values in R thus
cannot be encoded. This property causes pain.
The simplest scheme one might come up with for a machine is a fixed point storage
format. In fixed point notation, we write down all numbers with a certain number of
digits before the decimal point and a certain number after the decimal point. With
four digits (decimal) and two leading digits, we can, for example, write the number
three as 0300. This means 03.00. Fixed point storage is the first variant I have
sketched in our introductory thought experiment.
We immediately see that such a representation is not a fit for scientific computing.
Let x = 3 in the representation be divided by three. The result x/3 = 1.00 suits
our data structure. However, once we divide by three once more, we start to run
into serious trouble. Indeed 1/3 ≈ 0.33—which is the closest value to 1/3 we can
numerically encode—is already off the real result by 0.003.
You might be tempted to accept that you have a large error. But as a computational
scientist, you neither can accept that most of your bits soon start to hold zeroes, i.e. no
information at all, nor that your storage format is only suited to hold numbers from
a very limited range (basically from 0.01 to 30 and even here with quite some error).
4.1 Fixed Point Formats 35
Definition 4.1 (Relative and absolute error) Computers always yield wrong results.
To quantify this effect, we distinguish the absolute from the relative error. Let x M be
the machine’s representation of x. The absolute round-off error then is given by
e = |x M − x|.
The relative round-off error is given by
e xM − x
= =| |.
|x| x
starting from the decimal point both towards the left and the right, but also for the
exponent (a8 and a9 ). If I want to make it very clear that we are using this decimal
system, I attach |10 to the numbers. If nothing is written down, you may assume that
any number is given for a base of 10.
On a computer, a different data representation is used: a binary system. Here, the
digit sequence a1 a2 a3 a4 .a5 a6 a7 has a different meaning. First, we use a base of 2
for the scaling of the entries. Second, the ai s are from {0, 1} only. Third, we would
36 4 Floating Point Numbers
use 2a8 a9 if we had a scientific notation. In a binary system, digit sequences denote
the number (a4 · 20 + a3 · 21 + a2 · 22 + a1 · 23 + a5 · 2−1 + a6 · 2−2 + a7 · 2−3 ).
If required, I will write a1 a2 a3 a4 .a5 a6 a7 · 2a8 a9 |2 to make it clear that I’m using a
binary base.
We conclude that we search for a scheme where the number of positions after the
decimal point is not fixed. This way, we can cover a way bigger value range than
with fixed point formats. We are more flexible. At the same time, the new format
should run as efficient as possible on our hardware and its memory footprint should
be under control. It should work with a fixed, predetermined number of bytes. In
computer architecture, there used to be a zoo of such formats. In 1985 however,
vendors agreed on the IEEE Standard 754:
The IEEE standard makes computers work with floating point numbers. The point
will be floating around within the number representation. It is conceptionally close
to our second variant from the introductory thought experiment: However, we write
down the format with an explicit exponent, and we let the exponent determine where
the decimal point is. The point does not float around in the bit representation, but it
floats around in the real representation. Lets derive this format step by step.
4.2.1 Normalisation
We start our introduction into machine data representations with a simple observa-
tion: we need normalisation. It is obvious that a.bcd · 104 = ab.cd · 103 . Whenever
we write down a number, we have some degree of freedom. Lets exploit this degree
of freedom—well, lets actually remove it—and always move the point to the left
such that the leading digit is 1. This makes the notation unique, as we work in a
binary world. So let ab.cd ·24 |2 be a number that is not normalised. The same number
a.bcd · 25 |2 = 1.bcd · 25 |2 is normalised.
Normalisation is nice to realise (in hardware or software): It requires bit shifts which
is something computers are really good at:
4.2 Floating Point Numbers 37
In the upper representations, there are four zeroes to the left of the leftmost digit 1.
If we assume that the one and only bit left of the decimal point should be a one, then
we have to move all the digits four positions to the left. This is called a bit shift.
Before the shift, we had an exponent of 4|10 = 100|2 which we have to reduce in
return.
I emphasise that the discussion of the exponent above is “wrong”. We’ll discuss
that in a minute. However, the principle idea holds: Hardware shifts around the bits,
a process we call normalisation.
•> Zero
Zero is the only number we cannot write down in a normalised way, since there is
no leading non-zero.
One bit is “lost” for the sign. This leaves us with 23 bits. We get one back, as we
do not have to store the leading 1 bit of the normalised representation. We know
it has to be there as we have defined normalisation that way. There is no point in
storing it. As a consequence, we know exactly how many bits we have available to
store the significand. For a 32-bit floating point number, we use 23 bits for all the
significant bits right from the decimal point. This is C’s float. For a 64-bit number,
a C double, we use 52 bits (Table 4.1).
We obtain a number x M (note the subscript M for machine precision) which differs
from the real number x. x M is given in machine floating-point precision whereas a
real number x ∈ R in mathematics has an infinite precision.
In hardware, the table below represents exactly how data is stored: The sign bit
comes first, then the exponent. All remaining bits hold the significand. The advantage
of such a standardised format is clear: Once we commit to it, we can build hardware
to realise it, i.e. vendors can “hardwire” the floating point format and thus make
computations with it fast.
Definition 4.4 (Biased exponent) The exponent in the IEEE standard is stored in a
biased form. That is, we do not store the E from (4.1) directly within our 8 or 11
bits, respectively. Instead, we store E + 127 (single precision) or E + 1023 (double
precision).
1. we can always recompute the real exponent quickly (just substract 127 or 1023,
respectively);
2. the value we store is always positive, i.e. we don’t have to encode a sign explicitly;
3. we can hold values in the range −126 ≤ E ≤ 127 (single precision) or −1022 ≤
E ≤ 1023 (double), respectively.
With all of these definitions available, we can study a few particular bit sequences
within the IEEE standard. For this, we split the range of numbers into certain regions.
We start with numbers of the type ±1.00001·211111111 |2 . This is the largest exponent
one could think of and a significand that is slightly bigger than one. Though the codes
would represent a meaningful number according to our rules, the IEEE standard
reserves them:
•> NaN
If the exponent bits are all set and any bit of the significand is set, too, then the
floating point number represents NaN (not a number). As we have a dedicated sign
bit, there’s a +NaN and a −NaN.
The biggest or smallest, respectively, number that remains now is ±1.0 · 211111111 |2 .
We assign them special semantics, too:
•> ∞
If the significand equals one and all the bits of the exponent are set, then the number
represents ±∞ (depending on the sign bit). As we do not store the leading bit in
the significand, the code for these two numbers is all zeros in the significand and all
ones in the exponent.
Any bit sequence −1.0 · 211111111 |2 < x M < 1.0 · 211111111 |2 is a valid number.
Starting from the biggest one we we can decrease them further and further. Eventually,
we approach 1.0 · 20 |2 . Note the biased exponent. This is the positive number closest
to zero that we can encode. The bitset still would allow us to encode 0.1 · 20 |2 . Yet,
we cannot normalise it anymore, as we run out of digits with the exponent.
40 4 Floating Point Numbers
Fig. 4.1 Schematic classification of the different IEEE codes/number types on the number line
Finally, we have a zero value where the biased exponent is all zeros and the bits in the
significand are zero, too. We reserve these codes for the zero value by convention:
We summarise that there are five types of IEEE numbers (Fig. 4.1): We have not-
a-number (NaN) types, ±∞, normalised numbers, denormalised numbers and two
zeroes.
The fixed number of bits per number allows vendors to build bespoke hardware for
floating point values. In the early days, floating-point units (FPUs) were shipped as
accelerators, i.e. have not been part of the main chip. They have been floating-point
co-processors. We had to wait till the 1980s for vendors to integrate the FPU into the
CPU.
On the one hand, this is due to the fact that complicated math routines and calcula-
tions used to be a specialist domain before that. Today, every smart phone uses image
processing, text recognition, and so forth. On the other hand, it follows Moore’s logic
from Chap. 2. FPUs are very complex and their transistor count today exceeds the
number of transistors of other parts of the chip. As a consequence, it used to be
economically unfeasible to ship FPUs with every chip. However, having an FPU
4.2 Floating Point Numbers 41
separate from the CPU means that we have to accept some overhead for them to
communicate, i.e. to instruct the FPU and to bring data into the CPU. As soon as
a higher transistor count came into reach, it thus was clever to integrate CPU and
FPU. There is still some legacy of all of this in modern chips: FPUs often run at a
lower frequency than the main chip and have separate access to some memory parts
(caches), while high-end chips today feature more than one FPU per CPU.
The IEEE standard is an at-least standard. If you store a binary file with IEEE
values, then you know exactly how many bits are invested into every single number.
However, there is no guarantee that the data within the chip is held in that precision.
Indeed, most FPUs work with a higher precision internally. Intel, for example, spends
80 bits on a double precision value. When we add two numbers, we temporarily have
to denormalise. Without the additional bits, we would immediately loose information.
It therefore is important that computers internally hold way more bits. When they
write values into a file or main memory, then they squeeze them into the IEEE
format. As a result, you can get two different simulation results when you once run
a simulation without any file dumps and then rerun the simulation but dump its state
into a file and reload it.
As the two numbers have a different exponent (they are of different magnitude), the
computer’s floating point unit first has to bring them to the same exponent format.
1.00101 · 210 |2 + 0.0100001 · 210 |2
Here, it helps that the computer internally works with more significant bits. If denor-
malisation for the right summand exceeds the number of internally available bits,
we already lose information. With the two exponents having the same value, we can
sum up the significands:
1.00101 · 210 |2 + 0.0100001 · 210 |2 = 1.0110101 · 210 |2
If this result is now immediately used again by the computer (as part of a larger
formula, e.g.) and resides within the CPU’s internal memory (register), then we are
done. Otherwise, we have to round it, as we have only five significant bits in our toy
data format:
1.00101 · 210 |2 + 0.0100001 · 210 |2 = 1.0110101 · 210 ≈ 1.01101 · 210 |2
We introduce an error of 0.0000001 · 210 |2 .
After any operation—besides some special cases like the multiplication with two—
the system will apply some kind of normalisation, i.e. round, and thus “spoil” our
result. We therefore know that we are computing with slightly incorrect values from
the very moment we run the first arithmetic operation on data.
We pretend to work with values x, but the floating point numbers do, for most
choices of x or results from calculations, not allow us to represent them precisely.
4.3 Machine Precision 43
The format rounds the data to a number that it can represent, and we end up with a
machine number
x M ∈ [x − e, x + e].
The number x M we are actually working with is from an -environment around the
real value x. In this manuscript, we typically write
x M = x(1 + )
and allow the to have either sign. Note that we use two different symbols (e and
) as they are different things. They fit to our definition of relative versus absolute
error. The sketch supports the statement that shall have the “worst-case sign”. The
worst-case magnitude of is something that depends on the floating point format,
i.e. how many significant bits we have. It obviously is bigger for single precision
than for double precision.
The latter definition, i.e. the one with the relative error, is useful: If our x is to
hold a value around 1,000 but is actually off by 0.1, then this might be acceptable. If
our x should store a value around 1 but is off by 0.1, then this is a massive deviation.
Definition 4.5 (Machine precision) The machine precision for a given floating point
format is the upper bound on the relative error.
The above definition is slightly biased: as we assume we have a value which is in the
range of what we can represent with our normalised floating point format, i.e. we
can “cover” it through our exponent. Another definition of this term goes as follows:
The machine precision is the difference between 1.0 and the next number closest
to 1.0 that we can represent in our floating point format. Both definitions are the
same in the regions we are interested in. The latter eliminates the division from the
relative round-off error and we can calculate the machine precision on our computer
straightforwardly. It is only a brief source code snippet:
We start with a ridiculously high estimate of and successively reduce it until the
code can not spot any difference anymore between 1.0 and 1.0 + . When we end
up in this situation (while loop terminates), we unfortunately just have lost the last
meaningful bit of already. Therefore, we always backup the previous value of
which is also the value we should plot in the end.
44 4 Floating Point Numbers
Machine precision in C
We have a machine precision of around 10−7 for C’s float (single precision) and
10−16 for double (double precision).
Comparisons
Never compare two floating point numbers with the equals operator ==. In C, this
operator compares the left and right side bit-wisely. As we work with imprecise
numbers subject to round-offs, we always have to write code which checks against
“reasonably close”.
Whenever we compare two numbers, we can basically assume that they are the same
if their (relative) difference is small. I always use statements like
if (
std::abs(x-y)/std::max(std::abs(x),std::abs(y)) < 1e-10
) {
// x and y are equal in floating point precision
}
or I use a comparison of the absolute values if I have an idea about the value quantities:
If you want to make it 100% watertight, you have to modify the range comparisons,
too: Let x < y, x <= y, x == y be three expressions that you evaluate up
to a precision of z̃—so the z̃ is either an absolute or a relative delta. As you use
x == M y → |x − y| ≤ z̃ in your code in line with the rules above, a consistent logic
4.4 Programming Guidelines 45
Overflows
If you add two floating point numbers and their result exceeds the range that can be
represented, you will get ∞.
The process described above is called floating-point overflow. On the one hand, this
behaviour is re-assuring: Different to integer numbers, nothing weird happens such
as a sudden sign flip. On the other hand, you have to be very careful (and check
yourself somehow) if you work close to the maximum value that can be encoded by
your number format.
NaNs
Your computer will not check for NaNs.
Feel free to divide by zero, e.g., but be prepared to bear the consequences. You can
always check if a datum2 is valid by adding
if (x==x) {
// x==x will fail for NaNs
}
Two NaNs are by definition never the same. As so often, C++ has a nicer way to
write down things:
if (std::isnan(x)) {
// x==x will fail for NaNs
}
Denormalised numbers
Avoid working with denormalised numbers.
Modern machines can work with denormalised numbers. There are three problems
however: First, working within the denormalised regime is slow. Modern chips need
significantly more time per floating point operation if you feed them denormalised
numbers.
Second, a floating point number has 23 or even 52 bits to discriminate different
values from each other. If your first 20 bits, for example, are zero as you work with
denormalised data, then only 3 or 32 bits, respectively, hold actual information. That
is, the information density of your values is very limited. The effect when you lose
the very last meaningful bit and drop from a denormalised number to the real zero is
called floating point underflow. Underflows can occur from two directions, i.e. into
+0 and −0.
Finally, some IO/data formats (such as VTK) usually work with reduced precision
and in particular struggle to handle denormalised values. You will get visual garbage
if you use denormalised data, or your data files might become unreadable. Therefore,
I often round to values significantly away from machine precision before I actually
write them into a file. Note that C/C++’s terminal output typically also truncates,
i.e. you won’t get full floating point accuracy here either.
Further reading
• We have defined the terms relative versus absolute error. We will reencounter these
terms frequently throughout this book.
• We know the storage format of IEEE floating point numbers.
• We can perform simple calculations with floating point numbers. This implies that
we can manually normalise and denormalise.
• We have introduced the term round-off error, and we can compute it for a given
piece of data (cmp. normalisation and IEEE format’s bit length).
• We are familiar with the ideas behind NaN and ±∞.
• We have discussed a couple of rules of thumb for what to take into account when
we work with floating point numbers.
A Simplistic Machine Model
5
Abstract
The von Neumann architecture equips us with an understanding how a computer
processes our program code step by step. Our brief review of the architecture
highlights the role of registers for our numerical computations. We will augment
the straightforward definition of a register later to realise vector processing, i.e.
the first flavour of hardware parallelism, while the architecture review provides
clues why many numerical codes struggle with performance flaws and bottlenecks
on our CPUs.
ALU
Register
Memory
• We cannot memorise any data. All results and intermediate results have to be
written on sheets of paper.
• Every single sheet of paper can hold only one number at a time.
• We work at our desk. Only two sheets of paper fit onto this desk.
• In the room in front of our office with the desk, we can store an arbitrary
number of sheets. They are enumerated.
Today’s computers in principle all evolve from one common blueprint. It is called
von Neumann architecture. There are a few other fundamental machine architectures
that have been published earlier and differ by nuances (there is a Harvard architecture
and a Zuse architecture, for example), but we stick to von Neumann here.
In the von Neumann architecture, a (minimalist) computer consists of few building
blocks. Each part of the machine, i.e. each block is responsible for a particular job.
Definition 5.1 (von Neumann architecture) The von Neumann architecture consists
of four parts which are relevant for us: A control unit is the mastermind or conductor
which orchestrates all other machine parts. An arithmetic logic unit (ALU) is the
only part of the machine which can run calculations. A main memory holds data. A
bus connects the ALU and the control unit with the main memory. It also connects
the machinery to other devices such as the keyboard or the screen.
The control unit and the ALU are often integrated into one piece. The whole machine
works in steps. It is pulsed. Per step, its components perform at most one operation.
If you specify your machine with GHz (1 Hz = 1/s), this tells you how many of
these basic steps are done by the computer per second, though modern machines
use different frequencies for the different machine parts. The arithmetic unit is the
working horse of the computer. Yet, it is surprisingly primitive: it can, for example,
add two numbers or subtract them. It can also compare two numbers. This is the
reason for the term arithmetic logic unit (ALU). In the memory, all data of a program
are stored as a long sequence. The enumeration starts from 0.
The ALU has registers. A register is its own (fast) local memory which can hold
one datum. There is a limited number of registers which is small compared to the main
memory, as registers are expensive to manufacture and power. As a consequence, we
have to bring in data into the ALU before we use them, and then we have to move it
back into the memory to free registers for the next calculation.
Fig. 5.1 Schematic sketch of a von Neumann architecture: We have three components (ALU,
Control Unit and memory) which are connected via a bus. The main memory holds both the machine
code and the application’s data. Its locations are enumerated. In each step, the control unit reads
from the position held in PC from the main memory. The code (instruction) there tells it what do
do next. It then loads values into the ALU, or tells the ALU to compute something, or moves data
back from the ALU registers into the main memory
double f = 0.0;
double x[...]; // our input array
double y[...]; // the other input array
for (int i=0; i<I; i++) {
f += x[i] * y[i];
}
Definition 5.2 (Machine code) The above code is high level code, i.e. it is nothing a
computer can directly execute. A computer needs much simpler instructions which
the control unit can directly use to instruct the von Neumann machine parts. Such a
low level code is also called a machine code.
code for GPGPUs can be read along these lines: You write down the code on your
machine, and you then ask the compiler to produce machine code for the GPU. You
usually do not translate on the GPU itself.
To understand what machine code looks like, we next clarify what kind of ALU
we have in our made-up machine. We have to know what it is capable to do and how
it works. Though there are significant architectural differences, the core principles of
ALU/von Neumann architectures as we use them in the artificial machine are similar
to those in real architectures. For our code snippet, the ALU has to be able to do three
things: add two numbers, multiply two numbers and compare a number to another
number. Let the ALU have only five storage places. Recall the office desk on which
we can fit few sheets of paper only: Each sheet can hold one number. The ALU can, if
instructed to do so by the controller, compare two sheets and write its result to either
of them again, or it can combine two sheets’ values by a multiplication or addition
and write the result back onto one of the sheets. We call these “sheets” registers.
It is the responsibility of the control unit to ask the bus to bring the actual x, y and
f values into the ALU registers. After that, the control unit has to tell the ALU to
add, compare or multiply the two values. The result is deposited by the ALU in one
register. Finally, the controller has to instruct the bus to bring the result back from
the registers into the memory; to free the former for further work. With this machine
(code) idea, a part of the machine operation sequence for our high level code looks
as follows:
We load the data from a particular address in memory into one of our registers (here
called RegA). The address is a positive number identifying a memory location. The
location holds the value of a. We then do the same thing for the other variable and
RegB. After we have added the two numbers, the result ends up in RegA. We finally
write this result back to main memory, and RegA becomes available for the next
calculation.
A simple C instruction like f = x[. . .]+y[. . .] translates into at least four machine
code steps, where we have to be careful which data is held in which register at which
point in time. Identifying a suitable set of registers and getting the memory moves
right is a tricky challenge mastered by your compiler. The underlying process is
called register assignment.
5.1 Blueprint of a Computer’s Execution Workflow 53
Our machine code for our simple von Neumann architecture consists only of a
few instruction types. It itself is a sequence of codes stored in the main memory just
as the data itself.1 Each step within this code is either
• move a constant into a register. In our simple case, let’s have four registers: RegA,
RegB, RegC, RegD;
• move one piece of data from a position in main memory to a register. Where to
take the data from, i.e. its address, is held in one of the registers, too;
• move data stored in RegA, RegB, RegC or RegD to the memory location specified
by the content of RegA, RegB, RegC or RegD;2
• add or multiply the content of any of the two registers. The result becomes available
in one register as specified;
• compare any register to any other register. If the two are equal, write a non-zero
number into one of the registers. Otherwise, store zero;
• if a register holds zero, skip the next line of the instruction.
The control unit has a special register itself, the program counter (PC). After a ma-
chine code step has terminated, this counter is incremented; unless the code explicitly
overwrites the PC register’s value or encouters the skip command which increases
the PC by two. At the startup, the PC register points to the very first instruction of
your code. By manipulating the PC, a code can jump around in the instruction stream,
as the controller always reads the next instruction from where the PC points to.
Machine code has no variables or even arrays. It works only with memory
addresses. That is, the notion of an array a and its elements a[i] is a high lev-
el language concept. In reality, an array materialises as a long consecutive sequence
of numbers in main memory. For our simple machine, we can assume that constants,
i.e. any hard-coded number, can be stored within the instruction stream. We can
write things like move 14 into RegA. For real machines, this is often possible
for integers only. For floating point numbers, it is the responsibility of the compiler
to create a big table of floating point values embedded into the machine code. Instead
of moving a magic constant into a register, we can tell the bus to move a particular
memory location from the lookup table into a register.
1 The fact that we do not distinguish code and instructions within the memory—they are technically
both just number sequences and the software has to interpret them either as data or instruction
codes—is a built-in security problem. Malware can alter code by pretending it wrote data into the
memory.
2 This is where the machine truncates the bit sequence of the floating point number into the IEEE
format (cmp. Sect. 4.2.3): If the ALU uses 80 bits per double internally for example, the number is
now reduced to 64.
54 5 A Simplistic Machine Model
Lets assume that the initial memory configuration for our example with I=10 looks
like this:
This is a very primitive computer architecture. Yet, with only few different instruc-
tions, it can compute a scalar product:
• We assume that the memory regions 01–10 hold entries of the array x, while 11–19
holds y.
• 20 and 21 are spare memory locations that will hold variables later on. I wrote in
i and f, but initially these memory locations hold whatever has been there before:
garbage from a code’s point of view.
• The first thing we do is that we make the controller jump to location 50 in memory.
This is where the actual machine code (algorithm) starts. 00 is the code entry point,
01–49 is the data region and everything ≥50 holds machine code.
• We load a zero into RegD and use this value to initialise both the storage locations
of i (the loop counter) and the result f.
• We initialise RegA and RegB. These two registers will always point to the current
entry of the array x or y, respectively.
• We load the current array elements into the registers RegC and RegD, multiply
them (result is kept in RegC), fetch the content of f, add the multiplication’s
result, and eventually write back the new value of f into the memory location 20.
This sequence is the loop body.
• We load the value 1 into RegD and use it to increment the two pointers as well
as the variable i. This is the pointer increment within the high level code’s for
statement.
• We compare i to 10. If it is equal, we skip the subsequent line. This is the end of
the algorithm. If it is not equal (yet), we jump back to 54 to run through the code,
i.e. the loop body of our C code, once again.
5.2 A Working SISD Example 55
In subsequent chapters, we will sophisticate the architecture further and, hence, make
the machine faster. At the same time, we will work exclusively with C, i.e. a high level
programming language. To understand what is going on “under the hood”, it is how-
ever important to realise how simple high level constructs (such as a for loop) are
translated into code sequences with register comparisons and register manipulations.
Definition 5.3 (Single Instruction Single Data (SISD)) The plain machine as intro-
duced here is a single instruction single data machine (SISD). It runs one single
instruction at a time. It furthermore runs this instruction on a single piece of data or
a pair of entries feeding into a binary operator, respectively.
The term SISD has been coined by the scientists Michael J. Flynn and is part of the
so-called Flynn’s taxonomy. We return to this taxonomy multiple times.
5.3 Flaws
There are obvious flaws in our computer design; and indeed today’s machines. Two
are important at this point.
For computations such as our vector addition, we have to pipe a stream of entries
through the registers. Our ALU has to be served by the memory before it can trigger
its actual computation.
Definition 5.4 (Memory-bound) Nowadays ALUs are fast, while the memory can
not keep pace. We end up in situations where the ALU has to wait for the memory
most of the time. Such a code is memory-bound.
Its runtime is not determined by the sheer amount of computations but by the capa-
bility of the memory subsystem. Many important codes today suffer from memory-
boundness.
3 This is the price to pay for fame: the most important technological achievements might be named
after you, but then the flaws tied to it inherit your name, too.
56 5 A Simplistic Machine Model
another flavour of this bottleneck. Data movements are not only problematic from a
performance point of view. It is memory movements that are responsible for a major
part of the power budget, not the actual computations.4
A second implication of our SISD/von Neumann design is that it is intrinsically
sequential. There is no parallelism in there so far. To extend this machine model, we
will return to our desk metaphor. But that’s next …
• John von Neumann: First Draft of a Report on the EDVAC. Technical report (1945)
• Michael J. Flynn: Some Computer Organizations and Their Effectiveness. IEEE
Transactions on Computers. C-21(9), pp. 948–960 (1972) https://doi.org/10.1109/
TC.1972.5009071
• Randall Hyde: Write Great Code—Understanding the Machine: Volume 1. No
Starch Press (2004)
• The four core components of the von Neumann architecture are control unit, ALU,
bus and memory.
• A register is the ALU’s internal memory (scratchpad) for computations. There is
a very small number of registers.
• SISD means single instruction single data. SISD architectures can compute one
operation on one piece of data (or one tuple of data for binary operations) at a
time. The term is part of Flynn’s taxonomy.
• Before we can compute anything, we have to move the required values from the
main memory to the registers. Codes that spend the majority of their time moving
data in and out are called memory-bound.
• Traditional efficiency metrics often assume compute-bound codes.
4 Witha theoretical computer science background, you will characterise algorithms as O(n) or
O(n log n) or …where n counts the algorithm’s operations. In reality, there might be an additional
O(m), O(m log m), …term paraphrasing the memory movement cost. It is not clear a priori which
term dominates the actual runtime.
Wrap-up
There are two fundamental principles of computers that drive our studies: Computers
have to work with finite representations of data, and computers work with a small
number of simple unary and binary operations in a clocked way. These are the two
topics in Part II. As our computer can not work with R, it works with an approximation
of numbers from R, rounding will become our steady companion throughout the text.
The runtime of a code (manipulating data from R) will be determined by the number
of binary and unary steps within the machine code produced by the compiler.
With these two fundamental principles at hand, we can next study both the arising
errors of a code and its computational efficiency. We can start to do some number
crunching. Our goal is, on the one hand, to obtain a good understanding of the un-
derlying principles needed to dive into more details in further literature. On the other
hand, we want to start to write down some “rules of thumb” or recommendations:
Where do we need extra care when we write code?
Different to classic programming, numerical codes always give the wrong results
even when they are correct. It is the approximation with IEEE precision which
introduces errors. We have to understand how big these errors can grow for particular
cases. Can we trust the prediction how the particles in our simulation move through
the domain? On top of this, some ways to write down basic calculations are horribly
slow while others run extremely fast. It is important to understand reasons for this,
so we can deliver code that is also of high non-functional quality, i.e. speed. So far,
we know that counting the number of computations - such as calculations of particle-
particle distances - is flawed, as the data first has to be moved from the memory into
the registers if it does not yet reside there. Further to that, the type of operation on
sets of registers makes a difference!
Part III
Abstract
Round-off errors creep into our calculations with every non-trivial floating point
instruction. We introduce a formalism how to analyse round-off errors in imple-
mentations, and discuss when and why numerical calculations suffer from a lack
of associativity, cancellation or accumulation stagnation.
1. The left and right object are stationary while the middle one is subject to
an initial velocity v2 = (0, 1, 0)T .
2. Let object 1 (the left one) have a prescribed velocity v1 = (−ṽ, 0, 0)T ,
while object 3 has the velocity v3 = (ṽ, 0, 0)T . They move apart with a tiny
ṽ.
The whole system should then behave like a huge pendulum or spring. As the
two outer objects are prescribed and the free object is right in the middle, the
forces pull the middle object back whenever it “leaves” the axis the three objects
are initially aligned on. The free object oscillates: the left guy pulls it closer, but
the right guy does the same. All the “horizontal” pulls cancel out.
The lack of any friction explains why the oscillation carries on and on and on.
Indeed, its amplitude remains invariant over time in the first experiment. In the
second experiment, the amplitude of the oscillation becomes bigger over time.
The left and right object move apart. They are the spring anchors. Therefore,
their forces upon the free object become smaller and smaller.
The interesting thing to us is not that the amplitude grows. The surprising
observation is that the middle object suddenly goes crazy: It does not remain in
the middle, but after a while starts to roam around in the domain.
Our setup starts with perfect oscillations. There are two forces F0→1 (force of 0 on
1) and F2→1 . Both decompose into a force parallel to the axis connecting 0 and 2
and a force orthogonal to this axis:
F0→1 = F0⊥→1 + F0→1 and F2→1 = F2⊥→1 + F2→1 .
The former two partial forces cancel out, i.e. F0→1 + F2→1 = 0, while the two
partial forces orthogonal to the axis sum up. They push object 1 back towards the
line connecting 0 and 2 whenever it leaves the line due to the initial velocity. In our
example, they are even the same: F0⊥→1 = F2⊥→1 .
After a while, the central particle ends up slightly off the centre if 0 and 2 move
away from each other. F0→1 and F2→1 are scaled with the inverse of the distance
of object 0 and 1 or 0 and 2, respectively. When we compute this distance, weget a
slight error due to the scaling’s normalisation and rounding. Therefore, F0→1 =
M
− F2→1 . We introduce a tiny acceleration parallel to the line between 0 and
M
2, and consequently move object 1 slightly. Once it has left its central oscillation
path, the horizontal forces do not cancel out anymore, so the “wrong” acceleration
starts to grow in our simulation. We obtain a weird trajectory rather than a perfect
oscillation. Our simulation becomes totally wrong (and neither matches any theory
6 Round-Off Error Propagation 63
nor any properly done experiment). We now want to ask ourselves where these small
deviations came from in the first place. They navigated us into this unsatisfactory
situation.
We first formalise our considerations. F0→1 andF2→1 be the horizontal forces
Let
in the initial configuration. We know F0→1 = − F2→1 bit-wisely as object
M M
1 stays in the middle of 0 and 2 if the latter two objects do not move. As 0 and 2
move away from each other, object 1 experiences forces
s · F0→1 and s · F2→1 with s < 1.
Due to the machine’s finite precision, we cannot represent this value on our machine
accurately as s decreases over time:
s · F0→1 = s · F0→1 and s · F2→1 = s · F2→1 .
M M
Even worse, both sides might round the force to the next slightly bigger value. If we
had centred our setup around 0, then both forces would be rounded the same way.
However, we have shifted the setup, and we know that IEEE does not round two
values the same way, i.e. add or remove the same amount to fit the values into IEEE,
if they have a different magnitude. With our definition of machine precision, we can
assume
s · F0→1 ≈ (s · F0→1 )(1+0→1 ) and s · F2→1 ≈ (s · F2→1 )(1+2→1 ).
M M
The 0→1 and 2→1 are some values whose magnitudes are in the order of the machine
precision . Now we sum up the rounded values and keep in mind that the object 0
is left of the coordinate axis’ origin, i.e. this force component is negative.
(s · F0→1 )(1 + 0→1 ) + (s · F2→1 )(1 + 2→1 ) = (s · (F0→1 + F2→1 ))
+ s · F0→1 · 0→1 + s · F2→1 · 2→1 .
The F0→1 = −F2→1 (in exact arithmetics) makes the first term drop out, but the
other two terms do not cancel as 0→1 = 2→1 . It could, for example, happen that
the left value is rounded towards the next smaller value, while we have rounded
the other value to the next bigger one along the number line. After all, we have to
“rasterise” the unknowns into IEEE precision. With |0→1 | ≤ and |2→1 | ≤ , the
right term’s size is bounded by around 2; but it does not disappear.
If we work with floating point numbers in a computer, the permanent denormalisa-
tion/normalisation over a finite bit field introduces round-off errors which propagate
through our calculations. The results become biased. The fewer bits we have for our
values, the more biased we are. This is something we have to accept. We however
can not accept that all of these tiny errors start to blow up unheededly. Our goal next
thus is to establish a formalism how to study the error propagation.
64 6 Round-Off Error Propagation
Since we know that even our most basic arithmetic operations are “wrong”, we should
write a formula, once it is translated into source code, as
f = x + y ⇒ f M = r ound(x + y).
In a large calculation, every single term therein should be replaced with the above
rounding wrapper. Wherever we write x + y (or another operation) in our code, we
will end up with something slightly off the correct result. To highlight this, I write
x + M y. The subscript M denotes that this addition is done in machine precision,
i.e. it will, for most cases, not be exact. It is a shortcut for wrapping everything into
the r ound(. . .) function from above.
We now can use the knowledge from (4.3): We know where the round-off error
comes from. The actual calculations suffer from the number (re-)normalisations and
the rounding and thus calculate (x + y) M instead of (x + y). I usually prefer to write
x + M y = (x + y) M to safe a few brackets—I am not a LISP programmer after all.
We end up with
(x + y) M = x + M y = r ound(x + y) = (x + y)(1 + ).
The is a worst-case estimate always picking the sign least favourable for us. So it
could be smaller or bigger than zero. It could also be exactly zero, but we cannot
count on this. We work, on purpose, quite sloppy here. In a more formal world,
we should multiply the real result with (1 + ) ˜ with |˜ | ≤ from our machine
precision definition. But we keep notation simple. Our goal is not to construct a
watertight formalism. We want to understand the implications and issues for us as
programmers.
If we have two operations x + M y · M z in a row, we’d have to use two epsilon
values 1 and 2 . One for the addition and one for the multiplication. We however
ignore all of this and again work with only one generic (read “± the machine
precision” as worst-case). This allows us to introduce few simple compute rules:
x +M y ⇒ (x + y)(1 + ) (6.1)
x −M y ⇒ (x − y)(1 + )
x ·M y ⇒ (x · y)(1 + )
x/ M y ⇒ (x/y)(1 + ).
This formalises what happens in our ALU in a single step. We furthermore simplify
our business and neglect all higher order terms:
· ⇒ 0,
i.e. assume that is really small, so the product of two is something we can safely
ignore. Finally, we introduce
+ ⇒ 2 and
− ⇒ 2,
6.1 Inexact Arithmetics 65
as we don’t know the sign of the errors . If we add two , the maximum error doubles.
If we subtract one from the other, we have to assume that they have different signs
and thus eventually also add up. Finally,
x −M y =
if x ≈ y. This is an effect we call cancellation:
Definition 6.1 (Cancellation) If we subtract two values x and y from each other and
if x and y are almost the same (up to machine precision), normalisation makes us end
up with lots of garbage bits. Scientifically correct, we speak of a loss of significant
bits: The additional bits that we insert when we renormalise the result do not carry
any semantics. With x (and y) we might have 64 or 32 bits with a real meaning. If
x and y are of the same size x − y will yield a result where very few bits actually
mean something, i.e. are significant. This effect is called cancellation.
This worst-case analysis is done in the tradition of Wilkinson, and that’s from the
1960s. In reality, it is usually way too pessimistic. Some values are slightly too big,
some too small. If we add them up, the resulting error does not amplify but more or
less cancels out. Modern error analysis models this behaviour by making a random
distribution, e.g.
With a formal system at hand, we can analyse longer calculations. This analysis
describes how a computer realises a particular source code. Our activity consists of
two steps: We first write down the calculation as done on the machine and replace all
arithmetic operations by their machine counterpart. In a second step, we use the rules
from (6.1) and thereafter to map the machine calculations onto “real” mathematical
operators including error thresholds. The first step is purely mechanical. The second
step requires us to think about potential variable content.
I demonstrate this agenda via the summation of three variables. Technically, each
compute step would require us to work with an of its own:
x + M y + M z = (x + y)(1 + 1 ) + M z
= ((x + y)(1 + 1 ) + z) (1 + 2 )
= (x + y) + (x + y)1 + (x + y)2 + (x + y)1 2 + z + z2
66 6 Round-Off Error Propagation
For each calculation, we have another error. As discussed before, we use a pessimistic
upper bound and hence use a generic as worst case, i.e. we set 1 = 2 = .
Furthermore, we eliminate higher order terms and end up with
x + M y + M z = ((x + y)(1 + ) + z) (1 + )
= (x + y + z + x + y)(1 + ) (6.2)
= x + y + z + x + y + x + y + z + x 2 + y 2
= x + y + z + x + y + x + y + z
= x(1 + 2) + y(1 + 2) + z(1 + ). (6.3)
We reiterate that this is not a precise formula in a mathematical sense: Per term, the
accepts the role of a bad guy and switches sign and magnitude just as appropriate.
The equals sign also is really a ≈ as we drop the higher order terms. All of this has
been known before. There are however a few implicit assumptions that we used and
should articulate explicitly:
The simple formalism yields some interesting insight into the way we should pro-
gram:
6.3 Programming Recipes 67
Associativity
Associativity
With machine arithmetics, associative operations are not associative anymore.
Reordering calculations
If you can reorder your operations, run the operation introducing the potentially
largest error last.
Stagnation
The addition of two numbers is problematic if we add a very small to a very big
number. To add them, the ALU has to make them have the same exponent: To add
68 6 Round-Off Error Propagation
1.0|2 ·23 +1.0|2 ·24 , it first converts the numbers into 0.1|2 ·24 +1.0|2 ·24 . Throughout
this process, we could loose bits: bits are pushed to the right when we denormalise
1.0|2 · 23 → 0.1|2 · 24 and thus might be squeezed out of the significand; even if
it occupies significant more bits than IEEE precision internally. Therefore, adding a
big to a small number—speaking in absolute values—is problematic.
N −1
Definition 6.2 (Accumulation) If we calculate f = n=0 xn in a register of the
ALU, we accumulate this result step by step in a computer.
There are ways to work around these machine restrictions. A simple strategy is to
sort the entries within an array by their magnitude. That is, adding 1.0 · 21 + 1.0 ·
22 + 1.0 · 23 + 1.0 · 24 + · · · is less problematic than adding 1.0 · 210 + 1.0 · 29 +
1.0 · 28 + 1.0 · 27 + · · · . This doesn’t necessarily work if we have to compute a scalar
product with two arrays; or it might at least be cumbersome.
In this case, we can employ another trick: We know that we can split up any
sum into two sums, accumulate left and right, and then sum up the result: f =
N /2−1 N −1
n=0 xn · yn + n=N /2 x n · yn . As computer scientist, we can apply this scheme
recursively, i.e. split up left and right again. This way, we end up with a tree scheme
to obtain the result which is, on the one hand, better-suited to parallelise (we will see
this later). On the other hand, it can be more robust if all xn · yn s are of the same size.
We never sum up two partial results which are of completely different magnitude.
We never run into stagnation.
Our formalism allows us to formalise this effect:
The error impact is more equally distributed among the sum terms: We have N = 4
terms in our sum and thus get a maximum error amplification of (N − 1) for a
plain left-to-right sum. A recursive summation yields a maximum amplification of
log2 (N ).
Terminology
In different books, you find different terms. Some books call the plain left-to-right
summation a “recursive summation” as it is formally equivalent to f (x1 , x2 , . . .) =
x1 + f (x2 , . . .). If we split up the range into blocks, compute f over the blocks and
then combine the partial results, this is typically called blocked summation. The idea
to split up a range into blocks, and then to split up these blocks recursively over and
over again is called pairwise summation, cascadic summation or tree summation.
Blocked summation will play a major role once we do things in parallel: If we have
b blocks, then these b blocks can be summed concurrently. Alternatively, we can
sum the first entries of different blocks parallel to the second entries, and so forth.
We will use the first pattern on multicore parallel machines and the other pattern on
GPUs and vector computers.
Our recursive tree summation uses bipartitioning in the sketch. Nothing stops us
from splitting up a range into more sections per recursion step, and nothing stops
us to abandon the recursion once blocks become too small. We can construct hybrid
schemes.
Datatype
Another flavour of hybrid becomes interesting, when we consider that we have multi-
ple precision formats available on our chips. Recent papers around fast and accurate
blocked summation not only propose to split up large sums into blocks which fit par-
ticularly well to our hardware. They also propose to compute the individual blocks
with a lower precision than the total sum. That is, they propose to compute the partial
sums for example with single precision, but then to sum up these partial results with
double precision. An analysis of such an approach is technical, but the rationale is
simple: The lion’s share of the work is the summations within the blocks and it thus
has to be fast. If the blocks of size b are reasonably small, then our error scales with
(b − 1)low precision . The flaws in the partial results then are amplified by the summa-
tion over the blocks. However, these operations are done with high precision; maybe
even with higher precision than what we need in the end by means of our IEEE
format of interest. As a result, we have pretty good control over the overall error plus
a fast code. This is an appealing example of a mixed precision implementation.
70 6 Round-Off Error Propagation
Unfortunate constants
We know that our computers work with bits and thus are intrinsically good in working
with multiples of two. They however struggle with other values. We furthermore
know that many rational numbers introduce a significant round-off error as we have
to squeeze them into the significand. 1/2 is something we can represent exactly, but
1/3 or 1/10 (we are not living in a decimal world) are not. As a result, the following
two loops yield qualitatively different results:
sum = 0.0;
for (int n=0; n<N; n++) {
sum += 0.5;
}
std::cout << sum << std::endl;
On my computer, both loops deliver the correct result. Lets increase N = 10,000. It
still all seems to be correct. If we however switch sum to float, i.e. if we reduce
the accumulation variable to single precision, we get 999.903 for the first loop. This
is wrong. The errors introduced by each individual rounded 0.1 addition accumulate.
The second loop in contrast is fine.
Nice constants
If you work with floating-point constants and if you have some degree of freedom,
use “well-behaved” constants such as 1.0 · 2k instead of something that intrinsically
introduces errors.
Cancellation
Postpone cancellation
If your calculations run risk to suffer from cancellation, rearrange the calculations
such that the cancellation arises as late as possible.
(x − y)(x + y) = x 2 − y 2 . Yet, the left side yields bigger errors for x ≈ y. Also,
zx − zy might be better than z(x − y) if x and y are close.
Compilers do a good job of trying to keep all data within the registers as long as
possible. Moving data from the register into memory is expensive, so we try to use
our full spectrum of fast in-house (register) storage locations. There’s a subtlety to
this: Registers internally work with higher precision than IEEE, so we reduce the
impact of floating point round-offs, too.
It is therefore never a good idea to work on a variable, then not to touch it for
ages, and then finally to reuse it. It is always better to do all the work on a variable in
one go. Rewrites to achieve this increase spatial and temporal data access locality.
This is the positive side of things.
However, any instruction reordering can alter the floating point results. If you
bring calculations forward, e.g., you might force the compiler to temporary move
values out of registers that it otherwise would have kept there. Modern compilers use
instruction permutations quite heavily to produce fast code. Whenever we increase
a compiler’s optimisation level (from -O0 to -O2 (default) to -O3 to …), we give
the translator more freedom to make performance-relevant code modifications. For
floating-point numbers, these might not be semantics-preserving.
72 6 Round-Off Error Propagation
Most compilers today allow you to disable code permutations: The GNU com-
piler’s switch -fassociative-math for example allows the re-association of
operands in series of floating-point operations. There’s a “don’t do this” alternative
which explicitly forbids it. Clang supports different floating point policies, too. The
default is not “safe”.
Compiler impact
Check the compiler’s influence on your result from time to time, i.e. try out different
optimisation levels, compiler types and generations. In most cases, it makes no
difference to the outcome. In some, it does.
• David Goldberg: What Every Computer Scientist Should Know About Floating-
Point Arithmetic. ACM Computing Surveys (CSUR), 23(1), pp. 5–48 (1991)
https://doi.org/10.1145/103162.103163
• James H. Wilkinson: Error analysis of floating-point computation. Numerische
Mathematik. 2, pp. 319–340 (1960) https://doi.org/10.1007/BF01386233
• Pierre Blanchard, Nicholas J. Higham, Theo Mary: A Class of Fast and Accu-
rate Summation Algorithms. SIAM J. Sci. Comput., 42(3), A1541–A1557 (2020)
https://doi.org/10.1137/19M1257780
Abstract
Vector computing can be read as equipping a processing unit with replicated
ALUs. To benefit from this hardware concurrency, we have to phrase our calcula-
tions as operation sequences over small vectors. We discuss what well-suited loops
that can exploit vector units have to look like, we discuss the relation between
loops with vectorisation and partial loop unrolling, and we discuss the difference
between horizontal and vertical vectorisation. These insights allow us to study
realisation flavours of vectorisation on the chip: Are wide vector registers used, or
do we introduce hardware threads with lockstepping, how do masking and blend-
ing facilitate slight thread divergence, and why does non-continuous data access
require us to gather and scatter register content?
• The ALU is told by a controller what to do. The controller acts as master of
puppets.
• There are two ALUs and they do exactly the same, i.e. if the controller (master)
shouts “add”, both add their registers.
• The controller is able to address the 2 × 2 registers separately.
How can we compute y in such an environment way quicker than before? What
is the maximum speedup resulting from this architecture? What would happen
if we doubled the number of ALUs once more?
A first extension of the simple von Neumann machine (Chap. 5) is the introduction
of a vector register: Assume that there is not only a register RegA but that there are
two registers RegA1 and RegA2. Logically, we could make register A twice as big
such that it can hold (a small vector of) two entries. Alternatively, we could assume
that there are two ALUs with exactly the same register types. Both approaches
are variations of the same conceptional idea. We obviously assume that the same
extensions holds for register RegB, too.
Let there be a new command which we call add2. It takes the content of RegA1,
adds RegB1 and stores the result into RegB1 for example. At the same time, the
new routine adds RegA2 and RegB2 and stores the result. We call such an addition
a vector operation, as it effectively runs
y1 x1 y1
← +
y2 x2 y2
in one step.
The potential of such an idea is immediately clear. If we need exactly one cycle for
each instruction—an assumption which is wrong, but helps us to bring the message
across—then we have to load 2N values from the main memory into the chip every
time we add two vectors of N elements, i.e. we need 2N cycles for all of the loads.
We furthermore have to write N values back. The number of computations totals to
another N cycles. We end up with 4N cycles for the whole vector addition.
With our new vector operation, we reduce the cost for the actual computation by
a factor of two. We end up with a total cost of 3N + N /2 cycles. Not a tremendous
saving. Yet, we see where this leads us to: If we now introduce new load and store
operations, too, then we can bring down the cost to a half of the original cost. And
there is no reason why we should not apply this idea to more than just two registers.
7 SIMD Vector Crunching 75
Definition 7.1 (Vector operation) A vector operation is an operation which runs the
same operation on multiple pieces of data in one go. If programmers or a compiler
translate a (scalar) piece of code into a code using these vector operations, they
vectorise a code.
The idea to introduce vector operations is old. The underlying paradigm has been
described by Michael Flynn as Single Instruction Multiple Data (SIMD). The mean-
ing of this is intuitively clear: One single operation (add two values, e.g.) is applied
to multiple entries of a vector in one step. Others thus speak of SIMD as vector
computing.
•> SIMD
Single instruction multiple data (SIMD) is the hardware paradigm equipping chips
with vector computing capabilities.
Historically, vector computing had always been popular with computers constructed
for scientific calculations, as our codes traditionally had been written using lots of
vector operations—essentially matrix-vector products (the jargon/slang for this is
mat-vecs). It hence paid off to create specialised hardware. This was in the 1980s.
After that, the economy of scale took over. It became cheaper to buy a few more
standard computers without these specialised instructions off the shelf. It also was
more convenient for developers—no need to replace our additions with specialised
SIMD operations, e.g.
Things changed due to the demand for better computer games and movie graphics:
If you process an image, very often you apply the same operator to all the pixels.
You pipe data subject to the same code through the system. You apply the same
operator to one big vector of data. GPU vendors thus started to revive the vector
computing idea. One might argue that realistic blood spilling and car races paved
the way towards exascale computing with GPUs. As vectorisation is a proper and
energy-effective way to provide what users want, it found its ways back into CPUs
too. Today, we have SIMD in almost any chip—though with different realisation
flavours.
The name vector computing both highlights for which types of codes our paradigm
works and it is slightly misleading at the same time. We next try to get a better (formal)
understanding which types of code snippets are well-suited for parallelisation:
76 7 SIMD Vector Crunching
Let’s break down our program into elementary steps as they are supported by the
machine, i.e. break down our code into steps like a ← b and a ← b ◦ c with some
operators ◦. Each step of the resulting code now becomes a node in a graph. Two
nodes n 1 and n 2 are connected via a directed edge from node n 1 to n 2 if the outcome
of n 1 is required on the right-hand side of the computation in n 2 . The graph of the
vector addition f = x + y, f, x, y ∈ R10 corresponding to the code
then equals
Definition 7.2 (Control flow graph) The control flow graph is a representation of
a particular realisation of our algorithm. It consists of elementary steps connected
through causal dependencies.
The control flow graph encodes a partial order: If there’s no path in the graph between
two nodes n 1 and n 2 , we are free to compute either n 1 first or n 2 .
Programmers can alter their code by rearranging instructions (nodes) that are
independent of each other, and compilers do so all the time when they optimise.
Whenever they rearrange/reorder a code, they “only” have to ensure that they do
not violate any data dependencies. In the example above, there is nothing we can
reorder that fits to our idea of a vector operation. There is however a very simple
code transformation:
1. Copy the body within the loop once, i.e. each loop iteration now runs through two
entries of the input vectors. In return, the loop increment becomes an increment
of two. For the time being, we assume N is an even number.
7.2 Suitable Code Snippets 77
2. Within the loop body, rearrange the operations such that two operations using the
same binary operator but working on different input/output data follow immedi-
ately after one another.
3. Finally, fuse these pairs of operations with the same operator into one vector
operation.
This yields a fundamentally different control flow graph: the update of y[i+1]
still follows y[i] in our textual description. However, the two steps are completely
independent of each other. Indeed, the definition of the control flow graph implies
that there is no edge between the two multiplications within the loop body. We thus
are free to pick any order. Alternatively, we can do both calculations at the same time
as one vector operation over tiny vectors of size two. A compiler will immediately
recognise this.
It is clear what vector operations within our (unrolled) loop body have to look like:
• They have to group exactly the same operation together (after all, it is called “Single
Instruction”),
• each operand has to have the same type (don’t mix doubles and int), and
• the results have to go into separate memory locations.
Compared
to our vector addition, it is more challenging to bring the inner product
f = 10−1n=0 x n · yn onto a vector machine: we can do the multiplications in parallel,
but the accumulation of the result—so far—has to be done step-by-step.
In most cases, the ◦ above is an addition or multiplication. The need for a reduction
to realise the scalar product limits our gain from vector operations. However, nothing
stops us from reusing our blocking idea from the floating point discussion:
With this pseudo code on our desk, we have a control flow graph fit for a SIMD
machine. Here, operation pairs can be replaced by one vector operation. We will
make compilers to do this job for us later. They will yield significantly improved
performance “automatically”.
So far, we have unrolled the loop once and assumed that the loop cardinality is
even. For N = 10, this holds. If we have a vectorisation parallelism of a factor
of four—think about four-tuples of registers—we have to rethink our solution. A
straightforward solution in this case would be to augment the input vectors x and
y by two zeroes each, i.e. to make them x̂, ŷ ∈ R12 . This is not always possible or
desirable—we cannot always extend data structures in memory without overwriting
information and if we rewrite code, we might be hesitant to sacrifice the memory.
7.2 Suitable Code Snippets 79
What we can do instead is to split the vector logically into two parts:
x = (x0 , x1 , . . . , x7 , x8 , x9 ).
first part second part
We now can apply the code transformation pattern to the first part with a vectorisation
factor of four, and then handle the remaining two entries of the input vectors with
another “loop” (it is a degenerated one with only one run-through) and an unrolling
factor of two. Again, we will later throughout this book rely on a compiler to do this
for us. But we have to understand these patterns to understand why certain codes
yield a certain performance.
1. Countable. We have to know how many loop iterations are to be performed when
we first hit the loop. A while loop where we check after each sweep whether a
criterion holds is not a fit for vectorisation.
2. Well-formed. Some people like loops with break or continue statements or loops
which can jump out of a function. More “sophisticated” programmers make loop
bodies throw exceptions or alter the loop counter within the loop body. All of
these are poison to vectorisation. They make it impossible for the compiler to
translate loop instructions into vector instructions.
3. Branching/simplicity. The loop body has to be simple.
a. Loops should not host if statements. The idea behind all vector computations
is that you do the same thing for all entries of a vector. If one vector entry
goes down one source code path and the others do something different, this
contradicts the idea behind SIMD. We will weaken this statement on ifs in a
minute, but in principle we would like to have a no-if policy.
b. Vectorisation can only work for inner loops. If we have nested loops, we can’t
expect the machine to vectorise over both the outer loops and the inner loop;
at least not if the inner loop’s range depends on the outer loop indices.
c. We cannot expect vectorisation to work if we use function calls (think about
something like trigonometric functions in the worst case) within the loop.
On top of the “hard” constraints formalised by the term canonical form, loops have to
exhibit a reasonable cost. Vectorisation is not for free. There are multiple technical
80 7 SIMD Vector Crunching
reasons for this: Most chips for example downclock once they encounter vector
operations. They become slower. If you get it right, this downclocking does not eat
up all of the speedup of vectorisation. But it reduces the pay off. In the worst case,
it however renders vectorisation counterproductive.
We will discuss in detail later how to force or encourage a compiler to vectorise.
In most cases (and for most programmers), code that can be vectorised nicely is
code written in a way such that a compiler can automatically create the right vector
instructions for the code fragment. Such a code follows the recipes above.
There are two fundamentally different ways to realise SIMD in hardware: We can
work with large registers that host N values at once. When we add two of these
massive registers, they effectively perform N additions in one rush. Alternatively,
we can work with 2N normal registers where the N pairs of registers all perform the
same operation.
The former realisation variant is what we find in standard processors today. Though
there are numerous early adoptions of the SIMD concept, today’s architectural
blueprint dates back to around 1999 when Intel introduced a technique they called
SSE. The main processor here is sidelined by an FPU (floating point unit as com-
pared to ALU) hosting additional registers. These registers are called xmmk with
k ∈ {0, 1, . . . , 7}. They are larger than their ALU’s counterparts. In the original SSE,
they had 128 bits. SSE can only be used for single precision arithmetics—its primary
market had been computer games and graphics—which means each xmm register
can host up to four single precision values with 32 bits each. If four entries of a
vector x ∈ R4 are held in xmm0 and four entries y ∈ R4 in xmm1, then the addition
of xmm0 and xmm1 computes four additions in one rush. The hardware ensures that
xmm0 does not spoil xmm2 and so forth. The xmmk registers are the RegA, RegB,
…registers from our introductory example, i.e. the RegA1, RegA2, and so forth are
physically stored in one large RegA register.
Fig. 7.1 Let x = (x1 , x2 ) and y = (y1 , y2 ) be two vector registers holding two values each. We can
combine these registers vertically into f = (x1 ◦y1 , x2 ◦y2 ) or horizontally into f = (x1 ◦x2 , y1 ◦y2 )
Single precision support is not what we need in science. With SSE2, Intel therefore
allows us to store two double precision rather than four single precision values per
register. They also offer many new operations and support for further (integer) data
types, but the single and double precision are the predominant use cases for scientific
computing. With SSE3, new instructions were made available which allow us to add
two (or four) values stored in one register. Before that, we had been able to add xmm0
and xmm1 but not the four values within xmm0, e.g.
Definition 7.6 (Horizontal versus vertical vector operations) If one vector register
holds two entries and if we combine its first and second entry, this is a horizontal
vector operation. If entries from multiple registers are combined with each other, we
call this a vertical operation.
Horizontal operations work with tuples of vector registers (Fig. 7.1). Each entry in the
outcome vector results from the combination of two neighbouring entries from one
source vector register. Vertical operations combine the entries from different vector
registers. The operations introduced as RegA1, RegA2, …combined with RegB1,
RegB2, …in this chapter are vertical SIMD operations. If we added RegA1 with
RegA2, we would have introduced a vertical vector operation. At first glance, it
is not obvious why one should introduce horizontal operations; notably as vertical
operations are way faster than their horizontal “counterparts”. However, they are
convenient whenever partial results are already distributed among entries of one
vector register. Without horizontal operations, we would have to disentangle a vector
register before we can combine its individual entries.
multiplies them component-wisely via one vertical operation. Thus, there will be
one vector register holding (x1 y1 , x2 y2 ). Without horizontal vector operations, we
next have to decompose (split) this vector register up into two registers—another
step—before we eventually add up the partial results.
Further improvement of vector computing capabilities results from the fact that
modern vector units offer fused multiply add (FMA): They compute f = x + (y · z)
in one step. That is two arithmetic operations (a multiplication plus an addition) in
one step rather than two! The operations are fused.
Beyond the extensions of the vector instruction set, the biggest improvement
upon SSE is SSE’s successor Advanced Vector Extensions (AVX), which widens the
individual register from 128 bits to 256. Later, we got the AVX-512 extension. Eight
double values a eight bytes now fit into one register.
Performance breakdowns
Statements on the pay off of vector operations as factors of two or four lack two
details: On the one hand, vector operations typically have a way higher latency than
their scalar counterparts. That means, loading data into vector registers is expensive
and we have to amortise this speed penalty by vector efficiency. On the other hand,
vector units are independent of the CPU. Vendors thus drive them with slightly
different clock speed. They reduce the frequency for AVX-heavy code.1 Otherwise,
the chip would become too hot. We conclude that optimal code, from a vector point
of view, relies on sequences of f = x + (y · z) operations, but the impact on the
time-to-solution has to be analysed carefully and experimentally.
7.3.2 Lockstepping
An alternative to large vector registers is the use of many standard registers that all
do the same thing at the same time. They synchronise after each step. The selling
point compared to scalar architectures remains the same: We can save hardware and
energy for all the orchestration and execution logic.
Definition 7.7 (Lockstepping) If multiple ALUs do exactly the same thing at the
same time, they run in lockstep mode.
1 In return, vendors sell you a temporary increase of the clock speed as turbo boost; which sounds
nice but actually means that you will not benefit from this with a floating-point heavy code.
7.3 Hardware Realisation Flavours 83
We summarise that SIMT is one way to realise the vision behind SIMD. The other
way is the large vector registers found in CPUs.
7.3.3 Branching
Both vector units and SIMT architectures today offer some support for branching.
They can evaluate codes like
At first glance, this comes as a surprise, as an input to this code snippet might be
x = (1, −1, 1, −1, . . .)T . There seems to be no real potential for vectorisation.
However, modern systems use a technique called masking:
Definition 7.9 (Masking) Masking means that a condition is evaluated and its out-
come is stored in a register. A bitfield with one entry per vector register entry is
sufficient. Once a SIMD computation is completed, the result is discarded for those
vector elements for which the masking’s register bit is not set.
2 Two different things are called CUDA: There is a programming language (a C dialect) that we call
CUDA, and there is a hardware architecture paradigm called CUDA. We do not discuss the CUDA
programming here.
84 7 SIMD Vector Crunching
Fig. 7.2 Illustration of masking/if statements of our example code snippet with a vector multiplicity
of four. In the if/else branches, we only use 50% of our compute resources as the impact of every
second vector calculation is masked out. This is also called thread or vector divergence
skip any part of the code. We always execute everything line by line. The masking
solely can mask out the effect of complete if or else blocks. The problem can be
mitigated—though only from a performance point of view, not from a “wasted”
computation point of view—via blending:
Definition 7.11 (Blending) If a code has two branches, blending implies that we
evaluate both branches simultaneously but only use one of the outcomes.
Blending can be read as natural extension of masking: We don’t mask out but make
the masking bit decide which result to use in the end. This gives a compiler a larger
set of options how to arrange computations (and how to vectorise them).
Branches in loops
Our requirements for loops on Sect. 7.2 are too strict: As we have masking and
blending, we can use conditionals, but (i) they have to be simple—complete if-then-
else cascades or branches in outer loops do not work—and (ii) they make our code
less efficient.
Definition 7.12 (Gather and scatter) If we have to grab various entries from dif-
ferent places in memory into one large vector register, we gather information. The
counterpart is called scatter. They are collective operations, as they affect all vector
register entries.
86 7 SIMD Vector Crunching
In the SSE era, gathers and scatters have been so expensive that it often was dis-
advantageous to use vector operations at all for scattered data. This changes. The
reason why scatter and gather are so important that vendors start to support them
better in hardware becomes clear when we revisit matrix operations:
Matrix-matrix multiplication
If you have a matrix, you can either store it row by row (C-style) or column by
column. The latter is the Fortran way. When you multiply two matrices X · Y , you
can either take the first row of X times the first column of Y , then the first row or X
and the second column of Y , and so forth. Or you can take the first column of X and
multiply it with the first row of Y . The result is a matrix to which you next add the
result of the second column of X times the first row of Y . No matter which variant
you follow and no matter whether you store your matrices in row-major order (this
is the C way) or column-major (Fortran), one of your matrix accesses always will
suffer from scattering.
A new trend around vectorisation—in particular on the GPU side and in particular
focused on machine learning—is the progression from vector operations to matrix
calculations. Classic vectorisation orbits around operations with small vectors. Due
to the advance of machine learning, calculations with small matrices become more
and more important. With out-of-the-box SIMD, we have to break matrix operations
down into sequences of operations over vectors. New machines however offer so-
called tensor units:
Definition 7.13 (Tensor units) Tensor units are dedicated hardware to compute F =
X + (Y · Z ) and variations of these, where F, X, Y and Z are (tiny) matrices. They
translate the vectorisation idea into the matrix world. Some vendors speak of tensor
cores as they are separate cores to the standard processors on the GPGPU.
With tensor units we witness a little bit of history repeating: Again, they became
first available for reduced (half) precision, before they have grown into hardware
that supports IEEE precisions as used in most scientific codes, i.e. single and double
precision.
• Dominic E. Charrier, Benjamin Hazelwood, and others: Studies on the energy and
deep memory behaviour of a cache-oblivious, task-based hyperbolic PDE solver.
The International Journal of High Performance Computing Applications. 33(5),
pp. 973–986 (2019) http://doi.org/10.1177/1094342019842645
• Johannes Hofmann, Jan Treibig, Georg Hager, Gerhard Wellein: Comparing the
performance of different x86 SIMD instruction sets for a medical imaging appli-
7.3 Hardware Realisation Flavours 87
cation on modern multi- and manycore chips. WPMVP ’14: Proceedings of the
2014 Workshop on Programming models for SIMD/Vector processing, pp. 57–64
(2014) https://doi.org/10.1145/2568058.2568068
• We know what a vector computer is and that it belongs to the class of SIMD
architectures.
• Control graphs are one tool to analyse given code and to identify its vectorisation
potential.
• We have defined the term reduction and identified implications for the arising
vector efficiency.
• There are four criteria loops have to meet to benefit from vectorisation.
• Different vendors use different terminology and technical concepts to realise
SIMD—in particular GPUs versus x86 chips. The vendor strategies mainly d-
iffer in whether they use large registers to hold multiple vector entries or rely on
lockstepping. The implications for programmers however remain similar. If you
can write CPU code that vectorises brilliantly, you can usually transfer this code
to GPUs relatively easy.
• Masking and blending are the two key concepts how vector computing facilities
support branching. They imply that we can use if statements in the inner loop but
also imply that we should be careful when we use them as they might harm the
performance. The latter insight is formalised by the term thread divergence.
Arithmetic Stability of an
Implementation 8
Abstract
Round-off errors are tied to the implementation, i.e. two different implementa-
tions of the same algorithm might exhibit different error propagation patterns.
We introduce the term arithmetic stability and formalise how to find out if an
implementation is stable under round-off errors. The formalism allows us to show
that our previously studied N -body problem is indeed single-step stable and also
stable over longer simulation runs. For implementations of extreme-scale scalar
products, e.g., stability however does not come for free, and we have to carefully
arrange all computational steps.
We compute
x12 − x22
f (x1 , x2 ) = x1 − x2 =
x1 + x2
for x2 = 1. If we plot either the left or the right variant of the above equation,
they seem to compute the same. However, if we compare the right to the left
side, we observe that they do not yield the same result for all x1 :
The different plots are for different precisions, i.e. we set the machine preci-
sion to 10−6 , 10−7 , 10−8 and 10−10 . The higher the precision, the smaller the
difference between the left and the right variant. For reasonably big differences
between x1 and x2 , both sides of the equation yield the same value. For small
differences, the two sides however yield significantly different results. Which
variant (left or right) gives the more accurate one, and how can we show this?
What does this mean for larger calculations where x1 −x2 is only an intermediate
step?
8.1 Arithmetic Stability 91
x1 − M x2 = but (8.1)
x12 M− M x22M
x1 − M x22 (1 + )
2
= (1 + ) (8.2)
x1 + M x2 (x1 + x2 )(1 + )
M
x12 − M x22 (1 + )
= (1 + )
(x1 + x2 )(1 + )
x12 − M x22
= (1 + 4) (8.3)
x1 + x2
= (1 + 4)
x1 + x2
The analysis of the straightforward implementation of the difference in (8.1) is, well,
straightforward. The analysis of the alternative formulation (8.2) is more interesting:
In a first step, we mechanically insert our replacement rules. We then drop all higher
order terms (see the remarks in Sect. 6.3) and therefore also get rid of the division by
1 + , before we finally once again rush into cancellation when we compute x12 − x22
in (8.3).
In our “sophisticated” rewrite, we obtain a factor (1 + C) with C = 4 that we
multiply with the solution. This factor is bigger than the sole that we get in (8.1).
However, we see that the “large” error term is scaled with (x1 + x2 )−1 . That suggests
that the complex rephrasing of x1 − x2 is of value for very close variables x1 , x2 as
soon as |x1 + x2 | is sufficiently large. We have to pay for the additional computations,
but the error is damped. Otherwise, just writing down x1 − x2 is the method of choice.
Definition 8.1 (Absolute and relative error) Let f M (x) be the outcome of an im-
plementation of f (x) evaluated on a machine. f M (x) is subject to round-off errors.
The absolute error is e = f M (x) − f (x). The relative error is
e f M (x) − f (x) f M (x)
= = − 1.
f (x) f (x) f (x)
92 8 Arithmetic Stability of an Implementation
This is an equivalent of the Definition 4.1 with the exact outcome of a function as
baseline.
Three remarks
1. Most people, books and papers put absolute brackets around the error formulae.
They ensure that errors are always positive. Sometimes, it is however useful to
highlight whether an error means “too much” or “not enough”.
2. For f (x) → 0, the definition introduces a division by zero. In this case, you
might have to use L’Hôpital’s rule.
3. If you have f (x, y), you have to specify an error w.r.t. x and one w.r.t. y. They
can be qualitatively different.
(1)
Fig. 8.1 Schematic illustration of the evaluation of y + x 2 , e.g.: A first function f M computes the
square, but the result is slightly off the analytical value due to round-offs. This is a unary function.
(2)
The slightly skewed result is used by f M which computes the sum of two values and again yields a
slightly wrong result within an area around the real result (cmp. error bar around real result denoted
as point)
8.1 Arithmetic Stability 93
Stability analysis assumes correct input. It thus characterises to which degree imple-
mentations give the “(slightly) wrong answer to the correct question”. Unfortunately,
the definition of arithmetic stability is not the same in all books. Some differ quite
dramatically. Many maths books for example call an algorithm stable if
| f M (x) − f (x M )| |x M − x|
≤ C1 for ≤ C2 . (8.4)
| f (x M )| |x|
I wrote before that I usually use C as generic constant. Along these lines, I could
have used C twice instead of the C1 and C2 . However, I wanted to emphasise here
that the two Cs differ from each other. The right-hand side of (8.4) states that an exact
(correct) input x has to be an idealisation. On computers, x is subject to machine
precision/truncation effects (or other “flaws”) as well, and we thus have to deal with
a machine approximation x M . As long as this difference remains bounded, a stable
algorithm yields “almost the right answer to almost the right question”. This is the
statement behind the left inequality in (8.4). Equation (8.4) is more sophisticated
than our initial notion of stability as it accommodates deviations in the input data.
The less formal, hands-on characterisation of stability is sufficient for us.
More important than the philosophical question whether you can represent input
accurately or not is the fact that you will find books that don’t even speak of “stable”
but rather prefer statements along the line “one implementation is more stable than
the other one”. This is a convenient way to phrase things: There’s a set of problems
that a computer can solve accurately. We still have to formalise this set. Within the set,
there are codes that give the right answer and there are codes, i.e. implementations,
that don’t give the right answer—even though they all implement (symbolically) the
same thing. Some implementations run into rounding issues while others do not.
Between both extremes, there are codes that yield accurate results for some input
(Fig. 8.2).
Our notion of arithmetic stability takes the round-off error analysis and relates it to
the real outcome of a piece of code. We ask whether the outcome is accurate. In the
example, we see that
1 + 4
·
x1 + x2
if x1 + x2 1 + 4. There is a threshold which determines which of the two
implementation variants is more stable (read “accurate”). Further to these accuracy
94 8 Arithmetic Stability of an Implementation
Fig. 8.2 Schematic illustration of problem/code classes: Among the problems we want to solve in
scientific computing, we are interested in those that we can solve accurately. This property is one
we still have to formalise. Within this category, there are implementations that are always stable
w.r.t. round-off errors and some that are never stable. In-between, there is a blurry region, i.e. most
codes are stable for some inputs and not for others. Among stable, unstable and in-between codes,
some implementations are fast—they exploit features such as vector units and have low computation
counts, e.g.—others are not
statements, our analysis demonstrates that neither variant is stable always: A simple
limit where we make x2 approach x1 shows
(x1 − M x2 ) − (x1 − x2 ) − x1 + x2
lim = =∞
x2 →x1 ,x2 =x1 x1 − x2 x1 − x2
2
x1 M − M x22 M
lim − (x1 − x2 ) · (x1 − x2 )−1 = ∞.
x2 →x1 ,x2 =x1 x1 + M x2
M
In our analysis, we can arbitrarily increase the relative error for both implementation
variants. On a machine, it is impossible to choose x1 and x2 arbitrarily close, i.e. there
is a natural upper bound on the absolute error. But it can be very high. A high error
might, in more complex code, feed into follow-up calculations, and that is where the
problems start.
Definition 8.3 (Loss of significance (again)) If the absolute relative error increases
faster than the absolute error, we basically loose all the bits in our floating point
number that carry meaningful information.
We return to our model problem from Chap. 3 and ask the question whether the
algorithm proposed is vulnerable to round-off errors. For this, we abstract the problem
and study
As we plug in our machine arithmetics into (8.5), our stability analysis accepts that
the time evolution is subject to machine errors. Before we start, we however make
one more assumption: We assume that the calculation of F( p(t)) is stable. If F were
not stable, it would not make any sense to continue to investigate the stability of the
overall code. Hence, let FM ( p(t)) = F( p(t))(1 + C), and we obtain
1. We still assume that p(t) is correct. Therefore, I use a subscript M for p M (t +dt),
but I use no subscript for p(t).
2. We assume that F’s computation is stable (as discussed above).
3. We assume that we have a small time step size dt. Twice, we thus kick out terms
scaled with · dt.
The value we start from, i.e. p(t), scales the impact of the round-off error. We can
consider p(t) to be a constant handed into the (first) time step computation, as it is
given from the outset.
The stability statement refers to one step, as we exploit that we consider p(0) to be
correct. Some people hence call it local stability. After one step only, we however
have a flawed p M (dt). This flaw propagates through subsequent iterations. It is hence
not valid to claim “one step is stable, therefore a sequence of steps is stable, too”. The
one step stability relies on a correct (in a rounding sense) input and this assumption
is already wrong for step two.
When we make a statement on the whole algorithm, we thus have to take the
number of steps into account and can only assume something on the initial condition
p(0). To determine the error after N time steps, we can derive the following two
equations following the argument from above:
p M (0) = p(0) is exact, but
p M (N · dt) = p M ((N − 1)dt) + dt · F( p M ((N − 1)dt)) + p M ((N − 1)dt)
= p M ((N − 1)dt)(1 + ) + dt · F( p M ((N − 1)dt)).
This is an induction formula. We unfold it and end up with
p M (N · dt) = p M ((N − 1)dt)(1 + ) + dt · F( p M ((N − 1)dt))
N −1
= p M (0)(1 + ) N + dt · F( p M (n · dt))
n=0
N −1
= p(0)(1 + N ) + dt · F( p M (n · dt))
n=0
N −1
= p(0)(1 + N ) + dt · C · F( p(n · dt)),
n=0
where we, once again, assume that F does not yield a dramatically different output if
we plug in slightly skewed input. In this formula, the very first value, i.e. our initial
value, times the number of time steps scales the error. However, the error does not
explode. It grows linearly in the number of time steps.
8.2 Stability of Some Example Problems 97
For a given time interval from 0 to T , our code cuts the continuous time into
pieces. We have
T
N∼ .
dt
Once we fix the observation interval (0, T ), we know the number of time steps N
that we need to step through time. There is a constant incorporating N such that the
growth of the relative error is bounded.
The analytical solution models the population of a species p(t). The species replicates
and dies. The higher the population count p the more children are born, i.e. the faster
p grows. This replication rate is calibrated through the parameter a. However, if the
species’ population exceeds a magic threshold introduced by the finite amount of
food, members of the population starve. The constant b in the model parameterises
this threshold. The above plot shows typical solution curves for different initial p
(population) values.
98 8 Arithmetic Stability of an Implementation
In the plot above, I run each simulation per initial p(0) value twice. Once, I select
60 valid digits in decimal precision. This is ridiculously high, so we can assume that
there is basically no arithmetic error. In the other case, I use only six valid decimal
entries in the significant. I track the relative difference
p(t)|6 digits − p(t)|60 digits
.
max( p(t)|6 digits , p(t)|60 digits )
over time. Even though all three experiments converge towards the same value of
p(t) = 2 over time, we obtain qualitatively different curves for the relative differ-
ences.
For t 2, all three studies are subject to rounding errors, but these cancel out.
The errors wobble around, yet stay close to zero. This confirms our statement that our
stability analysis plays through worst-case scenarios: In this regime, the rounding
error sometimes overshoots, sometimes yields something slightly too small, and
effectively these effects cancel out.
Around t ≈ 2, the wobbling amplitudes grow significantly and the rounding errors
start to accumulate. They do not cancel out over multiple time steps anymore. This
happens the earlier the bigger p(0): The initial value scales the global round-off error
as it builds up and with larger p(0) we hit the threshold after which errors start to
accumulate earlier. This is the worst-case regime we are afraid of: Each time step
adds yet another error, they all have the same sign, and we thus get a worse result
over time.
All differences are bounded for t 4, i.e. on the long term the choice of p(0) seems
not to make a difference. This does not come as a surprise, as the updates to p(t)
in the time stepping algorithm all go to zero as p(t) approaches p(t) = 2 and the
solution becomes stationary. We study this behaviour later under the rubric attractive
solutions (Sect. 12.2). For the time being, we ignore this and observe: Once a solution
has approximated t > 4 ± 0.0008, the right-hand side evaluation seems to yield zero,
as the numerical approximation p M (t) does not change anymore. The ±0.0008 is
the effective machine precision for this calculation with our artificial machine with
six valid decimal digits.
8.2 Stability of Some Example Problems 99
It is intuitively clear that this blocked version of a scalar product combines the
advantages of small relative accumulation errors with high vectorisation efficiency.
We preserve the logarithm in the error estimate though this logarithm is now subject
to a small factor b.
• Robert M. Corless and Leili R. Sevyeri: The Runge Example for Interpolation and
Wilkinson’s Examples for Rootfinding. SIAM Review. 62(1), pp. 231–243 (2020)
https://doi.org/10.1137/18M1181985
• Nicholas J. Higham: Accuracy and Stability of Numerical Algorithms. 2nd edition.
Society for Industrial and Applied Mathematics (2002)
• Lloyd N. Trefethen and David I. Bau: Numerical Linear Algebra. Society for
Industrial and Applied Mathematics (1997)
Abstract
With an understanding how computers handle floating-point calculations, we vec-
torise our model problem. For this, we first categorise different parallelisation pro-
gramming techniques, before we commit to a pragma-based approach, i.e. rely
on auto-vectorisation plus vectorisation hints to the compiler. Compilers can give
us feedback where they succeed to vectorise, which helps us when we introduce
OpenMP pragmas to our code. We wrap up the vectorisation topic with a discus-
sion of showstoppers to vectorisation, such as aliasing, reductions, or function
calls.
We have discussed so far how a computer handles floating point numbers and
how we can tune floating point operations. It is time to translate this knowledge
into working code. Before we start to do so, let’s brainstorm: How would we
like to write fast scientific codes if it were up to us to decide? Would we like
to have a totally new programming language fit for the job, or could the idea
behind SIMD fit somehow to C? How would we realise vectorisation and what
is the right metric to assess whether we have been successful? Will sole runtime
measurements be sufficient, or do we need more information to assess the quality
of our implementation?
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 101
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3_9
102 9 Vectorisation of the Model Problem
Relying on the compiler to do the work for us seems to be lazy, but very often it
means that we have to revisit our code over and over again to analyse whether the way
we have written down an algorithm allows a compiler to translate it into a parallel
piece of code. Even though we do not directly “add parallelism” to our code base,
the parallelisation does not come for free. It manifests in work to validate if, why and
how a tool manages to create parallel code. On the pro side, as soon as the compiler
yields parallelised code we may assume that it has fewer bugs compared to a manual
parallelisation. We are also platform-independent.
Taking our existing code and adding hints to it is an alternative. We still rely
on the compiler or a tool to do the techi part of producing parallel code, but we
give the tool the domain-specific information explicitly such that it knows what to
do. This can be read as a compromise: We do not outsource all of the work to a
tool which eventually might miss out parallelisation potential, but we also do not
write parallel code ourselves. This means that we have lower development cost
compared to a rewrite, and we reduce the risk to introduce errors—though obviously
our instructions can make the tool do the wrong things. We also avoid the need to
have multiple source code branches if we want to keep the original, serial version.
Another option is a modification of our serial code using a library. We explicitly
call a library to make particular parts of our code run in parallel or to wrap our code
into parallelisation constructs. This is an explicit way how to express parallelism. We
take full ownership. It is also a radical way and requires significant programming.
While it might yield the best performance, it is a strategy that very quickly leads into
9.1 Compiler Feedback 103
situations where our code is tailored towards one parallel computer, and we commit
to one library. If the library works only on some machines or its development is
discontinued, we have to rewrite.
Finally, we could completely rewrite our code in a new language such as CUDA.
This is a strategy we ignore here.
For vectorisation, all three “do not rewrite” variants are on the table if we start from
C code. Intel for example provides a set of function calls that we can invoke. They are
realised through some assembler code which triggers SSE/AVX instructions. These
functions are called intrinsics. We will not go down this route, as there are hundreds
of these intrinsic functions, they might change from chip generation to generation,
and code with intrinsics is not (always) portable to other chips by ARM, AMD or
IBM, e.g.
Luckily, compilers today vectorise themselves if we hand them over the right
code. We have discussed recipes how to write vector code in Sect. 7.2. As long as
we follow these guidelines, we can assume that the compiler does the job for us.
Taking into account that vectorisation is not a one-way street performance-wisely,
our first objective is to find out what the compiler actually does. After that, we
discuss ways how to explicitly tell a compiler to vectorise particular code parts in a
certain way. That is, we discuss how we can provide additional information to the
translator that helps or forces it to vectorise. For this, we rely on the OpenMP standard.
OpenMP (Open Multi-Processing) was originally designed for multicore processors,
but since its 4th generation it also supports SIMD. As it is platform-independent, this
is our method of choice. Even better, OpenMP’s 5th generation (actually already 4.x)
support GPGPUs. We have pointed out that major ideas behind GPUs are similar to
SIMD. OpenMP will later be able to yield portable vector codes that even fit to GPUs.
To find out whether vectorisation has been used, we ask the compiler to provide us
with feedback. For Intel’s compiler, translate your code with
You obtain an optimisation feedback file with the extension .optrpt per source
code file or some reports on the terminal. It is a plain text file which tells you what
the compiler has done. The GNU tools have a similar option (-fopt-info), while
1 Inrecent versions, Intel have replaced their icc compiler with a new compiler product built upon
Clang/LLVM. These compilers are invoked via icx instead of icc. But Intel’s feedback is still
obtained via -qopt-report.
104 9 Vectorisation of the Model Problem
which is an example of a loop which hasn’t been vectorised as it does not meet
criterion 3 from our list. This is the Intel output of one particular Intel tool generation.
Different tools give you different feedback, but the information published is not that
dissimilar.
Vectorisation steps
Here is a first guideline what we should do from a vectorisation point of view:
1. Any proper optimisation should be driven by profiling. First, we have to find out
where the majority of our time is spent. Through a profile (it might be reasonable
to profile for different input data), we get an idea where the CPUhrs are burnt.
2. Next, we use the optimisation report to identify hotspots that suffer from a lack
of vectorisation. With luck, the report tells us already some of the obstacles (such
as an embedded if that can not be masked).
3. We tune the code.
4. Before we run our tuned, i.e. optimised code, we recompile with the vectorisation
report once again to validate that the vectorisation has been successful. This step
also provides us with estimates how much faster the code should be.
5. Finally, me measure the optimisation’s impact on the time-to-solution. Then, we
start all over again.
Using the compiler feedback all the way through is not just a nice gimnick. As
scientific computing specialists, we should have a good understanding what calcu-
lations are done where and which ones are vectorised. This distinguishes us from
trial-and-error coding monkeys.
2I found on some systems that both Clang and LLVM tend to ignore the analysis request if you
launch it without -O3 (or another explicit optimisation instruction). So specify an optimisation
level explicitly to be on the safe side.
9.1 Compiler Feedback 105
If the vectorisation report shows us that some parts of the code have not been
vectorised though we know it would benefit from vectorisation, it is time to enter
phase two of our vectorisation endeavour:
OpenMP syntax
Pragmas in C always start with the keyword #pragma. After that, we write which
pragma slang we employ. This is omp for OpenMP. Every pragma in this book starts
with
#pragma omp
After the omp identifier, we finally say what our pragma’s message is. This is the
key instruction. It is called a directive. It tells the compiler what to do and typically
consists of one or few words. Finally, most OpenMP instructions support further
106 9 Vectorisation of the Model Problem
The SIMD pragma in OpenMP is a prescriptive annotation that always is the preamble
to a loop. The loop has to be in a canonical form.
The example urges the compiler to vectorise this loop. Often, you will get away with
these three words. There are however a few further clauses that you can hand over
to OpenMP to tweak the vectorisation:
• SIMDsation can be read as loop unrolling. To allow the compiler to do the un-
rolling “trick”, it has to know how often it may unroll the loop. By adding
safelen(length), you give the compiler a hint which unroll factor is ab-
solutely safe to use. With this argument, you might get away with a loop range
which is not 100% in canonical form. As you manually specify safe loop unrolling
counts, the compiler does not have to try to understand some nuances of your code.
• A vectoriser has to assume that each individual vector line has to calculate its own
local variables. You might know better, i.e. you might know that a variable within
a loop always holds the same value. In this case, tell the compiler about such a
local variable via uniform(myVar). Values then are computed only once.
• Vectorisation is particularly fast if you run through multiple arrays and all these
arrays are traversed continuously, i.e. one by one in line with the loop counter.
Whenever you know that your code runs through arrays in the same way, then you
can tell the compiler about this via linear(myVar). It means that you run over
the array myVar in exactly the same way as you run over the loop index within
your for-loop.
9.2 Explicit Vectorisation with OpenMP 107
• You can add an if clause to a simd instruction. The clause accepts a boolean
expression. If expression evaluates to true at runtime, the loop is vectorised.
Otherwise it is not. With if, the compilers produces two loop variants: A scalar
one and a vectorised one.
Nested loops are a natural enemy to vectorisation. Whenever you loop over j and
then immediately over i and whenever the loop range of i is independent of j’s
range, you can permute the loops. Unfortunately, you might not know always which
permutation is the better one. You always run risk that the iteration space of the inner
loop—and that’s the one vectorisation will tackle—is too small to yield efficient
vectorisation ranges, yields unfortunate memory accesses, or does something else
that is not clever. If you write
then you tell OpenMP to take the next two for-loops and to collapse them into one
big loop. For this, they have to be independent of each other. The pragma yields
one large, one-dimensional loop range and thus gives OpenMP more freedom to
vectorise.
Nested loops
Search for hints in the vectorisation feedback that tells you that a loop has not been
vectorised as it has not been an inner loop. Often, you might have one big loop that
has some vector operations inside. Many compilers work greedly: They vectorise
the innermost loops and then tell you “sorry, have already vectorised a nested one”.
It does the tiny little pieces but misses the big picture. Once you manually collapse
the loops, vectorisation can make a bigger difference.
108 9 Vectorisation of the Model Problem
Once we do the f (x, y) = − f (y, x)-“trick”, we effectively compute only the upper
half of this matrix and then wrap it over with a flipped sign. We exploit its anti-
symmetry. With SIMD, it is not clear anymore whether this triangularisation is clever.
If the particle count is reasonably small, we might be better off with removing our
trick. We go back to our rectangular iteration space, i.e. compute both the upper
and lower part of the force matrix, but now can maybe collapse the loops over the
particles.
9.2.3 Aliasing
A loop body that benefits from vectorisation has to consist of instructions which are
independent within the control flow graph: Two instructions can be vectorised, if they
are of the same type and if one doesn’t feed into the other. We can run instruction A
and B in parallel if A’s result is not required by B and vice versa.
For a sequence over operations on simple (scalar) variables, it is straightforward
for a compiler to extract the control flow graph from your code and, hence, to identify
operations that could run in parallel. If they are the same ones, we can vectorise.
Things become nasty for C arrays. Arrays in C are identified by pointers, i.e. variables
that identify a memory location. You can move these pointers around. Actually, the
9.2 Explicit Vectorisation with OpenMP 109
statement x[i] adds i to the pointer x and then works with the result address. If
there are two arrays and accesses x[i] and y[j], the compiler is in trouble: It
has no (built-in) guarantee that the two pointers do not point to the same memory
location. In such a case, a compiler has to be pessimistic. It has to assume that two
pointers from different code locations or loop iterations point to the same address
and thus cannot run in parallel.
Definition 9.3 (Aliasing) Aliasing names the situation that two memory locations
can be accessed via different variables (pointers).
Aliasing in a queue
Alisasing is by no means an esoteric problem. Assume you have an array with n
elements which we call a queue. Let an operation remove the first element and then
move the remaining ones:
to move data around, you encounter aliasing. In the former case, the aliasing is
apparent. In the latter case, we have to assume it can happen (Fig. 9.1). If we knew
however that bar is never invoked on two overlapping memory regions x through
x+N and y through y+N, we could vectorise the memory movements.
By definition, a compiler must not vectorise code where aliasing might occur. Oth-
erwise, it might break the code’s semantics. If you enable your compiler feedback,
it should tell you where aliasing prevents it from generating fast code. Often, these
will be false positives. You might know better.
110 9 Vectorisation of the Model Problem
In C, programmers have to tell the compiler explicitly when they are sure that no
aliasing happens. There are multiple ways to do so. First, individual compilers like
the one from Intel of the GNU suite typically offer special pragmas or annotations
to pass on this information. This is a tool-specific solution. Second, almost all C
compilers support the restrict keyword, even though this keyword is not part of
most C standards. To inform a compiler that no aliasing can happen, you can thus
add a restrict after the pointer.
restrict qualifies the pointer, not the data it is pointing to. It means that noone
else points into the region accessed through myPointer.
Finally, you can use OpenMP’s safelen qualifier (see remarks when we in-
troduced the simd pragma). By notifying OpenMP that it is safe to vectorise over
a given chunk within the iteration space, you implicitly inform the translator that
aliasing cannot happen.
Be aware of aliasing
Search for hints in the plain vectorisation feedback that tells you that no vector oper-
ations have been used as the compiler assumes that arrays (pointer regions) overlap
in memory. Work your way backward from here by inserting restrict keywords
or annotations or safelen clauses until the compiler succeeds in vectorising.
There is this myth in the scientific computing community that Fortran is faster than
C. When people claim this, they suggest that the C compilers do a poor job bench-
marked against their Fortran counterparts. Yet the algorithms are often the same, the
implementations strikingly similar, and the target machine is the same as well. One
of the reasons that often makes Fortran code indeed faster is aliasing: Fortran is,
9.2 Explicit Vectorisation with OpenMP 111
9.2.4 Reductions
At first glance on the control graph, this loop can not benefit from any vectorisa-
tion, as each loop iteration adds something to y. The snippet above tells OpenMP
explicitly that we reduce the sum over the xs into y via an addition. The compiler
now can introduce internal code rewrites—similar to our discussion—and vectorise
nevertheless.
The clause’s argument supports further basic operations: You can also use the
minus sign, the multiplication and logical and bit-wise and (& and &&), or (| and
||) and xor (ˆ). You can also reduce several parameters in one rush; just enlist them
separated by a comma. Finally, there is support for min and max.
9.2.5 Functions
Function calls within a loop prevent a compiler to vectorise this loop. A function call
has to backup local variables from registers on the call stack, has to create new local
variables on the stack, has to store a jump-back address, and so forth. If you however
inline a function, then you can get rid of all of this overhead and, consequently,
vectorise over functions, too.
112 9 Vectorisation of the Model Problem
Manual inlining is not very attractive for developers. It does not make your code
nicer. However, you can inform OpenMP that you intend to use a function within
a loop that you want to vectorise later on. The function then is prepared by the
compiler. We have to declare a function as SIMD-ready:
Again, you can augment this declaration with a simdlen and thus hand over in-
formation on which array areas the routine can work safely. uniform is supported,
too. The delaration affects only the subsequent function, so you might have to insert
multiple declarations into a file: one per function.
In principle, the OpenMP standard does not really care how the compiler reacts
to a simd declaration, i.e. how the declaration is translated into code. For many
mainstream compilers, you will however find that the compiler generates multiple
versions of the function for different vector widths. It creates a scalar version (that is
the one corresponding to a 1:1 translation), it creates one accepting a vector of two
entries, of four, and so forth. When it encouters a function call to a declared function,
it thus knows that vector variants of this function are available. Depending on the
context, it can select the appropriate one.
While these details are all hidden from the developer—you only declare the
function—you have to ensure that a SIMDised function is reasonably simple such that
the compiler manages in creating vector variants. Vectorising too complex functions
will fail.
The chapters of this Part III orbit around number crunching, i.e. the handling of vast
amounts of floating point calculations. We highlight two facets of such calculations:
From a numerical point of view, we discuss representation issues—how accurate can
we run calculations on our machines—while the speed of floating point calculations
is the other aspect of our studies.
We observe that modern machines are well-prepared to handle large calculations
efficiently as long as we keep two things in mind: Don’t mess up floating point calcu-
lations early throughout complex calculations—avoid cancellation or summing up
quantities of different magnitude, e.g.—and try to phrase calculations as sequences
of small vector calculations while you avoid branching. The two ingredients guaran-
tee that errors do not propagate through your calculations and do not blow up. They
also guarantee that the calculations are efficiently piped through the SIMD facilities.
While vector features (and the GPUs’ SIMT model) do give us quite a performance
boost, exploiting these units is only one way to make a code faster. Vendors today
increase the number of cores per node or GPGPU card, too. Our next step hence is to
study how to program multicore machines. At the same time, we have so far studied
all floating point issues with the assumption that all input data were exact. This is a
strong assumption: Input data are also represented by floating point numbers, stem
from measurements, or are man-made—very unlikely that they are exact. Besides
first multicore upscaling, we thus have to study the effect of input data variations.
There are two important tools that we need to master to achieve our next goals: On
the one hand, we estimate how much we actually could gain from a parallel machine.
It is one thing to acquire the skills to make a code exploit parallelism. It is another
thing to know a priori what might be achievable through which technique—and
eventually which technique is the right one to use in which case. On the other hand,
we return to our simulation code blueprint and start to argue where the formulae in
there come from and whether we can find a (proper) solution at all. This will help
us to understand how much of the error that we get is inherent to the problem, and
what errors are machine-made and maybe avoidable. Parallelisation and solution plus
algorithm properties are totally different things. We therefore treat them in separate
manuscript parts from hereon.
Part IV
Abstract
Hadamard’s definition of well-posedness makes us start to think if a particular
problem is solvable on a computer at all. We introduce the terms well-conditioned
and ill-conditioned, and finally define the condition number to quantify how sen-
sitive an algorithm is to slight changes of its input. After a brief excursus what the
condition number means for linear equation systems—it turns out to be a special
case of our more generic definition—we discuss the interplay of the conditioning
with forward stability and introduce backward stability as evaluation condition
number.
Take a pen into your hand. Your challenge is to bring this pen into a stationary
vertical position, i.e. it should not move. Also, you should hold it on one end
only.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 119
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3_10
120 10 Conditioning and Well-Posedness
There are two solutions to this problem: We can hold the pen at the top and let
it dangle. Or we can balance it on our palm to keep it upright. These are the two
valid, stationary configurations for the pen. The dangling thing is easy, but we
know that balancing it on our palm is tricky. We permanently have to move our
hand to keep the pen in position. Once we stop these correcting movements, the
pen will fall.
Assume that you put a thermometer into your oven when you bake pizza, and the
thermometer shows 220 ◦ C when you switch it off. There is an equation that describes
how the oven gradually cools down until it has reached the temperature of your
kitchen. Once this equation is written down, we can solve it, i.e. there is a unique
solution. The solution even will be relatively insensitive: If our thermometer gets the
temperature slightly wrong—for example 222 ◦ C instead of 220 ◦ C—the temperature
decrease will not change dramatically. It will not be massively different after an hour
from the “right” setup with 220 ◦ C.
Assume you do not know the initial temperature. You however know that the
temperature after an hour equals the room temperature. Is it possible to say what the
original temperature had been? The answer is no. There are many potential solutions.
The oven could have been at room temperature (as we forgot to switch it on). It could
have been 220 ◦ C and then we switched it off. Or it continued to heat at 180 ◦ C for
a while, we then switched it off, switched it quickly on again at 240 ◦ C and then
eventually off. There are infinitely many potential solutions.
It is important to formalise the difference between the two challenges. The first
one can be solved by a computer. The second one is tricky, as any solution dumped
by a computer could be right or could be totally wrong.
1. it has a solution,
2. this solution is unique, and
3. the solution depends continuously on the input data.
Well-posedness is a property that is inherent to the problem setup—we have not yet
started to speak about an algorithm. However, we know from the previous sessions
that computers make tiny errors all the time. Even worse, we may assume that even
our input data are slightly wrong all the time. Consequently, the well-posedness of
a problem is extremely relevant for the development of computer codes. We are
10 Conditioning and Well-Posedness 121
doomed if we try to make our computer solve an ill-posed problem.1 But even once
we know that a problem is properly set (aka well-posed), we still have to understand
how sensitive it is to variations of the input.
Orbit of a satellite
A stable orbit, i.e. an orbit where the satellite neither flies into the void nor crashes
into the planet, is very sensitive to the input data: There is a solution, and this solution
is unique given the satellite’s original position.
However, once we fix its initial position, only one specific initial velocity ensures
that the satellite remains on the stationary orbit. If we slightly alter this one, it follows
a completely different trajectory which is nowhere close to the original one. Let the
position of a satellite be known for time t = 0. Let’s furthermore assume that we
know the exact velocity that we have to choose to keep the satellite on a stationary
orbit, i.e. on a trajectory around the planet with constant radius. A slight reduction
of the radius (let’s slightly push the satellite towards the planet) or a decrease of
the velocity sends the rotating satellite onto a spiral trajectory. It eventually crashes
into the planet. If we stay on the stationary orbit but slightly increase the satellite’s
rotation speed, it will fly away.
An analogous argument holds for the initial position.
1 As always, there are people developing solutions to ill-posed problems. But that is out of our
The formalism ignores higher order terms. It expresses the impact of a change in
x on the outcome of f through a first-order Taylor expansion.2 This allows us to
compute the ratio of the relative change in the image to the relative wobbling:
Relative change of output
condx f =
Relative change of input
f (x+δ)− f (x)
( f (x) + δ · f ‘(x) − f (x)) · x x · f (x)
f (x) =
= δ ≈ f (x) .
f (x) · δ
x
We speak about small changes δ above. To be precise what small means, we normalise
it. For x = 1, δ = 0.1 is quite significant. For x = 105 , δ = 0.1 is tiny. Dividing by
x makes us recognise this fact, though we have to be careful for x ≈ 0.
The actual function change due to this modification of x enters the enumerator.
Again, we are interested in the relative change, i.e. we compare f (x + δ) − f (x) to
f (x). Overall, we put the relative change of the function outcome in relation to the
relative change of the input data. As we speak, once more, about small modifications
to x, we finally approximate the change in f (x) by Taylor expansion.
2 We revise Taylor expansion in Chap. 11. For the time being, it is sufficient to accept that f (x +δ) ≈
f (x) + δ · f (x) is a reasonable assumption here since δ is small.
10.1 Condition Number 123
Definition 10.3 (Condition number (Definition #1)) The condition number of a func-
tion f is
x · f (x)
κ= .
f (x)
κ is traditionally the symbol used for the condition number, and the derivative in the
formula above is the derivative w.r.t. the quantity of interest, i.e. f (x) = dx f (x).
If f depends on many different input variables, we have to study the conditioning
for all of these variables one by one, before we make a worst-case statement on the
overall condition number. That is, problems can be ill-conditioned w.r.t. particular
variables only.
Choice of norms
If we write down the condition number’s definition over a scalar function f (x)
accepting a scalar input x, we use an absolute value. If we work with vector inputs or
vector outputs, we need appropriate vector norms. Therefore, there might be different
notions of condition (condition numbers) for different vector norms. We usually use
the standard Eukledian norm.
The implication in the definition above illustrates that a small relative perturbation δ
of the input data which is bounded by does not arbitrarily mess up the output data.
The rule emphasises that we always work with “wrong” data in our simulations, as
we work on a computer with finite precisions. Therefore, also our input datum has
only a certain number of valid bits. In the best case, this number is given by the IEEE
standard, but there might be way fewer bits if you work with data from measurements,
e.g. κ tells us how many of these valid bits are lost due to our calculations. This
statement still might be too optimistic if we pick an unstable algorithm.
While the condition number quantifies how inaccurate our result intrinsically
might become—intrinsically means no matter how clever our algorithm and im-
plementation is—a problem with a big κ still might be well-posed according to
Hadamard. It is only for κ → ∞ that we enter a regime where a small change in the
input data triggers a discontinuous change in the outcome. In this case, we violate the
third criterion of Hadamard’s definition. Overall, we observe that there are well-posed
and ill-posed problems. Within the well-posed problems, there are well-conditioned
and ill-conditioned ones. The transition is blurry. On the well-conditioned side, there
are algorithms which are stable—so far we know only stability w.r.t. the round-off
error though we’ll introduce a second notion of stability soon—and then there are al-
gorithms/implementations which are not stable. Among the stable algorithms, there
are efficient and not that efficient implementations (Fig. 10.1).
With all of this in mind, we can sketch what a programmer has to take into account
when she starts to write a simulation code:
Whenever you work with linear equation systems, a lot of literature relies on a
specialisation of the condition number. While not adding anything substantially new,
it is good to have this specialisation in mind (Fig. 10.2), as linear equation systems
are omnipresent.
With linear equation systems, your y = f (x) solves Ay = x. You are interested
in the value y = A−1 x. The statement assumes that A can be inverted. This allows
us to characterise the κ: A (square) matrix A takes a vector and maps it onto another
vector. This mapping can be a rotation, stretching, mirroring, and so forth. If a vector’s
direction remains preserved, we call it an eigenvector v with an eigenvalue λ:
Av = λv.
Given a diagonalisable matrix A, we can decompose any input vector x into a linear
combination of the eigenvectors: x = a1 v1 + a2 v2 + · · · The vi are the eigenvectors,
and the ai are scalar weights. This yields a decomposition into vector components for
which we know which ones are particularly amplified: Ax = A(a1 v1 +a2 v2 +· · · ) =
a1 λ1 v1 +a2 λ2 v2 +· · · The weights ai which correspond to large eigenvalues vi make
a significant contribution to the outcome. The weights corresponding to very small
eigenvalues do not.
We are not really interested in Ax, but we exploit that λi (A−1 ) = λi (A) 1
. If we
−1
know the eigenvalues of A, we know the eigenvalues of A , too. Returning to our
condition number discussion where we want to solve y = A−1 x, the worst-case
becomes apparent: We could pick an input x which is very close to the biggest
eigenvector from A−1 . Then, we wobble around x by a delta δ which is close to the
smallest eigenvector.
We pick our such that |δ/x| = |vmin (A−1 )|/|vmax (A−1 )| ≤ . For this particular
input, let’s quantify the calibrated output | f (δ)|/| f (x)| using the eigenvalues from
A:
| f (δ)| |A−1 δ| |λmin (A−1 )δ| |λmax (A)δ| λmax (A)
= −1 = −1
= and therefore κ = .
| f (x)| |A x| |λmax (A )x| |λmin (A)x| λmin (A)
Inverting A is what our algorithm for f (x) is responsible for. We do not know A−1 .
However, our theorem requires only statements on the eigenvalues of A. For many
problems, there are good a priori estimates of λmax (A) and λmin (A). We can provide
an estimate for κ.
Our introduction of the condition number and stability carefully distinguishes the
two things. Stability is about an implementation and conditioning about the problem
itself. Condition number and round-off stability however have something to do with
each other. When we introduce the term forward stability in Chap. 8, we emphasise
that we work with exact input data. It is the inexact arithmetics within the machine
that nevertheless pollute the outcome, and we end up with something which is not
precisely what we wanted to compute.
10.3 Backward Stability: Arithmetic Stability Revisited 127
This is one way to read the error introduced by the machine. Backward stability
is another way, where we interpret numerical errors throughout the calculations as
modifications of the input data. It goes back something some books call Wilkinson’s
principle. We ask ourselves whether we can read the outcome of a computation as
the exact solution for different input data: We obtain something slightly off. Yet, this
is the exact solution for a slightly different problem!
With this interpretation, the question is whether we have solved a reasonably
close alternative problem, or whether we’ve solved something completely different
and unrelated. As long as the round-off errors have made us solve a problem that’s
close to the problem we originally wanted to solve, our algorithm is (backward)
stable. Otherwise, it is not. The name backward stable comes from the fact that we
relate the outcome to where we come from, i.e. we look backwards.
Definition 10.5 (Backward stable) Let f M (x) be the outcome of an exact input x
computed through f with machine precision. That is, f M suffers from some round-
off error. Determine x̂ that would yield f M in exact arithmetics. An algorithm is
backward stable iff
|x̂ − x|
∃C : ∀x with f M (x) = f (x̂) : ≤ C.
|x|
With the backward stability analysis, we search for a magic constant C such that—no
matter which input x we select—we can rewrite the outcome as the exact solution
for a perturbed x̂ and this x̂ is not too far away from x. Too far means C times the
machine precision.
Square root
√
Let’s compute f (2) = 2 = 1.0 · 21 |2 (with an unbiased exponent). The
result is f (2) = 1.414213 . . . = 1.0110101000001001 . . . |2 . If we work with
half precision and if we assume that our square root implementation (which is
128 10 Conditioning and Well-Posedness
With this backward idea, we still break down an algorithm into a sequence of ele-
mentary steps, i.e. unary and binary calculations. Instead of accepting that each step
introduces some error, we assume that each step computes exactly the right result
yet works on some altered data. This data is off by some value.
We interpret the function evaluation f M as precise, i.e. not polluted. It has however
been given a slightly biased input which is off from the real intermediate result by
x̃ − x. This offset models the truncation error within the code of f M as a priori
alteration of the input fed into f .
The algorithm is backward stable if this “being off” is not amplified too much—but
that’s our definition of conditioning: Let an algorithm be a sequence of functions.
Each function has a condition number. If a sequence of function evaluations does not
yield a condition number that’s unbounded, then the initial error is not amplified too
much. If only one step is ill-conditioned, then the overall algorithm is ill-conditioned
and there is no stable implementation. For this reason, some scientists avoid the
term backward stable and instead speak of the evaluation condition number: It is not
about the condition number of the problem but about the condition number of the
individual evaluation steps that we have chosen when we wrote down the algorithm.
• Robert M. Corless and Leili R. Sevyeri: The Runge Example for Interpolation and
Wilkinson’s Examples for Rootfinding. SIAM Review. 62(1), pp. 231–243 (2020)
https://doi.org/10.1137/18M1181985
• Felix Kwok, Martin J. Gander, Walter Gander: Scientific Computing—An Intro-
duction Using Maple and MATLAB. Springer (2014)
• Nick Higham: What Is a Condition Number? “What Is” Series. https://nhigham.
com/2020/03/19/what-is-a-condition-number (2020)
• Nick Higham: What Is Backward Error? “What Is” Series. https://nhigham.com/
2020/03/25/what-is-backward-error (2020)
10.3 Backward Stability: Arithmetic Stability Revisited 129
Abstract
We revise the Taylor series and Taylor expansion. The text’s intention is to remind
us of this important tool, and to refresh some important technical skills such
as Taylor for functions with multiple arguments. We reiterate the smoothness
assumptions we implicitly employ. This chapter revisits Taylor expansion in a
very hands-on fashion. Feel free to skip the whole chapter if you don’t have to
refresh your knowledge.
Imagine you are driving with your car along a road through a national park.
You know that you are 100 m above sea level. Where will you be in around five
minutes’ time? Assume the road is completely dark before you. Without further
information, the best answer you can give is “guess we will still be at 100 m”.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 131
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3_11
132 11 Taylor Expansion
Let’s assume you are at 100 m at the moment, but the road is ascending. You are
climbing 5 m per minute. A more sophisticated answer now is “I guess we’ll be
at 125 m in 5 min from now”. You take the knowledge about the current climb
into account.
Let’s finally assume that we are at 100 m, the road is ascending, but we see that
the slope starts to decrease. There’s actually a plateau in front of you. Without
further information, you might assume that the road continues to descend after
the plateau and eventually leads right into the sea. What does a sophisticated
prediction now sound like?
Whenever we want to argue about some functions, we first have to make a choice
how to write down these functions. One of the simplest ways is to write them down
as polynomials. In this case, functions look similar to
f (x) = a1 · x, or
f (x) = a2 · x 2 + a0 .
This type of ansatz is also called power series.
Definition 11.2 (Power series) If we work with polynomials, our shape functions
are φi (x) = x i , i.e. they are from the set {1, x, x 2 , x 3 , . . .}. The functions φi form
a basis, as no two φi s can be combined into a third one. The (infinitely long) linear
combination of these shape functions yields a power series.
11 Taylor Expansion 133
For the Taylor series1 , we assume that our functions look like f (x) = i≥0 ai x i .
Pick one of our points. Let this one be the x0 with f (x0 ). The best approximation,
i.e. the best function that runs through this point is f (x) = f (x0 ). We set a0 = f (x0 ),
and all the other as zero. This is a constant function, and the best we can do for only
one single sample point. We have no further data.
Next, we use f (x1 ), too. Unless f (x1 ) = f (x2 ), which would be a coincidence,
our initial fit with f (x) = a0 = f (x1 ) for x = x1 . So we have to use one more
function from our basis:
f (x) = a0 + a1 x.
Once we insert x1 and the known a0 , we obtain a line which is a good approximation
for two given points. We continue with the other points.
The above process is called interpolation as we use a set of (simple) basis functions
to successively approximate a function. Due to the fact that we know our basis
functions, we get a feeling what the solution in-between given sample points looks
like. We interpolate.
1 Asthis text is written at Durham University, it is only fair to emphasise that Taylor’s mother was
from Durham, i.e. there is Durham heritage in these formulas.
134 11 Taylor Expansion
Unless we are lucky and hit multiple sample points “prematurely” (if all points
have the same value, picking a0 right makes us meet all of them in one go), we need as
many basis functions as we have points. If we have more points than basis functions—
as we stick to a constant number of them, e.g.—we have an overdetermined system,
i.e. we have to find weights such that f (x) is approximated as good as possible.
At the same time, we have to accept that we will not be able to match all given
sample points. If we have more shape functions than sample points, then we have an
underdetermined system. Many ai combinations then interpolate all sample points.
We have to come up with a plan which combinations are reasonable ones.
Machine learning
Interpolation is the backbone of machine learning: Each layer of a neural network
yields a large linear combination of simple, elementary (shape) functions. The total
function is a linear combination of these elementary ones. The “training” of a machine
learning model is finding proper weights ai .
11.1 Taylor
Definition 11.3 (Taylor series) The Taylor series for a function f (x) is a power
series where the ai s stem from the derivatives of f (x).
With this formalism, we can fiddle in weights ai in the following way: Let’s assume
that we know a function in the point x = 0, i.e. we know f (0). Then, we automatically
know a0 = f (0) in (11.1). Let’s assume that we furthermore know something about
∂x f (x = 0). We do not know what the function overall looks like, but we know its
value in x = 0 plus its derivative. In this case, we can plug our match a1 = ∂x f (x)
into the formula above.
Definition 11.4 (Taylor expansion) The Taylor expansion takes the function f (x)
and approximates it with a constant function. It then expands this approximation
further and further by adding more basis functions from the power series. The cor-
responding weights are given by the derivatives of the function f (x) that is ap-
proximated. The (infinite) result of the Taylor expansion is the Taylor series for the
function f (x).
Taylor takes a function value plus an arbitrary set of derivatives in one point, and
it provides us with an idea what the function around this point looks like. If we are
given only a point, we assume the environment is represented by a constant function.
If we are given with a point plus its derivative, we end up with a linear extrapolation.
If we are given a point, its derivative and a second derivative, we obtain a parabola.
The Taylor series for a function around a point x equals
dx2 dx3
f (x + d x) = f (x) + d x · ∂x f (x) + · ∂x ∂x f (x) + · ∂x ∂x ∂x f (x) + . . .
2 3·2
dxi
= ∂x(i) f (x)
i!
i≥0
This is a formula that we will need over and over again throughout this text. Its
derivation is a straightforward result of the Taylor expansion, i.e. we repeat the steps
from (11.1) centred around x. The point f (x) is fixed, and we study the function in
a d x-environment around x. Though we prefer the arrangement above, a reordering
highlights the connection to a power series:
∂x ∂x f (x) ∂x ∂x ∂x f (x)
f (x + d x) = f (x) + ∂x f (x) ·d x + ·d x 2 + ·d x 3 + . . .
1 · 2 1 · 2 · 3
:=a0 :=a1
:=a2 :=a3
136 11 Taylor Expansion
Notation remark
The convention 0! = 1 for the factorial is it that enables us to write down the Taylor
series in its compact form. That’s what a good convention looks like!
There are three important assumptions tied to the Taylor expansion and the way we
use it:
1. We assume that the function is smooth, i.e. we assume that all the derivatives
(n) (n) (n+1)
∂x f (x) do exist. No ∂x f (x) does jump. Otherwise, the ∂x f (x) would
not be there.
2. We assume that the sum over all the derivatives remains finite. In particular, we
assume that the f (x) does not run into ±∞. This implies that the ai very quickly
become very small.
3. We assume that the ai , once fitted, are constant: We fit them around x through
the derivatives of f (x) and then keep them invariant. The development of the
function around f (x) thus depends solely on the distance.
Since the magnitudes of the ai s decrease, we are on the safe side if we cut off the
Taylor series: The “higher i” terms are scaled with d x i , their weights ai are fixed,
(i)
and the ∂x f (x) are bounded. Due to the d x i scaling, we can cut off. As we however
assume that the ai are constant once we have fitted them to the derivatives of f (x) at
a point x, we should only make statements about f (x) in a small d x-area around x.
We have so far used Taylor as a one-directional thing. If we have a function f (x, y),
we can use Taylor for either coordinate. That is, we can either develop Taylor for
x or for y. It is a little bit like the Manhatten distance or an old 2d computer game
where you can either go left/right or up/down:
Once we have committed to one direction and have obtained the Taylor expression,
we can next run in the other direction. Let’s assume we have commited to the upper
row first. Once the Taylor over x is written down, we take (11.2) and apply the
expansion again for dy. The terms on the right-hand side in the first line of (11.2)
expand to
11.2 Functions with Multiple Arguments 137
f (x + d x, y + dy) = f (x, y) + dy · ∂ y f (x, y) + . . .
+ d x · ∂x f (x, y) + dy · ∂x ∂ y f (x, y) + . . . + . . .
It is convenient to revisit our notation. Let’s write down f (x, y) as a function with
only one parameter
x
f ,
y
which is a vector. Hence, we study
x + dx x x
f = f + d x · ∂x f + ...
y + dy y + dy y + dy
x x x
= f + dy · ∂ y f + . . . + d x · ∂x f + . . . roll out left term
y y y + dy
x x x x
= f + dy · ∂ y f + d x · ∂x f + dy · ∂ y f + ... + ...
y y y y
x x x
= f + dy · ∂ y f + d x · ∂x f + d x d y · ∂x ∂ y . . .
y y y
x dx ∂x f ((x, y)T )
= f + · +...
y dy ∂ y f ((x, y)T )
=:∇ f ((x,y)T )
scalar product
The vector Taylor expansion thus can be written down just like the normal Taylor
expansion: The scalar step size becomes a vector. The first derivative is a vector of
the partial derivatives called the gradient, and we multiply it with the step vector via
a scalar product. The second derivative would then a matrix of second derivatives
multiplied with the step vector twice such that it “collapses” into a scalar.
We intuitively use Taylor for our planet model in Chap. 3, where we assume that
we can represent the planet position with a Taylor series. As the velocity per object
is stored, it determines (fixes) our a1 . Then, we throw away all the terms ai with
i ≥ 3, i.e. we set them zero. This leaves us with a2 into which we plug our force
equations. The example illustrates that we only should use small time step sizes, as
the velocity per object is not constant but also subject to the forces. Fixing it thus is
an approximation.
If we want to find out how a function reacts to tiny variations around a value,
i.e. how it reacts to pollution of the input values or computational errors, we can
recast the behaviour around the value into a Taylor series. We assume that the |ai |s
decrease rapidly and throw all the stuff away that is associated with higher order
derivatives. Consequently, we end up with a description of the function’s behaviour
around a value which is shaped only by the function value plus its first derivative.
This is exactly the mindset we have employed for our condition number definition.
Revision Material
Abstract
There is a lot of definitions, terminology and theory around ordinary differential
equations (ODEs). We introduce the terminology and properties that we need for
our programming as well as the knowledge that we can exploit when we debug and
assess codes. With the skills to rewrite any ODE into a system of first-order ODEs
and the Taylor expansion, we finally have the mathematical tools to derive our
explicit Euler formally. This chapter revisits some very basic properties of ordinary
differential equations (ODEs). Feel free to skip the majority of this chapter and
jump straight into Sect. 12.3, where we bring our ODE knowledge and the ideas
behind Taylor together.
For those who fancy a revision, we start once more with a pen ’n paper exper-
iment: Let there be a radioactive substance which is quantified by r (t), i.e. it
is given as a value r over time. Assume we have 16 nuclei initially. That is
r (t = 0) = 16. After 1s (I omit the units from hereon), only half of them are
left. r (1) = r (0)/2 = 8. After another second, again half of them have vanished
and we consequently have r (2) = 4.
t
r(0) r(1) r(2)
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 139
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3_12
140 12 Ordinary Differential Equations
The pattern here is obvious: Every time unit of 1, we loose half of our substance.
This can be described as r (t) = r (t−1)/2, but what happens if the t is continuous
and the r (t) is continuous as well? What if we are interested in r (0.5)?
The continuous counterpart of our recursive equation is
∂t r (t) = αr (t) with an α < 0.
The evolution of the quantity r (t) in time, i.e. its time derivative, depends on
the quantity itself. If we have twice the amount of the substance, we also loose
twice the amount of substance over a given time span.
Chapter 3 introduces equations where the evolution of the equation does not only
depend on an explicit argument such as time, but also on the solution itself. Our evo-
lution of the planets, i.e. how they move, depends on both time and the arrangement
of the planets themselves.
We notice that this is a different type of equation compared to what we know from
school. Its unknowns are both the traditional parameters (here t) as well as functions
and their derivatives. That is, we are not searching for a particular argument here—if
we solve y(x) = x 2 = 16 we want to obtain x = ±4—but we are searching for a
whole function f (x) that materialises a solution, i.e. fits to the ODE. This function
usually depends on t, too.
Our definition contains only time-derivatives. There are also differential equations
that have time derivatives plus spatial derivatives or, more general, derivatives along
more than one direction. Such equations are partial differential equations. They are
not subject of study here.
Our planet trajectories are one example for an ODE. The introduction’s decay
equation is another one; one for which we can even write down a solution. It has
solutions r (t) = r0 · eαt as we quickly validate:
r (t) = r0 · eαt is our ansatz (educated guess) with
−αt
∂t r (t) = α r0 · e , i.e.
∂ r (t) = α∂t(0)r (t) .
(1)
t
(1)
=:∂t f (t) =:F(t, f (t))
What we do here is a classic: We rely on our intuition and come up with an ansatz:
“this is what the solution could look like, but there are many free parameters”. Next,
we validate that this educated guess fits to the ODE. In our example, it does. The
r (t) is the f (t) in the definition, while the n from the definition equals n = 1. We
also see that t does not enter F directly in this particular case. We have F(t, r (t)) =
12 Ordinary Differential Equations 141
F(r (t)). Finally, we take the two free parameters r0 and α of our ansatz and plug in
r (0) = r0 = 16 and r (1) = 8 = 16eα .
This is an approach that yields an analytical solution. It yields a write-up of a
“proper” function, i.e. something we can simply evaluate. However, constructing
analytical solutions is not the strategy we follow in this course. We want a computer
to construct approximations for complex ODEs where we struggle or fail to write
down analytical solutions.
12.1 Terminology
Even though we are not interested in the symbolic approach to solve ODEs, we need
a certain set of terms and concepts to discuss solutions properly. Furthermore, the
terms often allow us to validate whether a numerical solution is reasonable—if we
know that a solution should have certain properties, we can look out for them in our
plots, e.g.
Before we start, we have to clarify whether and where a solution of an ODE does
exist. Otherwise, it is meaningless to even boot our computer. We assume that there
is a certain area of t and f (t) for which the F(. . .) is well-defined and a solution
does exist. That means that the solution is reasonably smooth, i.e. its derivatives do
exist. You can not write down a term ∂t f if the function exhibits a zigzag pattern
(Fig. 12.1).1
Within the area where everything is well-defined, we are typically interested in a
particular solution:
Definition 12.2 (Initial value problem) If we search for a solution of an ODE fitting
to a certain f (0) = f 0 , we speak of an initial value problem (IVP).
1There are weaker notions of derivatives such that certain jumps (in the derivatives, e.g.) are fine.
But that goes way beyond the maths that we do here.
142 12 Ordinary Differential Equations
In this book, all of the problems we study with our code are IVPs. This does not come
as a surprise, as our numerical computer programs need a (floating point) value to
start from. Next, we discuss the order of an equation:
Standard equations from school such as y(x) = x 2 can be read as degenerated ODEs.
We can call them 0th-order ODEs. If a problem has only first derivatives, we speak
of a problem given in first-order formulation.
The ODE’s solution f can be a scalar or a vector function, i.e. it can either return
a single value or a whole vector. In this context, there is a nice technique which we
can use for all ODEs:
Our prototype code in Chap. 3 however does not solve this second-order system.
It augments the f and keeps track of the velocities explicitly. This way, it can stick to
a first-order formulation of the problem, where the velocities follow the acceleration
and the positions follow the velocities. If you have N objects, we obtain 2 · 3N = 6N
first-order ODEs.
Linearity for ODEs means linear in the derivatives. You can still have a t 2 term some-
2
where in there, but may not encounter something like f (t) · ∂t f (t) or ∂t(3) f (t) .
If we know that we have a linear, homogeneous equation, then we immediately see
that any linear combination of two solutions to the ODE yield a new solution. We
call this superposition principle.
Autonomous ODEs have two interesting properties that are useful when we debug
code. First, they are translational invariant in time. Let’s assume we know a solution
for a given initial value. If we have another solution (simulation) that runs into this
initial value (and derivative for higher-order ODEs) later on, then it continues along
the known trajectory: After all, the t does not enter F, so it does not really matter
at which point in time we hit a state. The other solution yields the same trajectory
though shifted in time.
The second interesting property for us results from the translational invariance:
Let’s assume you have found a configuration f such that F(. . .) = 0. This means,
you have identified a stationary f , as the f does not change anymore from hereon.
You therefore can argue about particular solutions to ODEs:
Fig. 12.2 The solution f (t) here is stable: The maximum difference between f (t) and a dotted
fˆ(t) above is the initial difference times a scaling. So the trajectory of fˆ(t) starts slightly off f (t)
but then remains within a certain corridor
of a satellite around a planet—you can look up the analytical formula for it. If we
start on this orbit with exactly the right velocity, then we should stay on the orbit.
By means of its radius, the orbit yields a stationary solution. Obviously, the velocity
is not constant as we rotate around the planet. The ODE
This definition describes the following: If we have a stable solution and wobble
around with the initial value, then the resulting new solution doesn’t move away
from the stable solution arbitrarily. It always remains in a corridor, parameterised by
C, around the stable solution (Fig. 12.2).
Fig. 12.3 Example visualisation of two solutions for two different initial values of the logistic
growth equation (12.2). There are two stationary solution, but only one is attractive/stable. The
arrows illustrate the gradients. They make up a direction field. Though f (t) = β is stable, not the
whole environment around it is attractive/stable. Once we go below f (t) ≥ 0, we have no attractive
behaviour anymore. Notably, f (t) = 0 is a repulsive, stationary solution
However, if we study the definition closely, we see that the definition has exactly
the same structure and format in both cases. This makes sense; otherwise we should
have used a different term and not stability. And they are linked: If a solution to an
ODE is stable, then truncation errors, e.g., put the solution off. But the trajectory
of the computed solution does not yield something completely different, unless all
the truncation errors all point in the same direction and accumulate. Notably, the
underlying problem is well-posed.
Before we discuss this, lets introduce another characterisation:
Attractive means that the solution fˆ(t) approaches f (t) over time. This can be a
very slow process, but they get closer and closer and closer. We immediately see
that this tells use something about the condition number: Small alterations of the
input data yield no significant long-term difference in the solution. The setup is well-
conditioned around f (t). Even better: Any small error we make will be damped out
completely on the long run.
If you have a solution, you can formally argue about its attractiveness via some
operator eigenvalues. We use a more hands-on approach which usually is sufficient
for the kind of problems we study with our codes. It focuses on stationary solutions
(Fig. 12.3): Assume you have found one. Move a little bit up and compute the deriva-
tive ∂t f (t)—which is easy as it is explicitly given by the ODE. Look only at the
146 12 Ordinary Differential Equations
derivative’s sign. If it is negative, the derivative pushes any solution that is slightly
too big down towards the stable solution below. Next, we study trajectories starting
slightly below the stable solution and analyse whether F(t, f (t)) > 0. If this is the
case, starting slightly below the stable solution yields a trajectory which approaches
the stable one.
If both tests yield a “yes, it does approach”, we have an attractive stable solution.
If one of them pushes the off-solution towards the stable trajectory but the other one
does not, we have identified a saddle point and the overall thing is not stable: Errors
towards the “wrong” side make the trajectory shoot off; errors towards the other side
will bounce back, but then we might next shoot in the wrong direction and we are lost
again. If neither side approaches, it speak of a repulsive stable solution. Numerical
codes basically never hit them.
Once we do our hands-on analysis for our logistic growth equation, we observe
a second interesting property: Sometimes, there is a region around a stable solu-
tion where other solutions approach it. Here, the solution is stable plus attractive.
However, if we start off too far away, then the resulting trajectory is definitely not
attractive anymore (Fig. 12.3). Stability and attractiveness can be localised proper-
ties: There might be stability regions for which everything is fine, but once we leave
these regions, anything can happen. Attractive means a special case of stable, but
this “attractive implies stable” statement is a property which holds only for scalar
ODEs.
With our knowledge of ODEs and Taylor refreshed, it is time to revise how we
construct the algorithm in Chap. 3. Instead of the complex N-body setup, let the
simple ODE
∂t f (t) = α f (t) (12.3)
be our equation of interest. It is supplemented by suitable initial value conditions
f (0) > 0 and some α < 0, so the result will be the decay equation (with the same
analytic solution). It yields a curve that approaches f (t) = 0 from above.
12.3 Approximation of ODEs Through Taylor Expansion 147
The test equation’s analytical solution is infinitely smooth, i.e. you can throw ∂t on it
infinitely often. None of its derivatives does disappear. As the function is that smooth,
we know that the Taylor expansion works well. If we have infinitely many terms in
it, then it will exactly yield the solution.
As we can not handle sums of infinite length on a computer, we truncate it. This is
one fundamental idea behind many approaches in the simulation business. We take
the Taylor expansion of the solution
dt i (i)
f (t + dt) = f (t) + dt · ∂t f (t) + ∂t f (t)
i!
i≥2
higher order terms
and throw away the higher-order terms. Next, we take the ∂t f (t) term and plug in
the ODE as written down in first-order formulation.
The result is a time-stepping scheme, i.e. it sketches how we can write down a
for-loop running through time with a step length of dt. Since we know how to rewrite
any system of ODEs of any order into a first-order formulation, and as the truncation
idea works for both scalar- and vector-valued f (t), we can solve any ODE via this
scheme.
Definition 12.9 (Explicit Euler) Take the solution and develop its Taylor expansion.
Truncate this expansion after the second term and plug the ODE definition into the
derivative. This yields the time stepping loop from the model problem. It is called
explicit Euler.
• There are three properties for ODE solutions which are important for us: stationary,
stable, attractive. If we solve a problem with our code, it is reasonable first to
validate our code against (artificial) scenarios for which we know the solution or
the solution properties. The tests help us to find bugs, and they allow us to quantify
(if we know an analytical solution) how bad our computer’s approximation is.
• The explicit Euler is a straightforward outcome of a truncated Taylor expansion
with only two terms into which we plug the ODE.
Accuracy and Appropriateness of
Numerical Schemes 13
Abstract
For the accuracy and stability of numerical codes, we have to ensure that a dis-
cretisation is consistent and is stable. Both depend on the truncation error, while
we distinguish zero- and A-stability. We end up with the notion of convergence ac-
cording to the Lax Equivalence Theorem, and finally discuss how we can compute
the convergence order experimentally.
We study two particles as they orbit around each other. They dance through
space. Intuitively, we understand that this dancing behaviour should become the
more accurate the smaller the time steps. In the sketch below, the top “animation”
runs with half the time step size compared to the bottom simulation.
F1
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 149
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3_13
150 13 Accuracy and Appropriateness of Numerical Schemes
The left (starting) frame is exactly the same. It hosts two particles with two
velocities. The bottom row performs one time step, whenever the top row per-
forms two time steps with half the time step size. Due to the more of time steps,
the top row yields a different, more accurate particle configuration in the right
frame.
Accuracy is what we want to quantify now. Unfortunately, we lack a clear
reference point: We don’t know where the particles should (analytically) be after
a certain time T . Therefore, we track the positions of the particles p1 (T ) and
p2 (T ) after a certain fixed time T and compare them with each other. What do
you expect if you simulate the time span (0, T ) with a time step size dt, then
with dt/2, then with dt/4, …?
We derive our explicit Euler from the Taylor expansion by truncating terms:
∞
dt i (i)
f (t + dt) = f (t) + dt · ∂t f (t) + ∂t f (t) .
i!
i=2
plug in F
cut off
In this formula, we assume that the Taylor series for f (t) exists within a dt-area
around t (cmp. criteria on page 134). Therefore, the sum over i quantifies the error
that we introduce by solving the problem numerically instead of analytically:
Definition 13.1 (Truncation error) The truncation error is the error that results from
the time discretisation, i.e. from cutting off the Taylor series.
A more formal definition goes as follows: Let τ be the truncation error, i.e. the stuff
we cut off. Consistency means
τ
lim = 0,
h→0 h
i.e. the truncation error goes down faster than h.
For the explicit Euler, we directly see that our approach is consistent. Once we have
written down the scheme, we replace the f h (t) in the Taylor series with the given
ODE definition
f (t) = F(t, f (t)),
and argue that the remaining terms fade away with h 2 once h becomes small. That is
faster than h. We solve the right problem when h goes to zero. For the explicit Euler,
the claim that we are consistent is trivial.
Non-consistent schemes
More sophisticated numerical schemes that approximate higher-order terms in the
Taylor series to yield better estimates are not automatically consistent. In this case,
we have to invest work to show that we solve the correct problem as h → 0.
• Is the solution also stable for any dt (or h; whichever notation we do prefer), i.e. do
we always get something meaningful?
152 13 Accuracy and Appropriateness of Numerical Schemes
• How do the truncation and the round-off errors sum up, and what is the accuracy
we obtain in the end?
When we focus on the first question, we basically ask our consistency questions
the other way round: We know that we get the right answer for dt → 0. But what
happens for bigger dt?
13.1 Stability
Before we study complex challenges such as our multibody simulation and before
we formalise yet another notion of stability, we study what could possibly go wrong
with an explicit Euler. For this, it makes sense to focus on the Dahlquist test equation.
Writing down the explicit Euler for this setup is reasonably simple, as we can plug
in the right-hand side F and isolate the f h (t + h):
1 N here is a counter for the time steps. It has nothing to do with the N from “N -body simulations”.
13.1 Stability 153
Fig. 13.1 Explicit Euler trajectories for the Dahlquist test equation with λ = −1 and various time
step step sizes
When we dump the outcome for smaller and smaller h, we observe interesting things:
A plot of the behaviour for various time step sizes confirms this conclusion (Fig. 13.1):
With h = 1.0, we are far away from the analytical solution of the test equation. Once
we make the h smaller, we get closer and closer to the real behaviour. We also observe
that we undershoot the real solution (cmp. to the sign in the above error computation)
all the time: The explicit Euler starts from a point and then walks along the derivative
for a time interval of h. Since the function smooths out, i.e. the derivative approaches
0, we always go down too aggressively.
154 13 Accuracy and Appropriateness of Numerical Schemes
Fig. 13.2 Explicit Euler trajectories for Dahlquist test equation with λ = −2 and the same time
step step sizes as in Fig. 13.1
We next decrease the material parameter λ.2 We check λ = −2 (Fig. 13.2). While
we see a similar behaviour as before for very small time step sizes, we get oscillations
for bigger time step sizes. These oscillations damp out as long as h < 1. For h = 1,
there is no damping at all and the solution does not approach the real solution of
f (t) = 0 for large t. If h > 1, the amplitudes grow.
Everything seems to be right if we have a smallish time step size. For h = 0.5, we
run along the gradient “too long”, end up exactly on f h = 0 which is an attractive
stable solution, and thus remain on the axis. Trivially, our numerical solution on the
long run is consistent, but we struggle to call the observed curve characteristic for
this problem. For h slightly bigger than 0.5, we run for too long along the derivative
in the first step and obtain an approximate value f h < 0. As a consequence, the
derivative becomes positive, i.e. switches the sign, and we go up again. We produce
an oscillation. The code still yields the right solution on the long term (the solution
goes to zero), but this is qualitatively yet another solution than the ones we have
had before. Even worse is the situation for h > 1. We create oscillations where the
amplitude becomes bigger and bigger.
2 I call λ the material parameter. Different to flow through a subsurface medium, e.g., our λ is not
really a material. But the name material here highlights that it is not a parameter determined by yet
another equation but something fixed.
13.1 Stability 155
Fig. 13.3 We assume that hλ of the Dahlquist test equation spans the complex plane. So the x-axis
is the real part (which we usually use, but this plot makes no constraint on λ, i.e. allows it to be
complex-valued). Only if the product h · λ ends up in the shaded circle, the numerical scheme is
stable. The calibration of hλ seems to be weird: We would maybe expect only h on the x-axis.
However, with h on the x-axis, we would need a different plot per λ. The present plot holds for all
λ. It clarifies which h yield a stable solver for all λ
of under control, i.e. as we know that we have the implementation error under con-
trol, we do not really have to worry for h → 0. However …
Definition 13.4 (A-stable) The A-stable region of a numerical scheme for the test
equation is the set of (h, λ) parameters for which the scheme converges towards zero.
A numerical scheme is A-stable if it is A-stable for all (h, λ) inputs with λ < 0.
As we see above, the explicit Euler is not A-stable. To formalise this behaviour, we
go back to (13.2) where we write down the explicit formula computing the outcome
for the test equation after N time steps in one rush. To ensure
f h (N h) = (1 + hλ) N f h (0) → 0,
we need −1 < 1 + hλ < 1. The whole study has to be done for λ < 0 (we want the
analytical solution to decrease towards zero) and we know that h is positive (we can
not go back in time). As a consequence, 1 + hλ < 1 is trivial. We however have to
be concerned about −1 < 1 + hλ which implies
0 < h < −2/λ. (13.3)
This inequality determines our A-stable region.
Scientists like schematic illustrations. Figure 13.3 introduces the standard way to
depict the insight from (13.3): If λ > 0, we are lost right from the start. If λ < 0,
we have to meet (13.3). In theory, λ ∈ C. We thus care about the magnitude of λh.
If λ ∈ R (which is what we have in this manuscript), we are only interested in the
x-axis.
156 13 Accuracy and Appropriateness of Numerical Schemes
13.2 Convergence
With our notion of the stability and consistency, we can finally bring both concepts
together:
Convergence means that no matter at which time step N we study a solution, we close
in on the right answer once h goes to zero. We can phrase the definition differently:
I prefer the Lax Equivalence Theorem as it highlights explicitly that algorithm choice
and implementation go hand in hand: When h approaches zero, we want to be sure
that we solve the right problem plus that no round-off error pollutes the outcome. In
accordance with the definition, we can now also speak of convergent codes. They
are stable realisations of a consistent numerical scheme. Notably, consistency and
stability are both necessary for convergence, but each property alone is not sufficient.
Terminology
Convergence is often used in a sloppy way and some people use it as synonym for
“nothing changes in my iterative code anymore”. Always check whether it includes
both stability and consistency. You might for example have run into the wrong so-
lution if a while loop terminates as it does not change the solution anymore, or you
might miss the right solution as you decrease h since your round-off errors mess
everything up.
13.2 Convergence 157
In practice, we can not show real convergence (or consistency) in line with the
mathematical definition. Once h ends up in the order of machine precision, the error
will likely plateau as well. Yet, a decent selection of small h values that exhibit a
clear trend is usually seen as sufficient evidence—despite the fact that even smaller
h might introduce compute instabilities.
So far, our convergence statement is a qualitative one. We next derive a quantitative
notion of convergence. For this, we study the truncation error we make: We compare
the numerical approach f h (h) with the not-truncated Taylor series of f (h), i.e. the
real solution, for one time step of h. This gives us
Fig. 13.4 Illustration of the global and local error. As the real solution meanders up and down,
we observe that our global error statements are a worst-case estimate. Sometimes we overshoot,
sometimes we undershoot. The local error is the sum of local errors (though each local error
corresponds to a different ODE solution) which might cancel out
158 13 Accuracy and Appropriateness of Numerical Schemes
Since f (t, f (0)) is given by the problem, i.e. it is a real “constant” in this sense,
we are zero-stable. The error disappears for h → 0. Zero-stability is no surprise, but
we now can quantify the error:
The local truncation error of the explicit Euler is in O(h 2 ). The explicit Euler’s single
step is thus second order.
However, this is only the analysis of the first step. Once we have completed the
first step, i.e. moved from the initial condition to f h (h), the second term in the error
computation f h (t, f h ) − f (t, f ) does not vanish anymore. Therefore, we carefully
have to distinguish a local error (made by one time step) vs. the global error that we
actually observe. They are two different things (Fig. 13.4), as the reference point for
both quantities differs: it is either the starting point of a time step (local error) or the
real solution (global). The global error is an accumulation of local errors; although,
in practice, local errors often cancel out as we have seen it before for the floating
point numbers. This is the same story as for the truncation error, i.e. all just a little
bit of history repeating!
Deriving the global error requires a little bit more work than the local error. Let’s
assume that we run our algorithm over a time span T , i.e. that we do Th time steps.
We start with an analysis of the last step of our scheme which suffers from the error
|eh (T )| = | f h (T ) − f (T )|
= | f h (T − h) + h F(T − h, f h (T − h)) − f (T )|
one Euler step
h2
= | f h (T − h) + h F(T − h, f h (T − h)) − f (T − h) + h F(T − h, f (T − h)) + ... |
2
1
≤ | f h (T − h) − f (T − h)| + h · |F(T − h, f h (T − h)) − F(T − h, f (T − h))| + h 2 | ...|
2!
1
= |e(T − h)| + h · |F(T − h, f h (T − h)) − F(T − h, f (T − h))| + h 2 | ...|
2!
step. It does not disappear. Instead, we keep it and apply the triangle inequality to the
whole expression. I omitted the absolute values for the error before to highlight that
the error can be a “too much” or “not enough”, but now we need the norm. We obtain
a recursive expression for the error, where the error (bound) increases per time step
by the inaccuracies in the first derivative plus an additional O(h 2 ) term that is also
often called global error increment.
To get the second term under control, we exploit the fact that our ODEs have to
be Lipschitz continuous.
Definition 13.7 (Lipschitz continuous) Let f 1 and f 2 be two solutions to the ODE
∂t f (t) = F(t, f (t)),
i.e. f 1 and f 2 belong to two different initial values. The F is globally Lipschitz-
continuous iff
|F(t, f 1 (t)) − F(t, f 2 (t))| ≤ C| f 1 (t) − f 2 (t)| ∀t ≥ 0.
Indeed, one can show that this is the only condition we need to ensure that the
ODE has a solution. We therefore can safely rely on the fact that this holds for our
problems of interest. While the proof is out-of-scope, we do emphasise what it means:
If we slightly modify the solution f , then the F-value does not change dramatically.
It means that we are kind of well-conditioned along the f (t)-axis. Plugging in a
“wrong” f into our solver will yield an error, but this error will not be amplified
arbitrarily.3
Lets use the continuity property in our global error estimate. We obtain
1
|eh (T )| ≤ |e(T − h)| + h · |F(T − h, f h (T − h)) − F(T − h, f (T − h))| + h 2 | ...|
2!
1
≤ |e(T − h)| + h · C| f h (T − h) − f (T − h)| + h 2 | ...|
2!
1
= (1 + hC)|e(T − h)| + h 2 | ...|
2!
where the C is the so-called Lipschitz-constant from the definition. Some remarks
on this formula:
• One might assume that the explicit Euler does not converge, as the recursive
equation above makes us use Th steps. In each step, we get an additional error
amplification from the left part of the right-hand side in the order of h. So it seems
that we obtain Th · h. However, the initial error is zero! Therefore, if we unfold the
(1 + hC)|e(T − h)| term, we eventually hit e(0) = 0 and the term drops out.
• Each recursive unfolding step (induction or time step) produces one h 2 | 2!1 . . . |
term. In the end, we will have N of these guys. So their number scales with T / h
or O(h −1 ).
• These remaining terms scale with the second derivative ∂t(2) f (t). It is the inaccurate
capture of the function’s second derivative that injects more and more error into
the approximation.
The interested student can now study the discrete Gronwall lemma. For us, we set
N = T / h and write
1 (2)
|eh (T )| ≤ (1 + hC)|e(T − h)| +h 2 | . . . ∂t f (t)| + . . .
2!
un f old
1 (2)
≤ (1 + hC) N |e(0)| + N · h 2 | . . . ∂t f (t)| + ...
2!
=0
ignore higher order terms
recursion spills out terms
1
≤ hC · h 2 | . . . ∂t(2) f (t)|
2!
(2)
≤ Ch| max ∂t f (t)|.
The “ignore” remark means that the magic constant before the preceding term has
to be increased accordingly.
This is the global error statement: If we half the time step size h, we can also expect
the error to go down by a factor of two.
One challenge we face for real problems is that we typically can not compute the
error, as we do not know the exact solution. In this case, it helps to compare two
errors for different mesh sizes:
eh (T ) − eh/2 (T ) = ( f h (T ) − f (T )) − ( f h/2 (T ) − f (T ))
1
= f h (T ) − f h/2 (T ) = Ch p − C (h/2) p = C(1 − p )h p .
2
measure
We measure the outcome for two different time step sizes and plug in our model of
a convergence order (Fig. 13.5). This yields one equation for two quantities C and
13.3 Convergence Plots 161
Fig. 13.5 A planet runs through space on a trajectory which is unknown to us. To study the
convergence behaviour of our code, we can however fix a certain observation time T . We still do
not know the solution f (T ), but we know that the points f h (T ), f h/2 (T ), f h/4 (T ), . . . close in on
the solution for a consistent algorithm
N (h = T/N)
100 200 400
Note that this sketch of a plot has flaws (cmp. remarks in Appendix F). Yet, it
illustrates the principle behind our experimental strategy plus the fact that it is very
162 13 Accuracy and Appropriateness of Numerical Schemes
convenient for such plots to use log scales. An interesting decision is that I did not use
h on the x-axis. With h on the x-axis, left would would mean accurate (“smaller h is
better”). Many readers intuitively associate right with better, so rather than depicting
h, I used the time step count as unit. Further right thus means more accurate.
It is never a good idea to study only one setup or only three or four step sizes to
claim a certain convergence order. Run more experiments. If all of them yield parallel
lines, then you can claim that you observe a certain order. This is in line with our
previous observation that stability and consistency can depend on the experimental
setup.
•
> Key points and lessons learned
In this part of the manuscript, we revise some of the most fundamental concepts
behind numerical algorithms. Our discussion orbits around our model problem—a
few particles “dancing” through space—or around very basic ordinary differential
equations. The simplest case is the Dahlquist test equation. Though basic, these
examples help us to establish the language we need in almost any area of scientific
computing. Terms like consistency, stability or truncation error are found in the field
of numerical linear algebra or for the solvers of partial differential equations as well.
They are omnipresent.
With our toolset, i.e. an appropriate language and fundamental ideas, we are in a
position to write N -body simulations and analyse their outcome, plus we can discuss,
assess and understand their characteristics. Our work is not mere code-monkeying
anymore, since we know why we are writing certain code, how accurate it is, and
when and why it breaks down. This toolset also maneuvers us into a position where
we can introduce more sophisticated numerical algorithms.
While the latter is among our long-term goals, this manuscript departs from the
route of numerical algorithms for the time being, and instead equips the reader with
the skill set and knowledge to write simulation code that makes use of today’s om-
nipresent multicore computers. We also introduce tools to employ GPGPUs. Overall,
our intention is to become able to write simulations with really large particle counts
that run over a longer time span. In my opinion, it is reasonable to discuss better nu-
merics with a large-scale mindset. Designing mathematically advanced algorithms
is a challenge. Designing mathematically advanced algorithms that scale is a harder
challenge. As the basics of numerics are now known to us, we next acquire a sound
understanding how parallel codes can be written.
Part V
Abstract
We distinguish three different strategies (competition, data decomposition, func-
tional decomposition) when we design parallelisation schemes. SIMD’s vectori-
sation is a special case of data parallelism which we now generalise to MIMD
machines where we distinguish (logically) shared and distributed memory. Our
text discusses some vendor jargon regarding hardware, before we introduce BSP
and SPMD as programming techniques.
Assume there are p boxes filled with Lego© bricks of p different colours. We
label them as input boxes. Every box holds a random number of bricks and
colours. Let’s further assume that you have a team of p colleagues, and each
one “owns” one of the filled (input) boxes. Finally, each colleague has one box of
its own which is initially empty—we call it output box. Colleagues are allowed
to chat with each other. They also can exchange input boxes. However, your
colleagues can not handle more than one input box at a time, and they can only
access their own output box.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 167
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3_14
168 14 Writing Parallel Code
1. How many red bricks are in all the input boxes together?
2. Is there a red brick in one of the boxes?
3. Sort all red bricks into the first output box, all green bricks into the second,
and so forth.
In my opinion, there are three fundamentally different ways to come up with a parallel
solution for a problem. Obviously, not all problems are a fit to all three approaches,
but often they complement each other. Let us assume that a problem decomposes into
subproblems, and that we have various compute entities, where entities can mean
cores or computers or servers.
Pipelining
Even if steps within a functional decomposition have to run one after another, and
even though we might not be able to run steps for different work items in parallel—
due to a shortage of well-suited compute components such as vector computing units,
e.g.—there is an additional, potential gain from parallelism if we can functionally
decompose our work into a series of activities A, B, C,…and have to run over a
sequence of problems: While worker number two in the car manufacturing example
above starts painting, the first worker already continues with the next car. In the end,
we get the impression that cars are built in parallel even though car by car drop out
sequentially.
This pattern is called pipelining: We give the first subproblem to the first unit
running the first task (a particular activity) on this subproblem. Meanwhile, we give
the second subproblem to the second unit which performs a different task. Whenever
the entities have completed their task, they pass the subproblem on to another unit.
Pipelining (as materialisation of a functional decomposition) and data decompo-
sition can often be applied as alternatives or can be combined. Pipelining is typically
more complex and always needs some warm-up and cool-down phases (when not
all steps of the car manufacturing pipeline are occupied, e.g.). It is also notoriously
vulnerable to a step within the pipeline that lasts longer than all others; the slowest
step determines the throughput.
Pipelining is heavily used by hardware. If a particular instruction runs through S
phases in a pipeline—phases typically are “understand the type of operation”, “bring
in the variables required”, “write the outcome into a register”, …—the pipelining
concept can bring down the effective cost per instruction to 1: In the ideal case, one
instruction “leaves” the chip per second. This obviously implies that instructions are
reasonably independent of each other such that we can progress them independently.
In this ideal case, it seems that S steps are done per time unit. From a user’s per-
spective, we may thus argue that we have done multiple instructions for one piece
of data in parallel. The acronym in Flynn’s taxonomy for this is MISD (Multiple
Instructions Single Data).
170 14 Writing Parallel Code
The above definition doesn’t prohibit that two compute units share data. It just means
that everyone works on its piece of data at a time (in the registers). Nevertheless,
these “multiple data” are all allowed to reside within the same memory.
Definition 14.2 (Shared memory versus distributed memory) A shared memory ma-
chine is a machine where multiple compute units share a common data region, i.e. can
access the same memory. In a distributed memory environment, the memory per com-
pute unit is not connected, so if one compute unit wants to access data of the other
one, the two units have to exchange the data explicitly via a message.
MIMD is a hardware classification. Both multicore computers and whole clusters are
materialisations of this paradigm. ALUs with vector capabilities however are not.
They run the same instruction per data tuple even though these instructions might be
subject to masking.
1. The von Neumann pair of control unit plus ALU (incl. an FPU if required) is
a core. So each core plus bus plus memory yields a complete von Neumann
architecture, but bus and memory are shared between cores.
2. Some vendors speak of (hardware) thread rather than core. At the same time, we
call a thread in a programming language one execution stream running indepen-
dently of the other threads. So if someone tells you about a thread, this might
mean a (software) thread where a piece of the program runs its code in a MIMD
sense (though MIMD in turn is a hardware term) or a real thing in hardware.
3. You can run multiple logical (program) threads, i.e. different instruction path-
ways, on one core. It is the responsibility of your operating system to swap them.
Actually, you do this all the time on your laptop when you open Skype, emails,
Word, Twitter, …Obviously, if multiple threads are executed on one core, we
cannot expect any improvement in runtime. Unless there are enough registers
around to store data temporarily—that’s what GPUs do—there will be a high
price to pay for the swaps. For our purposes, it usually does not make sense to
have more threads than CPU cores.
4. On GPGPUs, it makes sense to have way more logical than physical threads
(hardware),1 as GPUs have a massive number of registers. So they do not have
to empty and free registers into the memory if another thread continues. They
simply make the other thread use another set of registers. The hardware threads
on a GPU are also sometimes called CUDA threads.
5. Today, most vendors ship chips with simultaneous multithreading (SMT). If your
chip has SMT, it means that a core has NSMT hardware threads. However, these
(hardware) threads share some parts of the chip such as the FPU. So if one of the
threads runs into some floating point calculations, it occupies the FPU, and the
other threads have to wait if they need the FPU, too. Intel brands its SMT with
NSMT = 2 as hyperthreading. The machine or the sales department of the vendor,
respectively, pretend to give you NSMT full cores/threads, but they are actually
not. Through this trick, most systems today seem to have twice (or even more) as
many threads as full-blown cores. For most standard (office) software, the lack
of parallel FPU silicon is not a big deal, but for scientific software, we have to
carefully evaluate the role of hyperthreading. It might be “fake multithreading”.
6. The majority of compute servers today are shipped in a two-socket configuration.
A socket is a piece of silicon that hosts several cores and their shared memory.
If you have two sockets, the two sockets connect their memory and thus pretend
to have one large shared memory serving twice the core count (cmp. tips in Sect.
C.2). In reality, these are however different pieces of memory wired to each
other. They are not really one homogeneous thing. This means, you might face
runtime penalties (the system has to ship data over this wire) when one socket
needs memory unfortunately stored on the other guy. In this introductory text, we
do not really care about the arising non-uniform memory access cost (scientists
1 GPUs in the classic CUDA tradition use a slightly different terminology as discussed in Chap. 18.
172 14 Writing Parallel Code
logical cores to the user. Out of these, only the per-node cores are available via
the shared memory programming techniques we discuss here, whereas only half
of them are real physical cores with full hardware concurrency.
Terminology
I use the term compute unit sometimes when I don’t want to discriminate between
core, node or thread. Unit is no formal term. I use it lacking a better word and to
emphasise that we discuss generic concepts that are, too some degree, independent
of the hardware realisation.
14.2 BSP
We next introduce one of the all-time classics on how to program a parallel (MIMD)
machine: Bulk synchronous processing (BSP). BSP is among the oldest ideas of how
to write a parallel program. It is a perfect example for a way to realise data parallelism
on MIMD hardware.
In BSP, a program runs through four conceptual phases:
• It starts with some code which runs in serial. There’s no parallelism here.
• The code then splits up into multiple parts (threads). Some call this a fork and
speak of a fork-join model rather than of BSP.
• The threads now compute something in parallel. In the original BSP, there’s no
communication between the threads. The threads make up the bulk within the
acronym BSP.
14.2 BSP 173
• Once all parallel activities have terminated, we fuse all the threads. This is the
synchronisation step or join. It may comprise some data exchange as the forked
threads are terminated.
double r[N];
double x[N];
double y[N];
double alpha = ...;
for (int i=0; i<N; i++) {
double alphaX = alpha * x[i];
r[i] = alphaX + y[i];
}
The dependency graph (DAG) for this example is close to trivial and highlights that
all the individual loop body executions can run in parallel. So a parallelisation of this
code for two cores could conceptually look like
double r[N];
double x[N];
double y[N];
double alpha = ...;
for (int i=0; i<N/2; i++) { for (int i=N/2; i<N; i++) {
double alphaX=alpha*x[i]; double alphaX=alpha*x[i];
r[i] = alphaX+y[i]; r[i] = alphaX+y[i];
} }
where the right code block is ran on one core and the left part on the other one. The
initial setup of all data is done only on one core however. Such a parallelisation is
exactly BSP (Fig. 14.1): There are a few initialisation steps where we create the arrays
and initialise the α. These steps run on one thread only as there is no concurrency
here. We then enter the bulk which we can split up into two or more chunks. This is
174 14 Writing Parallel Code
Fig. 14.1 Illustration of bulk synchronous parallel processing for a simple for loop realising a
DAXPY. The code runs on two cores
where we fork the threads, i.e. now the threads are started. Each thread runs through a
segment of the loop. There are no dependencies between the steps on one thread and
the steps on the other thread. The steps are in no way synchronised or orchestrated.
Eventually, both threads have done their work and we join them. This is a syn-
chronisation. We don’t know how quick the threads are, whether the work is evenly
split or whether they physically run on different hardware threads. Hence, we might
have to wait for some to terminate. We finally continue serially.
Definition 14.3 (Superstep) In BSP, the individual bulk plus its synchronisation
including the final information exchange forms a superstep.
14.3 Realisation of Our Approach 175
All of our programming techniques in this manuscript fall into one class: They are
SPMD.
Definition 14.4 (SPMD) Single program multiple data (SPMD) means that all par-
allel compute units execute the same program (instruction stream).
This does not mean that they all follow the same instruction branch (if-statements),
and it in particular does not mean that they run in locksteps. It simply means that they
run, in principle, through the same piece of code. As such, the SIMD instructions
we have seen—SIMD formally is a machine architecture model, i.e. has nothing
to do with programming, so I write “SIMD instructions”—is a very special case
of SPMD. We can even be more precise and read SIMD as very special case of
BSP, where each instruction forms its own superstep and the join’s synchronisation
happens implicitly after the instruction. BSP is an example for SPMD programming,
too. Every technique in this manuscript relies on SPMD. We never write two different
programs that collaborate.
If you work on an application where you have a dedicated server program, a client
software and a database software, then you have a classic example of a non-SPMD
application. All three different components typically are shipped as different pieces
of code.
• There are three classes of parallelism that we often find in computer codes: func-
tional and data decomposition as well as competition. Data decomposition is the
most popular approach.
• MIMD is the hardware architecture paradigm underlying multicore processors.
• In hardware specifications, we have to distinguish nodes, sockets, cores, and
(hyper-)threads. As programmers, we typically handle (logical/software) threads
which are mapped onto these machine concepts.
• BSP means bulk synchronous processing and describes a straightforward fork-join
model where threads do not communicate with each other and synchronise only
at the end. It is a paradigm how to design a data-parallel algorithm.
• BSP and traditional (SISD-focused) vectorisation can and are usually realised via
SPMD programming.
Upscaling Models and Scaling
Measurements 15
Abstract
Two different upscaling models dominate our work: weak and strong scaling.
They are introduced as Amdahl’s and Gustafson’s law, and allow us to predict
and understand how well a BSP (or other) parallelisation is expected to scale. We
close the brief speedup discussion with some remarks how to present acquired
data.
Let a simulation work on a big square. Assume that the square represents a huge
image and the simulation computes something per pixel. The code runs for 24s
on one core. This computation per pixel is where the 24s come from. How long
will it run on two cores if we cut the square in two equally sized pieces? How
long will it run on four cores if we cut the pieces once more?
The natural answer is 12s and 6s. At the same time, we know that this is likely
too optimistic. If we split our homework among our three room mates, they
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 177
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3_15
178 15 Upscaling Models and Scaling Measurements
likely have to chat, i.e. coordinate, with each other, and there might be tasks that
just can not be split among them or can not be split into equally sized parts.
Let another simulation run for 4s on four cores. Each core handles two pieces
of work. How much time would this run need on only one core? Finally we take
the setup and deploy it to 16 cores. At the same time, we also make it four times
bigger. For our image, we can for example increase the resolution in both the
x- and y-direction by a factor of two, so we get four times more pixels. What
happens if we run the simulation with four times as many compute resources
on a problem that is four times bigger?
Before we start tuning or parallelising our codes, we have to ask ourselves what we
can and want to achieve. There are two reasons for this: On the one hand, we want
to get a feeling what we might end up with before we start the work. It is important
to assess whether parallelisation is worth the effort. On the other hand, we want to
assess the quality of a delivered solution: Have we delivered something reasonable?
For this, we first derive a model of our code:
At any time of a simulation run, we may have a different concurrency level. However,
we can obviously compute an average one. In our BSP introductory example in Sect.
14.2, we have a concurrency level of N for the loop.
In the example, we then split up the loop into two parts only. There are many
potential reasons for this: The compiler or programmer might come to the conclusion
that the loop body is too cheap, and splitting it up into too small pieces induces too
much overhead. Alternatively, our computer might have only two cores, so there’s
no point in splitting up the loop further, or we might want to map some concurrency
onto SIMD hardware features. In any case, we get a level of parallelism that is lower
than the concurrency level:
Definition 15.2 (Parallelism) Parallelism means that things are happening at the
same time.
15 Upscaling Models and Scaling Measurements 179
In line with our previous discussion, one of the prime properties we are interested
in is scalability. Scalability means how the runtime improves, i.e. goes down, when
we add more and more compute units (cores or vector registers). It characterises the
observed parallelism. Let t ( p) be the runtime of our code for p compute units. With
this definition we can introduce two important quantities:
I tend to forget what is in the numerator and what is in the denominator. But it is
easy to reconstruct: If we have two processing units p instead of one, the time t (2)
should go down and we expect a speedup, i.e. S(2) > 1. So t (1), the bigger one, has
to be numerator.
Speedup tells us how fast we can make the code, but it does not tell us what price
we have to pay for this, i.e. how much we have to invest:
Efficiency puts the scalability in relation to the number of compute units we invested.
There are some natural properties for both quantities:
• Both efficiency and speedup are positive. If not, we’ve computed something wrong.
• If the speedup falls below one, we’ve also done something wrong. S( p) < 1 means
that we have invested more compute facilities, but our code has slowed down.
• In the best case, we have a speedup of S( p) = p. If we invest two compute
units rather than one, we half the compute time. This is optimal. We call it linear
speedup.
• The best-case efficiency therefore is one.
• If the speedup is S( p) > p, i.e. the efficiency E( p) > 1, we have to be sceptical.
Sometimes, slightly superlinear speedup occurs due to caching effects for example.
Yet, this is rare.
• The concurrency level yields a natural upper bound for the speedup.
Our first theoretical model for upscaling results from a simple speedup model. Intro-
duced around the 1967s, Gene Amdahl started from the observation that many codes
do some initialisation, then branch out and use many cores, but eventually have to
synchronise all compute units again. As a result, we are never able to exploit all
180 15 Upscaling Models and Scaling Measurements
Fig. 15.1 Speedup curves for Amdahl’s Law for various choices of f . The cores (x-axis) is the p
from the formula, while the speedup is S( p)
compute units all the time. There is always some work which is (almost) sequential.
The prime example for such codes is given by BSP.
Let 0 < f ≤ 1 be the fraction of code that runs serially. f close to 0 means
that the code keeps all compute units busy almost all the time. This implies that its
concurrency level is always equal or higher than the number of parallel compute
nodes. f close to 1 means that the majority of our code is not running in parallel.
The model is simple to explain: We start from a given time t (1). A fraction f · t (1)
of this time can’t be parallelised. It does not benefit from more compute units p
and thus “just is there” for a parallel run, too. The remaining fraction of the runtime
(1 − f ) · t (1) however is shared between the compute units, so it is reduced. The
total runtime is the sum of these two ingredients.
Amdahl’s law yields characteristic curves (Fig. 15.1): For small core counts, we
see a significant speedup. The improvement due to additional cores becomes less
significant with higher total core count. The code’s parallel efficiency deteriorates.
Eventually, the speedup plateaus almost completely.
Though Amdahl’s law serves us well, it relies on some crude assumptions:
15.1 Strong Scaling 181
These shortcomings suggest that we often have to extend the law to reflect our
particular cases. We have to augment it. It is however a good starting point to assess
our code.
In our second theoretical model, we invert the train of thought. We assume that we
have a piece of code that is running on a parallel machine. To make a statement on its
speedup or efficiency, respectively, we ask ourselves “what happened if we ran this
code without the parallelism”. In 1988, John Gustafson argued along these lines. He
still sticks to the same idea of a code, i.e. he still assumes that there is a serial part f
in the code as in Amdahl’s case.
This model is simple to explain, too: We start from a given time t ( p). A fraction
f · t ( p) is not running in parallel. When we rerun our code on one compute unit
only, we have to do this part once. The remainder of the runtime is work that did run
in parallel before. This whole workload now has to be processed by our single unit
one by one. The total runtime is the sum of these two ingredients.
Gustafson’s law gives us characteristic curves for speedup studies as we see them
in Fig. 15.2: For small core counts, there is a price to pay to get the whole paralleli-
sation thing going. Once we enter a regime with a reasonable number of cores, the
code however performs well. The speedup curve “recovers”.
Though Gustafson’s law serves us well, it relies on some crude assumptions:
Fig. 15.2 Speedup curves for Gustafson’s Law for various choices of f
3. The parallel code exhibits a certain f . The same f holds for a serial rerun. This
implies that any parallel code we start from has a sufficiently high concurrency
level.
These shortcomings suggest that we often have to extend the law to reflect our
particular cases. We have to augment it. It is however a good starting point to assess
our code.
15.3 Comparison
The two models behave very differently. One enters a saturation regime, the other one
suffers from some startup penalty before it recovers. To understand the difference,
we have to go back to our initial assumptions when we construct the formulae.
For Amdahl’s law, we assume that we have a problem of size N and run this
problem on bigger and bigger machines. We can assume that N enters the time we
need to compute the result as a scaling factor subject to f . As it enters both t (1)
and t ( p), it is kicked out again immediately. The important thing to remember is:
We keep this N fixed as we scale up. Consequently, the f we observe for a single
compute node is the same for massively parallel runs. This is a strong notion of
upscaling.
15.3 Comparison 183
For Gustafson’s law, things are slightly different: We take a problem from the parallel
machine and run this one on one unit. For both runs, we assume that the f is a constant
and the same. To keep f constant in a real application, we might have to increase
the problem size if we switch to bigger machines. Otherwise, parallelisation and
synchronisation overhead will alter f . If we want to keep the ratio of serial code
to parallel code constant, we have to make N bigger for bigger p. This observation
at Sandia National Labs motivated John Gustafson to introduce the new law. The
important thing to remember is: We increase the problem when we scale up to ensure
that the ratio f remains constant. This is a weak notion of upscaling as we alter the
underlying problem.
When we discuss the behaviour of our code, we have to make it crystal clear which
type of upscaling model we use. We also have to be aware that either model might
be ill-suited:
• If we work on a machine with limited memory and do not connect multiple of these
machines, weak scaling does not arbitrarily make sense. At least, it is capped.
• If we work with an application where the accuracy depends on the problem size,
strong scaling might make limited sense. Users will use a bigger computer to solve
a bigger problem.
• If we work with a code that suffers from too much parallelism overhead, users will
never scale up further. They will instead switch to other/bigger problems.
Depending on the model used, we obtain two different formulae for the speedup:
t (1) t (1) 1
Sstrong ( p) = = t (1)
= , v.s.
t ( p) f · t (1) + (1 − f ) · f + (1 − f )/ p
p
t (1) f · t ( p) p + (1 − f ) · t ( p)
Sweak ( p) = = = f · p + (1 − f ). (15.3)
t ( p) t ( p)
Algorithmic complexity
Whenever we compare timings for different problem sizes, we have to be very careful
with the algorithm’s complexity. If we measure runtimes for two different setups, we
184 15 Upscaling Models and Scaling Measurements
cannot directly compare them to say which one is better. Let our particle setup serve
as an example: A straightforward implementation is in O(N 2 ), i.e. if we double the
number of particles our runtime goes up by a factor of four. As a result, the t (1)s
in the formula above are not the same and we cannot eliminate them directly by
inserting one line into the other.
To fix this is relatively easy, once you know the complexity of your algorithm. A
popular approach is not to report on runtimes, but to report on relative runtimes: If
one setup it twice as big as the other and if our algorithm is in O(N 2 ), we divide the
time for the bigger setup by four and then reconstruct f . In a particle context, we
now report on “force computations per seconds” rather than runtimes. If we have a
different complexity, we need a different scaling.
Determine f
Before you scale up a code, it is reasonable to determine f experimentally.
Determining f gives you a good indication how good your code should scale for
large core counts. If you do not match your predictions for higher core counts, you
subdivide the problem too rigorously and suffer from parallelisation administration
overhead or you have a flaw in your parallel implementation.
There are different ways to present speedup data. A convenient way is to plot the
speedup, i.e. the quantity (15.1) over the processor/core count p. A typical plot is
similar to
15.4 Data Presentation 185
A dotted line represents the linear speedup. All data is calibrated towards the (1, 1)-
point, as S(1) = 1 by definition. You have to decide carefully whether to connect
your measurement points with a (trend-)line (cmp. Appendix F.2).
A typical plot flattens out or plateaus with increasing p. This saturation effect
simply means that the problem starts to become non-decomposable and thus adding
further processing units p does not yield any runtime improvement anymore. The
stagnation in performance is a clear materialisation of strong scaling behaviour.
Two major issues with this plot are that
1. we do not see any absolute times, i.e. we cannot compare different trends;
2. we cannot plot any experiment where the experiment cannot be solved on one
compute unit in reasonable time.
As an alternative to speedup plots, we can plot the efficiency (15.2). The fundamental
problems remain the same.
If you want to give your reader an idea about real timings plus want to compare
different setups where some setups are too big to fit onto a single compute unit, you
can rely on some spider web graphs:
• On the one hand, we do not present any speedup or efficiency, but we plot (nor-
malised) timings for different compute unit counts.
• On the other hand, we present data for different experiments in one plot.
In the example above, we assume that experiment b is roughly four times as expen-
sive as a, while c is four times more as expensive as b, and so forth. The calibration,
aka what means expensive, can be tricky and it requires a careful description in the
caption/text to make it clear to the reader what quantities are displayed. Once this
is mastered, we directly visualise interesting properties: If we connect the measure-
ments for one problem size choice (here labelled as (a), e.g.), we obtain a classic
strong scaling curve. It should show a linearly decreasing trend. If we connect the
measurements of different problem sizes which correlate—if (b) is four times the
186 15 Upscaling Models and Scaling Measurements
compute cost of (a), (c) four times of (b), …, we might connect the dots for p = 1,
p = 4, p = 16—a flat line would indicate perfect weak scaling.
Often, scalability plots rely on a logarithmic scale. Ensure however that you pick
a log scale for both the x-axis and the y-axis, as perfect scaling is called “linear
scaling”; (cmp. wording above) and you want linear scaling to result in a line.
Abstract
We introduce OpenMP for multicore programming through the eyes of the BSP
programming model. An overview over some terminology for the programming of
multicore CPUs and some syntax discussions allow us to define well-suited loops
for OpenMP BSP. Following a sketch of the data transfer throughout the termi-
nation of the bulk, we abandon the academic notion of BSP and introduce shared
memory data exchange via critical sections and atomics. Remarks on variable
visibility policies and some scheduling options close the chapter.
If you cook a proper meal following a recipe, then the recipe consists of steps.
Your cookbook might contain paragraphs alike “fry A in batches and put aside,
then add B and put everything in the oven; after that, take C and …”. This is a
sequential model which is optimised to make the best out of your time. If you
have to do the same dish for your family’s Christmas dinner (with enough space
in your kitchen and lots of people who volunteer to help), your first step might
be to rearrange these instructions such that you can all work in parallel.
If you write down the cooking instructions for this particular meal onto a fresh
sheet of paper, you have two options: You can stick to the original recipe and
add annotations who does what, i.e. which steps run in parallel. Alternatively,
you can write down one recipe per helper.
1. Write down the code in such a way that it can be executed in parallel.
2. Make the code run in parallel by inserting the right directives.
3. Do some performance analysis and engineering.
Often, the three steps are part of an iterative process. The algorithm engineering
of the first step is beyond the scope of this text. We focus primarily on the second
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 187
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3_16
188 16 OpenMP Primer: BSP on Multicores
step. However, any algorithm engineering, i.e. rewrite of code such that it exhibits
parallelism, requires a sound understanding of what is possible on a machine, i.e. what
is possible in step two.
For the “make it run in parallel” task, we rely on OpenMP. OpenMP originates
from a multicore/BSP mindset. Although it has emancipated from it, the BSP heritage
is still visible within the standard. Finally, new generations of OpenMP support
GPGPUs besides classic multiprocessors.
Our strategy thus reads as follows: Since we are already familiar with OpenMP’s
vectorisation capabilities, we first introduce OpenMP within the classic multicore
BSP context. The aim of our studies is to sketch the OpenMP mindset, to allow a
quick start and to draw connections to more general computer science patterns. The
goal is not to provide a comprehensive OpenMP introduction or description—there’s
enough (better) material available. Starting from the introductory BSP, we discuss
more advanced concepts—mainly OpenMP’s tasking—which not only generalise
and eventually emancipate from BSP but also allow us to realise functional decom-
position. Finally, we return to BSP parallelisation to sketch how this fits to modern
GPGPUs.
The core compute routines in Chap. 3 are the force calculations and the particle
update. We can straightforwardly parallelise them with BSP:
1. There are two supersteps. These are the two outer for-loops.
2. Whenever the code hits a superstep, it spawns three OpenMP threads.
3. OpenMP automatically splits up the for-range into three chunks.
4. Once a thread has done its part of the iteration range, it waits for the others to join
again. This synchronisation happens implicitly as we hit the end of the for-loop.
Run multithreaded code. Before we dive into details, we discuss how to run
OpenMP multicore code.
An executable translated with OpenMP should automatically link to all required
libraries. The software that’s integrated into your code by the compiler to take care
of all of your parallel needs is called the runtime. This runtime has to know how
many threads it should use on your system. For this, it evaluates the environment
variable OMP_NUM_THREADS. On bash, type in
export OMP_NUM_THREADS=3
before you start your code. It makes OpenMP use three threads. By default, most
systems set OMP_NUM_THREADS to the number of logical cores; this includes hy-
perthreads.
Fig.16.1 Schematic illustration of an example execution of the first simple OpenMP parallelisation
in a BSP sense. Upon entering the pragma’s block which is the loop following it
in our example, OpenMP fires up OMP_NUM_THREADS threads. The thread count
is set at runtime; which is great for scalability studies but also makes your code
platform-independent. If you run it on a different node with another core count no
recompile is required. At the end of the block, all threads of a team join again.
Definition 16.1 (Thread team) Whenever OpenMP hits a parallel region, it spawns
up one thread team with OMP_NUM_THREADS threads.
16.1 A First (Working) OpenMP Code 191
While modern OpenMP is very good at administering the teams it might make sense
to distinguish the team creation from the actual work distribution. Historically, the
team’s threads were literally destroyed at the end of the region and re-created later
on. This induces overhead. Today, OpenMP implementations often let threads idle
and reuse them, i.e. wake them up when you hit the parallel region.
Loop split. #pragma omp parallel for is a shortcut for #pragma omp
parallel (to spawn/fork the threads) followed by #pragma omp for. The
latter is OpenMP’s pragma for for-loops. It ensures that the loop range is properly
split up and the loop iterations are distributed among the threads of the team. Such
combinations of a for and a parallel directive are called combined constructs in
OpenMP.
All threads of the team have to enter this point before they continue. The barrier thus
is a collective operation.
The end of a parallel region implicitly introduces a barrier, i.e. you can add a barrier
statement or leave it out—OpenMP will always add one. This barrier precedes the
actual join.
While barriers are rarely used in OpenMP applications, their existence (and the
discussion of their existence) emphasises that OpenMP has been specified with a
MIMD architecture in mind. That is, the individual threads run asynchronously, and
you never know which thread processes which statement at a certain point. In partic-
ular, the threads are not in lockstep mode. OpenMP’s simd statement is explicitly
the opposite thing: Here, all steps run in lockstep. Indeed, we can (logically) interpret
a simd region as a sequence of instructions where each statement is followed by a
barrier.
OpenMP advocates a pragma-based parallelisation. However, it defines a few
library functions as well. I never use them, though sometimes they are good for
debugging, e.g.
OpenMP functions
Try to avoid OpenMP’s functions. Work with the pragmas only.
16.2 Suitable For-Loops 193
To identify loops that are well-suited for OpenMP parallelism, I recommend to sketch
the control flow via a graph first. OpenMP regions become apparent by a graph that
fans out and fans in again.
1. Countable. We have to know how often we run through the loop when we hit it
for the first time. This allows OpenMP to distribute a loop over multiple threads.
Something like a while loop where we don’t know how often we run through it is
ill-suited for BSP parallelism.
2. Well-formed. Analogous to SIMD loops, a canonical for-loop once more has to
have a simplistic counter increment/decrement and the ranges may not change
at runtime (see countability requirement). You may not jump out of a loop and
you should not throw an exception, as OpenMP in general does not deal with
exceptions.
On top of the “hard” constraints formalised by the term canonical form, loops have
to exhibit a reasonable cost. Launching a parallel loop is expensive. Different to
compiler-based vectorisation, OpenMP is prescriptive. Analogous to its simd di-
rective, an annotated loop will definitely be parallelised. It is thus important that you
are sure that your loop chunk is reasonably expensive.
Compared to the SIMD requirements, there are delicate differences of a canonical
loop for multithreading compared to a loop that can be vectorised: Both flavours
of canonical require the loop count and range to be fixed and simplistic such that
OpenMP can straightforwardly split it up. It is particularly important that there are
no dependencies between the loop body and the loop count. You may not alter the
loop counter within the loop.
Different to the vectorisation, “canonical” for multithreading is less restrictive
for the loop body. Complex branching and function calls are allowed. This is not a
surprise given that we execute the loop on different threads, i.e. full logical von Neu-
mann cores. Each thread has its own thread count and speed, and can arbitrarily jump
through the source code. This is a direct consequence of a programming paradigm
made for MIMD machines. There’s a further small difference: Parallel regions can
be nested.
parallel block, if there is no code between the outer and the inner block. It is closely
nested, if there is no other parallel region nested between them: If a SIMDised loop
A is embedded within a parallel loop B within a parallel loop C, A is closely nested
within B, but not within C. Finally, we formally distinguish worksharing regions
from regions that are not worksharing. A worksharing region is one that makes
OpenMP run different instructions on different threads. An omp parallel alone is not
worksharing, as it only fires up threads. A for however splits up work.
SIMD loops can be nested within parallel regions, but not the other way round. If you
use a nested parallel loop within a nested parallel loop, this works, but you pay for in
terms of adinistration (scheduling) overhead; though every new OpenMP generation
becomes better in handling this from a performance point of view.
Both SIMD and MIMD parallelisation rely on source code with sufficient cost, as
the parallelisation comes with overhead. The notion of cost however differs. OpenMP
parallelisation suffers from a thread start-up cost. After the thread kick off, the
underlying loop range is split up and distributed over the threads. As a consequence,
either the loop range has to be vast—individual threads then are assigned a significant
chunk of the iteration space—or the individual loop iteration has to be expensive. In
both cases, the cost per thread is high and the parallelisation overhead is amortised.
In a SIMD environment, thread divergence reduces the effective average cost of
a loop—if only half of the registers are really used, we can read this as “half of the
compute cost per instruction”. It renders a loop less likely to benefit from SIMDsation.
parallel for statements are not sensitive to complex if statements as long as
the cost per loop iteration remains high or each thread is given a big enough range
to run through. Since the MIMD parallelism does not rely on lockstepping, thread
divergence is not an issue and techniques like masking or blending do not play a role.
Yet, thread ill-balancing—if one thread needs more time than the others—eventually
harms the efficiency.
In the pure, academic BSP model, threads do not communicate while they compute.
They run totally independent of each other. Only at the end, they synchronise through
an (implicit) barrier, i.e. they wait for each other. This is where they also combine
their individual results. We make threads work on their local (partial) result (data
16.3 Thread Information Exchange 195
Definition 16.5 (Race condition) A race condition is a situation where two threads
access a variable in a non-deterministic fashion without proper orchestration. One of
them hence “wins”. They “race” for superiority. We distinguish three types of race
conditions:
1. Read-write race condition. Thread A writes to a location and thread B reads from
this location. If thread B requires data from A through a shared variable yet A is
too slow (or B too quick), then B reads the variable before A has actually dumped
the result.
2. Write-read race condition. Thread A writes to a location after thread B reads from
this location. If thread A is quicker than B, then A has overwritten the variable
both have access to before B fetches the result, i.e. A feeds B with new data though
B has been interested in the old value.
3. Write-write race condition. Both A and B write to a variable. Whoever is the slower
thread overwrites the other result and is therefore the “winner”.
As OpenMP is constructed around a shared memory mindset, i.e. all data are by
default visible to all threads, the example below produces data races:
The code determines an inner scalar product (stored in r). Let’s assume this code
is executed on two cores and OpenMP splits up the loop range into 0–9 and 10–19.
There are multiple data races (Fig. 16.2): If both cores add something to r, they first
have to bring r into their register. They then add x[i]*y[i] to the register content
196 16 OpenMP Primer: BSP on Multicores
Fig. 16.2 Illustration of a parallel scalar product evaluation with data races. If we assume that the
arrays x and y from r = x · y, x, y ∈ R8 are split into two times four on a dual core machine, then
both cores load the value of r from the memory as well as x[0] and y[0] or x[4] and y[4],
respectively. They multiply the x and y component. After that, they have to write back r at one
point. As we lack any synchronisation, there is a risk that one core overwrites the result from the
other one
and write the result back to memory. If thread A and B write concurrently to the
memory location and if B writes late, then the contribution of A will be overwritten
and consequently lost. We have a classic write-write where we don’t know whether
A’s or B’s result “will rule”. Furthermore, there’s a write-read (B should have taken
the result from A).
If we use the above code snippet within a parallel environment of OpenMP, no two
threads can concurrently modify x. OpenMP does not enforce us to name a section
(in round brackets), but any unnamed critical section blocks all other (unnamed)
sections globally. Two critical sections can be in totally different branches and still
exclude each other. Nested critical sections allow us to realise complex exclusion
logic such as “A can run if B doesn’t run, but B may not run concurrently to C”.
They also allow us to produce deadlocks.
Deadlock means that your program does not progress anymore (Fig. 16.3). A dead-
lock often is kind of the counterpart of a data race. Where a data race results from
a lack of data access protection, a deadlock results from too aggressive protection.
The easiest way to find a deadlock is to use a debugger, to kill the execution once
it hangs, and then to study the backtrace. Sporadically occurring, non-deterministic
deadlocks however remain nasty.
The concept of different critical sections in different code parts works, once again,
due to the fact that OpenMP does not imply lockstepping. The individual threads run
through “their” code independently. Indeed, a real critical section is difficult to realise
in a lockstepped environment, as critical sections can protect a whole block. You can
mask out all other threads besides the main thread, but there is no way to stop some
lockstepped threads from progressing through the code; non-trivial workarounds are
required:
Fig. 16.3 Classic deadlock setup: A thread locks one critical section (dark red), and starts to wait
for permission to enter a second one (lighter red). This second section however has been entered by
another thread before (dark green) which in return waits for permission to enter A’s section (light
red)
The code snippets above use brackets even though they protect only a single statement
each. The brackets are syntactic sugar. The code semantics are preserved without
them. Unfortunately, critical sections are slow. For faster “critical access”, i.e. race-
free access, to a single built-in variable, we switch to atomics:
Definition 16.8 (Atomics) Critical sections are expensive. Administering them in-
duces severe overhead. If one individual variable access is to be protected, atomics
are a better option. Atomics work only for primitive types such as integers.
Semantically, atomics are not different from critical sections around a single state-
ment. However, they are significantly faster as they are natively supported by the
hardware. Chips have special “wires” for atomics.
16.3 Thread Information Exchange 199
As atomics apply only to a single statement, there is no way to label them. Actually,
the system uses the variables’ locations in memory as label, i.e. to identify two
atomics which exclude each other. Atomics wrap around one single variable access.
You can read from a variable
in which case the read from x is protected, you can write to a variable
In the latter case, we have a read-write, i.e. the variable is read, incremented and
written back; all is protected. The default clause of an atomic, i.e. the one used if
you specify none, is update.
is a valid implementation of the reduction, as it protects the access to the result r with
an atomic. It is not fast however. Even though atomics themselves are not expensive,
this code effectively serialises the scalar product computation. Only the evaluation of
the temporary results in tmp exhibits some concurrency. If we replaced the atomic
with critical, we would get the same result, but the code would be even slower.
The variant
is significantly faster. Here, we first fork our threads and create a new helper variable
myR. The following OpenMP for then splits up the loop range, and we accumulate
the thread-local result over a chunk of x and y within our local helper. Once an
individual loop has terminated—this part runs totally in parallel on the individual
threads—we add the local accumulation result to r. This last step is protected by an
atomic.
The sketched data flow from Fig. 16.4 is an all-time classic of parallel computing.
It is nevertheless ugly, as it requires us to alter the original code significantly. This
does not commit to an OpenMP philosophy where parallelisation should be an add-on
and code should remain unchanged and working if the compiler does not understand
OpenMP. The annotation standard indeed provides an alternative which is as efficient
yet a one-liner: reduction clauses. Before we discuss this particular type of all-to-
one communication (all threads accumulate their partial results into one variable), we
however first discuss variable visibility; a topic we have implicitly taken for granted
so far.
Fig. 16.4 Illustration of a scalar product realisation where each thread accumulates a local result
in myR. This local result finally is added to the global outcome. The roll-over is protected by an
atomic
202 16 OpenMP Primer: BSP on Multicores
The discussion so far has assumed that all variables within our loops are shared if they
are declared outside of the loop body. This is due to our sloppy remark that we work
within a shared memory environment. Upon close inspection, the assumption that all
data not declared within the loop body are shared is already wrong: If we parallelise
a for-loop with OpenMP, each thread gets its own iteration range. That means, if we
run a statement for(int i = 0; i < N; i + +), each thread has its own notion of
(a modified) N plus initial value for the loop counter, and, in particular, its own i.
If all threads shared all variables, all would use the same i. As the OpenMP code
yields the correct result, we conclude that OpenMP has, without our interference,
not shared these variables naively. Indeed, OpenMP distinguishes different variable
visibilities.
Definition 16.9 (Shared variables) By default, all variables that have been declared
outside of the OpenMP region are shared. Shared means that all threads have access
to the same memory location. Loop counters and loop range thresholds are excluded
from this convention.
There is no need to tell OpenMP explicitly that variables decleared outside of the
OpenMP region are shared, though you can do so if you want:
double r = 0;
#pragma omp parallel for shared(r)
for(int i=0; i<size; i++) {
r += x[i] * y[i];
}
Private variables are variables that are owned by the respective thread. They are
not visible to other threads. We have two options to create private variables: If we
declare a variable within the parallel region, it is a local variable and hence private,
i.e. each thread creates it on its own.
16.3 Thread Information Exchange 203
ensures that each thread has its own working copy of myPrivateVariable. We
use this technique when we define myR for the reduction (Fig. 16.4). Though this is
a natural way to create thread-local variables in C, introducing helper variables in
the original code just to ensure that it is not broken by a pragma is not 100% in line
with our OpenMP philosophy. We want to stick as close to a serial code as possible
and add all parallelisation through pragmas only.
Indeed, OpenMP allows us to specify private variables explicitly within pragmas.
This means that the variable labelled as private is not used once OpenMP kicks in.
Instead, OpenMP ensures that one instance of the specified variable is created per
thread. In the example
double r = 0;
#pragma omp parallel for private(r)
for(int i=0; i<size; i++) {
r += x[i] * y[i];
}
nothing happens to the actual variable r declared in the first line. OpenMP hits
the fork instruction, splits up the loop range, and creates a new instance of r per
thread. Each thread now works against its own instance. Once the threads join again
(implicitly at the end of the for-loop), all these local result copies are destroyed and
lost. In the above example, the private clause breaks the code semantics.
To understand the implications of this definition, we keep in mind that there are
no default values in C: Whenever OpenMP creates a thread-local variable, it is not
properly initialised. It holds garbage. If we want a private variable to be initialised,
we have to do this ourselves within the OpenMP parallel block r=0. Yet, OpenMP
allows us to declare a variable as firstprivate(r). In this case, each thread-
204 16 OpenMP Primer: BSP on Multicores
local variable is initialised with a copy of the original variable. There is also a
lastprivate. I’ve never used the latter.
Visibility modifiers
Prefer private and firstprivate over an explicit creation of thread-local
variables. They help to keep your code the same and, hence, valid even if no OpenMP
is available.
Loop counters are by default made private, and it is the OpenMP runtime which
initialises them properly. The same holds for the lower and upper loop bounds. It is
the runtime that decides how to initialise the variables. It knows at runtime how big
the actual loop count is and how many threads are to be used.
Parallel for
Whenever you write
• iMin and iMax private. Once the thread-local copies are created, it ini-
tialises the values appropriately. Logically, the values are something closer to
firstprivate, as OpenMP automatically sets them to the right value. In this
particular case, this “right” value is not a plain copy of the iMin’s or iMax’s
original value, but one that provides meaningful boundaries;
• the counter i private. The C++ expression automatically sets it to the right
iMin per thread before we run through the loop body for the first time.
So far, all of our OpenMP threads run asynchronously. Once the fork is done, each
thread runs through its source code with its data. At the end of the OpenMP region, the
threads all join again. They synchronise with each other. OpenMP’s synchronisation
at the end of a BSP superstep is implicit, i.e. nothing is to be done on the programmer’s
side.
16.3 Thread Information Exchange 205
A barrier is the simplest collective we can think of. It synchronises threads yet does
not allow them to exchange data. If threads exchange data towards the end of their
lifetime, we perform a reduction. Reductions are an advanced type of a collective,
which we can either implement manually or realise through built-in mechanisms.
They are more specific than a barrier as they synchronise, plus they move information
around, plus they compute something:
Addition, multiplication or the minimum or maximum function are frequently
used binary operators within reductions. However, we can also use binary/logical ors
and ands. While we can insert barriers literally everywhere, OpenMP’s reductions
close the BSP superstep. You cannot insert them manually into a specific place in
your loop your code.
With the reduction, we can finally write our scalar product routine efficiently:
Nondeterministic output
If you deploy the reduction’s realisation to the OpenMP runtime, you have to accept
that the output is nondeterministic. We know that the order in which we perform
a reduction affects the outcome in floating point precision. With OpenMP’s reduc-
tion, we have no influence on the order, and the order might change as users set
OMP_NUM_THREADS to different values. For most of the codes I’ve worked with,
this is a negligible effect. There are however application areas, where you should, in
theory, keep this nondeterminism in mind.
Scheduling algorithms for iteration ranges work internally with grain sizes.
Definition 16.12 (Grain size) The grain size of a decomposition is the smallest
number of parallel work items (loop iterations) that are grouped into one work ag-
glomerate for a thread.
The grain size used by a scheduler determines the granularity of the work decompo-
sition. In theory, a loop with N items could be decomposed into N work packages
that all run in parallel (Fig. 16.5). The grain size then would be one. However, any
decomposition comes along with an overhead: Work packages have to be managed,
synchronised, tracked. Therefore, OpenMP usually uses a grain size that is bigger:
Fig. 16.5 Schematic illustration of static scheduling over a loop with 11 entries with a grain size of
one over three threads. The work is reasonably balanced (4 vs. 4 vs. 3) over the work queues of the
three threads. If the seventh entry (assigned to thread 2; marked) however needs significantly more
compute time than the other work packages, we run into ill-balances. With dynamic scheduling,
the thread three in such a case might steal a work package originally assigned to thread two (red
dotted arrow)
happens when the code hits a loop. The distribution is not changed anymore at
runtime.
The grain size of the partitioning hence is max{N / p, 1}. Most implementations
follow the following pattern: Each thread (spawned by parallel) has a queue
of work items. When the code hits a parallel for, it creates work packages
of grain size. They are continuous segments of the loop iteration range. Next, it
distributes these over the thread queues. The threads take their workpackage from
their queue and process their iteration range.
Users can alter the grain size by adding
to their loops. This constrains the scheduling and ensures that no chunk of the iteration
range is smaller or significantly bigger than 20 entries. We explicitly specify a value
for the grain size.
This implies that either more than p chunks are distributed among the p work
items—which is an artificial overdecomposition which in turn can lead to ill-
balance—or it means that we use a smaller number of threads than what is actually
available.
208 16 OpenMP Primer: BSP on Multicores
Another application for a manual grain size is that you might find out that paral-
lelisation pays off if and only if the underlying for loop is sufficiently big. In this
context, OpenMP’s if is worth mentioning:
It switches OpenMP dynamically on or off. You can put in almost any boolean
expression into the if; in particular the array size. As an alternative to the if condition
or the grain size, you can also specify the exact number of threads used by a loop. I do
not recommend the latter—a hard-coded thread count prohibits users to modify/tune
the parallel environment via the command line and makes your code less platform-
independent.
The grain size for static partitioning helps us to eliminate unreasonable small work
units per thread. Another problem arises if some loop iterations last significantly
longer than others. Assume the first ten iterations of a loop with 20 elements last ten
times longer than any of the remaining ten iterations. In this case, static partitioning
with two threads will yield a workload distribution of 10 vs. 100, i.e. an extremely
ill-balanced one. To eliminate this problem you can use dynamic scheduling
16.4 Grain Sizes and Scheduling 209
where the problem is cut into chunks of 2 (feel free to use another constant) and
equally distributed among the work queues. So far, the scheduling does not differ
from static scheduling. However, as soon as one queue is empty, the corresponding
thread is allowed to steal a work package from another queue. The code starts with
some initial estimate what it thinks might be well-balanced and dynamically balances
work packages after that. Obviously, this dynamic balancing is not for free.
Dynamic scheduling
Use dynamic scheduling if you are sure that the workload per loop iteration fluctuates
significantly and if the individual work packages are reasonably expensive.
We summarise that we have two types of scheduling flaws: Imbalances resulting from
inhomogeneous workloads have to be tackled via dynamic scheduling. Inefficiencies
due to too small work units have to be tackled via appropriate guarding mechanisms,
i.e. minimal grain sizes, ifs, or explicit restrictions on the number of threads.
OpenMP provides a third scheduling flavour called guided. It is dynamic yet works
with exponentially decreasing subproblem sizes. The ones distributed initially are
large, the next ones only half of the size and so forth. This way, it tries to bring
together the best of two worlds: the large grains scheduled initially ensure that not
too much overhead is caused by over-decomposition. The small grains scheduled
finally ensure that some small work packages remain available that can be used to
balance out different threads.
Future OpenMP generations will likely offer further bespoke scheduling variants
and tailoring opportunities.
• OpenMP realises the fork-join model through the omp parallel pragma. The
behaviour of this pragma is controlled via the environment variable OMP_NUM_
THREADS.
• Loops are automatically split and OpenMP takes the thread count and loop range
into account. With the scheduling and grain size information as well as the if, we
can control how this splitting behaves.
• Synchronisation and orchestration between threads is coordinated via collectives
and critical sections plus shared memory and atomics.
• Reductions can be tailored through clauses and automatically pick the right neutral
element to initialise the thread-local reduction variable.
• Important visibility modes of OpenMP are private, shared and
firstprivate.
Shared Memory Tasking
17
Abstract
Starting from a brief revision of task terminology, we introduce task-based pro-
gramming in OpenMP and the required technical terms. After a summary of
variable visibilities for tasks—all concepts are well-known from the BSP world
though some defaults change—we conclude the task discussion with the intro-
duction of explicit task dependencies.
You are given the following code from a bigger project. Your colleagues ask
you to explain what the programmer, who since left the company, wanted to
achieve:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 211
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3_17
212 17 Shared Memory Tasking
of parallelism on the hardware side implies that a sole BSP approach underperforms.
No matter how good we parallelise, we quickly run into the situation where the serial
part (in an Amdahl sense) between BSP supersteps dominates the code’s efficiency.
Definition 17.1 (Task) Let a task be a sequence of instructions that we would like
to run on one thread. A task in our manuscript is atomic: Once started, there are no
further dependencies on other tasks unless these tasks are children tasks, that have
to complete before we continue.
The control flow graph (Sect. 7.2) of a problem helps us to identify tasks. Yet, the
decomposition of a problem into tasks is seldom unique. A good task
These criteria are wishy-washy. In the end, the decision which instructions to combine
into one task (task granularity) is up to the programmer.
Once we group our control flow graph into sequences or blocks of instructions
and thus identify tasks, these tasks inherit the data dependencies: If a task A hosts
an instruction a and a task B hosts b and there is a data flow dependency from a to
b. Task B consequently has to be executed after task A.
Definition 17.2 (Task graph) A task graph is a directed acyclic graph (DAG): Each
task (node in the DAG) needs a clear set of preconditions represented by incoming
edges, while the graph over the tasks may not have any loops.
Loops in the control flow graph thus require us to break down the way we construct
our task graph into multiple steps: Whenever we create a task graph (task graph
assembly), it has to be a DAG, and we cannot map a loop directly onto a DAG. We
might therefore assemble one task graph per loop iteration, or we might unroll the
loop before we create the task graph. A task graph is a snapshot of the loop behaviour.
In this context, we emphasise that a task graph does not have to be static: We
can always dynamically add tasks to a task graph. The only constraint is that the
graph has to remain acyclic. We can only add further children to existing tasks or
further follow-up tasks that either have dependencies on previous steps or not. This
constraint helps us to formalise how tasks are deployed to the threads of our system:
Definition 17.3 (Ready task) A ready task is a task which can be computed as all
input data are available.
17.1 Task Graphs Revisited 213
Once we have grouped our code logically into tasks, the tasks (including their de-
pendencies) are thrown into a task pool. Individual threads grab ready tasks from this
task pool and execute them. Throughout this execution, they might add more tasks
to the task pool. As soon as a thread has terminated, it grabs the next ready task. As
the termination of a task implies that data feeding into follow-up steps is computed,
i.e. data flow dependencies are fullfilled, each termination may unlock further tasks,
i.e. make them ready.
As we work with a DAG, there is always at least one ready task as long as there
are tasks in the pool.
The execution scheme describes a work stealing mechanism. However this dy-
namic scheduling approach is now phrased in terms of activities (tasks) as modelled
by the programmer rather than data regions (cmp. Sect. 16.4). The number of ready
tasks within the task pool is a lower bound on the concurrency level of the code: If
tasks themselves are parallelised—as they host a parallel for, e.g.—then the concur-
rency level of the overall pool is even higher than the ready task count. It is up to the
OpenMP implementation to decide whether they work with a centralised task pool
or a task pool per thread where task stealing occurs if and only if a local pool runs
empty.
We introduce our OpenMP task ingredients with the following code snippet:
1. The parallel section fires up our threads. Lets assume we have four threads
(OMP_NUM_THREADS is set to four). Prior to the pragma, only one thread, the
master thread, is alive. We hit the pragma and fork the other three guys. This is
the fork-join model. We now have four threads up and running.
2. OpenMP follows a SPMD model, i.e. all threads run the same code. We next
produce a couple of tasks. However, we don’t want all four threads to produce
tasks, so we tell OpenMP that we want only one thread to continue. This is
achieved through single. We could have used master, alternatively. It has
the same effect yet ensures that the following code section is only executed on
the master (first/original) thread. In our example, we do not care: Any thread can
run the following code, as long as only one does so.
3. We create a task (lets call it A as it plots Task A to the terminal) and add this
one to the task pool. Skip the remainder within task A’s brackets and notice that
we immediately create a task B, too.
4. Once A and B are put in the task pool, or original task (let’s call this task S due to
the single) waits for A and B. They have to terminate before S can continue.
This is achieved through the taskwait.
5. After both have terminated, the task S yields another task C. C is a one-liner as
there are no block brackets. No further tasks are scheduled anymore. However,
the closing brackets of the parallel section implicitly make us wait for all
tasks to terminate, so S implicitly waits for task C.
6. If a thread grabs task B, it prints Task B to the terminal. After that, the task is
complete and the thread tries to grab the next one from the pool.
7. If a thread grabs task A, it prints the message Task A. It then spawns another
two tasks A.1 and A.2 and immediately after that waits for them. Once they are
complete, it dumps the resume message and thus completes the task’s instructions.
8. We finally join all threads again.
Creating (spawning) tasks and inserting waits both tell OpenMP which task depen-
dencies do exist. To the right in Fig. 17.1, I illustrate the task dependencies as the
OpenMP runtime system might see them at a certain point: It recognises that our
program has produced four tasks so far. Three of them are ready, as they have no
incoming dependencies, two of them cannot continue/start at the moment.
1. The code’s structuring into task blocks defines the task control graph.
2. Whenever we make a snapshot of the system’s state, we can write down all the
existing tasks (both those assigned to threads and those pending in the task pool)
plus their dependencies. This is the task DAG. In our example it is always a tree,
but we will generalise this one later. The tasks without any incoming dependencies
in this snapshot are the ready tasks, i.e. those tasks that an idling thread can grab
and process.
3. Every task statement creates a new node within this task graph.
4. Every taskwait adds fan-in dependencies to the task graph.
Fig. 17.1 Task control graph of the code snippet (left), and task dependency graph at a certain point
(right)
216 17 Shared Memory Tasking
Fig. 17.2 Execution pattern of example source code from Sect. 17.2: The single environment S
it grabbed by the second thread. Thread S creates two tasks which are immediately grabbed by
thread one and three such that thread two hosting S has to wait for their completion. When task A
spawns A.2, the runtime decides not to put it into the task pool but instead immediately processes
this one on the current thread as it hits the taskwait instruction. Dotted lines illustrate task creation,
i.e. throwing tasks into the task pool
5. If a task terminates, its dependency edge in the graph of tasks managed by the
runtime system is removed. The task graph thus hosts both a subset of the tasks
that exist overall and a subset of the dependencies. It changes over time.
6. If a task spawns another task, both tasks immediately are ready (if we read the code
step by step). If the creating task (the one that spawns) issues a synchronisation
(either via taskwait or as it closes the parallel section), it adds a task
dependency i.e. becomes waiting and thus degenerates to non-ready.
OpenMP’s task graph construction is always dynamic, i.e. your code tells OpenMP
step by step what the task DAG looks like. Since you build up the task graph incre-
mentally, OpenMP can start to distribute and execute tasks while you still build up
the overall graph.
Scheduling Point
The decision which ready task to process next is up to OpenMP. OpenMP however
relies on scheduling points in your code, i.e. steps where the OpenMP runtime source
17.2 Basics of OpenMP Tasking 217
code is allowed to “think” about the next steps. It does not arbitrarily or parallel to
your run make further scheduling decisions. OpenMP implicitly reaches a scheduling
point whenever a task terminates. It is then up to the runtime to pick the next one.
Furthermore, OpenMP regards any task creation point as scheduling point. That is,
directly after the programmer creates a task, it is up to OpenMP to decide with which
task to continue. Finally, any type of implicit or explicit synchronisation provides a
further scheduling point: If you close a parallel region or if you issue a taskwait,
the scheduler takes over.
Task Interruption/Deferral
Tasks are usually not interrupted by OpenMP unless they yield explicitly. However,
the OpenMP runtime is allowed to defer a task, i.e. to interrupt it to continue with
a direct child task. The original task resumes typically on the same hardware thread
unless it is explicitly made migratable. This ensures that not too much data has to be
moved between cores.
Definition 17.5 (Starvation) Starvation means that a task is ready yet is never ex-
ecuted, as the program always spawns further tasks and the scheduler makes them
“jump the queue”.
• If we removed the single from our toy code, the whole task graph is built up
OMP_NUM_THREAD times.
• If we removed the first taskwait instruction, then task A spawns A.1 and A.2 but
does not wait for them. It might terminate before A.1 and A.2 terminate. There is a
second taskwait if we scroll down. This one makes the single thread spawning
A and B wait for A and B before it spawns C. As soon as we comment out the first
taskwait, task C might run parallel to A.1 and A.2.
• If we removed both taskwait annotations, all tasks could run in any order.
• If we removed the first taskwait annotations and replaced the second one by a
taskgroup, then the taskgroup would wait for A and B plus A.1 and A.2.
17.3 Task Properties and Data Visibility 219
By default, all tasks have the same priority, i.e. we do not distinguish which one is
important and which one is not.
If you want to impose a priority on your task, you can annotate the spawn mechanisms
with the priority(n) clause. n is an integer. The higher the value the more
important the task.
In principle, tasks fit to the BSP mindset we have used to introduce classic
OpenMP. All data are shared, threads are forked and joined, and the same synchroni-
sation mechanisms as for BSP are available. There is a delicate difference however
when it comes to the default visibility of variables. OpenMP has three categories of
variable visibilities:
• Predetermined data sharing is a data sharing rule that derives directly from the
usage. Our loop iteration counters and ranges within a parallel for for ex-
ample are predetermined: You know that they are thread local and you cannot
change this behaviour (you can still explicitly write down their visibility but it has
to match the default, i.e. there’s not really any freedom).
• Explicitly determined visibility means that you write down private, first
private or shared in the pragma.
• Implicitly determined means you skip any explicit annotation and you rely on the
default rules.
In a classic loop context, shared is the default visibility for implicitly determined
sharing. For tasks, this does not make sense: If a task is spawned, it sees all the
variables defined “around it” at the moment it is created. In the absense of a task
wait mechanism, we know that these variables might already be dead when the
task launches: Let the parent create a local variable, spawn a new thread and then
terminate. The termination of the parent might mean that the local variable is long
gone by the time the spawned task is grabbed by a thread and actually executed.
Therefore, OpenMP has altered the default visibility:
If you label variables as shared or you work with pointers, then multiple tasks still can
access the shared memory as kind of a shared scratchboard. As a result, you might
still produce data races. Yet, all the shared memory synchronisation mechanisms
continue to work. That is, you can use atomics, and you can use critical sections to
orchestrate memory access.
Our task mechanisms so far imply that the task pool (snapshot of the task graph)
always hosts trees. Tasks either are ready, or they wait for their children to terminate.
OpenMP gives you the opportunity to model more complex task graphs. For this, it
simply uses memory locations, i.e. memory addresses:
• OpenMP relies on a task pool into which tasks can inject further tasks.
• Task dependencies are built up through a wait clause or by explicit depends in-
structions.
• Tasks synchronise explicitly, implicitly, or they yield.
• Dependencies model dependencies, not data flows. If we need information flow
between tasks, we use the shared memory space. Yet, the default visibility is
firstprivate and we have discussed why. For information flow between tasks, we
can use critical sections or atomics.
• If our task graph resembles BSP, we should, despite all the nice task features, use
the more traditional, BSP-style pragmas. They are faster.
GPGPUs with OpenMP
18
Abstract
GPGPUs can be read as multicore chips within the compute node which host
their own, separate memory and multiple cores which are slow yet nevertheless
extremely powerful. This power results from SIMD on a massive scale. We give
a brief, very high level overview of the GPU architecture paradigm, before we
introduce OpenMP’s target blocks. The data movement between host and acceler-
ator requires dedicated attention, while a brief discussion of collectives, function
offloading and structs (classes) for GPUs close the discussion.
Assume that a house is on fire, the next river is a mile away, and there is only
one fire fighter who can get close enough to the fire so he can throw water in.
This water is, in our thought experiment, delivered in buckets.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 223
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3_18
224 18 GPGPUs with OpenMP
You have to make a decision how you provide this one fire fighter with buckets:
Do you hire one sprinter, i.e. a guy who can run really fast, or do you rather hire
an old-men’s football team, i.e. a bunch of slow guys?
Standard CPUs are the racehorses of computing. They are fast and versatile. GPUs
have been designed exactly the other way round: They are relatively slow and com-
parably simplistic (cmp. our discussion on lockstepping in Sect. 7.3.2). However,
they can process vast amounts of data and instructions in a very homogeneous way
in parallel.
Historically, they are designed for computer graphics/games where the same op-
eration has to be done for every pixel. If all pixels are processed in parallel, the indi-
vidual operations are allowed to be relatively slow, as long as the overall throughput
is high. We know that such a uniform processing scheme maps efficiently onto vector
processing.
The scientific computing community has a large interest in GPU architectures,
too. Uniform (vector) operations are omnipresent, and the community suffers from
Dennard scaling, i.e. needs to get the power consumption (frequency) down. General
purpose graphics processing units (GPGPUs) emancipate from graphics and offer
vector processing capabilities in a novel way. They are additional devices that plug
into your computer and can do graphics but also general computations. Some modern
GPGPUs don’t even have a graphics port anymore, i.e. they are compute-only devices.
Definition 18.1 (Host and devices) We distinguish host and devices. The host is the
main compute unit (the CPU) which orchestrates other machine parts, i.e. all the
GPGPU accelerators.
With the increasing importance of GPUs for the mainstream (super-)computing mar-
ket and the increasing capabilities of GPUs, some developers started to articulate their
frustration that they effectively have to write two codes when they deal with GPUs:
The main code for the CPU (host) and then the functions (kernels) that they want
to run on the GPU (device). While the programming language CUDA continues to
dominate the GPU software market, more and more developers favour a pragma-
based approach. It allows them to write all of their code in one language, while they
annotate those parts that are a fit for a GPU. A compiler then can take care of the data
transfer and how to generate the right microcode for the right GPU. Furthermore,
the code continues to run “as it is” on a CPU without an accelerator.
There are two important pragma dialects for GPUs: OpenACC and OpenMP.
OpenACC is tailored towards accelerators (hence the acronym). Though it resembles
18 GPGPUs with OpenMP 225
Definition 18.2 (CUDA) CUDA stands for Compute Unified Device Architecture
and is a name, on the one hand, for a particular programming language. On the other
hand, it circumscribes the architecture of a GPU chip, i.e. the principles of how to
build an accelerator.
In this manuscript, we ignore the CUDA language, which is a language that is very
similar to C yet sufficiently different to make it a topic of its own. Yet, we should
have a rough understanding of how a GPU works. Though NVIDIA is not the sole
vendor of GPUs, CUDA remains the blueprint of a GPU hardware design.
18.1.1 Memory
Definition 18.3 (Unified memory) Modern CUDA architectures offer unified mem-
ory. In this case, a compute kernel can (logically) access the CPU’s memory (and
vice versa). We work with one large memory address space. Technically, the two
memories are still separate and we have on-chip memory access and off-chip access.
If the GPU faces a CPU memory request, it issues a page miss, i.e. it internally copies
data over. To allow the hardware to release your software from the cost to transfer
data, you have tell your system that you want it to store data in this unified memory
when you allocate the data. NVIDIA offers specialised versions of new or alloc,
respectively, for this.
Originally, GPGPUs—I will use GPGPU, GPU and graphics cards as synonyms
from hereon—have been totally separate from the CPU. With improving integration,
i.e. more powerful kernel capabilities and better memory integration, this distinction
blurs.
226 18 GPGPUs with OpenMP
We do not dig deeper into this idea of a logically shared memory and the in-
tegration, but instead copy data forth and back. This can either happen explicitly,
i.e. we transfer the data, or the GPU runtime and compilers can derive the required
data movements from source code and trigger data movements. We call the latter
approach managed memory. Managed memory versus manual data transfers are a
decision we have to make as programmers.
Each GPU card hosts multiple processors. It can be seen as a MIMD multicore ma-
chine, as all the processors have access to the common GPU memory. Each processor
has its own control unit, i.e. runs its instruction stream completely independent from
all the other processors. If two processors communicate with each other, one pro-
cessor has to write data to the shared memory from where the other one can grab
it.
On state-of-the-art GPUs, we have around 80 of these processors. They run at
relatively low speed (typically way below 2GHz). The massive power of GPUs thus
has to come from somewhere else. Indeed, the GPU processors are called streaming
multiprocessors (SM) in NVIDIA’s CUDA terminology as they are particularly good
in streaming data through the chip. AMD calls them computer unit (CU). Both rely
on a particular SIMD design flavour within their cores:
Each GPU processor hosts a certain number of threads. NVIDIA today calls them
CUDA cores. AMD speaks of “parallel ALUs” which is closer to the terminology
we have used so far. 32 or 64 are popular numbers for these CUDA cores or parallel
ALUs. With 80 cores and 64 threads, a GPU offers up to 5,000 CUDA cores in total.
These CUDA cores are a mixture between hardware threads and full-blown cores.
They have their own registers and ALUs, but they cannot work independently. All the
cores within a SM operate in lockstep mode; even though recent GPU generations
weaken this constraint. The cores are independent, but they share the control logic.
The CUDA cores per SM are CUDA’s materialisation of vector computing.
There is one further difference compared to standard CPUs: The register count
within each CUDA core—and therefore SM—is massive. More than 200 is typical.
If an instruction on a CPU thread has to wait for data from the memory, the CPU
has two options: It can either wait or it can save all register content to fast nearby
memory that we call cache, load the registers of another (logical) thread and continue
with this one. We call this thread switches. They are expensive.
GPUs organise their code in warps (NVIDIA’s name) or wavefronts (AMD). I use
warp most of the time for historical reasons. A warp is a set of 32 or 64, respectively,
logical threads in the source code. As discussed, the instructions of these logical
threads run in lockstep mode. AMD calls them work item. In a CPU world, the
bundling (lockstepping) of vector instructions introduces a problem: If one of them
18.1 Informal Sketch of GPU Hardware 227
Fig. 18.1 Left: Assume an array of 20 entries is split into two chunks. Each chunk is handled in
parallel via SIMD instructions. We run three SIMD steps on the array entries. Unfortunately, the
third operation requires the CPU to wait for some data to drop in. If we had two cores, we could
run the two chunks in parallel and each chunk would have to wait. We don’t have two cores here,
so we deploy both chunks to the same core. We oversubscribe it. The chunks are scheduled one
after the other as the chip recognises that saving all register content after operation two would last
longer than the actual wait. Right: A GPU architecture relies on oversubscription. Different to the
CPU, it can switch between the warps, i.e. chunks, very quickly. There are enough “spare” registers
to hold all temporary results from operation two. Should it have to wait for some data for the third
operation, it can consequently switch to another chunk and continue to work with this one while
the first chunk “waits” for data to drop in
has to wait for input data, the remaining have to wait, too, or we have to thread
switch.
On the GPU, we switch threads all the time: If one of the 32 lockstepping threads
has to wait as data is not there yet, the GPU can simply switch to another warp, i.e. let
another set of threads progress. The high number of registers makes this possible.
If the GPU code is not extraordinary complex, there is no need to back up register
content, as we have so many registers per hardware thread. Up to 64 of the warps can
run “in parallel”: they do not really run parallel, they simply are active, i.e. it is up to
the GPU to decide which of the 64 active warps to progress (Fig. 18.1) at any point.
Thread switches are a design principle rather than a performance flaw. NVIDIA calls
the set of warps that are currently active on a GPU processor a thread block, while
AMD speaks of a work group.
Definition 18.4 (Warp interleaving) Let multiple warps (sets of threads) be active
on an SM. Some of them might be ready to run the next instructions, others might
wait for data to arrive. The SM cannot run the independent warps in parallel, but
228 18 GPGPUs with OpenMP
it can always pick a warp that can progress without any further wait. It effectively
interleaves the execution of different thread sets (warps).
MISD
We can interpret warp interleaving as a realisation of MISD in the tradition of
pipelining.
Today, NVIDIA and other vendors make their cores flexible. New GPUs for example
have a different program counter per CUDA core and not one for all threads of a
warp. Therefore, they become more resilient against thread divergence. However,
high thread divergence leads to high scheduling overhead and any synchronisation
of “unlocked” threads remains expensive. The principle therefore remains: GPUs
perform best if
GPUs are hardware that is built for codes with (i) a significantly higher concurrency
than what we find in the hardware and (ii) infrequent synchronisation needs.
If you have a code with an occupancy of one, the GPU has no more resources to
take up another active warp. If it stalls (as data is for example not available yet), then
there is nothing the chip can do to mitigate this problem. The term highlights the
capabilities of the GPU to switch warps efficiently.
Architecture review
In a way, GPUs add the von Neumann bottleneck to our code once more: Data has
to be brought from the CPU memory to the right CPU and from the GPU memory
to the SM. On top of that, data has to be moved into the accelerator and has to be
fetched back.
Creating a compute kernel, i.e. a program that could run on the GPU is close to trivial
with OpenMP:
the compiler is told to create machine code for the GPU. Furthermore, the target
instructs OpenMP to kick off the block’s instructions on the GPU, to wait for its
termination, and then to continue. We call such a behaviour blocking, as the host
core is blocked while the GPU is busy. You cannot nest targets.
We formalise this behaviour: The target first opens a stream to the GPU. The term
stream here means that data is streamed to the accelerator. It moves data. Next, the
target pragma creates one new team:
If we write down omp parallel, we establish a team on the host and fork the
threads. If we write down omp target, we create one team on the device and fork
threads. The team will be mapped onto one SM:
Target
A target (without any further teams modifications) creates a new virtual processor
within OpenMP that behaves in the same way as the original multithreaded processor.
It is up to OpenMP do roll this new “processor” out to an accelerator.
Within the new team, we can now use parallel for to run through a loop
in BSP style; though technically all threads of a team exist right from the start and
are just masked out prior to the parallel for. Each team hosts its own thread
pool. That is, within a team, we can use dynamic loop scheduling, and we could
even try to port a whole multitasking environment onto the GPU—though it really
depends on the hardware and software ecosystem in your machine to which degree
this is supported (efficiently). In general, classic BSP with low thread divergence
will perform best.
On most systems, you cannot specify how many threads you want to use within a
team. This depends on the hardware. Therefore, there is also no OMP_NUM_THREADS
environment variable for our “virtual” processor on the GPU. You can however spec-
ify the maximum thread count via thread_limit. If this is not a multiple of the
SM’s CUDA core count, then you will leave threads idle.
It is up to OpenMP how to distribute the threads within a team over warps.
While target marks code that should go to the device, you have to explicitly tell
OpenMP if you want to use more than one team. This is achieved via OpenMP
teams. The statement can be augmented with a num_teams clause. With the
clause thread_limit you can specify a maximum number of threads per team.
OpenMP however might choose a smaller number. In any case, the number of threads
per team will be the same for all teams of a league:
232 18 GPGPUs with OpenMP
Fig. 18.2 With teams distribute parallel for simd, a large for loop is first split up
(distributed) into teams. This splitting is static as each team is mapped onto one SM and the SMs do
not synchronise or load balance between each other. Once a team is given a certain loop range, it has
to process this loop range. The simd statement splits the per-SM loop range into vector fragments
which keep all SM threads busy. They form warps. The final set of warps is progressed by the SM.
As it is labelled with a parallel for, some load balancing is realised by the SM. Yet, this load
balancing is realised through warp interleaving
Definition 18.8 (League of teams) A teams block creates a league of teams, i.e. a set
of teams where each team has a distinguished master thread. OpenMP’s parallel
statement creates the host thread team, while target without a teams qualifier
creates one device team.
Teams do not really add something new that we haven’t been able to express from a
programmer’s point of view. The essential rules around teams are straightforward:
With the CUDA architecture in mind, it is obvious why we may not have any instruc-
tion before or after a teams block within the same target: Teams are made to allow a
compiler to map logical threads onto different SMs. As SMs are more or less totally
independent of each other, we want to launch whole kernels on them. We don’t want
one kernel on one SM affect the other SMs. As teams are so restrictive, I typically
always use them in combination directly with a target, i.e. instead of opening a target
block and then immediately a teams after that, I write omp target teams. Even
more than that, I combine it with distribute directly:
1 This constraint is weakened starting with OpenMP 5, but it makes sense, for the time being to
The distribute construct is the parallel for of teams: It takes the subsequent for
loop (which has to follow all rules we had set up for the parallel for loops before)
and splits up the iteration range between the teams. Though teams accept a clause
dist_schedule, the only distribution schedule allowed here is static. That
is, teams cannot load balance between them. This is exactly what allows OpenMP
to map them efficiently onto warps and tasks (Fig. 18.2).
We conclude that a popular OpenMP GPU pragma looks similar to
which reflects the hierarchical decomposition of a loop over the GPU, its SMs and
(CUDA) cores on the SMs. It also illustrates that the distribute statement only dis-
tributes the loop range over the team’s master (or initial) thread. It is the parallel for
which then launches the per-team parallel region and the further loop decomposition.
Multi-SM computations
Each target region is a task in an OpenMP sense and specifies one compute ker-
nel. Modern GPUs can run multiple compute kernels simultaneously. If you cannot
occupy all of your SMs via one compute kernel, nothing stops you from launching
different targets on your GPU at the same time.
The task analogy for GPU targets explains why scalars within OpenMP targets are by
default firstprivate. Whenever OpenMP launches the compute kernel on the
GPU, it copies scalars automatically over. The GPU task is ran—so far immediately—
and has all data locally available.
Copying scalars is cheap. Big arrays are a different story. They are not copied auto-
matically.
234 18 GPGPUs with OpenMP
For big arrays, the programmer has to instruct the compiler which (parts of the) arrays
go to the GPU, remain there, and which have to come back with updated content.
By default, arrays are mapped onto a pointer (array) on the device of size zero. That
is, your code might compile if you forget to specify how an array is transferred. The
first access to such an array however will cause a crash, as the array’s pointer points
to zero and to a field of length zero.
Programmers have to annotate the target environment with data movement clauses.
1. map(to:x) instructs the compiler to create an array x on the device. The array
is initialised with firstprivate semantics. It is copied from the host to the
device before the kernel starts to run.
2. map(from:x) instructs the compiler to create an array x on the device, and
to copy its content back once the kernel has terminated.
3. map(tofrom:x) combines both clauses.
4. map(alloc:x) tells the compiler to create an array x on the device. This array
is neither initialised nor is its content copied back once the kernel has terminated.
In line with OpenMP’s vision that codes should still be correct if we translate without
OpenMP/GPU features, all the clauses assume implicitly that the array x exists on
the host if there is no GPU support. The four clauses are all you need if you handle
arrays where their size is known at compile time. However, if you have a function
which accepts pointers to dynamic memory regions, OpenMP cannot know the size
of the underlying data structure.
Definition 18.9 (Arrays on devices) If you handle arrays on your device, you have
to map them explicitly. If the array is dynamic, you also have to instruct OpenMP
which regions of an array have to be mapped.
If you use dynamic arrays, the data transfer clause has to be augmented by further
range specifiers.
This means that you can also synchronise only subranges of an array, as you specify
the offset and the number of items to be copied forth and back.
18.4 Data Movement and Management on GPUs 235
We dive into OpenMP’s data movement realisation to clarify that it works internally
with some reference counters: For all of our GPU code, OpenMP hosts one table per
GPU. For our purpose, we may assume that the table has three columns: a column
with (CPU) pointers, a column with GPGPU pointers, and the right column is an
integer counter. The table maps each pointer from the host onto a memory on the
GPU and it tracks its lifetime (Fig. 18.3).
When you translate, the compiler runs through your GPU code, i.e. your target
block, and analyses its data reads and writes: For each access to a certain variable
within a block, the target’s kick-off code that our OpenMP compiler adds automati-
cally looks up whether the GPU knows the address of this variable, i.e. whether there
is a row in the table which maps the CPU address onto a GPU address. If not, the
generated code allocates a new memory location on the accelerator, adds an entry of
its address to the GPU table, and copies the data over from the host into the newly
generated GPGPU memory. Once copied, the data reference counter is set to 1. All
memory accesses to the respective variable on the accelerator are now replaced with
the GPU memory location from the table. This does not work for data of dynamic
range. In this case, the compiler requires you to copy stuff manually via map. The
kernel itself creates a mapping entry for any memory access but makes it point to 0
on the GPGPU. Your explicit map statement allocates the memory on the GPU and
sets the counter to 1. Once we know that the counter is at least 1, we can copy the
data from the CPU to the GPU (if wanted by the user). In the same way, the compiler
can generate an epilogue which copies GPU data back to the CPU if instructed to
Fig. 18.3 The memory of CPU and GPU is (logically) totally separated. I denote this through
different addresses here. OpenMP’s memory table holds an entry (23, 2345, 1) which means that
the CPU memory location 23 is mapped onto the GPU’s memory location 2345. The 1 indicates
that this mapping entry is valid. The mapping from 24 onto 4667 is not valid anymore as the GPU
memory is already freed and might be used by some other GPU kernel
236 18 GPGPUs with OpenMP
do so. When we leave a target block, the counters of all the variables that are used at
least once within the target block are decremented.
We end up with an address replacement process: OpenMP generates two versions
of a target block. One for the GPU, one for the CPU. The CPU version works with
the real memory addresses. In the GPU version, all memory addresses are replaced
with their GPU counterpart.
Making the reference entry an integer allows us to launch multiple GPU kernels
concurrently. If a pointer for a CPU memory region does already exist in the table,
the launch increments its counter and decrements it once the kernel has terminated.
Data entries on the GPU are not replicated. Effectively, the table on the GPU is an
overview which parts of the CPU memory are mirrored on the accelerator and how
many kernels currently use these mirrored data.
It is the counter semantics that allows us to organise data transfer more explicitly,
i.e. outside of a map statement:
Definition 18.10 (Explicit data transfer steps) OpenMP allows us to move data
explicitly to the device, from the device and to update it on the device:
These pragmas are stand-alone pragmas, i.e. they have nothing to do with the
block/instruction following them.
18.4 Data Movement and Management on GPUs 237
While enter and exit change the GPU’s internal reference counter, update
leaves it unaltered. As a consequence, you can update data from within a target, too.
The other two instructions have to be put outside of the compute target region.
As we can separate data movements from computations, it makes sense that there
is a delete call as counterpart of the allocation. We have to, from, tofrom,
alloc and delete.
It is obvious how the enter and exit annotations are mapped onto GPU table entry
modifications. Enter in combination with to or alloc reserves memory on the
accelerator and increments the reference counter in the table. Exit in combination
with from or delete reduces the counter. If you explicitly delete data on the GPU,
this delete also requires you to specify the array sizes if the compiler cannot know
these at compile time.
It is important to memorise that alloc and delete mimic the behaviour of
new and delete on the GPU. They expect that a pointer with the same name on
the CPU exists before they are called. Our OpenMP philosophy is that OpenMP code
remains valid even if a compiler fails to understand our OpenMP pragmas. The data
instructions allow us to keep data between different memory regions consistent, but
if neither data nor the target blocks are translated, our code remains correct.
Memory errors
When you work with explicit data movements, there are a few evergreens of memory
errors:
• If you allocate data on the GPU and you forget the corresponding from or
delete, your GPU will eventually run out of memory. You produce a GPU
memory leak.
238 18 GPGPUs with OpenMP
• If an array on the CPU is changed while there is a replica counterpart on the GPU,
the GPU table works with an outdated view of the memory. Never alter a pointer
while it is in use on an accelerator.
• If you free memory on the CPU while the corresponding pointer replica on the
GPU is still valid, you will eventually get some strange memory error.
GPU kernels, i.e. target blocks, are tasks. The explicit memory transfer opera-
tions triggered by target enter and target leave are tasks, too. Whenever
we launch such a GPU kernel on the host, the host blocks (idles) until the corre-
sponding GPU task has finished:
Definition 18.12 (Target tasks) All target blocks are tasks in OpenMP. However,
target adds an (implicit) synchronisation (barrier) to its task by default.
is equivalent to one target map where the corresponding data transfer statements
are merged into the kernel annotation. Its split-up into explicit data movements and
sole compute kernels does not really give us anything, as we work with a blocking
data transfer model and implicit barriers: We always wait for the GPU to finish its
job and until the data transfer is complete before we continue. There are three tasks
which are invoked one after the other. Formally, we should model this via four tasks:
There’s the main task running on the CPU. It launches the data transfer task, waits,
launches the compute kernel task, waits, launches another data transfer task, waits,
and finally continues.
18.4 Data Movement and Management on GPUs 239
Like any other task, targets can be augmented with a nowait clause. This clause
implies that the GPU shall run its computation while the host continues with some
other work—maybe meanwhile launching the next kernel. With a nowait, we need
some further depends clauses to ensure that everything on the GPU is executed in
the right order (all target instructions also are tasks, so the OpenMP runtime might
permute them), but we are now able to make the host computations and the data
transfer plus the GPU calculations overlap (Fig. 18.4).
Fig.18.4 A schematic break-down of the three GPU steps (data transfer, computation, data transfer)
followed by some expensive host code on the CPU which is independent of the GPU kernels. With
a nowait, we can exploit the fact that GPU target invocations are tasks and run the operations all
in parallel. Indeed, we “only” use the dependencies and let the runtime decide how and when to
schedule a kernel on the GPU
240 18 GPGPUs with OpenMP
In this code,
• the original data transfer to the GPU is launched as a task. The implicit barrier at
the end of the target is removed due to the nowait clause.
• We immediately launch a second task which is the actual GPU kernel. It does not
start before the data transfer task has finished, as it is subject to an in-dependency.
• An additional out-dependency ensures that the task that brings data back is not
launched too early.
• Parallel to the three GPU tasks, we spawn a fourth task on the CPU which does
all the expensive calculations on the host.
• A final taskwait ensures that we wait for both our GPU tasks plus the CPU
task. All other (implicit) synchronisation has been removed due to the nowait
clauses.
x can be an arbitrary placeholder variable. Its content might never be used, it is solely
its address that matters. You can distribute the kernels over your whole code, as long
as the x remains alive while the kernels on the GPU are still pending or running,
respectively.
Multitasking on GPU
Nothing stops you from having multiple (different) tasks on the GPU. Indeed, you
can use one data migration task to make multiple compute tasks ready. You can
deploy a whole task graph to the accelerator.
18.5 Collectives
GPU tasks can communicate through shared memory and atomics. We know that
this allows us to realise all-to-one communication but it remains cumbersome in a
GPU context: We have to get all the reduction done on the GPU manually, and then
bring the result back to the host.
OpenMP allows us to have reductions over loops. This also works on GPUs. You
even can have a reduction over a distribute statement, i.e. for multiple teams.
That is, reductions also work over multiple SMs. If you use reduce in a line which
hosts both a parallel for and distribute clause, then the reduction applies
to both clauses. If you split it up, i.e. if you first use distribute followed by a
separate parallel for, then you have to use the same reduction for both clauses
to ensure that the data is properly reduced. A reduction implies that the reduced
scalar is mapped back from the device to the host.
18.6 Functions and Structs on GPUs 241
target blocks so far are code blocks which work with scalars and arrays over
built-in types. It is clear that this gives the OpenMP compiler the opportunity to
create bespoke GPU code. It knows all the instructions and exact data layouts.
If you want to use functions within your GPU code, too, you have to tell OpenMP
that you need both a host and a device variant. For this, you have to embed the
function into
This is a recursive process: If your GPUised function uses another function again,
then you have to embed this one into declare target, too.
Definition 18.13 (Target declarations) All functions, data structures, variables and
classes that you want to use on the GPU have to be embedded into a declare
target environment.
Structures have to be marked with a declare, too, if you want to use them on the
accelerator. As soon as the compiler knows that a structure might be used on a GPU,
242 18 GPGPUs with OpenMP
it can analyse its memory layout (that includes its field offsets and arrangement of
all the attributes within the structure) and create the counterpart for the accelerator.
If you mark a variable with declare target, then this variable is replicated
on the GPU at startup. The technique can be used for lookup tables which you want
to hold on the accelerator persistently. Finally, you can offload complete classes to a
GPU. Just ensure that your declaration wraps both the fields of a struct and all of its
routines.
• You cannot have string routines and outputs to the terminal on the accelerator.
Plain printf works, but anything that has to do with string construction or
concatenation might fail.
• Templates seem to work, but I faced issues multiple times with functors.
• Forget about assertions or similar things that stop your code if they are violated.
• Inheritance, virtual function calls, class attributes (static), and so forth seem to
challenge the compiler. Stick to plain classes or, even better, functions.
• We experienced real difficulties with function overloading for some compilers.
• Some of the tests we did with compilers suggest that it is problematic to move
classes to and from the accelerator. You might have to map individual (scalar)
attributes one by one.
Abstract
To construct higher-order time stepping methods, we discuss two paradigms: On
the one hand, we can write down an integral equation for the time stepping and
construct more accurate integrators for the right-hand side. On the other hand, we
can shoot multiple times into the future to obtain a guess for the additional terms
from the Taylor expansion. The first approach leads to multistep methods, the
second approach introduces the principles behind Runge-Kutta methods. After a
brief discussion how to program arbitrary order Runge-Kutta once the Butcher
tableau is known, we briefly sketch leapfrog-type methods which yield particularly
elegant, second order methods for our N-body problem.
The explicit Euler takes a snapshot of a system yh (t) and constructs the next
snapshot yh (t + h) from there through Taylor. If we had two initial values,
i.e. yh (0) and yh (h), we could make yh (2h) depend on yh (0), yh (3h) depend
on yh (h), and so forth, if we pick a step size of 2h. There is no formal reason
not to do this, though it is clear that the result, even for the simple test equation,
now yields some strange solutions with oscillations. We effectively interleave
two solvers running with 2h each.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 249
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3_19
250 19 Higher Order Methods
These oscillations fade away for tiny time step sizes, but overall it just does not
seem to be a particularly good idea to make two solvers run in a parallel, shifted
way.
However, shouldn’t we be able to use the history of a solution, i.e. our com-
putational data, to make our predictions better if we have two initial values?
Can you sketch such a scheme? Alternatively, could we do multiple shots into
the future, and mangle them up into one better guess?
Both questions boil down to the question whether we can get the next terms
in the Taylor expansion right. Is there a way to get the ∂t ∂t y(t) right even if the
ODE is given in first-order formulation? In this case, we do not have to chop it
(3)
off and ignore it; which makes the ∂t y(t) dominate the error. We obtain higher
accuracy.
Euler belongs into the class of single step methods—only the current solution is taken
into account—and it is a single shot method, as it comes up with the next solution
with one F-evaluation.
In a first attempt to design better time stepping schemes, we reinterpret time stepping
as an integration problem. To solve
y (t) = F(t, y(t)),
we integrate over time:
t+h
y(t + h) = y(t) + F(s, y(s))ds.
t
In practice, this does not get us anywhere, as we do not know the antiderivative of
F for interesting problems. This motivated us in the first place to approximate the
solution with a computer rather than solving it symbolically. However, we can make
19.1 An Alternative Interpretation of Time Stepping 251
Fig. 19.1 Schematic derivation of the explicit Euler via an integral formulation: (1) We know yh (t)
and thus the value of F(t, yh (t)) (2). If we assume that this value remains constant over the time
interval (t, t + h), we can compute the area under F(s, yh (s)). This yields a straight line (3). We
can add this straight line to yh (t) and obtain the value yh (t + h) (4)
a simple assumption: We can assume that F(s, y(s)) = F(t, y(t)) is constant over
one time slice. This yields
t+h
y(t + h) = y(t) + F(s, y(s))ds
t
t+h
= y(t) + F(t, y(t))ds
t
t+h
= y(t) + F(t, y(t)) · 1ds (19.1)
t
= y(t) + F(t, y(t)) · s|t+h
s=t
= y(t) + F(t, y(t)) · h.
Section 12.3 derives the explicit Euler through a Taylor series. The above sketch
yields exactly the same algorithm with integration glasses on (Fig. 19.1). It is clear
that both approaches produce the same formula in the end—otherwise, one formalism
would be wrong. On the following pages, we exploit this equivalence to construct
so-called multistep algorithms:
19.2 Adams-Bashforth
For the explicit Euler, we assume that the time step yh (t) gives us an F(t, yh (t)) which
then does not change until we hit yh (t + h). Equation (19.1) formalises that. Let’s
assume that we know one further previous time step yh (t − h), too. We therefore
also know the value of F at t − h. In the thought experiment with step size 2h,
252 19 Higher Order Methods
Fig. 19.2 Schematic derivation of Adams-Bashforth using notation from Fig. 19.1: This time, we
assume we know both yh (t) (1a) and yh (t − h) (1b). Therefore, we also know F(t, yh (t)) (2a) and
F(t, yh (t)) (2b). We take these two data points and extrapolate what F looks like over the interval
(t, t + h). This yields an area under a straight line (3). We integrate over this area to obtain a new
value yh (t + h) (4)
we know two derivatives in the point y(h): We have the derivative from y(0) and
we have the derivative at y(h). Instead of following either of them, and instead
of switching between different solutions, we try to combine these two pieces of
information cleverly.
With additional information from previous time steps, we can make a more edu-
cated guess for the integrand (Fig. 19.2): Let F not be constant anymore, but assume
it grows linearly! As we know F(t − h, yh (t − h)) and F(t, yh (t)), we draw a line
through both values and assume that this F continues linearly beyond t until we hit
the next snapshot at t + h.
Definition 19.1 (Multistep methods) Multistep methods keep track of previous solu-
tions to come up with a more educated guess for F. They use historical information
to predict what F looks like over the next time span.
We redo the (symbolic) integration over our improved approximation of F and obtain
a new formula where the new time step depends on yh (t) as starting point (same as
Euler), the value of F around that starting point (same as Euler) plus an additional
term which depends on the difference of F between t and t − h (not the same as in
the Euler construction):
19.2 Adams-Bashforth 253
t+h
yh (t + h) = yh (t) + F(s, y(s))ds
t
= yh (t) + F(t, yh (t)) · h
Area under constant function
F(t, yh (t)) − F(t − h, yh (t − h)) h
+ ·h ·
h
2
Slope of better approximation
Walk h along slope
Area under triangle
t+h
= yh (t) + F(s, y(s))ds
t
3 1
= yh (t) + h · ( F(t, yh (t)) − F(t − h, yh (t − h))).
2 2
The multistep methods here do not really care about old solutions before our last
time step yh (t). They care about the F values at previous time steps! There are three
important implications for programmers:
1. We have to bookkeep F values from previous time steps. This will increase our
memory footprint.
2. We make certain assumptions about the smoothness of F in time. We take it for
granted that the history of F tells us how F develops in the future.
3. We need historic data when we kick off the solve. When we solve an IVP, we
typically do not have such historic data. In this case, we can for example run
explicit Euler steps (with tiny time step sizes) until we have a sufficient set of
historical F data to switch to our multistep method of choice.
Fig. 19.3 Schematic illustration of Runge (RK2; left) and Heun’s method (right): RK2 runs an
explicit Euler, but only for h/2 (end up in orange circle). It then evaluates the F value (gradient)
of the ODE there and uses this F value to kick off a whole “Euler” step (end up in green circle).
Heun first runs an explicit Euler for the whole h interval (orange circle). It then takes half of the F
value at this predicted point plus half of the original F value (gradient) to perform one time step
from the starting point (end up in green circle)
into the future multiple times with different time step sizes h and thus obtain a feeling
how F changes over time. Such schemes are also called multiple shooting methods.
Here is a sketch of a very simple multiple shooting method: With the explicit
Euler, we use the F value at time t and assume that it is constant over the whole
interval (t, t + h). If we want to stick to that concept that F is constant, we could
evaluate F(t + h/2) and use this one as a better approximation to F. Unfortunately,
we don’t know F(t + h/2). Therefore, we use two shots:
• We run an explicit Euler step for half of the time step to get an estimate for a
ŷ(t + h/2). The hat emphasises that this is a helper result that we do not use
directly in the end.
• We plug this estimate into the right-hand side of the ODE, i.e. compute F(t +
h/2, ŷ(t + h/2)).
• We finally return to yh (t) and use this second F-value to realise an Euler step over
the whole time interval.
Schemes as the one sketched are also called predictor-corrector schemes (Fig. 19.3):
We compute an estimate of F and feed it into the time step. We know that it is a
guess, i.e. we can call it a prediction of what the right-hand side looks like. Next,
we use this prediction to compute a new solution which we use to replace or correct
the original guess. In the introductory example, we walk for 2h along the derivative,
but stop at h to construct a second derivative. We can use these two directions to
construct two independent solves. Or we can combine the two derivatives into one,
better guess and use this one instead of either of them.
Runge’s method is more accurate than the explicit Euler, and yet it does not use
any historic data. It does not make the assumptions that the history of F tells us
something about its future. It makes an attempt of what yh would look like half a
time step down the line, and then evaluates the F there again. No technical reason
forces us to do half a time step for the first guess:
Definition 19.4 (Heun’s method) For Heun’s method, we also use two shots, but the
first attempt spans the whole h. This attempt (prediction) yields a new guess for F.
We average the F at the current time step and the predicted F:
⎛ ⎞
⎜1 1 ⎟
yh (t + h) = yh (t) + h ⎝ F(t, yh (t)) + F(t + h, yh (t) + h F(t, yh (t)))⎠ .
2 2
Euler part Test shot
Whenever we see such a “let’s shoot into the future” algorithm as computer scientists,
we can immediately generalise it via recursion:
Systems of ODEs
If you are dealing with systems of ODEs, you will have to develop the scheme in all
unknowns, i.e. the ks are time derivatives of all of your unknowns. If you have, for
example, N particles and each particle has three position and velocity components,
your ODE system has 2 · 3 · N entries and each k therefore has 6N entries, too
(cmp. discussion on page 142).
256 19 Higher Order Methods
That is, for each trial shot, you have to compute new trial positions plus velocities
of all particles using your F approximations, i.e. the velocity and position updates
computed so far; subject to weights asi . Once you have these trial properties, you
compute the new forces plus the corresponding position updates. The new forces
plus positions then feed into the final position and velocity updates. All intermediate
vectors have to be stored.
Definition 19.5 (Butcher tableau) The weights behind Runge-Kutta methods are
something you typically look up. The table where you look them up is called Butcher
tableau.
Runge-Kutta notation
When authors refer to Runge-Kutta without specifying the number of steps, they
typically mean four steps. Sometimes, people write RK(4).
19.3.1 Accuracy
For us, copying indices from the Butcher tableau into our code is usually sufficient.
Yet, one should have an intuition where the parameters in the tableau come from.
For this, we return to the Taylor series—and use it multiple times. Assume we want
to write down a scheme which is second order accurate. We want to realise
h h 2 (2)
yh (t + h) = yh (t) + ∂t yh (t) + ∂t yh (t) + ...
.
1!
given
2! acceptable error
we want to get this right
The Taylor expansion with a second-order term contains terms with F(t, y), with
∂t F(t, y) and ∂ y F(t, y). All second derivatives are eliminated.
We next realise the following strategy: Write down the definition of the s-step method
and unroll the recursive expressions therein. We continue to demonstrate this for the
s = 2-step method:
The next step is the clue: The unrolled recursive formula expects us to evaluate F at
a time t + c2 h. It is time for Taylor!
F(t + c2 h, . . .) = F(t, . . .) + c2 h∂t F(t, . . .) respectively
F(. . . , yh (t) + ha21 · F(t, yh (t))) = F(. . . , yh (t)) + ha21 · F(t, yh (t))
∂ yh (t) F(. . . , yh (t)). (19.5)
The last trick reduces all terms to rely on F evaluations at (t, yh ). The terms in the
generic Runge-Kutta formulation (19.2) thus match the terms in the Taylor expansion.
In the s = 2-example,
yh (t + h) = yh (t) + hb1 · F(t, yh (t)) + hb2 · F(t + c2 h, yh (t) + ha21 · F(t, yh (t)))
= yh (t) + hb1 · F(t, yh (t)) + hb2 · F(t, yh (t))
+ h 2 b2 c2 · ∂t F(t, yh (t))
+ h 2 b2 a21 · F(t, yh (t)) · ∂ yh (t) F(t, yh (t)).
This is a plain “plug our equations in” exercise. The only pitfall is that the Taylor
expansions in (19.5) run either in the direction of the first argument or the second.
As the term structure in (19.5) and above match now, we can pick the free weights
such that the equations are exactly the same. However, (19.3) gives us four unknowns
to play around. The fact that there are four “free” parameters (b1 , b2 , c2 and a21 )
explains why there are (infinitely) many Runge-Kutta schemes per order. We have
some freedom of choice.
In the example above, we collect all the h terms and obtain the constraint b1 +b2 =
1. We furthermore collect the h 2 terms and obtain b2 c2 = 21 as well as b2 a21 = 21 .
Both Runge’s method and Heun’s method match these constraints. We conclude that
they both yield the same order of accuracy.
The exercise above constructs a second-order scheme. All techniques can be ap-
plied recursively, i.e. over and over again, to construct schemes with order p > 2.
Yet, there are bad news for RK methods: You do not always get a pth order accurate
integrator after s steps. With proper parameter choices, we can get away with p = s
for p ∈ {3, 4}. To construct higher-order RK methods ( p > 4), we however will
need more than s steps to get enough free parameters for tailoring them such that the
result matches the higher order Taylor expansion terms. Furthermore, finding these
weights becomes nasty and the number of conditions which your parameters in the
Butcher tableau have to meet explodes.
Definition 19.6 (Butcher barrier) For p ≥ 5, there are no generic, explicit Runge-
Kutta construction rules with s = p. p ≥ 5 is the Butcher barrier.
19.3 Runge-Kutta Methods 259
While RK(2) (Runge’s method) or Heun’s method both are more accurate than Euler,
we have a price to pay: We need two F evaluations per time step rather than one.
Let us assume that both a second-order scheme and an explicit Euler yield a result
of roughly the same quality for a particular h. To reduce the error of a calculation
by a factor of four with standard explicit Euler, we have to divide the time step size
by four. We have to half it twice. Instead of computing with h/4, we could use only
h/2 with RK(2). As it is second order globally, one refinement step in h reduces the
error by a factor of four, too. However, each RK(2) step requires us to evaluate F
twice. If we assume that F is by far the most expensive part of our calculation, we
run into a tie: The more accurate method needs half the steps, but each step has twice
the cost. Along these arguments, things change if we switch to RK(3) or RK(4).
While RK(3) or RK(4) exhibit an advantageous cost-to-accuracy ratio, other draw-
backs cannot be easily compensated for:
• We have to bookkeep the predictions ks . They are intermediate results that feed into
the subsequent predictions, but they also contribute to the final linear combination.
Therefore, we have to store them throughout the computation. More shots means
bigger memory footprint.
• The rightmost entries in every line of the Butcher tableau are unequal to 0. That
is, the evaluation of ks requires the completion of all computations for ks−1 . The
“outer loop” running through the shots does not exhibit trivial concurrency. Instead,
each s-step introduces a parallel barrier.
• In a distributed memory world, each s-step requires a complete data exchange. If
we have N particles, we have to compute ks for every single particle before we
compute the ks+1 for all particles in the next sweep. ks+1 will need an updated
guess of the particle properties (positions).
The three reasons above make many supercomputing codes ignore higher-order RK
methods. Instead, there is one particularly popular scheme which is second order
(almost) for free:
19.4 Leapfrog
When we write code for our second-order ODEs describing particle motion—where
the acceleration is affected by the particles’ position—we first rewrite these ODEs
into a system of first-order quantities (cmp. discussion on page 141). Pseudo code
for the resulting scheme consists of a sequence of three loops per time step:
You can fuse the second and the third loop to get rid of one loop. This results
directly from the respective control flow graph. If we permute loop two and three,
we update the velocity first and this new velocity enters the position immediately.
This modification of the explicit Euler is a different algorithm and is called Euler-
Cromer.
Invalid optimisation
I have seen multiple “optimised” codes which try to fuse the force computation loop
with the updates. The motivation behind such a merger is valid: Update loops are
computationally cheap and thus hard to scale up. If we can hide them behind another
loop, we get code that scales better.
Unfortunately, such optimisations are vulnerable to bugs: We must not compute
the forces on a particle, update the particle’s positions plus velocity, and then continue
with the next particle. We have to ensure that one, invariant particle configuration
feeds into all calculations of the follow-up step. If we change some particles in the
preimage on-the-fly, we end up in a situation where some particles are already a step
ahead and feed into calculations, while others have not been updated yet. We get
inconsistent data.
A totally new scheme arises, once we decide not to stick to integer time steps anymore:
Leapfrog time stepping keeps the position discretisation, but shifts the execution of
the velocity updates (Fig. 19.4): The velocities are not updated per time step but
after half a time step, after 1.5 time steps, and so forth. Velocity and position updates
permanently overtake each other, i.e. jump over each other.
Fig. 19.4 The idea behind leapfrog is that velocity and position updates happen in the same time
intervals but are set off against each other by h/2. The dotted arrows indicate which snapshots
influence v(t + h/2). Intuitively, it is clear why leapfrog is more accurate than Euler: The value
of v(t + h/2) requires data from v(t − h/2) and (formally) from x(t + h/2). The latter is not
available (yet) and we thus extrapolate it from x(t) using the knowledge how it will evolve. This
extrapolation/knowledge argument is the reason behind the improved accuracy
19.4 Leapfrog 261
Definition 19.7 (Leapfrog time stepping) The blueprint of leapfrog time stepping is
given by
x h (t + h) = x h (t) + h · vh (t + h/2), (19.6)
and
vh (t + h/2) = vh (t − h/2) + h · F(t, x h (t)).
To analyse the accuracy of this scheme, we have to map it back onto a Taylor ex-
pansion. The formal problem with the position update in (19.6) is that it has this
fractional time step in there, and such a time step does not arise in Taylor. The same
argument holds for the velocities. The dilemma is resolved by recognising that the
explicit Euler extrapolates linearly and that it works both forward and backward in
time:
Time-reversibility
The Taylor expansion works forward and backward in time, i.e. you can develop
f (t + h) around t through the Taylor series as well as f (t − h). Most notably, the
derivatives entering the Taylor expansions are exactly the same in
f (t + h) = f (t) + h · ∂ f (t) and f (t − h) = f (t) − h · ∂ f (t).
The expansion is time-reversible.
Numerics philosophy
Phenomena in nature have plenty of properties (energy conservation, time reversibil-
ity, smoothness, …), which we might loose when we solve them on a computer. If a
simple trick (such as the shift of the operator evaluations) helps us to regain some of
these properties, we typically get more robust code plus new opportunities to design
better numerical schemes; that allow us to pick larger time step sizes preserving the
same accuracy, e.g.
The leapfrog is different. If we study the solution around {t, t + h, t + 2h, . . .},
the solution is time-reversible in the velocity. If we study the solution around {t +
h/2, t + 3h/ h, t + 5h/2, . . .}, the solution is time-reversible in the position. We can
262 19 Higher Order Methods
This is exactly the Euler expansion where we truncate after the third term. As the same
train of thought holds for the velocities, too, the leapfrog is globally second-order
accurate.
Velocity Verlet
In the context of particle methods, we often use Velocity Verlet time integrators. They
are alternative formulations of leapfrog, which yield velocity values at the same points
in time where we also encounter position updates. Having the velocity explicitly
at hand at these points is important for visualisation and a lot of postprocessing
or coupling to other models. Velocity Verlet relies—in its simplest form—on the
following steps:
After step 4, both position and velocity are valid for this point.
•
> Key points and lessons learned
• Multistep methods keep track of some previous time steps and construct a smooth
curve for F through these steps to predict the real F over the upcoming time
interval.
• Runge-Kutta (RK) methods are the most prominent type of multiple shooting
methods. The principle here is that we shoot into the future using all known ap-
proximations of F so far, and then take the result to obtain a new F guess. This
works recursively.
• Butcher tableaus provide us with a formal specification how to program a Runge-
Kutta (RK) method.
• Leapfrog-type methods are very popular with particle methods where we maintain
velocity plus position anyway.
Adaptive Time Stepping
20
Abstract
The discussion around A-stability makes it clear that ODE integrators are very
sensitive w.r.t. time step choices. Some simple thought experiments illustrate that
a fixed time step size for a whole simulation often is inappropriate. We sketch one
simple multiscale approach to guide h-adaptivity (adaptive time step choices) and
one simple p-adaptivity strategy.
Assume you have to create a bike map of two tracks. More specifically, you
want to create a height profile, i.e. how these tracks go up and down. There are
two different tracks on your todo list: One track is a single track mountain bike
trail with a lot of bumps. The other trail is a dismantled railway line.
If you ride along a former train route, you don’t need to take many snapshots
per time span to track the train track’s altitude. Railway lines are smooth and
do not ascend or descend suddenly. No need to invest a lot of snapshots. Things
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 265
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3_20
266 20 Adaptive Time Stepping
are different on the mountain bike trail: Here, you frequently have to make
snapshots to track all the features of the ground.
While a certain snapshotting strategy might work for one particular trail type,
mixed-type trails require mixed strategies. Assume you bike up a former railway
line to ride downhill across the fields next to it. To map such a circular loop, a
low snapshot frequency is economically reasonable for the uphill path, but the
downhill ride still requires a lot of snapshots. Having one fixed snapshotting
strategy is not a good idea.
Definition 20.1 (Stiff problem) A stiff problem is a problem that requires condition-
ally stable algorithms to use extremely small time step sizes.
If codes have to use tiny time step sizes, we have to recognise this. If they do not
need tiny time step sizes all the time, we should come up with an algorithm that
uses tiny step sizes where required but maximises h otherwise. We want to balance
approximation errors versus compute effort.
We simulate a billiard table with an explicit Euler. It is clear that it is almost impossible
to make the billiard balls totally rigid and incompressible: As we cut time into slices,
i.e. we study snapshots, there will always be setups where a ball should have bumped
into the other one and pushed it to the side, but the simulation made it penetrate it.
A simple fix wraps the balls into rubber cushion. If two balls move away from each
other or if they are further away than a hard-coded , we just simulate how they roll
on the table. If they approach each other and the distance d between them becomes
d ≤ 2, we accelerate them with (2 − d)12 to push them away from each other.
With an exponent of 12, the push force “explodes” as the particles approach. This
explosion approximates a real rigid body collision where the force is infinite for
penetrating objects. We call this is a soft collision model (Fig. 20.1).
20.1 Motivation: A Very Simple Two-Grid Method 267
Fig. 20.1 Left to right: Spherical rigid bodies are surrounded by a rubber halo of size which is
compressible. If they approach each other, there’s no force in-between them as long as they are more
than 2 away from each other. Once their halos overlap, a massive force is applied to push them
away from each other. If they move away from each other, there is no force interaction (anymore)
For very fast billiard balls, the soft collision model continues to fail. We add forces
once two balls are closer than 2, but we still assume that they never really penetrate.
If they roll very fast, we therefore still need tiny time step sizes.
It is not a trivial exercise to find out whether and when we need small sizes. One
solution is to design an algorithm which always runs the Euler with two time step
sizes h and h/2 concurrently. If the h/2 setup yields reasonably similar results to
the h-simulation, we accept the time step with h and continue. If not, we rerun our
calculation with h/2 and h/4, i.e. we half the hs.
Some people call this a two-level, feature-based refinement criterion. Let’s use
the scalpel: It takes two “levels”—one with h and one with h/2—and compares the
outcomes of the two runs. The h/2 then acts as a guardian for the real simulation
with h. It compares the features of the solution. We don’t know the real solution to
our simulation. Yet, if there are large differences between two solutions running with
h versus h/2, we better be careful and use small time step sizes.
Coarsening
The text sketches only a refinement criterion, as we discuss when to refine h. For a
full-blown code, we would likely complement this refinement criterion with a time
step coarsening criterion making h larger again if this is admissible.
20.2 Formalisation
We want to solve our problem as quickly as possible. At the same time, the error
should be bounded. Both goals can only be achieved by ensuring that all time steps
produce an error of roughly the same size: If there is one single time step producing
a massive error, we know that this massive error propagates in time and eventually
destroys the quality of our final outcome. In this chapter, we distinguish two different
algorithmic strategies how to reduce the compute cost, while we keep the error under
control:
268 20 Adaptive Time Stepping
1. Make the time step size h as big as possible. Large time step sizes imply that we
need fewer steps to simulate up to a given time stamp.
2. Pick a numerical scheme per time step which is as cheap as possible. As discussed
throughout Chap. 19, this often means switching to higher-order methods where
appropriate, i.e. where the solution is smooth.
Definition 20.2 (h- vs. p-adaptivity) If we change the time step size, we call this
h-adaptivity. If we change the (polynomial) order of our scheme, we call this p-
adaptivity.
Both strategies rely on an error estimator. As we don’t know the real solution, we
don’t know the error of our numerical approximation either. However, we can try to
construct a (pessimistic) guess of what the error looks like:
Definition 20.3 (Error estimator) Let eh (t) describe the local error of a numerical
scheme. A function τ (t) with |eh (t)| ≤ τ (t) is an error estimator.
The idea behind an error estimator is that we use τ to determine which numerical
model to use and which time step size. With the most aggressive choice, we still
ensure τ (t) ≤ tol per time step, with a fixed global tolerance tol. We distinguish
two different selection strategies:
Definition 20.4 (Optimistic vs. pessimistic time stepping) An optimistic time step-
ping approach constructs an estimate yh (t + h) and then evaluates τ (t + h). If
τ (t + h) ≤ tol, it accepts the solution, i.e. continues with the time stepping. Oth-
erwise, is performs a rollback, i.e. returns to yh (t) and runs the time step again
with a modified numerical scheme/parameters. A pessimistic implementation always
chooses a stable numerical scheme right from the start.
Let’s exploit the fact that we know the local convergence order of our time stepping
scheme. As we truncate the Taylor series,
τh (t + h) = Ch p . (20.31)
Although we do not know C, we have h (the time step size) and we know p from our
numerical analysis. The remainder is a calibration problem. We have to determine
C from two measurements:
If we rerun our simulation with half of the time step size, we get τh/2 (t + h/2)
and, after one more time step a τh/2 (t + h). The definition of the two τ terms yields
y(t + h) = yh (t + h) + τh (t + h) and (20.32)
y(t + h) ≈ yh/2 (t + h) + 2τh/2 (t + h).
In the second line of (20.32), we assume that the local truncation error τh/2 is roughly
the same over both time steps, i.e. local τh/2 (t + h/2) ≈ τh/2 (t + h). Indeed, if the
solution is reasonably smooth, the error behind this assumption is negligible. Next,
we insert what we know about the shape of τ , i.e. we exploit that
p
h 1
τh/2 (t + h) = C = p τh (t + h).
2 2
This expansion assumes that the C is the same for both runs. We are only interested
in one versus two time steps. Substracting the two equations (20.32) from each other
gives us
2
yh/2 (t + h) − yh (t + h) = (1 − p ) τh (t + h) , i.e.
2
1
|τh (t + h)| ≤ | yh − yh/2 (t + h)|.
1 − 22p
With two measurements with two different h choices, we can compute which h is
admissible. For this, we insert our ansatz τh (t + h) = Ch p and set the absolute value
of this expression equal to tol.
1
C≈ |yh (t + h) − yh/2 (t + h)|h − p and therefore
1− 2
2p
p
(adm) (adm) p tol
C h ≤ tol ⇔ h ≤ . (20.33)
C
270 20 Adaptive Time Stepping
With only three time steps, we get a formula how to determine an optimal admissible
time step size for our setup, i.e. a maximal time step size for a given tolerance. This
is a check we can do in-situ, i.e. while our code is running. In practice, there are a
few challenges to keep in mind:
• The approach works only as the ansatz (20.31) defines a function that runs through
0. In theory, it is correct that τ runs through zero—otherwise, our scheme would
not be consistent. In practice, it does not run through zero. Once h becomes too
small, we enter a regime where the round-off errors of floating point arithmetics
dominate the result.
• The ansatz (20.31) works with a constant p where we know p a priori and observe
this order in our data. It is important to remind ourselves that we always use a
fundamental assumption when we construct integration schemes: For a numeri-
cal scheme to have pth order locally, the analytical solution has to exhibit this
smoothness, too. If the smoothness of our solution changes (suddenly), the error
estimator breaks down, too. This reminder is of limited relevance for Euler (we
can’t work “simpler” than that), but it plays a big role for higher-order schemes.
• The approach works as long as the ODE’s right-hand side F is well-behaved
and does not change dramatically over the time step under examination. If the F
suddenly becomes very large in the second smallish time step, then the τ is not
equally distributed among the two intervals with h/2 anymore.
Many interesting phenomena in science span multiple (time) scales. Our planets orbit
around the sun, but this is a very slow process compared to their moons which tend
to spin around their respective planets. If we want to simulate the solar system, the
moons make the numerical treatment cumbersome: As they are so light and, hence,
fast and spinning—they change their direction (second derivative) all the time—they
require us use very small time step sizes. At the same time, they have only a small
impact on the trajectories of all other heavier objects. Yet their impact on the overall
system evolution is not negligible. It plays a role. Few fast objects make the overall
problem stiff.
Definition 20.5 (Local time stepping) Local time stepping (LTS) techniques circum-
scribe that some parts of a simulation progress in time with their own (local) time
step sizes.
In a multibody simulation, each particle is subject to its own set of ODEs affecting
its velocity and position. The ODEs directly tell us when these properties change
dramatically over one time step: The position changes quickly if the particle’s velocity
is high. The velocity changes quickly if the particle’s mass is small compared to others
or if many other particles roam nearby. With our time step adaptivity techniques from
above, we can determine an optimal time step size per particle and make each particle
20.3 Error Estimators and Adaptivity Strategies 271
progress in time at its own (numerical) pace. There are a few design decisions to
make:
Interpolate. Assume there are two particles p1 and p2 , and they travel at their
own speed v1 and v2 . As v1 v2 , we want to use a tiny time step size for particle
p1 and update p2 infrequently. To simulate this system we can
• send the slow particle ahead, and then keep it at p1 (t + h 1 ) until particle p2 has
caught up; or
• send the fast particle ahead. When the small time step sizes h 2 have accumulated
such that they match or exceed h 1 , we update the big particle, too; or
• send the slow particle off, but memorise its old position. We can now interpolate
its position per time step of the fast particle.
The latter variant is the most accurate yet most expensive one, as we have to bookkeep
previous solutions. In any case, we have to be careful which positions feed into F at
any point in time.
Homogeneous numerical schemes. High-order schemes work best if the trajectory
of a particle is smooth, i.e. does not change rapidly. If we categorise particles into
(light) particles which race around and are subject to “sudden” direction changes
versus (heavy) particles which are slow, then it is a natural decision to ask whether
we might be well-advised to use different numerical schemes. In this context, we
have to be particularly careful with the local time step sizes however:
Bucketing. If you simulate N objects in our setup, you will end up with N different
admissible time step sizes. This makes the implementation tricky. A lot of codes thus
discretise and bucket the local time stepping scheme:
Let h adm
max be the biggest admissible time step size in our simulation. Bucketing
max /2 < h
groups the particles into sets: All particles with h adm adm ≤ h adm belong
max
into class A, all particles with h max /4 < h
adm adm ≤ h max /2 belong into class B,
adm
max /8 < h
all particles with h adm adm ≤ h adm /4 into C, and so forth. We advance
max
all particles in class A with h max /2. This is pessimistic—some of them would be
adm
allowed to use way bigger time step sizes—but it makes our orchestration simpler.
Class B particles advance in time with h admmax /4. Class C with h max /8 and so forth.
adm
20.3.3 p-Refinement
The chapter so far fixed the order p and alters the time step sizes driven by accuracy
indicators. These stem from comparisons of different runs with different hs. An
alternative metric used in literature is the comparison of two schemes with different
orders over the same time step size. Whenever we run two schemes with order p
272 20 Adaptive Time Stepping
versus order p +1, we know that the difference between them will give us an estimate
for the error:
With a pth-order scheme, we have a local error term in the Taylor series which
is eliminated by a subsequent p + 1th-order scheme. Consequently, the difference
between two solutions leaves us with the leading local error of the lower-order
scheme.
While it is straightforward to run two schemes in parallel, it is also expensive;
in particular for Runge-Kutta schemes where we have to memorise multiple steps.
A particular attractive flavour of an RK scheme thus is RFK45 (the Runge-Kutta-
Fehlberg method). Its Butcher tableau is given by
0
1/4 1/4
3/8 3/32 9/32
12/13 1932/2197 −7200/2197 7296/2197
1 439/216 −8 3680/513 −845/4104
1/2 −8/27 2 −3544/2565 1859/4104 −11/40
16/135 0 6656/12825 28561/56430 −9/50 2/55
25/216 0 1408/2565 2197/4104 −1/5 0
The upper part of the tableau is well-known to us. It clarifies how we compute six
estimates of F. The bottom part tells us how to combine these estimates into two
approximations: The upper row yields a 5th order approximation, the lower row a
4th order one. One F guess is not used at all for the two final outcomes. It feeds
however into subsequent steps of the RK schemes as intermediate result. This need
for an additional auxiliary solution is the price to pay for the fact that we get two
approximations instead of only one.
There are different applications of error estimators for higher-order methods: We
know that the error should go down with h p , and we also know that we should use a
higher-order method where possible. However, we also know that this higher order
quickly breaks down if the underlying solution becomes non-smooth. With a built-in
error estimator, we can dynamically change the order, or we can even run codes
where a time step is rolled back whenever we find out that the error did grow too
much (optimistic time stepping). That is, we run RFK45. If the resulting error is
above a given threshold, we roll back and rerun the time step with an explicit Euler,
e.g., yet with very fine time step sizes.
The crucial element behind all of our adaptive schemes so far is a proper error
estimator. Once we get an idea of the quantity of the error that we face, we can decide
which time step size or numerical scheme to use. Error estimators are unfortunately
not for free.
20.4 Hard-Coded Refinement Strategies 273
In many applications, we have a good feeling (expert knowledge) where big errors
arise if we are not picky with the time step size choices: The density of particles
around a particle of interest gives us a good feeling of how massive trajectories
change. Small particles are, by definition, more “vulnerable” than heavier ones. If
we have clusters of particles, these clusters will interact strongly with each other,
while isolated particles far away only alter their flight direction slowly. More general,
we know that a large right-hand side of the velocity ODE forces us to use accurate
time stepping with a small time step size. Often, an analysis of the simulation state
provides us with an estimate where these large right-hand sides arise, or we might
know from previous experiments where we have to be careful with h and p. Such
domain knowledge allows us to hard-code refinement criteria which are cheaper than
their dynamic counterparts.
Further material
• E. Fehlberg, Low-order classical Runge-Kutta formulas with step size control and
their application to some heat transfer problems. NASA Technical Report 315
(1969)
• R.J. LeVeque, Finite Difference Methods for Ordinary and Partial Differential
Equations: Steady-State and Time-dependent Problems. Classics in Applied Math-
ematics. SIAM (2007)
• Stiff problems can force explicit methods to use tiny time step sizes to remain
stable.
• Techniques to choose the time step sizes adaptively ensure that we find a stable
time step size automatically. For these schemes, we need an error estimator.
• We can alter the time step size or the numerical scheme or even the physical
phenomena that we track (locally).
• Binning of admissible time step sizes helps us to bring the computational overhead
down though the load balancing of adaptive local time stepping remains subject
of active research.
Wrap-up
The last part of the manuscript sketches two algorithmic ingredients of simulation
codes that help to significantly reduce our codes’ time-to-solution. Indeed, high order
methods and local time stepping are considered a must in many research communities
today.
Though only sketched, one of the fundamental challenges for computer scientists
from the field of scientific computing is it to map the efficient numerical techniques
onto efficient implementations. For some subchallenges, this is a straightforward
engineering tasks. For others, the numerical concepts (seem to) contradict the ef-
ficient exploitation of modern computers. Local time stepping is difficult to load
balance, optimistic time stepping is very sensitive to the cost and additional footprint
of backups/rollbacks, higher-order introduces additional synchronisation points per
time step, and so forth. A lot of these challenges are subject of active research in
high performance scientific computing, and a lot of the state-of-the-art numerical
techniques have not yet found their way into mainstream simulation codes that run
on the largest machines in the world.
Students reading this text know now all they need to write a first N -body code
with various interesting features. If you can realise such a simulation with local time
stepping and/or a higher order method on a multicore- or GPU-cluster, you can be
sure that you also understood a decent part of the underlying theory. Another option
for a challenging project that combines important aspects of this manuscript in a
new way is to replace the gravity force with a Lennard-Jones potential. You switch
from planets attracting each other via gravity to classic molecular dynamics. The
additional interesting aspect here is that you can cut off the interaction, i.e. it is
sufficient to “only” study the objects around a particle to obtain valid forces. This
reduces the algorithm to an O(N ) algorithm; which poses yet another challenge to
realise this efficiently.
Using the Text
A
Below is a suggestion of how to structure a course around this book plus some
remarks what could be added and done beyond its content. I also add some remarks
on the notation used and some useful software.
The enumeration below assumes that you have 50–60 min per session.
1. how to use the simple time command and how compiler optimisation flags affect
the outcome;
2. that compiler options have a significant impact on the runtime;
3. how to use a profiler. Starting with gprof is a good option;
4. how to use a debugger (gdb, e.g.) to find errors systematically;
5. how to run valgrind to identify memory leaks. An alternative is the Intel
Inspector or the Clang correctness checker toolchain, e.g.
• Session 11: Conditioning and well-posedness. Discusses this key idea behind
numerical algorithms and introduces the idea behind backward stability.
• Session 12: Taylor and ODEs. This is mainly a revision session. Most students
will have studied these ideas throughout their A-levels already, but it is nevertheless
good to revise it. I usually skip the remarks on stationary solutions, attractiveness,
…and squeeze the two Chaps. 11 and 12 into one session.
• Session 13: Stability. I return to the attractiveness/stability definition from
Chap. 12 and introduce flavours of (discretisation) stability. If you play around
with the code snippets from Chap. 13, then there might not be enough time to
introduce convergence and convergence plots.
• Session 14: Convergence. I wrap up the core block on numerics with a discussion
of convergence and convergence measurements plus plots. There should be some
time left to run through parts of the Parallelisation Paradigms from Chap. 14; as
a cliff hanger.
• Session 15: Upscaling laws and shared memory parallelisation with BSP. This
session covers the remainder of Chap. 14—I usually do a “live demo” how to deter-
mine machine parallelism from likwid-topology or vendor descriptions—
and continues with Chap. 15. The discussion of proper upscaling plots can be
left to the students if they have to provide such plots as part of their coursework
anyway.
• Session 16 and 17: BSP in OpenMP. This session runs through Chap. 16. As I
also discuss variable visibilies and collective computations as well as scheduling,
I typically need two hours; notably if I want to do some live programming.
• (skipped): Task-based parallelisation. Chapter 17 discusses the majority of tasks.
I think this is an important topic for modern codes, but my students typically prefer
Appendix A: Using the Text 279
to learn something about GPGPUs rather and I want to speak about something
beyond explicit Euler and about proper performance analysis, too. The task chapter
is thus one I often “sacrifice” and leave it as self-study exercise.
• Session 18: GPGPUs. Chapter 18 is popular with students. As I typically skip the
task discussion and thus have to omit the fancy GPU concepts, this one fits into
one session.
• Session 19—Intermezzo—Parallel Debugging and Performance Analysis. I
propose to use at least one session to discuss tools like VTune, APS, Map or other
tools. The NVIDIA analysis tools are another attractive option. The most important
message behind these sessions is not to make students prolific in using the tools.
The message that has to come across is that these tools do exist, that they are worth
learning, and that there is tons of good material out there—most of it provided
for free by the vendors. Another useful live demo session orbits around parallel
debugging.
• Session 20: Higher-order (RK) methods or adaptive time stepping. You might
want to add some more in-depth stuff on top of the content of Chaps. 19 and 20,
but there is some core content on advanced algorithms in the manuscript.
A.2 What I Have Not Covered (But Maybe Should Have Done)
There are a few things that make up a natural second part of the presented material:
• I have not spoken about any linear algebra. This is one natural next topic.
• Partial differential equations are missing. I focus exclusively on ordinary differen-
tial equations, as students, in my experience, like particles that dance around, and
ODEs are more intuitively accessible.
• The performance models are rudimentary.
• MPI is missing. This is the big missing thing from a high-performance computing
perspective.
ponents for the three directions, then I write x1,1 , x1,2 , x1,3 , x2,1 , x2,2 , x2,3 , x3,1 , . . .
Besides this, I try to avoid multi-indices.
My indices (loop variables in C) are usually drawn from i, j, . . . , n. They are
lowercase. I use uppercase letters for upper bounds, i.e. 1 ≤ n ≤ N or 0 ≤ n < N .
Sometimes, I start counting with 1 in my maths formulae and use 0 ≤ n < N in the
corresponding C code. Please be aware of such index shifts.
By default, I use lowercase letters such as f (t) or y(t) for normal functions, and
uppercase F for higher-order functions accepting further functions as arguments.
Again f or F can either be scalar or vector-valued (with f 1 (x), f 2 (x), . . .). It makes
sense in a couple of situations to use other symbols. I use p(x) for a position, e.g.,
and I use v(x) for velocities.
C is something I use (besides the one exception with Dennard scaling in Sect. 2.2)
as a generic constant, i.e. it is just a fixed positive value. If multiple Cs are found
within one formula, they might have different values (but each one a distinguished
fixed one). As C is just a generic identifier that says “this value is fixed, but we
might not know it”, any combination (additional or multiplication) of Cs yields a
new constant. Besides C, I also use a, b, c, . . . or a1 , a2 , a3 , . . ., e.g., as constants;
in particular when they calibrate shape functions.
e is the letter I use for (absolute) error. can be the relative error, but usually I
reserve it for machine precision (so, yes, a particular type of relative error). Depending
on the context, it can be negative: Most books define errors as absolute values. In
few cases, I however want to point out that an error might be “too much” or “too
small” compared to a wanted outcome.
Small variations of entries (due to errors or normalisation, e.g.) are always denoted
by Greek letters, too.
The term dt is used as difference between two t values. It means “delta t”. In
literature, the symbol t is often used instead of dt. I mainly try to stick to dt, as
has a different meaning in the context of partial differential equations (out of scope
here). The ∂ and d symbols finally are used for derivatives. I usually use the subscript
to denote what type of derivative I mean, i.e. ∂t and dt are the derivative in time t.
Some rationale behind this as well as links to other notations are discussed on page
22.
Here is a list of software that I frequently use in my course. Besides the compiler,
nothing is required to master the book’s content, but some of these tools make
students’ life easier. I enlist only software that is free for students:
• You need a reasonably up-to-date C/C++ compiler. I recommend at least GCC 9.x,
Intel 20.x, or LLVM/Clang 8.0. Visual Studio 2019 or newer supports the majority
of features discussed, too.
Appendix A: Using the Text 281
Before we start to assess any code of ours or to run any performance studies, we need
an understanding what we might be able to achieve: Modern machines have hundreds
of different settings and options, and we can combine them in various ways. If we
use an inferior setup, we pay in terms of time-to-solution. The other way round, we
might see excellent speedup if we work with a poor baseline. But such excellent
speedup is fake; it typically results from an “artificially” decreased serial ratio f .
Let’s first understand what we can theoretically expect from the machine’s nodes.1
If you bike to work and want to assess you fitness level, you cannot expect to have
an average speed of 100 km/h. That is the wrong calibration! It might hold for a car
but not for your bike. Only the knowledge of how much we can expect allows us to
make meaningful statements about the quality of our code and our setup.
The data acquisition exercise to answer our capability question consists of three
steps.
• Find out what processor you use and what its clock frequency is. On a Unix system,
a simple
1 Fora first code run-through or tuning session, you might want to skip this section. Feel free to do
so. But in the end, you will have to tell your readers how good your code performs relative to the
hardware’s capabilities.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 283
Nature Switzerland AG 2021
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3
284 Appendix B: Cheat Sheet: System Benchmarking
cat /proc/cpuinfo
typically gives you all the info you need: You get the CPU make (something like
Intel(R) Core(TM), e.g.) including the base frequency.
• Now go to the Internet and search for the specification of this chip. It should also
tell you which instruction set the chip does support. Search for something alike
AVX512, e.g. This way, you know how much you can expect from SIMDsation.
Next, we find out how many cores you have on your system. You can take this
from your supercomputer’s homepage, you can extract it from the cat output (don’t
be fooled by the hyperthreads that are displayed there), or you can use a tool like
LIKWID2 : LIKWID gives you a set of useful tools wrapping around processor-
specific APIs, and it offers command line tools to translate these data into digestible
information. For step two, we simply call
likwid-topology
which gives us all information (and way more data) we need at the moment.
With the number of cores and the instruction set/specification, we can compute the
theoretical number of floating point operations per second that we could get on our
machine:
peak = α · F · NCores · NFPUs/Core · Ninstr/cycle
Scaling by the frequency F gives us the number of steps (cycles) our chip can
complete per second. We multiply this quantity with the number of cores on our
system times the number of floating point units. The latter is usually one or two.
Hyperthreads do not count, as they don’t own FPUs of their own. Finally, we take
this number and multiply it with the number of operations our vector unit can deliver
2 https://hpc.fau.de/research/tools/LIKWID/.
Appendix B: Cheat Sheet: System Benchmarking 285
per cycle. For AVX512 and double precision, we have 512 bits per register. This
is 64 bytes. We can hence squeeze eight double precision values into each register
(8 · 8 · 8 = 512). Plain SIMD hence yields Ninstr/cycle = 8. If we can write our code
with FMA, we even get Ninstr/cycle = 2 · 8, as we do a multiplication plus an addition
per cycle. The remaining α is a little bit nasty. Processors often downclock chips in
AVX mode—though it is up to the sysad to disable this feature—or clock them up
(turboboost) if the power envelope (temperature) allows it. α ≈ 1 is a reasonable first
guess. The final result is the magic target. It tells us how many GFlops our system
can deliver; in theory. If we hit this target, then our code is clearly compute-bound.
This is our ultimate goal—as long as we use the best numerical techniques we are
aware of.
In a third and final step, we acknowledge that the theoretical peak is kind of useless for
many codes, as the memory will not be able to move that much data into the registers
on time. To assess our code, we need a detailed understanding how much data our
memory could—for a real problem, not according to some marketing presentations—
move in and out of the chip. For this, we use the STREAM benchmark.3
3 https://www.cs.virginia.edu/STREAM.
286 Appendix B: Cheat Sheet: System Benchmarking
Calibrated Insight
It is convenient and fair if you quantify achieved performance in time-to-solution plus
percentage of peak plus percentage of STREAM throughput. These three quantities
give a good picture of a code’s behaviour: They tell the reader how long to wait for
a result and to which degree an implementation exploits a given hardware.
•> Checklist
Before you start, try to get a sound understanding of your system. Here is a checklist
what you might want to find out before you start your measurements or write a report.
This collection is not comprehensive:
Find out what your machine looks like. This should include the instruction set
(vectorisation), frequency, and core counts.
Summarise this insight. Do not expect a reader to look up how many cores or
frequency or whatever your machine has. Do not just throw a machine name into
the write-up. You can not expect the reader of your report to recherche all these
data, machines change over time, or machine information might not be public.
State the theoretical peak. What can we expect from your code in the best case?
State the memory bandwidth. This is where STREAM data goes into. What can
we expect from your code in the best case?
Relate all data you present to the (theoretical) best-case quantities.
…
Appendix B: Cheat Sheet: System Benchmarking 287
Further Reading
• J.D. McCalpin, Memory Bandwidth and Machine Balance in Current High Per-
formance Computers, IEEE Computer Society Technical Committee on Computer
Architecture (TCCA) Newsletter, pp. 19–25, (1995)
• T. Röhl, J. Eitzinger, G. Hager, G. Wellein, LIKWID Monitoring Stack: A Flexible
Framework Enabling Job Specific Performance monitoring for the masses, 2017
IEEE International Conference on Cluster Computing (CLUSTER), pp. 781–784,
(2017) https://doi.org/10.1109/CLUSTER.2017.115
Cheat Sheet: Performance Assessment
C
This is my cheat sheet, i.e. my template, how to run a first performance assessment
and start tuning a code. I do it top-down, I keep it generic, and I run everything as
black box initially. Often, I find it useful even for my own codes to lean back and
to pretend I knew nothing about them. This stops me from running into a certain
direction following my own runtime hypotheses—which more often than not turn
out to be wrong. Before you start to assess your code, ensure you know how your
machine behaves (Appendix B).
Symbol Information
Different to a plain performance tuning, our analysis should give us some expla-
nations why certain runtime patterns arise or where runtime flaws come from. For
this, you have to give the computer the opportunity to map machine instructions
(back) to source code. If you add -g, the compiler enriches the machine code with
symbol information, i.e. helper data. Without -g, it is totally unclear where machine
instructions originally come from, i.e. which high-level construct generates which
machine level instructions.
Let’s find out which environment setup and code configuration gives us the best
performance. Compilers today accept an astonishing number of parameters to guide
and control the translation process. When we run our tests, we have to ensure we use
the best compiler options, and that we deactivate everything that obscures our data.
• In practice, that means we use the highest compiler optimisation level (typically
-O3 or -Ofast), disable all auxiliary output (all of your prints) and switch
off input and output (IO).
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 289
Nature Switzerland AG 2021
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3
290 Appendix C: Cheat Sheet: Performance Assessment
• On most machines, it also pays off to compile such that the compiler builds a code
that’s tailored towards the particular machine. With LLVW, -march=native
-mtune=native makes the compile create the best machine instruction scheme
tailored towards the particular machine you compile on. With GNU,
-march=native has the same effect.
• Finally, compilers yield particular fast code if you use further specialised optimi-
sation. You have to study their documentation and handbooks for your particular
system.
Cross-Compilation
On many supercomputers, the login nodes are of the same make as the compute
nodes. They might have more memory or more cores as they are shared amongst all
users, but otherwise understand the same instruction set. If this is not the case, you
have to use cross-compilation, i.e. you have to tell the compiler explicitly to write
out machine code for a machine it is not running on.
This is the time to search the internet for details and to read the compiler man-
ual. Last but not least, test with different compilers if you have different compilers
installed. The compilers all use different heuristics and internal optimisation algo-
rithms, i.e. they all perform differently for different test cases.
In principle, OpenMP runs are close to trivial. You set the environment variable
OMP_NUM_PROCS to pick the right number of cores and run your experiment.
However, there are a few things to consider.
Most supercomputer schedulers such as SLURM installations already set the right
core count automatically. I do recommend that you submit a test script and add a
simple export to it. That allows you to check once which type of variables are set
by SLURM on your system. If your SLURM installation does set the core count,
then I recommend not to overwrite OMP_NUM_PROCS. Instead, use the SLURM
parameters within the script file header to configure your run. Whenever possible,
leave it to SLURM to then set the right variables. Otherwise, you quickly run into
data inconsistencies if you alter the SLURM parameters but forget to update your
script’s variables.
Appendix C: Cheat Sheet: Performance Assessment 291
export OMP_PROC_BIND=spread
ensures that threads are spread-out. The other way round, setting the variable to
close keeps the threads together.
Fiddling around with where threads physically go is called thread binding or
thread pinning. It pays off to test different variants from time to time. If you need
even more fine-granular control, you can use the environment variable OMP_PLACES
to pin threads to particular sockets or threads. Consult your cluster/OpenMP docu-
mentation for details. You have to test which one is better.
Some compilers such as the one by Intel offer different OpenMP backends, i.e. dif-
ferent implementations of the scheduler, heuristics, and so forth. It is worth checking
the impact of these variants.
Once we know what we could expect from the machine and once we are sure that
we use a reasonable configuration, we can analyse what we obtain from our code.
Characterising and explaining the runtime behaviour typically requires a tool like In-
tel VTune or the NSight profiler by NVIDIA or a non-commercial alternative. These
tools allow you to obtain some high-level overview and to answer some fundamental
questions:
1. Is the work between the cores reasonably balanced, i.e. how many cores do you
effectively use?
2. Does your code use all the GPUs all the time, or are they busy only for a brief
period?
3. Has your code spent significant time in synchronisation, i.e. in barriers or atomics?
4. How many MFlop/s does the code or do parts of the code deliver?
5. How many bytes are read and written to and from the main memory?
I recommend that you run these tools regularly on your codes. They are simple to
use nowadays, there are a lot of good tutorials, most of them are free, and it is fun to
understand why your code behaves in a certain way.
Appendix C: Cheat Sheet: Performance Assessment 293
Frequency
Try to assess to which degree your computer alters the frequency. There are different
applications to read out the actual frequency. likwid-perfscope is my favourite.
If a computer reduces the frequency due to heavy AVX-usage, e.g., you cannot expect
the theoretical improvement of speed that you computed.
Once we know how our code performs and why—is it a scalability issue or are we
bandwidth-bound, e.g.—we next are interested in the code parts that are responsible
for this behaviour. We switch from a mere black-box performance assessment to
performance analysis.
To find out which code parts are responsible for some behaviour, we have two options:
We can instrument the code, i.e. tell the compiler to inject routines into the generated
code that keep track of the runtime burnt within some code parts, or we can sample
the code. Sampling-based profiling means that an external tool stops the code in
certain time intervals, and takes snapshots (which function is currently active, which
cores are busy, how much memory is currently used, …) to create some statistics.
Sampling is less intrusive, i.e. introduces a lower overhead, while instrumentation
can yield more details.
There are many good commercial or academic tools that help you to obtain a better
understanding of how your code performs. Among the most popular ones is Intel’s
VTune Performance Analyser. NVIDIA has its own set of tools, too. They are free
for academics/students. Finally, the VI-HPS (https://www.vi-hps.org) offers a com-
prehensive overview incl. tutorials of free tools which are developed by academics
for other academics and not driven by major chip vendors.
Once you have identified bottlenecks in your code, you might want to alter the im-
plementation and tune it. After that, return to Sect. C.3 and rerun the assessment.
Performance analysis can be time-consuming and difficult. As it is a key activity be-
hind high-performance software engineering, there is however a lot of good tutorials
and documentation available.
294 Appendix C: Cheat Sheet: Performance Assessment
•
> Checklist
When you submit a code assessment, here is a checklist what might be covered by
your report. This collection is not comprehensive:
Summary of compile options you have used (and explanations if you have used
unorthodox ones). If necessary, this one might be supported by some benchmarks.
What did a high-level (black box) performance assessment tell you? Is this
code bandwidth- or compute-bound? What’s your high-level judgement on the
simulation?
What is the vectorisation and multicore performance character of your code?
Where are the hotspots? Which code parts are responsible for the observed
behaviour?
If you have altered the code, some before versus after plots are always good.
…
Cheat Sheet: Calibrating the Upscaling
Models D
Speedup laws are great to assess the quality and potential of our code. However,
we seldom know the f used in both Amdahl’s and Gustafson’s law. Some people
sit down and count operations and then come up with an f for their code. This
is cumbersome and, for larger codes, almost impossible. A better option is to run
experiments.
Before you do so, ensure you use the most optimised code. It makes no sense
to conduct upscaling experiments with a debug version of your executable. Further-
more, I recommend to switch off any IO. Do not read or write files, and even switch
off excessive terminal output. The IO subsystem of computers is notoriously slow,
i.e. writing files will dominate your runtime and the real upscaling potential is hidden
behind these cost.4
4 It is a completely different story obviously if you want to benchmark the scalability impact of IO
on a parallel run.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 295
Nature Switzerland AG 2021
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3
296 Appendix D: Cheat Sheet: Calibrating the Upscaling Models
Strong and weak scaling are parameterised by f ∈ [0, 1] and the time t (1) we
need/would need using one core. With two unknowns, we can calibrate these laws
taking only two measurements: You measure t ( p1 ) for a certain core count p1 . And
you measure t ( p2 ). For many real-world applications, none of your runs will use
pi = 1 as you won’t be able to squeeze something meaningful onto a single core
without waiting for ages. However, you can instantiate Amdahl’s or Gustafson’s law
twice. Here’s the example for Gustafson:
t (1) = f · t ( p1 ) + (1 − f ) · t ( p1 ) p1
t (1) = f · t ( p2 ) + (1 − f ) · t ( p2 ) p2
As you know t ( p1 ), t ( p2 ), p1 and p2 , you have two unknowns remaining (t (1) and
f ), so you can solve this problem. One option is to subtract the two equations:
0 = f · t ( p1 ) − t ( p2 ) + t ( p1 ) p1 − t ( p2 ) p2
− f t ( p1 ) p1 − t ( p2 ) p2
t ( p2 ) p2 − t ( p1 ) p1
⇒ f = .
t ( p1 ) − t ( p2 ) + t ( p2 ) p2 − t ( p1 ) p1
From (15.3), we see that we are not interested in t (1) for statements on the speedup.
The speedup effectively normalises the runtimes by t (1).
With more than two measurements—which is actually what we should do—we
have to replace our simple substitution strategy with some parameter fitting.
The measurement span has to be long enough to get rid of noise. Many codes have a
setup phase and then run many time steps. This setup phase often consists of memory
allocations and data structure preparation. It thus is notoriously difficult to parallelise.
You should be very careful (and document) whether you ignore this setup time or
you take it into account.
In our thought experiment behind the weak scaling, we increase the problem size
as we increase the compute unit count. For most algorithms this is problematic, as
most algorithms’ runtime is not linear in the number of unknowns. If we double the
particle count in our baseline code for example, we get a four times bigger compute
load.
A similar argument holds if we study an algorithm with an adaptive time step
size for different setups. One setup might be “stiffer” than the other and thus yield
a different time step size. So we cannot compare runtimes directly. Things become
better once we follow some simple rules:
1. Present time per time step rather than raw total runtime.
2. Depending on whether you do strong or weak scaling tests, present the time per
particle update or time per force computation.
3. Discuss the problem sizes explicitly.
•> Checklist
When you submit a scalability assessment, here is a checklist what might want to
discuss in your report. This collection is not comprehensive:
Be fair and compare your data to a serial run that is compiled without any
parallelisation feature. How much do you have to pay for the parallelisation per
se?
…
Cheat Sheet: Convergence and
Stability Studies E
This is my cheat sheet, i.e. my template, how to run some first convergence tests with
my codes, i.e. how to ensure that the code solves the correct thing with the correct ac-
curacy or to determine this accuracy, respectively. Most tests approach the challenge
in a quite generic way. This is reasonable: The steps of any stability and conver-
gence analysis are usually very similar and problem- and algorithm-independent. It
is the interpretation of these data that requires domain knowledge and a numerics
background.
Any convergence analysis follows a certain pattern using key concepts from the
lecture: We first ensure that we have a consistent scheme, i.e. that the solution goes in
the right direction once we make the discretisation error smaller. Then, we study the
code’s stability. Due to the Lax Equivalence Theorem, we can then make a statement
on the code’s convergence behaviour and quantify it. Before we run through this
sequence of tests, we have to discuss what good test cases are.
Our first consideration has to orbit around proper test cases. To construct test sce-
narios, we have to ask “what do we want to analyse”, “what setup should make our
algorithm yield the expected characteristic solution”, “what do we expect to see”,
“how do we get data to support our claim”, “how many shots do we need”.
Consistency of time integrators. To assess the consistency of our code, we have to
ensure that we solve the right problem as the discretisation error vanishes. To make
our job easier, we first should investigate whether we know an analytical solution or
can construct one. If we write an ODE solver, the Dahlquist test equation is a good
test. But there might be better ones. All-time classic for particles is collision tests:
We construct symmetric starting conditions (arrange all particles on a sphere around
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 299
Nature Switzerland AG 2021
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3
300 Appendix E: Cheat Sheet: Convergence and Stability Studies
the coordinate system’s origin, e.g.), and thus know where exactly particles should
clash.
Timestamp
It is important that we always track the solution at the same point in time. If we have
different time step sizes (depending on object count, e.g.), we have to take this into
account and alter the time step count accordingly.
With an exact function representation, we know what we want our solver to return. We
now run a series of tests for different time step sizes, where we make the discretisation
error smaller and smaller. If the solutions get closer and closer to the analytical
solution, we conclude that we have a consistent scheme.
Consistency of physical model. Many algorithmic optimisations (in an N -body
context) do not kick in for a simple (two-body) setup: Collision of different particles
or adaptive time stepping, e.g., become interesting only for larger N . In this case,
we should do some literature recherche what other groups have done and whether
they provide quantitative data. Study exactly how they measure the data they report
on and measure the same way. It is highly unlikely that you get completely different
results and all the other codes have been wrong.
For more complex physics, I do recommend to brainstorm which properties should
be preserved or visible to the bare eye. Often common sense helps. Is a setup symmet-
ric and should it thus yield a symmetric outcome? Is a setup given in three dimensions
but basically boils down to a one-dimensional solution which should be visible in
the outcome, too? Is a setup kind of homogeneous (particles are distributed along a
Cartesian mesh, e.g.) and should thus yield a homogeneous outcome?
Challenging test setups. Once we have know that our code is correct for simple
setups, we should run a few setups which challenge our algorithm: I always rec-
ommend to have at least one setup that is very stiff, i.e. suffers from a large F at
one point. In the N -body context, it makes sense to study setups which either host
particles with extremely differing speed or are inhomogeneously distributed. You
split up the domain and assign particles in one part a very high velocity. These are
convection-dominated while they drift in the other part of the domain, i.e. diffuse.
This stresses any time stepping algorithm, as high speed tends to require tiny time
step sizes (cmp. our stiffness discussion). If you start from a very inhomogeneous
particle distribution, you again stress your algorithms as some particles are subject
to quite a lot of strong interactions while others are almost not affected by their
neighbours.
Appendix E: Cheat Sheet: Convergence and Stability Studies 301
Once we have clarified what type of behaviour we want to analyse (stiffness, e.g.),
we construct a setup that is as simple as possible but still runs into the phenomena of
interest. This could be two particles on a collision trajectory, e.g. Write down what
you would expect to happen. What stability behaviour do you expect to see, what is
smoothness of the problem that you would like to see reflected in the code outcome,
… Finally, test the algorithm for various input parameters and support your claims.
It is important for a fair report that you carefully document what you assume about
your solution, and that you are clear where your implementation struggles or breaks
down.
Convergence behaviour. To characterise the convergence behaviour, we study the
behaviour of our code for a set of different time step sizes, and track the quantity of
interest (velocities or positions or distances of particles, e.g.) at the end point. Before
we start, it is important that we validate two things: Does our setup have solution
which is sufficiently smooth, and does our setup have solution which is sufficiently
interesting?
The Dahlquist test equation is a poor test if you pick a large observation time stamp,
as all proper numerical schemes—independent of their order—will have produced
solutions close to zero. Even for schemes differing massively in order, the results
will be hard to distinguish, so you cannot measure anything. A simple solution here
is to pick an observation time stamp shortly after you have unleashed the simulator,
such that you still face massive changes in the solution per time step.
With a table of time step sizes versus quantity of interest, we can finally compute
the accuracy. If we know the exact solution, we can do this directly. If we do not
know the exact solution, we cannot compute any error. The only thing we can do is
to compute the differences between subsequent approximations and to extrapolate
(cmp. Sect. 13.3). Hence, we need at least three measurement points, but the more
we have the better.
302 Appendix E: Cheat Sheet: Convergence and Stability Studies
Finally, we plot the data. Log plots or log-log plots here are your friend, as we
typically half the time step size per new experiment. Try to squeeze a trend line
through your data points and, hence, to spot what convergence order your data has.
When we plot data, we typically want to work with a scalar. We want to assign a
setup a certain number. This is what a norm over a simulation state, i.e. vector, does.
The two most frequently used norms are the max norm and the Eukledian norm
(2-norm):
ymax = max |yi | and
i
y2 = yi2 .
i
If you have N particles, you can for example report on their maximum velocity which
evaluates vmax over Nquantities (remember you first have to compute the per-
particle velocity via v = v12 + v22 + v22 ). Or you can report on the p2 norm over
the minimum distances between particles. In any case, you have to specify explicitly
what you measure.
You will recognise that many authors do not write down .2 but use .22 . This
allows them to omit the square root.
If .max and .2 are the popular choices, which one shall we take? This depends
on the property you are interested in (Fig. E.1). The .2 is an averaging norm. If all
entries within a large vector have a magnitude of around seven, the norm will return
7 scaled with the square root of the size of the vector. If a few entries are really large
yet the vector is large, too, then you won’t see this under the .2 norm. The .max
Appendix E: Cheat Sheet: Convergence and Stability Studies 303
has a different focus: It picks the biggest entry. So if you have a large vector with
sevens everywhere, but one single entry is 1,000, then .max will return 1,000.
Choice of Norms
The .max is good to study outliers, the .2 is good to argue about overall trends.
Norms return a positive scalar and we thus might be tempted to compare norms of
different experimental runs. This can go horribly wrong. Imaging you want to argue
about the .2 norm√(trend) of a vector which holds 1s only. If the vector has 1,000
entries, you obtain 1000. If you rerun the√experiment with a bigger setup and the
vector now has 2,000 entries, you obtain 2000. You can’t compare them. So it
might be reasonable to calibrate/normalise the norms, i.e. to use .x = N1 .2 . The
.max is immune to such size changes.
The interpretation of your data is the really tricky part and, obviously, strongly de-
pendent on your problem. There are few generic things to take into account however:
1. Try to spot regions where your algorithm is not A-stable, i.e. where the time
step size is simply to big. In practice, it is really interesting to know where these
critical time steps start. For your subsequent data analysis, it is important to
exclude “unstable measurements” from any computation. You can not put a trend
line through data that stem from an unstable regime.
2. Try to spot whether your algorithm becomes unstable if the time steps become
tiny (zero-stability). This can be laborious: If the convergence plot stalls around
machine precision, then you are fine and there’s nothing to worry. However, if it
stalls or the error increases before you have hit machine precision, then the code’s
zero-stability is not great.
With these things in mind, you can start to use your observed data to calibrate the
error formula. Formally, you use the ansatz
|e| = Ch p
to come up with the problem formulation
1 p 2
(C, p) = arg minC, p Ch i − e(h i ) .
2
i
over all measurements with h i . Section 13.3 discusses all required techniques.
304 Appendix E: Cheat Sheet: Convergence and Stability Studies
First Order
It is really hard to cook up a method that is not at least first-order accurate. So if
your method exhibits a convergence order below 1 and you are operating in a stable
regime, something went horribly wrong.
What happens if we had expected p 1 and it turns out to be 1 or, in general, lower
than the expected accuracy? One of the major reasons for this (besides programming
errors) can be that the function that you approximate is not sufficiently smooth. If
the F of an ODE jumps around suddenly, i.e. is not differentiable, then you can not
expect higher-order accuracy.
•> Checklist
When you submit a convergence study, here is a checklist what should be covered
by your report. This collection is not comprehensive:
Presenting your experimental data is more than dumping it into a spreadsheet program
and pressing a “create graph” button—though this might be appropriate to get a first
idea of what the data looks like. This is my cheat sheet that helps you to avoid the
most “popular” flaws when you present your data.
The first question to ask yourself before you start to draw a plot is whether a table
or figure is the better option. Here’s a rule of thumb:
• A table is better to present quantitative data. If the exact value is of relevance, then
you should rather go for a table.
• A figure with a graph is better to present qualitative data. If the reader shall be
able to spot a trend, a data region where something interesting happens, or shall
be able to compare different measurement series, then a figure is usually the better
choice.
F.1 Tables
• Table captions in most styles are put above the actual table, not below.
• Ensure the caption allows the reader to understand what is displayed in the table.
The reader should not be forced to search for details in the text.
• Measurements are presented top-down within a column. Each measurement goes
into a row of its own. Different quantities of interest go into different columns.
• How many valid digits do you have? Is it two digits behind the decimal point or
only one? Different measurements of the same quantity should all use the same
precision.
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 305
Nature Switzerland AG 2021
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3
306 Appendix F: Cheat Sheet: Data Presentation
Fig. F.1 Example of a rather poor table with several flaws: There is no caption that tells the reader
what is shown. The text is slightly too small. It should be similar to the size used throughout
the script. This is a pity as there would have been enough white space to increase the font size
given your current page width. The number of significant digits per measurements is not the same
per column—remember that data is grouped per column, i.e. different columns can host different
precisions—and the numeric measurements should be aligned right. The text is aligned left (that is
fine), but the punctuation is inconsistent: One remark ends with a full stop, the other one does not.
You might argue that the variance column is useless or could be presented in a nicer way
• Text within table cells is usually aligned left, but data are aligned right. As you
use the same number of valid digits, the decimal point is automatically aligned.
• Use a font within the table that is roughly as big as the text font you use throughout
your write-up.
• If multiple rows of a column host exactly the same data, you might be able to write
it down only once and use quotation marks. Or you might be able to highlight
the few entries that are interesting/different (via bold font, e.g.). If all entries are
the same or only one or two differ, it might be reasonable to remove the column
altogether. This is information that might better fit into the text or the caption.
• Captions do not interpret data. They describe what you see.
F.2 Graphs
When you create a figure hosting a graph, extra care is required. Before you start,
clarify what type of data you have:
• Is your data continuous, discrete or is it discrete with a certain trend? If you have
speedup measurements for 1, 2 and 4 cores, then you clearly have discrete data
points. When you display it later on, it does not make sense to suggest that there
were data for 1.5 cores. If you have however data from 1 through 256 cores and the
data is reasonably smooth, then you might argue that there is a continuous trend.
If you argue about different time step sizes, then your data are continuous.
• Is your data noisy or are you reasonably confident that it is pretty much spot on?
If it is noisy you might have to display the standard deviation, or you might want
to work with a scatter plot where the average (the trend) is displayed as a line.
Once you understand the character of your data, try to follow some simple guidelines:
Appendix F: Cheat Sheet: Data Presentation 307
Fig. F.2 Example of a lousy figure. It is of low quality (not a vector graphics format), and one axis
is not labelled. There are only three measurements—at least markers indicate where they are—but
the author has decided to interpolate in-between with a smooth curve. It suggests that we have some
additional information how the data behaves in-between the measurement points. The caption is
above the figure, which is not in line with most journal styles, and the legend lacks a proper text
• Figure captions in most styles are put below the actual figure.
• Ensure the caption allows the reader to understand what is displayed in the figure.
But leave the interpretation (why) for the text.
• Label all axes including units where appropriate. Make sure the reader does un-
derstand immediately whether you plot on a linear scale or a log scale.
• If you plot, ensure that the reader can understand the plot even if it is plotted in
black and white only. It is good practice to use markers (circles, squares, triangles)
to both distinguish different lines and to highlight where actual data are available
and where you interpolate.
• Ensure there is a legend.
• Use a font for all labels, the legend and the caption that is roughly as big as the
text font you use in your text.
A few of the flaws from above are illustrated in Figs. F.2 and F.3. The list grows
longer as soon as we find multiple figures within one report. Popular “flaws” are
that multiple figures show data that are to be compared by the reader, but they use
different ranges on the axes. Other papers present data distributed over multiple
graphs but use different colour maps and different markers for similar measurements
308 Appendix F: Cheat Sheet: Data Presentation
Fig. F.3 Plot without (left) and with (right) markers. The markers make the graph readable even if
it is printed without colours or if a reader struggles to distinguish colours (cmp. bottom row), and
they help us to argue about the three plots in an interpretation even though the graphs overlap most
of the time and thus are hard to distinguish. It might be reasonable to use a symlog plot here or a
log plot and to eliminate the initial 0 measurement from the plot to allow the reader to distinguish
the measurements even clearer. Some authors would omit the legend from three out of four plots if
they are grouped as one, while it is absolutely essential that all plots use the same scale on the x-
and y-axis such that we can visually compare trends
or measurements that belong (logically) together. Make it as easy as possible for the
reader:
• Keep data that belong together close (same page, e.g.) and
• use the same style for data that belongs together, such that the logical connection
corresponds to a similar visual style. At the same time
• try not to present too much data per graph. Three to five data sets is a good guideline.
Graphs have to be easy to digest as they communicate qualitative insight.
Appendix F: Cheat Sheet: Data Presentation 309
F.3 Formulae
Formulae are properly set by typesetting systems such as LaTEX. There is not that
much to do. A few pitfalls however continue to exist:
• If you give a formula a number, then you have to reference it later. If you do not
refer to a particular equation later on, do not assign it a number!
• Equations are just read as part of the sentence, i.e. if I claim that
a = b,
then read it out loud as “I claim that a is not equal to b. It immediately is clear
that we need the comma after the formula. Formulae are part of your text and thus
need proper punctuation.
• If you find a (10) in your sentence, then read it as “if you find an equation ten in
your sentence”. There’s no need to write down “equation”. The brackets do the
job.
• The only exception arises if the (10) starts a sentence. In this case, you write
“Equation 10 illustrates …” and you omit the brackets.
A manuscript should be digestible without any of the figures and tables. It should be
able to stand alone. However, a figure or table is considered to be non-existent, if it
is not referenced within the text. When you reference it, try to be consistent: either
write Figure X or Fig. X and note that most styles today consider tables and figures
to be names once they come with a number. That is, Figure 10 is uppercase but “all
the figures” is lowercase.
When you introduce a data set, I recommend that you first tell the reader exactly
what you measure, i.e. what type of experimental setup is used. Next, describe what
you see in a figure or table. This is where the reference should be found. Do not leave
it to the reader to study a plot. Tell the reader exactly what you can spot. Finally,
present your interpretation and evaluation. You discuss: what is done, what is seen,
what does it mean; in this order.
Most type setting systems take ownership of how they move figures and tables
around. However, you have some influence on this. Ensure that a figure or table is
roughly placed (on the page) where it is referenced for the first time. Do not force
your readers to turn pages over and over again.
Lots of style guides recommend that figures, tables and equations should not be
subjects in your sentence. Indeed, I find it strange to read “Table 20 shows that the
algorithm converges in O(h 2 ).” Good text rather reads: “The algorithm converges
in O(h 2 ) (Table 20).”
310 Appendix F: Cheat Sheet: Data Presentation
Validity
These are personal (biased) guidelines. Some supervisors will have different opinions
and some journals will have other guidelines and conventions. Check them carefully!
•> Checklist
When you submit a report, here is a checklist what should take into account. This
collection is not comprehensive:
For every single plot or table, there is a clear case why this is a table and not a
figure (or the other way round).
Both tables and figures are clearly labelled and captionised.
All figures and tables are referenced in the text at the right place and are displayed
roughly where they should be.
The recommendations from the sections above either are followed or there are
good reasons not to do so.
All figures and tables follow the same stylesheet.
You have printed everything in black and white, too, and you have checked that
the reader can still distinguish all details.
All captions are consistent, i.e. they either are only brief remarks or complete
sentences with a full stop. Avoid inconsistencies.
If multiple figures or tables use similar data sets or experiments, ensure they
use the same symbols, colours, …
…
Further Reading
• R.A. Day, B. Gastel, How to Write and Publish a Scientific Paper Paperback.
Cambridge University Press (2006)
• Stephen King, On Writing: A Memoir of the Craft. Hodder Paperbacks (2012)
• W. Strunk Jr., The Elements of Style E.B. White, The Elements of Style. Allyn and
Bacon Longman (1999)
• N.J. Higham, Handbook of Writing for the Mathematical Sciences. SIAM (2020)
Index
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 311
Nature Switzerland AG 2021
T. Weinzierl, Principles of Parallel Scientific Computing, Undergraduate Topics
in Computer Science, https://doi.org/10.1007/978-3-030-76194-3
312 Index
T W
Task, 212, 233 Well-posedness (Hadamard), 120, 124, 145