01 - Introduction: 1 Why Parallel Programming Is Important in Research
01 - Introduction: 1 Why Parallel Programming Is Important in Research
In [2]: Image(filename="bd_hpc.png")
Out[2]:
1
In [3]: Image(filename="whyhpc1.png")
Out[3]:
2
In [4]: Image(filename="whyhpc2.png")
Out[4]:
2 Class goals:
1. Explain what factors force to consider parallel computing
2. Have an overview of parallel programming paradigms and patterns
3. Give an understanding of modern computer architectures from a performance point-of-view
• Processor, [Cache, Memory subsystem]
• Use x86-64 as a de-facto standard
• Look at how accelerators are made
4. Explain hardware factors that improve or degrade program execution speed
5. How to do detailed performance measurements: highlight the most important events for
such measurements
3 Contents
Section ??
Section ??
Section ??
3
In [5]: Image(filename="500eurofr_HR.jpg")
Out[5]:
In [6]: Image("photo-intel-haswell-core-4th-gen-logos.jpg")
Out[6]:
what we want to measure it: 1. performance in terms costs; even "commodity hardware costs"
grow up in large facilities 2. performance per Watt; let loose your 1kW GPU mining rig at home
and then you ’ll know
4
4.0.1 Moore’s law
In 1965(1) , Intel co-founder Gordon Moore predicted (from just 3 data points!) that semiconductor
density would double every 18 months. – He was right! Transistors are still shrinking at the same
rate
(1) Moore, G.E.: Cramming more components onto integrated circuits. Electronics, 38(8), April
1965.
In [7]: Image("moores.jpeg")
Out[7]:
In [8]: Image("1200px-Moore's_Law_Transistor_Count_1971-2016.png")
Out[8]:
5
available Section ??
This has become a de-facto standard accepted by: - Semiconductor manufacturers (Intel, ARM,
AMD) - Hardware integrators (e. g. Asus) - Software companies (pick your) - Customers (that is,
we)
Consequences: An incredible level of integration - CPUs: Many-core, Hardware vectors,
Hardware threading - GPUs: Enormous number of floating- point units
right now, we just assume that stuff like hardware threads, cores, whatever are different com-
puting elements in one integrated chip
Today, we commonly acquire chips with 1’000’000’000 (109 ) transistors!
Server chips and high-end GPU devices have more:
In [9]: Image("titan.jpg")
Out[9]:
6
-GPU Name: GM200
-GPU Variant: GM200-400-A1
-Architecture: Maxwell 2.0
-Process Size: 28 nm
-Transistors: 8,000 million
-Die Size: 601 mm2
But if computers double their power every 18 months we don’t need to worry about perfor-
mance
not quite
That used to be the case in the "days of the Pentium":
In [10]: Image("oldays.png")
Out[10]:
7
a implicit agreement: - write your code as you want - performance is a problem of those smart
guys of the hardware dept.
In [11]: Image("power.png")
Out[11]:
8
not only this sucks up a lot of energy (and money) but creates a lot of heat as Q ∝ P
In [12]: Image("sun.png")
Out[12]:
9
see also: [1]S. H. Fuller and L. I. Millett, “Computing Performance: Game Over or Next
Level?,” Computer, vol. 44, no. 1, pp. 31–38, Jan. 2011.
so what solution did the smart guys at HW dept. invent?
In [13]: Image("power2.png")
Out[13]:
10
4.0.2 Possible solutions?
In [14]: Image("gskill-hwbot-overclock-kit-cinebench-r15.jpg")
Out[14]:
11
... nitrogen cooling is cool (literally) but quite expensive
In [15]: Image("natick.jpg")
Out[15]:
12
this one may be rather difficult if you are not a scuba lover
and we still have to pay the electricity bill
V = IR
How much power it consumes? At any time the circuit hold a charge q = CV and hence
performs a work equal to:
W = qV = CV 2
power consupmtion (plus fixed terms) is ( f is the operating frequency):
Wt−1 = W f = CV 2 f
Consider now what happens splitting the CPU in two cores operating at half the clock speed:
In [16]: Image("more_cores.png")
Out[16]:
13
[1]S. H. Fuller and L. I. Millett, “Computing Performance: Game Over or Next Level?,” Com-
puter, vol. 44, no. 1, pp. 31–38, Jan. 2011.
As a practical example the E5640 Xeon (4 cores @ 2.66 GHz) has a power envelope of 95 watts
while the L5630 (4 Cores @ 2.13 GHz) requires only 40 watts. That’s 137% more electrical power
for 24% more CPU power for CPU’s that are for the most part feature compatible. The X5677
pushes the speed up to 3.46 GHz with some more features but that’s only 60% more processing
power for 225% more electrical power.
Now compare the X5560 (2.8 GHz, 4 cores, 95 watts) with the newer X5660 (2.8 GHz, 6 cores,
95 watts) and there’s 50% extra computing power in the socket
In [17]: Image("manycores.png")
Out[17]:
note that three out six here come from the game industry
but what about the gentleman agreement with the HW dept.
well ...
we have to cope with this new contract: HW dept. will continue to add transistors (in lots of
simple cores) ...
... and SW dept. will have to adapt (rewrite everything).
That’s way we have to worry about parallel programming
Market forces, not technology, will drive technology (core counts in this case)
(Nietzsche would like this one)
14
In [18]: Image("manycores2.png")
Out[18]:
Fragmentation into multiple execution units and more complicated logic creates additional
hindrance for performance: - ILP Wall (Instruction Level Parallelism) - Memory Wall
why so many "walls"?
Computing units are "ancient": - As “stupid” as 50 years ago - Still based on the Von Neumann
architecture - Using a primitive “machine language”
Matrix/vector multiplication in assembly:
__Z6matmulv (snippet):
vmovlhps
%xmm0, %xmm3, %xmm3
vmovss
+_b(%rip), %xmm4
vinsertf128 $1, %xmm3, %ymm3, %ymm3
vinsertps
$0x10, 44+_b(%rip), %xmm7,
vmovss
48+_b(%rip), %xmm6
vinsertps
$0x10, 36+_b(%rip), %xmm1,
vmovlhps
%xmm0, %xmm2, %xmm2
vinsertps
15
$0x10, 60+_b(%rip), %xmm4,
vxorps
%xmm4, %xmm4, %xmm4
<snip>
• Intel translates “CISC” x86 assembly instructions into “RISC” micro-operations which can
vary with each CPU generation
• NVIDIA translates PTX (parallel thread execution, or virtual assembly into machine instruc-
tions which can vary with each GPU generation
In [3]: Image("pyramid.png")
Out[3]:
16
In [19]: Image("opt_areas.png")
Out[19]:
4.0.5 Supercomputers
Much of modern computational science is performed on GNU/Linux clusters where multiple
processors can be utilized and many calculations can be run in parallel. The biggest, built from
non off shelf hardware, are dubbed "supercomputers".
The first supercomputer of history:
In [20]: Image(filename="cray.png")
Out[20]:
17
In mid-to-late 1970’s CRAY-1 was the fastest computer in the world.
Clock speed of 12.5ns (80MHz) Computaional rate of 138 MFLOPS during sustained period.
Unveiled in 1976 by its inventor Seymour Roger Cray Had spawned a new class of computer
called ”The Supercomputer”.
The tree of HPC:
In [21]: Image("archs.png")
18
Out[21]:
In [22]: Image("mainframes.png")
Out[22]:
19
The rise of micros
• The Caltech Cosmic Cube developed by Charles Seitz and Geoffrey Fox in 1981
• 64 Intel 8086/8087 processors with 128kB of memory per processor
• 6-dimensional hypercube network
In [23]: Image("cosmic_cube.png")
Out[23]:
http://calteches.library.caltech.edu/3419/1/Cubism.pdf
Micros spawned the Many Parallel Processors era: - Parallel computers with large numbers
of microprocessors - High speed, low latency, scalable interconnection networks - Lots of custom
hardware to support scalability - Required massive changes to software (parallelization)
In [24]: Image("mpps.png")
Out[24]:
20
• MPPs using Mass market Commercial off the shelf (COTS) microprocessors and standard
memory and I/O components
• Decreased hardware and software costs makes huge systems affordable
In [25]: Image("mpps2.png")
Out[25]:
21
• NASA Goddards Beowulf cluster demonstrated publically that high visibility science could
be done on clusters.
In [26]: Image("goddard.png")
Out[26]:
In [27]: Image("constellation.png")
Out[27]:
22
The current listing of the "biggest of biggest" can be found on www.top500.org (see also
www.green500.org)
In [28]: Image("top500-november-2017-2-1024.jpg")
Out[28]:
23
In [29]: Image("top500-november-2017-9-1024.jpg")
Out[29]:
24
5 Key Concepts in Parallel computing
• Basic definitions: Parallelism and Concurrency
• Notions of parallel performance
Band
amount of data moved over time unit. Measured in bytes s−1 (hard disks) and bit s−1 (sockets and
nodes)
Latency minimum time needed to assign a requested resource
Performance number of 64 bit floating point operations per second
Concurrency A condition of a system in which multiple tasks are logically active at one time.
Parallelism A condition of a system in which multiple tasks are actually active at one time."
In [31]: Image(filename="parallel.png")
Out[31]:
In [32]: Image("conc_vs_par.png")
Out[32]:
25
Figure from “An Introduction to Concurrency in Programming Languages” by J. Sottile, Tim-
othy G. Mattson, and Craig E Rasmussen, 2010
An example: - A Web Server is a Concurrent Application (the problem is fundamentally de-
fined in terms of concurrent tasks):
- An arbitrary, large number of clients make requests which reference per-client persistent state -
An Image Server, which relieves load on primary web servers by storing, processing, and serving
only images
In [33]: Image("web_server_1.png")
Out[33]:
26
The HTML server, image server, and clients (you have to plan on having many clients) all
execute at the same time
The problem of one or more clients interacting with a web server not only contains concurrency,
the problem is fundamentally current. It doesnt exist as a serial problem.
Concurrent application: An application for which the problem definition is fundamentally
concurrent.
In [34]: Image("web_server_2.png")
Out[34]:
z2n+1 = z2n + c
where c is a constant and z0 = 0.
Points that do not diverge after a finite number of iterations are part of the set.
In [35]: Image("out.png")
27
Out[35]:
To generate the famous Mandelbrot set image, we use the function mandel(C) where C comes
from the points in the complex plane. "
The computation for each point is independent of all the other points ... a so-called embarrass-
ingly parallel problem
28
In [36]: Image("mandel2.png")
Out[36]:
29
y0 = ymin
for (int i=0; i< height; i++)
{
x0 = xmin;
for (int j=0; j < width; j++)
{
image[i][j] = mandel(x0,y0,horiz,maxiter);
x0 = x0 + xres;
}
y0 = y0 + yres;
}// close on i
• The problem of generating an image of the Mandelbrot set can be viewed serially.
• We may choose to exploit the concurrency contained in this problem so we can generate the
image in less time
Key points: - A web server had concurrency in its problem definition ... it doesnt make sense
to even think of writing a “serial web server”." - The Mandelbrot program didnt have concurrency
in its problem definition. It would take a long time, but it could be serial
The parallel programming process
In [37]: Image("ppar_process.png")
Out[37]:
30
EVERY parallel program requires a task decomposition and a data decomposition: - Task
decomposition: break the application down into a set of tasks that can execute concurrently
- Data decomposition: How must the data be broken down into chunks and associated with
threads/processes to make the parallel program run efficiently.
For the Mandelbrot set: map the pixels into row blocks and deal them out to the cores. This
will give each core a memory efficient block to work on.
In [38]: Image("ppar_process2.png")
Out[38]:
In [39]: Image("glue.png")
Out[39]:
31
An easy way to go is “Bulk Synchronous Processing”.
In [40]: Image("bulk.png")
Out[40]:
32
33
Is this efficent?
How we can measure the outcome of our parallelization effort?
Let’s consider the Speedup that we can gain from parallelization; a simple meaure would be:
Tser
S=
Tpar
Molecular Dynamics
Simulate the motion of atom as point masses intercating by classical laws:
d2
mi Ri (t) = −∇V ( R)
dt2
V = Vbond + Vangle + Vtors + VCoul + VLJ
In [41]: Image("PES.png")
Out[41]:
In [42]: Image("image.png")
Out[42]:
34
In [43]: Image("charmm_bench2.png")
Out[43]:
35
the speedup is not constant with the number of processors; why?
let’s get back to our Mandelbrot toy:
In [44]: Image("kcache.png")
Out[44]:
not every part of the code is run in parallel; even in a utopistic situation of linear speedup of
parallel code that cost would stay fixed (for a fixed problem size).
36
}
fprintf(fp,"%d %d %d ",colour[0],colour[1],colour[2]);
}
Amdhal Law
Consider a generic program running in serial mode, taking Tt to be completed and made up
by serial part and a parallelizable part. Then
Tt = Tser + Tpar = f s ∗ Tser + f p ∗ Tser
where Tser is the time needed to run in serial and f s , f p are the serial and parallel parts of the code
now, running in parallel we have (for a linear speedup):
Tser Tser 1−α
Tpar = = Tser ∗ f s + f p ∗ = (α + ) ∗ Tser
p p p
In [45]: 1/0.27
Out[45]: 3.7037037037037033
but there’s still work to do:
In [46]: 4.406/2.726
Out[46]: 1.6162876008804108
In [47]: Image("charmm_bench3.png")
Out[47]:
37
Two major sources of parallel overhead:
Load imbalance: the slowest process determines when everyone is done. Time waiting for
other processes to finish is time wasted.
Communication overhead: A cost only incurred by the parallel program. Grows with the
number of processes for collective comm
In [48]: Image("charmm_bench.png")
Out[48]:
Weak scaling
However: non bonded interactions (all vs all) scale as Nˆ2 while bond scales as N (an atom
tipically has that many neighbours). Hence, a serial (or less parallel) part may grow less than the
parallel one for some problems.
S( P) → S( P, N )
Tser 1
S( P, N ) = =
Tpar α + 1−p α
N→∞⇒α→0
S( P, N )α→0 = P
38
In [49]: Image("dense_matrices.png")
Out[49]:
Out[50]:
39
Environments from the literature 2010-2012:
In [51]: Image("envs10.png")
Out[51]:
Out[52]:
40
From Wikipedia: The von Neumann architecture is a computer design model that uses a pro-
cessing unit and a single separate storage structure to hold both instructions and data.
In [53]: Image("cpu_layout.svg.png")
Out[53]:
41
In [54]: Image("onecore.png")
Out[54]:
42
Instructions and data must be continuously fed to the control and arithmetic units, so that the
speed of the memory interface poses a limitation on compute performance (von Neumann bottle-
neck).
The architecture is inherently sequential, processing a single instruction with (possibly) a sin-
gle operand or a group of operands from memory. SISD (Single Instruction Single Data) has been
coined for this concept.
All the components of a CPU core can operate at some maximum speed called peak performance
The performance at which the Floating Point units generate results for multiply and add oper-
ations is measured in floating-point operations per second (Flops/sec).
Typical single core perfomances for latest (e. g. Coffee lake) Intel architectures reach about 100
GFlop/s.
Feeding arithmetic units with operands is a complicated task. The most important data paths
from the programmer’s point of view are those to and from the caches and main memory (see
later). The performance, or bandwidth of those paths is quantified in GBytes/s.
Fathoming the chief performance characteristics of a processor or system is one of the purposes
of low-level benchmarking such as the vector triad
double start end,mflops;
timing(&start); //a generic timestamp function
timing(&end);
mflops=2.0*NITER*N/(end-start)/1000000.0;
43
6.0.1 Performance dimensions in the good old days
Two dimensions: - Frequency of CPU - number of CPUs
1. Pipelining
2. Superscalar
3. Hardware vectors/SIMD
4. Hardware threads
5. Cores
6. Sockets
7. Nodes
In [55]: Image("perfo_dim.png")
Out[55]:
1.Latency - Each instruction takes a certain time to complete. - Amount of time between in-
struction issuing and completition
Throughput - The number of instructions that complete in a span of time.
44
Pipelining and ILP wall Assembly line: workstations that perform a single production step
before passing the partially completed automobile to the next workstation. When Henry Ford(1)
introduced it in 1920, 31 workstations could assemble a Model T car in about 1,5 hrs.
Early platforms such in 1960 used to have multiple parallel units (i. e. multiple workers)
performing arithmetic and logic operations.
The first "dedicated workers" performing tasks a single task before passing data to the next
were introduced in the Cray-1 which had a pipeline of 12 stages.
If it takes m different steps to finish the product, m products are continually worked on in
different stages of completion. If all tasks are carefully tuned to take the same amount of time (the
“time step”), all workers are continuously busy. At the end, one finished product per time step
leaves the assembly line.
The most simple setup is a “fetch–decode–execute” pipeline, in which each stage can operate
indepen- dently of the others. While an instruction is being executed, another one is being decoded
and a third one is being fetched from instruction (L1I) cache.
Breaking up tasks in many different elementary stages (one of the reasons behind RISC):
• The Good: potential for a higher clock rate as functional units can be kept simple.
• The Bad: the more deep the pipeline, the more long is the wind up phase that is needed for all
units to become operational
In [56]: Image("ILP_wall.png")
Out[56]:
Tseq mN
=
Tpipe N+m−1
45
Troughtput is:
N 1
=
Tpipe N+m−1
that is for large N speedup ∝ m and troughput ∝ 1. The critical Nc needed to achieve at least a
throughput of p (0 ≤ p ≤ 1):
p ( m − 1)
Nc =
1− p
At p = 0.5 Nc = m − 1. Think how to manage a pipeline of depth 20 to 30 and possible slow
operations (e. g. trascendent functions).
Digression: maybe this was not invented by Ford actually. Do you know what is that:
In [57]: Image("ars2.jpg")
Out[57]:
46
In [58]: Image("superscalar.png")
Out[58]:
But ... automatic search for independent instructions reqiures extra resources
In [59]: Image("ooe.png")
Out[59]:
47
Vector registers
In [60]: Image("vectors.png")
Out[60]:
• AltiVec
• MMX
• SSE2
• SSE3
• SSE4
• 3DNow!
• AVX1
• AVX2
Larger caches Small, fast, on-chip memories serve as temporary data storage for holding copies
of data that is to be used again “soon,” or that is close to data that has recently been used. En-
larging the cache size does usually not hurt application performance, but there is some tradeoff
because a big cache tends to be slower than a small one.
Also:
In [61]: Image("transisto_cores.jpg")
Out[61]:
48
Hardware threads Pipelined architectures, however performant, will inevitably have stalls. One
way to minimize the cost of stalls would be to be keep stalling units (workstations of the assembly
line) in the execution of one program occupied by another one. This is called Hardware threading
or multithreading or hyper-threading.
Multithreading is implemented by creating multiple sets of registers and latches to hold the
state of the multiple programs. The correct register set can be selected for the right pipeline re-
source by a very smart compiler
7 Conclusions
... too early but
49
• Hardware threads, Cores, Sockets
In [62]: Image("500eurofr_HR.jpg")
Out[62]:
50