0% found this document useful (0 votes)

59 views

01 - Introduction: 1 Why Parallel Programming Is Important in Research

1. The document discusses the importance of parallel programming in computational research due to the increasing complexity of computational models and massive calculations required. It also discusses high performance computing. 2. The goals of the class are explained as understanding parallel programming paradigms, computer architectures, hardware performance factors, and performance measurements. 3. Processor performance is no longer increasing due to clock speed alone, requiring the use of multiple cores through parallel programming. Moore's Law and power consumption issues necessitate this approach.

Uploaded by

giordano mancini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views

01 - Introduction: 1 Why Parallel Programming Is Important in Research

Uploaded by

giordano mancini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

01_Introduction

April 27, 2018

In [1]: from IPython.display import Image

1 Why parallel programming is important in research

1.1 Computational science
is concerned with constructing mathematical models and quantitative analysis techniques and
using computers to analyze and solve scientific problems.
Scientists and engineers develop computer programs, application software, that model sys-
tems being studied and run these programs with various sets of input parameters. In some cases,
these models require massive amounts of calculations (usually floating-point) and are often exe-
cuted on supercomputers or distributed computing platforms.
Numerical analysis is an important underpinning for techniques used in computational sci-
ence.

1.2 High Performance Computing (HPC)

is the set of methods by which scientists and engineers solve complex problems using apps that
require high bandwidth, low latency networking and high computing capabilities.

In [2]: Image(filename="bd_hpc.png")

Out[2]:

1
In [3]: Image(filename="whyhpc1.png")

Out[3]:

2
In [4]: Image(filename="whyhpc2.png")

Out[4]:

2 Class goals:
1. Explain what factors force to consider parallel computing
2. Have an overview of parallel programming paradigms and patterns
3. Give an understanding of modern computer architectures from a performance point-of-view
• Processor, [Cache, Memory subsystem]
• Use x86-64 as a de-facto standard
• Look at how accelerators are made
4. Explain hardware factors that improve or degrade program execution speed
5. How to do detailed performance measurements: highlight the most important events for
such measurements

3 Contents
Section ??
Section ??
Section ??

4 Why worry about performance

• processors are currently sold in terms of clock speed,number of cores, cache dimension
• still in the late 90’ clock speed was the only advertised item

3
In [5]: Image(filename="500eurofr_HR.jpg")

Out[5]:

how the previous one compares well to these ones?

In [6]: Image("photo-intel-haswell-core-4th-gen-logos.jpg")

Out[6]:

what we want to measure it: 1. performance in terms costs; even "commodity hardware costs"
grow up in large facilities 2. performance per Watt; let loose your 1kW GPU mining rig at home
and then you ’ll know

4
4.0.1 Moore’s law
In 1965(1) , Intel co-founder Gordon Moore predicted (from just 3 data points!) that semiconductor
density would double every 18 months. – He was right! Transistors are still shrinking at the same
rate
(1) Moore, G.E.: Cramming more components onto integrated circuits. Electronics, 38(8), April

1965.

In [7]: Image("moores.jpeg")

Out[7]:

In [8]: Image("1200px-Moore's_Law_Transistor_Count_1971-2016.png")

Out[8]:

5
available Section ??
This has become a de-facto standard accepted by: - Semiconductor manufacturers (Intel, ARM,
AMD) - Hardware integrators (e. g. Asus) - Software companies (pick your) - Customers (that is,
we)
Consequences: An incredible level of integration - CPUs: Many-core, Hardware vectors,
Hardware threading - GPUs: Enormous number of floating- point units
right now, we just assume that stuff like hardware threads, cores, whatever are different com-
puting elements in one integrated chip
Today, we commonly acquire chips with 1’000’000’000 (109 ) transistors!
Server chips and high-end GPU devices have more:

In [9]: Image("titan.jpg")

Out[9]:

6
-GPU Name: GM200
-GPU Variant: GM200-400-A1
-Architecture: Maxwell 2.0
-Process Size: 28 nm
-Transistors: 8,000 million
-Die Size: 601 mm2
But if computers double their power every 18 months we don’t need to worry about perfor-
mance
not quite
That used to be the case in the "days of the Pentium":

In [10]: Image("oldays.png")

Out[10]:

7
a implicit agreement: - write your code as you want - performance is a problem of those smart
guys of the hardware dept.

In [11]: Image("power.png")

Out[11]:

8
not only this sucks up a lot of energy (and money) but creates a lot of heat as Q ∝ P

In [12]: Image("sun.png")

Out[12]:

9
see also: [1]S. H. Fuller and L. I. Millett, “Computing Performance: Game Over or Next
Level?,” Computer, vol. 44, no. 1, pp. 31–38, Jan. 2011.
so what solution did the smart guys at HW dept. invent?

In [13]: Image("power2.png")

Out[13]:

10
4.0.2 Possible solutions?
In [14]: Image("gskill-hwbot-overclock-kit-cinebench-r15.jpg")

Out[14]:

11
... nitrogen cooling is cool (literally) but quite expensive
In [15]: Image("natick.jpg")
Out[15]:

12
this one may be rather difficult if you are not a scuba lover
and we still have to pay the electricity bill

4.0.3 Power consumption in a chip

To switch faster (increase frequency) we have drive current trought a circuit more quickly. Now,
the circuit has various fixed resistances, hence (naively) we to increase voltage (all things equal):

V = IR

How much power it consumes? At any time the circuit hold a charge q = CV and hence
performs a work equal to:

W = qV = CV 2
power consupmtion (plus fixed terms) is ( f is the operating frequency):

Wt−1 = W f = CV 2 f

Consider now what happens splitting the CPU in two cores operating at half the clock speed:

In [16]: Image("more_cores.png")

Out[16]:

13
[1]S. H. Fuller and L. I. Millett, “Computing Performance: Game Over or Next Level?,” Com-
puter, vol. 44, no. 1, pp. 31–38, Jan. 2011.
As a practical example the E5640 Xeon (4 cores @ 2.66 GHz) has a power envelope of 95 watts
while the L5630 (4 Cores @ 2.13 GHz) requires only 40 watts. That’s 137% more electrical power
for 24% more CPU power for CPU’s that are for the most part feature compatible. The X5677
pushes the speed up to 3.46 GHz with some more features but that’s only 60% more processing
power for 225% more electrical power.
Now compare the X5560 (2.8 GHz, 4 cores, 95 watts) with the newer X5660 (2.8 GHz, 6 cores,
95 watts) and there’s 50% extra computing power in the socket

In [17]: Image("manycores.png")

Out[17]:

note that three out six here come from the game industry
but what about the gentleman agreement with the HW dept.
well ...
we have to cope with this new contract: HW dept. will continue to add transistors (in lots of
simple cores) ...
... and SW dept. will have to adapt (rewrite everything).
That’s way we have to worry about parallel programming
Market forces, not technology, will drive technology (core counts in this case)
(Nietzsche would like this one)

14
In [18]: Image("manycores2.png")

Out[18]:

Fragmentation into multiple execution units and more complicated logic creates additional
hindrance for performance: - ILP Wall (Instruction Level Parallelism) - Memory Wall
why so many "walls"?
Computing units are "ancient": - As “stupid” as 50 years ago - Still based on the Von Neumann
architecture - Using a primitive “machine language”
Matrix/vector multiplication in assembly:

__Z6matmulv (snippet):
vmovlhps
%xmm0, %xmm3, %xmm3
vmovss
+_b(%rip), %xmm4
vinsertf128 $1, %xmm3, %ymm3, %ymm3
vinsertps
$0x10, 44+_b(%rip), %xmm7,
vmovss
48+_b(%rip), %xmm6
vinsertps
$0x10, 36+_b(%rip), %xmm1,
vmovlhps
%xmm0, %xmm2, %xmm2
vinsertps

15
$0x10, 60+_b(%rip), %xmm4,
vxorps
%xmm4, %xmm4, %xmm4
<snip>

Even assembly is "too complicated":

• Intel translates “CISC” x86 assembly instructions into “RISC” micro-operations which can
vary with each CPU generation
• NVIDIA translates PTX (parallel thread execution, or virtual assembly into machine instruc-
tions which can vary with each GPU generation

CISC: Complex Instruction Set Computing

RISC: Reduced Instruction Set Computing

1. We start with a concrete, real-life problem to solve

2. We write programs in high level languages (C++, JAVA, Python, etc.)
3. A compiler (or an interpreter) transforms the high-level code to machine-level code
4. We link in external libraries
5. A sophisticated processor with a complex architecture and even more complex micro-
architecture executes the code
6. In most cases, we have little clue as to the efficiency of this transformation process

In [3]: Image("pyramid.png")

Out[3]:

4.0.4 Any good news?

few: 1. Parallel programming is not new, it’s been around for many years since the "many core
era" 2. By learning something about parallel programming our multicore CPU and shiny GP-GPU
we can learn something about using supercomputers

16
In [19]: Image("opt_areas.png")

Out[19]:

4.0.5 Supercomputers
Much of modern computational science is performed on GNU/Linux clusters where multiple
processors can be utilized and many calculations can be run in parallel. The biggest, built from
non off shelf hardware, are dubbed "supercomputers".
The first supercomputer of history:

In [20]: Image(filename="cray.png")

Out[20]:

17
In mid-to-late 1970’s CRAY-1 was the fastest computer in the world.
Clock speed of 12.5ns (80MHz) Computaional rate of 138 MFLOPS during sustained period.
Unveiled in 1976 by its inventor Seymour Roger Cray Had spawned a new class of computer
called ”The Supercomputer”.
The tree of HPC:

In [21]: Image("archs.png")

18
Out[21]:

The Cretacean of mainframes

• Large mainframes that operated on vectors of data
• Custom built, highly specialized hardware and software
• Multiple processors in an shared memory configuration
• Required modest changes to software (vectorization)

In [22]: Image("mainframes.png")

Out[22]:

19
The rise of micros

• The Caltech Cosmic Cube developed by Charles Seitz and Geoffrey Fox in 1981
• 64 Intel 8086/8087 processors with 128kB of memory per processor
• 6-dimensional hypercube network

In [23]: Image("cosmic_cube.png")

Out[23]:

http://calteches.library.caltech.edu/3419/1/Cubism.pdf
Micros spawned the Many Parallel Processors era: - Parallel computers with large numbers
of microprocessors - High speed, low latency, scalable interconnection networks - Lots of custom
hardware to support scalability - Required massive changes to software (parallelization)

In [24]: Image("mpps.png")

Out[24]:

20
• MPPs using Mass market Commercial off the shelf (COTS) microprocessors and standard
memory and I/O components
• Decreased hardware and software costs makes huge systems affordable

In [25]: Image("mpps2.png")

Out[25]:

Mass extictions caused by clusters

• A cluster is a collection of connected, independent computers that work in unison to solve a

problem.
• Nothing is custom ... motivated users could build cluster on their own
• The Intel Pentium Pro in 1995 coupled with Linux made them competitive.

21
• NASA Goddards Beowulf cluster demonstrated publically that high visibility science could
be done on clusters.

In [26]: Image("goddard.png")

Out[26]:

In [27]: Image("constellation.png")

Out[27]:

22
The current listing of the "biggest of biggest" can be found on www.top500.org (see also
www.green500.org)

In [28]: Image("top500-november-2017-2-1024.jpg")
Out[28]:

Distribution of top500 systems in the world

23
In [29]: Image("top500-november-2017-9-1024.jpg")
Out[29]:

Trend over time:

In [30]: Image(filename="top500-november-2017-10-1024.jpg")
Out[30]:

24
5 Key Concepts in Parallel computing
• Basic definitions: Parallelism and Concurrency
• Notions of parallel performance

Band
amount of data moved over time unit. Measured in bytes s−1 (hard disks) and bit s−1 (sockets and
nodes)
Latency minimum time needed to assign a requested resource
Performance number of 64 bit floating point operations per second
Concurrency A condition of a system in which multiple tasks are logically active at one time.
Parallelism A condition of a system in which multiple tasks are actually active at one time."

In [31]: Image(filename="parallel.png")

Out[31]:

In [32]: Image("conc_vs_par.png")

Out[32]:

25
Figure from “An Introduction to Concurrency in Programming Languages” by J. Sottile, Tim-
othy G. Mattson, and Craig E Rasmussen, 2010
An example: - A Web Server is a Concurrent Application (the problem is fundamentally de-
fined in terms of concurrent tasks):
- An arbitrary, large number of clients make requests which reference per-client persistent state -
An Image Server, which relieves load on primary web servers by storing, processing, and serving
only images

In [33]: Image("web_server_1.png")

Out[33]:

26
The HTML server, image server, and clients (you have to plan on having many clients) all
execute at the same time
The problem of one or more clients interacting with a web server not only contains concurrency,
the problem is fundamentally current. It doesnt exist as a serial problem.
Concurrent application: An application for which the problem definition is fundamentally
concurrent.

In [34]: Image("web_server_2.png")

Out[34]:

Another example: Mandelbrot’s set

The Mandelbrot set is an iterative map in the complex plane:

z2n+1 = z2n + c
where c is a constant and z0 = 0.
Points that do not diverge after a finite number of iterations are part of the set.

In [35]: Image("out.png")

27
Out[35]:

To generate the famous Mandelbrot set image, we use the function mandel(C) where C comes
from the points in the complex plane. "

• At each point C, use n=mandel (C) to determine if:

– The map converges (n=max_iters), assign the color black
– The map diverges (z>a given treshold after n<max_iters), assign the color based on the
value of n

The computation for each point is independent of all the other points ... a so-called embarrass-
ingly parallel problem

28
In [36]: Image("mandel2.png")
Out[36]:

int mandel(double x0, double y0, double tr, int maxiter)

{
double re = 0.0;
double im = 0.0;
double re2 = 0.0;
double im2 = 0.0;
int k;

for (k = 1; k < maxiter && (re2 + im2 < tr); k++)

{
im = 2. * re * im + y0;
re = re2 - im2 + x0;
re2 = re * re;
im2 = im * im;
}
return k;
}

29
y0 = ymin
for (int i=0; i< height; i++)
{
x0 = xmin;
for (int j=0; j < width; j++)
{
image[i][j] = mandel(x0,y0,horiz,maxiter);
x0 = x0 + xres;
}
y0 = y0 + yres;
}// close on i

• The problem of generating an image of the Mandelbrot set can be viewed serially.
• We may choose to exploit the concurrency contained in this problem so we can generate the
image in less time

Parallel application: An application composed of tasks that actually execute concurrently in

order to (1) consider larger problems in fixed time or (2) complete in less time for a fixed size
problem.
How much we can gain from the parallelization (2048x2048 pixels, 1000 iterations and horizon
at 3):

[g.mancini@zama mandelbrot] time OMP_NUM_THREADS=1 ./mandelbrot.exe

real 0m4.406s
[g.mancini@zama mandelbrot] time OMP_NUM_THREADS=2 ./mandelbrot.exe
real 0m2.726s
[g.mancini@zama mandelbrot] time OMP_NUM_THREADS=4 ./mandelbrot.exe
real 0m2.401s

Key points: - A web server had concurrency in its problem definition ... it doesnt make sense
to even think of writing a “serial web server”." - The Mandelbrot program didnt have concurrency
in its problem definition. It would take a long time, but it could be serial
The parallel programming process

In [37]: Image("ppar_process.png")

Out[37]:

30
EVERY parallel program requires a task decomposition and a data decomposition: - Task
decomposition: break the application down into a set of tasks that can execute concurrently
- Data decomposition: How must the data be broken down into chunks and associated with
threads/processes to make the parallel program run efficiently.
For the Mandelbrot set: map the pixels into row blocks and deal them out to the cores. This
will give each core a memory efficient block to work on.

In [38]: Image("ppar_process2.png")

Out[38]:

5.0.1 Parallel performance

How a "real life" parallel program is built?
A large fraction of HPC applications (such as Molecular Dynamics) use a message passing notation
with the Single Instruction Multiple Data or SIMD design pattern.

In [39]: Image("glue.png")

Out[39]:

31
An easy way to go is “Bulk Synchronous Processing”.

In [40]: Image("bulk.png")

Out[40]:

32
33
Is this efficent?
How we can measure the outcome of our parallelization effort?
Let’s consider the Speedup that we can gain from parallelization; a simple meaure would be:

Tser
S=
Tpar
Molecular Dynamics
Simulate the motion of atom as point masses intercating by classical laws:

d2
mi Ri (t) = −∇V ( R)
dt2
V = Vbond + Vangle + Vtors + VCoul + VLJ

In [41]: Image("PES.png")

Out[41]:

In [42]: Image("image.png")

Out[42]:

34
In [43]: Image("charmm_bench2.png")

Out[43]:

35
the speedup is not constant with the number of processors; why?
let’s get back to our Mandelbrot toy:

In [44]: Image("kcache.png")

Out[44]:

not every part of the code is run in parallel; even in a utopistic situation of linear speedup of
parallel code that cost would stay fixed (for a fixed problem size).

unsigned char colour[3];

FILE * fp = fopen(filename,"w");
fprintf(fp,"P3\n%d %d %d\n",width, height, 255);
for (int i=0; i< height; i++)
for (int j=0; j < width; j++)
{
if(image[i][j] >= maxiter)
{
colour[0] = 0;
colour[1] = 0;
colour[2] = 0;
}
else
{
colour[0] = 255;
colour[1] = 255;
colour[2] = 255;

36
}
fprintf(fp,"%d %d %d ",colour[0],colour[1],colour[2]);
}
Amdhal Law
Consider a generic program running in serial mode, taking Tt to be completed and made up
by serial part and a parallelizable part. Then
Tt = Tser + Tpar = f s ∗ Tser + f p ∗ Tser
where Tser is the time needed to run in serial and f s , f p are the serial and parallel parts of the code
now, running in parallel we have (for a linear speedup):
Tser Tser 1−α
Tpar = = Tser ∗ f s + f p ∗ = (α + ) ∗ Tser
p p p

if α = f s . The total speedup is then:

Tser 1
S= =
Tpar α + 1−p α

which goes to 1/α with many processors

the maximum speedup for Mandelbrot is

In [45]: 1/0.27
Out[45]: 3.7037037037037033
but there’s still work to do:

In [46]: 4.406/2.726
Out[46]: 1.6162876008804108
In [47]: Image("charmm_bench3.png")
Out[47]:

37
Two major sources of parallel overhead:
Load imbalance: the slowest process determines when everyone is done. Time waiting for
other processes to finish is time wasted.
Communication overhead: A cost only incurred by the parallel program. Grows with the
number of processes for collective comm

In [48]: Image("charmm_bench.png")

Out[48]:

Weak scaling
However: non bonded interactions (all vs all) scale as Nˆ2 while bond scales as N (an atom
tipically has that many neighbours). Hence, a serial (or less parallel) part may grow less than the
parallel one for some problems.

S( P) → S( P, N )
Tser 1
S( P, N ) = =
Tpar α + 1−p α
N→∞⇒α→0
S( P, N )α→0 = P

38
In [49]: Image("dense_matrices.png")

Out[49]:

5.0.2 Parallel programming environments in the 90’s

In [50]: Image(filename="envs90.png")

Out[50]:

39
Environments from the literature 2010-2012:

In [51]: Image("envs10.png")

Out[51]:

6 CPU architecture 101

In [52]: Image("von_neumann.png")

Out[52]:

40
From Wikipedia: The von Neumann architecture is a computer design model that uses a pro-
cessing unit and a single separate storage structure to hold both instructions and data.

In [53]: Image("cpu_layout.svg.png")

Out[53]:

41
In [54]: Image("onecore.png")

Out[54]:

42
Instructions and data must be continuously fed to the control and arithmetic units, so that the
speed of the memory interface poses a limitation on compute performance (von Neumann bottle-
neck).
The architecture is inherently sequential, processing a single instruction with (possibly) a sin-
gle operand or a group of operands from memory. SISD (Single Instruction Single Data) has been
coined for this concept.
All the components of a CPU core can operate at some maximum speed called peak performance
The performance at which the Floating Point units generate results for multiply and add oper-
ations is measured in floating-point operations per second (Flops/sec).
Typical single core perfomances for latest (e. g. Coffee lake) Intel architectures reach about 100
GFlop/s.
Feeding arithmetic units with operands is a complicated task. The most important data paths
from the programmer’s point of view are those to and from the caches and main memory (see
later). The performance, or bandwidth of those paths is quantified in GBytes/s.
Fathoming the chief performance characteristics of a processor or system is one of the purposes
of low-level benchmarking such as the vector triad
double start end,mflops;
timing(&start); //a generic timestamp function

for(k=0; k<NITER; k++)

{
for(i=0; i<N; ++i)
a[i] = b[i] + c[i] * d[i];
if(a[2]<0.) dummy(a,b,c,d); //prevent optimization
}

timing(&end);
mflops=2.0*NITER*N/(end-start)/1000000.0;

43
6.0.1 Performance dimensions in the good old days
Two dimensions: - Frequency of CPU - number of CPUs

6.0.2 Performance dimensions now

we got seven:

1. Pipelining
2. Superscalar
3. Hardware vectors/SIMD
4. Hardware threads
5. Cores
6. Sockets
7. Nodes

In [55]: Image("perfo_dim.png")

Out[55]:

1.Latency - Each instruction takes a certain time to complete. - Amount of time between in-
struction issuing and completition
Throughput - The number of instructions that complete in a span of time.

44
Pipelining and ILP wall Assembly line: workstations that perform a single production step
before passing the partially completed automobile to the next workstation. When Henry Ford(1)
introduced it in 1920, 31 workstations could assemble a Model T car in about 1,5 hrs.
Early platforms such in 1960 used to have multiple parallel units (i. e. multiple workers)
performing arithmetic and logic operations.
The first "dedicated workers" performing tasks a single task before passing data to the next
were introduced in the Cray-1 which had a pipeline of 12 stages.
If it takes m different steps to finish the product, m products are continually worked on in
different stages of completion. If all tasks are carefully tuned to take the same amount of time (the
“time step”), all workers are continuously busy. At the end, one finished product per time step
leaves the assembly line.
The most simple setup is a “fetch–decode–execute” pipeline, in which each stage can operate
indepen- dently of the others. While an instruction is being executed, another one is being decoded
and a third one is being fetched from instruction (L1I) cache.
Breaking up tasks in many different elementary stages (one of the reasons behind RISC):

• The Good: potential for a higher clock rate as functional units can be kept simple.

• The Bad: the more deep the pipeline, the more long is the wind up phase that is needed for all
units to become operational

• The Evil: Only so many pipeline stages, possible conflicts

In [56]: Image("ILP_wall.png")

Out[56]:

For a pipeline of depth m, executing N independent, subsequent operations takes N + m1

steps. We can thus calculate the expected speedup versus a general-purpose unit that needs m
cycles to generate a single result:

Tseq mN
=
Tpipe N+m−1

45
Troughtput is:
N 1
=
Tpipe N+m−1
that is for large N speedup ∝ m and troughput ∝ 1. The critical Nc needed to achieve at least a
throughput of p (0 ≤ p ≤ 1):
p ( m − 1)
Nc =
1− p
At p = 0.5 Nc = m − 1. Think how to manage a pipeline of depth 20 to 30 and possible slow
operations (e. g. trascendent functions).
Digression: maybe this was not invented by Ford actually. Do you know what is that:

In [57]: Image("ars2.jpg")

Out[57]:

Superscalar Execution and ILP Wall

• Multiple instructions can be fetched and decoded concurrently.

• Address and other integer calculations are performed in multiple integer (add,mult, shift,
mask) units (2–6).
• Multiple floating-point pipelines can run in parallel. Often there are one or twocombined
multiply-add pipes that perform a=b+c*d with a throughput of one each.
• Caches are fast enough to sustain more than one load or store operation per

46
In [58]: Image("superscalar.png")

Out[58]:

But ... automatic search for independent instructions reqiures extra resources

Out-of-Order Execution and Memory Wall Out-of-order execution. If arguments to instructions

are not available in registers “on time,” e.g., because the memory subsystem is too slow to keep up
with processor speed, out-of-order execution can avoid idle times (also called stalls) by executing
instructions that appear later in the instruction stream but have their parameters available. This
improves instruction throughput and makes it easier for compilers to arrange machine code for
optimal performance. Current out-of-order designs can keep hundreds of instructions in flight at
any time, using a reorder buffer that stores instructions until they become eligible for execution.

In [59]: Image("ooe.png")

Out[59]:

47
Vector registers

In [60]: Image("vectors.png")
Out[60]:

• AltiVec
• MMX
• SSE2
• SSE3
• SSE4
• 3DNow!
• AVX1
• AVX2

Larger caches Small, fast, on-chip memories serve as temporary data storage for holding copies
of data that is to be used again “soon,” or that is close to data that has recently been used. En-
larging the cache size does usually not hurt application performance, but there is some tradeoff
because a big cache tends to be slower than a small one.
Also:

In [61]: Image("transisto_cores.jpg")

Out[61]:

48
Hardware threads Pipelined architectures, however performant, will inevitably have stalls. One
way to minimize the cost of stalls would be to be keep stalling units (workstations of the assembly
line) in the execution of one program occupied by another one. This is called Hardware threading
or multithreading or hyper-threading.
Multithreading is implemented by creating multiple sets of registers and latches to hold the
state of the multiple programs. The correct register set can be selected for the right pipeline re-
source by a very smart compiler

6.0.3 Real life latencies

Most integer/logic instructions have a one-cycle execution latency: - ADD, AND, SHL (shift left),
ROR (rotate right) - Amongst the exceptions: - IMUL (integer multiply): 3 - IDIV (integer divide):
13 – 23
Floating-point latencies are typically multi-cycle: - FADD (3), FMUL (5) - Same for both x87
and SIMD double-precision variants - Exception: FABS (absolute value): 1 - Many-cycle: FDIV
(20), FSQRT (27)

7 Conclusions
... too early but

• You must get your code to use vectors

• You must understand if your ILP is seriously limited by serial code, complex math functions,
and other contracts
• You must parallelise across all “CPU slots”

49
• Hardware threads, Cores, Sockets

if you really need to get as much as possible from this:

In [62]: Image("500eurofr_HR.jpg")

Out[62]:

PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
From Everand
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
Rodrigo Copetti
No ratings yet
Modern Hardware - Algorithmica
No ratings yet
Modern Hardware - Algorithmica
6 pages
Lecture Slides-Week1
No ratings yet
Lecture Slides-Week1
59 pages
9 2.05pm John Hennessey
No ratings yet
9 2.05pm John Hennessey
45 pages
Hpca Notes
No ratings yet
Hpca Notes
216 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
The Future Evolution of High-Performance Microprocessors: Norm Jouppi HP Labs
No ratings yet
The Future Evolution of High-Performance Microprocessors: Norm Jouppi HP Labs
57 pages
14013204-3 - Parallel Computing - Lecture1_ (4)
No ratings yet
14013204-3 - Parallel Computing - Lecture1_ (4)
52 pages
Lecture 9
No ratings yet
Lecture 9
72 pages
CAQA5e ch1
No ratings yet
CAQA5e ch1
42 pages
Week 1 Csc447
No ratings yet
Week 1 Csc447
36 pages
Parallel Processing Lecture1
No ratings yet
Parallel Processing Lecture1
34 pages
1-s2.0-S1383762122001138-main
No ratings yet
1-s2.0-S1383762122001138-main
51 pages
Comp422 2011 Lecture1 Introduction
No ratings yet
Comp422 2011 Lecture1 Introduction
50 pages
lecture1
No ratings yet
lecture1
37 pages
lec1
No ratings yet
lec1
15 pages
SMM Cap1
No ratings yet
SMM Cap1
101 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Computer Architecture: Challenges and Opportunities For The Next Decade
No ratings yet
Computer Architecture: Challenges and Opportunities For The Next Decade
13 pages
Exploiting Intra Warp Parallelism For GPGPU
No ratings yet
Exploiting Intra Warp Parallelism For GPGPU
37 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
24 pages
Trends in Computer Architecture
No ratings yet
Trends in Computer Architecture
30 pages
Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1
No ratings yet
Lecture 4 Parallel and Scalable Machine Learning With HPC Part 1
47 pages
Gpu Programming
100% (2)
Gpu Programming
96 pages
CS/EE 6810: Computer Architecture
No ratings yet
CS/EE 6810: Computer Architecture
18 pages
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
No ratings yet
CI-0120 Arquitectura de Computadoras Ejemplos FundamentosDiseño
52 pages
Cours 1
No ratings yet
Cours 1
38 pages
CS-3006_2_PDC_Overview_compressed
No ratings yet
CS-3006_2_PDC_Overview_compressed
107 pages
Cours 1
No ratings yet
Cours 1
38 pages
An Introduction: Prof. Thomas Sterling Department of Computer Science Louisiana State University January 18, 2011
No ratings yet
An Introduction: Prof. Thomas Sterling Department of Computer Science Louisiana State University January 18, 2011
77 pages
Lecture1 Introduction to Parallel Computing_2025
No ratings yet
Lecture1 Introduction to Parallel Computing_2025
38 pages
CA01_2024S2
No ratings yet
CA01_2024S2
30 pages
Lecture 1
No ratings yet
Lecture 1
18 pages
High Performance Computing: Course Introduction
No ratings yet
High Performance Computing: Course Introduction
32 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Chapter 1 Measuring Understanding Performance
No ratings yet
Chapter 1 Measuring Understanding Performance
63 pages
Lec5-tpu
No ratings yet
Lec5-tpu
44 pages
Computer Performance Enhancing Techniques - Session-2
100% (1)
Computer Performance Enhancing Techniques - Session-2
36 pages
CMP2008 L1
No ratings yet
CMP2008 L1
47 pages
GPGPU
No ratings yet
GPGPU
139 pages
Introduction & Matrix Multiplication: 6.172 Performance Engineering of Software Systems
No ratings yet
Introduction & Matrix Multiplication: 6.172 Performance Engineering of Software Systems
69 pages
Parallel Computing
No ratings yet
Parallel Computing
57 pages
L1.2 HPC Introduction
No ratings yet
L1.2 HPC Introduction
42 pages
CS5204/EE5364 - Advanced Computer Architecture - Introduction
No ratings yet
CS5204/EE5364 - Advanced Computer Architecture - Introduction
28 pages
Advanced Python
No ratings yet
Advanced Python
70 pages
Ppar2017 Gpu 1
No ratings yet
Ppar2017 Gpu 1
61 pages
ch1 PC
No ratings yet
ch1 PC
84 pages
Chapter 2
No ratings yet
Chapter 2
15 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
CMP2008 L1
No ratings yet
CMP2008 L1
20 pages
CHAPTER 1-orig
No ratings yet
CHAPTER 1-orig
50 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
Par Prog Course Many Core SW Pats Ocl
No ratings yet
Par Prog Course Many Core SW Pats Ocl
90 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
4.-Software-Hardware Evolution and Birth of Multicore Processors
No ratings yet
4.-Software-Hardware Evolution and Birth of Multicore Processors
7 pages
2024_Evolution_challenges_optimization
No ratings yet
2024_Evolution_challenges_optimization
11 pages
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
From Everand
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
Rodrigo Copetti
No ratings yet
04 OOP Kindergarten
No ratings yet
04 OOP Kindergarten
12 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
08 - Mixedprogramming: 1 Mixed Programming
No ratings yet
08 - Mixedprogramming: 1 Mixed Programming
41 pages
MM Lost Chapter 03 Initiative Cards
No ratings yet
MM Lost Chapter 03 Initiative Cards
2 pages
MM Lost Chapter 03 Search Cards
No ratings yet
MM Lost Chapter 03 Search Cards
2 pages
Fortunes of War (6.0) : Result Result: Death Of..
No ratings yet
Fortunes of War (6.0) : Result Result: Death Of..
2 pages
The Ghost of Castle Andon: Midge Softpaw
No ratings yet
The Ghost of Castle Andon: Midge Softpaw
8 pages
Portents of Importance: Choose A Player To Read The Following Aloud
No ratings yet
Portents of Importance: Choose A Player To Read The Following Aloud
8 pages
Sirjohnhawkwoodl00lead PDF
No ratings yet
Sirjohnhawkwoodl00lead PDF
390 pages
Information Entropy Monte Carlo Simulation
No ratings yet
Information Entropy Monte Carlo Simulation
20 pages
Mechanism of Copper (II) - Induced Misfolding of Parkinson's Disease Protein
No ratings yet
Mechanism of Copper (II) - Induced Misfolding of Parkinson's Disease Protein
5 pages
Ultra 2008-01 PDF
No ratings yet
Ultra 2008-01 PDF
16 pages
Step-By-Step Instructions For Running An Offensive in Empire of The Sun
No ratings yet
Step-By-Step Instructions For Running An Offensive in Empire of The Sun
8 pages
Die Roll Result: 22 Sps Needed To
No ratings yet
Die Roll Result: 22 Sps Needed To
4 pages
Rule Book: 1.0 Introduction
No ratings yet
Rule Book: 1.0 Introduction
32 pages
Cataphract Rules Dec 2019
No ratings yet
Cataphract Rules Dec 2019
32 pages
Cataphract Playbook
No ratings yet
Cataphract Playbook
12 pages
The Berkeley Out-of-Order Machine (BOOM) : An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor
No ratings yet
The Berkeley Out-of-Order Machine (BOOM) : An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor
5 pages
Computer Organization Exercise Answer5
No ratings yet
Computer Organization Exercise Answer5
7 pages
CDNLive2012 - A Comprehensive Approach To Scalable Framework For Both Vertical and Horizontal Reuse in UVM Verification - AMD
No ratings yet
CDNLive2012 - A Comprehensive Approach To Scalable Framework For Both Vertical and Horizontal Reuse in UVM Verification - AMD
10 pages
Multicore Architecture and Programming
No ratings yet
Multicore Architecture and Programming
20 pages
Microprocessor Indiviual assignment
No ratings yet
Microprocessor Indiviual assignment
21 pages
Cs152 Sp16 F Sol VLIW
No ratings yet
Cs152 Sp16 F Sol VLIW
40 pages
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
No ratings yet
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
41 pages
Dynamic Scheduling Using Tomasulo's Algorithm: Lotzi Bölöni
No ratings yet
Dynamic Scheduling Using Tomasulo's Algorithm: Lotzi Bölöni
54 pages
NaCl SFI
No ratings yet
NaCl SFI
11 pages
Chip Basics: Time, Area, Power, Reliability, Configurability
No ratings yet
Chip Basics: Time, Area, Power, Reliability, Configurability
46 pages
Speculative Execution in High Performance Computer Architectures Chapman Hall Crc Computer Information Science Series 1st Edition David Kaeli pdf download
100% (2)
Speculative Execution in High Performance Computer Architectures Chapman Hall Crc Computer Information Science Series 1st Edition David Kaeli pdf download
78 pages
2019 - Validation of The Gem5 Simulator For x86 Architectures
No ratings yet
2019 - Validation of The Gem5 Simulator For x86 Architectures
6 pages
Instruction-Level Parallelism and Superscalar Processors
No ratings yet
Instruction-Level Parallelism and Superscalar Processors
22 pages
A Study of The Alpha 21364 Processor Arul Prakash CS6810: Advanced Computer Architecture
No ratings yet
A Study of The Alpha 21364 Processor Arul Prakash CS6810: Advanced Computer Architecture
12 pages
Assignment No:-2: Object Oriented Analysis & Design
No ratings yet
Assignment No:-2: Object Oriented Analysis & Design
15 pages
Intel Sandy Ntel Sandy Bridge Architecture
No ratings yet
Intel Sandy Ntel Sandy Bridge Architecture
54 pages
CS6303 Computer Architecture 2
No ratings yet
CS6303 Computer Architecture 2
56 pages
Pentium 4
No ratings yet
Pentium 4
108 pages
Computer Architecture Syllabus
No ratings yet
Computer Architecture Syllabus
2 pages
Fundamentals of Parallelism On Intel Architecture: Week 1 Modern Code
No ratings yet
Fundamentals of Parallelism On Intel Architecture: Week 1 Modern Code
31 pages
KNL Presentation TACC Summer School - Shared
No ratings yet
KNL Presentation TACC Summer School - Shared
73 pages
Cset203!24!25 Nptel Answer Key
No ratings yet
Cset203!24!25 Nptel Answer Key
9 pages
Computer Architecture
100% (2)
Computer Architecture
46 pages
Sections 3.2 and 3.3 Dynamic Scheduling - Tomasulo's Algorithm
No ratings yet
Sections 3.2 and 3.3 Dynamic Scheduling - Tomasulo's Algorithm
53 pages
Coa
No ratings yet
Coa
11 pages
More On Pipelining
100% (1)
More On Pipelining
34 pages
Computer Architecture Midterm1 Cmu
No ratings yet
Computer Architecture Midterm1 Cmu
30 pages
Vector
No ratings yet
Vector
38 pages
Exam - Chinese PDF
100% (1)
Exam - Chinese PDF
81 pages
Get Modern Processor Design Fundamentals of Superscalar Processors John Paul Shen free all chapters
100% (3)
Get Modern Processor Design Fundamentals of Superscalar Processors John Paul Shen free all chapters
41 pages

01 - Introduction: 1 Why Parallel Programming Is Important in Research

Uploaded by

01 - Introduction: 1 Why Parallel Programming Is Important in Research

Uploaded by

01_Introduction

April 27, 2018

In [1]: from IPython.display import Image

1 Why parallel programming is important in research

1.2 High Performance Computing (HPC)

4 Why worry about performance

how the previous one compares well to these ones?

4.0.3 Power consumption in a chip

Even assembly is "too complicated":

CISC: Complex Instruction Set Computing

1. We start with a concrete, real-life problem to solve

4.0.4 Any good news?

The Cretacean of mainframes

Mass extictions caused by clusters

• A cluster is a collection of connected, independent computers that work in unison to solve a

Distribution of top500 systems in the world

Trend over time:

Another example: Mandelbrot’s set

• At each point C, use n=mandel (C) to determine if:

int mandel(double x0, double y0, double tr, int maxiter)

for (k = 1; k < maxiter && (re2 + im2 < tr); k++)

Parallel application: An application composed of tasks that actually execute concurrently in

[g.mancini@zama mandelbrot] time OMP_NUM_THREADS=1 ./mandelbrot.exe

5.0.1 Parallel performance

unsigned char colour[3];

if α = f s . The total speedup is then:

which goes to 1/α with many processors

5.0.2 Parallel programming environments in the 90’s

6 CPU architecture 101

for(k=0; k<NITER; k++)

6.0.2 Performance dimensions now

• The Evil: Only so many pipeline stages, possible conflicts

For a pipeline of depth m, executing N independent, subsequent operations takes N + m1

Superscalar Execution and ILP Wall

• Multiple instructions can be fetched and decoded concurrently.

Out-of-Order Execution and Memory Wall Out-of-order execution. If arguments to instructions

6.0.3 Real life latencies

• You must get your code to use vectors

if you really need to get as much as possible from this:

You might also like