0% found this document useful (0 votes)
59 views

01 - Introduction: 1 Why Parallel Programming Is Important in Research

1. The document discusses the importance of parallel programming in computational research due to the increasing complexity of computational models and massive calculations required. It also discusses high performance computing. 2. The goals of the class are explained as understanding parallel programming paradigms, computer architectures, hardware performance factors, and performance measurements. 3. Processor performance is no longer increasing due to clock speed alone, requiring the use of multiple cores through parallel programming. Moore's Law and power consumption issues necessitate this approach.

Uploaded by

giordano mancini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

01 - Introduction: 1 Why Parallel Programming Is Important in Research

1. The document discusses the importance of parallel programming in computational research due to the increasing complexity of computational models and massive calculations required. It also discusses high performance computing. 2. The goals of the class are explained as understanding parallel programming paradigms, computer architectures, hardware performance factors, and performance measurements. 3. Processor performance is no longer increasing due to clock speed alone, requiring the use of multiple cores through parallel programming. Moore's Law and power consumption issues necessitate this approach.

Uploaded by

giordano mancini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

01_Introduction

April 27, 2018

In [1]: from IPython.display import Image

1 Why parallel programming is important in research


1.1 Computational science
is concerned with constructing mathematical models and quantitative analysis techniques and
using computers to analyze and solve scientific problems.
Scientists and engineers develop computer programs, application software, that model sys-
tems being studied and run these programs with various sets of input parameters. In some cases,
these models require massive amounts of calculations (usually floating-point) and are often exe-
cuted on supercomputers or distributed computing platforms.
Numerical analysis is an important underpinning for techniques used in computational sci-
ence.

1.2 High Performance Computing (HPC)


is the set of methods by which scientists and engineers solve complex problems using apps that
require high bandwidth, low latency networking and high computing capabilities.

In [2]: Image(filename="bd_hpc.png")

Out[2]:

1
In [3]: Image(filename="whyhpc1.png")

Out[3]:

2
In [4]: Image(filename="whyhpc2.png")

Out[4]:

2 Class goals:
1. Explain what factors force to consider parallel computing
2. Have an overview of parallel programming paradigms and patterns
3. Give an understanding of modern computer architectures from a performance point-of-view
• Processor, [Cache, Memory subsystem]
• Use x86-64 as a de-facto standard
• Look at how accelerators are made
4. Explain hardware factors that improve or degrade program execution speed
5. How to do detailed performance measurements: highlight the most important events for
such measurements

3 Contents
Section ??
Section ??
Section ??

4 Why worry about performance


• processors are currently sold in terms of clock speed,number of cores, cache dimension
• still in the late 90’ clock speed was the only advertised item

3
In [5]: Image(filename="500eurofr_HR.jpg")

Out[5]:

how the previous one compares well to these ones?

In [6]: Image("photo-intel-haswell-core-4th-gen-logos.jpg")

Out[6]:

what we want to measure it: 1. performance in terms costs; even "commodity hardware costs"
grow up in large facilities 2. performance per Watt; let loose your 1kW GPU mining rig at home
and then you ’ll know

4
4.0.1 Moore’s law
In 1965(1) , Intel co-founder Gordon Moore predicted (from just 3 data points!) that semiconductor
density would double every 18 months. – He was right! Transistors are still shrinking at the same
rate
(1) Moore, G.E.: Cramming more components onto integrated circuits. Electronics, 38(8), April

1965.

In [7]: Image("moores.jpeg")

Out[7]:

In [8]: Image("1200px-Moore's_Law_Transistor_Count_1971-2016.png")

Out[8]:

5
available Section ??
This has become a de-facto standard accepted by: - Semiconductor manufacturers (Intel, ARM,
AMD) - Hardware integrators (e. g. Asus) - Software companies (pick your) - Customers (that is,
we)
Consequences: An incredible level of integration - CPUs: Many-core, Hardware vectors,
Hardware threading - GPUs: Enormous number of floating- point units
right now, we just assume that stuff like hardware threads, cores, whatever are different com-
puting elements in one integrated chip
Today, we commonly acquire chips with 1’000’000’000 (109 ) transistors!
Server chips and high-end GPU devices have more:

In [9]: Image("titan.jpg")

Out[9]:

6
-GPU Name: GM200
-GPU Variant: GM200-400-A1
-Architecture: Maxwell 2.0
-Process Size: 28 nm
-Transistors: 8,000 million
-Die Size: 601 mm2
But if computers double their power every 18 months we don’t need to worry about perfor-
mance
not quite
That used to be the case in the "days of the Pentium":

In [10]: Image("oldays.png")

Out[10]:

7
a implicit agreement: - write your code as you want - performance is a problem of those smart
guys of the hardware dept.

In [11]: Image("power.png")

Out[11]:

8
not only this sucks up a lot of energy (and money) but creates a lot of heat as Q ∝ P

In [12]: Image("sun.png")

Out[12]:

9
see also: [1]S. H. Fuller and L. I. Millett, “Computing Performance: Game Over or Next
Level?,” Computer, vol. 44, no. 1, pp. 31–38, Jan. 2011.
so what solution did the smart guys at HW dept. invent?

In [13]: Image("power2.png")

Out[13]:

10
4.0.2 Possible solutions?
In [14]: Image("gskill-hwbot-overclock-kit-cinebench-r15.jpg")

Out[14]:

11
... nitrogen cooling is cool (literally) but quite expensive
In [15]: Image("natick.jpg")
Out[15]:

12
this one may be rather difficult if you are not a scuba lover
and we still have to pay the electricity bill

4.0.3 Power consumption in a chip


To switch faster (increase frequency) we have drive current trought a circuit more quickly. Now,
the circuit has various fixed resistances, hence (naively) we to increase voltage (all things equal):

V = IR

How much power it consumes? At any time the circuit hold a charge q = CV and hence
performs a work equal to:

W = qV = CV 2
power consupmtion (plus fixed terms) is ( f is the operating frequency):

Wt−1 = W f = CV 2 f

Consider now what happens splitting the CPU in two cores operating at half the clock speed:

In [16]: Image("more_cores.png")

Out[16]:

13
[1]S. H. Fuller and L. I. Millett, “Computing Performance: Game Over or Next Level?,” Com-
puter, vol. 44, no. 1, pp. 31–38, Jan. 2011.
As a practical example the E5640 Xeon (4 cores @ 2.66 GHz) has a power envelope of 95 watts
while the L5630 (4 Cores @ 2.13 GHz) requires only 40 watts. That’s 137% more electrical power
for 24% more CPU power for CPU’s that are for the most part feature compatible. The X5677
pushes the speed up to 3.46 GHz with some more features but that’s only 60% more processing
power for 225% more electrical power.
Now compare the X5560 (2.8 GHz, 4 cores, 95 watts) with the newer X5660 (2.8 GHz, 6 cores,
95 watts) and there’s 50% extra computing power in the socket

In [17]: Image("manycores.png")

Out[17]:

note that three out six here come from the game industry
but what about the gentleman agreement with the HW dept.
well ...
we have to cope with this new contract: HW dept. will continue to add transistors (in lots of
simple cores) ...
... and SW dept. will have to adapt (rewrite everything).
That’s way we have to worry about parallel programming
Market forces, not technology, will drive technology (core counts in this case)
(Nietzsche would like this one)

14
In [18]: Image("manycores2.png")

Out[18]:

Fragmentation into multiple execution units and more complicated logic creates additional
hindrance for performance: - ILP Wall (Instruction Level Parallelism) - Memory Wall
why so many "walls"?
Computing units are "ancient": - As “stupid” as 50 years ago - Still based on the Von Neumann
architecture - Using a primitive “machine language”
Matrix/vector multiplication in assembly:

__Z6matmulv (snippet):
vmovlhps
%xmm0, %xmm3, %xmm3
vmovss
+_b(%rip), %xmm4
vinsertf128 $1, %xmm3, %ymm3, %ymm3
vinsertps
$0x10, 44+_b(%rip), %xmm7,
vmovss
48+_b(%rip), %xmm6
vinsertps
$0x10, 36+_b(%rip), %xmm1,
vmovlhps
%xmm0, %xmm2, %xmm2
vinsertps

15
$0x10, 60+_b(%rip), %xmm4,
vxorps
%xmm4, %xmm4, %xmm4
<snip>

Even assembly is "too complicated":

• Intel translates “CISC” x86 assembly instructions into “RISC” micro-operations which can
vary with each CPU generation
• NVIDIA translates PTX (parallel thread execution, or virtual assembly into machine instruc-
tions which can vary with each GPU generation

CISC: Complex Instruction Set Computing


RISC: Reduced Instruction Set Computing

1. We start with a concrete, real-life problem to solve


2. We write programs in high level languages (C++, JAVA, Python, etc.)
3. A compiler (or an interpreter) transforms the high-level code to machine-level code
4. We link in external libraries
5. A sophisticated processor with a complex architecture and even more complex micro-
architecture executes the code
6. In most cases, we have little clue as to the efficiency of this transformation process

In [3]: Image("pyramid.png")

Out[3]:

4.0.4 Any good news?


few: 1. Parallel programming is not new, it’s been around for many years since the "many core
era" 2. By learning something about parallel programming our multicore CPU and shiny GP-GPU
we can learn something about using supercomputers

16
In [19]: Image("opt_areas.png")

Out[19]:

4.0.5 Supercomputers
Much of modern computational science is performed on GNU/Linux clusters where multiple
processors can be utilized and many calculations can be run in parallel. The biggest, built from
non off shelf hardware, are dubbed "supercomputers".
The first supercomputer of history:

In [20]: Image(filename="cray.png")

Out[20]:

17
In mid-to-late 1970’s CRAY-1 was the fastest computer in the world.
Clock speed of 12.5ns (80MHz) Computaional rate of 138 MFLOPS during sustained period.
Unveiled in 1976 by its inventor Seymour Roger Cray Had spawned a new class of computer
called ”The Supercomputer”.
The tree of HPC:

In [21]: Image("archs.png")

18
Out[21]:

The Cretacean of mainframes


• Large mainframes that operated on vectors of data
• Custom built, highly specialized hardware and software
• Multiple processors in an shared memory configuration
• Required modest changes to software (vectorization)

In [22]: Image("mainframes.png")

Out[22]:

19
The rise of micros

• The Caltech Cosmic Cube developed by Charles Seitz and Geoffrey Fox in 1981
• 64 Intel 8086/8087 processors with 128kB of memory per processor
• 6-dimensional hypercube network

In [23]: Image("cosmic_cube.png")

Out[23]:

http://calteches.library.caltech.edu/3419/1/Cubism.pdf
Micros spawned the Many Parallel Processors era: - Parallel computers with large numbers
of microprocessors - High speed, low latency, scalable interconnection networks - Lots of custom
hardware to support scalability - Required massive changes to software (parallelization)

In [24]: Image("mpps.png")

Out[24]:

20
• MPPs using Mass market Commercial off the shelf (COTS) microprocessors and standard
memory and I/O components
• Decreased hardware and software costs makes huge systems affordable

In [25]: Image("mpps2.png")

Out[25]:

Mass extictions caused by clusters

• A cluster is a collection of connected, independent computers that work in unison to solve a


problem.
• Nothing is custom ... motivated users could build cluster on their own
• The Intel Pentium Pro in 1995 coupled with Linux made them competitive.

21
• NASA Goddards Beowulf cluster demonstrated publically that high visibility science could
be done on clusters.

In [26]: Image("goddard.png")

Out[26]:

In [27]: Image("constellation.png")

Out[27]:

22
The current listing of the "biggest of biggest" can be found on www.top500.org (see also
www.green500.org)

In [28]: Image("top500-november-2017-2-1024.jpg")
Out[28]:

Distribution of top500 systems in the world

23
In [29]: Image("top500-november-2017-9-1024.jpg")
Out[29]:

Trend over time:


In [30]: Image(filename="top500-november-2017-10-1024.jpg")
Out[30]:

24
5 Key Concepts in Parallel computing
• Basic definitions: Parallelism and Concurrency
• Notions of parallel performance

Band
amount of data moved over time unit. Measured in bytes s−1 (hard disks) and bit s−1 (sockets and
nodes)
Latency minimum time needed to assign a requested resource
Performance number of 64 bit floating point operations per second
Concurrency A condition of a system in which multiple tasks are logically active at one time.
Parallelism A condition of a system in which multiple tasks are actually active at one time."

In [31]: Image(filename="parallel.png")

Out[31]:

In [32]: Image("conc_vs_par.png")

Out[32]:

25
Figure from “An Introduction to Concurrency in Programming Languages” by J. Sottile, Tim-
othy G. Mattson, and Craig E Rasmussen, 2010
An example: - A Web Server is a Concurrent Application (the problem is fundamentally de-
fined in terms of concurrent tasks):
- An arbitrary, large number of clients make requests which reference per-client persistent state -
An Image Server, which relieves load on primary web servers by storing, processing, and serving
only images

In [33]: Image("web_server_1.png")

Out[33]:

26
The HTML server, image server, and clients (you have to plan on having many clients) all
execute at the same time
The problem of one or more clients interacting with a web server not only contains concurrency,
the problem is fundamentally current. It doesnt exist as a serial problem.
Concurrent application: An application for which the problem definition is fundamentally
concurrent.

In [34]: Image("web_server_2.png")

Out[34]:

Another example: Mandelbrot’s set


The Mandelbrot set is an iterative map in the complex plane:

z2n+1 = z2n + c
where c is a constant and z0 = 0.
Points that do not diverge after a finite number of iterations are part of the set.

In [35]: Image("out.png")

27
Out[35]:

To generate the famous Mandelbrot set image, we use the function mandel(C) where C comes
from the points in the complex plane. "

• At each point C, use n=mandel (C) to determine if:


– The map converges (n=max_iters), assign the color black
– The map diverges (z>a given treshold after n<max_iters), assign the color based on the
value of n

The computation for each point is independent of all the other points ... a so-called embarrass-
ingly parallel problem

28
In [36]: Image("mandel2.png")
Out[36]:

int mandel(double x0, double y0, double tr, int maxiter)


{
double re = 0.0;
double im = 0.0;
double re2 = 0.0;
double im2 = 0.0;
int k;

for (k = 1; k < maxiter && (re2 + im2 < tr); k++)


{
im = 2. * re * im + y0;
re = re2 - im2 + x0;
re2 = re * re;
im2 = im * im;
}
return k;
}

29
y0 = ymin
for (int i=0; i< height; i++)
{
x0 = xmin;
for (int j=0; j < width; j++)
{
image[i][j] = mandel(x0,y0,horiz,maxiter);
x0 = x0 + xres;
}
y0 = y0 + yres;
}// close on i

• The problem of generating an image of the Mandelbrot set can be viewed serially.
• We may choose to exploit the concurrency contained in this problem so we can generate the
image in less time

Parallel application: An application composed of tasks that actually execute concurrently in


order to (1) consider larger problems in fixed time or (2) complete in less time for a fixed size
problem.
How much we can gain from the parallelization (2048x2048 pixels, 1000 iterations and horizon
at 3):

[g.mancini@zama mandelbrot] time OMP_NUM_THREADS=1 ./mandelbrot.exe


real 0m4.406s
[g.mancini@zama mandelbrot] time OMP_NUM_THREADS=2 ./mandelbrot.exe
real 0m2.726s
[g.mancini@zama mandelbrot] time OMP_NUM_THREADS=4 ./mandelbrot.exe
real 0m2.401s

Key points: - A web server had concurrency in its problem definition ... it doesnt make sense
to even think of writing a “serial web server”." - The Mandelbrot program didnt have concurrency
in its problem definition. It would take a long time, but it could be serial
The parallel programming process

In [37]: Image("ppar_process.png")

Out[37]:

30
EVERY parallel program requires a task decomposition and a data decomposition: - Task
decomposition: break the application down into a set of tasks that can execute concurrently
- Data decomposition: How must the data be broken down into chunks and associated with
threads/processes to make the parallel program run efficiently.
For the Mandelbrot set: map the pixels into row blocks and deal them out to the cores. This
will give each core a memory efficient block to work on.

In [38]: Image("ppar_process2.png")

Out[38]:

5.0.1 Parallel performance


How a "real life" parallel program is built?
A large fraction of HPC applications (such as Molecular Dynamics) use a message passing notation
with the Single Instruction Multiple Data or SIMD design pattern.

In [39]: Image("glue.png")

Out[39]:

31
An easy way to go is “Bulk Synchronous Processing”.

In [40]: Image("bulk.png")

Out[40]:

32
33
Is this efficent?
How we can measure the outcome of our parallelization effort?
Let’s consider the Speedup that we can gain from parallelization; a simple meaure would be:

Tser
S=
Tpar
Molecular Dynamics
Simulate the motion of atom as point masses intercating by classical laws:

d2
mi Ri (t) = −∇V ( R)
dt2
V = Vbond + Vangle + Vtors + VCoul + VLJ

In [41]: Image("PES.png")

Out[41]:

In [42]: Image("image.png")

Out[42]:

34
In [43]: Image("charmm_bench2.png")

Out[43]:

35
the speedup is not constant with the number of processors; why?
let’s get back to our Mandelbrot toy:

In [44]: Image("kcache.png")

Out[44]:

not every part of the code is run in parallel; even in a utopistic situation of linear speedup of
parallel code that cost would stay fixed (for a fixed problem size).

unsigned char colour[3];


FILE * fp = fopen(filename,"w");
fprintf(fp,"P3\n%d %d %d\n",width, height, 255);
for (int i=0; i< height; i++)
for (int j=0; j < width; j++)
{
if(image[i][j] >= maxiter)
{
colour[0] = 0;
colour[1] = 0;
colour[2] = 0;
}
else
{
colour[0] = 255;
colour[1] = 255;
colour[2] = 255;

36
}
fprintf(fp,"%d %d %d ",colour[0],colour[1],colour[2]);
}
Amdhal Law
Consider a generic program running in serial mode, taking Tt to be completed and made up
by serial part and a parallelizable part. Then
Tt = Tser + Tpar = f s ∗ Tser + f p ∗ Tser
where Tser is the time needed to run in serial and f s , f p are the serial and parallel parts of the code
now, running in parallel we have (for a linear speedup):
Tser Tser 1−α
Tpar = = Tser ∗ f s + f p ∗ = (α + ) ∗ Tser
p p p

if α = f s . The total speedup is then:


Tser 1
S= =
Tpar α + 1−p α

which goes to 1/α with many processors


the maximum speedup for Mandelbrot is

In [45]: 1/0.27
Out[45]: 3.7037037037037033
but there’s still work to do:

In [46]: 4.406/2.726
Out[46]: 1.6162876008804108
In [47]: Image("charmm_bench3.png")
Out[47]:

37
Two major sources of parallel overhead:
Load imbalance: the slowest process determines when everyone is done. Time waiting for
other processes to finish is time wasted.
Communication overhead: A cost only incurred by the parallel program. Grows with the
number of processes for collective comm

In [48]: Image("charmm_bench.png")

Out[48]:

Weak scaling
However: non bonded interactions (all vs all) scale as Nˆ2 while bond scales as N (an atom
tipically has that many neighbours). Hence, a serial (or less parallel) part may grow less than the
parallel one for some problems.

S( P) → S( P, N )
Tser 1
S( P, N ) = =
Tpar α + 1−p α
N→∞⇒α→0
S( P, N )α→0 = P

38
In [49]: Image("dense_matrices.png")

Out[49]:

5.0.2 Parallel programming environments in the 90’s


In [50]: Image(filename="envs90.png")

Out[50]:

39
Environments from the literature 2010-2012:

In [51]: Image("envs10.png")

Out[51]:

6 CPU architecture 101


In [52]: Image("von_neumann.png")

Out[52]:

40
From Wikipedia: The von Neumann architecture is a computer design model that uses a pro-
cessing unit and a single separate storage structure to hold both instructions and data.

In [53]: Image("cpu_layout.svg.png")

Out[53]:

41
In [54]: Image("onecore.png")

Out[54]:

42
Instructions and data must be continuously fed to the control and arithmetic units, so that the
speed of the memory interface poses a limitation on compute performance (von Neumann bottle-
neck).
The architecture is inherently sequential, processing a single instruction with (possibly) a sin-
gle operand or a group of operands from memory. SISD (Single Instruction Single Data) has been
coined for this concept.
All the components of a CPU core can operate at some maximum speed called peak performance
The performance at which the Floating Point units generate results for multiply and add oper-
ations is measured in floating-point operations per second (Flops/sec).
Typical single core perfomances for latest (e. g. Coffee lake) Intel architectures reach about 100
GFlop/s.
Feeding arithmetic units with operands is a complicated task. The most important data paths
from the programmer’s point of view are those to and from the caches and main memory (see
later). The performance, or bandwidth of those paths is quantified in GBytes/s.
Fathoming the chief performance characteristics of a processor or system is one of the purposes
of low-level benchmarking such as the vector triad
double start end,mflops;
timing(&start); //a generic timestamp function

for(k=0; k<NITER; k++)


{
for(i=0; i<N; ++i)
a[i] = b[i] + c[i] * d[i];
if(a[2]<0.) dummy(a,b,c,d); //prevent optimization
}

timing(&end);
mflops=2.0*NITER*N/(end-start)/1000000.0;

43
6.0.1 Performance dimensions in the good old days
Two dimensions: - Frequency of CPU - number of CPUs

6.0.2 Performance dimensions now


we got seven:

1. Pipelining
2. Superscalar
3. Hardware vectors/SIMD
4. Hardware threads
5. Cores
6. Sockets
7. Nodes

In [55]: Image("perfo_dim.png")

Out[55]:

1.Latency - Each instruction takes a certain time to complete. - Amount of time between in-
struction issuing and completition
Throughput - The number of instructions that complete in a span of time.

44
Pipelining and ILP wall Assembly line: workstations that perform a single production step
before passing the partially completed automobile to the next workstation. When Henry Ford(1)
introduced it in 1920, 31 workstations could assemble a Model T car in about 1,5 hrs.
Early platforms such in 1960 used to have multiple parallel units (i. e. multiple workers)
performing arithmetic and logic operations.
The first "dedicated workers" performing tasks a single task before passing data to the next
were introduced in the Cray-1 which had a pipeline of 12 stages.
If it takes m different steps to finish the product, m products are continually worked on in
different stages of completion. If all tasks are carefully tuned to take the same amount of time (the
“time step”), all workers are continuously busy. At the end, one finished product per time step
leaves the assembly line.
The most simple setup is a “fetch–decode–execute” pipeline, in which each stage can operate
indepen- dently of the others. While an instruction is being executed, another one is being decoded
and a third one is being fetched from instruction (L1I) cache.
Breaking up tasks in many different elementary stages (one of the reasons behind RISC):

• The Good: potential for a higher clock rate as functional units can be kept simple.

• The Bad: the more deep the pipeline, the more long is the wind up phase that is needed for all
units to become operational

• The Evil: Only so many pipeline stages, possible conflicts

In [56]: Image("ILP_wall.png")

Out[56]:

For a pipeline of depth m, executing N independent, subsequent operations takes N + m1


steps. We can thus calculate the expected speedup versus a general-purpose unit that needs m
cycles to generate a single result:

Tseq mN
=
Tpipe N+m−1

45
Troughtput is:
N 1
=
Tpipe N+m−1
that is for large N speedup ∝ m and troughput ∝ 1. The critical Nc needed to achieve at least a
throughput of p (0 ≤ p ≤ 1):
p ( m − 1)
Nc =
1− p
At p = 0.5 Nc = m − 1. Think how to manage a pipeline of depth 20 to 30 and possible slow
operations (e. g. trascendent functions).
Digression: maybe this was not invented by Ford actually. Do you know what is that:

In [57]: Image("ars2.jpg")

Out[57]:

Superscalar Execution and ILP Wall

• Multiple instructions can be fetched and decoded concurrently.


• Address and other integer calculations are performed in multiple integer (add,mult, shift,
mask) units (2–6).
• Multiple floating-point pipelines can run in parallel. Often there are one or twocombined
multiply-add pipes that perform a=b+c*d with a throughput of one each.
• Caches are fast enough to sustain more than one load or store operation per

46
In [58]: Image("superscalar.png")

Out[58]:

But ... automatic search for independent instructions reqiures extra resources

Out-of-Order Execution and Memory Wall Out-of-order execution. If arguments to instructions


are not available in registers “on time,” e.g., because the memory subsystem is too slow to keep up
with processor speed, out-of-order execution can avoid idle times (also called stalls) by executing
instructions that appear later in the instruction stream but have their parameters available. This
improves instruction throughput and makes it easier for compilers to arrange machine code for
optimal performance. Current out-of-order designs can keep hundreds of instructions in flight at
any time, using a reorder buffer that stores instructions until they become eligible for execution.

In [59]: Image("ooe.png")

Out[59]:

47
Vector registers

In [60]: Image("vectors.png")
Out[60]:

• AltiVec
• MMX
• SSE2
• SSE3
• SSE4
• 3DNow!
• AVX1
• AVX2

Larger caches Small, fast, on-chip memories serve as temporary data storage for holding copies
of data that is to be used again “soon,” or that is close to data that has recently been used. En-
larging the cache size does usually not hurt application performance, but there is some tradeoff
because a big cache tends to be slower than a small one.
Also:

In [61]: Image("transisto_cores.jpg")

Out[61]:

48
Hardware threads Pipelined architectures, however performant, will inevitably have stalls. One
way to minimize the cost of stalls would be to be keep stalling units (workstations of the assembly
line) in the execution of one program occupied by another one. This is called Hardware threading
or multithreading or hyper-threading.
Multithreading is implemented by creating multiple sets of registers and latches to hold the
state of the multiple programs. The correct register set can be selected for the right pipeline re-
source by a very smart compiler

6.0.3 Real life latencies


Most integer/logic instructions have a one-cycle execution latency: - ADD, AND, SHL (shift left),
ROR (rotate right) - Amongst the exceptions: - IMUL (integer multiply): 3 - IDIV (integer divide):
13 – 23
Floating-point latencies are typically multi-cycle: - FADD (3), FMUL (5) - Same for both x87
and SIMD double-precision variants - Exception: FABS (absolute value): 1 - Many-cycle: FDIV
(20), FSQRT (27)

7 Conclusions
... too early but

• You must get your code to use vectors


• You must understand if your ILP is seriously limited by serial code, complex math functions,
and other contracts
• You must parallelise across all “CPU slots”

49
• Hardware threads, Cores, Sockets

if you really need to get as much as possible from this:

In [62]: Image("500eurofr_HR.jpg")

Out[62]:

50

You might also like