0% found this document useful (0 votes)
30 views19 pages

ACA 2024W 01 Introduction

Uploaded by

Ghofrane Rh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views19 pages

ACA 2024W 01 Introduction

Uploaded by

Ghofrane Rh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

2024-10-23

Multicore & GPU Programming : An Integrated Approach

Ch. 1: Introduction
By G. Barlas

<C> G. Barlas, 2016  Modifications by H. Weber

Objectives

!
Understand the current trends in computing machine design, and
how this influences software development.
!
Learn how to categorize computing machines based on Flynn's
taxonomy.
!
Learn the essential tools used to evaluate multicore/parallel
performance, i.e. speedup and efficiency.
! Learn the proper experimental procedure for measuring and
reporting performance.
!
Learn Amdahl's and Gustafson-Barsis' laws and apply them in
order to predict the performance of parallel programs.

<C> G. Barlas, 2016 2


The Era of Multicore Machines

Source: http://commons.wikimedia.org/wiki/File:Transistor_Count_and_Moore's_Law_-_2011.svg

The Era of Multicore Machines

<C> G. Barlas, 2016 4

Source: https://en.wikipedia.org/wiki/File:Moore%27s_Law_Transistor_Count_1970-2020.png
Flynn's Taxonomy (1966)
! Single Instruction, Single Data (SISD): a simple sequential machine, that
executes one instruction at a time, operating on a single data item. Surprisingly, the
vast majority of contemporary CPUs, do not belong to this category.
! Single Instruction, Multiple Data (SIMD): a machine where each instruction is
applied on a collection of items. Vector processors were the very first machines
that followed this paradigm. GPUs also follow this design at the level of the
Streaming Multiprocessor (SM for NVidia) or the SIMD unit (for AMD).
! Multiple Instructions, Single Data (MISD): this configuration seems like an oddity.
! Multiple Instructions, Multiple Data (MIMD): the most versatile machine
category. Multicore machines follow this paradigm, including GPUs.

Symmetric  5
Multiprocessing

Industry Trends
! Increase the on-chip core count, combined with
augmented specialized SIMD instruction sets and
larger caches. This is best exemplified by Intel's x86
line of CPUs and the Intel Xeon Phi co-processor.
! Combine heterogeneous cores in the same
package, typically CPU and GPU ones, each
optimized for a different type of task. This is best
exemplified by AMD's line of APU (Accelerated
Processing Unit) chips. Intel is also offering
OpenCL-based computing on its line of CPUs with
integrated graphics chips.
<C> G. Barlas, 2016  6
CPUs VS GPUs
! CPUs:
− Large caches, repetitive use of data
− Instruction decoding and prediction hardware
− Pipelined execution

! GPUs:
− Small caches, single-time use of data
− Big data volume
− Simple program logic, simple ALUs

<C> G. Barlas, 2016 7

Examples of recent multicore chips

! Name of chip: ...


! Company: ...
! Architecture description (architecture diagram):

! Architecture description (keywords):


– …
– ...

<C> G. Barlas, 2016 8


A Glimpse at the Top 500

June 2024

<C> G. Barlas, 2016 9

https://www.top500.org/lists/top500/2024/06/

A Glimpse at the Top 500

June 2023

<C> G. Barlas, 2016 10


How do we measure performance?
!
It's all about time.
!
Counting steps or calculating the asymptotic complexity has little to no
benefit.
!
At the very least, a parallel program should be able to beat in terms of
execution time its sequential counterpart (not always certain).
!
The improvement in execution time is typically expressed as the speedup:

where tseq is the execution time of the sequential program, and tpar is the
execution time of the parallel program for solving the same instance of a
problem.

<C> G. Barlas, 2016 11

Speedup – How objective is it?

!
Both tseq and tpar are total response times (elapsed times), and
as such they are not objective. They can be influenced by:
− The skill of the programmer who wrote the implementations
− The choice of compiler (e.g. GNU C++ versus Intel C++)
− The compiler switches (e.g. turning optimization on/off)
− The operating system
− The type of filesystem holding the input data (e.g. EXT4, NTFS, etc.)
− The time of day... (different workloads, network traffic, etc.)

<C> G. Barlas, 2016  12


Speedup - Conditions

!
One should abide by the following rules:
− Both the sequential and the parallel programs should be
tested on identical software and hardware platforms, and
under similar conditions.
− The sequential program should be the fastest known
solution to the problem at hand.
!
Why is the second condition there?
!
Shouldn't we compare against the sequential version
of the parallel program instead?

<C> G. Barlas, 2016 13

Efficiency
!
Speedup tells only part of the story: it can tell us if it is feasible to accelerate the solution
of a problem, e.g. if speedup > 1.
!
It cannot tell us if this can be done with a modest amount of resources.
!
The second metric employed for this purpose is efficiency, defined as:

!
where N is the number of CPUs/cores employed for the execution of the parallel program.
!
One can interpret the efficiency as the average percent of time, that a node is utilized
during the parallel execution.
!
When speedup = N, then the corresponding parallel program exhibits what is called as
linear speedup.

<C> G. Barlas, 2016 14


Efficiency Example
! Speedup and efficiency curves for a sample program that
calculates the definite integral of a function by applying the
trapezoidal rule algorithm.

Hint concerning the shown curves


(Efficiency Example)


There is a discrepancy here for the cautious student:

If we only have a quad-core CPU like the Intel i7, how can
we test and report speedup for 8 threads?
Hyperthreading!

What is a Thread?

A sequence of instructions of a
process which is managed
separately by the operating
system scheduler as a unit.

<C> G. Barlas, 2016 16



Efficiency Milestones

! 2005 : first single-die dual-core CPU (AMD Athlon)


! 2007 : first heterogeneous CPU : Cell BE
! Mid 2000s : introduction of GPGPU paradigm.

! GPUs offer distinct advantages:


− Bulk computation power
− High FLOP/Watt ratio

<C> G. Barlas, 2016 17

Speedup-Efficiency Considerations

! How could we calculate efficiency, if the sequential


and parallel programs run on different platforms?
Example, CPU and GPU respectively.
! Caution for testing: make sure you are using real
hardware resources.

<C> G. Barlas, 2016 18


Speedup-Efficiency Considerations cont.

! Is it possible to get:
speedup > N
efficiency > 100%
! This is the so-called super-linear speedup
scenario.
! Can be caused by using an other algorithm, which
cannot be used on a single processor, e.g. acqui-
sition of an item in a search space.

<C> G. Barlas, 2016 19

Super-linear Speedup Example


!
Let us consider the problem of breaking DES. In the DES encryption standard, a secret
number in the range [0,256-1] is used as the key to encrypt a message.
!
A brute force attack on a ciphertext, would involve trying out all the keys until the decoded
message could be identified as a readable text. If we assume that each attempt to decipher
the message costs time T on a single CPU, if the key was the number 255, then a sequential
program would take tseq = (255+1) T time to solve the problem.
!
If we were to employ 2 CPUs to solve the same problem, and we partitioned the search
space of 256 keys equally among the 2 CPUs, i.e. range [0,255-1] to the first one, and range
[255,256-1] to the second one, then the key would be found after only one attempt by the
second CPU! We would then have

!
What would happen If we increased the CPUs? Would we get a reduction in run time?
!
What would happen if the secret key was 2?

<C> G. Barlas, 2016 20


Scalability

!
Speedup covers the efficacy of a parallel solution: is it
beneficial or not?
! Efficiency is a measure of resource utilization: how much of
the potential afforded by the computing resources we commit,
is actually used?
! Finally, we would like to know how a parallel algorithm
behaves with increased computational resources and/or
problem sizes: does it scale?
! Scalability is the ability of a (software or hardware) system to
efficiently handle a growing amount of work.

<C> G. Barlas, 2016 21

Scalability, cont.

!
In the context of a parallel algorithm and/or platform, scalability
translates to being able to
(a) solve bigger problems and/or
(b) to incorporate more computing resources.
! To measure (a) we use the weak scaling efficiency:

where t'par is the time to solve a N-times bigger problem than


the one the single processor machine is solving in tseq

<C> G. Barlas, 2016 22


Scalability, cont.

! To measure (b) we use the strong scaling efficiency:

which is the same as the efficiency discussed


earlier.
! The one that is most challenging to improve, is the
strong scaling efficiency.

<C> G. Barlas, 2016 23

Predicting and Measuring Parallel Program


Performance

!
The development of a parallel solution to a problem, starts with
the development of its sequential variant!
!
Questions that need to be answered:
− Which parts of the sequential program are the greatest consumers of
computational power?
− What is the potential speedup?
− Is the parallel program correct?
!
The development of the sequential algorithm and associated
program can also provide essential insights about the design
that should be pursued for parallelization.

<C> G. Barlas, 2016 24


Predicting and Measuring Parallel Program
Performance (cont.)

!
Once the sequential version is implemented, we can use a
profiler to guide the design process.
!
Profilers can use:
− Instrumentation: modifies the code of the program that is
being profiled, so that information can be collected (usually
requires re-compilation).
− Sampling: the execution of the target program is inter-
rupted periodically, in order to query which function is being
executed.

<C> G. Barlas, 2016 25

Profiler Example
$ valgrind --tool=callgrind ./bucketsort 10000

<C> G. Barlas, 2016 26

See: https://www.valgrind.org/
Experimentation Guidelines
!
The duration of the whole execution should be measured, unless specifically stated
otherwise.
!
Results should be reported in the form of averages over multiple runs, possibly
including standard deviations.
!
Outliers, i.e. too big or too small results should be excluded from the calculation of the
averages as they typically are expression of an anomaly. However, care should be
given so that unfavorable results are not brushed away instead of being explained.
!
Scalability is paramount, so results should be reported for a variety of input sizes
(ideally covering the size and/or quality of real-life data), and a variety of parallel
platform sizes.
!
Test inputs should vary from very small, to very big, but they should always include
problem sizes that would be typical of a production environment, if these can be
identified.
!
When multicore machines are employed, the number of threads and/or processes,
should not exceed the number of available hardware cores. (Disable multithreading!)

<C> G. Barlas, 2016 27

Amdahl's Law
! In 1967 Gene Amdahl formulated a simple thought experiment:
− A sequential application that requires T time to execute on a single
CPU.
− The application consists of a part that can be
parallelized. The remaining 1 − α has to be done sequentially.
− Parallel execution incurs no communication overhead, and the
parallelizable part can be divided evenly among any chosen number of
CPUs. This assumption suits particularly well multicore architectures,
where cores have access to the same shared memory.
! Given the above assumptions, the speedup obtained by N
nodes, should be upper-bounded by:

<C> G. Barlas, 2016 28


Amdahl's Predictions
! How much faster can we go?
! What if we had infinite resources?
1
( speedup ) =
1− α

<C> G. Barlas, 2016 29

Amdahl's Predictions (2)

30
„An army of ants versus a herd of
elephants“
!
What is the best investment (in mainframe/minicomputer time)?
!
Assuming that we have a program that can execute in time TA on a
single powerful CPU and time TB on a less powerful, inexpensive CPU.
!
We can declare, based on the execution time, that CPU A is

times faster than B.


!
If we can afford to buy NB CPUs of the inexpensive type, the best
speedup we can get relative to the execution on a single CPU of type A
is:

<C> G. Barlas, 2016 31

„An army of ants versus a herd of


elephants“

!
For infinite NB, we can get the absolute upper bound on the speedup:

which means that the speedup will never be above 1, no matter how
many „ants“ you use, if

!
So if α = 90% and r = 10 we are better off using a single expensive
CPU than going the parallel route with inexpensive components.
!
BUT, math does not tell a true story all the time!

<C> G. Barlas, 2016 32


Gustafson-Barsis' rebuttal (1988)

!
Empirical data shows: Parallel programs routinely exceed the
speedup limits predicted by Amdahl's law.
!
The key to understanding the fundamental error in Amdahl's law
is problem size.
!
Do we solve the same size of problems with parallel machines,
as with sequential ones?
!
Assuming that we have:
− A parallel application that requires T time to execute of N CPUs.
− The application spends percent of the total time running on
all machines. The remaining has to be done sequentially.

<C> G. Barlas, 2016 33

Gustafson-Barsis' rebuttal (1988) cont.


! A sequential machine would require a total time:

! The speedup would then be:

! And the corresponding efficiency:

! The efficiency has a lower bound of α, as N goes to infinity.


! Gustafson-Barsis speedup curves are worlds apart from Amdahl's curves!
(see next slides) 

<C> G. Barlas, 2016 34


Gustafson-Barsis' Predictions

<C> G. Barlas, 2016 35

Gustafson-Barsis' Predictions

<C> G. Barlas, 2016 36


Which one is right?

!
Actually..... none!
!
The assumption of zero communication overhead is too
optimistic, even with shared-memory platforms.
!
Scenarios where coordination is completely absent even
for cores sharing the same memory space are not typical.
!
There are two things we can keep from the discussion:
− There is room for optimism in parallel computing.
− Simple models can help us predict performance before actually
building the software.

<C> G. Barlas, 2016 37

You might also like