ACA 2024W 01 Introduction
ACA 2024W 01 Introduction
Ch. 1: Introduction
By G. Barlas
Objectives
!
Understand the current trends in computing machine design, and
how this influences software development.
!
Learn how to categorize computing machines based on Flynn's
taxonomy.
!
Learn the essential tools used to evaluate multicore/parallel
performance, i.e. speedup and efficiency.
! Learn the proper experimental procedure for measuring and
reporting performance.
!
Learn Amdahl's and Gustafson-Barsis' laws and apply them in
order to predict the performance of parallel programs.
Source: http://commons.wikimedia.org/wiki/File:Transistor_Count_and_Moore's_Law_-_2011.svg
Source: https://en.wikipedia.org/wiki/File:Moore%27s_Law_Transistor_Count_1970-2020.png
Flynn's Taxonomy (1966)
! Single Instruction, Single Data (SISD): a simple sequential machine, that
executes one instruction at a time, operating on a single data item. Surprisingly, the
vast majority of contemporary CPUs, do not belong to this category.
! Single Instruction, Multiple Data (SIMD): a machine where each instruction is
applied on a collection of items. Vector processors were the very first machines
that followed this paradigm. GPUs also follow this design at the level of the
Streaming Multiprocessor (SM for NVidia) or the SIMD unit (for AMD).
! Multiple Instructions, Single Data (MISD): this configuration seems like an oddity.
! Multiple Instructions, Multiple Data (MIMD): the most versatile machine
category. Multicore machines follow this paradigm, including GPUs.
Symmetric 5
Multiprocessing
Industry Trends
! Increase the on-chip core count, combined with
augmented specialized SIMD instruction sets and
larger caches. This is best exemplified by Intel's x86
line of CPUs and the Intel Xeon Phi co-processor.
! Combine heterogeneous cores in the same
package, typically CPU and GPU ones, each
optimized for a different type of task. This is best
exemplified by AMD's line of APU (Accelerated
Processing Unit) chips. Intel is also offering
OpenCL-based computing on its line of CPUs with
integrated graphics chips.
<C> G. Barlas, 2016 6
CPUs VS GPUs
! CPUs:
− Large caches, repetitive use of data
− Instruction decoding and prediction hardware
− Pipelined execution
! GPUs:
− Small caches, single-time use of data
− Big data volume
− Simple program logic, simple ALUs
June 2024
https://www.top500.org/lists/top500/2024/06/
June 2023
where tseq is the execution time of the sequential program, and tpar is the
execution time of the parallel program for solving the same instance of a
problem.
!
Both tseq and tpar are total response times (elapsed times), and
as such they are not objective. They can be influenced by:
− The skill of the programmer who wrote the implementations
− The choice of compiler (e.g. GNU C++ versus Intel C++)
− The compiler switches (e.g. turning optimization on/off)
− The operating system
− The type of filesystem holding the input data (e.g. EXT4, NTFS, etc.)
− The time of day... (different workloads, network traffic, etc.)
!
One should abide by the following rules:
− Both the sequential and the parallel programs should be
tested on identical software and hardware platforms, and
under similar conditions.
− The sequential program should be the fastest known
solution to the problem at hand.
!
Why is the second condition there?
!
Shouldn't we compare against the sequential version
of the parallel program instead?
Efficiency
!
Speedup tells only part of the story: it can tell us if it is feasible to accelerate the solution
of a problem, e.g. if speedup > 1.
!
It cannot tell us if this can be done with a modest amount of resources.
!
The second metric employed for this purpose is efficiency, defined as:
!
where N is the number of CPUs/cores employed for the execution of the parallel program.
!
One can interpret the efficiency as the average percent of time, that a node is utilized
during the parallel execution.
!
When speedup = N, then the corresponding parallel program exhibits what is called as
linear speedup.
●
There is a discrepancy here for the cautious student:
●
If we only have a quad-core CPU like the Intel i7, how can
we test and report speedup for 8 threads?
Hyperthreading!
●
What is a Thread?
●
A sequence of instructions of a
process which is managed
separately by the operating
system scheduler as a unit.
Speedup-Efficiency Considerations
! Is it possible to get:
speedup > N
efficiency > 100%
! This is the so-called super-linear speedup
scenario.
! Can be caused by using an other algorithm, which
cannot be used on a single processor, e.g. acqui-
sition of an item in a search space.
!
What would happen If we increased the CPUs? Would we get a reduction in run time?
!
What would happen if the secret key was 2?
!
Speedup covers the efficacy of a parallel solution: is it
beneficial or not?
! Efficiency is a measure of resource utilization: how much of
the potential afforded by the computing resources we commit,
is actually used?
! Finally, we would like to know how a parallel algorithm
behaves with increased computational resources and/or
problem sizes: does it scale?
! Scalability is the ability of a (software or hardware) system to
efficiently handle a growing amount of work.
Scalability, cont.
!
In the context of a parallel algorithm and/or platform, scalability
translates to being able to
(a) solve bigger problems and/or
(b) to incorporate more computing resources.
! To measure (a) we use the weak scaling efficiency:
!
The development of a parallel solution to a problem, starts with
the development of its sequential variant!
!
Questions that need to be answered:
− Which parts of the sequential program are the greatest consumers of
computational power?
− What is the potential speedup?
− Is the parallel program correct?
!
The development of the sequential algorithm and associated
program can also provide essential insights about the design
that should be pursued for parallelization.
!
Once the sequential version is implemented, we can use a
profiler to guide the design process.
!
Profilers can use:
− Instrumentation: modifies the code of the program that is
being profiled, so that information can be collected (usually
requires re-compilation).
− Sampling: the execution of the target program is inter-
rupted periodically, in order to query which function is being
executed.
Profiler Example
$ valgrind --tool=callgrind ./bucketsort 10000
See: https://www.valgrind.org/
Experimentation Guidelines
!
The duration of the whole execution should be measured, unless specifically stated
otherwise.
!
Results should be reported in the form of averages over multiple runs, possibly
including standard deviations.
!
Outliers, i.e. too big or too small results should be excluded from the calculation of the
averages as they typically are expression of an anomaly. However, care should be
given so that unfavorable results are not brushed away instead of being explained.
!
Scalability is paramount, so results should be reported for a variety of input sizes
(ideally covering the size and/or quality of real-life data), and a variety of parallel
platform sizes.
!
Test inputs should vary from very small, to very big, but they should always include
problem sizes that would be typical of a production environment, if these can be
identified.
!
When multicore machines are employed, the number of threads and/or processes,
should not exceed the number of available hardware cores. (Disable multithreading!)
Amdahl's Law
! In 1967 Gene Amdahl formulated a simple thought experiment:
− A sequential application that requires T time to execute on a single
CPU.
− The application consists of a part that can be
parallelized. The remaining 1 − α has to be done sequentially.
− Parallel execution incurs no communication overhead, and the
parallelizable part can be divided evenly among any chosen number of
CPUs. This assumption suits particularly well multicore architectures,
where cores have access to the same shared memory.
! Given the above assumptions, the speedup obtained by N
nodes, should be upper-bounded by:
30
„An army of ants versus a herd of
elephants“
!
What is the best investment (in mainframe/minicomputer time)?
!
Assuming that we have a program that can execute in time TA on a
single powerful CPU and time TB on a less powerful, inexpensive CPU.
!
We can declare, based on the execution time, that CPU A is
!
For infinite NB, we can get the absolute upper bound on the speedup:
which means that the speedup will never be above 1, no matter how
many „ants“ you use, if
!
So if α = 90% and r = 10 we are better off using a single expensive
CPU than going the parallel route with inexpensive components.
!
BUT, math does not tell a true story all the time!
!
Empirical data shows: Parallel programs routinely exceed the
speedup limits predicted by Amdahl's law.
!
The key to understanding the fundamental error in Amdahl's law
is problem size.
!
Do we solve the same size of problems with parallel machines,
as with sequential ones?
!
Assuming that we have:
− A parallel application that requires T time to execute of N CPUs.
− The application spends percent of the total time running on
all machines. The remaining has to be done sequentially.
Gustafson-Barsis' Predictions
!
Actually..... none!
!
The assumption of zero communication overhead is too
optimistic, even with shared-memory platforms.
!
Scenarios where coordination is completely absent even
for cores sharing the same memory space are not typical.
!
There are two things we can keep from the discussion:
− There is room for optimism in parallel computing.
− Simple models can help us predict performance before actually
building the software.