Lect 1 Overview
Lect 1 Overview
SYLLABUS
The Idea of Parallelism: A Parallelised version of the Sieve of
Eratosthenes
PRAM Model of Parallel Computation
Pointer Jumping and Divide & Conquer: Useful Techniques for
Parallelization
PRAM Algorithms: Parallel Reduction, Prefix Sums, List Ranking, Preorder
Tree Traversal, Merging Two Sorted Lists, Graph Coloring
Reducing the Number of Processors and Brent's Theorem
Dichotomy of Parallel Computing Platforms
Cost of Communication
Programmer's view of modern multi-core processors
The role of compilers and writing efficient serial programs
1
26‐07‐2015
SYLLABUS (CONTD.)
Parallel Complexity: The P-Complete Class
Mapping and Scheduling
Elementary Parallel Algorithms
Sorting
Parallel Programming Languages: Shared Memory Parallel Programming using OpenMP
Writing efficient OpenMP programs
Dictionary Operations: Parallel Search
Graph Algorithms
Matrix Multiplication
Industrial Strength programming 1:
Programming for performance; Dense Matrix-matrix multiplication through various stages:
data access optimization, loop interchange, blocking and tiling
Analyze BLAS (Basic Linear Algebra System) and ATLAS (Automatically Tuned Linear
Algebra System) code
SYLLABUS (CONTD.)
Distributed Algorithms: models and complexity measures.
Safety, liveness, termination, logical time and event ordering
Global state and snapshot algorithms
Mutual exclusion and Clock Synchronization
Distributed Graph algorithms
Distributed Memory Parallel Programming: Cover MPI programming basics with
simple programs and most useful directives; Demonstrate Parallel Monte Carlo
Industrial strength programming 2:
Scalable programming for capacity
Distributed sorting of massive arrays
Distributed Breadth-First Search of huge graphs and finding Connected
Components
2
26‐07‐2015
Course site:
http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/PAlgo/index.htm
3
26‐07‐2015
PARALLEL ALGORITHMS
The fastest computers in the world are built of numerous conventional
microprocessors.
The emergence of these high performance, massively parallel
computers demand the development of new algorithms to take
advantage of this technology.
Objectives:
Design
Analyze
Implement
Parallel algorithms on such computers with numerous processors
4
26‐07‐2015
APPLICATIONS
Scientists often use high performance computing to validate their theory.
“Data!data!data!" he cried impatiently. "I can't make bricks without clay.”
― Arthur Conan Doyle, The Adventure of the Copper Beeches
Many scientific problems are so complex that solving them requires extremely
powerful machines.
Some complex problems where parallel computing is useful:
Computer Aided Design
Cryptanalysis
Quantum Chemistry, statistical mechanics, relativistic physics
Cosmology and astrophysics
Weather and environment modeling
Biology and pharmacology
Material design
MOTIVATIONS OF PARALLELISM
Computational Power Argument – from Transistors to FLOPS
In 1965, Gordon Moore said
“The complexity for minimum component costs has increased at a rate of
roughly a factor of two per year. Certainly over the short term this rate
can be expected to continue if not to increase. Over the long term, the
rate of increase is a bit more uncertain although there is no reason to
believe it will not remain nearly constant for at least 10 years.”
From 1975, he revised the doubling period to 18 months, what came
to be known as Moore’s law.
With more devices in chip, the pressing issue is of achieving increasing
OPS (operation per second):
A logical recourse is to rely on parallelism.
10
5
26‐07‐2015
MOTIVATIONS OF PARALLELISM
The memory/disk speed argument
Overall speed of computation is determined not just by speed of
processor, but also by the ability of memory system to feed data to it.
While clock rates of high speed processors have increased by 40% in last 10 years, DRAM
access times have increased by only 10%.
Coupled with increase in instructions executed per clock cycles this gap presents a huge
performance bottleneck.
Further the principles of parallel algorithms themselves lend to cache friendly serial algorithms!
11
MOTIVATIONS OF PARALLELISM
The Data Communication Argument
The internet is often envisioned as one large heterogenous
parallel/distributed computing environment.
Some most impressive applications:
SETI (Search for Extra Terrestrial Intelligence) utilizes the power of large number of
home computers to analyze electromagnetic signals from outer space.
Factoring large integers
Bitcoin mining ( tries to find the collision of a cryptographic hash function)
12
6
26‐07‐2015
13
MICROPROCESSORS VS
SUPERCOMPUTERS: PARALLEL MACHINES
Parallelism is exploited on a variety of high performance computers, in particular
massively parallel computers (MPPs) and clusters.
MPPs, clusters, and high-performance vector computers are termed supercomputers
(vector processors have instructions that operate on a one dimension array)
Example: Cray Y/MP and NEC SX-3
Supercomputers were augmented with several architectural advances by 70’s like:
bit parallel memory, bit parallel arithmetic, cache memory, interleaved memory,
instruction lookahead, multiple functional units, pipelining:
However microprocessors had a long way to go!
Huge development due to advances in their architectures, coupled with reduced instruction cycle
times have lead to the convergence in relative performances of them and supercomputers.
14
7
26‐07‐2015
15
TYPES OF PARALLELISM
16
8
26‐07‐2015
17
SCHEDULING EXAMPLE
Time Jobs running Utilisation
1 S, M 75%
2 L 100%
3 S, S, M 100%
4 L 100%
5 L 100%
6 S, M 75%
7 M 50%
Average utilisation is 83.3%
Time to complete all jobs is 7 time units.
18
9
26‐07‐2015
A BETTER SCHEDULE
A better schedule would allow jobs to be taken out of order to give
higher utilisation.
SMLSSMLLSMM
Allow jobs to “float” to the front to the queue to maintain high
utilisation.
Actual situation is more complex as jobs may run for differing lengths
of time.
Real job scheduler must balance high utilisation with fairness
(otherwise large jobs may never run).
20
10
26‐07‐2015
21
11
26‐07‐2015
ROBOT EXAMPLE
Task Vision Manipulation Motion
1. Looking for
electrical socket
2. Going to
electrical socket
3. Plugging into
electrical socket
23
24
12
26‐07‐2015
DOMAIN DECOMPOSITION
A common form of program-level parallelism arises from
the division of the data to be programmed into subsets.
This division is called domain decomposition.
Parallelism that arises through domain decomposition is
called data parallelism.
The data subsets are assigned to different
computational processes. This is called data distribution.
Processes may be assigned to hardware processors by
the program or by the runtime system. There may be
more than one process on each processor.
25
DATA PARALLELISM
Consider an image digitised
as a square array of pixels
which we want to process by
replacing each pixel value
by the average of its
neighbours.
array.
26
13
26‐07‐2015
DOMAIN DECOMPOSITION
Suppose we decompose the
problem into 16 subdomains
We then distribute the data
by assigning each
subdomain to a process.
This is a homogeneous
problem because each pixel
requires the same amount of
computation (almost - which
pixels are different?).
27
PARALLEL PROCESSING
Information processing that emphasizes the concurrent manipulation of
data elements belonging to one or more process solving a single
problem.
A parallel computer is a multiple processor computer capable of
parallel processing.
The throughput of the device is the number of results it produces per
unit time.
Throughput can be improved by increasing the speed of processing on data.
Also, by increasing the number of operations that are performed at a time.
28
14
26‐07‐2015
29
30
15
26‐07‐2015
31
CONTROL PARALLELISM
Pipeline is actually a special case of a more general class of parallel
algorithms, called control-path parallelism.
Data Parallelism: Same operation is performed on a data set.
Control Parallelism: Different operations are performed on different data
elements concurrently.
Consider an example: The task of maintenance of an estate’s landscape
as quickly as possible>
Mowing the lawn, edging the lawn, checking the sprinklers, weeding the flower beds.
Except checking the sprinklers, other jobs would be quick if there are multiple workers.
Increasing the lawn mowing speed by creating a team and assigning each member a
portion of the lawn is an example of data parallelism.
We can also perform the other tasks concurrently. This is an example of control parallelism.
There is a precedence relationship, since all other tasks must be completed before the
sprinklers are tested.
32
16
26‐07‐2015
SCALABILITY
An algorithm is scalable if the level of parallelism increases at least
linearly with the problem size.
An architecture is scalable if it continues to yield the same
performance per processor, albeit used on a larger problem size, as
the number of processors increase.
This allows a user to solve larger problems in the same amount of time
by using a parallel computer with more processors.
Data parallel algorithms are more scalable than control parallel
algorithms, which is usually a constant, independent of the problem
size.
We shall study more such algorithms which are amenable to data parallelism!
33
17