Parallel and Distributed Algorithms
Parallel and Distributed Algorithms
Algorithms
Course for Undergraduate Students in the 3rd year
(Major in Computer Science)
Instructor:
Mihai L. Mocanu, Ph.D., Professor
E-mail: [email protected]
Office: Room 303 Office hours: Wednesday14:00-16:00
Course page: http://software.ucv.ro/~mmocanu
(ask for the passw and use the appropriate entry)
What this course is about
Classically, algorithm designers assume a computer with only one
processing element
Algorithms they design are “sequential”: all steps of the
algorithms must be performed in a particular sequence
But today’s computers have multiple processors, each can perform
its own chain of execution
Even basic desktop computers often have multicore
processors that include a handful of processing elements.
Also significant are vast clusters of computers, aggregated in
networks, grids, clouds etc.
Not to mention that real world processes that may need algorithmic
descriptions are far from sequential
Why study P&D Algorithms?
Suppose we’re working with a commodity multiprocessor system
(like almost all are today)
Inevitably, we’ll run into some complex problem that we want it to
solve in less time
Let T represent the amount of time that a single processor takes to
solve this problem; we would hope to develop an algorithm that
takes T/p time on a p-processor system.
Why this will rarely be possible to achieve entirely, in reality?
Depending not only on the problem but also on the solution
we adopted, the processors will almost always spend some
time coordinating their work, which will lead to a penalty
Example 1
Sometimes a near-optimal speedup is easy to achieve
Suppose that we have an array of integers, and we want to display
all the negative integers in the array
We can divide the array into equal-sized segments, one
segment for each processor, and each processor can display
all the negative integers in its segment
The sequential algorithm would take O(n) time
In our multiprocessor algorithm each processor handles an
array segment with at most n/p elements
So, processing the whole array takes O(n/p) time
Example 2
There are some problems where it is hard to figure out how to use
multiple processors to speed up the processing time
Let’s consider the problem of arbitrary precision arithmetic*: we
are given 2 integer arrays representing very large numbers (where
a[0] contains the 1’s digit, a[1] contains the 10’s digit, a.s.o.) and
we want to construct a new array containing the sum of numbers
We know a sequential algorithm to do this that takes O(n) time.
We can’t easily adapt this algorithm to use many processors,
though: In processing each segment of the arrays, we need to know
whether there is a carry from the preceding segments.
Thus, this problem, like many others, seems inherently sequential.
In fact there is an algorithm that can add two such big-integers in
O(n/p + log p) time
PDA Field of Study inside the CS Area
Par.& Distr.
Algorithms
Social and
Professional Database and
Context Information
Algorithms Architecture Retrieval
& Data
Structures
Operating
Systems
Artificial Computer Software
Intelligence Science Methodology
and Robotics &Software
Engineering
Human-
Numerical and Computer
Symbolic Programming Interaction
Computation Languages ACM & IEEE
curriculum
recommendations
Course objectives
To provide you with an introduction and overview to the
computational aspects of parallel and distributed computing
To introduce several important parallel computing models
To study the typical models for distributed computing.
To make basic concepts of P & D computing both accessible and
ready for immediate use
Broadly understand P&D architectures
Become familiar with typical software/ programming approaches
Be able to quickly adapt to any programming environment
Study the main classes of P&D algorithms
Learn the jargon … understanding what people are talking about
Be able to apply this knowledge
Course objectives (cont.)
More specific, with the aim of drastically flattening the learning
curve in a parallel and/or distributed environment:
Study the development methods of efficient P&D algorithms using:
Data partitioning and functional partitioning
Parallelism exploitation for different process topologies
Efficient overcoming of communication latencies,
Load balancing
Termination detection
Failure tolerance etc.
Learn the conceptual models for the specification and verification
of P&D algorithms
Understand the complexity models of P&D algorithms
Learning Outcomes
On successfully completing this course you should understand a
number of different models of P&D computing and the basic
techniques for designing algorithms in these models.
Development techniques for parallel algorithms - examples from
the main classes: prefix algorithms, search, sort, selection, matrix
processing, graph algorithms, lists, trees
Main development techniques of classes of distributed algorithms:
wave algorithms, mutual exclusion, topology establishment,
termination, fault tolerance, leader election, genetic algorithms
Textbooks and other references
Textbooks:
1. Vipin Kumar, Ananth Grama, Anshul Gupta, George Kyrypis - Introduction to
Parallel Computing Benjamin/Cummings 2003, (2nd Edition - ISBN 0-201-
64865-2) or Benjamin/Cummings 1994, (1st Edition ISBN 0-8053-3170-0)
2. Behrooz Parhami - Introduction to Parallel Processing: Algorithms and
Architectures, Kluwer Academic Publ, 2002
3. Dan Grigoras – Parallel Computing. From Systems to Applications,
Computer Libris Agora, 2000, ISBN 973-97534-6-9
4. Mihai Mocanu – Algorithms and Languages for Parallel Processing, Publ.
University of Craiova, updated yearly for 20+ years on my home page
5. George Coulouris, Jean Dollimore, Tim Kindberg and Gordon Blair,
Distributed Systems Concepts and Design (5th ed.), Addison Wesley, 2011
6. M. Ben-Ari, Principles of Concurrent and Distributed Programming
(2nd ed.), Addison-Wesley, 2006
Working Resources
Laboratory and Projects:
1. Mihai Mocanu, Alexandru Patriciu – Parallel Computing in C for Unix and
Windows NT Networks, Publ. University of Craiova, 1998
2. Christofer H.Nevison et al. - Laboratories for Parallel Computing, Jones and
Bartlett, 1994
3. SPIN multi-threaded software verification tool, http://spinroot.com/spin/
whatispin.html
Why?
It is better to be aware of the physical and economical constraints and tradeoffs
of the parallel system you are designing for, not to be sorry later.
Topics (continued)
Why?
You can start to program simple parallel programs early on.
Topics (cont.)
Why?
These are fundamental issues that appear/apply to every parallel program.
You really should learn this stuff by hearth.
Topics (cont.)
Why?
These are fundamental primitives you would often use and you should known
them well: Not only what they do, but also how much do they cost and when and
how to use them.
Going through details of implementation allows us to see how are the principles
from the previous topic applied to relatively simple problems.
Topics (cont.)
Why?
Parallel programming is done to increase performance.
Debugging and profiling is extremely difficult in parallel setting, so it is better
to understand from the beginning what performance to expect from a given
parallel program and, more generally, how to design parallel programs with
low execution time. It is also important to know the limits of what can be done
and what not.
Topics (cont.)
6. Parallel Dense Matrix Algorithms
matrix vector multiplication
matrix matrix multiplication
solving systems of linear equations
7. Parallel Graph Algorithms
minimum spanning tree
single-source shortest paths
all-pairs shortest paths
connected components
algorithms for sparse graphs
Why?
Classical problems with lots of applications, many interesting and useful
techniques exposed.
Topics (cont.)
8. Parallel Sorting
odd-even transpositions sort
sorting networks, bitonic sort
parallel quicksort, bucket and sample sort
9. Search Algorithms for Discrete Optimization Problems
search overhead factor, speedup anomalies
parallel depth-first search
sorting and searching on PRAMs
10. Geometric Algorithms
convex hulls, closest pair of points
Why?
As before, plus shows many examples of hard-to-parallelize problems.
Topics (cont.)
11. Concepts of distributed computation
termination detection
failure tolerance
distributed network topologies
12. Formal Models
state transitions and executions
asynchronous and synchronous models
causality constraints and the computation theorem
logical clocks (Lamport) and vector clocks (Mattern-Fidge)
Why?
It is important and useful to understand their social importance for Internet,
WWW, small devices (mobiles, sensors) and their technical importance
that aims to improve scalability, reliability through inherent distribution
Topics (cont.)
13. Distributed abstractions
event-based component model
specification of services
safety and liveness
node and channel failure models and links
timing assumptions
14. Failure detectors
classes of detectors
leader elections and reductions
quorums; byzantine quorums and leader election
Why?
Making use of abstractions, like timing assumptions and ways to encapsulate
them (failure detectors), proves to be very useful.
Topics (cont.)
15. Broadcast abstractions
(lazy/ eager/ uniform/ fail-silent) reliable broadcast
causal broadcast, vector-clock algorithms
performance of algorithms
16. Consistency
replicated shared memory
linearizable registers
linearizable single write multiple reader
multiple writers
17. Consensus
control-oriented vs. event-based, uniform consensus
Paxos and multi-Paxos consensus
Topics (cont.)
18. Reconfigurable Replicated State Machines
total order broadcasting
robust distributed networks
19. Random Walks
introduction to Markov processes
random walks (hitting time, cover time (s.t)-connectivity)
20. Time and Clocks in Distributed Systems
time lease
clock drift and the complete algorithm
…
Grading (tentative)
20% continuous practical laboratory tests/ assignments (L)
20% practical evaluation through homework projects (P)
20% final test quizz (T)
40% final exam (E)
You have to get at least 50% on any evaluation form (T, L and P) in
order to be allowed to sustain the final exam in the next session.
You have to get at least 50% on the final exam (E) to pass and to
obtain a mark greater than 5. All the grades obtained go with the
specified weight into the computation of the final mark.
If you have problems with setting up your working environment and/or
running your programs, ask the TA for help/advice. He is there to help you
with that. Use him, but do not abuse him with normal programming bugs
Assignments and evaluations
computational speed)
A few core problems in Distributed Computing
Background
Parallel Computer: a computer with many processors that
are closely connected.
frequently, all processors share the same memory
they also communicate by accessing this shared memory
Examples of parallel computers include the multicore
processors found in many computers today (even cheap
ones), as well as many graphics processing units (GPUs)
Parallel Computing: “using more than one computer, or a
computer with more than one processor, to solve a task ”
The parallel programming paradigm is not new: it has been
around for more than 50 years! Motives: faster computation,
larger amount of memory available, etc.
“... There is therefore nothing new in the idea of parallel
programming, but its application to computers. The author
cannot believe that there will be any insuperable difficulty in
extending it to computers. It is not to be expected that the
necessary programming techniques will be worked out
overnight. Much experimenting remains to be done. After all,
the techniques that are commonly used in programming today
were only won at the cost of considerable toil several years
ago. In fact the advent of parallel programming may do
something to revive the pioneering spirit in programming
which seems at the present to be degenerating into a rather
dull and routine occupation ...”
Gill, S. (1958), “Parallel Programming,” The Computer Journal, vol. 1, April 1958, pp. 2-10.
Background
Distributed Computer: one in which the processors are less
strongly connected –“a set of nodes connected by a network,
which appear to its users as a single coherent system”
A typical distributed system consists of many independent
computers, attached via network connections
Such an arrangement is called a cluster
In a distributed system, each processor has its own memory.
This precludes using shared memory for communicating
Processors instead communicate by sending messages
Though message passing is slower than shared memory, it
scales better for many processors, and it is cheaper
PARALLEL vs. Distributed Computing
is usually) different
Maximum Speedup
Is usually p with p processors (linear speedup)
Speedup factor is given by:
ts p
S(p)
fts (1 f )ts /p 1 (p 1)f
This equation is known as Amdahl’s law
Remark: Possible but unusual to get superlinear speedup
(greater than p) but due to a specific reason such as:
Extra memory in multiprocessor system
Nondeterministic algorithm
Maximum Speedup
Amdahl’s law
ts
fts (1 - f)ts
(b) Multiple
processors
p processors
tp (1 - f)ts /p
Speedup against number of processors
Start Time
ts
t s/p
Sub-space t
search
xts/p
Solution found
x indeterminate
(b) Searching each sub-space in parallel ts
x t
Speedup is given by: p
S ( p)
t
Worst case for sequential search when solution found in last
sub-space search. Then the parallel version offers the greatest
benefit, i.e.
Solution found
Measures of Performance
OR
A moderate to very large number of simpler processors
So:
When combined with a high-bandwidth, but logically simple,
Execution Speed
With >102 times increase of (floating point) execution speed every
computing platforms
1985 – 1990 : in spite of an average 20x increase in processor
performance, the communication speed kept constant
Parallel Computing – How exectime decreased
Time(s) per
fp instruction
Motto: “I think there is a world market for maybe five
computers” (Thomas Watson, IBM Chairman, 1943)
Towards Parallel Computing – The 5 ERAs
More on the historical perspective
Parallel computing has been here since the early days of computing.
Traditionally: custom HW, custom SW, high prices
The “doom” of the Moore law:
- custom HW has hard time catching up with the commodity processors
Current trend: use commodity HW components, standardize SW
Parallelism sneaking into commodity computers:
• Instruction Level Parallelism - wide issue, pipelining, OOO
• Data Level Parallelism – 3DNow, Altivec
• Thread Level Parallelism – Hyper-threading in Pentium IV
Transistor budgets allow for multiple processor cores on a chip.
More on historical perspective (cont.)
Most applications would benefit from being parallelized and
executed on a parallel computer.
• even PC applications, especially the most demanding ones – games,
multimedia
Chicken & Egg Problem:
1. Why build parallel computers when the applications are sequential?
2. Why parallelize applications when there are no parallel commodity
computers?
Answers:
1. What else to do with all those transistors?
2. Applications already are a bit parallel (wide issue, multimedia
instructions, hyper-threading), and this bit is growing.
Core Problems in Distributed Computing
• What types of problems are there?
• An example: Generals’ Problem
Two generals need to coordinate an attack
1. Must agree on time to attack
2. They’ll win only if they attack simultaneously
3. Communicate through messengers
4. Messengers may be killed on their way
Lets try to solve it for general g1 and g2
step 1: g1 sends time of attack to g2
Problem: how to ensure g2 received msg?
step 2 (Solution): let g2 ack receipt of msg
Problem: how to ensure g1 received ack
step 3 (Solution): let g1 ack the receipt of the ack…
…
This problem seems (is!) really impossible to solve!
Consensus
Two nodes need to agree on a value
Communicate by messages using an unreliable channel
Agreement is a core problem… agreeing on a number = consensus
The Consensus problem
All nodes propose a value
Some nodes might crash & stop responding
The algorithm must ensure:
All correct nodes eventually decide
Every node decides the same
Only decide on proposed values
Broadcast
Atomic Broadcast
A node broadcasts a message
If sender correct, all correct nodes deliver msg
All correct nodes deliver same messages
Messages delivered in the same order
Given Atomic broadcast
Can use it to solve Consensus
Every node broadcasts its proposal
Decide on the first received proposal
Messages received same order
All nodes will decide the same
Given Consensus
Can use it to solve Atomic broadcast
Atomic Broadcast equivalent to Consensus