0% found this document useful (0 votes)

320 views65 pages

Parallel and Distributed Algorithms

This document provides information about a course on parallel and distributed algorithms for undergraduate computer science students. The course introduces key concepts in parallel computing including different parallel models, techniques for designing efficient parallel algorithms, and how to program basic parallel applications using PVM and MPI. It aims to help students understand parallel and distributed systems and be able to design algorithms that can take advantage of multiple processors. The course covers topics like parallel programming platforms, communication primitives, load balancing, analytical performance modeling, and examples of parallel algorithms.

Uploaded by

Sergiu Voloc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

320 views65 pages

Parallel and Distributed Algorithms

Uploaded by

Sergiu Voloc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 65

Parallel and Distributed

Algorithms
Course for Undergraduate Students in the 3rd year
(Major in Computer Science)

Instructor:
Mihai L. Mocanu, Ph.D., Professor

E-mail: [email protected]
Office: Room 303 Office hours: Wednesday14:00-16:00
Course page: http://software.ucv.ro/~mmocanu
(ask for the passw and use the appropriate entry)
What this course is about
 Classically, algorithm designers assume a computer with only one
processing element
 Algorithms they design are “sequential”: all steps of the
algorithms must be performed in a particular sequence
 But today’s computers have multiple processors, each can perform
its own chain of execution
 Even basic desktop computers often have multicore
processors that include a handful of processing elements.
 Also signiﬁcant are vast clusters of computers, aggregated in
networks, grids, clouds etc.
 Not to mention that real world processes that may need algorithmic
descriptions are far from sequential
Why study P&D Algorithms?
 Suppose we’re working with a commodity multiprocessor system
(like almost all are today)
 Inevitably, we’ll run into some complex problem that we want it to
solve in less time
 Let T represent the amount of time that a single processor takes to
solve this problem; we would hope to develop an algorithm that
takes T/p time on a p-processor system.
 Why this will rarely be possible to achieve entirely, in reality?
 Depending not only on the problem but also on the solution
we adopted, the processors will almost always spend some
time coordinating their work, which will lead to a penalty
Example 1
 Sometimes a near-optimal speedup is easy to achieve
 Suppose that we have an array of integers, and we want to display
all the negative integers in the array
 We can divide the array into equal-sized segments, one
segment for each processor, and each processor can display
all the negative integers in its segment
 The sequential algorithm would take O(n) time
 In our multiprocessor algorithm each processor handles an
array segment with at most n/p elements
 So, processing the whole array takes O(n/p) time
Example 2
 There are some problems where it is hard to figure out how to use
multiple processors to speed up the processing time
 Let’s consider the problem of arbitrary precision arithmetic*: we
are given 2 integer arrays representing very large numbers (where
a[0] contains the 1’s digit, a[1] contains the 10’s digit, a.s.o.) and
we want to construct a new array containing the sum of numbers
 We know a sequential algorithm to do this that takes O(n) time.
 We can’t easily adapt this algorithm to use many processors,
though: In processing each segment of the arrays, we need to know
whether there is a carry from the preceding segments.
 Thus, this problem, like many others, seems inherently sequential.
 In fact there is an algorithm that can add two such big-integers in
O(n/p + log p) time
PDA Field of Study inside the CS Area
Par.& Distr.
Algorithms
Social and
Professional Database and
Context Information
Algorithms Architecture Retrieval
& Data
Structures
Operating
Systems
Artificial Computer Software
Intelligence Science Methodology
and Robotics &Software
Engineering

Human-
Numerical and Computer
Symbolic Programming Interaction
Computation Languages ACM & IEEE
curriculum
recommendations
Course objectives
To provide you with an introduction and overview to the
computational aspects of parallel and distributed computing
 To introduce several important parallel computing models
 To study the typical models for distributed computing.
To make basic concepts of P & D computing both accessible and
ready for immediate use
 Broadly understand P&D architectures
 Become familiar with typical software/ programming approaches
 Be able to quickly adapt to any programming environment
 Study the main classes of P&D algorithms
 Learn the jargon … understanding what people are talking about
 Be able to apply this knowledge
Course objectives (cont.)
More specific, with the aim of drastically flattening the learning
curve in a parallel and/or distributed environment:
 Study the development methods of efficient P&D algorithms using:
 Data partitioning and functional partitioning
 Parallelism exploitation for different process topologies
 Efficient overcoming of communication latencies,
 Load balancing
 Termination detection
 Failure tolerance etc.
 Learn the conceptual models for the specification and verification
of P&D algorithms
 Understand the complexity models of P&D algorithms
Learning Outcomes
On successfully completing this course you should understand a
number of different models of P&D computing and the basic
techniques for designing algorithms in these models.
 Development techniques for parallel algorithms - examples from
the main classes: prefix algorithms, search, sort, selection, matrix
processing, graph algorithms, lists, trees
 Main development techniques of classes of distributed algorithms:
wave algorithms, mutual exclusion, topology establishment,
termination, fault tolerance, leader election, genetic algorithms
Textbooks and other references
Textbooks:
1. Vipin Kumar, Ananth Grama, Anshul Gupta, George Kyrypis - Introduction to
Parallel Computing Benjamin/Cummings 2003, (2nd Edition - ISBN 0-201-
64865-2) or Benjamin/Cummings 1994, (1st Edition ISBN 0-8053-3170-0)
2. Behrooz Parhami - Introduction to Parallel Processing: Algorithms and
Architectures, Kluwer Academic Publ, 2002
3. Dan Grigoras – Parallel Computing. From Systems to Applications,
Computer Libris Agora, 2000, ISBN 973-97534-6-9
4. Mihai Mocanu – Algorithms and Languages for Parallel Processing, Publ.
University of Craiova, updated yearly for 20+ years on my home page
5. George Coulouris, Jean Dollimore, Tim Kindberg and Gordon Blair,
Distributed Systems Concepts and Design (5th ed.), Addison Wesley, 2011
6. M. Ben-Ari, Principles of Concurrent and Distributed Programming
(2nd ed.), Addison-Wesley, 2006
Working Resources
Laboratory and Projects:
1. Mihai Mocanu, Alexandru Patriciu – Parallel Computing in C for Unix and
Windows NT Networks, Publ. University of Craiova, 1998
2. Christofer H.Nevison et al. - Laboratories for Parallel Computing, Jones and
Bartlett, 1994
3. SPIN multi-threaded software verification tool, http://spinroot.com/spin/
whatispin.html

Other resources are on the web page

Topics

1. Parallel Programming Platforms & Parallel Models

 logical and physical organization
 interconnection networks for parallel machines
 communication costs in parallel machines
 process - processors mappings, graph embeddings
 PRAMs. Complexity measures

Why?
It is better to be aware of the physical and economical constraints and tradeoffs
of the parallel system you are designing for, not to be sorry later.
Topics (continued)

2. Quick Introduction to PVM (Parallel Virtual Machine)

and MPI (Message Passing Interface)
 semantics and syntax of basic communication operations
 setting up your PVM/MPI environment, compiling and running
PVM or MPI programs

Why?
You can start to program simple parallel programs early on.
Topics (cont.)

3. Principles of Parallel Algorithm Design

 basic PRAM techniques: doubling technique
 summation trees and prefix summation
 decomposition techniques
 load balancing
 techniques for reducing communication overhead
 parallel algorithm models

Why?
These are fundamental issues that appear/apply to every parallel program.
You really should learn this stuff by hearth.
Topics (cont.)

4. Implementation & Cost of Basic Communication Operations

 broadcast, reduction, scatter, gather, parallel prefix, …
 interconnection networks: graph models of networks
 network properties

Why?
These are fundamental primitives you would often use and you should known
them well: Not only what they do, but also how much do they cost and when and
how to use them.
Going through details of implementation allows us to see how are the principles
from the previous topic applied to relatively simple problems.
Topics (cont.)

5. Analytical Modeling of Parallel Programs

 sources of overhead
 execution time, speedup, efficiency, cost, Amdahl's law

Why?
Parallel programming is done to increase performance.
Debugging and profiling is extremely difficult in parallel setting, so it is better
to understand from the beginning what performance to expect from a given
parallel program and, more generally, how to design parallel programs with
low execution time. It is also important to know the limits of what can be done
and what not.
Topics (cont.)
6. Parallel Dense Matrix Algorithms
 matrix vector multiplication
 matrix matrix multiplication
 solving systems of linear equations
7. Parallel Graph Algorithms
 minimum spanning tree
 single-source shortest paths
 all-pairs shortest paths
 connected components
 algorithms for sparse graphs
Why?
Classical problems with lots of applications, many interesting and useful
techniques exposed.
Topics (cont.)
8. Parallel Sorting
 odd-even transpositions sort
 sorting networks, bitonic sort
 parallel quicksort, bucket and sample sort
9. Search Algorithms for Discrete Optimization Problems
 search overhead factor, speedup anomalies
 parallel depth-first search
 sorting and searching on PRAMs
10. Geometric Algorithms
 convex hulls, closest pair of points
Why?
As before, plus shows many examples of hard-to-parallelize problems.
Topics (cont.)
11. Concepts of distributed computation
 termination detection
 failure tolerance
 distributed network topologies
12. Formal Models
 state transitions and executions
 asynchronous and synchronous models
 causality constraints and the computation theorem
 logical clocks (Lamport) and vector clocks (Mattern-Fidge)
Why?
It is important and useful to understand their social importance for Internet,
WWW, small devices (mobiles, sensors) and their technical importance
that aims to improve scalability, reliability through inherent distribution
Topics (cont.)
13. Distributed abstractions
 event-based component model
 specification of services
 safety and liveness
 node and channel failure models and links
 timing assumptions
14. Failure detectors
 classes of detectors
 leader elections and reductions
 quorums; byzantine quorums and leader election
Why?
Making use of abstractions, like timing assumptions and ways to encapsulate
them (failure detectors), proves to be very useful.
Topics (cont.)
15. Broadcast abstractions
 (lazy/ eager/ uniform/ fail-silent) reliable broadcast
 causal broadcast, vector-clock algorithms
 performance of algorithms
16. Consistency
 replicated shared memory
 linearizable registers
 linearizable single write multiple reader
 multiple writers
17. Consensus
 control-oriented vs. event-based, uniform consensus
 Paxos and multi-Paxos consensus
Topics (cont.)
18. Reconfigurable Replicated State Machines
 total order broadcasting
 robust distributed networks
19. Random Walks
 introduction to Markov processes
 random walks (hitting time, cover time (s.t)-connectivity)
20. Time and Clocks in Distributed Systems
 time lease
 clock drift and the complete algorithm
…
Grading (tentative)
 20% continuous practical laboratory tests/ assignments (L)
 20% practical evaluation through homework projects (P)
 20% final test quizz (T)
 40% final exam (E)
You have to get at least 50% on any evaluation form (T, L and P) in
order to be allowed to sustain the final exam in the next session.
You have to get at least 50% on the final exam (E) to pass and to
obtain a mark greater than 5. All the grades obtained go with the
specified weight into the computation of the final mark.
If you have problems with setting up your working environment and/or
running your programs, ask the TA for help/advice. He is there to help you
with that. Use him, but do not abuse him with normal programming bugs
Assignments and evaluations

 assignments from individual homeworks (may be a project) for a

total of 20 points
 mostly programming in C, C++ with:
• threads (POSIX Threads and OpenMP)
• multiple processes, PVM, MPI, etc. implementing
(relatively) simple algorithms and load balancing techniques
• task-based and data-based parallelism (OpenCL)
 deep understanding of algorithms and performance models
continuous evaluation based on theoretical questions thrown in
to prepare you better for the final exam
Introduction
Parallel vs. Distributed Computing
 Background. “Traditional” definitions. Code examples
 Speedup. Amdahl’s law. Performance Measures

 Parallel Platforms for High Performance Computing

 The context and difficulties of parallel computing

 HPC/ parallel computing evolution over 50 years

 Grand challenge problems in demand for HPC (higher

computational speed)
 A few core problems in Distributed Computing
Background
 Parallel Computer: a computer with many processors that
are closely connected.
 frequently, all processors share the same memory
 they also communicate by accessing this shared memory
 Examples of parallel computers include the multicore
processors found in many computers today (even cheap
ones), as well as many graphics processing units (GPUs)
 Parallel Computing: “using more than one computer, or a
computer with more than one processor, to solve a task ”
 The parallel programming paradigm is not new: it has been
around for more than 50 years! Motives: faster computation,
larger amount of memory available, etc.
“... There is therefore nothing new in the idea of parallel
programming, but its application to computers. The author
cannot believe that there will be any insuperable difficulty in
extending it to computers. It is not to be expected that the
necessary programming techniques will be worked out
overnight. Much experimenting remains to be done. After all,
the techniques that are commonly used in programming today
were only won at the cost of considerable toil several years
ago. In fact the advent of parallel programming may do
something to revive the pioneering spirit in programming
which seems at the present to be degenerating into a rather
dull and routine occupation ...”

Gill, S. (1958), “Parallel Programming,” The Computer Journal, vol. 1, April 1958, pp. 2-10.
Background
 Distributed Computer: one in which the processors are less
strongly connected –“a set of nodes connected by a network,
which appear to its users as a single coherent system”
 A typical distributed system consists of many independent
computers, attached via network connections
 Such an arrangement is called a cluster
 In a distributed system, each processor has its own memory.
This precludes using shared memory for communicating
 Processors instead communicate by sending messages
 Though message passing is slower than shared memory, it
scales better for many processors, and it is cheaper
PARALLEL vs. Distributed Computing

A working definition may differentiate them after:

• Focus (relative): coarse, medium or fine grain
• Main goal: shorter running time!
• The processors are contributing to the implementation of a more
efficient execution of a solution to the same problem
• In parallel computing a problem involves lots of computations and
data (e.g. matrix multiplication, sorting), and to attain efficiency,
communication has to be kept to a minimum (optimal)
• In distributed systems the problems are different: often coordination
of resources is the most important (e.g. leader election, commit,
termination detection…)
Code example 1
 For a shared-memory parallel computer, here is the Java code
intended to find the sum of all the elements in a long array.
 Variables whose name begin with my_ are specific to each
processor; this might be implemented by storing these variables in
individual processors’ registers
 The code fragment assumes that a variable array has already been
set up with the numbers we want to add; there is a variable procs
that indicates how many processors our system has
 In addition, we assume each register has its own my_pid variable,
which stores that processor’s own processor ID, a unique number
between 0 and procs – 1
Parallel program
// 1. Determine where processor’s segment is and add up numbers in segment
count = array.length / procs;
my_start = my_pid * count;
my_total = array[my_start];
for(my_i = 1;my_i<count; my_i++) my_total += array[my_start + my_i];
// 2. Store subtotal into shared array, then add up the subtotals.
subtotals[my_pid] = my_total; // denoted as line A in remarks below
my_total = subtotals[0]; // line B in remarks below
for(my_i = 1; my_i < procs; my_i++) {
my_total += subtotals[my_i]; } // line C in remarks below
// 3. If array.length isn’t a multiple of procs, then total will exclude some
// elements at the array’s end. Add these last elements in now.
for(my_i = procs * count; my_i < array.length; my_i++) my_total +=
array[my_i];
Analysis: computation
 Here, we first divide the array into segments of length count, and
each processor adds up the elements within its segment, placing
that into its variable my_total.
 We write this variable into shared memory in line A so that all
processors can read it; then we go through this shared array of
subtotals to find the total of the subtotals.
 The last step is to take care of any numbers that may have been
excluded by trying to divide the array into p equally-sized
segments.
Analysis: synchronization
 Each processor must complete line A before any other processor
tries to use that saved value in line B or line C
 One way of ensuring this is to build the computer so that all
processors share the same program counter as they step through
identical programs. Such an approach:
 Allows all processors to execute line A simultaneously.
 Though it works, it is quite rare because it can be difficult to
write a program so that all processors work identically
 We often want different processors to perform different tasks
 The more common approach is to allow each processor to execute
at its own pace, giving programmers the responsibility to include
code enforcing dependencies between processors’ work *
*
In our example above, we would add code between line A
and line B to enforce the restriction that all processors
complete line A before any proceed to line B and line C.
If we were using Java’s built-in features for supporting
such synchronization between threads, we could accomplish
this by introducing a new shared variable number_saved
whose value starts out at 0. The code following line A could
be as follows.
synchronized(subtotals) {
number_saved++;
if(number_saved == procs)
subtotals.notifyAll();
while(number_saved < procs) {
try { subtotals.wait(); }
catch(InterruptedException e) { }
}
}
Code example 2
 From now on, we’ll be working with a message-passing system
implemented using the following two functions
void send(int dst_pid, int data) – sends a message containing the
integer data to the processor whose ID is dst_pid. Note that the
function’s return may be delayed until the receiving processor
requests to receive the data — though the message might instead
be buffered so that the function can return immediately
int receive(int src_pid) – waits until the processor whose ID is
src_pid sends a message and returns the integer in that message
(blocking receive). Some systems also support a non-blocking
receive, which returns immediately if the processor hasn’t yet
sent a message. Another variation is a receive that allows a
program to receive the first message sent by any processor*
Distributed code
 On the same example, we assumme that each processor already
has its segment of the array in its memory, called segment
 The variable procs holds the number of processors in the system,
and pid holds the processor’s ID (a unique integer between 0 and
procs – 1, as before)
total = segment[0];
for(i = 1; i < segment.length; i++) total += segment[i];
if(pid > 0) { // each processor but 0 sends its total to processor 0
send(0, total);
} else { // processor 0 adds all these totals up
for(int k = 1; k < procs; k++) total += receive(k);
}
Analysis
 This code says that each processor should first add the elements
of its segment. Then each processor except processor 0 should
send its total to processor 0
 Processor 0 waits to receive each of these messages in succession,
adding the total of that processor’s segment into its total. By the
end, processor 0 will have the total of all segments
 In a large distributed system, this approach may be flawed since
inevitably there is a possibility that some processors would break,
often due to the failure of some equipment such as a hard disk or
power supply
 We’ll ignore this issue here, but it is an important issue when
writing programs for large distributed systems in real life
Performance view
Parallel System: An optimized collection of processors, dedicated
to the execution of complex tasks; each processor executes in a
semi-independent manner a subtask and co-ordination may be
needed from time to time. The primary goal of parallel processing
is a significant increase in performance.

Distributed System: A collection of multiple autonomous

computers, communicating through a computer network, that
interact with each other in order to achieve a common goal.
Leslie Lamport: “A distributed system is one in which the failure
of a computer you didn't even know existed can render your own
computer unusable “

Remark. Parallel processing in distributed environments is not

only possible but also a cost-effective attractive alternative
Speedup Factor
Execution time using one processor (best sequential algor
ithm) ts
S(p) = =
Execution time using a multiprocessor with p processors tp
Speedup factor can also be cast in terms of computational steps:
Number of computational steps using one processor
S(p) =
Number of parallel computational steps with p processors

 S(p) gives increase in speed by using “a multiprocessor”

Hints:
 Use best sequential algorithm with single processor system

 Underlying algorithm for parallel implementation might be (and it

is usually) different
Maximum Speedup
Is usually p with p processors (linear speedup)
Speedup factor is given by:
ts p
S(p)  
fts  (1  f )ts /p 1  (p  1)f
This equation is known as Amdahl’s law
Remark: Possible but unusual to get superlinear speedup
(greater than p) but due to a specific reason such as:
Extra memory in multiprocessor system
 Nondeterministic algorithm
Maximum Speedup
Amdahl’s law
ts
fts (1 - f)ts

Serial section Parallelizable sections

(a) One processor

(b) Multiple
processors

p processors

tp (1 - f)ts /p
Speedup against number of processors

•Even with infinite number of processors, maximum speedup limited to 1/f

•Ex: With only 5% of computation being serial, maximum speedup is 20
Superlinear Speedup - Searching

(a) Searching each sub-space sequentially

Start Time

ts
t s/p
Sub-space t
search

xts/p
Solution found

x indeterminate
(b) Searching each sub-space in parallel ts
x  t
Speedup is given by: p
S ( p) 
t
Worst case for sequential search when solution found in last
sub-space search. Then the parallel version offers the greatest
benefit, i.e.

Least advantage for parallel version when solution found in

first sub-space search of the sequential search, i.e.
t

Solution found
Measures of Performance

Speedup is obviously a MoP, but what is the really relevant from

the performance pov?
• To computer scientists: speedup, execution time
• To applications people: size of problem, accuracy of solution etc.
Speedup of algorithm
= sequential execution time/execution time on p processors (with
the same data set).
Speedup on problem
= sequential execution time of best known sequential algorithm /
execution time on p processors.
 A more honest measure of performance.

 Avoids picking an easily parallelizable algorithm with poor

sequential execution time.

The Context of Parallel Processing
Facts:
• The explosive growth of digital computer architectures
• Therefore, the need for:
• a better understanding of concurrency
• user-friendliness, compactness and simplicity of code
• high performance computing (HPC) but low cost,
low power consumption a.o.
• HPC uni-processors are increasingly complex & expensive
• they have high power-consumption
• they may also be under-utilized - mainly due to the
lack of appropriate software.
Possible trade-offs to achieve efficiency
What’s better?
 The use of one or a small number of such complex processors,

OR
 A moderate to very large number of simpler processors

 The answer may seem simple, but there is a clue, forcing us to

answer first to another question:

 How “good” is communication between processors?

So:
 When combined with a high-bandwidth, but logically simple,

inter-processor communication facility, the latter approach

may lead to significant increase in efficiency, not only at the
execution but also in earlier stages (i.e. in the design process)
The Difficulties of Parallel Processing
Two were the major problems that have prevented over the years
the immediate and widespread adoption of such (moderately to)
massively parallel architectures:
 the inter-processor communication bottleneck

 the difficulty of algorithmic/ software development – this may

come at a very high cost

How were these problems overcomed?
 At very high clock rates, the link between the processor and

memory becomes very critical

- integrated processor/memory design optimization
- emergence of multiple-processor microchips
 The emergence of standard programming and communication

models has removed some of the concerns with compatibility

and software design issues in parallel processing
Demand for Computational Speed

 Continuous demand for greater computational speed

from a computer system than is usually possible

 Areas requiring great computational speed include

numerical modeling and simulation, scientific and
engineering problems etc.

 Remember: Computations must not only be completed,

but completed within a “reasonable” time period
Grand Challenge Problems

Typically, they cannot be solved in a reasonable amount

of time with today’s computers. Obviously, an
execution time of 2 months is always unreasonable
Examples
 Modeling large DNA structures

 Global weather forecasting

 Modeling motion of astronomical bodies.

Global Weather Forecasting
 Atmosphere modeled by dividing it into 3-dim. cells
 Computations in each cell repeat many times to model
time passing
 Suppose whole global atmosphere divided into cubic cells of
size 1 km to a height of 15 km - about 7.5 x 109 cells
(Land Surface = 510.1 millions km2 = 196.9 millions sq. miles)
 Suppose each calculation requires 200 float. point operations.
In one time step, 15x1013 floating point operations necessary
 To forecast weather over 7 days using 10-minute intervals (103
minutes), a computer operating at 1Gflops (109 flops) takes
15x107 s ≈250 days. To perform calculation in under 30 min.
needs computer operating at 100 Tflops (1012 flops, att. 2000)
Modeling Motion of Astronomical Bodies
 Bodies are attracted to each others by gravitational forces
 Movement of each body is predicted by calculating total
force on each body
 With N bodies, N - 1 forces to calculate for each body, or
approx. N2 calculations (N log2 N for an efficient approx.
algorithm); after determining new positions of bodies,
calculations repeated
 If a galaxy might have, say, 1011 stars, even if each
calculation done in 1 ms (extremely optimistic figure), it
takes 109 years for one iteration using N2 algorithm and
almost a year for one iteration using an efficient N log2 N
approximate algorithm.
Astrophysical N-body simulation – screen snapshot
Wide Range of Applications that
depend on HPC (Dongarra, 2017*)
• Airplane wing design
• Quantum chemistry
• Geophysical flows
• Noise reduction
• Diffusion of solid bodies in a liquid
• Computational materials research
• Weather forecasting
• Deep learning in neural networks
• Stochastic simulation
• Massively parallel data mining
…
Do we need powerful, HPC?

Yes, to solve much bigger problems much faster!

“Fine grain” and “coarse-grain” parallelism are different, i.e. the
latter is mainly applicable to long-running, scientific programs
Performance
- there are problems which can use any amount of
computing (i.e. simulation)
Capability
- to solve previously unsolvable problems (such as prime
number factorization): too big data sizes, real time
constraints
Capacity
- to handle a lot of processing much faster, perform more
precise computer simulations (e.g. weather prediction)
Why are HPC computers parallel?
From Transistors to FLOPS
 By Moore’s law the no of transistors per area doubles every 18
months, so: how to make better use of these transistors?
 more execution units, graphical pipelines, processors etc.
But technology is not the only key, computer structure (architecture)
and organization are also important! (see next slide)
 There are also inhibitors of parallelism, such as dependencies
The Data Communication Argument
 Move computation towards data: for huge data, it is cheaper
The Memory/Disk Speed Argument
 Parallel computers yield better memory system performance,
because of larger aggregate caches, higher aggregate bandwidth
How did parallel computers evolved?

Execution Speed
With >102 times increase of (floating point) execution speed every


<10 years (more than the 26 increase suggested by Moore’s law)

Communication Technology
A factor which is critical to the performance of parallel


computing platforms
 1985 – 1990 : in spite of an average 20x increase in processor
performance, the communication speed kept constant
Parallel Computing – How exectime decreased
Time(s) per
fp instruction
Motto: “I think there is a world market for maybe five
computers” (Thomas Watson, IBM Chairman, 1943)
Towards Parallel Computing – The 5 ERAs
More on the historical perspective
Parallel computing has been here since the early days of computing.
Traditionally: custom HW, custom SW, high prices
The “doom” of the Moore law:
- custom HW has hard time catching up with the commodity processors
Current trend: use commodity HW components, standardize SW
 Parallelism sneaking into commodity computers:
• Instruction Level Parallelism - wide issue, pipelining, OOO
• Data Level Parallelism – 3DNow, Altivec
• Thread Level Parallelism – Hyper-threading in Pentium IV
 Transistor budgets allow for multiple processor cores on a chip.
More on historical perspective (cont.)
Most applications would benefit from being parallelized and
executed on a parallel computer.
• even PC applications, especially the most demanding ones – games,
multimedia
Chicken & Egg Problem:
1. Why build parallel computers when the applications are sequential?
2. Why parallelize applications when there are no parallel commodity
computers?
Answers:
1. What else to do with all those transistors?
2. Applications already are a bit parallel (wide issue, multimedia
instructions, hyper-threading), and this bit is growing.
Core Problems in Distributed Computing
• What types of problems are there?
• An example: Generals’ Problem
Two generals need to coordinate an attack
1. Must agree on time to attack
2. They’ll win only if they attack simultaneously
3. Communicate through messengers
4. Messengers may be killed on their way
Lets try to solve it for general g1 and g2
step 1: g1 sends time of attack to g2
Problem: how to ensure g2 received msg?
step 2 (Solution): let g2 ack receipt of msg
Problem: how to ensure g1 received ack
step 3 (Solution): let g1 ack the receipt of the ack…
…
This problem seems (is!) really impossible to solve!
Consensus
 Two nodes need to agree on a value
 Communicate by messages using an unreliable channel
Agreement is a core problem… agreeing on a number = consensus
 The Consensus problem
 All nodes propose a value
 Some nodes might crash & stop responding
 The algorithm must ensure:
 All correct nodes eventually decide
 Every node decides the same
 Only decide on proposed values
Broadcast
 Atomic Broadcast
 A node broadcasts a message
 If sender correct, all correct nodes deliver msg
 All correct nodes deliver same messages
 Messages delivered in the same order
 Given Atomic broadcast
 Can use it to solve Consensus
 Every node broadcasts its proposal
 Decide on the first received proposal
 Messages received same order
 All nodes will decide the same
 Given Consensus
 Can use it to solve Atomic broadcast
 Atomic Broadcast equivalent to Consensus

Spring Material by Ashok
82% (11)
Spring Material by Ashok
284 pages
5412220A02 S - ECDIS User Manual
100% (1)
5412220A02 S - ECDIS User Manual
185 pages
SafeQ6 License Guide en 1-02-00
No ratings yet
SafeQ6 License Guide en 1-02-00
14 pages
Using Oracle8
100% (1)
Using Oracle8
420 pages
Agile Fundamentals For Project Managers: Saturday Workshop PMI Lakeshore Chapter
No ratings yet
Agile Fundamentals For Project Managers: Saturday Workshop PMI Lakeshore Chapter
72 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
C++ Essentials
No ratings yet
C++ Essentials
311 pages
Programare Distribuita in Java
No ratings yet
Programare Distribuita in Java
273 pages
PHP Grid
No ratings yet
PHP Grid
40 pages
Parallel and Distributed Computing Lecture 03
No ratings yet
Parallel and Distributed Computing Lecture 03
44 pages
IOT-Module 1 PDF
No ratings yet
IOT-Module 1 PDF
73 pages
AES Algorithm Implementation
No ratings yet
AES Algorithm Implementation
183 pages
Exercitii PROLOG
No ratings yet
Exercitii PROLOG
11 pages
Car Plate Recognition System
100% (1)
Car Plate Recognition System
27 pages
RMQ
No ratings yet
RMQ
74 pages
Debugger v850
No ratings yet
Debugger v850
58 pages
Parallel and Distributed Computing Architectures A PDF
No ratings yet
Parallel and Distributed Computing Architectures A PDF
286 pages
Medical Biotechnology Book
No ratings yet
Medical Biotechnology Book
105 pages
Basics of Parallel Programming: Unit-1
No ratings yet
Basics of Parallel Programming: Unit-1
79 pages
Fendt Variotronic Terminal Brochure Sept 2012
0% (1)
Fendt Variotronic Terminal Brochure Sept 2012
24 pages
Channel Coding For Modern Communication Systems: Presented by Yasir Mehmood (200411018)
No ratings yet
Channel Coding For Modern Communication Systems: Presented by Yasir Mehmood (200411018)
20 pages
AsiaPacific AMHS Implementation Workshop
No ratings yet
AsiaPacific AMHS Implementation Workshop
29 pages
Oracle Database
No ratings yet
Oracle Database
5 pages
SQL vs. NOSQL PDF
No ratings yet
SQL vs. NOSQL PDF
7 pages
DD Database Design Learner
No ratings yet
DD Database Design Learner
98 pages
Star UML
No ratings yet
Star UML
25 pages
Web Application Development: Jquery
No ratings yet
Web Application Development: Jquery
41 pages
Iot PPT New 1
No ratings yet
Iot PPT New 1
18 pages
PL SQL
No ratings yet
PL SQL
35 pages
BTCS-602 - (IOT) Internet of Things Notes
No ratings yet
BTCS-602 - (IOT) Internet of Things Notes
119 pages
II. Combinational Logic Network (CLN) : 2.1 Definition and Classification
No ratings yet
II. Combinational Logic Network (CLN) : 2.1 Definition and Classification
12 pages
Introduction To AI
No ratings yet
Introduction To AI
26 pages
and 80486
0% (1)
and 80486
28 pages
Report SSL
No ratings yet
Report SSL
31 pages
Internet of Things (IoT) Implementation in Learning Institutions - A Systematic Literature Review
100% (1)
Internet of Things (IoT) Implementation in Learning Institutions - A Systematic Literature Review
48 pages
PGP
No ratings yet
PGP
38 pages
Circuit Maker
No ratings yet
Circuit Maker
8 pages
Seminar Algebra 1
No ratings yet
Seminar Algebra 1
26 pages
Preguza Cristian CR-201 LAB1
No ratings yet
Preguza Cristian CR-201 LAB1
4 pages
Project #15 - Cioc: PIC24 Project Rlan Marius Cosmin
No ratings yet
Project #15 - Cioc: PIC24 Project Rlan Marius Cosmin
7 pages
Predictive Analytics in Data Science For Business Intelligence Solutions
No ratings yet
Predictive Analytics in Data Science For Business Intelligence Solutions
4 pages
SSL and Millicent Protocols
No ratings yet
SSL and Millicent Protocols
15 pages
Python modules-XI
No ratings yet
Python modules-XI
9 pages
Unit1 Parallel and Distributed
No ratings yet
Unit1 Parallel and Distributed
21 pages
IOT in Education Final
100% (1)
IOT in Education Final
8 pages
AS4971404607365121495538930834 Content 1-Min
No ratings yet
AS4971404607365121495538930834 Content 1-Min
7 pages
Software Engineering
No ratings yet
Software Engineering
41 pages
Curs 1 - Introducere in Ingineria Software
No ratings yet
Curs 1 - Introducere in Ingineria Software
15 pages
CCNA Routing and Switching Course Brochure
No ratings yet
CCNA Routing and Switching Course Brochure
5 pages
Cloud Computing Unit 1
No ratings yet
Cloud Computing Unit 1
12 pages
Chapter 9 - M J Flynn Classification
No ratings yet
Chapter 9 - M J Flynn Classification
14 pages
Sad Notes Updated Final
No ratings yet
Sad Notes Updated Final
47 pages
Rexelite Tutorial
No ratings yet
Rexelite Tutorial
5 pages
Design & Evaluation in The Real World: Communicators & Advisory Systems
No ratings yet
Design & Evaluation in The Real World: Communicators & Advisory Systems
13 pages
Graphics Processing Unit (GPU) : Guided By: Presented BY
No ratings yet
Graphics Processing Unit (GPU) : Guided By: Presented BY
22 pages
Machine Learning Base IoT Botnet Detection Systems
No ratings yet
Machine Learning Base IoT Botnet Detection Systems
10 pages
Miss Nasreen Anjum: Artificial Intelligence (AI)
No ratings yet
Miss Nasreen Anjum: Artificial Intelligence (AI)
21 pages
Grile Atp Rezolvate - H18 1. Nu Am Gasit Inca Un Tutorial Bun 2. A 3. C 4. B 5. C 6. D 7. A 8. D 9. D
No ratings yet
Grile Atp Rezolvate - H18 1. Nu Am Gasit Inca Un Tutorial Bun 2. A 3. C 4. B 5. C 6. D 7. A 8. D 9. D
1 page
Log
No ratings yet
Log
43 pages
COMPSCI 351 and COMPSCI 751 - Strategic Exercise 1 - Answers
No ratings yet
COMPSCI 351 and COMPSCI 751 - Strategic Exercise 1 - Answers
3 pages
Transport Layer Protocols and Services: October 2016
No ratings yet
Transport Layer Protocols and Services: October 2016
5 pages
General System Architecture
No ratings yet
General System Architecture
28 pages
Ans Lab Record
100% (3)
Ans Lab Record
26 pages
CV Putineanu Andrei
No ratings yet
CV Putineanu Andrei
1 page
Problem Solving With C++: The Object of Programming
No ratings yet
Problem Solving With C++: The Object of Programming
19 pages
20 Practical Examples of RPM Commands in Linux
No ratings yet
20 Practical Examples of RPM Commands in Linux
11 pages
Comenzi Linux
No ratings yet
Comenzi Linux
4 pages
Binary Decoder
No ratings yet
Binary Decoder
7 pages
Operating System Term Paper
No ratings yet
Operating System Term Paper
6 pages
Brainly
No ratings yet
Brainly
4 pages
M470 Application Usage Top Users by Bandwidth 2019-10-03 00 00 00 To 2019-10-04 00 00 00
No ratings yet
M470 Application Usage Top Users by Bandwidth 2019-10-03 00 00 00 To 2019-10-04 00 00 00
3 pages
Fallsem2019-20 Cse4001 Eth Vl2019201001348 Reference Material Cse4001 Parallel and Distributed Computing May 2019 (003) 18
No ratings yet
Fallsem2019-20 Cse4001 Eth Vl2019201001348 Reference Material Cse4001 Parallel and Distributed Computing May 2019 (003) 18
4 pages
Multi-Core in JAVA/JVM: Concurrency Prior To Java 5: Synchronization and Threads
No ratings yet
Multi-Core in JAVA/JVM: Concurrency Prior To Java 5: Synchronization and Threads
14 pages
Application Development
No ratings yet
Application Development
87 pages
Day 45 - Deploy WordPress Website On AWS
No ratings yet
Day 45 - Deploy WordPress Website On AWS
14 pages
PCD Notes - Unit - 1
No ratings yet
PCD Notes - Unit - 1
15 pages
Syllabus
No ratings yet
Syllabus
2 pages
Output Archive
No ratings yet
Output Archive
42 pages
IOT Question Bank With Answer
No ratings yet
IOT Question Bank With Answer
75 pages
Syllabus Parallel Computing
No ratings yet
Syllabus Parallel Computing
5 pages
Become An OCI Foundations Associate (2024)
No ratings yet
Become An OCI Foundations Associate (2024)
53 pages
Update Instruction - UV88
No ratings yet
Update Instruction - UV88
6 pages
Part A - Unit 3 - Class 10th (IT) - Employability Skills - ICT Skills-II (01 Marks)
No ratings yet
Part A - Unit 3 - Class 10th (IT) - Employability Skills - ICT Skills-II (01 Marks)
5 pages
Microcontrollers Lab
No ratings yet
Microcontrollers Lab
4 pages
Database For DUmmies - 074553
No ratings yet
Database For DUmmies - 074553
19 pages
CMW Testing Bluetooth App-Bro en 5214-6745-92 v0100
No ratings yet
CMW Testing Bluetooth App-Bro en 5214-6745-92 v0100
36 pages
(Computer Awareness) C Program Basics
No ratings yet
(Computer Awareness) C Program Basics
19 pages
Pda 1
No ratings yet
Pda 1
72 pages
Parallel Algorithms Presentation
No ratings yet
Parallel Algorithms Presentation
32 pages

Parallel and Distributed Algorithms

Uploaded by

Parallel and Distributed Algorithms

Uploaded by

Parallel and Distributed

Other resources are on the web page

1. Parallel Programming Platforms & Parallel Models

2. Quick Introduction to PVM (Parallel Virtual Machine)

3. Principles of Parallel Algorithm Design

4. Implementation & Cost of Basic Communication Operations

5. Analytical Modeling of Parallel Programs

 assignments from individual homeworks (may be a project) for a

 Parallel Platforms for High Performance Computing

 The context and difficulties of parallel computing

 HPC/ parallel computing evolution over 50 years

 Grand challenge problems in demand for HPC (higher

A working definition may differentiate them after:

Distributed System: A collection of multiple autonomous

Remark. Parallel processing in distributed environments is not

 S(p) gives increase in speed by using “a multiprocessor”

 Underlying algorithm for parallel implementation might be (and it

Serial section Parallelizable sections

•Even with infinite number of processors, maximum speedup limited to 1/f

(a) Searching each sub-space sequentially

Least advantage for parallel version when solution found in

Speedup is obviously a MoP, but what is the really relevant from

 Avoids picking an easily parallelizable algorithm with poor

sequential execution time.

 The answer may seem simple, but there is a clue, forcing us to

answer first to another question:

inter-processor communication facility, the latter approach

 the difficulty of algorithmic/ software development – this may

come at a very high cost

memory becomes very critical

models has removed some of the concerns with compatibility

 Continuous demand for greater computational speed

 Areas requiring great computational speed include

 Remember: Computations must not only be completed,

Typically, they cannot be solved in a reasonable amount

 Global weather forecasting

 Modeling motion of astronomical bodies.

Yes, to solve much bigger problems much faster!

<10 years (more than the 26 increase suggested by Moore’s law)

You might also like