0% found this document useful (0 votes)
11 views52 pages

Chapter 01

The document discusses parallel processing, emphasizing its necessity due to computational demands and the evolution of practical parallel processing from early machines. It contrasts pipeline and data parallelism, explores scalability, and details the Sieve of Eratosthenes algorithm in both control and data parallel approaches. Additionally, it addresses the speedup achievable through parallel execution and the impact of the number of processors on performance.

Uploaded by

abdallahm.alsoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views52 pages

Chapter 01

The document discusses parallel processing, emphasizing its necessity due to computational demands and the evolution of practical parallel processing from early machines. It contrasts pipeline and data parallelism, explores scalability, and details the Sieve of Eratosthenes algorithm in both control and data parallel approaches. Additionally, it addresses the speedup achievable through parallel execution and the impact of the number of processors on performance.

Uploaded by

abdallahm.alsoud
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 52

Chapter1: Introduction

 Why Parallel Processing?


 Advent of Practical Parallel Processing.
 Parallel Processing Terminology.
 Contrasting Pipeline and Data Parallelism.
 Control Parallelism.
 Scalability.
 The Sieve of Eratosthenes (Sieve Technique
 Control Parallel Approach.
 Data Parallel Approach.
Why Parallel Processing?
 Because of the computational
demand.
Advent of Practical Parallel
Processing
Early machines in the 70’s & 80’s in
the laboratories:
 ILLIAC IV at Burroughs Corporation
(70’s).
 Cm* & C.mmp at Carnegie-Mellon
University (70’s).
 Cosmic Cube by nCUBE (80’s).

 The performance of a single
processor can be improved by either
architecture or technological
advances.
 Architecture: by increasing the amount
of work performed by per instruction
cycle.
 Technological advances: by reducing the
time needed per instruction cycle.
Parallel Processing
Terminology
 Most high-performance modern
computers exhibit concurrency, but it
is not desirable to call them parallel
computers.
 Parallel Processing: is information
(data) processing that emphasizes
the concurrent manipulation of data
elements belonging to one or more
processes solving a single problem.

 Parallel Computer: is a multiple-
processor computer capable of
parallel processing.
 Super Computer: a general-purpose

computer capable of solving


individual problems at extremely high
speed.
A super computer is a parallel
computer.

 Throughput: number of results per
unit of time.
 Pipeline: computation is diveded into
steps /segments/ stages.
 Data Parallelism: is the use of
multiple functional units to apply the
same operation simultaneously to
elements of data set.
Ex: parallel addition.

 Speedup: the ratio between the time
needed using sequential algorithm
and the time to perform the same
computation in a parallel machine.
Contrasting Pipeline and Data
Parallelism
Ex: Copy Machine.
1. Sequential.

ABC w w
2 1

2. Pipeline (Control parallelism )

A B C w w w w w
5 4 3 2 1

 Data Parallelism.

ABC w w
4 1

ABC w w
5 2

ABC w w
6 3

 To copy 4 papers sequentially, it
takes 12 seconds.
 Using pipeline, it takes 3+1+1+1=6
seconds.
 Using data parallelism with 3
functional units, it takes 3+3=6.
Ex:

Papers Pipelin Paralle


1 3e 3l
… … …
4 6 6
… … …
7 9 9
… … …
10 12 12

A three-way data parallel machine produces


3 papers every 3 units of time.


 Speedup in a pipeline when papers
=4 is
12/6 =2
 Speedup in data parallelism when
papers=4 is
12/6 =2

 When number of papers is very large
the speedup of the pipeline machine
is the same of the data parallelism,
given that number of functional units
equal number of stages.
 Pipelining is achieved by applying
different operations to different data
elements simultaneously, which is
called control parallelism.
Scalability
 An architecture is scalable if it
continues to yield the same
performance per processor as the
number of processors increases.
 An algorithm is scalable if the level of
parallelism increases, at least
linearly, with the problem size.

 Control Parallelism (pipelining):
Applying different operations on
different data simultaneously.
 Data Parallelism:

Applying same operations, using


multiple functional units, on different
data simultaneously.
The Sieve of Eratosthenes
(Sieve Technique)
(The classic prime-finding algorithm)
 We want to find the number of primes

lees than or equal to some positive


integer n.
 a) Begin with a list of natural numbers

2,3,4,…n
b) Strike multiples of 2,3,5, and
successive primes.
c) Terminate after multiples of the
largest prime ≤ √n have been struck.
Sequential Implementation

(a) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930

(b) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930

(c) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930

(d) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930

(e) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930

(f) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930

(g) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930

√30=5.5
5 is the largest prime ≤ 5.5  STOP

curren
t
p prime
index

2 3 4 n-1 n

The sequential algorithm maintains an


array of natural numbers, variable storing
current prime, and variable storing index of
loop interring through array of natural
Control Parallel Approach
 Algorithm:
Every processor repeatedly goes
through the two-step process of
finding the next prime number and
striking from the list multiples of that
prime, beginning with its square.

P1 P2 P3
index index index

Shared current
prime
Memory

2 3 4 n-1 n

 Algorithm:
For example, one processor will be
responsible for marking multiple of 2
beginning with 4.
While this processor marks up
multiples of 2, another may be
marking multiples of 3 beginning with
9, and so on.

Two problems might occur:
a) Tow or more processors may end up
sieving the same prime number.
b) A processor may end up sieving
multiples of a composite number.
Analysis
Analysis starting from the prime square
a) Consider the time taken by the
sequential algorithms:
 n  3  number of steps needed to sieve
 2  multiples of 2.
 
 (n  8)  number of steps needed to sieve
 3  multiples of 3.

Total number of steps needed to sieve all


primes is
 (n  1)   12   (n  1)   22   (n  1)   32   (n  1)   k2 
       ...   
 1    2    3   k 

 n  3   n  8   n  24   (n  1)    2
       ...   
k

 2   3   5   k 

b) Consider the time taken by the
parallel algorithm:
 Note that parallel execution will not

increase if more than 3 processor are


used.
 An upper bound on the execution

time of the parallel algorithm for


n=1000 is 2.83 .

0 100 200 300 400 500 600 700 800 900 10001100 1200 1300 14001500
23
29
(a) 2 3 5 7 1 1 1 1 31
1 3 7 9
23 29

2 7 1 31
(b 7
3 5 1 1 1
)
1 3 9

2
29
(c) 3 1 1 31
1 9
5 7 1 1 23
3 7

 Time for one processor is 1411.
 Time for two processors is 706,
therefore the speedup is
1411/706=2.
 Time for three processors is 499,
therefore the speedup is
1411/499=2.83
Data Parallel Approach
a) Algorithm:
Processors will work together to
strike multiples of each newly found
prime. Every processor will be
responsible for a segment of the
array representing the natural
numbers.

H.W: Compute the time and speedup for p=1,2, …,10


and find the relation between speedup and
number of processors.

b) Consider a different model of parallel
computation.
No shared memory, interaction
occurs through message passing.

P2 current

P1 current
index
Prime index

Prime

n/p+1 2n/p
2 n/
p

Pp current
index
Prime

(p-1)n/ n
p+1

 Assume we solve for p processors.
 Every processor is assigned no more
than the ceiling of n/p natural
numbers.
 Assume P is much less than √n.
 All primes are in the segment
assigned to the first processor.

Algorithm:
 Processor 1 finds the next prime and

broadcast it to other processors.


 Then all processors strike from their

lists all multiples of the newly found


prime, and so on.
Analysis
 We focus on the time spent on
 Marking composite numbers.
 Communicating the current prime for P1
to the rest.
 Assume it takes ‫ א‬time units fir a
processor to mark a multiple of a
prime.

 The total amount of time a
processor spends striking out
composite numbers is no longer
than
 n/   n/   n/  
   ...  
  2   3   
  k 

 Communication
Assume a processor spends λ time
unit each time it passes a number to
another processor.
The total communication time for all K
primes is
K(p-1) λ
K: number of prime
(p-1): number of p’s
λ : communication time
Example
 n=1,000,000
 There are 168 primes < 1,000 =
√1,000,000
 The largest prime is 997.
 The maximum possible execution
time spent striking out primes is
  1,000,000 /     1,000,000 /     1,000,000 /    
      ...    
 2   3   997 

 The total communication time is 168
(p-1)λ
 Assume the rotation between ‫ א‬and λ
is
λ= 100 ‫א‬

Note that speedup declines after 11 processors

Note that the total time begins to
increase after 11 processors.
A mdahl’s Law
 S ≤1/(f+(1-f)/p)
S: speedup
f: the fraction of operations in a
computation that must be performed
sequentially, 0 ≤f ≤1.
p: number of processor.
Q.1-6)
Widgets Sequantaully pipelined speedup
1 3 3 1
2 6 3 2
3 9 3 3
4 12 4 3
5 15 4 3.75
6 18 4 4.5
7 21 5 4.2
8 24 5 4.8
9 27 5 5.4
10 30 6 5

6
5
4
Speedup

3
2
1
0
1 2 3 4 5 6 7 8 9 10
# of widgets
Question
 Analyze the speedup achievable by
data parallel algorithm on the shared
model.
 Assume it takes a unit of time for a
processor to mark a multiple of a
prime as being a composite number.
 The total amount of time a processor
spends striking out composite
n/ p  n/ p  n/ p
 2   than
numbers is no greater  3   ...    
 k 
Ex:
 N=1000
 The primes are (2,3,5,7,11,13,17,19,23,29,31)
 When p=1

 1000 / 1  1000 / 1  1000 / 1  1000 / 1



 2   3   5    ...   31 
       
=500+334+200+143+91+77+59+53+44+35
+33
=1569
Speedup=1569/1569=1

 When p=2
 1000 / 2   1000 / 2   1000 / 2   1000 / 2 
 2    3    5   ...   31 

=
250+167+100+72+46+39+30+27+22+18
+17
=788
Speedup=1569/788=1.99

 When p=3

 1000 / 3   1000 / 3   1000 / 3   1000 / 3 



 2   3   5    ...   31 

=167+112+67+48+31+26+20+18+15+1
2+11
=527
Speedup=1569/527=2.97

 When p=10
 1000 / 10   1000 / 10   1000 / 10   1000 / 10 
       ...   
2   3   5   31 

=50+34+20+15+10+8+6+5+4+4
=162
Speedup=1569/162=9.68

12
10
8
Speedup

6
4
2
0
1 2 3 4 5 6 7 8 9 10
Processor
H.W
 Continue for p=4,5, …, 9 and draw
the relation of the speedup and the
number of processors.

You might also like