Chapter 01
Chapter 01
ABC w w
2 1
A B C w w w w w
5 4 3 2 1
…
Data Parallelism.
ABC w w
4 1
ABC w w
5 2
ABC w w
6 3
…
To copy 4 papers sequentially, it
takes 12 seconds.
Using pipeline, it takes 3+1+1+1=6
seconds.
Using data parallelism with 3
functional units, it takes 3+3=6.
Ex:
2,3,4,…n
b) Strike multiples of 2,3,5, and
successive primes.
c) Terminate after multiples of the
largest prime ≤ √n have been struck.
Sequential Implementation
(a) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
(b) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
(c) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
(d) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
…
(e) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
(f) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
(g) 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
√30=5.5
5 is the largest prime ≤ 5.5 STOP
…
curren
t
p prime
index
2 3 4 n-1 n
P1 P2 P3
index index index
Shared current
prime
Memory
2 3 4 n-1 n
…
Algorithm:
For example, one processor will be
responsible for marking multiple of 2
beginning with 4.
While this processor marks up
multiples of 2, another may be
marking multiples of 3 beginning with
9, and so on.
…
Two problems might occur:
a) Tow or more processors may end up
sieving the same prime number.
b) A processor may end up sieving
multiples of a composite number.
Analysis
Analysis starting from the prime square
a) Consider the time taken by the
sequential algorithms:
n 3 number of steps needed to sieve
2 multiples of 2.
(n 8) number of steps needed to sieve
3 multiples of 3.
…
…
n 3 n 8 n 24 (n 1) 2
...
k
2 3 5 k
…
b) Consider the time taken by the
parallel algorithm:
Note that parallel execution will not
2 7 1 31
(b 7
3 5 1 1 1
)
1 3 9
2
29
(c) 3 1 1 31
1 9
5 7 1 1 23
3 7
…
Time for one processor is 1411.
Time for two processors is 706,
therefore the speedup is
1411/706=2.
Time for three processors is 499,
therefore the speedup is
1411/499=2.83
Data Parallel Approach
a) Algorithm:
Processors will work together to
strike multiples of each newly found
prime. Every processor will be
responsible for a segment of the
array representing the natural
numbers.
P1 current
index
Prime index
Prime
n/p+1 2n/p
2 n/
p
Pp current
index
Prime
(p-1)n/ n
p+1
…
Assume we solve for p processors.
Every processor is assigned no more
than the ceiling of n/p natural
numbers.
Assume P is much less than √n.
All primes are in the segment
assigned to the first processor.
…
Algorithm:
Processor 1 finds the next prime and
6
5
4
Speedup
3
2
1
0
1 2 3 4 5 6 7 8 9 10
# of widgets
Question
Analyze the speedup achievable by
data parallel algorithm on the shared
model.
Assume it takes a unit of time for a
processor to mark a multiple of a
prime as being a composite number.
The total amount of time a processor
spends striking out composite
n/ p n/ p n/ p
2 than
numbers is no greater 3 ...
k
Ex:
N=1000
The primes are (2,3,5,7,11,13,17,19,23,29,31)
When p=1
=
250+167+100+72+46+39+30+27+22+18
+17
=788
Speedup=1569/788=1.99
…
When p=3
=167+112+67+48+31+26+20+18+15+1
2+11
=527
Speedup=1569/527=2.97
…
When p=10
1000 / 10 1000 / 10 1000 / 10 1000 / 10
...
2 3 5 31
=50+34+20+15+10+8+6+5+4+4
=162
Speedup=1569/162=9.68
…
12
10
8
Speedup
6
4
2
0
1 2 3 4 5 6 7 8 9 10
Processor
H.W
Continue for p=4,5, …, 9 and draw
the relation of the speedup and the
number of processors.