18-Assignment 1 - Solution
18-Assignment 1 - Solution
Assignment 1
CS4402B / CS9635B University of Western Ontario
Submission instructions.
Format: The answers to the problem questions should be typed:
• source programs must be accompanied with input test files and,
• in the case of CilkPlus code, a Makefile (for compiling and running) is required,
and
• for algorithms or complexity analyzes, LATEX is highly recommended.
A PDF file (no other format allowed) should gather all the answers to non-programming
questions. All the files (the PDF, the source programs, the input test files and Make-
files) should be archived using the UNIX command tar.
Submission: The assignment should submitted through the OWL website of the class.
Collaboration. You are expected to do this assignment on your own without assistance
from anyone else in the class. However, you can use literature and if you do so, briefly
list your references in the assignment. Be careful! You might find on the web solutions
to our problems which are not appropriate. For instance, because the parallelism model
is different. So please, avoid those traps and work out the solutions by yourself. You
should not hesitate to contact me if you have any questions regarding this assignment.
I will be more than happy to help.
Marking. This assignment will be marked out of 100. A 10 % bonus will be given if your
paper is clearly organized, the answers are precise and concise, the typography and the
language are in good order. Messy assignments (unclear statements, lack of correctness
in the reasoning, many typographical and language mistakes) may yield a 10 % malus.
1
PROBBLEM 1. [20 points] Consider the following multithreaded algorithm for perform-
ing pairwise addition on n-element arrays A[1..n] and B[1..n], storing the sums in D[1..n],
shown in Algorithm 5.
1.1 Suppose that we set grain size = 1. What is the work, span and parallelism of this
implementation?
Solution.
• With grain size = 1, the for-loop of the procedure Sum-Array performs n iter-
ations. Moreover, at each iteration, the function call Add-Subarray performs
constant work. Therefore, the work is in the order of Θ(n).
• As for the span, it is also Θ(n): indeed, spawning the function calls does not
reduce the critical path.
• Therefore, the parallelism is in Θ(1).
1.2 For an arbitrary grain size, what is the work, span and parallelism of this implementa-
tion?
Solution.
• Let us denote the grain size by g, each function call has a cost in Θ(g).
• With grain size = g, the for-loop of the procedure Sum-Array performs n/g it-
erations. Moreover, at each iteration, the function call Add-Subarray performs
Θ(g). Therefore, the work remains in the order of Θ(n).
• Here again, spawning the function calls does not reduce the critical path. So each
of the n/g iterations has a span of Θ(g) and in the possible worst case, these n/g
function calls are executed one after another. Hence, the span is in O(n).
2
• Therefore, the parallelism is in Ω(1), which is not an attractive result. In practice,
some benefits can come from a spawning a function call at each iteration of a for-
loop, but this is hard to capture theoretically. Moreover, using cilk for is generally
the better way to go.
1.3 Determine the best value for grain size that maximizes parallelism. Explain the reasons.
Solution.
• To give a precise answer, we would need to know whether some of the function
calls to Add-Subarray are performed concurrently. Let us consider the best
and the worst cases.
• In the worst case, these function calls execute serially, one after another, whatever
is g. In which case, the parallelism is in Θ(1) and the value of g has no effect.
• In the best case, all the function calls execute in parallel. In which case, the span
√Θ(n/g + g). The function g 7−→ n/g + g reaches a minimum (for g > 0)
drops to
at g = n, which suggests to use this value for maximizing parallelism.
1.4 Implement in C/C++ this algorithm with the best value of grain size (which can be
determined from either theory or practice), and then use Cilkview to collect the following
information of the whole program with n = 4096 or larger:
Work (instructions) Span (instructions) Burdened span (instructions)
Parallelism Burdened parallelism
as well as the speedup estimated on 2, 4, 8, 16, 32, 64 and 128 processors, respectively.
This question receives 10 points distributed as follows:
• the code compiles: 3 points,
• the Code runs: 4 points,
• the code runs correctly against verification: 3 points.
PROBBLEM 2. [20 points] The objective of this problem is to prove that, with respect
to the Theorem of Graham & Brent, a greedy scheduler achieves the stronger bound:
TP ≤ (T1 − T∞ )/p + T∞ .
Let G = (V, E) be the DAG representing the instruction stream for a multithreaded
program in the fork-join parallelism model. The sets V and E denote the vertices and edges
of G respectively. Let T1 and T∞ be the work and span of the corresponding multithreaded
program. We assume that G is connected. We also assume that G admits a single source
(vertex with no predecessors) denoted by s and a single target (vertex with no successors)
denoted by t. Recall that T1 is the total number of elements of V and T∞ is the maximum
number of nodes on a path from s to t (counting s and t).
Let S0 = {s}. For i ≥ 0, we denote by Si+1 the set of the vertices w satisfying the
following two properties:
3
(i) all immediate predecessors of w belong to Si ∪ Si−1 ∪ · · · ∪ So ,
Therefore, the set Si represents all the units of work which can be done during the i−-th
parallel step (and not before that point) on infinitely many processors.
Let p > 1 be an integer. For all i ≥ 0, we denote by wi the number of elements in Si .
Let ` be the largest integer i such that wi 6= 0. Observe that S0 , S1 , . . . , S` form a partition
of V . Finally, we define the following sequence of integers:
0 if wi ≤ p
ci =
dwi /pe − 1 if wi > p
2.1 For the computation of the 5-th Fibonacci number (as studied in class) what are
S0 , S1 , S2 , . . .?
Solution.
Solution.
For each i = 0 · · · ` − 1, the set Si+1 consists of strands which cannot be executed
before those in Si ∪ Si−1 ∪ · · · ∪ So are executed. Therefore the span T∞ is at least
` + 1. On the other hand, all strands in Si+1 can be executed (concurrently) after those
4
in Si ∪ Si−1 ∪ · · · ∪ So are executed. Therefore the T∞ is at most ` + 1. These two
observations imply ` + 1 = T∞ .
Since S0 , S1 , . . . , S` form a partition of V , we clearly have w0 + · · · + w` = T1 .
Solution. We have
Pi=`
c0 + · · · + c` ≤ (dwi /pe − 1)
Pi=0
i=`
≤ i=0 (wi /p − 1/p)
1
P i=` (1)
≤ p i=0 (wi − 1)
1
≤ p (T1 − T∞ ) .
Indeed, for every positive integer a, b, one can easily verify the following inequality
a a−1
d e−1 ≤ . (2)
b b
TP ≤ (T1 − T∞ )/p + T∞ .
5
2.5 Application: Professor Brown takes some measurements of his (deterministic) multi-
threaded program, which is scheduled using a greedy scheduler, and finds that T8 = 80
seconds and T64 = 20 seconds. Give lower bound and an upper bound for Professor
Brown’s computation running time on p processors, for 1 ≤ p ≤ 100? Using a plot is
recommended.
Solution.
6
The above solution is elegant and addresses the question in the best possibl way.
Neverthless we accept grosser solutions where Equation (3) is used an equality in order
to numerically determine T1 and T∞ . After that, one observes
T1 + (p − 1)T∞
≥ TP ≥ max(T1 /p, T∞ )
p
and plots the above upper and lower bounds of TP .
PROBBLEM 3. [20 points] Given a weighted directed graph G = (V , E), where each edge
(v, w) ∈ E (vertices v, w ∈ V ) has a non-negative weight, the Floyd-Warshall algorithm,
shown in Algorithm 2, can find the shortest paths between all pairs of vertices in G. Let |V |
be the number of vertices in G.
3.1 Determine which loops among the k-loop, i-loop and j-loop can be parallelized and
explain the reasons.
Solution. From the proposed pseudo-code, it is unclear that any of the 3 for loops
could become of a parallel loop. Thus, it is an acceptable solution to say: none! The
challenge is the dynamic programming formulation. In fact, one needs to rework the
algorithm a bit so as to obtain a bloking strategy formulation. See for instance:
7
Algorithm 2: The Floyd-Warshall algorithm
/* Let D be a |V | × |V | array of minimum distances initialized by the
weighted directed graph G. */
for k = 0; k < |V |; ++k do
for i = 0; i < |V |; ++i do
for j = 0; j < |V |; ++j do
if D[i][j] > D[i][k] + D[k][j] then
D[i][j] = D[i][k] + D[k][j];
https://gkaracha.github.io/papers/floyd-warshall.pdf
and
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1333649
From there, one deduces that the two inner for loops can bemcome parallel for loops.
Indeed, the the ”i” and ”j” iterations are independent of each other. This yields a
fork-join algorithm with a work in Θ(n3 ) and a span in Θ(n).
Solution. The section of the parallelization of the Floyd algorithm, in the wikipedia
page, provides us with an interesting point of view. We can see Floyd-Warshall algo-
rithm as a stencil computation, see Algorithm 3. Note that the parallel for-loops in
Algorithm 3 can be expressed in the cilk language using cilk for with the appropri-
ate grain size.
3.3 Analyze its work, span and parallelism of your multithreaded pseudo-code.
Solution.
• Removing the two in parallel clauses yields a serial algorithm of work in Θ(N 3 ).
• The outermost loop and the two innermost loops are serial. This yields a span of
Θ(N (2log((N − 1)/b))b2 ). If ve wiew b as a small constant, we can simply answer
Θ(N log(N )).
8
Algorithm 3: Parallel Floyd-Warshall algorithm using blocking
/* Let D be a |V | × |V | array of minimum distances initialized by the
weighted directed graph G. */
(0)
Define D = D and let N = |V | ;
Let b be an integer dividing N − 1 ;
for k = 0; k < N ; ++k do
Initialize an N × N matrix D(k+1) to zero ;
for i = 0; i ≤ (N − 1)/b; ++i ; in parallel do
for j = 0; j ≤ (N − 1)/b; ++j ; in parallel do
for h = 0; h < b; ++h do
for ` = 0; ` < b; ++` do
(k+1) (k) (k) (k)
Di b+h,j b+` = min(Di b+h,j b+` , Di b+h,k + Dk,j b+` )
D(k) = D(k+1) ;
9
We can divide the n × n array A into four n/2 × n/2 subarrays as,
A11 A12
A= ,
A21 A22
and then recursively to update each subarray in parallel.
Solution.
4.2 Draw the computation dag of your pseudo-code, and show how to schedule the dag on
4 processors using greedy scheduling.
4.3 Give and solve recurrences for the work and span for this algorithm in terms of n. What
is the parallelism?
Solution.
Copy part:
Work: O(n2 )
Span: C∞ (n) = C∞ (n/2) + O(1) ∈ O(log n)
The whole algorithm:
Work: O(n2 )
Span: S∞ (n) = S∞ (n/2) + O(log n) = Θ(log2 n)
Parallelism: O(n2 / log2 n)
Choose an integer b ≥ 2. Divide the n × n array into b2 subarrays, each of size n/b × n/b,
recursing with as much parallelism as possible.
4.4 In terms of n and b, what are the work, span and parallelism of your algorithm?
Copy part:
Work: O(n2 )
Span: C∞ (n) = C∞ (n/b) + O(1) ∈ O(logb n))
The whole algorithm:
Work: O(n2 )
Span: S∞ (n) = S∞ (n/b) + O(logb n) = Θ(log2b n)
Parallelism: O(n2 / log2b n)
10
Algorithm 5: Parallel Stencil
Update(A, D, b, N )
Update-blocks (A, D, b, 0, 0, N − 1, N − 1);
Copy-blocks (A, D, b, 0, 0, N − 1, N − 1);
Update-blocks(A, D, b, i0 , j0 , di , dj )
if di > b then
d = di /2;
spawn Update-blocks (A, D, b, i0 , j0 , d, dj ) ;
Update-blocks (A, D, b, i0 + d, j0 , d, dj ) ;
return ;
if dj > b then
d = dj /2;
spawn Update-blocks (A, D, b, i0 , j0 , di , d) ;
Update-blocks (A, D, b, i0 , j0 + d, di , d) ;
return ;
Update-block(A, D, i0 , j0 , di , dj )
Copy-blocks(A, D, b, i0 , j0 , di , dj )
if di > b then
d = di /2;
spawn Copy-blocks (A, D, b, i0 , j0 , d, dj ) ;
Copy-blocks (A, D, b, i0 + d, j0 , d, dj ) ;
return ;
if dj > b then
d = dj /2;
spawn Copy-blocks (A, D, b, i0 , j0 , di , d) ;
Copy-blocks (A, D, b, i0 , j0 + d, di , d) ;
return ;
Copy-block(A, D, i0 , j0 , di , dj )
Update-block(A, D, i0 , j0 , di , dj )
for i = i0 , k < i0 + di ; ++i do
for j = j0 , k < j0 + dj ; ++j do
D[i, j] = 0.25 * (A[i − 1, j] + A[i + 1, j] + A[i, j − 1] + A[i, j + 1]);
copy-block(A, D, b, i0 , j0 )
for i = i0 , k < i0 + b; ++i do
for j = j0 , k < j0 + b; ++j do
A[i, j] = D[i, j] ;
11
4.5 For any choice of b ≥ 2, analyze the trends of parallelism and burden parallelism.
The code can be found in problem4/stencilDnC.cpp, For simplicity, the order of the
matrix is set to n + 2 and we ignore the edge cells.
5.1 Describe, in plain words, how to construct a tableau in a k-way fashion, for an arbitrary
integer k ≥ 2, using the same stencil (the one of the Pascal triangle construction) as
in the lectures.
One can use either a divide-and-conquer or a blocking strategy, as seen in class for
Pascal’s triangle.
5.2 Determine the work and the span for an input square array of order n.
For an input n × n array, the work is clearly in Θ(n2 ) Let Sk (n) be the non-burdened
span for the k-way divide and conquer approach. We have:
5.3 Determine the burdened span, similarly to what we did for the Pascal triangle construc-
tion at then of the chapter Multithreaded Parallelism and Performance Measures
Sk (n) ∈ Θ( nk log nk )
12