Assignment 1
Assignment 1
1. Parallelize the program of finding the sum of n numbers (a1+a2+...+an) using different
numbers of processes (Algorithm 1: n/2 processors; Algorithm 2: n/log2n processors).
(1) Draw the diagrams for the implementations of Algorithm 1 and Algorithm 2 respectively.
(2) Find the numbers of operations for the implementations of the sequential algorithm,
Algorithm 1, and Algorithm 2 respectively.
(5) Are Algorithm 1 and Algorithm 2 cost optimal respectively? Justify your answer.
2. Consider a memory system with a level 1 cache of 32 KB and DRAM of 512 MB with
the processor operating at 1 GHz. The latency to L1 cache is one cycle and the latency to
DRAM is 100 cycles. In each memory cycle, the processor fetches four words (cache line
size is four words). What is the peak achievable performance of a dot product of two
vectors? Note: Where necessary, assume an optimal cache placement policy.
1 /* dot product loop */
2 for (i = 0; i < dim; i++)
3 dot_prod += a[i] * b[i];
3. Now consider the problem of multiplying a dense matrix with a vector using a two loop
dot-product formulation. The matrix is of dimension 4K x 4K. (Each row of the matrix
takes 16 KB of storage.) What is the peak achievable performance of this technique using
a two-loop dot-product based matrix-vector product?
4. Extending this further, consider the problem of multiplying two dense matrices of
dimension 4K x 4K. What is the peak achievable performance using a three-loop dotproduct
based formulation? (Assume that matrices are laid out in a row-major fashion.)
1 /* matrix-matrix product loop */
2 for (i = 0; i < dim; i++)
3 for (j = 0; i < dim; j++)
4 for (k = 0; k < dim; k++)
5 c[i][j] += a[i][k] * b[k][j];
5. Consider an SMP with a distributed shared-address-space. Consider a simple cost
model in which it takes 10 ns to access local cache, 100 ns to access local memory, and
400 ns to access remote memory. A parallel program is running on this machine. The
program is perfectly load balanced with 80% of all accesses going to local cache, 10% to
local memory, and 10% to remote memory. What is the effective memory access time for
this computation? If the computation is memory bound, what is the peak computation
rate?
Now consider the same computation running on one processor. Here, the processor hits
the cache 70% of the time and local memory 30% of the time. What is the effective peak
computation rate for one processor? What is the fractional computation rate of a processor
in a parallel configuration as compared to the serial configuration?
Hint: Notice that the cache hit for multiple processors is higher than that for one
processor. This is typically because the aggregate cache available on multiprocessors is
larger than on single processor systems.