0% found this document useful (0 votes)
209 views2 pages

Assignment 1

The document outlines an assignment on parallel computing that is due on September 17, 2019. It contains 5 questions related to parallelizing algorithms for summing numbers, optimizing memory access for vector and matrix operations, and analyzing performance on single-processor and multi-processor systems based on cache hit rates and memory access times. The student is asked to analyze algorithms, calculate performance metrics like speedup and efficiency, and determine peak achievable performance for different parallel problems and system configurations.

Uploaded by

Samira Yomi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
209 views2 pages

Assignment 1

The document outlines an assignment on parallel computing that is due on September 17, 2019. It contains 5 questions related to parallelizing algorithms for summing numbers, optimizing memory access for vector and matrix operations, and analyzing performance on single-processor and multi-processor systems based on cache hit rates and memory access times. The student is asked to analyze algorithms, calculate performance metrics like speedup and efficiency, and determine peak achievable performance for different parallel problems and system configurations.

Uploaded by

Samira Yomi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Assignment 1

Due on September 17, 2019

1. Parallelize the program of finding the sum of n numbers (a1+a2+...+an) using different
numbers of processes (Algorithm 1: n/2 processors; Algorithm 2: n/log2n processors).

(1) Draw the diagrams for the implementations of Algorithm 1 and Algorithm 2 respectively.

(2) Find the numbers of operations for the implementations of the sequential algorithm,
Algorithm 1, and Algorithm 2 respectively.

(3) Calculate the speedups of Algorithm 1 and Algorithm 2 respectively.

(4) Calculate the efficiencies of Algorithm 1 and Algorithm 2 respectively.

(5) Are Algorithm 1 and Algorithm 2 cost optimal respectively? Justify your answer.

2. Consider a memory system with a level 1 cache of 32 KB and DRAM of 512 MB with

the processor operating at 1 GHz. The latency to L1 cache is one cycle and the latency to
DRAM is 100 cycles. In each memory cycle, the processor fetches four words (cache line
size is four words). What is the peak achievable performance of a dot product of two
vectors? Note: Where necessary, assume an optimal cache placement policy.
1 /* dot product loop */
2 for (i = 0; i < dim; i++)
3 dot_prod += a[i] * b[i];

3. Now consider the problem of multiplying a dense matrix with a vector using a two loop

dot-product formulation. The matrix is of dimension 4K x 4K. (Each row of the matrix
takes 16 KB of storage.) What is the peak achievable performance of this technique using
a two-loop dot-product based matrix-vector product?

1 /* matrix-vector product loop */


2 for (i = 0; i < dim; i++)
3 for (j = 0; i < dim; j++)
4 c[i] += a[i][j] * b[j];

4. Extending this further, consider the problem of multiplying two dense matrices of

dimension 4K x 4K. What is the peak achievable performance using a three-loop dotproduct
based formulation? (Assume that matrices are laid out in a row-major fashion.)
1 /* matrix-matrix product loop */
2 for (i = 0; i < dim; i++)
3 for (j = 0; i < dim; j++)
4 for (k = 0; k < dim; k++)
5 c[i][j] += a[i][k] * b[k][j];
5. Consider an SMP with a distributed shared-address-space. Consider a simple cost
model in which it takes 10 ns to access local cache, 100 ns to access local memory, and
400 ns to access remote memory. A parallel program is running on this machine. The
program is perfectly load balanced with 80% of all accesses going to local cache, 10% to
local memory, and 10% to remote memory. What is the effective memory access time for
this computation? If the computation is memory bound, what is the peak computation
rate?
Now consider the same computation running on one processor. Here, the processor hits
the cache 70% of the time and local memory 30% of the time. What is the effective peak
computation rate for one processor? What is the fractional computation rate of a processor
in a parallel configuration as compared to the serial configuration?
Hint: Notice that the cache hit for multiple processors is higher than that for one
processor. This is typically because the aggregate cache available on multiprocessors is
larger than on single processor systems.

You might also like