Assignment 1: Sample Solution
Assignment 1: Sample Solution
Assignment Objectives:
• To reinforce your understanding of some key concepts/techniques introduced in class.
• To introduce you to doing independent study in parallel computing.
Assignment Questions:
Parallel speedup is defined as the ratio of the time required to compute some function using a
single processor (T1) divided by the time required to compute it using P processors (TP). That
is: speedup = T1/TP. For example if it takes 10 seconds to run a program sequentially and
2 seconds to run it in parallel on some number of processors, P, then the speedup is 10/2=5
times.
Parallel efficiency measures how much use of the parallel processors we are making. For P
processors, it is defined as: efficiency= 1/P x speedup= 1/P x T1/TP. For
example, continuing with the same example, if P is 10 processors and the speedup is 5 times,
then the parallel efficiency is 5/10=.5. In other words, on average, only half of the processors
were used to gain the speedup and the other half were idle.
Amdahl’s law states that the maximum speedup possible in parallelizing an algorithm is
limited by the sequential portion of the code. Given an algorithm which is P% parallel,
Amdahl’s law states that: MaximumSpeedup=1/(1- (P/100)). For example if 80% of
a program is parallel, then the maximum speedup is 1/(1-0.8)=1/.2=5 times. If the program in
question took 10 seconds to run serially, the best we could hope for in a parallel execution
would be for it to take 2 seconds (10/5=2). This is because the serial 20% of the program
cannot be sped up and it takes .2 x 10 seconds = 2 seconds even if the rest of the code is run
perfectly in parallel on an infinite number of processors so it takes 0 seconds to execute.
The Gustafson-Barsis law states that speedup tends to increase with problem size (since the
fraction of time spent executing serial code goes down). Gustafason-Barsis’ law is thus a
measure of what is known as “scaled speedup” (scaled by the number of processors used on a
problem) and it can be stated as: MaximumScaledSpeedup=p+(1-p)s, where p is the
number of processors and s is the fraction of total execution time spent in serial code. This
law tells us that attainable speedup is often related to problem size not just the number of
processors used. In essence Amdahl’s law assumes that the percentage of serial code is
independent of problem size. This is not necessarily true. (E.g. consider overhead for
managing the parallelism: synchronization, etc.). Thus, in some sense, Gustafon-Barsis’ law
generalizes Amdahl’s law.
[15] 2. Another class of parallel architecture is the pipelined vector processor (PVP). PVP machines
consist of one or more processors each of which is tailored to perform vector operations very
efficiently. An example of such a machine is the NEC SX-5. Do a little online research and
then describe what a PVP is, and how, specifically, it supports fast vector operations. Another
type of closely related parallel architecture is the vector processor (AP). AP machines support
fast array operations. An early example of such a machine was the ILLIAC-IV. Do a little
more research then describe what an AP is and how it supports fast array operations. Which
architecture do you think offers more potential parallelism? Why?
A PVP is a parallel architecture where each machine consists of one or more processors
designed explicitly to support vector operations. To make vector operations efficient, each
processor has multiple, deep D-unit pipelines and supports a vector instruction set. The
availability of the vector instructions allows a continuous flow of vector elements into the D-
Unit pipelines thereby making them efficient. The high level architecture of a PVP looks like:
Pipe 1
Instruction Scalar
Processing Registers Pipe 2 Scalar
Unit Processor
High Pipe P
Speed Vector
Main Instruction
Memory Controller
Pipe 1
Pipe 2 Vector
Vector Processor(s)
Access Vector
Controller Registers
Pipe N
Instrn
Mem. M0 M1 M2 Mn-1
In general, APs are likely to be more scalable and therefore offer more potential parallelism.
Of course, this will depend on the problem being solved being able to make use of the
parallelism offered. The parallelism provided in one D-Unit pipeline in a PVP is limited by
the number of stages in the pipeline. Further, there is unlikely to be sufficient ILP to keep
very many D-Unit pipelines busy.
[10] 3. Design an EREW-PRAM algorithm for computing the inner product of two vectors, X and Y,
of length N, (where N is a power of two). Your algorithm should run in O(logN) time with N
processors. You should express your algorithm in a fashion similar to the “fan-in” algorithm
done in class using PARDO and explicitly indicating which variables are shared between
processes and which are local.
There are a variety of possible O(logN) EREW PRAM parallel algorithms for inner product.
Here is one possible algorithm.
We begin by extending the merge network alone, from the one given in class that can merge
two 4 element sequences to one that can merge two sorted 8 element sequences. We use two
of the smaller merge networks (with odd and even inputs grouped) and feed the outputs of the
two small networks to another layer of comparators as shown. In all cases, corresponding
pairs are routed to the same node in the new layer to compare adjacent result values
A1
B1
A3
B3
A5
B5
A7
B7
A2
B2
A4
B4
A6
B6
A8
B8
Now we combine the necessary stages of the merge network to implement the sort in the
same way as was done in class for the 4 element networks.
[20] 5. You are to write a simple p-threads program that will implement a parallel search in a large
vector. Your main thread will declare a global shared vector of 6,400,000 integers where each
element is initialized to the value of the corresponding vector index (i.e. element i in the
vector will always contain the value i). The main thread should then prompt for and read two
values (one between 0 and 6,399,999 which is the value to be searched for in the vector and a
second (having the value of 1, 2, 8 or 16) which is the number of threads to use in the search.
In C, you might use the following code sequence to do the necessary I/O:
Your main thread should then spawn NumThreads threads each running the function
search which is passed its thread number, MyThreadNum (in the range 0 to
NumThreads-1), as a parameter. The searching in the vector must be partitioned into
NumThreads pieces with each piece being done by a separate thread. After partitioning,
each piece of the vector will contain NumElts = 6,400,000/NumThreads elements.)
Your search routine should thus search for the value SearchVal in elements
MyThreadNum * NumElts through ((MyThreadNum+1) * NumElts) - 1
(inclusive) of the vector. For convenience, you may assume that SearchVal is declared
globally like the vector. Whichever thread finds the value in the vector should print a
message announcing that the value has been found and the position in the vector at which it
was found. It is not necessary to stop other threads once a value is found. You are to include
calls to measure the runtime of your program (C code will be provided on the homepage for
this purpose if you need it). Run your program multiple times with 1, 2, 8 and 16 threads on a
machine in the Linux lab and then on one of the machines helium-XX.cs.umanitoba.ca
(where XX=01 through 05) and collect the results. Compute the average times for each case
and then generate a simple bar graph of your two sets of runs comparing the relative average
execution times. Given the results in each case, how many execution units (i.e. processors or
cores) do you think each machine has? Some links to p-threads tutorials are provided in case
anyone is unfamiliar with programming with p-threads.
See the course homepage for sample code (pthreadSearch.c). The graph of the results
on the helium-XX machines is:
The graph of the results on the Linux lab machines is:
Based on the results, it would appear that the Linux lab machines have two cores and the
helium-XX machines have eight cores. This is determined by the point at which adding more
processors ceases to yield speedup. Interestingly, the helium-XX machines actually have 16
cores so there is some other factor affecting the performance of the embarrassingly parallel
search process.
[10] 6. A simple approach to blur boundaries in a grayscale bitmap image would be to replace each
pixel with the average of its immediate neighboring pixels. A given pixel, Px,y at location
(x,y) in an image of size (X,Y) will have neighbors at (x-1,y-1), (x-1,y), (x-1,y+1), (x, y-1),
(x,y+1), (x+1,y-1), (x+1,y), and (x+1, y+1) unless it is on a boundary of the image (i.e. x=0
or x=X-1 and/or y=0 or y=Y). Illustrate the application of the PCAM method to this problem.
Assume that X=Y=N, that N mod 16=0 and that you have 16 processors. Justify your choices.
The PCAM parallel algorithm design methodology consists of four steps: Partitioning,
Communication, Agglomeration, and Mapping.
X-1
2 3 X-1
In the communication step, we determine the communication patterns between the partitions.
In this problem, each partition communicates with its neighboring partitions as defined in the
problem description since the computation is NewPx,y=(Px-1,y-1+Px-1,y+Px-1,y+1+Px,y-
1+Px,y+1+Px+1,y-1+Px+1,y+Px+1,y+1/8). In essence, each node must send its old pixel value to each
of its neighbors and every partition can do this in parallel. Hence, for a particular
partition/pixel, (i,j), the communication patterns are as follows:
i,j
This communication pattern repeats throughout the partitions. We assume that pixels on the
boundaries are available through this communication process.
Finally, in mapping, we assign the agglomerates onto the available processors. Assuming that
communication/data sharing is equally efficient among all processors then the assignment of
agglomerates to processors is arbitrary.
Total: 75 marks