0% found this document useful (0 votes)

199 views8 pages

Assignment 1: Sample Solution

This document provides a sample solution to an assignment on parallel computing concepts. It defines key terms like parallel speedup, efficiency, Amdahl's law, and Gustafson-Barsis law. It also describes parallel architectures like pipelined vector processors and array processors. Finally, it provides parallel algorithms to compute inner product, perform odd-even merging to sort data, and implement a parallel search using pthreads.

Uploaded by

নিবিড় অভ্র

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

199 views8 pages

Assignment 1: Sample Solution

Uploaded by

নিবিড় অভ্র

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

COMP4510 Assignment 1 Sample Solution

Assignment Objectives:
• To reinforce your understanding of some key concepts/techniques introduced in class.
• To introduce you to doing independent study in parallel computing.

Assignment Questions:

[10] 1. We used a number of terms/concepts informally in class relying on intuitive explanations to

understand them. Provide concrete definitions and examples of the following terms/concepts:
a) Parallel speedup
b) Parallel efficiency
c) Amdahl’s Law
Do some online research and provide a concrete definition of Gustafson-Barsis’s law. What
do this law tell us. How does it relate to Amdahl’s law?

Parallel speedup is defined as the ratio of the time required to compute some function using a
single processor (T1) divided by the time required to compute it using P processors (TP). That
is: speedup = T1/TP. For example if it takes 10 seconds to run a program sequentially and
2 seconds to run it in parallel on some number of processors, P, then the speedup is 10/2=5
times.

Parallel efficiency measures how much use of the parallel processors we are making. For P
processors, it is defined as: efficiency= 1/P x speedup= 1/P x T1/TP. For
example, continuing with the same example, if P is 10 processors and the speedup is 5 times,
then the parallel efficiency is 5/10=.5. In other words, on average, only half of the processors
were used to gain the speedup and the other half were idle.

Amdahl’s law states that the maximum speedup possible in parallelizing an algorithm is
limited by the sequential portion of the code. Given an algorithm which is P% parallel,
Amdahl’s law states that: MaximumSpeedup=1/(1- (P/100)). For example if 80% of
a program is parallel, then the maximum speedup is 1/(1-0.8)=1/.2=5 times. If the program in
question took 10 seconds to run serially, the best we could hope for in a parallel execution
would be for it to take 2 seconds (10/5=2). This is because the serial 20% of the program
cannot be sped up and it takes .2 x 10 seconds = 2 seconds even if the rest of the code is run
perfectly in parallel on an infinite number of processors so it takes 0 seconds to execute.

The Gustafson-Barsis law states that speedup tends to increase with problem size (since the
fraction of time spent executing serial code goes down). Gustafason-Barsis’ law is thus a
measure of what is known as “scaled speedup” (scaled by the number of processors used on a
problem) and it can be stated as: MaximumScaledSpeedup=p+(1-p)s, where p is the
number of processors and s is the fraction of total execution time spent in serial code. This
law tells us that attainable speedup is often related to problem size not just the number of
processors used. In essence Amdahl’s law assumes that the percentage of serial code is
independent of problem size. This is not necessarily true. (E.g. consider overhead for
managing the parallelism: synchronization, etc.). Thus, in some sense, Gustafon-Barsis’ law
generalizes Amdahl’s law.

[15] 2. Another class of parallel architecture is the pipelined vector processor (PVP). PVP machines
consist of one or more processors each of which is tailored to perform vector operations very
efficiently. An example of such a machine is the NEC SX-5. Do a little online research and
then describe what a PVP is, and how, specifically, it supports fast vector operations. Another
type of closely related parallel architecture is the vector processor (AP). AP machines support
fast array operations. An early example of such a machine was the ILLIAC-IV. Do a little
more research then describe what an AP is and how it supports fast array operations. Which
architecture do you think offers more potential parallelism? Why?

A PVP is a parallel architecture where each machine consists of one or more processors
designed explicitly to support vector operations. To make vector operations efficient, each
processor has multiple, deep D-unit pipelines and supports a vector instruction set. The
availability of the vector instructions allows a continuous flow of vector elements into the D-
Unit pipelines thereby making them efficient. The high level architecture of a PVP looks like:

Pipe 1
Instruction Scalar
Processing Registers Pipe 2 Scalar
Unit Processor

High Pipe P
Speed Vector
Main Instruction
Memory Controller
Pipe 1

Pipe 2 Vector
Vector Processor(s)
Access Vector
Controller Registers
Pipe N

An array processor is a parallel architecture designed to make operations on arrays efficient.

It is structurally very different from a PVP in that the processing elements (PEs), which are
similar to CPU cores, are themselves structured as an array. By mapping elements of the
arrays we wish to operate on across the PEs (and their corresponding local memories) we can
enable parallel operation. The high level architecture of an AP looks like:
Interconnection Network

CU PE0 PE1 PE2 PEn-1

Instrn
Mem. M0 M1 M2 Mn-1

In general, APs are likely to be more scalable and therefore offer more potential parallelism.
Of course, this will depend on the problem being solved being able to make use of the
parallelism offered. The parallelism provided in one D-Unit pipeline in a PVP is limited by
the number of stages in the pipeline. Further, there is unlikely to be sufficient ILP to keep
very many D-Unit pipelines busy.

[10] 3. Design an EREW-PRAM algorithm for computing the inner product of two vectors, X and Y,
of length N, (where N is a power of two). Your algorithm should run in O(logN) time with N
processors. You should express your algorithm in a fashion similar to the “fan-in” algorithm
done in class using PARDO and explicitly indicating which variables are shared between
processes and which are local.

There are a variety of possible O(logN) EREW PRAM parallel algorithms for inner product.
Here is one possible algorithm.

// Start by computing the element-wise product of X and Y

FOR i:=1 TO N PARDO
Prod[i]:=X[i]*Y[i];

// Now use fan-in to accumulate the sum of the products

// Note: I am doing fan-in not parallel prefix
Dist=1;
REPEAT log2N TIMES
FOR i:=1 TO N PARDO
IF ((myRank%(Dist*2))==0) THEN
Prod[myRank]+=Prod[myRank+Dist]
Dist:=Dist*2;
innerPrd=Prod[0];
[10] 4. Extend Batcher’s Odd-Even merge network to sort two 8 element sequences of numbers.
Draw the resulting network machine and provide an illustrative example to show that your
extension works.

We begin by extending the merge network alone, from the one given in class that can merge
two 4 element sequences to one that can merge two sorted 8 element sequences. We use two
of the smaller merge networks (with odd and even inputs grouped) and feed the outputs of the
two small networks to another layer of comparators as shown. In all cases, corresponding
pairs are routed to the same node in the new layer to compare adjacent result values
A1
B1

A3
B3

A5
B5

A7
B7

A2
B2

A4
B4

A6
B6

A8
B8

Now we combine the necessary stages of the merge network to implement the sort in the
same way as was done in class for the 4 element networks.
[20] 5. You are to write a simple p-threads program that will implement a parallel search in a large
vector. Your main thread will declare a global shared vector of 6,400,000 integers where each
element is initialized to the value of the corresponding vector index (i.e. element i in the
vector will always contain the value i). The main thread should then prompt for and read two
values (one between 0 and 6,399,999 which is the value to be searched for in the vector and a
second (having the value of 1, 2, 8 or 16) which is the number of threads to use in the search.
In C, you might use the following code sequence to do the necessary I/O:

int SearchVal, NumThreads;

printf(“Please enter the value to search for: ”);

scanf(“%d”,&SearchVal);
printf(“Please enter the number of threads to use: ”);
scanf(“%d”,&NumThreads);

Your main thread should then spawn NumThreads threads each running the function
search which is passed its thread number, MyThreadNum (in the range 0 to
NumThreads-1), as a parameter. The searching in the vector must be partitioned into
NumThreads pieces with each piece being done by a separate thread. After partitioning,
each piece of the vector will contain NumElts = 6,400,000/NumThreads elements.)
Your search routine should thus search for the value SearchVal in elements
MyThreadNum * NumElts through ((MyThreadNum+1) * NumElts) - 1
(inclusive) of the vector. For convenience, you may assume that SearchVal is declared
globally like the vector. Whichever thread finds the value in the vector should print a
message announcing that the value has been found and the position in the vector at which it
was found. It is not necessary to stop other threads once a value is found. You are to include
calls to measure the runtime of your program (C code will be provided on the homepage for
this purpose if you need it). Run your program multiple times with 1, 2, 8 and 16 threads on a
machine in the Linux lab and then on one of the machines helium-XX.cs.umanitoba.ca
(where XX=01 through 05) and collect the results. Compute the average times for each case
and then generate a simple bar graph of your two sets of runs comparing the relative average
execution times. Given the results in each case, how many execution units (i.e. processors or
cores) do you think each machine has? Some links to p-threads tutorials are provided in case
anyone is unfamiliar with programming with p-threads.

See the course homepage for sample code (pthreadSearch.c). The graph of the results
on the helium-XX machines is:
The graph of the results on the Linux lab machines is:

Based on the results, it would appear that the Linux lab machines have two cores and the
helium-XX machines have eight cores. This is determined by the point at which adding more
processors ceases to yield speedup. Interestingly, the helium-XX machines actually have 16
cores so there is some other factor affecting the performance of the embarrassingly parallel
search process.
[10] 6. A simple approach to blur boundaries in a grayscale bitmap image would be to replace each
pixel with the average of its immediate neighboring pixels. A given pixel, Px,y at location
(x,y) in an image of size (X,Y) will have neighbors at (x-1,y-1), (x-1,y), (x-1,y+1), (x, y-1),
(x,y+1), (x+1,y-1), (x+1,y), and (x+1, y+1) unless it is on a boundary of the image (i.e. x=0
or x=X-1 and/or y=0 or y=Y). Illustrate the application of the PCAM method to this problem.
Assume that X=Y=N, that N mod 16=0 and that you have 16 processors. Justify your choices.

The PCAM parallel algorithm design methodology consists of four steps: Partitioning,
Communication, Agglomeration, and Mapping.

In partitioning, we identify the fundamental individual units of computation performed by the

algorithm. In this problem, each internal pixel (i.e. ones not on a border) will be updated
since the blurring operation is well defined for these pixels. Thus, we could partition the
problem into (X-2)*(Y-2) partitions as follows:
2
3

X-1
2 3 X-1

In the communication step, we determine the communication patterns between the partitions.
In this problem, each partition communicates with its neighboring partitions as defined in the
problem description since the computation is NewPx,y=(Px-1,y-1+Px-1,y+Px-1,y+1+Px,y-
1+Px,y+1+Px+1,y-1+Px+1,y+Px+1,y+1/8). In essence, each node must send its old pixel value to each
of its neighbors and every partition can do this in parallel. Hence, for a particular
partition/pixel, (i,j), the communication patterns are as follows:

i,j

This communication pattern repeats throughout the partitions. We assume that pixels on the
boundaries are available through this communication process.

In agglomeration, we agglomerate partitions related by communication into larger

“agglomerates”. Since the communication pattern is regular, we can choose to divide out
collection of partitions anywhere we like and we will only have inter-agglomerate
communication at the agglomerate boundaries. We also know that X=Y=N and that N is
divisible by 16, which is the number of processors. Thus, the number of partitions (including
those corresponding to boundary pixels) is NxN which is also divisable by 16. We typically
want a number of agglomerates which is a small multiple of the number of processors. We
could easily divide up the partitions into 16 groups in each of the x and y dimensions giving
256 agglomerates. If we then chose one dimension to distribute over the available processors
during mapping, this would result in 16 agglomerates per processor which is likely a larger
number than would be ideal to tolerate possible scheduling issues (i.e. 16 is not as small a
multiple as we would like). We could, of course, instead divide the partitions into 16 groups
in one dimension and 4 in the other, resulting in 64 agglomerates and, eventually, 4 per
processor. To minimize communication/sharing across agglomerates, we should always
agglomerate partitions corresponding to adjacent pixels in the image.

Finally, in mapping, we assign the agglomerates onto the available processors. Assuming that
communication/data sharing is equally efficient among all processors then the assignment of
agglomerates to processors is arbitrary.

Total: 75 marks

Speedup
No ratings yet
Speedup
12 pages
Principles of Scalable Performance
No ratings yet
Principles of Scalable Performance
61 pages
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
UNIT-2 Parallel Programming Challenges
No ratings yet
UNIT-2 Parallel Programming Challenges
32 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Lect11 12 Parallel
No ratings yet
Lect11 12 Parallel
57 pages
Patterson6e MIPS Ch06 PPT
No ratings yet
Patterson6e MIPS Ch06 PPT
74 pages
Lec 14
No ratings yet
Lec 14
36 pages
Parallel Computer Models: CEG 4131 Computer Architecture III Miodrag Bolic
No ratings yet
Parallel Computer Models: CEG 4131 Computer Architecture III Miodrag Bolic
27 pages
Lecture 3 Amdahl's Law and Karp Flatt Metric
No ratings yet
Lecture 3 Amdahl's Law and Karp Flatt Metric
42 pages
PDC Lecture 03
No ratings yet
PDC Lecture 03
36 pages
Module 5
No ratings yet
Module 5
45 pages
PDC Notes by Zatch-1
No ratings yet
PDC Notes by Zatch-1
42 pages
04 Progbasics
No ratings yet
04 Progbasics
62 pages
Parallel Computing
No ratings yet
Parallel Computing
30 pages
Lect 02
No ratings yet
Lect 02
51 pages
CS-3006 10 PerformanceAnalysis
No ratings yet
CS-3006 10 PerformanceAnalysis
52 pages
Al-Hayanni. 2020
No ratings yet
Al-Hayanni. 2020
16 pages
Seminar
No ratings yet
Seminar
85 pages
410A Week 4
No ratings yet
410A Week 4
12 pages
Amdahl's Law: Example 1
No ratings yet
Amdahl's Law: Example 1
12 pages
15CS72 ACA Module1 Chapter1Final
No ratings yet
15CS72 ACA Module1 Chapter1Final
25 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
Computer Hardware Engineering: IS1200, Spring 2015
No ratings yet
Computer Hardware Engineering: IS1200, Spring 2015
17 pages
PDC - Lecture - No. 2
No ratings yet
PDC - Lecture - No. 2
31 pages
Cao - Unit 4 - Notes - Final
No ratings yet
Cao - Unit 4 - Notes - Final
30 pages
01 Introduction
No ratings yet
01 Introduction
20 pages
Unit4 Session5 Amdahls Law Gustafsons Law
No ratings yet
Unit4 Session5 Amdahls Law Gustafsons Law
15 pages
Chen Paap08-Multicorescalability
No ratings yet
Chen Paap08-Multicorescalability
12 pages
Exercises 6
No ratings yet
Exercises 6
3 pages
Introduction To Parallel Processing: Unit-2
No ratings yet
Introduction To Parallel Processing: Unit-2
32 pages
Parallel System Architectures: System and Network Engineering Lab, Uva E-Mail: A.D.Pimentel@Uva - NL
No ratings yet
Parallel System Architectures: System and Network Engineering Lab, Uva E-Mail: A.D.Pimentel@Uva - NL
6 pages
COA Unit V B
No ratings yet
COA Unit V B
5 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
Lecture-11 Amdhals Law Gustafsons Law
No ratings yet
Lecture-11 Amdhals Law Gustafsons Law
16 pages
Pc98 Lect5 Part1 Speedup
No ratings yet
Pc98 Lect5 Part1 Speedup
36 pages
Unit 2 Cloud Computing
No ratings yet
Unit 2 Cloud Computing
19 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
CA Classes-41-45
No ratings yet
CA Classes-41-45
5 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
02 Gustafsons Law
No ratings yet
02 Gustafsons Law
2 pages
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
No ratings yet
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
20 pages
Stanley Assignment
No ratings yet
Stanley Assignment
6 pages
Parallel Processors: Session 2
No ratings yet
Parallel Processors: Session 2
32 pages
1-QP KEY PDC CAT-1 - C1-Slot Answer Key PDF
No ratings yet
1-QP KEY PDC CAT-1 - C1-Slot Answer Key PDF
8 pages
Unassessed Tutorial Exercise 2: Assessing A Vector Processing Enhancement
No ratings yet
Unassessed Tutorial Exercise 2: Assessing A Vector Processing Enhancement
4 pages
Parallel2 PDF
No ratings yet
Parallel2 PDF
16 pages
L 1 ParallelProcess Challenges
No ratings yet
L 1 ParallelProcess Challenges
82 pages
Speed Up Laws
No ratings yet
Speed Up Laws
21 pages
Lecture 4 Analytical Modeling of Parallel Programs
No ratings yet
Lecture 4 Analytical Modeling of Parallel Programs
11 pages
Use of DAG in Distributed Parallel Computing
No ratings yet
Use of DAG in Distributed Parallel Computing
5 pages
Simcenter Nastran 2019.1: Parallel Processing Guide
No ratings yet
Simcenter Nastran 2019.1: Parallel Processing Guide
112 pages
High Performance Computing Using Parallel Processing
No ratings yet
High Performance Computing Using Parallel Processing
3 pages
Performance Analysis: PE PE
No ratings yet
Performance Analysis: PE PE
10 pages
Chapter 2
No ratings yet
Chapter 2
54 pages
Amdahl Law
No ratings yet
Amdahl Law
2 pages
Reevaluating Amdahls Law
No ratings yet
Reevaluating Amdahls Law
3 pages
CPU Scheduling Algorithm Problems
No ratings yet
CPU Scheduling Algorithm Problems
21 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Multithreading
100% (1)
Multithreading
14 pages
Process Management, Threads, Process Scheduling - Operating Systems
67% (3)
Process Management, Threads, Process Scheduling - Operating Systems
21 pages
RTOS1
No ratings yet
RTOS1
23 pages
Test CSC207 20242
No ratings yet
Test CSC207 20242
10 pages
BCS303
No ratings yet
BCS303
2 pages
Slot28 CH17 ParallelProcessing 32 Slides
No ratings yet
Slot28 CH17 ParallelProcessing 32 Slides
32 pages
OS Unit-1 Notes
No ratings yet
OS Unit-1 Notes
31 pages
Distributed System Bank
No ratings yet
Distributed System Bank
27 pages
Os Lab Manual
No ratings yet
Os Lab Manual
51 pages
Merged
No ratings yet
Merged
7 pages
A Solution To The Problem of Parallel Programming: November 2018
No ratings yet
A Solution To The Problem of Parallel Programming: November 2018
13 pages
HW2 Solutions
No ratings yet
HW2 Solutions
4 pages
Characterizing Schedules Based On Serializabilitygfdgfgfg
No ratings yet
Characterizing Schedules Based On Serializabilitygfdgfgfg
9 pages
Ktunotes - In: Apj Abdul Kalam Technological University
No ratings yet
Ktunotes - In: Apj Abdul Kalam Technological University
21 pages
Abdur Rahman SarkarID 1637720115department of Computer Science Engineering
No ratings yet
Abdur Rahman SarkarID 1637720115department of Computer Science Engineering
29 pages
Untitled
No ratings yet
Untitled
13 pages
Unit 3 Process Management
No ratings yet
Unit 3 Process Management
11 pages
Unit 1
No ratings yet
Unit 1
65 pages
OS Full Journal
No ratings yet
OS Full Journal
90 pages
Real-Time-Operating-Systems-Book-1-The-Theory-The - Annas-Archive-70-105
No ratings yet
Real-Time-Operating-Systems-Book-1-The-Theory-The - Annas-Archive-70-105
36 pages
1-An Overview of Parallel Computing
No ratings yet
1-An Overview of Parallel Computing
41 pages
Multi Version Timestamp Ordering Protocol
87% (15)
Multi Version Timestamp Ordering Protocol
2 pages
01ec 370 Ans1
No ratings yet
01ec 370 Ans1
36 pages
OS (3 CHP)
No ratings yet
OS (3 CHP)
33 pages
Process Part 2: Operating System
No ratings yet
Process Part 2: Operating System
8 pages
OS PPT 3-4
No ratings yet
OS PPT 3-4
18 pages
Cps 303 Note
No ratings yet
Cps 303 Note
40 pages
Database Summary, Transactions
No ratings yet
Database Summary, Transactions
8 pages
Distributed Transaction
No ratings yet
Distributed Transaction
10 pages
Transactions and Concurrency Control
No ratings yet
Transactions and Concurrency Control
6 pages
A790966436 - 24950 - 8 - 2020 - Multiple-Processor and Real Time Scheduling
No ratings yet
A790966436 - 24950 - 8 - 2020 - Multiple-Processor and Real Time Scheduling
11 pages
Condition Variable: Mother and Son With The Laundry
No ratings yet
Condition Variable: Mother and Son With The Laundry
7 pages
TSN2101 - Tutorial 3 (Processes and Threads)
No ratings yet
TSN2101 - Tutorial 3 (Processes and Threads)
3 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
LEARN MPLS FROM SCRATCH PART-B: A Beginners guide to next level of networking
From Everand
LEARN MPLS FROM SCRATCH PART-B: A Beginners guide to next level of networking
POONAM DEVI
No ratings yet

Assignment 1: Sample Solution

Uploaded by

Assignment 1: Sample Solution

Uploaded by

COMP4510 Assignment 1 Sample Solution

[10] 1. We used a number of terms/concepts informally in class relying on intuitive explanations to

An array processor is a parallel architecture designed to make operations on arrays efficient.

CU PE0 PE1 PE2 PEn-1

// Start by computing the element-wise product of X and Y

// Now use fan-in to accumulate the sum of the products

int SearchVal, NumThreads;

printf(“Please enter the value to search for: ”);

In partitioning, we identify the fundamental individual units of computation performed by the

In agglomeration, we agglomerate partitions related by communication into larger

You might also like