V. Rajaraman, C. Siva Ram Murthy - Parallel Computers Architecture and Programming-PHI (2016)
V. Rajaraman, C. Siva Ram Murthy - Parallel Computers Architecture and Programming-PHI (2016)
V. RAJARAMAN
Honorary Professor
Supercomputer Education and Research Centre
Indian Institute of Science Bangalore
C. SIVA RAM MURTHY
Richard Karp Institute Chair Professor
Department of Computer Science and Engineering
Indian Institute of Technology Madras
Chennai
Delhi-110092
2016
PARALLEL COMPUTERS: Architecture and Programming, Second Edition
V. Rajaraman and C. Siva Ram Murthy
© 2016 by PHI Learning Private Limited, Delhi. All rights reserved. No part of this book may be reproduced in any form, by
mimeograph or any other means, without permission in writing from the publisher.
ISBN-978-81-203-5262-9
The export rights of this book are vested solely with the publisher.
Published by Asoke K. Ghosh, PHI Learning Private Limited, Rimjhim House, 111, Patparganj Industrial Estate, Delhi-
110092 and Printed by Mohan Makhijani at Rekha Printers Private Limited, New Delhi-110020.
To
the memory of my dear nephew Dr. M.R. Arun
— V. Rajaraman
To
the memory of my parents, C. Jagannadham and C. Subbalakshmi
— C. Siva Ram Murthy
Table of Contents
Preface
1. Introduction
1.1 WHY DO WE NEED HIGH SPEED COMPUTING?
1.1.1 Numerical Simulation
1.1.2 Visualization and Animation
1.1.3 Data Mining
1.2 HOW DO WE INCREASE THE SPEED OF COMPUTERS?
1.3 SOME INTERESTING FEATURES OF PARALLEL COMPUTERS
1.4 ORGANIZATION OF THE BOOK
EXERCISES
Bibliography
2. Solving Problems in Parallel
2.1 UTILIZING TEMPORAL PARALLELISM
2.2 UTILIZING DATA PARALLELISM
2.3 COMPARISON OF TEMPORAL AND DATA PARALLEL PROCESSING
2.4 DATA PARALLEL PROCESSING WITH SPECIALIZED PROCESSORS
2.5 INTER-TASK DEPENDENCY
2.6 CONCLUSIONS
EXERCISES
Bibliography
3. Instruction Level Parallel Processing
3.1 PIPELINING OF PROCESSING ELEMENTS
3.2 DELAYS IN PIPELINE EXECUTION
3.2.1 Delay Due to Resource Constraints
3.2.2 Delay Due to Data Dependency
3.2.3 Delay Due to Branch Instructions
3.2.4 Hardware Modification to Reduce Delay Due to Branches
3.2.5 Software Method to Reduce Delay Due to Branches
3.3 DIFFICULTIES IN PIPELINING
3.4 SUPERSCALAR PROCESSORS
3.5 VERY LONG INSTRUCTION WORD (VLIW) PROCESSOR
3.6 SOME COMMERCIAL PROCESSORS
3.6.1 ARM Cortex A9 Architecture
3.6.2 Intel Core i7 Processor
3.6.3 IA-64 Processor Architecture
3.7 MULTITHREADED PROCESSORS
3.7.1 Coarse Grained Multithreading
3.7.2 Fine Grained Multithreading
3.7.3 Simultaneous Multithreading
3.8 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
4. Structure of Parallel Computers
4.1 A GENERALIZED STRUCTURE OF A PARALLEL COMPUTER
4.2 CLASSIFICATION OF PARALLEL COMPUTERS
4.2.1 Flynn’s Classification
4.2.2 Coupling Between Processing Elements
4.2.3 Classification Based on Mode of Accessing Memory
4.2.4 Classification Based on Grain Size
4.3 VECTOR COMPUTERS
4.4 A TYPICAL VECTOR SUPERCOMPUTER
4.5 ARRAY PROCESSORS
4.6 SYSTOLIC ARRAY PROCESSORS
4.7 SHARED MEMORY PARALLEL COMPUTERS
4.7.1 Synchronization of Processes in Shared Memory Computers
4.7.2 Shared Bus Architecture
4.7.3 Cache Coherence in Shared Bus Multiprocessor
4.7.4 MESI Cache Coherence Protocol
4.7.5 MOESI Protocol
4.7.6 Memory Consistency Models
4.7.7 Shared Memory Parallel Computer Using an Interconnection Network
4.8 INTERCONNECTION NETWORKS
4.8.1 Networks to Interconnect Processors to Memory or Computers to
Computers
4.8.2 Direct Interconnection of Computers
4.8.3 Routing Techniques for Directly Connected Multicomputer Systems
4.9 DISTRIBUTED SHARED MEMORY PARALLEL COMPUTERS
4.9.1 Cache Coherence in DSM
4.10 MESSAGE PASSING PARALLEL COMPUTERS
4.11 Computer Cluster
4.11.1 Computer Cluster Using System Area Networks
4.11.2 Computer Cluster Applications
4.12 Warehouse Scale Computing
4.13 Summary and Recapitulation
EXERCISES
BIBLIOGRAPHY
5. Core Level Parallel Processing
5.1 Consequences of Moore’s law and the advent of chip multiprocessors
5.2 A generalized structure of Chip Multiprocessors
5.3 MultiCore Processors or Chip MultiProcessors (CMPs)
5.3.1 Cache Coherence in Chip Multiprocessor
5.4 Some commercial CMPs
5.4.1 ARM Cortex A9 Multicore Processor
5.4.2 Intel i7 Multicore Processor
5.5 Chip Multiprocessors using Interconnection Networks
5.5.1 Ring Interconnection of Processors
5.5.2 Ring Bus Connected Chip Multiprocessors
5.5.3 Intel Xeon Phi Coprocessor Architecture [2012]
5.5.4 Mesh Connected Many Core Processors
5.5.5 Intel Teraflop Chip [Peh, Keckler and Vangal, 2009]
5.6 General Purpose Graphics Processing Unit (GPGPU)
EXERCISES
BIBLIOGRAPHY
6. Grid and Cloud Computing
6.1 GRID COMPUTING
6.1.1 Enterprise Grid
6.2 Cloud computing
6.2.1 Virtualization
6.2.2 Cloud Types
6.2.3 Cloud Services
6.2.4 Advantages of Cloud Computing
6.2.5 Risks in Using Cloud Computing
6.2.6 What has Led to the Acceptance of Cloud Computing
6.2.7 Applications Appropriate for Cloud Computing
6.3 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
7. Parallel Algorithms
7.1 MODELS OF COMPUTATION
7.1.1 The Random Access Machine (RAM)
7.1.2 The Parallel Random Access Machine (PRAM)
7.1.3 Interconnection Networks
7.1.4 Combinational Circuits
7.2 ANALYSIS OF PARALLEL ALGORITHMS
7.2.1 Running Time
7.2.2 Number of Processors
7.2.3 Cost
7.3 PREFIX COMPUTATION
7.3.1 Prefix Computation on the PRAM
7.3.2 Prefix Computation on a Linked List
7.4 SORTING
7.4.1 Combinational Circuits for Sorting
7.4.2 Sorting on PRAM Models
7.4.3 Sorting on Interconnection Networks
7.5 SEARCHING
7.5.1 Searching on PRAM Models
Analysis
7.5.2 Searching on Interconnection Networks
7.6 MATRIX OPERATIONS
7.6.1 Matrix Multiplication
7.6.2 Solving a System of Linear Equations
7.7 PRACTICAL MODELS OF PARALLEL COMPUTATION
7.7.1 Bulk Synchronous Parallel (BSP) Model
7.7.2 LogP Model
7.8 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
8. Parallel Programming
8.1 MESSAGE PASSING PROGRAMMING
8.2 MESSAGE PASSING PROGRAMMING WITH MPI
8.2.1 Message Passing Interface (MPI)
8.2.2 MPI Extensions
8.3 SHARED MEMORY PROGRAMMING
8.4 SHARED MEMORY PROGRAMMING WITH OpenMP
8.4.1 OpenMP
8.5 HETEROGENEOUS PROGRAMMING WITH CUDA AND OpenCL
8.5.1 CUDA (Compute Unified Device Architecture)
8.5.2 OpenCL (Open Computing Language)
8.6 PROGRAMMING IN BIG DATA ERA
8.6.1 MapReduce
8.6.2 Hadoop
8.7 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
9. Compiler Transformations for Parallel Computers
9.1 ISSUES IN COMPILER TRANSFORMATIONS
9.1.1 Correctness
9.1.2 Scope
9.2 TARGET ARCHITECTURES
9.2.1 Pipelines
9.2.2 Multiple Functional Units
9.2.3 Vector Architectures
9.2.4 Multiprocessor and Multicore Architectures
9.3 DEPENDENCE ANALYSIS
9.3.1 Types of Dependences
9.3.2 Representing Dependences
9.3.3 Loop Dependence Analysis
9.3.4 Subscript Analysis
9.3.5 Dependence Equation
9.3.6 GCD Test
9.4 TRANSFORMATIONS
9.4.1 Data Flow Based Loop Transformations
9.4.2 Loop Reordering
9.4.3 Loop Restructuring
9.4.4 Loop Replacement Transformations
9.4.5 Memory Access Transformations
9.4.6 Partial Evaluation
9.4.7 Redundancy Elimination
9.4.8 Procedure Call Transformations
9.4.9 Data Layout Transformations
9.5 FINE-GRAINED PARALLELISM
9.5.1 Instruction Scheduling
9.5.2 Trace Scheduling
9.5.3 Software Pipelining
9.6 Transformation Framework
9.6.1 Elementary Transformations
9.6.2 Transformation Matrices
9.7 PARALLELIZING COMPILERS
9.8 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
10. Operating Systems for Parallel Computers
10.1 RESOURCE MANAGEMENT
10.1.1 Task Scheduling in Message Passing Parallel Computers
10.1.2 Dynamic Scheduling
10.1.3 Task Scheduling in Shared Memory Parallel Computers
10.1.4 Task Scheduling for Multicore Processor Systems
10.2 PROCESS MANAGEMENT
10.2.1 Threads
10.3 Process Synchronization
10.3.1 Transactional Memory
10.4 INTER-PROCESS COMMUNICATION
10.5 MEMORY MANAGEMENT
10.6 INPUT/OUTPUT (DISK ARRAYS)
10.6.1 Data Striping
10.6.2 Redundancy Mechanisms
10.6.3 RAID Organizations
10.7 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
11. Performance Evaluation of Parallel Computers
11.1 BASICS OF PERFORMANCE EVALUATION
11.1.1 Performance Metrics
11.1.2 Performance Measures and Benchmarks
11.2 SOURCES OF PARALLEL OVERHEAD
11.2.1 Inter-processor Communication
11.2.2 Load Imbalance
11.2.3 Inter-task Synchronization
11.2.4 Extra Computation
11.2.5 Other Overheads
11.2.6 Parallel Balance Point
11.3 SPEEDUP PERFORMANCE LAWS
11.3.1 Amdahl’s Law
11.3.2 Gustafson’s Law
11.3.3 Sun and Ni’s Law
11.4 SCALABILITY METRIC
11.4.1 Isoefficiency Function
11.5 PERFORMANCE ANALYSIS
11.6 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
Appendix
Index
Preface
Of late there has been a lot of interest generated all over the world on parallel processors and
parallel computers. This is due to the fact that all current micro-processors are parallel
processors. Each processor in a microprocessor chip is called a core and such a
microprocessor is called a multicore processor. Multicore processors have an on-chip
memory of a few megabytes (MB). Before trying to answer the question “What is a parallel
computer?”, we will briefly review the structure of a single processor computer (Fig. 1.1). It
consists of an input unit which accepts (or reads) the list of instructions to solve a problem (a
program) and data relevant to that problem. It has a memory or storage unit in which the
program, data and intermediate results are stored, a processing element which we will
abbreviate as PE (also called a Central Processing Unit (CPU)) which interprets and executes
instructions, and an output unit which displays or prints the results.
The role of experiments, theoretical models, and numerical simulation is shown in Fig.
1.2. A theoretically developed model is used to simulate the physical system. The results of
simulation allow one to eliminate a number of unpromising designs and concentrate on those
which exhibit good performance. These results are used to refine the model and carry out
further numerical simulation. Once a good design on a realistic model is obtained, it is used
to construct a prototype for experimentation. The results of experiments are used to refine the
model, simulate it and further refine the system. This repetitive process is used until a
satisfactory system emerges. The main point to note is that experiments on actual systems are
not eliminated but the number of experiments is reduced considerably. This reduction leads to
substantial cost saving. There are, of course, cases where actual experiments cannot be
performed such as assessing damage to an aircraft when it crashes. In such a case simulation
is the only feasible method.
Figure 1.2 Interaction between theory, experiments and computer simulation.
With advances in science and engineering, the models used nowadays incorporate more
details. This has increased the demand for computing and storage capacity. For example, to
model global weather, we have to model the behaviour of the earth’s atmosphere. The
behaviour is modelled by partial differential equations in which the most important variables
are the wind speed, air temperature, humidity and atmospheric pressure. The objective of
numerical weather modelling is to predict the status of the atmosphere at a particular region
at a specified future time based on the current and past observations of the values of
atmospheric variables. This is done by solving the partial differential equations numerically
in regions or grids specified by using lines parallel to the latitude and longitude and using a
number of atmospheric layers. In one model (see Fig. 1.3), the regions are demarcated by
using 180 latitudes and 360 longitudes (meridian circles) equally spaced around the globe. In
the vertical direction 12 layers are used to describe the atmosphere. The partial differential
equations are solved by discretizing them to difference equations which are in turn solved as
a set of simultaneous algebraic equations. For each region one point is taken as representing
the region and this is called a grid point. At each grid point in this problem, there are 5
variables (namely air velocity, temperature, pressure, humidity, and time) whose values are
stored. The simultaneous algebraic equations are normally solved using an iterative method.
In an iterative method several iterations (100 to 1000) are needed for each grid point before
the results converge. The calculation of each trial value normally requires around 100 to 500
floating point arithmetic operations. Thus, the total number of floating point operations
required for each simulation is approximately given by:
Number of floating point operations per simulation
= Number of grid points × Number of values per grid point × Number of trials × Number
of operations per trial
Figure 1.3 Grid for numerical weather model for the Earth.
In this example we have:
Number of grid points = 180 × 360 × 12 = 777600
Number of values per grid point = 5
Number of trials = 500
Number of operations per trial = 400
Thus, the total number of floating point operations required per simulation = 777600 × 5 ×
500 × 400 = 7.77600 × 1011. If each floating point operation takes 100 ns the total time taken
for one simulation = 7.8 × 104 s = 21.7 h. If we want to predict the weather at the intervals of
6 h there is no point in computing for 21.7 h for a prediction! If we want to simulate this
problem, a floating point arithmetic operation on 64-bit operands should be complete within
10 ns. This time is too short for a computer which does not use any parallelism and we need a
parallel computer to solve such a problem. In general the complexity of a problem of this
type may be described by the formula:
Problem complexity = G × V × T × A
where
G = Geometry of the grid system used
V = Variables per grid point
T = Number of steps per simulation for solving the problem
A = Number of floating point operations per step
For the weather modelling problem,
G = 777600, V = 5, T = 500 and A = 400 giving problem complexity = 7.8 × 1011.
There are many other problems whose complexity is of the order of 1012 to 1020. For
example, the complexity of numerical simulation of turbulent flows around aircraft wings and
body is around 1015. Some other areas where numerically intensive simulation is required
are:
In this chapter we will explain with examples how simple jobs can be solved in parallel in
many different ways. The simple examples will illustrate many important points in perceiving
parallelism, and in allocating tasks to processors for getting maximum efficiency in solving
problems in parallel.
2.1 UTILIZING TEMPORAL PARALLELISM
Suppose 1000 candidates appear in an examination. Assume that there are answers to 4
questions in each answer book. If a teacher is to correct these answer books, the following
instructions may be given to him:
Procedure 2.1 Instructions given to a teacher to correct an answer book
Step 1: Take an answer book from the pile of answer books.
Step 2: Correct the answer to Q1 namely, A1.
Step 3: Repeat Step 2 for answers to Q2, Q3, Q4, namely, A2, A3, A4.
Step 4: Add marks given for each answer.
Step 5: Put answer book in a pile of corrected answer books.
Step 6: Repeat Steps 1 to 5 until no more answer books are left in the input.
A teacher correcting 1000 answer books using Procedure 2.1 is shown in Fig. 2.1. If a
paper takes 20 minutes to correct, then 20,000 minutes will be taken to correct 1000 papers.
If we want to speedup correction, we can do it in the following ways:
If n >> k then (k – 1)/n 0 and the speedup is nearly equal to k, the number of teachers in
the pipeline. Thus, the speedup is directly proportional to the number of teachers working in a
pipeline mode provided the number of jobs is very large as compared to the number of tasks
per job. The main problems encounted in implementing this method are:
1. Synchronization. Identical time should be taken for doing each task in the pipeline so that
a job can flow smoothly in the pipeline without holdup.
2. Bubbles in pipeline. If some tasks are absent in a job “bubbles” form in the pipeline. For
example, if there are some answer books with only 2 questions answered, two teachers will
be forced to be idle during the time allocated to them to correct these answers.
3. Fault tolerance. The system does not tolerate faults. If one of the teachers takes a coffee
break, the entire pipeline is upset.
4. Inter-task communication. The time to pass answer books between teachers in the
pipeline should be much smaller as compared to the time taken to correct an answer by a
teacher.
5. Scalability. The number of teachers working in the pipeline cannot be increased beyond a
certain limit. The number of teachers would depend on how many independent questions a
question paper has. Further, the time taken to correct the answer to each of the questions
should be equal and the number of answer books n must be much larger than k, the number of
teachers in the pipeline.
In spite of these problems, this method is a very effective technique of using parallelism as
it is easy to perceive the possibility of using temporal parallelism in many jobs. It is very
efficient as each stage in the pipeline can be optimized to do a specific task well. (We know
that a teacher correcting the answer to one question in all answer books becomes an “expert”
in correcting that answer soon and does it very quickly!). Pipelining is used extensively in
processor design. It was the main technique used by vector supercomputers such as CRAY to
attain their high speed.
2.2 UTILIZING DATA PARALLELISM
Method 2: Data Parallelism
In this method, we divide the answer books into four piles and give one pile to each teacher
(see Fig. 2.3). Each teacher follows identical instructions given in Procedure 2.1 (see Section
2.1).
Assuming each teacher takes 20 minutes to correct an answer book, the time taken to
correct all the 1000 answer books is 5000 minutes as each teacher corrects only 250 papers
and all teachers work simultaneously. This type of parallelism is called data parallelism as
the input data is divided into independent sets and processed simultaneously. We can quantify
this method as shown below:
Let the number of jobs = n
Let the time to do a job = p
Let there be k teachers to do the job.
Figure 2.3 Four teachers working independently and simultaneously on 4 sets of answer books.
Let the time to distribute the jobs to k teachers be kq. Observe that this time is proportional
to the number of teachers.
The time to complete n jobs by a single teacher = np
The time to complete n jobs by k teachers = kq + np/k
If we define efficiency as the ratio of actual speedup to maximum possible speedup we get:
If k = 100 then
Observe that the speedup is not directly proportional to the number of teachers as the time
to distribute jobs to teachers (which is an unavoidable overhead) increases as the number of
teachers is increased. The main advantages of this method are:
However, there are many situations in practice (for example, distributing portions of an
array to processors) where the task distribution time is negligible as only the starting and
ending array indices are given to each processor. In such cases the efficiency will be almost
100%.
Method 3: Combined Temporal and Data Parallelism
We can combine Method 1 and Method 2 as shown in Fig. 2.4. Here two pipelines of teachers
are formed and each pipeline is given half the total number of jobs. This is called parallel
pipeline processing. This method almost halves the time taken by a single pipeline. If it takes
5 minutes to correct an answer book, the time taken by the two pipelines is (20 + 499 × 5) =
2515 minutes.
Figure 2.4 Eight teachers working in two pipelines to grade 2 sets of answer books.
Even though this method reduces the time to complete the set of jobs, it also has the
disadvantages of both temporal parallelism and to some extent that of data parallelism. The
method is effective only if the number of jobs given to each pipeline is much larger than the
number of stages in the pipeline.
Multiple pipeline processing was used in supercomputers such as Cray and NEC-SX as
this method is very efficient for numerical computing in which a number of long vectors and
large matrices are used as data and could be processed simultaneously.
Method 4: Data Parallelism with Dynamic Assignment
This method is shown in Fig 2.5. Here a head examiner gives one answer paper to each
teacher and keeps the rest with him. All teachers simultaneously correct the paper given to
them. A teacher who completes correction goes to the head examiner for another paper which
is given to him for correction. If a second teacher completes correction at the same time, then
he queues up in front of the head examiner and waits for his turn to get an answer paper. The
procedure is repeated till all the answer papers are corrected. The main advantage of this
method is the balancing of the work assigned to each teacher dynamically as work
progresses. A teacher who finishes his work quickly gets another paper immediately and he is
not forced to be idle. Further, the time to correct a paper may widely vary without creating a
bottleneck. The method is not affected by bubbles, namely, unanswered questions or blank
answer papers. The overall time taken for paper correction will be minimized. The main
disadvantages of this method are:
1. If the loads given to processors is balanced, i.e., almost equal, then a static
assignment is very good.
2. One should resort to dynamic assignment of tasks to processors only if there is a
wide variation in the completion time of tasks.
3. The task distribution overhead in quasi-dynamic assignment with coarse grained
tasks is usually much smaller than in fine grained dynamic assignment and is thus
a better method.
2.3 COMPARISON OF TEMPORAL AND DATA PARALLEL
PROCESSING
We compare the two methods of parallel processing we discussed in the last two sections in
Table 2.1.
1. Job is divided into a set of independent tasks and tasks Full jobs are assigned for processing.
are assigned for processing.
2. Tasks should take equal time. Pipeline stages should thus Jobs may take different times. No need to
be synchronized. synchronize beginning of jobs.
3. Bubbles in jobs lead to idling of processors. Bubbles do not cause idling of processors.
4. Processors specialized to do specific tasks efficiently. Processors should be general purpose and may
not do all tasks efficiently.
7. Efficient with fine grained tasks. Efficient with coarse grained tasks and quasi-
dynamic scheduling.
8. Scales well as long as number of data items to be Scales well as long as number of jobs is much
processed is much larger than the number of processors in greater than the number of processors and
the pipeline and time taken to communicate task from one processing time is much higher than the time to
processor to the next is negligible. distribute data to processors.
2.4 DATA PARALLEL PROCESSING WITH SPECIALIZED
PROCESSORS
We saw that pipeline processing is effective if each answer takes the same time to correct and
there are not many answer papers with unanswered questions. Data parallel processing is
more tolerant but requires each teacher to be capable of correcting answers to all questions
with equal ease. If we have a situation where we would like to constrain each teacher to
correct only the answer to one question in each paper (may be to ensure uniformity in grading
or because he is a specialist who can do that job very quickly) and the answer books are such
that not all students answer all questions then a method using specialist data parallelism may
be developed as explained below.
Method 6: Specialist Data Parallelism
In this method, we assume that there is a head examiner who assigns answer papers to
teachers. The organization of this method is shown in Fig. 2.6. The procedure followed by the
head examiner is given as Procedure 2.2. We assume that teacher 1 (T1) grades A1, teacher 2
(T2) grades A2 and teacher i (Ti) the answer Ai to question Qi.
Figure 2.6 Head examiner assigning specific answers to each teacher to grade.
Procedure 2.2 Task assignment method followed by the head examiner
1. Give one answer paper to T1, T2, T3, T4 (Remark: Teacher Ti corrects only the
answer to question Qi).
2. When a corrected answer paper is returned check if all questions are graded. If yes
add marks and put the paper in the output pile.
3. If no, check which questions are not graded.
4. For each i, if Ai is ungraded and teacher Ti is idle send it to teacher Ti or if any
other teacher Tp is idle and an answer paper remains in input pile with Ap
uncorrected send it to him.
5. Repeat Steps 2, 3 and 4 until no answer paper remains in the input pile and all
teachers are idle.
The main problem with this method is that the load is not balanced. If some answers take
much longer time to grade than others then some of the teachers will be busy while others are
idle. The same problem will occur if a particular question is not answered by many students.
Further, the head examiner will waste a lot of time seeing which questions are unanswered
and which teachers are idle before he is able to assign work to a teacher. In other words, the
maximum possible speedup will not be attained.
Method 7: Coarse Grained Specialist Temporal Parallel Processing
The same problem can also be done without a head examiner. Here all teachers work
independently and simultaneously at their pace. We will see, however, that each teacher will
end up spending a lot of time inefficiently waiting for other teachers to complete their work.
The model of processing is shown in Fig. 2.7. The answer papers are initially divided into 4
equal parts and one part is kept in an in-tray with each teacher. The teachers sit in a circle as
shown in Fig. 2.7. Teacher T1 takes a paper from his in tray, grades answer A1 to question Q1
in it and places it in his out-tray. When no papers are in his in-tray, he checks if teacher T2’s
in-tray is empty. If yes he places his graded paper in T2’s in tray and waits for his in-tray to
be filled. Teacher T1’s in-tray will be filled by teacher T4. This is a circular pipeline
arrangement with a coarse grain size. All papers would be graded when each teacher’s out-
tray is filled 4 times. The method is given more precisely as Procedure 2.3.
Figure 2.7 One teacher grades one answer in all papers—A circular pipeline method.
Procedure 2.3 Coarse grained specialist temporal processing
Answer papers are divided into 4 equal piles and put in the in-trays of each teacher. Each
teacher repeats 4 times Steps 1 to 5. All teachers work simultaneously.
For teachers Ti (i = 1 to 4) do in parallel Steps 1 to 5.
Step 1: Take an answer paper from in-tray.
Step 2: Grade answer Ai to question Qi and put it in out-tray.
Step 3: Repeat Steps 1 and 2 till no papers left in in-tray.
Step 4: Check if teacher T(i + l)mod4’s in-tray is empty.
Step 5: As soon as it is empty, empty own out-tray into the in-tray of that teacher.
Step 6: All answers will be graded (in other words the procedure will terminate) when each
teacher’s output tray is filled 4 times.
This section mainly illustrates the point that if processing of special tasks by special
processors are to be done then balancing the load is essential and the promised speedup of
parallel processing is not attainable unless this condition is fulfilled. It may also be observed
that the method uses the concept of pipelined processing using a circular pipeline. Further,
each stage in the pipeline has a chunk of work to do. This method does not require strict
synchronization. It also tolerates bubbles due to unanswered questions in a paper or blank
answer papers.
Method 8: Agenda Parallelism
In this method an answer book is thought of as an agenda of answers to be graded. All
teachers are asked to work on the first item on the agenda, namely grade the answer to the
first question in all papers. A head examiner gives one paper to each teacher and asks him to
grade the answer A1 to Q1. When a teacher finishes this he is given another paper in which he
again grades A1. When A1 of all papers are graded then A2 is taken up by all teachers. This is
repeated till all the answers in all papers are graded. This is data parallel method with
dynamic schedule and fine grain tasks. This method for the problem being considered is not a
good idea as the grain size is small and waiting time of teachers to receive a paper to grade
may exceed the time to correct it! Method 4 is much better.
We consolidate and summarize the features of all the 8 methods of parallel processing in
Fig. 2.8.
An interesting point brought out by this simple example is the rich opportunity which
exists to exploit parallelism. Many jobs have inherent parallel features and if we think deeply
many different methods for exploiting parallelism become evident. The important question is
to decide which of the methods is good. The choice of a method depends on the nature of the
problem, the type of input data, processors, interconnection between processors, etc. This will
be discussed again in later chapters.
Figure 2.8 Chart showing various methods of parallel processing.
2.5 INTER-TASK DEPENDENCY
So far we have made the following assumptions in evolving methods of assigning tasks to
teachers.
In general a job consists of tasks which are inter-related. Some tasks can be done
simultaneously and independently while others have to wait for the completion of previous
tasks. For example, in the grading example we have discussed, the answer to a question may
depend on the answers to previous questions. The inter-relations of various tasks of a job may
be represented graphically as a task graph. In this graph circles represent tasks which in the
example we have discussed are answers to be graded. A line with an arrow connecting two
circles shows dependency of tasks. The direction of an arrow shows precedence. A task at the
head of an arrow can be done after all tasks at their respective tails are done.
If the question paper is such that the answers to the 4 questions are independent of one
another the answers to the four questions can be graded in any order. If grading the answer to
Qi is called Ti then the task graph for this case may be represented as shown in Fig. 2.9(a). If
the question paper is such that the answer to question Qi+1 depends on the answer to Qi (i = 1,
2, 3, 4), then Qi+1 cannot be graded (namely, task Ti+1) unless the answer to Qi is graded (task
Ti). The task graph for this case is shown in Fig. 2.9(b). If the question paper is such that T2 is
dependent on T1, and T4 on T1 and T3, then the task graph for this case may be drawn as
shown in Fig. 2.9(c).
If the tasks are to be assigned to teachers for pipeline processing (Method 1) then if the
tasks are independent (Fig. 2.9(a)) any task can be assigned to any teacher. If the task graph is
Fig. 2.9(b) or (c), then T1 should be assigned to the first teacher, T2 to the second teacher who
takes the paper from the first teacher and T3 and T4 assigned to teachers 3 and 4 respectively
who sit next to teacher 2 in the assembly line (or pipeline).
Figure 2.9 Task graphs for grading answer papers.
In the examples we have considered so far many identical jobs were processed in parallel.
Another important case is where there is one complicated job which yields one end product.
This job can be broken up into separate tasks which are cooperatively done faster by multiple
processors. A simple everyday example of this type is cooking using a given recipe. We give
as Procedure 2.4, the recipe for cooking Chinese vegetable fried rice.
Procedure 2.4 Recipe for Chinese vegetable fried rice
T1: Clean and wash rice
T2: Boil water in a vessel with 1 teaspoon salt
T3: Put rice in boiling water with some oil and cook till soft (do not overcook)
T4: Drain rice and cool
T5: Wash and scrape carrots
T6: Wash and string French beans
T7: Boil water with 1/2 teaspoon salt in two vessels
T8: Drop carrots and French beans separately in boiling water and keep for 1 minute
T9: Drain and cool carrots and French beans
T10: Dice carrots
T11: Dice French beans
T12: Peel onions and dice into small pieces.
T13: Clean cauliflower. Cut into small pieces
T14: Heat oil in iron pan and fry diced onion and cauliflower for 1 minute in heated oil
T15: Add diced carrots, French beans to above and fry for 2 minutes
T16: Add cooled cooked rice, chopped spring onions, and soya sauce to the above and stir and
fry for 5 minutes.
There are 16 tasks in cooking Chinese vegetable fried rice. Some of these tasks can be
carried. out simultaneously whereas others have to be done in sequence. For instance tasks
T1, T2, T5, T6, T7, T12 and T13 can all be done simultaneously, whereas T8 cannot be done
unless T5, T6, T7 are done. A graph showing the relationship among tasks is given as in Fig.
2.10. The reader is urged to check the correctness of the graph.
Task Tl T2 T3 T4 T5 T6 T7 T8
Time 5 10 15 10 5 10 8 1
Besides these issues, there are also constraints placed by the structure and interconnection
of computers in a parallel computing system. Thus, picking a suitable method is also
governed by the architecture of the parallel computer using which a problem is to be solved.
EXERCISES
2.1 An examination paper has 8 questions to be answered and there are 1000 answer books.
Each answer takes 3 minutes to correct. If 4 teachers are employed to correct the papers
in a pipeline mode, how much time will be taken to complete the job of correcting 1000
answer papers? What is the efficiency of processing? If 8 teachers are employed instead
of 4, what is the efficiency? What is the job completion time in the second case? Repeat
with 32 teachers and 4 pipelines.
2.2 An examination paper has 4 questions. The answers to these questions do not take equal
time to correct. Answer to question 1 takes 2 minutes to correct, question 2 takes 3
minutes, question 3 takes 2.5 minutes and question 4 takes 4 minutes. Due to this speed
mismatch, storage should be provided between teachers (e.g., a tray where a teacher
finishing early can keep the paper for the slower teacher to take it). Answer the
following questions assuming 1000 papers are to be corrected by 4 teachers.
(i) What is the idle time of teachers?
(ii) What is the system efficiency?
(iii) How much tray space should be provided between teachers due to speed
mismatch?
2.3 If Exercise 2.2 is to be solved using a data parallel processing method, what will be its
efficiency? Compare with pipeline mode.
2.4 In an examination paper there are 5 questions and each will take on the average 10
minutes to correct. 2000 candidates write the examination. 5 teachers are employed to
correct the papers using pipeline mode. Every question is not answered by all
candidates. 10% of the candidates do not answer question 1, 15% question 2, 5%
question 3, 5% question 4, and 25% question 5.
(i) How much time is taken to complete grading?
(ii) What is the efficiency of pipeline processing?
(iii) Work out a general formula for calculating efficiency assuming n papers, k
teachers and percent unanswered questions as p1, p2, …, pk respectively.
(iv) What is the efficiency of the data parallel method?
(v) If data parallel method is used how much time will be taken to complete grading?
2.5 In the pipeline mode of processing we assumed that there is no communication delay
between stages of the pipeline. If there is a delay of y between pipeline stages derive a
speedup formula. What condition should y satisfy to ensure a speedup of at least 0.8 k,
where k is the number of stages in the pipeline?
2.6 Assume that there are k questions in a paper and k teachers grade them in parallel using
specialist data parallelism (Method 6, Section 2.4). Assume the time taken to grade the
k answers is tl, t2, t3, …., tk respectively. Assume that there are n answer books and the
time taken to despatch a paper to a teacher is kq.
(i) Obtain a formula for the speedup obtainable from the method assuming that every
student answers all questions.
(ii) If the percentage of students not answering questions 1, 2, 3, ..., k are respectively,
p1, p2, p3, …, pk, modify the speedup formula.
2.7 Making the same assumptions as made in Exercise 2.6, obtain the speedup formula for
Method 7 of the text (Procedure 2.3). Find speedup for both cases (i) and (ii).
2.8 Making the same assumptions as made in Exercise 2.6, obtain the speedup formula for
Method 8 (Agenda parallelism). Find speedup for both cases (i) and (ii).
2.9 In Method 4, we assumed that the head examiner gives one paper to each teacher to
correct. Assume that instead of giving one paper each, x papers are given to each
teacher.
(i) Derive the speedup formula for this case. State your assumptions.
(ii) If the percentage of unanswered questions are a1, a2, …, ak, respectively (for the k
questions), obtain the speedup.
(iii) Compare (ii) with static assignment for the same case.
2.10 In Method 7, we assumed that each teacher has an in-tray and an out-tray. Instead if
each teacher has only one in-tray and when he corrects a paper he puts it in the in-tray
of his neighbour, modify the algorithm for correcting all papers. Compare this method
with the method given in the text. Clearly specify the termination condition of the
algorithm.
2.11 A recipe for making potato bondas is given below: Ingredients: Potatoes (100), onions
(10), chillies (10), gram flour (1 kg), oil, salt.
Method:
Step 1: Boil potatoes in water till cooked.
Step 2: Peel and mash potatoes till soft.
Step 3: Peel onions and chop fine.
Step 4: Clean chillies and chop fine.
Step 5: Mix mashed potatoes, onion, green chillies, and salt to taste and make small
balls.
Step 6: Mix gram flour with water and salt to taste till a smooth and creamy batter is
obtained.
Step 7: Dip the potato balls in batter. Take out and deep fry in oil on low fire.
Step 8: Take out when the balls are fried to a golden brown colour.
Result: 300 bondas.
(i) Obtain a task graph for making bondas in parallel. Clearly specify the tasks in your
task graph. Assign appropriate time to do each task.
(ii) If 3 cooks are employed to do the job and 3 stoves are available for frying, how
will you assign tasks to cooks?
2.12 The following expressions are to be evaluated:
a = sin (x2y) + cos (xy2) + exp (–xy2)
b = g(p) + e–xf(y) + h(x2) + f(y) g(p)
c = f(u2) + sin (g(p)) + cos2 h(y2)
(i) Obtain a task graph for calculating a, b, c.
(ii) Assuming 4 processors are available, obtain a task assignment to processors
assuming the following timings for various operations.
Squaring = 1, Multiplication = 1
sin = cos = exponentiation = 2
g(x) = h(x) = f(x) = 3
(iii) Compare the time obtained by your assignment with the ideal time needed to complete
the job with 4 processors.
2.13 A program is to be written to find the frequency of occurrence of each of the vowels a,
e, i, o, u (both lower case and upper case) in a text of 1,00,000 words. Obtain at least 3
different methods of doing this problem in parallel. Assume 4 processors are available.
2.14 A company has 5000 employees and their weekly payroll is to be prepared. Each
employee’s record has employee number, hours worked, rate per hour, overtime worked
(hours > 40 is considered overtime), overtime rate, percent to be deducted as tax and
other deductions.
(i) Write a sequential procedure to edit input for correctness, print the payroll, print
sum of all the amounts paid, average payment per employee, a frequency table of
payment made in 5 slabs and the employee number of the person(s) receiving
maximum amount.
(ii) Break up the sequential procedure into tasks and find which tasks can be carried
out in parallel.
(iii) If there are 3 processors available to carry out the job show how tasks can be
assigned to processors.
2.15 For the task graph of Fig. 2.10 with the timings given in Table 2.2, assign tasks to three
cooks.
(i) Find the job completion time efficiency of your assignment.
(ii) Repeat for 6 cooks. Compare with answers to (i) and comment.
2.16 A task graph with times of various tasks is given in Fig. 2.17. Assuming that 4
processors are available, assign tasks to processors.
T1 T2 T3 T4 T5 T6 T7
3 4 5 4 6 7 6
5 6 8 9 10 9
In this chapter we will discuss how instruction level parallelism is employed in designing
modern high performance microprocessors which are used as Processing Elements (PEs) of
parallel computers. The earliest use of parallelism in designing PEs to enhance processing
speed, was pipelining. Pipelining has been extensively used in Reduced Instruction Set
Computers (RISC). RISC processors have been succeeded by superscalar processors that
execute multiple instructions in one clock cycle. The idea in superscalar processor design is
to use the parallelism available at the instruction level by increasing the number of arithmetic
and functional units in a PE. This idea has been exploited further in the design of Very Long
Instruction Word (VLIW) processors in which one instruction word encodes more than one
operation. The idea of executing a number of instructions of a program in parallel by
scheduling them suitably on a single processor has been the major driving force in the design
of many recent processors. The number of transistors which can be integrated in a chip has
been doubling every 24 months (Moore’s law). Now a processor chip has over ten billion
transistors. The major architectural question is how to effectively use this to enhance the
speed of PEs. Currently most microprocessor chips use multiple processors called “cores”
within a microprocessor chip. Clearly the trend is to design a high performance
microprocessor as a parallel processor. In this chapter we will describe the use of pipelining
in the design of PEs in some detail. We will also discuss superscalar and multithreaded
processor architectures.
3.1 PIPELINING OF PROCESSING ELEMENTS
In the last chapter we discussed in detail the idea of pipelining. We saw that pipelining uses
temporal parallelism to increase the speed of processing. One of the important methods of
increasing the speed of PEs is pipelined execution of instructions.
Pipelining is an effective method of increasing the execution speed of processors provided
the following “ideal” conditions are satisfied:
In actual practice these “ideal” conditions are not always satisfied. The non-ideal situation
arises because of the following reasons:
The challenge is thus to make pipelining work under these non-ideal conditions.
In order to fix our ideas, it is necessary to take a concrete example. We will describe the
architectural model of a small hypothetical computer. The computer is a Reduced Instruction
Set Computer (RISC) which is designed to facilitate pipelined instruction execution. The
computer is similar to SMAC2 (Small Computer 2) described by Rajaraman and
Radhakrishnan [2008]. We will call the architecture we use SMAC2P (Small Computer 2
Parallel). It has the following units:
A data cache (or memory) and an instruction cache (or memory). It is assumed that
the instructions to be executed are stored in the instruction memory and the data to
be read or written are stored in the data memory. These two memories have their
own address and data registers. We call the memory address register of the
instruction memory IMAR and that of data memory DMAR. The data register of
the instruction memory is called IR and that of the data memory MDR.
A Program Counter (PC) which contains the address of the next instruction to be
executed. The machine is word addressable. Each word is 32 bits long.
A register file with 32 registers. These are general purpose registers used to store
operands and index values.
The instructions are all of equal length and the relative positions of operands are
fixed. There are 3 instruction types as shown in Fig. 3.1.
The only instructions which access memory are load and store instructions. A load
instruction reads a word from the data memory and stores it in a specified register
in the register file and a store instruction stores the contents of a register in the
register file in the data memory. This is called load-store architecture.
An Arithmetic Logic Unit (ALU) which carries out one integer arithmetic or one
logic operation in one clock cycle.
Operation Instruction
Semantics
mnemonic Type Symbolic form
JMP J JMP Y PC ← Y
LDI M LDI R2, 0, Y C(R2) ← Y (The 16 bits address treated as a 2’s complement signed
number)
This set has a reasonable variety of instructions to illustrate the basic ideas of pipeline
processing. In Table 3.1 observe the load (LD) and store (ST) instructions. One of the
registers is used as an index register to calculate the effective address. The other interesting
instruction is load immediate (LDI) which uses the address part of the instruction as an
operand. We will now examine the steps in the execution of the instructions of SMAC2P.
We will see that an instruction execution cycle can be broken up into 5 steps, each step
taking one clock cycle. We will now explain each of these steps in detail.
Figure 3.3 Data flow for decode and fetch register step.
Step 3: Execute instruction and calculate effective address (EX)
In this step the ALU operations are carried out. The instructions where ALU is used may be
classified as shown below:
B3 ← B1<operation> B2
where <operation> is ADD, SUB, MUL or DIV and
B3 is the register where the ALU output is stored
Increment/Decrement (INC/DEC/BCT)
B3 ← B1 + 1 (Increment)
B3 ← B1 – 1 (Decrement or BCT)
Branch on equal (JEQ)
B3 ← B1 – B2
If B3 = 0, set zero flag in status register = 1
Effective address calculation (for LD/ST)
For load and store instructions, the effective address is stored in B3.
B3 ← B2 + IMM
The data flow for these operations is shown in Fig. 3.4. Observe that multiplexers
(MUXes) have been used to select the appropriate alternate values to be input to ALU and
output from ALU. Control signals (shown by dotted lines) are necessary for the MUXes.
Instructions 1 2 3 4 5 6 7 8 9 10
i FI DE EX MEM SR
i+1 FI DE EX MEM SR
i+2 FI DE EX MEM SR
i+3 FI DE EX MEM SR
i+4 Fl DE EX MEM SR
i+5 FI DE EX MEM SR
We will now find out the speedup due to pipelining in the ideal case.
Ideal case
Let the total number of instructions executed = m.
Let the number of clock cycles per instruction = n.
Time taken to execute m instructions with no pipelining = mn cycles
Time taken with pipelining (ideal) = n + (m – 1) cycles.
If a non-pipelined computer takes 5 clock cycles per instruction and e = 0.1, the speedup is
5/1.1 = 4.5.
In practice, the speedup is lower as the number of clock cycles needed per instruction in
non-pipelined mode is smaller than n, the number of pipeline stages as we saw earlier in this
section.
EXAMPLE 3.1
A non-pipelined computer uses a 10 ns clock. The average number of clock cycles per
instruction required by this machine is 4.35. When the machine is pipelined it requires a 11 ns
clock. We now find out the speedup due to pipelining.
Assume that the depth of the pipeline is n and the number of instructions executed is N.
The speed-up due to pipelining is (4.35 × 10 × N)/((n + N – 1) × 11). As (n + N – 1)/N is
nearly 1, (N >> n) the speedup is (4.35 × 10)/11 = 3.95.
Besides the need to increase the period of each clock cycle, the ideal condition of being
able to complete one instruction per clock cycle in pipeline execution is often not possible
due to other non-ideal situations which arise in practice.
We will discuss the common causes for these non-ideal conditions and how to alleviate
them in the next section.
3.2 DELAYS IN PIPELINE EXECUTION
Delays in pipeline execution of instructions due to non-ideal conditions are called pipeline
hazards. As was pointed out earlier the non-ideal conditions are:
Each of these non-ideal conditions causes delays in pipeline execution. We can classify them
as:
We will discuss each of these next and examine how to reduce these delays.
Referring to Fig. 3.12, we see that the value of R3 (the result of the ADD operation) will
not be available till clock cycle 6. Thus, the MUL operation should not be executed in cycle 4
as the value of R3 it needs will not be stored in the register file by the ADD operation and
thus not available. Thus, the EX step of MUL instruction is delayed till cycle 6. This stall is
due to data dependency and is called data hazard as the required data is not available when
needed. The next instruction SUB R7, R2, R6 does not need R3. It also reads R2 whose value
is not changed by MUL. Thus, it can proceed without waiting as shown in Fig. 3.12. The next
instruction INC R3 can be executed in clock period 6 as R3 is already available in the register
file at that time. However, as ALU is being used in clock cycle 6, it has to wait till cycle 7 to
increment R3. The new value of R3 will be available only in cycle 9. This delay can be
eliminated if the computer has a separate unit to execute INC operation.
Clock cycle → 1 2 3 4 5 6 7 8 9 10
INC R3 FI DE X X EX MEM SR
Observe that the third instruction SUB R7, R2, R6 completes execution before the
previous instruction. This is called out-of-order completion and may be unacceptable in some
situations. Thus, many machine designers lock the pipeline when it is stalled (delayed). In
such a case the pipeline execution will be as shown in Fig. 3.13. Observe that this preserves
the order of execution but the overall execution takes 1 more cycle as compared to the case
where pipeline is not locked.
Clock cycle → 1 2 3 4 5 6 7 8 9 10
INC R3 X X FI DE EX MEM SR
A question which naturally arises is how we can avoid pipeline delay due to data
dependency. There are two methods available to do this. One is a hardware technique and the
other is a software technique. The hardware method is called register forwarding. Referring
to Fig. 3.12, the result of ADD Rl, R2, R3 will be in the buffer register B3. Instead of waiting
till SR cycle to store it in the register file, one may provide a path from B3 to ALU input and
bypass the MEM and SR cycles. This technique is called register forwarding. If register
forwarding is done during ADD instruction, there will be no delay at all in the pipeline. Many
pipelined processors have register forwarding as a standard feature. Hardware must of course
be there to detect that the next instruction needs the output of the current instruction and
should be fed back to ALU. In the following sequence of instructions not only MUL but also
SUB needs the value of R3. Thus, the hardware should have a facility to forward R3 to SUB
also.
ADD Rl, R2, R3
MUL R3, R4, R5
SUB R3, R2, R6
Consider another sequence of instructions
Clock cycle → 1 2 3 4 5 6 7 8 9 10
(a)
Clock cycle → 1 2 3 4 5 6 7 8 9
(b)
The pipeline execution with register forwarding is shown in Fig. 3.14(a). Observe that the
LD instruction stores the value of R2 only at the end of cycle 5. Thus, ADD instruction can
perform addition only in cycle 6. MUL instruction even though it uses register forwarding is
delayed first due to the non-availability of R3 as ADD instruction is not yet executed and in
the next cycle as the execution unit is busy. The next instruction SUB, even though not
dependent on any of the previous instructions is delayed due to the non-availability of the
execution unit. Hardware register forwarding has not eliminated pipeline delay. There is,
however, a software scheduling method which reduces delay and in some cases completely
eliminates it. In this method, the sequence of instructions is reordered without changing the
meaning of the program. In the example being considered SUB R7, R8, R9 is independent of
the previous instructions. Thus, the program can be rewritten as:
The pipeline execution of these instructions is shown in Fig. 3.14(b). Observe that SUB can
be executed without any delay. However, ADD is delayed by a cycle as a value is stored in
R2 by LD instruction only in cycle 5. This rescheduling has reduced delay by one clock cycle
but has not eliminated it. If the program has two or more instructions independent of LD and
ADD then they can be inserted between LD and ADD eliminating all delays.
i FI DE EX MEM SR
(i+2) FI X X FI DE EX MEM SR
TABLE 3.3 Modified Decode Step of SMAC2P to Reduce Stall During Branch
Step in execution Data flow during step Remarks
Decode instruction fetch register and find branch address B1 ← Reg [IR21..25]
B2 ← Reg [IR16…20]
B2 ← (B1 – B2)
It is clear from Table 3.3 that we have introduced an add/subtract unit in the decode stage
to implement JEQ and BCT instructions in Step 2. This add/subtract unit will also be useful if
an effective address is to be computed. By adding this extra circuitry we have reduced delay
to 1 cycle if branch is taken and to zero if it is not taken. We will calculate the improvement
in speedup with this hardware.
EXAMPLE 3.3
Assume again 5% unconditional jumps, 15% conditional jumps and 80% of conditional
jumps as taken.
Average delay cycles with extra hardware
= 1 × 0.05 + 1 × (0.15 × 0.8) = 0.05 + 0.12 = 0.17
∴ Speedup with branches = 5/1.17 ≈ 4.27
% loss of speedup = 14.6%.
Thus, we get a gain of 22% in speedup due to the extra hardware which is well worth it in
our example.
In SMAC2P there are only a small number of branch instructions and the extra hardware
addition is simple. Commercial processors are more complex and have a variety of branch
instructions. It may not be cost effective to add hardware of the type we have illustrated.
Other technologies have been used which are somewhat more cost-effective in certain
processors.
We will discuss two methods both of which depend on predicting the instruction which
will be executed immediately after a branch instruction. The prediction is based on the
execution time behaviour of a program. The first method we discuss is less expensive in the
use of hardware and consequently less effective. It uses a small fast memory called a branch
prediction buffer to assist the hardware in selecting the instruction to be executed
immediately after a branch instruction. The second method which is more effective also uses
a fast memory called a branch target buffer. This memory, however, has to be much larger
and requires more control circuitry. Both ideas can, of course, be combined. We will first
discuss the use of branch prediction buffer.
Branch Prediction Buffer
In this technique some of the lower order bits of the address of branch instructions in a
program segment are used as addresses of the branch prediction buffer memory. This is
shown in Fig. 3.16. The contents of each location of this buffer memory is the address of the
next instruction to be executed if the branch is taken. In addition, two bits are used to predict
whether a branch will be taken when a branch instruction is executed. If the prediction bits
are 00 or 01, the prediction is that the branch will not be taken. If the prediction bits are 10 or
11 then the prediction is that the branch will be taken. While executing an instruction, at the
DE step of the pipeline, we will know whether the instruction is a branch instruction or not. If
it is a branch instruction, the low order bits of its address are used to look up the branch
prediction buffer memory. Initially the prediction bits are 00. Table 3.4 is used to change the
prediction bits in the branch prediction buffer memory.
Branch taken? Y N Y N Y N Y N
The prediction bits are examined. If they are 10 or 11, control jumps to the branch address
found in the branch prediction buffer. Otherwise the next sequential instruction is executed.
Experimental results show that the prediction is correct 90% of the time. With 1000 entries in
the branch prediction buffer, it is estimated that the probability of finding a branch instruction
in the buffer is 95%. The probability of finding the branch address is at least 0.9 × 0.95 =
0.85.
There are two questions which must have occurred to the reader. The first is “How many
clock cycles do we gain, if at all, with this scheme?” The second is “Why should there be 2
bits in the prediction field? Would it not be sufficient to have only one bit?” We will take up
the first question. In SMAC2P with hardware enhancements there will not be any gain by
using this scheme as the branch address will be found at the DE step of the pipeline. Without
any hardware enhancements of SMAC2P two clock cycles will be saved when this buffer
memory is used provided the branch prediction is correct. In many machines where address
computation is slow this buffer will be very useful.
Address Contents Prediction bits
Low order bits of branch instruction address Address where branch will jump 2 bits
A single bit predictor incorrectly predicts branches more often, particularly in most loops,
compared to a 2-bit predictor. Thus, it has been found more cost-effective to use a 2-bit
predictor.
Branch Target Buffer
Unlike a branch prediction buffer, a branch target buffer is used at the instruction fetch step
itself. The various fields of the Branch Target Buffer memory (BTB) are shown in Fig. 3.17.
Address of branch instruction Address where branch will jump 1 or 2 bits (optional)
Observe that the address field has the complete address of all branch instructions. The
contents of BTB are created dynamically. When a program is executed whenever a branch
statement is encountered, its address and branch target address are placed in BTB. Remember
the fact that an instruction is a branch will be known only during the decode step. At the end
of execution step, the target address of the branch will be known if branch is taken. At this
time the target address is entered in BTB and the prediction bits are set to 01. Once a BTB
entry is made, it can be accessed at instruction fetching phase itself and target address found.
Typically when a loop is executed for the first time, the branch instruction governing the loop
would not be found in BTB. It will be entered in BTB when the loop is executed for the first
time. When the loop is executed the second and subsequent times the branch target would be
found at the instruction fetch phase itself thus saving 3 clock cycles delay. We explain how
the BTB is used with a flow chart in Fig. 3.18. Observe that left portion of this flow chart
describes how BTB is created and updated dynamically and the right portion describes how it
is used. In this flow chart it is assumed that the actual branch target is found during DE step.
This is possible in SMAC2P with added hardware. If no hardware is added then the fact that
the predicted branch and actual branch are same will be known only at the execution step of
the pipeline. If they do not match, the instruction fetched from the target has to be removed
and the next instruction in sequence should be taken for execution.
Observe that we have to search BTB to find out whether the fetched instruction is in it.
Thus, BTB cannot be very large. About 1000 entries are normally used in practice. We will
now compute the reduction in speedup due to branches when BTB is employed using the
same data as in Example 3.2.
EXAMPLE 3.4
Assume unconditional branches = 5%.
Conditional branches = 15%
Taken branches = 80% of conditional
For simplicity we will assume that branch instructions are found in BTB with probability
0.95. We will assume that in 90% cases, the branch prediction based on BTB is correct.
Figure 3.18 Branch Target Buffer (BTB) creation and use.
We will also assume that the branch address is put in PC only after MEM cycle. In other
words, there is no special branch address calculation hardware.
By having a BTB the average delay cycles when unconditional branches are found in BTB
= 0.
Average delay when unconditional branches are not found in BTB = 3.
Thus, average delay due to unconditional branches 2 × 0.05 = 0.1 (Probability of branch
not found in BTB is 0.05).
(For unconditional branches there can be no misprediction.)
Average delay due to conditional branches when they are found in BTB = 0.
Average delay when conditional branches are not found in BTB = ( 3 × 0.8) + (2 × 0.2) =
2.8.
As probability of not being in BTB = 0.05, the average delay due to conditional branches
= 0.05 × 2.8 = 0.14.
Average delay due to misprediction of conditional branches when found in BTB = 0.1 × 3
× 0.95 = 0.285 (As probability of conditional branch being in BTB is 0.95). As 5% of
instructions are unconditional branches and 15% are conditional branches in a program, the
average delay due to branches = 0.05 × 0.1 + 0.15 × (0.14 + 0.285) = 0.005 + 0.064 = 0.069.
Therefore, speedup with branches when BTB is used = 5/(1 + 0.069) = 4.677.
% loss of speedup due to branches = 6%.
Compare with the loss of speedup of 36.4% found in Example 3.2 with no BTB. Thus, use
of BTB is extremely useful.
Figure 3.21 Hoisting code from target to delay slot to reduce stall.
3.3 DIFFICULTIES IN PIPELINING
We have discussed in the last section problems which arise in pipeline execution due to
various non-ideal conditions in programs. We will examine another difficulty which makes
the design of pipeline processors challenging. This difficulty is due to interruption of normal
flow of a program due to events such as illegal instruction codes, page faults, and I/O calls.
We call all these exception conditions. In Table 3.5 we list a number of exception conditions.
For each exception condition we have indicated whether it can occur in the middle of the
execution of an instruction and if yes during which step of the pipeline. Referring to Table
3.5 we see, for example, that memory protection violation can occur either during the
instruction fetch (FI) step or load/store memory (MEM) step of the pipeline. In Table 3.5 we
have also indicated whether we could resume the program after the exception condition or
not. In the case of undefined instruction, hardware malfunction or power failure we have no
option but to terminate the program. If the Operating System (OS) has a checkpointing
feature, one can restart a half finished program from the last checkpoint.
An important problem faced by an architect is to be able to attend to the exception
conditions and resume computation in an orderly fashion whenever it is feasible. The
problem of restarting computation is complicated by the fact that several instructions will be
in various stages of completion in the pipeline. If the pipeline processing can be stopped
when an exception condition is detected in such a way that all instructions which occur
before the one causing the exception are completed and all instructions which were in
progress at the instant exception occurred can be restarted (after attending to the exception)
from the beginning, the pipeline is said to have precise exceptions. Referring to Fig. 3.22,
assume an exception occurred during the EX step of instruction (i + 2). If the system supports
precise exceptions then instructions i and (i + 1) must be completed and instructions (i + 2), (i
+ 3) and (i + 4) should be stopped and resumed from scratch after attending to the exception.
In other words whatever actions were carried out by (i + 2), (i + 3) and (i + 4) before the
occurrence of the exception should be cancelled. When an exception is detected, the
following actions are carried out:
1. As soon as the exception is detected turn off write operations for the current and all
subsequent instructions in the pipeline [Instructions (i + 2), (i + 3) and (i + 4) in Fig.
3.22].
2. A trap instruction is fetched as the next instruction in pipeline (Instruction i + 5 in Fig.
3.22).
3. This instruction invokes OS which saves address (or PC of the program) of faulting
instruction to enable resumption of program later after attending to the exception.
I1 R1 ← R1/R5 2 Integer
I2 R3 ← R1 + R2 1 Integer
I3 R2 ← R5 + 3 1 Integer
I4 R7 ← R1 – R11 1 Integer
I5 R6 ← R4 × R8 2 Floating point
I6 R5 ← R1 + 6 1 Integer
I7 R1 ← R2 + 1 1 Integer
I1 R1 ← R1/R5 R1 ← R1/R5
I2 R3 ← R1 + R2 R3 ← R1 + R2
I3 R2 ← R5 + 3 R2N ← R5 + 3
I4 R7 ← R1 – R11 R7 ← R1 – R11
I5 R6 ← R4 × R8 R6 ← R4 × R8
I6 R5 ← R1 + 6 R5N ← R1 + 6
I7 R1 ← R2 + 1 R1N ← R2N + 1
I8 R10 ← R9 × R8 R10 ← R9 × R8
With this modified instructions, the execution sequence is shown in Fig. 3.27. We see that
the delay in the pipeline is reduced and the execution completes in 9 cycles, a saving of one
cycle as compared to the schedule of Fig. 3.26. In developing the execution sequence we
have assumed that 2 floating point units and 2 integer units are available. After completing
the sequence of instructions, the renamed registers overwrite their parents (R2 is R2N’s
parent).
A major problem is the lack of sufficient parallelism. Assume that the two Floating Point
Units (FPUs) are pipelined to speedup floating point operations. If the number of stages in the
pipeline of FPU is 4, we need at least 10 floating point operations to effectively use such a
pipelined FPU. With two FPUs we need at least 20 independent floating point operations to
be scheduled to get the ideal speedup. We also need besides this, 2 integer operations. The
overall instruction level parallelism may not be this high.
The VLIW hardware also needs much higher memory and register file bandwidth to
support wide words and multiple read/write to register files. For instance, to sustain two
floating point operations per cycle, we need two read ports and two write ports to register file
to retrieve operands and store results. This requires a large silicon area on the processor chip.
Lastly it is not always possible to find independent instructions to pack all 7 operations in
each word. On the average about half of each word is filled with no-ops. This increases the
required memory capacity.
Binary code compatibility between two generations of VLIWs is also very difficult to
maintain as the structure of the instruction will invariably change.
Overall VLIW, though an interesting idea, has not become very popular in commercial
high performance processors.
3.6 SOME COMMERCIAL PROCESSORS
In this section we describe three processors which are commercially available. The first
example is a RISC superscalar processor, the second is one of the cores of a CISC high
performance processor, and the third a long instruction word processor called an EPIC
processor.
TABLE 3.8 Operations Performed During Eleven Pipeline Stages of ARM Cortex A9
5. DE 2
6. RE Rename registers to eliminate WAR and WAW hazards during out-of-order execution. Assign
registers from a pool of unused physical registers.
7. I1 Up to 4 instructions issued based on availability of inputs and execution units. They may be
out of order.
8. EX 1 Instructions executed. Conditional branch execution done in EX1 and branch/no branch
determined. If mispredicted, a signal is sent to FI 1 to void pipeline. The execution units
9. EX 2 available are shown in Fig. 3.30.
10. EX 3
11. SR Update physical register file in the correct order. Retire instructions in right order.
All these processors have a basic pipelined execution unit and in effect try to make best use
of pipeline processing by using thread parallelism.
In order to answer these questions we must identify the parameters which affect the
performance of the processor. These are:
1. The average number of instructions which a thread executes before it suspends (let
us say it is p).
2. The delay when a processor suspends a thread and switches to another one (let it be
q cycles).
3. The average waiting time of a thread before it gets the resource it needs to resume
execution called latency (let it be w cycles).
Figure 3.36 Processing of multiple threads in a processor.
We will count all these in number of processor cycles. To find out the number of threads n
to reduce the effects of latency (referring to Fig. 3.36), we use the simple formula:
nq + (n – 1)p ≤ w
Observe that larger p and smaller w reduces the number of threads to be supported.
A rough calculation of efficiency of processor is:
This poor utilization is due to the fact that the average non-blocked length of a thread, in
this example, is small relative to context switching time. Coarse grained multithreading has
poor performance because the instructions in the pipeline when the thread is suspended are
abandoned and this equals the length of the pipeline (in the worst case). It is too expensive in
hardware to let them continue while another thread is invoked as we have to provide
independent resources.
The question thus naturally arises whether there are better schemes. One such attempt is
fine-grained multithreading.
Coarse grained Instruction fetch buffers, register files for threads, control logic/state On pipeline stall
Simultaneous Instruction fetch buffers, register files, control logic/state, return address No context
multithreading stack, reorder buffer/retirement unit switching
3.8 CONCLUSIONS
The main computing engine of a parallel computer is a microprocessor used as the PE. It is
thus important to design high speed PEs in order to build high performance parallel
computers. The main method used to build high speed PEs is to use parallelism at the
instruction execution level. This is known as instruction level parallelism. The earliest use of
parallelism in a PE was using temporal parallelism by pipelining instruction execution. RISC
processors were able to execute one instruction every clock cycle by using pipelined
execution. Pipelined execution of instructions is impeded by dependencies between
instructions in a program and resource constraints. The major dependencies are data
dependency and control dependency. Loss of speedup due to these dependencies are
alleviated by both hardware and software methods. The major method used to alleviate data
dependency using hardware is by register forwarding, i.e., forwarding values in registers to
ALU of PE internally instead of storing it in memory. Source instructions may be rearranged
by a compiler to reduce data dependency during execution. To alleviate loss of speedup due
to control dependency hardware is enhanced by providing two fast memories called branch
prediction buffer and branch target buffer. They primarily work by predicting which path a
program will take when it encounters a branch and fetching the instruction from this
predicted branch. It has been found that this method rearranges the instructions in the
program in such a way that an instruction fetched need not be abandoned due to a branch
instruction. Delay in pipelined execution of programs due to resource constraints is normally
solved by providing resources such as an additional floating point arithmetic unit or speeding
up the unit which causes delays. Adding more resources is done judiciously by executing a
number of benchmark problems and assessing cost-benefit.
Besides using temporal parallelism to enhance the speed of PEs, data parallelism is used in
what are known as superscalar RISC processors. Superscalar processing depends on available
parallelism in groups of instructions of programs. Machines have been built which issue 4
instructions in each cycle and execute them in parallel in pipeline mode. These machines
require additional functional units to simultaneously execute several instructions.
In superscalar RISC processors, the hardware examines only a small window of
instructions and schedules them to use all available functional units, taking into account
dependencies. This is sub-optimal. Compilers can take a more global view of a program and
rearrange instructions in it to utilize PE’s resources better and reduce pipeline delays. VLIW
processors use sophisticated compilers to expose a sequence of instructions which have no
dependency and require diverse resources available in the processor. In VLIW architecture, a
single word incorporates many operations. Typically two integer operations, two floating
point operations, two load/store operations and a branch may be all packed into one long
instruction word which may be anywhere between 128 to 256 bits long. Trace scheduling is
used to optimally use the resources available in VLIW processors. Early VLIW processors
were primarily proposed by University researchers. These ideas were used by commercial
PEs, notably the IA-64 processor of Intel.
A program may be decomposed into small sequences of instructions, called threads, which
are schedulable as a unit by a processor. Threads which are not dependent on one another can
be executed in parallel. It is found that many programs can be decomposed into independent
threads. This observation has led to the design of multithreaded processors which execute
threads in parallel. There are three types of multithreaded processors; coarsegrained,
finegrained and simultaneous. All multithreaded processors use pipelined execution of
instructions. In coarsegrained multithreaded processors, a thread executes till it is stalled due
to a long latency operation at which time another thread is scheduled for execution. In
finegrained multithreading, one instruction is fetched from each thread in each clock cycle.
This eliminates data and control hazards in pipelined execution and tolerates long latency if
there is a cache miss. This PE was designed specifically for use in parallel computers and was
used in a commercial parallel computer (TeraMTA-32). In simultaneous multithreaded
processors several threads are scheduled to execute in each clock cycle. There is considerable
current research being conducted on multithreading as it promises latency tolerant PE
architecture.
Levels of integration have reached a stage where five billion transistors are integrated in
an integrated circuit chip in 2015. This has allowed a number of processing elements called
“cores” to be built on a chip. We will discuss the architecture of multi-core processors in
Chapter 5. It is evident that all future PEs will be built to exploit as much parallelism as
possible at instruction and thread level leading to very high performance. These PEs will be
the computing engines of future parallel computers thereby enhancing the speed of parallel
computers considerably in the coming years.
EXERCISES
3.1 We assume a five stage pipeline processing for the hypothetical computer SMAC2P.
Assume that the processor uses a 4 stage pipeline. What would you suggest as the 4
pipeline stages? Justify your answer.
3.2 Develop data flows of SMAC2P with a 4 stage pipeline (a table similar to Table 3.2 is to
be developed by you).
3.3 If the percentage of unconditional branches is 10%, conditional branches 18%, and
immediate instructions 8% in programs executed in SMAC2P, compute the average
clock cycles required per instruction.
3.4 Assume 5 stage pipelining of SMAC2P with percentage mix of instructions given in
Exercise 3.3. Assuming 8 ns clock, and 80% of the branches are taken find out the
speedup due to pipelining.
3.5 Repeat Exercise 3.4 assuming a 4 stage pipeline.
3.6 In our model of SMAC2P we assumed availability of separate instruction and data
memories. If there is a single combined data and instruction cache, discuss how it will
affect pipeline execution. Compute loss of speedup due to this if a fraction p of total
number of instructions require reference to data stored in cache.
3.7 Describe how a floating point add/subtract instruction can be pipelined with a 4 stage
pipeline. Assume normalized floating point representation of numbers.
3.8 If a pipelined floating point unit is available in SMAC2P and if 25% of arithmetic
operations are floating point instructions, calculate the speedup due to pipeline
execution with and without pipelined floating point unit.
3.9 Draw pipeline execution diagram during the execution of the following instructions of
SMAC2P.
MUL R1, R2, R3
ADD R2, R3, R4
INC R4
SUB R6, R3, R7
Find out the delay in pipeline execution due to data dependency of the above
instructions.
3.10 Describe how the delay can be reduced in execution of the instructions of Exercise 3.9
by (i) Hardware assistance, and (ii) Software assistance.
3.11 Repeat Exercise 3.9, if MUL instruction takes 3 clock cycles whereas other
instructions take one cycle each.
3.12 Given the following SMAC2P program
(i) With R1 = 2, R4 = 6, Temp 1 = 4, draw the pipeline execution diagram.
(ii) With R0 = 0, R3 = 3, R4 = 4, Temp 1 = 6, draw the pipeline execution diagram.
3.13 Explain how branch instructions delay pipeline execution. If a program has 18%
conditional branch instructions and 4% unconditional branch instructions and if 7% of
conditional branches are taken branches, calculate the loss in speedup of a processor
with 4 pipeline stages.
3.14 What modifications would you suggest in SMAC2P hardware to reduce delay due to
branches? Using the data of Exercise 3.13, calculate the improvement in speedup.
3.15 What is the difference between branch prediction buffer and branch target buffer used
to reduce delay due to control dependency?
3.16 Using the data of Exercise 3.13, compute the reduction in pipeline delay when branch
prediction buffer is used.
3.17 Using the data of Exercise 3.13, compute the reduction in pipeline delay when branch
target buffer is used.
3.18 Estimate the size of branch target buffer (in number of bytes) for SMAC2P using the
data of Exercise 3.13.
3.19 How can software methods be used to reduce delay due to branches? What conditions
should be satisfied for software method to succeed?
3.20 What conditions must be satisfied by the statement appearing before a branch
instruction so that it can be used in the delay slot?
3.21 For the program of Exercise 3.12, is it possible to reduce pipelining delay due to
branch instructions by rearranging the code? If yes show how it is done.
3.22 What do you understand by the term precise exception? What is its significance?
3.23 When an interrupt occurs during the execution stage of an instruction, explain how the
system should handle it.
3.24 What is the difference between superscalar processing and superpipelining? Can one
combine the two? If yes, explain how.
3.25 What extra resources are needed to support superscalar processing?
(i) For the following sequence of instructions develop superscalar pipeline execution
diagrams similar to that given in Fig. 3.26. Assume there is 1one floating point and
2 integer execution units.
Instruction Number of cycles needed Arithmetic unit needed
R2 ← R2 × R6 2 Floating point
R3 ← R2 + R1 1 Integer
R1 ← R6 + 8 1 Integer
R8 ← R2 – R9 1 Integer
R6 ← R2 + 4 1 Integer
R2 ← R1 + 2 1 Integer
(ii) If there are 2 integer and 2 floating point execution units, repeat (i).
(iii) Is it possible to rename registers to reduce the number of execution cycles?
(iv) Reschedule instructions (if possible) to reduce the number of cycles needed to
execute this set of instructions.
3.26 Distinguish between flow dependency, anti-dependency and output depen-dency of
instructions. Give one example of each of these dependencies.
(i) Why should one be concerned about these dependencies in pipelined execution of
programs?
(ii) If in pipelined execution of programs out of order completion of instruc-tions is
avoided, will these dependencies matter?
3.27 What is score boarding? For the sequence of instructions given in Exercise 3.25,
develop a score board. Illustrate how it is used in register renaming.
3.28 (i) Define a trace.
(ii) How is it used in a VLIW processor?
(iii) List the advantages and disadvantages of VLIW processor.
3.29 (i) Is ARM Cortex A9 a RISC or a CISC processor?
(ii) Is it superscalar or superpipelined?
(iii) Is it a VLIW processor?
(iv) How many integer and floating point execution units does it have?
(v) Show pipeline execution sequence of instructions of Exercise 3.25 on ARM Cortex
A9.
3.30 (i) Is Intel Core i7 processor a RISC or CISC processor?
(ii) Is it superscalar or superpipelined?
(iii) What is the degree of superscalar processing in i7?
(iv) Show pipelined execution sequence of instructions of Table 3.6 on a i7 processor.
3.31 (i) Is IA-64 a VLIW processor?
(ii) Is it a superscalar processor?
(iii) How many instructions can it carry out in parallel?
(iv) Explain how it uses predicate registers in processing.
3.32 (i) Define the term thread.
(ii) What is the difference between a thread, a trace and a process?
(iii) What is multithreading?
(iv) List the similarities and differences between a coarse grained multi-threaded
processor, fine grained multithreaded processor and a simul-taneous multithreaded
processor.
3.33 What are the similarities and differences between multithreading and multi-
programming?
3.34 If a computer does not have an on-processor cache which type of multi-threaded
processor would you use and why?
3.35 What type of multithreading is used by:
(i) HEP processor?
(ii) Tera processor?
3.36 Explain with a pipeline diagram how the instructions in Exercise 3.12 will be carried
out by a fine grained multithreaded processor if R1 = 10.
3.37 Assume that the average number of instructions a thread executes before it suspends is
15, the delay when a thread suspends and switches to another one is 3 cycles and the
average number of cycles it waits before it gets the resource it needs is 30. What is the
number of threads the processor should support to hide the latency? What is the
processor efficiency?
3.38 Explain how simultaneous multithreading is superior to multithreading. What extra
processor resources are required to support simultaneous multithreading?
BIBLIOGRAPHY
Culler, D.E., Singh, J.P. and Gupta, A., Parallel Computer Architecture: A Hardware/
Software Approach, Morgan Kaufmann, San Francisco, USA, 1999.
Fisher, J.A., “Very Long Instruction Word Architectures and ELI-512”, Proceedings of ACM
Symposium on Computer Architecture, Sweden, 1983.
IA-64 (Website: www.intel.com)
Laudon, J., Gupta, A. and Horowitz, A., Multithreaded Computer Architecture, [Iannucci,
R.A. (Ed.)], Kluwer Academic, USA, 1994.
Patterson, D.A. and Hennessy, J.L., Computer Architecture—A Quantitative Approach, 5th
ed., Reed Elsevier India, New Delhi, 2012.
Rajaraman, V. and Radhakrishnan, T., An Introduction to Digital Computer Design, 5th ed.,
PHI Learning, New Delhi, 2008.
Shen, J.P., and Lipasti, M.H., Modern Processor Design, Tata McGraw-Hill, New Delhi,
2010.
Smith, B.J., The Architecture of HEP in Parallel MIMD Computation, [Kowalik, J.S. (Ed.)],
MIT Press, USA, 1985.
Stallings, W., Computer Organization and Architecture, 4th ed., Prentice-Hall of India, New
Delhi, 1996.
Tannenbaum, A.S. and Austin,T., Structured Computer Organization, 6th ed., Pearson, New
Delhi, 2013.
Theme Papers on One Billion Transistors on a Chip, IEEE Computer, Vol. 30, No. 9, Sept.
1997.
Tullsen, D.M., Eggers, S.J. and Levy, H.M., “Simultaneous Multithreading: Maximizing On-
chip Parallelism”, Proceedings of ACM Symposium on Computer Architecture, 1995.
Structure of Parallel Computers
In the last chapter we saw how parallelism available at the instruction level can be used
extensively in designing processors. This type of parallelism is known in the literature as fine
grain parallelism as the smallest grain is an instruction in a program. In this chapter we will
examine how processors may be interconnected to build parallel computers which use a
coarser grain, for example, threads or processes executing in parallel.
4.1 A GENERALIZED STRUCTURE OF A PARALLEL COMPUTER
A parallel computer is defined as:
an interconnected set of Processing Elements (PEs) which cooperate by communicating with
one another to solve large problems fast.
We see from this definition that the keywords which define the structure of a parallel
computer are PEs, communication and cooperation. Figure 4.1 shows a generalized structure
of a parallel computer. The heart of the parallel computer is a set of PEs interconnected by a
communication network. This general structure can have many variations based on the type
of PEs, the memory available to PEs to store programs and data, and how memories are
connected to the PEs, the type of communication network used and the technique of
allocating tasks to PEs and how they communicate and cooperate. The variations in each of
these lead to a rich variety of parallel computers. The possibilities for each of these blocks
are:
Type of Processing Elements
1. A PE may be only an Arithmetic Logic Unit (ALU) or a tiny PE. The ALUs may
use 64-bit operands or may be quite tiny using 4-bit operands.
2. A PE may also be a microprocessor with only a private cache memory or a full
fledged microprocessor with its own cache and main memory. We will call a PE
with its own private memory a Computing Element (CE). PEs themselves are
becoming quite powerful and may themselves be parallel computers as we will see
the next chapter.
3. A PE may be server or a powerful large computer such as a mainframe or a Vector
Processor.
Communication Network
Memory System
1. Each PE has its own private caches (multi-level) and main memory.
2. Each PE has its own private caches (multi-level) but all PEs share one main
memory which is uniformly addressed and accessed by all PEs.
3. Each PE has its own private caches (multi-level) and main memory and all of them
also share one large memory.
Mode of Cooperation
1. Each CE (a PE with its own private memory) has a set of processes assigned to it.
Each CE works independently and CEs cooperate by exchanging intermediate
results.
2. All processes and data to be processed are stored in the memory shared by all PEs.
A free PE selects a process to execute and deposits the results in the shared
memory for use by other PEs.
3. A host CE stores a pool of tasks to be executed and schedules tasks to free CEs
dynamically.
From the above description we see that a rich variety of parallel computer structures can
be built by permutation and combination of these different possibilities.
4.2 CLASSIFICATION OF PARALLEL COMPUTERS
The vast variety of parallel computer architecture may be classified based on the following
criteria:
1. How do instructions and data flow in the system? This idea for classification was
proposed by Flynn [1972] and is known as Flynn’s classification. It is considered
important as it is one of the earliest attempts at classification and has been widely
used in the literature to describe various parallel computer architectures.
2. What is the coupling between PEs? Coupling refers to the way in which PEs
cooperate with one another.
3. How do PEs access memory? Accessing relates to whether data and instruc-tions
are accessed from a PE’s own private memory or from a memory shared by all PEs
or a part from one’s own memory and another part from the memory belonging to
another PE.
4. What is the quantum of work done by a PE before it communicates with another
PE? This is commonly known as the grain size of computation.
In this section we will examine each of these classifications. This will allow us to describe
a given parallel computer using adjectives based on the classification.
arrays simultaneously (e.g.), (aik + bik). Such computers are known as array processors
which we will discuss later in this chapter.
Figure 4.2 Structure for a single instruction multiple data computer.
Figure 4.3 Regular structure of SIMD computer with data flowing from neighbours.
The third class of computers according to Flynn’s classification is known as Multiple
Instructions stream Single Data stream (MISD) Computers. This structure is shown in Fig.
4.4. Observe that in this structure different PEs run different programs on the same data. In
fact pipeline processing of data explained in Section 2.1 is a special case of this mode of
computing.
In the example we took in Section 2.1, the answer books passed from one teacher to the
next corresponds to the data stored in DM and instructions to grade different questions given
to the set of teachers is analogous to the contents of IM1 to IMn in the structure of Fig. 4.4. In
pipeline processing the data processed by PE1, namely, R1 is fed to PE2, R2 to PE3 etc., and
DM contents will be input to only PE1. This type of processor may be generalized using a 2-
dimensional arrangement of PEs. Such a structure is known as a systolic processor. Use of
MISD model in systolic processing will be discussed later in this chapter.
Figure 4.4 Structure of a Multiple Instruction Single Data (MISD) Computer.
The last and the most general model according to Flynn’s classification is Multiple
Instructions stream Multiple Data stream (MIMD) Computer. The structure of such a
computer is shown in Fig. 4.5.
where we have assumed for simplicity that si = s. To fix our ideas we will substitute some
values in the above equation.
Let the overhead T for each communication event be 100 cycles and the communication
times be 20 cycles. Let n = 100 (number of processors) and the total compute time p of each
processor = 50000 cycles and m the number of communication events be 100.
Thus, there is 20% loss in speedup when the grain size is 500 cycles. If the processors are
pipelined, one instruction is carried out in each cycle. Thus, if the parallel computer uses such
processors and if each processor communicates once every 500 instructions, the overall
efficiency loss is 20%.
In general if the loss in efficiency is to be less than 10% then m/p(T + s) < 0.1. This
implies that the gain size p/m > 10(T + s).
If T = 100 and s = 20 then grain size > 1200 cycles. If one instruction is carried out every
cycle, the grain size is 1200 instructions. Thus, each processor of a loosely coupled parallel
computer may communicate only once every 1200 instructions if loss of efficiency is to be
kept low.
On the other hand in a tightly coupled system T is 0 as the memory is shared and no
operating system call is necessary to write in the memory. Assuming s of the order of 2
cycles the grain size for 10% loss of speedup is 20 cycles. In other words, after every 20
instructions a communication event can take place without excessively degrading
performance.
The above calculations are very rough and intended just to illustrate the idea of grain size.
In fact we have been very optimistic in the above calculations and made many simplifying
assumptions. These points will be reiterated later with a more realistic model.
4.3 VECTOR COMPUTERS
One of the earliest ideas used to design high performance computers was the use of temporal
parallelism (i.e., pipeline processing). Vector computers use temporal processing extensively.
The most important unit of a vector computer is the pipelined arithmetic unit. Consider
addition of two floating point numbers x and y. A floating point number consists of two parts,
a mantissa and an exponent. Thus, x and y may be represented by the tuple (mant x, exp x)
and (mant y, exp y) respectively. Let z = x + y. The sum z may be represented by (mant z, exp
z). The task of adding x and y can be broken up into the following four steps:
Step 1: Compute (exp x – exp y) = m
Step 2: If m > 0 shift mant y, m positions right and fill its leading bit positions with zeros. Set
exp z = exp x. If m < 0 shift mant x, m positions right and fill the leading bits of mant x
with zeros. Set exp z = exp y. If m = 0 do nothing. Set exp z = exp x.
Step 3: Add mant x and mant y. Let mant z = mant x + mant y.
Step 4: If mant z > 1 then shift mant z right by 1 bit and add 1 to exponent. If one or more
significant bits of mant z = 0 shift mant z left until leading bit of mant z is not zero. Let
the number of shifts be p. Subtract p from exp z.
A block diagram of a pipelined floating point adder is shown in Fig. 4.8. Such a pipelined
adder can be used very effectively to add two vectors. A vector may be defined as an ordered
sequence of numbers. For example x = (x1, x2, …, xn) is a vector of length n. To add vectors x
and y, each of length n, we feed them to a pipelined arithmetic unit as shown in Fig. 4.9. One
pair of operands is shifted into the pipeline from the input registers every clock period, let us
say T.
Input/Output processors
Main memory
Instruction registers, scalar registers, and vector registers
Pipelined processors
Secondary storage system
Frontend computer system
If s = 8 and k = 4, we obtain
In Fig. 4.16, we have given an example of a systolic array which replaces a sequence of
values (xn+2, xn+1 …, x3) in a memory with a new sequence of values (yn, yn–2, yn–1 …, y3)
where
Observe that the contents of the serial memory are shifted one position to the right and
pushed into the cells at each clock interval. The most significant bit position of the serial
memory which is vacated when the contents are shifted right is filled up by the value
transformed by the cells.
The details of the transformation rules are shown in Fig. 4.16. Complex systolic arrays
have been built for applications primarily in signal processing.
Figure 4.16 A systolic sequence generator.
4.7 SHARED MEMORY PARALLEL COMPUTERS
One of the most common parallel architectures using moderate number of processors (4 to
32) is a shared memory multiprocessor [Culler, Singh and Gupta, 1999]. This architecture
provides a global address space for writing parallel programs. Each processor has a private
cache. The processors are connected to the main memory either using a shared bus or using
an interconnection network. In both these cases the average access time to the main memory
from any processor is same. Thus, this architecture is also known as Symmetric
Multiprocessor abbreviated SMP. This architecture is popular as it is easy to program it due
to the availability of globally addressed main memory where the program and all data are
stored. A shared bus parallel machine is also inexpensive and easy to expand by adding more
processors. Thus, many systems use this architecture now and many future systems are
expected to use this architecture. In this section we will first discuss this architecture.
Process X when it encounters fork Y invokes another Process Y. After invoking Process Y, it
continues doing its work. The invoked Process Y starts executing concurrently in another
processor. When Process X reaches join Y statement, it waits till Process Y terminates. If Y
terminates earlier, then X does not have to wait and will continue after executing join Y.
When multiple processes work concurrently and update data stored in a common memory
shared by them, special care should be taken to ensure that a shared variable value is not
initialized or updated independently and simultaneously by these processes. The following
example illustrates this problem. Assume that Sum ← Sum + f(A) + f(B) is to be computed
and the following program is written:
Suppose Process A loads Sum in its register to add f(A) to it. Before the result is stored back
in main memory by Process A, if Process B also loads Sum in its local register to add f(B) to
it. Process A will have Sum + f(A) and Process B will have Sum + f(B) in their respective
local registers. Now, both Processes A and B will store the result back in Sum. Depending on
which process stores Sum last, the value in the main memory will be either Sum + f(A) or
Sum + f(B) whereas what was intended was to store Sum + f(A) + f(B) as the result in the
main memory. If Process A stores Sum + f(A) first in Sum and then Process B takes this
result and adds f(B) to it then the answer will be corrected. Thus, we have to ensure that only
one process updates a shared variable at a time. This is done by using a statement called lock
<variable name>. If a process locks a variable name, no other process can access it till it is
unlocked by the process which locked it. This method ensures that only one process is able to
update a variable at a time.
The updating program can thus be rewritten as:
In the above case whichever process reaches lock Sum statement first will lock the variable
Sum disallowing any other process from accessing it. It will unlock Sum after updating it.
Any other process wanting to update Sum will now have access to it. That process will lock
Sum and then update it. Thus, updating a shared variable is serialized. This ensures that the
correct value is stored in the shared variable.
In order to correctly implement lock and unlock operations used by the software, we
require hardware assistance. Let us first assume there is no hardware assistance. The two
assembly language programs given below attempt to implement lock and unlock respectively
in SMAC2P, the hypothetical computer we described in the last chapter [Culler, Singh and
Gupta, 1999].
In the above two programs L is the lock variable and #0 and #1 immediate operands. A
locking program locks by setting the variable L = 1 and unlocks by setting L = 0. A process
trying to obtain the lock should check if L = 0 (i.e., if it is unlocked), enter the critical section
and immediately lock the critical section by setting L = 1 thereby making the lock busy. If a
process finds L = 1, it knows that the lock is closed and will wait for it to become 0 (i.e.,
unlocked). The first program above implements a “busy-wait” loop which goes on looping
until the value of L is changed by some other process to 0. It now knows it is unlocked, enters
a critical section locking it to other processes. After completing its work, it will use the
unlock routine to allow another process to capture the lock variable. The lock program given
above looks fine but will not work correctly in actual practice due to the following reason.
Suppose L = 0 originally and two processes P0 and P1 execute the above lock code. P0 reads
L and finds it to be 0 and passes BNZ statement and in the next instruction will set L = 1.
However, if before P0 sets L = 1, if P1 reads L it will also think L = 0 and assume the lock is
open! This problem has arisen because the sequence of instructions: reading L, testing it to
see if it is 0 and changing it to 1 are not atomic. In other words, they are separate instructions
and another process is free to carry out any instruction in between these instructions. What is
required is an atomic instruction which will load L into register and store 1 in it. With this
instruction we can rewrite the assembly codes for lock and unlock as follows:
The new instruction TST is called Test and Set in the literature even though a better name for
it would be load and set. It loads the value found in L in the register R1 and sets the value of
L to 1.
Observe that if L = 1, the first program will go on looping keeping L at 1. If L at sometime
is set to 0 by another process, R1 will be set to 0, capturing the value of L and the procedure
will return after making L = 1, thereby implementing locking. Observe that another process
cannot come in between and disturb locking as test and set (TST) is an atomic instruction and
is carried out as a single indivisible operation.
Test and set is a representative of an atomic read-modify-write instruction in the
instruction set of a processor which one intends to use as a processor of a parallel computer.
This instruction stores the contents of a location “L” (in memory) in a processor register and
stores another value in “L”. Another important primitive called barrier which ensures that all
processes complete specified jobs before proceeding further also requires such atomic read-
modify-write instruction to be implemented correctly. Implementing barrier synchronization
using an atomic read-modify write instruction is left as an exercise to the reader.
An important concept in executing multiple processes concurrently is known as sequential
consistency. Lamport [1979] defines sequential consistency as:
“A multiprocessor is sequentially consistent if the result of any execution is the same as if
the operation of all processors were executed in some sequential order and the operations of
each individual processor occurs in this sequence in the order specified by its program”.
In order to ensure this in hardware each processor must appear to issue and complete
memory operations one at a time atomically in program order. We will see in succeeding
section that in a shared memory computer with cache coherence protocol, sequential
consistency is ensured.
Figure 4.17 A shared memory parallel computer (number of PEs between 4 and 32).
EXAMPLE 4.1
Consider a shared bus parallel computer built using 32-bit RISC processors running at 2 GHz
which carry out one instruction per clock cycle. Assume that 15% of the instructions are
loads and 10% are stores. Assume 0.95 hit rate to cache for read and write through caches.
The bandwidth of the bus is given as 20 GB/s.
1. How many processors can the bus support without getting saturated?
2. If caches are not there how many processors can the bus support assuming the
main memory is as fast as the cache?
Solution
Assuming each processor has a cache.
Number of transactions to main memory/s = Number of read transactions + Number of
write transactions
= 0.15 × (0.05) × 2 × 109 + 0.10 × 2 × 109
= (0.015 + 0.2)109 = 0.215 × 109
Bus bandwidth = 20 × 109 bytes/s.
Average number of bytes traffic to memory = 4 × 0.215 × 109 = 0.86 × 109
∴ Number of processors which can be supported
= (20 × 109)/(0.86 × 109) = 23
If no caches are present, then every load and store instruction requires main memory
access.
Average number of bytes traffic to memory
= 2 × 4 × 0.25 × 109 = 150 MB/s
∴ Number of processors which can be supported
= (20 × 109)/(2 × 109) = 10
This example illustrates the fact that use of caches allows more processors on the bus even
if the main memory is fast because the shared bus becomes a bottleneck.
The use of caches is essential but they bring with them a problem known as cache
coherence problem. The problem arises because when a PE writes a data, say 8, into its
private cache in address x, for example, it is not known to the caches of other PEs. If another
PE reads data from the address x of its cache, it will read a data stored in x earlier which may
be, say 6. It is essential to keep the data in a given address x same in all the caches to avoid
errors in computation. There are many protocols (or rules) to ensure coherence. We will
describe in the next sub-section a simple protocol for bus based systems. Before describing it,
we quickly review the protocol used to read and write in a single processor system.
1. Cold miss which happens when cache is not yet loaded with the blocks in the
working set. This happens at the start of operation and is inevitable.
2. Capacity miss which happens when the cache is too small to accommodate the
working set. The only way to prevent this is to have a large cache which may not
always be possible.
3. Conflict miss which happens when multiple cache blocks contend for the same set
in a set associative cache. This is difficult to prevent.
These three types of misses are commonly called the 3C misses in single processor
systems.
The situation in a multiprocessor is more complicated because there are many caches, one
per processor, and one has to know the status of each block of the cache in each PE to ensure
coherence. A bus based system has the advantage that a transaction involving a cache block
may be broadcast on the bus and other caches can listen to the broadcast. Thus, cache
coherence protocols are based on the cache controller of each cache receiving Read/Write
instructions from the PE to which it is connected and also listening (called snooping which
means secretly listening) to the broadcast on the bus initiated when there is a Read/Write
request by any one of the PEs and taking appropriate action. Thus, these protocols are known
as snoopy cache protocols. In this subsection, we will describe one simple protocol which
uses only the information whether a cache block is valid or not. Details of cache coherence
protocols are discussed by Sorin, Mill and Wood [2011]. A cache block will be invalid if that
block has new data written in it by any other PE. Many protocols are possible as there are
trade offs between speed of reading/writing the requested data, bandwidth available on the
bus, and the complexity of the cache and the memory controllers. In the protocol, we describe
below the cache controller is very simple.
The actions to be taken to maintain cache coherence depend on whether the command is to
read data from memory to a register in the processor or to write the contents of a register in
the processor to a location in memory. Let us first consider a Read instruction (request).
Read request for a variable stored in the cache of PEk
The following cases can occur:
Case 1: If the block containing the address of the variable is in PEk’s cache and it is valid (a
bit associated with the block should indicate this), it is retrieved from the cache and sent to
PEk.
Case 2: If the block containing the requested address is in PEk’s cache but is not valid then it
cannot be taken from PEk’s cache and there is a Read-miss. (This is a Read-miss which is due
to lack of cache coherence and is called Coherence miss). The request is broadcast on the bus
and if the block is found in some other PE’s cache and is valid, the block containing the data
is moved to PEk’s cache. The data is read from PEk’s cache. (It should be remembered that at
least one PE must have a valid block).
Case 3: If the block containing the requested address is not in PEk’s cache. The request is
broadcast on the bus and if a block containing the requested data is found in any other PE’s
cache and is valid, the block is moved to PEk’s cache and the data is read from there. A block
in PEk is replaced and if it is dirty, it is written in the main memory.
Case 4: If the block containing the requested address is not in PEk’s cache. The request is
broadcast on the bus and if the block is not found in any other PE’s cache. The block with the
required data is read from the main memory and placed in PEk’s cache replacing a block and
the data is read from there. If the block replaced is dirty, it is written in the main memory.
Table 4.1(a) summarises the above cases.
TABLE 4.1(a) Decision Table for Maintaining Cache Coherence in a Bus Based Multiprocessor:
Reading from cache
Cases →
Conditions 1 2 3 4
Is block valid? Y N — —
Actions
Actions
As we pointed out earlier, many variations to this protocol are possible. For instance in
case 2 (see Table 4.1(a)), this protocol moves a valid cache block and replaces the cache
block in PEk with it. Instead the protocol may have moved the valid block to the main
memory and then moved it to PEk’s cache. This would increase the read delay and may be
inevitable if blocks cannot be moved from one cache to another via the bus.
M = The data in cache block has been modified and it is the only valid copy. Main memory copy is an
Modified old copy. (Dirty bit set in cache block)
E = The data in cache block is valid and is the same as in the main memory. No other cache has this
Exclusive copy.
S = Shared The data in cache block is valid and is the same as in the main memory. Some other caches may
also have valid copies.
I = Invalid The data in cache block has been invalidated as another PE has the same cache block with a newly
written value.
Figure 4.18 State transition diagram for MESI protocol.
Referring to Fig. 4.18, observe that the protocol is explained using two state transition
diagrams marked (a) and (b). In Fig. 4.18, the circles represent the state of a cache block in
the processor initiating action to Read or Write. The solid line shows the transition of the
cache block state after Read/Write. The action taken by the processor is also shown as Al, A2
etc. Referring to Fig. 4.18(a) when a cache block is in state S and a write command is given
by a PE, the cache block transitions, to state M and A4 is the control action initiated by the
system.
In Fig. 4.18(b) we show the state transitions of the corresponding cache blocks in other
PEs which respond to the broadcast by the initiating PE. We show these transitions using
dashed lines and label them with the actions shown in Fig. 4.18(a).
A1: Read data into processor register from the block in own cache.
A2: Copy modified, shared or exclusive copy into own cache block. Write new data into own
cache block. Invalidate copies of block in other caches.
A3: Write data into cache block.
A4: Broadcast intent to write on bus. Mark all blocks in caches with shared copy of block
invalid. Write data in own cache block.
A5: Broadcast request for valid copy of block from another cache. Replace copy of block in
cache with valid copy. Read data from own cache block.
The above actions are specified when the required data is found in the cache. If it is not
found in the cache, namely, Read-miss or Write-miss, the following actions are taken:
Read-miss actions:
A6: A command to read the required block is broadcast. If only other cache has this block
and it is in modified state, that copy is stored in the main memory. This block replaces a
block in the cache. The initiating and supplying caches go to shared state. Data is read
from the supplied block. If the replaced block is in modified state, it is written in the main
memory.
A7: A command to read the required block is broadcast. If any other processor’s cache has
this block in exclusive or shared state, it supplies the block to the requesting processor’s
cache. The cache block status of requesting and supplying processors go to shared state.
Data is read from the supplied block. If the replaced block is in modified state, it is written
in the main memory.
A8: A command to read the required block is broadcast. If no processor’s cache has this
block, it is read from the main memory to the requesting processor’s cache replacing a
block. If the block which is replaced is in modified state, it is written in the main memory.
The processor’s cache block goes to the exclusive state. Data is read from the replaced
block.
Write-miss actions:
A9: Intent to write is broadcast. If any cache has the requested block in modified state, it is
written in the main memory and supplied to the requesting processor. Data is written in
the supplied block. The state of requesting cache block is set as modified. The other
processor’s cache block is invalidated. If the block which is replaced is in modified state,
it is written in main memory.
A10: Intent to write is broadcast. If another cache has the requested block in shared or
exclusive state, the block is read from it and replaces a block. Data is written in it. Other
copies are invalidated. The requesting cache block will be in modified state. If the block
which is replaced is in modified state, it is written in main memory.
A11: Intent to write is broadcast. If no cache has the requested block, it is read from the main
memory and replaces a block in the requesting cache. Data is written in this block and it is
placed in exclusive state. If the block replaced is in modified state, it is written in the main
memory.
We have described the MESI protocol using a state diagram. We may also use decision
tables to explain the protocol. We have decision tables for the MESI protocol in Table 4.3.
Case 1: Read data. The required block with data is in own cache or in some other cache
Current state M E S I
Actions
Next state M E S S
A1 A1 A1 A5
Case 2: Write data in block. The required block is in own cache or in some other cache
Current state M E S I
Next state M M M M
Actions A3 A3 A4 A2
Case 3: Read data. The required block is not in own cache (Read Miss)
Actions
A7 A7 A6 A8
Case 4: Write data. The required block is not in own cache (Write-miss)
Read data block from main memory and place it in requesting cache replacing a — — — X
block.
If the replaced block in modified state write it in main memory
EXAMPLE 4.2
A 4-processor system shares a common memory connected to a bus. Each PE has a cache
with block size of 64 bytes. The main memory size is 16 MB and the cache size is 16 KB.
Word size of the machine is 4 bytes/word. Assume that cache invalidate command is 1 byte
long. The following sequence of instructions are carried out. Assume blocks containing these
addresses are initially in all 4 caches. We use hexadecimal notation to represent addresses.
P1: Store R1 in AFFFFF
P2: Store R2 in AFFEFF
P3: Store R3 in BFFFFF
P4: Store R4 in BFFEFF
P1: Load Rl from BFFEFF
P2: Load R2 from BFFFFF
P3: Load R3 from AFFEFF
P4: Load R4 from AFFFFF
(i) If write invalidate protocol is used, what is the bus traffic?
(ii) If write update protocol is used, what is the bus traffic?
This example does not illustrate that write invalidate protocol and write-update protocol
are identical. A different pattern of read/write will make write-update protocol better, for
instance if one processor writes many times to a location in memory and it is read only once
by another processor (see Exercise 4.23).
M = The data in the cache block has been modified and it is the only valid copy. The main memory copy
Modified is an old copy. (Dirty bit set in cache block).
O = The data in the cache block is valid. Some other caches may also have valid copies. However, the
Owned block in this state has exclusive right to make changes to it and must broadcast the change to all other
caches sharing this block.
E = The data in cache block is valid and is the same as in the main memory. No other cache has this copy.
Exclusive
S = The data in the cache block is valid and may not be the same as in the main memory. Some other
Shared caches may also have valid copies.
I = The data in cache block has been invalidated as another cache block has a newly written value.
Invalid
This parallel program has the possible results: (p,q) : = (0,0), (0,1), (1,1) whereas the result
(p,q) := (1,0) is not possible as p:=1 implies that F:=1 which cannot happen unless X:=1
which will make q:=1 and q cannot be 0. Observe that even though this parallel program is
sequentially consistent as per Lamport’s definition, it can give three possible results. This
indeterminacy is inherent to programs running independently in two processors due to what is
known as data races. In the above example, (p,q) : = (0,0) will occur if PE1 is slow and PE2
is fast so that it reaches statements p:=F and q: = X before PE1 gets to the statements X:=1,
and F:=1. The result (p,q): = (1,1) will occur if PE1 is fast and wins the race and makes X:=1
and F:=1 before PE2 reaches the statements p:=F and q:=X. To get determinate results, a
programmer must make the programs data race free. For example if the result (p,q) :=(1,1) is
required the programs should be rewritten as:
This program will always give p:=1, q:=1 regardless of the speeds of PE1 and PE2. The
program is now data race free.
Sequential consistency of a shared memory parallel computer is guaranteed if the
following conditions are satisfied [Adve and Gharachorloo, 1996].
1. Cache coherence of the parallel computer is ensured (if the PEs have caches).
2. All PEs observe writes to a specified location in memory in the same program
order.
3. For each PE delay access to a location in memory until all previous Read/Write
operations to memory are completed. A read is completed if a value is returned to
PE and a write is complete when all PEs see the written value in the main memory.
(It is assumed that writes are atomic).
It is important to note that the above conditions are sufficient conditions for maintaining
sequential consistency.
In a bus based shared memory multiprocessor system whose caches are coherent,
sequential consistency is maintained as:
1. Writes to a specified location in the shared memory is serialized by the bus and
observed by all PEs in the order in which they are written. Writes are atomic
because all PEs observe the write at the same time.
2. Read by a PE is completed in program order. If there is a cache miss, the cache is
busy and delays any subsequent read requests.
A data race free program should be the aim of a programmer writing a parallel program as
it will give predictable results. Sequentially consistent memory model is easy to understand
and what programmers expect as the behavior of shared memory parallel computers. Many
commercial shared memory parallel computer architects have, however, used relaxed
memory models as they felt it would improve the performance of the computer. For example,
the multiprocessor architecture of Fig. 4.19 does not guarantee sequential consistency.
As was pointed out earlier in this section, architects use these models in many commercial
shared memory computers as they are convinced that performance gains can be obtained.
However, these models are difficult to understand from a programmer’s point of view. As
synchronization of programs are specific to a multiprocessor using such relaxed model and
are prone to error, it is recommended that programmers use synchronization libraries
provided by the manufacturer of a multiprocessor which effectively hide the intricacies of the
relaxed model.
Hill [1998] argues convincingly that with the advent of speculative execution, both
sequential consistent and relaxed memory model implementations by hardware can execute
instructions out of order aggressively. Sometimes instructions may have to be undone when
speculation is incorrect. A processor will commit an instruction only when it is sure that it
need not be undone. An instruction commits only when all previous instructions are already
committed and the operation performed by the instruction is itself committed. A load or store
operation will commit only when it is sure to read or write a correct value from/to memory.
Even though relaxed and sequential implementations can both perform the same speculation,
relaxed implementation may often commit to memory earlier. The actual difference in
performance depends on the type of program being executed. Benchmark programs have
been run [Hill, 1998] and it has been found that relaxed models’ superiority is only around 10
to 20%. Thus, Hill recommends that even though many current systems have implemented
various relaxed consistency models, it is advisable for future systems to be designed with
sequential consistency as the hardware memory model. This is due to the fact that
programming relaxed models add complexity and consequently increased programming cost.
If a relaxed model is implemented, relaxing W → R order (processor consistency or total
store order model) leads to least extra programming problems.
Status bits
State Explanation
Lock Modified
bit bit
Absent (A) 0 0 All processors’ cache bits in directory 0. (No cache holds a copy of
this block)
Present (P) 0 0 One or more processor cache bit is 1 (i.e., one or more caches has
this block)
Present exclusive 0 1 Exactly one cache bit is 1 (i.e., exactly one cache holds a copy)
(PM)
TABLE 4.6 States of Blocks in Cache (There is one 2-bit entry per block in the cache)
Status bits
State Explanation
Valid Modified
bit bit
The cache consistency is maintained by actions taken by the memory controllers. The
actions depend on whether it is a read or a write request from processors, state of appropriate
cache block and the state of the requested blocks in main memory. These actions constitute
the cache coherence protocols for directory based multiprocessors. We will explain now a
protocol which uses a write invalidate policy. It also updates the main memory whenever a
new value is written in the cache of a processor.
Thus, the main memory will always have the most recently written value. We look at
various cases which arise when we try to read a word from the cache.
Read from memory to processor register
kth processor executes load word command.
Case 1: Read-hit, i.e., block containing word is in kth processor’s cache.
Cases 1.1,1.2: If the status of the block in cache is V or VM, read the word. No other action.
Case 1.3: If the status of the block in cache is I (i.e., the block is invalid), read the valid copy
from the cache of a PE holding a valid copy (Remember at least one other PE must have a
valid copy).
Case 2: Read-miss; i.e., block containing word not in cache. Check status of block in main
memory.
Case 2.1: If it is A, i.e., no processor cache has this block.
Al: Copy block from main memory to kth processor’s cache using block replacement
policy.
A2: Fulfill read request from cache.
A3: Set 1 in the kth bit position of directory entry of this block.
A4.1: Change status bits of directory entry of this block to PM.
A5: Change status bits of kth processor’s cache block to V.
Case 2.2: If status of block in main memory is P (i.e., one or more caches has this block) then
carry out actions A1, A2, A3 and A5 as in Case 2.1.
Case 2.3: If status of block in main memory is PM (exactly one cache has a copy and it is
modified) then carry out actions Al, A2, A3 as in Case 2.1.
A4.2: Change status bits of directory entry of this block to P.
A5: Change status bits of kth processor’s cache block to V. These rules are summarized in
Table 4.7.
TABLE 4.7 Read Request Processing in Multiprocessor with Directory Based Cache Coherence Protocol
(Read request by kth processor)
Cases →
Actions
Set R = — — 1 2
Go to Decision Table R — — X X
Exit X X — —
Decision Table R
Cases →
Actions
Note: Write-invalidate protocol used and write through policy to main memory.
Write from processor register to memory
Store from kth processor’s register to its cache. (Assume write-invalidate policy and write
through to main memory).
Case 1: Write hit, i.e., kth processor’s cache has the block where the word is to be stored.
Case 1.1: If the block’s status in cache is VM (i.e., it is an exclusive valid copy).
A1: Write contents of register in cache.
A2: Write updated cache block in main memory.
Case 1.2: If the block’s status is V carry out actions A1, A2 of Case 1.1.
A3: Change the status of this block in other processors to I (The identity of other
processors having this block is found from the directory in main memory). (Note: This
is write-invalidate policy).
Case 1.3: If the block’s status is I.
A0: Copy block from main memory to cache and set its status to V. Carry out actions A1,
A2 of Case 1.1 and action A3 of Case 1.2.
Case 2: Write miss, i.e., block containing word is not in kth processor’s cache. Read directory
of block from main memory.
Case 2.1: If its status is A do following:
A2.1: Copy block from memory into k’s cache using block replacement policy.
A2.2: Write register value in block.
A2.3.1: Change status of cache block bits to VM.
A2.4: Write back block to main memory.
A2.5: Put 1 in kth bit of directory.
A2.6.1: Set status bits of directory entry to PM.
Case 2.2: If its status is P do following:
Perform actions A2.1, A2.2, A2.4, A2.5 and the following:
A2.3.2: Change status of cache block bits to V.
A2.8: Change status bits of cache blocks storing this block to I.
Case 2.3: If its status is PM, perform actions A2.1, A2.2, A2.4, A2.5, A2.3.2, A2.8.
A2.6.2: Change status of directory entry to P.
The write protocol is summarized in Table 4.8.
One of the main disadvantages of the directory scheme is the large amount of memory
required to maintain the directory. This is illustrated by the following example:
EXAMPLE 4.3
A shared memory multiprocessor with 256 processors uses directory based cache coherence
and has 2 TB of main memory with block size of 256 bytes. What is the size of the directory?
Solution
There are (2 TB/256B) blocks = 8G blocks.
Each block has one directory entry of size (256 + 2) = 258 bits.
∴ Number of bits in the directory = 8 × 258 Gbits
= 258 GB.
This is more than one eighth of the total main memory!
Number of bits for directory in each cache = 2 × Number of blocks/cache. If each cache is
1 MB, number of blocks in each cache = 4096.
∴ Number of bits for directory in each cache = 8192 bits.
TABLE 4.8 Write Request Processing in Multiprocessor with Directory Based Cache Coherence Protocol
Cases →
Actions
A0 Copy block from main memory to cache and set cache block state to V — — X —
Decision Table W
Cases →
Actions
The directory scheme we have described was proposed by Censier and Feautrier [1978]
and is a simple scheme. The main attempts to improve this scheme try to reduce the size of
the directory by increasing the block size. For example, in Example 4.3 if the block size is
increased to 1 KB, the directory size will be around 64 GB. Larger blocks will, however,
increase the number of invalid cached copies and consequent delay in moving large blocks
from the main memory to the processors’ caches even when only a small part of it is changed.
One method which has been proposed is a limited directory scheme in which the number of
simultaneously cached copies of a particular block is restricted. Suppose not more than 7
simultaneous copies of a cache block are allowed in Example 4.3 then the directory entry can
be organized as shown in Fig. 4.22.
1. The total network bandwidth, i.e., the total bandwidth, namely, bytes/second the
network can support. This is the best case when all the CEs simultaneously
transmit data. For a ring of n CEs, with individual link bandwidth B, the total
bandwidth is nB.
2. The bisection bandwidth is determined by imagining a cut which divides the
network into two parts and finding the bandwidth of the links across the cut. In the
case of a ring of n CEs it is 2B (independent of n).
3. The number of ports each network switch has. In the case of a ring it is 3.
4. The total number of links between the switches and between the switches and the
CEs. For a ring it is 2n.
Source CE address Destination CE address Address in memory from where data is to be loaded
Retrieved data is sent over the network using the following format to the requesting CE.
Source CE address Destination CE address Data
In the case of store instruction, the store request has the format
Source CE Destination CE Address in memory where data is to be stored Data
address address
After storing an acknowledgement packet is sent back to the CE originating request in the
format
Source CE address Destination CE address Store successful
1. A fixed time T needed by the system program at the host node to issue a command
over the network. This will include the time to decode (remote node to which the
request is to be sent) and format a packet.
2. Time taken by the load request packet to travel via the interconnection network.
This depends on the bandwidth of the network and the packet size. If B is the
network bandwidth (in bytes/s) and the packet size in n bytes, the time is (n/B) s.
3. Time taken to retrieve a word from the remote memory, q.
4. Fixed time T needed by the destination CE system program to access the network.
5. Time taken to transport the reply packet (which contains the data retrieved) over
the network. If the size of this packet is m bytes, the time needed is (m/B) s.
Thus, the total time is 2T + q + [(n + m)/B].
The time taken to store is similar. The values of n and m may, however, be different from
the above case. The above model is a simplified model and does not claim to be an exact
model. A basic question which will arise in an actual system is “Can the CE requesting
service from a remote CE continue with processing while awaiting the arrival of the required
data or acknowledgement?” In other words, can computation and communication be
overlapped? The answer depends on the program being executed. Another issue is how
frequently service requests to a remote processor are issued by a CE. Yet another question is,
“Whether multiple transactions by different CEs can be supported by the network”. We will
now consider an example to get an idea of penalty paid if services of remote CEs are needed
in solving a problem.
EXAMPLE 4.4
A NUMA parallel computer has 256 CEs. Each CE has 16 MB main memory. In a set of
programs, 10% of instructions are loads and 15% are stores. The memory access time for
local load/store is 5 clock cycles. An overhead of 20 clock cycles is needed to initiate
transmission of a request to a remote CE. The bandwidth of the interconnection network is
100 MB/s. Assume 32-bit words and a clock cycle time of 5 ns. If 400,000 instructions are
executed, compute:
Solution
Case 1: Number of load/store instructions = 400,000/4 = 100,000.
Time to execute load/store locally = 100,000 × 5 × 5 ns = 2500 μs.
Case 2: Number of load instructions = 40,000.
Number of local loads = 40000 × 3/4 = 30,000.
Time taken for local loads = 30000 × 25 ns = 750 μs (Each load takes 25 ns).
Number of remote loads = 10,000
Request packet format is:
(Remember that clock time is 5 ns and T = 20 clock cycles and memory access time is 5
clock cycles).
Number of requests to remote CE = 10,000.
∴ Total time for remote load = 3350 μs
Time for local load = 750 μs
Thus, total time taken for loads = 4100 μs
Number of store instructions = 400000 × 0.15 = 60,000.
Number of local stores = 60000 × 0.75 = 45000.
Number of remote stores = 15,000.
Time taken for local stores = 45000 × 25 ns = 1125 μs.
Time taken for 1 remote store of 1 word = (Fixed overhead to initiate request over network
+ Time to transmit remote store packet + Data store time + Fixed overhead to initiate request
for acknowledgement + Time to transmit acknowledgement packet)
Observe that due to remote access, the total time taken is over 4 times the time if all access
were local. It is thus clear that it is important for the programmer/compiler of the parallel
program for a NUMA parallel computer to (if possible) eliminate remote accesses to reduce
parallel processing time.
A question which may arise is, “What is the effect of increasing the speed of the
interconnection network? In the example we have considered, if the bandwidth of the
interconnection network is increased to 1 GB/s from 100 MB/s, let us compute the time taken
for remote load and store.
The time for remote load =
Figure 4.33 A distributed cache block directory using a doubly linked list.
Let us now examine how cache coherence is maintained. This is given in the following
procedure:
Read Procedure-maintenance of cache coherence
Read command issued by processor p
The major disadvantage of this method is the large number of messages which have to be
sent using the interconnection network while traversing the linked list. This is very slow. The
main advantage is that the directory size is small. It holds only address of the node at the head
of the list. The second advantage is that the latest transacting node is nearer to the head of the
list. This ensures some fairness to the protocol. Lastly, invalidation messages are distributed
rather than centralized thereby balancing the load. There are many other subtle problems
which arise particularly when multiple nodes attempt to update a cache block simultaneously.
Protocol for resolving such conflicts is given in detail in the IEEE standard [Gustavson,
1992]. It is outside the scope of this book.
4.10 MESSAGE PASSING PARALLEL COMPUTERS
A general block diagram of a message passing parallel computer (also called loosely coupled
distributed memory parallel computer) is shown in Fig. 4.34. On comparing Fig. 4.32 and
Fig. 4.34, they look identical except for the absence of a directory block in each node of Fig.
4.34. In this structure, each node is a full fledged computer. The main difference between
distributed shared memory machine and this machine is the method used to program this
computer. In DSM computer a programmer assumes a single flat memory address space and
programs it using fork and join commands. A message passing computer, on the other hand,
is programmed using send and receive primitives. There are several types of send-receive
used in practice. We discuss a fairly general one in which there are two types of send,
synchronous send and asynchronous send. There is one command to receive a message. A
synchronous send command has the general structure:
Synch-Send (source address, n, destination process, tag)
1. In these systems each server is self-contained with on-board disk memory besides
the main memory. As such systems have a large number of servers and the total
cost of the system may be of the order of $100 million, a small saving in the cost of
each server will add up to several million dollars. Thus, the servers must be of low
cost without sacrificing reliability to make the system cost effective.
2. The power dissipation of a server must be minimal commensurate with its
capability to reduce the overall power consumption. If each server consumes 300
W, the power consumption of 100,000 servers will be 30 MW. A 50 W reduction
in power consumption by a server will result in a saving of 5 MW. The system
design involves proper cooling to dissipate the heat. Without proper cooling the
performance of the system will be adversely affected.
3. The systems are accessed using the Internet. Thus, very high bandwidth network
connection and redundant communication paths are essential to access these
systems.
4. Even though many of the services are interactive good high performance parallel
processing is also an important requirement. Web search involves searching many
web sites in parallel, getting information appropriate for the search terms from
them and combining them.
5. The architecture of WSC normally consists of around 48 server boards mounted in
a standard rack inter-connected using low cost Ethernet switches. The racks are in
turn connected by a hierarchy of high bandwidth switches which can support a
higher bandwidth communication among the computers in the racks.
6. Single Program Multiple Data (SPMD) model of parallel computing is eminently
suitable for execution on a WSC infrastructure as a WSC has thousands of loosely
coupled computers available on demand. This model is used, for example, by
search engines to find the URLs of websites in which a specified set of keywords
occur. A function called Map is used to distribute sets of URLs to computers in the
WSC and each computer asked to select the URLs in which specified keywords
occur. All the computers search simultaneously and find the relevant URLs. The
results are combined by a function called Reduce to report all the URLs where the
specified keywords occur. This speeds up the search in proportion to the number of
computers in the WSC allocated for the job. An open source software suite called
Hadoop MapReduce [Dean and Ghemawat, 2004] has been developed for writing
applications on WSCs that process peta bytes of data in parallel, in a reliable fault
tolerant manner which in essence uses the above idea.
4.13 SUMMARY AND RECAPITULATION
In this chapter we saw that there are seven basic types of parallel computers. They are vector
computers, array processors, shared memory parallel computers, distributed shared memory
parallel computers, message passing parallel computers, computer clusters, and warehouse
scale parallel computers. Vector computers use temporal parallelism whereas array
processors use fine grain data parallelism. Shared memory multiprocessors, message passing
multicomputers, and computer clusters may use any type of parallelism. Warehouse scale
computers are well suited for SPMD model of programming. Individual processors in a
shared memory computer or message passing computer, for instance, can be vector
processors. Shared memory computers use either a bus or a multistage interconnection
network to connect a common global memory to the processors. Individual processors have
their own cache memory and it is essential to maintain cache coherence. Maintaining cache
coherence in a shared bus-shared memory parallel computer is relatively easy as bus
transactions can be monitored by the cache controllers of all the processors [Dubois, et. al.,
1988]. Increasing the processors in this system beyond 16 is difficult as a single bus will
saturate degrading performance. Maintaining cache coherence is more difficult in a shared
memory parallel computer using an interconnection network to connect a global memory to
processors. A shared memory parallel computer using an interconnection network, however,
allows a larger number of processors to be used as the interconnection network provides
multiple paths from processors to memory. As shared memory computers have single global
address space which can be uniformly accessed, parallel programs are easy to write for such
systems. They are, however, not scalable to larger number of processors. This led to the
development of Distributed Shared Memory (DSM) parallel computers in which a number of
CEs are interconnected by a high speed interconnection network. In this system, even though
the memory is physically distributed, the programming model is a shared memory model.
DSM parallel computers, also known as CC-NUMA machines are scalable while providing a
logically shared memory programming model.
It is expensive and difficult to design massively parallel computers (i.e., parallel machines
using hundreds of CEs) using CC-NUMA architecture due to the difficulty of maintaining
cache coherence. This led to the development of message passing distributed memory parallel
computers. These machines do not require cache coherence as each CE runs programs using
its own memory systems and cooperate with other CEs by exchanging messages. Message
passing distributed memory parallel computers require good task allocation to CEs which
reduce the need to send messages to remote CEs. There are two major types of message
passing multicomputers. In one of them, the interconnection network connects the memory
buses of CEs. This allows faster exchange of messages. The other type called a computer
cluster interconnects CEs via their I/O buses. Computer clusters are cheaper. Heterogeneous
high performance PCs can be interconnected as a cluster because I/O buses are standardized.
Message passing is, however, slower as messages are taken as I/O transactions requiring
assistance from the operating system. Computer clusters are, however, scalable. Thousands of
CEs distributed on a LAN may be interconnected. With the increase in speed of high
performance PCs and LANs, the penalty suffered due to message passing is coming down
making computer clusters very popular as low cost, high performance, parallel computers.
Another recent development driven by widely used services such as email, web search, and e-
commerce is the emergence of warehouse scale computers which use tens of thousands of
servers interconnected by Ethernet switches connected to the computers’ network interface
cards. They are used to cater to request level parallel jobs. They are also used to solve SPMD
style of parallel programming. In Table 4.10 we give a chart comparing various parallel
computer architectures we have discussed in this chapter.
Message passing MIMD Coarse Needs good task allocation Very good
multi-computer and reduced inter-CE
communication
Cluster computer MIMD Very coarse as Same as above Very good (low cost
O.S. assis-tance system)
needed
Warehouse scale SPMD Very coarse Same as above Excellent (Uses tens of
computer (WSC) thousands of low cost
servers)
EXERCISES
4.1 We saw in Section 4.1 that a parallel computer is made up of interconnected processing
elements. Can you explain how PEs can be logically inter-connected using a storage
device? How can they cooperate with this logical connection and solve problems?
4.2 Parallel computers can be made using a small number of very powerful processors (e.g.,
8-processor Cray) or a large number of low speed micro-processors (say 10,000 Intel
8086 based microcomputers). What are the advantages and disadvantages of these two
approaches?
4.3 We gave a number of alternatives for each of the components of a parallel computer in
Section 4.1. List as many different combinations of these components which will be
viable parallel computer architecture (Hint: One viable combination is 100 full fledged
microprocessors with a cache and main memory connected by a fixed interconnection
network and cooperating by exchanging intermediate results).
4.4 For each of the alternative parallel computer structure you gave as solution to Exercise
4.3, classify them using Flynn’s classification.
4.5 Give some representative applications in which SIMD processing will be very effective.
4.6 Give some representative applications in which MISD processing will be very effective.
4.7 What is the difference between loosely coupled and tightly coupled parallel computers?
Give one example of each of these parallel computer structures.
4.8 We have given classifications based on 4 characteristics. Are they orthogonal? If not
what are the main objectives of this classification method?
4.9 When a parallel computer has a single global address space is it necessarily a uniform
memory access computer? If not explain why it is not necessarily so?
4.10 What do you understand by grain size of computation? A 4-processor computer with
shared memory carries out 100,000 instructions. The time to access shared main
memory is 10 clock cycles and the processors are capable of carrying out 1 instruction
every clock cycle. The main memory need to be accessed only for inter-processor
communication. If the loss of speedup due to communication is to be kept below 15%,
what should be the grain size of computation?
4.11 Repeat Exercise 4.10 assuming that the processors cooperate by exchanging messages
and each message transaction takes 100 cycles.
4.12 A pipelined floating point adder is to be designed. Assume that exponent matching
takes 0.1 ns, mantissa alignment 0.2 ns, adding mantissas 1 ns and normalizing result
0.2 ns. What is the highest clock speed which can be used to drive the adder? If two
vectors of 100 components each are to be added using this adder, what will be the
addition time?
4.13 Develop a block diagram for a pipelined multiplier to multiply two floating point
numbers. Assuming times for each stage similar to the ones used in Exercise 4.12,
determine the time to multiply two 100 component vectors.
4.14 What is vector chaining? Give an example of an application where it is useful.
4.15 If the access time to a memory bank is 8 clock cycles, how many memory banks are
needed to retrieve one component of a 64-component vector each cycle?
4.16 A vector machine has a 6-stage pipelined arithmetic unit and 10 ns clock. The time
required to interpret and start executing a vector instruction is 60 ns. What should be the
length of vectors to obtain 95% efficiency on vector processing?
4.17 What are the similarities and differences between
(i) Vector processing,
(ii) Array processing, and
(iii) Systolic processing?
Give an application of each of these modes of computation in which their unique
characteristics are essential.
4.18 Obtain a systolic array to compute the following:
4.19 A shared bus parallel computer with 16 PEs is to be designed using 32-bit RISC
processor running at 2 GHz. Assume average 1.5 clocks per instruction. Each PE has a
cache and the hit ratio to cache is 98%. The system is to use write-invalidate protocol. If
10% of the instructions are loads and 15% are stores, what should be the bandwidth of
the bus to ensure processing without bus saturation? Assume reasonable values of other
necessary parameters (if any).
4.20 In Section 4.7.1, we stated that an atomic read-modify-write machine level instruction
would be useful to implement barrier synchronization. Normally a counter is
decremented by each process which reaches a barrier and when the count becomes 0, all
processes cross the barrier. Until then processors wait in busy wait loops. Write an
assembly language program with a read-modify-write atomic instruction to implement
barrier synchronization.
4.21 Assume a 4-PE shared memory, shared bus, cache coherent parallel computer. Explain
how lock and unlock primitives work with write-invalidate cache coherence protocol.
4.22 Develop a state diagram or a decision table to explain MOESI protocol for maintaining
cache coherence in a share memory parallel computer.
4.23 A 4-PE shared bus computer executes the streams of instructions given below:
Stream 1 : (R PE1)(R PE2)(W PEl)(W PE2)(R PE3)(W PE3)(R PE4)(W PE4) Stream 2
: (R PE1)(R PE2)(R PE3)(R PE4)(W PE1)(W PE2)(W PE3)(W PE4) Stream 3 : (R
PE1)(W PE1)(R PE2)(W PE2)(R PE3)(W PE3)(R PE4)(W PE4) Stream 4 : (R PE2)(W
PE2)(R PE1)(R PE3)(W PE1)(R PE4)(R PE3)(W PE4)
Where (R PEl) means read from PEl and (W PEl) means write into PEl memory.
Assume that all caches are initially empty. Assume that if there is a cache hit the
read/write time is 1 cycle and that all the read/writes are to the same location in
memory. Assume that 100 cycles are needed to replace a cache block and 50 cycles for
any transaction on the bus. Estimate the number of cycles required to execute the
streams above for the following two protocols.
(i) MESI protocol
(ii) Write update protocol
4.24 List the advantages and disadvantages of shared memory parallel computers using a
bus for sharing memory and an interconnection network for sharing memory.
4.25 We have given detailed protocols to ensure cache coherence in directory based shared
memory systems assuming a write-invalidate policy and write-through to main memory.
Develop a detailed protocol for a write-update policy and write-back to main memory.
Compare and contrast the two schemes. Specify the parameters you have used for the
comparison.
4.26 A shared memory parallel computer has 128 PEs and shares a main memory of 1 TB
using an interconnection network. The cache block size is 128B. What is the size of the
directory used to ensure cache coherence? Discuss methods of reducing directory size.
4.27 Solve Exercise 4.19 for a directory based shared memory computer. In the case of
directory scheme the solution required is the allowed latency in clock cycles of the
interconnection network for “reasonable functioning” of the parallel computer.
4.28 A parallel computer has 16 processors which are connected to independent memory
modules by a crossbar interconnection network. If each processor generates one
memory access request every 2 clock cycles, how many memory modules are required
to ensure that no processor is idle?
4.29 What is the difference between a blocking switch and a non-blocking switch? Is an
omega network blocking or non-blocking?
4.30 If switch delay is 2 clock cycles, what is the network latency of an n-stage omega
network?
4.31 Obtain a block diagram of a 16-input 16-output general interconnection network
similar to the one shown in Fig. 4.25.
4.32 What is the difference between a direct and an indirect network?
4.33 What is the significance of bisection bandwidth? What is the bisection bandwidth of a
3-stage omega network?
4.34 Draw a diagram of a 32-node hypercube. Number the nodes using systematic binary
coding so that processors which are physical neighbours are also logical neighbours.
What is the bisection bandwidth? How many links are there in this hypercube? What is
the maximum number of hops in this network? How many alternate paths are there
between any two nodes?
4.35 What is packet switching? Why is it used in CE to CE communication in direct
connected networks?
4.36 What do you understand by routing? What is the difference between direct packet
routing, virtual cut through routing and wormhole routing?
4.37 Distinguish between UMA, NUMA and CC-NUMA parallel computer archi-tectures.
Give one example block diagram of each of these architectures. What are the
advantages and disadvantages of each of these architectures? Do all of these
architectures have a global addressable shared memory? What is the programming
model used by these parallel computers?
4.38 Take the data of Example 4.4 given in the text except the bandwidth of the
interconnection network. If 15% of the accesses are to remote CEs, what should be the
bandwidth of the interconnection network to keep the penalty due to remote access
below 2?
4.39 A NUMA parallel computer has 64 CEs. Each CE has 1 GB memory. The word length
is 64 bits. A program has 500,000 instructions out of which 20% are loads and 15%
stores. The time to read/write in local memory is 2 cycles. An overhead of 10 clock
cycles is incurred for every remote transaction. If the network bandwidth is 500 MB/s
and 30% of the accesses are to remote computers, what is the total load/store time?
How much will it be if all accesses are to local memory?
4.40 Obtain an algorithm to ensure cache coherence in a CC-NUMA parallel computer.
Assume write-invalidate protocol and write through caches.
4.41 A CC-NUMA parallel computer has 128 CEs connected as a hypercube. Each CE has
256 MB memory with 32 bits/word. Link bandwidth is 500 MB/s. Local load/store take
1 clock cycle. Fixed overhead for remote access is 50 cycles. Each hop from link to
next link costs extra 5 cycles. Answer the following questions:
(i) Describe the directory organization in this machine.
(ii) If a load from the memory of PE 114 to PE 99’s register is to be performed, how
many cycles will it take?
4.42 Compare and contrast directory based cache coherence scheme and scalable coherent
interface standard based scheme.
4.43 A 64-PE hypercube computer uses SCI protocol for maintaining cache coherence.
Statistically it is found that not more than 4 caches hold the value of a specified
variable. Memory size at each node is 256 MB with 32-bit words. Cache line size is 128
bytes. Link bandwidth is 500 MB/s. If a read from Node x requires reading from Node
(x + 4)’s cache as it is not in x’s cache, how many cycles does it take? Access time to
main memory at each node is 10 cycles and to local cache 1 cycle. Assume reasonable
value(s) for any other essential parameter(s).
4.44 What are the major differences between a message passing parallel computer and a
NUMA parallel computer?
4.45 A series of 10 messages of 64 bytes each are to be sent from node 5 to 15 of a 32-CE
hypercube. Write synchronous send commands to do this. If start up to send a message
is 10 cycles, link latency 15 cycles and memory cycle is 5 cycles, how much time (in
cycles) will be taken by the system to deliver these 10 messages?
4.46 Repeat Exercise 4.45 if it is asynchronous send. What are the advantages (if any) of
using an asynchronous send instead of a synchronous send? What are the
disadvantages?
4.47 What are the advantages and disadvantages of cluster computer when compared with a
message passing parallel computer connected by a hypercube?
4.48 A program is parallelized to be solved on a computer cluster. There are 16 CEs in the
cluster. The program is 800,000 instructions long and each CE carries out 500,000
instructions. Each CE has a 2 GHz clock and an average of 0.75 instructions are carried
out each cycle. A message is sent by each CE once every 20,000 instructions on the
average. Assume that all messages do not occur at the same time but are staggered. The
fixed overhead for each message is 50,000 cycles. The average message size is 1,000
bytes. The bandwidth of the LAN is 1 GB/s. Calculate the speedup of this parallel
machine.
4.49 What are the advantages of using a NIC (which has a small microprocessor) in
designing a computer cluster? Make appropriate assumptions about NIC and solve
Exercise 4.48. Assume that the network bandwidth remains the same.
4.50 Repeat Exercise 4.48 assuming that the LAN bandwidth is 10 GB/s. Compare the
results of Exercises 4.48 and 4.50 and comment.
4.51 What are the differences between computer clusters and warehouse scale computers?
In what applications are warehouse scale computers used?
4.52 Differentiate between request level parallelism and data level parallelism. Give some
examples of request level parallelism.
4.53 A warehouse scale computer has 10,000 computers. Calculate the availability of the
system in a year if there is a hardware failure once a month and software failure once in
fifteen days. Each hardware failure takes 15 minutes to rectify (assuming a simple
replacement policy) and a software failure takes 5 minutes to rectify by just rebooting.
If an availability is to be 99.99%, what would be the allowed failure rates.
BIBLIOGRAPHY
Adve, S.N., and Gharachorloo, K., “Shared Memory Consistency Models: A Tutorial”, IEEE
Computer, Vol. 29, No. 12, Dec. 1996, pp. 66–76.
Censier, L. and Feautrier, P., “A New Solution to Cache Coherence Problems in
Multiprocessor Systems”, IEEE Trans. on Computers, Vol. 27, No. 12, Dec. 1978, pp.
1112–1118.
Culler, D.E., Singh, J.P. and Gupta, A., Parallel Computer Architecture: A Hardware
Software Approach, Morgan Kaufmann, San Francisco, USA, 1999.
Dean, J., and Ghemawat, S., “MapReduce: Simplified Data Processing on Large Clusters”,
Proceedings of the USENIX Symposium on Operating System Design and Implementation,
2004, pp. 137–150, http://research.google.com/archive/mapreduce.html
DeCegama, A.L., Parallel Processing Architectures and VLSI Hardware, Vol. 1, Prentice-
Hall Inc., Englewood Cliffs, New Jersey, USA, 1989.
Dubois, M., et al., “Synchronization, Coherence, and Event Ordering”, IEEE Computer, Vol.
21, No. 2, Feb. 1988, pp. 9–21.
Flynn, M.J., “Some Computer Organizations and their Effectiveness”, IEEE Trans. on
Computers, Vol. 21, No. 9, Sept. 1972, pp. 948–960.
Gustavson, D., “The Scalable Coherence Interface and Related Standards”, IEEE Micro, Vol.
12, No. 1, Jan. 1992, pp. 10–12.
Hennessy, J.L., and Patterson, D.A., Computer Architecture—A Quantitative Approach, 5th
ed., Morgan Kauffman, Waltham, MA, USA, 2012.
Hill, M.D., “Multiprocessors should Support Simple Memory Consistency Models”, IEEE
Computer, Vol. 31, No. 8, Aug. 1998, pp. 28–34.
Hord, M.R., Parallel Supercomputing in SIMD Architectures, CRC Press, Boston, USA,
1990.
Kung, H.T., “Why Systolic Architectures”?, IEEE Computer, Vol. 15, No. 1, Jan. 1982, pp.
37–46.
Lamport, L., “How to Make Multiprocessor Computer that Correctly Executes Multiprocess
Programs”, IEEE Trans. on Computers, Vol. 28, No. 9, Sept. 1979, pp. 690–691.
Leighton, F.T., Introduction to Parallel Algorithms and Architectures, Morgan Kaufmann,
San Francisco, USA, 1992.
Litzkow, M., Livny, M. and Mutka, M.W., “CONDOR-A Hunter of Idle Workstations”,
Proceedings of IEEE International Conference of Distributed Computing Systems, June
1988, pp. 104–111.
Mahapatra, P., “Wormhole Routing Techniques for Directly Connected Multicomputer
Networks”, ACM Computing Surveys, Vol. 30, No. 3, Sept. 1998, pp. 374–410.
Nakazato, S., et al., “Hardware Technology of the SX-9”, NEC Technical Journal, Vol. 3,
No. 4, 2008.
Rajaraman, V., Supercomputers, Universities Press, Hyderabad, 1999.
Sorin, D.J., Hill, M.D., and Wood, D.A., A Primer on Memory Consistency and Cache
Coherence, Morgan and Claypool, USA, 2011.
Stenstrom, P., “A Survey of Cache Coherence Protocols for Multiprocessors”, IEEE
Computer, Vol. 23, No. 6, June 1990, pp. 12–29.
Core Level Parallel Processing
In Chapter 3, we saw how the speed of processing may be improved by perceiving instruction
level parallelism in programs and exploiting this parallelism by appro-priately designing the
processor. Three methods were used to speedup processing. One was to use a large number of
pipeline stages, the second was to initiate execution of multiple instructions simultaneously
which was facilitated by out of order execution of instructions in a program, and the third was
to execute multiple threads in a program in such a way that all the sub units of the processor
are always kept busy. (A thread was defined as a small sequence of instructions that shares
the same resources). All this was done by the processor hardware automatically and was
hidden from the programmer. If we define the performance of a processor as the number of
instructions it can carry out per second then it may be increased (i) by pipelining which
reduces the average cycle time needed to execute instructions in a program, (ii) by increasing
the number of instructions executed in each cycle by employing instruction level parallelism
and thread level parallelism, and (iii) by increasing the number of clock cycles per second by
using higher clock speed. Performance continually increased up to 2005 by increasing the
clock speed, exploiting to the maximum extent possible pipelining, instruction and thread
level parallelism.
While research and consequent implementation of thread level parallelism to increase the
speed of processors was in progress, researchers were examining how several processors
could be interconnected to work in parallel. Research in parallel computing started in the late
70s. Many groups and start up companies were active in building parallel machines during
the 80s and 90s upto early 2000. We discussed the varieties of parallel computers that were
built and their structure in Chapter 4. The parallel computers we described in Chapter 4 were
not used as general purpose computers such as desktop computers and servers due to the fact
that single processor computers were steadily increasing in speed with no increase in cost.
This was due to the increase in clock speed as well as the use of multithreading in single
processor computers which led to their steady increase in speed. Parallel computer
programming was not easy and their performance was not commensurate with their cost for
routine computing. They were only used where high performance was needed in numeric
intensive computing in spite of their many advantages which we pointed out in Chapter 1.
However, as we will explain in the next section, the performance of single chip processors
reached saturation around 2005. Chip designers started examining integrating many
processors on a single chip to continue to build processors with better performance. When
this happened many of the ideas we presented in Chapter 4 were resurrected to build parallel
computers using many processors or cores integrated on a single silicon chip. The objective
of this chapter is to describe parallel computers built on a single integrated circuit chip called
Chip Multiprocessors (CMP) which use what we call core level parallelism to speedup
execution of programs. While many of the architectural ideas described in the previous
chapter are applicable to chip multiprocessors, there are important differences which we
discuss in the next section.
5.1 CONSEQUENCES OF MOORE’S LAW AND THE ADVENT OF
CHIP MULTIPROCESSORS
In 1965 Moore had predicted, based on the data available at that time, that the number of
transistors in an integrated circuit chip will double every two years. This is an empirical
prediction (called Moore’s law) which has been surprisingly accurate. Starting with about
1000 transistors in a chip in the 70s by the year 2014 there were almost 10 billion transistors
in a chip. The increase in the number of transistors allowed processor designers to implement
many innovations culminating in complex pipelined, superscalar, multithreaded processor
which exploited to the maximum extent the available instruction level parallelism.
Simultaneously the clock frequency was being increased. The increase in clock speed was
sustainable till 2003. Beyond a clock frequency of around 4 GHz, the heat dissipation in the
processors increased. The faster the transistors are switched the higher is the heat dissipated.
There are three components of heat dissipation. They are:
1. When transistors switch states, their equivalent capacitances charge and discharge
leading to power dissipation. This is the dynamic power dissipation which is
proportional to CVd2 f, where, C is the equivalent capacitance of the transistor, Vd
the voltage at which the transistor operates and f the clock frequency.
2. Small currents called leakage current flow continuously between different doped
parts of a transistor. This current increases as the transistor becomes smaller and
the ambient temperature goes up. This leakage current IL also leads to higher
power consumption proportional to VdIL and a resultant increase of heat
dissipation.
3. Lastly during the finite time a logic gate toggles, there is a direct path between the
voltage source and ground. This leads to short circuit current Ist and consequent
power dissipation proportional to VdIst. This in turn heats up the gates.
These three sources of power dissipation and the consequent heat generation increases
when transistor size decreases. However, the dynamic power loss due to the increase of the
frequency of switching causes the greatest increase in heat generation. This puts a limit to the
frequency at which processors operate. This is called frequency wall in the literature. As was
pointed out earlier when the clock frequency reached 4 GHz, processor designers had to look
for other means of increasing the speed of processors, rather than brute force increase in
clock speed which required special cooling systems. As Moore’s law was still valid and more
transistors could be packed in a chip, architects had to find ways of using these transistors.
Simultaneous multithreading uses the extra transistors. However, instruction level
parallelism had its limitations. Even if many arithmetic elements are put in one processor,
there was not enough instruction level parallelism available in threads to use them. This
upper bound on the number of instructions which can be executed simultaneously by a
processor is called the ILP wall in the literature. Thus, instead of using the extra transistors to
increase the number of integer units, floating point units, and registers which reached a point
of diminishing returns, architects had to explore other ways of using the extra transistors.
Further as the complexity of the processor increased design errors crept in, debugging
complex processors on a single chip with too many transistors was difficult and expensive.
There are two other ways of using transistors. One is to increase the size of the on-chip
memory called the L1 cache and the other is to put several processors in a single chip and
make them cooperate to execute a program. There is a limit beyond which increasing the L1
cache size has no advantage as there is an upper limit to locality of reference to memory.
Putting n processors in a chip, however, can very often lead to n-fold speedup and in some
special cases even super-linear speedup.
Starting in the late 70s, many research groups had been working on inter-connecting
independent processors to architect parallel computers which we discussed in the previous
chapter. Much of the research could now be used to design single chip parallel computers
called multicore processors or chip multiprocessors. However, there are important differences
between the architectural issues in designing parallel computers using independent processing
units and designing Chip Multiprocessors (CMP). These primarily relate to the fact that
individual cores heat up at high clock frequency and may affect the performance of
physically adjoining cores. Heat sensors are often built as part of a core to switch off the core,
or reduce its operating clock frequency, or switch the thread being processed to another core
which may be idle. Another problem arises due to threads or processes running on different
cores sharing an on-chip cache memory leading to contention and performance degradation
unless special care is taken both architecturally and in programming. The total power budget
of a chip has to be apportioned to several cores. Thus, the power requirement of each core has
to be examined carefully. There are, however, many advantages of CMPs compared to
parallel computers using several computers. As the processors are closely packed, the delay
in transferring data between cooperating processors is small. Thus, very light weight threads
(i.e., threads with very small number of instructions) can run in the processors and
communicate more frequently without degrading performance. This allows the software
designers to re-think appropriate algorithms for CMPs. There is also a downside. Whereas in
parallel computers wires interconnecting computers for inter-computer communication were
outside the computer and there was no constraint on the number of wires. Even though the
wires inside a chip are short, they occupy chip area and thus their number has to be reduced.
The switches used for inter-processor communication also use chip space and their
complexity has to be reduced. The trade-offs are different in CMPs as compared to parallel
computers and this has to be remembered by architects.
As cores are duplicated, the design of a single core may be replicated without incurring
extra design cost. Besides this, depending on the requirement of the application, the power
budget, and the market segment, processors may be tailor made with 2, 4, 6, 8 or many
processors as duplicating processors on a die for special chips do not incur high cost.
The terminology processing core is used to describe an independent Processing Element
(PE) in an integrated circuit chip. When many PEs are integrated in a chip, it is called a
multicore chip or a chip multiprocessor as we pointed out earlier. The independent processing
cores normally share common on-chip cache memory and use parallelism available in
programs to solve problems in parallel. You may recall that we called it core level
parallelism. In this chapter, we will describe single chip parallel computers which use core
level parallelism.
Unlike multithreaded processors in which the hardware executes multiple threads without
a programmer having to do anything, multicore processors require careful programming. As
the number of cores per processor is increasing rapidly programming becomes more difficult.
We discuss programming multicore processors in Chapter 8. In Table 5.1, we summarise the
discussions of this section.
Power
dissipated Heating due to increase in frequency of Reduce frequency
(Power wall) clock Reduce Vd
Increase in leakage current as transistor size Switch off unused cores/sub-
becomes smaller systems
Short circuit power loss during switching Transistor technology improvement
Clock
frequency Heating as frequency increases Use multiple slower cores in a chip
Power consumption increases as frequency Use heterogeneous cores
increases
Transmission line effects at higher
frequencies
Electromagnetic radiation at higher
frequencies
Memory
Working set size does not increase beyond a Use distributed cache hierarchy L1,
point based on locality of reference L2, L3 caches
Memory bandwidth lower than CPU Shared memory multiprocessor with
bandwidth caches in each processor
On-chip memory requires large area and Multiple processors can use private
dissipates power caches better
Design
complexity Design verification of complex processors Reuse earlier design by replication
Architectural innovation facilitated by duplicating well
designed cores
Hardware software co-design
Improvement
in Increase in performance of pro-cessors Use multiple cores with simple logic
performance proportional to square root of increase in in each core. Promise of linear
complexity. Doubling logic gates gives 40% increase in performance
increase in performance
5.2 A GENERALIZED STRUCTURE OF CHIP MULTIPROCESSORS
Following the model we used in Chapter 4, we define a chip multiprocessor (CMP) as:
“An interconnected set of processing cores integrated on a single silicon chip which
communicate and cooperate with one another to execute one or more programs fast”.
We see from the above definition that the key words are: processing cores, single silicon
chip, communication, and cooperation. It should also be noted that a CMP may be used to
execute a number of independent programs simultaneously.
Memory System
Communication Network
Mode of Cooperation
1. Each core is assigned an independent task. All the cores work simultaneously on
assigned tasks. This is called request level parallel processing.
2. All cores have a single program running but on different data sets (called data level
parallelism, or SPMD).
3. All cores execute a single instruction on multiple data streams or threads (called
single instruction multiple data/thread parallelism, or SIMD/SIMT).
4. All threads and data to the processor are stored in the memory shared by all the
cores. A free core selects a thread to execute and deposits the results in the
memory for use by other cores (called multiple instructions multiple data style of
computing, or MIMD).
From the above description, we see that a rich variety of CMPs can be built. We will
describe some of the chip multiprocessors which have been built.
The main considerations which have to be remembered while designing CMPs and many
core processors (when processors exceed 30) are:
Figure 5.5 A multicore processor with separate L2 and a shared on-chip L3 cache.
The choice of architecture depends on the nature of the application of the multicore chip
and also the area available in the chip to integrate memory. For example, Intel core duo an
early CMP had the architecture shown in Fig. 5.3 whereas a later CMP, AMD Opteron used
the architecture of Fig. 5.4. In the architecture of Fig. 5.5 which uses Intel i7 cores, the
processor is a general purpose processor and having an L3 cache on-chip is a great advantage.
The choice of the architecture also depends on the application of the CMP. A CMP may be
used as:
In the first two cases, the cores may be easily used to run separate tasks of a multitasking
system. In other words, when one core is being used as a word processor, another may be
used to play music from an Internet radio station and a third core may be used for running a
virus scanner. In such cases, there is hardly any interaction or cooperation between the
processor cores. The architecture of Fig. 5.3 would be appropriate as a fast L1 cache will be
sufficient as no communication among the cores is required. As a server is normally used to
run programs such as query processing of large databases and running independent tasks of
several users, the architecture of Fig. 5.4 which has an L2 cache with each core would be
advantageous to speedup processing. In this case also in many situations the cores may be
working on independent tasks. In case 4, the processor will be required to solve a single
problem which needs cooperation of all the cores. The architecture of Fig. 5.5 with a built-in
L3 cache would facilitate cooperation of all the cores in solving a single numeric intensive
problem. In this case also different cores may be assigned independent jobs which will
increase the efficiency of the multicore processor. The architecture of co-processors is
problem specific. We discuss graphic processors later in this chapter.
1. As the wires are inside one chip, their lengths are smaller, resulting in lower delay
in communication between processors. This in turn allows the units of
communication between processors to be light-weight threads (a few hundred
bytes).
2. Planar wiring interconnections are easier to fabricate on a chip. Thus, a ring
interconnection and a grid interconnection described in Section 4.8.2 are the ones
which are normally used.
3. On-chip routers are normally used to interconnect processors as they are more
versatile than switches. It is not very expensive to fabricate routers on a chip. They,
however, dissipate heat. Power efficiency in the design of routers is of great
importance.
4. Routing of data between processors has to be designed to avoid deadlocks. Simple
routing methods which guarantee deadlock prevention even if they take a longer
route are preferable.
5. The router connecting the processing cores need not be connected to the I/O
system of the core. It may be connected to registers or caches in the core. This
significantly reduces the delay in data communication.
6. The logical unit of communication we used in Chapter 4 was a message. Most on-
chip networks communication is in units of packets. Thus when a message is sent
on a network, it is first broken up into packets. The packets are then divided into
fixed length flits, an abbreviation for flow control digits. This is done as storage
space for buffering messages is at a premium in on-chip routers and it is
economical to buffer smaller size flits. For example, a 64-byte cache block sent
from a sender to a receiver will be divided into a sequence of fixed size flits. If a
maximum packet size is 64 bytes, the cache block will be divided into 16 flits. In
addition to the data flits, an additional header flit will specify the address of the
destination processor followed by data flits, and a flit with flow control
information and end of data packet information. The flits that follow do not have
the destination address and follow the same route as the header flit. This is called
wormhole routing in which flits are sent in a pipeline fashion through the network
(See Exercise 7.5). Details of router design are outside the scope of this book and
the readers are referred to the book On-chip networks for multicore systems [Peh,
Keckler and Vangal, 2009] for details.
7. As opposed to a communication system in which TCP/IP protocol is used and
packets are sent as serial string of bits on a single cable or a fibre optic link, the
flits sent between cores in a many core processor are sent on a bus whose width is
determined by the distance between routers, interline coupling, line capacitance,
resistance of the line, etc. The bus width may vary from 32 bits to as much as 256
bits depending on the semiconductor technology when a chip is fabricated.
Accordingly the flit size is adjusted.
Figure 5.11 A ring bus interconnected Intel i7 cores. (also called Sandy Bridge architecture for servers).
The ring is composed of four different rings. A 32-byte data ring (half of a cache block), a
request ring, a snoop ring, and an acknowledgement ring. These four rings implement a
distributed communication protocol that enforces coherency and ordering. The
communication on the rings is fully pipelined and runs at the same clock speed as the core.
The rings are duplicated. One set runs in the clockwise direction and the other in the anti-
clockwise direction. Each agent interfacing a core with the rings interleaves accesses to the
two rings on a cycle by cycle basis. A full 64-byte cache block takes two cycles to deliver to
the requesting core. All routing decisions are made at the source of a message and the ring is
not buffered which simplifies the design of the ring system. The agent on the ring includes an
interface block whose responsibility is to arbitrate access to the ring. The protocol informs
each interface agent one cycle in advance whether the next scheduling slot is available and if
so allocates it. The cache coherence protocol used is a modified MESI protocol discussed in
section 4.7.4.
A question which would arise is why a ring interconnect was chosen by the architects. The
number of wires used by the ring is quite large as 32 bytes of data to be sent in parallel will
require 256 lines. The total number of wires in the rings is around 1000. The delay in sending
a cache block from one core to another is variable. In spite of these shortcomings, ring
interconnection is used due to the flexibility which it provides. The design of the ring will be
the same for a 4, 8, or 16 processor system. If the processors exceed 8, instead of a MESI
protocol a directory protocol will be used to ensure cache coherence. It is also easy to connect
to the ring agents to control off-chip memory and off-chip I/O systems.
As we pointed out earlier, temperature sensors are essential in CMPs. In the Sandy Bridge
architecture, Intel has used better sensors. Further the voltage used by the on-chip memory,
the processor cores, and dynamic memories are different. There is also a provision to
dynamically vary the voltage of the power supply and the clock of different parts of the chip.
Thus, the GPUs were soon used with appropriate software for general purpose computing
and were renamed GPGPU (General Purpose GPUs).
As the origin of GPGPUs is from graphics, a number of technical terms relevant to
graphics processing have been traditionally used which confuses students who want to
understand the basic architecture of GPGPUs [Hennessy and Patterson, 2011]. We avoid
reference to graphics oriented terminology and attempt to simplify the basic idea of
processing threads in parallel. Consider the following small program segment.
for i := 1 to 8192 do
p(i) = k * a(i) + b(i)
end for
This loop can be unrolled and a thread created for each value of i. Thus there are 8192
threads which can be carried out in parallel. In other words, the computations p(1) = k * a(1)
+ b(1), p(2) = k * a(2) + b(2), … p(8192) = k * a(8192) + b(8192) can all be carried out
simultaneously as these computations do not depend on one another. If we organize a parallel
computer with 8192 processing units with each processor computing k * a(i) + b(i) in say 10
clock cycles, the for loop will be completed in 10 cycles. Observe that in this case each
processing unit is a simple arithmetic logic unit which can carry out a multiply operation and
add operation. If a(i) and b(i) are floating point numbers and k an integer, we need only a
floating point adder and a multiplier. If we do not have 8192 processing units but only 1024
processing units, then we can employ a thread scheduler to schedule 1024 threads and after
they are computed, schedule the next 1024 threads. Assuming 10 cycles for each thread
execution, (8192/1024) × 10 = 80 cycles will be required to complete the operation.
This example illustrates the essence of the idea in designing GPGPUs. In general GPGPUs
may be classified as Single Instruction Multiple Thread (SIMT) architecture. They are
usually implemented as a multiprocessor composed of a very large number of multithreaded
SIMD processors. If a program consists of a huge number of loops that can be unrolled into a
sequence of independent threads and can be executed in parallel, this architecture is
appropriate. A simplified model of architecture for executing such program is shown in Fig.
5.16.
There are three basic systems which are used to build parallel and distributed computer
systems. They are:
There has been immense progress in the design and fabrication of all these three systems.
Processors and memory systems which depend on semiconductor technology have seen very
rapid development. The number of transistors which can be integrated in a microprocessor
chip has been doubling almost every two years with consequent improvement of processing
power at constant cost. This development has led to single chip multiprocessors as we saw in
Chapter 5. Storage systems have also kept pace with the development of processors. Disk
sizes double almost every eighteen months at constant cost. The density of packing bits on a
disk’s surface was 200 bits/sq.inch in 1956 and it increased to 1.34 Tbits/sq.inch in 2015,
nearly 600 billion fold increase in 59 years. The cost of hard disk storage is less than $0.03
per GB in 2015. Communication speed increase in recent years has been dramatic. The
number of bits being transmitted on fibre optic cables is doubling every 9 months at constant
cost. The improvement of communication is not only in fibre optics but also in wireless
communication. A report published by the International Telecommunication Union
[www.itu.int/ict] in 2011 states that between 2006 and 2011, the communication bandwidth
availability in the world has increased eight fold and the cost of using bandwidth has halved
between 2008 and 2011. It is predicted that by 2016 wireless bandwidth will overtake wired
bandwidth.
The improvement in computer hardware technology has led to the development of servers
at competitive cost. The increase of communication bandwidth availability at low cost has a
profound impact. It has made the Internet all pervasive and led to many new computer
environments and applications. Among the most important applications which have changed
our lifestyle are:
Two important new computing environments which have emerged due to improvements in
computing and communication technologies are:
Grid computing
Cloud computing
In this chapter, we will describe the emergence of these two environments and their impact
on computing. We will also compare the two environments and their respective roles in the
computing scenario in general and parallel computing in particular.
6.1 GRID COMPUTING
We saw that the all pervasive Internet and the improvement of bandwidth of communication
networks have spawned two new computing environments, namely, grid computing and
cloud computing. In this section, we will describe grid computing which was historically the
first attempt to harness the power of geographically distributed computers. A major
consequence of the increase of speed of the Internet is that distance between a client and a
server has become less relevant. In other words, computing resources may be spread across
the world in several organizations, and if the organizations decide to cooperate, a member of
an organization will be able to use a computer of a cooperating organization whenever he or
she needs it, provided it is not busy. Even if it is busy, the organization needing the resource
may reserve it for later use. High performance computers, popularly known as
supercomputers have become an indispensable tool for scientists and engineers especially
those performing research. High performance computer based realistic simulation models are
now considered as important in science and engineering research as formulating hypotheses
and performing experiments. “Big Science” is now a cooperative enterprise with scientists all
over the world collaborating to solve highly complex problems. The need for collaboration,
sharing data, scientific instruments (which invariably are driven by computers), and High
Performance Computers led to the idea of grid computing [Berman, Fox and Hey, 2003].
Organizations may cooperate and share not only high performance computers but also
scientific instruments, databases, and specialized software. The interconnected resources of
cooperating organizations constitute what is known as a computing grid. The idea of grid
computing was proposed by Ian Foster and his coworkers in the late 1990s, and has spawned
several projects around the globe. A project called Globus [Foster and Kesselman, 1997] was
started as a cooperative effort of several organizations to make the idea of grid computing a
reality. Globus has developed standard tools to implement grid computing systems.
The Globus project (www.globus.org) defined a computer grid as: an infrastructure that
enables the integrated, collaborative use of high-end computers, networks, databases and
scientific instruments owned and managed by multiple organizations. It later refined the
definition as: a system that coordinates resources that are not subject to centralized control
using standards, open general purpose protocols, and interfaces to deliver non trivial quality
of service.
Comparing these two definitions, we see that both of them emphasize the fact that the
resources are owned and operated by independent organizations, not subject to centralized
control and that the resources may be combined to deliver a requested service. The first
definition emphasizes collaborative use of high performance computers and other facilities.
The second definition emphasizes, in addition, the use of open standards and protocols as
well as the importance of ensuring a specified Quality of Service (QoS). A number of
important aspects related to grid computing which we may infer from these definitions are:
1. Remote facilities are used by an organization only when such facilities are not
locally available.
2. Remote facilities are owned by other organizations which will have their own
policies for use and permitting other organizations to use their facilities.
3. There may not be any uniformity in the system software, languages, etc., used by
remote facilities.
4. While using remote facilities, an organization should not compromise its own
security and that of the host.
5. There must be some mechanisms to discover what facilities are available in a grid,
where they are available, policies on their use, priority, etc.
6. As the grid belongs to a “virtual organization”, namely, separate organizations with
agreed policies on collaboration, there must also be a policy defining Quality of
Service (QoS). QoS may include a specified time by which the requested resource
should be provided and its availability for a specified time.
These software components are typically organized as a layered architecture. Each layer
depends on the services provided by the lower layers. Each layer in turn supports number of
components which cooperate to provide the necessary service. The layered architecture is
shown in Table 6.1.
Application Scientific and engineering applications (e.g., TeraGrid Science Gateway). Collaborative
layer computing environment (e.g., National Virtual Observatory). Software environment.
Collective Directory of available services and resource discovery. Brokering (resource management,
layer selection, aggregation), diagnostics, monitoring, and usage policies.
Connectivity Communication and authentication protocols. Secure access to resources and services. Quality of
layer Service monitoring.
Fabric layer Computers, storage media, instruments, networks, databases, OS, software libraries.
The necessary software requirements to support a grid infrastructure are so huge that it is
impractical for a single organization to develop it. Several organizations including computer
industry participants have developed an open-source Globus toolkit consisting of a large suite
of programs. They are all standards-based and the components are designed to create
appropriate software environment to use the grid. There is also a global open grid forum
(www.ogf.org) which meets to evolve standards.
We will now examine how the grid infrastructure is invoked by a user and how it carries
out a user’s request. The steps are:
Very often the programs which are executed using the grid infrastructure are parallel
programs which use the high performance computers and appropriate programming systems
(such as MPI) we discuss in Chapter 8 of this book.
1. Savings which accrue due to the reduced idle time of computers. Idle computers in
one location may be used by users in other locations of the enterprise.
2. Access to high performance computing to all employees who request them from
their desktop.
3. Improvement in reliability and QoS.
4. Reduced hardware and software costs, reduced operational cost and improvement
in the productivity of resources.
5. Better utilization of the enterprise’s intranet.
Overall enterprise grid infrastructure benefits the enterprises by enabling innovations and
making better business decisions as an enterprise wide view is now possible. It also reduces
the overall cost of computing infrastructure.
Several vendors have cooperated to establish the enterprise grid alliance (www.ega.org).
The objectives of this alliance are to adopt and deploy grid standards including
interoperability. They work closely with Globus consortium. An important point to be
observed is that an enterprise grid need not be confined to a single enterprise. It may
encompass several independent enterprises which agree to collaborate on some projects
sharing resources and data.
There are also examples of enterprises collaborating with universities to solve difficult
problems. One often cited industrial global grid project is distributed aircraft maintenance
environment (www.wrgrid.org.uk/leaflets/DAME.pdf). In this project Rolls Royce aircraft
engine design group collaborated with several UK universities to implement a decision
support system for maintenance application. The system used huge amount of data which
were logged by temperature sensors, noise sensors and other relevant sensors in real-time
from thousands of aircrafts and analyzed them using a computing grid to detect problems and
suggest solutions to reduce down time of aircraft and to avert accidents.
There are several such projects in the anvil and enterprise grids will be deployed widely in
the near future.
6.2 CLOUD COMPUTING
There are many definitions of cloud computing. We give a definition adapted from the
definition given by the National Institute of Standards and Technology, U.S.A. [Mell and
Grance, 2011]. Cloud computing is a method of availing computing resources from a
provider, on demand, by a customer using computers connected to a network (usually the
Internet). The name “cloud” in cloud computing is used to depict remote resources available
on the Internet [which is often depicted as enclosed by a diffuse cloud in diagrams (see Fig.
6.1)] over which the services are provided. In Fig. 6.1 Amazon, Google, Microsoft, and IBM
are providers of cloud services.
AD: Access Device to cloud (Desktop laptop, tablet, thin client, etc.)
Figure 6.1 Cloud computing system.
A cloud computing service has seven distinct characteristics:
Even though the cloud computing model is very similar to a utility such as a power utility,
it is different in three important respects due to the distinct nature of computing. They are:
6.2.1 Virtualization
An important software idea that has enabled the development of cloud computing is
virtualization. The dictionary meaning of virtual is “almost or nearly the thing described, but
not completely”. In computer science the term virtual is used to mean: “a system which is
made to behave as though it is a physical system with specified characteristics by employing
a software layer on it”. For example, in a virtual memory system, a programmer is under the
illusion that a main random access memory is available with a large address space. The actual
physical system, however, consists of a combination of a small random access semi-
conductor memory and a large disk memory. A software layer is used to mimic a larger
addressable main memory. This fact is hidden (i.e. transparent) to the programmer. This is
one of the early uses of the term “virtual” in computing.
The origin of virtualization in servers was driven by the proliferation of many servers, one
for each service, in organizations. For example, organizations use a server as a database
server, one to manage email, another as a compute server and yet another as a print server.
Each of these servers was used sub optimally. It was found that sometimes not more than
10% of the available server capacity was used as managers normally bought servers to
accommodate the maximum load. Each of these applications used the facilities provided by
the respective Operating Systems (OS) to access the hardware. The question arose whether all
the applications can be executed on one server not interfering with one another, each one
using its own native OS. This led to the idea of a software layer called hypervisor which runs
on one server hardware and manages all the applications which were running on several
servers. Each application “thinks” that it has access to an independent server with all the
hardware resources available to it. Each application runs on its own “virtual machine”. By
virtual machine we mean an application using the OS for which it is written. (See Fig. 6.2).
Figure 6.2 Virtual machines using a server.
There are several advantages which accrue when Virtual Machines (VM) are used. They
are:
The hardware is utilized better. As the probability of all the applications needing
the entire hardware resources simultaneously is rare, the hypervisor can balance
the load dynamically.
If one of the applications encounters a problem, it can be fixed without affecting
other applications.
Virtual machines maybe added or removed dynamically.
If a set of servers called a “server farm” is used, they can all be consolidated and a
hypervisor can move VMs between the physical servers. If a physical server fails
the VM running on it will not be affected as it can be moved to another server. The
entire operation of shifting VMs among servers is hidden from the user. In other
words, where a VM runs is not known to a user.
1. An enterprise need not invest huge amounts to buy, maintain, and upgrade
expensive computers.
2. The service may be hired for even short periods, e.g., one server for one hour. As
and when more resources such as more powerful servers or disk space is needed, a
cloud service provider may be requested to provide it. As we pointed out earlier
some providers may offer over 10,000 servers on demand. There is no need to plan
ahead. This flexibility is of great value to small and medium size enterprises and
also to entrepreneurs starting a software business. The billing is based on the
resources and the period they are used.
3. System administration is simplified.
4. Several copies of commonly used software need not be licensed as payment is
based on software used and time for which it is used.
5. A large variety of software is available on the cloud.
6. Migration of applications from one’s own data centre to a cloud provider’s
infrastructure is easy due to virtualization and a variety of services available from a
large number of cloud services providers.
7. Quality of service is assured based on service level agreements with the service
provider.
8. The system is designed to be fault tolerant as the provider would be able to shift
the application running on a server to another server if the server being used fails.
9. Organizations can automatically back up important data. This allows quick
recovery if data is corrupted. Data archiving can be scheduled regularly.
10. Disaster recovery will be easy if important applications are running simultaneously
on servers in different geographic areas.
1. Managers of enterprises are afraid of storing sensitive data on disks with a provider
due to security concerns especially due to the fact the cloud resources are shared
by many organizations which may include their competitors. To mitigate this,
strong encryption such as AES 256-bit should be used. Users are also afraid of
losing data due to hardware malfunction. To mitigate this concern, a backup copy
should be normally kept with the organization.
2. Failure of communication link to the cloud from an enterprise would seriously
disrupt an enterprise’s function. To mitigate this, duplicate communication links to
the cloud should be maintained.
3. Failure of servers of a provider will disrupt an enterprise. Thus, Service Level
Agreements (SLA) to ensure a specified QoS should be assured by the provider.
4. If the service of a cloud service provider deteriorates, managers may not be able to
go to another provider as there are no standards for inter-operability or data
portability in the cloud. Besides this, moving large volumes of data out of a cloud
may take a long time (a few hundred hours) due to the small bandwidth available
on the Internet.
5. If a provider’s servers are in a foreign country and the data of a customer are
corrupted or stolen, complex legal problems will arise to get appropriate
compensation.
6. There is a threat of clandestine snooping of sensitive data transmitted on the
Internet by some foreign intelligence agencies that use very powerful computers to
decode encrypted data [Hoboken, Arnbak and Eijk, 2013].
Cloud computing is provided by several vendors. A customer signs a contract with one or
more of these vendors. As we pointed out earlier, if a customer is dissatisfied with a vendor,
it is difficult to migrate to another vendor as there is no standardization and thus no inter-
operability. There is no service provided by a group of cooperating vendors.
1. Routine data processing jobs such as pay roll processing, order processing, sales
analysis, etc.
2. Hosting websites of organizations. The number of persons who may login to the
websites of organizations is unpredictable. There may be surges in demand which
can be easily accommodated by a cloud provider as the provider can allocate more
servers and bandwidth to access the servers as the demand increases.
3. High performance computing or specialized software not available in the
organization which may be occasionally required. Such resources may be obtained
from a provider on demand.
4. Parallel computing, particularly Single Program Multiple Data (SPMD) model of
parallel computing. This model is eminently suitable for execution on a cloud
infrastructure as a cloud provider has several thousands of processors available on
demand. This model is used, for example, by search engines to find the URLs of
websites in which a specified set of keywords occur. A table of URLs and the
keywords occurring in the web pages stored in each URL may be created. This
table may be split into, say 1000 sets. Each set may be given to a computer with a
program to search if the keywords occur in them and note the URLs, where they
are found. All the 1000 computers search simultaneously and find the URLs. The
results are combined to report all the URLs where the specified keywords occur.
This can speed up the search almost thousand fold. Software called MapReduce
[Dean and Ghemawat, 2004] was developed by Google which used this idea. An
open source implementation of MapReduce called Hadoop [Turner, 2011] has been
developed for writing applications on the cloud that process petabytes of data in
parallel, in a reliable fault tolerant manner which in essence uses the above idea.
We will discuss MapReduce in Chapter 8.
5. Seasonal demand such as processing results of an examination conducted during
recruitment of staff once or twice a year.
6. Archiving data to fulfill legal requirements and which are not needed day to day. It
is advisable to encrypt the data before storing it in a cloud provider’s storage
servers.
6.3 CONCLUSIONS
As of now (2015), cloud computing is provided by a large number of vendors (over 1000
worldwide) using the Internet. Each vendor owns a set of servers and operates them. There is
no portability of programs or data from one vendor’s facility to that of another. On the other
hand, the aim of grid computing is to create a virtual organization by interconnecting the
resources of a group of cooperating organizations. It has inter-operability standards and tools
for portability of programs among the members of the group. It would be desirable to
combine the basic idea of the grid, that of cooperation and open standards, with that of the
cloud, namely, vendors maintaining infrastructure and providing services (IaaS, SaaS and
PaaS) as a pay for use service to customers. This will result in what we term a cooperative or
federated cloud. Such a cloud infrastructure is not yet available.
We conclude the chapter by comparing grid and cloud computing summarized in Table
6.2. This is based on a comparison given by Foster, et al. [2008]. As can be seen, these
computing services provide a user a lot of flexibility. In fact, today it is possible for a user to
dispense with desktop computers and use instead a low cost mobile thin client to access any
amount of computing (even supercomputing), storage and sophisticated application programs
and pay only for the facilities used. In fact Google has recently announced a device called
Chromebook which is primarily a mobile thin client with a browser to access a cloud.
TABLE 6.2 Comparison of Grid and Cloud Computing
Business
model Cooperating organizations Massive infrastructure with
Operation on project mode profit motive
Community owned infrastructure Economics of sale
No “real money” transaction 100,000 + servers available on
Driven originally by scientists working in demand
universities Customers pay by use
Cost shared. No profit/no loss model
Architecture
Heterogeneous high performance computers Uniform warehouse scale
Federated infrastructure computers
Standard protocols allowing inter-operability Independent, competing
providers
No virtualization
No standards for inter-
operability
Iaas, Paas, Saas
Virtualized systems
Resource
management Batch operation and scheduling Massive infrastructure alloca-
Resource reservation and allocation based on tion on demand
availability Multi-tenanted
Not multi-tenanted Resource optimization
Programming
model High performance computing parallel SPMD model
programming methods such as MPI MapReduce, PACT etc.
Applications
High performance computing Loosely coupled transaction
High throughput computing oriented
Portal for application in scientific research Mostly commercial
Big data analysis
Security model
Uses public key based grid security infrastructure Customer credential
Users need proper registration and verification of verification lacks rigour
credentials Multi-tenancy risk
Data location risk
Multinational legal issues
EXERCISES
6.1 What are the rates of increase of speed of computing, size of disk storage, and
communication bandwidth as of now? What is the main implication of this?
6.2 The distance between a client and server is 500 m. The speed of the communication
network connecting them is 10 Gbps. How much time will it take to transfer 10 MB of
data from client to server? If the distance between a client and server is 100 km and the
network speed is 1 Gbps, how much time will be required to transfer 10 MB data from
client to server? How much time will be needed to transfer 1 TB data?
6.3 Define grid computing. We gave two definitions of grid computing in the text. Combine
them and formulate a new definition.
6.4 Can a grid environment be used to cooperatively use several computers to solve a
problem? If yes, suggest ways of achieving this.
6.5 What functions should be provided by the software system in a grid computing
environment?
6.6 Why is grid software architecture specified as layers? Give the components in each
layer and explain their functions.
6.7 Explain how a grid environment is invoked by a user and how it carries out a user’s
request for service.
6.8 What is an enterprise grid? In what way is it different from an academic grid?
6.9 What are the advantages which accrue to enterprises when they use an enterprise grid?
6.10 Define cloud computing. What are its distinct characteristics?
6.11 Define IaaS, SaaS, and PaaS.
6.12 In what way is cloud computing similar to grid computing?
6.13 Describe different types of clouds and their unique characteristics.
6.14 List the advantages of using cloud computing.
6.15 What are the risks of using cloud computing?
6.16 What are the applications which are appropriate for outsourcing to a cloud computing
provider?
6.17 What are the major differences between cloud computing and grid computing?
6.18 What are the major differences between a community cloud and grid computing?
6.19 What are the major differences between a private cloud and a public cloud?
BIBLIOGRAPHY
Amazon Elastic Compute Cloud (Amazon EC2), 2008, (aws.amazon.com/ec2).
Armbrust, M., et al., “Above the Clouds: A Berkeley View of Cloud Computing”, Technical
report EECS—2009–20, UC, Berkeley, 2009,
(http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-28.html).
Berman, F., Fox, G. and Hey, A. (Eds.), Grid Computing: Making Global Infrastructure a
Reality, Wiley, New York, USA, 2003.
Dean, J., and Ghemawat, S., “MapReduce: Simplified Data Processing in Large Clusters”,
http://research.google.com/archive/mapreduce.html
Foster, I. and Kesselman, C., “Globus: A Metacomputing Infrastructure Toolkit”, Intl.
Journal of Supercomputer Applications, Vol. 11, No. 2, 1997, pp. 115–128.
Foster, Ian, et al., Cloud Computing and Grid Computing 360 Degree Compared, 2008.
(www.arxiv.org/pdf/0901.0131).
Hoboken, J. Van, Arnbak, A., and Van Eijk, N., Van, “Obscured by Clouds or How to
Address Government Access to Cloud Data from Abroad”, 2013.
http://ssrn.com/abstract=2276103.
Hypervisor, Wikipedia, Oct., 2013.
International Telecommunication Union, “ICT Facts and Figures”, Switzerland, 2011.
(www.itu.int/ict).
Mell, P. and Grance, T., “The NIST Definition of Cloud Computing”, National Institute of
Standards and Technology, USA, Special Publication 800–145, Sept. 2011 (Accessible on
the World Wide Web).
Rajaraman, V. and Adabala, N., Fundamentals of Computers, (Chapter 17), 6th ed., PHI
Learning, Delhi, 2014.
Rajaraman, V., Cloud Computing, Resonance, Indian Academy of Sciences, Vol. 19, No. 3,
2014, pp. 242–258.
Turner, J., “Hadoop: What it is, how it works, and what it can do”, 2011.
http://strata.oreilly.com/2011/01/what-is-hadoop.html.
Parallel Algorithms
In order to solve a problem on a sequential computer we need to design an algorithm for the
problem. The algorithm gives the sequence of steps which the sequential computer has to
execute in order to solve the problem. The algorithms designed for sequential computers are
known as sequential algorithms. Similarly, we need an algorithm to solve a problem on a
parallel computer. This algorithm, naturally, is quite different from the sequential algorithm
and is known as parallel algorithm. We gave some ideas on how parallel algorithms are
evolved in Chapter 2. A parallel algorithm defines how a given problem can be solved on a
given parallel computer, i.e., how the problem is divided into subproblems, how the
processors communicate, and how the partial solutions are combined to produce the final
result. Parallel algorithms depend on the kind of parallel computer they are designed for.
Thus even for a single problem we need to design different parallel algorithms for different
parallel architectures.
In order to simplify the design and analysis of parallel algorithms, parallel computers are
represented by various abstract machine models. These models try to capture the important
features of a parallel computer. The models do, however, make simplifying assumptions
about the parallel computer. Even though some of the assumptions are not very practical, they
are justified in the following sense: (i) In designing algorithms for these models one often
learns about the inherent parallelism in the given problem. (ii) The models help us compare
the relative computational powers of the various parallel computers. They also help us in
determining the kind of parallel architecture that is best suited for a given problem.
In this chapter we study the design and analysis of parallel algorithms for the
computational problems: (i) prefix computation, (ii) sorting, (iii) searching, (iv) matrix
operations. These problems are chosen as they are widely used and illustrate most of the
fundamental issues in designing parallel algorithms. Finally, we look at some recent models
of parallel computation. Throughout this chapter we use Pascal-like language to express
parallel algorithms.
7.1 MODELS OF COMPUTATION
In this section, we present various abstract machine models for parallel computers. These
models are useful in the design and analysis of parallel algorithms. We first present the
abstract machine model of a sequential computer for the sake of comparison.
Any step of an algorithm for the RAM model consists of (up to) three basic phases
namely:
1. Read: The processor reads a datum from the memory. This datum is usually stored
in one of its local registers.
2. Execute: The processor performs a basic arithmetic or logic operation on the
contents of one or two of its registers.
3. Write: The processor writes the contents of one register into an arbitrary memory
location.
For the purpose of analysis, we assume that each of these phases takes constant, i.e., O(1)
time.
It is important to note that the shared memory also functions as the communi-cation
medium for the processors. A PRAM can be used to model both SIMD and MIMD machines.
In this chapter, we see that an interesting and useful application of the PRAM occurs when it
models an SIMD machine. When PRAM models an SIMD machine all the processors
execute the same algorithm synchronously. As in the case of RAM, each step of an algorithm
here consists of the following phases:
1. Read: (Up to) N processors read simultaneously (in parallel) from (up to) N
memory locations (in the common memory) and store the values in their local
registers.
2. Compute: (Up to) N processors perform basic arithmetic or logical operations on
the values in their registers.
3. Write: (Up to) N processors write simultaneously into (up to) N memory locations
from their registers.
Each of the phases, READ, COMPUTE, WRITE, is assumed to take O(1) time as in the
case of RAM. Notice that not all processors need to execute a given step of the algorithm.
When a subset of processors execute a step, the other processors remain idle during that time.
The algorithm for a PRAM has to specify which subset of processors should be active during
the execution of a step.
In the above model, a problem might arise when more than one processor tries to access
the same memory location at the same time. The PRAM model can be subdivided into four
categories based on the way simultaneous memory accesses are handled.
Exclusive Read Exclusive Write (EREW) PRAM
In this model, every access to a memory location (read or write) has to be exclusive. This
model provides the least amount of memory concurrency and is therefore the weakest.
Concurrent Read Exclusive Write (CREW) PRAM
In this model, only write operations to a memory location are exclusive. Two or more
processors can concurrently read from the same memory location. This is one of the most
commonly used models.
Exclusive Read Concurrent Write (ERCW) PRAM
This model allows multiple processors to concurrently write into the same memory location.
The read operations are exclusive. This model is not frequently used and is defined here only
for the sake of completeness.
Concurrent Read Concurrent Write (CRCW) PRAM
This model allows both multiple read and multiple write operations to a memory location. It
provides the maximum amount of concurrency in memory access and is the most powerful of
the four models.
The semantics of the concurrent read operation is easy to visualize. All the processors
reading a particular memory location read the same value. The semantics of the concurrent
write operation on the other hand is not obvious. If many processors try to write different
values to a memory location, the model has to specify precisely, the value that is written to
the memory location. There are several protocols that are used to specify the value that is
written to a memory in such a situation. They are:
1. Priority CW: Here the processors are assigned certain priorities. When more than
one processor tries to write a value to a memory location, only the processor with
the highest priority (among those contending to write) succeeds in writing its value
to the memory location.
2. Common CW: Here the processors that are trying to write to a memory location
are allowed to do so only when they write the same value.
3. Arbitrary CW: Here one of the processors that is trying to write to the memory
location succeeds and the rest fail. The processor that succeeds is selected
arbitrarily without affecting the correctness of the algorithm. However, the
algorithm must specify exactly how the successful processor is to be selected.
4. Combining CW: Here there is a function that maps the multiple values that the
processors try to write to a single value that is actually written into the memory
location. For instance, the function could be a summing function in which the sum
of the multiple values is written to the memory location.
1. Running time
2. Number of processors
3. Cost
1. The function g(n) is said to be of order at least f(n) denoted by Ω(f(n)) if there
exist positive constants c, n0 such that g(n) ≥ cf(n) for all n ≥ n0.
2. The function g(n) is said to be of order at most f(n) denoted by O(f(n)) if there exist
positive constants c, n0 such that g(n) ≤ cf(n) for all n ≥ n0.
3. The function g(n) is said to be of the same order as f(n) if g(n) ∈ O(f(n)) and g(n)
∈ Ω(f(n)).
Consider the following examples: The function g(n) = n2 is of order at least f(n) = n log n.
In this case, one can consider c = 1, n0 = 1. The function g(n) = 4n2 + 3n + 1 is of the same
order as the function f(n) = n2 + 6n + 3.
The order notation is used to compare two algorithms. For instance, an algorithm whose
running time is O(n) is asymptotically better than an algorithm whose running time is O(n2)
because as n is made larger n2 grows much faster than n.
Lower and Upper Bounds
Every problem can be associated with a lower and an upper bound. The lower bound on a
problem is indicative of the minimum number of steps required to solve the problem in the
worst case. If the number of steps taken by an algorithm for execution in the worst case is
equal to or of the same order as the lower bound, then the algorithm is said to be optimal.
Consider the matrix multiplication problem. The best known algorithm for this particular
problem takes O(n2.38) time while the known lower bound for the problem is Ω(n2). In the
case of sorting, the lower bound is Ω(n log n) and the upper bound is O(n log n). Hence
probably optimal algorithms exist for the sorting problem while no such algorithm exists for
matrix multiplication to date.
Our treatment of lower and upper bounds in this section has so far focused on sequential
algorithms. Clearly, the same general ideas also apply to parallel algorithms while taking two
additional factors into account: (i) the model of parallel computation used and (ii) the number
of processors in the parallel computer.
Speedup
A good measure for analyzing the performance of a parallel algorithm is speedup. Speedup is
defined as the ratio of the worst case running time of the best (fastest) known sequential
algorithm and the worst case running time of the parallel algorithm.
Greater the value of speedup, better is the parallel algorithm. The speedup S of any algorithm
A is less than or equal to N where N denotes the number of processors in the parallel
computer. The lower bound for summing (adding) n numbers is O(log n) for any
interconnection network parallel computer. The maximum speedup achievable for the
summing problem is therefore O(n/log n), as the best sequential algorithm for this problem
takes O(n) time.
7.2.3 Cost
The cost of a parallel algorithm is defined as the product of the running time of the parallel
algorithm and the number of processors used. It is indicative of the number of steps executed
collectively by all the processors in solving a problem in the worst case. Note that in any step
of a parallel algorithm a subset of processors could be idle. The above definition of cost
includes all these idle steps as well.
Cost = Running time × Number of processors
If the cost of the parallel algorithm matches the lower bound of the best known sequential
algorithm to within a constant multiplicative factor, then the algorithm is said to be cost-
optimal. The algorithm for adding (summing) n numbers takes O(log n) steps on an (n – 1),
processor tree. The cost of this parallel algorithm is thus O(n log n). But the best sequential
algorithm for this problem takes O(n) time. Therefore, the parallel algorithm is not cost-
optimal.
The efficiency of a parallel algorithm is defined as the ratio of the worst case running time
of the fastest sequential algorithm and the cost of the parallel algorithm.
Usually, efficiency is less than or equal to 1; otherwise a faster sequential algorithm can be
obtained from the parallel one!
7.3 PREFIX COMPUTATION
Prefix computation is a very useful suboperation in many parallel algorithms. The prefix
computation problem is defined as follows.
A set χ is given, with an operation o on the set such that
1. o is a binary operation.
2. χ is closed under o.
3. o is associative.
Let X = {x0, x1, …, xn–1}, where xi ∈ χ (0 ≤ i ≤ n – 1). Define
s0 = x0
s1 = x0 o x1
.
.
.
sn–1 = x0 o x1 o ... o xn–1
Obtaining the set S = {s0, s1, …, sn–1} given the set X is known as the prefix computation.
Note that the indices of the elements used to compute si form a string 012 … i, which is a
prefix of the string 012 … n – 1, and hence the name prefix computation for this problem.
Until explicitly mentioned otherwise, we assume that the binary operation o takes constant
time. In this section, we look at some algorithms for the prefix computation on the PRAM
models.
Analysis
Steps 1 and 3 of the algorithm require O(k) time. Step 2 requires O(log m) time. Since k =
(log n) and m = n/(log n), the running time of the algorithm is O(log n) + O(log (n/log n))
which is O(log n). The cost of the algorithm is O(log n) × (n/log n) which is O(n). Hence this
algorithm is cost optimal.
7.3.2 Prefix Computation on a Linked List
We will now look at prefix computation on linked lists. The algorithm for the prefix
computation which we describe here illustrates a powerful technique, called pointer
jumping. The idea of pointer jumping is used in solving problems whose data are stored in
pointer-based data structures such as linked lists, trees, and general graphs. The exact linked
list data structure that is used in the rest of this subsection is given below.
Linked List Data Structure
A linked list is composed of nodes that are linked by pointers. Each node i of the linked list
consists of the following:
The last node in the list, the tail, has no successor. Hence its pointer is equal to nil. In what
follows, we omit the info field from the discussion.
Linked List Prefix Computation
In the linked list version of the prefix computation, the values of the input set X are given as
elements of a linked list. Consider a linked list L shown in Fig. 7.7. Given the linked list L,
we need to perform the prefix computation
x0, x0 o x1, x0 o x1 o x2, ...
Each node of the linked list is assigned to a processor. However, the processor
knows only the “location” of the node that is assigned to it. It does not know the
relative position of the node in the linked list. It can however access the next node
in the linked list by following the succ(i) pointer. This is in contrast to the usual
case of the prefix computation where each processor knows the position (or index)
of the element it is working with. (Though the notion of a node’s index is
meaningless here, we still use it in order to be able to distinguish among nodes and
their respective fields. Note that the index of a node denotes its position from the
beginning of the list. For instance, in Fig. 7.7, node 2 has an index 0 and node 5
has an index 4.)
None of the processors knows how many nodes are present in the linked list. In
other words, each processor has only a “local” picture of the linked list, and not the
“global” one. Hence the processors participating in the prefix computation should
have some means of determining when the algorithm has terminated.
1. Perform an o operation on the value of its node and the value of its successor node,
and leave the resultant value in the successor node.
2. Update its succ(i) pointer to the successor of the current successor node.
The iterations of the algorithm are shown in Fig. 7.8 for N = |L| = 8. Notice that this
algorithm is quite similar to the prefix computation algorithm described earlier except that it
uses pointer jumping to determine the nodes which perform the operation.
An important issue in this algorithm is the termination. The processors should know when
the algorithm has terminated. For this purpose, the algorithm uses a COMMON CW memory
location called finished. After each iteration, all the processors whose succ is not nil write a
false to finished. Because of the COMMON CW property, the value in finished becomes true
only when all the processors write true to it. Hence when finished contains the value true the
algorithm terminates. At the beginning of each iteration, all processors read finished and if it
is false, determine that another iteration is required, This way, all processors will “know”
simultaneously when the algorithm has terminated.
Procedure PRAM Linked List Prefix Computation
Analysis
Steps 1 and 2 and each iteration of Step 3 take constant time. Since the number of partial
sums doubles after each iteration of Step 3, the number of iterations is 0 (log |L|). Assuming
|L| = n = number of processors, the algorithm runs in O(log n) time.
7.4 SORTING
The problem of sorting is important as sorting data is at the heart of many computations. In
this section, we describe a variety of sorting algorithms for the various models of parallel
computers discussed earlier. The problem of sorting is defined as follows: Given a sequence
of n numbers S = s1, s2, …, sn, where the elements of S are initially in arbitrary order,
rearrange the numbers so that the resulting sequence is in non-decreasing (sorted) order.
All elements of the sequence S1 have a value lesser than all elements of the
sequence S2.
Each sequence, S1 and S2, is a bitonic sequence.
This way of splitting a bitonic sequence into two bitonic sequences is known as bitonic
split. The idea of bitonic split can be used to sort a given bitonic sequence. A sequential
recursive algorithm for sorting a bitonic sequence S = <a0, a1, …, an–1> is given below:
1. Split the sequence S into two sequences S1 and S2 using the bitonic split.
2. Recursively sort the two bitonic sequences S1 and S2.
3. The sorted sequence of S is the sorted sequence of S1 followed by the sorted
sequence of S2.
This process of sorting a bitonic sequence is known as bitonic merging. This bitonic
merging can easily be implemented as a combinational circuit. The bitonic merging circuit
(network) for a bitonic sequence of length n = 16 is shown in Fig. 7.10. The bitonic merging
network for a sequence of length n consists of log n layers each containing n/2 comparators.
If all the comparators are increasing comparators, we get an ascending sequence at the output.
Such a bitonic merging circuit is represented by (+)BM(n). If all the comparators are
decreasing comparators, we get a decreasing sequence at the output. Such a bitonic merging
circuit is represented by (–)BM(n). Note that the depth of both (+)BM(n) and (–)BM(n) is log
n.
The basic idea of the bitonic sorting algorithm is to incrementally construct larger and
larger bitonic sequences starting from bitonic sequences of length 2. Given two bitonic
sequences of length m/2 each, we can obtain one ascending order sorted sequence using
(+)BM(m/2) and one descending order sorted sequence using (–)BM(m/2). These two can be
concatenated to get a single bitonic sequence of length m. Once a bitonic sequence of length
n is formed, this is sorted using (+)BM(n) to get a sorted output sequence. The bitonic sorting
network for length n = 16 is shown in Fig. 7.11.
1. Using an (m/2, m/2) merging circuit, merge the odd-indexed elements of the two
sequences: <x1, x3, …, xm–1> and <y1, y3, …, ym–1> to produce a sorted sequence
<u1, u2, …, um>.
2. Using an (m/2, m/2) merging circuit, merge the even-indexed elements of the two
sequences: <x2, x4, … xm> and <y2, y4, …, ym> to produce a sorted sequence <v1,
v2, …, vm>.
3. The output sequence < z1, z2, …, z2m> is obtained as follows: z1 = u1, z2m = vm, z2i
= min (ui + 1, vi), and z2i + 1 = max (ui + 1, vi), for i = 1, 2, …, m – 1.
The schematic diagram of an (m, m) odd-even merging circuit, which uses two (m/2, m/2)
odd-even merging circuits, is shown in Fig. 7.13.
Figure 7.13 Odd-even merging circuit.
Analysis
1. Width: Each comparator has exactly two inputs and two outputs. The circuit takes 2m
inputs and produces 2m outputs. Thus it has a width of m.
2. Depth: The merging circuit consists of two smaller circuits for merging sequences of
length m/2, followed by one stage of comparators. Thus the depth d(m) of the circuit
(i.e., the running time) is given by the following recurrence relations.
d(m) = 1 (m = 1)
d(m) = d(m/2) + 1 (m > 1)
Solving this we get d(m) = 1 + log m. Thus the time taken to merge two sequences is
quite fast compared to the obvious lower bound of Ω(m) on the time required by the
RAM to merge two sorted sequences of length m.
3. Size: The size p(m) of the circuit (i.e., the number of comparators used) can be obtained
by solving the following recurrence relations.
p(1) = 1 (m = 1)
p(m) = 2p(m/2) + m–1 (m > 1)
Solving this we get p(m) = 1 + m log m.
Odd-Even Merge Sorting Circuit
The sorting circuit can now be derived using the merging circuit, developed above, as the
basic building block. The circuit for sorting an input sequence of length n is constructed as
follows:
1. Split the input unsorted sequence <a1, a2, …, an> into two unsorted sequences <a1,
a2, …, an/2> and <an/2 + 1, an/2 + 2, …, an>. Recursively sort these two sequences.
2. Merge the two sorted sequences using an (n/2, n/2) merging circuit to get the
sorted sequence.
The above algorithm takes O(1) time. And it uses O(n2) processors which is very large.
Further the assumption made on the write conflict resolution process is quite unrealistic.
CREW Sorting
The sorting algorithm for the CREW model is very similar to the one described for the
CRCW model. The algorithm uses n processors and has a time complexity of O(n). Let the
processors be denoted by P1, P2, …, Pn. The processor Pi computes the value ri after n
iterations. In the jth iteration of the algorithm, all the processors read sj; the processor Pi
increments ri if si > sj or (si = sj and i > j). Therefore, after n iterations, ri contains the
number of elements smaller than si in the input sequence. As in the previous case, by writing
the value si in the location ri + 1 we get a completely sorted sequence.
Procedure CREW Sorting
EREW Sorting
The above algorithm uses concurrent read operations at every iteration. Concurrent read
conflicts can be avoided if we ensure that each processor reads a unique element in each
iteration. This can be achieved by allowing the processors to read elements in a cyclic
manner. In the first iteration the processors P1, P2, …, Pn read s1, s2, …, sn respectively and
in the second iteration the processors read s2, s3, …, sn, s1 respectively and so on. Thus in the
jth iteration the processors P1, P2, …, Pn read sj, sj+1, …, sj–1 respectively and update the
corresponding r values. This algorithm has a running time complexity of O(n). Therefore we
get an O(n) running time algorithm for the EREW model using O(n) processors.
Procedure EREW Sorting
1. If si > x, then if an element equal to x is in the sequence at all, it must precede si.
Consequently, si and all the elements that follow it (i.e., to its right) are removed
from consideration.
2. If si < x, then all elements to the left of si in the sequence are removed from
consideration.
Thus each processor splits the sequence into two parts: those elements to be discarded as they
definitely do not contain an element equal to x and those that might and are hence retained.
This narrows down the search to the intersection of all the parts to be retained. This process
of splitting and probing continues until either an element equal to x is found or all the
elements of S are discarded. Since at every stage the size of the subsequence that needs to be
searched drops by a factor N + 1, O(logN + 1 (n + 1)) stages are needed. The algorithm is
given below. The sequence S = {s1, s2, …, sn} and the element x are given as input to the
algorithm. The algorithm returns k if x = sk for some k, else it returns 0. Each processor Pi
uses a variable ci that takes the value left or right according to whether the part of the
sequence Pi decides to retain is to the left or right of the element it compared with x during
this stage. The variables q and r keep track of the boundary of the sequence that is under
consideration for searching. Precisely one processor updates q and r in the shared memory,
and all remaining processors simultaneously read the updated values in constant time.
Procedure CREW Search
Analysis
If the element x is found to occur in the input sequence, the value of k is set accordingly and
consequently the while loop terminates. After each iteration of the while loop, the size of the
sequence under consideration shrinks by a factor N + 1. Thus after ⌈logN + 1 (n + 1)⌉ steps
(iterations), in the worst case, q becomes greater than r and the while loop terminates. Hence
the worst case time complexity of the algorithm is O(logN + 1 (n + 1)). When N = n, the
algorithm runs in constant time.
Note that the above algorithm assumes that all the elements in the input sequence S are
sorted and distinct. If each si is not unique, then possibly more than one processor will
succeed in finding an element of S equal to x. Consequently, possibly several processors will
attempt to return a value in the variable k, thus causing a write conflict which is not allowed
in both the EREW and CREW models. However, the problem of uniqueness can be solved
with an additional overhead of O(log N) (The reader is urged to verify this). In such a case,
when N = n the running time of the algorithm is O(logN + 1 (n + 1)) + O(log N) = O(log n)
which is precisely the time required by the sequential binary search algorithm!
CRCW Searching
The searching algorithm discussed for the CREW model is applicable even to the CRCW
model. The uniqueness criterion which posed a problem for the CREW model is resolved in
this case by choosing an appropriate conflict resolution policy. The algorithm, therefore, does
not incur the O(log N) extra cost. Thus the algorithm runs in O(logN+1 (n + 1)) time.
Searching a Random Sequence
We now consider the more general case of the search problem. Here the elements of the
sequence S = {s1, s2, …, sn} are not assumed to be in any particular order and are not
necessarily distinct. As we will see now, the general algorithm for searching a sequence in
random order is straightforward. We use an N-processor computer to search for an element x
in S, where 1 < N < n. The sequence S is divided into N subsequences of equal size and each
processor is assigned one of the subsequences. The value x is then broadcast to all the N
processors. Each processor now searches for x in the subsequence assigned to it. The
algorithm is given below.
Procedure Search
Analysis
1. EREW: In this model, the broadcast step takes O(log N) time. The sequen-tial search
takes O(n/N) time in the worst case. The final store operation takes O(log N) time since
it involves a concurrent write to a common memory location. Therefore the overall time
complexity of the algorithm, t(n), is
t(n) = O(log N) + O(n/N)
and its cost, c(n), is
c(n) = O(N log N) + O(n)
which is not optimal.
2. CREW: In this model, the broadcast operation can be accomplished in O(1) time since
concurrent read is allowed. Steps 2 and 3 take the same time as in the case of EREW.
The time complexity of the algorithm remains unchanged.
3. CRCW: In this model, both the broadcast operation and the final store operation take
constant time. Therefore the algorithm has a time complexity of O(n/N). The cost of the
algorithm is O(n) which is optimal.
1. The root node reads x and passes it to its two child nodes. In turn, these send x to
their child nodes. The process continues until a copy of x reaches each leaf node.
2. Simultaneously, every leaf node Li compares the element x with si. If they are
equal, the node Li sends the value 1 to its parent, else it sends a 0.
3. The results (1 or 0) of the search operation at the leaf nodes are combined by going
upward in the tree. Each intermediate node performs a logical OR of its two inputs
and passes the result to its parent. This process continues until the root node
receives its two inputs, performs their logical OR and produces either a 1 (for yes)
or a 0 (for no).
The working of the algorithm is illustrated in Fig. 7.18 where S = {26, 13, 36, 28, 15, 17,
29, 17} and x = 17. It takes O(log n) time to go down the tree, constant time to perform the
comparison, and again O(log n) time to go back up the tree. Thus the time taken by the
algorithm to search for an element is O(log n). Surprisingly, its performance is superior to
that of the EREW PRAM algorithm.
Figure 7.18 Searching on a tree.
Suppose, instead of one value of x (or query, which only requires a yes or no answer), we
have q values of x, i.e., q queries to be answered. These q queries can be pipelined down the
tree since the root node and intermediate nodes are free to handle the next query as soon as
they have passed the current one along their child nodes. The same remark applies to the leaf
nodes—as soon as the result of one comparison has been performed, each leaf node is ready
to receive a new value of x. Thus it takes O(log n) + O(q) time for the root node to process all
the q queries. Note that for large values of q, this behaviour is equivalent to that of the
CRCW PRAM algorithm.
Searching on a Mesh
We now present a searching algorithm for the mesh network. This algorithm runs in O(n1/2)
time. It can also be adapted for pipelined processing of q queries so that q search queries can
be processed in O(n1/2) + O(q) time.
The algorithm uses n processors arranged in an n1/2 × n1/2 square array. The processor in
the ith row and jth column is denoted by P(i, j). Then elements are distributed among the n
processors.
Let si,j denote the element assigned to the processor P(i, j). The query is submitted to the
processor P(1, 1). The algorithm proceeds in two stages namely unfolding and folding.
Unfolding: Each processor has a bit bi, j associated with it. The unfolding step is
effectively a broadcast step. Initially, the processor P(1, 1) reads x. At the end of the
unfolding stage, all the processors get a copy of x. The bit bi, j is set to 1 if x = si,j, otherwise it
is set to 0. The processor P(1, 1) sends the value x to P(1, 2). In the next step, both P(1, 1)
and P(1, 2) send their values to P(2, 1) and P(2, 2), respectively. This unfolding process,
which alternates between row and column propagation, continues until x reaches processor
P(n1/2, n1/2). The exact sequence of steps for a 4 × 4 mesh is shown in Fig. 7.19 where each
square represents a processor P(i, j) and the number inside the square is element si,j, and x =
15.
A simple sequential algorithm for the matrix multiplication is given below. Assuming that m
≤ n and k ≤ n, the algorithm, Matrix Multiplication, runs in O(n3) time. However, there exist
several sequential matrix multiplication algorithms whose running time is O(nx) where 2 < x
< 3. For example, Strassen has shown that two matrices of size 2 × 2 can be multiplied using
only 7 multiplications. And by recursively using this algorithm he designed an algorithm for
multiplying two matrices of size n × n in O(n2.81) time. Interestingly to this date it is not
known whether the fastest of such algorithms is optimal. Indeed, the only known lower bound
on the number of steps required for matrix multiplication is the trivial O(n2) which is
obtained by observing that n2 outputs are to be produced. And therefore any algorithm must
require at least this many steps. In view of the gap between n2 and nx, in what follows, we
present algorithms whose cost is matched against the running time of the algorithm, Matrix
Multiplication, given below. Further, we make the simplifying assumption that all the three
matrices A, B, and C are n × n matrices. It is quite straightforward to extend the algorithms
presented here to the asymmetrical cases.
Procedure Matrix Multiplication
Finally, note that the cost of all the three algorithms given above matches the running time
of sequential algorithm Matrix Multiplication.
Matrix Multiplication on a Mesh
The algorithm uses n × n processors arranged in a mesh configuration. The matrices A and B
are fed into the boundary processors in column 1 and row 1, respectively as shown in Fig.
7.20. Note that row i of matrix A and column j of matrix B lag one time unit behind row i –1
(2 ≤ i ≤ n) and column j – 1 (2 ≤ j ≤ n), respectively in order to ensure that ai,k meets bk, j (1 ≤
k ≤ n) in processor Pi,j at the right time. Initially, ci,j is zero. Subsequently, when Pi,j receives
two inputs a and b it (i) multiplies them, (ii) adds the result to ci,j (iii) sends a to Pi,j+1 unless j
= n, and (iv) sends b to Pi+1,j unless i = n. At the end of the algorithm, ci,j resides in processor
Pi, j.
1. First, we form an augmented matrix A|b (i.e., the b-vector is appended as the (n +
l)th column) of order n × (n + 1). This matrix is duplicated to form A0 | b0 and b1 |
A1 (where the b-vector is appended to the left of the coefficient matrix A) with A0
and A1 being the same as the coefficient matrix A, and similarly, b0 and b1 being
the same as the b-vector. Note the b-vector is appended to the left or right of the
coefficient matrix only for programming convenience.
2. Using the GE method, we triangularize A0 | b0 in the forward direction to eliminate
be of the same order, We now triangularize A00 | b00 and A10 | b10 in the
forward direction and b01 | A01 and b11 | A11 in the backward direction through n/4
columns using the GE method, thus reducing the order of each of these matrices to
half of their original size, i.e., Note that the above four augmented
matrices are reduced in parallel.
4. We continue this process of halving the size of the submatrices using GE and
doubling the number of submatrices log n times so that we end up with n
submatrices each of order 1 × 2. The modified b-vector part, when divided by the
modified A matrix part in parallel, gives the complete solution vector, x.
The solution of Ax = b using the BGE algorithm is shown in Fig. 7.21 in the form of a
binary tree for n = 8. We leave the implementation of this BGE algorithm on mesh-connected
processors to the reader as an exercise. Finally, note that in the BGE algorithm, all xi(i = 1, 2,
…, n) are found simultaneously, unlike in the GE algorithm. Hence the BGE algorithm is
expected to have better numerical stability characteristics (i.e., lesser rounding errors due to
the finite precision of floating-point numbers stored in the memory) than the GE algorithm.
The idea of bidirectional elimination is used in efficiently solving sparse systems of linear
equations and also dense systems of linear equations using Givens rotations.
Figure 7.21 Progression of the BGE algorithm.
7.7 PRACTICAL MODELS OF PARALLEL COMPUTATION
So far we looked at a variety of parallel algorithms for various models of a parallel computer.
In this section, we first look at some drawbacks of the models we studied and then present
some recent models of a parallel computer which make realistic assumptions about a parallel
computer.
The PRAM is the most popular model for representing and analyzing the complexity of
parallel algorithms due to its simplicity and clean semantics. The majority of theoretical
parallel algorithms are specified using the PRAM model or a variant thereof. The main
drawback of the PRAM model lies in its unrealistic assumptions of zero communication
overhead and instruction level synchronization of the processors. This could lead to the
development of parallel algorithms with a degraded performance when implemented on real
parallel machines. To reduce the theory-to-practice disparity, several extensions of the
original PRAM model have been proposed such as delay PRAM (which models delays in
memory access times) and phase PRAM (which models synchronization delays).
In an interconnection network model, communication is only allowed between directly
connected processors; other communication has to be explicitly forwarded through
intermediate processors. Many algorithms have been developed which are perfectly matched
to the structure of a particular network. However, these elegant algorithms lack robustness, as
they usually do not map with equal efficiency onto interconnection networks different from
those for which they were designed. Most current networks allow messages to cut through
(Exercise 7.5) intermediate processors without disturbing the processor; this is much faster
than explicit forwarding.
Culler et al., argue that the above technological factors are forcing parallel computer
architectures to converge towards a system made up of a collection of essentially complete
computers (each consisting of a microprocessor, cache memory, and sizable DRAM memory)
connected by a communication network. This convergence is reflected in the LogP model.
The LogP model abstracts out the interconnection network in terms of a few parameters.
Hence the algorithms designed using the LogP model do not suffer a performance
degradation when ported from one network to another. The LogP model tries to achieve a
good compromise between having a few design parameters to make the task of designing
algorithms in this model easy, and trying to capture the parallel architecture without making
unrealistic assumptions.
The parameters of the LogP model to characterize a parallel computer are:
The LogP model assumes the following about the processors and the underlying network:
(i) The processors execute asynchronously, (ii) The messages sent to a destination processor
may not arrive in the order that they were sent, and (iii) The network has a finite capacity
such that at most ⌈L/g⌉ messages can be in transit from any processor or to any processor at
any time.
Notice that, unlike the PRAM model, the LogP model takes into account the inter-
processor communication. It does not assume any particular interconnection topology but
captures the constraints of the underlying network through the parameters L, o, g. These three
parameters are measured as multiples of the processor cycle.
We now illustrate the LogP model with an example. Consider the problem of broadcasting
a single datum from one processor to the remaining P – 1 processors. Each processor that has
the datum transmits it as quickly as possible while ensuring that no processor receives more
than one message. The optimal broadcast tree is shown in Fig. 7.24 for the case of P = 8, L =
6, g = 4, o = 2. The number shown here for each processor is the time at which it has received
the datum and can begin sending it on. The broadcast operation starts at time t = 0 from
processor P1. The processor P1 incurs an overhead of o (= 2) units in transferring the datum
to processor P2. The message reaches P2 after L (= 6) units of time i.e., at time t = 8 units.
The processor P2 then spends o (= 2) units of time in receiving the datum. Thus at time t = L
+ 2o = 10 units the processor P2 has the datum. The processor Pl meanwhile initiates
transmissions to other processors at time g, 2g, ..., (assuming g ≥ o) each of which acts as the
root of a smaller broadcast tree. The processor P2 now initiates transmission to other nodes
independently. The last value is received at time 24 units. Note that this solution to the
broadcast problem on the LogP model is quite different from that on the PRAM model.
Figure 7.24 Optimal broadcast tree for P = 8, L = 6, g = 4, o = 2.
The LogP model only deals with short messages and does not adequately model parallel
computers with support for long messages. Many parallel computers have special support for
long messages which provide a much higher bandwidth than for short messages. A
refinement to the LogP model, called LogGP model, has been proposed which accurately
predicts communication performance for both short and long messages.
LogGP Model
Here the added parameter G, called gap per byte, is defined as the time per byte for a long
message. The reciprocal of G characterizes the available per processor communication
bandwidth for long messages.
The time to send a short message can be analyzed, in the LogGP model, as in the LogP
model. Sending a short message between two processors takes o + L + o cycles: o cycles in
the sending processor, L cycles for the communication latency, and finally, o cycles on the
receiving processor. Under the LogP model, sending a k-byte message from one processor to
another requires sending ⌈k/w⌉ messages, where w is the underlying message size of the
parallel computer. This would take o + (k/w – 1) * max {g, o} + L + o cycles. In contrast,
sending everything as a single large message takes o + (k – 1) G + L + o cycles when we use
the LogGP model (Note that g does not appear here since only a single message is sent).
LogGPS Model
The LogGP model has been extended to the LogGPS model by including one additional
parameter ‘S’ in order to capture synchronization cost that is needed when sending long
messages by high-level communication libraries. The parameter ‘S’ is defined as the
threshold for message length above which synchronous messages are sent (i.e.,
synchronization occurs, before sending a message, between sender and receiver).
7.8 CONCLUSIONS
The primary ingredient in solving a computational problem on a parallel computer is the
parallel algorithm without which the problem cannot be solved in parallel. In this chapter we
first presented three models of parallel computation—(i) PRAM, (ii) interconnection
networks, and (iii) combinational circuits, which differ according to whether the processors
communicate among themselves through a shared memory or an interconnection network,
whether the interconnection network is in the form of an array, a tree, or a hypercube, and so
on. We then studied the design and analysis of parallel algorithms for the computational
problems: (i) prefix computation, (ii) sorting, (iii) searching, (iv) matrix operations on the
above models of parallel computation. Finally, we presented some recent models of parallel
computation, namely, BSP, LogP and their extensions which intend to serve as a basis for the
design and analysis of fast, portable parallel algorithms on a wide variety of current and
future parallel computers.
EXERCISES
7.1 CRCW PRAM model is the most powerful of the four PRAM models. It is possible to
emulate any algorithm designed for CRCW PRAM model on the other PRAM models.
Describe how one can emulate an algorithm of SUM CRCW PRAM model on (a)
CREW PRAM model and (b) EREW PRAM model. Also compute the increase in
running time of the algorithms on these two weaker models.
7.2 Consider an EREW PRAM model with n processors. Let m denote the size of the
common memory. Emulate this model on n-processor (a) ring, (b) square mesh and (c)
hypercube interconnection network models, where each processor has m/n memory
locations. Also compute the increase in running time of an n-processor EREW PRAM
algorithm when it is emulated on these three interconnection network models.
7.3 Design an efficient algorithm for finding the sum of n numbers on a CREW PRAM
model.
7.4 Develop an O(log n) time algorithm for performing a one-to-all broadcast
communication operation (where one processor sends an identical data element to all
the other (n – 1) processors) on a CREW PRAM model, where n is the number of
processors.
7.5 The time taken to communicate a message between two processors in a network is
called communication latency. Communication latency involves the following
parameters.
(a) Startup time (ts): This is the time required to handle a message at the sending
processor. This overhead is due to message preparation (adding header, trailer, and
error correction information), execution of the routing algorithm, etc. This delay is
incurred only once for a single message transfer.
(b) Per-hop time (th): This is the time taken by the header of a message to reach the
next processor in its path. (This time is directly related to the latency within the
routing switch for determining which output buffer or channel the message should
be forwarded to.)
(c) Per-word transfer time (tw): This is the time taken to transmit one word of a
message. If the bandwidth of the links is r words per second, then it takes tw = 1/r
seconds to traverse a link.
There are two kinds of routing techniques, namely, store-and-forward routing, and
cut-through routing. In store-and-forward routing, when a message is traversing a path
with several links, each intermediate processor on the path forwards the message to the
next processor after it has received and stored the entire message. Therefore, a message
of size m (words) takes
Tcomm = ts + (mtw + th) l
time to traverse l links. Since, in general, th is quite small compared to mtw, we simplify
the above equation as
tcomm = ts + mtwl
In cut-through routing, an intermediate node waits for only parts (of equal size) of a
message to arrive before passing them on. It is easy to show that in cut-through routing
a message of size m (words) requires
tcomm = ts + lth + mtw
time to traverse l links. Note that, disregarding the startup time, the cost to send a
message of size m over l links (hops) is O(ml) and O(m + l) in store-and-forward
routing and cut-through routing, respectively.
Determine an optimal strategy for performing a one-to-all broadcast commu-nication
operation on (a) a ring network with p processors and (b) a mesh network with p1/2 ×
p1/2 processors using (i) store-and-forward routing and (ii) cut-through routing.
7.6 Repeat Exercise 7.5 for performing an all-to-all broadcast operation. (This is a
generalization of one-to-all broadcast in which all the processors simultaneously initiate
a broadcast.)
7.7 Using the idea of parallel prefix computation develop a PRAM algorithm for solving a
first-order linear recurrence:
xj = aj xj–1 + dj,
for j = 1, 2, ... , n where the values x0, a1, a2, …, an and d1, d2, …, dn are given. For
simplicity, and without loss of generality, you can assume that x0 = a1 = 0.
7.8 Design and analyze a combinational circuit for computing the prefix sums over the input
{x0, x1, …, xn–1}.
7.9 Consider a linked list L. The rank of each node in L is the distance of the node from the
end of the list. Formally, if succ(i) = nil, then rank(i) = 0; otherwise rank(i) =
rank(succ(i)) + 1. Develop a parallel algorithm for any suitable PRAM model to
compute the rank of each node in L.
7.10 Prove formally the correctness of the odd-even transposition sort algorithm.
7.11 Design a bitonic sorting algorithm to sort n elements on a hypercube with fewer than n
processors.
7.12 Extend the searching algorithm for a tree network, presented in this chapter, when the
number of elements, n, is greater than the number of leaf nodes in the tree.
7.13 The transpose of the matrix A = [aij]m×n, denoted by AT, is given by the matrix [aji]n×m.
Develop an algorithm for finding the transpose of a matrix on (a) EREW PRAM
machine and (b) hypercube network.
7.14 Design an algorithm for multiplying two n × n matrices on a mesh with fewer than n2
processors.
7.15 Describe in detail a parallel implementation of the Bidirectional Gaussian Elimination
algorithm on (a) a PRAM machine and (b) a mesh network. Also derive expressions for
the running time of the parallel implementations.
7.16 A Mesh Connected Computer with Multiple Broadcasting (MCCMB) is shown in Fig.
7.25. It is a mesh connected computer in which the processors in the same row or
column are connected to a bus in addition to the local links. On an MCCMB there are
two types of data transfer instructions executed by each processor: routing data to one
of its nearest neighbours via a local link and broadcasting data to other processors in the
same row (or column) via the bus on this row (or column). At any time only one type of
data routing instruction can be executed by processors. Further, only one processor is
allowed to broadcast data on each row bus and each column bus at a time. Note that
other than the multiple broadcast feature the control organization of an MCCMB is the
same as that of the well-known mesh connected computer. Design and analyze
algorithms for (a) finding the maximum of n elements and (b) sorting n elements on an
MCCMB. Are your algorithms cost-optimal? Assume that each broadcast along a bus
takes O(1) time.
In the previous chapter, we have presented parallel algorithms for many problems. To run
these algorithms on a parallel computer, we need to implement them in a programming
language. Exploiting the full potential of a parallel computer requires a cooperative effort
between the user and the language system. There is clearly a trade-off between the amount of
information the user has to provide and the amount of effort the compiler has to expend to
generate efficient parallel code. At one end of the spectrum are languages where the user has
full control and has to explicitly provide all the details while the compiler effort is minimal.
This approach, called explicit parallel programming, requires a parallel algorithm to
explicitly specify how the processors will cooperate in order to solve a specific problem. At
the other end of the spectrum are sequential languages where the compiler has full
responsibility for extracting the parallelism in the program. This approach, called implicit
parallel programming, is easier for the user because it places the burden of parallelization on
the compiler. Clearly, there are advantages and disadvantages to both, explicit and implicit
parallel programming approaches. Many current parallel programming languages for parallel
computers are essentially sequential languages augmented by library functions calls, compiler
directives or special language constructs. In this chapter, we study four different (explicit)
parallel programming models (a programming model is an abstraction of a computer system
that a user sees and uses when developing a parallel program), namely, message passing
programming (using MPI), shared memory programming (using OpenMP), heterogeneous
programming (using CUDA and OpenCL) and MapReduce programming (for large scale
data processing). Implicit parallel programming approach, i.e., automatic parallelization of a
sequential program, will be discussed in the next chapter.
8.1 MESSAGE PASSING PROGRAMMING
In message passing programming, programmers view their programs (applications) as a
collection of cooperating processes with private (local) variables. The only way for an
application to share data among processors is for the programmer to explicitly code
commands to move data from one processor to another. The message passing programming
style is naturally suited to message passing parallel computers in which some memory is local
to each processor but none is globally accessible.
Message passing parallel computers need only two primitives (commands) in addition to
normal sequential language primitives. These are Send and Receive primitives. The Send
primitive is used to send data/instruction from one processor to another. There are two forms
of Send called Blocked Send and Send. Blocked Send sends a message to another processor
and waits until the receiver has received it before continuing the processing. This is also
called synchronous send. Send sends a message and continues processing without waiting.
This is known as asynchronous send. The arguments of Blocked Send and Send are:
Blocked Send (destination processor address, variable name, size)
Send (destination processor address, variable name, size)
The argument “destination processor address” specifies the processor to which the
message is destined, the “variable name” specifies the starting address in memory of the
sender from where data/instruction is to be sent and “size” the number of bytes to be sent.
Similarly Receive instructions are:
Blocked Receive (source processor address, variable name, size)
Receive (source processor address, variable name, size)
The “source processor address” is the processor from which message is expected, the
“variable name” the address in memory of the receiver where the first received data item will
be stored and “size” the number of bytes of data to be stored.
If data is to be sent in a particular order or when it is necessary to be sure that the message
has reached its destination, Blocked Send is used. If a program requires specific piece of data
from another processor before it can continue processing, Blocked Receive is used. Send and
Receive are used if ordering is not important and the system hardware reliably delivers
messages. If a message passing system’s hardware and operating system guarantee that
messages sent by Send are delivered correctly in the right order to the corresponding Receive
then there will be no need for a Blocked Send and Blocked Receive in that system. Such a
system is called an asynchronous message passing system.
We illustrate programming a message passing parallel computer with a simple example.
The problem is to add all the components of an array A[i] for i = 1, N, where,N is the size of
the array.
We begin by assuming that there are only two processors. If N is even then we can assign
adding A[i], …, A[N/2] to one processor and adding A[N/2 + 1], …, A[N] to the other. The
procedures for the two processors are shown below.
Processor0
Read A[i] for i = 1, … , N
Send to Processor1 A[i] for i = (N/2 + 1), … , N
Find Sum0 = Sum of A[i] for i = 1, … , N/2
Receive from Processor1, Sum1 computed by it
Sum = Sum0 + Sum1
Write Sum
Processor1
Receive from Processor0 A[i] for i = (N/2 + 1), … , N
Find Sum1 = Sum of A[i] for i = (N/2 + 1), … , N
Send Sum1 to Processor0
If we use the notation Send (1, A[N/2 + 1], N/2), it will be equivalent to: Send to
Processor1, N/2 elements of the array A starting from (N/2 + 1)th element.
Similarly the notation Receive (0, B[1], N/2) will be equivalent to saying receive from
Processor0, N/2 values and store them starting at B[1]. Using this notation, we can write
programs for Processor0 and Processor1 for our example problem using a Pascal-like
language as shown in Program 8.1.
Program 8.1 Two-processor addition of array
Program for Processor0
We can extend the same Program to p processors as shown in Program 8.2. One point to
be observed is that when the array size N is not evenly divisible by N the integer quotient N
div p is allocated to Processor0, Processor1, … up to Processor(p – 2) and (N div p + N mod
p) is allocated to Processor(p – 1). In the worst case Processor(p – 1) is allocated (N div p + p
– 1) array elements. If N >> p then N div p >> p and the extra load on Processor(p – 1)
compared to the load on the other processors is not significant.
Program 8.2 Adding array using p processors
Program for Processor0
Program for Processork, k = 1, ..., (p–2)
We present below an SPMD message passing program for parallel Gaussian elimination
that we described in the previous chapter.
Program 8.4 SPMD program for Gaussian elimination
Gaussian Elimination (INOUT: A[l:N,l:N], INPUT: B[l:N], OUTPUT: Y[l:N]);
8.2 MESSAGE PASSING PROGRAMMING WITH MPI
In this section, we show how parallel processing can be achieved using a public domain
message passing system, namely MPI. It should be noted that the programmer is responsible
for correctly identifying parallelism and implementing parallel algorithms using MPI
constructs.
In the above program, the routine MPI_Pack_size determines the size of the buffer
required to store 50 MPI_DOUBLE data items. The buffer size is next used to dynamically
allocate memory for TempBuffer. The MPI_Pack routine is used to pack the even numbered
elements into the TempBuffer in a for loop. The input to MPI_Pack is, the address of an
array element to be packed, the number (1) and datatype (MPI_DOUBLE) of the items,
output buffer (TempBuffer) and the size of the output buffer (BufferSize) in bytes, the current
Position (to keep track of how many items have been packed) and the communicator. The
final value of Position is used as the message count in the following MPI_Send routine. Note
that all packed messages in MPI use the datatype MPI_PACKED.
Derived Datatypes
In order to send a message with non-contiguous data items of possibly mixed datatypes
wherein a message could consist of multiple data items of different datatypes, MPI introduces
the concept of derived datatypes. MPI provides a set of routines in order to build complex
datatypes. This is illustrated in the example below.
In order to send the even numbered elements of array A, a derived datatype EvenElements
is constructed using the routine MPI_Type_vector whose format is as follows:
MPI_Type_vector(count, blocklength, stride, oldtype, &newtype)
The derived datatype newtype consists of count copies of blocks, each of which in turn
consists of blocklength copies of consecutive items of the existing datatype oldtype. stride
specifies the number of oldtype elements between two consecutive blocks. Thus
stride blocklength is the gap between two blocks.
Thus the MPI_Type_vector routine in the above example creates a derived datatype
EvenElements which consists of 50 blocks, each of which in turn consists of one double
precision number. These blocks are separated by two double precision numbers. Thus only
the even numbered elements are considered, skipping the odd numbered elements using the
gap. The MPI_Type_commit routine is used to commit the derived type before it is used by
the MPI_Send routine. When the datatype is no longer required, it should be freed with
MPI_Type_free routine.
Message Envelope
As mentioned earlier, MPI uses three parameters to specify the message envelope, the
destination process id, the tag, and the communicator. The destination process id specifies the
process to which the message is to be sent. The tag argument allows the programmer to deal
with the arrival of messages in an orderly way, even if the arrival of messages is not in the
order desired. We now study the roles of tag and communicator arguments.
Message Tags
A message tag is an integer used by the programmer to label different types of messages and
to restrict message reception. Suppose two send routines and two receive routines are
executed in sequence in node A and node B, respectively, as shown below without the use of
tags.
Here, the intent is to first send 64 bytes of data which is to be stored in P in process Y and
then send 32 bytes of data which is to be stored in Q. If N, although sent later, reaches earlier
than M at process Y, then there would be an error. In order to avoid this error tags are used as
shown below:
If N reaches earlier at process Y it would be buffered till the recv() routine with tag2 as tag
is executed. Thus, message M with tag 1 is guaranteed to be stored in P in process Y as
desired.
Message Communicators
A major problem with tags (integer values) is that they are specified by users who can make
mistakes. Further, difficulties may arise in the case of libraries, written far from the user in
time and space, whose messages must not be accidentally received by the user program.
MPI’s solution is to extend the notion of tag with a new concept—the context. Contexts are
allocated at run time by the system in response to user (and library) requests and are used for
matching messages. They differ from tags in that they are allocated by the system instead of
the user.
As mentioned earlier, processes in MPI belong to groups. If a group contains n processes,
its processes are identified within a group by ranks, which are integers from 0 to n–1.
The notions of context and group are combined in a single object called a communicator,
which becomes an argument in most point-to-point and collective communications. Thus the
destination process id or source process id in an MPI_Send or MPI_Recv routines always
refers to the rank of the process in the group identified within the given communicator.
Consider the following example:
Process 0:
MPI_Send(msg1, countl, MPI_INT, 1, tag1, comm1);
parallel_matrix_inverse();
Process 1:
MPI_Recv(msgl, countl, MPI_INT, 0, tagl, comm1);
parallel_matrix_inverse();
The intent here is that process0 will send msg1 to process 1, and then both execute a
routine parallel_matrix_inverse(). Suppose parallel_matrix_inverse() code contains an
MPI_Send(msg2, countl, MPI_INIT, 1, tag2, comm2) routine, If there were no
communicators, the MPI_Recv in process 1 could mistakenly receive the msg2 sent by the
MPI_Send in parallel_matrix_inverse(), when tag2 happens to be the same as tag1. This
problem is solved using communicators as each communicator has a distinct system-assigned
context. The communicator concept thus facilitates the development of parallel libraries,
among other things.
MPI has a predefined set of communicators. For instance, MPI_COMM_WORLD
contains the set of all processes which is defined after MPI_Init routine is executed.
MPI_COMM_SELF contains only the process which uses it. MPI also facilitates user defined
communicators.
status, the last parameter of MPI_Recv, returns information (source, tag, error code) on the
data that was actually received. If the MPI_Recv routine detects an error (for example, if
there is no sufficient storage at the receiving process, an overflow error occurs), it returns an
error code (integer value) indicating the nature of the error.
Most MPI users only need to use routines for (point-to-point or collective) communication
within a group (called intra-communicators in MPI). MPI also allows group-to-group
communication (inter-group) through inter-communicators.
Point-to-Point Communications
MPI provides both blocking and non-blocking send and receive operations, and non-blocking
versions whose completion can be tested for and waited for explicitly. Recall that a blocking
send/receive operation does not return until the message buffer can be safely written/read
while a non-blocking send/receive can return immediately. MPI also has multiple
communication modes. The standard mode corresponds to current common practice in
message passing systems. (Here the send can be either synchronous or buffered depending on
the implementation.) The synchronous mode requires a send to block until the corresponding
receive has occurred (as opposed to the standard mode blocking send which blocks only until
the message is buffered). In buffered mode a send assumes the availability of a certain
amount of buffer space, which must be previously specified by the user program through a
routine call MPI_Buffer_attach(buffer,size) that allocates a user buffer. This buffer can be
released by MPI_Buffer_detach(*buffer,*size). The ready mode (for a send) is a way for the
programmer to notify the system that the corresponding receive has already started, so that
the underlying system can use a faster protocol if it is available. Note that the send here does
not have to wait as in the synchronous mode. These modes together with blocking/non-
blocking give rise to eight send routines as shown in Table 8.1. There are only two receive
routines. MPI_Test may be used to test for the completion of a receive/send stared with
MPI_Isend/MPI_Irecv, and MPI_Wait may be used to wait for the completion of such a
receive/send.
At this point, the reader is urged to go through Program 8.5 again, especially the MPI
routines and their arguments in this program.
TABLE 8.1 Different Send/Receive Operations in MPI
Collective Communications
When all the processes in a group (communicator) participate in a global communi-cation
operation, the resulting communication is called a collective communication. MPI provides a
number of collective communication routines. We describe below some of them assuming a
communicator Comm contains n processes:
In the above program, there are numprocs number of processes in the default communicator,
MPI_COMM_WORLD. The process with rank (myid) 0 reads the value of the number of
intervals (n) and uses MPI_Bcast routine to broadcast it to all the other processes. Now each
process computes its contribution (partial sum) mypi. Then all these partial sums are added
up by using the MPI_Reduce routine with operator MPI_SUM and the final result is thus
stored in the process ranked 0. Process 0 prints the computed value of π.
Virtual Topologies
The description of how the processors in a parallel computer are interconnected is called the
topology of the computer (or more precisely, of the interconnection network). In most
parallel programs, each process communicates with only a few other processes and the
pattern of communication in these processes is called an application topology (or process
graph or task graph). The performance of a program depends on the way the application
topology is fitted onto the physical topology of the parallel computer (see Exercises 8.14 and
8.15). MPI supports the mapping of application topologies onto virtual topologies, virtual
topologies onto physical hardware topologies.
MPI allows the user to define a particular application, or virtual topology. Communication
among processes then takes place within this topology with the hope that the underlying
physical network topology will correspond and expedite the message transfers. An important
virtual topology is the Cartesian or (one- or multi-dimensional) mesh topology as the solution
of several computation-intensive problems results in Cartesian topology. MPI provides a
collection of routines for defining, examining, and manipulating Cartesian topologies.
Process X when it encounters fork Y invokes another Process Y. After invoking Process
Y, it continues doing its work. The invoked Process Y starts executing concurrently in
another processor. When Process X reaches join Y statement, it waits till Process Y
terminates. If Y terminates earlier, then X does not have to wait and will continue after
executing join Y. When multiple processes work concurrently and update data stored in a
common memory shared by them, special care should be taken to ensure that a shared
variable value is not initialized or updated independently and simultaneously by these
processes. The following example illustrates this problem. Assume that Sum : = Sum + f(A) +
f(B) is to be computed and the following program is written.
Suppose Process A loads Sum in its register to add f(A) to it. Before the result is stored
back in main memory by Process A, Process B also loads Sum in its local register to add f(B)
to it. Process A will have Sum + f(A) and Process B will have Sum + f(B) in their respective
local registers. Now both Process A and B will store the result back in Sum. Depending on
which process stores Sum last, the value in the main memory will be either Sum + f(A) or
Sum + f(B) whereas what was intended was to store (Sum+ f(A) + f(B)) as the result in the
main memory. If Process A stores Sum + f(A) first in Sum and then Process B takes this
result and adds f(B) to it then the answer will be correct. Thus we have to ensure that only
one process updates a shared variable at a time. This is done by using a statement called lock
<variable name>. If a process locks a variable name no other process can access it till it is
unlocked by the process which locked it. This method ensures that only one process is able to
update a variable at a time. The above program can thus be rewritten as:
In the above case whichever process reaches lock Sum statement first will lock the
variable Sum disallowing any other process from accessing it. It will unlock Sum after
updating it. Any other process wanting to update Sum will now have access to it. The process
will lock Sum and then update it. Thus updating a shared variable is serialized. This ensures
that the correct value is stored in the shared variable.
We illustrate the programming style appropriate for shared memory parallel computer
using as an example adding the components of an array. Program 8.7 has been written
assuming that the parallel computer has 2 processors. Observe the use of fork, join, lock, and
unlock. In this program the statement:
fork(l) Add_array(input: A[N/2 + l:N], output: Sum)
states that the Procedure Add_array is to be executed on Processor1 with the specified input
parameters. Following this statement the procedure Add_array is invoked again and this
statement is executed in Processor0 (the main processor in the system).
Program 8.7 Adding an array with 2-processor shared memory computer
{Main program for the 2-processor computer [Processor0 (main), Processor1 (slave)]}
The above program may be extended easily to p processors as shown in Program 8.8.
Program 8.8 Adding an array with p-processor shared memory computer
{Procedure for main Processor, Processor0}
Single Program Multiple Data Programming—Shared Memory Model
The SPMD programming style is not limited to message passing parallel computers. Program
8.9 shows how SPMD could be used on a shared memory parallel computer. Here
synchronization is achieved by using the barrier primitive. Each process waits at the barrier
for every other process. (After synchronization, all processes proceed with their execution.)
Program 8.9 SPMD program for addition of array
We present below an SPMD shared memory program (Program 8.10) for the parallel
Gaussian elimination.
Program 8.10 SPMD program for Gaussian elimination
Gaussian Elimination (INOUT: A[l:N,l:N], INPUT: B[l:N], OUTPUT:Y[l:N]);
8.4 SHARED MEMORY PROGRAMMING WITH OpenMP
In this section, we present a portable standard, namely, OpenMP for programming shared
memory parallel computers. Like MPI, OpenMP is also an explicit (not automatic)
programming model, providing programmers full control over parallelization. A key
difference between MPI and OpenMP is the approach to exploiting parallelism in an
application or in an existing sequential program. MPI requires a programmer to parallelize
(convert) the entire application immediately. OpenMP, on the other hand, allows the
programmer to incrementally parallelize the application without major restructuring.
OpenMP (Open specifications for Multi-Processing) is an API (Application Programmer
Interface) for programming in FORTRAN and C/C++ on shared memory parallel computers.
The API comprises of three distinct components, namely compiler directives, runtime library
routines, and environment variables, for programming shared memory parallel computers,
and is supported by a variety of shared memory architectures and operating systems.
Programmers or application developers decide how to use these components.
The OpenMP standard designed in 1997 continues to evolve and the OpenMP
Architecture Review Board (ARB) that owns the OpenMP API continues to add new features
and/or constructs to OpenMP in each version. Currently most compliers support OpenMP
Version 2.5 released in May 2005. The latest OpenMP Version 4.0 was released in July 2013.
The OpenMP documents are available at http://www.openmp.org.
8.4.1 OpenMP
OpenMP uses the fork-join, multi-threading, shared address model of parallel execution and
the central entity in an OpenMP program is a thread and not a process. An OpenMP program
typically creates multiple cooperating threads which run on multiple processors or cores. The
execution of the program begins with a single thread called the master thread which executes
the program sequentially until it encounters the first parallel construct. At the parallel
construct the master thread creates a team of threads consisting of certain number of new
threads called slaves and the master itself. Note that the fork operation is implicit. The
program code inside the parallel construct is called a parallel region and is executed in
parallel typically using SPMD mode by the team of threads. At the end of the parallel region
there is an implicit barrier synchronization (join operation) and only the master thread
continues to execute further. Threads communicate by sharing variables. OpenMP offers
several synchronization constructs for the coordination of threads within a parallel region.
Figure 8.2 shows the OpenMP execution model. Master thread spawns a team of threads as
required. Notice that as parallelism is added incrementally until the performance goals are
realized, the sequential program evolves into a parallel program.
Figure 8.2 OpenMP execution model.
Output:
ThreadID 0 is executing loop iteration 0
ThreadID 0 is executing loop iteration 1
ThreadID 0 is executing loop iteration 2
ThreadID 2 is executing loop iteration 6
ThreadID 1 is executing loop iteration 3
ThreadID 1 is executing loop iteration 4
ThreadID 1 is executing loop iteration 5
Note that the scope of the for loop index variable i in the parallel region is private by
default. We observe from the output that out of the three threads, thread numbers 0 and 1 are
working on three iterations each and thread number 2 is working on only one iteration. We
will shortly see how to distribute the work of the parallel region among the threads of the
team.
It is possible to incrementally parallelize a sequential program that has many for loops by
successively using the OpenMP parallel for directive for each of the loops. However, this
directive has the following restrictions: (i) the total number of iterations is known in advance
and (ii) the iterations of the for loop must be independent of each other. This implies that the
following for loop cannot be correctly parallelized by OpenMP parallel for directive.
for (i = 2; i < 7; i++)
{
a[i] = a[i] + c; /* statement 1 */
b[i] = a[i-1] + b[i]; /* statement 2 */
}
Here, when i = 3, statement 2 reads the value of a[2] written by statement 1 in the previous
iteration. This is called loop-carried dependence. Note that there can be dependences among
the statements within a single iteration of the loop but not dependences across iterations
because OpenMP compilers do not check for dependences among iterations, with parallel for
directive. It is the responsibility of the programmers to identify loop-carried dependences (by
looking for variables that are read/written in one iteration and written in another iteration) in
for loop and eliminate them by restructuring the loop body. If it is impossible to eliminate
them, then the loop is not amenable to parallelization. OpenMP does not parallelize while
loops and do-while loops and it is the responsibility of the programmers to convert them into
equivalent for loops for possible parallelization.
OpenMP for directive, unlike the parallel for directive, does not create a team of threads.
It partitions the iterations in for loop among the threads in an existing team. The following
two code fragments are equivalent:
Loop Scheduling
OpenMP supports different scheduling strategies, specified by the schedule parameters, for
distribution or mapping of loop iterations to threads. It should be noted that a good mapping
of iterations to threads can have a very significant effect on the performance. The simplest
possible is static scheduling, specified by schedule(static, chunk_size), which assigns blocks
of size chunk_size in a round-robin fashion to the threads. See program 8.13. The iterations
are assigned as: <Thread 0: 0,1,2,9>, <Thread 1: 3,4,5>, <Thread 2: 6,7,8>.
If the amount of work (computation) per loop iteration is not constant (for example,
Thread 0 gets vastly different workload), static scheduling leads to load imbalance. A
possible solution to this problem is to use dynamic scheduling, specified by
schedule(dynamic, chunk_size), which assigns a new block of size chunk_size to a thread as
soon as the thread has completed the computation of the previously assigned block. However,
there is an overhead (bookkeeping in order to know which thread is to compute the next
block/chunk) associated with dynamic scheduling. Guided scheduling, specified by
schedule(guided, chunk_size), is similar to dynamic scheduling but the chunk_size is relative
to the number of iterations left. For example, if there are 100 iterations and 2 threads, the
chunk sizes could be 50,25,12,6,3,2,1,1. The chunk_size is approximately the number of
iterations left divided by the number of threads. Another scheduling strategy supported by the
OpenMP is schedule(auto) where the mapping decision is delegated to the compiler and/or
runtime system.
Program 8.13 An OpenMP program with schedule clause
Output:
ThreadID: 0, iteration: 0
ThreadID: 0, iteration: 1
ThreadID: 0, iteration: 2
ThreadID: 0, iteration: 9
ThreadID: 2, iteration: 6
ThreadID: 2, iteration: 7
ThreadID: 2, iteration: 8
ThreadID: 1, iteration: 3
ThreadID: 1, iteration: 4
ThreadID: 1, iteration: 5
Non-Iterative Constructs
OpenMP for construct shares iterations of a loop across the team representing “data
parallelism”. OpenMP sections construct, which breaks work into separate independent
sections that can be executed in parallel by different threads, can be used to implement “task
or functional parallelism”.
As stated above structured blocks are independent of each other and are executed in
parallel by different threads. There is an implicit synchronization at the end of the construct.
If there are more sections than threads, it is up to the implementation to decide how the extra
sections are executed.
Synchronization
The OpenMP offers several high-level (such as critical, atomic and barrier) and low-level
(such as locks and flush) synchronization constructs to protect critical sections or avoid race
conditions as multiple threads in a parallel region access the same shared data.
1. #pragma omp critical
{ structured block }
The critical construct guarantees that the structured block (critical section) is executed
by only one thread at a time.
2. #pragma omp atomic
statement_expression
The atomic construct ensures that a single line of code (only one statement, for example
x = x + 1) is executed by only one thread at a time. While the critical construct protects
all the elements of an array variable, atomic construct can be used to enforce exclusive
access to one element of the array.
3. #pragma omp barrier
The barrier construct is used to synchronize the threads at a specific point of execution.
That is every thread waits until all the other threads arrive at the barrier.
4. nowait, for example in ‘#pragma omp parallel for nowait’, eliminates the implicit
synchronization that occurs by default at the end of a parallel section.
5. OpenMP provides several lock library functions for ensuring mutual exclusion between
threads in critical sections. For example, the functions omp_init_lock(),
omp_set_lock(), omp_unset_lock(), initialize a lock variable, set the lock, unset or
release the lock respectively.
6. #pragma omp flush [(list)]
Different threads can have different values for the same shared variable (for example,
value in register). To produce a consistent view of memory, flush construct updates
(writes back to memory at this synchronization point) all variables given in the list so
that the other threads see the most recent values. If no list is specified, then all thread
visible variables are updated. OpenMP uses a relaxed memory consistency model.
7. Reduction (operator: list) reduces list of variables into one, using operator. It provides a
neat way to combine private copies of a variable into a single result at the end of a
parallel region. Note that (i) without changing any code and by adding one simple line,
the sequential code fragment shown below gets parallelized, (ii) with reduction clause,
a private copy for the variable sum is created for each thread, initialized and at the end
all the private copies are reduced into one using the addition operator, and (iii) no
synchronization construct such as critical construct is required to combine the private
copies of the variable sum.
In Section 8.2, we provided an MPI parallel program (Program 8.6) to compute the value
of π. Sequential and OpenMP π programs are given below (Program 8.14 and Program 8.15).
Observe that without making any changes to the sequential program and by just inserting two
lines into it, the sequential program is converted to an OpenMP parallel program.
Program 8.14 A sequential program to compute π
Program 8.15 An OpenMP program to compute π
Environment Variables
OpenMP provides several environment variables that can be used to control the parallel
execution environment. We list a few of them here.
(i) OMP_SCHEDULE: It alters the behaviour of the schedule clause (mapping of iterations
to threads) when schedule(runtime) clause is specified in for or parallel for directive.
(ii) OMP_NUM_THREADS: It sets the maximum number of threads to use in a parallel
region during execution.
(iii) OMP_DYNAMIC: It specifies whether the runtime can adjust the number of threads in
a parallel region.
(iv) OMP_STACKSIZE: It controls the size of the stack for slaves (non-master threads).
In this section, we presented some of the most important directives, clauses and functions
of the OpenMP standard for exploiting thread-level parallelism. We also presented how to
start multiple threads, how to parallelize for loops and schedule them, and also how to
coordinate and synchronize threads to protect critical sections. Another important directive,
added in the Version 3.0 of OpenMP, is the task directive which can be used to parallelize
recursive functions and while loops. Also the collapse clause added in this version enables
parallelization of nested loops. The most recent OpenMP Version 4.0 released in July 2013
has SIMD constructs to support SIMD or vector-level parallelism to exploit full potential of
today’s multicore architectures.
8.5 HETEROGENEOUS PROGRAMMING WITH CUDA AND OpenCL
8.5.1 CUDA (Compute Unified Device Architecture)
CUDA, a parallel programming model and software environment developed by NVIDIA,
enables programmers to write scalable parallel programs using a straightforward extension of
C/C++ and FORTRAN languages for NVIDIA’s GPUs. It follows the data-parallel model of
computation and eliminates the need of using graphics APIs, such as OpenGL and DirectX,
for computing applications. Thus CUDA, to harness the processing power of GPUs, enables
programmers for general purpose processing (not exclusively graphics) on GPUs (GPGPUs)
without mastering graphic terms and without going into the details of transforming
mathematical computations into equivalent pixel manipulations.
As mentioned in Chapter 5, a GPU consists of an array of Streaming Multi-processors
(SMs). Each SM further consists of a number of Streaming Processors (SPs). The threads of a
parallel program run on different SPs with each SP having a local memory associated with it.
Also, all the SPs in an SM communicate via a common shared memory provided by the SM.
Different SMs access and share data among them by means of GPU DRAM, different from
the CPU DRAM, which functions as the GPU main memory, called the global memory. Note
that CPU (having a smaller number of very powerful cores, aimed at minimizing the latency
or the time taken to complete a task/thread) and GPU (having thousands of less powerful
cores, aimed at maximizing the throughput, that is, number of parallel tasks/threads
performed per unit time) are two separate processors with separate memories.
In CUDA terminology, CPU is called the host and GPU, where the parallel computation of
tasks is done by a set of threads running in parallel, the device. A CUDA program consists of
code that is executed on the host as well as code that is executed in the device, thus enabling
heterogeneous computing (a mixed usage of both CPU and GPU computing). The execution
of a CUDA device program (parallel code or “kernel”) involves the following steps: (i)
(device) global memory for the data required by the device is allocated, (ii) input data
required by the program is copied from the host memory to the device memory, (iii) the
program is then loaded and executed on the device, (iv) the results produced by the device are
copied from the device memory to the host memory, and (v) finally, the memory on the
device is deallocated.
CUDA Threads, Thread Blocks, Grid, Kernel
The parts of the program that can be executed in parallel are defined using special functions
called kernels, which are invoked by the CPU and run on the GPU. A kernel launches a large
number of threads that execute the same code on different SPs (Streaming Processors) thus
exploiting data parallelism. A thread is the smallest execution unit of a program. From a
thread’s perspective, a kernel is a sequential code written in C. The set of all threads launched
by a kernel is called a grid. The kernel grid is divided into thread blocks with all thread
blocks having an equal number of threads (See Fig. 8.3). Each thread block has a maximum
number of threads it can support. Newer GPUs support 1024 threads per block while the
older ones can only support 512. A maximum of 8 thread blocks can be executed
simultaneously on an SM. Also, the maximum number of threads that can be assigned to an
SM is 1536. Note that each SM can execute one or more thread blocks, however, a thread
block cannot run on multiple SMs.
Figure 8.3 2D grid is 2 × 2 thread blocks; each 2D thread block is 2 × 3 threads.
The hierarchical grouping of threads into thread blocks that form a grid has many
advantages. First, different thread blocks operate independent of each other. Hence, thread
blocks can be scheduled in any order relative to each other (in serial or in parallel) on the SM
cores. This makes CUDA architecture scalable across multiple GPUs having different
number of cores. Second, the threads belonging to the same thread block are scheduled on the
same SM concurrently. These threads can communicate with each other by means of the
shared memory of the SM. Thus, only the threads belonging to the same thread block need to
synchronize at the barrier. Different thread blocks execute independently and hence no
synchronization is needed between them. In short, the grid is a set of loosely coupled thread
blocks (expressing coarse-grained data parallelism) and a thread block is a set of tightly
coupled threads (expressing fine-grained data/thread parallelism).
Once the execution of one kernel gets completed, the CPU invokes another kernel for
execution on the GPU. In newer GPUs (for example, Kepler), multiple kernels can be
executed simultaneously (expressing task parallelism). CUDA supports three special
keywords for function declaration. These keywords are added before the normal function
declaration and differentiate normal functions from kernel functions. They include:
__global__ keyword indicates that the function being declared is a kernel function which
is to be called only from the host and gets executed on the device.
__device__ keyword indicates that the function being declared is to be executed on the
device and can only be called from a device.
__host__ keyword indicates that the function being declared is to be executed on the host.
This is a traditional C function. This keyword can be omitted since the default type for each
function is __host__.
Let functionDemo() be a device kernel. Its definition can be given as:
The keyword __global__ indicates that this is a kernel function which will be invoked by
the host and executed on the device. It takes two integer arguments param1 and param2.
Whenever a kernel is launched, the number of threads to be generated by the kernel is
specified by means of its execution configuration parameters. These parameters include grid
dimension defined as the number of thread blocks in the grid launched by the kernel and
block dimension defined as the number of threads in each thread block. Their values are
stored in the predefined variables gridDim and blockDim respectively. They are initialized by
the execution configuration parameters of the kernel functions. These parameters are
specified in the function call between <<< and >>> after the function name and before the
function arguments as shown in the following example.
functionDemo<<<32, 16>>> (param1, param2);
Here the function functionDemo() will invoke a total of 512 threads in a group of 32
thread blocks where each thread block consists of 16 threads.
CUDA has certain built-in variables with pre-initialized values. These variables can be
accessed directly in the kernel functions. Two such important variables are threadIdx and
blockIdx representing the thread index in a block and the block index in a grid respectively.
These keywords are used to identify and distinguish threads from each other. This is
necessary since all the threads launched by a kernel need to operate on different data. Thus by
associating a unique thread identifier with each thread, the thread specific data is fetched.
gridDim and blockDim are also the pre-initialized CUDA variables. A thread is uniquely
identified by its threadIdx, blockIdx and blockDim values as:
id = blockIdx.x * blockDim.x + threadIdx.x
Here .x represents that the threads and blocks are one dimensional only.
Memory Operations
As mentioned earlier, GPU (device) does not have access to the system (host) RAM and has
its own device memory. Hence, before executing a kernel on the device, memory has to be
allocated on the device global memory and necessary data required by the kernel is to be
transferred from the host memory to the device memory. After the kernel execution, the
results available in the global memory have to be transferred back to the host memory. All
these are achieved by means of the following APIs provided with CUDA and are used in the
host code.
cudaMalloc((void**)A, size): This API function is used to allocate a chunk of device global
memory for the data objects to be used by the kernel. It is similar to the malloc() function of
standard C programming language. The parameter A is the address of the pointer variable
that points to the address of the allocated device memory. The function cudaMalloc() expects
a generic pointer, hence the address of the pointer variable A should be cast to (void**).
Second parameter, size, is the allocation size in bytes reserved for the data object pointed to
by A.
cudaFree(A): This API function is used to free the device global memory allocated
previously by cudaMalloc().
cudaMemcpy(A, B, size, direction): This API function is used to transfer or copy the data
from host to device memory and vice-versa. It is used after the memory is allocated for the
object in the device memory using cudaMalloc(). First parameter, A, to the cudaMemcpy()
function is a pointer to the destination location where data object has to be copied. Second
parameter, B, is again a pointer to the data source location. Third parameter, size, represents
the number of bytes to be transferred. The direction of memory operation involved is
represented by the fourth parameter, direction, that is, host to device, device to host, host to
host and device to device. For data transfer from host to device memory, the direction
parameter would be cudaMemcpyHostToDevice. Similarly, the parameter value for data
transfer from device to host memory is cudaMemcpyDeviceToHost. Here
cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are predefined constants of the
CUDA environment.
Array Addition Example
We now present a simple CUDA program to add two given arrays or vectors and explain the
difference between sequential and parallel programming. Consider a function that takes four
parameters as input. First two arguments are the addresses of the integer array elements
whose sum is to be computed. Third argument is the address where the resulting sum should
be stored. Fourth argument is the number of elements in the arrays to be added (assuming
equal size of both the arrays). We will first write the sequential function for the problem and
then add the relevant statements that make the function parallel thereby enabling it for GPU
execution.
Sequential CPU Execution
Under sequential execution, the function is as given below. The function addArray() takes the
integer arrays A and B, computes the sum and stores the result in C.
1. void addArray(int *A, int *B, int *C, int count)
2. {
3. for (int i = 0; i < count; i++)
4. C[i] = A[i] + B[i];
5. }
Parallel GPU Execution
Under parallel execution in the GPU, the function is as given below.
1. void addArray(int *A, int *B, int *C, int count)
2. {
3. int allocationSize = count * sizeof(int);
4. int *deviceA, *deviceB, *deviceC;
5. cudaMalloc((void**)&deviceA, allocationSize);
6. cudaMemcpy(deviceA, A, allocationSize, cudaMemcpyHostToDevice);
7. cudaMalloc((void**)&deviceB, allocationSize);
8. cudaMemcpy(deviceB, B, allocationSize, cudaMemcpyHostToDevice);
9. cudaMalloc((void**)&deviceC, allocationSize);
10. addArrayKernel<<<ceil(count/128), 128>>>
(deviceA, deviceB, deviceC, count);
11. cudaMemcpy(C, deviceC, allocationSize, cudaMemcpyDeviceToHost);
12. cudaFree(deviceA);
13. cudaFree(deviceB);
14. cudaFree(deviceC);
15. }
In the function addArray(), statements 5,7 and 9 allocate memory for the array elements
using cudaMalloc() API. The pointers deviceA, deviceB and deviceC point to the addresses
of the arrays involved in computation in the device global memory. Arrays A and B are
transferred to the device memory using the cudaMemcpy() API in the statements 6 and 8.
Statement 10 launches the kernel function. Understanding the way to choose the execution
configuration parameters is very important. The way GPU groups the threads depends upon
the kernel configuration parameters only. As explained earlier, here the first configuration
parameter specifies the number of thread blocks to be launched by the kernel while the
second parameter specifies the number of threads within each thread block.
In our example, we want array elements of each index to be handled by a separate thread.
This requires the total number of threads to be equal to the value of count, the fourth
parameter to the function addArray(). Out of the count array elements (or threads), let us
consider that each thread block is made up of 128 threads. This is specified by the second
configuration parameter. This means that 128 threads of a given thread block will get
executed by the SPs of the same SM. The total number of thread blocks needed to hold a total
of count threads would be equal to count/128, if count is an integer multiple of 128.
However, this might not be the case always, thus we use ceil(count/128). Statement 11 copies
the result of addition from the device global memory present in the array deviceC to the host
memory array C. Statements 12, 13 and 14 free the memory allocated in the device.
The operation of kernel function, addArrayKernel(), is given below:
1. __global__ void addArrayKernel(int *A, int *B, int *C, int count)
2. {
3. int id = blockIdx.x * blockDim.x + threadIdx.x;
4. if (id < count)
5. C[id] = A[id] + B[id];
6. }
Statement 3 computes the identifier associated with the thread under execution by using
the built-in variables blockIdx, threadIdx and blockDim. This is used as an index into the
array. Each thread manipulates the array element whose index is equal to the thread identifier
as given by the statement 5. Since the number of array elements might not be multiple of 128,
the number of threads created would be more than the actual number of array elements. Thus,
the check in the conditional statement 4 is necessary.
Grid and Block Dimensions
The kernel grids and the thread blocks are three dimensional variables of the type dim3. A
grid can be viewed as a 3D array of thread blocks. Similarly, a thread block is composed of a
3D array of threads. It is upto the programmer to use all the dimensions or keep higher
dimensions unused by setting the corresponding dimension parameters to 1.
The data type dim3 is a C struct which is made up of the three unsigned integer fields x, y
and z. The number of thread blocks that can be present in each dimension of a grid is 65,536.
Thus, the values of each of the grid variables gridDim.x, gridDim.y and gridDim.z range
from 1 to 65,536.
In case of a thread block, the value of blockIdx.x ranges from 0 to gridDim.x–1. It is true
for blockIdx.y and blockIdx.z variables also. Only 1024 threads can be accommodated within
a thread block (including all the dimensions). They can be grouped as (512,2,1), (128,2,2) or
(64,4,2) or any other valid combination as long as they do not exceed the total thread count of
1024.
The dim3 variables are declared and used as kernel execution configuration parameters as
shown in the following example:
1. dim3 numBlocks(8,4,2);
2. dim3 numThreads(64,2,1);
3. funcDemo<<<numBlocks, numThreads>>>(param1, param2);
The statement 1 defines the dimensions of the kernel grid in terms of the number of thread
blocks. There are 64 thread blocks in this grid which can be visualized as a 3D array with 8
unit blocks in x-direction, 4 unit blocks in y-direction and 2 unit blocks in z-direction.
Similarly, the statement 2 gives the structure of a thread block. In this example, each thread
block is two dimensional with 64 unit threads in x-direction and 2 unit threads in y-direction,
thus having 128 threads. In the statement 3, the variables numBlocks and numThreads are
used as configuration parameters. The kernel functionDemo() invokes a total of 64 × 128 =
8192 threads.
In case a kernel consists of single dimensional threads and blocks, then the variables
numBlocks and numThreads can be replaced by the corresponding integer value arguments,
as is done in the previous example. Note that the configuration parameters numBlocks and
numThreads can be accessed inside the kernel function funcDemo() by using the predefined
variables gridDim and blockDim respectively.
The number of dimensions chosen for grids and blocks depends upon the type of data
being operated upon, thus helping in better visualization and manipulation of the data. For
example, for an image that is made up of a 2D array of pixels, it is imperative to use a 2D
grid and block structure. In this case, each thread can be identified by a unique ID in x- and
y-directions, with each thread operating on a specific image pixel. An example of using 3D
variables could be for the software involving multiple layers of graphics where each layer
adds another dimension to the grids and blocks.
In C language, multidimensional arrays are stored in the main memory in row-major
layout. This means that in case of a 2D array, first all the elements in the first row are placed
in the consecutive memory locations. Then the elements of the next row are placed
afterwards. This is in contrast with the column-major layout followed by FORTRAN, in
which all elements of the first column are assigned consecutive memory locations and the
next columns are assigned memory afterwards in a similar way. For dynamically allocated
multidimensional arrays, there is no way to specify the number of elements in each
dimension beforehand. Hence while accessing such arrays using CUDA C, programmers
convert the higher dimensional element indexes to the equivalent 1D array indexes. The
programmers being able to access higher dimensional arrays by a single dimensional index is
possible because of the flat memory system being used, which linearizes all multidimensional
arrays in a row-major fashion.
Consider a 2D array of size m × n. Since it is a two dimensional data object, it is intuitive
that the kernel accessing it creates a two dimensional grid and block structure. Let us consider
that each thread block consists of 16 threads in x- and y-dimensions. Then, the variable
specifying the dimensions of the block is declared as:
dim3 numThreads(16,16,1);
The total number of thread blocks in x dimension (which represents the columns) will be
ceil (n/16) and y dimension (which represents the rows) will be ceil (m/16). Thus, the
variable specifying the dimensions of the grid is declared as:
dim3 numBlocks(ceil(n/16), ceil(m/16),1);
If the 2D array has a size of 100 × 70 and the block dimension is 16 × 16, then the number
of thread blocks in x and y dimensions is 5 and 7, respectively, creating a total of (7*16=112)
× (5*16=80) = 8960 threads. However, the number of data elements in the array = 100 × 70 =
7000. For the extra created threads, a check is required on their identifier whether it is
exceeding the number of rows or columns of the 2D array.
Matrix Multiplication Example
Consider two matrices A and B of sizes m × n and n × p, respectively. Let C be the matrix
that stores the product of A and B. The size of C would be m × p. We want each cell element
of the matrix C to be evaluated by a separate thread. Thus, we follow 2D structure of the
kernel grid with each thread pointing to a separate cell element of C. Assuming that the size
of each thread block is 16 × 16, dimensions of the grid computing matrix C would be
ceil(m/16) × ceil(p/16). Thus, each block will consist of 256 threads ranging from
T0,0...T0,15... to T15,15. The program for matrix multiplication is given below:
1 void productMatrix(float **A, float **B, float **C, int m, int n, int p)
2{
3 int allocationSize_A = m*n*sizeof(float);
4 int allocationSize_B = n*p*sizeof(float);
5 int allocationSize_C = m*p*sizeof(float);
6 float *deviceA, *deviceB, *deviceC;
7 cudaMalloc((void**) &deviceA, allocationSize_A);
8 cudaMalloc((void**) &deviceB, allocationSize_B);
9 cudaMalloc((void**) &deviceC, allocationSize_C);
10 cudaMemcpy(deviceA, A, allocationSize_A, cudaMemcpyHostToDevice);
11 cudaMemcpy(deviceB, B, allocationSize_B, cudaMemcpyHostToDevice);
12 dim3 numBlocks(ceil(p/16), ceil(m/16),1);
13 dim3 numThreads(16,16,1);
14 productMatrixKernel<<<numBlocks, numThreads>>>
(deviceA, deviceB, deviceC, m, n, p);
15 cudaMemcpy(C, deviceC, allocationSize_C, cudaMemcpyDeviceToHost);
16 }
17 __global__void productMatrixKernel(float *deviceA, float *deviceB,
float *deviceC, int m, int n, int p)
18 {
19 /* y_id corresponds to the row of the matrix */
20 int y_id = blockIdx.y * blockDim.y + threadIdx.y;
21 /* x_id corresponds to the column of the matrix */
22 int x_id = blockIdx.x * blockDim.x + threadIdx.x;
23 if ((y_id < m) && (x_id < p)) {
24 float sum = 0;
25 for (int k = 0; k < n; k++)
26 sum += deviceA[y_id*n + k] * deviceB[k*p + x_id];
27 }
28 deviceC[y_id*p + x_id] = sum;
29 }
Statements 1–11 perform the basic steps required before the execution of a kernel, that is,
memory allocation and data transfer. The configuration parameters numBlocks and
numThreads are defined based on the data to be operated (on matrix in our example) in
statements 12 and 13.
It is very important to understand how the kernel productMatrixKernel() works, since it
involves two dimensions. Statements 20 and 22 compute the y and x thread indexes,
respectively. For our 2D data, these indices map to the rows and columns of the matrix.
Every thread accesses data elements based on its unique thread ID. If condition in statement
23 is necessary because there will be some threads which will not map to any data element in
the matrix and should not process any data.
An important point to note is that accessing the matrices using these IDs directly in the
form deviceA[y_id][x_id] and deviceB[y_id][x_id] is not possible because the number of
columns in the arrays is not available during the compile time. Hence, the arrays are
linearized as explained earlier and programmers have to map the 2D indices into the
corresponding linear or 1D indices, as is done in the statement 26. Since, C uses row-major
layout, memory is allocated row-wise. Each row of matrix deviceA has n elements, thus we
need to move y_id*n memory locations to reach the row having the required elements. Then
by traversing k adjacent locations, the data deviceA[y_id][k] is fetched by using the index
[y_id*n+k]. Similarly, the data element deviceB[k][x_id] is accessed using the equivalent
linearized form deviceB[k*p+x_id], since each row of matrix deviceB has p elements. Each
element of matrix deviceC is the sum of products of corresponding row elements of deviceA
and column elements of deviceB. Once the final value of sum is calculated after the
completion of the loop, it is copied to the corresponding element location in the matrix
deviceC using the statement 28.
Figure 8.4 shows in detail various threads and thread blocks for the matrix multiplication
example with m=n=p=4. Block size is taken as 2 × 2.
8.6.1 MapReduce
This celebrated new, data-parallel programming model popularized by Google, having its
roots in functional languages’ map and fold, defines a computation step using two functions:
map() and reduce().
Map: (key1,value1) → list(key2,value2)
Reduce: (key2,list(value2)) → list(key3,value3)
The map function takes an input key/value pair (key1,value1) and outputs a list of
intermediate key/value pairs (key2,value2). The reduce function takes all values associated
with the same intermediate key and produces a list of key/value pairs (key3,value3).
Programmers express their computation using the above two functions and as mentioned
earlier the MapReduce implementation or runtime system manages the parallel execution of
these two functions.
The following pseudocode illustrates the basic structure of a MapReduce program that
counts the number of occurrences of each word in a collection of documents. Map and
Reduce are implemented using system provided APIs, EmitIntermediate() and Emit()
respectively.
Figure 8.5 illustrates an execution overview of MapReduce word count program. The Map
function emits each word in a document with a temporary count 1. The Reduce function, for
each unique word, emits word count. In this example, we assume that the run-time system
decides to use three Map instances (or three Map tasks) and two Reduce instances (or two
Reduce tasks). For ease of understanding of our example program execution, we also assume
that the three map instances operate on three lines of the input file as shown in the figure. In
practice, a file is split into blocks as explained later. Observe that there are two phases,
namely shuffle and sort, in between the three mappers and two reducers. The shuffle phase
performs an all-map-to-all-reduce personalized communication so that all the <key,value>
tuples for a particular key are sent to one reduce task. As the number of unique keys is
usually much more than the reduce tasks, each reduce task may process more than one key.
The sort phase sorts the tuples on the key field essentially grouping all the tuples for the same
key. As each reduce task receives tuples from all the map tasks, this grouping may not take
place naturally and the tuples for the different keys meant for a reduce task may be jumbled.
8.6.2 Hadoop
Hadoop is an open-source Java-based MapReduce implementation (as opposed to Google’s
proprietary implementation) provided by Apache Software Foundation
(http://hadoop.apache.org). It is designed for clusters of commodity machines that contain
both CPU and local storage. Like MapReduce, Hadoop consists of two layers: (i) Hadoop
Distributed File System (HDFS) or data storage layer and (ii) MapReduce implementation
framework or data processing layer. These two layers are loosely coupled.
HDFS is primarily designed to store large files and built around the idea of “write-once
and read-many-times”. It is an append-only file system. Once a file on HDFS is created and
written to, the file can either be appended to or deleted. It is not possible to modify any
portion of the file. Files are stored as blocks, with block size 64 or 128 MB. Different blocks
of the same file are stored on different machines or nodes (striping of data blocks across
cluster nodes enables efficient MapReduce processing) and each block is replicated across
(usually, 3) nodes for fault-tolerance. The HDFS implementation has a master-slave
architecture. The HDFS master process, called NameNode, is responsible for maintaining the
file system metadata (directory structure of all files in the file system, location of blocks and
their replicas, etc.), thus manages the global file system name space. The slaves, called
DataNodes, perform operations on blocks stored locally following instructions from the
NameNode. In typical Hadoop deployments, HDFS is used to store both input/output data
to/from a MapReduce job. Any intermediate data (for example, output from Map tasks) is not
saved in HDFS, instead stored on local disks of nodes (where the map tasks are executed).
The MapReduce implementation framework handles scheduling of map/reduce tasks of a
job (processes that invoke user’s Map and Reduce functions are called map and reduce tasks
respectively), moving the intermediate data from map tasks to reduce tasks and fault-
tolerance. A user’s (client’s) program submitted to Hadoop is sent to the JobTracker process
running on the master node (NameNode). Another process called TaskTracker is run on each
slave node (DataNode). The JobTracker splits a submitted job into map and reduce tasks and
schedules them (considering factors such as load balancing and locality, for example, it
assigns map (reduce) tasks close to the nodes where the data (map output) is located) to the
available TaskTrackers. The JobTracker then requests the TaskTrackers to run tasks and
monitor them. Each TaskTracker has a fixed number of map (reduce) slots for executing map
(reduce) tasks. When a TaskTracker becomes idle, the JobTracker picks a new task from its
queue to feed it. If the request from the JobTracker is a map task, the TaskTracker will
process job split or data chunk specified by the JobTracker. If the request is a reduce task, it
initializes the reduce task and waits for the notification from the JobTracker (that the map
tasks are completed) to start processing (reading intermediate results, sorting, running the
user-provided reduce function and saving the output in HDFS). Communication between the
JobTracker and a TaskTracker processes, for information and requests, happens using a
heartbeat protocol. Similar to how data storage follows master-slave architecture, the
processing of map/reduce tasks follows the master-slave model.
Figure 8.6 shows the execution of our word count example program on Hadoop cluster.
The secondary NameNode shown in this figure is not a standby node for handling the single
point of failure of the NameNode. It (a misleading named component of Hadoop) is used for
storing the latest checkpoints of HDFS states. Typically, a map task executes sequentially on
a block and different map tasks execute on different blocks in parallel. Map function is
executed on each record in a block. A record can be defined based on a delimiter such as ‘end
of line’. Typically, a line delimiter is used and map function is executed for each line in a
block. Sequentially executing map functions record-by-record in a block improves the data
read time.
Convert the above program into an equivalent SPMD shared memory parallel program.
8.7 Write SPMD message passing and shared memory parallel programs for processing
student grades as described in Chapter 2.
8.8 A popular statement is that shared memory model supports fine-grained parallelism
better than the message passing model. Is it true? (Hint: Fine-grained parallelism
requires efficient communication/synchronization mechanism.)
8.9 Another popular statement is that shared memory programming is easier than message
passing programming. Is it true? If so justify your answer with examples.
8.10 Which programs, shared memory or message passing, are difficult to debug? (Hint:
Consider the address space of the processes.)
8.11 A parallel programming language, by extending (sequential) FORTRAN or C, can be
realized using three different approaches—Library Functions (e.g. MPI), Compiler
Directives (e.g. OpenMP) and New Constructs (e.g. FORTRAN 90). List the relative
merits and shortcomings of these three approaches.
8.12 Write an MPI program that sends a message (a character string, say, “hello world”)
from processes with rank greater than zero to the process with rank equal to zero.
8.13 Write an MPI program that distributes (using MPI_Scatter) the random numbers
generated by a root node (the node corresponding to the process with rank equal to
zero) to all the other nodes. All nodes should compute the partial sums which should be
gathered into the root node (using MPI_Gather). The root node should print the final
result (sum of all random numbers it generated).
8.14 The need for the embedding of one graph into another (graph embedding) arises from
at least two different directions. First (portability of algorithms across networks), an
algorithm designed for a specific interconnection network (graph) may be necessary to
adapt it to another network (graph). Second (mapping/assigning of processes to
processors), the flow of information in a parallel algorithm defines a program
(task/application) graph and embedding this into a network tells us how to organize the
computation on the network. Figure 8.7 shows an embedding of linear array (chain) into
a mesh and Fig. 8.8 shows an embedding of complete binary tree on a mesh. Embed (a)
a ring into a 4 × 5 mesh and a 3 × 3 mesh, and (b) a 2 × 4 mesh into a 3-dimensional
hypercube.
8.22 Consider the word count program described in this chapter. How do you adopt this
program if our goal is to output all words sorted by their frequencies.
(Hint: Make two rounds of MapReduce and in the second round take the output of word
count as input and exchange <key,value>)
8.23 Hadoop speculative task scheduler uses a simple heuristic method which compares the
progress of each task to the average progress. A task with the lowest progress as
compared to the average is selected for re-execution on a different slave node.
However, this method is not suited in a heterogeneous (slave nodes with different
computing power) environment. Why?
8.24 How are the following handled by Hadoop runtime system: (i) Map task failure, (ii)
Map slave node failure, (iii) Reduce task failure, (iv) Reduce slave node failure, and (v)
NameNode failure?
BIBLIOGRAPHY
Apache Hadoop Project— http://hadoop.apache.org/
Barney, B., “Introduction to Parallel Computing”.
https://computing.llnl.gov/tutorials/parallel_comp/#DesignPerformance.
Chapman, B., Jost, G. and Pas, R., Using OpenMP: Portable Shared Memory Parallel
Programming, MIT Press, USA, 2007.
Dean, J., and Ghemawat, S., “MapReduce: Simplified Data Processing on Large Clusters”,
Proceedings of the USENIX Symposium on Operating Systems Design and
Implementation, 2004, pp. 137–150.
Diaz, J., Munoz-Caro, C. and Nino, A., “A Survey of Parallel Programming Models and
Tools in the Multi and Many-Core Era”, IEEE Transactions on Parallel and Distributed
Systems, Vol. 23, No. 8, Aug. 2012, pp. 1369–1386.
Dobre, C. and Xhafa, F., “Parallel Programming Paradigms and Frameworks in Big Data
Era”, International Journal of Parallel Programming, Vol. 42, No. 5, Oct. 2014, pp. 710–
738.
Gaster, B.R., Howes, L., Keili, D.R., Mistry, P. and Schaa, D., Heterogeneous Computing
with OpenCL, Morgan Kaufmann Publishers, USA, 2011.
Hwang, K. and Xu, Z., Scalable Parallel Computing: Technology, Architecture,
Programming, WCB/McGraw-Hill, USA, 1998.
Kalavri, V. and Vlassov, “MapReduce: Limitations, Optimizations and Open Issues”,
Proceedings of IEEE International Conference on Trust, Security, and Privacy in
Computing and Communications, 2013, pp. 1031–1038.
Kirk, D.B. and Hwu, W.W., Programming Massively Parallel Processors, Morgan
Kaufmann Publishers, USA, 2010.
Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D. and Moon, B., “Parallel Data Processing with
MapReduce: A Survey”, ACM SIGMOD Record, Vol. 40, No. 4, Dec. 2011, pp. 11–20.
Pacheco, P., An Introduction to Parallel Programming, Morgan Kaufmann Publishers, USA,
2011.
Rajaraman, V. and Siva Ram Murthy, C., Parallel Computers: Architecture and
Programming, PHI Learning, Delhi, 2000.
Rauber, T. and Runger, G., Parallel Programming for Multicore and Cluster Systems,
Springer-Verlag, Germany, 2010.
Sanders, J. and Kandrot, CUDA by Example: An Introduction to General Purpose GPU
Programming, Addison-Wesley, NJ, USA, 2010.
Vinod Kumar, V., Arun, C.M., Chris, D., Sharad, A., Mahadev, K., Robert, E., Thomas, G.,
Jason, L., Hitesh, S., Siddharth, S., Bikas, S., Carlo, C., Owen, O.M., Sanjay, R.,
Benjamin, R. and Eric, B., “Apache Hadoop YARN: Yet Another Resource Negotiator”,
Proceedings of the ACM Annual Symposium on Cloud Computing, 2013.
White, T., Hadoop: The Definitive Guide, O’Reilly Media Publisher, Sebastopol, CA, USA,
2012.
Wilson, G.V., Practical Parallel Programming, Prentice-Hall, New Delhi, 1998.
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S. and Stoica, I., “Spark: Cluster
Computing with Working Sets”, Proceedings of the USENIX Conference on Hot Topics in
Cloud Computing, 2010.
Compiler Transformations for Parallel Computers
The role of a compiler is to translate code (the input program) from a high level language into
an equivalent machine language. In the process, the compiler has a number of opportunities
to transform the code into an optimized code and consequently minimize the running time (or
size) of the final executable code. This is what an optimizing compiler aims to achieve. In
this chapter, we describe several transformations that optimize programs written in
(imperative) languages such as Pascal, C and FORTRAN (a de facto standard of the high-
performance engineering and scientific computing community) for parallel computers,
including vector machines, superscalar and multicore processors. However, the optimizations
(transformations) we present here are not specific to Pascal, C and FORTRAN alone.
It is very important to note that a compiler must discover and expose sufficient parallelism
in the input program to make efficient use of (parallel) hardware resources. While too little
parallelism leaves more room for its efficient execution (due to underutilization of the
resources) and too much parallelism can lead to execution inefficiency (due to parallel
overhead such as synchronization and load imbalance).
9.1 ISSUES IN COMPILER TRANSFORMATIONS
To apply an optimization, a compiler must do the following:
In this chapter, we concentrate on the last step, transformation of the code. However, we
also present dependence analysis techniques that are used to select and verify many loop
transformations.
9.1.1 Correctness
When a program is transformed by a compiler, the meaning of the program should remain the
same. This is the issue of correctness. A transformation is said to be legal if the original and
the transformed programs produce exactly the same output for all identical executions i.e.,
when supplied with the same input data, every corresponding pair of non-deterministic
operations in the two executions produces the same result. The correctness is a complex issue
as illustrated below:
Original program:
procedure correctness(a, b, n, m, k)
integer n, m, k
real a[m], b[m]
for i := 1 to n do
a[i] = a[i] + b[k] + 10000.0
end for
end
A transformed version of the above program is as follows:
procedure correctness(a,b,n,m,k)
integer n, m, k
real a[m], b[m], C
C = b[k] + 10000.0
for i := n downto 1 do
a[i] = a[i] + C
end for
end
Though on the face of it, the above code seems equivalent to the original code, there are a
few possible problems.
Overflow. Suppose b[k] is a very large number and a[1] is negative. This changes
the order of the additions in the transformed code. This can cause an overflow to
occur in the transformed code when one attempts to add b[k] and 10000, while the
same would not occur in the original piece of code. This discrepancy complicates
debugging since the transformation is not visible to the programmer.
Different results. Even if no overflow occurs, the values of the elements of array a,
may be slightly different. The reason is floating-point numbers are approximations
of real numbers, and the order in which the approximations are applied (rounding)
can affect the result.
Memory fault. If k > m and n < 1, the reference to b[k] is illegal. While the original
code would not access b[k] since it would not run through the loop, the
transformed code does access it and throws an exception.
Different results. Partial aliasing of a and b before the procedure ‘correctness’ is
called can change the values assigned to a in the transformed code. In the original
code, when i = 1, b[k] is changed if k = n and b[n] is aliased to a[1]. Whereas in
the transformed version, the old value of b[k] is used for all the iterations thus
giving different results.
Because of the above problems, henceforth we say a transformation is legal if, for all
semantically correct executions of the original program, the original and the transformed
programs perform equivalent operations for identical executions.
9.1.2 Scope
Transformations can be applied to a program at different levels of granularity. It should be
noted that as the scope of the transformations is enlarged, the cost of analysis generally
increases. Some useful gradations in ascending order of complexity are: (1) Statement, (2)
Basic block (a sequence of statements with single entry and single exit), (3) Innermost loop,
(4) Perfect loop nest, (5) General loop nest, (6) Procedure, and (7) Inter-procedural.
9.2 TARGET ARCHITECTURES
In this section, we briefly present an overview of different parallel architectures and some
optimizations that a compiler performs for each of these architectures.
9.2.1 Pipelines
Pipeline processing is one of the most basic forms of the use of temporal parallelism. A
common use of pipeline in computers is in the arithmetic unit, where complicated floating
point operations are broken down into several stages whose executions overlap in time. The
objective of scheduling a pipeline is to keep it full (bubble-free or stall-free) as this results in
maximum parallelism. An operation can be issued only when there is no pipeline conflict,
i.e., the operation has no resource conflicts or dependences on operations already in the
pipeline. If there is no hardware interlocking mechanism to detect pipeline conflicts, the
compiler must schedule the pipeline completely. If there is full hardware interlocking, the
compiler need not worry about the pipeline conflicts. However, it may wish to reorder the
operations presented to the pipeline to reduce the conflicts at run (or execution) time in order
to enhance the pipelines performance. A common compiler optimization for pipelined
machines is instruction scheduling, described later.
The iteration space of the above nest is a d-dimensional discrete Cartesian space. Each
axis of the iteration space corresponds to a particular for loop and each point represents the
execution of all the statements in one iteration of the loop. An iteration can be uniquely
named by a vector of d elements I = ( i1, ... , id), where each index falls within the iteration
range of its corresponding loop in the nesting (i.e., lp < ip < up). Consider a dependence
between statements S1 and S2 denoted as S1 => S2. That is, S2 is dependent on S1. The
dependence distance between the two statements is defined as vector (S2) – vector (S1) = (y1
– x1, …, yd – xd). For example, consider the following loop:
Each iteration of the inner loop writes the element a[i, j]. There is a dependence if any
iteration reads or writes that same element. Consider the iterations I = (2, 3) and J = (3, 2).
Iteration I occurs first and writes the value a[2, 3]. This value is read in iteration J, so there is
a flow dependence from iteration I to iteration J, i.e., I → J. The dependence distance is J – I
= (1, –1).
When a dependence distance is used to describe the dependences for all iterations, it is
called a distance vector. (1, –1) is the one and only distance vector for the above loop.
Consider another example:
The set of distance vectors for this loop is {(0, 1), (1, 0), (1, –1)}. Note that distance vectors
describe dependences among iterations, not among array elements. The operations on array
elements create the dependences.
Direction Vectors
It may not always be possible to determine the exact dependence distance at compile-time, or
the dependence distance may vary between iterations; but there is enough information to
partially characterize the dependence. Such dependences are commonly described using
direction vectors. For a dependence I ⇒ J, the direction vector is defined as W = (w1, …,
wd), where
The direction vector for the above first loop (with distance vector set {(1, –1)}) is the set
{(<, >)}. The direction vectors for the second loop (with distance vector set {(0, 1), (1, 0), (1,
–1)}) are {( =, <), (<, =), (<,>)}. It can be easily seen that a direction vector entry of <
corresponds to a positive distance vector entry, = corresponds to 0, and > corresponds to a
negative distance vector entry.
The first four iterations, corresponding to the index values 4, 5, 6, 7 are shown below:
S(4) : A(4) = B(4) + C(4)
T(4) : B(6) = A(3) + A(1) + C(3)
U(4) : A(5) = B(11) + 1
S(5) : A(5) = B(5) + C(5)
T(5) : B(7) = A(4) + A(2) + C(4)
U(5) : A(6) = B(13) + 1
S(6) : A(6) = B(6) + C(6)
T(6) : B(8) = A(5) + A(3) + C(5)
U(6) : A(7) = B(15) + 1
S(7) : A(7) = B(7) + C(7)
T(7) : B(9) = A(6) + A(4) + C(6)
U(7) : A(8) = B(17) + 1
The dependence graph for this program segment is shown in Fig. 9.3. In this figure, an
anti-dependence edge has a cross on it and an output dependence edge has a small circle on it.
Statement T is flow dependent on statement S. The flow dependence of T on S caused by
the output variable A(I ) of S and the input variable A(I – 1) of T is the set
{ (S(4), T(5)), (S(5), T(6)), (S(6), T(7)), …, (S(199), T(200))}
The distance vector for this flow dependence is {1}. The corresponding direction vector is
{<}. The flow dependence of T on S caused by the output variable A(I) of S and the input
variable A(I – 3) of T is the set
{(S(4), T(7)), (S(5), T(8)), (S(6), T(9)), …, (S(197), T(200))}
Since the array X is one-dimensional and the number of loops in the loop nest is 2 (i.e., d =
2), the dependence equation is a single equation in four (i.e., 2d) variables:
The above idea of a dependence equation can be extended to statements where variables
can be array-elements of multiple dimensions. A variable that is an element of an n-
dimensional array has the form
where the coefficients are all integer constants. In matrix notation, this variable can be written
a X(IA + a0), where
where (i, j) is the 2d-vector obtained by adding the components of j to the components of i.
Equation (4) represents a set of n linear Diophantine equations in 2d variables. The two
variables X(IA + a0) and X(IB + b0) will cause a dependence between the statements S and T
iff there are index points i and j satisfying this set of linear Diophantine equations. We now
present a basic test for dependence.
9.3.6 GCD Test
The GCD test is a general and approximate test used to determine whether the variable X(IA
+ a0) of a statement S and the variable X(IB + b0) of a statement T cause a dependence of S
on T (or of T on S). First we will describe the test considering only single dimensional array-
element variables. The one-dimensional version of the GCD test is given below:
Let X(a1i1 + a2i2 + … + adid + a0) denote a variable of a statement S and X(b1j1 + b2j2 +
… + bd jd + b0) denote a variable of statement T, where X is a one-dimensional array. If gcd
(a1, a2, … , ad, b1, b2, … , bd) does not divide (b0 – a0), then the variables do not cause a
dependence between S and T.
The dependence equation for X(a1i1 + a2i2 + … + adid + a0) and X(b1j1 + b2j2 + … + bdjd +
b0) is
a1i1 + a2i2 + … + adid + a0 = b1j1 + b2j2 + … + bdjd + b0
This reduces to
The one-dimensional version of the GCD test follows directly from Eqs. (5) and (6).
Let us apply the GCD test to determine if any dependence exists between statements S and
T in the previous example. The dependence equation is given by Eq. (1) as
2i1 + 4i2 – 4j1 – 2j2 = 7
The GCD of coefficients on the left-hand side is 2. Since 2 does not divide 7, it follows from
the GCD test that there is no dependence between S and T.
The GCD test can be extended to statements involving multi-dimensional array-element
variables. Before we present this extension, we need to define two types of matrices. A
square integer matrix A is said to be unimodular if the determinant of A has modulus 1. For
an m × n matrix B, let li denote the column number of the leading (first non-zero) element of
row i (for a zero row, li is undefined). Then, B is an echelon matrix if for some integer p in 0
≤ p ≤ m, the following conditions hold:
Here, we have
is (250, 350/3, t3, t4) where t3 and t4 are undetermined. Since there is no integer solution, the
GCD test implies that there is no dependence between statements S and T.
9.4 TRANSFORMATIONS
A major goal of optimizing compilers for parallel computers is to detect or increase
parallelism in loops, since generally most of the execution time is spent in loops. We now
present several loop transformations and also some standard ones which have been
systematized and automated.
Here the costly sqrt function is called just once in the optimized code as opposed to n
times in the original code. Also, unless the loop has at least one iteration, the sqrt function is
not invoked.
Loop Unswitching
This optimization is applied when a loop contains a conditional statement with a loop-
invariant test condition. The loop is then replicated inside the if and then parts of the
conditional statement thereby (i) saving the overhead of conditional branching inside the
loop, (ii) reducing the code size of the loop body, and (iii) possibly enabling the
parallelization of if or then part of the conditional statement. This technique is illustrated
below:
The loop unswitching transformation when applied to the above code results in
Conditionals that are candidates for unswitching can be spotted during the analysis for
code motion, which identifies loop invariants.
Here the distance vectors are (1, 0), (1, –1), so the inner loop is parallelizable as shown
below:
Loop Interchange
In a perfect loop nest, a loop interchange exchanges the position of two loops. This
transformation has several advantages:
The above advantages, simple as they may seem, can be extremely tricky to handle since
one of the above benefits can cancel another. For example, an interchange that improves
register usage may change a stride-1 access pattern to a stride-n access pattern. Consider the
following piece of code.
Assuming (FORTRAN) column-major storage of the array, the inner loop accesses array a
with stride n, i.e., it accesses memory locations which are n apart. This increases the cache
misses which results in lower overall performance. By interchanging the loops, as shown
below, we convert the inner loop to stride-1 access.
However, the original code, unlike the optimized code, allows total[i], to be placed in a
register thereby eliminating the load/store operations in the inner loop. Hence, if the array a
fits in the cache, the original code is better.
Loop Skewing
This transformation, together with loop interchange, is typically used in loops which have
array computations called wavefront computations, so called because the updates to the array
go on (propagate) in a cascading (wave) fashion. Consider the following wavefront
computation:
Observe that the transformed code is equivalent to the original, but the effect on the
iteration space is to align the diagonal wavefronts of the original loop nest so that for a given
value of j, all iterations in i can be executed in parallel. To expose this parallelism, the
skewed loop nest must also be interchanged as shown below:
Loop Reversal
Reversing the direction (indices) of a loop is normally used in conjunction with other
iteration space reordering transformations because it changes the distance vector. The change
in the distance vector sometimes facilitates loop interchange thus possibly optimizing it.
Consider the following example:
Strip Mining
In a vector machine, when a vector has a length greater than that of the vector registers,
segmentation of the vector into fixed size segments is necessary. This technique is called strip
mining. One vector segment (one surface of the mine field) is processed at a time. In the
following example, the iterations of a serial loop are converted into a series of vector
operations, assuming the length of vector segment to be 64 elements. The strip-mined
computation is expressed in array notation, and is equivalent to a do in parallel loop.
Cycle Shrinking
This is just a special case of strip mining. It converts a serial loop into an outer loop and an
inner parallel loop. Consider the following example. The result of an iteration of the loop is
used only after k iterations and consequently the first k iterations can be performed in parallel
after which the same is done for the next k iterations and so on. This results in a speedup of k
but since k is quite likely to be small (usually 2 or 3), this optimization is primarily used for
exposing fine-grained (instruction level) parallelism.
Loop Tiling
Tiling is the multi-dimensional generalization of strip mining. It is mainly used to improve
the cache reuse by dividing an iteration space into tiles and transforming the loop nest to
iterate over them. It can also be used to improve processor, register, TLB (translation
lookaside buffer or simply translation buffer is a separate and dedicated memory cache that
stores most recent translations of virtual memory to physical addresses for faster retrieval of
instructions and data) or page locality. The following example shows the division of a matrix
into tiles. This sort of division is critical in dense matrix multiplication for achieving good
performance.
Figure 9.4 shows the iteration dependence graphs (iteration spaces) for some example loop
nests that illustrate loop interchange, loop skewing and loop tiling transformations.
Loop Distribution
Distribution (also called loop fission or loop splitting) breaks a loop into two or more. Each
of the new loops has the same iteration space as the original but with fewer statements.
Distribution is used to
The following is an example in which distribution removes dependences and allows a part
of a loop to be executed in parallel. Distribution can be applied to any loop, but all statements
belonging to a dependence cycle (i.e., all the statements which are dependent on each other)
should be executed in the same loop. Also the flow sequence should be maintained.
Figure 9.4 Examples of loop transformations.
Loop Fusion
This is the inverse of loop distribution and is also called jamming. It combines two or more
independent loops. In the above example, distribution enables the parallelization of part of
the loop while fusion improves register and cache locality as a[i] need be loaded only once.
Further, fusion reduces the loop control overhead, which is useful in both vector and parallel
machines, and coalesces serial regions, which is useful in parallel execution. With large n
(number of iterations), the distributed loop should run faster on a vector machine while the
fused loop should be better on a superscalar machine.
A loop-unrolled version of this loop with the loop unrolled twice is:
Here the loop overhead is cut in half because two iterations are performed before the test
and branch at the end of the loop. Instruction level parallelism is enhanced because the
second assignment can be performed while the results of the first are being stored.
Software Pipelining
This refers to the pipelining of successive iterations of a loop in the source code. Here, the
operations of a single iteration are broken into s stages, and a single iteration performs stage 1
from iteration i, stage 2 from iteration i – 1, and so on. Software pipelining is illustrated with
the execution details of the following loop on a two-issue processor first without software
pipelining and then with software pipelining.
for i := 1 to n do
a[i] = a[i] * b + c
end for
Here, there is no loop-carried dependence. Let us assume for the analysis that each
memory access (Read or Write) takes one cycle and each arithmetic operation (Mul and Add)
requires two cycles. Without pipelining, one iteration requires six cycles to execute as shown
below:
2 Mul Multiply by b
4 Add Add to c
Thus n iterations require 6n cycles to complete, ignoring the loop control overhead. Shown
below is the execution of four iterations of the software-pipelined code on a two-issue
processor. Although each iteration requires 8 cycles to flow through the pipeline, the four
overlapped iterations require only 14 clock cycles to execute thereby achieving a speedup of
24/14 = 1.7.
Loop Coalescing
Coalescing combines a loop nest into a single loop, with the original indices computed from
the resulting single induction variable. It can improve the scheduling of the loop on a parallel
machine and may also reduce loop overhead. An example for loop coalescing is given below.
The original loop is:
Here, if n and m are slightly larger than the number of available processors P, say n = m =
P + 1, then by scheduling any of the loops (inner or outer) first, it will take 2n units of time.
This is because the P processors can update the first P = n – 1 rows in parallel in n units of
time, with each processor handling one row. And to update the last row a further n units of
time is required. The coalesced form of the loop is shown below:
Forward Substitution
This is a generalization of copy propagation, in that it replaces a variable by the expression
corresponding to it. Consider the following example. The (original) loop cannot be
parallelized because an unknown element of a is being written. On the other hand, the
optimized loop can be implemented as a parallel reduction.
Reassociation
This is a technique to increase the number of common sub-expressions in a program. It is
generally applied to address calculations within loops when performing strength reduction on
induction variable expressions. Address calculations generated by array references involve
several multiplications and additions. Reassociation applies the associative, commutative, and
distributive laws to rewrite these expressions in a canonical sum-of-products form.
Algebraic Simplification and Strength Reduction
The compiler, by applying algebraic rules, can simplify arithmetic expressions. Here are
some of the commonly applied rules.
x×0=0
0/x=0
x×l=x
x+0=x
x/1=x
As mentioned earlier, the compiler can replace an expensive operator (such as x × 2 and
x2) with an equivalent less expensive operator (x + x and x × x, respectively). This is called
strength reduction.
An inter-procedural analysis between the two procedures generally collects the following
types of information:
alias information related to the procedure call, such as the fact that the actual
parameter at the call site (b[i, j – 1] and c) and the formal parameter in the
procedure header (x and y, respectively) refer to the same object, if the call by
reference mode is used.
change information, such as a variable nonlocal to a procedure being changed as a
result of the procedure invocation, as c is changed as a result of the call to FUNC.
The decomposition of the array, a, onto the processors is shown in Fig. 9.5. Each column
is mapped serially, i.e., all the data in that column is placed on a single processor. As a result
of this, the loop that iterates over the columns (the inner loop) is unchanged.
The above parallelized version of the code computes upper and lower bounds that the
guards use to prevent computation on array elements that are not stored locally within the
processor. It assumes that n is an even multiple of Pnum and the arrays are already block
distributed across the processors.
Redundant Guard Elimination
A compiler can reduce the number of guards by hoisting them to the earliest point where they
can be correctly computed. Hoisting often reveals that identical guards have been introduced
and that all but one can be removed. The above parallelized version of the code after
redundant guard elimination results in the following:
Bounds Reduction
The guards control which iterations of the loop perform computation. When the desired set of
iterations is a contiguous range, the compiler can achieve the same effect by changing the
induction expressions to reduce the loop bounds. The following version of the code is
obtained after bounds reduction:
LB = (n/Pnum)*Pid + 1
UB = (n/Pnum)*(Pid + 1)
FORK(Pnum)
for i := LB to UB do
a[i] = a[i] + c
b[i] = b[i] + c
end for
JOIN()
Communication Optimization
An important part of compilation for distributed memory machines is the analysis of a
program’s communication requirements and introduction of explicit message passing
operations into it. The same fundamental issues arise in communication optimization as in
optimizations for memory access: maximizing reuse, minimizing the working set, and
making use of available parallelism in the communication system. However, the problems are
magnified because the communication costs are at least an order of magnitude higher than
those associated with memory access. We now discuss transformations which optimize on
communication.
Message Vectorization
This transformation is analogous to the way a vector processor interacts with memory.
Instead of sending each element of an array in an individual message, the compiler can group
many of them together and send them in a single block to avoid paying the startup cost for
each message transfer. The idea behind vectorization of messages is to amortize the startup
overhead over a larger number of messages. Startup overhead includes the time to prepare the
message (adding header, trailer, and error correction information), the time to execute the
routing algorithm, and the time to establish an interface between the local processor and the
router. Consider the following loop which adds each element of array a with the mirror
element of array b.
for i := 1 to n do
a[i] = a[i] + b[n+1-i]
end for
A parallelized version of the loop to be executed on two processors with the arrays having
been block distributed (processor 0 has the first half and processor 1 has the second half)
among them is as follows. The format for the SEND and the RECEIVE calls are as follows:
This version is very inefficient because during each iteration, each processor sends the
element of b that the other processor will need and waits to receive the corresponding
message. Applying the message vectorization transformation to this loop, we get the
following version.
This is a much more efficient version because it handles all communication in a single
message before the loop begins executing. But such an optimization has a drawback if the
vectorized messages are larger than the internal buffers available. The compiler may be able
to perform additional communication optimization by using strip mining to reduce the
message length as shown in the following parallelized version of the loop (assuming array
size is a multiple of 256).
LB = Pid*(n/2) + 1
UB = LB + (n/2)
otherPid = 1 - Pid
otherLB = otherPid *(n/2) + 1
otherUB = otherLB + (n/2)
for j := LB to UB with step size of 256 do
SEND(b[j], 256*4, otherPid)
RECEIVE(b[otherLB + (j-LB)], 256*4, otherPid)
for i := j to j+255 do
a[i] = a[i] + b[n+1-i]
end for
end for
Collective Communication
Many parallel architectures and message passing libraries offer special purpose
communication primitives such as broadcast, hardware reduction, and scatter-gather. These
primitives would have been implemented very efficiently on the underlying hardware. So a
compiler can improve performance by recognizing opportunities to exploit these operations.
The idea is analogous to idiom and reduction recognition on sequential machines.
Message Pipelining
Another important optimization is to pipeline parallel computations by overlapping
communication and computation. Many message passing systems allow the processor to
continue executing instructions while a message is being sent. The compiler has the
opportunity to arrange for useful computation to be performed while the network is
delivering messages.
9.5 FINE-GRAINED PARALLELISM
We now describe some techniques directed towards the exploitation of instruction-level
parallelism. They are (i) instruction scheduling, (ii) trace scheduling, and (iii) software
pipelining.
2. Reversal: Reversal of the ith loop is represented by the identity matrix with the ith element
on the diagonal equal to –1. Consider the following piece of code:
The distance vector describing the loop nest is D = {(1, 0)}, representing the dependence
of iteration i on i – 1 in the outer loop. Suppose we want to check if the outer loop can be
reversed, i.e., we want to check if the following transformation of the loop is legal.
Applying loop-reverse transformation to the dependence vector describing the above loop,
This shows that the loop-reversal transformation is not legal for this loop, because P1 is
not lexicographically positive. This can be easily explained from elementary dependence
analysis.
Similarly, the matrix for interchange transformation is
In this case P2 is lexicographically positive and hence the interchange of the two loops to
get the following transformed version is legal.
We have seen that reversing the outer loop is invalid in this case. Let us see if reversing
the outer loop followed by interchange of the loops is valid or not. The transformation matrix
for this compound transformation of reversing the outer loop followed by interchange is
given by the product of the unimodular transformation matrices for each of these individual
transformations taken in the correct order, i.e.,
9.2 Vectorize the following loops if possible. Otherwise give reasons why it is not possible.
9.3 Loop Peeling is a loop restructuring transformation in which two successive loops are
merged after “peeling” or separating the first few iterations of the first loop or the last
few iterations of the second loop or both. Peeling has two uses: for removing
dependences created by these first/last loop iterations thereby enabling parallelization
and for matching the iteration control of adjacent loops to enable fusion. Apply this
transformation to the following code to obtain a single parallelizable loop.
9.4 Loop Normalization, another loop restructuring transformation, converts all loops so
that the induction variable is initially 1 or 0 and is incremented by 1 on each iteration.
This transformation can expose opportunities for fusion and simplify inter-loop
dependence analysis. The most important use of normalization is to permit the compiler
to apply subscript analysis tests, many of which require normalized iteration ranges.
Now show how this loop transformation should be applied to the following code such
that the different iterations of a loop can be parallelized.
9.6 Find the dependence equation for the variable of the two-dimensional array X in
statements S and T in the following loop.
9.7 Consider the following algorithm which performs Echelon Reduction. Given an m × n
integer matrix A, this algorithm finds an m × m unimodular matrix U and m × n echelon
matrix S = (sij), such that UA = S. Im is the m × m identity matrix and sig(i) denotes the
sign of integer i. Use this algorithm to compute U and S for the example problem given
in the section on GCD test.
9.8 Using the GCD test, decide if there could be dependence between the statements S and T
in the loop nest
Figure 9.10 gives the iteration spaces for the original code as well as for the
transformed version.
(a) Assume that the array a is stored in the memory in column-major order and the
memory has B = 4 banks. Then the elements of the array will be stored in the
memory as shown in Fig. 9.11. How will the given code perform in terms of time
required for consecutive memory accesses?
Figure 9.11 Storage of a in the memory.
(b) From the preceding part, it is clear that if the stride is an exact multiple of the number of
banks, then all successive memory accesses will be on the same bank leading to severe
performance degradation. One straightforward solution to this problem is to pad the
array along the dimension of access with some p (now the stride will be s + p) such that
the accesses do not fall onto the same bank. Find the condition on the stride s, the
number of banks B, and the padding size p such that B successive accesses will all
access different banks.
9.15 Consider the following loop
The iteration space for this loop is given in Fig. 9.12. Locality for neighbour-based
computation and uniform balancing of load across processors are the two important
metrics for array decomposition across processors in a parallel machine. Assume that
you want to optimize on the uniform balancing of load metric. Then which array
decomposition strategy among the following would you choose for decomposing the
array a?
(a) Serial
(b) Block
(c) Cyclic
Scheduling Algorithm
The algorithm takes exactly N scheduling steps for scheduling a task graph and works as
follows: At each scheduling step, it computes a set of ready tasks (Unscheduled tasks which
have no predecessors or which have all their predecessors scheduled are known as ready
tasks.), constructs a priority list of these tasks and then selects the element (task) with the
highest priority for scheduling in the current step. The priorities are computed taking into
consideration the task graph, the multiprocessor architecture and the partial schedule at the
current scheduling step. (A partial schedule at the current scheduling step is the incomplete
schedule constructed by the scheduling steps preceding the current scheduling step.) Thus
each element (task) in the priority list will have three fields, viz., task, processor and priority
of the task with respect to the processor. Now the selected task is scheduled to the selected
processor such that it finishes the earliest.
The priority of a ready task, vi with respect to each processor, pj (j = 1, 2, …, number of
processors), given the partial schedule (Σ) at the current scheduling step, (which reflects the
quality of match between vi and pj given Σ) is given by
Schedule Priority, SP (vi, pj, Σ) = SL(vi) – EFT (vi, pj, Σ)
where EFT (vi, pj, Σ) is the earliest time at which task vi will finish execution if it is scheduled
on processor pj , given the partial schedule Σ at the current scheduling step and SL(vi), called
the static level of vi, is the length of the longest path from task vi to the exit task. Here, the
length of the path is defined as the sum of the execution times of all the tasks on the path. The
level of a task indicates how far the task is away from the exit task (task completion). The
level of the entry task denotes the lower bound on the completion time of the task graph. (The
longest path, from the entry task to the exit task is called the critical path because it
determines the shortest possible completion time.) This means the total execution time for the
task graph cannot be shorter than the level of the entry task even if a completely-connected
interconnection topology (in which there exists a direct communication channel or link
between every pair of processors) with unbounded number of processors is employed. Since
the static level of a task depends on the task graph alone, it can be computed for all tasks
before the scheduling process begins.
The algorithm considers simultaneous scheduling of the processors and channels. All
inter processor communications are scheduled by a router which finds the best path between
the communicating processors for each communication and schedules the communication
onto the channels on this path. Below we present the algorithm for computing EFT(*).
Algorithm EFT (vi, pj, Σ)
Given the idea of schedule priority and the algorithm for computing EFT(*), we now
present the scheduling algorithm below.
Algorithm SCH
while there are tasks to be scheduled do
1. Let Σ be the partial schedule at this scheduling step. Compute the set of ready tasks R
with respect to Σ.
2. foreach task vi ∈ R do
foreach processor pj in the multiprocessor do
—Compute EFT (vi, pj, Σ) using algorithm EFT
—Compute the schedule priority SP(vi, pj, Σ)
using the equation
SP(vi, pj, ∑) = SL(vi) – EFT(vi, pj, ∑)
end foreach
end foreach
3. Let the schedule priority be maximum for task vm with respect to processor pn given the
partial schedule ∑. Schedule all communications from the immediate predecessors of
task Vm to the channels using the router which exploits schedule holes in the channels.
Then schedule task vm onto processor pn such that it finishes the earliest, exploiting
schedule holes in processor pj whenever possible.
end while
Figure 10.3 gives the output of the algorithm, presented above, at various steps while
scheduling the task graph in Fig. 10.1 onto a two-processor system shown in Fig. 10.2.
1. Load estimation policy: This determines how to estimate the work load of a
particular processor in the system. The work load is typically represented by a load
index which is a quantitative measure of a processor’s load. Some examples of
load indices are processor’s queue length (number of tasks on the processor) and
processor’s utilization (number of CPU cycles actually executed per unit of real
time). Obviously, the CPU utilization of a heavily loaded processor will be greater
than the CPU utilization of a lightly loaded processor.
2. Task transfer policy: This determines the suitability of a task for reassign-ment,
i.e., the policy identifies whether or not a task is eligible to be transferred to
another processor. In order to facilitate this, we need to devise a policy to decide
whether a processor is lightly or heavily loaded. Most of the dynamic scheduling
algorithms use the threshold policy to make this decision. The threshold value
(fixed or varying) of a processor is the limiting value of its work load and is used
to decide whether a processor is lightly or heavily loaded. Thus a new task at a
processor is accepted locally for processing if the work load of the processor is
below its threshold value at that time. Otherwise, an attempt is made to transfer the
task to a lightly loaded processor.
3. State information exchange policy: This determines how to exchange the system
load information among the processors. This policy decides whether the load
information is to be distributed periodically (periodic broadcast) to all the
processors or the load information is to be distributed on-demand (on-demand
exchange) at load balancing time (a processor, for example, requests the state
information of other processors when its state switches from light to heavy.) These
two methods generate a number of messages for state information exchange as
they involve a broadcast operation. To overcome this problem, a heavily loaded
processor can search for a suitable receiver (partner) by randomly polling the other
processors one by one.
4. Location policy: This determines the processor to which a task should be
transferred. This policy decides whether to select a processor randomly or select a
processor with minimum load or select a processor whose load is below some
threshold value.
where lp is the load of the sender, K is the total number of its immediate neighbours and lk is
the load of processor k. Each neighbour is assigned a weight hk, according to hk = Lavg – lk if
lk < Lavg; = 0 otherwise. These weights are summed to determine the total deficiency Hd:
Finally, we define the proportion of processor p’s excess load, which is assigned to
neighbour k as δk, such that
Once the amount of load to be transferred is determined, then the appropriate number of
tasks are dispatched to the receivers. Figure 10.10 shows an example of the sender-initiated
algorithm.
Here we assume that Lhigh = 10. Hence, the algorithm identifies processor A as the sender
and does its first calculation of the domain’s average load:
Processor B C D E
Weight, hk 8 3 1 0
These weights are summed to determine the total deficiency:
Hd = 8 + 3 + 1 + 0 = 12
Processor B C D E
Load, δk 8 3 1 0
Finally, processor p sends respective load requests to its neighbours. Figure 10.11 shows
an example of the receiver-initiated algorithm. Here we assume that Llow = 6. Hence, the
algorithm identifies processor A as the receiver and does its first calculation of the domain’s
average load:
Figure 10.11 Example of receiver-initiated algorithm in a 4 × 4 mesh.
The weight for each neighbourhood processor is then as follows:
Processor B C D E
Weight, hk 4 3 2 0
Load, δk 4 3 2 0
Processor A B C D E
Load 11 10 10 10 9
Several researchers believe that dynamic scheduling (load balancing), which attempts to
equalize the work load on all processors, is not an appropriate objective as the overhead
involved (in gathering state information, selecting a task and processor for task transfer, and
then transferring the task) may outweigh the potential performance improvement. Further,
load balancing in the strictest sense may not be achievable because the number of tasks in a
processor may fluctuate and the temporal unbalance among the processors exist at every
moment, even if the average load is perfectly balanced. In fact, for proper utilization of the
resources, it is enough to prevent the processors from being idle while some other processors
have two or more tasks to execute. Algorithms which try to avoid unshared states (states in
which some processor lies idle while tasks contend for service at other processors) are called
dynamic load sharing algorithms. Note that load balancing (dynamic scheduling) algorithms
also try to avoid unshared states but go a step beyond load sharing by attempting to equalize
the loads on all processors.
i. Sensor based methods use thermal sensors on the processor to obtain real-time
temperature. It may be noted that efficient placement of limited number of thermal
sensors on the processor chip (for producing a thermal profile) with their
associated overhead (sensor’s power and run-time performance overhead) is as
important as the sensing (temperature estimation).
ii. Thermal model based methods use thermal models for estimating the temperature
based on chip’s hardware characteristics, power consumption and ambient
temperature.
iii. Performance counter based methods estimate the temperature using the values read
from specific registers (counters). A processor has many functional units each
consuming a different amount of power per access. The access rates of these
functional units can be monitored using programmable performance counters. The
readings of the counters which reflect the activity information of functional units
are used to estimate the power consumption and also temperature using a simple
thermal model.
10.2.1 Threads
It should be noted that hardware multithreading means the use of multiple-context (for
example, multithreaded) processors to hide memory latency as discussed in Chapter 3. What
we consider here is software multithreading which is a different concept. Hardware threads
means physical resources, whereas software threads, as we discuss in this section, are those
created by application (user) programs. The Intel Core i7 processor has four cores and each
core can execute two software threads at the same time using hyper-threading technology.
Thus, the processor has eight hardware threads which can execute upto eight software threads
simultaneously. The application program can create any number of software threads. It is the
responsibility of the operating system to assign software threads to hardware threads for the
execution of the application program.
In operating systems with threads facility, a process consists of an address space and one
or more threads, which are executable (i.e., schedulable) entities, as shown in Fig. 10.18.
Each thread of a process has its own program counter, its own register states and its own
stack. But all the threads of a process share the same address space. In addition, they also
share the same set of operating system resources, such as open files, semaphores and signals.
Thus, switching between threads sharing the same address space is considerably cheaper than
switching between processes that have their own address spaces. Because of this, threads are
called lightweight processes. (In a similar way, a hardware thread can be called as a
lightweight processor.) Threads can come and go quickly, without a great deal of system
overhead. Further, when a newly created thread accesses code and data that have recently
been accessed by other threads within its process, it automatically takes advantage of main
memory caching that has already taken place. Thus, threads are the right constructs, from an
operating system point of view, to efficiently support fine-grained parallelism on shared
memory parallel computers or on multicore processors.
Figure 10.18 Single-threaded and multithreaded processes.
An operating system that supports threads facility must provide a set of primitives, called
threads library or threads package, to its users for creating, terminating, synchronizing and
scheduling threads. Threads can be created statically or dynamically. In the static approach,
the number of threads of a process is decided at the time of writing the corresponding
program or when the program is compiled, and a fixed size of stack is allocated to each
thread. In the dynamic approach, a process is started with a single thread, new threads are
created (dynamically) during the execution of the process, and the stack size of a thread is
specified as a parameter to the system call for thread creation. Termination and
synchronization of threads is performed in a manner similar to those with conventional
(heavyweight) processes.
Implementing a Threads Library
A threads library can be implemented either in the user space or in the kernel. (The kernel
that is always in memory is the essential, indispensable program in an operating system
which directly manages system resources, handles exceptions and controls processes). In the
first approach, referred to as user-level approach, the user space consists of a run-time system
which is a collection of thread management routines. These routines are linked at run-time to
applications. Kernel intervention is not required for the management of threads. In the second
approach, referred to as kernel-level approach, there is no run-time system and the kernel
provides the operations for thread management. As we will see now both approaches have
their own advantages and limitations.
In the user-level approach, a threads library can be implemented on top of an existing
operating system that does not support threads. This means no changes are required in the
existing operating system kernel to support user-level threads. The threads library is a set of
application-level utilities shared by all applications. But this is not possible· in kernel-level
approach as all of the work of thread management is done by the kernel. In the user-level
approach, scheduling of threads can be application specific. Users have the flexibility to use
their own customized algorithm to schedule the threads of a process. This is not possible in
kernel level approach as a scheduler is already built into the kernel. However, users may have
the flexibility to select an algorithm, through system call parameters, from a set of already
implemented scheduling algorithms.
Thread switching in user-level approach is faster as it does not require kernel mode
privileges as all of the thread management data structures are within the user space. This
saves the overhead of two mode switches (user to kernel and kernel back to user). A serious
drawback associated with the user-level approach is that a multithreaded application cannot
take advantage of multiprocessing. The kernel assigns one process to only one processor at a
time. Therefore, only a single thread within a process can execute at a time and further, there
is no way to interrupt the thread. All this is due to lack of clock interrupts within a single
process. Note that this problem is not there in kernel-level approach. A way to solve this
problem is to have the run-time system requests a clock interrupt (after every fixed unit of
time) to give it control so that the scheduler can decide whether to continue to run the same
thread or switch to another thread.
Another drawback of the user-level approach is as follows. In a typical operating system,
most system calls are blocking. Thus, when a thread executes a system call, not only is that
thread blocked but all threads of its process are blocked and the kernel will schedule another
process to run. This defeats the basic purpose of using threads. Again, note that this problem
is not there in kernel-level approach. A way to solve this problem of blocking threads is to
use a technique called jacketing. The purpose of jacketing is to convert a blocking system call
into a non-blocking system call. For example, instead of directly calling a system I/O routine,
a thread calls an application-level I/O jacket routine. This jacket routine contains a code
which checks whether the I/O device is busy. If it is, the thread enters the ready state and
passes control (through the threads library) to another thread. When this thread later is given
control again, it checks the I/O device again.
Threads Scheduling
Threads library often provides calls to give users or application programmers the flexibility to
schedule the threads. We now describe a few threads scheduling algorithms that may be
supported by a threads library.
1. FCFS or Round-robin: Here the threads are scheduled on a first come, first served
basis or using the round-robin method where the CPU cycles are equally
timeshared among the threads on a quantum-by-quantum basis. A fixed-length
time quantum may not be appropriate on a parallel computer as there may be fewer
runnable threads than the available processors. Therefore, instead of using a fixed-
length time quantum, a scheduling algorithm may vary the size of the time
quantum inversely with the total number of threads in the system.
2. Hand-off Scheduling: Here a thread can choose its successor or the next thread for
running, i.e., the current thread hands-off the processor to another thread thereby
eliminating the scheduler interference (or bypassing the queue of runnable
threads). Hand-off scheduling has been shown to perform better when program
synchronization is exploited (for example, the requester thread hands-off the
processor to the holder of a lock) and when inter-process communication takes
place (for example, the sender hands the processor off to a receiver).
3. Affinity Scheduling: Here a thread is scheduled on a processor on which it last
executed hoping that part of its address space (working set) is still present in the
processor’s cache.
4. Coscheduling (Gang Scheduling): The goal of coscheduling is to achieve a high
degree of simultaneous execution of threads belonging to a single program or
application. A coscheduling algorithm schedules the runnable threads of an
application to run simultaneously on different processors. Application preemption
implies the simultaneous preemption of all its threads. Effectively, the system
context switches between applications. Since the overhead involved here can be
significant, processors are often dedicated for periods longer than the typical time-
sharing quantum to amortize this overhead.
In order to provide portability of threaded programs, IEEE has defined a POSIX (Portable
Operating System Interface for Computer Environments) threads standard called as Pthreads.
The threads library Pthreads defines function calls to create, schedule, synchronize and
destroy threads. Other threads libraries include Win32 threads and Java threads. Most
operating systems support kernel-level threads.
Threads Execution Models
A parallel program has one or more processes, each of which contains one or more threads
and each of these threads is mapped to a processor by the scheduler in the operating system.
There are three mapping models used between threads and processors: (i) many-to-one (M:1
or User-level threading), (ii) one-to-one (1:1 or Kernel-level threading) and (iii) many-to-
many (M:N or Hybrid threading), as shown in Fig. 10.19.
Figure 10.19 Mapping of threads to processors/cores.
In the M:1 model, many user-level threads are mapped to a single kernel-level thread.
Since a single kernel-level thread (schedulable entity) can operate only on a single processor,
this model does not allow the execution of different threads of one process on different
processors. This means only process-level parallelism is possible here, but not thread-level.
The library scheduler determines which thread gets the priority. This is also called as
cooperative multithreading. This model is used in systems that do not support kernel-level
threads. Some examples of this model are Solaris Green threads and GNU Portable threads.
In the 1:1 model, each user-level thread maps to a kernel-level thread. This facilitates the
parallel execution within a single process. This model does not require the threads library
scheduler and the operating system does the thread scheduling. This is also called as
preemptive multithreading. A serious drawback of this model is that generation of every user-
level thread of a process must include the creation of a corresponding kernel-level thread and
further the mode switches overhead due to transfer of control across multiple user-level
threads of the same process cause a significant overhead which has an impact on the
performance of the parent process. Some operating systems address this problem by limiting
the growth of the thread count. Some examples of this model are Linux, Windows
NT/XP/2000, and Solaris 9 and later.
In the (M:N) mapping model, a number of user-level threads are associated to an equal or
smaller number of kernel-level threads. At different points in time, a user-level thread may be
mapped to a different kernel-level thread and correspondingly a kernel-level thread may
execute different user-level threads at different points in time. Some examples of this model
include Windows NT/2000 with ThreadFiber package and Solaris prior to version 9.
A variation of M:N mapping model, an example of which is HP-UX, is the two-tier model
which allows either M:N or 1:1 operation. It is important to note that in the M:1 model, if a
user-level thread blocks (due to a blocking system call for I/O or a page fault) then the entire
process blocks, even if the other user-level threads of the same process would otherwise be
able to continue. Both 1:1 and M:N models overcome this problem but they require the
implementation of threads library at the kernel-level. The M:N mapping model is flexible as
the programmer can adapt the number of user-level threads depending on the parallel
algorithm (application) used. The operating system can select the number of kernel-level
threads in a way that results in efficient utilization of the processors. A programmer, for
example by selecting a scheduling algorithm using Pthreads library, can influence only the
library scheduler but not the scheduler of the operating system which is fair, starvation-free
and tuned for an efficient utilization of the processors.
It is clear by now that the creation of user-level threads is necessary to exploit fine-grained
parallelism in a program (application). Since threads of one process share the address space
of the process, if one thread modifies a value in the shared address space, the effect is visible
to all threads of the same process. A natural programming model for shared memory parallel
computers or multicore processors/systems is a thread model in which all threads have access
to shared variables. As mentioned earlier, to coordinate access to shared variables,
synchronization of threads is performed in a manner similar to those with conventional
(heavyweight) processes described in the next section.
10.3 PROCESS SYNCHRONIZATION
The execution of a parallel program on a parallel computer (multiprocessor) may require
processors to access shared data structures and thus may cause the processors to concurrently
access a location in the shared memory. Hence a mechanism is needed to serialize this access
to shared data structures to guarantee its correctness. This is the classic mutual exclusion
problem.
In order to realize mutual exclusion, multiprocessor operating systems provide instructions
to atomically read and write a single memory location (in the main memory). If the operation
on the shared data is very elementary (for example, an integer increment), it can be embedded
in a single atomic machine instruction. In such a case, mutual exclusion can be realized
completely in hardware. However, if an access to a shared data structure constitutes several
instructions (critical section), then primitives such as lock and unlock are needed to ensure
mutual exclusion. In such a case, the acquisition of a lock itself entails performing an
elementary operation on a shared variable (which indicates the status of the lock). Atomic
machine instructions can be used to implement lock/unlock operations. We now first present
one such machine (hardware) instruction called test-and-set and then describe how it can be
used to implement lock/unlock operations.
The test-and-set instruction atomically reads and modifies the contents of a memory
location in one memory cycle. It is defined as follows (variable m is a memory location):
The test-and-set instruction returns the current value of variable m and sets it to true. This
instruction can be used to implement lock and unlock operations in the following way (S, a
binary semaphore—a non-negative integer variable, is also called a lock):
lock(S): while test-and-set (S) do nothing;
unlock(S): S: = false;
Initially, S is set to false. When a lock(S) operation is executed for the first time, test-and-
set(S) return a false value (and sets S to true) and the “while” loop of the lock(S) operation
terminates. All subsequent executions of lock(S) keep looping because S is true until an
unlock(S) operation is executed.
In the above implementation of a lock operation, several processors may wait for the lock,
S, to open by executing the respective atomic machine language instructions concurrently.
This wait can be implemented in two ways:
Figure 10.20 Independent disks. Each column represents physically sequential blocks on a single disk.
A0 and B0 represent block of data in separate disk address spaces.
10.22 Give an example of transactional program that deadlocks. (Hint: consider footprints
of the following transactions shown below.)
10.23 Give an example to show that blindly replacing locks/unlocks with Transac-tion
Begin/Transaction End may result in unexpected behavior.
10.24 Consider the following two concurrent threads. What are the final values of x and y?
10.25 It appears that input/output operations cannot be supported with transac-tions. (For
example, there is no way to revert displaying some text on the screen for the user to
see.) Suggest a way to do this. (Hint: buffering)
10.26 When transactions conflict, contention resolution chooses which will continue and
which will wait or abort. Can you think of some policies for contention resolution?
10.27 Prepare a table which shows the advantages and disadvantages of HTM and STM
systems.
10.28 A technique related to transactional memory system is lock elision. The idea here is to
design hardware to handle locks more efficiently, instead of enabling transactions. If
two threads are predicted to only read locked data, the lock is removed (elided) and the
two threads can be executed in parallel. Similar to transactional memory system, if a
thread modifies data, the processor must rollback and re-execute the thread using locks
for correctness. What are the advantages and disadvantages of lock elision?
10.29 What governs the tradeoff between I/O request rate (i.e., access concurrency) and
data transfer rate (i.e., transfer parallelism) in a RAID?
10.30 Which RAID (level 3 or level 5) organization is more appropriate for applications
that require high I/O request rate?
101.31 Which RAID (level 3 or level 5) organization is more appropriate for applications
such as CAD and imaging that require large I/O request size?
10.32 What is the minimum number of disks required for RAID levels 5 and 10?
10.33 What are some inherent problems with all RAIDs? (Hints: (i) correlated failures and
(ii) data integrity)
10.34 Another approach, apart from RAID, for improving I/O performance is to use the
main memory of idle machines on a network as a backing store for the machines that
are active thereby reducing disk I/O. Under what conditions this approach is effective?
(Hint: Network latency.)
10.35 The most common technique used to reduce disk accesses is the disk cache. The
cache, which is a buffer in the main memory, contains a copy of some the sectors
(blocks) on the disk. What are the issues that need to be addressed in the design of disk
caches?
BIBLIOGRAPHY
Adl-Tabatabai, A., Kozyrakis, C. and Saha, B., “Unlocking Concurrency”, ACM QUEUE,
Vol. 4, No. 10, Dec.–Jan. 2006–2007, pp. 24–33.
Akhter, S. and Roberts, J., Multi-Core Programming, Intel Press, Hillsboro, 2006.
Antony Louis Piriyakumar, D. and Siva Ram Murthy, C., “Optimal Compile-time Multi-
processor Scheduling Based on the 0–1 Linear Programming Algorithm with the Branch
and Bound Technique”, Journal of Parallel and Distributed Computing, Vol. 35, No. 2,
June 1996, pp. 199–204.
Casavant, T.L., Tvrdik, P. and Plasil, F. (Eds.), Parallel Computers: Theory and Practice,
IEEE Computer Society Press, 1996.
Cascaval, C., Blundell, C., Michael, M., Cain, H.W., Wu, P., Chiras, S. and Chatterjee, S.,
“Software Transactional Memory: Why Is It Only a Research Toy?” ACM QUEUE, Vol.
6, No. 5, Sept. 2008, pp. 46–58.
Chen, P.M., Lee, E.K., Gibson, G.A., Katz, R.H. and Patterson, D.A., “RAID:
High Performance, Reliable Secondary Storage”, ACM Computing Surveys, Vol. 26, No.
2, June 1994, pp. 145–185.
Chisnall, D., “Understanding Hardware Transactional Memory in Intel’s Haswell
Architecture”, 2013. http://www.quepublishing.com/articles/article.aspx?p=2142912
El-Rewini, H., Lewis, T.G. and Ali, H.H., Task Scheduling in Parallel and Distributed
Systems, Prentice-Hall, NJ, USA, 1994.
Ganger, G.R., Worthington, B.L., Hou, R.Y. and Patt, Y.N., “Disk Arrays: High-
Performance, High-Reliability Storage Subsystems”, IEEE Computer, Vol. 27, No. 3,
Mar. 1994, pp. 30–36.
Grahn, H., “Transactional Memory”, Journal of Parallel and Distributed Computing, Vol.
70, No. 10, Oct. 2010, pp. 993–1008.
Haris, T., Larus, J. and Rajwar, R., Transactional Memory, Morgan & Claypool Publishers,
CA, USA, 2010.
Hwang, K. and Xu, Z., Advanced Computer Architecture: Parallelism, Scalability,
Programmability, WCB/McGraw-Hill, USA, 1998.
Kong, J., Chung, S.W. and Skadron, K., “Recent Thermal Management Techniques for
Microprocessors”, ACM Computing Surveys, Vol. 44, No. 3, June 2012.
Krishnan, C.S.R., Antony Louis Piriyakumar, D. and Siva Ram Murthy, C., “A Note on Task
Allocation and Scheduling Models for Multiprocessor Digital Signal Processing”, IEEE
Trans. on Signal Processing, Vol. 43, No. 3, Mar. 1995, pp. 802–805.
Loh, P.K.K., Hsu, W.J., Wentong, C. and Sriskanthan, N., “How Network Topology Affects
Dynamic Load Balancing”, IEEE Parallel & Distributed Technology, Vol. 4, No. 3, Fall
1996, pp. 25–35.
Manimaran, G. and Siva Ram Murthy, C., “An Efficient Dynamic Scheduling Algorithm for
Multiprocessor Real-time Systems”, IEEE Trans. on Parallel and Distributed Systems,
Vol. 9, No. 3, Mar. 1998, pp. 312–319.
Rauber, T. and Runger, G., Parallel Programming for Multicore and Cluster Systems,
Springer-Verlag, Germany, 2010.
Rosario, J.M. and Choudhary, A.N., “High-Performance I/O for Massively Parallel
Computers—Problems and Prospects”, IEEE Computer, Vol. 27, No. 3, Mar. 1994, pp.
59–68.
Selvakumar, S. and Siva Ram Murthy, C., “Scheduling of Precedence Constrained Task
Graphs with Non-negligible Inter-task Communication onto Multiprocessors”, IEEE
Trans. on Parallel and Distributed Systems, Vol. 5, No. 3, Mar. 1994, pp. 328–336.
Shiraji, B.A., Hurson, A.R. and Kavi, K.M. (Eds.), Scheduling and Load Balancing in
Parallel and Distributed Systems, IEEE Computer Society Press, 1995.
Singh, A.K., Shafique, M., Kumar, A. and Hankel, J., “Mapping on Multi/Many-core
Systems: Survey of Current and Emerging Trends”, Proceedings of the ACM Design
Automation Conference, 2013.
Sinha, P.K., Distributed Operating Systems, IEEE Computer Society Press, USA, 1997.
Siva Ram Murthy, C. and Rajaraman, V., “Task Assignment in a Multiprocessor System”,
Microprocessing and Microprogramming, the Euromicro Joumal, Vol. 26, No. 1, 1989,
pp. 63–71.
Sreenivas, A., Balasubramanya Murthy, K.N. and Siva Ram Murthy, C., “Reverse
Scheduling—An Effective Method for Scheduling Tasks of Parallel Programs Employing
Divide-and-Conquer Strategy onto Multiprocessors”, Microprocessors and Microsystems
Journal, Vol. 18, No. 4, May 1994, pp. 187–192.
Wang, A., Gaudet, M., Wu, P., Amaral, J.N., Ohmacht, M., Barton, C., Silvera, R. and
Michael, M., “Evaluation of Blue Gene/Q Hardware Support for Transactional Memory”,
Proceedings of the ACM International Conference on Parallel Architectures and
Compilation Techniques, 2012, pp. 127–136.
Yoo, R.M., Hughes, C.J., Rajwar, R. and Lai, K., “Performance Evaluation of Intel
Transactional Synchronization Extensions for High-performance Computing”,
Proceedings of the International Conference for High Performance Computing,
Networking, Storage and Analysis, 2013.
Zhuravlev, S., Saez, J.C., Blagodurov, S., Fedorova, A. and Prieto, M., “Survey of
Scheduling Techniques for Addressing Shared Resources in Multicore Processors”, ACM
Computing Surveys, Vol. 45, No. 1, Nov. 2012.
Zhuravlev, S., Saez, J.C., Blagodurov, S., Fedorova, A. and Prieto, M., “Survey of Energy-
Cognizant Scheduling Techniques”, IEEE Trans. on Parallel and Distributed Systems,
Vol. 24, No. 7, July 2013, pp. 1447–1464.
11
Performance Evaluation of Parallel Computers
A sequential algorithm is usually evaluated in terms of its execution time which is expressed
as a function of its input size. On the other hand, the execution time of a parallel algorithm
depends not only on the input size but also on the parallel architecture and the number of
processors employed. Hence, a parallel algorithm cannot be evaluated isolated from a parallel
computer. A parallel computing system should be viewed as a combination of a parallel
algorithm and the parallel computer on which it is implemented. In this chapter, we first
introduce various metrics, standard measures, and benchmarks for evaluating the
performance of a parallel computing system. Then we study the sources of parallel overhead
followed by speedup performance laws and scalability principles. Finally, we briefly mention
the need for performance analysis tools which are essential to optimize an application’s
performance.
11.1 BASICS OF PERFORMANCE EVALUATION
In this section, we introduce the metrics and measures for analyzing the performance of
parallel computing systems.
Note that speedup metric is a quantitative measure of performance gain that is achieved by
multiple processor implementation of a given program over a single processor
implementation. In other words, it captures the relative benefit of running a program (or
solving a problem) in parallel.
For a given problem, there may be more than one sequential implementation (algorithm).
When a single processor system is used, it is natural to employ a (sequential) algorithm that
solves the problem in minimal time. Therefore, given a parallel algorithm it is appropriate to
assess its performance with respect to the best sequential algorithm. Hence, in the above
definition of speedup, T(1) denotes the time taken to run a program using the fastest known
sequential algorithm.
Now consider the problem of adding numbers using an n-processor hypercube. Initially,
each processor is assigned one of the numbers to be added. At the end of the computation,
one of the processors accumulates the sum of all the numbers. Assuming that n is a power of
2, we can solve this problem on an n-processor hypercube in log (n) steps. Figure 11.1
illustrates a solution, for n = 8, which has three (= log (8)) steps. Each step, as shown in Fig.
11.1, consists of one addition and the communication of a single message between two
processors which are directly connected.
Thus,
T(n) = O(log n)
Since the problem can be solved in O(n) time on a single processor, its speedup is
Superlinear Speedup
Theoretically, speedup can never exceed the number of processors, n, employed in solving a
problem. A speedup greater than n is possible only when each processor spends less than
T(1)/n units of time for solving the problem. In such a case, a single processor can emulate
the n-processor system and solve the problem in less than T(1) units of time, which
contradicts the hypothesis that the best sequential algorithm was used in computing the
speedup. Occasionally, claims are made of speedup greater than n, superlinear speedup. This
is usually due to (i) non optimal sequential algorithm, (ii) hardware characteristics, and (iii)
speculative computation which put a sequential algorithm at a disadvantage.
(a) Superlinear Speedup Due to Non-optimal Sequential Algorithm: Consider the problem
of setting the n elements of an array to zero by both a uniprocessor and an n-processor
parallel computer. The parallel algorithm simply gives one element to each processor
and sets each element to zero. But the sequential algorithm incurs the overhead of loop
construct (i.e., incrementing the loop index variable and checking its-bounds) which is
absent in the parallel algorithm. Thus the time taken by the parallel algorithm is less
than (1/n)th of the time taken by the sequential algorithm.
Figure 11.2 Parallel DFS generating fewer nodes than sequential DFS.
This gives the measure of the fraction of the time for which a processor is usefully
employed. Since 1 ≤ S(n) < n, we have 1/n < E(n) ≤ 1. An efficiency of one is often not
achieved due to parallel overhead. For the problem of adding numbers on an n-processor
hypercube,
Now consider the same problem of adding n numbers on a hypercube where the number of
processors, p < n. In this case, each processor first locally adds its n/p numbers in O(n/p)
time. Then the problem is reduced to adding the p partial sums on p-processors. Using the
earlier method, this can be done in O(log(p)) time. Thus the parallel run time of this
algorithm is O(n/p + log(p)). Hence the efficiency in this case is
Note that, for a fixed p, E(p) increases with increasing n as the computation at each
processor increases. Similarly, when n is fixed, E(p) decreases with increasing p as the
computation at each processor decreases.
Figures 11.3 to 11.5 show the relation between execution time, speedup, efficiency and the
number of processors used. In the ideal case, the speedup is expected to be linear, i.e., it
grows linearly with the number of processors, but in most cases, it falls due to the parallel
overhead.
NAS https://www.nas.nasa.gov/publications/npb.html
PARKBENCH http://www.netlib.org/parkbench/
PARSEC http://parsec.cs.princeton.edu/
Rodinia https://www.cs.virginia.edu/~skadron/wiki/rodinia/
SPLASH-2 http://www-flash.stanford.edu/apps/SPLASH
http://www.capsl.udel.edu/splash/
The performance results of benchmarks also depend on the compilers used. It is quite
possible for separate benchmarking runs on the same machine to produce disparate results,
because different (optimizing) compilers are used to produce the benchmarking object code.
Therefore, as part of recording benchmark results, it is important to identify the compiler and
the level of optimization used in the compilation.
11.2 SOURCES OF PARALLEL OVERHEAD
Parallel computers in practice do not achieve linear speedup or an efficiency of one, as
mentioned earlier, because of parallel overhead (the amount of time required to coordinate
parallel tasks as opposed to doing useful work). The major sources of overhead in a parallel
computer are inter-processor communication, load imbalance, inter-task synchronization, and
extra computation. All these are problem or application dependent.
Figures 11.7 and 11.8 show the speedup as a function of number of processors and α. Note
that the sequential operations will tend to dominate the speedup as n becomes very large. This
means, no matter how many processors are employed, the speedup in this problem is limited
to 1/α. This is called the sequential bottleneck of the problem. Again note that this sequential
bottleneck cannot be removed just by increasing the number of processors. As a result of this,
two major impacts on the parallel computing industry were observed. First, manufacturers
were discouraged from building large-scale parallel computers. Second, more research
attention was focused on developing optimizing (parallelizing) compilers which reduce the
value of α.
Figure 11.8 Effect of Amdahl’s law (for n = 1024; figure not to scale).
The major shortcoming in applying Amdahl’s law is that the total work load or the
problem size is fixed as shown in Fig. 11.9. Here Ws and Wp denote the sequential and
parallel work load, respectively. (Note that in Fig. 11.10, the execution time decreases with
increasing number of processors because of the parallel operations.) The amount of
sequential operations in a problem is often independent of, or increases slowly with, the
problem size, but the amount of parallel operations usually increases in direct proportion to
the size of the problem. Thus a successful method of overcoming the above shortcoming of
Amdahl’s law is to increase the size of the problem being solved.
Figure 11.9 Constant work load for Amdahl’s law.
A simple hardware model considered by Hill and Marty assumes that a multicore
processor contains a fixed number, say n, Base Core Equivalents (BCEs) in which either (i)
all BCEs implement the baseline core (homogeneous or symmetric multicore processor) or
(ii) one or more cores are more powerful than the others (heterogeneous or asymmetric
multicore processor). As an example, with a resource budget of n = 16 per chip, a
homogeneous multicore processor can have either 16 one-BCE cores or 4 cores of 4-BCE
cores whereas a heterogeneous multicore processor can have either 1 four-BCE core (one
powerful core is realized by fusing 4 one-BCE cores) and 12 one-BCE cores or 1 nine-BCE
core and 7 one-BCE cores. In general, a homogeneous multicore processor has n/r cores of r-
BCE cores and a heterogeneous chip has 1 + n – r cores because the single larger core uses r
resources leaving n – r resources for the one-BCE cores. The performance of a core is
assumed to be a function of number of BCEs it uses. The performance of a one-BCE is
assumed to be 1; perf(1) = 1 < perf(r) < r. The value of perf(r) is hardware/implementation
dependent and is assumed to be √r. This means r BCE resources give √r sequential
performance.
Homogeneous Multicore Processor
Sequential fraction (1 – f) uses 1 core at performance perf(r) and thus, sequential time = (1 –
f)/perf(r). Parallel fraction uses n/r cores at rate perf(r) each and thus, parallel time =
f/(perf(r) * (n/r)) = f * r/(perf(r) * n). The modified Amdahl’s law is given by
It was found that for small r, the processor performs poorly on sequential code and for
large r, it performs poorly on parallel code.
Heterogeneous Multicore Processor
Sequential time = (1 – f)/perf(r), as before. In parallel, 1 core at rate perf(r), and (n – r) cores
at rate 1 give parallel time = f/(perf(r) + n – r). The modified Amdahl’s law in this case is
given by
It was observed that speedups obtained are better than for homogeneous cores. In both the
cases, the fraction f should be as high as possible (just as followed from traditional Amdahl’s
law) for better speedups. This analysis ignores important effects of static and dynamic power,
as well as on-/off-chip memory system and interconnect design.
Amdahl’s Law for Energy-Efficient Computing on Many-core Processor
A many-core processor is designed in a way that its power consumption does not exceed its
power budget. For example, a 32-core processor with each core consuming an average of 20
watts leads to a total power consumption of 640 watts assuming all cores are active. As noted
in Chapter 5, multicore scaling faces a power wall. The number of cores that fit on a chip is
much more than the number that can operate simultaneously under the power budget. Thus,
power becomes more critical than performance in scaling up many-core processors.
Woo and Lee model power consumption for a many-core processor, assuming that one
core in active state consumes a power of 1. Sequential fraction (1 – f) consumes 1 + (n – 1)k
power, where n is the number of cores and k is the fraction of power, the processor consumes
in idle state (0 ≤ k ≤ 1). Parallel fraction f uses n cores consuming n amount of power. As it
takes (1 – f) and f/n to execute the sequential and parallel code, respectively, the formula for
average power consumption (denoted by W) for a homogeneous many-core processor is as
follows:
Performance (reciprocal of execution time) per watt (Perf/W), which denotes the
performance achievable at the same cooling capacity, based on the average power (W), is as
follows:
Here, unfortunately, parallel execution consumes much more energy than sequential
execution to run the program. Only when f = 1 (ideal case), Perf/W attains the maximum
value of 1. A sequential execution consumes the same amount of energy as that of its parallel
execution version only when the performance improvement through parallelization scales
linearly. When parallel performance does not scale linearly, the many-core processor
consumes much more energy to run the program. A heterogeneous many-core processor
alternative is shown to achieve better energy efficiency.
Like in Amdahl’s law, if we assume that the parallel operations in the program achieve a
linear speedup (i.e., these operations using n processors take (1/n)th of the time taken to
perform them on one processor), then Tp(1, W) = n . Tp(n, W). Let α be the fraction of the
sequential work load in the problem i.e.,
For large n, S′(n) ≈ n(1 – α). A plot of speedup as a function of sequential work load in the
problem is shown in Fig. 11.13. Note that the slope of the S′(n) curve is much flatter than that
of S(n) in Fig. 11.8. This is achieved by keeping all the processors busy by increasing the
problem size. Thus according to Gustafson’s law, when the problem can scale to match the
available computing power, the sequential bottleneck will never arise.
Weak Scaling
A program is said to be weakly scalable if we can keep the efficiency fixed by increasing the
problem size at the same rate as we increase the number of processors. While Amdahl’s law
predicts the performance of a parallel program under strong scaling, Gustafson’s law predicts
a parallel program’s performance under weak scaling.
Figure 11.13 Effect of Gustafson’s law (for n = 1024; figure not to scale).
Figure 11.14 Scaled work load for Sun and Ni’s law.
Figure 11.15 Slightly increasing execution time for Sun and Ni’s law.
Let M be the memory requirement of a given problem and W be its computational work
load. M and W are related to each other through the address space and the architectural
constraints. The execution time T, which is a function of number of processors (n) and the
work load (W), is related to the memory requirement M. Hence
T = g(M) or M = g–1(T)
The total memory capacity increases linearly with the number of processors available. The
enhanced memory can be related to the execution time of the scaled work load by T* = g*
(nM) where, nM is the increased memory capacity of an n-processor machine.
We can write g*(nM) = G(n)g(M) = G(n) T(n, W), where T(n, W) = g(M) and g* is a
homogeneous function. The factor G(n) reflects the increase in the execution time as memory
increases n times. The speedup in this case is:
Two assumptions are made while deriving the above speedup expression:
(i) A global address space is formed from all the individual memory spaces, i.e., there is a
distributed shared memory space.
(ii) All available memory capacity is used up for solving the scaled problem.
In scientific computations, a matrix often represents some discretized data continuum.
Increasing the matrix size generally leads to a more accurate solution for the continuum. For
matrices with dimension n, the number of computations involved in matrix multiplication is
2n3 and the memory requirement is roughly M = 3n2. As the memory increases n times in an
n-processor multiprocessor system, nM = n × 3n2 = 3n3. If the increased matrix has a
dimension of N, then 3n3 = 3N2. Therefore, N = n1.5. Thus G(n) = n1.5 and
Case 3: G(n) > n: This corresponds to the case where the computational work load (time)
increases faster than the memory requirement. Comparing the speedup factors S*(n) and S′(n)
for the scaled matrix multiplication problem, we realize that the memory-bound (or fixed
memory) model may give a higher speedup than the fixed execution time (Gustafson’s)
model.
The above analysis leads to the following conclusions: Amdahl’s law and Gustafson’s law
are special cases of Sun and Ni’s law. When computation grows faster than the memory
requirement, the memory-bound model may yield a higher speedup (i.e., S * (n) > S′(n) >
S(n)) and better resource utilization.
11.4 SCALABILITY METRIC
Given that increasing the number of processors decreases efficiency and increasing the
amount of computation per processor increases efficiency, it should be possible to keep the
efficiency fixed by increasing both the size of the problem and the number of processors
simultaneously.
Consider the example of adding n numbers on a p-processor hypercube. Assume that it
takes one unit of time both to add two numbers and to communicate a number between two
processors which are directly connected. As seen earlier, the local addition takes n/p –
1(O(n/p)) time. After this, the p partial sums are added in log(p) steps, each consisting of one
addition and one communication. Thus the parallel run time of this problem is n/p – 1 + 2
log(p), and the expressions for speedup and efficiency can be approximated by
From the above expression E(p), the efficiency of adding 64 numbers on a hypercube with
four processors is 0.8. If the number of processors is increased to 8, the size of the problem
should be scaled up to add 192 numbers to maintain the same efficiency of 0.8. In the same
way, if p (machine size) is increased to 16, n (problem size) should be scaled up to 512 to
result in the same efficiency. A parallel computing system (the combination of a parallel
algorithm and parallel machine on which it is implemented) is said to be scalable if its
efficiency can be fixed by simultaneously increasing the machine size and the problem size.
The scalability of a parallel system is a measure of its capacity to increase speedup in
proportion to the machine size.
An isoefficiency function shows how the size of a problem must grow as a function of the
number of processors used in order to maintain some constant efficiency. We can derive a
general form of an isoefficiency function by using an equivalent definition of efficiency:
where U is the time taken to do the useful (essential work) computation and O is the parallel
overhead. Note that O is zero for sequential execution. If we fix the efficiency at some
constant value, K, then
where K′ is a constant for a fixed efficiency K. This function is called the isoefficiency
function of the parallel computing system. If we wish to maintain a certain level of
efficiency, we must keep the parallel overhead, O, no worse than proportional to the total
useful computation time, U. A small isoefficiency function means that small increments in
the problem size (or U) are sufficient for the efficient utilization of an increasing number of
processors, indicating that the parallel system is highly scalable. On the other hand, a large
isoefficiency function indicates a poorly scalable system. To keep a constant efficiency, in
such a system, the problem size must be explosive (see Fig. 11.16).
Consider the example of adding n numbers on a p-processor hypercube. Assume that it
takes one unit of time both to add two numbers and to communicate a number between two
processors which are directly connected. The parallel run time of this problem, as seen
earlier, is approximately n/p + 2 log(p). Thus, U = n and O = 2p log(p), as each processor, in
the parallel execution, spends approximately n/p time for useful work and incurs an overhead
of 2 log(p) time. Substituting O with 2p log(p) in the above equation, we get
U = K′2p log(p)
Thus the isoefficiency for this parallel computing system is O(p log (p)). That means, if the
machine size is increased from from p to p′, the problem size (n) must be increased by a
factor of (p′ log (p′))/(p log (p)) to obtain the same efficiency as with p processors. Finally,
note that the isoefficiency function does not exist for unscalable parallel computing systems.
This is because the efficiency in such systems cannot be kept at any constant value as
machine size increases, no matter how fast the problem size is increased.
Figure 11.16 Scalable system characteristics.
11.5 PERFORMANCE ANALYSIS
The main reason for writing parallel programs is speed. The factors which determine the
speed of a parallel program (application) are complex and inter-related. Some important ones
are given in Table 11.2. Once a parallel program has been written and debugged,
programmers then generally turn their attention to performance tuning of their programs.
Performance tuning is an iterative process used to optimize the efficiency of a program (or to
close the “gap” between hardware and software). Performance tuning usually involves (i)
profiling using software tools to record the amount of time spent in different parts of a
program and resource utilization, and (ii) finding program’s hotspots (areas of code within
the program that use disproportionately high amount of processor time) and eliminating
bottlenecks (areas of code within the program that use processor resources inefficiently
thereby causing unnecessary delays) in them.
The software tools, which are either platform specific (refers to a specific combination of
hardware and compiler and/or operating system) or cross-platform, provide insight to
programmers to help them understand why their programs do not run fast enough by
collecting and analyzing statistics about processor utilization rate, load index, memory access
patterns, cache-hit ratio, synchronization frequency, inter-processor communication
frequency and volume, I/O (disk) operations, compiler/OS overhead, etc. The statistics can be
used to decide if a program needs to be improved and in such cases, where in the code the
main effort should be put in. The design and development of performance tools, which can
pinpoint a variety of performance bottlenecks in complex applications that run on parallel
computers which consist of tens of thousands of nodes each of which is equipped with one or
more multicore microprocessors, is a challenging problem. For the performance tools to be
useful on such parallel systems the techniques, which are used to measure and analyze large
volume of performance data for providing fine-grained details about application performance
bottlenecks to the programmers, must scale to thousands of threads.
TABLE 11.2 Important Factors which Affect a Parallel Program’s Performance
(i)
(ii)
(iii)
(iv)
(v)
11.3 Based on the definitions given in the above problem, prove the following:
(a) 1 ≤ S(n) ≤ n
(b) U(n) = R(n) · E(n)
(c) 1/n ≤ E(n) ≤ U(n) ≤ 1
(d) 1/n ≤ R(n) ≤ 1/E(n) ≤ n
(e) Q(n) ≤ S(n) ≤ n
11.4 Visit SPEC website to know more about the benchmark suites SPEC ACCEL, SPEC
MPI2007 and SPEC OMP2012 and prepare a short summary, for each of these, about
programming language used, computations/application performed/included and
information obtained by running the benchmarks.
11.5 What are the different type of programs contained in SPLASH-2 and PARSEC
benchmark suites?
11.6 Consider the problem of sorting n numbers using bubble sort algorithm. The worst case
time complexity of solving it is O(n2) on a single processor system. The same problem
is to be solved, using the same algorithm, on a two-processor system. Calculate the
speedup factor for n = 100, assuming unit constant of proportionality for both sorting
and merging. Comment on your result.
11.7 The execution times (in seconds) of four programs on three computers are given
below:
Program Computer A Computer B Computer C
Program 1 1 10 20
Assume that each of the programs has 109 floating point operations to be carried out.
Calculate the Mflops rating of each program on each of the three machines. Based on
these ratings, can you draw a clear conclusion about the relative performance of each of
the three computers?
11.8 Let α be the percentage of a program code which must be executed sequentially on a
single processor. Assume that the remaining part of the code can be executed in parallel
by all the n homogeneous processors. Each processor has an execution rate of x MIPS.
Derive an expression for the effective MIPS rate in terms of α, n, x assuming only this
program code is being executed.
11.9 A program contains 20% of non-parallelizable code and the remaining being fully
parallelizable. How many processors are required to achieve a speedup of 3? Is it
possible to achieve a speedup of 5? What do you infer from this?
11.10 What limits the speedup in the case of (a) Amdahl’s Law and (b) Sun and Ni’s Law?
11.11 Extend Amdahl’s law to reflect the reality of multicore processors taking parallel
overhead H(n) into consideration. Assume that the best sequential algorithm runs in one
time unit and H(n) = overhead from operating system and inter-thread activities such as
synchronization and communication between threads.
11.12 Analyze multicore scalability under (i) fixed-time, and (ii) memory-bounded speedup
models.
11.13 Derive a formula for average power consumption for a heterogeneous many-core
processor.
11.14 Consider the problem of adding n numbers on a p-processor hypercube. Assume that
the base problem for p = 1 is that of adding 256 numbers. By what fraction should the
parallel portion of the work load be increased in order to maintain constant execution
time if the number of processors used is increased to (a) p = 4 and (b) p = 8.
11.15 Is it possible to solve an arbitrary large problem in a fixed amount of time, provided
that unlimited processors are available? Why?
11.16 Consider the problem of adding n numbers on an n-dimensional hypercube. What
should be the size of the machine in order to achieve a good trade-off between
efficiency and execution time? (Hint: Consider the ratio E(n)/T(n).)
11.17 Consider the problem of computing the s-point FFT (Fast Fourier Transform), whose
work load is O(s log(s)), on two different parallel computers, hypercube and mesh. The
overheads for the problem on the hypercube and mesh with n-processors are
CE Computing Element
CI Communication Interface
DE Decode instruction
EX Execute instruction
FI Fetch Instruction
GB Gigabyte
GE Gaussian Elimination
GHz Gigahertz
I/O Input/Output
IR Instruction Register
KB Kilobyte
kHz Kilohertz
MB Megabyte
MC Memory Controller
MHz Megahertz
MTTF Mean-Time-To-Failure
MUX Multiplexer
OS Operating System
PE Processing Element
RMW Read-Modify-Write
RW Regenerate-Write
SL Static Level
SM Shared Memory
SM Streaming Multiprocessor
SP Streaming Processor
SP Schedule Priority
TB TeraByte
TD Tag Directory
VM Virtual Machine
Agenda parallelism, 23
Animation, 6
Anti dependency, 68
ARM Cortex A9 architecture, 75
ARM cortex A9 multicore processor, 182
Array processors, 99, 119
Asynchronous message passing protocol, 159
Asynchronous send, 157
Atomic read modify write instruction, 117
Automatic parallelization, 379
Autotuning, 379
Cache coherence
directory scheme, 132
in DSM, 158
MESI protocol, 122
MOESI protocol, 127
in shared bus multiprocessors, 119, 121
Cache coherent non-uniform memory access, 148
Cache misses, 120
capacity, 120
cold, 120
conflict, 120
Chip multiprocessors (CMP), 174, 178
cache coherence, 181
definition of, 176
generalized structure, 176
using interconnection networks, 184
Cloud computing, 210
acceptance of, 217
advantages, 215
applications of, 217
characteristics of, 211
comparison with grid computing, 219
definition, 210
quality of service, 211
risks in using, 216, 211
services in, 214
types of, 213
Cloud services, 214
Coarse grain job, 20
Combinational circuit, 226
depth, 227
fan-in, 226
fan-out, 226
size, 227
width, 227
Communication optimization transformation, 370
collective communication, 372
message pipelining, 372
message vectorization, 371
Community cloud, 214
Comparison of grid and cloud computing, 219
Compiler transformation, 331
correctness, 332
scope, 333
Computation partitioning transformation, 369
bounds reduction, 370
guard overhead, 369
redundant guard elimination, 370
Computer cluster, 160
applications of, 161
using system area network, 161
Concurrentization, 336
Control dependence, 337
Core level parallelism, 173, 175
Cray supercomputer, 108
Critical path, 393
Critical section, 414
Cross bar connection, 138
CUDA, 307
block, 308
coalescing, 316
device (GPU), 307
grid, 308
heterogeneous computing with, 307
host (CPU), 307
kernel, 308
thread, 308
thread scheduling, 316
thread synchronization, 316
warp, 316
CUDA Core, 197, 198
Hadoop, 320
ecosystem, 322
HDFS, 321
MapReduce implementation framework, 321
Hardware register forwarding, 69
Heat dissipation in processors, 173
components of, 173
Heterogeneous element processor, 88
Hybrid cloud, 214
IA 64 processor, 79
Ideal pipelining, 39
ILP wall, 174
Infrastructure as a Service (IaaS), 214
Instruction level parallelism, 72
Instruction scheduling, 372
Instruction window, 71
committing instruction, 72
Intel Core i7 processor, 76
Intel i7 multicore processor, 182
Intel Teraflop chip, 195
Intel Xeon–Phi coprocessor, 191
Interconnection networks, 137
2D, 144, 145
2D grid, 143
hypercube, 144, 145
multistage, 139
N cube, 145
parameters of, 146
ring, 143
router, 146
Inter-process communication, 420
Isoefficiency, 460
Join, 115
Machine parallelism, 72
loop unrolling, 73
MapReduce, 318
Matrix multiplication, 256
on interconnection network, 258
on PRAM, 257
Memory access transformation, 359
array padding, 360
scalar expansion, 360
Memory consistency model, 128
relaxed consistency, 130
Memory management, 420
DRAM, 421
SRAM, 421
Mesh connected many core processors, 193
MESI protocol, 122
Message passing parallel computer, 156
Middleware, 214
Models of computation, 222
combinational circuit, 226
interconnection network, 225
PRAM, 223
RAM, 223
MOESI protocol, 127
Moore’s law, 173
Motivation for multicore parallelism, 175
MPI, 283
collective communication, 290
derived datatype, 286
extension, 293
message buffer, 285
message communicator, 288
message envelope, 287
message passing, 285
message tag, 287
packing, 286
point-to-point communication, 289
virtual topology, 293
Multicore processor, 1
Multistage interconnection network
butterfly, 142
three stage cube, 140
three stage omega, 141
Multithreaded processors, 82
coarse grained, 82
comparison of, 89
fine grained, 85
simultaneous multithreading, 88
YARN, 323