(Ebooks PDF) Download Parallel Computers Architecture and Programming V. Rajaraman Full Chapters
(Ebooks PDF) Download Parallel Computers Architecture and Programming V. Rajaraman Full Chapters
com
https://textbookfull.com/product/parallel-computers-
architecture-and-programming-v-rajaraman/
OR CLICK BUTTON
DOWNLOAD NOW
https://textbookfull.com/product/parallel-programming-concepts-and-
practice-1st-edition-bertil-schmidt/
textboxfull.com
https://textbookfull.com/product/parallel-programming-with-co-arrays-
robert-w-numrich/
textboxfull.com
https://textbookfull.com/product/parallel-programming-for-modern-high-
performance-computing-systems-czarnul/
textboxfull.com
https://textbookfull.com/product/programming-quantum-computers-
essential-algorithms-and-code-samples-1st-edition-eric-r-johnston/
textboxfull.com
Fortran 2018 with Parallel Programming 1st Edition Subrata
Ray (Author)
https://textbookfull.com/product/fortran-2018-with-parallel-
programming-1st-edition-subrata-ray-author/
textboxfull.com
https://textbookfull.com/product/digital-architecture-beyond-
computers-fragments-of-a-cultural-history-of-computational-design-
roberto-bottazzi/
textboxfull.com
https://textbookfull.com/product/mathematica-functional-and-
procedural-programming-2nd-edition-v-aladjev/
textboxfull.com
https://textbookfull.com/product/concurrency-in-c-cookbook-
asynchronous-parallel-and-multithreaded-programming-2nd-edition-
stephen-cleary/
textboxfull.com
https://textbookfull.com/product/assembly-programming-and-computer-
architecture-for-software-engineers-brian-hall/
textboxfull.com
PARALLEL COMPUTERS
Architecture and Programming
SECOND EDITION
V. RAJARAMAN
Honorary Professor
Supercomputer Education and Research Centre
Indian Institute of Science Bangalore
C. SIVA RAM MURTHY
Richard Karp Institute Chair Professor
Department of Computer Science and Engineering
Indian Institute of Technology Madras
Chennai
Delhi-110092
2016
PARALLEL COMPUTERS: Architecture and Programming, Second Edition
V. Rajaraman and C. Siva Ram Murthy
© 2016 by PHI Learning Private Limited, Delhi. All rights reserved. No part of this book may be reproduced in any form, by
mimeograph or any other means, without permission in writing from the publisher.
ISBN-978-81-203-5262-9
The export rights of this book are vested solely with the publisher.
Published by Asoke K. Ghosh, PHI Learning Private Limited, Rimjhim House, 111, Patparganj Industrial Estate, Delhi-
110092 and Printed by Mohan Makhijani at Rekha Printers Private Limited, New Delhi-110020.
To
the memory of my dear nephew Dr. M.R. Arun
— V. Rajaraman
To
the memory of my parents, C. Jagannadham and C. Subbalakshmi
— C. Siva Ram Murthy
Table of Contents
Preface
1. Introduction
1.1 WHY DO WE NEED HIGH SPEED COMPUTING?
1.1.1 Numerical Simulation
1.1.2 Visualization and Animation
1.1.3 Data Mining
1.2 HOW DO WE INCREASE THE SPEED OF COMPUTERS?
1.3 SOME INTERESTING FEATURES OF PARALLEL COMPUTERS
1.4 ORGANIZATION OF THE BOOK
EXERCISES
Bibliography
2. Solving Problems in Parallel
2.1 UTILIZING TEMPORAL PARALLELISM
2.2 UTILIZING DATA PARALLELISM
2.3 COMPARISON OF TEMPORAL AND DATA PARALLEL PROCESSING
2.4 DATA PARALLEL PROCESSING WITH SPECIALIZED PROCESSORS
2.5 INTER-TASK DEPENDENCY
2.6 CONCLUSIONS
EXERCISES
Bibliography
3. Instruction Level Parallel Processing
3.1 PIPELINING OF PROCESSING ELEMENTS
3.2 DELAYS IN PIPELINE EXECUTION
3.2.1 Delay Due to Resource Constraints
3.2.2 Delay Due to Data Dependency
3.2.3 Delay Due to Branch Instructions
3.2.4 Hardware Modification to Reduce Delay Due to Branches
3.2.5 Software Method to Reduce Delay Due to Branches
3.3 DIFFICULTIES IN PIPELINING
3.4 SUPERSCALAR PROCESSORS
3.5 VERY LONG INSTRUCTION WORD (VLIW) PROCESSOR
3.6 SOME COMMERCIAL PROCESSORS
3.6.1 ARM Cortex A9 Architecture
3.6.2 Intel Core i7 Processor
3.6.3 IA-64 Processor Architecture
3.7 MULTITHREADED PROCESSORS
3.7.1 Coarse Grained Multithreading
3.7.2 Fine Grained Multithreading
3.7.3 Simultaneous Multithreading
3.8 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
4. Structure of Parallel Computers
4.1 A GENERALIZED STRUCTURE OF A PARALLEL COMPUTER
4.2 CLASSIFICATION OF PARALLEL COMPUTERS
4.2.1 Flynn’s Classification
4.2.2 Coupling Between Processing Elements
4.2.3 Classification Based on Mode of Accessing Memory
4.2.4 Classification Based on Grain Size
4.3 VECTOR COMPUTERS
4.4 A TYPICAL VECTOR SUPERCOMPUTER
4.5 ARRAY PROCESSORS
4.6 SYSTOLIC ARRAY PROCESSORS
4.7 SHARED MEMORY PARALLEL COMPUTERS
4.7.1 Synchronization of Processes in Shared Memory Computers
4.7.2 Shared Bus Architecture
4.7.3 Cache Coherence in Shared Bus Multiprocessor
4.7.4 MESI Cache Coherence Protocol
4.7.5 MOESI Protocol
4.7.6 Memory Consistency Models
4.7.7 Shared Memory Parallel Computer Using an Interconnection Network
4.8 INTERCONNECTION NETWORKS
4.8.1 Networks to Interconnect Processors to Memory or Computers to
Computers
4.8.2 Direct Interconnection of Computers
4.8.3 Routing Techniques for Directly Connected Multicomputer Systems
4.9 DISTRIBUTED SHARED MEMORY PARALLEL COMPUTERS
4.9.1 Cache Coherence in DSM
4.10 MESSAGE PASSING PARALLEL COMPUTERS
4.11 Computer Cluster
4.11.1 Computer Cluster Using System Area Networks
4.11.2 Computer Cluster Applications
4.12 Warehouse Scale Computing
4.13 Summary and Recapitulation
EXERCISES
BIBLIOGRAPHY
5. Core Level Parallel Processing
5.1 Consequences of Moore’s law and the advent of chip multiprocessors
5.2 A generalized structure of Chip Multiprocessors
5.3 MultiCore Processors or Chip MultiProcessors (CMPs)
5.3.1 Cache Coherence in Chip Multiprocessor
5.4 Some commercial CMPs
5.4.1 ARM Cortex A9 Multicore Processor
5.4.2 Intel i7 Multicore Processor
5.5 Chip Multiprocessors using Interconnection Networks
5.5.1 Ring Interconnection of Processors
5.5.2 Ring Bus Connected Chip Multiprocessors
5.5.3 Intel Xeon Phi Coprocessor Architecture [2012]
5.5.4 Mesh Connected Many Core Processors
5.5.5 Intel Teraflop Chip [Peh, Keckler and Vangal, 2009]
5.6 General Purpose Graphics Processing Unit (GPGPU)
EXERCISES
BIBLIOGRAPHY
6. Grid and Cloud Computing
6.1 GRID COMPUTING
6.1.1 Enterprise Grid
6.2 Cloud computing
6.2.1 Virtualization
6.2.2 Cloud Types
6.2.3 Cloud Services
6.2.4 Advantages of Cloud Computing
6.2.5 Risks in Using Cloud Computing
6.2.6 What has Led to the Acceptance of Cloud Computing
6.2.7 Applications Appropriate for Cloud Computing
6.3 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
7. Parallel Algorithms
7.1 MODELS OF COMPUTATION
7.1.1 The Random Access Machine (RAM)
7.1.2 The Parallel Random Access Machine (PRAM)
7.1.3 Interconnection Networks
7.1.4 Combinational Circuits
7.2 ANALYSIS OF PARALLEL ALGORITHMS
7.2.1 Running Time
7.2.2 Number of Processors
7.2.3 Cost
7.3 PREFIX COMPUTATION
7.3.1 Prefix Computation on the PRAM
7.3.2 Prefix Computation on a Linked List
7.4 SORTING
7.4.1 Combinational Circuits for Sorting
7.4.2 Sorting on PRAM Models
7.4.3 Sorting on Interconnection Networks
7.5 SEARCHING
7.5.1 Searching on PRAM Models
Analysis
7.5.2 Searching on Interconnection Networks
7.6 MATRIX OPERATIONS
7.6.1 Matrix Multiplication
7.6.2 Solving a System of Linear Equations
7.7 PRACTICAL MODELS OF PARALLEL COMPUTATION
7.7.1 Bulk Synchronous Parallel (BSP) Model
7.7.2 LogP Model
7.8 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
8. Parallel Programming
8.1 MESSAGE PASSING PROGRAMMING
8.2 MESSAGE PASSING PROGRAMMING WITH MPI
8.2.1 Message Passing Interface (MPI)
8.2.2 MPI Extensions
8.3 SHARED MEMORY PROGRAMMING
8.4 SHARED MEMORY PROGRAMMING WITH OpenMP
8.4.1 OpenMP
8.5 HETEROGENEOUS PROGRAMMING WITH CUDA AND OpenCL
8.5.1 CUDA (Compute Unified Device Architecture)
8.5.2 OpenCL (Open Computing Language)
8.6 PROGRAMMING IN BIG DATA ERA
8.6.1 MapReduce
8.6.2 Hadoop
8.7 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
9. Compiler Transformations for Parallel Computers
9.1 ISSUES IN COMPILER TRANSFORMATIONS
9.1.1 Correctness
9.1.2 Scope
9.2 TARGET ARCHITECTURES
9.2.1 Pipelines
9.2.2 Multiple Functional Units
9.2.3 Vector Architectures
9.2.4 Multiprocessor and Multicore Architectures
9.3 DEPENDENCE ANALYSIS
9.3.1 Types of Dependences
9.3.2 Representing Dependences
9.3.3 Loop Dependence Analysis
9.3.4 Subscript Analysis
9.3.5 Dependence Equation
9.3.6 GCD Test
9.4 TRANSFORMATIONS
9.4.1 Data Flow Based Loop Transformations
9.4.2 Loop Reordering
9.4.3 Loop Restructuring
9.4.4 Loop Replacement Transformations
9.4.5 Memory Access Transformations
9.4.6 Partial Evaluation
9.4.7 Redundancy Elimination
9.4.8 Procedure Call Transformations
9.4.9 Data Layout Transformations
9.5 FINE-GRAINED PARALLELISM
9.5.1 Instruction Scheduling
9.5.2 Trace Scheduling
9.5.3 Software Pipelining
9.6 Transformation Framework
9.6.1 Elementary Transformations
9.6.2 Transformation Matrices
9.7 PARALLELIZING COMPILERS
9.8 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
10. Operating Systems for Parallel Computers
10.1 RESOURCE MANAGEMENT
10.1.1 Task Scheduling in Message Passing Parallel Computers
10.1.2 Dynamic Scheduling
10.1.3 Task Scheduling in Shared Memory Parallel Computers
10.1.4 Task Scheduling for Multicore Processor Systems
10.2 PROCESS MANAGEMENT
10.2.1 Threads
10.3 Process Synchronization
10.3.1 Transactional Memory
10.4 INTER-PROCESS COMMUNICATION
10.5 MEMORY MANAGEMENT
10.6 INPUT/OUTPUT (DISK ARRAYS)
10.6.1 Data Striping
10.6.2 Redundancy Mechanisms
10.6.3 RAID Organizations
10.7 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
11. Performance Evaluation of Parallel Computers
11.1 BASICS OF PERFORMANCE EVALUATION
11.1.1 Performance Metrics
11.1.2 Performance Measures and Benchmarks
11.2 SOURCES OF PARALLEL OVERHEAD
11.2.1 Inter-processor Communication
11.2.2 Load Imbalance
11.2.3 Inter-task Synchronization
11.2.4 Extra Computation
11.2.5 Other Overheads
11.2.6 Parallel Balance Point
11.3 SPEEDUP PERFORMANCE LAWS
11.3.1 Amdahl’s Law
11.3.2 Gustafson’s Law
11.3.3 Sun and Ni’s Law
11.4 SCALABILITY METRIC
11.4.1 Isoefficiency Function
11.5 PERFORMANCE ANALYSIS
11.6 CONCLUSIONS
EXERCISES
BIBLIOGRAPHY
Appendix
Index
Preface
Of late there has been a lot of interest generated all over the world on parallel processors and
parallel computers. This is due to the fact that all current micro-processors are parallel
processors. Each processor in a microprocessor chip is called a core and such a
microprocessor is called a multicore processor. Multicore processors have an on-chip
memory of a few megabytes (MB). Before trying to answer the question “What is a parallel
computer?”, we will briefly review the structure of a single processor computer (Fig. 1.1). It
consists of an input unit which accepts (or reads) the list of instructions to solve a problem (a
program) and data relevant to that problem. It has a memory or storage unit in which the
program, data and intermediate results are stored, a processing element which we will
abbreviate as PE (also called a Central Processing Unit (CPU)) which interprets and executes
instructions, and an output unit which displays or prints the results.
The role of experiments, theoretical models, and numerical simulation is shown in Fig.
1.2. A theoretically developed model is used to simulate the physical system. The results of
simulation allow one to eliminate a number of unpromising designs and concentrate on those
which exhibit good performance. These results are used to refine the model and carry out
further numerical simulation. Once a good design on a realistic model is obtained, it is used
to construct a prototype for experimentation. The results of experiments are used to refine the
model, simulate it and further refine the system. This repetitive process is used until a
satisfactory system emerges. The main point to note is that experiments on actual systems are
not eliminated but the number of experiments is reduced considerably. This reduction leads to
substantial cost saving. There are, of course, cases where actual experiments cannot be
performed such as assessing damage to an aircraft when it crashes. In such a case simulation
is the only feasible method.
Figure 1.2 Interaction between theory, experiments and computer simulation.
With advances in science and engineering, the models used nowadays incorporate more
details. This has increased the demand for computing and storage capacity. For example, to
model global weather, we have to model the behaviour of the earth’s atmosphere. The
behaviour is modelled by partial differential equations in which the most important variables
are the wind speed, air temperature, humidity and atmospheric pressure. The objective of
numerical weather modelling is to predict the status of the atmosphere at a particular region
at a specified future time based on the current and past observations of the values of
atmospheric variables. This is done by solving the partial differential equations numerically
in regions or grids specified by using lines parallel to the latitude and longitude and using a
number of atmospheric layers. In one model (see Fig. 1.3), the regions are demarcated by
using 180 latitudes and 360 longitudes (meridian circles) equally spaced around the globe. In
the vertical direction 12 layers are used to describe the atmosphere. The partial differential
equations are solved by discretizing them to difference equations which are in turn solved as
a set of simultaneous algebraic equations. For each region one point is taken as representing
the region and this is called a grid point. At each grid point in this problem, there are 5
variables (namely air velocity, temperature, pressure, humidity, and time) whose values are
stored. The simultaneous algebraic equations are normally solved using an iterative method.
In an iterative method several iterations (100 to 1000) are needed for each grid point before
the results converge. The calculation of each trial value normally requires around 100 to 500
floating point arithmetic operations. Thus, the total number of floating point operations
required for each simulation is approximately given by:
Number of floating point operations per simulation
= Number of grid points × Number of values per grid point × Number of trials × Number
of operations per trial
Figure 1.3 Grid for numerical weather model for the Earth.
In this example we have:
Number of grid points = 180 × 360 × 12 = 777600
Number of values per grid point = 5
Number of trials = 500
Number of operations per trial = 400
Thus, the total number of floating point operations required per simulation = 777600 × 5 ×
500 × 400 = 7.77600 × 1011. If each floating point operation takes 100 ns the total time taken
for one simulation = 7.8 × 104 s = 21.7 h. If we want to predict the weather at the intervals of
6 h there is no point in computing for 21.7 h for a prediction! If we want to simulate this
problem, a floating point arithmetic operation on 64-bit operands should be complete within
10 ns. This time is too short for a computer which does not use any parallelism and we need a
parallel computer to solve such a problem. In general the complexity of a problem of this
type may be described by the formula:
Problem complexity = G × V × T × A
where
G = Geometry of the grid system used
V = Variables per grid point
T = Number of steps per simulation for solving the problem
A = Number of floating point operations per step
For the weather modelling problem,
G = 777600, V = 5, T = 500 and A = 400 giving problem complexity = 7.8 × 1011.
There are many other problems whose complexity is of the order of 1012 to 1020. For
example, the complexity of numerical simulation of turbulent flows around aircraft wings and
body is around 1015. Some other areas where numerically intensive simulation is required
are:
In this chapter we will explain with examples how simple jobs can be solved in parallel in
many different ways. The simple examples will illustrate many important points in perceiving
parallelism, and in allocating tasks to processors for getting maximum efficiency in solving
problems in parallel.
2.1 UTILIZING TEMPORAL PARALLELISM
Suppose 1000 candidates appear in an examination. Assume that there are answers to 4
questions in each answer book. If a teacher is to correct these answer books, the following
instructions may be given to him:
Procedure 2.1 Instructions given to a teacher to correct an answer book
Step 1: Take an answer book from the pile of answer books.
Step 2: Correct the answer to Q1 namely, A1.
Step 3: Repeat Step 2 for answers to Q2, Q3, Q4, namely, A2, A3, A4.
Step 4: Add marks given for each answer.
Step 5: Put answer book in a pile of corrected answer books.
Step 6: Repeat Steps 1 to 5 until no more answer books are left in the input.
A teacher correcting 1000 answer books using Procedure 2.1 is shown in Fig. 2.1. If a
paper takes 20 minutes to correct, then 20,000 minutes will be taken to correct 1000 papers.
If we want to speedup correction, we can do it in the following ways:
1.D. The copyright laws of the place where you are located also
govern what you can do with this work. Copyright laws in most
countries are in a constant state of change. If you are outside the
United States, check the laws of your country in addition to the
terms of this agreement before downloading, copying, displaying,
performing, distributing or creating derivative works based on this
work or any other Project Gutenberg™ work. The Foundation makes
no representations concerning the copyright status of any work in
any country other than the United States.
1.E.6. You may convert to and distribute this work in any binary,
compressed, marked up, nonproprietary or proprietary form,
including any word processing or hypertext form. However, if you
provide access to or distribute copies of a Project Gutenberg™ work
in a format other than “Plain Vanilla ASCII” or other format used in
the official version posted on the official Project Gutenberg™ website
(www.gutenberg.org), you must, at no additional cost, fee or
expense to the user, provide a copy, a means of exporting a copy, or
a means of obtaining a copy upon request, of the work in its original
“Plain Vanilla ASCII” or other form. Any alternate format must
include the full Project Gutenberg™ License as specified in
paragraph 1.E.1.
• You pay a royalty fee of 20% of the gross profits you derive
from the use of Project Gutenberg™ works calculated using the
method you already use to calculate your applicable taxes. The
fee is owed to the owner of the Project Gutenberg™ trademark,
but he has agreed to donate royalties under this paragraph to
the Project Gutenberg Literary Archive Foundation. Royalty
payments must be paid within 60 days following each date on
which you prepare (or are legally required to prepare) your
periodic tax returns. Royalty payments should be clearly marked
as such and sent to the Project Gutenberg Literary Archive
Foundation at the address specified in Section 4, “Information
about donations to the Project Gutenberg Literary Archive
Foundation.”
• You comply with all other terms of this agreement for free
distribution of Project Gutenberg™ works.
1.F.
1.F.4. Except for the limited right of replacement or refund set forth
in paragraph 1.F.3, this work is provided to you ‘AS-IS’, WITH NO
OTHER WARRANTIES OF ANY KIND, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PURPOSE.
Please check the Project Gutenberg web pages for current donation
methods and addresses. Donations are accepted in a number of
other ways including checks, online payments and credit card
donations. To donate, please visit: www.gutenberg.org/donate.
Most people start at our website which has the main PG search
facility: www.gutenberg.org.
Our website is not just a platform for buying books, but a bridge
connecting readers to the timeless values of culture and wisdom. With
an elegant, user-friendly interface and an intelligent search system,
we are committed to providing a quick and convenient shopping
experience. Additionally, our special promotions and home delivery
services ensure that you save time and fully enjoy the joy of reading.
textbookfull.com