0% found this document useful (0 votes)

89 views72 pages

HPC Unit 2

The document outlines the principles of parallel algorithm design, focusing on various parallel programming models and architectures. It discusses the motivations for parallelism, including computational power, memory speed, and data communication constraints, while also detailing different parallel computing paradigms such as SIMD and MIMD. Additionally, it covers the importance of understanding performance bottlenecks in conventional architectures and the implications of shared-address-space versus message-passing platforms.

Uploaded by

RAVINDRA MULE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views72 pages

HPC Unit 2

Uploaded by

RAVINDRA MULE

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 72

Unit No.

- 2
Principles of Parallel Algorithm Design

 Course Objectives & Outcomes


To understand different parallel programming models

 CO1: Understand various Parallel Paradigm

Syllabus:

 Introduction to Parallel Computing: Motivating Parallelism

Modern Processor: Stored-program computer architecture, General-
purpose Cache-based Microprocessor architecture
 Parallel Programming Platforms: Implicit Parallelism, Dichotomy of
Parallel Computing Platforms, Physical
Organization of Parallel Platforms, Communication Costs in Parallel
Machines. Levels of parallelism,
 Models: SIMD, MIMD, SIMT, SPMD, Data Flow Models, Demand-
driven Computation,
 Architectures: N-wide superscalar architectures, multi-core, multi-
To Parallel Computing

 Introduction to Parallel Computing Traditionally, software has been written for serial
computation:
 To be run on a single computer having a single Central Processing Unit (CPU);
 A problem is broken into a discrete series of instructions.
 Instructions are executed one after another.
 Only one instruction may execute at any moment in time.
Serial computation:
Parallel Computing

 In the simplest sense, parallel computing is the simultaneous use of multiple compute
resources to solve a computational problem.
 To be run using multiple CPUs
 A problem is broken into discrete parts that can be solved concurrently
 Each part is further broken down to a series of instructions
 Instructions from each part execute simultaneously on different CPUs
Continue…
Motivating Parallelism

• Development of parallel software has traditionally been thought of as time and effort
intensive.
• This can be largely attributed to the inherent complexity of specifying and coordinating
concurrent tasks, a lack of portable algorithms, standardized environments, and software
development toolkits.
1. The Computational Power Argument – from Transistors to FLOPS
2. The Memory/Disk Speed Argument
3. The Data Communication Argument
The Computational Power Argument – from Transistors to
FLOPS …

• In 1965, Gordon Moore made the following simple observation: "The complexity for
minimum component costs has increased at a rate of roughly a factor of two per year.
Certainly, over the short term this rate can be expected to continue, if not to increase. Over
the longer term, the rate of increase is a bit more uncertain, although there is no reason to
believe it will not remain nearly constant for at least 10 years. That means by 1975, the
number of components per integrated circuit for minimum cost will be 65,000.
The Memory/Disk Speed Argument

 The overall speed of computation is determined not just by the speed of the processor, but
also by the ability of the memory system to feed data to it. While clock rates of high-end
processors have increased at roughly 40% per year over the past decade, DRAM access
times have only improved at the rate of roughly 10% per year over this interval. • The overall
performance of the memory system is determined by the fraction of the total memory
requests that can be satisfied from the cache.
The Data Communication
Argument

• In many applications there are constraints on the location of data and/or resources across
the Internet. • An example of such an application is mining of large commercial datasets
distributed over a relatively low bandwidth network. • In such applications, even if the
computing power is available to accomplish the required task without resorting to parallel
computing, it is infeasible to collect the data at a central location. • In these cases, the
motivation for parallelism comes not just from the need for computing resources but also
from the infeasibility or undesirability of alternate (centralized) approaches.
Reference Book: Ananth Grama
Modern Processor

1. Stored-program computer architecture : Its defining property, which set it apart from
earlier designs, is that its instructions are numbers that are stored as data in memory.
Instructions are read and executed by a control unit; a separate arithmetic/logic unit is
responsible for the actual computations and manipulates data stored in memory along with
the instructions A von Neumann computer uses the stored-program concept. The CPU
executes a stored program that specifies a sequence of read and write operations on the
memory.
Continue...
Continue...

 Instructions and data must be continuously fed to the control and arithmetic units, so that the speed of the
memory interface poses a limitation on compute performance.
 The architecture is inherently sequential, processing a single instruction with (possibly) a single operand
or a group of operands from memory.(SISD)
2. General-purpose Cache-based Microprocessor architecture :
• Microprocessors implement stored pgm....
• Modern processors have lot of components but only a small part does the actual work -AU for fp and int
operations.
• Rest are CPU regs, nowadays processors req all operands to reside in regs.
• LD(load) and ST(store) units handle instruction transfer.
 • Queues for instructions
 • Finally, Cache
Continue...
References

 Book Title: Introduction to High Performance Computing for Scientist and Engineers
Authors: George and Wellen
 Reference:
http://prdrklaina.weebly.com/uploads/5/7/7/3/5773421/
intIntroduction_to_high_performance_computing_for_scientist s_and_engineers.pdf
Scope of Parallelism

 Conventional architectures coarsely comprise of a processor, memory system, and the

Datapath.
• Each of these components present significant performance bottlenecks.
• Parallelism addresses each of these components in significant ways.
• Different applications utilize different aspects of parallelism - e.g., data intensive
applications utilize high aggregate throughput, server applications utilize high aggregate
network bandwidth, and scientific applications typically utilize high processing and memory
system performance.
• It is important to understand each of these performance bottlenecks.
Implicit Parallelism: Trends in Microprocessor Architectures

 Microprocessor clock speeds have posted impressive gains over the past two decades
(two to three orders of magnitude).
 Higher levels of device integration have made available a large number of transistors.
 The question of how best to utilize these resources is an important one.
 Current processors use these resources in multiple functional units and execute
multiple instructions in the same cycle.
 The precise manner in which these instructions are selected and executed provides
impressive diversity in architectures.
Pipelining and Superscalar Execution

 Pipelining overlaps various stages of instruction execution to achieve performance.

 At a high level of abstraction, an instruction can be executed while the next one is
being decoded and the next one is being fetched.
 This is akin to an assembly line for manufacture of cars.
 Pipelining, however, has several limitations.
 The speed of a pipeline is eventually limited by the slowest stage.
 For this reason, conventional processors rely on very deep pipelines (20 stage
pipelines in state-of-the-art Pentium processors).
 However, in typical program traces, every 5-6th instruction is a conditional jump! This
requires very accurate branch prediction.
Continue…

 The penalty of a misprediction grows with the depth of the pipeline, since a larger
number of instructions will have to be flushed.
 One simple way of alleviating these bottlenecks is to use multiple pipelines.
 The question then becomes one of selecting these instructions.
 In Below example, there is some wastage of resources due to data dependencies .
Superscalar Execution: An Example
Example of a two-way superscalar execution of instructions.
Superscalar Execution: An Example

 In the above example, there is some wastage of resources due to data dependencies.
 The example also illustrates that different instruction mixes with identical semantics
can take significantly different execution time.
Superscalar Execution

 Scheduling of instructions is determined by a number of factors:

–True Data Dependency: The result of one operation is an input to the next.
–Resource Dependency: Two operations require the same resource.
–Branch Dependency: Scheduling instructions across conditional branch
statements cannot be done deterministically a-priori.
–The scheduler, a piece of hardware looks at a large number of instructions in an
instruction queue and selects appropriate number of instructions to execute
concurrently based on these factors.
–The complexity of this hardware is an important constraint on superscalar
processors.
Superscalar Execution: Issue Mechanisms
•

 In the simpler model, instructions can be issued only in the order in which they are
encountered. That is, if the second instruction cannot be issued because it has a
data dependency with the first, only one instruction is issued in the cycle. This is
called in-order issue.
 •In a more aggressive model, instructions can be issued out of order. In this case,
if the second instruction has data dependencies with the first, but the third
instruction does not, the first and third instructions can be co-scheduled. This is
also called dynamic issue.
 •Performance of in-order issue is generally limited.
Superscalar Execution: Efficiency Considerations

 Not all functional units can be kept busy at all times.

 •If during a cycle, no functional units are utilized, this is referred to as vertical
waste.
 •If during a cycle, only some of the functional units are utilized, this is referred to
as horizontal waste.
 •Due to limited parallelism in typical instruction traces, dependencies, or the
inability of the scheduler to extract parallelism, the performance of superscalar
processors is eventually limited.
 •Conventional microprocessors typically support four-way superscalar execution.
Very Long Instruction Word (VLIW) Processors

 The hardware cost and complexity of the superscalar scheduler is a major

consideration in processor design.
 •To address this issues, VLIW processors rely on compile time analysis to identify
and bundle together instructions that can be executed concurrently.
 •These instructions are packed and dispatched together, and thus the name very
long instruction word.
 •This concept was used with some commercial success in the Multiflow Trace
machine (circa 1984).
 •Variants of this concept are employed in the Intel IA64 processors.
Very Long Instruction Word (VLIW) Processors: Considerations

 Issue hardware is simpler.

 •Compiler has a bigger context from which to select co-scheduled instructions.
 •Compilers, however, do not have runtime information such as cache misses.
Scheduling is, therefore, inherently conservative.
 •Branch and memory prediction is more difficult.
 •VLIW performance is highly dependent on the compiler. A number of techniques
such as loop unrolling, speculative execution, branch prediction are critical.
 •Typical VLIW processors are limited to 4-way to 8-way parallelism.
VLIW & Superscalar
Limitations of Memory System Performance

 Memory system, and not processor speed, is often the bottleneck for many
applications.
 •Memory system performance is largely captured by two parameters, latency and
bandwidth.
 •Latency is the time from the issue of a memory request to the time the data is
available at the processor.
 •Bandwidth is the rate at which data can be pumped to the processor by the
memory system.
Dichotomy of Parallel Computing Platforms

 An explicitly parallel program must specify concurrency and interaction between

concurrent subtasks.
 •The former is sometimes also referred to as the control structure and the latter as
the communication model.
Control Structure of Parallel Programs

 Parallelism can be expressed at various levels of granularity - from instruction

level to processes.
 •Between these extremes exist a range of models, along with corresponding
architectural support.
Control Structure of Parallel Programs

 Processing units in parallel computers either operate under the centralized control
of a single control unit or work independently.
 •If there is a single control unit that dispatches the same instruction to various
processors (that work on different data), the model is referred to as single
instruction stream, multiple data stream (SIMD).
 •If each processor has its own control control unit, each processor can execute
different instructions on different data items. This model is called multiple
instruction stream, multiple data stream (MIMD).
SIMD and MIMD Processors
SIMD Processors

 Some of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2,
and MasPar MP-1 belonged to this class of machines.
 •Variants of this concept have found use in co-processing units such as the MMX
units in Intel processors and DSP chips such as the Sharc.
 •SIMD relies on the regular structure of computations (such as those in image
processing).
 •It is often necessary to selectively turn off operations on certain data items. For
this reason, most SIMD programming paradigms allow for an ``activity mask'',
which determines if a processor should participate in a computation or not.
Conditional Execution in SIMD Processors
MIMD Processors

 In contrast to SIMD processors, MIMD processors can execute different programs

on different processors.
 •A variant of this, called single program multiple data streams (SPMD) executes
the same program on different processors.
 •It is easy to see that SPMD and MIMD are closely related in terms of
programming flexibility and underlying architectural support.
 •Examples of such platforms include current generation Sun Ultra Servers, SGI
Origin Servers, multiprocessor PCs, workstation clusters, and the IBM SP.
SIMD-MIMD Comparison

 SIMD computers require less hardware than MIMD computers (single control
unit).
 •However, since SIMD processors ae specially designed, they tend to be
expensive and have long design cycles.
 •Not all applications are naturally suited to SIMD processors.
 •In contrast, platforms supporting the SPMD paradigm can be built from
inexpensive off-the-shelf components with relatively little effort in a short amount
of time.
Communication Model of Parallel
Platforms

 •There are two primary forms of data exchange between parallel tasks - accessing
a shared data space and exchanging messages.
 •Platforms that provide a shared data space are called shared-address-space
machines or multiprocessors.
 •Platforms that support messaging are also called message passing platforms or
multicomputers.
Shared-Address-Space Platforms

 •Part (or all) of the memory is accessible to all processors.

 •Processors interact by modifying data objects stored in this shared-address-
space.
 •If the time taken by a processor to access any memory word in the system global
or local is identical, the platform is classified as a uniform memory access (UMA),
else, a non-uniform memory access (NUMA) machine.
NUMA and UMA Shared-Address-Space Platforms
Continue…

 The distinction between NUMA and UMA platforms is important from the point of view
of algorithm design. NUMA machines require locality from underlying algorithms for
performance.
 •Programming these platforms is easier since reads and writes are implicitly visible to
other processors.
 •However, read-write data to shared data must be coordinated (this will be discussed in
greater detail when we talk about threads programming).
 •Caches in such machines require coordinated access to multiple copies. This leads to
the cache coherence problem.
 •A weaker model of these machines provides an address map, but not coordinated
access. These models are called non cache coherent shared address space machines.
Shared-Address-Space vs. Shared Memory Machines

 It is important to note the difference between the terms shared address space and
shared memory.
 •We refer to the former as a programming abstraction and to the latter as a
physical machine attribute.
 •It is possible to provide a shared address space using a physically distributed
memory.
 Message-Passing Platforms
 These platforms comprise of a set of processors and their own (exclusive)
memory.
 •Instances of such a view come naturally from clustered workstations and non-
shared-address-space multicomputers.
 •These platforms are programmed using (variants of) send and receive primitives.
 •Libraries such as MPI and PVM provide such primitives.
Message Passing vs. Shared Address
Space Platforms

 •Message passing requires little hardware support, other than a network.

 •Shared address space platforms can easily emulate message passing. The
reverse is more difficult to do (in an efficient manner).
Physical Organization of Parallel Platforms : Architecture of an Ideal
Parallel Computer

 We begin this discussion with an ideal parallel machine called Parallel Random Access Machine, or
PRAM.
 •A natural extension of the Random-Access Machine (RAM) serial architecture is the Parallel Random-
Access Machine, or PRAM.
 •PRAMs consist of p processors and a global memory of unbounded size that is uniformly accessible to
all processors.
 •Processors share a common clock but may execute different instructions in each cycle.
Architecture of an Ideal Parallel
Computer

 Depending on how simultaneous memory accesses are handled, PRAMs can be divided
into four subclasses.
 –Exclusive-read, exclusive-write (EREW) PRAM.
 –Concurrent-read, exclusive-write (CREW) PRAM.
 –Exclusive-read, concurrent-write (ERCW) PRAM.
 –Concurrent-read, concurrent-write (CRCW) PRAM.
Cont..

 2. Interconnection Networks for Parallel Computers

 Classification of interconnection networks:
 (a) a static network; and (b) a dynamic network.
 3. Network Topologies: Bus-Based Networks, Crossbar Networks, Multistage Networks,
Star-Connected Network, Linear Arrays, Meshes, and k-d Meshes etc
 4. Evaluating Static Interconnection Networks : Diameter, connection and Bisection
Width
 5. Evaluating Dynamic Interconnection Networks
 6. Cache Coherence in Multiprocessor Systems: Snoopy cache based and Directory
based
Interconnection Networks for Parallel
Computers

 Interconnection networks carry data between processors and to

memory.
 •Interconnects are made of switches and links (wires, fiber).
 •Interconnects are classified as static or dynamic.
 •Static networks consist of point-to-point communication links among
processing nodes and are also referred to as direct networks.
 •Dynamic networks are built using switches and communication links.
Dynamic networks are also referred to as indirect networks.
Evaluating Static Interconnection
Networks
Evaluating Dynamic Interconnection Networks
Cache Coherence in Multiprocessor
Systems

 Interconnects provide basic mechanisms for data transfer.

 •In the case of shared address space machines, additional hardware is
required to coordinate access to data that might have multiple copies in
the network.
 •The underlying technique must provide some guarantees on the
semantics.
 •This guarantee is generally one of serializability, i.e., there exists
some serial order of instruction execution that corresponds to the
parallel schedule.
Cache Coherence in Multiprocessor
Systems

 When the value of a variable is changes, all its copies must either be invalidated
or updated.

Fig. - Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b) Update protocol for shared variables.
Communication Costs in Parallel
Machines

 Along with idling and contention, communication is a major overhead in

parallel programs.
 •The cost of communication is dependent on a variety of features
including the programming model semantics, the network topology,
data handling and routing, and associated software protocols.
Store-and-Forward Routing

 A message traversing multiple hops is completely received at an intermediate hop before

being forwarded to the next hop.
 •The total communication cost for a message of size m words to traverse l communication
links is

 In most platforms, th is small and the above expression can be approximated by

Routing Techniques

 Passing a message from node P0 to P3 (a) through a store-and-forward communication network; (b) and
(c) extending the concept to cut-through routing. The shaded regions represent the time that the message
is in transit. The startup time associated with this message transfer is assumed to be zero.
B]Communication Costs in Shared-
Address-Space Machines
Levels of parallelism

 Data Parallelism: Many problems in scientific computing involve processing of large quantities of data stored on a
computer. If this manipulation can be performed in parallel, i.e., by multiple processors working on different parts
of the data, we speak of data parallelism. As a matter of fact, this is the dominant parallelization concept in
scientific computing on MIMD-type computers. It also goes under the name of SPMD (Single Program Multiple
Data), as usually the same code is executed on all processors, with independent instruction pointers.
 Ex: Medium-grained loop parallelism, Coarse-grained parallelism by domain decomposition
 2. Functional Parallelism: Sometimes the solution of a “big” numerical problem can be split into more or less disparate subtasks,
which work together by data exchange and synchronization. In this case, the subtasks execute completely different code on
different data items, which is why functional parallelism is also called MPMD (Multiple Program Multiple Data).
 Ex: Master Worker Scheme, Functional decomposition
 Ref:http://prdrklaina.weebly.com/uploads/5/7/7/3/5773421/
introduction_to_high_performance_computing_for_scientists_and_engineers.pdf
Models : SIMD, MIMD, SIMT, SPMD
MIMD
SIMT

 Single instruction, multiple threads (SIMT) is an execution model used in parallel

computing where single instruction, multiple data (SIMD) is combined with
multithreading. It is different from SPMD in that all instructions in all "threads" are
executed in lock-step. The SIMT execution model has been implemented on
several GPUs and is relevant for general-purpose computing on graphics
processing units (GPGPU),
 e.g. some supercomputers combine CPUs with GPUs.
SPMD

 •In contrast to SIMD processors, MIMD processors can execute different

programs on different processors.
 •A variant of this, called single program multiple data streams (SPMD) executes
the same program on different processors.
 •It is easy to see that SPMD and MIMD are closely related in terms of
programming flexibility and underlying architectural support.
 •Examples of such platforms include current generation Sun Ultra Servers, SGI
Origin Servers, multiprocessor PCs, workstation clusters, and the IBM SP.
Data Flow

 •ln a daraflow computer, the execution of an instruction is driven by data availability instead of being
guided by a program counter, ln theory, any instruction should be ready for execution whenever
operands become available.
 •The instructions in a data-driven program are not ordered in any way. instead of being stored
separately in a main memory, data are directly held inside instructions.
 •This data-driven scheme requires no program counter, and no eonlrol sequencer. However, it requires
special mechanisms to detect data availability, to match data tokens with needy instructions, and to
enable the chain reaction of asynchronous instruction executions. No memory sharing between
instructions results in no side effects.
Cont..
Demand Driven
N-Wide Superscalar

 •Ideally: in an n-issue superscalar, n instructions are fetched, decoded, executed,

and committed per cycle
 In practice:
 – Data, control, and structural hazards spoil issue flow
 – Multi-cycle instructions spoil commit flow
 • Buffers at issue (issue queue) and commit (reorder buffer)

 decouple these stages from the rest of the pipeline and regularize somewhat
breaks in the flow
Cont..
Multicore Processor
Multithreaded Processor
Modern Processor and Architecture

 Book Title: Introduction to High Performance Computing for Scientist and

Engineers
 Authors: George and Wellien
 •Refernce:http://prdrklaina.weebly.com/uploads/5/7/7/3/5773421/
introduction_to_high_performance_computing_for_scientists_and_engineers.pdf
Demand Driven and Data flow

 •Book Name:
 •ADVANCED COMPUTER ARCHITECTURE
 Author name: Kai Hwang
 THANK YOU

Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
Parallelism in Computer Architecture
No ratings yet
Parallelism in Computer Architecture
27 pages
CSC580 Quick Notes Lect1and2
100% (1)
CSC580 Quick Notes Lect1and2
18 pages
PAG Unit1
No ratings yet
PAG Unit1
64 pages
L1.3a HPC Concepts
No ratings yet
L1.3a HPC Concepts
43 pages
Unit 1
No ratings yet
Unit 1
34 pages
Parallel Programming - Unit 1
No ratings yet
Parallel Programming - Unit 1
81 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
51 pages
Organization CH 2
No ratings yet
Organization CH 2
102 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
Unit 5
No ratings yet
Unit 5
44 pages
PDC Architectures
No ratings yet
PDC Architectures
24 pages
Lecture-2-06 01 2025
No ratings yet
Lecture-2-06 01 2025
21 pages
Unit 5
No ratings yet
Unit 5
66 pages
Coa Unit 04
No ratings yet
Coa Unit 04
85 pages
Lecture 2
No ratings yet
Lecture 2
32 pages
Chapter 02 - Asynchronous and Parallel Programming in
No ratings yet
Chapter 02 - Asynchronous and Parallel Programming in
55 pages
Unit Iv Parallelism
No ratings yet
Unit Iv Parallelism
80 pages
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
No ratings yet
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
32 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
PC 1
No ratings yet
PC 1
53 pages
Module - 4 - Parallel Processing
No ratings yet
Module - 4 - Parallel Processing
32 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
COA - Module-5
No ratings yet
COA - Module-5
35 pages
Unit 7 - Parallel Processing Paradigm
No ratings yet
Unit 7 - Parallel Processing Paradigm
26 pages
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
No ratings yet
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
17 pages
GPU Unit-1
No ratings yet
GPU Unit-1
10 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Parallel Computing Platforms-Dr Nausheen
No ratings yet
Parallel Computing Platforms-Dr Nausheen
47 pages
Flynns
No ratings yet
Flynns
41 pages
COA - Unit 4
No ratings yet
COA - Unit 4
84 pages
Introduction To Parallel and Distributed Programming
No ratings yet
Introduction To Parallel and Distributed Programming
6 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Multiprocessing Vs Multithreading 2
No ratings yet
Multiprocessing Vs Multithreading 2
16 pages
Cloud Computing - Lecture 3
No ratings yet
Cloud Computing - Lecture 3
22 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
12th Certificate
No ratings yet
12th Certificate
1 page
08 Parallel Algorithms Approches
No ratings yet
08 Parallel Algorithms Approches
12 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
No ratings yet
Module 1: PARALLEL AND DISTRIBUTED COMPUTING
65 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
19 pages
Module 2
No ratings yet
Module 2
127 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
Week1 - Parallel and Distributed Computing
100% (1)
Week1 - Parallel and Distributed Computing
46 pages
Evolution Computer1
No ratings yet
Evolution Computer1
17 pages
Parallelism
No ratings yet
Parallelism
22 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
47 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
Dsa Record Dsa Elab Answers
No ratings yet
Dsa Record Dsa Elab Answers
126 pages
Prof Ed 11 Activity Guide
No ratings yet
Prof Ed 11 Activity Guide
5 pages
Latex Template For Scientific Style Book
No ratings yet
Latex Template For Scientific Style Book
31 pages
Lg6 Lesson 4 Noah and The Ark
No ratings yet
Lg6 Lesson 4 Noah and The Ark
6 pages
The Praying Parent Challenge
No ratings yet
The Praying Parent Challenge
59 pages
The Great Convergence Information Technology and The New Globalization 1st Edition by Richard Baldwin 067466048X, Â 9780674660489 Download
100% (1)
The Great Convergence Information Technology and The New Globalization 1st Edition by Richard Baldwin 067466048X, Â 9780674660489 Download
41 pages
Expl NetFund CH 01 Intro - 56 Slides
No ratings yet
Expl NetFund CH 01 Intro - 56 Slides
68 pages
Graduation Script2023
No ratings yet
Graduation Script2023
4 pages
Short Parable Stories
No ratings yet
Short Parable Stories
14 pages
Week 4
No ratings yet
Week 4
25 pages
Practical On PHP
No ratings yet
Practical On PHP
7 pages
54 TH Nfa Brochure
No ratings yet
54 TH Nfa Brochure
200 pages
Chapter 1 Legally Hai
No ratings yet
Chapter 1 Legally Hai
5 pages
Class - 'UKG - A + B' Unit Test - II Syllabus
No ratings yet
Class - 'UKG - A + B' Unit Test - II Syllabus
1 page
Part-A Assignment No. 6
No ratings yet
Part-A Assignment No. 6
2 pages
Modern Land Law 10th Edition Dixon Martin 2024 Scribd Download
No ratings yet
Modern Land Law 10th Edition Dixon Martin 2024 Scribd Download
24 pages
(RB) COA - Notes UNIT-1
No ratings yet
(RB) COA - Notes UNIT-1
20 pages
A Review of Practices and Digital Technology Integration in Reading Instruction and Suggestions For The Philippines
No ratings yet
A Review of Practices and Digital Technology Integration in Reading Instruction and Suggestions For The Philippines
10 pages
The Early Trinity
No ratings yet
The Early Trinity
14 pages
Design Rationale
No ratings yet
Design Rationale
9 pages
Instructions For Course Assignments
No ratings yet
Instructions For Course Assignments
3 pages
PR ELO NelsonMandela Worksheet2
No ratings yet
PR ELO NelsonMandela Worksheet2
7 pages
Power Cloud For Technical Sales - Part 2 Private Cloud Quiz - Attempt Review
No ratings yet
Power Cloud For Technical Sales - Part 2 Private Cloud Quiz - Attempt Review
14 pages
Ethical World Conception
No ratings yet
Ethical World Conception
24 pages
Project Information 2024-25
No ratings yet
Project Information 2024-25
6 pages
BE Project Work Stage I Reviewer List 2024-25 Sem I
No ratings yet
BE Project Work Stage I Reviewer List 2024-25 Sem I
6 pages
Typology of The Adjective
No ratings yet
Typology of The Adjective
15 pages
Part-A Assignment No. 2
No ratings yet
Part-A Assignment No. 2
2 pages
Part-A Assignment No. 4
No ratings yet
Part-A Assignment No. 4
2 pages
Part-A Assignment No. 7
No ratings yet
Part-A Assignment No. 7
2 pages
Part-A Assignment No. 5
No ratings yet
Part-A Assignment No. 5
2 pages
Part-A Assignment No. 3
No ratings yet
Part-A Assignment No. 3
2 pages
Software Sit - Cobas - e - 411
No ratings yet
Software Sit - Cobas - e - 411
6 pages
Manchester Grammar School 2009 Part 1
No ratings yet
Manchester Grammar School 2009 Part 1
3 pages
Hymn Sheet - 6th Sept 2024
No ratings yet
Hymn Sheet - 6th Sept 2024
2 pages
Reading Jibanananda Dass Banalata Sen From A Surr PDF
No ratings yet
Reading Jibanananda Dass Banalata Sen From A Surr PDF
11 pages
Literary Devices English 4
No ratings yet
Literary Devices English 4
3 pages
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

HPC Unit 2

Uploaded by

HPC Unit 2

Uploaded by

Unit No.

 Course Objectives & Outcomes

 CO1: Understand various Parallel Paradigm

 Introduction to Parallel Computing: Motivating Parallelism

 Conventional architectures coarsely comprise of a processor, memory system, and the

 Pipelining overlaps various stages of instruction execution to achieve performance.

 Scheduling of instructions is determined by a number of factors:

 Not all functional units can be kept busy at all times.

 The hardware cost and complexity of the superscalar scheduler is a major

 Issue hardware is simpler.

 An explicitly parallel program must specify concurrency and interaction between

 Parallelism can be expressed at various levels of granularity - from instruction

 In contrast to SIMD processors, MIMD processors can execute different programs

 •Part (or all) of the memory is accessible to all processors.

 •Message passing requires little hardware support, other than a network.

 2. Interconnection Networks for Parallel Computers

 Interconnection networks carry data between processors and to

 Interconnects provide basic mechanisms for data transfer.

 Along with idling and contention, communication is a major overhead in

 A message traversing multiple hops is completely received at an intermediate hop before

 In most platforms, th is small and the above expression can be approximated by

 Single instruction, multiple threads (SIMT) is an execution model used in parallel

 •In contrast to SIMD processors, MIMD processors can execute different

 •Ideally: in an n-issue superscalar, n instructions are fetched, decoded, executed,

 Book Title: Introduction to High Performance Computing for Scientist and

You might also like