0% found this document useful (0 votes)
112 views59 pages

Module 1 - New

This document provides an overview of parallel computing fundamentals including: - Motivations for parallel computing due to limitations in hardware design - Key concepts such as parallelism, parallel computers, multi-core processors, and shared vs distributed memory - An overview of parallel computing including Flynn's taxonomy, parallel programming challenges, and communication models for parallel platforms like shared-address space and message passing.

Uploaded by

Bantu Aadhf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views59 pages

Module 1 - New

This document provides an overview of parallel computing fundamentals including: - Motivations for parallel computing due to limitations in hardware design - Key concepts such as parallelism, parallel computers, multi-core processors, and shared vs distributed memory - An overview of parallel computing including Flynn's taxonomy, parallel programming challenges, and communication models for parallel platforms like shared-address space and message passing.

Uploaded by

Bantu Aadhf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

CSE4001

Parallel and Distributed


Computing
Parallelism Fundamentals
Motivation – Key Concepts and Challenges –

Overview of Parallel computing – Flynn’s

Taxonomy – Multi-Core Processors – Shared

vs Distributed memory.

CSE4001 Parallel and Distributed Computing


Motivation
Motivations
 If one is to view this in the context of rapidly improving
uniprocessor speeds, one is tempted to question the need for
parallel computing.
 There are some unmistakable trends in hardware design, which
indicate that uniprocessor (or implicitly parallel) architectures may
not be able to sustain the rate of realizable performance
increments in the future.
 This is the result of a number of fundamental physical and
computational limitations.
 The emergence of standardized parallel programming
environments, libraries, and hardware have significantly reduced
time to (parallel) solution.
CSE4001 Parallel and Distributed Computing
Key concepts and
Challenges
Key Concepts

 Parallel computing -use of parallel computers to reduce


the time needed to solve a single computational problem.
 Parallel computer -a computer containing more than
one processor. Parallel computers can be categorized as
either multi computers or centralized multiprocessors.
 Multicomputer - a computer that contains two or more
computers connected by an interconnection network.

CSE4001 Parallel and Distributed Computing


Key Concepts
 Centralized multiprocessor, also known as a
symmetrical multiprocessor (SMP), is a computer in which
the processors share access to a single, global memory.
 Multi-core processor is a particular type of
multiprocessor in which the individual processors (called
"cores") are in a single integrated circuit.
 Parallel programming is programming in a language
that allows one to explicitly indicate how the different
parts of the computation can be executed concurrently by
different processors

CSE4001 Parallel and Distributed Computing


Challenges

It is not easy to develop an efficient parallel program


• Some Challenges:
– Parallel Programming
– Complex Algorithms
– Memory issues
– Communication Costs
– Load Balancing

CSE4001 Parallel and Distributed Computing


Parallel Computing -
Overview
Parallel computing- Scope for Parallelism
 Conventional architectures – consists of a processor,
memory system, and the data path.
 Each of these components present significant performance
bottlenecks.
 Parallelism addresses each of these components in
significant ways.
 Different applications utilize different aspects of
parallelism - e.g., data itensive applications utilize high
aggregate throughput, server applications utilize high
aggregate network bandwidth, and scientific applications
typically utilize high processing and memory system
performance.
 It is important to understand each of these performance
bottlenecks

CSE4001 Parallel and Distributed Computing


Architectural Demands and Trends

 Microprocessor clock speeds have posted impressive gains


over the past two decades (two to three orders of
magnitude).
 Higher levels of device integration have made available a
large number of transistors.
 The question of how best to utilize these resources is
an important one.
 Current processors use these resources in multiple
functional units and execute multiple instructions in the
same cycle.
 The precise manner in which these instructions are
selected and executed provides impressive diversity in
architectures

CSE4001 Parallel and Distributed Computing


Pipelining and Superscalar Execution

 Pipelining overlaps various stages of instruction execution

to achieve performance.

 At a high level of abstraction, an instruction can be

executed while the next one is being decoded and the next

one is being fetched.

CSE4001 Parallel and Distributed Computing


Pipelining and Superscalar Execution
Pipelining- limitations.
 The speed of a pipeline is eventually limited by the slowest
stage.
 For this reason, conventional processors rely on very deep
pipelines (20 stage pipelines in state-of-the-art Pentium
processors).
 However, in typical program traces, every 5-6th instruction is
a conditional jump! This requires very accurate branch
prediction.
 The penalty of a mis prediction grows with the depth of the
pipeline, since a larger number of instructions will have to be
flushed.

CSE4001 Parallel and Distributed Computing


Superscalar Execution
Scheduling of instructions is determined by a number of
factors:
◦ True Data Dependency: The result of one operation is
an input to the next.
◦ Resource Dependency: Two operations require the
same resource.
◦ Branch Dependency: Scheduling instructions across
conditional branch statements cannot be done
deterministically a-priori.
◦ The scheduler, a piece of hardware looks at a large
number of instructions in an instruction queue and
selects appropriate number of instructions to execute
concurrently based on these factors.
◦ The complexity of this hardware is an important
constraint on superscalar processors.

CSE4001 Parallel and Distributed Computing


Goals of Parallelism
 Increases the computational speed.
 Increases throughput, i.e. amount of processing that
can be accomplished during a given interval of time.
 Improves the performance of the computer for a given
clock speed.
 Two or more ALUs in CPU can work concurrently to
increase throughput.
 The system may have two or more processors
operating concurrently.

CSE4001 Parallel and Distributed Computing


Communication Model of Parallel
Platforms
 There are two primary forms of data exchange between

parallel tasks - accessing a shared data space and

exchanging messages.

 Platforms that provide a shared data space are called

shared-address-space machines or multiprocessors.

 Platforms that support messaging are also called message

passing platforms or multicomputers.

CSE4001 Parallel and Distributed Computing


Shared-Address-Space Platforms
 Part (or all) of the memory is accessible to all processors.

 Processors interact by modifying data objects stored in this

shared-address-space.

 If the time taken by a processor to access any memory

word in the system global or local is identical, the platform

is classified as a Uniform Memory Access (UMA), else, a

Non-uniform Memory Access (NUMA) machine.

CSE4001 Parallel and Distributed Computing


NUMA and UMA Shared-Address-Space
Platforms

(a) Uniform-memory access shared-address-space computer; (b)


Uniform-memory-access shared-address-space computer with
caches and memories; (c) Non-uniform-memory-access shared-
address-spaceCSE4001 Parallel and Distributed Computing
computer with local memory only.
NUMA and UMA Shared-Address-Space
Platforms
 NUMA and UMA platforms differs in the algorithm design
 NUMA machines require locality from underlying algorithms
for performance.
 Programming these platforms is easier since reads and writes
are implicitly visible to other processors.
 However, read-write data to shared data must be coordinated
 Caches in such machines require coordinated access to
multiple copies.
 This leads to the cache coherence problem.
 A weaker model of these machines provides an address map,
but not coordinated access.
 These are non cache coherent shared address space
machines.

CSE4001 Parallel and Distributed Computing


Shared-Address-Space
vs.
Shared Memory Machine
 Shared Address Space - programming abstraction

 Shared Memory Machine -physical machine attribute.

 It is possible to provide a shared address space using a

physically distributed memory.

CSE4001 Parallel and Distributed Computing


Message-Passing Platforms
 These platforms comprise of a set of processors and their
own (exclusive) memory.
 Instances of such a view come naturally from clustered
workstations and non-shared-address-space multi-
computers.
 These platforms are programmed using (variants of) send
and receive primitives.
 Libraries such as MPI and PVM provide such primitives.

CSE4001 Parallel and Distributed Computing


Message Passing
vs.
Shared Address Space Platforms
 Message passing requires little hardware support, other
than a network.
 Shared address space platforms can easily emulate
message passing.
 The reverse is more difficult to do (in an efficient manner).

CSE4001 Parallel and Distributed Computing


Coordination
Parallel computing is the simultaneous execution of
the same task (split up and specially adapted) on multiple
processors in order to obtain results faster.
The idea is based on the fact that the process of solving
a problem usually can be divided into smaller tasks, which
may be carried out simultaneously with some coordination.

CSE4001 Parallel and Distributed Computing


Coordination
A coordination model -mechanism used to coordinate
the execution of distinct processes within a parallel program.
The term coordination to refer to the three fundamental
issues in concurrency:
 communication,
 synchronization, and
 process management.
Coordination is explicit and occurs at distinct points
within a program.

CSE4001 Parallel and Distributed Computing


Coordination
Coordination models are usually combined with well-
known sequential languages, so they have been very
successful at attracting programmers.
Programming environments that use coordination
models usually fall into one of two classes:
1.Communication-based coordination models
2.Shared-memory coordination models

CSE4001 Parallel and Distributed Computing


Coordination
Communication-based coordination models
Communication between concurrent tasks takes place
through the exchange of messages or some other discrete
communication event.
The semantics for the communication usually provides
synchronization as well communication
Shared-memory coordination models

•Coordination is provided by the semantics of operations


on shared data structures.

CSE4001 Parallel and Distributed Computing


Amdhal’s Law
Amdahl's Law

 Validity of the single processor approach to


achieving large scale computing capabilities

CSE4001 Parallel and Distributed Computing


Amdahl's Law - History
 Gene Amdahl, chief architect of IBM's first
mainframe series and founder of Amdahl
Corporation and other companies found that
there were some fairly stringent restrictions on
how much of a speedup one could get for a given
parallelized task.
 These observations were wrapped up in Amdahl's
Law

CSE4001 Parallel and Distributed Computing


Amdahl's Law - Introduction
If F is the fraction of a calculation that is
sequential, and (1-F) is the fraction that can be
parallelized, then the maximum speed-up that can
be achieved by using P processors is
1/(F+(1-F)/P).

CSE4001 Parallel and Distributed Computing


Amdahl's Law - Essence

 The point that Amdahl was trying to make was that using
lots of parallel processors was not a viable way of
achieving the sort of speed-ups that people were looking
for. i.e. it was essentially an argument in support of
investing effort in making single processor systems run
faster.

CSE4001 Parallel and Distributed Computing


Generalizations

 The performance of any system is constrained by the


speed or capacity of the slowest point.
 The impact of an effort to improve the performance of a
program is primarily constrained by the amount of time
that the program spends in parts of the program not
targeted by the effort

CSE4001 Parallel and Distributed Computing


Maximum Theoretical Speed-up

 Amdahl's Law is a statement of the maximum theoretical


speed-up you can ever hope to achieve.
 The actual speed-ups are always less than the speed-up
predicted by Amdahl's Law

CSE4001 Parallel and Distributed Computing


Actual speed ups are always less - Why?

 Distributing work to the parallel processors and collecting


the results back together is extra work required in the
parallel version which isn't required in the serial version
 straggler problem :when the program is executing in the
parallel parts of the code, it is unlikely that all of the
processors will be computing all of the time as some of
them will likely run out of work to do before others are
finished their part of the parallel work.

CSE4001 Parallel and Distributed Computing


Examples
if 90% of a calculation can be parallelized (i.e. 10% is
sequential) then the maximum speed-up which can be
achieved on 5 processors is 1/(0.1+(1-0.1)/5) or roughly 3.6
(i.e. the program can theoretically run 3.6 times faster on five
processors than on one)
If 90% of a calculation can be parallelized then the maximum
speed-up on 10 processors is 1/(0.1+(1-0.1)/10) or 5.3 (i.e.
investing twice as much hardware speeds the calculation up by
about 50%).
If 90% of a calculation can be parallelized then the maximum
speed-up on 20 processors is 1/(0.1+(1-0.1)/20) or 6.9 (i.e.
doubling the hardware again speeds up the calculation by only
30%).
CSE4001 Parallel and Distributed Computing
Flynn’s Taxonomy
Flynn’s Taxonomy
 The best known classification scheme for parallel
computers.
 Depends on parallelism they exhibit with
◦ Instruction streams
◦ Data streams
 A sequence of instructions (the instruction stream)
manipulates a sequence of operands (the data stream)
 The instruction stream (I) and the data stream (D) can be
either single (S) or multiple (M)
 Four combinations: SISD, SIMD, MISD, MIMD

CSE4001 Parallel and Distributed Computing


Flynn’s Taxonomy
 SISD
◦ Single Instruction Stream, Single Data Stream
◦ Most important member is a sequential computer
◦ Some argue other models included as well.
 SIMD
◦ Single Instruction Stream, Multiple Data Streams
◦ One of the two most important in Flynn’s Taxonomy
 MISD
◦ Multiple Instruction Streams, Single Data Stream
◦ Relatively unused terminology. Some argue that this
includes pipeline computing.
 MIMD
◦ Multiple Instructions, Multiple Data Streams
◦ An important classification in Flynn’s Taxonomy

CSE4001 Parallel and Distributed Computing


Flynn’s Taxonomy

CSE4001 Parallel and Distributed Computing


Flynn’s Taxonomy
Single-instruction, single-data (SISD) systems
 An SISD computing system is a uniprocessor machine which
is capable of executing a single instruction, operating on a
single data stream.
 Machine instructions are processed in a sequential manner
and computers adopting this model are popularly called
sequential computers.
 Most conventional computers have SISD architecture.
 All the instructions and data to be processed have to be
stored in primary memory.

CSE4001 Parallel and Distributed Computing


Flynn’s Taxonomy
Single-instruction, single-data (SISD) systems

CSE4001 Parallel and Distributed Computing


Flynn’s Taxonomy
Single-instruction, multiple-data (SIMD) systems
• It is a multiprocessor machine capable of executing the
same instruction on all the CPUs but operating on different
data streams.
• Machines based on an SIMD model are well suited to
scientific computing since they involve lots of vector and
matrix operations.
• So that the information can be passed to all the processing
elements (PEs) organized data elements of vectors can be
divided into multiple sets(N-sets for N PE systems) and
each PE can process one data set.
CSE4001 Parallel and Distributed Computing
Flynn’s Taxonomy
Single-instruction, multiple-data (SIMD) systems

CSE4001 Parallel and Distributed Computing


Flynn’s Taxonomy
Multiple-instruction, single-data (MISD) systems
 An MISD computing system is a multiprocessor machine
capable of executing different instructions on different PEs
but all of them operating on the same dataset .
 Example Z = sin(x)+cos(x)+tan(x)
The system performs different operations on the same data
set.
 Machines built using the MISD model are not useful in most
of the application, a few machines are built, but none of
them are available commercially.

CSE4001 Parallel and Distributed Computing


Flynn’s Taxonomy
Multiple-instruction, single-data (MISD) systems

CSE4001 Parallel and Distributed Computing


Flynn’s Taxonomy
Multiple-instruction, multiple-data (MIMD) systems
• It is a multiprocessor machine which is capable of
executing multiple instructions on multiple data sets.
• Each PE in the MIMD model has separate instruction and
data streams; therefore machines built using this model
are capable to any kind of application.
• Unlike SIMD and MISD machines, PEs in MIMD machines
work asynchronously.
• MIMD machines are broadly categorized into shared-
memory MIMD and distributed-memory MIMD based
on the way PEs are coupled to the main memory.

CSE4001 Parallel and Distributed Computing


Flynn’s Taxonomy
Multiple-instruction, multiple-data (MIMD) systems

CSE4001 Parallel and Distributed Computing


Multicore Processors
Multi-core Processors
 Motivation for Multi-Core
◦ Exploits improved feature-size and density
◦ Increases functional units per chip (spatial efficiency)
◦ Limits energy consumption per operation
◦ Constrains growth in processor complexity
 Challenges resulting from multi-core
◦ Relies on effective exploitation of multiple-thread
parallelism
◦ Aggravates memory wall
◦ Pins become strangle point
◦ Requires mechanisms for efficient inter-processor
coordination

CSE4001 Parallel and Distributed Computing


Multi-core Processors
 Challenges resulting from multi-core
◦ Relies on effective exploitation of multiple-thread parallelism
 Need for parallel computing model and parallel
programming model
◦ Aggravates memory wall
 Memory bandwidth
◦ Way to get data out of memory banks
◦ Way to get data into multi-core processor array
 Memory latency
 Fragments L3 cache
◦ Pins become strangle point
 Rate of pin growth projected to slow and flatten
 Rate of bandwidth per pin (pair) projected to grow slowly
◦ Requires mechanisms for efficient inter-processor
coordination
 Synchronization
 Mutual exclusion
 Context switching
CSE4001 Parallel and Distributed Computing
Heterogeneous Multicore Architecture
 Combines different types of processors
◦ Each optimized for a different operational modality
 Performance > nX better than other n processor types
◦ Synthesis favors superior performance
 For complex computation exhibiting distinct modalities
 Conventional co-processors
◦ Graphical processing units (GPU)
◦ Network controllers (NIC)
◦ Efforts underway to apply existing special purpose
components to general applications
 Purpose-designed accelerators
◦ Integrated to significantly speedup some critical aspect
of one or more important classes of computation
◦ IBM Cell architecture
◦ ClearSpeed SIMD attached array processor

CSE4001 Parallel and Distributed Computing


Moore’s Law
Moore's Law
describes a long-
term trend in the
history of computing
hardware, in which
the number of
transistors that can
be placed
inexpensively on an
integrated circuit
has doubled
approximately every
two years.

CSE4001 Parallel and Distributed Computing


Shared vs Distributed
memory
Shared vs Distributed memory.
According to memory communication model
– Shared address or shared memory
▪ Processes in different processors can use the same virtual
address space
▪ Any processor can directly access memory in another
processor node
▪ Communication is done through shared memory variables
▪ Explicit synchronization with locks and critical sections
▪ Arguably easier to program??

CSE4001 Parallel and Distributed Computing


Shared vs Distributed memory.
 Shared memory enables all processors to access all
memory as global address space.
 Multiple processors can operate independently but share
the same memory resources.
 Changes in a memory location effected by one processor
are visible to all other processors.

CSE4001 Parallel and Distributed Computing


Shared vs Distributed memory.
Distributed address or message passing
▪ Processes in different processors use different virtual
address spaces
▪ Each processor can only directly access memory in its own
node
▪ Communication is done through explicit messages
▪ Synchronization is implicit in the messages
▪ Arguably harder to program??
▪ Some standard message passing libraries (e.g., MPI)

CSE4001 Parallel and Distributed Computing


Shared vs Distributed memory.
• Distributed memory systems require a communication
network to connect inter-processor memory.
• Processors have their own local memory.
• Memory addresses in one processor do not map to another
processor, so there is no concept of global address space
across all processors.
• Processors operates independently
• The programmer, usually, explicitly define how and when
data is communicated between two processors.
• Synchronization between tasks is likewise the
programmer's responsibility.
• The network structure (model) used for data transfer
varies widely, though it can be as simple as Ethernet.

CSE4001 Parallel and Distributed Computing


Distributed memory.
Advantages:
◦ Memory is scalable with number of processors.
◦ Each processor can rapidly access its own memory
without interference and without the overhead incurred
with trying to maintain cache coherency.
◦ Cost effectiveness: can use commodity, off-the-shelf
processors and networking.
Disadvantages:
◦ The programmer is responsible for many of the details
associated with data communication between
processors.
◦ It may be difficult to map existing data structures, based
on global memory, to this memory organization.
◦ Non-uniform memory access (NUMA) times

CSE4001 Parallel and Distributed Computing


Shared vs Distributed memory.

CSE4001 Parallel and Distributed Computing

You might also like