0% found this document useful (0 votes)
30 views107 pages

CS-3006 2 PDC Overview Compressed

The document provides an overview of Parallel and Distributed Computing, emphasizing the use of multiple processors to solve large problems more efficiently. It discusses the evolution from single-core to multi-core processors, the implications of Moore's Law, and the necessity for parallel programming strategies. Key concepts include task decomposition, parallel execution, and the importance of understanding communication and synchronization in parallel applications.

Uploaded by

lordshen2804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views107 pages

CS-3006 2 PDC Overview Compressed

The document provides an overview of Parallel and Distributed Computing, emphasizing the use of multiple processors to solve large problems more efficiently. It discusses the evolution from single-core to multi-core processors, the implications of Moore's Law, and the necessity for parallel programming strategies. Key concepts include task decomposition, parallel execution, and the importance of understanding communication and synchronization in parallel applications.

Uploaded by

lordshen2804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Parallel and Distributed Computing

Overview
(CS 3006)

Muhammad Aadil Ur Rehman

Department of Computer Science,


National University of Computer & Emerging Sciences,
Islamabad Campus
Credits: Dr. Muhammad Aleem
Parallel and Distributed Computing?
• What is Parallel and Distributed Computing?
– Use of multiple processors or machines working
together on a common task

– Each processor/machine works on its section


of the problem

– Processors may exchange information


Parallel and Distributed Computing?
Why Parallel and Distributed Computing?
• Goal: solve large problems more quickly

• Parallelism provides:
– Solving large problems in same time (or even less)
– Solve fixed size problems faster
Why do parallel computing?
Two main aspects:
1. Technology Push
2. Applications Pull

• Acknowledgements: Department of Computer Science Rice University


Why do parallel computing?

Technology Push
Single Processor Systems
– From mid-1980s until 2004 computers got
faster:
• More number of transistors were packed
every year.
• Size of transistors continued to get smaller
• Resulted in faster processors (frequency
scaling method more GHzs means more
performance) - as observed in Moore’s law
Moore’s Law
• An observation by Gordon Moore (co-founder of
Intel)
• Projection based on historical trend (processor
design) in 1970s
Number of transistors in a processor doubles
every year

• The real-data shows the correlation:


– Number of transistors doubled every 18 months
ACK: shigeru23 CC BY-SA 3.0, via Wikimedia Commons
Moore’s Law
• More transistors à more parallelism
opportunities
• Implicit Parallelism (hidden from programmer):
– Execution pipelines, multiple-functional units, etc

• Explicit Parallelism:
– VLIW, More execution Units

• Higher packing density means shorter electrical


paths, giving higher performance
Problems with single “Faster” CPU
– Power Consumption:
• With growing clock-frequencies, power
consumption increases exponentially

• Not possible forever: a higher frequency


processor could have require a whole
nuclear reactor to fulfill the energy needs
https://www.nytimes.com/2004/05/17/business/technology-intel-s-big-
shift-after-hitting-technical-wall.html

Acknowledgements: Department of Computer Science Rice University


Problems with single “Faster” CPU
– Heat Dissipation:
• With growing clock frequency, more power
is consumed àMore heat is dissipated..

• Keeping processors cool become a problem


(with increasing clock frequencies)
Problems with single “Faster” CPU
Stanford CS149, Fall 2024
Typical Single-core Processor

Single Execution Unit


Stanford CS149, Fall 2024
Stanford CS149, Fall 2024
Stanford CS149, Fall 2024
Stanford CS149, Fall 2024
Stanford CS149, Fall 2024
Stanford CS149, Fall 2024
Stanford CS149, Fall 2024
Stanford CS149, Fall 2024
Processing Unit 1 Processing Unit 2

Stanford CS149, Fall 2024


Processing Unit 1 Processing Unit 2

Stanford CS149, Fall 2024


Stanford CS149, Fall 2024
Old Intel Pentium 4 CPU

Stanford CS149, Fall 2024


Superscalar processor execution

Stanford CS149, Fall 2024


Program Order

Stanford CS149, Fall 2024


Program Order

Stanford CS149, Fall 2024


Program Order

Stanford CS149, Fall 2024


Program Order

Stanford CS149, Fall 2024


Instruction level parallelism (ILP)

Instruction Dependency Graph

Stanford CS149, Fall 2024


A more complex example

Instruction Dependency Graph

Stanford CS149, Fall 2024


Pre multi-core era processor

Stanford CS149, Fall 2024


Another Example
Diminishing returns of superscalar execution

Instruction Dependency Graph

Stanford CS149, Fall 2024


ILP Tapped out +
End of Frequency Scaling
Other Issues…
– Smaller size of transistors:
• Fabrication of processors become more
difficult.

– Limited memory size:


• A single processor has limited internal
memory i.e., cache memory L1, L2, registers,
etc.
• A single processor based system has limited
overall system memory or main memory.
Multi-core Era

[Picture credit: https://i.ibb.co/4ftNR3z/mlcore-copy.jpg


Multi-core Processors
• Single-core processors reached their physical
limits:
– Overheating issues
– Huge power consumption requirements
– Difficult to design

[Picture credits: Intel Corporation, ]


Serial Performance Scaling is Over
• Cannot continue to scale processor frequencies
– no 20 GHz chips

• Cannot continue to increase power


consumption
– can’t melt chip

• Can continue to increase transistor density


– as per Moore’s Law
Idea of Multicore Era Processor

Stanford CS149, Fall 2024


Multi-core architectures
• Replicated multiple processor on a single die/chip

Core 1 Core 2 Core 3 Core 4

Multi-core CPU chip


The “New” Moore’s Law
• Computers no longer get faster, just wider

• You must re-think your algorithms to be


parallel !

• Data-parallel computing is most scalable


solution:

2 cores 4 cores 8 cores 16 cores…


Idea of Multicore Era Processor

Stanford CS149, Fall 2024


Technology Trends and Multicores
Example Parallel Machines
Examples of parallel machines/hardware:
– A Chip Multi-Processor (CMP) contains multiple
cores on a single chip

– A shared memory multi-processor by connecting


multiple processors to a single memory system

–A distributed computer contains multiple machines


combined together with a network such as Cluster,
Grid, Cloud, etc.
Multi-core CPU chip
• The cores fit on a single processor socket
• Also called Chip Multi-Processor (CMP)

c c c c
o o o o
r r r r
e e e e

1 2 3 4
Multi-core CPU chip

thread 1 thread 2 thread 3 thread 4

c c c c
o o o o
r r r r
e e e e

1 2 3 4
Threads may be time-sliced (like a uniprocessor)

several several several several


threads threads threads threads

c c c c
o o o o
r r r r
e e e e

1 2 3 4

52
Other Option: Simultaneous Multithreading

several several several several


threads threads threads threads

c c c c
o o o o
r r r r
e e e e

1 2 3 4
Mobile parallel processing
Power constraints also heavily influence the design of mobile systems.

Stanford CS149, Fall 2024


8-Core Core-i7 Processor (Extreme Edition)

http://simplecore.intel.com/newsroom/wp-content/uploads/sites/11/HSW-E-Die-Mapping-Hi-Res.jpg
Mobile Parallel Processing
Mobile Parallel Processing
Why do parallel computing?

Application Pull
Why do parallel computing?

Plamen Krastev@San Diego State University


Why do parallel computing?

Plamen Krastev@San Diego State University


Parallelizing Applications
Traditional vs. Parallel Applications
• Traditionally, software has been written for serial
computation:
– To be run on a single computer having a single Central
Processing Unit (CPU);
– A problem is broken into a discrete series of
instructions.
– Instructions are executed one after another.
Traditional vs. Parallel Applications
• In the simplest sense, parallel execution is the simultaneous
use of multiple compute resources to solve a computational
problem
– To be run using multiple CPUs (on same or different machines)
– A problem is broken into discrete parts that can be solved
concurrently
– Each part is further broken down to a series of instructions
Example: Sin(x) Taylor Series
Execution
Execute program
My very simple processor: executes one instruction per clock
x[i]

Fetch/
Decode
ld r0, addr[r1]
mul r1, r0, r0
Execution Unit mul r1, r1, r0
(ALU) ...
...
...
Execution ...

Context ...
...
st addr[r2], r0

y[i]

Stanford CS149, Fall 2


Two cores: compute two elements in parallel
Two cores: compute two elements in parallel

x[i] x[j]
Fetch/ Fetch/
Decode Decode

ld r0, addr[r1] ld r0, addr[r1]


mul
mul
r1, r0, r0
r1, r1, r0
Exec Exec mul
mul
r1, r0, r0
r1, r1, r0
... (ALU) (ALU) ...
... ...
... ...
... ...
...
... Execution Execution ...
...
st addr[r2], r0
Context Context st addr[r2], r0

x[i] x[j]

Simpler cores: each core may be slower at running a single instruction


stream than our original “fancy” core (e.g., 25% slower)

But there are now two cores: 2 × 0.75 = 1.5 (potential for speedup!)
S
Expressing parallelism using C++ threads
Some Terminologies
• Task
– A logically discrete section of computational work.
– A task is typically a program or program-like set of
instructions that is executed by a processor.

• Parallel Task
– A task that can be executed by multiple processors
safely (producing correct results)

• Serial Execution
– Execution of a program sequentially, one statement at
a time using one processor.
4 Steps in Creating Parallel Programs
Parallelization Strategy
1. Problem Understanding
2. Partitioning/Decomposition
3. Assignment
4. Orchestration
5. Mapping
Parallelization Strategy
1. Problem Understanding
2. Partitioning/Decomposition
3. Assignment
4. Orchestration
5. Mapping
1. Problem Understanding
• The first step in developing parallel application is
to Understand the Problem (that you wish to
solve)

• Starting with a serial program Understanding


the existing code

• The first decision Whether or not the problem


is suitable for parallelization?
– Code Dependencies
– Communication
– Synchronization
1. (a) Identify the Programs hot-spots
• Know where most of the real work is being done.
(The majority of scientific and technical programs
usually accomplish most of their work in a few
places.)

• Profilers and performance analysis tools can help


here

• Focus on parallelizing the hotspots and ignore


those sections of the program that account for
little CPU usage.
1. (a) Identify the Programs hot-spots
• Identify areas that are unjustly slow?
– Example: I/O is usually something that slows a
program

• Potential Solutions:
– Restructure the program
– Use a different algorithm
– Overlap the communication with computation
Parallelization Strategy
1. Problem Understanding
2. Partitioning/Decomposition
3. Assignment
4. Orchestration
5. Mapping
Amdhal’s Law
• Speedup
wall-clock time of serial execution
----------------------------------------------
wall-clock time of parallel execution
2. Partitioning/Decomposition
• Divide work into discrete "chunks"
• Broken chunks can be assigned to specific tasks
• Tasks could be executed concurrently
• Decomposition: (1) Domain decomposition
(2) Functional decomposition
• Decomposition Granularity:
– Fine-grain: large number of small tasks
– Coarse-grain: small number of large tasks
Fine-grained tasks

Coarse-grained
tasks
Granularity

• Grain of parallelism: How big are the units?


– Bits, Instructions, Blocks, Loop iterations,
Procedures, …
2. (a) Domain Decomposition
• The data associated with a problem is
decomposed

• Each parallel task then works on a portion of the


data
2. (a) Domain Decomposition
1D Data Decomposition:
• Block: same sized large blocks (typically ==
number of parallel tasks) - Pros & Cons?

• Cyclic: small-sized many blocks (typically >>


number of parallel tasks) - Pros & Cons?
2. (a) Domain Decomposition
2D Data Decomposition:
• Block (row*,column*): same sized large blocks
(row or column) (typically == number of parallel
tasks) - Pros & Cons?

• Block (row,coulmn): small sized sub-blocks


(typically == number of parallel tasks) - Pros &
Cons?
2. (a) Domain Decomposition
2D Data Decomposition:
• Cyclic (row*,column*): small-sized many blocks
(row or column) (typically >> number of parallel
tasks) - Pros & Cons?
• Cyclic (row,coulmn): small-sized many sub-blocks
(typically >> number of parallel tasks) –
Pros & Cons?
2. (b) Functional Decomposition
• Focus is on the computation that is to be
performed

• The problem is decomposed according to the


work (instructions) that must be done

• Each sub-task performs a portion of the overall


work
2. (b) Functional Decomposition
• Functional decomposition useful to the problems
that can be split into different tasks
Examples:
– Ecosystem Modeling
– Signal Processing
– Climate Modeling
Parallelization Strategy
1. Problem Understanding
2. Partitioning/Decomposition
3. Assignment
4. Orchestration
5. Mapping
3. Assignment
• Composing “Fine-grained” parallel computations into
processes or coarse-grained tasks

• Considerations while composing tasks: Load-balance,


Uniform communication, ease of Synchronization
Parallelization Strategy
1. Problem Understanding
2. Partitioning/Decomposition
3. Assignment
4. Orchestration
5. Mapping
4-5. Orchestration & Mapping
• Following aspects should be considered before actual
mapping (Scheduling):

a) Inter-task communication
b) Synchronization among tasks
c) Data locality aspect
d) Other system related considerations (NUMA,
etc.).
Communication
Who Needs Communications?
• No Communication Required:
– Some problems can be decomposed and executed
in parallel with virtually no need for tasks to share
data:
• Imagine an image processing operation where
every pixel in a black and white image needs to
have its color reversed

– These types of problems are often called


Embarrassingly Parallel because they are so
straight-forward.
• Very little or No inter-task communication is required
Who Needs Communications?
• Communications Required
– Most parallel applications are not quite so simple,
– They do require tasks to share data with each
other
• For example: a 3-D heat diffusion problem requires
a task to know the temperatures calculated by the
tasks that have neighbouring data

• Changes to neighbouring data has a direct effect


on that task's data
Communication Cost
– Inter-task communication virtually always implies
overhead
– CPU time and resources are used to package and
transmit data
– Communications mostly require some
synchronization between tasks Results in waiting
time
– Communication can saturate the available network
bandwidth, further reducing performance
Latency & Bandwidth
• Latency is the time it takes to send a message from
point A to point B

• Bandwidth is the amount of data that can be


communicated per unit of time

• Sending many small messages can cause a higher


overall latency

• Efficient approach is to package small messages into


a larger message
– To maximize bandwidth usage & Reduce Overhead
Visibility of communications
• In Distributed memory model, communications
are explicit and under the control of a programmer

• In Shared memory mode communications are


implicit (transparently to the programmer)
Communication Patterns
A. Synchronous communication: require some type of
"handshaking" between tasks
– Blocking communications (since other task must wait
until the communications have completed)

B. Asynchronous communications: allow tasks to


transfer data independently from one another
– non-blocking communications
– Interleaving computation with communication is the
single greatest benefit
Scope of Communication
• Knowing which tasks must communicate with each
other is critical during the design stage of a parallel
code
1. Point-to-point - involves two tasks with one task
acting as the sender/producer of data, and the other
acting as the receiver/consumer.

1. Collective - involves data sharing between more than


two tasks, which are often specified as being
members in a common group, or collective.
Collective Communications
Synchronization
Synchronization
• Synchronization: mechanism for coordination and
sequencing of parallel tasks

• Significant factor in program’s performance

• Often requires "serialization" of program


segments
Synchronization Types
1. Barrier
– Usually all tasks are involved

– Each task performs its work until it reaches the


barrier. It then stops, or "blocks”

– When the last task reaches the barrier, all tasks


resume
Synchronization Types
2. Lock / semaphore
– Can involve any number of tasks

– Typically used to serialize (protect) access to global


data/code. Only one task at a time may use (own) the
lock / semaphore / flag.

– The first task to acquire the lock "sets" it

– Other tasks can attempt to acquire the lock but must


wait until the task that owns the lock releases it

– Can be blocking or non-blocking


Synchronization Types
3. Synchronous communication Operations
– Involves only those tasks executing a
communication operation

– Before initiating the communication,


acknowledgement must be received (OK to send)
Any Questions?

You might also like