0% found this document useful (0 votes)

286 views55 pages

Module 2 - Parallel Computing

1) Explicit parallelism is motivated by the growing performance gap between processors and memory, and the distributed nature of computational problems. 2) Early computer architectures like the Von Neumann model were serial, while later models introduced parallelism through concepts like Flynn's taxonomy of SISD, SIMD, MISD, and MIMD architectures. 3) Parallel platforms can be classified based on their physical hardware organization or logical programming view, with control structures and communication models defining the latter.

Uploaded by

muwaheedmustapha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

286 views55 pages

Module 2 - Parallel Computing

Uploaded by

muwaheedmustapha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

ELE5211: Advanced Topics in CE

Module Two
(Parallel Computing – Explicit Parallelism)

Tutor: Hassan A. Bashir 1

Intro: Motivating Parallelism
1- Why explicit parallelism?

- The growing gap in speed and sustainable peak

performance of current processors,
- Impact of memory system performance, and
- Distributed nature of many computational
problems

present overreaching motivations for parallelism.

2
Architectural Concept
(Serial to Parallel)
The Von Neumann Architecture (1940s)
Comprised of four main components:
Memory
Control Unit
Arithmetic Logic
Unit
Input/Output

Read/write, random access memory is used to store both program

instructions and data
Program instructions are coded data which tell the computer to do
something 3
Architectural Concept
(Serial to Parallel)
Flynn's Classical Taxonomy (1960s)
Multi-processor computer architectures are classified along the
two independent dimensions of Instruction Stream and Data
Stream.

Each of these dimensions can have only one of two possible states:
4
Single or Multiple.
Single Instruction, Single Data (SISD)
This is a serial (non-parallel) computer with
Single Instruction: Only one instruction stream executed by the CPU
during any one clock cycle
Single Data: Only one data stream is utilized during any one clock cycle
Deterministic execution

Examples: single processor/core PCs, older generation mainframes, etc.

5
Single Instruction, Multiple Data (SIMD)
This is a type of parallel computer with

Single Instruction: All processing units execute the same instruction at

any given clock cycle
Multiple Data: Each processing unit can operate on a different data
element

- Suitable for problems with high degree of regularity, such as

graphics/image processing.
- Synchronous (lockstep) and deterministic execution

Two varieties: Processor Arrays and Vector Pipelines

Examples: Most modern computers, particularly those with graphics
processor units (GPUs) employ instructions and execution units. 6
Single Instruction, Multiple Data (SIMD)
Single Instruction: All processors execute the same instruction at a cycle
Multiple Data: Each processing unit can operate on different data

7
Multiple Instruction, Single Data (MISD)
Multiple Instruction: Each processing unit operates on the data
independently via separate instruction streams.
Single Data: A single data stream is fed into multiple processors.
Examples: Few (if any) actual examples of this class of parallel
computer have ever existed.

8
Multiple Instruction, Multiple Data (MIMD)
Multiple Instruction: Every processor may be executing a different
instruction stream
Multiple Data: Every processor may be working on a different data
stream

Execution: Can be synchronous or asynchronous, deterministic or non-

deterministic

-Currently, the most common type of parallel computer - most modern

supercomputers fall into this category.

Examples: Most current supercomputers, networked parallel computer

clusters and "grids", multi-processor SMP computers, multi-core PCs.
9
Multiple Instruction, Multiple Data (MIMD)

HP/Compaq Alphaserver IBMPower5

10
Dichotomy in Parallel Platforms
1- Physical Parallel Platform
This refers to the actual hardware organization of the
platform.

2- Logical Parallel Platform

This refers to a programmer’s view point of the platform
and involves two critical components:
- Control Structure: Ways of expressing parallel tasks, and
- Communication Model: Mechanisms for specifying
interaction between these tasks.

11
Control Structure of Parallel Platforms

12
Example 1 (SIMD) – Parallelism from Single
Instruction on Multiple Processors
• Consider the following code segment that adds two vectors:
1. for (i = 0; i < 1000; i ++)
2. c[i] = a[i] + b[i];
• Various iterations of the loop are independent of each other and can
be executed independently.
• In this case adding all the processors with the appropriate data will
allow executing the loop much faster.
• In SIMD parallel computer, the same instruction is dispatched to all
processors and executed concurrently.
• The Intel Pentium processor with Streaming SIMD Extensions (SSE)
provides a number of instructions that execute the same instructions
on multiple data items.
• The above architectural enhancements rely on highly structured
(regular) nature of the underlying computation (e.g. image
13
processing or graphics) to deliver improved performance.
SIMD – Selective Execution Challenge
• While the SIMD concept works well for structured
computations on parallel data structures such as arrays;

• Often, it is necessary to selectively turn off operations on

certain data items.

• Thus, most SIMD programming paradigm allow for activity

mask to determine whether a processor should participate in
an operation or not.

• Conditional statements are typically used to support selective

execution and can be detrimental to the performance of SIMD
processors. 14
Example 2 (SIMD) – Execution of Conditional
Statements on SIMD Architectures

15
The idling challenge
Memory Architecture

16
Memory Architecture
(Shared memory)
General Characteristics:

The ability for all processors to access all memory as

global address space.

Multiple processors can operate independently but

share the same memory resources.

Changes in a memory location effected by one processor

are visible to all other processors.
17
Memory Architecture
(Shared memory )
Classes of Shared Memory:

The classification is based upon memory access times. 18

Memory Architecture
(Shared memory )
Uniform Memory Access (UMA)

- Most commonly represented today by Symmetric

Multiprocessor (SMP) machines
- Identical processors
- Equal access and access times to memory
- Sometimes called CC-UMA - Cache Coherent UMA.

Cache coherence means if one processor updates a location in

shared memory, all the other processors know about the update.
Cache coherency is accomplished at the hardware level.
19
Memory Architecture
(Shared memory )
Non-uniform Memory Access (NUMA)

- It physically links two or more SMPs

- One SMP can directly access memory of another
- Not all processors have equal access time to all memories
- Memory access across link is slower
- If cache coherency is maintained, then may also be called CC-
NUMA - Cache Coherent NUMA

20
Memory Architecture
(Shared memory )
Advantages:
- Global address space provides a user-friendly programming
perspective to memory
- Data sharing between tasks is both fast and uniform due to the
proximity of memory to CPUs

Disadvantages:
-Lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increases traffic on the shared
memory-CPU path, and for cache coherent systems, geometrically
increase traffic associated with cache/memory management.
-Synchronization Programmer responsible for synchronization
constructs that ensure "correct" access of global memory.
21
Memory Architecture
(Distributed memory )
General Characteristics:

This requires a communication network to connect inter-

processor memory.

No global address space Processors have their own local memory.

Memory addresses in one processor do not map to another
processor, so there is no concept of global address across all
processors.
No cache coherency: Because each processor has its own local
memory, it operates independently. Changes it makes to its
local memory have no effect on the memory of other
processors. 22
Memory Architecture
(Distributed memory )

General Characteristics:

Programmer Decides on Memory Access: When a processor

needs access to data in another processor, it is usually the
task of the programmer to explicitly define how and when
data is communicated. Synchronization between tasks is
likewise the programmer's responsibility.

The network "fabric" used for data transfer varies widely, though
it can be as simple as Ethernet.

23
Memory Architecture
(Distributed memory )

Advantages:
-Memory is scalable with the number of processors. Increase the
number of processors and the size of memory increases
proportionately.

-Rapid memory access: Each processor can rapidly access its own
memory without interference and without the overhead incurred
with trying to maintain global cache coherency.

- Cost effectiveness: can use commodity, off-the-shelf processors and

networking.
24
Memory Architecture
(Distributed memory )

Disadvantages:
- Demand on the programmer who is responsible for many of the
details associated with data communication between processors.

- Mapping Difficulty It may be difficult to map existing data

structures, based on global memory, to this memory organization.

- Non-uniform memory access times - data residing on a remote

node takes longer to access than node local data.

25
Memory Architecture
(Hybrid Distributed-Shared memory )

General Characteristics:

The largest and fastest computers in the world today employ both
shared and distributed memory architectures.

The shared memory component can be a shared memory

machine and/or graphics processing units (GPU).
26
Memory Architecture
(Hybrid Distributed-Shared memory )

General Characteristics:
Needs network communications - The distributed memory
component is the networking of multiple shared
memory/GPU machines, which know only about their own
memory - not the memory on another machine. Thus, they
need network to move data from one machine to another.

Current trends indicate that this type of memory architecture will

continue to prevail for the foreseeable future.

Advantage - Increased scalability

Disadvantage - Increased programmer complexity
27
28
Memory Architecture
Remarks So far
For different computational problems, parallelism may exist in
different forms

For a given computational problem, parallelism may exist on different

levels

Finding parallelism (as much as possible) may not be straightforward

However, once parallelism is identified, parallel computing becomes

possible

It is vital to understand the required collaboration between processes

Parallel programming is the next big step 29

Parallel Programming Models
1- Shared Memory (Threaded and Non-threaded) models
- Easy to program (such as OpenMP)
- Difficult to scale to many CPUs (NUMA, cache coherence)

2- Message-passing model
- Many programming details (MPI or PVM)
- Better user control (data & work decomposition)
- Larger systems and better performance

3- Stream-based programming (for using GPUs)

4- Hybrid parallel programming

30
Shared Memory Models
This, perhaps, is the simplest parallel programming model where
processes/tasks share a common address space, which they
read and write to asynchronously.

31
Shared Memory Models

Various mechanisms such as locks are used to control access to the

shared memory, resolve contentions and to prevent race
conditions and deadlocks.

Advantage - Programmer's point of view: The notion of data

"ownership" is lacking, so there is no need to specify explicitly
the communication of data between tasks.

All processes see and have equal access to shared memory.

32
Thread Models
This is a type of shared memory programming model where a single
"heavy weight" process can have multiple "light weight",
concurrent execution paths.

33
Thread Models
- From a programming perspective, threads implementations
commonly comprise:

• A library of subroutines that are called from within parallel source code
• A set of compiler directives embedded in either serial or parallel source
code

34
Types of Thread Models
Historically, there are 2 main implementation types of threaded models:
- Posix ,and
- OpenMP

POSIX Threads
- Specified by the IEEE POSIX 1003.1c standard (1995). C Language only.
- Part of Unix/Linux operating systems
- Library based
- Commonly referred to as Pthreads.
- Very explicit parallelism; requires significant programmer attention to
detail.

35
Types of Thread Models
OpenMP
- Industry standard with fork-join model
- Compiler directive based
- Portable / multi-platform, including Unix and Windows platforms
- Available in C/C++ and Fortran implementations
- Can be very easy and simple to use
- Provides for "incremental parallelism". Can begin with a serial code.

36
Types of Thread Models
OpenMP - Example

Other threaded implementations include:

- Microsoft threads
- Java, Python threads
- CUDA threads for GPUs 37
Distributed Memory/Message Passing
Interface (MPI) Models
MPI (message passing interface) is the ‘de facto’ industry standard
which is also library-based;

Its implementation is available on almost every major parallel platform

• Each process has its local memory

• Explicit message passing enables information exchange and

collaboration between processes

38
Distributed Memory/Message Passing
Interface (MPI) Models
• Data transfer usually requires cooperative operations to be
performed by each process. For example, a send operation must
have a matching receive operation.

• Portability, good performance & functionality

39
Distributed Memory/Message Passing
Interface (MPI) Models
MPI – Example

40
Hybrid Programming Model
Typical hybrid model is the combination of MPI model with OpenMP
threads.

Threads perform computationally intensive kernels using local, on-

node data
Communications between processes on different nodes occurs over
the network using MPI
This hybrid model lends itself well to the most popular hardware
environment of clustered multi/many-core machines. 41
Hybrid Programming Model
Another hybrid model uses MPI with CPU-GPU (Graphics Processing
Unit) programming.
An approach termed GPGPU (General-Purpose computing on Graphics
Processing Units).
- MPI tasks run on CPUs using local memory and communicate with each
other over a network.
- Computationally intensive kernels are off-loaded to GPUs on-node.
- Data exchange between node-local memory and GPUs uses CUDA (or
equivalent).

42
SPMD & MPMD Programming Models
1- Single Program Multiple Data (SPMD)

SPMD is actually a "high level" programming model that can be built

upon any combination of the previously mentioned parallel
programming models.

SINGLE PROGRAM: All tasks execute their copy of the same program
simultaneously. This program can be threads, message passing, data
parallel or hybrid.
MULTIPLE DATA: All tasks may use different data

43
SPMD & MPMD Programming Models
1- Single Program Multiple Data (SPMD)

SPMD programs usually have the necessary logic programmed into

them to allow different tasks to branch or conditionally execute only
those parts of the program they are designed to execute. That is, tasks
do not necessarily have to execute the entire program - perhaps only
a portion of it.

The SPMD model, using message passing or hybrid programming, is

probably the most commonly used parallel programming model for
multi-node clusters.

44
SPMD & MPMD Programming Models
2- Multiple Program Multiple Data (MPMD)

MPMD is also a "high level" programming model that can be built

upon any combination of the previously mentioned parallel
programming models.

MULTIPLE PROGRAM: Tasks may execute different programs

simultaneously. The programs can be threads, message passing,
data parallel or hybrid.
MULTIPLE DATA: All tasks may use different data.

45
Designing Parallel Programs
Important considerations in designing parallel programs include:

1- Understand the Problem and Program

• Identify the parts of a serial code that have concurrency

• Be aware of inhibitors to parallelism (e.g. data dependency) – e.g.

the Fibonacci sequence F(n) = F(n-1) + F(n-2);

• Identify the program's hotspots – where most computational time is

spent

• Identify bottlenecks in the program: such as I/O operations, etc.

46
Designing Parallel Programs
Important considerations cont’d

2- Parallel Overhead

The amount of time required to coordinate parallel tasks, as opposed

to doing useful work. Parallel overhead can include factors such as:
• Task start-up time
• Data communications
• Synchronizations
• Granularity
• Load imbalance
• Software overhead imposed by parallel languages, libraries,
operating system, etc.
• Task termination time 47
Designing Parallel Programs
Synchronizations
- Involves coordination of parallel tasks in real time
- Often implemented by establishing a synchronization point
- Usually involves waiting by at least one task
- can therefore increase a parallel application's execution time

Granularity
Is a qualitative measure of the ratio of computation to
communication.
Coarse: Relatively large amounts of computational work are done between
communication events
Fine: Relatively small amounts of computational work are done between
communication events
48
Cost of Parallel Programs
Amdahl's Law:
- States that the potential program speedup is defined by the fraction
of code (P) that can be parallelized:

• No Speedup: If none of the code can be parallelized, P = 0 and speedup =

• Infinite Speedup: If all of the code is parallelized, P = 1 and the speedup is

infinite (in theory).

• Doubling: If 50% of the code can be parallelized, maximum speedup = 2,

meaning the code will run twice as fast.
49
Cost of Parallel Programs
Amdahl's Law:
- States that the potential program speedup is defined by the fraction
of code (P) that can be parallelized:

In terms of the number of processors performing the parallel fraction of

work, the relationship can be modeled by:

where N = number of processors,

S = serial fraction, and
P = parallel fraction = 1 – S.
50
Limits of Parallel Programs

Speedup Limit: Scalability of parallelism has limit!

“You can spend a lifetime getting 95% of your code to be parallel, and never achieve
better than 20x speedup no matter how many processors you throw at it!” 51
Further Topics in Parallel Computing

Parallel Parallel computing is the concurrent use of multiple

Computing processors (CPUs) to do computational work.

Grid “Grid computing" refers to the connection of distributed:

Computing computing, visualization, and storage resources
to solve large-scale computing problems that otherwise
could not be solved within the limited
memory, computing power, or I/O capacity
of a system or cluster at a single location.

Super-
“Supercomputer” refers to computing systems capable
Computers of sustaining high-performance computing applications
that require a large number of processors,
shared/distributed memory, and multiple disks. 52
Summary
• Parallel computing relies on parallel hardware
• Parallel computing needs parallel software
• So parallel programming requires:
– New way of thinking
– Identification of parallelism
– Design of parallel algorithm
• Implementation can be a challenge
– requires careful attention to details by the
programmer
53
LAB Exercises
Run Matlab on a Multiprocessor machine and
execute each of the following codes.

Code 1: Serial Code

54
LAB Exercises
Run Matlab on a Multiprocessor machine and
execute each of the following codes.
Code 2: Parallel Code

Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
Overview of Parallel Processing Systems
No ratings yet
Overview of Parallel Processing Systems
29 pages
Lecture 5 Principles of Parallel Algorithm Design
No ratings yet
Lecture 5 Principles of Parallel Algorithm Design
30 pages
Application Development in 3-Tier Architecture
No ratings yet
Application Development in 3-Tier Architecture
64 pages
Single vs Multi-Core Processor Architectures
No ratings yet
Single vs Multi-Core Processor Architectures
31 pages
Unit Iii - Aca
No ratings yet
Unit Iii - Aca
13 pages
Cache Memory
No ratings yet
Cache Memory
11 pages
Unit 4
No ratings yet
Unit 4
112 pages
CS6801 Mca Rejinpaul Iq April May 2018
No ratings yet
CS6801 Mca Rejinpaul Iq April May 2018
1 page
CP4094
No ratings yet
CP4094
153 pages
Understanding SIMD Architecture
No ratings yet
Understanding SIMD Architecture
28 pages
IAS Computer Architecture Overview
No ratings yet
IAS Computer Architecture Overview
34 pages
Question Bank-1 PC
100% (1)
Question Bank-1 PC
2 pages
SMT and CMP Architectures
100% (1)
SMT and CMP Architectures
19 pages
Software System Design
No ratings yet
Software System Design
55 pages
Unit 5: Sensor Network Platforms and Tools
No ratings yet
Unit 5: Sensor Network Platforms and Tools
47 pages
Associative Memory Networks
No ratings yet
Associative Memory Networks
6 pages
7th Sem Previous Year Question Paper
No ratings yet
7th Sem Previous Year Question Paper
28 pages
CP4253 Map Unit Ii
No ratings yet
CP4253 Map Unit Ii
23 pages
Classification of Parallel Computers
No ratings yet
Classification of Parallel Computers
16 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
9 pages
CP4094 Unit 5
No ratings yet
CP4094 Unit 5
36 pages
Computer Organization - Von Neumann Architecture
No ratings yet
Computer Organization - Von Neumann Architecture
5 pages
CP4253 Map Unit Iii
No ratings yet
CP4253 Map Unit Iii
26 pages
Flynn's Classification of Computer Architectures
No ratings yet
Flynn's Classification of Computer Architectures
14 pages
Module1 PP BDS701 Notes
No ratings yet
Module1 PP BDS701 Notes
27 pages
Computer Architecture Comparch Unit Two Powerpoint
No ratings yet
Computer Architecture Comparch Unit Two Powerpoint
51 pages
OpenMP Shared-Memory Programming Guide
No ratings yet
OpenMP Shared-Memory Programming Guide
37 pages
Performance of Computers: Factors Affecting Computer Performance
No ratings yet
Performance of Computers: Factors Affecting Computer Performance
4 pages
Grid Computing: A Comprehensive Guide
No ratings yet
Grid Computing: A Comprehensive Guide
26 pages
PowerPoint Slides To Chapter 07
No ratings yet
PowerPoint Slides To Chapter 07
49 pages
Unit 1
No ratings yet
Unit 1
16 pages
Multimedia DB
No ratings yet
Multimedia DB
30 pages
DPCO Unit 1
No ratings yet
DPCO Unit 1
45 pages
M.E (FT) 2021 Regulation-Cse Syllabus
No ratings yet
M.E (FT) 2021 Regulation-Cse Syllabus
88 pages
Coa Lab Manual
No ratings yet
Coa Lab Manual
33 pages
Multiprocessors and Multicomputers
No ratings yet
Multiprocessors and Multicomputers
27 pages
SMT and CMP Architectures
100% (3)
SMT and CMP Architectures
19 pages
Multiprocessor Cache Coherence Design
No ratings yet
Multiprocessor Cache Coherence Design
32 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
Object Oriented Programming Using C++
No ratings yet
Object Oriented Programming Using C++
17 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
36 pages
Unit-4 Illumination & Colour Models
No ratings yet
Unit-4 Illumination & Colour Models
69 pages
3G & 4G Network Evolution Guide
No ratings yet
3G & 4G Network Evolution Guide
178 pages
ADC Interfacing
No ratings yet
ADC Interfacing
37 pages
CH 2 Vector Processing
No ratings yet
CH 2 Vector Processing
16 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
57 pages
Centralized Shared Memory Architectures
No ratings yet
Centralized Shared Memory Architectures
31 pages
KCS 713 Unit 1 Lecture 5
No ratings yet
KCS 713 Unit 1 Lecture 5
32 pages
Java Practical Projects for B.Tech CSE
No ratings yet
Java Practical Projects for B.Tech CSE
54 pages
Mobile and Pervasive Computing - CP4094
No ratings yet
Mobile and Pervasive Computing - CP4094
501 pages
23BCE3S2 Multimedia System Notes
No ratings yet
23BCE3S2 Multimedia System Notes
58 pages
Part B Questions
No ratings yet
Part B Questions
3 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
Parallel Computers
No ratings yet
Parallel Computers
39 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
Module 2
No ratings yet
Module 2
124 pages
COME6102 Chapter 1 Introduction 2 of 2
No ratings yet
COME6102 Chapter 1 Introduction 2 of 2
8 pages
Microcomputer Hardware & Software Techniques 2019
No ratings yet
Microcomputer Hardware & Software Techniques 2019
12 pages
Siwes 24
No ratings yet
Siwes 24
1 page
Module 4 - Colour System
No ratings yet
Module 4 - Colour System
35 pages
Module 6 - Game Theory
No ratings yet
Module 6 - Game Theory
54 pages
Module 1 - Parallel Computing
No ratings yet
Module 1 - Parallel Computing
29 pages
Basic English Grammar Exercises
No ratings yet
Basic English Grammar Exercises
58 pages
Fees 2025 26 22.05.2025
No ratings yet
Fees 2025 26 22.05.2025
2 pages
Compute PI Value PROJECT
No ratings yet
Compute PI Value PROJECT
13 pages
Baxsn BY-95 TDS
No ratings yet
Baxsn BY-95 TDS
2 pages
CSEC POA June 2012 P1 PDF
0% (1)
CSEC POA June 2012 P1 PDF
12 pages
Modeling Binary Responses: Correlated
No ratings yet
Modeling Binary Responses: Correlated
297 pages
Local RRL Bohol
No ratings yet
Local RRL Bohol
176 pages
Door and Window Schedule Details
No ratings yet
Door and Window Schedule Details
6 pages
Telefunken+LED32S66T2S+Chassis+P75-2841V6 0
No ratings yet
Telefunken+LED32S66T2S+Chassis+P75-2841V6 0
45 pages
The Resource-Based View (RBV) : Issues and Perspectives: March 2010
No ratings yet
The Resource-Based View (RBV) : Issues and Perspectives: March 2010
22 pages
KCB Ibank Swift Guide
No ratings yet
KCB Ibank Swift Guide
5 pages
Curtus Controller Manual 1236 1238
100% (1)
Curtus Controller Manual 1236 1238
13 pages
Meeting Minutes (1) F&B Logistic Technical & Safety
No ratings yet
Meeting Minutes (1) F&B Logistic Technical & Safety
3 pages
Role of Pharmacist in Healthcare System
No ratings yet
Role of Pharmacist in Healthcare System
10 pages
Land Laws Module 2 Part 2
No ratings yet
Land Laws Module 2 Part 2
12 pages
STEAM - Preschool Activities For STEM Enric - Jamie Hand
No ratings yet
STEAM - Preschool Activities For STEM Enric - Jamie Hand
92 pages
Sustainable Construction Costs & Benefits
No ratings yet
Sustainable Construction Costs & Benefits
7 pages
Construction Cost Estimating Guide
100% (1)
Construction Cost Estimating Guide
59 pages
API Test Log for Developers
No ratings yet
API Test Log for Developers
1 page
Passing Rhythms Liverpool FC and The Transformation of Football (John Williams (Ed.), Stephen Hopkins (Ed.) Etc.)
No ratings yet
Passing Rhythms Liverpool FC and The Transformation of Football (John Williams (Ed.), Stephen Hopkins (Ed.) Etc.)
253 pages
Urban Greening and Beautification Projects
100% (2)
Urban Greening and Beautification Projects
49 pages
TANUKU-Copy (1) Export
No ratings yet
TANUKU-Copy (1) Export
1 page
Max 7032
No ratings yet
Max 7032
32 pages
Navarro AM132
No ratings yet
Navarro AM132
3 pages
Germanene Nanoribbon Insights
No ratings yet
Germanene Nanoribbon Insights
11 pages
Labor Dispute: CBMI vs. Asprec & Bataller
No ratings yet
Labor Dispute: CBMI vs. Asprec & Bataller
15 pages
PL Toolbox Talk Fire Extinguisher Safety
No ratings yet
PL Toolbox Talk Fire Extinguisher Safety
1 page
AM 12SC Redmer EN
No ratings yet
AM 12SC Redmer EN
20 pages
Comparison of Early Literacy Ipad Apps - Evaluation of Teachers' Perceptions
No ratings yet
Comparison of Early Literacy Ipad Apps - Evaluation of Teachers' Perceptions
49 pages
JOB DESCRIPTION - TESDA Positions
No ratings yet
JOB DESCRIPTION - TESDA Positions
270 pages

Module 2 - Parallel Computing

Uploaded by

Module 2 - Parallel Computing

Uploaded by

ELE5211: Advanced Topics in CE

Tutor: Hassan A. Bashir 1

- The growing gap in speed and sustainable peak

present overreaching motivations for parallelism.

Read/write, random access memory is used to store both program

Examples: single processor/core PCs, older generation mainframes, etc.

Single Instruction: All processing units execute the same instruction at

- Suitable for problems with high degree of regularity, such as

Two varieties: Processor Arrays and Vector Pipelines

Execution: Can be synchronous or asynchronous, deterministic or non-

-Currently, the most common type of parallel computer - most modern

Examples: Most current supercomputers, networked parallel computer

HP/Compaq Alphaserver IBMPower5

2- Logical Parallel Platform

• Often, it is necessary to selectively turn off operations on

• Thus, most SIMD programming paradigm allow for activity

• Conditional statements are typically used to support selective

The ability for all processors to access all memory as

Multiple processors can operate independently but

Changes in a memory location effected by one processor

The classification is based upon memory access times. 18

- Most commonly represented today by Symmetric

Cache coherence means if one processor updates a location in

- It physically links two or more SMPs

This requires a communication network to connect inter-

No global address space Processors have their own local memory.

Programmer Decides on Memory Access: When a processor

- Cost effectiveness: can use commodity, off-the-shelf processors and

- Mapping Difficulty It may be difficult to map existing data

- Non-uniform memory access times - data residing on a remote

The shared memory component can be a shared memory

Current trends indicate that this type of memory architecture will

Advantage - Increased scalability

For a given computational problem, parallelism may exist on different

Finding parallelism (as much as possible) may not be straightforward

However, once parallelism is identified, parallel computing becomes

It is vital to understand the required collaboration between processes

Parallel programming is the next big step 29

3- Stream-based programming (for using GPUs)

4- Hybrid parallel programming

Various mechanisms such as locks are used to control access to the

Advantage - Programmer's point of view: The notion of data

All processes see and have equal access to shared memory.

Other threaded implementations include:

Its implementation is available on almost every major parallel platform

• Each process has its local memory

• Explicit message passing enables information exchange and

• Portability, good performance & functionality

Threads perform computationally intensive kernels using local, on-

SPMD is actually a "high level" programming model that can be built

SPMD programs usually have the necessary logic programmed into

The SPMD model, using message passing or hybrid programming, is

MPMD is also a "high level" programming model that can be built

MULTIPLE PROGRAM: Tasks may execute different programs

1- Understand the Problem and Program

• Identify the parts of a serial code that have concurrency

• Be aware of inhibitors to parallelism (e.g. data dependency) – e.g.

• Identify the program's hotspots – where most computational time is

• Identify bottlenecks in the program: such as I/O operations, etc.

The amount of time required to coordinate parallel tasks, as opposed

• No Speedup: If none of the code can be parallelized, P = 0 and speedup =

• Infinite Speedup: If all of the code is parallelized, P = 1 and the speedup is

• Doubling: If 50% of the code can be parallelized, maximum speedup = 2,

In terms of the number of processors performing the parallel fraction of

where N = number of processors,

Speedup Limit: Scalability of parallelism has limit!

Parallel Parallel computing is the concurrent use of multiple

Grid “Grid computing" refers to the connection of distributed:

Code 1: Serial Code

You might also like