0% found this document useful (0 votes)
29 views

Module 2 - Parallel Computing

1) Explicit parallelism is motivated by the growing performance gap between processors and memory, and the distributed nature of computational problems. 2) Early computer architectures like the Von Neumann model were serial, while later models introduced parallelism through concepts like Flynn's taxonomy of SISD, SIMD, MISD, and MIMD architectures. 3) Parallel platforms can be classified based on their physical hardware organization or logical programming view, with control structures and communication models defining the latter.

Uploaded by

muwaheedmustapha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Module 2 - Parallel Computing

1) Explicit parallelism is motivated by the growing performance gap between processors and memory, and the distributed nature of computational problems. 2) Early computer architectures like the Von Neumann model were serial, while later models introduced parallelism through concepts like Flynn's taxonomy of SISD, SIMD, MISD, and MIMD architectures. 3) Parallel platforms can be classified based on their physical hardware organization or logical programming view, with control structures and communication models defining the latter.

Uploaded by

muwaheedmustapha
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 55

ELE5211: Advanced Topics in CE

Module Two
(Parallel Computing – Explicit Parallelism)

Tutor: Hassan A. Bashir 1


Intro: Motivating Parallelism
1- Why explicit parallelism?

- The growing gap in speed and sustainable peak


performance of current processors,
- Impact of memory system performance, and
- Distributed nature of many computational
problems

present overreaching motivations for parallelism.

2
Architectural Concept
(Serial to Parallel)
The Von Neumann Architecture (1940s)
Comprised of four main components:
Memory
Control Unit
Arithmetic Logic
Unit
Input/Output

Read/write, random access memory is used to store both program


instructions and data
Program instructions are coded data which tell the computer to do
something 3
Architectural Concept
(Serial to Parallel)
Flynn's Classical Taxonomy (1960s)
Multi-processor computer architectures are classified along the
two independent dimensions of Instruction Stream and Data
Stream.

Each of these dimensions can have only one of two possible states:
4
Single or Multiple.
Single Instruction, Single Data (SISD)
This is a serial (non-parallel) computer with
Single Instruction: Only one instruction stream executed by the CPU
during any one clock cycle
Single Data: Only one data stream is utilized during any one clock cycle
Deterministic execution

Examples: single processor/core PCs, older generation mainframes, etc.


5
Single Instruction, Multiple Data (SIMD)
This is a type of parallel computer with

Single Instruction: All processing units execute the same instruction at


any given clock cycle
Multiple Data: Each processing unit can operate on a different data
element

- Suitable for problems with high degree of regularity, such as


graphics/image processing.
- Synchronous (lockstep) and deterministic execution

Two varieties: Processor Arrays and Vector Pipelines


Examples: Most modern computers, particularly those with graphics
processor units (GPUs) employ instructions and execution units. 6
Single Instruction, Multiple Data (SIMD)
Single Instruction: All processors execute the same instruction at a cycle
Multiple Data: Each processing unit can operate on different data

7
Multiple Instruction, Single Data (MISD)
Multiple Instruction: Each processing unit operates on the data
independently via separate instruction streams.
Single Data: A single data stream is fed into multiple processors.
Examples: Few (if any) actual examples of this class of parallel
computer have ever existed.

8
Multiple Instruction, Multiple Data (MIMD)
Multiple Instruction: Every processor may be executing a different
instruction stream
Multiple Data: Every processor may be working on a different data
stream

Execution: Can be synchronous or asynchronous, deterministic or non-


deterministic

-Currently, the most common type of parallel computer - most modern


supercomputers fall into this category.

Examples: Most current supercomputers, networked parallel computer


clusters and "grids", multi-processor SMP computers, multi-core PCs.
9
Multiple Instruction, Multiple Data (MIMD)

HP/Compaq Alphaserver IBMPower5


10
Dichotomy in Parallel Platforms
1- Physical Parallel Platform
This refers to the actual hardware organization of the
platform.

2- Logical Parallel Platform


This refers to a programmer’s view point of the platform
and involves two critical components:
- Control Structure: Ways of expressing parallel tasks, and
- Communication Model: Mechanisms for specifying
interaction between these tasks.

11
Control Structure of Parallel Platforms

12
Example 1 (SIMD) – Parallelism from Single
Instruction on Multiple Processors
• Consider the following code segment that adds two vectors:
1. for (i = 0; i < 1000; i ++)
2. c[i] = a[i] + b[i];
• Various iterations of the loop are independent of each other and can
be executed independently.
• In this case adding all the processors with the appropriate data will
allow executing the loop much faster.
• In SIMD parallel computer, the same instruction is dispatched to all
processors and executed concurrently.
• The Intel Pentium processor with Streaming SIMD Extensions (SSE)
provides a number of instructions that execute the same instructions
on multiple data items.
• The above architectural enhancements rely on highly structured
(regular) nature of the underlying computation (e.g. image
13
processing or graphics) to deliver improved performance.
SIMD – Selective Execution Challenge
• While the SIMD concept works well for structured
computations on parallel data structures such as arrays;

• Often, it is necessary to selectively turn off operations on


certain data items.

• Thus, most SIMD programming paradigm allow for activity


mask to determine whether a processor should participate in
an operation or not.

• Conditional statements are typically used to support selective


execution and can be detrimental to the performance of SIMD
processors. 14
Example 2 (SIMD) – Execution of Conditional
Statements on SIMD Architectures

15
The idling challenge
Memory Architecture

16
Memory Architecture
(Shared memory)
General Characteristics:

The ability for all processors to access all memory as


global address space.

Multiple processors can operate independently but


share the same memory resources.

Changes in a memory location effected by one processor


are visible to all other processors.
17
Memory Architecture
(Shared memory )
Classes of Shared Memory:

The classification is based upon memory access times. 18


Memory Architecture
(Shared memory )
Uniform Memory Access (UMA)

- Most commonly represented today by Symmetric


Multiprocessor (SMP) machines
- Identical processors
- Equal access and access times to memory
- Sometimes called CC-UMA - Cache Coherent UMA.

Cache coherence means if one processor updates a location in


shared memory, all the other processors know about the update.
Cache coherency is accomplished at the hardware level.
19
Memory Architecture
(Shared memory )
Non-uniform Memory Access (NUMA)

- It physically links two or more SMPs


- One SMP can directly access memory of another
- Not all processors have equal access time to all memories
- Memory access across link is slower
- If cache coherency is maintained, then may also be called CC-
NUMA - Cache Coherent NUMA

20
Memory Architecture
(Shared memory )
Advantages:
- Global address space provides a user-friendly programming
perspective to memory
- Data sharing between tasks is both fast and uniform due to the
proximity of memory to CPUs

Disadvantages:
-Lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increases traffic on the shared
memory-CPU path, and for cache coherent systems, geometrically
increase traffic associated with cache/memory management.
-Synchronization Programmer responsible for synchronization
constructs that ensure "correct" access of global memory.
21
Memory Architecture
(Distributed memory )
General Characteristics:

This requires a communication network to connect inter-


processor memory.

No global address space Processors have their own local memory.


Memory addresses in one processor do not map to another
processor, so there is no concept of global address across all
processors.
No cache coherency: Because each processor has its own local
memory, it operates independently. Changes it makes to its
local memory have no effect on the memory of other
processors. 22
Memory Architecture
(Distributed memory )

General Characteristics:

Programmer Decides on Memory Access: When a processor


needs access to data in another processor, it is usually the
task of the programmer to explicitly define how and when
data is communicated. Synchronization between tasks is
likewise the programmer's responsibility.

The network "fabric" used for data transfer varies widely, though
it can be as simple as Ethernet.

23
Memory Architecture
(Distributed memory )

Advantages:
-Memory is scalable with the number of processors. Increase the
number of processors and the size of memory increases
proportionately.

-Rapid memory access: Each processor can rapidly access its own
memory without interference and without the overhead incurred
with trying to maintain global cache coherency.

- Cost effectiveness: can use commodity, off-the-shelf processors and


networking.
24
Memory Architecture
(Distributed memory )

Disadvantages:
- Demand on the programmer who is responsible for many of the
details associated with data communication between processors.

- Mapping Difficulty It may be difficult to map existing data


structures, based on global memory, to this memory organization.

- Non-uniform memory access times - data residing on a remote


node takes longer to access than node local data.

25
Memory Architecture
(Hybrid Distributed-Shared memory )

General Characteristics:

The largest and fastest computers in the world today employ both
shared and distributed memory architectures.

The shared memory component can be a shared memory


machine and/or graphics processing units (GPU).
26
Memory Architecture
(Hybrid Distributed-Shared memory )

General Characteristics:
Needs network communications - The distributed memory
component is the networking of multiple shared
memory/GPU machines, which know only about their own
memory - not the memory on another machine. Thus, they
need network to move data from one machine to another.

Current trends indicate that this type of memory architecture will


continue to prevail for the foreseeable future.

Advantage - Increased scalability


Disadvantage - Increased programmer complexity
27
28
Memory Architecture
Remarks So far
For different computational problems, parallelism may exist in
different forms

For a given computational problem, parallelism may exist on different


levels

Finding parallelism (as much as possible) may not be straightforward

However, once parallelism is identified, parallel computing becomes


possible

It is vital to understand the required collaboration between processes

Parallel programming is the next big step 29


Parallel Programming Models
1- Shared Memory (Threaded and Non-threaded) models
- Easy to program (such as OpenMP)
- Difficult to scale to many CPUs (NUMA, cache coherence)

2- Message-passing model
- Many programming details (MPI or PVM)
- Better user control (data & work decomposition)
- Larger systems and better performance

3- Stream-based programming (for using GPUs)

4- Hybrid parallel programming

30
Shared Memory Models
This, perhaps, is the simplest parallel programming model where
processes/tasks share a common address space, which they
read and write to asynchronously.

31
Shared Memory Models

Various mechanisms such as locks are used to control access to the


shared memory, resolve contentions and to prevent race
conditions and deadlocks.

Advantage - Programmer's point of view: The notion of data


"ownership" is lacking, so there is no need to specify explicitly
the communication of data between tasks.

All processes see and have equal access to shared memory.

32
Thread Models
This is a type of shared memory programming model where a single
"heavy weight" process can have multiple "light weight",
concurrent execution paths.

33
Thread Models
- From a programming perspective, threads implementations
commonly comprise:

• A library of subroutines that are called from within parallel source code
• A set of compiler directives embedded in either serial or parallel source
code

34
Types of Thread Models
Historically, there are 2 main implementation types of threaded models:
- Posix ,and
- OpenMP

POSIX Threads
- Specified by the IEEE POSIX 1003.1c standard (1995). C Language only.
- Part of Unix/Linux operating systems
- Library based
- Commonly referred to as Pthreads.
- Very explicit parallelism; requires significant programmer attention to
detail.

35
Types of Thread Models
OpenMP
- Industry standard with fork-join model
- Compiler directive based
- Portable / multi-platform, including Unix and Windows platforms
- Available in C/C++ and Fortran implementations
- Can be very easy and simple to use
- Provides for "incremental parallelism". Can begin with a serial code.

36
Types of Thread Models
OpenMP - Example

Other threaded implementations include:


- Microsoft threads
- Java, Python threads
- CUDA threads for GPUs 37
Distributed Memory/Message Passing
Interface (MPI) Models
MPI (message passing interface) is the ‘de facto’ industry standard
which is also library-based;

Its implementation is available on almost every major parallel platform

• Each process has its local memory

• Explicit message passing enables information exchange and


collaboration between processes

38
Distributed Memory/Message Passing
Interface (MPI) Models
• Data transfer usually requires cooperative operations to be
performed by each process. For example, a send operation must
have a matching receive operation.

• Portability, good performance & functionality


39
Distributed Memory/Message Passing
Interface (MPI) Models
MPI – Example

40
Hybrid Programming Model
Typical hybrid model is the combination of MPI model with OpenMP
threads.

Threads perform computationally intensive kernels using local, on-


node data
Communications between processes on different nodes occurs over
the network using MPI
This hybrid model lends itself well to the most popular hardware
environment of clustered multi/many-core machines. 41
Hybrid Programming Model
Another hybrid model uses MPI with CPU-GPU (Graphics Processing
Unit) programming.
An approach termed GPGPU (General-Purpose computing on Graphics
Processing Units).
- MPI tasks run on CPUs using local memory and communicate with each
other over a network.
- Computationally intensive kernels are off-loaded to GPUs on-node.
- Data exchange between node-local memory and GPUs uses CUDA (or
equivalent).

42
SPMD & MPMD Programming Models
1- Single Program Multiple Data (SPMD)

SPMD is actually a "high level" programming model that can be built


upon any combination of the previously mentioned parallel
programming models.

SINGLE PROGRAM: All tasks execute their copy of the same program
simultaneously. This program can be threads, message passing, data
parallel or hybrid.
MULTIPLE DATA: All tasks may use different data

43
SPMD & MPMD Programming Models
1- Single Program Multiple Data (SPMD)

SPMD programs usually have the necessary logic programmed into


them to allow different tasks to branch or conditionally execute only
those parts of the program they are designed to execute. That is, tasks
do not necessarily have to execute the entire program - perhaps only
a portion of it.

The SPMD model, using message passing or hybrid programming, is


probably the most commonly used parallel programming model for
multi-node clusters.

44
SPMD & MPMD Programming Models
2- Multiple Program Multiple Data (MPMD)

MPMD is also a "high level" programming model that can be built


upon any combination of the previously mentioned parallel
programming models.

MULTIPLE PROGRAM: Tasks may execute different programs


simultaneously. The programs can be threads, message passing,
data parallel or hybrid.
MULTIPLE DATA: All tasks may use different data.

45
Designing Parallel Programs
Important considerations in designing parallel programs include:

1- Understand the Problem and Program

• Identify the parts of a serial code that have concurrency

• Be aware of inhibitors to parallelism (e.g. data dependency) – e.g.


the Fibonacci sequence F(n) = F(n-1) + F(n-2);

• Identify the program's hotspots – where most computational time is


spent

• Identify bottlenecks in the program: such as I/O operations, etc.


46
Designing Parallel Programs
Important considerations cont’d

2- Parallel Overhead

The amount of time required to coordinate parallel tasks, as opposed


to doing useful work. Parallel overhead can include factors such as:
• Task start-up time
• Data communications
• Synchronizations
• Granularity
• Load imbalance
• Software overhead imposed by parallel languages, libraries,
operating system, etc.
• Task termination time 47
Designing Parallel Programs
Synchronizations
- Involves coordination of parallel tasks in real time
- Often implemented by establishing a synchronization point
- Usually involves waiting by at least one task
- can therefore increase a parallel application's execution time

Granularity
Is a qualitative measure of the ratio of computation to
communication.
Coarse: Relatively large amounts of computational work are done between
communication events
Fine: Relatively small amounts of computational work are done between
communication events
48
Cost of Parallel Programs
Amdahl's Law:
- States that the potential program speedup is defined by the fraction
of code (P) that can be parallelized:

• No Speedup: If none of the code can be parallelized, P = 0 and speedup =


1.

• Infinite Speedup: If all of the code is parallelized, P = 1 and the speedup is


infinite (in theory).

• Doubling: If 50% of the code can be parallelized, maximum speedup = 2,


meaning the code will run twice as fast.
49
Cost of Parallel Programs
Amdahl's Law:
- States that the potential program speedup is defined by the fraction
of code (P) that can be parallelized:

In terms of the number of processors performing the parallel fraction of


work, the relationship can be modeled by:

where N = number of processors,


S = serial fraction, and
P = parallel fraction = 1 – S.
50
Limits of Parallel Programs

Speedup Limit: Scalability of parallelism has limit!

“You can spend a lifetime getting 95% of your code to be parallel, and never achieve
better than 20x speedup no matter how many processors you throw at it!” 51
Further Topics in Parallel Computing

Parallel Parallel computing is the concurrent use of multiple


Computing processors (CPUs) to do computational work.

Grid “Grid computing" refers to the connection of distributed:


Computing computing, visualization, and storage resources
to solve large-scale computing problems that otherwise
could not be solved within the limited
memory, computing power, or I/O capacity
of a system or cluster at a single location.

Super-
“Supercomputer” refers to computing systems capable
Computers of sustaining high-performance computing applications
that require a large number of processors,
shared/distributed memory, and multiple disks. 52
Summary
• Parallel computing relies on parallel hardware
• Parallel computing needs parallel software
• So parallel programming requires:
– New way of thinking
– Identification of parallelism
– Design of parallel algorithm
• Implementation can be a challenge
– requires careful attention to details by the
programmer
53
LAB Exercises
Run Matlab on a Multiprocessor machine and
execute each of the following codes.

Code 1: Serial Code

54
LAB Exercises
Run Matlab on a Multiprocessor machine and
execute each of the following codes.
Code 2: Parallel Code

55

You might also like