0% found this document useful (0 votes)
7 views57 pages

Lecture 2 Computer Architecture Course 2024 1

The document provides an overview of parallel computing, contrasting it with traditional serial computing, and highlights its advantages such as time and cost savings, solving larger problems, and utilizing modern hardware effectively. It discusses various types of parallel computing architectures, including Flynn's taxonomy, and introduces concepts like CUDA for GPU programming. Additionally, it outlines the applications of parallel computing in science, engineering, and commercial sectors, emphasizing its growing importance in modern computing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views57 pages

Lecture 2 Computer Architecture Course 2024 1

The document provides an overview of parallel computing, contrasting it with traditional serial computing, and highlights its advantages such as time and cost savings, solving larger problems, and utilizing modern hardware effectively. It discusses various types of parallel computing architectures, including Flynn's taxonomy, and introduces concepts like CUDA for GPU programming. Additionally, it outlines the applications of parallel computing in science, engineering, and commercial sectors, emphasizing its growing importance in modern computing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

9/8/2024

Căn bản khoa học máy tính cho kỹ sư


(Basic computer science for Engineering)

Lectuer: Nguyen Van Hong

Introduction to Computer
http://cacs.usc.edu/educatio
architecture and Parallel Computing
n/cs653-lecture.html

1
9/8/2024

What is Parallel Computing?

 Traditionally, software has been written for serial


computation: A problem is broken into a discrete series of
instructions
 Instructions are executed sequentially one after another
 Executed on a single processor
 Only one instruction may execute at any moment in time

Serial Computing

2
9/8/2024

Serial Computing

Parallel Computing

parallel computing is the simultaneous use of multiple


compute resources to solve a computational problem: A
problem is broken into discrete parts that can be solved
concurrently
 Each part is further broken down to a series of instructions
 Instructions from each part execute simultaneously on
different processors
 An overall control/coordination mechanism is employed

3
9/8/2024

Parallel Computing

Parallel Computing

4
9/8/2024

Parallel Computing
The computational problem should be able to:
Be broken apart into discrete pieces of work that can be
solved simultaneously;
Execute multiple program instructions at any moment in
time;
Be solved in less time with multiple compute resources
than with a single compute resource.

Parallel Computing

The compute resources are typically:


A single computer with multiple processors/cores
An arbitrary number of such computers connected
by a network

10

5
9/8/2024

Parallel Computers:

Virtually all stand-alone computers


today are parallel from a hardware
perspective:
Multiple functional units
Multiple execution units/cores
Multiple hardware threads

11

12

6
9/8/2024

Parallel Computing

13

CPU

14

7
9/8/2024

GPU

15

TPU

16

8
9/8/2024

17

DPU

PCIe (Peripheral Component Interconnect Express)

18

9
9/8/2024

19

QPU vs GPU

20

10
9/8/2024

Why Use Parallel Computing?

The Real World is Massively Parallel:


In the natural world, many complex, interrelated events
are happening at the same time, yet within a temporal
sequence.
Compared to serial computing, parallel computing is
much better suited for modeling, simulating and
understanding complex, real world phenomena.
For example, imagine modeling these serially:

21

Why Use Parallel Computing?

22

11
9/8/2024

Why Use Parallel Computing?

23

Main Reasons

SAVE TIME AND/OR MONEY:


In theory, throwing more resources at a task will
shorten its time to completion, with potential cost
savings.
Parallel computers can be built from cheap,
commodity components.

24

12
9/8/2024

Main Reasons

25

Main Reasons: SOLVE LARGER / MORE COMPLEX PROBLEMS:

Many problems are so large and/or complex that it is


impractical or impossible to solve them on a single
computer, especially given limited computer memory.
Example: "Grand Challenge Problems" (en.wikipedia.org/wiki/Grand_Challenge)
requiring PetaFLOPS and PetaBytes of computing resources.
Example: Web search engines/databases processing millions
of transactions every second.

26

13
9/8/2024

Main Reasons

SOLVE LARGER / MORE COMPLEX PROBLEMS:

27

Main Reasons

PROVIDE CONCURRENCY:
A single compute resource can only do one thing at a time.
Multiple compute resources can do many things
simultaneously.
Example: Collaborative Networks provide a global venue
where people from around the world can meet and conduct
work "virtually".

28

14
9/8/2024

Main Reasons

29

Main reasons

TAKE ADVANTAGE OF NON-LOCAL RESOURCES


Using compute resources on a wide area network, or even the
Internet when local compute resources are scarce or
insufficient. Two examples below, each of which has over 1.7
million contributors globally (May 2018):
https://setiathome.berkeley.edu/
https://foldingathome.org/

30

15
9/8/2024

Main reasons
MAKE BETTER USE OF UNDERLYING PARALLEL HARDWARE:
 Modern computers, even laptops, are parallel in architecture
with multiple processors/cores.
 Parallel software is specifically intended for parallel
hardware with multiple cores, threads, etc.
 In most cases, serial programs run on modern computers
"waste" potential computing power.

31

Main reasons

Intel Xeon processor with 6 cores and 6 L3 cache units


32

16
9/8/2024

The Future:
During the past 20+ years, the trends indicated by ever faster
networks, distributed systems, and multi-processor computer
architectures (even at the desktop level) clearly show that
parallelism is the future of computing.
In this same time period, there has been a greater than
500,000x increase in supercomputer performance, with no
end currently in sight.
The race is already on for Exascale Computing!
Exaflop = 1018 calculations per second

33

The Future:

34

17
9/8/2024

floating-point operations per second (FLOPS)


gigaFLOPS, GFLOPS, 109
teraFLOPS, TFLOPS, 1012.
petaFLOPS, PFLOPS, 1015.
exaFLOPS, EFLOPS, 1018.
zettaFLOPS, ZFLOPS, 1021.
yottaFLOPS, YFLOPS, 1024

35

Who is Using Parallel Computing?

Science and Engineering:


Historically, parallel computing has been considered to be "the high end of computing", and has
been used to model difficult problems in many areas of science and engineering:
 Atmosphere, Earth, Environment  Mechanical Engineering - from
 Physics - applied, nuclear, particle, condensed prosthetics to spacecraft
matter, high pressure, fusion, photonics  Electrical Engineering, Circuit
 Bioscience, Biotechnology, Genetics Design, Microelectronics
 Chemistry, Molecular Sciences  Computer Science, Mathematics
 Geology, Seismology  Defense, Weapons

36

18
9/8/2024

Science and Engineering:

37

Who is Using Parallel Computing?

Industrial and Commercial:


Today, commercial applications provide an equal or greater driving force in the
development of faster computers. These applications require the processing of
large amounts of data in sophisticated ways. For example:

 "Big Data", databases, data mining  Financial and economic modeling


 Artificial Intelligence (AI)  Management of national and multi-national
 Web search engines, web based corporations
business services  Advanced graphics and virtual reality, particularly in
 Medical imaging and diagnosis the entertainment industry
 Pharmaceutical design  Networked video and multi-media technologies
 Oil exploration

38

19
9/8/2024

Industrial and Commercial:

39

Who is Using Parallel Computing?

40

20
9/8/2024

41

42

21
9/8/2024

Concepts and Terminology


(Khái niệm và thuật ngữ)

43

Von Neumann Architecture

 Named after the Hungarian mathematician/genius John von Neumann


who first authored the general requirements for an electronic computer
in his papers in 1945.
 Also known as "stored-program computer" - both program instructions
and data are kept in electronic memory. Differs from earlier computers
which were programmed through "hard wiring".
 Since then, virtually all computers have followed this basic design:
Comprised of four main components:
 Memory
 Control Unit
 Arithmetic Logic Unit
 Input/Output

44

22
9/8/2024

Von Neumann Architecture


Read/write, random access memory is used to store both program
instructions and data
Program instructions are coded data which tell the computer
to do something
Data is simply information to be used by the program
Control unit fetches instructions/data from memory, decodes the
instructions and then sequentially coordinates operations to
accomplish the programmed task.
Arithmetic Unit performs basic arithmetic operations
Input/Output is the interface to the human operator

45

Von Neumann Architecture

More info on his other remarkable accomplishments:


http://en.wikipedia.org/wiki/John_von_Neumann
So what? Who cares?
Well, parallel computers still follow this basic design, just
multiplied in units. The basic, fundamental architecture
remains the same.

46

23
9/8/2024

Parallel computers: Flynn's Classical Taxonomy


This classification was first studied and proposed by Michael Flynn in 1972

 There are different ways to classify parallel computers.


 One of the more widely used classifications
 Flynn's taxonomy distinguishes multi-processor computer
architectures according to how they can be classified along
the two independent dimensions of Instruction Stream and
Data Stream. Each of these dimensions can have only one
of two possible states: Single or Multiple.
 The matrix below defines the 4 possible classifications
according to Flynn:

47

Flynn's Classical Taxonomy (Phân loại)

48

24
9/8/2024

Single Instruction, Single Data (SISD):


A serial (non-parallel) computer
Single Instruction: Only one instruction stream is being
acted on by the CPU during any one clock cycle
Single Data: Only one data stream is being used as input
during any one clock cycle
Deterministic execution
 This is the oldest type of computer
Examples: older generation mainframes, minicomputers,
workstations and single processor/core PCs.

49

Single Instruction, Single Data (SISD):

50

25
9/8/2024

Single Instruction, Multiple Data (SIMD):


 A type of parallel computer
 Single Instruction: All processing units execute the same instruction at
any given clock cycle
 Multiple Data: Each processing unit can operate on a different data
element
 Best suited for specialized problems characterized by a high degree of
regularity, such as graphics/image processing.
 Synchronous (lockstep) and deterministic execution
 Two varieties: Processor Arrays and Vector Pipelines
 Examples:
Processor Arrays: Thinking Machines CM-2, MasPar MP-1 & MP-2,
ILLIAC IV
Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-
2, Hitachi S820, ETA10

51

Single Instruction, Multiple Data (SIMD):

52

26
9/8/2024

The GPU: CUDA(Compute Unified Device Architecture)

53

Different Types of CUDA Applications

54

27
9/8/2024

CUDA
(Compute Unified Device
Architecture)
Supercomputing for the Masses

55

What is CUDA?

• CUDA is a set of developing tools to create applications that


will perform execution on GPU (Graphics Processing Unit).

• CUDA compiler uses variation of C with future support of


C++

• CUDA was developed by NVidia and as such can only run


on NVidia GPUs of G8x series and up.

• CUDA was released on February 15, 2007 for PC and Beta


version for MacOS X on August 19, 2008.

56

28
9/8/2024

Why CUDA?

• CUDA provides ability to use high-level languages such as C


to develop application that can take advantage of high level
of performance and scalability that GPUs architecture offer.

• GPUs allow creation of very large number of concurrently


executed threads at very low system resource cost.

• CUDA also exposes fast shared memory (16KB) that can be


shared between threads.

• Full support for integer and bitwise operations.

• Compiled code will run directly on GPU.

57

Multiple Instruction, Single Data (MISD):


A type of parallel computer
Multiple Instruction: Each processing unit operates on the data
independently via separate instruction streams.
Single Data: A single data stream is fed into multiple processing
units.
Few (if any) actual examples of this class of parallel computer
have ever existed.
Some conceivable uses might be:
multiple frequency filters operating on a single signal stream
multiple cryptography algorithms attempting to crack a single
coded message.

58

29
9/8/2024

Multiple Instruction, Single Data (MISD):

59

Multiple Instruction, Multiple Data (MIMD):


 A type of parallel computer
 Multiple Instruction: Every processor may be executing a different
instruction stream
 Multiple Data: Every processor may be working with a different data
stream
 Execution can be synchronous or asynchronous, deterministic or non-
deterministic
 Currently, the most common type of parallel computer - most modern
supercomputers fall into this category.
 Examples: most current supercomputers, networked parallel computer
clusters and "grids", multi-processor SMP computers, multi-core PCs.
 Note: many MIMD architectures also include SIMD execution sub-
components
(Symmetric Multi-Processor (SMP))

60

30
9/8/2024

Multiple Instruction, Multiple Data (MIMD):

61

Multiple Instruction, Multiple Data (MIMD):

62

31
9/8/2024

Some General Parallel Terminology


Like everything else, parallel computing has its own "jargon". Some of the more commonly used terms associated with parallel
computing are listed below.

 Supercomputing / High Performance Computing (HPC) Using the


world's fastest and largest computers to solve large problems.
Node A standalone "computer in a box". Usually comprised of multiple
CPUs/processors/cores, memory, network interfaces, etc. Nodes are
networked together to comprise a supercomputer.
CPU / Socket / Processor / Core: This varies, depending upon who you
talk to. In the past, a CPU (Central Processing Unit) was a singular
execution component for a computer. Then, multiple CPUs were
incorporated into a node. Then, individual CPUs were subdivided into
multiple "cores", each being a unique execution unit. CPUs with multiple
cores are sometimes called "sockets" - vendor dependent. The result is a
node with multiple CPUs, each containing multiple cores. The
nomenclature is confused at times. Wonder why?

63

Some General Parallel Terminology

64

32
9/8/2024

Some General Parallel Terminology


Task: A logically discrete section of computational work. A task is
typically a program or program-like set of instructions that is executed by
a processor. A parallel program consists of multiple tasks running on
multiple processors.
Pipelining: Breaking a task into steps performed by different processor
units, with inputs streaming through, much like an assembly line; a type of
parallel computing.
Shared Memory: From a strictly hardware point of view, describes a
computer architecture where all processors have direct (usually bus based)
access to common physical memory. In a programming sense, it describes
a model where parallel tasks all have the same "picture" of memory and
can directly address and access the same logical memory locations
regardless of where the physical memory actually exists.

65

Pipelining is the process of accumulating instruction from the processor through a pipeline. It
allows storing and executing instructions in an orderly process. It is also known as pipeline
processing

66

33
9/8/2024

Some General Parallel Terminology


Symmetric Multi-Processor (SMP) Shared memory hardware
architecture where multiple processors share a single address space
and have equal access to all resources.
Distributed Memory In hardware, refers to network based
memory access for physical memory that is not common. As a
programming model, tasks can only logically "see" local machine
memory and must use communications to access memory on other
machines where other tasks are executing.
Communications Parallel tasks typically need to exchange data.
There are several ways this can be accomplished, such as through a
shared memory bus or over a network, however the actual event of
data exchange is commonly referred to as communications
regardless of the method employed.

67

Distributed Memory Architecture

68

34
9/8/2024

Some General Parallel Terminology

Synchronization The coordination of parallel tasks in real


time, very often associated with communications. Often
implemented by establishing a synchronization point within an
application where a task may not proceed further until another
task(s) reaches the same or logically equivalent point.

69

Some General Parallel Terminology

Observed Speedup Observed speedup of a code which has


been parallelized, defined as:

One of the simplest and most widely used indicators for a


parallel program's performance.

70

35
9/8/2024

Some General Parallel Terminology


• Parallel Overhead The amount of time required to
coordinate parallel tasks, as opposed to doing useful work.
Parallel overhead can include factors such as: Task start-up
time
Synchronizations
Data communications
Software overhead imposed by parallel languages,
libraries, operating system, etc.
Task termination time

71

Some General Parallel Terminology

Massively Parallel Refers to the hardware that comprises a


given parallel system - having many processors. The meaning
of "many" keeps increasing, but currently, the largest parallel
computers are comprised of processing elements numbering in
the hundreds of thousands to millions.
Embarrassingly Parallel Solving many similar, but
independent tasks simultaneously; little to no need for
coordination between the tasks.

72

36
9/8/2024

Some General Parallel Terminology


Scalability Refers to a parallel system's (hardware and/or
software) ability to demonstrate a proportionate increase in
parallel speedup with the addition of more resources. Factors
that contribute to scalability include:
 Hardware - particularly memory-cpu bandwidths and
network communication properties
 Application algorithm
Parallel overhead related
Characteristics of your specific application

73

Concepts and Terminology


Limits and Costs of Parallel Programming
Amdahl's Law states that potential program speedup is defined
by the fraction of code (P) that can be parallelized:
1) If none of the code can be parallelized, P = 0 and the
speedup = 1 (no speedup).
2) If all of the code is parallelized, P = 1 and the
speedup is infinite (in theory).
3) If 50% of the code can be parallelized, maximum
speedup = 2, meaning the code will run twice as fast.

Introducing the number of processors


performing the parallel fraction of work, the
relationship can be modeled by:
where P = parallel fraction, N = number of processors and S = serial fraction.

74

37
9/8/2024

Limits and Costs of Parallel Programming

75

Limits and Costs of Parallel Programming

76

38
9/8/2024

Limits and Costs of Parallel Programming


It soon becomes obvious that there are limits to the scalability
of parallelism. For example:

77

https://computing.llnl.gov/tutorials/parallel_comp/#Concepts

78

39
9/8/2024

Complexity:
In general, parallel applications are much more complex than
corresponding serial applications, perhaps an order of magnitude. Not
only do you have multiple instruction streams executing at the same
time, but you also have data flowing between them.
The costs of complexity are measured in programmer time in virtually
every aspect of the software development cycle:
Design
Coding
Debugging
Tuning
Maintenance
Application programming interface=API

79

Portability:
Thanks to standardization in several APIs, such as MPI, POSIX threads, and
OpenMP, portability issues with parallel programs are not as serious as in years
past.
However...
All of the usual portability issues associated with serial programs apply to
parallel programs. For example, if you use vendor "enhancements" to Fortran,
C or C++, portability will be a problem.
Even though standards exist for several APIs, implementations will differ in a
number of details, sometimes to the point of requiring code modifications in
order to effect portability.
Operating systems can play a key role in code portability issues.
Hardware architectures are characteristically highly variable and can affect
portability.
POSIX stands for Portable Operating System Interface

80

40
9/8/2024

Resource Requirements:
 The primary intent of parallel programming is to decrease execution
wall clock time, however in order to accomplish this, more CPU time
is required. For example, a parallel code that runs in 1 hour on 8
processors actually uses 8 hours of CPU time.
 The amount of memory required can be greater for parallel codes than
serial codes, due to the need to replicate data and for overheads
associated with parallel support libraries and subsystems.
 For short running parallel programs, there can actually be a decrease in
performance compared to a similar serial implementation. The overhead
costs associated with setting up the parallel environment, task creation,
communications and task termination can comprise a significant portion
of the total execution time for short runs.

81

Scalability:

Two types of scaling based on time to solution: strong scaling


and weak scaling.
1) Strong scaling:
 The total problem size stays fixed as
more processors are added.
 Goal is to run the same problem size
faster
 Perfect scaling means problem is solved
in 1/P time (compared to serial)

82

41
9/8/2024

Scalability:
2) Weak scaling:
 The problem size per processor
stays fixed as more processors
are added. The total problem
size is proportional to the
number of processors used.
 Goal is to run larger problem in
same amount of time
 Perfect scaling means problem
Px runs in same time as single
processor run

83

Scalability:
 The ability of a parallel program's performance to scale is a result of a
number of interrelated factors. Simply adding more processors is rarely the
answer.
 The algorithm may have inherent limits to scalability. At some point, adding
more resources causes performance to decrease. This is a common situation
with many parallel applications.
 Hardware factors play a significant role in scalability. Examples:
 Memory-cpu bus bandwidth on an SMP machine
 Communications network bandwidth
 Amount of memory available on any given machine or set of machines
 Processor clock speed
 Parallel support libraries and subsystems software can limit scalability
independent of your application.

84

42
9/8/2024

Parallel Computer Memory Architectures

Shared Memory
General Characteristics:
 Shared memory parallel computers vary widely,
but generally have in common the ability for all
processors to access all memory as global
address space.
 Multiple processors can operate independently
but share the same memory resources.
 Changes in a memory location effected by one
processor are visible to all other processors.
 Historically, shared memory machines have
been classified as UMA and NUMA, based upon
memory access times.
Uniform Memory Access (UMA)

85

Shared Memory

Uniform Memory Access (UMA):


 Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
 Identical processors
 Equal access and access times to memory
 Sometimes called CC-UMA - Cache Coherent UMA. Cache
coherent means if one processor updates a location in shared
memory, all the other processors know about the update. Cache
coherency is accomplished at the hardware level.

86

43
9/8/2024

Shared Memory

Non-Uniform Memory Access (NUMA):


Often made by physically linking two or more SMPs
One SMP can directly access memory of another SMP
Not all processors have equal access time to all memories
Memory access across link is slower
If cache coherency is maintained, then may also be called
CC-NUMA - Cache Coherent NUMA

87

Shared Memory
Advantages:
Global address space provides a user-friendly programming perspective to
memory
Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
Disadvantages:
 Primary disadvantage is the lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increases traffic on the shared
memory-CPU path, and for cache coherent systems, geometrically increase
traffic associated with cache/memory management.
 Programmer responsibility for synchronization constructs that ensure
"correct" access of global memory.

88

44
9/8/2024

Distributed Memory

Like shared memory systems, distributed memory systems vary


widely but share a common characteristic. Distributed memory
systems require a communication network to connect inter-
processor memory.

89

Processors have their own local memory. Memory addresses in one processor do
not map to another processor, so there is no concept of global address space across
all processors.
Because each processor has its own local memory, it operates independently.
Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
When a processor needs access to data in another processor, it is usually the task
of the programmer to explicitly define how and when data is communicated.
Synchronization between tasks is likewise the programmer's responsibility.
The network "fabric" used for data transfer varies widely, though it can be as
simple as Ethernet.

Advantages:
 Memory is scalable with the number of processors. Increase the number of
processors and the size of memory increases proportionately.
 Each processor can rapidly access its own memory without interference and without
the overhead incurred with trying to maintain global cache coherency.
 Cost effectiveness: can use commodity, off-the-shelf processors and networking.

90

45
9/8/2024

Disadvantages:
The programmer is responsible for many of the details
associated with data communication between processors.
It may be difficult to map existing data structures, based on
global memory, to this memory organization.
Non-uniform memory access times - data residing on a
remote node takes longer to access than node local data.

91

Hybrid Distributed-Shared Memory

General Characteristics:
 The largest and fastest computers in the world today employ both shared and distributed
memory architectures.
92

46
9/8/2024

 The shared memory component can be a shared memory


machine and/or graphics processing units (GPU).
 The distributed memory component is the networking of
multiple shared memory/GPU machines, which know only
about their own memory - not the memory on another
machine. Therefore, network communications are required
to move data from one machine to another.
 Current trends seem to indicate that this type of memory
architecture will continue to prevail and increase at the
high end of computing for the foreseeable future.

Advantages and Disadvantages:


 Whatever is common to both shared and distributed memory architectures.
 Increased scalability is an important advantage
 Increased programmer complexity is an important disadvantage

93

Parallel Programming Models


There are several parallel programming models in common
use:
Shared Memory (without threads)
Threads
Distributed Memory / Message Passing
Data Parallel
Hybrid
Single Program Multiple Data (SPMD)
Multiple Program Multiple Data (MPMD)

94

47
9/8/2024

Parallel programming models exist as an abstraction above hardware


and memory architectures.
Although it might not seem apparent, these models are NOT specific to a particular
type of machine or memory architecture. In fact, any of these models can
(theoretically) be implemented on any underlying hardware. Two examples from the
past are discussed below.
SHARED memory model on a DISTRIBUTED memory machine:
Machine memory was physically distributed across networked
machines, but appeared to the user as a single shared memory global
address space. Generically, this approach is referred to as "virtual
shared memory".

DISTRIBUTED memory model on a SHARED memory machine

Which model to use? This is often a combination of what is


available and personal choice. There is no "best" model, although
there certainly are better implementations of some models over
others.

95

Shared Memory Model (without threads)


Processes/tasks share a common address space, which they read
and write to asynchronously.
Various mechanisms such as are used to control access to the
shared memory, resolve contentions and to prevent race
conditions and deadlocks.
This is perhaps the simplest parallel programming model.
Advantage from the programmer's point of view: All
processes see and have equal access to shared memory. Program
development can often be simplified.
Disadvantage in terms of performance: it becomes more
difficult to understand and manage data locality:
 Keeping data local to the process that works on it conserves
memory accesses, cache refreshes and bus traffic that occurs
when multiple processes use the same data.
 Unfortunately, controlling data locality is hard to understand
and may be beyond the control of the average user.

96

48
9/8/2024

Shared Memory Model (without threads)


Implementations:
On stand-alone shared memory machines, native operating
systems, compilers and/or hardware provide support for shared
memory programming. For example, the POSIX standard
provides an API for using shared memory, and UNIX provides
shared memory segments (shmget, shmat, shmctl, etc).
On distributed memory machines, memory is physically
distributed across a network of machines, but made global
through specialized hardware and software. A variety of SHMEM
implementations are available:
http://en.wikipedia.org/wiki/SHMEM.
POSIX (Portable Operating System Interface)

97

Threads Model
 This programming model is a type of shared memory programming.
 In the threads model of parallel programming, a single "heavy weight" process can
have multiple "light weight", concurrent execution paths.

For example:
 a.out loads and acquires all of the necessary system and user resources to run.
This is the "heavy weight" process.
 a.out performs some serial work, and then creates a number of tasks (threads)
that can be scheduled and run by the operating system concurrently.
 Each thread has local data, but also, shares the entire resources of a.out. This
saves the overhead associated with replicating a program's resources for each
thread ("light weight"). Each thread also benefits from a global memory view
because it shares the memory space of a.out.
 A thread's work may best be described as a subroutine within the main program.
Any thread can execute any subroutine at the same time as other threads.
 Threads communicate with each other through global memory (updating
address locations). This requires synchronization constructs to ensure that more
than one thread is not updating the same global address at any time.
 Threads can come and go, but a.out remains present to provide the necessary
shared resources until the application has completed.

98

49
9/8/2024

Implementations:
From a programming perspective, threads implementations commonly
comprise: A library of subroutines that are called from within parallel source
code
A set of compiler directives imbedded in either serial or parallel source code
In both cases, the programmer is responsible for determining the parallelism
(although compilers can sometimes help).
Threaded implementations are not new in computing. Historically, hardware
vendors have implemented their own proprietary versions of threads. These
implementations differed substantially from each other making it difficult for
programmers to develop portable threaded applications.
Unrelated standardization efforts have resulted in two very different
implementations of threads: POSIX Threads and OpenMP.
o POSIX Threads tutorial: computing.llnl.gov/tutorials/pthreads
o OpenMP tutorial: computing.llnl.gov/tutorials/openMP

99

Distributed Memory / Message Passing Model

This model demonstrates the following


characteristics:
A set of tasks that use their own local memory during
computation. Multiple tasks can reside on the same
physical machine and/or across an arbitrary number
of machines.
Tasks exchange data through communications by
sending and receiving messages.
Data transfer usually requires cooperative operations
to be performed by each process. For example, a send
operation must have a matching receive operation.

100

50
9/8/2024

From a programming perspective, message passing implementations


usually comprise a library of subroutines. Calls to these subroutines are
imbedded in source code. The programmer is responsible for determining
all parallelism.
Historically, a variety of message passing libraries have been available
since the 1980s. These implementations differed substantially from each
other making it difficult for programmers to develop portable applications.
In 1992, the MPI Forum was formed with the primary goal of establishing
a standard interface for message passing implementations.
Part 1 of the Message Passing Interface (MPI) was released in 1994. Part
2 (MPI-2) was released in 1996 and MPI-3 in 2012. All MPI specifications
are available on the web at http://www.mpi-forum.org/docs/.
MPI is the "de facto" industry standard for message passing, replacing
virtually all other message passing implementations used for production
work. MPI implementations exist for virtually all popular parallel
computing platforms. Not all implementations include everything in MPI-1,
MPI-2 or MPI-3.

101

Data Parallel Model


May also be referred to as the Partitioned Global
Address Space (PGAS) model
The data parallel model demonstrates the following
characteristics:
 address space is treated globally
 Most of the parallel work focuses on performing
operations on a data set. The data set is typically
organized into a common structure, such as an array
or cube.
 A set of tasks work collectively on the same data
structure, however, each task works on a different
partition of the same data structure.
 Tasks perform the same operation on their partition
of work, for example, "add 4 to every array
element".

On shared memory architectures, all tasks may have access to the data structure through global memory.
On distributed memory architectures, the global data structure can be split up logically and/or physically across tasks.

102

51
9/8/2024

Hybrid Model
 A hybrid model combines more than one of the previously described programming
models.
 Currently, a common example of a hybrid model is the combination of the message
passing model (MPI) with the threads model (OpenMP). Threads perform
computationally intensive kernels using local, on-node data
 Communications between processes on different nodes occurs over the network using
MPI
 This hybrid model lends itself well to the most popular hardware environment of
clustered multi/many-core machines.
 Another similar and increasingly popular example of a hybrid model is using MPI
with CPU-GPU (Graphics Processing Unit) programming.
 MPI tasks run on CPUs using local memory and communicating with each other
over a network.
 Computationally intensive kernels are off-loaded to GPUs on-node.
 Data exchange between node-local memory and GPUs uses CUDA (or
something equivalent).
 Other hybrid models are common:
 MPI with Pthreads
 MPI with non-GPU accelerators

103

Single Program Multiple Data (SPMD):

SPMD is actually a "high level" programming model that can be built upon any
combination of the previously mentioned parallel programming models.
SINGLE PROGRAM: All tasks execute their copy of the same program
simultaneously. This program can be threads, message passing, data parallel or
hybrid.
MULTIPLE DATA: All tasks may use different data
SPMD programs usually have the necessary logic programmed into them to
allow different tasks to branch or conditionally execute only those parts of the
program they are designed to execute. That is, tasks do not necessarily have to
execute the entire program - perhaps only a portion of it.
The SPMD model, using message passing or hybrid programming, is probably
the most commonly used parallel programming model for multi-node clusters.
104

52
9/8/2024

105

Multiple Program Multiple Data (MPMD):

 Like SPMD, MPMD is actually a "high level" programming model that


can be built upon any combination of the previously mentioned parallel
programming models.
 MULTIPLE PROGRAM: Tasks may execute different programs
simultaneously. The programs can be threads, message passing, data
parallel or hybrid.
 MULTIPLE DATA: All tasks may use different data
 MPMD applications are not as common as SPMD applications, but
may be better suited for certain types of problems, particularly those
that lend themselves better to functional decomposition than domain
decomposition (discussed later under Partioning).

106

53
9/8/2024

Designing Parallel Programs


Automatic vs. Manual Parallelization
Fully Automatic:
The compiler analyzes the source code and identifies
opportunities for parallelism.
The analysis includes identifying inhibitors to parallelism
and possibly a cost weighting on whether or not the
parallelism would actually improve performance.
Loops are the most frequent target for automatic
parallelization.

107

Designing Parallel Programs

Programmer Directed
Using "compiler directives" or possibly compiler flags, the
programmer explicitly tells the compiler how to parallelize
the code.
May be able to be used in conjunction with some degree of
automatic parallelization also.

108

54
9/8/2024

Designing Parallel Programs


The most common compiler generated parallelization is done using
on-node shared memory and threads (such as OpenMP).
If you are beginning with an existing serial code and have time or
budget constraints, then automatic parallelization may be the
answer. However, there are several important caveats that apply to
automatic parallelization:
Wrong results may be produced
Performance may actually degrade
Much less flexible than manual parallelization
Limited to a subset (mostly loops) of code
May actually not parallelize code if the compiler analysis suggests there are
inhibitors or the code is too complex

109

Designing Parallel Programs


Understand the Problem and the Program
Partitioning
One of the first steps in designing a parallel program is to
break the problem into discrete "chunks" of work that can be
distributed to multiple tasks. This is known as decomposition
or partitioning.
There are two basic ways to partition computational work
among parallel tasks: domain decomposition and functional
decomposition.

110

55
9/8/2024

Designing Parallel Programs

Domain Decomposition: In this type of partitioning, the data


associated with a problem is decomposed. Each parallel task
then works on a portion of the data.

111

Functional Decomposition:

Functional Decomposition:
In this approach, the focus is
on the computation that is to
be performed rather than on
the data manipulated by the
computation. The problem is
decomposed according to the
work that must be done. Each
task then performs a portion
of the overall work.

112

56
9/8/2024

Functional Decomposition:

Communications
Synchronization
Data Dependencies
Load Balancing
Granularity

113

Functional Decomposition:

114

57

You might also like