0% found this document useful (0 votes)
29 views41 pages

Topic 1 2024

plhmidy c

Uploaded by

ella.davis.9811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views41 pages

Topic 1 2024

plhmidy c

Uploaded by

ella.davis.9811
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Information Technology

FIT3143 Parallel Computing


Semester 2 2024

Topic 1:
Introduction to Parallel Computing
Dr Carlo Kopp, MACM, SMIEEE, AFAIAA
Sima, Fountain and Kacsuk, Advanced Computer Architectures - a Design Space Approach, Chapter 1
Faculty of Information Technology
© 2024 Monash University
Why Study Parallel Computing?
§ Parallel computing hardware is now pervasive – smartphones, tablets,
notebooks, desktops, servers, clusters, clouds all employ parallel hardware in
various forms. Very few systems today use single core CPUs
§ Designing, developing and implementing code to run well on parallel hardware
requires a robust understanding of parallelism and how it impacts code design
and system performance
§ Theory: Parallelism arises in different forms that impact performance and code
design in different ways that require different approaches to system and code
§ Theory: Without a sound theory background good solutions are impossible
§ Practice: The ability to analyse a parallel computing problem and match it to
available hardware
§ Practice: The ability to code parallel applications on real world hardware

2
Parallelism Concepts and Context
Why Parallelism?
§ The basic reason why parallel systems are built is to improve
computational performance – in the simplest of terms to “make
applications run much faster”
§ Since the 1950s parallelism has been used in various ways to extract
more performance out of hardware that is limited in how fast it can
execute machine instructions
§ Parallelism has been used to improve the performance of individual
CPUs using techniques like pipelining, superscalar processing, and
vector processing
§ Parallelism has been used to improve the performance of systems by
aggregating multiple CPUs or processing elements (in GPUs, NPUs)

4
What is Parallelism?
§ In the most basic sense all parallelism involves the exploitation of concurrency
in executing instructions or code with many instructions
§ If you have a CPU with an arithmetic unit that can execute one machine
instruction e.g. a multiplication in say 1 nanosecond, the ability to execute ten
such instructions simultaneously provides a tenfold speedup – 10 multiplies
§ Parallelism always incurs costs in hardware – for instance, a CPU that can
execute ten machine instructions concurrently will be more costly, just as a
multicore CPU chip with ten cores will also be more costly than a single core
CPU, and a server with ten multicore CPU chips will be at least ten times as
expensive as a single multicore CPU chip machine, all else being equal
§ While hardware imposes many limits on parallelism, algorithms often present a
much bigger challenge to achieving high concurrency in a parallel system

5
What is Parallel Computing?
§ Parallel computing involves programming for concurrency on many
processors, rather than sequential execution on a single processor
§ Parallel computing requires learning a different approach to
programming to by-pass the limits of sequential processing;
§ Programmers need to understand parallel architectures and as needed
re-design applications for parallel platforms;
§ In considering performance a programmer working in a “traditional”
sequential system thinks in terms of linear timescales;
§ In a parallel environment, concurrency and synchronisation must also
be considered.

6
Where is Parallel Computing Used?
§ The most common application of parallel computing techniques is today in
commodity computing products
§ Multicore CPU chips are used in portable devices like smartphones and tablets,
portable equipment like notebooks, and all types of desktops and small servers
– gaming desktops now use up to 24 core CPUs, and servers up to 60 core
CPUs
§ Cloud systems used in data centres use many thousands of multicore CPUs
networked by a fabric, and are typical parallel/distributed systems
§ Supercomputers in the Exascale performance class arrived in 2022 – these are
massively parallel / distributed systems with performance in the class of 1018
IEEE 754 Double Precision (64-bit) operations per second and are typically
used to solve huge scientific computing problems

7
Intel Sapphire Rapids 2023 (Up to 60 cores)

8
Hewlett Packard Enterprise / Cray Frontier / OLCF-5 (2022)

§ 606208 cores in 9,472 AMD


EPYC 7713 Trento 64 core 2
GHz CPUs
§ 8,335,360 GPU cores in 37,888
AMD Instinct MI250X GPUs
§ Fabric: HPE Slingshot 64-port
switch using 200 Gbps QSPF
Terabit Ethernet
§ Developed by HPE Cray and
AMD for the US Oak Ridge
National Laboratory and U.S.
Department of Energy
https://www.flickr.com/photos/olcf/52117623843/

9
Large Scale Computational Problems (Sankar, 2008)
§ Science
– Global climate modelling
– Biology: genomics; protein folding; drug design
– Astrophysical modelling
– Computational Chemistry
– Computational Material Sciences and Nanosciences
§ Engineering
– Semiconductor design
– Earthquake and structural modelling, remote sensing
– Computation fluid dynamics (airplane design) & Combustion (engine design)
– Simulation
– Deep learning
– Game design
– Telecommunications (e.g. Network monitoring & optimization)
– Autonomous systems
§ Business
– Financial derivatives and economic modelling
– Transaction processing, web services and search engines
– Analytics

10
Parallel versus Distributed Computing
Parallel versus Distributed Computing
§ Parallel and Distributed Computing are frequently confused by novices,
because many applications are both parallel and distributed at the same time;
§ Parallel Computing is mostly focused on problems where the same computing
task is divided up to execute concurrently on many processing cores or
components, regardless of whether these are on the same chip, in the same
computer, or distributed across a fabric or network connecting many computers;
§ Distributed Computing is mostly focused on problems where the same or
different computing tasks are concurrently executed on multiple cores
distributed across a fabric or network connecting many computers;
§ This overlap results in a need to understand a number of important distributed
computing concepts to solve many parallel programming problems

12
Parallel versus Distributed Computing
Scalability
Latency
Bandwidth
Synchronisation
Reliability

Systems that are


Both distributed
and parallel
include clusters
and clouds

13
Conventional versus Parallel Processing

Conventional i.e. “Sequential” Computing

Parallel Computing

Adapted from https://computing.llnl.gov/tutorials/parallel_comp/

14
The Diversity Problem – How to Classify Parallel Processors?
§ Until the 1990s most computers in use in desktop and portable
applications were single CPU (core) machines
§ Large mainframes and servers were mainly machines with multiple
single CPU (core) processors sharing a main memory
§ A small minority of supercomputers were vector processing machines
§ Increasing chip density during the 1990s led to many changes:
a) Multicore CPU chips starting with dual and quad core devices
b) Networking of multicore CPU machines into clusters
c) Multimedia coprocessors to process streaming data and vectors
§ Density now permits single chip solutions for arbitrary parallel models

15
The Taxonomy Problem

§ A taxonomy of parallel architectures can be built based on three


relationships:
1. Relationship between PE and the instruction sequence executed
2. Relationship between PE and the memory
3. Relationship between PE and the interconnection network
§ Where a PE (PU) is a Processing Element (Processing Unit) e.g. a CPU
or Execution Unit
§ Please note that different textbooks and papers often use different
labels even if the definitions are fundamentally the same
§ The most widely used taxonomy is Flynn’s 1966 model

16
Flynn’s Taxonomy
Flynn’s Taxonomy

§ Michael Flynn (1966) developed a taxonomy of parallel


systems based on the number of independent
instruction and data streams.
A. SISD: Single Instruction Stream- Single Data Stream
(sequential Von Neumann machine)
B. MISD: Multiple Instruction Stream- Single Data Stream
C. SIMD: Single Instruction Stream- Multiple Data Stream
D. MIMD: Multiple Instruction Stream- Multiple Data
Stream

18
Flynn’s Taxonomy

Instructions
One Many
Defined by:
Prof Michael J Flynn
One

SISD MISD at Stanford University


Data

§ Single Instruction Single Data


Many

SIMD MIMD § Multiple Instruction Single Data


§ Single Instruction Multiple Data
§ Multiple Instruction Multiple Data

19
Flynn’s Taxonomy

20
Single Instruction Single Data (Von Neumann Model)
§ A serial (non-parallel) computer
§ Single instruction: only one instruction stream is being acted
on by the CPU during any one clock cycle
§ Single data: only one data stream is being used as input
during any one clock cycle
§ Deterministic execution
§ This is the oldest and until recently, the most prevalent form
of computer
§ Examples: most single core CPU notebooks, desktops,
workstations and servers)

21
Single Instruction Single Data (Von Neumann Model)
§ Consists of a processor
executing a program stored in a
(main) memory.
§ Each main memory location
located by its address.
Addresses start at 0 and extend
to 2n -1 when there are n bits
(binary digits) in the address.

22
Single Instruction Multiple Data System
§ A type of parallel computer
§ Single instruction: All processing units execute the same
instruction at any given clock cycle
§ Multiple data: Each processing unit can operate on a different
data element
§ This type of machine typically has an instruction dispatcher, a
very high-bandwidth internal interconnect, and a very large Processor Arrays:
array of very small-capacity instruction (execution) units. Connection Machine CM-
§ Best suited for specialized problems characterized by a high 2, Maspar MP-1, MP-2
degree of regularity, such as image processing. Vector Pipelines: IBM
§ Synchronous (lockstep) and deterministic execution 9000, Cray C90, Fujitsu VP,
§ Two varieties: Processor Arrays and Vector Pipelines NEC SX-2, Hitachi S820

23
Single Instruction Multiple Data System

Processor array
Array
control

Data interface Host


computer

Data Mass
in/out storage

24
Multiple Instruction Single Data
§ A single data stream is fed into multiple processing units.
§ Each processing unit operates on data independently via independent
instruction streams.
§ Few actual examples of this class of parallel computer have ever existed. One
is the experimental Carnegie-Mellon computer
§ Many textbooks class systolic arrays as MISD architectures but this is
disputed – systolic arrays are used in some AI NPUs
§ Some conceivable uses might be:
a) multiple frequency filters operating on a single signal stream
b) multiple cryptography algorithms attempting to crack a single coded
message.

25
Multiple Instruction Multiple Data

§ Currently, the most common type of parallel computer.


Most modern computers fall into this category.
§ Multiple Instruction: every processor may be executing a
different instruction stream
§ Multiple Data: every processor may be working with a
different data stream
§ Execution can be synchronous or asynchronous,
deterministic or non-deterministic
§ Examples: most current supercomputers, networked
parallel computer "grids" and multi-processor SMP
computers, and anything with a multi-core CPU chip

26
Multiple Instruction Multiple Data – Distributed and Shared Memory
Processor 1 Processor 1 Memory
Module
1

Memory Processor 2
Memory
Module
Interconnection 2
Network Interconnection
Network

Processor p Memory
Module
Processor p m

Memory

27
Memory Architectures
Why Do Memory Architectures Matter
§ There are multiple ways in which a parallel system might
access memory
§ The memory architecture in use can impact the behaviour of
the system in multiple ways:
a) Performance: bandwidth available to read and write data
from and to memory
b) Reliability: maintaining consistency of data for different
processing elements in the system
c) Security: control of access permissions to data in memory
29
Parallel Computer Memory Architectures
§ Broadly divided into three categories:
A. Shared Memory Architecture: multiple processing
elements access memory via a common local bus or switch
using a common address space
B. Distributed Memory Architecture: multiple processing
elements access memory over a fabric (network, bus,
switch) not always using a common address space
C. Hybrid Memory Architecture: Combines (A) and (B)

30
Shared Memory Architecture
§ Shared memory parallel computers vary widely,
but generally have in common the ability for all
processors to access all memory as a global
address space.
§ Multiple processors can operate independently
but share the same memory resources.
§ Changes in a memory location effected by one
processor are visible to all other processors.
§ Shared memory machines can be divided into
two main classes based upon memory access
times: UMA and NUMA (discussed later).

31
Distributed Memory Architecture
§ Distributed memory systems require a communication
network to connect inter-processor memory.
§ Processors have their own local memory. There is no
concept of global address space across all processors.
§ Because each processor has its own local memory, it
operates independently. Changes it makes to its local
memory have no effect on the memory of other processors.
Hence, the concept of cache coherency does not apply.
§ When a processor needs access to data in another
processor, it is usually the task of the programmer to
explicitly define how and when data is communicated.
Synchronization between tasks is likewise the programmer's
responsibility. The network “fabric” used for data transfer
varies widely, though it can be as simple as Ethernet.

32
Hybrid Memory Architecture
§ Clusters, clouds and supercomputers today employ both
shared and distributed memory architectures.
§ The shared memory component is usually a cache coherent
SMP (Symmetrical Multi-Processing) machine. Processors
on a given SMP can address that machine's memory as
global.
§ The distributed memory component is the networking of
multiple SMPs. SMPs know only about their own memory -
not the memory on another SMP. Therefore, network
communications are required to move data from one SMP to
another.
§ Current trends seem to indicate that this type of memory
architecture will continue to prevail and increase at the high
end of computing for the foreseeable future.

33
Parallel Programming Models
Parallel Programming Models Overview
§ There are several parallel programming models in common use:
A. Shared Memory
B. Threads
C. Message Passing
D. Data Parallel
E. Hybrid
§ Parallel programming models exist as an abstraction above the hardware and
memory architectures.
§ Although it might not seem apparent, these models are NOT specific to a
particular type of machine or memory architecture. In fact, any of these models
can (theoretically) be implemented on any underlying hardware

35
Single Program Multiple Data (SPMD) Structure
§ Another programming structure we may use is the Single Program
Multiple Data (SPMD) structure.
§ In this structure, a single source program is written and each processor
will execute its personal copy of this program, although independently,
and not in synchrony.
§ This source program can be constructed so that parts of the program
are executed by certain computers and not others depending on the
identity of the computer.
§ For a master-slave structure, the programs could have parts for the
master and parts for slaves.

36
Multiple Program Multiple Data (MPMD) Structure
§ Within the MIMD classification, each processor will have its own
program to execute. This could be described as MPMD.
§ In this case, some of the programs to be executed could be copies of
the same program.
§ Typically, only two source programs are written, one for the designated
master processor, and one for the remaining processors, which are
called slave processors.

37
Summary
Summary
§ Parallelism Concepts and Context
§ Parallel versus Distributed Computing
§ Flynn’s Taxonomy
§ Memory Architectures
§ Parallel Programming Models

39
Reading Materials
Reading and References
§ M. J. Flynn, “Very high-speed computing systems,” in Proceedings of
the IEEE, vol. 54, no. 12, pp. 1901-1909, Dec. 1966, doi:
10.1109/PROC.1966.5273, URI:
https://ieeexplore.ieee.org/document/1447203
§ Sima, Dezsö, Terry J. Fountain and Péter Kacsuk. “Advanced computer
architectures - a design space approach,” Chapter 1, International
computer science series, Addison-Wesley, Reading, MA (1997)

41

You might also like