CS516: Parallelization of Programs: Overview of Parallel Architectures
CS516: Parallelization of Programs: Overview of Parallel Architectures
Vishwesh Jatala
Assistant Professor
Department of CSE
Indian Institute of Technology Bhilai
[email protected]
2023-24 W
1
Recap: Why Parallel Architectures?
• Moore’s Law: The number of transistors on a IC doubles about every two years
2
Recap: Moore’s Law Effect
3
Processor Architecture RoadMap
4
Course Outline
■ Introduction
■ Overview of Parallel Architectures
■ Performance
■ Parallel Programming
• GPUs and CUDA programming
■ Case studies
■ Extracting Parallelism from Sequential Programs Automatically
5
Flynn’s Taxonomy
• Flynn’s classification of computer architecture
6
SISD: Single Instruction, Single Data
• The von Neumann architecture
7
SIMD: Single Instruction, Multiple Data
• Single control stream
• Fine-grained parallelism
8
SIMD: Single Instruction, Multiple Data
• Example: GPUs
9
MIMD: Multiple Instructions, Multiple Data
• Most the machines that are prevalent
10
Rest of the today’s lecture…
• Flynn’s classification of computer architecture
11
Flynn’s Taxonomy
• Flynn’s classification of computer architecture
12
MIMD: Shared Memory Multiprocessors
• Tightly coupled multiprocessors
• Shared global memory address space
• Traditional multiprocessing: symmetric multiprocessing (SMP)
• Existing multi-core processors, multithreaded processors
• Programming model similar to uniprocessors (i.e., multitasking uniprocessor) except
• Operations on shared data require synchronization
13
Interconnection Schemes for SMP
14
SMP Architectures
15
UMA: Uniform Memory Access
• All processors have the same uncontended latency to memory
• Symmetric multiprocessing (SMP) ~ UMA with bus interconnect
16
UMA: Uniform Memory Access
+ Data placement unimportant/less important (easier to optimize code and make use of available
memory space)
- Scaling the system increases all latencies
- Contention could restrict bandwidth and increase latency
17
How to Scale Shared Memory Machines?
• Two general approaches
• Maintain UMA
• Provide a scalable interconnect to memory
• Scaling system increases memory latency
18
NUMA: Non Uniform Memory Access
• Shared memory as local versus remote memory
+ Low latency to local memory
- Much higher latency to remote memories
+ Bandwidth to local memory may be higher
- Performance very sensitive to data placement
19
MIMD: Message Passing Architectures
• Loosely coupled multiprocessors
• No shared global memory address space
• Multicomputer network
• Network-based multiprocessors
• Usually programmed via message passing
• Explicit calls (send, receive) for communication
20
MIMD: Message Passing Architectures
21
Historical Evolution: 1960s & 70s
• Early MPs
• Mainframes
• Small number of processors
• crossbar interconnect
• UMA
22
Historical Evolution: 1980s
• Bus-Based MPs
• enabler: processor-on-a-board
• economical scaling
• precursor of today’s SMPs
• UMA
23
Historical Evolution: Late 80s, mid 90s
• Large Scale MPs (Massively Parallel
Processors)
• multi-dimensional interconnects
• each node a computer (proc + cache
+ memory)
• NUMA
• still used for “supercomputing”
24
Flynn’s Taxonomy
• Flynn’s classification of computer architecture
25
SIMD: Single Instruction, Multiple Data
• Example: GPUs
26
Data Parallel Programming Model
• Programming Model
• Operations are performed on each element of a large (regular) data
structure (array, vector, matrix)
27
On Sequential Hardwares
28
On Data Parallel Hardwares
29
Data Parallel Architectures
• Early architectures directly mirrored programming model
30
Data Parallel Architectures
• Later data parallel architectures
• Higher integration → SIMD units on chip along with caches
• More generic → multiple cooperating multiprocessors (GPUs)
• Specialized hardware support for global synchronization
31
SIMD: Graphics Processing Units
• The early GPU designs
• Specialized for graphics processing only
• Exhibit SIMD execution
• Less programmable
• NVIDIA GeForce 256
32
Single-core CPU vs Multi-core vs GPU
33
Single-core CPU vs Multi-core vs GPU
34
NVIDIA V100 GPU
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
35
Specifications
36
CPUs vs GPUs
Chip to chip comparison of peak memory bandwidth in GB/s and peak double precision
gigaflops for GPUs and CPUs since 2008.
https://www.nextplatform.com/2019/07/10/a-decade-of-accelerated-computing-augurs-well-for-gpus
37
GPU Applications
38
Specifications
39
Multi-GPU Systems
https://www.azken.com/images/dgx1_images/dgx1-system-architecture-whitepaper1.pdf
40
Summary
• Parallel architectures are inevitable
• Flynn’s taxonomy:
• SISD
• MISD
• MIMD
• SIMD
41
References
• David Culler, Jaswinder Pal Singh, and Anoop Gupta. 1998. Parallel Computer
Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA
• https://safari.ethz.ch/architecture/fall2020/doku.php?id=schedule
• https://www.cse.iitd.ac.in/~soham/COL380/page.html
• https://s3.wp.wsu.edu/uploads/sites/1122/2017/05/6-9-2017-slides-vFinal.pptx
• https://ebhor.com/full-form-of-cpu/
43