0% found this document useful (0 votes)

64 views25 pages

PDC Lecture 7-8 GPU Architectures

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views25 pages

PDC Lecture 7-8 GPU Architectures

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

1

Lahore Garrison University

Parallel and Distributed
Computing
Session Fall 2024
Lecture – 07 & 08
2
Preamble

 Load Balancing
 Load Balancers
 Flynn’s Taxonomy
 Computation Models
 SISD
 SIMD
 MISD
 MIMD

Lahore Garrison University

3
 GPU Architectures

Lahore Garrison University

 Conventional CPU architecture
 Modern GPGPU architectures
 AMD Southern Islands GPU Architecture
 Nvidia Fermi GPU Architecture
 Cell Broadband Engine
Lesson Plan  Maryland CPU/GPU Cluster Architecture
 Intel’s response to NVIDIA GPUs
 To Accelerate or not to accelerate
 When is GPUs appropriate?
 CPU vs. GPU hardware design philosophies
 CUDA-capable GPU hardware architecture
4
Modern CPU Architecture

 CPUs are optimized to minimize the latency

of a single thread.
 Can efficiently handle control flow intensive workloads.
 Lots of space devoted to caching and
control logic.
 Multi-level caches used to avoid latency.
 Limited number of registers due to smaller
number of active threads.
 Control logic to reorder execution, provide
Instruction Level Parallelism and minimize
pipeline stalls.
Lahore Garrison University
5
Modern GPGPU Architecture

 Array of independent “cores” called Compute Units.

 High bandwidth, bankedL2 caches and main memory.
 Banks allow multiple accesses to occur in parallel
 100s of GB/s
 Memory and caches are generally non-coherent.
 Compute units are based on SIMD hardware.
 Both AMD and NVIDIA have 16-element wide SIMDs
 Large register files are used for fast context switching
 No saving/restoring state
 Data is persistent for entire thread execution
 Both vendors have a combination of automatic L1 cache and a user managed
scratchpad
 Scratchpad is heavily banked and very high bandwidth (~terabytes/second)
Lahore Garrison University
6
Modern GPU Architecture

 Work-items are automatically grouped into hardware threads called

“wavefronts” (AMD)
or “warps” (NVIDIA)
 Single instruction stream executed on SIMD hardware
 64 work-items in a wavefront, 32 in a warp
 Instruction is issued multiple times on 16-lane SIMD unit
 Control flow is handled by masking SIMD lanes

Lahore Garrison University

7
AMD Southern Islands

 7000-series GPUs based on Graphics Core Next (GCN)

architecture
 4 SIMDs per compute unit
 1 Scalar Unit to handle instructions common to wavefront
 Loop iterators, constant variable accesses, branches
 Has a single, integer-only ALU unit
 Separate branch unit used for some conditional instructions
 Radeon HD7970
 32 compute units
 Max performance
Lahore Garrison University
8

AMD
Southern
Islands
Architecture

Lahore Garrison University

9
AMD GPGPU & AMD HD7970 series

Exercise: 1000 instructions are passed to two systems separately.

1st system uses AMD-7000 Series (Radeon HD7970) GPU and 2nd system uses AMD GPGPU.
Calculate the number of Compute Units and SIMDs used in each system to execute 1000 instructions.
AMD GPGPU: AMD-7970 series GPU:
CUs = Cores If 1 CU = 4 SIMDS
Wavefronts = Work-items And 1 SIMD = 64 work-items

If 1 CU = 16 SIMDs Then, the total number of SIMDs and CUs needed to

And 1 SIMD = 64 wavefronts execute a total of 1000 instructions is;

Then, the total number of SIMDs and CUs No. of SIMDs = 1000/64
needed to execute a total of 1000 instructions is; = 15.625
No. of CUs = 15.625/4
No. of SIMDs = 1000/64 = 3.90 ~ 4
No. of CUs = 15.625/16 (Ceiling function)
= 0.97 ~
10
NVIDIA Fermi Architecture

 GTX 480 - Compute 2.0 capability

 15 cores or Streaming Multiprocessors (SMs)
 Each SM features 32 CUDA processors
 480 CUDA processors
 Global memory with ECC
 SM executes threads in groups of 32 called warps.
 Two warp issue units per SM (Separate Case)
 Concurrent kernel execution
 Execute multiple kernels simultaneously to improve efficiency
 CUDA core consists of a single ALU and floating-point unit FPU

Lahore Garrison University

NVIDIA
Fermi
Architecture

Lahore Garrison University

12
NVIDIA GPGPU & NVIDIA GTX 480
series
Exercise: 1000 instructions are passed to two systems separately.
1st system uses NVIDIA-GTX Series (GTX 480) GPU and 2nd system uses NVIDIA GPGPU.
Calculate the number of Streaming Multiprocessors and CUDA Processors used in each system to execute 1000 instructions.
NVIDIA GPGPU: NVIDIA GTX series 480 GPU:
SMs = Cores If 1 SM = 32 CUDA Processors
Warps = Work-items And 1 CUDA Processor = 32 warps

If 1 SM = 16 CUDA Processors Then, the total number of CUDA processors and SMs
And 1 CUDA Processor = 32 warps needed to execute a total of 1000 instructions is;

Then, the total number of CUDA processors and SMs No. of CUDA Processors =
needed to execute a total of 1000 instructions is; 1000/32
= 31.25
No. of CUDA Processors = No. of CUs = 31.25/32
1000/32 = 0.97 ~ 1
No. of CUs = 31.25/16 (Ceiling function)
13
Cell Broadband Engine

 Developed by Sony, Toshiba, IBM

 Transitioned from embedded platforms into HPC via the Playstation 3
 OpenCL drivers available for Cell Bladecenter servers
 Consists of a Power Processing Element (PPE) and multiple Synergistic
Processing Elements (SPE)
 Uses the IBM XL C for OpenCL compiler

Lahore Garrison University

14
Cell Broadband Engine Architecture

Lahore Garrison University

15
Maryland CPU/GPU Cluster
Infrastructure

Lahore Garrison University

Intel’s
Response to
NVIDIA GPUs

Lahore Garrison University

17
When is GPU appropriate?

 Pro:  Applications
 They make your code run faster.
Traditional GPU Applications: Gaming, image
processing
 Cons:  i.e., manipulating image pixels, oftentimes the
 They’re expensive (False). same operation on each pixel
 They’re hard to program.  Scientific and Engineering Problems: physical
 Your code may not be cross-platform modeling, matrix algebra, sorting, etc.
(False).  Data parallel algorithms:
 Large data arrays
 Single Instruction, Multiple Data (SIMD) parallelism
 Floating point computations

Lahore Garrison University

18
CPU vs. GPU hardware design
philosophies

CPU GPU

Lahore Garrison University

19
CUDA-capable GPU Hardware

 Processors execute computing threads

 Thread execution managers issues threads

 128 thread processors grouped into 16 streaming

multiprocessors (SMs)

 Parallel Data Cache enables thread cooperation.

Lahore Garrison University

20
CUDA-capable GPU Hardware

Lahore Garrison University

21
Is it hard to program on a GPU?

 In the olden days – (pre-2006) – programming GPUs meant either:

 using a graphics standard like OpenGL (which is mostly meant for
rendering), or

 getting deep into the graphics rendering pipeline.

 To use a GPU to do general purpose number crunching, you had to

make your number crunching pretend to be graphics.

 This is hard. Why bother?

Lahore Garrison University

22
How to program on a GPU today?

 Proprietary programming language or extensions

 NVIDIA: CUDA (C/C++)
 AMD/ATI: StreamSDK/Brook+ (C/C++)
 OpenCL (Open Computing Language): an industry standard for
doing number crunching on GPUs.
 Portland Group Inc (PGI) Fortran and C compilers with accelerator
directives; PGI CUDA Fortran (Fortran 90 equivalent of NVIDIA’s
CUDA C).
 OpenMP version 4.0 may include directives for accelerators.

Lahore Garrison University

23
Lesson Review

 GPU Architectures
 Conventional CPU architecture
 Modern GPGPU architectures
 AMD Southern Islands GPU Architecture
 Exercise for AMD GPGPU and AMD 7000 series GPU
 Nvidia Fermi GPU Architecture
 Exercise for NVIDIA GPGPU and NVIDIA GTX series GPU
 Cell Broadband Engine

Lahore Garrison University

24
Next Lesson Preview

 Heterogeneity
 Heterogeneous Concurrent Computing
 Forms of Heterogeneity
 Goals of Heterogeneous Concurrent Computing
 Processing Elements
 Parallel Virtual Machine (PVM)
 The PVM System

Lahore Garrison University

25
References

 To cover this topic, different reference material has been used for
consultation.

 Textbook:

Distributed Systems: Principles and Paradigms, A. S. Tanenbaum and M.

V. Steen, Prentice Hall, 2nd Edition, 2007.

Distributed and Cloud Computing: Clusters, Grids, Clouds, and the

Future Internet, K. Hwang, J. Dongarra and GC. C. Fox, Elsevier, 1st Ed.

 Google Search Engine

Lahore Garrison University

Human Resource Management Presentation
No ratings yet
Human Resource Management Presentation
12 pages
CSED405 Lec1-Course Intro - 240903 - 203340
No ratings yet
CSED405 Lec1-Course Intro - 240903 - 203340
65 pages
EWD Kids Catalogue
100% (1)
EWD Kids Catalogue
38 pages
Seinfeld This Is The One
100% (1)
Seinfeld This Is The One
74 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Myoan Kyōkai
No ratings yet
Myoan Kyōkai
249 pages
SAWStudio User Manual PDF
No ratings yet
SAWStudio User Manual PDF
343 pages
00 CourseIntroduction
No ratings yet
00 CourseIntroduction
33 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
Day1 1
No ratings yet
Day1 1
25 pages
PART19
No ratings yet
PART19
20 pages
Two-Chord-Songs Ukulele 1
No ratings yet
Two-Chord-Songs Ukulele 1
34 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Coe4590 15 Gpu1
No ratings yet
Coe4590 15 Gpu1
14 pages
HP Laserjet 4050n Quick Reference Guide
No ratings yet
HP Laserjet 4050n Quick Reference Guide
46 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
Multicore Computers
No ratings yet
Multicore Computers
9 pages
1 Cuda
100% (1)
1 Cuda
173 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
Tarifa Neb 2011-1
No ratings yet
Tarifa Neb 2011-1
19 pages
Axis Bank Product Details
No ratings yet
Axis Bank Product Details
19 pages
Assignment
No ratings yet
Assignment
16 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
GPU Based Parallel Processing Model Proposal
No ratings yet
GPU Based Parallel Processing Model Proposal
4 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Clevo M980nu
No ratings yet
Clevo M980nu
102 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Mba 4th Sem Assignment List
No ratings yet
Mba 4th Sem Assignment List
45 pages
Kallada Tours & Travels
0% (1)
Kallada Tours & Travels
2 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Insect Circulatory System
No ratings yet
Insect Circulatory System
67 pages
Parody
No ratings yet
Parody
26 pages
Intro To Gpu &amp Cuda
No ratings yet
Intro To Gpu &amp Cuda
15 pages
QuinReward Merchant List in Sarawak - 28.12.2021
No ratings yet
QuinReward Merchant List in Sarawak - 28.12.2021
44 pages
04 DLP
No ratings yet
04 DLP
19 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
PB Man's World August 2020
No ratings yet
PB Man's World August 2020
43 pages
Analyzing CUDA Workloads Using A Detailed GPU Simulator
No ratings yet
Analyzing CUDA Workloads Using A Detailed GPU Simulator
12 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
For Kids
No ratings yet
For Kids
8 pages
Eurostar Timetable 2012 PDF
No ratings yet
Eurostar Timetable 2012 PDF
2 pages
p10 Cuda
No ratings yet
p10 Cuda
28 pages
Assembly Language Project Os Shape Making Program
No ratings yet
Assembly Language Project Os Shape Making Program
6 pages
Esullivan Resume
No ratings yet
Esullivan Resume
2 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
Owens
No ratings yet
Owens
67 pages
TIC TAC TOE Game Code in C++
No ratings yet
TIC TAC TOE Game Code in C++
4 pages
CUDA
No ratings yet
CUDA
46 pages
What Is Performance Appraisal
No ratings yet
What Is Performance Appraisal
2 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
Unit 4
No ratings yet
Unit 4
48 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Bone Fracture Detection
No ratings yet
Bone Fracture Detection
15 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
FFG Star Wars Reference PDF
No ratings yet
FFG Star Wars Reference PDF
26 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
GPU Architecture and Programming Lecture
No ratings yet
GPU Architecture and Programming Lecture
9 pages
q3 Mapeh Reviewer
No ratings yet
q3 Mapeh Reviewer
2 pages
Irregular and Regular Verbs
No ratings yet
Irregular and Regular Verbs
2 pages
cs179 2024 Lec01
No ratings yet
cs179 2024 Lec01
26 pages
CUDA Programming with Python: From Basics to Expert Proficiency
From Everand
CUDA Programming with Python: From Basics to Expert Proficiency
William Smith
1/5 (1)
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Lecture-12-PDC - CUDA
No ratings yet
Lecture-12-PDC - CUDA
25 pages
GPU Introduction
No ratings yet
GPU Introduction
52 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
Audix Fireball Specifications
No ratings yet
Audix Fireball Specifications
2 pages
CHAPLIN Charles - The Music of Charlie Chaplin - Arr. Lüghausen
No ratings yet
CHAPLIN Charles - The Music of Charlie Chaplin - Arr. Lüghausen
34 pages
Lec 1
No ratings yet
Lec 1
27 pages
Is Ryuma Stronger Than Current Luffy - Google Se
No ratings yet
Is Ryuma Stronger Than Current Luffy - Google Se
1 page
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Hector in The Iliad
No ratings yet
Hector in The Iliad
2 pages
Week 07
No ratings yet
Week 07
24 pages
Quote Dispatch Msgs
No ratings yet
Quote Dispatch Msgs
5 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
SL Tour Package
No ratings yet
SL Tour Package
5 pages
Posture Detection System
No ratings yet
Posture Detection System
31 pages
Note2 4
No ratings yet
Note2 4
11 pages
Quote - Dispatch Msgs
No ratings yet
Quote - Dispatch Msgs
5 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Dispatch Sheet
No ratings yet
Dispatch Sheet
1 page
Ninja OL601 Cookbook & Recipe Guide
No ratings yet
Ninja OL601 Cookbook & Recipe Guide
57 pages
A Story
No ratings yet
A Story
6 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
From Everand
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Draper Expert Quality: Professional High Accuracy Electronic Torque Wrenches
No ratings yet
Draper Expert Quality: Professional High Accuracy Electronic Torque Wrenches
3 pages
German 15cm Panzerfeldhaubitz Auf GW III/IV, SD - Kfz.165, "Hummel", Part 1 (Updated 9/18/00)
100% (8)
German 15cm Panzerfeldhaubitz Auf GW III/IV, SD - Kfz.165, "Hummel", Part 1 (Updated 9/18/00)
24 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
OpenShift 4 Technical Deep Dive
100% (5)
OpenShift 4 Technical Deep Dive
129 pages
Terraform Associate
100% (10)
Terraform Associate
465 pages
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
AZ 104 Microsoft Azure Administrator
100% (8)
AZ 104 Microsoft Azure Administrator
431 pages
Kubernetes Basic To Advance End To End
100% (6)
Kubernetes Basic To Advance End To End
295 pages
Kubernetes
100% (3)
Kubernetes
139 pages
Terraform Azure
100% (2)
Terraform Azure
297 pages
Kubernetes Docker
100% (7)
Kubernetes Docker
129 pages
Terraform Certified
100% (3)
Terraform Certified
121 pages
Azure Devops Explained
88% (8)
Azure Devops Explained
438 pages
AWS Course - All Slides
80% (10)
AWS Course - All Slides
879 pages
Kubernetes Tutorial
100% (11)
Kubernetes Tutorial
83 pages
100 Days of Kubernetes
100% (4)
100 Days of Kubernetes
121 pages
Container Networking Docker Kubernetes
100% (8)
Container Networking Docker Kubernetes
72 pages
Ansible For Kubernetes PDF
100% (6)
Ansible For Kubernetes PDF
172 pages
Generative Ai Fundamentals v1
100% (16)
Generative Ai Fundamentals v1
80 pages
Azure Implementation Guide
100% (4)
Azure Implementation Guide
237 pages
RAG Architecture
100% (8)
RAG Architecture
52 pages
Udemy Docker Advanced PDF
0% (1)
Udemy Docker Advanced PDF
71 pages
Kubernetes Practicals Ebook
75% (4)
Kubernetes Practicals Ebook
187 pages
Docker Docker Tutorial For Beginners Build Ship and Run - Dennis Hutten
100% (11)
Docker Docker Tutorial For Beginners Build Ship and Run - Dennis Hutten
187 pages
Az104 Master
100% (1)
Az104 Master
477 pages
Terraform Practice Guide
100% (14)
Terraform Practice Guide
109 pages
AWS Certified DevOps Engineer Professional... Tests 2021
100% (3)
AWS Certified DevOps Engineer Professional... Tests 2021
210 pages
Learn Kubernetes 5 Minutes at A Time
No ratings yet
Learn Kubernetes 5 Minutes at A Time
187 pages
AWS Certified Solution Architect Associate Study Guide V1.0 Abdul Jaseem VP Release 30 Aug 2020
100% (6)
AWS Certified Solution Architect Associate Study Guide V1.0 Abdul Jaseem VP Release 30 Aug 2020
235 pages
Hands-On Kubernetes On Azure
100% (4)
Hands-On Kubernetes On Azure
330 pages
CNCF Webinar - Kubernetes 1.16 PDF
No ratings yet
CNCF Webinar - Kubernetes 1.16 PDF
48 pages
Azure Solution Architect Map
100% (1)
Azure Solution Architect Map
1 page
VDI Calculator v1
No ratings yet
VDI Calculator v1
3 pages
Terraform From Bigginer To Master
100% (4)
Terraform From Bigginer To Master
90 pages

PDC Lecture 7-8 GPU Architectures

Uploaded by

PDC Lecture 7-8 GPU Architectures

Uploaded by

1

Lahore Garrison University

Lahore Garrison University

Lahore Garrison University

 CPUs are optimized to minimize the latency

 Array of independent “cores” called Compute Units.

 Work-items are automatically grouped into hardware threads called

Lahore Garrison University

 7000-series GPUs based on Graphics Core Next (GCN)

Lahore Garrison University

Exercise: 1000 instructions are passed to two systems separately.

If 1 CU = 16 SIMDs Then, the total number of SIMDs and CUs needed to

 GTX 480 - Compute 2.0 capability

Lahore Garrison University

Lahore Garrison University

 Developed by Sony, Toshiba, IBM

Lahore Garrison University

Lahore Garrison University

Lahore Garrison University

Lahore Garrison University

Lahore Garrison University

Lahore Garrison University

 Processors execute computing threads

 Thread execution managers issues threads

 128 thread processors grouped into 16 streaming

 Parallel Data Cache enables thread cooperation.

Lahore Garrison University

Lahore Garrison University

 In the olden days – (pre-2006) – programming GPUs meant either:

 getting deep into the graphics rendering pipeline.

 To use a GPU to do general purpose number crunching, you had to

 This is hard. Why bother?

Lahore Garrison University

 Proprietary programming language or extensions

Lahore Garrison University

Lahore Garrison University

Lahore Garrison University

Distributed Systems: Principles and Paradigms, A. S. Tanenbaum and M.

Distributed and Cloud Computing: Clusters, Grids, Clouds, and the

 Google Search Engine

You might also like