0% found this document useful (0 votes)
64 views25 pages

PDC Lecture 7-8 GPU Architectures

Uploaded by

M.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views25 pages

PDC Lecture 7-8 GPU Architectures

Uploaded by

M.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

1

Lahore Garrison University


Parallel and Distributed
Computing
Session Fall 2024
Lecture – 07 & 08
2
Preamble

 Load Balancing
 Load Balancers
 Flynn’s Taxonomy
 Computation Models
 SISD
 SIMD
 MISD
 MIMD

Lahore Garrison University


3
 GPU Architectures

Lahore Garrison University


 Conventional CPU architecture
 Modern GPGPU architectures
 AMD Southern Islands GPU Architecture
 Nvidia Fermi GPU Architecture
 Cell Broadband Engine
Lesson Plan  Maryland CPU/GPU Cluster Architecture
 Intel’s response to NVIDIA GPUs
 To Accelerate or not to accelerate
 When is GPUs appropriate?
 CPU vs. GPU hardware design philosophies
 CUDA-capable GPU hardware architecture
4
Modern CPU Architecture

 CPUs are optimized to minimize the latency


of a single thread.
 Can efficiently handle control flow intensive workloads.
 Lots of space devoted to caching and
control logic.
 Multi-level caches used to avoid latency.
 Limited number of registers due to smaller
number of active threads.
 Control logic to reorder execution, provide
Instruction Level Parallelism and minimize
pipeline stalls.
Lahore Garrison University
5
Modern GPGPU Architecture

 Array of independent “cores” called Compute Units.


 High bandwidth, bankedL2 caches and main memory.
 Banks allow multiple accesses to occur in parallel
 100s of GB/s
 Memory and caches are generally non-coherent.
 Compute units are based on SIMD hardware.
 Both AMD and NVIDIA have 16-element wide SIMDs
 Large register files are used for fast context switching
 No saving/restoring state
 Data is persistent for entire thread execution
 Both vendors have a combination of automatic L1 cache and a user managed
scratchpad
 Scratchpad is heavily banked and very high bandwidth (~terabytes/second)
Lahore Garrison University
6
Modern GPU Architecture

 Work-items are automatically grouped into hardware threads called


“wavefronts” (AMD)
or “warps” (NVIDIA)
 Single instruction stream executed on SIMD hardware
 64 work-items in a wavefront, 32 in a warp
 Instruction is issued multiple times on 16-lane SIMD unit
 Control flow is handled by masking SIMD lanes

Lahore Garrison University


7
AMD Southern Islands

 7000-series GPUs based on Graphics Core Next (GCN)


architecture
 4 SIMDs per compute unit
 1 Scalar Unit to handle instructions common to wavefront
 Loop iterators, constant variable accesses, branches
 Has a single, integer-only ALU unit
 Separate branch unit used for some conditional instructions
 Radeon HD7970
 32 compute units
 Max performance
Lahore Garrison University
8

AMD
Southern
Islands
Architecture

Lahore Garrison University


9
AMD GPGPU & AMD HD7970 series

Exercise: 1000 instructions are passed to two systems separately.


1st system uses AMD-7000 Series (Radeon HD7970) GPU and 2nd system uses AMD GPGPU.
Calculate the number of Compute Units and SIMDs used in each system to execute 1000 instructions.
AMD GPGPU: AMD-7970 series GPU:
CUs = Cores If 1 CU = 4 SIMDS
Wavefronts = Work-items And 1 SIMD = 64 work-items

If 1 CU = 16 SIMDs Then, the total number of SIMDs and CUs needed to


And 1 SIMD = 64 wavefronts execute a total of 1000 instructions is;

Then, the total number of SIMDs and CUs No. of SIMDs = 1000/64
needed to execute a total of 1000 instructions is; = 15.625
No. of CUs = 15.625/4
No. of SIMDs = 1000/64 = 3.90 ~ 4
No. of CUs = 15.625/16 (Ceiling function)
= 0.97 ~
10
NVIDIA Fermi Architecture

 GTX 480 - Compute 2.0 capability


 15 cores or Streaming Multiprocessors (SMs)
 Each SM features 32 CUDA processors
 480 CUDA processors
 Global memory with ECC
 SM executes threads in groups of 32 called warps.
 Two warp issue units per SM (Separate Case)
 Concurrent kernel execution
 Execute multiple kernels simultaneously to improve efficiency
 CUDA core consists of a single ALU and floating-point unit FPU

Lahore Garrison University


11

NVIDIA
Fermi
Architecture

Lahore Garrison University


12
NVIDIA GPGPU & NVIDIA GTX 480
series
Exercise: 1000 instructions are passed to two systems separately.
1st system uses NVIDIA-GTX Series (GTX 480) GPU and 2nd system uses NVIDIA GPGPU.
Calculate the number of Streaming Multiprocessors and CUDA Processors used in each system to execute 1000 instructions.
NVIDIA GPGPU: NVIDIA GTX series 480 GPU:
SMs = Cores If 1 SM = 32 CUDA Processors
Warps = Work-items And 1 CUDA Processor = 32 warps

If 1 SM = 16 CUDA Processors Then, the total number of CUDA processors and SMs
And 1 CUDA Processor = 32 warps needed to execute a total of 1000 instructions is;

Then, the total number of CUDA processors and SMs No. of CUDA Processors =
needed to execute a total of 1000 instructions is; 1000/32
= 31.25
No. of CUDA Processors = No. of CUs = 31.25/32
1000/32 = 0.97 ~ 1
No. of CUs = 31.25/16 (Ceiling function)
13
Cell Broadband Engine

 Developed by Sony, Toshiba, IBM


 Transitioned from embedded platforms into HPC via the Playstation 3
 OpenCL drivers available for Cell Bladecenter servers
 Consists of a Power Processing Element (PPE) and multiple Synergistic
Processing Elements (SPE)
 Uses the IBM XL C for OpenCL compiler

Lahore Garrison University


14
Cell Broadband Engine Architecture

Lahore Garrison University


15
Maryland CPU/GPU Cluster
Infrastructure

Lahore Garrison University


16

Intel’s
Response to
NVIDIA GPUs

Lahore Garrison University


17
When is GPU appropriate?

 Pro:  Applications
 They make your code run faster.
Traditional GPU Applications: Gaming, image
processing
 Cons:  i.e., manipulating image pixels, oftentimes the
 They’re expensive (False). same operation on each pixel
 They’re hard to program.  Scientific and Engineering Problems: physical
 Your code may not be cross-platform modeling, matrix algebra, sorting, etc.
(False).  Data parallel algorithms:
 Large data arrays
 Single Instruction, Multiple Data (SIMD) parallelism
 Floating point computations

Lahore Garrison University


18
CPU vs. GPU hardware design
philosophies

CPU GPU

Lahore Garrison University


19
CUDA-capable GPU Hardware

 Processors execute computing threads

 Thread execution managers issues threads

 128 thread processors grouped into 16 streaming

multiprocessors (SMs)

 Parallel Data Cache enables thread cooperation.

Lahore Garrison University


20
CUDA-capable GPU Hardware

Lahore Garrison University


21
Is it hard to program on a GPU?

 In the olden days – (pre-2006) – programming GPUs meant either:


 using a graphics standard like OpenGL (which is mostly meant for
rendering), or

 getting deep into the graphics rendering pipeline.

 To use a GPU to do general purpose number crunching, you had to


make your number crunching pretend to be graphics.

 This is hard. Why bother?

Lahore Garrison University


22
How to program on a GPU today?

 Proprietary programming language or extensions


 NVIDIA: CUDA (C/C++)
 AMD/ATI: StreamSDK/Brook+ (C/C++)
 OpenCL (Open Computing Language): an industry standard for
doing number crunching on GPUs.
 Portland Group Inc (PGI) Fortran and C compilers with accelerator
directives; PGI CUDA Fortran (Fortran 90 equivalent of NVIDIA’s
CUDA C).
 OpenMP version 4.0 may include directives for accelerators.

Lahore Garrison University


23
Lesson Review

 GPU Architectures
 Conventional CPU architecture
 Modern GPGPU architectures
 AMD Southern Islands GPU Architecture
 Exercise for AMD GPGPU and AMD 7000 series GPU
 Nvidia Fermi GPU Architecture
 Exercise for NVIDIA GPGPU and NVIDIA GTX series GPU
 Cell Broadband Engine

Lahore Garrison University


24
Next Lesson Preview

 Heterogeneity
 Heterogeneous Concurrent Computing
 Forms of Heterogeneity
 Goals of Heterogeneous Concurrent Computing
 Processing Elements
 Parallel Virtual Machine (PVM)
 The PVM System

Lahore Garrison University


25
References

 To cover this topic, different reference material has been used for
consultation.

 Textbook:

Distributed Systems: Principles and Paradigms, A. S. Tanenbaum and M.


V. Steen, Prentice Hall, 2nd Edition, 2007.

Distributed and Cloud Computing: Clusters, Grids, Clouds, and the


Future Internet, K. Hwang, J. Dongarra and GC. C. Fox, Elsevier, 1st Ed.

 Google Search Engine


Lahore Garrison University

You might also like