0% found this document useful (0 votes)

78 views18 pages

An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza

This document provides an overview of general purpose graphics processing units (GPUs) and the CUDA programming model. It discusses how CUDA is designed for heterogeneous CPU+GPU systems with many parallel threads. A CUDA program consists of a serial CPU code and parallel GPU kernel code. The kernel is launched by CPU threads and executed in parallel by many GPU threads organized into blocks. The document outlines the memory hierarchy and how blocks and threads are scheduled across multiprocessors on the GPU using a single-instruction, multiple-thread execution model.

Uploaded by

AsHraf G. ElrawEi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views18 pages

An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza

Uploaded by

AsHraf G. ElrawEi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

An Overview of General Purpose Graphics

Processing Units

Marc Moreno Maza

University of Western Ontario, Canada

UWO
April 1st, 2014
The CUDA programming and memory models

CUDA design goals

Enable heterogeneous systems (i.e., CPU+GPU)

Scale to 100’s of cores, 1000’s of parallel threads
Use C/C++ with minimal extensions
Let programmers focus on parallel algorithms
The CUDA programming and memory models

Heterogeneous programming (1/3)

A CUDA program is a serial program with parallel kernels, all in C.

The serial C code executes in a host (= CPU) thread
The parallel kernel C code executes in many device threads across
multiple GPU processing elements, called streaming processors (SP).
The CUDA programming and memory models

Heterogeneous programming (2/3)

Thus, the parallel code (kernel) is launched and executed on a device

by many threads.
Threads are grouped into thread blocks.
One kernel is executed at a time on the device.
Many threads execute each kernel.
The CUDA programming and memory models

Heterogeneous programming (3/3)

The parallel code is written for a thread

• Each thread is free to execute a unique code path
• Built-in thread and block ID variables are used to map each thread
to a specific data tile (see next slide).
Thus, each thread executes the same code on different data based on
its thread and block ID.
The CUDA programming and memory models

Example: increment array elements (1/2)

See our example number 4 in /usr/local/cs4402/examples/4

The CUDA programming and memory models

Example: increment array elements (2/2)

The CUDA programming and memory models

Thread blocks (1/2)

A Thread block is a group of threads that can:

• Synchronize their execution
• Communicate via shared memory
Within a grid, thread blocks can run in any order:
• Concurrently or sequentially
• Facilitates scaling of the same code across many devices
The CUDA programming and memory models

Thread blocks (2/2)

Thus, within a grid, any possible interleaving of blocks must be valid.

Thread blocks may coordinate but not synchronize
• they may share pointers
• they should not share locks (this can easily deadlock).

The fact that thread blocks cannot synchronize gives scalability:

• A kernel scales across any number of parallel cores

However, within a thread block, threads may synchronize with

barriers.
That is, threads wait at the barrier until all threads in the same
block reach the barrier.
The CUDA programming and memory models

Memory hierarchy (1/3)

Host (CPU) memory:

Not directly accessible by CUDA threads
The CUDA programming and memory models

Memory hierarchy (2/3)

Global (on the device) memory:

Also called device memory
Accessible by all threads as well as host (CPU)
Data lifetime = from allocation to deallocation
The CUDA programming and memory models

Memory hierarchy (3/3)

Shared memory:
Each thread block has its own shared memory, which is accessible
only by the threads within that block
Data lifetime = block lifetime
Local storage:
Each thread has its own local storage
Data lifetime = thread lifetime
The CUDA programming and memory models

Blocks run on multiprocessors

The CUDA programming and memory models

Streaming processors and multiprocessors

The CUDA programming and memory models

Hardware multithreading

Hardware allocates resources to blocks:

• blocks need: thread slots, registers, shared memory
• blocks don’t run until resources are available
Hardware schedules threads:
• threads have their own registers
• any thread not waiting for something can run
• context switching is free every cycle
Hardware relies on threads to hide latency:
• thus high parallelism is necessary for performance.
The CUDA programming and memory models

SIMT thread execution

At each clock cycle, a multiprocessor executes the same instruction

on a group of threads called a warp
• The number of threads in a warp is the warp size (32 on G80)
• A half-warp is the first or second half of a warp.
Within a warp, threads
• share instruction fetch/dispatch
• some become inactive when code path diverges
• hardware automatically handles divergence
Warps are the primitive unit of scheduling:
• each active block is split into warps in a well-defined way
• threads within a warp are executed physically in parallel while warps
and blocks are executed logically in parallel.
The CUDA programming and memory models

Returning to the example

The CUDA programming and memory models

Example host code for increment array elements

Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
Catanzaro Intro To GPUs
No ratings yet
Catanzaro Intro To GPUs
76 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Format Project Report
No ratings yet
Format Project Report
51 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Cuda
No ratings yet
Cuda
69 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA
No ratings yet
CUDA
18 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Course 7
No ratings yet
Course 7
21 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
PracticeExam DCADAS3 Scala 1
No ratings yet
PracticeExam DCADAS3 Scala 1
27 pages
CS3350B Computer Architecture: Marc Moreno Maza
100% (1)
CS3350B Computer Architecture: Marc Moreno Maza
45 pages
Cuda
No ratings yet
Cuda
25 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Introduction To Programming in CUDA C: Will Landau
No ratings yet
Introduction To Programming in CUDA C: Will Landau
40 pages
Mesa County Database and System Analysis
100% (6)
Mesa County Database and System Analysis
22 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Cuda C
No ratings yet
Cuda C
70 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Govind 6
No ratings yet
Govind 6
4 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Lec 1
No ratings yet
Lec 1
27 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CUDA
No ratings yet
CUDA
33 pages
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
No ratings yet
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
19 pages
PYTHON PROGRAMMING March 2021
No ratings yet
PYTHON PROGRAMMING March 2021
8 pages
مكتبة نور - مميز بالاصفر PDF
No ratings yet
مكتبة نور - مميز بالاصفر PDF
229 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Distributed Systems CSE 2052 Module 1 Dr. Jayakumar 26 Aug 2023
No ratings yet
Distributed Systems CSE 2052 Module 1 Dr. Jayakumar 26 Aug 2023
60 pages
AlgNotes PDF
No ratings yet
AlgNotes PDF
106 pages
Cloud Tech Associate Advanced Security
No ratings yet
Cloud Tech Associate Advanced Security
163 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Explanation: Answer: A B C D
No ratings yet
Explanation: Answer: A B C D
56 pages
Homing Drone For Public Services
No ratings yet
Homing Drone For Public Services
20 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
Unit 2 Chapter 1 Functions
No ratings yet
Unit 2 Chapter 1 Functions
46 pages
Citrix Workspace App
No ratings yet
Citrix Workspace App
157 pages
Be Informed: Secos - Saurer Customer Portal
No ratings yet
Be Informed: Secos - Saurer Customer Portal
8 pages
Types of Software - G8
No ratings yet
Types of Software - G8
10 pages
CS3350B Computer Architecture Memory Hierarchy: Why?: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture Memory Hierarchy: Why?: Marc Moreno Maza
30 pages
Mobile Computers Brochure Portfolio en Us
No ratings yet
Mobile Computers Brochure Portfolio en Us
16 pages
How Does The Software Work?: Jugar - Toyota Smart Key Solution
No ratings yet
How Does The Software Work?: Jugar - Toyota Smart Key Solution
4 pages
Section 9
No ratings yet
Section 9
60 pages
Unit Four: Flow of Control of Program
No ratings yet
Unit Four: Flow of Control of Program
40 pages
QUIZ GAME DOC CS-4152 (Repaired)
No ratings yet
QUIZ GAME DOC CS-4152 (Repaired)
19 pages
Blockchain Quiz A
No ratings yet
Blockchain Quiz A
16 pages
JavaScript Interview Questions and Answers For Freshers
No ratings yet
JavaScript Interview Questions and Answers For Freshers
17 pages
CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture CPU Performance and Profiling: Marc Moreno Maza
28 pages
Midterm 1 Exam
No ratings yet
Midterm 1 Exam
5 pages
Os BCS303 Lab Manual DR - Ttit
No ratings yet
Os BCS303 Lab Manual DR - Ttit
36 pages
Practical 7
No ratings yet
Practical 7
2 pages
Asrock H61M-Ps4 / H61M-Hg4 / H61M-Vg4 / H61M-Vs4 Motherboard
No ratings yet
Asrock H61M-Ps4 / H61M-Hg4 / H61M-Vg4 / H61M-Vs4 Motherboard
54 pages
CS3350B Computer Architecture Memory Hierarchy: How?: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture Memory Hierarchy: How?: Marc Moreno Maza
33 pages
Exemplos de CI Controladores
No ratings yet
Exemplos de CI Controladores
32 pages
CS3350B Computer Architecture: Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions
No ratings yet
CS3350B Computer Architecture: Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions
31 pages
A Seminar On Desktop Activity Recording" Under The Guidance of SMT - BHARTI.S.A
No ratings yet
A Seminar On Desktop Activity Recording" Under The Guidance of SMT - BHARTI.S.A
14 pages
CS3350B Computer Architecture MIPS Introduction: Marc Moreno Maza
No ratings yet
CS3350B Computer Architecture MIPS Introduction: Marc Moreno Maza
24 pages
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
No ratings yet
CS3350B Computer Architecture: Lecture 6.3: Instructional Level Parallelism: Advanced Techniques
24 pages
COMP0020-2020-lecture17-ProgrammingExamplesContinued With Captions
No ratings yet
COMP0020-2020-lecture17-ProgrammingExamplesContinued With Captions
22 pages
L7 Multicore 2
No ratings yet
L7 Multicore 2
22 pages
India Ink Guide
No ratings yet
India Ink Guide
16 pages
Nano PDF
No ratings yet
Nano PDF
1 page
3 COMPUTER Important
No ratings yet
3 COMPUTER Important
1 page
6854 Proj
No ratings yet
6854 Proj
7 pages
Hacker Disassembling Uncovered (Second Edition) : (Draft, For Internal Usage ONLY)
No ratings yet
Hacker Disassembling Uncovered (Second Edition) : (Draft, For Internal Usage ONLY)
5 pages
Assembly Language For Intel-Based Computers, 4 Edition Chapter Overview
No ratings yet
Assembly Language For Intel-Based Computers, 4 Edition Chapter Overview
3 pages

An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza

Uploaded by

An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza

Uploaded by

An Overview of General Purpose Graphics

Marc Moreno Maza

University of Western Ontario, Canada

CUDA design goals

Enable heterogeneous systems (i.e., CPU+GPU)

Heterogeneous programming (1/3)

A CUDA program is a serial program with parallel kernels, all in C.

Heterogeneous programming (2/3)

Thus, the parallel code (kernel) is launched and executed on a device

Heterogeneous programming (3/3)

The parallel code is written for a thread

Example: increment array elements (1/2)

See our example number 4 in /usr/local/cs4402/examples/4

Example: increment array elements (2/2)

Thread blocks (1/2)

A Thread block is a group of threads that can:

Thread blocks (2/2)

Thus, within a grid, any possible interleaving of blocks must be valid.

The fact that thread blocks cannot synchronize gives scalability:

However, within a thread block, threads may synchronize with

Memory hierarchy (1/3)

Host (CPU) memory:

Memory hierarchy (2/3)

Global (on the device) memory:

Memory hierarchy (3/3)

Blocks run on multiprocessors

Streaming processors and multiprocessors

Hardware allocates resources to blocks:

SIMT thread execution

At each clock cycle, a multiprocessor executes the same instruction

Returning to the example

Example host code for increment array elements

You might also like