0% found this document useful (0 votes)

55 views55 pages

Programming Models For GPU Architecture

Uploaded by

VincentKao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views55 pages

Programming Models For GPU Architecture

Uploaded by

VincentKao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 55

CUDA Programming Model

Xing Zeng, Dongyue Mou

Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Introduction

What is CUDA?
- Compute Unified Device Architecture.
- A powerful parallel programming model for issuing and
managing computations on the GPU without mapping them to a
graphics API.

• Heterogenous - mixed serial-parallel programming

• Scalable - hierarchical thread execution model
• Accessible - minimal but expressive changes to C
Introduction
Software Stack:
• Libraries:
CUFFT & CUBLAS

• Runtime:
Common component
Device component
Host component

• Driver:
Driver API
Introduction
CUDA SDK
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Motivation

GPU Programming Model

GPGPU Programming Model
CUDA Programming Model
Motivation

GPU Programming Model

GPGPU Programming Model
CUDA Programming Model
Motivation
GPU Programming Model for Graphics
Motivation

GPU Programming Model

GPGPU Programming Model
CUDA Programming Model
Motivation
GPGPU Programming Model
Trick the GPU into general-purpose
computing by casting problem as graphics

• Turn data into images ("texture maps")

• Turn algorithms into image synthesis ("rending passes")
Drawback:
• Tough learning curve
• potentially high overhead of graphics API
• highly constrained memory layout & access model
• Need for many passes drives up bandwidth consumption
Motivation
GPGPU Programming to do A + B
Motivation
What's wrong with GPGPU 1
APIs are specific to Graphics

Limited texture size and

dimension

Limited Instruction set

No thread Communication Limited local storage

Limited shader outputs

No scatter
Motivation
What's wrong with GPGPU 2
Motivation

GPU Programming Model

GPGPU Programming Model
CUDA Programming Model
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Programming Model
CUDA: Unified Design
Advantage:

HW: fully generally data-parallel arch-

tecture.
• General thread launch
• Global load-store
• Parallel data cache
• Scalar architecture
• Integers, bit operation

SW: program the GPU in C

• Scalable data parallel execuation/
memory model
• C with minimal yet powerful
extensions
Motivation
From GPGPU to CUDA Programming Model
Programming Model
Feature 1:
• Thread not pixel
• Full Integer and Bit Instructions
• No limits on branching, looping
• 1D, 2D, 3D threadID allocation

Feature 2:
• Fully general load/store to
GPU memory
• Untyped, not fixed texture types
• Pointer support

Feature 3:
• Dedicated on-chip memory
• Shared between threads for
inter-threads communication
• Explicitly managed
• As fast as registers
Programming Model
Important Concepts:
• Device: GPU, viewed as a
co-processor.
• Host: CPU
• Kernel: data-parallel,
computed-intensive positions
of application running on the
device.
Programming Model
Important Concepts:
• Thread: basic execution unit
• Thread block:
A batch of thread. Threads
in a block cooperate together,
efficiently share data.
Thread/block have unique id
• Grid:
A batch of thread block.
that excuate same kernel.
Threads in different block in
the same grid cannot directly
communicate with each other
Programming Model
Simple example ( Matrx addition ):
cpu c program: cuda program:
Programming Model
Hardware implementation:

A set of SIMD
Multiprocessors with On-
Chip shared memory
Programming Model
G80 Example:
• 16 Multiprocessors, 128 Thread Processors
• Up to 12,288 parallel threads active
• Per-block shared memory accelerates processing.
Programming Model
Streaming Multiprocessor (SM)
• Processing elements
o 8 scalar thread processors
o 32 GFLOPS peak at 1.35GHz
o 8192 32-bit registers (32KB)
o usual ops: float, int, branch...

• Hardware multithreading
o up to 8 blocks (3 active) residents at once
o up to 768 active threads in total

• 16KB on-chip memory

• supports thread communication
• shared amongst threads of a block
Programming Model
Execution Model:
Programming Model
Single Instruction Multiple Thread (SIMT) Execution:
• Groups of 32 threads formed
into warps
o always executing same instruction
o share instruction fetch/dispatch
o some become inactive
when code path diverges
o hardware automatically handles divergence

• Warps are primitive unit of scheduling

• pick 1 of 24 warps for each instruction slot.
• all warps from all active blocks are time-sliced
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Memory Model
There are 6 Memory Types :
Memory Model
There are 6 Memory Types :

• Registers
o on chip
o fast access
o per thread
o limited amount
o 32 bit
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
o in DRAM
o slow
o non-cached
o per thread
o relative large
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
• Shared Memory
o on chip
o fast access
o per block
o 16 KByte
o synchronize between
threads
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
• Shared Memory
• Global Memory
o in DRAM
o slow
o non-cached
o per grid
o communicate between
grids
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
• Shared Memory
• Global Memory
• Constant Memory
o in DRAM
o cached
o per grid
o read-only
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
• Shared Memory
• Global Memory
• Constant Memory
• Texture Memory
o in DRAM
o cached
o per grid
o read-only
Memory Model
• Registers
• Shared Memory
o on chip

• Local Memory
• Global Memory
• Constant Memory
• Texture Memory
o in Device Memory
Memory Model
• Global Memory
• Constant Memory
• Texture Memory
o managed by host code
o persistent across kernels
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
CUDA API

CUDA API provides a easily path for users to write programs for
GPU device .

It consists of:

• A minimal set of extensions to C/C++

o type qualifiers
o call-syntax
o build-in variables

• A runtime library to support the execution

o host component
o device component
o common component
CUDA API

CUDA C/C++ Extensions:

• New function type qualifiers
__host__ void HostFunc(...); //executable on host
__global__ void KernelFunc(...); //callable from host
__device__ void DeviceFunc(...); //callable from device only
o Restrictions for device code (__global__ / __device__)
 no recursive call
 no static variable
 no function pointer
 __global__ function is asynchronous invoked
 __global__ function must have void return type
CUDA API

CUDA C/C++ Extensions:

• New variable type qualifiers
__device__ int GlobalVar; //in global memory, lifetime of app
__const__ int ConstVar; //in constant memory, lifetime of app
__shared__ int SharedVar; //in shared memory, lifetime of blocks
o Restrictions
 no external usage
 only file scope
 no combination with struct or union
 no initialization for __shared__
CUDA API

CUDA C/C++ Extensions:

• New syntax to invoke the device code
KernelFunc<<< Dg, Db, Ns, S >>>(...);
o Dg: dimension of grid
o Db: dimension of block
o Ns: optional, shared memory for external variables
o S : optional, associated stream

• New build-in variables for indexing the threads

o gridDim: dimension of the whole grid
o blockIdx: index of the current block
o blockDim: dimension of each block in the grid
o threadIdx: index of the current thread
CUDA API

CUDA Runtime Library:

• Common component
o Vector/Texture Types
o Mathematical/Time Functions

• Device component
o Mathematical/Time/Texture Functions
o Synchronization Function
o __syncthreads()
o Type Conversion/Casting Functions
CUDA API

CUDA Runtime Library:

• Host component

o Structure
 Driver API
 Runtime API

o Functions
 Device, Context, Memory, Module, Texture management
 Execution control
 Interoperability with OpenGL and Direct3D
CUDA API

The CUDA source file uses .cu as

extension. It contains host and
device source codes.

The CUDA Compiler Driver nvcc can

compile it and generate CPU/PTX
binary code.
(PTX: Parallel Thread Execution, a
device independent VM code)

PTX code may be further translated

for special GPU-Arch.
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Programming Model
Simple example ( Matrx addition ):
cpu c program: cuda program:
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Pro & Contra

CUDA allows
• massive parallel computing
• with a relative low price
• high integrated solution
• personal supercomputing
• ecofriendly production
• easy to learn
Pro & Contra

Problem ......
• slightly low precision
• limited support for IEEE-754
• no recursive function call
• hard to use for irregular join/fork logic
• no concurrency between jobs
Outline

• Introduction
• Motivation
• Programming Model
• Memory Model
• CUDA API
• Example
• Pro & Contra
• Trend
Trend

• More cores on-chip

• Better support for float point
• Flexiber configuration & control/data flow
• Lower price
• Support higher level programming language
References

[1] CUDA Programming Guide, nVidia Corp.

[2] The CUDA Compiler Driver, nVidia Corp.
[3] Parallel Thread Execution, nVidia Corp.
[4] CUDA: A Heterogeneous Parallel Programming Model for
Manycore Computing, ASPLOS 2008, gpgpu.org
Question?

Welcome To, Welcome To,: Combat Kettlebell
100% (10)
Welcome To, Welcome To,: Combat Kettlebell
58 pages
Art of Problem Solving Prealgebra
88% (24)
Art of Problem Solving Prealgebra
1,011 pages
Vshred Workout Log
91% (32)
Vshred Workout Log
13 pages
Women's 12 Week Shred
90% (30)
Women's 12 Week Shred
68 pages
The Kettlebell Solution
90% (10)
The Kettlebell Solution
87 pages
Challenging Word Problems, Grade 5 Singapore Marshall Cavendish Int (S) Pte LTD - 2011 - Marshall Cavendish Education
100% (10)
Challenging Word Problems, Grade 5 Singapore Marshall Cavendish Int (S) Pte LTD - 2011 - Marshall Cavendish Education
220 pages
Kettlebell Workouts PDF
91% (11)
Kettlebell Workouts PDF
48 pages
12 Week Shred
88% (162)
12 Week Shred
70 pages
Squat Bible
97% (33)
Squat Bible
177 pages
The Ultimate Bodybuilding Cookbook
97% (64)
The Ultimate Bodybuilding Cookbook
299 pages
Mobility Routine
100% (34)
Mobility Routine
15 pages
6th Grade Math Textbook, Progress PDF
75% (12)
6th Grade Math Textbook, Progress PDF
586 pages
10-Week Muscle Building Training Program by Jeffrey Ortiz
90% (29)
10-Week Muscle Building Training Program by Jeffrey Ortiz
54 pages
5 TH Grade
100% (8)
5 TH Grade
814 pages
1001 Algebra Problems
96% (70)
1001 Algebra Problems
292 pages
4th Grade Math Book PDF
92% (26)
4th Grade Math Book PDF
522 pages
Practical Projects
100% (30)
Practical Projects
478 pages
Mental Math Grade 3 Workbook
100% (12)
Mental Math Grade 3 Workbook
63 pages
Fat Loss Nutrition Guide
97% (35)
Fat Loss Nutrition Guide
16 pages
Singapore Math
92% (25)
Singapore Math
116 pages
100 PLANKS - The Plank Encyclopedia For Back Health - Bodyweight Training - and Ultimate Core Strength PDF
95% (19)
100 PLANKS - The Plank Encyclopedia For Back Health - Bodyweight Training - and Ultimate Core Strength PDF
409 pages
Migrating From UML To Xen and Kernel Based Virtual Machines
100% (1)
Migrating From UML To Xen and Kernel Based Virtual Machines
53 pages
Pre Algebra
100% (20)
Pre Algebra
216 pages
Math Word Problems Book
92% (39)
Math Word Problems Book
368 pages
Calisthenics For Beginners - A Step-by-Step Program To Get in Shape and To Build Explosive Strength at Any Fitness Level Without Going To The Gym
92% (25)
Calisthenics For Beginners - A Step-by-Step Program To Get in Shape and To Build Explosive Strength at Any Fitness Level Without Going To The Gym
110 pages
Harlow, Bruce - Calisthenics Workout Bible - The - 1 Guide For Beginners - Over 75+ Bodyweight Exercises (Photos Included) (2020) - Libgen - Li
92% (13)
Harlow, Bruce - Calisthenics Workout Bible - The - 1 Guide For Beginners - Over 75+ Bodyweight Exercises (Photos Included) (2020) - Libgen - Li
113 pages
[Algebra Practice Workbook With Answers Improve Your Math Fluency Series] Chris McMullen - Systems of Equations Substitution Simultaneous Cramer s Rule Algebra Practice Workbook With Answers Improve Your Math Fluency Series 20 Chr
100% (12)
[Algebra Practice Workbook With Answers Improve Your Math Fluency Series] Chris McMullen - Systems of Equations Substitution Simultaneous Cramer s Rule Algebra Practice Workbook With Answers Improve Your Math Fluency Series 20 Chr
374 pages
Smart Bodyweight Training - How - Schifferle, Matt
100% (12)
Smart Bodyweight Training - How - Schifferle, Matt
324 pages
Getting Started With Windows Driver
No ratings yet
Getting Started With Windows Driver
81 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza
No ratings yet
An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza
18 pages
1 Cuda
100% (1)
1 Cuda
173 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Course 7
No ratings yet
Course 7
21 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Cuda
No ratings yet
Cuda
25 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Christian Eh An Sen 2
No ratings yet
Christian Eh An Sen 2
18 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Cuda Final
No ratings yet
Cuda Final
17 pages
Unit 4
No ratings yet
Unit 4
48 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
CUDA
No ratings yet
CUDA
33 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
CUDA
No ratings yet
CUDA
18 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
Lec 1
No ratings yet
Lec 1
27 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Developing Library of Internet Protocol Suite On CUDA Platform
No ratings yet
Developing Library of Internet Protocol Suite On CUDA Platform
4 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
ACA Unit3 Revised
No ratings yet
ACA Unit3 Revised
53 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
Cuda
No ratings yet
Cuda
69 pages
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
No ratings yet
MCUDA: An Efficient Implementation of CUDA Kernels On Multi-Cores
19 pages
CUDA Programming with Python: From Basics to Expert Proficiency
From Everand
CUDA Programming with Python: From Basics to Expert Proficiency
William Smith
1/5 (1)
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
From Everand
CUDA Programming Fundamentals: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
100 No Equipment Workouts 2014 by Neilarey
100% (42)
100 No Equipment Workouts 2014 by Neilarey
210 pages
Mobility: How To Fix The 3 Most Common Mobility Issues
100% (14)
Mobility: How To Fix The 3 Most Common Mobility Issues
32 pages
Spartan Workout
93% (14)
Spartan Workout
22 pages
DX Diag
No ratings yet
DX Diag
47 pages
Drop Box
No ratings yet
Drop Box
49 pages
Advanced Computer Architecture: Sisir Kumar Jena Asst. Professor DIT University
No ratings yet
Advanced Computer Architecture: Sisir Kumar Jena Asst. Professor DIT University
20 pages
22516-Practice Test Unit 1 2
No ratings yet
22516-Practice Test Unit 1 2
11 pages
List of Linux Distributions - Wikipedia, The Free Encyclopedia
No ratings yet
List of Linux Distributions - Wikipedia, The Free Encyclopedia
19 pages
Installing Pycharm
No ratings yet
Installing Pycharm
8 pages
10 Python Automation Scripts
No ratings yet
10 Python Automation Scripts
8 pages
Untitled
No ratings yet
Untitled
691 pages
OpenVINO Installation Guide 2019R1
No ratings yet
OpenVINO Installation Guide 2019R1
30 pages
ZA Scan
No ratings yet
ZA Scan
5 pages
70-410 Exam Dumps
No ratings yet
70-410 Exam Dumps
37 pages
Digital Content Manager Version 20.1 Installation Guide
No ratings yet
Digital Content Manager Version 20.1 Installation Guide
32 pages
Crash 2019 09 15 - 14.33.26 Server
No ratings yet
Crash 2019 09 15 - 14.33.26 Server
2 pages
8259 Programmable Interrupt..
No ratings yet
8259 Programmable Interrupt..
3 pages
April: A Processor Architecture For Multiprocessing
No ratings yet
April: A Processor Architecture For Multiprocessing
11 pages
P2KConfig 09622 201708132326
No ratings yet
P2KConfig 09622 201708132326
37 pages
Containerization - Docker
No ratings yet
Containerization - Docker
10 pages
8.5.7 Share and Secure Folders
No ratings yet
8.5.7 Share and Secure Folders
1 page
Usenixsecurity24 Maar Defects
No ratings yet
Usenixsecurity24 Maar Defects
19 pages
OS Model Paper Solved
No ratings yet
OS Model Paper Solved
35 pages
Maya Color Management Issue
No ratings yet
Maya Color Management Issue
3 pages
Modo de Instalacion
No ratings yet
Modo de Instalacion
17 pages
Cloud Computing Homework 4
No ratings yet
Cloud Computing Homework 4
3 pages
ADV1594BU FORMATTED FINAL 1507829874065001ITMv
No ratings yet
ADV1594BU FORMATTED FINAL 1507829874065001ITMv
53 pages
Hyper - : Threading Technology
No ratings yet
Hyper - : Threading Technology
20 pages
Lab4 Multiprocessing in Python
No ratings yet
Lab4 Multiprocessing in Python
5 pages
Process Part 2: Operating System
No ratings yet
Process Part 2: Operating System
8 pages
Java Multithreading Concurrency Interview Questions and Answers - JournalDev
No ratings yet
Java Multithreading Concurrency Interview Questions and Answers - JournalDev
25 pages

Programming Models For GPU Architecture

Uploaded by

Programming Models For GPU Architecture

Uploaded by

CUDA Programming Model

Xing Zeng, Dongyue Mou

• Heterogenous - mixed serial-parallel programming

GPU Programming Model

GPU Programming Model

GPU Programming Model

• Turn data into images ("texture maps")

Limited texture size and

Limited Instruction set

Limited shader outputs

GPU Programming Model

HW: fully generally data-parallel arch-

SW: program the GPU in C

• 16KB on-chip memory

• Warps are primitive unit of scheduling

• A minimal set of extensions to C/C++

• A runtime library to support the execution

CUDA C/C++ Extensions:

CUDA C/C++ Extensions:

CUDA C/C++ Extensions:

• New build-in variables for indexing the threads

CUDA Runtime Library:

CUDA Runtime Library:

The CUDA source file uses .cu as

The CUDA Compiler Driver nvcc can

PTX code may be further translated

• More cores on-chip

[1] CUDA Programming Guide, nVidia Corp.

You might also like