Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming

This document discusses CUDA programming and the CUDA architecture. CUDA (Compute Unified Device Architecture) is a parallel computing platform that allows software developers to use GPUs for general-purpose processing. The CUDA programming model utilizes kernels that execute across a grid of parallel threads. CUDA gives developers access to the GPU's memory and instruction set through industry programming languages like C/C++. CUDA programming provides advantages like scattered reads and shared memory, but also has limitations like disjoint memory spaces between CPU and GPU.

Uploaded by

Pallavi Bharti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views28 pages

Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming

Uploaded by

Pallavi Bharti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 28

CUDA Programming

Lei Zhou, Yafeng Yin, Yanzhi Ren,

Hong Man, Yingying Chen
Outline
GPU
CUDA Introduction
What is CUDA
CUDA Programming Model
CUDA Library
Advantages & Limitations
CUDA Programming
Future Work
GPU
 GPUs are massively multithreaded many core chips
 Hundreds of scalar processors
 Tens of thousands of concurrent threads
 1 TFLOP peak performance
 Fine-grained data-parallel computation

 Users across science & engineering disciplines are

achieving tenfold and higher speedups on GPU
Outline
GPU
CUDA Introduction
What is CUDA
CUDA Programming Model
CUDA Library
Advantages & Limitations
CUDA Programming
Future Work
What is CUDA?
 CUDA is the acronym for Compute Unified Device
Architecture.
 A parallel computing architecture developed by NVIDIA.
 The computing engine in GPU.
 CUDA can be accessible to software developers through
industry standard programming languages.
 CUDA gives developers access to the instruction set
and memory of the parallel computation elements
in GPUs.
Processing Flow
 Processing Flow of CUDA:
 Copy data from main mem
to GPU mem.
 CPU instructs the process
to GPU.
 GPU execute parallel in
each core.
 Copy the result from GPU
mem to main mem.
Outline
GPU
CUDA Introduction
What is CUDA
CUDA Programming Model
CUDA Library
Advantages & Limitations
CUDA Programming
Future Work
Definitions:
CUDA Programming Model

Device = GPU
Host = CPU
Kernel =
function that
runs on the
device
CUDA Programming Model
A kernel is executed by a grid of thread blocks

 A thread block is a batch of threads that can

cooperate with each other by:
Sharing data through shared memory
Synchronizing their execution

 Threads from different blocks cannot

cooperate
CUDA Kernels and Threads
 Parallel portions of an application are executed on
the device as kernels
 One kernel is executed at a time
 Many threads execute each kernel

 Differences between CUDA and CPU threads

 CUDA threads are extremely lightweight
 Very little creation overhead
 Instant switching
 CUDA uses 1000s of threads to achieve efficiency
 Multi-core CPUs can use only a few
Thread Hierarchy

Thread – Distributed by the CUDA runtime

(identified by threadIdx)
Warp – A scheduling unit of up to 32 threads

Block – A user defined group of 1 to 512 threads.

(identified by blockIdx)

Grid – A group of
one or more blocks. A
grid is created for each
CUDA kernel function
Arrays of Parallel Threads
 A CUDA kernel is executed by an array of threads
 All threads run the same code
 Each thread has an ID that it uses to compute memory
addresses and make control decisions
Minimal Kernels
Example: Increment Array Elements
Example: Increment Array Elements
Thread Cooperation
 The Missing Piece: threads may need to cooperate
 Thread cooperation is valuable
 Share results to avoid redundant computation
 Share memory accesses
Drastic bandwidth reduction
 Thread cooperation is a powerful feature of CUDA
Manage memory
Moving Data…
CUDA allows us to copy data from
one memory type to another.

This includes dereferencing pointers,

even in the host’s memory (main
system RAM)

To facilitate this data movement

CUDA provides cudaMemcpy()
Advantages of CUDA
CUDA has several advantages over
traditional general purpose computation on
GPUs:
Scattered reads – code can read from arbitrary
addresses in memory.
Shared memory - CUDA exposes a fast shared
memory region (16KB in size) that can be shared
amongst threads.
Limitations of CUDA
 CUDA has several limitations over traditional
general purpose computation on GPUs:
A single process must run spread across multiple
disjoint memory spaces, unlike other C language
runtime environments.
The bus bandwidth and latency between the CPU
and the GPU may be a bottleneck.
CUDA-enabled GPUs are only available from
NVIDIA.
Cuda Programming
• Cuda Specifications
– Function Qualifiers
– CUDA Built-in Device Variables
– Variable Qualifiers
• Cuda Programming and Examples
– Compile procedure
– Examples
Function Qualifiers
• _global__ : invoked from within host (CPU) code,
– cannot be called from device (GPU) code
– must return void
• __device__ : called from other GPU functions,
– cannot be called from host (CPU) code
• __host__ : can only be executed by CPU, called from
host
• __host__ and __device__ qualifiers can be combined
– Sample use: overloading operators
– Compiler will generate both CPU and GPU code
CUDA Built-in Device Variables
• All __global__ and __device__ functions have
access to these automatically defined variables
• dim3 gridDim;
– Dimensions of the grid in blocks (at most 2D)
• dim3 blockDim;
– Dimensions of the block in threads
• dim3 blockIdx;
– Block index within the grid
• dim3 threadIdx;
– Thread index within the block
Variable Qualifiers (GPU code)
• __device__
– Stored in device memory (large, high latency, no cache)
– Allocated with cudaMalloc (__device__ qualifier implied)
– Accessible by all threads
– Lifetime: application
• __shared__
– Stored in on-chip shared memory (very low latency)
– Allocated by execution configuration or at compile time
– Accessible by all threads in the same thread block
– Lifetime: kernel execution
• Unqualified variables:
– Scalars and built-in vector types are stored in registers
– Arrays of more than 4 elements stored in device memory
Cuda Programming
• Kernels are C functions with some
restrictions
– Can only access GPU memory
– Must have void return type
– No variable number of arguments (“varargs”)
– Not recursive
– No static variables
• Function arguments automatically copied
from CPUto GPU memory
Cuda Compile
Cuda Compile_cont
Cuda Compile_cont

Parallel Computing For Data Science With Examples in R, C++ and CUDA (PDFDrive)
No ratings yet
Parallel Computing For Data Science With Examples in R, C++ and CUDA (PDFDrive)
336 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
CUDA
No ratings yet
CUDA
18 pages
Cuda
No ratings yet
Cuda
25 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Course 7
No ratings yet
Course 7
21 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Moving To Parallel With CUDA - Hello Program
No ratings yet
Moving To Parallel With CUDA - Hello Program
14 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Christian Eh An Sen 2
No ratings yet
Christian Eh An Sen 2
18 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
ACA Unit3 Revised
No ratings yet
ACA Unit3 Revised
53 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Govind 6
No ratings yet
Govind 6
4 pages
Cuda-: An Emerging Technology That Can Make Robots Reflex Action Faster
No ratings yet
Cuda-: An Emerging Technology That Can Make Robots Reflex Action Faster
11 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Cuda C
No ratings yet
Cuda C
70 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
CUDA
No ratings yet
CUDA
33 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
CUDA
No ratings yet
CUDA
20 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDA - Quick Reference PDF
No ratings yet
CUDA - Quick Reference PDF
2 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
Experiment No 6: Handler
No ratings yet
Experiment No 6: Handler
4 pages
Experiment No 5: Title: - Write A Program To Demonstrate Spinners, Touch Mode, Alerts, Popups, and
No ratings yet
Experiment No 5: Title: - Write A Program To Demonstrate Spinners, Touch Mode, Alerts, Popups, and
3 pages
Unit - 9: Themes and Master Pages
No ratings yet
Unit - 9: Themes and Master Pages
30 pages
Experiment No 2: Buttons, and Toggle Buttons With Their Events Handler
No ratings yet
Experiment No 2: Buttons, and Toggle Buttons With Their Events Handler
4 pages
Effective Approaches To Attention-Based Neural Machine Translation
No ratings yet
Effective Approaches To Attention-Based Neural Machine Translation
11 pages
AST Lab-II Experiment List
No ratings yet
AST Lab-II Experiment List
1 page
Question Bank On Chapter 5 Designing For The Internet of Things
No ratings yet
Question Bank On Chapter 5 Designing For The Internet of Things
1 page
How Your Brain Works
No ratings yet
How Your Brain Works
12 pages
Smartphone Architecture: Evan Mcdonough - Ke Vin We LCH
No ratings yet
Smartphone Architecture: Evan Mcdonough - Ke Vin We LCH
18 pages
05 Wind PDF
100% (1)
05 Wind PDF
44 pages
Autodesk Revit MEP 2012 System Requirements and Recommendations A4
No ratings yet
Autodesk Revit MEP 2012 System Requirements and Recommendations A4
5 pages
Beckhoff PC Control
No ratings yet
Beckhoff PC Control
62 pages
AStudyof Techniquesto Increase Instruction Level Parallelism
No ratings yet
AStudyof Techniquesto Increase Instruction Level Parallelism
6 pages
Oracle Database Pricing and Options
100% (2)
Oracle Database Pricing and Options
25 pages
Performance AIX
No ratings yet
Performance AIX
88 pages
Ec8552 - Cao MCQ
No ratings yet
Ec8552 - Cao MCQ
27 pages
Twinpilots
No ratings yet
Twinpilots
7 pages
System On Chip (Soc) : Justin Eapen Mgp16Ec033 S7 Ece Guided by Er. Aravindhan Alagarsamy
No ratings yet
System On Chip (Soc) : Justin Eapen Mgp16Ec033 S7 Ece Guided by Er. Aravindhan Alagarsamy
15 pages
1 - Datasheet Nutanix NX-8170-G9 System Specifications - Eng
No ratings yet
1 - Datasheet Nutanix NX-8170-G9 System Specifications - Eng
21 pages
Image Hardware PDF
No ratings yet
Image Hardware PDF
19 pages
2016 Catalog Web
100% (1)
2016 Catalog Web
108 pages
Arm Cortex R Technology For Safe and Relaible Systems
No ratings yet
Arm Cortex R Technology For Safe and Relaible Systems
5 pages
p920 p720 Power Configurator v1.6
No ratings yet
p920 p720 Power Configurator v1.6
25 pages
ZXR10 5900E Series Easy-Maintenance Routing Switch Datasheet - en
No ratings yet
ZXR10 5900E Series Easy-Maintenance Routing Switch Datasheet - en
8 pages
Mu0 Core
No ratings yet
Mu0 Core
22 pages
3 - Balancing The Load of A Safety Critical Multicore System - TASKING
No ratings yet
3 - Balancing The Load of A Safety Critical Multicore System - TASKING
17 pages
Cpu Control Unit Alu
No ratings yet
Cpu Control Unit Alu
19 pages
Software in Silicon in The Oracle SPARC M7 Processor
No ratings yet
Software in Silicon in The Oracle SPARC M7 Processor
31 pages
System 1 - Datasheet - Rev - AU1
No ratings yet
System 1 - Datasheet - Rev - AU1
35 pages
Cyber Law (Paper - I)
No ratings yet
Cyber Law (Paper - I)
130 pages
11 - OpenCL Fundamentals
No ratings yet
11 - OpenCL Fundamentals
253 pages
Lec 03 Internal Micro Architecture
No ratings yet
Lec 03 Internal Micro Architecture
24 pages
PassMark CPU Benchmarks - Laptop & Portable CPU Performance
No ratings yet
PassMark CPU Benchmarks - Laptop & Portable CPU Performance
15 pages
STM User Guide
No ratings yet
STM User Guide
84 pages
MT6595 Datasheet Brief
100% (1)
MT6595 Datasheet Brief
73 pages
Acer Tab A100 Service Manual
No ratings yet
Acer Tab A100 Service Manual
224 pages
Chapter 1 - Introduction - 2023 - Programming Massively Parallel Processors
No ratings yet
Chapter 1 - Introduction - 2023 - Programming Massively Parallel Processors
20 pages

Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming

Uploaded by

Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming

Uploaded by

CUDA Programming

Lei Zhou, Yafeng Yin, Yanzhi Ren,

 Users across science & engineering disciplines are

 A thread block is a batch of threads that can

 Threads from different blocks cannot

 Differences between CUDA and CPU threads

Thread – Distributed by the CUDA runtime

Block – A user defined group of 1 to 512 threads.

This includes dereferencing pointers,

To facilitate this data movement

You might also like