0% found this document useful (0 votes)

49 views35 pages

CUDA Programming: Advantages & Limitations

Uploaded by

owboostrsh2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views35 pages

CUDA Programming: Advantages & Limitations

Uploaded by

owboostrsh2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CUDA Programming

Outline
GPU
CUDA Introduction
What is CUDA
CUDA Programming Model
Advantages & Limitations
CUDA Programming
Future Work
GPU
GPUs are massively multithreaded many core chips
handle computation only for computer graphics
Hundreds of processors
Tens of thousands of concurrent threads
TFLOPs peak performance
Fine-grained data-parallel computation

Users across science & engineering disciplines are

achieving tenfold and higher speedups on GPU
CPU v/s GPU

© NVIDIA Corporation 2009

CPU v/s GPU
CPU v/s GPU
CPU v/s GPU

© NVIDIA Corporation 2009

GPGPU

• What is GPGPU?
– General purpose computing on GPUs
– GPGPU is the use of a GPU, which typically
handles computation only for computer
graphics, to perform computation in
applications traditionally handled by the
central processing unit (CPU).
• Why GPGPU?
– Massively parallel computing power
– Inexpensive
GPGPU
• How?

– CUDA

– OpenCL

– DirectCompute
What is CUDA?
CUDA is the acronym for Compute Unified Device
Architecture.
A parallel computing architecture developed by NVIDIA.
Heterogeneous serial-parallel computing
The computing engine in GPU.
CUDA can be accessible to software developers through
industry standard programming languages.
CUDA gives developers access to the instruction set
and memory of the parallel computation elements in
GPUs.
Heterogeneous Computing
 Terminology:
 Host The CPU and its memory (host memory)
 Device The GPU and its memory (device memory)

Host Device
© NVIDIA 2013
Heterogeneous Computing
#include <iostream>
#include <algorithm>

using namespace std;

#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16

global void stencil_1d(int in, int out) {

__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

parallel fn
// Synchronize (ensure all the data is available)
__syncthreads();

// Apply the stencil

int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

void fill_ints(int *x, int n) {

fill_n(x, n, 1);
}

int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);

// Alloc space for host copies and setup values

in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);
out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copies

cudaMalloc((void **)&d_in, size);

serial code
cudaMalloc((void **)&d_out, size);

// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPU

stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,
d_out + RADIUS);

// Copy result back to host

parallel code
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;

serial code
}

A Kernel is a Function that runs on a device

One kernel is executed at a time
Many threads execute each kernel

Differences between CUDA and CPU threads

CUDA threads are extremely lightweight
Very little creation overhead
Instant switching
CUDA Programming Model
A kernel is executed by a grid of thread blocks

A thread block is a batch of threads that can

cooperate with each other by:
Sharing data through shared memory
Synchronizing their execution

Threads from different blocks cannot cooperate

CUDA Programming Model

• All threads within a block can

– Share data through ‘Shared Memory’
– Synchronize using ‘_syncthreads()’

• Threads and Blocks have unique IDs

– Available through special variables
CUDA Programming Model
• SIMT (Single Instruction Multiple Threads)
Execution

• Threads run in groups of 32 called warps

• Every thread in a warp executes the same

instruction at a time
CUDA Programming Model

• Groups of 32 threads formed into warps

o always executing same instruction
o share instruction fetch/dispatch
o some become inactive
when code path diverges
o hardware automatically handles divergence

• Warps are primitive unit of scheduling

• all warps from all active blocks are time-sliced
Control Flow Divergence

Courtesy Fung et al. MICRO ‘07

Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory

to GPU memory

PCI Bus

1. Copy input data from CPU memory

to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance

PCI Bus

1. Copy input data from CPU memory

to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance
3. Copy results from GPU memory to
CPU memory

© NVIDIA 2013
Memory Model
• Types of device memory
– Registers – read/write per-thread
– Local Memory – read/write per-thread
– Shared Memory – read/write per-block
– Global Memory – read/write across grids
– Constant Memory – read across grids
– Texture Memory – read across grids
Memory Model

• Registers
o on chip
o fast access
o per thread
o limited amount
o 32 bit
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
o in DRAM
o slow
o non-cached
o per thread
o relative large
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
• Shared Memory
o on chip
o fast access
o per block
o 16 KByte
o synchronize between
threads
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
• Shared Memory
• Global Memory
o in DRAM
o slow
o non-cached
o per grid
o communicate between
grids
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
• Shared Memory
• Global Memory
• Constant Memory
o in DRAM
o cached
o per grid
o read-only
Memory Model
There are 6 Memory Types :

• Registers
• Local Memory
• Shared Memory
• Global Memory
• Constant Memory
• Texture Memory
o in DRAM
o cached
o per grid
o read-only
Memory Model
• Registers
• Shared Memory
o on chip

• Local Memory
• Global Memory
• Constant Memory
• Texture Memory
o in Device Memory
Memory Model
• Global Memory
• Constant Memory
• Texture Memory
o managed by host code
o persistent across kernels
Advantages of CUDA
CUDA has several advantages over traditional
general purpose computation on GPUs:
Scattered reads – code can read from arbitrary
addresses in memory.
Shared memory - CUDA exposes a fast shared
memory region (16KB in size) that can be shared
amongst threads.
Limitations of CUDA
CUDA has several limitations over traditional
general purpose computation on GPUs:
A single process must run spread across multiple
disjoint memory spaces, unlike other C language
runtime environments.
The bus bandwidth and latency between the CPU
and the GPU may be a bottleneck.
CUDA-enabled GPUs are only available from
NVIDIA.

Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA Programming for Engineers
No ratings yet
CUDA Programming for Engineers
84 pages
CUDA for Developers and Engineers
No ratings yet
CUDA for Developers and Engineers
28 pages
CUDA Programming Overview and Guide
No ratings yet
CUDA Programming Overview and Guide
28 pages
CUDA Programming Model Overview
No ratings yet
CUDA Programming Model Overview
31 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Introduction to CUDA Parallel Programming
No ratings yet
Introduction to CUDA Parallel Programming
25 pages
Intro to CUDA Programming Guide
No ratings yet
Intro to CUDA Programming Guide
33 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Cuda PPT
No ratings yet
Cuda PPT
54 pages
Cuda
No ratings yet
Cuda
25 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CUDA Tutorial
100% (1)
CUDA Tutorial
50 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Introduction to CUDA C/C++ Basics
100% (1)
Introduction to CUDA C/C++ Basics
82 pages
Introduction to CUDA Programming
No ratings yet
Introduction to CUDA Programming
26 pages
CUDA Programming Overview
No ratings yet
CUDA Programming Overview
38 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
CUDA
No ratings yet
CUDA
18 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Course 7
No ratings yet
Course 7
21 pages
PDC Lecture 09
No ratings yet
PDC Lecture 09
36 pages
Introduction to CUDA Programming Basics
No ratings yet
Introduction to CUDA Programming Basics
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lec 1
No ratings yet
Lec 1
27 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Unit 4
100% (1)
Unit 4
48 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Manycore GPU Programming Overview
No ratings yet
Manycore GPU Programming Overview
67 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
GPU History & CUDA Programming Basics
No ratings yet
GPU History & CUDA Programming Basics
44 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
No ratings yet
Cs-3006 8 Gpuprogramming Using Cuda&Opencl
167 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Cuda Final
No ratings yet
Cuda Final
17 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Gpu Computing
No ratings yet
Gpu Computing
57 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Distributed Mutual Exclusion Guide
No ratings yet
Distributed Mutual Exclusion Guide
9 pages
Drop Box
No ratings yet
Drop Box
159 pages
Parallel Computing Architectures
No ratings yet
Parallel Computing Architectures
57 pages
Operating Systems: Processes & Threads
No ratings yet
Operating Systems: Processes & Threads
50 pages
Assigment Question OS
No ratings yet
Assigment Question OS
3 pages
Hadoop Framework Overview
No ratings yet
Hadoop Framework Overview
4 pages
Operating Systems CH 3
No ratings yet
Operating Systems CH 3
39 pages
Implementing Java Mutual Exclusion
No ratings yet
Implementing Java Mutual Exclusion
8 pages
Cca 500 PDF
No ratings yet
Cca 500 PDF
4 pages
Distributed OS Synchronization Guide
No ratings yet
Distributed OS Synchronization Guide
14 pages
CS609 Solved Subjective Final Term by Junaid
No ratings yet
CS609 Solved Subjective Final Term by Junaid
21 pages
Process Synchronization in Operating Systems
No ratings yet
Process Synchronization in Operating Systems
5 pages
Middleware Types and Models Guide
No ratings yet
Middleware Types and Models Guide
3 pages
OS Question Bank
No ratings yet
OS Question Bank
6 pages
Economics PPT Final
No ratings yet
Economics PPT Final
10 pages
Token Ring Algorithm in Distributed Systems
No ratings yet
Token Ring Algorithm in Distributed Systems
15 pages
Parallel Processing Essentials
No ratings yet
Parallel Processing Essentials
32 pages
Multicore Programming Practices
100% (1)
Multicore Programming Practices
114 pages
CS2056 DS Unit4 Notes
No ratings yet
CS2056 DS Unit4 Notes
34 pages
Accelerating Implicit LS-DYNA With GPU
No ratings yet
Accelerating Implicit LS-DYNA With GPU
7 pages
IPC Programming for Students
No ratings yet
IPC Programming for Students
5 pages
Complete 100 Operating System MCQs SDET Fixed
No ratings yet
Complete 100 Operating System MCQs SDET Fixed
4 pages
MapReduce Code for Twitter Hashtags
No ratings yet
MapReduce Code for Twitter Hashtags
21 pages
Traffic Deadlock: Deadlock: Bridge Crossing Example
100% (1)
Traffic Deadlock: Deadlock: Bridge Crossing Example
3 pages
OS Question Bank Solution
No ratings yet
OS Question Bank Solution
16 pages
CST206OS Previous Qns
No ratings yet
CST206OS Previous Qns
13 pages
Multithreading Concepts in Java
No ratings yet
Multithreading Concepts in Java
46 pages
Operating System
No ratings yet
Operating System
213 pages
RTOS Comparison
No ratings yet
RTOS Comparison
4 pages
OS Assignment-1 - 22-23
No ratings yet
OS Assignment-1 - 22-23
12 pages

CUDA Programming: Advantages & Limitations

Uploaded by

CUDA Programming: Advantages & Limitations

Uploaded by

CUDA Programming

Users across science & engineering disciplines are

© NVIDIA Corporation 2009

© NVIDIA Corporation 2009

using namespace std;

__global__ void stencil_1d(int *in, int *out) {

// Read input elements into shared memory

// Apply the stencil

// Store the result

void fill_ints(int *x, int n) {

// Alloc space for host copies and setup values

// Alloc space for device copies

// Launch stencil_1d() kernel on GPU

// Copy result back to host

A Kernel is a Function that runs on a device

Differences between CUDA and CPU threads

A thread block is a batch of threads that can

Threads from different blocks cannot cooperate

• All threads within a block can

• Threads and Blocks have unique IDs

• Threads run in groups of 32 called warps

• Every thread in a warp executes the same

• Groups of 32 threads formed into warps

• Warps are primitive unit of scheduling

Courtesy Fung et al. MICRO ‘07

1. Copy input data from CPU memory

1. Copy input data from CPU memory

1. Copy input data from CPU memory

You might also like

global void stencil_1d(int in, int out) {