0% found this document useful (0 votes)

11 views36 pages

CUDA Programming Invert

The document provides an introduction to CUDA programming, covering concepts such as heterogeneous computing, memory management, and parallel programming. It includes examples of writing and launching CUDA C kernels, managing GPU memory, and performing operations like vector addition using threads and blocks. The document also outlines the steps for compiling and executing CUDA programs using the NVIDIA compiler.

Uploaded by

codingakshat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views36 pages

CUDA Programming Invert

Uploaded by

codingakshat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

CUDA Programming

NVIDIA Corporation

Note: Some contents have been modified for teaching

purpose.

© NVIDIA 2013
Outline
• What will you learn in this session?
– What is heterogeneous computing?
– “Hello World!” for CUDA C
– Write and launch CUDA C kernel
– Manage GPU memory and communication
between CPU and GPU
– Hands-on-session on CUDA using Google Colab

© NVIDIA 2013
Heterogeneous Computing
▪ Terminology:
▪ Host The CPU and its memory (host memory)
▪ Device The GPU and its memory (device memory)

Host Device

using namespace std;

#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16

global void stencil_1d(int in, int out) {

__shared__ int temp[BLOCK_SIZE
+ 2 * RADIUS];
int gindex = threadIdx.x +
blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared

memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {

parallel fn
temp[lindex - RADIUS] = in[gindex -
RADIUS];

temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}

// Synchronize (ensure all the data

is available)
__syncthreads();

// Apply the stencil

int result = 0;
for (int offset = -RADIUS ; offset <=
RADIUS ; offset++)

result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

void fill_ints(int *x, int n) {

fill_n(x, n, 1);
}

int main(void) {

serial code
int *in, *out; // host copies
of a, b, c
int *d_in, *d_out; // device
copies of a, b, c
int size = (N + 2*RADIUS) *
sizeof(int);

// Alloc space for host copies and

setup values
in = (int *)malloc(size); fill_ints(in,
N + 2*RADIUS);

parallel code
out = (int *)malloc(size); fill_ints(out,
N + 2*RADIUS);

// Alloc space for device copies

cudaMalloc((void **)&d_in, size);
cudaMalloc((void **)&d_out, size);

cudaMemcpyHostToDevice);
// Copy to device
cudaMemcpy(d_in, in, size,

cudaMemcpy(d_out, out, size,

serial code
cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on

GPU
stencil_1d<<<N/BLOCK_SIZE,BLO
CK_SIZE>>>(d_in + RADIUS, d_out + RADIUS);

// Copy result back to host

cudaMemcpy(out, d_out, size,
cudaMemcpyDeviceToHost);

// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
© NVIDIA 2013
}
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory

to GPU memory

PCI Bus

1. Copy input data from CPU memory

to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance

PCI Bus

1. Copy input data from CPU memory

to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance
3. Copy results from GPU memory to
CPU memory

• NVIDIA compiler (nvcc) can be used to compile

programs with no device code

$ nvcc hello_world.cu
$ ./a.out
Hello World!
$

int main(void) {
mykernel<<<1,1>>>();
return 0;
}

▪ Two new syntactic elements…

© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void) {
}
• CUDA C/C++ keyword __global__ indicates a
function that:
– Runs on the device
– Is called from host code

• nvcc separates source code into host and device

components
– Device functions (e.g. mykernel()) processed by NVIDIA
compiler
– Host functions (e.g. main()) processed by standard host compiler
• Eg., gcc

• Triple angle brackets mark a call from host code to device

code
– Also called a “kernel launch”
– We’ll return to the parameters (1,1) in a moment

• That’s all that is required to execute a function on the GPU!

int main(void) {
mykernel<<<1,1>>>();
return 0;
}

$ nvcc hello.cu
$ ./a.out
Hello World!
$
© NVIDIA 2013
Parallel Programming in CUDA C
• But wait… GPU computing is about
massive parallelism!

• We need a more interesting example…

• We’ll start by adding two integers and

build up to vector addition

a b c

© NVIDIA 2013
Addition on the Device
• Note that we use pointers for the variables
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

• add()runs on the device, so a, b and c must

point to device memory

• We need to allocate memory on the GPU

– Device pointers point to GPU memory

– Host pointers point to CPU memory

• Simple CUDA API for handling device memory

– cudaMalloc(), cudaFree(), cudaMemcpy()
– Similar to the C equivalents malloc(), free(), memcpy()

© NVIDIA 2013
Addition on the Device: add()
• Returning to our add() kernel
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

• Let’s take a look at main()…

© NVIDIA 2013
Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);

// Allocate space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Setup input values

a = 2;
b = 7;

© NVIDIA 2013
Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU

add<<<1,1>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

RUNNING IN
Handling errors

Managing devices

PARALLEL

© NVIDIA 2013
Moving to Parallel
• GPU computing is about massive parallelism
– So how do we run code in parallel on the device?

add<<< 1, 1 >>>();

add<<< N, 1 >>>();

• Instead of executing add() once, execute N

times in parallel

• Terminology: each parallel invocation of add() is referred to

as a block
– The set of blocks is referred to as a grid
– Each invocation can refer to its block index using blockIdx.x
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• By using blockIdx.x to index into the array, each block handles

a different index

© NVIDIA 2013
Vector Addition on the Device
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• On the device, each block can execute in parallel:

Block 0 Block 1 Block 2 Block 3

c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

© NVIDIA 2013
Vector Addition on the Device: add()
• Returning to our parallelized add() kernel
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• Let’s take a look at main()…

© NVIDIA 2013
Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

© NVIDIA 2013
Vector Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocks

add<<<N,1>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

• Let’s change add() to use parallel threads instead of

parallel blocks
__global__ void add(int *a, int *b, int *c) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}

• We use threadIdx.x instead of blockIdx.x

• Need to make one change in main().

© NVIDIA 2013
Vector Addition Using Threads: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

© NVIDIA 2013
Vector Addition Using Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads

add<<<1,N>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

© NVIDIA 2013
Indexing Arrays with Blocks and
Threads
• No longer as simple as using blockIdx.x and threadIdx.x
– Consider indexing an array with one element per thread (8
threads/block)
threadIdx.x threadIdx.x threadIdx.x threadIdx.x

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

• With M threads/block a unique index for each thread

is given by:
int index = threadIdx.x + blockIdx.x * M;

© NVIDIA 2013
Indexing Arrays: Example
• Which thread will operate on the red
element?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

M = 8 threadIdx.x = 5

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 2

int index = threadIdx.x + blockIdx.x * M;

= 5 + 2 * 8;
= 21;
© NVIDIA 2013
Vector Addition with Blocks and
Threads
• Use the built-in variable blockDim.x for threads per
block
int index = threadIdx.x + blockIdx.x * blockDim.x;

• Combined version of add() to use parallel

threads and parallel blocks
__global__ void add(int *a, int *b, int *c) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}

• What changes need to be made in main()?

© NVIDIA 2013
Addition with Blocks and Threads: main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values

a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

© NVIDIA 2013
Addition with Blocks and Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU

add<<<N/THREADS_PER_BLOCK,,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);

// Copy result back to host

cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

• Avoid accessing beyond the end of the arrays:

__global__ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}

• Update the kernel launch:

add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);

• Unlike parallel blocks, threads have

mechanisms to:
– Communicate
– Synchronize

Under The Sea
No ratings yet
Under The Sea
7 pages
Harrison Bergeron: by Kurt Vonnegut, JR
No ratings yet
Harrison Bergeron: by Kurt Vonnegut, JR
7 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Cuda C
No ratings yet
Cuda C
70 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Group A Assignment 4 (A) : Two Large Vectors
No ratings yet
Group A Assignment 4 (A) : Two Large Vectors
5 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
CUDA
No ratings yet
CUDA
33 pages
Cuda
No ratings yet
Cuda
4 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
Addition Cuda
No ratings yet
Addition Cuda
2 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
3 Computation
No ratings yet
3 Computation
28 pages
Moving To Parallel With CUDA - Hello Program
No ratings yet
Moving To Parallel With CUDA - Hello Program
14 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
Threads
No ratings yet
Threads
54 pages
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
CALLSHEET
No ratings yet
CALLSHEET
1 page
7 Principles of Leave No Trace
100% (1)
7 Principles of Leave No Trace
2 pages
Paint Catalogue
No ratings yet
Paint Catalogue
68 pages
Đ I T
No ratings yet
Đ I T
12 pages
Using Menus, Buttons, Bars, and Boxes
No ratings yet
Using Menus, Buttons, Bars, and Boxes
4 pages
Robalino - Jonathan - Assigments LEVEL VI
No ratings yet
Robalino - Jonathan - Assigments LEVEL VI
15 pages
Yunzhu Chen: Education
No ratings yet
Yunzhu Chen: Education
3 pages
Kucinich Citation
No ratings yet
Kucinich Citation
2 pages
Final Switzerland Itinerary GenevaDay
No ratings yet
Final Switzerland Itinerary GenevaDay
2 pages
Technical Manual
No ratings yet
Technical Manual
68 pages
Drama Bhs Inggris Kelompok 1
No ratings yet
Drama Bhs Inggris Kelompok 1
3 pages
English Test
No ratings yet
English Test
2 pages
Future Forms 2
No ratings yet
Future Forms 2
2 pages
Design and Architecture
No ratings yet
Design and Architecture
26 pages
Ghost Windows 10
No ratings yet
Ghost Windows 10
2 pages
Stockholm Archipelago
No ratings yet
Stockholm Archipelago
4 pages
Pick Up SticksV2.0
No ratings yet
Pick Up SticksV2.0
14 pages
Bass Guitar Magazine Issue 67
No ratings yet
Bass Guitar Magazine Issue 67
2 pages
Jesus Saves
No ratings yet
Jesus Saves
2 pages
Adjectives Adverbs
No ratings yet
Adjectives Adverbs
59 pages
Mid-Norfolk Times December 2009
No ratings yet
Mid-Norfolk Times December 2009
31 pages
Red Lentil Soup What Would You Like To Cook
No ratings yet
Red Lentil Soup What Would You Like To Cook
13 pages
RULES Rulebook
No ratings yet
RULES Rulebook
3 pages
Fleu 1
No ratings yet
Fleu 1
4 pages
THE 8051 Microcontrollers
No ratings yet
THE 8051 Microcontrollers
14 pages
Interview With My Nephew
No ratings yet
Interview With My Nephew
7 pages
OXFORD PRACTICE GRAMMAR - REVIEW Basic PDF
No ratings yet
OXFORD PRACTICE GRAMMAR - REVIEW Basic PDF
27 pages
Assisting With A Sitz Bath
No ratings yet
Assisting With A Sitz Bath
3 pages

CUDA Programming Invert

Uploaded by

CUDA Programming Invert

Uploaded by

CUDA Programming

Note: Some contents have been modified for teaching

using namespace std;

__global__ void stencil_1d(int *in, int *out) {

// Read input elements into shared

// Synchronize (ensure all the data

// Apply the stencil

result += temp[lindex + offset];

// Store the result

void fill_ints(int *x, int n) {

// Alloc space for host copies and

// Alloc space for device copies

cudaMemcpy(d_out, out, size,

// Launch stencil_1d() kernel on

// Copy result back to host

1. Copy input data from CPU memory

1. Copy input data from CPU memory

1. Copy input data from CPU memory

• NVIDIA compiler (nvcc) can be used to compile

▪ Two new syntactic elements…

• nvcc separates source code into host and device

• Triple angle brackets mark a call from host code to device

• That’s all that is required to execute a function on the GPU!

• We need a more interesting example…

• We’ll start by adding two integers and

• add()runs on the device, so a, b and c must

• We need to allocate memory on the GPU

– Device pointers point to GPU memory

– Host pointers point to CPU memory

• Simple CUDA API for handling device memory

• Let’s take a look at main()…

// Allocate space for device copies of a, b, c

// Setup input values

// Launch add() kernel on GPU

// Copy result back to host

• Instead of executing add() once, execute N

• Terminology: each parallel invocation of add() is referred to

• By using blockIdx.x to index into the array, each block handles

• On the device, each block can execute in parallel:

Block 0 Block 1 Block 2 Block 3

• Let’s take a look at main()…

// Alloc space for device copies of a, b, c

// Alloc space for host copies of a, b, c and setup input values

// Launch add() kernel on GPU with N blocks

// Copy result back to host

• Let’s change add() to use parallel threads instead of

• We use threadIdx.x instead of blockIdx.x

• Need to make one change in main().

// Alloc space for device copies of a, b, c

// Alloc space for host copies of a, b, c and setup input values

// Launch add() kernel on GPU with N threads

// Copy result back to host

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

• With M threads/block a unique index for each thread

int index = threadIdx.x + blockIdx.x * M;

• Combined version of add() to use parallel

• What changes need to be made in main()?

// Alloc space for device copies of a, b, c

// Alloc space for host copies of a, b, c and setup input values

// Launch add() kernel on GPU

// Copy result back to host

• Avoid accessing beyond the end of the arrays:

• Update the kernel launch:

• Unlike parallel blocks, threads have

You might also like

global void stencil_1d(int in, int out) {