0% found this document useful (0 votes)
10 views27 pages

06-CUDA Thread Organization

Uploaded by

chirag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views27 pages

06-CUDA Thread Organization

Uploaded by

chirag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

CS516: Parallelization of Programs

CUDA Thread Organization

Vishwesh Jatala
Assistant Professor
Department of EECS
Indian Institute of Technology Bhilai
[email protected]

2022-23M
1
Outline

■ Continue with CUDA Programming


❑ Thread organization
❑ Examples

2
CUDA Programming Flow

GPU (Device)

(2) Kernel SM SM SM

Device Memory

(1) CPU to GPU (3) GPU to CPU


Data transfer Data transfer

Memory

CPU (Host)

3
VectorAdd in CUDA

■ For given two vectors A and B both having size N (where


N<=1024), write a CUDA program to compute C=A+B

4
VectorAdd in CUDA

#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *dev_a, *dev_b, *dev_c; //device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for host copies of a, b, c and
// setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

// Alloc space for device copies of a, b, c


cudaMalloc((void **)&dev_a, size);
cudaMalloc((void **)&dev_b, size);
cudaMalloc((void **)&dev_c, size);

5
VectorAdd in CUDA

// Copy inputs to device


cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads


add<<<1,N>>>(dev_a, dev_b, dev_c);

// Copy result back to host


cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c);
return 0;
}

6
VectorAdd in CUDA

__global__ void add(int *a, int *b, int *c) {


c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}

7
Practise Problem-1

■ For given two matrices M and N both having size k*k (where
k<=1024), write a CUDA program to perform M+N
❑ Hint: Allocate M and N single dimension array having
k*k elements.

8
Thread Configuration

add<<<ThreadConfig>>> (dev_a, dev_b, dev_c);

add<<<ThreadBlocks, Threads>>> (dev_a, dev_b, dev_c);

Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

blockIdx.x -> For identifying the block


threadIdx.x -> For identifying the thread within
a thread block
blockDimx.x -> Size of thread block

9
Indexing Arrays with Threads and Thread Blocks

Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

M=8

0 1 2 3 4 5 6 7 …………..…………..……………………….. 24 25 26 27 28 29 30 31

Array A

What is the array index accessed by thread having


threadIdx.x from the blockIdx.x?

int index = threadIdx.x + blockIdx.x * M;

10
Indexing Arrays with Threads and Thread Blocks

Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

M=8

0 1 2 3 4 5 6 7 …………..…………..……………………….. 21 22 23 24 25 26 27 28 29 30 31

Array A

Which threadIdx.x and BlockIdx.x will operate on index 21?


index = threadIdx.x + blockIdx.x * M;
21 = 5 + 2 * 8

11
VectorAdd in CUDA with Thread and Blocks

__global__ void add(int *a, int *b, int *c) {


int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}

12
VectorAdd in CUDA

#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *dev_a, *dev_b, *dev_c; //device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for host copies of a, b, c and
// setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

// Alloc space for device copies of a, b, c


cudaMalloc((void **)&dev_a, size);
cudaMalloc((void **)&dev_b, size);
cudaMalloc((void **)&dev_c, size);

13
VectorAdd in CUDA

// Copy inputs to device


cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads


add<<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>>
(dev_a, dev_b, dev_c);

// Copy result back to host


cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c);
return 0;
}

14
Threadblock configuration

add<<<ThreadBlocks, Threads>>> (dev_a, dev_b, dev_c);

■ Thread block configuration


❑ User choice
❑ Depends on problem size
■ Problem size = 32768 (1024 * 32)
❑ Threadblocks = 32, No of threads/thread block = 1024
❑ Threadblocks = 128, No of threads/thread block = 256

15
Practise Problem-2

■ For given two matrices M and N both having size k*k,


write a CUDA program to perform M+N using threads
and thread blocks
❑ Assume threads in a thread block as THREADS_PER_BLOCK

16
Thread Configuration

add<<<ThreadConfig>>> (dev_a, dev_b, dev_c);

add<<<ThreadBlocks, Threads>>> (dev_a, dev_b, dev_c);

Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

17
Exercise: 1D Thread Organization

■ For a given matrix A of size N*M, write a CUDA


program to initialize the matrix elements as below

■ Assumptions:
❑ Matrix is stored in the single dimensional array.

❑ No. of thread blocks = N

❑ No. of threads in each block = M

18
1D

19
Why Thread Blocks?

But why not 1 thread block with all the threads in it?

20
GPU Architecture

21
Few Constraints

■ Thread block size has limit


❑ Each thread executes the same program
❑ Each thread requires some registers
❑ Number of registers in each SM is finite

__global__ void add(int *a, int *b, int *c) {


int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}

22
Scalability

GPU-0 GPU-1

SM0 SM1 SM0 SM1 SM2 SM3

Block 0 Block 2 Block 5 Block 6


Block 0 Block 1 Block 2 Block 3
SM0

Block 4 Block 5 Block 6 Block 7


Block 1 Block 3 Block 5 Block 7

SM1 SM0 SM1 SM2 SM3

23
Few Constraints

■ Thread block size has limit


■ Max number of threads blocks reside per SM
■ Max threads reside per SM

24
Compute Capabilities

Source: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
25
Summary

■ CUDA Programming
❑ Thread organizations:
❑ Examples
■ Next Lecture
❑ Thread organization (2D & 3D)
❑ GPU Instruction Execution

26
References

■ CS6023 GPU Programming


❑ https://www.cse.iitm.ac.in/~rupesh/teaching/gpu/jan20/
■ Miscellaneous resources from internet

27

You might also like