0% found this document useful (0 votes)

10 views27 pages

06-CUDA Thread Organization

Uploaded by

chirag

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views27 pages

06-CUDA Thread Organization

Uploaded by

chirag

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

CS516: Parallelization of Programs

CUDA Thread Organization

Vishwesh Jatala
Assistant Professor
Department of EECS
Indian Institute of Technology Bhilai
[email protected]

2022-23M
1
Outline

■ Continue with CUDA Programming

❑ Thread organization
❑ Examples

2
CUDA Programming Flow

GPU (Device)

(2) Kernel SM SM SM

Device Memory

(1) CPU to GPU (3) GPU to CPU

Data transfer Data transfer

Memory

CPU (Host)

3
VectorAdd in CUDA

■ For given two vectors A and B both having size N (where

N<=1024), write a CUDA program to compute C=A+B

4
VectorAdd in CUDA

#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *dev_a, *dev_b, *dev_c; //device copies of a, b, c
int size = N * sizeof(int);
…
// Alloc space for host copies of a, b, c and
// setup input values
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&dev_a, size);
cudaMalloc((void **)&dev_b, size);
cudaMalloc((void **)&dev_c, size);

5
VectorAdd in CUDA

// Copy inputs to device

cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads

add<<<1,N>>>(dev_a, dev_b, dev_c);

// Copy result back to host

cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c);
return 0;
}

6
VectorAdd in CUDA

global void add(int a, int b, int *c) {

c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}

7
Practise Problem-1

■ For given two matrices M and N both having size k*k (where
k<=1024), write a CUDA program to perform M+N
❑ Hint: Allocate M and N single dimension array having
k*k elements.

8
Thread Configuration

add<<<ThreadConfig>>> (dev_a, dev_b, dev_c);

add<<<ThreadBlocks, Threads>>> (dev_a, dev_b, dev_c);

Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

blockIdx.x -> For identifying the block

threadIdx.x -> For identifying the thread within
a thread block
blockDimx.x -> Size of thread block

9
Indexing Arrays with Threads and Thread Blocks

Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

M=8

0 1 2 3 4 5 6 7 …………..…………..……………………….. 24 25 26 27 28 29 30 31

Array A

What is the array index accessed by thread having

threadIdx.x from the blockIdx.x?

int index = threadIdx.x + blockIdx.x * M;

10
Indexing Arrays with Threads and Thread Blocks

Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

M=8

0 1 2 3 4 5 6 7 …………..…………..……………………….. 21 22 23 24 25 26 27 28 29 30 31

Array A

Which threadIdx.x and BlockIdx.x will operate on index 21?

index = threadIdx.x + blockIdx.x * M;
21 = 5 + 2 * 8

11
VectorAdd in CUDA with Thread and Blocks

global void add(int a, int b, int *c) {

int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}

12
VectorAdd in CUDA

// Alloc space for device copies of a, b, c

cudaMalloc((void **)&dev_a, size);
cudaMalloc((void **)&dev_b, size);
cudaMalloc((void **)&dev_c, size);

13
VectorAdd in CUDA

// Copy inputs to device

cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads

add<<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>>
(dev_a, dev_b, dev_c);

// Copy result back to host

cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(dev_a); cudaFree(dev_b); cudaFree(dev_c);
return 0;
}

14
Threadblock configuration

add<<<ThreadBlocks, Threads>>> (dev_a, dev_b, dev_c);

■ Thread block configuration

❑ User choice
❑ Depends on problem size
■ Problem size = 32768 (1024 * 32)
❑ Threadblocks = 32, No of threads/thread block = 1024
❑ Threadblocks = 128, No of threads/thread block = 256

15
Practise Problem-2

■ For given two matrices M and N both having size k*k,

write a CUDA program to perform M+N using threads
and thread blocks
❑ Assume threads in a thread block as THREADS_PER_BLOCK

16
Thread Configuration

add<<<ThreadConfig>>> (dev_a, dev_b, dev_c);

add<<<ThreadBlocks, Threads>>> (dev_a, dev_b, dev_c);

Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

17
Exercise: 1D Thread Organization

■ For a given matrix A of size N*M, write a CUDA

program to initialize the matrix elements as below

■ Assumptions:
❑ Matrix is stored in the single dimensional array.

❑ No. of thread blocks = N

❑ No. of threads in each block = M

18
1D

19
Why Thread Blocks?

But why not 1 thread block with all the threads in it?

20
GPU Architecture

21
Few Constraints

■ Thread block size has limit

❑ Each thread executes the same program
❑ Each thread requires some registers
❑ Number of registers in each SM is finite

global void add(int a, int b, int *c) {

int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}

22
Scalability

GPU-0 GPU-1

SM0 SM1 SM0 SM1 SM2 SM3

Block 0 Block 2 Block 5 Block 6

Block 0 Block 1 Block 2 Block 3
SM0

Block 4 Block 5 Block 6 Block 7

Block 1 Block 3 Block 5 Block 7

SM1 SM0 SM1 SM2 SM3

23
Few Constraints

■ Thread block size has limit

■ Max number of threads blocks reside per SM
■ Max threads reside per SM

24
Compute Capabilities

Source: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
25
Summary

■ CUDA Programming
❑ Thread organizations:
❑ Examples
■ Next Lecture
❑ Thread organization (2D & 3D)
❑ GPU Instruction Execution

26
References

■ CS6023 GPU Programming

❑ https://www.cse.iitm.ac.in/~rupesh/teaching/gpu/jan20/
■ Miscellaneous resources from internet

Street Fighter Alpha 3 MAX - 2006 - Capcom Co., Ltd.
No ratings yet
Street Fighter Alpha 3 MAX - 2006 - Capcom Co., Ltd.
24 pages
IDS816 Installer Manual
No ratings yet
IDS816 Installer Manual
65 pages
Data Management System Beckman Coult
No ratings yet
Data Management System Beckman Coult
187 pages
Service Manual: 17" LCD Monitor Dell E176Fpc
No ratings yet
Service Manual: 17" LCD Monitor Dell E176Fpc
73 pages
IM01C50B01-01E YTA110 Lengkap
No ratings yet
IM01C50B01-01E YTA110 Lengkap
50 pages
Acer Aspire 4740 4745 5740 5745 - Compal La-5681p - Rev 1.0
No ratings yet
Acer Aspire 4740 4745 5740 5745 - Compal La-5681p - Rev 1.0
61 pages
B550 Taichi
No ratings yet
B550 Taichi
103 pages
Connector & Switches: Prepared by
No ratings yet
Connector & Switches: Prepared by
13 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
Hardware Security
No ratings yet
Hardware Security
22 pages
Introduction To PIC: Microchip PIC 16F84 Microcontroller
No ratings yet
Introduction To PIC: Microchip PIC 16F84 Microcontroller
53 pages
DDR 2
No ratings yet
DDR 2
10 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
978-0-00-812098-6 IGCSE ICT Teacher's Guide
100% (1)
978-0-00-812098-6 IGCSE ICT Teacher's Guide
14 pages
3.15 MFD 4000 New Basic Training V 4 - 02 Info To VDR
No ratings yet
3.15 MFD 4000 New Basic Training V 4 - 02 Info To VDR
49 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
Onkyo tx-nr636
No ratings yet
Onkyo tx-nr636
108 pages
Compal La-6841p r0.1 Schematics
No ratings yet
Compal La-6841p r0.1 Schematics
45 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
CPBO423 - Lesson 3 - Instruction Sets-Characteristics and Functions
No ratings yet
CPBO423 - Lesson 3 - Instruction Sets-Characteristics and Functions
49 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Cpuid: History
No ratings yet
Cpuid: History
4 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Information and Technology 4 ICT Tools Faults in ICT Tools Assembling and Disassemble ICT Tools Disassembling A Computer Terms
No ratings yet
Information and Technology 4 ICT Tools Faults in ICT Tools Assembling and Disassemble ICT Tools Disassembling A Computer Terms
3 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
VGN SZ23GP
No ratings yet
VGN SZ23GP
2 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
Releasable Bulldog Spear
No ratings yet
Releasable Bulldog Spear
2 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
Threads and Memory7
No ratings yet
Threads and Memory7
42 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Summary of VMware ESXi and Vcenter
No ratings yet
Summary of VMware ESXi and Vcenter
2 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Threads
No ratings yet
Threads
54 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Class 10
No ratings yet
Class 10
13 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
BigFix Interview
No ratings yet
BigFix Interview
12 pages
Release Notes Corescanner Driver For Windows
No ratings yet
Release Notes Corescanner Driver For Windows
7 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Cisco Catalyst 9200 Series Switches Hardware Installation Guide
No ratings yet
Cisco Catalyst 9200 Series Switches Hardware Installation Guide
94 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
CUDA
No ratings yet
CUDA
33 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Delta - Ia Hmi - Dop 105CQ - TC SC en Tur - 20191016
No ratings yet
Delta - Ia Hmi - Dop 105CQ - TC SC en Tur - 20191016
8 pages
Cuda 101
No ratings yet
Cuda 101
53 pages
FireFly A High-Throughput Hardware Accelerator For Spiking Neural Networks With Efficient DSP and Memory Optimization
No ratings yet
FireFly A High-Throughput Hardware Accelerator For Spiking Neural Networks With Efficient DSP and Memory Optimization
14 pages
Intro To CUDA
No ratings yet
Intro To CUDA
16 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
10IT June 2022 P2 MEMO
No ratings yet
10IT June 2022 P2 MEMO
5 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
MPIquestion Bank
No ratings yet
MPIquestion Bank
10 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Catanzaro Intro To GPUs
No ratings yet
Catanzaro Intro To GPUs
76 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
Apple Invoice Recipet
No ratings yet
Apple Invoice Recipet
1 page
3 Cuda
No ratings yet
3 Cuda
5 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Class-1 Basics of Computers
No ratings yet
Class-1 Basics of Computers
75 pages
CUDA
No ratings yet
CUDA
18 pages
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet

06-CUDA Thread Organization

Uploaded by

06-CUDA Thread Organization

Uploaded by

CS516: Parallelization of Programs

CUDA Thread Organization

■ Continue with CUDA Programming

(1) CPU to GPU (3) GPU to CPU

■ For given two vectors A and B both having size N (where

// Alloc space for device copies of a, b, c

// Copy inputs to device

// Launch add() kernel on GPU with N threads

// Copy result back to host

__global__ void add(int *a, int *b, int *c) {

add<<<ThreadConfig>>> (dev_a, dev_b, dev_c);

add<<<ThreadBlocks, Threads>>> (dev_a, dev_b, dev_c);

Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

blockIdx.x -> For identifying the block

Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

What is the array index accessed by thread having

int index = threadIdx.x + blockIdx.x * M;

Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

Which threadIdx.x and BlockIdx.x will operate on index 21?

__global__ void add(int *a, int *b, int *c) {

// Alloc space for device copies of a, b, c

// Copy inputs to device

// Launch add() kernel on GPU with N threads

// Copy result back to host

add<<<ThreadBlocks, Threads>>> (dev_a, dev_b, dev_c);

■ Thread block configuration

■ For given two matrices M and N both having size k*k,

add<<<ThreadConfig>>> (dev_a, dev_b, dev_c);

add<<<ThreadBlocks, Threads>>> (dev_a, dev_b, dev_c);

Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

■ For a given matrix A of size N*M, write a CUDA

❑ No. of thread blocks = N

❑ No. of threads in each block = M

■ Thread block size has limit

__global__ void add(int *a, int *b, int *c) {

SM0 SM1 SM0 SM1 SM2 SM3

Block 0 Block 2 Block 5 Block 6

Block 4 Block 5 Block 6 Block 7

SM1 SM0 SM1 SM2 SM3

■ Thread block size has limit

■ CS6023 GPU Programming

You might also like

global void add(int a, int b, int *c) {

global void add(int a, int b, int *c) {

global void add(int a, int b, int *c) {