0% found this document useful (0 votes)

21 views54 pages

Threads

The document discusses GPU programming and CUDA threads. It describes how CUDA kernels are executed by arrays of parallel threads and how threads are organized into blocks and grids. The document also covers CUDA memory models and how to allocate and transfer data between host and device memories.

Uploaded by

Cosmic02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views54 pages

Threads

Uploaded by

Cosmic02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

CSE 591/CSE 392: GPU Programming

Threads

Klaus Mueller

Computer Science Department

Stony Brook University

Som e portions © David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009

ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Threads
• Differences between GPU and CPU threads
– GPU threads are extremely lightweight
• Very little creation overhead
– GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 2

ECE 498AL, University of Illinois, Urbana-Cham paign
Arrays of Parallel Threads
• A CUDA kernel is executed by an array of
threads
– All threads run the same code (SPMD)‫‏‬
– Each thread has an ID that it uses to compute
memory addresses and make control decisions
threadID 0 1 2 3 4 5 6 7

…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 3

ECE 498AL, University of Illinois, Urbana-Cham paign
Thread Blocks: Scalable Cooperation
• Divide monolithic thread array into multiple blocks
– Threads within a block cooperate via shared memory,
atomic operations and barrier synchronization
– Threads in different blocks cannot cooperate

Thread Block 0 Thread Block 0 Thread Block N - 1

threadID 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

… … …
float x = float x = float x =
input[threadID];
float y = func(x);
output[threadID] = y;
input[threadID];
float y = func(x);
output[threadID] = y;
… input[threadID];
float y = func(x);
output[threadID] = y;
… … …

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 4

ECE 498AL, University of Illinois, Urbana-Cham paign
Block IDs and Thread IDs
Host Device
• Each thread uses IDs to
Grid 1
decide what data to work on
Kernel
– Block ID: 1D or 2D 1
Block
(0, 0)
Block
(1, 0)
– Thread ID: 1D, 2D, or 3D
Block Block
(0, 1) (1, 1)

• Simplifies memory Grid 2

addressing when Kernel

processing 2
Block (1, 1)
multidimensional data (0,0,1) (1,0,1) (2,0,1) (3,0,1)

– Image processing
Thread Thread Thread Thread
– Solving PDEs on volumes (0,0,0) (1,0,0) (2,0,0) (3,0,0)

– … Thread Thread Thread Thread

(0,1,0) (1,1,0) (2,1,0) (3,1,0)

Courtesy: NDVIA
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 5
Figure 3.2. An Example of CUDA Thread Org
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Memory Model Overview
• Global memory
– Main means of
communicating R/W
Grid
Data between host and
device Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

– Contents visible to all Shared Memory Shared Memory

threads
Registers Registers Registers Registers
– Long latency access
• We will focus on Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

global memory for

now Host Global Memory

– Constant and texture

memory will come later
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 6
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Device Memory Allocation
• cudaMalloc()
– Allocates object in the Grid

device Global Memory Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

– Requires two parameters Shared Memory Shared Memory

• Address of a pointer to Registers Registers Registers Registers

the allocated object

Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

• Size of of allocated object

• cudaFree() Host Global
Memory

– Frees object from device

Global Memory
• Pointer to freed object
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 7
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Device Memory Allocation (cont.)‫‏‬
• Code example:
– Allocate a 64 * 64 single precision float array
– Attach the allocated storage to Md
– “d”‫‏‬is‫‏‬often‫‏‬used‫‏‬to‫‏‬indicate‫‏‬a‫‏‬device‫‏‬data‫‏‬
structure
TILE_WIDTH = 64;
Float* Md
int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);
cudaFree(Md);
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 8
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Host-Device Data Transfer
• cudaMemcpy()‫‏‬
– memory data transfer Grid

– Requires four parameters Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

• Pointer to destination Shared Memory Shared Memory

• Pointer to source
Registers Registers Registers Registers
• Number of bytes copied
• Type of transfer Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬ Thread (0, 0)‫ ‏‬Thread (1, 0)‫‏‬

– Host to Host
– Host to Device Host Global
Memory
– Device to Host
– Device to Device
• Asynchronous transfer
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 9
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Host-Device Data Transfer
(cont.)
• Code example:
– Transfer a 64 * 64 single precision float array
– M is in host memory and Md is in device memory
– cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic constants

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 10

ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Keywords

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 11

ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Function Declarations
Executed Only callable
on the: from the:
__device__ float DeviceFunc()‫‏‬ device device
__global__ void KernelFunc()‫‏‬ device host
__host__ float HostFunc()‫‏‬ host host

• global defines a kernel function

– Must return void
• __device__ and __host__ can be used
together
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007- 12
2009
ECE 498AL, University of Illinois, Urbana-Cham paign
CUDA Function Declarations (cont.)‫‏‬

• device functions cannot have their

address taken
• For functions executed on the device:
– No recursion
– No static variable declarations inside the
function
– No variable number of arguments

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 13

ECE 498AL, University of Illinois, Urbana-Cham paign
Calling a Kernel Function – Thread
Creation
• A kernel function must be called with an
execution configuration:
__global__ void KernelFunc(...);
dim3 DimGrid(100, 50); // 5000 thread blocks
dim3 DimBlock(4, 8, 8); // 256 threads per
block
size_t SharedMemBytes = 64; // 64 bytes of shared
memory
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes
>>>(...);
• Any call to a kernel function is asynchronous from
CUDA 1.0 on, explicit synch needed for blocking
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 14
ECE 498AL, University of Illinois, Urbana-Cham paign
A Simple Running Example
Matrix Multiplication
• A simple matrix multiplication example that
illustrates the basic features of memory and
thread management in CUDA programs
– Leave shared memory usage until later
– Local, register usage
– Thread ID usage
– Memory data transfer API between host and device
– Assume square matrix for simplicity

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 15

ECE 498AL, University of Illinois, Urbana-Cham paign
Programming Model:
Square Matrix Multiplication Example
• P = M * N of size WIDTH x WIDTH N

• Without tiling:

WIDTH
– One thread calculates one element
of P
– M and N are loaded WIDTH times
from global memory
M P

WIDTH
WIDTH WIDTH
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 16
ECE498AL, University of Illinois, Urbana-Cham paign
Memory Layout of a Matrix in C
M 0,0 M 1,0 M 2,0 M 3,0

M 0,1 M 1,1 M 2,1 M 3,1

M 0,2 M 1,2 M 2,2 M 3,2

M 0,3 M 1,3 M 2,3 M 3,3

M 0,0 M 1,0 M 2,0 M 3,0 M 0,1 M 1,1 M 2,1 M 3,1 M 0,2 M 1,2 M 2,2 M 3,2 M 0,3 M 1,3 M 2,3 M 3,3

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 17

ECE498AL, University of Illinois, Urbana-Cham paign
Step 1: Matrix Multiplication
A Simple Host Version in C
/ / Matrix m u ltiplication on the (CPU) host in d ou ble
N
precision
void MatrixMulOnHost(float* M, float* N, float* P, int Width)‫‏‬ k
{
j

WIDTH
for (int i = 0; i < Width; ++i)‫‏‬
for (int j = 0; j < Width; ++j) {
double sum = 0;
for (int k = 0; k < Width; ++k) {
double a = M[i * width + k];
double b = N[k * widthM+ j]; P

sum += a * b; i
}

WIDTH
P[i * Width + j] = sum;
}
} k
WIDTH WIDTH
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 18
ECE498AL, University of Illinois, Urbana-Cham paign
Step 2: Input Matrix Data Transfer
(Host-side Code)‫‏‬
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)‫‏‬
{
int size = Width * Width * sizeof(float);
float* Md , N d , Pd ;
…
1. // Allocate and Load M, N to device memory
cudaMalloc(&Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the device

cudaMalloc(&Pd, size);
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 19
ECE498AL, University of Illinois, Urbana-Cham paign
Step 3: Output Matrix Data Transfer
(Host-side Code)‫‏‬

2. / / Kernel invocation cod e – to be show n later

…

3. / / Read P from the d evice

cudaMemcpy(P, Pd, size, cudaMemcpyD eviceToHost);

/ / Free d evice m atrices

cu d aFree(Md ); cu d aFree(N d ); cu d aFree (Pd );
}

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 20

ECE498AL, University of Illinois, Urbana-Cham paign
Step 4: Kernel Function

// Matrix multiplication kernel – per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)‫‏‬
{

// Pvalue is used to store the element of the matrix

// that is computed by the thread
float Pvalue = 0;

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 21

ECE498AL, University of Illinois, Urbana-Cham paign
Step 4: Kernel Function (cont.)‫‏‬
for (int k = 0; k < Width; ++k)‫{‏‬ Nd

float Melement = Md[threadIdx.y*Width+k];

float Nelement = Nd[k*Width+threadIdx.x]; k

WIDTH
Pvalue += Melement * Nelement;
} tx

Pd[threadIdx.y*Width+threadIdx.x] = Pvalue;
} Md Pd

ty ty

WIDTH
k tx

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 WIDTH WIDTH 22

ECE498AL, University of Illinois, Urbana-Cham paign
Step 5: Kernel Invocation
(Host-side Code)

// Setup the execution configuration

dim3 dimGrid(1, 1);
dim3 dimBlock(Width, Width);

// Launch the device computation threads!

MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 23

ECE498AL, University of Illinois, Urbana-Cham paign
Only One Thread Block Used
Grid 1 Nd
• One Block of threads computes Block 1
2
matrix Pd
4
– Each thread computes one
element of Pd Thread
(2, 2)‫‏‬
2
• Each thread 6
– Loads a row of matrix Md
– Loads a column of matrix Nd
– Performs one multiply and
addition for each pair of Md
and Nd elements
– Compute to off-chip memory 3 2 5 4 48
access ratio close to 1:1 (not
very high)‫‏‬
• Size of matrix limited by the WIDTH
number of threads allowed in a
thread block Pd
Md
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 24
ECE498AL, University of Illinois, Urbana-Cham paign
Next: Handling Arbitrary Sized Square
Matrices
• Have each 2D thread block to Nd

compute a (TILE_WIDTH)2 sub-

matrix (tile) of the result matrix

WIDTH
– Each has (TILE_WIDTH)2 threads
• Generate a 2D Grid of
(WIDTH/TILE_WIDTH)2 blocks Pd
Md
You still need to pu t a loop by
arou nd the kernel call for TILE_WIDTH
cases w here ty

WIDTH
WIDTH / TILE_WIDTH is
greater than m ax grid size bx tx
(64K)!
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 WIDTH WIDTH 25
ECE498AL, University of Illinois, Urbana-Cham paign
bx
0 1 2
Matrix Multiplication Using
tx
Multiple Blocks 0 1 2 TILE_WIDTH-1

Nd
• Break-up Pd into tiles
• Each block calculates one

WIDTH
tile
– Each thread calculates one
element
– Block size equal tile size
Md Pd

TILE_WIDTHE
0 Pdsub

WIDTH
1
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH

2 WIDTH WIDTH

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 26

ECE498AL, University of Illinois, Urbana-Cham paign
A Small Example

Block(0,0) Block(1,0)

P 0,0 P 1,0 P 2,0 P 3,0 TILE_WIDTH = 2

P 0,1 P 1,1 P 2,1 P 3,1

P 0,2 P 1,2 P 2,2 P 3,2

P 0,3 P 1,3 P 2,3 P 3,3

Block(0,1) Block(1,1)

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 27

ECE498AL, University of Illinois, Urbana-Cham paign
A Small Example: Multiplication
N d 0,0N d 1,0

N d 0,1N d 1,1

N d 0,2N d 1,2

N d 0,3N d 1,3

Md 0,0Md 1,0Md 2,0Md 3,0 Pd 0,0 Pd 1,0 Pd 2,0 Pd 3,0

Md 0,1Md 1,1Md 2,1Md 3,1 Pd 0,1 Pd 1,1 Pd 2,1 Pd 3,1

Pd 0,2 Pd 1,2 Pd 2,2 Pd 3,2

Pd 0,3 Pd 1,3 Pd 2,3 Pd 3,3

ECE498AL, University of Illinois, Urbana-Cham paign
Revised Matrix Multiplication
Kernel using Multiple Blocks
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockIdx.y*TILE_WIDTH + threadIdx.y;
// Calculate the column idenx of Pd and N
int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;
}
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 29
ECE498AL, University of Illinois, Urbana-Cham paign
CUDA Thread Block
• All threads in a block execute the same
kernel program (SPMD)
• Programmer declares block: CUDA Thread Block
– Block size 1 to 512 concurrent threads
– Block shape 1D, 2D, or 3D Thread Id #:
– Block dimensions in threads 0123… m
• Threads have thread id numbers within block
– Thread program uses thread id to select
work and address shared data Thread program

• Threads in the same block share data and

synchronize while doing their share of the
work
• Threads in different blocks cannot cooperate
Cou rtesy: John N ickolls,
– Each block can execute in any order relative N VIDIA
to other blocs!

ECE498AL, University of Illinois, Urbana-Cham paign
Transparent Scalability
• Hardware is free to assign blocks to any processor at any time, given
the resources
– A kernel scales across any number of parallel processors
– When less resources are available, hardware will reduce the
number of blocks run in parallel (compare right with left block
assignment below)
Device Kernel grid
Device
Block 0 Block 1

Block 2 Block 3

Block 0 Block 1 Block 4 Block 5

Block 0 Block 1 Block 2 Block 3
Block 6 Block 7
Block 2 Block 3
tim e
Block 4 Block 5 Block 6 Block 7

Block 4 Block 5
Each block can execu te in any ord er relative
Block 6 Block 7 to other blocks.
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 31
ECE498AL, University of Illinois, Urbana-Cham paign
G80 Example: Executing Thread Blocks
t0 t1 t2 … tm SM 0 SM 1 t0 t1 t2 … tm

MT IU MT IU
Blocks
SP SP

Blocks • Threads are assigned to Streaming

Multiprocessors in block granularity
– Up to 8 blocks to each SM as
Shared Shared resource allows
Memory Memory
– SM in G80 can take up to 768 threads
• Could be 256 (threads/block) * 3
blocks
• Or 128 (threads/block) * 6 blocks, etc.
• Threads run concurrently
– SM maintains thread/block id #s
– SM manages/schedules thread
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 execution 32
ECE498AL, University of Illinois, Urbana-Cham paign
G80 Example: Thread Scheduling
• Each Block is executed as …
Block 1 Warps
…Block 2 Warps …
Block 1 Warps

32-thread Warps t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31

– An implementation decision, … … …
not part of the CUDA
programming model
– Warps are scheduling units Streaming Multiprocessor
in SM Instruction L1

• If 3 blocks are assigned to an Instruction Fetch/Dispatch

SM and each block has 256 Shared Memory

threads, how many Warps are SP SP

there in an SM? SP SP
– Each Block is divided into SP
SFU
SP
SFU

256/32 = 8 Warps
SP SP
– There are 8 * 3 = 24 Warps

ECE498AL, University of Illinois, Urbana-Cham paign
G80 Example: Thread Scheduling
(Cont.)

• SM implements zero-overhead warp scheduling

– Effectively provides for latency hiding (memory waits, etc.)
– At any time, only one of the warps is executed by SM
– Warps whose next instruction has its operands ready for
consumption are eligible for execution
– Eligible Warps are selected for execution on a prioritized
scheduling policy
– All threads in a warp execute the same instruction when selected

TB1, W1 stall
TB2, W1 stall TB3, W2 stall

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4

Time TB = Thread Block, W = Warp

ECE498AL, University of Illinois, Urbana-Cham paign
G80 Block Granularity Considerations
• For Matrix Multiplication using multiple blocks, should I
use 8X8, 16X16 or 32X32 blocks?

– For 8X8, we have 64 threads per Block. Since each SM can take
up to 768 threads, there are 12 Blocks. However, each SM can
only take up to 8 Blocks, only 512 threads will go into each SM!
This will lead to under-utilization (bad for latency hiding).

– For 16X16, we have 256 threads per Block. Since each SM can
take up to 768 threads, it can take up to 3 Blocks and achieve full
capacity unless other resource considerations overrule.

– For 32X32, we have 1024 threads per Block. Not even one can fit
into an SM!
© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 35
ECE498AL, University of Illinois, Urbana-Cham paign
Number of Threads
• All threads on an SM must run in lock-step
– if one thread is delayed because of memory load then
all threads in a warp must wait
– a new warp is scheduled
– so it is good to have more threads per block
– however, there is a limit on the number of threads per
SM
– 768, 1024, 1536, 2048 depending on compute
capability
– this is a function of the maximum number of warps
(24, 32, 48, 64) of 32 threads each
Number of Blocks
• Workload of threads is not always uniform
– all threads must complete before a new block can be
scheduled
– if the slow thread is part of a large block the idle time
is high
– so it is better to have smaller blocks
– however, there is a limit on the number of blocks per
SM (8 or less, depending on compute capability)
GPU Utilization
• Goal is to allocate as many threads per SM as
maximum limit
• Here take into account:
– max number of blocks
– modulus of warps
• So to achieve 100 % utilization depends on
compute capability and threads/block
GPU Utilization

• So 256 threads per block is safest across all

compute capabilities

Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬
Practical Example
• Histogram computation

for (unsigned int i=0; i< max; i++)

{
bin[array[i]]++;
}
CPU Algorithm
1. Read the value from the array into a register
2. Work out the base address and offset to the correct bin
element
3. Fetch the existing bin value
4. Increment the bin value by one
5. Write the new bin value back to the bin in memory

Steps 3, 4, 5 are not atomic

- OK for CPU since serial but not for GPU
- use atomicAdd(&value) on the GPU
GPU Algorithm 1

Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬
GPU Algorithm 1
• Not overly fast
• Why?
– each thread only fetches 1 byte
– half warp fetches 16 bytes
– maximal supported size is 128 bytes
– hence memory bandwidth is heavily underused
GPU Algorithm 2
GPU Algorithm 2
• In fact, achieves zero speedup
– no improvements in memory bandwidth
– advanced compute capability already does good
coalescing
– need to look for other culprit
GPU Algorithm 2
• In fact, achieves zero speedup
– no improvements in memory bandwidth
– advanced compute capability already does good
coalescing
– need to look for other culprit

– maybe reduce the amount of global number of atomic

writes?
GPU Algorithm 3

Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬
GPU Algorithm 3

Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬
GPU Algorithm 3
• results in a 6 fold speedup
• but could still reduce global memory traffic
• what can we do?
GPU Algorithm 3
• results in a 6 fold speedup
• but could still reduce global memory traffic
• what can we do?

• compute more than one histogram per thread

GPU Algorithm 3

Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬
GPU Algorithm 3

not m uch grow th in band w id th after N =32 d ue to other factors im ped ing grow th
(atom ics ad d s in this case)
Shane‫‏‬Cook‫“‏‬CUDA‫‏‬Program m ing”‫‏‬

Lecture 4
No ratings yet
Lecture 4
48 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
CUDA
No ratings yet
CUDA
33 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
CUDA
No ratings yet
CUDA
18 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
LM32 Ait L21
No ratings yet
LM32 Ait L21
19 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
Wii U Architecture: Architecture of Consoles: A Practical Analysis, #21
From Everand
Wii U Architecture: Architecture of Consoles: A Practical Analysis, #21
Rodrigo Copetti
No ratings yet
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
From Everand
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
Rodrigo Copetti
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
CS8076 - GPU Architecture and Programming
No ratings yet
CS8076 - GPU Architecture and Programming
244 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
A Practical GPU Based KNN Algorithm: Quansheng Kuang, and Lei Zhao
No ratings yet
A Practical GPU Based KNN Algorithm: Quansheng Kuang, and Lei Zhao
5 pages
Nvidia Cuda Thesis
100% (3)
Nvidia Cuda Thesis
8 pages
Main GPU
No ratings yet
Main GPU
87 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
GPU - Final - Gradescope
No ratings yet
GPU - Final - Gradescope
20 pages
07 Gpuarch
No ratings yet
07 Gpuarch
73 pages
AI Processor Electronics Basic Technology of Artificial Intelligence
No ratings yet
AI Processor Electronics Basic Technology of Artificial Intelligence
399 pages
Large-Scale Welding Process Simulation by GPU
No ratings yet
Large-Scale Welding Process Simulation by GPU
27 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
PDC Example Exam Questions
No ratings yet
PDC Example Exam Questions
9 pages
D. Granularity
No ratings yet
D. Granularity
24 pages
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
No ratings yet
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
15 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
A GPU Polyhedral For DEM - Master - Thesis - Adam - Bilock
No ratings yet
A GPU Polyhedral For DEM - Master - Thesis - Adam - Bilock
75 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale
No ratings yet
DeepSpeed Inference - Enabling Efficient Inference of Transformer Models at Unprecedented Scale
13 pages
Anjum 2017 Cloud-Based Scalable Object Detection and Classification in Video Streams Accepted
No ratings yet
Anjum 2017 Cloud-Based Scalable Object Detection and Classification in Video Streams Accepted
35 pages
LP 1 Lab Manual - v1
No ratings yet
LP 1 Lab Manual - v1
80 pages
Chapter 4 Compute Architecture and S 2023 Programming Massively Parallel P
No ratings yet
Chapter 4 Compute Architecture and S 2023 Programming Massively Parallel P
24 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
Vector Processor
No ratings yet
Vector Processor
83 pages
s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance
No ratings yet
s21170 Cuda On Nvidia Ampere Gpu Architecture Taking Your Algorithms To The Next Level of Performance
49 pages

Threads

Uploaded by

Threads

Uploaded by

CSE 591/CSE 392: GPU Programming

Computer Science Department

Som e portions © David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 2

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 3

Thread Block 0 Thread Block 0 Thread Block N - 1

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 4

• Simplifies memory Grid 2

addressing when Kernel

– … Thread Thread Thread Thread

– Contents visible to all Shared Memory Shared Memory

global memory for

– Constant and texture

device Global Memory Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

– Requires two parameters Shared Memory Shared Memory

• Address of a pointer to Registers Registers Registers Registers

the allocated object

• Size of of allocated object

– Frees object from device

– Requires four parameters Block (0, 0)‫‏‬ Block (1, 0)‫‏‬

• Pointer to destination Shared Memory Shared Memory

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 10

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 11

• __global__ defines a kernel function

• __device__ functions cannot have their

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 13

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 15

M 0,1 M 1,1 M 2,1 M 3,1

M 0,2 M 1,2 M 2,2 M 3,2

M 0,3 M 1,3 M 2,3 M 3,3

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 17

// Allocate P on the device

2. / / Kernel invocation cod e – to be show n later

3. / / Read P from the d evice

/ / Free d evice m atrices

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 20

// Matrix multiplication kernel – per thread code

// Pvalue is used to store the element of the matrix

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 21

float Melement = Md[threadIdx.y*Width+k];

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 WIDTH WIDTH 22

// Setup the execution configuration

// Launch the device computation threads!

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 23

compute a (TILE_WIDTH)2 sub-

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 26

P 0,0 P 1,0 P 2,0 P 3,0 TILE_WIDTH = 2

P 0,2 P 1,2 P 2,2 P 3,2

P 0,3 P 1,3 P 2,3 P 3,3

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 27

Md 0,0Md 1,0Md 2,0Md 3,0 Pd 0,0 Pd 1,0 Pd 2,0 Pd 3,0

Md 0,1Md 1,1Md 2,1Md 3,1 Pd 0,1 Pd 1,1 Pd 2,1 Pd 3,1

Pd 0,2 Pd 1,2 Pd 2,2 Pd 3,2

Pd 0,3 Pd 1,3 Pd 2,3 Pd 3,3

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 28

• Threads in the same block share data and

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 30

Block 0 Block 1 Block 4 Block 5

Blocks • Threads are assigned to Streaming

32-thread Warps t0 t1 t2 … t31 t0 t1 t2 … t31 t0 t1 t2 … t31

• If 3 blocks are assigned to an Instruction Fetch/Dispatch

SM and each block has 256 Shared Memory

threads, how many Warps are SP SP

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 33

• SM implements zero-overhead warp scheduling

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

Time TB = Thread Block, W = Warp

© David Kirk/ N VIDIA and Wen -m ei W. H w u, 2007-2009 34

• So 256 threads per block is safest across all

for (unsigned int i=0; i< max; i++)

Steps 3, 4, 5 are not atomic

– maybe reduce the amount of global number of atomic

• compute more than one histogram per thread

You might also like

• global defines a kernel function

• device functions cannot have their