0% found this document useful (0 votes)

159 views93 pages

GPU Basics

This document provides an overview of GPU programming and CUDA. It discusses the history and motivation for GPU programming, basic GPU programs and thread synchronization. It also covers GPU memory optimizations, case studies in image processing and graph algorithms, and GPU vendor examples. The document outlines a sample GPU "Hello World" program and discusses key CUDA concepts like kernels, thread blocks and grids, shared memory, and global memory.

Uploaded by

Maheshkumar Amula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views93 pages

GPU Basics

Uploaded by

Maheshkumar Amula

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 93

GPU Programming

Rupesh Nasre.

High-Performance Parallel Computing

June 2016 1
Outline
●
Basics
● History and Motivation
● Simple Programs
● Thread Synchronization
● Optimizations
● GPU Memories
● Thread Divergence
● Memory Coalescing
● ...
● Case Studies
● Image Processing Some images are taken from NVIDIA 2
● Graph Algorithms CUDA Programming Guide.
3
GPU-CPU Performance Comparison

Source: Thorsten Thormählen

5
GPGPU: General Purpose Graphics Processing Unit
GPU Vendors
● NVIDIA
● AMD
● Intel
● QualComm
● ARM
● Broadcom
● Matrox Graphics
● Vivante
● Samsung
6
● ...
Earlier GPGPU Programming
GPGPU = General Purpose Graphics Processing Units.

Graphics State

Screenspace triangles (2D)

Xformed, Lit Vertices (2D)

Final Pixels (Color, Depth)

Fragments (pre-pixels)
Vertices (3D)

Transform Assemble Video

Application Rasterize Shade
& Light Primitives Memory
(Textures)

CPU GPU Render-to-texture

●
Applications: Protein Folding, Stock Options Pricing, SQL Queries, MRI Reconstruction.
●
Required intimate knowledge of graphics API and GPU architecture.
●
Program complexity: Problems expressed in terms of vertex coordinates, textures and
shaders programs.
●
Random memory reads/writes not supported.
●
Lack of double precision support.
7
Kepler Configuration
Feature K80 K40

# of SMX Units 26 (13 per GPU) 15

# of CUDA Cores 4992 (2496 per GPU) 2880

Memory Clock 2500 MHz 3004 MHz

GPU Base Clock 560 MHz 745 MHz

GPU Boost Support Yes – Dynamic Yes – Static

GPU Boost Clocks 23 levels between 562 MHz 810 MHz

and 875 MHz 875 MHz
Architecture features Dynamic Parallelism, Hyper-Q

Compute Capability 3.7 3.5

Wattage (TDP) 300W (plus Zero Power Idle) 235W

Onboard GDDR5 Memory 24 GB 12 GB

rn-gpu machine:/usr/local/cuda/NVIDIA_CUDA-6.5_Samples/1_Utilities/deviceQuery/deviceQuery
8

Homework: Find out what is the GPU type on rn-gpu machine.

Configurations
In your login on rn-gpu, setup the environment:
$ export PATH=$PATH:/usr/local/cuda/bin:
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64:

You can also add the lines to .bashrc.

To create:
$ vi file.cu

To compile:
$ nvcc file.cu

This should create a.out in the current directory.

To execute:
$ a.out

9
GPU Configuration: Fermi
● Third Generation Streaming Multiprocessor (SM) ● Improved Memory Subsystem

● 32 CUDA cores per SM, 4x over GT200 ● NVIDIA Parallel DataCacheTM

hierarchy with Configurable L1 and
● 8x the peak double precision floating point Unified L2 Caches
performance over GT200 ● First GPU with ECC memory support
● Dual Warp Scheduler simultaneously ● Greatly improved atomic memory
schedules and dispatches instructions from operation performance
two independent warps
● 64 KB of RAM with a configurable partitioning
of shared memory and L1 cache
● NVIDIA GigaThreadTM Engine
● Second Generation Parallel Thread Execution ISA
● 10x faster application context switching
● Full C++ Support
● Concurrent kernel execution
● Optimized for OpenCL and DirectCompute
● Out of Order thread block execution
● Full IEEE 754-2008 32-bit and 64-bit ● Dual overlapped memory transfer
precision engines
● Full 32-bit integer path with 64-bit extensions
● Memory access instructions to support 10
transition to 64-bit addressing
CUDA, in a nutshell
● Compute Unified Device Architecture. It is a hardware and software architecture.
● Enables NVIDIA GPUs to execute programs written with C, C++, Fortran, OpenCL,
and other languages.
● A CUDA program calls parallel kernels. A kernel executes in parallel across a set of
parallel threads.
● The programmer or compiler organizes these threads in thread blocks and grids of
thread blocks.
● The GPU instantiates a kernel program on a grid of parallel thread blocks.
● Each thread within a thread block executes an instance of the kernel, and has a
thread ID within its thread block, program counter, registers, per-thread private
memory, inputs, and output results.
● A thread block is a set of concurrently executing threads that can cooperate among
themselves through barrier synchronization and shared memory.
● A grid is an array of thread blocks that execute the same kernel, read inputs from
global memory, and write results to global memory.
● Each thread has a per-thread private memory space used for register spills,
function calls, and C automatic array variables.
● Each thread block has a per-block shared memory space used for inter-thread
communication, data sharing, and result sharing in parallel algorithms.
11
Hello World.
#include <stdio.h>
int main() {
printf("Hello World.\n");
return 0;
}

Compile: nvcc hello.cu

Run: a.out 12
GPU Hello World.
#include <stdio.h>
#include <cuda.h>
__global__ void dkernel() {
Kernel printf(“Hello World.\n”);
}
int main() {
Kernel Launch dkernel<<<1, 1>>>();
return 0;
}

Compile: nvcc hello.cu

Run: ./a.out 13
– No output. --
GPU Hello World.
#include <stdio.h> Takeaway
#include <cuda.h>
__global__ void dkernel() {
CPU function
printf(“Hello World.\n”); and GPU kernel
} run asynchronously.
int main() {
dkernel<<<1, 1>>>();
cudaThreadSynchronize();
return 0;
}
Compile: nvcc hello.cu
Run: ./a.out 14
Hello World.
GPU Hello World in Parallel.
#include <stdio.h>
#include <cuda.h>
__global__ void dkernel() {
printf(“Hello World.\n”);
}
int main() {
dkernel<<<1, 32>>>();
cudaThreadSynchronize();
return 0;
}
Compile: nvcc hello.cu
Run: ./a.out
Hello World.
15
32 times Hello World.
...
GPU Hello World with a Global.
#include <stdio.h>
#include <cuda.h>
Takeaway
const char *msg = "Hello World.\n";
__global__ void dkernel() {
printf(msg); CPU and GPU
memories are
}
separate
int main() { (for discrete GPUs).
dkernel<<<1, 32>>>();
cudaThreadSynchronize();
return 0;
}
Compile: nvcc hello.cu
error: identifier "msg" is undefined in device code
16
Separate Memories
D R A M D R A M

PCI Express
Bus
CPU GPU

● CPU and its associated (discrete) GPUs have

separate physical memory (RAM).
● A variable in CPU memory cannot be accessed
directly in a GPU kernel.
● A programmer needs to maintain copies of variables.
● It is programmer's responsibility to keep them in sync.17
Typical CUDA Program Flow
Copy data from CPU
to GPU memory.
Execute
Use 2
results on
5 3 GPU
kernel.
CPU.
CPU
CPU GPU
GPU

Load data 4
into CPU 1
memory. Copy results from
GPU to CPU memory.
File
System

18
Typical CUDA Program Flow
1 Load data into CPU memory.
- fread / rand
2 Copy data from CPU to GPU memory.
- cudaMemcpy(..., cudaMemcpyHostToDevice)
3 Call GPU kernel.
- mykernel<<<x, y>>>(...)
4 Copy results from GPU to CPU memory.
- cudaMemcpy(..., cudaMemcpyDeviceToHost)
5 Use results on CPU. 19
Typical CUDA Program Flow

2 Copy data from CPU to GPU memory.

- cudaMemcpy(..., cudaMemcpyHostToDevice)

This means we need two copies of the same

variable – one on CPU another on GPU.
e.g., int *cpuarr, *gpuarr;
Matrix cpumat, gpumat;
20
Graph cpug, gpug;
CPU-GPU Communication
#include <stdio.h>
#include <cuda.h>
__global__ void dkernel(char *arr, int arrlen) {
unsigned id = threadIdx.x;
if (id < arrlen) {
++arr[id];
}
}

int main() {
char cpuarr[] = "Gdkkn\x1fVnqkc-",
*gpuarr;

cudaMalloc(&gpuarr, sizeof(char) * (1 + strlen(cpuarr)));

cudaMemcpy(gpuarr, cpuarr, sizeof(char) * (1 + strlen(cpuarr)), cudaMemcpyHostToDevice);
dkernel<<<1, 32>>>(gpuarr, strlen(cpuarr));
cudaThreadSynchronize(); // unnecessary.
cudaMemcpy(cpuarr, gpuarr, sizeof(char) * (1 + strlen(cpuarr)), cudaMemcpyDeviceToHost);
printf(cpuarr);

return 0;
} 21
Classwork
1. Write a CUDA program to initialize an array of
size 32 to all zeros in parallel.
2. Change the array size to 1024.
3. Create another kernel that adds i to array[i].
4. Change the array size to 8000.
5. Check if answer to problem 3 still works.

22
Thread Organization
● A kernel is launched as a grid of threads.
● A grid is a 3D array of thread-blocks (gridDim.x,
gridDim.y and gridDim.z).
● Thus, each block has blockIdx.x, .y, .z.
● A thread-block is a 3D array of threads
(blockDim.x, .y, .z).
● Thus, each thread has threadIdx.x, .y, .z.

23
Grids, Blocks, Threads
CPU GPU
Each thread uses IDs to decide what
data to work on
Block ID: 1D, 2D, or 3D
Thread ID: 1D, 2D, or 3D
Grid with
2x2 blocks
Simplifies memory
addressing when processing
multidimensional data
Image processing
A single
Solving PDEs on volumes thread in
… 4x2x2
Typical configuration: threads
1-5 blocks per SM
128-1024 threads per block.
Total 2K-100K threads.
You can launch a kernel with
millions of threads.

24
Accessing Dimensions
#include <stdio.h> How
Howmany manytimes
timesthe kernelprintf
thekernel printf
#include <cuda.h> gets executed when the if
gets executed when the if
__global__ void dkernel() { condition
conditionisischanged
changedtoto
if (threadIdx.x == 0 && blockIdx.x == 0 && ifif(threadIdx.x
(threadIdx.x== ==0)0)??
threadIdx.y == 0 && blockIdx.y == 0 &&
threadIdx.z == 0 && blockIdx.z == 0) {
printf("%d %d %d %d %d %d.\n", gridDim.x, gridDim.y, gridDim.z,
blockDim.x, blockDim.y, blockDim.z);
}
} Number ofofthreads launched ==22* *33* *44* *55* *66* *7.7.
int main() { Number threads launched
Number
Numberofofthreads
threadsininaathread-block
thread-block==55* *66* *7.7.
dim3 grid(2, 3, 4); Number
Numberofofthread-blocks
thread-blocksininthe thegrid
grid==22* *33* *4.4.
dim3 block(5, 6, 7);
dkernel<<<grid, block>>>(); ThreadId
ThreadIdininxxdimension
dimensionisisinin[0..5).
[0..5).
cudaThreadSynchronize(); BlockId in y dimension is in [0..3).
BlockId in y dimension is in [0..3).
return 0;
}
25
#include <stdio.h>
2D
#include <cuda.h>
__global__ void dkernel(unsigned *matrix) {
unsigned id = threadIdx.x * blockDim.y + threadIdx.y;
matrix[id] = id; $$a.out
} a.out
00 11 22 33 44 55
#define N 5 66 77 88 9910
#define M 6 1011
11
12 13 14 15 16 17
12 13 14 15 16 17
int main() { 18
dim3 block(N, M, 1); 1819
19202021
2122
2223
23
24 25 26 27 28 29
24 25 26 27 28 29
unsigned *matrix, *hmatrix;

cudaMalloc(&matrix, N * M * sizeof(unsigned));
hmatrix = (unsigned *)malloc(N * M * sizeof(unsigned));

dkernel<<<1, block>>>(matrix);
cudaMemcpy(hmatrix, matrix, N * M * sizeof(unsigned), cudaMemcpyDeviceToHost);

for (unsigned ii = 0; ii < N; ++ii) {

for (unsigned jj = 0; jj < M; ++jj) {
printf("%2d ", hmatrix[ii * M + jj]);
}
printf("\n");
} 26
return 0;
}
#include <stdio.h>
1D Takeaway
#include <cuda.h>
__global__ void dkernel(unsigned *matrix) {
unsigned id = blockIdx.x * blockDim.x + threadIdx.x; One can perform
matrix[id] = id; computation on a
}
#define N 5 multi-dimensional
#define M 6 data using a one-
int main() { dimensional block.
unsigned *matrix, *hmatrix;

cudaMalloc(&matrix, N * M * sizeof(unsigned));
hmatrix = (unsigned *)malloc(N * M * sizeof(unsigned));

dkernel<<<N, M>>>(matrix);
cudaMemcpy(hmatrix, matrix, N * M * sizeof(unsigned), cudaMemcpyDeviceToHost);

for (unsigned ii = 0; ii < N; ++ii) {

for (unsigned jj = 0; jj < M; ++jj) { If I want the launch configuration to be
printf("%2d ", hmatrix[ii * M + jj]);
}
<<<2, X>>>, what is X?
printf("\n"); The rest of the code should be intact.
} 27
return 0;
}
Launch Configuration for Large Size
#include <stdio.h>
#include <cuda.h>
__global__ void dkernel(unsigned *vector) {
unsigned id = blockIdx.x * blockDim.x + threadIdx.x;
vector[id] = id; Access out-of-bounds.
}
#define BLOCKSIZE 1024
int main(int nn, char *str[]) {
Find
Findtwo
twoissues
issues
unsigned N = atoi(str[1]);
with this code.
with this code.
unsigned *vector, *hvector;
cudaMalloc(&vector, N * sizeof(unsigned));
hvector = (unsigned *)malloc(N * sizeof(unsigned));

unsigned nblocks = ceil(N / BLOCKSIZE); Needs floating point division.

printf("nblocks = %d\n", nblocks);

dkernel<<<nblocks, BLOCKSIZE>>>(vector);
cudaMemcpy(hvector, vector, N * sizeof(unsigned), cudaMemcpyDeviceToHost);
for (unsigned ii = 0; ii < N; ++ii) {
printf("%4d ", hvector[ii]);
}
return 0;
28
}
Launch Configuration for Large Size
#include <stdio.h>
#include <cuda.h>
__global__ void dkernel(unsigned *vector, unsigned vectorsize) {
unsigned id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < vectorsize) vector[id] = id;
}
#define BLOCKSIZE 1024
int main(int nn, char *str[]) {
unsigned N = atoi(str[1]);
unsigned *vector, *hvector;
cudaMalloc(&vector, N * sizeof(unsigned));
hvector = (unsigned *)malloc(N * sizeof(unsigned));

unsigned nblocks = ceil((float)N / BLOCKSIZE);

printf("nblocks = %d\n", nblocks);

dkernel<<<nblocks, BLOCKSIZE>>>(vector, N);

cudaMemcpy(hvector, vector, N * sizeof(unsigned), cudaMemcpyDeviceToHost);
for (unsigned ii = 0; ii < N; ++ii) {
printf("%4d ", hvector[ii]);
}
return 0;
29
}
Classwork
● Read a sequence of integers from a file.
● Square each number.
● Read another sequence of integers from
another file.
● Cube each number.
● Sum the two sequences element-wise, store in
the third sequence.
● Print the computed sequence.

30
CUDA Memory Model Overview
• Global memory
– Main means of
communicating R/W Data Grid
between host and device
– Contents visible to all GPU Block (0, 0) Block (1, 0)

threads Shared Memory Shared Memory

– Long latency access Registers Registers Registers Registers

• We will focus on global

memory for now Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

– There are also constant and

Host Global Memory
texture memory.

3131
CUDA Function Declarations
Executed Only callable
on the: from the:

device float DeviceFunc() device device

__global__ void KernelFunc() device host

host float HostFunc() host host

● global defines a kernel. It must return void.

● A program may have several functions of each kind.

● The same function of any kind may be called multiple times.

● Host == CPU, Device == GPU.

3232
Function Types (1/2)
#include <stdio.h>
#include <cuda.h>
__host__ __device__ void dhfun() {
printf("I can run on both CPU and GPU.\n");
}
__device__ unsigned dfun(unsigned *vector, unsigned vectorsize, unsigned id) {
if (id == 0) dhfun();
if (id < vectorsize) {
vector[id] = id;
return 1;
} else {
return 0;
}
}
__global__ void dkernel(unsigned *vector, unsigned vectorsize) {
unsigned id = blockIdx.x * blockDim.x + threadIdx.x;
dfun(vector, vectorsize, id);
}
__host__ void hostfun() {
printf("I am simply like another function running on CPU. Calling dhfun\n");
dhfun();
}
33
Function Types (2/2)
#define BLOCKSIZE 1024
int main(int nn, char *str[]) {
unsigned N = atoi(str[1]);
unsigned *vector, *hvector;
cudaMalloc(&vector, N * sizeof(unsigned));
hvector = (unsigned *)malloc(N * sizeof(unsigned));

unsigned nblocks = ceil((float)N / BLOCKSIZE);

printf("nblocks = %d\n", nblocks);

dkernel<<<nblocks, BLOCKSIZE>>>(vector, N);

cudaMemcpy(hvector, vector, N * sizeof(unsigned), cudaMemcpyDeviceToHost);
for (unsigned ii = 0; ii < N; ++ii) { C
printf("%4d ", hvector[ii]); main hostfun P
} main hostfun
U
printf("\n"); dhfun
dhfun G
hostfun();
dhfun(); dkernel dfun
dfun P
dkernel
return 0; U
}
What are the other arrows possible in this diagram?
34
GPU Computation Hierarchy
... ... ... ... Hundreds of
GPU thousands

... ...
... ... Tens of
Multi-processor thousands

Block
... ... ... ...
1024

... 32
Warp

Thread 1

35
What is a Warp?

36
Source: Wikipedia
Warp
● A set of consecutive threads (currently 32) that
execute in SIMD fashion.
● SIMD == Single Instruction Multiple Data
● Warp-threads are fully synchronized. There is
an implicit barrier after each step / instruction.
● Memory coalescing is closedly related to warps.
Takeaway

It is a misconception that all

threads in a GPU execute in
lock-step. Lock-step execution is
true for threads only within a warp.
37
Warp with Conditions
__global__
__global__void
voiddkernel(unsigned
dkernel(unsigned*vector,
*vector,unsigned
unsignedvectorsize)
vectorsize){{
unsigned
unsignedid id==blockIdx.x
blockIdx.x**blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x; S0
ifif(id
(id%%2)
2)vector[id]
vector[id]==id;
id; S1
else
elsevector[id]
vector[id]==vectorsize
vectorsize**vectorsize;
vectorsize; S2
vector[id]++;
vector[id]++; S4
}}
0 1 2 3 4 5 6 7

S0 S0 S0 S0 S0 S0 S0 S0 NOP

S1 S1 S1 S1
Time

S2 S2 S2 S2

S4 S4 S4 S4 S4 S4 S4 S4 38
Warp with Conditions
● When different warp-threads execute different
instructions, threads are said to diverge.
● Hardware executes threads satisfying same condition
together, ensuring that other threads execute a no-op.
● This adds sequentiality to the execution.
● This problem is termed as thread-divergence.
0 1 2 3 4 5 6 7

S0 S0 S0 S0 S0 S0 S0 S0

S1 S1 S1 S1
Time

S2 S2 S2 S2

S4 S4 S4 S4 S4 S4 S4 S4 39
Thread-Divergence
__global__
__global__void
voiddkernel(unsigned
dkernel(unsigned*vector,
*vector,unsigned
unsignedvectorsize)
vectorsize){{
unsigned
unsignedidid==blockIdx.x
blockIdx.x**blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x;
switch
switch(id)
(id){{
case
case0:
0:vector[id]
vector[id]==0;
0;break;
break;
case
case1:
1:vector[id]
vector[id]==vector[id];
vector[id];break;
break;
case
case2:
2:vector[id]
vector[id]==vector[id
vector[id--2];
2];break;
break;
case
case3:
3:vector[id]
vector[id]==vector[id
vector[id++3];
3];break;
break;
case
case4:
4:vector[id]
vector[id]==44++44++vector[id];
vector[id];break;
break;
case
case5:
5:vector[id]
vector[id]==55--vector[id];
vector[id];break;
break;
case
case6:
6:vector[id]
vector[id]==vector[6];
vector[6];break;
break;
case
case7:
7:vector[id]
vector[id]==77++7;
7;break;
break;
case
case8:
8:vector[id]
vector[id]==vector[id]
vector[id]++8;
8;break;
break;
case
case9:
9:vector[id]
vector[id]==vector[id]
vector[id]**9;
9;break;
break;
}} }}
40
Thread-Divergence
● Since thread-divergence makes execution sequential,
conditions are evil in the kernel codes?
ifif(vectorsize
(vectorsize<<N)
N)S1;
S1;else
elseS2;
S2; Condition but no divergence

● Then, conditions evaluating to different truth-values

are evil?
ifif(id
(id//32)
32)S1;
S1;else
elseS2;
S2; Different truth-values but no divergence

Takeaway

Conditions are not bad;

they evaluating to different truth-values is also not bad;
they evaluating to different truth-values for warp-threads is bad.
41
Classwork
● Rewrite the following program fragment to
remove thread-divergence.
//
//assert(x
assert(x==
==yy||||xx==
==z);
z);
ifif(x
(x==
==y)
y)xx==z;z;
else
elsexx==y;y;

42
Locality
● Locality is important for performance on GPUs
also.
● All threads in a thread-block access their L1
cache.
● This cache on Kepler is 64 KB.
● It can be configured as 48 KB L1 + 16 KB scratchpad
or 16 KB L1 + 48 KB scratchpad.
● To exploit spatial locality, consecutive threads
should access consecutive memory locations.
43
Matrix Squaring (version 1)
square<<<1,
square<<<1,N>>>(matrix,
N>>>(matrix,result,
result,N);
N); ////NN==64
64
__global__
__global__void
voidsquare(unsigned
square(unsigned*matrix,
*matrix,
unsigned
unsigned*result,
*result,
unsigned
unsignedmatrixsize)
matrixsize){{
unsigned
unsignedid id==blockIdx.x
blockIdx.x**blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x;
for
for(unsigned
(unsignedjjjj==0;0;jjjj<<matrixsize;
matrixsize;++jj)
++jj){{
for
for(unsigned
(unsignedkk kk==0; 0;kk
kk<<matrixsize;
matrixsize;++kk)
++kk){{
result[id
result[id**matrixsize
matrixsize++jj]jj]+=
+=
matrix[id
matrix[id**matrixsize
matrixsize++kk] kk]**
matrix[kk
matrix[kk**matrixsize
matrixsize++jj];jj];
}} }} }}
44
CPU time = 1.527 ms, GPU v1 time = 6.391 ms
Matrix Squaring (version 2)
square<<<N,
square<<<N,N>>>(matrix,
N>>>(matrix,result,
result,N);
N); ////NN==64
64
__global__
__global__void
voidsquare(unsigned
square(unsigned*matrix,
*matrix,
unsigned
unsigned*result,
*result,
unsigned
unsignedmatrixsize)
matrixsize){{
unsigned
unsignedid id==blockIdx.x
blockIdx.x**blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x;
unsigned
unsignediiii==id id//matrixsize;
matrixsize; Homework: What if you
interchange ii and jj?
unsigned jj = id % matrixsize;
unsigned jj = id % matrixsize;
for
for(unsigned
(unsignedkk kk==0;
0;kkkk<<matrixsize;
matrixsize;++kk)
++kk){{
result[ii
result[ii**matrixsize
matrixsize++jj]jj]+=
+=matrix[ii
matrix[ii**matrixsize
matrixsize++kk]kk]**
matrix[kk
matrix[kk**matrixsize
matrixsize++jj];
jj];
}} }}
CPU time = 1.527 ms, GPU v1 time = 6.391 ms, 45
GPU v2 time = 0.1 ms
Memory Coalescing
● If consecutive threads access words from the
same block of 32 words, their memory requests
are clubbed into one.
● That is, the memory requests are coalesced.
● This can be effectively achieved for regular
programs (such as dense matrix operations).

46
Coalesced Uncoalesced Coalesced
Memory Coalescing
●
● Each
Eachthread
threadshould
shouldaccess
access ●
● AAchunk
chunkshould
shouldbebe
consecutive
consecutiveelements
elementsof ofaa accessed
accessedbybyconsecutive
consecutive
CC GG
chunk
chunk(strided).
(strided). threads
threads(coalesced).
(coalesced).
PP PP
UU ●● Array
Arrayof
ofStructures
Structures(AoS)
(AoS) ●
● Structures
Structuresof
ofArrays
Arrays(SoA)
(SoA) UU
has
hasaabetter
betterlocality.
locality. has
hasaabetter
betterperformance.
performance.

start = id * chunksize;
end = start + chunksize;
for (ii = start; ii < end; ++ii)
… a[id] ... … a[input[id]] ...
… a[ii] ...

47
Coalesced Strided Random
AoS versus SoA
struct
structnode
node{ { struct
structnode
node{ {
int
inta;a; int
intalla[N];
alla[N];
double
doubleb;b; double
doubleallb[N];
allb[N];
char
charc;c; char allc[N];
char allc[N];
};}; };};
struct
structnode
nodeallnodes[N];
allnodes[N];

Expectation:
Expectation:When Whenaathread
thread Expectation:
Expectation:When
Whenaathread
thread
accesses
accessesan anattribute
attributeof
ofaa accesses
accessesan anattribute
attributeofofaa
node,
node,ititalso
alsoaccesses
accessesother
other node,
node,its
itsneighboring
neighboringthread
thread
attributes
attributesofofthe
thesame
samenode.
node. accesses
accessesthethesame
sameattribute
attribute
of
ofthe
thenext
nextnode.
node.
Better
Betterlocality
locality(on
(onCPU).
CPU). Better
Bettercoalescing
coalescing(on
(onGPU).
GPU).

48
AoS versus SoA
struct
structnode
node{ { struct
structnode
node{ {
int
inta;a; int
intalla[N];
alla[N];
double
doubleb;b; double
doubleallb[N];
allb[N];
char
charc;c; char
charallc[N];
allc[N];
};}; };};
struct
structnode
nodeallnodes[N];
allnodes[N];
__global__
__global__void
void __global__
__global__void
void
dkernelaos(struct
dkernelaos(structnodeAOS
nodeAOS dkernelsoa(int
dkernelsoa(int*a,
*a,double
double*b,
*b,
*allnodesAOS)
*allnodesAOS){{ char
char*c)
*c){{
unsigned
unsignedidid==blockIdx.x
blockIdx.x** unsigned
unsignedidid==blockIdx.x
blockIdx.x**
blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x; blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x;

allnodesAOS[id].a
allnodesAOS[id].a==id;
id; a[id]
a[id]==id;
id;
allnodesAOS[id].b
allnodesAOS[id].b==0.0;
0.0; b[id]
b[id]==0.0;
0.0;
allnodesAOS[id].c
allnodesAOS[id].c=='c';
'c'; c[id]
c[id]=='d';
'd';
}} }}
AoS time: 0.000058 seconds 49
SoA time: 0.000021 seconds
Let's Compute the Shortest Paths
● You are given an input graph of aa
India, and you want to compute 7
3
4

the shortest path from Nagpur to bb dd

every other city. ee cc gg

● Assume that you are given a ff
GPU graph library and the
associated routines.
● Each__global__
thread operates
__global__void on a node
voiddsssp(Graph
dsssp(Graph g,g,unsigned
unsigned*dist)
*dist){{
and settles distances
unsigned
unsigned idid==…… of the
for
neighborsforeach
eachnnining.allneighbors(id)
(Bellman-Ford style). {{ //
g.allneighbors(id) //pseudo-code.
pseudo-code.
unsigned
unsignedaltdist
altdist==dist[id]
dist[id]++weight(id,
weight(id,n);n);
ifif(altdist
(altdist<<dist[n])
dist[n]){{
dist[n]
dist[n]==altdist;
altdist; What is the error in this code?
}} }} }} 50
Synchronization
● Atomics
● Barriers
● Control + data flow
● ...

51
atomics
● Atomics are primitive operations whose effects
are visible either none or fully (never partially).
● Need hardware support.
● Several variants: atomicCAS, atomicMin,
atomicAdd, ...
● Work with both global and shared memory.

52
atomics
__global__
__global__void
voiddkernel(int
dkernel(int*x)
*x){{
++x[0];
++x[0]; After dkernel completes,
what is the value of x[0]?
}}
……
dkernel<<<1,
dkernel<<<1,2>>>(x);
2>>>(x);
++x[0]
++x[0]isisequivalent
equivalentto:
to: Load
Loadx[0],
x[0],R1
R1 Load
Loadx[0],
x[0],R2
R2
Load
Loadx[0],
x[0],R1
R1 Increment
IncrementR1R1 Increment
IncrementR2R2
Store
StoreR2,
R2,x[0]
Time

Increment
IncrementR1 R1 x[0]
Store Store
StoreR1,
R1,x[0]
StoreR1,
R1,x[0]
x[0] x[0]

Final value stored in x[0] could be 1 (rather than 2). 53

What if x[0] is split into multiple instructions? What if there are more threads?
atomics
__global__
__global__void
voiddkernel(int
dkernel(int*x)
*x){{
++x[0];
++x[0];
}}
……
dkernel<<<1,
dkernel<<<1,2>>>(x);
2>>>(x);
● Ensure all-or-none behavior.
● e.g., atomicInc(&x[0], ...);
●
dkernel<<<K1, K2>>> would ensure x[0] to be
incremented by exactly K1*K2 – irrespective of the
thread execution order.
54
Let's Compute the Shortest Paths
● You are given an input graph of aa
India, and you want to compute 7
3
4

the shortest path from Nagpur to bb dd

every other city. ee cc gg

● Assume that you are given a ff
GPU graph library and the
associated routines.
● Each threadvoid
__global__
__global__ operates
void on a g,node
dsssp(Graph
dsssp(Graph g,unsigned
unsigned*dist)
*dist){{
and settles distances
unsigned
unsigned idid==…… of the
for
foreach
neighbors nnining.allneighbors(id)
(Bellman-Ford
each style). {{ //
g.allneighbors(id) //pseudo-code.
pseudo-code.
unsigned
unsignedaltdist
altdist==dist[id]
dist[id]++weight(id,
weight(id,n);
n);
ifif(altdist
(altdist<<dist[n])
dist[n]){{
dist[n]
dist[n]==altdist;
altdist; atomicMin(&dist[n],
atomicMin(&dist[n],altdist);
altdist);
}} }} }} 55
Classwork
1. Compute sum of all elements of an array.
2. Find the maximum element in an array.
3. Each thread adds elements to a worklist.
● e.g., next set of nodes to be processed in SSSP.

56
Barriers
● A barrier is a program point where all threads
need to reach before any thread can proceed.
● End of kernel is an implicit barrier for all GPU
threads (global barrier).
● There is no explicit global barrier supported in
CUDA.
● Threads in a thread-block can synchronize
using __syncthreads().
● How about barrier within warp-threads?
57
Barriers
__global__ void dkernel(unsigned *vector, unsigned vectorsize) {
unsigned id = blockIdx.x * blockDim.x + threadIdx.x;
vector[id] = id; S1
__syncthreads();
if (id < vectorsize - 1 && vector[id + 1] != id + 1) S2
printf("syncthreads does not work.\n");
}
S1 S1 S1 S1
Thread block
Time

S2 S2 S2 S2 S1 S1 S1 S1

Thread block 58
S2 S2 S2 S2
Barriers
●
__syncthreads() is not only about control synchronization, it
also has data synchronization mechanism.
● It performs a memory fence operation.
● A memory fence ensures that the writes from a thread

are made visible to other threads.

●
There is a separate __threadfence() instruction also.
● A fence does not ensure that other thread will read the
updated value.
● This can happen due to caching.
●
The other thread needs to use volatile data.

59
Classwork
● Write a CUDA kernel to find maximum over a
set of elements, and then let thread 0 print the
value in the same kernel.
● Each thread is given work[id] amount of work.
Find average work per thread and if a thread's
work is above average + K, push extra work to
a worklist.
● This is useful for load-balancing.
● Also called work-donation.

60
Synchronization
● Atomics
● Barriers
● Control + data flow
● ... Initially, flag == false.
S2;
S2;
while
while(!flag)
(!flag);; flag
flag==true;
true;
S1;
S1;

61
Reductions
● What are reductions?
● Computation properties required.
● Complexity measures

Input: 4 3 9 3 5 7 3 2 n numbers
7 12 12 5
barrier
log(n) steps 19 17
Output: 36

62
Reductions
for
for(int
(intoffoff==n/2;
n/2;off;
off;off
off/=
/=2)
2){{
ifif(threadIdx.x
(threadIdx.x<<off)
off){{
a[threadIdx.x]
a[threadIdx.x]+= +=a[threadIdx.x
a[threadIdx.x++off];
off];
}}
__syncthreads();
__syncthreads();
}}

Input: 4 3 9 3 5 7 3 2 n numbers
7 12 12 5
barrier
log(n) steps 19 17
Output: 36

63
Prefix Sum
● Imagine threads wanting to push work-items to
a central worklist.
● Each thread pushes different number of work-
items.
● This can be computed using atomics or prefix
sum (also called as scan).
Input: 4 3 9 3 5 7 3 2
Output: 4 7 16 19 24 31 33 35
OR
Output: 0 4 7 16 19 24 31 33
64
Prefix Sum
for
for(int
(intoffoff==1;1;off
off<<n;n;off
off*=
*=2)
2){{
ifif(threadIdx.x
(threadIdx.x>= >=off)
off){{
a[threadIdx.x]
a[threadIdx.x]+= +=a[threadIdx.x
a[threadIdx.x- -off];
off];
}}
__syncthreads();
__syncthreads();
}}

65
Shared Memory
● What is shared memory?
● How to declare Shared Memory?
● Combine with reductions.

__shared__
__shared__float
floata[N];
a[N];
a[id]
a[id]==id;
id;

66
Barrier-based Synchronization
Consider threads pushing

Disjoint accesses elements into a worklist


Overlapping accesses ...


Benign overlaps

atomic per element O(e) atomics

atomic per thread O(t) atomics
prefix-sum O(log t) barriers

67
Barrier-based Synchronization
Consider threads trying to

Disjoint accesses own a set of elements


Overlapping accesses ...


Benign overlaps
atomic per element

e.g., for owning cavities in non-atomic mark

Race
Delaunay mesh refinement prioritized mark and AND
resolve
check

e.g., for inserting unique non-atomic mark Race

elements into a worklist and OR
check resolve

68
Barrier-based Synchronization
Consider threads updating shared

Disjoint accesses variables to the same value


Overlapping accesses ...


Benign overlaps

with atomics
e.g., level-by-level
breadth-first search
without atomics

69
Exploiting Algebraic Properties

Monotonicity

Idempotency Consider threads updating distances in
shortest paths computation

Associativity

tfive tseven tfive tseven tfive tseven

33 44 33 44 33 44

2 3 2 3 2 3

10
10 77 55

Atomic-free update Lost-update problem Correction by topology-driven

processing, exploiting monotonicity

70
Exploiting Algebraic Properties

Monotonicity

Idempotency Consider threads updating distances in
shortest paths computation

Associativity
t5, t6, t7,t8
t2 t3
worklist zz
bb cc
t1 zz zz zz zz
t4
aa dd pp rr
zz qq

Update by multiple threads Multiple instances of a node Same node processed by

in the worklist multiple threads

71
Exploiting Algebraic Properties

Monotonicity

Idempotency Consider threads pushing
information to a node

Associativity

t2 y t3 z,v
bb cc
t1 x
t4 m,n
aa dd
zz
x,y,z,v,m,n

Associativity helps push

information using prefix-sum
72
Scatter-Gather
Consider threads pushing
elements into a worklist

...

atomic per element O(e) atomics

atomic per thread O(t) atomics

prefix-sum O(log t) barriers

scatter

gather

73
Other Memories
● Texture
● Const
● Global
● Shared
● Cache
● Registers

74
Thrust
● Thrust is a parallel algorithms library (similar in
spirit to STL on CPU).
● Supports vectors and associated transforms.
● Programmer is oblivious to where code executes
– on CPU or GPU.
● Makes use of C++ features such as functors.

75
Thrust
thrust::host_vector<int>
thrust::host_vector<int>hnums(1024);
hnums(1024);
thrust::device_vector<int>
thrust::device_vector<int>dnums;
dnums;
dnums
dnums==hnums;
hnums; //
//calls
callscudaMemcpy
cudaMemcpy
//
//initialization.
initialization.
thrust::device_vector<int>
thrust::device_vector<int>dnum2(hnums.begin(),
dnum2(hnums.begin(),hnums.end());
hnums.end());
hnums
hnums==dnum2;
dnum2; //
//array
arrayresizing
resizinghappens
happensautomatically.
automatically.
std::cout
std::cout<<
<<dnums[3]
dnums[3]<<
<<std::endl;
std::endl;
thrust::transform(dsrc.begin(),
thrust::transform(dsrc.begin(),dsrc.end(),
dsrc.end(),dsrc2.begin(),
dsrc2.begin(),
ddst.begin(),
ddst.begin(),addFunc);
addFunc);

76
Thrust Functions
●
find(begin, end, value);
●
find_if(begin, end, predicate);
●
copy, copy_if.
●
count, count_if.
●
equal.
●
min_element, max_element.
●
merge, sort, reduce.
●
transform.
77
●
...
Thrust User-Defined Functors
1 // calculate result[] = (a * x[]) + y[]
2 struct saxpy {
3 const float _a;
4 saxpy(int a) : _a(a) { }
5
6 __host__ __device__
7 float operator()(const float &x, const float& y) const {
8 return a * x + y;
9 }
10 };
11
12 thrust::device_vector<float> x, y, result;
13 // ... fill up x & y vectors ...
14 thrust::transform(x.begin(), x.end(), y.begin(),
15 result.begin(), saxpy(a));
78
Thrust on host versus device
● Same algorithm can be used on CPU and GPU.

int
intx,x,y;y;
thrust::host_vector<int>
thrust::host_vector<int>hvec;
hvec;
thrust::device_vector<int>
thrust::device_vector<int>dvec;
dvec;
////(thrust::reduce
(thrust::reduceisisaasum
sumoperation
operationby
bydefault)
default)
xx==thrust::reduce(hvec.begin(),
thrust::reduce(hvec.begin(),hvec.end());
hvec.end()); //
//on
onCPU
CPU
yy==thrust::reduce(dvec.begin(),
thrust::reduce(dvec.begin(),dvec.end());
dvec.end()); //
//on
onGPU
GPU

79
Challenges with GPU

●
Warp-based execution ●
Incoherent L1 caches

Often requires sorting of 
May need to explicitly push
work or algorithm change data out
●
Data structure layout ●
Poor recursion support

Best layout for CPU differs 
Need to make code
from the best layout for iterative and maintain
GPU
explicit iteration stacks
●
Separate memory space

Slow transfers
●
Thread and block counts

Pack/unpack data

Hierarchy complicates
implementation

Optimal counts have to be
(auto-)tuned

80
General Optimization Principles
● Finding and exposing enough parallelism to populate
all the multiprocessors.
● Finding and exposing enough additional parallelism to
allow multithreading to keep the cores busy.
● Optimizing device memory accesses for contiguous
data.
● Utilizing the software data cache to store intermediate
results or to reorganize data.
● Reducing synchronization.

81
Other Optimizations
● Async CPU-GPU execution
● Dynamic Parallelism
● Multi-GPU execution
● Unified Memory

82
Bank Conflicts
● Programming guide.

83
Dynamic Parallelism
● Usage for graph algo.

84
Async CPU-GPU execution
● Overlapping communication and computation
● streams
● Overlapping two computations

85
Multi-GPU execution
● Peer-to-peer copying
● CPU as the driver

86
Unified Memory
● CPU-GPU memory coherence
● Show the problem first

87
Other Useful Topics
● Voting functions
● Occupancy
● Compilation flow and .ptx assembly

88
Voting Functions

89
Occupancy
● Necessity
● Pitfall and discussion

90
Compilation Flow
● Use shailesh's flow diagram
● .ptx example

91
Common Pitfalls and
Misunderstandings
● GPUs are only for graphics applications.
● GPUs are only for regular applications.
● On GPUs, all the threads need to execute the
same instruction at the same time.
● A CPU program when ported to GPU runs
faster.

92
GPU Programming

Rupesh Nasre.

High-Performance Parallel Computing

June 2016

GT Operating and Maintenance Manual v943 - 240416 - 184428
No ratings yet
GT Operating and Maintenance Manual v943 - 240416 - 184428
765 pages
Design Proposal of An Automatic Smart MultiInsect Mosquito Killing System IEEE
No ratings yet
Design Proposal of An Automatic Smart MultiInsect Mosquito Killing System IEEE
6 pages
BS en Iso 14692-3-2017
No ratings yet
BS en Iso 14692-3-2017
46 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
1 Cuda
100% (1)
1 Cuda
173 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Measurement Ipsas 46
No ratings yet
Measurement Ipsas 46
11 pages
Mary Nirmala MSN Synopsis
No ratings yet
Mary Nirmala MSN Synopsis
25 pages
"The Electoral Reforms Law of 1987" Sec. 27. Election Offenses. - in Addition To The Prohibited Acts and Election Offenses Enumerated in
100% (1)
"The Electoral Reforms Law of 1987" Sec. 27. Election Offenses. - in Addition To The Prohibited Acts and Election Offenses Enumerated in
24 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
1 Updated Offer-Letter-Yadhukrishna
No ratings yet
1 Updated Offer-Letter-Yadhukrishna
3 pages
Game of Bitcoins Mega Airdrop Sheet
No ratings yet
Game of Bitcoins Mega Airdrop Sheet
9 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Aero Seal
No ratings yet
Aero Seal
14 pages
Slam
No ratings yet
Slam
98 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
A Report On An Automated Whistle Blowing System For Aiding Crime Investigation
No ratings yet
A Report On An Automated Whistle Blowing System For Aiding Crime Investigation
68 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Annexure 2
No ratings yet
Annexure 2
77 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
3 Computation
No ratings yet
3 Computation
28 pages
Food Truck Project
No ratings yet
Food Truck Project
51 pages
BST Weekly Coaching Guide
No ratings yet
BST Weekly Coaching Guide
37 pages
CUDA
No ratings yet
CUDA
18 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Microsoft Nokia Deal: A SWOT Analysis: Adit Jha
No ratings yet
Microsoft Nokia Deal: A SWOT Analysis: Adit Jha
6 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
Course 7
No ratings yet
Course 7
21 pages
04 IntroductionGPUsCUDA
No ratings yet
04 IntroductionGPUsCUDA
25 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
A Comparative Analysis of The Fiduciary Duties of Trustees in South Africa and Namibia
No ratings yet
A Comparative Analysis of The Fiduciary Duties of Trustees in South Africa and Namibia
58 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Cuda
No ratings yet
Cuda
25 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
Corporate Governanceand Ethics
No ratings yet
Corporate Governanceand Ethics
8 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Govind 6
No ratings yet
Govind 6
4 pages
Smart Health Monitoring and Management U
No ratings yet
Smart Health Monitoring and Management U
21 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
HSAF Purpose and Outlook by Moor Insights Strategy
No ratings yet
HSAF Purpose and Outlook by Moor Insights Strategy
17 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Lec 1
No ratings yet
Lec 1
27 pages
ADSL Application Form
No ratings yet
ADSL Application Form
6 pages
Digiscope Slimhole MWD Ps
No ratings yet
Digiscope Slimhole MWD Ps
2 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
MINERALS
No ratings yet
MINERALS
4 pages
Air Conditioning
No ratings yet
Air Conditioning
4 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Cuda C
No ratings yet
Cuda C
70 pages
Msme (PSL & Non PSL) (Including Renewable Energy and Social Infrastructure)
No ratings yet
Msme (PSL & Non PSL) (Including Renewable Energy and Social Infrastructure)
8 pages
Hotel Ibis Senen
No ratings yet
Hotel Ibis Senen
1 page
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Laag 1
No ratings yet
Laag 1
12 pages
SBR - Chapter 1
No ratings yet
SBR - Chapter 1
2 pages
Amplifier Build and Design: Faculty of Engineering and Applied Science
No ratings yet
Amplifier Build and Design: Faculty of Engineering and Applied Science
21 pages
Cash Transaction Charges For Savings Account Holders
No ratings yet
Cash Transaction Charges For Savings Account Holders
6 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Elgamatic 100
No ratings yet
Elgamatic 100
1 page
Cryptoasset Registration Flowchart
No ratings yet
Cryptoasset Registration Flowchart
1 page
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Machine Learning Approaches in Smart Health
No ratings yet
Machine Learning Approaches in Smart Health
8 pages
Avaya 9641GS IP Deskphone: Phones & Devices
No ratings yet
Avaya 9641GS IP Deskphone: Phones & Devices
4 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Micron DRAM Memory Support For AMD Embedded Platforms
No ratings yet
Micron DRAM Memory Support For AMD Embedded Platforms
1 page
PR1 Characteristics Strengths and Weaknesses Kinds and Importance of Qualitative Research
No ratings yet
PR1 Characteristics Strengths and Weaknesses Kinds and Importance of Qualitative Research
13 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Activity Hazards Analysis: MD485B Tower Assembly AHA
No ratings yet
Activity Hazards Analysis: MD485B Tower Assembly AHA
6 pages
CUDA
No ratings yet
CUDA
33 pages
For Customer Reference
No ratings yet
For Customer Reference
4 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Technical Tip: Overview of Ethylene Oxide (Eo or Eto) Residuals
No ratings yet
Technical Tip: Overview of Ethylene Oxide (Eo or Eto) Residuals
3 pages
IoT Based Health Monitoring System Using Machine Learning
No ratings yet
IoT Based Health Monitoring System Using Machine Learning
6 pages
Types of Plants:: Operations
No ratings yet
Types of Plants:: Operations
2 pages

GPU Basics

Uploaded by

GPU Basics

Uploaded by

GPU Programming

High-Performance Parallel Computing

Source: Thorsten Thormählen

Screenspace triangles (2D)

Final Pixels (Color, Depth)

Transform Assemble Video

CPU GPU Render-to-texture

# of SMX Units 26 (13 per GPU) 15

# of CUDA Cores 4992 (2496 per GPU) 2880

Memory Clock 2500 MHz 3004 MHz

GPU Base Clock 560 MHz 745 MHz

GPU Boost Support Yes – Dynamic Yes – Static

GPU Boost Clocks 23 levels between 562 MHz 810 MHz

Compute Capability 3.7 3.5

Wattage (TDP) 300W (plus Zero Power Idle) 235W

Onboard GDDR5 Memory 24 GB 12 GB

Homework: Find out what is the GPU type on rn-gpu machine.

You can also add the lines to .bashrc.

This should create a.out in the current directory.

● 32 CUDA cores per SM, 4x over GT200 ● NVIDIA Parallel DataCacheTM

Compile: nvcc hello.cu

Compile: nvcc hello.cu

● CPU and its associated (discrete) GPUs have

2 Copy data from CPU to GPU memory.

This means we need two copies of the same

cudaMalloc(&gpuarr, sizeof(char) * (1 + strlen(cpuarr)));

for (unsigned ii = 0; ii < N; ++ii) {

for (unsigned ii = 0; ii < N; ++ii) {

unsigned nblocks = ceil(N / BLOCKSIZE); Needs floating point division.

unsigned nblocks = ceil((float)N / BLOCKSIZE);

dkernel<<<nblocks, BLOCKSIZE>>>(vector, N);

threads Shared Memory Shared Memory

– Long latency access Registers Registers Registers Registers

• We will focus on global

– There are also constant and

__device__ float DeviceFunc() device device

__host__ float HostFunc() host host

● __global__ defines a kernel. It must return void.

● The same function of any kind may be called multiple times.

● Host == CPU, Device == GPU.

unsigned nblocks = ceil((float)N / BLOCKSIZE);

dkernel<<<nblocks, BLOCKSIZE>>>(vector, N);

It is a misconception that all

● Then, conditions evaluating to different truth-values

Conditions are not bad;

the shortest path from Nagpur to bb dd

every other city. ee cc gg

Final value stored in x[0] could be 1 (rather than 2). 53

the shortest path from Nagpur to bb dd

every other city. ee cc gg

are made visible to other threads.

atomic per element O(e) atomics

e.g., for owning cavities in non-atomic mark

e.g., for inserting unique non-atomic mark Race

tfive tseven tfive tseven tfive tseven

Atomic-free update Lost-update problem Correction by topology-driven

Update by multiple threads Multiple instances of a node Same node processed by

Associativity helps push

atomic per element O(e) atomics

atomic per thread O(t) atomics

prefix-sum O(log t) barriers

High-Performance Parallel Computing

You might also like

device float DeviceFunc() device device

host float HostFunc() host host

● global defines a kernel. It must return void.