GPU Basics
GPU Basics
Rupesh Nasre.
Graphics State
●
Applications: Protein Folding, Stock Options Pricing, SQL Queries, MRI Reconstruction.
●
Required intimate knowledge of graphics API and GPU architecture.
●
Program complexity: Problems expressed in terms of vertex coordinates, textures and
shaders programs.
●
Random memory reads/writes not supported.
●
Lack of double precision support.
7
Kepler Configuration
Feature K80 K40
rn-gpu machine:/usr/local/cuda/NVIDIA_CUDA-6.5_Samples/1_Utilities/deviceQuery/deviceQuery
8
To create:
$ vi file.cu
To compile:
$ nvcc file.cu
To execute:
$ a.out
9
GPU Configuration: Fermi
● Third Generation Streaming Multiprocessor (SM) ● Improved Memory Subsystem
PCI Express
Bus
CPU GPU
Load data 4
into CPU 1
memory. Copy results from
GPU to CPU memory.
File
System
18
Typical CUDA Program Flow
1 Load data into CPU memory.
- fread / rand
2 Copy data from CPU to GPU memory.
- cudaMemcpy(..., cudaMemcpyHostToDevice)
3 Call GPU kernel.
- mykernel<<<x, y>>>(...)
4 Copy results from GPU to CPU memory.
- cudaMemcpy(..., cudaMemcpyDeviceToHost)
5 Use results on CPU. 19
Typical CUDA Program Flow
int main() {
char cpuarr[] = "Gdkkn\x1fVnqkc-",
*gpuarr;
return 0;
} 21
Classwork
1. Write a CUDA program to initialize an array of
size 32 to all zeros in parallel.
2. Change the array size to 1024.
3. Create another kernel that adds i to array[i].
4. Change the array size to 8000.
5. Check if answer to problem 3 still works.
22
Thread Organization
● A kernel is launched as a grid of threads.
● A grid is a 3D array of thread-blocks (gridDim.x,
gridDim.y and gridDim.z).
● Thus, each block has blockIdx.x, .y, .z.
● A thread-block is a 3D array of threads
(blockDim.x, .y, .z).
● Thus, each thread has threadIdx.x, .y, .z.
23
Grids, Blocks, Threads
CPU GPU
Each thread uses IDs to decide what
data to work on
Block ID: 1D, 2D, or 3D
Thread ID: 1D, 2D, or 3D
Grid with
2x2 blocks
Simplifies memory
addressing when processing
multidimensional data
Image processing
A single
Solving PDEs on volumes thread in
… 4x2x2
Typical configuration: threads
1-5 blocks per SM
128-1024 threads per block.
Total 2K-100K threads.
You can launch a kernel with
millions of threads.
24
Accessing Dimensions
#include <stdio.h> How
Howmany manytimes
timesthe kernelprintf
thekernel printf
#include <cuda.h> gets executed when the if
gets executed when the if
__global__ void dkernel() { condition
conditionisischanged
changedtoto
if (threadIdx.x == 0 && blockIdx.x == 0 && ifif(threadIdx.x
(threadIdx.x== ==0)0)??
threadIdx.y == 0 && blockIdx.y == 0 &&
threadIdx.z == 0 && blockIdx.z == 0) {
printf("%d %d %d %d %d %d.\n", gridDim.x, gridDim.y, gridDim.z,
blockDim.x, blockDim.y, blockDim.z);
}
} Number ofofthreads launched ==22* *33* *44* *55* *66* *7.7.
int main() { Number threads launched
Number
Numberofofthreads
threadsininaathread-block
thread-block==55* *66* *7.7.
dim3 grid(2, 3, 4); Number
Numberofofthread-blocks
thread-blocksininthe thegrid
grid==22* *33* *4.4.
dim3 block(5, 6, 7);
dkernel<<<grid, block>>>(); ThreadId
ThreadIdininxxdimension
dimensionisisinin[0..5).
[0..5).
cudaThreadSynchronize(); BlockId in y dimension is in [0..3).
BlockId in y dimension is in [0..3).
return 0;
}
25
#include <stdio.h>
2D
#include <cuda.h>
__global__ void dkernel(unsigned *matrix) {
unsigned id = threadIdx.x * blockDim.y + threadIdx.y;
matrix[id] = id; $$a.out
} a.out
00 11 22 33 44 55
#define N 5 66 77 88 9910
#define M 6 1011
11
12 13 14 15 16 17
12 13 14 15 16 17
int main() { 18
dim3 block(N, M, 1); 1819
19202021
2122
2223
23
24 25 26 27 28 29
24 25 26 27 28 29
unsigned *matrix, *hmatrix;
cudaMalloc(&matrix, N * M * sizeof(unsigned));
hmatrix = (unsigned *)malloc(N * M * sizeof(unsigned));
dkernel<<<1, block>>>(matrix);
cudaMemcpy(hmatrix, matrix, N * M * sizeof(unsigned), cudaMemcpyDeviceToHost);
cudaMalloc(&matrix, N * M * sizeof(unsigned));
hmatrix = (unsigned *)malloc(N * M * sizeof(unsigned));
dkernel<<<N, M>>>(matrix);
cudaMemcpy(hmatrix, matrix, N * M * sizeof(unsigned), cudaMemcpyDeviceToHost);
dkernel<<<nblocks, BLOCKSIZE>>>(vector);
cudaMemcpy(hvector, vector, N * sizeof(unsigned), cudaMemcpyDeviceToHost);
for (unsigned ii = 0; ii < N; ++ii) {
printf("%4d ", hvector[ii]);
}
return 0;
28
}
Launch Configuration for Large Size
#include <stdio.h>
#include <cuda.h>
__global__ void dkernel(unsigned *vector, unsigned vectorsize) {
unsigned id = blockIdx.x * blockDim.x + threadIdx.x;
if (id < vectorsize) vector[id] = id;
}
#define BLOCKSIZE 1024
int main(int nn, char *str[]) {
unsigned N = atoi(str[1]);
unsigned *vector, *hvector;
cudaMalloc(&vector, N * sizeof(unsigned));
hvector = (unsigned *)malloc(N * sizeof(unsigned));
30
CUDA Memory Model Overview
• Global memory
– Main means of
communicating R/W Data Grid
between host and device
– Contents visible to all GPU Block (0, 0) Block (1, 0)
3131
CUDA Function Declarations
Executed Only callable
on the: from the:
3232
Function Types (1/2)
#include <stdio.h>
#include <cuda.h>
__host__ __device__ void dhfun() {
printf("I can run on both CPU and GPU.\n");
}
__device__ unsigned dfun(unsigned *vector, unsigned vectorsize, unsigned id) {
if (id == 0) dhfun();
if (id < vectorsize) {
vector[id] = id;
return 1;
} else {
return 0;
}
}
__global__ void dkernel(unsigned *vector, unsigned vectorsize) {
unsigned id = blockIdx.x * blockDim.x + threadIdx.x;
dfun(vector, vectorsize, id);
}
__host__ void hostfun() {
printf("I am simply like another function running on CPU. Calling dhfun\n");
dhfun();
}
33
Function Types (2/2)
#define BLOCKSIZE 1024
int main(int nn, char *str[]) {
unsigned N = atoi(str[1]);
unsigned *vector, *hvector;
cudaMalloc(&vector, N * sizeof(unsigned));
hvector = (unsigned *)malloc(N * sizeof(unsigned));
... ...
... ... Tens of
Multi-processor thousands
Block
... ... ... ...
1024
... 32
Warp
Thread 1
35
What is a Warp?
36
Source: Wikipedia
Warp
● A set of consecutive threads (currently 32) that
execute in SIMD fashion.
● SIMD == Single Instruction Multiple Data
● Warp-threads are fully synchronized. There is
an implicit barrier after each step / instruction.
● Memory coalescing is closedly related to warps.
Takeaway
S0 S0 S0 S0 S0 S0 S0 S0 NOP
S1 S1 S1 S1
Time
S2 S2 S2 S2
S4 S4 S4 S4 S4 S4 S4 S4 38
Warp with Conditions
● When different warp-threads execute different
instructions, threads are said to diverge.
● Hardware executes threads satisfying same condition
together, ensuring that other threads execute a no-op.
● This adds sequentiality to the execution.
● This problem is termed as thread-divergence.
0 1 2 3 4 5 6 7
S0 S0 S0 S0 S0 S0 S0 S0
S1 S1 S1 S1
Time
S2 S2 S2 S2
S4 S4 S4 S4 S4 S4 S4 S4 39
Thread-Divergence
__global__
__global__void
voiddkernel(unsigned
dkernel(unsigned*vector,
*vector,unsigned
unsignedvectorsize)
vectorsize){{
unsigned
unsignedidid==blockIdx.x
blockIdx.x**blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x;
switch
switch(id)
(id){{
case
case0:
0:vector[id]
vector[id]==0;
0;break;
break;
case
case1:
1:vector[id]
vector[id]==vector[id];
vector[id];break;
break;
case
case2:
2:vector[id]
vector[id]==vector[id
vector[id--2];
2];break;
break;
case
case3:
3:vector[id]
vector[id]==vector[id
vector[id++3];
3];break;
break;
case
case4:
4:vector[id]
vector[id]==44++44++vector[id];
vector[id];break;
break;
case
case5:
5:vector[id]
vector[id]==55--vector[id];
vector[id];break;
break;
case
case6:
6:vector[id]
vector[id]==vector[6];
vector[6];break;
break;
case
case7:
7:vector[id]
vector[id]==77++7;
7;break;
break;
case
case8:
8:vector[id]
vector[id]==vector[id]
vector[id]++8;
8;break;
break;
case
case9:
9:vector[id]
vector[id]==vector[id]
vector[id]**9;
9;break;
break;
}} }}
40
Thread-Divergence
● Since thread-divergence makes execution sequential,
conditions are evil in the kernel codes?
ifif(vectorsize
(vectorsize<<N)
N)S1;
S1;else
elseS2;
S2; Condition but no divergence
Takeaway
42
Locality
● Locality is important for performance on GPUs
also.
● All threads in a thread-block access their L1
cache.
● This cache on Kepler is 64 KB.
● It can be configured as 48 KB L1 + 16 KB scratchpad
or 16 KB L1 + 48 KB scratchpad.
● To exploit spatial locality, consecutive threads
should access consecutive memory locations.
43
Matrix Squaring (version 1)
square<<<1,
square<<<1,N>>>(matrix,
N>>>(matrix,result,
result,N);
N); ////NN==64
64
__global__
__global__void
voidsquare(unsigned
square(unsigned*matrix,
*matrix,
unsigned
unsigned*result,
*result,
unsigned
unsignedmatrixsize)
matrixsize){{
unsigned
unsignedid id==blockIdx.x
blockIdx.x**blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x;
for
for(unsigned
(unsignedjjjj==0;0;jjjj<<matrixsize;
matrixsize;++jj)
++jj){{
for
for(unsigned
(unsignedkk kk==0; 0;kk
kk<<matrixsize;
matrixsize;++kk)
++kk){{
result[id
result[id**matrixsize
matrixsize++jj]jj]+=
+=
matrix[id
matrix[id**matrixsize
matrixsize++kk] kk]**
matrix[kk
matrix[kk**matrixsize
matrixsize++jj];jj];
}} }} }}
44
CPU time = 1.527 ms, GPU v1 time = 6.391 ms
Matrix Squaring (version 2)
square<<<N,
square<<<N,N>>>(matrix,
N>>>(matrix,result,
result,N);
N); ////NN==64
64
__global__
__global__void
voidsquare(unsigned
square(unsigned*matrix,
*matrix,
unsigned
unsigned*result,
*result,
unsigned
unsignedmatrixsize)
matrixsize){{
unsigned
unsignedid id==blockIdx.x
blockIdx.x**blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x;
unsigned
unsignediiii==id id//matrixsize;
matrixsize; Homework: What if you
interchange ii and jj?
unsigned jj = id % matrixsize;
unsigned jj = id % matrixsize;
for
for(unsigned
(unsignedkk kk==0;
0;kkkk<<matrixsize;
matrixsize;++kk)
++kk){{
result[ii
result[ii**matrixsize
matrixsize++jj]jj]+=
+=matrix[ii
matrix[ii**matrixsize
matrixsize++kk]kk]**
matrix[kk
matrix[kk**matrixsize
matrixsize++jj];
jj];
}} }}
CPU time = 1.527 ms, GPU v1 time = 6.391 ms, 45
GPU v2 time = 0.1 ms
Memory Coalescing
● If consecutive threads access words from the
same block of 32 words, their memory requests
are clubbed into one.
● That is, the memory requests are coalesced.
● This can be effectively achieved for regular
programs (such as dense matrix operations).
46
Coalesced Uncoalesced Coalesced
Memory Coalescing
●
● Each
Eachthread
threadshould
shouldaccess
access ●
● AAchunk
chunkshould
shouldbebe
consecutive
consecutiveelements
elementsof ofaa accessed
accessedbybyconsecutive
consecutive
CC GG
chunk
chunk(strided).
(strided). threads
threads(coalesced).
(coalesced).
PP PP
UU ●● Array
Arrayof
ofStructures
Structures(AoS)
(AoS) ●
● Structures
Structuresof
ofArrays
Arrays(SoA)
(SoA) UU
has
hasaabetter
betterlocality.
locality. has
hasaabetter
betterperformance.
performance.
start = id * chunksize;
end = start + chunksize;
for (ii = start; ii < end; ++ii)
… a[id] ... … a[input[id]] ...
… a[ii] ...
47
Coalesced Strided Random
AoS versus SoA
struct
structnode
node{ { struct
structnode
node{ {
int
inta;a; int
intalla[N];
alla[N];
double
doubleb;b; double
doubleallb[N];
allb[N];
char
charc;c; char allc[N];
char allc[N];
};}; };};
struct
structnode
nodeallnodes[N];
allnodes[N];
Expectation:
Expectation:When Whenaathread
thread Expectation:
Expectation:When
Whenaathread
thread
accesses
accessesan anattribute
attributeof
ofaa accesses
accessesan anattribute
attributeofofaa
node,
node,ititalso
alsoaccesses
accessesother
other node,
node,its
itsneighboring
neighboringthread
thread
attributes
attributesofofthe
thesame
samenode.
node. accesses
accessesthethesame
sameattribute
attribute
of
ofthe
thenext
nextnode.
node.
Better
Betterlocality
locality(on
(onCPU).
CPU). Better
Bettercoalescing
coalescing(on
(onGPU).
GPU).
48
AoS versus SoA
struct
structnode
node{ { struct
structnode
node{ {
int
inta;a; int
intalla[N];
alla[N];
double
doubleb;b; double
doubleallb[N];
allb[N];
char
charc;c; char
charallc[N];
allc[N];
};}; };};
struct
structnode
nodeallnodes[N];
allnodes[N];
__global__
__global__void
void __global__
__global__void
void
dkernelaos(struct
dkernelaos(structnodeAOS
nodeAOS dkernelsoa(int
dkernelsoa(int*a,
*a,double
double*b,
*b,
*allnodesAOS)
*allnodesAOS){{ char
char*c)
*c){{
unsigned
unsignedidid==blockIdx.x
blockIdx.x** unsigned
unsignedidid==blockIdx.x
blockIdx.x**
blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x; blockDim.x
blockDim.x++threadIdx.x;
threadIdx.x;
allnodesAOS[id].a
allnodesAOS[id].a==id;
id; a[id]
a[id]==id;
id;
allnodesAOS[id].b
allnodesAOS[id].b==0.0;
0.0; b[id]
b[id]==0.0;
0.0;
allnodesAOS[id].c
allnodesAOS[id].c=='c';
'c'; c[id]
c[id]=='d';
'd';
}} }}
AoS time: 0.000058 seconds 49
SoA time: 0.000021 seconds
Let's Compute the Shortest Paths
● You are given an input graph of aa
India, and you want to compute 7
3
4
51
atomics
● Atomics are primitive operations whose effects
are visible either none or fully (never partially).
● Need hardware support.
● Several variants: atomicCAS, atomicMin,
atomicAdd, ...
● Work with both global and shared memory.
52
atomics
__global__
__global__void
voiddkernel(int
dkernel(int*x)
*x){{
++x[0];
++x[0]; After dkernel completes,
what is the value of x[0]?
}}
……
dkernel<<<1,
dkernel<<<1,2>>>(x);
2>>>(x);
++x[0]
++x[0]isisequivalent
equivalentto:
to: Load
Loadx[0],
x[0],R1
R1 Load
Loadx[0],
x[0],R2
R2
Load
Loadx[0],
x[0],R1
R1 Increment
IncrementR1R1 Increment
IncrementR2R2
Store
StoreR2,
R2,x[0]
Time
Increment
IncrementR1 R1 x[0]
Store Store
StoreR1,
R1,x[0]
StoreR1,
R1,x[0]
x[0] x[0]
56
Barriers
● A barrier is a program point where all threads
need to reach before any thread can proceed.
● End of kernel is an implicit barrier for all GPU
threads (global barrier).
● There is no explicit global barrier supported in
CUDA.
● Threads in a thread-block can synchronize
using __syncthreads().
● How about barrier within warp-threads?
57
Barriers
__global__ void dkernel(unsigned *vector, unsigned vectorsize) {
unsigned id = blockIdx.x * blockDim.x + threadIdx.x;
vector[id] = id; S1
__syncthreads();
if (id < vectorsize - 1 && vector[id + 1] != id + 1) S2
printf("syncthreads does not work.\n");
}
S1 S1 S1 S1
Thread block
Time
S2 S2 S2 S2 S1 S1 S1 S1
Thread block 58
S2 S2 S2 S2
Barriers
●
__syncthreads() is not only about control synchronization, it
also has data synchronization mechanism.
● It performs a memory fence operation.
● A memory fence ensures that the writes from a thread
59
Classwork
● Write a CUDA kernel to find maximum over a
set of elements, and then let thread 0 print the
value in the same kernel.
● Each thread is given work[id] amount of work.
Find average work per thread and if a thread's
work is above average + K, push extra work to
a worklist.
● This is useful for load-balancing.
● Also called work-donation.
60
Synchronization
● Atomics
● Barriers
● Control + data flow
● ... Initially, flag == false.
S2;
S2;
while
while(!flag)
(!flag);; flag
flag==true;
true;
S1;
S1;
61
Reductions
● What are reductions?
● Computation properties required.
● Complexity measures
Input: 4 3 9 3 5 7 3 2 n numbers
7 12 12 5
barrier
log(n) steps 19 17
Output: 36
62
Reductions
for
for(int
(intoffoff==n/2;
n/2;off;
off;off
off/=
/=2)
2){{
ifif(threadIdx.x
(threadIdx.x<<off)
off){{
a[threadIdx.x]
a[threadIdx.x]+= +=a[threadIdx.x
a[threadIdx.x++off];
off];
}}
__syncthreads();
__syncthreads();
}}
Input: 4 3 9 3 5 7 3 2 n numbers
7 12 12 5
barrier
log(n) steps 19 17
Output: 36
63
Prefix Sum
● Imagine threads wanting to push work-items to
a central worklist.
● Each thread pushes different number of work-
items.
● This can be computed using atomics or prefix
sum (also called as scan).
Input: 4 3 9 3 5 7 3 2
Output: 4 7 16 19 24 31 33 35
OR
Output: 0 4 7 16 19 24 31 33
64
Prefix Sum
for
for(int
(intoffoff==1;1;off
off<<n;n;off
off*=
*=2)
2){{
ifif(threadIdx.x
(threadIdx.x>= >=off)
off){{
a[threadIdx.x]
a[threadIdx.x]+= +=a[threadIdx.x
a[threadIdx.x- -off];
off];
}}
__syncthreads();
__syncthreads();
}}
65
Shared Memory
● What is shared memory?
● How to declare Shared Memory?
● Combine with reductions.
__shared__
__shared__float
floata[N];
a[N];
a[id]
a[id]==id;
id;
66
Barrier-based Synchronization
Consider threads pushing
Disjoint accesses elements into a worklist
Overlapping accesses ...
Benign overlaps
67
Barrier-based Synchronization
Consider threads trying to
Disjoint accesses own a set of elements
Overlapping accesses ...
Benign overlaps
atomic per element
68
Barrier-based Synchronization
Consider threads updating shared
Disjoint accesses variables to the same value
Overlapping accesses ...
Benign overlaps
with atomics
e.g., level-by-level
breadth-first search
without atomics
69
Exploiting Algebraic Properties
Monotonicity
Idempotency Consider threads updating distances in
shortest paths computation
Associativity
2 3 2 3 2 3
10
10 77 55
70
Exploiting Algebraic Properties
Monotonicity
Idempotency Consider threads updating distances in
shortest paths computation
Associativity
t5, t6, t7,t8
t2 t3
worklist zz
bb cc
t1 zz zz zz zz
t4
aa dd pp rr
zz qq
71
Exploiting Algebraic Properties
Monotonicity
Idempotency Consider threads pushing
information to a node
Associativity
t2 y t3 z,v
bb cc
t1 x
t4 m,n
aa dd
zz
x,y,z,v,m,n
...
scatter
gather
73
Other Memories
● Texture
● Const
● Global
● Shared
● Cache
● Registers
74
Thrust
● Thrust is a parallel algorithms library (similar in
spirit to STL on CPU).
● Supports vectors and associated transforms.
● Programmer is oblivious to where code executes
– on CPU or GPU.
● Makes use of C++ features such as functors.
75
Thrust
thrust::host_vector<int>
thrust::host_vector<int>hnums(1024);
hnums(1024);
thrust::device_vector<int>
thrust::device_vector<int>dnums;
dnums;
dnums
dnums==hnums;
hnums; //
//calls
callscudaMemcpy
cudaMemcpy
//
//initialization.
initialization.
thrust::device_vector<int>
thrust::device_vector<int>dnum2(hnums.begin(),
dnum2(hnums.begin(),hnums.end());
hnums.end());
hnums
hnums==dnum2;
dnum2; //
//array
arrayresizing
resizinghappens
happensautomatically.
automatically.
std::cout
std::cout<<
<<dnums[3]
dnums[3]<<
<<std::endl;
std::endl;
thrust::transform(dsrc.begin(),
thrust::transform(dsrc.begin(),dsrc.end(),
dsrc.end(),dsrc2.begin(),
dsrc2.begin(),
ddst.begin(),
ddst.begin(),addFunc);
addFunc);
76
Thrust Functions
●
find(begin, end, value);
●
find_if(begin, end, predicate);
●
copy, copy_if.
●
count, count_if.
●
equal.
●
min_element, max_element.
●
merge, sort, reduce.
●
transform.
77
●
...
Thrust User-Defined Functors
1 // calculate result[] = (a * x[]) + y[]
2 struct saxpy {
3 const float _a;
4 saxpy(int a) : _a(a) { }
5
6 __host__ __device__
7 float operator()(const float &x, const float& y) const {
8 return a * x + y;
9 }
10 };
11
12 thrust::device_vector<float> x, y, result;
13 // ... fill up x & y vectors ...
14 thrust::transform(x.begin(), x.end(), y.begin(),
15 result.begin(), saxpy(a));
78
Thrust on host versus device
● Same algorithm can be used on CPU and GPU.
int
intx,x,y;y;
thrust::host_vector<int>
thrust::host_vector<int>hvec;
hvec;
thrust::device_vector<int>
thrust::device_vector<int>dvec;
dvec;
////(thrust::reduce
(thrust::reduceisisaasum
sumoperation
operationby
bydefault)
default)
xx==thrust::reduce(hvec.begin(),
thrust::reduce(hvec.begin(),hvec.end());
hvec.end()); //
//on
onCPU
CPU
yy==thrust::reduce(dvec.begin(),
thrust::reduce(dvec.begin(),dvec.end());
dvec.end()); //
//on
onGPU
GPU
79
Challenges with GPU
●
Warp-based execution ●
Incoherent L1 caches
Often requires sorting of
May need to explicitly push
work or algorithm change data out
●
Data structure layout ●
Poor recursion support
Best layout for CPU differs
Need to make code
from the best layout for iterative and maintain
GPU
explicit iteration stacks
●
Separate memory space
Slow transfers
●
Thread and block counts
Pack/unpack data
Hierarchy complicates
implementation
Optimal counts have to be
(auto-)tuned
80
General Optimization Principles
● Finding and exposing enough parallelism to populate
all the multiprocessors.
● Finding and exposing enough additional parallelism to
allow multithreading to keep the cores busy.
● Optimizing device memory accesses for contiguous
data.
● Utilizing the software data cache to store intermediate
results or to reorganize data.
● Reducing synchronization.
81
Other Optimizations
● Async CPU-GPU execution
● Dynamic Parallelism
● Multi-GPU execution
● Unified Memory
82
Bank Conflicts
● Programming guide.
83
Dynamic Parallelism
● Usage for graph algo.
84
Async CPU-GPU execution
● Overlapping communication and computation
● streams
● Overlapping two computations
85
Multi-GPU execution
● Peer-to-peer copying
● CPU as the driver
86
Unified Memory
● CPU-GPU memory coherence
● Show the problem first
87
Other Useful Topics
● Voting functions
● Occupancy
● Compilation flow and .ptx assembly
88
Voting Functions
89
Occupancy
● Necessity
● Pitfall and discussion
90
Compilation Flow
● Use shailesh's flow diagram
● .ptx example
91
Common Pitfalls and
Misunderstandings
● GPUs are only for graphics applications.
● GPUs are only for regular applications.
● On GPUs, all the threads need to execute the
same instruction at the same time.
● A CPU program when ported to GPU runs
faster.
92
GPU Programming
Rupesh Nasre.