Class13
Class13
Shared memory
Fast, bank conflicts; limited; threads in block
Registers
Fast, only for one thread
Local memory
For what doesn’t fit in registers, slow but cached, one thread
Variables
Access Penalty
Registers:
Memory
The fastest form of memory on the multi-processor. Is only accessible by the thread.
Has the lifetime of the thread.
Shared Memory:
Can be as fast as a register when there are no bank conflicts or when reading from
the same address. Accessible by any thread of the block from which it was created.
Has the lifetime of the block.
Constant Memory:
Accessible by all threads. Lifetime of application. Fully cached, but limited.
Global memory:
Potentially 150x slower than register or shared memory -- watch out for uncoalesced
reads and writes. Accessible from either the host or device. Has the lifetime of the
application—that is, it persistent between kernel launches.
Local memory:
A potential performance gotcha, it resides in global memory and can be 150x slower
than register or shared memory. Is only accessible by the thread. Has the lifetime of
the thread.
Where to declare variables?
yes no
global register
constant local
shared
Global Memory: DRAM on card
Assigned by cudaMemcpy
or
int *myDeviceMemory = 0;
cudaMalloc(&myDeviceMemory, 256 * sizeof(int));
Global Memory
CUDA streams are a sequence of operations that execute on the device in the order
they are issued by the host. Different streams can interleave operations.
All on the default stream. Kernel call waits until cudaMemcpy is complete.
increment kernel has to complete before the cudaMemcpy to the host is started.
Alternatively:
cpu_function does not wait for memory transfer nor kernel_function to finish.
Concurrent copy and kernel execution
Availability: DeviceQueryDrv:
“Concurrent copy and kernel execution: Yes with 1 copy engine(s)”
See http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/
Coalesced Global Memory Access
Threads in a block are computed a warp at a time (32 threads).
Every successive 128 bytes can be accessed by a warp (or 32 single precision
words).
Best if all threads in warp read the same constant data, otherwise slower.
Constant Memory
deviceQueryDrv; I have 65536 bytes (64 kB)
Note: constant memory is available to all threads, like global memory; however, the
data is cached.
N=2048
g= 9.81m/s2 , put into constant memory
cudaMallocManaged(&data, N);
…..generate data
kernel<<<….>>> (data, N)
cudaDeviceSynchronize();
// Synchronize is to get GPU to finish before doing anything with the data on CPU
cudaFree(data);
———————————————