0% found this document useful (0 votes)
486 views2 pages

Cheat Sheet CUDA

This document provides instructions for different teams to access code directories and run sample code using SSH. It also includes Linux commands for navigating directories and managing files. Finally, it outlines CUDA programming basics like variable qualifiers, built-in variables, memory allocation and transfer, kernel launching, and error handling.

Uploaded by

aditandadit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
486 views2 pages

Cheat Sheet CUDA

This document provides instructions for different teams to access code directories and run sample code using SSH. It also includes Linux commands for navigating directories and managing files. Finally, it outlines CUDA programming basics like variable qualifiers, built-in variables, memory allocation and transfer, kernel launching, and error handling.

Uploaded by

aditandadit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Working with codes

Team 1-25
ssh X user#@10.21.1.166
(replace # by team number. Ex: user16@...)

cd codes
cd helloworld
make
./helloworld

Team 26-50
ssh X [email protected]
password is guest123 (typing will not be visible)
ssh X user#@192.168.1.211
(replace # by team number. Ex: user32@...)
cd codes
cd helloworld
make
./helloworld

cd ..
cd helloworld_blocks
make
./helloworld_blocks
cd ..

cd ..
cd helloworld_blocks
make
./helloworld_blocks
cd ..

Linux commands
ls list files in the current directory
mkdir name create new directory name
cd name changes current directory to name directory.
pwd - print current directory path
gedit filename & opens filename file in text editor
nvcc filename.cu compiles filename.cu and creates binary executable a.out
nvcc filename.cu -o exefile compiles filename.cu and creates binary executable exefile
./a.out executes a.out binary
./exefile executes exefile binary
cp name1 name2 - copies file name1 to file name2
mv name1 name2 rename file name1 to filename name2
rm name - permanently deletes the file name
rmdir dirname delete empty directory dirname
rm rf name delete directory and its contents or file name
rm na* - delete all files starting with na
logout logout the session.

CUDA Cheat sheet


Function Qualifers
__global__
called from host, executed on device
__device__
called from device, executed on device (always inline when Compute Capability is 1.x)
__host__
called from host, executed on host
__host__ __device__ generates code for host and device
__noinline__
if possible, do not inline
__forceinline__ force compiler to inline
Variable Qualifers (Device)
__device__
variable on device (Global Memory)
__constant__ variable in Constant Memory
__shared__
variable in Shared Memory
No Qualifer
automatic variable, resides in Register or in Local Memory in some cases (local arrays,
register spilling)
Built-in Variables (Device)
dim3 gridDim
dimensions of the current grid (gridDim.x, y, z. ) (composed of independent blocks)
dim3 blockDim dimensions of the current block (composed of threads) (total number of threads should
be a multiple of warp size)
uint3 blockIdx
block location in the grid (blockIdx.x, y, z )
uint3 threadIdx thread location in the block (threadIdx.x, y, z )
int warpSize
warp size in threads (instructions are issued per warp)
Shared Memory
Static allocation
__shared__ int a[128]
Dynamic allocation (at kernel launch)
extern __shared__ float b[ ];
Host / Device Memory
Allocate pinned / page- locked Memory on host cudaMallocHost(&dptr, size)
Allocate Device Memory
cudaMalloc(&devptr, size)
Free Device Memory
cudaFree(devptr)
Transfer Memory
cudaMemcpy(dst, src, size, cudaMemcpyKind kind)
kind = {cudaMemcpyHostToDevice, . . . }
Non-blocking Transfer
cudaMemcpyAsync(dst, src, size, kind[, stream])
(Host memory must be page-locked)
Copy to constant or global memory
cudaMemcpyToSymbol(symbol, src, size[, offset[, kind]])
kind=cudaMemcpy[HostToDevicejDeviceToDevice]
Synchronizing
Synchronizing one Block
syncthreads() (device call)
Synchronizing all Blocks
cudaDeviceSynchronize() (host call, CUDA Runtime API)
Kernel
Kernel Launch kernel<<<dim3 blocks, dim3 threads[, ...]>>>( arguments )
CUDA Runtime API Error Handling
CUDA Runtime API error as String
cudaGetErrorString(cudaError t err)
Last CUDA error produced by any of the runtime calls cudaGetLastError()
CUDA Memory
Memory
Location
Cached
Access
Scope
Lifetime
Register
On-Chip
N/A
R/W
Thread
Thread
Local
Off-Chip
No*
R/W
Thread
Thread
Shared
On-Chip
N/A
R/W
Block
Block
Global
Off-Chip
No*
R/W
Global
Application
Constant
Off-Chip
Yes
R
Global
Application
Texture
Off-Chip
Yes
R
Global
Application
Surface
Off-Chip
Yes
R/W
Global
Application
*) Devices with compute capability >2.0 use L1 and L2 Caches.

You might also like