GPU Programming ● NVIDIA GPU hardware architecture ● CUDA Programming model GPU vs CPU GPU vs CPU ● GPU and the CPU exists because they are designed with different goals – CPU is designed to execute a sequence of operations, called a thread, as fast as possible. ● transistors are devoted to insctruction control – GPU is designed to execute thousands of them in parallel ● transistors are devoted to data processing GPU architecture What is CUDA The CUDA parallel programming model is designed to overcome this challenge while maintaining a low learning curve for programmers familiar with standard programming languages such as C. CUDA Programming model ● CUDA Programming model offers three key abstractions – Hierarchy of thread groups – Shared memories – Barrier synchronization ● These abstractions are simply exposed to the programmer as a minimal set of language extensions. Thread Hierarchy ● The programmer can to partition the problem into : – coarse sub-problems that can be solved independently in parallel by blocks of threads, – sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block – Grid is composed of different block Thread Hierarchy ● This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub- problem ● At the same time enables automatic scalability where each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or ● sequentially, so that a compiled CUDA program can execute on any number of multiprocessors What is CUDA ● CUDA C++ extends C++ by allowing the programmer to define C++ functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C++ functions. CUDA Programming model ● There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same streaming multiprocessor core and must share the limited memory resources of that – On current GPUs, a thread block may contain up to 1024 threads ● The size of the grid depends on the data size CUDA Programming model ● A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new<<<...>>>execution configuration syntax CUDA build-in variables ● int threadIdx → This variable contains the thread index within the block ● int blockDim → This variable contains the number of threads per block ● int blockIdx.x → This variable contains the block index within the grid Dimensions of the block/grid CUDA Programming model Where to run your CUDA code ? ● On you PC if it has a NVIDIA GPU ! – CUDA Installation Guide for Linux ● On the cloud – Google colab : 3 types of NVIDIA GPU : ● T4 ● A 100 ● L4 – Kaggle – Amazon SageMarker Studio Lab T4 GPU ● Number of SM = 40 ● Number of core per SM = 64 ● Total number of cores = 2560 References ● CUDA C++ Programming Guide