CUDA Toolkit For Sysadmins
CUDA Toolkit For Sysadmins
CUDA Toolkit For Sysadmins
! What this talk will cover: The CUDA 5 Toolkit as a toolchain for HPC applications, focused on the needs of sysadmins and application packagers
! ! ! ! ! Review GPU Computing concepts CUDA C/C++ with nvcc compiler Example application build processes OpenACC compilers Common libraries
CPU vs GPU
Latency Processor + Throughput processor
CPU
GPU
Copyright NVIDIA Corporation
CPU
! Optimized for low-latency access to cached data sets ! Control logic for out-of-order and speculative execution
GPU
! Optimized for data-parallel, throughput computation ! Architecture tolerant of memory latency ! More transistors dedicated to computation
Copyright NVIDIA Corporation
Processing Flow
PCIe Bus
Processing Flow
PCIe Bus
1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance
Processing Flow
PCIe Bus
1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Copy results from GPU memory to CPU memory
Copyright NVIDIA Corporation
Parallel code
Serial code
Parallel code
...
Copyright NVIDIA Corporation
CUDA C
Standard C Code
void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Perform SAXPY on 1M elements saxpy_serial(4096*256, 2.0, x, y);
Parallel C Code
__global__ void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Perform SAXPY on 1M elements saxpy_parallel<<<4096,256>>>(n,2.0,x,y);
OpenACC Directives
Compiler directives (Like OpenMP)
Programming Languages
Most common: CUDA C Also CUDA Fortran, PyCUDA, Matlab,
Copyright NVIDIA Corporation
Libraries
Drop-in Acceleration
Directives
Like OpenMP
Programming Languages
Most common: CUDA C Also CUDA Fortran, PyCUDA, Matlab,
Copyright NVIDIA Corporation
! Will hit OpenACC and common libraries at the end of the talk
CUDA Toolkit
! Free developer tools for building applications with CUDA C/C++ and the CUDA Runtime API ! Includes (on Linux):
! nvcc compiler ! Debugging and profiling tools ! Nsight Eclipse Edition IDE ! NVIDIA Visual Profiler ! A collection of libraries (CUBLAS, CUFFT, Thrust, etc)
! Currently the most common tool for building NVIDIA GPU applications
Copyright NVIDIA Corporation
Wheres CUDA?
Common to install CUDA somewhere other than /usr/local/cuda, so where is it? ! Common: specify location of the CUDA toolkit using an environment variable
! No convention on the name of this variable, though ! CUDA_HOME= is common ! Also CUDA=, CUDA_PATH=, NVIDIA_CUDA=,
NVCC Compiler
! Compiler for CUDA C/C++ ! Uses the CUDA Runtime API
! Resulting binaries link to CUDA Runtime library, libcudart.so
! Takes a mix of host code and device code as input ! Uses g++ for host code ! Builds code for CPU and GPU architectures ! Generates a binary which combines both types of code
LIBRARIES
Apply all device-level math optimizations Print GPU resources (shared memory, registers) used per kernel
! Most major MPIs now support addressing CUDA device memory directly ! Do MPI_Send/MPI_Receive with pointers to device memory; skip cudaMemcpy step in application code ! GPUDirect: do direct device-to-device transfers (skipping host memory)
! OpenMPI, mvapich2, Platform MPI, See NVIDIA DevZone for a full list ! Support typically has to be included at compile time
Copyright NVIDIA Corporation
Example Builds
Example: matrixMul
! Part of the CUDA 5 Samples (distributed with CUDA Toolkit) ! Single CUDA source file containing host and device code ! Single compiler command using nvcc $
nvcc
-m64
-I../../common/inc
matrixMul.cu
$
./a.out
[Matrix
Multiply
Using
CUDA]
-
Starting...
GPU
Device
0:
"Tesla
M2070"
with
compute
capability
2.0
MatrixA(320,320),
MatrixB(640,320)
Computing
result
using
CUDA
Kernel...done
...
Copyright NVIDIA Corporation
Example: simpleMPI
! Part of the CUDA 5 Samples (distributed with CUDA Toolkit) ! Simple example combining CUDA with MPI ! Split and scatter an array of random numbers, do computation on GPUs, reduce on host node ! MPI and CUDA code separated into different source files, simpleMPI.cpp and simpleMPI.cu ! Works exactly like any other multi-file C++ build ! Build the CUDA object file, build the C++ object, link them together
Copyright NVIDIA Corporation
$ make nvcc -m64 -gencode arch=compute_10,code=sm_10 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -o simpleMPI.o -c simpleMPI.cu mpicxx -m64 -o main.o -c simpleMPI.cpp mpicxx -m64 -o simpleMPI simpleMPI.o main.o -L$CUDA/lib64 - lcudart
$ make nvcc -m64 -gencode arch=compute_10,code=sm_10 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_30,code=sm_30 -o simpleMPI.o -c simpleMPI.cu (well explain the gencode bits later) mpicxx -m64 -o main.o -c simpleMPI.cpp mpicxx -m64 -o simpleMPI simpleMPI.o main.o -L$CUDA/lib64 - lcudart
Example: OpenMPI
! Popular MPI implementation ! Includes CUDA support for sending/receiving CUDA device pointers directly, without explicitly staging through host memory
! Either does implicit cudaMemcpy calls, or does direct transfers if GPUDirect support
Example: GROMACS
! Popular molecular dynamics application with CUDA support (mostly simulating biomolecules) ! Version 4.5: CUDA support via OpenMM library, only single-GPU support ! Version 4.6: CUDA supported directly, multi-GPU support ! Requires Compute Capability >= 2.0 (Fermi or Kepler)
Example: GROMACS
wget
ftp://ftp.gromacs.org/pub/gromacs/gromacs-4.6.tar.gz
tar
xzf
gromacs-4.6.tar.gz
mkdir
gromacs-build
module
load
cmake
cuda
gcc/4.6.3
fftw
openmpi
CC=mpicc
CXX=mpiCC
cmake
./gromacs-4.6
-DGMX_OPENMP=ON
-DGMX_GPU=ON
-DGMX_MPI=ON
-DGMX_PREFER_STATIC_LIBS=ON
- DCMAKE_BUILD_TYPE=Release
-DCMAKE_INSTALL_PREFIX=./gromacs-build
make
install
Copyright NVIDIA Corporation
! Compiler produces a fat binary which includes all three types of code ! Breaking changes in both NVIDIA object code and in PTX assembly can occur with each new GPU release ! PTX is forward-compatible, object code is not
Copyright NVIDIA Corporation
Fat binaries
! When a CUDA fat binary is run on a given GPU, a few different things can happen:
! If the fat binary includes object code compiled for the device architecture, that code is run directly. ! If the fat binary includes PTX assembly which the GPU understands, that code is Just-In-Time compiled and run on the GPU. (results in slight startup lag) ! If neither version are compatible with the GPU, the application doesnt run.
! Always uses the correct object code, or the newest compatible PTX
Copyright NVIDIA Corporation
Why do we care?
! A given CUDA binary is not guaranteed to run on an arbitrary GPU ! And if it does run, not guaranteed to get best performance
! JIT startup time ! Your GPU may support newer PTX or object code features than are compiled in
! Mix of hardware you have in your cluster determines what options to include in your fat binaries
gcc
Input Files
nvcc
gcc
nvopencc
ptxas
fatbinary
Device code
PTX
(device assembly)
gcc
nvopencc generates PTX assembly Hostaccording object code to the compute capability ! ! ptxas generates device binaries according to the device architecture
Combined gcc fatbinary packages them together object code
Input Files
nvcc
nvopencc
ptxas
fatbinary
Device code
PTX
(device assembly)
! GPU Architecture
! Binary code is architecturespecific, and changes with each GPU generation ! Version of the object code. ! Different architectures use different optimizations, etc.
! You can generate multiple versions of both the PTX and the object code to be included.
nvcc
-m64
-gencode
arch=compute_10,code=sm_10
-gencode
arch=compute_20,code=sm_20
-gencode
arch=compute_30,code=sm_30
-o
simpleMPI.o
-c
simpleMPI.cu
Copyright NVIDIA Corporation
--gpu-code <gpu>
-code
--generate-code
-gencode
GROMACS revisited
! Default flags in GROMACS: CUDA_NVCC_FLAGS= -gencode;arch=compute_20,code=sm_20;gencode;arch=compute_20,code=sm_21;gencode;arch=compute_30,code=sm_30;gencode;arch=compute_30,code=compute_30;-use_fast_math; ! Generates code for compute versions 2.0 (Tesla M2050/M2070), compute version 2.1 (Quadro 600, various GeForce) and 3.0 (Tesla K10) ! To generate optimized code for Tesla K20, youd add compute capability 3.5: -gencode arch=compute_35,code=sm_35
! Common pattern: build shared library containing all CUDA code, link to it from your larger application
Copyright NVIDIA Corporation
NVIDIA cuBLAS
NVIDIA cuRAND
NVIDIA cuSPARSE
NVIDIA NPP
NVIDIA cuFFT
IMSL Library
Building-block Algorithms
NVIDIA cuBLAS
NVIDIA cuRAND
NVIDIA cuSPARSE
NVIDIA NPP
NVIDIA cuFFT
IMSL Library
Building-block Algorithms
OpenACC Directives
CPU GPU
OpenACC
! Useful way to quickly add CUDA support to a program without writing CUDA code directly, especially for legacy apps ! Uses compiler directives very similar to OpenMP ! Supports C and Fortran ! Generally doesnt produce code as fast as a good CUDA programmer but often get decent speedups ! Cross-platform; depending on compiler, supports NVIDIA, AMD, Intel accelerators ! Compiler support: ! Cray 8.0+ ! PGI 12.6+ ! CAPS HMPP 3.2.1+
! http://developer.nvidia.com/openacc
Copyright NVIDIA Corporation
OpenACC
$
pgcc
-acc
-Minfo=accel
-ta=nvidia
-o
saxpy_acc
saxpy.c
PGC-W-0095-Type
cast
required
for
this
conversion
(saxpy.c:
13)
PGC-W-0155-Pointer
value
created
from
a
nonlong
integral
type
(saxpy.c:
13)
saxpy:
4,
Generating
present_or_copyin(x[0:n])
Generating
present_or_copy(y[0:n])
Generating
NVIDIA
code
Generating
compute
capability
1.0
binary
Generating
compute
capability
2.0
binary
Generating
compute
capability
3.0
binary
5,
Loop
is
parallelizable
Accelerator
kernel
generated
5,
#pragma
acc
loop
gang,
vector(128)
/*
blockIdx.x
threadIdx.x
*/
PGC/x86-64
Linux
13.2-0:
compilation
completed
with
warnings
Copyright NVIDIA Corporation
OpenACC
! PGI compiler generates ! Object code for currently-installed GPU, if supported (auto-detect) ! PTX assembly for all major versions (1.0, 2.0, 3.0) ! Depending on the compiler step, there may or may not be a OpenACC->CUDA C translation step before compile (but this intermediate code is usually not accessible)
Copyright NVIDIA Corporation
CUDA Fortran
! Slightly-modified Fortran language which uses the CUDA Runtime API ! Almost 1:1 translation of CUDA C concepts to Fortran 90 ! Changes mostly to conform to Fortran idioms (Fortranic?) ! Currently supported only by PGI Fortran compiler ! pgfortran acts like nvcc for Fortran with either the Mcuda option, or if you use the file extension .cuf ! Compiles to CUDA C as intermediate. Can keep C code with option -Mcuda=keepgpu
Copyright NVIDIA Corporation
NVIDIA GPUs
x86 CPUs
Other Resources
! CUDA Toolkit Documentation: http://docs.nvidia.com ! OpenACC: http://www.openacc.org/
! CUDA Fortran @ PGI: http://www.pgroup.com/resources/cudafortran.htm ! GPU Applications Catalog (list of known common apps with GPU support): http://www.nvidia.com/docs/IO/123576/nv-applications-catalog-lowres.pdf ! Email me! Adam DeConinck, [email protected] and many other resources available via CUDA Registered Developer program. https://developer.nvidia.com/nvidia-registered-developer-program
Copyright NVIDIA Corporation
Questions?
ISV Applications
! or maybe you dont have to build the application at all! If using an ISV application, distributed as a binary. ! Important to be careful about libraries for pre-compiled packages, especially CUDA Runtime:
! Many applications distribute a particular libcudart.so ! Dependent on that particular version, may break with later versions ! Apps dont always link to it intelligently; be careful with your modules!