Nvidia Cuda
Nvidia Cuda
What is CUDA?
CUDA(an acronym for Compute Unified Device Architecture) is NVIDIAs parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of the GPU (graphics processing unit).
Background
Computing is evolving from "central processing" on the CPU to "co-processing" on the CPU and GPU. To enable this new computing paradigm, NVIDIA invented the CUDA parallel computing architecture that is now shipping in GeForce, ION, Quadro, and Tesla GPUs, representing a significant installed base for application developers. In the consumer market, nearly every major consumer video application has been, or will soon be, accelerated by CUDA, including products from Elemental Technologies, MotionDSP and LoiLo, Inc.
It has eight times the peak double-precision floating-point performance compared to Nvidia's previous-generation Tesla GPU. It also introduced several new features including: up to 512 CUDA cores and 3.0 billion transistors NVIDIA Parallel DataCache technology NVIDIA GigaThread engine ECC memory support Native support for Visual Studio
Over View
Example of CUDA processing flow 1. Copy data from main mem to GPU mem 2. CPU instructs the process to GPU 3. GPU execute parallel in each core 4. Copy the result from GPU mem to main mem
The Search for Extra-Terrestrial Intelligence (SETI) Accelerated rendering of 3D graphics Real Time Cloth Simulation Distributed Calculations, such as predicting the native conformation of proteins Medical analysis simulations, for example virtual reality based on CT and MRI scan images. Physical simulations, in particular in fluid dynamics. Environment statistics Accelerated encryption, decryption and compression Accelerated interconversion of video file formats
Threads and blocks have IDs So each thread can decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes
Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1)
Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
Courtesy: NDVIA
Local Memory
Local Memory
Local Memory
Local Memory
Host
Successive 32-bit words are assigned to successive banks, Each bank has a bandwidth of 32 bits per clock cycle.
Warp size is 32, number of banks is 16. Memory request requires two cycles for a warp.
One for the first half, one for the second half of the warp. No conflicts between threads from first and second half
Extended C
Integrated source
(f
cudacc
ED C/C++ frontend Open64 lobal Optimizer
PU
f
ssembly
.
OC 8
f
gcc / cl
. a
.cu)
__global__
Executed on the device, Callable from the host only.
__host__
Executed on the host, Callable from the host only.
__constant__
Resides in constant memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library.
__shared__
Resides in the shared memory space of a thread block, Has the lifetime of the block, Is only accessible from all the threads within the block.
7 7 7 7
CPU/GPU Comparison
Particle #
Speedup
p n P CPU
Intel and AMD are now shipping CPU chips with 4 cores. Nvidia is shipping GPU chips with 12 . Overall , in four years, GPUs have achieved a 1 .5-fold increase in performance, which exceeds Moores law.
CPU/GPU Comparison
Differences between GPU and CPU threads
GPU threads are extremely lightweight
Very little creation overhead Multi-core CPU needs only a few
GPU Baseline speedup is approximately 60x For 500,000 particles that is a reduction in calculation time from
33 minutes to 33 seconds!
Summary
Thousands of lightweight concurrent threadsno switching overhead Shared memory-user managed L1 cache thread communication within block. Random access to global memory Current generation hardware- upto 12 streaming processors.
THANK YOU