0% found this document useful (0 votes)
251 views26 pages

Nvidia Cuda

CUDA is a parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of The GPU. The current generation CUDA architecture (codename: "fermi") is standard on NVIDIA's released (GeForce 400 Series) GPU. CUDA can be used to accelerate a wide range of applications, including Medical analysis simulations, for example virtual reality based on CT and MRI scan images.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
251 views26 pages

Nvidia Cuda

CUDA is a parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of The GPU. The current generation CUDA architecture (codename: "fermi") is standard on NVIDIA's released (GeForce 400 Series) GPU. CUDA can be used to accelerate a wide range of applications, including Medical analysis simulations, for example virtual reality based on CT and MRI scan images.
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Introduction

What is CUDA?
CUDA(an acronym for Compute Unified Device Architecture) is NVIDIAs parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of the GPU (graphics processing unit).

Background


Computing is evolving from "central processing" on the CPU to "co-processing" on the CPU and GPU. To enable this new computing paradigm, NVIDIA invented the CUDA parallel computing architecture that is now shipping in GeForce, ION, Quadro, and Tesla GPUs, representing a significant installed base for application developers. In the consumer market, nearly every major consumer video application has been, or will soon be, accelerated by CUDA, including products from Elemental Technologies, MotionDSP and LoiLo, Inc.

Current CUDA architectures


The current generation CUDA architecture (codename: "Fermi") which is standard on NVIDIA's released (GeForce 400 Series)GPU is designed from the ground up to natively support more programming languages such as C++.


It has eight times the peak double-precision floating-point performance compared to Nvidia's previous-generation Tesla GPU. It also introduced several new features including: up to 512 CUDA cores and 3.0 billion transistors NVIDIA Parallel DataCache technology NVIDIA GigaThread engine ECC memory support Native support for Visual Studio

    

Over View
Example of CUDA processing flow 1. Copy data from main mem to GPU mem 2. CPU instructs the process to GPU 3. GPU execute parallel in each core 4. Copy the result from GPU mem to main mem

Current and future usages of CUDA architecture




The Search for Extra-Terrestrial Intelligence (SETI) Accelerated rendering of 3D graphics Real Time Cloth Simulation Distributed Calculations, such as predicting the native conformation of proteins Medical analysis simulations, for example virtual reality based on CT and MRI scan images. Physical simulations, in particular in fluid dynamics. Environment statistics Accelerated encryption, decryption and compression Accelerated interconversion of video file formats

CUDA Programming Model


GPU is a coprocessor to the CPU or host and has its own DRAM (device memory). The GPU is viewed as a compute device to execute portoin of application that: Has to be executed many times Can be isolated as a function Works independently on different data Such a function can be compiled to run on the device.The resulting program is called a Kernel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads.

CUDA Programming Model Thread Block


Batch of threads that can cooperate together
Fast shared memory Synchronizable Thread ID

Block can be one-, two- or threedimensional arrays

CUDA Programming Model


Thread batching: Grids and Blocks
A kernel is executed as a grid of thread blocks All threads share data memory space Limited number of threads in a block Blocks identifiable via block ID A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate.

CUDA Programming Model

Thread and Block IDs

Threads and blocks have IDs So each thread can decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes

Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1)

Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

Courtesy: NDVIA

CUDA Memory Model

Device memory space overview Block and Thread IDs


Each thread can:
R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory
(Device) Grid Block (0, 0) Block (1, 0)

Shared Memory Registers Registers

Shared Memory Registers Registers

Thread (0, 0) Thread (1, 0)

Thread (0, 0) Thread (1, 0)

Local Memory

Local Memory

Local Memory

Local Memory

The host can R/W global, constant, and texture memories

Host

Global Memory Constant Memory Texture Memory

CUDA Memory Model


Shared Memory
Is on-chip:
much faster than the local and global memory, as fast as a register when no bank conflicts, divided into equally-sized memory banks.

Successive 32-bit words are assigned to successive banks, Each bank has a bandwidth of 32 bits per clock cycle.

CUDA Memory Model


 

Warp size is 32, number of banks is 16. Memory request requires two cycles for a warp.

One for the first half, one for the second half of the warp. No conflicts between threads from first and second half

CUDA Memory Model


Shared Memory

CUDA API Basics


An Extension to the C Programming Language
Function type qualifiers to specify execution on host or device Variable type qualifiers to specify the memory location on the device A new directive to specify how to execute a kernel on the device Four built-in variables that specify the grid and block dimensions and the block and thread indices

Extended C
Integrated source
(f

cudacc
ED C/C++ frontend Open64 lobal Optimizer
PU
f

ssembly

.

OC 8
f

gcc / cl

. a

.cu)

CPU Host Code


.cpp

CUDA API Basics


Function type qualifiers __device__
Executed on the device Callable from the device only.

__global__
Executed on the device, Callable from the host only.

__host__
Executed on the host, Callable from the host only.

CUDA API Basics


Variable Type Qualifiers __device__
Resides in global memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library.

__constant__

(optionally used together with __device__)

Resides in constant memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library.

__shared__

(optionally used together with __device__)

Resides in the shared memory space of a thread block, Has the lifetime of the block, Is only accessible from all the threads within the block.


7 7 7 7

  

CPU/GPU Comparison

Particle #

Speedup

Bas lin CPU p n P CPU

p n P CPU

Why is GPU outpacing CPU?


GPU chips are customized to handle streaming data GP chips do not need the significant amount of cache space because the data is already sequential or cache-coherent.

Intel and AMD are now shipping CPU chips with 4 cores. Nvidia is shipping GPU chips with 12 . Overall , in four years, GPUs have achieved a 1 .5-fold increase in performance, which exceeds Moores law.

CPU/GPU Comparison
Differences between GPU and CPU threads
GPU threads are extremely lightweight
Very little creation overhead Multi-core CPU needs only a few

GPU needs 1000s of threads for full efficiency

GPU Baseline speedup is approximately 60x For 500,000 particles that is a reduction in calculation time from

33 minutes to 33 seconds!

CUDA advantages over Legacy GPGPU


Random access to memorythreads
access any memory location. can

Unlimited access to memory-threads can


read/write as many locations as needed.

User-managed cache per block-threads


can cooperatively load data into SMEM . Any thread can then access any SMEM location.

Low learning curve-no knowledge of graphics is


required.

No graphics API overhead.

Summary
Thousands of lightweight concurrent threadsno switching overhead Shared memory-user managed L1 cache thread communication within block. Random access to global memory Current generation hardware- upto 12 streaming processors.

THANK YOU

You might also like