0% found this document useful (0 votes)

251 views26 pages

Nvidia Cuda

CUDA is a parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of The GPU. The current generation CUDA architecture (codename: "fermi") is standard on NVIDIA's released (GeForce 400 Series) GPU. CUDA can be used to accelerate a wide range of applications, including Medical analysis simulations, for example virtual reality based on CT and MRI scan images.

Uploaded by

Arpit Vijayvergia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

251 views26 pages

Nvidia Cuda

Uploaded by

Arpit Vijayvergia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 26

Introduction

What is CUDA?
CUDA(an acronym for Compute Unified Device Architecture) is NVIDIAs parallel computing architecture that enables dramatic increases in computing performance by harnessing the power of the GPU (graphics processing unit).

Background

Computing is evolving from "central processing" on the CPU to "co-processing" on the CPU and GPU. To enable this new computing paradigm, NVIDIA invented the CUDA parallel computing architecture that is now shipping in GeForce, ION, Quadro, and Tesla GPUs, representing a significant installed base for application developers. In the consumer market, nearly every major consumer video application has been, or will soon be, accelerated by CUDA, including products from Elemental Technologies, MotionDSP and LoiLo, Inc.

Current CUDA architectures

The current generation CUDA architecture (codename: "Fermi") which is standard on NVIDIA's released (GeForce 400 Series)GPU is designed from the ground up to natively support more programming languages such as C++.

It has eight times the peak double-precision floating-point performance compared to Nvidia's previous-generation Tesla GPU. It also introduced several new features including: up to 512 CUDA cores and 3.0 billion transistors NVIDIA Parallel DataCache technology NVIDIA GigaThread engine ECC memory support Native support for Visual Studio

Over View
Example of CUDA processing flow 1. Copy data from main mem to GPU mem 2. CPU instructs the process to GPU 3. GPU execute parallel in each core 4. Copy the result from GPU mem to main mem

Current and future usages of CUDA architecture

The Search for Extra-Terrestrial Intelligence (SETI) Accelerated rendering of 3D graphics Real Time Cloth Simulation Distributed Calculations, such as predicting the native conformation of proteins Medical analysis simulations, for example virtual reality based on CT and MRI scan images. Physical simulations, in particular in fluid dynamics. Environment statistics Accelerated encryption, decryption and compression Accelerated interconversion of video file formats

CUDA Programming Model

GPU is a coprocessor to the CPU or host and has its own DRAM (device memory). The GPU is viewed as a compute device to execute portoin of application that: Has to be executed many times Can be isolated as a function Works independently on different data Such a function can be compiled to run on the device.The resulting program is called a Kernel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads.

CUDA Programming Model Thread Block

Batch of threads that can cooperate together
Fast shared memory Synchronizable Thread ID

Block can be one-, two- or threedimensional arrays

CUDA Programming Model

Thread batching: Grids and Blocks
A kernel is executed as a grid of thread blocks All threads share data memory space Limited number of threads in a block Blocks identifiable via block ID A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate.

CUDA Programming Model

Thread and Block IDs

Threads and blocks have IDs So each thread can decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes

Device Grid 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0) Block (2, 1)

Block (1, 1)
Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

Courtesy: NDVIA

CUDA Memory Model

Device memory space overview Block and Thread IDs

Each thread can:
R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory
(Device) Grid Block (0, 0) Block (1, 0)

Shared Memory Registers Registers

Thread (0, 0) Thread (1, 0)

Local Memory

The host can R/W global, constant, and texture memories

Host

Global Memory Constant Memory Texture Memory

CUDA Memory Model

Shared Memory
Is on-chip:
much faster than the local and global memory, as fast as a register when no bank conflicts, divided into equally-sized memory banks.

Successive 32-bit words are assigned to successive banks, Each bank has a bandwidth of 32 bits per clock cycle.

CUDA Memory Model

Warp size is 32, number of banks is 16. Memory request requires two cycles for a warp.

One for the first half, one for the second half of the warp. No conflicts between threads from first and second half

CUDA Memory Model

Shared Memory

CUDA API Basics

An Extension to the C Programming Language
Function type qualifiers to specify execution on host or device Variable type qualifiers to specify the memory location on the device A new directive to specify how to execute a kernel on the device Four built-in variables that specify the grid and block dimensions and the block and thread indices

Extended C
Integrated source
(f

cudacc
ED C/C++ frontend Open64 lobal Optimizer
PU
f

ssembly

.

OC 8
f

gcc / cl

. a

.cu)

CPU Host Code

.cpp

CUDA API Basics

Function type qualifiers __device__
Executed on the device Callable from the device only.

__global__
Executed on the device, Callable from the host only.

__host__
Executed on the host, Callable from the host only.

CUDA API Basics

Variable Type Qualifiers __device__
Resides in global memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library.

__constant__

(optionally used together with device)

Resides in constant memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library.

__shared__

(optionally used together with device)

Resides in the shared memory space of a thread block, Has the lifetime of the block, Is only accessible from all the threads within the block.

7 7 7 7

CPU/GPU Comparison

Particle #

Speedup

Bas lin CPU p n P CPU

p n P CPU

Why is GPU outpacing CPU?

GPU chips are customized to handle streaming data GP chips do not need the significant amount of cache space because the data is already sequential or cache-coherent.

Intel and AMD are now shipping CPU chips with 4 cores. Nvidia is shipping GPU chips with 12 . Overall , in four years, GPUs have achieved a 1 .5-fold increase in performance, which exceeds Moores law.

CPU/GPU Comparison
Differences between GPU and CPU threads
GPU threads are extremely lightweight
Very little creation overhead Multi-core CPU needs only a few

GPU needs 1000s of threads for full efficiency

GPU Baseline speedup is approximately 60x For 500,000 particles that is a reduction in calculation time from

33 minutes to 33 seconds!

CUDA advantages over Legacy GPGPU

Random access to memorythreads
access any memory location. can

Unlimited access to memory-threads can

read/write as many locations as needed.

User-managed cache per block-threads

can cooperatively load data into SMEM . Any thread can then access any SMEM location.

Low learning curve-no knowledge of graphics is

required.

No graphics API overhead.

Summary
Thousands of lightweight concurrent threadsno switching overhead Shared memory-user managed L1 cache thread communication within block. Random access to global memory Current generation hardware- upto 12 streaming processors.

THANK YOU

Xtensa Lx7 Data Book
100% (1)
Xtensa Lx7 Data Book
755 pages
NOC & SOC Dimensioning Tool
No ratings yet
NOC & SOC Dimensioning Tool
21 pages
Hyosung GT250 Parts Catalogue
100% (3)
Hyosung GT250 Parts Catalogue
118 pages
Hardware Design For Machine Learning
No ratings yet
Hardware Design For Machine Learning
22 pages
AMD ZEN Architecture PDF
100% (1)
AMD ZEN Architecture PDF
19 pages
Veeam Backup and Recovery
No ratings yet
Veeam Backup and Recovery
42 pages
Computer Application in Business
No ratings yet
Computer Application in Business
45 pages
Download
No ratings yet
Download
2 pages
Cuda PDF
No ratings yet
Cuda PDF
18 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Cuda 9 and Beyond
100% (1)
Cuda 9 and Beyond
45 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Nvidia Cuda Arc
No ratings yet
Nvidia Cuda Arc
16 pages
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
No ratings yet
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
44 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Modern GPU Architecture
No ratings yet
Modern GPU Architecture
93 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
CUDA
No ratings yet
CUDA
46 pages
CPUs GPUs Accelerators
No ratings yet
CPUs GPUs Accelerators
22 pages
Nvidia DGX A100 Datasheet
No ratings yet
Nvidia DGX A100 Datasheet
2 pages
qm9700 qm9790 1u NDR 400gb S Infiniband Switch Systems User Manual
No ratings yet
qm9700 qm9790 1u NDR 400gb S Infiniband Switch Systems User Manual
102 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
NVIDIA CUDA Programming Guide 2.0
100% (3)
NVIDIA CUDA Programming Guide 2.0
107 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
Nvidia DGX Station A100 Datasheet
No ratings yet
Nvidia DGX Station A100 Datasheet
2 pages
Mali GPU Architecture
No ratings yet
Mali GPU Architecture
21 pages
Nvidia - Ug - Matlab Gpu Coder
100% (1)
Nvidia - Ug - Matlab Gpu Coder
66 pages
CUDA Installation Guide Linux
No ratings yet
CUDA Installation Guide Linux
45 pages
Nvidia DGX Os 4 Server: Software Release Notes
No ratings yet
Nvidia DGX Os 4 Server: Software Release Notes
19 pages
Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
Hpe Proliant Gen11 Ai
No ratings yet
Hpe Proliant Gen11 Ai
7 pages
FPGA Implementation: Connect FPGA Kit To The CPU With JTAG Cable
No ratings yet
FPGA Implementation: Connect FPGA Kit To The CPU With JTAG Cable
39 pages
gtc22 Whitepaper Hopper
No ratings yet
gtc22 Whitepaper Hopper
71 pages
Which GPU(s) To Get For Deep Learning
No ratings yet
Which GPU(s) To Get For Deep Learning
388 pages
A Graphics Processing Unit
No ratings yet
A Graphics Processing Unit
14 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
Gpu Applications Catalog
No ratings yet
Gpu Applications Catalog
51 pages
Machine Learning in Embedded System
No ratings yet
Machine Learning in Embedded System
56 pages
8 Nvidia PDF
No ratings yet
8 Nvidia PDF
48 pages
Nca Aiio
No ratings yet
Nca Aiio
11 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
MICROSOFT AZURE IoT Platform-Manual PDF
100% (1)
MICROSOFT AZURE IoT Platform-Manual PDF
1,100 pages
Xilinx Drivers
No ratings yet
Xilinx Drivers
1,628 pages
Nvidia Corporation: Nvidia Is A Multinational Corporation Which Specializes in The Development of Graphics
No ratings yet
Nvidia Corporation: Nvidia Is A Multinational Corporation Which Specializes in The Development of Graphics
1 page
IDA+VMWare - Linux Debugger
No ratings yet
IDA+VMWare - Linux Debugger
8 pages
Gpu-Applications-Catalog 2021
No ratings yet
Gpu-Applications-Catalog 2021
76 pages
Analog Solutions For Xilinx Fpgas: 1st Edition
No ratings yet
Analog Solutions For Xilinx Fpgas: 1st Edition
36 pages
NVSwitch
100% (1)
NVSwitch
23 pages
NV Applications Catalog Lowres
No ratings yet
NV Applications Catalog Lowres
20 pages
Nvidia Profiling Tools Keipert 10 4 22
No ratings yet
Nvidia Profiling Tools Keipert 10 4 22
27 pages
Multi-Core Processor PDF
No ratings yet
Multi-Core Processor PDF
6 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Esp32-S2 Technical Reference Manual en PDF
No ratings yet
Esp32-S2 Technical Reference Manual en PDF
702 pages
Gpu Programming
100% (2)
Gpu Programming
96 pages
Compute Unified Device Architecture
No ratings yet
Compute Unified Device Architecture
6 pages
GPU Architecture
No ratings yet
GPU Architecture
70 pages
CUDA Installation Guide Windows
100% (1)
CUDA Installation Guide Windows
17 pages
HPC Datasheet sc23 h200 Datasheet 3002446
No ratings yet
HPC Datasheet sc23 h200 Datasheet 3002446
3 pages
GPU Wiki
No ratings yet
GPU Wiki
9 pages
Nutanix New Technology Storage
No ratings yet
Nutanix New Technology Storage
17 pages
New Trends in Incubation
No ratings yet
New Trends in Incubation
21 pages
Solar Energy Application in Textiles
No ratings yet
Solar Energy Application in Textiles
4 pages
Geotextile For Costal Area
100% (1)
Geotextile For Costal Area
33 pages
2.2 Block Diagram of 89C51
No ratings yet
2.2 Block Diagram of 89C51
15 pages
Human-Computer Interaction: Dr. Ibrar Hussain Week: 01
No ratings yet
Human-Computer Interaction: Dr. Ibrar Hussain Week: 01
16 pages
Clutch Cable Adjustment One
No ratings yet
Clutch Cable Adjustment One
19 pages
Lec. 1 Center - Lathe
No ratings yet
Lec. 1 Center - Lathe
38 pages
Motor Acceleration Analysis
100% (1)
Motor Acceleration Analysis
4 pages
LightSYS Brochure en-LR
No ratings yet
LightSYS Brochure en-LR
2 pages
Upgrade Iphone 12 64GB Storage To 256GB - Repair Guide
No ratings yet
Upgrade Iphone 12 64GB Storage To 256GB - Repair Guide
4 pages
Linux Commands
No ratings yet
Linux Commands
21 pages
VMware
No ratings yet
VMware
3 pages
Pegasus3 R4 / R6 / R8 Compatibility List V1.13: Document History
No ratings yet
Pegasus3 R4 / R6 / R8 Compatibility List V1.13: Document History
7 pages
Bdi50 Family PDF
No ratings yet
Bdi50 Family PDF
12 pages
Power Saving Iron Supporter
No ratings yet
Power Saving Iron Supporter
7 pages
Pe Notes PDF
No ratings yet
Pe Notes PDF
197 pages
HY27UF (08 - 16) 2G2B (Rev0.2)
No ratings yet
HY27UF (08 - 16) 2G2B (Rev0.2)
54 pages
Fast Visibility Analysis in 3D Mass Modeling Environments and Approximated Visibility Analysis Concept Using Point Clouds Data
No ratings yet
Fast Visibility Analysis in 3D Mass Modeling Environments and Approximated Visibility Analysis Concept Using Point Clouds Data
10 pages
Renesas Application Note AN1340
No ratings yet
Renesas Application Note AN1340
6 pages
Sppa-Gs Gsu Am-050606
No ratings yet
Sppa-Gs Gsu Am-050606
10 pages
JVC Dr-Mh200se
No ratings yet
JVC Dr-Mh200se
49 pages
Specification For 3 MVA Transformer
100% (1)
Specification For 3 MVA Transformer
16 pages
With Tier 2 Engine: 638 KW 856 HP at 2050 RPM
No ratings yet
With Tier 2 Engine: 638 KW 856 HP at 2050 RPM
12 pages
ESV SMV Frequency Inverter v21-0 en
No ratings yet
ESV SMV Frequency Inverter v21-0 en
72 pages
UPort 1100 Series
No ratings yet
UPort 1100 Series
3 pages
C Pointers Study Notes
No ratings yet
C Pointers Study Notes
3 pages
VersaCount VC772 PDF
No ratings yet
VersaCount VC772 PDF
124 pages
Encoder Types: Incremental Encoders Absolute Encoders Resolvers
No ratings yet
Encoder Types: Incremental Encoders Absolute Encoders Resolvers
22 pages

Nvidia Cuda

Uploaded by

Nvidia Cuda

Uploaded by

Introduction

Current CUDA architectures

Current and future usages of CUDA architecture

CUDA Programming Model

CUDA Programming Model Thread Block

Block can be one-, two- or threedimensional arrays

CUDA Programming Model

CUDA Programming Model

Thread and Block IDs

CUDA Memory Model

Device memory space overview Block and Thread IDs

Shared Memory Registers Registers

Shared Memory Registers Registers

Thread (0, 0) Thread (1, 0)

Thread (0, 0) Thread (1, 0)

The host can R/W global, constant, and texture memories

Global Memory Constant Memory Texture Memory

CUDA Memory Model

CUDA Memory Model

CUDA Memory Model

CUDA API Basics

CPU Host Code

CUDA API Basics

CUDA API Basics

(optionally used together with __device__)

(optionally used together with __device__)

Bas lin CPU p n P CPU

Why is GPU outpacing CPU?

GPU needs 1000s of threads for full efficiency

CUDA advantages over Legacy GPGPU

Unlimited access to memory-threads can

User-managed cache per block-threads

Low learning curve-no knowledge of graphics is

No graphics API overhead.

You might also like

(optionally used together with device)

(optionally used together with device)