0% found this document useful (0 votes)

12 views25 pages

G80 Cuda

Uploaded by

Gatzo Gatzbit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views25 pages

G80 Cuda

Uploaded by

Gatzo Gatzbit

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 25

Brent Oster

Associate Director Allosphere, CNSI

PhD Studies in:
Computational Nanotechology

nVidia G80 GPU & CUDA

Previously: Technical Director for Video Games
(Bioware, EA, Alias|Wavefront, LucasArts )
Modern GPU is More General
Purpose – Lots of ALU’s
The nVidia G80 GPU
► 128 streaming floating point processors
@1.5Ghz
► 1.5 Gb Shared RAM with 86Gb/s bandwidth
► 500 Gflop on one chip (single precision)
What Has Driven the
Evolution of These Chips?

Males age 15-35 buy

$10B in video games / year

Crysis Demo
Are GPU’s Useful for Scientific
Computing?

Electronic Structure (DFT)

Finite Element Modeling

Molecular Dynamics
&
Monte Carlo
nVidia G80 GPU Architecture Overview
•16 Multiprocessors Blocks
•Each MP Block Has:
•8 Streaming Processors
(IEEE 754 spfp compliant)
•16K Shared Memory
•64K Constant Cache
•8K Texture Cache
•Each processor can access
all of the memory at 86Gb/s,
but with different latencies:
•Shared – 2 cycle latency
•Device – 300 cycle latency
Programming Interface
► Interface to GPU via nVidia’s proprietary API
– CUDA (very C-like)
► Looks a lot like UPC (simplified CUDA below)

void AddVectors(float r, float a, float *a)

{
int tx = threadId.x; //~processor rank
r[tx] = a[tx] + b[tx]; //executed in parallel
}
Actual CUDA Code
#define MAX_THREADS 512
extern “C” void AddVectors(float *r, float *a, float *b, int n)
{
int nThreads = MAX_THREADS/2;
int nBlocks = n / nThreads;
AddVectorsKernel<nThreads, nBlocks>(r, a, b, n);
}
__global__ void AddVectorsKernel(float *r, float *a, float *b, int n)
{
int tx = threadID.x;
int bx = blockID.x;
int i = tx + bx* MAX_THREADS;
r[i] = a[i] + b[i];
} // This would be extremely slow and inefficient code – more later
Still A Specialized Processor
► Very Efficient For
 Fast Parallel Floating Point Processing
 Single Instruction Multiple Data Operations
 High Computation per Memory Access

► Not As Efficient For

 Double Precision (need to test performance)
 Logical Operations on Integer Data
 Branching-Intensive Operations
 Random Access, Memory-Intensive Operations
__global__ void NxNGenericOp_Kernel( float *r, float *a, float *b, int n) // r[i] = SUMj(a[i]*b[j])
{
__shared__ float r_sh[MAX_THREADS]; //Allocate in fast 16K shared memory
__shared__ float a_sh[MAX_THREADS];
__shared__ float b_sh[MAX_THREADS];

int tx = threadID.x; //Rank of processor

int bx = blockID.x; //Rank of multiprocessor block
int i = tx + bx* MAX_THREADS; //Compute index from tx, bx
a_sh[tx] = a[i]; //Each thread loads a value for a_sh
r_sh[tx] = 0; //Each thread zeros a value for r_sh

__synchthreads(); //sync till all threads reach this point

for(int J = 0; J < n; J += MAX_THREADS) //Loop over blocks in b
{
b_sh[tx] = b[J+tx]; //Each thread loads a value for b_sh
__synchthreads(); //synch
for(int j = 0; j < MAX_THREADS; j++) //For each b_sh
r_sh[tx] += a_sh[tx] * b_sh[j]; //Compute product a_sh*b_sh, add to
r_sh
}
__synchthreads(); //synch
r[i] = r_sh[tx]; //Write results to r
Making Optimal Use of 16Kb Shared Memory
16K Shared Memory is
Allocated in 16 Banks

Array data allocated

across banks
B[0] -> bank0
B[1] -> bank1
…
B[n] ->mod(n,nBanks)

No Bank Conflicts if
Each Thread Indexes
A Different Bank

Bank Conflicts if Threads

Access The Same Bank
(results in data stall)
More Detail on GPU Architecture
Exploiting the Texture Samplers
► Designed to map textures onto 3D polygons
► Specialty hardware pipelines for:
 Fast data sampling from 1D, 2D, 3D arrays
 Swizzling of 2D, 3D data for optimal access
 Bilinear filtering in zero cycles
 Image compositing & blending operations
► Arrays indexed by u,v,w coordinates – easy
to program
► Extremely well suited for multigrid & finite
difference methods – example later
Experiments in Computational
Nanotech on GPU

Electronic Structure (DFT)

Finite Element Modeling

Molecular Dynamics
&
Monte Carlo
HP XW9400 with Quad AMD CPU
& Dual nVidia Quaddro 5600 GPUs
= A Teraflop Workstation?
Molecular Dynamics Trial
► Lennard Jones inter-atomic potential
► Verlet integration
► Normalized coordinates
► FCC lattice in a NxNxN Simulation Cell
► Periodic Boundary Conditions
► Trials with Rc = ∞, Rc = 3.0
► Tested nVidia 8800 GPU vs 3.0 Ghz Intel
P4
► Open GL used to implement MD on GPU
MD Timing Tests (NxN brute force)
# Cells #Atoms Time/Step Time/Step Performance
X,Y,Z Total GPU (s) CPU(s) Differential
2 32 0.000308 0.000429 139%
3 108 0.00039 0.004513 1157%
4 256 0.000391 0.025295 6469%
5 500 0.000596 0.092766 15565%
6 864 0.001274 0.27681 21728%
7 1372 0.002845 0.689375 24231%
8 2048 0.005665 1.547 27308%
MD Timing Results (bins & Rc=9 Ang)

# Cells #Atoms Time/Step Timesteps

X,Y,Z Total GPU (ms) Per sec
8 2’048 0.532 1879.7
16 16’384 1.984 504.03
32 131’072 16.157 61.89
40 256’000 36.515 27.38
50 500’000 70.985 14.08
Hardware Accelerated DFT Test
• Real-Space Grid Method
(Beck, Bryant,…)
• LDA, localized basis fn’s
• Iterative soln KS Equations
• Finite difference methods
• Multigrid with FMG-FAS
• Weighted Jacobi relaxation
• Merhstellen discritization
• Grahm-Schmidt
orthogonalization on lo-res
• 64x64x64 grid x 4 orbitals
• 8 H nuclei, 8 Electrons
• >1M data elements
• 81 ms computation time!
Where Next?
G90 Double Precision GPU in Spring 2008

G90 GPU nVidia Quaddro nVidia QuadroPlex

Double-precision PC Workstation Cluster - 16 PC Nodes
1 Teraflop 4 Teraflops 64 Teraflops
~$2500 ~$15’000 ~$300’000
NanoCAD in the Allosphere
California Nano Systems Institute
How to Find out More
► Download CUDA and docs from nVidia
 http://developer.nvidia.com/object/cuda.html
► Buy a $600 nVidia GeForce 8800GTX
► Get one free through their developer
program (talk to me after class)
► CUDA Programming Course through CS
 Fall 07 or Winter 08
 Tobias Hollerer & Myself
► NanoCAD
collaborative development –
www.powerofminus9.net

Data Analyst - Assignment
50% (6)
Data Analyst - Assignment
3 pages
Айпады блок
No ratings yet
Айпады блок
1 page
Clinical Informatics Board Review and Self Assessment
100% (1)
Clinical Informatics Board Review and Self Assessment
339 pages
Lec 14
No ratings yet
Lec 14
52 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
CUDA
No ratings yet
CUDA
33 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Lec 6
No ratings yet
Lec 6
16 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Lec 1
No ratings yet
Lec 1
27 pages
GPU Architecture and Programming Lecture
No ratings yet
GPU Architecture and Programming Lecture
9 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
Lab 1 Parallel
No ratings yet
Lab 1 Parallel
4 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Class 10
No ratings yet
Class 10
13 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Data Parallelism, Task Parallelism, CPU, GPU
No ratings yet
Data Parallelism, Task Parallelism, CPU, GPU
13 pages
Data Parallelism, Task Parallelism, CPU, GPU
No ratings yet
Data Parallelism, Task Parallelism, CPU, GPU
13 pages
Gpu Programming
100% (2)
Gpu Programming
96 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
Gujarat Technological University: W.E.F. AY 2018-19
No ratings yet
Gujarat Technological University: W.E.F. AY 2018-19
3 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Owens
No ratings yet
Owens
67 pages
Brief+case Study
No ratings yet
Brief+case Study
4 pages
Basic Networking Hardware: Kamal Harmoni Kamal Ariff
100% (2)
Basic Networking Hardware: Kamal Harmoni Kamal Ariff
22 pages
Klu Coa I 2 Sem Home Assignment
No ratings yet
Klu Coa I 2 Sem Home Assignment
9 pages
BOI Mobile Banking Pre Login FAQs PDF
No ratings yet
BOI Mobile Banking Pre Login FAQs PDF
3 pages
Installation Manual: Sliding Gate Opener
No ratings yet
Installation Manual: Sliding Gate Opener
22 pages
ESIGN - 1710834342999 English
No ratings yet
ESIGN - 1710834342999 English
4 pages
Kushal Kanal Resume
No ratings yet
Kushal Kanal Resume
3 pages
Silversingles: The Best Dating Service For Over 50S
No ratings yet
Silversingles: The Best Dating Service For Over 50S
10 pages
Unit2 Cs
No ratings yet
Unit2 Cs
16 pages
Modbus Display User Guide
No ratings yet
Modbus Display User Guide
6 pages
Manual Luxometro Tecnolux Moon 12995
No ratings yet
Manual Luxometro Tecnolux Moon 12995
19 pages
LLM Using Prompting Method
No ratings yet
LLM Using Prompting Method
21 pages
ET Week1
No ratings yet
ET Week1
46 pages
Konsep Dasar Multimedia: Kuliah 2
No ratings yet
Konsep Dasar Multimedia: Kuliah 2
15 pages
Algorithms and Data Structures Exercises: Antonio Carzaniga University of Lugano Edition 1.2 January 2009
No ratings yet
Algorithms and Data Structures Exercises: Antonio Carzaniga University of Lugano Edition 1.2 January 2009
13 pages
HRSD ServiceNow Resume Sample
No ratings yet
HRSD ServiceNow Resume Sample
2 pages
Air Quality - Report
No ratings yet
Air Quality - Report
62 pages
DLL Resources Emtp
No ratings yet
DLL Resources Emtp
25 pages
ICU Patient Monitoring System With Automatic SMS System Using GSMZIGBEE Wireless Technology
No ratings yet
ICU Patient Monitoring System With Automatic SMS System Using GSMZIGBEE Wireless Technology
2 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
Exercises On Scheduling
No ratings yet
Exercises On Scheduling
5 pages
3.7.10 Lab - Use Wireshark To View Network Traffic
No ratings yet
3.7.10 Lab - Use Wireshark To View Network Traffic
10 pages
Ict Note
No ratings yet
Ict Note
48 pages
Advanced Computer Networks: TCP Congestion Control Thanks To Kamil Sarac
No ratings yet
Advanced Computer Networks: TCP Congestion Control Thanks To Kamil Sarac
12 pages
Ch.8 - More About Equations - Method of Substitution
No ratings yet
Ch.8 - More About Equations - Method of Substitution
7 pages
Las464b.dusallu Web Eng MFL69681808 180504
No ratings yet
Las464b.dusallu Web Eng MFL69681808 180504
24 pages
Machine Learning
No ratings yet
Machine Learning
1 page

G80 Cuda

Uploaded by

G80 Cuda

Uploaded by

Brent Oster

Associate Director Allosphere, CNSI

nVidia G80 GPU & CUDA

Males age 15-35 buy

Electronic Structure (DFT)

void AddVectors(float *r, float *a, float *a)

► Not As Efficient For

int tx = threadID.x; //Rank of processor

__synchthreads(); //sync till all threads reach this point

Array data allocated

Bank Conflicts if Threads

Electronic Structure (DFT)

# Cells #Atoms Time/Step Timesteps

G90 GPU nVidia Quaddro nVidia QuadroPlex

You might also like

void AddVectors(float r, float a, float *a)