0% found this document useful (0 votes)
12 views25 pages

G80 Cuda

Uploaded by

Gatzo Gatzbit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views25 pages

G80 Cuda

Uploaded by

Gatzo Gatzbit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 25

Brent Oster

Associate Director Allosphere, CNSI


PhD Studies in:
Computational Nanotechology

nVidia G80 GPU & CUDA


Previously: Technical Director for Video Games
(Bioware, EA, Alias|Wavefront, LucasArts )
Modern GPU is More General
Purpose – Lots of ALU’s
The nVidia G80 GPU
► 128 streaming floating point processors
@1.5Ghz
► 1.5 Gb Shared RAM with 86Gb/s bandwidth
► 500 Gflop on one chip (single precision)
What Has Driven the
Evolution of These Chips?

Males age 15-35 buy


$10B in video games / year

Crysis Demo
Are GPU’s Useful for Scientific
Computing?

Electronic Structure (DFT)


Finite Element Modeling

Molecular Dynamics
&
Monte Carlo
nVidia G80 GPU Architecture Overview
•16 Multiprocessors Blocks
•Each MP Block Has:
•8 Streaming Processors
(IEEE 754 spfp compliant)
•16K Shared Memory
•64K Constant Cache
•8K Texture Cache
•Each processor can access
all of the memory at 86Gb/s,
but with different latencies:
•Shared – 2 cycle latency
•Device – 300 cycle latency
Programming Interface
► Interface to GPU via nVidia’s proprietary API
– CUDA (very C-like)
► Looks a lot like UPC (simplified CUDA below)

void AddVectors(float *r, float *a, float *a)


{
int tx = threadId.x; //~processor rank
r[tx] = a[tx] + b[tx]; //executed in parallel
}
Actual CUDA Code
#define MAX_THREADS 512
extern “C” void AddVectors(float *r, float *a, float *b, int n)
{
int nThreads = MAX_THREADS/2;
int nBlocks = n / nThreads;
AddVectorsKernel<nThreads, nBlocks>(r, a, b, n);
}
__global__ void AddVectorsKernel(float *r, float *a, float *b, int n)
{
int tx = threadID.x;
int bx = blockID.x;
int i = tx + bx* MAX_THREADS;
r[i] = a[i] + b[i];
} // This would be extremely slow and inefficient code – more later
Still A Specialized Processor
► Very Efficient For
 Fast Parallel Floating Point Processing
 Single Instruction Multiple Data Operations
 High Computation per Memory Access

► Not As Efficient For


 Double Precision (need to test performance)
 Logical Operations on Integer Data
 Branching-Intensive Operations
 Random Access, Memory-Intensive Operations
__global__ void NxNGenericOp_Kernel( float *r, float *a, float *b, int n) // r[i] = SUMj(a[i]*b[j])
{
__shared__ float r_sh[MAX_THREADS]; //Allocate in fast 16K shared memory
__shared__ float a_sh[MAX_THREADS];
__shared__ float b_sh[MAX_THREADS];

int tx = threadID.x; //Rank of processor


int bx = blockID.x; //Rank of multiprocessor block
int i = tx + bx* MAX_THREADS; //Compute index from tx, bx
a_sh[tx] = a[i]; //Each thread loads a value for a_sh
r_sh[tx] = 0; //Each thread zeros a value for r_sh

__synchthreads(); //sync till all threads reach this point


for(int J = 0; J < n; J += MAX_THREADS) //Loop over blocks in b
{
b_sh[tx] = b[J+tx]; //Each thread loads a value for b_sh
__synchthreads(); //synch
for(int j = 0; j < MAX_THREADS; j++) //For each b_sh
r_sh[tx] += a_sh[tx] * b_sh[j]; //Compute product a_sh*b_sh, add to
r_sh
}
__synchthreads(); //synch
r[i] = r_sh[tx]; //Write results to r
Making Optimal Use of 16Kb Shared Memory
16K Shared Memory is
Allocated in 16 Banks

Array data allocated


across banks
B[0] -> bank0
B[1] -> bank1

B[n] ->mod(n,nBanks)

No Bank Conflicts if
Each Thread Indexes
A Different Bank

Bank Conflicts if Threads


Access The Same Bank
(results in data stall)
More Detail on GPU Architecture
Exploiting the Texture Samplers
► Designed to map textures onto 3D polygons
► Specialty hardware pipelines for:
 Fast data sampling from 1D, 2D, 3D arrays
 Swizzling of 2D, 3D data for optimal access
 Bilinear filtering in zero cycles
 Image compositing & blending operations
► Arrays indexed by u,v,w coordinates – easy
to program
► Extremely well suited for multigrid & finite
difference methods – example later
Experiments in Computational
Nanotech on GPU

Electronic Structure (DFT)


Finite Element Modeling

Molecular Dynamics
&
Monte Carlo
HP XW9400 with Quad AMD CPU
& Dual nVidia Quaddro 5600 GPUs
= A Teraflop Workstation?
Molecular Dynamics Trial
► Lennard Jones inter-atomic potential
► Verlet integration
► Normalized coordinates
► FCC lattice in a NxNxN Simulation Cell
► Periodic Boundary Conditions
► Trials with Rc = ∞, Rc = 3.0
► Tested nVidia 8800 GPU vs 3.0 Ghz Intel
P4
► Open GL used to implement MD on GPU
MD Timing Tests (NxN brute force)
# Cells #Atoms Time/Step Time/Step Performance
X,Y,Z Total GPU (s) CPU(s) Differential
2 32 0.000308 0.000429 139%
3 108 0.00039 0.004513 1157%
4 256 0.000391 0.025295 6469%
5 500 0.000596 0.092766 15565%
6 864 0.001274 0.27681 21728%
7 1372 0.002845 0.689375 24231%
8 2048 0.005665 1.547 27308%
MD Timing Results (bins & Rc=9 Ang)

# Cells #Atoms Time/Step Timesteps


X,Y,Z Total GPU (ms) Per sec
8 2’048 0.532 1879.7
16 16’384 1.984 504.03
32 131’072 16.157 61.89
40 256’000 36.515 27.38
50 500’000 70.985 14.08
Hardware Accelerated DFT Test
• Real-Space Grid Method
(Beck, Bryant,…)
• LDA, localized basis fn’s
• Iterative soln KS Equations
• Finite difference methods
• Multigrid with FMG-FAS
• Weighted Jacobi relaxation
• Merhstellen discritization
• Grahm-Schmidt
orthogonalization on lo-res
• 64x64x64 grid x 4 orbitals
• 8 H nuclei, 8 Electrons
• >1M data elements
• 81 ms computation time!
Where Next?
G90 Double Precision GPU in Spring 2008

G90 GPU nVidia Quaddro nVidia QuadroPlex


Double-precision PC Workstation Cluster - 16 PC Nodes
1 Teraflop 4 Teraflops 64 Teraflops
~$2500 ~$15’000 ~$300’000
NanoCAD in the Allosphere
California Nano Systems Institute
How to Find out More
► Download CUDA and docs from nVidia
 http://developer.nvidia.com/object/cuda.html
► Buy a $600 nVidia GeForce 8800GTX
► Get one free through their developer
program (talk to me after class)
► CUDA Programming Course through CS
 Fall 07 or Winter 08
 Tobias Hollerer & Myself
► NanoCAD
collaborative development –
www.powerofminus9.net

You might also like