Introduction To Massively Parallel Computing
Introduction To Massively Parallel Computing
Course Goals
! Learn how to program massively parallel processors and achieve
! High performance ! Functionality and maintainability ! Scalability across future generations
People
! Lecturers
! Jared Hoberock: jaredhoberock at gmail.com ! David Tarjan: tar.cs193g at gmail.com ! Office hours: 3:00-4:00 PM, Tu Th, Gates 195
! Course TA
! Niels Joubert: njoubert at cs.stanford.edu
! Guest lecturers
! Domain experts
Web Resources
! Website:
! ! ! ! http://stanford-cs193g-sp2010.googlecode.com Lecture slides/recordings Documentation, software resources Note: while well make an effort to post announcements on the web, we cant guarantee it, and wont make allowances for people who miss things in class
! Mailing list
! Channel for electronic announcements ! Forum for Q&A Lecturers and assistants read the board, and your classmates often have answers
Grading
! This is a lab oriented course! ! Machine problems: 50%
! Correctness: ~40% ! Performance: ~35% ! Report: ~25%
! Project: 50%
! Technical pitch: 25% ! Project Presentation: 25% ! Demo: 50%
Bonus Days
! Every student is allocated two bonus days
! No-questions asked one-day extension that can be used on any MP ! Use both on the same thing if you want ! Weekends/holidays dont count for the number of days of extension (Friday-Monday is just one day extension)
! Intended to cover illnesses, interview visits, just needing more time, etc. ! Late penalty is 10% of the possible credit/day, again counting only weekdays
Academic Honesty
! You are allowed and encouraged to discuss assignments with other students in the class. Getting verbal advice/help from people whove already taken the course is also fine. ! Any reference to assignments from previous terms or web postings is unacceptable ! Any copying of non-trivial code is unacceptable ! Non-trivial = more than a line or so ! Includes reading someone elses code and then going off to write your own.
Course Equipment
! Your own PCs with a CUDA-enabled GPU ! NVIDIA GeForce GTX 260 boards
! Lab facilities: Pups cluster, Gates B21 ! Nodes 2, 8, 11, 12, & 13 ! New Fermi Architecture GPUs?
! As they become available
! References:
! NVIDIA. The NVIDIA CUDA Programming Guide. 2010. ! NVIDIA. CUDA Reference Manual. 2010.
Schedule
! Week 1: ! Tu: Introduction ! Th: CUDA Intro ! MP 0: Hello, World! ! MP 1: Parallel For ! Week 2 ! Tu: Threads & Atomics ! Th: Memory Model ! MP 2: Atomics ! Week 3 ! Tu: Performance ! Th: Parallel Programming ! MP 3: Communication ! Week 4 ! Tu: Project Proposals ! Th: Parallel Patterns ! MP 4: Productivity
2008 NVIDIA Corporation
! Week 5 ! Tu: Productivity ! Th: Sparse Matrix Vector ! Week 6 ! Tu: PDE Solvers Case Study ! Th: Fermi ! Week 7 ! Tu: Ray Tracing Case Study ! Th: Future of Throughput ! Week 8 ! Tu: AI Case Study ! Th: Advanced Optimization ! Week 9 ! Tu: TBD ! Th: Project Presentations ! Week 10 ! Tu: Project Presentations
The number of transistors on an integrated circuit doubles every two years. Gordon E. Moore
Power
Performance
2008 NVIDIA Corporation
! Data-level parallelism
! vector units, SIMD execution, ! increasing SSE, AVX, Cell SPE, Clearspeed, GPU
! Thread-level parallelism
! increasing multithreading, multicore, manycore ! Intel Core2, AMD Phenom, Sun Niagara, STI Cell, NVIDIA Fermi,
GT200
G80
Westmere
GT200
G80
NV40 NV30
Westmere
Global Memory
! Handful of processors each supporting ~1 hardware thread ! On-chip memory near processors (cache, RAM, or both) ! Shared global memory space (external DRAM)
2008 NVIDIA Corporation
Processor
Memory
Global Memory
! Many processors each supporting many hardware threads ! On-chip memory near processors (cache, RAM, or both) ! Shared global memory space (external DRAM)
2008 NVIDIA Corporation
GPU Evolution
! High throughput computation
! GeForce GTX 280: 933 GFLOP/s
1995
2008 NVIDIA Corporation
2000
2005
2010
Giga Thread
SM Multiprocessor
! 32 CUDA Cores per SM (512 total) ! 8x peak FP64 performance ! 50% of peak FP32 performance ! Direct load/store to memory ! Usual linear sequence of bytes ! High bandwidth (Hundreds GB/ sec) ! 64KB of fast, on-chip RAM ! Software or hardware-managed ! Shared amongst CUDA cores ! Enables thread communication
! Hardware multithreading
! HW resource allocation & thread scheduling ! HW relies on threads to hide latency
Enter CUDA
! Scalable parallel programming model ! Minimal extensions to familiar C/C++ environment ! Heterogeneous serial-parallel computing
13457x
Motivation
110-240X
2008 NVIDIA Corporation
35X
Block b
t0 t1 tB
Block
Memory
Global Memory
! Global synchronization isnt cheap ! Global memory access times are expensive
Processor
Memory
Interconnection Network
Heterogeneous Computing
Multicore CPU
C for CUDA
! Philosophy: provide minimal set of extensions necessary to expose power ! Function qualifiers:
__global__ void my_kernel() { } __device__ float my_device_func() { }
! Variable qualifiers:
__constant__ float my_constant_array[32]; __shared__ float my_shared_array[32];
! Execution configuration:
dim3 grid_dim(100, 50); // 5000 thread blocks dim3 block_dim(4, 8, 8); // 256 threads per block my_kernel <<< grid_dim, block_dim >>> (...); // Launch kernel
Example: vector_addition
Device Code
// compute vector sum c = a + b // each thread performs one pair-wise addition __global__ void vector_add(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { // elided initialization code ... // Run N/256 blocks of 256 threads each vector_add<<< N/256, 256>>>(d_A, d_B, d_C); }
2008 NVIDIA Corporation
Example: vector_addition
// compute vector sum c = a + b // each thread performs one pair-wise addition __global__ void vector_add(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } int main() { // elided initialization code ... // launch N/256 blocks of 256 threads each vector_add<<< N/256, 256>>>(d_A, d_B, d_C); }
2008 NVIDIA Corporation
Host Code
vector_addition
// allocate and initialize host (CPU) memory float *h_A = , *h_B = ; // allocate float *d_A, cudaMalloc( cudaMalloc( cudaMalloc( device (GPU) memory *d_B, *d_C; (void**) &d_A, N * sizeof(float)); (void**) &d_B, N * sizeof(float)); (void**) &d_C, N * sizeof(float));
// copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) ); // launch N/256 blocks of 256 threads each vector_add<<<N/256, 256>>>(d_A, d_B, d_C);
2008 NVIDIA Corporation
Description SPEC 06 version, change in guess vector SPEC 06 version, change to single precision and print fewer reports Distributed.net RC5-72 challenge client code Finite element modeling, simulation of 3D graded materials Rye Polynomial Equation Solver, quantum chem, 2-electron repulsion Petri Net simulation of a distributed system Single-precision implementation of saxpy, used in Linpacks Gaussian elim. routine Two Point Angular Correlation Function Finite-Difference Time Domain analysis of 2D electromagnetic wave propagation Computing a matrix Q, a scanners configuration in MRI reconstruction
Source 34,811 1,481 1,979 1,874 1,104 322 952 536 1,365 490
% time 35% >99% >99% 99% 99% >99% >99% 96% 16% >99%
Speedup of Applications
! GeForce 8800 GTX vs. 2.2GHz Opteron 248 ! 10 speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads ! 25 to 400 speedup if the functions data requirements and control flow suit the GPU and the application is optimized
2008 NVIDIA Corporation
Final Thoughts
! Parallel hardware is here to stay ! GPUs are massively parallel manycore processors
! easily available and fully programmable
! Parallelism & scalability are crucial for success ! This presents many important research challenges
! not to speak of the educational challenges
Machine Problem 0
! http://code.google.com/p/stanford-cs193g-sp2010/ wiki/GettingStartedWithCUDA ! Work through tutorial codes
! ! ! ! ! hello_world.cu cuda_memory_model.cu global_functions.cu device_functions.cu vector_addition.cu