GPUComputing
GPUComputing
DATA
ANALYTICS SIMULATION
SUPERCOMPUTING
EDGE APPLIANCE
NETWORK
EDGE VISUALIZATION
STREAMING
EXTREME IO
CLOUD AI
2
HOW GPU ACCELERATION WORKS
Application Code
Compute-Intensive Functions
Rest of Sequential
5% of Code CPU Code
GPU CPU
+ 3
ACCELERATED COMPUTING
GPU Accelerator
Optimized for
Parallel Tasks
4
SILICON BUDGET
5
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
CPU IS A LATENCY REDUCING ARCHITECTURE
CPU Strengths
GPU Accelerator
CPU Optimized
• Very large main memory
for
Optimized for Parallel
• Very fast clock Tasks
speeds
Serial Tasks • Latency optimized via large caches
• Small number of threads can run very
quickly
CPU Weaknesses
6
GPU IS ALL ABOUT HIDING LATENCY
GPU Strengths
GPU Accelerator
• CPUmemory
High bandwidth main
Optimized for
• SignificantlyOptimized for resources
more compute Parallel Tasks
• Latency tolerant via Tasks
Serial parallelism
• High throughput
• High performance/watt
GPU Weaknesses
7
WHY
HOW GPU COMPUTING WORKS
High Performance Computing for Astronomy and Astrophysics
WHERE'S MY DATA?
WHY
HOW GPU COMPUTING WORKS
NOBODY CARES ABOUT FLOPs
10
REALLY
ALMOST NOBODY CARES ABOUT FLOPs
^
11
CPU
DRAM
12
CPU
13
CPU
14
THIS IS COMPUTE INTENSITY
How many operations must I do on some data to make it worth the cost of loading it?
CPU
FLOPs
Required Compute Intensity = = 80
Data Rate
15
THIS IS COMPUTE INTENSITY
How many operations must I do on some data to make it worth the cost of loading it?
CPU
FLOPs
Required Compute Intensity = = 80
Data Rate
So for every number I load from memory, I need to do 80 operations on it to break even
16
GPU CPU
18
REALLY
ALMOST NOBODY CARES ABOUT FLOPs
^
...BECAUSE WE SHOULD REALLY BE CARING ABOUT
MEMORY BANDWIDTH LATENCY
19
DAXPY: aX + Y = Z
20
DAXPY: aX + Y = Z
load
x[0] time
21
DAXPY: aX + Y = Z
load load
x[0] y[0] time
22
DAXPY: aX + Y = Z
memory latency
23
DAXPY: aX + Y = Z
24
DAXPY: aX + Y = Z
25
SPEED OF LIGHT = 300,000,000 M/S
26
SPEED OF LIGHT = 300,000,000 M/S
COMPUTER CLOCK = 3,000,000,000 Hz
27
SPEED OF LIGHT = 300,000,000 M/S
COMPUTER CLOCK = 3,000,000,000 Hz
28
SPEED OF LIGHT = 300,000,000 M/S
COMPUTER CLOCK = 3,000,000,000 Hz
SPEED OF ELECTRICITY = 60,000,000 M/S
IN SILICON
29
SPEED OF LIGHT = 300,000,000 M/S
COMPUTER CLOCK = 3,000,000,000 Hz
SPEED OF ELECTRICITY = 60,000,000 M/S
IN SILICON
DRAM
22mm L3 Cache
32mm 50-100mm
30
31
DAXPY: aX + Y = Z
Intel Xeon 8280
Memory latency: 89 ns
32
DAXPY: aX + Y = Z
Intel Xeon 8280
33
DAXPY: aX + Y = Z
Intel Xeon 8280
34
COMPARISON OF DAXPY* EFFICIENCY ON DIFFERENT CHIPS
11,659
To keep memory bus busy, we must run = 729 iterations at once
16
36
LOOP UNROLLING
void daxpy(int n, double alpha, double *x, double *y) Compilers rarely unroll a loop 729 times
{
for( i = 0; i < n; i += 8 ) Just one thread issuing all these commands
{
y[i+0] = alpha * x[i+0] + y[i+0];
y[i+1] = alpha * x[i+1] + y[i+1]; One thread cannot hold 729 outstanding loads
y[i+2] = alpha * x[i+2] + y[i+2];
y[i+3] = alpha * x[i+3] + y[i+3];
y[i+4] = alpha * x[i+4] + y[i+4];
y[i+5] = alpha * x[i+5] + y[i+5];
y[i+6] = alpha * x[i+6] + y[i+6];
y[i+7] = alpha * x[i+7] + y[i+7];
}
}
11,659
To keep memory bus busy, we must run = 729 iterations at once
16
37
THE ONLY OPTION IS THREADS
void daxpy(int n, double alpha, double *x, double *y) Each thread issues load operations independently
{
parallel for( i = 0; i < n; i++ ) Ideally requires 729 threads
{
y[i] = alpha * x[i] + y[i];
} Limited by max threads & memory requests
}
11,659
To keep memory bus busy, we must run = 729 iterations at once
16
38
COMPARISON OF DAXPY* EFFICIENCY ON DIFFERENT CHIPS
SM 0 SM 1 SM 2 SM 3 SM 107
...
regs regs regs regs regs
(256k) (256k) (256k) (256k) (256k) Data Bandwidth Compute
L1$ L1$ L1$ L1$ L1$ Location (GB/sec) Intensity
13x 1x (192k) (192k) (192k) (192k)
...
(192k)
L1 Cache 19,400 8
L2 Cache 4,000 39
3x 5x L2 Cache (40MB)
HBM 1,555 100
PCIe 25 6240
1x 15x HBM Memory (80GB)
HBM
HBM
43
B/W Latency Ampere A100 GPU
SM 0 SM 1 SM 2 SM 3 SM 107
...
regs regs regs regs regs
(256k) (256k) (256k) (256k) (256k) Data Threads
Latency (ns)
L1$ L1$ L1$ L1$ L1$ Location Required
13x 1x (192k) (192k) (192k) (192k)
...
(192k)
L1 Cache 27 32,738
44
Ampere A100 GPU A100 Streaming Multiprocessor (SM)
warp warp warp warp warp
SM 0 SM 1 SM 2 SM 3 SM 107 64 warps/SM
warp warp warp warp warp
...
regs regs regs regs regs
(256k) (256k) (256k) (256k) (256k)
Scheduler Scheduler Scheduler Scheduler 4x concurrent warp exec
L1$ L1$ L1$ L1$ L1$ Registers Registers Registers Registers
(192k) (192k) (192k) (192k)
...
(192k) (16k x 4) (16k x 4) (16k x 4) (16k x 4)
64k x 4-byte registers
The GPU can switch from one warp to the next in a single clock cycle
46
THROUGHPUT VS. LATENCY
47
San Francisco San Francisco
Millbrae
Burlingame
Belmont
Redwood City
Palo Alto
Mountain View
Sunnyvale
NVIDIA NVIDIA
73 Minutes 45 Minutes
48
San Francisco San Francisco
Millbrae
Burlingame
Belmont
Redwood City
Palo Alto
Mountain View
Sunnyvale
NVIDIA NVIDIA
73 Minutes 45 Minutes
49
San Francisco San Francisco
Millbrae
Burlingame
Belmont
Redwood City
Palo Alto
Mountain View
Sunnyvale
NVIDIA NVIDIA
73 Minutes 45 Minutes
50
San Francisco San Francisco
Millbrae
Burlingame
Belmont
Redwood City
Palo Alto
Mountain View
Sunnyvale
NVIDIA NVIDIA
73 Minutes 45 Minutes
51
San Francisco San Francisco
Millbrae
Burlingame
Belmont
Redwood City
Palo Alto
Mountain View
Sunnyvale
NVIDIA NVIDIA
73 Minutes 45 Minutes
52
San Francisco San Francisco
Millbrae
Burlingame
Belmont
Redwood City
Palo Alto
Mountain View
Sunnyvale
NVIDIA NVIDIA
73 Minutes 45 Minutes
53
San Francisco San Francisco
Millbrae
Burlingame
Belmont
Redwood City
Palo Alto
Mountain View
Sunnyvale
NVIDIA NVIDIA
73 Minutes 45 Minutes
54
San Francisco San Francisco
Millbrae
Burlingame
Belmont
Redwood City
Palo Alto
Mountain View
Sunnyvale
NVIDIA NVIDIA
73 Minutes 45 Minutes
55
San Francisco San Francisco
Millbrae
Burlingame
Belmont
Redwood City
Palo Alto
Mountain View
Sunnyvale
NVIDIA NVIDIA
73 Minutes 45 Minutes
56
San Francisco San Francisco
Millbrae
Burlingame
Belmont
Redwood City
Palo Alto
Mountain View
Sunnyvale
NVIDIA NVIDIA
73 Minutes 45 Minutes
57
San Francisco San Francisco
Millbrae
Burlingame
Belmont
Redwood City
Palo Alto
Mountain View
Sunnyvale
NVIDIA NVIDIA
73 Minutes 45 Minutes
58
San Francisco San Francisco
Millbrae
Burlingame
Belmont
Redwood City
Palo Alto
Mountain View
Sunnyvale
NVIDIA NVIDIA
73 Minutes 45 Minutes
59
BUT NOT ALL THREADS WANT TO WORK INDEPENDENTLY
In fact, threads are very rarely completely independent
63
64
1. Overlay with a grid
65
2. Operate on blocks within the grid
66
2. Operate on blocks within the grid
67
2. Operate on blocks within the grid
68
2. Operate on blocks within the grid
69
2. Operate on blocks within the grid
70
CUDA’S HIERARCHICAL EXECUTION MODEL
71
73
74
75
76
77
78
79
WHERE'S MY DATA?
WHY
HOW GPU COMPUTING WORKS
CUDA
81
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
NVIDIA HPC SDK
Download at developer.nvidia.com/hpc-sdk
Develop for the NVIDIA HPC Platform: GPU, CPU and Interconnect
HPC Libraries | GPU Accelerated C++ and Fortran | Directives | CUDA
82
N-WAYS TO GPU PROGRAMMING
Math Libraries | Standard Languages | Directives | CUDA
__global__
void saxpy(int n, float a,
float *x, float *y) {
#pragma acc data copy(x,y) int i = blockIdx.x*blockDim.x +
{ threadIdx.x;
if (i < n) y[i] += a*x[i];
... }
std::transform(par, x, x+n, y, y,
[=](float x, float y) {
std::transform(par, x, x+n, y, y, int main(void) {
return y + a*x;
[=](float x, float y) { cudaMallocManaged(&x, ...);
});
return y + a*x; cudaMallocManaged(&y, ...);
}); ...
saxpy<<<(N+255)/256,256>>>(...,x, y)
do concurrent (i = 1:n) ... cudaDeviceSynchronize();
y(i) = y(i) + a*x(i) ...
enddo } }
83
SINGLE PRECISION ALPHA X PLUS Y (SAXPY)
GPU SAXPY in multiple languages and libraries
𝒛 = 𝛼𝒙 + 𝒚
x, y, z : vector
a : scalar
84
SAXPY: OPENACC COMPILER DIRECTIVES
Parallel C Code Parallel Fortran Code
subroutine saxpy(n, a, x, y)
void saxpy(int n,
real :: x(:), y(:), a
float a,
integer :: n, i
float *x,
!$acc kernels
float *y)
do i=1,n
{
y(i) = a*x(i)+y(i)
#pragma acc kernels
enddo
for (int i = 0; i < n; ++i)
!$acc end kernels
y[i] = a*x[i] + y[i];
end subroutine saxpy
}
...
...
// Perform SAXPY on 1M elements
! Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);
call saxpy(2**20, 2.0, x_d, y_d)
...
...
www.openacc.org This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
85
SAXPY: CUBLAS LIBRARY
Serial BLAS Code Parallel cuBLAS Code
int N = 1<<20;
int N = 1<<20;
cublasInit();
cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);
... cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);
cublasShutdown();
You can also call cuBLAS from Fortran, C++, Python, and other languages:
http://developer.nvidia.com/cublas
86
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
SAXPY: CUDA C
Standard C Parallel C
void saxpy(int n, float a, __global__
float *x, float *y) void saxpy(int n, float a,
{ float *x, float *y)
for (int i = 0; i < n; ++i) {
y[i] = a*x[i] + y[i]; int i = blockIdx.x*blockDim.x + threadIdx.x;
} if (i < n) y[i] = a*x[i] + y[i];
}
int N = 1<<20;
int N = 1<<20;
cudaMemcpy(d_x, x, N, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N, cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy(N, 2.0, x, y); // Perform SAXPY on 1M elements
saxpy<<<4096,256>>>(N, 2.0, d_x, d_y);
http://developer.nvidia.com/cuda-toolkit This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
87
SAXPY: CUDA FORTRAN
Standard Fortran Parallel Fortran
module mymodule contains module mymodule contains
subroutine saxpy(n, a, x, y) attributes(global) subroutine saxpy(n, a, x, y)
real :: x(:), y(:), a real :: x(:), y(:), a
integer :: n, i integer :: n, i
do i=1,n attributes(value) :: a, n
y(i) = a*x(i)+y(i) i = threadIdx%x+(blockIdx%x-1)*blockDim%x
enddo if (i<=n) y(i) = a*x(i)+y(i)
end subroutine saxpy end subroutine saxpy
end module mymodule end module mymodule
88
http://developer.nvidia.com/cuda-fortran This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
SAXPY: PYTHON
Standard Python Numba: Parallel Python
import numpy as np
import numpy as np from numba import vectorize
@vectorize(['float32(float32, float32,
def saxpy(a, x, y): float32)'], target='cuda')
return [a * xi + yi def saxpy(a, x, y):
for xi, yi in zip(x, y)] return a * x + y
http://numpy.scipy.org https://numba.pydata.org 89
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
SAXPY: PSTL
Serial C++ Code (with STL and Boost) Parallel C++ Code
int N = 1<<20;
int N = 1<<20;
std::vector<float> x(N), y(N);
std::vector<float> x(N), y(N);
...
...
www.boost.org/libs/lambda This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
90
SAXPY: MATLAB
void saxpy(int n,
float a,
float *x,
float *y) <<initialize>>
{ p = parpool
#pragma acc kernels
for (int i = 0; i < n; ++i) parfor i = 1:numel(N)
y[i] = a*x[i] + y[i]; y(i) = 2.0 * x(i) + y(i)
} end
91
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
SAXPY: MATLAB
92
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
CHALLENGES WITH COMPLEX SOFTWARE
Current DIY GPU-accelerated AI
and HPC deployments can be
complex and time consuming to
build, test and maintain Open Source
Frameworks
Development of software
frameworks by the community is
NVIDIA Libraries
moving very fast
NVIDIA Docker
NVIDIA Driver
Benefits of Containers:
Simplify deployment of
GPU-accelerated software, eliminating
time-consuming software integration work
Isolate individual deep learning frameworks
and applications
Share, collaborate,
and test applications across
different environments
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
494
VIRTUAL MACHINES VS. CONTAINERS
MOTIVATION
95
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
COMPUTATIONAL
SCIENCES Create
Inputs Mathematical
Outputs
Model, First Principles
Create
96
CAN THIS WORK ∀? ABOLUTELY, YES!
Proof: Universal Approximation Theorem
𝛼∗
𝛽∗
Inputs Mathematical
Outputs
Model
Create
98
WHAT MAKES
AI * HPC SPECIAL? Create
Inputs Mathematical
Outputs
Model
Training Labels
Outputs
Inputs Efficient Implementation
Backpropagation
99
RECOGNITION/CLASSIFICATION -> FILTER
De-noising gravitational waves
101
IIT SKA National Centre for
Kharagpur India Supercomputing Development of
Mission Advanced Computing
[email protected]
102
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)