0% found this document useful (0 votes)
6 views

GPUComputing

Uploaded by

Huseyn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

GPUComputing

Uploaded by

Huseyn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

HOW GPU COMPUTING WORKS

High Performance Computing for Astronomy and Astrophysics

IIT SKA National Centre for


Kharagpur India Supercomputing Development of
Mission Advanced Computing
EXPANDING UNIVERSE OF HIGH PERFORMANCE COMPUTING

DATA
ANALYTICS SIMULATION
SUPERCOMPUTING

EDGE APPLIANCE
NETWORK
EDGE VISUALIZATION
STREAMING

EXTREME IO

CLOUD AI

2
HOW GPU ACCELERATION WORKS
Application Code

Compute-Intensive Functions
Rest of Sequential
5% of Code CPU Code
GPU CPU

+ 3
ACCELERATED COMPUTING
GPU Accelerator
Optimized for
Parallel Tasks

4
SILICON BUDGET

The three components of any processor

Less ALU More

More Control Less

More Cache Less

5
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
CPU IS A LATENCY REDUCING ARCHITECTURE
CPU Strengths
GPU Accelerator
CPU Optimized
• Very large main memory
for
Optimized for Parallel
• Very fast clock Tasks
speeds
Serial Tasks • Latency optimized via large caches
• Small number of threads can run very
quickly

CPU Weaknesses

• Relatively low memory bandwidth


• Cache misses very costly
• Low performance/watt

6
GPU IS ALL ABOUT HIDING LATENCY
GPU Strengths
GPU Accelerator
• CPUmemory
High bandwidth main
Optimized for
• SignificantlyOptimized for resources
more compute Parallel Tasks
• Latency tolerant via Tasks
Serial parallelism
• High throughput
• High performance/watt

GPU Weaknesses

• Relatively low memory capacity


• Low per-thread performance

7
WHY
HOW GPU COMPUTING WORKS
High Performance Computing for Astronomy and Astrophysics
WHERE'S MY DATA?
WHY
HOW GPU COMPUTING WORKS
NOBODY CARES ABOUT FLOPs

10
REALLY
ALMOST NOBODY CARES ABOUT FLOPs
^

11
CPU

DRAM

12
CPU

2000 GFLOPs FP64

200 GBytes / sec


DRAM

13
CPU

2000 GFLOPs FP64

200 GBytes / sec


DRAM
= 25 Giga-FP64 / sec
(because FP64 = 8 bytes)

14
THIS IS COMPUTE INTENSITY
How many operations must I do on some data to make it worth the cost of loading it?

CPU

2000 GFLOPs FP64

FLOPs
Required Compute Intensity = = 80
Data Rate

200 GBytes / sec


DRAM
= 25 Giga-FP64 / sec
(because FP64 = 8 bytes)

15
THIS IS COMPUTE INTENSITY
How many operations must I do on some data to make it worth the cost of loading it?

CPU

2000 GFLOPs FP64

FLOPs
Required Compute Intensity = = 80
Data Rate

200 GBytes / sec


DRAM
= 25 Giga-FP64 / sec
(because FP64 = 8 bytes)

So for every number I load from memory, I need to do 80 operations on it to break even

16
GPU CPU

HBM HBM HBM


DRAM

NVIDIA A100 Intel Xeon 8280 AMD Rome 7742


Peak FP64 GigaFLOPs 19500 2190 2300
Memory B/W (GB/sec) 1555 131 204
Compute Intensity 100 134 90
17
REALLY
ALMOST NOBODY CARES ABOUT FLOPs
^
...BECAUSE WE SHOULD REALLY BE CARING ABOUT
MEMORY BANDWIDTH

18
REALLY
ALMOST NOBODY CARES ABOUT FLOPs
^
...BECAUSE WE SHOULD REALLY BE CARING ABOUT
MEMORY BANDWIDTH LATENCY

19
DAXPY: aX + Y = Z

void daxpy(int n, double alpha, double *x, double *y)


{ 2 FLOPs: multiply & add
for( i = 0; i < n; i++ )
{ 2 memory loads: x[i] & y[i] (per element)
y[i] = alpha * x[i] + y[i];
}
} Single operation: FMA (fused multiply-add)

20
DAXPY: aX + Y = Z

void daxpy(int n, double alpha, double *x, double *y)


{ 2 FLOPs: multiply & add
for( i = 0; i < n; i++ )
{ 2 memory loads: x[i] & y[i] (per element)
y[i] = alpha * x[i] + y[i];
}
} Single operation: FMA (fused multiply-add)

load
x[0] time

21
DAXPY: aX + Y = Z

void daxpy(int n, double alpha, double *x, double *y)


{ 2 FLOPs: multiply & add
for( i = 0; i < n; i++ )
{ 2 memory loads: x[i] & y[i] (per element)
y[i] = alpha * x[i] + y[i];
}
} Single operation: FMA (fused multiply-add)

load load
x[0] y[0] time

22
DAXPY: aX + Y = Z

void daxpy(int n, double alpha, double *x, double *y)


{ 2 FLOPs: multiply & add
for( i = 0; i < n; i++ )
{ 2 memory loads: x[i] & y[i] (per element)
y[i] = alpha * x[i] + y[i];
}
} Single operation: FMA (fused multiply-add)

memory latency

load load x[0]


x[0] y[0] ready time

23
DAXPY: aX + Y = Z

void daxpy(int n, double alpha, double *x, double *y)


{ 2 FLOPs: multiply & add
for( i = 0; i < n; i++ )
{ 2 memory loads: x[i] & y[i] (per element)
y[i] = alpha * x[i] + y[i];
}
} Single operation: FMA (fused multiply-add)

memory latency a*x

load load x[0] y[0]


x[0] y[0] ready ready time

24
DAXPY: aX + Y = Z

void daxpy(int n, double alpha, double *x, double *y)


{ 2 FLOPs: multiply & add
for( i = 0; i < n; i++ )
{ 2 memory loads: x[i] & y[i] (per element)
y[i] = alpha * x[i] + y[i];
}
} Single operation: FMA (fused multiply-add)

memory latency a*x +y

load load x[0] y[0] result


x[0] y[0] ready ready ready time

25
SPEED OF LIGHT = 300,000,000 M/S

26
SPEED OF LIGHT = 300,000,000 M/S
COMPUTER CLOCK = 3,000,000,000 Hz

27
SPEED OF LIGHT = 300,000,000 M/S
COMPUTER CLOCK = 3,000,000,000 Hz

So in 1 clock tick light travels 100mm (~4 inches)

28
SPEED OF LIGHT = 300,000,000 M/S
COMPUTER CLOCK = 3,000,000,000 Hz
SPEED OF ELECTRICITY = 60,000,000 M/S
IN SILICON

So in 1 clock tick electricity travels 20mm (~0.8 inches)

29
SPEED OF LIGHT = 300,000,000 M/S
COMPUTER CLOCK = 3,000,000,000 Hz
SPEED OF ELECTRICITY = 60,000,000 M/S
IN SILICON

DRAM

22mm L3 Cache

32mm 50-100mm

30
31
DAXPY: aX + Y = Z
Intel Xeon 8280

Memory bandwidth: 131 GB/sec

Memory latency: 89 ns

memory latency a*x +y

load load x[0] y[0] result


x[0] y[0] ready ready ready time

32
DAXPY: aX + Y = Z
Intel Xeon 8280

Memory bandwidth: 131 GB/sec


11,659 bytes can be moved in 89ns
Memory latency: 89 ns

memory latency a*x +y

load load x[0] y[0] result


x[0] y[0] ready ready ready time

33
DAXPY: aX + Y = Z
Intel Xeon 8280

Memory bandwidth: 131 GB/sec


11,659 bytes can be moved in 89ns
Memory latency: 89 ns
daxpy moves 16 bytes per 89ns latency
Memory efficiency = 0.14%

memory latency a*x +y

memory bus is idle 99.86% of the time

load load x[0] y[0] result


x[0] y[0] ready ready ready time

34
COMPARISON OF DAXPY* EFFICIENCY ON DIFFERENT CHIPS

NVIDIA A100 AMD Rome 7742 Intel Xeon 8280


Memory B/W (GB/sec) 1555 204 131
DRAM Latency (ns) 404 122 89

Peak bytes per latency 628,220 24,888 11,659

Memory efficiency 0.0025% 0.064% 0.14%

*daxpy moves 16 bytes per latency


35
SO WHAT CAN WE DO ABOUT IT?

void daxpy(int n, double alpha, double *x, double *y)


{
for( i = 0; i < n; i++ )
11,659 bytes can be moved in 89ns
{
y[i] = alpha * x[i] + y[i]; daxpy moves 16 bytes per 89ns latency
}
} Memory efficiency = 0.14%

11,659
To keep memory bus busy, we must run = 729 iterations at once
16

36
LOOP UNROLLING

void daxpy(int n, double alpha, double *x, double *y) Compilers rarely unroll a loop 729 times
{
for( i = 0; i < n; i += 8 ) Just one thread issuing all these commands
{
y[i+0] = alpha * x[i+0] + y[i+0];
y[i+1] = alpha * x[i+1] + y[i+1]; One thread cannot hold 729 outstanding loads
y[i+2] = alpha * x[i+2] + y[i+2];
y[i+3] = alpha * x[i+3] + y[i+3];
y[i+4] = alpha * x[i+4] + y[i+4];
y[i+5] = alpha * x[i+5] + y[i+5];
y[i+6] = alpha * x[i+6] + y[i+6];
y[i+7] = alpha * x[i+7] + y[i+7];
}
}

11,659
To keep memory bus busy, we must run = 729 iterations at once
16

37
THE ONLY OPTION IS THREADS

void daxpy(int n, double alpha, double *x, double *y) Each thread issues load operations independently
{
parallel for( i = 0; i < n; i++ ) Ideally requires 729 threads
{
y[i] = alpha * x[i] + y[i];
} Limited by max threads & memory requests
}

11,659
To keep memory bus busy, we must run = 729 iterations at once
16

38
COMPARISON OF DAXPY* EFFICIENCY ON DIFFERENT CHIPS

NVIDIA A100 AMD Rome 7742 Intel Xeon 8280


Memory B/W (GB/sec) 1555 204 143
DRAM Latency (ns) 404 122 89

Peak bytes per latency 628,220 24,888 12,727

Memory efficiency 0.0025% 0.064% 0.13%

Threads required 39,264 1,556 729

Threads available 221,184 2048 896

Thread ratio 5.6x 1.3x 1.2x

*daxpy moves 16 bytes per latency


39
B/W Latency Ampere A100 GPU

SM 0 SM 1 SM 2 SM 3 SM 107
...
regs regs regs regs regs
(256k) (256k) (256k) (256k) (256k) Data Bandwidth Compute
L1$ L1$ L1$ L1$ L1$ Location (GB/sec) Intensity
13x 1x (192k) (192k) (192k) (192k)
...
(192k)

L1 Cache 19,400 8

L2 Cache 4,000 39
3x 5x L2 Cache (40MB)
HBM 1,555 100

NVLink 300 520

PCIe 25 6240
1x 15x HBM Memory (80GB)
HBM
HBM

43
B/W Latency Ampere A100 GPU

SM 0 SM 1 SM 2 SM 3 SM 107
...
regs regs regs regs regs
(256k) (256k) (256k) (256k) (256k) Data Threads
Latency (ns)
L1$ L1$ L1$ L1$ L1$ Location Required
13x 1x (192k) (192k) (192k) (192k)
...
(192k)

L1 Cache 27 32,738

L2 Cache 150 37,500


3x 5x L2 Cache (40MB)
HBM 404 39,264

NVLink 700 13,125

PCIe 1470 2297


1x 15x HBM Memory (80GB)
HBM
HBM

44
Ampere A100 GPU A100 Streaming Multiprocessor (SM)
warp warp warp warp warp
SM 0 SM 1 SM 2 SM 3 SM 107 64 warps/SM
warp warp warp warp warp
...
regs regs regs regs regs
(256k) (256k) (256k) (256k) (256k)
Scheduler Scheduler Scheduler Scheduler 4x concurrent warp exec
L1$ L1$ L1$ L1$ L1$ Registers Registers Registers Registers
(192k) (192k) (192k) (192k)
...
(192k) (16k x 4) (16k x 4) (16k x 4) (16k x 4)
64k x 4-byte registers

L2 Cache (40MB) Functional Functional Functional Functional


Units Units Units Units

192KB L1/shared memory


Shared Memory / L1 Cache (192k)
HBM Memory (80GB) (configurable split)
HBM
HBM

The GPU runs threads in groups of 32 - each group is known as a warp


45
THE GPU’S SECRET SAUCE: OVERSUBSCRIPTION
A100 Streaming Multiprocessor (SM)
warp warp warp warp warp
Per SM On A100 64 warps/SM
warp warp warp warp warp

Scheduler Scheduler Scheduler Scheduler 4x concurrent warp exec


Total Threads 2048 221,184
Registers Registers Registers Registers
(16k x 4) (16k x 4) (16k x 4) (16k x 4)
64k x 4-byte registers
Total Warps 64 6,912

Active Warps 4 432


Functional Functional Functional Functional
Waiting Warps 60 6,480 Units Units Units Units

Active Threads 128 13,824

192KB L1/shared memory


Waiting Threads 1,920 207,360 Shared Memory / L1 Cache (192k)
(configurable split)

The GPU can switch from one warp to the next in a single clock cycle
46
THROUGHPUT VS. LATENCY

47
San Francisco San Francisco

Millbrae

Burlingame

Belmont

Redwood City

Palo Alto

Mountain View

Sunnyvale

NVIDIA NVIDIA

73 Minutes 45 Minutes
48
San Francisco San Francisco

Millbrae

Burlingame

Belmont

Redwood City

Palo Alto

Mountain View

Sunnyvale

NVIDIA NVIDIA

73 Minutes 45 Minutes
49
San Francisco San Francisco

Millbrae

Burlingame

Belmont

Redwood City

Palo Alto

Mountain View

Sunnyvale

NVIDIA NVIDIA

73 Minutes 45 Minutes
50
San Francisco San Francisco

Millbrae

Burlingame

Belmont

Redwood City

Palo Alto

Mountain View

Sunnyvale

NVIDIA NVIDIA

73 Minutes 45 Minutes
51
San Francisco San Francisco

Millbrae

Burlingame

Belmont

Redwood City

Palo Alto

Mountain View

Sunnyvale

NVIDIA NVIDIA

73 Minutes 45 Minutes
52
San Francisco San Francisco

Millbrae

Burlingame

Belmont

Redwood City

Palo Alto

Mountain View

Sunnyvale

NVIDIA NVIDIA

73 Minutes 45 Minutes
53
San Francisco San Francisco

Millbrae

Burlingame

Belmont

Redwood City

Palo Alto

Mountain View

Sunnyvale

NVIDIA NVIDIA

73 Minutes 45 Minutes
54
San Francisco San Francisco

Millbrae

Burlingame

Belmont

Redwood City

Palo Alto

Mountain View

Sunnyvale

NVIDIA NVIDIA

73 Minutes 45 Minutes
55
San Francisco San Francisco

Millbrae

Burlingame

Belmont

Redwood City

Palo Alto

Mountain View

Sunnyvale

NVIDIA NVIDIA

73 Minutes 45 Minutes
56
San Francisco San Francisco

Millbrae

Burlingame

Belmont

Redwood City

Palo Alto

Mountain View

Sunnyvale

NVIDIA NVIDIA

73 Minutes 45 Minutes
57
San Francisco San Francisco

Millbrae

Burlingame

Belmont

Redwood City

Palo Alto

Mountain View

Sunnyvale

NVIDIA NVIDIA

73 Minutes 45 Minutes
58
San Francisco San Francisco

Millbrae

Burlingame

Belmont

Redwood City

Palo Alto

Mountain View

Sunnyvale

NVIDIA NVIDIA

73 Minutes 45 Minutes
59
BUT NOT ALL THREADS WANT TO WORK INDEPENDENTLY
In fact, threads are very rarely completely independent

Element-wise Local All-to-All


DAXPY Convolution Fourier Transform

63
64
1. Overlay with a grid

65
2. Operate on blocks within the grid

Blocks execute independently


GPU is oversubscribed with blocks
1. Overlay with a grid

66
2. Operate on blocks within the grid

Blocks execute independently


3. Many threads work
GPU is oversubscribed with blocks
together in each block
1. Overlay with a grid for local data sharing

67
2. Operate on blocks within the grid

Blocks execute independently


3. Many threads work
GPU is oversubscribed with blocks
together in each block
1. Overlay with a grid for local data sharing

68
2. Operate on blocks within the grid

Blocks execute independently


3. Many threads work
GPU is oversubscribed with blocks
together in each block
1. Overlay with a grid for local data sharing

69
2. Operate on blocks within the grid

Blocks execute independently


3. Many threads work
GPU is oversubscribed with blocks
together in each block
1. Overlay with a grid for local data sharing

70
CUDA’S HIERARCHICAL EXECUTION MODEL

A grid represents all work to be done

The grid comprises many blocks with an equal number of threads

Threads within a block run independently


but may synchronize to exchange data

71
73
74
75
76
77
78
79
WHERE'S MY DATA?
WHY
HOW GPU COMPUTING WORKS
CUDA

81
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
NVIDIA HPC SDK
Download at developer.nvidia.com/hpc-sdk

Develop for the NVIDIA HPC Platform: GPU, CPU and Interconnect
HPC Libraries | GPU Accelerated C++ and Fortran | Directives | CUDA
82
N-WAYS TO GPU PROGRAMMING
Math Libraries | Standard Languages | Directives | CUDA

__global__
void saxpy(int n, float a,
float *x, float *y) {
#pragma acc data copy(x,y) int i = blockIdx.x*blockDim.x +
{ threadIdx.x;
if (i < n) y[i] += a*x[i];
... }
std::transform(par, x, x+n, y, y,
[=](float x, float y) {
std::transform(par, x, x+n, y, y, int main(void) {
return y + a*x;
[=](float x, float y) { cudaMallocManaged(&x, ...);
});
return y + a*x; cudaMallocManaged(&y, ...);
}); ...
saxpy<<<(N+255)/256,256>>>(...,x, y)
do concurrent (i = 1:n) ... cudaDeviceSynchronize();
y(i) = y(i) + a*x(i) ...
enddo } }

GPU Accelerated Incremental Performance Maximize GPU Performance with


C++ and Fortran Optimization with Directives CUDA C++/Fortran

GPU Accelerated Math Libraries

83
SINGLE PRECISION ALPHA X PLUS Y (SAXPY)
GPU SAXPY in multiple languages and libraries

Part of Basic Linear Algebra Subroutines (BLAS) Library

𝒛 = 𝛼𝒙 + 𝒚
x, y, z : vector
a : scalar

84
SAXPY: OPENACC COMPILER DIRECTIVES
Parallel C Code Parallel Fortran Code

subroutine saxpy(n, a, x, y)
void saxpy(int n,
real :: x(:), y(:), a
float a,
integer :: n, i
float *x,
!$acc kernels
float *y)
do i=1,n
{
y(i) = a*x(i)+y(i)
#pragma acc kernels
enddo
for (int i = 0; i < n; ++i)
!$acc end kernels
y[i] = a*x[i] + y[i];
end subroutine saxpy
}

...
...
// Perform SAXPY on 1M elements
! Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);
call saxpy(2**20, 2.0, x_d, y_d)
...
...

www.openacc.org This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
85
SAXPY: CUBLAS LIBRARY
Serial BLAS Code Parallel cuBLAS Code
int N = 1<<20;

int N = 1<<20;
cublasInit();
cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);
... cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);

// Use your choice of blas library // Perform SAXPY on 1M elements


cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);
// Perform SAXPY on 1M elements
blas_saxpy(N, 2.0, x, 1, y, 1); cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1);

cublasShutdown();

You can also call cuBLAS from Fortran, C++, Python, and other languages:
http://developer.nvidia.com/cublas
86
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
SAXPY: CUDA C
Standard C Parallel C
void saxpy(int n, float a, __global__
float *x, float *y) void saxpy(int n, float a,
{ float *x, float *y)
for (int i = 0; i < n; ++i) {
y[i] = a*x[i] + y[i]; int i = blockIdx.x*blockDim.x + threadIdx.x;
} if (i < n) y[i] = a*x[i] + y[i];
}
int N = 1<<20;
int N = 1<<20;
cudaMemcpy(d_x, x, N, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N, cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
saxpy(N, 2.0, x, y); // Perform SAXPY on 1M elements
saxpy<<<4096,256>>>(N, 2.0, d_x, d_y);

cudaMemcpy(y, d_y, N, cudaMemcpyDeviceToHost);

http://developer.nvidia.com/cuda-toolkit This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
87
SAXPY: CUDA FORTRAN
Standard Fortran Parallel Fortran
module mymodule contains module mymodule contains
subroutine saxpy(n, a, x, y) attributes(global) subroutine saxpy(n, a, x, y)
real :: x(:), y(:), a real :: x(:), y(:), a
integer :: n, i integer :: n, i
do i=1,n attributes(value) :: a, n
y(i) = a*x(i)+y(i) i = threadIdx%x+(blockIdx%x-1)*blockDim%x
enddo if (i<=n) y(i) = a*x(i)+y(i)
end subroutine saxpy end subroutine saxpy
end module mymodule end module mymodule

program main program main


use mymodule use cudafor; use mymodule
real :: x(2**20), y(2**20) real, device :: x_d(2**20), y_d(2**20)
x = 1.0, y = 2.0 x_d = 1.0, y_d = 2.0

! Perform SAXPY on 1M elements ! Perform SAXPY on 1M elements


call saxpy(2**20, 2.0, x, y) call saxpy<<<4096,256>>>(2**20, 2.0, x_d, y_d)

end program main end program main

88
http://developer.nvidia.com/cuda-fortran This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
SAXPY: PYTHON
Standard Python Numba: Parallel Python
import numpy as np
import numpy as np from numba import vectorize

@vectorize(['float32(float32, float32,
def saxpy(a, x, y): float32)'], target='cuda')
return [a * xi + yi def saxpy(a, x, y):
for xi, yi in zip(x, y)] return a * x + y

x = np.arange(2**20, dtype=np.float32) N = 1048576


y = np.arange(2**20, dtype=np.float32)
# Initialize arrays
A = np.ones(N, dtype=np.float32)
B = np.ones(A.shape, dtype=A.dtype)
C = np.empty_like(A, dtype=A.dtype)

cpu_result = saxpy(2.0, x, y) # Add arrays onGPU


C = saxpy(2.0, X, Y)

http://numpy.scipy.org https://numba.pydata.org 89
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
SAXPY: PSTL

Serial C++ Code (with STL and Boost) Parallel C++ Code

int N = 1<<20;
int N = 1<<20;
std::vector<float> x(N), y(N);
std::vector<float> x(N), y(N);
...
...

// Perform SAXPY on 1M elements


// Perform SAXPY on 1M elements
std::transform(std::execution::par
std::transform(x.begin(), x.end(),
, x.begin(), x.end(),
y.begin(), y.end(),
y.begin(), y.end(),
2.0f * _1 + _2);
2.0f * _1 + _2);

www.boost.org/libs/lambda This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
90
SAXPY: MATLAB

Parallel C Code Parallel C++ Code

void saxpy(int n,
float a,
float *x,
float *y) <<initialize>>
{ p = parpool
#pragma acc kernels
for (int i = 0; i < n; ++i) parfor i = 1:numel(N)
y[i] = a*x[i] + y[i]; y(i) = 2.0 * x(i) + y(i)
} end

... <<post process>


// Perform SAXPY on 1M elements delete(p)
saxpy(1<<20, 2.0, x, y);
...

91
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
SAXPY: MATLAB

500+ GPU-enabled MATLAB functions


Transfer Data To GPU From
Computer Memory

Additional GPU-enabled Toolboxes


x=gpuArray(x);
Neural Networks
Perform Calculation on GPU
Image Processing and Computer Vision
X=saxpy(N,2.0,x,0,1,y,0,1);
Communications

Signal Processing Gather Data or Plot

Stats Toolbox y=gather(y)

MATLAB GPU computing

92
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
CHALLENGES WITH COMPLEX SOFTWARE
Current DIY GPU-accelerated AI
and HPC deployments can be
complex and time consuming to
build, test and maintain Open Source
Frameworks

Development of software
frameworks by the community is
NVIDIA Libraries
moving very fast
NVIDIA Docker

NVIDIA Driver

Requires high level of expertise to NVIDIA GPU

manage driver, library, framework


dependencies
393
WHY CONTAINERS?

Benefits of Containers:
Simplify deployment of
GPU-accelerated software, eliminating
time-consuming software integration work
Isolate individual deep learning frameworks
and applications
Share, collaborate,
and test applications across
different environments

This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
494
VIRTUAL MACHINES VS. CONTAINERS
MOTIVATION

Packaging mechanism for


applications
Consistent and reproducible
deployment 95

Lightweight with faster startup than


VMs

Logical isolation from other


applications (at the OS level)

95
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)
COMPUTATIONAL
SCIENCES Create

Inputs Mathematical
Outputs
Model, First Principles

Some Level of Approximation


Similarities to the shift
Feature → Network Engineering?

NNs as a Porting Strategy?

Create

Inputs Efficient Implementation Outputs

96
CAN THIS WORK ∀? ABOLUTELY, YES!
Proof: Universal Approximation Theorem

𝛼∗

𝛽∗

Combine to form peaks And assemble your arbitrary


Take many non-linearities (one hidden layer is enough!) function with arbitrary 𝜀

Problem: this is an essentially useless


theorem for practical purposes
97
WHAT MAKES
AI * HPC SPECIAL? Create

Inputs Mathematical
Outputs
Model

Some Level of Approximation

Create

Inputs Efficient Implementation Outputs

98
WHAT MAKES
AI * HPC SPECIAL? Create

Inputs Mathematical
Outputs
Model

Training Labels

Some Level of Approximation

Note: We have more information about the ground


truth in AI*HPC (often mathematically precise)

This should actually be an advantage!


Why does it sometimes feel like a disadvantage?
“Prior”
?
Create
Loss Function

Outputs
Inputs Efficient Implementation

Backpropagation
99
RECOGNITION/CLASSIFICATION -> FILTER
De-noising gravitational waves

Laser Interferometer Gravitational-wave


Observatory (LIGO).

DL enabling 5000x faster filtering for real-time multi-messenger astronomy


100
IS A ML MODEL USEFUL FOR SCIENCE?

101
IIT SKA National Centre for
Kharagpur India Supercomputing Development of
Mission Advanced Computing

[email protected]
102
This material is released by NVIDIA Corporation under the Creative Commons Attribution 4.0 International (CC BY 4.0)

You might also like