0% found this document useful (0 votes)
18 views54 pages

CUDA

This document provides an in-depth analysis of NVIDIA's GPU architectures from Pascal to Hopper, focusing on their specific features for deep learning applications. It highlights advancements such as Tensor Cores, mixed precision support, and structured sparsity, which enhance performance for AI workloads. The document also discusses the importance of choosing the right GPU for training or inference and introduces the cuDNN library for framework development.

Uploaded by

YeePee Indo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views54 pages

CUDA

This document provides an in-depth analysis of NVIDIA's GPU architectures from Pascal to Hopper, focusing on their specific features for deep learning applications. It highlights advancements such as Tensor Cores, mixed precision support, and structured sparsity, which enhance performance for AI workloads. The document also discusses the importance of choosing the right GPU for training or inference and introduces the cuDNN library for framework development.

Uploaded by

YeePee Indo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

CUDADeep LearningDevicesGPUsNVIDIA

1. Introduction

2. AI specific features in recent NVIDIA GPUs


2.1 Pascal microarchitecture (2016)
2.2 Volta microarchitecture (2018)
2.3 Turing microarchitecture (Late 2018)
2.4 Ampere microarchitecture (2020)
2.5 Hopper microarchitecture (2022)

3. CuDNN for framework development


3.1 Who needs cuDNN?
3.2 Convolution in cuDNN

4. GPU performance for deep learning

5. Summary

1. Introduction
Welcome to the second part of this series of blog posts, where we
are covering ‘behind the scenes’ of the GPU hardware and CUDA
software stack that powers most of deep learning. If you haven’t
already, please be sure to read the first part of this series. To quickly
recap the learning goals of this two part series, after reading both
these posts, you will:

1. Learn how a GPU accelerates AI workloads (covered in part 1)


2. Understand the features of many generations of NVIDIA GPUs
(covered here)
3. Choose the right GPU for your training or inference workload
(covered here)
4. Learn how to write code to maximize GPU utilization (covered
here)

In part 1, we introduced the CUDA programming model in detail and


implemented a dense layer in CUDA via matrix multiplication. This
background will form the basis of the content covered in this post.
Having understood the terminology of CUDA, in this post we will dive
deep into hardware features for AI acceleration in recent NVIDIA
GPUs. In total we will describe the features of five generations of
NVIDIA GPUs released over the last 6 years.

A deep understanding of the hardware features will allow you to


chose the right GPU for your AI workloads, be it on the cloud or at
the edge. After this we will return back to the CUDA software stack,
but with a higher level focus on the cuDNN library which enables
easy integration of CUDA into machine learning frameworks. Since
most deep learning practitioners don’t work directly with either CUDA
or cuDNN, we will provide some practical tips to profile and
benchmark your AI software directly from the framework itself. Here,
we will use PyTorch as an example.
I’ve partnered exclusively with OpenCV.org to bring you official courses in AI,
Computer Vision, and Deep Learning to take you on a structured path from first
steps to mastery.
Learn More

2. AI specific features in recent


NVIDIA GPUs
In this section, we will analyze the last 5 generations of NVIDIA
GPUs and compare their features for deep learning workloads. We
have crystallized all the relevant information from disparate
documents and whitepapers into an easily digestible package with
the least possible jargon and marketing hype. If you are a senior
deep learning engineer or a manager looking to understand which
GPU will best serve your needs, this section will be especially helpful
to you.

2.1 Pascal microarchitecture (2016)


Figure 1. A schematic of the P100 SM

(Source: NVIDIA P100 whitepaper)

We will begin the analysis from the Pascal microarchitecture.


Introduced in 2016, the Pascal generation P100 was NVIDIA’s first
major datacenter GPU designed for deep learning. The most
important feature in Pascal was the introduction of hardware support
for float16 calculations. Before Pascal, all GPUs were mainly
designed to perform 32-bit or 64-bit floating point calculations, since
these were the precisions most important for gaming and high
performance computing applications. The P100 design also inspired
the well known Jetson TX2.

Figure 1 shows a schematic of the P100 SM. The full P100 GPU
contains 56 such SMs.

 The overall P100 SM is composed of two identical processing


blocks which shows why NVIDIA calls it a ‘multiprocessor’.
 The green blocks represent CUDA cores,
 Yellow blocks represent CUDA cores dedicated for double
precision calculations (almost never used in deep learning, but
widely used in HPC workloads such as fluid dynamics
simulations).
 SFUs are blocks for special function units which compute
functions like sine, cosine, log and exponential. For instance,
SFUs will be used for calculating the sigmoid activation.
 ‘Tex’ represents texture memory which was explained in part 1.
 ‘Register file’ represents register memory which is not shared
between threads.
 LD/ST represents load/store units which are parts of memory
controllers.
 In addition, there are instruction cache, warp scheduler and
dispatch units, all of which are not directly controlled by
programmers but by firmware within the SMs.

Although both P100 and TX2 continue to be used widely in industry,


they are nearing the end of their lives. You should use this generation
of hardware if the computational requirements for your application
are not expected to increase in the future, or if you are navigating a
chip shortage and this is the only hardware you can get your hands
on. The Pascal SM forms a great starting point for understanding the
next generations.

2.2 Volta microarchitecture (2018)


Figure 2. One processing block of the Volta V100 SM, the full SM is
composed of four such processing blocks and the full GPU contains
80 SMs (Source: NVIDIA V100 whitepaper)

The Volta microarchitecture was first released at the very end of


2017 and became widely available in 2018. The full SM of the V100
is composed of four processing blocks, one of which is shown in
figure 6. The full V100 contains 80 SMs. Armed with the
understanding of the P100 SM, several components are readily
understandable, such as memory registers, load/store units, CUDA
cores and SFUs. The major new innovation in Volta generation was
the introduction of Tensor Cores.

A Volta tensor core is a special type of CUDA core designed for


Multiply Accumulate (MAC) operations of the form D= A x B + C
(where A~D are all 4×4 matrices)

Figure 3. Volta Tensor Core MAC operations (Source: NVIDIA V100


whitepaper)

MAC calculations are used in most deep learning layers since


multiplication can be used to implement dense or convolutional layers
and addition can be used to apply bias. In Volta TC, A and B must be
FP16 matrices but C and D could be either FP16 or FP32. In other
words, Tensor Cores accelerate mixed precision operations for deep
learning. The secret to this acceleration is that the hardware is
designed to perform the two operations of multiplication and addition
in one single clock cycle.
These are called Fused Multiply Add (FMA) instructions as illustrated
in figure 4. With these enhancements, Volta TCs provide up to 9x
enhancement in mixed precision matrix multiplications than Pascal.
Volta generation hardware remains a workhorse of several deep
learning workloads for various industries and will remain so for at
least another couple of years.

Figure 4. Multiplication and addition happen in one clock cycle also


known as FMA.

2.3 Turing microarchitecture (Late 2018)

Turing is a gaming focused GPU architecture very similar to Volta


and was released in late 2018. The most important AI specific feature
in Turing is that the Turing Tensor Cores support INT8 and INT4 data
types, in addition to the FP16 type supported by Volta. This enables
INT8 inference on Turing GPUs which is substantially faster than
FP16.
Turing GPUs like the RTX20 series are quite popular among AI labs
and small teams. In addition, Turing TCs inspired the Jetson AGX
Xavier die. In a previous blog post, we showed that INT8 inference
on the Jetson Xavier was ~2x faster than fp16 and 10x faster than
FP32. These impressive speed ups were enabled by Turing
generation tensor cores.

2.4 Ampere microarchitecture (2020)

Introduced in May 2020, the Ampere microarchitecture is the


successor to Volta. The A100 GPU is the current flagship CUDA
enabled datacenter GPU for deep learning and has 108 SMs. The
two major innovations in Ampere for deep learning are:

2.4.1 Third generation Tensor Cores

Figure 5. Performance comparison between Volta and Ampere


tensor cores.

(Source: NVIDIA Ampere whitepaper)


The Volta tensor cores were quite limited in the data types they
supported. Turing improved on this and with Ampere all the
restrictions on the data types supported by tensor cores have been
removed. The third generation of tensor cores in Ampere support all
data types from binary, INT4, INT8, FP16, TF32 and even FP64. So,
with Ampere, deep learning practitioners do not have to use
mixed precision training to take advantage of tensor cores. This
is great because mixed precision training can sometimes be
numerically unstable. With Ampere, full precision TF32 throughput is
up to 20x that of Volta.

2.4.2 Structured Sparsity (SS)

Figure 6. How structured sparsity in Ampere works (Source: Ampere


whitepaper)

SS is a principled approach to take advantage of sparsity in neural


networks. First, recall that almost all commonly used layers in deep
learning can be represented as matrix multiplications. Structured
sparsity works on neural networks which have been pruned in a
specific way, as explained in figure 6.

 First, train a network as usual (without sparsity) until an


acceptable performance is achieved. We now consider one
layer of the trained network. Divide the weight matrix of the layer
into small patches of 2×2 (quite like we do for max pooling).
 Then for each 2×2 patch of weights, zero out the smallest 2
weights and retain the other 2. This results in a matrix that has
exactly half of its elements as zeros.
 There will be some loss in accuracy if we use this pruned weight
matrix as is, but with a bit more fine-tuning of the network, the
non-zero weights can adjust to compensate for the zeroed out
entries.
 The final result of this process is that the fine-tuned weight
matrix has almost the same accuracy as the initial dense matrix
but requires only half the number of multiplications.

Ampere Structured Sparsity codifies this property of sparse networks


into the hardware. During inference, as shown in the blue box of
figure 6, the Ampere hardware just skips multiplications wherever
zeros are present. Thus, the layer is accelerated by 2x by skipping
half of the matrix entries. A couple of things to note here:

 SS can be used with Tensor Cores to achieve a 2x improvement


on top of what the 3rd generation tensor cores already offer.
 SS is not enabled in any of the deep learning frameworks, but it
is exposed to users via TensorRT 8. Let us know in the
comments if you would like to write to us about using SS in
practice.
 Although it can also be used during training, SS is usually
recommended for inference. You may notice performance drops
if you train a model with SS right from the very beginning.

The Ampere microarchitecture was adapted to consumer graphics


cards without any major changes, giving gamers access to some
pretty hefty AI compute. In particular, even the gaming focused RTX
3090 outperforms the flagship datacenter V100 GPU from a couple of
generations ago for most deep learning workflows. If you are short on
budget and don’t need all the enterprise features that are only
available in datacenter GPUs like V100, the 3090 or 3090Ti may be a
good alternative for a small sized team at a research lab or a startup.

The Ampere A100 has inspired the recently released Jetson AGX
Orin. Therefore, the Orin’s GPU is much faster than Xavier’s and Orin
also supports structured sparsity.

2.5 Hopper microarchitecture (2022)

We covered the features of Hopper in great detail when it was


announced in March 2022, but as of writing, the first product H100 is
not yet generally available. Let us take a close look at the Hopper
SM.
Figure 7. The full Hopper H100 SM, there are a total of 144 SMs

(Source: Hopper whitepaper)


The SM contains the usual suspects, like the floating point compute
units, SFUs and memory controllers, but there are a few new
features (not all are visible in the above figure):

2.5.1 Support for FP8 data format

Figure 8. Two variants of 8 bit floating point operations introduced by


Hopper.

As shown in figure 12, Hopper introduces two variants of 8 bit floating


point precision: 5 exponents with 2 mantissa and 4 exponents with 3
mantissa. These data types are most useful for large language model
training, the likes of GPT-3 or PaLM. An advantage of FP8 training is
that during inference, there is no need to convert the model to a
lower precision and the model trained can be used as is,
guaranteeing absolutely no loss in performance.

2.5.2 Fourth generation tensor cores


The Hopper tensor cores offer double the performance of Ampere
tensor cores at the same clock frequency. Since the H100 supports
FP8, has more SMs and a higher boost frequency, the overall matrix
multiply throughput from H100 is 6x that of A100 (FP8 on Hopper v/s
FP16 on Ampere).

2.5.3 Tensor Memory Accelerators (TMA)

As we noted in part 1, memory access takes much longer than


computation. Memory access can become a bottleneck for large
models with billions of parameters. TMAs sit between the global
memory and shared memory in the CUDA programming model
hierarchy. They accelerate memory transfers by asynchronously
transferring data to shared memory while the CUDA threads do some
other work. This allows data to be available to every thread without
having to wait for a transfer to be completed. Since Hopper isn’t quite
available yet, it is unclear if TMA acceleration will be built into
frameworks like PyTorch and TensorFlow.

2.5.4 DPX instructions for dynamic programming

Dynamic programming is an important class of computational


workload most commonly used in genomics and robotics. The DPX
instructions accelerate these problems by 7x over A100. NVIDIA
notes that this would be useful, for example, in accelerating the
calculation of optimal paths for robots in a warehouse.

2.5.5 Thread Block Clusters


We discussed earlier that all threads in a block run only on one SM
and can share memory among themselves. As GPUs scale to over a
hundred SMs, more fine grained control over resources is required to
execute massive computing workloads. To this end, Hopper
introduces a new extension of the CUDA programming model, called
Thread Block Clusters (TBCs). In figure 1 of part 1, we saw the
CUDA thread hierarchy goes from Thread → Block → Grid.

TBCs sit between blocks and grid, so the new hierarchy is Thread →
Block → Thread Block Cluster → Grid. The advantage of having
TBCs is that clusters can contain blocks running on different SMs.
Hopper also contains additional mechanisms to let blocks from one
cluster share memory among themselves without going through the
global memory. This is called Distributed Shared Memory. We can
see that such features can accelerate attention layers for large
transformer models when calculating softmax, for example, since the
softmax of a vector element depends not only on it’s own value but
also the sum of exponentials of all the values in the vector.

2.5.6 Transformer Engine (TE)


Figure 9. Hopper transformer engine (Source: Hopper whitepaper)

This is perhaps the most important feature of the Hopper


architecture. We are covering it at the end because all the previous
background is necessary to understand transformer engines. TEs
neatly integrate all the above new features introduced in Hopper to
accelerate transformer training.

During forward pass of an attention layer, activations from the


previous layer are used to calculate attention scores using tensor
cores at a mix of FP16 and FP8 precisions. If FP8 is used carelessly,
it could result in loss of accuracy. Particularly, FP16 → FP8
conversion can lead to a large loss in precision as we have fewer bits
to represent a given value. To minimize this loss in precision, a
software module (in the firmware of the SM) analyzes the range of
activations produced by this layer and the next layer (black arrows)
and uses this information to scale the conversion from FP16 → FP8.
Conceptually, this is similar to INT8 calibration performed for
inference.

The range analysis and format conversion allow most computation


and data transfers to be performed in FP8 and only the metadata
about minimum, maximum etc. are kept in FP16. Therefore, every
layer is accelerated optimally according to the range of activations it
produces.

Hopper is at the bleeding edge of AI hardware as of the time of


writing. Even most of NVIDIA’s SDKs (such as TensorRT) have not
yet been updated to take advantage of all of Ampere’s features. This
will change over this year and the next.

You should not consider Hopper for your deployments for now unless
you absolutely need the highest possible performance and do not
have any budgetary constraints. However, if you are reading this
article long after it is published, please verify how many of the
features explained above are supported in frameworks and whether
your specific workload or application benefits from them.

3. CuDNN for framework


development
Have you ever wondered why NVIDIA has never developed their own
framework like PyTorch for TensorFlow despite having the resources
to do it? Couldn’t they optimize performance on their GPUs better
than PyTorch or TensorFlow developer teams?
This curious situation is explained by a library called cuDNN. We
have reviewed the fundamental CUDA concepts that underlie every
GPU computation. However, the full CUDA library is too complex and
low-level for most programmers to use directly. Knowing this, as early
as 2014, NVIDIA released cuDNN, a C++ library built on top of CUDA
which provides highly optimised routines for frequently used
operations in deep learning.

With cuDNN, a programmer doesn’t have to deal directly with CUDA


cores, SMs, warps, etc. Rather, they can just treat cuDNN functions
as regular C++ functions and call them as they would with any other
library.

cuDNN provides framework developers with an easy way to add


GPU support for their framework without knowing the details of
CUDA and GPU hardware. Thus, by creating a ‘framework for
frameworks’, NVIDIA ensured that their GPUs were supported by all
major AI/ML frameworks and libraries.

3.1 Who needs cuDNN?


Figure 10. Deep learning framework software stack

The most common use case for cuDNN is for framework


development such as TensorFlow or PyTorch. As a framework
developer you will work with cuDNN and almost never directly with
CUDA. The only case for direct CUDA usage is if you are trying to
implement a custom layer or if you want to merge a few layers for
computational efficiency. Therefore, if you are not planning to
develop your own framework, you do not need to know cuDNN.
However, if you are comfortable with C++, learning cuDNN is easy
and will definitely boost your confidence.

3.2 Convolution in cuDNN

Just like CUDA, cuDNN is quite vast. Here, we will take the example
of a convolutional layer and review how PyTorch uses cuDNN to
implement it. CuDNN uses
the cudnnConvolutionForward function to expose convolution
operation. This function has the following signature in cuDNN:

Download Code To easily follow along this tutorial, please download


code by clicking on the button below. It's FREE!

Download Code
1 cudnnStatus_t cudnnConvolutionForward(
2 cudnnHandle_t handle,
3 constvoid*alpha,
4 constcudnnTensorDescriptor_t xDesc,
5 constvoid*x,
6 constcudnnFilterDescriptor_t wDesc,
7 constvoid*w,
8 constcudnnConvolutionDescriptor_t convDesc,
9 cudnnConvolutionFwdAlgo_t algo,
10 void*workSpace,
11 size_tworkSpaceSizeInBytes,
12 constvoid*beta,
13 constcudnnTensorDescriptor_t yDesc,
14 void*y)

Although we have not introduced all the concepts behind cuDNN,


most of the inputs to this function are quite understandable. For
example, cudnnTensorDescriptor_t is some kind of a struct
describing properties of the input
tensor, cudnnFilterDescriptor_t describes filter properties,
and cudnnConvolutionDescriptor_t specifies some properties
of the convolution operation. One important parameter
is cudnnConvolutionFwdAlgo_t. This is the low level
implementation by which the convolution calculations should be
performed. The following implementations of convolution are
available:

 GEMM: This method implements convolution as a matrix to


matrix multiplication
 Implicit GEMM: This is similar as above, except that the
matrices being multiplied are never explicitly created in memory.
This saves memory.
 Implicit precomp GEMM: Similar to above, except that in this
implementation, some commonly required values are pre-
calculated. This requires a bit of extra memory but can save
computational time.
 Direct: This method implements convolution with a sliding
window approach and is slower than GEMM based methods.
 FFT: This method takes advantage of the mathematical relation
between convolution and the Fast Fourier Transform to
implement convolution
 FFT tiling, and
 Winograd: this method pre-computes some statistics of the
convolution kernel and uses them to accelerate the convolution
operation.

Let us look at how PyTorch uses cuDNN to implement convolution on


the GPU. This link contains the exact location in PyTorch code where
the forward implementation of convolution is described in a file
named `cuda_op_convolution.cu`. A few lines of the code are
reproduced below:

1 CUDNN_ENFORCE(cudnnConvolutionForward(
2 state->cudnn_handle(),
3 cudnnTypeWrapper<T_X>::kOne(),
4 bottom_desc_,
5 X.templatedata<T_X>(),
6 filter_desc_,
7 filter.templatedata<T_W>(),
8 conv_desc_,
9 algo_,
10 state->workspace().get(cudnn_ws_nbytes_),
11 cudnn_ws_nbytes_,
12 cudnnTypeWrapper<T_Y>::kZero(),
13 top_desc_,
14 Y->templatemutable_data<T_Y>()));

The code just calls cudnnConvolutionForward and passes references


to input tensor and convolution filters. As a short exercise, take a look
at the equivalent definition of convolution operation in TensorFlow
and try to understand how the forward and backward passes of the
convolutional layer are implemented. OpenCV’s DNN module also
uses cuDNN under the hood with the convolution operation
defined here. In contrast to TensorFlow and PyTorch, OpenCV DNN
does not define backward pass for convolution since OpenCV’s DNN
module supports only inference and not training.

A major advantage of cuDNN is that whenever new hardware such


as tensor cores are added to GPUs, NVIDIA updates cuDNN to take
advantage of that hardware under the hood, and the framework
developers don’t need to modify anything. As a result, end-users of
PyTorch automatically get enhanced performance when using newer
versions of CUDA (cuDNN is bundled within CUDA, so you usually
do not need to install it separately). This is equally true for
TensorFlow.

Although cuDNN is quite a low-level library from the point of view of


ML engineers, it is convenient and high-level for framework
engineers, as it gives them lots of flexibility and performance. As a
deep learning engineer, you do not have to worry about all the
intricacies of the above code walkthrough but we recommend taking
a look at the cuDNN documentation.
4. GPU performance for deep
learning
CUDA has an extensive suite of debugging and profiling tools
like cuda-memcheck, cuda-gdb, nvprof, nsys, ncu to name a
few. Since deep learning practitioners do not work with CUDA and
cuDNN directly, the developers of frameworks have integrated
profiling tools within the frameworks so that users can understand
how well their code is optimized. Here we will take a look at PyTorch
profiling tools.

4.1 How to profile your code for DL training.

You can profile your code with a few simple steps and visualize the
results with Tensorboard. First, install the tensorboard PyTorch
profiler with

1 pip installtorch_tb_profiler

The next step is to slightly modify a typical PyTorch training loop to


profile the resource usage statistics. This is done by creating a
`profile` object to log both CPU and CUDA events and export the logs
into a format that can be read by tensorboard. This allows the profiler
to record both the CPU and GPU parts of the execution and identify
bottlenecks in training.

1 fromprofiler_demo_utils import*
2 #importing * is not good practice, but simplifies
3 #this demo. Please do not imitate this 🙂
4
5 classVisionTrainer(object):
6 def__init__(self, net, dm):
7 pass
8 self.net=net
9 self.dm=dm
10 self.writer=SummaryWriter()
11 self.criterion=nn.CrossEntropyLoss()
12 self.optimizer=optim.AdamW(self.net.parameters(), lr=1e-6)
13 self.savepath=None
14
15 deftrain(self, epochs, save, profiler=None):
16 pass
17 eval_interval=200#evaluate every 200 steps
18 self.savepath=save
1 device =torch.device('cuda') iftorch.cuda.is_available() elset
9 orch.device('cpu')
train_loader,
20 valid_loader =self.dm.train_loader, self.dm.valid_loader #ignore
test loader if any
21
22 self.net.to(device).train()
23
24 ifhas_apex:
2 self.net, self.optimizer =amp.initialize(self.net, self.opti
5 mizer,
26 opt_level='O2', enabled=True)
27
28 step=0
29
get_accuracy=lambdap,y: (torch.argmax(p,
30
dim=1) ==y).to(torch.float).mean().item()
31
32 forepoch inrange(epochs):
33 estart=time.time()
34 forx,y intrain_loader:
with record_function("training_events"): #record these as
35
training_events
36 self.optimizer.zero_grad()
37
38 x=x.to(device)
39 y=y.to(device)
40
41 pred =self.net(x)
42
43 loss =self.criterion(pred,y)
44
45 #print(loss.item())
self.writer.add_scalar('Training Loss', loss.item(),
46
step)
47
with amp.scale_loss(loss, self.optimizer) as
48
scaled_loss:
49 scaled_loss.backward()
50
51 #loss.backward()
52
5 torch.nn.utils.clip_grad_norm_(self.net.parameters(), 0.
3 01)
54 self.optimizer.step()
55 acc=get_accuracy(pred, y)
56 step+=1
57 self.writer.add_scalar('Training Accuracy', acc, step)
58
59 ifstep%eval_interval==0:
with record_function("evaluation_events"): #record
60
these as evaluation_events
61 self.net.eval()
62 valoss=[]
63 vaacc=[]
64 with torch.no_grad():
65 pass
66 forimgs, ys invalid_loader:
67 imgs=imgs.to(device)
68 ys=ys.to(device)
69 preds=self.net(imgs)
70 vacc=get_accuracy(preds, ys)
71 vloss=self.criterion(preds, ys)
72 #pdb.set_trace()
73 valoss.append(vloss.flatten().item())
74 vaacc.append(vacc)
75
self.writer.add_scalar('Validation Loss',
76
np.mean(valoss), step)
self.writer.add_scalar('Validation Accuracy',
77
np.mean(vaacc), step)
78 self.net.train()
79
80 ifprofiler:
81 profiler.step()
82
83 self.save(epoch)
84 eend=time.time()
print('Time taken for last epoch = {:.3f}'.format(eend-
85
estart))
86
87 defsave(self, epoch):
88 ifself.savepath:
89 path=self.savepath.format(epoch)
90 torch.save(self.net.state_dict(), path)
91 print(f'Saved model to {path}')
92
93
94 defmain():
95 dm=CIFAR10_Manager('./cf10')
96
97 #Just change name to one of the following:
#resnet18, resnet50, mobilenetv3, densenet, squeezenet,
98
inception
99 mname='resnet50'
100 net=VisionClassifier(nclasses=10, mname=mname)
101
102 trainer=VisionTrainer(net,dm)
103
with profile(activities=[ProfilerActivity.CPU,
104
ProfilerActivity.CUDA],
105 record_shapes=True,
106 schedule=schedule(
107 wait=1,
108 warmup=1,
10
active=2),
9
11 on_trace_ready=torch.profiler.tensorboard_trace_handler('
0 ./runs'),
111 profile_memory=True,
112 use_cuda=True) as prof:
113
trainer.train(epochs=1, save='models/cf10_{}.pth',
114
profiler=prof)
115
116 if__name__=='__main__':
117 main()

We are now ready to profile and visualize the results. Just run the
training script and open a tensorboard window in your browser.

Figure 11. Screenshot of PyTorch profiler for ResNet-50 fine-tuning


on CIFAR-10.

The profiler gives a very detailed view of the operations performed in


the training loop. The blue box in figure 11 contains many
perspectives you may choose to look at. However, the profiler
provides the most important statistics in the overview section (see
red box). For example in this case the profiler shows that only 30% of
the CUDA kernels run on tensor cores. Moreover, GPU utilization is
just ~18%. The green box shows a recommendation to improve
performance in easy to understand terms. In this case, the profiler
advises us to increase the batch size to better use the GPU.
5. Summary
In this blog post, we built upon the foundation laid in part 1 of this
series and reviewed the most relevant features for deep learning in
recent NVIDIA GPUs.

Starting from the Pascal generation introduced in 2016, we traced the


evolution of GPU hardware and understood exactly how certain
features such as tensor cores, structured sparsity and transformer
engines work. This knowledge is far and beyond what a typical
engineer would need in their daily job, but it is helpful to know all
these details if you are planning to invest in high-end GPUs for your
team or just want to rent GPU instances on the cloud.

After understanding the details of the hardware features of GPUs, we


learnt about cuDNN, which was developed by NVIDIA to simplify
CUDA integration into deep learning frameworks. cuDNN is used by
framework developers, and we discussed specific examples showing
implementations of the convolutional layer in PyTorch, TensorFlow
and OpenCV’s DNN module.

Finally, we discussed some practical tips for using profiler tools


directly from frameworks using python to understand the
performance of your code. We saw the specific case of the PyTorch
profiler.

It is perfectly possible to build a successful career in deep learning


without knowing anything we have described in this short two-part
series of blog posts. However, sooner or later, you will realize that to
advance your career and stand apart from other engineers, you need
to either (a) keep up to date with the latest research papers and
algorithms or (b) develop a much deeper understanding of the tools
of your trade so you can solve problems that others cannot.

This series is a small step to help you in the latter direction. We hope
you have enjoyed reading the post and learnt something. Please let
us know in the comments or on any social media platform which
topics mentioned here would you want to read more about in the
future.
Demystifying GPU Architectures For

Deep Learning – Part 1

Jaiyam Sharma
JULY 5, 2022 LEAVE A COMMENT

CUDA Deep Learning Devices GPUs NVIDIA

1. Introduction

2. CUDA programming model


2.1 What is CUDA?
2.2 Introduction to some important CUDA concepts

3. Implementing a dense layer in CUDA

4. Summary

1. Introduction
A few months ago, we covered the launch of NVIDIA’s latest Hopper
H100 GPU for data centres. The Hopper architecture is packed with
features to accelerate various machine learning algorithms. It
continues a now-established trend of NVIDIA adding more AI-specific
functionality to their GPUs. However, we noticed that most deep
learning practitioners and engineers do not understand the specifics
of each architecture and what benefits they bring to the table. Thus,
we decided to write a two-part series of blog posts to fill in the gap.
What you learn in this series of posts will help you to:

1. Understand how a GPU accelerates AI workloads.


2. Understand most CUDA concepts and implement dense layers
in CUDA.
3. Understand the features of many generations of NVIDIA GPUs.
4. Choose the right GPU for your training or inference workload.
5. Learn how to profile your code and maximise GPU utilisation.

We will begin by introducing the CUDA programming model and go


through the most important concepts of CUDA in detail. This will help
you to understand how GPUs work. Then, we will use this
understanding to implement matrix multiplication in C++ with CUDA.
Matrix multiplication forms the bedrock of most deep learning
computations and most commonly used layers such as dense,
convolutional and attention layers can be represented as matrix
multiplies.

Note: Although we will only cover CUDA in this post, other GPU chip
makers like AMD and Intel also have similar software stack (though
not as mature as CUDA) and a lot of the concepts discussed here will
carry over to ROCm from AMD or One API from Intel.
I’ve partnered exclusively with OpenCV.org to bring you official courses in AI,
Computer Vision, and Deep Learning to take you on a structured path from first
steps to mastery.
Learn More

2. CUDA programming model

2.1 What is CUDA?

You have no doubt heard about CUDA, and know that it has
something to do with NVIDIA GPUs. You may not know what CUDA
exactly is. For example,

 Is CUDA a library that talks to your GPU? If so, is it a C++ or


python library?
 Is it a compiler for the GPU?
 Is it a driver for the GPU to let the operating system talk to the
GPU? If so, do gamers need CUDA to run games (the original
use case for GPUs)

Back in the early 2000s, much before the widespread use of GPUs
for machine learning, CPUs used to be the most important hardware
for computing. GPUs were primarily developed for graphics and were
very difficult to use for scientific computing. Unsurprisingly, very few
programmers could write efficient code for using GPUs for non-
graphics related computing. NVIDIA realized that programmers
needed to see GPUs as an essential part of computing and not just
as fancy super specialized pieces of hardware (much like FPGAs are
perceived to this day). Thus, they introduced a new way of thinking
about programming, commonly called a programming model. In this
new programming model, different computations could be performed
on different devices most suited to that task. For example, since
CPUs excel at sequential computations while GPUs, by design, excel
at parallel computations, the programming model introduced ways for
CPUs and GPUs to exchange data and synchronize their operations.
This unified model simplified heterogenous programming and NVIDIA
called it Compute Unified Device Architecture or CUDA. So, returning
back to the question, what is CUDA? It is a unified programming
model or architecture for heterogenous computing.

The CUDA programming model has a programming interface in


C/C++ which allows programmers to write code for both CPU and
GPU computations. This C/C++ interface is most commonly referred
to when people say they are ‘programming in CUDA’. Bindings also
exist for almost all other major languages like Python, Java, MATLAB
and even Fortran. Deep learning frameworks such as TensorFlow or
PyTorch use the C/C++ CUDA interface to implement operations like
matrix multiplications, which forms the backbone of dense,
convolutional, recurrent and attention layers to name a few. CUDA
has been wildly successful in popularizing GPU programming and no
other heterogenous computing model has the same reach and
popularity among developers as CUDA.

Since CUDA abstracts away most of the inner workings of GPUs, you
can learn to write a simple GPU program in a few minutes. However,
CUDA also exposes relevant functionality for advanced programmers
to truly extract all possible performance from the GPU. Thus, you as
a programmer can continue improving your skills and your programs
over months and years as you become more comfortable with CUDA
computing.

2.2 Introduction to some important CUDA


concepts

GPU programming is a vast topic and it is not possible to explain all


CUDA concepts within one blog post. However, in true LearnOpenCV
fashion, we will give you a flavor of some technical aspects in far
greater detail than is expected from a deep learning practitioner while
keeping the explanation easy and digestible. Specifically, we will
explain how GPU hardware is organized from a CUDA programmer’s
perspective. A GPU contains the following hardware blocks

 CUDA cores

At the heart of the GPU hardware is a hardware unit called a ‘CUDA


core’ which executes a ‘thread’ in software terms. A CUDA core can
execute instructions for multiplying, dividing or calculating special
functions, for example, activation functions. Although there are many
differences between them, it can help to think of a CUDA core as the
GPU equivalent of a CPU core. Although a CUDA core is weaker
than a CPU core, GPUs have thousands of them. For example, even
the consumer grade RTX 3090 GPU has over 10,000 CUDA cores!
However, there are limitations around the instructions CUDA cores
can execute, which we explain next.
Figure 1. Hierarchy of CUDA threads, blocks and grids

(Source: NVIDIA CUDA C++ Programming Guide)

 CUDA blocks and grids

CUDA threads are grouped together into so called ‘blocks’. All


threads within a block execute the same instructions and all of them
run on the same SM (explained later). The programmer should divide
the computation into blocks and threads. Blocks are further grouped
into entities called CUDA grids. This will all make sense when we
look at a CUDA C++ program at the end of this section.

 CUDA kernels

Suppose you want to run matrix multiplication. If you were just using
a CPU, you could write matrix multiplication with for loops that go
through all entries in the matrices and perform the required work.
Thus, the same CPU thread will produce all the entries of the output
matrix. However, on a GPU each CUDA thread will work to produce
only one entry of the output matrix. We need a way to specify the
computation that each CUDA thread should perform. This is done via
special functions known as ‘CUDA kernels’. A kernel is written in
such a way that different threads do the same computation but on
different data. This computing paradigm is called Single Instruction
Multiple Thread or SIMT. In CUDA terminology, you perform
computations on a GPU by ‘launching’ CUDA kernels.
Figure 2. A schematic of Streaming Multiprocessor (SM) in the latest
H100 GPU

(Source: NVIDIA H100 whitepaper)


 Streaming multiprocessors (SMs)

We have been building our way up the hardware hierarchy. We


started with the smallest unit of computing hardware called the CUDA
core. We saw how threads are grouped into blocks and blocks into
grids. Further, we saw that the compute instructions for the grid are
specified in C++ functions called CUDA kernels. Streaming
Multiprocessors (SMs) are the second highest layer in the hardware
hierarchy. An SM is a sophisticated processor within the GPU which
contains hardware and software for orchestrating the execution of
hundreds of CUDA threads. Modern GPUs contains several dozens
of SMs. For instance, the RTX 3090 has 82 SMs. For the purposes of
execution, the SM divides blocks of threads into ‘warps’ which are
groups of size 32. SMs which are physically located close to each
other are further grouped into entities called Graphics Processing
Clusters (GPCs).
Figure 3. CUDA memory hierarchy. Global memory is visible to all
threads and blocks but it is slow, Shared memory is visible to all
threads within a block and is ~10 times faster than global memory.
FInally, each thread has its own local memory which is even faster.

(Source: CUDA C++ Programming Guide)

 Global memory or VRAM

So far we have discussed the computational units of a GPU, but the


memory units are equally and arguably more important. Most of the
power consumption and latency of computation occurs due to
memory transfers rather than computation. Therefore, understanding
and optimising memory access latency can speed up your workloads
by orders of magnitude. The highest level of memory hierarchy in a
GPU is the global memory or VRAM as it is called in consumer
GPUs. This is the specification quoted in marketing materials and
what most users understand by GPU memory. We have all
encountered CUDA Out of Memory errors in TensorFlow or PyTorch
at some point. OOM errors occur when the model or cannot fit within
the global memory of the GPU. This is why large memory allows us
to train larger models with large batch sizes, but CUDA offers
programmers control over a much richer memory hierarchy,
explained next.

 Shared memory

Shared memory is roughly a GPU equivalent of the cache in a CPU.


While writing software to run on CPUs, the programmer has no
control over the cache. CUDA, on the other hand, provides a way for
the programmer to control what data should be present in the cache
and which threads should have access to which data. For the
uninitiated, cache in any computing system is a small patch of
memory which is located physically close to the actual transistors that
are doing the computation.
In CUDA terms, shared memory is located physically close to the
CUDA cores and fetching data from shared memory is at least 10
times faster than from global memory. This can be extremely helpful
in many deep learning workflows. For instance, when applying the
well known gaussian blur filter to an image, every pixel position of the
input image is required by many CUDA threads. In this case, CUDA
allows threads working on nearby pixels to access the data they need
as quickly as possible using shared memory. Shared memory is
visible to threads in the same block.

 Read-only cache/Texture memory

As the name suggests, this is a read-only cache located physically


close to CUDA cores and is shared within a warp. Here, the term
read only implies that the data stored in this memory does not
change during the course of kernel execution. Common image
processing workloads such as scaling 2D or 3D arrays (like image
resizing) greatly benefit from such memory, which is why this is also
called texture memory.

 Registers

So far all the memory types we have discussed are shared between
threads of a warp, block or SM. In contrast, registers are small
memory banks dedicated to each thread. We have discussed how
threads in a block all execute the same instructions. However, the
numerical values of the results of intermediate calculations are
different for every thread. Registers allow threads to store local
copies of variable which are visible to only that one thread.

Although a CUDA programmer does not need to know or care about


registers to write functional CUDA code, it definitely helps to keep in
mind that each SM has a fixed, limited number of registers. A CUDA
kernel which declares a lot of unnecessary local variables can
perform a lot slower than one which uses registers as efficiently as
possible. Typically, register allocation is handled by the compiler
(nvcc) and the only thing a programmer can control is the number
and size of local variables.

 Unified memory (UM)

It can be quite laborious for a programmer to keep track of which


variables belong to which processor. UM is a software functionality in
CUDA which allows the programmer to forget the distinction between
CPU and GPU memory and see all available memory in the system
as one large unified whole. In programming terms this means that
you declare a variable, allocate memory for it once and use it on both
CPU and GPU.

Like all things CUDA, using UM is quite easy but the more you know
how hardware and compilers work, the better performance you get
extract out of your GPU. In the case of UM, the best performance is
achieved by using initialisation kernels on the GPU whenever
necessary, optimising page faults and asynchronously prefetching
data using `cudaMemPrefetchAsync()`.

3. Implementing a dense layer in


CUDA
We have discussed the software and hardware features of CUDA
GPUs in great detail. Now it is time to get into some code. We will
build upon the concepts learnt so far to implement matrix
multiplication using CUDA. Matrix multiplication forms the basis for
dense, convolutional, recurrent and attention layers, so this is a really
fundamental workflow used all the time. Although most deep learning
practitioners don’t program in CUDA directly, we would still advise
you to at least read through the code that follows and get a rough
understanding of how it works. The code is heavily commented to
make this easy.

The process of writing matrix multiplication is the following:

1. First, we declare 4 matrices A, B, C and D using unified memory


feature in CUDA. This allows us to forget all distinction between
CPU and GPU memory and seamlessly access the matrices
wherever necessary.
2. Then, we define a CUDA kernel called matmul_kernel. As
explained earlier, a CUDA kernel is a function which is executed
by all threads.
3. We will initialise the matrices A and B by writing some values
into them.
4. Next, we declare how we will split up the computation across
blocks and threads. This is done using two
variables blocks_per_grid and threads_per_block.
5. After declaring the splits, we launch the kernel and use CUDA
events feature to accurately measure the time it takes to
perform the computation.
6. Finally, we will do the same computation on the CPU, measure
the time taken and print the time taken to the terminal.

Download Code To easily follow along this tutorial, please download


code by clicking on the button below. It's FREE!

Download Code
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <iostream>
4 #include <math.h>
5 #include <time.h>
6
7 //#define VERIFY
//uncomment above to print difference between CPU and GPU
8
calculations
9
10 __global__ void matmul_kernel(
11 const float* M1,
12 const float* M2,
13 float* M3,
14 const int m,
15 const int n,
16 const int p
17 )
18
19 {
20 /*
21 CUDA kernel for matrix multiplication M3 = M1 * M2
22 This function will be executed by every CUDA thread
23 The instructions are the same, but each thread will work
on a separate chunk of the data, as specified by the array
24
indices.
25 Note that the kernel definition is preceded by the __global__
qualifier. Further, the kernel function returns nothing
26
(void)
Thus, we must modify the output matrix M3 within this
27
function.
The changes made to M3 (or M1 and M2) will all be visible
28
outside
the kernel to CPU and GPU memory after the kernel has
29
executed.
30 */
31
32 //Get the x and y indices of output entry for this thread
33 int i = blockIdx.y * blockDim.y + threadIdx.y;
34 int j = blockIdx.x * blockDim.x + threadIdx.x;
35
36 /*
37 Wait! what are blockDim, blockIdx and threadIdx??
38 These are structs provided by CUDA, which tells the thread
39 how many blocks have been launched, what block number does
40 the current thread reside in and finally, what is the x and y
41 index of the current thread within the block.
42 These variables allow each thread to choose which sub-section
of the A, B and C matrices it should work on and we use them
43
next.
44 */
45
46 if ((i>=m)||(j>=p))
47 {
48 return;
49 //this just means that dont process anything outside the
50 //bounds of the output matrix size
51 }
52
53 float cout=0.0;
54 //this is a local variable we have defined within the thread
//so, this variable will reside in register memory as
55
explained earlier
56
57 for (int k=0; k<n; k++)
58 {
59 cout += M1[i*n + k]*M2[k*p + j];
60 //loop through elements of one row of M1 and
61 //one column of M2, multiply corresponding elements
62 //and add them up. We are just doing standard matrix
63 //multiplication.
64 }
65
66 M3[i*p+j] = cout;
67 //here we modify M3
68 }
69
70 int main(int argc, char* argv[])
71 {
72 /*
73 In this demo, we will create matrices of size
74 A: M x N
75 B: N x P
76 C: M x P <-- for GPU
77 D: M x P <-- for CPU
78
We will initialize A, B, C, D and perform matrix
79
multiplications:
80 C = A*B (on GPU)
81 D = A*B (on CPU)
82 */
83
84 if (argc != 4)
85 {
printf("Matrix multiplication example for A[MxN] and
86
B[NxP]\nUsage: cu_mm.out M N P\n");
87 exit(1);
88 }
89
90 int M=atoi(argv[1]); //2049;
91 int N=atoi(argv[2]); //257;
92 int P=atoi(argv[3]); //512;
93
94 float *A, *B, *C, *D;
95
96 /*
97 Let's use unified memory
98 cudaMallocManaged allows us to allocate memory
99 once and use it across both CPU and GPU.
100 */
101
102 cudaMallocManaged(&A, M*N*sizeof(float));//input Mat1
103 cudaMallocManaged(&B, N*P*sizeof(float));//input Mat2
104
105 cudaMallocManaged(&C, M*P*sizeof(float));//output Mat for GPU
106
107 cudaMallocManaged(&D, M*P*sizeof(float));//output Mat for CPU
//we will do matmul in both CPU and GPU and compare the
108
execution times
109
110 for (int i=0; i<M*N; i++)
111 {
112 A[i]=sin((float)i/100);
113 //init with sine of index, just as an example
114 }
115
116 for (int i=0; i<N*P; i++)
117 {
118 B[i]=cos((float)i/100);
119 //init with sine of index, just as an example
120 }
121
122 //C and D can be left uninitialized
123
124 float elapsed_time_gpu=0.0;
125 double elapsed_time_cpu=0.0;
126 cudaEvent_t gpu_start, gpu_stop;
127 struct timespec cpu_start, cpu_stop;
128
129 //BEGIN GPU MATMUL
130 dim3 blocks_per_grid(ceil(M/32),ceil(P/32));
131 dim3 threads_per_block(32, 32);
132
133 /*
We use CUDA events to accurately measure the time taken by
134
matmul op
13
Refer to page 16 of CUDA C++ Best Practices Guide:
5
13 https://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pd
6 f
137 */
138 cudaEventCreate(&gpu_start);
139 cudaEventCreate(&gpu_stop);
140
141 cudaEventRecord(gpu_start, 0);
142
matmul_kernel<<<blocks_per_grid, threads_per_block>>>(A, B, C,
143
M, N, P);
144 cudaEventRecord(gpu_stop, 0);
145
146 cudaEventSynchronize(gpu_stop);
147 //END GPU MATMUL
148
149 timespec_get(&cpu_start, TIME_UTC);
150
151 //BEGIN CPU MATMUL
152 for (int i=0; i<M; i++)
153 {
154 for (int j=0; j< P; j++)
155 {
156 float cout=0.0;
157
158 for(int k=0; k<N; k++)
159 {
160 cout+=A[i*N+k]*B[k*P+j];
161 }
162
163 D[i*P+j]=cout;
164 }
165 }
166 //END CPU MATMUL
167
168 timespec_get(&cpu_stop, TIME_UTC);
169
170 //Measure elapsed times
171 cudaEventElapsedTime(&elapsed_time_gpu, gpu_start, gpu_stop);
elapsed_time_cpu = ((double)(cpu_stop.tv_sec -
172 cpu_start.tv_sec)) * 1000000 + ((double)(cpu_stop.tv_nsec -
cpu_start.tv_nsec)) / 1000;
173 //tv_nsec is in nanoseconds
174
175 /*
176 Define VERIFY above to print diffs for the
177 first 100 entries
178 you will get all values very close to zero
179 */
180 #ifdef VERIFY
181 for (int i=0; i<100; i++)
182 {
183 float diff=C[i]-D[i];
184 printf("%f, ", diff);
185 }
186 printf("\n");
187 #endif
188
189 //convert microseconds to milliseconds
printf("Elapsed time (CPU)= %f milliseconds\n",
190
elapsed_time_cpu/1000);
printf("Elapsed time (GPU)= %f milliseconds\n",
191
elapsed_time_gpu);
192 //cudaEventElapsedTime reports time in milliseconds
193
194 cudaFree(A);
195 cudaFree(B);
196 cudaFree(C);
197 cudaFree(D);
198 }

To keep things simple, we have defined everything in one file. You


can compile the code using nvcc compiler on any system with CUDA
installed.

1 nvcc cuda_matmul.cu -lm -o cu_mm.out


2 ./cu_mm.out 2048 256 512

We tested this code on a computer equipped with an AMD Ryzen


5800X CPU and an RTX 3090 GPU, with 32 GB RAM running
Ubuntu 20.04.
Figure 4. GPU is ~650x faster than a CPU.

The results were as follows:

 CPU computation time: 681.51 milliseconds


 GPU computation time: 1.047 milliseconds

Thus, the GPU was ~650 times faster than the CPU!! If the CPU
code was to be written to use CPU parallel processing (the CPU has
16 cores and we assume perfect scaling), then the GPU would be
~40x faster than the CPU.

4. Summary
In this blog post, we have ventured into the technical depths of the
hardware and software stack which forms the foundation of deep
learning as we know it.

We started by understanding what exactly is CUDA and went through


the most important CUDA concepts such as

 kernels,
 threads,
 blocks,
 SMs and
 various levels of the memory hierarchy.

Armed with this understanding, we implemented the forward pass of


a dense layer purely in CUDA. We found that the with relatively little
effort, a GPU can accelerate matrix multiplication by hundreds of
times compared to a CPU.

Since the backward pass of a dense layer also requires a matrix


multiply, the same basic idea can be used to implement backward
pass and all other major types of layers.

We are just getting started with this journey. In the second blog post
of this series, we will dive deep into hardware features for AI
acceleration in recent NVIDIA GPUs. After this we return back to
software and take a look at the cuDNN library. Finally, we will share
some practical tips to profile your deep learning code and maximize
GPU utilization.

You might also like