CUDA
CUDA
1. Introduction
5. Summary
1. Introduction
Welcome to the second part of this series of blog posts, where we
are covering ‘behind the scenes’ of the GPU hardware and CUDA
software stack that powers most of deep learning. If you haven’t
already, please be sure to read the first part of this series. To quickly
recap the learning goals of this two part series, after reading both
these posts, you will:
Figure 1 shows a schematic of the P100 SM. The full P100 GPU
contains 56 such SMs.
The Ampere A100 has inspired the recently released Jetson AGX
Orin. Therefore, the Orin’s GPU is much faster than Xavier’s and Orin
also supports structured sparsity.
TBCs sit between blocks and grid, so the new hierarchy is Thread →
Block → Thread Block Cluster → Grid. The advantage of having
TBCs is that clusters can contain blocks running on different SMs.
Hopper also contains additional mechanisms to let blocks from one
cluster share memory among themselves without going through the
global memory. This is called Distributed Shared Memory. We can
see that such features can accelerate attention layers for large
transformer models when calculating softmax, for example, since the
softmax of a vector element depends not only on it’s own value but
also the sum of exponentials of all the values in the vector.
You should not consider Hopper for your deployments for now unless
you absolutely need the highest possible performance and do not
have any budgetary constraints. However, if you are reading this
article long after it is published, please verify how many of the
features explained above are supported in frameworks and whether
your specific workload or application benefits from them.
Just like CUDA, cuDNN is quite vast. Here, we will take the example
of a convolutional layer and review how PyTorch uses cuDNN to
implement it. CuDNN uses
the cudnnConvolutionForward function to expose convolution
operation. This function has the following signature in cuDNN:
Download Code
1 cudnnStatus_t cudnnConvolutionForward(
2 cudnnHandle_t handle,
3 constvoid*alpha,
4 constcudnnTensorDescriptor_t xDesc,
5 constvoid*x,
6 constcudnnFilterDescriptor_t wDesc,
7 constvoid*w,
8 constcudnnConvolutionDescriptor_t convDesc,
9 cudnnConvolutionFwdAlgo_t algo,
10 void*workSpace,
11 size_tworkSpaceSizeInBytes,
12 constvoid*beta,
13 constcudnnTensorDescriptor_t yDesc,
14 void*y)
1 CUDNN_ENFORCE(cudnnConvolutionForward(
2 state->cudnn_handle(),
3 cudnnTypeWrapper<T_X>::kOne(),
4 bottom_desc_,
5 X.templatedata<T_X>(),
6 filter_desc_,
7 filter.templatedata<T_W>(),
8 conv_desc_,
9 algo_,
10 state->workspace().get(cudnn_ws_nbytes_),
11 cudnn_ws_nbytes_,
12 cudnnTypeWrapper<T_Y>::kZero(),
13 top_desc_,
14 Y->templatemutable_data<T_Y>()));
You can profile your code with a few simple steps and visualize the
results with Tensorboard. First, install the tensorboard PyTorch
profiler with
1 pip installtorch_tb_profiler
1 fromprofiler_demo_utils import*
2 #importing * is not good practice, but simplifies
3 #this demo. Please do not imitate this 🙂
4
5 classVisionTrainer(object):
6 def__init__(self, net, dm):
7 pass
8 self.net=net
9 self.dm=dm
10 self.writer=SummaryWriter()
11 self.criterion=nn.CrossEntropyLoss()
12 self.optimizer=optim.AdamW(self.net.parameters(), lr=1e-6)
13 self.savepath=None
14
15 deftrain(self, epochs, save, profiler=None):
16 pass
17 eval_interval=200#evaluate every 200 steps
18 self.savepath=save
1 device =torch.device('cuda') iftorch.cuda.is_available() elset
9 orch.device('cpu')
train_loader,
20 valid_loader =self.dm.train_loader, self.dm.valid_loader #ignore
test loader if any
21
22 self.net.to(device).train()
23
24 ifhas_apex:
2 self.net, self.optimizer =amp.initialize(self.net, self.opti
5 mizer,
26 opt_level='O2', enabled=True)
27
28 step=0
29
get_accuracy=lambdap,y: (torch.argmax(p,
30
dim=1) ==y).to(torch.float).mean().item()
31
32 forepoch inrange(epochs):
33 estart=time.time()
34 forx,y intrain_loader:
with record_function("training_events"): #record these as
35
training_events
36 self.optimizer.zero_grad()
37
38 x=x.to(device)
39 y=y.to(device)
40
41 pred =self.net(x)
42
43 loss =self.criterion(pred,y)
44
45 #print(loss.item())
self.writer.add_scalar('Training Loss', loss.item(),
46
step)
47
with amp.scale_loss(loss, self.optimizer) as
48
scaled_loss:
49 scaled_loss.backward()
50
51 #loss.backward()
52
5 torch.nn.utils.clip_grad_norm_(self.net.parameters(), 0.
3 01)
54 self.optimizer.step()
55 acc=get_accuracy(pred, y)
56 step+=1
57 self.writer.add_scalar('Training Accuracy', acc, step)
58
59 ifstep%eval_interval==0:
with record_function("evaluation_events"): #record
60
these as evaluation_events
61 self.net.eval()
62 valoss=[]
63 vaacc=[]
64 with torch.no_grad():
65 pass
66 forimgs, ys invalid_loader:
67 imgs=imgs.to(device)
68 ys=ys.to(device)
69 preds=self.net(imgs)
70 vacc=get_accuracy(preds, ys)
71 vloss=self.criterion(preds, ys)
72 #pdb.set_trace()
73 valoss.append(vloss.flatten().item())
74 vaacc.append(vacc)
75
self.writer.add_scalar('Validation Loss',
76
np.mean(valoss), step)
self.writer.add_scalar('Validation Accuracy',
77
np.mean(vaacc), step)
78 self.net.train()
79
80 ifprofiler:
81 profiler.step()
82
83 self.save(epoch)
84 eend=time.time()
print('Time taken for last epoch = {:.3f}'.format(eend-
85
estart))
86
87 defsave(self, epoch):
88 ifself.savepath:
89 path=self.savepath.format(epoch)
90 torch.save(self.net.state_dict(), path)
91 print(f'Saved model to {path}')
92
93
94 defmain():
95 dm=CIFAR10_Manager('./cf10')
96
97 #Just change name to one of the following:
#resnet18, resnet50, mobilenetv3, densenet, squeezenet,
98
inception
99 mname='resnet50'
100 net=VisionClassifier(nclasses=10, mname=mname)
101
102 trainer=VisionTrainer(net,dm)
103
with profile(activities=[ProfilerActivity.CPU,
104
ProfilerActivity.CUDA],
105 record_shapes=True,
106 schedule=schedule(
107 wait=1,
108 warmup=1,
10
active=2),
9
11 on_trace_ready=torch.profiler.tensorboard_trace_handler('
0 ./runs'),
111 profile_memory=True,
112 use_cuda=True) as prof:
113
trainer.train(epochs=1, save='models/cf10_{}.pth',
114
profiler=prof)
115
116 if__name__=='__main__':
117 main()
We are now ready to profile and visualize the results. Just run the
training script and open a tensorboard window in your browser.
This series is a small step to help you in the latter direction. We hope
you have enjoyed reading the post and learnt something. Please let
us know in the comments or on any social media platform which
topics mentioned here would you want to read more about in the
future.
Demystifying GPU Architectures For
Jaiyam Sharma
JULY 5, 2022 LEAVE A COMMENT
1. Introduction
4. Summary
1. Introduction
A few months ago, we covered the launch of NVIDIA’s latest Hopper
H100 GPU for data centres. The Hopper architecture is packed with
features to accelerate various machine learning algorithms. It
continues a now-established trend of NVIDIA adding more AI-specific
functionality to their GPUs. However, we noticed that most deep
learning practitioners and engineers do not understand the specifics
of each architecture and what benefits they bring to the table. Thus,
we decided to write a two-part series of blog posts to fill in the gap.
What you learn in this series of posts will help you to:
Note: Although we will only cover CUDA in this post, other GPU chip
makers like AMD and Intel also have similar software stack (though
not as mature as CUDA) and a lot of the concepts discussed here will
carry over to ROCm from AMD or One API from Intel.
I’ve partnered exclusively with OpenCV.org to bring you official courses in AI,
Computer Vision, and Deep Learning to take you on a structured path from first
steps to mastery.
Learn More
You have no doubt heard about CUDA, and know that it has
something to do with NVIDIA GPUs. You may not know what CUDA
exactly is. For example,
Back in the early 2000s, much before the widespread use of GPUs
for machine learning, CPUs used to be the most important hardware
for computing. GPUs were primarily developed for graphics and were
very difficult to use for scientific computing. Unsurprisingly, very few
programmers could write efficient code for using GPUs for non-
graphics related computing. NVIDIA realized that programmers
needed to see GPUs as an essential part of computing and not just
as fancy super specialized pieces of hardware (much like FPGAs are
perceived to this day). Thus, they introduced a new way of thinking
about programming, commonly called a programming model. In this
new programming model, different computations could be performed
on different devices most suited to that task. For example, since
CPUs excel at sequential computations while GPUs, by design, excel
at parallel computations, the programming model introduced ways for
CPUs and GPUs to exchange data and synchronize their operations.
This unified model simplified heterogenous programming and NVIDIA
called it Compute Unified Device Architecture or CUDA. So, returning
back to the question, what is CUDA? It is a unified programming
model or architecture for heterogenous computing.
Since CUDA abstracts away most of the inner workings of GPUs, you
can learn to write a simple GPU program in a few minutes. However,
CUDA also exposes relevant functionality for advanced programmers
to truly extract all possible performance from the GPU. Thus, you as
a programmer can continue improving your skills and your programs
over months and years as you become more comfortable with CUDA
computing.
CUDA cores
CUDA kernels
Suppose you want to run matrix multiplication. If you were just using
a CPU, you could write matrix multiplication with for loops that go
through all entries in the matrices and perform the required work.
Thus, the same CPU thread will produce all the entries of the output
matrix. However, on a GPU each CUDA thread will work to produce
only one entry of the output matrix. We need a way to specify the
computation that each CUDA thread should perform. This is done via
special functions known as ‘CUDA kernels’. A kernel is written in
such a way that different threads do the same computation but on
different data. This computing paradigm is called Single Instruction
Multiple Thread or SIMT. In CUDA terminology, you perform
computations on a GPU by ‘launching’ CUDA kernels.
Figure 2. A schematic of Streaming Multiprocessor (SM) in the latest
H100 GPU
Shared memory
Registers
So far all the memory types we have discussed are shared between
threads of a warp, block or SM. In contrast, registers are small
memory banks dedicated to each thread. We have discussed how
threads in a block all execute the same instructions. However, the
numerical values of the results of intermediate calculations are
different for every thread. Registers allow threads to store local
copies of variable which are visible to only that one thread.
Like all things CUDA, using UM is quite easy but the more you know
how hardware and compilers work, the better performance you get
extract out of your GPU. In the case of UM, the best performance is
achieved by using initialisation kernels on the GPU whenever
necessary, optimising page faults and asynchronously prefetching
data using `cudaMemPrefetchAsync()`.
Download Code
1 #include <stdio.h>
2 #include <stdlib.h>
3 #include <iostream>
4 #include <math.h>
5 #include <time.h>
6
7 //#define VERIFY
//uncomment above to print difference between CPU and GPU
8
calculations
9
10 __global__ void matmul_kernel(
11 const float* M1,
12 const float* M2,
13 float* M3,
14 const int m,
15 const int n,
16 const int p
17 )
18
19 {
20 /*
21 CUDA kernel for matrix multiplication M3 = M1 * M2
22 This function will be executed by every CUDA thread
23 The instructions are the same, but each thread will work
on a separate chunk of the data, as specified by the array
24
indices.
25 Note that the kernel definition is preceded by the __global__
qualifier. Further, the kernel function returns nothing
26
(void)
Thus, we must modify the output matrix M3 within this
27
function.
The changes made to M3 (or M1 and M2) will all be visible
28
outside
the kernel to CPU and GPU memory after the kernel has
29
executed.
30 */
31
32 //Get the x and y indices of output entry for this thread
33 int i = blockIdx.y * blockDim.y + threadIdx.y;
34 int j = blockIdx.x * blockDim.x + threadIdx.x;
35
36 /*
37 Wait! what are blockDim, blockIdx and threadIdx??
38 These are structs provided by CUDA, which tells the thread
39 how many blocks have been launched, what block number does
40 the current thread reside in and finally, what is the x and y
41 index of the current thread within the block.
42 These variables allow each thread to choose which sub-section
of the A, B and C matrices it should work on and we use them
43
next.
44 */
45
46 if ((i>=m)||(j>=p))
47 {
48 return;
49 //this just means that dont process anything outside the
50 //bounds of the output matrix size
51 }
52
53 float cout=0.0;
54 //this is a local variable we have defined within the thread
//so, this variable will reside in register memory as
55
explained earlier
56
57 for (int k=0; k<n; k++)
58 {
59 cout += M1[i*n + k]*M2[k*p + j];
60 //loop through elements of one row of M1 and
61 //one column of M2, multiply corresponding elements
62 //and add them up. We are just doing standard matrix
63 //multiplication.
64 }
65
66 M3[i*p+j] = cout;
67 //here we modify M3
68 }
69
70 int main(int argc, char* argv[])
71 {
72 /*
73 In this demo, we will create matrices of size
74 A: M x N
75 B: N x P
76 C: M x P <-- for GPU
77 D: M x P <-- for CPU
78
We will initialize A, B, C, D and perform matrix
79
multiplications:
80 C = A*B (on GPU)
81 D = A*B (on CPU)
82 */
83
84 if (argc != 4)
85 {
printf("Matrix multiplication example for A[MxN] and
86
B[NxP]\nUsage: cu_mm.out M N P\n");
87 exit(1);
88 }
89
90 int M=atoi(argv[1]); //2049;
91 int N=atoi(argv[2]); //257;
92 int P=atoi(argv[3]); //512;
93
94 float *A, *B, *C, *D;
95
96 /*
97 Let's use unified memory
98 cudaMallocManaged allows us to allocate memory
99 once and use it across both CPU and GPU.
100 */
101
102 cudaMallocManaged(&A, M*N*sizeof(float));//input Mat1
103 cudaMallocManaged(&B, N*P*sizeof(float));//input Mat2
104
105 cudaMallocManaged(&C, M*P*sizeof(float));//output Mat for GPU
106
107 cudaMallocManaged(&D, M*P*sizeof(float));//output Mat for CPU
//we will do matmul in both CPU and GPU and compare the
108
execution times
109
110 for (int i=0; i<M*N; i++)
111 {
112 A[i]=sin((float)i/100);
113 //init with sine of index, just as an example
114 }
115
116 for (int i=0; i<N*P; i++)
117 {
118 B[i]=cos((float)i/100);
119 //init with sine of index, just as an example
120 }
121
122 //C and D can be left uninitialized
123
124 float elapsed_time_gpu=0.0;
125 double elapsed_time_cpu=0.0;
126 cudaEvent_t gpu_start, gpu_stop;
127 struct timespec cpu_start, cpu_stop;
128
129 //BEGIN GPU MATMUL
130 dim3 blocks_per_grid(ceil(M/32),ceil(P/32));
131 dim3 threads_per_block(32, 32);
132
133 /*
We use CUDA events to accurately measure the time taken by
134
matmul op
13
Refer to page 16 of CUDA C++ Best Practices Guide:
5
13 https://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pd
6 f
137 */
138 cudaEventCreate(&gpu_start);
139 cudaEventCreate(&gpu_stop);
140
141 cudaEventRecord(gpu_start, 0);
142
matmul_kernel<<<blocks_per_grid, threads_per_block>>>(A, B, C,
143
M, N, P);
144 cudaEventRecord(gpu_stop, 0);
145
146 cudaEventSynchronize(gpu_stop);
147 //END GPU MATMUL
148
149 timespec_get(&cpu_start, TIME_UTC);
150
151 //BEGIN CPU MATMUL
152 for (int i=0; i<M; i++)
153 {
154 for (int j=0; j< P; j++)
155 {
156 float cout=0.0;
157
158 for(int k=0; k<N; k++)
159 {
160 cout+=A[i*N+k]*B[k*P+j];
161 }
162
163 D[i*P+j]=cout;
164 }
165 }
166 //END CPU MATMUL
167
168 timespec_get(&cpu_stop, TIME_UTC);
169
170 //Measure elapsed times
171 cudaEventElapsedTime(&elapsed_time_gpu, gpu_start, gpu_stop);
elapsed_time_cpu = ((double)(cpu_stop.tv_sec -
172 cpu_start.tv_sec)) * 1000000 + ((double)(cpu_stop.tv_nsec -
cpu_start.tv_nsec)) / 1000;
173 //tv_nsec is in nanoseconds
174
175 /*
176 Define VERIFY above to print diffs for the
177 first 100 entries
178 you will get all values very close to zero
179 */
180 #ifdef VERIFY
181 for (int i=0; i<100; i++)
182 {
183 float diff=C[i]-D[i];
184 printf("%f, ", diff);
185 }
186 printf("\n");
187 #endif
188
189 //convert microseconds to milliseconds
printf("Elapsed time (CPU)= %f milliseconds\n",
190
elapsed_time_cpu/1000);
printf("Elapsed time (GPU)= %f milliseconds\n",
191
elapsed_time_gpu);
192 //cudaEventElapsedTime reports time in milliseconds
193
194 cudaFree(A);
195 cudaFree(B);
196 cudaFree(C);
197 cudaFree(D);
198 }
Thus, the GPU was ~650 times faster than the CPU!! If the CPU
code was to be written to use CPU parallel processing (the CPU has
16 cores and we assume perfect scaling), then the GPU would be
~40x faster than the CPU.
4. Summary
In this blog post, we have ventured into the technical depths of the
hardware and software stack which forms the foundation of deep
learning as we know it.
kernels,
threads,
blocks,
SMs and
various levels of the memory hierarchy.
We are just getting started with this journey. In the second blog post
of this series, we will dive deep into hardware features for AI
acceleration in recent NVIDIA GPUs. After this we return back to
software and take a look at the cuDNN library. Finally, we will share
some practical tips to profile your deep learning code and maximize
GPU utilization.