0% found this document useful (0 votes)

104 views52 pages

Lecture - 01 - CUDA Programming

This document provides an overview of GPU computing and CUDA. It discusses: 1) The motivation for GPU computing to increase computational performance as Moore's Law slows. GPUs have thousands of cores compared to CPUs with tens of cores. 2) Generations of NVIDIA GPUs including Kepler, Maxwell, Pascal, Volta, and Turing. Current state-of-the-art includes the Volta and Turing architectures. 3) The basic building block of a GPU is the streaming multiprocessor (SM) which contains cores, cache/shared memory, and other components. Different GPU products have varying numbers of SMs.

Uploaded by

Denis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views52 pages

Lecture - 01 - CUDA Programming

Uploaded by

Denis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Lecture 1

An introduction to CUDA

Prof Wes Armour

[email protected]

Oxford e-Research Centre

Department of Engineering Science
Learning outcomes
In this first lecture we will set the scene for GPU computing.
You will learn about:

• GPU hardware and how it fits in to HPC.

• Different generations of GPUs and current state of the art.

• The design of a GPU.

• How to work with GPUs and the CUDA programming language.

Lecture 1 2
Motivation for GPU computing
Moore’s law (roughly) states that
the number of transistors in an
integrated circuit will double every
two years. Moore made this
prediction in 1965!

Over the last decade, we have

begun to see an end to this (see
plot, more in Tim Lanfear’s lecture).

This motivates the need for other

ways to increase computational
performance.

This is where Heterogeneous

computing (specifically for us, GPU
computing) comes in.

Lecture 1 3
High level hardware overview
Heterogeneous computing makes
use of more than one type of
computer processor (for example a
CPU and a GPU). Bank of GPUs

Heterogeneous computing is
sometimes called hybrid computing
or accelerated computing.
2x CPUs
Some of the motivations for
employing heterogeneous
technologies are:
Server (often called a
node when a collection of
• Significant reductions in floor
servers form a cluster)
space needed.

• Energy efficiency. BiomedNMR [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)]

• Higher throughput.
Lecture 1 4
High level hardware overview

A typical server configuration is to

have one or more CPUs (typically
two) communicating with one or
more GPUs through the PCIe bus.

The CPU has 10’s of cores, the

GPU has 1000’s.

The server in which the GPU sits

is often referred to as the “host”

The GPU itself is often called the

“device”

Lecture 1 5
Generations of GPUs Kepler (compute capability 3.x):
• first released in 2012, including HPC cards.
• excellent double precision arithmetic (DP or fp64).
Several generations of GPUs have • our practicals will use K40s and K80s.
been released now.
Maxwell (compute capability 5.x):
Starting in 2007 with the first • first released in 2014.
General-Purpose Graphics • an architecture for gaming, so poor DP.
Processing Unit (GPGPU) called
Tesla. Pascal (compute capability 6.x):
• first released in 2016.
The “Tesla” name has been • many gaming cards and several HPC cards.
subsequently used as the identifier
for all NVIDIA HPC cards. Volta (compute capability 7.x):
• first released end 2017 / start 2018.
Currently, 5 generations of • only HPC cards, excellent DP.
hardware cards are in use, although
the Kepler and Maxwell generations Turing (compute capability 7.5):
are becoming more scarce. • first released Q3 2018.
• this is the gaming version of Volta.
• some HPC cards (specifically for AI inference). Poor DP.
Lecture 1 6
Current start of the art
Due to the tailoring of GPUs for AI we
now have two generations of GPU The Volta generation has only HPC (Tesla) and “prosumer”
meeting the needs of: (Titan) cards:
Gamers: RTX cards based on the Turing
Titan V: 5120 cores, 12GB (£2900)
architecture.
Tesla V100: 5120 cores, 16/32GB, PCIe or NVLink (£380)
AI/AR/VR researchers: Titan cards and
T4 based on both Volta and Turing The Turing generation has cards for gaming (GeForce) and AI
architectures. cards (Titan and T4):

Scientific Computing researchers: Titan GeForce RTX 2060: 1920 cores, 6GB (£300)
V and V100 cards with good DP. GeForce RTX 2080Ti: 4352 cores, 11GB (£1100)
Currently the biggest differentiation
factor between the generations is the Titan RTX: 4608 cores, 24GB (£2400)
double precision performance T4: 2560 cores, 16GB, HHHW, Low power (£2300)

(Volta = good DP, Turing = bad DP)

Lecture 1 7
The GPU

GPU chip (sometimes called the die) A block diagram of the V100 GPU

Lecture 1 8
Basic building blocks
The basic building block of a GPU is the “Streaming Multiprocessor” or SM. This contains:

Pascal Volta Turing

Cores 64 64 64
L1 / Shared 64KB 96KB 64KB
Memory
L2 Cache 4096KB 6144KB 6144KB
Max # threads 2048 2048 1024

Lecture 1 9
Looking into the SM
A V100 has 80 SMs (see right).

The GV100 SM incorporates 64 FP32

cores and 32 FP64 cores per SM.

The SM is partitioned into four

processing blocks, each with:

• 16 FP32 Cores;
• 8 FP64 Cores;
• 16 INT32 Cores;
• 128KB L1 / Shared memory;
• 64 KB Register File.

Lecture 1 10
Looking into the SM
Lets look at this in more detail

A FP32 core is the execution unit that performs single precision

floating point arithmetic (floats).

A FP64 core performs double precision arithmetic (doubles).

A INT32 Core performs integer arithmetic.

The warp scheduler selects which warp (group of 32 threads) to

send to which execution units (more to come).

64 KB Register File – lots of transistors used for vary fast memory.

128KB of configurable L1 (data) cache or shared memory

Shared memory is a used manged cache (more to come).

LD/ST units load and store data to/from cores.

SFU – special function units compute things like transcendentals.

Lecture 1 11
Different numbers of SMs
Different products have different numbers of SMs, but although the number of SMs across
the product range might vary, the SM is exactly the same for each generation.

Product Generation SMs Bandwidth Memory Power

RTX 2060 Turing 30 336 GB/s 6 GB 160 W
RTX 2070 Turing 36 448 GB/s 8 GB 175 W
RTX 2080 Turing 46 448 GB/s 8 GB 215 W
Titan RTX Turing 72 672 GB/s 24 GB 250 W

Lecture 1 12
Different numbers of SMs
Typically each GPU generation brings improvements in the number of SMs, the bandwidth
to device (GPU) memory and the amount of memory on each GPU.
Sometimes NVIDIA use rather confusing naming schemes….

Product Generation SMs Bandwidth Memory Power

GTX Titan Kepler 14 288 GB/s 6 GB 230 W
GTX Titan X Maxwell 24 336 GB/s 12 GB 250 W
Titan Xp Pascal 30 548 GB/s 12 GB 250 W
Titan V Volta 80 653 GB/s 12 GB 250 W
Titan RTX Turing 72 672 GB/s 24 GB 250 W

Lecture 1 13
Multiple GPU Chips
Some “GPUs” (the actual device that plugs into a PCIe slot) has multiple GPU chips on them…

Product Generation SMs Bandwidth Memory Power

GTX 595 Fermi 2x 16 2x 164 GB/s 2x 1.5 GB 365 W
GTX 690 Kepler 2x 8 2x 192 GB/s 2x 2 GB 300 W
Tesla K80 Kepler 2x 13 2x 240 GB/s 2x 12 GB 300 W
Tesla M60 Maxwell 2x 16 2x 160 GB/s 2x 8 GB 300 W

Lecture 1 14
Multithreading
Key hardware feature is that the cores in a SM are SIMT (Single Instruction Multiple
Threads) cores:

• Groups of 32 cores execute the same instructions simultaneously, but with different data.

• Similar to vector computing on CRAY supercomputers.

• 32 threads all doing the same thing at the same time (threads are in lock-step).

• Natural for graphics processing and much of scientific computing.

• SIMT is also a natural choice for many-core chips to simplify each core.

Lecture 1 15
Multithreading

Having lots of active threads is the key to high performance:

• GPUs do not exploit “context switching”; each thread has its own
registers, which limits the number of active threads.
1 2 3 4 ….. 30 31
• Threads on each SM execute in groups of 32 called “warps”

• Execution alternates between “active” warps, with warps

becoming temporarily “inactive” when waiting for data. Warp

Lecture 1 16
Multithreading

Originally, each thread completed one operation before the next Pipe 0
started to avoid complexity of pipeline overlaps, some examples are:
Pipe 1
• Structural – two instructions need to use the same physical
hardware component. Pipe 2

• Data – one instruction needs the result of a previous instruction Pipe 3

and has to wait (stall - inefficient). .
.
• Branch – waiting for a conditional branch to complete before we .
know whether to execute following instructions.
Pipe 31

Lecture 1 17
Multithreading

NVIDIA relaxed this restriction, so each

thread can have multiple independent
instructions overlapping, but for our
purposes we will assume each Pipe 0
instruction within a warp is lock-step.
Pipe 1

Memory access from device memory has Pipe 2

a delay of 200-400 cycles; with 40 active
warps this is equivalent to 5-10
operations, so enough to hide the
latency?

Lecture 1 18
Working with GPUs
Recall from earlier in these slides, Device
a GPU is attached to the
computer by a PCIe bus.

It also has it’s own memory (for

example a V100 has 16/32 GB of
HBM2 memory.

This means that for us to work

with the GPU we need to allocate
memory on the card (device) and
then transfer data to and from Host
the device.

By RRZEicons [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons

Lecture 1 19
Software view
In a bit more detail, at the top level, we have a master process which runs
on the CPU and performs the following steps:

1. Initialises card.

2. Allocates memory on the host and on the device.

3. Copies data from the host memory to device memory.

4. Launches multiple instances of the execution “kernel” on the device.

5. Copies data from the device memory to the host.

6. Repeat 3-5 as needed.

7. De-allocates all memory and terminates.

Lecture 1 20
Software view
In further detail, within the GPU:

• Each instance of the execution kernel executes on a SM.

• If the number of instances exceeds the number of SMs, then more

than one will run at a time on each SM if there are enough registers
and shared memory, and the others will wait in a queue and
execute later.

• All threads within one instance can access local shared memory but
can’t see what the other instances are doing (even if they are on the
same SM).

• There are no guarantees on the order in which the instances

execute.

Lecture 1 21
Software view - CUDA
CUDA (Compute Unified Device Architecture) provides a
set of programming extensions based on the C/C++
family of languages.

If you have a basic understanding of C and understand

the concept of threads and SIMD execution, then CUDA
is easy to pick up.

FORTRAN support is provided through a compiler from

PGI (who are now owned by NVIDIA) and also the IBM XL
compiler.

The language is now fairly mature, there is lots of

example code available, good documentation, and a large
user community on NVIDIA forums.

Lecture 1 22
Installing CUDA
CUDA is supported on Windows, Linux and MacOSX.
https://developer.nvidia.com/cuda-downloads

Driver
• Low-level software that controls the graphics
card.

Toolkit
• nvcc CUDA compiler.
• Nsight IDE plugin for Eclipse or Visual Studio
profiling and debugging tools.
• Several libraries (more to come on this).

SDK
• Lots of demonstration examples.
• Some error-checking utilities.
• Not officially supported by NVIDIA.
• Sparse documentation.

Lecture 1 23
CUDA Programming
A CUDA program comes in two parts:

1. A host code that executes on the CPU which

interfaces to the GPU.

2. Kernel code which runs on the GPU.

At the host level, there is a choice of 2 APIs

(Application Programming Interfaces):

1. Runtime
• simpler, more convenient

2. Driver
• much more verbose
• more flexible (e.g. allows run-time compilation)
• closer to OpenCL

We will only use the runtime API in this course.

Lecture 1 24
CUDA Programming – host code
// Allocate pointers for host and device memory
float *h_input, *h_output;
At the host code level, there are library routines for: float *d_input, *d_output;

• memory allocation on graphics card // malloc() host memory (this is in your RAM)
h_input = (float*) malloc(mem_size);
• data transfer to/from device memory, including h_output = (float*) malloc(mem_size);
• constants
• ordinary data // allocate device memory input and output arrays
cudaMalloc((void**)&d_input, mem_size);
• error-checking cudaMalloc((void**)&d_output, mem_size);
• timing
// Do something here!
There is also a special syntax for launching multiple // cleanup memory
instances of the kernel process on the GPU. free(h_input);
free(h_output);
cudaFree(d_input);
cudaFree(d_output);

Lecture 1 25
CUDA Programming – host code

At the host code level, there are library routines for:

• memory allocation on graphics card // Copy host memory to device input array
• data transfer to/from device memory, including cudaMemcpy(d_input, h_input, mem_size, cudaMemcpyHostToDevice);
• constants // Do something on the GPU
• ordinary data
• error-checking // copy result from device to host
cudaMemcpy(h_output, d_output, mem_size, cudaMemcpyDeviceToHost);
• timing

There is also a special syntax for launching multiple

instances of the kernel process on the GPU.

Lecture 1 26
CUDA Programming – host code

At the host code level, there are library routines for:

__global__ void helloworld_GPU(void){
• memory allocation on graphics card printf("Hello world!\n");
• data transfer to/from device memory, including }
• constants int main(void) {
• ordinary data
• error-checking // run CUDA kernel
Covered in practicals helloworld_GPU<<<1,1>>>();
• timing
return (0);
There is also a special syntax for launching multiple }
instances of the kernel process on the GPU…

Lecture 1 27
CUDA Programming – host code
In its simplest form the special syntax looks like:

kernel_routine<<gridDim, blockDim>>(args);
__global__ void helloworld_GPU(void){
gridDim is the number of instances of the kernel printf("Hello world!\n");
}
(the “grid” size).
int main(void) {
blockDim is the number of threads within each instance (the
// run CUDA kernel
“block” size). helloworld_GPU<<<1,1>>>();

args is a limited number of arguments, usually mainly return (0);

}
pointers to arrays in graphics memory, and some constants
which get copied by value.

The more general form allows gridDim and blockDim to be

2D or 3D to simplify application programs.

Lecture 1 28
CUDA Programming
At the lower level, when one instance of the kernel is started on a SM it
is executed by a number of threads, each of which knows about:

• some variables passed as arguments.

• pointers to arrays in device memory (also arguments).

• global constants in device memory.

• shared memory and private registers/local variables.

Lecture 1 29
CUDA Programming
The CUDA language uses some reserved or special variables.
These are:

Variable Example Description

gridDim gridDim.x Size (or dimensions) of grid of blocks
blockDim blockDim.y Size (or dimensions) of each block All can have an x, y or z
component as in the
blockIdx blockIdx.z Index (or 2D/3D indices) of block
examples listed.
threadIdx threadIdx.y Index (or 2D/3D indices) of thread

warpSize Currently 32 lanes (and has been so far)

kernel<<<…>>>(args); Kernel launch

Lecture 1 30
CUDA Programming
Below is a conceptual example of a 1D grid, comprised of 4 blocks, each having 64 threads per block:

• gridDim = 4

• blockDim = 64

• blockIdx ranges from 0 to 3

• threadIdx ranges from 0 to 63

Lecture 1 31
CUDA Programming
The kernel code looks fairly normal once you get used to two things:

Code is written from the point of view of a single thread…

• Quite different to OpenMP multithreading

• Similar to MPI, where you use the MPI “rank” to identify the MPI process
• All local variables are private to that thread

It’s important to think about where each variable lives (more on this in the next lecture)

• Any operation involving data in the device memory forces its transfer to/from registers in the GPU.
• It’s often better to copy the value into a local register variable

Lecture 1 32
Our first host code
int main() {

float h_x, d_x; // h=host, d=device.

int nblocks=2, nthreads=8, nsize=2*8; // 2 blocks, 8 threads each.

h_x = (float )malloc(nsizesizeof(float)); // Allocate host memory.

cudaMalloc((void **)&d_x, nsize*sizeof(float)); // Allocate device memory.

my_first_kernel<<<nblocks,nthreads>>>(d_x); // GPU kernel launch.

cudaMemcpy(h_x,d_x,nsize*sizeof(float),cudaMemcpyDeviceToHost); // Copy results back from GPU.

for (int n=0; n<nsize; n++) printf(" n, x = %d %f \n",n,h_x[n]); // Print the results.

cudaFree(d_x); free(h_x); // Free memory on host & device.

}

Lecture 1 33
Our first kernel code
#include <helper_cuda.h>

global void my_first_kernel(float *x)

{
int tid = threadIdx.x + blockDim.x*blockIdx.x;
x[tid] = (float) threadIdx.x;
}

• The global identifier says it’s a kernel function.

• Each thread sets one element of the x array.

• Within each block of threads, threadIdx.x ranges from 0 to blockDim.x-1, so each thread
has a unique value for tid.

Lecture 1 34
Quick recap
Software Hardware
A thread executes on a core. Thread FP32 Core

A group of threads, a thread

block, comprised of groups of
32 threads (a warp), executes
on a SM.
Thread
Thread blocks don’t migrate Block SM
between SMs.

Several concurrent thread

blocks can reside on a SM.

A group of thread blocks form a

grid and a grid runs on the
device. Grid Device
In CUDA this is a kernel launch.

Lecture 1 35
Scaling things up
Suppose we have 1000 blocks, and each one has 128 threads – how would this get executed?

On the Kepler that we will use in our practicals, we would probably get 8-12 blocks running on each SM (Kepler SMs have 128
cores), and each block has 4 warps, so 32-48 warps running on each SM.

Each clock tick, the SM warp scheduler decides which warps to execute next, choosing from those not waiting for:

• data coming from device memory (memory latency)

• completion of earlier instructions (pipeline delay)

As a programmer, we don’t need to worry about this level of detail, we just need to ensure there are lots of threads /
warps.

Lecture 1 36
Scaling things up

Lecture 1 37
Quick recap
Thread
Block
Thread blocks are formed from warps.

The warp is executed in parallel on the

=
Warp 0
=
SM. 1 2 3 4 …30 31

Warp 1
By this we mean that everything that 32 33 34 35…
happens within a warp is lock-step.
Warp 2 64 65 66 67…
So the same operation (instruction) in
thread 2, 3, 8 occurs in thread 7, 1, 4, Warp 3
96 97 98 99…
11… (from 0 to 31) at the same time.

So we have a Single Instruction Multiple

Thread (SIMT) architecture.
SM

Lecture 1 38
Higher dimensions
So far, out simple example considers the case of a 1D grid of blocks, and within each block a 1D set of threads.

Many applications – Finite Element, CFD, MD,…., might need to use 2D or even 3D sets of threads.

As mentioned previously, if we want to use a 2D set of threads, then

blockDim.x, blockDim.y give the dimensions, and
threadIdx.x, threadIdx.y give the thread indices

and to launch the kernel we would use something like:

dim3 nblocks(2,3); // 2 blocks in x, 3 blocks in y

dim3 nthreads(16,4); // 16 threads in x, 4 threads in y
my_new_kernel<<<nblocks, nthreads>>>(d_x);

Here, dim3 is a special CUDA datatype with 3 components .x,.y,.z each initialised to 1.

Lecture 1 39
Indexing in higher dimensions
To calculate a unique (or 1D) thread identifier (previously we called this tid) when working in 2D or 3D we
simply use:

tid = threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.x * blockDim.y;

and this is then broken up into warps of size 32.

How do 2D / 3D threads get divided into warps?
1D thread ID defined by

Lecture 1 40
Mikes notes on Practical 1

Lecture 1 41
Practical 1

Lecture 1 42
Practical 1

Lecture 1 43
Practical 1

Lecture 1 44
Practical 1

Lecture 1 45
Practical 1

Lecture 1 46
Arcus htc

arcus-htc

• arcus-htc.arc.ox.ac.uk is the head node.

• The GPU compute nodes have two K80 cards with a total of 4 GPUs, numbered 0 – 3.

• Read the Arcus notes before starting the practical.

Lecture 1 47
Nsight

Lecture 1 48
Nsight

Lecture 1 49
Nsight

Lecture 1 50
Key reading
CUDA Programming Guide, version 10.1:
• Chapter 1: Introduction

https://www.flickr.com/photos/abee5/8314929977
• Chapter 2: Programming Model
• Chapter 5: performance of different GPUs
• Appendix A: CUDA-enabled GPUs
• Appendix B, sections B.1 – B.4: C language extensions
• Appendix B, section B.20: Formatted (printf) output
• Appendix H, section H: Compute capabilities (features of
different GPUs)

Wikipedia (clearest overview of NVIDIA products):

• https://en.wikipedia.org/wiki/Nvidia_Tesla
• https://en.wikipedia.org/wiki/GeForce_10_series
• https://en.wikipedia.org/wiki/GeForce_20_series

Lecture 1 51
What have we learnt?

In this lecture you have learnt about the usage of

GPUs in HPC. We have looked at different hardware
generations of GPUs and some of there
differences.

We’ve looked at the GPU architecture and how

they execute code.

Finally we’ve looked at the CUDA programming

language and how to create a basic host and device
code.

Lecture 1 52

Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
No ratings yet
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
77 pages
Citibank N.A. Swift Tracker 03-05-2024
No ratings yet
Citibank N.A. Swift Tracker 03-05-2024
1 page
AMD ZEN Architecture PDF
100% (1)
AMD ZEN Architecture PDF
19 pages
AMD Gem5 APU Simulator Micro 2015 Final PDF
No ratings yet
AMD Gem5 APU Simulator Micro 2015 Final PDF
62 pages
GPU Datasheet
No ratings yet
GPU Datasheet
3 pages
HC31 1.11 Huawei - Davinci.HengLiao v4.0 PDF
No ratings yet
HC31 1.11 Huawei - Davinci.HengLiao v4.0 PDF
44 pages
SafeQ6 License Guide en 1-02-00
No ratings yet
SafeQ6 License Guide en 1-02-00
14 pages
FPGA Lab 11
No ratings yet
FPGA Lab 11
5 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Output Archive
No ratings yet
Output Archive
42 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
Image Rotation Using CUDA
No ratings yet
Image Rotation Using CUDA
18 pages
Graphics Processing Unit
No ratings yet
Graphics Processing Unit
9 pages
and 80486
0% (1)
and 80486
28 pages
Mali GPU Architecture
No ratings yet
Mali GPU Architecture
21 pages
Advanced Performance Optimization in CUDA (S62192)
No ratings yet
Advanced Performance Optimization in CUDA (S62192)
127 pages
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
No ratings yet
NVIDIA Ampere GA102 GPU Architecture Whitepaper V1 PDF
44 pages
Aplicații Ale Realității Virtuale În Design Interior
No ratings yet
Aplicații Ale Realității Virtuale În Design Interior
61 pages
Nvidia Cuda Arc
No ratings yet
Nvidia Cuda Arc
16 pages
Graphics Processing Unit (GPU) : Guided By: Presented BY
No ratings yet
Graphics Processing Unit (GPU) : Guided By: Presented BY
22 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
CV Putineanu Andrei
No ratings yet
CV Putineanu Andrei
1 page
CS6703-Grid and Cloud Computing
100% (1)
CS6703-Grid and Cloud Computing
11 pages
Graphics Processing Unit 783 WclXGgU
No ratings yet
Graphics Processing Unit 783 WclXGgU
25 pages
Digital Design Using FPGA
No ratings yet
Digital Design Using FPGA
33 pages
Report SSL
No ratings yet
Report SSL
31 pages
Optimize For Adreno - 0 PDF
No ratings yet
Optimize For Adreno - 0 PDF
11 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
Directx Material and Texture
No ratings yet
Directx Material and Texture
25 pages
GPU
100% (1)
GPU
4 pages
Software Design Description
No ratings yet
Software Design Description
27 pages
MMCM VS PLL
No ratings yet
MMCM VS PLL
1 page
Internet of Things (IoT) Implementation in Learning Institutions - A Systematic Literature Review
100% (1)
Internet of Things (IoT) Implementation in Learning Institutions - A Systematic Literature Review
48 pages
H - LLM: Hardware-Dataflow Co-Exploration For Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
No ratings yet
H - LLM: Hardware-Dataflow Co-Exploration For Heterogeneous Hybrid-Bonding-based Low-Batch LLM Inference
17 pages
Robust Face Recognition Under Difficult Lighting Conditions
No ratings yet
Robust Face Recognition Under Difficult Lighting Conditions
4 pages
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
No ratings yet
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
6 pages
2C-ESP8266 SDK Programming Guide en v1.5
No ratings yet
2C-ESP8266 SDK Programming Guide en v1.5
200 pages
Arm Cortex-A Comparison Table - v4
No ratings yet
Arm Cortex-A Comparison Table - v4
6 pages
HPE Reference Configuration For HPE Cloud Bank Storage With Microsoft Azure-A00043317enw
No ratings yet
HPE Reference Configuration For HPE Cloud Bank Storage With Microsoft Azure-A00043317enw
24 pages
AI Neocloud Playbook and Anatomy
No ratings yet
AI Neocloud Playbook and Anatomy
19 pages
NVIDIA P40 Supported Servers
No ratings yet
NVIDIA P40 Supported Servers
13 pages
Pipeline Hazards
No ratings yet
Pipeline Hazards
94 pages
Thread Level Parallelism (2) : EEC 171 Parallel Architectures John Owens UC Davis
No ratings yet
Thread Level Parallelism (2) : EEC 171 Parallel Architectures John Owens UC Davis
45 pages
Arduino
No ratings yet
Arduino
58 pages
GPU Wiki
No ratings yet
GPU Wiki
9 pages
Raport: Lucrarea de Laborator Nr. 6 Disciplina: Programarea de Sistem Și de Rețea
No ratings yet
Raport: Lucrarea de Laborator Nr. 6 Disciplina: Programarea de Sistem Și de Rețea
9 pages
Graphics Processing Unit
No ratings yet
Graphics Processing Unit
3 pages
System On Chip SOC
No ratings yet
System On Chip SOC
25 pages
Inside Intel Management Engine
No ratings yet
Inside Intel Management Engine
41 pages
IOT in Education Final
100% (1)
IOT in Education Final
8 pages
AI Accelerator
No ratings yet
AI Accelerator
5 pages
CS: Examination Form: 1.study Group?
No ratings yet
CS: Examination Form: 1.study Group?
4 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
100% (1)
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
91 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
GPU Architecture and Function: Michael Foster and Ian Frasch
No ratings yet
GPU Architecture and Function: Michael Foster and Ian Frasch
35 pages
A Survey On IoT Communication and Computation Frameworks
No ratings yet
A Survey On IoT Communication and Computation Frameworks
6 pages
9851 6637 01a RCS Trainer 2.2 - User Guide
No ratings yet
9851 6637 01a RCS Trainer 2.2 - User Guide
13 pages
NPN Silicon RF Power Transistor: Description: ASI 2N3375 Package Style To-60 (Isolated)
No ratings yet
NPN Silicon RF Power Transistor: Description: ASI 2N3375 Package Style To-60 (Isolated)
1 page
Lab Manual Network Security
No ratings yet
Lab Manual Network Security
34 pages
Text 123
No ratings yet
Text 123
94 pages
Threat Hunt Class FAQ 2022-10
No ratings yet
Threat Hunt Class FAQ 2022-10
8 pages
Software Project Management (KOE-068) : UNIT-2 (Lecture-2)
No ratings yet
Software Project Management (KOE-068) : UNIT-2 (Lecture-2)
3 pages
Final
No ratings yet
Final
17 pages
2017 Audio Basic Unit Connector Diagram
No ratings yet
2017 Audio Basic Unit Connector Diagram
4 pages
Axonius Basic
No ratings yet
Axonius Basic
2 pages
PFE Book 2024 Integration Objects
No ratings yet
PFE Book 2024 Integration Objects
24 pages
Ericsson Router 6676
No ratings yet
Ericsson Router 6676
2 pages
KL 038.4.2 3 en Labs v2.0.1
No ratings yet
KL 038.4.2 3 en Labs v2.0.1
67 pages
FTG UI Mod - V2.1 - Manual
No ratings yet
FTG UI Mod - V2.1 - Manual
5 pages
Course Outline
No ratings yet
Course Outline
5 pages
CyberProof - Malware Trends Ebook - 2022 - 11 - 15pg
No ratings yet
CyberProof - Malware Trends Ebook - 2022 - 11 - 15pg
15 pages
Abdallah Shnaino - Fresher - Resume - Backend Developer
No ratings yet
Abdallah Shnaino - Fresher - Resume - Backend Developer
2 pages
CloudConnect Azure DOC en V30
No ratings yet
CloudConnect Azure DOC en V30
47 pages
Ops CLI
No ratings yet
Ops CLI
4 pages
Дз англ
No ratings yet
Дз англ
2 pages
Big Data Quiz For Final
No ratings yet
Big Data Quiz For Final
6 pages
USB Type-C R2.2 ECN - Total Shared Capacity Definition Update
No ratings yet
USB Type-C R2.2 ECN - Total Shared Capacity Definition Update
6 pages
H1U-H2U Program UG 4 Revised20210104
No ratings yet
H1U-H2U Program UG 4 Revised20210104
538 pages
Assignment Etgbe
No ratings yet
Assignment Etgbe
2 pages
Arm
100% (2)
Arm
44 pages
Gen 5
No ratings yet
Gen 5
4 pages
Ricoh IM C2000 IM C2500: Full Colour Multi Function Printer
No ratings yet
Ricoh IM C2000 IM C2500: Full Colour Multi Function Printer
4 pages
Computer Performance Task
No ratings yet
Computer Performance Task
6 pages

Lecture - 01 - CUDA Programming

Uploaded by

Lecture - 01 - CUDA Programming

Uploaded by

Lecture 1

Prof Wes Armour

Oxford e-Research Centre

• GPU hardware and how it fits in to HPC.

• Different generations of GPUs and current state of the art.

• The design of a GPU.

• How to work with GPUs and the CUDA programming language.

Over the last decade, we have

This motivates the need for other

This is where Heterogeneous

• Energy efficiency. BiomedNMR [CC BY-SA 4.0 (https://creativecommons.org/licenses/by-sa/4.0)]

A typical server configuration is to

The CPU has 10’s of cores, the

The server in which the GPU sits

The GPU itself is often called the

(Volta = good DP, Turing = bad DP)

Pascal Volta Turing

The GV100 SM incorporates 64 FP32

The SM is partitioned into four

A FP32 core is the execution unit that performs single precision

A FP64 core performs double precision arithmetic (doubles).

A INT32 Core performs integer arithmetic.

The warp scheduler selects which warp (group of 32 threads) to

64 KB Register File – lots of transistors used for vary fast memory.

128KB of configurable L1 (data) cache or shared memory

LD/ST units load and store data to/from cores.

SFU – special function units compute things like transcendentals.

Product Generation SMs Bandwidth Memory Power

Product Generation SMs Bandwidth Memory Power

Product Generation SMs Bandwidth Memory Power

• Similar to vector computing on CRAY supercomputers.

• Natural for graphics processing and much of scientific computing.

Having lots of active threads is the key to high performance:

• Execution alternates between “active” warps, with warps

• Data – one instruction needs the result of a previous instruction Pipe 3

NVIDIA relaxed this restriction, so each

Memory access from device memory has Pipe 2

It also has it’s own memory (for

This means that for us to work

By RRZEicons [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons

2. Allocates memory on the host and on the device.

3. Copies data from the host memory to device memory.

4. Launches multiple instances of the execution “kernel” on the device.

5. Copies data from the device memory to the host.

6. Repeat 3-5 as needed.

7. De-allocates all memory and terminates.

• Each instance of the execution kernel executes on a SM.

• If the number of instances exceeds the number of SMs, then more

• There are no guarantees on the order in which the instances

If you have a basic understanding of C and understand

FORTRAN support is provided through a compiler from

The language is now fairly mature, there is lots of

1. A host code that executes on the CPU which

2. Kernel code which runs on the GPU.

At the host level, there is a choice of 2 APIs

We will only use the runtime API in this course.

At the host code level, there are library routines for:

There is also a special syntax for launching multiple

At the host code level, there are library routines for:

args is a limited number of arguments, usually mainly return (0);

The more general form allows gridDim and blockDim to be

• some variables passed as arguments.

• pointers to arrays in device memory (also arguments).

• global constants in device memory.

• shared memory and private registers/local variables.

Variable Example Description

warpSize Currently 32 lanes (and has been so far)

• blockIdx ranges from 0 to 3

• threadIdx ranges from 0 to 63

Code is written from the point of view of a single thread…

• Quite different to OpenMP multithreading

float *h_x, *d_x; // h=host, d=device.

int nblocks=2, nthreads=8, nsize=2*8; // 2 blocks, 8 threads each.

h_x = (float *)malloc(nsize*sizeof(float)); // Allocate host memory.

my_first_kernel<<<nblocks,nthreads>>>(d_x); // GPU kernel launch.

cudaMemcpy(h_x,d_x,nsize*sizeof(float),cudaMemcpyDeviceToHost); // Copy results back from GPU.

float h_x, d_x; // h=host, d=device.

h_x = (float )malloc(nsizesizeof(float)); // Allocate host memory.

global void my_first_kernel(float *x)

• The global identifier says it’s a kernel function.