0% found this document useful (0 votes)

26 views27 pages

Lec 1

The document provides an introduction to GPU programming including an overview of GPU hardware and software. It discusses the motivation for using GPUs, provides a hardware view of GPU architecture including streaming multiprocessors and memory, and describes the CUDA programming model including host code, kernels, and threads.

Uploaded by

foof faaf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views27 pages

Lec 1

Uploaded by

foof faaf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

An introduction to

GPU programming
Mike Giles
[email protected]

Oxford University Mathematical Institute

Oxford-Man Institute of Quantitative Finance
Oxford eResearch Centre

Lecture 1 – p. 1/27
Overview
motivation & objectives
hardware view
software view
CUDA programming

Lecture 1 – p. 2/27
Motivation
Current Intel quad-core Xeon:
4 cores
10-40 GFflops in SP; 10-20 GFlops in DP
20-30GB/s memory bandwidth

Current NVIDIA Tesla GPU

240 cores
1 Tflops in SP; 125 GFlops in DP
100GB/s memory bandwidth

1 Tesla GPU is usually 5-10× faster than two quad-core

Xeons, at roughly same cost and power consumption
Lecture 1 – p. 3/27
Motivation
Next generation Intel CPUs:
Westmere-EP: 6 core
Nehalem-EX: 8 core
Sandy Bridge: 4-10 cores, and SSE vectur unit
replaced by AVX with 8 float or 4 double

Next generation NVIDIA Fermi GPU:

512 SP cores, 256 DP cores
1.5 Tflops in SP; 800 GFlops in DP
L1/L2 cache on GPU
200GB/s memory bandwidth

Lecture 1 – p. 4/27
Motivation
Other GPUs?
AMD:
similar to NVIDIA for graphics/games applications
not very focussed on computing side, but supporting
new OpenCL standard
IBM Cell:
proved to be challenging to program
no further development for scientific computing?
Intel Larrabee:
16-32 cores each with a vector unit
project has suffered major delays, first-generation
product has been cancelled

Lecture 1 – p. 5/27
Motivation
My predictions?
GPUs will continue to evolve, with increasing cores and
features to make them more easily programmed
CPUs will also continue to evolve with increasing cores
and vector lengths will increase to compete with GPUs
GPUs will stay ahead because:
better bandwidth on graphics card than in
motherboard socket
GPUs aimed at high-end market, whereas CPUs
aimed more at low-end mobile mass market

Lecture 1 – p. 6/27
Objectives
an overview understanding of GPU hardware and
software
hands-on experience of CUDA programming
(very relevant to emerging OpenCL open standard)
learn about some key challenges, and how to approach
the GPU implementation of a new application
learn about resources for future learning

I hope to convince you it’s not difficult!

Lecture 1 – p. 7/27
Hardware view
At the top-level, a PCIe graphics card with a many-core
GPU and high-speed graphics “device” memory sits inside
a standard PC/server with one or two multicore CPUs:

DDR3 GDDR3/5

motherboard graphics card

Lecture 1 – p. 8/27
Hardware view
At the GPU level:
basic building block is a “streaming multiprocessor” with
8 cores, each with 2048 registers
16KB of shared memory
8KB cache for constants held in device memory
8KB cache for textures held in device memory
different chips have different numbers of these SMs:
product SMs bandwidth memory
GTX260 27 110 GB/s 1-2 GB
GTX285 30 160 GB/s 1-2 GB
Tesla M1060 30 102 GB/s 4 GB

Lecture 1 – p. 9/27
Hardware view
Key hardware feature is that the 8 cores in a multiprocessor
are SIMT (Single Instruction Multiple Threads) cores:
all 8 cores execute the same instructions
simultaneously, but with different data
similar to vector computing on CRAY supercomputers
minimum of 4 threads per core, so end up with a
minimum of 32 threads all doing the same thing at
(almost) the same time
natural for graphics processing and much scientific
computing
SIMT is also a natural choice for many-core chips to
simplify each core

Lecture 1 – p. 10/27
Multithreading
Lots of active threads is the key to high performance:
no “context switching”; each thread has its own
registers, which limits the number of active threads
threads execute in “warps” of 32 threads per
multiprocessor (4 per core) – execution alternates
between “active” warps, with warps becoming
temporarily “inactive” when waiting for data

Lecture 1 – p. 11/27
Multithreading
for each thread, one operation completes long before
the next starts – avoids the complexity of pipeline
overlaps which can limit the performance of modern
processors
-
-
-1 2345 - time
-
-1 2345 -
-
-1 2345 -

memory access from device memory has a delay of

400-600 cycles; with 40 threads this is equivalent to
10-15 operations and can be managed by the compiler

Lecture 1 – p. 12/27
Software view
At the top level, we have a master process which runs on
the CPU and performs the following steps:
1. initialises card
2. allocates memory in host and on device
3. copies data from host to device memory
4. launches multiple copies of execution “kernel” on device
5. copies data from device memory to host
6. repeats 3-5 as needed
7. de-allocates all memory and terminates

Lecture 1 – p. 13/27
Software view
At a lower level, within the GPU:
each copy of the execution kernel executes on an SM
if the number of copies exceeds the number of SMs,
then more than one will run at a time on each SM if
there are enough registers and shared memory, and the
others will wait in a queue and execute later
all threads within one copy can access local shared
memory but can’t see what the other copies are doing
(even if they are on the same SM)
there are no guarantees on the order in which the
copies will execute

Lecture 1 – p. 14/27
CUDA programming
CUDA is NVIDIA’s program development environment:
based on C with some extensions
C++ support increasing steadily
FORTRAN support provided by PGI compiler
basis for OpenCL standard pushed by Apple and
supported by AMD, Intel and IBM
lots of example code and good documentation
– 2-4 week learning curve for those with experience of
OpenMP and MPI programming
large user community on NVIDIA forums

Lecture 1 – p. 15/27
CUDA programming
At the host code level, there are library routines for:
memory allocation on graphics card
data transfer to/from device memory
constants
texture arrays (useful for lookup tables)
ordinary data
error-checking
timing

There is also a special syntax for launching multiple copies

of the kernel process on the GPU.

Lecture 1 – p. 16/27
CUDA programming
In its simplest form it looks like:
kernel_routine<<<gridDim, blockDim>>>(args);
where
gridDim is the number of copies of the kernel
(the “grid” size”)
blockDim is the number of threads within each copy
(the “block” size)
args is a limited number of arguments, usually mainly
pointers to arrays in graphics memory

The more general form allows gridDim and blockDim to

be 2D or 3D to simplify application programs

Lecture 1 – p. 17/27
CUDA programming
At the lower level, when one copy of the kernel is started
on a multiprocessor it is executed by a number of threads,
each of which knows about:
some variables passed as arguments
pointers to arrays in device memory (also arguments)
global constants in device memory
shared memory and private registers/local variables
some special variables:
gridDim size (or dimensions) of grid of blocks
blockIdx index (or 2D/3D indices)of block
blockDim size (or dimensions) of each block
threadIdx index (or 2D/3D indices) of thread

Lecture 1 – p. 18/27
CUDA programming
The kernel code looks fairly normal once you get used to
two things:
code is written from the point of view of a single thread
quite different to OpenMP multithreading
similar to MPI, where you use the MPI “rank” to
identify the MPI process
all local variables are private to that thread
need to think about where each variable lives
any operation involving data in the device memory
forces its transfer to/from registers in the GPU
there’s no cache (currently) so a second operation
with the same data will force a second transfer
usually better to copy the value into a local register
variable
Lecture 1 – p. 19/27
Host code
int main(int argc, char **argv) {
float *h_x, *d_x; // h=host, d=device
int nblocks=2, nthreads=8, nsize=2*8;

h_x = (float )malloc(nsizesizeof(float));

cudaMalloc((void **)&d_x, nsize*sizeof(float));

my_first_kernel<<<nblocks,nthreads>>>(d_x);

cudaMemcpy(h_x,d_x,nsize*sizeof(float),
cudaMemcpyDeviceToHost);

for (int n=0; n<nsize; n++)

printf(" n, x = %d %f \n",n,h_x[n]);

cudaFree(d_x); free(h_x);
} Lecture 1 – p. 20/27
Kernel code
#include <cutil_inline.h>

global void my_first_kernel(float *x)

{
int tid = threadIdx.x + blockDim.x*blockIdx.x;

x[tid] = threadIdx.x;
}

global identifier says it’s a kernel function

each thread sets one element of x array
within each block of threads, threadIdx.x ranges
from 0 to blockDim.x-1, so each thread has a unique
value for tid

Lecture 1 – p. 21/27
CUDA distribution
3 components:
graphics driver
toolkit (nvcc compiler and libraries)
SDK code samples and utilities

Everything is available for Windows, Linux and OS X.

I’ll focus on Linux, but there is good integration for

Visual Studio 2008.

Lecture 1 – p. 22/27
CUDA Makefile
Two choices:
use nvcc within a standard Makefile
use the special Makefile template provided in the SDK

I use the SDK Makefile because it provides useful options:

make emu=1
uses an emulation library for debugging on a CPU
make dbg=1
activates run-time error checking (see Practical 1)

I would use a standard Makefile for more complex cases:

producing a GPU library
when needing different compiler flags for different files
Lecture 1 – p. 23/27
Practical 1
start from code shown above (but with comments)
try out the various Makefile options
modify code to add two vectors together (including
sending them over from the host to the device)
if time permits, look at CUDA SDK examples

Lecture 1 – p. 24/27
Practical 1
Things to note:
memory allocation
cudaMalloc((void **)&d x, nbytes);

data copying
cudaMemcpy(h x,d x,nbytes,
cudaMemcpyDeviceToHost);
kernel routine is declared by global prefix, and is
written from point of view of a single thread

Lecture 1 – p. 25/27
Practical 1
Second version of the code is very similar to first, but uses
CUDA SDK toolkit for various safety checks – gives useful
feedback in the event of errors.

check for error return codes:

cutilSafeCall( ... );

check for failure messages:

cutilCheckMsg( ... );

Lecture 1 – p. 26/27
Webpages
NVIDIA’s CUDA homepage:
www.nvidia.com/object/cuda home.html

Wikipedia overviews of NVIDIA GPUs:

en.wikipedia.org/wiki/GeForce 200 Series
en.wikipedia.org/wiki/Nvidia Tesla

GPU computing community website:

www.gpucomputing.net

LIBOR test code:

www.maths.ox.ac.uk/∼gilesm/hpc/

Lecture 1 – p. 27/27

Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
The Magic of Manifestation
100% (5)
The Magic of Manifestation
127 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Generative Adversarial Networks Seminar Report
50% (4)
Generative Adversarial Networks Seminar Report
11 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
Tda 6107 Ajf
No ratings yet
Tda 6107 Ajf
16 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Computer Project: For Loop
No ratings yet
Computer Project: For Loop
11 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Grade 10 CAT Year Planner 2025
No ratings yet
Grade 10 CAT Year Planner 2025
9 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
Threads
No ratings yet
Threads
54 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
GV Series: Variable Speed Booster Sets With The New Sd60 Control Card
100% (1)
GV Series: Variable Speed Booster Sets With The New Sd60 Control Card
40 pages
Ece I Basic Electronics Engg. (15eln15) Notes
No ratings yet
Ece I Basic Electronics Engg. (15eln15) Notes
124 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
MIPS Addressing Modes
No ratings yet
MIPS Addressing Modes
5 pages
CUDA
No ratings yet
CUDA
33 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
CUDA
No ratings yet
CUDA
18 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Manual de Usuario VFD MAX500
No ratings yet
Manual de Usuario VFD MAX500
37 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Formats
No ratings yet
Formats
14 pages
Week 3 - Probablistic Context Free Grammars
No ratings yet
Week 3 - Probablistic Context Free Grammars
18 pages
Picture Code: V Shape Yellow/Whi TE
No ratings yet
Picture Code: V Shape Yellow/Whi TE
128 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
4.SAP TAO Training Material
0% (1)
4.SAP TAO Training Material
20 pages
Application Information: Need To Know How? You've Turned To The Right Place - . - Literally
No ratings yet
Application Information: Need To Know How? You've Turned To The Right Place - . - Literally
50 pages
Richtek RT9742
No ratings yet
Richtek RT9742
20 pages
Design and Implementation of A Computerised Stadium Management Information System
100% (8)
Design and Implementation of A Computerised Stadium Management Information System
32 pages
F20 HMGT 6335 OPRE 6332 Spreadsheet Modeling SYLLABUS
No ratings yet
F20 HMGT 6335 OPRE 6332 Spreadsheet Modeling SYLLABUS
9 pages
Diagnostic Systematic Reviews Road Map V3
No ratings yet
Diagnostic Systematic Reviews Road Map V3
2 pages
2.1.1.5 Lab - The World Runs On Circuits
No ratings yet
2.1.1.5 Lab - The World Runs On Circuits
3 pages
EECS98733 Digikey Order $226.41
No ratings yet
EECS98733 Digikey Order $226.41
2 pages
Links For Learning German
No ratings yet
Links For Learning German
2 pages
Screen Capture: User's Guide
No ratings yet
Screen Capture: User's Guide
15 pages
Bibliography
No ratings yet
Bibliography
6 pages
Yu, Buttay, Laboure - 2017 - Thermal Management and Electromagnetic Analysis For GaN Devices Packaging On DBC Substrate
No ratings yet
Yu, Buttay, Laboure - 2017 - Thermal Management and Electromagnetic Analysis For GaN Devices Packaging On DBC Substrate
5 pages
Endian Iec-62443-Compliance Whitepaper en
No ratings yet
Endian Iec-62443-Compliance Whitepaper en
5 pages
Implementation of An Image Search Engine - 1
No ratings yet
Implementation of An Image Search Engine - 1
31 pages
Presentation (Vehicle Insurance Policy)
No ratings yet
Presentation (Vehicle Insurance Policy)
10 pages
Internship Report Vinoth Kumar
No ratings yet
Internship Report Vinoth Kumar
28 pages
Form # 4 - ICS 4 Cable Installation Below Ground
No ratings yet
Form # 4 - ICS 4 Cable Installation Below Ground
1 page
Chapter 16
No ratings yet
Chapter 16
20 pages
Chapter 7
No ratings yet
Chapter 7
20 pages
Lecture 7: Least-Squares Problem: Convex Optimization
No ratings yet
Lecture 7: Least-Squares Problem: Convex Optimization
7 pages
Final Presentation Slide Template
No ratings yet
Final Presentation Slide Template
13 pages
Chapter 8
No ratings yet
Chapter 8
14 pages
Homework 2 Template
No ratings yet
Homework 2 Template
12 pages
Color 1
No ratings yet
Color 1
2 pages
Week 8 Isolated DCDC
No ratings yet
Week 8 Isolated DCDC
11 pages
03 01 PatMax Logic
No ratings yet
03 01 PatMax Logic
15 pages
Chapter 0
No ratings yet
Chapter 0
8 pages
Week 4.2 Boost
No ratings yet
Week 4.2 Boost
6 pages
Auto Bell
No ratings yet
Auto Bell
7 pages
Middle Term (ECE692)
No ratings yet
Middle Term (ECE692)
6 pages
Class04 25
No ratings yet
Class04 25
3 pages
Week 15 Final
No ratings yet
Week 15 Final
5 pages
Vaccine Portal
No ratings yet
Vaccine Portal
3 pages
ECE 599 Lab Test 3 - Schedule
No ratings yet
ECE 599 Lab Test 3 - Schedule
4 pages
Embedded PCs ISA BusbookTOC
No ratings yet
Embedded PCs ISA BusbookTOC
4 pages
Static Characterization Schedule
No ratings yet
Static Characterization Schedule
2 pages
HW 4
No ratings yet
HW 4
1 page
Homework 1 Template
No ratings yet
Homework 1 Template
2 pages
Syllabus-ECE 599 Characterization of WBG Devices
No ratings yet
Syllabus-ECE 599 Characterization of WBG Devices
2 pages
Equations
No ratings yet
Equations
1 page
Matlab 03
No ratings yet
Matlab 03
2 pages
Syllabus532 spr05
No ratings yet
Syllabus532 spr05
2 pages
Tektronix Dprobes TDP1000
No ratings yet
Tektronix Dprobes TDP1000
2 pages
Lab 7 F2012
No ratings yet
Lab 7 F2012
1 page
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
From Everand
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
Rodrigo Copetti
No ratings yet
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
From Everand
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
Rodrigo Copetti
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
PC Hardware Explained
From Everand
PC Hardware Explained
V. Subhash
No ratings yet

Lec 1

Uploaded by

Lec 1

Uploaded by

An introduction to

Oxford University Mathematical Institute

Current NVIDIA Tesla GPU

1 Tesla GPU is usually 5-10× faster than two quad-core

Next generation NVIDIA Fermi GPU:

I hope to convince you it’s not difficult!

motherboard graphics card

memory access from device memory has a delay of

There is also a special syntax for launching multiple copies

The more general form allows gridDim and blockDim to

h_x = (float *)malloc(nsize*sizeof(float));

for (int n=0; n<nsize; n++)

__global__ void my_first_kernel(float *x)

global identifier says it’s a kernel function

Everything is available for Windows, Linux and OS X.

I’ll focus on Linux, but there is good integration for

I use the SDK Makefile because it provides useful options:

I would use a standard Makefile for more complex cases:

check for error return codes:

check for failure messages:

Wikipedia overviews of NVIDIA GPUs:

GPU computing community website:

LIBOR test code:

You might also like

h_x = (float )malloc(nsizesizeof(float));

global void my_first_kernel(float *x)