Chapter 5 - General Purpose PGPU, CUDA
Chapter 5 - General Purpose PGPU, CUDA
CUDA
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.
3
5.1 GPU Architecture
4
Performance factors (1)
5
Performance factors (2)
6
CPUs: 4 key factors
• Parallel processing
– Until relatively recently, each CPU only had a single core. Now
CPUs have multiple cores, where each can process multiple
instructions per cycle
• Clock frequency
– CPUs aim to maximise clock frequency, but this has now hit a
limit due to power restrictions (more later)
• Memory bandwidth
– CPUs use regular DDR memory, which has limited bandwidth
• Memory latency
– Latency from DDR is high, but CPUs strive to hide the latency
through:
– Large on-chip low-latency caches to stage data
– Multithreading
– Out-of-order execution
7
The Problem with CPUs
• The power used by a CPU core is proportional to
Clock Frequency x Voltage2
• In the past, computers got faster by increasing the
frequency
– Voltage was decreased to keep power reasonable.
• Now, voltage cannot be decreased any further
– 1s and 0s in a system are represented by different
voltages
– Reducing overall voltage further would reduce this
difference to a point where 0s and 1s cannot be properly
distinguished
8
Reproduced from http://queue.acm.org/detail.cfm?id=2181798
9
The Problem with CPUs
• Instead, performance increases can be achieved
through exploiting parallelism
• Need a chip which can perform many parallel
operations every clock cycle
– Many cores and/or many operations per core
• Want to keep power/core as low as possible
• Much of the power expended by CPU cores is on
functionality not generally that useful for HPC
– e.g. branch prediction
10
Accelerators
• So, for HPC, we want chips with simple, low power,
number-crunching cores
• But we need our machine to do other things as well
as the number crunching
– Run an operating system, perform I/O, set up calculation
etc
• Solution: “Hybrid” system containing both CPU and
“accelerator” chips
11
Accelerators
• It costs a huge amount of money to design and
fabricate new chips
– Not feasible for relatively small HPC market
• Luckily, over the last few years, Graphics
Processing Units (GPUs) have evolved for the
highly lucrative gaming market
– And largely possess the right characteristics for HPC
– Many number-crunching cores
12
AMD 12-core CPU
• Not much space on CPU is dedicated to compute
= compute unit
(= core)
13
NVIDIA Pascal GPU
• GPU dedicates much more space to compute
– At expense of caches, controllers, sophistication etc
= compute unit
(= SM
= 64 CUDA cores)
14
GPUs: 4 key factors
• Parallel processing
– GPUs have a much higher extent of parallelism than
CPUs: many more cores (high-end GPUs have thousands
of cores).
• Clock frequency
– GPUs typically have lower clock-frequency than CPUs,
and instead get performance through parallelism.
• Memory bandwidth
– GPUs use high bandwidth GDDR or HBM2 memory.
• Memory latency
– Memory latency from is similar to DDR.
– GPUs hide latency through very high levels of
multithreading.
15
NVIDIA Tesla Series GPU
16
Performance trends
17
Programming GPU
• GPUs cannot be used instead of CPUs
– They must be used together
– GPUs act as accelerators
– Responsible for the computationally expensive parts of the code
18
GPU Accelerated Systems
• CPUs and GPUs are used together
– Communicate over PCIe bus
– Or, in case of newest Pascal P100 GPUs, NVLINK (more later)
DRAM GDRAM/HBM2
CPU GPU
PCIe
I/O I/O
19
Scaling to larger systems
• Can have multiple CPUs and GPUs within each “workstation” or
“shared memory node”
– E.g. 2 CPUs +2 GPUs (below)
– CPUs share memory, but GPUs do not
PCIe
I/O I/O
GPU +
CPU
GDRAM/HB
Interconnect M2
DRAM
GPU +
CPU
Interconnect allows GDRAM/HB
multiple nodes to be M2
connected PCIe
I/O I/O
20
GPU Accelerated Supercomputer
… … …
21
Summary
• GPUs have higher compute and memory bandwidth
capabilities than CPUs
– Silicon dedicated to many simplistic cores
– Use of high bandwidth graphics or HBM2 memory
• Accelerators are typically not used alone, but work
in tandem with CPUs
• Most common are NVIDIA GPUs
– AMD also have high performance GPUs, but not so
widely used due to programming support
• GPU accelerated systems scale from simple
workstations to large-scale supercomputers
22
5.2 Introduction to CUDA
Programming
23
What is CUDA?
Programing system for machines with GPUs
• Programming Language
• Compilers
• Runtime Environments
• Drivers
• Hardware
24
CUDA : Heterogeneous Parallel
Computing
CPU optimized for fast single-thread execution
• Cores designed to execute 1 thread or 2 threads concurrently
• Large caches attempt to hide DRAM access times
• Cores optimized for low-latency cache accesses
• Complex control-logic for speculation and out-of-orderexecution
25
CUDA C/C++ Program
Serial code executes in a Host (CPU) thread
Parallel code executes in many concurrent Device (GPU) threads
across multiple parallel processing elements
Host = CPU
Device = GPU
Host = CPU
Device = GPU
...
26
Compiling CUDA C/C++ Programs
/ / f oo. cpp
i n t foo(int x)
{
... CUDA C Rest of C
} Application
fl o at bar(float x)
Functions
{
...
}
NVCC CPU Compiler
/ / saxpy.cu
global void saxpy(int n , f l o a t . . . )
{
i n t i = threadIdx.x + . . . ;
i f ( i < n) y [ i ] = a * x [ i ] + y [ i ] ;
CUDA object CPU object
}
files Linker files
/ / main.cpp
void main( ) {
fl o at x = bar(1.0)
i f (x<2.0f) CPU + GPU
saxp y< < < . . . >>>(fo o (1), . . . ) ; Executable
...
}
27
Canonical execution flow
GPU CPU
Core Core
SMEM SMEM
Cache
PCIe Bus
SMEM SMEM
28
Step 1 – copy data to GPU memory
GPU CPU
Core Core
SMEM SMEM
Cache
PCIe Bus
SMEM SMEM
29
Step 2 – launch kernel on GPU
GPU CPU
Core Core
SMEM SMEM
Cache
PCIe Bus
SMEM SMEM
30
Step 3 – execute kernel on GPU
GPU CPU
Core Core
SMEM SMEM
Cache
PCIe Bus
SMEM SMEM
31
Step 4 – copy data to
CPU memoryGPU CPU
Core Core
SMEM SMEM
Cache
PCIe Bus
SMEM SMEM
32
Example: Fermi's GPU Architecture
32 CUDA Cores per Streaming
Multiprocessor (SM)
• 32 fp32 ops/clock
• 16 fp64 ops/clock
• 32 int32 ops/clock
2 Warp schedulers per SM
• 1,536 concurrent threads
4 special--function units
64KB shared memory + L1 cache
32K 32-bit registers
Load/Store Units x 16
Special Func Units x 4
33
Multithreading
CPU architecture attempts to minimize latency within each thread
GPU architecture hides latency with computation from other thread
warps
W3
Tn Executing
W2
W1
Waiting for data
Ready to execute
CPU core – Low Latency Processor
T1 T2 T1 Context switch
34
CUDA Kernels
Parallel portion of application: execute as a kernel
• E\ ntire GPU executes kernel
• Kernel launches create thousands of CUDA threads efficiently
CUDA threads
• Lightweight
• Fast switching
• 1000s execute simultaneously
Kernel launches create hierarchical groups of threads
• Threads are grouped into Blocks, and Blocks into Grids
• Threads and Blocks represent different levels of parallelism
35
CUDA C : C with a few keywords
Kernel : function that executes on device (GPU) and can be called
from host (CPU)
• Can only access GPU memory
• No variable number ofarguments
• No static variables
36
CUDA Kernels : Parallel Threads
A kernel is a function in[i] in[i+1] in[i+2] in[i+3]
37
5.3 Synchronization/Communication
38
CUDA Thread Organization
39
Blocks of threads
40
Grids of blocks
Grid
41
Blocks execute on Streaming
Multiprocessors Streaming Multiprocessor
Streaming Processor
SMEM
Registers Per B
- lock
Shared
Memory
42
Grids of blocks executes across GPU
GPU Grid of Blocks
. ..
SMEM SMEM
. ..
… … …
. ..
SMEM SMEM
. ..
43
Kernel Execution Recap
A thread executes on a single streaming processor
• Allows use of familiar scalar code within kernel
44
Block abstraction provides scalability
Blocks may execute in arbitrary order, concurrently or
sequentially, and parallelism increases with resources
• Depends on when execution resources become available
Independent execution of blocks provides scalability
• Blocks can be distributed across any number of SMs
Kernel
Grid
Launch
45
Blocks enable e cient collaboration
Threads often need tocollaborate
• Cooperatively load/store common data sets
• Share results or cooperate to produce a single result
• Synchronize with each other
46
Blocks must be independent
Any possible interleaving of blocks is allowed
• Blocks presumed to run to completion without pre--emption
• May run in any order, concurrently or sequentially
47
Thread and Block ID and Dimensions
Threads: 3D IDs, unique within a
block
Device
Thread Blocks: 2D IDs, unique Grid 1
within a grid
Dimensions set at launch: Can be
unique for each grid
Block(1,1)
Build-in variables
• threadIdx, blockIdx
Block (1, 1)
• blockDim, gridDim
Thread Thread Thread Thread Thread
Programmers usually select (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
48
Examples of Indexes and Indexing
global v o i d k e r n e l ( i n t *a )
{
i n t i d x = b l o ck Idx. x* blockD im.x + t h r e a d I d x . x ;
a[idx] = 7;
} Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
global v o i d k e r n e l ( i n t *a )
{
i n t i d x = b l o ck Idx. x* blockD im.x + t h r e a d I d x . x ;
a [ i d x ] = block Idx.x;
} } Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
global v o i d k e r n e l ( i n t *a )
{
i n t i d x = b l o ck Idx. x* blockD im.x + t h r e a d I d x . x ;
a [ i d x ] = threadIdx.x;
} } Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
49
Example of 2D indexing
a[idx] = a[idx]+1;
}
50
Independent address spaces
CPU and GPU have independent memory systems
• PCIe bus transfers data between CPU and GPU memory systems
PCI-e
51
51
Independent address spaces:
consequences
Cannot reliably determine whether a pointer references a host
(CPU) or device (GPU) address from the pointer value
• Dereferencing CPU/GPU pointer on GPU/CPU will likely cause crash
Unified virtual addressing (UVA) in CUDA 4.0
• One virtual address space shared by CPU thread and GPU threads
PCI-e
52
CUDA Memory Hierarchy
Thread Thread Thread Thread
Thread
• Registers
Registers Registers Registers Registers
53
53
CUDA Memory Hierarchy
Thread Thread Thread Thread
Thread
• Registers
• Local memory
Registers Registers Registers Registers
54
54
CUDA Memory Hierarchy
Thread Thread Thread Thread
Thread
• Registers
• Local memory
Registers Registers Registers Registers
Shared
55
55
CUDA Memory Hierarchy
Thread
• Registers
• Local memory
56
56
Shared Memory
shared <type> x [<elements>];
Thread Thread Thread Thread
Allocated per thread block
Scope: threads in block
Data lifetime: same as block
Registers Registers Registers Registers
Capacity: small (about 48kB)
Latency: a few cycles Local Local Local Local
Bandwidth: very high
SM: 32 * 4 B * 1.15 GHz /2 = 73.6 GB/s Shared
GPU: 14 * 32 * 4 B * 1.15 GHz /2 = 1.03 TB/s
Common uses
58
58
Communication and Data Persistence
Kernel 1
Shared Shared
Memory . .. Memory
sequential Kernel 2
kernels
Shared Shared
Memory . .. Memory
Global
Memory
Kernel 3
Shared Shared
Memory . .. Memory
59
59
Managing Device (GPU) Memory
Host (CPU) manages device (GPU) memory
cudaMalloc (void ** pointer, size_t num_bytes)
cudaMemset (void* pointer, int value, size_t count)
cudaFree(void* pointer)
60
60
Transferring Data
cudaMemcpy(void* d s t , void* s r c , size_t num_bytes, enum
cudaMemcpyKind d i r e c t i o n ) ;
• returns to host thread after the copy completes
• blocks CPU thread until all bytes have been copied
• doesn’t start copying until previous CUDA calls complete
Direction controlled by enum cudaMemcpyKind
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
CUDA also provides non--blocking
• Allows program to overlap data transfer with concurrent computation
on host anddevice
• Need to ensure that source locations are stable and destination locations are
not accessed
61
61
Example: SAXPY Kernel [1/4]
/ / [com pute] f o r ( i = 0 ; i < n ; i + + ) y [ i ] = a * x [ i ] + y [ i ] ;
// Each thread processes one element
g l o b a l v o id s a xp y( i n t n , f l o a t a , f l o a t * x , f l o a t * y )
{
i n t i = t h r e a d I d x. x + blockDim.x * b l o c k I d x . x ;
i f ( i < n) y [ i ] = a * x [ i ] + y [ i ] ;
}
i n t m ain()
{
...
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;
...
}
62
62
Example:SAXPY Kernel [1/4]
/ / [computes] f o r ( i = 0 ; i < n ; i + + ) y [ i ] = a * x [ i ] + y [ i ] ;
// Each thread processes one element
g l o b a l v o id s a xp y( i n t n , f l o a t a , f l o a t * x , f l o a t * y )
{
int i = t h r e a d I d x . x + blockDim.x * b l o c k I d x . x ;
if (i < n ) y [ i ] = a * x [ i ] + y [ i ] ;
}
63
63
Example: SAXPY Kernel [1/4]
/ / [computes] f o r ( i = 0 ; i < n ; i + + ) y [ i ] = a * x [ i ] + y [ i ] ;
// Each thread processes one element
g l o b a l v o id s a xp y( i n t n , f l o a t a , f l o a t * x , f l o a t * y )
{
i n t i = t h r e a d I d x. x + blockDim.x * b l o c k I d x . x ;
i f ( i < n) y [ i ] = a * x [ i ] + y [ i ] ;
}
Host Code
i n t m ain()
{
...
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;
...
}
64
64
Example: SAXPY Kernel [2/4]
i n t m ain()
{
/ / a l l o c a t e and i n i t i a l i z e host (CPU) memory
float* x = . . . ;
float* y = . . . ;
65
65
Example: SAXPY Kernel [2/4]
i n t m ain()
{
/ / a l l o c a t e and i n i t i a l i z e host (CPU) memory
float* x = . . . ;
float* y = . . . ;
66
66
Example: SAXPY Kernel [3/4]
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;
/ / do something w i t h th e resultP
return 0;
}
67
67
Example: SAXPY Kernel [3/4]
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 threads / b lo ck
i n t nblocks = ( n + 2 5 5 )/2 5 6 ;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , d_x, d _ y) ;
/ / do something w i t h th e resultP
return 0;
}
68
68
Example: SAXPY Kernel [4/4]
void s a xp y_ s e r i a l ( i n t n , f l o a t a , f l o a t * x , f l o a t * y)
{
f o r ( i n t i = 0; i < n; ++i)
y[ i ] = a*x[i] + y[ i ] ;
}
/ / invoke host SAXPYfu n c t i o n
s a xp y_ s e r i a l ( n , 2 . 0 , x , y ) ; Standard C Code
global v o id s a xp y( i n t n , f l o a t a , f l o a t * x , f l o a t * y )
{
i n t i = b lo ck Id x.x*b lo ck Dim .x + t h r e a d I d x . x ;
i f ( i < n) y [ i ] = a * x [ i ] + y [ i ] ;
}
/ / invoke p a r a l l e l SAXPY k e r n e l w i t h 256 th re a d s/b lo ck
i n t nblocks = ( n + 255) / 256;
saxpy<<<nblocks, 256>>>(n, 2 . 0 , x , y ) ; CUDA C Code
69
69
Thank
you for
your
attentions
!
70