0% found this document useful (0 votes)
106 views28 pages

GPU Computing 2

This document discusses GPU computing and CUDA programming. It begins by comparing GPU and CPU architectures, noting key differences like GPUs having hundreds of lightweight cores compared to CPUs having fewer heavier cores. The document then provides an overview of the CUDA programming model, describing how CUDA programs consist of a CPU part and GPU part that execute concurrently. It explains CUDA's main abstractions of hierarchies of threads, shared memory, and barrier synchronization. Kernel launches and the unique IDs of threads and blocks are also summarized.

Uploaded by

QuantumChromist
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views28 pages

GPU Computing 2

This document discusses GPU computing and CUDA programming. It begins by comparing GPU and CPU architectures, noting key differences like GPUs having hundreds of lightweight cores compared to CPUs having fewer heavier cores. The document then provides an overview of the CUDA programming model, describing how CUDA programs consist of a CPU part and GPU part that execute concurrently. It explains CUDA's main abstractions of hierarchies of threads, shared memory, and barrier synchronization. Kernel launches and the unique IDs of threads and blocks are also summarized.

Uploaded by

QuantumChromist
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

GPU COMPUTING - ARCHITECTURE +

PROGRAMMING
LECTURE 02 - CUDA PROGRAMMING
Holger Fröning
[email protected]
Institute of Computer Engineering
Ruprecht-Karls University of Heidelberg
With material from D. Kirk, W. Hwu (“Programming Massively Parallel Processors”)
DIE SHOTS - CPU OR GPU?
PCIe QPI links
Memory
controller

4x Setup
pipeline

Command
Setup pipeline processor Setup pipeline
Caching agents, router, Mrs

Core LLC slice


“Core”
Scalable Memory Interface (SMI)

NVIDIA Kepler- GK110 Intel Xeon E7 - Westmere-EX 2


MAIN DIFFERENCES BETWEEN GPU AND CPU

3
CUDA & GPU - OVERVIEW 96 GFLOPS (DP)
CPU SOCKET
NVIDIA CUDA
CPU
Compute kernel as C program 60GB/S
CORES
Explicit data- and thread-level parallelism
system request
Computing, not graphics processing queue
Host communication NORTH HOST
Memory hierarchy BRIDGE memory MEMORY
interface
Host memory 16GB/S
system interface
GPU (device) memory
GPU on-chip memory (later)
IO
BRIDGE 288GB/S
More HW details exposed 16GB/S peripheral
Use of pointers interface
Load/store architecture GPU GPU
Barrier synchronization of thread blocks 1,165 GFLOPS (DP) CORES MEMORY
memory
3,494 GFLOPS (SP) interface
4
G80 ARCHITECTURE FOR GRAPHICS PROCESSING
Host

Input Assembler Setup / Rstr / ZCull

Vtx Thread Issue Geom Thread Issue Pixel Thread Issue

SP SP SP SP

Thread Processor
SP SP SP SP SP SP SP SP SP SP SP SP

TF TF TF TF TF TF TF TF

L1 L1 L1 L1 L1 L1 L1 L1

L2 L2 L2 L2 L2 L2

FB FB FB FB FB FB

5
G80 ARCHITECTURE FOR GENERAL-PURPOSE
PROCESSING
Host
SM = Streaming Multiprocessor
Input Assembler SM = Shared Memory
Thread Execution Manager

Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data Parallel Data
Cache Cache Cache Cache Cache Cache Cache Cache

Texture
SM SM SM SM SM SM SM SM

Load/store Load/store Load/store Load/store Load/store Load/store

Global Memory

6
CUDA PROGRAMMING MODEL
PROGRAMMING MODEL
CUDA program consists of CPU & GPU part

CPU
CPU part: part of the program with no or little parallelism

}
GPU part: high parallel part, SPMD-style

Kernel
GPU
Concurrent execution …
Non-blocking thread execution
Explicit synchronization

CPU
C Extension with three main abstractions
1.Hierarchy of threads

GPU
2.Shared memory …
3.Barrier synchronization
Exploiting parallelism

CPU
Inner loops
Fine-grain data-level parallelism (DLP)
Thread-level parallelism (TLP) } Threads
Kernels
8
KERNEL LAUNCH
Kernels: N-fold execution by N __global__ void matAdd ( float A[N][N],
float B[N][N],
threads float C[N][N] )
{
__global__ int i = threadIdx.x;
int j = threadIdx.y;
Execution: C[i][j] = A[i][j] + B[i][j];
}
kernel <<< numBlocks,
threadsPerBlock >>> int main()
{
(args) // Kernel invocation
dim3 dimBlock ( N, N );
Unique ID matAdd <<< 1, dimBlock >>> ( A, B, C );
}
threadIdx.{x,y,z}
Control flow for SPMD programs
Memory access orchestration

9
KERNEL LAUNCH
Each thread block has up to 3 dimensions __global__ void matAdd ( float A[N][N],
float B[N][N],
“Block” in the following float C[N][N])
{
Number of blocks is limited int i = threadIdx.x;
512 x 512 x 64 -> 1024 x 1024 x 64 int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
GPU dep. }

Additional hierarchy level: grid = int main()


multiple blocks {
// Kernel invocation
Grid = kernel in execution dim3 dimBlock ( N, N );
Unique ID blockIdx, up to 3 dimensions matAdd <<< 1, dimBlock >>> ( A, B, C );
}
Blocks are executed independently and
implementation-dependent
Number of blocks limited (typ. 64k-1 per
dimension) up to 3D

10
KERNEL LAUNCH
__global__ void matAdd ( float A[N][N], Super-fine grained: one thread
float B[N][N],
float C[N][N] )
computes one element
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if ( i < N && j < N )
C[i][j] = A[i][j] + B[i][j]; Operator “/” rounds down, so add
}
block size to round up!
int main() E.g. N=50
{ grid size = (50+16-1)/16=4.0625 => 4
// Kernel invocation
dim3 dimBlock ( 16, 16 );
dim3 dimGrid ( ( N + dimBlock.x – 1 ) / dimBlock.x,
( N + dimBlock.y – 1 ) / dimBlock.y );
matAdd <<< dimGrid, dimBlock >>> ( A, B, C );
}

GPU

grid size block size
11
THREAD HIERARCHY
Thread hierarchy
Grid of thread blocks
Blocks of equal size
Given problem size N, how to choose the
parameter threads per block respectively blocks
per grid?
Recommendations wrt block count
>2x number of SMs
Optimal: 100 – 1000 (max. 64k-1)
Recommendations wrt threads/block
Required concurrency for latency toleration vs
resources per thread
12
THREAD COMMUNICATION
Host Device

Communication and synchronization Grid 1

only within one thread block Kernel


Block Block
1
Shared memory (0, 0) (1, 0)

Atomic operations Block


(0, 1)
Block
(1, 1)

Barrier synchronization
Grid 2

Threads from different blocks cannot Kernel

interact
2

Block (1, 1)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Exception: global memory
Very weak coherence & consistency Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)

guarantees
Thread Thread Thread Thread

Iterative kernel invokations


(0,1,0) (1,1,0) (2,1,0) (3,1,0)

13
MEMORY HIERARCHY – GLOBAL MEMORY
Global memory
Grid
Communication between host and device
Block Block
Accessible from all threads (R/W)
Shared memory Shared memory
High latency
Lifetime exceeds thread lifetime Registers Registers Registers Registers
Sensitive to fine-grained accesses
Allocation Thread Thread Thread Thread
(0,0) (0,1) (0,0) (0,1)
cudaMalloc (&dmem, size);
Deallocation
Host Global memory
cudaFree (dmem);
Data transfer (blocking)
cudaMemcpy (*dst, *src, size, transfer_type);
cudaMemcpyAsync ( … )

14
MEMORY HIERARCHY – GLOBAL MEMORY
void *dmem = cudaMalloc ( N*sizeof ( float ) ); // Allocate GPU memory
Annotate
variable void *hmem = malloc ( N*sizeof ( float ) ); // Allocate CPU memory
scope!
// Transfer data from host to device
cudaMemcpy ( dmem, hmem, N*sizeof ( float ), cudaMemcpyHostToDevice );

// Do calculations
kernel1 <<< numBlocks, numThreadsPerBlock >>> ( dmem, N );
Only ...
references kernel2 <<< numBlocks, numThreadsPerBlock >>> ( dmem, N );
to device
// Transfer data from device to host
memory cudaMemcpy ( hmem, dmem, N*sizeof ( float ), cudaMemcpyDeviceToHost );

cudaFree ( dmem ); // Free device buffer


free ( hmem ); // Free host buffer

15
MEMORY HIERARCHY – SHARED MEMORY
On-chip memory Grid
Block Block
Lifetime: thread lifetime
Shared memory Shared memory
Access costs in the best case
equal register access Registers Registers Registers Registers

Organized in n banks Thread Thread Thread Thread


(0,0) (0,1) (0,0) (0,1)
Typ. 16-32 banks with 32bit
width
Low-order interleaving Host Global memory
Parallel access if no conflict
Conflicts result in access
serialization
16
VARIABLE DECLARATION
Location Access from Lifetime

global memory
__device__ float var; device/host program
(device memory)
constant memory
__constant__ float var; device/host program
(device memory)
__shared__ float var; shared memory threads thread block
texture memory
texture <float> ref; device/host program
(device memory)

__device__ can be combined with others (e.g., __constant__)


Shared memory & consistency model
__syncthreads to wait for completion of outstanding write operations
Unconstrained completion of read/write operations (exception: volatile)
17
TYPE SPECIFIERS

Vector types
char1, uchar1, char2, uchar2, char3, uchar3, char4, uchar4, short1, ushort1,
short2, ushort2, short3, ushort3, short4, ushort4, int1, uint1, int2, uint2, int3,
uint3, int4, uint4, long1, ulong1, long2, ulong2, long3, ulong3, long4, ulong4,
float1, float2, float3, float4, double2
Derived from basic types (int, float, …)
Dimension type: dim3
Based on uint3
Unspecified components are initialized with 1

18
FUNCTION DECLARATION
Executed on Callable from

__device__ float DeviceFunc() device device


__global__ void KernelFunc() device host
__host__ float HostFunc() host host
__global__ defines a kernel (return type: void)
__host__ is optional
__host__ and __device__ can be combined
No pointers to __device__ functions (exception: __global__ functions)
For functions that are executed on the GPU:
No recursions
Only static variable declarations
No variable parameter count
19
JUST-IN-TIME COMPILATION
Device code only supports C-subset of C++ (getting better) Virtual
Compile with nvcc CUDA program
Compiler Driver
Calls other tools as required
cudacc, g++, clang, … nvcc
Output
C code (host CPU Code) PTX x86
Either PTX object code, or source code for run-time
interpretation Physical
PTX (Parallel Thread Execution) PTX to target
Virtual Machine and ISA
Execution resources and state
Linking GF100 GK110 GP100
CUDA runtime library cudart
CUDA core library cuda

20
BRIEF PROPERTY SURVEY (DEVICEQUERY)
Shared
Total Total Regis- Max Max. Max.
CC Multi- memory Threads Clock Concurrent
global constant ters Warp dimen- dimen- memory
Model Revi- proces Cores per per rate copy and
memory memory per size sion of sion of pitch
sion sors block block [GHz] execution
[bytes] [bytes] block a block a grid [bytes]
[bytes]
65535
x
GeForce 1k x 1k Y
2.0 1.5G 15 480 64k 48k 32k 32 1k 65535 2G 1.4
GTX 480 x 64 1
x
65535

2G x
Tesla 1k x 1k 65535 Y
3.5 5G 13 2496 64k 48k 64k 32 1k 2G 0.7
K20c x 64 x 2
65535

2G x
RTX 1k x 1k 65535 Y
7.5 11G 68 4352 64k 48k 64k 32 1k 2G 1.54
2080Ti x 64 x 3
65535 21
CUDA EXAMPLE: SAXPY
SAXPY EXAMPLE

y[i] = α ⋅ x[i] + y[i]


SAXPY: Scalar Alpha X Plus Y
Simple test to compare GPU and CPU performance
Objective: runtime reduction
Max. gridSize * threadsPerBlock elements
65535*1k -> ~ 64M elements
Memory requirement = 32M elements * 2 arrays * 4 Byte/element = 256MB
Source code contains kernels for the GPU and the CPU

23
SAXPY EXAMPLE
// kernel function (CPU)
void saxpy_serial(int n, float alpha, float *x, float *y)
{
int i;
for (i=0; i<n; i++) { CPU version
y[i] = alpha*x[i] + y[i];
}
}

// kernel function (CUDA device)


__global__ void saxpy_parallel(int n, float alpha, float *x, float *y)
{
// compute the global index in vector from
int i = blockIdx.x * blockDim.x + threadIdx.x;

// avoid writing past the allocated memory


GPU version
if (i<n) {
y[i] = alpha*x[i] + y[i];
}
}

24
Huge advantage
for GPU when no
data movements
INITIAL PERFORMANCE Pageable memory
vs. pinned memory

64k
16k

16k
Tesla K10, PCIe 2.0 x16, Intel Xeon E5 (single-threaded) 25
PINNED MEMORY
Replace malloc with cudaMallocHost
Significant reduction of data movement costs
Pinned memory is a scarce resource!
float *h_x;
float *h_y;
float *d_x;
float *d_y;

if (USE_PINNED_MEMORY) {
cudaMallocHost ( (void**) &h_x, N*sizeof(float) );
cudaMallocHost ( (void**) &h_y, N*sizeof(float) );
} else {
h_x = (float*) malloc ( N*sizeof(float) );
h_y = (float*) malloc ( N*sizeof(float) );
}
cudaMalloc ( (void**)&d_x, N*sizeof(float) );
cudaMalloc ( (void**)&d_y, N*sizeof(float) );

26
COMMON ERRORS
CUDA Error: the launch timed out and was terminated
-> Stop X11
CUDA Error: unspecified launch failure
-> Typically a segmentation fault
CUDA Error: invalid configuration argument
-> Too many threads per block, too many resources per thread (shared memory,
register count)
Compile problem:
mmult.cu(171): error: identifier "__eh_curr_region" is
undefined
-> Non-static shared memory, use static allocation of shared memory
27
SUMMARY 96 GFLOPS (DP)
CPU SOCKET
Introduction to CUDA
CPU
Pretty unusual concept compared to CPU CORES 60GB/S
programming
system request
Once understood: pretty easy programming model queue

Did you see any vector instructions today? NORTH HOST


BRIDGE memory MEMORY
Direct control over hardware interface
16GB/S
Plenty of opportunities for the (experienced) user system interface
Increases the burden IO
BRIDGE 288GB/S
Main differences to CPU programming
16GB/S peripheral
Sophisticated resource planning
interface
Many manual data movements
GPU GPU
Limited memory capacity 1,165 GFLOPS (DP) CORES MEMORY
memory
3,494 GFLOPS (SP) interface
28

You might also like