Gpgpu Final

General Purpose computing on Graphics Processing Units
(GPGPU)
Prof. Dr. Aman Ullah Khan

Position of an integrated GPU in a
northbridge/southbridge system layout
• Northbridge – Memory Controller Hub
• Southbridge – I/O Controller Hub
Components of a GPU
• Three key concepts behind how modern GPU processing cores run code
1. Understand space of GPU core (and throughput CPU core) designs

2. Optimize shaders/compute kernels
3. Establish intuition: what workloads might benefit from the design of
these architectures?
Graphics Processing Unit (GPU)
• A GPU is a heterogeneous chip multi-processor highly tuned for graphics.
Shader Programming Model
• Fragments are processed independently, but there is no explicit parallel
programming. 1 unshaded
fragment
Shader Program input record Execute Shader
Compile
Shader
1 shaded
fragment
output record
Ideas to make GPU processing
cores run fast
Idea # 1: Add CPU Style Cores to run Multiple Fragments
➢ N cores = N simultaneous instruction streams

Instruction stream sharing
But ... many fragments should be able

to share an instruction stream!
Idea #2: Add ALU’s
• Amortize cost/complexity of managing
an instruction stream across many ALUs.
SIMD Processing.
Modify the Shader
Original compiled shader: New compiled shader:

Processes one fragment using Processes eight fragments using
scalar ops on scalar registers vector ops on vector registers
Modifying the Shader
128 fragments in parallel
Graphics Workloads Processing
Streaming computation
Streaming computation on pixels
Identical, Streaming computation on pixels
Identical, Independent, Streaming computation on pixels

Generalize: Data Parallel Workloads & Approach
• Identical, Independent computation on multiple data inputs
• Naïve Approach - Split independent work over multiple processors

• MIMD - Split independent work over multiple processors
• SPMD - Split identical, independent work over multiple processors
• SIMD - Split identical, independent work over multiple execution units (lanes)
• More efficient: Eliminate redundant fetch/decode
• SIMT - Split identical, independent work over multiple lockstep threads
• Multiple Threads + Scalar Ops → One PC, Multiple register files
Naïve Approach
• Split independent work over multiple processors
Data Parallelism: A MIMD Approach
• Multiple Instruction Multiple Data
• Split independent work over multiple processors
Data Parallelism: An SPMD Approach
• Single Program Multiple Data
• Split identical, independent work over multiple processors
Data Parallelism: A SIMD Approach
• Single Instruction Multiple Data
• Split identical, independent
work over multiple execution
units (lanes)
• More efficient: Eliminate
redundant fetch/decode
• One Thread + Data Parallel Ops
→ Single PC, single register file
Data Parallelism: A SIMT Approach
• Single Instruction Multiple Thread
• Split identical, independent work over multiple lockstep threads
• Multiple Threads + Scalar Ops → One PC, Multiple register files
Data Parallel Execution Models, Examples & Comparison
Data Parallel Execution Models Comparison
Execution
Model
Multiple independent One thread with wide
Multiple lockstep threads
threads execution datapath
Example
Multicore CPUs x86 SSE/AVX GPUs
Architecture
More general: supports TLP Can mix sequential & parallel Easier to program
Pros
code Gather/Scatter operation
Inefficient for data Gather/Scatter can be Divergence kills

Cons
parallelism awkward performance
GPUs and Memory
➢Recall: GPUs perform Streaming computation → Streaming memory
access
• DRAM latency: 100s of GPU cycles
➢How to keep the GPU busy (hide memory latency)?
• Options from the CPU world:
1. Caches
o Need spatial/temporal locality
2. OoO/Dynamic Scheduling
o Need ILP
3. Multicore/Multithreading/SMT
o Need independent threads
Multicore Multithreaded SIMT
• Many SIMT “threads” grouped together into GPU “Core”
• SIMT threads in a group ≈ SMT threads in a CPU core
o Unlike CPU, groups are exposed to programmers
• Multiple GPU “Cores”
This is a GPU Architecture!

Terminology Headaches
Nvidia / CUDA AMD/OpenCL CPU Analogy

CUDA Processor Processing Element Lane
CUDA Core SIMD Unit Pipeline
Streaming Multiprocessor Compute Unit Core
GPU Device GPU Device Device
GPU Programming Models
• CUDA – Compute Unified Device Architecture
• Developed by Nvidia -- proprietary
• First serious GPGPU language/environment
• OpenCL – Open Computing Language
• From makers of OpenGL
• Wide industry support: AMD, Apple, Qualcomm, Nvidia (begrudgingly), etc.
• C++ AMP – C++ Accelerated Massive Parallelism
• Microsoft
• Much higher abstraction than CUDA/OpenCL
• OpenACC – Open Accelerator
• Like OpenMP for GPUs (semi-auto-parallelize serial code)
• Much higher abstraction than CUDA/OpenCL
OpenCL
• Early CPU languages were light abstractions of physical hardware
• e.g., C
• Early GPU languages are light abstractions of physical hardware
• OpenCL + CUDA
OpenCL - NDRange and Kernel
➢NDRange: N-Dimensional index space
(N = 1, 2, or 3)
• Partitioned into workgroups,
wavefronts, and work-items
➢Kernel: Run an NDRange on a kernel

(i.e., a function)
• Same kernel executes for each work-
item
• Seems like MIMD/SPMD … but it’s
not!
OpenCL Code
__kernel void flip_and_recolor( __global float3 **in_image,

__global float3 **out_image,
int img_dim_x, int img_dim_y)
{
int x = get_global_id(1); // get work-item id in dim 1
int y = get_global_id(2); // get work-item id in dim 2
out_image[img_dim_x - x][img_dim_y - y] = recolor(in_image[x][y]);
}
Terminology Comparison
Nvidia / CUDA AMD/OpenCL Henn & Patt
Thread Work-item Sequence of SIMD Lane Operations
Wrap Wavefront Thread of SIMD Instructions
Block Workgroup Body of vectorized loop
Grid NDRange Vectorized loop
GPU Microarchitecture
AMD Graphics Core Next (GCN)
• GPU Hardware Overview
• Compute Unit – A GPU Core
• Compute Unit (CU) – Runs Workgroups
• Contains 4 SIMT Units
• Picks one SIMT Unit per cycle for scheduling
• SIMT Unit – Runs Wavefronts
• Each SIMT Unit has 10 wavefront instruction buffer
• Takes 4 cycles to execute one wavefront
10 Wavefront x 4 SIMT Units = 40 Active Wavefronts / CU

64 work-items / wavefront x 40 active wavefronts = 2560 Active Work-items / CU
CU Timing Diagram
• On average: fetch & commit one wavefront / cycle
SIMT Unit – A GPU Pipeline
Like a wide CPU pipeline – except one fetch for entire width
• 16-wide physical ALU In each cycle can execute 16 wavefronts,
• Executes 64-wavefront over 4 cycles. Why?? unit is pipelined, so in 4 cycles it can
execute 16x4=64 wavefronts.
• 64KB register state / SIMT Unit
• Compare to x86 (Bulldozer): ~1KB of physical register file state (~1/64 size)
• Address Coalescing Unit Coalescing is a part of memory management in which
• A key to good memory performance two adjacent free blocks of computer memory are
merged.
Address Coalescing
• Wavefront: Issue 64 memory requests
• Common case:
• work-items in same wavefront touch same cache block
• Coalescing:
• Merge many work-items requests into single cache block request
• Important for performance:
• Reduces bandwidth to DRAM
Bulldozer – FX-8170 vs. GCN – Radeon HD 7970
GPU Memory CPU (Bulldozer) GPU (GCN)
L1 data cache capacity 16KB 16KB
Active threads (work-items) sharing L1 D Cache 1 2560
L1 d cache capacity / thread 16KB 6.4 bytes
Last level cache (LLC) capacity 8MB 768KB
• GPUs have caches. Active threads (work-items) sharing LLC 8 81,920
• Not Your CPU’s Cache LLC capacity / thread 1MB 9.6 bytes
• GPU Caches
• Maximize throughput, not hide latency
• Not there for either spatial or temporal locality
• L1 Cache: Coalesce requests to same cache block by different work-items
• i.e., streaming thread locality?
• Keep block around just long enough for each work-item to hit once
• Ultimate goal: Reduce bandwidth to DRAM
• L2 Cache: DRAM staging buffer + some instruction reuse
• Ultimate goal: Tolerate spikes in DRAM bandwidth If there is any spatial/temporal
locality:
• Use local memory (scratchpad)
GPU Memory
Scratchpad Memory
• GPUs have scratchpads (Local Memory)
• Separate address space
• Managed by software:
• Rename address
• Manage capacity – manual fill/eviction
• Allocated to a workgroup
• i.e., shared by wavefronts in workgroup
Nvidia calls “Local Memory” “Shared Memory”.

AMD sometimes calls it “Group Memory”.
Example System: Radeon HD 7970
High-end part
• 32 Compute Units:
• 81,920 Active work-items
• 32 CUs * 4 SIMT Units * 16 ALUs = 2048 Max FP ops/cycle
• 264 GB/s Max memory bandwidth 925 MHz engine clock
• 3.79 TFLOPS single precision (accounting trickery: FMA)
• 210W Max Power (Chip)
• >350W Max Power (card)
• 100W idle power (card)
• Two 7970s on one card:
• 375W (AMD Official) – 450W (OEM)
Example System – Nvidia K80
• GPU: 2
• CUDA Cores : 4992
• 2× 240 GB/s Max memory bandwidth 925 MHz engine clock
• 5591–8736 GFLOPS Single Precision (MAD or FMA)
GPUs: Great for data parallelism. Bad for everything else.
• Data Parallelism: Identical, Independent work over multiple data inputs
• GPU version: Add streaming access pattern
• Data Parallel Execution Models: MIMD, SIMD, SIMT
• GPU Execution Model: Multicore Multithreaded SIMT
• OpenCL Programming Model
• NDRange over workgroup/wavefront
• Modern GPU Microarchitecture: AMD Graphics Core Next (GCN)
• Compute Unit (“GPU Core”): 4 SIMT Units
• SIMT Unit (“GPU Pipeline”): 16-wide ALU pipe (16x4 execution)
• Memory: designed to stream
CPU viz GPU Architecture
• The Graphics Processing Unit (GPU) provides much higher instruction throughput and
memory bandwidth than the CPU within a similar price and power envelope. Many
applications leverage these higher capabilities to run faster on the GPU than on the
CPU as described in the GPU Applications Catalog. Other computing devices, like
FPGAs, are also very energy efficient, but offer much less programming flexibility
than GPUs.
• The difference in capabilities between the GPU and the CPU exists because they are
designed with different goals in mind. While the CPU is designed to excel at executing
a sequence of operations, called a thread, as fast as possible and can execute a few
tens of these threads in parallel, the GPU is designed to excel at executing thousands
of them in parallel (amortizing the slower single-thread performance to achieve
greater throughput).
• The GPU is specialized for highly parallel computations and therefore designed such
that more transistors are devoted to data processing rather than data caching and
flow control. The schematic Figure in next lside shows an example distribution of chip
resources for a CPU versus a GPU.
➢GPU devotes more transistors to data processing rather than data caching & flow control.
CUDA
Compute Unified
Device Architecture
Prof. Dr. Aman Ullah Khan

CUDA
• CUDA (Compute Unified Device Architecture) is a parallel computing
platform and application programming interface (API) model created
by Nvidia.
• CUDA allows to use a CUDA-enabled GPU (Graphics Processing Unit) for
general purpose processing – an approach termed as GPGPU (General-
Purpose computing on Graphics Processing Units).
• CUDA platform is a software layer that gives direct access to the GPU's
virtual instruction set and parallel computational elements, for the
execution of compute kernels.
CUDA
➢ CUDA is a platform and programming model for CUDA-enabled GPUs. The platform exposes GPUs for
general purpose computing.
➢ CUDA programming model uses both CPUs and GPUs for computing. CUDA refers CPU as host and
GPU as device. CPUs and GPUs are separated platforms with their own memory space. Typically, CPU
is used for serial workload and parallel computation work is offloaded to GPUs.
➢ CUDA provides C/C++ language extension and APIs for programming and managing GPUs.
A Comparison of C and CUDA program
C Program CUDA program
// C Hello World Program // CUDA Hello World Program
void c_hello(){ __global__ void cuda_hello(){
printf("Hello World!\n"); printf("Hello World from GPU!\n");
} }
int main() { int main() {
// excecute c_hello // execute cuda kernel on
c_hello(); cuda_hello<<<1,1>>>();
return 0; return 0;
} }
CUDA capable GPU Design
GPGPU Programming
CPU GPU
DMA
Thread
GPGPU Programming with CUDA (of execution)
• A GPU operates like a vector

processor (i.e. SIMD) and executes
same instruction simultaneously
using multiple input data
(operands)
• Instruction execution is performed
by multiple processors ("cores")
• A program execution on a
processor is called a "thread“
(of execution)
• Each thread is executed by a
CUDA core under the control of
the thread execution control
DMA
unit.
CUDA Processing Flow
1. Copy data from main memory to
GPU memory
2. CPU initiates the GPU compute
kernel
3. GPU's CUDA cores execute the
compute kernel in parallel
4. Copy the resulting data from
GPU memory to main memory
CUDA Memory Management
• Device Memory Management
➢cudaMalloc(void **devPtr, size_t count);
➢cudaFree(void *devPtr);
• Memory Transfer
➢cudaMemcpy(void *dst, void *src, size_t count, cudaMemcpyKind kind)
o If copying from host (CPU) to device (GPU) set: kind = cudaMemcpyHostToDevice
o If copying from device (GPU) to host (CPU) set: kind = cudaMemcpyDeviceToHost
• Allocate Unified Memory – accessible from CPU or GPU
➢ cudaMallocManaged(&x, N*sizeof(float));
• Thread Synchronization
➢cudaDeviceSynchronize()
Unified Memory in CUDA 6
CUDA Compilation, Execution and Performance
• Compilation and Execution
1. nvcc vector_add.cu -o vector_add
2. ./vector_add
• Profiling Performance
• time ./vector_add
• nvprof ./vector_add
==6326== Profiling application: ./vector_add

==6326== Profiling result:
Time(%) Time Calls Avg Min Max Name
97.55% 1.42529s 1 1.42529s 1.42529s 1.42529s vector_add(float*, float*, float*, int)
1.39% 20.318ms 2 10.159ms 10.126ms 10.192ms [CUDA memcpy HtoD]
1.06% 15.549ms 1 15.549ms 15.549ms 15.549ms [CUDA memcpy DtoH]
Serial Vector Addition
#define N 10000000
void vector_add(float *out, float *a, float *b, int n)
{
for(int i = 0; i < n; i++)
out[i] = a[i] + b[i];
}
int main(){
float *a, *b, *out;
// Allocate memory
a = (float*)malloc(sizeof(float) * N);
b = (float*)malloc(sizeof(float) * N);
out = (float*)malloc(sizeof(float) * N);
// Initialize array
for(int i = 0; i < N; i++){
a[i] = 1.0f; b[i] = 2.0f;
}
// Main function
vector_add(out, a, b, N);
}
Single thread CUDA – addVectors
#define N 10000000 // Transfer data from host to device
__global__ void vector_add(float *out, float *a, cudaMemcpy(d_a, a, sizeof(float) * N, cudaMemcpyHostToDevice);
float *b, int n) { cudaMemcpy(d_b, b, sizeof(float) * N, cudaMemcpyHostToDevice);
for (int i = 0; i < n; i++) cudaMemcpy(d_r, out, sizeof(float) * N, cudaMemcpyHostToDevice);
out[i] = a[i] + b[i]; // Execute kernel
} vector_add<<<1, 1>>>(d_out, d_a, d_b, N);
void main(){ // Transfer data back to host memory
float *a, *b, *out; cudaMemcpy(out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost);
float *d_a, *d_b, *d_out; // Deallocate device memory
// Allocate host memory for arrays a, b and out cudaFree(d_a);
a = (float*) malloc(sizeof(float) * N); cudaFree(d_b);
b = (float*) malloc(sizeof(float) * N); cudaFree(d_out);
out = (float*) malloc(sizeof(float) * N); // Verification
// Initialize array for(int i = 0; i < N; i++) {
for( int i = 0; i < N; i++) { assert(fabs(out[i] - a[i] - b[i]) < MAX_ERR);
a[i] = 1.0f; b[i] = 2.0f; } }
// Allocate device memory for arrays a, b and out printf("out[0] = %f\n", out[0]);
cudaMalloc((void**)&d_a, sizeof(float) * N); printf("PASSED\n");
cudaMalloc((void**)&d_b, sizeof(float) * N); // Deallocate host memory
cudaMalloc((void**)&d_out, sizeof(float) * N); free(a); free(b); free(out); }
Unified Memory & Single thread CUDA – addVectors
#include <iostream> // Run kernel on 1M elements on the GPU
#include <math.h> add<<<1, 1>>>(N, x, y);
// Kernel function to add the elements of two arrays // Wait for GPU to finish before accessing on host
__global__ cudaDeviceSynchronize();
void add(int n, float *x, float *y) // Check for errors (all values should be 3.0f)
{ float maxError = 0.0f;
for (int i = 0; i < n; i++) for (int i = 0; i < N; i++)
y[i] = x[i] + y[i]; maxError = fmax(maxError, fabs(y[i]-3.0f));
} std::cout << "Max error: " << maxError << std::endl;
int main(void)
{ // Free memory
int N = 1<<20; cudaFree(x);
float *x, *y; cudaFree(y);
// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float)); return 0;
cudaMallocManaged(&y, N*sizeof(float)); }
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f; y[i] = 2.0f; }
CUDA – Thread Management API
• Program_Name<<<numBlocks, threadsPerBlock>>>(Parameters_list);
• CUDA C++ provides keywords that let kernels get the indices of the
running threads. Specifically,
• threadIdx.x contains the index of the current thread within its block,
• blockDim.x contains the number of threads in the block, and
• gridDim.x contains the number of blocks in the grid.
• The index of a thread and its thread ID relate to each other in a
straightforward way:
• For a 1-D block, they are the same i.e. index = threadIdx.x
• For a 2-D block of size (Dx, Dy), the thread ID of a thread of index (x, y) is (x + y
Dx);
• For a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of
index (x, y, z) is (x + y Dx + z Dx Dy)
Thread Index Calculation in 1 & 2-D Grids
1-D grid
index = threadIdx.x = 2
gridDim.x = 4096
threaIdx.x
0 1 2 3 ⋯ ⋯ ⋯ ⋯ ⋯ 4094 4095
blockIdx.x = 0
index = blockIdx.x * blockDim.x + threadIdx.x = (2) * (256) + 3 = 515
2-D grid
CUDA C Vectors Add on Grid
#include <stdio.h> // Device input vectors
#include <stdlib.h> double *d_a;
#include <math.h> double *d_b;
// CUDA kernel. Each thread takes care of one element of c //Device output vector
__global__ void vecAdd(double *a, double *b, double *c, int n) double *d_c;
{ // Size, in bytes, of each vector
// Get our global thread ID size_t bytes = n*sizeof(double);
int id = blockIdx.x*blockDim.x+threadIdx.x; // Allocate memory for each vector on host
// Make sure we do not go out of bounds h_a = (double*) malloc(bytes);
if (id < n) h_b = (double*) malloc(bytes);
c[id] = a[id] + b[id]; h_c = (double*) malloc(bytes);
} // Allocate memory for each vector on GPU
int main( int argc, char* argv[] ) { cudaMalloc(&d_a, bytes);
// Size of vectors cudaMalloc(&d_b, bytes);
int n = 100000; cudaMalloc(&d_c, bytes);
// Host input vectors // Initialize vectors on host
double *h_a; for(int i = 0; i < n; i++ ) {
double *h_b; h_a[i] = sin(i)*sin(i);
//Host output vector h_b[i] = cos(i)*cos(i);
double *h_c; }
CUDA C Vectors Add …
// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
// Number of threads in each thread block
int NumBlocks = 1024; // gridDim.x
/ Number of thread blocks in a block
int NumThreadsPerBlock = (int)ceil((float)n/blockSize);
// Execute the kernel
vecAdd<<<NumBlocks, NumThreadsPerBlock>>>(d_a, d_b, d_c, n);
// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );
// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for( int i = 0; I < n; i++)
sum += h_c[i];
printf("final result: %f\n", sum/n);
// Release device memory
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
// Release host memory
free(h_a); free(h_b); free(h_c);
return 0; }
CUDA – Thread Management API (Thread Index & ID)
• For example, the following code adds two matrices A and B of size NxN and stores the
result into matrix C, using a single grid having 2-D blocks (2-D grid).
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) {
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
• The vector types derived from the basic integer and floating-point types.
They are structures and the 1st, 2nd, 3rd, and 4th components are
accessible through the fields x, y, z, and w, respectively. They all come
with a constructor function of the form make_<type name>;
• for example, int2 make_int2(int x, int y); creates a vector of type int2
with value(x, y).
• The alignment requirements of the vector types are detailed in Table 4.
CUDA Built-in Vector types
Type Alignment Type Alignment
char1, uchar1 1 long1, ulong1 4 if sizeof(long) is equal to sizeof(int) 8, otherwise
char2, uchar2 2 long2, ulong2 8 if sizeof(long) is equal to sizeof(int), 16, otherwise
char3, uchar3 1 long3, ulong3 4 if sizeof(long) is equal to sizeof(int), 8, otherwise
char4, uchar4 4 double4 16
short1, ushort1 2 longlong1, ulonglong1 8
int1, uint1 4 float1 4
double1 8
double2 16
double3 8
double4 16
Built-in variables
• dim3: is an integer vector type based on uint3 that is used to specify
dimensions. When defining a variable of type dim3, any component left
unspecified is initialized to 1.
• Built-in variables specify the grid and block dimensions and the block and
thread indices. They are only valid within functions that are executed on
the device.
1. gridDim variable is of type dim3 and contains the dimensions of the grid.
2. blockIdx variable is of type uint3 and contains the block index within the grid.
3. blockDim variable is of type dim3 and contains the dimensions of the block.
4. threadIdx variable is of type uint3 and contains the thread index within the
block.
5. warpSize variable is of type int and contains the warp size in threads
Vector Add Example
• We just need to modify the loop to stride through the array with parallel threads.
• The kernel code will need to know its block and thread index to find its offset into the
passed arrays. The parallelized kernel often uses a grid-stride loop, such as the
following:
__global__ __global__
void add(int n, float *x, float *y) void add(int n, float *x, float *y) {
{ int index = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = 0; i < n; i++) int stride = blockDim.x * gridDim.x;
y[i] = x[i] + y[i]; for (int i = index; i < n; i += stride)
} y[i] = x[i] + y[i]; }
dim3 blockDim(16, 16, 1);
dim3 gridDim((width + blockDim.x - 1)/ blockDim.x, (height + blockDim.y - 1) / blockDim.y, 1);
kernel<<< gridDim, blockDim, 0 >>>(d_data, height, width);
• A GPU context represents all the state (data, variables, conditions, etc.) that
are collectively required and instantiated to perform certain tasks (e.g. CUDA
compute, graphics, H.264 encode, etc). A CUDA context is instantiated to
perform CUDA compute activities on the GPU, either implicitly by the CUDA
runtime API, or explicitly by the CUDA device API.
• A command is simply a set of data, and instructions to be performed on that
data. For example a command could be issued to the GPU to launch a kernel,
or to move a graphical window from one place to the other on the desktop.
• A channel represents a communication path between host (CPU) and the GPU.
In modern GPUs this makes use of PCI Express, and represents state and
buffers in both host and device, that are exchanged over PCI express, to issue
commands to, and provide other data to, the GPU, as well as to inform the
CPU of GPU activity.
• For the most part, using the CUDA runtime API, it's not necessary to be
familiar with these concepts, as they are all abstracted (hidden) underneath
the CUDA runtime API.
CUDA Thread and , Block & Blocks
Laptop (GeForce GT 750M) Server (Tesla K80)

Version Time Bandwidth Time Bandwidth
1 CUDA Thread 411ms 30.6 MB/s 463ms 27.2 MB/s
1 CUDA Block 3.2ms 3.9 GB/s 2.7ms 4.7 GB/s
Many CUDA Blocks 0.68ms 18.5 GB/s 0.094ms 134 GB/s
OpenCL
Open
Computing
Language
Intel Skylake GPU (GT2)
• Intel HD Graphics 520 (GT2)
is an integrated graphics
unit, which can be found in
various ULV (Ultra Low
Voltage) processors of the
Skylake generation. The
"GT2" version of the Skylake
GPU offers 24 Execution
Units (EUs) clocked at up to
1050 MHz (depending on
the CPU model). Due to its
lack of dedicated graphics
memory or eDRAM cache,
the HD 520 has to access the
main memory (2x 64bit
DDR3L-1600 / DDR4-2133).
OpenCL
• OpenCL (Open Computing Language)
• Programming framework for CPUs, GPUs, DSPs, FPGAs with
programming language “OpenCL C”
• Started by Apple, subsequent development with AMD, IBM, Intel, and
NVIDIA, meanwhile managed by Khronos Group
• Open and royalty–free standard
• Goal: Programming framework for portable, parallel programming of
devices in heterogeneous environments (CPUs, GPUs, and other
processors; from smartphone to supercomputer)
OpenCL Program Flow
Architecture of OpenCL
• At the conceptional level:
o Platform model
o Execution model
o Memory model
o Programming model
• At the programming level:
• OpenCL Platform API
• OpenCL Runtime API
• OpenCL C (programming language)
Platform Model
• Basic structure: Host which is
connected to several devices
• Host: Computational unit on which
the host program runs.
➢Usually: CPU of the computer system
• Device: Computational unit which
is accessed via OpenCL library.
➢ Examples: CPUs, GPUs, DSPs, FPGAs
• Further subdivision:
➢Device → “Compute Units”
➢Compute Unit → “Processing
Elements”
Platform Model (CPU, GPU and MIC)
1. CPU
• Device: All CPUs on the mainboard of the computer system
• Compute unit (CU): One CU per core (or per hardware thread)
• Processing element (PE): 1 PE per CU, or if PEs are mapped to SIMD lanes, n PEs
per CU, where n matches the SIMD width.
2. GPU
• Device: Each GPU in the system acts as single device
• Compute unit (CU): One CU per multi–processor (NVIDIA)
• Processing element (PE): 1 PE per CUDA core (NVIDIA) or “SIMD lane” (AMD)
3. MIC
• Device: Each MIC (Many Integrated Cores) in the system acts as single device
• Compute unit (CU): One CU per hardware thread (= 4 x [# of cores - 1])
• Processing element (PE): 1 PE per CU, or if PEs are mapped to SIMD lanes, n PEs
per CU, where n matches the SIMD width
Fermi: GF100/GF110 - 16 Streaming Multi–Processors(SM)
Die shot
GF100/GF110: Streaming Multi–Processor (SM)
SM properties
• 32 CUDA cores (Streaming processors/SP)
• 6 Load/store units
• 4 Special function units (SFU)
• 2 Warp scheduler
• : 512 ALUs/FPUs available
Platform Model (Platform)
• Platform
• Every OpenCL implementation (with underlying OpenCL library) defines a so–
called “platform“.
• Each specific platform enables the host to control the devices belonging to it.
• Platforms of various manufacturers can coexist on one host and may be used
from within a single application (ICD: “installable client driver model“).
Platform Model - Practical Hints
Get OpenCL running under Linux
• Header files: Get from Khronos website (e.g.)
➢ Central file: CL/cl.h
• OpenCL library stub with ICD loader:
• Get from one of the vendors of your OpenCL devices
➢ Central file: libOpenCL.so
• ICD definition files and platform–specific OpenCL libraries:
• Get from all the vendors of your OpenCL devices
➢ ICD files usually located in: /etc/OpenCL/vendors/
• Mechanism at runtime:
➢libOpenCL.so is dynamically linked to your application at runtime
➢ICD loader uses dlopen(..) to open all required platform–specific OpenCL libraries
➢Calls to OpenCL library functions are routed to the correct implementation
Execution Model - Example: 2D–Arrangement of Work–Items
OpenCL Host API
Basic Programming Steps:
• Query platforms → selection
• Query devices of the platform → selection
• Create context for the devices
• Create queue (for context and device)
• Create program object (for context) ← from C string
➢ Compile program
➢ Create kernel (contained in program)
• Create memory objects (within context)
• Kernel execution:
1. Set kernel arguments
2. Put kernel into queue → Execution
• Copy memory objects with results from device to host (invoke via queue)
• Clean up …
Excursus: Thread Management on GPUs
Kernel
• Function for execution on the device (here: GPU)
• Typical scenario: Many kernel instantiations running simultaneously in parallel
threads
Challenge
• Management of many thousands of threads
Solution
• “Coarse Grained Parallelism” → ”Fine Grained Parallelism”
Thread Management (cont.)
Hierarchical thread organization

• Upper level : Grid (equiv. to NDRange) ⇒ Device
• Medium level : Block (equiv. to work–group) ⇒ Streaming Multi–Processor (SM)
• Lower level : Thread (equiv. to work–item) ⇒ Streaming Processor (SP)
Thread–Management (cont.)
Block Scheduler : “Coarse Grained Parallelism” (NVIDIA)
• Distributes groups of work–items (“work–groups”) to SMs
• Takes free capacity into account (registers, local memory, number of work–items)
• Goal: Load–balancing (“round–robin” procedure)
Warp/Wavefront : “Fine Grained Parallelism”
• Warp (NVIDIA): Group of 32 work–items which are scheduled and executed
together (within a work–group/SM)
• Wavefront (AMD): Group of 64 work–items which are scheduled and executed
together (within a work–group/CU)
• At this level: SIMD
Latency Hiding
Thread–Management (cont.) (numbers for GF110)
• Up to 8 work–groups actively scheduled per SM
➢Up to 1024 work–items per work–group (does not result in 8 x 1024 → see next
item)
• Up to 1536 work–items per SM (organized as 48 Warps)
→With 16 SMs:
Max. 16 x 1536 = 24576 simultaneously scheduled work–items
➢Comparison with CPU:
Several 100 threads simultaneously active
• In the grid: Up to 655353 work–items
→ Up to 655353 x 1024 ≈ 2.88 x 1017 work–items per kernel call
NVIDIA Tesla Graphics Cards in Comparison
Execution Model - Components
• Basic distinction:
➢ Host: Executes host program
➢ Device: Executes device kernel
• Hierarchy on device:
➢ NDRange → Work–Group → Work–Item
• Host defines Context:
➢ Devices (only from single platform!)
➢ Kernels (OpenCL–functions for execution on the device)
➢ Program objects (kernel source code and kernels in compiled form)
➢ Memory objects
• Host manages Queues:
➢ Kernel execution
➢ Operations on memory objects
➢ Synchronization
o Variants: In–order– execution and out–of–order execution
Memory Model
Memory Model
Allocation and Access
Weak Consistency Model

• Consistency within work–group for global and local memory:
Only at synchronization points within work–group
• Consistency between work–groups for global memory:
Only at synchronization points at host level
Programming Model
Supported Approaches
1. Data Parallel
• Possible mappings between data and NDRange:
➢ Strict 1:1 mapping: For each data element one work–item
➢ More flexible mappings also possible
• Favored device class: GPUs
2. Task Parallel
• Execution of only a single kernel instance (equivalent to an NDRange with only
one work–item)
• Parallelism via:
➢ SIMD units on the device (using OpenCL vector data types)
➢ Multiple tasks in queue which are executed asynchronously
• Favored device class: Multi–core CPUs, multi–CPU systems
The “Big Picture”
OpenCL Basic Programming Steps in Host Code
1. Determine components of the heterogeneous system
2. Query specific properties of each component to adapt program
execution dynamically during runtime
3. Compile and configure the OpenCL kernels :
→ Programming language for kernel code: OpenCL C
4. Create and initialize memory objects (buffers, images)
5. Execution of the kernels in the correct order with the best suited
device for each kernel
6. Collection of results
→ Functions for all these steps: OpenCL Platform and Runtime API
Basic Programming Steps … in Practice
1. Query platforms : selection
2. Query devices of the platform : selection
3. Create context for the devices
4. Create queue (for context and device)
5. Create program object (for context) from C string
5.1 Compile program
5.2 Create kernel (contained in program)
6. Create memory objects (within context)
7. Kernel execution:
7.1 Set kernel arguments
7.2 Put kernel into queue : Execution
8. Copy memory objects with results from device to host (invoke via queue)
9. Clean up ...
OpenCL - Addition of two vectors using single work-item
#include <stdio.h> int main( int argc, char* argv[] ) {
#include <stdlib.h> // Length of vectors
#include <math.h> unsigned int n = 100000;
#include <CL/opencl.h> // Host input vectors
double *h_a;
// OpenCL kernel. Each work item takes care of one element of c double *h_b;
const char *kernelSource = "\n" // Host output vector
"#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n" double *h_c;
"__kernel void vecAdd( __global double *a, \n" // Device input buffers
" __global double *b, \n" cl_mem d_a;
" __global double *c, \n" cl_mem d_b;
" const unsigned int n) \n" // Device output buffer
"{ \n" cl_mem d_c;
" //Get our global thread ID \n"
" int id = get_global_id(0); \n" cl_platform_id cpPlatform; // OpenCL platform
" \n" cl_device_id device_id; // device ID
" //Make sure we do not go out of bounds \n" cl_context context; // context
" if (id < n) \n" cl_command_queue queue;// command queue
" c[id] = a[id] + b[id]; \n" cl_program program; // program
"} \n" cl_kernel kernel; // kernel
"\n" ; // Size, in bytes, of each vector
size_t bytes = n*sizeof(double);
// Create a context
Addition of two Vectors … context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
// Create a command queue
// Allocate memory for each vector on host queue = clCreateCommandQueue(context, device_id, 0, &err);
h_a = (double*) malloc(bytes); // Create the compute program from the source buffer
h_b = (double*) malloc(bytes); program = clCreateProgramWithSource(context, 1,
h_c = (double*) malloc(bytes); (const char **) & kernelSource, NULL, &err);
// Initialize vectors on host // Build the program executable
int i; clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
for( i = 0; i < n; i++ ) { // Create the compute kernel in the program we wish to run
h_a[i] = sinf(i)*sinf(i); kernel = clCreateKernel(program, "vecAdd", &err);
h_b[i] = cosf(i)*cosf(i); // Create the input and output arrays in device memory for our
} calculation
size_t globalSize, localSize; d_a = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes,
cl_int err; NULL, NULL);
// Number of work items in each local work group d_b = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes,
localSize = 64; NULL, NULL);
// Number of total work items - localSize must be devisor d_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes,
globalSize = ceil(n/(float)localSize)*localSize; NULL, NULL);
// Bind to platform // Write our data set into the input array in device memory
err = clGetPlatformIDs(1, &cpPlatform, NULL); err = clEnqueueWriteBuffer(queue, d_a, CL_TRUE, 0,
// Get ID for the device bytes, h_a, 0, NULL, NULL);
err = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, err |= clEnqueueWriteBuffer(queue, d_b, CL_TRUE, 0,
1, &device_id, NULL); bytes, h_b, 0, NULL, NULL);
// Set the arguments to our compute kernel
Addition of two Vectors …
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a);
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_b);
err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_c);
err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &n);
// Execute the kernel over the entire range of the data set
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize, 0, NULL, NULL);
// Wait for the command queue to get serviced before reading back results
clFinish(queue);
// Read the results from the device
clEnqueueReadBuffer(queue, d_c, CL_TRUE, 0, bytes, h_c, 0, NULL, NULL );
//Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i<n; i++)
sum += h_c[i];
printf("final result: %f\n", sum/n);
// Release OpenCL resources
clReleaseMemObject(d_a); clReleaseMemObject(d_b); clReleaseMemObject(d_c);
clReleaseProgram(program); clReleaseKernel(kernel); clReleaseCommandQueue(queue); clReleaseContext(context);
// Release host memory
free(h_a); free(h_b); free(h_c);
//
return 0;}
Exercises
• Task – 1:
• Implement the addition of three vectors instead of two!
• Task – 2:
• Implement a second kernel for element–wise vector multiplication!
• Compute with both kernels (multiplication and pair–wise addition) the equation
e = a * b + c * d as element–wise vector operation!
• BONUS: Use an out–of–order queue instead of the default queue …
• … and ensure by using events that all commands are executed in the right order!
Open CL API
Query Platforms
cl_int clGetPlatformIDs ( cl_uint num_entries ,
cl_platform_id * platforms ,
cl_uint * num_platforms );
Query all OpenCL platforms on the system

• Return value : Error code (ideally equal to CL_SUCCESS)
• num_entries : Number of pre-allocated elements of type cl_platform_id in the array
platforms
• platforms : Returns information about the platforms (for each platform one element
in the array platforms)
• num_platforms: Returns number of platforms
Query Platforms (cont.)
Double invocation of clGetPlatformIDs(..) necessary
• 1. invocation: num_entries = 0, platforms = NULL
→ Query num_platforms
• Allocate num_platforms elements of type cl_platform_id in the array
platforms
• 2. invocation: num_entries = num_platforms → Query platforms
Related functions
• clGetPlatformInfo(..)
Query Devices -- Precondition: Platform exists
cl_int clGetDeviceIDs ( cl_platform_id platform,
cl_device_type device_type ,
cl_uint num_entries ,
cl_device_id * devices ,
cl_uint * num_devices );
Query the devices belonging to the respective platform

• platform : Selected platform
• device_type : Device category (e.g. CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU)
• num_entries : Number of pre-allocated elements of type cl_device_id in the array
devices
• devices : Returns information about the devices (for each device one element in the
array devices)
• num_devices : Returns number of devices
Query Devices (cont.)
Double invocation of clGetDeviceIDs(..) necessary
• 1. invocation: num_entries = 0, devices = NULL
→ Query num_devices
• Allocate num_devices elements of type cl_device_id in the array devices
• 2. invocation: num_entries = num_devices → Query devices
Related functions
• clGetDeviceInfo(..)
Create Context -- Precondition: Device exists
cl_context
clCreateContext (const cl_context_properties * properties ,
cl_uint num_devices ,
const cl_device_id * devices ,
( voidCL_CALLBACK * pfn_notify ) (
const char * errinfo ,
const void * private_info , size_t cb ,
void * user_data
),
void * user_data ,
cl_int * errcode_ret );
Creation of a context
• Return value : The created context
• properties : Bit field for the definition of the desired properties of the context
• num_devices : Number of devices for which the context shall be created
• devices : Array with devices for which the context shall be created
• errcode_ret : Returns the error code (ideally equal to CL_SUCCESS)
Create Queue -- Precondition: Context & device exist
cl_command_queue clCreateCommandQueue (cl_context context ,
cl_device_id device ,
cl_command_queue_properties properties ,
Creation of a queue
• Return value : The created queue
• context : Context within which the queue shall be created
• device : Device for which the queue shall be created
• properties : Bit field for the definition of the desired properties of the
queue. The default mode for queues is “in order execution”
(other settings possible via parameter properties).
Create Program Object -- Precondition: Context & source code exist
cl_program clCreateProgramWithSource( cl_context context ,
cl_uint count ,
const char ** strings ,
const size_t * lengths ,
Creation of a program object

• Return value : The created program object
• count : Number of char buffers with source code (see strings)
• context : Context within which the program object shall be created
• strings : Array with pointers to the char buffers containing the source code
• length : Array specifying the length of each char buffer (in bytes)
Compile Program -- Precondition: Program object and device(s) exist
cl_int clBuildProgram(cl_program program ,
cl_uint num_devices ,
const cl_device_id * device_list ,
const char * options ,
void ( CL_CALLBACK * pfn_notify )(
cl_ program program , void * user_data
),
void * user_data );
Compile the program for the listed devices

• num_devices : Number of devices for which the program shall be compiled
• device_list : Array with devices for which the program shall be compiled (these must
belong to the same context as the program object!)
• options : Char string with compiler options
Create Kernel -- Precondition: Program object with compiled code exists
cl_kernel clCreateKernel ( cl_program program ,
const char * kernel_name ,
Creation of a compute kernel
Return value : The created kernel

program : The program object which contains the compiled kernel code
kernel_name : Name of the kernel function (within the source code of the program object)
errcode_ret : Returns the error code (ideally equal to CL_SUCCESS)
❖ The kernel is afterwards available for all devices which were

contained in the device_list when calling clBuildProgram(..) before.
Create Memory Objects -- Precondition: Context exists
cl_mem clCreateBuffer ( cl_context context ,
cl_mem_flags flags ,
size_t size ,
void * host_ptr ,
Creation of a buffer object
• Return value : The created buffer
• context : Context within which the buffer shall be created
• flags : Bit field for the definition of the buffer properties and of the copy operations
executed at creation
• size : Length of the buffer (in bytes)
• host_ptr : Pointer to the memory area in host memory which is used as source for
copy operations or which is directly used for the buffer
Explanation of the Parameter flags (Disjunct. within the Bit Field)
Flag Meaning
CL_MEM_READ_WRITE Memory object will be read and written by a kernel.
CL_MEM_READ_ONLY Memory object will only be read by a kernel.
CL_MEM_WRITE_ONLY Memory object will only be written by a kernel
CL_MEM_USE_HOST_PTR The buffer shall be located in host memory at address host_ptr (content may
be cached in device memory). Not combinable with CL_MEM_ALLOC_HOST_PTR or
CL_MEM_COPY_HOST_PTR.
CL_MEM_ALLOC_HOST_PTR The buffer will be newly allocated in host memory (: in some implementations page–
locked memory!).
CL_MEM_COPY_HOST_PTR The buffer will be initialized with the content ofthe memory region to which
host_ptr points.
Set Kernel Arguments -- Precondition: Kernel exists
cl_int clSetKernelArg ( cl_kernel kernel ,
cl_uint arg_index ,
size_t arg_size ,
const void * arg_value );
Set a single kernel argument

• kernel : The kernel for which the argument is set
• arg_index : Index of the argument (starting with 0 for the first argument of the
kernel function)
• arg_size : Length of the value of the argument (in bytes)
• arg_value : Pointer to the value of the argument
• If you want to pass a global memory buffer as kernel argument, you have to use the corresponding cl_mem object as value.
• In this case, arg_size has to be the size of the cl_mem object (not the length of the buffer)!
Execution Model - Example: 2D–Arrangement of Work–Items
Kernel Execution Precondition: Queue and kernel exist, kernel arguments already set
cl_int clEnqueueNDRangeKernel ( cl_command_queue command_queue ,
cl_kernel kernel ,
cl_uint work_dim ,
const size_t * global_work_offset ,
const size_t * global_work_size ,
const size_t * local_work_size ,
cl_uint num_events_in_wait_list ,
const cl_event * event_wait_list ,
Place a kernel for execution in a queue cl_event * event );
• command_queue : Queue which shall be used for execution
• kernel : The kernel to be executed
• work_dim : Number of array dimensions (concerning the following three parameters)
• global_work_offset: Fx (Fy ; Fz ) (see preceding slide)
• global_work_size : Gx (Gy ;Gz ) (see preceding slide; overall number of work–items in each dimension
across all work–groups!)
• local_work_size : Sx (Sy ; Sz ) (see preceding slide; the ratios Gx =Sx , Gy =Sy , Gz=Sz need to be
integer numbers!)
Transfer Data from Device to Host -- Precondition: Queue exists
cl_int clEnqueueReadBuffer ( cl_command_queue command_queue ,
cl_mem buffer , cl_bool blocking_read ,
size_t offset , size_t cb , void *ptr ,
cl_uint num_events_in_wait_list ,
const cl_event * event_wait_list ,
cl_event * event );
Copy buffer content into host memory (e.g., buffer with results after kernel execution)
• command_queue : Queue which shall be used for execution
• buffer : Buffer object which serves as source of the copy operation
• blocking_read : If true, the function only returns after the copy operation has been
finished (and therefore also all preceding commands in the queue if it operates in “in–
order mode”)
• offset : Read offset in the buffer (in bytes)
• cb : Number of bytes to copy
• ptr : Pointer to the target region in host memory (sufficient size be allocated)
Free OpenCL Resources -- (Selection)
cl_int clReleaseContext ( cl_context context );
cl_int clReleaseCommandQueue ( cl_command_queue command_queue );
cl_int clReleaseProgram ( cl_program program );
cl_int clReleaseKernel ( cl_kernel kernel );
cl_int clReleaseMemObject ( cl_mem memobj );
Release of different types of OpenCL objects

In analogy to the release functions also retain functions exist for many types of OpenCL objects. The retain
functions increase an object–internal counter, the release functions decrease it. Only after all retain calls were
compensated by a release call, the next subsequent release call will ultimately free the resources of the object.
OpenCL for Compute Kernels
Basic Facts about “OpenCL C”
• Derived from ISO C99
• A few restrictions: No recursion, no function pointers, no functions from
the C99 standard headers
• Preprocessing directives defined by C99 are supported (e.g., #include)
• Built–in data types: Scalar and vector data types, pointers, images
• Mandatory built–in functions:
➢ Work–item functions, math.h, reading and writing of images
➢ Relational functions, geometric functions, synchronization functions
➢ printf (v1.2 only)
• Optional built–in functions (called “extensions”)
➢ Support for double precision, atomics to global and local memory
Qualifiers and Functions
Function qualifiers:
➢_kernel qualifier declares a function as a kernel, i.e. makes it visible to host code so that it
can be enqueued
• Address space qualifiers:
➢__global, __local, __constant, __private
➢Pointer kernel arguments must be declared with an address space qualifier (excl. __private)
• Work-item functions:
➢get_work_dim(),
➢get_global_id(),
➢get_local_id(),
➢get_group_id(), etc.
• Synchronization functions:
➢Barriers — all work-items within a work-group must execute the barrier function before
any work-item can continue: barrier(cl_mem_fence_flags flags)
➢Memory fences — provides ordering between memory operations:
mem_fence(cl_mem_fence_flags flags)
• Recursion is not supported
• Pointers to functions are not allowed
• Pointers to pointers allowed within a kernel, but
not as an argument to a kernel invocation
• Bit–fields are not supported
• Variable length arrays are not supported
Restrictions • Structures and other data types have to be
defined in both the host and device code
(naturally, in exactly the same way; use common
header files)
• Double types are optional in OpenCL v1.1, but
the key word is reserved (note: Most
implementations support double)
Event Handling
cl_int clWaitForEvents ( cl_uint num_events ,
const cl_event * event_list );
Wait for all events in event_list
Return value : Error code (ideally equal to CL_SUCCESS)
num_events : Number of elements in event_list
event_list : Array of events
cl_int clFlush ( cl_command_queue command_queue );
Issues all previously queued OpenCL commands in command_queue to the device associated with
command_queue
cl_int clFinish ( cl_command_queue command_queue );
Blocks until all previously queued OpenCL commands in command_queue are issued to the associated
device and have completed. clFinish is also a synchronization point.
Return value : Error code (ideally equal to CL_SUCCESS)
1-D NDRange
• The figure illustrates an example of 1-D
NDRange with global size = (4096, 1, 1) and
local size = (512, 1, 1). This allows the
computation to be broken down into eight
work-groups, each with 512 work-items.
• Now consider a simple vector adder kernel
written with a work size of (1, 1, 1). The length
of the data is 4096, and the function iterates
over the data using an explicit loop.
• In OpenCL C, however, it is better to write the
kernel as shown below:
• This produces the NDRange and work group
sizes shown above. Because this example __kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
allows the OpenCL compiler and runtime to void vadd(__global const int* a, __global const int* b, __global int* c) {
control the iteration over the 4096 data items, for (int i= 0; i < 4096; i++)
it allows a simpler coding style and enables c[i] = a[i] + b[i]; }
the compiler to make better optimization
decisions to parallelize the operations. The
call to get_global_id(0) provides the current __kernel __attribute__ ((reqd_work_group_size(512, 1, 1)))
location in the NDRange and is analogous to void vadd(__global const int* a, __global const int* b, __global int* c) {
the index of a for loop. int i = get_global_id(0);
c[i] = a[i] + b[i]; }
2-D NDRange
• The 2D NDRanges. works
well with 2D data such as
matrices.
• The matrix adder kernel
defines a local work size of
2x2, specified as a required
size of (2, 2, 1).
• The calls to get_global_id()
provide the index in the
global work size, while
get_global_size() provides
the total range value (e.g.,
64 for a 64x64 matrix).
• Alternatively, the kernel
could also index the local
work indices and sizes using
get_local_id() and
get_local_size(). __kernel __attribute__ ((reqd_work_group_size(2, 2, 1)))
void madd(__global int* a, __global int* b, __global int* output) {
int index = get_global_id(1)*get_global_size(0) + get_global_id(0);
output[index] = a[index] + b[index]; }
3-D NDRange
• The concept of work size can also be
extended to a 3-D space.
• The figure illustrates this work size as
a 3-D cube of size 16x16x16. While
the total number of work items is
again 4096, the work space is now
defined across three different
dimensions.
• This works well for applications that
can be defined across 3-dimensions
such as 3D computer graphics and
data mining algorithms.
• Similarly the 1- and 2-dimensional
cases, three dimensional work-items
can be implemented to operate in a
concurrent fashion on the FPGA
device.
Nomenclature
- AMD vs.
NVIDIA
Nomenclature
- OpenCL vs.
CUDA
References
• https://docs.nvidia.com/cuda/cuda-c-programming-guide/
• https://sites.google.com/site/csc8820/opencl-basics/
• https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/pet15040
34296131.html
• “From Shader Code to a Teraflop: How GPU Shader Cores Work”,
Kayvon Fatahalian, Stanford University
Thread execution manager Input assembler Host
Parallel data Parallel data Parallel data Parallel data Parallel data Parallel data Parallel data
cache cache cache cache cache cache cache
Texture Texture Texture Texture Texture Texture Texture
Load/Store Load/Store Load/Store Load/Store Load/Store Load/Store
Global memory
Process Context & Context Switching
• Process context is the current state of the process i.e. what is in its
registers?. The context of the current running process need to be saved,
so that it can be resumed after the interrupt is handled.
• Context Switch is the process of storing the context/state of a process or
thread, so that it can be restored and resume execution at a later point.
This allows multiple processes to share a single central processing unit
(CPU) and is an essential feature of a multitasking operating system.
• For example, in case of X86 processors, process context is based on the
registers: ESP, SS, EIP, CS and more. We need to save the instruction
pointer (EIP) and the CS (Code Segment) so that after the interrupt is
handled we can continue running from where we were stopped.

Gpgpu Final

Uploaded by

Copyright:

Available Formats

Gpgpu Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gpgpu Final

Uploaded by

Copyright:

Available Formats

General Purpose computing on Graphics Processing Units

Prof. Dr. Aman Ullah Khan

1. Understand space of GPU core (and throughput CPU core) designs

➢ N cores = N simultaneous instruction streams

But ... many fragments should be able

Original compiled shader: New compiled shader:

Streaming computation on pixels

Identical, Streaming computation on pixels

Identical, Independent, Streaming computation on pixels

• Naïve Approach - Split independent work over multiple processors

Inefficient for data Gather/Scatter can be Divergence kills

This is a GPU Architecture!

Nvidia / CUDA AMD/OpenCL CPU Analogy

➢Kernel: Run an NDRange on a kernel

__kernel void flip_and_recolor( __global float3 **in_image,

10 Wavefront x 4 SIMT Units = 40 Active Wavefronts / CU

Nvidia calls “Local Memory” “Shared Memory”.

Prof. Dr. Aman Ullah Khan

• A GPU operates like a vector

==6326== Profiling application: ./vector_add

index = blockIdx.x * blockDim.x + threadIdx.x = (2) * (256) + 3 = 515

Laptop (GeForce GT 750M) Server (Tesla K80)

Hierarchical thread organization

Weak Consistency Model

Query all OpenCL platforms on the system

Query the devices belonging to the respective platform

Creation of a program object

Compile the program for the listed devices

Creation of a compute kernel

Return value : The created kernel

❖ The kernel is afterwards available for all devices which were

Set a single kernel argument

Release of different types of OpenCL objects

Load/Store Load/Store Load/Store Load/Store Load/Store Load/Store

You might also like