0% found this document useful (0 votes)

13 views

Module 05 Massive Multi-Core Programming GPGPUs, CUDA

Uploaded by

Samrat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Module 05 Massive Multi-Core Programming GPGPUs, CUDA

Uploaded by

Samrat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

GPGPUs

BITS Pilani K Hari Babu

Department of Computer Science & Information Systems
Pilani Campus
Vector Processors
• Scalar processors
 instructions operate on single data items only
• Vector computer
 Vectors are large one-dimensional arrays of data values
 A vector computer is a computer whose instruction set
includes operations on vectors as well as scalars
• Two ways of implementing vector computer
 Pipelined vector processor:
• Streams vectors from memory to the CPU, where pipelined arithmetic
units process them. E.g. Cray-1, Cyber-205
 Processor arrays:
• A sequential computer connected to a set of identical, synchronized
processing elements capable of simultaneously performing the same
operation on different data.
CSIS BITS Pilani
Processor Arrays
Suppose the processor array
contains 1024 processors, labeled
P0, P1…· Imagine two 1024-element
vectors A and B are distributed
among the processors such that ai
and bi are in the memory of
processor Pi, for all i in the range 0
<= i<= 1024. The processor array
can perform the vector addition A +
B in a single instruction, because
each processor Pi fetches its own
pair of values ai and bi and
performs the addition in parallel
with all the other processors.

CSIS BITS Pilani

SIMD vs SIMT
• SIMT (Single Instruction Multiple Threads) is analogous to SIMD
• The only major difference is that in SIMT the size of the “vector”
on which the processing elements operate is determined by the
software, i.e., the block size

CSIS BITS Pilani

GPU vs CPU
• CPU design
 CPUs employ large on-chip memory caches, few complex
(e.g., pipelined) arithmetic and logical processing units
(ALUs), and complex instruction decoding and prediction
hardware to avoid stalling
• Graphics Processing Units (GPUs) design
 developed for processing massive amount of graphics data
very quickly
• their layout departed from the one used by conventional CPUs
 GPU design: small on-chip caches with a big collection of
simple ALUs
• since data reuse is typically small for graphics processing
 fast memory buses for fetching data from the GPU’s main
memory.
CSIS BITS Pilani
GPU vs CPU

• Block diagrams of the Nvidia Titan GPU and the Intel i7-5960X
octa-core CPU.
 clear that while memory cache dominates the die in the CPU case,
compute logic dominates in the case of the GPU
CSIS BITS Pilani
Nvidia’s KEPLER
• Kepler is the third GPU architecture of Nvidia
• The cores in a Kepler GPU are arranged in groups
called Streaming Multiprocessors (abbreviated to SMX
in Kepler)
• Each Kepler SMX contains 192 cores that execute in a
SIMD fashion, i.e., they run the same sequence of
instructions but on different data. Each SMX can run
its own program.
 The most powerful chip in the Kepler family is the GTX Titan,
with a total of 15 SMXs
 One of the SMXs is disabled in order to improve production
yields, resulting in a total of 14 · 192 = 2688 cores!

CSIS BITS Pilani

Heterogeneous System Architecture (HSA)
• AMD’s line of APU (Accelerated Processor Unit )
processors combine CPU and GPU cores on the same
die
 What is significant is the unification of the memory spaces of
the CPU and GPU cores
 no communication overhead associated with assigning
workload to the GPU cores, nor any delay in getting the
results back
• AMD’s APU chips implement the Heterogeneous
System Architecture (HSA)
 Code written in HSAIL (intermediate language IL) is
translated to a compute unit’s native instruction set before
execution. Because of this CPU cores can run GPU code and
an HSA application can run on any platform.
CSIS BITS Pilani
AMD’s Kaveri Architecture

• Block diagram of AMD’s Kaveri architecture

 GCN stands for Graphics Core Next, the designated name
for AMD’s next generation GPU core
• HSA is arguably the way forward, having the capability
to assign each task to the computing node most
suitable for it, without the penalty of traversing a slow
peripheral bus.
 Sequential tasks are more suitable for the CPU cores, while
data-parallel tasks can take advantage of the high-
bandwidth, high-computational throughput of the GPU cores.
CSIS BITS Pilani
Existing Architectures for CPU-GPU Systems

• (a) and (b) represent discrete GPU solutions, with a

CPU-integrated memory controller in (b). Diagram (c)
corresponds to integrated CPU-GPU solutions, such as
the AMD’s Accelerated Processing Unit (APU) chips.
 sharing the same chip limits the amount of functionality that
can be incorporated for both CPU and GPU alike
CSIS BITS Pilani
Programming GPU
• GPU is used as a coprocessor, assigned work items
from the main CPU. The CPU is referred to as the host.
• How do we generate 2688 threads?
 the GPU programming environment allows the launch of
special functions called kernels that run with distinct, built-in
variables
 the number of threads that can be generated with a single
statement/kernel launch comes to tens of thousands or
even millions
• Each thread will take its turn running on a GPU core
• The sequence can be summarized: (a) host sending
data to the GPU, (b) host launching a kernel, and (c)
host waiting to collect the results.
CSIS BITS Pilani
Challenges
• Having disjoint memories means that data must be
explicitly transferred between the two whenever data
need to be processed by the GPU or results collected
by the CPU
 communicating data over relatively slow peripheral buses
like the PCIe (16GB/sec) is a major problem
• GPU devices may not adhere to the same floating-point
representation and accuracy standards as typical
CPUs
 This can lead to the accumulation of errors and the
production of inaccurate results

CSIS BITS Pilani

GPU Development Platforms
• CUDA: Compute Unified Device Architecture
 broke free of the “code-it-as-graphics” approach used until
then
• OpenCL: Open Computing Language
 open standard for writing programs that can execute across
a variety of heterogeneous platforms that include GPUs, CPU,
DSPs, or other processors.
 OpenCL is supported by both Nvidia and AMD. It is the
primary development platform for AMD GPUs.
 OpenCL’s programming model matches closely the one
offered by CUDA

CSIS BITS Pilani

GPU Development Platforms
• OpenACC:
 An open specification for an API that allows the use of
compiler directives (e.g., #pragma acc, in a similar fashion to
OpenMP) to automatically map computations to GPUs or
multicore chips according to a programmer’s hints.
• Thrust:
 A C++ template library that accelerates GPU software
development by utilizing a set of container classes and a set
of algorithms to automatically map computations to GPU
threads.
 Thrust used to rely solely on a CUDA backend, but since
version 1.6 it can target multiple device back-ends, including
CPUs. Thrust has been incorporated in the CUDA SDK
distribution since CUDA 4.1
CSIS BITS Pilani
BITS Pilani
Pilani Campus

CUDA
CUDA’S Programming Model: Threads, Blocks, Grids
• GPUs are coprocessors that can be used to accelerate
parts of a program
 A CUDA program executes like a sequential program
delegating whenever parallelism is required.

CSIS BITS Pilani

CUDA’S Programming Model: Threads, Blocks, Grids
• To properly utilize a GPU, the program must be
decomposed into a large number of threads that can
run concurrently
 GPU schedulers execute these threads with minimum
switching overhead and under a variety of configurations
based on the available device capabilities
• Challenges that must be overcome:
 How do we spawn the hundreds or thousands of threads
required?
• CUDA’s answer is to spawn all the threads as a group, running the
same function with a set of parameters that apply to all
 How do we initialize the individual threads so that each one
does a different part of the work?

CSIS BITS Pilani

CUDA’S Programming Model: Threads, Blocks, Grids
• Challenges that must be overcome:
 How do we spawn the hundreds or thousands of threads
required?
• CUDA’s answer is to spawn all the threads as a group, running the
same function with a set of parameters that apply to all
 How do we initialize the individual threads so that each one
does a different part of the work?
• CUDA solves this problem by organizing the threads in a 6D structure
(lower dimensions are also possible)! Each thread is aware of its
position in the overall structure via a set of intrinsic structure
variables. With this information, a thread can map its position to the
subset of data to which it is assigned

CSIS BITS Pilani

CUDA’S Programming Model: Threads, Blocks, Grids
• Threads are organized in a hierarchy of two levels
 At the lower level, threads are organized in blocks that can
be of one, two or three dimensions.
 Blocks are then organized in grids of one, two, or three
dimensions.
 The sizes of the blocks and grids are limited by the
capabilities of the target device

CSIS BITS Pilani

CUDA’S Programming Model: Threads, Blocks, Grids
• Nvidia uses the compute capability specification to
encode what each family/- generation of GPU chips is
capable of. The part that is related to grid and block
sizes is shown below:

CSIS BITS Pilani

CUDA’S Programming Model: Threads, Blocks, Grids
• The programmer has to provide a function that will be
run by all the threads in a grid
 This function in CUDA terminology is called a kernel.
 During the invocation of the kernel, one can specify the
thread hierarchy with a special execution configuration
syntax (<<< >>>)

• The CUDA-supplied dim3 type represents an integer vector of three

elements. If less than three components are specified, the rest are
initialized by default to 1.

CSIS BITS Pilani

CUDA’S Programming Model: Threads, Blocks, Grids
• In the special case where a 1D structure is desired,
e.g., running a grid made up of five blocks, each
consisting of 16 threads, the following shortcut is also
possible:
• Many different grid-block combinations are possible.
Some examples are shown here:

CSIS BITS Pilani

Hello World! In CUDA
 Terminology:
 Host The CPU and its memory (host memory)
 Device The GPU and its memory (device memory)

Host Device

CSIS BITS Pilani

Hello World! In CUDA

#include <iostream>
#include <algorithm>

using namespace std;

#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16

global void stencil_1d(int in, int out) {

__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

parallel fn
// Synchronize (ensure all the data is available)
__syncthreads();

// Apply the stencil

int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

void fill_ints(int *x, int n) {

fill_n(x, n, 1);
}

int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);

// Alloc space for host copies and setup values

in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);
out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copies

cudaMalloc((void **)&d_in, size);

serial code
cudaMalloc((void **)&d_out, size);

// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPU

stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,
d_out + RADIUS);

// Copy result back to host

parallel code
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;

serial code
}

CSIS BITS Pilani

Hello World! In CUDA
• Standard C that runs on the host
 NVIDIA compiler (nvcc) can be used to compile programs
with no device code

• Output:

$ nvcc hello_world.cu
$ a.out
Hello World!
$

CSIS BITS Pilani

Hello World! In CUDA

The program can be compiled and executed as follows (CUDA programs should
be stored in files with a .cu extension):
$ nvcc −arch=sm_20 hello.cu −o hello
$ ./hello
• The “architecture” switch (-arch=sm_20) in the Nvidia CUDA Compiler
(nvcc) driver command line above instructs the generation of GPU
code for a device of compute capability 2.0. Compatibility with 2.0
and higher
CSIS BITS Pilani
Hello World! In CUDA

• The global directive specifies that the hello function is supposed

to be called from the host and run on the device
• Kernels that are called from the host (i.e., __global__) are not
supposed to return any value
• GPU execution is asynchronous i.e., program will not wait for the
statement of line 12. An explicit barrier statement is needed
(cudaDeviceSynchronize in line 13) to seethe output.
CSIS BITS Pilani
Thread’s position
• CUDA threads are aware of their position in the
grid/block hierarchy via the following intrinsic/built-in
structures, all having 3D components x, y, and z.
 blockDim: Contains the size of each block, e.g., (Bx, By, Bz)
 gridDim: Contains the size of the grid, in blocks, e.g., (Gx, Gy,
Gz).
 threadIdx: The (x, y,z) position of the thread within a block,
with x ∈ [0, Bx − 1], y ∈ [0, By − 1], and z ∈ [0, Bz − 1].
 blockIdx: The (bx, by, bz) position of a thread’s block within
the grid, with bx ∈ [0, Gx − 1], by ∈ [0, Gy − 1], and bz ∈ [0, Gz
− 1]
• Knowing thread’s position is to allow it to identify its
workload.
CSIS BITS Pilani
Thread’s position
• threadIdx is not unique among threads
 there could be two or more threads in different blocks with
the same index
• Deriving a unique scalar ID for each of the threads
would require using block and grid info.
 Each thread can be considered an element of a 6D array
with the following definition

 Getting the offset of a particular thread from the beginning

of the array would produce a unique scalar ID

CSIS BITS Pilani

Thread’s Position in Single Dimension
• M= threads per block.
• blockIdx.x ∗ blockDim.x + threadIdx.x gives thread
position from the beginning.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

M = 8 threadIdx.x = 5

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 2

int index = threadIdx.x + blockIdx.x * M;

= 5 + 2 * 8;
= 21;

CSIS BITS Pilani

Thread’s Position in 3 Dimensions

int altMyID = threadIdx.x + blockIdx.x * blockDim.x +

(blockIdx.y * blockDim.y + threadIdx.y) * gridDim.x *
blockDim.x+
(blockIdx.z * blockDim.z + threadIdx.z) * gridDim.x *
blockDim.x * gridDim.y * blockDim.y;

CSIS BITS Pilani

CUDA’S Execution Model: Streaming Multiprocessors
and Warps
• When a kernel is run on a GPU core, the same
instruction sequence is synchronously executed by
processing units called streaming processors (SP)
• A group of SPs that execute under the control of a
single control unit is called a streaming multiprocessor
(SM)
• A GPU can contain multiple SMs, each running each
own kernel
• Since each thread runs on its own SP, we will refer to
SPs as cores
• The computational power of a GPU is largely
determined by the number of SMs available
CSIS BITS Pilani
CUDA’S Execution Model: Streaming Multiprocessors
and Warps
• Threads are scheduled to run on an SM as a block
 The threads in a block do not run concurrently, instead they
are executed in groups called warps
 The size of a warp is hardware-specific
 The current CUDA GPUs use a warp size of 32
• At any time instance and based on the number of CUDA cores in an
SM, we can have 32 threads active (one full active warp), 16 threads
active (a half-warp active), or 8 threads active (a quarter-warp active).
• The benefit of interleaving the execution of warps (or their parts) is to
hide the latency associated with memory access, which can be
significantly high.
 An SM can switch seamlessly between warps (or half- or
quarter-warps) as each thread gets its own set of registers.
• This contradicts the arrangement used by expensive context-
switching on CPUs
CSIS BITS Pilani
CUDA’S Execution Model: Streaming Multiprocessors
and Warps
• Each SM can have multiple warp schedulers, e.g., in
Kepler there are four.
 This means that up to four independent instruction
sequences from four warps can be issued simultaneously
 Each warp scheduler can issue up to two instructions as long
as they are independent leading to instruction-level
parallelism (ILP)
• Independent - the outcome of one does not depend on the outcome
of the other

CSIS BITS Pilani

Block and Grid Design
• Once an SM completes the execution of all the threads
in a block, it switches to a different block in the grid
• Assume a kernel that requires 48 registers per thread,
and it is to be launched on a GTX 580 card, as a grid of
4x5x3 blocks, each 100 threads long
 The registers demanded by each block are 100 ∗ 48 = 4800
(< 32k/SM available on this compute capability 2.0 card)
 The grid is made of 4 ∗ 5 ∗ 3 = 60 blocks that need to be
distributed to the 16 SMs of the card
• Assuming round-robin, there will be 12 SMs that will receive four
blocks and four SMs that will receive three blocks. During the time
the 12 SMs process the last of the 12 blocks they were assigned, the
remaining four SMs are idle

CSIS BITS Pilani

Block and Grid Design
• Each of the 100-thread blocks would be split into 4
warps, assuming warpSize=32 (ceil(100/32))
• The first three warps would have 32 threads and the
last would have four threads!
• So, during the execution of the last warp of each block,
(32−4)/32 = 87.5% of the multiprocessors would be
idle
• These issues indicate that kernel design and
deployment are critical for extracting the maximum
performance from a GPU

CSIS BITS Pilani

Block and Grid Design
• Goal is to keep the available computing hardware
“occupied/busy” as much as possible
• For two things must happen:
 1. Enough work should be assigned to the CUDA cores
(deployment/execution configuration phase)
 2. The assigned work should allow execution with the
minimum amount of stalling due to resource contention or
slow memory access (execution phase)
• Block and grid designs influence the first aspect. The
number of dimensions has no effect on the execution
efficiency. The total number of threads per block and
the total number of blocks in the grid do.

CSIS BITS Pilani

Block and Grid Design

• The multiProcessorCount field can be used to derive

the minimum number of blocks a grid should be made
of
 the block number can be a multiple of the SMs

CSIS BITS Pilani

Block and Grid Design
• The number of threads per block can be derived from
the warpSize, the maxThreadsPerBlock, and the
register and shared memory demands per thread of
the kernel
 The sharedMemPerBlock and regsPerBlock are the fields of
the cudaDeviceProp structure
• Calculate an estimate for the number of threads per
block

CSIS BITS Pilani

Kernel Structure

• A branching operation leads to the stalling of the

threads that do not follow the particular branch
• A way around the stalling problem would be to modify
the condition so that all the threads in a warp follow
the same execution path, but they diversify across
warps or blocks

CSIS BITS Pilani

Kernel Structure
Threadid warpId warpId%2 warpId/2 warpsize*warpid/2 id- *2 +
1 0 0 0 0 1 2 2
2 0 0 0 0 2 4 4
3 0 0 0 0 3 6 6
4 0 0 0 0 4 8 8
5 0 0 0 0 5 10 10
6 0 0 0 0 6 12 12
7 0 0 0 0 7 14 14
8 0 0 0 0 8 16 16
9 0 0 0 0 9 18 18
10 0 0 0 0 10 20 20
11 0 0 0 0 11 22 22
12 0 0 0 0 12 24 24
13 0 0 0 0 13 26 26
14 0 0 0 0 14 28 28
15 0 0 0 0 15 30 30
16 0 0 0 0 16 32 32
17 0 0 0 0 17 34 34
18 0 0 0 0 18 36 36
19 0 0 0 0 19 38 38
20 0 0 0 0 20 40 40
21 0 0 0 0 21 42 42
22 0 0 0 0 22 44 44
23 0 0 0 0 23 46 46
24 0 0 0 0 24 48 48
25 0 0 0 0 25 50 50
26 0 0 0 0 26 52 52
27 0 0 0 0 27 54 54
28 0 0 0 0 28 56 56
29 0 0 0 0 29 58 58
30 0 0 0 0 30 60 60
31 0 0 0 0 31 62 62
32 1 1 1 32 0 0 1
33 1 1 1 32 1 2 3
34 1 1 1 32 2 4 5
35 1 1 1 32 3 6 7
36 1 1 1 32 4 8 9
37 1 1 1 32 5 10 11
38 1 1 1 32 6 12 13
39 1 1 1 32 7 14 15
40 1 1 1 32 8 16 17
41 1 1 1 32 9 18 19
42 1 1 1 32 10 20 21
43 1 1 1 32 11 22 23
44 1 1 1 32 12 24 25

New id is generated for each thread.

45 1 1 1 32 13 26 27
46 1 1 1 32 14 28 29
47 1 1 1 32 15 30 31
48 1 1 1 32 16 32 33
49 1 1 1 32 17 34 35
50 1 1 1 32 18 36 37
51 1 1 1 32 19 38 39
52 1 1 1 32 20 40 41
53 1 1 1 32 21 42 43
54 1 1 1 32 22 44 45
55 1 1 1 32 23 46 47
56 1 1 1 32 24 48 49
57 1 1 1 32 25 50 51
58 1 1 1 32 26 52 53
59 1 1 1 32 27 54 55
60 1 1 1 32 28 56 57
61 1 1 1 32 29 58 59

 warpId will be same for all threads in a warp.

 First term is always even as it is multiplied by 2. Second term
adds remainder by making it odd or even.

CSIS BITS Pilani

Kernel Structure

CSIS BITS Pilani

Memory Heirarchy
• GPU memory is disjoint from the host’s memory
 passing chunks of data to a kernel in the form of a pointer to
an array in the host’s memory is not possible
 CUDA allows us to copy data from one memory type to
another.
 This includes dereferencing pointers, even in the host’s
memory (main system RAM)
 To facilitate this data movement CUDA provides
cudaMemcpy()

CSIS BITS Pilani

Memory Heirarchy
• The GPU memory hierarchy is not transparent to the
programmer
 GPUs have faster on-chip memory, which has separate
address space than the off-chip one

CSIS BITS Pilani

Memory Heirarchy
• Data movement between the host and the device can
only involve global memory
• GPUs also employ other types of memory, most of
them residing on-chip and in separate address spaces

CSIS BITS Pilani

Memory Heirarchy
• Local memory/registers:
 Used for holding automatic variables.
• Shared memory:
 Fast on-chip RAM that is used for holding frequently used
data. The shared on-chip memory can be used for data
exchange between the cores of the same SM.

CSIS BITS Pilani

Memory Heirarchy
• Cache memory:
 Cache memory is transparent to the programmer. In recent
GPU generations (e.g., Fermi, Kepler), a fixed amount of fast
on-chip RAM is divided between first-level cache (L1) and
shared memory. The L2 cache is shared among the SMs
• Global memory:
 Main part of the off-chip memory. High capacity, but
relatively slow. The only part of the memory that is
accessible to the host via the CUDA library functions.

CSIS BITS Pilani

Memory Heirarchy
• Texture and surface memory:
 Part of the off-chip memory. Its contents are handled by
special hardware that permits the fast implementation of
some filtering operations.
• Constant memory:
 Part of the off-chip memory. As its name suggests, it is read-
only. However, it is cached on-chip, which means it can
provide a performance boost.

CSIS BITS Pilani

Memory Heirarchy
• Summary of the memory hierarchy characteristics

CSIS BITS Pilani

Local Memory/Registers
• Each multiprocessor gets a set of registers that are
split among the resident executing threads
• used to hold automatic variables declared in a kernel, speeding up
operations that would otherwise require access to the global or
shared memories
• The number of registers used per thread influences
the maximum number of threads that can be resident
at an SM
• For example, assume that a kernel is using 48 registers and it is
invoked as blocks of 256 threads, which means each block requires
48 · 256 = 12,288 registers.

CSIS BITS Pilani

Local Memory/Registers
 For example, assume that a kernel is using 48 registers and
it is invoked as blocks of 256 threads, which means each
block requires 48 · 256 = 12,288 registers.
• If the target GPU for running the kernel is a GTX 580, sporting 32k
registers per SM, then each SM could have only two resident blocks
(requiring 2 · 12,288 = 24,576 registers) as three would exceed the
available register space (3 · 12,288 = 36,864 > 32,768). This in turn
means that each SM could have 2 · 256 = 512 resident threads
running, which is well below the maximum limit of 1536 threads per
SM. This undermines the GPU’s capability to hide the latency of
memory operations by running other ready warps.
 Nvidia calls occupancy the ratio of resident warps over the
maximum possible resident warps

CSIS BITS Pilani

Local Memory/Registers
 In this example occupancy =

• an occupancy close to 1 is desirable

 In order to raise the occupancy in our example, we could (a)
reduce the number of required registers by the kernel, or (b)
use a GPU with a bigger register file than GTX 580, such as
GTX 680.
• If the required registers per kernel fell to 40, then we could have
three resident blocks (requiring a total of 3 · 40 · 256 = 30,720
registers), resulting in an occupancy of 3·8 48 = 50% because each
block is made up of eight warps
• If a GTX 680 were used, the resident blocks would go up to five,
resulting in an occupancy of 5·8 64 = 63%. CSIS BITS Pilani
Shared Memory
• Shared memory is a block of fast on-chip RAM that is
shared among the cores of an SM
• Each SM gets its own block of shared memory, which
can be viewed as a user-managed L1 cache
• Shared memory can be used in the following
capacities:
 As a holding place for very frequently used data that would
otherwise require global memory access
 As a fast mirror of data that reside in global memory, if they
are to be accessed multiple times
 As a fast way for cores within an SM, to share data

CSIS BITS Pilani

Shared Memory
• How can we specify that the holding place of some
data will be the shared memory of the SM and not the
global memory of the device?
 The answer is, via the __shared__ specifier

// histogramGPU computes the histogram of an input array on the

GPU
__global__ void histogramGPU(unsigned int* input,
unsigned int* bins, unsigned int numElems) {
int tx = threadIdx.x; int bx = blockIdx.x;

// compute global thread coordinates

int i = (bx * blockDim.x) + tx;

// create a private histogram copy for each thread block

__shared__ unsigned int hist[4096];

CSIS BITS Pilani

Shared Memory
• To speed up the update of the local counts, a “local” array is set up in shared
memory. Because multiple threads may access its locations at any time,
atomic addition operations have to be used for modifying its contents
 The atomicAdd is an overloaded function that belongs to a set of atomic operations

// update private histogram

if (i < numElems) {
atomicAdd(&(hist[input[i]]), 1);
}
// wait for all threads in the block to finish
__syncthreads();

• Because threads execute in blocks and each block executes warp by warp,
explicit synchronization of the threads must take place between the discrete
phases of the kernel, e.g., between initializing the shared-memory histogram
array and starting to calculate the histogram, and so on
 the __syncthreads() function can be called inside a kernel to act as a barrier for all the
threads in a block

CSIS BITS Pilani

CUDA Streams
• In a CUDA program, first the data must be transferred
from the CPU memory into the GPU memory. When the
kernel execution is done, the processed data must be
transferred back to CPU memory.
 Data transfer rate between CPU and GPU is limited by PICe.
 For, e.g. CPUGPU takes 31% of the total time, kernel
execution in GPU takes 39%, GPUCPU takes 30% of total
time.
• Can we start processing in GPU as soon as data starts
arriving into GPU?
 Yes. CPUGPU transfers can be overlapped with kernel
execution because they differet hardware (PCI and GPU
cores)

CSIS BITS Pilani

Asynchronous data transfers
• To get best overlap opportunities between data copy
and kernel activity, it's necessary to
 Host memory buffers that are pinned, e.g. via
cudaMallocHost(void **, SIZE)
 Usage of cudaMemcpyAsync (instead of cudaMemcpy)
• Asynchronous data transfers and pinned memory form
the skeleton of CUDA streams
• To stream operations
 First create pinned memory
• needed for performing Direct Memory Access (DMA) transfers across
the PCIe bus
 Create multiple streams and partition task into multiple
subtasks that can executed independently.
 Assign each subtask into different stream CSIS BITS Pilani
CUDA Streams

 There is default stream in CUDA programs. Additional

streams can be created.

 There is one copy engine in each stream. By calling

cudaMemcpyAsync copy operations are queued on copy
engine of the streams which are executed whenever PCIe
bus is available
 Execution moves onto the next line.
• In the next line, execute a kernel or queue up another transfer on
another stream.

CSIS BITS Pilani

CUDA Streams
• There is one kernel execution engine in each stream.
Its job is to queue up the execution of different kernels
for a given stream

CSIS BITS Pilani

CSIS BITS Pilani
Synchronizing CUDA Streams

 For given stream, this function will block until all the queued
up operations are complete
• Events
 The time instance a command (and everything preceding it
in a stream) completes can be captured in the form of an
event. CUDA uses the cudaEvent_t type for managing events

CSIS BITS Pilani

Events

CSIS BITS Pilani

CUDA Dynamic Parallelism
• An extension to the CUDA programming model which
allows a thread to launch another grid of threads
executing another kernel
• is supported in devices of Compute Capability 3.5 and
above
• Uses for Dynamic Parallelism
 Recursive algorithms
 Processing at different levels of detail for different parts of
the input (i.e. irregular grid structure)
 Algorithms in which new work is “uncovered” along the way

CSIS BITS Pilani

Work Discovery With/Without Dynamic Parallelism

CSIS BITS Pilani

Work Discovery With/Without Dynamic Parallelism

CSIS BITS Pilani

Work Discovery With/Without Dynamic Parallelism

CSIS BITS Pilani

Parent grid & Child grid
• A grid that is launched by a CUDA thread is considered a child
grid
 The grid of the launcher is called the parent grid
• Child grids execute asynchronously, as if they were launched by
the host
 a parent grid is not considered finished, until all the child grids launched
by its threads are also finished.
• Parent and child grids have two points of guaranteed global
memory consistency:
 When the child grid is launched by the parent; all memory operations
performed by the parent thread before launching the child are visible to
the child grid when it starts
 When the child grid finishes; all memory operations by any thread in the
child grid are visible to the parent thread once the parent thread has
synchronized with the completed child grid

CSIS BITS Pilani

Parent grid & Child grid
• A kernel launched from within a kernel can launch a kernel,
which can also launch a kernel, etc
• The total “nesting depth” allowed with dynamic parallelism is
limited to 24
• Kernels launched from within a kernel cannot be executed on
another GPU
• Memory management
 a child grid launch can be passed references to global data, but not to
shared memory, since shared memory is private to a specific SM
 the same applies to local memory, as it is private to a thread.

CSIS BITS Pilani

QuickSort
The child grids are launched using two
separate streams. Launching them into the
default stream would force their sequential
execution.

CSIS BITS Pilani

Multi-GPU Programming with CUDA
• A single host thread can manage multiple devices
• In general, the first step is determining the number of
CUDA-enabled devices available in a system with
cudaGetDeviceCount()

CSIS BITS Pilani

Multi-GPU Programming with CUDA
• We select which GPU is the current target for all CUDA
operations with cudaSetDevice()
 This function sets the device with identifier id as the current
device. device identifiers range from 0 to deviceCount-1
 Current GPU can be changed while async calls (kernels,
memcpy) are running

CSIS BITS Pilani

Multi-GPU Programming with CUDA
1. Select the set of GPUs this application will use
2. Create streams for each device
3. Allocate device resources on each device (for
example, device memory)
4. Launch tasks on each GPU through the stream (for
example, data transfers or kernel executions)
5. Use the streams to wait for task completion

CSIS BITS Pilani

Multi-GPU Programming with CUDA

CSIS BITS Pilani

Peer-to-peer transfer
• Many modern GPU systems are able to directly
(without involving the host) transfer data
 This is called peer-to-peer transfer.
• Data transfer speeds
 Main memory: 900 GB/s
 NVLink device-to-device: 300 GB/s
 PCIe 3.0 16x: 16 GB/s

CSIS BITS Pilani

Peer-to-peer transfer
• Because not all GPUs support peer-to-peer access, we need to
check if a device supports P2P using
cudaDeviceCanAccessPeer()
• Peer-to-peer memory access must be explicitly enabled between
two devices with cudaDeviceEnablePeerAccess()
• This function enables peer-to-peer access from the current
device to peerDevice.
 The flag argument is reserved for future use and currently must be set to
0.
 The access granted by this function is unidirectional
 this function enables access from the current device
 to peerDevice but does not enable access from peerDevice.

CSIS BITS Pilani

Peer-to-peer transfer
• After enabling peer access between two devices, we can copy
data between those devices asynchronously with
cudaMemcpyPeerAsync()
• This transfers data from device memory on the device srcDev to
device memory on the device dstDev. The function
cudaMemcpyPeerAsync is asynchronous with respect to the
host and all other devices.

cudaMemcpyPeerAsync(dst, dst_device_id, src, src_device_id, size, stream[i]);

CSIS BITS Pilani

OpenCL
• OpenCL (Open Computing Language) is a framework
for writing programs that execute across
heterogeneous platforms consisting of
 central processing units (CPUs), graphics processing units
(GPUs), digital signal processors (DSPs), field-programmable
gate arrays (FPGAs) and other processors or hardware
accelerators.

CSIS BITS Pilani

OpenCL
• OpenCL views a computing system as consisting of a
number of compute devices, which might be central
processing units (CPUs) or "accelerators" such as
graphics processing units (GPUs), attached to a host
processor (a CPU)
 It defines a C-like language for writing programs. Functions
executed on an OpenCL device are called "kernels".

CSIS BITS Pilani

OpenCL to CUDA Data Parallelism Model Mapping
OpenCL Parallelism CUDA Equivalent
Concept
kernel kernel
host program host program
NDRange (index space) grid
work item thread
work group block

CSIS BITS Pilani

Mapping of OpenCL Dimensions and Indices to CUDA

OpenCL API Call Explanation CUDA Equivalent

get_global_id(0); global index of the blockIdx.x×blockDim.x+threadIdx.
work item in the x x
dimension
get_local_id(0) local index of the blockIdx.x
work item within the
work group in the x
dimension
get_global_size(0); size of NDRange in gridDim.x ×blockDim.x
the x dimension
get_local_size(0); Size of each work blockDim.x
group in the x
dimension

CSIS BITS Pilani

Mapping OpenCL Memory Types to CUDA
OpenCL Memory Types CUDA Equivalent
global memory global memory
constant memory constant memory
local memory shared memory
private memory Local memory

CSIS BITS Pilani

A Simple OpenCL Kernel Example
1. __kernel void vadd(__global const float *a,
2. __global const float *b, __global float *result) {

3. int id = get_global_id(0);
4. result[id] = a[id] + b[id];
5. }

CSIS BITS Pilani

A Simple OpenCL Kernel Example
1. OpenCL:
2. __kernel void clenergy( …) {

3. unsigned int xindex= (get_global_id(0) / get_local_id(0))*

UNROLLX + get_local_id(0) ;
4. unsigned int yindex= get_global_id(1);
5. unsigned int outaddr= get_global_size(0) * UNROLLX
*yindex+xindex;

6. CUDA:
7. __global__ void cuenergy(…) {
8. Unsigned int xindex= blockIdx.x *blockDim.x +threadIdx.x;
9. unsigned int yindex= blockIdx.y *blockDim.y +threadIdx.y;
10. unsigned int outaddr= gridDim.x *blockDim.x *
UNROLLX*yindex+xindex

CSIS BITS Pilani

#include <stdio.h> //Get the devices list and choose the device you want to run
#include <stdlib.h> on
#ifdef __APPLE__ cl_device_id *device_list = NULL;
#include <OpenCL/cl.h>
cl_uint num_devices;
#else
#include <CL/cl.h>
#endif
clStatus = clGetDeviceIDs( platforms[0],
CL_DEVICE_TYPE_GPU, 0,NULL, &num_devices);
#define VECTOR_SIZE 1024
//OpenCL kernel which is run for every work item device_list = (cl_device_id *)
created. malloc(sizeof(cl_device_id)*num_devices);
const char *saxpy_kernel = clStatus = clGetDeviceIDs(
"__kernel \n" platforms[0],CL_DEVICE_TYPE_GPU, num_devices, device_list,
"void saxpy_kernel(float alpha, \n" NULL);
" __global float *A, \n"
" __global float *B, \n" // Create one OpenCL context for each device in the
" __global float *C) \n" platform
"{ \n" cl_context context;
" //Get the index of the work-item \n" context = clCreateContext( NULL, num_devices, device_list,
" int index = get_global_id(0); \n" NULL, NULL, &clStatus);
" C[index] = alpha* A[index] + B[index]; \n"
"} \n";
// Create a command queue
int main(void) {
cl_command_queue command_queue =
int i;
clCreateCommandQueue(context, device_list[0], 0, &clStatus);
// Allocate space for vectors A, B and C
float alpha = 2.0;
float *A = (float*)malloc(sizeof(float)*VECTOR_SIZE); // Create memory buffers on the device for each vector
float *B = (float*)malloc(sizeof(float)*VECTOR_SIZE); cl_mem A_clmem = clCreateBuffer(context,
float *C = (float*)malloc(sizeof(float)*VECTOR_SIZE); CL_MEM_READ_ONLY,VECTOR_SIZE * sizeof(float), NULL,
for(i = 0; i < VECTOR_SIZE; i++)
&clStatus);
{ cl_mem B_clmem = clCreateBuffer(context,
A[i] = i; B[i] = VECTOR_SIZE - i; C[i] = 0; CL_MEM_READ_ONLY,VECTOR_SIZE * sizeof(float), NULL,
&clStatus);
}
// Get platform and device information cl_mem C_clmem = clCreateBuffer(context,
cl_platform_id * platforms = NULL;
CL_MEM_WRITE_ONLY,VECTOR_SIZE * sizeof(float), NULL,
&clStatus);
cl_uint num_platforms;
//Set up the Platform
cl_int clStatus = clGetPlatformIDs(0, NULL, // Copy the Buffer A and B to the device
&num_platforms); clStatus = clEnqueueWriteBuffer(command_queue, A_clmem,
platforms = (cl_platform_id *) CL_TRUE, 0, VECTOR_SIZE * sizeof(float), A, 0, NULL, NULL);
malloc(sizeof(cl_platform_id)*num_platforms); clStatus = clEnqueueWriteBuffer(command_queue, B_clmem,
clStatus = clGetPlatformIDs(num_platforms, platforms, CL_TRUE, 0, VECTOR_SIZE * sizeof(float), B, 0, NULL, NULL);
NULL); CSIS BITS Pilani
// Create a program from the kernel source
cl_program program = // Read the cl memory C_clmem on device to the host
clCreateProgramWithSource(context, 1,(const variable C
char **)&saxpy_kernel, NULL, &clStatus); clStatus = clEnqueueReadBuffer(command_queue, C_clmem,
CL_TRUE, 0, VECTOR_SIZE * sizeof(float), C, 0, NULL, NULL);
// Build the program
// Clean up and wait for all the comands to complete.
clStatus = clBuildProgram(program, 1,
clStatus = clFlush(command_queue);
device_list, NULL, NULL, NULL);
clStatus = clFinish(command_queue);

// Create the OpenCL kernel

// Display the result to the screen
cl_kernel kernel = clCreateKernel(program, for(i = 0; i < VECTOR_SIZE; i++)
"saxpy_kernel", &clStatus);
printf("%f * %f + %f = %f\n", alpha, A[i], B[i], C[i]);

// Set the arguments of the kernel // Finally release all OpenCL allocated objects and host
clStatus = clSetKernelArg(kernel, 0, buffers.
sizeof(float), (void *)&alpha); clStatus = clReleaseKernel(kernel);
clStatus = clSetKernelArg(kernel, 1, clStatus = clReleaseProgram(program);
sizeof(cl_mem), (void *)&A_clmem); clStatus = clReleaseMemObject(A_clmem);
clStatus = clSetKernelArg(kernel, 2, clStatus = clReleaseMemObject(B_clmem);
sizeof(cl_mem), (void *)&B_clmem); clStatus = clReleaseMemObject(C_clmem);
clStatus = clSetKernelArg(kernel, 3, clStatus = clReleaseCommandQueue(command_queue);
sizeof(cl_mem), (void *)&C_clmem); clStatus = clReleaseContext(context);
free(A);
// Execute the OpenCL kernel on the list free(B);
size_t global_size = VECTOR_SIZE; // free(C);
Process the entire lists free(platforms);
size_t local_size = 64; // free(device_list);
Process one item at a time return 0;
}
clStatus =
clEnqueueNDRangeKernel(command_queue,
kernel, 1, NULL, &global_size, &local_size,
0, NULL, NULL);

CSIS BITS Pilani

References
BITS Pilani

• Chapter 6, Multicore and GPU Programming An Integrated

Approach, Gerassimos Barlas, Morgan Kaufmann
BITS Pilani
Pilani Campus

Thank You

Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
GPGPU
No ratings yet
GPGPU
139 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
1
No ratings yet
1
44 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
chapter-8
No ratings yet
chapter-8
58 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
gpus
No ratings yet
gpus
32 pages
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
No ratings yet
Evolution of The Graphics Process Units: Dr. Zhijie Xu Z.xu@hud - Ac.uk
24 pages
Cks 2012 It Art 002
No ratings yet
Cks 2012 It Art 002
10 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
Lec 3
No ratings yet
Lec 3
48 pages
Lec 14
No ratings yet
Lec 14
52 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Comparative Study On CPU GPU and TPU
No ratings yet
Comparative Study On CPU GPU and TPU
9 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
GPU Architecture
0% (2)
GPU Architecture
28 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
L 3 GPU
No ratings yet
L 3 GPU
33 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
CSE_lec4_cuda
No ratings yet
CSE_lec4_cuda
91 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
Presentation1 (1) hpc mod 3
No ratings yet
Presentation1 (1) hpc mod 3
51 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Owens
No ratings yet
Owens
67 pages
2
No ratings yet
2
58 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Gpu Programming
100% (2)
Gpu Programming
96 pages
PDC Lecture 7-8 GPU Architectures
No ratings yet
PDC Lecture 7-8 GPU Architectures
25 pages
Universidad Nacional Mayor de San Marcos: Arquitectura de Computadoras Mg. Juan Carlos Gonzales Suarez 2019 - I
No ratings yet
Universidad Nacional Mayor de San Marcos: Arquitectura de Computadoras Mg. Juan Carlos Gonzales Suarez 2019 - I
22 pages
Part1 22
No ratings yet
Part1 22
77 pages
Thesis Gpu Programming
100% (2)
Thesis Gpu Programming
6 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
MTGP Slide Mcqmc2
No ratings yet
MTGP Slide Mcqmc2
35 pages
07 cmsc416 Cuda
No ratings yet
07 cmsc416 Cuda
26 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
New Microsoft PowerPoint Presentation
No ratings yet
New Microsoft PowerPoint Presentation
13 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Heathdale Prospectus 2023
No ratings yet
Heathdale Prospectus 2023
24 pages
Newsletter - Week 2, Term 2.
No ratings yet
Newsletter - Week 2, Term 2.
13 pages
Lab2 VectorsAndMatrices
No ratings yet
Lab2 VectorsAndMatrices
1 page
List of VicRoads Fees 2022 2023 June 2022
No ratings yet
List of VicRoads Fees 2022 2023 June 2022
3 pages
Conduit User Manual
No ratings yet
Conduit User Manual
29 pages
Lab4 ConditionalProbAndBayes
No ratings yet
Lab4 ConditionalProbAndBayes
1 page
Lab5 Linear Regression
No ratings yet
Lab5 Linear Regression
1 page
T2 S 176 Pushing and Pulling Forces Worksheet Ver 1
No ratings yet
T2 S 176 Pushing and Pulling Forces Worksheet Ver 1
2 pages
Au t2 S 1007 Forces Display Poster
No ratings yet
Au t2 S 1007 Forces Display Poster
1 page
Au t2 S 013 Types of Forces Display Posters - Ver - 1
No ratings yet
Au t2 S 013 Types of Forces Display Posters - Ver - 1
5 pages
C735 C835 C935 AU Manual
No ratings yet
C735 C835 C935 AU Manual
21 pages
Gamio M. Top Java Challenges. Cracking the Coding Interview...2020
No ratings yet
Gamio M. Top Java Challenges. Cracking the Coding Interview...2020
124 pages
DATA-STRUCTURE-AND-ALGORITHMS-REVIEWER
No ratings yet
DATA-STRUCTURE-AND-ALGORITHMS-REVIEWER
4 pages
C Lab Assignment
No ratings yet
C Lab Assignment
2 pages
Technical coding
No ratings yet
Technical coding
28 pages
Javascript Programs
No ratings yet
Javascript Programs
18 pages
Unit-1 Programming in C Hindi
No ratings yet
Unit-1 Programming in C Hindi
34 pages
Dsa MCQ Bank
No ratings yet
Dsa MCQ Bank
91 pages
Python for Data Analysis 3rd Edition by Wes McKinney ISBN 9781098103989 109810398X instant download
No ratings yet
Python for Data Analysis 3rd Edition by Wes McKinney ISBN 9781098103989 109810398X instant download
86 pages
CD Unit-3 (2) (R20)
No ratings yet
CD Unit-3 (2) (R20)
43 pages
Download Complete PHP Cookbook Solutions and Examples for PHP Programmers 2nd Edition David Sklar PDF for All Chapters
100% (1)
Download Complete PHP Cookbook Solutions and Examples for PHP Programmers 2nd Edition David Sklar PDF for All Chapters
67 pages
Unix Basics and TCL Scripting
100% (1)
Unix Basics and TCL Scripting
49 pages
A Fast and Power-Efficient Hardware Architecture for Non-Maximum Suppresion
No ratings yet
A Fast and Power-Efficient Hardware Architecture for Non-Maximum Suppresion
5 pages
150+ DSA Interview Questions
No ratings yet
150+ DSA Interview Questions
12 pages
Bloomberg
No ratings yet
Bloomberg
7 pages
LINEAR
No ratings yet
LINEAR
4 pages
MATLAB MANUAL 7th SEM (ME - 705)
No ratings yet
MATLAB MANUAL 7th SEM (ME - 705)
25 pages
Starting Out With C Early Objects 10th Edition by Tony Gaddis, Judy Walters, Godfrey Muganda ISBN 0135241006 9780135241004 download
100% (2)
Starting Out With C Early Objects 10th Edition by Tony Gaddis, Judy Walters, Godfrey Muganda ISBN 0135241006 9780135241004 download
69 pages
PythonNotesForProfessionals (3) (1)
No ratings yet
PythonNotesForProfessionals (3) (1)
856 pages
Imp - Data-Structures Questions
No ratings yet
Imp - Data-Structures Questions
16 pages
PLDS UNIT 2 (1)
No ratings yet
PLDS UNIT 2 (1)
51 pages
Base Address of Array
No ratings yet
Base Address of Array
6 pages
APL Prager PDF
No ratings yet
APL Prager PDF
146 pages
2022-2023 - SEM - 2 - Online B.Sc. CS-Batch 1 - BCS ZC313 - Introduction To Programming - EC-3 - REGULAR - 19-02-2023
No ratings yet
2022-2023 - SEM - 2 - Online B.Sc. CS-Batch 1 - BCS ZC313 - Introduction To Programming - EC-3 - REGULAR - 19-02-2023
11 pages
Slides - Vlookup
No ratings yet
Slides - Vlookup
130 pages
frmCourseSyllabusIPDownload Aspx
No ratings yet
frmCourseSyllabusIPDownload Aspx
2 pages
QAFox Selenium Java Automation Course 2018
No ratings yet
QAFox Selenium Java Automation Course 2018
13 pages
R23 B.Tech-CSD
No ratings yet
R23 B.Tech-CSD
46 pages
01 Intro To Course DSA
No ratings yet
01 Intro To Course DSA
35 pages
C++ MCQ
No ratings yet
C++ MCQ
34 pages
REFERENCE Kintex 7 Rad Complete Report
No ratings yet
REFERENCE Kintex 7 Rad Complete Report
39 pages

Module 05 Massive Multi-Core Programming GPGPUs, CUDA

Uploaded by

Module 05 Massive Multi-Core Programming GPGPUs, CUDA

Uploaded by

GPGPUs

BITS Pilani K Hari Babu

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

• Block diagram of AMD’s Kaveri architecture

• (a) and (b) represent discrete GPU solutions, with a

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

• The CUDA-supplied dim3 type represents an integer vector of three

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

using namespace std;

__global__ void stencil_1d(int *in, int *out) {

// Read input elements into shared memory

// Apply the stencil

// Store the result

void fill_ints(int *x, int n) {

// Alloc space for host copies and setup values

// Alloc space for device copies

// Launch stencil_1d() kernel on GPU

// Copy result back to host

CSIS BITS Pilani

CSIS BITS Pilani

• The __global__ directive specifies that the hello function is supposed

 Getting the offset of a particular thread from the beginning

CSIS BITS Pilani

int index = threadIdx.x + blockIdx.x * M;

CSIS BITS Pilani

int altMyID = threadIdx.x + blockIdx.x * blockDim.x +

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

• The multiProcessorCount field can be used to derive

CSIS BITS Pilani

CSIS BITS Pilani

• A branching operation leads to the stalling of the

CSIS BITS Pilani

New id is generated for each thread.

 warpId will be same for all threads in a warp.

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

• an occupancy close to 1 is desirable

CSIS BITS Pilani

// histogramGPU computes the histogram of an input array on the

// compute global thread coordinates

// create a private histogram copy for each thread block

CSIS BITS Pilani

// update private histogram

CSIS BITS Pilani

CSIS BITS Pilani

 There is default stream in CUDA programs. Additional

 There is one copy engine in each stream. By calling

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

CSIS BITS Pilani

global void stencil_1d(int in, int out) {

• The global directive specifies that the hello function is supposed