0% found this document useful (0 votes)
49 views81 pages

Overview of GPGPU's

This document provides an overview of GPUs and GPGPU computing. It discusses the evolution of GPUs from dedicated graphics cards to integrated and hybrid solutions. It also describes how GPGPU modifies the stream processor concept to enable general purpose computing on GPUs. The document outlines some major GPU manufacturers like NVIDIA and AMD and provides details on Nvidia's Fermi and AMD's Cypress architectures. It introduces CUDA as a programming model for Nvidia GPUs and describes basic CUDA terminology and concepts like kernels, threads and memory architecture.

Uploaded by

fajal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views81 pages

Overview of GPGPU's

This document provides an overview of GPUs and GPGPU computing. It discusses the evolution of GPUs from dedicated graphics cards to integrated and hybrid solutions. It also describes how GPGPU modifies the stream processor concept to enable general purpose computing on GPUs. The document outlines some major GPU manufacturers like NVIDIA and AMD and provides details on Nvidia's Fermi and AMD's Cypress architectures. It introduces CUDA as a programming model for Nvidia GPUs and describes basic CUDA terminology and concepts like kernels, threads and memory architecture.

Uploaded by

fajal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 81

Overview of GPGPU’s

Deepika H.V
C-DAC Knowledge Park
[email protected]

19, June 2013 Think Parallel - 2013


CONTENTS
 GPU Overview
 Evolution
 Major CARDS
 CUDA Basics
 Terminlogy
 CUDA Kernels & Threads
 Thread Hierarchy
 CUDA Memory Architecture
 CUDA Syntax
 Compilation & Debugging
 Development Tools

19, June 2013 Think Parallel - 2013 2


GPU Overview

 GPU - specialized microprocessor that offloads and accelerates


graphics rendering from the central processor.
 Used in embedded systems, mobile phones, personal
computers, workstations, and game consoles.
 Requirement
 which are the demand of modern computing.
 Fast smooth gaming
 Features?
 Real-time rendering
 Hardware graphics pipeline

19, June 2013 Think Parallel - 2013 3


Real-Time Rendering
Graphics hardware enables real-time rendering
Real-time means display rate at more than 10 images / second

19, June 2013 Think Parallel - 2013 4


GPGPU Evolution
 Dedicated graphics cards
 contains RAM dedicated to the card's use
 Integrated graphics solutions
 utilize a portion of a computer's system RAM i.e shared
 Hybrid solutions
 share memory with the system and a small dedicated memory cache, to
make up for the high latency of the system RAM.
 GPGPU
 modified concept of stream processor.
 Given a set of data (a stream), a series of operations (kernel functions) are
applied to each element in the stream.

19, June 2013 Think Parallel - 2013 5


GPGPU
 Turns the massive floating-point computational power of a modern
graphics accelerator into general-purpose computing power.
 Massively multi-threaded as 1000s of threads and many cores i.e 100s
of scalar processors.
 Uses fine grained data parallel computation.
 Peak performance is upto 1TFLOP (Nvidia HPC card)

19, June 2013 Think Parallel - 2013 6


CPU & GPU Architecture

19, June 2013 Think Parallel - 2013 7


Major Players
 NVIDIA(Formed in 1993)
NV1: NVIDIA's first product, based on quadratic surfaces
RIVA TNT, RIVA TNT2: DirectX 6 support, OpenGL 1 NVIDIA
GeForce: graphics processors for personal computers
NVIDIA Quadro: graphics processors for workstations
NVIDIA Tesla: dedicated GPGPU processors for HPC
NVIDIA Tegra: processor for mobile devices
 AMD
Mach : 2D GUI "Windows Accelerator"
Rage : 2D and 3D accelerator chips
Radeon : Directx 3D accelerator.
FireGL & FirePro : Workstation video card
Imageon : handhelds devices, cellphones and Tablet PCs.
AMD FireStream : for HPC, utilizing the stream processing concept

19, June 2013 Think Parallel - 2013 8


Nvidia Fermi Architecture

19, June 2013 Think Parallel - 2013 9


19, June 2013 Think Parallel - 2013 10
AMD Cypress Architecture

19, June 2013 Think Parallel - 2013 11


19, June 2013 Think Parallel - 2013 12
CUDA

19, June 2013


13
Think Parallel - 2013 13
CUDA Basics
 COMPUTE UNIFIED DEVICE ARCHITECTURE

 Used to expose the computational horsepower of NVIDIA GPUs for GPU


computing

 It is scalable across any number of threads

 Software

 Based on industry-standard C

 Small set of extensions to C language

 Low learning curve

 Straightforward APIs to manage devices, memory, etc.

19, June 2013 Think Parallel - 2013 14


Terminology
 Host – The CPU and its memory
 Device - The GPU and its memory
 Kernel - Function compiled for the device and it is
executed on the device with many threads

 Pre-Requistes
 You (probably) need experience with C or C++

 You do not need any GPU experience

 You do not need any graphics experience

 You do not need any parallel programming experience

19, June 2013 Think Parallel - 2013 15


Why CUDA ???
 Data Parallelism

Program property where arithmetic operations are simultaneously


performed on data structures.

 Eg: 1,000 X 1,000 matrix multiplication

 1,000,000 independent dot products.

 Each 1,000 multiply & 1,000 add arithmetic operations.

 Thread Creation : CUDA threads light weight than CPU threads.


 Take few cycles to generate and schedule due to efficient hardware support.

 CPU threads typically take thousands of clock cycles to generate and schedule.

 It avoids performance overhead of graphics layer APIs by compiling


software directly to hardware (GPU assembly lang).

19, June 2013 Think Parallel - 2013 16


Processing Flow on CUDA
Example of CUDA processing flow

 Copy data from main memory to GPU


memory
 CPU instructs the process to GPU
 GPU executes the compute in parallel
on each core
 Copy the result from GPU memory to
main memory

19, June 2013 Think Parallel - 2013 17


How to Write
 Create or edit the CUDA program with your favorite editor. Note: CUDA C
language programs have the suffix ".cu".

 Compile the program with nvcc to create the executable.

 Run the executable.

CPU Serial Code

GPU Parallel Kernel


KernelA<<< nBlk, nTid >>>(args);

CPU Serial Code

GPU Parallel Kernel


KernelB<<< nBlk, nTid >>>(args);

19, June 2013 Think Parallel - 2013 18


Hello World

 __global__ void kernel( void )


{
}
 int main( void ) {
kernel<<< input parameters >>>();
printf( "Hello, World!\n" );
return 0;
}

 Two notable additions to the original “Hello, World!”

 Let’s discuss what these two additions are

19, June 2013 Think Parallel - 2013 19


Hello World with device code

__global__ void kernel( void )


{
}

 CUDA C keyword __global__ indicates that a function


-- Runs on the device
-- Called from host code
 nvcc splits source file into host and device components
 NVIDIA’s compiler handles device functions like kernel()
 Standard host compiler handles host functions like main()

19, June 2013 Think Parallel - 2013 20


 int main( void ) {
kernel<<< input parameters >>>();
printf( "Hello, World!\n" );
return 0;
}

 Triple angle brackets mark a call from host code to device code
-- Sometimes called a “kernel launch”
--We’ll discuss the parameters inside the angle brackets later

 This is all that’s required to execute a function on the GPU!

 The function kernel() does nothing.

19, June 2013 Think Parallel - 2013 21


A bit complex
 A simple kernel to add two integers:
__global__ void add( int *a, int *b, int *c )
{
*c = *a + *b;
}
Where do we
allocate
 As before, __global__ is a CUDA C keyword meaningmemory???

 add() will execute on the device. so a, b, and c must point to device


memory

 add() will be called from the host

 Notice we use pointers for our variables ………

19, June 2013 Think Parallel - 2013 22


int main( void ) {
// host copies of a, b, c
int a, b, c;
// device copies of a, b, c
int *dev_a, *dev_b, *dev_c;
// we need space for an integer
int size = sizeof( int);
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, size );
a = 2;b = 7;
}

19, June 2013 Think Parallel - 2013 23


CUDA i/f for data allocation & data movement
b/w CPU & device.

// allocate arrays on device


#include <stdio.h> Memory cudaMalloc((void **) &a_d,sizeof(float)*N);
#include <cuda.h>
allocation cudaMalloc((void **) &b_d,sizeof(float)*N);
int main(void) // send data from host to device: a_h to a_d
cudaMemcpy(a_d, a_h, sizeof(float)*N,
{ //pointers to host memory
cudaMemcpyHostToDevice);
float *a_h, *b_h;
// pointers to device memory //get data from device: b_d to b_h
float *a_d, *b_d; cudaMemcpy(b_h, b_d, sizeof(float)*N,
int N = 14; int i; cudaMemcpyDeviceToHost);
// allocate arrays on host
// check result
a_h = (float *)malloc(sizeof(float)*N); for (i=0; i<N; i++)
b_h = (float *)malloc(sizeof(float)*N);
Memory transfer
assert(a_h[i] == b_h[i]);

// initialize host data // cleanup


for (i=0; i<N; i++) { free(a_h); free(b_h);
cudaFree(a_d); cudaFree(b_d);
a_h[i] = 10.f+I; b_h[i] = 0.f;}

19, June 2013 Think Parallel - 2013 24


Memory Allocation and Transfer
Memory Allocation
 cudaMalloc() :
 Address of a pointer to the allocated object
 Size of of allocated object
 cudaFree()
 frees object from device memory

Memory Copy
 cudaMemcpy()
 points to the destination location
 pointer to the source data object
 number of bytes to copy
 type of memory involved in copy

19, June 2013 Think Parallel - 2013 25


CUDA i/f updated

#include <stdio.h> // send data from host to device: a_h to a_d


#include <cuda.h> cudaMemcpy(a_d, &a_h, size,
int main(void) cudaMemcpyHostToDevice);
{ cudaMemcpy(a_d, &a_h, size,
int a_h, b_h; cudaMemcpyHostToDevice);
int size = sizeof(int);
// pointers to device memory add<<<1,1>>>(a_d,b_d,c_d);
int *a_d, *b_d, *c_d;
// allocate arrays on device //get data from device: b_d to b_h
cudaMalloc((void **) &a_d, size); cudaMemcpy(b_h, c_d, size,
cudaMalloc((void **) &b_d, size); cudaMemcpyDeviceToHost);
cudaMalloc((void **) &c_d, size);
// initialize host data
a_h = 10; // cleanup
b_h = 20; cudaFree(a_d); cudaFree(b_d);
return 0;
}
__global void add(int *a_d,int *b_d,int *c_d)
{
*c= *a + *b;
}
19, June 2013 Think Parallel - 2013 26
Terms to be got

 What is kernel
 How do you call kernel
 How to sync

19, June 2013 Think Parallel - 2013 27


CUDA Kernels & Threads
 Parallel portions of an application executed on device as kernel
- One kernel is executed at a time
- Many threads execute each kernel
- __global__ indicates function is a kernel, it can be called from host
functions to generate a grid of threads.
- Once a kernel is launched, its dimensions cannot change in the
current CUDA run-time implementation.
 Both host (CPU) and device (GPU) manage their own memory- host &
device memory
 Data can be copied between them

19, June 2013 Think Parallel - 2013 28


Array of Parallel Threads

threadID 0 1 2 3 4 5 6

.
Float x = input[threadID];
Float y = func(x);
Output[threadID] = y;
..

19, June 2013 Think Parallel - 2013 29


Thread hierarchy
 Kernels are executed by thread.
-- kernel is a simple C program.
--Each thread has an ID, used to compute memory
addresses & make ctrl decisions
-- Thousands of threads execute same kernel.
 Threads are grouped into blocks.
-- Threads in a block can synchronize execution
-- Threads within a block co-operate using shared
memory.
 Blocks are grouped in a grid.
-- Blocks are independent (Must be able to be
executed in any order.)

19, June 2013 Think Parallel - 2013 30


CPU Thread
65653 threads
Kernel invocation

1. How to access individual threads

2. How many unique id ? ?? -- 65653

3. How to synchronize such large thread bank?

4. How to handle memory for them ?

19, June 2013 Think Parallel - 2013 31


CPU Thread

Kernel invocation

Grid

Block(0,0) Block(1,0) Block(2,0)

1024 threads ??

Block(0,1) Block(1,1) Block(2,1)

19, June 2013 Think Parallel - 2013 32


CPU Thread

Kernel invocation

Grid
Block(0,0) Block(1,0) Block(2,0)

Block(0,1) Block(1,1) Block(2,1)

(0,0) (1,0) (2,0) (3,0)

(0,1) (1,1) (2,1) (3,1)

(0,2) (1,2) (2,2) (0,0)

19, June 2013 Think Parallel - 2013 33


Grid of Thread Blocks
Host Device

Grid 1

Kerne
 The computational grid consist of a grid
Block Block Block
l1 (0, 0) (1, 0) (2, 0) of thread blocks
Block Block Block  The application specifies the grid and
(0, 1) (1, 1) (2, 1) block size
Grid 2
 The grid layouts can be 1, 2 dimensional
Kerne  The maximal sizes are determined by
l2 GPU memory and card capability.
 Each block has an unique block ID
Block (1, 1)
 Each thread has an unique thread
Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
ID(within the block)
Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread


(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

19, June 2013 Think Parallel - 2013 34


int main(void){
float *a_h, *b_h, float *a_d;
int i, N = 10; size_t size = N*sizeof(float);
// allocate arrays and initialize
a_h = (float *)malloc(size);
b_h = (float *)malloc(size);
cudaMalloc((void **) &a_d, size);
//Function run on host
for (i=0; i<N; i++) a_h[i] = (float)i; void incrementArrayOnHost(float *a, int N)
//Copy data to GPU { int i;
cudaMemcpy(a_d, a_h, sizeof(float)*N, for (i=0; i < N; i++)
a[i] = a[i]+1.f;
cudaMemcpyHostToDevice);
}
// do calculation on host
incrementArrayOnHost(a_h, N); //Compute Kernel
// Compute execution configuration __global__ void incrementOnDev(float *a, int N)
{
int blockSize = 4; int idx = blockIdx.x*blockDim.x + threadIdx.x;
int nBlocks = N/blockSize + (N%blockSize == 0?0:1); if (idx<N) a[idx] = a[idx]+1.f;
}
incrementOnDev <<< nBlocks, blockSize >>> (a_d, N);
//Copy result back and clean up
cudaMemcpy(b_h, a_d, sizeof(float)*N,
cudaMemcpyDeviceToHost);
for (i=0; i<N; i++) assert(a_h[i] == b_h[i]);
free(a_h);
19, June 2013 free(b_h); cudaFree(a_d); Think Parallel - 2013 35
 threadIdx.x - thread ID within block
 blockIdx.x - block ID within grid
 blockDim.x - number of threads per block

19, June 2013 Think Parallel - 2013 36


Memory Hierarchy

Three types of memory in the graphic card:


Global memory: 3-6 GB
Shared memory: 48 KB Registers &
local memory
Registers: 32 KB
Latency: Shared memory

Global memory: 400-600 cycles


Shared memory: Fast
Register: Fast Global memory

Purpose:
Global memory: IO for grid
Shared memory: thread collaboration
Registers: thread space

19, June 2013 Think Parallel - 2013 37


Global Memory

 Main means of
communicating R/W Data
between host and device
 Contents visible to all
threads
 Long latency (100s cycles)
 Off-chip, read/write access
 Host can read/write
 GT200
• Up to 4 GB
 GF100
• Up to 6Gb

19, June 2013 Think Parallel - 2013 38


Local Memory

 Stored in global memory.


 Copy per thread
 Used for automatic arrays
 Unless all accessed with only constant
indices

19, June 2013 Think Parallel - 2013 39


Constant Memory

 Short latency, high bandwidth, read only


access when all threads access the same
location
 Stored in global memory but cached
 Host can read/write
 Initialized by host.
 Up to 64 KB
 Cache is 8KB

19, June 2013 Think Parallel - 2013 40


Shared Memory

 Shared memories are allocated to thread


blocks
 all threads in a block can access variables in
the shared memory locations allocated to
the block.
 Fast, on-chip, read/write access
 Full speed random access
 48KB per SM
 48KB / 8 = 6 KB per block

19, June 2013 Think Parallel - 2013 41


Registers

 allocated to individual threads


 Each thread can only access its own
registers.
 frequently accessed variables that are
private to each thread.
 32K register per SM
32K / 1024 ~ 31 registers/thread
Exceeding limit reduces threads by the
block

19, June 2013 Think Parallel - 2013 42


Summary

Summary
Registers Per thread Read-write On-chip No cache
Local memory Per thread Read-write Off-chip No cache
Shared memory Per block Read-write On-chip No cache
Global memory Per grid Read-write Off-chip No cache
Constant memory Per grid Read-only Off-chip cache

19, June 2013 Think Parallel - 2013 43


CUDA Syntax

19, June 2013 Think Parallel - 2013 44


Built-in variables

 dim3 dimGrid(x,y); (1-65,536)


// Dimensions of grid in blocks
 dim3 dimBlock;
// Dimensions of the block in threads
 dim3 blockIdx(x,y);(0 and gridDim.x-1)
// Block index within the grid
 dim3 threadIdx;
// Thread index within the block

19, June 2013 Think Parallel - 2013 45


Variable Type

Memory Scope Lifetime


Automatic variables register thread Kernel
Automatic array variables local Thread Kernel
__shared__ int SharedVar; Shared Thread block Kernel
__device__ int GlobalVar; Global Grid Application
__constant__ int ConstantVar; Constant Grid application

 Automatic variables without any qualifier reside in registers , except for


large structures or arrays that reside in local memory
 Pointers can point to memory allocated or declared in either global or
shared memory

19, June 2013 Think Parallel - 2013 46


Language Extensions

 A kernel function is called with an execution configuration:

dim3 dimGrid(100, 50); // 5000 thread blocks


dim3 dimBlock(4, 8, 8); // 256 threads per block
size_t SharedMemBytes = 64; // 64 bytes of shared memory
KernelFunc<<< dimGrid, dimBlock, SharedMemBytes >>>(...);

The optional SharedMemBytes bytes are:


Allocated in addition to the compiler allocated shared memory
Mapped to any variable declared as:

extern __shared__ float DynamicSharedMem[];

19, June 2013 Think Parallel - 2013 47


Function Type Qualifiers
Executed Callable from
on
__device__ float DeviceFunc() device Device
__global__ void KernelFunc() Device Host
__host__ float HostFunc() Host Host

__host__ __device__ func()


 __global__ defines a kernel {
function & must return void #if __CUDA_ARCH__ == 100
//Device code path for capability 1.0
 __global__ is asynchronous call
#elif __CUDA_ARCH__ == 200
 __device__ and __host__ can be // Device code path for capability 2.0
used together
#elif !defined(__CUDA_ARCH__)
 __device__ functions cannot //Host code path
have their address taken
#endif
}

19, June 2013 Think Parallel - 2013 48


Thread Synchronization

Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Time: 0
19, June 2013 Think Parallel - 2013 49
Thread Synchronization

Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Time: 1
19, June 2013 Think Parallel - 2013 50
Thread Synchronization

Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Threads 0 and 1 are blocked at barrier


Time: 1
19, June 2013 Think Parallel - 2013 51
Thread Synchronization

Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Time: 2
19, June 2013 Think Parallel - 2013 52
Thread Synchronization

Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Time: 3
19, June 2013 Think Parallel - 2013 53
Thread Synchronization

Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

All threads in block have reached barrier, any thread


can continue
Time: 3
19, June 2013 Think Parallel - 2013 54
Thread Synchronization

Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Time: 4
19, June 2013 Think Parallel - 2013 55
Thread Synchronization

Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Thread 2 Thread 3
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);

Time: 5
19, June 2013 Think Parallel - 2013 56
Sample : Dot Product
c = a∙ b
c = (a0, a1, a2, a3) ∙ (b0, b1, b2, b3)
c = a0b0+ a1 b1+ a2 b2+ a3 b3

A0 * B0
How to add ??
A1 * B1
A2
A3
*
*
B2
B3
+ C

A4 * B4

__global__ void dot( int*a, int*b, int*c ) {


// Each thread computes a pairwiseproduct
Int temp = a[threadIdx.x] * b[threadIdx.x];
……….
……….
}
19, June 2013 Think Parallel - 2013 57
#define N 512
__global__ void dot( int*a, int*b, int*c ) {
// Shared memory for results of multiplication
__shared__ int temp[N];
temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x];

__syncthreads()

// Thread 0 sums the pairwiseproducts


if( 0 == threadIdx.x ) {
intsum = 0;
for( int i= 0; i< N; i++ )
sum += temp[i];
*c = sum;
}
}

19, June 2013 Think Parallel - 2013 58


Sample : Dot Product
Block 0
A0 * B0
A1 * B1
A2
A3
*
*
B2
B3
+
A4 * B4

C
Block N

A512 * B512
A513 * B513
A514
A515
*
*
B514
B515
+
A516 * B516

19, June 2013 Think Parallel - 2013 59


#define N (2048*2048)
#define THREADS_PER_BLOCK 512
__global__ void dot( int*a, int*b, int*c )
{
__shared__ int temp[THREADS_PER_BLOCK];
intindex = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads();
if( 0 == threadIdx.x ) {
intsum = 0;
for( int i= 0; i< THREADS_PER_BLOCK; i++ )
sum += temp[i];
*c += sum
}
} Race condition???

19, June 2013 Think Parallel - 2013 60


Race Conditions
 Terminology: A race condition occurs when program behavior depends
upon relative timing of two (or more) event sequences.
 What actually takes place to executethe line in question: *c += sum;
Read value at address c
Add sum to value
Write result to address c

 What if two threads are trying to do this at the same time?


 -- Thread 0, Block 0 --Thread 0, Block 1
1. Read value at address c 1. Read value at address c
2.Add sumto value 2.Add sumto value
3. Write result to address c 3. Write result to address c

19, June 2013 Think Parallel - 2013 61


Global Memory Contention
Read-Modify-Write

Block 0 Reads 0 Computes 0+3 Writes 3


Sum = 3 0 0+3 =3 3

*c +=sum 0 0 3 3 3 7

3 3+4 =7 7
Block 1 Reads 3 Computes 0+3 Writes 7
Sum = 4

Read-Modify-Write

19, June 2013 Think Parallel - 2013 62


Atomic Operations
Terminology: Read-modify-write uninterruptible when atomic

Many atomic operations on memory available with CUDA C


-- atomicAdd() -- atomicInc()
-- atomicSub() -- atomicDec()
-- atomicMin() -- atomicExch()
--atomicMax() -- atomicCAS()

Predictable result when simultaneous access to memory required

We need to atomically add sum to c in our multiblock dot product

19, June 2013 Think Parallel - 2013 63


#define N (2048*2048)
#define THREADS_PER_BLOCK 512
__global__ void dot( int*a, int*b, int*c )
{
__shared__ int temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads();
if( 0 == threadIdx.x ) {
int sum = 0;
for( int i= 0; i< THREADS_PER_BLOCK; i++ )
sum += temp[i];
atomicAdd(c, sum);
}
}

19, June 2013 Think Parallel - 2013 64


CUDA Compilation &
Debugging

19, June 2013 Think Parallel - 2013 65


Compilation

 Source file which contains CUDA language extensions must


be compiled with nvcc

 – Contain .cu suffix

 Nvcc outputs:
- C code
- Assembly code (ptx)
Or
- directly object code.

19, June 2013 Think Parallel - 2013 66


Debugging Tools

 Debugging
-- cuda-gdb
-- cuda-memcheck
-- Parallel NSight Debugger
 Performance Analysis
-- CUDA Visual Profiler
-- Parallel NSight Analyser

19, June 2013


67
Think Parallel - 2013 67
CUDA Development Tools :
CUDA- gdb
MemCheck
Visual Profiler

19, June 2013 Think Parallel - 2013 68


CUDA-gdb

19, June 2013 Think Parallel - 2013 69


19, June 2013
70
Think Parallel - 2013 70
19, June 2013
71
Think Parallel - 2013 71
 kernel′s funccon name is mykernel_main, the break command is
as follows:
(cuda-gdb) break mykernel_main
 The user can inspect either a specific host thread or a specific
CUDA thread.
--To switch to host thread, use ″thread N″ command.
--To switch to CUDA thread, use
″cudadevice/sm/warp/lane/kernel/grid/block/thread″
 variable array (which is of shared int array) can be directly
accessed in order to see what the stored values are in the array.
(cuda-gdb) p &array
$1 = (@shared int (*)[0]) 0x20
(cuda-gdb) p array[0]@4
$2 = {0, 128, 64, 192}

19, June 2013 Think Parallel - 2013 72


 Determing the coordinates
(cuda-gdb) cuda device sm warp lane block thread
Current CUDA focus: device 0, sm 0, warp 0, lane 0, block(0,0), thread (0,0,0)

 Change the physical coordinates


(cuda-gdb) cuda device 0 sm 1 warp 2 lane 3
New CUDA focus: device 0, sm 1, warp 2, lane 3, grid 1,block (10,0), thread (67,0,0)

 Similarly : cuda thread (15,0,0)


cuda block (1,0) thread (3,0,0)
cuda kernel 0
 Display system information
info cuda system/device/sm/warp/lane
 Checking memory errors
set cuda memcheck on

19, June 2013 Think Parallel - 2013 73


19, June 2013
74
Think Parallel - 2013 74
19, June 2013
75
Think Parallel - 2013 75
Provide strategic metrics to find potential performance
problems.
 GPU and CPU timing for all kernel invocations and
memcpy calls.
Evolution through time stamps.
Access to hardware performance counters.

19, June 2013 Think Parallel - 2013


Profiler Signals

19, June 2013


77
Think Parallel - 2013 77
Interpreting profiler counters
 Values represent events within a thread warp.
 Only targets one multiprocessor
 Values will not correspond to the total number of warps
launched for a particular kernel.
 Launch enough thread blocks to ensure that the target
multiprocessor is given a consistent percentage of the total
work.
 Values are best used to identify relative performance
differences between unoptimized and optimized code
 Try to reduce the magnitudes of gld/gst_incoherent,
divergent_branch y warp_serialize.

19, June 2013 Think Parallel - 2013 78


SpeedUps of GPU v/s CPU
(Nvidia projections)

Note : Speedup calculated only for computation time and data transfer time not included
19, June 2013
79
Think Parallel - 2013 79
References

 Nvidia CUDA Programming Guide


 CUDA by Example: An Introduction to General-Purpose GPU Programming.
 Programming Massively Parallel Processors.
 CUDA - GTC 2010 workshop
http://www.nvidia.com/object/gtc2010-presentation-archive.html#tools
 Course ware of Illinois University http://courses.engr.illinois.edu/ece498/al/textbook/
 Parallel Programming course– UC Berkley
 GPGPU, PEMG – 2010
http://cdac.in/html/events/beta-test/PEMG-2010/pemg10-about-overview.html
 CUDA Optimizations,Debugging and Profiling - University of Malaga
 Introduction to GPU Computing, Nagasaki University,
 CUDA, SuperComputing for Massess
http://drdobbs.com/high-performance-computing/207200659
 CUDA Training Material- Nvidia

19, June 2013 Think Parallel - 2013 80


THANK YOU

19, June 2013 Think Parallel - 2013 81

You might also like