0% found this document useful (0 votes)

49 views81 pages

Overview of GPGPU's

This document provides an overview of GPUs and GPGPU computing. It discusses the evolution of GPUs from dedicated graphics cards to integrated and hybrid solutions. It also describes how GPGPU modifies the stream processor concept to enable general purpose computing on GPUs. The document outlines some major GPU manufacturers like NVIDIA and AMD and provides details on Nvidia's Fermi and AMD's Cypress architectures. It introduces CUDA as a programming model for Nvidia GPUs and describes basic CUDA terminology and concepts like kernels, threads and memory architecture.

Uploaded by

fajal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views81 pages

Overview of GPGPU's

Uploaded by

fajal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

Overview of GPGPU’s

Deepika H.V
C-DAC Knowledge Park
[email protected]

19, June 2013 Think Parallel - 2013

CONTENTS
GPU Overview
Evolution
Major CARDS
CUDA Basics
Terminlogy
CUDA Kernels & Threads
Thread Hierarchy
CUDA Memory Architecture
CUDA Syntax
Compilation & Debugging
Development Tools

19, June 2013 Think Parallel - 2013 2

GPU Overview

GPU - specialized microprocessor that offloads and accelerates

graphics rendering from the central processor.
Used in embedded systems, mobile phones, personal
computers, workstations, and game consoles.
Requirement
which are the demand of modern computing.
Fast smooth gaming
Features?
Real-time rendering
Hardware graphics pipeline

19, June 2013 Think Parallel - 2013 3

Real-Time Rendering
Graphics hardware enables real-time rendering
Real-time means display rate at more than 10 images / second

19, June 2013 Think Parallel - 2013 4

GPGPU Evolution
Dedicated graphics cards
contains RAM dedicated to the card's use
Integrated graphics solutions
utilize a portion of a computer's system RAM i.e shared
Hybrid solutions
share memory with the system and a small dedicated memory cache, to
make up for the high latency of the system RAM.
GPGPU
modified concept of stream processor.
Given a set of data (a stream), a series of operations (kernel functions) are
applied to each element in the stream.

19, June 2013 Think Parallel - 2013 5

GPGPU
Turns the massive floating-point computational power of a modern
graphics accelerator into general-purpose computing power.
Massively multi-threaded as 1000s of threads and many cores i.e 100s
of scalar processors.
Uses fine grained data parallel computation.
Peak performance is upto 1TFLOP (Nvidia HPC card)

19, June 2013 Think Parallel - 2013 6

CPU & GPU Architecture

19, June 2013 Think Parallel - 2013 7

Major Players
NVIDIA(Formed in 1993)
NV1: NVIDIA's first product, based on quadratic surfaces
RIVA TNT, RIVA TNT2: DirectX 6 support, OpenGL 1 NVIDIA
GeForce: graphics processors for personal computers
NVIDIA Quadro: graphics processors for workstations
NVIDIA Tesla: dedicated GPGPU processors for HPC
NVIDIA Tegra: processor for mobile devices
AMD
Mach : 2D GUI "Windows Accelerator"
Rage : 2D and 3D accelerator chips
Radeon : Directx 3D accelerator.
FireGL & FirePro : Workstation video card
Imageon : handhelds devices, cellphones and Tablet PCs.
AMD FireStream : for HPC, utilizing the stream processing concept

19, June 2013 Think Parallel - 2013 8

Nvidia Fermi Architecture

19, June 2013 Think Parallel - 2013 9

19, June 2013 Think Parallel - 2013 10
AMD Cypress Architecture

19, June 2013 Think Parallel - 2013 11

19, June 2013 Think Parallel - 2013 12
CUDA

19, June 2013

13
Think Parallel - 2013 13
CUDA Basics
COMPUTE UNIFIED DEVICE ARCHITECTURE

Used to expose the computational horsepower of NVIDIA GPUs for GPU

computing

It is scalable across any number of threads

Software

Based on industry-standard C

Small set of extensions to C language

Low learning curve

Straightforward APIs to manage devices, memory, etc.

19, June 2013 Think Parallel - 2013 14

Terminology
Host – The CPU and its memory
Device - The GPU and its memory
Kernel - Function compiled for the device and it is
executed on the device with many threads

Pre-Requistes
You (probably) need experience with C or C++

You do not need any GPU experience

You do not need any graphics experience

You do not need any parallel programming experience

19, June 2013 Think Parallel - 2013 15

Why CUDA ???
Data Parallelism

Program property where arithmetic operations are simultaneously

performed on data structures.

Eg: 1,000 X 1,000 matrix multiplication

1,000,000 independent dot products.

Each 1,000 multiply & 1,000 add arithmetic operations.

Thread Creation : CUDA threads light weight than CPU threads.

Take few cycles to generate and schedule due to efficient hardware support.

CPU threads typically take thousands of clock cycles to generate and schedule.

It avoids performance overhead of graphics layer APIs by compiling

software directly to hardware (GPU assembly lang).

19, June 2013 Think Parallel - 2013 16

Processing Flow on CUDA
Example of CUDA processing flow

Copy data from main memory to GPU

memory
CPU instructs the process to GPU
GPU executes the compute in parallel
on each core
Copy the result from GPU memory to
main memory

19, June 2013 Think Parallel - 2013 17

How to Write
Create or edit the CUDA program with your favorite editor. Note: CUDA C
language programs have the suffix ".cu".

Compile the program with nvcc to create the executable.

Run the executable.

CPU Serial Code

GPU Parallel Kernel

KernelA<<< nBlk, nTid >>>(args);

CPU Serial Code

GPU Parallel Kernel

KernelB<<< nBlk, nTid >>>(args);

19, June 2013 Think Parallel - 2013 18

Hello World

global void kernel( void )

{
}
int main( void ) {
kernel<<< input parameters >>>();
printf( "Hello, World!\n" );
return 0;
}

Two notable additions to the original “Hello, World!”

Let’s discuss what these two additions are

19, June 2013 Think Parallel - 2013 19

Hello World with device code

global void kernel( void )

{
}

CUDA C keyword global indicates that a function

-- Runs on the device
-- Called from host code
nvcc splits source file into host and device components
NVIDIA’s compiler handles device functions like kernel()
Standard host compiler handles host functions like main()

19, June 2013 Think Parallel - 2013 20

int main( void ) {
kernel<<< input parameters >>>();
printf( "Hello, World!\n" );
return 0;
}

Triple angle brackets mark a call from host code to device code
-- Sometimes called a “kernel launch”
--We’ll discuss the parameters inside the angle brackets later

This is all that’s required to execute a function on the GPU!

The function kernel() does nothing.

19, June 2013 Think Parallel - 2013 21

A bit complex
A simple kernel to add two integers:
__global__ void add( int *a, int *b, int *c )
{
*c = *a + *b;
}
Where do we
allocate
As before, __global__ is a CUDA C keyword meaningmemory???

add() will execute on the device. so a, b, and c must point to device

memory

add() will be called from the host

Notice we use pointers for our variables ………

19, June 2013 Think Parallel - 2013 22

int main( void ) {
// host copies of a, b, c
int a, b, c;
// device copies of a, b, c
int *dev_a, *dev_b, *dev_c;
// we need space for an integer
int size = sizeof( int);
// allocate device copies of a, b, c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, size );
a = 2;b = 7;
}

19, June 2013 Think Parallel - 2013 23

CUDA i/f for data allocation & data movement
b/w CPU & device.

// allocate arrays on device

#include <stdio.h> Memory cudaMalloc((void **) &a_d,sizeof(float)*N);
#include <cuda.h>
allocation cudaMalloc((void **) &b_d,sizeof(float)*N);
int main(void) // send data from host to device: a_h to a_d
cudaMemcpy(a_d, a_h, sizeof(float)*N,
{ //pointers to host memory
cudaMemcpyHostToDevice);
float *a_h, *b_h;
// pointers to device memory //get data from device: b_d to b_h
float *a_d, *b_d; cudaMemcpy(b_h, b_d, sizeof(float)*N,
int N = 14; int i; cudaMemcpyDeviceToHost);
// allocate arrays on host
// check result
a_h = (float *)malloc(sizeof(float)*N); for (i=0; i<N; i++)
b_h = (float *)malloc(sizeof(float)*N);
Memory transfer
assert(a_h[i] == b_h[i]);

// initialize host data // cleanup

for (i=0; i<N; i++) { free(a_h); free(b_h);
cudaFree(a_d); cudaFree(b_d);
a_h[i] = 10.f+I; b_h[i] = 0.f;}

19, June 2013 Think Parallel - 2013 24

Memory Allocation and Transfer
Memory Allocation
cudaMalloc() :
Address of a pointer to the allocated object
Size of of allocated object
cudaFree()
frees object from device memory

Memory Copy
cudaMemcpy()
points to the destination location
pointer to the source data object
number of bytes to copy
type of memory involved in copy

19, June 2013 Think Parallel - 2013 25

CUDA i/f updated

#include <stdio.h> // send data from host to device: a_h to a_d

#include <cuda.h> cudaMemcpy(a_d, &a_h, size,
int main(void) cudaMemcpyHostToDevice);
{ cudaMemcpy(a_d, &a_h, size,
int a_h, b_h; cudaMemcpyHostToDevice);
int size = sizeof(int);
// pointers to device memory add<<<1,1>>>(a_d,b_d,c_d);
int *a_d, *b_d, *c_d;
// allocate arrays on device //get data from device: b_d to b_h
cudaMalloc((void **) &a_d, size); cudaMemcpy(b_h, c_d, size,
cudaMalloc((void **) &b_d, size); cudaMemcpyDeviceToHost);
cudaMalloc((void **) &c_d, size);
// initialize host data
a_h = 10; // cleanup
b_h = 20; cudaFree(a_d); cudaFree(b_d);
return 0;
}
__global void add(int *a_d,int *b_d,int *c_d)
{
*c= *a + *b;
}
19, June 2013 Think Parallel - 2013 26
Terms to be got

What is kernel
How do you call kernel
How to sync

19, June 2013 Think Parallel - 2013 27

CUDA Kernels & Threads
Parallel portions of an application executed on device as kernel
- One kernel is executed at a time
- Many threads execute each kernel
- __global__ indicates function is a kernel, it can be called from host
functions to generate a grid of threads.
- Once a kernel is launched, its dimensions cannot change in the
current CUDA run-time implementation.
Both host (CPU) and device (GPU) manage their own memory- host &
device memory
Data can be copied between them

19, June 2013 Think Parallel - 2013 28

Array of Parallel Threads

threadID 0 1 2 3 4 5 6

.
Float x = input[threadID];
Float y = func(x);
Output[threadID] = y;
..

19, June 2013 Think Parallel - 2013 29

Thread hierarchy
Kernels are executed by thread.
-- kernel is a simple C program.
--Each thread has an ID, used to compute memory
addresses & make ctrl decisions
-- Thousands of threads execute same kernel.
Threads are grouped into blocks.
-- Threads in a block can synchronize execution
-- Threads within a block co-operate using shared
memory.
Blocks are grouped in a grid.
-- Blocks are independent (Must be able to be
executed in any order.)

19, June 2013 Think Parallel - 2013 30

CPU Thread
65653 threads
Kernel invocation

1. How to access individual threads

2. How many unique id ? ?? -- 65653

3. How to synchronize such large thread bank?

4. How to handle memory for them ?

19, June 2013 Think Parallel - 2013 31

CPU Thread

Kernel invocation

Grid

Block(0,0) Block(1,0) Block(2,0)

1024 threads ??

Block(0,1) Block(1,1) Block(2,1)

19, June 2013 Think Parallel - 2013 32

CPU Thread

Kernel invocation

Grid
Block(0,0) Block(1,0) Block(2,0)

Block(0,1) Block(1,1) Block(2,1)

(0,0) (1,0) (2,0) (3,0)

(0,1) (1,1) (2,1) (3,1)

(0,2) (1,2) (2,2) (0,0)

19, June 2013 Think Parallel - 2013 33

Grid of Thread Blocks
Host Device

Grid 1

Kerne
The computational grid consist of a grid
Block Block Block
l1 (0, 0) (1, 0) (2, 0) of thread blocks
Block Block Block The application speciﬁes the grid and
(0, 1) (1, 1) (2, 1) block size
Grid 2
The grid layouts can be 1, 2 dimensional
Kerne The maximal sizes are determined by
l2 GPU memory and card capability.
Each block has an unique block ID
Block (1, 1)
Each thread has an unique thread
Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
ID(within the block)
Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Thread Thread Thread Thread Thread

(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

19, June 2013 Think Parallel - 2013 34

int main(void){
float *a_h, *b_h, float *a_d;
int i, N = 10; size_t size = N*sizeof(float);
// allocate arrays and initialize
a_h = (float *)malloc(size);
b_h = (float *)malloc(size);
cudaMalloc((void **) &a_d, size);
//Function run on host
for (i=0; i<N; i++) a_h[i] = (float)i; void incrementArrayOnHost(float *a, int N)
//Copy data to GPU { int i;
cudaMemcpy(a_d, a_h, sizeof(float)*N, for (i=0; i < N; i++)
a[i] = a[i]+1.f;
cudaMemcpyHostToDevice);
}
// do calculation on host
incrementArrayOnHost(a_h, N); //Compute Kernel
// Compute execution configuration __global__ void incrementOnDev(float *a, int N)
{
int blockSize = 4; int idx = blockIdx.x*blockDim.x + threadIdx.x;
int nBlocks = N/blockSize + (N%blockSize == 0?0:1); if (idx<N) a[idx] = a[idx]+1.f;
}
incrementOnDev <<< nBlocks, blockSize >>> (a_d, N);
//Copy result back and clean up
cudaMemcpy(b_h, a_d, sizeof(float)*N,
cudaMemcpyDeviceToHost);
for (i=0; i<N; i++) assert(a_h[i] == b_h[i]);
free(a_h);
19, June 2013 free(b_h); cudaFree(a_d); Think Parallel - 2013 35
threadIdx.x - thread ID within block
blockIdx.x - block ID within grid
blockDim.x - number of threads per block

19, June 2013 Think Parallel - 2013 36

Memory Hierarchy

Three types of memory in the graphic card:

Global memory: 3-6 GB
Shared memory: 48 KB Registers &
local memory
Registers: 32 KB
Latency: Shared memory

Global memory: 400-600 cycles

Shared memory: Fast
Register: Fast Global memory

Purpose:
Global memory: IO for grid
Shared memory: thread collaboration
Registers: thread space

19, June 2013 Think Parallel - 2013 37

Global Memory

Main means of
communicating R/W Data
between host and device
Contents visible to all
threads
Long latency (100s cycles)
Off-chip, read/write access
Host can read/write
GT200
• Up to 4 GB
GF100
• Up to 6Gb

19, June 2013 Think Parallel - 2013 38

Local Memory

Stored in global memory.

Copy per thread
Used for automatic arrays
Unless all accessed with only constant
indices

19, June 2013 Think Parallel - 2013 39

Constant Memory

Short latency, high bandwidth, read only

access when all threads access the same
location
Stored in global memory but cached
Host can read/write
Initialized by host.
Up to 64 KB
Cache is 8KB

19, June 2013 Think Parallel - 2013 40

Shared Memory

Shared memories are allocated to thread

blocks
all threads in a block can access variables in
the shared memory locations allocated to
the block.
Fast, on-chip, read/write access
Full speed random access
48KB per SM
48KB / 8 = 6 KB per block

19, June 2013 Think Parallel - 2013 41

Registers

allocated to individual threads

Each thread can only access its own
registers.
frequently accessed variables that are
private to each thread.
32K register per SM
32K / 1024 ~ 31 registers/thread
Exceeding limit reduces threads by the
block

19, June 2013 Think Parallel - 2013 42

Summary

Summary
Registers Per thread Read-write On-chip No cache
Local memory Per thread Read-write Off-chip No cache
Shared memory Per block Read-write On-chip No cache
Global memory Per grid Read-write Off-chip No cache
Constant memory Per grid Read-only Off-chip cache

19, June 2013 Think Parallel - 2013 43

CUDA Syntax

19, June 2013 Think Parallel - 2013 44

Built-in variables

dim3 dimGrid(x,y); (1-65,536)

// Dimensions of grid in blocks
dim3 dimBlock;
// Dimensions of the block in threads
dim3 blockIdx(x,y);(0 and gridDim.x-1)
// Block index within the grid
dim3 threadIdx;
// Thread index within the block

19, June 2013 Think Parallel - 2013 45

Variable Type

Memory Scope Lifetime

Automatic variables register thread Kernel
Automatic array variables local Thread Kernel
__shared__ int SharedVar; Shared Thread block Kernel
__device__ int GlobalVar; Global Grid Application
__constant__ int ConstantVar; Constant Grid application

Automatic variables without any qualifier reside in registers , except for

large structures or arrays that reside in local memory
Pointers can point to memory allocated or declared in either global or
shared memory

19, June 2013 Think Parallel - 2013 46

Language Extensions

A kernel function is called with an execution configuration:

dim3 dimGrid(100, 50); // 5000 thread blocks

dim3 dimBlock(4, 8, 8); // 256 threads per block
size_t SharedMemBytes = 64; // 64 bytes of shared memory
KernelFunc<<< dimGrid, dimBlock, SharedMemBytes >>>(...);

The optional SharedMemBytes bytes are:

Allocated in addition to the compiler allocated shared memory
Mapped to any variable declared as:

extern shared float DynamicSharedMem[];

19, June 2013 Think Parallel - 2013 47

Function Type Qualifiers
Executed Callable from
on
__device__ float DeviceFunc() device Device
__global__ void KernelFunc() Device Host
__host__ float HostFunc() Host Host

host device func()

__global__ defines a kernel {
function & must return void #if __CUDA_ARCH__ == 100
//Device code path for capability 1.0
__global__ is asynchronous call
#elif __CUDA_ARCH__ == 200
__device__ and __host__ can be // Device code path for capability 2.0
used together
#elif !defined(__CUDA_ARCH__)
__device__ functions cannot //Host code path
have their address taken
#endif
}

19, June 2013 Think Parallel - 2013 48

Thread Synchronization

Thread 0 Thread 1
Mds[i] = Md[j]; Mds[i] = Md[j];
__syncthreads(); __syncthreads();
func(Mds[i], Mds[i+1]); func(Mds[i], Mds[i+1]);