0% found this document useful (0 votes)
78 views18 pages

An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza

This document provides an overview of general purpose graphics processing units (GPUs) and the CUDA programming model. It discusses how CUDA is designed for heterogeneous CPU+GPU systems with many parallel threads. A CUDA program consists of a serial CPU code and parallel GPU kernel code. The kernel is launched by CPU threads and executed in parallel by many GPU threads organized into blocks. The document outlines the memory hierarchy and how blocks and threads are scheduled across multiprocessors on the GPU using a single-instruction, multiple-thread execution model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views18 pages

An Overview of General Purpose Graphics Processing Units: Marc Moreno Maza

This document provides an overview of general purpose graphics processing units (GPUs) and the CUDA programming model. It discusses how CUDA is designed for heterogeneous CPU+GPU systems with many parallel threads. A CUDA program consists of a serial CPU code and parallel GPU kernel code. The kernel is launched by CPU threads and executed in parallel by many GPU threads organized into blocks. The document outlines the memory hierarchy and how blocks and threads are scheduled across multiprocessors on the GPU using a single-instruction, multiple-thread execution model.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

An Overview of General Purpose Graphics

Processing Units

Marc Moreno Maza

University of Western Ontario, Canada

UWO
April 1st, 2014
The CUDA programming and memory models

CUDA design goals

Enable heterogeneous systems (i.e., CPU+GPU)


Scale to 100’s of cores, 1000’s of parallel threads
Use C/C++ with minimal extensions
Let programmers focus on parallel algorithms
The CUDA programming and memory models

Heterogeneous programming (1/3)

A CUDA program is a serial program with parallel kernels, all in C.


The serial C code executes in a host (= CPU) thread
The parallel kernel C code executes in many device threads across
multiple GPU processing elements, called streaming processors (SP).
The CUDA programming and memory models

Heterogeneous programming (2/3)

Thus, the parallel code (kernel) is launched and executed on a device


by many threads.
Threads are grouped into thread blocks.
One kernel is executed at a time on the device.
Many threads execute each kernel.
The CUDA programming and memory models

Heterogeneous programming (3/3)

The parallel code is written for a thread


• Each thread is free to execute a unique code path
• Built-in thread and block ID variables are used to map each thread
to a specific data tile (see next slide).
Thus, each thread executes the same code on different data based on
its thread and block ID.
The CUDA programming and memory models

Example: increment array elements (1/2)

See our example number 4 in /usr/local/cs4402/examples/4


The CUDA programming and memory models

Example: increment array elements (2/2)


The CUDA programming and memory models

Thread blocks (1/2)

A Thread block is a group of threads that can:


• Synchronize their execution
• Communicate via shared memory
Within a grid, thread blocks can run in any order:
• Concurrently or sequentially
• Facilitates scaling of the same code across many devices
The CUDA programming and memory models

Thread blocks (2/2)

Thus, within a grid, any possible interleaving of blocks must be valid.


Thread blocks may coordinate but not synchronize
• they may share pointers
• they should not share locks (this can easily deadlock).

The fact that thread blocks cannot synchronize gives scalability:


• A kernel scales across any number of parallel cores

However, within a thread block, threads may synchronize with


barriers.
That is, threads wait at the barrier until all threads in the same
block reach the barrier.
The CUDA programming and memory models

Memory hierarchy (1/3)

Host (CPU) memory:


Not directly accessible by CUDA threads
The CUDA programming and memory models

Memory hierarchy (2/3)

Global (on the device) memory:


Also called device memory
Accessible by all threads as well as host (CPU)
Data lifetime = from allocation to deallocation
The CUDA programming and memory models

Memory hierarchy (3/3)

Shared memory:
Each thread block has its own shared memory, which is accessible
only by the threads within that block
Data lifetime = block lifetime
Local storage:
Each thread has its own local storage
Data lifetime = thread lifetime
The CUDA programming and memory models

Blocks run on multiprocessors


The CUDA programming and memory models

Streaming processors and multiprocessors


The CUDA programming and memory models

Hardware multithreading

Hardware allocates resources to blocks:


• blocks need: thread slots, registers, shared memory
• blocks don’t run until resources are available
Hardware schedules threads:
• threads have their own registers
• any thread not waiting for something can run
• context switching is free every cycle
Hardware relies on threads to hide latency:
• thus high parallelism is necessary for performance.
The CUDA programming and memory models

SIMT thread execution

At each clock cycle, a multiprocessor executes the same instruction


on a group of threads called a warp
• The number of threads in a warp is the warp size (32 on G80)
• A half-warp is the first or second half of a warp.
Within a warp, threads
• share instruction fetch/dispatch
• some become inactive when code path diverges
• hardware automatically handles divergence
Warps are the primitive unit of scheduling:
• each active block is split into warps in a well-defined way
• threads within a warp are executed physically in parallel while warps
and blocks are executed logically in parallel.
The CUDA programming and memory models

Returning to the example


The CUDA programming and memory models

Example host code for increment array elements

You might also like