0% found this document useful (0 votes)
2 views

Gpu Architecture

Uploaded by

Huseyn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Gpu Architecture

Uploaded by

Huseyn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

GPU Architecture: Overview

P J Narayanan
Centre for Visual Information Technology
IIIT, Hyderabad
IIIT Hyderabad

PPoPP Tutorial on
GPU Architecture, Programming and Performance Models
GPU: Evolution
• Graphics : a few hundred triangles/vertices map to a few
hundred thousand pixels
• Process pixels in parallel. Do the same thing on a large
number of different items.
• Data parallel model: parallelism provided by the data
– Thousands to millions of data elements
– Same program/instruction on all of them

• Hardware: 8-16 cores to process vertices and 64-128 to


process pixels by 2005
– Less versatile than CPU cores
– SIMD mode of computations. Less hardware for instruction issue
– No caching, branch prediction, out-of-order execution, etc.
IIIT Hyderabad

– Can pack more cores in same silicon die area

January 10, 2010 GPU Tutorial at PPoPP 2010 2


GPU & CPU
Nvidia GTX280
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 3


CPU vs GPU
• CPU Architecture features: ALU ALU
Control
– Few, complex cores ALU ALU
– Perform irregular operations well CPU
• Run an OS, control multiple IO, pointer Cache
manipulation, etc.

• GPU Architecture features:


DRAM
– Hundreds of simple cores, operating on
a common memory (like the PRAM
model)
– High compute power but high memory
latency (1:500)
– No caching, prefetching, etc GPU
– High arithmetic intensity needed for
IIIT Hyderabad

good performance
• Graphics rendering, image/signal processing, DRAM
matrix manipulation, FFT, etc.
January 10, 2010 GPU Tutorial at PPoPP 2010 4
What do GPUs do?
Vertex
• GPU implements the graphics
pipeline consisting of: Vertex
– Vertex transformations
Processing
• Compute camera coords, lighting
– Geometry processing Geometry
• Primitive-wide properties Processing
– Rasterizing polygons to pixels
• Find pixels falling on each Rasterization
polygon
– Processing the pixels
• Texture lookup, shading, Z-values
Pixel
Processing
– Writing to the framebuffer
• Colour, Z-value
Framebuffer
• Computationally intensive
Image
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 5


Programmable GPUs
• Parts of the GPU pipeline were
made programmable for innovative Vertex Vertex
shading effects Shader
Processing
• Vertex, pixel, & later geometry
stages of processing could run Geometry Geometry
user’s shaders. Processing Shader
• Pixel shaders perform Data-
parallel computations on a parallel Rasterization
hardware
– 64-128 single precision floating Pixel Pixel
point processors Shader
Processing
– Fast texture access
• GPGPU: High performance Framebuffer
computing on the GPU using
shaders. Efficient for vectors,
IIIT Hyderabad

matrix, FFT, etc.

January 10, 2010 GPU Tutorial at PPoPP 2010 6


New Generation GPUs
• The DX10/SM4.0 model required a uniform
shader model
• Translated into common, unified, hardware
cores to perform vertex, geometry, and pixel
operations.
• Brought the GPUs closer to a general parallel
processor
• A number of cores that can be reconfigured
dynamically
– More cores: 128  240  320
IIIT Hyderabad

– Each transforms data in a common memory for use


by others
January 10, 2010 GPU Tutorial at PPoPP 2010 7
Old Array Processors
• Processor and Memory
tightly attached Proc Proc Proc
• A network to interconnect
– Mesh, star, hypercube
• Local data: Memory read/
write Proc Proc Proc
Remote data: network access
• Data reorganization is
expensive to perform
• Data-Parallel model works
• Thinking Machines CM-1, Proc Proc Proc
CM-2. MasPar MP-1, etc
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 8


Current GPU Architecture
• Processors have no local memory
• Bus-based connection to the common, large,
memory
P P P P P
• Uniform access to all memory for a PE
– Slower than computation by a factor of 500
• Resembles the PRAM model!
Memory Access
• No caches. But, instantaneous locality of
reference improves performance
– Simultaneous memory accesses combined to a
single transaction
• Memory access pattern determines
performance seriously
• Compute power: Upto 3 TFLOPs on a $400
add on card
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 9


What is the GPU Good at?
• The GPU is good at
data-parallel processing
• The same computation executed on many data
elements in parallel – low control flow overhead
with high SP floating point arithmetic intensity
• Many calculations per memory access
• Currently also need high floating point to integer ratio
• High floating-point arithmetic intensity and many
data elements can hide memory access latency
without big data cache
IIIT Hyderabad
SIMD Multiprocessors
• The device is a set of 16 Device
or 30 multiprocessors
Multiprocessor N
• Each multiprocessor is a
set of 32-bit processors Multiprocessor 2
with a Single Instruction Multiprocessor 1
Multiple Data architecture
– shared instruction unit
• At each clock cycle, a
multiprocessor executes Instruction

the same instruction on a Processor 1 Processor 2 … Processor M


Unit

group of threads called a


warp
• The number of threads in
a warp is the warp size
IIIT Hyderabad
HW Overview
Streaming Processor Array

TPC TPC TPC TPC TPC TPC TPC TPC

Texture Processor Cluster Streaming Multiprocessor


Instruction L1 Data L1

SM Instruction Fetch/Dispatch

Shared Memory
TEX
SP SP

SM SP SP
SFU SFU
IIIT Hyderabad

SP SP

SP SP
Streaming Multi-Processor
• Streaming Multiprocessor
• 8 Streaming Processors (SP)
– 2 Super Function Units (SFU)
• Multi-threaded instruction dispatch Streaming Multiprocessor
Instruction L1 Data L1
• 1 to 512 threads active
Instruction Fetch/Dispatch
• Shared instruction fetch per 32 threads
Shared Memory
• Cover latency of texture/memory loads
SP SP
• 30+ GFLOPS SP SP
SFU SFU
• 16K registers SP SP

• Partitioned among active threads SP SP

• 16 KB shared memory
IIIT Hyderabad

• Partitioned among logical blocks

January 10, 2010 GPU Tutorial at PPoPP 2010 13


Multithreaded Coprocessor
• The GPU is viewed as a compute device that:
– Is a coprocessor to the CPU or host
– Has its own DRAM (device memory)
– Runs many threads in parallel
• Data-parallel portions of an application are executed
on the device as kernels which run in parallel on
many threads
• Differences between GPU and CPU threads
– GPU threads are extremely lightweight
• Very little creation overhead
– GPU needs 1000s of threads for full efficiency
IIIT Hyderabad

• Multi-core CPU needs only a few


Thread Batching: Grids and
Blocks
• A kernel is executed as a
Host Device
grid of thread blocks
– All threads share data memory space Grid 1
Kern
• A thread block is a batch of el 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
threads that can cooperate Block Block Block
with each other by: (0, 1) (1, 1) (2, 1)

– Synchronizing their execution Grid 2


• For hazard-free shared memory
Kern
accesses
el 2
Blk
– Efficiently sharing data
through a low latency shared
memory
Blk
• Two threads from two
IIIT Hyderabad

different blocks cannot


cooperate
Courtesy: NDVIA
Block and Thread IDs
• Threads and blocks have Device
IDs Grid 1
– So each thread can decide Block Block Block
what data to work on (0, 0) (1, 0) (2, 0)

– Block ID: 1D or 2D Block Block Block


– Thread ID: 1D, 2D, or 3D (0, 1) (1, 1) (2, 1)

• Simplifies memory
addressing when Block (1, 1)

processing Thread Thread Thread Thread Thread


multidimensional data (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

– Image processing Thread Thread Thread Thread Thread


(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
– Solving PDEs on volumes Thread Thread Thread Thread Thread
IIIT Hyderabad

– … (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

Courtesy: NDVIA
Threads, Warps, Blocks
• 32 threads in a Warp or a scheduling group
– Only <32 when there are fewer than 32 total threads
• There are (up to) 16 Warps in a Block
• Each Block (and thus, each Warp) executes on a
single SM
• G80 has 16 SMs, G280 has 30 SMs
• At least 16 Blocks required to “fill” the device
• More is better
– If resources (registers, thread space, shared memory)
allow, more than 1 Block can occupy each SM
IIIT Hyderabad
Memory Spaces
Grid

• Each thread can: Block (0, 0) Block (1, 0)

– Read/write per-thread registers


Shared Memory Shared Memory
– Read/write per-thread local memory
Registers Registers Registers Registers
– Read/write per-block shared memory
– Read/write per-grid global memory
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
– Read only per-grid constant memory
– Read only per-grid texture memory Local Local Local Local
Memory Memory Memory Memory

The host can read/write Host Global


Memory

global, constant, and


IIIT Hyderabad

Constant

texture memory Memory

Texture
Memory
Memory Access Times
• Register – dedicated HW - single cycle
• Shared Memory – dedicated HW - single cycle
• Local Memory – DRAM, no cache - *slow*
• Global Memory – DRAM, no cache - *slow*
(400-500 cycles)
• Constant Memory – DRAM, cached, 1…10s…
100s of cycles, depending on cache locality
• Texture Memory – DRAM, cached, 1…10s…
100s of cycles, depending on cache locality
• Instruction Memory (invisible) – DRAM,
IIIT Hyderabad

cached
Thread Scheduling/Execution
Each Thread Blocks consists of 32-thread warps currently

Warps are scheduling units in SM. A warp is schedule at


one time

Multiple warps time share the SM processors

Multiple blocks can also share an SM, if resources permit.


Available resources are vertically shared between blocks
that time-share an SM

If more blocks are needed, they use the hardware


IIIT Hyderabad

sequentially.
Processors, Memory
• Nvidia 280GTX: 240 Streaming Processors, grouped into 30
Streaming Multiprocessors
– One instruction sequencer per SM
– 16KB of on-chip shared memory per SM
– 16K 32-bit registers per SM
– Single clock access of registers, shared memory
• 1 GB of common, off-chip global memory
– 130 GB/s of theoretical peak memory bandwith
– High memory access latency: 300-500 cycles
– 128 byte, 64 byte, or 32 byte memory transactions
• 10 special texture access units to the same global memory.
30 SMs grouped into 10 Texture processor clusters
• 1.3 GHz clock, 933 GFLOPs peak
• Integer and single-precision float operations in one clock cycle.
IIIT Hyderabad

Slower double-precision support

January 10, 2010 GPU Tutorial at PPoPP 2010 21


AMD 5870 Architecture
• 20 SIMD engines with 16
stream cores each
– Each SC with 5 PEs
(1600 Pes in total)
– Each with IEEE754 and
integer support
– Each with local data share
memory
• 32 kb shared low latency
memory
• 32 banks with hardware conflict
management
• 32 integer atomic units 80 Read
Address Probes
– 4 addresses per SIMD engine
– 4 filter or convert logic per
SIMD Global Memory access

• 153 GB/sec GDDR5


memory interface
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 22


Nvidia 280GTX: Architecture
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 23


Performance Considerations
• Thread divergence
– SIMD width is 32 threads. They should execute the same
very instruction
– Serialization otherwise
• Memory access coherence
– A half-warp of 16 threads should read from a local block
(128, 64, or 32 bytes) for speed
– Random memory access very expensive
• Occupancy or degree of parallelism
– Optimum use of registers and shared memory for maximum
exploitation of parallelism
– Memory latency hidden best with high parallelism
• Atomic operations
IIIT Hyderabad

– Global and shared memory support slow atomic operations


January 10, 2010 GPU Tutorial at PPoPP 2010 24
Tools and APIs
• OpenGL/Direct3D for older, GPGPU exposure
– Shaders operating on polygons, textures, and
framebuffer
• CUDA: an alternate interface from Nvidia
– Kernel operating on grids using threads
– Extensions of the C language
• DirectX Compute Shader: Microsoft’s version
• OpenCL: A promising open compute standard
– Apple, Nvidia, AMD, Intel, TI, etc.
– Support for task parallel, data parallel, pipeline-parallel,
etc.
IIIT Hyderabad

– Exploit the strengths of all available computing


resources
January 10, 2010 GPU Tutorial at PPoPP 2010 25
Massively Multithreaded Model
• Hiding memory latency: Overlap computation & memory
access
– Keep multiple threads in flight simultaneously on each core
– Low-overhead switching. Another thread computes when one is
stalled for memory data
– Alternate resources like registers, context to enable this
• A large number of threads in flight
– Nvidia GPUs: up to 128 threads on each core on the GTX280
– 30K time-shared threads on 240 cores
• Common instruction issue units for a number of cores
– SIMD model at some level to optimize control hardware
– Inefficient for if-the-else divergence
• Threads organized in multiple tiers
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 26


Multi-tier Thread Structure
Scheduling groups
• Data parallel model: A kernel on
each data element
– A kernel runs on a core
– CUDA: an invocation of the kernel is
called a thread
– OpenCL: the same is called a work item
• Group data elements based on
simultaneous scheduling
– Execute truly in parallel, SIMD mode
– Memory access, instruction divergence,
etc., affect performance
– CUDA: a warp of threads
• Group elements for resource usage
– Share memory and other resources
– May synchronize within group
– CUDA: Blocks of threads
IIIT Hyderabad

– OpenCL: Work groups


Resource groups

January 10, 2010 GPU Tutorial at PPoPP 2010 27


Data-Parallelism
• Data elements provide
parallelism
– Think of many data
elements, each being
processed simultaneously
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 28


Data-Parallelism
• Data elements provide
parallelism
– Think of many data elements,
each being processed
simultaneously
– Thousands of threads to
process thousands of data
elements
• Not necessarily SIMD, most
are SIMD or SPMD
– Each kernel knows its
location, identical otherwise
– Work on different parts using
IIIT Hyderabad

the location
January 10, 2010 GPU Tutorial at PPoPP 2010 29
Thinking Data-Parallel
• Launch N data locations, each of which gets a kernel of code
• Data follows a domain of computation.
• Each invocation of the kernel is aware of its location loc within the
domain
– Can access different data elements using the loc
– May perform different computations also
• Variations of SIMD processing
– Abstain from a compute step: if ( f(loc) ) then … else …
• Divergence can result in serialization
– Autonomous addressing for gather: a := b[ f(loc) ]
– Autonomous addressing for scatter: a[ g(loc) ] := b
• GPGPU model supports gather but not scatter
– Operation autonomy: Beyond SIMD.
• GPU hardware uses it for graphics, but not exposed to users
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 30


Image Processing
• A kernel for each location of
the 2D domain of pixels
– Embarrassingly parallel for
simple operations
• Each work element does its
own operations
– Point operations, filtering,
transformations, etc. 3 x 3 Filtering
• Process own pixels, get
neighboring pixels, etc
• Work groups can share data
– Get own pixels and “apron”
pixels that are accessed
IIIT Hyderabad

multiple times

January 10, 2010 GPU Tutorial at PPoPP 2010 31


Regular Domains
• Regular 1D, 2D, and nD
domains map very well to a b c d e f g h i
data-parallelism a
b
c
d
e
f
g
h

• Each work-item operates
by itself or with a few
neighbors
• Need not be of equal
dimensions or length
• A mapping from loc to
each domain should exist
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 32


Irregular Domains
• A regular domain generates
varying amounts of data
Irregular Domain
– Convert to a regular domain A B C D E F
– Process using the regular
domain
– Mapping to original domain
using new location possible
• Needs computations to do
this
• Occurs frequently in data Regular Domain
structure building,work
IIIT Hyderabad

distribution, etc.
January 10, 2010 GPU Tutorial at PPoPP 2010 33
Data-Parallel Primitives
• Deep knowledge of
architecture needed to get 1 3 2 0 6 2 5 2 4
high performance
– Use primitives to build other
algorithms Add Reduce
– Efficient implementations on 25
the architecture by experts
• reduce, scan, segmented Scan or prefix sum
scan: Aggregate or
0 1 4 6 6 12 14 19 21
progressive results from
distributed data
– Ordering distributed info Segmented Scan
• split, sort: 1 0 0 1 0 0 0 1 0
IIIT Hyderabad

– Mapping distributed data 0 1 4 0 0 6 8 0 2


[Blelloch 1989]
January 10, 2010 GPU Tutorial at PPoPP 2010 34
Split Primitive

Split

• Rearrange data according to its category. Categories could be


anything.
• Generalization of sort. Categories needn’t ordered themselves
• Important in distributing or mapping data
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 35


Handling Irregular Domains
A B C D E F
• Convert from irregular
to a regular domain
• Each old domain
Counts
element counts its 2 3 1 2 3 1
elements in new domain Scan
Progressive Counts
• Scan the counts to get
0 2 4 5 7 10
the progressive counts or
the starting points
• Copy data elements to
IIIT Hyderabad

Regular Domain
own location
January 10, 2010 GPU Tutorial at PPoPP 2010 36
Graph Algorithms
• Not the prototypical data-
parallel application; an
irregular application. Adjacency
Matrix
• Source of data-parallelism:
Data structure (adjacency
matrix or adjacency list)
• A 2D-domain of V2
elements or a 1D-domain of
E elements Vertices
• A thread processes each
edge in parallel. Combine
the results
IIIT Hyderabad

Adjacency List

January 10, 2010 GPU Tutorial at PPoPP 2010 37


Find min edge for each vertex
Example: Find the minimum for each node in parallel
outgoing edge of each vertex for all neighbours v
if w[v] < min
Soln 1: Each node-kernel loops min = w[v]
over its neighbors, keeping track mv = v
of the minimum weight and the
edge
Soln 2: Segmented min-scan of
the weights array + a kernel to 1 0 0 1 1 0 1 1 0 0 1 1 1 0
identify min vertex
Soln 3: Sort the tuple (u, w, v)
using the key (w, v) for all edges u
(u, v) of the graph of weight w. w
IIIT Hyderabad

v
Take the first entry for each u.

January 10, 2010 GPU Tutorial at PPoPP 2010 38


Task Parallel Computing
• The problem is divided into a
number of tasks; Data may also
be partitioned or shared A
• Some can be done in parallel,
others depend on previous
results
B C D
• Combine the results finally
• CPU cores and GPU can be
doing task-parallel computing
E F
• OpenCL supports this model of
computation as well as the
pipelined model
• More on OpenCL later today G
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 39


Summary
• GPU can be an essential computing platform with a
massively multithreaded programming model
• Data-parallel model fits the GPUs best.
• High performance requires deep knowledge of the
architecture. High-level primitives can alleviate
this greatly.
• Think of CPU and GPU together achieving your
computing goals. Not one instead of the other
• OpenCL is an exciting new development that can
make this possible and portable!
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 40


For More Information
• GPGPU: gpgpu.org
• SIGGRAPH Courses:
– SIGGRAPH 2008: Available at UC, Davis.
http://s08.idav.ucdavis.edu/
– SIGGRAPH Asia 2008: Available at UC, Davis
http://sa08.idav.ucdavis.edu/
– Upcoming course at SIGGRAPH 2009
• CudaZone for Nvidia
IIIT Hyderabad

• And more …
January 10, 2010 GPU Tutorial at PPoPP 2010 41
Thank you!

Image credits to owners such as Intel,


Nvidia, AMD/ATI, etc.
IIIT Hyderabad

GPU Tutorial at PPoPP 2010 January 10, 2010


Thank you!

Image credits to owners such as Intel,


Nvidia, AMD/ATI, etc.
IIIT Hyderabad

GPU Tutorial at PPoPP 2010 January 10, 2010

You might also like