0% found this document useful (0 votes)

2 views43 pages

Gpu Architecture

Uploaded by

Huseyn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views43 pages

Gpu Architecture

Uploaded by

Huseyn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

GPU Architecture: Overview

P J Narayanan
Centre for Visual Information Technology
IIIT, Hyderabad
IIIT Hyderabad

PPoPP Tutorial on
GPU Architecture, Programming and Performance Models
GPU: Evolution
• Graphics : a few hundred triangles/vertices map to a few
hundred thousand pixels
• Process pixels in parallel. Do the same thing on a large
number of different items.
• Data parallel model: parallelism provided by the data
– Thousands to millions of data elements
– Same program/instruction on all of them

• Hardware: 8-16 cores to process vertices and 64-128 to

process pixels by 2005
– Less versatile than CPU cores
– SIMD mode of computations. Less hardware for instruction issue
– No caching, branch prediction, out-of-order execution, etc.
IIIT Hyderabad

– Can pack more cores in same silicon die area

January 10, 2010 GPU Tutorial at PPoPP 2010 2

GPU & CPU
Nvidia GTX280
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 3

CPU vs GPU
• CPU Architecture features: ALU ALU
Control
– Few, complex cores ALU ALU
– Perform irregular operations well CPU
• Run an OS, control multiple IO, pointer Cache
manipulation, etc.

• GPU Architecture features:

DRAM
– Hundreds of simple cores, operating on
a common memory (like the PRAM
model)
– High compute power but high memory
latency (1:500)
– No caching, prefetching, etc GPU
– High arithmetic intensity needed for
IIIT Hyderabad

good performance
• Graphics rendering, image/signal processing, DRAM
matrix manipulation, FFT, etc.
January 10, 2010 GPU Tutorial at PPoPP 2010 4
What do GPUs do?
Vertex
• GPU implements the graphics
pipeline consisting of: Vertex
– Vertex transformations
Processing
• Compute camera coords, lighting
– Geometry processing Geometry
• Primitive-wide properties Processing
– Rasterizing polygons to pixels
• Find pixels falling on each Rasterization
polygon
– Processing the pixels
• Texture lookup, shading, Z-values
Pixel
Processing
– Writing to the framebuffer
• Colour, Z-value
Framebuffer
• Computationally intensive
Image
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 5

Programmable GPUs
• Parts of the GPU pipeline were
made programmable for innovative Vertex Vertex
shading effects Shader
Processing
• Vertex, pixel, & later geometry
stages of processing could run Geometry Geometry
user’s shaders. Processing Shader
• Pixel shaders perform Data-
parallel computations on a parallel Rasterization
hardware
– 64-128 single precision floating Pixel Pixel
point processors Shader
Processing
– Fast texture access
• GPGPU: High performance Framebuffer
computing on the GPU using
shaders. Efficient for vectors,
IIIT Hyderabad

matrix, FFT, etc.

January 10, 2010 GPU Tutorial at PPoPP 2010 6

New Generation GPUs
• The DX10/SM4.0 model required a uniform
shader model
• Translated into common, unified, hardware
cores to perform vertex, geometry, and pixel
operations.
• Brought the GPUs closer to a general parallel
processor
• A number of cores that can be reconfigured
dynamically
– More cores: 128  240  320
IIIT Hyderabad

– Each transforms data in a common memory for use

by others
January 10, 2010 GPU Tutorial at PPoPP 2010 7
Old Array Processors
• Processor and Memory
tightly attached Proc Proc Proc
• A network to interconnect
– Mesh, star, hypercube
• Local data: Memory read/
write Proc Proc Proc
Remote data: network access
• Data reorganization is
expensive to perform
• Data-Parallel model works
• Thinking Machines CM-1, Proc Proc Proc
CM-2. MasPar MP-1, etc
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 8

Current GPU Architecture
• Processors have no local memory
• Bus-based connection to the common, large,
memory
P P P P P
• Uniform access to all memory for a PE
– Slower than computation by a factor of 500
• Resembles the PRAM model!
Memory Access
• No caches. But, instantaneous locality of
reference improves performance
– Simultaneous memory accesses combined to a
single transaction
• Memory access pattern determines
performance seriously
• Compute power: Upto 3 TFLOPs on a $400
add on card
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 9

What is the GPU Good at?
• The GPU is good at
data-parallel processing
• The same computation executed on many data
elements in parallel – low control flow overhead
with high SP floating point arithmetic intensity
• Many calculations per memory access
• Currently also need high floating point to integer ratio
• High floating-point arithmetic intensity and many
data elements can hide memory access latency
without big data cache
IIIT Hyderabad
SIMD Multiprocessors
• The device is a set of 16 Device
or 30 multiprocessors
Multiprocessor N
• Each multiprocessor is a
set of 32-bit processors Multiprocessor 2
with a Single Instruction Multiprocessor 1
Multiple Data architecture
– shared instruction unit
• At each clock cycle, a
multiprocessor executes Instruction

the same instruction on a Processor 1 Processor 2 … Processor M

Unit

group of threads called a

warp
• The number of threads in
a warp is the warp size
IIIT Hyderabad
HW Overview
Streaming Processor Array

TPC TPC TPC TPC TPC TPC TPC TPC

Texture Processor Cluster Streaming Multiprocessor

Instruction L1 Data L1

SM Instruction Fetch/Dispatch

Shared Memory
TEX
SP SP

SM SP SP
SFU SFU
IIIT Hyderabad

SP SP

SP SP
Streaming Multi-Processor
• Streaming Multiprocessor
• 8 Streaming Processors (SP)
– 2 Super Function Units (SFU)
• Multi-threaded instruction dispatch Streaming Multiprocessor
Instruction L1 Data L1
• 1 to 512 threads active
Instruction Fetch/Dispatch
• Shared instruction fetch per 32 threads
Shared Memory
• Cover latency of texture/memory loads
SP SP
• 30+ GFLOPS SP SP
SFU SFU
• 16K registers SP SP

• Partitioned among active threads SP SP

• 16 KB shared memory
IIIT Hyderabad

• Partitioned among logical blocks

January 10, 2010 GPU Tutorial at PPoPP 2010 13

Multithreaded Coprocessor
• The GPU is viewed as a compute device that:
– Is a coprocessor to the CPU or host
– Has its own DRAM (device memory)
– Runs many threads in parallel
• Data-parallel portions of an application are executed
on the device as kernels which run in parallel on
many threads
• Differences between GPU and CPU threads
– GPU threads are extremely lightweight
• Very little creation overhead
– GPU needs 1000s of threads for full efficiency
IIIT Hyderabad

• Multi-core CPU needs only a few

Thread Batching: Grids and
Blocks
• A kernel is executed as a
Host Device
grid of thread blocks
– All threads share data memory space Grid 1
Kern
• A thread block is a batch of el 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
threads that can cooperate Block Block Block
with each other by: (0, 1) (1, 1) (2, 1)

– Synchronizing their execution Grid 2

• For hazard-free shared memory
Kern
accesses
el 2
Blk
– Efficiently sharing data
through a low latency shared
memory
Blk
• Two threads from two
IIIT Hyderabad

different blocks cannot

cooperate
Courtesy: NDVIA
Block and Thread IDs
• Threads and blocks have Device
IDs Grid 1
– So each thread can decide Block Block Block
what data to work on (0, 0) (1, 0) (2, 0)

– Block ID: 1D or 2D Block Block Block

– Thread ID: 1D, 2D, or 3D (0, 1) (1, 1) (2, 1)

• Simplifies memory
addressing when Block (1, 1)

processing Thread Thread Thread Thread Thread

multidimensional data (0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

– Image processing Thread Thread Thread Thread Thread

(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
– Solving PDEs on volumes Thread Thread Thread Thread Thread
IIIT Hyderabad

– … (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

Courtesy: NDVIA
Threads, Warps, Blocks
• 32 threads in a Warp or a scheduling group
– Only <32 when there are fewer than 32 total threads
• There are (up to) 16 Warps in a Block
• Each Block (and thus, each Warp) executes on a
single SM
• G80 has 16 SMs, G280 has 30 SMs
• At least 16 Blocks required to “fill” the device
• More is better
– If resources (registers, thread space, shared memory)
allow, more than 1 Block can occupy each SM
IIIT Hyderabad
Memory Spaces
Grid

• Each thread can: Block (0, 0) Block (1, 0)

– Read/write per-thread registers

Shared Memory Shared Memory
– Read/write per-thread local memory
Registers Registers Registers Registers
– Read/write per-block shared memory
– Read/write per-grid global memory
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)
– Read only per-grid constant memory
– Read only per-grid texture memory Local Local Local Local
Memory Memory Memory Memory

The host can read/write Host Global

Memory

global, constant, and

IIIT Hyderabad

Constant

texture memory Memory

Texture
Memory
Memory Access Times
• Register – dedicated HW - single cycle
• Shared Memory – dedicated HW - single cycle
• Local Memory – DRAM, no cache - *slow*
• Global Memory – DRAM, no cache - *slow*
(400-500 cycles)
• Constant Memory – DRAM, cached, 1…10s…
100s of cycles, depending on cache locality
• Texture Memory – DRAM, cached, 1…10s…
100s of cycles, depending on cache locality
• Instruction Memory (invisible) – DRAM,
IIIT Hyderabad

cached
Thread Scheduling/Execution
Each Thread Blocks consists of 32-thread warps currently

Warps are scheduling units in SM. A warp is schedule at

one time

Multiple warps time share the SM processors

Multiple blocks can also share an SM, if resources permit.

Available resources are vertically shared between blocks
that time-share an SM

If more blocks are needed, they use the hardware

IIIT Hyderabad

sequentially.
Processors, Memory
• Nvidia 280GTX: 240 Streaming Processors, grouped into 30
Streaming Multiprocessors
– One instruction sequencer per SM
– 16KB of on-chip shared memory per SM
– 16K 32-bit registers per SM
– Single clock access of registers, shared memory
• 1 GB of common, off-chip global memory
– 130 GB/s of theoretical peak memory bandwith
– High memory access latency: 300-500 cycles
– 128 byte, 64 byte, or 32 byte memory transactions
• 10 special texture access units to the same global memory.
30 SMs grouped into 10 Texture processor clusters
• 1.3 GHz clock, 933 GFLOPs peak
• Integer and single-precision float operations in one clock cycle.
IIIT Hyderabad

Slower double-precision support

January 10, 2010 GPU Tutorial at PPoPP 2010 21

AMD 5870 Architecture
• 20 SIMD engines with 16
stream cores each
– Each SC with 5 PEs
(1600 Pes in total)
– Each with IEEE754 and
integer support
– Each with local data share
memory
• 32 kb shared low latency
memory
• 32 banks with hardware conflict
management
• 32 integer atomic units 80 Read
Address Probes
– 4 addresses per SIMD engine
– 4 filter or convert logic per
SIMD Global Memory access

• 153 GB/sec GDDR5

memory interface
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 22

Nvidia 280GTX: Architecture
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 23

Performance Considerations
• Thread divergence
– SIMD width is 32 threads. They should execute the same
very instruction
– Serialization otherwise
• Memory access coherence
– A half-warp of 16 threads should read from a local block
(128, 64, or 32 bytes) for speed
– Random memory access very expensive
• Occupancy or degree of parallelism
– Optimum use of registers and shared memory for maximum
exploitation of parallelism
– Memory latency hidden best with high parallelism
• Atomic operations
IIIT Hyderabad

– Global and shared memory support slow atomic operations

January 10, 2010 GPU Tutorial at PPoPP 2010 24
Tools and APIs
• OpenGL/Direct3D for older, GPGPU exposure
– Shaders operating on polygons, textures, and
framebuffer
• CUDA: an alternate interface from Nvidia
– Kernel operating on grids using threads
– Extensions of the C language
• DirectX Compute Shader: Microsoft’s version
• OpenCL: A promising open compute standard
– Apple, Nvidia, AMD, Intel, TI, etc.
– Support for task parallel, data parallel, pipeline-parallel,
etc.
IIIT Hyderabad

– Exploit the strengths of all available computing

resources
January 10, 2010 GPU Tutorial at PPoPP 2010 25
Massively Multithreaded Model
• Hiding memory latency: Overlap computation & memory
access
– Keep multiple threads in flight simultaneously on each core
– Low-overhead switching. Another thread computes when one is
stalled for memory data
– Alternate resources like registers, context to enable this
• A large number of threads in flight
– Nvidia GPUs: up to 128 threads on each core on the GTX280
– 30K time-shared threads on 240 cores
• Common instruction issue units for a number of cores
– SIMD model at some level to optimize control hardware
– Inefficient for if-the-else divergence
• Threads organized in multiple tiers
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 26

Multi-tier Thread Structure
Scheduling groups
• Data parallel model: A kernel on
each data element
– A kernel runs on a core
– CUDA: an invocation of the kernel is
called a thread
– OpenCL: the same is called a work item
• Group data elements based on
simultaneous scheduling
– Execute truly in parallel, SIMD mode
– Memory access, instruction divergence,
etc., affect performance
– CUDA: a warp of threads
• Group elements for resource usage
– Share memory and other resources
– May synchronize within group
– CUDA: Blocks of threads
IIIT Hyderabad

– OpenCL: Work groups

Resource groups

January 10, 2010 GPU Tutorial at PPoPP 2010 27

Data-Parallelism
• Data elements provide
parallelism
– Think of many data
elements, each being
processed simultaneously
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 28

Data-Parallelism
• Data elements provide
parallelism
– Think of many data elements,
each being processed
simultaneously
– Thousands of threads to
process thousands of data
elements
• Not necessarily SIMD, most
are SIMD or SPMD
– Each kernel knows its
location, identical otherwise
– Work on different parts using
IIIT Hyderabad

the location
January 10, 2010 GPU Tutorial at PPoPP 2010 29
Thinking Data-Parallel
• Launch N data locations, each of which gets a kernel of code
• Data follows a domain of computation.
• Each invocation of the kernel is aware of its location loc within the
domain
– Can access different data elements using the loc
– May perform different computations also
• Variations of SIMD processing
– Abstain from a compute step: if ( f(loc) ) then … else …
• Divergence can result in serialization
– Autonomous addressing for gather: a := b[ f(loc) ]
– Autonomous addressing for scatter: a[ g(loc) ] := b
• GPGPU model supports gather but not scatter
– Operation autonomy: Beyond SIMD.
• GPU hardware uses it for graphics, but not exposed to users
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 30

Image Processing
• A kernel for each location of
the 2D domain of pixels
– Embarrassingly parallel for
simple operations
• Each work element does its
own operations
– Point operations, filtering,
transformations, etc. 3 x 3 Filtering
• Process own pixels, get
neighboring pixels, etc
• Work groups can share data
– Get own pixels and “apron”
pixels that are accessed
IIIT Hyderabad

multiple times

January 10, 2010 GPU Tutorial at PPoPP 2010 31

Regular Domains
• Regular 1D, 2D, and nD
domains map very well to a b c d e f g h i
data-parallelism a
b
c
d
e
f
g
h

• Each work-item operates
by itself or with a few
neighbors
• Need not be of equal
dimensions or length
• A mapping from loc to
each domain should exist
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 32

Irregular Domains
• A regular domain generates
varying amounts of data
Irregular Domain
– Convert to a regular domain A B C D E F
– Process using the regular
domain
– Mapping to original domain
using new location possible
• Needs computations to do
this
• Occurs frequently in data Regular Domain
structure building,work
IIIT Hyderabad

distribution, etc.
January 10, 2010 GPU Tutorial at PPoPP 2010 33
Data-Parallel Primitives
• Deep knowledge of
architecture needed to get 1 3 2 0 6 2 5 2 4
high performance
– Use primitives to build other
algorithms Add Reduce
– Efficient implementations on 25
the architecture by experts
• reduce, scan, segmented Scan or prefix sum
scan: Aggregate or
0 1 4 6 6 12 14 19 21
progressive results from
distributed data
– Ordering distributed info Segmented Scan
• split, sort: 1 0 0 1 0 0 0 1 0
IIIT Hyderabad

– Mapping distributed data 0 1 4 0 0 6 8 0 2

[Blelloch 1989]
January 10, 2010 GPU Tutorial at PPoPP 2010 34
Split Primitive

Split

• Rearrange data according to its category. Categories could be

anything.
• Generalization of sort. Categories needn’t ordered themselves
• Important in distributing or mapping data
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 35

Handling Irregular Domains
A B C D E F
• Convert from irregular
to a regular domain
• Each old domain
Counts
element counts its 2 3 1 2 3 1
elements in new domain Scan
Progressive Counts
• Scan the counts to get
0 2 4 5 7 10
the progressive counts or
the starting points
• Copy data elements to
IIIT Hyderabad

Regular Domain
own location
January 10, 2010 GPU Tutorial at PPoPP 2010 36
Graph Algorithms
• Not the prototypical data-
parallel application; an
irregular application. Adjacency
Matrix
• Source of data-parallelism:
Data structure (adjacency
matrix or adjacency list)
• A 2D-domain of V2
elements or a 1D-domain of
E elements Vertices
• A thread processes each
edge in parallel. Combine
the results
IIIT Hyderabad

Adjacency List

January 10, 2010 GPU Tutorial at PPoPP 2010 37

Find min edge for each vertex
Example: Find the minimum for each node in parallel
outgoing edge of each vertex for all neighbours v
if w[v] < min
Soln 1: Each node-kernel loops min = w[v]
over its neighbors, keeping track mv = v
of the minimum weight and the
edge
Soln 2: Segmented min-scan of
the weights array + a kernel to 1 0 0 1 1 0 1 1 0 0 1 1 1 0
identify min vertex
Soln 3: Sort the tuple (u, w, v)
using the key (w, v) for all edges u
(u, v) of the graph of weight w. w
IIIT Hyderabad

v
Take the first entry for each u.

January 10, 2010 GPU Tutorial at PPoPP 2010 38

Task Parallel Computing
• The problem is divided into a
number of tasks; Data may also
be partitioned or shared A
• Some can be done in parallel,
others depend on previous
results
B C D
• Combine the results finally
• CPU cores and GPU can be
doing task-parallel computing
E F
• OpenCL supports this model of
computation as well as the
pipelined model
• More on OpenCL later today G
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 39

Summary
• GPU can be an essential computing platform with a
massively multithreaded programming model
• Data-parallel model fits the GPUs best.
• High performance requires deep knowledge of the
architecture. High-level primitives can alleviate
this greatly.
• Think of CPU and GPU together achieving your
computing goals. Not one instead of the other
• OpenCL is an exciting new development that can
make this possible and portable!
IIIT Hyderabad

January 10, 2010 GPU Tutorial at PPoPP 2010 40

For More Information
• GPGPU: gpgpu.org
• SIGGRAPH Courses:
– SIGGRAPH 2008: Available at UC, Davis.
http://s08.idav.ucdavis.edu/
– SIGGRAPH Asia 2008: Available at UC, Davis
http://sa08.idav.ucdavis.edu/
– Upcoming course at SIGGRAPH 2009
• CudaZone for Nvidia
IIIT Hyderabad

• And more …
January 10, 2010 GPU Tutorial at PPoPP 2010 41
Thank you!

Image credits to owners such as Intel,

Nvidia, AMD/ATI, etc.
IIIT Hyderabad

GPU Tutorial at PPoPP 2010 January 10, 2010

Thank you!

Image credits to owners such as Intel,

Nvidia, AMD/ATI, etc.
IIIT Hyderabad

GPU Tutorial at PPoPP 2010 January 10, 2010

Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
GPGPU
No ratings yet
GPGPU
139 pages
Presentation1 (1) HPC Mod 3
No ratings yet
Presentation1 (1) HPC Mod 3
51 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Hardware
No ratings yet
Hardware
54 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
Lec 14
No ratings yet
Lec 14
52 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
L 3 GPU
No ratings yet
L 3 GPU
33 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Slides - Chapter 6
No ratings yet
Slides - Chapter 6
59 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Chapter 9 - Multiple Core Computers
No ratings yet
Chapter 9 - Multiple Core Computers
44 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Unit 4
No ratings yet
Unit 4
48 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
04 DLP
No ratings yet
04 DLP
19 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
The Evolution of Gpus For General Purpose Computing
No ratings yet
The Evolution of Gpus For General Purpose Computing
38 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Lec 1
No ratings yet
Lec 1
27 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Lecture4 CUDA Threads Part2
No ratings yet
Lecture4 CUDA Threads Part2
15 pages
Lecture-12-PDC - CUDA
No ratings yet
Lecture-12-PDC - CUDA
25 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
Using GPUs
No ratings yet
Using GPUs
18 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
No ratings yet
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
3 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
Parallel Path Tracing
No ratings yet
Parallel Path Tracing
35 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
From Everand
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
Robert Johnson
No ratings yet
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
Pipeline and Vector Processing
No ratings yet
Pipeline and Vector Processing
48 pages
Unit 5 (Coa) Notes
No ratings yet
Unit 5 (Coa) Notes
35 pages
Assignment 03
No ratings yet
Assignment 03
3 pages
CH 6 Distributed System
No ratings yet
CH 6 Distributed System
6 pages
What Is Distributed Software Architecture?
No ratings yet
What Is Distributed Software Architecture?
5 pages
DS Unit - 2
No ratings yet
DS Unit - 2
18 pages
Operating Systems ENN 205
No ratings yet
Operating Systems ENN 205
14 pages
Chapter 5
No ratings yet
Chapter 5
92 pages
Multicore C Standard Template Library in A Generat
No ratings yet
Multicore C Standard Template Library in A Generat
11 pages
Distributed Computing 2012-13
No ratings yet
Distributed Computing 2012-13
1 page
Power Off Reset Reason
No ratings yet
Power Off Reset Reason
3 pages
10) Question Bank DC
No ratings yet
10) Question Bank DC
7 pages
Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 4: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
31 pages
Hadoop Framework
No ratings yet
Hadoop Framework
22 pages
Chapter 17: Database System Architectures
No ratings yet
Chapter 17: Database System Architectures
13 pages
ACA Answer Key
No ratings yet
ACA Answer Key
24 pages
3rd ASSIGNMENT
No ratings yet
3rd ASSIGNMENT
9 pages
Thread Vs Processes in Distributed Systems
No ratings yet
Thread Vs Processes in Distributed Systems
13 pages
CH 23
No ratings yet
CH 23
126 pages
FMTH0301/Rev.5.1 Course Plan
No ratings yet
FMTH0301/Rev.5.1 Course Plan
16 pages
1.semaphore and It's Types
No ratings yet
1.semaphore and It's Types
3 pages
Concurrency in Operating System
No ratings yet
Concurrency in Operating System
3 pages
6.0.82 (1706) Crash 2022 05 27 08 26 29 1653632789060
No ratings yet
6.0.82 (1706) Crash 2022 05 27 08 26 29 1653632789060
2 pages
CSC520 Chapter 5
No ratings yet
CSC520 Chapter 5
38 pages
TP1: Converting Vector Addition To CUDA.: Listing 1 An Example of Vector Addition Implemented in C
No ratings yet
TP1: Converting Vector Addition To CUDA.: Listing 1 An Example of Vector Addition Implemented in C
1 page
Parallel Computer Structures
No ratings yet
Parallel Computer Structures
23 pages
Chapter 7
No ratings yet
Chapter 7
26 pages
Learning Concurrent Programming in Scala: Chapter No. 1 "Introduction"
No ratings yet
Learning Concurrent Programming in Scala: Chapter No. 1 "Introduction"
21 pages
Thread Safe in CICS
No ratings yet
Thread Safe in CICS
54 pages
Lab 7
No ratings yet
Lab 7
2 pages

Gpu Architecture

Uploaded by

Gpu Architecture

Uploaded by

GPU Architecture: Overview

• Hardware: 8-16 cores to process vertices and 64-128 to

– Can pack more cores in same silicon die area

January 10, 2010 GPU Tutorial at PPoPP 2010 2

January 10, 2010 GPU Tutorial at PPoPP 2010 3

• GPU Architecture features:

January 10, 2010 GPU Tutorial at PPoPP 2010 5

matrix, FFT, etc.

January 10, 2010 GPU Tutorial at PPoPP 2010 6

– Each transforms data in a common memory for use

January 10, 2010 GPU Tutorial at PPoPP 2010 8

January 10, 2010 GPU Tutorial at PPoPP 2010 9

the same instruction on a Processor 1 Processor 2 … Processor M

group of threads called a

TPC TPC TPC TPC TPC TPC TPC TPC

Texture Processor Cluster Streaming Multiprocessor

• Partitioned among active threads SP SP

• Partitioned among logical blocks

January 10, 2010 GPU Tutorial at PPoPP 2010 13

• Multi-core CPU needs only a few

– Synchronizing their execution Grid 2

different blocks cannot

– Block ID: 1D or 2D Block Block Block

processing Thread Thread Thread Thread Thread

– Image processing Thread Thread Thread Thread Thread

– … (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

• Each thread can: Block (0, 0) Block (1, 0)

– Read/write per-thread registers

The host can read/write Host Global

global, constant, and

texture memory Memory

Warps are scheduling units in SM. A warp is schedule at

Multiple warps time share the SM processors

Multiple blocks can also share an SM, if resources permit.

If more blocks are needed, they use the hardware

Slower double-precision support

January 10, 2010 GPU Tutorial at PPoPP 2010 21

• 153 GB/sec GDDR5

January 10, 2010 GPU Tutorial at PPoPP 2010 22

January 10, 2010 GPU Tutorial at PPoPP 2010 23

– Global and shared memory support slow atomic operations

– Exploit the strengths of all available computing

January 10, 2010 GPU Tutorial at PPoPP 2010 26

– OpenCL: Work groups

January 10, 2010 GPU Tutorial at PPoPP 2010 27

January 10, 2010 GPU Tutorial at PPoPP 2010 28

January 10, 2010 GPU Tutorial at PPoPP 2010 30

January 10, 2010 GPU Tutorial at PPoPP 2010 31

January 10, 2010 GPU Tutorial at PPoPP 2010 32

– Mapping distributed data 0 1 4 0 0 6 8 0 2

• Rearrange data according to its category. Categories could be

January 10, 2010 GPU Tutorial at PPoPP 2010 35

January 10, 2010 GPU Tutorial at PPoPP 2010 37

January 10, 2010 GPU Tutorial at PPoPP 2010 38

January 10, 2010 GPU Tutorial at PPoPP 2010 39

January 10, 2010 GPU Tutorial at PPoPP 2010 40

Image credits to owners such as Intel,

GPU Tutorial at PPoPP 2010 January 10, 2010

Image credits to owners such as Intel,

GPU Tutorial at PPoPP 2010 January 10, 2010

You might also like

processing Thread Thread Thread Thread Thread

– Mapping distributed data 0 1 4 0 0 6 8 0 2