0% found this document useful (0 votes)

26 views32 pages

GPU Computing 3

This document discusses GPU architecture and programming. It describes the GK110 GPU, which contains up to 15 streaming multiprocessors with 192 single-precision or 64 double-precision processing units each. It views the GPU as a programmable many-core scalar architecture that exploits parallel slackness, as well as a programmable multi-core vector architecture that hides its vector units. It emphasizes the importance of collaborative computing and memory access across threads to achieve high performance on the GPU.

Uploaded by

QuantumChromist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views32 pages

GPU Computing 3

Uploaded by

QuantumChromist

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

GPU COMPUTING - ARCHITECTURE +

PROGRAMMING
LECTURE 03 - BASIC ARCHITECTURE
Holger Fröning
[email protected]
Institute of Computer Engineering
Ruprecht-Karls University of Heidelberg
GK110 - ARCHITECTURE
Up to 15 SMX, 6 MCs,
L2 cache, PCIe 3.0,
CC 3.5

2
GK110 - ARCHITECTURE

192 SP units
64 DP units
32 load/store units
32 special function units
4 warp schedulers
Optimized for performance/watt
-> reduced clock frequency
Remember Pollack

3
BULK-SYNCHRONOUS PARALLEL
REMINDER: BULK-SYNCHRONOUS PARALLEL
In 1990, Valiant already described GPU computing
pretty well
Superstep
Compute, communicate, synchronize
Leslie G. Valiant, A bridging model for

Parallel slackness: # of virtual processors v, physical parallel computa on, Communica ons of
the ACM, Volume 33 Issue 8, Aug. 1990
processors p
v = 1: not viable
v = p: unpromising wrt optimality
v >> p: leverage slack to schedule and pipeline computation
and communication efficiently
Extremely scalable, bad for unstructured parallelism
5
ti
ti
REMINDER: VECTOR ISAS
Compact: single instruction defines N operations
Amortizes the cost of instruction fetch/decode/issue
Also reduces the frequency of branches
4x SIMD example
Parallel: N operations are (data) parallel
Instruction stream
No dependencies
No need for complex hardware to detect parallelism 1 1
PU
(similar to VLIW)

Data pool
1 PU
Can execute in parallel assuming N parallel data paths
1 PU
Expressive: memory operations describe patterns 1 PU
Continuous or regular memory access pattern
Can prefetch or accelerate using wide/multi-banked memory
Can amortize high latency for 1st element over large sequential pattern
6
OUR VIEW OF A GPU
Software view: a programmable many-core scalar architecture
Huge amount of scalar threads to exploit parallel slackness, operates in lock-step
SIMT: single instruction, multiple threads

IT’S A (ALMOST) PERFECT INCARNATION OF THE BSP MODEL

Hardware view: a programmable multi-core vector architecture

Illusion of scalar threads: hardware packs them into compound units
SIMD: single instruction, multiple data

IT’S A VECTOR ARCHITECTURE THAT HIDES ITS VECTOR UNITS

7
THE BEAUTY OF SIMPLICITY
GPU Computing & CUDA
Thread-collective computation and memory accesses Output data set
SIMT – Single Instruction, Multiple Threads

Memory Compute
GPU collaborative computing
…
One thread per output element
PCAM: A == form thread blocks, ignore M
Schedulers exploit parallel slack GDDR
…
GPU collaborative memory access

MC
GDDR
One thread per data element
GDDR
MCs highly optimized to exploit concurrency
-> coalescing issues
-> If you do something on a GPU, do it collaboratively with all threads
8
(GLOBAL) MEMORY SUBSYSTEM
GK110 – MEMORY HIERARCHY
Registers at thread level Thread
Registers
64k/thread block
Registers/thread depends on run-time
configuration
Max. 255 registers/thread Thread Shared Memory L1 Cache Read-only data
Shared memory / L1$ at block level Block 16-48kB 16-48kB Cache 48kB

Variable sizes
L1$ can serve for register spilling L2 Cache
L1$ not coherent, write-invalidate 1.5MB
Multiple
Compiler controlled RO L1$ Kernels
L2$ / GDDR at device level GDDR (off-chip)
6GB
GDDR: ~400-600 cycles access latency
L2$ as victim cache for all upper units,
write-back
Host memory (off-device)
Purpose: reducing contention multiple TBs

10
LOCAL MEMORY
Local memory: part of global memory, but Thread
Registers
thread-local 64k/thread block
Register spilling: when SM runs out of
resources
Limited register count per thread Thread Shared Memory L1 Cache Read-only data
Block 16-48kB 16-48kB Cache 48kB
Limited total number of registers
LM is used if the source code exceeds these
limits
L2 Cache
Local because each thread has its private 1.5MB
area
Multiple
Differences from global memory Kernels
Stores are cached in L1$ GDDR (off-chip)
Addressing is resolved by compiler 6GB

Store always happens before load

Per thread: move data from GM to LM (stores) Host memory (off-device)
Subsequent load accesses multiple TBs

11
HOST MEMORY 96 GFLOPS (DP)
CPU SOCKET
Pinned/unpinned host memory
CPU
Unpinned host memory: possibility of CORES 60GB/S
demand paging -> staging buffers system request
queue
Pinned host memory: autonomous
device access possible NORTH HOST
BRIDGE memory MEMORY
cudaMemcpy 16GB/S interface
system interface
GPU DMA engine(s)
IO
Zero copy (CC >= 2.0) BRIDGE 288GB/S
GPU threads can operate on pinned 16GB/S peripheral
host memory interface

For initial shared memory fills, GPU GPU

etc. 1,165 GFLOPS (DP) CORES MEMORY
memory
3,494 GFLOPS (SP) interface
12
HOST MEMORY & CUDAMEMCPY

13
HOST MEMORY & STREAMS
Stream: sequence of operations
performed in-order
cudaMemcpy
Kernel launch
Default stream: id=0
Overlap computation with data
movement
Latency hiding
Only applicable for divisible work
Most suited for compute-bound workloads
See also zero-copy for initial data movements

14
GLOBAL MEMORY - COALESCING
High bandwidth, high latency
Coalesced access
Combine fine-grain accesses by multiple
threads into single GDDR operations (such
requests have a certain granularity)
Coalesced thread access should match a
multiple of L1/L2 cache line sizes
For Kepler cache line sizes: L1: 128B, L2: 32B
Misaligned accesses
One warp is scheduled, but accesses
misaligned addresses
GPUs use caches for access coalescing
15
GLOBAL MEMORY – ACCESS PENALTIES
Offset: constant shift
of access pattern
data[addr+offset]
Penalty: fetch 5 CLs
instead of 4
4/5 of max. bandwidth

8 elements offset, 4B per element NVIDIA, CUDA C Best Practices Guide 16

GLOBAL MEMORY – ACCESS PENALTIES
Stride: access only
every nth address
data[addr*stride]
Stride of 2
50% load/store
efficiency
Worsens with larger
strides

NVIDIA, CUDA C Best Practices Guide 17

GLOBAL MEMORY – ACCESS PENALTIES
Main problem: thread scheduling does not result in coalesced
accesses
Solution: manually control data movement in memory hierarchy
Caches = transparent, implicit hierarchy
Scratchpad (shared memory) = opaque, explicit hierarchy
Collaborative loads from global memory to shared memory
Common case: one thread is not moving the data it requires (at least not
immediately)
One of the GPU’s main advantages is memory bandwidth: coalescing of upmost
importance!

18
CUDA THREAD SCHEDULING

Foundation of latency tolerance

LATENCY TOLERANCE TECHNIQUES
Block Data
Property Relaxed Consistency Models Prefetching Multi-Threading
Transfer

Types of Write (blocking read processors) Write

Read and write (dynamically Write Write
latency Read
scheduled processors) Read Read
tolerated Synchronization
Identifying and
Software Labeling synchronization Explicit extra
Predictability orchestrating
requirements operations concurrency
block transfers
Extra Not in processor,
hardware Little Little Substantial but in memory
support system
Supported in
commercial Yes Yes Yes (Yes)
systems?
David E. Culler, Jaswinder Pal Singh, Anoop Gupta, Parallel Computer Architecture: A Hardware/Software Approach, 20
Morgan Kaufmann,1998
THREAD SCHEDULING
Up to 1k threads per block Scheduler
One block executes on one SM 1.Select one thread block to execute,
Kepler: one SM = 192 SP + 64 DP units allocate resources (registers, etc) as
required
Each thread block is divided in 2.Select one out of the 32 warps of
warps of 32 threads this block for instruction fetch and
Implementation decision, not CUDA execution
3.Repeat until all resources are
Warps are the units for the
utilized
scheduler
4.Upon warp stalling, select another
Example warp for IF and EX
4 blocks being executed on one SM, 5.Deallocate resources after all warps
each block 1k threads have finished (non-preemptive)
How many warps?
21
THREAD SCHEDULING
Fine-grained multi-threading (FGMT)
Switch context (i.e., warp) every cycle
A warp that has the operands ready for its next instruction is ready for execution
All threads in a warp execute the same instruction
Goal of FGMT: latency hiding
Global memory access latency: ~400-600 cycles
Sufficient number of warps can keep all functional units busy
Warp count for maximum utilization depends on computational intensity
TB1, W1 stall
TB2, W1 stall TB3, W2 stall

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4

Time TB = Thread Block, W = Warp

22
EXAMPLE FOR HARDWARE MULTI-THREADING
(G80)
4 warp contexts, max. 1 being executed
simultaneously
32 SIMD ALUs
Explicit 32x SIMD instructions 0 1 2
30 31
32 ALUs execute a single SIMD instruction
Register file (RF) is shared among contexts RF
One register entry (vector) has 32 words (each 1
0
5
32bit)
RF: 16 entries -> Max. of 4 registers/warp
Thread Warp Contexts
Simplifying assumptions
T0 T1 T2 T3
Each memory access blocks execution for 50
cycles
A memory access occurs every 20 cycles
23
EXAMPLE FOR HARDWARE MULTI-THREADING
(G80)
Each memory access blocks execution for 50 cycles
(texture memory)
A memory access occurs every 20 cycles 32 SIMD ALUs
0 1 2
-> 4 thread warps required for full utilization
30 31
-> Per thread warp 32 entities = 128 entities
0 T0 T1 T2 T3 RF
1
waiting 0
20 5

40
Thread Warp Contexts
exec

60 T0 T1 T2 T3
stall

24
THREAD SCHEDULING (KEPLER)
Fetch one instruction per cycle Warp Scheduler
(from I$)
Determine dependencies (operands) Instruction Dispatch Unit Instruction Dispatch Unit

Scoreboard checks if dependencies

Warp 4 Instruction 12 Warp 4 Instruction 13
are resolved
Prevent data hazards Warp 12 Instruction 94 Warp 12 Instruction 95

Issue: select one warp based on Warp 7 Instruction 0 Warp 7 Instruction 1

time
prioritized round-robin … …
Priority: warp age
Warp 4 Instruction 14 Warp 4 Instruction 15
Scheduler broadcasts the
Warp 3 Instruction 42 Warp 3 Instruction 42
instruction to all 32 threads in a
warp
25
THREAD SCHEDULING - SCOREBOARD
Scoreboard: hardware table that tracks
Instructions (fetched, issued, executed)
Resources/Functional units (occupation)
Dependencies (operands)
Outputs (modified registers)
Tracks all operands of all instructions in the instruction buffer
Any thread can proceed until scoreboard prevents issue
OOO execution among warps
Unfeasible without warp abstraction (32x less issue slots required)
Scoreboard: old concept from 1960s wikipedia.org

Separate computation and memory resources

CDC6600 (https://en.wikipedia.org/wiki/CDC_6600)
Enabler of OOO execution for CPUs

26
THREAD SCHEDULING – BRANCH DIVERGENCE
Scheduler broadcasts the instruction to all 32 threads in a warp
Dedicated control paths
Branch divergence problem
-> Write-masks
__global__ kernel1 (…) __global__ kernel2 (…)
{ {
id = threadIdx.x; id = threadIdx.x;

if ( id % 32 == 0 ) if ( id < 32 )
out = complex_function_call(); out = complex_function_call();
else else
out = 0; out = 0;
} }

27
SUMMARY
SUMMARY
GPUs have manually-controlled, Global memory subsystem
rather flat memory hierarchies Fully featured memory subsystem,
CPUs = deep memory hierarchy including virtual addresses, MMU
and TLB
Caches in GPUs not used to reduce
latency, but to reduce memory Performance issues
contention and to coalesce Latency hiding: insufficient number
accesses of threads
Parallel slackness as in BSP Too many threads: register spilling
Latency hiding & scalability Coalescing issues (global memory):
Instruction stream == thread stride and offset
warp, != single thread (as for CPUs) Branch divergence

29
BONUS: ADVANCED MEMORY ANALYSIS
POINTER CHASING: MEMORY/CACHE ANALYSIS

GeForce 8800 GTX @ 1350MHz

Source: Vasily Volkov, James W. Demmel: LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs, LAPACK Working Note 202 31
POINTER CHASING: MEMORY/CACHE ANALYSIS @512kB: saturation
of TLB misses:
page size=512kB
20kB@1kB, L1 cache 128MB latency
latency reverts: 20- increase: TLB
way set-associative presence

128MB@8MB stride,
no overhead: 16-
entry, fully-
@32B, L1 & L2 associative TLB
saturation: CL size
= 32B
768kB@32kB, L2 cache
latency reverts: 24-
L1 cache latency, way set-associative
5kB size (latency
increase for 5.5kB)
Ambiguous: or 6
replicated 4-way L2s
GeForce 8800 GTX @ 1350MHz
Source: Vasily Volkov, James W. Demmel: LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs, LAPACK Working Note 202 32

Megger Test Procedure Explained With Transformer Example
No ratings yet
Megger Test Procedure Explained With Transformer Example
4 pages
Handbook819 Old
100% (1)
Handbook819 Old
72 pages
MR366X8488B000
No ratings yet
MR366X8488B000
16 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
Lec 3
No ratings yet
Lec 3
48 pages
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Hardware
No ratings yet
Hardware
54 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
1083 Wang
No ratings yet
1083 Wang
56 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Unit 4
No ratings yet
Unit 4
48 pages
Using GPUs
No ratings yet
Using GPUs
18 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA
No ratings yet
CUDA
33 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
NCSA02 Fundamental CUDA Optimization
No ratings yet
NCSA02 Fundamental CUDA Optimization
50 pages
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
No ratings yet
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
50 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
Gpu Architecture
No ratings yet
Gpu Architecture
43 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
GPU Introduction
No ratings yet
GPU Introduction
52 pages
GPGPU
No ratings yet
GPGPU
139 pages
Main Parameters To Evaluate The GPU Performance
No ratings yet
Main Parameters To Evaluate The GPU Performance
40 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
Lec 14
No ratings yet
Lec 14
52 pages
GPU Architecture
No ratings yet
GPU Architecture
70 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Note2 4
No ratings yet
Note2 4
11 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
From Everand
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
Rodrigo Copetti
No ratings yet
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
From Everand
Nintendo DS Architecture: Architecture of Consoles: A Practical Analysis, #14
Rodrigo Copetti
No ratings yet
DM542E
No ratings yet
DM542E
14 pages
Power Flow in Transmission - Neutral ZF
100% (3)
Power Flow in Transmission - Neutral ZF
25 pages
Cybernetics State of The Art PDF
No ratings yet
Cybernetics State of The Art PDF
207 pages
Chapter 1 To 5.18
No ratings yet
Chapter 1 To 5.18
99 pages
Cisco IOS Flexible NetFlow
No ratings yet
Cisco IOS Flexible NetFlow
5 pages
Selection of Timing Belts 1: 'Technical Calculations
No ratings yet
Selection of Timing Belts 1: 'Technical Calculations
1 page
HUAWEI IdeaHub Pro, S, and Enterprise 21.0 Must-See Tips
No ratings yet
HUAWEI IdeaHub Pro, S, and Enterprise 21.0 Must-See Tips
55 pages
Pubkey
No ratings yet
Pubkey
3 pages
General Care Ear Mould Hearing Aid
No ratings yet
General Care Ear Mould Hearing Aid
16 pages
T Mu Sy 10010 ST PDF
No ratings yet
T Mu Sy 10010 ST PDF
23 pages
Podcasting For Dollars - How To Create & Grow Your Own Internet Talk Show (Make Money On The Internet With Internet Marketing)
100% (2)
Podcasting For Dollars - How To Create & Grow Your Own Internet Talk Show (Make Money On The Internet With Internet Marketing)
85 pages
Lab 1 Effects of Sampling and Aliasing in Discrete Time Sinusoids
No ratings yet
Lab 1 Effects of Sampling and Aliasing in Discrete Time Sinusoids
5 pages
IoT Internet of Things
No ratings yet
IoT Internet of Things
4 pages
Personalization in User Interface Design
No ratings yet
Personalization in User Interface Design
8 pages
Hybrid Azure Ad Join Autoenrollment Using GPO
No ratings yet
Hybrid Azure Ad Join Autoenrollment Using GPO
19 pages
NewReportingTemplate Example
No ratings yet
NewReportingTemplate Example
15 pages
ACOS 4.1.4-GR1-P5 Configuring Application Delivery Partitions
No ratings yet
ACOS 4.1.4-GR1-P5 Configuring Application Delivery Partitions
100 pages
Lab Activity - REST Retrieve
No ratings yet
Lab Activity - REST Retrieve
12 pages
Cloudera Lab Preparation
No ratings yet
Cloudera Lab Preparation
3 pages
ICT Presentation by Aparna Vasaniya
No ratings yet
ICT Presentation by Aparna Vasaniya
15 pages
Encoder Serie Web en
No ratings yet
Encoder Serie Web en
3 pages
Quantum Computing
No ratings yet
Quantum Computing
2 pages
SelfHelpWMIToolReport 21-03-2023 10 01 24
No ratings yet
SelfHelpWMIToolReport 21-03-2023 10 01 24
4 pages
Wireless Communications For Everybody
No ratings yet
Wireless Communications For Everybody
1 page
Battery - Modeling and Application: T-TN003 (v1.1) June 12, 2013
No ratings yet
Battery - Modeling and Application: T-TN003 (v1.1) June 12, 2013
8 pages
Datasheet PM851 MSATA v10
No ratings yet
Datasheet PM851 MSATA v10
2 pages
Prototyping Essentials With Axure Sample Chapter
No ratings yet
Prototyping Essentials With Axure Sample Chapter
66 pages

GPU Computing 3

Uploaded by

GPU Computing 3

Uploaded by

GPU COMPUTING - ARCHITECTURE +

IT’S A (ALMOST) PERFECT INCARNATION OF THE BSP MODEL

Hardware view: a programmable multi-core vector architecture

IT’S A VECTOR ARCHITECTURE THAT HIDES ITS VECTOR UNITS

Store always happens before load

For initial shared memory fills, GPU GPU

8 elements offset, 4B per element NVIDIA, CUDA C Best Practices Guide 16

NVIDIA, CUDA C Best Practices Guide 17

Foundation of latency tolerance

Types of Write (blocking read processors) Write

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

Time TB = Thread Block, W = Warp

Scoreboard checks if dependencies

Issue: select one warp based on Warp 7 Instruction 0 Warp 7 Instruction 1

Separate computation and memory resources

GeForce 8800 GTX @ 1350MHz

You might also like