0% found this document useful (0 votes)

24 views54 pages

Hardware

Uploaded by

Asif Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views54 pages

Hardware

Uploaded by

Asif Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 54

Architecture Overview

Introduction to CUDA Programming

Andreas Moshovos
Winter 2009
Most slides/material from:
UIUC course by Wen-Mei Hwu and David Kirk
Real World Techonologies by David Kanter
System Architecture of a Typical PC / Intel
PCI-Express Programming Model
• PCI device registers are mapped into the
CPU’s physical address space
– Accessed through loads/ stores (kernel mode)

• Addresses assigned to the PCI devices at boot

time
– All devices listen for their addresses

• That’s a reason why Windows XP cannot “see”

4GB
PCI-E 1.x Architecture
• Switched, point-to-point connection
– Each card has a dedicated “link” to the central
switch, no bus arbitration.
– Packet switches messages form virtual channel
– Prioritized packets for QoS
• E.g., real-time video streaming

IO IO IO IO IO IO

NB NB

BUS: PCI or older PCI-E

PCI-E 1.x Architecture Contd.
• Each link consists of one more lanes
– Each lane is 1-bit wide (4 wires, each 2-wire pair
can transmit 2.5Gb/s in one direction)
• Upstream and downstream now simultaneous and
symmetric
• Differential signalling
– Each Link can combine 1, 2, 4, 8, 12, 16 lanes- x1,
x2, etc.
– Each byte data is 8b/10b encoded into 10 bits with
equal number of 1’s and 0’s; net data rate 2 Gb/s
per lane each way.
– Thus, the net data rates are 250 MB/s (x1) 500
MB/s (x2), 1GB/s (x4), 2 GB/s (x8), 4 GB/s (x16),
each way
PCI-E 2.x and beyond

Version Clock Speed Overhead Data Rate

Transfer Rate

1.x 1.25Ghz 20% 250MB/s

2.5GT/s
2.0 2.5 Ghz 20% 500MB/s
5GT/s
3.0 4Ghz 0% 1GB/s
8GT/s
Typical AMD System (for completeness)

• AMD HyperTransport™
Technology bus replaces the
Front-side Bus architecture
• HyperTransport ™
similarities to PCIe:
– Packet based, switching
network
– Dedicated links for both
directions
• Shown in 4 socket
configuraton, 8 GB/sec per
link
• Northbridge/HyperTransport
™ is on die
• Glueless logic
– to DDR, DDR2 memory
– PCI-X/PCIe bridges (usually
implemented in Southbridge)
A Typical Motherboad (mATX form factor)
CUDA Refresher
• Grids of Blocks
• Blocks of Threads

Why? Realities of integrated circuits: need to cluster computation and storage

to achieve high speeds
Thread Blocks Refresher
• Programmer declares (Thread)
Block:
– Block size 1 to 512 concurrent threads
– Block shape 1D, 2D, or 3D Thread Id #:
– Block dimensions in threads 0123… m

• All threads in a Block execute the

same thread program Thread program
• Threads have thread id numbers
within Block
• Threads share data and synchronize
while doing their share of the work
• Thread program uses thread id to
select work and address shared data
Architecture Goals
• Use multithreading to hide DRAM latency
• Support fine-grain parallel processing
• Virtualize the processors to achieve scalability
• Simplify programming. Develop program for one
thread

• Conventional Processors
– Latency optimized
– ILP
– Caches 99% hit rate
• GPU
– Caches 90% or less. Not a good option
– Throughput optimized
– ILP + TLP
GT200 Architecture Overview

atomic
Terminology
• SPA
– Streaming Processor Array
• TPC
– Texture Processor Cluster
• 3 SM + TEX
• SM
– Streaming Multiprocessor (8 SP)
– Multi-threaded processor core
– Fundamental processing unit for CUDA thread block
• SP
– Streaming Processor
– Scalar ALU for a single CUDA thread
Thread Processing Cluster

Thread Processing Cluster

TEX SM

SM
Stream Multiprocessor Overview
• Streaming Multiprocessor (SM)
– 8 Streaming Processors (SP)
– 2 Super Function Units (SFU) Streaming Multiprocessor
– Instruction L1 Data L1
1 Double-FP Unit (DPU)
• Multi-threaded instruction Instruction Fetch/Dispatch

dispatch Shared Memory

– 1 to 512 threads active SP SP

– Shared instruction fetch per 32 SP SP

threads SFU SFU
SP SP
– Cover latency of texture/memory
SP SP
loads
• 20+ GFLOPS DPU

• 16 KB shared memory
• DRAM texture and memory
access
Thread Life
• Grid is launched on the SPA
Host Device

• Thread Blocks are serially Grid 1

distributed to all the SM’s Kernel Block Block Block
– Potentially >1 Thread Block per 1 (0, 0) (1, 0) (2, 0)
SM
Block Block Block
(0, 1) (1, 1) (2, 1)

• Each SM launches Warps of

Threads Grid 2

– 2 levels of parallelism Kernel

• SM schedules and executes Block (1, 1)

Warps that are ready to run
Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

• As Warps and Thread Blocks Thread Thread Thread Thread Thread

(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
complete, resources are freed Thread Thread Thread Thread Thread
– SPA can distribute more Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2)

Blocks
Cooperative Thread Array
• Break Blocks into warps
• Allocate Resources
– Registers, Shared Mem, Barriers
• Then allocate for execution
Stream Multiprocessors Execute Blocks
• Threads are assigned to SMs in
Block granularity
– Up to 8 Blocks to each SM as resource t0 t1 t2 … tm SM 0
allows MT IU
– SM in G200 can take up to 1K threads SP
• Could be 256 (threads/block) * 4 blocks
• Or 128 (threads/block) * 8 blocks, etc. Blocks
• Threads run concurrently
– SM assigns/maintains thread id #s
Shared
– SM manages/schedules thread Memory
execution
TF

Texture L1

Memory
Thread Scheduling and Execution
• Each Thread Blocks is divided
in 32-thread Warps …
Block 1 Warps
…
Block 2 Warps

t0 t1 t2 … t31 t0 t1 t2 … t31
– This is an implementation … …
decision, not part of the CUDA
programming model

Streaming Multiprocessor
• Warp: primitive scheduling unit Instruction L1 Data L1

Instruction Fetch/Dispatch

• All threads in warp: Shared Memory

– same instruction SP SP

– control flow causes some to SP SP

become inactive SFU SFU
SP SP

SP SP

DPU
Warp Scheduling
• SM hardware implements zero-
overhead Warp scheduling
– Warps whose next instruction has its
SM multithreaded operands ready for consumption are
Warp scheduler
eligible for execution
time – Eligible Warps are selected for
warp 8 instruction 11 execution on a prioritized scheduling
policy
warp 1 instruction 42 – All threads in a Warp execute the
same instruction when selected
warp 3 instruction 95
..
. • 4 clock cycles needed to dispatch
warp 8 instruction 12 the same instruction for all threads
warp 3 instruction 96 in a Warp in G200
How many warps are there?
• If 3 blocks are assigned to an SM and each Block has
256 threads, how many Warps are there in an SM?
– Each Block is divided into 256/32 = 8 Warps
– There are 8 * 3 = 24 Warps
– At any point in time, only one of the 24 Warps will be
selected for instruction fetch and execution.
Warp Scheduling: Hiding Thread stalls

TB1, W1 stall
TB2, W1 stall TB3, W2 stall

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

W1 W1 W1 W2 W1 W1 W2 W3 W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4

Time TB = Thread Block, W = Warp

Warp Scheduling Ramifications
• If one global memory access is needed for every 4
instructions
• A minimal of 13 Warps are needed to fully tolerate a
200-cycle memory latency
• Why?
– Need to hide 200 cycles every four instructions
– Every Warp occupies 4 cycles during which the
same instruction executes
– Every 4 insts a thread stalls
– Every 16 cycles a thread stalls
– 200/16 =12.5 or at least 13 warps
SM Instruction Buffer – Warp Scheduling
• Fetch one warp instruction/cycle
– from instruction L1 cache
– into any instruction buffer slot
• Issue one “ready-to-go” warp I$
L1
instruction/cycle
– from any warp - instruction buffer slot Multithreaded
Instruction Buffer
– operand scoreboarding used to prevent
hazards R C$ Shared
F L1 Mem
• Issue selection based on round-robin/age
of warp Operand Select

• SM broadcasts the same instruction to 32

Threads of a Warp MAD SFU
Scoreboarding
• All register operands of all instructions in the
Instruction Buffer are scoreboarded
– Status becomes ready after the needed values are
deposited
– prevents hazards
– cleared instructions are eligible for issue
• Decoupled Memory/Processor pipelines
– any thread can continue to issue instructions until
scoreboarding prevents issue
– allows Memory/Processor ops to proceed in
shadow of Memory/Processor ops
Granularity Considerations
• For Matrix Multiplication, should I use 8X8, 16X16 or
32X32 tiles?

– For 8X8, we have 64 threads per Block. Since each SM can

take up to 1024 threads, it can take up to 16 Blocks.
However, each SM can only take up to 8 Blocks, only 512
threads will go into each SM

– For 16X16, we have 256 threads per Block. Since each SM

can take up to 1024 threads, it can take up to 4 Blocks and
achieve full capacity unless other resource considerations
overrule.

– For 32X32, we have 1024 threads per Block. Not even one
can fit into an SM.
Stream Multiprocessor Detail

64 entry
Scalar Units
• 32 bit ALU and Multiply-Add
• IEEE Single-Precision Floating-Point
• Integer
• Latency is 4 cycles
• FP: NaN, Denormals become signed 0.
• Round to nearest even
Special Function Units
• Transcendental function evaluation and per-
pixel attribute interpolation

• Function evaluator:
– rcp, rsqrt, log2, exp2, sin, cos approximations
– Uses quadratic interpolation based on
Enhanced Minimax Approximation
– 1 scalar result per cycle

• Latency is 16 cycles
• Some are synthesized: 32 cycles or so
Memory System Goals
• High-Bandwidth
• As much parallelism as possible
• wide. 512 pins in G200 / Many DRAM chips
• fast signalling. max data rate per pin.
• maximize utilization
– Multiple bins of memory requests
– Coalesce requests to get as wide as possible
– Goal to use every cycle to transfer from/to memory
• Compression: lossless and lossy
• Caches where it makes sense. Small
DRAM considerations
• multiple banks per chip
– 4-8 typical
• 2^N rows.
– 16K typical
• 2^M cols
– 8K typical
• Timing contraints
– 10~ cycles for row
– 4 cycles within row
• DDR
– 1Ghz --> 2Gbit/pin
– 32-bit --> 8 bytes clock
• GPU to memory: many traffic generators
– no correlation if greedy scheduling
– separate heaps / coalesce accesses
• Longer latency
Parallelism in the Memory System

Thread
• Local Memory: per-thread
– Private per thread
Local Memory – Auto variables, register spill
• Shared Memory: per-Block
– Shared by threads of the same block
Block
– Inter-thread communication
Shared • Global Memory: per-application
Memory – Shared by all threads
– Inter-Grid communication

Grid 0

...
Global Sequential
Grid 1 Memory Grids
in Time
...
SM Memory Architecture
• Threads in a Block share data & results
– In Memory and Shared Memory
– Synchronize at barrier instruction

• Per-Block Shared Memory Allocation

– Keeps data close to processor
– Minimize trips to global Memory
– SM Shared Memory dynamically allocated to
Blocks, one of the limiting resources
SM Register File

• Register File (RF) I$

L1
– 64 KB
– 16K 32-bit registers Multithreaded
Instruction Buffer

– Provides 4 operands/clock
R C$ Shared

• TEX pipe can also read/write RF F L1 Mem

– 3 SMs share 1 TEX Operand Select

• Load/Store pipe can also

read/write RF MAD SFU
Programmer’s View of Register File
• There are 16K registers in 4 blocks 3 blocks
each SM in G200
– This is an implementation
decision, not part of CUDA
– Registers are dynamically
partitioned across all Blocks
assigned to the SM
– Once assigned to a Block, the
register is NOT accessible by
threads in other Blocks
– Each thread in the same Block
only access registers assigned
to itself
Register Use Implications Example
• Matrix Multiplication

• If each Block has 16X16 threads and each thread uses

10 registers, how many threads can run on each SM?
– Each Block requires 10*16*16 = 2560 registers
– 16384 = 6* 2560 + change
– So, six blocks can run on an SM as far as registers are
concerned

• How about if each thread increases the use of registers

by 1?
– Each Block now requires 11*256 = 2816 registers
– 16384 < 2816 *6
– Only five Blocks can run on an SM, 5/6reduction of parallelism
Dynamic Partitioning
• Dynamic partitioning gives more flexibility to
compilers/programmers
– One can run a smaller number of threads that
require many registers each or a large number of
threads that require few registers each
• This allows for finer grain threading than traditional CPU
threading models.
– The compiler can tradeoff between instruction-level
parallelism and thread level parallelism
Within or Across Thread Parallelism (ILP vs. TLP)
• Assume:
– kernel: 256-thread Blocks
– 4 independent instructions for each global memory load,
– thread: 21 registers
– global loads: 200 cycles
– 6 Blocks can run on each SM
• If a Compiler can use one more register to change the
dependence pattern so that 8 independent instructions
exist for each global memory load
– Only three can run on each SM
– However, one only needs 200/(8*4) = 7 Warps to tolerate the
memory latency
– Two Blocks have 16 Warps.
– Conclusion: could be better
Constants

• Immediate address constants I$

L1
• Indexed address constants
• Constants stored in DRAM, and Multithreaded
Instruction Buffer

cached on chip
R C$ Shared
– L1 per SM F L1 Mem

– 64KB total in DRAM Operand Select

• A constant value can be broadcast

to all threads in a Warp MAD SFU

– Extremely efficient way of accessing a

value that is common for all threads in
a Block!
Shared Memory

• Each SM has 16 KB of Shared Memory I$

– 16 banks of 32bit words L1

• CUDA uses Shared Memory as shared Multithreaded

storage visible to all threads in a thread Instruction Buffer

block R C$ Shared

– read and write access F L1 Mem

• Not used explicitly for pixel shader Operand Select

programs
– we dislike pixels talking to each other MAD SFU

• Key Performance Enhancement

• Move data in Shared memory
• Operate in there
Parallel Memory Architecture
• In a parallel machine, many threads access memory
– Therefore, memory is divided into banks
– Essential to achieve high bandwidth

• Each bank can service one address per cycle

– A memory can service as many simultaneous Bank 0
accesses as it has banks Bank 1
Bank 2
Bank 3
• Multiple simultaneous accesses to a bank Bank 4
result in a bank conflict Bank 5
Bank 6
– Conflicting accesses are serialized Bank 7

Bank 15
Bank Addressing Examples

• No Bank Conflicts • No Bank Conflicts

– Linear addressing – Random 1:1
stride == 1 Permutation
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7

Thread 15 Bank 15 Thread 15 Bank 15

Bank Addressing Examples

• 2-way Bank Conflicts • 8-way Bank Conflicts

– Linear addressing – Linear addressing
stride == 2 stride ==
x8
8
Thread 0 Bank 0 Thread 0 Bank 0
Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 8 x8
Thread 9
Thread 10
Thread 11 Bank 15 Thread 15 Bank 15
How addresses map to banks on G80

• Each bank has a bandwidth of 32 bits per clock

cycle
• Successive 32-bit words are assigned to
successive banks
• G80 has 16 banks
– So bank = address % 16
– Same as the size of a half-warp
• No bank conflicts between different half-warps, only within a
single half-warp
• G200? Probably the same
– Will find out
Shared memory bank conflicts
• Shared memory is as fast as registers if there are no bank
conflicts

• The fast case:

– If all threads of a half-warp access different banks, there is no bank
conflict
– If all threads of a half-warp access the identical address, there is
no bank conflict (broadcast)
• The slow case:
– Bank Conflict: multiple threads in the same half-warp access the
same bank
– Must serialize the accesses
– Cost = max # of simultaneous accesses to a single bank
Linear Addressing
• Given:
s=1
Thread 0 Bank 0
Thread 1 Bank 1
__shared__ float shared[256]; Thread 2
Thread 3
Bank 2
Bank 3

float foo = Thread 4

Thread 5
Bank 4
Bank 5
Thread 6 Bank 6
shared[baseIndex + s * Thread 7 Bank 7

threadIdx.x];
Thread 15 Bank 15

• This is only bank-conflict-free if s s=3

shares no common factors with Thread 0
Thread 1
Bank 0
Bank 1

the number of banks

Thread 2 Bank 2
Thread 3 Bank 3
Thread 4 Bank 4

– 16 on G200, so s must be odd Thread 5

Thread 6
Bank 5
Bank 6
Thread 7 Bank 7

Thread 15 Bank 15
Data types and bank conflicts
• This has no conflicts if type of shared is 32-bits:

foo = shared[baseIndex + threadIdx.x]

• But not if the data type is smaller Thread 0

Thread 1
Bank 0
Bank 1
– 4-way bank conflicts: Thread 2 Bank 2
Thread 3 Bank 3
__shared__ char shared[]; Thread 4 Bank 4
Thread 5 Bank 5
foo = shared[baseIndex + threadIdx.x]; Thread 6 Bank 6
Thread 7 Bank 7

Thread 15 Bank 15
– 2-way bank conflicts:
__shared__ short shared[];
foo = shared[baseIndex + threadIdx.x]; Thread 0 Bank 0
Thread 1 Bank 1
Thread 2 Bank 2
Thread 3 Bank 3
Thread 4 Bank 4
Thread 5 Bank 5
Thread 6 Bank 6
Thread 7 Bank 7

Thread 15 Bank 15
Structs and Bank Conflicts
• Struct assignments compile into as many memory accesses
as there are struct members:
struct vector { float x, y, z; };
struct myType {
float f;
int c;
Thread 0 Bank 0
}; Thread 1 Bank 1
__shared__ struct vector vectors[64]; Thread 2 Bank 2
Thread 3 Bank 3
__shared__ struct myType myTypes[64]; Thread 4 Bank 4
Thread 5 Bank 5
• This has no bank conflicts for vector; struct size is 3 words Thread 6 Bank 6
Thread 7 Bank 7
– 3 accesses per thread, contiguous banks (no common factor
with 16)
Thread 15 Bank 15
struct vector v = vectors[baseIndex +
threadIdx.x];

• This has 2-way bank conflicts for my Type; (2 accesses per

thread)
struct myType m = myTypes[baseIndex +
threadIdx.x];
Common Array Bank Conflict Patterns 1D
• Each thread loads 2 elements into
shared mem:
– 2-way-interleaved loads result in
2-way bank conflicts:

Thread 0 Bank 0
int tid = threadIdx.x;
Thread 1 Bank 1
shared[2*tid] = global[2*tid]; Thread 2 Bank 2

shared[2tid+1] = global[2tid+1]; Thread 3 Bank 3

Thread 4 Bank 4

Bank 5

• This makes sense for traditional Bank 6

CPU threads, locality in cache line Thread 8

Bank 7

usage and reduced sharing traffice. Thread 9

Thread 10
– Not in shared memory usage where Thread 11 Bank 15
there is no cache line effects but
banking effects
A Better Array Access Pattern
• Each thread loads one element
in every consecutive group of
blockDim elements.
Thread 0 Bank 0

shared[tid] = global[tid]; Thread 1 Bank 1

shared[tid + blockDim.x] = Thread 2 Bank 2

Thread 3 Bank 3
global[tid + blockDim.x];
Thread 4 Bank 4

Thread 5 Bank 5

Thread 6 Bank 6

Thread 7 Bank 7

Thread 15 Bank 15
Vector Reduction with Bank Conflicts

Array elements
0 1 2 3 4 5 6 7 8 9 10 11

3
No Bank Conflicts

0 1 2 3 … 13 14 15 16 17 18 19

3
Common Bank Conflict Patterns (2D)
• Operating on 2D array of floats in
shared memory
– Bank Indices without Padding
e.g., image processing
0 1 2 3 4 5 6 7 15
• Example: 16x16 block 0 1 2 3 4 5 6 7 15
0 1 2 3 4 5 6 7 15
– Each thread processes a row 0 1 2 3 4 5 6 7 15
– So threads in a block access the elements 0 1 2 3 4 5 6 7 15
in each column simultaneously (example: 0 1 2 3 4 5 6 7 15
row 1 in purple) 0 1 2 3 4 5 6 7 15
0 1 2 3 4 5 6 7 15
– 16-way bank conflicts: rows all start at bank
0
0 1 2 3 4 5 6 7 15

• Solution 1) pad the rows Bank Indices with Padding

0 1 2 3 4 5 6 7 15 0
– Add one float to the end of each row 1 2 3 4 5 6 7 8 0 1
• Solution 2) transpose before processing 2
3
3 4 5 6 7 8 9
4 5 6 7 8 9 10
1 2
2 3
– Suffer bank conflicts during transpose 4 5 6 7 8 9 10 11 3 4
5 6 7 8 9 10 11 12 4 5
– But possibly save them later 6 7 8 9 10 11 12 13 5 6
7 8 9 10 11 12 13 14 7 8

15 0 1 2 3 4 5 6 14 15
• Use LD to hide
Load/Store LD latency
(Memory (non-dependent LD ops
read/write)
only)
Clustering/Batching
– Use same thread to help hide own latency
• Instead of:
– LD 0 (long latency)
– Dependent MATH 0
– LD 1 (long latency)
– Dependent MATH 1
• Do:
– LD 0 (long latency)
– LD 1 (long latency - hidden)
– MATH 0
– MATH 1
• Compiler handles this!
– But, you must have enough non-dependent LDs and Math

ARM Based Development Course by Mouli Sankaran
100% (8)
ARM Based Development Course by Mouli Sankaran
1,027 pages
Javascript Cheat Sheet: Beginner's Essential
No ratings yet
Javascript Cheat Sheet: Beginner's Essential
63 pages
Slides - Chapter 6
No ratings yet
Slides - Chapter 6
59 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
Cpus: Latency Oriented Design
No ratings yet
Cpus: Latency Oriented Design
2 pages
Unit 5 Part2
No ratings yet
Unit 5 Part2
25 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
GPU Architecture
No ratings yet
GPU Architecture
17 pages
Gpu Cuda 2
No ratings yet
Gpu Cuda 2
72 pages
Lecture4 CUDA Threads Part2
No ratings yet
Lecture4 CUDA Threads Part2
15 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
Unit5 Part2
No ratings yet
Unit5 Part2
26 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
NCSA02 Fundamental CUDA Optimization
No ratings yet
NCSA02 Fundamental CUDA Optimization
50 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Unit 4
No ratings yet
Unit 4
48 pages
Lec 3
No ratings yet
Lec 3
48 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Lec 1
No ratings yet
Lec 1
27 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
PART19
No ratings yet
PART19
20 pages
Gpu Architecture
No ratings yet
Gpu Architecture
43 pages
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Chapter 9 - Multiple Core Computers
No ratings yet
Chapter 9 - Multiple Core Computers
44 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
L 3 GPU
No ratings yet
L 3 GPU
33 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
Lec 14
No ratings yet
Lec 14
52 pages
CS 179: GPU Computing: Lecture 2: More Basics
No ratings yet
CS 179: GPU Computing: Lecture 2: More Basics
23 pages
04 CUDA Fundamental Optimization
No ratings yet
04 CUDA Fundamental Optimization
30 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
200-301 CCNA (Cisco Certified Network Associate) Study Guide
From Everand
200-301 CCNA (Cisco Certified Network Associate) Study Guide
Anand Vemula
No ratings yet
SPANNING TREE PROTOCOL: Most important topic in switching
From Everand
SPANNING TREE PROTOCOL: Most important topic in switching
Mulayam Singh
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
Defence Exercises
No ratings yet
Defence Exercises
40 pages
1 May 2024 Current Affairs Quiz For SSC, UPSC, PCS, Railways
No ratings yet
1 May 2024 Current Affairs Quiz For SSC, UPSC, PCS, Railways
8 pages
BDS Syllabus
No ratings yet
BDS Syllabus
3 pages
Full Stack Web Development Syllabus
No ratings yet
Full Stack Web Development Syllabus
12 pages
26SA Manual 6570
No ratings yet
26SA Manual 6570
36 pages
70-410 CH1 Deploy, Manage and Maintain Servers
No ratings yet
70-410 CH1 Deploy, Manage and Maintain Servers
25 pages
Compact FieldPoint Accessories (323493A-01)
No ratings yet
Compact FieldPoint Accessories (323493A-01)
13 pages
CODEMagazine 2024 JulyAugust
No ratings yet
CODEMagazine 2024 JulyAugust
76 pages
MX8600S
No ratings yet
MX8600S
2 pages
SQL Express Installation Guide
No ratings yet
SQL Express Installation Guide
22 pages
Thinking Procedurally 57433
No ratings yet
Thinking Procedurally 57433
9 pages
FUJITSU Server PRIMEQUEST 3400E
No ratings yet
FUJITSU Server PRIMEQUEST 3400E
8 pages
SYSC2006 L01 Imperative Programming Intro C
No ratings yet
SYSC2006 L01 Imperative Programming Intro C
33 pages
Introduction To Windows Workflow Foundation
No ratings yet
Introduction To Windows Workflow Foundation
50 pages
869df5e47080aefbed69911b1e47cf27
No ratings yet
869df5e47080aefbed69911b1e47cf27
500 pages
Software Update Notification For Mastersizer 3000 v3.40 (PSS0223-18)
No ratings yet
Software Update Notification For Mastersizer 3000 v3.40 (PSS0223-18)
17 pages
Python RapidSolutionInEmbeddedLinux v0.2
No ratings yet
Python RapidSolutionInEmbeddedLinux v0.2
15 pages
UpgradeGX15 Practical SP
100% (1)
UpgradeGX15 Practical SP
142 pages
ASEBA - JiMBa - GladstoneUSA
No ratings yet
ASEBA - JiMBa - GladstoneUSA
2 pages
Aruba 6200f 48g 4sfp+ Switch-Psn1012749072dken
No ratings yet
Aruba 6200f 48g 4sfp+ Switch-Psn1012749072dken
4 pages
Mixer Studio Master
No ratings yet
Mixer Studio Master
12 pages
AWS Certified SysOps Administrator
No ratings yet
AWS Certified SysOps Administrator
3 pages
K32LX Installation Guide en
No ratings yet
K32LX Installation Guide en
2 pages
(Ebook PDF) Fundamentals of C# Programming For Information Systems 2nd Editioninstant Download
100% (5)
(Ebook PDF) Fundamentals of C# Programming For Information Systems 2nd Editioninstant Download
49 pages
Packet Switching: Concept
No ratings yet
Packet Switching: Concept
27 pages
Resolving The Truncation of Multi-Byte Characters in DMEE File
No ratings yet
Resolving The Truncation of Multi-Byte Characters in DMEE File
5 pages
Net Surveillance Setup Log
No ratings yet
Net Surveillance Setup Log
6 pages
Day-1 Solutions
No ratings yet
Day-1 Solutions
10 pages
03-03-14 Functionality BRAUMAT Classic
No ratings yet
03-03-14 Functionality BRAUMAT Classic
0 pages
SMP Gateway and Conitel RTUs
No ratings yet
SMP Gateway and Conitel RTUs
4 pages
Gfe - Odyssey
No ratings yet
Gfe - Odyssey
29 pages
FCC by Akatsuki - Removed
No ratings yet
FCC by Akatsuki - Removed
44 pages

Hardware

Uploaded by

Hardware

Uploaded by

Architecture Overview

Introduction to CUDA Programming

• Addresses assigned to the PCI devices at boot

• That’s a reason why Windows XP cannot “see”

BUS: PCI or older PCI-E

Version Clock Speed Overhead Data Rate

1.x 1.25Ghz 20% 250MB/s

Why? Realities of integrated circuits: need to cluster computation and storage

• All threads in a Block execute the

Thread Processing Cluster

dispatch Shared Memory

– 1 to 512 threads active SP SP

– Shared instruction fetch per 32 SP SP

• Thread Blocks are serially Grid 1

• Each SM launches Warps of

– 2 levels of parallelism Kernel

• SM schedules and executes Block (1, 1)

• As Warps and Thread Blocks Thread Thread Thread Thread Thread

• All threads in warp: Shared Memory

– control flow causes some to SP SP

TB1 TB2 TB3 TB3 TB2 TB1 TB1 TB1 TB3

Time TB = Thread Block, W = Warp

• SM broadcasts the same instruction to 32

– For 8X8, we have 64 threads per Block. Since each SM can

– For 16X16, we have 256 threads per Block. Since each SM

• Per-Block Shared Memory Allocation

• Register File (RF) I$

• TEX pipe can also read/write RF F L1 Mem

– 3 SMs share 1 TEX Operand Select

• Load/Store pipe can also

• If each Block has 16X16 threads and each thread uses

• How about if each thread increases the use of registers

• Immediate address constants I$

– 64KB total in DRAM Operand Select

• A constant value can be broadcast

– Extremely efficient way of accessing a

• Each SM has 16 KB of Shared Memory I$

• CUDA uses Shared Memory as shared Multithreaded

– read and write access F L1 Mem

• Not used explicitly for pixel shader Operand Select

• Key Performance Enhancement

• Each bank can service one address per cycle

• No Bank Conflicts • No Bank Conflicts

Thread 15 Bank 15 Thread 15 Bank 15

• 2-way Bank Conflicts • 8-way Bank Conflicts

• Each bank has a bandwidth of 32 bits per clock

• The fast case:

float foo = Thread 4

• This is only bank-conflict-free if s s=3

the number of banks

– 16 on G200, so s must be odd Thread 5

foo = shared[baseIndex + threadIdx.x]

• But not if the data type is smaller Thread 0

• This has 2-way bank conflicts for my Type; (2 accesses per

shared[2*tid+1] = global[2*tid+1]; Thread 3 Bank 3

• This makes sense for traditional Bank 6

CPU threads, locality in cache line Thread 8

usage and reduced sharing traffice. Thread 9

shared[tid] = global[tid]; Thread 1 Bank 1

shared[tid + blockDim.x] = Thread 2 Bank 2

• Solution 1) pad the rows Bank Indices with Padding

You might also like

shared[2tid+1] = global[2tid+1]; Thread 3 Bank 3