0% found this document useful (0 votes)

115 views16 pages

Lecture 30 GPU Programming Loop Parallelism

The document discusses data level parallelism and programming GPUs. It explains that functions for the GPU device are declared with _device_ or _global_, which allocates variables to GPU memory. It provides an example of invoking the DAXPY operation on 256 threads per block. It also summarizes key aspects of the Nvidia GPU instruction set architecture like PTX, conditional branching, memory structures, and techniques for detecting and enhancing loop level parallelism.

Uploaded by

Udai Valluru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

115 views16 pages

Lecture 30 GPU Programming Loop Parallelism

Uploaded by

Udai Valluru

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPSX, PDF, TXT or read online on Scribd

You are on page 1/ 16

CKV

Advanced VLSI Architecture

MEL G624

Lecture 30: Data Level Parallelism

CKV
Programming the GPU

To distinguish between functions for GPU (device) and

System processor (host)

_device_ or _global_ => GPU Device

_host_ => System processor (host)

Variables declared in _device_ or _global_ functions are

allocated to GPU memory
Function Call: name<<<dimGrid, dimBlock>>>..parameters list..)
threadIdx: Identifier for threads per block
blockIdx: Identifier for blocks
blockDim: Number of threads per block
CKV
Programming the GPU
//Invoke DAXPY
daxpy(n,2.0,x,y);
//DAXPY in C
void daxpy(int n, double a, double* x, double* y){
for (int i=0;i<n;i++)
y[i]= a*x[i]+ y[i]
}

//Invoke DAXPY with 256 threads per Thread Block

_host_
int nblocks = (n+255)/256;
daxpy<<<nblocks, 256>>> (n,2.0,x,y);
//DAXPY in CUDA
_device_
void daxpy(int n,double a,double* x,double* y){
int i=blockIDx.x*blockDim.x+threadIdx.x;
if (i<n) y[i]= a*x[i]+ y[i]
}
CKV
Nvidia GPU Instruction Set Architecture

Instruction Set target of Nvidia compiler is an abstraction

of the hardware instruction set
PTX (Parallel Thread Execution)
Stable Instruction set for compilers

Compatibility across generations of GPU

PTX describe operations on single CUDA thread
One-to-one map with H/W instruction
One PTX expand to many machine instructions

PTX uses virtual registers (compiler maps physical registers to thread)

CKV
Nvidia GPU Instruction Set Architecture
//Invoke DAXPY with 256 threads per Thread Block
_host_
int nblocks = (n+255)/256;
daxpy<<<nblocks, 256>>> (n,2.0,x,y);
//DAXPY in CUDA
_device_
void daxpy(int n,double a,double* x,double* y){
int i=blockIDx.x*blockDim.x+threadIdx.x;
if (i<n) y[i]= a*x[i]+ y[i]
}

shl.s32 R8, blockIdx, 8 ; Thread Block ID * Block size (256 or 28)

add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID
shl.s32 R8, R8, 3 ;Byte offset
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])
CKV
Conditional Branching
Like vector arch, GPU branch hardware uses internal masks

Branch synchronization stack

Entries consist of masks for each SIMD lane
i.e. which threads commit their results (all threads execute)

Instruction markers to manage when a branch diverges into

multiple execution paths
Push on divergent branch
.. and when paths converge
Act as barriers, Pops stack

Per‐thread‐lane 1‐bit predicate register,

CKV
Conditional Branching
if (X[i] != 0)
X[i] = X[i] – Y[i];
else X[i] = Z[i];

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]

setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1
@!P1, bra ELSE1, *Push ; Push old mask, set new mask bits
; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2 ; Difference in RD0
st.global.f64 [X+R8], RD0 ; X[i] = RD0
@P1, bra ENDIF1, *Comp ; complement mask bits
; if P1 true, go to ENDIF1

ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]

st.global.f64 [X+R8], RD0 ; X[i] = RD0
ENDIF1: <next instruction>, *Pop ; pop to restore old mask
CKV
NVIDIA GPU Memory Structures
Each SIMD Lane has private section of off-chip DRAM
“Private memory”, not shared by any other lanes

Contains stack frame, spilling registers, and private variables

Recent GPUs cache this in L1 and L2 caches

Each MT SIMD processor also has “Local memory” that is on

chip
Shared by SIMD lanes / threads within a block only

The off-chip memory shared by SIMD processors is “GPU

Memory”
Host can read and write GPU memory
CKV
NVIDIA GPU Memory Structures
CKV
Detecting and Enhancing Loop Level Parallelism
Focuses on determining whether data accesses in later
iterations are dependent on data values produced in
earlier iterations
Loop‐carried dependence

Loop‐level parallelism has no loop‐carried dependence

for (i=999; i>=0; i=i‐1)
x[i] = x[i] + s;

No Loop‐carried dependence
CKV
Detecting and Enhancing Loop Level Parallelism

for (i=0; i<100; i=i+1) {

A[i+1] = A[i] + C[i]; /* S1 */
B[i+1] = B[i] + A[i+1]; /* S2 */
}

S1 uses value computed by S1 in earlier iteration

Iteration i computes A[i+1] (& B[i+1]) which is used in iteration i+1

Loop‐carried dependence

S2 uses value computed by S1 in same iteration

No Loop‐carried dependence
CKV
Detecting and Enhancing Loop Level Parallelism
for (i=0; i<100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}

S1 uses value computed by S2 in previous iteration, but

this dependence is not circular.

Neither statement depend on itself

Although S1 Depends on S2, S2 does not depend on S1,

Interchanging the two statements will not affect the
execution of S2
CKV
Detecting and Enhancing Loop Level Parallelism
for (i=0; i<100; i=i+1) {
A[i] = A[i] + B[i]; /* S1 */
B[i+1] = C[i] + D[i]; /* S2 */
}

A[0] = A[0] + B[0];

for (i=0; i<99; i=i+1) {
B[i+1] = C[i] + D[i]; /*S2*/
A[i+1] = A[i+1] + B[i+1]; /*S1*/
}
B[100] = C[99] + D[99];
CKV
Detecting and Enhancing Loop Level Parallelism
for (i=0;i<100;i=i+1) {
A[i] = B[i] + C[i];
D[i] = A[i] * E[i];
}

The second reference to A needs not be translated to a load

instruction.
The two reference are the same. There is no intervening memory
access to the same location

A more complex analysis, i.e. Loop‐carried dependence

analysis + data dependence analysis, can be applied in the
same basic block to optimize
CKV
Detecting and Enhancing Loop Level Parallelism
Recurrence is a special form of loop‐carried dependence.

for (i=1;i<100;i=i+1) {
Y[i] = Y[i‐1] + Y[i];
}

Detecting a recurrence is important

Some vector computers have special support for

executing recurrence

It my still be possible to exploit a fair amount of ILP

CKV

Thank You for Attending

Porsche Macan 2015 2016 Repair Manual
100% (4)
Porsche Macan 2015 2016 Repair Manual
5,966 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
UVM Interview Questions 1705926124
No ratings yet
UVM Interview Questions 1705926124
6 pages
Intro S4hana Using Global Bike Exercises MM en v4.11 1707315946076
No ratings yet
Intro S4hana Using Global Bike Exercises MM en v4.11 1707315946076
12 pages
72 UVM Callbacks Vs Factory PDF
No ratings yet
72 UVM Callbacks Vs Factory PDF
1 page
Al-Razi - Descripción Ajbar Muluk - Levi-Provençal
100% (3)
Al-Razi - Descripción Ajbar Muluk - Levi-Provençal
63 pages
Manual En-Us
100% (1)
Manual En-Us
191 pages
Memory Verification Using UVM
No ratings yet
Memory Verification Using UVM
13 pages
Ethernet MAC UVM Verification
No ratings yet
Ethernet MAC UVM Verification
9 pages
Apbuvm
No ratings yet
Apbuvm
48 pages
UVM Interview Questions
No ratings yet
UVM Interview Questions
8 pages
SystemVerilog 20041201165354
No ratings yet
SystemVerilog 20041201165354
34 pages
Pcie Config Space Interview Questions.
No ratings yet
Pcie Config Space Interview Questions.
4 pages
A Slave VIP For Verification of AMBA AXI-3 Master DUT in UVM
No ratings yet
A Slave VIP For Verification of AMBA AXI-3 Master DUT in UVM
8 pages
Factory Report Config UVM
No ratings yet
Factory Report Config UVM
57 pages
SV Constraints 240529 181113
No ratings yet
SV Constraints 240529 181113
27 pages
DV Interview Question
No ratings yet
DV Interview Question
35 pages
RAM Job Interview Questions and Answers
No ratings yet
RAM Job Interview Questions and Answers
8 pages
MSRV32I Core Design Specification
No ratings yet
MSRV32I Core Design Specification
44 pages
Xge Mac Spec
100% (1)
Xge Mac Spec
24 pages
Basha System Verilog PPT
No ratings yet
Basha System Verilog PPT
363 pages
Riscv Spec
No ratings yet
Riscv Spec
236 pages
Axi Slave
No ratings yet
Axi Slave
9 pages
Ethernet Project Questions
No ratings yet
Ethernet Project Questions
1 page
UVM Tips Tricks
No ratings yet
UVM Tips Tricks
36 pages
UVM 1.1 Class Reference Final 06062011
No ratings yet
UVM 1.1 Class Reference Final 06062011
852 pages
Aquabolt-XL: Samsung HBM2-PIM With In-Memory Processing For ML Accelerators and Beyond
No ratings yet
Aquabolt-XL: Samsung HBM2-PIM With In-Memory Processing For ML Accelerators and Beyond
26 pages
Systemverilog UVM QA
No ratings yet
Systemverilog UVM QA
9 pages
The Berkeley Out - of - Order Machine (Boom!) : An Open - Source Industry - Compeeeve, Synthesizable, Parameterized Risc - V Processor
100% (1)
The Berkeley Out - of - Order Machine (Boom!) : An Open - Source Industry - Compeeeve, Synthesizable, Parameterized Risc - V Processor
45 pages
UVM Interview Questions
No ratings yet
UVM Interview Questions
5 pages
Cache Coherence: Caches Memory Coherence Caches Multiprocessing
No ratings yet
Cache Coherence: Caches Memory Coherence Caches Multiprocessing
4 pages
System Verilog - Verification Methodology Manual
No ratings yet
System Verilog - Verification Methodology Manual
36 pages
Maven
No ratings yet
Maven
6 pages
PCIe Transaction and Data Link Layers Verification IP Development Using UVM
No ratings yet
PCIe Transaction and Data Link Layers Verification IP Development Using UVM
4 pages
03 Building Custom Socs
No ratings yet
03 Building Custom Socs
30 pages
Mod-12 Loadable Up - Down Counter
No ratings yet
Mod-12 Loadable Up - Down Counter
14 pages
Ahb2ap Bridge
100% (1)
Ahb2ap Bridge
3 pages
Chapter-1: 1.1. Serial Data Transmission
No ratings yet
Chapter-1: 1.1. Serial Data Transmission
44 pages
MGC DVCon 13 Sequence Sequence On The Wall Who's The Fairest of Them All
No ratings yet
MGC DVCon 13 Sequence Sequence On The Wall Who's The Fairest of Them All
24 pages
Fun With Uvm Sequences Coding and Debugging - VH v15 I12
100% (1)
Fun With Uvm Sequences Coding and Debugging - VH v15 I12
11 pages
UVM Based Subsystem - FIFO
No ratings yet
UVM Based Subsystem - FIFO
5 pages
Fifo Uvm
No ratings yet
Fifo Uvm
15 pages
Design Verification Engineer Resume Example
No ratings yet
Design Verification Engineer Resume Example
5 pages
On The Fly Reset
No ratings yet
On The Fly Reset
8 pages
Verilog Parameters and Operators
No ratings yet
Verilog Parameters and Operators
25 pages
Comparison of Pentium Processor With 80386 and 80486
60% (5)
Comparison of Pentium Processor With 80386 and 80486
23 pages
Ahb Faq
No ratings yet
Ahb Faq
18 pages
Sample Testplan-AXI DRIVER
No ratings yet
Sample Testplan-AXI DRIVER
13 pages
Implementing Communication Bridge Between I2C and APB
No ratings yet
Implementing Communication Bridge Between I2C and APB
4 pages
Application-Specific Integrated Circuit ASIC A Complete Guide
From Everand
Application-Specific Integrated Circuit ASIC A Complete Guide
Gerardus Blokdyk
No ratings yet
Lecture 29 GPU Architecture Example
No ratings yet
Lecture 29 GPU Architecture Example
15 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
CUDA
No ratings yet
CUDA
33 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Two Stage Miller OTA 1725070490
No ratings yet
Two Stage Miller OTA 1725070490
29 pages
Yang Dissertation 2017
No ratings yet
Yang Dissertation 2017
159 pages
1.1.2 SLM Ge I
No ratings yet
1.1.2 SLM Ge I
64 pages
2043348-Paricaya Sep2023 QP
No ratings yet
2043348-Paricaya Sep2023 QP
4 pages
2015 371813 Shrii-Lalitoo
No ratings yet
2015 371813 Shrii-Lalitoo
452 pages
Lecture 4: Static NMOS/CMOS Inverter: VTC: (MEL G621)
No ratings yet
Lecture 4: Static NMOS/CMOS Inverter: VTC: (MEL G621)
6 pages
Advanced VLSI Architecture: Lecture 6: Memory Hierarchy
No ratings yet
Advanced VLSI Architecture: Lecture 6: Memory Hierarchy
14 pages
Reading Assignment1
No ratings yet
Reading Assignment1
15 pages
Lecture 21
No ratings yet
Lecture 21
25 pages
(MEL G621) : Lecture 1: Introduction
No ratings yet
(MEL G621) : Lecture 1: Introduction
8 pages
Electronic Devices Tutorial - 1: Prof. Sanket Goel & Dr. Surya Shankar Dan
No ratings yet
Electronic Devices Tutorial - 1: Prof. Sanket Goel & Dr. Surya Shankar Dan
7 pages
Electronic Devices Tutorial - 2: Prof. Sanket Goel & Dr. Surya Shankar Dan
No ratings yet
Electronic Devices Tutorial - 2: Prof. Sanket Goel & Dr. Surya Shankar Dan
8 pages
Electronic Devices Tutorial - 5: Prof. Sanket Goel & Dr. Surya Shankar Dan
No ratings yet
Electronic Devices Tutorial - 5: Prof. Sanket Goel & Dr. Surya Shankar Dan
3 pages
Electronic Devices Tutorial - 1: Prof. Sanket Goel & Dr. Surya Shankar Dan
No ratings yet
Electronic Devices Tutorial - 1: Prof. Sanket Goel & Dr. Surya Shankar Dan
7 pages
RESEARCH METHODLOGY AND Ipr LECTURE NOTES
No ratings yet
RESEARCH METHODLOGY AND Ipr LECTURE NOTES
3 pages
IMDA Public Consultation On IP Interconnection
No ratings yet
IMDA Public Consultation On IP Interconnection
12 pages
Formats Drip 16 17 Renewal
No ratings yet
Formats Drip 16 17 Renewal
13 pages
Natwar Lal Joshi - Resume 2023
No ratings yet
Natwar Lal Joshi - Resume 2023
1 page
KX-TA824.Operating Manual
No ratings yet
KX-TA824.Operating Manual
188 pages
Piping Class: Umm Shaif Gas Injection Facilities Project
No ratings yet
Piping Class: Umm Shaif Gas Injection Facilities Project
43 pages
1.0 МнЪЛЪвCab
No ratings yet
1.0 МнЪЛЪвCab
81 pages
Log
No ratings yet
Log
17 pages
BDA3
No ratings yet
BDA3
61 pages
Chap 04
No ratings yet
Chap 04
18 pages
8-TP2 DC2010 Fangyi Rao Statistical Eye
No ratings yet
8-TP2 DC2010 Fangyi Rao Statistical Eye
21 pages
Tecnologia Da Deformação Plástica - Vol. III - Exercícios Resolvidos
No ratings yet
Tecnologia Da Deformação Plástica - Vol. III - Exercícios Resolvidos
6 pages
Polygraphy Midterm Lecture
100% (2)
Polygraphy Midterm Lecture
99 pages
Alternate Start - Live Another Life Extension Documentation
No ratings yet
Alternate Start - Live Another Life Extension Documentation
2 pages
Intrest Certificate 2022to2023
No ratings yet
Intrest Certificate 2022to2023
2 pages
10th ICRCET Conference Agenda
No ratings yet
10th ICRCET Conference Agenda
23 pages
Sewoo Slk-tl20x Series User's Manual Eng - 202307
No ratings yet
Sewoo Slk-tl20x Series User's Manual Eng - 202307
13 pages
LDF Example
No ratings yet
LDF Example
2 pages
Lincs Building Consultancy BS EN ISO 9001:2015 Quality Manual For Building Control Services
No ratings yet
Lincs Building Consultancy BS EN ISO 9001:2015 Quality Manual For Building Control Services
20 pages
AbhishekPaul PCC-CS601
No ratings yet
AbhishekPaul PCC-CS601
9 pages
En 60950 21 Remote Power Feeding
No ratings yet
En 60950 21 Remote Power Feeding
22 pages
Ejemplo de Ensayo Narrativo para La Universidad
100% (1)
Ejemplo de Ensayo Narrativo para La Universidad
4 pages
PanelView Plus 6 - Access The Maintenance Mode - Configuration S
No ratings yet
PanelView Plus 6 - Access The Maintenance Mode - Configuration S
5 pages
Younis JASSIM: Courses
No ratings yet
Younis JASSIM: Courses
3 pages
CS409 Final Term Handouts + MCQS-1
No ratings yet
CS409 Final Term Handouts + MCQS-1
61 pages
For Admission To Polytechnics of West Bengal For The Academic Session 2023-24 Through JEXPO-2023
No ratings yet
For Admission To Polytechnics of West Bengal For The Academic Session 2023-24 Through JEXPO-2023
2 pages

Lecture 30 GPU Programming Loop Parallelism

Uploaded by

Lecture 30 GPU Programming Loop Parallelism

Uploaded by

CKV

Advanced VLSI Architecture

Lecture 30: Data Level Parallelism

To distinguish between functions for GPU (device) and

_device_ or _global_ => GPU Device

_host_ => System processor (host)

Variables declared in _device_ or _global_ functions are

//Invoke DAXPY with 256 threads per Thread Block

Instruction Set target of Nvidia compiler is an abstraction

Compatibility across generations of GPU

PTX uses virtual registers (compiler maps physical registers to thread)

shl.s32 R8, blockIdx, 8 ; Thread Block ID * Block size (256 or 28)

Branch synchronization stack

Instruction markers to manage when a branch diverges into

Per‐thread‐lane 1‐bit predicate register,

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]

ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]

Contains stack frame, spilling registers, and private variables

Recent GPUs cache this in L1 and L2 caches

Each MT SIMD processor also has “Local memory” that is on

The off-chip memory shared by SIMD processors is “GPU

Loop‐level parallelism has no loop‐carried dependence

for (i=0; i<100; i=i+1) {

S1 uses value computed by S1 in earlier iteration

S2 uses value computed by S1 in same iteration

S1 uses value computed by S2 in previous iteration, but

Neither statement depend on itself

Although S1 Depends on S2, S2 does not depend on S1,

A[0] = A[0] + B[0];

The second reference to A needs not be translated to a load

A more complex analysis, i.e. Loop‐carried dependence

Detecting a recurrence is important

Some vector computers have special support for

It my still be possible to exploit a fair amount of ILP

Thank You for Attending

You might also like