0% found this document useful (0 votes)

6 views23 pages

UNIT-5 Tiling

The document discusses a programming strategy for optimizing GPU performance through data tiling, which involves partitioning data into subsets that fit into shared memory to reduce slow global memory access. It outlines the benefits of using shared, constant, and register memory based on data access patterns, and details the execution of tiled matrix multiplication using CUDA code. The objective is to understand the design of a tiled parallel algorithm, focusing on loading tiles, phased execution, and barrier synchronization.

Uploaded by

D Dhaaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views23 pages

UNIT-5 Tiling

Uploaded by

D Dhaaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Tiling/Performance

A Common Programming Strategy

• Global memory resides in device memory (DRAM)
- much slower access than shared memory
• So, a profitable way of performing computation on the device
is to tile data to take advantage of fast shared memory:
– Partition data into subsets that fit into shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using multiple
threads to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each thread
can efficiently multi-pass over any data element
• Copying results from shared memory to global memory
A Common Programming Strategy (Cont.)
• Constant memory also resides in device memory (DRAM)
- much slower access than shared memory
– But… cached!
– Highly efficient access for read-only data
• Carefully divide data according to access patterns
– R/Only -> constant memory (very fast if in cache)
– R/W shared within Block -> shared memory (very fast)
– R/W within each thread -> registers (very fast)
– R/W inputs/results -> global memory (very slow)
Idea: Use Shared Memory to reuse global memory data

• Each input element is

WIDTH
read by Width threads.
• Load each element into Shared
Memory and have several threads M P

use the local version to reduce the ty

memory bandwidth

WIDTH
– Tiledalgorithms
tx
WIDTH WIDTH
bx
Tiled Multiply 0 1 2

tx
012 TILE_WIDTH-1

Nd
Break up the execution of the

TILE_WIDTH TILE_WIDTH
kernel into phases so that the data accesses in

WIDTH
each phase is focused on one subset (tile) of
Md and Nd

Md Pd

TILE_WIDTHE
0
1 Pdsub

WIDTH
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
Breaking Md and Nd into Tiles
Nd0,0 Nd1,0
Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

Each phase of a Thread Block uses
one tile from Md and one from Nd
Phase 1 Step 4 StepPhase 2 6
5 Step

T0,0 Md0,0 Nd0,0 PValue0,0 += Md2,0 Nd0,2 PValue0,0 +=

↓ ↓ Mds0,0*Nds0,0 ↓ ↓ Mds0,0*Nds0,0
+ +
Mds0,0 Nds0,0 Mds0,0 Nds0,0
Mds1,0*Nds0,1 Mds1,0*Nds0,1
T1,0 Md1,0 Nd1,0 PValue1,0 += Md3,0 Nd1,2 PValue1,0 +=
↓ ↓ Mds0,0*Nds1,0 ↓ ↓ Mds0,0*Nds1,0
+ +
Mds1,0 Nds1,0 Mds1,0 Nds1,0
Mds1,0*Nds1,1 Mds1,0*Nds1,1
T0,1 Md0,1 Nd0,1 PdValue0,1 += Md2,1 Nd0,3 PdValue0,1 +=
↓ ↓ Mds0,1*Nds0,0 ↓ ↓ Mds0,1*Nds0,0
+ +
Mds0,1 Nds0,1 Mds0,1 Nds0,1
Mds1,1*Nds0,1 Mds1,1*Nds0,1
T1,1 Md1,1 Nd1,1 PdValue1,1 += Md3,1 Nd1,3 PdValue1,1 +=
↓ ↓ Mds0,1*Nds1,0 ↓ ↓ Mds0,1*Nds1,0
+time +
Mds1,1 Nds1,1 Mds1,1 Nds1,1
Mds1,1*Nds1,1 Mds1,1*Nds1,1
Threads, Warps, Blocks
• There are (up to) 32 threads in a Warp
– Only <32 when there are fewer than 32 total
threads
• There are (up to) 16 Warps in a Block
• Each Block (and thus, each Warp) executes on a single SM
• G80 has 16 SMs
• At least 16 Blocks required to “fill” the device
• More is better
– If resources (registers, thread space, shared memory) allow, more than 1
Block can occupy each SM
First-order Size Considerations in G80
• Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads

• There should be many thread blocks

– A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks

• Each thread block perform 2*256 = 512 float loads from global
memory for 256 * (2*16) = 8,192 mul/add operations.
– Memory bandwidth no longer a limiting factor
How about performance on a GPU

– All threads access global memory for their input matrix elements
– One memory accesses (4 bytes) per floating-point addition
– 4B/s of memory bandwidth/FLOPS
– Assume a GPU with
– Peak floating-point rate 1,500 GFLOPS with 200 GB/s DRAM bandwidth
– 4*1,500 = 6,000 GB/s required to achieve peak FLOPS rating
– The 200 GB/s memory bandwidth limits the execution at 200/4 = 50 GFLOPS

– This limits the execution rate to 3.3% (50/1500) of the peak

floating-point execution rate of the device!

– Need to drastically cut down memory accesses to get close to

the1,500 GFLOPS
Outline of Tiling Technique

– Identify a tile of global memory contents that are accessed by multiple threads
– Load the tile from global memory into on-chip memory

– Use barrier synchronization to make sure that all threads are ready to start the phase

– Have the multiple threads to access their data from the on-chip memory

– Use barrier synchronization to make sure that all threads have completed the
current phase
– Move on to the next tile
Objective

– To understand the design of a tiled parallel algorithm

for matrix multiplication
– Loading a tile
– Phased execution
– Barrier Synchronization
Loading a Tile
– All threads in a block participate
– Each thread loads one M element and one N element in tiled
code
CUDA Code – Kernel
Execution Configuration
// Setup the execution configuration
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3
dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
Tiled Matrix Multiplication Kernel
global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1. shared float Mds[TILE_WIDTH][TILE_WIDTH];
2. shared float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;

4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work

5. int Row on
6. int Col = by * TILE_WIDTH + ty;
= bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute Pd element
the
8. Collaborative
for (int m =loading
0; m < of
Width/TILE_WIDTH;
Md and Nd tiles++m)
into{shared
// memory
10.
9. Mds[ty][tx]
Nds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH
Nd[Col + (m*TILE_WIDTH + + tx)];
11. ty)*Width];
syncthreads();
11. for (int k = 0; k < TILE_WIDTH; ++k)
12. Pvalue += Mds[ty][k] * Nds[k][tx];
13. Synchthreads();
14. }
13. Pd[Row*Width+Col] = Pvalue;
}
bx
Tiled Multiply 0 1 2

tx
• Each block computes one 012 TILE_WIDTH-1

square sub-matrix Pdsub of size Nd

TILE_WIDTH TILE_WIDTH
m
TILE_WIDTH

WIDTH
• Each thread computes one bx k
element of Pdsub

Md Pd
by
0
m

TILE_WIDTHE
0
1 Pdsub

WIDTH
by ty 2 k
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory
– SM size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory.
– Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block,
allowing only up to two thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
0
10
20
30
40
50
60
70
80
90
100

not tiled
tiled
only

4x4 tiles
tiled &
unrolled

tiled
8x8 tiles only

tiled &
unrolled

tiled
only

tiled &
12x12 tiles

unrolled

tiled
only
Tiling Size Effects

tiled &
16x16 tiles

unrolled

NemoSens Briefsheet 008
No ratings yet
NemoSens Briefsheet 008
2 pages
Lab Manual Digital Marketing: Mr. Prince Vohra
No ratings yet
Lab Manual Digital Marketing: Mr. Prince Vohra
32 pages
DD Assignment
No ratings yet
DD Assignment
40 pages
Mcu/Eeprom: Selection Guide
No ratings yet
Mcu/Eeprom: Selection Guide
16 pages
HCIA-WLAN V3.0 Training Material
No ratings yet
HCIA-WLAN V3.0 Training Material
30 pages
12 Gpu Cuda 3
No ratings yet
12 Gpu Cuda 3
58 pages
DDS-CAD Installation Manual
No ratings yet
DDS-CAD Installation Manual
36 pages
Advanced Power Electronics Corp.: Description
No ratings yet
Advanced Power Electronics Corp.: Description
6 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
2001 Chevy S10 T10 Blazer Distributor Replacement REMOVAL PROCEDURE
50% (2)
2001 Chevy S10 T10 Blazer Distributor Replacement REMOVAL PROCEDURE
7 pages
T9 Assembly Modeling
No ratings yet
T9 Assembly Modeling
15 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
GPU Architecture and Parallel Programming: Tiled Convolution Analysis
No ratings yet
GPU Architecture and Parallel Programming: Tiled Convolution Analysis
18 pages
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
No ratings yet
GPU Computing With CUDA Lecture 3 - Efficient Shared Memory Use
52 pages
DAA Presentation Greedy Aproch of Coloring
No ratings yet
DAA Presentation Greedy Aproch of Coloring
11 pages
Add Label For XY Scatter Chart
No ratings yet
Add Label For XY Scatter Chart
34 pages
Kolom Distilasi Tinjauan Umum
No ratings yet
Kolom Distilasi Tinjauan Umum
22 pages
CH 7 - PMTTD
No ratings yet
CH 7 - PMTTD
32 pages
Comp422 2011 Lecture8 UPC
No ratings yet
Comp422 2011 Lecture8 UPC
44 pages
Food Irradiation: Communication Strategies To Bridge The Gap Between Scientists and The Public
No ratings yet
Food Irradiation: Communication Strategies To Bridge The Gap Between Scientists and The Public
10 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
SVM-Based Detection of Tomato Leaves Diseases: Abstract. This Article Introduces An e Cient Approach To Detect and
No ratings yet
SVM-Based Detection of Tomato Leaves Diseases: Abstract. This Article Introduces An e Cient Approach To Detect and
12 pages
Ma8551 Algebra and Number Theory
No ratings yet
Ma8551 Algebra and Number Theory
14 pages
HPC Revision
No ratings yet
HPC Revision
16 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
4 MM in CUDA
No ratings yet
4 MM in CUDA
38 pages
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
No ratings yet
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
31 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
RG Series
No ratings yet
RG Series
26 pages
Processors
No ratings yet
Processors
25 pages
0wning Antivirus: Alex Wheeler Neel Mehta
No ratings yet
0wning Antivirus: Alex Wheeler Neel Mehta
39 pages
ORNL Tensor Core Training Aug2019
No ratings yet
ORNL Tensor Core Training Aug2019
113 pages
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
No ratings yet
CS 61C: Great Ideas in Computer Architecture (Machine Structures)
32 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
GPU Programming Basics - Slides
No ratings yet
GPU Programming Basics - Slides
68 pages
Web GPU
0% (1)
Web GPU
40 pages
Duk
No ratings yet
Duk
7 pages
Tsarouchas Anastasios Resume
No ratings yet
Tsarouchas Anastasios Resume
1 page
217 Lec6
No ratings yet
217 Lec6
23 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
Sectional Weights
No ratings yet
Sectional Weights
1 page
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
Casio G-Shock Watch - URBANHUG
No ratings yet
Casio G-Shock Watch - URBANHUG
1 page
Lecture 4
No ratings yet
Lecture 4
48 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
Numerical Diff and Integration
No ratings yet
Numerical Diff and Integration
56 pages
Master Thesis RSM Erasmus University
100% (3)
Master Thesis RSM Erasmus University
5 pages
O133932v89 SUPER 19003L EN 2951844 MPW 080221
No ratings yet
O133932v89 SUPER 19003L EN 2951844 MPW 080221
18 pages
Design For Performance
100% (1)
Design For Performance
34 pages
Lecture # 1
No ratings yet
Lecture # 1
22 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
JNCIS-SP Certification - Juniper Networks US
No ratings yet
JNCIS-SP Certification - Juniper Networks US
6 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
No ratings yet
Unit Iii Data-Level Parallelism in Vector, Simd, and Gpu Architectures
26 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Ul-1 13
No ratings yet
Ul-1 13
13 pages
Cuda Notes From Udacity Lecture
No ratings yet
Cuda Notes From Udacity Lecture
3 pages
Class 10
No ratings yet
Class 10
13 pages
Input: Output: 1. Sub String Program
No ratings yet
Input: Output: 1. Sub String Program
8 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
20 Quiz 14
No ratings yet
20 Quiz 14
12 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Advanced Computer Architecture 1
No ratings yet
Advanced Computer Architecture 1
14 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
Tilining
No ratings yet
Tilining
23 pages
Siemens TC14.3 Installation Guide
No ratings yet
Siemens TC14.3 Installation Guide
338 pages
Moving To Parallel - Addition of 2 Matrices
No ratings yet
Moving To Parallel - Addition of 2 Matrices
14 pages
HPC File
No ratings yet
HPC File
22 pages
FCI ANSIFCI 70-3 - Standard For Regulator Seat Leakage Testing
No ratings yet
FCI ANSIFCI 70-3 - Standard For Regulator Seat Leakage Testing
5 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Uc Problems For Transmission System, Models and Approaches
No ratings yet
Uc Problems For Transmission System, Models and Approaches
21 pages
5 Computation
No ratings yet
5 Computation
13 pages
CUDA Part-2
No ratings yet
CUDA Part-2
49 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
VSCSE Lecture3 Cuda Memory Model 2012
No ratings yet
VSCSE Lecture3 Cuda Memory Model 2012
31 pages
HPC-Practical-4Addition of Two Large Vectors
No ratings yet
HPC-Practical-4Addition of Two Large Vectors
4 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Cuuda Nvidai Guide - Part3
No ratings yet
Cuuda Nvidai Guide - Part3
15 pages
Lab7 GPU
No ratings yet
Lab7 GPU
10 pages
Assignment 04
No ratings yet
Assignment 04
16 pages
CSED405 Lec3-Memory and Locality - 240912 - 113301
No ratings yet
CSED405 Lec3-Memory and Locality - 240912 - 113301
65 pages

UNIT-5 Tiling

Uploaded by

UNIT-5 Tiling

Uploaded by

Tiling/Performance

A Common Programming Strategy

• Each input element is

use the local version to reduce the ty

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3

T0,0 Md0,0 Nd0,0 PValue0,0 += Md2,0 Nd0,2 PValue0,0 +=

• There should be many thread blocks

– This limits the execution rate to 3.3% (50/1500) of the peak

– Need to drastically cut down memory accesses to get close to

– To understand the design of a tiled parallel algorithm

3. int bx = blockIdx.x; int by = blockIdx.y;

// Identify the row and column of the Pd element to work

square sub-matrix Pdsub of size Nd

You might also like