0% found this document useful (0 votes)
6 views23 pages

UNIT-5 Tiling

The document discusses a programming strategy for optimizing GPU performance through data tiling, which involves partitioning data into subsets that fit into shared memory to reduce slow global memory access. It outlines the benefits of using shared, constant, and register memory based on data access patterns, and details the execution of tiled matrix multiplication using CUDA code. The objective is to understand the design of a tiled parallel algorithm, focusing on loading tiles, phased execution, and barrier synchronization.

Uploaded by

D Dhaaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views23 pages

UNIT-5 Tiling

The document discusses a programming strategy for optimizing GPU performance through data tiling, which involves partitioning data into subsets that fit into shared memory to reduce slow global memory access. It outlines the benefits of using shared, constant, and register memory based on data access patterns, and details the execution of tiled matrix multiplication using CUDA code. The objective is to understand the design of a tiled parallel algorithm, focusing on loading tiles, phased execution, and barrier synchronization.

Uploaded by

D Dhaaran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Tiling/Performance

A Common Programming Strategy


• Global memory resides in device memory (DRAM)
- much slower access than shared memory
• So, a profitable way of performing computation on the device
is to tile data to take advantage of fast shared memory:
– Partition data into subsets that fit into shared memory
– Handle each data subset with one thread block by:
• Loading the subset from global memory to shared memory, using multiple
threads to exploit memory-level parallelism
• Performing the computation on the subset from shared memory; each thread
can efficiently multi-pass over any data element
• Copying results from shared memory to global memory
A Common Programming Strategy (Cont.)
• Constant memory also resides in device memory (DRAM)
- much slower access than shared memory
– But… cached!
– Highly efficient access for read-only data
• Carefully divide data according to access patterns
– R/Only -> constant memory (very fast if in cache)
– R/W shared within Block -> shared memory (very fast)
– R/W within each thread -> registers (very fast)
– R/W inputs/results -> global memory (very slow)
Idea: Use Shared Memory to reuse global memory data

• Each input element is

WIDTH
read by Width threads.
• Load each element into Shared
Memory and have several threads M P

use the local version to reduce the ty


memory bandwidth

WIDTH
– Tiledalgorithms
tx
WIDTH WIDTH
bx
Tiled Multiply 0 1 2

tx
012 TILE_WIDTH-1

Nd
Break up the execution of the

TILE_WIDTH TILE_WIDTH
kernel into phases so that the data accesses in

WIDTH
each phase is focused on one subset (tile) of
Md and Nd

Md Pd

TILE_WIDTHE
0
1 Pdsub

WIDTH
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
Breaking Md and Nd into Tiles
Nd0,0 Nd1,0
Nd0,1 Nd1,1

Nd0,2 Nd1,2

Nd0,3 Nd1,3

Md0,0Md1,0Md2,0Md3,0 Pd0,0 Pd1,0 Pd2,0 Pd3,0

Md0,1Md1,1Md2,1Md3,1 Pd0,1 Pd1,1 Pd2,1 Pd3,1

Pd0,2 Pd1,2 Pd2,2 Pd3,2

Pd0,3 Pd1,3 Pd2,3 Pd3,3


Each phase of a Thread Block uses
one tile from Md and one from Nd
Phase 1 Step 4 StepPhase 2 6
5 Step

T0,0 Md0,0 Nd0,0 PValue0,0 += Md2,0 Nd0,2 PValue0,0 +=


↓ ↓ Mds0,0*Nds0,0 ↓ ↓ Mds0,0*Nds0,0
+ +
Mds0,0 Nds0,0 Mds0,0 Nds0,0
Mds1,0*Nds0,1 Mds1,0*Nds0,1
T1,0 Md1,0 Nd1,0 PValue1,0 += Md3,0 Nd1,2 PValue1,0 +=
↓ ↓ Mds0,0*Nds1,0 ↓ ↓ Mds0,0*Nds1,0
+ +
Mds1,0 Nds1,0 Mds1,0 Nds1,0
Mds1,0*Nds1,1 Mds1,0*Nds1,1
T0,1 Md0,1 Nd0,1 PdValue0,1 += Md2,1 Nd0,3 PdValue0,1 +=
↓ ↓ Mds0,1*Nds0,0 ↓ ↓ Mds0,1*Nds0,0
+ +
Mds0,1 Nds0,1 Mds0,1 Nds0,1
Mds1,1*Nds0,1 Mds1,1*Nds0,1
T1,1 Md1,1 Nd1,1 PdValue1,1 += Md3,1 Nd1,3 PdValue1,1 +=
↓ ↓ Mds0,1*Nds1,0 ↓ ↓ Mds0,1*Nds1,0
+time +
Mds1,1 Nds1,1 Mds1,1 Nds1,1
Mds1,1*Nds1,1 Mds1,1*Nds1,1
Threads, Warps, Blocks
• There are (up to) 32 threads in a Warp
– Only <32 when there are fewer than 32 total
threads
• There are (up to) 16 Warps in a Block
• Each Block (and thus, each Warp) executes on a single SM
• G80 has 16 SMs
• At least 16 Blocks required to “fill” the device
• More is better
– If resources (registers, thread space, shared memory) allow, more than 1
Block can occupy each SM
First-order Size Considerations in G80
• Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads

• There should be many thread blocks


– A 1024*1024 Pd gives 64*64 = 4096 Thread Blocks

• Each thread block perform 2*256 = 512 float loads from global
memory for 256 * (2*16) = 8,192 mul/add operations.
– Memory bandwidth no longer a limiting factor
How about performance on a GPU

– All threads access global memory for their input matrix elements
– One memory accesses (4 bytes) per floating-point addition
– 4B/s of memory bandwidth/FLOPS
– Assume a GPU with
– Peak floating-point rate 1,500 GFLOPS with 200 GB/s DRAM bandwidth
– 4*1,500 = 6,000 GB/s required to achieve peak FLOPS rating
– The 200 GB/s memory bandwidth limits the execution at 200/4 = 50 GFLOPS

– This limits the execution rate to 3.3% (50/1500) of the peak


floating-point execution rate of the device!

– Need to drastically cut down memory accesses to get close to


the1,500 GFLOPS
Outline of Tiling Technique

– Identify a tile of global memory contents that are accessed by multiple threads
– Load the tile from global memory into on-chip memory

– Use barrier synchronization to make sure that all threads are ready to start the phase

– Have the multiple threads to access their data from the on-chip memory

– Use barrier synchronization to make sure that all threads have completed the
current phase
– Move on to the next tile
Objective

– To understand the design of a tiled parallel algorithm


for matrix multiplication
– Loading a tile
– Phased execution
– Barrier Synchronization
Loading a Tile
– All threads in a block participate
– Each thread loads one M element and one N element in tiled
code
CUDA Code – Kernel
Execution Configuration
// Setup the execution configuration
dim3 dimBlock(TILE_WIDTH, TILE_WIDTH); dim3
dimGrid(Width / TILE_WIDTH,
Width / TILE_WIDTH);
Tiled Matrix Multiplication Kernel
global void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
1. shared float Mds[TILE_WIDTH][TILE_WIDTH];
2. shared float Nds[TILE_WIDTH][TILE_WIDTH];

3. int bx = blockIdx.x; int by = blockIdx.y;


4. int tx = threadIdx.x; int ty = threadIdx.y;

// Identify the row and column of the Pd element to work


5. int Row on
6. int Col = by * TILE_WIDTH + ty;
= bx * TILE_WIDTH + tx;
7. float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute Pd element
the
8. Collaborative
for (int m =loading
0; m < of
Width/TILE_WIDTH;
Md and Nd tiles++m)
into{shared
// memory
10.
9. Mds[ty][tx]
Nds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH
Nd[Col + (m*TILE_WIDTH + + tx)];
11. ty)*Width];
syncthreads();
11. for (int k = 0; k < TILE_WIDTH; ++k)
12. Pvalue += Mds[ty][k] * Nds[k][tx];
13. Synchthreads();
14. }
13. Pd[Row*Width+Col] = Pvalue;
}
bx
Tiled Multiply 0 1 2

tx
• Each block computes one 012 TILE_WIDTH-1

square sub-matrix Pdsub of size Nd

TILE_WIDTH TILE_WIDTH
m
TILE_WIDTH

WIDTH
• Each thread computes one bx k
element of Pdsub

Md Pd
by
0
m

TILE_WIDTHE
0
1 Pdsub

WIDTH
by ty 2 k
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH

2 WIDTH WIDTH
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory
– SM size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory.
– Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block,
allowing only up to two thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
0
10
20
30
40
50
60
70
80
90
100

not tiled
tiled
only

4x4 tiles
tiled &
unrolled

tiled
8x8 tiles only

tiled &
unrolled

tiled
only

tiled &
12x12 tiles

unrolled

tiled
only
Tiling Size Effects

tiled &
16x16 tiles

unrolled

You might also like