UNIT-5 Tiling
UNIT-5 Tiling
WIDTH
read by Width threads.
• Load each element into Shared
Memory and have several threads M P
WIDTH
– Tiledalgorithms
tx
WIDTH WIDTH
bx
Tiled Multiply 0 1 2
tx
012 TILE_WIDTH-1
Nd
Break up the execution of the
TILE_WIDTH TILE_WIDTH
kernel into phases so that the data accesses in
WIDTH
each phase is focused on one subset (tile) of
Md and Nd
Md Pd
TILE_WIDTHE
0
1 Pdsub
WIDTH
2
by ty
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
Breaking Md and Nd into Tiles
Nd0,0 Nd1,0
Nd0,1 Nd1,1
Nd0,2 Nd1,2
Nd0,3 Nd1,3
• Each thread block perform 2*256 = 512 float loads from global
memory for 256 * (2*16) = 8,192 mul/add operations.
– Memory bandwidth no longer a limiting factor
How about performance on a GPU
– All threads access global memory for their input matrix elements
– One memory accesses (4 bytes) per floating-point addition
– 4B/s of memory bandwidth/FLOPS
– Assume a GPU with
– Peak floating-point rate 1,500 GFLOPS with 200 GB/s DRAM bandwidth
– 4*1,500 = 6,000 GB/s required to achieve peak FLOPS rating
– The 200 GB/s memory bandwidth limits the execution at 200/4 = 50 GFLOPS
– Identify a tile of global memory contents that are accessed by multiple threads
– Load the tile from global memory into on-chip memory
– Use barrier synchronization to make sure that all threads are ready to start the phase
– Have the multiple threads to access their data from the on-chip memory
– Use barrier synchronization to make sure that all threads have completed the
current phase
– Move on to the next tile
Objective
tx
• Each block computes one 012 TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH
m
TILE_WIDTH
WIDTH
• Each thread computes one bx k
element of Pdsub
Md Pd
by
0
m
TILE_WIDTHE
0
1 Pdsub
WIDTH
by ty 2 k
1
TILE_WIDTH-1
TILE_WIDTH TILE_WIDTH TILE_WIDTH
2 WIDTH WIDTH
G80 Shared Memory and Threading
• Each SM in G80 has 16KB shared memory
– SM size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared memory.
– Can potentially have up to 8 Thread Blocks actively executing
• This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4B= 8KB shared memory usage per thread block,
allowing only up to two thread blocks active at the same time
• Using 16x16 tiling, we reduce the accesses to the global memory by a factor of 16
– The 86.4B/s bandwidth can now support (86.4/4)*16 = 347.6 GFLOPS!
0
10
20
30
40
50
60
70
80
90
100
not tiled
tiled
only
4x4 tiles
tiled &
unrolled
tiled
8x8 tiles only
tiled &
unrolled
tiled
only
tiled &
12x12 tiles
unrolled
tiled
only
Tiling Size Effects
tiled &
16x16 tiles
unrolled