S0285 Optimization of Sparse Matrix Matrixltiplication On GPU
S0285 Optimization of Sparse Matrix Matrixltiplication On GPU
x Non-zero
Zero
GPU implementation
Warp 0
Warp 1 Warp’s hash table:
Shared memory
A …
Global memory
Warp n
IMPLEMENTATION
Sparse Matrix Representation: CSR
Do not store 0s
Store column indices and values. Compress row indices
Row indices: …
Row offsets: …
Column indices: …
Values: …
GPU Algorithm
Count the number of non-zeroes in each row of C (1 kernel)
— Conceptually, a warp computes the result for one row
When # of rows > # of warps, a warp works on more rows
— Store the numbers in C.rows
for( k = 0, end = __popc( __ballot( aColIt < aColEnd ) ) ; k < end ; ++k )
{
if( laneId == k ) { sColA[warpId] = colA; sValA[warpId] = valA; }
foreach hashFunction:
if( __all( inserted ) )
break;
tryToInsertInGlobalMemory( colB, value );
If the slot is full, retry with next hash function (next iteration)
Hash Table Size
Hash tables in global memory contain 2^k slots
— We usually start with 2048 slots
We use inlined PTX to compute hash % kGlobalSize
// Given 2^k, it finds k.
int nBits;
asm( "bfind.u32 %0, %1;" : "=r"( nBits ) : "r"( kGlobalSize ) );
Honestly, it’s not critical for performance here but it’s fun
Speed Comparison with CUSP
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
700 250
650 200
600
150
550
100
500
450 50
400 0
256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192
Time (ms) per Number of warps Memory (MB) per Number of warps
Simple load-balancing:
for( ; __syncthreads_or( aRowId < A_nRows ) ; aRowId += getWork( workQueue ) )
500.00
400.00
300.00 Static
200.00 Dynamic
100.00
0.00
256 512 1024 2048 4096 8192
Time(ms) per Number of warps
atmosmodd mc2depi
230.00 90.00
210.00
80.00
190.00
170.00 70.00
150.00
60.00
130.00
110.00 50.00
90.00
40.00
70.00
50.00 30.00
16 8 4 2 1 1 w/o a 16 8 4 2 1 1 w/o a
Time (ms) per Number of rows Time (ms) per Number of rows