0% found this document useful (0 votes)
27 views21 pages

S0285 Optimization of Sparse Matrix Matrixltiplication On GPU

The document describes an algorithm for sparse matrix multiplication on the GPU. It maps each row of the left matrix to a warp of threads to compute the non-zero elements in the corresponding row of the result matrix. Hash tables stored in shared and global memory are used to accumulate the results. Load balancing is implemented to reduce memory usage. Performance is competitive with CUSP and improvements include loading multiple rows of the right matrix to better utilize the GPU.

Uploaded by

Nagaraj S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views21 pages

S0285 Optimization of Sparse Matrix Matrixltiplication On GPU

The document describes an algorithm for sparse matrix multiplication on the GPU. It maps each row of the left matrix to a warp of threads to compute the non-zero elements in the corresponding row of the result matrix. Hash tables stored in shared and global memory are used to accumulate the results. Load balancing is implemented to reduce memory usage. Performance is competitive with CUSP and improvements include loading multiple rows of the right matrix to better utilize the GPU.

Uploaded by

Nagaraj S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Sparse Matrix-Matrix Multiplication on the GPU

Julien Demouth, NVIDIA


Introduction: Problem
 Two sparse matrices A and B, compute:
C = AB
 Sparse matrix: Many zeroes

x Non-zero

Zero

 Only non-zero elements are stored in memory


Introduction: Approach
A C
foreach rowA : A.rows (in parallel):
hashtbl = {}
foreach (colA, valA) : rowA.nonZeroes:
rowB = B.rows[colA]
foreach (colB, valB) : rowB.nonZeroes:
hashtbl[colB] += valA * valB
store hashtbl

Two main steps:


B
 Find the # of non-zeroes per row of C
 Compute the value of non-zeroes of C
Introduction: Implementation
 CPU implementation
CPU thread 0 Thread’s hash table:
CPU thread 1
A …
Host memory (and cache)
CPU thread k

 GPU implementation
Warp 0
Warp 1 Warp’s hash table:
Shared memory
A …
Global memory
Warp n
IMPLEMENTATION
Sparse Matrix Representation: CSR
 Do not store 0s
 Store column indices and values. Compress row indices

Row indices: …

Row offsets: …

Column indices: …

Values: …
GPU Algorithm
 Count the number of non-zeroes in each row of C (1 kernel)
— Conceptually, a warp computes the result for one row
 When # of rows > # of warps, a warp works on more rows
— Store the numbers in C.rows

 Exclusive scan on C.rows to compute offsets (1 Thrust call)

 Compute column indices and values (1 kernel)


— Conceptually, a warp computes the pairs for one row
 When # of rows > # of warps, a warp works on more rows
— Store the column indices and values in C.cols and C.vals
Pseudo-code: Load (colA, valA) Pairs
 One warp per row of A (coalesced loads):
A.cols: …
A.vals: …
Invalid (colA, valA) pairs

 Each thread keeps its pair (colA, valA) in registers


warpId = threadIdx.x / WARP_SIZE;
laneId = threadIdx.x % WARP_SIZE;

aColIt = A.rows[warpId ] + laneId;


aColEnd = A.rows[warpId + 1];

colA = aColIt < aColEnd ? A.cols[aColIt] : -1;


Is the (colA, valA)
valA = aColIt < aColEnd ? A.vals[aColIt] : 0.0;
pair valid ?

// asm( "mov.u32 %0, %%laneid;" : "=r"( laneId ) );


Pseudo-code: Vote to Load Rows of B
 Each thread, in turn, will ask the warp to load its row of B:
__shared__ volatile int sColA[nWarps]; Number of valid (colA, valA) pairs
__shared__ volatile double sValA[nWarps];

for( k = 0, end = __popc( __ballot( aColIt < aColEnd ) ) ; k < end ; ++k )
{
if( laneId == k ) { sColA[warpId] = colA; sValA[warpId] = valA; }

The kth thread pushes its (colA, valA) pair

bColIt = B.rows[sColA[warpId] ]; // sColA is volatile and warp’s threads


bColEnd = B.rows[sColA[warpId] + 1]; // are implicitly synchronized

for( bColIt += laneId ; __any( bColIt < bColEnd ) ; bColIt += 32 ) {  }

As long as there are unread (colB,


} valB) pairs in the row colA of B
Note: Warp synchronization is a HW concept. Use it with care.
Pseudo-code: Load (colB, valB) Pairs
 Inside the  loop, each thread loads its pair (colB, valB)
colB = bColIt < bColEnd ? B.cols[bColIt] : -1;
valB = bColIt < bColEnd ? B.vals[bColIt] : 0.0;

 And inserts its pair in the hash table


hashtbl.insert( colB, sValA[warpId] * valB );

 Insertions are performed in parallel


 Important note: All threads have different colB values
Hash Table
 Stored in shared memory and global memory
__shared__ volatile unsigned sKeys[nWarps][sMemSize]; // In our impl, sMemSize == 256
__shared__ volatile double sVals[nWarps][sMemSize];

 Insertion uses four different hash functions in order:


foreach hashFunction:
if( __all( inserted ) )
break;
tryToInsertInSharedMemory( colB, value );

foreach hashFunction:
if( __all( inserted ) )
break;
tryToInsertInGlobalMemory( colB, value );

 If it fails, we re-run the kernel using more global memory


Hash Table Insertion
 Each thread computes its hash value
 If the slot contains the same key, update the value
sVals[warpId][hash] += value;

 If the slot is empty, try to insert:

sKeys[warpId][hash] = colB; sKeys:


if( sKeys[warpId][hash] == colB ) // ?
sVals[warpId][hash] = value;
Write conflict! Only one winner: colB=5 colB=8

 If the slot is full, retry with next hash function (next iteration)
Hash Table Size
 Hash tables in global memory contain 2^k slots
— We usually start with 2048 slots
 We use inlined PTX to compute hash % kGlobalSize
// Given 2^k, it finds k.
int nBits;
asm( "bfind.u32 %0, %1;" : "=r"( nBits ) : "r"( kGlobalSize ) );

// Compute hash % (2 << nBits) == hash & ((2 << nBits)-1).


unsigned dst;
asm( "bfe.u32 %0, %1, 0, %2;" : "=r"( dst ) : "r"( hash ), "r"( nBits ) );

 Honestly, it’s not critical for performance here but it’s fun 
Speed Comparison with CUSP
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0

Speedup compared to CUSP

Default setup Tuned GMEM size Tuned load width


C = AA on a Tesla M2090 – fp64
IMPROVEMENTS
Memory Consumption
 Global memory is allocated for each warp (to store hash tables)

700 250

650 200
600
150
550
100
500

450 50

400 0
256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192
Time (ms) per Number of warps Memory (MB) per Number of warps

C = AA - fp64 code on Tesla M2090 – A: hood.mtx


Load Balancing
 Objective: Performance with lower memory requirements
 Initial approach: Static scheduling
for( ; __syncthreads_or( aRowId < A_nRows ) ; aRowId += nWarps )

 Simple load-balancing:
for( ; __syncthreads_or( aRowId < A_nRows ) ; aRowId += getWork( workQueue ) )

 getWork is implemented using a simple atomic operation:


__shared__ unsigned work;
if( threadIdx.x == 0 )
work = atomicAdd( workQueue, nWarpsPerBlock );
__syncthreads( );
return work + warpId;
Load Balancing: Results
 Lower memory usage
600.00

500.00

400.00

300.00 Static
200.00 Dynamic

100.00

0.00
256 512 1024 2048 4096 8192
Time(ms) per Number of warps

C = AA - fp64 code on Tesla M2090 – A: hood.mtx


Load Several Rows of B
 Modify the algorithm to load several rows of B
 Hash tables have to use atomics
sKeys[warpId][hash] = colB;
if( sKeys[warpId][hash] == colB ) // Winner?
atomicAdd( &sVals[warpId][hash], value );

 Fp64 atomicAdd is implemented using Compare-and-Swap


— See CUDA programming guide
Load Several Rows of B: Results

atmosmodd mc2depi
230.00 90.00
210.00
80.00
190.00
170.00 70.00
150.00
60.00
130.00
110.00 50.00
90.00
40.00
70.00
50.00 30.00
16 8 4 2 1 1 w/o a 16 8 4 2 1 1 w/o a
Time (ms) per Number of rows Time (ms) per Number of rows

1: with atomics -- 1 w/o a: without atomics


Summary
 We have implemented a CPU-like algorithm on the GPU
— It gives good results compared to CUSP
 Ideas
— Use hash tables stored in shared and global memories
— Map a CPU thread to a warp of GPU threads
— Use simple load-balancing to reduce memory consumption
 Future work
— Try with better load-balancing
— Adapt the algorithm to multiple GPUs, then MPI nodes

You might also like