0% found this document useful (0 votes)

27 views21 pages

S0285 Optimization of Sparse Matrix Matrixltiplication On GPU

The document describes an algorithm for sparse matrix multiplication on the GPU. It maps each row of the left matrix to a warp of threads to compute the non-zero elements in the corresponding row of the result matrix. Hash tables stored in shared and global memory are used to accumulate the results. Load balancing is implemented to reduce memory usage. Performance is competitive with CUSP and improvements include loading multiple rows of the right matrix to better utilize the GPU.

Uploaded by

Nagaraj S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views21 pages

S0285 Optimization of Sparse Matrix Matrixltiplication On GPU

Uploaded by

Nagaraj S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Sparse Matrix-Matrix Multiplication on the GPU

Julien Demouth, NVIDIA

Introduction: Problem
 Two sparse matrices A and B, compute:
C = AB
 Sparse matrix: Many zeroes

x Non-zero

Zero

 Only non-zero elements are stored in memory

Introduction: Approach
A C
foreach rowA : A.rows (in parallel):
hashtbl = {}
foreach (colA, valA) : rowA.nonZeroes:
rowB = B.rows[colA]
foreach (colB, valB) : rowB.nonZeroes:
hashtbl[colB] += valA * valB
store hashtbl

Two main steps:

B
 Find the # of non-zeroes per row of C
 Compute the value of non-zeroes of C
Introduction: Implementation
 CPU implementation
CPU thread 0 Thread’s hash table:
CPU thread 1
A …
Host memory (and cache)
CPU thread k

 GPU implementation
Warp 0
Warp 1 Warp’s hash table:
Shared memory
A …
Global memory
Warp n
IMPLEMENTATION
Sparse Matrix Representation: CSR
 Do not store 0s
 Store column indices and values. Compress row indices

Row indices: …

Row offsets: …

Column indices: …

Values: …
GPU Algorithm
 Count the number of non-zeroes in each row of C (1 kernel)
— Conceptually, a warp computes the result for one row
 When # of rows > # of warps, a warp works on more rows
— Store the numbers in C.rows

 Exclusive scan on C.rows to compute offsets (1 Thrust call)

 Compute column indices and values (1 kernel)

— Conceptually, a warp computes the pairs for one row
 When # of rows > # of warps, a warp works on more rows
— Store the column indices and values in C.cols and C.vals
Pseudo-code: Load (colA, valA) Pairs
 One warp per row of A (coalesced loads):
A.cols: …
A.vals: …
Invalid (colA, valA) pairs

 Each thread keeps its pair (colA, valA) in registers

warpId = threadIdx.x / WARP_SIZE;
laneId = threadIdx.x % WARP_SIZE;

aColIt = A.rows[warpId ] + laneId;

aColEnd = A.rows[warpId + 1];

colA = aColIt < aColEnd ? A.cols[aColIt] : -1;

Is the (colA, valA)
valA = aColIt < aColEnd ? A.vals[aColIt] : 0.0;
pair valid ?

// asm( "mov.u32 %0, %%laneid;" : "=r"( laneId ) );

Pseudo-code: Vote to Load Rows of B
 Each thread, in turn, will ask the warp to load its row of B:
__shared__ volatile int sColA[nWarps]; Number of valid (colA, valA) pairs
__shared__ volatile double sValA[nWarps];

for( k = 0, end = __popc( __ballot( aColIt < aColEnd ) ) ; k < end ; ++k )
{
if( laneId == k ) { sColA[warpId] = colA; sValA[warpId] = valA; }

The kth thread pushes its (colA, valA) pair

bColIt = B.rows[sColA[warpId] ]; // sColA is volatile and warp’s threads

bColEnd = B.rows[sColA[warpId] + 1]; // are implicitly synchronized

for( bColIt += laneId ; __any( bColIt < bColEnd ) ; bColIt += 32 ) {  }

As long as there are unread (colB,

} valB) pairs in the row colA of B
Note: Warp synchronization is a HW concept. Use it with care.
Pseudo-code: Load (colB, valB) Pairs
 Inside the  loop, each thread loads its pair (colB, valB)
colB = bColIt < bColEnd ? B.cols[bColIt] : -1;
valB = bColIt < bColEnd ? B.vals[bColIt] : 0.0;

 And inserts its pair in the hash table

hashtbl.insert( colB, sValA[warpId] * valB );

 Insertions are performed in parallel

 Important note: All threads have different colB values
Hash Table
 Stored in shared memory and global memory
__shared__ volatile unsigned sKeys[nWarps][sMemSize]; // In our impl, sMemSize == 256
__shared__ volatile double sVals[nWarps][sMemSize];

 Insertion uses four different hash functions in order:

foreach hashFunction:
if( __all( inserted ) )
break;
tryToInsertInSharedMemory( colB, value );

foreach hashFunction:
if( __all( inserted ) )
break;
tryToInsertInGlobalMemory( colB, value );

 If it fails, we re-run the kernel using more global memory

Hash Table Insertion
 Each thread computes its hash value
 If the slot contains the same key, update the value
sVals[warpId][hash] += value;

 If the slot is empty, try to insert:

sKeys[warpId][hash] = colB; sKeys:

if( sKeys[warpId][hash] == colB ) // ?
sVals[warpId][hash] = value;
Write conflict! Only one winner: colB=5 colB=8

 If the slot is full, retry with next hash function (next iteration)
Hash Table Size
 Hash tables in global memory contain 2^k slots
— We usually start with 2048 slots
 We use inlined PTX to compute hash % kGlobalSize
// Given 2^k, it finds k.
int nBits;
asm( "bfind.u32 %0, %1;" : "=r"( nBits ) : "r"( kGlobalSize ) );

// Compute hash % (2 << nBits) == hash & ((2 << nBits)-1).

unsigned dst;
asm( "bfe.u32 %0, %1, 0, %2;" : "=r"( dst ) : "r"( hash ), "r"( nBits ) );

 Honestly, it’s not critical for performance here but it’s fun 
Speed Comparison with CUSP
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0

Speedup compared to CUSP

Default setup Tuned GMEM size Tuned load width

C = AA on a Tesla M2090 – fp64
IMPROVEMENTS
Memory Consumption
 Global memory is allocated for each warp (to store hash tables)

700 250

650 200
600
150
550
100
500

450 50

400 0
256 512 1024 2048 4096 8192 256 512 1024 2048 4096 8192
Time (ms) per Number of warps Memory (MB) per Number of warps

C = AA - fp64 code on Tesla M2090 – A: hood.mtx

Load Balancing
 Objective: Performance with lower memory requirements
 Initial approach: Static scheduling
for( ; __syncthreads_or( aRowId < A_nRows ) ; aRowId += nWarps )

 Simple load-balancing:
for( ; __syncthreads_or( aRowId < A_nRows ) ; aRowId += getWork( workQueue ) )

 getWork is implemented using a simple atomic operation:

__shared__ unsigned work;
if( threadIdx.x == 0 )
work = atomicAdd( workQueue, nWarpsPerBlock );
__syncthreads( );
return work + warpId;
Load Balancing: Results
 Lower memory usage
600.00

500.00

400.00

300.00 Static
200.00 Dynamic

100.00

0.00
256 512 1024 2048 4096 8192
Time(ms) per Number of warps

C = AA - fp64 code on Tesla M2090 – A: hood.mtx

Load Several Rows of B
 Modify the algorithm to load several rows of B
 Hash tables have to use atomics
sKeys[warpId][hash] = colB;
if( sKeys[warpId][hash] == colB ) // Winner?
atomicAdd( &sVals[warpId][hash], value );

 Fp64 atomicAdd is implemented using Compare-and-Swap

— See CUDA programming guide
Load Several Rows of B: Results

atmosmodd mc2depi
230.00 90.00
210.00
80.00
190.00
170.00 70.00
150.00
60.00
130.00
110.00 50.00
90.00
40.00
70.00
50.00 30.00
16 8 4 2 1 1 w/o a 16 8 4 2 1 1 w/o a
Time (ms) per Number of rows Time (ms) per Number of rows

1: with atomics -- 1 w/o a: without atomics

Summary
 We have implemented a CPU-like algorithm on the GPU
— It gives good results compared to CUSP
 Ideas
— Use hash tables stored in shared and global memories
— Map a CPU thread to a warp of GPU threads
— Use simple load-balancing to reduce memory consumption
 Future work
— Try with better load-balancing
— Adapt the algorithm to multiple GPUs, then MPI nodes

Diver Easy Pro Key
100% (1)
Diver Easy Pro Key
5 pages
Cubes - Models and Schemas
No ratings yet
Cubes - Models and Schemas
6 pages
Wireless Site Survey Checklist: Select Download Format
0% (1)
Wireless Site Survey Checklist: Select Download Format
4 pages
Ecler
No ratings yet
Ecler
4 pages
A Simple Event-Based PID Controller: Årzén, Karl-Erik
No ratings yet
A Simple Event-Based PID Controller: Årzén, Karl-Erik
7 pages
Data Warehousing
100% (1)
Data Warehousing
154 pages
COLORFUL Motherboard UEFI DOS Environment BIOS Update Guide
No ratings yet
COLORFUL Motherboard UEFI DOS Environment BIOS Update Guide
6 pages
3D Processing
No ratings yet
3D Processing
3 pages
CEN Workshop Agreement CWA 16374-34
No ratings yet
CEN Workshop Agreement CWA 16374-34
38 pages
JAVA QUESTION - MR - ABHISHEK AVULA
No ratings yet
JAVA QUESTION - MR - ABHISHEK AVULA
6 pages
Applied Physics: Textbook of
No ratings yet
Applied Physics: Textbook of
2 pages
Presentation On Genera Banking Activities of SEBL
No ratings yet
Presentation On Genera Banking Activities of SEBL
15 pages
R5 ES DP221 Deployment Approach
No ratings yet
R5 ES DP221 Deployment Approach
26 pages
Sage ERP X3 Technology Demo Script: Patchset 19 November 2012
No ratings yet
Sage ERP X3 Technology Demo Script: Patchset 19 November 2012
117 pages
JessNoLimit's Instant Gusion Guide! Only For Those of You Who Have Quick Fingers! - Dunia Gamesz
No ratings yet
JessNoLimit's Instant Gusion Guide! Only For Those of You Who Have Quick Fingers! - Dunia Gamesz
4 pages
Dogfooding
No ratings yet
Dogfooding
10 pages
How-Computers-Work 20231118 231813 0000
No ratings yet
How-Computers-Work 20231118 231813 0000
28 pages
A Protocol-Independent Technique For Eliminating Redundant Network Traffic
No ratings yet
A Protocol-Independent Technique For Eliminating Redundant Network Traffic
9 pages
CDC IT Dept - Curriculum Structure
No ratings yet
CDC IT Dept - Curriculum Structure
12 pages
FALLSEM2023-24 CSE1008 ETH VL2023240107110 2023-08-14 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSE1008 ETH VL2023240107110 2023-08-14 Reference-Material-I
26 pages
Vazeema Resume...
No ratings yet
Vazeema Resume...
2 pages
Student Resource Portal: Meghan Patil, Mihir Prajapati, Ankit Patel
No ratings yet
Student Resource Portal: Meghan Patil, Mihir Prajapati, Ankit Patel
18 pages
Review Your Answers
No ratings yet
Review Your Answers
5 pages
Logfile Router Transcript
No ratings yet
Logfile Router Transcript
12 pages
Introduction To Autocad: Technical College of Engineering Engineering Drawing and Autocad
No ratings yet
Introduction To Autocad: Technical College of Engineering Engineering Drawing and Autocad
11 pages
Improving First Order Differential Power Attacks Through Digital Signal Processing
No ratings yet
Improving First Order Differential Power Attacks Through Digital Signal Processing
10 pages
Changes
No ratings yet
Changes
15 pages
Biometric Slip Form
No ratings yet
Biometric Slip Form
4 pages
Theory of Constrained Optimization
No ratings yet
Theory of Constrained Optimization
18 pages
Introduction To Python Part 3
No ratings yet
Introduction To Python Part 3
2 pages

S0285 Optimization of Sparse Matrix Matrixltiplication On GPU

Uploaded by

S0285 Optimization of Sparse Matrix Matrixltiplication On GPU

Uploaded by

Sparse Matrix-Matrix Multiplication on the GPU

Julien Demouth, NVIDIA

 Only non-zero elements are stored in memory

Two main steps:

 Exclusive scan on C.rows to compute offsets (1 Thrust call)

 Compute column indices and values (1 kernel)

 Each thread keeps its pair (colA, valA) in registers

aColIt = A.rows[warpId ] + laneId;

colA = aColIt < aColEnd ? A.cols[aColIt] : -1;

// asm( "mov.u32 %0, %%laneid;" : "=r"( laneId ) );

The kth thread pushes its (colA, valA) pair

bColIt = B.rows[sColA[warpId] ]; // sColA is volatile and warp’s threads

for( bColIt += laneId ; __any( bColIt < bColEnd ) ; bColIt += 32 ) {  }

As long as there are unread (colB,

 And inserts its pair in the hash table

 Insertions are performed in parallel

 Insertion uses four different hash functions in order:

 If it fails, we re-run the kernel using more global memory

 If the slot is empty, try to insert:

sKeys[warpId][hash] = colB; sKeys:

// Compute hash % (2 << nBits) == hash & ((2 << nBits)-1).

Speedup compared to CUSP

Default setup Tuned GMEM size Tuned load width

C = AA - fp64 code on Tesla M2090 – A: hood.mtx

 getWork is implemented using a simple atomic operation:

C = AA - fp64 code on Tesla M2090 – A: hood.mtx

 Fp64 atomicAdd is implemented using Compare-and-Swap

1: with atomics -- 1 w/o a: without atomics

You might also like