Exploring Better Speculation and Data Locality in Sparse Matrix Vector Multiplication On Intel Xeon
Exploring Better Speculation and Data Locality in Sparse Matrix Vector Multiplication On Intel Xeon
ICCD’2020
Outline
(Iterations)
Sparse Matrix-Vector Multiplication
Vector
• (Very) Sparse Matrix
• Dense Vector
x
a b
For i in Range(Iter): c d e f
A g y
…
h i j
yi = Axi
k l
…
Sparse Matrix
End
SpMV on Intel XEON
(Iterations)
Sparse Matrix-Vector Multiplication
Vector
• (Very) Sparse Matrix
• Dense Vector
• (Normally) Many Iterations with
Same Matrix and Multiple Vectors
a b
For i in Range(Iter): c d e f
A g
…
h i j
yi = Axi
k l
…
Sparse Matrix
End
SpMV on Intel XEON
CPU
Graph-based Matrix (SuiteSparse Matrix Collection)
• Intel Xeon Gold 6146 (SkyLake)
• CPU Core: 12
• Frequency: 3.2-4.2 GHz
• L1 Cache: 32KB I/D web-google dblp-2010
• L2 Cache: 1MB Private
• L3 Cache: 24.75 MB Shared
Tested Function
• Intel® Math Kernel Library
• Version: l_mkl_2019.5.281
• Routine: mkl_sparse_d_mv
Linux_call_graph web-Stanford
Profiling
• Intel® VTune Profiler 2020
SpMV on Intel XEON
Some Take-Aways:
Retiring Front-End Bad Speculation • Poor efficiency (34% valid execution)
• Two Major bottlenecks:
➢ Bad Speculation
➢ Memory Access
• Observation Applicable on Most Graph-
based Sparse Matrices
Back-End.Core Back-End.Memory
Outline
... ...
• Wrong Prediction causes serious penalty:
... ...
... ➢ Invalidate speculatively-executed instructions
... ...
➢ Recover CPU state to branch point
... ...
NT - NT
• Shorter rows tend to have more frequent wrong
Train predictions
T T NT X
R1
NT (Predict)
T T X
100% Mispredict
R2
Bottleneck Analysis
• Speculation
NT - NT
• Shorter rows tend to have more frequent wrong
Train predictions
T T NT X
R1 • Longer rows tend to have less frequent wrong
T (Predict) predictions
T T
R2 20% Mispredict
• Speculation Penalty is influenced by NNZ’s density
of each row.
Bottleneck Analysis
• Memory Access
Dense Sparse
Threshold NNZ=20
Sub-Matrix Sub-Matrix
Dense
MKL SpMV MKL SpMV
Routine Routine
Sparse
Bottleneck Bottleneck
Analysis Analysis
Bottleneck Analysis
• Sparse vs. Dense
Retiring
Sparse Sparse Front-End
Runtime Row
Bad Speculation
Back-End.Memory
Back-End.Core
Sparse
NNZ
High
Optimize Locality
Memory Penalty
Low
R7 R8
R8 R1
R9 R6
2. Buckets
Tag: 0000 Tag: 0001 Tag: 0011 Tag: 1101
Generation
… …
3. Buckets 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
Ordering 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0
0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
Optimize Memory Access
High
More NNZ Less NNZ
Poorly Banded Poorly Banded
Sparse Dense
Memory Penalty
Optimize Locality Optimize Speculation & Locality
Sub-Matrix Sub-Matrix
BBO BBO + DBO
Highly Highly
Banded ? Banded ? More NNZ Less NNZ
Yes No Yes No Highly Banded Highly Banded
Low
Optimized Optimized
Sparse Sub-matrix Dense Sub-matrix Low Speculation Penalty High
Optimized Matrix
(CSR)
Outline
8 Thread:
• Ours: 1.8x
• MKL Opt: 1.2x
• Ours + MKL Opt: 3.6x
Conclusion:
• Ours outperform MKL Opt with great
margins.
• Our method helps MKL Opt achieve
higher vectorization rate and much
higher efficiency.
Evaluations
Pre-processing Cost
• Ours: 4.2x
• MKL Opt: 26.5x
• Ours + MKL Opt: 31.9x
Conclusion:
• Extreme low cost compared with MKL Opt or other approaches
• Can be negligible considering numerous iterations.
Q&A
ICCD’2020
SpMV on Intel XEON
Back-End.Core Back-End.Memory
Linux_call_graph dblp-2010 web-google
SpMV on Intel XEON
• Intel® VTune Profiler 2020
Optimization Scheme
Original
Matrix
Sparse Dense
Sub-Matrix Sub-Matrix
Highly Highly
Banded ? Banded ?
Yes No Yes No
DBO DBO+BBO BBO
Optimized Optimized
Sparse Sub-matrix Dense Sub-matrix
Optimized Matrix
(CSR)
Optimization Scheme
Yes
Optimization Scheme
Original
Matrix
Sparse Dense
Sub-Matrix Sub-Matrix
Highly Highly
Banded ? Banded ?
No Yes No
DBO DBO+BBO BBO
Optimized Optimized
Sparse Sub-matrix Dense Sub-matrix
Optimized Matrix
(CSR)
Optimization Scheme