0% found this document useful (0 votes)
106 views36 pages

Exploring Better Speculation and Data Locality in Sparse Matrix Vector Multiplication On Intel Xeon

This is the oral presentation of the ICCD 2020 accepted research paper entitled Exploring Better Speculation and Data Locality in Sparse Matrix Vector Multiplication on Intel Xeon.

Uploaded by

夏天
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views36 pages

Exploring Better Speculation and Data Locality in Sparse Matrix Vector Multiplication On Intel Xeon

This is the oral presentation of the ICCD 2020 accepted research paper entitled Exploring Better Speculation and Data Locality in Sparse Matrix Vector Multiplication on Intel Xeon.

Uploaded by

夏天
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Exploring Better Speculation and Data Locality in

Sparse Matrix-Vector Multiplication on Intel Xeon

Haoran Zhao, Tian Xia, Chenyang Li,


Wenzhe Zhao, Nanning Zheng and Pengju Ren

Xi’an Jiaotong University

ICCD’2020
Outline

• SpMV on Intel XEON


• Bottleneck Analysis
• Optimize Speculation
• Optimize Memory Access
• Optimization Scheme
• Evaluations
Outline

• SpMV on Intel XEON


• Bottleneck Analysis
• Optimize Speculation
• Optimize Memory Access
• Optimization Scheme
• Evaluations
SpMV on Intel XEON

(Iterations)
Sparse Matrix-Vector Multiplication

Vector
• (Very) Sparse Matrix
• Dense Vector
x

a b
For i in Range(Iter): c d e f
A g y

h i j
yi = Axi
k l

Sparse Matrix
End
SpMV on Intel XEON

(Iterations)
Sparse Matrix-Vector Multiplication

Vector
• (Very) Sparse Matrix
• Dense Vector
• (Normally) Many Iterations with
Same Matrix and Multiple Vectors
a b
For i in Range(Iter): c d e f
A g

h i j
yi = Axi
k l

Sparse Matrix
End
SpMV on Intel XEON
CPU
Graph-based Matrix (SuiteSparse Matrix Collection)
• Intel Xeon Gold 6146 (SkyLake)
• CPU Core: 12
• Frequency: 3.2-4.2 GHz
• L1 Cache: 32KB I/D web-google dblp-2010
• L2 Cache: 1MB Private
• L3 Cache: 24.75 MB Shared

Tested Function
• Intel® Math Kernel Library
• Version: l_mkl_2019.5.281
• Routine: mkl_sparse_d_mv
Linux_call_graph web-Stanford
Profiling
• Intel® VTune Profiler 2020
SpMV on Intel XEON

Some Take-Aways:
Retiring Front-End Bad Speculation • Poor efficiency (34% valid execution)
• Two Major bottlenecks:
➢ Bad Speculation
➢ Memory Access
• Observation Applicable on Most Graph-
based Sparse Matrices

Back-End.Core Back-End.Memory
Outline

• SpMV on Intel XEON


• Bottleneck Analysis
• Optimize Speculation
• Optimize Memory Access
• Optimization Scheme
• Evaluations
Bottleneck Analysis
• CSR Format
0 1 2 3 4 Column Index
0 a b 1 3 0 1 2 4 1 0 3 4 1 2
1 c d e f
2 g a b c d e f g h i j k l
3 h i j Non-zero value
4 k l Row Pointer 0 2 6 7 10 12

Vector (CSR) Sparse Matrix Vector Irregular computation:


R1 a b • Different loops among rows
a b means random branch conditions
R2 c d e f
c d e f • Different vector positions among
R3 g rows means no cache-line reuse.
g
h i j R4 h i j
k l R5 k l
Sparse Matrix
Bottleneck Analysis
• Speculation
Speculatively Execution

Branch ... • Modern Superscalar CPU relies on Branch Prediction


... to increase pipeline efficiency
...

Fetch, decode, Reorder Buffer Retire


rename, dispatch Issue Execution (ROB)

... ...
• Wrong Prediction causes serious penalty:
... ...
... ➢ Invalidate speculatively-executed instructions
... ...
➢ Recover CPU state to branch point
... ...

Detect Wrong Prediction


Bottleneck Analysis
• Speculation

PHT • Branch predictor learns from local branch history and


History Branch
global branch history
T NT NT NT NT
T-T Weak Strong
Strong Weak
Not Not
T - NT Taken
T
Taken
T Taken T Taken • SpMV loop is impossible to predict, because it has
NT - T no pattern by its nature.

NT - NT
• Shorter rows tend to have more frequent wrong
Train predictions
T T NT X
R1
NT (Predict)
T T X
100% Mispredict
R2
Bottleneck Analysis
• Speculation

PHT • Branch predictor learns from local branch history and


History Branch
T NT NT NT NT
global branch history
T-T Weak Strong
Strong Weak
Not Not
Taken Taken
T - NT T T Taken T Taken • SpMV loop is impossible to predict, because it has
NT - T no pattern by its nature.

NT - NT
• Shorter rows tend to have more frequent wrong
Train predictions
T T NT X
R1 • Longer rows tend to have less frequent wrong
T (Predict) predictions
T T
R2 20% Mispredict
• Speculation Penalty is influenced by NNZ’s density
of each row.
Bottleneck Analysis
• Memory Access

• web-google has less density but poorer locality

• Memory locality is influenced by NNZ’s distribution,


web-google dblp-2010 not by NNZ’s density.

• In graph-based matrix, locality can be measured


using Diagonal Band, i.e. NNZ proportion in this
band.

• NNZ Density = 6.1 x 10-6 • NNZ Density = 1.5 x 10-5


• Poor Locality • Good Locality
• NNZ scatter all over • NNZ gather in diagonal
Bottleneck Analysis
• Sparse vs. Dense Original
Matrix
• In scale-free matrix, most rows are sparse
• Divide matrix into sparse & dense matrices
• Separately analyze bottlenecks Split Rows by NNZ

Dense Sparse
Threshold NNZ=20
Sub-Matrix Sub-Matrix

Dense
MKL SpMV MKL SpMV
Routine Routine
Sparse

Bottleneck Bottleneck
Analysis Analysis
Bottleneck Analysis
• Sparse vs. Dense
Retiring
Sparse Sparse Front-End
Runtime Row
Bad Speculation

Back-End.Memory

Back-End.Core
Sparse
NNZ

Sparse Sub-matrix Dense Sub-matrix

• Sparse part is bounded • Dense part is bounded


by memory access and • Sparse part take up by memory access
speculation most of the total runtime
Bottleneck Analysis
• SpMV Penalty Model

More NNZ Less NNZ


Poorly Banded Poorly Banded
Optimize Speculation & Locality

High
Optimize Locality

Memory Penalty
Low

More NNZ Less NNZ


Highly Banded Low High Highly Banded
Speculation Penalty
It’s already perfect! Optimize Speculation
Outline

• SpMV on Intel XEON


• Bottleneck Analysis
• Optimize Speculation
• Optimize Memory Access
• Optimization Scheme
• Evaluations
Optimize Speculation
X X
R0 R2
Density-based Optimization (DBO):
R1 R5
• Group together rows with same length
R2 R0 • Adjacent rows have similar loop pattern
R3 R4
Row Reorder Results:
R4 R7
✓ Better speculation
R5 R9 ⅹ Break original matrix structure and Y sequence
(Come back to this later)
R6 R3

R7 R8

R8 R1
R9 R6

63% Mis-Prediction 26% Mis-Prediction


Outline

• SpMV on Intel XEON


• Bottleneck Analysis
• Optimize Speculation
• Optimize Memory Access
• Optimization Scheme
• Evaluations
Optimize Memory Access

1. Downscale Scale=4 Bitmap-based Optimization (BBO):


• Segment each row and downscale to bitmaps
• Put rows of the same bitmap into buckets
1 1 0 1
• Order buckets with grey coding

2. Buckets
Tag: 0000 Tag: 0001 Tag: 0011 Tag: 1101
Generation
… …

3. Buckets 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
Ordering 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0
0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
Optimize Memory Access

1. Downscale Bitmap-based Optimization (BBO):


(Threshold)
• Segment each row and downscale to bitmaps
• Put rows of the same bitmap into buckets
1 0 1 0
Down-scale (Threshold = 2) • Order buckets with grey coding
• For dense rows, use threshold to select only more
2. Buckets dense sections
Tag: 0000 Tag: 0001 Tag: 0011 Tag: 1010
Generation
… …
Results:
✓ Better memory locality on poorly-banded matrix
ⅹ No effect on highly-banded matrix
3. Buckets 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
Ordering 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0
0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
Optimize Memory Access
DBO + BBO:
density group: • DBO may degrade memory locality as it breaks original structure
K-1 Gray Code • If poorly banded, combine DBO & BBO to alleviate degradation
• Use BBO inside each DBO group
• Use reverse gray ordering among groups

density group: Inverse Results:


K Gray Code
✓ Better speculation
✓ Better memory locality on poorly-banded matrix

density group: Gray Code


K+1
Outline

• SpMV on Intel XEON


• Bottleneck Analysis
• Optimize Speculation
• Optimize Memory Access
• Optimization Scheme
• Evaluations
Optimization Scheme
Original
Matrix

Split Rows by NNZ

High
More NNZ Less NNZ
Poorly Banded Poorly Banded
Sparse Dense

Memory Penalty
Optimize Locality Optimize Speculation & Locality
Sub-Matrix Sub-Matrix
BBO BBO + DBO
Highly Highly
Banded ? Banded ? More NNZ Less NNZ
Yes No Yes No Highly Banded Highly Banded

DBO DBO+BBO BBO It’s already perfect! Optimize Speculation


/ DBO

Low
Optimized Optimized
Sparse Sub-matrix Dense Sub-matrix Low Speculation Penalty High

Optimized Matrix
(CSR)
Outline

• SpMV on Intel XEON


• Bottleneck Analysis
• Optimize Speculation
• Optimize Memory Access
• Optimization Scheme
• Evaluations
Evaluations
• Sparse vs. Dense
Evaluations
Full Matrix Performance
CPU Benchmark
• Intel Xeon Gold 6146 (SkyLake) • SuiteSparse Matrix Collection
• CPU Core: 12 • 86 Graph-based Matices
• Frequency: 3.2-4.2 GHz • Scale range 10K – 2 Millions
• L1 Cache: 32KB I/D • NNZ range 40K – 70 Millions
• L2 Cache: 1MB Private Thread
• L3 Cache: 24.75 MB Shared • 1 Thread
Function
• 8 Thread
• Intel® Math Kernel Library Tested Methods
• Version: l_mkl_2019.5.281 • Ours
• Routine: mkl_sparse_d_mv (Baseline) • MKL Opt
• Optimize: mkl_sparse_optimize • Ours + MKL Opt
Evaluations
1 Thread:
• Ours: 1.7x
• MKL Opt: 1.1x
• Ours + MKL Opt: 2.5x

8 Thread:
• Ours: 1.8x
• MKL Opt: 1.2x
• Ours + MKL Opt: 3.6x

Conclusion:
• Ours outperform MKL Opt with great
margins.
• Our method helps MKL Opt achieve
higher vectorization rate and much
higher efficiency.
Evaluations
Pre-processing Cost

• Ours: 4.2x
• MKL Opt: 26.5x
• Ours + MKL Opt: 31.9x

Conclusion:
• Extreme low cost compared with MKL Opt or other approaches
• Can be negligible considering numerous iterations.
Q&A

Xi’an Jiaotong University

ICCD’2020
SpMV on Intel XEON

Retiring Front-End Bad Speculation

p2p-Gnutella soc-sign-Slashdot sx-askubuntu

email-EuAll Stanford web-Stanford

Back-End.Core Back-End.Memory
Linux_call_graph dblp-2010 web-google
SpMV on Intel XEON
• Intel® VTune Profiler 2020
Optimization Scheme
Original
Matrix

Split Rows by NNZ

Sparse Dense
Sub-Matrix Sub-Matrix

Highly Highly
Banded ? Banded ?
Yes No Yes No
DBO DBO+BBO BBO

Optimized Optimized
Sparse Sub-matrix Dense Sub-matrix

Optimized Matrix
(CSR)
Optimization Scheme

Yes
Optimization Scheme
Original
Matrix

Split Rows by NNZ

Sparse Dense
Sub-Matrix Sub-Matrix

Highly Highly
Banded ? Banded ?
No Yes No
DBO DBO+BBO BBO

Optimized Optimized
Sparse Sub-matrix Dense Sub-matrix

Optimized Matrix
(CSR)
Optimization Scheme

You might also like