Osdi23 Slides Zhao
Osdi23 Slides Zhao
The 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI’23)
1 Introduction
2 Overview
5 Experimental Results
6 Conclusion
Introduction Limitations of Prior Work July 12, 2023, Boston, MA, USA 4 / 21
Limitations of Prior Work
Introduction Limitations of Prior Work July 12, 2023, Boston, MA, USA 4 / 21
Limitations of Prior Work
Introduction Limitations of Prior Work July 12, 2023, Boston, MA, USA 4 / 21
Limitations of Prior Work
Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo
stage2 = n×
stage3 = n×
stage4 = n×
Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo
stage2 = n×
stage3 = n×
stage4 = n×
Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo
stage2 = n×
stage3 = n×
stage4 = n×
Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo
stage2 = n× 3 3 6 6 10 10 13 13
stage3 = n× 7 7 7 7 14 14 14 14
stage4 = n× 151515151515151515
Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo
stage2 = n× 3 3 6 6 10 10 13 13
stage3 = n× 7 7 7 7 14 14 14 14
stage4 = n× 151515151515151515
Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Collect Splitting Information
Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Collect Splitting Information
Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Collect Splitting Information
Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Collect Splitting Information
Schedule Sub-graph Instances Group Sub-graphs July 12, 2023, Boston, MA, USA 8 / 21
Group Sub-graphs with the aid of SplitInfo
Schedule Sub-graph Instances Group Sub-graphs July 12, 2023, Boston, MA, USA 8 / 21
Group Sub-graphs with the aid of SplitInfo
SG1 SG1
SG1
SG2 SG3 SG2
SG2
SG4 SG3
Schedule Sub-graph Instances Group Sub-graphs July 12, 2023, Boston, MA, USA 8 / 21
Order Sub-graph Instances
a1 a2 a3 a4 a5 a6 a7 a8
b1 b2 b3 b4
c1 c2
d1
Schedule Sub-graph Instances Order Sub-graph Instances July 12, 2023, Boston, MA, USA 9 / 21
Order Sub-graph Instances
a1 a2 a3 a4 a5 a6 a7 a8
b1 b2 b3 b4
c1 c2
d1
Schedule Sub-graph Instances Order Sub-graph Instances July 12, 2023, Boston, MA, USA 9 / 21
Order Sub-graph Instances
a1 a2 a3 a4 a5 a6 a7 a8
b1 b2 b3 b4
c1 c2
d1
Schedule Sub-graph Instances Order Sub-graph Instances July 12, 2023, Boston, MA, USA 9 / 21
Order Sub-graph Instances
a1 a2 a3 a4 a5 a6 a7 a8
b1 b2 b3 b4
c1 c2
d1
Schedule Sub-graph Instances Infer Core Binding and Buffer Scopes July 12, 2023, Boston, MA, USA 10 / 21
Infer Core Binding and Buffer Scopes
a8 a7 a6 a5 a4 a3 a2 a1
concat concat concat concat
b4 b3 b2 b1
concat concat
c2 c1
concat
d1
Schedule Sub-graph Instances Concatenate Instance Outputs July 12, 2023, Boston, MA, USA 11 / 21
Concatenate the Outputs of Sub-graph Instances
Insert additional ops, e.g., copy, redistribute, for moving data across
the memory hierarchy if the binding strategies and memory scopes of
a concatenation op are different from each other.
Schedule Sub-graph Instances Concatenate Instance Outputs July 12, 2023, Boston, MA, USA 11 / 21
Concatenate the Outputs of Sub-graph Instances
Insert additional ops, e.g., copy, redistribute, for moving data across
the memory hierarchy if the binding strategies and memory scopes of
a concatenation op are different from each other.
How the approach is generalied to handle a sub-graph of multiple
output tensors and other cases is discussed in the paper.
Schedule Sub-graph Instances Concatenate Instance Outputs July 12, 2023, Boston, MA, USA 11 / 21
Loop Fusion within Layers
Generate one kernel for a sub-graph instance by expanding it as
conv block layer #1
conv
read inputs
identity block layer #2
LB buffer b3 batchnorm
write output identity block
layer #3 layer #4
ReLU
identity block
layer #5
Generate Kernels for Sub-graph Instances Loop Fusion within Layers July 12, 2023, Boston, MA, USA 12 / 21
Loop Fusion within Layers
Generate one kernel for a sub-graph instance by expanding it as
conv block layer #1
conv
read inputs
identity block layer #2
LB buffer b3 batchnorm
write output identity block
layer #3 layer #4
ReLU
identity block
layer #5
Generate Kernels for Sub-graph Instances Loop Fusion within Layers July 12, 2023, Boston, MA, USA 12 / 21
Buffer Stitching across Layers/Blocks
Remain the outputs of a layer in LB, e.g., res l1, instead of spilling it
to slower global memory
Consider both compute- and memory-intensive ops.
Generate Kernels for Sub-graph Instances Buffer Stitching across Layers/Blocks July 12, 2023, Boston, MA, USA 13 / 21
Memory Allocation and Reuse
Release the space consumed by an output tensor as early as possible.
Generate Kernels for Sub-graph Instances Memory Allocation and Reuse July 12, 2023, Boston, MA, USA 14 / 21
Memory Allocation and Reuse
Release the space consumed by an output tensor as early as possible.
The space with the logest liveness across multiple computation tasks
is first spilled in case LB cannot hold all tensors.
in1 in1 in1 in1
ct2 ct3 ct5 ct2 ct3 ct5 ct2 ct3 ct5 ct2 ct3 ct5
Generate Kernels for Sub-graph Instances Memory Allocation and Reuse July 12, 2023, Boston, MA, USA 14 / 21
Across-layer Instruction Scheduling
Weight tensors can be promoted as early as possible.
Generate Kernels for Sub-graph Instances Across-layer Instruction Scheduling July 12, 2023, Boston, MA, USA 15 / 21
Across-layer Instruction Scheduling
Weight tensors can be promoted as early as possible.
hidden hidden
Generate Kernels for Sub-graph Instances Across-layer Instruction Scheduling July 12, 2023, Boston, MA, USA 15 / 21
Across-layer Instruction Scheduling
Weight tensors can be promoted as early as possible.
hidden hidden
d ← 4; c ← 8; u ← 3
LB ← 64 KB L1
GB ← 8MB last local buffer (LLB)
CU1 ← vector core; CU2 ← VME; CU3 ← MME
[1]
Rongkai Zhan et al. “NeuralScale: A RISC-V Based Neural Processor Boosting AI Inference in Clouds”. Fifth Workshop on
Computer Architecture Research with RISC-V. CARRV. Virtual, 2021.
Experimental Results Environments and Setup July 12, 2023, Boston, MA, USA 16 / 21
Environments and Setup
The experiment platform is STCP920[1]
d ← 4; c ← 8; u ← 3
LB ← 64 KB L1
GB ← 8MB last local buffer (LLB)
CU1 ← vector core; CU2 ← VME; CU3 ← MME
Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.
Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.
Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.
Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.
Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.
Experimental Results Performance Breakdown July 12, 2023, Boston, MA, USA 18 / 21
Performance Breakdown
Evaluate how different factors of GraphTurbo contribute to the
overall speedup using four variants:
Experimental Results Performance Breakdown July 12, 2023, Boston, MA, USA 18 / 21
Performance Breakdown
Evaluate how different factors of GraphTurbo contribute to the
overall speedup using four variants:
Experimental Results Performance Breakdown July 12, 2023, Boston, MA, USA 18 / 21
Performance Breakdown
Evaluate how different factors of GraphTurbo contribute to the
overall speedup using four variants:
Experimental Results Performance Breakdown July 12, 2023, Boston, MA, USA 18 / 21
Performance Breakdown
Evaluate how different factors of GraphTurbo contribute to the
overall speedup using four variants:
Experimental Results Hardware Utilization July 12, 2023, Boston, MA, USA 19 / 21
Hardware Utilization
We report the frequencies of each memory level.
Experimental Results Hardware Utilization July 12, 2023, Boston, MA, USA 19 / 21
Hardware Utilization
We report the frequencies of each memory level.
We acknowledgment the TVM community led by Tianqi Chen, without whose work this paper
would be impossible.