0% found this document useful (0 votes)
8 views68 pages

Osdi23 Slides Zhao

The document discusses a new scheduling approach for computational graphs of deep neural networks (DNNs) aimed at optimizing their performance on domain-specific accelerators (DSAs). It critiques existing methods for their limitations in exploiting hardware capabilities and proposes a solution called GraphTurbo, which constructs coarser-grained sub-graphs to enhance data movement efficiency and memory utilization. The paper outlines the core principles of GraphTurbo and its potential benefits in maximizing computational efficiency for DNNs.

Uploaded by

ziguoxut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views68 pages

Osdi23 Slides Zhao

The document discusses a new scheduling approach for computational graphs of deep neural networks (DNNs) aimed at optimizing their performance on domain-specific accelerators (DSAs). It critiques existing methods for their limitations in exploiting hardware capabilities and proposes a solution called GraphTurbo, which constructs coarser-grained sub-graphs to enhance data movement efficiency and memory utilization. The paper outlines the core principles of GraphTurbo and its potential benefits in maximizing computational efficiency for DNNs.

Uploaded by

ziguoxut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Effectively Scheduling Computational Graphs of Deep

Neural Networks toward Their Domain-Specific


Accelerators

Jie Zhao1 Siyuan Feng2 Xiaoqiang Dan3


Fei Liu3 Chengke Wang3 Sheng Yuan3 Wenyuan Lv3 Qikai Xie3
1 Information Engineering University, Zhengzhou
2 Shanghai Jiao Tong University, Shanghai
3 Stream Computing Inc., Hangzhou

The 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI’23)

July 12, 2023, Boston, MA, USA

July 12, 2023, Boston, MA, USA 1 / 21


Outline

1 Introduction

2 Overview

3 Schedule Sub-graph Instances

4 Generate Kernels for Sub-graph Instances

5 Experimental Results

6 Conclusion

July 12, 2023, Boston, MA, USA 2 / 21


A Deep Neural Network (DNN) DSA Abstraction

Moore’s Law ↓ Domain-specific Architecture (DSA) ↑

Introduction DSA Abstraction July 12, 2023, Boston, MA, USA 3 / 21


A Deep Neural Network (DNN) DSA Abstraction

Moore’s Law ↓ Domain-specific Architecture (DSA) ↑


A DSA Abstraction has formed after several years of investigations

Introduction DSA Abstraction July 12, 2023, Boston, MA, USA 3 / 21


A Deep Neural Network (DNN) DSA Abstraction

Moore’s Law ↓ Domain-specific Architecture (DSA) ↑


A DSA Abstraction has formed after several years of investigations


 d ← 1; c ← 9; u ← 1
LB ← Local Memory or N/A

Goya
 GB ← Shared Memory

CU1 ← GEMM engine/TPC

d ← 1; c ← 8; u ← 3


LB ← Unified/L1 Buffer




GB ← on-chip Buffer

Ascend

 CU1 ← scalar unit
 CU2 ← vector unit



 CU3 ← cube unit

 d ← 2; c ← 1216; u ← 1
LB ← Local Memory

IPU

 GB ← N/A
CU1 ← core

Introduction DSA Abstraction July 12, 2023, Boston, MA, USA 3 / 21


A Deep Neural Network (DNN) DSA Abstraction

Moore’s Law ↓ Domain-specific Architecture (DSA) ↑


A DSA Abstraction has formed after several years of investigations


 d ← 1; c ← 9; u ← 1
LB ← Local Memory or N/A

Goya
 GB ← Shared Memory

CU1 ← GEMM engine/TPC

d ← 1; c ← 8; u ← 3


LB ← Unified/L1 Buffer




GB ← on-chip Buffer

Ascend

 CU1 ← scalar unit
 CU2 ← vector unit



 CU3 ← cube unit

 d ← 2; c ← 1216; u ← 1
LB ← Local Memory

IPU

 GB ← N/A
CU1 ← core

Scheduling DNNs for this DSA abstraction is thus important!

Introduction DSA Abstraction July 12, 2023, Boston, MA, USA 3 / 21


A Deep Neural Network (DNN) DSA Abstraction

Moore’s Law ↓ Domain-specific Architecture (DSA) ↑


A DSA Abstraction has formed after several years of investigations


 d ← 1; c ← 9; u ← 1
LB ← Local Memory or N/A

Goya
 GB ← Shared Memory

CU1 ← GEMM engine/TPC

d ← 1; c ← 8; u ← 3


LB ← Unified/L1 Buffer




GB ← on-chip Buffer

Ascend

 CU1 ← scalar unit
 CU2 ← vector unit



 CU3 ← cube unit

 d ← 2; c ← 1216; u ← 1
LB ← Local Memory

IPU

 GB ← N/A
CU1 ← core

Scheduling DNNs for this DSA abstraction is thus important!


But existing approaches cannot fully exploit its computing power...

Introduction DSA Abstraction July 12, 2023, Boston, MA, USA 3 / 21


Limitations of Prior Work

layer: nodes connected in a


straight line, with at most one
containing parameters learner
using gradients of loss.
block: a layer or a group of
layers used recursively
stage: a logical, high-level
abstraction used in a
computational graph

Introduction Limitations of Prior Work July 12, 2023, Boston, MA, USA 4 / 21
Limitations of Prior Work

layer: nodes connected in a


straight line, with at most one
containing parameters learner
using gradients of loss.
block: a layer or a group of
layers used recursively
stage: a logical, high-level
abstraction used in a
computational graph

Prior work groups nodes by obscuring hardware architectures,


producing more kernels and requiring more in-between, off-core data
movements;

Introduction Limitations of Prior Work July 12, 2023, Boston, MA, USA 4 / 21
Limitations of Prior Work

layer: nodes connected in a


straight line, with at most one
containing parameters learner
using gradients of loss.
block: a layer or a group of
layers used recursively
stage: a logical, high-level
abstraction used in a
computational graph

Prior work groups nodes by obscuring hardware architectures,


producing more kernels and requiring more in-between, off-core data
movements;
Grouping nodes within a layer generates fine-grained sub-graphs,
missing the across-layer instruction scheduling opportunities;

Introduction Limitations of Prior Work July 12, 2023, Boston, MA, USA 4 / 21
Limitations of Prior Work

layer: nodes connected in a


straight line, with at most one
containing parameters learner
using gradients of loss.
block: a layer or a group of
layers used recursively
stage: a logical, high-level
abstraction used in a
computational graph

Prior work groups nodes by obscuring hardware architectures,


producing more kernels and requiring more in-between, off-core data
movements;
Grouping nodes within a layer generates fine-grained sub-graphs,
missing the across-layer instruction scheduling opportunities;
Prior work did not expose/exploit the imbalanced memory usage
distribution[1] , under-utilizing the faster local memory.
[1]
Ji Lin et al. “Memory-efficient Patch-based Inference for Tiny Deep Learning”. NeurIPS. vol. 34. 2021, pp. 1–13.
Introduction Limitations of Prior Work July 12, 2023, Boston, MA, USA 4 / 21
Our Solution

layer: nodes connected in a


straight line, with at most one
containing parameters learner
using gradients of loss.
block: a layer or a group of
layers used recursively
stage: a logical, high-level
abstraction used in a
computational graph

Introduction Our Solution July 12, 2023, Boston, MA, USA 5 / 21


Our Solution

layer: nodes connected in a


straight line, with at most one
containing parameters learner
using gradients of loss.
block: a layer or a group of
layers used recursively
stage: a logical, high-level
abstraction used in a
computational graph

Construct coarser-grained sub-graphs, generating larger kernels and


coverting data movements from off-core to on-core;

Introduction Our Solution July 12, 2023, Boston, MA, USA 5 / 21


Our Solution

layer: nodes connected in a


straight line, with at most one
containing parameters learner
using gradients of loss.
block: a layer or a group of
layers used recursively
stage: a logical, high-level
abstraction used in a
computational graph

Construct coarser-grained sub-graphs, generating larger kernels and


coverting data movements from off-core to on-core;
Sub-graphs should cover layers or blocks, better hiding memory
latency and exploiting the parallelism across CUs;

Introduction Our Solution July 12, 2023, Boston, MA, USA 5 / 21


Our Solution

layer: nodes connected in a


straight line, with at most one
containing parameters learner
using gradients of loss.
block: a layer or a group of
layers used recursively
stage: a logical, high-level
abstraction used in a
computational graph

Construct coarser-grained sub-graphs, generating larger kernels and


coverting data movements from off-core to on-core;
Sub-graphs should cover layers or blocks, better hiding memory
latency and exploiting the parallelism across CUs;
Consider the internal relations between coarser-grained sub-graphs,
better utilizing the faster local memory.

Introduction Our Solution July 12, 2023, Boston, MA, USA 5 / 21


Our Solution

layer: nodes connected in a


straight line, with at most one
containing parameters learner
using gradients of loss.
block: a layer or a group of
layers used recursively
stage: a logical, high-level
abstraction used in a
computational graph

Construct coarser-grained sub-graphs, generating larger kernels and


coverting data movements from off-core to on-core;
Sub-graphs should cover layers or blocks, better hiding memory
latency and exploiting the parallelism across CUs;
Consider the internal relations between coarser-grained sub-graphs,
better utilizing the faster local memory.
These solutions form our new scheduler for DSA – GraphTurbo.
Introduction Our Solution July 12, 2023, Boston, MA, USA 5 / 21
Core Idea of GraphTurbo

maximally preserve the input tensors in LB to convert as many


off-core data movements as possible into on-core data exchanges.

Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo

maximally preserve the input tensors in LB to convert as many


off-core data movements as possible into on-core data exchanges.
stage1 = n×

stage2 = n×

stage3 = n×

stage4 = n×

tensor used to store a batch of images LB of the DSA abstraction

images processed by the stage with the same color

Each cluster processes 8 images; each stage reduces an image by half.

Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo

maximally preserve the input tensors in LB to convert as many


off-core data movements as possible into on-core data exchanges.
stage1 = n×

stage2 = n×

stage3 = n×

stage4 = n×

tensor used to store a batch of images LB of the DSA abstraction

images processed by the stage with the same color

Each cluster processes 8 images; each stage reduces an image by half.


Construct larger sub-graph for each stage.

Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo

maximally preserve the input tensors in LB to convert as many


off-core data movements as possible into on-core data exchanges.
stage1 = n×

stage2 = n×

stage3 = n×

stage4 = n×

tensor used to store a batch of images LB of the DSA abstraction

images processed by the stage with the same color

Each cluster processes 8 images; each stage reduces an image by half.


Construct larger sub-graph for each stage.
Split each sub-graph into 8, 4, 2, and 1 instance(s), respectively.

Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo

maximally preserve the input tensors in LB to convert as many


off-core data movements as possible into on-core data exchanges.
stage1 = n× 1 2 4 5 8 9 11 12

stage2 = n× 3 3 6 6 10 10 13 13

stage3 = n× 7 7 7 7 14 14 14 14

stage4 = n× 151515151515151515

tensor used to store a batch of images LB of the DSA abstraction

images processed by the stage with the same color

Each cluster processes 8 images; each stage reduces an image by half.


Construct larger sub-graph for each stage.
Split each sub-graph into 8, 4, 2, and 1 instance(s), respectively.
Schedule sub-graph instances in this order.

Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo

maximally preserve the input tensors in LB to convert as many


off-core data movements as possible into on-core data exchanges.
stage1 = n× 1 2 4 5 8 9 11 12

stage2 = n× 3 3 6 6 10 10 13 13

stage3 = n× 7 7 7 7 14 14 14 14

stage4 = n× 151515151515151515

tensor used to store a batch of images LB of the DSA abstraction

images processed by the stage with the same color

Each cluster processes 8 images; each stage reduces an image by half.


Construct larger sub-graph for each stage.
Split each sub-graph into 8, 4, 2, and 1 instance(s), respectively.
Schedule sub-graph instances in this order.
Saturate LB while exploiting the parallelism across cores.
Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Collect Splitting Information

Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Collect Splitting Information

Collect hardware information for constructing larger sub-graphs.


SplitInfo includes split loop dimension, factor, etc.

Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Collect Splitting Information

Collect hardware information for constructing larger sub-graphs.


SplitInfo includes split loop dimension, factor, etc.
Each sub-graph SG is initialized by an op.
Each op include only one output tensor and multiple input tensors.

Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Collect Splitting Information

Collect hardware information for constructing larger sub-graphs.


SplitInfo includes split loop dimension, factor, etc.
Each sub-graph SG is initialized by an op.
Each op include only one output tensor and multiple input tensors.
Compute SplitInfo for the output and propagate it to inputs.

Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Collect Splitting Information

Collect hardware information for constructing larger sub-graphs.


SplitInfo includes split loop dimension, factor, etc.
Each sub-graph SG is initialized by an op.
Each op include only one output tensor and multiple input tensors.
Compute SplitInfo for the output and propagate it to inputs.
Define three metrics, and use
lexmax∀d∈SplitInfo (nd , −fd , −d)
to select the a loop dimension of a tensor to be split.
Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Group Sub-graphs with the aid of SplitInfo

Schedule Sub-graph Instances Group Sub-graphs July 12, 2023, Boston, MA, USA 8 / 21
Group Sub-graphs with the aid of SplitInfo

Sort a graph G in a topological order, each node denoting an SG .

Schedule Sub-graph Instances Group Sub-graphs July 12, 2023, Boston, MA, USA 8 / 21
Group Sub-graphs with the aid of SplitInfo

Sort a graph G in a topological order, each node denoting an SG .


Group SG by repeatedly considering three patterns

SG1 SG1
SG1
SG2 SG3 SG2
SG2
SG4 SG3

Schedule Sub-graph Instances Group Sub-graphs July 12, 2023, Boston, MA, USA 8 / 21
Order Sub-graph Instances

a1 a2 a3 a4 a5 a6 a7 a8

b1 b2 b3 b4
c1 c2
d1

Schedule Sub-graph Instances Order Sub-graph Instances July 12, 2023, Boston, MA, USA 9 / 21
Order Sub-graph Instances

a1 a2 a3 a4 a5 a6 a7 a8

b1 b2 b3 b4
c1 c2
d1

GraphTurbo uses either a BFS heuristic to order these sub-graph


instances,
a1 a2 a3 a4 a5 a6 a7 a8 b1 b2 b3 b4 c1 c2 d1

Schedule Sub-graph Instances Order Sub-graph Instances July 12, 2023, Boston, MA, USA 9 / 21
Order Sub-graph Instances

a1 a2 a3 a4 a5 a6 a7 a8

b1 b2 b3 b4
c1 c2
d1

GraphTurbo uses either a BFS heuristic to order these sub-graph


instances,
a1 a2 a3 a4 a5 a6 a7 a8 b1 b2 b3 b4 c1 c2 d1

or a DFS heuristic, which simplifies the algorithmic design.


a8 a7 b4 a6 a5 b3 c2 a4 a3 b2 a2 a1 b1 c1 d1

Schedule Sub-graph Instances Order Sub-graph Instances July 12, 2023, Boston, MA, USA 9 / 21
Order Sub-graph Instances

a1 a2 a3 a4 a5 a6 a7 a8

b1 b2 b3 b4
c1 c2
d1

GraphTurbo uses either a BFS heuristic to order these sub-graph


instances,
a1 a2 a3 a4 a5 a6 a7 a8 b1 b2 b3 b4 c1 c2 d1

or a DFS heuristic, which simplifies the algorithmic design.


a8 a7 b4 a6 a5 b3 c2 a4 a3 b2 a2 a1 b1 c1 d1

An ILP-based heuristic is under construction and will be released soon.


Schedule Sub-graph Instances Order Sub-graph Instances July 12, 2023, Boston, MA, USA 9 / 21
Infer Core Binding and Buffer Scopes

Schedule Sub-graph Instances Infer Core Binding and Buffer Scopes July 12, 2023, Boston, MA, USA 10 / 21
Infer Core Binding and Buffer Scopes

Visit the scheduling result of sub-graph instances in a reverse order.


Either initialize binding information using a plain strategy and the
buffer scope using LB,
or infer the binding strategy from the output tensor.
A better strategy is selected if both inferred and initialized binding
information exist.
Schedule Sub-graph Instances Infer Core Binding and Buffer Scopes July 12, 2023, Boston, MA, USA 10 / 21
Concatenate the Outputs of Sub-graph Instances

Detect fine-grained dependencies between sub-graph instances and


introduce a lightweight concatenation op when necessary.

a8 a7 a6 a5 a4 a3 a2 a1
concat concat concat concat
b4 b3 b2 b1
concat concat
c2 c1
concat
d1

Schedule Sub-graph Instances Concatenate Instance Outputs July 12, 2023, Boston, MA, USA 11 / 21
Concatenate the Outputs of Sub-graph Instances

Detect fine-grained dependencies between sub-graph instances and


introduce a lightweight concatenation op when necessary.
shape=[2,28,28,512], shape=[2,28,28,512],
scope=GB, scope=LB,
a8 a7 a6 a5 a4 a3 a2 a1
bind=[2,2] bind=[2,4]
concat concat concat concat
b4 b3 b2 b1 copy (LB, GB) redistribute([2,2])
concat concat
c2 c1
shape=[4,28,28,512],
concat scope=LB,
d1 bind=[4,2]

Insert additional ops, e.g., copy, redistribute, for moving data across
the memory hierarchy if the binding strategies and memory scopes of
a concatenation op are different from each other.

Schedule Sub-graph Instances Concatenate Instance Outputs July 12, 2023, Boston, MA, USA 11 / 21
Concatenate the Outputs of Sub-graph Instances

Detect fine-grained dependencies between sub-graph instances and


introduce a lightweight concatenation op when necessary.
shape=[2,28,28,512], shape=[2,28,28,512],
scope=GB, scope=LB,
a8 a7 a6 a5 a4 a3 a2 a1
bind=[2,2] bind=[2,4]
concat concat concat concat
b4 b3 b2 b1 copy (LB, GB) redistribute([2,2])
concat concat
c2 c1
shape=[4,28,28,512],
concat scope=LB,
d1 bind=[4,2]

Insert additional ops, e.g., copy, redistribute, for moving data across
the memory hierarchy if the binding strategies and memory scopes of
a concatenation op are different from each other.
How the approach is generalied to handle a sub-graph of multiple
output tensors and other cases is discussed in the paper.

Schedule Sub-graph Instances Concatenate Instance Outputs July 12, 2023, Boston, MA, USA 11 / 21
Loop Fusion within Layers
Generate one kernel for a sub-graph instance by expanding it as
conv block layer #1
conv
read inputs
identity block layer #2
LB buffer b3 batchnorm
write output identity block
layer #3 layer #4
ReLU
identity block
layer #5

buffer stitching buffer stitching loop fusion


buffer stitching is performed between the components connected by
loop fusion is performed between the components connected by
an op that can be expressed using loop nests of arithmetic operations

Generate Kernels for Sub-graph Instances Loop Fusion within Layers July 12, 2023, Boston, MA, USA 12 / 21
Loop Fusion within Layers
Generate one kernel for a sub-graph instance by expanding it as
conv block layer #1
conv
read inputs
identity block layer #2
LB buffer b3 batchnorm
write output identity block
layer #3 layer #4
ReLU
identity block
layer #5

buffer stitching buffer stitching loop fusion


buffer stitching is performed between the components connected by
loop fusion is performed between the components connected by
an op that can be expressed using loop nests of arithmetic operations
Perform loop fusion within each layer

Generate Kernels for Sub-graph Instances Loop Fusion within Layers July 12, 2023, Boston, MA, USA 12 / 21
Buffer Stitching across Layers/Blocks
Remain the outputs of a layer in LB, e.g., res l1, instead of spilling it
to slower global memory
Consider both compute- and memory-intensive ops.

Generate Kernels for Sub-graph Instances Buffer Stitching across Layers/Blocks July 12, 2023, Boston, MA, USA 13 / 21
Memory Allocation and Reuse
Release the space consumed by an output tensor as early as possible.

Generate Kernels for Sub-graph Instances Memory Allocation and Reuse July 12, 2023, Boston, MA, USA 14 / 21
Memory Allocation and Reuse
Release the space consumed by an output tensor as early as possible.
The space with the logest liveness across multiple computation tasks
is first spilled in case LB cannot hold all tensors.
in1 in1 in1 in1

ct1 ct1 ct1 ct1

out1 out1 out1 out1

ct2 ct3 ct5 ct2 ct3 ct5 ct2 ct3 ct5 ct2 ct3 ct5

out3 out3 out3 out3

ct4 ct4 ct4 ct4

out2 out4 out2 out4 out2 out4 out2 out4

ct6 ct6 ct6 ct6

out6 out5 out6 out5 out6 out5 out6 out5

ct7 ct7 ct7 ct7

out7 out7 out7 out7

Execute ct1 . Execute ct4 . Execute ct5 . Execute ct7 .

Generate Kernels for Sub-graph Instances Memory Allocation and Reuse July 12, 2023, Boston, MA, USA 14 / 21
Across-layer Instruction Scheduling
Weight tensors can be promoted as early as possible.

Generate Kernels for Sub-graph Instances Across-layer Instruction Scheduling July 12, 2023, Boston, MA, USA 15 / 21
Across-layer Instruction Scheduling
Weight tensors can be promoted as early as possible.

The latency of these promotion statements behind computation tasks.

hidden hidden

DMA hoisting dispatching


a layer’s computation

Generate Kernels for Sub-graph Instances Across-layer Instruction Scheduling July 12, 2023, Boston, MA, USA 15 / 21
Across-layer Instruction Scheduling
Weight tensors can be promoted as early as possible.

The latency of these promotion statements behind computation tasks.

hidden hidden

DMA hoisting dispatching


a layer’s computation

Enable across-layer memory latency hiding.


Generate Kernels for Sub-graph Instances Across-layer Instruction Scheduling July 12, 2023, Boston, MA, USA 15 / 21
Environments and Setup
The experiment platform is STCP920[1]



 d ← 4; c ← 8; u ← 3
LB ← 64 KB L1


 GB ← 8MB last local buffer (LLB)
CU1 ← vector core; CU2 ← VME; CU3 ← MME

[1]
Rongkai Zhan et al. “NeuralScale: A RISC-V Based Neural Processor Boosting AI Inference in Clouds”. Fifth Workshop on
Computer Architecture Research with RISC-V. CARRV. Virtual, 2021.
Experimental Results Environments and Setup July 12, 2023, Boston, MA, USA 16 / 21
Environments and Setup
The experiment platform is STCP920[1]



 d ← 4; c ← 8; u ← 3
LB ← 64 KB L1


 GB ← 8MB last local buffer (LLB)
CU1 ← vector core; CU2 ← VME; CU3 ← MME

DNN models: ResNet-50 v1.5, BERT, DLRM, MobileNet v2,


Vision Transformer, DenseNet, Conformer
DNN frameworks: Pytorch v1.81.1 for DLRM, and TensorFlow v1.13
for all others
Compare with TVM, AStitch, and a vendor-crafted implementation
[1]
Rongkai Zhan et al. “NeuralScale: A RISC-V Based Neural Processor Boosting AI Inference in Clouds”. Fifth Workshop on
Computer Architecture Research with RISC-V. CARRV. Virtual, 2021.
Experimental Results Environments and Setup July 12, 2023, Boston, MA, USA 16 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.

Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.

Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.

Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.

TVM fuses ops within a sub-graph, producing kernels that exchange


data via DDR.
AStitch neither orders sub-graph instances nor considers
compute-intensive ops.

Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.

TVM fuses ops within a sub-graph, producing kernels that exchange


data via DDR.
AStitch neither orders sub-graph instances nor considers
compute-intensive ops.
On average, GraphTurbo outperforms TVM by 11.15×, AStitch by
6.16×, and the vendor-crafted implementation by 1.04×.

Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.

TVM fuses ops within a sub-graph, producing kernels that exchange


data via DDR.
AStitch neither orders sub-graph instances nor considers
compute-intensive ops.
On average, GraphTurbo outperforms TVM by 11.15×, AStitch by
6.16×, and the vendor-crafted implementation by 1.04×.
Compilation overhead of different approaches is reported in the paper.
Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Breakdown
Evaluate how different factors of GraphTurbo contribute to the
overall speedup using four variants:

Experimental Results Performance Breakdown July 12, 2023, Boston, MA, USA 18 / 21
Performance Breakdown
Evaluate how different factors of GraphTurbo contribute to the
overall speedup using four variants:

Variant 1: maximally keeps outputs in LLB

Experimental Results Performance Breakdown July 12, 2023, Boston, MA, USA 18 / 21
Performance Breakdown
Evaluate how different factors of GraphTurbo contribute to the
overall speedup using four variants:

Variant 1: maximally keeps outputs in LLB


Variant 2: maximally keeps outputs in L1; outperforms Variant 1 by
3.67×. (demonstrating the importance of utilizing L1, i.e., the LB of
the DSA abstraction)

Experimental Results Performance Breakdown July 12, 2023, Boston, MA, USA 18 / 21
Performance Breakdown
Evaluate how different factors of GraphTurbo contribute to the
overall speedup using four variants:

Variant 1: maximally keeps outputs in LLB


Variant 2: maximally keeps outputs in L1; outperforms Variant 1 by
3.67×. (demonstrating the importance of utilizing L1, i.e., the LB of
the DSA abstraction)
Variant 3: Variant 2 + schedule sub-graph instance; outperforms
Variant 1 by 2.20×.

Experimental Results Performance Breakdown July 12, 2023, Boston, MA, USA 18 / 21
Performance Breakdown
Evaluate how different factors of GraphTurbo contribute to the
overall speedup using four variants:

Variant 1: maximally keeps outputs in LLB


Variant 2: maximally keeps outputs in L1; outperforms Variant 1 by
3.67×. (demonstrating the importance of utilizing L1, i.e., the LB of
the DSA abstraction)
Variant 3: Variant 2 + schedule sub-graph instance; outperforms
Variant 1 by 2.20×.
Variant 4: Variant 3 + across-layer instruction scheduling;
outperforms Variant 1 by 1.72×.
Experimental Results Performance Breakdown July 12, 2023, Boston, MA, USA 18 / 21
Hardware Utilization
We report the frequencies of each memory level.

Experimental Results Hardware Utilization July 12, 2023, Boston, MA, USA 19 / 21
Hardware Utilization
We report the frequencies of each memory level.

We also report how VME and MME are utilized.

Experimental Results Hardware Utilization July 12, 2023, Boston, MA, USA 19 / 21
Hardware Utilization
We report the frequencies of each memory level.

We also report how VME and MME are utilized.

The scalability to GPU is demonstrated in the paper using


ResNet18-Tailor, which outperforms the CUTLASS implementations
with and without convolution fusion by 1.06× and 1.23×.
Experimental Results Hardware Utilization July 12, 2023, Boston, MA, USA 19 / 21
Contributions

+ We recognize the importance of considering hardware architecture at


the graph partitioning level, enabling the synergy between network
and hardware architectures.

Conclusion Contributions July 12, 2023, Boston, MA, USA 20 / 21


Contributions

+ We recognize the importance of considering hardware architecture at


the graph partitioning level, enabling the synergy between network
and hardware architectures.
+ This synergy reduces off-core data movements, better saturates the
valuable local memory, and empowers across-layer instruction
scheduling.

Conclusion Contributions July 12, 2023, Boston, MA, USA 20 / 21


Contributions

+ We recognize the importance of considering hardware architecture at


the graph partitioning level, enabling the synergy between network
and hardware architectures.
+ This synergy reduces off-core data movements, better saturates the
valuable local memory, and empowers across-layer instruction
scheduling.
+ We design and implement a novel scheduling approach GraphTurbo,
addressing the deployment of DNNs on DSA chips and offering
insight to other platforms.

Conclusion Contributions July 12, 2023, Boston, MA, USA 20 / 21


Contributions

+ We recognize the importance of considering hardware architecture at


the graph partitioning level, enabling the synergy between network
and hardware architectures.
+ This synergy reduces off-core data movements, better saturates the
valuable local memory, and empowers across-layer instruction
scheduling.
+ We design and implement a novel scheduling approach GraphTurbo,
addressing the deployment of DNNs on DSA chips and offering
insight to other platforms.
+ The experimental results demonstrate that GraphTurbo can
outperform two state-of-the-art tools and achieve performance
comparable to the vendor-crafted code.

Conclusion Contributions July 12, 2023, Boston, MA, USA 20 / 21


Questions & Answers

We acknowledgment the TVM community led by Tianqi Chen, without whose work this paper
would be impossible.

Conclusion Q&A July 12, 2023, Boston, MA, USA 21 / 21

You might also like