0% found this document useful (0 votes)

8 views68 pages

Osdi23 Slides Zhao

The document discusses a new scheduling approach for computational graphs of deep neural networks (DNNs) aimed at optimizing their performance on domain-specific accelerators (DSAs). It critiques existing methods for their limitations in exploiting hardware capabilities and proposes a solution called GraphTurbo, which constructs coarser-grained sub-graphs to enhance data movement efficiency and memory utilization. The paper outlines the core principles of GraphTurbo and its potential benefits in maximizing computational efficiency for DNNs.

Uploaded by

ziguoxut

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views68 pages

Osdi23 Slides Zhao

Uploaded by

ziguoxut

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Effectively Scheduling Computational Graphs of Deep

Neural Networks toward Their Domain-Specific

Accelerators

Jie Zhao1 Siyuan Feng2 Xiaoqiang Dan3

Fei Liu3 Chengke Wang3 Sheng Yuan3 Wenyuan Lv3 Qikai Xie3
1 Information Engineering University, Zhengzhou
2 Shanghai Jiao Tong University, Shanghai
3 Stream Computing Inc., Hangzhou

The 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI’23)

July 12, 2023, Boston, MA, USA

July 12, 2023, Boston, MA, USA 1 / 21

Outline

1 Introduction

2 Overview

3 Schedule Sub-graph Instances

4 Generate Kernels for Sub-graph Instances

5 Experimental Results

6 Conclusion

July 12, 2023, Boston, MA, USA 2 / 21

A Deep Neural Network (DNN) DSA Abstraction

Moore’s Law ↓ Domain-specific Architecture (DSA) ↑

Introduction DSA Abstraction July 12, 2023, Boston, MA, USA 3 / 21

A Deep Neural Network (DNN) DSA Abstraction

Moore’s Law ↓ Domain-specific Architecture (DSA) ↑

A DSA Abstraction has formed after several years of investigations

Introduction DSA Abstraction July 12, 2023, Boston, MA, USA 3 / 21

A Deep Neural Network (DNN) DSA Abstraction

Moore’s Law ↓ Domain-specific Architecture (DSA) ↑

A DSA Abstraction has formed after several years of investigations


 d ← 1; c ← 9; u ← 1
LB ← Local Memory or N/A

Goya
 GB ← Shared Memory

CU1 ← GEMM engine/TPC

d ← 1; c ← 8; u ← 3


LB ← Unified/L1 Buffer




GB ← on-chip Buffer

Ascend

 CU1 ← scalar unit
 CU2 ← vector unit



 CU3 ← cube unit

 d ← 2; c ← 1216; u ← 1
LB ← Local Memory

IPU

 GB ← N/A
CU1 ← core


Introduction DSA Abstraction July 12, 2023, Boston, MA, USA 3 / 21

A Deep Neural Network (DNN) DSA Abstraction

Moore’s Law ↓ Domain-specific Architecture (DSA) ↑

Scheduling DNNs for this DSA abstraction is thus important!

Introduction DSA Abstraction July 12, 2023, Boston, MA, USA 3 / 21

A Deep Neural Network (DNN) DSA Abstraction

Moore’s Law ↓ Domain-specific Architecture (DSA) ↑

Scheduling DNNs for this DSA abstraction is thus important!

But existing approaches cannot fully exploit its computing power...

Introduction DSA Abstraction July 12, 2023, Boston, MA, USA 3 / 21

Limitations of Prior Work

layer: nodes connected in a

straight line, with at most one
containing parameters learner
using gradients of loss.
block: a layer or a group of
layers used recursively
stage: a logical, high-level
abstraction used in a
computational graph

Introduction Limitations of Prior Work July 12, 2023, Boston, MA, USA 4 / 21
Limitations of Prior Work

layer: nodes connected in a

Prior work groups nodes by obscuring hardware architectures,

producing more kernels and requiring more in-between, off-core data
movements;

Introduction Limitations of Prior Work July 12, 2023, Boston, MA, USA 4 / 21
Limitations of Prior Work

layer: nodes connected in a

Prior work groups nodes by obscuring hardware architectures,

Introduction Limitations of Prior Work July 12, 2023, Boston, MA, USA 4 / 21
Limitations of Prior Work

layer: nodes connected in a

Prior work groups nodes by obscuring hardware architectures,

producing more kernels and requiring more in-between, off-core data
movements;
Grouping nodes within a layer generates fine-grained sub-graphs,
missing the across-layer instruction scheduling opportunities;
Prior work did not expose/exploit the imbalanced memory usage
distribution[1] , under-utilizing the faster local memory.
[1]
Ji Lin et al. “Memory-efficient Patch-based Inference for Tiny Deep Learning”. NeurIPS. vol. 34. 2021, pp. 1–13.
Introduction Limitations of Prior Work July 12, 2023, Boston, MA, USA 4 / 21
Our Solution

layer: nodes connected in a

Introduction Our Solution July 12, 2023, Boston, MA, USA 5 / 21

Our Solution

layer: nodes connected in a

Construct coarser-grained sub-graphs, generating larger kernels and

coverting data movements from off-core to on-core;

Introduction Our Solution July 12, 2023, Boston, MA, USA 5 / 21

Our Solution

layer: nodes connected in a

Construct coarser-grained sub-graphs, generating larger kernels and

coverting data movements from off-core to on-core;
Sub-graphs should cover layers or blocks, better hiding memory
latency and exploiting the parallelism across CUs;

Introduction Our Solution July 12, 2023, Boston, MA, USA 5 / 21

Our Solution

layer: nodes connected in a

Construct coarser-grained sub-graphs, generating larger kernels and

Introduction Our Solution July 12, 2023, Boston, MA, USA 5 / 21

Our Solution

layer: nodes connected in a

Construct coarser-grained sub-graphs, generating larger kernels and

coverting data movements from off-core to on-core;
Sub-graphs should cover layers or blocks, better hiding memory
latency and exploiting the parallelism across CUs;
Consider the internal relations between coarser-grained sub-graphs,
better utilizing the faster local memory.
These solutions form our new scheduler for DSA – GraphTurbo.
Introduction Our Solution July 12, 2023, Boston, MA, USA 5 / 21
Core Idea of GraphTurbo

maximally preserve the input tensors in LB to convert as many

off-core data movements as possible into on-core data exchanges.

Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo

maximally preserve the input tensors in LB to convert as many

off-core data movements as possible into on-core data exchanges.
stage1 = n×

stage2 = n×

stage3 = n×

stage4 = n×

tensor used to store a batch of images LB of the DSA abstraction

images processed by the stage with the same color

Each cluster processes 8 images; each stage reduces an image by half.

Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo

maximally preserve the input tensors in LB to convert as many

off-core data movements as possible into on-core data exchanges.
stage1 = n×

stage2 = n×

stage3 = n×

stage4 = n×

tensor used to store a batch of images LB of the DSA abstraction

images processed by the stage with the same color

Each cluster processes 8 images; each stage reduces an image by half.

Construct larger sub-graph for each stage.

Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo

maximally preserve the input tensors in LB to convert as many

off-core data movements as possible into on-core data exchanges.
stage1 = n×

stage2 = n×

stage3 = n×

stage4 = n×

tensor used to store a batch of images LB of the DSA abstraction

images processed by the stage with the same color

Each cluster processes 8 images; each stage reduces an image by half.

Construct larger sub-graph for each stage.
Split each sub-graph into 8, 4, 2, and 1 instance(s), respectively.

Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo

maximally preserve the input tensors in LB to convert as many

off-core data movements as possible into on-core data exchanges.
stage1 = n× 1 2 4 5 8 9 11 12

stage2 = n× 3 3 6 6 10 10 13 13

stage3 = n× 7 7 7 7 14 14 14 14

stage4 = n× 151515151515151515

tensor used to store a batch of images LB of the DSA abstraction

images processed by the stage with the same color

Each cluster processes 8 images; each stage reduces an image by half.

Construct larger sub-graph for each stage.
Split each sub-graph into 8, 4, 2, and 1 instance(s), respectively.
Schedule sub-graph instances in this order.

Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Core Idea of GraphTurbo

maximally preserve the input tensors in LB to convert as many

off-core data movements as possible into on-core data exchanges.
stage1 = n× 1 2 4 5 8 9 11 12

stage2 = n× 3 3 6 6 10 10 13 13

stage3 = n× 7 7 7 7 14 14 14 14

stage4 = n× 151515151515151515

tensor used to store a batch of images LB of the DSA abstraction

images processed by the stage with the same color

Each cluster processes 8 images; each stage reduces an image by half.

Construct larger sub-graph for each stage.
Split each sub-graph into 8, 4, 2, and 1 instance(s), respectively.
Schedule sub-graph instances in this order.
Saturate LB while exploiting the parallelism across cores.
Overview Core Idea of GraphTurbo July 12, 2023, Boston, MA, USA 6 / 21
Collect Splitting Information

Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Collect Splitting Information

Collect hardware information for constructing larger sub-graphs.

SplitInfo includes split loop dimension, factor, etc.

Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Collect Splitting Information

Collect hardware information for constructing larger sub-graphs.

SplitInfo includes split loop dimension, factor, etc.
Each sub-graph SG is initialized by an op.
Each op include only one output tensor and multiple input tensors.

Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Collect Splitting Information

Collect hardware information for constructing larger sub-graphs.

Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Collect Splitting Information

Collect hardware information for constructing larger sub-graphs.

SplitInfo includes split loop dimension, factor, etc.
Each sub-graph SG is initialized by an op.
Each op include only one output tensor and multiple input tensors.
Compute SplitInfo for the output and propagate it to inputs.
Define three metrics, and use
lexmax∀d∈SplitInfo (nd , −fd , −d)
to select the a loop dimension of a tensor to be split.
Schedule Sub-graph Instances Collect Splitting Information July 12, 2023, Boston, MA, USA 7 / 21
Group Sub-graphs with the aid of SplitInfo

Schedule Sub-graph Instances Group Sub-graphs July 12, 2023, Boston, MA, USA 8 / 21
Group Sub-graphs with the aid of SplitInfo

Sort a graph G in a topological order, each node denoting an SG .

Schedule Sub-graph Instances Group Sub-graphs July 12, 2023, Boston, MA, USA 8 / 21
Group Sub-graphs with the aid of SplitInfo

Sort a graph G in a topological order, each node denoting an SG .

Group SG by repeatedly considering three patterns

SG1 SG1
SG1
SG2 SG3 SG2
SG2
SG4 SG3

Schedule Sub-graph Instances Group Sub-graphs July 12, 2023, Boston, MA, USA 8 / 21
Order Sub-graph Instances

a1 a2 a3 a4 a5 a6 a7 a8

b1 b2 b3 b4
c1 c2
d1

Schedule Sub-graph Instances Order Sub-graph Instances July 12, 2023, Boston, MA, USA 9 / 21
Order Sub-graph Instances

a1 a2 a3 a4 a5 a6 a7 a8

b1 b2 b3 b4
c1 c2
d1

GraphTurbo uses either a BFS heuristic to order these sub-graph

instances,
a1 a2 a3 a4 a5 a6 a7 a8 b1 b2 b3 b4 c1 c2 d1

Schedule Sub-graph Instances Order Sub-graph Instances July 12, 2023, Boston, MA, USA 9 / 21
Order Sub-graph Instances

a1 a2 a3 a4 a5 a6 a7 a8

b1 b2 b3 b4
c1 c2
d1

GraphTurbo uses either a BFS heuristic to order these sub-graph

instances,
a1 a2 a3 a4 a5 a6 a7 a8 b1 b2 b3 b4 c1 c2 d1

or a DFS heuristic, which simplifies the algorithmic design.

a8 a7 b4 a6 a5 b3 c2 a4 a3 b2 a2 a1 b1 c1 d1

Schedule Sub-graph Instances Order Sub-graph Instances July 12, 2023, Boston, MA, USA 9 / 21
Order Sub-graph Instances

a1 a2 a3 a4 a5 a6 a7 a8

b1 b2 b3 b4
c1 c2
d1

GraphTurbo uses either a BFS heuristic to order these sub-graph

instances,
a1 a2 a3 a4 a5 a6 a7 a8 b1 b2 b3 b4 c1 c2 d1

or a DFS heuristic, which simplifies the algorithmic design.

a8 a7 b4 a6 a5 b3 c2 a4 a3 b2 a2 a1 b1 c1 d1

An ILP-based heuristic is under construction and will be released soon.

Schedule Sub-graph Instances Order Sub-graph Instances July 12, 2023, Boston, MA, USA 9 / 21
Infer Core Binding and Buffer Scopes

Schedule Sub-graph Instances Infer Core Binding and Buffer Scopes July 12, 2023, Boston, MA, USA 10 / 21
Infer Core Binding and Buffer Scopes

Visit the scheduling result of sub-graph instances in a reverse order.

Either initialize binding information using a plain strategy and the
buffer scope using LB,
or infer the binding strategy from the output tensor.
A better strategy is selected if both inferred and initialized binding
information exist.
Schedule Sub-graph Instances Infer Core Binding and Buffer Scopes July 12, 2023, Boston, MA, USA 10 / 21
Concatenate the Outputs of Sub-graph Instances

Detect fine-grained dependencies between sub-graph instances and

introduce a lightweight concatenation op when necessary.

a8 a7 a6 a5 a4 a3 a2 a1
concat concat concat concat
b4 b3 b2 b1
concat concat
c2 c1
concat
d1

Schedule Sub-graph Instances Concatenate Instance Outputs July 12, 2023, Boston, MA, USA 11 / 21
Concatenate the Outputs of Sub-graph Instances

Detect fine-grained dependencies between sub-graph instances and

introduce a lightweight concatenation op when necessary.
shape=[2,28,28,512], shape=[2,28,28,512],
scope=GB, scope=LB,
a8 a7 a6 a5 a4 a3 a2 a1
bind=[2,2] bind=[2,4]
concat concat concat concat
b4 b3 b2 b1 copy (LB, GB) redistribute([2,2])
concat concat
c2 c1
shape=[4,28,28,512],
concat scope=LB,
d1 bind=[4,2]

Insert additional ops, e.g., copy, redistribute, for moving data across
the memory hierarchy if the binding strategies and memory scopes of
a concatenation op are different from each other.

Schedule Sub-graph Instances Concatenate Instance Outputs July 12, 2023, Boston, MA, USA 11 / 21
Concatenate the Outputs of Sub-graph Instances

Detect fine-grained dependencies between sub-graph instances and

Insert additional ops, e.g., copy, redistribute, for moving data across
the memory hierarchy if the binding strategies and memory scopes of
a concatenation op are different from each other.
How the approach is generalied to handle a sub-graph of multiple
output tensors and other cases is discussed in the paper.

Schedule Sub-graph Instances Concatenate Instance Outputs July 12, 2023, Boston, MA, USA 11 / 21
Loop Fusion within Layers
Generate one kernel for a sub-graph instance by expanding it as
conv block layer #1
conv
read inputs
identity block layer #2
LB buffer b3 batchnorm
write output identity block
layer #3 layer #4
ReLU
identity block
layer #5

buffer stitching buffer stitching loop fusion

buffer stitching is performed between the components connected by
loop fusion is performed between the components connected by
an op that can be expressed using loop nests of arithmetic operations

Generate Kernels for Sub-graph Instances Loop Fusion within Layers July 12, 2023, Boston, MA, USA 12 / 21
Loop Fusion within Layers
Generate one kernel for a sub-graph instance by expanding it as
conv block layer #1
conv
read inputs
identity block layer #2
LB buffer b3 batchnorm
write output identity block
layer #3 layer #4
ReLU
identity block
layer #5

buffer stitching buffer stitching loop fusion

buffer stitching is performed between the components connected by
loop fusion is performed between the components connected by
an op that can be expressed using loop nests of arithmetic operations
Perform loop fusion within each layer

Generate Kernels for Sub-graph Instances Loop Fusion within Layers July 12, 2023, Boston, MA, USA 12 / 21
Buffer Stitching across Layers/Blocks
Remain the outputs of a layer in LB, e.g., res l1, instead of spilling it
to slower global memory
Consider both compute- and memory-intensive ops.

Generate Kernels for Sub-graph Instances Buffer Stitching across Layers/Blocks July 12, 2023, Boston, MA, USA 13 / 21
Memory Allocation and Reuse
Release the space consumed by an output tensor as early as possible.

Generate Kernels for Sub-graph Instances Memory Allocation and Reuse July 12, 2023, Boston, MA, USA 14 / 21
Memory Allocation and Reuse
Release the space consumed by an output tensor as early as possible.
The space with the logest liveness across multiple computation tasks
is first spilled in case LB cannot hold all tensors.
in1 in1 in1 in1

ct1 ct1 ct1 ct1

out1 out1 out1 out1

ct2 ct3 ct5 ct2 ct3 ct5 ct2 ct3 ct5 ct2 ct3 ct5

out3 out3 out3 out3

ct4 ct4 ct4 ct4

out2 out4 out2 out4 out2 out4 out2 out4

ct6 ct6 ct6 ct6

out6 out5 out6 out5 out6 out5 out6 out5

ct7 ct7 ct7 ct7

out7 out7 out7 out7

Execute ct1 . Execute ct4 . Execute ct5 . Execute ct7 .

Generate Kernels for Sub-graph Instances Memory Allocation and Reuse July 12, 2023, Boston, MA, USA 14 / 21
Across-layer Instruction Scheduling
Weight tensors can be promoted as early as possible.

Generate Kernels for Sub-graph Instances Across-layer Instruction Scheduling July 12, 2023, Boston, MA, USA 15 / 21
Across-layer Instruction Scheduling
Weight tensors can be promoted as early as possible.

The latency of these promotion statements behind computation tasks.

hidden hidden

DMA hoisting dispatching

a layer’s computation

The latency of these promotion statements behind computation tasks.

hidden hidden

DMA hoisting dispatching

a layer’s computation

Enable across-layer memory latency hiding.

Generate Kernels for Sub-graph Instances Across-layer Instruction Scheduling July 12, 2023, Boston, MA, USA 15 / 21
Environments and Setup
The experiment platform is STCP920[1]



 d ← 4; c ← 8; u ← 3
LB ← 64 KB L1


 GB ← 8MB last local buffer (LLB)
CU1 ← vector core; CU2 ← VME; CU3 ← MME


[1]
Rongkai Zhan et al. “NeuralScale: A RISC-V Based Neural Processor Boosting AI Inference in Clouds”. Fifth Workshop on
Computer Architecture Research with RISC-V. CARRV. Virtual, 2021.
Experimental Results Environments and Setup July 12, 2023, Boston, MA, USA 16 / 21
Environments and Setup
The experiment platform is STCP920[1]



 d ← 4; c ← 8; u ← 3
LB ← 64 KB L1


 GB ← 8MB last local buffer (LLB)
CU1 ← vector core; CU2 ← VME; CU3 ← MME


DNN models: ResNet-50 v1.5, BERT, DLRM, MobileNet v2,

Vision Transformer, DenseNet, Conformer
DNN frameworks: Pytorch v1.81.1 for DLRM, and TensorFlow v1.13
for all others
Compare with TVM, AStitch, and a vendor-crafted implementation
[1]
Rongkai Zhan et al. “NeuralScale: A RISC-V Based Neural Processor Boosting AI Inference in Clouds”. Fifth Workshop on
Computer Architecture Research with RISC-V. CARRV. Virtual, 2021.
Experimental Results Environments and Setup July 12, 2023, Boston, MA, USA 16 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.

Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Comparison
We report the performance by selecting the optimal numbers of
batches per cluster.
How these optimal numbers are selected is discussed in the paper.

TVM fuses ops within a sub-graph, producing kernels that exchange

data via DDR.
AStitch neither orders sub-graph instances nor considers
compute-intensive ops.

TVM fuses ops within a sub-graph, producing kernels that exchange

data via DDR.
AStitch neither orders sub-graph instances nor considers
compute-intensive ops.
On average, GraphTurbo outperforms TVM by 11.15×, AStitch by
6.16×, and the vendor-crafted implementation by 1.04×.
Compilation overhead of different approaches is reported in the paper.
Experimental Results Performance Comparison July 12, 2023, Boston, MA, USA 17 / 21
Performance Breakdown
Evaluate how different factors of GraphTurbo contribute to the
overall speedup using four variants:

Experimental Results Performance Breakdown July 12, 2023, Boston, MA, USA 18 / 21
Performance Breakdown
Evaluate how different factors of GraphTurbo contribute to the
overall speedup using four variants:

Variant 1: maximally keeps outputs in LLB

Variant 2: maximally keeps outputs in L1; outperforms Variant 1 by
3.67×. (demonstrating the importance of utilizing L1, i.e., the LB of
the DSA abstraction)

Variant 1: maximally keeps outputs in LLB

Variant 2: maximally keeps outputs in L1; outperforms Variant 1 by
3.67×. (demonstrating the importance of utilizing L1, i.e., the LB of
the DSA abstraction)
Variant 3: Variant 2 + schedule sub-graph instance; outperforms
Variant 1 by 2.20×.
Variant 4: Variant 3 + across-layer instruction scheduling;
outperforms Variant 1 by 1.72×.
Experimental Results Performance Breakdown July 12, 2023, Boston, MA, USA 18 / 21
Hardware Utilization
We report the frequencies of each memory level.

Experimental Results Hardware Utilization July 12, 2023, Boston, MA, USA 19 / 21
Hardware Utilization
We report the frequencies of each memory level.

We also report how VME and MME are utilized.

Experimental Results Hardware Utilization July 12, 2023, Boston, MA, USA 19 / 21
Hardware Utilization
We report the frequencies of each memory level.

We also report how VME and MME are utilized.

The scalability to GPU is demonstrated in the paper using

ResNet18-Tailor, which outperforms the CUTLASS implementations
with and without convolution fusion by 1.06× and 1.23×.
Experimental Results Hardware Utilization July 12, 2023, Boston, MA, USA 19 / 21
Contributions

+ We recognize the importance of considering hardware architecture at

the graph partitioning level, enabling the synergy between network
and hardware architectures.

Conclusion Contributions July 12, 2023, Boston, MA, USA 20 / 21

Contributions

+ We recognize the importance of considering hardware architecture at

Conclusion Contributions July 12, 2023, Boston, MA, USA 20 / 21

Contributions

+ We recognize the importance of considering hardware architecture at

the graph partitioning level, enabling the synergy between network
and hardware architectures.
+ This synergy reduces off-core data movements, better saturates the
valuable local memory, and empowers across-layer instruction
scheduling.
+ We design and implement a novel scheduling approach GraphTurbo,
addressing the deployment of DNNs on DSA chips and offering
insight to other platforms.

Conclusion Contributions July 12, 2023, Boston, MA, USA 20 / 21

Contributions

+ We recognize the importance of considering hardware architecture at

Conclusion Contributions July 12, 2023, Boston, MA, USA 20 / 21

Questions & Answers

We acknowledgment the TVM community led by Tianqi Chen, without whose work this paper
would be impossible.

Conclusion Q&A July 12, 2023, Boston, MA, USA 21 / 21

PPS Lab File With Solution
No ratings yet
PPS Lab File With Solution
70 pages
English - Bounce Wireless Headphones
100% (1)
English - Bounce Wireless Headphones
8 pages
Lec-All Deep Learning Coursework
100% (2)
Lec-All Deep Learning Coursework
639 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
Deep Learning Curriculum
No ratings yet
Deep Learning Curriculum
23 pages
Placement Cheatsheet - Curious Freaks
100% (1)
Placement Cheatsheet - Curious Freaks
3 pages
ETH Zurich Talk - April 14, 2025
No ratings yet
ETH Zurich Talk - April 14, 2025
84 pages
23 DeepLearning PDF
No ratings yet
23 DeepLearning PDF
74 pages
SRWE Module 5
100% (1)
SRWE Module 5
52 pages
Manual Biostat
No ratings yet
Manual Biostat
150 pages
Elastic Block Store (Amazon EBS)
100% (1)
Elastic Block Store (Amazon EBS)
29 pages
Deep Learning Hardware
No ratings yet
Deep Learning Hardware
82 pages
Deep Learning Most Important Ideas PDF
No ratings yet
Deep Learning Most Important Ideas PDF
16 pages
Implemented LeNet On PyTorch
100% (1)
Implemented LeNet On PyTorch
17 pages
PCF Users Guide
No ratings yet
PCF Users Guide
104 pages
Basic Design Approaches To Accelerating Deep Neural Networks
No ratings yet
Basic Design Approaches To Accelerating Deep Neural Networks
93 pages
AML - Lecture - 09 - 08nov24
No ratings yet
AML - Lecture - 09 - 08nov24
126 pages
AML - Lecture - 11 - 19nov24
No ratings yet
AML - Lecture - 11 - 19nov24
103 pages
1 HCIE-Cloud Computing V3.0 Lab Guide
No ratings yet
1 HCIE-Cloud Computing V3.0 Lab Guide
150 pages
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
No ratings yet
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
147 pages
Introduction To Deep Neural Networks - DataCamp
No ratings yet
Introduction To Deep Neural Networks - DataCamp
10 pages
DesignSafe Bootcamp V1
No ratings yet
DesignSafe Bootcamp V1
129 pages
Quiz 3
No ratings yet
Quiz 3
5 pages
MSc - ICSE - Beston - Lufyagila - 2021 ممتاز
No ratings yet
MSc - ICSE - Beston - Lufyagila - 2021 ممتاز
89 pages
Neural Network Accelerators: CS223 Computer Architecture & Organization
No ratings yet
Neural Network Accelerators: CS223 Computer Architecture & Organization
45 pages
GTC 2021 - Nvfuser - 1617576268549001fotz
No ratings yet
GTC 2021 - Nvfuser - 1617576268549001fotz
75 pages
Tutorial On DNN 6 of 9 Network and Hardware Co Design
No ratings yet
Tutorial On DNN 6 of 9 Network and Hardware Co Design
60 pages
The Evolution of Deep Learning
No ratings yet
The Evolution of Deep Learning
53 pages
DL Tutorial NIPS2015 PDF
No ratings yet
DL Tutorial NIPS2015 PDF
133 pages
Week 6 Unsupervised Learning
No ratings yet
Week 6 Unsupervised Learning
60 pages
001 Intro
No ratings yet
001 Intro
66 pages
Lecture8 Computational Graph Pytorch TF
No ratings yet
Lecture8 Computational Graph Pytorch TF
64 pages
24 TensorFlow Clipper
No ratings yet
24 TensorFlow Clipper
35 pages
Bacciu 2020
No ratings yet
Bacciu 2020
62 pages
Slides Architecture Based Continual Learning
No ratings yet
Slides Architecture Based Continual Learning
23 pages
(Fall 2024) Deep Learning 3
No ratings yet
(Fall 2024) Deep Learning 3
54 pages
Lecun 20201027 Att
No ratings yet
Lecun 20201027 Att
72 pages
L10-DL Intro
No ratings yet
L10-DL Intro
25 pages
5 Lecture 28 01 25
No ratings yet
5 Lecture 28 01 25
47 pages
01 Intro
No ratings yet
01 Intro
49 pages
Tut 3
No ratings yet
Tut 3
3 pages
1 TensorFlow
No ratings yet
1 TensorFlow
66 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
ReGNN A Redundancy-Eliminated Graph Neural Networks Accelerator
No ratings yet
ReGNN A Redundancy-Eliminated Graph Neural Networks Accelerator
15 pages
15 ML
No ratings yet
15 ML
60 pages
Yash Mittal Final Project Documentation
No ratings yet
Yash Mittal Final Project Documentation
114 pages
BMM 2018 - Deep Learning Tutorial
No ratings yet
BMM 2018 - Deep Learning Tutorial
47 pages
Class 5 - 2D Maxima Sweep-Line Algorithm
No ratings yet
Class 5 - 2D Maxima Sweep-Line Algorithm
28 pages
Osdi20 Ma
No ratings yet
Osdi20 Ma
18 pages
Recreating PyTorch From Scratch (With GPU Support and Automatic Differentiation)
No ratings yet
Recreating PyTorch From Scratch (With GPU Support and Automatic Differentiation)
35 pages
M Thesis Report
No ratings yet
M Thesis Report
38 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
6020 - AB - Communication Protocol NEON Temperature Transmitter - E
No ratings yet
6020 - AB - Communication Protocol NEON Temperature Transmitter - E
27 pages
Acc 112 Aud in A Cis Envt Output
No ratings yet
Acc 112 Aud in A Cis Envt Output
13 pages
Orra - Hazratganj - Lucknow - JFM
No ratings yet
Orra - Hazratganj - Lucknow - JFM
8 pages
VL2023240500586 Da02
No ratings yet
VL2023240500586 Da02
12 pages
Papers Papers PDF
No ratings yet
Papers Papers PDF
48 pages
MODULE 2 Deep Learning
No ratings yet
MODULE 2 Deep Learning
26 pages
Deep Learning Unit 4
No ratings yet
Deep Learning Unit 4
11 pages
CLD1
No ratings yet
CLD1
39 pages
Deep Learning 1.0 and Beyond: A Tutorial
No ratings yet
Deep Learning 1.0 and Beyond: A Tutorial
55 pages
Builder
No ratings yet
Builder
19 pages
Computing Graph Neural Networks: A Survey From Algorithms To Accelerators
No ratings yet
Computing Graph Neural Networks: A Survey From Algorithms To Accelerators
38 pages
Approximation - and Quantization-Aware Training For Graph Neural Networks
No ratings yet
Approximation - and Quantization-Aware Training For Graph Neural Networks
14 pages
CSE - 2019-23 BATCH - List of Successful Candidates
No ratings yet
CSE - 2019-23 BATCH - List of Successful Candidates
16 pages
Post-Reading Report Alex Shen (Mid Exam)
No ratings yet
Post-Reading Report Alex Shen (Mid Exam)
36 pages
Untitled Document
No ratings yet
Untitled Document
7 pages
A GPU Implementation of The Sparse Deep Neural Network Graph Challenge
No ratings yet
A GPU Implementation of The Sparse Deep Neural Network Graph Challenge
8 pages
Phantom Gateway Integrations
No ratings yet
Phantom Gateway Integrations
8 pages
User Guide For MDB Viewer Plus
No ratings yet
User Guide For MDB Viewer Plus
8 pages
MobileNetV2 Inverted Residuals and Linear Bottlenecks
No ratings yet
MobileNetV2 Inverted Residuals and Linear Bottlenecks
11 pages
CV Tran Thanh Huy
No ratings yet
CV Tran Thanh Huy
5 pages
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
No ratings yet
Hidet: Task-Mapping Programming Paradigm For Deep Learning Tensor Programs
15 pages
06 Use Case Modeling Part 1
No ratings yet
06 Use Case Modeling Part 1
6 pages
Author Biographies Preface Acknowledgments Table of Figures
No ratings yet
Author Biographies Preface Acknowledgments Table of Figures
6 pages
204 Tutorial 2 Solution
No ratings yet
204 Tutorial 2 Solution
3 pages
Code Optimization: A Project Report ON
No ratings yet
Code Optimization: A Project Report ON
21 pages
Lec 60
No ratings yet
Lec 60
21 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
1.2.2 User Education
No ratings yet
1.2.2 User Education
2 pages
Limbo Log
No ratings yet
Limbo Log
2 pages
Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms
No ratings yet
Detailed Performance Analysis of Distributed Tensorflow On A GPU Cluster Using Deep Learning Algorithms
8 pages
Dynamic Computation Graphs
No ratings yet
Dynamic Computation Graphs
12 pages
Ble Abap Codes:: Reusa
100% (1)
Ble Abap Codes:: Reusa
66 pages
10 1109@vlsi-Dat49148 2020 9196288
No ratings yet
10 1109@vlsi-Dat49148 2020 9196288
1 page
Eden Gebrekidan Front End Developer: Buffalo, NY 7162592843
No ratings yet
Eden Gebrekidan Front End Developer: Buffalo, NY 7162592843
2 pages
A Comprehensive Survey of Graph Neural Networks PDF
No ratings yet
A Comprehensive Survey of Graph Neural Networks PDF
22 pages
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
No ratings yet
Cs344 - Lesson 2 - GPU Hardware and Parallel Communication Patterns - Udacity
14 pages