0% found this document useful (0 votes)

29 views47 pages

5 Lecture 28 01 25

The document discusses various dataflow architectures for deep neural network (DNN) execution, including Output Stationary, Weight Stationary, and Input Stationary. It introduces the concept of DNN accelerators, highlighting their advantages over CPUs and GPUs for specific applications. The document also covers the architecture of the NVIDIA Deep Learning Accelerator (NVDLA) and the DianNao accelerator, focusing on memory access optimization and performance metrics.

Uploaded by

rppay777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views47 pages

5 Lecture 28 01 25

Uploaded by

rppay777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Mem Sensor

μP Battery

E0 294: Systems for Machine Learning

Lecture #5
28th January 2025
Previous Class
▪ Some more CNNs
▪ Technique to reduce FLOPS
▪ OS dataflow

2
Today’s Agenda
▪ Various dataflow for DNN execution
– Output stationary (an example)
– Weight Stationary
– Input stationary
▪ Introduction to DNN accelerator

3
2-D Convolution – Output Stationary

4
2-D Convolution – Output Stationary (1)

5
2-D Convolution – Output Stationary (2)

6
2-D Convolution – Output Stationary (3)

7
2-D Convolution – Output Stationary (4)

8
2-D Convolution – Output Stationary (5)

9
2-D Convolution – Output Stationary (6)

10
Different Types of Data Flow
▪ Output Stationary (OS)
▪ Weight Stationary (WS)
▪ Input Stationary (IS)

11
Weight Stationary (WS) Dataflow

▪ Broadcast activations and accumulate partial sums (psums) spatially across the
PE array
▪ Weights are stored in PE array
– Minimize read energy consumption
– Maximize convolutional and filter reuse of weights
▪ Weights Stationary
– Weights do not move (stay in PEs) until all computations related to them are
finished
12
1-D Convolution – Weight Stationary

13
Weight Stationary – Reference Pattern
▪ Observations
– Single weight is reused many times (E)
– Large sliding window of inputs (E)
– Fixed window of outputs (E)

14
Data Access for WS

Output Stationary Weight Stationary

15
Relation to WS Architecture

▪ In WS architecture, weights stay in each PE – P0:W0, P1:W1…

▪ Inputs are broadcasted into all PEs: Pi: I[0], I[1], I[2],….
▪ During computation, each PE performs the partial sum at each cycle and sends
to the next PE – spatially accumulate
– O[0] = I[0]W[0] + I[1]W[1] + I[2]W[2] + I[3]W[3]
– O[6] = I[6]W[0] + I[7]W[1] + I[8]W[2] + I[9]W[3]
▪ P0 -> P1 -> P2 -> P3 ….
16
2-D Convolution – Weight Stationary (1)

17
2-D Convolution – Weight Stationary (2)

18
2-D Convolution – Weight Stationary (3)

19
2-D Convolution – Weight Stationary (4)

20
2-D Convolution – Weight Stationary (5)

21
2-D Convolution – Weight Stationary (6)

22
WS Example: NVDLA (simplified)
▪ NVIDIA Deep Learning Accelerator (NVDLA)

23
WS Example: NVDLA (simplified)
▪ NVIDIA Deep Learning Accelerator (NVDLA)

24
Different Types of Data Flow
▪ Output Stationary (OS)
▪ Weight Stationary (WS)
▪ Input Stationary (IS)

25
Input (IS) Data Flow
▪ Minimize activation read energy consumption
– Input stays in PEs until they are no longer needed
– Maximize convolutional and map reuse of activations
▪ Unicast weights and accumulate psums spatially across the PE array

26
IS Data Flow

27
Cost of different data flow

28
Today’s Agenda
▪ Various dataflow for DNN execution
– Output stationary (an example)
– Weight Stationary
– Input stationary
▪ Introduction to DNN accelerator

29
Accelerator: Why/What?
▪ Drawbacks of CPUs
– Less parallelism
▪ Drawback of GPU
– Less memory
▪ Accelerators
– Application specific
– Tailored only for DNNs
– Mostly inference
▪ Major performance metrics
– Inference latency, energy, area

30
DianNao (Computer) [1]
▪ One of the earliest accelerator for DNN models
▪ Tiling to reduce memory usage
– Classifier layer
– Convolutional layer
– Pooling layer
▪ An architecture tailored for DNN inference

[1] Chen, Tianshi, et al. "Diannao: A small-footprint high-

throughput accelerator for ubiquitous machine-learning."
ACM SIGARCH Computer Architecture News 42.1 (2014):
269-284.

31
General Structure of a DNN
▪ Sequential layers
▪ Feature maps
▪ Convolutional (nonlinear after the output), pooling, classifier
– Same kernels sometimes/private kernels
– Pooling: average and max Quiz: what is sx and sy in the figure?
– Classifier is a multi/single layer perceptron

32
Classifier Layer
a[j]
▪ Notations
• Tin= Number of input neurons
w[i][j]
s[i] • a[j] = Activation (value) at jth input neuron
• Tout = Number of output neurons

Output
• w[i][j] = Weight from jth input neuron to ith output
Input

Tout=2 neuron
• s[i] = Sum at ith output neuron
▪ Tin=8, Tout=2 in this example
▪ Implementation for (i=0; i<Tout; i++) {
Tin=8
▪ #memory accesses? for (j=0; j<Tin; j++) {
s[i] += a[j].w[i][j];
}
}
33
Memory accesses in classifier layer
▪ Notations
• Tin= Number of input neurons
• a[j] = Activation (value) at jth input neuron
• Tout = Number of output neurons
• w[i][j] = Weight from jth input neuron to ith output neuron
• s[i] = Sum at ith output neuron
for (i=0; i<Tout; i++) {
▪ #memory accesses? for (j=0; j<Tin; j++) {
– Activations: Tin × Tout s[i] += a[j].w[i][j];
– Weights: Tin × Tout }
– Sum: Tout
}
– Total: Tin × Tout + Tin × Tout + Tout
▪ Drawback
– Number of input neurons (Tin) can be very large (10 – 1L)
– Can not fit in L1
34
How to Reduce the Memory Access
a[i]
▪ Rather, how to use L1?
w[i][j]
▪ Tiling
s[j] – Divide the input neurons into tiles
– Store the tiles in L1
Output
– Compute the contribution from each tile separately
Input

Tout=2 – Accumulate the outputs

▪ Trade off
– Number of times a[i] needs to be fetched from main memory
is less (+)
Tin=8
– Number of times s[j] needs to be fetched is more (-)

35
Tiling for Convolutional and Pooling Layers
▪ Tiling for convolutional layers
– Input feature maps
– Output feature maps
▪ Pooling layers
– No kernels (why?)
– Tiling does not improve the BW requirement
drastically
▪ Please go through the papers and come
back with any question(s)
▪ Expect similar question(s) in the quiz Is this a good figure?

36
A Naïve Hardware Implementation
▪ Fully layout the neurons and the synapses on the
w[i][j] silicon
s[j]
– Neurons: logic circuits performing computation

Output
– Synapses: latches/RAMs storing the weights
Input

– Non linear functions are implemented as piecewise

Tout=2 linear function
▪ Advantage
– High speed: distance travelled by the data is less

Tin=8
▪ Disadvantage
– Area increases drastically with increase in the number
of neurons

37
Performance of the Naïve Implementation
▪ Area and energy suffers a lot with increasing number of input and output
neurons

38
Accelerator Architecture in DianNao
▪ Storage
– NBin: To store input activations
– SB: To store synapses
– NBout: To store outputs
▪ Neural functional unit (NFU)
– NFU-1: multiplier
– NFU-2: adder tree
– NFU-3: non linear
▪ Control instructions

39
Storage (1)
▪ Split buffers: Three different buffers
instead of single cache
– Reduced conflict
– Reduced access time
▪ Scratchpad memory
▪ Point to be noted
– Different width of NBin/NBout and SB
enables reading of different width of
data
for (i=0; i<To; i++) {
for (j=0; j<Ti; j++) {
s[i] += a[j].w[i][j];
}
}
40
Storage (2)
▪ NBout is a Circular buffer
– Partial sums are pushed into NBout
instead of sending them back into the
main memory and fetching them again
– Output is read from NBout only when
all the input neurons are integrated for
the partial sum
– Useful while tiling

for (i=0; i<To; i++) {

for (j=0; j<Ti; j++) {
s[i] += a[j].w[i][j];
}
}
41
NFU-1: Multipliers
▪ Convolutional layers require
multiplication
▪ 16-bit fixed point multipliers are
used instead of 32-bit floating
point multipliers
▪ Comments (?)

Error

Performance

42
NFU-2: Adders
▪ Adder trees
▪ Advantages of adder trees
– Less generate and propagate
– Less latency

43
NFU-3: Non linear Unit
▪ Involves non-linear operations
– Power of exponential, division etc.
▪ Any optimization idea?
– Implement as piecewise linear function

44
Control Instructions

Control Instruction Format

An example:
Tin=8192, Tout=256, 64-
entry buffers

45
Results

46
Mem Sensor

μP Battery

THANK YOU

7 CNN
No ratings yet
7 CNN
66 pages
Neuromorphic Architectures Lec 4-16-1731320691
No ratings yet
Neuromorphic Architectures Lec 4-16-1731320691
276 pages
Deep Learning - Question Bank
No ratings yet
Deep Learning - Question Bank
6 pages
Hot Chips Overview
No ratings yet
Hot Chips Overview
47 pages
An In-Memory VLSI Architecture For Convolutional Neural Networks
No ratings yet
An In-Memory VLSI Architecture For Convolutional Neural Networks
12 pages
PE Implementation Paper
No ratings yet
PE Implementation Paper
2 pages
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
No ratings yet
Architecture Design For Highly Flexible and Energy-Efficient Deep Neural Network Accelerators
147 pages
EECS251Leture-JennyHuang 2021
No ratings yet
EECS251Leture-JennyHuang 2021
67 pages
Deep Learning Hardware
No ratings yet
Deep Learning Hardware
82 pages
2019 Neurips Tutorial
No ratings yet
2019 Neurips Tutorial
138 pages
High-Performance Hardware For Machine Learning - 0916
No ratings yet
High-Performance Hardware For Machine Learning - 0916
68 pages
Tutorial On DNN 6 of 9 Network and Hardware Co Design
No ratings yet
Tutorial On DNN 6 of 9 Network and Hardware Co Design
60 pages
2 DataflowAnalysis
No ratings yet
2 DataflowAnalysis
49 pages
3 Lecture 21 01 25
No ratings yet
3 Lecture 21 01 25
62 pages
BMM 2018 - Deep Learning Tutorial
No ratings yet
BMM 2018 - Deep Learning Tutorial
47 pages
Deep Learning (22CS63) : Module-3
No ratings yet
Deep Learning (22CS63) : Module-3
58 pages
Lecture 3
No ratings yet
Lecture 3
48 pages
Make 04 00004 v3
No ratings yet
Make 04 00004 v3
37 pages
Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory
No ratings yet
Analog Architectures For Neural Network Acceleration Based On Non-Volatile Memory
35 pages
FT04 Haghighat Independent 2023
No ratings yet
FT04 Haghighat Independent 2023
40 pages
MobileNetV2 Inverted Residuals and Linear Bottlenecks
No ratings yet
MobileNetV2 Inverted Residuals and Linear Bottlenecks
11 pages
Hardware Architectures For Deep Neural Networks: ISCA Tutorial June 24, 2017
No ratings yet
Hardware Architectures For Deep Neural Networks: ISCA Tutorial June 24, 2017
290 pages
MSCDA 605 Machine Learning Exam Model Answers May - 2019
No ratings yet
MSCDA 605 Machine Learning Exam Model Answers May - 2019
7 pages
NoC Based DNN Accelerators
No ratings yet
NoC Based DNN Accelerators
8 pages
Ch-3 Convolutional Neural Networks (CNNS)
No ratings yet
Ch-3 Convolutional Neural Networks (CNNS)
11 pages
FP BNN On FPGA
No ratings yet
FP BNN On FPGA
15 pages
Ug4 Proj
No ratings yet
Ug4 Proj
44 pages
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
No ratings yet
A CNN Accelerator On FPGA Using Depthwise Separable Convolution
5 pages
23 DeepLearning PDF
No ratings yet
23 DeepLearning PDF
74 pages
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
No ratings yet
w1 - Machine Learning Hardware Design For Efficiency, Flexibility, and Scalability (Feature)
19 pages
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
No ratings yet
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
12 pages
Chapter 5 Deep Learning
No ratings yet
Chapter 5 Deep Learning
35 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
Author Biographies Preface Acknowledgments Table of Figures
No ratings yet
Author Biographies Preface Acknowledgments Table of Figures
6 pages
Tesi
No ratings yet
Tesi
73 pages
A Comprehensive Evaluation of CNN
No ratings yet
A Comprehensive Evaluation of CNN
5 pages
2015WS HS SpikingVision
No ratings yet
2015WS HS SpikingVision
23 pages
L 0017398760 PDF
No ratings yet
L 0017398760 PDF
24 pages
Deep Learning Notes For Easy Access
No ratings yet
Deep Learning Notes For Easy Access
14 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
Hardware Architectures For Deep Neural Networks-MIT'16
No ratings yet
Hardware Architectures For Deep Neural Networks-MIT'16
300 pages
Basic Design Approaches To Accelerating Deep Neural Networks
No ratings yet
Basic Design Approaches To Accelerating Deep Neural Networks
93 pages
Hardware Implementation of Neural Networks
No ratings yet
Hardware Implementation of Neural Networks
5 pages
Advanced Topics For AI
No ratings yet
Advanced Topics For AI
30 pages
DL Inference FPGA Class1
No ratings yet
DL Inference FPGA Class1
56 pages
Lecture 3 V33
No ratings yet
Lecture 3 V33
52 pages
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
No ratings yet
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
6 pages
Article Report Final
No ratings yet
Article Report Final
9 pages
Question Bank Advanced CO1, CO2
No ratings yet
Question Bank Advanced CO1, CO2
4 pages
Kernel Slides
No ratings yet
Kernel Slides
33 pages
A CNN Accelerator On FPGA Using Depthwise
No ratings yet
A CNN Accelerator On FPGA Using Depthwise
5 pages
DNN Accelerators
No ratings yet
DNN Accelerators
29 pages
Tutorial On DNN 1 of 9 Background of DNNs
No ratings yet
Tutorial On DNN 1 of 9 Background of DNNs
65 pages
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block On FPGAs
No ratings yet
High Throughput and Low Bandwidth Demand Accelerating CNN Inference Block-By-block On FPGAs
9 pages
Byt BZB 1 8 Tsfi Eng PDF
No ratings yet
Byt BZB 1 8 Tsfi Eng PDF
51 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
SMM 2024 WRF
No ratings yet
SMM 2024 WRF
374 pages
Kmu BSN 1st Semes Computer Slides by M Ibrahim
No ratings yet
Kmu BSN 1st Semes Computer Slides by M Ibrahim
33 pages
R2023-AIML-Curriculum and Syllabus
No ratings yet
R2023-AIML-Curriculum and Syllabus
59 pages
Krushi Bhavan
No ratings yet
Krushi Bhavan
5 pages
JNCIS-SP Certification - Juniper Networks US
No ratings yet
JNCIS-SP Certification - Juniper Networks US
6 pages
ML Syllabus Updated E13137
No ratings yet
ML Syllabus Updated E13137
7 pages
LM Chart Cast Alloys Aluminum
0% (1)
LM Chart Cast Alloys Aluminum
2 pages
3dmax Assignment List
No ratings yet
3dmax Assignment List
15 pages
3573 PF00603
No ratings yet
3573 PF00603
8 pages
Nexans NYY 80-0-6 1 KV Single Core
No ratings yet
Nexans NYY 80-0-6 1 KV Single Core
6 pages
Numpy Tutorial by Expertized Guy
No ratings yet
Numpy Tutorial by Expertized Guy
12 pages
DL QB With Ans
No ratings yet
DL QB With Ans
38 pages
Statistical Analysis System: First SAS Program
No ratings yet
Statistical Analysis System: First SAS Program
8 pages
Building Mental Models
No ratings yet
Building Mental Models
32 pages
UNIT 2.2 Functional Modeling
No ratings yet
UNIT 2.2 Functional Modeling
23 pages
Lesson 2 Introduction of Robot HAT
No ratings yet
Lesson 2 Introduction of Robot HAT
4 pages
Mock 3
No ratings yet
Mock 3
7 pages
10 Formatting Text (Font, Paragraph, Lists)
No ratings yet
10 Formatting Text (Font, Paragraph, Lists)
3 pages
Module2-Signals and Systems
No ratings yet
Module2-Signals and Systems
21 pages
Tugas SKD
No ratings yet
Tugas SKD
5 pages
NSP P1
No ratings yet
NSP P1
46 pages
Aegis El RG 4m El RG 4k Manual 21-03-31
No ratings yet
Aegis El RG 4m El RG 4k Manual 21-03-31
8 pages
The Design of The Center of Pressure Apparatus With Demonstration
No ratings yet
The Design of The Center of Pressure Apparatus With Demonstration
22 pages
Message
No ratings yet
Message
30 pages
Cylinder Form
No ratings yet
Cylinder Form
1 page
HP-MP: Compact Pulverizing Mill and Pellet Press
No ratings yet
HP-MP: Compact Pulverizing Mill and Pellet Press
6 pages
Tips For Managing Virtual Teams 24 03 PDF
No ratings yet
Tips For Managing Virtual Teams 24 03 PDF
1 page
Bond Strength of Concrete Plugs Embedded in Tubula PDF
No ratings yet
Bond Strength of Concrete Plugs Embedded in Tubula PDF
16 pages
Rotella DD: Two-Stroke Diesel Engine Oil
No ratings yet
Rotella DD: Two-Stroke Diesel Engine Oil
1 page

5 Lecture 28 01 25

Uploaded by

5 Lecture 28 01 25

Uploaded by

Mem Sensor

E0 294: Systems for Machine Learning

Output Stationary Weight Stationary

▪ In WS architecture, weights stay in each PE – P0:W0, P1:W1…

[1] Chen, Tianshi, et al. "Diannao: A small-footprint high-

Tout=2 – Accumulate the outputs

– Non linear functions are implemented as piecewise

for (i=0; i<To; i++) {

Control Instruction Format

You might also like