0% found this document useful (0 votes)

21 views6 pages

Hardware Implementation: Data-Flow and Design Space: 1.1 Memory-Access Simulator

This document discusses hardware implementation of deep neural networks and how different dataflow mappings can affect performance on a given architecture. It provides details on memory access simulation, installing Timeloop and Accelergy tools to analyze dataflows, and evaluating different dataflows like weight stationary, untiled output stationary and tiled output stationary on a sample problem. The document also covers generating energy models using Accelergy and jointly optimizing a DNN model and hardware accelerator through pruning to minimize energy consumption.

Uploaded by

Aard Habibi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views6 pages

Hardware Implementation: Data-Flow and Design Space: 1.1 Memory-Access Simulator

Uploaded by

Aard Habibi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

ELE6310-W2024 Assignment 2 Prof.

: François Leduc-Primeau

Hardware Implementation: Data-Flow and Design Space

In this assignment, we go into the details of the implementation of DNNs on specific hardware. The
final goal of this assignment is to figure out how different dataflows can change the performance of
a DNN workload on a given hardware architecture.

1 Preliminaries

1.1 Memory-access simulator

To estimate the performance of a DNN workload on a given architecture for a specific dataflow, we
need to simulate the process of transferring data between memory hierarchies. In fact, a dataflow
shows how data is staged in the memory hierarchy. Hence, we may obtain the access count of differ-
ent components in the given architecture. Finally, we may also estimate the energy consumption of
the DNN workload on the designated architecture based on the energy models for each component.

A dataflow mapping for the problem in Fig. 3 can be represented as a set of nested loops as in
Fig. 1, which shows the tiled version of an output stationary (OS) mapping. The loop bounds
(M1 , P , Q, M2 , R, S) are parameters of the mapping (in Fig. 3, we have M1 = M and M2 = 1).
For the mapping to be valid, the parameters must be such that the data structures can fit in the
corresponding level of the memory hierarchy, and also such that they correspond to the dimensions
of the problem.

In the example of Fig. 1, the buffer can store an input block with size of 1 × R × S, a weight block
with size of M2 × R × S, and an output block with size of M2 × 1 × 1. In each cycle, the required
blocks of the input, weight, and outputs that have to be transferred between main memory and
buffer are shown. First, we check if they are already available in the buffer or not. If the blocks are
not in the buffer, we have to read them from the main memory and store them in the buffer.

1.2 Installing Timeloop/Accelergy

The assignment requires the use of the Timeloop and Accelergy tools. Timeloop is a tool to determine
the number of times a particular memory or processing element is used (similar to Question 1) and
to optimize the dataflow mapping, whereas Accelergy is a tool that estimates the energy cost of
each operation (data movement or arithmetic operation).

You can install Timeloop-Accelergy on your own computer or on a Google Colab Pro instance. To
install on your own computer, the use of Ubuntu is recommended (either directly or by installing a
ELE6310-W2024 Assignment 2 Prof.: François Leduc-Primeau

Figure 1: Dataflow

Figure 2: Architecture Figure 3: Problem

virtual machine). A local install is the best way to become familiar with Timeloop-Accelergy. How-
ever, note that Question 3 will ask you to optimize a DNN model based on the energy consumption
predicted by Timeloop. This will require having access to GPU acceleration and Timeloop-Accelergy
at the same time. Using Colab Pro is a simple way to achieve this. A script is provided in the
homework git repository to install Timeloop-Accelergy on Ubuntu or in a Colab Pro instance (which
runs Ubuntu).

Also, make sure to have a look at the Timeloop-Accelergy Tutorial.

2 Questions
Q1. Memory-access simulation (45 points)
In this question, we want to discover the effect of different dataflows for 1D multi-channel
convolution (shown in Fig. 4) on the architecture shown in Figure 2. The main memory
consists of an SRAM module with a logical depth of 32768 and logical width of 8 bits. The
buffer consists of a register file (regfile) module with a depth of 64 and width of 8 bits.
Also, we use an 8-bit integer MAC (intmac) in the PE. For this question, we use the simple
energy model given in Table 1.
ELE6310-W2024 Assignment 2 Prof.: François Leduc-Primeau

1 M M

1 Input ∗ =
1 Weight 1 Output
P+R R P

Figure 4: 1-D multi-channel convolution

Your task is to implement your own simulator and test it for different dataflows: weight sta-
tionary, untiled output stationary, and tiled output stationary as presented in Table 2. Con-
figuration files are provided in YAML format that describe the architecture (Q1_arch.yaml),
the problem (Q1_prob.yaml), and the different dataflows/mappings. The problem can be
equivalently described by the following pseudo-code:
for ( m =0; m < M ; m ++)
for ( p =0; p < P ; p ++)
for ( r =0; r < R ; r ++)
o [m , p ]+= i [ p + r ] * f [m , r ];

Several dataflows and mappings are provided as configuration files. Using your simulator,
generate the statistics indicated in Table 2 for each case:

(a) Weight stationary: Q1_ws.map.yaml

(b) Output stationary, untiled.
Mapping 1: Q1_os-untiled.map.yaml,
Mapping 2: Q1_os-untiled.map2.yaml.
(c) Output stationary, tiled.
Mapping 1: Q1_os-tiled.map.yaml,
Mapping 2: Q1_os-tiled.map2.yaml

Validate your results with Timeloop-Accelergy and discuss the results.

Table 1: Energy per action

Energy (pJ)
SRAM read access 7.95
SRAM write access 5.45
regfile read access 0.42
regfile write access 0.42
MAC use 0.56

Q2. Energy model with Accelergy (15 points)

An interesting feature of Accelergy [1] is its flexibility in using different plug-ins to estimate
the energy of different components. For instance, Aladdin [2] is used for FIFO, register
files, multipliers, bit-wise operators, and so on, while Cacti [3] is mainly used for general
memory components (SRAM, DRAM, etc.). In this question, you will generate figures that
illustrate the energy models of DRAM, SRAM, register files, and multiply-accumulate (MAC)
ELE6310-W2024 Assignment 2 Prof.: François Leduc-Primeau

Table 2: Simulation results

Main Memory Global Buffer MAC
Total MAC Energy
read access write access read access write access uses
WM WR WP
Weight Stationary ?? ?? ?? ?? ?? ??
Wm=1
M Wr=1
P Wp=1
R
Untiled ?? ?? ?? ?? ?? ??
Wm=1
P
p=1 Wr=1
M R
Output Stationary ??
W
?? ?? ?? ?? ??
WM1 p=1 WM2 r=1WR
WP m=1
Tiled ?? ?? ?? ?? ?? ??
Wm
P1
1 =1 p=1 m2 =1 r=1
WM WP 2 WR
Output Stationary p1 =1 m=1 p2 =1 r=1 ?? ?? ?? ?? ?? ??

operations. In the case of memories, plot the energy for read and write access. You can use
timeloop-metrics for this purpose, which provides a simplified view of an architecture’s
energy consumption. Determine experimentally how the energy scales as a function of each
parameter, and justify the observed scaling based on the corresponding CMOS circuit models.

(a) First, fix the SRAM depth to 8192 and change the width from 16 to 2048. Then, fix the
width to 128 and change the depth from 512 to 8192.
(b) First, fix the regfile depth to 8192 and change the width from 16 to 2048. Then, fix the
width to 128 and change the depth from 512 to 8192.
(c) Change the DRAM (logical) width from 16 to 2048. (Note: The depth is not a parameter
in this case, the model assumes a large-capacity DRAM chip.)
(d) Change the intmac width from 2 to 64.

Q3. Pruning and design space exploration (40 points) In this question, we consider a joint
optimization of the DNN model and of the HW accelerator. For the base HW architecture,
we consider the Eyeriss [4] chip. Configuration files for this architecture are provided with
Timeloop.
For the base DNN model, we consider the ResNet-32 model that was used in Assignment 1.
In this question, we will consider optimizing the sparsity of the weights separately for each
layer of the model, and the objective will be to minimize energy consumption, as estimated
by Timeloop-Accelergy, subject to a minimum task accuracy constraint of 85% correct clas-
sifications. We will use structured pruning so that the reduction in the number of non-zero
parameters can be directly translated into energy gains even on a hardware architecture like
Eyeriss that is not specialized for sparse models.
First, complete the model_to_spars and generate_resnet_layers functions in solution.py.
Here we focus on structured pruning applied to the output channel dimension of the convolu-
tional layers. You can use PyTorch pruning to prune the network. Follow the instructions in
the main.ipynb notebook and prune the network. After fine-tuning (by any training recipe),
save the model and generate the YAML “problem” files for each layer of the pruned network
using generate_resnet_layers. Then you can use run_Accelergy to optimize the mapping
and estimate the energy consumption of the pruned network.
Any reasonable attempt at exploring the design space will be awarded full marks, and more
ingenuous approaches will be given bonus points.
ELE6310-W2024 Assignment 2 Prof.: François Leduc-Primeau

3 Deliverables

The answers to all the questions should be presented in a single report. Please also submit the code
you developed, either as an archive or using a github link (make sure the repository is public)1 .

For Question 1, make sure to briefly explain the structure of your software simulator. For instance,
for an object-oriented implementation, list the classes as well as the methods and attributes of each
class. For each element, give a brief explanation of its purpose. Alternatively, you can include
documentation that is generated from code comments as an appendix to your report.

Report and code are due on Sunday March 10, 23h59.

1
Any commit pushed after the due date will not be considered.
ELE6310-W2024 Assignment 2 Prof.: François Leduc-Primeau

References
[1] Yannan Nellie Wu, Joel S Emer, and Vivienne Sze. Accelergy: An architecture-level energy
estimation methodology for accelerator designs. In IEEE/ACM International Conference on
Computer-Aided Design, 2019.

[2] Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David Brooks. Aladdin: A pre-RTL,
power-performance accelerator simulator enabling large design space exploration of customized
architectures. In ACM/IEEE International Symposium on Computer Architecture, 2014.

[3] Sheng Li, Ke Chen, Jung Ho Ahn, Jay B Brockman, and Norman P Jouppi. CACTI-P:
Architecture-level modeling for SRAM-based structures with advanced leakage reduction tech-
niques. In IEEE/ACM International Conference on Computer-Aided Design, 2011.

[4] Yu-Hsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energy-efficient
reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State
Circuits, 2016.

Sram Part1
No ratings yet
Sram Part1
36 pages
Assignment I-4
No ratings yet
Assignment I-4
3 pages
Final - Project
No ratings yet
Final - Project
4 pages
Assignment1 3
No ratings yet
Assignment1 3
7 pages
Accelerating VGG16 DCNN With An FPGA: Dongjoon Park, Pranoti Dhamal
No ratings yet
Accelerating VGG16 DCNN With An FPGA: Dongjoon Park, Pranoti Dhamal
7 pages
MLP Report
No ratings yet
MLP Report
4 pages
W01 PracticalProblemsProjects
No ratings yet
W01 PracticalProblemsProjects
27 pages
ITNPAI1 Assignment S22
No ratings yet
ITNPAI1 Assignment S22
3 pages
Written Asst2
No ratings yet
Written Asst2
27 pages
A GPU Implementation of The Sparse Deep Neural Network Graph Challenge
No ratings yet
A GPU Implementation of The Sparse Deep Neural Network Graph Challenge
8 pages
Homework 2
No ratings yet
Homework 2
2 pages
Report
No ratings yet
Report
16 pages
School of Physics, Engineering and Technology: The Statement of Assessment
No ratings yet
School of Physics, Engineering and Technology: The Statement of Assessment
3 pages
Chest Cancer - 90.8 On Test Data Set Code
No ratings yet
Chest Cancer - 90.8 On Test Data Set Code
17 pages
Finn RTL
No ratings yet
Finn RTL
22 pages
Nimish Shah, Wannes Meert, Marian Verhelst - Efficient Execution of Irregular Dataflow Graphs. Hardware - Software Co-Optimization For Probabilistic AI and Sparse Linear Algebra-Springer
No ratings yet
Nimish Shah, Wannes Meert, Marian Verhelst - Efficient Execution of Irregular Dataflow Graphs. Hardware - Software Co-Optimization For Probabilistic AI and Sparse Linear Algebra-Springer
155 pages
Question Bank Deep-Learning Unit 3 and 4
No ratings yet
Question Bank Deep-Learning Unit 3 and 4
5 pages
Taask
No ratings yet
Taask
18 pages
Final ProblemStatements Website
No ratings yet
Final ProblemStatements Website
15 pages
Faraone 2018
No ratings yet
Faraone 2018
4 pages
Question Bank AML
No ratings yet
Question Bank AML
2 pages
Pehlivan 2019
No ratings yet
Pehlivan 2019
4 pages
EE 577A Fall 2020 Lab 1 Part 1: Akhilesh Jaiswal TA: Marzieh Vaeztourshizi
No ratings yet
EE 577A Fall 2020 Lab 1 Part 1: Akhilesh Jaiswal TA: Marzieh Vaeztourshizi
4 pages
Intro Ai Group3
No ratings yet
Intro Ai Group3
35 pages
5 Lecture 28 01 25
No ratings yet
5 Lecture 28 01 25
47 pages
Synopsis Report
No ratings yet
Synopsis Report
7 pages
Lasa Abstraction and Specialization For Productive and Performant Linear Algebra On FPGAs
No ratings yet
Lasa Abstraction and Specialization For Productive and Performant Linear Algebra On FPGAs
7 pages
5th AccML Paper 1
No ratings yet
5th AccML Paper 1
6 pages
Assign 4
No ratings yet
Assign 4
3 pages
AK - Robo X - Selection (2024-25)
No ratings yet
AK - Robo X - Selection (2024-25)
9 pages
Fixed-Point CNN For FPGA
No ratings yet
Fixed-Point CNN For FPGA
7 pages
Assignment Questions
No ratings yet
Assignment Questions
3 pages
Design and Implementation of Hardware Computation For Convolutional Neural Networks
No ratings yet
Design and Implementation of Hardware Computation For Convolutional Neural Networks
6 pages
DSE 3141 Deep Learning Lab Manual 2024 Week4
No ratings yet
DSE 3141 Deep Learning Lab Manual 2024 Week4
14 pages
ECE408 2012 Practice Exam1
No ratings yet
ECE408 2012 Practice Exam1
10 pages
Osdi23 Slides Zhao
No ratings yet
Osdi23 Slides Zhao
68 pages
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
No ratings yet
Design and Implementation of Convolutional Neural Network Accelerator Based On RISCV
6 pages
An Efficient Hardware Accelerator For Block Sparse Convolutional Neural Networks On FPGA
No ratings yet
An Efficient Hardware Accelerator For Block Sparse Convolutional Neural Networks On FPGA
4 pages
Written Asst1
No ratings yet
Written Asst1
31 pages
Accelerated Deep Learning Inference From Constrained Embedded Devices
No ratings yet
Accelerated Deep Learning Inference From Constrained Embedded Devices
5 pages
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
No ratings yet
Rapid Simulation of Hydraulic Fracturing Using A Planar 3D Model
26 pages
CA2021 Project2 Spec
No ratings yet
CA2021 Project2 Spec
7 pages
CD-601 Assignmentquestions
No ratings yet
CD-601 Assignmentquestions
2 pages
Bleed AI Intern Test - Assignment PDF
No ratings yet
Bleed AI Intern Test - Assignment PDF
7 pages
ENCE361 - Design Architecture - Â Review
No ratings yet
ENCE361 - Design Architecture - Â Review
16 pages
Digital Signal Processing: Lab Manual
No ratings yet
Digital Signal Processing: Lab Manual
5 pages
Computer Architecture Final 1 2022
No ratings yet
Computer Architecture Final 1 2022
2 pages
MBIST Assignment - 2
No ratings yet
MBIST Assignment - 2
11 pages
Lab 2
No ratings yet
Lab 2
14 pages
Migdalskiy Sergiy Physics Optimization Strategies
No ratings yet
Migdalskiy Sergiy Physics Optimization Strategies
104 pages
Assignment DL 12 03 2025
No ratings yet
Assignment DL 12 03 2025
2 pages
Meghnad Saha Answers
No ratings yet
Meghnad Saha Answers
25 pages
SCALE-Sim Tutorial ASPLOS2021 2 Overview
No ratings yet
SCALE-Sim Tutorial ASPLOS2021 2 Overview
35 pages
EAI - Lecture 1
No ratings yet
EAI - Lecture 1
14 pages
An Efficient Hardware Accelerator For Structured Sparse Convolutional Neural Networks On Fpgas
No ratings yet
An Efficient Hardware Accelerator For Structured Sparse Convolutional Neural Networks On Fpgas
12 pages
Physical Design
No ratings yet
Physical Design
111 pages
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
No ratings yet
Tutorial On DNN 4 of 9 DNN Accelerator Architectures PDF
73 pages
VLSIGuru Physical Design
No ratings yet
VLSIGuru Physical Design
30 pages
An Implementation of Convolutional Neural Networks
No ratings yet
An Implementation of Convolutional Neural Networks
23 pages
Servers
No ratings yet
Servers
227 pages
Wa120 3 - 8703 67
No ratings yet
Wa120 3 - 8703 67
7 pages
1 5438195346
No ratings yet
1 5438195346
3 pages
WA WA120-3 Komatsu S4D102E-1 (WA120-53024-) (1999-2001)
No ratings yet
WA WA120-3 Komatsu S4D102E-1 (WA120-53024-) (1999-2001)
3 pages
Renault CAPTUR: Experience The New Renault Captur at WWW - Renault.co - Za
No ratings yet
Renault CAPTUR: Experience The New Renault Captur at WWW - Renault.co - Za
14 pages
ASM - HOL Step by Step
No ratings yet
ASM - HOL Step by Step
88 pages
Master UNIT EK IPC-M MANUAL
No ratings yet
Master UNIT EK IPC-M MANUAL
61 pages
VMware Memory Reclamation Techniques Run Cycle in VSphere 6.0
No ratings yet
VMware Memory Reclamation Techniques Run Cycle in VSphere 6.0
13 pages
Blending DSP and ML Features Into A Low Power General Purpose Processor
No ratings yet
Blending DSP and ML Features Into A Low Power General Purpose Processor
21 pages
MINIX OS Guide
100% (1)
MINIX OS Guide
127 pages
Onyx v9 NUC13OXv9 and Kv9 v1.2
No ratings yet
Onyx v9 NUC13OXv9 and Kv9 v1.2
3 pages
WPC Practical No. 10
No ratings yet
WPC Practical No. 10
3 pages
04 CBLM Template DDD
No ratings yet
04 CBLM Template DDD
67 pages
Rahul Assignment 2
No ratings yet
Rahul Assignment 2
2 pages
RetroMagazine 14 Eng
No ratings yet
RetroMagazine 14 Eng
82 pages
Aws Fargate and Ecs Masterclass
100% (1)
Aws Fargate and Ecs Masterclass
74 pages
Hiding Files Using Kernel Modules
No ratings yet
Hiding Files Using Kernel Modules
22 pages
Libro 4
No ratings yet
Libro 4
117 pages
Cisco Gprs
No ratings yet
Cisco Gprs
398 pages
DATACOM Ethernet Switches DM2100 Family - Ethernet Demarcation Devices
No ratings yet
DATACOM Ethernet Switches DM2100 Family - Ethernet Demarcation Devices
4 pages
Lab - Configure Windows Firewall: Part 1: Create and Share A Folder On PC-1
No ratings yet
Lab - Configure Windows Firewall: Part 1: Create and Share A Folder On PC-1
3 pages
EX294
No ratings yet
EX294
16 pages
LO1 - 1 (Perform Manual Installation)
No ratings yet
LO1 - 1 (Perform Manual Installation)
7 pages
Abhishek Resume
No ratings yet
Abhishek Resume
3 pages
ICX Switches - Sample Config
No ratings yet
ICX Switches - Sample Config
4 pages
Non-Microsoft Languages
No ratings yet
Non-Microsoft Languages
7 pages
The Pentium 4 Architecture
No ratings yet
The Pentium 4 Architecture
5 pages
GLPI Link With An Active Directory - RDR-IT
No ratings yet
GLPI Link With An Active Directory - RDR-IT
12 pages
imagePROGRAF PRO 2600 Spec Sheetpdf
No ratings yet
imagePROGRAF PRO 2600 Spec Sheetpdf
2 pages
BASCAVR
No ratings yet
BASCAVR
1,282 pages
Λέγεται πελαργός γιατί κάνει 9 μήνες να έρθει. Αλλιώς θα λεγόταν πελαγρήγορος.....
No ratings yet
Λέγεται πελαργός γιατί κάνει 9 μήνες να έρθει. Αλλιώς θα λεγόταν πελαγρήγορος.....
4 pages
Cisco 9300L Datasheet
No ratings yet
Cisco 9300L Datasheet
79 pages
Module 3 - Planning and Configuring A Cloud Solution
No ratings yet
Module 3 - Planning and Configuring A Cloud Solution
29 pages
4-Digital Literacy Nov 22
100% (1)
4-Digital Literacy Nov 22
5 pages
DEFECTS IN COMPUTER SYSTEM AND NETWORKS - Week 10
100% (1)
DEFECTS IN COMPUTER SYSTEM AND NETWORKS - Week 10
25 pages

Hardware Implementation: Data-Flow and Design Space: 1.1 Memory-Access Simulator

Uploaded by

Hardware Implementation: Data-Flow and Design Space: 1.1 Memory-Access Simulator

Uploaded by

ELE6310-W2024 Assignment 2 Prof.

Hardware Implementation: Data-Flow and Design Space

1.1 Memory-access simulator

1.2 Installing Timeloop/Accelergy

Figure 2: Architecture Figure 3: Problem

Also, make sure to have a look at the Timeloop-Accelergy Tutorial.

Figure 4: 1-D multi-channel convolution

(a) Weight stationary: Q1_ws.map.yaml

Validate your results with Timeloop-Accelergy and discuss the results.

Table 1: Energy per action

Q2. Energy model with Accelergy (15 points)

Table 2: Simulation results

Report and code are due on Sunday March 10, 23h59.

You might also like