0% found this document useful (0 votes)
15 views14 pages

A Fast Accurate and Comprehensive PPA Estimation of Convolutional Hardware Accelerators

This paper presents a fast and accurate design space exploration (DSE) method for convolutional neural networks (CNNs) using an analytical model integrated with machine learning frameworks like TensorFlow. The proposed approach aims to provide comprehensive power-performance-area (PPA) estimations for CNN hardware accelerators, addressing gaps in existing literature regarding the evaluation of these accelerators. The model demonstrates an average error of less than 7% when compared to physical synthesis data, enabling better deployment of CNNs in energy-constrained environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views14 pages

A Fast Accurate and Comprehensive PPA Estimation of Convolutional Hardware Accelerators

This paper presents a fast and accurate design space exploration (DSE) method for convolutional neural networks (CNNs) using an analytical model integrated with machine learning frameworks like TensorFlow. The proposed approach aims to provide comprehensive power-performance-area (PPA) estimations for CNN hardware accelerators, addressing gaps in existing literature regarding the evaluation of these accelerators. The model demonstrates an average error of less than 7% when compared to physical synthesis data, enabling better deployment of CNNs in energy-constrained environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO.

12, DECEMBER 2022 5171

A Fast, Accurate, and Comprehensive PPA


Estimation of Convolutional Hardware Accelerators
Leonardo Rezende Juracy , Alexandre de Morais Amory , and Fernando Gehm Moraes , Senior Member, IEEE

Abstract— Convolutional Neural Networks (CNN) are widely One of the most common ways to deliver ML is by using
adopted for Machine Learning (ML) tasks, such as classification Artificial Neural Networks (ANN), particularly Convolutional
and computer vision. GPUs became the reference platforms for Neural Networks (CNN). CNNs have the advantage of having
both training and inference phases of CNNs due to their tailored
architecture to the CNN operators. However, GPUs are power- sparse connections, in contrast to fully connected ANNs,
hungry architectures. A path to enable the deployment of CNNs where all neurons of one layer are connected to all neurons
in energy-constrained devices is adopting hardware accelerators of the next layer.
for the inference phase. However, the literature presents gaps A CNN contains four main layers: (i ) convolutional layer,
regarding analyses and comparisons of these accelerators to which is the CNN core and performs the synapses by mul-
evaluate Power-Performance-Area (PPA) trade-offs. Typically,
the literature estimates PPA from the number of executed tiplying and accumulating weights and input feature maps;
operations during the inference phase, such as the number of (ii ) activation function, a nonlinear transformation sent to the
MACs, which may not be a good proxy for PPA. Thus, it is next layer of neurons; (iii ) pooling layer, used to reduce the
necessary to deliver accurate hardware estimations, enabling amount of data processed by the CNN; (i v) fully connected
design space exploration (DSE) to deploy CNNs according to layer, used in the classification result.
the design constraints. This work proposes a fast and accurate
DSE approach for CNNs using an analytical model fitted from The deployment of CNNs applications comprises two
the physical synthesis of hardware accelerators. The model is phases [6]. The first is the training, which defines the weights
integrated with CNN frameworks, like TensorFlow, to generate of the synapse. The second is inference, which uses the
accurate results. The analytic model estimates area, performance, weights previously computed during the training phase to
power, energy, and memory accesses. The observed average error classify or predict output values based on the inputs. A CNN
comparing the analytical model to the data obtained from the
physical synthesis is smaller than 7%. can correctly classify inputs not used in the training phase.
The success of CNNs led to the development of frame-
Index Terms— CNN, convolutional hardware accelerator, works that help developers to build their models by offering
power-performance-area (PPA) estimation, design space
exploration (DSE). mechanisms required for training and inference. Examples of
frameworks include Caffe [7], Pytorch [8] and TensorFlow [9].
These frameworks use a high-level approach to abstract the
I. I NTRODUCTION implementation of functions, such as convolution, and aid

M ACHINE Learning (ML) is a sub-area of artificial intel-


ligence that contains a class of algorithms able to solve
problems involving knowledge and “learning” characteristics
in implementing CNN applications. Also, these frameworks
abstract the training phase by implementing functions like
back-propagation algorithms.
from determined patterns. The ML decision capability [1] GPUs became the reference platform for training and infer-
enables its adoption in classification and pattern recognition ence due to their tailored architecture to the CNN opera-
problems. Many applications can use ML, such as computa- tors [10], [11], reducing the time spent in training. The main
tional vision, virtual reality, voice assistants, chatbots, health GPU drawback is its energy consumption. Considering energy-
care, and self-driving vehicles [2], [3], [4], [5]. constrained applications, such as the Internet of Things (IoT),
autonomous driving, and wearable devices, the adoption of
Manuscript received 12 July 2022; revised 30 August 2022; accepted specialized hardware for computing inference became a trend.
5 September 2022. Date of publication 15 September 2022; date of current CNN hardware accelerators are a suitable replacement
version 9 December 2022. This work was supported in part by the Conselho
Nacional de Desenvolvimento Científico e Tecnológico (CNPq) under Grant for CPUs and GPUs for the inference phase [12]. CNN
309605/2020-2; in part by the Fundação de Amparo à pesquisa do Estado accelerators can reduce power consumption and/or improve
do RS (FAPERGS) under Grant 21/2551-0002047-4; and in part by the throughput [10], [13], [14]. Also, consumer products are
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES),
Finance Code 001. This article was recommended by Associate Editor J. Di. increasingly receiving these blocks [15], [16], [17]. Most of
(Corresponding author: Fernando Gehm Moraes.) these accelerators are application-specific and can focus only
Leonardo Rezende Juracy and Fernando Gehm Moraes are with the School on one characteristic to optimize, such as power, performance,
of Technology, PUCRS University, Porto Alegre 90619-900, Brazil (e-mail:
[email protected]; [email protected]). or area [18], [19].
Alexandre de Morais Amory is with Scuola Superiore Sant’Anna, The literature presents gaps regarding analyses and compar-
56127 Pisa, Italy (e-mail: [email protected]). isons of CNN hardware accelerators. Even with a representa-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSI.2022.3204932. tive number of accelerators using different implementations,
Digital Object Identifier 10.1109/TCSI.2022.3204932 there is a lack of proposals exploring the trade-offs between
1549-8328 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5172 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022

implementations. For example, Eyeriss proposes a comparison the flow described in Section IV and with related works.
between accelerators but lacks performance or area trade-offs Finally, Section VII concludes this paper, pointing out the
evaluation [10]. Also, some works compare accelerators con- direction for future work.
sidering different technology nodes, resulting in an unfair
analysis [20]. II. S TATE -O F - THE -A RT
For design space exploration (DSE), literature presents Section II-A describes frameworks enabling the design and
works using analytic models integrated into frameworks and DSE of hardware accelerators. Section II-B presents CNN
system simulators. Analytical models estimate the power, per- simulators. Section II-C compares qualitatively the presented
formance, and area (PPA) of hardware accelerators to a given works with regard to our proposal.
hardware constraint [21], [22]. System simulators [23], [24] Besides CNNs, literature also presents Spiking Neural
describe accelerators in high-level languages, like Python and Network (SNN) proposals [28], [29]. SNNs are based on
C++, reducing the design time and providing PPA evaluation. firing patterns computation, similar to the human brain, and
Both analytical and simulator approaches present as drawbacks are commonly implemented using an analog approach. The
the PPA accuracy, typically estimated from the number of advantage of using SNNs is reduced area and energy consump-
required MACs (multiply-accumulate operators) [23], [25]. tion once analog components can be smaller than a digital
Despite the efforts to increase the abstraction level to imple- adder or multiplier, allowing the connection of thousands
ment CNN accelerators using high-level synthesis (HLS) [26], of neurons with low area cost. However, our focus is on
this approach presents several challenges, as (i ) long synthesis digital implementations, being SNNs out of the scope of this
time; (ii ) huge design space; (iii ) effective impact of design work.
parameters in the hardware [27].
The goal of this paper is to propose a method to perform A. Frameworks for CNN Hardware Accelerators
a compr ehensi ve DSE in a f ast and accur ate way, inte- MLPAT [30] is a framework that allows modeling power,
grated with an ML framework, to estimate the costs of the area, and timing for machine learning accelerators. MLPAT
hardware accelerator parameters. Besides the CNN parameters, models components such as systolic arrays, on-chip memory,
the proposed method also allows the user to change hardware and activation pipeline. Also, MLPAT supports different pre-
parameters, such as dataflow type (weight stationary (WS), cision types, which allows validating the trade-off between
input stationary (IS), and output stationary (OS)), memory accuracy and precision, and different dataflows, such as WS
type, memory latency and accelerator frequency. We also and OS. As input, the MLPAT allows specifying the accelera-
demonstrated that estimating PPA values using MACs pro- tor architecture, the circuit, and the technology. The framework
duces an inaccurate PPA estimation. generates an optimized chip representation to report the results,
The original contributions of this work include: such as area, power, and performance.
MAESTRO [31], [32] is a framework to describe and
1) Adoption of an ML framework (TensorFlow) as
analyze neural network hardware, which allows obtaining
a front-end to perform DSE for CNN hardware
the hardware cost to implement a target architecture. It has
accelerators;
a domain-specific language to describe the dataflow that
2) The use of data from the physical synthesis of the
allows specifying the number of PEs, memory size, and NoC
complete hardware accelerators, not only from basic
bandwidth parameters. The results generated by the frame-
components, to obtain accurate results;
work are focused on performance analyses. In recent work,
3) A method to fairly compare different CNN hardware
MAESTRO was used to estimate tradeoffs between execution
accelerators, allowing to compare accelerators with dif-
time and energy efficiency for CNN models, such as VGG and
ferent dataflows, considering the same characteristics,
AlexNet.
such as the technology node, frequency, and memory
Timeloop [23] is a DSE framework for CNNs. It can
type;
emulate a set of accelerators, such as NVDLA [33]. Timeloop
4) An analytical method to perform DSE. A set of equa-
focuses on the convolution layer analyses. Timeloop uses
tions derived from the physical synthesis flow enables
as input a workload description, such as input dimension
the ML framework to estimate power, performance,
and weight values, a hardware architecture description, such
and area. The analytical model, integrated into the
as arithmetic modules, and hardware constraint. Instead of
ML framework, is the key to obtaining a fast and
using a cycle-accurate simulator, Timeloop uses data trans-
accurate DSE.
fers deterministic behavior to perform analytic analyses.
This paper is organized as follows. Section II reviews CNN As energy models, Timeloop has memory, arithmetic units,
hardware accelerators and simulators, positioning our work and wire/network models based on TSMC 16nm FinFET.
with regard to the state-of-the-art. Section III introduces the Accelergy [25] allows estimating the energy of accelerators
convolutional accelerator architecture proposed in this work. without a complete hardware description, using a library of
Section IV presents two DSE flows, the first based on physical basic components. Accelergy uses a high-level architectural
synthesis and the latter on MAC counting. Section V details description to capture the circuit behavior characteristics, such
the main original contribution of this work, the proposed DSE as memory reads. Accelergy considers the number of memory
method based using an analytical approach to produce PPA reads and the memory access pattern, which can be random
results. Section VI compares the proposed DSE method with or acccess at the same address repetitively.

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
JURACY et al.: FAST, ACCURATE, AND COMPREHENSIVE PPA ESTIMATION OF CONVOLUTIONAL HARDWARE ACCELERATORS 5173

Heidorn et al. [21] propose an analytical model that esti- and data parallelism. The method is capable of predicting the
mates throughput and energy to a given hardware constraint. hardware latency and energy, and was compared to machine
A DSE is proposed to determine the accelerator architecture learning-based methods to perform DSE (linear regression,
limits in terms of throughput, number of parallel opera- Gradient tree boosting, and neural network).
tions, and memory. The Authors propose an accelerator to Aladdin [43] is a pre-RTL power-performance accelerator
evaluate the model with a tile-local memory, a bus, and modeling framework. It estimates performance, power, and
a coarse-grained reconfigurable array (CGRA). Each CGRA area. Aladdin infrastructure uses dynamic data dependence
presents a two-dimensional array of PEs, and the accelerator graphs (DDDG) to represent accelerators. The DDDG is
can have more than one CGRA to parallelize the processing. generated from a C program and allows Aladdin to report
Zhao et al. [22] propose an analytical performance predic- the program dependencies and resource constraints.
tor to estimate energy, throughput, and latency for ASIC and
FPGA. The predictor uses DNN models, hardware architec- B. Hardware Simulators
ture, dataflows types, and hardware cost regarding a tech- SCALE-Sim (Systolic CNN Accelerator Simulator) [44],
nology node. The results are generated with AlexNet and [45] is a systolic array cycle-accurate simulator. This simu-
SkyNet CNN models, with Eyeriss, an FPGA implementation lator allows configuring micro-architectural features such as
from [34], and synthesized results of a proposed accelerator. array size, array aspect ratio, scratchpad memory size, and
DNNExplorer [35] is a framework for DSE of ML accel- dataflow mapping strategy. Also, it is possible to configure
erators. DNNExplorer supports machine learning frameworks system integration parameters, such as memory bandwidth.
(Caffe and PyTorch), besides three accelerator architectures. SCALE-Sim simulates convolutions and matrix multiplica-
The architecture also supports WS and IS dataflows. This tions, and models the compute unit as a systolic array. Also,
framework adopts analytical models to estimate performance it allows simulation in a system context with CPU and DMA
and hardware configuration. components.
Gemmini [36] is an open-source systolic array generator STONNE [24] is a cycle-accurate architecture simulator
that allows evaluating deep-learning architectures. Gemmini for CNNs which allows end-to-end evaluation. It is con-
generates a custom ASIC accelerator for matrix multiplication nected with Caffe framework [7] to generate the CNNs, and
based on a systolic array architecture. Gemmini is compatible models the MAERI accelerator [46]. The results are focused
with the RISC-V Rocket ecosystem [37]. on performance and hardware utilization. To estimate area
Interstellar [38] is a DSE framework that uses Halide and energy, STONNE uses the Accelergy energy estimation
language (https://halide-lang.org) to generate hardware and methodology [25], which considers basic modules to calculate
compare different accelerators, such as different dataflows the energy values, such as adders.
(WS, OS, RS) in 2D arrays and a MAC tree schemes. The AccTLMSim [47] is a pre-RTL cycle-accurate CNN accel-
authors propose a systematic approach to describe the design erator simulator based on SystemC transaction-level modeling
space of DNN accelerators using Halide. The framework also (TLM). The simulator allows maximizing the throughput per-
optimizes the memory hierarchy. formance for a given on-chip SRAM size. An accelerator is
DeepOpt [39] is a DSE framework to explore ASIC imple- proposed to validate the simulator. AccTLMSim is focused
mentation of systolic hardware accelerators for CNNs. The only on performance, not power or area.
main goal of this DSE is to reduce the number of mem-
ory accesses based on hardware characteristics like on-chip C. Summary Related to DSE Frameworks and Simulators
SRAMs and the number of parallel PEs. The DeepOpt uses Table I summarizes the reviewed works. The second column
a search tree to schedule the convolution process. Thus, it is indicates whether the work has integration with high-level
possible to minimize the number of accesses from memory modeling CNN frameworks, such as TensorFlow and Caffe.
by modeling memory access patterns (weight and output The third and fourth columns are related to the evaluated
stationary) and pruning branches from the search tree. metrics. The third column presents metrics based on basic
Karbachevsky et al. [40] propose a method to estimate area components, such as MACs and register files. The fourth
and power values based on the bit operations performed (BOP) column shows the evaluated metrics regarding the entire
metric [41]. BOP is the number of bit operations required to convolution, our original contribution.
perform the calculation, defined by the input bit size, output Maestro [32] does not allow the accelerator simula-
bit size, number of inputs, and number of outputs. According tion, limiting the performance evaluation (e.g., throughput).
to the authors, BOP metric allows estimating the area and SCALE-Sim [45] does not provide power or energy results.
power required by accelerator hardware with high accuracy MLPAT [30] and Timeloop [23] provide PPA based on basic
in the early stages of the design process. Also, the method operations, such as adders and multipliers. Methods relying on
can show the trade-off between the number PEs and the operations counting do not consider how these operators are
bottlenecks caused by the parameters quantization, such as interconnected (e.g., 1D or 2D systolic arrays or adder trees),
memory bandwidth or computational resources. resulting in imprecise hardware metrics.
Ferianc et al. [42] propose a method to improve the perfor- Works [21] and [22] show analytical results for power,
mance of DSE analyses. The method is based on a Gaussian performance, and area. Also, [22] consider features like the
process regression model parameterized by the features of dataflow type, which can contribute to the power consumption.
the accelerator and the target CNN, such as filter, channel, Similar occurs to Aladdin [43].

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5174 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022

Fig. 1. Generic architecture and the modules required to build the convolutional accelerators.

TABLE I framework to model CNNs (TensorFlow) and executes an


DSE F RAMEWORKS AND S IMULATORS S TATE - OF - THE -A RT S UMMARY analytical analysis considering the accelerator architecture,
allowing a fast and accurate DSE. The adoption of TensorFlow
was a design choice, and other frameworks, such as Caffe or
Pytorch, can be used as a front-end to execute the training and
weight extraction.

III. C ONVOLUTIONAL ACCELERATOR


A RCHITECTURE OVERVIEW
Figure 1 shows the accelerator model, with a unified
interface with input and output external memories. External
memories play an important role in the total accelerator energy,
being mandatory to evaluate its cost [10]. This model contains
three modules:
• INPUT memory: stores the IFMAP, filter weights, and
bias values. It acts as a read only memory for the
accelerator;
• Convolutional core, executes the convolution;
• Output Feature Map (OFMAP) memory, stores partial
and complete convolution values.
Works like Gemmini [36], Interstellar [38], DeepOpt, [40], The main modules of the convolutional core include:
and [42], have a limited PPA analyses. Also, [40] claims
• Input buffer, used to reduce the number of input memory
that the main effect of changing the circuit frequency is to
readings. According to the accelerator type, this buffer
reduce power consumption. However, this is not exact since
may store, e.g., an input channel, a set of rows of the
logical synthesis with different frequency constraints produces
input channel, or a set of weights;
different area values.
• Input memory access control logic, used to control the
STONNE [24] and SimuNN [48] present flows integrated to
input memory access. This module is an FSM configured
frameworks to model CNNs. However, SimuNN has an energy
according to the dataflow type;
estimation based on basic elements, not considering data
• MAC array, is the unit that processes the convolution
movement through the accelerator. STONNE does not address
operation, which may adopt a 2D (matrix) or 1D (vector).
power estimation, but the authors argue that it is possible
It contains MACs and registers;
to integrate STONNE with Accelergy (which only evaluates
• Activation function, a non-linear function applied to
power). DNNExplorer [35] also allows frameworks to model
the OFMAP results. Examples are sigmoid, ReLU, leaky
CNNs, but lacks PPA analyses. SCALE-Sim and AccTLMSim
ReLU [49]. This works adopts the ReLU function
lack integration with frameworks to model CNNs.
(max(0, x)) due to its simpler hardware implementation;
Considering the reviewed works, we identify the following
• Output control logic, used to control the OFMAP mem-
gaps:
ory access. It can be implemented with buffers to reduce
1) Comprehensive PPA analyses, not considering all per- memory access.
formance figures; Our previous work, [50], details the architecture of the
2) An analytical method based on the entire convolution convolutional core (WS and 1D array). The focus of [50] is the
accelerator to produce an accurate PPA estimation; convolutional core design, without modeling memory effects
3) A framework or environment to produce a fast DSE. in the PPA, neither propose a DSE analytical model.
Our proposal provides a comprehensive PPA estimation The dataflow type is related to the reused data and how the
method that integrates a CNN modeling framework to perform MAC array computes the convolution. We implemented in the
DSE using data from actual CNNs. The DSE starts with a MAC array the main dataflows used in the literature [51]:

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
JURACY et al.: FAST, ACCURATE, AND COMPREHENSIVE PPA ESTIMATION OF CONVOLUTIONAL HARDWARE ACCELERATORS 5175

• Weight Stationary – WS. Stores weights in an internal 2) RTL Simulation: Verification of the accelerator descrip-
buffer, aiming their reuse. Thus, each weight value is tion behavior. This step uses the VHDL packages, comparing
read once from the input memory, and the convolution the results against the gold model. It is necessary to check
is performed using stationary weights and reads from if the simulation output matches the expected values during
memory IFMAP values. the development of a new accelerator. Once validated the
• Input Stationary – IS. Stores an IFMAP window in an accelerator description, it is possible to bypass this step.
internal buffer to provide its reuse. The window size is 3) Physical Synthesis: This step comprises logical and
equal to the filter size. The accelerator reads IFMAP physical synthesis. The logic synthesis inputs include the
values once and reads the weight values from memory. RTL accelerator description, technology files (LEF and LIB
• Output Stationary – OS. Stores partial convolution results files), and constraints (such as clock frequency or power). The
in registers. The OS does not present buffers to store the physical synthesis corresponds to the placement and routing.
inputs; each convolution fetches the IFMAPs and weight This step generates a gate-level netlist with annotated wire
values from the input memory. capacitances (SDF file).
Figure 1 presents the set of signals between the convolu- 4) Annotated Gate-Level Simulation and Power Estimation:
tional core and memories, and signals used by the hardware Simulation of the annotated gate-level netlist, also using the
controlling the accelerator (start_conv and end_conv). VHDL packages and the gold model. This step may fail
Both memories use a standard set of busses, such as addresses, due to the applied constraints, such as clock frequency and
output data (ifmap_value and pixel_in), input data input/output delays. In this case, the designer must modify
(pixel_out), memory enable and write notification. Mem- the constraints used in the physical synthesis. The output of
ories have different implementation technologies (e.g., SRAM this simulation is a VCD file, with the switching activity
or DRAM) and latencies, varying according to their speci- induced by the CNN IFMAP and weight values. The power
fication. Thus, it is necessary to model the memory latency estimation tool uses the VCD file to estimate the accelerator
to estimate the power and performance of the convolu- power dissipation.
tional accelerator. A validity signal (ifmap_valid and The execution of this flow produces an accurate PPA esti-
ofmap_valid) models the memory latency, indicating the mation for a given accelerator architecture with actual CNN
end of memory accesses. data. However, it is necessary to execute this flow for each
This unified memory interface makes it possible to imple- new set of weights and IFMAPs. The reason is that different
ment different dataflows with distinct protocols, allowing a fair data sets present different switching activities, changing the
comparison between the accelerators. power dissipation. Also, the hardware may show differences
The accelerators adopted for this work have a 3 × 3 MAC due to the number of channels in a given layer or the IFMAP
array, and the convolution stride equals 2. Despite being a and OFMAP sizes, changing the number of bits in counters
design limitation, state-of-the-art CNNs adopt these values, or the buffers’ depth. Thus, we have an accurate PPA, but it
such as VGG16 [52], ensuring that the proposed accelerators requires a significant processing time, which takes hours.
reflect real CNNs.
B. MAC-Based Flow
IV. P HYSICAL S YNTHESIS AND MAC-BASED F LOWS This Section describes a DSE flow based on estimating the
This Section introduces two flows to execute DSE. The required number of MACs. Several works in the literature use
first one uses a standard-cell synthesis flow. Results for one the MAC-based method to evaluate the area, and power [21],
instance of each dataflow are the basis for the main original [22], [23], [30], [48]. The MAC-based flow does not model
contribution of this work, the analytical DSE flow, presented memory accesses, having as primary goal a fast area and power
in the next Section. The second flow uses number of MACs, estimation.
a method used in related work. The MAC-based flow used in this work considers the power
and area values related to MACs and registers extracted from
A. Physical Synthesis Flow the physical synthesis flow. The number of required MACs,
registers, and the data width is a function of the accelerator
The DSE physical synthesis flow performs both logical and
design. The performance, i.e., the number of clock cycles to
physical synthesis steps and uses IFMAP and weights data
execute a complete convolution, is obtained through the RTL
from real CNNs to obtain PPA results. This flow includes the
simulation.
steps described below.
The area and power accuracy of this flow are expected to be
1) TensorFlow Modeling: Model the CNN application in
worse than the physical synthesis flow, as it does not consider
the TensorFlow framework to generate the VHDL packages
the control circuitry (such as FSMs), buffers, and accesses to
used in the RTL and gate-level simulations, and the gold
the memory.
model (expected outputs values). The VHDL packages contain
the IFMAP, weights, bias, and OFMAP values. This step
also generates parameters for configuring the RTL description, V. A NALYTIC DSE F LOW
such as the size of the filters, OFMAP, and IFMAP. This The proposed DSE flow analytically estimates the PPA of
step is generic, supporting different CNNs, such as MNIST CNN layers using results obtained from the physical synthesis
or CIFAR10. of one CNN layer. This flow is faster than the physical

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5176 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022

2) IFMAP size: single integer value (we assume square


IFMAPs) - IFMAP_D;
3) Number of input channels: integer value - InChannels;
4) Number of output channels: integer value -
OutChannels;
5) Filter size: single integer value (we assume square
filters) - Filter_D;
6) Stride: integer value.
The user’s constraints file contains constraints applied to the
accelerator (italic parameters corresponds to a variable in the
analytic model):
1) Clock period (ns);
2) 2D dataflow type: WS, IS, and OS. It is possible to use
buffered versions (output buffer) for WS and IS;
3) Memory features: type (SRAM or DRAM) and its
latency - MemLat;
For example, a designer wants to define the accelerator for a
layer with the following characteristics: (1) 16-bit word size;
(2) IFMAP size=32 (cifar10); (3) 3 input channels (RGB);
(4) 16 output channels; (5) 3 × 3 filters; (6) stride=2. The
Fig. 2. DSE analytic flow for PPA extraction.
designer executes the DSE loop by selecting the clock period,
dataflow type, and memory parameters. The DSE loop may
be executed up to reaching the user constraint, such as
synthesis flow because the synthesis is executed once for each performance, area, or power.
accelerator and is more accurate than the MAC-based flow The analytic model produces the following results:
once it considers the entire convolution hardware. Figure 2
• Power: power values for the accelerator, output buffer,
presents the analytic DSE flow.
and the sum of both (total power), mW ;
The flow starts with the training phase of the CNN using
• Performance: number of clock cycles required to execute
TensorFlow, according to the CNN configuration defined
the layer convolution;
in the application.py file. The training phase ends when
• Area: area values for the accelerator, output buffer, and
reaching the target accuracy. After the training phase, the
the sum of both (total area), µm 2 ;
flow executes a quantization step, converting 32-bit floating-
• Accelerator energy: total power × number of cycles ×
point weights and IFMAP values to 16-bit integers using
clock period, p J ;
a fixed-point representation. Adopting integer values avoids
• Memory accesses:
floating-point arithmetic in the accelerator, reducing its area
and power consumption. This step generates two output files: – Number of input memory reads (IFMAPs, weights
(i ) tensorflow.vhd, with the IFMAP and weights values and and bias);
the gold model (expected output values) of the first CNN – Number of input memory writes (always zero, once
layer; (ii ) accelerator.cfg, with the CNN convolutional layers the input memory acts as a ROM);
parameters. – Number of output memory reads (partial sums
The right side of Figure 2 corresponds to the physical values);
synthesis flow, presented in Section IV-A. The flow executes – Number of output memory writes (partial sums val-
the synthesis of the first convolutional layer for each dataflow. ues, and OFMAPs);
The remaining layers are estimated from the first layer results. • Memory read energy: total memory reads × energy per
From one side, this procedure avoids the synthesis of all layers, reading (estimated by Cacti-IO), n J ;
speeding up the DSE. On the other side, there is a penalty in • Memory write energy: total memory writes × energy per
the PPA accuracy, evaluated in the Results Section. The output writing (estimated by Cacti-IO), n J ;
of the physical synthesis flow is a set of PPA reports used • Total energy: accelerator energy + memory read
during DSE. energy + memory write energy, n J .
The Analytic Model uses four inputs to execute the DSE The physical synthesis produces PPA reports for the accel-
loop: (i ) configuration of the convolutional layers (accelera- erator core. The performance to compute the convolution of
tor.cfg); (ii ) user’s constraints; (iii ) PPA reports related to the a given layer is calculated by the analytic model, process
synthesis of the first convolutional layer; (i v) memory energy detailed on Section V-A. The analytical model calculates the
estimated by the Cacti-IO tool [53]. effect of the memory accesses on performance and power
The accelerator.cfg file contains hardware parameters (italic (Section V-B). Another component that affects the PPA is the
parameters corresponds to a variable in the analytic model): output buffers. The influence of these components is described
1) Word size (bits); in Section V-C.

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
JURACY et al.: FAST, ACCURATE, AND COMPREHENSIVE PPA ESTIMATION OF CONVOLUTIONAL HARDWARE ACCELERATORS 5177

A. Performance Estimation • T er m2: cycles to read Filter _D × Filter _D IFMAP


Note that the OFMAP size is not a parameter in the acceler- values read from the memory;
ator.cfg file. The OFMAP size is a function of the IFMAP size • T er m3: cycles to execute all convolutions of the layer.
(I F M A P_D), filter size (Filter _D), and the stride value. Equation (3) computes most of the clock cycles required
Equation 1 computes the OFMAP size. to calculate the convolution of a given layer (C ycles I S >
  83% for IS and C ycles I S > 90% for buffered IS). As in
I F M A P_D − F I LT E R_D
O F M A P_D = + 1 (1) the previous dataflow, the time spent with bubbles is also
Stri de
accounted by the proposed analytic model.
For example: a 32 × 32 IFMAP, with 3 × 3 filters and The WS dataflow reads the IFMAP for each partial
stride=2 generates a 15 × 15 OFMAP. result (Equation 2). For the IS dataflow, a partial read-
The convolution performance is a function of the dataflow ing of the IFMAP is performed (Term 2 of Equation 3),
defined by the designer. reusing these values for all partial convolutions (Term 3 of
1) WS Dataflow: Equation 2 computes the number of clock Equation 3).
cycles, C yclesW S, to execute the convolution of a given layer 3) OS Dataflow: The OS dataflow does not have buffers
using the WS dataflow. Weights and bias are stationary, i.e., for IFMAP and weight values. Thus, the OS dataflow reads
pre-loaded in a buffer. 18 values from the input memory to execute each convolution
C yclesW S = 6 × O F M A P_D 2 × I nChannels (9 weights and 9 IFMAP values). Due to the pipeline imple-
mentation, the convolution occurs in parallel to the memory
×OutChannels × (1 + Mem Lat) (2)
reading.
where: Equation 4 computes most of the clock cycles required
• 6 constant: number of clock cycles to read 9 (3 × 3) to calculate the convolution of a given layer (>98%). The
IFMAP values. Due to the stride value (equal to 2), each analytic model considers the number of clock cycles to write
reading reuse one column, reducing memory accesses; in the OFMAP memory and the bubbles in the pipeline. The
• O F M A P_ D 2 × I nChannels: number of convolutions number of memory readings is the main difference concerning
to produce one output channel. Due to the accelerator the WS dataflow (Equation 2), which is larger in the OS
pipeline implementation the IFMAP reading and the dataflow.
convolution occurs in parallel;
C ycles O S = 18 × O F M A P_D 2 × I nChannels
• The process is repeated for all output channels
(OutChannels); ×OutChannels × (1 + Mem Lat) (4)
• The constant value added to the memory latency (1 clock
cycle) corresponds to the address phase.
Equation 2 computes most of the clock cycles required to B. Memory Accesses Estimation
calculate the convolution of a given layer (>80%). The analyt- The number of memory accesses is a function of the
ical model also computes the time to read the weights and the dataflow defined by the designer.
number of bubbles in the pipeline when it is necessary to return 1) Memory Readings for the WS Dataflow: Equation 5
to the first X coordinate after O F M A P_D convolutions. For presents the number of memory readings for WS and buffered
the sake of simplicity, the other equations are not presented. WS dataflows.
2) IS Dataflow: Equation 3 computes the number of clock
T er m1 = 6 × (O F M A P_D + 5)
cycles, C ycles I S, to execute the convolution of a given layer
using the IS dataflow. In the IS approach values read from the ×I nChannels × OutChannels
IFMAP are stationary, i.e., they are used to compute a partial T er m2 = (Filter _D 2 + 1) × I nChannels
output value at each output channel. ×OutChannels
T er m1 = OutChannels × (1 + Mem Lat) T er m3 = 6 × O F M A P_D 2 × I nChannels
+(Filter _D 2 × OutChannels × I nChannels) ×OutChannels
×(1 + Mem Lat) Mem Read W S = T er m1 + T er m2 + T er m3 (5)
T er m2 = Filter _D 2 × O F M A P_ D 2
where:
×I nChannels × (1 + Mem Lat) • T er m1: refers to “invalid” readings. At the end of each
T er m3 = (9 × O F M A P_D 2 × I nChannels row, the WS accelerator accesses memory locations not
×OutChannels) used in the convolution. It would be possible to avoid
C ycles I S = T er m1 + T er m2 + T er m3 (3) these readings at the cost of more control logic in the
hardware. Our design choice was to keep the hardware
where: simple.
• T er m1: cycles to load bias and weights, and store in • T er m2: number of reads to load weight and bias values;
internal buffers. The number of bias values is equal to • T er m3: number of reads to load IFMAP values (core of
the number of OutChannels; Equation 2).

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5178 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022

2) Memory Readings for the IS Dataflow: Equation 6


presents the number of memory readings for IS and buffered
IS dataflows.
T er m1 = OutChannels + ((Filter _D 2 )
×I nChannels × OutChannels)
T er m2 = Filter _D 2 × O F M A P_D 2 × I nChannels
Mem Read I S = T er m1 + T er m2 (6)
where:
• T er m1: number of reads to load bias and weight values;
• T er m2: number of reads to load Filter _D × Filter _D
IFMAP values from the memory.
3) Memory Readings for the OS Dataflow: Equation 7
presents the number of memory readings for the OS dataflow.
Mem Read O S = 18 × O F M A P_D 2
×I nChannels × OutChannels (7)
It is possible to observe the smaller number of memory
accesses for the IS dataflow (Equation 6 T er m2) compared to
the WS and OS dataflows (Equations 5 T er m3 and 7).
4) Memory Writings: Equation 8 computes the number of
memory writings for a buffered accelerator (WS and IS). The
output buffer reduces the memory writes once partial results Fig. 3. Output buffer area results obtained from the physical synthesis flow,
for the three layers of Cifar10 CNN.
are stored on it.
O f mapWri tes (bu f f er ed acc.) = O F M A P_D 2
×OutChannels (8)
Equation 9 computes the number of memory writings for
a non-buffered accelerator (WS, IS, and OS). Non-buffered
accelerators read and write partial sums in the output memory.
O f mapWri tes (non bu f f er ed acc.) = O F M A P_ D 2
×OutChannels × I nChannels (9)

C. Output Buffer Area and Power Estimation


Fig. 4. Cifar10 CNN.
Two dataflows may present an output buffer: WS and IS.
The output buffer area and power are obtained from interpo-
lation. The data source for the interpolation are the results
32 output channels; L3: 3 × 3, 64 output channels. According
obtained from the physical synthesis flow for a three-layer
to Equation 10 the size of the WS output buffers decreases
Cifar10 CNN (presented in Section VI). The variable NumBits
from layer 1 to layer 3 since the OFMAP size reduces. On the
is the number of bits of each output buffer. Equation 10
other side, according to Equation 11 the size of the IS output
computes the number of bits for the WS output buffer.
buffers increases from layer 1 to 3 due to the increase in the
The WS output buffer has the size of one OFMAP channel
number of output channels.
(O F M A P_D 2 ) multiplied by the word size (16 bits).
The interpolation of the area results are used to compute the
Num Bi ts(W S) = O F M A P_ D 2 × 16 (10) output buffer area. Equations 12 and 13 compute the output
buffer area for WS and IS, respectively.
Equation 11 computes the number of bits for the IS out-
put buffer. The IS dataflow computes one line of results W S Out put Bu f f Ar ea = (10.4 × Num Bi ts) + 493 (12)
(O F M A P_D), for all output channels (OutChannel). I S Out put Bu f f Ar ea = (10.5 × Num Bi ts) + 539 (13)
Num Bi ts(I S) = O F M A P_ D × OutChannel × 16 (11)
The same interpolation method is applied to obtain the
Figure 3 presents in the x-axis the number of bits for each power consumption due to the output buffers. However, each
dataflow, and in the y-axis the area. The Cifar10 CNN has memory type has a interpolation equation, due to the latency
three convolutional layers. The OFMAP sizes for each layer access. Equations 14 and 15 compute the output buffer power
are (Figure 4): L1: 15 × 15, 16 output channels; L2: 7 × 7, dissipation for WS and IS using a SRAM, respectively.

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
JURACY et al.: FAST, ACCURATE, AND COMPREHENSIVE PPA ESTIMATION OF CONVOLUTIONAL HARDWARE ACCELERATORS 5179

Equations 16 and 17 compute the output buffer power dis- TABLE II


sipation for WS and IS using a DRAM, respectively. MAC-BASED AND P HYSICAL S YNTHESIS F LOWS A REA R ESULTS (µm 2 )
FOR L AYER 1. M AXIMUM E RROR : 39.90% IN L AYER 2 AND M INIMUM
S R AM_W S Out put Bu f f Power E RROR : 29.34% IN L AYER 0
= 0.0792 + (0.000305 × Num Bi ts)
+(0.0000000117 × Num Bi ts 2 ) (14)
S R AM_I S Out put Bu f f Power
= −5.4 + (0.00346 × Num Bi ts)
+(−0.000000402 × Num Bi ts 2 ) (15)
D R AM_W S Out put Bu f f Power
= 0.0794 + (0.000245 × Num Bi ts) TABLE III
+(0.0000000109 × Num Bi ts 2 ) (16) MAC-BASED AND P HYSICAL S YNTHESIS F LOWS P OWER R ESULTS (mW )
FOR L AYER 1. M AXIMUM E RROR : 44.58% IN L AYER 0 AND M INIMUM
D R AM_I S Out put Bu f f Power E RROR : 12.62% IN L AYER 1
= −7.98 + (0.00484 × Num Bi ts)
+(−0.000000595 × Num Bi ts 2 ) (17)

VI. R ESULTS
Section VI-A details the experimental setup adopted to
obtain the results. Next, we evaluate the MAC-based and
analytic flows using the physical synthesis flow as reference,
in Section VI-B and Section VI-C, respectively. Section VI-D B. MAC-Based DSE Flow Results
compares the analytical mode to related works. This Section evaluates the MAC-based DSE flow, using
the physical synthesis flow as a reference. The goal is to
A. Experimental Setup demonstrate that executing power and area estimation using
We adopt the CNN illustrated in Figure 4 as a case study, the required amount of MACs (Section IV-B) does not produce
with three convolutional layers and one fully-connected layer. accurate results.
TensorFlow executes the fully-connected layer, not accelerated Table II presents area results for the three dataflows and
in hardware. The number of filters per layer is 16, 32, and 64. their versions with output buffers for the CIFAR10 CNN
The CNN implemented in the TensorFlow uses the Cifar10 layer 1. The MAC-based flow area is the same, regardless
dataset with a 32 × 32 × 3 (RGB) IFMAP. After training, of the dataflow type, because this flow only considers the
the obtained accuracy was 67%. The obtained accuracy with arithmetic core plus input and output registers. The area of the
the quantization using 16-bit words at the inputs was 66.98%, synthesized accelerators does not include the output buffers in
a value 0.02% smaller than the one obtained in TensorFlow such a way to fairly compare both approaches. The difference
with float point values. observed in the non-buffered and buffered versions is due to
Cadence Genus and Innovus tools were used for logic and the presence of control logic to access the OFMAP memory,
physical synthesis, with 28nm technology and a frequency not present in the buffered versions, translating in a larger
of 500MHz. The logic synthesis uses clock-gating to reduce area.
the accelerator energy consumption. The power dissipation As expected, the MAC-based flow underestimates the accel-
uses the VCD file generated after a post-physical synthesis erator area because it does not consider the control logic
simulation and the Cadence Voltus tool. (FSMs), internal registers, internal buffers, and logic to inter-
The netlist simulation inputs are the Cifar10 CNN layers connect components. The area estimation error varies from
(Figure 4) extracted from TensorFlow, according to the flow 29.3% to 39.9%, with an average error of 35.9%.
presented in Figure 2). Layer 0 uses a 32 × 32 × 3 IFMAP Table III presents power results using the same setup of
(RGB image from CIFAR10 dataset), 16 3 × 3 filters, stride 2, the previous table. The MAC-based flow power is the same
generating a 15 × 15 × 16 output. Layer 1 uses a 15 × 15 × 16 because this flow only considers the arithmetic core.
IFMAP, 32 3×3 filters, stride 2, generating a 7×7×32 output. The MAC-based flow also presents significant errors in
Layer 2 uses a 7 × 7 × 32 IFMAP, 64 3 × 3 filters, stride 2, the power estimation. The main reasons for the differences
generating a 3 × 3 × 64 output. Thus, power values come from observed in the power estimation are the switching activity
a real dataset and not synthetic values. The total energy is and the presence of input buffers. The MAC-based flow uses
computed by multiplying the average power by the number of an average of the switching activity of MACs and registers,
clock cycles required to execute a complete convolution. fixing this value for the estimation (in this example, 1.16 mW ).
The Cacti-IO tool [53] models the external memories. For The physical synthesis uses the switching activity of the whole
a 28nm 64KB SRAM, Cacti-IO reports 0.01356nJ for reading circuit with actual data. The actual switching activity may
and 0.01351nJ for writing. For a 64kB DRAM, 0.1633nJ for be smaller than the average due to the presence of IFMAP
reading and 0.1662nJ for writing. and weight values near zero. This fact justifies optimization

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5180 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022

Fig. 5. Area estimation using the analytic DSE flow, for the 3 Cifar10 layers
(L0, L1, L2) (percentages represent the difference between flows).

techniques such as pruning [54]. Power is overestimated for


WS and OS, and underestimated for IS due to the presence of
input buffers. The absolute power estimation error varies from
12.62% to 44.58%, with an average absolute error of 30.1%
for layer 1.
We demonstrate from the above results that estimating
area and power using the number of arithmetic operators
produces results that are far from the actual hardware. Thus,
we claim that methods such as the analytical flow evaluated
below produce reliable PPA estimates for CNN hardware
accelerators.

C. Analytic DSE Flow Results


Fig. 6. Performance estimation, in clock cycles, using the analytic DSE
This Section evaluates the proposed analytic DSE flow. The flow, for the 3 Cifar10 layers (L0, L1, L2), for SRAM and DRAM memories
reference for the analytic DSE flow is the physical synthesis (percentages represent the difference between flows).
flow for the Cifar10 CNN layer 0.
1) Area Estimation: Figure 5 presents results for area
estimation. The WS, IS, and OS areas are the same in layer 2) Performance Estimation: Figure 6 presents results for
L0 (error=0) since the synthesis of this layer is the refer- performance estimation, using Equations from Section V-A.
ence for the other layers. Due to the interpolation approach, The performance results consider different memory latencies,
dataflows with output buffers presented a small error in layer two clock cycles for SRAM and five clock cycles for DRAM.
L0 (below 1%). The area error estimation for layers L1 and The main source of error is due to synchronization states not
L2 stays below 4% for layers L1 and L2, excepting OS L2, included in the Equations. These synchronization states occur
with an error of 5.17%. The reason for explaining the error mainly at the end of a row, where buffers must be flushed to
is the increased number of output channels compared to L0, start a new one. The largest errors occur in layer two due to
which affects the control logic and counters. The OS dataflow the larger number of output channels.
does not have buffers, requiring a more complex circuitry to The overall performance estimation error stays below 10%,
manage memory accesses. with an average error of 3.50% with a standard deviation of
It is worth highlighting that the IS implementations require 2.78% (minimal error: 0.12%, maximum error: 9.29%).
an input buffer to store weights and bias values, which 3) Memory Accesses Estimation: Figure 7 presents results
increase the accelerator area. The area for these buffers is for memory accesses. Estimating the number of memory
not included in the area results since this buffer acts as a accesses is mandatory to estimate the total energy consumed
cache memory, requiring an external memory. We adopted this by the accelerator.
approach because memory compilers are needed to generate The analytical model correctly captures the number of
these buffers. The use of memory compilers is considered a memory accesses, except for IS and buffered IS readings, with
future work. The IS for this CNN needs 7,168, 74,240, and the worst-case occuring in layer 0 – 9.38%. The reason is
295,936 bits for layers L0, L1, and L2, respectively. Thus, similar to the error observed in the performance estimation,
in terms of total area, the IS dataflow is larger than the others, where there are synchronization states, mainly in the exchange
requiring further development to reduce this area, as splitting of rows. At the end of a row, there are invalid reads to avoid
the weights into small samples to minimize the output buffer increasing the complexity of the FSM. Thus, the memory is
size and not read and store all weight values in internal buffers. active for some clock cycles, inducing these invalid reads.
The overall area estimation error stays below 6%, with an If the memory latency is small (2 cycles for SRAM), more
average error of 1.85% with a standard deviation of 1.51% invalid reads may occur, while this effect is masked for higher
(minimal error: 0.14%, maximum error: 5.17%). latencies (5 cycles for DRAM).

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
JURACY et al.: FAST, ACCURATE, AND COMPREHENSIVE PPA ESTIMATION OF CONVOLUTIONAL HARDWARE ACCELERATORS 5181

TABLE IV
C IFAR 10 CNN P OWER A NALYTIC R ESULTS (A CCELERATOR C ORE ) - mW

TABLE V
C IFAR 10 CNN E NERGY A NALYTIC R ESULTS (A CCELERATOR C ORE AND E XTERNAL M EMORIES ) - n J

TABLE VI
S UMMARY OF THE A NALYTIC A PPROACH R ESULTS

The power estimation has a larger error than the area and
performance estimation. Two reasons explain this mismatch:
• The power reference is layer L0, with its switching
activity. The switching activity of other layers is different,
affecting the power estimation. The switching activity is
a function of the input data, not being possible to capture
it in the analytic model.
• The buffered dataflows has an error induced by the
interpolation method.
The overall power estimation error presents an average error
of 7.00% with a standard deviation of 6.21% (minimal error:
0.05%, maximum error: 23.55%).
5) Total Energy Estimation: Table V presents results for
the energy estimation, considering the accelerators and the
Fig. 7. Estimated number of memory accesses. memory accesses. Memory accesses are responsible for most
of the consumed energy. According to [10], the memory
energy can spend 200 times more energy than the accelerator
4) Accelerator Power Estimation: Table IV shows the array. As observed, there is a mismatch higher than 1% only
results for power estimation. Note that this table only considers in the IS dataflows due to the errors in the memory readings
the accelerator core. Similar to the area estimation, the L0 is estimations (Section VI-C.3). The errors in the IS dataflows
the reference. WS, IS and OS present an absolute error equal occur due the following reasons:
to 0% (bold values on Table IV), while the buffered dataflows • The IS dataflow presents a small energy error estima-
presented an error due to the interpolation approach. tion because the number of OFMAP accesses is higher

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5182 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022

TABLE VII
A NALYTIC AND S TATE - OF - THE -A RT R ESULT E RRORS C OMPARISON

than the IFMAP readings (where the estimation presents to our method. Aladdin presents the most complete evaluation
errors), resulting in a small energy estimation error, compared to the other methods. However, the proposed method
below 3%. has an area evaluation more accurate than Aladdin, while the
• The bufferd IS dataflow makes more IFMAP readings evaluation of the other metrics has errors of the same order.
(with an estimation error equal to 9.38%) than OFMAP As a conclusion, the proposed flow presents a more com-
writes. The result is a higher energy estimation error. prehensive evaluation of more metrics with lower errors, and
The overall performance energy error stays below 7%, with can provide a large set of estimates for each dataflow.
an average error of 0.66% with a standard deviation of 1.54%
(minimal error: 0%, maximum error: 6.26%). VII. C ONCLUSION
6) Analytic Model Summary: The proposed analytic flow This work presented a fast, comprehensive, and accurate
enabled an accurate PPA estimation. Table VI summarizes the design space exploration analytic method for CNN hardware
results. The main source of error is in estimating the number accelerators. The method integrates front-end frameworks
of IFMAP readings for the dataflow IS, impacting the average (such as TensorFlow) to a hardware back-end design flow.
power estimation. Also, improvements may be done in the Despite the coupling of the DSE framework to the accelerator
interpolation approach to estimate the area and consumption architecture, the flow presented in this work is a guideline for
of output buffers. executing DSE for other dataflows. We showed that methods
Besides the PPA accuracy, the method is fast. Below we based on basic components, such as MACs, are not enough
compare the computing time of each flow. to have accurate results and present errors between 12.51%
• Physical synthesis flow. The physical synthesis of and 44.68% for area and power. The average error comparing
one accelerator can take up to 45 minutes (Intel the analytical model to the data obtained from the physical
[email protected], 28 cores, 64 GB memory). The synthesis is smaller than 7%. Compared to the literature, the
DSE of a given CNN configuration takes several hours, proposed method shows a more accurate area evaluation, while
considering all layers and channels. the assessment of power and performance have errors of the
• Analytical flow. It is necessary to synthesize the first same order.
convolutional layer of each dataflow, with the remaining Accelerators source code, synthesis and DSE scripts are
layers estimated from the first layer results. The obtained available at the following GitHub repository:
PPA data is used to extract the model. The DSE of a https://github.com/leorezende93/acc_dse_env
given CNN configuration took 0.025 seconds for the CNN Future work covers accelerator and system levels. At the
evaluated in this Section. accelerator level, we plan to: (i ) optimize the accuracy of the
• MAC-based flow. It also requires a synthesis step for PPA results, as the performance of the IS dataflow estimation;
the MAC and registers. The area and power related to (ii ) add other dataflows, such as NLR (No Local Reuse),
the synthesis step are the values used to estimate the RS (Row Stationary), and FG (Fine-Grained), and implement
accelerator cost. in hardware other CNN functions, such as max and aver-
For comparison purposes, Shao et al. [43] mention that age pooling; (iii ) implement larger arrays as 16 × 16, and
their simulator, Aladdin, takes 7 minutes to execute a full DSE integrate the DSE flow with the Imagenet dataset to allow
of a given set of accelerators against 52 hours for the synthesis simulation of the accelerators using more complex CNNs;
flow. (i v) use both SRAM and DRAM, implementing a memory
hierarchy scheme. At the system level, we plan to combine
the DSE method with system simulators to perform DSE
D. Analytic Model Compared to the State-of-the-Art
regarding an entire system composed of CPUs, DMA, and
Table VII compares the analytic model results with results CNN accelerators.
available in the literature. Power and energy consider only the
accelerator, not the memories. The Table presents for each R EFERENCES
work the min/max/average error, when it is available. [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
Few works in the literature present a comprehensive esti- MA, USA: MIT Press, 2016, http://www.deeplearningbook.org.
mation as the method proposed in this paper. MLPAT and [2] Facebook. (2022). Facebook Horizon. [Online]. Available: https://www.
oculus.com/horizon-worlds/
Accerlergy present a limited evaluation of area and energy. [3] Google. (2022). Google Assistant, Your Own Personal Google. [Online].
Stone evaluates only performance with higher errors compared Available: https://assistant.google.com

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
JURACY et al.: FAST, ACCURATE, AND COMPREHENSIVE PPA ESTIMATION OF CONVOLUTIONAL HARDWARE ACCELERATORS 5183

[4] ServiceNow. (2022). Enterprise Chatbot—Virtual Agent. [Online]. Avail- [31] H. Kwon, M. Pellauer, and T. Krishna, “MAESTRO: An open-source
able: https://assistant.google.com/ infrastructure for modeling dataflows within deep learning accelera-
[5] Tesla. (2022). Autopilot. [Online]. Available: https://www.tesla.com tors,” Comput. Res. Repository, vol. abs/1805.02566, no. 1, pp. 1–5,
[6] S. S. Haykin, Neural Networks and Learning Machines, 3rd ed. London, 2018.
U.K.: Pearson, 2009. [32] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and
[7] (2022). Caffe. [Online]. Available: https://caffe.berkeleyvision.org/ T. Krishna, “Understanding reuse, performance, and hardware cost of
[8] (2022). PyTorch. [Online]. Available: https://pytorch.org/ DNN dataflow: A data-centric approach,” in Proc. IEEE/ACM Int. Symp.
[9] (2022). TensorFlow. [Online]. Available: https://www.tensorflow.org/ Microarchitecture (MICRO), May 2019, pp. 754–768.
[10] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “EyeRiss: An energy- [33] NVIDIA. (2022). NVDLA. [Online]. Available: http://nvdla.org/
efficient reconfigurable accelerator for deep convolutional neural net- [34] C. Hao et al., “FPGA/DNN co-design: An efficient design methodology
works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, for IoT intelligence on the edge,” in Proc. ACM/IEEE Design Automat.
Jan. 2016. Conf. (DAC), Mar. 2019, pp. 1–6.
[11] N. Strom, “Scalable distributed DNN training using commodity GPU [35] X. Zhang, H. Ye, and D. Chen, “Being-ahead: Benchmarking and
cloud computing,” in Proc. Interspeech, Sep. 2015, pp. 1488–1492. exploring accelerators for hardware-efficient AI deployment,” Comput.
[12] W. J. Dally, Y. Turakhia, and S. Han, “Domain-specific hardware Res. Repository, vol. abs/2104.02251, no. 1, pp. 1–12, Jun. 2021.
accelerators,” Commun. ACM, vol. 63, no. 7, pp. 48–57, Jun. 2020, doi: [36] H. Genc et al., “Gemmini: Enabling systematic deep-learning archi-
10.1145/3361682. tecture evaluation via full-stack integration,” in Proc. 58th ACM/IEEE
[13] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An architec- Design Autom. Conf. (DAC), Dec. 2021, pp. 769–774.
ture for ultralow power binary-weight CNN acceleration,” IEEE Trans. [37] K. Asanovic et al., “The rocket chip generator,” Univ. California,
Comput.-Aided Design Integr. Circuits Syst., vol. 37, no. 1, pp. 48–60, Los Angeles, CA, USA, Tech. Rep. UCB/EECS-2016-17,
Jan. 2018. 2016. [Online]. Available: https://aspire.eecs.berkeley.edu/wp/wp-
[14] S. Shivapakash, H. Jain, O. Hellwich, and F. Gerfers, “A power efficient content/uploads/2016/04/Tech-Report-The-Rocket-Chip-Generator-
multi-bit accelerator for memory prohibitive deep neural networks,” in Beamer.pdf
Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Oct. 2020, pp. 1–5. [38] X. Yang et al., “Interstellar: Using halide’s scheduling language
[15] S.-F. Hsiao, K.-C. Chen, C.-C. Lin, H.-J. Chang, and B.-C. Tsai, “Design to analyze DNN accelerators,” in Proc. ACM Int. Conf. Archi-
of a sparsity-aware reconfigurable deep learning accelerator supporting tectural Support Program. Lang. Operating Syst. (ASPLOS), 2020,
various types of operations,” IEEE J. Emerg. Sel. Topics Circuits Syst., pp. 369–383.
vol. 10, no. 3, pp. 376–387, Sep. 2020. [39] S. D. Manasi and S. S. Sapatnekar, “DeepOpt: Optimized scheduling of
[16] F. Spagnolo, S. Perri, F. Frustaci, and P. Corsonello, “Reconfigurable CNN workloads for ASIC-based systolic deep learning accelerators,” in
convolution architecture for heterogeneous systems-on-chip,” in Proc. Proc. ACM/IEEE Asia South Pacific Design Automat. Conf. (ASPDAC),
9th Medit. Conf. Embedded Comput. (MECO), Jun. 2020, pp. 1–5. May 2021, pp. 235–241.
[17] S.-F. Hsiao and H.-J. Chang, “Sparsity-aware deep learning accelerator [40] A. Karbachevsky et al., “Early-stage neural network hardware perfor-
design supporting CNN and LSTM operations,” in Proc. IEEE Int. Symp. mance analysis,” MDPI Sustainability, vol. 13, no. 2, p. 717, 2021.
Circuits Syst. (ISCAS), Oct. 2020, pp. 1–4. [41] C. Baskin et al., “UNIQ: Uniform noise injection for non-uniform
[18] TESLA. (2019). Autopilot and Full Self-Driving Capability. [Online]. quantization of neural networks,” ACM Trans. Comput. Syst., vol. 37,
Available: https://analyticsindiamag.com/under-the-hood-of-teslas-ai- nos. 1–4, pp. 1–15, 2021.
chip-that-takes-the-driverless-battle-to-nvidias-home-turf/ [42] M. Ferianc et al., “Improving performance estimation for design space
[19] Apple. (2022). iPhone 11. [Online]. Available: https://www.apple.com/ exploration for convolutional neural network accelerators,” MDPI Elec-
iphone-11/ tron., vol. 10, no. 4, pp. 1–14, 2021.
[20] S. Das, A. Roy, K. K. Chandrasekharan, A. Deshwal, and S. Lee, [43] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks, “Aladdin: A pre-
“A systolic dataflow based accelerator for CNNs,” in Proc. IEEE Int. RTL, power-performance accelerator simulator enabling large design
Symp. Circuits Syst. (ISCAS), Oct. 2020, pp. 1–5. space exploration of customized architectures,” in Proc. ACM Int. Symp.
[21] C. Heidorn, F. Hannig, and J. Teich, “Design space exploration for layer- Comput. Archit. (ISCA), 2014, pp. 97–108.
parallel execution of convolutional neural networks on CGRAs,” in Proc. [44] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna,
23th Int. Workshop Software Compilers Embedded Syst. (SCOPES), “SCALE-SIM: Systolic CNN accelerator,” Comput. Res. Repository,
2020, pp. 26–31. vol. abs/1811.02883, no. 1, pp. 1–11, 2018.
[22] Y. Zhao, C. Li, Y. Wang, P. Xu, Y. Zhang, and Y. Lin, “DNN-chip pre- [45] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina, and
dictor: An analytical performance predictor for DNN accelerators with T. Krishna, “A systematic methodology for characterizing scalability of
various dataflows and hardware architectures,” in Proc. IEEE Int. Conf. DNN accelerators using SCALE-sim,” in Proc. IEEE Int. Symp. Perform.
Acoust., Speech Signal Process. (ICASSP), May 2020, pp. 1593–1597. Anal. Syst. Softw. (ISPASS), Aug. 2020, pp. 58–68.
[23] A. Parashar et al., “Timeloop: A systematic approach to DNN acceler- [46] H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling flexible
ator evaluation,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. dataflow mapping over dnn accelerators via reconfigurable intercon-
(ISPASS), Mar. 2019, pp. 304–315. nects,” ACM Special Interest Group Program. Lang. Notices, vol. 53,
[24] F. Muñoz-Martínez, J. L. Abellán, M. E. Acacio, and T. Krishna, no. 2, pp. 461–475, 2018.
“STONNE: A detailed architectural simulator for flexible neural network [47] S. Kim et al., “Transaction-level model simulator for communication-
accelerators,” Comput. Res. Repository, vol. 2006, no. 1, pp. 1–8, 2020. limited accelerators,” Comput. Res. Repository, vol. abs/2007.14897,
[25] Y. Nellie Wu, J. S. Emer, and V. Sze, “Accelergy: An architecture- no. 1, pp. 1–11, 2020.
level energy estimation methodology for accelerator designs,” in Proc. [48] S. Cao, W. Deng, Z. Bao, C. Xue, S. Xu, and S. Zhang, “SimuNN:
IEEE/ACM Int. Conf. Comput.-Aided Des. (ICCAD), Nov. 2019, pp. 1–8. A pre-RTL inference, simulation and evaluation framework for neural
[26] D. Giri, K.-L. Chiu, G. Di Guglielmo, P. Mantovani, and L. P. Carloni, networks,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 10, no. 2,
“ESP4ML: Platform-based design of systems-on-chip for embedded pp. 217–230, 2020.
machine learning,” in Proc. Design, Autom. Test Eur. Conf. Exhib. [49] Keras. (2022). PReLU layer. [Online]. Available: https://keras.io/
(DATE), Mar. 2020, pp. 1049–1054. api/layers/activations/
[27] A. Sohrabizadeh, Y. Bai, Y. Sun, and J. Cong, “Enabling automated [50] L. R. Juracy, M. T. Moreira, A. M. Morais, A. Hampel, and F. G. Moraes,
FPGA accelerator optimization using graph neural networks,” Comput. “A high-level modeling framework for estimating hardware metrics of
Res. Repository, vol. abs/2111.08848, pp. 1–12, Jun. 2021. CNN accelerators,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68,
[28] G. Datta and P. A. Beerel, “Can deep neural networks be converted to no. 11, pp. 4783–4795, 2021.
ultra low-latency spiking neural networks?” in Proc. Design, Autom. Test [51] D. Moolchandani, A. Kumar, and S. R. Sarangi, “Accelerating CNN
Eur. Conf. Exhib. (DATE), Mar. 2022, pp. 718–723. inference on ASICs: A survey,” J. Syst. Archit., vol. 113, no. 1, pp. 1–26,
[29] P. Panda, S. A. Aketi, and K. Roy, “Toward scalable, efficient, and 2021.
accurate deep spiking neural networks with backward residual con- [52] K. Simonyan and A. Zisserman, “Very deep convolutional net-
nections, stochastic softmax, and hybridization,” Frontiers Neurosci., works for large-scale image recognition,” Comput. Res. Repository,
vol. 14, pp. 1–18, Aug. 2020. vol. abs/1409.1556, no. 1, pp. 1–14, 2014.
[30] T. Tang and Y. Xie, “MlPat: A power area timing modeling framework [53] N. P. Jouppi, A. B. Kahng, N. Muralimanohar, and V. Srinivas, “Cacti-
for machine learning accelerators,” in Proc. IEEE Int. Workshop Domain IO: Cacti with off-chip power-area-timing models,” IEEE Trans. Very
Specific Syst. Archit. (DOSSA), Aug. 2018, pp. 1–3. Large Scale Integr. (VLSI) Syst., vol. 23, no. 7, pp. 1254–1267, 2014.

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5184 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022

[54] N. J. Kim and H. Kim, “FP-AGL: Filter pruning with adaptive Alexandre de Morais Amory received the Ph.D.
gradient learning for accelerating deep convolutional neural net- degree in computer science from UFRGS, Brazil,
works,” IEEE Trans. Multimedia, early access, Jul. 1, 2022, doi: in 2007. His professional experience include the
10.1109/TMM.2022.3189496. Lead Verification Engineer at CEITEC Design
House from 2007 to 2009, a Post-Doctoral Fellow
at PUCRS from 2009 to 2012; and a Professor
at PUCRS from 2012 to 2020. He is currently
a Research Fellow at Scuola Superiore Sant’anna,
Italy. His research interests include design, test,
fault-tolerance, and safety-critical systems.

Fernando Gehm Moraes (Senior Member, IEEE)


Leonardo Rezende Juracy received the bachelor’s received the bachelor’s degree in electrical engineer-
degree in computer engineering and the M.Sc. ing and the M.Sc. degree from UFRGS, Brazil, in
degree in computer science from the Pontifical 1987 and 1990, respectively, and the Ph.D. degree
Catholic University of Rio Grande do Sul (PUCRS), from the Laboratoire d’Informatique, Robotique et
Porto Alegre, Brazil, in 2015 and 2018, respectively, Microélectronique de Montpellier, France, in 1994.
where he is currently pursuing the Ph.D. degree. Since 2002, he has been a Full Professor at PUCRS
His research interests include design for testabil- University. He has authored and coauthored 48
ity, fault-tolerant designs, asynchronous designs, peer-refereed journal articles in the field of VLSI
resilient designs, and machine learning hardware design. His primary research interests include micro-
accelerators. electronics, security, MPSoCs, NoCs, and hardware
accelerators.

Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.

You might also like