A Fast Accurate and Comprehensive PPA Estimation of Convolutional Hardware Accelerators
A Fast Accurate and Comprehensive PPA Estimation of Convolutional Hardware Accelerators
Abstract— Convolutional Neural Networks (CNN) are widely One of the most common ways to deliver ML is by using
adopted for Machine Learning (ML) tasks, such as classification Artificial Neural Networks (ANN), particularly Convolutional
and computer vision. GPUs became the reference platforms for Neural Networks (CNN). CNNs have the advantage of having
both training and inference phases of CNNs due to their tailored
architecture to the CNN operators. However, GPUs are power- sparse connections, in contrast to fully connected ANNs,
hungry architectures. A path to enable the deployment of CNNs where all neurons of one layer are connected to all neurons
in energy-constrained devices is adopting hardware accelerators of the next layer.
for the inference phase. However, the literature presents gaps A CNN contains four main layers: (i ) convolutional layer,
regarding analyses and comparisons of these accelerators to which is the CNN core and performs the synapses by mul-
evaluate Power-Performance-Area (PPA) trade-offs. Typically,
the literature estimates PPA from the number of executed tiplying and accumulating weights and input feature maps;
operations during the inference phase, such as the number of (ii ) activation function, a nonlinear transformation sent to the
MACs, which may not be a good proxy for PPA. Thus, it is next layer of neurons; (iii ) pooling layer, used to reduce the
necessary to deliver accurate hardware estimations, enabling amount of data processed by the CNN; (i v) fully connected
design space exploration (DSE) to deploy CNNs according to layer, used in the classification result.
the design constraints. This work proposes a fast and accurate
DSE approach for CNNs using an analytical model fitted from The deployment of CNNs applications comprises two
the physical synthesis of hardware accelerators. The model is phases [6]. The first is the training, which defines the weights
integrated with CNN frameworks, like TensorFlow, to generate of the synapse. The second is inference, which uses the
accurate results. The analytic model estimates area, performance, weights previously computed during the training phase to
power, energy, and memory accesses. The observed average error classify or predict output values based on the inputs. A CNN
comparing the analytical model to the data obtained from the
physical synthesis is smaller than 7%. can correctly classify inputs not used in the training phase.
The success of CNNs led to the development of frame-
Index Terms— CNN, convolutional hardware accelerator, works that help developers to build their models by offering
power-performance-area (PPA) estimation, design space
exploration (DSE). mechanisms required for training and inference. Examples of
frameworks include Caffe [7], Pytorch [8] and TensorFlow [9].
These frameworks use a high-level approach to abstract the
I. I NTRODUCTION implementation of functions, such as convolution, and aid
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5172 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022
implementations. For example, Eyeriss proposes a comparison the flow described in Section IV and with related works.
between accelerators but lacks performance or area trade-offs Finally, Section VII concludes this paper, pointing out the
evaluation [10]. Also, some works compare accelerators con- direction for future work.
sidering different technology nodes, resulting in an unfair
analysis [20]. II. S TATE -O F - THE -A RT
For design space exploration (DSE), literature presents Section II-A describes frameworks enabling the design and
works using analytic models integrated into frameworks and DSE of hardware accelerators. Section II-B presents CNN
system simulators. Analytical models estimate the power, per- simulators. Section II-C compares qualitatively the presented
formance, and area (PPA) of hardware accelerators to a given works with regard to our proposal.
hardware constraint [21], [22]. System simulators [23], [24] Besides CNNs, literature also presents Spiking Neural
describe accelerators in high-level languages, like Python and Network (SNN) proposals [28], [29]. SNNs are based on
C++, reducing the design time and providing PPA evaluation. firing patterns computation, similar to the human brain, and
Both analytical and simulator approaches present as drawbacks are commonly implemented using an analog approach. The
the PPA accuracy, typically estimated from the number of advantage of using SNNs is reduced area and energy consump-
required MACs (multiply-accumulate operators) [23], [25]. tion once analog components can be smaller than a digital
Despite the efforts to increase the abstraction level to imple- adder or multiplier, allowing the connection of thousands
ment CNN accelerators using high-level synthesis (HLS) [26], of neurons with low area cost. However, our focus is on
this approach presents several challenges, as (i ) long synthesis digital implementations, being SNNs out of the scope of this
time; (ii ) huge design space; (iii ) effective impact of design work.
parameters in the hardware [27].
The goal of this paper is to propose a method to perform A. Frameworks for CNN Hardware Accelerators
a compr ehensi ve DSE in a f ast and accur ate way, inte- MLPAT [30] is a framework that allows modeling power,
grated with an ML framework, to estimate the costs of the area, and timing for machine learning accelerators. MLPAT
hardware accelerator parameters. Besides the CNN parameters, models components such as systolic arrays, on-chip memory,
the proposed method also allows the user to change hardware and activation pipeline. Also, MLPAT supports different pre-
parameters, such as dataflow type (weight stationary (WS), cision types, which allows validating the trade-off between
input stationary (IS), and output stationary (OS)), memory accuracy and precision, and different dataflows, such as WS
type, memory latency and accelerator frequency. We also and OS. As input, the MLPAT allows specifying the accelera-
demonstrated that estimating PPA values using MACs pro- tor architecture, the circuit, and the technology. The framework
duces an inaccurate PPA estimation. generates an optimized chip representation to report the results,
The original contributions of this work include: such as area, power, and performance.
MAESTRO [31], [32] is a framework to describe and
1) Adoption of an ML framework (TensorFlow) as
analyze neural network hardware, which allows obtaining
a front-end to perform DSE for CNN hardware
the hardware cost to implement a target architecture. It has
accelerators;
a domain-specific language to describe the dataflow that
2) The use of data from the physical synthesis of the
allows specifying the number of PEs, memory size, and NoC
complete hardware accelerators, not only from basic
bandwidth parameters. The results generated by the frame-
components, to obtain accurate results;
work are focused on performance analyses. In recent work,
3) A method to fairly compare different CNN hardware
MAESTRO was used to estimate tradeoffs between execution
accelerators, allowing to compare accelerators with dif-
time and energy efficiency for CNN models, such as VGG and
ferent dataflows, considering the same characteristics,
AlexNet.
such as the technology node, frequency, and memory
Timeloop [23] is a DSE framework for CNNs. It can
type;
emulate a set of accelerators, such as NVDLA [33]. Timeloop
4) An analytical method to perform DSE. A set of equa-
focuses on the convolution layer analyses. Timeloop uses
tions derived from the physical synthesis flow enables
as input a workload description, such as input dimension
the ML framework to estimate power, performance,
and weight values, a hardware architecture description, such
and area. The analytical model, integrated into the
as arithmetic modules, and hardware constraint. Instead of
ML framework, is the key to obtaining a fast and
using a cycle-accurate simulator, Timeloop uses data trans-
accurate DSE.
fers deterministic behavior to perform analytic analyses.
This paper is organized as follows. Section II reviews CNN As energy models, Timeloop has memory, arithmetic units,
hardware accelerators and simulators, positioning our work and wire/network models based on TSMC 16nm FinFET.
with regard to the state-of-the-art. Section III introduces the Accelergy [25] allows estimating the energy of accelerators
convolutional accelerator architecture proposed in this work. without a complete hardware description, using a library of
Section IV presents two DSE flows, the first based on physical basic components. Accelergy uses a high-level architectural
synthesis and the latter on MAC counting. Section V details description to capture the circuit behavior characteristics, such
the main original contribution of this work, the proposed DSE as memory reads. Accelergy considers the number of memory
method based using an analytical approach to produce PPA reads and the memory access pattern, which can be random
results. Section VI compares the proposed DSE method with or acccess at the same address repetitively.
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
JURACY et al.: FAST, ACCURATE, AND COMPREHENSIVE PPA ESTIMATION OF CONVOLUTIONAL HARDWARE ACCELERATORS 5173
Heidorn et al. [21] propose an analytical model that esti- and data parallelism. The method is capable of predicting the
mates throughput and energy to a given hardware constraint. hardware latency and energy, and was compared to machine
A DSE is proposed to determine the accelerator architecture learning-based methods to perform DSE (linear regression,
limits in terms of throughput, number of parallel opera- Gradient tree boosting, and neural network).
tions, and memory. The Authors propose an accelerator to Aladdin [43] is a pre-RTL power-performance accelerator
evaluate the model with a tile-local memory, a bus, and modeling framework. It estimates performance, power, and
a coarse-grained reconfigurable array (CGRA). Each CGRA area. Aladdin infrastructure uses dynamic data dependence
presents a two-dimensional array of PEs, and the accelerator graphs (DDDG) to represent accelerators. The DDDG is
can have more than one CGRA to parallelize the processing. generated from a C program and allows Aladdin to report
Zhao et al. [22] propose an analytical performance predic- the program dependencies and resource constraints.
tor to estimate energy, throughput, and latency for ASIC and
FPGA. The predictor uses DNN models, hardware architec- B. Hardware Simulators
ture, dataflows types, and hardware cost regarding a tech- SCALE-Sim (Systolic CNN Accelerator Simulator) [44],
nology node. The results are generated with AlexNet and [45] is a systolic array cycle-accurate simulator. This simu-
SkyNet CNN models, with Eyeriss, an FPGA implementation lator allows configuring micro-architectural features such as
from [34], and synthesized results of a proposed accelerator. array size, array aspect ratio, scratchpad memory size, and
DNNExplorer [35] is a framework for DSE of ML accel- dataflow mapping strategy. Also, it is possible to configure
erators. DNNExplorer supports machine learning frameworks system integration parameters, such as memory bandwidth.
(Caffe and PyTorch), besides three accelerator architectures. SCALE-Sim simulates convolutions and matrix multiplica-
The architecture also supports WS and IS dataflows. This tions, and models the compute unit as a systolic array. Also,
framework adopts analytical models to estimate performance it allows simulation in a system context with CPU and DMA
and hardware configuration. components.
Gemmini [36] is an open-source systolic array generator STONNE [24] is a cycle-accurate architecture simulator
that allows evaluating deep-learning architectures. Gemmini for CNNs which allows end-to-end evaluation. It is con-
generates a custom ASIC accelerator for matrix multiplication nected with Caffe framework [7] to generate the CNNs, and
based on a systolic array architecture. Gemmini is compatible models the MAERI accelerator [46]. The results are focused
with the RISC-V Rocket ecosystem [37]. on performance and hardware utilization. To estimate area
Interstellar [38] is a DSE framework that uses Halide and energy, STONNE uses the Accelergy energy estimation
language (https://halide-lang.org) to generate hardware and methodology [25], which considers basic modules to calculate
compare different accelerators, such as different dataflows the energy values, such as adders.
(WS, OS, RS) in 2D arrays and a MAC tree schemes. The AccTLMSim [47] is a pre-RTL cycle-accurate CNN accel-
authors propose a systematic approach to describe the design erator simulator based on SystemC transaction-level modeling
space of DNN accelerators using Halide. The framework also (TLM). The simulator allows maximizing the throughput per-
optimizes the memory hierarchy. formance for a given on-chip SRAM size. An accelerator is
DeepOpt [39] is a DSE framework to explore ASIC imple- proposed to validate the simulator. AccTLMSim is focused
mentation of systolic hardware accelerators for CNNs. The only on performance, not power or area.
main goal of this DSE is to reduce the number of mem-
ory accesses based on hardware characteristics like on-chip C. Summary Related to DSE Frameworks and Simulators
SRAMs and the number of parallel PEs. The DeepOpt uses Table I summarizes the reviewed works. The second column
a search tree to schedule the convolution process. Thus, it is indicates whether the work has integration with high-level
possible to minimize the number of accesses from memory modeling CNN frameworks, such as TensorFlow and Caffe.
by modeling memory access patterns (weight and output The third and fourth columns are related to the evaluated
stationary) and pruning branches from the search tree. metrics. The third column presents metrics based on basic
Karbachevsky et al. [40] propose a method to estimate area components, such as MACs and register files. The fourth
and power values based on the bit operations performed (BOP) column shows the evaluated metrics regarding the entire
metric [41]. BOP is the number of bit operations required to convolution, our original contribution.
perform the calculation, defined by the input bit size, output Maestro [32] does not allow the accelerator simula-
bit size, number of inputs, and number of outputs. According tion, limiting the performance evaluation (e.g., throughput).
to the authors, BOP metric allows estimating the area and SCALE-Sim [45] does not provide power or energy results.
power required by accelerator hardware with high accuracy MLPAT [30] and Timeloop [23] provide PPA based on basic
in the early stages of the design process. Also, the method operations, such as adders and multipliers. Methods relying on
can show the trade-off between the number PEs and the operations counting do not consider how these operators are
bottlenecks caused by the parameters quantization, such as interconnected (e.g., 1D or 2D systolic arrays or adder trees),
memory bandwidth or computational resources. resulting in imprecise hardware metrics.
Ferianc et al. [42] propose a method to improve the perfor- Works [21] and [22] show analytical results for power,
mance of DSE analyses. The method is based on a Gaussian performance, and area. Also, [22] consider features like the
process regression model parameterized by the features of dataflow type, which can contribute to the power consumption.
the accelerator and the target CNN, such as filter, channel, Similar occurs to Aladdin [43].
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5174 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022
Fig. 1. Generic architecture and the modules required to build the convolutional accelerators.
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
JURACY et al.: FAST, ACCURATE, AND COMPREHENSIVE PPA ESTIMATION OF CONVOLUTIONAL HARDWARE ACCELERATORS 5175
• Weight Stationary – WS. Stores weights in an internal 2) RTL Simulation: Verification of the accelerator descrip-
buffer, aiming their reuse. Thus, each weight value is tion behavior. This step uses the VHDL packages, comparing
read once from the input memory, and the convolution the results against the gold model. It is necessary to check
is performed using stationary weights and reads from if the simulation output matches the expected values during
memory IFMAP values. the development of a new accelerator. Once validated the
• Input Stationary – IS. Stores an IFMAP window in an accelerator description, it is possible to bypass this step.
internal buffer to provide its reuse. The window size is 3) Physical Synthesis: This step comprises logical and
equal to the filter size. The accelerator reads IFMAP physical synthesis. The logic synthesis inputs include the
values once and reads the weight values from memory. RTL accelerator description, technology files (LEF and LIB
• Output Stationary – OS. Stores partial convolution results files), and constraints (such as clock frequency or power). The
in registers. The OS does not present buffers to store the physical synthesis corresponds to the placement and routing.
inputs; each convolution fetches the IFMAPs and weight This step generates a gate-level netlist with annotated wire
values from the input memory. capacitances (SDF file).
Figure 1 presents the set of signals between the convolu- 4) Annotated Gate-Level Simulation and Power Estimation:
tional core and memories, and signals used by the hardware Simulation of the annotated gate-level netlist, also using the
controlling the accelerator (start_conv and end_conv). VHDL packages and the gold model. This step may fail
Both memories use a standard set of busses, such as addresses, due to the applied constraints, such as clock frequency and
output data (ifmap_value and pixel_in), input data input/output delays. In this case, the designer must modify
(pixel_out), memory enable and write notification. Mem- the constraints used in the physical synthesis. The output of
ories have different implementation technologies (e.g., SRAM this simulation is a VCD file, with the switching activity
or DRAM) and latencies, varying according to their speci- induced by the CNN IFMAP and weight values. The power
fication. Thus, it is necessary to model the memory latency estimation tool uses the VCD file to estimate the accelerator
to estimate the power and performance of the convolu- power dissipation.
tional accelerator. A validity signal (ifmap_valid and The execution of this flow produces an accurate PPA esti-
ofmap_valid) models the memory latency, indicating the mation for a given accelerator architecture with actual CNN
end of memory accesses. data. However, it is necessary to execute this flow for each
This unified memory interface makes it possible to imple- new set of weights and IFMAPs. The reason is that different
ment different dataflows with distinct protocols, allowing a fair data sets present different switching activities, changing the
comparison between the accelerators. power dissipation. Also, the hardware may show differences
The accelerators adopted for this work have a 3 × 3 MAC due to the number of channels in a given layer or the IFMAP
array, and the convolution stride equals 2. Despite being a and OFMAP sizes, changing the number of bits in counters
design limitation, state-of-the-art CNNs adopt these values, or the buffers’ depth. Thus, we have an accurate PPA, but it
such as VGG16 [52], ensuring that the proposed accelerators requires a significant processing time, which takes hours.
reflect real CNNs.
B. MAC-Based Flow
IV. P HYSICAL S YNTHESIS AND MAC-BASED F LOWS This Section describes a DSE flow based on estimating the
This Section introduces two flows to execute DSE. The required number of MACs. Several works in the literature use
first one uses a standard-cell synthesis flow. Results for one the MAC-based method to evaluate the area, and power [21],
instance of each dataflow are the basis for the main original [22], [23], [30], [48]. The MAC-based flow does not model
contribution of this work, the analytical DSE flow, presented memory accesses, having as primary goal a fast area and power
in the next Section. The second flow uses number of MACs, estimation.
a method used in related work. The MAC-based flow used in this work considers the power
and area values related to MACs and registers extracted from
A. Physical Synthesis Flow the physical synthesis flow. The number of required MACs,
registers, and the data width is a function of the accelerator
The DSE physical synthesis flow performs both logical and
design. The performance, i.e., the number of clock cycles to
physical synthesis steps and uses IFMAP and weights data
execute a complete convolution, is obtained through the RTL
from real CNNs to obtain PPA results. This flow includes the
simulation.
steps described below.
The area and power accuracy of this flow are expected to be
1) TensorFlow Modeling: Model the CNN application in
worse than the physical synthesis flow, as it does not consider
the TensorFlow framework to generate the VHDL packages
the control circuitry (such as FSMs), buffers, and accesses to
used in the RTL and gate-level simulations, and the gold
the memory.
model (expected outputs values). The VHDL packages contain
the IFMAP, weights, bias, and OFMAP values. This step
also generates parameters for configuring the RTL description, V. A NALYTIC DSE F LOW
such as the size of the filters, OFMAP, and IFMAP. This The proposed DSE flow analytically estimates the PPA of
step is generic, supporting different CNNs, such as MNIST CNN layers using results obtained from the physical synthesis
or CIFAR10. of one CNN layer. This flow is faster than the physical
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5176 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
JURACY et al.: FAST, ACCURATE, AND COMPREHENSIVE PPA ESTIMATION OF CONVOLUTIONAL HARDWARE ACCELERATORS 5177
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5178 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
JURACY et al.: FAST, ACCURATE, AND COMPREHENSIVE PPA ESTIMATION OF CONVOLUTIONAL HARDWARE ACCELERATORS 5179
VI. R ESULTS
Section VI-A details the experimental setup adopted to
obtain the results. Next, we evaluate the MAC-based and
analytic flows using the physical synthesis flow as reference,
in Section VI-B and Section VI-C, respectively. Section VI-D B. MAC-Based DSE Flow Results
compares the analytical mode to related works. This Section evaluates the MAC-based DSE flow, using
the physical synthesis flow as a reference. The goal is to
A. Experimental Setup demonstrate that executing power and area estimation using
We adopt the CNN illustrated in Figure 4 as a case study, the required amount of MACs (Section IV-B) does not produce
with three convolutional layers and one fully-connected layer. accurate results.
TensorFlow executes the fully-connected layer, not accelerated Table II presents area results for the three dataflows and
in hardware. The number of filters per layer is 16, 32, and 64. their versions with output buffers for the CIFAR10 CNN
The CNN implemented in the TensorFlow uses the Cifar10 layer 1. The MAC-based flow area is the same, regardless
dataset with a 32 × 32 × 3 (RGB) IFMAP. After training, of the dataflow type, because this flow only considers the
the obtained accuracy was 67%. The obtained accuracy with arithmetic core plus input and output registers. The area of the
the quantization using 16-bit words at the inputs was 66.98%, synthesized accelerators does not include the output buffers in
a value 0.02% smaller than the one obtained in TensorFlow such a way to fairly compare both approaches. The difference
with float point values. observed in the non-buffered and buffered versions is due to
Cadence Genus and Innovus tools were used for logic and the presence of control logic to access the OFMAP memory,
physical synthesis, with 28nm technology and a frequency not present in the buffered versions, translating in a larger
of 500MHz. The logic synthesis uses clock-gating to reduce area.
the accelerator energy consumption. The power dissipation As expected, the MAC-based flow underestimates the accel-
uses the VCD file generated after a post-physical synthesis erator area because it does not consider the control logic
simulation and the Cadence Voltus tool. (FSMs), internal registers, internal buffers, and logic to inter-
The netlist simulation inputs are the Cifar10 CNN layers connect components. The area estimation error varies from
(Figure 4) extracted from TensorFlow, according to the flow 29.3% to 39.9%, with an average error of 35.9%.
presented in Figure 2). Layer 0 uses a 32 × 32 × 3 IFMAP Table III presents power results using the same setup of
(RGB image from CIFAR10 dataset), 16 3 × 3 filters, stride 2, the previous table. The MAC-based flow power is the same
generating a 15 × 15 × 16 output. Layer 1 uses a 15 × 15 × 16 because this flow only considers the arithmetic core.
IFMAP, 32 3×3 filters, stride 2, generating a 7×7×32 output. The MAC-based flow also presents significant errors in
Layer 2 uses a 7 × 7 × 32 IFMAP, 64 3 × 3 filters, stride 2, the power estimation. The main reasons for the differences
generating a 3 × 3 × 64 output. Thus, power values come from observed in the power estimation are the switching activity
a real dataset and not synthetic values. The total energy is and the presence of input buffers. The MAC-based flow uses
computed by multiplying the average power by the number of an average of the switching activity of MACs and registers,
clock cycles required to execute a complete convolution. fixing this value for the estimation (in this example, 1.16 mW ).
The Cacti-IO tool [53] models the external memories. For The physical synthesis uses the switching activity of the whole
a 28nm 64KB SRAM, Cacti-IO reports 0.01356nJ for reading circuit with actual data. The actual switching activity may
and 0.01351nJ for writing. For a 64kB DRAM, 0.1633nJ for be smaller than the average due to the presence of IFMAP
reading and 0.1662nJ for writing. and weight values near zero. This fact justifies optimization
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5180 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022
Fig. 5. Area estimation using the analytic DSE flow, for the 3 Cifar10 layers
(L0, L1, L2) (percentages represent the difference between flows).
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
JURACY et al.: FAST, ACCURATE, AND COMPREHENSIVE PPA ESTIMATION OF CONVOLUTIONAL HARDWARE ACCELERATORS 5181
TABLE IV
C IFAR 10 CNN P OWER A NALYTIC R ESULTS (A CCELERATOR C ORE ) - mW
TABLE V
C IFAR 10 CNN E NERGY A NALYTIC R ESULTS (A CCELERATOR C ORE AND E XTERNAL M EMORIES ) - n J
TABLE VI
S UMMARY OF THE A NALYTIC A PPROACH R ESULTS
The power estimation has a larger error than the area and
performance estimation. Two reasons explain this mismatch:
• The power reference is layer L0, with its switching
activity. The switching activity of other layers is different,
affecting the power estimation. The switching activity is
a function of the input data, not being possible to capture
it in the analytic model.
• The buffered dataflows has an error induced by the
interpolation method.
The overall power estimation error presents an average error
of 7.00% with a standard deviation of 6.21% (minimal error:
0.05%, maximum error: 23.55%).
5) Total Energy Estimation: Table V presents results for
the energy estimation, considering the accelerators and the
Fig. 7. Estimated number of memory accesses. memory accesses. Memory accesses are responsible for most
of the consumed energy. According to [10], the memory
energy can spend 200 times more energy than the accelerator
4) Accelerator Power Estimation: Table IV shows the array. As observed, there is a mismatch higher than 1% only
results for power estimation. Note that this table only considers in the IS dataflows due to the errors in the memory readings
the accelerator core. Similar to the area estimation, the L0 is estimations (Section VI-C.3). The errors in the IS dataflows
the reference. WS, IS and OS present an absolute error equal occur due the following reasons:
to 0% (bold values on Table IV), while the buffered dataflows • The IS dataflow presents a small energy error estima-
presented an error due to the interpolation approach. tion because the number of OFMAP accesses is higher
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5182 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022
TABLE VII
A NALYTIC AND S TATE - OF - THE -A RT R ESULT E RRORS C OMPARISON
than the IFMAP readings (where the estimation presents to our method. Aladdin presents the most complete evaluation
errors), resulting in a small energy estimation error, compared to the other methods. However, the proposed method
below 3%. has an area evaluation more accurate than Aladdin, while the
• The bufferd IS dataflow makes more IFMAP readings evaluation of the other metrics has errors of the same order.
(with an estimation error equal to 9.38%) than OFMAP As a conclusion, the proposed flow presents a more com-
writes. The result is a higher energy estimation error. prehensive evaluation of more metrics with lower errors, and
The overall performance energy error stays below 7%, with can provide a large set of estimates for each dataflow.
an average error of 0.66% with a standard deviation of 1.54%
(minimal error: 0%, maximum error: 6.26%). VII. C ONCLUSION
6) Analytic Model Summary: The proposed analytic flow This work presented a fast, comprehensive, and accurate
enabled an accurate PPA estimation. Table VI summarizes the design space exploration analytic method for CNN hardware
results. The main source of error is in estimating the number accelerators. The method integrates front-end frameworks
of IFMAP readings for the dataflow IS, impacting the average (such as TensorFlow) to a hardware back-end design flow.
power estimation. Also, improvements may be done in the Despite the coupling of the DSE framework to the accelerator
interpolation approach to estimate the area and consumption architecture, the flow presented in this work is a guideline for
of output buffers. executing DSE for other dataflows. We showed that methods
Besides the PPA accuracy, the method is fast. Below we based on basic components, such as MACs, are not enough
compare the computing time of each flow. to have accurate results and present errors between 12.51%
• Physical synthesis flow. The physical synthesis of and 44.68% for area and power. The average error comparing
one accelerator can take up to 45 minutes (Intel the analytical model to the data obtained from the physical
[email protected], 28 cores, 64 GB memory). The synthesis is smaller than 7%. Compared to the literature, the
DSE of a given CNN configuration takes several hours, proposed method shows a more accurate area evaluation, while
considering all layers and channels. the assessment of power and performance have errors of the
• Analytical flow. It is necessary to synthesize the first same order.
convolutional layer of each dataflow, with the remaining Accelerators source code, synthesis and DSE scripts are
layers estimated from the first layer results. The obtained available at the following GitHub repository:
PPA data is used to extract the model. The DSE of a https://github.com/leorezende93/acc_dse_env
given CNN configuration took 0.025 seconds for the CNN Future work covers accelerator and system levels. At the
evaluated in this Section. accelerator level, we plan to: (i ) optimize the accuracy of the
• MAC-based flow. It also requires a synthesis step for PPA results, as the performance of the IS dataflow estimation;
the MAC and registers. The area and power related to (ii ) add other dataflows, such as NLR (No Local Reuse),
the synthesis step are the values used to estimate the RS (Row Stationary), and FG (Fine-Grained), and implement
accelerator cost. in hardware other CNN functions, such as max and aver-
For comparison purposes, Shao et al. [43] mention that age pooling; (iii ) implement larger arrays as 16 × 16, and
their simulator, Aladdin, takes 7 minutes to execute a full DSE integrate the DSE flow with the Imagenet dataset to allow
of a given set of accelerators against 52 hours for the synthesis simulation of the accelerators using more complex CNNs;
flow. (i v) use both SRAM and DRAM, implementing a memory
hierarchy scheme. At the system level, we plan to combine
the DSE method with system simulators to perform DSE
D. Analytic Model Compared to the State-of-the-Art
regarding an entire system composed of CPUs, DMA, and
Table VII compares the analytic model results with results CNN accelerators.
available in the literature. Power and energy consider only the
accelerator, not the memories. The Table presents for each R EFERENCES
work the min/max/average error, when it is available. [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
Few works in the literature present a comprehensive esti- MA, USA: MIT Press, 2016, http://www.deeplearningbook.org.
mation as the method proposed in this paper. MLPAT and [2] Facebook. (2022). Facebook Horizon. [Online]. Available: https://www.
oculus.com/horizon-worlds/
Accerlergy present a limited evaluation of area and energy. [3] Google. (2022). Google Assistant, Your Own Personal Google. [Online].
Stone evaluates only performance with higher errors compared Available: https://assistant.google.com
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
JURACY et al.: FAST, ACCURATE, AND COMPREHENSIVE PPA ESTIMATION OF CONVOLUTIONAL HARDWARE ACCELERATORS 5183
[4] ServiceNow. (2022). Enterprise Chatbot—Virtual Agent. [Online]. Avail- [31] H. Kwon, M. Pellauer, and T. Krishna, “MAESTRO: An open-source
able: https://assistant.google.com/ infrastructure for modeling dataflows within deep learning accelera-
[5] Tesla. (2022). Autopilot. [Online]. Available: https://www.tesla.com tors,” Comput. Res. Repository, vol. abs/1805.02566, no. 1, pp. 1–5,
[6] S. S. Haykin, Neural Networks and Learning Machines, 3rd ed. London, 2018.
U.K.: Pearson, 2009. [32] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and
[7] (2022). Caffe. [Online]. Available: https://caffe.berkeleyvision.org/ T. Krishna, “Understanding reuse, performance, and hardware cost of
[8] (2022). PyTorch. [Online]. Available: https://pytorch.org/ DNN dataflow: A data-centric approach,” in Proc. IEEE/ACM Int. Symp.
[9] (2022). TensorFlow. [Online]. Available: https://www.tensorflow.org/ Microarchitecture (MICRO), May 2019, pp. 754–768.
[10] Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “EyeRiss: An energy- [33] NVIDIA. (2022). NVDLA. [Online]. Available: http://nvdla.org/
efficient reconfigurable accelerator for deep convolutional neural net- [34] C. Hao et al., “FPGA/DNN co-design: An efficient design methodology
works,” IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, for IoT intelligence on the edge,” in Proc. ACM/IEEE Design Automat.
Jan. 2016. Conf. (DAC), Mar. 2019, pp. 1–6.
[11] N. Strom, “Scalable distributed DNN training using commodity GPU [35] X. Zhang, H. Ye, and D. Chen, “Being-ahead: Benchmarking and
cloud computing,” in Proc. Interspeech, Sep. 2015, pp. 1488–1492. exploring accelerators for hardware-efficient AI deployment,” Comput.
[12] W. J. Dally, Y. Turakhia, and S. Han, “Domain-specific hardware Res. Repository, vol. abs/2104.02251, no. 1, pp. 1–12, Jun. 2021.
accelerators,” Commun. ACM, vol. 63, no. 7, pp. 48–57, Jun. 2020, doi: [36] H. Genc et al., “Gemmini: Enabling systematic deep-learning archi-
10.1145/3361682. tecture evaluation via full-stack integration,” in Proc. 58th ACM/IEEE
[13] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An architec- Design Autom. Conf. (DAC), Dec. 2021, pp. 769–774.
ture for ultralow power binary-weight CNN acceleration,” IEEE Trans. [37] K. Asanovic et al., “The rocket chip generator,” Univ. California,
Comput.-Aided Design Integr. Circuits Syst., vol. 37, no. 1, pp. 48–60, Los Angeles, CA, USA, Tech. Rep. UCB/EECS-2016-17,
Jan. 2018. 2016. [Online]. Available: https://aspire.eecs.berkeley.edu/wp/wp-
[14] S. Shivapakash, H. Jain, O. Hellwich, and F. Gerfers, “A power efficient content/uploads/2016/04/Tech-Report-The-Rocket-Chip-Generator-
multi-bit accelerator for memory prohibitive deep neural networks,” in Beamer.pdf
Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), Oct. 2020, pp. 1–5. [38] X. Yang et al., “Interstellar: Using halide’s scheduling language
[15] S.-F. Hsiao, K.-C. Chen, C.-C. Lin, H.-J. Chang, and B.-C. Tsai, “Design to analyze DNN accelerators,” in Proc. ACM Int. Conf. Archi-
of a sparsity-aware reconfigurable deep learning accelerator supporting tectural Support Program. Lang. Operating Syst. (ASPLOS), 2020,
various types of operations,” IEEE J. Emerg. Sel. Topics Circuits Syst., pp. 369–383.
vol. 10, no. 3, pp. 376–387, Sep. 2020. [39] S. D. Manasi and S. S. Sapatnekar, “DeepOpt: Optimized scheduling of
[16] F. Spagnolo, S. Perri, F. Frustaci, and P. Corsonello, “Reconfigurable CNN workloads for ASIC-based systolic deep learning accelerators,” in
convolution architecture for heterogeneous systems-on-chip,” in Proc. Proc. ACM/IEEE Asia South Pacific Design Automat. Conf. (ASPDAC),
9th Medit. Conf. Embedded Comput. (MECO), Jun. 2020, pp. 1–5. May 2021, pp. 235–241.
[17] S.-F. Hsiao and H.-J. Chang, “Sparsity-aware deep learning accelerator [40] A. Karbachevsky et al., “Early-stage neural network hardware perfor-
design supporting CNN and LSTM operations,” in Proc. IEEE Int. Symp. mance analysis,” MDPI Sustainability, vol. 13, no. 2, p. 717, 2021.
Circuits Syst. (ISCAS), Oct. 2020, pp. 1–4. [41] C. Baskin et al., “UNIQ: Uniform noise injection for non-uniform
[18] TESLA. (2019). Autopilot and Full Self-Driving Capability. [Online]. quantization of neural networks,” ACM Trans. Comput. Syst., vol. 37,
Available: https://analyticsindiamag.com/under-the-hood-of-teslas-ai- nos. 1–4, pp. 1–15, 2021.
chip-that-takes-the-driverless-battle-to-nvidias-home-turf/ [42] M. Ferianc et al., “Improving performance estimation for design space
[19] Apple. (2022). iPhone 11. [Online]. Available: https://www.apple.com/ exploration for convolutional neural network accelerators,” MDPI Elec-
iphone-11/ tron., vol. 10, no. 4, pp. 1–14, 2021.
[20] S. Das, A. Roy, K. K. Chandrasekharan, A. Deshwal, and S. Lee, [43] Y. S. Shao, B. Reagen, G.-Y. Wei, and D. Brooks, “Aladdin: A pre-
“A systolic dataflow based accelerator for CNNs,” in Proc. IEEE Int. RTL, power-performance accelerator simulator enabling large design
Symp. Circuits Syst. (ISCAS), Oct. 2020, pp. 1–5. space exploration of customized architectures,” in Proc. ACM Int. Symp.
[21] C. Heidorn, F. Hannig, and J. Teich, “Design space exploration for layer- Comput. Archit. (ISCA), 2014, pp. 97–108.
parallel execution of convolutional neural networks on CGRAs,” in Proc. [44] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna,
23th Int. Workshop Software Compilers Embedded Syst. (SCOPES), “SCALE-SIM: Systolic CNN accelerator,” Comput. Res. Repository,
2020, pp. 26–31. vol. abs/1811.02883, no. 1, pp. 1–11, 2018.
[22] Y. Zhao, C. Li, Y. Wang, P. Xu, Y. Zhang, and Y. Lin, “DNN-chip pre- [45] A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina, and
dictor: An analytical performance predictor for DNN accelerators with T. Krishna, “A systematic methodology for characterizing scalability of
various dataflows and hardware architectures,” in Proc. IEEE Int. Conf. DNN accelerators using SCALE-sim,” in Proc. IEEE Int. Symp. Perform.
Acoust., Speech Signal Process. (ICASSP), May 2020, pp. 1593–1597. Anal. Syst. Softw. (ISPASS), Aug. 2020, pp. 58–68.
[23] A. Parashar et al., “Timeloop: A systematic approach to DNN acceler- [46] H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling flexible
ator evaluation,” in Proc. IEEE Int. Symp. Perform. Anal. Syst. Softw. dataflow mapping over dnn accelerators via reconfigurable intercon-
(ISPASS), Mar. 2019, pp. 304–315. nects,” ACM Special Interest Group Program. Lang. Notices, vol. 53,
[24] F. Muñoz-Martínez, J. L. Abellán, M. E. Acacio, and T. Krishna, no. 2, pp. 461–475, 2018.
“STONNE: A detailed architectural simulator for flexible neural network [47] S. Kim et al., “Transaction-level model simulator for communication-
accelerators,” Comput. Res. Repository, vol. 2006, no. 1, pp. 1–8, 2020. limited accelerators,” Comput. Res. Repository, vol. abs/2007.14897,
[25] Y. Nellie Wu, J. S. Emer, and V. Sze, “Accelergy: An architecture- no. 1, pp. 1–11, 2020.
level energy estimation methodology for accelerator designs,” in Proc. [48] S. Cao, W. Deng, Z. Bao, C. Xue, S. Xu, and S. Zhang, “SimuNN:
IEEE/ACM Int. Conf. Comput.-Aided Des. (ICCAD), Nov. 2019, pp. 1–8. A pre-RTL inference, simulation and evaluation framework for neural
[26] D. Giri, K.-L. Chiu, G. Di Guglielmo, P. Mantovani, and L. P. Carloni, networks,” IEEE J. Emerg. Sel. Topics Circuits Syst., vol. 10, no. 2,
“ESP4ML: Platform-based design of systems-on-chip for embedded pp. 217–230, 2020.
machine learning,” in Proc. Design, Autom. Test Eur. Conf. Exhib. [49] Keras. (2022). PReLU layer. [Online]. Available: https://keras.io/
(DATE), Mar. 2020, pp. 1049–1054. api/layers/activations/
[27] A. Sohrabizadeh, Y. Bai, Y. Sun, and J. Cong, “Enabling automated [50] L. R. Juracy, M. T. Moreira, A. M. Morais, A. Hampel, and F. G. Moraes,
FPGA accelerator optimization using graph neural networks,” Comput. “A high-level modeling framework for estimating hardware metrics of
Res. Repository, vol. abs/2111.08848, pp. 1–12, Jun. 2021. CNN accelerators,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 68,
[28] G. Datta and P. A. Beerel, “Can deep neural networks be converted to no. 11, pp. 4783–4795, 2021.
ultra low-latency spiking neural networks?” in Proc. Design, Autom. Test [51] D. Moolchandani, A. Kumar, and S. R. Sarangi, “Accelerating CNN
Eur. Conf. Exhib. (DATE), Mar. 2022, pp. 718–723. inference on ASICs: A survey,” J. Syst. Archit., vol. 113, no. 1, pp. 1–26,
[29] P. Panda, S. A. Aketi, and K. Roy, “Toward scalable, efficient, and 2021.
accurate deep spiking neural networks with backward residual con- [52] K. Simonyan and A. Zisserman, “Very deep convolutional net-
nections, stochastic softmax, and hybridization,” Frontiers Neurosci., works for large-scale image recognition,” Comput. Res. Repository,
vol. 14, pp. 1–18, Aug. 2020. vol. abs/1409.1556, no. 1, pp. 1–14, 2014.
[30] T. Tang and Y. Xie, “MlPat: A power area timing modeling framework [53] N. P. Jouppi, A. B. Kahng, N. Muralimanohar, and V. Srinivas, “Cacti-
for machine learning accelerators,” in Proc. IEEE Int. Workshop Domain IO: Cacti with off-chip power-area-timing models,” IEEE Trans. Very
Specific Syst. Archit. (DOSSA), Aug. 2018, pp. 1–3. Large Scale Integr. (VLSI) Syst., vol. 23, no. 7, pp. 1254–1267, 2014.
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.
5184 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 69, NO. 12, DECEMBER 2022
[54] N. J. Kim and H. Kim, “FP-AGL: Filter pruning with adaptive Alexandre de Morais Amory received the Ph.D.
gradient learning for accelerating deep convolutional neural net- degree in computer science from UFRGS, Brazil,
works,” IEEE Trans. Multimedia, early access, Jul. 1, 2022, doi: in 2007. His professional experience include the
10.1109/TMM.2022.3189496. Lead Verification Engineer at CEITEC Design
House from 2007 to 2009, a Post-Doctoral Fellow
at PUCRS from 2009 to 2012; and a Professor
at PUCRS from 2012 to 2020. He is currently
a Research Fellow at Scuola Superiore Sant’anna,
Italy. His research interests include design, test,
fault-tolerance, and safety-critical systems.
Authorized licensed use limited to: National Institute of Technology - Puducherry. Downloaded on February 11,2025 at 03:56:18 UTC from IEEE Xplore. Restrictions apply.