An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators

Uploaded by

aishwarya.0225

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views4 pages

An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators

Uploaded by

aishwarya.0225

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

An Area-Power-Efficient Multiplier-less

Processing Element Design for CNN Accelerators

Jiaxiang Li Masao Yanagisawa Youhua Shi
Dept. of Electron. and Phys. Syst. Dept. of Electron. and Phys. Syst. Dept. of Electron. and Phys. Syst.
Graduate School of Fundamental Graduate School of Fundamental Graduate School of Fundamental
Science and Engineering Science and Engineering Science and Engineering
Waseda Univ., Tokyo, Japan Waseda Univ., Tokyo, Japan Waseda Univ., Tokyo, Japan
[email protected] [email protected] [email protected]
2023 IEEE 15th International Conference on ASIC (ASICON) | 979-8-3503-1298-0/23/$31.00 ©2023 IEEE | DOI: 10.1109/ASICON58565.2023.10396536

Abstract—Machine learning has achieved remarkable In binary operations, multiplying a binary number by a
success in various domains. However, the computational power of 2 can be efficiently achieved by shifting the bits of
demands and memory requirements of these models pose the number, eliminating the need for actual multiplications.
challenges for deployment on privacy-secured or wearable edge Leveraging this characteristic, we propose a model
devices. To address this issue, we propose an area-power- quantization method and a corresponding multiplier-less PE.
efficient multiplier-less processing element (PE) in this paper. Our proposed method and PE design offer significant
Prior to implementing the proposed PE, we apply a power-of-2 advantages, resulting in approximately 30% reduction in
dictionary-based quantization to the model. We analyze the
computation power and 35% reduction in core area when
effectiveness of this quantization method in preserving the
compared to a baseline multiplication-and-accumulation
accuracy of the original model and present the standard and a
specialized diagram illustrating the schematics of the proposed
(MAC) PE. The analysis and evaluations conducted in this
PE. Our evaluation results demonstrate that our design study focus on convolutional neural networks (CNNs), which
achieves approximately 30% lower power consumption and are widely used for image-related applications and comprise
35% smaller core area compared to a conventional massive multiplication operation.
multiplication-and-accumulation (MAC) PE. Moreover, the The remaining sections of this paper are organized as
applied quantization reduces the model size and operand bit- follows. Section II presents the quantization method
width, resulting in reduced on-chip memory usage and energy
employed for the proposed PE design. In Section III, we
consumption for memory accesses.
provide a detailed illustration of schematics of the proposed
Keywords—a multiplier-less processing element, energy- PE. The evaluation results are presented in Section IV,
efficient, area-efficient, machine learning model quantization showcasing the performance of our design. Finally, Section
V concludes the paper.
I. INTRODUCTION
II. MODEL QUANTIZATION
In recent years, significant advancements have been made
in artificial intelligence (AI), particularly in areas such as Before implementing the proposed multiplier-less PE, a
object detection and allocation, natural language processing, power-of-2 dictionary-based quantization is applied to the
and image generation. To handle complex tasks, GPUs with pretrained neural network model. In this quantization process,
float-point multipliers have been predominantly employed in each original weight is represented by a combination of
big data centers or servers. However, with the rapid growth several sub-weights and scaled by a bias coefficient. This
of the Internet of Things (IoT) and increasing demands for approach allows for efficient representation and computation,
privacy security, a challenge arises in bringing AI models to enabling the subsequent utilization of the multiplier-less PE
edge devices. To address this challenge and reduce on-chip in the hardware accelerator design.
memory usage and power consumption, various hardware Equation (1) represents the power-of-2 dictionary, Dict,
accelerators have been proposed to enable more efficient which is utilized in the quantization process. Thus, for the
computations. One fundamental circuit component that is quantization of a single weight using the items in the
commonly utilized in these accelerators is the processing dictionary, if � = � + � + � × 2−�� it indicates that the
element (PE). weight (W) can be represented by three sub-weights in Dict
The basic component of PEs in hardware accelerators (i.e. �, �, � ∈ ��) and the bias term is a coefficient shared
typically comprises a fixed-point multiplier, which consumes across the layers, enabling efficient scaling of the quantized
a significant amount of computation energy. One common weights.
approach to reduce computation power is the utilization of
approximate multipliers. Fang et al. [1] introduced error �� = 0, 20 , ± 21 , …, ± 2� ; � ∈ �+ (1)
balancing techniques, while Perri et al. [2] employed on-chip
quantization to preserve better precision in approximate The effectiveness of the bias term in increasing the range
multipliers. However, both designs were developed for of weights after quantization and mitigating the degradation
general applications and lack exclusive optimization for in accuracy has been demonstrated. By incorporating the bias
machine learning models. Another method involves the use term into the quantization process, the quantized weights are
of zero-gating designs [3], but such approaches often require able to cover a wider range, thereby preserving more
additional hardware overhead to detect zero operands. information and reducing the impact on model accuracy [4].
This finding highlights the significance of the bias term in
This work is supported in part by the Waseda University Open
Innovation Ecosystem Program for Pioneering Research (W-SPRING),
Grant Number JPMJSP2128.

979-8-3503-1298-0/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:43:04 UTC from IEEE Xplore. Restrictions apply.
TABLE I. ACCURACY TABLE OF THE TESTBENCH VGG9 MODEL

n
1 2 3
m
1 < 2.00 % 51.45 % 54.24 %

2 52.90 % 54.97 % 55.24 %

3 52.90 % 54.95 % 55.16 %

4 53.67 % 55.03 % 55.24 %

* The accuracy of the original dense model is 55.19%.

TABLE II. AN EXAMPLE OF THE STANDARD SUB-WEIGHT ENCODING

Sub-weight Decimal Value Binary Code

+20 0 0 0
+21 0 0 1
minimizing the potential loss of accuracy during the
+22 0 1 0
quantization process.
+23 0 1 1
Algorithm 1 provides an illustration of the calculation of 0 1 0 0
the bias term. It outlines the iterative process employed to
-21 1 0 1
determine an optimal bias value that minimizes the mean
-22 1 1 0
square error (MSE) between the quantized weights and the
original weights. -2 3
1 1 1

The accuracy of quantized models is influenced by two

key parameters: the maximum power of the dictionary (n) while the remaining bits represent the corresponding number
and the number of sub-weights (m). To better assess the of power, namely ‘value-bits’.
impact of changes in these two parameters on model A. Proposed Standard PE
accuracy, we conducted experiments using a 9-layer CNN
Fig. 1 illustrates the standard diagram of the proposed PE
model with a specific architecture of 'C64-C64-C128-C128-
schematic. The PE performs the bit-shifting operation
P2-C256-C256-P2-FC512-FC256-FC100' on the CIFAR-100
between the input activation (IA) and the sub-weights using
dataset. TABLE I presents the results and demonstrates that
partial product calculators. At the beginning, the IA is shifted
increasing one of m and n is not helpful for reducing the
according to the maximum power (n) that is decided in
accuracy degradation. The Accuracy recovery is most
design time, then undergoes two steps of multiplexing.
effective only when both n and m are increased
simultaneously, as indicated by the diagonal direction in the The first multiplexer selects the appropriate bit-shifted IA,
table. This comprehensive adjustment of n and m helps to referred to as a partial product, based on the value-bits of the
better preserve the model's accuracy after quantization. In sub-weights. The second multiplexer determines whether to
addition, on this testbench, with only two sub-weights, the use the inverse complement or the unchanged partial product,
accuracy can be recovered very close to the initial one. based on the sign-bit. Finally, all the partial products are
accumulated with the partial summation (Psum) from the
III. PROPOSED PE DESIGN previous PE to generate the new Psum’.
In (1), the value '-20' is not included in the dictionary, as
B. Proposed Bisign PE
its corresponding binary code is assigned to '0'. To illustrate
the encoding of sub-weights, TABLE II presents an example Based on several simulations, it was observed that for
when the maximum power of the dictionary (n) is set to 3. models trained on datasets with lower complexity, setting the
For each sub-weight, its first bit represents the ‘sign-bit’, number of sub-weights (m) to 2 is sufficient to retain an
acceptable model accuracy. In such cases, the PE can be

Fig. 1. Diagram of the standard proposed PE schematic.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:43:04 UTC from IEEE Xplore. Restrictions apply.
TABLE III. AN EXAMPLE OF THE BISIGN SUB-WEIGHT ENCODING

Positive Negative
Binary Code
Sub_W Sub_W
+20 -20 0 0
+21 -21 0 1
+22 -22 1 0
+23 -23 1 1

TABLE IV. CIRCUIT PERFORMANCE MATRIX

Area Power Critical

Design Name
(μm2) (nW) Path (ns)
MAC PE (16 bit) 37755.0 3.91 7.27
Fig. 2. Diagram of the bisign proposed PE schematic. MAC PE (8 bit) 10277.4 0.71 4.68
Approximate PE (8 bit) 9729.3 0.72 5.18
modified slightly to handle two sub-weights as a positive
Proposed (m = 2; n = 3) 7175.6 0.53 4.87
weight and a negative weight by default, allowing for the
removal of the sign-bit. In this work, this modified design is Proposed (m = 3; n = 3) 10277.3 0.75 5.74
referred to as 'bisign' expansion. Fig. 2 showcases the Proposed (m = 2; n = 4) 7330.8 0.54 5.06
corresponding schematic diagram of the 'bisign' design, Proposed (bisign; n = 3) 6349.8 0.46 4.72
which eliminates a multiplexer and therefore introduces less
delay compared to the standard proposed design. This
modification enhances the efficiency of the proposed PE
while still maintaining identical computation result for
models with lower complexity.
TABLE III presents the look-up table (LUT) for
encoding sub-weights that applied in the bisign proposed PE.
In this LUT, different from the arrangement of encoding in
the standard proposed design, the binary representation of Fig. 3. Diagram of a normal MAC PE schematic.
zero is assigned back to the value '-20'. This is because zero
value can be represented by the summation of two sub- signed MSB and unsigned LSB requires additional logic
weights with identical value-bits but opposite sign-bits. By gates to accurately calculate the sign bits.
utilizing this property, the bisign PE optimizes the encoding Regarding the proposed standard PE, its characteristics
of sub-weights, reducing the memory requirement and can be observed in two aspects. Firstly, as the value of m
computational complexity while still accurately representing increases, the utilization of hardware resources and the
the corresponding weights. critical path delay also increase rapidly. This is because m
determines the number of partial product calculators, and
IV. EVALUATION RESULTS accumulating more partial products requires a longer adder
To assess the effectiveness of the proposed design, we tree. Conversely, increasing the value of n has a lesser
conducted evaluations using three types of PEs: the normal impact on the hardware, as it primarily adds options to the
MAC PE, as depicted in Fig. 3, an approximate PE based on first multiplexer in each partial product calculator. Based on
the compressor described in [5], and our proposed PEs. The this observation, when implementing this PE, it is preferable
evaluations encompassed various performance metrics, to increase n in order to reduce the degradation in accuracy
including area, power, and delay. For the synthesis process, during the model quantization stage. This strategy helps to
we employed OpenLane [6], an open-source ASIC synthesis mitigate the impact on hardware resources while preserving
tool with the 130nm technology. Additionally, we use the the model's overall accuracy.
observations introduced in [7] to estimate some more details
The proposed bisign PE demonstrates the best area-
of energy savings. These evaluation methodologies allowed
power-product performance among the different designs
us to comprehensively compare the performance and
considered. It achieves significant improvements, saving
efficiency of the different PEs and validate the benefits
approximately 38% in core area and 35% in power
offered by our proposed design.
consumption compared to the MAC PE operating at 8-bit
A. Circuit Synthesis precision. Additionally, it is noteworthy that the bisign PE
Table IV presents the synthesis results, highlighting the introduces minimal impact on the critical path, accounting
performance of the different PEs. The PE based on the for less than 1% of the total delay. These advantages make it
approximate multiplier shows minimal improvement in terms an attractive choice for optimizing the overall performance
of area compared to the MAC PE. However, there is no of the system while maintaining efficient resource utilization.
noticeable enhancement in power supply. This is primarily B. Memory Access and Weight Compression
due to the additional complexity involved in the approximate
Based on the findings in [7], it has been established that a
multiplier design, the signed operands are first divided into
significant portion of energy consumption during model
signed most-important-bits (MSBs) and unsigned less-
inference is attributed to memory accesses. Specifically, the
important-bits (LSBs), then their products are computed
energy expended for reading and writing operations between
separately. The approximate multiplication between the
RAMs or registers is proportional to the bit-width of the

Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:43:04 UTC from IEEE Xplore. Restrictions apply.
signals involved. Consequently, the bit-width of the weights
after quantization plays a crucial role in determining the
energy consumption and can be easily calculated by:

��_�_��ℎ = � × � (2)

��_�_��ℎ = 2 × (� − 1) (3)

In typical scenarios, float-point weights are commonly

quantized into 8-bit or 16-bit fixed-point numbers to
facilitate convolution on hardware accelerators. By applying
quantization with parameters (m = 2; n = 3), the weights
undergo a transformation that results in a reduction of
approximately 25% or 60% in energy consumption
associated with memory access. Furthermore, this
quantization scheme also reduces the necessary on-chip Fig. 4. Computation energy estimation for sparsity analysis.
memory required to store the same amount of weights. overall energy consumption for convolution operations
Importantly, this weight quantization approach remains decreases by approximately 45% in this particular scenario.
compatible with Huffman coding and other weight These findings highlight the potential benefits of employing
compression methods that encode the weights as a sparse proposed methods jointly with sparsity-related techniques in
high-dimensional matrix. This compatibility enhances the reducing the energy requirements.
potential for achieving additional efficiency gains in terms of
memory utilization and computation. V. CONCLUSION
C. Sparsity Analysis In this paper, we introduce a novel multiplier-less PE
While the multiplier-less PE design described in this design that offers an optimal balance between area, power,
paper does not incorporate specific hardware logic for and performance. Additionally, we propose a power-of-2
implementing sparsity, it still offers advantages in relation to dictionary-based weight quantization approach for model pre
zero-gating which is a technique that leverages the presence -processing. Through extensive evaluations, we demonstrate
of zeros in weight parameters to optimize computations. that our design surpasses the conventional MAC PE by
With the proposed multiplier-less PE, not all sub-weights are achieving up to 30% reduction in power consumption and
needed to represent a small but non-zero weight and each sub occupying a 35% smaller core area. Furthermore, we
-weight could be zero-gated separately. Considering weight highlight the compatibility of our approach with other
sparsity, the computation estimation can be approximated compression methods and sparsity-aware techniques,
using the equations provided and validated in [8]: enhancing its applicability in various contexts.
REFERENCES
�� = �� × (1 − ��) × �� (4)
[1] B. Fang et al., “Approximate multipliers based on a novel unbiased
approximate 4-2 compressor,” Integration, vol. 81, pp. 17–24, Nov.
where �� stands for the total energy for computation; 2021.
�� is the count of MACs; �� is the ratio of zero weights [2] S. Perri, F. Spagnolo, F. Frustaci, and P. Corsonello, “Designing
energy-efficient approximate multipliers,” J. of Low Power Electron.
and �� is the energy for a single MAC operation. and Appl., vol. 12, no. 4:49, Sep. 2022.
For this design, equation (4) becomes: [3] L. Ye, J. Ye, M. Yanagisawa and Y. Shi, "Power-efficient deep
convolutional neural network design through zero-gating PEs and
� �� partial-sum reuse centric dataflow," IEEE Access, vol. 9, pp. 17411 -
�� = �� × (1 − ��) × (5) 17420, Jan. 2021.
�=1 � [4] M. -H. Hsieh, Y. -T. Liu and T. -D. Chiueh, "A multiplier-less
convolutional neural network inference accelerator for intelligent
where �� is the sparsity of each sub-weight and �� is the edge devices," IEEE J. on Emerging and Selected Topics in Circuits
Syst., vol. 11, no. 4, pp. 739-750, Dec. 2021.
energy consumption of a single flop in proposed designs.
[5] A. Vigneshwar and G. A. Sathish Kumar, “Approximate multiplier
Based on the performance metrics obtained from TABLE for low power applications,” Int. J. Eng. Res. Technol., vol. 4, no. 14,
IV, Fig. 4 presents an example of computation energy 2016.
estimation using an AlexNet model provided by the PyTorch [6] M. Shalan and T. Edwards, “Building OpenLANE: a 130nm
framework, where 50% of its smallest weights are pruned. OpenROAD-based tapeout-proven flow: invited paper,” in Proc. of
IEEE/ACM Int. Conf. On Compt. Aided Des., 2020, pp. 1-6.
The computation energy is normalized in terms of the energy
[7] T. -J. Yang, Y. -H. Chen, J. Emer and V. Sze, "A method to estimate
consumed by a MAC operation. The results clearly the energy consumption of deep neural networks," in Proc. of
demonstrate that even for layers with relatively low sparsity, Asilomar Conf. on Signals, Syst. Compt., 2017, pp. 1916-1920.
the dictionary-based quantization significantly increases the [8] M. Dampfhoffer, T. Mesquida, A. Valentian and L. Anghel, "Are
effective sub-weight sparsity. This effect becomes more SNNs really more energy-efficient than ANNs? an in-depth hardware-
pronounced in deeper hidden layers, where the absolute aware study," IEEE Trans. on Emerging Topics in Compt. Intell., vol.
values of weights tend to cluster around zero. As a result, the 7, no. 3, pp. 731-741, June 2023.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on February 06,2024 at 08:43:04 UTC from IEEE Xplore. Restrictions apply.