2023-Low-Complexity - Precision-Scalable - Multiply-Accumulate - Unit - Architectures - For - Deep - Neural - Network - Accelerators

This document discusses precision-scalable multiply-accumulate (PSMAC) unit architectures for deep neural network accelerators. It proposes two new low-complexity PSMAC unit architectures based on the existing Fusion Unit design, which uses basic units called Bit Bricks to perform decomposed multiplications. The architectures aim to reduce complexity and resource usage compared to prior work.

Uploaded by

Sohan Sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views5 pages

2023-Low-Complexity - Precision-Scalable - Multiply-Accumulate - Unit - Architectures - For - Deep - Neural - Network - Accelerators

Uploaded by

Sohan Sai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

1610 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO.

4, APRIL 2023

Low-Complexity Precision-Scalable
Multiply-Accumulate Unit Architectures
for Deep Neural Network Accelerators
Wenjie Li , Aokun Hu , Gang Wang, Ningyi Xu, and Guanghui He , Member, IEEE

Abstract—Precision-scalable deep neural network (DNN) accel- DNN accelerator using only fixed-point arithmetic units. When
erator designs have attracted much research interest. Since the adopting advanced quantization techniques, the bitwidth varies
computation of most DNNs is dominated by multiply-accumulate significantly across DNNs and even across layers within a
(MAC) operations, designing efficient precision-scalable MAC
(PSMAC) units is of central importance. This brief proposes two DNN [2], [3], [4]. To maximize the utilization of MAC units
low-complexity PSMAC unit architectures based on the well- under different precision configurations, precision scalability
known one, Fusion Unit (FU), which is composed of a few basic is supported by many DNN accelerators. MAC units of these
units called Bit Bricks (BBs). We first simplify the architec- precision-scalable accelerators are built either on decomposed
ture of BB through optimizing some redundant logic. Then a multipliers [8], [10], [11] or by bit-serial approaches [5], [6],
top-level architecture for PSMAC unit is devised by recursively
employing BBs. Accordingly, two low-complexity PSMAC unit [7], [9]. Among these precision-scalable MAC (PSMAC) unit
architectures are presented for two different kinds of quantization architectures, Fusion Unit (FU) [8] and its variants show excel-
schemes. Moreover, we provide an insight into the decomposed lent energy efficiency and throughput [12]. This brief focuses
multiplications and further reduce the bitwidths of the two on FU-based PSMAC architectures unless otherwise specified.
architectures. Experimental results show that our proposed archi- A FU comprises 16 basic units called Bit Bricks (BBs)
tectures can save up to 44.18% area cost and 45.45% power
consumption when compared with the state-of-the-art design. each of which can perform a 2b × 2b multiplication. When
configured into the high-precision mode, the partial products
Index Terms—Deep neural networks (DNNs), multiply- generated by BBs are shifted and added together to recover the
and-accumulate (MAC), precision-scalable, low-complexity
architecture. final product. A FU usually supports at most an 8b × 8b mul-
tiplication in one cycle and so does the PSMAC unit proposed
in this brief. Obviously, it can achieve higher throughput in
I. I NTRODUCTION the low-precision modes than a traditional 8b × 8b multiplier
EEP neural networks (DNNs) have achieved a great suc- at the expense of area overhead. Specifically, the introduced
D cess over the past few years. They not only beat the
records in image classification and speech recognition, but
shift-add (S&A) logic occupies more than a half of the over-
all area cost. By means of reducing the number of supported
also have been employed in a wide range of applications such modes, a new FU architecture is proposed in [10] to lower
as object detection and natural language processing. To deal the hardware complexity. However, the improvement is still
with the tremendous amount of computation, hardware accel- limited.
eration for DNNs has become an overwhelming trend [1]. For The goal of this brief is to further reduce the complexity
almost all DNNs, the computation is dominated by multiply- of the existing FU-based PSMAC architectures. The archi-
accumulate (MAC) operations. Therefore MAC units occupy tecture of BB is first optimized through eliminating some
an important place in DNN accelerator design. redundant logic. We next devise a top-level architecture for
In addition to reducing the model size, quantization is essen- PSMAC unit by employing BBs in a recursive manner. Then
tial for DNN hardware implementations. It maps float-point two FU-based PSMAC architectures are presented for two
weights or/and activations to a smaller set of fixed-point val- quantization schemes: one for general purpose and the other
ues. With the assistance of quantization, one can implement a quantizes activations into nonnegative values. Furthermore,
an insight into the decomposed multiplication is given and
Manuscript received 19 November 2022; accepted 16 December 2022. Date the bitwidths of the proposed architectures are accordingly
of publication 22 December 2022; date of current version 29 March 2023. This reduced. These proposed optimization techniques are demon-
work was supported in part by the National Key Research and Development
Program of China under Grant 2020YFB2205500, and in part by the National strated to be effective by our experiments. It is also shown that
Natural Science Foundation of China under Grant 62074097. This brief our proposed architectures can significantly reduce the overall
was recommended by Associate Editor M. Huang. (Corresponding authors: area cost and power consumption.
Ningyi Xu; Guanghui He.)
The authors are with the School of Electronic Information and Electrical
Engineering, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail:
[email protected]; [email protected]). II. F USION U NIT AND R ELATED W ORKS
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSII.2022.3231418. Since the bitwidth varies significantly across DNNs and may
Digital Object Identifier 10.1109/TCSII.2022.3231418 even be individually adjusted for each layer within a DNN, the
1549-7747
c 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 08,2023 at 03:36:44 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LOW-COMPLEXITY PSMAC UNIT ARCHITECTURES FOR DNN ACCELERATORS 1611

Fig. 3. The architecture of BB which is based on Baugh-Wooley multipli-

cation between two 3b signed numbers.

Fig. 1. A FU consists of 16 BBs and some S&A units.

Fig. 4. Computation illustration for 4b × 4b decomposed multiplications

Fig. 2. A decomposed 8b×8b multiplication in which each sub-multiplication executed by a) the original FU [8], and b) the MFU [10].
is handled by a BB.

benefits brought by fixed-bitwidth accelerators are limited. To It is pointed out in [10] that the accuracy gap incurred by
take full advantage of the bitwidth variety, a bit-level dynam- different precisions is acceptable in some cases. Consequently,
ically composable accelerator architecture which is referred the granularity of the original FU is not necessary. It sug-
to as Bit Fusion is presented in [8]. It can achieve excellent gests to only support three configuration modes: 8b × 8b,
performance in terms of throughput and energy efficiency. 4b × 4b and 2b × 2b. To save the S&A logic, [10] devises
The PSMAC unit employed in Bit Fusion is called FU which a merge-based FU (MFU). For the sake of brevity, we
is composed of 16 BBs and some S&A units, as depicted in henceforth use xj:i to denote the binary representation of
Fig. 1. The supported bitwidths for the input activations and (xj xj−1 , . . . xi+1 xi )2 . Take a 4b × 4b decomposed multipli-
weights include 2b, 4b and 8b. In the configuration mode of cation x3:0 × y3:0 as an example. Four sub-products will be
2b × 2b, each BB computes a product of a 2b weight and a 2b computed, shifted and added together. Note that x1:0 and
activation. In a configuration mode for higher precision, each y1:0 are both unsigned numbers and their sub-product is of
BB computes a sub-product that will be shifted and added with at most four bits. Therefore [10] directly merges the sub-
others to recover the final result(s). For example, Fig. 2 shows products of x3:2 ×y3:2 and x1:0 ×y1:0 . Meanwhile, sub-products
how an 8b × 8b multiplication is decomposed such that each of x3:2 × y1:0 and x1:0 × y3:2 are added together and then
sub-multiplication is handled by a BB. Except in the 8b × 8b shifted. Finally, the two intermediate results are added up
mode, a FU can perform multiple multiplications between to recover the product. The process can be formulated as
activations and weights and add their products together, i.e., x3:0 × y3:0 = merge{x3:2 × y3:2 , x1:0 × y1:0 } + ((x3:2 × y1:0
obtaining an inner product of an activation vector and a + x1:0 × y3:2 ) << 2). The computation illustration for the
weight vector. It provides Bit Fusion higher throughput than 4b × 4b multiplications executed by the original FU and MFU
the conventional accelerators that adopt fixed-bitwidth MAC are depicted in Fig. 4. Clearly, Fig 4. (b) needs fewer shifters
units. As input bitwidth decreases, more multiplications can and two-input adders.
be simultaneously performed by a FU. Fig. 5 illustrates the MFU architecture devised in [10]. It
For different modes or/and different quantization schemes, can be configured into three modes: 2b × 2b, 4b × 4b and
the inputs of BB may be signed or unsigned. Thus each of 8b × 8b. The configuration is handled by the MUXs. Fig. 5
them has to be assigned with one extended sign bit to sup- also shows the data-path when MFU is configured into the
port both signed and unsigned numbers. To this end, one 8b × 8b mode. In this mode, the multiplication is decomposed
BB is set to perform a multiplication between two 3b signed into four parts: x7:0 × y7:0 = x1:0 × y7:0 + ((x3:2 × y7:0 ) <<
numbers [8]. Its architecture based on Baugh-Wooley multi- 2) + ((x5:4 × y7:0 ) << 4) + ((x7:6 × y7:0 ) << 6). The
plication is given in Fig. 3 where “HA” and “FA” represent four parts are respectively handled by the four columns of
half adder and full adder, respectively. sx and sy are the flag BBs. Analogous to the 4b × 4b case, there exist a few
signals to indicate whether the inputs are signed or unsigned merging operations. Such architecture saves the S&A logic
numbers (1 for signed and 0 for unsigned). In addition to the and hence the overall area cost. However, as we will see
16 BBs, S&A units are adopted in FU, which take more than in Section IV, the improvement brought by MFU is still
a half of the overall area cost. limited.

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 08,2023 at 03:36:44 UTC from IEEE Xplore. Restrictions apply.
1612 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 4, APRIL 2023

Fig. 6. Architecture of the proposed PSMAC which is called RFU.

Fig. 5. Architecture of MFU [10] (data-path in 8b × 8b mode is marked in
blue).

TABLE I
P RODUCTS AND T HEIR T WO ’ S C OMPLEMENT R EPRESENTATIONS FOR
A LL THE P OSSIBLE M ULTIPLICATION C ASES E NCOUNTERED BY A BB

Fig. 7. (a) Architecture of the extended RFU which can additionally support
the 8b × 4b mode. (b) Bitwidths details of the circuits outside BBUs but in
nRFU.
III. P ROPOSED P RECISION -S CALABLE
M ULTIPLY-ACCUMULATE U NIT A RCHITECTURES
This section provides a detailed discussion on our proposed simplified since their carry bits are not required anymore. As
PSMAC architectures. We first propose one architecture for the simplified BB (SBB) and BB perform the same computa-
general purpose, i.e., supporting all the uniform quantization, any architectures employing BB (e.g., FU and MFU) can
tion schemes. Considering the prevalent quantization scheme be accordingly modified by replacing BB with SBB.
in which activations are quantized to nonnegative numbers, We adopt four BBs to construct a BB unit (BBU) that can
we propose another architecture that can achieve lower com- be configured into 2b × 2b or 4b × 4b mode. The architec-
plexity. Furthermore, bitwidth optimization is applied to the ture of BBU is given in Fig. 6. When it is configured into
proposed architectures. the 4b × 4b mode, the computation process is the same to
Fig. 4 (b) except the input data arrangement. In the 2b × 2b
mode, the outputs of the left-hand two adders will pass through
A. Proposed PSMAC Unit for General Purpose the MUXs to compute the inner product of weights and acti-
Some recently reported works [14], [15], [17] show that a vations. To execute an 8b × 8b multiplication, one can divide
quantization scheme only supporting 8b × 8b or/and 4b × 4b each operand into two 4-bit numbers and then compute four
computation case(s) can achieve good tradeoff on model size sub-products: P1 = x7:4 × y7:4 , P2 = x7:4 × y3:0 , P3 =
and accuracy. Besides, ternary neural networks (TNNs) [18], x3:0 × y7:4 , P4 = x3:0 × y3:0 . Similar to Fig. 4 (b), the final
[19], [20] whose weights and activations are restricted to product can be recovered by x7:0 × y7:0 = merge{P1 , P4 } +
{−1, 0, +1} are good alternatives in low-accuracy application ((P2 + P3 ) << 4). Fig. 6 also depicts the architecture of the
scenarios. As a result, we choose to follow MFU [10] which proposed PSMAC which consists of four BBUs. Clearly, it
supports three configuration modes: 2b × 2b, 4b × 4b and is constructed by recursively employing BB to support both
8b × 8b. 4b × 4b and 8b × 8b configuration modes. The top-level archi-
Now we first optimize the architecture of BB. In effect, tecture of the proposed PSMAC is almost the same to that of
five bits are sufficient to represent the product of two 3b BBU, except the shifters and circuit bitwidths. We refer to the
signed numbers. In other words, p5 in Fig. 3 can be removed. proposed architecture as recursive FU (RFU). In the 8b × 8b
Recall that each input of BB is extended by one sign bit. Thus mode, the input data is arranged as shown in Fig. 2.
the possible values of x2:0 and y2:0 are actually the same to In some mixed-precision quantization schemes [3], [14],
that of the unextended 2-bit numbers which are restricted to there exist a few layers whose activations and weights are
{0, ±1, ±2, +3}. All the possible products are listed in Table I. quantized using eight and four bits, respectively. MFU is not
In most cases, p4 only depends on the sign bits of the inputs, easy to be adjusted to additionally support the configuration
i.e., p4 = x2 ⊕ y2 . One exception is that p4 will be zero mode 8b × 4b. Instead, our proposed RFU can be easily
when one input is zero and the other is a negative number. extended to additionally support the 8b × 4b configuration
Consequently, we have p4 = (x2 ⊕ y2 )(x1 + x0 )(y1 + y0 ) where mode at the expense of moderate overhead. The extended RFU
“+” represents OR operations. Since p5 has been already is shown in Fig. 7 (a). When configured into the 8b×4b mode,
removed, the full adder outputting p4 and the AND gate each column of BBUs are in charge of one 8b × 4b multipli-
corresponding to x2 and y2 can be replaced by the circuits cor- cation. In the 8b × 8b mode, two columns of BBs compute
responding to the new computing formula for p4 . Moreover, two sub-products Pleft and Pright . The final result is recovered
the two full adders vertically corresponding to p3 can also be as (Pleft << 4) + Pright .

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 08,2023 at 03:36:44 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LOW-COMPLEXITY PSMAC UNIT ARCHITECTURES FOR DNN ACCELERATORS 1613

Fig. 8. Bitwidth details of the BBUs in nRFU.

TABLE II
B. Adaptation for Nonnegative-Activation Quantization and S YNTHESIS R ESULTS FOR D IFFERENT BB S
Bitwidth Optimization
Some quantization schemes such as the two proposed by
Google [13] and Qualcomm [21] quantize the activations into
the interval [0, 255] using eight bits, i.e., uint8. In effect, acti-
vations are usually quantized into nonnegative numbers since
ReLU has been widely adopted as the activation functions TABLE III
A REA (um2 ) B REAKDOWN OF D IFFERENT FU-BASED
of DNNs. Due to the nonnegative-quantized activations, the PSMAC A RCHITECTURES
architecture of SBB can be further simplified through set-
ting sy (see Fig. 3) to zero. On the other hand, we have
p4 = x2 (y1 +y0 ). While activations are quantized into unsigned
numbers, weights are always quantized into signed numbers in
these quantization schemes. Consequently, we can set sx = 1
to save another AND gate for each SBB. The further simpli-
fied BB is denoted as nSBB where “n” is the abbreviation of
“nonnegative”. RFU employing nSBBs is referred to as nRFU.
Conventional TNNs are not supported by nRFU anymore
since nSBB only works for the nonnegative-quantized acti-
vations. Fortunately, some nonnegative-activation quantization
schemes [16], [17] yield acceptable accuracy even aggressively
quantizing both the weights and activations into two bits. At
this point, nRFU can be configured into the 2b × 2b mode to
only four output bits, their architectures can be further sim-
support the 2-bit quantization. Note that RFU is also effective
plified. By means of optimizing the bitwidth, the complexity
for the nonnegative-activation quantization schemes. However,
of RFU and nRFU is further reduced, which is not considered
it is not the most efficient choice in this situation.
neither in [8] nor in [10]. Similarly, the bitwidths of FU and
Now we will give an insight into the decomposed multi-
MFU can also be optimized, which will not be discussed in
plication and accordingly reduce the bitwidths of RFU and
detail due to the space limitation.
nRFU. Considering all the possible values of each signal,
bitwidths may be reduced from the naive implementation. For
RFU, the benefits brought by such bitwidth optimization are IV. E XPERIMENTAL R ESULTS AND C OMPARISONS
not significant. The main reason is that only a few signals In this section, experimental results are provided to demon-
successfully reduce their bitwidths due to the high generality strate the superiority of the proposed PSMAC architectures.
of RFU. For example, all BBs have to output five bits to sup- All the synthesis results are obtained using the TSMC 40-
port all the possible product values as listed in Table I. As to nm CMOS technology and with voltage of 0.99 V. The clock
nRFU, bitwidth optimization can lead to more improvement. period is set to 1.5 ns for all designs.
Bitwidth details of the circuits outside BBUs but in nRFU Table II provides the synthesis results for different BB
are shown in Fig. 7 (b). Meanwhile, Fig. 8 provides bitwidth designs. While SBB can save 9.40% area cost and 14.65%
details of the BBUs. As listed in Table I, four bits are suf- power consumption, the reduction ratios brought by nSBB are
ficient for the output of BB except the case of 3 × 3 = 9. up to 48.07% and 54.50%. As discussed in Section III-B, some
In the 2b × 2b mode, the 3 × 3 case will never happen since BBs can be further optimized since they only need to output
the quantized weights are always 2-bit signed numbers. In the four bits. For example, nSBB with four output bits saves up
4b × 4b mode, only the lower-right BB of each BBU is likely to 50.83% area cost and 56.04% power consumption of the
to perform the 3 × 3 multiplication. Nevertheless, its output is original one.
treated as the least significant part in the merging operation. Table III gives the area breakdown for the three kinds of
Hence the sign bit can be removed so that four bits are also FU-based PSMAC architectures. As S&A logic dominates the
sufficient for the lower-right BB. Five-bit outputs are required area cost in [8] and [10], the simplification on BB has a small
by some BBs only in the 8b × 8b mode. For these BBs with impact on the overall architecture. Fortunately, the proposed

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 08,2023 at 03:36:44 UTC from IEEE Xplore. Restrictions apply.
1614 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 4, APRIL 2023

TABLE IV
C OMPARISONS ON THE P ROPOSED AND E XISTING FU-BASED architecture and applying bitwidth optimization. Experimental
PSMAC A RCHITECTURES results demonstrate their superiority in terms of area cost
and power consumption. A brief discussion is also given to
show that the proposed PSMAC units can bring consider-
able benefits to some of the existing precision-scalable DNN
accelerators.

R EFERENCES
[1] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing
of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105,
no. 12, pp. 2295–2329, Dec. 2017.
[2] Z. Dong, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer,
“HAWQ: Hessian aware quantization of neural networks with mixed-
precision,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019,
recursive architecture surpasses the previous two designs and pp. 293–302.
successfully decreases the proportion of S&A logic regard- [3] Y. Cai, Z. Yao, Z. Dong, A. Gholami, M. W. Mahoney, and K. Keutzer,
less of the type of employed BBs. Assume the same BB is “ZeroQ: A novel zero shot quantization framework,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2020, pp. 13166–13175.
employed, e.g., the original BB. Our proposed recursive archi- [4] Y. Huang et al., “LSMQ: A layer-wise sensitivity-based mixed-precision
tecture can still reduce the area cost up to 39.11% and 35.39% quantization method for bit-flexible CNN accelerator,” in Proc. Int. SoC
when compared with [8] and [10]. Des. Conf., 2021, pp. 256–257.
[5] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and
Table III also lists the implementation results after applying A. Moshovos, “Stripes: Bit-serial deep neural network computing,”
the bitwidth optimization. Bitwidth optimization has a direct in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitect., 2016,
impact on S&A logic and some BBs can also be simplified by pp. 1–12.
[6] J. Albericio et al., “Bit-pragmatic deep neural network computing,”
decreasing their outputs to four bits. For all the cases, bitwidth in Proc. 50th Annu. IEEE/ACM Int. Symp. Microarchitect., 2017,
optimization brings 3.09%∼36.94% area reduction. pp. 382–394.
A direct comparison on our proposed PSMAC architec- [7] S. Sharify, A. D. Lascorz, K. Siu, P. Judd, and A. Moshovos, “Loom:
Exploiting weight and activation precisions to accelerate convolu-
tures and the existing two other FU-based designs is given in tional neural networks,” in Proc. 55th Annu. Des. Autom. Conf., 2018,
Table IV. Targeted at general-purpose quantization schemes, pp. 1–6.
our proposed RFU can save 44.18% area cost and 45.45% [8] H. Sharma et al., “Bit fusion: Bit-level dynamically composable archi-
tecture for accelerating deep neural network,” in Proc. ACM/IEEE 45th
power consumption when compared with MFU. Regarding Annu. Int. Symp. Comput. Archit., 2018, pp. 764–775.
the original FU, the two reduction ratios brought by RFU [9] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU:
will be 47.39% and 50.82%. Moreover, as aforementioned, An energy-efficient deep neural network accelerator with fully vari-
able weight bit precision,” IEEE J. Solid-State Circuits, vol. 54, no. 1,
the extended RFU additionally supports the configuration pp. 173–185, Jan. 2019.
mode 8b × 4b while introduces only an overhead of 11.24%. [10] W. Liu, J. Lin, and Z. Wang, “A precision-scalable energy-efficient con-
Hence, the extended RFU outperforms MFU in terms of volutional neural network accelerator,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 67, no. 10, pp. 3484–3497, Oct. 2020.
area cost, power consumption and configuration flexibility. [11] S. Ryu et al., “BitBlade: Energy-efficient variable bit-precision hardware
Note that nonnegative-quantized activations are not consid- accelerator for quantized neural networks,” IEEE J. Solid-State Circuits,
ered by FU and MFU. On the contrary, this brief develops vol. 57, no. 6, pp. 1924–1935, Jun. 2022.
[12] V. Camus, L. Mei, C. Enz, and M. Verhelst, “Review and benchmarking
an optimized PSMAC architecture dedicated to it. When of precision-scalable multiply-accumulate unit architectures for embed-
nonnegative-activation quantization schemes [13], [21] are ded neural-network processing,” IEEE J. Emerg. Sel. Topics Circuits
adopted, nRFU is more efficient than those proposed in [8] Syst., vol. 9, no. 4, pp. 697–711, Dec. 2019.
[13] B. Jacob et al., “Quantization and training of neural networks for effi-
and [10]. cient integer-arithmetic-only inference,” in Proc. IEEE Conf. Comput.
The PSMAC architectures are proposed without imposing Vis. Pattern Recognit., 2018, pp. 2704–2713.
any additional constraint on top-level accelerator architecture. [14] Y. Li et al., “BRECQ: Pushing the limit of post-training quantization
by block reconstruction,” in Proc. Int. Conf. Learn. Represent., 2021,
Therefore, they are practical for all the FU-based precision- pp. 1–16.
scalable accelerators. As reported by [10], the processing [15] Z. Yao et al., “HAWQ-V3: Dyadic neural network quantization,” in Proc.
element (PE) array of its design consumes more than 47% Int. Conf. Mach. Learn., 2021, pp. 11875–11886.
[16] S. Xu et al., “Generative low-bitwidth data free quantization,” in Proc.
power. Recall that the proposed RFU can reduce the power Eur. Conf. Comput. Vis., 2020, pp. 1–17.
of MFU by 45.45%. Directly replacing MFU by RFU is [17] Y. Liu, W. Zhang, and J. Wang, “Zero-shot adversarial quantization,” in
expected to save the overall power of the accelerator by around Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1512–1521.
[18] H. Alemdar, V. Leroy, A. Prost-Boucle, and F. Petrot, “Ternary neural
20%. Another example is the accelerator presented in [11]. networks for resource-efficient AI applications,” in Proc. Int. Joint Conf.
According to its reported results, the PE array occupies over Neural Netw., 2017, pp. 2547–2554.
60% chip area. The benefits brought by our proposed PSMAC [19] L. Deng, P. Jiao, J. Pei, Z. Wu, and G. Li, “GXNOR-Net: Training
deep neural networks with ternary weights and activations without full-
architectures will be more significant. precision memory under a unified discretization framework,” Neural
Netw., vol. 100, pp. 49–58, Apr. 2018.
[20] Y. Zhang et al., “When sorting network meets parallel bitstreams:
V. C ONCLUSION A fault-tolerant parallel ternary neural network accelerator based on
This brief proposes two low-complexity PSMAC architec- stochastic computing,” in Proc. Des. Autom. Test Eur. Conf. Exhibit.,
2020, pp. 1287–1290.
tures for two different quantization schemes. They are obtained [21] M. Nagel et al., “A white paper on neural network quantization,” 2021,
through simplifying the basic unit BB, devising a new top-level arXiv:2106.08295.

Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 08,2023 at 03:36:44 UTC from IEEE Xplore. Restrictions apply.

War Games Illustrated Issue 401 May 2021
100% (5)
War Games Illustrated Issue 401 May 2021
108 pages
ED 221 Strategic Planning Management Module VI
No ratings yet
ED 221 Strategic Planning Management Module VI
33 pages
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
No ratings yet
An Empirical Approach To Enhance Performance For Scalable CORDIC-Based Deep Neural Networks
32 pages
IJME Vol 7 Iss 4 Paper 9 1260 1264
No ratings yet
IJME Vol 7 Iss 4 Paper 9 1260 1264
5 pages
Conv PHD Thesis Urbinati Firmata
No ratings yet
Conv PHD Thesis Urbinati Firmata
146 pages
Encodingnet: A Novel Encoding-Based Mac Design For Efficient Neural Network Acceleration
No ratings yet
Encodingnet: A Novel Encoding-Based Mac Design For Efficient Neural Network Acceleration
7 pages
Article 87
No ratings yet
Article 87
4 pages
Mini (1) Sidhu
No ratings yet
Mini (1) Sidhu
47 pages
2022-BitBlade - Energy-Efficient - Variable - Bit-Precision - Hardware - Accelerator - For - Quantized - Neural - Networks
No ratings yet
2022-BitBlade - Energy-Efficient - Variable - Bit-Precision - Hardware - Accelerator - For - Quantized - Neural - Networks
12 pages
High-Level Design of Precision-Scalable DNN Accelerators Based On Sum-Together Multipliers
No ratings yet
High-Level Design of Precision-Scalable DNN Accelerators Based On Sum-Together Multipliers
27 pages
Compute in Bram
No ratings yet
Compute in Bram
11 pages
Systolic Array
No ratings yet
Systolic Array
9 pages
PS-IMC A 2385.7-TOPS W B Precision Scalable In-Memory Computing Macro With Bit-Parallel Inputs and Decomposable Weights For DNNs
No ratings yet
PS-IMC A 2385.7-TOPS W B Precision Scalable In-Memory Computing Macro With Bit-Parallel Inputs and Decomposable Weights For DNNs
4 pages
Basic Design Approaches To Accelerating Deep Neural Networks
No ratings yet
Basic Design Approaches To Accelerating Deep Neural Networks
93 pages
00) TD-SRAM - Time-Domain-Based - In-Memory - Computing - Macro - For - Binary - Neural - Networks
No ratings yet
00) TD-SRAM - Time-Domain-Based - In-Memory - Computing - Macro - For - Binary - Neural - Networks
11 pages
Jaisimha Thesis 2021
No ratings yet
Jaisimha Thesis 2021
81 pages
Applsci 15 00688 v3
No ratings yet
Applsci 15 00688 v3
21 pages
A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks
No ratings yet
A Logic-Compatible EDRAM Compute-In-Memory With Embedded ADCs For Processing Neural Networks
13 pages
Weight-Oriented Approximation For Energy-Efficient Neural Network Inference Accelerators
No ratings yet
Weight-Oriented Approximation For Energy-Efficient Neural Network Inference Accelerators
14 pages
Posit Abstract
No ratings yet
Posit Abstract
8 pages
Jlpea 12 00011 v3
No ratings yet
Jlpea 12 00011 v3
16 pages
2020 A Reconfigurable Approximate Multiplier For Quantized CNN Applications
No ratings yet
2020 A Reconfigurable Approximate Multiplier For Quantized CNN Applications
6 pages
A Deep Learning Prediction Process Accelerator Based FPGA PDF
No ratings yet
A Deep Learning Prediction Process Accelerator Based FPGA PDF
4 pages
UNPU An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision
No ratings yet
UNPU An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit Precision
13 pages
FINALVERSION
No ratings yet
FINALVERSION
24 pages
A Convolutional Neural Network Accelerator Architecture
No ratings yet
A Convolutional Neural Network Accelerator Architecture
5 pages
286 1006 1 PB
No ratings yet
286 1006 1 PB
8 pages
Risc Acc
No ratings yet
Risc Acc
7 pages
10.1515 - Nanoph 2020 0297
No ratings yet
10.1515 - Nanoph 2020 0297
12 pages
Xnor
No ratings yet
Xnor
11 pages
Tutorial On DNN 6 of 9 Network and Hardware Co Design
No ratings yet
Tutorial On DNN 6 of 9 Network and Hardware Co Design
60 pages
06A Dual-Split 6T SRAM-Based Computing-in-Memory
No ratings yet
06A Dual-Split 6T SRAM-Based Computing-in-Memory
14 pages
FP BNN On FPGA
No ratings yet
FP BNN On FPGA
15 pages
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
No ratings yet
Energy-Ef Cient Low-Latency Signed Multiplier For FPGA-based Hardware Accelerators
4 pages
Hardware Accleration For ML
No ratings yet
Hardware Accleration For ML
26 pages
Motivation For and Evaluation of The First Tensor Processing Unit
No ratings yet
Motivation For and Evaluation of The First Tensor Processing Unit
10 pages
Applsci 12 10771 v2
No ratings yet
Applsci 12 10771 v2
44 pages
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
No ratings yet
Efficient Deep Learning Infrastructures For Embedded Computing Systems: A Comprehensive Survey and Future Envision
101 pages
An Efficient Hardware Accelerator For Block Sparse Convolutional Neural Networks On FPGA
No ratings yet
An Efficient Hardware Accelerator For Block Sparse Convolutional Neural Networks On FPGA
4 pages
Quantization and Deployment Od DNN On Microcontroller
No ratings yet
Quantization and Deployment Od DNN On Microcontroller
34 pages
FPGA Implementation of Convolutional Neural Networ PDF
No ratings yet
FPGA Implementation of Convolutional Neural Networ PDF
10 pages
10 1016@j Vlsi 2019 11 003
No ratings yet
10 1016@j Vlsi 2019 11 003
10 pages
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
No ratings yet
VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing
12 pages
5 Lecture 28 01 25
No ratings yet
5 Lecture 28 01 25
47 pages
FireFly A High-Throughput Hardware Accelerator For Spiking Neural Networks With Efficient DSP and Memory Optimization
No ratings yet
FireFly A High-Throughput Hardware Accelerator For Spiking Neural Networks With Efficient DSP and Memory Optimization
14 pages
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
No ratings yet
Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework
13 pages
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
No ratings yet
An Area-Power-Efficient Multiplier-Less Processing Element Design For CNN Accelerators
4 pages
10.1515 - Nanoph 2021 0752
No ratings yet
10.1515 - Nanoph 2021 0752
10 pages
Fulltext
No ratings yet
Fulltext
145 pages
Tesla Patent
No ratings yet
Tesla Patent
12 pages
THESIS LucasHuijbregts Final
No ratings yet
THESIS LucasHuijbregts Final
86 pages
10 1109@mwscas48704 2020 9184436
No ratings yet
10 1109@mwscas48704 2020 9184436
4 pages
An Application-Oriented Analysis of Powerprecision
No ratings yet
An Application-Oriented Analysis of Powerprecision
7 pages
Residue-Net Multiplication-Free Neural Network by In-Situ No-Loss Migration To Residue Number Systems
No ratings yet
Residue-Net Multiplication-Free Neural Network by In-Situ No-Loss Migration To Residue Number Systems
7 pages
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
No ratings yet
Approximate Softmax Functions For Energy-Efficient Deep Neural Networks
13 pages
DSLOT-NN: Digit-Serial Left-to-Right Neural Network Accelerator
No ratings yet
DSLOT-NN: Digit-Serial Left-to-Right Neural Network Accelerator
7 pages
PFG 21 23
No ratings yet
PFG 21 23
35 pages
Deepa PPT 49
No ratings yet
Deepa PPT 49
20 pages
Marsellus A Heterogeneous RISC-V AI-IoT End-Node SoC With 28 B DNN Acceleration
No ratings yet
Marsellus A Heterogeneous RISC-V AI-IoT End-Node SoC With 28 B DNN Acceleration
15 pages
MAC - Low Power and Area
No ratings yet
MAC - Low Power and Area
6 pages
Indoor Radio Planning: A Practical Guide for 2G, 3G and 4G
From Everand
Indoor Radio Planning: A Practical Guide for 2G, 3G and 4G
Morten Tolstrup
5/5 (1)
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
From Everand
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
Rodrigo Copetti
No ratings yet
03 - 2011 - Markus G. Kuhn - Compromising Emanations of LCD TV Sets
No ratings yet
03 - 2011 - Markus G. Kuhn - Compromising Emanations of LCD TV Sets
7 pages
Parasound PLD-1100 Owners Manual
No ratings yet
Parasound PLD-1100 Owners Manual
7 pages
Unit V File Processing: Text Files
No ratings yet
Unit V File Processing: Text Files
26 pages
Project Control Manager
No ratings yet
Project Control Manager
3 pages
Hydrometeorological Hazards - Learning Material
No ratings yet
Hydrometeorological Hazards - Learning Material
5 pages
Integrity Without It Nothing Works.
No ratings yet
Integrity Without It Nothing Works.
7 pages
Pseudo Random Sequence Generator in Verilog
No ratings yet
Pseudo Random Sequence Generator in Verilog
3 pages
Nat An Skigin: Nskigin@nd - Edu
No ratings yet
Nat An Skigin: Nskigin@nd - Edu
4 pages
AIESL CAPABILITY (Group A) 1
No ratings yet
AIESL CAPABILITY (Group A) 1
314 pages
Dynamic BUsiness Environment
No ratings yet
Dynamic BUsiness Environment
16 pages
Transformation and Immortalization
No ratings yet
Transformation and Immortalization
20 pages
Tvl11-He-Cookery Q1 M4 W4
No ratings yet
Tvl11-He-Cookery Q1 M4 W4
15 pages
SPCCPDF
No ratings yet
SPCCPDF
83 pages
World GK MCQs For PPSC Set II PDF
No ratings yet
World GK MCQs For PPSC Set II PDF
12 pages
Cie 1
No ratings yet
Cie 1
12 pages
Miami de Laurentiis Giada Download
No ratings yet
Miami de Laurentiis Giada Download
35 pages
Zhang 2020 J. Phys. Conf. Ser. 1449 012001
No ratings yet
Zhang 2020 J. Phys. Conf. Ser. 1449 012001
6 pages
Topic Call To Be Different P
No ratings yet
Topic Call To Be Different P
3 pages
Cond
No ratings yet
Cond
81 pages
R G Bronze Mfg. Company PVT Limited RGB
No ratings yet
R G Bronze Mfg. Company PVT Limited RGB
2 pages
Mathematics: Sindh Textbook Board, Jamshoro
No ratings yet
Mathematics: Sindh Textbook Board, Jamshoro
176 pages
Class 10 English Solutions VP2
No ratings yet
Class 10 English Solutions VP2
3 pages
Int'l Application Guidelines
No ratings yet
Int'l Application Guidelines
18 pages
Android Practical File (IT-602)
50% (2)
Android Practical File (IT-602)
59 pages
C.14 Queens Park Urban Conservation Area
No ratings yet
C.14 Queens Park Urban Conservation Area
13 pages
Only Maaru-Dhola Together Secne List
No ratings yet
Only Maaru-Dhola Together Secne List
31 pages
IS082IU-Syllabus of Retail Management
No ratings yet
IS082IU-Syllabus of Retail Management
9 pages
Downloads Papers N59e995a0ab8c2 PDF
No ratings yet
Downloads Papers N59e995a0ab8c2 PDF
6 pages

2023-Low-Complexity - Precision-Scalable - Multiply-Accumulate - Unit - Architectures - For - Deep - Neural - Network - Accelerators

Uploaded by

2023-Low-Complexity - Precision-Scalable - Multiply-Accumulate - Unit - Architectures - For - Deep - Neural - Network - Accelerators

Uploaded by

1610 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO.

Fig. 3. The architecture of BB which is based on Baugh-Wooley multipli-

Fig. 1. A FU consists of 16 BBs and some S&A units.

Fig. 4. Computation illustration for 4b × 4b decomposed multiplications

Fig. 6. Architecture of the proposed PSMAC which is called RFU.

Fig. 8. Bitwidth details of the BBUs in nRFU.

You might also like