2023-Low-Complexity - Precision-Scalable - Multiply-Accumulate - Unit - Architectures - For - Deep - Neural - Network - Accelerators
2023-Low-Complexity - Precision-Scalable - Multiply-Accumulate - Unit - Architectures - For - Deep - Neural - Network - Accelerators
4, APRIL 2023
Low-Complexity Precision-Scalable
Multiply-Accumulate Unit Architectures
for Deep Neural Network Accelerators
Wenjie Li , Aokun Hu , Gang Wang, Ningyi Xu, and Guanghui He , Member, IEEE
Abstract—Precision-scalable deep neural network (DNN) accel- DNN accelerator using only fixed-point arithmetic units. When
erator designs have attracted much research interest. Since the adopting advanced quantization techniques, the bitwidth varies
computation of most DNNs is dominated by multiply-accumulate significantly across DNNs and even across layers within a
(MAC) operations, designing efficient precision-scalable MAC
(PSMAC) units is of central importance. This brief proposes two DNN [2], [3], [4]. To maximize the utilization of MAC units
low-complexity PSMAC unit architectures based on the well- under different precision configurations, precision scalability
known one, Fusion Unit (FU), which is composed of a few basic is supported by many DNN accelerators. MAC units of these
units called Bit Bricks (BBs). We first simplify the architec- precision-scalable accelerators are built either on decomposed
ture of BB through optimizing some redundant logic. Then a multipliers [8], [10], [11] or by bit-serial approaches [5], [6],
top-level architecture for PSMAC unit is devised by recursively
employing BBs. Accordingly, two low-complexity PSMAC unit [7], [9]. Among these precision-scalable MAC (PSMAC) unit
architectures are presented for two different kinds of quantization architectures, Fusion Unit (FU) [8] and its variants show excel-
schemes. Moreover, we provide an insight into the decomposed lent energy efficiency and throughput [12]. This brief focuses
multiplications and further reduce the bitwidths of the two on FU-based PSMAC architectures unless otherwise specified.
architectures. Experimental results show that our proposed archi- A FU comprises 16 basic units called Bit Bricks (BBs)
tectures can save up to 44.18% area cost and 45.45% power
consumption when compared with the state-of-the-art design. each of which can perform a 2b × 2b multiplication. When
configured into the high-precision mode, the partial products
Index Terms—Deep neural networks (DNNs), multiply- generated by BBs are shifted and added together to recover the
and-accumulate (MAC), precision-scalable, low-complexity
architecture. final product. A FU usually supports at most an 8b × 8b mul-
tiplication in one cycle and so does the PSMAC unit proposed
in this brief. Obviously, it can achieve higher throughput in
I. I NTRODUCTION the low-precision modes than a traditional 8b × 8b multiplier
EEP neural networks (DNNs) have achieved a great suc- at the expense of area overhead. Specifically, the introduced
D cess over the past few years. They not only beat the
records in image classification and speech recognition, but
shift-add (S&A) logic occupies more than a half of the over-
all area cost. By means of reducing the number of supported
also have been employed in a wide range of applications such modes, a new FU architecture is proposed in [10] to lower
as object detection and natural language processing. To deal the hardware complexity. However, the improvement is still
with the tremendous amount of computation, hardware accel- limited.
eration for DNNs has become an overwhelming trend [1]. For The goal of this brief is to further reduce the complexity
almost all DNNs, the computation is dominated by multiply- of the existing FU-based PSMAC architectures. The archi-
accumulate (MAC) operations. Therefore MAC units occupy tecture of BB is first optimized through eliminating some
an important place in DNN accelerator design. redundant logic. We next devise a top-level architecture for
In addition to reducing the model size, quantization is essen- PSMAC unit by employing BBs in a recursive manner. Then
tial for DNN hardware implementations. It maps float-point two FU-based PSMAC architectures are presented for two
weights or/and activations to a smaller set of fixed-point val- quantization schemes: one for general purpose and the other
ues. With the assistance of quantization, one can implement a quantizes activations into nonnegative values. Furthermore,
an insight into the decomposed multiplication is given and
Manuscript received 19 November 2022; accepted 16 December 2022. Date the bitwidths of the proposed architectures are accordingly
of publication 22 December 2022; date of current version 29 March 2023. This reduced. These proposed optimization techniques are demon-
work was supported in part by the National Key Research and Development
Program of China under Grant 2020YFB2205500, and in part by the National strated to be effective by our experiments. It is also shown that
Natural Science Foundation of China under Grant 62074097. This brief our proposed architectures can significantly reduce the overall
was recommended by Associate Editor M. Huang. (Corresponding authors: area cost and power consumption.
Ningyi Xu; Guanghui He.)
The authors are with the School of Electronic Information and Electrical
Engineering, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail:
[email protected]; [email protected]). II. F USION U NIT AND R ELATED W ORKS
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCSII.2022.3231418. Since the bitwidth varies significantly across DNNs and may
Digital Object Identifier 10.1109/TCSII.2022.3231418 even be individually adjusted for each layer within a DNN, the
1549-7747
c 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 08,2023 at 03:36:44 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LOW-COMPLEXITY PSMAC UNIT ARCHITECTURES FOR DNN ACCELERATORS 1611
benefits brought by fixed-bitwidth accelerators are limited. To It is pointed out in [10] that the accuracy gap incurred by
take full advantage of the bitwidth variety, a bit-level dynam- different precisions is acceptable in some cases. Consequently,
ically composable accelerator architecture which is referred the granularity of the original FU is not necessary. It sug-
to as Bit Fusion is presented in [8]. It can achieve excellent gests to only support three configuration modes: 8b × 8b,
performance in terms of throughput and energy efficiency. 4b × 4b and 2b × 2b. To save the S&A logic, [10] devises
The PSMAC unit employed in Bit Fusion is called FU which a merge-based FU (MFU). For the sake of brevity, we
is composed of 16 BBs and some S&A units, as depicted in henceforth use xj:i to denote the binary representation of
Fig. 1. The supported bitwidths for the input activations and (xj xj−1 , . . . xi+1 xi )2 . Take a 4b × 4b decomposed multipli-
weights include 2b, 4b and 8b. In the configuration mode of cation x3:0 × y3:0 as an example. Four sub-products will be
2b × 2b, each BB computes a product of a 2b weight and a 2b computed, shifted and added together. Note that x1:0 and
activation. In a configuration mode for higher precision, each y1:0 are both unsigned numbers and their sub-product is of
BB computes a sub-product that will be shifted and added with at most four bits. Therefore [10] directly merges the sub-
others to recover the final result(s). For example, Fig. 2 shows products of x3:2 ×y3:2 and x1:0 ×y1:0 . Meanwhile, sub-products
how an 8b × 8b multiplication is decomposed such that each of x3:2 × y1:0 and x1:0 × y3:2 are added together and then
sub-multiplication is handled by a BB. Except in the 8b × 8b shifted. Finally, the two intermediate results are added up
mode, a FU can perform multiple multiplications between to recover the product. The process can be formulated as
activations and weights and add their products together, i.e., x3:0 × y3:0 = merge{x3:2 × y3:2 , x1:0 × y1:0 } + ((x3:2 × y1:0
obtaining an inner product of an activation vector and a + x1:0 × y3:2 ) << 2). The computation illustration for the
weight vector. It provides Bit Fusion higher throughput than 4b × 4b multiplications executed by the original FU and MFU
the conventional accelerators that adopt fixed-bitwidth MAC are depicted in Fig. 4. Clearly, Fig 4. (b) needs fewer shifters
units. As input bitwidth decreases, more multiplications can and two-input adders.
be simultaneously performed by a FU. Fig. 5 illustrates the MFU architecture devised in [10]. It
For different modes or/and different quantization schemes, can be configured into three modes: 2b × 2b, 4b × 4b and
the inputs of BB may be signed or unsigned. Thus each of 8b × 8b. The configuration is handled by the MUXs. Fig. 5
them has to be assigned with one extended sign bit to sup- also shows the data-path when MFU is configured into the
port both signed and unsigned numbers. To this end, one 8b × 8b mode. In this mode, the multiplication is decomposed
BB is set to perform a multiplication between two 3b signed into four parts: x7:0 × y7:0 = x1:0 × y7:0 + ((x3:2 × y7:0 ) <<
numbers [8]. Its architecture based on Baugh-Wooley multi- 2) + ((x5:4 × y7:0 ) << 4) + ((x7:6 × y7:0 ) << 6). The
plication is given in Fig. 3 where “HA” and “FA” represent four parts are respectively handled by the four columns of
half adder and full adder, respectively. sx and sy are the flag BBs. Analogous to the 4b × 4b case, there exist a few
signals to indicate whether the inputs are signed or unsigned merging operations. Such architecture saves the S&A logic
numbers (1 for signed and 0 for unsigned). In addition to the and hence the overall area cost. However, as we will see
16 BBs, S&A units are adopted in FU, which take more than in Section IV, the improvement brought by MFU is still
a half of the overall area cost. limited.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 08,2023 at 03:36:44 UTC from IEEE Xplore. Restrictions apply.
1612 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 4, APRIL 2023
TABLE I
P RODUCTS AND T HEIR T WO ’ S C OMPLEMENT R EPRESENTATIONS FOR
A LL THE P OSSIBLE M ULTIPLICATION C ASES E NCOUNTERED BY A BB
Fig. 7. (a) Architecture of the extended RFU which can additionally support
the 8b × 4b mode. (b) Bitwidths details of the circuits outside BBUs but in
nRFU.
III. P ROPOSED P RECISION -S CALABLE
M ULTIPLY-ACCUMULATE U NIT A RCHITECTURES
This section provides a detailed discussion on our proposed simplified since their carry bits are not required anymore. As
PSMAC architectures. We first propose one architecture for the simplified BB (SBB) and BB perform the same computa-
general purpose, i.e., supporting all the uniform quantiza- tion, any architectures employing BB (e.g., FU and MFU) can
tion schemes. Considering the prevalent quantization scheme be accordingly modified by replacing BB with SBB.
in which activations are quantized to nonnegative numbers, We adopt four BBs to construct a BB unit (BBU) that can
we propose another architecture that can achieve lower com- be configured into 2b × 2b or 4b × 4b mode. The architec-
plexity. Furthermore, bitwidth optimization is applied to the ture of BBU is given in Fig. 6. When it is configured into
proposed architectures. the 4b × 4b mode, the computation process is the same to
Fig. 4 (b) except the input data arrangement. In the 2b × 2b
mode, the outputs of the left-hand two adders will pass through
A. Proposed PSMAC Unit for General Purpose the MUXs to compute the inner product of weights and acti-
Some recently reported works [14], [15], [17] show that a vations. To execute an 8b × 8b multiplication, one can divide
quantization scheme only supporting 8b × 8b or/and 4b × 4b each operand into two 4-bit numbers and then compute four
computation case(s) can achieve good tradeoff on model size sub-products: P1 = x7:4 × y7:4 , P2 = x7:4 × y3:0 , P3 =
and accuracy. Besides, ternary neural networks (TNNs) [18], x3:0 × y7:4 , P4 = x3:0 × y3:0 . Similar to Fig. 4 (b), the final
[19], [20] whose weights and activations are restricted to product can be recovered by x7:0 × y7:0 = merge{P1 , P4 } +
{−1, 0, +1} are good alternatives in low-accuracy application ((P2 + P3 ) << 4). Fig. 6 also depicts the architecture of the
scenarios. As a result, we choose to follow MFU [10] which proposed PSMAC which consists of four BBUs. Clearly, it
supports three configuration modes: 2b × 2b, 4b × 4b and is constructed by recursively employing BB to support both
8b × 8b. 4b × 4b and 8b × 8b configuration modes. The top-level archi-
Now we first optimize the architecture of BB. In effect, tecture of the proposed PSMAC is almost the same to that of
five bits are sufficient to represent the product of two 3b BBU, except the shifters and circuit bitwidths. We refer to the
signed numbers. In other words, p5 in Fig. 3 can be removed. proposed architecture as recursive FU (RFU). In the 8b × 8b
Recall that each input of BB is extended by one sign bit. Thus mode, the input data is arranged as shown in Fig. 2.
the possible values of x2:0 and y2:0 are actually the same to In some mixed-precision quantization schemes [3], [14],
that of the unextended 2-bit numbers which are restricted to there exist a few layers whose activations and weights are
{0, ±1, ±2, +3}. All the possible products are listed in Table I. quantized using eight and four bits, respectively. MFU is not
In most cases, p4 only depends on the sign bits of the inputs, easy to be adjusted to additionally support the configuration
i.e., p4 = x2 ⊕ y2 . One exception is that p4 will be zero mode 8b × 4b. Instead, our proposed RFU can be easily
when one input is zero and the other is a negative number. extended to additionally support the 8b × 4b configuration
Consequently, we have p4 = (x2 ⊕ y2 )(x1 + x0 )(y1 + y0 ) where mode at the expense of moderate overhead. The extended RFU
“+” represents OR operations. Since p5 has been already is shown in Fig. 7 (a). When configured into the 8b×4b mode,
removed, the full adder outputting p4 and the AND gate each column of BBUs are in charge of one 8b × 4b multipli-
corresponding to x2 and y2 can be replaced by the circuits cor- cation. In the 8b × 8b mode, two columns of BBs compute
responding to the new computing formula for p4 . Moreover, two sub-products Pleft and Pright . The final result is recovered
the two full adders vertically corresponding to p3 can also be as (Pleft << 4) + Pright .
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 08,2023 at 03:36:44 UTC from IEEE Xplore. Restrictions apply.
LI et al.: LOW-COMPLEXITY PSMAC UNIT ARCHITECTURES FOR DNN ACCELERATORS 1613
TABLE II
B. Adaptation for Nonnegative-Activation Quantization and S YNTHESIS R ESULTS FOR D IFFERENT BB S
Bitwidth Optimization
Some quantization schemes such as the two proposed by
Google [13] and Qualcomm [21] quantize the activations into
the interval [0, 255] using eight bits, i.e., uint8. In effect, acti-
vations are usually quantized into nonnegative numbers since
ReLU has been widely adopted as the activation functions TABLE III
A REA (um2 ) B REAKDOWN OF D IFFERENT FU-BASED
of DNNs. Due to the nonnegative-quantized activations, the PSMAC A RCHITECTURES
architecture of SBB can be further simplified through set-
ting sy (see Fig. 3) to zero. On the other hand, we have
p4 = x2 (y1 +y0 ). While activations are quantized into unsigned
numbers, weights are always quantized into signed numbers in
these quantization schemes. Consequently, we can set sx = 1
to save another AND gate for each SBB. The further simpli-
fied BB is denoted as nSBB where “n” is the abbreviation of
“nonnegative”. RFU employing nSBBs is referred to as nRFU.
Conventional TNNs are not supported by nRFU anymore
since nSBB only works for the nonnegative-quantized acti-
vations. Fortunately, some nonnegative-activation quantization
schemes [16], [17] yield acceptable accuracy even aggressively
quantizing both the weights and activations into two bits. At
this point, nRFU can be configured into the 2b × 2b mode to
only four output bits, their architectures can be further sim-
support the 2-bit quantization. Note that RFU is also effective
plified. By means of optimizing the bitwidth, the complexity
for the nonnegative-activation quantization schemes. However,
of RFU and nRFU is further reduced, which is not considered
it is not the most efficient choice in this situation.
neither in [8] nor in [10]. Similarly, the bitwidths of FU and
Now we will give an insight into the decomposed multi-
MFU can also be optimized, which will not be discussed in
plication and accordingly reduce the bitwidths of RFU and
detail due to the space limitation.
nRFU. Considering all the possible values of each signal,
bitwidths may be reduced from the naive implementation. For
RFU, the benefits brought by such bitwidth optimization are IV. E XPERIMENTAL R ESULTS AND C OMPARISONS
not significant. The main reason is that only a few signals In this section, experimental results are provided to demon-
successfully reduce their bitwidths due to the high generality strate the superiority of the proposed PSMAC architectures.
of RFU. For example, all BBs have to output five bits to sup- All the synthesis results are obtained using the TSMC 40-
port all the possible product values as listed in Table I. As to nm CMOS technology and with voltage of 0.99 V. The clock
nRFU, bitwidth optimization can lead to more improvement. period is set to 1.5 ns for all designs.
Bitwidth details of the circuits outside BBUs but in nRFU Table II provides the synthesis results for different BB
are shown in Fig. 7 (b). Meanwhile, Fig. 8 provides bitwidth designs. While SBB can save 9.40% area cost and 14.65%
details of the BBUs. As listed in Table I, four bits are suf- power consumption, the reduction ratios brought by nSBB are
ficient for the output of BB except the case of 3 × 3 = 9. up to 48.07% and 54.50%. As discussed in Section III-B, some
In the 2b × 2b mode, the 3 × 3 case will never happen since BBs can be further optimized since they only need to output
the quantized weights are always 2-bit signed numbers. In the four bits. For example, nSBB with four output bits saves up
4b × 4b mode, only the lower-right BB of each BBU is likely to 50.83% area cost and 56.04% power consumption of the
to perform the 3 × 3 multiplication. Nevertheless, its output is original one.
treated as the least significant part in the merging operation. Table III gives the area breakdown for the three kinds of
Hence the sign bit can be removed so that four bits are also FU-based PSMAC architectures. As S&A logic dominates the
sufficient for the lower-right BB. Five-bit outputs are required area cost in [8] and [10], the simplification on BB has a small
by some BBs only in the 8b × 8b mode. For these BBs with impact on the overall architecture. Fortunately, the proposed
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 08,2023 at 03:36:44 UTC from IEEE Xplore. Restrictions apply.
1614 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 70, NO. 4, APRIL 2023
TABLE IV
C OMPARISONS ON THE P ROPOSED AND E XISTING FU-BASED architecture and applying bitwidth optimization. Experimental
PSMAC A RCHITECTURES results demonstrate their superiority in terms of area cost
and power consumption. A brief discussion is also given to
show that the proposed PSMAC units can bring consider-
able benefits to some of the existing precision-scalable DNN
accelerators.
R EFERENCES
[1] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing
of deep neural networks: A tutorial and survey,” Proc. IEEE, vol. 105,
no. 12, pp. 2295–2329, Dec. 2017.
[2] Z. Dong, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer,
“HAWQ: Hessian aware quantization of neural networks with mixed-
precision,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019,
recursive architecture surpasses the previous two designs and pp. 293–302.
successfully decreases the proportion of S&A logic regard- [3] Y. Cai, Z. Yao, Z. Dong, A. Gholami, M. W. Mahoney, and K. Keutzer,
less of the type of employed BBs. Assume the same BB is “ZeroQ: A novel zero shot quantization framework,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., 2020, pp. 13166–13175.
employed, e.g., the original BB. Our proposed recursive archi- [4] Y. Huang et al., “LSMQ: A layer-wise sensitivity-based mixed-precision
tecture can still reduce the area cost up to 39.11% and 35.39% quantization method for bit-flexible CNN accelerator,” in Proc. Int. SoC
when compared with [8] and [10]. Des. Conf., 2021, pp. 256–257.
[5] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and
Table III also lists the implementation results after applying A. Moshovos, “Stripes: Bit-serial deep neural network computing,”
the bitwidth optimization. Bitwidth optimization has a direct in Proc. 49th Annu. IEEE/ACM Int. Symp. Microarchitect., 2016,
impact on S&A logic and some BBs can also be simplified by pp. 1–12.
[6] J. Albericio et al., “Bit-pragmatic deep neural network computing,”
decreasing their outputs to four bits. For all the cases, bitwidth in Proc. 50th Annu. IEEE/ACM Int. Symp. Microarchitect., 2017,
optimization brings 3.09%∼36.94% area reduction. pp. 382–394.
A direct comparison on our proposed PSMAC architec- [7] S. Sharify, A. D. Lascorz, K. Siu, P. Judd, and A. Moshovos, “Loom:
Exploiting weight and activation precisions to accelerate convolu-
tures and the existing two other FU-based designs is given in tional neural networks,” in Proc. 55th Annu. Des. Autom. Conf., 2018,
Table IV. Targeted at general-purpose quantization schemes, pp. 1–6.
our proposed RFU can save 44.18% area cost and 45.45% [8] H. Sharma et al., “Bit fusion: Bit-level dynamically composable archi-
tecture for accelerating deep neural network,” in Proc. ACM/IEEE 45th
power consumption when compared with MFU. Regarding Annu. Int. Symp. Comput. Archit., 2018, pp. 764–775.
the original FU, the two reduction ratios brought by RFU [9] J. Lee, C. Kim, S. Kang, D. Shin, S. Kim, and H.-J. Yoo, “UNPU:
will be 47.39% and 50.82%. Moreover, as aforementioned, An energy-efficient deep neural network accelerator with fully vari-
able weight bit precision,” IEEE J. Solid-State Circuits, vol. 54, no. 1,
the extended RFU additionally supports the configuration pp. 173–185, Jan. 2019.
mode 8b × 4b while introduces only an overhead of 11.24%. [10] W. Liu, J. Lin, and Z. Wang, “A precision-scalable energy-efficient con-
Hence, the extended RFU outperforms MFU in terms of volutional neural network accelerator,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 67, no. 10, pp. 3484–3497, Oct. 2020.
area cost, power consumption and configuration flexibility. [11] S. Ryu et al., “BitBlade: Energy-efficient variable bit-precision hardware
Note that nonnegative-quantized activations are not consid- accelerator for quantized neural networks,” IEEE J. Solid-State Circuits,
ered by FU and MFU. On the contrary, this brief develops vol. 57, no. 6, pp. 1924–1935, Jun. 2022.
[12] V. Camus, L. Mei, C. Enz, and M. Verhelst, “Review and benchmarking
an optimized PSMAC architecture dedicated to it. When of precision-scalable multiply-accumulate unit architectures for embed-
nonnegative-activation quantization schemes [13], [21] are ded neural-network processing,” IEEE J. Emerg. Sel. Topics Circuits
adopted, nRFU is more efficient than those proposed in [8] Syst., vol. 9, no. 4, pp. 697–711, Dec. 2019.
[13] B. Jacob et al., “Quantization and training of neural networks for effi-
and [10]. cient integer-arithmetic-only inference,” in Proc. IEEE Conf. Comput.
The PSMAC architectures are proposed without imposing Vis. Pattern Recognit., 2018, pp. 2704–2713.
any additional constraint on top-level accelerator architecture. [14] Y. Li et al., “BRECQ: Pushing the limit of post-training quantization
by block reconstruction,” in Proc. Int. Conf. Learn. Represent., 2021,
Therefore, they are practical for all the FU-based precision- pp. 1–16.
scalable accelerators. As reported by [10], the processing [15] Z. Yao et al., “HAWQ-V3: Dyadic neural network quantization,” in Proc.
element (PE) array of its design consumes more than 47% Int. Conf. Mach. Learn., 2021, pp. 11875–11886.
[16] S. Xu et al., “Generative low-bitwidth data free quantization,” in Proc.
power. Recall that the proposed RFU can reduce the power Eur. Conf. Comput. Vis., 2020, pp. 1–17.
of MFU by 45.45%. Directly replacing MFU by RFU is [17] Y. Liu, W. Zhang, and J. Wang, “Zero-shot adversarial quantization,” in
expected to save the overall power of the accelerator by around Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 1512–1521.
[18] H. Alemdar, V. Leroy, A. Prost-Boucle, and F. Petrot, “Ternary neural
20%. Another example is the accelerator presented in [11]. networks for resource-efficient AI applications,” in Proc. Int. Joint Conf.
According to its reported results, the PE array occupies over Neural Netw., 2017, pp. 2547–2554.
60% chip area. The benefits brought by our proposed PSMAC [19] L. Deng, P. Jiao, J. Pei, Z. Wu, and G. Li, “GXNOR-Net: Training
deep neural networks with ternary weights and activations without full-
architectures will be more significant. precision memory under a unified discretization framework,” Neural
Netw., vol. 100, pp. 49–58, Apr. 2018.
[20] Y. Zhang et al., “When sorting network meets parallel bitstreams:
V. C ONCLUSION A fault-tolerant parallel ternary neural network accelerator based on
This brief proposes two low-complexity PSMAC architec- stochastic computing,” in Proc. Des. Autom. Test Eur. Conf. Exhibit.,
2020, pp. 1287–1290.
tures for two different quantization schemes. They are obtained [21] M. Nagel et al., “A white paper on neural network quantization,” 2021,
through simplifying the basic unit BB, devising a new top-level arXiv:2106.08295.
Authorized licensed use limited to: VIT University- Chennai Campus. Downloaded on August 08,2023 at 03:36:44 UTC from IEEE Xplore. Restrictions apply.