An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention
An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
228 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
the same size of n × d as the input. It then separates each probability (P), and a change of X on X i will affect output
of Q, K , and V into M parts, named multi-head, to capture by P = X · (2 · Pi · (1 − Pi )) according to the softmax
various dependencies of the input [23]. Each head uses partial derivative. Generally, constrained by P = 1, at most one P
Q, K , and V with a size of n × dm (dm = d/M) for attention operand is greater than 0.5, and the rest P operands must be
computing, consisting of Q × K T , softmax, quantization, and in the range of [0, 0.5]. In this range, P increases with Pi .
P × V , as shown in Fig. 1(b). Q × K T and P × V introduce a Based on P, we can analyze the contribution and error
computation amount positively relate to n 2 × d. After that, the tolerance of different X i . Specifically, for the same variation
two-stage FFN, which is identical to the fully connected (FC) of X, we think that the score (X i ) that introduces a larger
layers in CNNs, performs upward and downward projections fluctuation (P) makes a more significant contribution and
to enhance the presentation capacity of the model. Unlike the vice versa. Assuming X m and X n (X m < X n ) with the same
attention block, the computations of FFN relate to n × d 2 since X, the fluctuation P of X m will be X · (2 · Pm · (1 − Pm )),
it does not need to capture the global relevance, avoiding the and it is exponentially smaller than X · (2 · Pn · (1 − Pn )) of
n 2 computational complexity. X n . It means that the small scores (X m ) make less contribution
In each layer, the attention block introduces a computational than the large ones (X n ). In other words, they are more error
bottleneck. First, it is a trend to use superior self-attention tolerable since they have exponentially reduced influence on
in more intractable applications, such as super-resolution or the output. In the GPT-2 model with a token length of 512,
automatic drives. In these tasks, the token length n is always these small scores, indicating WR-Tokens, consume 93.1%
gigantic [24]. For instance, in a video attention network for the energy but only contribute to 6.3% of the accuracy.
automotive application, 16 frames with 112 × 112 resolution Second, Q × K T suffers dynamic redundancy indicated
result in an n equal to 2 × 105 , ∼20× larger than d. by the WR-Tokens. For each attention row from Q × K T ,
In this case, the attention block takes a dominated computation the softmax function uses e xi −xmax to distinguish score signif-
proportion of 99.5% since its n 2 × d complexity far outweighs icances and quantizes the obtained results for the following
n × d 2 of FFN. On the other hand, the current AI processors computing. Since the WR-Tokens have small X i that pushes
focus on optimizing convolution and FC [13], [14], [15], [16]. X i − X max far less than zero, the value of e xi −xmax is tiny.
They work well for FFN but incur 59.3% of throughput degra- The n-bit quantization will transfer the tiny e xi −xmax to zero if
dation for attention computing due to the distinct computing they are smaller than 1/2n , which is the minimum value of a
flow, operands reuse pattern, and data access format. It inspires n-bit data. Therefore, the computations of any X i satisfying
an urgent demand to design a dedicated processor for attention X i − X max < ln(1/2n ) are redundant, which takes a proportion
blocks. of 34.3% in Q × K T , limiting the energy efficiency. Besides,
this output sparsity depends on the variable threshold of X max
in each row, and we cannot start speculation before getting
B. Challenges of Energy-Efficient Self-Attention Computing the varied X max . It increases the speculation complexity more
Nevertheless, the WR-Tokens introduce three challenges for than ReLU-based sparsity with a static threshold of zero.
energy-efficient attention computing, as shown in Fig. 2. Third, P × V incurs hardware under-utilization due to
First, the WR-Tokens lead to an energy consumption bot- near-zero P’s. After normalization, the summation of P’s
tleneck for the entire attention block. Fig. 2 shows the par- is 1, where a few P’s have larger values, but the rest P’s,
tial derivative of softmax when normalizing score (X) to indicated by WR-Tokens, would be small. For instance, in a
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
230 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
TABLE I
A CCURACY C OMPARISON W ITH D IFFERENT P RECISIONS
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 231
[28], [29], [30], [31], [32], [33]. However, the previous meth- Pex is the power of exact computing. It is relatively smaller
ods cannot match the value-adaptive error-tolerance of atten- than the approximate-component-based way, which requires
tion computing, limiting their energy efficiency. We propose a computing all bits and results in a Psp equaling to 0.77 × Pex .
BESA PE, which adaptively detects the value magnitude of the In contrast, their Plp is similar for various approaches since
WR-Tokens with MSBs to gate their logic in LSBs for more they require almost the same logical complexity to achieve
energy-saving and computes the SR-Tokens exactly. It adapts similarly high precision. Therefore, we set it to 0.95 × Pex for
to the error-tolerance of the attention block to break the energy all methods to perform a fair comparison.
bottleneck. Since (Psp − Plp ) < 0, (1) is a monotone decreasing linear
function with Rwr . Therefore, reducing Psp to decrease the
function’s slope or increasing Rwr can significantly lessen
A. Energy Efficiency for Approximate Attention Computing
energy. However, the previous approximate methods can nei-
The previous approximate methods effectively increase the ther achieve an extremely low Psp nor a high Rwr owing to their
energy efficiency for CNNs but incur performance degradation average error reduction mechanism. On the one hand, they
for attention computing due to the theoretical error mismatch try to reconcile the computing error for all operands, where
with softmax. Specifically, ReLU in CNNs does not change the the strict precision requirement of large scores limits their
value magnitude, so feature maps have invariant contributions truncated bit-width or the use number of approximate com-
before and after computing, as shown in Fig. 4. Nevertheless, ponents. It results in a large Psp of (1) that increases energy
softmax exponentially reduces the small scores but increases consumption. On the other hand, the low precision mode of
the significant scores. It introduces a value-adaptive magnitude the previous methods introduces a significant error for the
variation after computing, where the small scores, indicating SR-Tokens, which have exponentially increased contribution
the WR-Tokens, have exponentially weakened contributions. for the accuracy. To avoid confusing the SR-Tokens, they must
Therefore, the attention block has a parabola-like instead of reduce Rwr . As shown in Fig. 4, the Mitchell-based method has
a constant error-tolerance as ReLU. The previous approxi- to limit the Rwr to ∼50% for satisfying the tolerable accuracy
mate methods, including the Mitchell-based and approximate- loss of 1.5%. The low Rwr confines its energy reduction
component-based multipliers [30], [31], [32], [33], target to to 21.6%.
reduce the average error that is appropriate to the invari-
ant contribution of ReLU. However, they mismatch the B. BESA PE for Attention Adaptive Approximation
parabola-like error-tolerance of softmax, introducing two lim-
itations for energy reduction. Specifically, we can evaluate the This article proposes a BESA PE, achieving value-adaptive
energy of different approximate methods with the following approximation to match the parabola-like error-tolerance of
equation: softmax. It significantly reduces Psp and increases Rwr to break
the energy bottleneck of attention computing.
Energy = Rwr × Psp + (1 − Rwr ) × Plp Fig. 5 illustrates the overall workflow of mixed-mode com-
= ( Psp − Plp ) × Rwr + Plp . (1) puting with the proposed BESA PE. When computing each
attention row of headi , the attention core uses the 512 bit-mask
Here, Rwr denotes the WR-Tokens ratio, where PEs lead to generated by headi−1 (BMaski−1 ) to control the PE array work
large errors but small power (Psp ). For the rest of (1 − Rwr ) with the mixed mode. Specifically, the PE performs approxi-
SR-Tokens, PEs improve the precision while consuming large mate computing if receiving bit “0” but exact computing for
power (Plp ). Generally, Psp of different methods have signifi- the bit “1.” For the obtained attention matrix, the top-k selector
cant differences due to the distinct approximate mechanisms. filtrates the first k percent most significant scores of each
For instance, the Mitchell-based multiplier truncates several row and generates a corresponding bit-mask, represented as
LSBs for energy reduction. Its average Psp is 0.68× Pex , where BMaski . Here, the value k is adjustable and determined by the
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
232 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 233
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
234 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 235
A. Operations Dovetailing for Near-Zero Probabilities B. OPCS for Dynamic Out-of-Order Dovetailing
Operation dovetailing is a practical approach to omit the This article proposes an OPCS, performing dynamic out-
“0”-valued PPs’ computing for near-zero operands. As shown of-order dovetailing with the adaptive folding unit and
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
236 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
Fig. 12. Overall architecture of OPCS with the adaptive folding unit and asymmetric BENES routers.
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 237
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
238 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
TABLE II
C OMPARISON R ESULTS ON V I T-B W ITH T HREE D IFFERENT S ETUPS
Fig. 17. Accuracy and energy efficiency performances with BESA PE.
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 239
TABLE III
M EASUREMENT R ESULTS AND P ERFORMANCES C OMPARISON W ITH S TATE - OF - THE -A RT P ROCESSORS
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
240 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
by 1.43×. Apart from the computing speedup, the reduced Transformer processors by exploiting the naturally existent
computation with BASU and PPs generation with OPCS WR-Tokens with three innovations. First, the BESA PE appro-
decreases the computing energy by 1.65×. In addition, the priately matches the error-tolerance property of the attention
BESA PE exploited in 3) matches the error-tolerance of the block, significantly reducing the MAC energy while maintain-
attention block, which reduces the energy consumption of ing considerable accuracy. Second, the BASU speculates the
the WR-Tokens and increases the approximate ratio of the output sparsity of the attention matrix by utilizing the local
traditional approximate method, saving 10.1% more energy. property of global attention, resulting in higher speculation
Compared with 1), setup 3) prominently reduces the time by efficiency for more redundant computation reduction. Third,
2.88× and energy by 3.20×. the OPCS aggressively uses the “0”-valued MSBs in near-zero
operands through an out-of-order mechanism, remarkably
C. Comparison With State-of-the-Art Processors improving the hardware utilization. With these methods, our
Table III compares our processor with several state-of- processor achieves a peak energy efficiency of 27.56 TOPS/W,
the-art implementations, including A100 graphics processing breaking the computational bottleneck of the attention block
unit (GPU), sparse matrix multiplication processor for LSTM, and paving the way for attention-based applications, even for
and Transformer-based processors. The proposed processor power-constrained systems.
achieves the highest energy efficiency for attention computing.
It is 17.66× higher than the A100 GPU at INT8 preci- R EFERENCES
sion, which has a peak energy efficiency of 3.12 TOPS/W [1] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
but only achieves 1.56 TOPS/W for the attention block. Process. Syst., vol. 30, 2017, pp. –11.
[2] A. Radford et al., “Language models are unsupervised multitask learn-
Specifically, A100 GPU reaches the peak energy efficiency ers,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
of 3.12 TOPS/W with 50% input structured sparsity, but [3] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers
it cannot effectively handle the irregular output attention for image recognition at scale,” in Proc. Int. Conf. Learn. Represent.
(ICLR), 2020.
sparsity and near-zero probabilities due to the WR-Tokens. [4] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
Therefore, A100 can only work as a simple matrix multi- shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
plication processor without exploiting sparsity, limiting the Oct. 2021, pp. 10012–10022.
[5] K. M. Choromanski et al., “Rethinking attention with performers,” in
energy efficiency to 1.56 TOPS/W. In addition, the energy Proc. Int. Conf. Learn. Represent. (ICLR), 2021. [Online]. Available:
efficiency of our processor is 9.37× higher than [12], which https://openreview.net
has a peak performance of 8.93 TOPS/W for the sparse matrix [6] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and
A. Vaswani, “Bottleneck transformers for visual recognition,” in Proc.
multiplication but suffers dramatic degradation by 3.04× for IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
attention blocks. The proposed processor outperforms [12] pp. 16519–16529.
due to three reasons. First, the BESA PE takes less power [7] Q. Han et al., “On the connection between local attention and dynamic
depth-wise convolution,” in Proc. Int. Conf. Learn. Represent. (ICLR),
than the exact PE adopted in [12]. Second, the BASU can 2022. [Online]. Available: https://openreview.net
speculate the output sparsity in the attention block, while [12] [8] H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “MaX-DeepLab:
only supports input sparsity. Third, besides handling the zero End-to-end panoptic segmentation with mask transformers,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
operands as [12], the OPCS further skips “0”-valued PPs’ pp. 5463–5474.
computing with operation dovetailing. Moreover, compared [9] A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hechtman,
with the ELSA [24], which also supports output sparsity and J. Shlens, “Scaling local self-attention for parameter efficient visual
backbones,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
speculation, the proposed processor possesses the following (CVPR), Jun. 2021, pp. 12894–12904.
superiorities. First, the proposed BASU is lossless as it utilizes [10] H. Sharma et al., “Bit fusion: Bit-level dynamically composable archi-
the fixed-point system’s intrinsic finite word length effects, tecture for accelerating deep neural network,” in Proc. ACM/IEEE 45th
while the hash-based speculation in ELSA introduces accuracy Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2018, pp. 764–775.
[11] J.-S. Park et al., “9.5 A 6K-MAC feature-map-sparsity-aware neural
degradation. Second, the BASU performs zero skipping along processing unit in 5nm flagship mobile SoC,” in IEEE Int. Solid-State
with Q × K T , avoiding the additional hash computation Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021, pp. 152–154.
for sparsity speculation and the repeated memory access for [12] D. Kadetotad, S. Yin, V. Berisha, C. Chakrabarti, and J.-S. Seo, “An 8.93
TOPS/W LSTM recurrent neural network accelerator featuring hierar-
zero skipping. Third, the BASU exploits the local property chical coarse-grain sparsity for on-device speech recognition,” IEEE J.
of the attention mechanism to perform the diagonal-prior Solid-State Circuits, vol. 55, no. 7, pp. 1877–1887, Jul. 2020.
speculating, reducing more computations than speculating with [13] S. Choi, J. Sim, M. Kang, Y. Choi, H. Kim, and L.-S. Kim, “An energy-
efficient deep convolutional neural network training accelerator for in
the orthogonal matrix multiplication in ELSA. Apart from the situ personalization on smart devices,” IEEE J. Solid-State Circuits,
advantage of sparsity speculation, the BESA PE can reduce vol. 55, no. 10, pp. 2691–2702, Oct. 2020.
MAC power, and the OPCS increases the throughput for the [14] H. Mo et al., “9.2 A 28nm 12.1TOPS/W dual-mode CNN processor
using effective-weight-based convolution and error-compensation-based
WR-Tokens, while ELSA lacks these abilities. As a result, prediction,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
the proposed processor reduces energy by 4.57× and offers a Papers, Feb. 2021, pp. 146–148.
3.73× speedup compared with ELSA. [15] A. Agrawal et al., “9.1 A 7nm 4-Core AI chip with 25.6TFLOPS
hybrid FP8 training, 102.4TOPS INT4 inference and workload-aware
VIII. C ONCLUSION throttling,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
Papers, Feb. 2021, pp. 144–146.
This article designs a Transformer processor to fill the gap [16] T. Tambe et al., “9.8 A 25 mm2 SoC for IoT devices with 18ms
between the current AI processor and the emerging attention noise-robust speech-to-text latency via Bayesian speech denoising and
attention-based sequence-to-sequence DNN speech recognition in 16nm
mechanism. It achieves superior energy efficiency than the FinFET,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
GPU, sparse matrix multiplication processors, and the prior Papers, Feb. 2021, pp. 158–160.
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 241
[17] D. Han et al., “HNPU: An adaptive DNN training processor uti- Yang Wang received the B.S. degree in electronic
lizing stochastic dynamic fixed-point and active bit-precision search- science and technology from Xidian University,
ing,” IEEE J. Solid-State Circuits, vol. 56, no. 9, pp. 2858–2869, Xi’an, China, in 2014, and the Ph.D. degree in
Sep. 2021. microelectronics and solid-state electronics from the
[18] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo, “7.7 LNPU: Institute of Microelectronics, Chinese Academy of
A 25.3TFLOPS/W sparse deep-neural-network learning processor with Sciences, Beijing, China, in 2019.
fine-grained mixed precision of FP8-FP16,” in IEEE Int. Solid-State He is currently a Post-Doctoral Researcher with
Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2019, pp. 142–144. the School of Integrated Circuits, Tsinghua Univer-
[19] S. Kang et al., “7.4 GANPU: A 135TFLOPS/W multi-DNN training sity, Beijing. His research interests include very-
processor for GANs with speculative dual-sparsity exploitation,” in IEEE large-scale integration of digital signal processing
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2020, (VLSI DSP), deep learning, and neural network
pp. 140–142. acceleration.
[20] Y. Wang et al., “A 28nm 276.55TFLOPS/W sparse deep-neural-network
training processor with implicit redundancy speculation and batch nor-
malization reformulation,” in Proc. Symp. VLSI Circuits, Jun. 2021,
Yubin Qin received the B.S. degree from the School
pp. 1–2.
of Electronic Science and Engineering, Southeast
[21] F. Tu et al., “Evolver: A deep learning processor with on-device
University, Nanjing, China, in 2020. He is cur-
quantization–voltage–frequency tuning,” IEEE J. Solid-State Circuits,
rently pursuing the Ph.D. degree with the School
vol. 56, no. 2, pp. 658–673, Feb. 2021.
of Integrated Circuits, Tsinghua University, Beijing,
[22] Y. Wang et al., “A 28nm 27.5TOPS/W approximate-computing-based
China.
transformer processor with asymptotic sparsity speculating and out-of-
His current research interests include deep learn-
order computing,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
ing, very-large-scale integration (VLSI) design, and
Tech. Papers, Feb. 2022, pp. 464–465.
hardware-software co-design.
[23] T. J. Ham et al., “A3 : Accelerating attention mechanisms in neural
networks with approximation,” in Proc. IEEE Int. Symp. High Perform.
Comput. Archit. (HPCA), Feb. 2020, pp. 328–341.
[24] T. J. Ham et al., “ELSA: Hardware-software co-design for efficient,
lightweight self-attention mechanism in neural networks,” in Proc. Dazheng Deng received the B.S. degree in micro-
ACM/IEEE 48th Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2021, electronic science and engineering from Tsinghua
pp. 692–705. University, Beijing, China, in 2020, where he is
[25] H. Jang, J. Kim, J.-E. Jo, J. Lee, and J. Kim, “MnnFast: A fast and currently pursuing the M.S. degree with the School
scalable system architecture for memory-augmented neural networks,” of Integrated Circuits.
in Proc. 46th Int. Symp. Comput. Archit., Jun. 2019, pp. 250–263. His current research interests include deep learn-
[26] H. Wang, Z. Zhang, and S. Han, “SpAtten: Efficient sparse atten- ing, computer architecture, and very-large-scale inte-
tion architecture with cascade token and head pruning,” in Proc. gration (VLSI) design.
IEEE Int. Symp. High-Perform. Comput. Archit. (HPCA), Feb. 2021,
pp. 97–110.
[27] S. Sen and A. Raghunathan, “Approximate computing for long
short term memory (LSTM) neural networks,” IEEE Trans. Comput.-
Aided Design Integr. Circuits Syst., vol. 37, no. 11, pp. 2266–2276,
Nov. 2018. Jingchuan Wei received the B.S. degree from
[28] Z. G. Tasoulas, G. Zervakis, I. Anagnostopoulos, H. Amrouch, and Binzhou University, Binzhou, China, in 2010, and
J. Henkel, “Weight-oriented approximation for energy-efficient neural the M.S. degree from the Beijing Institute of Tech-
network inference accelerators,” IEEE Trans. Circuits Syst. I, Reg. nology, Beijing, China, in 2015.
Papers, vol. 67, no. 12, pp. 4670–4683, Dec. 2020. He is currently a Senior Engineer with Tsinghua
[29] V. Mrazek, Z. Vasicek, L. Sekanina, M. A. Hanif, and M. Shafique, University, Beijing. His research area is the imple-
“ALWANN: Automatic layer-wise approximation of deep neural network mentation of artificial intelligence (AI) accelerator
accelerators without retraining,” in Proc. IEEE/ACM Int. Conf. Comput.- and system-on-chip (SoC) design.
Aided Design (ICCAD), Nov. 2019, pp. 1–8.
[30] S. Vahdat, M. Kamal, A. Afzali-Kusha, and M. Pedram, “TOSAM: An
energy-efficient Truncation- and rounding-based scalable approximate
multiplier,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27,
no. 5, pp. 1161–1173, May 2019.
[31] A. G. M. Strollo, E. Napoli, D. D. Caro, N. Petra, G. Saggese, and Yang Zhou received the B.S. degree in electronic
G. D. Meo, “Approximate multipliers using static segmentation: Error science and technology from the Beijing Institute of
analysis and improvements,” IEEE Trans. Circuits Syst. I, Reg. Papers, Technology, Beijing, China, in 2019. He is currently
vol. 69, no. 6, pp. 2449–2462, Jun. 2022. pursuing the M.S. degree with the School of Inte-
[32] A. G. M. Strollo, E. Napoli, D. D. Caro, N. Petra, and G. D. Meo, grated Circuits, Tsinghua University, Beijing.
“Comparison and extension of approximate 4–2 compressors for low- His current research interests include very-large-
power approximate multipliers,” IEEE Trans. Circuits Syst. I, Reg. scale integration (VLSI) design and neural network
Papers, vol. 67, no. 9, pp. 3021–3034, Sep. 2020. acceleration.
[33] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro, and N. Petra,
“Approximate multipliers based on new approximate compressors,”
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 12, pp. 4169–4182,
Dec. 2018.
[34] E. Qin et al., “SIGMA: A sparse and irregular GEMM accelera-
tor with flexible interconnects for DNN training,” in Proc. IEEE Yuanqi Fan received the B.S. degree in micro-
Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2020, electronic science and engineering from Tsinghua
pp. 58–70. University, Beijing, China, in 2019, where he is
[35] B. Wang et al., “Exploration of Benes network in cryptographic proces- currently pursuing the M.S. degree with the School
sors: A random infection countermeasure for block ciphers against fault of Integrated Circuits.
attacks,” IEEE Trans. Inf. Forensics Security, vol. 12, no. 2, pp. 309–322,
His current research interests include deep learn-
Feb. 2017. ing, computer architecture, and artificial intelligence
[36] R. Yao and Y. Ye, “Toward a high-performance and low-loss
(AI) accelerators.
Clos–Benes-based optical network-on-chip architecture,” IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., vol. 39, no. 12,
pp. 4695–4706, Dec. 2020.
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
242 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023
Tianbao Chen received the B.S. degree from the Shaojun Wei (Fellow, IEEE) was born in Beijing,
College of Electronic Science and Engineering, Jilin China, in 1958. He received the Ph.D. degree from
University, Changchun, Jilin, China, in 2013, and the Faculté Polytechnique de Mons, Mons, Belgium,
the M.S. degree from Tsinghua University, Beijing, in 1991.
China, in 2016. He became a Professor at the Institute of Micro-
He is currently an Application-Specific Integrated electronics, Tsinghua University, Beijing, in 1995.
Circuit (ASIC) Design Engineer with TsingMicro His main research interests include very-large-scale
Technology, Beijing. His research interests include integration system-on-chip (VLSI SoC) design, elec-
very-large-scale integration (VLSI) design and deep tronic design automation (EDA) methodology, and
learning. communication application-specific integrated cir-
cuit (ASIC) design.
Dr. Wei is a Senior Member of the Chinese Institute of Electronics (CIE).
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.