0% found this document useful (0 votes)
72 views16 pages

An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention

deep learning

Uploaded by

michaelp8040
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views16 pages

An Energy-Efficient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention

deep learning

Uploaded by

michaelp8040
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO.

1, JANUARY 2023 227

An Energy-Efficient Transformer Processor


Exploiting Dynamic Weak Relevances
in Global Attention
Yang Wang , Yubin Qin , Dazheng Deng, Jingchuan Wei , Yang Zhou, Yuanqi Fan, Tianbao Chen, Hao Sun,
Leibo Liu , Senior Member, IEEE, Shaojun Wei , Fellow, IEEE, and Shouyi Yin , Member, IEEE
Abstract— Transformer-based models achieve tremendous suc- I. I NTRODUCTION
cess in many artificial intelligence (AI) tasks, outperforming
conventional convolution neural networks (CNNs) from natural
language processing (NLP) to computer vision (CV). Their
R ECENTLY, Transformer-based models, such as gener-
ative pre-trained transformer 2 (GPT-2), vision trans-
former (ViT), and Swin-Transformer, have emerged as one
success relies on the self-attention mechanism that provides a
global rather than local receptive field as CNNs. Despite its of the most crucial advancements in the artificial intelligence
superiority, the global–level self-attention consumes ∼100× more (AI) fields [1], [2], [3], [4], [5], [6], [7], [8], [9]. These
operations than CNNs and cannot be effectively handled by the models achieve superior accuracy than conventional convo-
existing CNN processor due to the distinct operations. It inspires lution neural networks (CNNs) and break CNNs’ dominance
an urgent requirement to design a dedicated Transformer proces-
sor. However, global self-attention involves massive naturally on various AI tasks [1], [2], [3], [4], [5], [6], [7], [8], [9].
existent weakly related tokens (WR-Tokens) due to the redundant The self-attention mechanism is essential for the tremendous
contents in human languages or images. These WR-Tokens gen- success of Transformer-based models. Unlike sliding window
erate zero and near-zero attention results that introduce energy convolution in CNNs with a limited local receptive field, self-
consumption bottleneck, redundant computations, and hard- attention captures the correlations of all input tokens (a token
ware under-utilization issues, making it challenging to achieve
energy-efficient self-attention computing. This article proposes denotes a word in a sentence or a patch in an image), resulting
a Transformer processor effectively handling the WR-Tokens in a global receptive field. Generally, the Transformer-based
to solve these challenges. First, a big-exact-small-approximate model consists of stacked layers, and each layer contains
processing element (PE) reduces multiply-and-accumulate (MAC) attention blocks achieving self-attention with query (Q), key
energy for WR-Tokens by adaptively computing the small values (K ), and value matrix (V ), computed by tokens and weight
approximately while computing the large values exactly. Sec-
ond, a bidirectional asymptotical speculation unit captures and matrices. Specifically, Q first multiplies with K T to generate
removes redundant computations of zero attention outputs by an attention score matrix. The scores in each row, represented
exploiting the local property of self-attention. Third, an out-of- as X i , indicate a specific token’s relevance with all others.
order PE-line computing scheduler improves hardware utilization Then, the row-wise softmax with inputs of X i − X max nor-
for near-zero values by reordering the operands to dovetail two malizes the attention scores to probabilities (P), exponentially
operations into one multiplication. Fabricated in a 28-nm CMOS
technology, the proposed processor occupies an area of 6.82 mm2 . scaling the scores. Finally, the probabilities are quantized and
When evaluated with a 90% of approximate computing for the multiplied by V to produce the output. Each output token is a
generative pre-trained transformer 2 (GPT-2) model, the peak weighted sum of all input tokens, where the strongly related
energy efficiency is 27.56 TOPS/W under 0.56 V at 50 MHz, tokens (SR-Tokens) have large weight values. With global
17.66× higher than A100 graphics processing unit (GPU). Com- attention, the normal GPT-2 model achieves 20.4% higher
pared with the state-of-the-art Transformer processor, it reduces
energy by 4.57× and offers 3.73× speedup. accuracy than long short-term memory (LSTM) for language
modeling. Besides, the Swin-B model increases the accuracy
Index Terms— Approximate computing, out-of-order comput- by 12.5% compared with the EfficientDet for detection on
ing, processor, self-attention, speculating, Transformer.
COCO2017.
Manuscript received 5 May 2022; revised 30 July 2022; accepted Unfortunately, the superior accuracy of Transformer-based
30 September 2022. Date of publication 25 October 2022; date of current models comes at the cost of more operations, but the existing
version 28 December 2022. This article was approved by Associate Editor processors designed for CNN cannot handle these operations
Sophia Shao. This work was supported in part by the NSFC under Grant
62125403, Grant U19B2041, and Grant 92164301; in part by the National Key effectively [10], [11], [12], [13], [14], [15], [16], [17], [18],
Research and Development Program under Grant 2021ZD0114400; in part by [19], [20], [21]. Specifically, GPT-2 needs 91.2× more com-
the Beijing National Research Center for Information Science and Technology; putations than LSTM, and the operation amount of Swin-B
and in part by the Beijing Advanced Innovation Center for Integrated Circuits.
(Corresponding author: Shouyi Yin.) is 94.8× higher than EfficientDet [22]. The intensive com-
Yang Wang, Yubin Qin, Dazheng Deng, Jingchuan Wei, Yang Zhou, Yuanqi putation severely prevents their deployment, especially for
Fan, Hao Sun, Leibo Liu, Shaojun Wei, and Shouyi Yin are with the School power-constrained systems. What is worse, due to the distinct
of Integrated Circuits, the Beijing Innovation Center for Future Chip, and the
Beijing National Research Center for Information Science and Technology, computing pattern between convolution and self-attention, the
Tsinghua University, Beijing 100084, China (e-mail: [email protected]). existing processors cannot effectively handle the Transformer-
Tianbao Chen is with TsingMicro Technology, Beijing 100084, China. based models [23], [24], [25], [26]. For instance, Jetson
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JSSC.2022.3213521. TX2 is a dedicated AI processor for CNN applications,
Digital Object Identifier 10.1109/JSSC.2022.3213521 achieving 41 FPS throughput for ResNet152 with 11.3 ×
0018-9200 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
228 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023

109 FLOP operations. Nevertheless, the throughput degrades


to 14 FPS for ViT-B with the same computation amount of
11.3 × 109 FLOPs as ResNet152. It inspires the requirement
to design a Transformer processor for efficient self-attention
computing.
However, global attention contains massive weakly related
tokens (WR-Tokens) with small attention scores, introducing
three challenges for energy-efficient self-attention comput-
ing. First, the WR-Tokens introduce an energy consumption
bottleneck for the attention block. Generally, the attention
block adopts a softmax function to enlarge the score dis-
tance for clarifying the token relevance. It will exponentially
reduce the small attention scores to near-zero probabilities,
weakening their contribution to the accuracy. However, these
small scores adversely take the dominated computing energy,
limiting the energy efficiency of the attention block. Second,
Q × K T contains substantial redundancies since many
near-zeros probabilities of the WR-Tokens from softmax
become zero after n-bit quantization. It means that the compu-
tation of X i is ineffective if satisfying X i < X max + ln(1/2n ).
The redundant computations waste time and energy. In addi-
tion, unlike ReLU-based sparsity in CNN with a static specula-
tion threshold of zero [18], [19], [20], [21], the softmax-based
sparsity relies on X max , which is a variable and different for Fig. 1. Principle and computational property of self-attention mecha-
each row, resulting in a dynamic threshold and increasing nism. (a) Principle of self-attention and schematic of an attention layer.
the difficulty for speculation. Third, besides inter-PE under- (b) WR-Tokens dominated computation of the attention block.
utilization caused by zero values [18], [20], the WR-Tokens
still lead to intra-PE under-utilization in P × V due to the near- multiplication. It omits the “0”-valued PPs’ computing
zero operands. Specifically, the near-zero probabilities have for improving hardware utilization by 1.81×.
massive “0”-valued most significant bits (MSBs), generating The rest of this article is organized as follows. Section II
“0”-valued partial products (PPs) in multiplications. The illustrates the principle of the self-attention mechanism and
intra-PE PPs’ computing resources are wasted since details the implementing challenges. Section III presents the
the “0”-valued PPs are ineffective for the computation overall architecture of the proposed Transformer processor.
results. In Sections IV–VI, the three innovations are described, respec-
This article proposes a Transformer processor [22]. It solves tively. Section VII presents the measurement results, and
the above challenges by effectively exploiting the WR-Tokens Section VIII concludes this article.
in global attention, which can facilitate Jetson TX2 increas-
II. BACKGROUND AND M OTIVATION
ing the throughput by 5.69× and reducing the energy by
8.14× for the attention-based model. In addition, it achieves A. Self-Attention Mechanism
4.57× higher energy efficiency and 3.73× speedup than The recent emerging Transformer-based models have made
the state-of-the-art Transformer processor [24] that supports a significant breakthrough and revolutionized the AI field [1],
self-attention computing but cannot efficiently handle the [2], [3], [4], [5], [6], [7], [8], [9]. The pivotal component
inherent WR-Tokens. The main innovations of the proposed of their success is the self-attention mechanism. Fig. 1(a)
processor contributing to its superior performance are given as presents its principle and the difference with convolution.
follows. Instead of obtaining local information around the football
1) A big-exact-small-approximate (BESA) processing ele- with a size-limited sliding window, the self-attention gets the
ment (PE) computes small values with large errors for football’s relevance with all other elements, including the foot,
energy-saving while computing the large values exactly. sky, and grass, through traversal computing. It gives a large
It matches the error-tolerance of the attention block to score for strongly related football and foot while a small score
reduce the energy of the WR-Tokens by 1.62×. for weakly related football and sky. This procedure helps the
2) A bidirectional asymptotical speculation unit (BASU) model understand the input better for a superior performance
captures the dynamic output sparsity by exploiting than CNNs.
the local property of self-attention. It performs the Generally, a Transformer-based model consists of stacked
diagonal-prior computing to find each row’s vary- layers with a self-attention mechanism. Fig. 1(a) presents the
ing X max rapidly to skip 41.1% of the redundant schematic of a single layer mainly comprised of an attention
computations. block and a feedforward network (FFN). For an n × d input
3) An out-of-order PE-line computing scheduler (OPCS) matrix, where n is the token length and d represents model
reorders the operands with asymmetrical BENES routing dimension, the attention block first multiplies it with three
networks to enable dovetailing two operations into one weight matrices to generate Q, K , and V . These matrices have
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 229

Fig. 2. Three challenges of achieving energy-efficient self-attention computing.

the same size of n × d as the input. It then separates each probability (P), and a change of X on X i will affect output
of Q, K , and V into M parts, named multi-head, to capture by P = X · (2 · Pi · (1 − Pi )) according to the softmax
various dependencies of the input [23]. Each head uses partial derivative. Generally, constrained by  P = 1, at most one P
Q, K , and V with a size of n × dm (dm = d/M) for attention operand is greater than 0.5, and the rest P operands must be
computing, consisting of Q × K T , softmax, quantization, and in the range of [0, 0.5]. In this range, P increases with Pi .
P × V , as shown in Fig. 1(b). Q × K T and P × V introduce a Based on P, we can analyze the contribution and error
computation amount positively relate to n 2 × d. After that, the tolerance of different X i . Specifically, for the same variation
two-stage FFN, which is identical to the fully connected (FC) of X, we think that the score (X i ) that introduces a larger
layers in CNNs, performs upward and downward projections fluctuation (P) makes a more significant contribution and
to enhance the presentation capacity of the model. Unlike the vice versa. Assuming X m and X n (X m < X n ) with the same
attention block, the computations of FFN relate to n × d 2 since X, the fluctuation P of X m will be X · (2 · Pm · (1 − Pm )),
it does not need to capture the global relevance, avoiding the and it is exponentially smaller than X · (2 · Pn · (1 − Pn )) of
n 2 computational complexity. X n . It means that the small scores (X m ) make less contribution
In each layer, the attention block introduces a computational than the large ones (X n ). In other words, they are more error
bottleneck. First, it is a trend to use superior self-attention tolerable since they have exponentially reduced influence on
in more intractable applications, such as super-resolution or the output. In the GPT-2 model with a token length of 512,
automatic drives. In these tasks, the token length n is always these small scores, indicating WR-Tokens, consume 93.1%
gigantic [24]. For instance, in a video attention network for the energy but only contribute to 6.3% of the accuracy.
automotive application, 16 frames with 112 × 112 resolution Second, Q × K T suffers dynamic redundancy indicated
result in an n equal to 2 × 105 , ∼20× larger than d. by the WR-Tokens. For each attention row from Q × K T ,
In this case, the attention block takes a dominated computation the softmax function uses e xi −xmax to distinguish score signif-
proportion of 99.5% since its n 2 × d complexity far outweighs icances and quantizes the obtained results for the following
n × d 2 of FFN. On the other hand, the current AI processors computing. Since the WR-Tokens have small X i that pushes
focus on optimizing convolution and FC [13], [14], [15], [16]. X i − X max far less than zero, the value of e xi −xmax is tiny.
They work well for FFN but incur 59.3% of throughput degra- The n-bit quantization will transfer the tiny e xi −xmax to zero if
dation for attention computing due to the distinct computing they are smaller than 1/2n , which is the minimum value of a
flow, operands reuse pattern, and data access format. It inspires n-bit data. Therefore, the computations of any X i satisfying
an urgent demand to design a dedicated processor for attention X i − X max < ln(1/2n ) are redundant, which takes a proportion
blocks. of 34.3% in Q × K T , limiting the energy efficiency. Besides,
this output sparsity depends on the variable threshold of X max
in each row, and we cannot start speculation before getting
B. Challenges of Energy-Efficient Self-Attention Computing the varied X max . It increases the speculation complexity more
Nevertheless, the WR-Tokens introduce three challenges for than ReLU-based sparsity with a static threshold of zero.
energy-efficient attention computing, as shown in Fig. 2. Third, P × V incurs hardware under-utilization due to
First, the WR-Tokens lead to an energy consumption bot- near-zero P’s. After normalization, the summation of P’s
tleneck for the entire attention block. Fig. 2 shows the par- is 1, where a few P’s have larger values, but the rest P’s,
tial derivative of softmax when normalizing score (X) to indicated by WR-Tokens, would be small. For instance, in a

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
230 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023

TABLE I
A CCURACY C OMPARISON W ITH D IFFERENT P RECISIONS

pressure of the computation-intensive Transformer models,


especially the power-constrained devices. The quantizer trans-
fers the 32-bit accumulation results to 12 bit at runtime for
successive computing. The PE supports approximate and exact
computing through a BESA multiplier, saving the energy
of the WR-Tokens. In addition, it can dovetail operations
by using PPs’ computing logic in one multiplier for two
multiplications to remove “0”-valued PPs for high hardware
utilization. The BASU consists of eight sign-based splitters
with two 128-depth RFs, an X max updater, and a speculator.
It controls the reorder unit to prioritize diagonal scores, which
rapidly detects X max for efficient sparsity speculation. Each
OPCS comprises a softmax unit, a folding unit, and a 4-to-8
Fig. 3. Overall architecture of the proposed processor and its workflow of
an attention layer. (a) Overall architecture. (b) Engagement of each module. asymmetrical BENES network. It receives results from the
quantizer and schedules the operands to an out-of-order format
translation task on GPT-2 with n = 256, the first 16 larges before sending them to the PE line for computing.
P’s have a summation of 0.73. The average value of the rest Fig. 3(b) details the engagement of the proposed modules.
240 P’s is 0.001125, and 87.3% of them have more than 8-b The processor sequentially computes the attention block head-
“0”-valued MSBs with 12-b quantization. Since the “0”-valued by-head and then completes the FFN. For each layer, the
MSBs produce “0”-valued PPs that are ineffectual for the BESA PE works in three modes. It runs in a pure exact mode
result, the multiplier incurs PPs resource under-utilization. The to obtain Q, K , and V , providing a precise footstone for self-
“0”-valued PPs waste 39.7% of time and 65.3% of energy for attention computing. Then, during Q × K T and P × V , the
attention computing. BESA PE calculates the SR-Tokens exactly but WR-Tokens
approximately based on the bit-mask from the former head.
III. OVERALL A RCHITECTURE The hybrid mode is motivated by the exponentially reduced
contribution of WR-Tokens. After that, the BESA PE performs
Fig. 3(a) presents the overall architecture of the proposed approximate computing for the two-stage FFN, which cannot
Transformer processor that effectively exploits the WR-Tokens perform attention captures and is relatively insensitive to
to achieve energy-efficient self-attention computing. The pro- accuracy. The BASU only works in Q × K T to speculate
posed processor consists of four attention cores, a 32-to-12-b the redundant computations, where it also helps the attention
quantizer, a reorder unit, and 336-kB static random-access core to generate the WR-Tokens bit-mask for mixed-mode
memory (SRAM). The attention core has a PE array that computing of the next head. The OPCS keeps operating all
contains eight PE lines, a BASU, and eight OPCSs. The eight the time. It skips the operations with zero weights during
PE lines, each of which has 16 BESA PEs, compute eight Q/K /V generation and FFN or the naturally existent zero-
outputs belonging to one row. Each PE performs a fixed-point valued Q and K operands in Q × K T . Besides, it dovetails
INT12 multiplication with INT32 accumulation. Here, we use two multiplications into one multiplier for small weights,
INT12 to achieve a reasonable tradeoff between accuracy and Q/K operands during Q/K /V generation, Q × K T , and the
energy. As shown in Table I, the accuracy of INT12 is almost FFN. In addition, OPCS significantly increases the hardware
the same as INT16 and FP32. It avoids the dramatic accuracy utilization for P × V since P contains substantial zero and
degradation of integer of 8 bits (INT8), which is up to 3.29% near-zero values. For the zero operands, it removes their
for ImageNet classification on the ViT-B/16. The reason is computation according to the bit-mask from BASU. As a
that attention computing requires distinguishing the relevance particular statement, only BESA PE impacts accuracy, but
among tokens to highlight the important contents. However, BASU and OPCS are numerical fidelity.
the limited resolution of INT8 restricts the discernibility
during attention computing. Besides, the INT12 multiply- IV. B IG -E XACT-S MALL -A PPROXIMATE PE
and-accumulate (MAC) unit can reduce the power by 1.78× Approximate computing is a practical way to break the
compared with INT16 computing, alleviating the applying energy bottleneck for the error-resilience neural networks [27],

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 231

Fig. 4. Analysis of different approximate methods for attention computing.

[28], [29], [30], [31], [32], [33]. However, the previous meth- Pex is the power of exact computing. It is relatively smaller
ods cannot match the value-adaptive error-tolerance of atten- than the approximate-component-based way, which requires
tion computing, limiting their energy efficiency. We propose a computing all bits and results in a Psp equaling to 0.77 × Pex .
BESA PE, which adaptively detects the value magnitude of the In contrast, their Plp is similar for various approaches since
WR-Tokens with MSBs to gate their logic in LSBs for more they require almost the same logical complexity to achieve
energy-saving and computes the SR-Tokens exactly. It adapts similarly high precision. Therefore, we set it to 0.95 × Pex for
to the error-tolerance of the attention block to break the energy all methods to perform a fair comparison.
bottleneck. Since (Psp − Plp ) < 0, (1) is a monotone decreasing linear
function with Rwr . Therefore, reducing Psp to decrease the
function’s slope or increasing Rwr can significantly lessen
A. Energy Efficiency for Approximate Attention Computing
energy. However, the previous approximate methods can nei-
The previous approximate methods effectively increase the ther achieve an extremely low Psp nor a high Rwr owing to their
energy efficiency for CNNs but incur performance degradation average error reduction mechanism. On the one hand, they
for attention computing due to the theoretical error mismatch try to reconcile the computing error for all operands, where
with softmax. Specifically, ReLU in CNNs does not change the the strict precision requirement of large scores limits their
value magnitude, so feature maps have invariant contributions truncated bit-width or the use number of approximate com-
before and after computing, as shown in Fig. 4. Nevertheless, ponents. It results in a large Psp of (1) that increases energy
softmax exponentially reduces the small scores but increases consumption. On the other hand, the low precision mode of
the significant scores. It introduces a value-adaptive magnitude the previous methods introduces a significant error for the
variation after computing, where the small scores, indicating SR-Tokens, which have exponentially increased contribution
the WR-Tokens, have exponentially weakened contributions. for the accuracy. To avoid confusing the SR-Tokens, they must
Therefore, the attention block has a parabola-like instead of reduce Rwr . As shown in Fig. 4, the Mitchell-based method has
a constant error-tolerance as ReLU. The previous approxi- to limit the Rwr to ∼50% for satisfying the tolerable accuracy
mate methods, including the Mitchell-based and approximate- loss of 1.5%. The low Rwr confines its energy reduction
component-based multipliers [30], [31], [32], [33], target to to 21.6%.
reduce the average error that is appropriate to the invari-
ant contribution of ReLU. However, they mismatch the B. BESA PE for Attention Adaptive Approximation
parabola-like error-tolerance of softmax, introducing two lim-
itations for energy reduction. Specifically, we can evaluate the This article proposes a BESA PE, achieving value-adaptive
energy of different approximate methods with the following approximation to match the parabola-like error-tolerance of
equation: softmax. It significantly reduces Psp and increases Rwr to break
the energy bottleneck of attention computing.
Energy = Rwr × Psp + (1 − Rwr ) × Plp Fig. 5 illustrates the overall workflow of mixed-mode com-
= ( Psp − Plp ) × Rwr + Plp . (1) puting with the proposed BESA PE. When computing each
attention row of headi , the attention core uses the 512 bit-mask
Here, Rwr denotes the WR-Tokens ratio, where PEs lead to generated by headi−1 (BMaski−1 ) to control the PE array work
large errors but small power (Psp ). For the rest of (1 − Rwr ) with the mixed mode. Specifically, the PE performs approxi-
SR-Tokens, PEs improve the precision while consuming large mate computing if receiving bit “0” but exact computing for
power (Plp ). Generally, Psp of different methods have signifi- the bit “1.” For the obtained attention matrix, the top-k selector
cant differences due to the distinct approximate mechanisms. filtrates the first k percent most significant scores of each
For instance, the Mitchell-based multiplier truncates several row and generates a corresponding bit-mask, represented as
LSBs for energy reduction. Its average Psp is 0.68× Pex , where BMaski . Here, the value k is adjustable and determined by the

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
232 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023

Fig. 5. Mixed-mode computing with BESA PE.

complexity of a specific task. For instance, when constraining


the accuracy degradation to less than 0.05%, the value of k is
3% for WikiText2 on the GPT-2 model. It increases to 5% for
WikiText103 since WikiText103 is relatively more complicated
than WikiText2. A similar phenomenon appears in the tasks of
image classification on ViT-B. The ImageNet requires k of 7%,
which is larger than Cifar100 with a k of 2%. After selecting
the top-k scores, the attention core performs bit-wise OR with
BMaski−1 and BMaski to produce BMaski+1 that will guide
the computation of headi+1 . In this way, a specific head always
performs the exact computing for the SR-Tokens indicated by
all former heads. It avoids missing the SR-Tokens, which are
essential to the accuracy.
The proposed BESA PE consists of a self-gating code gener-
ator, a multiplier modifier, a compensator, and an approximate
booth multiplier with hybrid exact/approximate components.
It has two work modes. For the WR-Tokens, BESA PE
performs approximate computing by enabling the self-gating
generator and using the generated control code to gate com-
pressors for energy reduction. In this mode, the modifier
and the compensator keep idle. On the contrary, when it
needs to achieve exact computing for the SR-Tokens, the
BESA PE disables the self-gating generator to active all Fig. 6. Design details of the proposed multiplier for BESA computing.
compressors. In addition, the modifier and the compensator
work together to use the approximate components for exact
computing. Fig. 6 shows the diagram of the approximate delivery. None of the previous works [31], [32] can achieve
multiplier and details how to achieve BESA computing. The this exact results delivery due to their random approximate
multiplier contains two stages, and we use the first stage to components arrangement and irregularly occurred exact output
explain its operating principle. This stage consists of six PP of each approximate component.
rows with 24 columns. It uses approximate PP generators When computing the WR-Tokens with BESA PE, the
in row 2 and row 5. The proposed generator outputs exact self-gating code generator performs cascaded OR for the
results if booth values (BVs) are ±1/0 but inverts several positive operands, which is AND for the negative data, with
correct results if BVs are ±2 for logic simplification, as shown the 6-bit MSBs. It adaptively generates a 6-bit code that has
in Fig. 6. Besides, columns 7–14 use the approximate 4–2 more “0” (“1”) for the minor positive (negative) values. We use
compressors. Unlike the previous compressors with random this code to gate the compressors belonging to columns 7–12
exact outputs, we purposely design the compressor to retain in LSBs, as shown in Fig. 6. Therefore, instead of treating
exactness if the first input ( A1 ) is 0. In addition, we spe- the operands equally without discrimination, the proposed
cially connect the approximate PPs to A1 of approximate self-gating saves more computing energy for the smaller
compressors. In this way, if the BV is 0, the PPs will be values. It reduces Psp to 0.55 × Pex , which is 1.4× than
exact 0, and the approximate compressors can also have the the previous approximate methods, resulting in 1.17× more
exact results with A1 = 0, achieving stage-by-stage exactness energy reduction. Although this energy reduction is at the

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 233

expense of an increased computing error of 23.5%, softmax


weakens the error to a negligible 0.25% benefiting from its
exponential value reduction for the WR-Tokens. Therefore,
it adapts to the parabola-like error-tolerance of attention
computing.
For the SR-Tokens, the multiplier modifier and the compen-
sator work together to achieve exact computing with approx-
imate components. Specifically, for the received operand, the
modifier detects its 2–4 and 9–11 bits, which serves for sec-
ond/fifth BVs and corresponds to PP generators of row 2/row
5. It adaptively modifies them to “000” or “111,” indicating a
BV of “0,” based on their 1’s number. For instance, it modifies
“010” to “000” but “011” to “111,” as shown in Fig. 6. In this
way, the modifier only changes 1-bit for all conditions. It sim-
plifies the compensation logic, which requires compensating
all the changing bits. After modification, the approximate PPs’
generators in row 2 and row 5 give exact results since their
BVs are “0.” Besides, these “0” PPs connect to A1 , which
artfully makes the approximate compressors produce accurate
outputs. So far, the approximate multiplier obtains an exact
computing result for the modified operand. The compensator
then performs shift-adding if the modified bit changes from
“1” to “0” or shift-subtracting for changing from “0” to
“1.” The multiplier finally obtains an exact result, even with
approximate circuits. It eliminates the approximate error of
the previous methods, resulting in a Rwr up to 90% with a
Fig. 7. Principle of bidirectional asymptotic speculation.
negligible 0.43% accuracy loss. Therefore, BESA PE reduces
energy by 1.69× compared to exact computing, and it is 1.34×
higher than the prior method [30], [31], [32], [33], breaking CNN [18], [19], [20], [21], it is more formidable to speculate
the energy bottleneck of the attention block. the sparse attention scores. First, instead of relying on a
constant threshold of zero as ReLU, the sparse attention scores
V. B IDIRECTIONAL A SYMPTOTICAL S PECULATION U NIT involve two degrees of freedom, which are X i and X max .
The quantized zero attention probabilities indicate output Therefore, it requires extra effort to detect X max first and then
sparsity during Q × K T determined by X i < X max +ln(1/2n ), perform speculation with the obtained X max , which leads to
which has a dynamic speculation threshold of X max . The severe speculation efficiency degradation for sequential com-
previous speculation methods either suffer efficiency degra- puting. As shown in Fig. 7, before obtaining X max , sequential

dation due to the slow X max detecting or accuracy loss computing has to utilize X max , which is the current maximum
caused by inevitable speculation error on X i , which could value from the preceding scores, for speculation. However,

be SR-Tokens. They cannot effectively exploit the dynamic this randomly selected X max from the former locations always
sparsity of the attention block. We propose a BASU that has a significant gap with the genuine X max , especially for
performs diagonal-prior computing to rapidly capture X max the posterior attention rows, whose X max often occurs at

for high speculation efficiency and computes X i with pos- rearward locations. Therefore, using X max for speculation
itive operations followed by negative to achieve a lossless naturally misses massive sparse attention scores that satisfy

redundancy skipping. The efficient lossless speculation of X i < X max +ln(1/2n ) but not X i < X max +ln(1/2n ). According
BASU significantly improves the energy efficiency of attention to our experiment on the GPT-2 model, the missing ratio
computing. of sequential computing is as large as 49.6%, significantly
restricting the energy efficiency improvement for attention
A. Speculation for Output Attention Sparsity computing. In addition, X i has a value-adaptive exponentially
scaled contribution, as explained in Section IV. The previous
Besides introducing an energy bottleneck, the WR-Tokens approaches use the cursory results computed by the half
also indicate output attention sparsity in Q × K T . Specifically, bit-width or exponent-only for speculation [18], [19]. They
substantial near-zero probabilities corresponding to the small inevitably introduce speculation errors for SR-Tokens that
scores of X i will become zero after n-bit quantization if X i dramatically impact accuracy.
satisfies the equation of X i < X max + ln(1/2n ). Therefore,
the computations of these X i are redundant, and removing
them can significantly increase the energy efficiency, similar B. BASU for Efficient Sparse Attention Speculation
to the output feature-map skipping in CNNs. Nevertheless, This article proposes a BASU that achieves efficient loss-
compared with speculating the zero outputs from ReLU in less speculation for the output attention sparsity, effectively

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
234 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023

Fig. 8. Architecture of the bidirectional asymptotic speculation unit.

removing the redundant computation during Q × K T .


Fig. 7 shows the workflow of BASU, and Fig. 8 provides
its details.
When generating the attention matrix with Q × K T , instead
of performing front-to-back sequential computing, the BASU
schedules the reorder unit to preferentially fetch the Q and
K rows contributing to the diagonal scores for computing.
Specifically, for the attention row n, it first computes eight
scores of X [n−4:n+3] with Q r[n] and K r[n−4:n+3] . Along with this
Fig. 9. Attention patterns across heads (12 heads of layer 2) of different
procedure, the BASU compares the obtained eight scores to models. (a) GPT-2 with WikiText2 for language modeling. (b) ViT-B with
∗ 
find the maximum value, defined as X max . Unlike X max having ImageNet for classification. (c) Swin-B with COCO2017 for detection.

a significant gap with X max , X max is usually close to X max due
to the local property of attention where a specific token always
has strong relevance with its near tokens [5], [7], [9]. Several line. It allows for overlapping the splitting with computing,
algorithms work even only connect a token with its near tokens avoiding scheduling latency that will decrease PE utilization.
in a small window to reduce the computational complexity Then, the PE line prioritizes the operands from the positive
[5], [7], [9]. This local property results in relatively large subset, resulting in a maximum partial sum of X i . After that,
scores belonging to the matrix diagonal, and Fig. 9 proves it computes the negative subset, where the partial sum, defined
this property with attention heat maps of different Transformer as X n , is monotone decreasing. In this stage, the speculator

models. In these heat maps, the highlighted pixels denote large compares X n with X max + ln(1/2n ) before computing, and

attention scores that have a high possibility of appearing at the it terminates the computation once X n < X max + ln(1/2n ),
∗ ∗
attention matrix’s diagonal. With X max , the BASU can specu- as shown in Fig. 7. Since X n ≥ X i and X max ≤ X max ,

late substantial missed sparse scores in sequential computing, the computing satisfies X n − X max < X i − X max , indicating

which may not meet X i < X max + ln(1/2n ) but still satisfy lossless speculation. In addition, the proposed separate-based
∗ ∗
X i < X max + ln(1/2 ) since X max is near to X max and much
n
speculating can save more energy than the previous methods

larger than X max . For the same task on GPT-2, the increased for attention sparsity. The previous approaches consume 50%
speculation ratio facilitates the diagonal-prior approach to energy with half bit-width for speculation and then take
achieve a 1.54× higher energy efficiency than sequential extra effort to compute the non-zero output [18]. Therefore,
computing. In addition, the energy efficiency improvement can their energy-saving is less than 50%. Differently, since the
be as large as 1.86× and 1.67× for the ViT-B and Swin- WR-Tokens have small scores, their negative operation amount
B models. After computing the eight diagonal scores, the takes an average proportion of 65.1%, which provides an
BASU controls the attention core to calculate X [n−8:n−5] and opportunity to reduce more than 50% energy with the separate-
X [n+4:n+7] , which located bilaterally of X [n−4:n+3] . The updater based speculation.

compares X max with the newly generated maximum score Besides redundancy speculation, the BASU also assists the

for updating, pushing X max to asymptotically approach the top-k selection, as shown in Fig. 10. It reuses the min/max
real X max . values from every eight scores for pre-processing to avoid

For computing X i , the BASU utilizes the X max to perform traversing all data from SRAM during top-k detecting. Specif-
speculation. As shown in Fig. 8, when receiving Q and ically, for top-k selection on n tokens, the BASU collects M
K , the splitter first separately sends them into the positive min/max candidates. Here, M equals to n/8 × 2 k, k denotes
and negative RFs based on the XOR results with their sign the percentage of top-k, and 2 is the selection margin. For
bits. Here, we parallelly split 32 operands in each cycle, each parallelly computed eight scores, defined as an interval,
which is 2× speedup than computing with 16 PEs in a PE the selection assistor compares its maximum value with all

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 235

Fig. 10. Top-k selection assisting with BASU.

minimum scores candidates while simultaneously comparing


the minimum value with maximum scores candidates. Sup-
posing that the maximum value is smaller than the minimum
candidates, indicating an interval with all small scores, the
detection controller will assign a skip signal to omit this
interval during the following top-k detecting. In the meantime,
if this minimum value is larger than all maximum candidates, Fig. 11. Dovetailing with adaptive folding and out-of-order scheduling.
denoting a significant interval, the detection controller will
entitle its priority to the highest and reduce the priorities of all
other intervals by 1. When filtrating the top-k SR-Tokens, the in Fig. 11, we can dovetail two P × V (Pi × Vi and P j × V j )
attention core can detect the operands based on the skipping operations into one multiplier if the ELW summation of the
and priority rating signals, as shown in Fig. 10. The intuition two P is smaller than the multiplier bit-width. To support
behind this approach is that a token is always highly related dovetailing, we designed the 12-b multiplier with a bit-width
to several specific domains, which has a more possibility of 24 b for operand V to accommodate two different Vi
of containing the SR-Tokens. Therefore, we prioritize the m and V j at each time. Although operation dovetailing can
scores (m = 8 × M) belonging to the M intervals to quickly avoid the PPs’ computing resources wasting, the conventional
determine the most significant scores instead of aimlessly static in-order dovetailing method suffers a severe failure ratio
traversing all the scores. It reduces the detection complexity since the randomly distributed operands cannot always sat-
from O(nlog(n)) to O(mlog(m)), where m is far less than n. isfy the dovetailing condition. It restricts hardware utilization
The proposed method saves 56.7% of data access and 63.3% improvement.
of time during top-k selection for the GPT-2 model when For one scenario, the successive operands have a relatively
traversing n from 64 to 1024 for k = 20%, a ratio without large ELW greater than half of 12, as shown in the left
incurring accuracy loss. part of Fig. 11. It prevents us from dovetailing any adjacent
two operands for hardware utilization improvement. Therefore,
VI. O UT- OF -O RDER PE-L INE C OMPUTING S CHEDULER the four operands still require four multipliers for comput-
ing, while each multiplier contains wasted PPs resources.
Besides zero probabilities, the WR-Tokens also lead to In addition, the static in-order dovetailing suffers resource
near-zero probabilities with massive “0”-valued MSBs, which misallocation. As shown in the right part of Fig. 11, it cannot
waste PPs’ computing resources of multipliers in P × V . dovetail P4 and P5 since their ELWs summation is 14, and
Operation dovetailing can remove “0”-valued PPs’ comput- the 12-bit multiplier lacks one PPs’ generator to compute them
ing for hardware utilization improvement. Nevertheless, static together. They, therefore, occupy two multipliers, while, in the
in-order dovetailing suffers a high failure ratio since it cannot meantime, the dovetailed operands P6 and P7 with an ELWs
adapt to randomly distributed operands. We propose an OPCS summation of eight waste two PPs’ generators. The static
that dynamically reduces the effective LSBs bit-width (ELW) in-order dovetailing cannot allocate the wasted PPs resources
of operands with adaptive folding and performs out-of-order onto the computation, indeed requiring them, resulting in a
dovetailing with asymmetrical BENES routers, significantly severe hardware under-utilization.
increasing the dovetailing ratio for higher hardware utilization.

A. Operations Dovetailing for Near-Zero Probabilities B. OPCS for Dynamic Out-of-Order Dovetailing
Operation dovetailing is a practical approach to omit the This article proposes an OPCS, performing dynamic out-
“0”-valued PPs’ computing for near-zero operands. As shown of-order dovetailing with the adaptive folding unit and

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
236 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023

Fig. 12. Overall architecture of OPCS with the adaptive folding unit and asymmetric BENES routers.

asymmetric BENES routers, which breaks the hardware uti-


lization limitation of the static in-order dovetailing approach.
Fig. 12 shows the overall architecture of OPCS, comprising
a softmax unit, an adaptive folding unit, and asymmetric
BENES routers. During attention computing, the OPCS con-
tinuously receives attention scores from the attention cores,
and it adopts the LUT-based softmax unit to normalize the
attention scores to probabilities sequentially. Afterward, the
folding unit in each cycle fetches 32 probabilities of data
belonging to the same attention row. It detects each operand’s
leading one (LO) location. If more than three successive
“1” following the LO, the folding unit inverts the bits after
the LO generated by the LSBs extractor and adds 1 for
the inverted result. It folds a large operand P with mas-
sive “1”-valued MSBs to a small operand P  with many
“0”-valued MSBs, as shown with P0 and P0 in Fig. 11.
In other words, the adaptive folding reduces the ELWs of
operands to meet the dovetailing condition more easily than
using the original data. As shown in Fig. 11, we can dovetail
the folded operands to use two rather than four multipliers
for computing, and this approach increases the dovetailing
efficiency by 21.9%. For the above procedure, the operands Fig. 13. Design details of input matcher and output aggregator.
before and after folding satisfy P + P  = 2 N , where N
represents the LO location. Therefore, the compensation unit
left shifts V for N bits and then subtracts P  × V , resulting
in 2 N × V − P  × V , which is identical to the original result to match them, as shown in Fig. 11. We, therefore, propose
of P × V . The compensation overhead is negligible since it 4-to-8 asymmetrical BENES to reduce the overhead for
only requires simple shift logic. After that, rather than directly operands reordering. As shown in Fig. 12, the presented
dovetailing the adjacent two operands, the OPCS uses four BENES router omits four switch elements and fixes the
4-to-8 asymmetrical BENES routers to achieve out-of-order connection for eight switch elements for the former three
computing. As shown in Fig. 11, the BENES router reorders stages. It only uses the rest two stages to move four operands
the eight operands to compute P0 with P4 while P2 with P3 for operands’ matching. Compared with the symmetrical
and so on. This approach allows each operand to match an approach, the proposed asymmetrical BENES network reduces
appropriate operand to improve the dovetailing ratio. For the the power by 56.7% and power by 46.2%. In addition, since
case in Fig. 11, it can reduce the multiplier number from 6 to 4 the asymmetrical BENES only requires controlling two stages,
compared with in-order dovetailing. Theoretically, achieving it reduces the critical path by 49.3% than the five-stage control
arbitrary location reordering for eight operands needs an 8-to-8 code generation [34], [35], [36]. Along with scheduling the
symmetrical BENES network [34], [35], [36]. However, it is 32 P operands, OPCS also reuses the control logic to reorder
power and area-wasting for dovetailing, which only requires 32 V operands to match them with the corresponding P for
fixing four operands and moving the rest of the four operands computing.

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 237

Fig. 15. Area and power breakdown of the proposed processor.

Fig. 14. Chip micrograph and summary of the proposed processor.

The BENES router also contains an input matcher and an


output aggregator to improve hardware utilization, as shown in
Fig. 13. When receiving 4 × 8 operands from the folding unit,
the input matcher detects the maximum LO values, defined
as L m , for every eight operands. It deals with these operands
in three manners based on their L m . First, for any L m equal
to zero, the input matcher skips them directly. Second, if the
summation of 2 L m is smaller than 12, as shown with L mx and
L my in Fig. 13, it denotes that any X i can dovetail with Yi .
Therefore, the input matcher dovetails the X i operands with Yi Fig. 16. Verification system of the proposed processor.
in one-to-one correspondence and gates the BENES to reduce
the dynamic routing power. Third, suppose that their sum-
mation is larger than 12. The input matcher uses two BENES 4.25 TOPS/[email protected] V (1.91 TOPS/W @ 1.1 V). For a
routers to schedule them, respectively. For each BENES router, 90% of approximate ratio exploiting sparsity speculating and
the input matcher sequentially fixes one operand and finds operations dovetailing, the processor increases the peak energy
its matched operand that satisfies the dovetailing condition efficiency to 27.56 TOPS/W @ 0.56 V (14.28 TOPS/W @
with maximum ELWs. After dovetailing, the output aggregator 1.1 V).
removes the zero probabilities before sending them to the Fig. 15 demonstrates the area and power breakdown of
PE line. As shown in Fig. 13, it will shift the dovetailed the proposed processor. Compared with exact multipliers, the
non-zero operands in the second region to the locations with attention cores equipping BESA PE to reduce the area ratio
zero values in the first region. Therefore, the PE line can from 68.7% to 49.4% due to the simplified PP generator and
skip the zero operands for high utilization. We adopt a digital compressor logic. Besides, its power proportion is reduced by
control shift line constructed with power-of-two structures to 1.37×, benefiting from the energy-saving BESA computing
achieve dynamic operands’ shifting. Instead of equipping N pattern for the dominant WR-Tokens. The BASU takes up an
shift unit for a control code with value N, the shift line can area ratio of 2.40%, including RFs occupation, and its power
share the shift unit, reducing the overhead by 75%. proportion is as small as 1.80% since it only requires simple
comparisons. The OPCS with asymmetrical BENES networks
VII. M EASUREMENT R ESULTS reduces the area overheads from 9.1% to 6.2% and power from
6.6% to 4.2% compared with symmetrical BENES.
A. Chip Specification and Verification System Setup Fig. 16 illustrates the verification system. During the eval-
Fig. 14 presents the proposed processor’s die photograph uation, the test processor uses an integrated FPGA mezzanine
and performance summary, fabricated in 28-nm CMOS tech- connector (FMC) interface with a bit-width of 64 to commu-
nology with a 6.82-mm2 area. It contains 512 PEs, and each nicate with FPGA. The FMC works as a bus to bridge the
PE performs an INT12 MAC that equals two operations. Under test processor and FPGA, where FPGA emulates the function
0.56–1.0-V supply voltage, the power is 12.06–272.8 mW. The of DDR. This behavior is the same as integrating a DDR in
maximum frequency of this processor is 510 MHz, resulting the processor. Apart from evaluation, we can also integrate the
in a peak performance of 522.2 GOPS. Without considering proposed processor into a completed system with a three-stage
the proposed optimizations, the highest energy efficiency is memory hierarchy. The DRAM and SRAM from the system

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
238 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023

TABLE II
C OMPARISON R ESULTS ON V I T-B W ITH T HREE D IFFERENT S ETUPS

Fig. 17. Accuracy and energy efficiency performances with BESA PE.

serve as the first two-stage memory hierarchy. The DRAM


stores the model parameters for the Q/K/V generation and FFN
computing. The SRAM, working as the L1 cache, will buffer
the intermediate data during attention computing. Generally,
the largest size of the intermediate data is less than 5 MB.
For instance, for a token length of 1024 for GPT-2, the
largest size is 1024 × 768 × 4 × 12 = 4.608 MB from
the first stage of FFN. The current processor system always
has enough space to store them. Then, the on-chip SRAM
is the third memory hierarchy and serves as the L0 cache to
achieve in situ attention score reuse with dedicated dataflow.
Specifically, it sequentially computes four rows of attention
matrix each time and immediately multiplies them with V ,
avoiding transmitting the huge attention matrix with off-chip
memory.

B. Evaluation on Key Features


This section shows detailed evaluations of the proposed Fig. 18. Redundant computation reduction with BASU.
techniques. We obtain the result from the language modeling
task on the GPT-2 model with a token length of 512. In this
case, the energy efficiency of the processor is 5.45 TOPS/W, efficiency of the processor to 8.12 TOPS/W, 1.49× higher
as shown in the summary table. The system energy efficiency than the baseline. In addition, it increases the system energy
is 3.98 TOPS/W when including the off-chip DRAM and efficiency by 1.31×, achieving 5.21 TOPS/W.
SRAM access that take an energy proportion of 27%. 2) BASU Evaluation: Fig. 18 shows the computation reduc-
1) BESA PE Evaluation: Fig. 17 shows the accuracy and tion by speculating the redundancy during attention computing
energy efficiency performance with BESA PE. For all mod- with BASU. Instead of sequentially computing the attention
els, the BESA PE can achieve considerable accuracy for an matrix for speculating, which misses 34.5% of output sparsity
approximate ratio smaller than 80%, where a higher ratio due to slow X max detecting, the BASU exploits diagonal-prior
indicates a more significantly increased energy efficiency. The computing to capture X max as quick as possible. It improves
energy efficiency of ViT and Swin models is relatively smaller the speculation efficiency by 1.47× and reduces an average
than GPT-2. It comes from the reason that they contain fewer of 1.53× more computation than the sequential method. This
small values than GPT-2, decreasing the effectiveness of the approach is feasible for either natural language processing
self-gating logic in BESA PE, which reduces more power for (NLP) tasks on GPT-2 model or object detection on the Swin
the smaller values. The approximate ratio of the WR-Tokens model, which contains inherent local property even with global
for the GPT-2 model is up to 90% with less than 0.1% accuracy attention. Besides, the BASU computes the positive operations
loss. It is 1.55× higher than the previous approach for the followed by the negative to perform speculation, which is
same loss since BESA PE can adapt to the error-tolerance numerical fidelity to protect the essential SR-Tokens. It is more
property of the attention block. The superior approximate ratio appropriate for attention sparsity speculating than the previous
allows the proposed processor to reduce 21.3% of the energy approaches [18], [19]. With the proposed efficient lossless
compared with exact computing. In addition, the self-gating redundancy speculation, the BASU can significantly reduce up
logic adaptively disables the compressors in the LSBs for to 43.2% of the computation for the GPT-2 model, increasing
approximate computing, which saves another 1.17× energy the processor’s energy efficiency from 8.12 to 14.30 TOPS/W
consumption. As a result, the BESA PE improves the energy and boosting the system energy efficiency to 6.20 TOPS/W.

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 239

TABLE III
M EASUREMENT R ESULTS AND P ERFORMANCES C OMPARISON W ITH S TATE - OF - THE -A RT P ROCESSORS

4) Combined Evaluation: We also evaluate the combined


effects of the proposed techniques with three different
setups.
1) We set the BESA PE to perform exact computing
while disabling the BASU and OPCS. In this way, the
processor performs standard attention computing, which
does not exploit the WR-Tokens. We can obtain the
performance baseline with this setup.
2) We also evaluate the performance of handling the
WR-Tokens straightforwardly. We set the BESA PE
Fig. 19. Hardware utilization and energy efficiency with OPCS. working under approximate mode but skip their
self-gating logic to emulate the previous approximate-
component-based method. Then, we gate the reorder
3) OPCS Evaluation: Fig. 19 presents the hardware utiliza-
unit in BASU to achieve sequential speculating. Besides,
tion and energy efficiency improvement achieved by OPCS.
we disable the folding and BENES router of OPCS to
By dovetailing two operations into one multiplier with out-
perform static in-order dovetailing.
of-order scheduling, the OPCS increases hardware utilization
3) We enable all proposed approaches to exploit the
by 1.81× than the original computing, which suffers inef-
WR-Tokens. Table II shows the combined evaluation
fectual “0”-valued PPs’ computing. In addition, the OPCS
results.
outperforms the static in-order dovetailing by 1.30× for two
reasons. First, the adaptive folding unit in OPCS reduces the The computing time and energy of setup 1) are pretty
ELWs for operands to improve the dovetailing efficiency by high since it requires performing computation-intensive global
1.09×. After that, the out-of-order scheduling allows each attention. Exploiting the WR-Tokens can significantly reduce
operand to match an appropriate operand with maximum the time by 39.1% and energy by 43.4%, even straightfor-
ELWs, avoiding wasting the PPs’ generators as the in-order wardly using the previous approximate multiplier, sequential
way. It further increases hardware utilization by 1.19×. Apart speculating, and the static in-order dovetailing as 2). How-
from the significantly increased dovetailing efficiency, the pro- ever, the progress of 2) suffers severe limitations since the
posed asymmetrical BENES network in OPCS decreases the traditional approaches cannot effectively exploit the prop-
out-of-order scheduling power overhead by 46.2% compared erty of attention computing. Setup 3) breaks these limita-
with the symmetrical BENES, resulting in a 1.17× higher tions to further improve the time and energy performance
energy efficiency. With the adaptive folding approach and by 1.75× and 1.83×. Concretely, the BASU captures X max
the asymmetrical BENES network, the OPCS can increase more rapidly than sequential computing, which skips extra
the energy efficiency by an average of 1.72×, which could 18.1% redundancy for 1.22× higher speedup. Besides, the
be as large as 1.93×, pushing the peak energy efficiency to OPCS performs dynamic out-of-order dovetailing to adapt the
27.56 TOPS/W. Besides, the system energy efficiency achieves randomly distributed operands, omitting more “0”-valued PPs’
a 1.21× improvement, reaching 7.53 TOPS/W. computing than the static in-order method, which saves time

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
240 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023

by 1.43×. Apart from the computing speedup, the reduced Transformer processors by exploiting the naturally existent
computation with BASU and PPs generation with OPCS WR-Tokens with three innovations. First, the BESA PE appro-
decreases the computing energy by 1.65×. In addition, the priately matches the error-tolerance property of the attention
BESA PE exploited in 3) matches the error-tolerance of the block, significantly reducing the MAC energy while maintain-
attention block, which reduces the energy consumption of ing considerable accuracy. Second, the BASU speculates the
the WR-Tokens and increases the approximate ratio of the output sparsity of the attention matrix by utilizing the local
traditional approximate method, saving 10.1% more energy. property of global attention, resulting in higher speculation
Compared with 1), setup 3) prominently reduces the time by efficiency for more redundant computation reduction. Third,
2.88× and energy by 3.20×. the OPCS aggressively uses the “0”-valued MSBs in near-zero
operands through an out-of-order mechanism, remarkably
C. Comparison With State-of-the-Art Processors improving the hardware utilization. With these methods, our
Table III compares our processor with several state-of- processor achieves a peak energy efficiency of 27.56 TOPS/W,
the-art implementations, including A100 graphics processing breaking the computational bottleneck of the attention block
unit (GPU), sparse matrix multiplication processor for LSTM, and paving the way for attention-based applications, even for
and Transformer-based processors. The proposed processor power-constrained systems.
achieves the highest energy efficiency for attention computing.
It is 17.66× higher than the A100 GPU at INT8 preci- R EFERENCES
sion, which has a peak energy efficiency of 3.12 TOPS/W [1] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
but only achieves 1.56 TOPS/W for the attention block. Process. Syst., vol. 30, 2017, pp. –11.
[2] A. Radford et al., “Language models are unsupervised multitask learn-
Specifically, A100 GPU reaches the peak energy efficiency ers,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
of 3.12 TOPS/W with 50% input structured sparsity, but [3] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers
it cannot effectively handle the irregular output attention for image recognition at scale,” in Proc. Int. Conf. Learn. Represent.
(ICLR), 2020.
sparsity and near-zero probabilities due to the WR-Tokens. [4] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
Therefore, A100 can only work as a simple matrix multi- shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
plication processor without exploiting sparsity, limiting the Oct. 2021, pp. 10012–10022.
[5] K. M. Choromanski et al., “Rethinking attention with performers,” in
energy efficiency to 1.56 TOPS/W. In addition, the energy Proc. Int. Conf. Learn. Represent. (ICLR), 2021. [Online]. Available:
efficiency of our processor is 9.37× higher than [12], which https://openreview.net
has a peak performance of 8.93 TOPS/W for the sparse matrix [6] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and
A. Vaswani, “Bottleneck transformers for visual recognition,” in Proc.
multiplication but suffers dramatic degradation by 3.04× for IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
attention blocks. The proposed processor outperforms [12] pp. 16519–16529.
due to three reasons. First, the BESA PE takes less power [7] Q. Han et al., “On the connection between local attention and dynamic
depth-wise convolution,” in Proc. Int. Conf. Learn. Represent. (ICLR),
than the exact PE adopted in [12]. Second, the BASU can 2022. [Online]. Available: https://openreview.net
speculate the output sparsity in the attention block, while [12] [8] H. Wang, Y. Zhu, H. Adam, A. Yuille, and L.-C. Chen, “MaX-DeepLab:
only supports input sparsity. Third, besides handling the zero End-to-end panoptic segmentation with mask transformers,” in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021,
operands as [12], the OPCS further skips “0”-valued PPs’ pp. 5463–5474.
computing with operation dovetailing. Moreover, compared [9] A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hechtman,
with the ELSA [24], which also supports output sparsity and J. Shlens, “Scaling local self-attention for parameter efficient visual
backbones,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
speculation, the proposed processor possesses the following (CVPR), Jun. 2021, pp. 12894–12904.
superiorities. First, the proposed BASU is lossless as it utilizes [10] H. Sharma et al., “Bit fusion: Bit-level dynamically composable archi-
the fixed-point system’s intrinsic finite word length effects, tecture for accelerating deep neural network,” in Proc. ACM/IEEE 45th
while the hash-based speculation in ELSA introduces accuracy Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2018, pp. 764–775.
[11] J.-S. Park et al., “9.5 A 6K-MAC feature-map-sparsity-aware neural
degradation. Second, the BASU performs zero skipping along processing unit in 5nm flagship mobile SoC,” in IEEE Int. Solid-State
with Q × K T , avoiding the additional hash computation Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2021, pp. 152–154.
for sparsity speculation and the repeated memory access for [12] D. Kadetotad, S. Yin, V. Berisha, C. Chakrabarti, and J.-S. Seo, “An 8.93
TOPS/W LSTM recurrent neural network accelerator featuring hierar-
zero skipping. Third, the BASU exploits the local property chical coarse-grain sparsity for on-device speech recognition,” IEEE J.
of the attention mechanism to perform the diagonal-prior Solid-State Circuits, vol. 55, no. 7, pp. 1877–1887, Jul. 2020.
speculating, reducing more computations than speculating with [13] S. Choi, J. Sim, M. Kang, Y. Choi, H. Kim, and L.-S. Kim, “An energy-
efficient deep convolutional neural network training accelerator for in
the orthogonal matrix multiplication in ELSA. Apart from the situ personalization on smart devices,” IEEE J. Solid-State Circuits,
advantage of sparsity speculation, the BESA PE can reduce vol. 55, no. 10, pp. 2691–2702, Oct. 2020.
MAC power, and the OPCS increases the throughput for the [14] H. Mo et al., “9.2 A 28nm 12.1TOPS/W dual-mode CNN processor
using effective-weight-based convolution and error-compensation-based
WR-Tokens, while ELSA lacks these abilities. As a result, prediction,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
the proposed processor reduces energy by 4.57× and offers a Papers, Feb. 2021, pp. 146–148.
3.73× speedup compared with ELSA. [15] A. Agrawal et al., “9.1 A 7nm 4-Core AI chip with 25.6TFLOPS
hybrid FP8 training, 102.4TOPS INT4 inference and workload-aware
VIII. C ONCLUSION throttling,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
Papers, Feb. 2021, pp. 144–146.
This article designs a Transformer processor to fill the gap [16] T. Tambe et al., “9.8 A 25 mm2 SoC for IoT devices with 18ms
between the current AI processor and the emerging attention noise-robust speech-to-text latency via Bayesian speech denoising and
attention-based sequence-to-sequence DNN speech recognition in 16nm
mechanism. It achieves superior energy efficiency than the FinFET,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
GPU, sparse matrix multiplication processors, and the prior Papers, Feb. 2021, pp. 158–160.
Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: ENERGY-EFFICIENT TRANSFORMER PROCESSOR EXPLOITING DYNAMIC WEAK RELEVANCES 241

[17] D. Han et al., “HNPU: An adaptive DNN training processor uti- Yang Wang received the B.S. degree in electronic
lizing stochastic dynamic fixed-point and active bit-precision search- science and technology from Xidian University,
ing,” IEEE J. Solid-State Circuits, vol. 56, no. 9, pp. 2858–2869, Xi’an, China, in 2014, and the Ph.D. degree in
Sep. 2021. microelectronics and solid-state electronics from the
[18] J. Lee, J. Lee, D. Han, J. Lee, G. Park, and H.-J. Yoo, “7.7 LNPU: Institute of Microelectronics, Chinese Academy of
A 25.3TFLOPS/W sparse deep-neural-network learning processor with Sciences, Beijing, China, in 2019.
fine-grained mixed precision of FP8-FP16,” in IEEE Int. Solid-State He is currently a Post-Doctoral Researcher with
Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2019, pp. 142–144. the School of Integrated Circuits, Tsinghua Univer-
[19] S. Kang et al., “7.4 GANPU: A 135TFLOPS/W multi-DNN training sity, Beijing. His research interests include very-
processor for GANs with speculative dual-sparsity exploitation,” in IEEE large-scale integration of digital signal processing
Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2020, (VLSI DSP), deep learning, and neural network
pp. 140–142. acceleration.
[20] Y. Wang et al., “A 28nm 276.55TFLOPS/W sparse deep-neural-network
training processor with implicit redundancy speculation and batch nor-
malization reformulation,” in Proc. Symp. VLSI Circuits, Jun. 2021,
Yubin Qin received the B.S. degree from the School
pp. 1–2.
of Electronic Science and Engineering, Southeast
[21] F. Tu et al., “Evolver: A deep learning processor with on-device
University, Nanjing, China, in 2020. He is cur-
quantization–voltage–frequency tuning,” IEEE J. Solid-State Circuits,
rently pursuing the Ph.D. degree with the School
vol. 56, no. 2, pp. 658–673, Feb. 2021.
of Integrated Circuits, Tsinghua University, Beijing,
[22] Y. Wang et al., “A 28nm 27.5TOPS/W approximate-computing-based
China.
transformer processor with asymptotic sparsity speculating and out-of-
His current research interests include deep learn-
order computing,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
ing, very-large-scale integration (VLSI) design, and
Tech. Papers, Feb. 2022, pp. 464–465.
hardware-software co-design.
[23] T. J. Ham et al., “A3 : Accelerating attention mechanisms in neural
networks with approximation,” in Proc. IEEE Int. Symp. High Perform.
Comput. Archit. (HPCA), Feb. 2020, pp. 328–341.
[24] T. J. Ham et al., “ELSA: Hardware-software co-design for efficient,
lightweight self-attention mechanism in neural networks,” in Proc. Dazheng Deng received the B.S. degree in micro-
ACM/IEEE 48th Annu. Int. Symp. Comput. Archit. (ISCA), Jun. 2021, electronic science and engineering from Tsinghua
pp. 692–705. University, Beijing, China, in 2020, where he is
[25] H. Jang, J. Kim, J.-E. Jo, J. Lee, and J. Kim, “MnnFast: A fast and currently pursuing the M.S. degree with the School
scalable system architecture for memory-augmented neural networks,” of Integrated Circuits.
in Proc. 46th Int. Symp. Comput. Archit., Jun. 2019, pp. 250–263. His current research interests include deep learn-
[26] H. Wang, Z. Zhang, and S. Han, “SpAtten: Efficient sparse atten- ing, computer architecture, and very-large-scale inte-
tion architecture with cascade token and head pruning,” in Proc. gration (VLSI) design.
IEEE Int. Symp. High-Perform. Comput. Archit. (HPCA), Feb. 2021,
pp. 97–110.
[27] S. Sen and A. Raghunathan, “Approximate computing for long
short term memory (LSTM) neural networks,” IEEE Trans. Comput.-
Aided Design Integr. Circuits Syst., vol. 37, no. 11, pp. 2266–2276,
Nov. 2018. Jingchuan Wei received the B.S. degree from
[28] Z. G. Tasoulas, G. Zervakis, I. Anagnostopoulos, H. Amrouch, and Binzhou University, Binzhou, China, in 2010, and
J. Henkel, “Weight-oriented approximation for energy-efficient neural the M.S. degree from the Beijing Institute of Tech-
network inference accelerators,” IEEE Trans. Circuits Syst. I, Reg. nology, Beijing, China, in 2015.
Papers, vol. 67, no. 12, pp. 4670–4683, Dec. 2020. He is currently a Senior Engineer with Tsinghua
[29] V. Mrazek, Z. Vasicek, L. Sekanina, M. A. Hanif, and M. Shafique, University, Beijing. His research area is the imple-
“ALWANN: Automatic layer-wise approximation of deep neural network mentation of artificial intelligence (AI) accelerator
accelerators without retraining,” in Proc. IEEE/ACM Int. Conf. Comput.- and system-on-chip (SoC) design.
Aided Design (ICCAD), Nov. 2019, pp. 1–8.
[30] S. Vahdat, M. Kamal, A. Afzali-Kusha, and M. Pedram, “TOSAM: An
energy-efficient Truncation- and rounding-based scalable approximate
multiplier,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27,
no. 5, pp. 1161–1173, May 2019.
[31] A. G. M. Strollo, E. Napoli, D. D. Caro, N. Petra, G. Saggese, and Yang Zhou received the B.S. degree in electronic
G. D. Meo, “Approximate multipliers using static segmentation: Error science and technology from the Beijing Institute of
analysis and improvements,” IEEE Trans. Circuits Syst. I, Reg. Papers, Technology, Beijing, China, in 2019. He is currently
vol. 69, no. 6, pp. 2449–2462, Jun. 2022. pursuing the M.S. degree with the School of Inte-
[32] A. G. M. Strollo, E. Napoli, D. D. Caro, N. Petra, and G. D. Meo, grated Circuits, Tsinghua University, Beijing.
“Comparison and extension of approximate 4–2 compressors for low- His current research interests include very-large-
power approximate multipliers,” IEEE Trans. Circuits Syst. I, Reg. scale integration (VLSI) design and neural network
Papers, vol. 67, no. 9, pp. 3021–3034, Sep. 2020. acceleration.
[33] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro, and N. Petra,
“Approximate multipliers based on new approximate compressors,”
IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 12, pp. 4169–4182,
Dec. 2018.
[34] E. Qin et al., “SIGMA: A sparse and irregular GEMM accelera-
tor with flexible interconnects for DNN training,” in Proc. IEEE Yuanqi Fan received the B.S. degree in micro-
Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2020, electronic science and engineering from Tsinghua
pp. 58–70. University, Beijing, China, in 2019, where he is
[35] B. Wang et al., “Exploration of Benes network in cryptographic proces- currently pursuing the M.S. degree with the School
sors: A random infection countermeasure for block ciphers against fault of Integrated Circuits.
attacks,” IEEE Trans. Inf. Forensics Security, vol. 12, no. 2, pp. 309–322,
His current research interests include deep learn-
Feb. 2017. ing, computer architecture, and artificial intelligence
[36] R. Yao and Y. Ye, “Toward a high-performance and low-loss
(AI) accelerators.
Clos–Benes-based optical network-on-chip architecture,” IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., vol. 39, no. 12,
pp. 4695–4706, Dec. 2020.

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.
242 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 58, NO. 1, JANUARY 2023

Tianbao Chen received the B.S. degree from the Shaojun Wei (Fellow, IEEE) was born in Beijing,
College of Electronic Science and Engineering, Jilin China, in 1958. He received the Ph.D. degree from
University, Changchun, Jilin, China, in 2013, and the Faculté Polytechnique de Mons, Mons, Belgium,
the M.S. degree from Tsinghua University, Beijing, in 1991.
China, in 2016. He became a Professor at the Institute of Micro-
He is currently an Application-Specific Integrated electronics, Tsinghua University, Beijing, in 1995.
Circuit (ASIC) Design Engineer with TsingMicro His main research interests include very-large-scale
Technology, Beijing. His research interests include integration system-on-chip (VLSI SoC) design, elec-
very-large-scale integration (VLSI) design and deep tronic design automation (EDA) methodology, and
learning. communication application-specific integrated cir-
cuit (ASIC) design.
Dr. Wei is a Senior Member of the Chinese Institute of Electronics (CIE).

Hao Sun received the B.S. degree from the College


of Computer Science and Technology, Tianjin Uni-
versity of Science and Technology, Tianjin, China,
in 2011, and the M.S. degree from Beihang Univer- Shouyi Yin (Member, IEEE) received the B.S.,
sity, Beijing, China, in 2018. M.S., and Ph.D. degrees in electronic engineering
He is currently a Research Associate with from Tsinghua University, Beijing, China, in 2000,
Tsinghua University, Beijing. His research interests 2002, and 2005, respectively.
include deep learning and machine learning. He was a Research Associate with Imperial Col-
lege London, London, U.K. He is currently a Full
Professor and the Vice-Director of the School of
Integrated Circuits, Tsinghua University. He has
published more than 100 journal articles and more
than 50 conference papers. His research interests
include reconfigurable computing, artificial intelli-
Leibo Liu (Senior Member, IEEE) received the B.S. gence (AI) processors, and high-level synthesis.
degree in electronic engineering from Tsinghua Uni- Dr. Yin has served as a Technical Program Committee Member for the
versity, Beijing, China, in 1999, and the Ph.D. degree top very-large-scale integration (VLSI) and electronic design automation
from the Institute of Microelectronics, Tsinghua (EDA) conferences, such as the Asian Solid-State Circuits Conference
University, in 2004. (A-SSCC), IEEE/ACM International Symposium on Microarchitecture
He is currently a Professor with the School of Inte- (MICRO), the Design Automation Conference (DAC), the International Con-
grated Circuits, Tsinghua University. His research ference on Computer-Aided Design (ICCAD), and the Asia and South Pacific
interests include reconfigurable computing, mobile Design Automation Conference (ASP-DAC). He is also an Associate Editor
computing, and very-large-scale integration of digi- of IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS I: R EGULAR PAPERS
tal signal processing (VLSI DSP). (TCAS-I), ACM Transactions on Reconfigurable Technology and Systems
(TRETS), and Integration, the VLSI Journal.

Authorized licensed use limited to: National Chung Hsing Univ.. Downloaded on March 30,2024 at 01:06:02 UTC from IEEE Xplore. Restrictions apply.

You might also like