0% found this document useful (0 votes)
19 views7 pages

ISQED2021

FPGA transformers for AI technology

Uploaded by

elebv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views7 pages

ISQED2021

FPGA transformers for AI technology

Uploaded by

elebv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Accelerating Transformer-based Deep Learning

Models on FPGAs using Column Balanced Block


Pruning
2021 22nd International Symposium on Quality Electronic Design (ISQED) | 978-1-7281-7641-3/20/$31.00 ©2021 IEEE | DOI: 10.1109/ISQED51717.2021.9424344

*
Hongwu Peng[1] , Shaoyi Huang[1] , Tong Geng[2] , Ang Li[2] , Weiwen Jiang[3] ,
Hang Liu[4] , Shusen Wang[4] , and Caiwen Ding[1]
[1]
University of Connecticut, Storrs, CT, USA
[2]
Pacific Northwest National Laboratory, Richland, WA, USA
[3]
University of Notre Dame, Notre Dame, IN, USA
[3]
Stevens Institute of Technology, Hoboken, NJ, USA
*
[email protected]

Abstract—Although Transformer-based language representa- much prediction accuracy degradation [9]. Therefore, we can
tions achieve state-of-the-art accuracy on various natural lan- accommodate the compressed and high accurate Transformer
guage processing (NLP) tasks, the large model size has been chal- model into FPGAs. In recent years, because of its extremely
lenging the resource constrained computing platforms. Weight
pruning, as a popular and effective technique in reducing the low-latency, high energy efficiency, and flexible reprogramma-
number of weight parameters and accelerating the Transformer, bility for easy prototyping, FPGAs have received much atten-
has been investigated on GPUs. However, the Transformer tion as an alternative accelerating solution to GPU for ML
acceleration using weight pruning on field-programmable gate and real-time data analytics applications [10]. However, the
array (FPGAs) remains unexplored. This paper investigates the current state-of-the-art Transformer acceleration focus on the
column balanced block-wise pruning on Transformer and designs
an FPGA acceleration engine to customize the balanced block- GPUs [11]–[13]. There lacks of comprehensive investigation
wise matrix multiplication. We implement the Transformer model on Transformer acceleration using hardware-aware weight
with proper hardware scheduling, and the experiments show that pruning FPGA.
the Transformer inference on FPGA achieves 10.35 ms latency In this paper, we focus on the acceleration of the Trans-
with the batch size of 32, which is 10.96 × speed up comparing to
CPU platform and 2.08 × speed up comparing to GPU platform.
former model on FPGAs with the balanced block pruning tech-
nique. We further propose hardware design and the resource
Index Terms—Transformer, deep learning, pruning, accelera- scheduling to achieve high parallelism and high throughput on
tion, FPGA FPGA device. Our contributions are summarized as follows:
I. I NTRODUCTION • A column balanced block pruning technique and its
storage format are developed. The indices pointer matrix
In the field of natural language processing (NLP), the recur-
has much lower memory storage overhead than that of
rent neural network (RNN) [1], and long short term memory
other sparse matrix formats.
(LTSM) model [2] have been well deployed to different tasks
• A specialized process element (PE) is introduced for the
in the past. However, the RNN and LTSM models’ training
sparse matrix multiplication accelerator, and multiple PEs
and inference are intrinsically sequential computation tasks,
can be used to increase the accelerator throughput.
making it difficult to accelerate on today’s hardware, e.g.,
• The hardware resource scheduling for the
GPUs and field-programmable gate array (FPGAs) [3], [4].
encoder/decoder accelerator on FPGA is discussed,
In 2017, the Transformer architecture, which relies on self-
and we achieve a high overall hardware throughput.
attention mechanisms [5] was proposed. The Transformer
model enables a high level of computation parallelism on both We implement the proposed techniques on different hard-
training and computation. It outperforms RNN and LSTM ware platforms (Intel i5-5257U (2.7 GHZ) CPU, Nvidia Jetson
in major NLP tasks, such as language inference, question TX2 GPU, and Xilinx Alveo U200 FPGA) for further com-
answering, and sentiment analysis. parison of latency and throughput. Experimental results show
On the other hand, weight pruning methodology such as that the FPGA hardware design enables a 10.96 × speed up
irregular pruning [6], structured pruning [7], pattern prun- on the FPGA platform comparing to the CPU platform and
ing [8] have been widely used for deep neural networks in 2.08 × speed up compared to the GPU platform.
the computer vision field. It has also been used to accelerate The organization of the work is as follows. Section II
Transformer-based DNNs due to the enormous parameters gives the basic Transformer model and DNN model compres-
or model size of the Transformer. With weight pruning, the sion knowledge, and the structure of the encoder is given.
size of the Transformer can be significantly reduced without Section III proposes a column balanced block-wise pruning

978-1-7281-7641-3/21/$31.00 ©2021 IEEE 142 22nd Int'l Symposium on Quality Electronic Design

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on August 09,2021 at 22:42:31 UTC from IEEE Xplore. Restrictions apply.
structure and its pruning algorithm. Section IV gives the hard- B. DNN Model Compression and Sparsity Pattern
ware design for the sparse matrix multiplication accelerator Different model compression techniques have been studied
and the overall Transformer structure. Section V gives the to reduce the parameter size and computation burden of the
Transformer model’s training, the hardware scheduling result deep neural networks (DNNs). The main challenge of the
of the encoder/decoder layer, and the inference speed for state-of-the-art model compression techniques is maintaining
different hardware platforms. Section VI gives the overall the model accuracy while improving the model execution
conclusion for the hardware design and experiments. efficiency on hardware devices. Irregular pruning methods
[11] enables high prune ratio without much performance
II. BACKGROUND AND R ELATED W ORK degradation. However, it is not easy to be accelerated on GPUs
and FPGAs due to its random memory access pattern and
This section focuses on the background of the Transformer extra indices overhead. For instance, when storing an irregular
model and the pruning methods for the hardware acceleration. sparse matrix using Coordinate (COO) format, we store the
subsequent nonzero and related coordinates in memory. Three
A. Transformer Model vectors are needed: row, col, data, where data[i] is value
at (row[i], col[i]) position. The regular pruning features its
The Transformer model was firstly proposed in 2017 [5] relatively higher regularity on non-zero elements, and the
for NLP tasks. Different Transformer-based models have been regularity can help speed up the sparse matrix multiplication
proposed, such as BERT [14], RoBERTa [15], which utilizes on FPGAs and GPUs. Regular pruning, such as bank balanced
more layers and heads to achieve better performance on major pruning [19]–[21], block-circulant matrix pruning [22], and
NLP tasks. To reduce the model size, DistilBERT [16] was block pruning [11], have been developed. However, the block
invented based on knowledge distilling [16], and achieves pruning structure has only been implemented on GPUs in
40% model size reduction from BERT without much accu- the past; the block pruning structure’s acceleration on FPGAs
racy degradation. Moreover, the Transformer model has been remains unexplored.
utilized in the computer vision field, such as the Transformer
for image recognition [17] and the Transformer for object III. S PARSE M ATRIX F ORMAT AND P RUNING A LGORITHM
detection [18]. In this section, we will evaluate different sparse matrix
formats. A row balanced block-wise sparsity is proposed,
2XWSXW which strikes a better balance between index storage overhead
and high accuracy. It also achieves high hardware parallelism
(QFRGHU among different pruning methodologies.
$GG1RUP
A. Comparison of Different Sparse Matrix Formats
$GG1RUP )XOO\&RQQHFWHG Fig. 2 shows an example of four different types of prun-
ing pattern for sparse matrix. The first one is the irregular
)XOO\&RQQHFWHG 5HOX
pruning technique. The irregular pruning pattern puts a weight
)XOO\&RQQHFWHG threshold for the specific pruning ratio and prunes out the
&RQFDW
+HDG灅 Q elements below the weight threshold. The pruning is done in an
«
element-wise pattern so that the irregular pruning pattern has
6FDOHG'RW3URGXFW$WWHQWLRQ the lowest accuracy drop as the pruning ratio increases. The
« irregular sparsity matrix indices can be stored in Compressed
/LQHDU /LQHDU /LQHDU Sparse Row (CSR) format or Coordinate list (COO) format,
4 . 9 both of which will occupy extra storage space. When the
pruning ratio is low, the required memory space for irregular
,QSXW sparsity matrix storage might be higher than that of a dense
matrix. Furthermore, the irregular sparsity matrix imposes a
Fig. 1: Encoder structure. great challenge on the hardware design. The non-zero element
of the irregular sparsity matrix is randomly distributed over
Our research will focus on the Transformer model inference the entire matrix, which results in an irregular memory access
acceleration on FPGAs. The encoder structure of our Trans- pattern and stalls the hardware parallelism. Thus, the speedup
former is shown in Fig. 1. Our Transformer has 2 layers of for the irregular sparsity matrix’s operation can be negative.
encoder and 4 heads. The decoder is simply a linear layer As a result, other types of regular pruning structures were
on top of the encoder stack. The scaled dot product attention proposed.
is used for the self-attention mechanism, and the equation is The bank balanced pruning has been widely used in dif-
shown below: ferent DNN applications. [19] discussed the hardware accel-
Q × KT
eration of bank balanced pruning sparse matrix operation on
Attention(Q, K, V) = Softmax(Mask( √ )) × V (1) GPUs, and [20], [21] implement the bank balances structure
dk

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on August 09,2021 at 22:42:31 UTC from IEEE Xplore. Restrictions apply.
         matrix pattern. Moreover, the column balanced block-wise
           pruning enables both inter-block and intra-block parallelism
for efficient hardware design. The hardware design details are
!
revealed in later sections.

B. Column Balanced Block-wise Pruning Algorithm


!  "#% &#& &#& We propose a column balanced block-wise weight pruning
     algorithm, where column balanced is achieved by pruning the
  
same number of block column elements in each column of
Fig. 2: Four types of pruning pattern with 0.33 pruning ratio: irregular weight matrices, which are weights of linear layers with the
pruning, bank balanced pruning, block-wise pruning, and column shape of 2-D for transformer model in our experiments.
balanced block-wise pruning.

Algorithm 1: Column balanced block-wise weight


matrix operation on FPGAs. Both of the papers showed pruning algorithm.
detailed hardware design and performance evaluation for bank 1 Input: weight matrix W , matrix width n, matrix height m,
balanced pruning. [20] proposed a Compressed Sparse Banks block row r, block column c, percentile perc
2 Output: pruned weight matrix W p
(CSB) for the sparse matrix storage. However, both the CSR
3 Set W p = W
and CSB formats require at least one index pointer for each 4 Set column division j = n / c, row division j  = m / r
non-zero element, which leads to extra memory occupation 5 Divide W p into j matrices: W 1 ,W 2 ,...,W j
overhead. Banks balanced pruning may require more memory 6 foreach W i in W 1 ,W 2 ,...,W j do
space at a low pruning ratio than the dense matrix because of 7 Divide W i into j  matrices
8 Calculate l2 norm of each block
the index pointer storage overhead.
9 Setting the value in perc of the lowest l2 norm of blocks
The third pattern, block-wise pruning [22], calculates the of W i as zero
block’s L2 norm and prunes the block with a lower L2 norm. 10 end
The block-wise pruning is similar to irregular pruning except 11 W p = concatenate(W 1 ,W 2 ,...,W j )
that the weight importance is calculated for each block rather
than individual elements. Block Compressed Sparse Row We denote weight matrix as W with width n and height m
(BCSR) format can be used to store the weight matrix, which while row size and column size of pruning block are denoted
significantly reduces index pointer storage overhead since the by r and c respectively and perc is the percentile of weights
whole block of elements share the same index. However, the will be excluded within W . Algorithm 1 illustrates column
blocks are randomly pruned over the entire matrix, making the balanced block-wise weight pruning method. We (a) first
hardware design for intra-block parallelism a great challenge. divide a Transformer weight matrix into j sub-matrices where
j = n/c. (b) For each sub-matrix W i , we divide it into j  sub-
* +, , ; matrices where j  = m/r and j × j  blocks with size of r by
< " <
c are generated in this step. (c) We calculate l2 norm of each
<=" <=> <=? <=& & & "     ,
<=% <=! <=@ <=% block. (d) For each W i , we set the value of perc of blocks
C, with lower l2 norm to zero. (e) Finally, we concatenate the
<=> <=A <=% <="
<=B <=? <=! <=& <=" <=> <=> <=A <=? <=& resulting sub-matrices W 1 ,W 2 ,...,W j horizontally to form
<=& <=B <=A <=B
<=% <=! <=B <=? <=@ <=% the pruned matrix W p .
<=! <=% <=> <=! <=& <=B <=A <=B <=% <="
<=! <=% <=> <=! <=! <=& IV. T RANSFORMER ACCELERATOR D ESIGN
Fig. 3: An example of CSCB format for column balanced block-wise Most of the Transformer model, even the one with some
sparse matrix, where the block size is 2×2. shallow encoder and decoder layers, is too large to fit into
the FPGA on-chip block RAM (BRAM). Thus, the model
The column balanced block-wise pruning combines the compression technique is needed to compress the model size
key features of both bank balanced pruning and block-wise and fit it into the existing FPGA devices. The hardware design
pruning. The column balanced block-wise pruning ranks the for compressed matrix operation is a great challenge, and it
blocks’ L2 norm by each column to get the pruning thresholds directly determines how much speedup we can achieve on
and prunes blocks for each column. A detailed algorithm hardware design. The embedding layer is normally a lookup
for column balanced block-wise pruning is given in the next table, and it contributes 30% of parameters, so it’s loaded into
section. By combining the CSB format and BCSR format, external memory instead of on-chip BRAM.
a Compressed Sparse Column Block (CSCB) is formed. An In this section, we first come up with holistic hardware
example of the CSCB format is shown in Fig. 3. Only one architecture for the Transformer hardware accelerator. Then we
index pointer for each block is needed, which leads to a dive into details of the sparse matrix multiplication accelerator
much lower memory consumption than the previous sparse as well as the encoder framework. We aim to reduce the

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on August 09,2021 at 22:42:31 UTC from IEEE Xplore. Restrictions apply.
inference latency by add more hardware parallelism and enable   R 
R   M , ;
the efficient pipeline on the hardware data flow.     ,
  , ;
<=" <=> <=> <=A <=? <=&
A. Overall Hardware Architecture  <  "  & < " < <=% <=! <=B <=? <=@ <=%
As shown in Fig. 4, the hardware architecture of the <=% <=! <=? <=& <=B <=" & & "
<=& <=B <=A <=B <=% <="
L T,  ,
Transformer is composed of the host PC which is in charge of <=! <=% <=> <=! <=! <=&
generating and sending the tokenized sentence, the off-chip J < J " J &
DDR memory and its’ controller for the embedding layer,
 <  "  < <=? <=&
and the FPGA accelerator for encoders and decoder layer. As <=" <=> <=> <=A
mentioned in the previous section, our transformer model is <=% <=! î <=% <=! <=? <=& î <=B <=? <=% <=! î <=@ <=%
composed of 2 layers of encoders and 1 layer of the decoder.
The decoder is simply a linear layer that takes the encoder’s  &  &  "
<=& <=B <=A <=B <=% <="
input and generates the final output. The hardware resources <=B <=" î <=! <=% <=B <=" î <=> <=! <=? <=& î <=! <=&
for the encoder can be reused for different layers. However,
we only have 2 layers of the encoder, so hardware resources
re-utilization is not necessary. L, L, L,

CCK
F, CKLG NPL  , ,
 ,
F,  , , ,
 GH ,M
J 
H KLG KLG Fig. 5: Dot product accelerator for sparse matrix with column
balanced block-wise sparsity pattern
F,
 J  C
GH
, , ,
be accumulated to obtain the final output. Inside the PE, the
Fig. 4: Overall structure of the Transformer accelerator data fetching, GEMM, and accumulation processes execute
in serial with an elaborate pipeline structure. It will iterate
The hardware data flow is as follows. Firstly, the host PC over the entire block column of the compressed sparse matrix
generates the tokenized sentence and sends it to FPGA through to obtain the final output. To increase data throughput and
the PCIe interface. Then, the DDR controller will fetch the enable intra-block parallelism, multiple PEs can be used in
embedding from DRAM for each word of the input sentence. parallel. With multiple PEs, the BRAM memory partition of
The word embedding sequence is then fed into FPGA on-chip the compressed sparse matrix is required to aid the parallel
resources for encoders and decoder inference. computation. After each PE finished accumulation and iterates
over the entire compressed sparse matrix, the result can be
B. Sparse Matrix Multiplication Accelerator Design concatenated together to generate the final dot product output.
The sparse matrix multiplication accelerator design is shown The PE design concept can be applied to both FPGA hardware
in Fig. 5. The compressed sparse matrix is stored in CSCB design and GPU kernel design.
format, and there are 2 matrices need for CSCB format; one
is the data matrix, and one is the indices pointer matrix. Each C. Encoder Accelerator Structure
of the indices pointer matrix element indicates the original The overall structure of a single encoder layer is shown
location of the block in the data matrix. To exploit hardware in Fig. 6. The encoder layer comprises a 4 sparse matrix dot
parallelism, we design a processor element (PE) for each product accelerator, an activation layer, 4 dot product attention
task, and multiple PEs can be used in parallel to increase the hardware blocks for 4 heads, and 2 add normalization layers. 5
hardware bandwidth. sparse weight matrices with their index matrices will be stored
The PE design is as follows. For each PE shown in Fig. in BRAM to speed up the computation.
5, it will have one copy of the single row of the dense The data flow of the encoder layer is as follows. The encoder
matrix input as register buffer array. And for each operation, input X is firstly fed into the dot product accelerator to
the PE will use the indices pointer to fetch two inputs: a calculate matrix Q K, and V , and the accelerator output is
single bank from dense matrix row and a whole data block fed into dot product attention hardware. The output matrices
from the compressed sparse matrix. After the data fetching of the dot product attention are then concatenated into a single
procedure, a dense general matrix multiply (GEMM) can be matrix Z. Z is fed into another dot product accelerator and
performed within the PE. The GEMM accelerator design has generate matrix At . The matrix At is then passed through
already been exploited in the previous work [23], and we an add normalization layer, and the result is stored in the
can follow the basic procedure to design a highly paralleled matrix N r1 . Again, the matrix N r1 is fed into a special design
GEMM hardware accelerator. Lastly, the GEMM result will dot product accelerator for two matrix multiplication and one

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on August 09,2021 at 22:42:31 UTC from IEEE Xplore. Restrictions apply.
dimension. In the experiment, we are using 32 as our batch
 , , 0XOWLKHDG
J  , DWWHQWLRQ size for both training and inference.
===
X Z The Transformer model training is conducted on 4 RTX
C, , , C, , ,, , 
M   , ; Q C, , , 2080Ti GPU servers (each with 4 or 8 GPUs). Experiments
=== M   , ;
Q X u Wq K Qi u K i T are implemented using python 3.6.10, GCC 7.3.0, PyTorch
Ti softmax mask   At Z u Wo 1.4.0, and CUDA 10.1.
K X u Wk V dk
V V u Wv Zi Ti u Vi
KLG Wo
KLG Wq, Wk, Wv 

Accuracy

C, , ,
L  ,  L  , 
FF2 M   , ; Nr1 At  
Xo Norm  Nr"  FF& 
FF" Activation Nr" u FW"  Nr" Norm  X  At  
FF& FF" u FW& 
Xo
6WUXFWXUHRI(QFRGHU
J  , , KLG FW1, FW2  DFFHOHUDWRU Overall pruning ratio

Fig. 6: Encoder structure Fig. 7: Accuracy of transformer model at different pruning ratio using
column balanced block-wise pruning method, two different block
sizes (16*16 and 4*4) are compared
activation function. In our design, we are using RELU for the
activation function. The dot product accelerator output F F1 We first run 50 training epochs for training and obtain the
is fed into another add normalization layer and generates the model with the best accuracy as our pre-trained model. We
final output Xo . load the pre-trained model and run 50 epochs iteration of
admm-based training [26], pruning and retraining. For admm-
D. Resource Scheduling based training, we set the initial learning rate as 3.0 and admm
To allocate a reasonable amount of resources for each epoch as 9 and adjust the learning rate periodically. To be more
function, we adopt operation scheduling methods [24] for specific, if the epoch is dividable by admm epoch, we reset
hardware design. The optimization task goal is: the learning rate to the original one. Otherwise, we decay the
learning rate every 1/3 admm epoch. For pruning, we set the
min min(T1 , T2 , ...Tk ) prune ratio between 0 to 1 and employ the column balanced
{Wn },{bn }


k (2) block-wise pruning method to prune the former trained model.
subject to Rt ≥ M Ri + Rm Finally, we retrain the pruned model. For the prune ratio, we
i=1 first set it in the range of 0 to 0.9 with an interval of 0.1. As
In the function, Rt = [RDSP , RF F , RLU T , RBRAM ] is the we could easily observe from Fig. 7, there is little accuracy
total available resource on FPGA chip. Ri is the resource used drop with the increasing prune ratio in this region with both
by each function with an encoder/decoder. Rm is the resource block sizes. To find the turning point or the ”sweet spot”, we
used by the DDR controller, PCIe controller, or other types further explored the accuracy and sparsity curve between 0.9
of miscellaneous function within the FPGA system. We firstly to 1 with a smaller interval of 0.02. We finally choose prune
begin with a hardware design without any parallelism. Then ratio = 0.9 as the operation point, where the model has a high
we start to add hardware parallelism to the slowest function pruning ratio with a low accuracy drop. And at this point,
or loop and check the resource constraint. If the resource the Transformer model with 16*16 block size has accuracy =
constraint is satisfied, we will add hardware parallelism to 0.9512, comparable to accuracy = 0.9535 for 4*4 block size.
the slowest function or loop and do it over until the hardware
resources and latency are optimized. B. Hardware Performance

V. E XPERIMENTS 1) Hardware Platform: The Xilinx Alveo U200 board,


which has 4,320 of 18k BRAM, 6,840 DSPs, and 1,882.2k
A. Training of the Transformer Model logic cells (LUT), is used in the experiment. The FPGA
We conduct our experiment using the Transformer model board is connected with the host machine through PCIe for
on Wikitext-2 dataset [25], on which we use the accuracy fetching the input word with 18 batch size. Xilinx SDX
of next word prediction for the benchmark. The Transformer 2020.1 and the high-level synthesis tool (C/C++) are used for
model illustrated in the paper is a shallow well pre-trained, hardware development. The hardware inference performance
and unpruned model in Pytorch and GPU version, and it has is compared between Intel i5-5257U(2.7 GHZ) CPU, Nvidia
two encoder layers and one decoder layer with hyperparam- Jetson TX2 GPU, and Xilinx Alveo U200 FPGA platforms
eters of embedding layer size (emsize)=800 and number of for latency and throughput (frame/sequence per second). The
hidden layer dimension (nhid)=200. The decoder layer of the batch size is chosen as 32 for both platforms for fairness
Transformer is a fully connected layer with a 28,785 output comparison.

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on August 09,2021 at 22:42:31 UTC from IEEE Xplore. Restrictions apply.
2) Resource Scheduling: We apply the hardware resource FPGA devices. Then, the training process, as well as the
scheduling concept from section IV-D to increase the hardware accuracy versus pruning ratio relationship, is demonstrated.
throughput. The hardware scheduling results of each operation Finally, the hardware resource scheduling is done for the
(e.g., Matrix Multiplication (MM), dot product attention, add Transformer model, and the inference latency is compared
normalization) in encoder and decoder are shown in Table I. between different hardware platforms. The overall framework
achieves a reduced NLP model size with only a little accuracy
TABLE I: Resource scheduling for the Transformer on FPGA (prune drop, and the FPGA implementation achieves 10.95 × speed
ratio = 0.9, block size = 16*16 and batch size = 32)
up comparing to the CPU platform and 2.08 × speed up
DSP FF LUT Latency comparing to the GPU platform.
Total hardware resources 6,840 2,364.5k 1,882.2k N/A
Encoder DSP FF LUT Latency
Sparse MM accelerator 1 331 150.4k 150.8k 1.152 ms R EFERENCES
Dot product attention × 4 292 59.6k 101.7k 0.554 ms [1] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the
Sparse MM accelerator 2 168 23.3k 31.2k 0.704 ms properties of neural machine translation: Encoder-decoder approaches,”
Add normalization 1 62 17.9k 18.3k 0.321 ms arXiv preprint arXiv:1409.1259, 2014.
Sparse MM accelerator 3 172 28.0k 26.1k 0.393 ms [2] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Add normalization 2 62 17.9k 18.3k 0.321 ms computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Resources for 1 encoder 1025 279.3k 389.3k 3.446 ms [3] A. X. M. Chang, B. Martini, and E. Culurciello, “Recurrent neu-
Percentage 15.0% 11.8% 20.7% N/A ral networks hardware implementation on fpga,” arXiv preprint
Decoder DSP FF LUT Latency arXiv:1511.05552, 2015.
Sparse MM accelerator 4 1318 98.8k 83.3k 3.456 ms [4] E. Nurvitadhi, J. Sim, D. Sheffield, A. Mishra, S. Krishnan, and D. Marr,
Resources for 1 decoder 1318 98.8k 83.3k 3.456 ms “Accelerating recurrent neural networks in analytics servers: Comparison
of fpga, cpu, gpu, and asic,” in 2016 26th International Conference on
Percentage 19.3% 4.2% 4.4% N/A
Field Programmable Logic and Applications (FPL). IEEE, 2016, pp.
1–4.
We observe that through the developed resource scheduling [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
technique, we achieve high resource utilization, i.e., 49.3%, in neural information processing systems, 2017, pp. 5998–6008.
27.8%, 45.8% for DSP, FF, LUT respectively. The resultant [6] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing
latency is 3.446 ms and 3.456 ms for the single encoder and deep neural networks with pruning, trained quantization and huffman
coding,” arXiv preprint arXiv:1510.00149, 2015.
decoder, satisfying the real-time requirements for various NLP [7] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured
tasks on resource-constrained devices. sparsity in deep neural networks,” in Advances in Neural Information
3) Cross Platform Comparison: After the resource schedul- Processing Systems, 2016, pp. 2074–2082.
[8] X. Ma, F.-M. Guo, W. Niu, X. Lin, J. Tang, K. Ma, B. Ren, and Y. Wang,
ing is done for each encoder/decoder layer, the Transformer “Pconv: The missing but desirable sparsity in dnn weight pruning for
model can be combined from those layers. The final hardware real-time execution on mobile devices.” in AAAI, 2020, pp. 5117–5124.
latency of the Transformer model implemented on FPGA is [9] Z. Wang, J. Wohlwend, and T. Lei, “Structured pruning of large language
models,” arXiv preprint arXiv:1910.04732, 2019.
10.35 ms for batch size = 32 and block size = 16*16. The [10] Y. Guan, Z. Yuan, G. Sun, and J. Cong, “Fpga-based accelerator for long
comparison of the Transformer model inference speed is made short-term memory recurrent neural networks,” in Design Automation
for different platforms, and the result is shown in Table II. Conference (ASP-DAC), 2017 22nd Asia and South Pacific. IEEE,
2017, pp. 629–634.
[11] C. Guo, B. Y. Hsueh, J. Leng, Y. Qiu, Y. Guan, Z. Wang, X. Jia, X. Li,
TABLE II: Comparison between CPU and FPGA (prune ratio = 0.9, M. Guo, and Y. Zhu, “Accelerating sparse dnn models without hardware-
block size = 16*16 and batch size = 32) support via tile-wise sparsity,” arXiv preprint arXiv:2008.13006, 2020.
[12] S. Zheng, H. Lin, S. Zha, and M. Li, “Accelerated large batch optimiza-
Hardware latency(ms) Throughput (FPS)
tion of bert pretraining in 54 minutes,” arXiv preprint arXiv:2006.13484,
Intel i5-5257U (2.7 GHZ) CPU 113.40 282.2
2020.
Jetson TX2 GPU 21.54 1485.6 [13] J. Xin, R. Tang, J. Lee, Y. Yu, and J. Lin, “Deebert: Dynamic early ex-
Xilinx Alveo U200 FPGA board 10.35 3091.8 iting for accelerating bert inference,” arXiv preprint arXiv:2004.12993,
2020.
[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
As shown in the table, We achieve 10.95 × speed up on of deep bidirectional transformers for language understanding,” arXiv
FPGA comparing to CPU and 2.08 × speed up comparing preprint arXiv:1810.04805, 2018.
to GPU. The overall pruning technique and hardware design [15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
concepts enable efficient Transformer neutral network accel- pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
eration on the FPGA platform. [16] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled
version of bert: smaller, faster, cheaper and lighter,” arXiv preprint
VI. C ONCLUSION arXiv:1910.01108, 2019.
[17] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
This paper introduces an efficient Transformer accelera- T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al.,
tion framework for FPGA application. A column balanced “An image is worth 16x16 words: Transformers for image recognition
at scale,” arXiv preprint arXiv:2010.11929, 2020.
block-wise pruning method is proposed, which achieves low [18] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
accuracy decay under a high pruning ratio. A specialized S. Zagoruyko, “End-to-end object detection with transformers,” arXiv
process element for sparse matrix multiplication is designed preprint arXiv:2005.12872, 2020.
[19] Z. Yao, S. Cao, W. Xiao, C. Zhang, and L. Nie, “Balanced sparsity for
to enable both inter block hardware parallelism and intra- efficient dnn inference on gpu,” in Proceedings of the AAAI Conference
block parallelism, and it can be applied to both GPU and on Artificial Intelligence, vol. 33, 2019, pp. 5676–5683.

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on August 09,2021 at 22:42:31 UTC from IEEE Xplore. Restrictions apply.
[20] S. Cao, C. Zhang, Z. Yao, W. Xiao, L. Nie, D. Zhan, Y. Liu, M. Wu,
and L. Zhang, “Efficient and effective sparse lstm on fpga with bank-
balanced sparsity,” in Proceedings of the 2019 ACM/SIGDA Interna-
tional Symposium on Field-Programmable Gate Arrays, 2019, pp. 63–
72.
[21] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao,
Y. Wang et al., “Ese: Efficient speech recognition engine with sparse
lstm on fpga,” in Proceedings of the 2017 ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, 2017, pp. 75–84.
[22] B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, M. Xie, L. Wan, H. Liu,
and C. Ding, “Ftrans: energy-efficient acceleration of transformers using
fpga,” in Proceedings of the ACM/IEEE International Symposium on
Low Power Electronics and Design, 2020, pp. 175–180.
[23] J. de Fine Licht, S. Meierhans, and T. Hoefler, “Transformations of high-
level synthesis codes for high-performance computing,” arXiv preprint
arXiv:1805.08288, 2018.
[24] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang, “C-
lstm: Enabling efficient lstm using structured compression techniques
on fpgas,” in Proceedings of the 2018 ACM/SIGDA International Sym-
posium on Field-Programmable Gate Arrays, 2018, pp. 11–20.
[25] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel
mixture models,” in 5th International Conference on Learning Repre-
sentationsICLR, 2017.
[26] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang,
“A systematic dnn weight pruning framework using alternating direction
method of multipliers,” in Proceedings of the European Conference on
Computer Vision (ECCV), 2018, pp. 184–199.

Authorized licensed use limited to: Stevens Institute of Technology. Downloaded on August 09,2021 at 22:42:31 UTC from IEEE Xplore. Restrictions apply.

You might also like