BRDS: An FPGA-based LSTM Accelerator With Row-Balanced Dual-Ratio Sparsification
BRDS: An FPGA-based LSTM Accelerator With Row-Balanced Dual-Ratio Sparsification
Abstract— In this paper, first, a hardware-friendly pruning algorithm for reducing energy consumption and improving the speed
of Long Short-Term Memory (LSTM) neural network accelerators is presented. Next, an FPGA-based platform for efficient
execution of the pruned networks based on the proposed algorithm is introduced. By considering the sensitivity of two weight
matrices of the LSTM models in pruning, different sparsity ratios (i.e., dual-ratio sparsity) are applied to these weight matrices. To
reduce memory accesses, a row-wise sparsity pattern is adopted. The proposed hardware architecture makes use of computation
overlapping and pipelining to achieve low-power and high-speed. The effectiveness of the proposed pruning algorithm and
accelerator is assessed under some benchmarks for natural language processing, binary sentiment classification, and speech
recognition. Results show that, e.g., compared to a recently published work in this field, the proposed accelerator could provide
up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.
Index Terms— LSTM neural network, Pruning, FPGA, Energy efficiency, Accuracy.
—————————— ◆ ——————————
1 INTRODUCTION
ct
hardware implementation of sparse matrix-vector multi-
plication (SpMxV) while creating regular memory accesses
ct-1 ht
for performing the operation. Next, we describe the BRDS
accelerator which is an FPGA-based, row-balanced, dual-
ratio sparsity-aware, low-power and high-performance ar- ft it gt ot
chitecture for the LSTM networks. The accelerator takes
full advantage of the efficiency of the proposed pruning al-
gorithm. In this accelerator, to minimize the overhead of Wf , bf Wi , bi Wg , bg Wo , bo
storing the positions of non-zero values in the rows of
sparse matrices, a relative addressing method is exploited xt
[22]. The contributions of this paper are given below: ht-1
• Devising a row-balanced dual-ratio sparsity algo-
rithm for improving the accuracy of the LSTM vector
concatenation
pointwise
multiplication
pointwise
summation
tanh sigmoid
PEs for the forward and recurrent weights. 0.3 0.1 0.4 -0.5 0.1 -0.1 0.2 0.6 0.4 -0.5 0.2 0.6
The proposed BRDS hardware architecture, which is an 0.3 0.4 0.6 0.1 -0.1 0.2 0.5 -0.5 0.4 0.6 0.5 -0.5
extension of the POLAR accelerator in [7] designed for the 0.1 0.4 -0.2 0.5 -0.2 0.5 0.3 -0.4 0.4 0.5 0.5 -0.4
inference phase of dense networks, has the ability to sup- 0.2 -0.6 0.6 0.5 0.1 0.2 0.4 0.7 -0.6 0.6 0.5 0.4 0.7
port sparse LSTM networks. By utilizing parallel modules (a) Original dense matrix (b) Unstructured sparse matrix by global pruning
and an addressing technique, performing sparse opera- 0.3 0.1 0.2 0.6 0.4 -0.5 0.2 0.6
tions efficiently as well as higher efficacy for BRDS com- 0.3 0.4 0.5 -0.5 0.4 0.6 0.5 -0.5
pared to POLAR were achieved. -0.2 0.5 0.3 -0.4 0.4 0.5 0.5 -0.4
0.6 0.5 0.4 0.7 -0.6 0.6 0.4 0.7
3 ROW-BALANCED DUAL-RATIO SPARSITY (c) Block sparse matrix by pruning 2x2 blocks
according to block average
(d) Bank-balanced sparse matrix by local pruning
inside each 1x4 bank
with different sizes, named Small and Large. By using a DRAM LSTM Controller
Off-chip Controller
multiplexer in the MA Selector, the accelerator can choose, DRAM
for each set of weights, which MA to be employed. The Gate
MB Function
number of elements in each row of 𝑊ℎ and 𝑊𝑥 which are 𝐻 MA Selector
Embedded Memory
MWX
and 𝑋, respectively, become 𝐻𝑆𝑃 and 𝑋𝑆𝑃 , after running the MAdX
Small MA Large MA
sig tanh
Adr.
BRDS algorithm. The weight matrix with a larger (smaller) MAdH Dec. Tree Adder Buffer
Buffer Mult
number of elements in each row (i.e., 𝐻𝑆𝑃 or 𝑋𝑆𝑃 ) utilizes MWH
Q
Accum.
Large (Small) MA. Two MAs working together conduct 𝑅 MX Accum.
MH
signed multiplications in parallel. Large (Small) MA in- Adder
Q Q
cludes a Mult Array component which conducts 𝑅𝐿 (𝑅𝑆 ) MC
parallel 𝑛-bit multiply operations in a single cycle (𝑅 = FPGA
𝑅𝐿 + 𝑅𝑆 ). It should be noted that the parameters 𝑅𝐿 and 𝑅𝑆
Fig. 6. The internal structure of the proposed BRDS LSTM accelerator.
show the level of parallelization for each weight matrix
(i.e., 𝑊ℎ and 𝑊𝑥 ). Here, Large (Small) MA processes the approximation of the activation functions (e.g., 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (𝜎)
weights with the larger (smaller) number of elements in and 𝑡𝑎𝑛ℎ) to balance speed, accuracy, area, and power con-
each row. To fully utilize the MAs, the BRDS accelerator sumption. In each piece, two 𝑛-bit coefficients a and b are
considers 𝑅𝐿 and 𝑅𝑆 in a way that 𝑅𝑆 /𝑅𝐿 be equal to obtained and stored in LUTs. Hence, the word size of the
(min {𝑋𝑆𝑃 , 𝐻𝑆𝑃 })/(max {𝑋𝑆𝑃 , 𝐻𝑆𝑃 }). In this way, the ratio of LUTs are 2𝑛 bits. In the activation function component, by
the number of non-zero elements in small and large matri- employing the add and multiply operations in one DSP
ces would be equal to the ratio of the number of multipliers block, the output (𝑎 × 𝑥 + 𝑏) of the activation functions
in small and large MAs. Therefore, the number of clock cy- are determined. The operations of (2) are done by deploy-
cles needed for processing each weight matrix is the same, ing a multiply unit and an add unit in the module Function.
and there is no time in which one MA is not utilized. Because the output of the multiply unit should be passed
It is worth mentioning that the parameter 𝑅 , which to the module Embedded Memory, these units are imple-
shows the number of parallel multiplication operations for mented separately by DSP blocks.
every row, determines the latency and the resource usage. To transfer the data from the module Gate to the module
Since, in the BRDS hardware, the input rows are pruned, Function, and also to feed back the result of the module
to reach a higher level of parallelism, we propose a new Function to itself, the module Buffer (see Fig. 6) is deployed
parallelization factor, called 𝑄 , showing the number of and used with the same approach as that of the POLAR
rows whose corresponding calculations could be per- architecture [7]. The module Embedded Memory (see Fig. 6)
formed parallel. Therefore, parameter 𝑄 shows the num- stores the weights, biases, inputs, relative addresses of the
ber of modules Gate (Buffer and Function as well) working inputs, cells, and outputs in the embedded memory banks
in parallel. The outputs of the Large and Small MAs are of the FPGA. To store the weights, two memory arrays de-
concatenated and passed onto the Add Array component noted by MWX and MWH are employed. Only the nonzero
in the module Gate. The Add Array component utilizes a elements of the matrices 𝑊𝑓𝑥 , 𝑊𝑖𝑥 , 𝑊𝑔𝑥 , and 𝑊𝑜𝑥 are stored
tree of 𝑛-bit adders, which gives the summation of its 𝑅 in- in MWX. We use relative row index and cumulative pointer
put operands. To perform the additions and multiplica- to store sparse matrices. The relative row index for each
tions, we use DSP blocks in the FPGA. Due to the architec- element shows the number of zero elements before it. Each
ture of the DSP blocks, we are able to perform some func- 𝑅𝑋 (𝑅𝐻 ) nonzero elements of 𝑊𝑥 (𝑊ℎ ) weights are stored in
tions together. In the Tree Adder component, we use three- four consecutive rows. Hence, the size of MWX is
input adders where it is possible to minimize resource uti- 4𝐻 × 𝑋𝑆𝑃 × 𝑛 bits where the width of each row of this
lization. The internal structure of a DSP block of the Xilinx memory is 𝑅𝑋 × 𝑛 . Similarly, the elements of matrices
FPGAs (DSPE48) is shown in Fig. 7. The specified path is 𝑊𝑓ℎ , 𝑊𝑖ℎ , 𝑊𝑔ℎ , and 𝑊𝑜ℎ are stored in MWH. The size of MWH
utilized for realizing the three-input adders. The current is 4𝐻 × 𝐻𝑆𝑃 × 𝑛 bits, where the width of each row of this
output of this component is added to its previous one in an memory is 𝑅𝐻 × 𝑛.
Accumulate component that is implemented by a DSP The memory array 𝑀𝐵 of the module Embedded Memory
block. The output of the Accumulate component is passed is used to store the biases. The size of 𝑀𝐵 is 4𝐻 × 𝑛 bits
to an Add unit to take biases into account. All these com- with a word-width of n bits. The ith row of the biases 𝑏𝑓 , 𝑏𝑖 ,
ponents as one unit perform the computation of the MxV. 𝑏𝑔 , and 𝑏𝑜 are stored in the four consecutive rows of the
The proposed accelerator truncates the output of each memory 𝑀𝐵 . The memory array 𝑀𝑋 stores N time steps of
add and multiply unit to 𝑛 bits. To alleviate the impact of the input dataset where the size of the input dataset, in
overflow in the result, we utilize Recovery units after each each time step, is 𝑋 × 𝑛 bits. To gain more throughput, for
Add and Multiply unit suggested in [7]. The module Func- the inputs, we use duplicate memories in parallel. Because
tion (see Fig. 6) performs operations that are pointwise (i.e., of the utilization of the Dual Port RAMs, we need 𝑅𝑋 /2
𝑠𝑖𝑔, 𝑡𝑎𝑛ℎ and (2)). This module generates the output ℎ and BRAMs (𝑀𝑋 ) for storing the inputs. In this work, for sim-
the cell state 𝑐, which are written to their corresponding plicity, we consider a single time step (𝑁 = 1). The memory
space in the module Embedded Memory. The operations of array 𝑀𝐻 stores the outputs of the current and previous
this module are overlapped by those of the module Gate time steps (ℎ𝑡 and ℎ𝑡−1 ) with the size of 𝐻 × 𝑛 bits. Similar
where this overlap is provided by the module Buffer. The to the input memory array, duplicated memories for the
proposed accelerator utilizes piecewise linear outputs were utilized ( 𝑅𝐻 /2 BRAMs for 𝑀𝐻 ). At the
6
0 2
5 RESULTS AND DISCUSSION H H n×RH
In this section, the accuracy of the proposed pruning algo-
rithm is evaluated by applying it to some LSTM networks. Fig. 8. Pattern of storing 𝑊ℎ and its relative addresses in MWH and MAdH.
The elements of each row are distinguished by the square, circle, trian-
Also, the design parameters of the proposed accelerator, as gle, and cross shapes.
well as its efficacy compared to several prior works, are
7
Perplexity
proposed pruning algorithm is reduced in the case of this 80
dataset compared to the other considered datasets due to 79
its small size.
78
5.2 Efficiency of BRDS Accelerator 77
To evaluate the efficiency of the architecture of the pro- 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
posed accelerator, it was implemented on FPGA for execut- (a) Sparsity Ratio
ing the TIMIT dataset with the same configurations pro-
31
vided in [4], [9], [14]. The design parameters of the BRDS
are compared with four state-of-the-art works including
30
the ones proposed in [4], [9], [14], [16]. The focus of all of
PER (%)
these prior works were on implementing the LSTM net-
29
works on FPGAs using weight pruning and compression.
The work in [16] was evaluated on GRU which is simpler
28
than LSTMs in terms of its computational complexity. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Thus, the reported design parameters of [16] should be (b) Sparsity Ratio
looked at as optimistic values when compared to those of
the LSTM accelerators. This work used the delta network 22
algorithm to reduce MxV operations and skipping unim-
21
portant cell activation changes (the changes were below a
Error (%)
TABLE 2
COMPARISON OF THE DESIGN PARAMETERS OF DIFFERENT STATE-OF-THE-ART LSTM ACCELERATORS.