0% found this document useful (0 votes)
7 views8 pages

BRDS: An FPGA-based LSTM Accelerator With Row-Balanced Dual-Ratio Sparsification

Uploaded by

tresa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views8 pages

BRDS: An FPGA-based LSTM Accelerator With Row-Balanced Dual-Ratio Sparsification

Uploaded by

tresa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

1

BRDS: An FPGA-based LSTM Accelerator


with Row-Balanced Dual-Ratio Sparsification
Seyed Abolfazl Ghasemzadeh, Erfan Bank Tavakoli, Mehdi Kamal, Ali Afzali-Kusha, and
Massoud Pedram

Abstract— In this paper, first, a hardware-friendly pruning algorithm for reducing energy consumption and improving the speed
of Long Short-Term Memory (LSTM) neural network accelerators is presented. Next, an FPGA-based platform for efficient
execution of the pruned networks based on the proposed algorithm is introduced. By considering the sensitivity of two weight
matrices of the LSTM models in pruning, different sparsity ratios (i.e., dual-ratio sparsity) are applied to these weight matrices. To
reduce memory accesses, a row-wise sparsity pattern is adopted. The proposed hardware architecture makes use of computation
overlapping and pipelining to achieve low-power and high-speed. The effectiveness of the proposed pruning algorithm and
accelerator is assessed under some benchmarks for natural language processing, binary sentiment classification, and speech
recognition. Results show that, e.g., compared to a recently published work in this field, the proposed accelerator could provide
up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.

Index Terms— LSTM neural network, Pruning, FPGA, Energy efficiency, Accuracy.

—————————— ◆ ——————————

1 INTRODUCTION

F OR applications that require processing time-depend-


ent sequences of data such as speech recognition [1] and
natural language processing (NLP) [2], Recurrent Neural
the random accesses to the memory. This implies that un-
structured pruning may limit the energy and performance
gains that are achievable by model reduction and pruning
Networks (RNNs) and, more specifically, Long Short-Term [8]. Accordingly, when reducing the model size, the de-
Memory (LSTM) Neural Networks [3] have been intro- signer should try to minimize the accuracy loss due to the
duced. These networks, which create high computational pruning while providing a sparsity pattern compatible
loads, have to cope with resource limitation when they are with the hardware architecture.
implemented on hardware platforms such as FPGAs [4]. Lower improvements achieved by unstructured prun-
A limited number of computational resources (e.g., ing is due to the fact that while the input weight matrices
number and type of available DSP blocks) and memory re- are sparse, the input vector is dense causing rather low im-
sources (e.g., size and access speed of embedded block provement in matrix-vector multiplication (MxV). Since
memories) in FPGAs make LSTM networks implementa- some elements in a row of sparse matrix are zeros, their
tion challenging. To overcome this problem, several works multiplication by their corresponding elements of the
which invoke resource sharing along with timing optimi- dense vector are zeros, thus not contributing to the sum in
zation have been suggested in the literature. Examples of final. To avoid the useless computations, we need to access
these research efforts include balancing the memory band- the nonzero elements requiring irregular (random) ac-
width and the internal storage utilization [5], optimizing cesses to memory which is translated to inefficient utiliza-
computational performance and communication require- tion of the memory bandwidth and processing elements
ments [6], overlapping LSTM computations with memory (PEs) of the NN accelerator. The reason is that the irregu-
accesses [4], and overlapping internal computations of the larity in the positions of non-zero elements in the weight
LSTM architecture [7]. matrices makes variation in the number of PEs needed for
The weight pruning technique reduces storage and each row of the matrix inevitable, reducing the efficiency
computational costs by eliminating redundant elements in of using the process elements.
the weight matrices. Reducing the model size, however, In this work, first, a Balanced Row Dual-ratio Sparsity-
does not necessarily lead to a more efficient hardware im- inducing pruning algorithm (called BRDS) is presented. In
plementation. This is because fetching unstructured this algorithm, the input and recurrent weight matrices of
pruned data may require high memory bandwidth due to the LSTM are pruned with different sparsity ratios result-
ing in lower accuracy loss while providing the opportunity
———————————————— for more weight pruning for a given target accuracy level.
• S. A. Ghasemzadeh, M. Kamal, and A. Afzali-Kusha are with School of These two sets of matrices have different sensitivities to the
Electrical and Computer Engineering, University of Tehran, Tehran pruning owing to their different contributions to the final
14399-57131, Iran. E-mail: [email protected]; [email protected]; results of the LSTM model. To lower the required memory
[email protected].
• E. Bank Tavakoli is with the School of Computing, Informatics, and Deci- bandwidth, the pruning is performed in a row-wise man-
sion Systems Engineering, Arizona State University, Tempe, AZ 85281, ner. Moreover, since the number of non-zero elements in
USA. E-mail: [email protected]. each row of the sparse matrices is known at design time,
• M. Pedram is with Department of Electrical and Computer Engineering,
University of Southern California, Los Angeles, CA 90089, USA. E-mail:
number of PEs required to process the row can be deter-
[email protected]. mined. Both of these features provide an efficient
2

ct
hardware implementation of sparse matrix-vector multi-
plication (SpMxV) while creating regular memory accesses
ct-1 ht
for performing the operation. Next, we describe the BRDS
accelerator which is an FPGA-based, row-balanced, dual-
ratio sparsity-aware, low-power and high-performance ar- ft it gt ot
chitecture for the LSTM networks. The accelerator takes
full advantage of the efficiency of the proposed pruning al-
gorithm. In this accelerator, to minimize the overhead of Wf , bf Wi , bi Wg , bg Wo , bo
storing the positions of non-zero values in the rows of
sparse matrices, a relative addressing method is exploited xt
[22]. The contributions of this paper are given below: ht-1
• Devising a row-balanced dual-ratio sparsity algo-
rithm for improving the accuracy of the LSTM vector
concatenation
pointwise
multiplication
pointwise
summation
tanh sigmoid

models while considering the hardware imple-


mentation (BRDS algorithm). Fig. 1. The internal structure of the considered LSTM layer.
• Presenting a low-energy yet high-speed FPGA-
based hardware accelerator based on the above 𝑥 is 𝑋 when sizes of the vectors ℎ, 𝑏, and 𝑐 are 𝐻. Similarly,
pruning algorithm for facilitating the implemen- sizes of matrices 𝑊𝑥 and 𝑊ℎ are 𝐻 × 𝑋 and 𝐻 × 𝐻, respec-
tations of BRDS-based sparse models. tively. The activation functions perform element-wise com-
The remainder of the paper is organized as follows. Sec- putations on their input vectors. The internal structure of
tion 2 provides basic concepts of LSTM as well as a review the considered LSTM layer is shown in Fig. 1.
of prior work on FPGA-based LSTM architectures. The
proposed row-balanced dual-ratio sparsity algorithm is
2.2 Related Work
presented in Section 3. Section 4 provides the details of the The compression of the network model could lead to speed
proposed hardware accelerator. In Section 5, the efficacies and energy efficiency improvements of the inference phase
of the proposed algorithm and accelerator are evaluated, [10]. The improvements are achieved by reducing the
and finally, the paper is concluded in Section 6. memory usage and bandwidth and the computational re-
quirements of an NN-based inference. Well-known com-
pression techniques consist of pruning [11], sparsity-in-
2 LSTM BASIC CONCEPTS AND RELATED WORK ducing regularization [12], and quantization [4]. Several
In this section, first, the internal structure of the considered structured sparsity methods have been proposed in prior
LSTM network layer is described and then prior work deal- studies. The proposed algorithms considered constraints
ing with the LSTM implementation on FPGA platforms on the locality of non-zero weights to limit the scattering of
and several LSTM pruning and compression algorithms the zero weights in the weight metrices [8], [13]. Compared
are briefly reviewed. to unstructured sparsity, accelerating the structured spar-
sity using special hardware is more feasible and affordable.
2.1 Considered LSTM Network The proposed LSTM architecture in [4] utilized weight
In this work, we use the LSTM network of [7], which has a compression and pruning techniques to increase speed
simple structure with an acceptable output accuracy. A and energy efficiency. The gains (in terms of energy effi-
layer of this network consists of the cells (to store prior in- ciency and computational speed) were obtained at the cost
formation) and gates (i.e., 𝑓𝑡 , 𝑖𝑡 , 𝑔𝑡 and 𝑜𝑡 ) to control of considerable hardware resource usage. The method pro-
whether to remember or forget prior information (i.e., posed in [14] reduced the LSTM network size and con-
𝑐𝑡−1 ), inputs (i.e., 𝑥𝑡 ), and outputs of the previous time step trolled the network irregularity. It made use of block-circu-
(i.e., ℎ𝑡−1 ). Based on this, an LSTM layer may be described lant matrices [15] (i.e., arbitrary size circulant submatrices),
as [7]: and further applying the FFT algorithm to accelerate the
𝑓𝑡 = 𝑠𝑖𝑔(𝑊𝑓𝑥 𝑥𝑡 + 𝑊𝑓ℎ ℎ𝑡−1 + 𝑏𝑓 ) compute-intensive circulant convolution operations. Vari-
𝑖𝑡 = 𝑠𝑖𝑔(𝑊𝑖𝑥 𝑥𝑡 + 𝑊𝑖ℎ ℎ𝑡−1 + 𝑏𝑖 ) able submatrix sizes provided a tradeoff between the com-
(1)
𝑔𝑡 = tanh (𝑊𝑔𝑥 𝑥𝑡 + 𝑊𝑔ℎ ℎ𝑡−1 + 𝑏𝑔 ) pression ratio and the accuracy degradation.
𝑜𝑡 = 𝑠𝑖𝑔(𝑊𝑜𝑥 𝑥𝑡 + 𝑊𝑜ℎ ℎ𝑡−1 + 𝑏𝑜 ) In [16], first, an algorithm for reducing the computa-
𝑐𝑡 = 𝑓𝑡 ⨀𝑐𝑡−1 + 𝑖𝑡 ⨀𝑔𝑡 tions of the Gated Recurrent Unit (GRU) network was sug-
(2) gested. The algorithm induced sparsity in the inputs and
ℎ𝑡 = 𝑜𝑡 ⨀𝑡𝑎𝑛ℎ(𝑐𝑡 )
where 𝑡𝑎𝑛ℎ and 𝑠𝑖𝑔 (i.e., Sigmoid) are logistic activation activations, thereby lowering the computations. Next, an
functions and ⨀ is the dot product operation. Also, 𝑊 and accelerator architecture, called DeltaRNN, which skipped
𝑏 denote the weight matrix and bias vector, respectively. the updating of an RNN when the input changes were be-
Gates 𝑓, 𝑖, 𝑔, and 𝑜 correspond to the forget, input, candi- low a certain threshold, was presented. In [9], Bank-Bal-
date cell, and output gates, respectively. Weight matrices anced Sparsity (BBS), which partitions each weight matrix
(i.e., 𝑊𝑥 and 𝑊ℎ ) and the bias vector are determined for row into banks for parallel computing, was proposed. This
each gate (e.g., weight matrices and bias vector for gate method adopted fine-grained pruning inside each bank to
𝑓 are denoted as 𝑊𝑓𝑥 , 𝑊𝑓ℎ , and 𝑏𝑓 , respectively). In addi- maintain the model accuracy. The architecture, which was
tion, 𝑡 denotes the current time step. The size of the vector fully parallel, had the drawback of considering the same
3

PEs for the forward and recurrent weights. 0.3 0.1 0.4 -0.5 0.1 -0.1 0.2 0.6 0.4 -0.5 0.2 0.6

The proposed BRDS hardware architecture, which is an 0.3 0.4 0.6 0.1 -0.1 0.2 0.5 -0.5 0.4 0.6 0.5 -0.5

extension of the POLAR accelerator in [7] designed for the 0.1 0.4 -0.2 0.5 -0.2 0.5 0.3 -0.4 0.4 0.5 0.5 -0.4

inference phase of dense networks, has the ability to sup- 0.2 -0.6 0.6 0.5 0.1 0.2 0.4 0.7 -0.6 0.6 0.5 0.4 0.7

port sparse LSTM networks. By utilizing parallel modules (a) Original dense matrix (b) Unstructured sparse matrix by global pruning

and an addressing technique, performing sparse opera- 0.3 0.1 0.2 0.6 0.4 -0.5 0.2 0.6
tions efficiently as well as higher efficacy for BRDS com- 0.3 0.4 0.5 -0.5 0.4 0.6 0.5 -0.5
pared to POLAR were achieved. -0.2 0.5 0.3 -0.4 0.4 0.5 0.5 -0.4
0.6 0.5 0.4 0.7 -0.6 0.6 0.4 0.7

3 ROW-BALANCED DUAL-RATIO SPARSITY (c) Block sparse matrix by pruning 2x2 blocks
according to block average
(d) Bank-balanced sparse matrix by local pruning
inside each 1x4 bank

PRUNING 0.3 0.4 -0.5 0.6


0.4 0.6 0.5 -0.5
3.1 Row-Balanced Pruning of Weight Matrices
0.4 0.5 0.5 -0.4
Fig. 2 illustrates the original and pruned matrices with the -0.6 0.6 0.5 0.7
sparsity ratio of 50%. Fine-grained pruning simply omits
(e) Row-balanced sparse matrix
the smallest 50% of the weights globally which leads to un-
structured sparse matrix (Fig. 2(b)). Block sparsity induces Fig. 2. Comparing different pruning methods with Row-Balanced.
a block sparse matrix (Fig. 2(c)) by setting the block size to
𝑚𝑚 (which in this example is 22) and the block repre- One may find more information about the perplexity met-
sentative (as the metric for pruning the blocks) with the ric in [21].
block average. Bank-balanced pruning induces a bank-bal- Similar to the previous pruning methods (see, e.g., [4],
anced sparse matrix (Fig. 2(d)) by splitting each row into [9]), we apply the row-balanced pruning method itera-
two equal-sized banks and applying fine-grained pruning tively to a pre-trained network and retrain the network af-
inside each bank independently. ter each pruning iteration to partially restore the model ac-
In this work, we propose row-balanced sparsity curacy. Since there are two matrices which should be
whereby the same number of elements from every row of pruned simultaneously with different sparsity ratios and
a given weight matrix are pruned. The row-balanced the sensitivities of the output quality to the sparsity ratio
sparse matrix of Fig. 2(a) is shown in Fig. 2(e) where the are different for the two matrices, we determine the spar-
smallest 50% elements in each row have been removed. sity ratios considering an overall sparsity target provided
The pseudo-code of the row-balanced sparsity is shown in by the designer. The (minimum) sparsity target is deter-
Fig. 3. The inputs to this algorithm are the weight matrix mined based on the number of weights that the designer
and the expected sparsity ratio while the output is the wants to store in the on-chip memories. The goal, therefore,
pruned matrix. It prunes each row separately based on the is to achieve the best model accuracy given a designer-
given sparsity ratio. To do this, based on the defined spar- specified lower bound on the sparsity factor. Since the
sity ratio (i.e., 𝑆𝑝𝑎𝑟%), some of the elements are pruned (in value of 𝑆𝑝𝑎𝑟𝑥 and 𝑆𝑝𝑎𝑟ℎ cannot be directly obtained from
order from the smallest to the largest values). the value of 𝑂𝑆 due to the their dependency on the dataset,
these values should be determined by exploring different
3.2 Row-Balanced Dual-Ratio Sparification possible combinations of them as shown in Fig. 4.
Algorithm Based on the above discussion, a heuristic algorithm for
As mentioned in subsection 3.1, in LSTM networks, for two pruning the 𝑊𝑥 and 𝑊ℎ weight matrices is presented. The
different sets of weights (i.e., 𝑊𝑥 and 𝑊ℎ ), the proposed ac- pseudo-code of the pruning method which is inducing
celerator can consider two different sparsity ratios. The row-balanced dual-ratio sparsity is shown in Fig. 5. It iter-
sparsity ratios are denoted by 𝑆𝑝𝑎𝑟ℎ and 𝑆𝑝𝑎𝑟𝑥 for 𝑊ℎ and atively explores the search space to find the best sparsity
𝑊𝑥 , respectively. Applying the aforesaid row-balanced ratios (i.e., the best values for 𝑆𝑝𝑎𝑟ℎ and 𝑆𝑝𝑎𝑟𝑥 ). In each
sparsity approach, every feed-forward (recurrent) weight pruning iteration, based on the considered sparsity ratios,
matrix, i.e., 𝑊𝑓𝑥 , 𝑊𝑖𝑥 , 𝑊𝑔𝑥 , and 𝑊𝑜𝑥 (𝑊𝑓ℎ , 𝑊𝑖ℎ , 𝑊𝑔ℎ , and 𝑊𝑜ℎ ) the weights with small importance are dropped. In this al-
will have the same numbers of non-zero elements per row. gorithm, the importance of weights is represented by their
Choosing different sparsity ratios for feed-forward and re- internal ranking within the row, which is dictated by their
current weights alleviates the accuracy degradation result- absolute values. To reduce the accuracy loss due to prun-
ing from the pruning process. This originates from the fact ing, in each iteration, the pruned network is retrained to
that the algorithm decides which weights are more im- Input:
portant to keep. As an example, in Fig. 4, the effect of hav- The matrix to be pruned, 𝑊;
ing two different sparsity ratios on the accuracy of an The expected sparsity, 𝑆𝑝𝑎𝑟%;
LSTM model is shown. The dataset here is PTB [17] with Output:
an input size of 1,500. For this example, the overall sparsity The pruned matrix, 𝑊𝑝 ;
(𝑂𝑆 ) of 65% is considered. When 𝑆𝑝𝑎𝑟𝑥 and 𝑆𝑝𝑎𝑟ℎ were 1: for each 𝑊𝑖 ∈ 𝑊. 𝑟𝑜𝑤𝑠 do
both set to 65%, the perplexity (the metric widely used in 2: Prune the smallest 𝑆𝑝𝑎𝑟% of 𝑊𝑖 ;
NLP) became large while the best perplexity was achieved 3: end for
when we set 𝑆𝑝𝑎𝑟ℎ = 60% and 𝑆𝑝𝑎𝑟𝑥 = 70%. A low value 4: return the pruned matrix, 𝑊𝑝 ;
for the perplexity shows a well-trained LSTM network [20]. Fig. 3. The Row-Balanced Pruning Algorithm.
4

starting from the 𝑁𝑁𝑃,𝐼 ) considering the opposite direction


Sparsity Ratio of Wx
35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% for the sparsity ratios. For each chosen tuple of the sparsity
88 ratios, the accuracy of the network is determined. At the
86 end, the algorithm returns the best tuple. To generate the
Perplexity

84 sparsity ratios with the maximum model accuracy


82 (𝑆𝑝𝑎𝑟𝑥,𝑀𝐴 and 𝑆𝑝𝑎𝑟ℎ,𝑀𝐴 ), the BRDS algorithm is executed
80 only once. The pruned network is used multiple times, so
78 the inference takes a long time which amortizes the cost of
76 the retraining algorithm.
95% 90% 85% 80% 75% 70% 65% 60% 55% 50% 45% 40% 35%
Sparsity Ratio of Wh In addition, the execution time of the algorithm de-
pends on 𝑂𝑆 , 𝛼 , 𝛿𝑥 , 𝛿ℎ , 𝑒𝑝𝑡 (the time needed for each
Fig. 4. The effect of Dual-Ratio Sparsity on the perplexity of PTB dataset. epoch), and 𝑛𝑟𝑒 (the number of epochs needed for the re-
training. The formulas below can be used to attain the ex-
determine the corresponding accuracy of the chosen spar- ecution time of the algorithm assuming the pretrained net-
sity ratios. For retraining, we freeze the weights that are set work is available. The parameter 𝑒𝑝𝑡 depends on both the
to zero (i.e., the dropped ones) and tune the other network size of the model that the user is trying to prune and the
weights. hardware that is going to be utilized to perform the algo-
In the proposed algorithm, to lower the accuracy loss rithm. The parameters 𝑒𝑥1 , 𝑒𝑥2 , 𝑒𝑥3 , and 𝑒𝑥𝑡𝑜𝑡 show the ex-
due to the pruning, in the first step, we suggest increasing ecution time of lines 1-6, 7-14, 15-24, and the whole algo-
the pruning ratios (i.e., 𝑆𝑝𝑎𝑟ℎ and 𝑆𝑝𝑎𝑟𝑥 ) gradually with rithm, respectively.
the same step size (𝛼) from zero to the predefined overall 𝑂𝑆
sparsity (𝑂𝑆). The pruned network at this point is consid-
𝑒𝑥1 = × 𝑒𝑝𝑡 × 𝑛𝑟𝑒 (3)
𝛼
ered as the initial point for searching. We denote it as 𝑁𝑁𝑃,𝐼 . 100 − 𝑂𝑆 𝑂𝑆
Next, to explore the search space, one of the sparsity ratios 𝑒𝑥2 = min ( , ) × 𝑒𝑝𝑡 × 𝑛𝑟𝑒 (4)
𝛿𝑥 𝛿ℎ
(e.g., 𝑆𝑝𝑎𝑟ℎ ) is increased by a predefined step (𝛿ℎ ) while the 100 − 𝑂𝑆 𝑂𝑆
other one (e.g., 𝑆𝑝𝑎𝑟𝑥 ) is decreased by its predefined step 𝑒𝑥3 = min ( , ) × 𝑒𝑝𝑡 × 𝑛𝑟𝑒 (5)
𝛿ℎ 𝛿𝑥
(𝛿𝑥 ). In each iteration, the chosen tuple of the sparsity is
𝑒𝑥𝑡𝑜𝑡 = 𝑒𝑥1 + 𝑒𝑥2 + 𝑒𝑥3 (6)
applied on the pruned network of the previous iteration.
Altering the sparsity ratios is continued till one of them
reaches 0 or 100%. Next, this process is repeated again (by 4 HARDWARE ARCHITECTURE
Input: As discussed before, unstructured sparsity leads to unbal-
The weights of the LSTM layer to be pruned, 𝑊𝑥 and 𝑊ℎ ; anced computations as well as irregular memory accesses.
The expected overall sparsity, 𝑂𝑆; To take advantage of the structured sparsity introduced by
Output: the BRDS algorithm, an efficient LSTM hardware accelera-
The maximum model accuracy, 𝑀𝐴; tor, called BRDS LSTM accelerator, is presented next. The
The sparsity ratios with the maximum model accuracy 𝑆𝑝𝑎𝑟𝑥,𝑀𝐴 internal structure of the accelerator, which is shown in Fig.
and 𝑆𝑝𝑎𝑟ℎ,𝑀𝐴 ; 6, is based on POLAR accelerator of [7] with some modifi-
1: set 𝑆𝑝𝑎𝑟𝑥 and 𝑆𝑝𝑎𝑟ℎ to 0; cations to support dual sparsity.
2: while 𝑆𝑝𝑎𝑟𝑥 < 𝑂𝑆 and 𝑆𝑝𝑎𝑟ℎ < 𝑂𝑆 do The BRDS accelerator consists of seven main modules
3: Increase 𝑆𝑝𝑎𝑟𝑥 and 𝑆𝑝𝑎𝑟ℎ by 𝛼;
including DRAM Controller, Embedded Memory, Address De-
4: Prune 𝑊𝑥 and 𝑊ℎ ;
5: Retrain the network; coder, Gate, Function, Buffer, and LSTM Controller. The bit
6: Save the pruned network as 𝑁𝑁𝑃,𝐼 ; width of the datapath is n, and the data is represented in
7: while 𝑆𝑝𝑎𝑟𝑥 < 100% and 𝑆𝑝𝑎𝑟ℎ > 0% do fixed-point two’s complement 𝑛-bit binary format. DRAM
8: Increase 𝑆𝑝𝑎𝑟𝑥 by 𝛿𝑥 ; Controller performs load and store instructions related to
9: Decrease 𝑆𝑝𝑎𝑟ℎ by 𝛿ℎ ; off-chip DRAM. A load instruction may occur when data
10: Prune 𝑊𝑥 and 𝑊ℎ ; should be read from off-chip DRAM and written onto On-
11: Retrain the network and save model accuracy to 𝐴𝑐𝑐; chip memories. Similarly, when outputs are ready, DRAM
12: if 𝐴𝑐𝑐 > 𝑀𝐴 do Controller should read the data and store them on the off-
13: 𝑀𝐴 = 𝐴𝑐𝑐; chip DRAM. It should be mentioned that most of the times,
14: (𝑆𝑝𝑎𝑟𝑥,𝑀𝐴 , 𝑆𝑝𝑎𝑟ℎ,𝑀𝐴 ) = (𝑆𝑝𝑎𝑟𝑥 , 𝑆𝑝𝑎𝑟ℎ ); the weights can be fully fit in FPGA embedded memories.
15: Load the pruned network 𝑁𝑁𝑃,𝐼 ; In the cases where, even with pruning, we cannot fit the
16: while 𝑆𝑝𝑎𝑟𝑥 > 0% and 𝑆𝑝𝑎𝑟ℎ < 100% do data in the FPGA memories, the proposed accelerator uti-
17: Decrease 𝑆𝑝𝑎𝑟𝑥 by 𝛿𝑥 ; lizes this module as the interface to off-chip DRAM. Note
18: Increase 𝑆𝑝𝑎𝑟ℎ by 𝛿ℎ ;
that only non-zero elements of the sparse matrices are
19: Prune 𝑊𝑥 and 𝑊ℎ ;
20: Retrain the network and save model accuracy to 𝐴𝑐𝑐; fetched consecutively to efficiently utilize the off-chip
21: if 𝐴𝑐𝑐 > 𝑀𝐴 do memory bandwidth.
22: 𝑀𝐴 = 𝐴𝑐𝑐; The module Gate (see Fig. 6) includes two Mult Arrays
23: (𝑆𝑝𝑎𝑟𝑥,𝑀𝐴 , 𝑆𝑝𝑎𝑟ℎ,𝑀𝐴 ) = (𝑆𝑝𝑎𝑟𝑥 , 𝑆𝑝𝑎𝑟ℎ ); (MAs) working concurrently, an MA Selector, a Tree Ad-
24: return 𝑀𝐴, (𝑆𝑝𝑎𝑟𝑥,𝑀𝐴 , 𝑆𝑝𝑎𝑟ℎ,𝑀𝐴 ); der, and an Accumulator. Since there are two different
Fig. 5. The BRDS Algorithm. pruning ratios (𝑆𝑝𝑎𝑟ℎ and 𝑆𝑝𝑎𝑟𝑥 ), we consider two MAs
5

with different sizes, named Small and Large. By using a DRAM LSTM Controller
Off-chip Controller
multiplexer in the MA Selector, the accelerator can choose, DRAM
for each set of weights, which MA to be employed. The Gate
MB Function
number of elements in each row of 𝑊ℎ and 𝑊𝑥 which are 𝐻 MA Selector

Embedded Memory
MWX
and 𝑋, respectively, become 𝐻𝑆𝑃 and 𝑋𝑆𝑃 , after running the MAdX
Small MA Large MA
sig tanh
Adr.
BRDS algorithm. The weight matrix with a larger (smaller) MAdH Dec. Tree Adder Buffer
Buffer Mult
number of elements in each row (i.e., 𝐻𝑆𝑃 or 𝑋𝑆𝑃 ) utilizes MWH
Q
Accum.
Large (Small) MA. Two MAs working together conduct 𝑅 MX Accum.
MH
signed multiplications in parallel. Large (Small) MA in- Adder
Q Q
cludes a Mult Array component which conducts 𝑅𝐿 (𝑅𝑆 ) MC
parallel 𝑛-bit multiply operations in a single cycle (𝑅 = FPGA
𝑅𝐿 + 𝑅𝑆 ). It should be noted that the parameters 𝑅𝐿 and 𝑅𝑆
Fig. 6. The internal structure of the proposed BRDS LSTM accelerator.
show the level of parallelization for each weight matrix
(i.e., 𝑊ℎ and 𝑊𝑥 ). Here, Large (Small) MA processes the approximation of the activation functions (e.g., 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 (𝜎)
weights with the larger (smaller) number of elements in and 𝑡𝑎𝑛ℎ) to balance speed, accuracy, area, and power con-
each row. To fully utilize the MAs, the BRDS accelerator sumption. In each piece, two 𝑛-bit coefficients a and b are
considers 𝑅𝐿 and 𝑅𝑆 in a way that 𝑅𝑆 /𝑅𝐿 be equal to obtained and stored in LUTs. Hence, the word size of the
(min {𝑋𝑆𝑃 , 𝐻𝑆𝑃 })/(max {𝑋𝑆𝑃 , 𝐻𝑆𝑃 }). In this way, the ratio of LUTs are 2𝑛 bits. In the activation function component, by
the number of non-zero elements in small and large matri- employing the add and multiply operations in one DSP
ces would be equal to the ratio of the number of multipliers block, the output (𝑎 × 𝑥 + 𝑏) of the activation functions
in small and large MAs. Therefore, the number of clock cy- are determined. The operations of (2) are done by deploy-
cles needed for processing each weight matrix is the same, ing a multiply unit and an add unit in the module Function.
and there is no time in which one MA is not utilized. Because the output of the multiply unit should be passed
It is worth mentioning that the parameter 𝑅 , which to the module Embedded Memory, these units are imple-
shows the number of parallel multiplication operations for mented separately by DSP blocks.
every row, determines the latency and the resource usage. To transfer the data from the module Gate to the module
Since, in the BRDS hardware, the input rows are pruned, Function, and also to feed back the result of the module
to reach a higher level of parallelism, we propose a new Function to itself, the module Buffer (see Fig. 6) is deployed
parallelization factor, called 𝑄 , showing the number of and used with the same approach as that of the POLAR
rows whose corresponding calculations could be per- architecture [7]. The module Embedded Memory (see Fig. 6)
formed parallel. Therefore, parameter 𝑄 shows the num- stores the weights, biases, inputs, relative addresses of the
ber of modules Gate (Buffer and Function as well) working inputs, cells, and outputs in the embedded memory banks
in parallel. The outputs of the Large and Small MAs are of the FPGA. To store the weights, two memory arrays de-
concatenated and passed onto the Add Array component noted by MWX and MWH are employed. Only the nonzero
in the module Gate. The Add Array component utilizes a elements of the matrices 𝑊𝑓𝑥 , 𝑊𝑖𝑥 , 𝑊𝑔𝑥 , and 𝑊𝑜𝑥 are stored
tree of 𝑛-bit adders, which gives the summation of its 𝑅 in- in MWX. We use relative row index and cumulative pointer
put operands. To perform the additions and multiplica- to store sparse matrices. The relative row index for each
tions, we use DSP blocks in the FPGA. Due to the architec- element shows the number of zero elements before it. Each
ture of the DSP blocks, we are able to perform some func- 𝑅𝑋 (𝑅𝐻 ) nonzero elements of 𝑊𝑥 (𝑊ℎ ) weights are stored in
tions together. In the Tree Adder component, we use three- four consecutive rows. Hence, the size of MWX is
input adders where it is possible to minimize resource uti- 4𝐻 × 𝑋𝑆𝑃 × 𝑛 bits where the width of each row of this
lization. The internal structure of a DSP block of the Xilinx memory is 𝑅𝑋 × 𝑛 . Similarly, the elements of matrices
FPGAs (DSPE48) is shown in Fig. 7. The specified path is 𝑊𝑓ℎ , 𝑊𝑖ℎ , 𝑊𝑔ℎ , and 𝑊𝑜ℎ are stored in MWH. The size of MWH
utilized for realizing the three-input adders. The current is 4𝐻 × 𝐻𝑆𝑃 × 𝑛 bits, where the width of each row of this
output of this component is added to its previous one in an memory is 𝑅𝐻 × 𝑛.
Accumulate component that is implemented by a DSP The memory array 𝑀𝐵 of the module Embedded Memory
block. The output of the Accumulate component is passed is used to store the biases. The size of 𝑀𝐵 is 4𝐻 × 𝑛 bits
to an Add unit to take biases into account. All these com- with a word-width of n bits. The ith row of the biases 𝑏𝑓 , 𝑏𝑖 ,
ponents as one unit perform the computation of the MxV. 𝑏𝑔 , and 𝑏𝑜 are stored in the four consecutive rows of the
The proposed accelerator truncates the output of each memory 𝑀𝐵 . The memory array 𝑀𝑋 stores N time steps of
add and multiply unit to 𝑛 bits. To alleviate the impact of the input dataset where the size of the input dataset, in
overflow in the result, we utilize Recovery units after each each time step, is 𝑋 × 𝑛 bits. To gain more throughput, for
Add and Multiply unit suggested in [7]. The module Func- the inputs, we use duplicate memories in parallel. Because
tion (see Fig. 6) performs operations that are pointwise (i.e., of the utilization of the Dual Port RAMs, we need 𝑅𝑋 /2
𝑠𝑖𝑔, 𝑡𝑎𝑛ℎ and (2)). This module generates the output ℎ and BRAMs (𝑀𝑋 ) for storing the inputs. In this work, for sim-
the cell state 𝑐, which are written to their corresponding plicity, we consider a single time step (𝑁 = 1). The memory
space in the module Embedded Memory. The operations of array 𝑀𝐻 stores the outputs of the current and previous
this module are overlapped by those of the module Gate time steps (ℎ𝑡 and ℎ𝑡−1 ) with the size of 𝐻 × 𝑛 bits. Similar
where this overlap is provided by the module Buffer. The to the input memory array, duplicated memories for the
proposed accelerator utilizes piecewise linear outputs were utilized ( 𝑅𝐻 /2 BRAMs for 𝑀𝐻 ). At the
6

assessed by implementing the accelerator on an FPGA.

5.1 Model Accuracy of BRDS Algorithm


To evaluate the accuracy of the BRDS pruning algorithm,
the algorithm was applied to an LSTM language model of
the PTB dataset [17], the IMDB Movie Reviews dataset [18],
and the TIMIT dataset [19]. The PTB dataset, widely used
in NLP researches, includes 929K training, 73K validation,
and 82K test words. The IMDB dataset has 50,000 highly
polar positive and negative movie reviews for binary sen-
Fig. 7. The internal structure of a DSP block configured for three-input timent classification. It includes 25,000 reviews for training
adder. and 25,000 reviews for testing. The TIMIT dataset has been
beginning of each time step, this memory contains ℎ𝑡−1 . provided for the study of acoustic-phonetics. It includes re-
After generating each element of ℎ𝑡 , this element will be cordings of 630 speakers of eight major dialects of Ameri-
stored on their corresponding rows in all replicated mem- can English. For the LSTM speech recognition model, we
ories of 𝑀𝐻 . By using duplicate memories, the proposed ar- set the input size to 153 and the hidden state to 1024 which
chitecture will have a much better performance by increas- are the same as the prior studies ([4], [9]).
ing the throughput at the cost of using a reasonable In this study, the accuracy of the BRDS is compared to
amount of embedded memory. The generated cells and the three prior pruning approaches including unstructured
ones for the previous time step are accessible from the sparsity, block sparsity, and bank-balanced sparsity (BBS)
memory array 𝑀𝐶 . The word size of this memory is 𝑛 bits [9]. For the studies of this section, 64 banks in the case of
while its size is 𝐻 × 𝑛 bits. In this memory approach, be- BBS method and 44 blocks in the case of block sparsity
fore replacing previous 𝑐𝑡 (i.e., current 𝑐𝑡−1 ), the cell is method were utilized which are the same sizes as those
fetched and then replaced with the current 𝑐𝑡 . used for the hardware implementation in [9]. The trade-
We store relative addresses corresponding to the mem- offs between the sparsity ratio and the accuracy of different
ories MWX and MWH in memories MAdX and MAdH, respec- sparsity patterns on PTB, TIMIT, and IMDB datasets are
tively. The module Address Decoder, which consists of small depicted in Fig. 9. To perform experiments on the LSTM
add units, decodes the relative addresses to obtain cumu- language model, we employed the large model of PTB da-
lative pointer addresses. The pattern of storing 𝑊ℎ in MWH taset with 1,500 inputs. Also, the perplexity metric, as a
and their corresponding relative addresses in MAdH are il- widely used merit parameter in quantifying language
lustrated in Fig. 8. In this example, 𝐻, 𝑆𝑝𝑎𝑟ℎ , and 𝑅𝐻 (the model quality [9], was exploited as the error metric for this
parallelization factor of 𝑊ℎ ) are considered 4, 50%, and 2, dataset. As the results in Fig. 9(a) show, the perplexities of
respectively. Storing 𝑊𝑥 and its relative addresses in the the suggested pruned network by the BRDS algorithm is
corresponding memories are performed similar to that of lower than the other pruning approaches. Also, the BRDS
𝑊ℎ . algorithm preserved the perplexity even until 85% of the
Based on the proposed datapath for the accelerator ar- weights were pruned. On average, the proposed algorithm
chitecture, a designer may control the trade-off between led to 0.7% lower perplexity compared to that of BBS
the resource usage and the latency of the architecture method for the pruning ratio ranging from 0 to 90% with
simply by adjusting the parameters 𝑅 and 𝑄 at the design the interval of 5%.
time. To switch from one parallelization factor to another, The LSTM model chosen for the experiments on TIMIT
one needs to change the number of mult and add units in dataset was also the same as the prior works ([4], [9]). The
MAs and Add Array components, respectively. Also, the metric for evaluating the accuracy of the acoustic model is
designer should change the size of the delay units in the PER (Phone Error Rate) which is a merit evaluation param-
module Buffer. It is worth mentioning that if the parallel- eter for speech recognition models [9]. The results in Fig.
ization factor 𝑅𝑥 (𝑅ℎ ) is chosen greater than 𝑋𝑆𝑃 (𝐻𝑆𝑃 ), the 9(b) show that the PER of the network pruned by the BRDS
designer should use the parameter 𝑄 and utilize multiple Wfh Wih MWH MAdH
number of modules Gate, Buffer, and Function.
0 1
The module LSTM Controller in the proposed architec- H H 0 2
ture performs the control of the complicated timing 0 0
scheme of the LSTM network. This module generates 1 0
H H
proper signals with proper timing to meet the architecture 0 2
requirements. Details of the timing of this architecture is Wgh Woh 4×H×(H/ R )
H
1 0
the same as that of the POLAR architecture [7].
H H

0 2
5 RESULTS AND DISCUSSION H H n×RH
In this section, the accuracy of the proposed pruning algo-
rithm is evaluated by applying it to some LSTM networks. Fig. 8. Pattern of storing 𝑊ℎ and its relative addresses in MWH and MAdH.
The elements of each row are distinguished by the square, circle, trian-
Also, the design parameters of the proposed accelerator, as gle, and cross shapes.
well as its efficacy compared to several prior works, are
7

Dense Baseline Unstructured


algorithm is, on average about 0.1%, less than that of the
Block Sparsity BBS
BBS method. As the results in Fig. 9(c) show, our method
BRDS
outperforms other algorithms in almost all the cases, par-
ticularly in larger sparsity ratios. Compared to the BBS 82
sparsity method, the proposed pruning algorithm resulted 81
in 0.7% lower accuracy loss. Note that the efficacy of the

Perplexity
proposed pruning algorithm is reduced in the case of this 80
dataset compared to the other considered datasets due to 79
its small size.
78
5.2 Efficiency of BRDS Accelerator 77
To evaluate the efficiency of the architecture of the pro- 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
posed accelerator, it was implemented on FPGA for execut- (a) Sparsity Ratio
ing the TIMIT dataset with the same configurations pro-
31
vided in [4], [9], [14]. The design parameters of the BRDS
are compared with four state-of-the-art works including
30
the ones proposed in [4], [9], [14], [16]. The focus of all of

PER (%)
these prior works were on implementing the LSTM net-
29
works on FPGAs using weight pruning and compression.
The work in [16] was evaluated on GRU which is simpler
28
than LSTMs in terms of its computational complexity. 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Thus, the reported design parameters of [16] should be (b) Sparsity Ratio
looked at as optimistic values when compared to those of
the LSTM accelerators. This work used the delta network 22
algorithm to reduce MxV operations and skipping unim-
21
portant cell activation changes (the changes were below a
Error (%)

threshold value) to reduce memory accesses. To perform a 20


better comparison, we implemented the BRDS design on 19
an FPGA (XCKU9P) with the same family as that of [4] by 18
exploiting Xilinx VIVADO 2018.2 tool.
17
In this study, the pruning ratio was set to 87.5% (the
0% 10% 20% 30% 40% 50% 60% 70% 80%
same as [9], [14]). Also, since the data bit width was con-
(c) Sparsity Ratio
sidered as 16 bits in most of the prior architectures (e.g., [9],
[14], [16]), without loss of generality, the same size was con-
Fig. 9. The accuracy-sparsity tradeoff on (a) PTB, (b) TIMIT, and (c)
sidered for the BRDS accelerator. For pruning the network, IMDB datasets.
we applied the BRDS algorithm for the overall sparsity
(𝑂𝑆) of 87.5% (the same sparsity ratio as most of the prior [9]. In this work, although the operating frequency of the
works) on the TIMIT dataset. The best 𝑆𝑝𝑎𝑟ℎ and 𝑆𝑝𝑎𝑟𝑥 BRDS architecture could be increased to 238MHz, it was
given by the algorithm were 87.5% for both. It is worth set to 200MHz to have a fair comparison with the selected
mentioning that because parameters 𝑋 and 𝐻 were differ- prior works. The design parameters of the prior works
ent in this design, having the same sparsity ratios for 𝑊ℎ were borrowed from their published papers. As the results
and 𝑊𝑥 did not mean the same number of elements in each in TABLE 2 show, the throughput (GOPS) of the BRDS ac-
row for them. Thus, even by having the same sparsity ra- celerator is higher (up to 52.7%) than that of [14], [16] while
tios, the numbers of elements to be pruned were different smaller (up to 52%) than those of [4], [9]. The power con-
for 𝑊ℎ and 𝑊𝑥 . After running the BRDS pruning algorithm, sumption was extracted using Xilinx Power Estimator. The
the parameters 𝑋𝑆𝑃 and 𝐻𝑆𝑃 were 20 and 64, respectively. switching activity was set by the tool based on inputs and
As mentioned in Section 4, to fully utilize the MAs in weights for the TIMIT dataset stored in the embedded
the BRDS accelerator, the parameters 𝑅𝐿 and 𝑅𝑆 were con- memories. The power consumption of the BRDS accelera-
sidered such that 𝑅𝑆 /𝑅𝐿 be equal to (𝑚𝑖𝑛 {𝑋𝑆𝑃 , 𝐻𝑆𝑃 })/ tor is much less than that of other selected works except for
(max {𝑋𝑆𝑃 , 𝐻𝑆𝑃 }). Therefore, we considered 𝑅𝑆 /𝑅𝐿 as 80/ [16], mostly due to its smaller operating frequency. Addi-
256 making the parallelization factors 𝑅 and 𝑄 equal to tionally, the GOPS/W of the BRDS shows a higher number
336 and 4, respectively. TABLE 1 shows the resource utili- compared to those of other works except for [16] which has
zation of the BRDS accelerator with the mentioned config- TABLE 1
uration on the XCKU9P Xilinx FPGA device. Also, TABLE RESOURCE UTILIZATION OF THE BRDS ACCELERATOR IMPLE-
MENTED ON XCKU9P FOR TIMIT DATASET (𝑂𝑆 = 87.5%).
2 shows the frequency, sparsity ratio, accuracy degrada-
tion, GOPS, power, GOPS/W, and effective GOPS,
LUT FF DSP BRAM
GOPS/W, and DSP and logic efficiency of the BRDS accel-
Available 274080 548160 2520 912
erator and those of the considered prior works. The Effec-
Utilization 5600 83710 1600 724
tive Throughput (GOPS) is defined as (𝐺𝑂𝑃𝑆)/(1 −
Utilization (%) 2 15.3 63.5 79.4
𝑠𝑝𝑎𝑟𝑠𝑖𝑡𝑦) which takes the impact of pruning into account
8

TABLE 2
COMPARISON OF THE DESIGN PARAMETERS OF DIFFERENT STATE-OF-THE-ART LSTM ACCELERATORS.

ESE [4] C-LSTM [14] DeltaRNN [16] BBS [9] BRDS


Platform XCKU060 Virtex-7 XC7Z100 Arria 10 GX1150 XCKU9P
Frequency (MHz) 200 200 125 200 200
Sparsity (%) 88.7 87.5 - 87.5 87.5
Quantization fixed-12 fixed-16 fixed-16 fixed-16 fixed-16
Accuracy Degradation 0.30% 0.32% - 0.25% 0.25%
Throughput (GOPS) 282 131 192 304 200
Power (W) 41.0 22.0 7.3 19.1 9.0
Energy Efficiency (GOPS/W) 6.9 6.0 26.3 15.9 22.2
Effective Throughput (GOPS) 2497 1049 1198 2433 1600
Effective Energy Efficiency (GOPS/W) 60.9 47.7 163.3 127.4 177.8
Effective DSP Efficiency (GOPS/#DSP) 1.66 0.39 1.56 1.60 1.00
Effective Logic Efficiency (GOPS/#Cell) 0.008 0.002 0.005 0.008 0.286
a slightly better energy efficiency. The effective GOPS [5] A. Chang and E. Culurciello, “Hardware Accelerators for Re-
(GOPS with considering the sparsity ratio) of the BRDS is, current Neural Networks on FPGA,” in Proc. of the IEEE Int.
on average, about 43% higher than those of [14], [16] while Sym. on Circuits and Systems (ISCAS), pp. 1-4, May 2017.
it is on average about 54% lower than those of the [4], [9]. [6] Y. Guan et al., “FPGA-based Accelerator for Long Short-Term
Moreover, the BRDS accelerator outperforms all of the Memory Recurrent Neural Networks,” in Proc. of Asia and South
Pacific Design Automation Conference, pp. 629-634, 2017.
other works in terms of effective GOPS/W. The effective
[7] E. Bank-Tavakoli, S.A. Ghasemzadeh, M. Kamal, A. Afzali-
GOPS/W of the BRDS, on average (up to), is 2.3× (3.7×)
Kusha, and M. Pedram, “POLAR: A Pipelined/Overlapped
higher than those of the other selected works. Finally,
FPGA-Based LSTM Accelerator,” in IEEE Transactions on Very
GOPS/#DSP and GOPS/#Cell are considered to normalize Large-Scale Integration (VLSI) Systems, 2019.
the effective throughput based on the amount of utilized [8] S. Narang, E. Undersander, and G. F. Diamos, “Block-sparse Re-
DSP and logic cell (i.e., ALM for Intel and LUT for Xilinx current Neural Networks,” in CoRR, 2017.
devices). [9] S. Cao et al., “Efficient and Effective Sparse LSTM on FPGA
with Bank-Balanced Sparsity,” in Proc. Int. Symp. Field-Program-
mable Gate Arrays, pp. 63-72, 2019.
6. CONCLUSION
[10] S. Han, H. Mao, and W. J. Dally, “Deep Compression: Com-
In this paper, first, the BRDS, a row-balanced dual-ratio pressing Deep Neural Networks with Pruning Trained Quanti-
sparsity algorithm, was presented to improve the accuracy zation and Huffman Coding,” in Proc. ICLR, 2016.
of LSTM models considering their hardware implementa- [11] S. Han et al., “Learning Both Weights and Connections for Effi-
tion. Additionally, BRDS LSTM, an energy-efficient FPGA cient Neural Networks,” in Proc. NIPS, pp. 1135-1143, 2015.
implementation for the inferencing phase of sparse LSTM [12] W. Wen et al., “Learning Structured Sparsity in Deep Neural
networks, was proposed. Its architecture is compatible Networks,” in Proc. NIPS, pp. 2074-2082, 2016.
with the suggested pruning algorithm, which utilized two [13] H. Mao et al., “Exploring the Regularity of Sparse Structure in
Convolutional Neural Networks,” in Proc. CVPR Workshop Ten-
configurable processing elements with different sparsity
sor Methods in Comput. Vis., 2017.
ratios. It takes advantage of the sensitivity of the two dif-
[14] S. Wang et al., "C-LSTM: Enabling Efficient LSTM Using Struc-
ferent weight matrices to the pruning. Finally, the effi-
tured Compression Techniques on FPGAs" in Proc. of the 2018
ciency of the proposed pruning algorithm and accelerator ACM/SIGDA Inter. Symp. on Field-Programmable Gate Arrays,
was evaluated using selected benchmarks in NLP, senti- ACM, pp. 11-20, 2018.
ment classification, and speech recognition fields. Com- [15] V. Y. Pan, “Structured Matrices and Polynomials: Unified Super-
pared the state-of-the-art work, the proposed architecture fast Algorithms,” New York: Springer, 2001.
and pruning algorithm provided, on average, 128% im- [16] C. Gao et al., “DeltaRNN: A Power-efficient Recurrent Neural
provements in effective GOPS/W, and a 0.7% reduction in Network Accelerator,” in Proc. of the 2018 ACM/SIGDA Inter.
perplexity. Symp. on Field-Programmable Gate Arrays, ACM, 21–30, 2018.
[17] M. Marcus et al. 1999. Treebank-3 LDC99T42. CD-ROM. Phila-
REFERENCES delphia, Penn.: Linguistic Data Consortium (1999).
[18] IMDb Datasets. https://www.imdb.com/interfaces
[1] A. Graves, A.R. Mohamed, and G. Hinton, “Speech Recognition with [19] J. S. Garofolo et al. Darpa TIMIT Acoustic-Phonetic Continous
Deep Recurrent Neural Networks,” in Proc. of the IEEE Int. Conf. Acoustic Speech Corpus CD-ROM. NIST Speech Disc 1-1.1. NASA
Speech Signal Process., pp. 6645-6649, May 2013. STI/Recon technical report N, 93.
[2] H. Palangi et al., ‘‘Deep sentence Embedding using Long Short- [20] J. Park et al., "Maximizing System Performance by Balancing
Term Memory Networks: Analysis and Application to Infor- Computation Loads in LSTM Accelerators," Design Aut. Test in
mation Retrieval,’’ IEEE/ACM Trans. Audio, Speech, Language Europe Conf. Ex. (DATE), pp. 7-12, March 2018.
Process., vol. 24, no. 4, pp. 694–707, Apr. 2016. [21] Chip Huyen, "Evaluation Metrics for Language Modeling," The
[3] S. Hochreiter and J. Schmidhuber. “Long Short-Term Memory,” Gradient, 2019.
Neural computation, 9(8):1735–1780, 1997. [22] S. Han et al., "EIE: Efficient Inference Engine on Compressed
[4] S. Han et al., “ESE: Efficient Speech Recognition Engine with Deep Neural Network," in Proc. ISCA, pp. 243-254, 2016.
Sparse LSTM on FPGA,” Dec. 2016.

You might also like