Dong 2020
Dong 2020
Abstract— This article presents a piecewise linear approxi- the conflicts between hardware resource constraints and the
mation computation (PLAC) method for all nonlinear unary requirement of low delay and acceptable accuracy in practical
functions, which is an enhanced universal and error-flattened application.
piecewise linear (PWL) approximation approach. Compared with
the previous methods, PLAC features two main parts, an opti- For instance, recurrent neural networks (RNNs), such as
mized segmenter to seek the minimum number of segments under long-short-term memory and gated recurrent unit, are widely
the predefined software maximum absolute error (MAE), raising applied in natural language processing and video processing,
the segmentation performance to the highest theoretical level where real-time performance is highly concerned [1]. As the
for logarithm, and a novel quantizer to completely simulate the activation functions in RNNs, hardware implementations of
hardware behavior and determine the required bit width and
MAEc (MAE in circuits) for hardware implementation. In addi- sigmoid, hyperbolic tangent, and softsign functions are inves-
tion, the hardware architecture is also improved by simplifying tigated in [2]–[5]. In [6]–[9], the efficient implementation of
the indexing logic, leading to nonredundant hardware overhead. logarithmic function is studied for graphics processing unit.
The ASIC implementation results reveal that the proposed PLAC These four widely used functions mentioned earlier are all
can improve all metrics without any compromise. Compared nonlinear unary functions.
with the state-of-the-art methods, when computing logarithmic
function, PLAC reduces 2.80% area, 3.77% power consumption, Several approximation methods have been proposed to
and 1.83% MAEc with the same delay; when approximating implement nonlinear unary functions. Iteration methods, such
hyperbolic tangent function, PLAC reduces 6.25% area, 4.31% as the Newton iteration method [10] and coordinated rota-
power consumption, and 18.86% MAEc with the same delay; tion digital computer (CORDIC) [11], [12], suffer from long
when evaluating sigmoid function, PLAC reduces 16.50% area, time delay due to their repeated iterative operations. The
4.78% power consumption with the same delay, and MAEc ; and
when calculating softsign function, PLAC reduces 17.28% area, polynomial approximation is then proposed to compute more
11.34% power consumption, 12.50% delay, and 33.28% MAEc . directly, which takes advantage of the series expansion of
target functions, such as the Taylor series approximation and
Index Terms— Error-flattened, nonlinear unary function,
piecewise linear (PWL) approximation, piecewise linear approx- the Chebyshev polynomial approximation [13], [14]. However,
imation computation (PLAC), quantizer, segmenter, VLSI polynomial approximation costs too many cascaded multi-
architecture. plication and addition (MAC) operations, resulting in much
high hardware overhead and delay. To further reduce the
computation complexity, the piecewise linear (PWL) method
I. I NTRODUCTION
is widely applied since it only requires one MAC, leading to
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
the maximum relative error (MRE) to obtain k and b and the procedure of segmentation. After the segmenter determined the
latter improves computing accuracy by correcting computation endpoints of each input region and obtained the coefficients
result using the error value stored in LUT. Since the variation k and b, an innovative quantizer is proposed to quantize
rates of nonlinear functions are varied among different input the coefficients and the circuit outputs. To be more specific,
regions, the maximum error in each segment is unequal, we denote MAE requirement for the segmenter as MAEsoft ,
resulting in many more segments when the maximum error is MAE requirement for quantizer as MAEhard , and the MAE of
required to be controlled under a given upper bound. As the circuit output as MAEc . The final goal of PLAC is to make
memory cost is proportional to the number of segments in the MAEc less or equal than MAEhard . Since the quantization in
PWL method, this segmentation method is not good enough hardware implementation brings accuracy loss, a quantization
for hardware design. factor QF is designed to scale up MAEsoft . The quantizer
To overcome the disadvantages of uniform PWL, nonuni- can completely simulate the hardware behavior and choose
form PWL is proposed in [20] and [21]. Kim et al. [20] use the least data width required for the predefined MAEhard ,
15 segments, while Nam and Yoo [21] use 24 segments to leading to no redundant resources consuming for hardware
compute logarithmic function. Both of them increase the num- implementation. It has to be noted that PLAC is an MAEhard -
ber of segments around input value 0. Although nonuniform guided method compared with previous works, which means
PWL can achieve relatively high accuracy using not too many that hardware accuracy is controllable and the iterative modi-
segments, it still has an uncontrollable error range, which fication process is significantly reduced in hardware design.
means that the design skill has a significant impact on circuit We use MATLAB to model the proposed segmenter and
performance. quantizer and Verilog HDL to model the hardware architec-
In [7], an error-flattened PWL for logarithmic function is ture. For synthesis, TSMC 65-nm technology is applied to a
proposed. This method utilizes the predefined upper bound logarithmic function, TSMC 90-nm technology is applied to
of MRE as a guide to divide the segments, guaranteeing that hyperbolic tangent function and sigmoid function, and TSMC
each segment has the same MRE, thus obtaining the minimum 40-nm technology is applied to hyperbolic tangent function
number of segments for approximation. However, considering and softsign function. We have performed enormous hardware
the hardware efficiency, maximum absolute error (MAE) is comparisons with the prior arts [5], [23]–[26]. Typically,
a better metric for fixed-point implementation. Liu et al. [8] compared with the state-of-the-art PWL method [23], PLAC
proposed a method to obtain equal MAE for each seg- reduces 2.80% area and 3.77% power consumption in the
ment by dividing the output range into equal subranges. implementation of log2 (1 + x), without accuracy loss. As for
The error-flattened approximation method in [8] achieves the tanh(x), 6.25% area reduction is achieved, while MAEc is
theoretically best segmentation performance, which is proved improved by 18.86%. For sigmoid(x), 16.5% area improve-
by strict mathematics. Ha and Lee [9] improved the accuracy ment is achieved without accuracy loss. For the abovemen-
of [8] by dividing one linear segment into three segments with tioned three functions, the delay of PLAC is the same as
the same slope, trading with hardware overhead. However, that of [23]. In addition, our implementation of softsign(x)
the dividing method in [8] relies on the characteristic of the reduces 17.28% area and 12.5% delay while improving MAEc
logarithmic function, which means that the method cannot by 33.28% compared with [5].
be extended to other functions. In [22], another logarithmic The contributions of this article are summarized as follows.
converter with a novel error-aware segmentation procedure is 1) The proposed segmenter takes advantage of the bisection
proposed, which approximates logarithmic function by unity method, dramatically saving segmentation time.
slope straight lines and maximizes the length of each segment 2) By decoupling the endpoints and the start points of
under the MAE requirement. adjacent segments, our segmenter can achieve the theo-
To realize the generalization of error-flattened PWL method, retically best segmentation performance for logarithmic
by decoupling the relation between the segmentation scheme function.
and the characteristics of target functions, Sun et al. [23] pro- 3) A quantizer is designed to completely simulate hardware
posed a universal PWL segmenter for all transcendental func- implementation, which determines the data width of
tions for the first time. However, we notice that Sun et al. [23] coefficients and outputs under the MAEhard requirement.
cannot reach the best segmentation performance, which is Compared with the traditional design methods that only
achieved in [8], and the MAE is only controlled in software provide the segmenter, the proposed quantizer can min-
approximation. In addition, the hardware architecture in [23] imize the hardware resources cost under the hardware
still has redundant logic, resulting in unnecessary hardware accuracy requirement before actual VLSI implementa-
overhead. Thus, we propose piecewise linear approxima- tion, saving a lot of iterative design time.
tion computation (PLAC) to enhance the previous universal 4) The hardware architecture is further simplified compared
error-flattened PWL method. with that in the state-of-the-art work, reducing the redun-
In this article, an error-flattened segmenter that raises dant indexing logic for coefficients.
the segmentation performance to the highest theoretical The rest of this article is organized as follows. Section II
level is proposed. Similar to the MAE-guided segmenta- introduces state-of-the-art research on the PWL method and
tion method in [23], the proposed segmenter keeps MAEs analyzes both advancements and disadvantages. Section III
equal for each segment and finds the minimum number of gives a detailed theory of segmenter, segmentation perfor-
segments under the software MAE requirement. Meanwhile, mance tests, and the theory of quantizer. VLSI architecture
a novel bisection-seeking method is designed to speedup the design is illustrated in Section IV where also goes the circuits’
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
A. Introduction to Segmenter
Sun et al. [23] first discretized the continuous input range
[M, N] into discrete points
1 2
x = x(1 : NUM) = M, M + iw , M + iw , . . . , N (1)
2 2
to imitate the actual hardware implementation features, where
iw is the number of input fractional bits and NUM = Fig. 1. Minimization of MAE by parallel shifting. Coordinate δ − x shows
(N − M)/(2−iw ) + 1. The PWL approximation of function the error curve of original approximation, and coordinate δ − x shows the
error curve after MAE minimization.
f (x) in subrange x(i : j ), 1 ≤ i < j ≤ NUM can be
computed as
The procedure of the segmenter in [23] can be summarized
h(x) = k ∗ x + b (2) as follows.
f (x( j )) − f (x(i )) 1) It first predefines the MAEsoft for segmenter.
k = (3)
x( j ) − x(i ) 2) Then, it calculates the MAEshift using (11) in the range
b = f (x(i )) − k ∗ x(i ). (4) x(start : end), where start is initialized as 1 and end is
initialized as NUM. The seeking order is from the last
Here, h(x) represents the approximate linear segment, while input point x(NUM) to the previous points.
k and b represent the slope and intercept of this segment, 3) If the calculated MAEshift is smaller than MAEsoft , then
respectively. Then, Sun et al. [23] proposed a way to minimize the searched range is recorded and start is upgraded to
MAE in the input range x(i : j ), which is given in (5)–(11). end, while end is reset to NUM.
The approximation error is denoted as 4) Repeat step 3 until start is updated to NUM.
δ = f (x(i: j )) − h(x(i: j )). (5)
B. Introduction to Hardware Architecture
MAEorigin represents the MAE of original linear approxi-
mation, and MAEshift represents the MAE of shifted linear The hardware architecture given in [23] is shown in Fig. 2,
segments. MAEorigin is computed by including one MAC, an index generator, and two LUTs for
coefficients k and b. Here, n denotes the number of segments,
MAEorigin = max{|max(δ)|, |min(δ)|}. (6) x i are the starting points of segment i, (i = 2, . . . , n), and
si (i = 1, 2, . . . , n − 1) denote the sign bits derived from
According to Fig. 1, the minimization of MAE is realized subtraction. The index generator is essentially a comparator
by vertically shifting the abscissa x by value D, where D is to locate the belonging segment of input, including a series of
calculated by subtraction and one MUX, which is denoted as MUX1. Sign
max(δ) + min(δ) bits of subtraction results are collected and concatenated as
D= . (7)
2 the index of MUX1, which is responsible to generate segment
Assuming that the shifted linear segment is h , and the shifted index. Then, the segment index is used as a select signal for
error is δ MUX2 and MUX3, where MUX2 is used to index slope k and
MUX3 is used to index intercept b. Finally, k and b are sent
h = k ∗ x + b (8) to the multiplier and adder to compute the approximate value.
δ = δ − D = f (x(i: j )) − h (x(i: j )) (9)
C. Advantages and Disadvantages Analyses
then it can be deduced that
There are three main advantages of [23], which are listed
k = k, b = b + D (10) as follows.
max(δ) − min(δ) 1) The segmenter proposed in [23] decouples the segmen-
MAEshift = = |max δ | = |min δ |. (11)
2 tation method and the features of target functions by
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
using discrete mathematics instead of continuous math- To solve the problems mentioned earlier, we propose PLAC
ematics, thereby realizing the universal error-flattened in Section III.
segmentation.
2) Compared with the shifting method that shifts a constant III. P ROPOSED A PPROXIMATION M ETHOD
value according to the property of logarithmic function
in [8], the shifting method proposed in [23] is universal The proposed error-flattened PWL method PLAC mainly
to all target functions. features two parts: a segmenter to find the proper segment
3) For hardware architecture, [23] utilized the parallel endpoints for the target function and a quantizer to determine
subtractors to index the coefficients k and b, thus sig- the data width of output and the coefficients k and b. The
nificantly reducing the delay of the circuit. segmenter is in some way based on state-of-the-art studies
in [8] and [23], while a novel quantizer is proposed to fur-
However, the segmenter and hardware architecture proposed ther improve the hardware efficiency. Section III-A illustrates
in [23] are not fully optimized. Four disadvantages are listed the details of the segmenter, and Section III-B presents the
in the following. segmentation performance expreriments. Then, Section III-C
1) According to steps 2 and 3 of the segmenter described describes the design of quantizer.
in Section II-A, we can find that the endpoints of the
recorded segments are reused: one time as the endpoint
A. PLAC Segmenter
of segment i and one time as the start point of seg-
ment i + 1. This is a waste of input points since there To address the two major shortcomings of the current
are no other probable inputs between two contiguous segmenter mentioned in Section II-C, we introduce a nonover-
discrete inputs. lapping endpoint updating scheme to achieve the best seg-
2) Another deficiency of the previous segmentation method mentation performance and a bisection-seeking method to
is the improper seeking order. Since the input ranges of dramatically save the segmenter execution time. Unfortunately,
target functions are relatively large, sequentially trying the naive bisection method cannot ensure the maximization
all points from the end to start is not quite efficient. of a certain segment input range because there are lots of
Especially, to the front part of the searching range, it will endpoints that satisfy the target software error. Therefore,
cost a long time to find the required endpoint. we innovatively introduce a bisection window, and by lever-
3) According to Fig. 2, three multiplexers are required aging the relationship of the endpoint of a segment and the
in [23]: MUX1 for generating segment index, MUX2 for window, we can make sure that the input range of a certain
indexing slope k, and MUX3 for indexing intercept b. segment can be finally maximized.
Actually, coefficients k and b are in pairs, which means The procedure of the segmenter is shown in Fig. 3, and it
that k and b can be indexed simultaneously. Besides, features two parts: the inner loop and the outer loop. The
the indexing logic in the index generator is redundant inner loop is responsible for maximizing the width of the
because the collection of sign bits can be directly used segment, while the outer loop is responsible for controlling
as the index to derive k and b. the segmentation within the complete input range of the target
4) In addition, [23] only controls MAEsoft in segmenter, and function. Here, i denotes the number of segments successfully
quantization of circuit is not addressed before hardware segmented, and j is the start pointer of the region to be
implementation. This will lead to iterative modification segmented. sp is the start pointer of a segment, and ep is
process to meet the final requirement for MAEc . the end pointer of a segment. The left pointer of the bisection
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
S EGMENTER P ERFORMANCE T EST FOR log2 (1 + x)
TABLE II
S EGMENTER P ERFORMANCE C OMPARISON W ITH [23]
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE III
P SEUDOCODE OF Q UANTIZER
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE V
C IRCUITS ’ A CCURACY C ONTROL T ESTS
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 8. Approximation results under different quantizer settings. (a)–(c) Seven-segment approximation results of log2 (1+ x). (d)–(f) 16-segment approximation
results of softplus(x). (g)–(i) 16-segment approximation results of softsign(x). (a) log2(1 + x)-7segs-Segmenter. (b) log2(1 + x)-7segs-QF = 1.03.
(c) log2(1 + x)-7segs-QF = 1.2. (d) softplus(x)-16segs-Segmenter. (e) softplus(x)-16segs-QF = 1.04. (f) softplus(x)-16segs-QF=1.2.
(g) softsign(x)-16segs-Segmenter. (h) softsign(x)-16segs-QF = 1.02. (i) softsign(x)-16segs-QF = 1.08.
presents the ASIC implementation results and comparison with and architecture cost 34.99% less area and 15.91% less delay
previous works. while improving MAEc by 20.37%. Compared with [8], which
Table VI details the parameters of four functions for achieves the theoretically minimum MAE by shifting the target
our ASIC implementations and the parameters for duplicat- function by a constant, our work costs 6.82% less area and
ing the work in [23]. For log2 (1 + x), we compare our 10.30% less delay while improving MAEc by 17.31%.
ASIC implementation with [6], [8], and [23], and Table VII To make a fair comparison with [23], we set the same
gives the experiment results. For tanh(x), we compare our input fractional bit width and MAEsoft for its segmenter and
implementation with [23]–[26], and the results are shown get the segments number 16, one more than ours. Then,
in Table VIII. For sigmoid(x), we compare our implementation we use our quantizer to quantize the circuit parameters,
with [23]–[25], and the results are presented in Table IX. For making that all the bit width settings are the same as those
softsign(x), we compare our implementation with [5], and of ours. Finally, we use the architecture proposed in [23]
the comparison results are in Table X. In these tables, MAEc to do modeling and synthesis. From the results presented
represents the MAE of circuits output, and E A represents the in Table VII, we can see that our improvements in endpoint
mean absolute error. updating scheme and hardware architecture contribute to
2.80% area reduction and 3.77% power reduction.
A. Implementation Results of Logarithmic Function
From Table VII, we can see that the proposed approximation
method achieves several improvements compared with the B. Implementation Results of Hyperbolic Tangent Function
previous works. Compared with the error-flattened approxi- Parhi and Liu [24] took advantage of stochastic comput-
mation method proposed in [6], our approximation method ing (SC) and Horner’s rule for the Maclaurin expansions to do
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VI
PARAMETERS OF ASIC I MPLEMENTATION
TABLE VII
ASIC I MPLEMENTATION R ESULTS OF log2 (1 + x)
TABLE VIII
ASIC I MPLEMENTATION R ESULTS OF tanh(x)
the approximation. Nguyen et al. [25] used SC logic based on Although SC logic use AND gate to replace multiplier, the lin-
the PWL to approximate tanh(x) and sigmoid(x), while [26] ear feedback shift register (LFSR) consumes quite long time
employed SC logic to compute complex arithmetic functions. to generate stochastic bit stream. In addition, the uniform
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IX
ASIC I MPLEMENTATION R ESULTS OF sigmoid(x)
TABLE X
ASIC I MPLEMENTATION R ESULTS OF softsign(x)
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE XI [10] A. Seth and W.-S. Gan, “Fixed-point square roots using L-b truncation
S YMBOLS AND A BBREVIATIONS [DSP tips and tricks],” IEEE Signal Process. Mag., vol. 28, no. 6,
pp. 149–153, Nov. 2011.
[11] Y. Luo, Y. Wang, Y. Ha, Z. Wang, S. Chen, and H. Pan, “Generalized
hyperbolic CORDIC and its logarithmic and exponential computation
with arbitrary fixed base,” IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 27, no. 9, pp. 2156–2169, Sep. 2019.
[12] Y. Wang, Y. Luo, Z. Wang, Q. Shen, and H. Pan, “GH CORDIC-based
architecture for computing N th root of single-precision floating-point
number,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28,
no. 4, pp. 864–875, Apr. 2020.
[13] P. Nilsson, A. U. R. Shaik, R. Gangarajaiah, and E. Hertz, “Hardware
implementation of the exponential function using Taylor series,” in Proc.
NORCHIP, Oct. 2014, pp. 1–4.
[14] M. Sybis, “Log-MAP equivalent Chebyshev inequality based algo-
rithm for turbo TCM decoding,” Electron. Lett., vol. 47, no. 18,
pp. 1049–1050, 2011.
[15] D. Das Sarma and D. W. Matula, “Faithful bipartite ROM reciprocal
tables,” in Proc. 12th Symp. Comput. Arithmetic, Jul. 1995, pp. 17–28.
[16] M. J. Schulte and J. E. Stine, “Approximating elementary functions
with symmetric bipartite tables,” IEEE Trans. Comput., vol. 48, no. 8,
pp. 842–847, Aug. 1999.
[17] P. Kumar Meher, “An optimized lookup-table for the evaluation of
sigmoid function for artificial neural networks,” in Proc. 18th IEEE/IFIP
Int. Conf. VLSI Syst.-on-Chip, Sep. 2010, pp. 91–95.
[18] D. De Caro, N. Petra, and A. G. M. Strollo, “Efficient logarithmic con-
verters for digital signal processing applications,” IEEE Trans. Circuits
Syst. II, Exp. Briefs, vol. 58, no. 10, pp. 667–671, Oct. 2011.
[19] D. M. Ellaithy, M. A. El-Moursy, G. H. Ibrahim, A. Zaki, and A. Zekry,
in Section V, we use fewer segments to achieve higher accu- “Double logarithmic arithmetic technique for low-power 3-D graphics
applications,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25,
racy compared with the uniform PWL in [5] and [24]–[26]. no. 7, pp. 2144–2152, Jul. 2017.
Compared with the universal error-flattened PWL in [23], our [20] H. Kim, B.-G. Nam, J.-H. Sohn, J.-H. Woo, and H.-J. Yoo, “A 231-MHz,
implementations also exhibit better performance due to the 2.18-mW 32-bit logarithmic arithmetic unit for fixed-point 3-D graphics
system,” IEEE J. Solid-State Circuits, vol. 41, no. 11, pp. 2373–2381,
improvements in segmenter and hardware architecture. Nov. 2006.
[21] B.-G. Nam and H.-J. Yoo, “An embedded stream processor core based
on logarithmic arithmetic for a low-power 3-D graphics SoC,” IEEE J.
A PPENDIX Solid-State Circuits, vol. 44, no. 5, pp. 1554–1570, May 2009.
[22] M. Loukrakpam and M. Choudhury, “Error-aware design procedure to
Table XI lists the frequently used symbols and abbreviations implement hardware-efficient logarithmic circuits,” IEEE Trans. Circuits
in this article. Syst. II, Exp. Briefs, vol. 67, no. 5, pp. 851–855, May 2020.
[23] H. Sun et al., “A universal method of linear approximation with
controllable error for the efficient implementation of transcendental
functions,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 1,
R EFERENCES pp. 177–188, Jan. 2020.
[24] K. K. Parhi and Y. Liu, “Computing arithmetic functions using stochastic
[1] Z. Wang, J. Lin, and Z. Wang, “Accelerating recurrent neural networks: logic by series expansion,” IEEE Trans. Emerg. Topics Comput., vol. 7,
A memory-efficient approach,” IEEE Trans. Very Large Scale Integr. no. 1, pp. 44–59, Jan. 2019.
(VLSI) Syst., vol. 25, no. 10, pp. 2763–2775, Oct. 2017. [25] V.-T. Nguyen, T.-K. Luong, H. Le Duc, and V.-P. Hoang, “An effi-
[2] B. Zamanlooy and M. Mirhassani, “Efficient VLSI implementation of cient hardware implementation of activation functions using stochastic
neural networks with hyperbolic tangent activation function,” IEEE computing for deep neural networks,” in Proc. IEEE 12th Int. Symp.
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 1, pp. 39–48, Embedded Multicore/Many-Core Syst.-on-Chip (MCSoC), Sep. 2018,
Jan. 2014. pp. 233–236.
[3] L. Li, S. Zhang, and J. Wu, “An efficient hardware architecture for [26] Z. Qin et al., “A universal approximation method and optimized
activation function in deep learning processor,” in Proc. IEEE 3rd Int. hardware architectures for arithmetic functions based on stochastic
Conf. Image, Vis. Comput. (ICIVC), Jun. 2018, pp. 911–918. computing,” IEEE Access, vol. 8, pp. 46229–46241, 2020.
[4] I. Tsmots, O. Skorokhoda, and V. Rabyk, “Hardware implementation of [27] L. Gu, J. Huang, and L. Yang, “On the representational power of
sigmoid activation functions using FPGA,” in Proc. IEEE 15th Int. Conf. restricted Boltzmann machines for symmetric functions and Boolean
Exper. Designing Appl. CAD Syst. (CADSM), Feb. 2019, pp. 34–38. functions,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 5,
[5] C.-H. Chang, E.-H. Zhang, and S.-H. Huang, “Softsign function hard- pp. 1335–1347, May 2019.
ware implementation using piecewise linear approximation,” in Proc.
Int. Symp. Intell. Signal Process. Commun. Syst. (ISPACS), Dec. 2019,
pp. 1–2.
[6] M. Zhu, J. Xiao, W. Wanggen, and H. A. Yajun, “Error flatten logarithm
approximation for graphics processing unit,” in Proc. ICM, Dec. 2011,
pp. 1–6. Hongxi Dong received the B.S. degree in electronic
[7] M. Zhu, Y. Ha, C. Gu, and L. Gao, “An optimized logarithmic converter information engineering from the Nanjing University
with equal distribution of relative errors,” IEEE Trans. Circuits Syst. II, of Aeronautics and Astronautics, Nanjing, China,
Exp. Briefs, vol. 63, no. 9, pp. 848–852, Sep. 2016. in 2019. She is currently working toward the mas-
[8] C.-W. Liu, S.-H. Ou, K.-C. Chang, T.-C. Lin, and S.-K. Chen, “A low- ter’s degree at the School of Electronic Science and
error, cost-efficient design procedure for evaluating logarithms to be used Engineering, Nanjing University, Nanjing.
in a logarithmic arithmetic processor,” IEEE Trans. Comput., vol. 65, Her current research interests include digital inte-
no. 4, pp. 1158–1164, Apr. 2016. grated circuit design and neural-network processing
[9] M. Ha and S. Lee, “Accurate hardware-efficient logarithm circuit,” unit.
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 64, no. 8, pp. 967–971,
Aug. 2017.
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Manzhen Wang received the B.S. degree in inte- Yajun Ha (Senior Member, IEEE) received the
grated circuit design and integrated system from B.S. degree from Zhejiang University, Hangzhou,
Xidian University, Xi’an, China, in 2018. She is China, in 1996, the M.Eng. degree from the National
currently working toward the master’s degree at University of Singapore, Singapore, in 1999, and
the School of Electronic Science and Engineering, the Ph.D. degree from Katholieke Universiteit
Nanjing University, Nanjing, China. Leuven, Leuven, Belgium, in 2004, all in electrical
Her current research interest includes the design engineering.
of digital VLSI circuits, with an emphasis on the He is currently a Professor with ShanghaiTech
approximate multiplier. University, Shanghai, China. Before this, he was a
Scientist and the Director of the I2R-BYD Joint Lab,
Institute for Infocomm Research, Singapore, and an
Adjunct Associate Professor at the Department of Electrical and Computer
Engineering, National University of Singapore, Singapore. Prior to this, he was
an Assistant Professor with the National University of Singapore. His research
Yuanyong Luo received the B.S. degree in applied
interests include reconfigurable computing, ultralow power digital circuits
physics from Jilin University, Changchun, China,
and systems, embedded system architecture, and design tools for applications
in 2016, and the Ph.D. degree in electronic science
in robots, smart vehicles, and intelligent systems. He has published around
and technology from Nanjing University, Nanjing,
100 internationally peer-reviewed journal/conference papers on these topics.
China, in 2020.
Dr. Ha was a recipient of two IEEE/ACM Best Paper Awards. He has served
He is currently a Senior Research Engineer at
as the TPC Co-Chair for ISICAS 2020, the General Co-Chair for ASP-DAC
the Department of Turing Architecture Design,
2014, the Program Co-Chair for FPT 2010 and FPT 2013, the Chair for
HiSilicon, Huawei Corporation, Shenzhen, China.
the Singapore Chapter of the IEEE Circuits and Systems (CAS) Society
His research interests include novel VLSI computing
in 2011 and 2012, and a member for the ASP-DAC Steering Committee
methods, central processing unit, and neural-network
and the IEEE CAS VLSI and Applications Technical Committee. He has
processing unit.
served a number of positions in the professional communities. He serves as the
Associate Editor-in-Chief for the IEEE T RANSACTIONS ON C IRCUITS AND
S YSTEMS II: E XPRESS B RIEFS from 2020 to 2021 and an Associate Editor
for the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS I: R EGULAR
PAPERS from 2016 to 2019, the IEEE T RANSACTIONS ON C IRCUITS AND
Muhan Zheng received the B.S. degree in electronic S YSTEMS II: E XPRESS B RIEFS from 2011 to 2013, the IEEE T RANS -
information science and technology from Nanjing ACTIONS ON V ERY L ARGE S CALE I NTEGRATION (VLSI) S YSTEMS from
University, Nanjing, China, in 2019, where she is 2013 to 2014, and the Journal of Low Power Electronics since 2009. He has
currently working toward the master’s degree at the been a Program Committee Member of a number of well-known conferences
School of Electronic Science and Engineer. in the fields of FPGAs and design tools, such as the Design Automation
Her current research interests include digital inte- Conference (DAC), Design, Automation and Test in Europe Conference
grated circuit design and VLSI implementation of (DATE), Asia and South Pacific Design Automation Conference (ASP-DAC),
neural networks. Field Programmable Gate Array (FPGA), International Conference on Field
Programmable Logic and Applications (FPL), and International Conference
on Field Programmable Technology (FPT).
Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.