0% found this document useful (0 votes)
24 views14 pages

Dong 2020

Uploaded by

Kanish R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views14 pages

Dong 2020

Uploaded by

Kanish R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

PLAC: Piecewise Linear Approximation


Computation for All Nonlinear Unary Functions
Hongxi Dong , Manzhen Wang, Yuanyong Luo , Muhan Zheng, Mengyu An, Graduate Student Member, IEEE,
Yajun Ha , Senior Member, IEEE, and Hongbing Pan

Abstract— This article presents a piecewise linear approxi- the conflicts between hardware resource constraints and the
mation computation (PLAC) method for all nonlinear unary requirement of low delay and acceptable accuracy in practical
functions, which is an enhanced universal and error-flattened application.
piecewise linear (PWL) approximation approach. Compared with
the previous methods, PLAC features two main parts, an opti- For instance, recurrent neural networks (RNNs), such as
mized segmenter to seek the minimum number of segments under long-short-term memory and gated recurrent unit, are widely
the predefined software maximum absolute error (MAE), raising applied in natural language processing and video processing,
the segmentation performance to the highest theoretical level where real-time performance is highly concerned [1]. As the
for logarithm, and a novel quantizer to completely simulate the activation functions in RNNs, hardware implementations of
hardware behavior and determine the required bit width and
MAEc (MAE in circuits) for hardware implementation. In addi- sigmoid, hyperbolic tangent, and softsign functions are inves-
tion, the hardware architecture is also improved by simplifying tigated in [2]–[5]. In [6]–[9], the efficient implementation of
the indexing logic, leading to nonredundant hardware overhead. logarithmic function is studied for graphics processing unit.
The ASIC implementation results reveal that the proposed PLAC These four widely used functions mentioned earlier are all
can improve all metrics without any compromise. Compared nonlinear unary functions.
with the state-of-the-art methods, when computing logarithmic
function, PLAC reduces 2.80% area, 3.77% power consumption, Several approximation methods have been proposed to
and 1.83% MAEc with the same delay; when approximating implement nonlinear unary functions. Iteration methods, such
hyperbolic tangent function, PLAC reduces 6.25% area, 4.31% as the Newton iteration method [10] and coordinated rota-
power consumption, and 18.86% MAEc with the same delay; tion digital computer (CORDIC) [11], [12], suffer from long
when evaluating sigmoid function, PLAC reduces 16.50% area, time delay due to their repeated iterative operations. The
4.78% power consumption with the same delay, and MAEc ; and
when calculating softsign function, PLAC reduces 17.28% area, polynomial approximation is then proposed to compute more
11.34% power consumption, 12.50% delay, and 33.28% MAEc . directly, which takes advantage of the series expansion of
target functions, such as the Taylor series approximation and
Index Terms— Error-flattened, nonlinear unary function,
piecewise linear (PWL) approximation, piecewise linear approx- the Chebyshev polynomial approximation [13], [14]. However,
imation computation (PLAC), quantizer, segmenter, VLSI polynomial approximation costs too many cascaded multi-
architecture. plication and addition (MAC) operations, resulting in much
high hardware overhead and delay. To further reduce the
computation complexity, the piecewise linear (PWL) method
I. I NTRODUCTION
is widely applied since it only requires one MAC, leading to

M ANY studies are devoted to an efficient approxima-


tion of nonlinear functions in recent years due to
low delay and simple circuit architecture.
The predecessor of the PWL method is the lookup table
(LUT)-based approximation, which uses LUTs to map input
Manuscript received March 29, 2020; revised May 30, 2020; accepted regions to output values. The biggest disadvantage of this
June 14, 2020. This work was supported in part by the National Natural
Science Foundation of China under Grant 61376075 and Grant 41412020201, method is that the storage requirement will increase exponen-
and in part by the Key Research and Development Program of Jiangsu tially with the improvement of calculation accuracy [15]–[17].
Province under Grant BE2015153. (Corresponding authors: Yuanyong Luo; PWL uses several linear segments k × x + b to approximate
Hongbing Pan.)
Hongxi Dong, Manzhen Wang, Muhan Zheng, Mengyu An, target functions and only requires the storage of slope k,
and Hongbing Pan are with the School of Electronic Science intercept b, and endpoints for each segment. Compared with
and Engineering, Nanjing University, Nanjing 210023, China the LUT-based method, PWL saves quite a lot of memory
(e-mail: [email protected]; [email protected];
[email protected]; [email protected]; [email protected]). resources.
Yuanyong Luo is with the Department of Turing Architecture Design, The development of the PWL method can be gener-
HiSilicon, Huawei Corporation, Shenzhen 518129, China (e-mail: ally divided into three stages: uniform-segment-based PWL,
[email protected]).
Yajun Ha is with the School of Information Science and Tech- nonuniform-segment-based PWL, and error-flattened PWL.
nology, ShanghaiTech University, Shanghai 201210, China (e-mail: The uniform method divides the input range into equal-length
[email protected]). regions and uses one linear segment to approximate target
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. function in each region. Both [18] and [19] use this method to
Digital Object Identifier 10.1109/TVLSI.2020.3004602 approximate logarithmic function, while the former minimizes
1063-8210 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

the maximum relative error (MRE) to obtain k and b and the procedure of segmentation. After the segmenter determined the
latter improves computing accuracy by correcting computation endpoints of each input region and obtained the coefficients
result using the error value stored in LUT. Since the variation k and b, an innovative quantizer is proposed to quantize
rates of nonlinear functions are varied among different input the coefficients and the circuit outputs. To be more specific,
regions, the maximum error in each segment is unequal, we denote MAE requirement for the segmenter as MAEsoft ,
resulting in many more segments when the maximum error is MAE requirement for quantizer as MAEhard , and the MAE of
required to be controlled under a given upper bound. As the circuit output as MAEc . The final goal of PLAC is to make
memory cost is proportional to the number of segments in the MAEc less or equal than MAEhard . Since the quantization in
PWL method, this segmentation method is not good enough hardware implementation brings accuracy loss, a quantization
for hardware design. factor QF is designed to scale up MAEsoft . The quantizer
To overcome the disadvantages of uniform PWL, nonuni- can completely simulate the hardware behavior and choose
form PWL is proposed in [20] and [21]. Kim et al. [20] use the least data width required for the predefined MAEhard ,
15 segments, while Nam and Yoo [21] use 24 segments to leading to no redundant resources consuming for hardware
compute logarithmic function. Both of them increase the num- implementation. It has to be noted that PLAC is an MAEhard -
ber of segments around input value 0. Although nonuniform guided method compared with previous works, which means
PWL can achieve relatively high accuracy using not too many that hardware accuracy is controllable and the iterative modi-
segments, it still has an uncontrollable error range, which fication process is significantly reduced in hardware design.
means that the design skill has a significant impact on circuit We use MATLAB to model the proposed segmenter and
performance. quantizer and Verilog HDL to model the hardware architec-
In [7], an error-flattened PWL for logarithmic function is ture. For synthesis, TSMC 65-nm technology is applied to a
proposed. This method utilizes the predefined upper bound logarithmic function, TSMC 90-nm technology is applied to
of MRE as a guide to divide the segments, guaranteeing that hyperbolic tangent function and sigmoid function, and TSMC
each segment has the same MRE, thus obtaining the minimum 40-nm technology is applied to hyperbolic tangent function
number of segments for approximation. However, considering and softsign function. We have performed enormous hardware
the hardware efficiency, maximum absolute error (MAE) is comparisons with the prior arts [5], [23]–[26]. Typically,
a better metric for fixed-point implementation. Liu et al. [8] compared with the state-of-the-art PWL method [23], PLAC
proposed a method to obtain equal MAE for each seg- reduces 2.80% area and 3.77% power consumption in the
ment by dividing the output range into equal subranges. implementation of log2 (1 + x), without accuracy loss. As for
The error-flattened approximation method in [8] achieves the tanh(x), 6.25% area reduction is achieved, while MAEc is
theoretically best segmentation performance, which is proved improved by 18.86%. For sigmoid(x), 16.5% area improve-
by strict mathematics. Ha and Lee [9] improved the accuracy ment is achieved without accuracy loss. For the abovemen-
of [8] by dividing one linear segment into three segments with tioned three functions, the delay of PLAC is the same as
the same slope, trading with hardware overhead. However, that of [23]. In addition, our implementation of softsign(x)
the dividing method in [8] relies on the characteristic of the reduces 17.28% area and 12.5% delay while improving MAEc
logarithmic function, which means that the method cannot by 33.28% compared with [5].
be extended to other functions. In [22], another logarithmic The contributions of this article are summarized as follows.
converter with a novel error-aware segmentation procedure is 1) The proposed segmenter takes advantage of the bisection
proposed, which approximates logarithmic function by unity method, dramatically saving segmentation time.
slope straight lines and maximizes the length of each segment 2) By decoupling the endpoints and the start points of
under the MAE requirement. adjacent segments, our segmenter can achieve the theo-
To realize the generalization of error-flattened PWL method, retically best segmentation performance for logarithmic
by decoupling the relation between the segmentation scheme function.
and the characteristics of target functions, Sun et al. [23] pro- 3) A quantizer is designed to completely simulate hardware
posed a universal PWL segmenter for all transcendental func- implementation, which determines the data width of
tions for the first time. However, we notice that Sun et al. [23] coefficients and outputs under the MAEhard requirement.
cannot reach the best segmentation performance, which is Compared with the traditional design methods that only
achieved in [8], and the MAE is only controlled in software provide the segmenter, the proposed quantizer can min-
approximation. In addition, the hardware architecture in [23] imize the hardware resources cost under the hardware
still has redundant logic, resulting in unnecessary hardware accuracy requirement before actual VLSI implementa-
overhead. Thus, we propose piecewise linear approxima- tion, saving a lot of iterative design time.
tion computation (PLAC) to enhance the previous universal 4) The hardware architecture is further simplified compared
error-flattened PWL method. with that in the state-of-the-art work, reducing the redun-
In this article, an error-flattened segmenter that raises dant indexing logic for coefficients.
the segmentation performance to the highest theoretical The rest of this article is organized as follows. Section II
level is proposed. Similar to the MAE-guided segmenta- introduces state-of-the-art research on the PWL method and
tion method in [23], the proposed segmenter keeps MAEs analyzes both advancements and disadvantages. Section III
equal for each segment and finds the minimum number of gives a detailed theory of segmenter, segmentation perfor-
segments under the software MAE requirement. Meanwhile, mance tests, and the theory of quantizer. VLSI architecture
a novel bisection-seeking method is designed to speedup the design is illustrated in Section IV where also goes the circuits’

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

DONG et al.: PLAC FOR ALL NONLINEAR UNARY FUNCTIONS 3

accuracy control experiments. Then, hardware experimental


results and comparison with previous works are shown in
Section V. Finally, we summarize our work in Section VI.
In addition, we list the frequently used symbols and abbrevi-
ations in the Appendix.

II. S TATE OF THE A RT OF PWL M ETHOD


This section will first introduce the state-of-the-art universal
error-flattened PWL method [23] that mainly includes a seg-
menter and a universal hardware architecture and then analyze
its advantages and disadvantages.

A. Introduction to Segmenter
Sun et al. [23] first discretized the continuous input range
[M, N] into discrete points
 
1 2
x = x(1 : NUM) = M, M + iw , M + iw , . . . , N (1)
2 2
to imitate the actual hardware implementation features, where
iw is the number of input fractional bits and NUM = Fig. 1. Minimization of MAE by parallel shifting. Coordinate δ − x shows
(N − M)/(2−iw ) + 1. The PWL approximation of function the error curve of original approximation, and coordinate δ  − x  shows the
error curve after MAE minimization.
f (x) in subrange x(i : j ), 1 ≤ i < j ≤ NUM can be
computed as
The procedure of the segmenter in [23] can be summarized
h(x) = k ∗ x + b (2) as follows.
f (x( j )) − f (x(i )) 1) It first predefines the MAEsoft for segmenter.
k = (3)
x( j ) − x(i ) 2) Then, it calculates the MAEshift using (11) in the range
b = f (x(i )) − k ∗ x(i ). (4) x(start : end), where start is initialized as 1 and end is
initialized as NUM. The seeking order is from the last
Here, h(x) represents the approximate linear segment, while input point x(NUM) to the previous points.
k and b represent the slope and intercept of this segment, 3) If the calculated MAEshift is smaller than MAEsoft , then
respectively. Then, Sun et al. [23] proposed a way to minimize the searched range is recorded and start is upgraded to
MAE in the input range x(i : j ), which is given in (5)–(11). end, while end is reset to NUM.
The approximation error is denoted as 4) Repeat step 3 until start is updated to NUM.
δ = f (x(i: j )) − h(x(i: j )). (5)
B. Introduction to Hardware Architecture
MAEorigin represents the MAE of original linear approxi-
mation, and MAEshift represents the MAE of shifted linear The hardware architecture given in [23] is shown in Fig. 2,
segments. MAEorigin is computed by including one MAC, an index generator, and two LUTs for
coefficients k and b. Here, n denotes the number of segments,
MAEorigin = max{|max(δ)|, |min(δ)|}. (6) x i are the starting points of segment i, (i = 2, . . . , n), and
si (i = 1, 2, . . . , n − 1) denote the sign bits derived from
According to Fig. 1, the minimization of MAE is realized subtraction. The index generator is essentially a comparator
by vertically shifting the abscissa x by value D, where D is to locate the belonging segment of input, including a series of
calculated by subtraction and one MUX, which is denoted as MUX1. Sign
max(δ) + min(δ) bits of subtraction results are collected and concatenated as
D= . (7)
2 the index of MUX1, which is responsible to generate segment
Assuming that the shifted linear segment is h  , and the shifted index. Then, the segment index is used as a select signal for
error is δ  MUX2 and MUX3, where MUX2 is used to index slope k and
MUX3 is used to index intercept b. Finally, k and b are sent
h  = k  ∗ x + b (8) to the multiplier and adder to compute the approximate value.
δ  = δ − D = f (x(i: j )) − h  (x(i: j )) (9)
C. Advantages and Disadvantages Analyses
then it can be deduced that
There are three main advantages of [23], which are listed
k  = k, b = b + D (10) as follows.
max(δ) − min(δ)     1) The segmenter proposed in [23] decouples the segmen-
MAEshift = = |max δ | = |min δ |. (11)
2 tation method and the features of target functions by

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 2. Hardware architecture proposed in the article [23].

using discrete mathematics instead of continuous math- To solve the problems mentioned earlier, we propose PLAC
ematics, thereby realizing the universal error-flattened in Section III.
segmentation.
2) Compared with the shifting method that shifts a constant III. P ROPOSED A PPROXIMATION M ETHOD
value according to the property of logarithmic function
in [8], the shifting method proposed in [23] is universal The proposed error-flattened PWL method PLAC mainly
to all target functions. features two parts: a segmenter to find the proper segment
3) For hardware architecture, [23] utilized the parallel endpoints for the target function and a quantizer to determine
subtractors to index the coefficients k and b, thus sig- the data width of output and the coefficients k and b. The
nificantly reducing the delay of the circuit. segmenter is in some way based on state-of-the-art studies
in [8] and [23], while a novel quantizer is proposed to fur-
However, the segmenter and hardware architecture proposed ther improve the hardware efficiency. Section III-A illustrates
in [23] are not fully optimized. Four disadvantages are listed the details of the segmenter, and Section III-B presents the
in the following. segmentation performance expreriments. Then, Section III-C
1) According to steps 2 and 3 of the segmenter described describes the design of quantizer.
in Section II-A, we can find that the endpoints of the
recorded segments are reused: one time as the endpoint
A. PLAC Segmenter
of segment i and one time as the start point of seg-
ment i + 1. This is a waste of input points since there To address the two major shortcomings of the current
are no other probable inputs between two contiguous segmenter mentioned in Section II-C, we introduce a nonover-
discrete inputs. lapping endpoint updating scheme to achieve the best seg-
2) Another deficiency of the previous segmentation method mentation performance and a bisection-seeking method to
is the improper seeking order. Since the input ranges of dramatically save the segmenter execution time. Unfortunately,
target functions are relatively large, sequentially trying the naive bisection method cannot ensure the maximization
all points from the end to start is not quite efficient. of a certain segment input range because there are lots of
Especially, to the front part of the searching range, it will endpoints that satisfy the target software error. Therefore,
cost a long time to find the required endpoint. we innovatively introduce a bisection window, and by lever-
3) According to Fig. 2, three multiplexers are required aging the relationship of the endpoint of a segment and the
in [23]: MUX1 for generating segment index, MUX2 for window, we can make sure that the input range of a certain
indexing slope k, and MUX3 for indexing intercept b. segment can be finally maximized.
Actually, coefficients k and b are in pairs, which means The procedure of the segmenter is shown in Fig. 3, and it
that k and b can be indexed simultaneously. Besides, features two parts: the inner loop and the outer loop. The
the indexing logic in the index generator is redundant inner loop is responsible for maximizing the width of the
because the collection of sign bits can be directly used segment, while the outer loop is responsible for controlling
as the index to derive k and b. the segmentation within the complete input range of the target
4) In addition, [23] only controls MAEsoft in segmenter, and function. Here, i denotes the number of segments successfully
quantization of circuit is not addressed before hardware segmented, and j is the start pointer of the region to be
implementation. This will lead to iterative modification segmented. sp is the start pointer of a segment, and ep is
process to meet the final requirement for MAEc . the end pointer of a segment. The left pointer of the bisection

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

DONG et al.: PLAC FOR ALL NONLINEAR UNARY FUNCTIONS 5

then, the end pointer ep is shifted right by half


of x(lp : rp).
Condition b: If the calculated MAE(sp : ep) does
not meet the accuracy requirements, the right win-
dow rp is updated to the current end pointer ep, and
then, the end pointer ep is shifted left by half of
x(lp : rp).
4) Repeat step 3 until one of the maximum width judgment
conditions is reached. Now, the inner loop is finished,
and the i th segment is decided. Then, update j using
the nonoverlapping scheme and return to step 2, starting
the next inner loop.
5) Repeat step 4 until j is equal to NUM + 1.
It has to be noted that there are two different maximum
width judgment conditions: ep == rp and ep == rp − 1.
Condition ep == rp is set for the last segment, and condition
ep == rp − 1 is set for other segments. The reasons are
described in the following.
After the left shift of right window, the input range index
rp : NUM cannot be used as the end pointer ep to meet the
accuracy requirement. Then, at this time, the judgment condi-
tion of the maximum segmentation endpoint that satisfies the
calculation accuracy requirement should be ep == rp−1, that
is, the point rp − 1 exactly meets the requirement. Therefore,
when MAE(sp : ep) meets the accuracy requirements after
several iterations, it is still necessary to determine whether the
segment width x(sp : ep) is the largest at this time according
to whether ep == rp − 1 holds. After repeated iterations of
the abovementioned calculation process, one of the maximum
width judgment conditions ep == rp − 1 will eventually be
established. When it comes to the last segment, MAE(sp : ep)
will always satisfy the accuracy requirement during the inner
Fig. 3. Procedure of segmenter.
loop. Thus, the right pointer will not be left shifted, and
the maximum segmentation endpoint will definitely be NUM
or rp.
window is denoted as lp, while the right pointer is denoted The judgment condition ep == NUM − 1 following the
as rp. inner loop is set up for the case where there is only one point
The software execution procedure of the proposed seg- in the last segment. If this judgment condition is established,
menter is listed as follows. there will be only one point in the last segment, which cannot
1) MAEsoft is first predefined as the target software error, form a linear segment. In this case, the endpoint updating
and i is initialized to 0, while j is initialized to 1. scheme is changed to j = ep to form a segment especially.
2) Input range x( j, NUM) is remained to be segmented. In practical use, we will abandon this segmentation result and
Then, the starting point x(sp) is set as x( j ), and the adjust MAEsoft slightly to get a better segmentation result.
endpoint x(ep) is set as x(NUM). Left pointer of bisec- Once the entire input range [M, N] is segmented, the start
tion window lp is set as j , and right pointer of bisection point sp and endpoint ep of each segment are stored.
window is set as NUM. We use collection E to represent the start points E =
3) Calculate the maximum absolute value error MAE(sp : {M, x 2 , x 3 , . . . , x n }, where x i = x(spi ). Meanwhile, slope ki
ep) of the approximate linear segment on the input width and intercept bi of i th segment are calculated and stored. Then,
x(sp : ep) by (11) and determine whether it meets the we get the approximation equation
required calculation accuracy MAEsoft . ⎧
Condition a: If the accuracy requirement is satisfied, ⎪
⎪ k1 ∗ x + b1 , x ∈ [M, x 2 )

⎨ k ∗ x + b , x ∈ [x , x )
then judge whether one of the maximum width 2 2 2 3
judgment conditions ep == rp and ep == rp − 1 f (x) ≈ (12)

⎪ . . .
is established. If one of the maximum width judg- ⎪

kn ∗ x + bn , x ∈ [x n , N].
ment conditions is reached, slope k, intercept b,
segment start pointer sp, and segment end pointer Fig. 4 takes a three-segment segmentation as an example
ep are stored. Otherwise, the left window lp needs to illustrate the difference of endpoint updating schemes
to be updated to the current end pointer ep, and between [23] and PLAC. In Fig. 4(a), the endpoint ep1 belongs

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE I
S EGMENTER P ERFORMANCE T EST FOR log2 (1 + x)

TABLE II
S EGMENTER P ERFORMANCE C OMPARISON W ITH [23]

Fig. 4. Endpoint updating scheme. (a) Endpoint updating scheme used


in [23]. (b) Endpoint updating scheme in the proposed segmenter.

to two adjacent segments, so does ep2 . Fig. 4(b) discards


the overlapping endpoints, separating ep1 , sp2 , ep2 , and sp3 . n is consistent with that in [8], indicating that the performance
The advancement of new endpoint updating scheme is proved of PLAC segmenter has reached the highest theoretical level
in Section III-B. for logarithmic function.
Then, we also compare our segmenter with [23] to prove
B. Segmentation Performance Experiments that the new endpoint updating scheme is valuable. To be fair,
The error-flattened linear approximation segmentation we choose the same input range and input bit width settings
method proposed in paper [8] is only suitable for binary as [23] and test the number of segments under different accu-
logarithmic function, but its segmentation performance has racy requirements. The comparison is executed on a hyperbolic
reached the best effect through rigorous mathematical deduc- tangent, sigmoid, and binary logarithmic functions. Table II
tion. Therefore, this section first uses [8] as a performance shows the performance of the segmenter in comparison with
benchmark to test whether PLAC segmenter has reached the that in [23]. As can be seen from the table, our segmenter can
highest level of performance when applied to binary logarithm. reduce the number of segments under most of MAE values
Here, the target function is f (x) = log2 (1 + x) with input compared with [23]. The higher the precision requirement,
range [0, 1), and we set the input fractional width iw = 23 to the clearer the defect of overlapping endpoints of segments.
finely discrete the input range. In [8], the MAE is equal for
each segment and can be calculated according to number of C. PLAC Quantizer
segments n. We define the MAE in [8] as MAElog2 (n), and
To address the fourth disadvantage of the current PWL
the calculation equation is given in
method mentioned in Section II-C, we propose a software
⎛ ⎛ ⎞ ⎞
1
n 2n − 1 quantizer to solve the optimal quantization bit width settings
1 1 1
MAElog2 (n) = ⎝log2 ⎝ ⎠+ − ⎠. of relevant data under given circuit accuracy requirements,
2 ln2 1
n 2n − 1 ln2 thereby saving the time for iterative circuits design and adjust-
(13) ment.
According to (12), coefficients k and b, intermediate product
In our performance test, software accuracy requirement k × x, and output f (x) remain to be quantized in hardware
MAEsoft is set as the MAElog2 (n). Table I gives the number implementation since input x has already been discretized.
of segments obtained using PLAC segmenter under MAEsoft Fig. 5 gives the computing circuits regarding to quantizer,
requirements. From Table I, we can see that when MAEsoft of where k and b are the inputs of the circuits, while m and f (x)
segmenter is equal to that in [8], the total number of segments are the outputs of the circuits. In this article, iw represents

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

DONG et al.: PLAC FOR ALL NONLINEAR UNARY FUNCTIONS 7

TABLE III
P SEUDOCODE OF Q UANTIZER

Fig. 5. Computing circuits regarding to quantizer, iw is the fractional bit


width of x, and qw is the fractional bit width of k, b, m, and f (x).

the fractional bit width of input x, and qw denotes the


output fractional bit width of quantizer, which is applied to
coefficients k and b, intermediate product m, and circuits
output f (x).
Quantization of k and b has no influence on the computing
circuits since it only changes the stored data, so we use round
operation in quantization to ensure accuracy with less bit
width. The quantization of these two parameters is calculated
by the following equations:
kq = round(k × 2qw ) × 2−qw (14)
bq = round(b × 2qw ) × 2−qw . (15)
When it comes to the intermediate output m, we simply
truncate it to qw fractional bits in quantizer for simulating
the truncating behavior in computing circuits. For example,
if the hardware design only uses 3 bits to represent a number
and the input positive number 3.75 is represented as 011.11 in
binary format, after truncation in hardware implementation,
its binary representation will turn to 011, which is equal maintain MAEc under 0.003, then the fractional bit number of
to 3 in decimal. This mapping relationship can be realized output must be larger than −log2 0.003 = 8.38, which means
by floor(·) operation. If the input data are negative number that 9 fractional bits are the lowest requirement. Due to the
−2.25, whose binary implement format is 101.11, the result truncation error, ew may be not enough to achieve the target
of truncation will be 101, which is −3 in decimal. Therefore, accuracy requirement; thus, guarding width gw is introduced
the operation for simulating truncation can be concluded as to guarantee the computation accuracy. If MAEc does not
floor(·) operation for both positive and negative numbers. meet the requirement of MAEhard , then gw will be increased
Here, floor(·) means taking the most closing integer number gradually to improve the computation accuracy. The output
smaller than or equal to the input number as the truncated fractional bit width qw is the sum of ew and gw.
output. Therefore, the quantization equation of m is written as Numerous experiments show that quantizer can completely
  simulate the behavior of the circuit, thus greatly saving circuit
m q = floor kq × x × 2qw × 2−qw . (16)
designing time by avoiding repeated adjusting work to meet
Because the addition operation does not change the fractional the design requirements. As a software program, quantizer can
bit width, there is no need to quantize the final output f (x) quickly obtain the circuit design parameters and the corre-
actually. sponding MAEc under the predefined MAEhard . In addition,
As mentioned earlier, the quantization operations will intro- it can determine the minimum fractional bit width of the
duce accuracy loss into computing result. To represent the coefficients and outputs, thereby realizing a nonredundant
gap between accuracy requirement MAEsoft for segmenter and VLSI design.
accuracy requirement MAEhard for quantizer, a quantization The proposed segmenter and quantizer are working as
factor QF is designed to establish the relationship MAEhard = follows. For instance, if a practical implementation requires
MAEsoft × QF. QF is a number greater than 1. As the seg- that MAEc ≤ 1e−3, then we set MAEhard as 1e−3. Due to the
menter finished segmentation task, we have got the number of accuracy loss of fixed-point implementation, MAEsoft should
segments n, the exact value of function f = f (x(1 : NUM)), be less than MAEhard . Therefore, it is suggested to assign
slopes ki , and intercepts bi of segment i , i = 1, 2, . . . , n. Then, QF = 2 from scratch, i.e., MAEsoft = MAEhard /2 = 5e − 4.
the quantizer is executed to quantize the circuit parameters. All that is left is to iteratively perform the PLAC algorithm
Here, we define MAEc as the MAE of circuit output. The and tune QF until a good tradeoff is achieved.
pseudo code of quantizer is given in Table III, which presents
the process of quantization bit width solution. In line 2, IV. VLSI A RCHITECTURE D ESIGN
ew denotes the least fractional bits required for final output According to the output of the PLAC quantizer, this section
under the limit of MAEhard . For example, if we want to will first design the corresponding universal hardware archi-

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

one n : 1 MUX to index slope kq and intercept bq . A multiplier


and an adder are used to calculate kq × x + bq .
It can be seen that the comparators are executed in parallel,
and only the sign bits are used as the output results. Therefore,
the design compiler will delete the unused sum output when
synthesizing the circuit and only keep the carry output. This
will simplify the parallel adders to a parallel carry chain.
Since one of the inputs of the carry logic is constant, this will
further reduce the hardware overhead of the carry logic. Then,
the parallel carry chain is simplified to a parallel constant carry
chain.
According to the computing circuits in Fig. 6, kq × x
Fig. 6. Proposed VLSI architecture.
will produce a product with iw + qw fractional bits. After
TABLE IV quantization, the intermediate product only retains qw frac-
I NPUT–O UTPUT M APPING OF S ELECTOR tional bits, which means that iw fractional bits are truncated.
Therefore, a lot of logic will be deleted when this multiplier is
synthesized, making its hardware cost lower than the normal
multiplier.
In summary, if the total number of segments is n, then
the hardware overhead in this article is n − 1 logic-reduced
comparators, one n : 1 MUX, one logic-reduced multiplier,
and one normal adder. Two MUXs are reduced compared with
the hardware overhead in [23].
2) Delay Analyses: The indexing logic in [23] includes
tecture and then analyze its hardware complexity and delay. three steps.
Finally, the quantizer’s circuit error control capability for the 1) Generating sign bits collection S as a selection signal
proposed universal hardware architecture will be presented. for MUX1.
2) MUX1 selects the segment index.
A. Universal Architecture 3) MUX2 and MUX3 use the segment index to select the
To address the third disadvantage of the current PWL corresponding k and b, respectively.
method mentioned in Section II-C, we remove the index To further simplify the indexing logic, we merge steps 2
generator in Fig. 2 [23] to simplify the hardware architecture. and 3 into one: using MUX1 directly selects the corresponding
In addition, we concatenate kq i and bq i for i = 1, 2, 3, . . . , n k and b simultaneously. The reduction in indexing logic con-
and store them in one LUT, using one MUX to index kq and bq tributes to the area or delay reduction in ASIC implementation,
simultaneously. The improved architecture is shown in Fig. 6. which is presented in Section V.
As shown in Fig. 6, n − 1 parallel adders are worked Because n −1 comparators are executed in parallel, the total
as n − 1 comparators to locate the input to its belonging delay of the proposed circuit architecture is the sum of the
segment and output sign bits si at the same time. Then, delays of one comparator, one n : 1 MUX, one logic-reduced
n −1 sign bits are concatenated into the MUX selection signal multiplier, and one normal adder. Compared with the critical
S = {s1 , s2 , . . . , sn−1 } to directly index slope kq and intercept path of the architecture in [23], one MUX’s delay is reduced.
bq of corresponding segments. Table IV gives the input–output
mapping of selector.
For example, if the input x lands in the interval of the second C. Circuits’ Accuracy Control With Quantizer
segment, then x −x 2 ≥ 0 and x −x i < 0 for i = 3, 4, . . . , n−1, In this section, we use MATLAB to model the proposed
and the output sign bits of comparators will be {0111 . . . 1}. segmenter and quantizer and give our experiment results of
This selection signal will index the slope and intercept of three functions: log2 (1 + x), softplus(x), and softsign(x).
the second segment, and then, the indexed kq 2 and bq 2 are Softplus [27] and softsign [5] are widely used activation
sent to the computing circuits shown in Fig. 5. functions in neural networks, where softplus(x) = ln(1 + e x )
The following computing circuits include a multiplier and and softsign(x) = x/(1 + |x|).
an adder. The fractional bit widths of kq and bq are both qw, Table V shows the software experiment results, including
and the intermediate product m is truncated to qw fractional both segmenter and quantizer. We test the quantizer’s perfor-
bits too, which is determined by quantizer. Finally, we get an mance under different QF settings. For log2 (1 + x), we set
output with qw fractional bits after addition. the input range as [0, 1] and MAEsoft as 8.84e–4, getting
seven segments in segmenter. Then, we adjust the value of
B. Complexity and Delay Analyses QF to acquire a value of different qw in quantizer, which
1) Complexity Analyses: As shown in Fig. 6, the entire indicates the output precision. After the quantization bit width
hardware architecture uses n − 1 adders as comparators and is determined, MAEc is calculated simultaneously. For the

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

DONG et al.: PLAC FOR ALL NONLINEAR UNARY FUNCTIONS 9

TABLE V
C IRCUITS ’ A CCURACY C ONTROL T ESTS

Fig. 8 shows the approximation results of the abovemen-


tioned three functions. AEsoft denotes the absolute error of
the segmenter’s output, while AEc is the absolute error of
circuit output after quantization. For each function, we visu-
alize the approximation result of segmenter and results after
quantization with respect to two different QF. Fig. 8(a)–(c)
shows the approximation results of log2 (1+x), where Fig. 8(a)
shows the result of segmenter, Fig. 8(b) shows the result when
QF = 1.03, and Fig. 8(c) shows the result when QF = 1.2.
The error curves in Fig. 8(a) have same extreme value in each
segment, which is equal to MAEsoft defined in the beginning
of segmentation procedure. When QF is set as 1.03, MAEhard
can be computed by MAEsoft × QF, that is, 9.11e–4. Then
quantizer chooses the minimum qw under the limitation of
MAEhard and output qw = 16, which means that the bit width
of k, b, m, and f (x) is set as 16. Under this condition, MAEc
is computed out as 9.05e–4, which is slightly larger than
MAEsoft . When QF is set as 1.2, after the same procedure
described earlier, we get MAEc = 1.05e − 3, and the extreme
values of each segment are no longer equal. Fig. 8(d)–(i) shows
Fig. 7. Traditional design procedure. the approximation results of softplus and softsign in order
to exhibit the universality of PLAC. According to the data
in Table V and the curves in Fig. 8, it can be concluded that the
other two functions, the experiment procedure is similar, and smaller the QF, the tighter the MAEhard constraint, resulting
the detailed parameters are outlined in Table V. in larger qw. The AEc curve is more close to AEsoft , which
It can be concluded from Table V that MAEc is slightly is flattened in each segment. Conversely, when QF becomes
lower than MAEhard , proving that the quantizer does have the larger and qw becomes smaller, the error-flattened feature will
error controlling ability. Different qw will lead to different be damaged due to the introduction of larger quantization
MAEc . Fig. 7 shows the traditional design procedure to find an error. Since the data width is positively related to hardware
optimal qw, where an arbitrary qw is initialized for parameters resources, the proposed quantizer is able to make a tradeoff
in circuits and iterative adjustments of qw are required. The between circuit accuracy and hardware overhead.
procedure framed by dashed lines can be replaced by a PLAC
quantizer. By changing the value of QF, we can determine V. ASIC I MPLEMENTATION AND C OMPARISON
the qw in quantizer according to the accuracy requirements
in practical use, thus saving quite a lot of time compared The proposed hardware architecture is modeled by Verilog
with a traditional design method that only provides PWL HDL and synthesized by Synopsys Design Compiler with
segmentation scheme. TSMC 40-, 65-, and 90-nm CMOS technologies. This section

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 8. Approximation results under different quantizer settings. (a)–(c) Seven-segment approximation results of log2 (1+ x). (d)–(f) 16-segment approximation
results of softplus(x). (g)–(i) 16-segment approximation results of softsign(x). (a) log2(1 + x)-7segs-Segmenter. (b) log2(1 + x)-7segs-QF = 1.03.
(c) log2(1 + x)-7segs-QF = 1.2. (d) softplus(x)-16segs-Segmenter. (e) softplus(x)-16segs-QF = 1.04. (f) softplus(x)-16segs-QF=1.2.
(g) softsign(x)-16segs-Segmenter. (h) softsign(x)-16segs-QF = 1.02. (i) softsign(x)-16segs-QF = 1.08.

presents the ASIC implementation results and comparison with and architecture cost 34.99% less area and 15.91% less delay
previous works. while improving MAEc by 20.37%. Compared with [8], which
Table VI details the parameters of four functions for achieves the theoretically minimum MAE by shifting the target
our ASIC implementations and the parameters for duplicat- function by a constant, our work costs 6.82% less area and
ing the work in [23]. For log2 (1 + x), we compare our 10.30% less delay while improving MAEc by 17.31%.
ASIC implementation with [6], [8], and [23], and Table VII To make a fair comparison with [23], we set the same
gives the experiment results. For tanh(x), we compare our input fractional bit width and MAEsoft for its segmenter and
implementation with [23]–[26], and the results are shown get the segments number 16, one more than ours. Then,
in Table VIII. For sigmoid(x), we compare our implementation we use our quantizer to quantize the circuit parameters,
with [23]–[25], and the results are presented in Table IX. For making that all the bit width settings are the same as those
softsign(x), we compare our implementation with [5], and of ours. Finally, we use the architecture proposed in [23]
the comparison results are in Table X. In these tables, MAEc to do modeling and synthesis. From the results presented
represents the MAE of circuits output, and E A represents the in Table VII, we can see that our improvements in endpoint
mean absolute error. updating scheme and hardware architecture contribute to
2.80% area reduction and 3.77% power reduction.
A. Implementation Results of Logarithmic Function
From Table VII, we can see that the proposed approximation
method achieves several improvements compared with the B. Implementation Results of Hyperbolic Tangent Function
previous works. Compared with the error-flattened approxi- Parhi and Liu [24] took advantage of stochastic comput-
mation method proposed in [6], our approximation method ing (SC) and Horner’s rule for the Maclaurin expansions to do

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

DONG et al.: PLAC FOR ALL NONLINEAR UNARY FUNCTIONS 11

TABLE VI
PARAMETERS OF ASIC I MPLEMENTATION

TABLE VII
ASIC I MPLEMENTATION R ESULTS OF log2 (1 + x)

TABLE VIII
ASIC I MPLEMENTATION R ESULTS OF tanh(x)

the approximation. Nguyen et al. [25] used SC logic based on Although SC logic use AND gate to replace multiplier, the lin-
the PWL to approximate tanh(x) and sigmoid(x), while [26] ear feedback shift register (LFSR) consumes quite long time
employed SC logic to compute complex arithmetic functions. to generate stochastic bit stream. In addition, the uniform

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE IX
ASIC I MPLEMENTATION R ESULTS OF sigmoid(x)

TABLE X
ASIC I MPLEMENTATION R ESULTS OF softsign(x)

segmentation results in nonflattened MAE distribution, so we D. Implementation Results of Softsign Function


compare E A with [24]–[26]. Compared with [24], our approx- In [5], an approximate multiplier is used to compute softsign
imation method costs 76.34% less area and 25.29% less delay function to achieve lower area and power consumption. How-
while improving E A by 87.43%. Compared with [25], our ever, [5] used the uniform segmentation method, resulting in
approximation method costs 68.03% less area and 24.42% less nonflattened error distribution. Compared with the implemen-
delay while improving E A by 39.31%. Compared with [26], tation results provided in [5], our method consumes 17.28%
our approximation method costs 5.87% less area and 3.23% less area and 12.5% less delay, while MAEc is improved
less delay while improving E A by 32.31%. by 33.28%.
Similar to the design procedure of logarithmic function
described earlier, we set the input fractional bit width to 8
and MAEsoft = 2.201e − 3 for the segmenter of [23] and get VI. C ONCLUSION
the segments number 5. Then, we adjust the QF to 3.15, and This article proposes an enhanced universal PLAC method
the quantization width qw becomes 8. Under this condition, for all nonlinear unary functions, including an improved
our work improves area consumption by 6.25% and MAEc segmenter and a novel quantizer.
by 18.86%. PLAC segmenter raises the segmentation performance to
the theoretically best performance for logarithmic function.
Compared with the state-of-the-art universal error-flattened
C. Implementation Results of Sigmoid Function method in [23], the proposed nonoverlapping endpoints’
updating scheme reduces the number of segments, which is
For sigmoid function, only two segments are required under proved through the segmentation performance experiments in
MAEsoft = 8.17e − 4 in our experiment. Compared with the Section III-B. Also, the PLAC segmenter takes advantage of
fifth-order Maclaurin expansion-based approximation in [24], the bisection method, significantly reducing the segmentation
our work reduces the area by 89.12%, delay by 25.29%, time.
and power consumption by 55.77% while improving E A by The circuits’ accuracy control tests of PLAC quantizer in
49.35%. Compared with the eight segments PWL with SC Section IV-C reflect that the quantizer is able to completely
logic implementation in [25], our work saves area, delay, and simulate hardware behavior and has the circuits’ error con-
power by 85.38%, 24.86%, and 19.62%, respectively, and trolling ability, thus saving the design time by controlling the
reduces E A by 2.92%. circuit computing accuracy in software experiment procedure.
Compared with the two segments implementation using Four functions are implemented in this article: log2 (1 + x),
architecture proposed in [23], our work achieves a 16.50% tanh(x), sigmoid(x), and softsign(x). Logarithmic function
reduction in area and 4.78% in power without compromising is widely used in 3-D graphics applications, while sigmoid
accuracy. Since the number of segments of [23] is the same function, hyperbolic tangent function, and softsign function
as ours, the improvements in area and power are attributed to are widely applied in RNNs in natural language processing
the superiority of our hardware architecture. and video processing. From the ASIC implementation results

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

DONG et al.: PLAC FOR ALL NONLINEAR UNARY FUNCTIONS 13

TABLE XI [10] A. Seth and W.-S. Gan, “Fixed-point square roots using L-b truncation
S YMBOLS AND A BBREVIATIONS [DSP tips and tricks],” IEEE Signal Process. Mag., vol. 28, no. 6,
pp. 149–153, Nov. 2011.
[11] Y. Luo, Y. Wang, Y. Ha, Z. Wang, S. Chen, and H. Pan, “Generalized
hyperbolic CORDIC and its logarithmic and exponential computation
with arbitrary fixed base,” IEEE Trans. Very Large Scale Integr. (VLSI)
Syst., vol. 27, no. 9, pp. 2156–2169, Sep. 2019.
[12] Y. Wang, Y. Luo, Z. Wang, Q. Shen, and H. Pan, “GH CORDIC-based
architecture for computing N th root of single-precision floating-point
number,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 28,
no. 4, pp. 864–875, Apr. 2020.
[13] P. Nilsson, A. U. R. Shaik, R. Gangarajaiah, and E. Hertz, “Hardware
implementation of the exponential function using Taylor series,” in Proc.
NORCHIP, Oct. 2014, pp. 1–4.
[14] M. Sybis, “Log-MAP equivalent Chebyshev inequality based algo-
rithm for turbo TCM decoding,” Electron. Lett., vol. 47, no. 18,
pp. 1049–1050, 2011.
[15] D. Das Sarma and D. W. Matula, “Faithful bipartite ROM reciprocal
tables,” in Proc. 12th Symp. Comput. Arithmetic, Jul. 1995, pp. 17–28.
[16] M. J. Schulte and J. E. Stine, “Approximating elementary functions
with symmetric bipartite tables,” IEEE Trans. Comput., vol. 48, no. 8,
pp. 842–847, Aug. 1999.
[17] P. Kumar Meher, “An optimized lookup-table for the evaluation of
sigmoid function for artificial neural networks,” in Proc. 18th IEEE/IFIP
Int. Conf. VLSI Syst.-on-Chip, Sep. 2010, pp. 91–95.
[18] D. De Caro, N. Petra, and A. G. M. Strollo, “Efficient logarithmic con-
verters for digital signal processing applications,” IEEE Trans. Circuits
Syst. II, Exp. Briefs, vol. 58, no. 10, pp. 667–671, Oct. 2011.
[19] D. M. Ellaithy, M. A. El-Moursy, G. H. Ibrahim, A. Zaki, and A. Zekry,
in Section V, we use fewer segments to achieve higher accu- “Double logarithmic arithmetic technique for low-power 3-D graphics
applications,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25,
racy compared with the uniform PWL in [5] and [24]–[26]. no. 7, pp. 2144–2152, Jul. 2017.
Compared with the universal error-flattened PWL in [23], our [20] H. Kim, B.-G. Nam, J.-H. Sohn, J.-H. Woo, and H.-J. Yoo, “A 231-MHz,
implementations also exhibit better performance due to the 2.18-mW 32-bit logarithmic arithmetic unit for fixed-point 3-D graphics
system,” IEEE J. Solid-State Circuits, vol. 41, no. 11, pp. 2373–2381,
improvements in segmenter and hardware architecture. Nov. 2006.
[21] B.-G. Nam and H.-J. Yoo, “An embedded stream processor core based
on logarithmic arithmetic for a low-power 3-D graphics SoC,” IEEE J.
A PPENDIX Solid-State Circuits, vol. 44, no. 5, pp. 1554–1570, May 2009.
[22] M. Loukrakpam and M. Choudhury, “Error-aware design procedure to
Table XI lists the frequently used symbols and abbreviations implement hardware-efficient logarithmic circuits,” IEEE Trans. Circuits
in this article. Syst. II, Exp. Briefs, vol. 67, no. 5, pp. 851–855, May 2020.
[23] H. Sun et al., “A universal method of linear approximation with
controllable error for the efficient implementation of transcendental
functions,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 1,
R EFERENCES pp. 177–188, Jan. 2020.
[24] K. K. Parhi and Y. Liu, “Computing arithmetic functions using stochastic
[1] Z. Wang, J. Lin, and Z. Wang, “Accelerating recurrent neural networks: logic by series expansion,” IEEE Trans. Emerg. Topics Comput., vol. 7,
A memory-efficient approach,” IEEE Trans. Very Large Scale Integr. no. 1, pp. 44–59, Jan. 2019.
(VLSI) Syst., vol. 25, no. 10, pp. 2763–2775, Oct. 2017. [25] V.-T. Nguyen, T.-K. Luong, H. Le Duc, and V.-P. Hoang, “An effi-
[2] B. Zamanlooy and M. Mirhassani, “Efficient VLSI implementation of cient hardware implementation of activation functions using stochastic
neural networks with hyperbolic tangent activation function,” IEEE computing for deep neural networks,” in Proc. IEEE 12th Int. Symp.
Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22, no. 1, pp. 39–48, Embedded Multicore/Many-Core Syst.-on-Chip (MCSoC), Sep. 2018,
Jan. 2014. pp. 233–236.
[3] L. Li, S. Zhang, and J. Wu, “An efficient hardware architecture for [26] Z. Qin et al., “A universal approximation method and optimized
activation function in deep learning processor,” in Proc. IEEE 3rd Int. hardware architectures for arithmetic functions based on stochastic
Conf. Image, Vis. Comput. (ICIVC), Jun. 2018, pp. 911–918. computing,” IEEE Access, vol. 8, pp. 46229–46241, 2020.
[4] I. Tsmots, O. Skorokhoda, and V. Rabyk, “Hardware implementation of [27] L. Gu, J. Huang, and L. Yang, “On the representational power of
sigmoid activation functions using FPGA,” in Proc. IEEE 15th Int. Conf. restricted Boltzmann machines for symmetric functions and Boolean
Exper. Designing Appl. CAD Syst. (CADSM), Feb. 2019, pp. 34–38. functions,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 5,
[5] C.-H. Chang, E.-H. Zhang, and S.-H. Huang, “Softsign function hard- pp. 1335–1347, May 2019.
ware implementation using piecewise linear approximation,” in Proc.
Int. Symp. Intell. Signal Process. Commun. Syst. (ISPACS), Dec. 2019,
pp. 1–2.
[6] M. Zhu, J. Xiao, W. Wanggen, and H. A. Yajun, “Error flatten logarithm
approximation for graphics processing unit,” in Proc. ICM, Dec. 2011,
pp. 1–6. Hongxi Dong received the B.S. degree in electronic
[7] M. Zhu, Y. Ha, C. Gu, and L. Gao, “An optimized logarithmic converter information engineering from the Nanjing University
with equal distribution of relative errors,” IEEE Trans. Circuits Syst. II, of Aeronautics and Astronautics, Nanjing, China,
Exp. Briefs, vol. 63, no. 9, pp. 848–852, Sep. 2016. in 2019. She is currently working toward the mas-
[8] C.-W. Liu, S.-H. Ou, K.-C. Chang, T.-C. Lin, and S.-K. Chen, “A low- ter’s degree at the School of Electronic Science and
error, cost-efficient design procedure for evaluating logarithms to be used Engineering, Nanjing University, Nanjing.
in a logarithmic arithmetic processor,” IEEE Trans. Comput., vol. 65, Her current research interests include digital inte-
no. 4, pp. 1158–1164, Apr. 2016. grated circuit design and neural-network processing
[9] M. Ha and S. Lee, “Accurate hardware-efficient logarithm circuit,” unit.
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 64, no. 8, pp. 967–971,
Aug. 2017.

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Manzhen Wang received the B.S. degree in inte- Yajun Ha (Senior Member, IEEE) received the
grated circuit design and integrated system from B.S. degree from Zhejiang University, Hangzhou,
Xidian University, Xi’an, China, in 2018. She is China, in 1996, the M.Eng. degree from the National
currently working toward the master’s degree at University of Singapore, Singapore, in 1999, and
the School of Electronic Science and Engineering, the Ph.D. degree from Katholieke Universiteit
Nanjing University, Nanjing, China. Leuven, Leuven, Belgium, in 2004, all in electrical
Her current research interest includes the design engineering.
of digital VLSI circuits, with an emphasis on the He is currently a Professor with ShanghaiTech
approximate multiplier. University, Shanghai, China. Before this, he was a
Scientist and the Director of the I2R-BYD Joint Lab,
Institute for Infocomm Research, Singapore, and an
Adjunct Associate Professor at the Department of Electrical and Computer
Engineering, National University of Singapore, Singapore. Prior to this, he was
an Assistant Professor with the National University of Singapore. His research
Yuanyong Luo received the B.S. degree in applied
interests include reconfigurable computing, ultralow power digital circuits
physics from Jilin University, Changchun, China,
and systems, embedded system architecture, and design tools for applications
in 2016, and the Ph.D. degree in electronic science
in robots, smart vehicles, and intelligent systems. He has published around
and technology from Nanjing University, Nanjing,
100 internationally peer-reviewed journal/conference papers on these topics.
China, in 2020.
Dr. Ha was a recipient of two IEEE/ACM Best Paper Awards. He has served
He is currently a Senior Research Engineer at
as the TPC Co-Chair for ISICAS 2020, the General Co-Chair for ASP-DAC
the Department of Turing Architecture Design,
2014, the Program Co-Chair for FPT 2010 and FPT 2013, the Chair for
HiSilicon, Huawei Corporation, Shenzhen, China.
the Singapore Chapter of the IEEE Circuits and Systems (CAS) Society
His research interests include novel VLSI computing
in 2011 and 2012, and a member for the ASP-DAC Steering Committee
methods, central processing unit, and neural-network
and the IEEE CAS VLSI and Applications Technical Committee. He has
processing unit.
served a number of positions in the professional communities. He serves as the
Associate Editor-in-Chief for the IEEE T RANSACTIONS ON C IRCUITS AND
S YSTEMS II: E XPRESS B RIEFS from 2020 to 2021 and an Associate Editor
for the IEEE T RANSACTIONS ON C IRCUITS AND S YSTEMS I: R EGULAR
PAPERS from 2016 to 2019, the IEEE T RANSACTIONS ON C IRCUITS AND
Muhan Zheng received the B.S. degree in electronic S YSTEMS II: E XPRESS B RIEFS from 2011 to 2013, the IEEE T RANS -
information science and technology from Nanjing ACTIONS ON V ERY L ARGE S CALE I NTEGRATION (VLSI) S YSTEMS from
University, Nanjing, China, in 2019, where she is 2013 to 2014, and the Journal of Low Power Electronics since 2009. He has
currently working toward the master’s degree at the been a Program Committee Member of a number of well-known conferences
School of Electronic Science and Engineer. in the fields of FPGAs and design tools, such as the Design Automation
Her current research interests include digital inte- Conference (DAC), Design, Automation and Test in Europe Conference
grated circuit design and VLSI implementation of (DATE), Asia and South Pacific Design Automation Conference (ASP-DAC),
neural networks. Field Programmable Gate Array (FPGA), International Conference on Field
Programmable Logic and Applications (FPL), and International Conference
on Field Programmable Technology (FPT).

Hongbing Pan received the B.S. degree in applied


Mengyu An (Graduate Student Member, IEEE) physics and the Ph.D. degree in microelectronics
received the B.S. degree in communications engi- and solid state electronics from Nanjing University,
neering from the Nanjing University of Science Nanjing, China, in 1994 and 2005, respectively.
and Technology, Nanjing, China, in 2019. She is From 2006 to 2012, he was an Associate Pro-
currently working toward the master’s degree at the fessor with the Institute of VLSI Design, Nanjing
School of Electronic Science and Engineer, Nanjing University, where he has been a Professor with
University, Nanjing. the School of Electronic Science and Engineering
Her current research interests include computa- since 2013. He has authored more than 40 articles.
tional theory and architecture and VLSI implemen- His research interests include VLSI design, CMOS
tation of neural networks. sensors, reconfigurable computing, and artificial
intelligence.

Authorized licensed use limited to: Carleton University. Downloaded on July 26,2020 at 23:28:40 UTC from IEEE Xplore. Restrictions apply.

You might also like