Han 2020
Han 2020
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
Abstract — FFT is an essential algorithm in digital signal Therefore, an area-power efficient and high speed FFT
processing and advanced mobile communications. With the architecture has been actively researched in recent years [9]-[12].
continuous development of modern technology, the area-power
FFT can be implemented both in high-level software
efficient hardware implementation of FFT has attracted a lot of
attention. In this paper, a novel design for FFT implementation is
languages and on hardware. Software solution has the
proposed. The number of resource-expensive multiplications in advantage in flexibility but the solution is usually accompanied
our design is decreased by a twiddle factor merging technique that with a bulky and complex hardware system which is unfriendly
reduces the hardware area. Subsequently, a common to small-sized and battery-based devices [13] [14]. Motivated
subexpression sharing scheme is applied to reuse the hardware by the challenging requirements in hardware area and power
resources to further save the hardware area. In addition, a budget, many hardware solutions for FFT implementation have
magnitude-response aware approximation algorithm is proposed been developed for mobile and smart devices [15] [16]. As
for applications where the transformation accuracy can be
compromised a little bit for lesser hardware area and power
technology matures, Application-Specific Integrated Circuit
dissipation. Logic synthesis shows that the proposed 16-point FFT (ASIC) has become more cost effective for some generic classes
architecture can save hardware area and power dissipation on of problems in digital signal processing such as the FFT [17].
ASIC by up to 65.7% and 53.1% compared with recently ASIC solution provides the ability to manage power and exploit
published designs. Similarly, the proposed 32-point FFT the robustness after processing in digital domain.
architecture achieves up to 58.8% reduction on hardware area and Existing hardware solutions for FFT implementation can be
60.0% reduction on power dissipation on ASIC. mainly divided into reconfigurable and fixed architectures.
Index Terms — Approximated FFT, Twiddle Factor Merging, Reconfigurable architecture [18]-[20] is generally developed
Common Subexpression Sharing. for variable length FFTs. Mixed-radix algorithms are common
solutions to implementing a variety of transformation lengths.
I. INTRODUCTION Fixed FFTs can be further categorized into pipelined and
parallel architectures. The most classical approaches for
0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
statistical learning technique based on Normalized Least Mean are required in (1), the computational complexity of an N-point
Square algorithm is used to train the inexact FFTs, where DFT grows quickly with increasing N. In order to compute DFT
separate training and test sets are requisite. The authors of [28] more efficiently, FFT was proposed. The simplest and most
proposed an efficient parallel FFT architecture where the commonly used is Cooley-Tukey radix-2 decimation-in-time
infinite-precision coefficients are approximated by cutting off (DIT) FFT [33]. Radix-2 DIT-FFT decomposes an N-point DFT
their insignificant bits. However, it may not be the most into smaller DFTs by dividing the even- and odd-indexed input
hardware efficient solution when the error introduced is still samples into two parts as:
less than the transformation error allowed. Another X ( k ) = X 1 ( k ) + WNk X 2 ( k ) , (2)
approximated parallel FFT architecture was recently proposed N /2−1 N /2−1
y Y
A. DFT and FFT Preliminaries
W
The 1-D DFT of an input signal sequence x(n) is given by
N −1 Fig. 2. The butterfly block.
X ( k ) = ∑ x ( n ) WNnk , n = 0, 1, …, N − 1, (1)
n=0
where X(k) is the frequency domain representation of x(n) with k B. Problem Formulation
= 0, 1, …, N ‒ 1. WNnk = exp ( − j 2π nk / N ) is the twiddle factor.
According to (1) and (2), complex multiplication (CM) and
Since N2 complex multiplications and N(N‒1) complex additions complex addition (CA) are the two arithmetic operations
0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
required in DFT/FFT. The numbers of CMs and CAs needed in III. THE PROPOSED METHOD
N-point DFT and DIT-FFT architectures are given in the second
and third columns of TABLE I. In hardware implementation, A. Twiddle Factor Merging
CMs and CAs are converted to operations on their real and
imaginary parts. The numbers of real multiplications (RMs) and In an FFT architecture, multiplication is the most
real additions (RAs) required in N-point DFT and DIT-FFT resource-expensive arithmetic operation. To minimize the FFT
architectures are computed and given in the fourth and fifth hardware cost with a limited transformation error, we propose a
columns respectively in TABLE I. twiddle factor merging method to reduce the number of
multiplications. Twiddle factor multiplications (TFMs) in FFT
TABLE I. The required resources in N-point DFT and DIT-FFT architectures.
architecture can be divided into two categories: trivial TFM and
Architecture CMs CAs RMs RAs nontrivial TFM. Trivial TFMs refer to the multiplications with
DFT N2 N(N‒1) 4N2 2N(2N‒1) twiddle factors which can be implemented by direct hardwiring,
DIT-FFT (N/2)log2N Nlog2N 2Nlog2N 3Nlog2N such as multiplying the input with ‒1, 1 or ‒j. Nontrivial TFMs
In many previous works, RMs are implemented by shift and are multiplications which need to be implemented by shift and
add network to achieve a low hardware cost [34] [35]. Knowing add network. Only those nontrivial TFMs are actually
that the shifts can be implemented by direct hardwiring, addition resource-expensive multiplications of which the number needs to
is the core operation that requires Look-Up Table (LUT) count be reduced. Based on the conventional radix-2 DIT-FFT
in FPGA. Therefore, the total number of RAs can be used to architecture, we merge two nontrivial twiddle factors (NTFs)
s s
estimate the required hardware resources. We assume all RAs WN p and WN p+1 , in two adjacent stages p and p+1, into a new
are implemented with ripple carry adders (RCAs) whose NTF, as shown in Fig. 3. r, t, u, v are four intermediate results in
complexity is proportional to its wordlength [36] [37] to have a the FFT architecture. R is computed as:
more accurate estimation. By doing so, the number of full adders
(FAs) in the carry chain of RCA can be used as the cost of each
s s
(
R = r + WN p t + WN p+1 u + WN p v
s
) (6)
sp s p +1 s p +1
s
particular RA. For implementation on ASIC, although there are = r + W t + W u + W WN p v.
N N N
many different types of FA cell, the number of FAs needed gives Using Euler's rule to re-express the twiddle factor, we have:
valid information about the hardware complexity. Therefore, the 2π nk 2π nk 2π nk
total FA count is used as an estimation for the hardware cost of WNnk = exp − j = cos − j sin , (7)
N N N
the FFT architecture, which is denoted as final_cost.
where cos(2πnk/N) and sin(2πnk/N) are twiddle factor
Besides the hardware cost, transformation accuracy is another s s
major consideration. Given an input signal sequence x=[x(0), coefficients (TFCs). The term WN p+1WN p in (6) can then be
x(1), ..., x(N‒1)], the exact FFT outputs X=[X(0), X(1), …, represented by a single NTF:
X(N‒1)] are computed by (2). To implement FFT on hardware,
s s
2π ( s p + s p +1 ) 2π ( s p + s p +1 ) s +s
infinite-precision twiddle factors in (2) have to be approximated WN p WN p+1 = cos − j sin =WN p p+1 . (8)
N N
to finite-precision. When approximated twiddle factors are used
Therefore, each TFM in (6) is decomposed into two
in (2), the outputs corresponding to the given input x can be
multiplications with nontrivial TFCs. Similar to nontrivial
denoted as Xappro=[ Xappro(0), Xappro(1), ..., Xappro(N‒1)]. The
TFMs, nontrivial TFCs are coefficients which need to be
corresponding error E caused by the approximation is evaluated
implemented by shift and adder network. When sp+1 = N/4 ‒ (sp +
based on the root-mean-square error (RMSE) between Xappro
sp+1), (6) is rewritten as
and X as follows:
N −1 2π s p 2π s p +1 2π s p +1
R = r + cos t + cos u + sin v
∑ X ( k ) − X appro ( k )
2
N N N
k =0 (9)
E= . (4) 2π s p 2π s p +1 2π s p +1
N − j sin t + sin u + cos v .
With the estimated hardware cost final_cost and the N N N
computed error E, the design problem can then be formulated as
a minimization of final_cost provided that E is less than
stage p stage p + 1
required:
r R
Minimize { final _ cost} , s.t. E ≤ δ , (5)
t T
s
where δ is the maximum RMSE allowed. To solve (5), we WN p
propose an algorithm to perform twiddle factor merging which u U
s
reduces the number of multiplications in FFT architecture and WN p+1
v V
increases hardware resource sharing to reduce final_cost. Y sp
W N
Moreover, a magnitude-response aware approximation approach Fig. 3. Multiplication of nontrivial twiddle factors across stages in an N-point
is proposed to further reduce final_cost while E is monitored to radix-2 DIT-FFT architecture.
be no more than δ. The details of the algorithm are presented in
the next section. Benefiting from the distributive operation of multiplication in
(6) and the twiddle factor merging in (8), the computation of the
0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
FFT output is finally re-expressed by the sum of terms where the the UAS to replace the adders and subtractors in the first stage
input is multiplied with nontrivial TFCs like (9). In such case, it of the proposed FFT architecture.
is highly possible that these addition terms in (9) can be shared
and the number of multiplications with nontrivial TFCs is B. Common Subexpression Sharing
reduced correspondingly compared to the conventional radix-2 Hardware resources sharing exists when different nontrivial
DIT-FFT architecture. For example, in stage 3 and 4 of Fig. 1, TFCs are multiplied with the same input as shown in (9) as well.
the data path in red for computing X(1) corresponds to the data In such case, multiplications of the input with these nontrivial
path for computing R which is also marked in red in Fig. 3. The TFCs can be implemented simultaneously using a merged
parameters in Fig. 3 are specified as structure. In this paper, we transform all TFCs into Canonical
s s
p = 3, WN p = W162 , WN p+1 = W161 . r, t, u, v are computed as Signed Digit (CSD) representation [38]. An M-bit CSD
r = ( x( 0) − x(8) ) − j ( x( 4) − x(12) ) , t = ( x( 2) − x(10) ) − j ( x( 6) − x(14) ) , representation cM −1cM −2 ⋯c0 of a decimal number C is derived
by
u = ( x(1) − x( 9) ) − j ( x( 5) − x(13) ) , v = ( x( 3) − x(11) ) − j ( x( 7) − x(15) ) . M −1
component of the input x. When k is even, i.e. W2k = 1, C. Magnitude-response Aware Approximation to Twiddle
elementary addition terms in X(k) are x(0)+x(8), x(1)+x(9), and Factor Coefficients
so on. While k is odd (i.e. W2k = −1 ), elementary subtraction All the infinite-precision TFCs are firstly transformed into
terms include x(0)‒x(8), x(1)‒x(9), and so on. Since the twiddle M-bit CSD representations by cutting off the insignificant bits of
factor merging technique is applied to nontrivial TFCs, TFCs. If the precision of TFCs is specified as K-bit ( K ≤ M ),
elementary addition/subtraction terms remain unchanged after the truncation operation directly cuts off the (M‒K) least
merging. Because the hardware area for implementing a significant bits (LSBs). The error E* caused by the truncation
unified adder-subtractor (UAS) [32] operator is lower than the operation can be evaluated by (4). When a maximum allowed
total hardware areas of one adder and one subtractor, we utilize transformation error δ is specified for certain application, there is
0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
usually a small margin between E* and δ. To better utilize the some of the remaining nonzero digits in TFCs which do not exist
i, j
margin, the truncated TFCs can be further approximated by i, j
in any CS yet. In each iteration, g zero and gcomp are computed
changing some digits in the TFCs to reduce total FA count as for all nonzero digits by (13). The approximation is performed to
long as the error is not bigger than δ. The challenge is how to i, j i, j
the nonzero digit which has the biggest g zero / gcomple , provided
develop an efficient measure to minimize the total FA count
during the approximation while the error is being kept below δ. that the corresponding transformation error Eaf is less than the
To address this, a novel magnitude-response aware maximally allowed error δ. If Eaf is bigger than δ due to the
approximation approach is proposed in this section. change of the nonzero digit with the biggest gain, the algorithm
Two methods are considered in the proposed approximation seeks the digit with the second biggest gain. The approximation
approach. The first one is to change less significant nonzero is performed when the error caused by changing the digit with
digits of nontrivial TFCs to zero so that the number of adders the second biggest gain is smaller than δ. Otherwise, the
used for implementing the corresponding TFMs is reduced. The algorithm continues to seek the digit with the third biggest gain.
second method is to change less significant nonzero digits to The iteration continues while Eaf is always evaluated to decide if
any nonzero digit should be changed. If Eaf caused by the
their complement so that opportunities for sharing CSs between
changing of the digit with the least gain is still bigger than δ, the
nontrivial TFCs are created. These two methods are
algorithm stops and the approximation to this set of TFCs
simultaneously applied to different nonzero digits of TFCs
completes. The main steps of the AFFT_ECS are summarized in
which have different impacts on the total FA count and
Algorithm 2.
transformation error. The effect on the total FA count by
approximating the ith TFC at the jth nonzero digit counting from Algorithm 2 Pseudo Code of The AFFT_ECS Algorithm
the most significant bit is: Input: C, existing_CS, δ
Output: C
FAbe − FAaf
ci, j = , (11) E = error(C);
FAbe while ( E ≤ δ ) {
where FAbe and FAaf are the total FA count of the corresponding updated_C = Remove(existing_CS, C);
D = Count_nonzerodigit(updated_C);
FFT implementation before and after one approximation n = Size(D);
method is adopted, respectively. By evaluating the for i from 1 to n
transformation error using (4), the sensitivity of the jth nonzero (gcomp[i], gzero[i]) = Gain(updated_C[i]);
digit in the ith TFC with respect to transformation error is end for
final_C = Max_Gain(gcomp, gzero, C); E = error(final_C);
defined as if E ≤ δ
Eaf − Ebe C = final_C;
si , j = , (12) else
Ebe C_set = Rank_Gain(gcomp, gzero, C);
where Ebe and Eaf are the transformation errors before and after for j from 2 to n
E = error(C_set[j]);
the nonzero digit is changed, respectively. It is obvious that
if E ≤ δ
changing a nonzero digit with a larger c and a smaller s leads to C = C_set[j];
more effective improvement. To evaluate these two measures by break;
using one metric, we define the gain of changing the jth nonzero end if
digit in the ith TFC on the total FA count and transformation end for
end if }
error as:
ci, j The function error computes the error of the FFT
g i, j =
. (13)
si, j implementation using the approximated TFC set C. The function
It is evident that changing the nonzero digit with a larger gain Count_nonzerodigit counts the total number of nonzero digits
contributes more efficient solution. Since the approximation to of updated_C after removing existing CSs. For each nonzero
one nonzero digit can be done by either changing to zero or the digit in updated_C, the function Gain measures its gain. The
complement, the respective gains can be denoted as g zero i, j
and nonzero digit that has the biggest gain is selected and the TFC
i, j
set is changed by the function Max_Gain accordingly. The
g comp which are evaluated by (13). With the above denotations function Rank_Gain ranks the gains in descending order.
and definitions, two approximated algorithms are presented in Finally, the algorithm returns the TFC set where all the qualified
the next sub-sections. nonzero digits are approximated.
C1. Approximation based on existing common subexpressions C2. Approximation by creating new common subexpressions
If CSs exist in different nontrivial TFCs, we propose an For small size FFTs, the number of nontrivial TFCs is limited.
approximated FFT algorithm based on existing CSs (named as As a consequence, it is likely that no CS can be shared by TFCs
AFFT_ECS) in this paper. First of all, an optimal CS sharing at the beginning. Moreover, even CSs exist initially, fixing them
solution is generated by the proposed algorithm. With the optimal as in AFFT_ECS algorithm may hinder the TFCs from being
CSs unchanged, the AFFT_ECS algorithm iteratively changes further approximated to achieve a better solution. For example,
0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
if the existing CS is located at less significant bit position in a This process is applied to every nonzero digit in the same way
TFC, we can only change the rest nonzero digits at more as described above. The total FA count is computed for each
significant bit positions which causes bigger transformation error. implementation and the approximated TFC set which results in
In the above-mentioned two circumstances, we propose to the lowest FA count is returned as the final solution by the
approximate TFCs with the freedom of creating new CS by AFFT_NCS. The main steps of the algorithm are summarized in
changing a nonzero digit to its complement. Eaf caused by this Algorithm 4. The function Change_to_complement changes a
approximation operation may exceed the maximally allowed particular nonzero digit to its complement and returns the
error δ. However, this does not mean that the approximation approximated TFC set. All the CSs are saved in CS_set using
cannot be performed because the unacceptable error can be the function Generate_CS. The function Select_CS selects all
compensated by changing other nonzero digits in the same CSs that can be shared and saves them into sharable_CS. For
TFCs. To achieve this, we propose an error compensation each element in sharable_CS, the function Shortest_CS
technique to adapt TFCs for compensating the error before it is chooses the CS of the shortest length as the newly created CS for
compared with δ. The algorithm starts with computing the initial further TFC approximation.
transformation error using the TFCs in which one nonzero digit
is changed to create new CS. For each of the remaining Algorithm 4 Pseudo Code of The AFFT_NCS Algorithm
nontrivial digits not appearing in the new CS, it is changed to Input: C, δ
zero and the Eaf is re-computed correspondingly. The minimum Output: final_C
Eaf is selected and compared with the initial transformation error D = Count_nonzerodigit(C);
after the new CS is generated. If the error decreases, the n = Size(D);
algorithm moves on to change the next nonzero digit to zero for i from 1 to n
sharable_CS = ∅;
until the transformation error stops decreasing. The main steps
appro_C = ∅;
of the algorithm are summarized in Algorithm 3. The function
new_C = Change_to_complement(C[i]);
Change_to_zero changes a nonzero digit which does not exist CS_set = Generate_CS(new_C);
in CS and returns a new TFC set compensated_C. The function sharable_CS = Select_CS(CS_set);
Min_error selects the minimum error and returns the while (sharable_CS ≠ ∅ ) {
approximated TFC set which produces this error. new_CS = Shortest_CS(sharable_CS);
Einitial = error(new_C);
With the error compensation, we propose an approximated
if Einitial ≥ δ
FFT algorithm by creating new CS (named as AFFT_NCS). If (adapt_C, Einitial) = Error_Compensate(new_CS, new_C);
there are CSs existing in TFCs initially, they are ignored and all end if
nonzero digits are considered equally when creating new CS. if Einitial ≤ δ
For each nonzero digit in TFCs, the algorithm first changes it to appro_C[i] = AFFT_ ECS (adapt_C, new_CS, δ );
its complement. All the remaining nonzero digits in the same cost[i] = FA_count(new_CS);
TFC take turns to be examined. Once a new CS is found, it is end if }
end for
fixed and the algorithm stops creating more. This is because of if appro_C ≠ ∅
the limited number of nonzero digits existing in TFC. When one (min_cost, final_C) = Min_cost(cost);
CS is fixed, there is little chance that the remaining nonzero digits else
can form other CSs. The error compensation is applied when the final_C = C;
corresponding transformation error exceeds. After that, the end if
AFFT_ECS algorithm proposed in Section III.C1 is performed
thereafter for further approximation. With the proposed twiddle factor merging technique, common
subexpression sharing and magnitude-response aware
Algorithm 3 Pseudo Code of The Error Compensation Technique
approximation algorithm, a complete approximated FFT
Input: C
Output: C, Einitial architecture design algorithm is established. First of all, the
Einitial = error(C); proposed twiddle factor merging technique is applied to an
while (true) { N-point FFT to generate nontrivial TFCs to be approximated.
D = Count_nonzerodigit(C); Next, we apply the common subexpression sharing method and
n = Size(D); magnitude-response aware approximation algorithm to further
for i from 1 to n
compensated_C[i] = Change_to_zero(C[i]); reduce the hardware complexity, with the maximally allowed
E[i] = error(compensated_C[i]); transformation error δ. With the nontrivial TFCs, we first check
end for if there are existing CSs that can be shared. If no, the
(min_E, new_C) = Min_error(E, compensated_C); AFFT_NCS algorithm is applied to approximate TFCs.
if min_E≤ Einitial
Einitial = min_E; Otherwise, the common subexpression sharing method is
C = new_C; applied to provide an solution for resource sharing before the
else AFFT_ECS algorithm is applied to further approximate
break;
nontrivial TFCs. Though a good solution can be returned by the
end if }
AFFT_ECS algorithm, the fixed CSs create a barrier for further
approximation. Therefore, the AFFT_NCS algorithm is also
0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
applied in this situation to provide an alternative even though IV. LOGIC SYNTHESIS RESULTS AND DISCUSSION
CS exists initially. Two solutions are compared at last in terms
In this section, 16- and 32-point approximated FFT architectures
of total FA count of the corresponding architectures. The final
are designed using the proposed algorithm. They
solution is the one which is with the minimum cost. The main
steps of the overall algorithm (named as TFC_approximation) All the TFCs of an N-point FFT and the
maximum error allowed for implementation
are summarized in Algorithm 5. The function TF_merging
performs the twiddle factor merging as presented in Section
Perform twiddle factor merging with TF_merging
III.A and returns the nontrivial TFCs. The function to obtain the nontrivial TFCs
Initialize_TFC determines the initial precision of nontrivial
TFCs by truncating insignificant bits with the constraint on Initialize the precision of nontrivial TFCs by
transformation error. The minimum wordlength of TFCs which Initialize_TFC, denote the wordlength as w
makes the error to be lower than δ is selected as initial precision.
Truncate nontrivial TFCs
However, longer TFC wordlength than the initial length does by Truncate at w-bit
not necessarily cause higher FA count because longer lengths
can provide more opportunities for such approximation.
Yes Check if CSs exist in No
Therefore, after applying the initial precision, the coefficient truncated TFCs or not
wordlength is increased until further increment can no longer
contribute to FA count reduction. The function Find_CS
searches CS in the truncated TFCs. Find optimal CSs
With the proposed algorithm, we generate the approximated with CSSharing
TFCs and the minimized cost for an FFT implementation. The
Approximate TFCs Approximate TFCs Approximate TFCs
problem formulated in (5) is then solved. To clearly show the with AFFT_ECS with AFFT_NCS with AFFT_NCS
entire algorithm, the flow chart of the proposed final_CE final_CN final_C
TFC_approximation algorithm is presented in Fig. 4.
CE=FA_count(final_CE)
CN=FA_count(final_CN)
0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
Note :
: real : real : multiplication with sin ( π / 8) implemented by
adder subtractor shift and add/subtract network
: CS-shared multiplications with sin ( π / 8) and cos (π / 8) R : a block to list the final outputs by
implemented by shift and add/subtract network I combining real and imaginary parts
C
Fig. 7. The approximated 16-point FFT architecture.
designs are described in Verilog HDL and mapped to Xilinx saves 8.4% FPGA area benefitting from the proposed
Virtex7, xc7s75fgga484 FPGA device. Xilinx Vivado Design techniques. The FPGA areas of the 16- and 32-point
Suite v17.4 is used to synthesize the designs. The number of approximated FFT designs are plotted in Fig. 8. The FPGA
LUTs (#LUTs), the utilization density of LUTs, the number of areas of the 16- and 32-point FFT designed by using the
IOs (#IOs), the utilization density of IOs and delays in ns of 16- conventional radix-2 DIT-FFT algorithm are used as the
and 32-point FFT designs are shown in TABLE IV. At least baseline. All other areas of FFT architectures by our algorithm,
41.2% and 56.4% improvements are achieved by our 16- and [10], [15] and [29] are normalized by the baseline.
32-point designs respectively over the designs of [10], [15] in
terms of #LUTs. Our designs have shorter delays compared with TABLE IV. Comparison between the FFT designs on FPGA
[10] and [15]. The reason is that the merging technique reduces
(a) 16-point FFT designs
the number of multiplications and therefore #FAs in the critical
path is reduced. Design of [10] has larger delays since the Area in Utilization Utilization Delay
Design #IOs
#LUTs density in % density in % in ns
iteration operation in CORDIC scheme lead to more adders in
AFFT1 2616 5.45 60 17.75 18.84
the critical path. The reason why AFFT4 and AFFT8 reduce AFFT2 2419 5.04 58 17.16 18.39
#LUTs dramatically over the 16- and 32-point designs in [10] AFFT3 1965 4.09 52 15.38 16.51
and [15] is because their high transformation error tolerance cause AFFT4 1270 2.65 36 10.65 10.50
excessively approximated TFCs, with which all the [10] 5091 10.61 62 18.34 27.55
multiplications in the FFT architecture can be implemented by [15] 4450 9.27 60 17.75 21.00
[29] 1392 2.90 38 11.24 7.89
direct hardwiring. Compared with another excessively DIT-FFT 6158 12.83 64 18.93 22.61
approximated 16-point FFT design proposed in [29], AFFT4
TABLE III. The total FA count and transformation error of the FFT designs. (b) 32-point FFT designs
0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
10
0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
11
0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems
12
[22] S.-W. Yang, and J.-Y. Lee, "Constant twiddle factor multiplier sharing in
multipath delay feedback parallel pipelined FFT processors," IEEE Xueyu Han received her B. Eng. degree from
Electronics Letters, vol. 50, no. 15, pp. 1050-1052, Jul. 2014. Northwestern Polytechnical University, Xi'an, China in
[23] M. Garrido, J. Grajal, M. A. Sanchez, O. Gustafsson, "Pipelined radix-2k 2017. She is currently pursuing her Ph.D. degree at the
feedforward FFT architectures," IEEE Transactions on Very Large Scale Center of Intelligent Acoustics and Immersive
Integration Systems, vol. 21, no. 1, pp. 23-32, Jan. 2011. Communications (CIAIC) of Northwestern
[24] M. Bansal, and S. Nakhate, "High speed pipelined 64-point FFT processor Polytechnical University. Her research interests include
based on radix-22 for wireless LAN," 2017 International Conference on algorithms and circuit design for digital signal
Signal Processing and Integrated Networks, Noida, India, Feb. 2017, pp. processing.
607-612.
[25] A. Lingamneni, C. Enz, K. Palem, and C. Piguet, "Highly energyefficient
and quality-tunable inexact FFT accelerators," 2014 IEEE Custom Jiajia Chen received his B. Eng. (Hons) and Ph.D. from
Integrated Circuits Conference, San Jose, CA, USA, Sept. 2014, pp. 1–4. Nanyang Technological University, Singapore, in 2004
[26] Q.-J. Xing, Z.-G. Ma, and Y.-K. Xu, "A Novel Conflict-Free Parallel and 2010, respectively. From April 2012 to March 2018,
Memory Access Scheme for FFT Processors," IEEE Transactions on he was a faculty member in Singapore University of
Circuits and Systems II, vol. 64, no. 11, pp. 1347-1351, Nov. 2017. Technology and Design. Since April 2018, he has been
[27] H. K. Samudrala, S. Qadeer, S. Azeemuddin, and Z. Khan, "Parallel and with Nanjing University of Aeronautics and
pipelined VLSI implementation of the new radix-2 DIT FFT algorithm," Astronautics, China, where he is currently a Professor.
2018 IEEE International Symposium on Smart Electronic Systems, His research interest includes computational
Hyderabad, India, Dec. 2018, pp. 21-26. transformations of low-complexity digital circuits and
[28] X. Han, J. Chen, and S. Rahardja, "A new twiddle factor merging method digital signal processing. Dr. Chen served as Web Chair of Asia-Pacific
for low complexity and high speed FFT architecture," 2019 IEEE Computer Systems Architecture Conference 2005, Technical Program
International Circuit and System Symposium, Kuala Lumpur, Malaysia, Committee member of European Signal Processing Conference 2014 and The
Sept. 2019, pp. 1-4. Third IEEE International Conference on Multimedia Big Data 2017, and
Associate Editor of Springer EURASIP Journal on Embedded Systems since
[29] V. Ariyarathna, D. F. G. Coelho, S. Pulipati, R. J. Cintra, F. M. Bayer, V.
S. Dimitrov and A. Madanayake, "Multibeam Digital Array Receiver 2016.
Using a 16-Point Multiplierless DFT Approximation," IEEE Transactions
on Antennas and Propagation, vol. 67, no. 2, pp. 925-933, Feb. 2019.
Boyu Qin is currently pursuing the B.Eng. degree with
[30] Y. Ji-yang, H. Dan, L. Xin, X. Ke, and W. Lu-yuan, "Conflict-free
the College of Electronic and Information Engineering,
architecture for multi-butterfly parallel processing in-place radix-r FFT,"
2016 IEEE International Conference on Signal Processing, Chengdu, Nanjing University of Aeronautics and Astronautics.
China, Nov. 2016, pp. 496-501. His research interests include digital circuits design and
implementation.
[31] S. Mittal, "A survey of techniques for approximate computing," ACM
Computing Surveys, vol. 48, no. 4, pp. 1-34, Mar. 2016.
[32] J. Ding, J. Chen, and C.-H. Chang, "A new paradigm of common
subexpression elimination by unification of addition and subtraction,"
IEEE Transactions on Computer-Aided Design of Integrated Circuits Susanto Rahardja (F'11) is currently a Chair Professor
and Systems, vol. 35, no. 10, pp. 1605-1617, Oct. 2016.
at the Northwestern Polytechnical University (NPU)
[33] J. W. Cooley and J. W. Tukey, "An algorithm for machine calculation of under the Thousand Talent Plan of People’s Republic of
complex Fourier series," Mathematics of Computation, vol. 19, no. 90, pp. China. His research interests are in multimedia, signal
297-301, Jan. 1965. processing, wireless communications, discrete
[34] J. Chen, and C.-H. Chang, "High-level synthesis algorithm for the design transforms, machine learning and signal processing
of reconfigurable constant multiplier," IEEE Transactions on Computer- algorithms and implementation. He contributed to the
Aided Design of Integrated Circuits and Systems, vol. 28, no. 12, pp. development of a series of audio compression
1844-1856, Dec. 2009. technologies such as Audio Video Standards AVS-L,
[35] K. Moller, M. Kumm, M. Garrido, and P. Zipf, "Optimal shift reassignment AVS-2 and ISO/IEC 14496-3:2005/Amd.2:2006,
in reconfigurable constant multiplication circuits," IEEE Transactions on ISO/IEC 14496-3:2005/Amd.3:2006 in which some have been licensed to
Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. several companies. Dr Rahardja has more than 15 years of experience in
3, pp.710-714, Mar. 2018. leading research team for media related research that cover areas in Signal
[36] B. Koyada, N. Meghana, M. O. Jaleel, and P. R. Jeripotula, "A Processing (audio coding, video/image processing), Media Analysis
comparative study on adders," 2017 International Conference on (text/speech, image, video), Media Security (biometrics, computer vision and
Wireless Communications, Signal Processing and Networking, Chennai, surveillance) and Sensor Networks. He has published more than 300 papers and
India, Mar. 2017, pp. 2226-2230. has been granted more than 70 patents worldwide out of which 15 are US
[37] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, "Low-power patents. Professor Rahardja was past Associate Editors of IEEE Transactions
digital signal processing using approximated adders," IEEE Transactions on Audio, Speech and Language Processing and IEEE Transactions on
on Computer-Aided Design of Integrated Circuits and Systems, vol. 32, Multimedia, past Senior Editor of the IEEE Journal of Selected Topics in Signal
no. 1, pp. 124-137, Jan. 2013. Processing, and is currently serving as Associate Editors for the Elsevier
[38] R. Kaur, and T. Singh, "Design of 32-point mixed radix Fft processor using Journal of Visual Communication and Image Representation and IEEE
CSD multiplier," 2016 International Conference on Parallel, Distributed Transactions on Multimedia. He was the Conference Chair of 5th ACM
and Grid Computing, Waknaghat, India, Dec. 2016, pp. 538-543. SIGGRAPHASIA in 2012 and APSIPA 2nd Summit and Conference in 2010
[39] J. Chen, C. H. Chang, and H. Qian, “New power index model for and 2018 as well as other conferences in ACM, SPIE and IEEE. Dr Rahardja is
switching power analysis from adder graph of FIR filter,” in Proc. IEEE a recipient of several honors including the IEE Hartree Premium Award, the
International Symposium on Circuits and Systems, Taipei, Taiwan, May Tan Kah Kee Young Inventors' Open Category Gold award, the Singapore
2009, pp. 2197-2200. National Technology Award, A*STAR Most Inspiring Mentor Award, Finalist
of the 2010 World Technology & Summit Award, the Nokia Foundation
Visiting Professor Award and the ACM Recognition of Service Award.
Professor Rahardja graduated with a B.Eng from National University of
Singapore, the M.Eng. and Ph.D. degrees all in Electronic Engineering from
Nanyang Technological University, Singapore. He attended the Stanford
Executive Programme at the Graduate School of Business in Stanford
University, USA.
0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.