0% found this document useful (0 votes)
30 views12 pages

Han 2020

This document discusses a novel area-power efficient design for approximated small-point FFT architecture. The proposed design decreases the number of resource-expensive multiplications through a twiddle factor merging technique and applies a common subexpression sharing scheme to reuse hardware resources, reducing hardware area. Additionally, a magnitude-response aware approximation algorithm is introduced to compromise some transformation accuracy for lesser hardware area and power dissipation. Logic synthesis shows the 16-point and 32-point FFT architectures proposed in this design achieve up to 65.7% and 58.8% reduction in hardware area and 53.1% and 60.0% reduction in power dissipation compared to recently published designs.

Uploaded by

kesavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views12 pages

Han 2020

This document discusses a novel area-power efficient design for approximated small-point FFT architecture. The proposed design decreases the number of resource-expensive multiplications through a twiddle factor merging technique and applies a common subexpression sharing scheme to reuse hardware resources, reducing hardware area. Additionally, a magnitude-response aware approximation algorithm is introduced to compromise some transformation accuracy for lesser hardware area and power dissipation. Logic synthesis shows the 16-point and 32-point FFT architectures proposed in this design achieve up to 65.7% and 58.8% reduction in hardware area and 53.1% and 60.0% reduction in power dissipation compared to recently published designs.

Uploaded by

kesavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

A Novel Area-Power Efficient Design for


Approximated Small-Point FFT Architecture
Xueyu Han, Jiajia Chen, Boyu Qin and Susanto Rahardja, Fellow, IEEE1

Abstract — FFT is an essential algorithm in digital signal Therefore, an area-power efficient and high speed FFT
processing and advanced mobile communications. With the architecture has been actively researched in recent years [9]-[12].
continuous development of modern technology, the area-power
FFT can be implemented both in high-level software
efficient hardware implementation of FFT has attracted a lot of
attention. In this paper, a novel design for FFT implementation is
languages and on hardware. Software solution has the
proposed. The number of resource-expensive multiplications in advantage in flexibility but the solution is usually accompanied
our design is decreased by a twiddle factor merging technique that with a bulky and complex hardware system which is unfriendly
reduces the hardware area. Subsequently, a common to small-sized and battery-based devices [13] [14]. Motivated
subexpression sharing scheme is applied to reuse the hardware by the challenging requirements in hardware area and power
resources to further save the hardware area. In addition, a budget, many hardware solutions for FFT implementation have
magnitude-response aware approximation algorithm is proposed been developed for mobile and smart devices [15] [16]. As
for applications where the transformation accuracy can be
compromised a little bit for lesser hardware area and power
technology matures, Application-Specific Integrated Circuit
dissipation. Logic synthesis shows that the proposed 16-point FFT (ASIC) has become more cost effective for some generic classes
architecture can save hardware area and power dissipation on of problems in digital signal processing such as the FFT [17].
ASIC by up to 65.7% and 53.1% compared with recently ASIC solution provides the ability to manage power and exploit
published designs. Similarly, the proposed 32-point FFT the robustness after processing in digital domain.
architecture achieves up to 58.8% reduction on hardware area and Existing hardware solutions for FFT implementation can be
60.0% reduction on power dissipation on ASIC. mainly divided into reconfigurable and fixed architectures.
Index Terms — Approximated FFT, Twiddle Factor Merging, Reconfigurable architecture [18]-[20] is generally developed
Common Subexpression Sharing. for variable length FFTs. Mixed-radix algorithms are common
solutions to implementing a variety of transformation lengths.
I. INTRODUCTION Fixed FFTs can be further categorized into pipelined and
parallel architectures. The most classical approaches for

F AST Fourier transform (FFT) is a widely-used algorithm in


digital signal processing [1]-[3] and wireless communication
systems [4]-[8]. In Long Term Evolution (LTE) and its
pipelined FFTs [21]-[25] are multi-path delay commutator and
single-path delay feedback. The basic structure of these two
approaches consists of a processing element (PE) for data
high-speed versions LTE-Advanced/LTE-Advanced Pro [7], computation and the required memory for data storage. Parallel
FFTs with different transformation sizes are desired. Moreover, architectures on the other hand [26]-[30] can be constructed by
FFT generates the required radio frequency (RF) beams for decomposing the FFT algorithm into several partitions and
multi-beam beamforming which is one of the significant using a combination of PEs to compute these partitions in
techniques in the fifth generation (5G) wireless communication parallel. Different FFT architectures have different advantages
[8]. Nowadays, as most of mobile communications are designed and disadvantages in terms of hardware complexity and
to be implemented on portable devices, the embedded FFT computation speed. Pipelined FFTs have simpler architectures
processor is required to have low hardware area and power but larger latency as data are processed sequentially. In addition,
dissipation. It is also vital that the computation speed of the FFT complex controller is required in pipelined architectures. On the
processor is high enough to support high data rate requirement. contrary, parallel architectures can deal with N inputs
simultaneously and have the advantage of easy control but at the
expense of more hardware resources.
Manuscript received 16 Oct. 2019, revised 8 Jan. 2020 and accepted 1 Mar. It is challenging to implement FFT on hardware to achieve
2020. This work was supported by grant 56YAH18043 at Nanjing University high accuracy and low hardware cost. However, the accuracy
of Aeronautics and Astronautics, Nanjing, China. (Corresponding authors:
Jiajia Chen ([email protected]) and Susanto Rahardja) can be relaxed in some applications to achieve a substantially
The authors Xueyu Han and Susanto Rahardja are with the School of Marine reduction on hardware cost and power dissipation. To do this,
Science and Technology, Northwestern Polytechnical University, Xi’an, approximation [31] has become imperative for the trade-off.
China. Recent published works in [25] [28] [29] improved existing
The authors Jiajia Chen and Boyu Qin are with the College of Electronic and
Information Engineering, Nanjing University of Aeronautics and Astronautics, FFT architectures based on approximate computing. An
Nanjing, China. inexact pipelined FFT accelerator was proposed in [25]. A

0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

statistical learning technique based on Normalized Least Mean are required in (1), the computational complexity of an N-point
Square algorithm is used to train the inexact FFTs, where DFT grows quickly with increasing N. In order to compute DFT
separate training and test sets are requisite. The authors of [28] more efficiently, FFT was proposed. The simplest and most
proposed an efficient parallel FFT architecture where the commonly used is Cooley-Tukey radix-2 decimation-in-time
infinite-precision coefficients are approximated by cutting off (DIT) FFT [33]. Radix-2 DIT-FFT decomposes an N-point DFT
their insignificant bits. However, it may not be the most into smaller DFTs by dividing the even- and odd-indexed input
hardware efficient solution when the error introduced is still samples into two parts as:
less than the transformation error allowed. Another X ( k ) = X 1 ( k ) + WNk X 2 ( k ) , (2)
approximated parallel FFT architecture was recently proposed N /2−1 N /2−1

in [29] where the design was simplified to be multiplierless. where X1 ( k ) = ∑ x ( 2n)W


n =0
nk
N /2 and X 2 ( k ) = ∑ x ( 2n +1)W
n =0
nk
N /2 are
Although a low area-power design is achieved in [29], the
transformation error caused by its heavy approximation is two N/2-point DFTs. By recursively applying (2), an N-point DFT
relatively high. can be decomposed into log2N stages as shown in Fig. 1 and each
In this paper, a novel design for area-power efficient FFT stage contains a total of N/2 2-point DFTs. The butterfly block
architecture is proposed as a solution to address the above depicted in Fig. 2 is utilized to implement these 2-point DFTs
challenges. Firstly, a new twiddle factor merging technique is based on the symmetric property of the twiddle factors:
N
nk +
proposed where the low cost unified adder-subtractor operator WN 2 = −WNnk . (3)
[32] is utilized for the merged expressions. Secondly,
In each butterfly block, one complex multiplication, one
duplicated multiplications in the merged expressions are
implemented by one sharable structure and a dedicated complex addition and one complex subtraction are required.
Therefore, the computational complexity of an N-point FFT
common subexpression sharing algorithm is proposed to
decreases remarkably from O(N2) to O(Nlog2N) compared with
maximize the sharing. Thirdly, we propose a novel
the direct DFT.
approximation approach for the FFT coefficients under
different requirements of transformation error. The approach
stage 1 stage 2 stage 3 stage 4
approximates the FFT coefficients, which takes care of both bit x( 0) X ( 0)
sensitivity to the transformation error and the reduction in r
x(8) X (1)
hardware cost. The proposed new approach has two advantages. W20
x( 4) X ( 2)
The first one is the ability to generate more efficient W40
approximation solution by utilizing the margin between the x(12) X ( 3)
W20 W41
maximum error allowed and the error caused by the direct x( 2) X ( 4)
t W80
truncation. The second one is the ability to achieve more x(10) X ( 5)
subexpression sharing by generating more approximated FFT W20 W81
x( 6) X ( 6)
coefficient candidates. They contribute to the design of W40 W82
x(14) X ( 7)
efficient FFT without the need of extra training and test sets. In W20 W41 W83
x(1) X ( 8)
addition, the approximation approach can be applied with any u W160
x( 9) X ( 9)
given transformation error constraints to provide a solution with W20 W161
good quality. The proposed FFT architectures are compared x( 5) X (10)
W40 W162
with the state-of-the-arts and the logic synthesis results show x(13) X (11)
that our designs successfully reduce the hardware area and W20 W41 W163
x( 3) X (12)
power dissipation. v W80 W164
x(11) X (13)
The paper is organized as follows. Section II introduces some W20 W81 W165
x( 7) X (14)
prerequisite preliminaries for FFT algorithm and presents the W40 W82 W166
x(15) X (15)
problem formulation. Section III presents the proposed algorithm W20 W41 W83 W167
for FFT architecture design. The logic synthesis results of our
architectures and competing designs are given in Section IV. Fig. 1. The 16-point DIT-FFT algorithm.
Finally, a conclusion is given in Section V.

II. PRELIMINARIES AND PROBLEM FORMULATION x X

y Y
A. DFT and FFT Preliminaries
W
The 1-D DFT of an input signal sequence x(n) is given by
N −1 Fig. 2. The butterfly block.
X ( k ) = ∑ x ( n ) WNnk , n = 0, 1, …, N − 1, (1)
n=0

where X(k) is the frequency domain representation of x(n) with k B. Problem Formulation
= 0, 1, …, N ‒ 1. WNnk = exp ( − j 2π nk / N ) is the twiddle factor.
According to (1) and (2), complex multiplication (CM) and
Since N2 complex multiplications and N(N‒1) complex additions complex addition (CA) are the two arithmetic operations

0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

required in DFT/FFT. The numbers of CMs and CAs needed in III. THE PROPOSED METHOD
N-point DFT and DIT-FFT architectures are given in the second
and third columns of TABLE I. In hardware implementation, A. Twiddle Factor Merging
CMs and CAs are converted to operations on their real and
imaginary parts. The numbers of real multiplications (RMs) and In an FFT architecture, multiplication is the most
real additions (RAs) required in N-point DFT and DIT-FFT resource-expensive arithmetic operation. To minimize the FFT
architectures are computed and given in the fourth and fifth hardware cost with a limited transformation error, we propose a
columns respectively in TABLE I. twiddle factor merging method to reduce the number of
multiplications. Twiddle factor multiplications (TFMs) in FFT
TABLE I. The required resources in N-point DFT and DIT-FFT architectures.
architecture can be divided into two categories: trivial TFM and
Architecture CMs CAs RMs RAs nontrivial TFM. Trivial TFMs refer to the multiplications with
DFT N2 N(N‒1) 4N2 2N(2N‒1) twiddle factors which can be implemented by direct hardwiring,
DIT-FFT (N/2)log2N Nlog2N 2Nlog2N 3Nlog2N such as multiplying the input with ‒1, 1 or ‒j. Nontrivial TFMs
In many previous works, RMs are implemented by shift and are multiplications which need to be implemented by shift and
add network to achieve a low hardware cost [34] [35]. Knowing add network. Only those nontrivial TFMs are actually
that the shifts can be implemented by direct hardwiring, addition resource-expensive multiplications of which the number needs to
is the core operation that requires Look-Up Table (LUT) count be reduced. Based on the conventional radix-2 DIT-FFT
in FPGA. Therefore, the total number of RAs can be used to architecture, we merge two nontrivial twiddle factors (NTFs)
s s
estimate the required hardware resources. We assume all RAs WN p and WN p+1 , in two adjacent stages p and p+1, into a new
are implemented with ripple carry adders (RCAs) whose NTF, as shown in Fig. 3. r, t, u, v are four intermediate results in
complexity is proportional to its wordlength [36] [37] to have a the FFT architecture. R is computed as:
more accurate estimation. By doing so, the number of full adders
(FAs) in the carry chain of RCA can be used as the cost of each
s s
(
R = r + WN p t + WN p+1 u + WN p v
s
) (6)
sp s p +1 s p +1
s
particular RA. For implementation on ASIC, although there are = r + W t + W u + W WN p v.
N N N
many different types of FA cell, the number of FAs needed gives Using Euler's rule to re-express the twiddle factor, we have:
valid information about the hardware complexity. Therefore, the  2π nk  2π nk 2π nk
total FA count is used as an estimation for the hardware cost of WNnk = exp  − j  = cos − j sin , (7)
 N  N N
the FFT architecture, which is denoted as final_cost.
where cos(2πnk/N) and sin(2πnk/N) are twiddle factor
Besides the hardware cost, transformation accuracy is another s s
major consideration. Given an input signal sequence x=[x(0), coefficients (TFCs). The term WN p+1WN p in (6) can then be
x(1), ..., x(N‒1)], the exact FFT outputs X=[X(0), X(1), …, represented by a single NTF:
X(N‒1)] are computed by (2). To implement FFT on hardware,
s s
2π ( s p + s p +1 ) 2π ( s p + s p +1 ) s +s
infinite-precision twiddle factors in (2) have to be approximated WN p WN p+1 = cos − j sin =WN p p+1 . (8)
N N
to finite-precision. When approximated twiddle factors are used
Therefore, each TFM in (6) is decomposed into two
in (2), the outputs corresponding to the given input x can be
multiplications with nontrivial TFCs. Similar to nontrivial
denoted as Xappro=[ Xappro(0), Xappro(1), ..., Xappro(N‒1)]. The
TFMs, nontrivial TFCs are coefficients which need to be
corresponding error E caused by the approximation is evaluated
implemented by shift and adder network. When sp+1 = N/4 ‒ (sp +
based on the root-mean-square error (RMSE) between Xappro
sp+1), (6) is rewritten as
and X as follows:
N −1  2π s p 2π s p +1 2π s p +1 
R =  r + cos t + cos u + sin v
∑  X ( k ) − X appro ( k )
2
 N N N 
k =0 (9)
E= . (4)  2π s p 2π s p +1 2π s p +1 
N − j  sin t + sin u + cos v .
With the estimated hardware cost final_cost and the  N N N 
computed error E, the design problem can then be formulated as
a minimization of final_cost provided that E is less than
stage p stage p + 1
required:
r R
Minimize { final _ cost} , s.t. E ≤ δ , (5)
t T
s
where δ is the maximum RMSE allowed. To solve (5), we WN p
propose an algorithm to perform twiddle factor merging which u U
s
reduces the number of multiplications in FFT architecture and WN p+1
v V
increases hardware resource sharing to reduce final_cost. Y sp
W N
Moreover, a magnitude-response aware approximation approach Fig. 3. Multiplication of nontrivial twiddle factors across stages in an N-point
is proposed to further reduce final_cost while E is monitored to radix-2 DIT-FFT architecture.
be no more than δ. The details of the algorithm are presented in
the next section. Benefiting from the distributive operation of multiplication in
(6) and the twiddle factor merging in (8), the computation of the

0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

FFT output is finally re-expressed by the sum of terms where the the UAS to replace the adders and subtractors in the first stage
input is multiplied with nontrivial TFCs like (9). In such case, it of the proposed FFT architecture.
is highly possible that these addition terms in (9) can be shared
and the number of multiplications with nontrivial TFCs is B. Common Subexpression Sharing
reduced correspondingly compared to the conventional radix-2 Hardware resources sharing exists when different nontrivial
DIT-FFT architecture. For example, in stage 3 and 4 of Fig. 1, TFCs are multiplied with the same input as shown in (9) as well.
the data path in red for computing X(1) corresponds to the data In such case, multiplications of the input with these nontrivial
path for computing R which is also marked in red in Fig. 3. The TFCs can be implemented simultaneously using a merged
parameters in Fig. 3 are specified as structure. In this paper, we transform all TFCs into Canonical
s s
p = 3, WN p = W162 , WN p+1 = W161 . r, t, u, v are computed as Signed Digit (CSD) representation [38]. An M-bit CSD
r = ( x( 0) − x(8) ) − j ( x( 4) − x(12) ) , t = ( x( 2) − x(10) ) − j ( x( 6) − x(14) ) , representation cM −1cM −2 ⋯c0 of a decimal number C is derived
by
u = ( x(1) − x( 9) ) − j ( x( 5) − x(13) ) , v = ( x( 3) − x(11) ) − j ( x( 7) − x(15) ) . M −1

By using (9), the computation of X(1) is finally expressed as: C = ∑ ci 2i , (10)


i =0
2π π
X (1) = ( x ( 0) − x ( 8) ) + sin ( x ( 2) − x (10 ) ) − ( x ( 6) − x (14) )  − sin − −
8  8 where ci ∈ { 1, 0, 1 }, 1 denotes ‒1. In CSD representation, a
π weight-two subexpression is defined as a string of 0s starting
( x ( 5) − x (13) ) − ( x ( 3) − x (11) )  + cos ( x (1) − x ( 9) ) − ( x ( 7) −x (15) ) 
8 and ending with nonzero digits. If one subexpression appears
2π π more than once, it is called common subexpression (CS).
{
− j ( x ( 4 ) − x (12 ) ) + sin
8 
( x ( 2 ) − x (10 ) ) + ( x ( 6 ) − x (14 ) )  + cos
8 Structures to implement CS in different TFCs can be merged
π
into a single structure where the hardware resources are reused.
8
}
( x ( 5) − x (13) ) + ( x ( 3) − x (11) ) + sin ( x (1) − x ( 9) ) + ( x ( 7 ) − x (15) )  . Therefore, we propose a CS sharing algorithm which is
summarized in Algorithm 1. In an N-point FFT, we denote
Similar to the computation of X(1), other outputs with k=0, 2,
nontrivial TFCs which are multiplied with the same input as a
3, …, 15 can be computed by using the proposed twiddle factor
set C. The algorithm starts with generating all CSs that can be
merging method. It is obvious that only 6 multiplications with
shared and stores them into a set CS_Share using the function
nontrivial TFCs are required for computing X(1) using the
Find_CS. The function Size counts the total number of CSs in
proposed twiddle factor merging method while 8
CS_Share. For each CS, the function Remove is used to remove
multiplications are required for the nontrivial TFCs (i.e. 3
it from C before Find_CS finds whether there is another CS that
complex nontrivial TFMs) in Fig. 1. In addition, among the 6
can be further shared. With the selected CSs, the hardware cost
multiplications for computing X(1), two multiplications with
of the corresponding FFT architecture is evaluated by the
sin(2π/8) are common to X(3), X(5), X(7) and, the rest four
function FA_count. At last, the CSs resulting in a minimum
multiplications with sin(π/8) and cos(π/8) are common to X(7).
hardware cost are chosen by the function Min_cost and
Structures for implementing these common terms can be reused
returned by the algorithm.
in the FFT architecture and therefore a large amount of
hardware resources can be saved. Algorithm 1 Pseudo Code of The CS Sharing Algorithm
A basic structure for implementing the above common terms Input: C
is the addition/subtraction of two input signals. We define the Output: CS_Final
elementary addition/subtraction term as the computation of the CS_Share = Find_CS(C);
sum/difference of two input signals. Let's take the 16-point FFT n = Size(CS_Share);
as an example. By recursively using (2), the 16-point FFT is for i from 1 to n
CS_Selected = CS_Share[i];
expressed as X ( k ) = ( x ( 0 ) + W2k x ( 8 ) ) + W4k  x ( 4 ) + W2k x (12 )  updated_C = Remove(CS_Share[i], C);
CS_Additional = Find_CS(updated_C);
+W8k ( x ( 2 ) + W2k x (10 ) ) + W4k ( x ( 6 ) + W2k x (14 ) )  + W16k ( x (1) + CS_Selected = Insert(CS_Selected, CS_ Additional);
cost[i] = FA_count(CS_Selected);
( ( x ( 5) + W x (13) ) ) + W (( x ( 3) + W x (11)) +
W2k x ( 9 ) ) + W4k 2
k
8
k
2
k
end for
CS_Final = Min_cost(cost);
W4k (( x( 7) +W x (15))) , where k is the kth frequency
2
k

component of the input x. When k is even, i.e. W2k = 1, C. Magnitude-response Aware Approximation to Twiddle
elementary addition terms in X(k) are x(0)+x(8), x(1)+x(9), and Factor Coefficients
so on. While k is odd (i.e. W2k = −1 ), elementary subtraction All the infinite-precision TFCs are firstly transformed into
terms include x(0)‒x(8), x(1)‒x(9), and so on. Since the twiddle M-bit CSD representations by cutting off the insignificant bits of
factor merging technique is applied to nontrivial TFCs, TFCs. If the precision of TFCs is specified as K-bit ( K ≤ M ),
elementary addition/subtraction terms remain unchanged after the truncation operation directly cuts off the (M‒K) least
merging. Because the hardware area for implementing a significant bits (LSBs). The error E* caused by the truncation
unified adder-subtractor (UAS) [32] operator is lower than the operation can be evaluated by (4). When a maximum allowed
total hardware areas of one adder and one subtractor, we utilize transformation error δ is specified for certain application, there is

0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

usually a small margin between E* and δ. To better utilize the some of the remaining nonzero digits in TFCs which do not exist
i, j
margin, the truncated TFCs can be further approximated by i, j
in any CS yet. In each iteration, g zero and gcomp are computed
changing some digits in the TFCs to reduce total FA count as for all nonzero digits by (13). The approximation is performed to
long as the error is not bigger than δ. The challenge is how to i, j i, j
the nonzero digit which has the biggest g zero / gcomple , provided
develop an efficient measure to minimize the total FA count
during the approximation while the error is being kept below δ. that the corresponding transformation error Eaf is less than the
To address this, a novel magnitude-response aware maximally allowed error δ. If Eaf is bigger than δ due to the
approximation approach is proposed in this section. change of the nonzero digit with the biggest gain, the algorithm
Two methods are considered in the proposed approximation seeks the digit with the second biggest gain. The approximation
approach. The first one is to change less significant nonzero is performed when the error caused by changing the digit with
digits of nontrivial TFCs to zero so that the number of adders the second biggest gain is smaller than δ. Otherwise, the
used for implementing the corresponding TFMs is reduced. The algorithm continues to seek the digit with the third biggest gain.
second method is to change less significant nonzero digits to The iteration continues while Eaf is always evaluated to decide if
any nonzero digit should be changed. If Eaf caused by the
their complement so that opportunities for sharing CSs between
changing of the digit with the least gain is still bigger than δ, the
nontrivial TFCs are created. These two methods are
algorithm stops and the approximation to this set of TFCs
simultaneously applied to different nonzero digits of TFCs
completes. The main steps of the AFFT_ECS are summarized in
which have different impacts on the total FA count and
Algorithm 2.
transformation error. The effect on the total FA count by
approximating the ith TFC at the jth nonzero digit counting from Algorithm 2 Pseudo Code of The AFFT_ECS Algorithm
the most significant bit is: Input: C, existing_CS, δ
Output: C
FAbe − FAaf
ci, j = , (11) E = error(C);
FAbe while ( E ≤ δ ) {
where FAbe and FAaf are the total FA count of the corresponding updated_C = Remove(existing_CS, C);
D = Count_nonzerodigit(updated_C);
FFT implementation before and after one approximation n = Size(D);
method is adopted, respectively. By evaluating the for i from 1 to n
transformation error using (4), the sensitivity of the jth nonzero (gcomp[i], gzero[i]) = Gain(updated_C[i]);
digit in the ith TFC with respect to transformation error is end for
final_C = Max_Gain(gcomp, gzero, C); E = error(final_C);
defined as if E ≤ δ
Eaf − Ebe C = final_C;
si , j = , (12) else
Ebe C_set = Rank_Gain(gcomp, gzero, C);
where Ebe and Eaf are the transformation errors before and after for j from 2 to n
E = error(C_set[j]);
the nonzero digit is changed, respectively. It is obvious that
if E ≤ δ
changing a nonzero digit with a larger c and a smaller s leads to C = C_set[j];
more effective improvement. To evaluate these two measures by break;
using one metric, we define the gain of changing the jth nonzero end if
digit in the ith TFC on the total FA count and transformation end for
end if }
error as:
ci, j The function error computes the error of the FFT
g i, j =
. (13)
si, j implementation using the approximated TFC set C. The function
It is evident that changing the nonzero digit with a larger gain Count_nonzerodigit counts the total number of nonzero digits
contributes more efficient solution. Since the approximation to of updated_C after removing existing CSs. For each nonzero
one nonzero digit can be done by either changing to zero or the digit in updated_C, the function Gain measures its gain. The
complement, the respective gains can be denoted as g zero i, j
and nonzero digit that has the biggest gain is selected and the TFC
i, j
set is changed by the function Max_Gain accordingly. The
g comp which are evaluated by (13). With the above denotations function Rank_Gain ranks the gains in descending order.
and definitions, two approximated algorithms are presented in Finally, the algorithm returns the TFC set where all the qualified
the next sub-sections. nonzero digits are approximated.

C1. Approximation based on existing common subexpressions C2. Approximation by creating new common subexpressions
If CSs exist in different nontrivial TFCs, we propose an For small size FFTs, the number of nontrivial TFCs is limited.
approximated FFT algorithm based on existing CSs (named as As a consequence, it is likely that no CS can be shared by TFCs
AFFT_ECS) in this paper. First of all, an optimal CS sharing at the beginning. Moreover, even CSs exist initially, fixing them
solution is generated by the proposed algorithm. With the optimal as in AFFT_ECS algorithm may hinder the TFCs from being
CSs unchanged, the AFFT_ECS algorithm iteratively changes further approximated to achieve a better solution. For example,

0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

if the existing CS is located at less significant bit position in a This process is applied to every nonzero digit in the same way
TFC, we can only change the rest nonzero digits at more as described above. The total FA count is computed for each
significant bit positions which causes bigger transformation error. implementation and the approximated TFC set which results in
In the above-mentioned two circumstances, we propose to the lowest FA count is returned as the final solution by the
approximate TFCs with the freedom of creating new CS by AFFT_NCS. The main steps of the algorithm are summarized in
changing a nonzero digit to its complement. Eaf caused by this Algorithm 4. The function Change_to_complement changes a
approximation operation may exceed the maximally allowed particular nonzero digit to its complement and returns the
error δ. However, this does not mean that the approximation approximated TFC set. All the CSs are saved in CS_set using
cannot be performed because the unacceptable error can be the function Generate_CS. The function Select_CS selects all
compensated by changing other nonzero digits in the same CSs that can be shared and saves them into sharable_CS. For
TFCs. To achieve this, we propose an error compensation each element in sharable_CS, the function Shortest_CS
technique to adapt TFCs for compensating the error before it is chooses the CS of the shortest length as the newly created CS for
compared with δ. The algorithm starts with computing the initial further TFC approximation.
transformation error using the TFCs in which one nonzero digit
is changed to create new CS. For each of the remaining Algorithm 4 Pseudo Code of The AFFT_NCS Algorithm
nontrivial digits not appearing in the new CS, it is changed to Input: C, δ
zero and the Eaf is re-computed correspondingly. The minimum Output: final_C
Eaf is selected and compared with the initial transformation error D = Count_nonzerodigit(C);
after the new CS is generated. If the error decreases, the n = Size(D);
algorithm moves on to change the next nonzero digit to zero for i from 1 to n
sharable_CS = ∅;
until the transformation error stops decreasing. The main steps
appro_C = ∅;
of the algorithm are summarized in Algorithm 3. The function
new_C = Change_to_complement(C[i]);
Change_to_zero changes a nonzero digit which does not exist CS_set = Generate_CS(new_C);
in CS and returns a new TFC set compensated_C. The function sharable_CS = Select_CS(CS_set);
Min_error selects the minimum error and returns the while (sharable_CS ≠ ∅ ) {
approximated TFC set which produces this error. new_CS = Shortest_CS(sharable_CS);
Einitial = error(new_C);
With the error compensation, we propose an approximated
if Einitial ≥ δ
FFT algorithm by creating new CS (named as AFFT_NCS). If (adapt_C, Einitial) = Error_Compensate(new_CS, new_C);
there are CSs existing in TFCs initially, they are ignored and all end if
nonzero digits are considered equally when creating new CS. if Einitial ≤ δ
For each nonzero digit in TFCs, the algorithm first changes it to appro_C[i] = AFFT_ ECS (adapt_C, new_CS, δ );
its complement. All the remaining nonzero digits in the same cost[i] = FA_count(new_CS);
TFC take turns to be examined. Once a new CS is found, it is end if }
end for
fixed and the algorithm stops creating more. This is because of if appro_C ≠ ∅
the limited number of nonzero digits existing in TFC. When one (min_cost, final_C) = Min_cost(cost);
CS is fixed, there is little chance that the remaining nonzero digits else
can form other CSs. The error compensation is applied when the final_C = C;
corresponding transformation error exceeds. After that, the end if
AFFT_ECS algorithm proposed in Section III.C1 is performed
thereafter for further approximation. With the proposed twiddle factor merging technique, common
subexpression sharing and magnitude-response aware
Algorithm 3 Pseudo Code of The Error Compensation Technique
approximation algorithm, a complete approximated FFT
Input: C
Output: C, Einitial architecture design algorithm is established. First of all, the
Einitial = error(C); proposed twiddle factor merging technique is applied to an
while (true) { N-point FFT to generate nontrivial TFCs to be approximated.
D = Count_nonzerodigit(C); Next, we apply the common subexpression sharing method and
n = Size(D); magnitude-response aware approximation algorithm to further
for i from 1 to n
compensated_C[i] = Change_to_zero(C[i]); reduce the hardware complexity, with the maximally allowed
E[i] = error(compensated_C[i]); transformation error δ. With the nontrivial TFCs, we first check
end for if there are existing CSs that can be shared. If no, the
(min_E, new_C) = Min_error(E, compensated_C); AFFT_NCS algorithm is applied to approximate TFCs.
if min_E≤ Einitial
Einitial = min_E; Otherwise, the common subexpression sharing method is
C = new_C; applied to provide an solution for resource sharing before the
else AFFT_ECS algorithm is applied to further approximate
break;
nontrivial TFCs. Though a good solution can be returned by the
end if }
AFFT_ECS algorithm, the fixed CSs create a barrier for further
approximation. Therefore, the AFFT_NCS algorithm is also

0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

applied in this situation to provide an alternative even though IV. LOGIC SYNTHESIS RESULTS AND DISCUSSION
CS exists initially. Two solutions are compared at last in terms
In this section, 16- and 32-point approximated FFT architectures
of total FA count of the corresponding architectures. The final
are designed using the proposed algorithm. They
solution is the one which is with the minimum cost. The main
steps of the overall algorithm (named as TFC_approximation) All the TFCs of an N-point FFT and the
maximum error allowed for implementation
are summarized in Algorithm 5. The function TF_merging
performs the twiddle factor merging as presented in Section
Perform twiddle factor merging with TF_merging
III.A and returns the nontrivial TFCs. The function to obtain the nontrivial TFCs
Initialize_TFC determines the initial precision of nontrivial
TFCs by truncating insignificant bits with the constraint on Initialize the precision of nontrivial TFCs by
transformation error. The minimum wordlength of TFCs which Initialize_TFC, denote the wordlength as w
makes the error to be lower than δ is selected as initial precision.
Truncate nontrivial TFCs
However, longer TFC wordlength than the initial length does by Truncate at w-bit
not necessarily cause higher FA count because longer lengths
can provide more opportunities for such approximation.
Yes Check if CSs exist in No
Therefore, after applying the initial precision, the coefficient truncated TFCs or not
wordlength is increased until further increment can no longer
contribute to FA count reduction. The function Find_CS
searches CS in the truncated TFCs. Find optimal CSs
With the proposed algorithm, we generate the approximated with CSSharing
TFCs and the minimized cost for an FFT implementation. The
Approximate TFCs Approximate TFCs Approximate TFCs
problem formulated in (5) is then solved. To clearly show the with AFFT_ECS with AFFT_NCS with AFFT_NCS
entire algorithm, the flow chart of the proposed final_CE final_CN final_C
TFC_approximation algorithm is presented in Fig. 4.
CE=FA_count(final_CE)
CN=FA_count(final_CN)

Algorithm 5 Pseudo Code of The TFC_approximation Algorithm Yes Check if No


Input: coefficients, δ CE > CN
Output: final_C, final_cost
final_C = final_CN final_C = final_CE
Initialize final_cost = ∞;
C = TF_merging(coefficients);
minimum_wordlength = Initialize_TFC(C, δ );
final_cost = FA_count(final_C)
for w = minimum_wordlength ++ w = w+1
truncated_C = Truncate(C, w);
existing_CS = Find_CS(truncated_C);
Check if there is Yes
if existing _ CS ≠ ∅ reduction on FA
fixed_CS = CSSharing(truncated_C); count
approximated_CE = AFFT_ECS(truncated_C, fixed_CS, δ ); No
approximated_CN = AFFT_NCS(truncated_C, δ ); Return final_C
cost_ECS = FA_count(approximated_CE);
Fig. 4. Flow chart of the proposed TFC_approximation algorithm.
cost_NCS = FA_count(approximated_CN);
if cost_ECS<cost_NCS
approximated_C = approximated_CE; are compared with two recently published works and the results
else by logic synthesis are presented and discussed.
approximated_C = approximated_CN;
end if A. Design Example
else
approximated_C = AFFT_NCS(truncated_C, δ );
We present the design flow of 16-point FFT architectures to
end if
cost = FA_count(approximated_C); demonstrate the proposed algorithm. Since the proposed algorithm
if cost > final_cost needs the maximum error allowed to run, we assume four
break; specific transformation error requirements. The first one δ1=5.3e-5
else is the transformation error of one competing design in [15] where
final_cost = cost;
final_C = approximated_C; the precision of TFCs is 10-bit such that the FFT implementation
end if is virtually exact. If the maximum error allowed is relaxed to
end for δ2=1.9e-3 and δ3=1.9e-2, there is more freedom to approximate
TFCs to achieve lower FA cost. The last one δ4=1.7e-1 is the
transformation error of another competing design proposed in
[29]. The TFCs in [29] were approximated heavily so that the
corresponding error is much higher. Therefore, we set δ4 as the

0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

biggest error requirement in our experiment. For the ease of


comparing the four transformation errors, we assume a unit
impulse input in time domain (as shown in Fig. 5 (a)) and
transform it into its frequency domain representations. The
frequency magnitude responses of exact FFT, approximated
FFTs with the maximum error allowed being 5.3e-5, 1.9e-3,
1.9e-2 and 1.7e-1 are shown in Fig. 5 (b), (c), (d), (e) and (f),
respectively. The four transformation errors demonstrate examples
of virtually exact, moderately and excessively approximated
FFTs. Because the algorithmic flow is the same for different
errors, we take δ3=1.9e-2 as one example. The first step is to
perform the twiddle factor merging. All the expressions for the
computation of FFT outputs are re-expressed by merging
nontrivial TFCs. The second step is approximation. All the
nontrivial TFCs are processed by the proposed approach under the
constraint that the transformation error is no more than 1.9e-2.
The algorithm terminates when the wordlength of TFCs is 7-bit
along with the usage of common subexpression sharing scheme
and the final total FA count is 1033. TABLE II shows the
quantitative improvement of each proposed method in our
algorithm. The design of [15] is chosen as a competing method.
We firstly apply the twiddle factor merging (TFMerging)
technique to a radix-2 DIT-FFT architecture which is with the
same transformation error with the design of [15]. The FA count Fig. 5. (a) A unit impulse input in time domain. (b) The frequency magnitude
responses of the input by exact FFT. (c) The frequency magnitude responses of
saving by the merging technique is 472. Next, we apply the the input by approximated FFT (transformation error is 5.3e-5). (d) The frequency
magnitude-response aware approximation (TFC_approximation) magnitude responses of the input by approximated FFT (transformation error is
algorithm to nontrivial TFCs together with the common 1.9e-3). (e) The frequency magnitude responses of the input by approximated
subexpression sharing (CSSharing), which reduces the FA count FFT (transformation error is 1.9e-2). (f) The frequency magnitude responses of
the input by approximated FFT (transformation error is 1.7e-1).
further by 179. The reason why we apply TFC_approximation
and CSSharing together is because that the objective of << 6
approximation is to create more opportunity for sharing and the × cos ( π / 8)
input
benefit by the approximation must be realized by the subsequent << 2
× sin ( π / 8 )
sharing. From TABLE II, it is obvious that the TFMerging << 3
technique brings significant improvement to FA count reduction.
Moreover, the proposed TFC_approximation algorithm creates Fig. 6. CS-shared structure when the maximum error allowed is 1.9e-2.
opportunities for reusing hardware which can be achieved by the
CSSharing. An additional 179 FA count is saved consequently.
B. Results and Discussion
TABLE II. The quantitative improvement of each proposed method.
The above four 16-point approximated FFT architectures by
Technique FA count Contribution to improvement our algorithm are compared with three state-of-the-art works
[15] 1684 -- proposed in [10], [15] and [29]. Since the FFT processer in [15]
Saving by TFMerging 472 72.5%
is multi-radix which supports a number of transformation sizes,
Saving by TFC_approximation
179 27.5% we extract the 16-point FFT design by using its radix-16 core.
and CSSharing
Total saving 651 100% The extracted 16-point FFT core only involves necessary
hardware to compute 16-point FFT. Redundant resources which
With the approximated TFCs returned by Approx algorithm,
cause unfair comparison are not included. Similarly, 32-point
the CS-shared structure for implementing cos(π/8) and sin(π/8)
FFT architectures are also designed using the proposed
is shown in Fig. 6. The 16-point approximated FFT architecture
algorithm and compared with [15]. As the design in [29] is for
for this design example is depicted in Fig. 7. The UAS operator
16-point approximated FFT implementation only, we do not
performs addition and subtraction of two input signals
compare our 32-point architectures with [29]. The total FA
simultaneously. The RIC block combines real and imaginary
count (#FAs) and the error of the FFT designs are given in
parts generated by the proposed architecture and lists all the
final outputs for 16-point FFT. It is obvious that only 12 TABLE III. Among the proposed FFT designs, AFFT1 and
multiplications with nontrivial TFCs (one shared structure AFFT5 are with the same ransformation errors as the 16- and
performs two multiplications at a time) are required in this design. 32-point designs in [15], respectively, while the others are less
This is much less than that of Fig. 1 where 40 multiplications accurate. To compare the hardware cost more accurately, all
with nontrivial TFCs are involved.

0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

stage 1 stage 2 stage 3 stage 4 stage 5


x( 0) X ( 0)
UAS
x(8) X (1)
x( 4) X ( 2)
UAS
x(12) X ( 3)
x( 2) X ( 4)
UAS
x(10) X ( 5)
x( 6) R X ( 6)
UAS
x(14) X ( 7)
x(1) I X ( 8)
UAS
x( 9) X ( 9)
C
x( 5) X (10)
UAS
x(13) X (11)
x( 3) X (12)
UAS
x(11) X (13)
x( 7) X (14)
UAS
x(15) X (15)

Note :
: real : real : multiplication with sin ( π / 8) implemented by
adder subtractor shift and add/subtract network

: CS-shared multiplications with sin ( π / 8) and cos (π / 8) R : a block to list the final outputs by
implemented by shift and add/subtract network I combining real and imaginary parts
C
Fig. 7. The approximated 16-point FFT architecture.

designs are described in Verilog HDL and mapped to Xilinx saves 8.4% FPGA area benefitting from the proposed
Virtex7, xc7s75fgga484 FPGA device. Xilinx Vivado Design techniques. The FPGA areas of the 16- and 32-point
Suite v17.4 is used to synthesize the designs. The number of approximated FFT designs are plotted in Fig. 8. The FPGA
LUTs (#LUTs), the utilization density of LUTs, the number of areas of the 16- and 32-point FFT designed by using the
IOs (#IOs), the utilization density of IOs and delays in ns of 16- conventional radix-2 DIT-FFT algorithm are used as the
and 32-point FFT designs are shown in TABLE IV. At least baseline. All other areas of FFT architectures by our algorithm,
41.2% and 56.4% improvements are achieved by our 16- and [10], [15] and [29] are normalized by the baseline.
32-point designs respectively over the designs of [10], [15] in
terms of #LUTs. Our designs have shorter delays compared with TABLE IV. Comparison between the FFT designs on FPGA
[10] and [15]. The reason is that the merging technique reduces
(a) 16-point FFT designs
the number of multiplications and therefore #FAs in the critical
path is reduced. Design of [10] has larger delays since the Area in Utilization Utilization Delay
Design #IOs
#LUTs density in % density in % in ns
iteration operation in CORDIC scheme lead to more adders in
AFFT1 2616 5.45 60 17.75 18.84
the critical path. The reason why AFFT4 and AFFT8 reduce AFFT2 2419 5.04 58 17.16 18.39
#LUTs dramatically over the 16- and 32-point designs in [10] AFFT3 1965 4.09 52 15.38 16.51
and [15] is because their high transformation error tolerance cause AFFT4 1270 2.65 36 10.65 10.50
excessively approximated TFCs, with which all the [10] 5091 10.61 62 18.34 27.55
multiplications in the FFT architecture can be implemented by [15] 4450 9.27 60 17.75 21.00
[29] 1392 2.90 38 11.24 7.89
direct hardwiring. Compared with another excessively DIT-FFT 6158 12.83 64 18.93 22.61
approximated 16-point FFT design proposed in [29], AFFT4
TABLE III. The total FA count and transformation error of the FFT designs. (b) 32-point FFT designs

16-point FFT 32-point FFT Area in Utilization Utilization Delay


Design #IOs
Design #FAs Error Design #FAs Error #LUTs density in % density in % in ns
AFFT1 1205 5.3e-5 AFFT5 4280 1.4e-4 AFFT5 9045 18.84 62 18.34 21.62
AFFT2 1155 1.9e-3 AFFT6 3974 1.9e-3 AFFT6 8297 17.29 60 17.75 21.43
AFFT3 1033 1.9e-2 AFFT7 3160 1.9e-2 AFFT7 6248 13.02 54 15.98 19.34
AFFT4 623 1.7e-1 AFFT8 2085 1.7e-1 AFFT8 4028 8.39 42 12.43 14.32
[10] 2063 5.7e-3 [10] 9148 1.9e-1 [10] 21176 45.37 76 22.49 35.25
[15] 1684 5.3e-5 [15] 8945 1.4e-4 [15] 20733 43.19 86 25.44 27.70
[29] 700 1.7e-1 -- -- -- [29] -- -- -- -- --
DIT-FFT 3178 1.4e-3 DIT-FFT 13701 2.0e-3 DIT-FFT 24048 50.10 74 21.89 26.99

0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

10

The on-chip memory requirement of the designs are


presented in TABLE V. The proposed FFT implementations 1 1
DIF-FFT DIF-FFT
have lower register cost, because the proposed merging [10] [10]
technique reduces the number of temporary TFM partial 0.8 [15] 0.8 [15]
AFFT1 AFFT5
products which requires storage. 0.6
AFFT2
0.6
AFFT6
AFFT3 AFFT7
AFFT4 AFFT8
TABLE V. The on-chip memory requirement of the FFT designs. 0.4 [29] 0.4

16-point FFT 32-point FFT 0.2 0.2


Utilization Utilization
Design #register Design #register 0 0
density in % density in %
AFFT1 1406 1.46 AFFT5 3035 3.16 (a) 16-point FFT Designs (b) 32-point FFT Designs
AFFT2 1380 1.44 AFFT6 2993 3.12
Fig. 9. Normalized ASIC areas of 16- and 32-point FFT designs.
AFFT3 1296 1.35 AFFT7 2721 2.83
AFFT4 1060 1.10 AFFT8 2297 2.39
[10] 1776 1.85 [10] 5742 5.98 In addition to hardware area and delay, total power
[15] 1726 1.80 [15] 6100 6.35 dissipation which includes dynamic power and static power is
[29] 1250 1.30 -- -- -- another critical metric to evaluate hardware performance [39].
DIT-FFT 2372 2.47 DIT-FFT 6774 7.06
The power dissipations of all approximated FFT designs are
firstly implemented on FPGA device and synthesized by Xilinx
1
DIF-FFT
1
DIF-FFT
Vivado Design Suite v17.4. The same output rate of 25 MHz
[10] [10] and supply voltage of 1.0V are set for all experiments to have a
0.8 [15] 0.8 [15]
AFFT1 AFFT5 fair comparison. Moreover, designs are mapped to standard cell
0.6
AFFT2
AFFT3 0.6
AFFT6
AFFT7
library for ASIC and simulated by Synopsys PowerComplierTM
AFFT4
[29]
AFFT8 version: J-2014.09-SP3. An output rate of 25 MHz and a supply
0.4 0.4 voltage of 1.0V are used. The power dissipations in mW of all
0.2 0.2
designs simulated on both FPGA device and ASIC are listed in
TABLE VII. To present the results visually, they are plotted in
0 0 Fig. 10 and Fig. 11 respectively for 16- and 32-point FFTs,
(a) 16-point FFT Designs (b) 32-point FFT Designs where the powers of designs by radix-2 DIT-FFT algorithm are
used as the baseline. Our AFFT1 and AFFT5 save ASIC power
Fig. 8. Normalized FPGA areas of 16- and 32-point FFT designs. by up to 53.1% and 60.0% compared with the 16- and 32-point
designs in [10], [15] respectively. The power dissipation of
To verify the performance on ASIC, all the designs are also AFFT4 on ASIC outperforms the design in [29] by 5.3%.
mapped to 45nm standard cell library and synthesized by
Synopsys DesignCompilerTM. We choose the same cell library
and synthesize all the designs using the same version of TABLE VII. Power dissipations in mW for FPGA and ASIC implementation.
DesignCompilerTM to make sure that the same FA cells are used
16-point FFT 32-point FFT
to implement different designs for fair comparison. The
Design FPGA Power ASIC Power Design FPGA Power ASIC Power
synthesized areas in µm2 and delays in ns of the FFT designs are AFFT1 116 1.21 AFFT5 150 4.40
shown in TABLE VI. The normalized ASIC areas of FFT AFFT2 114 1.11 AFFT6 146 4.00
architectures designed by the proposed algorithm, [10], [15] and AFFT3 110 0.88 AFFT7 130 3.00
[29] are plotted in Fig. 9. Similarly, the areas of the designs by AFFT4 101 0.54 AFFT8 112 1.83
radix-2 DIT-FFT algorithm are used as the baseline. From the [10] 151 2.58 [10] 463 10.99
[15] 123 2.07 [15] 277 10.20
comparison, AFFT1 and AFFT5 save ASIC area by up to 65.7% [29] 102 0.57 [29] -- --
and 58.8% compared with the 16- and 32-point designs in [10], DIT-FFT 154 3.18 DIT-FFT 346 12.60
[15] respectively. The result is consistent with the hardware cost
reduction on FPGA device.
1
DIF-FFT
[10]
TABLE VI. Comparison between the FFT designs on ASIC. 0.8 [15]
AFFT1
AFFT2
16-point FFT 32-point FFT 0.6 AFFT3
Design Area in µm2 Delay in ns Design Area in µm2 Delay in ns AFFT4
AFFT1 7240 2.72 AFFT5 26131 3.15 [29]
0.4
AFFT2 6649 2.72 AFFT6 23748 3.15
AFFT3 5346 2.41 AFFT7 17935 2.87 0.2
AFFT4 3235 1.53 AFFT8 10789 1.98
[10] 15351 4.29 [10] 63383 7.42 0
[15] 12383 2.79 [15] 59680 4.54
[29] 3518 1.48 [29] -- -- Normalized FPGA Normalized ASIC
power dissipation power dissipation
DIT-FFT 18637 4.11 DIT-FFT 72251 5.82
Fig. 10. Normalized power dissipations of 16-point FFT designs.

0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

11

[3] A. Wahbi, A. Roukhe, and L. Hlou, "Enhancing the quality of voice


communications by acoustic noise cancellation (ANC) using a low cost
1.4 adaptive algorithm based Fast Fourier Transform (FFT) and circular
DIF-FFT
[10] convolution," 2014 International Conference on Intelligent Systems:
1.2
[15] Theories and Applications, Rabat, Morocco, May 2014, pp. 218-224.
AFFT1
AFFT2 [4] N. Bhagat, D. Valencia, A. Alimohammad, and F. Harris, "High-throughput
AFFT3 and compact FFT architecture using the Good-Thomas and Winograd
1 AFFT4 algorithms," IET Communications, vol. 12, no. 8, pp. 1011-1018, May 2018.
[5] S. Liu, and D. Liu, "A high-flexible low-latency memory-based FFT
0.8 processor for 4G, WLAN, and future 5G," IEEE Transactions on Very
Large Scale Integration Systems, vol. 27, no. 3, pp. 511-523, Mar. 2019.
0.6 [6] J. Bhattacharya, S. De, S. Bose, I. Banerjee, T. Bhuniya, R. Karmakar, K.
Mandal, R. Sinha, A. Roy, and A. Chaudhuri, "Implementation of OFDM
modulator and demodulator subsystems using 16 point FFT/IFFT pipeline
0.4 arhitecture in FPGA," 2017 IEEE Annual Information Technology,
Electronics and Mobile Communication Conference, Vancouver, BC,
0.2 Canada, Oct. 2017, pp. 295-300.
[7] 4G LTE Networks. (2018). LTE Advanced Pro to 5G Roadmap. [Online].
Available: https://www.4g-lte.net/lte/lte-advanced-pro-to-5g-roadmap/
0
[8] W. Hong, Z. H. Jiang, C. Yu, J. Zhou, P. Chen, Z. Yu, H. Zhang, B. Yang,
Normalized FPGA Normalized ASIC X. Pang, M. Jiang, Y. Cheng, M. K. T. Al-Huaimi, Y. Zhang, J. Chen, and
power dissipation power dissipation
S. He, "Multibeam antenna technologies for 5G wireless communications,"
Fig. 11. Normalized power dissipations of 32-point FFT designs. IEEE Transactions on Antennas and Propagation, vol. 65, no. 12, pp.
6231-6249, Dec. 2017.
[9] S. L. M. Hassan, N. Sulaiman, and I. S. A. Halim, "Low power pipelined FFT
For an N-point DIT-FFT architecture (N ≥ 32), the nontrivial processor architecture on FPGA," 2018 IEEE Control and System Graduate
TFCs exist in (log2N‒2) stages. The twiddle factor merging Research Colloquium, Shah Alam, Malaysia, Aug. 2018, pp. 31-34.
technique can be applied to more stages as N increases. We have [10] S. Liu, and D. Liu, "A High-Flexible Low-Latency Memory-Based FFT
more chances to reduce the number of multiplications in Processor for 4G, WLAN, and Future 5G," IEEE Transactions on Very
Large Scale Integration Systems, vol. 27, no. 3, pp. 511-523, Mar. 2019.
large-size FFTs because of the much more TFCs existing
[11] Z. Qian, and M. Margala, "Low-power split-radix FFT processors using
compared to small-size FFTs. The magnitude-response aware radix-2 butterfly units," IEEE Transactions on Very Large Scale
approximation algorithm can be applied in the same way Integration Systems, vol. 24, no. 9, pp. 3008-3012, Sept. 2016.
because the errors evaluation and approximations to reduce FA [12] B. V. Uma, H. R. Kamash, S. Mohith, V. Sreekar, and S. Bhagirath, "Area
cost are not limited to small-size FFTs. Additionally, large-size and time optimized realization of 16 point FFT and IFFT blocks by using
IEEE 754 single precision complex floating point adder and multiplier," 2015
FFTs can also be decomposed into small-size FFTs and International Conference on Soft Computing Techniques and
therefore be implemented by recursively adopting the small-size Implementations, Faridabad, India, Oct. 2015, pp. 99-104.
FFT cores as presented in our work. In conclusion, large-size [13] X. Chen, Y. Lei, Z. Lu, and S. Chen, "A variable-size FFT hardware
FFTs can benefit from the proposed methods, similarly as the accelerator based on matrix transposition," IEEE Transactions on Very
Large Scale Integration Systems, vol. 26, no. 10, pp. 1953-1966, Oct. 2018.
small-size FFTs, and at least the same or even better savings on
[14] N. Govil, and S. R. Chowdhury, "High performance and low cost
hardware cost and power dissipation can be achieved.. implementation of fast Fourier transform algorithm based on hardware
software co-design," 2014 IEEE REGION 10 SYMPOSIUM, Kuala
Lumpur, Malaysia, Apr. 2014, pp. 403-407.
V. CONCLUSION [15] J. Chen, J. Hu, S. Lee, and G. E. Sobelman, "Hardware efficient mixed
A new hardware area-power efficient approximated FFT design radix-25/16/9 FFT for LTE systems," IEEE Transactions on Very Large
Scale Integration Systems, vol. 23, no. 2, pp. 221-229, Feb. 2015.
is presented in this paper. The proposed twiddle factor merging [16] V. Ariyarathna, A. Madanayake, X. Tang, D. F. G. Coelho, R. Cintra, L.
technique and magnitude-response aware approximation algorithm Belostotski, S. Mandal and T.S. Rappaport, "Analog Approximate-FFT
provide an efficient solution to approximate TFCs for deriving a 8/16-Beam Algorithms, Architectures and CMOS Circuits for 5G
new architecture for N-point FFT implementation. Both 16- and Beamforming MIMO Transceivers," IEEE Journal on Emerging and
Selected Topics in Circuits and Systems, vol. 8, no. 3, pp. 466-479, May 2018.
32-point FFT architectures are designed by applying the proposed
[17] A. Changela, M. Zaveri, and A. Lakhlani, "ASIC implementation of high
algorithm. In ASIC implementation using 45nm standard cell performance radix-8 CORDIC algorithm," 2018 International Conference
library, our 16- and 32-point designs save area by up to 65.7% on Advances in Computing, Communications and Informatics, Bangalore,
and 58.8%, respectively, over relevant designs published India, Dec. 2018, pp. 699-705.
recently. Meanwhile, the power simulations results show that [18] X.-Y. Shih, H.-R. Chou, and Y.-Q. Liu, "VLSI Design and Implementation
of Reconfigurable 46-Mode Combined-Radix-Based FFT Hardware
the 16- and 32-point FFT architectures designed by our Architecture for 3GPP-LTE Applications," IEEE Transactions on Circuits
algorithm save power dissipation by up to 53.1% and 60.0%, and Systems I, vol. 65, no. 1, pp. 118-129, Jul. 2017.
respectively, compared with recently published solutions. [19] H. Xiao, X. Yin, X. Chen, J. li, and X. Chen, "VLSI design of low-cost
and high-precision fixed-point reconfigurable FFT processors," IET
Computers & Digital Techniques, vol. 12, no. 3, pp. 105-110, May 2018.
REFERENCES [20] J. Chen, and J. Ding, "New algorithm for design of low complexity twiddle
factor multipliers in radix-2 FFT," 2015 IEEE International Symposium
[1] S.-N. Tang, and F.-C. Jan, "Energy-efficient and calibration-aware on Circuits and Systems, Lisbon, Portugal, May, 2015, pp. 958-961.
Fourier-domain OCT imaging processor," IEEE Transactions on Very
[21] N. L. Ba, and T. T.-H. Kim, "An area efficient 1024-point low power
Large Scale Integration Systems, vol. 27, no. 6, pp. 1390-1403, Jun. 2019.
radix-22 FFT processor with feed-forward multiple delay commutators,"
[2] A. Chen, and X. Wang, "An image watermarking scheme based on DWT IEEE Transactions on Circuits and Systems I, vol. 65, no. 10, pp.
and DFT," 2017 International Conference on Multimedia and Image 3291-3299, Oct. 2018.
Processing, Wuhan, China, Mar. 2017, pp. 177-180.

0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TCAD.2020.2978839, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems

12

[22] S.-W. Yang, and J.-Y. Lee, "Constant twiddle factor multiplier sharing in
multipath delay feedback parallel pipelined FFT processors," IEEE Xueyu Han received her B. Eng. degree from
Electronics Letters, vol. 50, no. 15, pp. 1050-1052, Jul. 2014. Northwestern Polytechnical University, Xi'an, China in
[23] M. Garrido, J. Grajal, M. A. Sanchez, O. Gustafsson, "Pipelined radix-2k 2017. She is currently pursuing her Ph.D. degree at the
feedforward FFT architectures," IEEE Transactions on Very Large Scale Center of Intelligent Acoustics and Immersive
Integration Systems, vol. 21, no. 1, pp. 23-32, Jan. 2011. Communications (CIAIC) of Northwestern
[24] M. Bansal, and S. Nakhate, "High speed pipelined 64-point FFT processor Polytechnical University. Her research interests include
based on radix-22 for wireless LAN," 2017 International Conference on algorithms and circuit design for digital signal
Signal Processing and Integrated Networks, Noida, India, Feb. 2017, pp. processing.
607-612.
[25] A. Lingamneni, C. Enz, K. Palem, and C. Piguet, "Highly energyefficient
and quality-tunable inexact FFT accelerators," 2014 IEEE Custom Jiajia Chen received his B. Eng. (Hons) and Ph.D. from
Integrated Circuits Conference, San Jose, CA, USA, Sept. 2014, pp. 1–4. Nanyang Technological University, Singapore, in 2004
[26] Q.-J. Xing, Z.-G. Ma, and Y.-K. Xu, "A Novel Conflict-Free Parallel and 2010, respectively. From April 2012 to March 2018,
Memory Access Scheme for FFT Processors," IEEE Transactions on he was a faculty member in Singapore University of
Circuits and Systems II, vol. 64, no. 11, pp. 1347-1351, Nov. 2017. Technology and Design. Since April 2018, he has been
[27] H. K. Samudrala, S. Qadeer, S. Azeemuddin, and Z. Khan, "Parallel and with Nanjing University of Aeronautics and
pipelined VLSI implementation of the new radix-2 DIT FFT algorithm," Astronautics, China, where he is currently a Professor.
2018 IEEE International Symposium on Smart Electronic Systems, His research interest includes computational
Hyderabad, India, Dec. 2018, pp. 21-26. transformations of low-complexity digital circuits and
[28] X. Han, J. Chen, and S. Rahardja, "A new twiddle factor merging method digital signal processing. Dr. Chen served as Web Chair of Asia-Pacific
for low complexity and high speed FFT architecture," 2019 IEEE Computer Systems Architecture Conference 2005, Technical Program
International Circuit and System Symposium, Kuala Lumpur, Malaysia, Committee member of European Signal Processing Conference 2014 and The
Sept. 2019, pp. 1-4. Third IEEE International Conference on Multimedia Big Data 2017, and
Associate Editor of Springer EURASIP Journal on Embedded Systems since
[29] V. Ariyarathna, D. F. G. Coelho, S. Pulipati, R. J. Cintra, F. M. Bayer, V.
S. Dimitrov and A. Madanayake, "Multibeam Digital Array Receiver 2016.
Using a 16-Point Multiplierless DFT Approximation," IEEE Transactions
on Antennas and Propagation, vol. 67, no. 2, pp. 925-933, Feb. 2019.
Boyu Qin is currently pursuing the B.Eng. degree with
[30] Y. Ji-yang, H. Dan, L. Xin, X. Ke, and W. Lu-yuan, "Conflict-free
the College of Electronic and Information Engineering,
architecture for multi-butterfly parallel processing in-place radix-r FFT,"
2016 IEEE International Conference on Signal Processing, Chengdu, Nanjing University of Aeronautics and Astronautics.
China, Nov. 2016, pp. 496-501. His research interests include digital circuits design and
implementation.
[31] S. Mittal, "A survey of techniques for approximate computing," ACM
Computing Surveys, vol. 48, no. 4, pp. 1-34, Mar. 2016.
[32] J. Ding, J. Chen, and C.-H. Chang, "A new paradigm of common
subexpression elimination by unification of addition and subtraction,"
IEEE Transactions on Computer-Aided Design of Integrated Circuits Susanto Rahardja (F'11) is currently a Chair Professor
and Systems, vol. 35, no. 10, pp. 1605-1617, Oct. 2016.
at the Northwestern Polytechnical University (NPU)
[33] J. W. Cooley and J. W. Tukey, "An algorithm for machine calculation of under the Thousand Talent Plan of People’s Republic of
complex Fourier series," Mathematics of Computation, vol. 19, no. 90, pp. China. His research interests are in multimedia, signal
297-301, Jan. 1965. processing, wireless communications, discrete
[34] J. Chen, and C.-H. Chang, "High-level synthesis algorithm for the design transforms, machine learning and signal processing
of reconfigurable constant multiplier," IEEE Transactions on Computer- algorithms and implementation. He contributed to the
Aided Design of Integrated Circuits and Systems, vol. 28, no. 12, pp. development of a series of audio compression
1844-1856, Dec. 2009. technologies such as Audio Video Standards AVS-L,
[35] K. Moller, M. Kumm, M. Garrido, and P. Zipf, "Optimal shift reassignment AVS-2 and ISO/IEC 14496-3:2005/Amd.2:2006,
in reconfigurable constant multiplication circuits," IEEE Transactions on ISO/IEC 14496-3:2005/Amd.3:2006 in which some have been licensed to
Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. several companies. Dr Rahardja has more than 15 years of experience in
3, pp.710-714, Mar. 2018. leading research team for media related research that cover areas in Signal
[36] B. Koyada, N. Meghana, M. O. Jaleel, and P. R. Jeripotula, "A Processing (audio coding, video/image processing), Media Analysis
comparative study on adders," 2017 International Conference on (text/speech, image, video), Media Security (biometrics, computer vision and
Wireless Communications, Signal Processing and Networking, Chennai, surveillance) and Sensor Networks. He has published more than 300 papers and
India, Mar. 2017, pp. 2226-2230. has been granted more than 70 patents worldwide out of which 15 are US
[37] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, "Low-power patents. Professor Rahardja was past Associate Editors of IEEE Transactions
digital signal processing using approximated adders," IEEE Transactions on Audio, Speech and Language Processing and IEEE Transactions on
on Computer-Aided Design of Integrated Circuits and Systems, vol. 32, Multimedia, past Senior Editor of the IEEE Journal of Selected Topics in Signal
no. 1, pp. 124-137, Jan. 2013. Processing, and is currently serving as Associate Editors for the Elsevier
[38] R. Kaur, and T. Singh, "Design of 32-point mixed radix Fft processor using Journal of Visual Communication and Image Representation and IEEE
CSD multiplier," 2016 International Conference on Parallel, Distributed Transactions on Multimedia. He was the Conference Chair of 5th ACM
and Grid Computing, Waknaghat, India, Dec. 2016, pp. 538-543. SIGGRAPHASIA in 2012 and APSIPA 2nd Summit and Conference in 2010
[39] J. Chen, C. H. Chang, and H. Qian, “New power index model for and 2018 as well as other conferences in ACM, SPIE and IEEE. Dr Rahardja is
switching power analysis from adder graph of FIR filter,” in Proc. IEEE a recipient of several honors including the IEE Hartree Premium Award, the
International Symposium on Circuits and Systems, Taipei, Taiwan, May Tan Kah Kee Young Inventors' Open Category Gold award, the Singapore
2009, pp. 2197-2200. National Technology Award, A*STAR Most Inspiring Mentor Award, Finalist
of the 2010 World Technology & Summit Award, the Nokia Foundation
Visiting Professor Award and the ACM Recognition of Service Award.
Professor Rahardja graduated with a B.Eng from National University of
Singapore, the M.Eng. and Ph.D. degrees all in Electronic Engineering from
Nanyang Technological University, Singapore. He attended the Stanford
Executive Programme at the Graduate School of Business in Stanford
University, USA.

0278-0070 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Canberra. Downloaded on April 27,2020 at 12:48:06 UTC from IEEE Xplore. Restrictions apply.

You might also like