Design and Implementation of Block Based Transpose Form FIR Filter
Design and Implementation of Block Based Transpose Form FIR Filter
08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
ABSTRACT
The transpose form configuration of Finite impulse response filter (FIR) does not support for
block based processing se form FIR filter architecture is optimized and implemented. The basic
Data Flow Graph (DFG) of transpose form FIR filter is converted into block based DFG and
retiming is inserted in the DFG for low power consumption, reduced area and minimal delay.
The generalized mathematical formulation is done for the retimed block based transpose form
FIR filter and it is implemented with the block size of 4 for the filter length of 16 using Verilog
Hardware Description Language (HDL). Later, it is synthesized using CADENCE-RTL compiler
in TSMC 45nm CMOS library and power, area and delay reports are generated. The obtained
results are compared with the few existing structures.
Keywords: Digital filters, Data Flow Graph, Retiming, low power, FIR, and HDL.
1. INTRODUCTION
The Digital Signal Processing (DSP) systems are being implemented on Field Programmable
Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC), due to the
reconfiguration and flexibility of FPGAs. The FPGA platform is more suitable for the optimizing
the DSP systems in terms of area, power and delay. Digital filters are mostly used in DSP
applications [1], such as biomedical applications, communication systems and mobile
applications. For these applications, the digital filter must consume less power, reduced area and
high speed. The FIR filters can be implemented in different architectures, such as, direct form
structure, transpose form structure and hybrid structures.
Several FIR architectures are implemented in different styles to meet the specifications. For
example, a FIR filter implemented by Mahesh et al [2] using programmable shift method (PSM)
and Constant shift method (CSM) [8][11]. Park [3] also implemented a FIR filter based on
distributed arithmetic structure in direct form and transpose form structures. But there is no any
block based concept in transpose form structure. Mohanty et al [4] proposed block based
structures and filter banks, which are not suitable for higher order filter lengths and applicable
for 2-Dimensional (2D) filters. Mohanty also proposed [5], the reconfigurable block based
transpose form filter and fixed length transpose form FIR filter for DSP applications [6].
The most preferred architectures of FIR filters in signal processing are transpose form
structures. The transpose form FIR filter consists of inherent pipelining process. The pipelining
in the digital filters design leads to reduction of critical path or delay, reduction of power
consumption and increases the clock speed.
72
DOI: https://dx.doi.org/10.26808/rs.ca.i8v1.08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
In this paper, the block based transpose form FIR filter is realized and mathematically
formulated for reconfiguration applications. The DFG of transpose form FIR filter is converted
and modified to reduce the power consumption, area and delay. Section-II, the computational
analysis using DFG and data flow table (DFT) of transpose form FIR filter and mathematical
formulations are presented. Section-III describes the realization of hard ware structure and the
implementation approach of the proposed FIR filter. In section-IV, all the practical implemented
synthesized circuits and simulation diagrams are presented and corresponding results compared
with the existing structures in terms of Very Large Scale Integration (VLSI) design metrics, such
as area power and delay etc.
Equation (2) is N tap finite impulse response filter with unit sample response h(k ) = bk for
1 k N 1 and h(k ) = 0 otherwise.
For the computational analysis, the DFGs are drawn in the transpose form for the filter length
N=8 as shown in figures.The figure 1 represents the DFG for the input x(n) and output y (n) and
figure 2 describes the DFG for the input x(n 1) and output y(n 1) respectively.
Figure1: DFG of transpose form FIR filter for the length of 8 for output y(n) to input x(n).
73
DOI: https://dx.doi.org/10.26808/rs.ca.i8v1.08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
Figure 2: DFG of transpose form FIR filter for the length of 8 for output y(n-1) to input x(n-1).
In the DFG1 and DFG2, the multiplied values of coefficients with input values and
corresponding accumulation paths are shown in the data flow tables (DFT1) and DFT2 of figure
3. The accumulation path of the product values are indicated by arrows in DFT1 and DFT2.
From the observation of the DFT1 and DFT2, we conclude that, the five values in the each
column of the data flow graphs are same.
Figure 3: DFT for output y(n) to input x(n) with respect to DFG1 and DFT2 output y(n-1) to
input x(n-1)with respect to DFG2.
This is high redundancy in the normal transpose from FIR filter. This redundancy can be
reduced in the above FIR structures for the two consecutive inputs by introducing the block
based inputs concept. Here, the non-overlapped sequence input blocks are used. Now two
modified data flow tables DFT3 and DFT4 are presented in the figure 4 corresponding to non -
overlapped input blocks to avoid the redundancy in normal transpose form FIR filters.
74
DOI: https://dx.doi.org/10.26808/rs.ca.i8v1.08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
Figure 4: The modified Data flow tables DFT3 and DFT4 for transpose form FIR filter with
block size of 2 and N=6.
The DFT3 is the data flow table for the output of y (n) and DFT4 for the output of y(n 1)
.There is no redundancy in DFT3 and DFT4, which can be observed from the entries of the
tables. The gray cells represent the output of y (n) and other values for output y(n 1) . Now the
DFG1 is completely transformed into a new DFG corresponding to the DFT3 and DFT4 is
referred as DFG3. The DFG3 is the equivalent flow diagram for the computations of DFT3 and
DFT4 with non-overlapped blocks of 2 for the filter length of 8 which is shown in figure 5.
Figure 5: Modified DFG of block based transpose form FIR filter for the length of 8.
This DFG3 is further optimized using the concept of retiming. The retiming is method which
reduces the power, area and delay in VLSI circuits, by changing the positions of the delay
elements like flip flops. This change can not alter the characteristics of the circuit. Retiming is
mostly used in the synchronous designs for many applications. Due to the retiming, circuit
switching activity is reduced, hence the power consumption decreases. Actually, the dynamic
power dissipation is reduced in static CMOS circuits [7]. In this paper, the DFG3 is retimed to
obtain the advantages of retiming and named as DFG4, which is block based retimed transpose
form FIR filter as shown in figure 6.
75
DOI: https://dx.doi.org/10.26808/rs.ca.i8v1.08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
Figure 6: Retimed DFG of block based Transpose form FIR filter for the length of 8.
In the comparison of DFG3 and DFG4 for block based transpose form FIR filters, note that
both structures consists of equal number of adders and multipliers. Only the delay elements or D
flip flops are less in the retimed FIR filter structure.
Xk = [ X K0 X K1 X K2 ……… X KN 1 ] (4)
Suppose N is composite number and decomposed as N=ML, the index i l 4m for 0 l 3 and
0 m 3 ; Substituting i l 4m in (5), we have
X Ki X kl 4 m X kl m (7)
76
DOI: https://dx.doi.org/10.26808/rs.ca.i8v1.08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
Xk = [ X K0 X K1 X K2 X K3 X K0 1 X K1 1 X K2 1 X K3 1 X K0 2 X K1 2 X K2 2 X K3 2 X K0 3 X K1 3 X K2 3 X K3 3 ] (8)
yk = 3 3
. (9)
X
l 0 m 0
l
k m
b(i 4m)
The input matrix Xk of (8) has the following features, the data block X K0 is the current block,
while { X K0 1 , X K0 2 , X K0 3 } are blocks delayed by 1, 2, 3 cycles. The overlapped blocks { X K1 1 ,
1
X K1 2 , X K1 3 } are 1, 2, 3 clock cycles delayed version of overlapped block X K .
The input matrix Xk is decomposed into 3 small matrices Rkl , such that Rk0 contains 4 blocks {
1
X K0 , X K1 , X K2 , X K3 } and Rk contains { X K0 1 , X K1 1 , X K2 1 , X K3 1 }. The coefficient vector b is
decomposed into small vectors Cm = b(4m), b(4m 1), b(4m 2), b(4m 3) where
0 m 3.
From (10) Rkm is m clock cycle delay with respect to Rk0 , the equation (9) can be expressed in
Rk0 m and Cm as
3
yk =
m0
rkm
77
DOI: https://dx.doi.org/10.26808/rs.ca.i8v1.08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
Figure 7: Architecture of FIR filter for the length N=16 and block size L=4.
The block of 4 input samples are applied to the RC for the kth cycle, it internally consists of
delay flip flops to rearrange the samples corresponding to the algorithm and it produce 4 rows of
input samples Rk0 in parallel as shown in figure 8.
These 4 rows of Rk0 are applied to M (where M=4) number of PUs in the structure. The M
weight coefficient vectors from CU also transmitted to PU. The 4 coefficient vectors C 0 are
transmitted to PU4, C1 to PU3, C2 to PU2, and C3 for PU1 respectively as shown in above figure.
Then a matrix multiplication is taken place between the input samples Rk0 from RC and
coefficient weighted vectors Cm (where m is o to 3). The each PU internally consists of four inner
product cells (IPC) as shown in the figure 9. The 4 rows of input samples from RC are going to
row wise to the each inner product cell. Then IPC multiples these values with 4 coefficient
vectors and generates the rkm as the output. Similarly, each PU produce rkm , such as, rk0, rk1, rk2,
and rk3. Here 4 PUs worked on parallel processing and produce 4 blocks of result rkm (where m is
o to 3).
78
DOI: https://dx.doi.org/10.26808/rs.ca.i8v1.08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
Parallel processing means that, the multiple outputs are computed in parallel for a clock
period. The parallel processing and pipelining in the architecture of PAC is used to reduce the
power consumption, for the reduction of critical path or delay, and which also can be improves
the clock speed. Parallel processing and pipelining techniques are dual each other [9] [10]. A
computation can be pipelined and it also can be parallel processed.
Figure 10: Internal circuit of Inner product cell (IPC) in the PU.
These four outputs are passing through PAC block, which consists of delay elements and
carry save adders, as shown in the figure 11. In this block, the total partial product outputs are
added by a pipeline addition and produce L number of outputs yk., where L value is 4. Finally,
the output of the Transpose form FIR filter architecture produces 4 blocks of output for four
79
DOI: https://dx.doi.org/10.26808/rs.ca.i8v1.08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
input block samples. The pipelining is used in this PAC block to obtain the optimization in the
filter.
Figure 12: Top level module of block based transpose form FIR filter for the length N=16 and
block size L=4 using XST.
80
DOI: https://dx.doi.org/10.26808/rs.ca.i8v1.08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
Figure 13: Simulation outputs of proposed FIR filter using ISE Simulator.
The Table. I represents the complete design summary of the proposed FIR filter using XST
tool for the FPGA Vertex 5. Here, the design blocks are mapped to the technology blocks in the
FPGA. The device utilization percentages and number of available blocks and number of used
blocks are shown in the table I.
Table I Device utilization summary for the proposed FIR filter using XST
Number of BUFG/BUFGCTRLs 1 32 3%
The blocks of FIR filter are coded using Verilog HDL, next synthesized using Encounter RTL
Compiler in TSMC 45nm and TSMC 180nm CMOS technology from CADENCE. The RTL
compiler gives nanometer performance goals, reduces the chip area, lowers power and improves
timing closure. The reports are generated for power consumption, area and delay using this
synthesis. The top module of FIR filter architecture from RTL complier tool is shown in the
figure 14.
81
DOI: https://dx.doi.org/10.26808/rs.ca.i8v1.08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
Figure 14: The complete sturcture of proposed FIR filter using TSMC 45-nm CMOS technology
by RTL Compiler.
The comparision of the two different TSMC technologies using RTL compiler is described in
table II. The FIR filter synthesized results for TSMC 45-nm CMOS library and TSMC 180-nm
CMOS technolgy are tabulated from th reports generated by RTL Compiler tool. The power
consumption of the FIR filter in 180nm technology is very much greater than the 45nm
technolgy, that means more power optimization is taken place in the 45nm technolgy due to the
constraints given with repect to power reduction in the filter. Here, the number of delay elements
also reduced using retiming, hence the delay is improved in the 45nm technology. The clock
speed corresponding to FIR filter delay in the 45nm is high i.e 222 MHz.
The area also reduced in the advanced CMOS technology 45nm using appropriate constarints
for the area. The number of Flip Flops are reduced and optimized adders and multipliers are used
in the design of FIR filter. From the RTL compiler synthesis tool, the leakage power is very less
comparively dynamic power in both 45nm and 180nm technologies. The leakage power in 45nm
is 157nW and leakage power in 180nm is 602 nW.It is negigible comparitively dynamic power.
The toatl power is 326 µW for proposed FIR filter using 45nm technology, which is optimized
power.
Table II Synthesis results of 45nm and 180nm CMOS technologies using RTL compiler.
82
DOI: https://dx.doi.org/10.26808/rs.ca.i8v1.08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
The table III presents the comparision between the exixting FIR filter structures form survey
and proposed structure of FIR filter. The number of multipliers requred for the proposed design
is 64 same as the direct form architecture [4] and transpose form structure [5], but the less
number of delay elments or FFs comparitively all the existing structures. Due to the less number
of delay elements the over delay for the propsed FIR filter is reduced to 4.487ns. The clock
speed for the proposed transpose form FIR filter is achived for 45nm technology is 222MHz. The
adder blocks also 197, which is lesser than filter existing architectures [2] and [3]. The area
occupied by the proposed structure is very much smaller than the exixting FIR filetr structures.
The more area optimization is achived by this FIR filter in 45nm technology.
Table III Comparison between different FIR filters parameters.
6. CONCLUSION
The optimized block based transpose form FIR filter is realized using retiming with less
number of delay elements for the low power consumption, low area and high speed. The basic
transpose form FIR filter DFG is converted into modified DFG to avoid the redundancy for
block based inputs. The retiming is introduced by changing the location of D flip flops for the
optimization of the FIR filter. The constraints are applied in the synthesis tool to reduce the
delay, area and power consumption of FIR filter. The entire block based transpose form FIR
filter structure is implemented in Verilog HDL code and simulated using ISE simulator. The
design synthesized using XST tool and again synthesized using RTL complier for two different
technologies, such as, TSMC 45nm and TSMC 180nm CMOS technology. From the comparison
between these two technologies, the 45nm technology gives better results in terms of area, delay
and power consumption. The area and utilization summary is given by XST tool and power
report and delay reports are obtained by RTL synthesis tool.
REFERENCES
[1] A. Umasankar and N. Vasudevan,” Design And Analysis of Various Slice Reduction
Algorithm for Low Power and Area Efficient FIR Filter” ,ICCTET13,IEEE Conf. july 2013.
[2] R Mahesh and A.P Vinod ,” New Reconfigurable Architectures for Implementing FIR
Filters with low Complexity” IEEE Tansactions, Computer Aided Design Integr. Circuits
Syst., Vol. 29, no 2, pp. 275-288, Feb. 2010
83
DOI: https://dx.doi.org/10.26808/rs.ca.i8v1.08
International Journal of Computer Application (2250-1797)
Issue 8 Volume 1, January- February 2018
[3] S. Y. Park and P. K. Meher, “Efficient FPGA and ASIC realizations of a DA-based
reconfigurable FIR digital filter,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 61, no. 7,
pp. 511–515, Jul. 2014.
[4] B. K. Mohanty and P.K.Meher “ A gigh performance energy efficient architecture for FIR
adaptive filter based on new distributed arthimetic formulation of block LMS algorithm”
IEEE Trans. Signal Process., vol. 61,no.4, pp. 921-932, feb. 2013
[5] B. K. Mohanty and P. K. Meher, “A high- performance FIR Filter Architecture for Fixed and
Reconfigurable Applications,” IEEE Trans. on VLSI systems, vol. 24, issue 2, pp.444 –452,
2016.
[6] A. P. Vinod and E. M. Lai, “Low power and high-speed implementation of FIR filters for
software defined radio receivers,” IEEE Trans. Wireless Commun., vol. 7, no. 5, pp. 1669–
1675, Jul. 2006.
[7] Keshab k. Parhi “VLSI Digital Signal Processing Systems- Design and Implementation” john
wiley & sons, in 1999.
[8] B. K. Mohanty and P. K. Meher, “A high-performance energy-efficient architecture for FIR
adaptive filter based on new distributed arithmetic formulation of block LMS algorithm,”
IEEE Trans. Signal Process., vol. 61, no. 4, pp. 921–932, Feb. 2013.
[9] R. Mahesh and A. P. Vinod, “A new common sub-expression elimination algorithm for
realizing low-complexity higher order digital filters,” IEEE Trans. Comput.-Aided Design
Integr. Circuits Syst., vol. 27, no. 2, pp. 217–219, Feb. 2008.
[10] J. Park, W. Jeong, H. Mahmoodi-Meimand, Y. Wang, H. Choo, and K. Roy, “Computation
sharing programmable FIR filter for low-power and high-performance applications,” IEEE J.
Solid State Circuits, vol. 39, no. 2, pp. 348–357, Feb. 2004.
[11] K.-H. Chen and T.-D. Chiueh, “A low-power digit-based reconfigurable FIR filter,” IEEE
Trans. Circuits Syst. II, Exp. Briefs, vol. 53, no. 8, pp. 617–621, Aug. 2006.
84