DSP
DSP
DSP
Edition 1.0
EDITION 1.0
Xilinx i
2005 Xilinx, Inc. All rights reserved. XILINX, the Xilinx Logo, and other designated brands included herein are trademarks of Xilinx, Inc. PowerPC is a trademark of IBM, Inc. All other trademarks are the property of their respective owners. NOTICE OF DISCLAIMER: The information stated in this book is not to be used for design purposes. Xilinx is providing this design, code, or information "as is." By providing the design, code, or information as one possible implementation of this feature, application, or standard, Xilinx makes no representation that this implementation is free from any claims of infringement. You are responsible for obtaining any rights you may require for your implementation. Xilinx expressly disclaims any warranty whatsoever with respect to the adequacy of the implementation, including but not limited to any warranties or representations that this implementation is free from claims of infringement and any implied warranties of merchantability or fitness for a particular purpose. All terms mentioned in this book are known to be trademarks or service marks and are the property of their respective owners. Use of a term in this book should not be regarded as affecting the validity of any trademark or service mark. All rights reserved. No part of this book may be reproduced, in any form or by any means, without written permission from the publisher.
ii Xilinx
Acknowledgements
Chapter 2 through Appendix A have been sourced from the Xilinx Advanced Product Divisions (APD) "XtremeDSP Design Considerations User Guide". For up-to-date information, download the online version located here: http://www.xilinx.com/bvdocs/userguides/ug073.pdf For more information, contact: Gregg C. Hawkes Senior Staff Applications Manager, Xilinx, Inc.
Xilinx iii
iv Xilinx
TABLE OF CONTENTS
Miscellaneous Functional Use Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dynamic, 18-bit Circular Barrel Shifter Use Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . VHDL and Verilog Instantiation Templates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VHDL Instantiation Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Verilog Instantiation Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47 48 50 50 51
TABLE OF CONTENTS
Appendix A: References
Xilinx vii
viii Xilinx
Chapter 1
Our insatiable hunger for electronic gadgets that provide high-quality audio, video, data or all three, is spiraling up the processing power that is needed to process these signals. Digital signal processing (DSP) systems, within both infrastructure and customer premise equipment must provide increasing levels of performance and flexibility to handle the new requirements yet provide greater scalability for achieving higher economies of scale.
Performance Gap
Alg
th ori
Co
it lex mp
DSP/GPP*
1960
1970
1980
1990
2000
2010
*Source: Jan Rabaey, BWRC
Figure 1-1:
DSP: DESIGNING FOR OPTIMAL RESULTS Field Programmable Gate Arrays (FPGAs) are very well suited to fill the performance gap for a variety of reasons: They offer extremely high-performance signal processing capability through parallelism. They provide very low risk due to the flexible architecture. They allow design migration to handle changing standards. Developers can use them to create a customized and differentiated solution. They are quickly coming down in price. In fact, it is possible to find FPGAs for less then $2 per device. They provide very low power per function.
XtremeDSP Slice Delivers Maximum Performance, Minimum Power, and Best Economy
The XtremeDSP Sliceoperating at a blazing 500 MHzlies at the heart of Virtex-4 FPGAs XtremeDSP performance. As the most powerful addition to the Xilinx XtremeDSP took kit, it is a unique piece of hard coded IP embedded in each Virtex-4 device. It provides industry-leading DSP processing performance, unrivalled economy, and the lowest power consumption of any device in this performance range.
2 Xilinx
Easy to Use
Xilinx and its partners provide the easiest-to-use design solutions for FPGA-based DSP solutions with features such as: System Generator for DSP reduces design time. A rich DSP IP library implements fast, highly optimized algorithms. Award-winning technical support and DSP services enable you to bring products to market much faster. Whether you are working with spread-spectrum, multi-carrier, or narrowband communication systems, Virtex-4 FPGAs are the ideal choice for ease of use.
A Must-Read
This book is a must-read for DSP designers who want to tap the power of the Virtex-4 XtremeDSP Slice. It provides a detailed description of the multiple features of the slice as well as providing multiple examples that show you how to harness the power and flexibility of this powerful IP block. Tap into the XtremeDSP Slice and reap the rewards of highest performance, lowest power at the lowest cost.
Xilinx 3
4 Xilinx
Chapter 2
This chapter provides technical details for the XtremeDSP Digital Signal Processing (DSP) element, the DSP48 slice. The DSP48 slice is a new element in the Xilinx development model referred to as Application Specific Modular Blocks (ASMBL). The purpose of this model is to deliver off-the-shelf programmable devices with the best mix of logic, memory, I/O, processors, clock management, and digital signal processing. ASMBL is an efficient FPGA development model for delivering off-theshelf, flexible solutions ideally suited to different application domains. Each XtremeDSP tile contains two DSP48 slices to form the basis of a versatile coarse-grain DSP architecture. Many DSP designs follow a multiply with addition. In Virtex-4 devices these elements are supported in dedicated circuits. The DSP48 slices support many independent functions, including multiplier, multiplieraccumulator (MAC), multiplier followed by an adder, three-input adder, barrel shifter, wide bus multiplexers, magnitude comparator, or wide counter. The architecture also supports connecting multiple DSP48 slices to form wide math functions, DSP filters, and complex arithmetic without the use of general FPGA fabric. The DSP48 slices available in all Virtex-4 family members support new DSP algorithms and higher levels of DSP integration than previously available in FPGAs. Minimal use of general FPGA fabric leads to low power, very high performance, and efficient silicon utilization.
Introduction
The DSP48 slices facilitate higher levels of DSP integration than previously possible in FPGAs. Many DSP algorithms are supported with minimal use of the general-purpose FPGA fabric, resulting in low power, high performance, and efficient device utilization. At first look, the DSP48 slice is an 18 x 18 bit twos complement multiplier followed by a 48-bit sign-extended adder/subtracter/accumulator, a function that is widely used in digital signal processing (DSP). A second look reveals many subtle features that enhance the usefulness, versatility, and speed of this arithmetic building block. Programmable pipelining of input operands, intermediate products, and accumulator outputs enhances throughput. The 48-bit internal bus allows for practically unlimited aggregation of DSP slices. Xilinx 5
DSP: DESIGNING FOR OPTIMAL RESULTS One of the most important features is the ability to cascade a result from one XtremeDSP Slice to the next without the use of general fabric routing. This path provides high-performance and lowpower post addition for many DSP filter functions of any tap length. For multi-precision arithmetic this path supports a right-wire-shift. Thus a partial product from one XtremeDSP Slice can be right-justified and added to the next partial product computed in an adjacent such slice. Using this technique, the XtremeDSP Slices can be configured to support any size operands. Another key feature for filter composition is the ability to cascade an input stream from slice to slice. The C input port, allows the formation of many 3-input mathematical functions, such as 3-input addition, 2-input multiplication with a single addition. One subset of this function is the very valuable support of rounding a multiplication away from zero.
Architecture Highlights
The Virtex-4 DSP slices are organized as vertical DSP columns. Within the DSP column, two vertical DSP slices are combined with extra logic and routing to form a DSP tile. The DSP tile is four CLBs tall. Each DSP48 slice has a two-input multiplier followed by multiplexers and a three-input adder/subtracter. The multiplier accepts two 18-bit, two's complement operands producing a 36-bit, two's complement result. The result is sign extended to 48 bits and can optionally be fed to the adder/subtracter. The adder/subtracter accepts three 48-bit, two's complement operands, and produces a 48-bit two's complement result. Higher level DSP functions are supported by cascading individual DSP48 slices in a DSP48 column. One input (cascade B input bus) and the DSP48 slice output (cascade P output bus) provide the cascade capability. For example, a Finite Impulse Response (FIR) filter design can use the cascading input to arrange a series of input data samples and the cascading output to arrange a series of partial output results. For details on this technique, refer to the section titled Adder Cascade vs. Adder Tree, page 31. Architecture highlights of the DSP48 slices are: 18-bit by 18-bit, two's-complement multiplier with a full-precision 36-bit result, sign extended to 48 bits Three-input, flexible 48-bit adder/subtracter with optional registered accumulation feedback Dynamic user-controlled operating modes to adapt DSP48 slice functions from clock cycle to clock cycle Cascading 18-bit B bus, supporting input sample propagation Cascading 48-bit P bus, supporting output propagation of partial results Multi-precision multiplier and arithmetic support with 17-bit operand right shift to align wide multiplier partial products (parallel or sequential multiplication) Symmetric intelligent rounding support for greater computational accuracy Performance enhancing pipeline options for control and data signals are selectable by configuration bits Input port C typically used for multiply-add operation, large three-operand addition, or flexible rounding mode Separate reset and clock enable for control and data registers 6 Xilinx
XTREMEDSP DESIGN CONSIDERATIONS I/O registers, ensuring maximum clock performance and highest possible sample rates with no area cost OPMODE multiplexers A number of software tools support the DSP48 slice. The Xilinx ISE software supports DSP48 slice instantiations. The Architecture Wizard is a GUI for creating instantiation VHDL and/or Verilog code. It also helps generate code for designs using a single DSP48 slice (i.e., Multiplier, Adder, Multiply-Accumulate or MAC, and Dynamic Control modes). Using the Architecture Wizard, CORE Generator tool, or System Generator, a designer can quickly generate math or other functions using Virtex-4 DSP48 slices.
Xilinx 7
18 18 48 7
18 48 48
Figure 2-1:
Table 2-2 lists the available ports in the DSP48 slice primitive. Table 2-2: A B C 8 Xilinx DSP48 Slice Port List and Definitions Direction Size I 18 I I 18 48 Function The multiplier's A input. This signal can also be used as the adder's Most Significant Word (MSW) input The multiplier's B input. This signal can also be used as the adder's Least Significant Word (LSW) input The adder's C input
Signal Name
XTREMEDSP DESIGN CONSIDERATIONS Table 2-2: DSP48 Slice Port List and Definitions (Continued) Direction Size I 7 I I I I I I I I I I I I I I I I I I I I I O O O 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 18 48 18 48 48 Function Controls the input to the X, Y, and Z multiplexers in the DSP48 slices (see OPMODE, Table 2-7) 0 = add, 1 = subtract The carry input to the carry select logic Selects carry source (see CARRYINSEL, Table 2-8) Clock enable: 0 = hold, 1 = enable AREG Clock enable: 0 = hold, 1 = enable BREG Clock enable: 0 = hold, 1 = enable CREG Clock enable: 0 = hold, 1 = enable MREG Clock enable: 0 = hold, 1 = enable PREG Clock enable: 0 = hold, 1 = enable OPMODEREG, CARRYINSELREG Clock enable: 0 = hold, 1 = enable SUBTRACTREG and general interconnect carry input Clock enable: 0 = hold, 1 = enable (carry input from internal paths) Reset: 0 = no reset, 1 = reset AREG Reset: 0 = no reset, 1 = reset BREG Reset: 0 = no reset, 1 = reset CREG Reset: 0 = no reset, 1 = reset MREG Reset: 0 = no reset, 1 = reset PREG Reset: 0 = no reset, 1 = reset SUBTRACTREG, OPMODEREG, CARRYINSELREG Reset: 0 = no reset, 1 = reset (carry input from general interconnect and internal paths) The DSP48 clock The multiplier's cascaded B input. This signal can also be used as the adder's LSW input Cascaded adder's Z input from the previous DSP slice The B cascade output The P cascade output The product output
Signal Name OPMODE SUBTRACT CARRYIN CARRYINSEL CEA CEB CEC CEM CEP CECTRL CECINSUB CECARRYIN RSTA RSTB RSTC RSTM RSTP RSTCTRL RSTCARRYIN CLK BCIN PCIN BCOUT PCOUT P
Xilinx 9
10 Xilinx
Attributes in VHDL
DSP48 generic map ( AREG => 1,-- Number of pipeline registers on the A input, 0, 1 or 2 BREG => 1,-- Number of pipeline registers on the B input, 0, 1 or 2 B_INPUT => DIRECT, -- B input DIRECT from fabric or CASCADE from -- another DSP48 CARRYINREG => 1, -- Number of pipeline registers for the CARRYIN -- input, 0 or 1 CARRYINSELREG => 1, -- Number of pipeline registers for the -- CARRYINSEL, 0 or 1 CREG => 1, -- Number of pipeline registers on the C input, 0 or 1 LEGACY_MODE => MULT18X18S, -- Backward compatibility, NONE, -- MULT18X18 or MULT18X18S MREG => 1, -- Number of multiplier pipeline registers, 0 or 1 OPMODEREG => 1,-- Number of pipeline registers on OPMODE input, -- 0 or 1 PREG => 1, -- Number of pipeline registers on the P output, 0 or 1 SIM_X_INPUT => GENERATE_X_ONLY, -- Simulation parameter for behavior for X on input. -- Possible values: GENERATE_X, NONE or WARNING SUBTRACTREG => 1)-- Number of pipeline registers on the SUBTRACT -- input, 0 or 1
Attributes in Verilog
defparam DSP48_inst.AREG = 1; // Number of pipeline registers on the A input, 0, 1 or 2 defparam DSP48_inst.BREG = 1; // Number of pipeline registers on the B input, 0, 1 or 2 defparam DSP48_inst.B_INPUT = DIRECT; // B input DIRECT from fabric or CASCADE from another DSP48 defparam DSP48_inst.CARRYINREG = 1; // Number of pipeline registers for the CARRYIN input, 0 or 1 defparam DSP48_inst.CARRYINSELREG = 1; // Number of pipeline registers for the CARRYINSEL, 0 or 1 defparam DSP48_inst.CREG = 1; // Number of pipeline registers on the C input, 0 or 1 defparam DSP48_inst.LEGACY_MODE = MULT18X18S; // Backward compatibility, NONE, MULT18X18 or MULT18X18S defparam DSP48_inst.MREG = 1; // Number of multiplier pipeline registers, 0 or 1 defparam DSP48_inst.OPMODEREG = 1; // Number of pipeline registers on OPMODE input, 0 or 1 defparam DSP48_inst.PREG = 1; // Number of pipeline registers on the P output, 0 or 1 defparam DSP48_inst.SIM_X_INPUT = GENERATE_X_ONLY; // Simulation parameter for behavior for X on input. // Possible values: GENERATE_X, NONE or WARNING defparam DSP48_inst.SUBTRACTREG = 1; // Number of pipeline registers on the SUBTRACT input, 0 or 1
Xilinx 11
Interconnect
Multiplier
BRAM
Virtex-4 Devices
Interconnect
Interconnect
BRAM
ug073_c1_02_060304
Figure 2-2: DSP48 Interconnect and Relative Dedicated Element Sizes Figure 2-3 shows two DSP48 slices and their associated datapaths stacked vertically in a DSP48 column. The inputs to the shaded multiplexers are selected by configuration control signals. These are set by attributes in the HDL source code or by the User Constraint File (UCF).
12 Xilinx
BCOUT 18 18 A 18 B 18 18 Note 1 36 X 48 36 Note 2 72 Note 3 36 48 48 C 48 Note 7 48 18 Zero Z 48 SUBTRACT Note 8 Y 48 CIN Note 4
PCOUT
48
48
Notes 4, 5 48 Note 5 BCIN BCOUT 18 18 Note 4 Note 1 36 X 48 36 Note 2 72 Note 3 36 48 48 Zero Z 48 SUBTRACT Note 8 Y 48 CIN 48 48 48 Wire Shift Right by 17 bits PCIN PCOUT
A 18 B 18
18
Figure 2-3: A DSP48 Tile Consisting of Two DSP48 Slices Notes: 1. The 18-bit A bus and B bus are concatenated, with the A bus being the most significant. 2. The X,Y, and Z multiplexers are 48-bit designs. Selecting any of the 36-bit inputs provides a 48-bit sign-extended output. 3. The multiplier outputs two 36-bit partial products, sign extended to 48 bits. The partial products feed the X and Y multiplexers. When OPMODE selects the multiplier, both X and Y multiplexers are utilized and the adder/subtracter combines the partial products into a valid multiplier result. 4. The multiply-accumulate path for P is through the Z multiplexer. The P feedback through the X multiplexer enables accumulation of P cascade when the multiplier is not used. 5. The Right Wire Shift by 17 bits path truncates the lower 17 bits and sign extends the upper 17 bits. 6. The grey-colored multiplexers are programmed at configuration time. 7. The shared C register supports multiply-add, wide addition, or rounding. 8. Enabling SUBTRACT implements Z (X+Y+CIN) at the output of the adder/subtracter.
Xilinx 13
Equation 2-2 describes a typical use where A and B are multiplied and the result is added to or subtracted from the C register. More detailed operations based on control and data inputs are described in later sections. Selecting the multiplier function consumes both X and Y multiplexer outputs to feed the adder. The two 36-bit partial products from the multiplier are sign extended to 48 bits before being sent to the adder/subtracter. A d d e r O u t = C (A B + CI N) Equation 2-2
Figure 2-4 shows the DSP48 slice in a very simplified form. The seven OPMODE bits control the selection of the 48-bit datapaths by the three multiplexers feeding each of the three inputs to the adder/subtracter. In all cases, the 36-bit input data to the multiplexers is sign extended, forming 48bit input datapaths to the adder/subtracter. Based on 36-bit operands and a 48-bit accumulator output, the number of guard bits (i.e., bits available to guard against overflow) is 12. Therefore, the number of multiply accumulations possible before overflow occurs is 4096. Combinations of OPMODE, SUBTRACT, CARRYINSEL, and CIN control the function of the adder/subtracter.
OPMODE Controls Behavior P A:B A B C Zero PCIN P
UG073_c1_04_070904
Timing Model
Table 2-3 lists the XtremeDSP switching characteristics. Table 2-3: XtremeDSP Switching Characteristics Symbol Setup and Hold of CE Pins TDSPCCK_CE/TDSPCKC_CE TDSPCCK_RST/TDSPCKC_RST Setup/Hold of all CE inputs of the DSP48 slice Setup/Hold of all RST inputs of the DSP48 slice Setup/Hold of {A, B, C} input to {A, B, C} register Setup/Hold of {A, B} input to M register Setup/Hold of {A, B} input to P register (LEGACY_MODE = MULT18X18) Setup/Hold of {A, B, C} input to P register (LEGACY_MODE = NONE for A and B) Setup/Hold of {CARRYIN, CARRYINSEL, OPMODE, SUBTRACT} input to {CARRYIN, CARRYINSEL, OPMODE, SUBTRACT} register Setup/Hold of {CARRYIN, CARRYINSEL, OPMODE, SUBTRACT, PCIN} input to P register Clock Enable Reset CE RST Description Function Control Signal
Setup and Hold Times of Data/Control Pins TDSPDCK_{AA, BB, CC}/ TDSPCKD_{AA, BB, CC} TDSPDCK_{AM, BM}/ TDSPCKD_{AM, BM} TDSPDCK_{AP, BP}_L/ TDSPCKD_{AP, BP}_L TDSPDCK_{AP_NL, BP_NL, CP}/ TDSPCKD_{AP_NL, BP_NL, CP} TDSPDCK_{CRYINC, CRYINSC, OPO, SUBS}/ TDSPCKD_{CRYINC, CRYINSC,
OPO, SUBS}
A, B, C A, B A, B A, B, C
Control In
Various
Control In
Various
Clock to Out TDSPCKO_PP TDSPCKO_{PA, PB}_L Clock to out from P register to P output Clock to out from {A, B} register to P output (LEGACY_MODE = MULT18X18) Clock to out from {A, B, C} register to P output (LEGACY_MODE = NONE for A and B) Clock to out from {M, CARRYIN, CARRYINSEL, OPMODE, SUBTRACT} register to P output Data Out Data Out P Output P Output
Data Out
P Output
TDSPCKO_{PM, PCRYIN,
PCRYINS, POP, PSUB}
Data Out
P Output
Xilinx 15
DSP: DESIGNING FOR OPTIMAL RESULTS Table 2-3: XtremeDSP Switching Characteristics (Continued) Symbol TDSPCKO_PCOUTP Description Clock to out from P register to PCOUT output Function Data Out Data Out Control Signal P Output P Output
TDSPCKO_{PCOUTA, PCOUTB}_L Clock to out from {A, B} register to PCOUT output (LEGACY_MODE = MULT18X18) TDSPCKO_{PCOUTA_NL,
PCOUTB_NL, PCOUTC}
Clock to out from {A, B, C} register to PCOUT output (LEGACY_MODE = NONE for A and B)
Data Out
P Output
TDSPCKO_{PCOUTM,
PCOUTCRYIN, PCOUTCRYINS, PCOUTOP, PCOUTSUB}
Data Out Clock to out from {M, CARRYIN, CARRYINSEL, OPMODE, SUBTRACT} register to PCOUT output {A, B} input to P output (LEGACY_MODE = MULT18X18) {A, B, C} input to P output (LEGACY_MODE = NONE for A and B) {CARRYIN, CARRYINSEL, OPMODE, SUBTRACT, PCIN} input to P output {A, B} input to PCOUT output (LEGACY_MODE = MULT18X18) {A, B, C} input to PCOUT output (LEGACY_MODE = NONE for A and B) {CARRYIN, CARRYINSEL, OPMODE, SUBTRACT, PCIN} input to PCOUT output From {A, B} register to P register (LEGACY_MODE = MULT18X18) From {A, B, C, P} register to P register (LEGACY_MODE = NONE for A and B) Data In to Out Data In to Out Control to Data Out Data In to PC Out
P Output
TDSPDO_{CRYINP, CRYINSP,
OPMODEP, SUBTRACTP, PCINP}
A, B to PC Out
TDSPDO_{CRYINPCOUT,
CRYINSPCOUT, OPMODEPCOUT, SUBTRACTPCOUT, PCINPCOUT}
16 Xilinx
XTREMEDSP DESIGN CONSIDERATIONS Table 2-3: XtremeDSP Switching Characteristics (Continued) Symbol TDSPCKCK_{CRYINP, CRYINSP,
OPMODEP, SUBTRACTP}
Description From {CARRYIN, CARRYINSEL, OPMODE, SUBTRACT} register to P register From {A, B} register to M register
Control Signal
TDSPCKCK__{AM, BM}
The timing diagram in Figure 2-5 uses OPMODE equal to 0x05 with all pipeline registers turned on. For other applications, the clock latencies and the parameter names must be adjusted.
CLK Event 1 CLK TDSPCCK_CE CE TDSPCCK_RST RST TDSPDCK_AA A Don't Care Data A1 Data A2 Data A3 Data A4 CLK Event 4 CLK Event 5
Data C3
Data C4 TDSPCKO_CC
Figure 2-5:
The following events occur in Figure 2-5: 1. At time TDSPCCK_CE before CLK event 1, CE becomes valid High to allow all DSP registers to sample incoming data. 2. At time TDSPDCK_{AA,BB,CC} before CLK event 1, data inputs A, B, C have remained stable for sampling into the DSP slice. 3. At time TDSPCKO_PP after CLK event 4, the P output switches into the results of the data captured at CLK event 1. This occurs three clock cycles after CLK event 1. Xilinx 17
DSP: DESIGNING FOR OPTIMAL RESULTS 4. 5. At time TDSPCCK_RST before CLK event 5, the RST signal becomes valid High to allow a synchronous reset at CLK event 5. At time TDSPCKO_PP after CLK event 5, the output P becomes a logic 0.
18 Xilinx
XTREMEDSP DESIGN CONSIDERATIONS The A, B, C, and P port logics are shown in Figure 2-6, Figure 2-7, Figure 2-8, and Figure 2-9, respectively.
A
18 18
18 D EN RST Q D EN RST Q 18
18
A input to Multiplier
CEA RSTA
UG073_c1_05_061304
Figure 2-6:
A Input Logic
B
18 18
BCIN
18 18 D EN RST Q D EN RST Q 18 18
B input to Multiplier
CEB RSTB
UG073_c1_06_061304
Figure 2-7:
C
B Input Logic
48 48 D Q 48
CLK_0
CEC
EN RST
CLK_1 RSTC
UG073_c1_07_061304
Figure 2-8:
C Input Logic
Xilinx 19
P
48 48 D Q 48
CEP
EN RST
RSTP
UG073_c1_08_061304
Figure 2-9:
P Output Logic
CARRYINSEL
2 D EN RST Q 2
ug073_c1_09_070904
36 72
Optional MREG
36
Figure 2-11:
X, Y, and Z Multiplexer
The Operating Mode (OPMODE) inputs provide a way for the design to change its functionality from clock cycle to clock cycle (e.g., when altering the initial or final state of the DSP48 relative to the middle part of a given calculation). The OPMODE bits can be optionally registered under the control of the configuration memory cells (as denoted by the grey colored MUX symbol in Figure 2-10). Table 2-4, Table 2-5, and Table 2-6 list the possible values of OPMODE and the resulting function at the outputs of the three multiplexers (X, Y, and Z multiplexers). The multiplexer outputs supply three operands to the following adder/subtracter. Not all possible combinations for the multiplexer select bits are allowed. Some are marked in the tables as illegal selection and give undefined results. If the multiplier output is selected, then both the X and Y multiplexers are consumed supplying the multiplier output to the adder/subtracter. Table 2-4: Z XXX XXX XXX XXX OPMODE Control Bits Select X, Y, and Z Multiplexer Outputs OPMODE Binary Y XX 01 XX XX X 00 01 10 11 X Multiplexer Output Fed to Add/Subtract ZERO (Default) Multiplier Output (Partial Product 1) P A concatenate B
Xilinx 21
DSP: DESIGNING FOR OPTIMAL RESULTS Table 2-5: Z XXX XXX XXX XXX OPMODE Control Bits Select X, Y, and Z Multiplexer Outputs OPMODE Binary Y 00 01 10 11 X XX 01 XX XX Y Multiplexer Output Fed to Add/Subtract ZERO (Default) Multiplier Output (Partial Product 2) Illegal selection C
Table 2-6: Z
000 001 010 011 100 101 110 111
X
XX XX XX XX XX XX XX XX
Z Multiplexer Output Fed to Add/Subtract ZERO (Default) PCIN P C Illegal selection Shift (PCIN) Shift (P) Illegal selection
There are seven possible non-zero operands for the three-input adder as selected by the three multiplexers, and the 36-bit operands are sign extended to 48 bits at the multiplexer outputs: 1. Multiplier output, supplied as two 36-bit partial products 2. Multiplier bypass bus consisting of A concatenated with B 3. C bus, 48 bits, shared by two slices 4. Cascaded P bus, 48 bits, from a neighbor DSP48 slice 5. Registered P bus output, 48 bits, for accumulator functions 6. Cascaded P bus, 48 bits, right shifted by 17 bits from a neighbor DSP48 slice 7. Registered P bus output, 48 bits, right shifted by 17 bits, for accumulator functions
Three-Input Adder/Subtracter
The adder/subtracter output is a function of control and data inputs. OPMODE, as shown in the previous section, selects the inputs to the X, Y, Z multiplexer directed to the associated three adder/subtracter inputs. It also describes how selecting the multiplier output consumes both X and Y multiplexers. As with the input multiplexers, the OPMODE bits specify a portion of this function. Table 2-7 shows OPMODE combinations and the resulting functions. The symbol in the table means either add or subtract and is specified by the state of the SUBTRACT control signal (SUBTRACT = 1 is defined as subtraction). The symbol : in the table means concatenation. The outputs of the X and Y
22 Xilinx
XTREMEDSP DESIGN CONSIDERATIONS multiplexer and CIN are always added together. This result is then added to or subtracted from the output of the Z multiplexer. Table 2-7: Hex OPMODE [6:0] 0x00 0x02 0x03 0x05 0x0c 0x0e 0x0f 0x10 0x12 0x13 0x15 0x1c 0x1e 0x1f 0x20 0x22 0x23 0x25 0x2c 0x2e 0x2f 0x30 0x32 0x33 0x35 0x3c 0x3e 0x3f 0x50 0x52 OPMODE Control Bits Adder/Subtracter Function Binary OPMODE Z Y X 000 00 00 000 00 10 000 00 11 000 01 01 000 11 00 000 11 10 000 11 11 001 00 00 001 00 10 001 00 11 001 01 01 001 11 00 001 11 10 001 11 11 010 00 00 010 00 10 010 00 11 010 01 01 010 11 00 010 11 10 010 11 11 011 00 00 011 00 10 011 00 11 011 01 01 011 11 00 011 11 10 011 11 11 101 00 00 101 00 10 Z 0 0 0 0 0 0 0 PCIN PCIN PCIN PCIN PCIN PCIN PCIN P P P P P P P C C C C C C C Shift (PCIN) Shift (PCIN) XYZ Multiplexer Outputs and Adder/Subtracter Output Y
0 0 0
X
0
C C C
0 0 0
P
0
C C C
0 0 0
PCIN (C + CIN)
P CIN
C C C
0 0 0
P (C + CIN)
C CIN C (P + CIN) C (A B + CIN) C (C + CIN) C (C + P + CIN) Shift(PCIN) CIN Shift(PCIN) (P + CIN) Xilinx 23
Note 1 C C C
0 0
P
0
DSP: DESIGNING FOR OPTIMAL RESULTS Table 2-7: Hex OPMODE [6:0] 0x53 0x55 0x5c 0x5e 0x5f 0x60 0x62 0x63 0x65 0x6c 0x6e 0x6f OPMODE Control Bits Adder/Subtracter Function (Continued) Binary OPMODE Z Y X 101 00 11 101 01 01 101 11 00 101 11 10 101 11 11 110 00 00 110 00 10 110 00 11 110 01 01 110 11 00 110 11 10 110 11 11 XYZ Multiplexer Outputs and Adder/Subtracter Output Z Shift (PCIN) Shift (PCIN) Shift (PCIN) Shift (PCIN) Shift (PCIN) Shift (P) Shift (P) Shift (P) Shift (P) Shift (P) Shift (P) Shift (P) Y
0
C C C
0 0 0
P
0
C C C
Shift(P) (C + CIN)
Notes: 1. When the multiplier output is selected, both X and Y multiplexers are used to feed the multiplier partial products to the adder input.
24 Xilinx
11
Figure 2-12: Carry Input Logic Feeding the Adder/Subtracter Figure 2-12 shows four inputs, selected by the 2-bit CARRYINSEL control with the OPMODE bits providing additional control. The first input CARRYIN (CARRYINSEL is equal to binary 00) is driven from general logic. This option allows implementation of a carry function based on user logic. It can be optionally registered to match the pipeline delay of the MREG when used. This register delay is controlled by configuration. The next input (CARRYINSEL is equal to binary 01) is the inverted MSB of either the output P or the cascaded output, PCIN (from an adjacent DSP48 slice). The final selection between P or PCIN is dictated by OPMODE[4] and OPMODE[6]. The third input (CARRYINSEL is equal to binary 10) is the inverted MSB of A, for rounding A concatenated with B values, or A[17] XNOR B[17] for rounding multiplier outputs. Again, the state of OPMODE determines the final selection. The fourth and final input is merely a registered version of the third input to adjust the carry input delay when using the multiplier output register or MREG. Table 2-8 lists the possible values of the two carry input select bits (CARRYINSEL), the operation mode bus (OPMODE), and the resulting carry inputs or sources.
Xilinx 25
DSP: DESIGNING FOR OPTIMAL RESULTS Table 2-8: OPMODE and CARRYINSEL Control Carry Source OPMODE XXX XX XX Z MUX output = P or Shift(P) Z MUX output = PCIN or Shift(PCIN) X and Y MUX output = multiplier partial products X and Y MUX output = multiplier partial products X MUX output = A:B X MUX output = A:B Carry Source CARRYIN ~P[47] ~PCIN[47] Comments General fabric carry source (registered or not) Rounding P or Shift(P) Rounding the cascaded PCIN or Shift(PCIN) from adjacent slice
CARRYINSEL[1:0] 00 01 01
10
11
10 11
A[17] xnor B[17] Rounding multiplier (MREG pipeline register disabled) A[17] xnor B[17] Rounding multiplier (MREG pipeline register enabled) ~A[17] Rounding A:B (not pipelined) ~A[17] Rounding A:B (pipelined)
XTREMEDSP DESIGN CONSIDERATIONS number of ones present in the C register. Table 2-9 has examples for rounding off the fractional bits from a value (binary point placement and rounded bits placement coincide). Table 2-9: Symmetric Rounding Examples C Value 0000.0111 0000.0111 0000.0111 0000.0111 0000.0111 0000.0111 Internally Generated CIN 1 1 1 0 0 0 Multiplier Plus C Plus CIN 0010.1111 0011.0000 0011.0001 1110.0000 1101.1111 1101.1110 After After Truncation Truncation (Decimal) (Binary) 0010 2 0011 3 0011 3 1110 -2 1101 -3 1101 -3 Multiplier Multiplier Output Output (Binary) (Decimal) 2.4375 0010.0111 2.5 0010.1000 2.5625 0010.1001 2.4375 1101.1001 2.5 1101.1000 2.5625 1101.0111
[35:18] P[69:52]
34-bit Offset
P[16:0]
ug073_c1_12_070904
Figure 2-13:
When separating two's complement numbers into two parts, only the most-significant part carries the original sign. The least-significant part must have a forced zero in the sign position meaning they are positive operands. While it seems logical to separate a positive number into the sum of two positive numbers, it can be counter intuitive to separate a negative number into a negative most-significant part and a positive least-significant part. However, after separation, the mostsignificant part becomes more negative by the amount the least-significant part becomes more
Xilinx 27
DSP: DESIGNING FOR OPTIMAL RESULTS positive. The forced zero sign bit in the least-significant part is why only 35-bit operands are found instead of 36-bit operands. The DSP48 slices, with 18 x 18 multipliers and post adder, can now be used to implement the sum of the four partial products shown in Figure 2-13. The lessor significant partial products must be right-shifted by 17 bit positions before being summed with the next most-significant partial products. This is accomplished with a built in wire shift applied to PCIN supplied as one selectable Z multiplexer input. The entire process of multiplication, shifting, and addition using adder cascade to form the 70-bit result can remain in the dedicated silicon of the DSP48 slice, resulting in maximum performance with minimal power consumption. Figure 2-21, page 41 illustrates the implementation of a 35 x 35 multiplier using the DSP48 slices.
FIR Filters
Basic FIR Filters
FIR filters are used extensively in video broadcasting and wireless communications. DSP filter applications include, but are not limited to the following: Wireless Communications Image Processing Video Filtering Multimedia Applications Portable Electrocardiogram (ECG) Displays Global Positioning Systems (GPS) Equation 2-3 shows the basic equation for a single-channel FIR filter. k = N1 y(n) =
k=0
h ( k )x ( n k ) Equation 2-3
The terms in the equation can be described as input samples, output samples, and coefficients. Imagine x as a continuous stream of input samples and y as a resulting stream (i.e., a filtered stream) of output samples. The n and k in the equation correspond to a particular instant in time, so to compute the output sample y(n) at time n, a group of input samples at N different points in time, or x(n), x(n-1), x(n-2), x(n-N+1) is required. The group of N input samples are multiplied by N coefficients and summed together to form the final result y. The main components used to implement a digital filter algorithm include adders, multipliers, storage, and delay elements. The DSP48 slice includes all of the above elements, which makes it ideal to implement digital filter functions. All of the input samples from the set of n samples are present at the input of each DSP48 slice. Each slice multiplies the samples with the corresponding coefficients within the DSP48 slice. The outputs of the multipliers are combined in the cascaded adders.
28 Xilinx
x(n)
Z-1
Z-1
Z-1
Z-1
Z-1
h(0)
h(1)
h(2)
h(3)
h(4)
h(N-1)
y(n)
+
UG073_c6_01_070904
Figure 2-14:
In Figure 2-14, the sample delay logic is denoted by Z-1, whereas the 1 represents a single clock delay. The delayed input samples are supplied to one input of the multiplier. The coefficients (denoted by h0 to h(N-1)) are supplied to the other input of the multiplier through individual ROMs, RAMs, registers, or constants. Y(n) is merely the summation of a set of input samples, and in time, multiplied by their respective coefficients.
Xilinx 29
x ( n ) = x ( n ) + jx Q ( n ) I
v(n)
xl(n)
M(z)
xQ(n)
M(z)
UG073_c6_02_070904
Figure 2-15: Software-Defined Radio Digital Down Converter Some video applications use multi-channel implementations for multiple components of a video stream. Typical video components are red, green, and blue (RGB) or luma, chroma red, and chroma blue (YCrCb). The different video components can have the same coefficient sets or different coefficient sets for each channel by simply changing the coefficient ROM structure.
30 Xilinx
48
h5(n) 18 18
48
+
48
+
48 y(n-6)
h3(n) 18 18
+
48
The final stages of the post addition in logic are the performance bottleneck that consume more power.
h1(n) 18 18
48
+
48
h0(n) X(n) 18 18
ug073_c1_13_070904
Figure 2-16: FIR Filter Adder Tree Using DSP48 Slices One difficulty of the adder tree concept is defining the size. Filters come in various lengths and consume a variable number of adders forming an adder tree. Placing a fixed number of adder tree components in silicon displaces other elements or requires a larger FPGA, thereby increasing the cost of the design. In addition, the adder tree structure with a fixed number of additions forces the designer to use logic resources when the fixed number of additions is exceeded. Using logic resources dramatically reduces performance and increases power consumption. The key to maximizing
Xilinx 31
DSP: DESIGNING FOR OPTIMAL RESULTS performance and lowering power for DSP math is to remain inside the DSP48 column consisting entirely of dedicated silicon. The Virtex-4 solution accomplishes the post-addition process while guaranteeing no wasted silicon resources. It involves computing the additive result incrementally utilizing a cascaded approach as illustrated in Figure 2-17. Figure 2-17 is a systolic version of a direct form FIR with a latency of 10 clocks versus an adder tree latency of six clocks.
32 Xilinx
48
+
No wire shift 48
48
Y(n10)
Slice 7
h6(n-6) 18 18
48
+
No wire shift 48
Slice 6
h5(n-5) 18 18
48
+
No wire shift 48
Slice 5
h4(n-4) 18 18
48
+
No wire shift 48
Slice 4
h3(n-3) 18 18
48
+
No wire shift 48
The post adders are contained wholly in dedicated silicon for highest performance and lowest power
Slice 3
h2(n-2) 18 18
48
+
No wire shift 48
Slice 2
h1(n-1) 18 18
48
+
No wire shift 48
Slice 1
h0(n) X(n) 18 18 Zero Sign extended from 36 bits to 48 bits
48
+
ug073_c1_14_070904
Figure 2-17: Systolic FIR with Adder Cascade Care should be taken to balance the input sample delay and the coefficients with the cascaded adder. The adaptive coefficients are staggered in time (wave coefficients).
Xilinx 33
Single Slice Slice Cycle Mode Number A 35 x 18 1 1 0,A[16:0 Multiply ] 2 A[34:17 ] 35 x 35 Multiply 1 1 2 0,A[16:0 ] A[34:17 ] 0,A[16:0 ] A[34:17 ]
0,B[16:0] 0,B[16:0]
X X
0x05 0x65
P[16:0]
3 4
B[34:17] B[34:17]
X X
0x25 0x65
P[33:17] P[69:34]
34 Xilinx
XTREMEDSP DESIGN CONSIDERATIONS Table 2-10: Single Slice DSP48 Implementation (Continued) Inputs B BRe[17:0] BIm[17:0] BIm[17:0] BRe[17:0] C X X X X Function and OPMODE[6:0] Multiply MultiplyAccumulate Multiply MultiplyAccumulate 0x05 0x25 0x05 0x25 P (Imaginary) P (Real) Output
Single Slice Slice Cycle Mode Number A Complex 1 1 ARe[17: Multiply 0] 2 AIm[17: 0] 3 ARe[17: 0] 4 AIm[17: 0]
Xilinx 35
48
P[52:17] 48 48
Clock Cycle 1
0,A[16:0] 18 B[17:0] 18
0,A[16:0] B[17:0]
+
48 Zero 48 48
P[16:0]
ug073_c1_15_071004
Figure 2-18:
XTREMEDSP DESIGN CONSIDERATIONS product. The upper partial product is formed by multiplying the signed upper 18 bits of B with the signed upper 18 bits of A. A[34:17] x B[34:17] The 70-bit result is output sequentially in 17-bit, 17-bit, and 36-bit segments as shown in Figure 2-19. Figure 2-19 shows the function during all four clock cycles for a single DSP48 slice used as a 35bit x 35-bit, signed, two's complement multiplier. Increased performance can be obtained by using the pipeline registers before and after the multiplier, however, the clock latency is increased.
Clock Cycle 4
A[34:17] 18 B[34:17] 18 A[34:17] x B[34:17] P = right shifted PREG + (A[34:0] x B[34:17])
48
+
48
P[69:34]
Clock Cycle 3
0,A[16:0] 18 B[34:17] 18
0,A[16:0] x B[34:17]
48
+
48 48
P[33:17]
Clock Cycle 2
A[34:17] 18 0,B[16:0] 18
A[34:17] x 0,B[16:0]
48
+
48 right wire shift by 17 bits
Clock Cycle 1
0,A[16:0] 18 0,B[16:0] 18
0,A[16:0] x 0,B[16:0]
48 Zero
+
48
48
P[16:0]
ug073_c1_16_071004
Figure 2-19:
Xilinx 37
1 2
0,A[16:0] A[34:17]
0,B[16:0] 0,B[16:0]
X X
0x05 0x65
P[16:0]
3 4
0,A[16:0] A[34:17]
B[34:17] B[34:17]
X X
0x25 0x65
P[33:17] P[69:34]
18 x 18 Complex Multiply Figure 2-22 18 x 18 Complex MAC Figure 2-23 35 x 18 Complex Multiply Real Part Figure 2-26 35 x 18 Complex Multiply Imaginary Part Figure 2-27
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
ARe[17:0] AIm[17:0] ARe[17:0] AIm[17:0] ARe[17:0] AIm[17:0] ARe[17:0] AIm[17:0] ARe[17:0] AIm[17:0] ARe[17:0] AIm[17:0] ARe[17:0] AIm[17:0] ARe[17:0] AIm[17:0]
BRe[17:0] BIm[17:0] BIm[17:0] BRe[17:0] BRe[17:0] BIm[17:0] BIm[17:0] BRe[17:0] BRe[17:0] BIm[17:0] BIm[17:0] BRe[17:0] BRe[17:0] BIm[17:0] BIm[17:0] BRe[17:0]
X X X X
0x05 0x25 0x05 0x25 0x05 0x25 0x05 0x25 0x05 0x25 0x05 0x25 0x05 0x25 0x05 0x25
X Multiply X Multiply Accumulate X Multiply X Multiply Accumulate X Multiply X Multiply Accumulate X Multiply X Multiply Accumulate X Multiply X Multiply Accumulate X Multiply X Multiply Accumulate
38 Xilinx
XTREMEDSP DESIGN CONSIDERATIONS Table 2-12 summarizes utilization of more complex digital filters possible using the DSP48. The small n in the Silicon Utilization column indicates the number of DSP48 filter taps. The construction and operation of complex filters is discussed in Chapter 4, MAC FIR Filters, Chapter 5, Parallel FIR Filters, and Chapter 6, Semi-Parallel FIR Filters. Table 2-12: Composite Digital Filters Silicon Utilization n DSP slices, n RAM n DSP slices n DSP slices n DSP slices n DSP slices, n RAM n DSP slices, n RAM 1 DSP slice per stage OPMODE Static Static Static Static Static Dynamic Static
Digital Filter Multi-Channel FIR Direct Form FIR Transposed Form FIR Systolic Form FIR Polyphase Interpolator Polyphase Decimator CIC Decimation/Interpolation Filters
Xilinx 39
+
48 48
P[52:17]
PREG2 = right shifted PREG1+ (A[34:17] B[17:0]) right wire shift by 17 bits 48 0,A[16:0] B[17:0] 48 P[16:0]
Slice 1
0,A[16:0] 18 B[17:0] 18
+
48 Zero
UG073_c1_17_071004
Figure 2-20:
40 Xilinx
+
48
Slice 3
0,A[16:0] B[34:17] 0,A[16:0] 18 B[34:17] 18
48 P[33:17] 48
48
Slice 2
A[34:17] 18 18
48 A[34:17] 0,B[16:0] 48
+
48 PREG2 = right shifted PREG1 + (A[34:17] 0,B[16:0]) right wire shift by 17 bits
Slice 1
0,A[16:0] 18 0,B[16:0] 18
48 0,A[16:0] 0,B[16:0] 48
Z-3 P[16:0]
+
48 PREG1 = 0,A[16:0] 0,B[16:0]
ug073_c1_18_071004
Figure 2-21:
Xilinx 41
Slice 4
A_imag 18 B_imag 18
_
48 48
P_real
Slice 3
A_real 18 B_real 18
48
+
48
Slice 2
A_imag 18 B_real 18
+
48
48
P_imag
Slice 1
A_real 18 B_imag 18 Zero Sign extended from 36 bits to 48 bits
48
48
UG073_c1_19_071004
Figure 2-22: Pipelined, Complex, 18 x 18 Multiply Note: The real and the imaginary computations are functionally similar using different input data. The real output subtracts the multiplied terms, and the imaginary output adds the multiplied terms.
42 Xilinx
XTREMEDSP DESIGN CONSIDERATIONS Last Cycle: Slice 1 + Slice 2 = P_imaginary Slice 3 Slice 4 = P_real During the last cycle, the input data must stall while the final terms are added. To avoid having to stall the data, instead of using the complex multiply implementation shown in Figure 2-23 and Figure 2-24, use the complex multiply implementation shown in Figure 2-25.
Slice 4
A_imag 18 B_imag 18 48
48
P_real
Slice 3
A_real 18 B_real 18
48
+
48
Slice 2
A_imag 18 B_real 18
48
+
48
P_imag
Slice 1
A_real 18 B_imag 18 Sign extended from 36 bits to 48 bits 48
48
+
ug073_c1_20_071004
Figure 2-23:
Xilinx 43
DSP: DESIGNING FOR OPTIMAL RESULTS In Figure 2-24, the N+1 cycle adds the accumulated products, and the input data stalls one cycle.
Slice 4
48
48
P_real
Slice 3
A_real 18 B_real 18 Zero
48
48
48
Slice 2
48
P_imag
Slice 1
A_real 18 B_imag 18 Zero Sign extended from 36 bits to 48 bits
48
+
48
ug073_c1_21_070904
Figure 2-24:
44 Xilinx
XTREMEDSP DESIGN CONSIDERATIONS An additional slice used for the accumulation is shown in Figure 2-25. The extra slice prevents the input data from stalling on the last cycle. The capability of accumulating the P cascade through the X mux feedback eliminates the pipeline stall.
Slice 6
+
48
P_real
Slice 5
A_imag 18 B_imag 18
48
_
48
Slice 4
A_real 18 B_real 18 Zero Sign extended from 36 bits to 48 bits
48
48
Slice 3
+
48
P_imag
Slice 2
A_imag 18 B_real 18
48
+
48
Slice 1
A_real 18 B_imag 18 Zero Sign extended from 36 bits to 48 bits
48
+
48
ug073_c1_22_070904
Figure 2-25:
A_imag[34:17]
_
48 48
Slice 3
A_real[34:17] 18 18
A_real[34:17] B_real[17:0]
+
48 PREG3 = right shifted PREG2 + (A_real[34:17] B_real[17:0]) Right wire shift by 17 bits 48 Z 2 48 P[16:0]
Slice 2
0,A_real[16:0] 18 B_real[17:0] 18
0,A_real[16:0] B_real[17:0]
+
48 PREG2 = PREG1 + (0,A_real[16:0] B_real[17:0])
48
Slice 1
0,A_imag[16:0] 18 B_imag[17:0] 18
0,A_imag[16:0] B_imag[17:0]
_
48 Zero PREG1 = (0,A_imag[16:0] B_imag[17:0])
ug073_c1_23_071004
Figure 2-26:
46 Xilinx
XTREMEDSP DESIGN CONSIDERATIONS Figure 2-27 shows the imaginary part of a fully pipelined, complex, 35-bit x 18-bit multiplier.
Slice 4
A_imag[34:17] 18 B_real[17:0] Z 3 18 Z 3 A_imag[34:17] B_real[17:0]
48
48
P[52:17]
48
Slice 3
A_real[34:17] 18
A_real[34:17] B_imag[17:0]
+
48 PREG3 = right shifted PREG2 + (A_real[34:17] B_imag[17:0]) Right wire shift by 17 bits 48 Z 2 48 P[16:0]
18
Slice 2
0,A_real[16:0] 18 B_imag[17:0] 18
0,A_real[16:0] B_imag[17:0]
+
48 PREG2 = PREG1+ 0,(A_real[16:0] B_imag[17:0])
48
Slice 1
0,A_imag[16:0] 18 B_real[17:0] 18
0,A_imag[16:0] B_real[17:0]
+
48 Zero PREG1 = (0,A_imag[16:0] B_real[17:0])
ug073_c1_24_071004
Figure 2-27:
Miscellaneous 18-bit Barrel Shifter 48-bit Add Subtract 36-bit Add Subtract Cascade n word MUX, 48 bit words n word MUX, 36 bit words 48-bit Counter Magnitude Compare Equal to Zero Compare
DSP: DESIGNING FOR OPTIMAL RESULTS Table 2-13: Miscellaneous Functional Use Models (Continued) Silicon Utilization 1 DSP slice 1 DSP slice 1 DSP slice OPMODE Static Static Static
Slice 2, P Result
Figure 2-28:
Figure 2-29 shows the DSP48 used an 18-bit circular barrel shifter. The P register for slice 1 contains leading zeros in the MSBs, followed by the most-significant 17 bits of A, followed by n trailing zeros. If n equals zero, then are no trailing zeros and the P register contains leading zeros followed by 17 bits of A.
48 Xilinx
+
48 48 Right wire shift by 17 bits 48
Slice 1
A[0,17:1] 18 2n 18 Zero
+
48 PREG1 = [000..., 000..., 12 36 18 n + 1 zeros zeros A17:1, ...000] 17-bit n A zeros
ug073_c1_26_071004
Figure 2-29: Dynamic 18-Bit Barrel Shifter In the case of n equal to zero (i.e., no shift), the P register of slice 1 is passed to slice 2 with 17 bits of right shift. This leaves all 48 bits of the P carry input effectively equal to zero since A[17:1] were shifted toward the least-significant direction. If there is a positive shift amount, then P carry out of slice 1 contains A[17:1] padded in front by 48 17 n zeros and in back by n zeros. After the right shift by 17, only the n most-significant bits of A remain in the lower 48 bits of the P carry input. This n-bit guaranteed positive number is added to the A[17:0], left shifted by n bits. In the n least-significant bits there are zeros. The end result contained in A[17:0] of the second slice P register is A[17 n:n, 17:17 n + 1] or a barrel shifted A[17:0]. The design is fully pipelined and can generate a new result every clock cycle at the maximum DSP48 clock rate. A single slice version of the dynamic 18-bit barrel shifter can be implemented. For this implementation, Table 2-14 describes the DSP48 slice function and OPMODE settings for each clock cycle. Table 2-14: Miscellaneous DSP48 Implementations A A[17:0] A[17:0] A[17:0] A[17:0] Inputs B B[17:0] B[17:0] B[17:0] B[17:0] C X X X X Function and OPMODE[6:0] Multiply Multiply Accumulate Multiply Multiply Accumulate 0x05 0x25 0x05 0x25 Output P P
Xilinx 49
-- Copy the following two statements and paste them before the -- Entity declaration, unless they already exists. Library UNISIM; use UNISIM.vcomponents.all; -- <-----Cut code below this line and paste into the architecture body----> -- DSP48: DSP Function Block -Virtex-4 -- Xilinx HDL Language Template version 6.1i DSP48_inst: DSP48 generic map ( AREG => 1, -- Number of pipeline registers on the A input, 0, 1 or 2 BREG => 1, -- Number of pipeline registers on the B input, 0, 1 or 2 B_INPUT => DIRECT, -- B input DIRECT from fabric or CASCADE from another DSP48 CARRYINREG => 1, -- Number of pipeline registers for the CARRYIN input, 0 or 1 CARRYINSELREG => 1, -- Number of pipeline registers for the CARRYINSEL, 0 or 1 CREG => 1, -- Number of pipeline registers on the C input, 0 or 1 LEGACY_MODE => MULT18X18S, -- Backward compatibility, NONE, -- MULT18X18 or MULT18X18S MREG => 1, -- Number of multiplier pipeline registers, 0 or 1 OPMODEREG => 1, -- Number of pipeline registers on OPMODE input, 0 or 1 PREG => 1, -- Number of pipeline registers on the P output, 0 or 1 SIM_X_INPUT => GENERATE_X_ONLY, -- Simulation parameter for behavior for X on input. -- Possible values: GENERATE_X, NONE or WARNING SUBTRACTREG => 1) -- Number of pipeline registers on the SUBTRACT input, 0 or 1 port map ( BCOUT => BCOUT, -- 18-bit B cascade output
50 Xilinx
// <-----Cut code below this line----> // DSP48: DSP Function Block // Virtex-4 // Xilinx HDL Language Template version 7.1i DSP48 DSP48_inst ( .BCOUT(BCOUT), // 18-bit B cascade output .P(P), // 48-bit product output .PCOUT(PCOUT), // 38-bit cascade output .A(A), // 18-bit A data input .B(B), // 18-bit B data input .BCIN(BCIN), // 18-bit B cascade input .C(C), // 48-bit cascade input .CARRYIN(CARRYIN), // Carry input signal .CARRYINSEL(CARRYINSEL), // 2-bit carry input select .CEA(CEA), // A data clock enable input
Xilinx 51
// The following defparams specify the behavior of the DSP48 slice. // If the instance name to the DSP48 is changed, that change needs to // be reflected in the defparam statements. defparam input, 0, 1 defparam input, 0, 1 defparam DSP48_inst.AREG = 1; // Number of pipeline registers on the A or 2 DSP48_inst.BREG = 1; // Number of pipeline registers on the B or 2 DSP48_inst.B_INPUT = DIRECT; // B input DIRECT from fabric // or CASCADE from another // Number of pipeline // for the CARRYIN input, 0 or 1 defparam DSP48_inst.CARRYINSELREG = 1; registers for the // Number of pipeline
// CARRYINSEL, 0 or 1 defparam DSP48_inst.CREG = 1; // Number of pipeline registers on the C input, 0 or 1 defparam DSP48_inst.LEGACY_MODE = MULT18X18S; // Backward compatibility, // NONE, MULT18X18 or MULT18X18S defparam DSP48_inst.MREG = 1; // Number of multiplier pipeline registers, 0 or 1 defparam DSP48_inst.OPMODEREG = 1; // Number of pipeline registers on // OPMODE input, 0 or 1 defparam DSP48_inst.PREG = 1; // Number of pipeline registers on the P output, 0 or 1 defparam DSP48_inst.SIM_X_INPUT = GENERATE_X_ONLY; // Simulation parameter for behavior // for X on input. Possible values: // GENERATE_X, NONE or WARNING defparam DSP48_inst.SUBTRACTREG = 1; // Number of pipeline registers // on the SUBTRACT input, 0 or 1 // End of DSP48_inst instantiation
52 Xilinx
Chapter 3
The DSP48 slice efficiently performs a wide range of basic math functions, including adders, subtracters, accumulators, MACs, multiply multiplexers, counters, dividers, square-root functions, and shifters. The optional pipeline stage within the DSP48 tile ensures high performance arithmetic functions. The DSP48 column structure and associated routing provides fast routing between DSP48 tiles with less routing congestion to the FPGA fabric. This chapter describes how to use the DSP48 slice to perform some basic arithmetic functions.
Overview
The DSP48 slice is shown in Figure 3-1. Refer to Figure 2-3, page 13 for a diagram showing two slices cascaded together.
Cascade Out to Next Slice BCOUT 18 18 A 18 B 18 18 36 X 36 48 CIN 48 48 PCOUT
72 36 48 Y 48
SUBTRACT
ZERO 48
48 Z 48
48 48 PCIN
ug073_c2_01_061304
BCIN
Figure 3-1:
Xilinx 53
Notes: 1. If one of X or Y is set to 01, the other one must also be set to 01.
2. For Carryin Select (CIN) see Carry Input Logic in Chapter 2.
The Verilog code for this 48-bit adder is in the reference design file: ADDSUB48.v, and the VHDL code is in the reference design file: ADDSUB48.vdh. This code can be used to implement any data combination for this equation by using the different OPMODEs found in Table 3-1.
Accumulate
A DSP48 slice can implement add and accumulate functions with up to 36-bit inputs. The output equation of the accumulator is: Output = Output + A:B + C Concatenate (:) the A and B inputs to provide a 36-bit input from Multiplexer X using the setting OPMODE[1:0] = 0b11. Select the C input to Multiplexer Y using the setting OPMODE[3:2] = 0b11. To add (accumulate) the output of the slice, select the feedback path (P) through the Z multiplexer using the setting OPMODE[6:4] = 0b010. Other accumulate functions can be implemented by changing the OPMODE selection for the Z input multiplexer. To get an output of: Output = Shift(P) (A:B + C) 54 Xilinx
DSP48 SLICE MATH FUNCTIONS use the setting OPMODE[6:4] = 0b110 to select the Shift(P) input to the Z multiplexer. To get an output of: Output = 0 (A:B +C) (no accumulation) use the setting OPMODE [6:4] = 0b0000 to select the ZERO input to the Z multiplexer. The Verilog code for the accumulator is in the reference design file ACCUM48.v, and the VHDL code is in the reference design file ACCUM48.vhd.
Multiplexer
There are three multiplexers in a DSP48 slice: the 3:1 Y multiplexer, the 4:1 X multiplexer, and the 6:1 Z multiplexer. Only one multiplexer is active to use the slice as a pure multiplexer. Make the other two multiplexers inactive by choosing the OPMODE selecting the ZERO inputs. The two DSP48 tiles in a slice can be combined to make wider input multiplexers.
Barrel Shifter
An 18-bit barrel shifter can be implemented using the two DSP48 tiles in the DSP slice. To barrel shift the 18-bit number A[17:0] two positions to the left, the output from the barrel shifter is A[15:0], A[17], and A[16]. This operation is implemented as follows. The first DSP48 is used to multiply {0,A[17:1]} by 22. The output of this DSP48 tile is now {0,A[17:1],0,0}. The output from the first tile is fed into the second DSP48 tile over the PCIN/PCOUT signals, and is passed through the 17-bit right-shifted input. The input to the Z multiplexer becomes {0,A[17],A[16]}, or {0,A[17:0],0,0} shifted right by 17 bits. The multiplier inputs to the second DSP48 tile are A = A[17:0] and B = 22. The output of this multiplier is {A[17:0], 0,0}. This output is added to the 17-bit right-shifted value of {0,A[17],A[16]} coming from the previous slice. The 18-bit output of the adder is {A[15:0],A[17],A[16]}. This is the initial A input shifted by two to the left. The Verilog code is in the reference design file barrelshifter_18bit.v, and the VHDL code is in the reference design file barrelshifter_18bit.vhd).
Counter
The DSP48 slice can be used as a counter to count up by one on each clock cycle. Setting the SUBTRACT input to 0, the carry-in input (CIN) to 1, and OPMODE [6:0] = 0b0100000 gives an output of P + CIN. After the first clock, the output P is 0 + 1 = 1. Subsequent outputs are P + 1. Xilinx 55
DSP: DESIGNING FOR OPTIMAL RESULTS This method is equivalent to counting up by one. The counter can be used as a down counter by setting the SUBTRACT input to a 1 at the start. The counter can also be preloaded using the C input to provide the preload value. Setting the Carry In input (CIN) to 1 and OPMODE [6:4] = 0b0110000 gives an output of P = C+1 in the first cycle. For subsequent clocks, set the OPMODE to select P = P+1 by changing OPMODE [6:4] from 0b0110000 to 0b0100000. The Verilog code for a loadable counter is in the reference design file CNTR_LOAD.v, and the VHDL code for a loadable counter is in the reference design file CNTR_LOAD.vhd.
Multiply
A single DSP48 slice can implement an 18x18 signed or unsigned multiplier. Larger multipliers can be implemented in a single DSP48 slice by sequentially shifting the appropriate number of bits in each clock cycle. The Verilog implementation of an 18x18 multiplier is in the reference design file MULT18X18_PARALLEL.v, and the VHDL implementation is in the reference design file MULT18X18_PARALLEL.vhd. The Verilog implementation of a 35x35 multiplier and a sequential 35x35 multiplier are in the reference design files MULT35X35_PIPE.v and MULT35X35_SEQUENTIAL_PIPE.v respectively. The VHDL implementation of a 35x35 multiplier and a sequential 35x35 multiplier are in the reference design files MULT35X35_PIPE.vhd and MULT35x35_SEQUENTIAL_PIPE.vhd, respectively.
Divide
Binary division can be implemented in the DSP48 slice by performing a shift and subtract or a multiply and subtract. The DSP48 slice includes a shifter, a multiplier, and adder/subtracter unit to implement binary division. The division by subtraction and division by multiplication algorithms are shown below. These algorithms assume: 1. N > D 2. N and D are both positive If either N or D is negative, use the same algorithms by taking the absolute positive values for N and D and making the appropriate sign change in the result. The terms N and D in the algorithms refer to the number to be divided (N) and the divisor (D). The terms Q and R in the algorithms refer to the quotient and remainder, respectively.
56 Xilinx
DSP48 SLICE MATH FUNCTIONS After the eighth iteration, Q[7:0] contains the quotient, and R[7:0] contains the remainder. For example: N 8 0000, 1000 --- = --- = ------------------------ = Q(10) + R(10) D 3 011 Step 1 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 4 2 3 Iteration (n) 1 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 R = 0000,0000 R <-- N[7] = 0000,0000 R-D = Negative Q[7] = 0 R <-- N[6] = 0000,0000 R-D = Negative Q[6] = 0 R <-- N[5] = 0000,0000 R-D = Negative Q[5] = 0 R <-- N[4] = 0000,0000 R-D = Negative Q[4] = 0 R <-- N[3] = 0000,0001 R-D = Negative Q[3] = 0 R <-- N[2] = 0000,0010 R-D = Negative Q[2] = 0 R <-- N[1] = 0000,0100 R-D = Positive Q[1] = 1, R = 0000,0001 R <-- N[0] = 0000,0010 R-D = Negative Q[2] = 0 Action After Action Q xxxx,xxxx xxxx,xxxx xxxx,xxxx 0xxx,xxxx 0xxx,xxxx 0xxx,xxxx 00xx,xxxx 00xx,xxxx 00xx,xxxx 000x,xxxx 000x,xxxx 000x,xxxx 0000,xxxx 0000,xxxx 0000,xxxx 0000,0xxx 0000,0xxx 0000,0xxx 0000,00xx 0000,00xx 0000,00xx 0000,001x 0000,001x 0000,001x 0000,0010 R 0000,0000 0000,0000 0000,0000 0000,0000 0000,0000 0000,0000 0000,0000 0000,0000 0000,0000 0000,0000 0000,0000 0000,0000 0000,0000 0000,0001 0000,0001 0000,0001 0000,0010 0000,0010 0000,0010 0000,0100 0000,0100 0000,0001 0000,0010 0000,0010 0000,0010
Xilinx 57
58 Xilinx
DSP48 SLICE MATH FUNCTIONS Step 1 2 3 Iteration (n) 8 8 8 Action Q[8-8] = 1 D*Q = 3 * 3 = 9 N - (D*Q) = 8 9 = Negative Q[8-8] = 0 Remainder = N-(D*Q) = 8-(3*2) = 2 After Action Q 0000, 0011 0000,0011 0000,0010
Both of the division implementations are possible in one DSP48 slice. The slice usage for 8-bit division is one DSP48, and the latency is eight clock cycles. The Verilog code for the Divide by Subtraction implementation is in the reference design file DIV_SUB.v, and the VHDL code is in the reference design file DIV_SUB.vhd. The Verilog code for the Divide by Multiplication implementation is in DIV_MULT.v and the VHDL code for the second implementation is in DIV_MULT.vhd.
Square Root
The square root of an integer number can be calculated by successive multiplication and subtraction. This is similar to the subtraction method used to divide two numbers. The square root of an N-bit number will have N/2 (rounded up) bits. If the square root is a fractional number, N/2 clocks are needed for the integer part of the answer, and every following clock gives one bit of the fraction part. The logic needed to compute this is shown in Figure 3-2.
Register A Input
Subtractor
Input = Reg C
UG073_c2_02_061304
Figure 3-2:
The square root for an 8-bit number can be calculated as follows: X = Y.Z Y is the integer part of the root, and Z is the fraction part. Register A refers to the registers found on the A input to the DSP48 slice, and Register C refers to the registers found on the C input to the DSP48 slice 1. Read the number into Register C. Set Register A to 8b10000000. 2. Calculate Register C (Register A * Register A). 3. If step 2 is positive, set Register A[(8-clock)] = 1, Register A[(8-clock)-1] = 1 Xilinx 59
DSP: DESIGNING FOR OPTIMAL RESULTS If step 2 is negative, set 4. Register A[(8-clock)] = 0, Register A[(8-clock)-1] = 1
Repeat steps 1 to 3. Four clocks are required to calculate the integer part of the value (Y). The number of clocks required for the fraction part (Z) depends on the precision required. For an 8-bit input value, the value in Reg_A after eight clocks includes the integer part given by the four MSBs and the fractional part given by the four LSBs. For example, find the square root of 11 decimal = 3.3166. Because 11 decimal is a 4-bit binary number, the integer part is two bits wide and is obtained in two clock cycles. The bit width of the fractional part depends on the precision required. In this example, four bits of precision are used requiring four clock cycles. The binary of value of 11 decimal is 1011. Expressed as an 8-bit number, it becomes 0000,1011. Store this value as 0000,1011,0000,0000. The last eight bits are necessary because the result is an 8-bit number, and 8 bits * 8 bits gives a 16-bit multiplication result. Clock 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 60 Xilinx Step 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 Register A = 1000,0000 0000,1011,0000,0000 (1000,0000 * 1000,0000) Step 2 is negative. Set Register A to 0100,0000 Register A = 0100,0000 0000,1011,0000,0000 (0100,0000 * 0100,0000) Step 2 is negative. Set Register A to 0010,0000 Register A = 0010,0000 0000,1011,0000,0000 (0010,0000 * 0010,0000) Step 2 is positive. Set Register A to 0011,0000 Register A = 0011,0000 0000,1011,0000,0000 (0011,0000* 0011,0000) Step 2 is positive. Set Register A to 0011,1000 Register A = 0011,1000 0000,1011,0000,0000 (0011,1000* 0011,1000) Step 2 is negative. Set Register A to 0011,0100 Register A = 0011,0100 0000,1011,0000,0000 (0011,0100* 0011,0100) Step 2 is positive. Set Register A to 0011,0110 Register A = 0011,0110 0000,1011,0000,0000 (0011,0110* 0011,0110) Step 2 is negative. Set Register A to 0011,0101 Register A = 0011,0101 Action
DSP48 SLICE MATH FUNCTIONS Clock 8 8 Step 2 3 Step 2 is positive. Action 0000,1011,0000,0000 (0011,0101* 0011,0101)
The output is in Register A and is 0011,0101. The final answer is 11.0101. The Verilog code for this implementation (8-bit input, 8 clocks) is in SQRT.v, and the VHDL code is in SQRT.vhd.
I = n1
SoS = A + B
or
SoS =
i=0
Ai
Equation 3-1
I = n1
SoS = A + B
or
SoS =
i=0
Ai
Equation 3-2 These functions are basic multiply-accumulate operations easily implemented on the DSP48 slice as described in Multiply Accumulate (MAC), page 55. A variation of this function is when the square root of either of the above equations is needed. In this case, the OPMODE does the MAC function for n cycles and then switches to do the square root function for the next n cycles. The Subtract input is dynamic and does an add for the MAC cycles and a subtract for the square root cycles. With the SUBTRACT input equal to 0, the OPMODE for the function is 0110101. A square root function is implemented by changing the SUBTRACT input to a 1. Xilinx 61
Conclusion
The DSP48 slice has a variety of features for fast and easy implementation of many basic math functions. The dedicated routing region around the DSP48 slice and the feedback paths provided in each slice result routing improvements. The high-speed multiplier and adder/subtracter unit in the slice delivers high-speed math functions.
62 Xilinx
Chapter 4
This chapter describes the implementation of a Multiply-Accumulate (MAC) Finite Impulse Response (FIR) filter using the DSP48 slice in a Virtex-4 device. Because the Virtex-4 architecture is flexible, constructing FIR filters for specific application requirements is practical. Creating optimized filter structures of a sequential nature saves resources and potential clock cycles. This chapter demonstrates two sequential filter architectures: the single-multiplier and the dualmultiplier MAC FIR filter. Reference design files are available for the System Generator in DSP, VHDL, and Verilog. These reference designs permit filter parameter changes including coefficients and the number of taps.
Overview
A large array of filtering techniques is available to signal processing engineers. A common filter implementation uses the single multiplier MAC FIR filter. In the past, this structure used the Virtex-II embedded multipliers and 18K block RAMs. The Virtex- 4 DSP48 slice contains higher performance multiplication and arithmetic capabilities specifically designed to enhance the use of MAC FIR filters in FPGA-based Digital Signal Processing (DSP).
yn =
xn i hi
Equation 4-1
i=0 In this equation, a set of N coefficients is multiplied by N respective data samples, and the inner products are summed together to form an individual result. The values of the coefficients determine the characteristics of the filter (e.g., low-pass filter, band-pass filter, high-pass filter). The equation can be mapped to many different implementations (e.g., sequential, semi-parallel, or parallel) in the different available architectures. Xilinx 63
DSP: DESIGNING FOR OPTIMAL RESULTS For slow sample rate requirements and a large number of coefficients, the single MAC FIR filter is well suited and dual-port block RAM is the optimal choice for the memory buffer. This structure is illustrated in Figure 4-1. If the number of coefficients is small, distributed memory and the SRL16E can be used as the data and coefficient buffers. For more information on using distributed memory, refer to Using Distributed RAM for Data and Coefficient Buffers, page 70.
Dual-Port Block RAM 18 Data Samples 96 x 18 Data Addr A P B load Optional Output Register Used Z-4
UG073_c3_02_081804
18
Control
WE
Coefficients 96 x 18
Coef Addr
Figure 4-1:
The input data buffer is implemented in dual-port block RAM. The read address port is clocked N times faster than the input samples are written into the data port, where N is the number of filter taps. The filter coefficients are also stored in the same dual-port block RAM, and are output at port B. Hence, the RAM is used in a mixed-mode configuration. The data is written and read from port A (RAM mode), and the coefficients are read only from port B (ROM mode). The control logic provides the necessary address logic for the dual-port block RAM and creates a cyclic RAM buffer for port A (data buffer) to create the FIR filter delay line. An optional output capture register maybe required for streaming operation, if the accumulation result can not be immediately used in downstream processing. The multiplier followed by the accumulator sums the products over the same number of cycles as there are coefficients. With this relationship, the performance of the MAC FIR filter is calculated by the following equation: M a x i m u m I n p u t S a m pl e R a t e = C l o c k S p ee d / N u mb e r o f Ta ps Equation 4-2
If the coefficients possess a symmetric shape, a slightly costlier structure is available (see Symmetric MAC FIR Filter, page 72), however, the maximum sampled rate is doubled. The sample rate of the costlier structure is defined as follows: S am p l e R at e = C l o c k S p e e d / ( 1 /2 x nu m b e r o f t a p s ) Equation 4-3
64 Xilinx
Bit Growth
The nature of the FIR filter, with numerous multiplies and adds, outputs a larger number of bits from the filter than are present on the filters input. This effect is the "bit growth" or the "gain" of a filter. These larger results cannot be maintained throughout a system due to cost implications. Therefore, the full precision result is typically rounded and quantized (refer to Rounding, page 69) back to a desired level. However, it is important to calculate the full precision output in order to select the correct bits from the output of the MAC. A simple explanation for implementation purposes involves considering the maximum value expected at the output (saturation level). A greater understanding of the specific filter enhances the accuracy of the output bit width. The following two techniques help determine the full precision output bit width.
where: ceil: Rounds up to the nearest integer abs: Makes the absolute value of a number (not negative) sum: Sums all the values in an array B: Number of bits in the data samples C: Number of bits in the coefficients If the output width exceeds 48 bits, there are notable effects on the size (in terms of the number of DSP48 slices used to implement the filter), because the DSP48 slice is limited to a 48-bit result. The output width can be extended by using more DSP48 slices, however, reconsidering the specification is more practical.
Control Logic
The control logic is very straightforward when using an SRL16E for the data buffer. For dual-port block RAM implementations the cyclic RAM buffer is required. This can complicate the control logic, and there are two different ways this control can be implemented. Both techniques produce the same results, but one way uses all slice-based logic to produce the results, while the other way embeds Xilinx 65
DSP: DESIGNING FOR OPTIMAL RESULTS the control in the available space in the Block RAM. The basic architecture of the control logic for the slice based approach is outlined in Figure 4-2.
en
en
Coefficient Address
UG073_c3_03_090204
Figure 4-2:
Dual-Port Block RAM MAC FIR Filter Control Logic Using Slices
The control logic consists of two counters. One counter drives the address of the coefficient section of the dual-port block RAM, while the other controls the address for the data buffer. A comparator controls an enable to the data buffer counter to disable the count for one cycle every output sample, and writes a new sample into the data buffer every N cycles. A simplified diagram of the control logic and the memory is shown in Figure 4-3.
Dual-Port RAM Counter 0 (N1) en addr DIN A WE data addr DOUT A
Coefficient ROM
DOUT B
D1 X 0 1
X 2
X 3
X D2 X X 1
X 2
X D3 X
X 1
94 95 95 0
93 94 94 95 0
96 97 98 99
190 191 96 97 98 99
190 191 96 97 98 99
WE
WE
UG073_c3_04_090204
Figure 4-3:
66 Xilinx
MAC FIR FILTERS The cyclic data RAM buffer is required to emulate the delay line shift register of the FIR filter while using a static RAM. The RAM is addressed sequentially every clock cycle. The counter rolls over to have the last coefficient (N1) read out. At this point, the data buffer is stalled by the controlling clock enable and the newest sample is read into the buffer AFTER the oldest data sample is read out. This newest data sample is now multiplied by the first coefficient (as the coefficient address counter is never disabled) and the cycle is repeated. The effect is of data shifting over time as the FIR filter equation requires. The ability to perform a simultaneous read and write requires the RAM buffer to have a read port and a write port (called read before write mode). The inverted WE signal is also used to drive the load input (OPMODE[5]) on the DSP48 slice. This signal must be delayed with a simple SRL16E to make sure the latency on the signal matches the latency through the MAC engine. This delay is typically four clocks, but depends upon the number of pipelining registers used in the DSP48 slice and block RAM. The number of required pipelining stages is a function of the desired achievable clock frequency. The number of resources used for the control logic is easily calculated. The counters are always two bits per slice plus the additional logic required to count limit the counter (unless the counter is a power of two limit). The count limiter circuit size is determined by the number of bits needed to represent the count limit value divided by four. Therefore, n/2 + n/4 slices are required for each counter, but the coefficient counter is larger due to the higher count value. The other control logic typically yields about N/4 slices due to the comparator required for the enable circuitry and the inverter to disable the data counter. The total number of slices for the control logic for an 18 x 18 MAC FIR filter with 96 coefficients is listed in Table 4-1. Table 4-1: Control Logic Using Slices Resource Utilization Elements Coefficient Counter Data Counter Relational Operator Other Logic Total Slices 5 4 1 1 11
Xilinx 67
DSP: DESIGNING FOR OPTIMAL RESULTS trick. Figure 4-4 illustrates the control logic and memory layout for this embedded control logic implementation.
coef addr
Load ACC WE_B DOUT A CE coef addr 18 Coefficients RAM must be: - Read before Write - Output Register On
18
DIN B WE DOUT B 18
en
addr
data addr
Dual-Port RAM
X D2 X X 0 1 2 X 3 X X D3 X 0 1 X 2 X 3 94 95 94 95
96 97 98 99
WE
WE
UG073_c3_05_090204
Figure 4-4:
Figure 4-4 demonstrates how the predictable and repeatable control sequence for the coefficient side of the memory can be embedded into the remaining space of the memory. The coefficient address value, accumulator Load signal, CE, and WE for the data buffer are precalculated and concatenated on to the coefficient values. The memory must be used in 512 x 36 mode, instead of 1024 x 18 mode. The individual signals are split up correctly on the output of the memory. This costs nothing in logic utilization apart from routing. Due to the feedback nature of the address line, it is important to set the initial state of the dualport block RAMs output register to effectively kick- start the MAC process. The initial values need to be different from each other to start the correct addressing, however, the silicon forces them to be the same. This changes the 1-bit masking of the LSB of the coefficient address such that the first value is 0 instead of the initialized value of 1. The initial value of the output latch is on the address bus the next cycle and, by unmasking the LSB, the count is successfully kick-started. Because the coefficients are placed in the upper half of the memory, only a single LSB must be masked, not the complete address bus. The masking signal can take the form of a reset signal or a registered permanent value to get the required single cycle mask. Each address concatenated onto its respective coefficient is the next required address (ahead by two cycles due to the output latch and register) to keep cycling through the coefficients. This technique enables a reduction in the control logic required for the MAC FIR filter, but it can only be exploited when the number of coefficients is smaller than 256 for greater than 9-bit data (256 68 Xilinx
MAC FIR FILTERS data and 256 coefficient elements are required to be stored). Table 4-2 highlights the smaller resource utilization. Table 4-2: Control Logic Using Embedded Block RAMs Resource Utilization Element Control Counter Total Slices 5 5
Rounding
As noted earlier, the number of bits on the output of the filter is much larger than the number of bits on the input, and must be reduced to a manageable width. The output can be truncated by simply selecting the MSBs required from the filter. However, truncation introduces an undesirable DC data shift due to the nature of twos complement numbers. Negative numbers become more negative, and positive numbers also become more negative. The DC shift can be improved with the use of symmetric rounding, where positive numbers are rounded up and negative numbers are rounded down. The rounding capability built into the DSP48 slice maintains performance and minimizes the use of the FPGA fabric. This is implemented in the DSP48 slice using the C input port and the Carry-In port. The rounding is achieved in the following manner: For positive numbers: Binary Data Value + 0.10000 and then truncate For negative numbers: Binary Data Value + 0.01111... and then truncate The actual implementation always adds 0.0111 to the data value using the C input port, as in the negative case, and then adds the extra carry in required to adjust for positive numbers. Table 4-3 illustrates some examples of symmetric rounding. Table 4-3: Symmetric Rounding Examples Binary Value
0010.0111 0010.1000 0010.1001 1101.1001 1101.1000 1101.0111
Add Round
0010.1111 0011.0000 0011.0001 1110.0000 1101.1111 1101.1110
Truncate: Finish
0010 0011 0011 1110 1101 1101
Rounded Value 2 3 3 -2 -3 -3
In the instance of the MAC FIR filter, the C input is available for continued use because the Z multiplexer is used for the feedback from the P output. Therefore, for rounding to be performed, either an extra cycle or another DSP48 slice is required. Typically, an extra cycle is used to save on DSP48 slices. On the extra cycle, OPMODE is changed for the X and Y multiplexers, setting the X multiplexer to zero and the Y multiplexer to use the C input to add the user-specified requirements for a negative rounding scenario.
Xilinx 69
DSP: DESIGNING FOR OPTIMAL RESULTS The Z multiplexer remains unchanged, as the feedback loop is still required, leading to the opcode being 0101100. The simplified diagram in Figure 4-5 shows how the DSP48 slice functions during this extra cycle.
18 Data Samples 96 x 18 Data Addr A P B CIN C Dual-Port Block RAM Z-4
UG073_c3_06_081804
DSP48 Slice
18
Control
WE
Coefficients 96 x 18
Coef Addr
Rounding Constant
OPMODE Translation
Figure 4-5:
70 Xilinx
MAC FIR FILTERS FIR filters. Figure 4-6 illustrates the MAC FIR filter implementation using distributed RAM for the coefficient bank and an SRL16E for the data buffer.
SRL16E 18 A P WE Addr B 18 DSP48 Slice OPMODE = 0100101
Control
load
Z-3
UG073_c3_07_081804
Figure 4-6:
The resource utilization is still small for these small memories. For a 16-tap (or less), n-bit memory bank, the cost is n/2 slices. Therefore, for this example, the cost is nine slices per memory bank (18 slices in total). The added benefit of using SRL16Es is the embedded shifting capabilities leading to a reduction in control logic. Only a single count value is required to address both the coefficient buffer and the data buffer. The terminal count signal is used to write the slower input samples into the data buffer and capture the results and to load the accumulator with the new set of inner products. The size of the control logic and memory buffer for a 16-tap, 18-bit data and coefficient FIR is detailed in Table 4-4. Table 4-4: Data Buffer Coefficient Memory Control Counter Relational Operator Capture/Load Delay Total Control Logic Resource Utilization Element Slices 9 9 2 1 1 22
All aspects of the DSP48 and capture register approach to the MAC FIR filter using distributed RAM are identical to the block RAM based MAC FIR.
Xilinx 71
Performance
Table 4-5 compares the performance of a Virtex-4 MAC FIR filter with a Virtex-II Pro solution. Overall, the Virtex-4 DSP48 slice greatly reduces the logic fabric resource requirement, improves the speed of the design, and reduces filter power consumption. Table 4-5: 18 x 18 MAC FIR Filter (96 Tap) Comparison 18 x 18 MAC FIR Filter (96 Tap) Virtex-II Pro FPGA Virtex-4 FPGA 99 slices, 1 Embedded Multiplier, 24 slices, 1 DSP48 Slice, 1 block RAM 1 block RAM 3.125 MSPS 4.69 MSPS 250 MHz 450 MHz 170 mW 57 mW
Figure 4-7 shows the architecture for a symmetric MAC FIR filter.
17 Data Samples 96 x 18 Data1 Addr A P Dual Read Access B
18
Control
WE
Data2 Addr Coef Addr Dual-Port Block RAM DSP48 Slice OPMODE = 010010 Z-4 load
Coefficients 48 x 18
UG073_c3_08_020405
Figure 4-7:
72 Xilinx
MAC FIR FILTERS There are limitations to using the symmetric MAC FIR filter. Due to the 1-bit growth from the pre-adder shown in Figure 4-5, the data input to the filter must be less than 18 bits to fit into one DSP48 slice. If necessary, the pre-adder can be implemented in slices or in another DSP48 slice. The performance of this fabric-based adder represents the critical path through the filter and limits the maximum clock speed. There are extra resources required for the filter to support symmetry. Three memory ports are needed along with the pre-adder. The control portion increases in resource utilization since the data is read out of one port in a forward direction and in reverse on the second port. This technique should only be utilized when extra sample rate performance is required.
P Coefficients 43 x 18 B
18
Control
WE
18 Data Samples 43 x 18 A
DSP48 Slice
Coefficients 43 x 18
OPMODE = 010010
UG073_c3_09_081804
Figure 4-8:
Xilinx 73
Control
WE
Coefficients 43 x 18
Coef Addr
OPMODE Translation
18 Data Samples 43 x 18 A
Coefficients 43 x 18
Figure 4-9:
Conclusion
MAC FIR filters are commonly used in DSP applications. With the introduction of the Virtex-4 DSP48 slice, this function can be achieved in a smaller area, while at the same time producing higher performance with less power resources. Designers have tremendous flexibility in determining the desired implementation as well as the ability to change the implementation parameters. Each specification and design scenario brings a different set of restrictions for the design. Several more techniques are discussed in the next chapters. The ability to "tune" a filter in an existing system or to have multiple filter settings is a distinct advantage. The HDL and System Generator for DSP reference designs are easily modified to achieve specific requirements, such as different coefficients, smaller data and coefficient bit widths, and coefficient values.
74 Xilinx
Chapter 5
This chapter describes the implementation of high-performance, parallel, full-precision FIR filters using the DSP48 slice in a Virtex-4 device. Because the Virtex-4 architecture is flexible, it is practical to construct custom FIR filters to meet the requirements of a specific application. Creating optimized, parallel filters saves either resources and potential clock cycles. This chapter demonstrates two parallel filter architectures: the Transposed and Systolic Parallel FIR filters. The reference design files in VHDL and Verilog permit filter parameter changes including coefficients and the number of taps.
Overview
There are many filtering techniques available to signal processing engineers. A common filter implementation for high-performance applications is the fully parallel FIR filter. Implementing this structure in the Virtex-II architecture uses the embedded multipliers and slice based arithmetic logic. The Virtex- 4 DSP48 slice introduces higher performance multiplication and arithmetic capabilities specifically designed to enhance the use of parallel FIR filters in FPGA-based DSP.
Xilinx 75
100
50 Log Scale
10 5
Figure 5-1:
The basic parallel architecture, shown in Figure 5-2, is referred to as the Direct Form Type 1.
18 h3 h2 h1 h0
18
38
UG073_c4_03_060404
Figure 5-2:
76 Xilinx
PARALLEL FIR FILTERS This structure implements the general FIR filter equation of a summation of products as defined in Equation 5-1.
N1
yn =
i=0
xn i hi
Equation 5-1
In Equation 5-1, a set of N coefficients is multiplied by N respective data samples. The results are summed together to form an individual result. The values of the coefficients determine the characteristics of the filter (e.g., a low-pass filter). The history of data is stored in the individual registers chained together across the top of the architecture. Each clock cycle yields a new complete result and all multiplication and arithmetic required occurs simultaneously. In sequential FIR filter architectures, the data buffer is created using Virtex-4 dedicated block RAMs or distributed RAMs. This demonstrates a trend; as algorithms become faster, the memory requirement is reduced. However, the memory bandwidth increases dramatically since all N coefficients must be processed at the same time. The performance of the Parallel FIR filter is calculated in Equation 5-2. M a x i m u m I n p u t S a m p l e R a t e = C l o c k S p e ed Equation 5-2
The bit growth through the filter is the same for all FIR filters and is explained in the section Bit Growth in Chapter 4.
h3
h2
h1
h0
0 DSP48 Slice OPMODE = 0000101 DSP48 Slice OPMODE = 0010101 DSP48 Slice OPMODE = 0010101 DSP48 Slice OPMODE = 0010101
20 P
UG073_c4_04_081104
Figure 5-3:
The input data is broadcast across all the multipliers simultaneously, and the coefficients are ordered from right to left with the first coefficient, h0, on the right. These results are fed into the pipelined adder chain acting as a data buffer to store previously calculated inner products in the adder chain. The rearranged structure yields identical results to the Direct Form structure, but gains from the use of an adder chain. This different structure is easily mapped to the DSP48 slice without Xilinx 77
DSP: DESIGNING FOR OPTIMAL RESULTS additional external logic. If more coefficients are required, then more DSP48 slices are required to be added to the chain. The configuration of the DSP48 slice for each segment of the Transposed FIR filter is shown in Figure 5-4. Apart from the very first segment, all processing elements are to be configured as in Figure 5-4. OPMODE is set to multiply mode with the adder combining the results from the multiplier and from the previous DSP48 slice through the dedicated cascade input (PCIN). OPMODE is set to binary 0010101.
B
PCOUT
UG073_c5_05_081104
Figure 5-4:
Resource Utilization
An N coefficient filter uses N DSP48 slices. A design cannot use symmetry to reduce the number of DSP48 slices when using the Transposed FIR filter structure.
78 Xilinx
0 DSP48 Slice OPMODE = 0000101 DSP48 Slice OPMODE = 0010101 DSP48 Slice OPMODE = 0010101 DSP48 Slice OPMODE = 0010101
20 P
UG073_c6_06_081104
Figure 5-5:
The input data is fed into a cascade of registers acting as a data buffer. Each register delivers a sample to a multiplier where it is multiplied by the respective coefficient. In contrast to the Transposed FIR filter, the coefficients are aligned from left to right with the first coefficients on the left side of the structure. The adder chain stores the gradually combined inner products to form the final result. As with the Transposed FIR filter, no external logic is required to support the filter and the structure is extendable to support any number of coefficients. The configuration of the DSP48 slice for each segment of the Systolic FIR filter is shown in Figure 5-6. Apart from the very first segment, all processing elements are to be configured as shown in Figure 5-6. OPMODE is set to multiply mode with the adder combining the results from the multiplier and from the previous DSP48 slice through the dedicated cascade input (PCIN). OPMODE is set to binary 0010101. The dedicated cascade input (BCIN) and dedicated cascade output (BCOUT) are used to create the necessary input data buffer cascade.
BCIN A
BCOUT
PCOUT
ug073_c4_07_081104
Figure 5-6:
Xilinx 79
Resource Utilization
An N coefficient filter uses N DSP48 slices.
Figure 5-7 shows the implementation of this type of Systolic FIR Filter structure.
17
h0
h1
h2
h3
0 DSP48 Slice OPMODE = 0000101 DSP48 Slice OPMODE = 0010101 DSP48 Slice OPMODE = 0010101 DSP48 Slice OPMODE = 0010101
38
UG073_c6_10_082404
PARALLEL FIR FILTERS In this structure, DSP48 slices have been traded off for Fabric slices. From a performance viewpoint, to achieve the full speed of the DSP48 slice, the fabric 18-bit adder has to run at the same speed. To achieve this, register duplication can be performed on the output from the last tap that feeds all the other multipliers. The two register delay in the input buffer time series is implemented as an SRL16E and a register output to save on logic area. A further benefit of the symmetric implementation is the reduction in latency, due to the adder chain being half the length. Figure 5-8 shows the configuration of the DSP48 slice for each segment of the Symmetric Systolic FIR filter. Apart from the very first segment, all processing elements are to be configured as in Figure 5-8. OPMODE is set to multiply mode with the adder combining results from the multiplier and from the previous DSP48 slice via the dedicated cascade input (PCIN). OPMODE is set to binary 0010101.
B
PCOUT
UG073_c5_11_082404
Figure 5-8:
Resource Utilization
An N symmetric coefficient filter uses N DSP48 slices. The slice count for the pre-adder and input buffer time series is a factor of the input bit width (n) and N. The equation for the size in slices is: (( n +1 ) * ( N / 2 ) ) + (n / 2 ) Equation 5-4
For the example illustrated in Figure 5-7, the size is (17+1) * 8/2 + 17/2 = 81 slices.
Rounding
The number of bits on the output of the filter is much larger than the input and must be reduced to a manageable width. The output can be truncated by simply selecting the MSBs required from the filter. However, truncation introduces an undesirable DC data shift. Due to the nature of two's complement numbers, negative numbers become more negative and positive numbers also become more negative. The DC shift can be improved with the use of symmetric rounding, where positive numbers are rounded up and negative numbers are rounded down. The rounding capability in the DSP48 slice maintains performance and minimizes the use of the FPGA fabric. This is implemented in the DSP48 slice using the C input port and the Carry In port. Rounding is achieved by: For positive numbers: Binary Data Value + 0.10000 and then truncate For negative numbers: Binary Data Value + 0.01111... and then truncate Xilinx 81
DSP: DESIGNING FOR OPTIMAL RESULTS The actual implementation always adds 0.0111 to the data value through the C port input as in the negative case, and then adds the extra carry in required to adjust for positive numbers. Table 5-1 illustrates some examples of symmetric rounding. Table 5-1: 2.4375 2.5 2.5625 2.4375 2.5 2.5625 Symmetric Rounding Examples Binary Value
0010.0111 0010.1000 0010.1001 1101.1001 1101.1000 1101.0111
Decimal Value
Add Round
0010.1111 0011.0000 0011.0001 1110.0000 1101.1111 1101.1110
Truncate: Finish
0010 0011 0011 1110 1101 1101
Rounded Value 2 3 3 2 3 3
For both the Transposed and Systolic Parallel FIR filters, the C input is used at the beginning of the adder chain to drive the carry value into the accumulated result. The final segment uses the MSB of the PCIN as the carry-in value to determine if the accumulated product is positive or negative. CARRYINSEL is used to select the appropriate carry-in value. If positive, the carry-in value is used, and if negative, the result is kept the same (see Figure 5-9).
B 18 h0 18 h1 h2 h3
0.49999 C DSP48 Slice OPMODE = 0'b0000101 DSP48 Slice OPMODE = 0'b0010101 DSP48 Slice OPMODE = 0'b0010101 DSP48 Slice OPMODE = 0'b0010101 Carryinsel = 01
18 P
UG073_c4_08_060704
Figure 5-9:
The one problem with this solution occurs when the final accumulated inner product input to the final DSP48 slice is very close to zero. If the value is positive and the final inner product makes the result negative (leading to a rounding down), then an incorrect result occurs because the rounding function assumes a positive number instead of a negative. The last coefficient in typical FIR filters is very small, so this situation rarely occurs. However, if absolute certainty is required, an extra DSP48
82 Xilinx
PARALLEL FIR FILTERS slice can perform the rounding function (see Figure 5-10). A Transposed FIR filter can have exactly the same problem as the Systolic FIR filter.
0.4999 C B 18 h0 18 h1 h2 h3
P C 0
DSP48 Slice OPMODE = 0'b0000101 DSP48 Slice OPMODE = 0'b0010101 DSP48 Slice OPMODE = 0'b0010101 DSP48 Slice DSP48 Slice OPMODE = 0'b0010101 OPMODE = 0'b0011101 Carryinsel = 01
UG073_c4_09_061304
18
Figure 5-10:
Performance
When examining the performance of a Virtex-4 Parallel FIR filter, a Virtex-II Pro design is a valuable reference. Table 5-2 illustrates the ability of the Virtex-4 DSP48 slice to greatly reduce logic fabric resources requirements while improving the speed of the design and reducing the power utilization of the filter. Table 5-2: Performance Analysis Device Family Virtex-II Pro FPGA Virtex-II Pro FPGA Virtex-4 FPGA Size 1860 Slices 26 Embedded Multipliers 2958 Slices 26 Embedded Multipliers 0 Slices 51 DSP48 Slices Performance 300 MHz Clock Speed 300 MSPS 300 MHz Clock Speed 300 MSPS 400 MHz Clock Speed 400 MSPS Power (Watts) TBD
Filter Type 18 x 18 Parallel Transposed FIR Filter (51 Tap Symmetric) 18 x 18 Parallel Systolic FIR Filter (51 Tap Symmetric) 18 x 18 Parallel Transposed FIR Filter (51 Tap Symmetric)
TBD
TBD
Xilinx 83
DSP: DESIGNING FOR OPTIMAL RESULTS Table 5-2: Performance Analysis (Continued) Device Family Virtex-4 FPGA Size 0 Slices 51 DSP48 Slices Performance 450 MHz Clock Speed 450 MSPS 400 MHz Clock Speed 400 MSPS Power (Watts) TBD
Filter Type 17 x 18 Systolic FIR Filter (51 Tap Nonsymmetric) 17 x 18 Systolic FIR Filter (51 Tap Symmetric)
Virtex-4 FPGA
TBD
Conclusion
Parallel FIR filters are commonly used in high-performance DSP applications. With the introduction of the Virtex-4 DSP48 slice, DSPs can be achieved in a smaller area, thereby producing higher performance with less power penalty. Designers have tremendous flexibility in determining the desired implementation, and also have the ability to change the implementation parameters. The ability to tune a filter in an existing system or to have multiple filter settings is a distinct advantage. By making the necessary coefficient changes in the synthesizable HDL code, the reconfigurable nature of the FPGA is fully exploited. The coefficients can be either hardwired to the A input of the DSP48 slices or stored in small memories and selected to change the filter characteristics. The HDL and System Generator for DSP reference designs are easily modified to achieve specific requirements.
84 Xilinx
Chapter 6
This chapter describes the implementation of semi-parallel or hardware-folded, full-precision FIR filters using the Virtex-4 DSP48 slice. Because the Virtex-4 architecture is flexible, constructing FIR filters for specific application requirements is practical. Creating optimum filter structures of a semi-parallel nature saves resources and potential clock cycles. Therefore, optimum filter structures of a semi-parallel nature can be created without draining resources or clock cycles. This chapter demonstrates two semi-parallel filter architectures: the four- multiplier FIR filter using distributed RAM and the three-multiplier FIR filter using block RAM. These filters illustrate how resources are saved by using available clock cycles and hardware-folding techniques. Reference design files are available for system generator in DSP, VHDL, and Verilog. The reference designs permit filter parameter changes including coefficients and the number of taps.
Overview
A large array of filtering techniques are available to signal processing engineers. A common filter implementation to exploit available clock cycles, while still achieving moderate to high sample rates, is the semi-parallel (also known as folded-hardware) FIR filter. In the past, this structure used the Virtex-II embedded multipliers and slice-based arithmetic logic. However, the Virtex-4 DSP48 slice introduces higher performance multiplication and arithmetic capabilities to enhance the use of semiparallel FIR filters in FPGA-based DSP designs.
DSP: DESIGNING FOR OPTIMAL RESULTS over numerous clock cycles to achieve the result. These techniques are often referred to as semi-parallel and are used to maximize efficiency of the filter (see Figure 6-1).
500 400 300 200
100
50 Log Scale
10 5
Figure 6-1:
The semi-parallel FIR structure implements the general FIR filter equation of a summation of products defined as shown in Equation 6-1. N1 yn =
xn i hi
i=0
Equation 6-1
Here a set of N coefficients is multiplied by N respective time series data samples, and the results are summed together to form an individual result. The values of the coefficients determine the characteristics of the filter (for example, a low-pass filter). Along with achievable clock speed and the number of coefficients (N), the number of multipliers (M) is also a factor in calculating semi-parallel FIR filter performance. The following equation demonstrates how the more multipliers used, the greater the achievable performance of the filter. 86 Xilinx
SEMI-PARALLEL FIR FILTERS Maximum Input Sample rate = (Clock speed / Number of Coefficients) x Number of Multipliers The above equation is rearranged to determine how many multipliers to use for a particular semiparallel architecture: Number of Multipliers = (Maximum Input Sample rate x Number of Coefficients) / Clock speed The number of clock cycles between each result of the FIR filter is determined by the following equation: Number of Clock cycles per result = Number of Coefficients / Number of Multipliers The bit growth on the output of the filter is the same as for all FIR filters and is explained in Bit Growth in Chapter 4. The large 48-bit internal precision of the DSP48 slice means that little concern needs to be paid to the internal bit growth of the filter.
Sampling Rate Number of Coefficients Assumed Clock Speed Input Data Width Output Data Width Number of Multipliers Number of Clock Cycles Between Each Result
Figure 6-2 illustrates the main structure for the four-multiplier, semi-parallel FIR filter.
CE 18 SRL16E CE B CE1 SRL16E CE1 B CE2 SRL16E CE2 B CE3 SRL16E B
Coefficients 4 x 18
18
Coefficients 4 x 18
18
Coefficients 4 x 18
18
Coefficients 4 x 18
18 PCIN 40 P CE4
0 DSP48 Slice OPMODE = 0110101 DSP48 Slice OPMODE = 0010101 DSP48 Slice OPMODE = 0010101
Figure 6-2:
The DSP48 slice arithmetic units are designed to be chained together easily and efficiently due to dedicated routing between slices. Figure 6-2 shows how the four DSP48 slice multiply-add elements are cascaded together to form the main part of the filter structure. Figure 6-3 provides a detailed view Xilinx 87
DSP: DESIGNING FOR OPTIMAL RESULTS of the main multiply-add elements. The two pipeline registers are used on the B input to compensate for the register on the output of the coefficient memory.
B
PCIN
PCOUT
Figure 6-3:
An extra DSP48 slice is required on the end to perform the accumulation of the partial results, thus creating the final result. A new result is created every four cycles. Every four cycles, the accumulation must be reset to the first partial value of the next result. As in the MAC FIR Filter, this reset (or load) is achieved by changing the OPMODE value of the DSP48 slice for a single cycle. OPMODE is changed from binary 0010010 to binary 0010000 (just a single bit change). At the same time, the capture register is also enabled, and the final result is stored on the output (see Figure 6-4).
CE 18 SRL16E CE B CE1 SRL16E CE1 B CE2 SRL16E CE2 B CE3 SRL16E B
Coefficients 4 x 18
18
Coefficients 4 x 18
18
Coefficients 4 x 18
18
Coefficients 4 x 18
0 18 PCIN 40 P CE4
0 DSP48 Slice OPMODE = 0110101 DSP48 Slice OPMODE = 0010101 DSP48 Slice OPMODE = 0010101
Figure 6-4:
Control logic is required to make this dynamic change occur. The specifics are detailed in Control Logic and Address Sequencing, page 90.
SEMI-PARALLEL FIR FILTERS samples along the time series buffer. The extra register on the output of each data buffer is required to match up the data buffer pipeline with the extra delay caused by the adder chain. The extra register should not cost extra resources, because it is already present in the slice containing the SRL16E (see Figure 6-5).
1/2 SliceM SRL16E Register DIN CE ADDR[3:0]
CE CE
DOUT
UG073_c5_05_081904
Figure 6-5:
As long as the depth does not exceed 16, the resources required for each of these input memory buffers is determined by the bit width of the input data (n). Therefore, n/2 SliceM is required for each memory buffer, leading to nine slices per buffer in this filter example. For depths up to 32, resources are a little more than doubled because two SRL16Es are needed, as well as an extra output multiplexer. For more information on SliceM, refer to the CLB section in the Virtex-4 User Guide.
Coefficient Memory
The coefficients are divided up into four groups of four. This arrangement is determined by dividing the total number of coefficients by the number of multipliers used in the implementation. In this example, if the total number of coefficients is 16, and the number of multipliers is four, four coefficients per memory are needed. Note that filters with a total number of coefficients that are integer-divisible by the required number of multipliers are very desirable. System designers should take this into account when designing their filters to get the optimal filter specification for the implementation used. Otherwise, the coefficients will have to be padded with zeros to achieve a number of coefficients that are integerdivisible by the number of multipliers. The coefficients are simply split into groups according to their order. The first four in the first memory, the second four in the second memory, and so on (see Figure 6-6).
Xilinx 89
Figure 6-6:
The adder chain architecture of the DSP48 slice means that each Multiply-Add cascade multiplication must be delayed by a single cycle so that the results synchronize appropriately when added together. This delay is achieved by addressing of the memories and is explained in Control Logic and Address Sequencing. Distributed RAM (refer to Chapter 2, XtremeDSP Design Considerations, for detailed information on distributed RAMs) are used for the coefficient memories. The reason for their use is that it would be an extremely inefficient usage of the larger block RAMs, especially given their scarcity versus the smaller abundant distributed RAMs. The larger block RAM comes into play when the number of coefficients per memory starts to increase to the point where the cost in slice resources becomes significant (for example, greater than 64). The total cost of the current example is 36 slices. The coefficient width is 18 bits, and distributed RAMs cost n/2 slices (that is, nine slices per memory and four memories). For larger distributed RAMs (larger than 16 elements), the size begins to increase as Write Enable (WE) control logic and an output multiplexer is needed. The distributed memory v7.0 in the CORE Generator system can be easily used to create these little distributed RAMs and get accurate size estimates.
90 Xilinx
N/M - 1
CE
CE1
CE2
CE3
CE4
UG073_c5_07_081104
Figure 6-7:
Figure 6-7 also shows clock enable sequencing. A relational operator is required to determine when the count limited counter resets its count. This signal is High for one clock cycle every four cycles, to represent the input and output data rates. The Clock Enable signal is delayed by a single register just like the coefficient address, and each delayed version of the signal is tied to the respective section of the filter. Refer to Figure 6-2 to see the signal connections to the element. Figure 6-8 illustrates the control logic waveforms changing over time.
Clock Input Data x(n) 0 0 1 2 3 0 1 2 3 0 1 2 3 0 x(n-1) x(n-2) y(n) 1 2 3 x(n-3) y(n-1) 0 1 2 3 x(n-4) x(n-5) y(n-2)
Output Data Address for First DSP48 Slice MADD Element Control CE Address for Second DSP48 Slice MADD Element Control CE1 Address for Third DSP48 Slice Design Element Control CE2 Address for Fourth DSP48 Slice Design Element Control CE3 2 3 0 1 2
UG073_c5_08_082504
Figure 6-8:
Resource Utilization
Table 6-2 shows the resources used by a 16-tap, four-multiplier, distributed-RAM-based, semiparallel FIR filter. Table 6-2: Resource Utilization Elements Multiply-Add Input Data Buffers Coefficient Memories Capture Register Main Control Counter Relational Operator Multiply-Add Element Control Total 36 36 20 2 1 9 (3 per extra element) 104 5 Slices DSP48 Slices 5
Coefficients 100 x 18 18
Coefficients 100 x 18
18 PCIN P 40 CE4
Figure 6-9:
The decision to use this implementation is based on the filter specification. The filter specifications are described in Table 6-3. Table 6-3: Three-Multiplier, Block RAM-Based, Semi-Parallel FIR Filter Specifications Parameter Sampling Rate Number of Coefficients Assumed Clock Speed Input Data Width Output Data Width 92 Xilinx Value 4.5 MSPS 300 450 MHz 18 Bits 18 Bits
SEMI-PARALLEL FIR FILTERS Table 6-3: Three-Multiplier, Block RAM-Based, Semi-Parallel FIR Filter Specifications Parameter Number of Mulipliers Number of Clock Cycles Between Each Result Value 3 100
The structure is similar to the four-multiplier filter studied earlier. In this instance, the lower sample rate of the filter specification and the larger number of taps indicates that only three multipliers are required, each servicing 100 coefficients, leading to a new result yielded every 100 clock cycles. Each memory buffer is required to hold 100 coefficients and also 100 input data history values. The dedicated Virtex-4 block RAM can be used in dual-port mode with a cyclic data buffer established in the first half of the memory to serve the shifting input data series. Chapter 4, MAC FIR Filters, describes using these memories to store the input data series, the coefficients, and also the control logic required to make the cyclic RAM buffer operate. The rest of the control logic and data flow is identical to the first filter investigated except that only three multipliers are serviced, therefore, the control logic can be scaled back by one element. Also note that the WE signals are the inversion of their respective CE pair. Table 6-4 shows the resource utilization for the 300-tap, three-multiplier, semi-parallel FIR filter. Table 6-4: Resource Utilization Elements Multiply-Add Input Data Buffers and Coefficient Memories Capture Register Main Control Counter Relational Operator Multiply-Add Element Control Total 20 5 1 12 (6 per extra element) 38 Slices DSP48 Slices 4 Block RAMs 3
Xilinx 93
18
18
18
18
UG073_c5_10_061404
Figure 6-10:
Only one data storage buffer is required, typically a block RAM. The data buffer output is also broadcast to all DSP48 slices. Each DSP48 slice works in accumulator mode until the last cycle of the calculation, when OPMODE changes to form an adder chain, and then passes the results to the next DSP48 slice. Actually, four results are being calculated at one time, and the completed result is output 94 Xilinx
SEMI-PARALLEL FIR FILTERS from the last DSP48 slice. The previous elements are working on their respective parts of the next results. Figure 6-11 shows the filter structure every time the DSP48 slice OPMODE is changed, which occurs once every result cycle.
18 Cycle Data Buffer 400 x 18 A 43 CE 18 B
18
18
UG073_c5_11_060804
Figure 6-11:
DSP: DESIGNING FOR OPTIMAL RESULTS Low latency due to the transpose nature of the filter implementation is lower than the Systolic approach. The latency is equal to the size of one coefficient bank. The disadvantages to using the Semi-Parallel, Transposed FIR filter are: Lower performance due to the broadcast nature of the data buffer output can limit performance of the filter. Control logic is more difficult to understand, but is still of a compact nature.
Rounding
The number of bits on the output of the filter is much larger than the input and must be reduced to a manageable width. The output can be truncated by simply selecting the MSBs required from the filter. However, truncation introduces an undesirable DC shift on the data set. Due to the nature of twos complement numbers, negative numbers become more negative and positive numbers also become more negative. The DC shift can be improved with the use of symmetric rounding, where positive numbers are rounded up and negative numbers are rounded down. The rounding capability built into the DSP48 slice maintains performance and minimizes the use of FPGA fabric. This is ingrained in the DSP48 slice via the C input port and also the Carry-In port. Rounding is achieved in the following manner: For positive numbers: Binary Data Value + 0.10000 and then truncate For negative numbers: Binary Data Value + 0.01111... and then truncate The actual implementation always adds 0.0111 to the data value using the C port input as in the negative case, and then adds the extra carry in required to adjust for positive numbers. Table 6-5 illustrates some examples of symmetric rounding. Table 6-5: 2.4375 2.5 2.5625 -2.4375 -2.5 -2.5625 Symmetric Rounding Examples Binary Value
0010.0111 0010.1000 0010.1001 1101.1001 1101.1000 1101.0111
Decimal Value
Add Round
0010.1111 0011.0000 0011.0001 1110.0000 1101.1111 1101.1110
Truncate: Finish
0010 0011 0011 1110 1101 1101
Rounded Value 2 3 3 -2 -3 -3
In the instance of the semi-parallel FIR filter, an extra DSP48 slice is required to perform the rounding functionality. It cannot be ingrained into the final accumulator because the rounding cannot be done on the final result. If the C input is used and the accumulator is put into three-input add mode, then rounding is performed on the partial result. The more multipliers in the filter, the worse the rounding performance because even fewer inner products are included in the result. An extra DSP48 slice is required to perform the rounding. Due to the finite nature of the DSP48 slices, it is recommended that the symmetric rounder be actually implemented in the fabric outside of the slices. The function is small and does not have to run at a high frequency because the results are running at the much slower input data rate.
96 Xilinx
Performance
It does not make sense to compare the performance of the semi-parallel FIR filter in a Virtex-4 device with Virtex-II Pro results because completely different techniques are used to build the filters. As a general statement though, Virtex-4 devices improve the speed of the design, shrink the area, and reduce power drawn by the filter. All designs assume 18-bit data and 18-bit coefficient widths. Table 6-6 through Table 6-8 compare the specifications for three filters. Table 6-6: 4-Multiplier, Memory-Based, Semi-Parallel FIR Filter Specifications (16-Tap Symmetric) Parameter Size Performance Power Specification 94 slices, 5 DSP48 slices 458 MHz clock speed, 114.5 MSPS TBD Watt
Table 6-7: 3-Multiplier, Block-RAM-Based, Semi-Parallel FIR Filter Specifications (300-Tap Symmetric) Parameter Size Performance Power Specification 38 slices, 4 DSP48 slices, 4 block RAMs 450 MHz clock speed, 4.5 MSPS TBD Watt
Table 6-8: 4-Multiplier, Block-RAM-Based, Semi-Parallel Transposed FIR Filter Specifications (400-Tap Symmetric) Parameter Size Performance Power Specification 46 slices, 4 DSP48 slices, 2 block RAMs 450 MHz clock speed, 4.5 MSPS TBD Watt
Xilinx 97
Conclusion
Semi-parallel FIR filters probably are the most frequently used filter techniques in Virtex-4 highperformance DSP applications. Figure 6-12 shows the necessary implementation decisions and provides guidelines for choosing the required structure based on the filter specifications.
Transposed FIR 500 400 300 200 Symmetric MACC FIR
Systolic FIR (symmetric & non-symmetric) Increasing Number of Multipliers Semi-Parallel Distributed Memory FIR 10-Multiplier Semi-Parallel FIR
100
50
10 5 Distributed Memory MACC FIR 1 0.5 1 5 10 20 50 100 200 Embedded Control MACC FIR
Block RAM
Distributed Memory
1000
UG072_c5_12_082304
Figure 6-12:
The major lines indicate the guideline thresholds between given implementation techniques. For instance, the shift to using block RAM is desirable when the number of taps needed to be stored in a given memory exceeds 32. This correlates to two SRL16Es for the data buffers. If more than two SRL16Es are used in a data buffer, it will be difficult to reach the high clock rate indicated in Chapter 4, MAC FIR Filters, Chapter 5, Parallel FIR Filters, and this chapter. However, this is only a guideline. A great deal depends upon how many slices or block RAMs are remaining in the device, the power requirements, and the available clock frequencies. A given filter implementation is subjective because a different set of restrictions is provided by every application and design. In general, the guidelines provided in the past three chapters should enable designers to make sensible and efficient decisions when designing filters. These chapters also complete the foundations required for filter construction in Virtex-4 devices so that more complex, multi-channel and interpolation or decimation multi-rate filters can be constructed. The supplied referenced designs further aid in helping to understand and utilize these filters. 98 Xilinx
Chapter 7
This chapter illustrates the use of the advanced Virtex-4 DSP features when implementing a widely used DSP function known as multi-channel FIR filtering. Multi-channel filters are used to filter multiple input sample streams in a variety of applications, including communications and multimedia. The main advantage of using a multi-channel filter is leveraging very fast math elements across multiple input streams (i.e., channels) with much lower sample rates. This technique increases silicon efficiency by a factor almost equal to the number of channels. The Virtex-4 DSP48 slice is one of the new and highly innovative diffused elements that form the basis of the Application Specific Modular BLock or ASMBL architecture. This modular architecture enables Xilinx to rapidly and cost-effectively build FPGA platforms by combining different elements, such as logic, memory, processors, I/O, and of course, DSP functionality targeting specific applications such as wireless or video DSP. The Virtex-4 DSP48 slice contains the basic elements of classic FIR filters: a multiplier followed by an adder, delay or pipeline registers, plus the ability to cascade an input stream (B bus) and an output stream (P bus) without exiting to a general slice fabric. The resulting DSP designs can have optional pipelining that permits aggregate multi-channel sample rates of up to 500 million samples per second, while minimizing power consumption and external slice logic. In the implementation described in this chapter, multi-channel filtering can be looked at as time-multiplexed, single-channel filters. In a typical multi-channel filtering scenario, multiple input channels are filtered using a separate digital filter for each channel. Due to the high performance of the DSP48 block within the Virtex-4 device, a single digital filter can be used to filter all eight input channels by clocking the single filter with an 8x clock. This implementation uses 1/8th of the total FPGA resource as compared to implementing each channel separately.
Xilinx #
DSP: DESIGNING FOR OPTIMAL RESULTS Six-to-one multiplexer that is implemented in slice logic as described in Combining Separate Input Streams into an Interleaved Stream, page 101 Coefficient ROMs using SRL16Es connected in head-to-tail fashion Input sample delay-by-seven SRL16Es to hold the interleaved streams DSP48 slices for multiplication and additions
SRL16 Coefficient ROM xo(n) x1(n) x2(n) x3(n) x4(n) x5(n) Z-7 Z-7 Z-7 SRL16 Coefficient ROM SRL16 Coefficient ROM SRL16 Coefficient ROM
UG073_c6_03_081804
Figure 7-1:
All datapaths and coefficient paths for this example are 8 bits wide. The coefficient ROMs and input sample delay elements are designed using SRL16Es. The SRL16E is a very compact and efficient memory element, running at the very high 6x clock rate. For adaptive filtering, where coefficients can be different depending upon their input signals, coefficient RAMs can be used to update the coefficient values. The DSP48 slices and interconnects also run at the 6x clock rate, providing unparalleled performance for multiplication and additions in todays FPGAs.
DSP48 Tile
The multi-channel filter block is a cascade implementation of the DSP48 tile. Each tile is implemented as shown in Figure 7-2. An SRL16E is used to shift the input from the six channels. The
# Xilinx
SEMI-PARALLEL FIR FILTERS product cascade path between two DSP48 slices within the tile can be used to bring the product output from one tap into the cascading input of the next tap for the final addition.
Input: 6 Channels SRL16 Taps SRL16 Taps
8 bits
C1
C2
Add
UG073_c6_04_081804
Figure 7-2:
Counter
Shift Register
X6(n-1) X5(n-1) X4(n-1) X3(n-1) X2(n-1) X1(n-1) X0(n-1)
18
ug073_c6_05_060904
Figure 7-3:
For each clock tick, the counter selects a different input stream (in order), and then supplies this value to the SRL16E shift register. After six clock ticks, the six input samples for a given time period are loaded sequentially, or interleaved into a single stream. A six-to-one multiplexer must be designed carefully, as it is constructed with slice logic that must run at the 6x clock rate. At 446 MHz, good design practices dictate connections point-to-point, a maximum of one Look-Up Table (LUT) between flip-flops and RLOC techniques.
Xilinx 15
DSP: DESIGNING FOR OPTIMAL RESULTS To reduce the high fanouts on the selected lines of the multiplexer, the conceptual multiplexer in Figure 7-3 is implemented as shown in Figure 7-4. This circuit is repeated for all eight bits of the input sample width.
Shift Register
'1'
LUT LUT
X4(n) X5(n)
LUT '0'
LUT
UG073_c6_06_060904
Figure 7-4:
Coefficient RAM
The six coefficient sets are stored in the SRL16 memories. If the same coefficient set is used for all channels, then only a single set is stored in the SRL16. If the different channels use different coefficients, then six sets of SRL16s are used for each tap. (Six RAMs can be used instead, one for each channel.) Each RAM is 8 bits wide and six deep, corresponding to the six taps. The optional Load input is used to change or load a new coefficient set. Six clock cycles are needed to load all six RAMs. Input C1 is used to load the eight locations of RAM1 which are used for Channel1. C8 is used to load the eight locations of RAM8 which are used for Channel8. At the eighth clock, all eight locations of the eight RAMs are loaded; the filter then becomes an adaptive filter. The speed of the overall filter will be reduced when the coefficients are stored in the RAM.
Control Logic
The control logic is used to ensure proper functioning of the different blocks. If the coefficient RAM block is used, the control logic ensures that the load signal is High for six clocks. Different tapenabled signals are used to make sure that RAM values are read into the DSP48 correctly. For instance, clock1 reads in the first location from RAM1, but the first location of RAM2 is read only at the clock number equal to shift register length. The design assumes a clock is running at 6x that of the input # Xilinx
SEMI-PARALLEL FIR FILTERS signals. The DCM can also be used to multiply the clock if the only available clock is running at the input channel frequency. The control logic also takes care of the initial latency such that the final output is enabled only after the initial latency period is complete.
Implementation Results
The initial latency of the design is equal to the [(number of channels + 1) * number of taps] plus three pipe stages within the DSP48. After placement and routing, the design uses 216 slices and eight DSP48 blocks. The design has a speed of 454 MHz.
Conclusion
The available arithmetic functions within the DSP48 block, combined with fine granularity and high speed, makes the Virtex-4 FPGA an ideal device to implement high-speed, multi-channel filter functions. The design shows the efficient implementation of a six-channel, eight-tap filter. Due to the high-performance capability within the DSP48 block, a single channel, eight-tap filter can be used to implement the six-channel, eight-tap filter, reducing the area utilization by 1/6th.
Xilinx 15
# Xilinx
Appendix A
References
1. 2. 3. 4.
A Digital Signal Processing Primer by Ken Steiglitz, ISBN: 0-8053-1684-1 Digital Video and HDTV Algorithms and Interfaces by Charles Poynton, ISBN: 1-55860-792-7 DSP Primer by C.Britton Rorabaugh, ISBN:0-07-054004-7 Xilinx, Inc., Virtex-4 User Guide
Xilinx #
# Xilinx
The Virtex-4 FPGA family The ideal platform for creating high-performance DSP systems and for boosting the performance of a DSP processor-based system. The Virtex-4 technology Innovative architectural features and design techniques that dramatically reduce power consumption while increasing DSP performance. The XtremeDSP slice Optimal performance IP, fully integrated into the on-chip DSP-specific architecture. (up to 512 XtremeDSP slices operating at 500 MHz speed.) Design tools and support A broad pallet of easy-to-use design tools and libraries, in addition to in-depth design support, from both Xilinx and its partners. Education and design services Get up and running as quickly as possible.
Xcell Publications help you solve design challenges, bringing you the awareness of the latest tools, devices, and technologies; knowledge on how to design most effectively; and the next steps for implementing working solutions. See all of our books, magazines, technical journals, solutions guides, and brochures at: www.xilinx.com/xcell