Ug 193
Ug 193
XtremeDSP Design
Considerations
User Guide
R
R
The information disclosed to you hereunder (the “Materials”) is provided solely for the selection and use of Xilinx products. To the maximum
extent permitted by applicable law: (1) Materials are made available "AS IS" and with all faults, Xilinx hereby DISCLAIMS ALL
WARRANTIES AND CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO WARRANTIES OF
MERCHANTABILITY, NON-INFRINGEMENT, OR FITNESS FOR ANY PARTICULAR PURPOSE; and (2) Xilinx shall not be liable (whether
in contract or tort, including negligence, or under any other theory of liability) for any loss or damage of any kind or nature related to, arising
under, or in connection with, the Materials (including your use of the Materials), including for any direct, indirect, special, incidental, or
consequential loss or damage (including loss of data, profits, goodwill, or any type of loss or damage suffered as a result of any action
brought by a third party) even if such damage or loss was reasonably foreseeable or Xilinx had been advised of the possibility of the same.
Xilinx assumes no obligation to correct any errors contained in the Materials or to notify you of updates to the Materials or to product
specifications. You may not reproduce, modify, distribute, or publicly display the Materials without prior written consent. Certain products are
subject to the terms and conditions of Xilinx’s limited warranty, please refer to Xilinx’s Terms of Sale which can be viewed at
https://www.xilinx.com/legal.htm#tos; IP cores may be subject to warranty and support terms contained in a license issued to you by Xilinx.
Xilinx products are not designed or intended to be fail-safe or for use in any application requiring fail-safe performance; you assume sole risk
and liability for use of Xilinx products in such critical applications, please refer to Xilinx’s Terms of Sale which can be viewed at
https://www.xilinx.com/legal.htm#tos.
© Copyright 2006–2017 Xilinx, Inc. XILINX, the Xilinx logo, Virtex, Spartan, ISE, and other designated brands included herein are
trademarks of Xilinx in the United States and other countries. PowerPC is a trademark of IBM Corp. and is used under license. All other
trademarks are the property of their respective owners.
Revision History
The following table shows the revision history for this document.
Virtex-5 FPGA XtremeDSP Design Considerations www.xilinx.com UG193 (v3.6) July 27, 2017
Date Version Revision
03/20/07 2.3 Updated “Dynamic Shifter” and “18-Bit Barrel Shifter” sections and Figure 4-2,
Figure 4-3. Added “Floating Point Multiply and 59 x 59 Signed Multiply” and
Figure 4-15 and Figure 4-16.
05/17/07 2.4 Updated the “X, Y, and Z Multiplexer” section.
05/22/07 2.4.1 Typographical edits.
08/01/07 2.5 Updated Figure 2-2, Figure 4-21, and Figure 4-22.
10/09/07 2.6 Updated disclaimer.
Chapter 1: Updated Equation 1-1 and “ALUMODE Inputs” and “MULTSIGNOUT
Port Logic”sections.
Chapter 4: Updated the “Convergent Rounding: LSB Correction Technique”
section.
Added Appendix A.
Made minor typographical edits.
12/11/07 2.7 Updated Table 1-1 with Virtex-5 LX155, LX20T, and LX155T devices. Updated
Figure 1-12.
02/05/08 2.8 Made minor typographical edits.
Chapter 2: Updated the “Designing for Performance (to 550 MHz)” and
“Adder/Subtracter or Logic Unit” sections.
03/31/08 3.0 Made minor typographical edits.
Updated “About This Guide.”
Chapter 1: Updated Figure 1-3 and Table 1-1 (added FXT devices).
Chapter 4: Updated Figure 4-16.
04/25/08 3.1 Updated Table 1-1 with SX240T device.
09/23/08 3.2 Added the TXT platform.
Chapter 1: Updated Table 1-1.
01/12/09 3.3 Chapter 2: Updated Figure 2-2.
Chapter 4: Updated “MACC and MACC Extension” section, Figure 4-23,
Figure 4-27, and Figure 4-37.
Appendix A: Updated “MULTSIGNOUT and CARRYCASCOUT” section.
06/01/10 3.4 Chapter 1: Added encoding to the MASK description in Table 1-3.
Chapter 2: Removed the “Connecting DSP48E Slices and Block RAM“ section.
Changed the input labels in Figure 2-3 and Figure 2-4 to match Equation 2-1
through Equation 2-3.
Chapter 4: Updated the MACC and MACC Extension section. Replaced
Figure 4-23. Changed both output buses in Figure 4-24 from [42:0] to [43:0].
Changed the input bus to the top slice in Figure 4-27 from a[17:0] to b[17:0].
Chapter 5: Updated hyperlinks.
01/26/12 3.5 Updated “Convergent Rounding” and “Convergent Rounding: LSB Correction
Technique.” Updated titles of Table 4-6 and Table 4-7.
UG193 (v3.6) July 27, 2017 www.xilinx.com Virtex-5 FPGA XtremeDSP Design Considerations
Date Version Revision
07/27/17 3.6 Updated links to Xilinx documentation in “About This Guide.”
Chapter 4: Reversed order of DSP48E headings in Table 4-1.
Chapter 5: Updated “Synthesis Tools.” Added link to DSP IP Core web page in
“DSP IP.” Updated link to System Generator for DSP User Guide in “System Generator
for DSP.” Removed link to ISE software manuals in “Architecture Wizard.”
Virtex-5 FPGA XtremeDSP Design Considerations www.xilinx.com UG193 (v3.6) July 27, 2017
Table of Contents
Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Adder Cascade . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Connecting DSP48E Slices across Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Time Multiplexing the DSP48E Slice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Miscellaneous Notes and Suggestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Preface
Guide Contents
This manual contains the following chapters:
• Chapter 1, “DSP48E Description and Specifics”
• Chapter 2, “DSP48E Design Considerations”
• Chapter 3, “DSP48E Timing Consideration”
• Chapter 4, “DSP48E Applications”
• Chapter 5, “DSP48E Software and Tool Support Overview”
Additional Documentation
The following documents are also available for download at
http://www.xilinx.com/documentation.
• Virtex-5 Family Overview
The features and product selection of the Virtex-5 family are outlined in this overview.
• Virtex-5 FPGA Data Sheet: DC and Switching Characteristics
This data sheet contains the DC and Switching Characteristic specifications for the
Virtex-5 family.
• Virtex-5 FPGA User Guide
Chapters in this guide cover the following topics:
• Clocking Resources
• Clock Management Technology (CMT)
• Phase-Locked Loops (PLLs)
• Block RAM
• Configurable Logic Blocks (CLBs)
• SelectIO™ Resources
• SelectIO Logic Resources
• Advanced SelectIO Logic Resources
Typographical Conventions
Typographical Conventions
This document uses the following typographical conventions. An example illustrates each
convention.
Online Document
The following conventions are used in this document:
Chapter 1
CARRYCASCOUT*
MULTSIGNOUT* PCOUT*
BCOUT* ACOUT* 48 A:B
18
30 ALUMODE
4 P 48
B 18
B B
X
0
4
25 X 18 M P
A 30 CARRYOUT
25
A A
0
1
Y 48
P
P
C 0 PATTERNDETECT
48 P
C
17-Bit Shift PATTERNBDETECT
30 Z
17-Bit Shift CREG/C Bypass/Mask
3
MULTSIGNIN*
CARRYIN
7 CARRYCASCIN*
18 OPMODE
CARRYINSEL
48
BCIN* ACIN* PCIN*
*These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources.
UG193_c1_01_032806
Architectural Highlights
The Virtex-5 FPGA DSP48E slice includes all Virtex-4 FPGA DSP48 features plus a variety
of new features. Among the new features are a wider 25 x 18 multiplier and an
add/subtract function that has been extended to function as a logic unit. This logic unit can
perform a host of bitwise logical operations when the multiplier is not used. The DSP48E
slice includes a pattern detector and a pattern bar detector that can be used for convergent
rounding, overflow/underflow detection for saturation arithmetic, and auto-resetting
counters/accumulators. The Single Instruction Multiple Data (SIMD) mode of the
adder/subtracter/logic unit is also new to the DSP48E slice; this mode is available when
the multiplier is not used. The Virtex-5 DSP48E slice also has new cascade paths. The new
features are highlighted in Figure 1-2.
Architectural Highlights
CARRYCASCOUT*
MULTSIGNOUT* PCOUT*
BCOUT* ACOUT* 48 A:B
18
30 ALUMODE
4 P 48
B 18
B B
X
0
4
25 X 18 M P
A 30 CARRYOUT
25
A A
0
1
Y 48
P
P
C 0 PATTERNDETECT
48 P
C
17-Bit Shift PATTERNBDETECT
30 Z
17-Bit Shift CREG/C Bypass/Mask
3
MULTSIGNIN*
CARRYIN
7 CARRYCASCIN*
18 OPMODE
CARRYINSEL
48
BCIN* ACIN* PCIN*
*These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources.
UG193_c1_02_032806
Figure 1-2: New Features in the DSP48E Slice vs. the DSP48 Slice
Architectural Highlights
DSP48E
Interconnect
DSP48E
Slice
UG193_c1_03_020508
Architectural Highlights
CEALUMODE OVERFLOW
1 1
CECTRL UNDERFLOW
1 1
CEMULTCARRYIN
1
CECARRYIN
1
RSTA
1
RSTB
1
RSTC
1
RSTM
1
RSTP
1
RSTCTRL
1
RSTALLCARRYIN
1
RSTALUMODE
1
CLK
1
ACIN [29:0]
30
BCIN [17:0]
18
PCIN [47:0]
48
CARRYCASCIN
1
MULTSIGNIN
1
UG193_c1_04_032806
Notes:
1. These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources.
2. All signals are active High.
different, highly pipelined, DSP application solutions. The other data inputs and the
control inputs can be optionally registered once. Full speed operation is 550 MHz when
using the pipeline registers. More detailed timing information is available in Chapter 2,
“DSP48E Design Considerations.”
In its most basic form, the output of the adder/subtracter/logic unit is a function of its
inputs. The inputs are driven by the upstream multiplexers, carry select logic, and
multiplier array.
Equation 1-1 summarizes the combination of X, Y, Z, and CIN by the adder/subtracter. The
CIN, X multiplexer output, and Y multiplexer output are always added together. This
combined result can be selectively added to or subtracted from the Z multiplexer output.
Note that the second option is a new feature of the DSP48E slice and is obtained by setting
the ALUMODE to 0001.
Adder/Sub Out = (Z ± (X + Y + CIN)) or (-Z + (X + Y + CIN) –1) Equation 1-1
A typical use of the slice is where A and B inputs are multiplied and the result is added to
or subtracted from the C register. More detailed operations based on control and data
inputs are described in later sections. Selecting the multiplier function consumes both X
and Y multiplexer outputs to feed the adder. The two 43-bit partial products from the
multiplier are sign extended to 48 bits before being sent to the adder/subtracter.
When not using the first stage multiplier, the 48-bit, dual input, bit-wise logic function
implements AND, OR, NOT, NAND, NOR, XOR, and XNOR. The inputs to these
functions are A:B, C, P, or PCIN selected through the X and Z multiplexers, with the Y
multiplexer selecting either all 1s or all 0s depending on logic operation.
The output of the adder/subtracter or logic unit feeds the pattern detector logic. The
pattern detector allows the DSP48E slice to support Convergent Rounding, Counter
Autoreset when a count value has been reached, and Overflow/Underflow/Saturation in
accumulators. In conjunction with the logic unit, the pattern detector can be extended to
perform a 48-bit dynamic comparison of two 48-bit fields. This enables functions such as
A:B NAND C = = 0, or A:B (bit-wise logic) C = = Pattern to be implemented.
Figure 1-5 shows the DSP48E slice in a very simplified form. The seven OPMODE bits
control the selects of X, Y, and Z multiplexers, feeding the inputs to the adder/subtracter or
logic unit. In all cases, the 43-bit partial product data from the multiplier to the X and Y
multiplexers is sign extended, forming 48-bit input datapaths to the adder/subtracter.
Based on 43-bit operands and a 48-bit accumulator output, the number of “guard bits” (i.e.,
bits available to guard against overflow) is 5. Therefore, the number of multiply
accumulations (MACC) possible before overflow occurs is 32. To extend the number of
MACC operations, the MACC_EXTEND feature should be used, which allows the MACC
to extend to 96 bits with two DSP48E slices. If A port is limited to 18 bits (sign-extended to
25), then there are 12 “guard bits” for the MACC, just like the Virtex-4 DSP48 slice. The
CARRYOUT bits are invalid during multiply operations. Combinations of OPMODE,
ALUMODE, CARRYINSEL, and CARRYIN control the function of the adder/subtracter or
logic unit.
UG193_c1_05_032806
Input Ports
The ACIN, ALUMODE, CARRYCASCIN, MULTSIGNIN, along with the corresponding
clock enable inputs and reset inputs, are new ports, introduced in the Virtex-5 family. The
following section describes the input ports of the Virtex-5 DSP48E slice in detail. The input
ports of the DSP48E slice are highlighted in Figure 1-6.
CARRYCASCOUT*
MULTSIGNOUT* PCOUT*
BCOUT* ACOUT* 48 A:B
18
30 ALUMODE
4 P 48
B 18 18
B B
X
0
4
25 X 18 M P
A 30 CARRYOUT
25
A A
0
1
Y 48
P
P
C 0 PATTERNDETECT
48 P
C
17-Bit Shift PATTERNBDETECT
30 Z
17-Bit Shift CREG/C Bypass/Mask
3 MULTSIGNIN*
CARRYIN
7 CARRYCASCIN*
18 OPMODE
CARRYINSEL
48
BCIN* ACIN* PCIN*
*These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources.
UG193_c1_06_071206
A, B, and C Ports
The DSP48E slice input data ports support many common DSP and math algorithms. The
DSP48E slice has three direct input data ports labeled A, B, and C. The A data port is 30 bits
wide, the B data port is 18 bits wide, and the C data port is 48 bits wide. In Virtex-5 devices,
each DSP48E slice also has an independent C data port. In Virtex-4 devices, the C port was
shared between two DSP48 slices in a tile.
The 25-bit A (A[24:0]) and 18-bit B ports supply input data to the 25-bit by 18-bit, two's
complement multiplier. With independent C port, each DSP48E slice is capable of
Multiply-Add, Multiply-Subtract, and Multiply-Round operations.
Concatenated A and B ports bypass the multiplier and feed the X multiplexer input. The
30-bit A input port forms the upper 30-bits of A:B concatenated datapath, and the 18-bit B
input port forms the lower 18 bits of the A:B datapath. The A:B datapath, together with the
C input port, enables each DSP48E slice to implement a full 48-bit adder/subtracter
provided the multiplier is not used.
Each DSP48E slice also has two cascaded input datapaths (ACIN and BCIN), providing a
cascaded input stream between adjacent DSP48E slices. The cascaded path is 30 bits wide
for the A input and 18 bits wide for the B input. Applications benefiting from this feature
include FIR filters, complex multiplication, multi-precision multiplication and complex
MACCs.
The A and B input port and the ACIN and BCIN cascade port can have 0, 1, or 2 pipeline
stages in its datapath. The A and B port logic is shown in Figure 1-7. The different pipe
stages are set using attributes. Attributes AREG and BREG are used to select the number of
pipeline stages for A and B direct inputs. Attributes ACASCREG and BCASCREG select
the number of pipeline stages in the ACOUT and BCOUT cascade datapaths. The allowed
attribute settings are shown in Table 1-3. Multiplexers controlled by configuration bits
select flow through paths, optional registers, or cascaded inputs. The data port registers
allow users to typically trade off increased clock frequency (i.e., higher performance) vs.
data latency. The attribute settings for A and B port registers are listed in Table 1-5.
ACOUT/BCOUT
A/B
ACIN/BCIN
UG193_c1_07_032806
The 48-bit C port is used as a general input to the Y and Z multiplexer to perform add,
subtract, three-input add/subtract, and logic functions. The C input is also connected to
the pattern detector for rounding function implementations. The C port logic is shown in
Figure 1-8. Attribute CREG is used to select the number of pipestages for the C input
datapath.
48
C
48 C Input to
Y and Z
D 48 Multiplexers and
CEC EN
Pattern Detector
RST
RSTC
UG193_c1_08_032806
7
OPMODE
7 To the X, Y, Z
Multiplexers and
1 D
3-Input Adder/Subtracter
CECTRL EN
RST
1
RSTCTRL
4
ALUMODE
4
To Adder/Subtracter
D
1
CEALUMODE EN
RST
1
RSTALUMODE
3
CARRYINSEL
3 To Carry Input
D Select Logic
EN
RST
UG193_c1_09_092106
X, Y, and Z Multiplexer
The OPMODE (Operating Mode) control input contains fields for X, Y, and Z multiplexer
selects.
The OPMODE input provides a way for the user to dynamically change DSP48E
functionality from clock cycle to clock cycle (e.g., when altering the internal datapath
configuration of the DSP48E slice relative to a given calculation sequence).
The OPMODE bits can be optionally registered using the OPMODEREG attribute (as
noted in Table 1-3).
Table 1-6, Table 1-7, and Table 1-8 list the possible values of OPMODE and the resulting
function at the outputs of the three multiplexers (X, Y, and Z multiplexers). The
multiplexer outputs supply three operands to the following adder/subtracter. Not all
possible combinations for the multiplexer select bits are allowed. Some are marked in the
tables as “illegal selection” and give undefined results. If the multiplier output is selected,
then both the X and Y multiplexers are used to supply the multiplier partial products to the
adder/subtracter.
If AREG/BREG = 0 and USE_MULT = MULT_S (this requires MREG=1), the A:B path
should not be selected via the opmode multiplexer. Since the opmode can be dynamic,
switching between the registered multiplier with MREG=1 and the combinatorial A:B path
is not supported. If the multiplier is not being used, USE_MULT should be set to NONE.
ALUMODE Inputs
The 4-bit ALUMODE controls the behavior of the second stage add/sub/logic unit.
ALUMODE = 0000 selects add operations of the form Z + (X + Y + CIN), which
corresponds to Virtex-4 SUBTRACT = 0. ALUMODE = 0011 selects subtract operations of
the form Z – (X + Y + CIN), which corresponds to Virtex-4 SUBTRACT = 1. ALUMODE set
to 0001 can implement -Z + (X + Y + CIN) – 1. ALUMODE set to 0010 can implement
-(Z + X + Y + CIN) – 1, which is equivalent to not (Z + X + Y + CIN). The negative of a
two's complement number is obtained by performing a bitwise inversion and adding one,
e.g., -k = not (k) + 1. Other subtract and logic operations can also be implemented with the
enhanced add/sub/logic unit. See Table 1-9.
Notes:
1. In two's complement: -Z = not (Z) + 1
Also, see Table 1-12 for two-input ALUMODE operations and Figure A-3.
Fabric CARRYIN
3
RSTALLCARRYIN RST CARRYINSEL
D
CECARRYIN CE
000
Large Add/Sub/Acc CARRYCASCIN
(Parallel Op) 010
111
Inverted P[47]
Round Output 101
011
Inverted PCIN[47]
001
UG193_c1_10_010906
Figure 1-10 shows eight inputs selected by the 3-bit CARRYINSEL control.
The first input, CARRYIN (CARRYINSEL set to binary 000), is driven from general logic.
This option allows implementation of a carry function based on user logic. CARRYIN can
be optionally registered. The next input, (CARRYINSEL is equal to binary 010) is the
CARRYCASCIN input from an adjacent DSP48E slice. The third input (CARRYINSEL is
equal to binary 100) is the CARRYCASCOUT from the same DSP48E slice, fed back to
itself.
The fourth input (CARRYINSEL is equal to binary 110) is A[24] XNOR B[17] for
symmetrically rounding multiplier outputs. This signal can be optionally registered to
match the MREG pipeline delay. The fifth and sixth inputs (CARRYINSEL is equal to
binary 111 and 101)selects the true or inverted P output MSB P[47] for symmetrically
rounding the P output. The seventh and eight inputs (CARRYINSEL is equal to binary 011
and 001)selects the true or inverted cascaded P input msb PCIN[47] for symmetrically
rounding the cascaded P input.
Table 1-10 lists the possible values of the three carry input select bits (CARRYINSEL) and
the resulting carry inputs or sources.
Output Ports
The Virtex-5 DSP48E slice has eight new output ports compared to the Virtex-4 DSP48
slice. These new output ports include the cascadeable A data port (ACOUT), cascadeable
carryout port (CARRYCASCOUT), cascadeable sign bit of the multiplier output
(MULTSIGNOUT), carry output to fabric (CARRYOUT), pattern detector outputs
(PATTERNDETECT and PATTERNBDETECT), OVERFLOW port, and UNDERFLOW
port. The following section describes the output ports of the Virtex-5 DSP48E in detail. The
output ports of the DSP48E slice are shown in Figure 1-11.
CARRYCASCOUT*
MULTSIGNOUT* PCOUT*
BCOUT* ACOUT* 48 A:B
18
30 ALUMODE
4 P 48
B 18
B B
X
0
4
25 X 18 M P
A 30 CARRYOUT
25
A A
0
1
Y 48
P
P
C 0 PATTERNDETECT
48 P
C
17-Bit Shift PATTERNBDETECT
30 Z
17-Bit Shift CREG/C Bypass/Mask
3
MULTSIGNIN*
CARRYIN
7 CARRYCASCIN*
18 OPMODE
CARRYINSEL
48
BCIN* ACIN* PCIN*
*These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources.
UG193_c1_11_013006
All the output ports except ACOUT and BCOUT are reset by RSTP and enabled by CEP
(see Figure 1-12). ACOUT and BCOUT are reset by RSTA and RSTB respectively (shown in
Figure 1-7).
P/PCOUT/MULTSIGNOUT/
CARRYCASCOUT/
CARRYOUT/ DSP48E
PATTERNDETECT/ D Slice Output
PATTERNBDETECT CEP EN Q
RST
RSTP
UG193_c1_12_112907
P Port
Each DSP48E slice has a 48-bit-wide output port P. This output can be connected (cascaded
connection) to the adjacent DSP48E slice internally through the PCOUT path. The PCOUT
connects to the input of the Z multiplexer (PCIN) in the adjacent DSP48E slice. This path
provides an output cascade stream between adjacent DSP48E slices.
The MSB of a multiplier output is cascaded to the next DSP48E slice using the
/MULTSIGNIN port and can be used only in MACC extension applications to build a 96-
bit accumulator. This opmode setting is described in Chapter 4, “Advanced Math
Applications.” The actual hardware implementation of MULTSIGNOUT is described in
the Appendix A.
Embedded Functions
The embedded functions in Virtex-5 devices include a 25 x 18 multiplier,
adder/subtracter/logic unit, and pattern detector logic (see Figure 1-13).
CARRYCASCOUT*
MULTSIGNOUT* PCOUT*
BCOUT* ACOUT* 48 A:B
18
30 ALUMODE
4 P 48
B 18
B B
X
0
4
25 X 18 M P
A 30 CARRYOUT
25
A A
0
1
Y 48
P
P
C 0 PATTERNDETECT
48 P
C
17-Bit Shift PATTERNBDETECT
30 Z
17-Bit Shift CREG/C Bypass/Mask
3
MULTSIGNIN*
CARRYIN
7 CARRYCASCIN*
18 OPMODE
CARRYINSEL
48
BCIN* ACIN* PCIN*
*These signals are dedicated routing paths internal to the DSP48E column. They are not accessible via fabric routing resources.
UG193_c1_13_013006
43
86 Partial Product 1
A
X 43
B
Partial Product 2
Optional
MREG
UG193_c1_14_120205
Table 1-12: OPMODE and ALUMODE Control Bits Select Logic Unit Outputs
OpMode ALUMode
Logic Unit Mode
<3:2> <3:0>
3 2 3 2 1 0
X XOR Z 0 0 0 1 0 0
X XNOR Z 0 0 0 1 0 1
X XNOR Z 0 0 0 1 1 0
X XOR Z 0 0 0 1 1 1
X AND Z 0 0 1 1 0 0
X AND (NOT Z) 0 0 1 1 0 1
X NAND Z 0 0 1 1 1 0
(NOT X) OR Z 0 0 1 1 1 1
X XNOR Z 1 0 0 1 0 0
X XOR Z 1 0 0 1 0 1
Table 1-12: OPMODE and ALUMODE Control Bits Select Logic Unit Outputs (Cont’d)
OpMode ALUMode
Logic Unit Mode
<3:2> <3:0>
3 2 3 2 1 0
X XOR Z 1 0 0 1 1 0
X XNOR Z 1 0 0 1 1 1
X OR Z 1 0 1 1 0 0
X OR (NOT Z) 1 0 1 1 0 1
X NOR Z 1 0 1 1 1 0
(NOT X) AND Z 1 0 1 1 1 1
0 [47:36]
[47:0] P[47:36], CARRYOUT[3]
A:B X
P
[35:24]
P[35:24], CARRYOUT[2]
0
1 Y
C [23:12]
P[23:12], CARRYOUT[1]
0
PCIN [11:0]
[47:0] P[11:0], CARRYOUT[0]
Z
P
C
ALUMODE[3:0]
UG193_c1_15_032806
• Four segments of dual or ternary adders with 12-bit inputs, a 12-bit output, and a
carry output for each segment
• Function controlled dynamically by ALUMODE[3:0], and operand source by
OPMODE[6:0]
P
P
PATTERNBDETECTPAST(1)
PATTERNBDETECT
SEL_PATTERN
PATTERNDETECTPAST(1)
C (Register)
PATTERNDETECT
PATTERN
C Shift by 2, 00 (Mode 2)
C Shift by 1, 0 (Mode 1)
C (Register) PATTERN = 48’b00000000...
MASK = 48’b00111111...
MASK
SEL_ROUNDING_MASK
SEL_MASK
Some of the applications that can be implemented using the pattern detector are:
• Pattern detect with optional mask
• Dynamic C input pattern match with A x B
• Overflow/underflow/saturation past P[46]
• A:B == C and dynamic pattern match, e.g., A:B OR C == 0, A:B AND C == 1
• A:B {function} C == 0
• 48-bit counter auto reset (terminal count detection)
• Detecting mid points for rounding operations
If the pattern detector is not being employed, it can be used for other creative design
implementations. These include:
• Duplicating a pin (e.g., the sign bit) to reduce fan out and thus increase speed.
• Implementing a built-in inverter on one bit (e.g., the sign bit) without having to route
out to the fabric.
• Checking for sticky bits in floating point, handling special cases, or monitoring the
DSP48E slice outputs.
• Raising a flag if a certain condition is met or if a certain condition is no longer met.
Refer to Chapter 2, “DSP48E Design Considerations” of the user guide for details on the
above applications.
A mask field can also be used to mask out certain bit locations in the pattern detector. The
pattern field and the mask field can each come from a distinct 48-bit memory cell field or
from the (registered) C input.
PATTERNDETECTPAST(1)
Overflow
PATTTERNBDETECT
PATTTERNDETECT
PATTERNBDETECTPAST(1)
Underflow
PATTERNBDETECT
PATTERNDETECT
PATTERN = 48’b00000000...
MASK = 48’b00111111...
output bit width to 47 bits. An alternate implementation using the pattern detector, to
detect overflow past the 47th bit, is described in Chapter 4, “Application Examples.”
By setting the mask to other values like "0000111 …1", the bit value P[N] at which
overflow is detected can be changed. This logic supports saturation to a positive number of
2N - 1 and a negative number of 2N in two's complement where N is the number of 1s in the
mask field.
To check overflow/underflow condition for N = 2, the following example is used:
• Mask is set to 0…..11.
• The (N) LSB bits are not considered for the comparison.
• For N = 2, the legal values (patterns) are 22-1 to -22 or 3 to -4.
See Figure 1-18 and Figure 1-19 for examples.
PD Caused by Overflow
High to Low
Overflow
UG193_c1_18_071006
Underflow
UG193_c1_19_052206
• PD is 1 if P == pattern or mask
• PBD is a 1 if P == patternb or mask
Overflow is caused by addition when the value at the output of the
adder/subtracter/logic unit goes over 3. Adding 1 to the final value of 0..0011 gives 0..0100
as the result. This causes the PD output to go to 0. When the PD output goes from 1 to 0, an
overflow is flagged.
Underflow is caused by subtraction when the value goes below -4. Subtracting 1 from
1..1100 yields 1..1010(-5). This causes the PBD output to go to 0. When the PBD output goes
from 1 to 0, an underflow is flagged.
AUTORESET_PATTERN_DETECT_OPTINV
PATTERNDETECT
PATTERNDETECTPAST
“OR” with
External
PATTERNDETECT RSTP
AUTORESET_PATTERN_DETECT
UG193_c1_20_032806
Two attributes are associated with using the auto reset logic. If the
AUTORESET_PATTERN_DETECT attribute is set to TRUE and the
AUTORESET_PATTERN_DETECT_OPTINV is set to MATCH, then the PREG attribute
automatically resets the P register one clock cycle after a pattern has been detected. For
example, DSP resets when 00001000 is detected. To get a repeating 9-state counter from 0
to 8, a 1 is repeatedly added to the DSP slice.
If the AUTORESET_PATTERN_DETECT attribute is set to TRUE and
AUTORESET_POLARITY is set to NOT_MATCH, then the P register auto resets on the
next clock cycle only if a pattern was detected - but is now no longer detected. For example,
DSP resets if 00000XXX is no longer detected (set PATTERN = 000...01000, and
MASK = 000...00000). A 1 is repeatedly added to the DSP to get a repeating 9-state
counter from 0 to 8. This mode of counter can be useful if different numbers are added on
every cycle and a reset is triggered every time a threshold (must be a power of 2) is crossed.
Chapter 2
Adder Tree
In typical direct form FIR filters, an input stream of samples is presented to one input of the
multipliers in the DSP48E slices. The coefficients supply the other input to the multipliers.
An adder tree is used to combine the outputs from many multipliers as shown in
Figure 2-1.
h7(n)
18 × 48
+
18
48
h6(n)
18 × +
X(n-4)
18
Z-2
h5(n)
18 × 48
+
18
48
h4(n)
18 ×
X(n-2)
18
Z-2 + y(n-6)
h3(n)
18 × 48
+
18
h1(n)
18 × +
48
18
48
h0(n)
X(n)
18 ×
ug193_c2_01_022706
18
In traditional FPGAs, the fabric adders are usually the performance bottleneck. The
number of adders needed and the associated routing depends on the size of the filter. The
depth of the adder tree scales as the log2 of the number of taps in the filter. Using the adder
tree structure shown in Figure 2-1 could also increase the cost, logic resources, and power.
The CLB architecture in Virtex®-5 devices improves the performance of the final adder tree
carry chains by approximately 32%. The Virtex-5 CLB allows the use of both the 6LUT and
the carry chain in the same slice to build an efficient ternary adder. The 6LUT in the CLB
functions as a dual 5LUT. The 5LUT is used as a 3:2 compressor to add three input values
to produce two output values. The 3:2 compressor is shown in Figure 2-2.
CY(1)
B4 IN4 ABUS(1)
X(1)
B3 IN3 O6B
Y(1)
0 1
0
B2 IN2
Z(1) 1 BMUX BBUS(1)
B5 IN5
BBUS(0)
SUB/ B1 IN1 BQ SUM(1)
D Q
ADDB
VDD O5B
CK
IN6
BX
BBUS(0)
CY(0)
A4 IN4
X(0) ABUS(0)
A3 IN3 O6A
Y(0)
0 1
0
A2 IN2
Z(0) 1 AMUX BBUS(0)
SUB/ A5 IN5
ADDB AQ SUM(0)
D Q
VDD O5A
CK
IN6
SUB/ AX GND
ADDB
UG193_c2_02_010709
The dual 5LUT configured as a 3:2 compressor in combination with the 2-input carry
cascade adder adds three N-bit numbers to produce one N+2 bit output, as shown in the
Figure 2-3, by vertically stacking the required number of slices.
ABUS
A
46 2-Input
3:2 SUM
B Compressor Cascade
46 48
BBUS Adder
C Left Shift By 1
46
UG193_c2_03_051010
The 3:1 adder shown in Figure 2-3 is used as a building block for larger adder trees.
Depending on the number of inputs to be added, a 5:3 or a 6:3 compressor is also built in
fabric logic using multiple 5LUTs or 6LUTs. The serial combination of 6:3 compressor,
along with two DSP48E slices, adds six operands together to produce one output, as
shown in Figure 2-4. The LSB bits of the first DSP48E slice that are left open due to left shift
of the Y and Z buses should be tied to zero. The last DSP48E slice uses 2-deep A:B input
registers to align (pipeline matching) the X bus to the output of the first DSP48E slice.
Multiple levels of 6:3 compressors can be used to expand the number of input buses.
X
A SUM
45 DSP48E
6:3 Y 48
Left Shift By 1 Slice
Compressor DSP48E
Z Slice
F Left Shift By 2
45
UG193_c2_04_051010
The logic equations for the X, Y, and Z buses in Figure 2-4 are listed here:
X(n) = A(n) XOR B(n) XOR C(n) XOR D(n) XOR E(n) XOR F(n) Equation 2-1
Y(n) = A(n)B(n) XOR A(n)C(n) XOR A(n)D(n) XOR A(n)E(n) XOR A(n)F(n) XOR B(n)C(n)
XOR B(n)D(n) XOR B(n)E(n) XOR B(n)F(n) XOR C(n)D(n) XOR C(n)E(n) XOR C(n)F(n)
XOR D(n)E(n) XOR D(n)F(n) XOR E(n)F(n) Equation 2-2
The compressor elements and cascade adder can be arranged like a tree in order to build
larger adders. The last add stage should be implemented in the DSP48E slice. Pipeline
registers should be added as needed to meet timing requirements of the design. These
adders can have higher area and/or power than the adder cascade.
Adder Cascade
The adder cascade implementation accomplishes the post addition process with minimal
silicon resources by using the cascade path within the DSP48E slice. This involves
computing the additive result incrementally, utilizing a cascaded approach as illustrated in
Figure 2-5.
Slice 8
h7(n-7)
18 × 48
+ 48
Y(n–10)
18
No Wire Shift
Slice 7 48
h6(n-6)
18 × 48
+
18
No Wire Shift
Slice 6 48
h5(n-5)
18 × 48
+
18
No Wire Shift
48
Slice 5
h4(n-4)
18 × 48
+
18
No Wire Shift The post adders are
contained wholly in
Slice 4 48 dedicated silicon for
h3(n-3) highest performance
18 × 48
+ and lowest power.
18
No Wire Shift
Slice 3 48
h2(n-2)
18 × 48
+
18
No Wire Shift
Slice 2 48
h1(n-1)
18 × +
48
18
No Wire Shift
48
Slice 1
h0(n)
X(n)
18 × 48
+
18
Zero
Sign Extended from 36 Bits to 48 Bits ug193_c2_05_022706
It is important to balance the delay of the input sample and the coefficients in the cascaded
adder to achieve the correct results. The coefficients are staggered in time (wave
coefficients).
These time-multiplexed DSP designs have optional pipelining that permits aggregate
multichannel sample rates of up to 500 million samples per second. Implementing a
time-multiplexed design using the DSP48E slice results in reduced resource utilization and
reduced power.
The Virtex-5 DSP48E slice contains the basic elements of classic FIR filters: a multiplier
followed by an adder, delay or pipeline registers, and the ability to cascade an input stream
(B bus) and an output stream (P bus) without exiting to a general slice fabric.
In the implementation described in “Multichannel FIR,” page 90, multichannel filtering
can be viewed as time-multiplexed, single-channel filters. In a typical multichannel
filtering scenario, multiple input channels are filtered using a separate digital filter for each
channel. Due to the high performance of the DSP48E slice within the Virtex-5 device, a
single digital filter can be used to filter all eight input channels by clocking the single filter
with an 8x clock. This implementation uses 1/8th of the total FPGA resource as compared
to implementing each channel separately.
• SRL16s/SRL32s in the CLB and block RAM should be used to store filter coefficients
or act as a register file or memory elements in conjunction with the DSP48E slice. The
bit pitch of the input bits (4 bits per interconnect) is designed to pitch match the CLB
and block RAM.
• The block RAM can also be used as a fast, finite state machine to drive the control
logic for the DSP design.
• The DSP48E slice can also be used in conjunction with a processor, e.g., MicroBlaze™
or PicoBlaze™ processors, for hardware acceleration of Processor Functions.
• A pipeline register should be used at the output of an SRL16 or block RAM before
connecting it to the input of the DSP48E slice. This ensures the best performance of
input operands feeding the DSP48E slice.
• In Virtex-5 devices, the register at the output of the SRL16 in the slice has a reset pin
and a clock-enable pin. To reset the SRL16, a zero is input into the SRL16 for 16 clock
cycles while holding the reset of the output register High. This capability is
particularly useful in implementing filters where the SRL16s are used to store the data
inputs.
Chapter 3
TDSPDCK_{AA, BB, ACINA, BCINB}/ {A, B, ACIN, BCIN} input to {A, B} register CLK
TDSPCKD_{AA, BB, ACINA, BCINB}
Timing Diagram
CLK Event 1 CLK Event 2 CLK Event 3 CLK Event 4 CLK Event 5
CLK
TDSPCCK_CEA2A/CEB2B
CE
TDSPCCK_RSTPP
RST
TDSPDCK_AA
TDSPDCK_BB
TDSPDCK_CC
TDSPCKO_PP TDSPCKO_PP
UG193_c3_01_032806
Chapter 4
DSP48E Applications
Introduction
The DSP48E slice can be used efficiently in video, wireless, and networking applications.
High performance combined with low power dissipation makes the DSP48E slice an ideal
choice for the above application segments. This chapter discusses the implementation
details of some of the common math and filter functions using the DSP48E slice. These
functions can be the building blocks for a variety of complex systems.
This chapter contains these sections:
• “Multiplexer Selection”
• “HDL Instantiation”
• “Application Examples”
• “Rounding Applications”
Multiplexer Selection
Mode settings are used to select the different multiplexer outputs in the DSP48E slice.
These settings designate which input signals go into the embedded multiplier and the
adder/subtracter/logic unit in the DSP48E slice. The A_INPUT and B_INPUT attributes
select between the fabric or an adjacent DSP48E slice as the source of the A and B input
signals. The outputs for the X, Y, and Z multiplexer are selected by the OPMODE settings.
The ALUMODE settings select the function of the adder/subtracter/logic unit.
CARRYINSEL selects the outputs of the carry input to the second stage adder/subtracter
in the multiplexer. The OPMODE, ALUMODE, and CARRYINSEL settings and the logic
they implement are listed in Table 1-2. Refer to Table 1-8 for details on the attribute settings
and the default values.
HDL Instantiation
Several tools are available to assist in creating a DSP48E design. These tools are described
in “DSP48E Software Tools,” page 107. A DSP48E design can also be built by instantiating
the DSP48E slice in HDL. The Verilog and VHDL instantiation for the DSP48E slice is also
described in Chapter 5, “DSP48E Software and Tool Support Overview.”
Application Examples
A 30
A:B 48 P
X LU
B 18
C 48
Z
0
DSP48E_0
OPMODE 0110011
ALUMODE (As Needed)
UG193_c4_01_091206
The functions implemented by the different settings of the ALUMODE bits are shown in
Table 1-12.
Dynamic Shifter
A pipelined full-speed 25 x 18 multiply with one of the inputs connected to 2K can shift the
other input to the left by K bits. In the process, the output is also shifted right by 18 – K bits.
The input value to be shifted, A, can be up to 25 bits wide. However, the shift value, K, can
be a maximum of up to 17 bits.
If A is an 18-bit number, a left shift is implemented in the lower 18 bits of the result P[17:0].
These 18 bits represent the A input shifted left by K bits. Similarly, if A is an 18-bit number,
the output bits P[35:18] represent the input A shifted right by 18 – K bits and
sign-extended.
Application Examples
a 1. The ALUMODE should have two registers to match the pipeline of the multiplier
inputs.
This dynamic shifter is also known as an arithmetic shifter because of the sign extension on
the right shift. The behavioral code for the multiplier can be used to implement the
dynamic shifter. The multiplier inputs are connected to the value to be shifted and to the
shift amount.
Swapping the bit order of the A[24:0] and P[24:0] yields a right-shifted A input without
sign extension. In other words, if the shifter input is S[24:0], a right shift of S by K bits
without sign-extension can be implemented by connecting S[0:24] to A[24:0]. The shifted
output is then S_OUT[0:24] = P[24:0]. This bit swap is not required if the shifter input
S[L-1:0] has a size L that is less than 25 bits. In this case, zero-padding the A input MSBs
instead of sign extending produces a right shifted value on the output P[L-1:0] with zeros
inserted on the left. This is known as logical shift.
A 25
B[17] 48 P
±
K B 18
2K
Fabric ALUMODE[1:0]
DSP48E_0
OPMODE 0000101
ALUMODE 0000/0011
Sign Extend to A[24]
UG193_c4_02_030507
Application Examples
A[17:0] A 18
P[17:0]
±
18
ALUMODE[1:0]
BCIN Shift 17 1
0,A[17:1] A 18
B[17]
±
K B 18
2K
ALUMODE[1:0]
DSP48E_0 DSP48E_1
OPMODE 0000101 OPMODE 1010101
ALUMODE 0000/0011 ALUMODE 0000/0011
UG193_c4_03_022607
48-Bit Counter
A counter is one of the most used functions in any digital application. A counter is often
used in the control logic. Many times, the counter in a system has to function much faster
than the system itself. The DSP48E slice can be used as a high-speed counter, with a
performance of over 550 MHz. A counter implemented using the DSP48E slice can achieve
high performance with the least amount of resource and power. The DSP48E slice can
implement any of theses counters: up counter, down counter, or a loadable counter.
Figure 4-4 shows the implementation of a 48-bit counter using the DSP48E slice. The
output is equal to P + C + CARRYIN. For a simple binary counter, CARRYIN is set to 1, and
C is set to 0. For a count-by-N counter, the CARRYIN is set to 0, and C is set to N.
C=N 48
DSP48E_0
OPMODE 0001110
ALUMODE 0000
UG193_c4_04_092106
The pattern detector can be used to reset the counter without external fabric. This feature,
Auto Reset, is described in detail in the “Pattern Detect Applications” section.
Application Examples
A 30
A:B 48 P
B 18
C 48
DSP48E_0
OPMODE 0110011
ALUMODE (As Needed)
Sign Extend Each Small Adder [11,23,35,47]
UG193_c4_05_022806
Sign Bit
IN1 - IN2
Fabric
UG193_c4_06_011306
Figure 4-6: Absolute Value Calculation Using DSP48E Slice in SIMD Mode
The absolute value of two 12-bit numbers can also be determined by using the same
process used in the 24-bit number case. The DSP48E slice in the 12-bit case has four
outputs. Two separate multiplexers are used in the fabric to get the final two absolute value
results.
Bus Multiplexer
Wide multiplexers are used in many network switching applications where voice and data
need to be multiplexed. Video functions, such as scanners, video routers, and pixel-in-pixel
switching, require high-speed switching with multiplexers. In digital signal processing, a
multiplexer is used to take several separate data streams and combine them into one single
data stream at a higher rate. The DSP48E slice can be used to implement high-width (up to
48 bits), multiplexers for networking and video applications.
The OPMODE bits are used to choose between the C input and the A:B input within the
DSP48E slice. Each slice can multiplex between two 48-bit values. Figure 4-7 shows the
implementation of a 48-bit-wide 8:1 multiplexer. Additional pipeline registers can be used
to increase the performance of the multiplexer.
in0 A 30
A:B 48 P
MUX
B 18
in1 C 48
in2 A 30
A:B 48
MUX
B 18
in3 C 48
in4 A 30
A:B 48
MUX
B 18
in5 C 48
in6 A 30
A:B 48
MUX
B 18
in7 C 48
UG193_c4_07_092206
Application Examples
The ALUMODE is set to 0000. Different OPMODE settings can be used for each of the four
DSP48E slices. Table 4-1 lists one way of setting the OPMODEs to implement the 48-bit
multiplexer.
If the speed of the multiplexer is critical, more pipeline registers can be added. The
implementation in Figure 4-7 can be easily extended to perform N:1 multiplexing.
A 25
M 48 P
B 18
DSP48E_0
OPMODE 0000101
ALUMODE 0000
Sign Extend to A[24] and B [17]
UG193_c4_08_092106
Sign extension is very important when implementing a multiplier with bit widths less than
25 x 18. When the bit width is less than 25 x 18, the designer must sign-extend the A input
all the way up to A[24] and the B input all the way up to B[17]. By setting the sign bits A[24]
and B[17] to 0, the DSP48E slice can be used to emulate a 24 x 17 unsigned multiplication.
For operands less than 25 x 18, fabric power can be reduced by placing operands into the
MSBs and zero padding unused LSBs.
A 30
A:B 48 P SUM
B 18
C 48
DSP48E_0
OPMODE 0001111
ALUMODE 0000
Sign Extend to A[29] and C[47]
UG193_c4_09_022806
Application Examples
A 30
A:B 48 P SUM
B 18
C 48 48
A 30
A:B 48 P
B 18
C 48
DSP48E_1 DSP48E_0
OPMODE 0011111 OPMODE 0001111
ALUMODE 0000 ALUMODE 0000
Sign Extend to A[29 and C[47]
UG193_c4_10_091806
A 30
A:B 48 P
B 18
C 48
DSP48E_0
OPMODE 0001111
ALUMODE 0000 (for Add)
Or
OPMODE 0110011
ALUMODE 0000
ALUMODE 0011 (for Subtract)
Sign Extend to A[29] & C[47]
UG193_c4_11_080107
A 30
A:B 48 P
B 18
C 48 48
PCIN 0
DSP48E_0
OPMODE 0011111
ALUMODE 0000 (for Add)
ALUMODE 0011 (for Subtract)
Sign Extend to A[29] and [47]
UG193_c4_12_080107
Application Examples
Because the A and B inputs are registered twice (AREG/BREG and MREG), two registers
should be used on the C input. One of the C input registers should be implemented in the
fabric, external to the DSP48E slice.
A 25
B 18
C 48
0
DSP48E_0
OPMODE 0110101
ALUMODE 0000 (for Add)
ALUMODE 0011 (for Subtract)
Sign Extend to A[24], B[17], and C[47]
UG193_c4_13_092106
Extended Multiply
The 25 x 18 multiply can be extended to a 26 x 18 or 25 x 19 multiply if needed. The
extended multiply function uses one DSP48E slice output concatenated with a one-bit
AND function output in the fabric. This implementation is expressed in Equation 4-3, and
the equation can be implemented as shown in Figure 4-14.
A[25:0] × B[17:0] = Equation 4-3
{( ( A[25:1] × B[17:0] ) + ( A[0] AND B[17:1] )), ( A [0] AND B[0] ) }
A[25:1] A 25
43 P[43:0] OUT[44:1]
B[17:0] B 18
48
Signext, B[17:1] C 48
Reset This
External
0
Register
if A[0] is Low Fabric
B[0] OUT[0]
A[0]
DSP48E_0
OPMODE 0110101
ALUMODE 0000
Sign Extend to C[47]
UG193_c4_14_092106
The largest multiplier that can be implemented in two DSP48E slices is a 35 x 25 multiplier.
For the 35 x 25 bit multiplier, the A input of the first DSP48E slice is connected to A[24:0],
and the B input is connected to {0,B[16:0]}. The A input of the second DSP48E slice is
connected to A[24:0] through the cascaded path. The B input is connected to {0,B[34:17]}.
The output of the first DSP48E slice is connected to the adder of the second DSP48E slice
through the 17-bit shift PCIN path. The first DSP48E slice produces the 17 LSBs of the final
product. Bits [59:17] of the final product are obtained at the output of the second DSP48E
slice. Larger multipliers can be implemented using a single DSP48E slice taking multiple
cycles, or they can be implemented in multiple slices taking a single cycle. See Table 4-2.
Application Examples
Table 4-2: Utilization and Latency Table for Different Sized Multipliers
Number of
Multiply Size DSP48E Latency Notes
Slices
A[24:0] x B[17:0] 1 3 DSP1 = 25 x 18
A[24:0] x B[34:0] 2 4 DSP1 = (A[24:0] x (0, B[16:0]))
DSP2 = (A[24:0] x B[34:17]) + PCIN
shift17 of DSP1
A[41:0] x B[34:0] 4 6 DSP1 = (0, A[16:0]) x (0, B[16:0])
DSP2 = (A[41:17] x (0, B[16:0]))
+ PCIN shift17 of DSP1
DSP3 = ((0, A[16:0]) x B[4:17]) +
PCIN of DSP2
DSP4 = (A[41:17] x B[35:17]) + PCIN
shift17 of DSP3
PCOUT4[47:0]
SRL16
R[58:34] R3[58:34] A 25
Z-3
0, S[16:0] 0, S3[16:0] B 18
Z-3
SRL16
4 OPMODE 0010101
SRL16
8’b0, R[33:17] 8’b0, R2[33:17] A 25
Z-2
18
2 OPMODE 0010101
8’b0, R[33:17] A 25
18
8’b0, R[16:0] A 25
SRL16
P[16:0] PROD[16:0]
Z-9
0, S[16:0] B 18
0 OPMODE 0000101
UG193_c4_41_030507
Application Examples
A 25
P[32:0] = PROD[117:85]
Sign Extend, SRL16 Sign Extend,
S[58:51] S8[58:51] B 18
Z-8
A 25
P[16:0] PROD[84:68]
SRL16
0, S[50:34] 0, S7[50:34] B 18
Z-7
SRL16
R[58:34] R6[58:34] A 25
Z-6 SRL16
P[16:0] PROD[67:51]
Z-2
0, S[33:17] 0, S6[33:17] B 18
Z-6
SRL16
7 OPMODE 0010101
A 25
SRL16
0, R[33:17] 0, R5[33:17] B 18
Z-5
SRL16
S[58:34] S4[58:34] A 25
Z-4 SRL16
P[16:0] PROD[50:34]
Z-4
0, R[16:0] 0, R4[16:0] B 18
Z-4
SRL16
5 OPMODE 0010101
PCOUT4[47:0]
UG193_c4_42_030608
A 25
B 18
48
PCIN 0
DSP48E_0
OPMODE 0010101
ALUMODE 0000
Sign Extend to A[24] and B[17]
UG193_c4_15_022806
Division
One application of the multiply sub function (see Figure 4-13) is in implementing a
division. For an N-bit divide (N is the numerator bit width) where the numerator is greater
than the denominator, the quotient can be calculated in N cycles.
( X ⁄ Y = Q + R ) or X = Y(Q + R) Equation 4-4
The numerator, “X,” is applied to the C input, and the denominator, “Y,” is applied to the
A input. The cycle number is denoted by “n.” For the first cycle, bit B[N–n] in the B input
of the DSP48E slice is set to 1, and the remaining bits are set to 0. The OPMODE is set to
calculate X – (Y x Binput). If the output MSB P[47] is 1, register bit Q[N–n] = 0, and vice-
versa. If P[47] is positive, bit B[N–n] is set to 1, and the bit to the left retains the 1. If P[47]
is negative, B[N–n] is set 1, and the bit to the left is reset to 0. The DSP48E slice implements
the multiply subtract cycle again. This process is continued for N cycles. After the Nth
cycle, register Q contains the quotient, and the output register P contains the remainder.
Refer to the “Basic Math Functions” section in XtremeDSP for Virtex-4 FPGAs User Guide
(UG073) for more details on the Divide implementation.
Application Examples
C 48 48 48
DSP48E_0
OPMODE 0001110
ALUMODE 0000
Sign Extend to C[47]
UG193_c4_16_022806
A 30
A:B 48 P
B 18
48 48
C 48
DSP48E_0
OPMODE 0101111
ALUMODE 0000
Sign Extend to A[29] and C[47]
UG193_c4_17_022806
The addition and subtraction can be dynamically changed by toggling the ALUMODE bits
between 0000 and 0011.
The 48-bit dynamic add/sub accumulator is shown in Figure 4-20.
A 30
A:B 48 P
B 18
48 48
C 48
DSP48E_0
OPMODE 0101111
ALUMODE 0000 (for Add)
ALUMODE 0011 (for Subtract)
Sign Extend to A[29] and C[47]
UG193_c4_18_092206
Application Examples
96-Bit Add/Subtract
The DSP48E slices can be cascaded together to implement a large add/subtract function. In
this case, the CARRYCASCOUT signal is used to cascade the DSP48E slices. Setting the
ALUMODE to 0000 implements adder (C + A:B) functions. A subtract function (C – A:B)
is implemented by setting the ALUMODE to 0011. Figure 4-21 shows the implementation
of a 96-bit add/subtract function.
A 30
A:B 48 P SUM[95:48]
B 18
48
C 48
CARRY
CASCOUT 1
A 30 SUM[47:0]
A:B 48 P
B 18
48
C 48
DSP48E_0 DSP48E_1
OPMODE 0110011 OPMODE 0110011
ALUMODE 0000 (for Add) ALUMODE 0000 (for Add)
ALUMODE 0011 (for Subtract) ALUMODE 0011 (for Subtract)
SIGN EXTEND TO A[29]
UG193_c4_20_072407
When the C input is not used, the example in Figure 4-21 implements a 96-bit accumulate
function: Output = P + A:B + CARRYIN. The OPMODE in this case is set to 0100011 and
the ALUMODE is set to 0000.
A 96-bit accumulate function can also be implemented using the 48-bit C input of both
slices. In this case, the A and B inputs are not used, and the output is equal to
P + C + CARRYIN. The OPMODE is set to 0101111, and the ALUMODE is set to 0000.
96-Bit Accumulator
Two DSP48E slices can be cascaded together to implement a 96-bit accumulator.
Here, the CARRYCASCOUT signal is used to cascade the DSP48E slices. The 48 LSB bits of
the result are obtained at the output of the first DSP48E slice, and the 48 MSB bits are
obtained at the output of the second DSP48E slice. Figure 4-22 shows the implementation
of a 96-bit accumulator function using the C input. The output of the first DSP48E slice is
equal to P + C. The OPMODE in this case is set to 0101100, and the ALUMODE is set to
0000. The output of the second DSP48E slice is equal to P + A:B. The OPMODE in this case
is set to 0100011, and the ALUMODE is set to 0000.
A 30
A:B 48 P SUM[95:48]
B 18
48
CARRY
CASCOUT 1
SUM[47:0]
P
48 48
C 48
DSP48E_0 DSP48E_1
OPMODE 0101100 OPMODE 0100011
ALUMODE 0000 (for Add) ALUMODE 0000 (for Add)
ALUMODE 0011 (for Subtract) ALUMODE 0011 (for Subtract)
SIGN EXTEND TO A[29]
UG193_c4_21_072407
Instead of the C input, the A:B input with one pipeline register can also be used to
implement the output in the first DSP48E slice. The second DSP48E slice should, however,
use the A:B input with two pipeline registers for alignment with the CARRYCASCOUT
signal.
Application Examples
the lower DSP48E slice must be switched. The OPMODE and CARRYINSEL must be
switched as shown in Figure 4-23. If the OPMODE is not switched, the hardware might
produce unpredictable results.
OPMODEREG = 1
CARRYINSELREG = 1
PREG = 1
OPMODE = 1001000 0 A
OPMODE = 0000000 Constant7
ALUMODE = 0000
0 B
sel
Constant8
72 d0 z–1 OPMODE
ophigh
Constant14 Delay1
d1 ALUMODE
0 0
Constant13 Mux1
Constant9 P Out
upper
0 CARRYIN high_out
sel
Constant11
2 d0 z–1 CARRYINSEL
Constant15 Delay2
d1 CARRYCASCIN
0
Mux2
Constant10
CARRYINSEL = 010 MULTSIGNIN
CARRYINSEL = 000
DSP48E Upper Slice
SIGNOUT
CASCOUT
OPMODEREG = 1
CARRYINSELREG = 1
AREG = 1
BREG = 1
MREG = 1
PREG = 1
–9 In cast A
Constant DIN25 Convert P d z–1 q Out
lower lower_reg
B low_out
50 In
Register
Constant1 DIN18
oplow
In sel OPMODE
reset
RST CARRYCASCOUT
37 d0
0 ALUMODE
Constant2
d1 Constant3
0
Constant12 0 CARRYIN
Constant4
OPMODE = 0100101 MULTISIGNOUT
OPMODE = 0000000 0 CARRYINSEL
ALUMODE = 0000 Constant5
Figure 4-23: MACC Implementation and MACC Extension Using a DSP48E Slice
This special OPMODE extends only the MACC functions. It cannot be used to extend three
input adders or add-accumulate functions. Extending three input add-accumulate
functions requires a 4-input adder for the two CARRYCASCOUT bits from the lower
DSP48E slice and is therefore not supported. For a smaller multiplier, such as a 20 x 18 or
18 x 18, the lower DSP48E slice itself can provide sufficient guard bits to prevent overflow.
See “MULTSIGNOUT and CARRYCASCOUT,” page 113 for design considerations.
25 x 18 Complex Multiply
A 25 x 18 complex multiply function is implemented using the DSP48E slices in
Figure 4-24. A complex multiply function is:
( A + ja ) × ( B + jb ) = ( A B – ab ) × j ( A b + Ba ) Equation 4-6
Two DSP48E slices implement the real part and two slices are used to implement the
imaginary part. The real and imaginary results use the same slice configuration with the
exception of the adder/subtracter. The A and B input register pipe stages are matched
between the slices for the real and imaginary parts. A similar technique can be used to
implement a butterfly computation in the FFT algorithm.
a[24:0] A 25
IMAGINARY_
P OUT[43:0]
B[17:0] B 18
A[24:0] A 25
P
b[17:0] B 18
a[24:0] A 25
REAL_OUT
P [43:0]
b[17:0] B 18
A[24:0] A 25
B[17:0] B 18
Application Examples
35 x 25 Complex Multiply
Many complex multiply algorithms require higher precision in one of the operands. The
equations for combining the real and imaginary parts in complex multiplication are the
same, but the larger operands must be separated into two parts and combined using
partial product techniques. The real and imaginary results use the same slice configuration
with the exception of the adder/subtracter. The adder/subtracter performs subtraction for
the real result and addition for the imaginary result. Equation 4-7 and Equation 4-8
describe the math used to form the real and imaginary parts for the fully pipelined,
complex, 35-bit x 25-bit multiplication.
( A[24:0] + ja[24:0] ) × ( B [34:0] + jb[34:0] ) Equation 4-7
REAL_OUT = ( A [24:0] × B[34:17] ) + SHIFT17 { A[24:0] × ( 0 , B[16:0] ) } – Equation 4-8
( a [24:0] × b[34:17] ) – SHIFT17 { a[24:0] × ( 0, b[16:0] ) }
Figure 4-25 shows the real part of a fully pipelined, complex, 35 x 25 multiplier.
SRL16
a[24:0]
A 25
REAL_OUT
P [59:17]
B 18
SRL16 3
b[34:17]
ACIN 25
B[34:17] P
B 18
SHIFT 17
2
REAL_OUT
A[24:0] A 25 [16:0]
P
0, B[16:0] B 18
SRL16
1
a[24:0] A 25
0,b[16:0] B 18
Figure 4-25: Real Part of a Fully Pipelined, Complex, 35-Bit x 25-Bit Multiplier
Application Examples
Figure 4-26 shows the imaginary part of a fully pipelined, complex, 35 x 25 multiplier.
SRL16
a[24:0]
A 25
IMAGINARY_OUT
P [59:17]
B 18
SRL16 3
B[34:17]
ACIN 25
b[34:17] P
B 18
SHIFT 17
2
IMAGINARY_OUT
A[24:0] A 25 [16:0]
P
0, b[16:0] B 18
SRL16
1
a[24:0] A 25
0,B[16:0] B 18
Figure 4-26: Imaginary Part of a Fully Pipelined, Complex, 35-Bit x 25-Bit Multiplier
25 x 18 Complex MACC
The implementation of a complex MACC using four DSP48E slices with dynamic
OPMODE settings is shown in Figure 4-27 and Figure 4-28. Equation 4-9 illustrates the
DSP48E implementation and OPMODE settings for the N-1 cycles for an N state MACC
operation. The OPMODE is changed in the last cycle to implement the Nth state. The
addition and subtraction of the terms only occur after the desired number of MACC
operations. For N cycles:
Slice 1 = ( A × b )accumulation
Slice 2 = ( a × B )accumulation Equation 4-10
Slice 3 = ( A × B )accumulation
Slice 4 = ( a × b )accumulation
a[24:0] A 25
b[17:0] B 18
48
3
A[24:0] A 25
P
B[17:0] B 18
48
2
a[24:0] A 25
B[17:0] B 18
48
1
A[24:0] A 25
b[17:0] B 18
48
0
Figure 4-27: N–1 Cycles of an N-Cycle Complex MACC Implementation Using Dynamic OPMODE
Application Examples
IMAGINARY_OUT
P [47:0]
48 48
Round C 48
REAL_OUT
P [47:0]
48 48
Round C 48
Figure 4-28: Nth Cycle of an N-Cycle Complex MACC Implementation Using Dynamic OPMODE
During the last cycle, the input data must stall while the final terms are added. To avoid
having to stall the data, the complex multiply implementation, shown in Figure 4-29,
should be used (instead of using the complex multiply implementation shown in
Figure 4-27 and Figure 4-28).
IMAGINARY_
P OUT[47:0]
48 48
Round C 48
a[24:0] A 25
B[17:0] B 18
A[24:0] A 25
b[17:0] B 18
REAL_OUT
P [47:0]
48 48
Round C 48
a[24:0] A 25
b[17:0] B 18
A[24:0] A 25
B[17:0] B 18
Application Examples
Filters
A wide variety of filter architectures can be implemented efficiently in the DSP48E slice.
The architecture chosen depends on the amount of processing required and the clock
cycles available. The MACC Filter, Parallel Filter, and Semi-Parallel Filter structures,
including their advantages and disadvantages are described in detail in the XtremeDSP for
Virtex-4 FPGAs User Guide (UG073). See chapters 3, 4, and 5.
The 16 coefficients are cycled through the DSP48E slice (see Figure 4-30). These coefficients
can be stored in shift registers (SRL16). The interpolator can also be implemented using
one DSP48E slice if a higher clock latency or a slower clock is acceptable.
x[n+2]
25
43 y[-4]
SRL16
h(14) B 18
h(15)
h(12)
h(13) ACIN 3
25
43 P
SRL16
h(11) B 18
h(8)
h(9)
h(10) ACIN 2
25
43 P
SRL16
h(4) B 18
h(5)
h(6)
ACIN 1
h(7)
A 25
43 P
SRL16
h(3) B 18
h(2)
h(1)
0
h(0)
Included in the reference design files for this chapter is PolyIntrpFilter.zip, which
provides examples of portable, parameterized, design, and simulation VHDL files that
infer DSP48E slices when creating Polyphase Interpolating FIR filters in Virtex®-5 devices.
The number of filter taps, interpolation factors, and data bit widths are parameterizable.
Synplify 8.1 was used to synthesize this portable, RTL VHDL code with generics for
parameterization. The reference design files associated with this user guide can be found
on the Virtex-5 FPGA page on xilinx.com.
Application Examples
For the 16-tap filter, the output sample (y) is the weighted average of 16 input samples (x)
multiplied by 16 coefficients (h) as described in Equation 4-13.
y[-2]
P
48
4 ce 1/m
M+1 A 25
FIFO
43
SRL16
h(14) B 18
h(13)
h(12)
3
h(15)
M+1 A 25
FIFO
43 P
SRL16
h(9) B 18
h(8)
h(11) 2
h(10)
M+1 A 25
FIFO
43 P
SRL16
h(4) B 18
h(7)
h(6) 1
h(5)
M+1 A 25
FIFO
43 P
SRL16
h(3) B 18
h(2)
h(1) x[n+1] 0
h(0)
The input signals to each DSP48E slice are delayed by (M+1) clocks from the previous slice.
Shift registers are used to achieve this delay. After an initial latency, the four tap filter
output is obtained at the fourth DSP48E slice. This output is accumulated with the
previous values in the final DSP48E slice. This final slice is in the accumulation mode for
four cycles. This filter structure (see Figure 4-31) can also be used as a folded single-rate
16-tap filter.
Included in the reference design files for this chapter is PolyDecFilter.zip,which
provides examples of portable, parameterized, design, and simulation VHDL files that
infer DSP48E slices when creating Polyphase Decimating FIR filters in Virtex-5 devices.
The number of filter taps, decimation factors, and data bit widths are parameterizable.
Synplify 8.1 was used to synthesize this portable, RTL VHDL code with generics for
parameterization. The reference design files associated with this user guide can be found
on the Virtex-5 FPGA page on xilinx.com.
Multichannel FIR
Multichannel filtering is used in applications like wireless communication, image
processing, and multimedia applications. In a typical multichannel filtering scenario,
multiple input channels are filtered using a separate digital filter for each channel. Due to
the high performance of the DSP48E slice, time division multiplexing can be used to filter
up to N separate channels using one DSP48E slice. The number of channels N is calculated
by:
N x channel frequency ≤ maximum frequency of the DSP48E slice
Each DSP48E slice is clocked using an NX clock. The N input streams are converted to one
parallel, interleaved stream using the N to one multiplexer. The multiplexer is designed so
that the maximum delay is equal to one LUT delay. This ensures the best performance. The
N parallel interleaved streams are stored in an N+1 FIFO. This implementation uses 1/Nth
of the total FPGA resource as compared to implementing each channel separately.
Figure 4-32 shows a four-tap, four-channel filter implementation. Refer to the XtremeDSP
for Virtex-4 FPGAs User Guide, Chapter 6, for a detailed multichannel filter application note.
Application Examples
N+1 A 25
DEMUX
FIFO
43
Y1
H3 B 18 P
Y0
3
N+1 A 25
FIFO
43 P
H2 B 18
N+1 A 25
FIFO
43 P
H1 B 18
N+1 A 25
FIFO
43 P
X
X H0 B 18
MUX
Xm 0
UG193_c4_31_041106
x A 25
43 P y3[n-6]
18
CEB1 CEB2
3
BCIN
x A 25
43 P
18
CEB1 CEB2
BCIN 2
x A 25
43 P
18
CEB1 CEB2
BCIN 1
x A 25
43 P
h B 18
CEB1 CEB2 0
Preloading the DSP48E slices as described above can reduce the register utilization in the
fabric.
Application Examples
The use of the pattern detector leads to a moderate speed reduction on the pattern detect
path. The PD output comes out on the same clock cycle as the DSP48E slice output P.
A 25
43 P
B 18
C 48 PATTERNDETECT
=
DSP48E_0
OPMODE 0000101
ALUMODE 0000
SEL_PATTERN = "C"
MASK = 48'h000000000000
UG193_c4_33_092206
Application Examples
A 25
43 P
B 18
PATTERNDETECT/
OVERFLOW/
UNDERFLOW
=
DSP48E_0
OPMODE 0100101
ALUMODE 0000
UG193_c4_34_092106
The C input can feed into the pattern detector. When the C input is fed into the pattern
detector, any function between the Z multiplexer output and the X multiplexer output can
be compared with the dynamic C input value.
With C input connected to the pattern detector, PCIN XNOR A:B == C is implemented, as
in Figure 4-36.
A 30
A:B 48 P
XNOR
B
C 48 PD
=
PCIN
DSP48E_0
OPMODE 0010011
ALUMODE 0101
UG193_c4_35_091806
Figure 4-36: PCIN fxn A:B ==C Implementation Using Pattern Detector
If the DSP48E slice is configured to do a subtraction and the pattern detector is used, then
C > A:B and C == A:B can be simultaneously detected. The sign bit of the P output
indicates if
A:B is > or < than C. The PD output indicates if A:B – C == 0. See Figure 4-37.
A 30
A:B 48 P
B 18
C 48 PATTERNDETECT
=
DSP48E_0
OPMODE 0110011
ALUMODE 0011
UG193_c4_36_010609
The pattern detector can be used to check overflow and underflow conditions for 12-bit
and 24-bit adders in the SIMD mode. If the logic unit is used in a four 12-bit SIMD mode,
a pattern such as 00x…x00x….x00x….x00x….x can detect if any of the 12-bit adders has
overflowed or underflowed beyond its respective 11th bit. The pattern detector in this case
can only identify if any one of the adders caused an overflow/underflow flag. It cannot
identify which one of the four adders caused this flag.
Application Examples
Control logic in a lot of designs often relies on add-compare-select type operations where
two number are added together and then compared to a value to determine if a certain
action needs to be taken. Such functions can be efficiently mapped to the DSP48E adder
and its pattern detector. The add-compare is done in multiple DSP48E slices and the select
function is then done in the fabric (see Figure 4-5, page 63).
DSP48E_0
OPMODE 0001110
ALUMODE 0000
If 48 bits is too large for a counter, then SIMD can be used to duplicate the same 12-bit
counter four times to spread the fanout load. The pattern detector (add compare select)
needs to operate only on one of the 4 identical counters. The auto reset capability can be
used to reset all four counters if the pattern is detected on one of them.
Linked counters can also be packed into the same DSP48E slice. For example, a design can
have one counter that adds by one and another counter that adds by three. The auto reset
feature resets both counters simultaneously if a threshold is reached on one or both
counters. The NOT_MATCH mode of AUTORESET can be used to OR threshold
conditions on both counters and trigger a reset if either counter crosses its limit.
Rounding Applications
Different styles of rounding can be done efficiently in the DSP48E slice. The C port in the
DSP48E slice is used to mark the location of the decimal point. For example, if
C = 000...00111, this indicates that there are four digits after the decimal point. In other
words, the number of continuous ones in the C port bus plus 1 indicates the number of
decimal places in the original number.
The CARRYIN input can be used to determine which rounding technique is implemented.
If CARRYIN is 1, then C + CARRYIN = 0.5 or 0.1000 in binary. If CARRYIN is zero, then
C + CARRYIN = 0.4999... or 0.0111 in binary. Thus, the CARRYIN bit decides if the
number is rounded up or rounded down.
The CARRYIN bit and the C input bus can change dynamically. After the round is
performed by adding C and CARRYIN to the result, the bits to the right of the decimal
point should be discarded.
For convergent rounding, the pattern detector can be used to determine whether a
midpoint number is rounded up or down. Truncation is performed after adding C and
CARRYIN to the data.
Rounding Decisions
There are different factors to consider while implementing a rounding function:
• Dynamic or static decimal point
• Symmetric or Random or Convergent
• LSB Correction or Carrybit Correction (if convergent rounding was chosen)
Symmetric Rounding
In symmetric rounding towards infinity, the CARRYIN bit is set to the sign bit bar of the result.
This ensures that the midpoint negative and positive numbers are both rounded away
from zero. For example, 2.5 rounds to 3 and -2.5 rounds to -3. In symmetric rounding towards
zero, the CARRYIN bit is set to the sign bit of the result. Positive and negative numbers at
the midpoint are rounded towards zero. For example, 2.5 rounds to 2 and -2.5 rounds to -
2. Although the round towards infinity is the conventional Matlab round, the round
towards zero has the advantage of never causing overflow. Table 4-4 and Table 4-5 shows
examples of symmetric rounding.
Application Examples
The multiply rounding towards infinity is built into the DSP48E slice where the sign bit bar
can be chosen by setting CARRYINSEL to 110 (see Table 1-10).
For MACC and add accumulate operations, it is difficult to determine the sign of the
output ahead of time, so the round might cost an extra clock cycle. This extra cycle can be
eliminated by adding the C input on the very first cycle using dynamic OPMODE. The sign
bit of the last but one cycle of the accumulator can be used for the final rounding operation
done in the final accumulate cycle. This implementation is a practical way to save a clock
cycle. There is a rare chance that the final accumulate operation can flip the sign of the
output from the previous accumulated value.
Random Round
In random rounding, the result is rounded up or down. In order to randomize the error due
to rounding, one can dynamically alternate between symmetric rounding towards infinity
and symmetric rounding towards zero by toggling the CARRYIN bit pseudo-randomly. The
CARRYIN bit in this case is a random number. The DSP48E slice adds either 0.4999 or 0.50
to the result before truncation. For example, 2.5 can round to 2 or to 3, randomly.
Repeatability depends on how the pseudo-random number is generated. If the LFSR/seed
is always the same, then results can be repeatable. Otherwise, the results might not be
exactly repeatable.
Convergent Rounding
In convergent rounding, the final result is rounded to the nearest even number (or odd
number). In conventional implementations, if the midpoint is detected, then the
units-placed bit before the round needs to be examined in order to determine whether the
number is going to be rounded up or down. The original number before the round can
change between even/odd from cycle to cycle, so the CARRYIN value cannot be
determined ahead of time.
In convergent rounding towards even, the final result is rounded toward the closest even
number, for example:
2.5 rounds to 2 and -2.5 rounds to -2, but 1.5 rounds to 2 and -1.5 rounds to -2.
In convergent rounding towards odd, the final result is rounded toward the closest odd
number, for example:
2.5 rounds to 3 and -2.5 rounds to -3, but 1.5 rounds to 1 and -1.5 rounds to -1.
The convergent rounding techniques require the use of fabric in addition to the DSP48E
slice. There are two ways of implementing a convergent rounding scheme.
• LSB Correction Technique: In the LSB correction technique, a logic gate is needed in
fabric to compute the final LSB after rounding.
• Carry Correction Technique: In the carry correction technique, an extra bit is produced
by the pattern detector that needs to be added to the truncated output of the DSP48E
slice in order to determine the final rounded number. If a series of computations are
being performed, then this carry bit can be added in a subsequent fabric add or
DSP48E add on the flowing datapath.
The bits after the multiplier should not be rounded because the DSP48E MACC is 48 bits.
Even though fabric designs round to allow a 36-bit fabric ACC, this style of coding is
inefficient when using the DSP48E slice.
A saturate (SAT) and round (RND) function can be provided by adding a single DSP48E
slice and using the PatternDetector of the preceding DSP48E slice to detect the SAT
condition. The new extra DSP48E slice should then be used to multiplex maximum
positive/negative numbers when a SAT occurs. The extra DSP48E slice PatternDetect
should be used as described in “Convergent Rounding: LSB Correction Technique,” page
101 to perform the RND function. This combination never overflows due to rounding
because the SAT function takes priority over RND.
When either the FIR filter cascade or MACC precision is 36 bits, PatternDetector and
PatternBarDetector can be used instead of the Overflow/Underflow signals. This allows
for the full 48-bit precision to be maintained in the Acumulator/P-cascade with
sign-extension even if only 36-bit precision is used. For 36-bit precision, the user sets the
MASK and PATTERN values as:
MASK = 0x0007 FFFF FFFF
PATTERN = 0x0000 0000 0000
For a 36-bit signed number, the 36th bit is the sign bit which is sign-extended to a full 48-bit
word. Therefore, PatternDetector or PATTERNBDETECT (PBD) should always be asserted
because the upper 13 bits should be all 1s or all 0s. If the upper 13 bits are not all 1s or all
0s, an underflow or overflow condition has occurred. The SAT signal is defined as:
SAT = (~(PatternDetector) AND ~(PatternBarDetector))
If SAT is asserted and P(47)=1, an underflow has occurred. If SAT is asserted and P(47)=0,
an overflow has occurred.
P(47:0) must be routed in fabric to the A:B(47:0) input to provide a pipeline stage because
full speed is only possible when the OPMODE and CARRYIN registers are used. The
OPMODE of the SAT/RND DSP48E slice is:
OPMODE(7:0)=(0 SAT SAT 0 0 ~SAT ~SAT)
Application Examples
Thus, the A:B path is selected if there is no SAT condition, while the C port is selected if
there is a SAT condition. The C port provides the maximum positive number for
saturation.
C = 0x0007 FFFF FFFF
CarryIn = (~(PatternDetector) AND ~(PatternBarDetector)) AND P(47)
Therefore, the most positive SAT number is changed to the most negative SAT number if it
is an underflow via the CarryIn path.
MAX Positive = 0x0007 FFFF FFFF
MAX Negative = 0x0008 0000 0000
These are the PatternDetector parameters for the SAT/RND DSP slice for this 36-bit
example where the final result is a 12-bit number after rounding:
PATTERN = 0x0000 0000 0000
MASK = 0xFFFF FF00 0000
The convergent round to even constant for 12-bit (Decimal 24) data from 36-bit data input
is:
RND CNST = 0x0000 0080 0000
This constant includes the CarryIn listed in Table 4-6, page 103 so that the CarryIn can be
used for negative maximum number operation. This constant must be input at the start of
an FIR filter cascade or at the beginning of a MACC FIR filter.
The LSB correction method uses the gate described in Figure 4-40, page 102 which is fed by
PSAT(24) of the SAT/RND DSP slice. Thus, the 12-bit MSB is PSAT(35).
A 25
43 P
B 18
C PD
=
Use PD to
FORCE_LSB
in fabric
DSP48E_0
OPMODE 0110101
ALUMODE 0000
UG193_c4_38_092206
A B
A: Round to Even
B: Round to Odd UG193_c4_39_052306
Note that while the PATTERNDETECT searches for XXXX.0000, the PATTERNBDETECT
searches for a match with XXXX.1111. The pattern detector is used here to detect the
midpoint (special case where decimal fraction is 0.5).
Examples of Dynamic Round to Even and Round to Odd are shown in Table 4-6 and
Table 4-7.
Application Examples
A 25
43 P
B 18
C 48 PD
=
For dynamic rounding using carry correction, the implementation is different for round to
odd and round to even.
In the dynamic round to even case, when XXX1.1111 is detected, a carry should be
generated. The SEL_ROUNDING_MASK should be set to MODE2 (see Table 1-3). MODE2
looks at the units place bit as well as the decimal fraction. This attribute sets the mask to left
shift by 2 of C complement. This makes the mask change dynamically with the C input
decimal point. So when the C input is 0000.0111, the mask is 1110.0000. If the
PATTERN is set to all ones, then the PATTERNDETECT is a '1' whenever XXX1.1111 is
detected. The carry correction bit is the PATTERNDETECT output. The PATTERNDETECT
should be added to the truncated P output of the next stage DSP48E slice in order to
complete the rounding operation.
Examples of dynamic round to even are shown in Table 4-8.
In the dynamic round to odd case, a carry should be generated whenever XXX0.1111 is
detected. SEL_ROUNDING_MASK is set to MODE1 (see Table 1-3). This attribute sets the
mask to left shift by 1 of C complement. This makes the mask change dynamically with the C
input decimal point. So when the C input is 0000.0111, the mask is 1111.0000. If the
PATTERN is set to all ones, then the PATTERNDETECT is a '1' whenever XXXX.1111 is
detected. The carry correction bit needs to be computed in fabric, depending on the units
place bit of the truncated DSP48E output and the PATTERNDETECT signal. The units
place bit after truncation should be a '0' and the PATTERNDETECT should be a '1' in order
for the carry correction bit to be a '1'. This carry correction bit should then be added to the
truncated P output of the DSP48E slice in order to complete the round.
Examples of static round to odd are shown in Table 4-9.
Chapter 5
Synthesis Tools
General information on DSP48E inference can be found in the “Coding for FPGA Device
Flow” chapter of UG626: Synthesis and Simulation Design Guide for the ISE tools, or in the
ISE HDL Language templates. In the Arithmetic Support section, there are VHDL and
Verilog examples. For information on XST, see UG627: XST User Guide for Virtex-4, Virtex-5,
Spartan-3, and New CPLD Devices.
DSP IP
Available DSP IPs are listed here:
• Multiplier
• Divider
• FIR Filter Compiler
• Floating Point Operators
• FFT
• Turbo Decoder
More information on Xilinx DSP IP can be found on the DSP IP Core web page.
UG193_c5_01_032806
Architecture Wizard
The architecture wizard is a GUI design tool in ISE that supports many single DSP48E slice
functions. In some cases, these single DSP48E slice functions can be used to create cascaded
DSP48E slice functions. The architecture wizard is accessible through the ISE tools and
supports the following configurable blocks:
• Accumulator
• Adder/Subtracter
• Multiplier
• Multiplier and Adder/Accumulator
Many of these functions can also be implemented by using Synthesis Inference, which is
the preferred method of using the DSP48E slice.
For information on the Architecture Wizard, please see the Architecture Wizard User
Guide in the ISE Software Manuals.
• Designs that do not use the multiplier should set the USE_MULT attribute to
“NONE” in order to save power by shutting down the multiplier. The USE_MULT is a
new attribute in the Virtex-5 device that replaces the LEGACY_MODE attribute in
Virtex-4 device.
• Subtractreg in the Virtex-4 device should map to Alumodereg - similarly CECINSUB
(in the Virtex-4 device) should map to both CEALUMODE (in the Virtex-5 device) and
CECARRYIN (in the Virtex-5 device).
• RSTCTRL (in the Virtex-4 device) should map to both RSTCTRL and RSTALUMODE
in the Virtex-5 device.
• The “CECARRYIN” signal for the Virtex-5 DSP48E slice has a completely different
meaning than the Virtex-4 device. Virtex-4 designs should retarget Virtex-4 DSP48's
CECARRYIN onto the Virtex-5 device’s CEMULTCARRYIN and should retarget
Virtex-4 device’s CECINSUB onto both the CECARRYIN and the CEALUMODE in
the Virtex-5 device.
Education Courses
Xilinx Education Services offers comprehensive, flexible training solutions designed to
help you get your product to market quickly and efficiently. The following classes contain
information about using the Virtex-5 DSP48E XtremeDSP slice.
• DSP Design Flows
• DSP Implementation Techniques
• Achieving Breakthrough Performance in Virtex-4 FPGAs
Education Home page:
http://www.xilinx.com/training/index.htm
Application Notes
For application examples targeted to Virtex-5 devices, refer to Chapter 2, “DSP48E Design
Considerations.”
Appendix A
A
+
0
B
1
0 A+B
Carry Input
A
+
0
B
1
A±B
Sub/Add = 1/0
(Carry input must be 1 for a subtract operation, so it is not available for other uses.)
UG193_A_01_092107
B 0
+
0 1
A
1
Sub/add
The DSP48E slice uses the second implementation extended to 3-input adders with a
CARRYIN input as shown in Figure A-3. This allows DSP48E SIMD operations to perform
subtract operations without a carryin for each smaller add/sub unit.
Y 0
+
0 1
Z
1
ALUMODE[0]
CIN
ALUMODE[1]
DSP48E Slice Add/Subtract
UG193_A_03_092107
CIN UG193_A_04_092107
Summary of Appendix A
Adder/Subtracter-only Operation
CARRYOUT[3]: Hardware and software match.
CARRYCASCOUT: Hardware and software match when ALUMODE=0000 and inverted
when ALUMODE=0011. The mismatch happens because the DSP48E slice performs the
subtract operation using a different algorithm from the fabric; thus, the DSP48E slice
requires an inverted carryout from the fabric.
MULTSIGNOUT is invalid in adder-only operation.
MACC Operation
CARRYOUT[3] is invalid in the MACC operation.
CARRYCASCOUT and MULTSIGNOUT: Hardware and software do not match due to
modeling difference. The software simulation model is an abstraction of the hardware
model. The software views of CARRYCASCOUT and MULTSIGNOUT enable
higher-precision MACC functions to be built in the Unisim model. They are not logically
equivalent to hardware CARRYCASCOUT and MULTSIGNOUT. Only the hardware and
software results (P output) are logically equivalent. The internal signals
(CARRYCASCOUT and MULTSIGNOUT) are not. See Figure A-5.
Software Model
MULTSIGNOUT
A CARRYCASCOUT
x + P[47:0]
B
CARRYIN
Zmux (e.g., C, P, PCIN)
Hardware Implementation
MULTSIGNOUT
A CARRYCASCOUT
x + P[47:0]
B
CARRYIN
Zmux (e.g., C, P, PCIN)
Partial products from the multiply operation are added together in the second stage ternary adder.
UG193_A_05_100207