Ug949 Vivado Design Methodology
Ug949 Vivado Design Methodology
of this document
UltraFast Design
Methodology Guide for
Xilinx FPGAs and SoCs
Chapter 1
Introduction
• This guide, which describes the various design tasks, analysis and reporting features, and best
practices for design creation and closure.
• UltraFast Design Methodology Quick Reference Guide (UG1231), which highlights key design
methodology steps in an easy-to-use, double-sided card format.
• UltraFast Design Methodology Timing Closure Quick Reference Guide (UG1292), which covers
recommendations for closing timing, including running initial design checks, baselining the
design, and resolving timing violations.
• UltraFast Design Methodology Checklist (XTP301), which is available in the Xilinx
Documentation Navigator and as a standalone spreadsheet. You can use this checklist to
identify common mistakes and decision points throughout the design process.
• UltraFast Design Methodology System-Level Design Flow diagram representing the entire
Vivado® Design Suite design flow, which is available in the Xilinx Documentation Navigator.
You can click a design step in the diagram to open related documentation, collateral, and FAQs
to help get you started.
RECOMMENDED: In addition to these resources, Xilinx recommends the UltraFast Embedded Design
Methodology Guide (UG1046) when working with embedded designs and the Vitis HLS Methodology in
the Vitis HLS User Guide (UG1399) when developing complex systems using Vivado IP integrator with
C-based IP.
Xilinx provides the following resources to help you take advantage of the UltraFast design
methodology:
TIP: Xilinx also provides methodology-related design rule checks (DRCs) for each design stage, which are
available using the report_methodology Tcl command in the Vivado Design Suite.
• Chapter 2: Board and Device Planning: Covers decisions and design tasks that Xilinx
recommends accomplishing prior to design creation. These include I/O and clock planning,
PCB layout considerations, device capacity and throughput assessment, alternate device
definition, power estimation, and debugging.
• Chapter 3: Design Creation with RTL: Covers the best practices for RTL definition and IP
configuration and management.
• Chapter 4: Design Constraints: Provides recommendations for creating proper timing, power,
and physical constraints as well as specifying additional constraints, attributes, and other
elements used during synthesis and implementation.
• Chapter 5: Design Implementation: Covers the options available and best practices for
synthesizing and implementing the design.
• Chapter 6: Design Closure: Covers the various design analysis and implementation techniques
used to close timing on the design or to reduce power consumption. It also includes
considerations for adding debug logic to the design for hardware verification purposes.
This guide includes references to other documents such as the Vivado Design Suite User Guides,
Vivado Design Suite Tutorials, and Quick-Take Video Tutorials. This guide is not a replacement for
those documents. Xilinx still recommends referring to those documents for detailed information,
including descriptions of tool use and design methodology.
This information is designed for use with the Vivado Design Suite, but you can use most of the
conceptual information with the ISE® Design Suite as well.
Related Information
The questions in the UltraFast Design Methodology Checklist highlight typical areas in which
design decisions are likely to have downstream impact and draw attention to issues that are
often overlooked or ignored. Each tab in the checklist:
• Includes common questions and recommended actions to take during each design flow step,
including project planning, board and device planning, IP and submodule design, and top-level
design closure.
• Includes a Documentation and Training section that lists resources related to the design flow
step.
• Provides links to content in this guide or other Xilinx documentation, which offer guidance on
addressing the design concerns raised by the questions.
VIDEO: For a demonstration of the checklist, see the Vivado Design Suite QuickTake Video: Introducing
the UltraFast Design Methodology Checklist.
RECOMMENDED: For maximum effect, run the methodology DRCs at each design stage and address
Critical Warnings and Warnings prior to moving to the next stage.
For more information on the design methodology DRCs, see the report_methodology Tcl
command in the Vivado Design Suite Tcl Command Reference Guide (UG835).
Related Information
Implementation
Logic Simulation
Logic Synthesis
Implementation
Decisions made at this stage affect the end product. A wrong decision at this point can result in
problems at a later stage, causing issues throughout the entire design cycle. Spending time early
in the process to carefully plan your design helps to ensure that you meet your design goals and
minimize debug time in the lab.
Impact of change on
performance
HLS
(C, C++) 1000x
RTL 10x
Synthesis
Routing 1.1x
X13423-081020
• Create optimal RTL constructs with Xilinx templates, and validate your RTL with methodology
DRCs prior to synthesis, after elaboration.
Because the Vivado tools use timing-driven algorithms throughout, the design must be
properly constrained from the beginning of the design flow.
• Perform timing analysis after synthesis.
To specify correct timing, you must analyze the relationship between each master clock and
related generated clocks in the design. In the Vivado tools, each clock interaction is timed
unless explicitly declared as an asynchronous or false path.
• Meet timing using the right constraints before proceeding to the next design stage.
You can accelerate overall timing and implementation convergence by following this
recommendation and by using the interactive analysis environment of the Vivado Design
Suite.
TIP: You can achieve further acceleration by combining these recommendations with the HDL design
guidelines in this guide.
Run Synthesis
Review options & HDL code
report_clock_networks
-> create_clock / create_generated_clock
report_clock_interaction
Cross-probe
-> set_clock_groups / set_false_path Define & Refine Instances in critical path
check_timing Constraints In Netlist view and
-> I/O delays
Elaborated view schematics
report_timing_summary
-> Timing exceptions
Timing Acceptable?
N
X13422
Synthesis is considered complete when the design goals are met with a positive margin or a
relatively small negative timing margin. For example, if post-synthesis timing is not met,
placement and routing results are not likely to meet timing. However, you can still go ahead with
the rest of the flow even if timing is not met. Implementation tools might be able to close timing
if they can allocate the best resources to the failing paths. In addition, proceeding with the flow
provides a more accurate understanding of the negative slack magnitude, which helps you
determine how much you need to improve the post-synthesis worst negative slack (WNS). You
can use this information when you return to the synthesis stage with improvements to HDL and
constraints.
• In the context of system design, the I/O bandwidth is validated in-system, before
implementing the entire design. Validating I/O bandwidth can highlight the need to revise
system architecture and interface choices before finalizing on I/Os.
• As part of design implementation, baselining is used to write the simplest set of constraints,
which can identify internal device timing challenges. Baselining is a process used to identify
the need to revise RTL micro-architecture choices before moving to the implementation
phase.
Related Information
Xilinx Design Hubs provide links to documentation organized by design tasks and other topics,
which you can use to learn key concepts and address frequently asked questions. To access the
Design Hubs:
• In the Xilinx Documentation Navigator, click the Design Hubs View tab.
• On the Xilinx website, see the Design Hubs page.
TIP: For quick access to information on different parts of the Vivado IDE, click the Quick Help button
in the window or dialog box. For detailed information on Tcl commands, enter the command
followed by -help in the Tcl Console.
Chapter 2
Failing to properly plan the I/O configuration can lead to decreased system performance and
longer design closure times. Xilinx highly recommends that you consider I/O planning in
conjunction with board planning.
• Vivado Design Suite User Guide: I/O and Clock Planning (UG899)
• Vivado Design Suite QuickTake Video: I/O Planning Overview
A sketch of the PCB including the critical interfaces can often help determine the best
orientation for the device on the PCB, as well as placement of the PCB components. After
completion, the rest of the device I/O interface can be planned.
High-speed interfaces such as memory can benefit from having very short and direct connections
with the PCB components with which they interface. These PCB traces often have to be
matched length and not use PCB vias, if possible. In these cases, the package pins closest to the
edge of the device are preferred in order to keep the connections short and to avoid routing out
of the large matrix of BGA pins.
The I/O Planning view layout in the Vivado® IDE is useful in this stage for visualizing I/O
connectivity relative to the physical device dimensions, showing both top-side and bottom-side
views.
THERMAL TIP: For thermally-challenged designs, be aware of device placement in relation to other high-
power components to minimize thermal coupling and maximize airflow. Avoid placement where the device
is positioned in the exhaust of another high-power component or where board heating might negatively
impact the operating temperature. Xilinx recommends thermal simulation to understand how the
placement and environmental conditions can affect the junction temperature of the device.
Xilinx devices do not share this property. Devices can implement an almost infinite number of
applications at user-determined frequencies, and in multiple clock domains.
For this reason, it is critical that you understand the power requirements of the design, which you
can assess by completing a power estimation using the Xilinx Power Estimator (XPE). Also refer
to the PCB Design Guide for your device to fully understand the PDS placement and generic
decoupling requirements prior to a power estimation.
• Selecting the right voltage regulators to meet the noise and current requirements based on
power estimation.
Note: To enable and simplify your power design, Xilinx partners with key power vendors to design,
build, document, and test reference designs that meet all power requirements. For more information,
see the Power Delivery Solutions tab on the Power page of the Xilinx website.
• Consolidating power. For supported consolidation options in UltraScale™ devices, see this link
in the UltraScale Architecture PCB Design User Guide (UG583).
POWER TIP: Xilinx recommends adding a shunt resistor to allow the power on each rail to be
monitored. Alternatively, you can use a PMBus-enabled regulator or current monitoring integrated
circuit (IC).
For more information on PDN simulation, see Simulating FPGA Power Integrity Using S-Parameter
Models (WP411).
POWER TIP: Xilinx recommends simulating your power supply design using the SIMPLIS simulator in
SIMetrix/SIMPLIS to ensure your design is within the Xilinx recommended operating conditions. The
majority of power vendors provide a limited version of SIMPLIS and supply the models to allow you to run
this simulation. SIMPLIS is a third-party software used for transient and AC analysis of voltage regulators.
For more information about simulating your power delivery, contact SIMPLIS or your preferred power
delivery vendor.
POWER TIP: The Vivado tools report_power command can analyze power on a per regulator or
voltage regulator module (VRM) basis to ensure the required current on each rail does not exceed the
intended power delivery system.
Related Information
Power Closure
Xilinx recommends using lidless packaging if it is available for your device. Lidless packaging
offers a more efficient thermal solution and allows direct contact with the heat source, removing
a thermal interface material (TIM) layer. Xilinx lidded and lidless parts have the same handling and
manufacturing requirements. The following figure compares the heat sink application for a lidded
and lidless device.
THERMAL TIP: Xilinx recommends between 20 and 50 pound-force per square inch (PSI) for the heat
sink, which ensures the smallest bond line thickness (BLT), and recommends using 4-hole mounting to
ensure even pressure for both lidded and lidless devices. For more information on lidless techniques, see
Mechanical and Thermal Design Guidelines for Lidless Flip-Chip Packages (XAPP1301).
Heat Sink
X23524-112719
Xilinx also recommends thermal simulation to ensure that there is adequate margin and accurate
power estimation. In the XPE, you have control over the following thermal settings:
• Junction Temperature Tj: You can override this setting to a desired junction temperature to
match your thermal simulation. If you are not running a thermal simulation, set the junction
temperature to the worst case.
• Effective ΘJA: Describes the thermal efficiency of a thermal solution, the units are measured
in degrees Celsius per watt (°C/W). For example, an ΘJA of 2.1°C/W means that for every
watt dissipated in the device, the junction temperature increases by 2.1°C. For a 10W design,
the increase is 21°C above the ambient temperature.
Note: You can obtain the ΘJA through thermal simulation using the following formula:
Power
Estimation
Re-Evaluate Design or
Thermal Solution
Thermal Simulation
Tj < Tj max
Y N
X23525-111319
After the junction temperature is within specification and sufficient margin is considered, the
thermal solution is considered effective.
THERMAL TIP: Add the results of the power estimation and thermal simulation to the Vivado design
constraints. You can use the following XDC constraints, which you can export from the XPE, as described
in the Xilinx Power Estimator User Guide (UG440):
# Standard Constraints:
set_operating_conditions -process Maximum
set_operating_conditions -design_power_budget <value>
#If thermal simulation completed
set_operating_conditions -ambient_temp <value>
set_operating_conditions -thetaja <value>
#Else if no thermal simulation completed
set_operating_conditions -junction_temp <value>
7. Manually add XDC operating condition constraints to your XDC file for the Vivado tools. Use
the XPE to generate a Xilinx design constraints (XDC) file, and import this file into the
corresponding Vivado project. The XPE environment settings are translated to XDC
constraints. The estimated total on-chip power becomes the design power budget for Vivado
power analysis. For more information, see the Vivado Design Suite User Guide: Power Analysis
and Optimization (UG907).
Related Information
• The device and the user design create system power supply and heat dissipation
requirements.
• Power supplies must be able to meet maximum power requirements and the device must
remain within the recommended voltage and temperature operating conditions during
operation. Power estimation and thermal modeling are required to ensure that the device
stays within these limits.
• Plan for the consolidation of power rails and their impact on power domain switching.
• Although consolidation is possible, Xilinx recommends using full power management to give
maximum flexibility where possible.
For these reasons, you must understand the power and cooling requirements of the device.
These must be designed on the board.
POWER TIP: For a list of Xilinx partners and Xilinx-approved power delivery reference designs, see the
Power page on the Xilinx website.
The separate sources provide the required power for the different device resources. This allows
different resources to work at different voltage levels for increased performance or signal
strength, while preserving a high immunity to noise and parasitic effects.
Power Types
A device goes through several power phases from power up to power down with varying power
requirements.
Power-On
Power-on power is the transient spike current that occurs when power is first applied to the
device. This current varies for each voltage supply and depends on the device construction, the
ability of the power supply source to ramp up to the nominal voltage, and the device operating
conditions, such as temperature and sequencing between the different supplies.
Spike currents are not a concern in modern device architectures when the proper power-on
sequencing guidelines are followed.
Startup Power
Startup power is the power required during the initial bring-up and configuration of the device.
This power generally occurs over a very short period of time and thus is not a concern for
thermal dissipation. However, current requirements must still be met. In most cases, the active
current of an operating design will be higher and thus no changes are necessary. However, for
lower-power designs where active current can be low, a higher current requirement during this
time might be necessary. XPE can be used to understand this requirement. When Process is set
to Maximum, the current requirement for each voltage rail will be specified to either the
operating current or the startup current, whichever is higher. XPE will display the current value in
blue if the startup current is the higher value.
Static Power
Design static power (also called standby power) is the power supplied when the device is
configured with your design and no activity is applied externally or generated internally. Static
power represents the minimum continuous power that the supplies must provide while the
design operates.
Static power is a function of junction temperature. Therefore, ensuring the ambient and thermal
solution parameters are correctly modeled is critical to allow the power estimation tools accuracy
report the static power.
Related Information
Dynamic Power
Dynamic power is the power required when the device is running your application and
undergoing switching activity as clocks and datapaths toggle between High and Low logic values.
Dynamic power is calculated based on the average switching activity of device circuits over a
period of time. Total power includes static power plus dynamic power.
THERMAL TIP: The -2LI and -2LE UltraScale+™ devices allow a temperature excursion of up to 110°C for
a defined period of time, which enables a reduction in the thermal solution cost. For more information, see
Extending the Thermal Solution by Utilizing Excursion Temperatures (WP517).
TIP: The Vivado tools also support power rail constraints. For information, see this link in the Vivado
Design Suite User Guide: Power Analysis and Optimization (UG907).
POWER TIP: Power estimation is only as accurate as the data entered. Xilinx recommends conducting a
thorough estimation and using the results of this estimation as well as the thermal evaluation as a design
constraint.
POWER TIP: During the design process, you can compare the total power of the design to the power
budget using the set_operating_conditions -design_power_budget <Power in
Watts> XDC constraint. If the power budget is exceeded, early intervention is the easiest way to correct
design power.
• Constraint creation, particularly in large devices with high utilization in conjunction with clock
planning.
• Manual placement of clocking resources if needed for design closure.
• Device-specific functionality that might require up-front planning to avoid issues and take
advantage of device features. For information on 7 series features, see this link and this link in
the 7 Series FPGAs Clocking Resources User Guide (UG472). For information on UltraScale
device features, see this link in the UltraScale Architecture Clocking Resources User Guide
(UG572).
Related Information
Clocking Guidelines
Auto-Pipelining Considerations
SLR Crossing for Wide Buses
You can visualize the data flow through the device and properly plan I/Os from both an external
and internal perspective. After the I/Os are assigned and configured through the Vivado IDE,
constraints are then automatically created for the implementation tools.
For more information on Vivado Design Suite I/O and clock planning capabilities, see the
following resources:
• Vivado Design Suite User Guide: I/O and Clock Planning (UG899)
• I/O planning project : An I/O planning project is an easy entry point that allows you to specify
select I/O constraints and generate a top-level RTL file from the defined pins.
• RTL project : An RTL project allows synthesis and implementation, which enables more
comprehensive design rule checks (DRCs). An RTL project also allows generation of IP cores,
which is important for memory interface pinout planning and any cores using GTs.
TIP: You can also start by using an I/O planning project and migrate to an RTL project later.
You can run more comprehensive DRCs on a post-synthesis netlist. The same is true after
implementation and bitstream generation. Therefore, Xilinx recommends using a skeleton design
that includes clocking components and some basic logic to exercise the DRCs. This builds
confidence that the pin definition for the board will not have issues later.
The recommended sign-off process is to run the RTL project through to bitstream generation to
exercise all the DRCs. However, not all design cycles allow enough time for this process. Often
the I/O configuration must be defined before you have synthesizable RTL. Although the Vivado
tools enable pre-RTL I/O planning, the level of DRCs performed are fairly basic. Alternatively,
you can use a dummy top-level design with I/O standards and pin assignments to help perform
DRCs related to banking rules.
I/O ports can also be created and configured interactively. Basic I/O bank DRC rules are
provided.
See the 7 Series FPGAs PCB Design Guide (UG483), UltraScale Architecture PCB Design User Guide
(UG583), or Zynq-7000 SoC PCB Design Guide (UG933) to ensure proper I/O configuration for
your device. For more information, see this link in the Vivado Design Suite User Guide: I/O and
Clock Planning (UG899).
See the 7 Series FPGAs PCB Design Guide (UG483), UltraScale Architecture PCB Design User Guide
(UG583), or Zynq-7000 SoC PCB Design Guide (UG933) to ensure proper I/O configuration for
your device. For more information, see this link in the Vivado Design Suite User Guide: I/O and
Clock Planning (UG899).
TIP: To migrate your design with reduced risk, carefully plan the following at the beginning of the design
process: device selection, pinout selection, and design criteria. Take the following into account when
migrating to a larger or smaller device in the same package: pinout, clocking, and resource management.
Pin Assignment
Good pinout selection leads to good design logic placement, shorter routes, reduced power
consumption, and improved performance. Good pinout selection is especially important for large
devices, because a pinout that is spread out can cause related signals to span longer distances.
For more information, see this link in the Vivado Design Suite User Guide: I/O and Clock Planning
(UG899).
If a design version is available, a quick top-level floorplan can be created to analyze the data flow
through the device. For more information, see the Vivado Design Suite User Guide: Design Analysis
and Closure Techniques (UG906).
Required Information
For the tools to work effectively, you must provide as much information about the I/O
characteristics and topologies as possible. You must specify the electrical characteristics,
including the I/O standard, drive, slew, and direction of the I/O.
You must also take into account all other relevant information, including clock topology and
timing constraints. Clocking choices in particular can have a significant influence on pinout
selection, and vice versa.
For IP that have I/O requirements, such as transceivers, PCIe, and memory interfaces, you must
configure the IP prior to completing I/O pin assignment. For more information on specifying the
electrical characteristics for an I/O, see this link in the Vivado Design Suite User Guide: I/O and
Clock Planning (UG899).
Related Information
Clocking Guidelines
Pinout Selection
Xilinx recommends careful pinout selection for some specific signals as discussed below.
• Group the same interface data, address, and control pins into the same bank. If you cannot
group these components into the same bank, group them into adjacent banks.
Note: For SSI technology devices, adjacent banks must also be located within the same super logic
region (SLR).
• Place the following interface control signals in the middle of the data buses they control:
clocking, enables, resets, and strobes.
• Place very high fanout, design-wide control signals towards the center of the device.
Note: For SSI technology devices, place the signals in the SLR located in the middle of the SLR
components they drive.
Configuration Pins
To design an efficient system, you must choose the device configuration mode that best matches
the system requirements. Factors to consider include:
Each configuration mode dedicates certain device pins and can temporarily use other multi-
function pins during configuration only. These multi-function pins are then released for
general use when configuration is completed.
• Using configuration mode to place voltage restrictions on some device I/O banks.
• Choosing suitable terminations for different configuration pins.
• Using the recommended values of pull-up or pull-down resistors for configuration pins.
RECOMMENDED: Even though configuration clocks are slow speed, perform signal integrity analysis on
the board to ensure clean signals.
There are several configuration options. Although the options are flexible, there is often an
optimal solution for each system. Consider the following when choosing the best configuration
option:
• Setup
• Speed
• Cost
• Complexity
For more information on device configuration options, see Vivado Design Suite User Guide:
Programming and Debugging (UG908).
Related Information
Configuration
Memory Interfaces
Additional I/O pin planning steps are required when using Xilinx Memory IP. After the IP is
customized, assign the top-level IP ports to physical package pins in either the elaborated or
synthesized design in the Vivado IDE. All of the ports associated with each Memory IP are
grouped together into an I/O Port Interface for easier identification and assignment. A Memory
Bank/Byte Planner is provided to assist you with assigning Memory I/O pin groups to byte lanes
on the physical device pins. For more information, see this link in the Vivado Design Suite User
Guide: I/O and Clock Planning (UG899).
Take care when assigning memory interfaces and try to limit congestion as much as possible,
especially with devices that have a center I/O column. Bunching memory interfaces together can
create routing bottlenecks across the device. The Zynq-7000 SoC and 7 series Devices Memory
Interface Solutions (UG586) and the UltraScale Architecture-Based FPGAs Memory IP LogiCORE IP
Product Guide (PG150) contain design and pinout guidelines. Be sure that you follow the trace
length match recommendations in these guides, verify that the correct termination is used, and
validate the pinout in by running the DRCs after memory IP I/O assignment. For more
information on memory interface signal termination and routing guidelines, see the UltraScale
Architecture PCB Design User Guide (UG583).
Xilinx recommends that you use the GT wizard to generate the core. Alternatively, you can use
the Xilinx IP core for the protocol. For pinout recommendations, see the related product guide.
For clock resource balancing, the Vivado placer attempts to constrain loads clocked by GT output
clocks (TXOUTCLK or RXOUTCLK) next to the GTs sourcing the clocks. For SSI technology
devices, if the GTs are located in the clock regions adjacent to another SLR, the routing resources
required for signals entering or exiting SLLs have to compete with the routing resources required
by the GT output clock loads. Therefore, GTs located in clock regions next to SLR crossings might
reduce the available routing connections to and from the SLL crossings available in those clock
regions.
There are multiple options to assist in generating test data for these interfaces. For some of the
interface IP cores, the Vivado tools can provide the test designs:
TIP: If a test design does not exist, consider using AXI traffic generators.
You might need to create a separate design for a system-level test in a production environment.
Usually, this is a single design that includes tested interfaces and optionally includes processors.
You can construct this design using the small connectivity designs to take advantage of design
reuse. Although this design is not required early in the flow, it can enable better DRC checks and
early software development, and you can quickly create it using the Vivado IP integrator.
Multiple SLR components are stacked vertically and connected through an interposer to create
an SSI technology device. The bottom SLR is SLR0, and subsequent SLR components are named
incrementally as they ascend vertically.
For example, the XC7V2000T device includes four SLR components. The bottom SLR is SLR0,
the SLR directly above SLR0 is SLR1, the SLR directly above SLR1 is SLR2, and the top SLR is
SLR3.
Note: The Xilinx tools clearly identify SLR components in the graphical user interface (GUI) and in reports.
SLR Nomenclature
Understanding SLR nomenclature for your target device is important in:
• Pin selection
• Floorplanning
• Analyzing timing and other reports
• Identifying where logic exists and where that logic is sourced or destined
You can use the Vivado Tcl command get_slrs to get specific information about SLRs for a
particular device. For example, use the following commands:
TIP: To query which SLR is the master SLR in the Vivado Design Suite, you can enter the get_slrs -
filter IS_MASTER Tcl command.
Silicon Interposer
The silicon interposer is a passive layer in the SSI technology device, which routes the following
between SLR components:
• Configuration
• Global clocking
• General interconnect
TIP: To determine the number of available SLLs between SLRs, use SLR properties. For example:
Propagation Limitations
TIP: For high-speed propagation across SLRs, be sure to register signals that cross SLR boundaries.
SLL signals are the only data connections between SLR components.
• Carry chains
• DSP cascades
• Block RAM and UltraRAM cascades
• Other dedicated connections, such as DCI cascades
The tools normally take this limit on propagation into account. To ensure that designs route
properly and meet your design goals, you must also take this limit into account when you:
• Build a very long DSP, Block RAM, or UltraRAM cascade and manually place such logic near
SLR boundaries
• Specify a pinout for the design
To improve timing closure and compile times, you can use Pblocks to assign logic to each SLR and
validate that individual SLRs do not have excessive utilization across all fabric resource types. For
example, a design with block RAM utilization of 70% might cause issues with timing closure if the
block RAM resources are not balanced across SLRs and one SLR is using over 85% block RAM.
TIP: You can define SLR Pblocks by specifying a complete SLR (e.g., resize_pblock pblock_SLR0
-add SLR0).
Xilinx recommends assigning block RAM and DSP groups to SLR Pblocks to minimize SLR
crossings of shared signals. For example, an address bus that fans out to a group of block RAMs
that are spread out over multiple SLRs can make timing closure more difficult to achieve, because
the SLR crossing incurs additional delay for the timing critical signals.
Device resource location or user I/O selection anchors IP to SLRs, for example, GT, ILKN, PCIe,
and CMAC dedicated block or memory interface controllers. Xilinx recommends the following:
• Pay special attention to dedicated block location and pinout selection to avoid data flow
crossing SLR boundaries multiple times.
• Keep tightly interconnected modules and IP within the same SLR. If that is not possible, you
can add pipeline registers to allow the placer more flexibility to find a good solution despite
the SLR crossing between logic groups.
• Keep critical logic within the same SLR. By ensuring that main modules are properly pipelined
at their interfaces, the placer is more likely to find SLR partitions with flip-flop to flip-flop SLR
crossings.
In the following figure, a memory interface that is constrained to SLR0 needs to drive user logic
in SLR1. An AXI4-Lite slave interface connects to the memory IP backend, and the well-defined
boundary between the memory IP and the AXI4-Lite slave interface provides a good transition
from SLR0 to SLR1.
User_logic
AXI4_slave
SLR1
SLR0
MIG_DDR3
X15238-121919
The following figure illustrates a worst case crossing for a vu190-2 device. This example starts at
an Interlaken dedicated block in the bottom left of SLR0 to a packet monitor block assigned to
the top right of SLR2. Without pipeline registers for the data bus to and from the packet monitor,
the design misses the 300 MHz timing requirement by a wide margin.
X15240-121919
However, adding seven pipeline stages to aid in the traversal from SLR0 to SLR2 allows the
design to meet timing. It also reduces the use of vertical and horizontal long routing resources, as
shown in the following figure.
Figure 11: Data Path Crossing SLR with Pipeline Flip-Flop Added
X15239-110415
TIP: Use the AXI Register Slice IP or your custom auto-pipelining IP to close timing on wide buses across
SLRs.
Related Information
Auto-Pipelining Considerations
The following figure shows the Virtex UltraScale+ HBM vu37p device adjacent to a Virtex
UltraScale+ vu13p device. In the VU37P device, the bottom two SLRs of the VU13P device are
replaced by the HBM stacks (SLR0 in the vu13p device) and an SLR that contains the 32 HBM
AXI interfaces (SLR1 in the vu13p device). The top two SLRs of the vu13p and vu37p device are
identical.
CMAC
ILKN
PCIE
PCIEC
HBM AXI
vu13p vu37p
X21195-121919
In the vu37p device, the SLR0 contains 4 PCIE4C sites, 2 ILKNE4 sites, and the 32 HBM AXI
interfaces. The 4 PCIE4C sites in the Virtex UltraScale+ HBM SLR0 are unique because they
allow for the Cache Coherent Interconnect for Accelerators (CCIX) protocol using PCIe Gen3 x
16 when VCCINT is at 0.72V.
ILKNE4
CMACE4
PCIE4
PCIE4C
HBM AXI
X21207-121919
Paths from fabric logic in SLR2 to the HBM AXI Interfaces in SLR0 often require five or more
pipeline stages to meet timing. Thoughtful design planning of Virtex UltraScale+ HBM devices
can reduce the need for additional pipeline stages and reduce routing congestion. The following
figure shows an example of SLR crossings to the HBM AXI Interfaces from SLR2.
RECOMMENDED: Xilinx recommends keeping the paths from SLR2 and SLR1 vertically aligned to their
respective HBM AXI interfaces to avoid crossing the device diagonally.
TIP: Use auto-pipelining (e.g., AXI Register Slice IP) to ensure timing closure between the HBM interfaces
and any SLR at 450 MHz.
Figure 14: HBM Sub-Optimal Design Planning (left) versus Optimal Design Planning
(right)
X21196-121919
Related Information
Auto-Pipelining Considerations
SLR Crossing for Wide Buses
• For designs that heavily utilize the HBM AXI interfaces, budget for lower overall fabric
utilization of non-HBM logic in SLR0 to better accommodate the resources required for the
HBM AXI interfaces.
• Using MIG IP in the SLR0 might result in timing closure challenges for HBM AXI interfaces
located near the I/O columns of the device. When using MIG IP, consider using the I/O
columns located in SLR2 or SLR1.
• Be aware of address ranges and the physical location of the HBM AXI interfaces that can
impact the latency and bandwidth of the design. To optimize the performance of the HBM,
utilize the physical HBM AXI interfaces on the same device side as the addressed HBM stack.
Figure 15: Recommended PCIE4C Sites in SLR0 of a Virtex UltraScale+ HBM vu37p
Device
Configuration
Configuration is the process of loading application-specific data into the internal memory of the
device. Because Xilinx device configuration data is stored in CMOS configuration latches (CCLs),
the configuration data is volatile and must be reloaded each time the device is powered up.
Xilinx devices can load themselves through configuration pins from an external, non-volatile
memory device. Devices can also be configured by an external smart source, such as the
following:
• Microprocessor
• Microcontroller
• DSP processor
• Personal computer (PC)
• Board tester
When board planning, consider configuration aspects up front, which makes it easier to configure
as well as debug. Each device family has a Configuration User Guide that is the primary resource
for detailed information about each of the supported configuration modes and their trade-offs on
pin count, performance, and cost.
Related Information
In addition, signals such as the INIT_B and DONE are critical for device configuration debug. The
INIT_B signal has multiple functions. It indicates completion of initialization at power-up and can
indicate when a CRC error is encountered. Xilinx recommends that you connect the INIT_B and
DONE signals to light-emitting diodes (LEDs) using LED drivers and pull-ups.
For recommended pull-up values, see the configuration user guide for your device:
To identify and check recommended board-level pin connections, see the schematic checklists:
Chapter 3
Decisions made at this stage affect the end product. A wrong decision at this point can result in
problems at a later stage, causing issues throughout the entire design cycle. Spending time early
in the process to carefully plan your design helps to ensure that you meet your design goals and
minimize debug time in lab.
However, defining a hierarchy based on functionality only does not take into account how to
optimize for timing closure, runtime, and debugging. The following additional considerations
made during hierarchy planning also help in timing closure.
When using the tool to infer IOBUF or OBUFT components, make sure that the enable logic and
the input/output logic are all in the same hierarchy. If the logic is in different hierarchies and
there are KEEP_HIERARCHY or DONT_TOUCH attributes between the hierarchies, the tool will
not be able to infer these buffers.
I/O components that need to be instantiated, such as differential I/O (IBUFDS, OBUFDS) and
double data-rate registers (IDDR, ODDR, ISERDES, OSERDES), should also be instantiated near
the top level. When you instantiate a component, you add an instance of that component to your
HDL file. Instantiation gives you full control over how the component is used. Therefore, you
know exactly how the logic will be used.
Aside from the module the clocks are created in, clock paths should only drive down into
modules. Any paths that go through (down from top and then back to top) can create a delta
cycle problem in VHDL simulation that is difficult and time consuming to debug.
If the cells are not contained within a level of hierarchy, all objects must be included individually
in the floorplan constraint. If synthesis changes the names of these objects, you must update the
constraints. A good floorplan is contained at the hierarchy level, because this requires only a one
line constraint.
For more information on floorplanning, see this link in the Vivado Design Suite User Guide: Design
Analysis and Closure Techniques (UG906).
RECOMMENDED: Although the Vivado tools allow cross hierarchy floorplans, these require more
maintenance. Avoid cross hierarchy floorplans where possible.
CAUTION! Unlike other attributes, the DONT_TOUCH attribute does not propagate from a module to all
the signals inside the module. For more information, see this link in the Vivado Design Suite User Guide:
Synthesis (UG901).
placement_flexibility_wrapper_i
floorplanning_wrapper_i
DSP_i
DATA_IN DATA_OUT
DSP
VALID_IN Algorithm VALID_OUT
CE
• DSP_i
In the DSP_i algorithm block, both the inputs and outputs are registered. Because registers are
plentiful in a device, it is preferable to use this method to improve the timing budget.
• floorplanning_wrapper_i
Planning IP Requirements
Planning IP requirements is one of the most important stages of any new project:
• Evaluate the IP options available from Xilinx or third-party partners against required
functionality and other design goals. For example:
○ Is custom logic more desirable compared to an available IP core?
○ Does it make sense to package a custom design for reuse in multiple projects in an industry
standard format?
• Consider the interfaces that are required such as, memory, network, and peripherals.
IMPORTANT! To ensure that the tools process the IP-specific constraints properly, add the .xci or .xcix IP
source files to the project. Do not use the IP-generated output DCP files as project sources when working
with IP.
AMBA AXI
Xilinx has standardized IP interfaces on the open AMBA® 4 AXI4 interconnect protocol. This
standardization eases integration of IP from Xilinx and third-party providers, and maximizes
system performance. Xilinx has worked with Arm® to define the AXI4, AXI4-Lite, and AXI4-
Stream specifications for efficient mapping into its device architectures.
AXI4 is targeted at high performance, high clock frequency system designs, and is suitable for
high-speed interconnects. AXI4-Lite is a light-weight version of AXI4, and is used mostly for
accessing control and status registers.
AXI4-Stream is used for unidirectional streaming of data from Master to Slave. This is typically
used for DSP, Video and Communications applications.
From the IP catalog, you can explore the available IP cores, and view the Product Guide, Change
Log, Product Web page, and Answer Records for any IP.
You can access and customize the cores in the IP catalog through the GUI or Tcl shell. You can
also use Tcl scripts to automate the customization of IP cores.
Custom IP
Xilinx uses the industry standard IP-XACT format for delivery of IP, and provides tools (IP
packager) to package custom IP. Accordingly, you can also add your own customized IP to the
catalog and create IP repositories that can be shared in a team or across a company. IP from
third-party providers can also be added to this catalog, provided it is packaged in IP packager,
even if it is already in the IP-XACT format.
A majority of the IP in the IP catalog is free. However, some high value IP has an associated cost
and requires a license. The IP catalog informs you about whether or not the IP requires purchase,
as well as the status of the license. To select an IP from the catalog, consider the following key
features, based on your design requirements, and what the specific IP offers:
○ Memory interfaces - number of memory interfaces, including their size and performance.
○ AXI4-Lite
○ AXI4-Stream
• If multiple protocols are involved, bridging IP cores might have to be chosen using
infrastructure IP from the IP catalog. For example:
○ AXI-AHB bridge
○ AXI-AXI Interconnect
○ AXI-PCIe bridge
○ AXI-PLB bridge
Customizing IP
IP can be customized through the GUI or through Tcl scripts.
Related Information
You would need to know the names of the configuration options, and the values to which they
can be set. Typically, you first perform the customization through the GUI, and then create the
script from that. Once you see the resulting Tcl script, you can easily modify the script for your
needs, such as changing data sizes.
Tcl script based IP creation is useful for automation, for example working with version control
system. For information about source management and revision control, see this link in the
Vivado Design Suite User Guide: Design Flows Overview (UG892).
IMPORTANT! For memory IP in 7 series devices, a PRJ file is created in addition to the XCI file. When
using revision control with 7 series memory IP, keep the PRJ file in the same directory as the XCI file.
Designs with too many unique control sets might have many wasted resources as well as fewer
options for placement, resulting in higher power and lower performance. Designs with fewer
control sets have more options and flexibility in terms of placement, generally resulting in
improved results.
In UltraScale™ devices, there is more flexibility in control set mapping within a CLB. Resets that
are undriven do not form part of the control set, because the tie off is generated locally within
the slice. However, it is good practice to limit unique control sets to give maximum flexibility in
placement of a group of logic.
Resets
Resets are one of the more common and important control signals to take into account and limit
in your design. Resets can significantly impact your design's performance, area, and power.
• LUTs
• Registers
• SRLs
• Block or LUT memory
• DSP48 registers
The choice and use of resets can affect the selection of these components, resulting in less
optimal resources for a given design. A misplaced reset on an array can mean the difference
between inferring one block RAM, or inferring several thousand registers.
Asynchronous resets described at the input or output of a multiplier might result in registers
placed in the slices rather than the DSP block. In such situations, additional logic resources are
used, which negatively impacts the power consumption and design performance.
If an initial state is not specified, sequential primitives are assigned a default value. In most cases,
the default value is zero. Exceptions are the FDSE and FDPE primitives that default to a logic
one. Every register will be at a known state at the end of configuration. Therefore, it is not
necessary to code a global reset for the sole purpose of initializing a device on power up.
Xilinx highly recommends that you take special care in deciding when the design requires a reset,
and when it does not. In many situations, resets might be required on the control path logic for
proper operation. However, resets are generally less necessary on the data path logic. Limiting
the use of resets:
RECOMMENDED: Evaluate each synchronous block, and attempt to determine whether a reset is
required for proper operation. Do not code the reset by default without ascertaining its real need.
For logic in which no reset is coded, there is much greater flexibility in selecting the device
resources to map the logic.
The synthesis tool can then pick the best resource for that code in order to arrive at a potentially
superior result by considering, for example:
• Requested functionality
• Performance requirements
• Available device resources
• Power
• Synchronous resets can directly map to more resource elements in the device architecture.
• Asynchronous resets impact the performance of the general logic structures. Because all Xilinx
device general-purpose registers can program the set/reset as either asynchronous or
synchronous, it might seem like there is no penalty in using asynchronous resets. If a global
asynchronous reset is used, it does not increase the control sets. However, the need to route
this reset signal to all register elements increases routing complexity.
• Asynchronous resets have a greater probability of corrupting memory contents of block
RAMs, LUTRAMs, and SRLs during reset assertion. This is especially true for registers with
asynchronous resets that drive the input pins of block RAMs, LUTRAMs, and SRLs.
• Synchronous resets offer more flexibility for control set remapping when higher density or fine
tuned placement is needed. A synchronous reset can be remapped to the data path of the
register if an incompatible reset is found in the more optimally placed slice. This can reduce
routing resource utilization and increase placement density where needed to allow proper
fitting and improved performance.
• Some resources such as the DSP48 and block RAM have only synchronous resets for the
register elements within the block. When asynchronous resets are used on register elements
associated with these elements, those registers may not be inferred directly into those blocks
without impacting functionality.
• The clock works as a filter for small reset glitches for synchronous resets. However, if these
glitches occur near the active clock edge, the flip-flop might become metastable.
• Synchronous resets might need to stretch the pulse width to ensure that the reset signal pulse
is wide enough for the reset to be present during an active edge of the clock.
• When using asynchronous resets, remember to synchronize the deassertion of the
asynchronous reset. Although the relative timing between clock and reset can be ignored
during reset assertion, the reset release must be synchronized to the clock. Avoiding the reset
release edge synchronization can lead to metastability. During reset release, setup and hold
timing conditions must be satisfied for the reset pin relative to the clock pin of a register. A
violation of the setup and hold conditions for asynchronous reset (e.g., reset recovery and
removal timing) might cause the flip-flop to become metastable, causing design failure due to
switching to an unknown state. Note that this situation is similar to the violation of setup and
hold conditions for the flip-flop data pin.
In this circuit, the DSP48 primitive is inferred with all pipeline registers packed within the DSP
primitive (AREG/BREG=1, MREG=1, PREG=1).
By simply changing the reset definition as shown in the following figure, such that the multiplier
pipeline registers use a synchronous reset, synthesis can take advantage of the DSP48 internal
registers: AREG/BREG=1, MREG=1, PREG=1.
Due to saving fabric resources and taking advantage of all DSP48 internal registers, the design
performance and power efficiency are optimal.
D Q D Q D Q
CE CE CLR
clk clk clk
rst
X17086-052016
The optimal way to remove the resets is to create separate sequential logic procedures with one
for reset conditions and the other for non-reset conditions, as shown in the following figure.
Figure 21: Separate Procedural Statements for Registers With and Without Reset
TIP: When using a reset, make sure that all registers in the procedural statement are reset.
Clock Enables
When used properly, clock enables can significantly reduce design power with little impact on
area or performance. However, when clock enables are used improperly, they can lead to:
In most cases, low fanout clock enables are the main contributor to the high number of control
sets.
In most implementations, this does not result in added logic, and avoids the need for a clock
enable. The exception to this rule is in the case of a large bus when inferring a clock enable in
which the value is held can help in power reduction. The basic premise is that when small
numbers of registers are inferred, a clock enable can be detrimental because it increases control
set count. However, in larger groups, it can become more beneficial and is recommended.
Related Information
Clocking Guidelines
When the design includes a synchronous reset/enable, synthesis creates a logic cone mapped
through the CE/R/S pins when the load is equal to or above the threshold set by the -
control_set_opt_threshold synthesis switch, or creates a logic cone that maps through
the D pin if below the threshold. The default thresholds are:
• 7 series devices: 4
• UltraScale devices: 2
In the following figure, the enable signal (en) is only connected to one flip-flop. Therefore, the
synthesis engine connected the en signal to the FDRE/D pin cone of logic. Note that the CE pin
is tied to logic 1.
To override this default behavior, you can use the DIRECT_ENABLE attribute. For example, the
following figure shows how to connect the enable signal (en) to the CE pin of the register by
adding the DIRECT_ENABLE attribute to the port/signal.
The following figure shows RTL code in which either global_rst or int_rst can reset the
register. By default, both are mapped to the reset pin cone of logic.
You can use the DIRECT_RESET attribute to specify which reset signal to connect to the register
reset pin. For example, the following figure shows how to use the DIRECT_RESET attribute to
connect only the global_rst signal to the register FDRE/R pin and connect the int_rst
signal to the FDRE/D cone of logic.
Pushing the Logic from the Control Pin to the Data Pin
During analysis of critical paths, you might find multiple paths ending at control pins. You must
analyze these paths to determine if there is a way to push the logic into the datapath without
incurring penalties, such as extra logic levels. There is less delay in a path to the D pin than
CE/R/S pins given the same levels of logic because there is a direct connection from the output
of the last LUT to the D input of the FF. The following coding examples show how to push the
logic from the control pin to the data pin of a register.
In the following example, the enable pin of dout_reg[0] has 2 logic levels, and the data pin has 0
logic levels. In this situation, you can improve timing by moving the enable logic to the D pin by
setting the EXTRACT_ENABLE attribute to "no" on the dout register definition in the RTL file.
The following example shows how to separate the combinational and sequential logic and map
the complete logic in to the datapath. This pushes the logic into the D pin, which still has 2 logic
levels.
You can achieve the same structure by setting the EXTRACT_ENABLE attribute to “no.” For more
information on the EXTRACT_ENABLE attribute, see the Vivado Design Suite User Guide: Synthesis
(UG901).
Figure 27: Critical Path Ending at Data Pin of a Register (Disabling Enable Extraction)
The following examples demonstrate how understanding the hardware resources and mapping
can help make certain design decisions:
• For larger than 4-bit addition, subtraction, and add-sub, a carry chain is generally used and
one LUT per 2-bit addition is used (that is, an 8-bit by 8-bit adder uses 8 LUTs and the
associated carry chain). For ternary addition or in the case where the result of an adder is
added to another value without the use of a register in between, one LUT per 3-bit addition is
used (that is, an 8-bit by 8-bit by 8-bit addition also uses 8 LUTs and the associated carry
chain).
If more than one addition is needed, it may be advantageous to specify registers after every
two levels of addition to cut device utilization in half by allowing a ternary implementation to
be generated.
• In general, multiplication is targeted to DSP blocks. Signed bit widths less than 18x25 (18x27
in UltraScale devices) map into a single DSP Block. Multiplication requiring larger products
might map into more than one DSP block. DSP blocks have pipelining resources inside them.
Pipelining properly for logic inferred into the DSP block can greatly improve performance and
power. When a multiplication is described, three levels of pipelining around it generates best
setup, clock-to-out, and power characteristics. Extremely light pipelining (one-level or none)
might lead to timing issues and increased power for those blocks, while the pipelining
registers within the DSP lie unused.
• Two SRLs with depths of 16 bits or less can be mapped into a single LUT, and single SRLs up
to 32 bits can also be mapped into a single LUT.
• For conditional code resulting in standard MUX components:
○ A 4-to-1 MUX can be implemented into a single LUT, resulting in one logic level.
○ An 8-to-1 MUX can be implemented into two LUTs and a MUXF7 component, still resulting
in effectively one logic (LUT) level.
○ A 16-to-1 MUX can be implemented into four LUTs and a combination of MUXF7 and
MUXF8 resources, still resulting in effectively one logic (LUT) level.
A combination of LUTs, MUXF7, and MUXF8 within the same CLB/slice structure results in a
very small combinational delay. Hence, these combinations are considered as equivalent to
only one logic level. Understanding this code can lead to better resource management, and
can help in better appreciating and controlling logic levels for the data paths.
For general logic, take into account the number of unique inputs for a given register. From that
number, an estimation of LUTs and logic levels can be achieved. In general, 6 inputs or fewer
always results in a single logic level. Theoretically, two levels of logic can manage up to 36 inputs.
However, for all practical purposes, you should assume that approximately 20 inputs is the
maximum that can be managed with two levels of logic. In general, the larger the number of
inputs and the more complex the logic equation, the more LUTs and logic levels are required.
IMPORTANT! Check the availability of hardware resources and how efficiently they are being utilized
early in the design cycle to enable easier modifications. This approach yields better results than waiting
until late in the design cycle during timing closure.
• Inference
Advantages:
○ Highly portable
○ Self-documenting
○ Fast simulation
Disadvantages:
○ Might not have access to all RAM configurations available
Because inference usually gives good results, it is the recommended method, unless a given
use is not supported, or it is not producing adequate results in performance, area, or power. In
that case, explore other methods.
When inferring RAM, Xilinx recommends that you use the HDL Templates provided in the
Vivado tools. As mentioned earlier, using asynchronous reset impacts RAM inference, and
should be avoided.
• Xilinx Parameterizable Macros (XPMs)
Advantages:
○ Portable between Xilinx device families
○ Fast simulation
Disadvantages:
○ Limited to supported XPM options
XPMs are built on inference using fixed templates that you cannot modify. Therefore, they can
guarantee QoR and can support features that standard inference does not. When standard
inference does not support the features required, Xilinx recommends you use XPMs instead.
Note: When you compile simulation libraries using compile_simlib, XPMs are automatically
compiled. For more information, see the Vivado Design Suite User Guide: Logic Simulation (UG900).
Advantages:
○ Highest level control over implementation
Disadvantages:
○ Less portable code
Disadvantages:
○ Less portable code
○ Core management
Related Information
Using an output register is required for high performance designs, and is recommended for all
designs. This improves the clock to output timing of the block RAM. Additionally, a second
output register is beneficial, as slice output registers have faster clock to out timing than a
block RAM register. Having both registers has a total read latency of 3. When inferring these
registers, they should be in the same level of hierarchy as the RAM array. This allows the tools
to merge the block RAM output register into the primitive.
• Using the Input Pipeline Register
When RAM arrays are large and mapped across many primitives, they can span a considerable
area of the die. This can lead to performance issues on address and control lines. Consider
adding an extra register after the generation of these signals and before the RAMs. To further
improve timing, use phys_opt_design later in the flow to replicate this register. Registers
without logic on the input will replicate more easily.
Figure 28: RAM with Extra Read Register for Block RAM Output Register Inference
Certain deviations from these examples can prevent the inference of the output register.
Figure 30: Multiple Fanout Preventing Block RAM Output Register Inference
The following figure highlights an example of what to avoid to ensure correct inference of RAMs
and output registers.
Figure 32: Check for the Presence of Feedback on Registers Around the RAM Block
UltraRAM can be used in your design using one of the following methods:
The following code example shows the instantiation of XPM memory and is available in the HDL
Language templates. Highlighted parameters MEMORY_PRIMITIVE and READ_LATENCY are the
key parameters to infer memory as UltraRAM for high performance.
Larger memories are mapped to an UltraRAM matrix consisting of multiple UltraRAM cells
configured as row x column structures.
A matrix can be created with single or multiple columns based on the depth. The current default
threshold for UltraRAM column height is 8 and it can be controlled with the attribute
CASCADE_HEIGHT.
The difference between single column and multiple column UltraRAM matrix is as follows:
• Single column UltraRAM matrix uses the built-in hardware cascade without fabric logic.
• Multiple column UltraRAM matrix uses built-in hardware cascade within each column, plus
some fabric logic for connecting the columns. Extra pipelining may be required to maintain
performance. This is inferred by increasing the read latency. The Vivado tools automatically
pack these registers into UltraRAM as required.
The preceding example uses a 32 K x 72 memory configuration, which uses eight UltraRAMs. To
increase performance of the UltraRAM, more pipelining registers should be added to the cascade
chain. This is achieved by increasing the read latency integer.
For more information on inferring UltraRAM in Vivado synthesis, see this link in the Vivado
Design Suite User Guide: Synthesis (UG901).
• Multiplication
• Addition and subtraction
• Comparators
• Counters
• General logic
The DSP blocks are highly pipelined blocks with multiple register stages allowing for high-speed
operation while reducing the overall power footprint of the resource. Xilinx recommends that
you fully pipeline the code intended to map into the DSP48, so that all pipeline stages are
utilized. To allow the flexibility of use of this additional resource, a set condition cannot exist in
the function for it to properly map to this resource.
DSP48 slice registers within Xilinx devices contain only resets, and not sets. Accordingly, unless
necessary, do not code a set (value equals logic 1 upon an applied signal) around multipliers,
adders, counters, or other logic that can be implemented within a DSP48 slice. Additionally, avoid
asynchronous resets, since the DSP slice only supports synchronous reset operations. Code
resulting in sets or asynchronous resets may produce suboptimal results in terms of area,
performance, or power.
Many DSP designs are well-suited for the Xilinx architecture. To obtain best use of the
architecture, you must be familiar with the underlying features and capabilities so that design
entry code can take advantage of these resources.
The DSP48 blocks use a signed arithmetic implementation. Xilinx recommends code using signed
values in the HDL source to best match the resource capabilities and, in general, obtain the most
efficient mapping. If unsigned bus values are used in the code, the synthesis tools may still be
able to use this resource, but might not obtain the full bit precision of the component due to the
unsigned-to-signed conversion.
If the target design is expected to contain a large number of adders, Xilinx recommends that you
evaluate the design to make greater use of the DSP48 slice pre-adders and post-adders. For
example, with FIR filters, the adder cascade can be used to build a systolic filter rather than using
multiple successive add functions (adder trees). If the filter is symmetric, you can evaluate using
the dedicated pre-adder to further consolidate the function into both fewer LUTs and flip-flops
and also fewer DSP slices as well (in most cases, half the resources).
If adder trees are necessary, the 6-input LUT architecture can efficiently create ternary addition
(A + B + C = D) using the same amount of resources as a simple 2-input addition. This can help
save and conserve carry logic resources. In many cases, there is no need to use these techniques.
By knowing these capabilities, the proper trade-offs can be acknowledged up front and
accounted for in the RTL code to allow for a smoother and more efficient implementation from
the start. In most cases, Xilinx recommends inferring DSP resources.
For more information about the features and capabilities of the DSP48 slice, and how to best
leverage this resource for your design needs, see the 7 Series DSP48E1 Slice User Guide (UG479)
and UltraScale Architecture DSP Slice User Guide (UG579).
• Clock
• Serial input
• Asynchronous set/reset
• Synchronous set/reset
• Synchronous/asynchronous parallel load
• Clock enable
• Serial or parallel output
Xilinx devices contain dedicated SRL16 and SRL32 resources (integrated in LUTs). These allow
efficiently implemented shift registers without using flip-flop resources. However, these elements
support only LEFT shift operations, and have a limited number of I/O signals:
• Clock
• Clock Enable
• Serial Data In
• Serial Data Out
In addition, SRLs have address inputs (A3, A2, A1, A0 inputs for SRL16) determining the length of
the shift register. The shift register can be a fixed static length or can be dynamically adjusted. In
dynamic mode, each time a new address is applied to the address pins, the new bit position value
is available on the Q output after the time delay to access the LUT.
Synchronous and asynchronous set/reset control signals are not available in the SRL primitives.
However, if your RTL code includes a reset, the Xilinx synthesis tool infers additional logic around
the SRL to provide the reset functionality.
To obtain the best performance when using SRLs, Xilinx recommends that you implement the last
stage of the shift register in the dedicated slice register. Slice registers have a better clock-to-out
time than SRLs. This allows additional slack for the paths sourced by the shift register logic.
Synthesis tools automatically infer this register unless this resource is instantiated or the
synthesis tool is prevented from inferring this type of register because of attributes or cross-
hierarchy boundary optimization restrictions. To infer the extra register, register the dynamically
delayed signal separately in the RTL.
Xilinx recommends that you use the HDL coding styles represented in the Vivado Design Suite
HDL Templates.
When using registers to obtain placement flexibility in the chip, turn off SRL inference using the
following attribute:
SHREG_EXTRACT = "no"
For more information about synthesis attributes and how to specify the attributes in the HDL
code, see the Vivado Design Suite User Guide: Synthesis (UG901).
Any inferred SRL, memory, or other synchronous element may also have an initial state defined
that will be programmed into the associated element upon configuration.
Xilinx highly recommends that you initialize all synchronous elements accordingly. Initialization of
registers is completely inferable by all major device synthesis tools. This lessens the need to add
a reset for the sole purpose of initialization, and makes the RTL code more closely match the
implemented design in functional simulation, as all synchronous element start with a known
value in the device after configuration.
Initial state of the registers and latches VHDL coding example one:
Initial state of the registers and latches Verilog coding example two:
For example, if an SRL is instantiated and is part of a long path, this path might become a
bottleneck. The SRL has a longer clock-to-out delay than a regular register. To preserve the area
reduction provided by the SRL while improving its clock-to-out performance, an SRL of one delay
less than the actual desired delay is created, with the last stage implemented in a regular flip-flop.
With instantiation, you have total control over the synthesis tool. For example, to achieve better
performance, you can implement a comparator using only LUTs, instead of the combination of
LUT and carry chain elements usually chosen by the synthesis tool.
Sometimes instantiation may be the only way to make use of the complex resources available in
the device. This can be due to:
• Consider the Vivado Design Suite language templates when writing common Verilog and
VHDL behavioral constructs or if necessary instantiating the desired primitives.
RECOMMENDED: Identify high fanout nets using the report_high_fanout_nets Tcl command
after synthesis. Monitor the impact of these nets on design performance as you progress through the
implementation process.
Most synthesis tools use a fanout threshold limit to automatically determine whether to
duplicate a register. Lowering this global threshold allows automatic duplication of high fanout
nets. However, it does not allow control over which registers are duplicated or how their loads
are grouped. In addition, the global replication mechanism does not assess timing slack
accurately, which can lead to unnecessary replicated cells, logic utilization increase, and
potentially higher power consumption.
For high frequency designs, a better approach to reducing fanout is to use a balanced tree for the
high fanout signals. Consider manually replicating registers based on the design hierarchy,
because the cells included in a hierarchy are often placed together. For example, in the balanced
reset tree shown in the following figure, the high fanout reset FF RST2 is replicated in RTL to
balance the fanout across the different modules. If required, physical synthesis can perform
further replication to improve WNS based on placement information.
TIP: To preserve the duplicate registers in synthesis, use a KEEP attribute instead of DONT_TOUCH. A
DONT_TOUCH attribute prevents further optimization during physical optimization later in the
implementation flow.
Note: If a LUT1 rather than a register is replicated, it indicates that an attribute or constraint is applied
incorrectly.
RST2
RST2
rst_gen_inst rst_gen_inst
11
10000 10000
RST2
2
3000 3000
block_C block_C
block_D block_D
RST2
3
7000 6000 6000
block_E block_E
RST2
4
1000 1000
X20034-110617
RECOMMENDED: Using MAX_FANOUT attributes on global high fanout signals leads to suboptimal
replication similar to when the global fanout limit is lowered in synthesis. For this reason, Xilinx
recommends only using MAX_FANOUT inside the hierarchies on local signals with medium to low fanout.
Do not replicate registers used for synchronizing signals that cross clock domains. The presence
of the ASYNC_REG attribute on these registers prevents the tool from replicating these registers.
If the synchronizing chain has a very high fanout and replication must meet timing, add an extra
register after the synchronization chain that does not have the ASYNC_REG constraint.
The following table provides guidelines on the number of fanouts that might be acceptable for
your design.
Condition Fanout > 5000 Fanout > 200 Fanout > 100
Low Frequency 1 to 125 Few logic levels between N/A N/A
MHZ synchronous logic <13 levels
of logic at maximum
frequency
Medium Frequency 125 If the design does not meet <6 levels of logic at maximum N/A
to 250 MHz timing, you might need to frequency. (Driver and load
reduce fanout and/or logic types impact performance.)
levels.
High Frequency > 250 Not recommended for most Small number of logic levels is Advance pipelining methods
MHz designs. typically necessary for higher required. Careful logic
speeds. replication. Compact
functions. Low logic levels
required. (Driver and load
types impact performance.)
TIP: If the timing reports indicate that high-fanout signals are limiting the design performance, consider
replicating the signals using the implementation tool options, such as opt_design -
hier_fanout_limit, place_design, and phys_opt_design.
TIP: When replicating registers, consider using a naming convention for the registers, such as
<original_name>_a, <original_name>_b, etc., to make it easier to understand intent of the
replication and easier to maintain the RTL code.
Pipelining Considerations
Another way to increase performance is to restructure long datapaths with several levels of logic
and distribute them over multiple clock cycles. This method allows for a faster clock cycle and
increased data throughput at the expense of latency and pipeline overhead logic management.
Because devices contain many registers, the additional registers and overhead logic are usually
not an issue. However, the datapath spans multiple cycles, and you must make special
considerations for the rest of the design to account for the added path latency.
Identifying pipelining opportunities early in the design can often significantly improve timing
closure, implementation runtime (due to easier-to-solve timing problems), and device power (due
to reduced switching of logic).
D Q D Q
LUT LUT LUT LUT
Slow_Clock
X13429-042122
Use one of the following methods to ensure that your design uses pipeline registers correctly:
• In your RTL code, add the registers before or after the logic to be retimed, preferably within
the hierarchy.
• Use the Vivado synthesis global retiming or BLOCK_SYNTH.RETIMING option, which
analyzes the timing of a path and moves the registers to improve timing, if possible.
• Alternatively, for more control, use the retiming_forward and retiming_backward
synthesis attributes. You can add these attributes on specific registers to force the tool to
retime through combinational logic regardless of the timing score of the logic. For more
information on these attributes, see the Vivado Design Suite User Guide: Synthesis (UG901).
The following figure shows the pipelining after adding extra registers.
D Q D Q D Q D Q D Q
LUT LUT LUT LUT
Slow_Clock
X26546-042122
The following figure is an example of the same data path shown in the Before Pipelining diagram.
Because the flip-flop is contained in the same slice as the function generator, the clock speed is
limited by the clock-to-out time of the source flip-flop, the logic delay through one level of logic,
one routing delay, and the setup time of the destination register. In this example, the system
clock runs faster after pipelining and retiming than in the original design.
D Q D Q D Q D Q D Q
LUT LUT LUT LUT
Fast_Clock
X13430-041422
Following is a code example that shows how to use the retiming attributes to force the specific
retiming shown in the Pipelining After Retiming figure.
To determine whether a design requires pipelining, identify the frequency of the clocks and the
amount of logic distributed across each of the clock groups. You can use the
report_design_analysis Tcl command with the -logic_level_distribution option
to determine the logic-level distribution for each of the clock groups.
TIP: The design analysis report also highlights the number of paths with zero logic levels, which you can
use to determine where to make modifications in your code.
Balance Latency
To balance the latency by adding pipeline stages, add the stage to the control path and not the
data path. The data path includes wider buses, which increases the number of flip-flop and
register resources used.
For example, if you have a 128-bit data path, 2 stages of registers, and a requirement of 5 cycles
of latency, inserting 3 register stages results in an extra 3 x 128 = 384 flip-flops. Alternatively,
you can use registers to control logic to enable the data path. Use 5 stages of single-bit registers
to control the enable signal of datapath flip-flops and multicycle path timing exceptions
accordingly.
Note: This example is only possible for certain designs. For example, in cases where there is a fanout from
the intermediate data path flip-flops, having only 2 stages does not work.
RECOMMENDED: The optimal LUT:FF ratio in a device is 1:1. Designs with significantly more FFs will
increase unrelated logic packing into slices, which will increase routing complexity and can degrade QoR.
There are multiple ways to infer SRLs during synthesis, including the following:
• SRL
• REG -> SRL
• SRL -> REG
• REG -> SRL -> REG
You can create these structures using the srl_style attribute in the RTL code as follows:
• (* srl_style = "srl" *)
• (* srl_style = "reg_srl" *)
• (* srl_style = "srl_reg" *)
• (* srl_style = "reg_srl_reg" *)
A common mistake is to use different enable/reset control signals in deeper pipeline stages.
Following is an example of a reset used in a 9-deep pipeline stage with the reset connected to
the third, fifth, and eighth pipeline stages. With this structure, the tools map the pipeline stages
to registers only, because there is a reset pin on the SRL primitive.
Note: If there are many paths with 0/1 levels of logic, check to make sure this is intentional.
Auto-Pipelining Considerations
The auto-pipelining feature allows the placer to determine the number of required pipeline
stages and their optimal location, which helps timing closure across interface boundaries. You can
enable this feature by setting up the auto-pipelining mode of the AXI Register Slice core or by
applying the auto-pipelining HDL attribute or XDC constraints for data buses. Because the
insertion is timing-driven, always be sure to apply proper timing constraints on the targeted
paths. For more information, see this link in the Vivado Design Suite User Guide: Implementation
(UG904).
The following example shows auto-pipelining applied on the interface between the module
data01 and data12. The output from data01 consists of registers with no control sets.
Following is the RTL code for this example. The autopipeline_module attribute is applied on the
hierarchical module data01, and the autopipeline_group/autopipeline_limit/autopipeline_include
attributes are applied on the nets directly driven by the Q pins of the registers.
(* autopipeline_module="yes" *)
module data_reg_ap # (
parameter integer C_DATA_WIDTH = 32
)
( input wire clk,
input wire [C_DATA_WIDTH-1:0] datain,
(* autopipeline_group="fwd",autopipeline_limit=24 *)
output reg [C_DATA_WIDTH-1:0] datareg
);
Following are the XDC constraints for this example, which is an alternative approach to using
attributes in the RTL code.
if (reg1)
val = reg_in1;
else if (reg2)
val = reg_in2;
else if (reg3)
val = reg_in3;
else val = reg_in4;
The following example highlights the different structures that can be generated to achieve your
requirements. Synthesis can limit the cascading of the block RAM for the performance/power
trade-off using the CASCADE_HEIGHT attribute. The usage and arguments for the attribute are
described in the Vivado Design Suite User Guide: Synthesis (UG901).
The following figure shows an example of 8Kx32 memory configuration for higher performance
(timing).
In this implementation, all block RAMs are always enabled (for each read or write) and consume
more power.
The following figure shows an example of cascading all the block RAMs for low power.
In this implementation, because one block RAM at a time is selected (from each unit), the
dynamic power contribution is almost half. Block RAMs have a dedicated cascade MUX and
routing structure that allows the construction of wide, deep memories requiring more than one
block RAM primitive to be built in a very power efficient configuration.
The following figure shows an example of how to limit the cascading and gain both power and
performance at the same time, often with no trade-off in performance.
Because two block RAMs are selected at a time in this implementation, the dynamic power
contribution is better than for the high performance structure, but not as good as for the low
power structure. The advantage with this structure compared to a low power structure is that it
uses only two block RAMs in the cascaded path, which has impact on the target frequency when
compared to four block RAMs in the critical path for the low power structure.
ram_decomp = "power"
cascade_height = 4
Figure 43: Generated Structure for 32x16K Memory Configuration Example Using
CASCADE_HEIGHT and RAM_DECOMP Attributes
32
32x1K 32x1K 32x1K 32x1K
32
32x1K 32x1K 32x1K 32x1K
32
32
32x1K 32x1K 32x1K 32x1K
32
32x1K 32x1K 32x1K 32x1K
4:1 MUX
X19283-121919
The following RTL code example shows the use of the CASCADE_HEIGHT and RAM_DECOMP
attributes.
Figure 44: RTL Code for 32x16K Memory Configuration Using the CASCADE_HEIGHT and
RAM_DECOMP Attributes
If you apply only the ram_decomp = "power" attribute, 16 RAMB36E2 are inferred and the
memory is decomposed as follows:
Figure 45: Generated Structure for 32x16K Memory Configuration Using the
RAM_DECOMP Attribute
0 1 6 7
32
32x1K 32x1K 32x1K 32x1K
32
32
32x1K 32x1K 32x1K 32x1K
2:1 MUX
X19284-050517
The following RTL code example shows the use of the RAM_DECOMP attribute.
Figure 46: RTL Code for 32x16K Memory Configuration Using the RAM_DECOMP
Attribute
If you use only the RAM_DECOMP attribute, the overall power savings is similar to using both
the RAM_DECOMP and CASCADE_HEIGHT attributes together, because only one block RAM is
active at a time. Creating a 4-deep cascaded block RAM chain is better for performance when
compared to an 8-deep cascaded block RAM chain.
For more information, see this link in the Vivado Design Suite User Guide: Synthesis (UG901).
Clocking Guidelines
Each device architecture has some dedicated resources for clocking. Understanding the clocking
resources for your device architecture can allow you to plan your clocking to best utilize those
resources. Most designs might not need you to be aware of these details. However, if you can
control the placement and have a good idea of the fanout on each of the clocking domains, you
can explore alternatives based on the following clocking details. If you decide to exploit any of
these clocking resources, you need to explicitly instantiate the corresponding clocking element.
UltraScale devices feature smaller clock regions of a fixed size across devices, and the clock
regions no longer span half of the device width in the horizontal direction. The number of clock
regions per row varies per UltraScale device. Each clock region contains a clock network routing
that is divided into 24 vertical and horizontal routing tracks and 24 vertical and horizontal
distribution tracks. The following figure shows a device with 36 clock regions (6 columns x 6
rows). The equivalent 7 series device has 12 clock regions (2 columns x 6 rows).
CLB, DSP,
Clocking I/O GTH/Y
BRAM
X15241-122019
The clocking architecture is designed so that only the clock resources necessary to connect clock
buffers and loads for a given placement are used, and no resource is wasted in clock regions with
no loads. The efficient clock resource utilization enables support for more design clocks in the
architecture while improving clock characteristics for performance and power. Following are the
main categories of clock types and associated clock structures grouped by their driver and use:
Clock Primitives
Most clocks enter the device through a global clock-capable I/O (GCIO) pin. These clocks directly
drive the clock network via a clock buffer or are transformed by a PLL or MMCM located in the
clock management tile (CMT) adjacent to the I/O column.
○ 1 MMCM
○ 8 BUFGCTRLs
○ 4 BUFGCE_DIVs
Note: Clocking resources in CMTs that are adjacent to I/O columns with unbonded I/Os are available for
use.
The GT user clocks drive the global clock network via BUFG_GT buffers. There are 24 BUFG_GT
buffers per clock region adjacent to the GTH/GTY columns.
Following is summary information for each of the UltraScale device clock buffers:
• BUFGCE
The most commonly used buffer is the BUFGCE. This is a general clock buffer with a clock
enable/disable feature equivalent to the 7 series BUFHCE.
• BUFGCE_DIV
The BUFGCE_DIV is useful when a simple division of the clock is required. It is considered
easier to use and more power efficient than using an MMCM or PLL for simple clock division.
When used properly, it can also show less skew between clock domains as compared to an
MMCM or PLL when crossing clock domains. The BUFGCE_DIV is often used as replacement
for the BUFR function in 7 series devices. However, because the BUFGCE_DIV can drive the
global clock network, it is considered more capable than the BUFR component.
• BUFGCTRL (also BUFGMUX)
The BUFGCTRL can be instantiated as a BUFGMUX and is generally used when multiplexing
two or more clock sources to a single clock network. As with the BUFGCE and BUFGCE_DIV,
it can drive the clock network for either regional or global clocking.
• BUFG_GT
When using clocks generated by GTs, the BUFG_GT clock buffer allows connectivity to the
global clock network. In most cases, the BUFG_GT is used as a regional buffer with its loads
placed in one or two adjacent clock regions. The BUFG_GT has built-in dynamic clock division
capability that you can use in place of an MMCM for clock rate changes.
You can use the Clock Utilization Report in the Vivado IDE to visually analyze clocking resource
utilization and clock routing. The following figure shows the clock resource utilization per clock
region overlaid in the Device window. For more information on this report, see the Vivado Design
Suite User Guide: Design Analysis and Closure Techniques (UG906).
For more information on the BUFGCE, BUFGCE_DIV, and BUFGCTRL buffers, see the UltraScale
Architecture Clocking Resources User Guide (UG572). For details on connectivity and use of the
BUFG_GT buffer, see the appropriate UltraScale Architecture Transceiver User Guide:
Note: A global clock net is assigned to a specific track ID in the device for all the vertical, horizontal routing,
and distribution resources the clock uses. A clock cannot change track IDs unless the clock goes through
another clock buffer.
Figure 49: BUFGCE, BUFGCE_DIV, and BUFGCTRL Shared Inputs and Output
Multiplexing
Track 23
To Track 5
BUFGCE_X0Y5
BUFGCE_DIV_X0Y0
To Track 4
BUFGCE_X0Y4
BUFGCTRL_X0Y1
To Track 3
BUFGCE_X0Y3
Track 7
MUX
Track 6
To Track 2
BUFGCE_X0Y2 Track 5
Track 4
To Track 1 Track 3
BUFGCE_X0Y1
Track 2
Track 1
BUFGCTRL_X0Y0
To Track 0 Track 0
BUFGCE_X0Y0
X15231-080420
• From the clock buffer to the clock root, the clock signal goes through one or several segments
of vertical and horizontal routing. Each segment must use the same track ID (between 0 and
23).
• At the clock root, the clock signal transitions from the routing track to the distribution track
with the same track ID. To reduce skew, the clock root is usually in the clock region located in
the center of the clock window. The clock window is the rectangular area that includes all the
clock regions where the clock net loads are placed. For skew optimization reasons, the Vivado
IDE might move the clock root to off center.
• From the clock root to the CLB columns where the loads are located, the clock signal travels
on the vertical distribution (both up and down the device as needed) and then onto the
horizontal distribution (both to the left and right as needed).
• The CLB columns are split into two halves, which are located above and below the horizontal
distribution resources. Each half of the CLB column contains several leaf clock routing
resources that can be reached by any of the horizontal distribution tracks.
In some cases, a clock buffer can directly drive onto the clock distribution track. This usually
happens when the clock root is located in the same clock region as the clock buffer or when the
clock buffer only drives non-clock pins (for example, high fanout nets).
Because clock routing resources are segmented, only the routing and distribution segments used
to traverse a clock region or to reach a load in a clock region are consumed.
The following figure shows how a clock buffer located in clock region X2Y1 reaches its loads
placed inside the clock window, which is formed by a rectangle of clock regions from X1Y3 to
X5Y5.
X1Y5 X5Y5
X3Y4
X1Y3 X5Y3
X2Y1
X15389-120619
In the following figure, a routed device view shows an example of a global clock that spans most
of the device. The clock buffer driving the network is marked in blue in clock region X2Y0 and
drives onto the horizontal routing in that clock region. The net then transitions from the
horizontal routing onto the vertical routing in clock region X2Y0 reaching the clock root in clock
region X2Y5. All clock routing is marked in blue. The clock root is marked in red in the clock
region X2Y5. From the clock root in X2Y5, the net transitions onto the vertical distribution and
then the horizontal distribution to the clock leaf pins. The distribution layer and the leaf clock
routing resources in the CLB columns are marked in red.
Any placer error at this phase is due to conflicting connectivity rules, user constraints, or
both. The log file shows extensive information about the possible root cause of the error,
which you must review in detail to make the appropriate design or constraint change.
2. SLR partitioning (SSI technology devices only) and global placement
The placer performs the initial clock tree implementation based on early driver and load
placements. Each clock net is associated with a clock window. The excessive overlap of clock
windows can lead to placer errors due to anticipated clock routing contention.
When a clock partitioning error occurs, the log file shows the last clock budgeting solution
for each clock net as well as the number of unique clock nets present in each clock region.
Review the log file in detail to determine which clocks to remove from the overutilized clock
regions. You can remove clocks using the following methods:
• Reduce the number of clocks in the design by combining identical synchronous clocks,
removing unnecessary MMCM feedback clocks, or consolidating lower fanout clocks with
high fanout clocks.
• Move clock primitives to different clock regions, especially those without connectivity-
based placement rules.
• Add floorplanning constraints on clock loads to keep clocks with smaller fanout closer to
their driver or away from the highly utilized clock regions.
The placer refines the clock tree implementation several times to help improve timing QoR.
For example, during the later placement optimization phases, the placer analyzes each
challenging clock to determine a better clock root location.
3. Clock tree pre-routing
The placer guides the subsequent implementation steps and provides accurate delay
estimates for post-place timing analysis.
After placement, the Vivado tools can modify the clock tree implementation as follows:
• The Vivado physical optimizer can replicate and move cells to clock regions without
associated clocks.
• The Vivado router can make adjustments to improve timing QoR and legalize the clock
routing.
The following table summarizes the placement rules for the main clock topologies and how
constraints affect these rules.
Clocking Capability
Clock planning must be based on the total number of high fanout clocks and low fanout clocks in
the target device.
Note: Using more than 24 clocks in a design might cause issues that require special design considerations
or other up-front planning.
IMPORTANT! In ZHOLD and BUF_IN compensation modes, the MMCM feedback clock path matches the
CLKOUT0 clock path in terms of routing track, clock root location, and distribution tracks. Therefore, the
feedback clock can be considered a high fanout clock when the clock buffer and clock root are far apart.
Related Information
In some cases, the placer is expected to identify a low fanout clock but fails. This can be caused
by design size, device size, or physical XDC constraints, such as a LOC constraint or Pblock,
which prevent the placer from placing the loads in a local area. To address this issue, you might
need to guide the tool by manually creating a Pblock or modifying the existing physical
constraints.
Clocks driven by BUFG_GTs are an example of a low fanout clock. The Vivado placer
automatically identifies these clock nets and contains the loads to the clock regions adjacent to
the GT interface. The following figure shows a low fanout clock contained in two clock regions
with the BUFG_GT driver shown in red.
TIP: To contain a low fanout clock to a single clock region, you can use the CLOCK_LOW_FANOUT XDC
constraint.
Related Information
• 24 clocks or less
Unless conflicting user constraints exist, all clocks can be treated as high fanout clocks
without risking placement or routing contention.
• Almost 300 clocks
For a design that targets a device with 6 clock region rows and includes only low fanout clocks
with each clock included in 3 clock regions at most, the following clocks are required: 6 rows x
2 clock windows per row x 24 clocks per region = 288 clocks.
Low fanout clock windows do not have a fixed size but are usually between 1 and 3 clock
regions. High fanout clocks rarely span the entire device or an entire SLR.
The following method shows how to balance high fanout clocks and low fanout clocks, assuming
that a few low fanout clocks come from I/O interfaces and most from GT interfaces. You can
apply the same method for each SSI technology device SLR.
○ Up to 24 for SSI technology devices (assuming some high fanout clocks are only present in
1 SLR)
• Low fanout clocks
○ Up to 12 plus 8 per GT utilized Quad
Clock Constraints
Physical XDC constraints drive the implementation of clock trees and control the use of high
fanout clocking resources. Because UltraScale device clocking is more flexible than clocking with
previous architectures and includes additional architectural constraints, it is important to
understand how to properly constrain your clocks for implementation.
The clock buffers directly connected to the MMCM or PLL outputs and the input clock ports
connected to the MMCM or PLL inputs are automatically placed in the same clock region. If
an input clock port and an MMCM or PLL are directly connected and constrained to different
clock regions, you must manually insert a clock buffer and set a CLOCK_DEDICATED_ROUTE
constraint on the net connected to the MMCM or PLL.
• On a GT*_CHANNEL or IBUFDS_GT* cell
The BUFG_GTs driven by the cell are placed in the same clock region.
CAUTION! Xilinx does not recommended using LOC constraints on the clock buffer cells. This method
forces the clock onto a specific track ID, which can result in placement that cannot be legally routed.
Only use LOC constraints to place high fanout clock buffers in UltraScale devices when you understand
the entire clock tree of the design and when placement is consistent in the design. Even after taking
these precautions, collisions might occur during implementation due to design or constraint changes.
You can also use a CLOCK_REGION constraint to provide guidance on the placement of
cascaded clock buffers or clock buffers driven by non-clocking primitives, such as fabric logic.
In the following example, the XDC constraint assigns the clkgen/clkout2_buf clock buffer to
the CLOCK_REGION X2Y2.
Note: In most cases, the clock buffers are directly driven by input clock ports, MMCMs, PLLs, or
GT*_CHANNELs that are already constrained to a clock region. If this is the case, the clock buffers are
automatically placed in the same clock region, and you do not need to use the CLOCK_REGION constraint.
Note: Xilinx does not recommend using a Pblock for a single clock region.
Figure 54: USER_CLOCK_ROOT Applied on the Net Segment Driven by the Clock Buffer
After placement, you can use the CLOCK_ROOT property to query the actual clock root as
shown in the following example. The CLOCK_ROOT reports the assigned root whether it was
user assigned or automatically assigned by the Vivado tools.
Another way to review the clock root assignments of your implemented design is to use the
report_clock_utilization Tcl command. For example:
report_clock_utilization -clock_roots_only
Related Information
Synchronous CDC
Note: When working with UltraScale devices, do not apply the CLOCK_DEDICATED_ROUTE property to
the net driven directly by a port. Instead, apply the CLOCK_DEDICATED_ROUTE property to the output
of the IBUF.
When driving from a clock buffer in one clock region to a MMCM or PLL in a vertically adjacent
clock region, set the CLOCK_DEDICATED_ROUTE to BACKBONE for 7 series devices or to
SAME_CMT_COLUMN for UltraScale devices for optimal results. This can prevent
implementation errors and ensures that the clock is routed with global clock resources only. In
some cases, the placer can legally place two or fewer MMCMs or PLLs in vertically adjacent clock
regions without setting the CLOCK_DEDICATED_ROUTE constraint. In cases where the placer
can find a legal solution for MMCMs or PLLs in vertically adjacent clock regions without the
CLOCK_DEDICATED_ROUTE constraint, it is possible that the resulting solution is sub-optimal
for your design. The following example and figure show a central clock buffer driving two PLLs in
vertically adjacent clock regions above and below.
set_property CLOCK_DEDICATED_ROUTE SAME_CMT_COLUMN [get_nets -of [get_pins BUFG_inst_0/O]]
set_property LOC PLLE3_ADV_X0Y0 [get_cells PLLE3_ADV_inst_0]
set_property LOC PLLE3_ADV_X0Y4 [get_cells PLLE3_ADV_inst_1]
When driving from a clock buffer to other clock regions that are not vertically adjacent, you must
set the CLOCK_DEDICATED_ROUTE to FALSE for 7 series devices or to ANY_CMT_COLUMN
for UltraScale devices. This prevents implementation errors and ensures that the clock is routed
with global clock resources only. The following example and figure show a BUFGCE driving two
PLLs that are not located on the same clock region column as the input buffer.
set_property CLOCK_DEDICATED_ROUTE ANY_CMT_COLUMN [get_nets -of [get_pins BUFG_inst_0/O]]
set_property LOC PLLE3_ADV_X1Y0 [get_cells PLLE3_ADV_inst_0]
set_property LOC PLLE3_ADV_X1Y4 [get_cells PLLE3_ADV_inst_1]
Note: The CLOCK_LOW_FANOUT constraint takes lower precedence when used with other clocking
constraints. If CLOCK_LOW_FANOUT is in conflict with other clock constraints, such as
USER_CLOCK_ROOT, CLOCK_DELAY_GROUP, or CLOCK_DEDICATED_ROUTE, CLOCK_LOW_FANOUT
is not obeyed.
The following example shows the CLOCK_LOW_FANOUT constraint applied to a list of flip-flops
that are used as part of a clock gating synchronization circuit to control the clock enable of a
global clock buffer.
In the design, an always-on clock network initially drives more than 2000 loads, including the
flip-flops that are part of the clock gating synchronization circuit used to clock gate other logic.
The following schematics show the clock gating synchronization circuit and additional logic
connected to the always-on clock network before and after opt_design creates a new parallel
global clock buffer to isolate the clock gating synchronization circuit.
The Device window of the fully implemented design shows the clock gating synchronization
circuit with green markers along with the always-on logic and clock-gated logic. The clock gating
synchronization circuit is placed in the same CLOCK_REGION as the MMCM, close to the global
clock buffers.
Figure 60: Fully Implemented Design with Placement of Clock Gating Synchronization
Circuit
If you set the CLOCK_LOW_FANOUT property on a clock net segment directly driven by a global
clock buffer and the fanout of the global clock buffer is less than 2000 loads, the placement of
the loads is contained to a single clock region.
The following example shows the CLOCK_LOW_FANOUT constraint applied to the clock net
segment directly driven by a global clock buffer. The clock network drives less than 2000 loads
and is contained to a single clock region. The input clock port, clkIn has a PACKAGE_PIN
assignment to a GCIO located in the CLOCK_REGION X2Y0 and drives a PLLE3_ADV. The
PLLE3_ADV drives a global clock buffer that subsequently drives the clock network with 1379
loads. The loads of the global clock buffer are all placed in the CLOCK_REGION X2Y0.
When the parallel clock buffers are directly driven by the same input clock port, MMCM, PLL,
or GT*_CHANNEL, the buffers are always placed in the same clock region as their driver
regardless of the netlist changes or logic placement variation.
• Match the insertion delays between parallel branches of the clock tree.
Xilinx recommends parallel buffers over cascaded clock buffers, especially when there are
synchronous paths between the branches. When using cascaded buffers, the clock insertion
delay is not matched between the branches of the clock trees even when using the
CLOCK_DELAY_GROUP or USER_CLOCK_ROOT constraints. This can result in high clock
skew, which makes timing closure challenging if not impossible.
The following figure shows three parallel BUFGCE buffers driven by the MMCM CLKOUT0 port.
However, you can use cascaded clock buffers to achieve the following:
• Route the clock to another clock buffer located in a different clock region.
This method is typical when using a clock multiplexer for clocks generated by MMCMs
located in different clock regions. Although one of the MMCMs can directly drive the
BUFGCTRL (BUFGMUX), the other MMCM requires an intermediate clock buffer to route the
clock signal to the other region. The following figure shows an example.
Clock Region 1
Clock Region 2
X15518-121919
• Balance the number of clock buffer levels across the clock tree branches when there is a
synchronous path between those branches.
For example, consider an MMCM clock called clk0 that drives both group A (sequential cells
driven via a BUFGCTRL located in a different clock region) and group B (sequential cells). To
better match the delay between the branches, insert a BUFGCE for group B and place it in the
same clock region as the BUFGCTRL. This ensures that the synchronous paths between group
A and group B have a controlled amount of skew. The following figure shows an example.
Note: The Vivado logic optimization command opt_design is not aware of the timing relationship
between timing clocks and clock network branches. As a result, opt_design removes as many
cascaded or redundant clock buffers as possible. In this example, opt_design removes
BUFGCE_inst_1 unless you set a DONT_TOUCH="TRUE" property on it. If there are only asynchronous
paths between the clock tree branches, the branches do not need to be balanced as long as there is
proper synchronization circuitry on the receiving clock domain.
Figure 64: Balancing Clock Trees for Synchronous Paths Between Clock Regions
Group A
Synchronous Paths
Clock Region 1
Clock Region 2
Group B
X15519-121919
To reduce the variation of insertion delays and skew, Xilinx recommends the following when
using cascaded clock buffers:
Note: If absolutely required, Xilinx recommends using two cascaded BUFGCTRLs instead of cascaded
BUFGCEs. Using dedicated routing, you can cascade two adjacent BUFGCTRLs with minimum delay when
both BUFGCTRLs are placed inside the same clock region.
Clock Multiplexing
You can build a clock multiplexer using a combination of parallel and cascaded BUFGCTRLs. The
placer finds the optimal placement based on the clock buffer site availability. If possible, the
placer places BUFGCTRLs in adjacent sites to take advantage of the dedicated cascade paths. If
that is not possible, the placer will attempt to place the BUFGCTRLs from the same level in the
adjacent clock regions.
The following figure shows a 4:1 MUX with balanced cascading. The first level of BUFGCTRL
buffers are both placed in the directly adjacent sites (X0Y2, X0Y0) of the last BUFGCTRL (X0Y1).
This configuration ensures a comparable insertion delay for all the clocks reaching the last
BUFGCTRL. You can use an equivalent structure for a 3:1 MUX.
When creating a 5:1 or larger clock MUX structure, it is common to create a symmetrical clock
structure as shown in the following figure. However, this is a suboptimal solution, because each
BUFGCTRL only has one cascade path to the two adjacent BUFGCTRLs, which does not provide
minimal delay for all connections between the BUFGCTRLs.
To support larger clock multiplexers (from 5:1 to 8:1 MUX), Xilinx recommends using cascaded
BUFGCTRL buffers as shown in the following figure. This figure shows an optimal 8:1 MUX that
uses 7 BUFGCTRL buffers.
Note: When using wide BUFGCTRL-based clock multiplexers, the clock insertion delays cannot be
balanced because some paths are longer than other paths in hardware. Therefore, this method is
recommended for multiplexing asynchronous clocks only.
When the MMCM compensation is set to ZHOLD or BUF_IN, the placer assigns the same clock
root to the nets driven by the feedback buffer and by all buffers directly connected to the
CLKOUT0 pin. This ensures that the insertion delays are matched so that the I/O ports and the
sequential cells connected to CLKOUT0 are phase-aligned and hold time is met at the device
interface. The Vivado tools consider all the loads of these nets to optimally define the clock root.
The Vivado tools do not automatically match the insertion delay with the other MMCM outputs.
To match the insertion delay for the nets driven by other MMCM output buffers, use the
following properties:
• CLOCK_DELAY_GROUP
Apply the same CLOCK_DELAY_GROUP property value to the nets directly driven by
feedback clock buffer, the CLKOUT0 buffers, and the other MMCM output buffers as needed.
This is the preferred method.
• USER_CLOCK_ROOT
If you need to force a specific clock root, use the same USER_CLOCK_ROOT property value
on the nets driven by the feedback clock buffer, the CLKOUT0 buffers, and the other MMCM
output buffers as needed.
BUFG_GT Divider
The BUFG_GT buffers can drive any loads in the fabric and include an optional divider you can
use to divide the clock from the GT*_CHANNEL. This eliminates the need to use an extra
MMCM or BUFG_DIV to divide the clock.
SelectIO Clocking
The UltraScale device SelectIO primitives have maximum skew requirements between clock pins.
Using the optimal clocking topology for the SelectIO primitives prevents maximum skew
violations, improves interface timing between the UltraScale device and the fabric logic, and uses
fewer clocking resources.
In the following figure, the left side shows a suboptimal configuration that uses the CLKOUT0B
output of the MMCM. The right side of the figure shows the optimal configuration that uses the
local inversion on the CLK_B and CB pins of the ISERDESE3 and IDDRE1. Using the optimal
configuration guarantees that the maximum skew requirement is met while using fewer global
clock resources.
Figure 68: Suboptimal to Optimal Clocking Topologies for ISERDESE3 and IDDRE1
OSERDESE3 Clocking
For OSERDESE3 clocking in UltraScale and UltraScale+ devices, maximum skew requirements
exist between the high-speed clock and divided clock pins. To meet the maximum skew
requirements, Xilinx recommends using parallel global clock buffers where one of the global clock
buffers is a BUFGCE_DIV. This removes the additional clock uncertainty between the two
outputs of the MMCM.
In the following figure, the left side shows a suboptimal configuration that uses two separate
outputs of the MMCM. The right side of the figure shows the optimal configuration that uses a
single MMCM output and the BUFGCE_DIV cell, which provides the divided clock using the
BUFGCE_DIVIDE property.
Note: The high-speed clock does not need to be driven by a BUFGCE. Alternatively, you can use
BUFGCE_DIV with a BUFGCE_DIVIDE property setting of 1.
However, if a CCIO drives an MMCM configured in ZHOLD mode in addition to another MMCM,
logic optimization will attempt to legalize the clock routing to the MMCMs by inserting a BUFG
after the CCIO. Because the MMCM with ZHOLD compensation is no longer driven directly by a
CCIO, the compensation is changed to BUF_IN. To avoid this, ensure that the CCIO drives the
MMCM configured in ZHOLD mode directly and drives the additional MMCM through a BUFG.
In addition, set the CLOCK_DEDICATED_ROUTE property for the net driven by the BUFG to
ANY_CMT_COLUMN.
Because the clock insertion delay varies with the clock root locations and the clock root
placement depends on placement of the loads, there might be variability between runs. This
variability affects the timing inside the device as well as the I/O timing.
When dealing with high-frequency I/Os, you might want more control over the I/O timing and
less variability between runs. One way to achieve this is to force the clock root placement. You
can run the tool in automated mode and look at the clock root region. If the I/O timing is
satisfactory, you can force the clock root placement on the buffer nets associated with I/O
timing. To determine the placement of the clock roots, use the report_clock_utilization
[-clock_roots_only] Tcl command.
In the following example, the I/O ports are located in the X0Y0 region. The Vivado placer
determined the placement of the clock roots in X1Y2 based on the I/O placement as well as
placement of other loads.
The following summary shows the I/O timing when the clock root is unconstrained.
In the following example, the clock roots are moved next to the I/O registers in X0Y0, which
reduces the clock insertion delays and timing pessimism and therefore, improves the I/O timing.
Figure 72: Clock Utilization Summary with User Constrained Clock Root
The following summary shows the I/O timing when the clock root is moved.
Synchronous CDC
When the design includes synchronous CDC paths between clocks that originate from the same
MMCM/PLL, you can use the following techniques to better control the clock insertion delays
and skew and therefore, the slack on those paths.
IMPORTANT! If the CDC paths are between clocks that originate from different MMCM/PLLs, the clock
insertion delays across the MMCMs/PLLs are more difficult to control. In this case, Xilinx recommends that
you treat these clock domain crossings as asynchronous and make design changes accordingly.
When a path is timed between two clocks that originate from different output pins of the same
MMCM/PLL, the MMCM/PLL phase error adds to the clock uncertainty for the path. For designs
using high clock frequencies, the phase error can cause issues with timing closure both for setup
and hold.
The following figure shows an example of paths both with and without the phase error. Path 1 is
a CDC path clocked by two buffers connected to the same MMCM output and does not include
the phase error. Path 2 is clocked by two clocks that originate from two different MMCM
outputs and does include the phase error.
Path 1
Path 2 X15234-121919
When two synchronous clocks from the same MMCM/PLL have a simple period ratio (/2 /4 /8),
you can prevent the phase error between the two clock domains using a single MMCM/PLL
output connected to two BUFGCE_DIV buffers. The BUFGCE_DIV buffer performs the clock
division (/1 /2 /4 /8). Other ratios are possible (/3 /5 /6 /7) but this requires modifying the clock
duty cycle and making mixed edge timing paths more challenging.
Note: Because the BUFGCE and BUFGCE_DIV do not have the same cell delays, Xilinx recommends using
the same clock buffer for both synchronous clocks (two BUFGCE or two BUFGCE_DIV buffers).
The following figure shows two BUFGCE_DIVs that divide the CLKOUT0 clock by 1 and by 2
respectively.
IMPORTANT! To ensure safe timing between parallel BUFGCE_DIV cells where the BUFGCE_DIVIDE
property is set to a value greater than 1, both buffers must use the same enable signal (CE) and the same
reset signal (RST). Otherwise, the divided clocks might become phase shifted from one another in
hardware, which is not reported by the Vivado tools.
Figure 75: MMCM Synchronous CDC with BUFGCE_DIVs Connected to One MMCM
Output
To automatically balance several clocks that originate from the same MMCM or PLL, set the
same CLOCK_DELAY_GROUP property value on the nets driven by the clock buffers that need
to be balanced. Following are additional recommendation:
• Avoid setting the CLOCK_DELAY_GROUP constraint on too many clocks, because this
stresses the clock placer resulting in suboptimal solutions or errors.
• Review the critical synchronous CDC paths in the Timing Summary Report to determine which
clocks must be delay matched to meet timing.
• Limit the use of the CLOCK_DELAY_GROUP on groups of synchronous clocks with tight
requirements and with identical clocking topologies.
IMPORTANT! Xilinx recommends using the Clocking Wizard for creating optimal clocking structures,
which use a mix of BUFGCEs and BUFGCE_DIVs along with related clock grouping constraints.
GT Interface Clocking
Each GT interface requires several clocks, including some clocks that are shared across bonded
GT*_CHANNEL cells located in one or several GT quads. UltraScale devices provide up to 128
GT*_CHANNEL sites, which can lead to the use of several hundreds of clocks in a design. Most
GT clocks have a low fanout with loads placed locally in the clock region next to the associated
GT*_CHANNEL. Some GT clocks drive loads across the entire device and require the utilization
of clock routing resource in many clock regions. The UltraScale architecture includes the
following enhancements to efficiently support the high number of GT clocks required.
You can use the BUFG_GT global clock buffer for GT interfaces where the user logic operates at
half the clock frequency of the internal PCS logic or for PCIe interfaces where the
GT*_CHANNEL needs to generate multiple clock frequencies for user_clk, sys_clk, and pipe_clk.
The following figure compares clocking requirements between 7 series and UltraScale devices for
a single-lane GT interface where the frequency of TXUSRCLK2 is equal to half of the frequency
of TXUSRCLK.
7 Series Devices (MMCM used for divide) UltraScale Devices (BUFG_GT used for divide)
LOCKED
BUFG3 BUFG_GT
CLKOUT0
+2
MMCM
1
BUFH/BUFG BUFG_GT
TXOUTCLK CLKOUT1 TXOUTCLK
CLKIN +1
BUFG3
Design in
TXDATA (32/40/64/80 Design in TXDATA (32/40/64/80
bits) bits) UltraScale
FPGA
Architecture
X15237-121919
You can use any output clock of the GT*_CHANNELs within a Quad or any reference clock
generated by an IBUFDS_GTE3/ODIV2 pin within a Quad to drive any of the 24 BUFG_GT
buffers located in the same clock region. A BUFG_GT_SYNC is always required to synchronize
reset and clear of BUFG_GTs driven by a common clock source.
Note: The Vivado tools automatically insert the BUFG_GT_SYNC primitive if it is not present in the design.
Some applications still require the use of an MMCM to generate complex non-integer clock
division of the GT output clocks or the IBUFDS_GTE3/ODIV2 reference clock. In these cases, a
BUFG_GT must directly drive the MMCM. By default, the placer tries to place the MMCM on the
same clock region row as the BUFG_GT. If other MMCMs try to use the same MMCM site, you
must verify that the automated MMCM placement is still as close as possible to the BUFG_GT to
avoid wasting clocking resources due to long routes.
The following figure shows a multi-quad interface. The GT*CHANNELs are marked in yellow, the
TXUSRCLK is highlighted in blue, and the TXUSRCLK2 is highlighted in red. The BUFG_GTs
driving both TXUSRCLK and TXUSRCLK2 are located in the center quad and are marked in blue
and red.
If the GT interface is contained within a single Quad, the placer treats the BUFG_GT clocks as
local clocks. In this case, the placer attempts to place the BUFG_GT clock loads in the clock
regions horizontally adjacent to the BUFG_GT, starting with the clock region that contains the
BUFG_GT and potentially using up to half the width of the device.
To override the placer regional clock constraint, assign any of the BUFG_GT clock loads to a
Pblock. The following figure shows a single-quad interface. The GT*CHANNELs are marked in
yellow, the TXUSRCLK is highlighted in blue, and the TXUSRCLK2 is highlighted in red. All the
TXUSRCLK2 loads are placed in the same clock region as the GT*CHANNELs.
RECOMMENDED: To avoid skew violations, Xilinx highly recommends following this clocking topology
when [RT]XUSRCLK2 operates at half the frequency of [RT]XUSRCLK.
• Assigns the BUFG_GTs that drive the three PCIe clocks in groups to the upper or lower 12
BUFG_GTs in a Quad
• Assigns the clock root for all three clocks to the same clock region
Note: For more information on PCIe clocking requirements, see the UltraScale Devices Gen3 Integrated
Block for PCI Express LogiCORE IP Product Guide (PG156).
Virtex-6 and Virtex-7 devices contain thirty-two global clock buffers known as BUFGs. BUFGs
can serve most clocking needs for designs with less demanding needs in terms of number of
clocks, design performance, and clocking control. Global clocking resources include BUFG,
BUFGCE, BUFGMUX, and BUFGCTRL primitives, which each have their own features. For more
information on the features of these global clock components, see the Clocking Resources Guide
(7 Series FPGAs Clocking Resources User Guide (UG472) or UltraScale Architecture Clocking
Resources User Guide (UG572)) and Libraries Guide (Vivado Design Suite 7 Series FPGA and
Zynq-7000 SoC Libraries Guide (UG953) or UltraScale Architecture Libraries Guide (UG974)) for your
device.
RECOMMENDED: If clocking demands exceed the number of BUFGs, or if better overall clocking
characteristics are desired, analyze the clocking needs against the available clocking resources, and select
the best resource for the task.
In addition to global clocking resources, regional clocking resources are also available, which
allow tighter control of clock networks. Regional clocking resources include the Horizontal Clock
Region Buffers (BUFH, BUFHCE), Regional Clock Buffer (BUFR), I/O Clock Buffer (BUFIO), and
Multi-Regional Clock Buffer (BUFMR). For more information on the features of these regional
clock components, see the Clocking Resources Guide (7 Series FPGAs Clocking Resources User
Guide (UG472) or UltraScale Architecture Clocking Resources User Guide (UG572)) and Libraries
Guide (Vivado Design Suite 7 Series FPGA and Zynq-7000 SoC Libraries Guide (UG953) or UltraScale
Architecture Libraries Guide (UG974)) for your device.
Enable
CE
I O Gated
Logic
BUFHCE
BUFG
Clock I O Non-gated
Logic
X13496-121919
When used independently, all loads connected to the BUFH must reside in the same clock
region. This makes it well-suited for very high-speed, more fine-grained (fewer loads) clocking
needs. BUFHCE can be used to achieve medium-grained clock-gating within the specific clock
region. You must ensure that the resources driven by the BUFH do not exceed the available
resources in the clock region and that no other conflicts exist.
The phase relationship might be different between the BUFH and clock domains driven by
BUFGs, other BUFHs, or any other clocking resource. The single exception is when two BUFHs
are driven to horizontally adjacent regions. In this case, the skew between left and right clock
regions when both BUFHs driven by the same clock source should have a very controlled phase
relationship in which data may safely cross the two BUFH clock domains. BUFHs can be used to
gain access to MMCMs or PLLs in opposite regions to a clock input or GT. However, care must be
taken in this approach to ensure that the MMCM or PLL is available.
In terms of global clocking, for designs requiring sixteen or fewer global clocks (BUFGs), no
additional considerations are necessary. The tools automatically assign BUFGs in a way to avoid
any possible contention. When more than 16 (but fewer than 32) BUFGs are required, some
consideration to pin selection and placement must be done to avoid any chance of contention of
resources based on global clocking line contention and/or placement of clock loads.
As in all other Xilinx 7 series devices, Clock-Capable I/Os (CCIOs) and their associated Clock
Management Tile (CMT) have restrictions on the BUFGs they can drive within the given SLR.
CCIOs in the top or bottom half of the SLR can drive BUFGs only in the top or bottom half of the
SLR (respectively). For this reason, pin and associated CMT selection should be done in a way in
which no more that sixteen BUFGs are required in either the top or bottom half of all SLRs
collectively. In doing so, the tools can automatically assign all BUFGs in a way to allow all clocks
to be driven to all SLRs without contention.
For designs that require more than 32 global clocks, Xilinx recommends that you explore using
BUFRs and BUFHs for smaller clock domains to reduce the number of needed global clock
domains. BUFRs with the use of a BUFMR to drive resources within three clock regions that
encompasses one-half of an SLR (approximately 250,000 logic cells in a Virtex-7 class SLR).
Horizontally adjacent clock regions may have both left and right BUFH buffers driven in a low-
skew manner enabling a clocking domain of one-third of an SLR (approximately 167,000 logic
cells).
Using these resources when possible not only leads to fewer considerations for clocking resource
contention, but many times improves overall placement, resulting in improved performance and
power.
If more than 32 global clocks are needed that must drive more than half of an SLR or to multiple
SLRs, it is possible to segment the BUFG global clocking spines. Isolation buffers exist on the
vertical global clock lines at the periphery of the SLRs that allow use of two BUFGs in different
SLRs that occupy the same vertical global clocking track without contention. To make use of this
feature, more user control and intervention is required. In the figure below, BUFG0 through
BUFG2 in the three SLRs have been isolated, and hence have independent clocks within their
respective SLRs. On the other hand, the BUFG31 line has not been isolated. Hence, the same
BUFG31 (located in SLR2 in the figure) drives the clock lines in all the three SLRs - and BUFG31
located in other SLRs should be disabled.
Careful selection and manual placement (LOCs) must be used for the BUFGs. Additionally, all
loads for each clock domain must be manually grouped and placed in the appropriate SLR to
avoid clocking contention. If all global clocks are placed and all loads managed in a way to not
create any clocking contention and allow the clock to reach all loads, this can allow greater use of
the global clocking resources beyond 32.
Interposer
BUFG31 (X0Y127)
96 97 98 99 12
BUFG2 (X0Y98)
7
BUFG1 (X0Y97)
SLR3
BUFG0 (X0Y96)
BUFG31 (X0Y95)
64 65 66 67 95
BUFG2 (X0Y66)
BUFG1 (X0Y65)
SLR2
BUFG0 (X0Y64)
BUFG31 (X0Y63)
32 33 34 35 63
BUFG2 (X0Y34)
BUFG0 (X0Y32)
BUFG31 (X0Y31)
BUFG2 (X0Y2)
BUFG0 (X0Y0)
Interposer
X14051_122019
Even with that extra action, the Xilinx timing tools accounts for these differences as a part of the
timing report. During path analysis, these aspects are analyzed as a part of the setup and hold
calculations, and are reported as a part of the path delay against the specified requirements. No
additional user calculations or consideration are necessary for SSI technology devices, because
the timing analysis tools consider these factors in their calculations.
Skew can increase if using the top or bottom SLR as the delay-differential is higher among points
farther away from each other. For this reason, Xilinx recommends for global clocks that must
drive more than one SLR to be placed into the center SLR. This allows a more even distribution of
the overall clocking network across the part resulting in less overall clock skew.
When targeting UltraScale devices, there is less repercussion to clock placement. However, it is
still highly suggested to place the clock source as close as possible to the central point of the
clock loads to reduce clock insertion delay and improve clock power.
Inference
Without user intervention, Vivado synthesis automatically specifies a global buffer (BUFG) for all
clock structures up to the maximum allowed in an architecture (unless otherwise specified or
controlled by the synthesis tool). As discussed above, the BUFG provides a well-controlled, low-
skew network suitable for most clocking needs. Nothing additional is required unless your design
clocking exceeds the number or capabilities of BUFGs in the part.
Applying additional control of the clocking structure, however, may prove to show better
characteristics in terms of jitter, skew, placement, power, performance, or other characteristics.
Using synthesis constraints allows this type of control without requiring any modification to the
code.
• Directly in the HDL code, which allows them to persist in the code
• As constraints in the XDC file, which allows this control without any changes needed to the
source HDL code
Use of IP
Certain IP assists in the creation of the clocking structures. Clocking Wizard and IO Wizard
specifically can assist in the selection and creation of the clocking resources and structure,
including:
• BUFG
• BUFGCE
• BUFGCE_DIV (UltraScale devices)
• BUFGCTRL
• BUFIO (7 series devices)
• BUFR (7 series devices)
• Clock modifying blocks such as:
○ Mixed Mode Clocking Manager (MMCM)
More complex IP, such as PCIe or Transceivers Wizard IP, might also include clocking structures
as part of the overall IP. This might provide additional clocking resources if properly taken into
account. If not taken into account, it might limit some clocking options for the remainder of the
design.
Xilinx highly recommends that, for any instantiated IP, the clocking requirements, capabilities,
and resources are well understood and leveraged where possible in other portions of the design.
Related Information
Instantiation
The most low-level and direct method of controlling clocking structures is to instantiate the
desired clocking resources into the HDL design. This allows you to access all possible capabilities
of the device and exercise absolute control over them. When using BUFGCE, BUFGMUX,
BUFHCE, or other clocking structure that requires extra logic and control, instantiation is
generally the only option. However, even for simple buffers, sometimes the quickest way to
obtain a desired result is to be direct and instantiate it into your design.
An effective style to manage clocking resources (especially when instantiating) is to contain the
clocking resources in a separate entity or module instantiated at the top or near the top of the
code. By having it at the top-level of code, it may more easily be distributed to multiple modules
in your design.
Be aware of where clocking resources can and should be shared. Creating redundant clocking
resources is not only a waste of resources, but generally consume more power, create more
potential conflicts and placement decisions resulting in longer overall implementation tool
compile times and potentially more complex timing situations. This is another reason why having
the clocking resources near the top module is important.
TIP: You can use Vivado HDL templates to instantiate specific clocking primitives.
Related Information
To use the MMCM or PLL, several attributes must be coordinated to ensure that the MMCM is
operating within specifications and delivering the desired clocking characteristics on its output.
For this reason, Xilinx highly recommends that you use the Clocking Wizard to properly configure
this resource.
You can also directly instantiate the MMCM or PLL, which allows even greater control. However,
be sure to use the proper settings to avoid causing the following issues:
IMPORTANT! When using the Clocking Wizard to configure the MMCM or PLL, the Clocking Wizard
by default attempts to configure the MMCM for low output jitter using reasonable power
characteristics.
Depending on your goals, you can change the settings in the Clocking Wizard to further minimize
jitter and thus, improve timing at the cost of higher power. Alternatively, you can reduce power
but increase output jitter.
• Do not leave any inputs floating. Relying on synthesis or other optimization tools to tie off the
floating values is not recommended, because the values might be different than expected.
• Connect RST to the user logic, so that it can be asserted by logic controlled by a reliable
clocking source. Grounding of RST can cause problems if the clock is interrupted.
• Use LOCKED output in the implementation of reset. For example, hold the synchronous logic
clocked from the PLL in reset until LOCKED is asserted. The LOCKED signal must be
synchronized before it is used in a synchronous portion of the design. Xilinx recommends
adding LOCKED to a processor map so it is visible when debugging.
• Confirm the connectivity between CLKFBIN and CLKFBOUT. The BUFG only needs to be
included in the feedback path if the PLL/MMCM output clock needs to be phase aligned with
the input reference clock, for example, when using ZHOLD compensation mode.
• To avoid the MMCM or PLL phase error timing penalty on synchronous clock domain crossing
paths in UltraScale devices, use BUFGCE_DIVs instead of BUFGCE.
RECOMMENDED: Explore the different settings within the Clocking Wizard to ensure that the most
desirable configuration is created based on your overall design goals.
Related Information
Synchronous CDC
When a clock can be slowed down during periods of time, you can also use these buffers with
additional logic to periodically enable the clock net. Alternatively, you can use a BUFGMUX or
BUFGCTRL to switch the clock source from a faster clock signal to a slower clock.
Any of these techniques can effectively reduce dynamic power. However, depending on the
requirements and clock topology, one technique may prove more effective than another. For
example, in 7 series devices:
• A BUFR might work best if it is an externally generated clock (under 450 MHz) that is only
needed to source up to three clock regions.
• For Virtex-7 devices, a BUFMRCE might also be needed to use this technique with more than
one clock region (but only up to three vertically adjacent regions).
• A BUFHCE is better suited for higher-speed clocks that can be contained in a single clock
region. Although a BUFGCE may span the device and is the most flexible approach, it might
not be the best choice for the greatest power savings.
If this release point is not synchronized to the given clock domain or if the clock is operating at a
faster time than the GWE can safely be released, portions of the design can go into an unknown
state. For some designs, this does not matter. In other designs, this can cause the design to
become unstable or to incorrectly process the initial data set.
If the design must start up in a known state, Xilinx recommends that you take action to control
the start-up synchronization process using any of the following methods:
• Use clock enables, local reset (synchronized), or both, on critical parts of the design, such as a
state machine, to ensure that the start-up of those portions of the design are controlled and
known.
• Use instantiated clock buffer components with clock enable capability.
Delay the reset release by as many cycles as needed before enabling the design clock. The
following example shows how to delay the first design clock edge after the reset is released in
an UltraScale device. By setting ASYNC_REG=TRUE on the synchronizer registers, all registers
are placed in a single SLICE and therefore, do not need to be driven by a global clock resource.
To prevent clock buffer insertion on the synchronizer clock, use the
CLOCK_BUFFER_TYPE=NONE property on the input clock port.
Figure 81: Reset Synchronization and Delay for Safe Clock Startup Example
Synchronizer/Reset
Delay Clock
Design Clock
X18183-121919
• When using an MMCM, you can select the Safe Clock Startup option from the Clocking
Wizard to ensure that design clocks are enabled only after they are stable and reliable.
The following example shows the synchronization stages of an UltraScale device MMCM
LOCKED signal connected to the CE pin of the BUFGCE, which drives the user logic. A second
BUFGCE is connected in parallel to the high fanout BUFGCE (user clock) and is dedicated to
the logic controlling the BUFGCE/CE pin. This topology helps timing closure on the
BUFGCE/CE in UltraScale devices by minimizing the clock skew between the synchronizer
and the BUFGCE pin.
TIP: If the MMCM or PLL compensation mode is set to ZHOLD or BUF_IN, all clocks from CLKOUT0
are grouped with the feedback clock and use the same CLOCK_ROOT. If this introduces timing
violations on BUFGCE/CE, create a CLOCK_DELAY_GROUP constraint between the high fanout clock
and the feedback clock only. Optionally, you can also set a USER_CLOCK_ROOT constraint on the low
fanout clock net to constrain the loads to the same clock region as the MMCM. For 7 series devices, the
second clock buffer is usually not needed for helping timing closure due to the different clocking
architecture.
In general, avoid using local clocks. Local clocks introduce several challenges to the
implementation tools:
TIP: If local clocks introduce timing QoR problems, try floorplanning the clock driver and loads to a
small area using a Pblock. Use report_clock_utilization to identify the location of the local
clocks, review the clock placement, and decide on how to reduce their number or impact.
If further phase control is necessary for an external clock, an MMCM or PLL can be used with
external feedback compensation and/or coarse or fine grained, fixed or variable phase
compensation. This allows great control over clock phase and propagation times to other devices
simplifying external timing requirements from the device.
• Reconfigurable module internal clocks: Clocks with driver and all loads inside the
reconfigurable module (RM).
• Boundary clocks: Clocks with nets crossing the cell boundary of the reconfigurable module as
follows:
Boundary clock net with driver in static region and loads in static and reconfigurable partition
Boundary clock net with driver in static region and loads in reconfigurable partition
Internal reconfigurable module clock net
Boundary clock net with driver in reconfigurable partition and loads in static region
Boundary clock nets with driver in reconfigurable partition and loads in reconfigurable partition and static region
X25409-062421
For more information on DFX, see the Vivado Design Suite User Guide: Dynamic Function eXchange
(UG909).
The clock root of the boundary clock net can be placed anywhere in the device, because the
boundary clock net can drive both static and RP loads. Xilinx recommends using the
USER_CLOCK_ROOT constraint on the boundary clock net to manually constrain the
CLOCK_ROOT location due to the following:
• If the loads of the boundary clock are located mainly in the static region, the clock root might
be placed in the static region.
• If the first implementation uses training logic in the RP Pblock, boundary clock nets might be
locked down after the first implementation with an off-center clock root location.
• Because the boundary clock net is distributed to all clock regions covered by the RP Pblock,
the clock insertion delay for the boundary clock is relatively high compared with the internal
RM clock nets.
• Driving specific features in place_design that reduce mean time between failures (MTBF)
on synchronization circuits.
• Ensuring recognition by report_synchronizer_mtbf.
• Avoiding report_cdc errors and warnings, which typically show up late in the design cycle
when iterations are longer.
TIP: For CDC violations that can be safely ignored, you can use the waiver mechanism to waive the
violations. For details, see this link in the Vivado Design Suite User Guide: Design Analysis and Closure
Techniques (UG906).
A CDC circuit is required when crossing between two asynchronous clocks or when attempting
to relax timing between two synchronous clocks by adding false path constraints. When using
XPMs, you can select a single-bit or a multi-bit bus to cross between the domains.
Single-Bit CDC
The following figure shows the decisions required when using a single-bit crossing.
Single-Bit CDC
Asynchronous? Is it a pulse?
No Yes No Yes
Note: For more information on the different single-bit synchronizers, see the Libraries Guide for your
device.
Multi-Bit CDC
The following figure shows the decisions required when using a multi-bit crossing.
Multi-Bit CDC
Is the data
Yes known to be
static?
No
Do not add CDC circuits
Manage CDC using waivers for
report_cdc
Is a transfer
required every
clock cycle? Yes
No
Is the data
Yes
buffered?
Use
XPM_FIFO_ASYNC
No
Is the data a
Yes
counter?
Use
XPM_CDC_GRAY
No
Note: For more information on the different multi-bit synchronizers, see the Libraries Guide for your
device.
• Synchronizer MTBF
• Device failure in time (FIT) rate due to single-event upsets (SEUs)
Note: The device FIT rate due to SEUs largely depends on process and device size.
The synchronizer MTBF is design dependent and varies with the following:
1. Run the design through the Vivado Design Suite implementation flow.
2. Based on your targeted device, do one of the following:
• For 7 series devices, select the default value for DEST_SYNC_FF. This is a conservative
approach to meeting typical reliability requirements. For critical designs, conduct further
analysis.
• For UltraScale devices, run the report_synchonizer_mtbf command, which reports
the MTBF for the entire design. By iterating through the flow as shown in the following
figure, you can find a suitable trade-off between MTBF, latency, and resources.
Note: You can also use this iterative process for a user CDC circuit in which the ASYNC_REG attribute is
correctly applied to all the synchronization registers.
Determine XPM
synchronizer stages
Set DEST_SYNC_FF
starting with default
value
Implement Design
Run report_synchronizer_mtbf
Need to improve
Yes
MTBF?
No
Yes
Need to
improve resource
or latency?
No
Finalize DEST_SYNC_FF
X17899-122019
Related Information
Chapter 4
Design Constraints
Design constraints define the requirements that must be met by the compilation flow for the
design to be functional in hardware. For complex designs, constraints also define guidance for
the tools to help with convergence and closure. Not all constraints are used by all steps in the
compilation flow. For example, physical constraints are used only during the implementation
steps: optimization, placement, and routing.
Because synthesis and implementation algorithms are timing-driven, creating proper timing
constraints is essential. Over-constraining or under-constraining your design makes timing
closure difficult. You must use reasonable constraints that correspond to your application
requirements. For more information on constraints, see the following resources:
• Vivado Design Suite User Guide: Design Analysis and Closure Techniques (UG906)
• Applying Design Constraints video tutorials available from the Vivado Design Suite Video
Tutorials page on the Xilinx® website
Note: Traditional and platform-based design flows use design constraints in a similar manner. However,
platform-based designs require extra attention for signals crossing the boundary from the static region of
the design to the dynamic region of the design. Constraining these signals properly ensures flexibility of
the platform and minimizes platform revisions.
Simple Design
For a simple design with a small team of designers:
Complex Design
For a complex design with IP cores or several designer teams:
• 1 file for top-level timing + 1 file for top-level physical + 1 file per IP/major block
• Variables must be defined before they can be used. Similarly, timing clocks must be defined
before they can be used in other constraints.
• For equivalent constraints that cover the same paths and have the same precedence, the last
one applies.
• When a path is covered by multiple timing exceptions, the constraint with the higher
precedence applies.
When considering the priority rules above, the timing constraints should overall use the following
sequence:
# False Paths
# Max Delay / Min Delay
# Multicycle Paths
# Case Analysis
# Disable Timing
When multiple XDC files are used, you must pay particular attention to the clock definitions and
validate that the dependencies are ordered correctly.
The synthesis engine accepts all XDC commands, but only some have a real effect:
○ set_input_delay / set_output_delay
• RTL attributes forces decisions made by the mapping and optimization algorithms. Following
are a few examples:
○ DONT_TOUCH / KEEP / KEEP_HIERARCHY / MARK_DEBUG
○ MAX_FANOUT
Note: The same attribute can also be set as a property from an XDC file. Using XDC-based constraints is
convenient for influencing the synthesis results only in some cases without changing the RTL.
Synthesis constraints must use names from the elaborated netlist, preferably ports and
sequential cells. During elaboration, some RTL signals can disappear and it is not possible to
attach XDC constraints to them. In addition, due to the various optimizations after elaboration,
nets or logical cells are merged into the various technology primitives such as LUTs or DSP
blocks. To know the elaborated names of your design objects, click Open Elaborated Design in
the Flow Navigator and browse to the hierarchy of interest.
Some registers are absorbed into RAM blocks and some levels of the hierarchy can disappear to
allow cross-boundary optimizations.
Any elaborated netlist object or level of hierarchy can be preserved by using a DONT_TOUCH,
KEEP, KEEP_HIERARCHY, or MARK_DEBUG constraint, at the risk of degrading timing or area
QoR.
Finally, some constraints can conflict and cannot be respected by synthesis. For example, if a
MAX_FANOUT attribute is set on a net that crosses multiple levels of hierarchy, and some
hierarchies are preserved with DONT_TOUCH, the fanout optimization will be limited or fully
prevented.
IMPORTANT! Unlike during implementation, RTL netlist objects that are used for defining timing
constraints can be optimized away by synthesis to allow better area QoR. This is usually not a problem as
long as the constraints are updated and validated for implementation. But if needed, you can preserve any
object by using the KEEP constraint so that the constraints will apply during both synthesis and
implementation.
After synthesis is complete, Xilinx recommends that you review the timing and utilization reports
to validate that the netlist quality meets the application requirements and can be used for
implementation.
In many cases, the same constraints can be used during synthesis and implementation. However,
because the design objects can disappear or have their name changed during synthesis, you must
verify that all synthesis constraints still apply properly with the implementation netlist. If this is
not the case, you must create an additional XDC file containing the constraints that are valid for
implementation only.
The block-level constraints must be developed independently from the top-level constraints, and
must be as generic as possible so that they can be used in various contexts. In addition, these
constraints must not affect any logic that is beyond the block boundaries.
When implementing a sub-block it is desirable to have the full clocking network included in
timing analysis to ensure accurate skew and clock domain crossing analysis. This might require an
HDL wrapper containing the clocking components and an additional constraint file to replicate
top level clocking constraints. It is used only in the timing validation of the sub-module.
For more information on constraints scoping as well as rules, guidelines, and mechanisms for
loading the block-level constraints into the top-level design, see this link in the Vivado Design
Suite User Guide: Using Constraints (UG903).
• When using a C/C++ kernel, you must specify additional user constraints for synthesis or
implementation using Vitis HLS. The Vitis HLS output must then be packaged in the IP
packager, and this packaged IP includes both the user and tool-generated constraints. For
information, see the Vitis HLS User Guide (UG1399).
• When using an RTL kernel, you must specify additional synthesis and implementation
constraints during IP packaging. For information, see the Vivado Design Suite User Guide:
Creating and Packaging Custom IP (UG1118).
In the Vitis environment, all of the design constraints for synthesis and implementation must be
packaged with the IP. If additional constraints are required for synthesis after the IP is packaged,
you must repackage the IP to include the missing constraints.
However, after the IP is packaged, you can specify additional XDC constraints to be used only
during implementation. Although the Vitis environment abstracts the underlying Vivado tools
process for implementing the programmable logic region, the Vitis environment also provides
advanced options to control the Vivado tools flow. With these advanced controls, you can
specify certain Tcl scripts to be executed before (Pre) or after (Post) each implementation phase,
including the following: init_design, opt_design, place_design, phys_opt_design,
route_design, or write_bitstream. For more information on Tcl scripting, see the Vivado
Design Suite User Guide: Using Tcl Scripting (UG894). You can leverage the Pre and Post Tcl scripts
to execute certain Vivado tools commands, such as to apply additional XDC constraints through
the read_xdc or source Tcl commands.
You can specify the Pre and Post Tcl scripts either through the Vitis environment configuration
file or directly on the v++ compiler command line.
To specify the Pre and Post Tcl scripts inside the Vitis environment configuration file, use the
parameters prop=run.impl_1.STEP.<PHASE>.TCL.<PRE|POST> inside the [vivado]
section.
Where:
For example:
[vivado]
prop=run.impl_1.STEPS.OPT_DESIGN.TCL.PRE=<pathToTclScript>
prop=run.impl_1.STEPS.OPT_DESIGN.TCL.POST=<pathToTclScript>
prop=run.impl_1.STEPS.PLACE_DESIGN.TCL.PRE=<pathToTclScript>
prop=run.impl_1.STEPS.PLACE_DESIGN.TCL.POST=<pathToTclScript>
prop=run.impl_1.STEPS.PHYS_OPT_DESIGN.TCL.PRE=<pathToTclScript>
prop=run.impl_1.STEPS.PHYS_OPT_DESIGN.TCL.POST=<pathToTclScript>
prop=run.impl_1.STEPS.ROUTE_DESIGN.TCL.PRE=<pathToTclScript>
prop=run.impl_1.STEPS.ROUTE_DESIGN.TCL.POST=<pathToTclScript>
To specify the Pre and Post Tcl scripts as a v++ parameter, use the --vivado.prop
run.impl_1.STEP.<PHASE>.TCL.<PRE|POST>=<pathToTclScript> command line
option. For example, to specify a Tcl script to be executed before opt_design:
--vivado.prop run.impl_1.STEP.OPT_DESIGN.TCL.PRE=<pathToTclScript>
Where:
• --vivado is the v++ command line option to specify directives for the Vivado tools.
• prop indicates a property setting.
• run. indicates a run property.
• impl_1. indicates the name of the run.
• STEP.OPT_DESIGN.TCL.PRE indicates the run property you are specifying.
• <pathToTclScript> indicates the property value.
XDC:
create_clock
create_generated_clock Create Clocks Reports:
set_system_jitter (Primary/Virtual/Generated) Clock Networks
set_input_jitter (External Feedback/Uncertainty) Check Timing
set_clock_uncertainty
set_external_delay
XDC: Reports:
Input/Output Delays Check Timing
set_input_delay
(System/Source Synchronous) Report Timing
set_output_delay
XDC: Reports:
Clock Groups and CDC Clock Interaction
set_clock_groups
(Asynchronous/Exclusive) Check Timing
set_false_path
XDC:
set_false_path
Reports:
set_min/max_delay Timing Exceptions Timing Summary
set_multicycle_path (Ignore/Max/Min) Report Timing
set_case_analysis
set_disable_timing
X13445-122019
• The first two steps refer to the timing assertions where the default timing path requirements
are derived from the clock waveforms and I/O delay constraints.
• During the third step, relationships between the asynchronous/exclusive clock domains that
share at least one logical path are reviewed. Based on the nature of the relationships, clock
groups or false path constraints are entered to ignore the timing analysis on these paths.
• The last step corresponds to the timing exceptions, where the designer can decide to alter the
default timing path requirements by ignoring, relaxing or tightening them with specific
constraints.
Constraints creation is associated with constraints identification and constraints validation tasks
that are only possible with the various reports generated by the timing engine. The timing engine
only works with a fully mapped netlist, for example, after synthesis. While it is possible to enter
constraints with an elaborated netlist, it is recommended to create the first set of constraints
with the post-synthesis netlist so that analysis and reports on the constraints can be performed
interactively.
When creating timing constraints for a new design or completing existing constraints, Xilinx
recommends using the Timing Constraints Wizard to quickly identify missing constraints for the
first three steps in the previous figure. The Timing Constraints Wizard follows the methodology
described in this section to ensure the design constraints are safe and reliable for proper timing
closure. You can find more information on the Timing Constraints Wizard in Vivado Design Suite
User Guide: Using Constraints (UG903).
The following sections describe in detail the four steps described above:
Refer to each section for a detailed methodology and use case when you are at the appropriate
step in the constraint creation process.
IMPORTANT! When defining a clock with a specific name (-name option), you must verify that the clock
name is not already used by another clock constraint or an existing auto-generated clock. The Vivado
Design Suite timing engine issues a message when a clock name is used in several clock constraints to warn
you that the first clock definition is overridden. When the same clock name is used twice, the first clock
definition is lost as well as all constraints referring to that name and entered between the two clock
definitions. Xilinx recommends that you avoid overriding clock definitions unless no other constraints are
impacted and all timing paths remain constrained.
With check_timing, the same clock source pin or port can appear in several groups depending
on the topology of the entire clock tree. In such case, creating a clock on the recommended
source pin or port will resolve the missing clock definition for all the associated groups.
Related Information
For this reason, it is important to define the primary clocks on objects that correspond to the
boundary of the design, so that their delay, and indirectly their skew, can be accurately
computed.
Input Ports
You can use an input port as the primary clock root as shown in the following figure.
REGA REGB
Data Path
D Q D Q
Constraint example:
In this example, the waveform is defined to have a 50% duty cycle. The -waveform argument is
shown above to illustrate its usage and is only necessary to define a clock with a duty cycle other
than 50%. For more information, see the create_clock Tcl command in the Vivado Design Suite Tcl
Command Reference Guide (UG835). For a differential clock input buffer, the primary clock only
needs to be defined on the P-side of the pair.
Constraint example:
RECOMMENDED: For designs that target 7 series devices, Xilinx recommends also defining the GT
incoming clocks, because the Vivado tools calculate the expected clocks on the GT output pins and
compare these clocks with the user-created clocks. If the clocks differ or if the incoming clocks to the GT
are missing, the tools issue a methodology check warning.
Note: For designs that target UltraScale™ and UltraScale+™ devices, Xilinx does not recommend defining a
primary clock on the output of GTs, because GT clocks are automatically derived when the REFCLK input
clocks are defined.
instB
D Q
instA
sysclk IBUF
IN OUT
X
no arc
Recommended primary clock
source point: instA/OUT
X13448-121919
IMPORTANT! No primary clock should be defined in the transitive fanout of another primary clock
because this situation does not correspond to any hardware reality. It will also prevent proper timing
analysis by preventing the complete clock insertion delay calculation. Any time this situation occurs, the
constraints must be revisited and corrected.
The following figure shows an example in which the clock clk1 is defined in the transitive fanout
of the clock clk0. The clock clk1 overrides clk0 starting at the output of BUFG1, where it is
defined. Therefore, the timing analysis between REGA and REGB is not accurate because of the
invalid skew computation between clk0 and clk1.
REGA REGB
Data Path
D Q D Q
BUFG1
NOT RECOMMENDED
create_clock –name clk1 –period 10 [get_pins BUFG1/0]
create_clock –name clk0 –period 10 [get_ports sysclk]
X13449-121919
Auto-Derived Clocks
Most generated clocks are automatically derived by the Vivado timing engine which recognizes
the clock modifying blocks (CMB) and the transformation they perform on the master clocks.
• MMCM*/ PLL*
• BUFR
• PHASER*
• MMCM* / PLL*
• BUFG_GT / BUFGCE_DIV
• GT*_COMMON / GT*_CHANNEL / IBUFDS_GTE3
• BITSLICE_CONTROL / RX*_BITSLICE
• ISERDESE3
For any other combinatorial cell located on the clock tree, the timing clocks propagate through
them and do not need to be redefined at their output, unless the waveform is transformed by the
cell. In general, you must rely on the auto-derivation mechanism as much as possible as it
provides the safest way to define the generated clocks that correspond to the actual hardware
behavior.
If the auto-derived clock name chosen by the Vivado Design Suite timing engine does not seem
appropriate, you can force your own name by using the create_generated_clock command
without specifying the waveform transformation. This constraint should be located right after the
constraint that defines the master clock in the constraint file. For example, if the default name of
a clock generated by a MMCM instance is net0, you can add the following constraint to force
your own name (fftClk in the given example):
To avoid any ambiguity, the constraint must be attached to the source pin of the clock. For more
information, see Vivado Design Suite User Guide: Using Constraints (UG903).
Finally, if the design contains latches, the latch gate pins also need to be reached by a timing
clock and will be reported by Check Timing (no_clock) if the constraint is missing. You can
follow the examples above to define these clocks.
You can also verify that all internal timing paths are covered by at least one clock. The Check
Timing report provides two checks for that purpose:
• no_clock: Reports any active clock pin that is not reached by a defined clock.
• unconstrained_internal_endpoint: Reports all the data input pins of sequential cells that have
a timing check relative to a clock but the clock has not been defined.
If both checks return zero, the timing analysis coverage will be high.
Alternatively, you can run the XDC and Timing Methodology checks to verify that all clocks are
defined on recommended netlist objects without introducing any constraint conflict or inaccurate
timing analysis scenario.
Related Information
Jitter
For jitter, it is best to use the default values used by the Vivado Design Suite. You can modify the
default computation as follows:
• If a primary clock enters the device with a random jitter greater than zero, use the
set_input_jitter command to specify the peak-to-peak jitter value in nanoseconds.
• To adjust the global jitter if the device power supply is noisy, use set_system_jitter.
Xilinx does not recommend increasing the default system jitter value.
For generated clocks, the jitter is derived from the master clock and the characteristics of the
clock modifying block. You do not need to adjust these numbers.
Additional Uncertainty
When you need to add extra margin on the timing paths of a clock or between two clocks, you
must use the set_clock_uncertainty command. This is also the best and safest way to
over-constrain a portion of a design without modifying the actual clock edges and the overall
clocks relationships. The clock uncertainty defined by you is additive to the jitter computed by
the Vivado tools, and can be specified separately for setup and hold analysis.
For example, the margin on all intra-clock paths of the design clock clk0 needs to be tightened
by 500 ps to make the design more robust to noise for both setup and hold:
Note: Tightening the hold margin on a design can lead to hold violations on dedicated intra-site and
cascade paths that the router cannot fix by detouring the intra-site net.
If you specify additional uncertainty between two clocks, the constraint must be applied in both
directions (assuming data flows in both directions). The example below shows how to increase
the uncertainty by 250 ps between clk0 and clk1 for setup only:
• To specify the clock delay propagation outside the device independently from the input and
output delay constraints.
• To model the internal propagation latency of a clock used by a block during out-of-context
compilation. In such a compilation flow, the complete clock tree is not described, so the
variation between min and max operating conditions outside the block cannot be
automatically computed and must be manually modeled.
This constraint should only be used by advanced users as it is usually difficult to provide valid
latency values.
IMPORTANT! I/O delays can only be constrained for interfaces using I/O logic, such as ISERDES/
OSERDES/IDDR/ODDR/IOB registers or fabric. For guidance on component mode timing, see Designing
Using SelectIO Interface Component Primitives (XAPP1324). For high-speed I/O interfaces created using
UltraScale device SelectIO native mode, see Answer Record 68618.
• For max delay analysis (setup), the data is captured one clock cycle after the launch edge for
single data rate interface, and half clock cycle after the launch edge for a double data rate
interface.
• For min delay analysis (hold), the data is launched and captured by the same clock edge.
If the relationship between the clock and I/O data must be timed differently, like for example in a
source synchronous interface, different I/O delays and additional timing exceptions must be
specified. This corresponds to an advanced I/O timing constraints scenario.
FPGA DEVICE
Board
Internal Delay REGB
Device
DIN D Q
D Q Ddata
Tsetup
Tco
Thold
Input Delay
BUFG
CLK
The input delay values for the both types of analysis are:
Input Delay(max) = Tco(max) + Ddata(max) + Dclock_to_ExtDev(max) - Dclock_to_FPGA(min)
Input Delay(min) = Tco(min) + Ddata(min) + Dclock_to_ExtDev(min) - Dclock_to_FPGA(max)
The following figure shows a simple example of input delay constraints for both setup (max) and
hold (min) analysis, assuming the sysClk clock has already been defined on the CLK port:
set_input_delay -max -clock sysClk 5.4 [get_ports DIN]
set_input_delay -min -clock sysClk 2.1 [get_ports DIN]
Launch Edge
PERIOD
Source
Clock CLK
A negative input delay means that the data arrives at the interface of the device before the
launch clock edge.
FPGA DEVICE
Board
REGB Internal Delay
Device
D Q DOUT
Ddata D Q
Tco Tsetup
Thold
Output Delay
BUFG
CLK
X23060-111419
The output delay values for the both types of analysis are:
Output Delay(max) = Tsetup + Ddata(max) + Dclock_to_FPGA(max) - Dclock_to_ExtDev(min)
Output Delay(min) = Ddata(min) - Thold + Dclock_to_FPGA(min) - Dclock_to_ExtDev(max)
The following figure shows a simple example of output delay constraints for both setup (max) and
hold (min) analysis, assuming the sysClk clock has already been defined on the CLK port:
set_output_delay -max -clock sysClk 2.4 [get_ports DOUT]
set_output_delay -min -clock sysClk -1.1 [get_ports DOUT]
PERIOD
Source
Clk CLK
Destination
CLK TH(DestDev) TSU(DestDev) TH(DestDev)
Clk
The output delay corresponds to the delay on the board before the capture edge. For a regular
system synchronous interface where the clock and data board traces are balanced, the setup
time of the destination device defines the output delay value for max analysis. And the
destination device hold time defines the output delay for min analysis. The specified min output
delay indicates the minimum delay that the signal will incur after coming out of the design,
before it will be used for hold analysis at the destination device interface. Thus, the delay inside
the block can be that much smaller. A positive value for min output delay means that the signal
can have negative delay inside the design. This is why min output delay is often negative. For
example, the following code example indicates that the delay inside the design until DOUT has to
be at least +0.5 ns to meet the hold time requirement.
For a group of I/O ports connected to another device interface on the board, you can use the
board clock that is connected to both the Xilinx device and to the external device interface as the
reference clock for the input or output delay constraints. To control the timing of the related
group of ports, you must verify in the external device data sheet that the board clock is internally
transformed for timing the I/O ports, which ensures that the design generates the same clock
inside the Xilinx device.
For each port, you can expand the path schematics to the first level of sequential cells, and then
trace the clock pins of those cells back to the clock source(s). This approach can be impractical
for ports that are connected to high fanout nets.
Whether a port is already constrained or not, you can use the report_timing command to
identify its related clocks in the design. Once all the timing clocks have been defined, you can
report the worst path from or to the I/O port, create the I/O delay constraint relative to the clock
reported, and rerun the same timing report from/to the other clocks of the design. If it appears
that the port is related to more than one clock, create the corresponding constraint and repeat
the process.
For example, the din input port is related to the clocks clk1 and clk2 inside the design:
The report shows that the din port is related to clk1. The input delay constraint is (for both min
and max delay in this example):
Rerun timing analysis with the same command as previously, and observe that din is also related
to clk2 due to the -sort_by group option, which reports N paths per endpoint clock. You
can add the corresponding delay constraint and rerun the report to validate that the din port is
not related to another clock.
You can also run the same analysis using the Timing Summary report with the -
report_unconstrained option. With only clock constraints in your design, the
Unconstrained Paths section appears as follows:
------------------------------------------------
| Unconstrained Path Table
------------------------------------------------
Path Group From Clock To Clock
---------- ---------- --------
(none)
(none) clk1
(none) clk2
(none) clk1
(none) clk2
The fields without a clock name (or <NONE> in the Vivado IDE) refer to a group of paths where
the startpoints (From Clock) or the endpoints (To Clock) are not associated with a clock. The
unconstrained I/O ports fall in this category. You can retrieve their name by browsing the rest of
the report. For example in the Vivado IDE, by selecting the Setup paths for the clk1 to NONE
category, you can see the ports driven by clk1 in the To column:
After adding the new constraints and applying them in memory, you must rerun the report to
determine which ports are still unconstrained. For most designs, you must increase the number
of reported paths to make sure all the I/O paths are listed in the report.
You can use the set_input_delay and set_output_delay constraints without specifying
the related clock. The Vivado Design Suite timing engine will analyze the design and associate
each port with all the sampling clocks automatically. Then by reporting timing on the I/O paths,
you can see how the tool constrained each I/O port. This is convenient for quickly constraining a
design, but this type of generic constraints can become a problem if they are too generic and do
not model the hardware reality accurately.
When the primary clock is compensated by a PLL or MMCM inside the device with the zero hold
violation (ZHOLD) mode, the I/O paths sequential cells are connected to an internal copy (for
example, a generated clock) of the primary clock. Because the waveforms of both clocks are
identical, Xilinx recommends using the primary clock as the reference clock for the input/output
delay constraints.
Figure 98: Input Delay in the Presence of a ZHOLD MMCM in Clock Path
mmcm
data_reg[0]
+
CLK IN1 CLK OUT1
CLK C
Input Delay
mmcm_zhold CE
Q
DIN_IBUF_inst CLR
I O
DIN D
IBUF
FDCE
X13454-121919
The constraints are identical to the example provided in Defining Input Delays because the
ZHOLD MMCM acts like a clock buffer with a negative insertion delay, which corresponds to the
amount of compensation.
• The internal clock and the board clock have different period: The virtual clock must be defined
with the same period and waveform as the internal clock. This results in a regular single-cycle
path requirement on the I/O paths.
• For input paths, the internal clock has a positive shifted waveform compared to the board
clock: the virtual clock is defined like the board clock, and a multicycle path constraint of two
cycles for setup is defined from the virtual clock to the internal clock. These constraints force
the setup timing analysis to be performed with a requirement of one clock cycle + amount of
phase shift.
• For output paths, the internal clock has a negative shifted waveform compared to the board
clock: the virtual clock is defined like the board clock and a multicycle path constraint of two
cycles for setup is defined from the internal clock to the virtual clock. These constraints force
the setup timing analysis to be performed with a requirement of one clock cycle + amount of
phase shift.
To summarize, the use of a virtual clock adjusts the default timing analysis to avoid treating I/O
paths as clock domain crossing paths with a tight and unrealistic requirement.
IMPORTANT! You only need to use the multicycle path for I/O paths with phase-shifted clocks when the
phase-shift results in modification of the clock waveform. When the phase shift is added to the insertion
delay of the clock modifying block and the clock waveform is preserved, you do not need to use a
multicycle path. For more information, see this link in the Vivado Design Suite User Guide: Design Analysis
and Closure Techniques (UG906).
For example, consider the sysClk board clock that runs at 100 MHz and gets multiplied by an
MMCM to generate clk266 that runs at 266 MHz. An output that is generated by clk266
should use clk266 as the reference clock. If you try to use sysClk as the reference clock (for
the set_output_delay specification), it will appear as asynchronous clocks, and the path can
no longer be timed as a single cycle.
In most cases, the I/O reference clock edges correspond to the clock edges used to latch or
launch the I/O data inside the device. By analyzing the I/O timing paths, you can review which
clock edges are used and verify that they correspond to the actual hardware behavior. If by
mistake a rising clock edge is used as a reference clock for an I/O path that is only related to the
falling clock edge internally, the path requirement is ½-period, which makes timing closure more
difficult.
• The correct clocks and clock edges are used as reference for the delay constraints.
• The expected clocks are launching and capturing the I/O data inside the device.
• The violations can reasonably be fixed by placement or by setting the proper delay line tap
configuration. If this is not the case, you must review the I/O delay values entered in the
constraints and evaluate whether they are realistic, and whether you must modify the design
to meet timing.
Improper I/O delay constraints can lead to impossible timing closure. The implementation tools
are timing driven and work on optimizing the placement and routing to meet timing. If the I/O
path requirements cannot be met and I/O paths have the worst violations in the design, the
overall design QoR will be impacted.
Example One
Use a virtual clock with a period greater or equal to the target maximum delay for the feed-
through path, and apply max input and output delay constraints as follows:
where
Example Two
Use a combination of min and max delay constraints between the feedthrough ports. Example:
This is a simple way to constrain both minimum and maximum delays on the path. Any existing
input and output delay constraints on the same ports are also used during the timing analysis. For
this reason, this style is not very popular.
The max delay is usually optimized and reported against the Slow timing corner, while the min
delay is in the Fast timing corner. It is best to run a few iterations on the feedthrough path delay
constraints to validate that they are reasonable and can be met by the implementation tools,
especially if the ports are placed far from one another.
• set_clock_groups: Disables timing analysis between groups of clocks that you identify but
not between the clocks within a same group.
• set_false_path: Disables timing analysis between the clocks only in the direction
specified by the -from and -to options.
In some cases, you might want to use the following constraints on one or more paths of the clock
domain crossing (CDC) to limit latency or bus skew:
Note: If clock groups or false path constraints already exist between the clocks or on the same CDC
paths, the maximum delay constraints will be ignored. Therefore, it is important to thoroughly review
every path between all clock pairs before choosing one CDC timing constraint over another to avoid
constraints collision.
• set_bus_skew: Constrains a set of signals between asynchronous CDC paths by bus skew
instead of latency.
TIP: You can also set a bus skew constraint from the Vivado IDE. In the Timing Constraints window,
expand Assertions, and double-click Set Bus Skew.
Related Information
Synchronous
Clock relationships are synchronous when two clocks have a fixed phase relationship. This is the
case when two clocks share the following:
Asynchronous
Clock relationships are asynchronous when the clocks do not have a fixed phase relationship.
This is the case when one of the following is true for the clocks:
• Do not share any common circuitry in the design and do not have a common primary clock.
• Do not have a common period within 1000 cycles (unexpandable) and the timing engine
cannot properly time them together.
• Have a common clock but do not share a common node.
• Are part of a topology that does not ensure a known phase relationship through the clocks
auto-derivation process.
If two clocks are synchronous but their common period is very small, the setup paths
requirement is too tight for timing to be met. Xilinx recommends that you treat the two clocks as
asynchronous and implement safe asynchronous CDC circuitry.
Exclusive
Clock relationships are exclusive when they propagate on a same clock tree and reach the same
sequential cell clock pins but cannot physically be active at the same time.
• Do the two clocks have a common primary clock? When clocks are properly defined, all clocks
that originate from the same source in the design share the same primary clock.
• Do the two clocks have a common period? This shows in the setup or hold path requirement
column (unexpandable), when the timing engine cannot determine the most pessimistic setup
or hold relationship.
• Are the paths between the two clocks partially or completely covered by clock groups or
timing exception constraints?
• Is the setup path requirement between the two clocks very tight? This can happen, when two
clocks are synchronous, but their period is not specified as an exact multiple (for example, due
to rounding off). Over multiple clock cycles, the edges could drift apart, causing the worst case
timing requirement to be very tight.
Based on the clock tree topology, you must apply different constraints as described in the
following sections.
If the clk_mode0 and clk_mode1 clocks generate other clocks, the same constraint needs to
be applied to their generated clocks as well, which can be done as follows:
set_clock_groups -physically_exclusive \
-group [get_clocks -include_generated_clock clk_mode0] \
-group [get_clocks -include_generated_clock clk_mode1]
For this reason, you must review the CDC paths and add new constraints to ignore some of the
clock relationships. The correct constraints are dictated by how and where the clocks interact in
the design.
The following figure shows an example of two clocks driving into a multiplexer and the possible
interactions between them before and after the multiplexer.
D Q
FD0
D Q D Q
FDM0 FDM1 C
clk0 I0
O
clk1 I1
D Q
B
FD1
X13455-121919
clk0 and/or clk1 directly interact with the multiplexed clock. To keep timing paths A, B, and
C, the constraint cannot be applied to clk0 and clk1 directly. Instead, it must be applied to
the portion of the clocks in the fanout of the multiplexer, which requires additional clock
definitions.
create_generated_clock -name clk0mux -divide_by 1 \
-source [get_pins mux/I0] [get_pins mux/O]
Report CDC
The Report CDC (report_cdc) command performs a structural analysis of the clock domain
crossings in your design. You can use this information to identify potentially unsafe CDCs that
might cause metastability or data coherency issues. Report CDC is similar to the Clock
Interaction report, but Report CDC focuses on structures and related timing constraints. Report
CDC does not provide timing information because timing slack does not make sense on paths
that cross asynchronous clock domains.
For more information on the report_cdc command, see this link in the Vivado Design Suite User
Guide: Design Analysis and Closure Techniques (UG906). Also, see report_cdc in the Vivado Design
Suite Tcl Command Reference Guide (UG835).
Specific constraints should be applied to prevent default timing analysis on asynchronous clock
domain crossings.
Related Information
When two master clocks and their respective generated clocks form two asynchronous domains
between which all the paths are properly synchronized, the clock groups constraint can be
applied to several clocks at once:
set_clock_groups -asynchronous \
-group {clkA clkA_gen0 clkA_gen1 } \
-group {clkB clkB_gen0 clkB_gen1 }
Or simply:
set_clock_groups -asynchronous \
-group [get_clocks -include_generated_clock clkA] \
-group [get_clocks -include_generated_clock clkB]
When the ratio between clock periods is high, choosing the minimum of the source and
destination clock periods is also a good option to reduce the transfer latency. A clean
asynchronous CDC path should not have any logic between the source and destination
sequential cells, so the Max Delay Datapath Only constraint is normally easy to meet for the
implementation tools.
Some asynchronous CDC paths require a skew control between the bits of the bus instead of a
constraint on the bus latency. Using a bus skew constraint prevents the receiving clock domain
from latching multiple states of the bus on the same clock edge. You can set the bus skew
constraint on the bus with set_bus_skew command. For example, you can apply
set_bus_skew to a CDC bus that uses gray-coding instead of using the Max Delay Datapath
Only constraint. For more information, see this link in the Vivado Design Suite User Guide: Using
Constraints (UG903).
For the paths that do not need latency control, you can define a point-to-point false path
constraint.
In the following figure, the clock clk0 has a period of 5 ns and is asynchronous to clk1. There
are two paths from the clk0 domain to the clk1 domain. The first path is a 1-bit data
synchronization. The second path is a multi-bit gray-coded bus transfer.
X13456-121919
The designer decides that the gray-coded bus transfer requires a Max Delay Datapath Only to
limit the delay variation among the bits, so it becomes impossible to use a Clock Groups or False
Path constraint between the clocks directly. Instead, two constraints must be defined:
There is no need to set a false path from clk1 to clk0 because there is no path in this example.
• Asynchronous CDC paths cannot be safely timed due to the lack of fixed phase relationship
between the clocks. They should be ignored (Clock Groups, False Path), or simply have
datapath delay constraint (Max Delay Datapath Only)
• The sequential cells launch and capture edges are not active at every clock cycle, so the path
requirement can be relaxed accordingly (Multicycle Path)
• The path delay requirement needs to be tightened to increase the design margin in hardware
(Max Delay)
• A path through a combinatorial cell is static and does not need to be timed (False Path, Case
Analysis)
• The analysis should be done with only a particular clock driven by a multiplexer (Case
Analysis).
In any case, timing exceptions must be used carefully and must not be added to hide real timing
problems.
• The implementation compile time significantly increases when many exceptions are used,
especially when they are attached to a large number of netlist objects.
• Constraints debugging becomes extremely complicated when several exceptions cover the
same paths.
• Presence of constraints on a signal can hamper the optimization of that signal. Therefore,
including unnecessary exceptions or unnecessary points in exception commands can hamper
optimization.
Following is an example of timing exceptions that can negatively impact the run time:
• If the din port does not have an input delay, it is not constrained. So there is no need to add a
false path.
• If the din port feeds only to sequential elements, there is no need to specify the false path to
the sequential cells explicitly. This constraint can be written more efficiently:
set_false_path -from [get_ports din]
• If the false path is needed, but only a few paths exist from the din port to any sequential cell
in the design, then it can be more specific (all_registers can potentially return thousands
of cells, depending upon the number of registers used in the design):
set_false_path -from [get_ports din] -to [get_cells blockA/config_reg[*]]
• The more specific the constraint, the higher the priority. For example:
set_max_delay -from [get_clocks clkA] -to [get_pins inst0/D] 12
set_max_delay -from [get_clocks clkA] -to [get_clocks clkB] 10
The first set_max_delay constraint has a higher priority because the -to option uses a pin,
which is more specific than a clock.
• The exceptions priority is as follows:
1. set_false_path
2. set_max_delay or set_min_delay
3. set_multicycle_path
For details on XDC precedence and priorities, see this link in the Vivado Design Suite User Guide:
Using Constraints (UG903).
Use Cases
The typical cases for using the false path constraint are:
• Ignoring timing on a path that is never active. For example, a path that crosses two
multiplexers that can never let the data propagate in a same clock cycle because of the select
pins connectivity.
MUX0 MUX1
REG0 REG1
D Q I0 I0 D Q
O O
I1 I1
S S
X13457-121919
Impact on Implementation
All the implementation steps are sensitive to the false path timing exception.
Impact on Synthesis
The false path constraint is supported by synthesis and only impacts max delay (setup/recovery)
path optimization. False path exceptions are recommended for portions of the design where
timing can safely be ignored and for asynchronous CDC paths with no datapath delay
requirements.
Impact on Implementation
All the implementation steps are sensitive to the false path timing exception.
Use Cases
Common reasons for using the min or max delay constraints are as follows:
It is not common or recommended to force extra delay insertion on a path by using the
set_min_delay constraint. The default min delay requirement for hold or removal is usually
sufficient to ensure proper hardware functionality when the slack is positive.
Impact on Synthesis
The set_max_delay constraint is supported by synthesis, including the -datapath_only
option. The set_min_delay constraint is ignored.
Impact on Implementation
The set_max_delay constraint replaces the setup path requirement and influences the entire
implementation flow. The set_min_delay constraint replaces the hold path requirement and
only affects the router behavior whenever it introduces the need to fix hold.
Path segmentation must only be used by experts as it alters the fundamentals of timing analysis:
• Startpoints: Clock, clock pin, sequential cell (implies valid startpoint pins of the cell), input or
inout port.
• Endpoints: Clock, input data pin of sequential cell, sequential cell (implies valid endpoint pins
of the cell), output or inout port.
For details on path segmentation, see this link in the Vivado Design Suite User Guide: Using
Constraints (UG903).
The hold relationships are always tied to the setup ones. Consequently, in most cases, the hold
relationship also needs to be separately adjusted after the setup one has been modified. This is
why a second constraint with the -hold option is needed. The main exception to this rule is for
synchronous CDC paths between phase-shifted clocks: only setup needs to be modified.
REGA REGB
D Q D Q
EN EN
X13458-121919
BEFORE
launch edge
Setup
Hold (default) Hold (default) active edges
(default)
Clock Enable
X13459-121919
Constraints:
AFTER
launch edge
Clock Enable
X13460-121919
Note: With the first command, as the setup capture edge moved to the third edge (that is, by two cycles
from its default position), the hold edge also moved by two cycles. The second command is for bringing the
hold edge back to its original location by moving it again by two cycles (in the reverse direction).
For more information on other common multicycle path scenarios, such as phase shift and
multicycle paths between synchronous clocks, see this link in the Vivado Design Suite User Guide:
Using Constraints (UG903).
IMPORTANT! When the clock phase shift does not modify the clock waveform but is instead included in
the insertion delay of the clock modifying block, you do not need to add a setup-only multicycle path to
properly time the path from or to the clock. For more information, see this link in the Vivado Design Suite
User Guide: Design Analysis and Closure Techniques (UG906).
As for synthesis, multicycle path exceptions help the implementation timing-driven algorithms to
focus on the real critical paths. The hold requirements are important only during routing. If a
setup relationship was adjusted with a set_multicycle_path constraint but not its
corresponding hold relationships, the worst hold requirement may become too hard to meet if it
is over 2 or 3 ns. This situation can have a negative impact on setup slack because of the
additional delay inserted by the router while fixing hold violations.
Common Mistakes
Following are common mistakes that you must avoid:
• Relaxing setup without adjusting hold back to same launch and capture edges in the case of a
multicycle path not functionally active at every clock cycle.
The hold requirement can become very high (at least one clock period in most cases) and
impossible to meet.
• Setting a multicycle path exception between incorrect points in the design.
This occurs when you assume that there is only one path from a startpoint cell to an endpoint
cell. In some cases, this is not true. The endpoint cell can have multiple data input pins,
including clock enable and reset pins, which are active on at least two consecutive clock
edges.
For this reason, Xilinx recommends that you specify the endpoint pin instead of just the cell
(or clock). For example, the endpoint cell REGB has three input pins: C, EN and D. Only the
REGB/D pin should be constrained by the multicycle path exception, not the EN pin because
it can change at every clock cycle. If the constraint is attached to a cell instead of a pin, all the
valid endpoint pins are considered for the constraints, including the EN (clock enable) pin.
To be safe, Xilinx recommends that you always use the following syntax:
set_multicycle_path -from [get_pins REGA/C] \
-to [get_pins REGB/D] -setup 3
set_multicycle_path -from [get_pins REGA/C] \
-to [get_pins REGB/D] -hold 2
Case Analysis
The case analysis command is commonly used to describe a functional mode in the design by
setting constants in the logic like what configuration registers do. It can be applied to input ports,
nets, hierarchical pins, or leaf cell input pins. The constant value propagates through the logic and
turns off the analysis on any path that can never be active. The effect is similar to how the false
path exception works.
The most common example is to set a multiplexer select pin to 0 or 1 to only allow one of the
two multiplexer inputs to propagate through. The following example turns off the analysis on the
paths through the mux/S and mux/I1 pins:
Disable Timing
The disable timing command turns off a timing arc in the timing database, which completely
prevents any analysis through that arc. The disabled timing arcs can be reported by the
report_disable_timing command.
CAUTION! Use the disable timing command carefully. It can break more paths than desired!
Data Check
The set_data_check command sets the equivalent of a setup or hold timing check between
two pins in a design. For example, this constraint can be used to report timing on asynchronous
interfaces. This command is ignored by the implementation tools and must only be used for
timing reporting purposes, typically by expert users.
At a minimum, Xilinx recommends applying the total power budget, maximum process, and
worst-case junction temperature to create a worst-case power analysis, using the following XDC
constraints:
POWER TIP: For a worst-case power estimation and until the Theta Ja (ΘJa) of the thermal solution is
known, Xilinx recommends setting the Tj to the maximum allowed for the targeted temperature range.
Theta Ja can be calculated as follows based on the thermal simulation result: ΘJa = (Tj – Ta)/ Pd.
Units are Celsius per watt (°C/W).
The most accurate power estimation can be achieved after the Theta Ja of the thermal design is
known. You can apply the Theta Ja and the maximum supported ambient temperature (Ta) of the
application to report_power using the following constraints to replace the junction
temperature setting. Using these constraints allows report_power to estimate the junction
temperature more accurately and therefore, give a more accurate static power estimation.
In addition, you can specify the power delivery design using XDC constraints. Using this
approach allows report_power to report the margin on the total power, check the power
estimation on the power rails, and report the margin based on the specified estimation and
power rail consolidation. For more information on these constraints, see the Vivado Design Suite
User Guide: Power Analysis and Optimization (UG907).
POWER TIP: Ensure the text power report is used for the most detailed power rail constraints reporting.
• For locking placement and routing, including relative placement of macros, see the Vivado
Design Suite User Guide: Using Constraints (UG903).
• For floorplanning, see the Vivado Design Suite User Guide: Design Analysis and Closure
Techniques (UG906).
• For configuration, see the Vivado Design Suite User Guide: Programming and Debugging (UG908).
Chapter 5
Design Implementation
After selecting your device, choosing and configuring the IP, and writing the RTL and the
constraints, the next step is implementation. Implementation compiles the design through
synthesis and place and route, and then generates the file that is used to program the device. The
implementation process might have some iterative loops. This chapter describes the various
implementation steps, highlights points for special attention, and gives tips and tricks to identify
and eliminate specific bottlenecks.
IMPORTANT! You must regularly validate that synthesis and implementation occur without errors and
with minimal timing violations before adding new blocks or generating a platform for the Vitis™ tools.
Note: The implementation steps are run automatically as part of the Vitis environment flow. You can
improve performance by applying the techniques described in this chapter using the Vitis command line
options and configuration file. For more information, see the Vitis Unified Software Platform Documentation.
Running Synthesis
Synthesis takes in RTL and timing constraints and generates an optimized netlist that is
functionally equivalent to the RTL. In general, the synthesis tool can take any legal RTL and
create the logic for it. Synthesis requires realistic timing constraints.
Related Information
Design Constraints
Baselining the Design
Synthesis Flows
In the Vivado® Design Suite, you can run the synthesis flows described in the following sections,
which each have different advantages and trade-offs.
Global Synthesis
In the global synthesis flow, the full design is synthesized in one run, which offers the following
advantages:
• Allows the synthesis tool to perform the maximum optimization. Because the synthesis tool is
aware of the full design, the tool can optimize between hierarchies that other flows might not.
• Enables easy analysis after the synthesis run.
The disadvantage of this flow is longer compile time. Every time synthesis is run, the full design is
rerun. However, this disadvantage can be mitigated by using incremental synthesis.
Note: If your design includes XDC constraints, you must reference the objects to the top-level design.
Related Information
Incremental Synthesis
When creating the block design, you can run synthesis using either out-of-context (OOC)
synthesis mode or global synthesis mode. If you use out-of-context synthesis mode, the block
design is synthesized separately from the rest of the design. This allows for faster resynthesis
when hierarchies outside the BD file are modified. If you use global synthesis mode, the full
design is compiled and synthesized each time. Global synthesis mode is easier to set up because
constraints are set on a global level. However, using this mode results in a higher run time on
resynthesis. You can improve run time using incremental synthesis.
Out-of-Context Synthesis
In the out-of-context (OOC) synthesis flow, certain levels of hierarchy are synthesized separately
from the top-level. The out-of-context hierarchy are synthesized first. Then, the top-level
synthesis is run, and each of the out-of-context runs are treated as a black box. Xilinx IP are often
run in out-of-context synthesis mode. After all of the out-of-context synthesis runs and top-level
synthesis runs are complete, the Vivado tools assemble the design from all of the synthesis runs
when you open the top-level synthesized design. This flow offers the following advantages:
• Reduces compile time for subsequent synthesis runs. Only the runs you specify are
resynthesized, leaving the other runs as is.
• Ensures stability when design changes are made. Only the runs that include changes are
resynthesized.
The disadvantage of this flow is that it requires additional setup. You must be careful in selecting
which modules to set as out-of-context synthesis modules. Any additional XDC constraints must
be defined separately and must only be used for the out-of-context synthesis runs.
The following figure shows a design that has a top-level synthesis run (synth_1) and two lower-
level out-of-context synthesis runs.
For more information on setting up out-of-context synthesis runs, see this link in the Vivado
Design Suite User Guide: Synthesis (UG901).
Incremental Synthesis
You can use incremental synthesis to reuse existing synthesis results. This approach reduces
typical synthesis compile times by 50%. When used with the incremental implementation flow,
this approach also improves overall compile time and timing closure consistency.
When a design is synthesized, it is broken into RTL partitions. Incremental synthesis reuses RTL
partitions from a previous synthesis run. RTL partitions are typically created along logical
hierarchies. Incremental synthesis only runs if the design is large enough that synthesis creates at
least 4 RTL partitions, each containing at least 25000 instances. Instances include both logical
hierarchy and RTL primitives.
Following are the different modes available when using the synth_design -
incremental_mode <value> command:
• quick: Fastest results but no cross boundary optimizations. This mode limits logic
performance.
• aggressive: All optimizations are enabled. Compile time is significantly reduced from non-
incremental synthesis.
The quick mode is typically recommended for low-performance designs only. Without cross
boundary optimizations, typical designs have reduced performance. However, if a design is well
constructed with registered hierarchical boundaries, performance might not be affected. Due to
the limits on cross-boundary optimization, resynthesis in a given area is caused by RTL changes
only in that area. Changes in another synthesis partition do not trigger changes beyond that
partition. This leads to more reuse and a faster synthesis result. For other modes, this is not the
case, and RTL changes might trigger the resynthesis of more partitions beyond just the partition
the cell is in. When more than 50% of partitions are modified, a full resynthesis is triggered.
For high-performance designs, the default, aggressive, and off modes are recommended.
More optimizations are enabled in aggressive and off modes, which might lead to more
resynthesis but higher QoR.
To more aggressively address compile time concerns, you can compare the incremental synthesis
against OOC synthesis. OOC synthesis is typically used by IP, and setup is automatic. Global
synthesis with incremental synthesis offers the advantage of cross-boundary optimizations as
well as compile time benefits. Following are areas to consider:
• Compile time
OOC synthesis is faster, because it reduces the amount of code that is elaborated each time.
RTL is only elaborated if the RTL is modified within the OOC module.
• Performance
Incremental synthesis has a performance advantage over OOC synthesis, because
optimization is not possible across OOC boundaries.
• Setup
For non-IP flows, when you create an OOC synthesis run, you must create a wrapper when
generics/parameters are passed from higher modules. In addition, you must create a separate
timing constraint file to target the OOC-level ports. Incremental synthesis does not have these
requirements.
For more information, see the Vivado Design Suite User Guide: Synthesis (UG901).
Synthesis Optimizations
By default, Vivado synthesis applies optimizations that yield the best results for the largest
number of designs. However, you can adjust the default synthesis optimizations as described in
the following sections.
Synthesis Settings
You can set several global settings that affect the entire design using the Vivado Design Suite
Synthesis Settings. These settings optimize how logic is inferred and how incremental synthesis is
run. Xilinx recommends using default options when you start your design and changing the
options based on the specific needs of your design. For more information, see this link in the
Vivado Design Suite User Guide: Synthesis (UG901).
Synthesis Attributes
Synthesis attributes allow you to control the logic inference in a specific way. Although synthesis
algorithms are set to give the best results for the largest number of designs, there are often
designs with differing requirements. In this case, you can use attributes to alter the design to
improve QoR. For information on the attributes supported by synthesis, see the Vivado Design
Suite User Guide: Synthesis (UG901).
POWER TIP: Evaluate synthesis settings carefully, because these settings can have a considerable impact
on the power consumption of a design. For example, a low setting for the control set threshold increases
the usage of register clock enables at the expense of less dense packing. Run the report_power
command after synthesis to evaluate the impact of synthesis settings on power.
Note: Before retargeting your design to a new device, Xilinx recommends reviewing any synthesis
attributes from previous design runs that target older devices.
When using the KEEP, DONT_TOUCH, and MAX_FANOUT attributes, be aware of the special
considerations described in the following sections.
• KEEP is used by the synthesis tool and is not passed along as a property in the netlist. KEEP
can be used to retain a specific signal, for example, to turn off specific optimizations on the
signal during synthesis.
• DONT_TOUCH is used by the synthesis tool and then passed along to the place and route
tools so the object is never optimized.
• A KEEP attribute on a register that receives RAM output prevents that register from being
merged into the RAM, thereby preventing a block RAM from being inferred.
• Do not use these attributes on a level of hierarchy that is driving a 3-state output or
bidirectional signal in the level above. If the driving signal and the 3-state condition are in this
level of hierarchy, the IOBUF is not inferred, because the tool must change the hierarchy to
create the IOBUF.
• Attributes that disable optimization often result in larger, higher power-consuming circuits.
Xilinx recommends using these controls sparingly and removing them when no longer needed.
Also, be aware that there is a difference between putting DONT_TOUCH on a signal or on a level
of hierarchy:
• If the attribute is placed on a level of hierarchy, the tool does not touch the boundaries of that
hierarchy, and no constant propagation occurs through the hierarchy. However, optimizations
inside that level of hierarchy are retained.
MAX_FANOUT
MAX_FANOUT forces the synthesis to replicate logic to meet a fanout limit. The tool is able to
replicate logic, but not inputs or black boxes. Accordingly, if a MAX_FANOUT attribute is placed
on a signal that is driven by a direct input to the design, the tool is unable to handle the
constraint.
Synthesis appends the replicated cells with _rep for the first replication and subsequent
replications are _rep__0, _rep__1 and so on. These cells can be seen in the post synthesized
netlist by selecting Edit → Find on cells.
Set the block-level synthesis strategy in the XDC file using the following syntax:
Where:
Note: These properties are always set on hierarchical instances. This allows modules or entities that are
instantiated more than once to be synthesized with different options.
For example, you can set the following strategies in an XDC file:
U1 U2
retiming area
U3
inst1
area
default
X19285-121919
You can set multiple BLOCK_SYNTH properties on the same instance to experiment with
different options. For example:
When working with IP, you can use the block-level synthesis strategy as follows:
• If the IP is compiled globally, you can use this strategy on the top level of the IP.
• If the IP is out-of-context, you cannot use the strategy, because the IP appears as a black box.
Instead, use global settings when compiling the IP.
Note: For more information on this feature and the supported strategies and options, see the Vivado Design
Suite User Guide: Synthesis (UG901).
• Post-synthesis netlist
• I/O, BUFG, and other placement specific requirements
• Attributes and wiring on MGTs, IODELAYs, MMCMs, PLLs and other primitives
RECOMMENDED: Review and correct DRC violations as early as possible in the design process to avoid
timing or logic related issues later in the implementation flow.
TIP: For DRC violations that can be safely ignored, you can use the waiver mechanism to waive the
violations. For details, see this link in the Vivado Design Suite User Guide: Design Analysis and Closure
Techniques (UG906).
In Project Mode, the tools automatically run Report Methodology during implementation
(opt_design or route_design) by default. To run these checks manually, use either of the
following methods:
• At the Tcl prompt, open the design to be validated, and enter following Tcl command:
report_methodology
• To run these checks from the Vivado IDE, open the design to be validated, and select Reports
→ Report Methodology.
RECOMMENDED: To identify common design issues, run this report the first time you synthesize the
design. Run this report again after significant module additions, constraint changes, or clocking circuit
changes.
Note: For Xilinx®-supplied IP cores, the violations are already reviewed and checked.
Any violations are listed in the Methodology window, as shown in the following figure. If a
specific methodology violation does not need to be fixed, make sure that you understand the
violation and its implication clearly and why the violation does not negatively impact your design.
IMPORTANT! You must resolve all Critical Warnings and most Warnings to ensure good QoR, timing
analysis accuracy, and reliable hardware stability are met. For methodology check violations that can be
safely ignored, you can use the waiver mechanism to waive the violations. For details, see this link in the
Vivado Design Suite User Guide: Design Analysis and Closure Techniques (UG906).
Note: Methodology checks related to RAMB and DSP primitive optional pipelining (SYNTH-6, SYNTH-11,
SYNTH-12 and SYNTH-13) are not reported when setup timing is greater than 1 ns on all of the input or
output paths for the primitives.
For more information on running Report Methodology, see the Vivado Design Suite User Guide:
System-Level Design Entry (UG895). Also, see this link in the Vivado Design Suite User Guide: Design
Analysis and Closure Techniques (UG906).
Related Information
CAUTION! If a message appears more than 100 times, the tool writes only the first 100 occurrences to
the synthesis log file. You can change the limit of 100 through the Tcl command set_param
messaging.defaultLimit.
RECOMMENDED: Review all Critical Warnings and Warnings related to timing constraints which indicate
that constraints have not been loaded or properly applied.
Related Information
Report QoR Assessment provides a score between 1 and 5 that indicates how likely the design is
to close timing. The following table shows the definition of each score. Scores of 1 and 2 have no
chance of meeting timing closure and a score of 3 is unlikely to close. Therefore, low scores mean
more work in closing timing.
Score Meaning
1 Design will likely not complete implementation.
2 Design will complete implementation but will not meet timing.
3 Design will likely not meet timing.
4 Design will likely meet timing.
5 Design will meet timing.
In the report, the detailed table provides information on the basis for the score. The thresholds in
the detailed table are not absolute limits for the device. Instead, the thresholds indicate when
timing closure might become increasingly difficult to achieve. After you exceed the threshold of
any of these items, the difficulty in closing timing increases exponentially.
Plan to correct any items that are marked for review in the Report QoR assessment. Many of the
items might be resolved automatically using Report QoR Suggestions.
HDL changes tend to have the biggest impact on QoR. You are therefore better off solving
problems before implementation to achieve faster timing convergence. When analyzing timing
paths, pay special attention to the following:
• Most frequent offenders (that is, the cells or nets that show up the most in the top worst
failing timing paths)
• Paths sourced by unregistered block RAMs
• Paths sourced by SRL
• Paths containing unregistered, cascaded DSP blocks
• Paths with large number of logic levels
• Paths with large fanout
Related Information
Timing Closure
Related Information
Timing Closure
Reviewing Utilization
It is important to review utilization for LUT, FF, block RAM, and DSP components independently.
A design with low LUT/FF utilization might still experience placement difficulties if block RAM
utilization is high. The report_utilization command generates a comprehensive utilization
report with separate sections for all design objects.
Note: After synthesis, utilization numbers might change due to optimization later in the design flow.
For some interfaces needing very tight timing relationship, it is sometimes better to lock specific
resources for these signals which need very tight timing relationship, for example, source
synchronous interfaces. In general, as a starting point for your design, lock only the I/Os unless
there are specific reasons not to follow this approach as cited above.
Related Information
Timing Closure
• Run the report_clock_networks command to show the clock network in detail tree view.
• Utilize clock trees in a way to minimize skew.
• For the outputs of PLLs and MMCMs, use the same clock buffer type to minimize skew.
• Look for unintended cascaded BUFG elements that can introduce additional delay, skew, or
both.
Strategies
Strategies are used by the Vivado Design Suite to control both the tool options and the reports
that are generated by synthesis and implementation runs in Project Mode. You can use the
strategies to adjust the implementation goals and to control the reports that are generated. For
more information on strategies, see the Vivado Design Suite User Guide: Implementation (UG904).
RECOMMENDED: Try the default strategy Vivado Implementation Defaults first. It provides a good
trade-off between compile time and design performance.
Note: Strategies are tool and version specific. In some cases, strategies might require a longer compile time.
Directives
Directives provide different modes of behavior for the following implementation commands:
• opt_design
• place_design
• phys_opt_design
• route_design
Use the default directive initially. Use other directives when the design nears completion to
explore the solution space for a design. You can specify only one directive at a time. For more
information on directives, see the Vivado Design Suite User Guide: Implementation (UG904).
Iterative Flows
In Non-Project Mode, you can iterate between various optimization commands with different
options. For example, you can run phys_opt_design -directive
AggressiveFanoutOpt followed by phys_opt_design -directive
AlternateFlowWithRetiming to run different physical synthesis optimizations on a placed
design that does not meet timing.
Checkpoint designs can be run through the remainder of the design flow using Tcl commands.
They cannot be modified with new design sources.
• Saving results so you can go back and do further analysis on that part of the flow.
• Trying place_design using multiple directives and saving the checkpoints for each. This
would allow you to select the place_design checkpoint with the best timing results for the
subsequent implementation steps.
For more information on checkpoints, see the Vivado Design Suite User Guide: Implementation
(UG904).
report_timing_summary
report_timing
report_power
report_methodology
report_drc
After the checkpoint is open, you can open the interactive report file using Reports → Open
Interactive Report.
Note: In Project Mode, the interactive reports are generated and opened automatically.
RECOMMENDED: When a report is generated, there is a size limit on the RPX file. Therefore, Xilinx
recommends using the catch command to prevent errors that might stop the flow. For example: catch
{report_timing_summary -rpx timing_summary.rpx -file timing_summary.rpt} .
RECOMMENDED: Incremental implementation is most useful during critical stages of the design cycle
when changes to the flow scripts are difficult to make. Ensure that your flow scripts include incremental
implementation early in the design cycle so you can enable incremental implementation during critical
periods.
Note: For further improvement in compile times and QoR, you can also use incremental synthesis.
Related Information
Incremental Synthesis
Note: The automatic incremental implementation mode is less aggressive than running the default
incremental implementation flow and enables better maintenance of QoR when running the incremental
implementation flow.
Project Mode
In Project Mode, the Vivado tools manage updating of the checkpoint as well as which algorithms
to use. To enable the automatic incremental implementation mode in Project Mode, right-click an
implementation run in the Design Runs window, select Set incremental Compile → Automatically
use the checkpoint from the previous run.
Non-Project Mode
In Non-Project Mode, the Vivado tools manage which algorithms to use, but you must decide
whether to update the checkpoint. To enable the automatic incremental implementation mode in
Non-Project Mode, use the -auto_incremental option. Following is an example command:
When updating the checkpoint, ensure that WNS did not degrade beyond acceptable limits by
using the following command at the end of the implementation flow:
• High Reuse Mode: High reuse mode is enabled when cell reuse is equal to or greater than
75%. This mode triggers incremental algorithms to be run and is the standard mode for
incremental implementation.
• Low Reuse Mode: Low reuse mode is enabled when cell reuse is less than 75%. This mode
reuses the cell placement of certain cells but runs the default algorithms. This mode can be
effective when targeting block placement of DSPs, BRAMs, or a hierarchical cell.
The following table shows each incremental directive and the corresponding target WNS
behavior.
Parallel Runs
To improve your chances of meeting timing using the default flow, it is common to implement
many parallel runs, each with different placer directives. For incremental flows, the directive
indicates whether to close or maintain timing. To achieve a spread of results, target the desired
incremental directive with different reference checkpoints.
When using the TimingClosure directive with automatic incremental implementation mode or
high reuse mode, time is spent on running extra algorithms to close timing. Compile time can
increase using this mode, especially when it is difficult to close timing or there are timing failures
in a congested areas. When the reference checkpoint meets timing, compile time reduction is
similar to using the RuntimeOptimized directive as described previously.
In low reuse mode, compile time is not predictable. When the place and route runs get closer to
meeting timing, the Vivado tools might increase compile time to meet timing. In other cases, the
Vivado tools might decrease compile time if existing placement and routing data is reused
efficiently.
Optimization Analysis
The opt_design command generates messages detailing the results for each optimization
phase. After optimization you can run report_utilization to analyze utilization
improvements. To better analyze optimization results, rerun opt_design with the -verbose
and -debug_log options for complete details on how each optimization affects the logic and
how user constraints prevent some optimizations. For more information, see this link and this link
in the Vivado Design Suite User Guide: Implementation (UG904).
Placement (place_design)
The Vivado Design Suite placer engine positions cells from the netlist onto specific sites in the
target Xilinx device.
Placement Analysis
Use the timing summary report after placement to check the critical paths.
• Paths with very large negative setup time slack may require that you check the constraints for
completeness and correctness, or logic restructuring to achieve timing closure.
• Paths with very large negative hold time slack are most likely due to incorrect constraints or
bad clocking topologies and should be fixed before moving on to route design.
• Paths with small negative hold time slack are likely to be fixed by the router. You can also run
report_clock_utilization after place_design to view a report that breaks down
clock resource and load counts by clock region.
For more information on placement, see this link in the Vivado Design Suite User Guide:
Implementation (UG904).
Related Information
Timing Closure
Routing (route_design)
The Vivado Design Suite router performs routing on the placed design and performs optimization
on the routed design to resolve hold time violations. By default, the router performs optimization
using a balance between runtime and design performance while alleviating congestion. Some
router directives sacrifice runtime for better design performance and more aggressive congestion
reduction. For more information on routing, see this link in the Vivado Design Suite User Guide:
Implementation (UG904).
Route Analysis
Nets that are routed sub-optimally are often the result of incorrect timing constraints. Before you
experiment with router settings, make sure that you have validated the constraints and the
timing picture seen by the router. Validate timing and constraints by reviewing timing reports
from the placed design before routing.
Common examples of poor timing constraints include cross-clock paths and incorrect multicycle
paths causing route delay insertion for hold fixing. Congested areas can be addressed by targeted
fanout optimization in RTL synthesis or through physical optimization. You can preserve all or
some of the design hierarchy to prevent cross-boundary optimization and reduce the netlist
density. Or you can use floorplan constraints to ease congestion.
Related Information
Timing Closure
Chapter 6
Design Closure
Design closure consists of meeting all system performance, timing, and power requirements, and
successfully validating the functionality in hardware. During the design closure phase where you
are starting to run the design through the implementation tools, both timing and power
considerations should be your top priorities.
At this stage of design closure, estimation of design utilization, timing and power gain more
accuracy. This presents an opportunity to reaffirm that the timing and power goals are
achievable. To confirm the design can meet its requirements, Xilinx recommends conducting both
a timing and power baseline. A timing baseline is largely about evaluating timing paths after
accurate timing constraints have been defined. A power baseline needs to provide Vivado with
the right toggle information to determine accurate dynamic power information.
By combining the analysis of power requirements and timing requirements, if one item is off
significantly, a measure taken to resolve it can significantly impact the other. For example:
• An extreme measure might be necessary to meet a power budget such as scaling back
features. This will make timing closure significantly easier as the part is less congested.
• A less extreme measure might involve adding logic to reduce switching. This might make
timing closure more difficult, particularly if in a congested area of the die.
While many power saving items do not impact timing closure, it is possible that other items might
make timing closure harder. Applying the required power saving techniques early will help you
understand the true magnitude of the timing closure task.
Once you start to iterate from the baseline, you should recheck the power numbers when you
make an improvement to timing. This ensures that you understand what change caused a
regression. Generally, turning on wholesale power saving features early and then scaling back on
individual items that are causing timing issues helps to strike the right balance of meeting design
closure goals.
Conducting both power and timing analysis together and early in the design closure
implementation phase will save engineering time and enable more accurate project planning. In
addition, it creates time to allow engineering solutions to be explored than when this is realized
later in the design cycle.
TIP: For more information on reports mentioned in this chapter, see Vivado Design Suite User Guide:
Design Analysis and Closure Techniques (UG906).
TIP: See the UltraFast Design Methodology Timing Closure Quick Reference Guide (UG1292) for a
condensed version of the techniques described in this chapter, including running initial design checks,
baselining the design, and resolving timing violations.
Timing Closure
Timing closure consists of the design meeting all timing requirements. It is easier to reach timing
closure if you have the right HDL and constraints for synthesis. In addition, it is important to
iterate through the synthesis stages with improved HDL, constraints, and synthesis options, as
shown in the following figure.
Run Synthesis
Review options & HDL
code
report_clock_networks
-> create_clock / Cross-probe
create_generated_clock Define & Refine Instances in critical path
report_clock_interaction Constraints In Netlist view and
-> set_clock_groups / set_false_path Elaborated view
check_timing schematics
-> I/O delays
report_timing_summary
-> Timing exceptions
Timing Acceptable?
N
X13422-121919
• When initially not meeting timing, evaluate timing throughout the flow.
• Focus on worst negative slack (WNS) of each clock as the main way to improve total negative
slack (TNS).
• Review large worst hold slack (WHS) violations (<-1 ns) to identify missing or inappropriate
constraints.
• Revisit the trade-offs between design choices, constraints, and target architecture.
• Know how to use the tool options and Xilinx® design constraints (XDC).
• Be aware that the tools do not try to further improve timing (additional margin) after timing is
met.
The following sections provide recommendations for reviewing the completeness and
correctness of the timing constraints using methodology design rule checks (DRCs) and
baselining, identifying the timing violation root causes, and addressing the violations using
common techniques.
Note: Timing results after synthesis use estimated net delays and not the actual routing delays. To get the
final timing results, run implementation and then check the Report Timing Summary.
Review the Check Timing section of the Timing Summary report to quickly assess the timing
constraints coverage, including the following:
CAUTION! Excessive use of wildcards in constraints can cause the actual constraints to be different
from what you intended. Use the report_exceptions command to identify timing exception
conflicts and to review the netlist objects, timing clocks, and timing paths covered by each exception.
In addition to check_timing, the Methodology report (TIMING and XDC checks) flags timing
constraints that can lead to inaccurate timing analysis and possible hardware malfunction. You
must carefully review and address all reported issues.
Note: When baselining the design, you must use all Xilinx IP constraints. Do not specify user I/O
constraints, and ignore the violations generated by check_timing and report_methodology due to
missing user I/O constraints.
Related Information
• Total Negative Slack (TNS): The sum of the setup/recovery violations for each endpoint in the
entire design or for a particular clock domain. The worst setup/recovery slack is the worst
negative slack (WNS).
• Total Hold Slack (THS): The sum of the hold/removal violations for each endpoint in the entire
design or for a particular clock domain. The worst hold/removal slack is the worst hold slack
(WHS).
• Total Pulse Width Slack (TPWS): The sum of the violations for each clock pin in the entire
design or a particular clock domain for the following checks:
• Worst Pulse Width Slack (WPWS): The worst slack for all pulse width, period, or skew checks
on any given clock pin.
The Total Slack (TNS, THS or TPWS) only reflects the violations in the design. When all timing
checks are met, the Total Slack is null.
The timing path report provides detailed information on how the slack is computed on any logical
path for any timing check. In a fully constrained design, each path has one or several
requirements that must all be met in order for the associated logic to function reliably.
The main checks covered by WNS, TNS, WHS, and THS are derived from the sequential cell
functional requirements:
• Setup time: The time before which the new stable data must be available before the next
active clock edge to be safely captured.
• Hold requirement: The amount of time the data must remain stable after an active clock edge
to avoid capturing an undesired value.
• Recovery time: The minimum time required between the time the asynchronous reset signal
has toggled to its inactive state and the next active clock edge.
• Removal time: The minimum time after an active clock edge before the asynchronous reset
signal can be safely toggled to its inactive state.
A simple example is a path between two flip-flops that are connected to the same clock net.
After a timing clock is defined on the clock net, the timing analysis performs both setup and hold
checks at the data pin of the destination flip-flop under the most pessimistic, but reasonable,
operating conditions. The data transfer from the source flip-flop to the destination flip-flop
occurs safely when both setup and hold slacks are positive.
For more information on timing analysis, see this link in the Vivado Design Suite User Guide: Design
Analysis and Closure Techniques (UG906).
Run check_timing to identify unconstrained paths. You can run this command as a standalone
command, but it is also part of report_timing_summary. In addition,
report_timing_summary includes an Unconstrained Paths section where N logical paths
without timing requirements are listed by the already defined source or destination timing clock.
N is controlled by the -max_path option.
After the design is fully constrained, run the report_methodology command and review the
TIMING and XDC checks to identify non-optimal constraints, which will likely make timing
analysis not fully accurate and lead to timing margin variations in hardware. To identify and
correct unrealistic target clock frequencies or setup path requirement, use the
report_qor_assessment command.
IMPORTANT! To address missing or incomplete constraints, use the Timing Constraints wizard or see the
Vivado Design Suite User Guide: Using Constraints (UG903).
Zero unconstrained internal endpoints indicate that all internal paths are constrained for timing
analysis. However, the correct value of the constraints is not yet guaranteed.
Generated Clocks
Generated clocks are a normal part of a design. However, if a generated clock is derived from a
master clock that is not part of the same clock tree, this can cause a serious problem. The timing
engine cannot properly calculate the generated clock tree delay. This results in erroneous slack
computation. In the worst case situation, the design meets timing according to the reports but
does not work in hardware.
RECOMMENDED: Start by validating baselining constraints and then complete the constraints with the
I/O timing.
Multiple Clocks
Multiple clocks are usually acceptable. Xilinx recommends that you ensure that these clocks are
expected to propagate on the same clock tree. You must also verify that the paths requirement
between these clocks does not introduce tighter requirements than needed for the design to be
functional in hardware.
If this is the case, you must use set_clock_groups or set_false_path between these
clocks on these paths. Any time that you use timing exceptions, you must ensure that they affect
only the intended paths.
For more information on some of these checks, see this link in the Vivado Design Suite User Guide:
Design Analysis and Closure Techniques (UG906). Also, see the Adoption Of The Methodology
Report blog series for more information on how report_methodology helps to resolve issues
and save time.
IMPORTANT! To increase visibility, the summary of the methodology violations is also included in the
timing summary text report, because addressing these issues is critical for having proper signoff timing.
For more information on timing methodology checks, see this link in the Vivado Design Suite User
Guide: Design Analysis and Closure Techniques (UG906).
Where:
• Ti is the target clock period (ns) used during the implementation run "i"
• WNSi is the worst negative slack (ns) of the target clock used during the implementation run
"i"
• Using overly tight clock periods can lead to automatic effort reduction in the Vivado
Implementation tools to avoid high compilation time due to unrealistic target and large timing
violations. Use reasonably tight clock constraints instead.
• For designs with multiple clocks, you must proportionally decrease all synchronous clock
periods until one of them starts failing timing after implementation (preferably the fastest
clock or the clock with the most timing paths).
Note: The FMAX value is not explicitly provided in the report_timing or report_timing_summary
report.
For a given design implementation, the maximum operating frequency on hardware across
temperature and voltage ranges supported by the target device speed grade is defined by
1000/(T - WNS), with WNS positive or negative. When operating under nominal temperature
and voltage conditions, typically in a lab environment, it is usually possible to operate the design
at a slightly higher frequency.
Note: To increase the maximum frequency of the design, you can leverage the techniques described in this
chapter or use Intelligent Design Runs.
Related Information
When baselining the design, you must meet timing after each implementation step by analyzing
and resolving timing challenges throughout the flow. First, you create simple and valid
constraints to give a realistic picture of timing in the Vivado® implementation tools. Then, while
iterating through different implementation steps, you solve timing violations before moving onto
the next step. The following figure shows the baselining process.
X20037-021821
RECOMMENDED: Xilinx recommends that you create the baseline constraints very early in the design
process, and plan any major change to the design HDL against these baseline constraints.
Not all constraints need to be defined at this stage. The Vivado tools ignore I/O timing by default
if there are no constraints. Therefore, you do not need to define I/O timing constraints at this
point. Instead, define the I/O timing constraints later in the flow after the baselining process is
complete.
TIP: When using the Timing Constraints wizard, deselect the suggested I/O timing constraints.
To get an accurate picture of internal timing in the device, define the following constraints:
After creating the constraints, identify the paths that cannot meet timing. Rewrite the
corresponding RTL or relax the clock period.
IMPORTANT! All Xilinx IP and partner IP are delivered with specific XDC constraints that comply with the
Xilinx constraints methodology. The IP constraints are automatically included during synthesis and
implementation. You must keep the IP constraints intact when creating the baselining constraints.
If you do not use the Timing Constraints wizard to define the constraints, the following sections
cover the steps you must take to define the baseline constraints manually.
Use the report_clock_networks Tcl command to create a list of all the primary clocks that
must be defined in the design. The resulting list of clock networks shows which clock constraints
should be created. Use the Timing Constraints Editor to specify the appropriate parameters for
each clock.
Note: MMCMs, PLLs, and clock buffers are clock-modifying blocks. For UltraScale™ devices, GTs are also
clock-modifying blocks.
The report_clocks results show that all clocks are propagated. The difference between the
primary clocks (created with create_clock) and the generated clocks is displayed in the
attributes field:
You can also create generated clocks using the create_generated_clock constraint. For
more information, see the Vivado Design Suite User Guide: Using Constraints (UG903).
Figure 112: Clock Report Shows the Clocks Generated from Primary Clocks
TIP: To verify that there are no unconstrained endpoints in the design, see the Check Timing report
(no_clock category). The report is available from within the Report Timing Summary or by using the
check_timing Tcl command.
Note: This section does not explain how to properly cross clock region boundaries. Instead, it explains how
to identify which crossings exist and how to constrain them.
The following table explains the meaning of each color in this report.
Before the creation of any false paths or clock group constraints, the only colors that appear in
the matrix are black, red, and green. Because all clocks are timed by default, the process of
decoupling asynchronous clocks takes on a high degree of significance. Failure to decouple
asynchronous clocks often results in a highly over-constrained design.
Use the report_cdc Tcl command for a comprehensive analysis of clock domain crossing
circuitry between asynchronous clocks. For more information on the report_cdc command,
see this link in the Vivado Design Suite User Guide: Design Analysis and Closure Techniques (UG906).
Also, see report_cdc in the Vivado Design Suite Tcl Command Reference Guide (UG835).
The Vivado tools identify the path requirements by expanding each clock out to 1000 cycles,
then determining where the closest, non-coincident edge alignment occurs. When 1000 cycles
are not sufficient to determine the tightest requirement, the report shows Not Expanded, in
which case you must treat the two clocks as asynchronous.
For example, consider a timing path that crosses from a 250 MHz clock to a 200 MHz clock:
• The positive edges of the 200 MHz clock are {0, 5, 10, 15, 20}.
• The positive edges of the 250 MHz clock are {0, 4, 8, 12, 16, 20}.
The tightest requirement for this pair of clocks occurs when the following is true:
This results in all paths timed from the 250 MHz clock domain into the 200 MHz clock domain
being timed at 1 ns.
Note: The simultaneous edge at 20 ns is not the tightest requirement in this example, because the capture
edge cannot be the same as the launch edge.
Because this is a fairly tight timing requirement, you must take additional steps. Depending on
the design, one of the following constraints might be the correct way to handle these crossings:
If nothing is done, the design might exhibit timing violations that cross these two domains. In
addition, all of the best optimization, placement and routing might be dedicated to these paths
instead of given to the real critical paths in the design. It is important to identify these types of
paths before any timing-driven implementation step.
Figure 114: Clock Domain Crossing from 250 MHz to 200 MHz
TIP: You can use the config_timing_analysis -ignore_io_paths yes Tcl command to
ignore timing on all I/O paths during implementation and in reports that use timing information. You must
manually enter this command before or immediately after opening a design in memory.
Based on recommendations of the RTL designer, timing exceptions must be limited and must not
be used to hide real timing problems. Prior to this point, the false path or clock groups between
clocks must be reviewed and finalized.
IP constraints must be entirely kept. When IP timing constraints are missing, known false paths
might be reported as timing violations.
In addition to evaluating the timing for the entire design after each implementation step, you can
take a more targeted approach for individual paths to evaluate the impact of each step in the
flow on the timing. For example, the estimated net delay for a timing path after the optimization
step might differ significantly from the estimated net delay for the same path after placement.
Comparing the timing of critical paths after each step is an effective method for highlighting
where the timing of a critical path diverges from closure.
Clock skew is accurately estimated and can be used to review imbalanced clock trees impact on
slack. You can estimate hold fixing by running min delay analysis. Large hold violations where the
WHS is -0.500 ns or greater between slices, block RAMs or DSPs will need to be fixed. Small
violations are acceptable and will likely be fixed by the router.
Note: Paths to/from dedicated blocks like the PCIe® block can have hold time estimates greater than
-0.500 ns that get automatically fixed by the router. For these cases, check report_timing_summary
after routing to verify that all corresponding hold violations are fixed.
• Nets with high fanout (report_high_fanout_nets shows highest fanout non-clock nets)
• Nets with drivers and loads located far apart
• Digital signal processor (DSP) and block RAM with sub-optimal pipeline register usage
No hold violation should remain after route, regardless of the worst setup slack (WNS) value. If
the design fails hold, further analysis is needed. This is typically due to very high congestion, in
which case the router gives up on optimizing timing. This can also occur for large hold violations
(over 4 ns) which the router does not fix by default. Large hold violations are usually due to
improper clock constraints, high clock skew or, improper I/O constraints which can already be
identified after placement or even after synthesis.
If hold is met (WHS > 0) but setup fails (WNS < 0), follow the analysis steps described in
Analyzing and Resolving Timing Violations.
3. Open the post-synthesis report_timing_summary text report and record the no_clock
section of check_timing.
Number of missing clock requirements in the design: ___________
4. Run report_clock_networks to identify primary clock source pins/ports in the design.
(Ignore QPLLOUTCLK and QPLLOUTREFCLK because they are pulse-width only checks.)
Number of unconstrained clocks in the design: ___________
5. Run report_clock_interaction -delay_type min_max and sort the results by
WNS path requirement.
Smallest WNS path requirement in the design: ___________
6. Sort the results of report_clock_interaction by WHS to see if there are large hold
violations (>500 ps) after synthesis.
Largest negative WHS in the design: ___________
7. Sort results of report_clock_interaction by Inter-Clock Constraints and list all the
clock pairs that show up as unsafe.
8. Upon opening the synthesized design, how many Critical Warnings exist?
Number of synthesized design Critical Warnings: ___________
9. What types of Critical Warnings exist?
Record examples of each type.
10. Run report_high_fanout_nets -timing -load_types -max_nets 25.
Number of high fanout nets not driven by FF: ___________
Number of loads on highest fanout net not driven by FF: ___________
Do any high fanout nets have negative slack? If yes, WNS = ___________
11. Implement the design. After each step, run report_timing_summary and record the
information shown in the following table.
12. Run report_exceptions -ignored to identify if there are constraints that overlap in
the design. Record the results.
The Report QoR Suggestions command automatically identifies issues and orders suggestions
based on criticality. You can determine the progress made towards timing closure by running the
Report QoR Assessment command both before and after applying the suggestions. An increase in
the QoR Assessment Score and a decrease in the detailed table marked for review indicates
improvements.
The following figure shows the basic process for analyzing and resolving timing violations.
X20036-110617
+ clock skew
- clock uncertainty
- setup/recovery time
- clock skew
- clock uncertainty
- hold/removal time
• Clock Skew = destination clock delay - source clock delay (after the common node if any)
During the analysis of the violating timing paths, you must review the relative impact of each
variable to determine which variable contributes the most to the violation. Then you can start
analyzing the main contributor to understand what characteristic of the path influences its value
the most and try to identify a design or constraint change to reduce its impact. If a design or
constraint change is not practical, you must do the same analysis with all other contributors
starting with the worst one. The following list shows the typical contributor order from worst to
least.
For setup/recovery:
• Datapath delay: Subtract the timing path requirement from the datapath delay. If the
difference is comparable to the (negative) slack value, then either the path requirement is too
tight or the datapath delay is too large.
• Datapath delay + setup/recovery time: Subtract the timing path requirement from the
datapath delay plus the setup/recovery time. If the difference is comparable to the (negative)
slack value, then either the path requirement is too tight or the setup/recovery time is larger
than usual and noticeably contributes to the violation.
• Clock skew: If the clock skew and the slack have similar negative values and the skew
absolute value is over a few 100 ps, then the skew is a major contributor and you must review
the clock topology.
• Clock uncertainty: If the clock uncertainty is over 100 ps, then you must review the clock
topology and jitter numbers to understand why the uncertainty is so high.
For hold/removal:
• Clock skew: If the clock skew is over 300 ps, you must review the clock topology.
• Clock uncertainty: If the clock uncertainty is over 200 ps, then you must review the clock
topology and jitter numbers to understand why the uncertainty is so high.
• Hold/removal time: If the hold/removal time is over a few 100 ps, you can review the
primitive data sheet to validate that this is expected.
• Hold path requirement: The requirement is usually zero. If not, you must verify that your
timing constraints are correct.
Assuming all timing constraints are accurate and reasonable, the most common contributors to
timing violations are usually the datapath delay for setup/recovery timing paths, and skew for
hold/removal timing paths. At the early stage of a design cycle, you can fix most timing problems
by analyzing these two contributors. However, after improving and refining design and
constraints, the remaining violations are caused by a combination of factors, and you must
review all factors in parallel to identify which to improve.
See this link for more information on timing analysis concepts, and see this link for more
information on timing reports (report_timing_summary/report_timing) in the Vivado
Design Suite User Guide: Design Analysis and Closure Techniques (UG906).
Note: The report_design_analysis command does not report on the completeness and correctness
of timing constraints.
TIP: Run the Design Analysis Report in the Vivado IDE for improved visualization, automatic filtering, and
convenient cross-probing.
The following sections only cover timing path characteristics analysis. The Design Analysis report
also provides useful information about congestion and design complexity.
Related Information
To report the 50 worst setup timing paths, you can use the Report Design Analysis dialog box in
the Vivado IDE, or you can use the following command:
The following figure shows an example of the Setup Path Characteristics table generated by this
command. To see additional columns in the window, scroll horizontally.
• Toggle between numbers and % by clicking the % (Show Percentage) button. This is
particularly helpful to review proportion of cell delay and net delay.
• By default, columns with only null or empty values are hidden. Click the Hide Unused button
to turn off filtering and show all columns, or right-click the table header to select which
columns to show or hide.
From this table, you can isolate which characteristics are introducing the timing violation for each
path:
○ Are there any constraints or attributes that prevent logic optimization? (DONT_TOUCH,
MARK_DEBUG)
○ Does the path include a cell with high logic delay such as block RAM or DSP? (Logical Path,
Start Point Pin Primitive, End Point Pin Primitive)
○ Is the path requirement too tight for the current path topology? (Requirement)
○ Are the cells assigned to several Pblocks that can be placed far apart? (Pblocks)
○ Are the cells placed far apart? (Bounding Box Size, Clock Region Distance)
○ For SSI technology devices, are there nets crossing SLR boundaries? (SLR Crossings)
○ Are one or several net delay values a lot higher than expected while the placement seems
correct? Select the path and visualize its placement and routing in the Device window.
○ Is there a missing pipeline register in a block RAM or DSP cell? (Comb DSP, MREG, PREG,
DOA_REG, DOB_REG)
• High skew (<-0.5 ns for setup and >0.5 ns for hold) (Clock Skew)
○ Is it a clock domain crossing path? (Start Point Clock, End Point Clock)
TIP: For visualizing the details of the timing paths in the Vivado IDE, select the path in the table, and
go to the Properties tab.
The report_design_analysis command also generates a Logic Level Distribution table for
the worst 1000 paths (default) that you can use to identify the presence of longer paths in the
design. The longest paths are usually optimized first by the placer to meet timing, which will
potentially degrade the placement quality of shorter paths. You must always try to eliminate the
longer paths to improve the overall timing QoR. For this reason, Xilinx recommends reviewing the
longest paths before placement.
The following figure shows an example of the Logic Level Distribution for a design where the
worst 5000 paths include difficult paths with 17 logic levels while the clock period is 7.5 ns. Run
the following command to obtain this report:
For logic levels above 10, you can use the -min_level and -max_level options to provide
more distribution information for paths between the min and max level you identify. For
example:
report_design_analysis -logic_level_distribution -min_level 16 -max_level 20
-logic_level_dist_paths 5000 -name design_analysis_1
Run the following command to generate the timing report of the longest paths:
report_timing -name longPaths -of_objects [get_timing_paths -setup -to [get_clocks
cpuClk_5] -max_paths 5000 -filter {LOGIC_LEVELS>=16 && LOGIC_LEVELS<=20}]
Based on what you find, you can improve the netlist by changing the RTL or using different
synthesis options, or you can modify the timing and physical constraints.
Was this path impacted by congestion? Look at each individual net delay, the fanout and
observe the routing in the Device view with routing details enabled (post-route analysis only).
You can also turn on the congestion metrics to see if the path is located in or near a congested
area. Use the following analysis steps for a quick assessment or review Reducing Net Delay
Caused by Congestion for a comprehensive analysis.
○ Yes - For the nets with the highest delay value, is the fanout low (<10)?
- Yes - If the routing seems optimal (straight line) but driver and load are far apart, the
sub-optimal placement is related to congestion. Review Addressing Congestion to
identify the best resolution technique.
- No - Try to use physical logic optimization to duplicate the driver of the net. Once
duplicated, each driver can automatically be placed closer to its loads, which will reduce
the overall datapath delay. Review Optimizing High Fanout Nets for more details and to
learn about alternate techniques.
○ No - The design is spread out too much. Try one of the following techniques to improve the
placement:
- Reducing Control Sets
- Tuning the Compilation Flow
- Considering Floorplan
Clock skew in high performance clock domains (+300 MHz) can impact performance. In general,
the clock skew should be no more than 500 ps. For example, 500 ps represents 15% of a 300
MHz clock period, which is equivalent to the timing budget of 1 or 2 logic levels. In cross domain
clock paths the skew can be higher, because the clocks use different resources and the common
node is located further up the clock trees. SDC-based tools time all clocks together unless
constraints specify that they should not be (for example, set_clock_groups,
set_false_path, or set_max_delay -datapath_only).
If the clock uncertainty is over 100 ps, then you must review the clock topology and jitter
numbers to understand why the uncertainty is so high.
Related Information
Before placement, timing analysis uses estimated delays that correspond to ideal placement and
typical clock skew. By using report_timing, report_timing_summary, or
report_design_analysis, you can quickly identify the paths with too many logic levels or
with high cell delays, because they usually fail timing or barely meet timing before placement.
Use the methodology proposed in Identifying Timing Violations Root Cause to find the long
paths which need to be improved before implementing the design.
Related Information
TIP: To cross-probe from a post-synthesis path to the corresponding RTL view and source code, see this
link in the Vivado Design Suite User Guide: Design Analysis and Closure Techniques (UG906).
Related Information
Pushing the Logic from the Control Pin to the Data Pin
• Higher setup/hold/clock-to-output timing arc values for some pins. For example, a block RAM
has a clock-to-output delay around 1.5 ns without the optional output register and 0.4 ns with
the optional output register. Review the data sheet of your target device series for complete
details.
• Higher routing delays than regular FD/LUT connections.
• Higher clock skew variation than regular FD-FD paths.
Also, their availability and site locations are restricted compared to CLB slices, which usually
makes their placement more challenging and often incurs some QoR penalty.
• Pipeline paths from and to dedicated blocks and macro primitives as much as possible.
• Restructure the combinational logic connected to these cells to reduce the logic levels by at
least 1 or 2 cells if latency incurred by pipelining is a concern.
• Meet setup timing by at least 500 ps on these paths before placement.
• Replicate cones of logic connected to too many dedicated blocks or macro primitives if they
need to be placed far apart.
• When the design has tight timing requirements to, within, or from a DSP block, run
opt_design -dsp_register_opt to move registers to a more timing optimal position.
Note: Because timing is approximate during opt_design, you might also need to run
phys_opt_design -dsp_register_opt to correct movements where timing was not accurately
represented at the pre-placement stage.
In the Vivado IDE Properties window, you can select the path in the Timing Path Characteristic
table to review which Pblocks are constraining cells in the path. Consider removing one or
several Pblock constraints if the constraints force logic spreading.
Note: Congestion levels of 5 or higher often impact QoR and always lead to longer router runtime.
The Interconnect Congestion Level metric provides a quick visual overview of any congestion
hotspots in the device. The following figure shows a placed design with several congested areas.
This metric is based on the current interconnect demand and availability with a threshold of 0.9
(that is, 90% routing usage). The range is 0.1 to 0.9.
Use the Routing Congestion per CLB, which is based on estimation and not actual routing. After
placement or after routing, you can display this congestion metric by right-clicking in the Device
window and selecting Metric → Vertical and Horizontal Routing Congestion per CLB. This
provides a quick visual overview of any congestion hotspots in the device. The following figure
shows a placed design with several congested areas due to high utilization and netlist complexity.
Note: Use this method for 7 series and UltraScale devices only.
In that case the QoR is very likely impacted and it is prudent to address the issues causing the
congestion before continuing on to the router. As stated in the message, use the
report_design_analysis command to report the actual congestion levels, as well as
identify their location and the logic placed in the same area.
When congestion level is 4 or higher, the router prints an initial estimated congestion table which
gives more details on the nature of the congestion:
• Global Congestion is similar to how the placer congestion is estimated and is based on all
types of interconnects.
• Long Congestion only considers long interconnect utilization for a given direction.
• Short Congestion considers all other interconnect utilization for a given direction.
Any congestion area greater than 32x32 (level 5) will likely impact QoR and routability
(highlighted in yellow in the table below). Congestion on Long interconnects increases usage of
Short interconnects which results in longer routed delays. Congestion on Short interconnects
usually induce longer runtimes and if their tile % is more than 5%, it will also likely cause QoR
degradation (highlighted in red in the table below).
During Global Iterations, the router first tries to find a legal solution with no overlap and also
meet timing for both setup and hold, with higher priority for hold fixing. When the router does
not converge during a global iteration, it stops optimizing timing until a valid routed solution has
been found, as shown on the example below:
Phase 4.1 Global Iteration 0
Number of Nodes with overlaps = 1157522
Number of Nodes with overlaps = 131697
Number of Nodes with overlaps = 28118
Number of Nodes with overlaps = 10971
Number of Nodes with overlaps = 7324
WARNING: [Route 35-447] Congestion is preventing the router from routing all nets.
The router will prioritize the successful completion of routing all nets over timing
optimizations.
After a valid routed solution has been found, timing optimizations are re-enabled.
The route also flags CLB routing congestion and provides the name of the top most congested
CLBs. An Info message is issued and the congested CLBs and nets are written to the text file
listed in the message body. You can examine the text file for the list of CLB tiles and congested
nets that are involved in the CLB pin-feed congestion, and use the congestion alleviation
techniques listed in the Addressing Congestion section to resolve the CLB congestion before
routing the design.
INFO: [Route 35-443] CLB routing congestion detected. Several CLBs have high routing
utilization, which can impact timing closure. Congested CLBs and Nets are dumped in:
iter_200_CongestedCLBsAndNets.txt
TIP: Localized CLB routing congestion can lead to routing failures even when the reported congestion
levels for Global, Long, or Short congestion are within the acceptable range (less than 5). Look for the
message above and in generated text files for localized congestion hotspots.
Finally, when the router cannot find a legally routed solution, several Critical Warning messages,
as shown below, indicate the number of nets that are not fully routed and the number of
interconnect resources with overlaps.
CRITICAL WARNING: [Route 35-162] 44084 signals failed to route due to routing
congestion. Please run report_route_status to get a full summary of the design's
routing.
...
CRITICAL WARNING: [Route 35-2] Design is not legally routed. There are 91566 node
overlaps.
TIP: During routing, nets are spread around the congested areas, which usually reduces the final
congestion level reported in the log file when the design is successfully routed.
The Placed Maximum, Initial Estimated Router Congestion, and Router Maximum congestion
tables provide information on the most congested areas in the North, South, East, and West
direction. When you select a window in the table, the corresponding congested area is
highlighted in the Device window.
The tables show the congestion at different stages of the design flow:
• Placed Maximum: Shows congestion based on the location of the cells and a model of routing.
• Initial Estimated Router Congestion: Shows congestion after a quick router iteration. This is
the most useful stage to analyze congestion because it gives an accurate picture of congestion
due to placement.
• Router Maximum: Shows congestion after the router has worked extensively to reduce
congestion.
The Congestion percentages in the Congestion Table show the routing utilization in the
congestion window. The top three hierarchical cells located in the congested window are listed
and can be selected and cross-probed to the Device window or Schematic window. The cell
utilization percentage in the congestion window is also shown.
With the hierarchical cells present in the congested area identified, you can use the congestion
alleviating techniques discussed later in this guide to try reducing the overall design congestion.
For more information on generating and analyzing the Report Design Analysis Congestion report,
see this link in the Vivado Design Suite User Guide: Design Analysis and Closure Techniques (UG906).
A design with higher Rent exponent corresponds to a design where the groups of highly
connected logic also have strong connectivity with other groups. This usually translates into a
higher utilization of global routing resources and an increased routing complexity. The Rent
exponent provided in this report is computed on the unplaced and unrouted netlist. After
placement, the Rent exponent of the same design can differ as it is based on physical partitions
instead of logical partitions.
Report Design Analysis runs in Complexity Mode when you do either of the following:
• Check the Complexity option in the Report Design Analysis dialog box Options tab.
• Execute the report_design_analysis Tcl command with the -complexity option.
The following table shows the typical ranges for the Rent Exponent.
Range Meaning
0.0 to 0.65 This range is low to normal.
0.65 to 0.85 This range is high, especially when the total number of instances is above 15,000.
Above 0.85 This range is very high, indicating that the design might fail during implementation if
the number of instances is also high.
The following table shows the typical ranges for the Average Fanout.
Range Meaning
Below 4 This range is normal.
Range Meaning
4 to 5 This range is high, indicating that placing the design without congestion might be
difficult.
When using SSI technology devices, if the total number of instances is above 100,000, it
might be difficult for the placer to find a solution that fits in 1 SLR or is spread over 2
SLRs.
Above 5 This range is very high, indicating that the design might fail during implementation.
You must treat high Rent exponents and high Average Fanouts for larger modules with higher
importance. Smaller modules, especially under 15,000 total instances, can have high Rent
exponent and high Average Fanout and still be easy to place and route successfully. Therefore,
you must review the Total Instances column along with the Rent exponent and Average Fanout.
TIP: Top-level modules do not necessarily have high complexity metrics even though some of the lower-
level modules have high Rent exponents and high Average Fanouts. Use the -hierarchical_depth
option to refine the analysis to include the lower-level modules.
For more information on generating and analyzing the Report Design Analysis Complexity report
see this link in the Vivado Design Suite User Guide: Design Analysis and Closure Techniques (UG906).
The clock skew is typically less than 300 ps for intra-clock timing paths and less than 500 ps for
timing paths between balanced synchronous clocks. When crossing resource columns, clock
skew shows more variation, which is reflected in the timing slack and optimized by the
implementation tools. For timing paths between unbalanced clock trees or with no common
node, clock skew can be several nanoseconds, making timing closure almost impossible.
1. Review all clock relationships to ensure that only synchronous clock paths are timed and
optimized.
2. Review the clock tree topologies and placement of timing paths impacted by higher clock
skew than expected, as described in the following sections.
3. Identify the possible clock skew reduction techniques, as described in the following sections.
Related Information
Figure 125: Typical Synchronous Clocking Topology with Common Node Located on
Green Net
When analyzing the clock path in the timing report, the delays before and after the common
node are not provided separately because the common node only exists in the physical database
of the design and not in the logical view. For this reason, you can see the common node in the
Device window of the Vivado IDE when the Routing Resources are turned on but not in the
Schematic window. The timing report only provides a summary of skew calculation with source
clock delay, destination clock delay, and credit from clock pessimism removal (CPR) up to the
common node.
In the following figure, three clocks have several intra and inter clock paths. The common node of
the two clocks driven by the MMCM is located at the output of the MMCM (red markers). The
common node of the paths between the MMCM input clock and MMCM output clocks is located
on the net before the MMCM (blue marker). For the paths between the MMCM input clock and
MMCM output clocks, the clock skew can be especially high depending on the clkin_buf
BUFGCE location and the MMCM compensation mode.
Figure 126: Synchronous CDC Paths with Common Nodes on Input and Output of a
MMCM
Xilinx recommends limiting the number of synchronous clock domain crossing paths even when
clock skew is acceptable. Also, when skew is abnormally high and cannot be reduced, Xilinx
recommends treating these paths as asynchronous by implementing asynchronous clock domain
crossing circuitry and adding timing exceptions.
You must review all timing paths between asynchronous clocks to ensure the following:
You can use the Clock Interaction Report (report_clock_interaction) to help identify
clocks that are asynchronous and are missing proper timing exceptions.
Figure 127: Asynchronous CDC Paths with Proper CDC Circuitry and No Common Node
Related Information
TIP: Given the flexibility of the UltraScale device clocking architecture, the report_methodology
command contains checks to aid you in creating an optimal clocking topology.
• Avoid timing paths between cascaded clock buffers by eliminating unnecessary buffers or
connecting them in parallel as shown in the following figure.
• Combine parallel clock buffers into a single clock buffer and connect any clock buffer clock
enable logic to the corresponding sequential cell enable pins, as shown on figure below. If
some of the clocks are divided by the buffer's built-in divider, implement the equivalent
division with clock enable logic and apply multicycle path timing exceptions as needed. When
both rising and falling clock edges are used by the downstream logic or when power is an
important factor, this technique might not be applicable.
Figure 129: Synchronous Clocking Topology with Parallel Clock Buffer Recombined
into a Single Buffer
• Remove LUTs or any combinatorial logic in clock paths as they make clock delays and clock
skew unpredictable during placement, resulting in lower quality of results. Also, a portion of
the clock path is routed with general interconnect resources which are more sensitive to noise
than global clocking resources. Combinatorial logic usually comes from sub-optimal clock
gating conversion and can usually be moved to clock enable logic, either connected to the
clock buffer or to the sequential cells.
In the following figure, the first BUFG (clk1_buf) is used in LUT3 to create a gated clock
condition.
IMPORTANT! The 7 series and UltraScale device clocking architectures differ. You must follow the
clocking guidelines for your targeted architecture and verify that your design complies.
Related Information
Clocking Guidelines
Figure 131: Comparison of Fabric Clock Route versus Dedicated Clock Route
• Do not allow regional clock buffers (BUFR/BUFIO/BUFH) to drive logic in several clock
regions as the skew between the clock tree branches in each region will be very high. Remove
inappropriate LOC or Pblock constraints to resolve this situation.
• Use the CLOCK_DELAY_GROUP on the driver net of critical synchronous clocks to force
CLOCK_ROOT and route matching during placement and routing. The buffers of the clocks
must be driven by the same cell for this constraint to be honored.
Note: This optimization technique is automatically applied by the report_qor_suggestions Tcl
command.
• If a timing path is having difficulty meeting timing and the skew is larger than expected, it is
possible that the timing path is crossing an SLR or an I/O column. If this is the case, physical
constraints such as Pblocks may be used to force the source and destination into a single SLR
or to prevent the crossing of an I/O column.
• When working with high speed synchronous clock domain crossing timing paths, constraining
the location of the clock modifying blocks, such as the MMCM/PLL, to the center of the clock
loads can aid in meeting timing. The decreased delay on the clock networks will result in less
timing pessimism on the clock domain crossing paths.
• Verify that clock nets with CLOCK_DEDICATED_ROUTE=FALSE constraint are routed with
global clocking resources. Use ANY_CMT_COLUMN instead of FALSE to ensure the clock
nets with routing waivers are routed with dedicated clocking resources only. If the clock net is
routed with fabric interconnect, identify the design change or clocking placement constraint
needed to resolve this situation and make the implementation tools use global clocking
resources instead. Clock paths routed with fabric interconnect can have high clock skew or be
impacted by switching noise, leading to poor performance or non-functional designs.
Related Information
Synchronous CDC
Clock Constraints
The following figure shows a clock path from the global clock buffer (BUFG) to the clock root.
The clock routing switches from routing to the vertical distribution track, through the
BUFCE_ROW in each clock region row that drives the horizontal distribution tracks, and then to
the leaf level. The source is shown in green and the destination in red.
Figure 133: Clock Path from BUFG to the Leaf Level via BUFCE_ROW
The row programmable tap delay is the largest near the clock root. This delay decreases by one
tap for one clock region as the clock reaches farther away from the root in the vertical direction,
eventually decreasing to zero.
The following figure shows the topology of the programmable row tap values decreasing from
the root. Higher tap values mean higher delays and higher crossing SLR clock skew, because the
higher tap values add additional uncertainty for timing due to the minimum/maximum delay
variation introduced by the manufacturing process variation. This makes it more difficult to meet
timing near the root where programmable tap delay values are higher. Farther from the root in
the vertical direction, there is less uncertainty, and it is generally easier to fix hold violations on
crossing SLR buses. For SLR crossing buses that are farther from the root in the horizontal
direction, the clock row delays increase. This additional delay introduces more minimum/
maximum delay variation and reduces the performance of SLR crossings.
Figure 134: Row Programmable Tap Delay Settings Across an UltraScale+ SSI
Technology Device
For UltraScale+ SSI technology devices, you can improve SLR crossing speed using either of the
following methods:
• Move the clock root close to the SLR crossings in the horizontal direction
• Limit the maximum row programmable tap delay value to reduce the uncertainty
Note: Timing paths farther from the root in the vertical direction might become slightly slower due to
increased delay from hold fixing route detours. However, using these methods results in an overall
performance gain.
You can review the row programmable tap delay settings that the Vivado tool chose for each
global clock in your design in the Device Cell Placement Summary for Global Clock sections in
the Clock Utilization Report. Following is an example that shows the row programmable tap delay
settings for the g13 global clock in the HORIZONTAL PROG DELAY column, which is highlighted
in yellow.
Figure 135: Global Clock Row Programmable Tap Delay Settings in the Clock
Utilization Report
For UltraScale+ SSI technology devices, the placer limits the maximum row programmable tap
delay value to reduce minimum/maximum delay variation and reduce SLR crossing clock skew
near the clock root, while also ensuring that clock regions on either side of SLR crossings have an
increasing or decreasing tap delay value to balance the clock skew on SLR crossing paths farther
from the root. The MAX_PROG_DELAY property value of the clock net can be queried to find
the maximum row programmable tap delay value used by the placer.
You can also limit the row programmable tap value using the USER_MAX_PROG_DELAY
property. Following is an example. To set the USER_MAX_PROG_DELAY property, the value
must be applied to the net segment directly driven by the global clock buffer. If the
USER_MAX_PROG_DELAY property is not set, the placer can use the maximum possible tap
setting of 7.
• The recommended USER_MAX_PROG_DELAY tap value is 3 or 4 for clocks that span the
majority of UltraScale+ SSI technology devices. When clock roots are near GT, PCIe®, or
CMAC blocks that are off-center in the device, SLR crossing performance on the opposite
device side is heavily impacted, because the common node for the launch and capture clock is
farther away from the SLR crossing.
• For clock groups using the CLOCK_DELAY_GROUP for clock network matching, ensure that
all clocks within the clock group use the same USER_MAX_PROG_DELAY value.
The Clocking Wizard provides accurate uncertainty data for the specified device and can
generate various MMCM clocking configurations for comparing different clock topologies. To
achieve optimal results for the target architecture, Xilinx recommends regenerating clock
generation logic using the Clocking Wizard rather than using legacy clock generation logic from
prior architectures.
When configuring an MMCM for frequency synthesis, Xilinx recommends configuring the
MMCM to achieve the lowest output jitter on the clocks. Optimize the MMCM settings to run at
the highest possible voltage-controlled oscillator (VCO) frequency that meets the allowed
operating range for the device. The following equations show the relationship between VCO
frequency, M (multiplier), D (divider), and O (output divider) settings to both the input and output
clock frequencies:
F VCO = F CLKIN × M
D
F OUT = F CLKIN × M
D ×O
TIP: You can increase the VCO frequency by increasing M, lowering D, or both and compensating for the
change in frequency by increasing O. Increases in VCO frequency negatively affects the power dissipation
from the MMCM or PLL. You can also make small increases in the VCO frequency when you switch from
multiple MMCM clock outputs using BUFGs to one MMCM clock output using BUFGCE_DIVs, which
allows more clocks to use the fractional divider. When selecting between MMCM and PLL, MMCMs are
preferred because they are able to operate at a higher VCO frequency, have improved granularity for
selecting M and D values, and have fractional dividers (CLKOUT0).
Different architectures have different VCO frequency maximums. Therefore, Xilinx recommends
regenerating clocking components to be optimal for your target architecture. Xilinx recommends
using the Clocking Wizard to automatically calculate M and D values along with the VCO
frequency to properly configure an MMCM for the target device.
TIP: When using the Clocking Wizard from the IP catalog, make sure that Jitter Optimization Setting is set
to the Minimize Output Jitter, which provides the higher VCO frequency. In addition, performing marginal
changes to the desired output clock frequency can allow for an even higher VCO frequency to further
reduce clock uncertainty.
The following MMCM frequency synthesis example uses an input clock of 62.5 MHz to generate
an output clock of approximately 40 MHz. There are two solutions, but the MMCM_2 with a
higher VCO frequency generates less clock uncertainty due to reduced jitter and phase error.
MMCM_1 MMCM_2
Input clock 62.5 MHz 62.5 MHz
Output clock 40.0 MHz 39.991 MHz
CLKFBOUT_MULT_F(M) 16 22.875
DIVCLK_DIVIDE(D) 1 1
VCO Frequency 1000.000 MHz 1429.688
CLKOUT0_DIVIDE_F(O) 25 35.750
Jitter (ps) 167.542 128.632
Phase Error (ps) 384.432 123.641
In this case, the clock uncertainty includes 120 ps of Phase Error for both Setup and Hold
analysis. Instead of generating the 150 MHz clock with the MMCM, a BUFGCE_DIV can be
connected to the 300 MHz MMCM output and divide the clock by 2. For optimal results, the 300
MHz clock needs to also use a BUFGCE_DIV with BUFGCE_DIVIDE set to 1 to match the 150
MHz clock delay accurately, as shown in the following figure.
Figure 136: Improving the Clock Topology for an UltraScale Synchronous CDC Timing
Path
• For setup analysis, clock uncertainty does not include the MMCM phase error and is reduced
by 120 ps.
• For hold analysis, there is no more clock uncertainty (only for same edge hold analysis).
• The common node moves closer to the buffers, which saves some clock pessimism.
By applying the CLOCK_DELAY_GROUP constraint on the two clock nets, the clock paths will
have matched routing.
The following tables compare the clock uncertainty for setup and hold analysis of an UltraScale
synchronous CDC timing path.
Setup Analysis MMCM Generated 150 MHz Clock BUFGCE_DIV 150 MHz Clock
Total System Jitter (TSJ) 0.071 ns 0.071 ns
Discrete Jitter (DJ) 0.115 ns 0.115 ns
Phase Error (PE) 0.120 ns 0.000 ns
Clock Uncertainty 0.188 ns 0.068 ns
Hold Analysis MMCM Generated 150 MHz Clock BUFGCE_DIV 150 MHz Clock
Total System Jitter (TSJ) 0.071 ns 0.000 ns
Discrete Jitter (DJ) 0.115 ns 0.000 ns
Phase Error (PE) 0.120 ns 0.000 ns
Clock Uncertainty 0.188 ns 0.000 ns
Related Information
Synchronous CDC
For example, one module might require the use of MUXF* resources to implement a timing
critical function, but the rest of the design might benefit from implementation of logic in LUTs
rather than MUXF* to reduce congestion. In this case, set the PERFORMANCE_OPTIMIZED
strategy for the timing-critical module, and synthesize the rest of the design using the
Flow_AlternateRoutability strategy to reduce congestion.
Related Information
Using the retiming option globally is usually runtime intensive and can negatively impact power.
Therefore, Xilinx recommends that you identify a specific hierarchy with violations on paths with
a high number of logic levels after synthesis or with optimal placement. When the paths in the
fanin or fanout of the longest paths have fewer logic levels and are contained within a small or
medium hierarchical module, you can use the BLOCK_SYNTH.RETIMING block-level synthesis
strategy.
The following figure shows a critical paths with five LUTs, constrained by a 600 MHz clock. The
REG2 destination flop drives a timing path with a single LUT that is included one hierarchy up
from REG2.
Figure 137: Schematic Showing Critical Path with Five Logic Levels
In addition to using the Schematic window in the Vivado IDE, you can use the
report_design_analysis -logic_level_distribution command to review the
distribution of logic levels for specific paths. This allows you to determine how many paths need
to be rebalanced to improve the timing QoR.
You can use the retiming_forward and retiming_backward attributes available in Vivado
synthesis to control the optimization on a specific register or a path. Using these attributes
applies retiming optimization on a specific set of paths rather than on the top module or
submodules, which reduces the area overhead. You can apply these attributes in the RTL or in the
XDC file. For more information, including usage and restrictions, see the Vivado Design Suite User
Guide: Synthesis (UG901).
The following figure shows 58 paths with five logic levels within the inst1/inst2 hierarchy
constrained with the 600 MHz clock and 32 paths with only one logic level.
Vivado synthesis can rebalance the logic levels by moving the registers in the low logic level
paths into the high logic level paths. In this example, you can add the following constraint to the
synthesis XDC file to perform retiming on the inst1/inst2 hierarchy:
After rerunning synthesis with the same global settings and the updated XDC file, you can run
regular timing analysis on the inst1/inst2 timing paths or rerun the
report_design_analysis command to verify that the longest paths have fewer logic levels,
as shown in the following figure. The critical path is now REG0 → 3 LUTs → REG2 (backward
retimed), and the path from REG2 to REG4 has three logic levels.
Figure 139: Logic Level Distribution with Retiming Enabled for Synthesis Optimization
Often not much consideration is given to control signals such as resets or clock enables. Many
designers start HDL coding with "if reset" statements without deciding whether the reset is
needed or not. While all registers support resets and clock enables, their use can significantly
affect the end implementation in terms of performance, utilization, and power.
The first factor to consider is the number of control sets. A control set is the group of clock,
enable, and set/reset signals used by a sequential cell. For example, two cells connected to the
same clock have different control sets if only one cell has a reset or if only one cell has a clock
enable. Constant or unused enable and set/reset register pins also contribute to forming control
sets.
The second factor to consider is the targeted architecture. The number of control sets that can
be packed together depends on the architecture:
• A 7 series device slice (or half-CLB) comprises eight registers, which all share one clock, one
set/reset, and one clock enable. Only one control set can be used per group of eight registers.
• An UltraScale device half-CLB comprises two groups of four registers, which share one clock
and one set/reset. In addition, each group of four registers has one clock enable and can
ignore the set/reset. A constant set/reset signal is not routed and can be ignored. A constant
enable signal is treated like a dynamic enable signal and needs to be routed. Under optimal
conditions, up to two control sets can be used per group of eight registers.
CLB packing restrictions caused by control sets force the placer to move some registers,
including their input LUT. In some cases, the registers are moved to less optimal locations. The
additional distance can negatively impact not only utilization but also placement QoR and power
consumption, due to logic spreading (longer net delays) and higher interconnect resources
utilization. This is mainly of concern in designs with many low fanout control signals, such as
clock enables that feed single registers.
Despite the higher UltraScale device CLB control set capacity, typical designs show a control set
utilization similar to 7 series designs. Therefore, Xilinx recommendations are the same for both
architectures.
TIP: The number of unique control sets can be a problem in a small portion of the design, resulting in
longer net delays or congestion in the corresponding device area. Identifying the high local density of
unique control sets requires detailed placement analysis in the Vivado IDE Device window, which includes
highlighted control signals in different colors.
• Remove the MAX_FANOUT attributes that are set on control signals in the HDL sources or
constraint files. Replication on control signals dramatically increases the number of unique
control sets. Xilinx recommends relying on place_design to perform coarse replication and
using phys_opt_design -directive Explore for finer replication after placer. This
prevents unnecessary replication and equivalent control sets from crossing each other, which
can lead to routing congestion.
• Increase the control set threshold of Vivado synthesis (or other synthesis tool). Review the
control sets fanout distribution table in report_control_sets -verbose to determine a
more appropriate control sets threshold to use during synthesis. Note that increasing
contol_set_opt can have negative impacts on power by eliminating clock enables that can
actively reduce power. For example:
synth_design -control_set_opt_threshold 16
TIP: Use the BLOCK_SYNTH synthesis constraints to change the control sets threshold on modules
that are the most impacted by placement spreading or congestion.
Related Information
Sometimes, designers address the high fanout nets in RTL or synthesis by using a MAX_FANOUT
attribute on a specific net. This does not always result in the most optimal routing resource
usage, especially if the MAX_FANOUT attribute is set too low or is set on a net connected to
several major hierarchies. In addition, if the high fanout signal is a register control signal and is
replicated more than necessary, this can lead to a higher number of control sets and increase
design power by unnecessarily adding additional registers that may not be necessary for timing
closure
Often, a better approach to reducing fanout is to use a balanced tree for the high fanout signals.
Consider manually replicating registers based on the design hierarchy, because the cells included
in a hierarchy are often placed together.
To restructure and reduce the number of control set trees and high fanout nets, you can use the
opt_design Tcl command with one of the following options:
Note: Try this option first, because the tools are aware of major hierarchies and Pblock constraints when
you run this option.
These options are the reverse of fanout replication and result in nets that are better suited for
module-based replication. This merge also works across multi-stage reset trees as shown in the
following figure.
RST3
2 400
RST2 RST3
2 3 200
RST3
4 500
X20035-122019
After reducing the number of replicated objects, you can use the opt_design Tcl command to
perform limited replication based on the hierarchy characteristics, with the following option:
The following figure shows replication on a clock enable net with a fanout of 60000 using
opt_design -hier_fanout_limit 1000. Because each module SR_1K contains 1000
loads, the driver is replicated 59 times.
You can force replication based on physical device attributes with the MAX_FANOUT_MODE
property. Supported MAX_FANOUT_MODE properties are CLOCK_REGION, SLR, MACRO. For
example, the MAX_FANOUT_MODE property with a value of CLOCK_REGION replicates the
driver based on the physical clock region, the loads placed into same clock region will be
clustered together. For more information, see this link in the Vivado Design Suite User Guide:
Implementation (UG904).
For SSI technology devices, high-fanout drivers can be replicated for each SLR and optionally
assigned to SLR-aligned Pblocks along with their loads. This technique helps reduce the impact of
the SLR crossing delay and gives more freedom to place the replicated high fanout nets
independently in each SLR.
Related Information
Lower performance high fanout nets can be moved onto the global routing by inserting a clock
buffer between the driver and the loads. This optimization is automatically performed in
opt_design for nets with a fanout greater than 25000 only when a limited number of clock
buffers are already used and the clock period of the logic driven by the net is above the limit
specific to the targeted device and speed grade.
You can force synth_design and opt_design to insert a clock buffer when setting the
CLOCK_BUFFER_TYPE attribute on a net in the RTL file or in the constraint file (XDC). For
example:
Using global clocking ensures optimal routing at the cost of higher net delay. For best
performance, clock buffers must drive sequential loads directly, without intermediate
combinatorial logic. In most cases, opt_design reconnects non-sequential loads in parallel to
the clock buffer. If needed, you can prevent this optimization by applying a DONT_TOUCH on
the clock buffer output net. Also, if the high fanout net is a control signal, you must identify why
some loads are not dedicated clock enable or set/reset pins.
The placer also automatically routes high fanout nets (fanout > 10000) on any global routing
tracks available after clock routing is performed. This optimization occurs towards the end of the
placer flow and is only performed if timing does not degrade. You can disable this feature using
the -no_bufg_opt option.
Related Information
In some cases, the default phys_opt_design command does not replicate all critical high
fanout nets. Use a different directive to increase the command effort: Explore, AggressiveExplore
or AggressiveFanoutOpt. Also, when a high fanout net becomes critical during routing, you can
add an iteration of phys_opt_design to force replication on specific nets before trying to
route the design again. For example:
In this example, the implementation tools give higher priority to the paths that belong to clock
group clock with a weight of 2 over other paths in the design.
The insertion of negative-edge triggered registers between sequential elements can split a timing
path into two half period paths and significantly reduce hold violations. You can insert the
negative-edge triggered registers using the -insert_negative_edge_ffs option during the
phys_opt_design implementation step. Only paths with flip-flop drivers and at most one LUT
in between the sequential elements are considered for this optimization. The setup slack on the
paths must be sufficiently positive after the optimization or else the optimization is discarded.
The following figure shows a negative-edge triggered register inserted after a flip-flop driving a
CMAC block. Before the optimization, the hold slack between the flip-flop and the driver was
-0.492 ns. After the insertion of the negative-edge triggered register (highlighted in blue), the
setup and hold slack are both positive.
Figure 142: Fixing Hold Violation with Negative Edge Register Insertion
You can also insert LUT1 delays onto datapaths to reduce hold violations. To insert LUT1 delays,
use one of the following options during the phys_opt_design implementation step:
• -hold_fix: Performs LUT1 insertion and only considers paths that are the largest WHS
violators with sufficient positive setup slack.
The following figure shows a LUT1 delay inserted after a flip-flop driving an ILKN block. Before
the optimization, the path from the flip-flop to the ILKN is the WHS path in the design with
-0.277 ns hold slack. After the insertion of the LUT1 delay (highlighted in blue), the hold slack is
positive and the setup slack remains positive.
Addressing Congestion
Congestion can be caused by a variety of factors and is a complex problem that does not always
have a straightforward solution. The report_design_analysis congestion report helps you
identify the congested regions and the top modules that are contained within the congestion
window. Various techniques exist to optimize the modules in the congested region. The
report_qor_suggestions can automate the resolution of many of the items that cause
congestion.
TIP: Before you try to address congestion with the techniques discussed in the following sections, make
sure that you have clean constraints and you followed the clocking guidelines recommended by Xilinx.
Excessive hold time failures (or negative hold slack) and clock uncertainties require the router to detour,
which can lead to congestion. Avoid overlapping Pblocks, which can also cause congestion.
TIP: Review resource utilization after opt_design to get more accurate numbers, once unused logic has
been trimmed instead of after synthesis.
In the following figure, the overall utilization for the design is low. However, the utilization in
SLR2 is high and the logic requires more routing resources than logic in the other SLRs. The logic
in this area is a wide bus MUX that saturates the routing resources.
Several placer directives exist that can help alleviate congestion by spreading logic throughout
the device to avoid congested regions. The SpreadLogic placer directives are:
• AltSpreadLogic_high
• AltSpreadLogic_medium
• AltSpreadLogic_low
• SSI_SpreadLogic_high
• SSI_SpreadLogic_low
• SSI_BalanceSLLs placer directive which helps with partitioning the design across SLRs while
attempting to balance SLLs between SLRs.
• SSI_SpreadSLLs placer directive which allocates extra area for regions of higher connectivity
when partitioning across SLRs.
Other placer directives or implementation strategies might also help with alleviating congestions
and should also be tried after the placer directives mentioned above.
To compare congestion for different placer directives either run the Design Analysis Congestion
report after place_design, or examine the initial estimated congestion in the router log file.
Routing has less impact on congestion than placer directives. However, in some cases it is useful
to attempt different routing directives. The following directive ensures that the router works
harder to access more routing and relieve congestion in the interconnect tiles:
• AlternateCLBRouting
Note: The AlternateCLBRouting routing directive is most effective when there is short congestion or both
short and long congestion. This directive only applies to UltraScale devices.
For more information, see this link in the Vivado Design Suite User Guide: Implementation (UG904).
Related Information
Using MUXF* primitives helps critical paths with many logic levels or a tight clock requirement
while also reducing power. MUXF* includes MUXF7, MUXF8, and MUXF9, which are dedicated
multiplexer resources located within the CLB. These resources are grouped with up to eight LUTs
during placement. This grouping forces high CLB input utilization with higher routing demand
and limits placement flexibility when the netlist connectivity is complex, leading to potential
higher routing congestion and timing degradation.
In addition, the opt_design command provides an optional MUX optimization phase to remap
MUXF* structures to LUT3 primitives to improve routability. You can use the -muxf_remap
option to remap all of the MUXF* cells. Alternatively, set the MUXF_REMAP property to TRUE
on a select number of cells in the congested region to limit the scope of the MUX remapping.
Any MUXF* cells with the MUXF_REMAP property set to TRUE automatically trigger the MUX
optimization phase during opt_design and are remapped to LUT3s.
Note: Disabling these resources can result in increased power. Use this method only when needed to
achieve timing closure.
The following figure shows a 16-1 MUX before and after the MUXF* optimization.
To further optimize the netlist after performing MUX optimization, use the -remap option with
the -muxf_remap option. This combines the LUT3 primitives that are generated by the MUXF*
optimization with connected logic if possible.
You can determine whether timing closure is impacted by routing congestion by reviewing the
Router Initial Estimated Congestion table in the log files or in the Design Analysis report
(report_design_analysis -congestion) after place or route is complete.
In the following figure, the Design Analysis report shows that 7% of the device is impacted by
Short congestion level 5 (32x32 CLBs) in the South direction while 26% MUXF are utilized in the
corresponding congested area.
In the Vivado IDE, you can select a row in the table of the Design Analysis congestion report to
highlight the corresponding congested area in the Device window. The following figure shows
that the congestion overlaps with a higher MUXF density area. The MUXF cells are highlighted in
magenta using the following command in the Vivado IDE Tcl Console:
Figure 147: MUXF Congestion Highlighted in the Vivado IDE Device Window
When high MUXF* utilization overlaps with areas of higher congestion, Xilinx recommends
reducing the number of MUXF* by mapping their corresponding functionality to LUTs, which
have higher placement and routing flexibility. You can use the following command in the XDC
synthesis constraints to modify the netlist:
After rerunning synthesis, place, and route, the updated congestion table in the Design Analysis
report now shows that the South Short congestion is lower (level 4), which typically improves the
timing quality of results.
Figure 148: Initial Router Congestion Table after Reducing MUXF Usage on a Module
LUT combining reduces logic utilization by combining LUT pairs with shared inputs into single
dual-output LUTs that use both O5 and O6 outputs. However, LUT combining can potentially
increase congestion because it tends to increase the input/output connectivity for the slices. If
LUT combining is high in the congested area (> 40%), you can try using a synthesis strategy that
eliminates LUT combining to help alleviate congestion. The Flow_AlternateRoutability
synthesis strategy and directive instructs the synthesis tool to not generate any additional LUT
combining.
Note: If you are using Synplify Pro for synthesis, you can use the Enable Advanced LUT Combining option
in the Implementation Options under the Device tab. This option is on by default. If you are modifying the
Synplify Pro project file (*prj), the following is specified: set_option -enable_prepacking 1.
You can use the following command to select cells with LUT combining enabled in your design:
The following figure shows the horizontal congestion of a design with and without LUT
combining. The cells with LUT combining are highlighted in purple.
To disable LUT combining on a module that overlaps with areas of higher congestion, use the
following Tcl command:
reset_property SOFT_HLUTNM [get_cells -hierarchical -filter {NAME =~ <module name> &&
SOFT_HLUTNM != ""}]
High fanout nets that have tight timing constraints require tightly clustered placement to meet
timing. This can cause localized congestion as shown in the following figure. High fanout nets can
also contribute to congestion by consuming routing resources that are no longer available for
other nets in the congestion window.
To analyze the impact of high fanout non-global nets on routability in the congestion window you
can:
• Select the leaf cells of the top hierarchical modules in the congestion window.
• Use the find command (Edit → Find) to select all of the nets of the selected cell objects (filter
out Global Clocks, Power, and Ground nets).
• Sort the nets in decreasing Flat Pin Count order.
• Select the top fan-out nets to show them in relation to the congestion window.
This can quickly help you identify high-fanout nets which potentially contribute to congestion.
For high fanout nets with tight timing constraints in the congestion window, replicating the
driver will help relaxing the placement constraints and alleviate congestion.
High fanout nets (fanout > 5000) with sufficient positive timing slack can be routed on global
clock resources instead of fabric resources. The placer automatically routes high fanout nets with
fanout > 1000 on global routing resources if those resources are available towards the end of the
placer step. This optimization only occurs if it does not degrade timing.
You can also set the property CLOCK_BUFFER_TYPE=BUFG on the net and let synthesis or logic
optimization automatically insert the buffer prior to the placer step. Review the newly inserted
buffer placement along with its driver and loads placement after place_design to verify that it
is optimal. If it is not optimal, use the CLOCK_REGION constraint (UltraScale devices only) or
LOC constraint (7 series devices only) on the clock buffer to control its placement.
To use cell bloating, apply the CELL_BLOAT_FACTOR property to hierarchical cells and set the
value to LOW, MEDIUM, or HIGH. When working with smaller modules of several hundred cells,
HIGH is the recommended setting.
CAUTION! If the device already uses too many routing resources, cell bloating is not recommended. In
addition, using cell bloating on larger cells might force placed cells to be too far apart.
ML Strategies
Machine learning (ML) strategies allow you to quickly obtain an optimized strategy for your
design. If you are running multiple implementation strategies to generate implementation results,
you can use ML strategies instead to help you predict which strategies are most likely to
generate a good result.
RECOMMENDED: Xilinx recommends performing three implementation runs with different strategy
suggestions to identify and address errors in the prediction.
You can generate strategy suggestion objects on a routed design by running the
report_qor_suggestions command. Prior to running this command, you must run the
implementation flow as follows:
After generating ML strategy suggestions, you must write the suggestions using
write_qor_suggestions -strategy_dir <directory>. This writes one RQS file per
strategy. To activate strategy objects, an RQS file with the strategy suggestion must be read
using read_qor_suggestions prior to running opt_design, and the directives for all
commands must be set to RQS (for example, opt_design -directive RQS).
• For best results, resolve all methodology checks, and make sure the design has a QoR
assessment score of three or higher. To verify, run report_qor_assessment after
synth_design or opt_design.
• To further enhance performance, combine ML strategy suggestions with other QoR
suggestions in the same RQS file.
Note: ML strategy suggestions are combined automatically when QoR suggestions are written. To
disable this feature, use write_qor_suggestions -of_objects
[get_qor_suggestions ...], and filter only the desired suggestions.
For more information, see this link in the Vivado Design Suite User Guide: Design Analysis and
Closure Techniques (UG906).
Predefined Strategies
Xilinx provides a set of predefined strategies that are tuned to be effective solutions for the
majority of designs.
Note: Xilinx does not recommend running the SSI technology strategies for a non-SSI technology device.
Custom Strategies
If timing cannot be met with the predefined strategies, you can manually explore a custom
combination of directives. Because placement typically has a large impact on overall design
performance, it can be beneficial to try various placer directives with only the I/O location
constraints and with no other placement constraints. By reviewing both WNS and TNS of each
placer run (these values can be found in the placer log), you can select two or three directives
that provide the best timing results as a basis for the downstream implementation flow.
TIP: For a list of directives and a short description of their functions, enter the implementation command
followed by the -help option (for example, place_design -help ). For information on strategies, see
this link in the Vivado Design Suite User Guide: Implementation (UG904).
For each of these checkpoints, several directives for phys_opt_design and route_design
can be tried and again only the runs with the best estimated or final WNS/TNS should be kept. In
Non-Project Mode, you must explicitly describe the flow with a Tcl script and save the best
checkpoints. In Project Mode, you can create individual implementation runs for each placer
directive, and launch the runs up to the placement step. You would continue implementation for
the runs that have the best results after the placer step (as determined by the Tcl-post script).
Physical constraints (Pblocks and DSP and RAM macro constraints) can prevent the placer from
finding the most optimal solution. Xilinx therefore recommends that you run the placer directives
without any Pblock constraints. The following Tcl command can be used to delete any Pblocks
before placement with directives commences:
delete_pblock [get_pblocks *]
Next, run phys_opt_design with any of the directives to improve the overall WNS of the
design.
In Project Mode, the same results can be achieved by running the first phys_opt_design
command as part of a Tcl-pre script for a phys_opt_design run step which will run using the -
directive option.
• It does not modify the clock relationships (clock waveforms remain unchanged).
• It is additive to the tool-computed clock uncertainty (jitter, phase error).
• It is specific to the clock domain or clock crossing specified by the -from and -to options.
• It can easily be reset by applying a null value to override the previous clock uncertainty
constraint.
• Overconstrain only the clocks or clock crossing that cannot meet setup timing.
• Use the -setup option to tighten the setup requirement only.
Note: If you do not specify this option, both setup and hold requirements are tightened.
Overconstraining Example
A design misses timing by -0.2 ns on paths with the clk1 clock domain and on paths from clk2
to clk3 by -0.3 ns before and after route.
3. Run the flow up to the router step. It is best if the pre-route timing is met.
4. Remove the extra uncertainty.
set_clock_uncertainty -from clk0 -to clk1 0 -setup
set_clock_uncertainty -from clk2 -to clk3 0 -setup
After running the router, you can review the timing results to evaluate the benefits of
overconstraining. If timing was met after placement but still fails by some amount after route,
you can increase the amount of uncertainty and try again.
WARNING! Do not overconstrain beyond 0.5 ns. Overconstraining the design can result in increased
power due to the additional logic replication introduced by the implementation tools as well as increased
compile time.
TIP: An alternative to overconstraining the design is to change the relative priority of each path group. By
default, each clock and user-defined path group is analyzed independently with the same priority during
implementation. You can set a higher priority for any clock-based path group using the group_path -
weight 2 -name <ClockName> options. The priority of user-defined path groups cannot be
changed.
Considering Floorplan
Floorplanning allows you to guide the tools, either through high-level hierarchy layout, or
through detail placement. This can provide improved QoR and more predictable results. You can
achieve the greatest improvements by fixing the worst problems or the most common problems.
For example, if there are outlier paths that have significantly worse slack, or high levels of logic,
fix those paths first by grouping them in a same region of the device through a Pblock. Limit
floorplanning only to portions of design that need additional user intervention, rather than
floorplanning the entire design.
Floorplanning logic that is connected to the I/O to the vicinity of the I/O can sometimes yield
good results in terms of predictability from one compilation to the next. In general, it is best to
keep the size of the Pblocks to a clock region. This provides the most flexibility for the placer.
Avoid overlapping Pblocks, as these shared areas can potentially become more congested.
Where there is a high number of connecting signals between two Pblocks consider merging them
into a single Pblock. Minimize the number of nets that cross Pblocks.
TIP: When upgrading to a newer version of the Vivado Design Suite, first try compiling without Pblocks or
with minimal Pblocks (i.e., only SLR level Pblocks) to see if there are any timing closure challenges. Pblocks
that previously helped to improve the QoR might prevent place and route from finding the best possible
implementation in the newer version of the tools.
For SSI technology devices, you can also consider using SLR Pblocks or soft floorplanning
constraints (USER_SLR_ASSIGNMENT).
Related Information
• On the left, the example shows that the placer was unable to find the most optimal placement
of the path, because block RAM utilization was high. FIFO36E2 primitives are marked in red.
• On the right, the example shows that the placer was able to meet timing, because the
FIFO36E2 blocks were grouped in a rectangle that avoided the configuration column crossing.
FIFO36E2 primitives are marked in green.
Locations Not Avoiding the Configuration Column Preassigned Locations Avoiding the Configuration Column
X18041-120219
set_property IS_LOC_FIXED 1 \
[get_cells -hier -filter {PRIMITIVE_TYPE =~ BLOCKRAM.*.*}]
write_xdc bram_loc.xdc -exclude_timing
You can edit the bram_loc.xdc file to only keep block RAM location constraints and apply it
for your consecutive runs.
IMPORTANT! Do not reuse the placement of general slice logic. Do not reuse the placement for sections
of the design that are likely to change. Use the Incremental Compile flow if you make small changes to the
design and want to re-use prior placement to achieve more predictable results and faster compile time.
This section covers recommendations for automatic incremental implementation, including both
high and low reuse modes.
In all other use cases of the incremental implementation flow, you have control over the selection
of the reference checkpoint. Following are guidelines to help improve your selection of the
reference checkpoint:
• Use a reference checkpoint that meets timing or is close to meeting timing. If the reference
checkpoint is close to meeting timing, it might be beneficial to improve timing as follows
before running the incremental implementation flow.
Note: For automatic incremental implementation, the checkpoint is rejected unless WNS is less than
-0.250 ns.
○ Run route_design -tns_cleanup to optimize paths that are not the worst case path.
○ Use incremental synthesis to reduce changes introduced into the netlist due to RTL
changes. Enable incremental synthesis early in the design closure cycle rather than waiting
until you are ready to use incremental implementation.
○ Ensure that synth_design and opt_design options match for the reference checkpoint
and the incremental implementation runs.
○ Match tool versions. Although this is not a requirement, thresholds change and new
optimizations are added, which can lead to reduced matching.
○ Avoid using opt_design AddRemap and ExploreWithRemap directives unless these
are the only directives that close timing. These directives have reduced naming consistency
when changes are introduced to the codebase.
• Use report_qor_assessment to determine whether the design is ready for the
incremental implementation flow to be run and whether it is preferable to switch from the
default flow.
TIP: To adjust the incremental implementation thresholds, run config_implementation -help for
information. To identify differences between the reference and the incremental checkpoints, run
report_incremental_reuse.
Following are the directives available for use with the incremental implementation flow:
• RuntimeOptimized: Targets the WNS from the reference checkpoint. This helps maintain
consistency with the reference checkpoint and improves placer and router run time by at least
2x. If the reference checkpoint does not close timing, this directive does not attempt to close
timing. This directive is the default.
• TimingClosure: Targets WNS = 0.000 ns. Use this directive when the reference run is very
close to meeting timing, and you are willing to trade off consistency in results and run time
with more effort to try to meet timing. This mode can improve WNS by up to 250 ps on
difficult designs. Use this directive with QoR Suggestions for the best chance at closing timing.
There is usually a run time hit with this directive.
• Quick: This option is intended for designs that easily meet timing with greater than 99%
reuse. Typically, this option is used for ASIC emulation and prototype designs with minor
changes that do not impact timing.
Note: The RuntimeOptimized directive replaces the Default mapping directive, and the
TimingClosure directive replaces the Explore mapping directive from previous Vivado Design Suite
releases.
• Some design runs are showing that a design can meet timing but many runs do not.
• It is early in the design flow or significant changes are still being made.
Reusing hierarchical cells is effective when placement of a particular cell is influencing the WNS
significantly. Reusing DSPs, block RAMs, or both is useful in designs that have a relatively high
density of these blocks.
• Analyze the reference runs, including checking failing checkpoints to identify the difference
between good and bad runs.
○ Identify runs that have a good WNS and low congestion levels.
• After determining the area to target, compare a set of runs using low reuse mode against a
baseline set of runs using the default flow to evaluate effectiveness.
○ Use different place_design directives to generate multiple results for comparison.
Note: In low reuse mode, incremental implementation directives are ignored, and target WNS is always
0.000 ns.
To reuse only block memory placement, use the following Tcl script:
To reuse both Block Memory and DSP placement, use the following Tcl script:
To reuse hierarchy in a particular hierarchical cell and all hierarchies below the cell, use the
following Tcl script:
Related Information
The following figure shows the resulting placement and connectivity from setting the BLI
property to TRUE.
Figure 152: Placement of XPIO-PL Interface BLI Flip-Flops for ODDRE1 and IDDRE1
Pblocks
dedicated to
SLR crossing
flip-flops
X18184-110716
IMPORTANT! Xilinx recommends using CLOCKREGION ranges instead of LAGUNA ranges for SLR-
crossing Pblocks.
TIP: You can define SLR Pblocks by specifying a complete SLR. For example, resize_pblock
pblock_SLR0 -add SLR0.
For more information, see this link in Vivado Design Suite User Guide: Design Analysis and Closure
Techniques (UG906).
VIDEO: For information on using floorplanning techniques to address design performance issues, see the
Vivado Design Suite QuickTake Video: Design Analysis and Floorplanning.
You can use the USER_SLR_ASSIGNMENT property to floorplan the design by assigning large
design blocks to SLRs. Set this property to a string value, which is applied to hierarchical cells and
ignored on leaf cells. The value you set for this property influences the logic partitioning as
follows:
• SLR name: When a hierarchical cell is assigned the name of an SLR (SLR0, SLR1, SLR2, etc.),
the placer attempts to place the entire cell within the specified SLR.
• String value: When a hierarchical cell is assigned an arbitrary string value, the placer chooses
the SLR. This prevents cells from being partitioned into multiple SLRs.
Note: If multiple cells have the same USER_SLR_ASSIGNMENT value, the placer attempts to group the
cells in the same SLR.
The USER_SLR_ASSIGNMENT property is a soft constraint during SLR partitioning while the
Pblock is always a hard constraint during SLR partitioning and global placement. Unlike Pblocks,
the USER_SLR_ASSIGNMENT can be ignored by the placer to find a valid SLR partitioning of the
design. Both USER_SLR_ASSIGNMENT and Pblocks allow the detailed placer and physical
optimization to make fine-tuned adjustments to leaf cell placement near the SLR boundaries to
improve timing. These adjustments include moving pipeline registers across SLR boundaries if the
moves improve timing. These register moves are not permitted across Pblock boundaries.
In the following example, a design contains three timing-critical hierarchical blocks with cell
names IP1, IP2, and IP3 and targets a two-SLR device. To split the three blocks so that IP1 and
IP2 are kept together in SLR1 while IP3 is placed in SLR0, the following XDC constraints are
applied:
The following figure shows the resulting placement. To improve performance, you can
incorporate extra pipeline stages to traverse distances within the device. This is particularly
helpful along expected SLR crossings, between IP2 and IP3 in this example. During detail
placement and phys_opt_design, the pipeline registers from IP2 and IP3 can automatically
move across SLR boundaries if this improves timing.
SLR1
IP1
Add pipeline
IP2 registers for
placement
flexibility
SLR0 IP3
X21199-121919
For cases in which you cannot set USER_SLR_ASSIGNMENT or the placer splits challenging
paths across SLRs, you can use the USER_CROSSING_SLR property to direct where SLR
crossings should or should not occur. Typically, you apply this property to nets or leaf pins where
you want pins to be placed in the same SLR as the net driver, or where you want the SLR crossing
for the case of a register chain. Set this property to a Boolean value, which is applied to nets and
pins to constrain individual SLR crossings:
• TRUE: Indicates that the target net object should cross an SLR or the target pin object should
be connected across an SLR. You can only apply the TRUE value to register-to-register
connections with a single fanout in between.
Note: You cannot use the TRUE value for random logic. This option is useful for ensuring a chain of
registers always crosses a SLR boundary on a specific register when trying multiple implementation
strategies.
• FALSE: Indicates that the target net object should not cross an SLR or the target pin object
should not be connected across an SLR. You can apply the FALSE value to any net or pin.
Note: Pins must not be inside macro primitives, because these pins are internal and cannot be constrained.
In the following example, a pipeline register chain crosses an SLR twice, resulting in an
unintentional, inefficient zigzag path.
Note: In the next two figures, each dot represents a register stage.
SLR
Boundary
net_A net_B
X21198-121919
To achieve the optimal placement in which only net_B crosses the SLR, the following XDC
constraints are applied:
The resulting placement contains just a single SLR crossing on net_B as shown in the following
figure.
Figure 156: Optimal SLR Crossings After Setting the USER_CROSSING_SLR Property
net_A
net_B
X21197-121919
• The placement of SLR crossings spreads vertically, reducing routing congestion near SLR
boundaries.
• Locating registers in Laguna sites improves delay estimation accuracy, resulting in higher
timing QoR.
• SLR-crossing performance becomes faster and more consistent.
Note: When targeting UltraScale SSI technology devices, you can only use a Laguna TX_REG or RX_REG on
a SLR crossing net, and you cannot use both at the same time. Performance advantages are similar to the
ones listed above.
You can set the USER_SLL_REG property on registers that you expect to be placed at an SLR
crossing boundary on a Laguna register site. The USER_SLL_REG constraint is ignored by
place_design if the register D and Q pins are connected to a net that either does not cross an
SLR boundary or drives loads placed in multiple SLRs. For example:
A reliable method of mapping registered crossings to Laguna is to apply both BEL and LOC
constraints to the registers to lock them in place. The LOC value assigns the Laguna site, and the
BEL value chooses a particular Laguna register inside the site, one of six TX_REG registers or one
of six RX_REG registers. Laguna crossing registers are a fixed distance apart, which means that
each TX_REG register is paired with an RX_REG register for a direct connection.
The BEL assignments are applied first, and the register position (0, 1, ... 5) must match between
TX_REG and RX_REG, which is 3 for this example. Finally, the distance between paired Laguna
sites is 120 rows. The register reg_A drives from the bottom row of the SLR2 Laguna column
across to the bottom row of the SLR1 Laguna column. When creating LAGUNA BEL and LOC
constraints, try grouping registers with same clock, clock enable and reset signals to avoid control
set compatibility issues.
• Target frequency
• Device floorplan
• Device speed grade
You can leverage the auto-pipelining feature to allow the placer algorithms to decide on the
number of required stages and their optimal location, which helps timing closure across SLR
boundaries. When using this feature, the Vivado placer automatically uses Laguna registers
without additional intervention.
Related Information
Auto-Pipelining Considerations
RECOMMENDED: Because an IDR takes longer than a standard implementation run, Xilinx recommends
using IDRs less frequently than standard runs. For example, use an IDR after you resolve all methodology
warnings and after trying a few common implementation strategies, such as Default and Explore.
TIP: To iterate more quickly, you can extract the QoR suggestions and ML strategies from the IDR for use
in a standard implementation run. If a significant design change is made, rerun the IDR to update the
associated files.
• The implementation must be project based. For non-project users, the easiest method is to
create a post-synthesis netlist-based project using a pre-opt_design checkpoint.
• The device must be from either an UltraScale or UltraScale+™- device-based family.
• The design must have a baseline with accurate and achievable constraints.
• All designs must comply with the recommended methodology, as reported by the
report_methodology Tcl command.
• An SLR-based floorplan might be required for SSI technology-based devices.
• Apply only automatic implementation suggestions. Text-based suggestions or suggestions
with APPLICABLE_FOR = synth_design must be applied before starting an IDR.
For more information see this link in the Vivado Design Suite User Guide: Design Analysis and
Closure Techniques (UG906).
Power Closure
Given the importance of power, the Vivado tools support methods for obtaining an accurate
estimate for power, as well as providing some power optimization capabilities. For additional
information, see the Vivado Design Suite User Guide: Power Analysis and Optimization (UG907).
RECOMMENDED: When targeting UltraScale and UltraScale+™ devices and using the Explore directives
or Explore-based strategies, you must manually enable block RAM power optimization by running
power_opt_design or using opt_design -bram_power_opt after opt_design runs. Xilinx
recommends minimizing memory resources to reduce power. Review the bit utilization from the RAM
Utilization Report to find memory arrays with inefficient mapping. Also consider using the HDL
RAM_STYLE MIXED attribute for the most efficient mapping of arrays using a combination of block
memory and LUTRAMs.
Specify a power budget to report the power margin using the XDC constraint:
This value is used by the report_power command. The difference between the calculated on-
chip power and the power budget is the power margin, which is displayed in red in the Vivado
IDE if the power budget is exceeded. This makes it easier to monitor power consumption
throughout the flow.
TIP: For UltraScale+ devices, you can export an XDC file from the XPE tool that contains the environment
settings, including the XPE estimate that can be used as a power budget constraint. You can override the
power budget using either XPE or the XDC. Add the XDC constraints for power margin reporting.
The accuracy of the power estimates varies depending on the design stage when the power is
estimated. To estimate power post-synthesis through implementation, run the report_power
command, or open the Power Report in the Vivado IDE.
• Post Synthesis: The netlist is mapped to the actual resources available in the target device.
• Post Placement: The netlist components are placed into the actual device resources. With this
packing information, the final logic resource count and configuration becomes available. This
accurate data can be exported to the XPE spreadsheet. This allows you to:
• Provide the basis for accurately filling in the spreadsheet for future designs with similar
characteristics.
• Post Routing: After routing is complete all the details about routing resources used and exact
timing information for each path in the design are defined.
In addition to verifying the implemented circuit functionality under best and worst case logic and
routing delays, the simulator can also report the exact activity of internal nodes and include
glitching. Power analysis at this level provides the most accurate power estimation before you
actually measure power on your prototype board.
The Vivado tools report_power command also allows you to report power on a on a per
regulator or voltage regulator module (VRM) basis using the following constraints.
X25127-021621
POWER TIP: Review and validate the decoupling requirement of the completed Vivado design against the
current schematic/PCB. You can generate a .xpe file from Vivado tools report_power using the
following Tcl commands:
You can then import the .xpe file into XPE. For example, the XPE Power Delivery sheet shows the
decoupling requirement based on the power estimation and power delivery option.
Power Optimization
If the power estimates are outside the budget, you must follow the steps described in the
following sections to reduce power.
• Examine the total power in the Summary section. Does the total power and junction
temperature fit into your thermal and power budget?
• If the results are substantially over budget, review the power summary distribution by block
type and by the power rails. This provides an idea of the highest power consuming blocks.
• Review the Hierarchy section. The breakdown by hierarchy provides a good idea of the
highest power consuming module. You can drill down into a specific module to determine the
functionality of the block. You can also cross-probe in the GUI to determine how specific
sections of the module have been coded, and whether there are power efficient ways to
recode it.
Note: If the design has a timing margin, conduct multiple runs to evaluate if any of the runs have a
better total power. For example, a design that has 2 ps of margin can perform similarly to a design with
15 ps, but the 2 ps design might have lower power.
Power optimization can be run either pre-place or post-place in the design flow, but not both.
The pre-place power optimization step focuses on maximizing power saving. This can result (in
rare cases) in timing degradation. If preserving timing is the primary goal, Xilinx recommends the
post-place power optimization step. This step performs only those power optimizations that
preserve timing.
In cases where portions of the design should be preserved due to legacy (IP) or timing
considerations, use the set_power_opt command to exclude those portions (such as specific
hierarchies, clock domains, or cell types) and rerun power optimization.
Related Information
Where possible, identify and apply power optimizations only on non-timing critical clock
domains or modules using the set_power_opt XDC command. If the most critical clock domain
happens to cover a large portion of the design or consumes the most power, review critical paths
to see if any cells in the critical path have the IS_CLOCK_GATED property with value TRUE,
indicating that the paths are the result of a power optimization. To improve timing at the expense
of increased power in a subsequent implementation, use the set_power_opt XDC constraint
to disable power optimization on the power-optimized cells in the critical path. Then rerun
implementation with the set_power_opt XDC constraints or Tcl commands.
The following Tcl example disables power optimization on cells in the top 100 failing paths:
The following figure shows an example of this approach. For all 64 timing closure runs, report
power was also run, and all runs are plotted together. From the graph, 36 runs were timing clean,
and from a power perspective, the total power budget is 77W. The 64 runs were in the range of
75W to 83W, an 8W or ~10% range.
Looking at the best run from a timing perspective, run #6 had a power estimate of 79.5W, which
exceeds the total power budget. However, from the timing clean runs, run #13 yielded the lowest
power at 75W and was still timing clean. Understanding the design from both a timing and
power perspective allows you to select the best run for both, without impacting the timing result.
In this example, this approach enabled a 4W power saving.
POWER TIP: You can also improve design power by removing the DONT_TOUCH constraint to allow
upfront logic trimming, including clocking primitives.
Figure 158: Power and Timing Slack for Different Place and Route Runs
See the following resources for details on configuration and debug software flows and
commands:
Configuration
You must first successfully synthesize and implement your design to create a bitstream image.
Once the bitstream has been generated and all DRCs are analyzed and corrected, you can load
bitstream onto the device using one of the following methods:
• Direct Programming :
The bitstream is loaded directly to the device using a cable, processor, or custom solution.
• Indirect Programming: The bitstream is loaded into an external flash memory. The flash
memory then loads the bitstream into the device.
IMPORTANT! The Vivado Design Suite Device Programmer can use JTAG to read the Status register
data on Xilinx devices. In case of a configuration failure, the Status register captures the specific error
conditions that can help identify the cause of a failure. In addition, the Status register allows you to
verify the Mode pin settings M[2:0] and the bus width detect. For details on the Status register, see the
Configuration User Guide for your device.
TIP: If configuration is not successful, you can use a JTAG readback/verify operation on the device to
determine whether the intended configuration data was loaded correctly into the device.
Debugging
In-system debugging allows you to debug your design in real time on your target device. This
step is needed if you encounter situations that are extremely difficult to replicate in a simulator.
For debug, you provide your design with special debugging IP that allows you to observe and
control the design. After debugging, you can remove the instrumentation or special IP to increase
performance and logic reduction.
Debugging a design is a multistep, iterative process. Like most complex problems, it is best to
break the design debugging process down into smaller parts by focusing on getting smaller
sections of the design working one at a time rather than trying to get the whole design to work
at once.
Though the actual debugging step comes after you have successfully implemented your design,
Xilinx recommends planning how and where to debug early in the design cycle. You can run all
necessary commands to perform programming of the devices and in-system debugging of the
design from the Program and Debug section of the Flow Navigator in the Vivado IDE.
1. Probing: Identify the signals in your design that you want to probe and how you want to
probe them.
2. Implementing: Implement the design that includes the additional debug IP attached to the
probed nets.
3. Analyzing: Interact with the debug IP contained in the design to debug and verify functional
issues.
4. Fixing phase: Fix any bugs and repeat as necessary.
For more information, see the Vivado Design Suite User Guide: Programming and Debugging
(UG908).
Debugging the PL
Debugging the programmable logic (PL) can be necessary if you encounter situations that are
difficult to replicate in PL logic simulation. This section covers the debugging tools that allow
visibility into the PL domain.
The Vivado tools provide several methods to add debug probes in your design. The table below
explains the various methods, including the pros and cons of each method.
Related Information
• Probe nets at the boundaries (inputs or outputs) of a specific hierarchy. This method helps
isolate problem areas quickly. Subsequently, you can probe further in the hierarchy if needed.
• Do not probe nets in between combinatorial logic paths. If you add MARK_DEBUG on nets in
the middle of a combinatorial logic path, none of the optimizations applicable at the
implementation stage of the flow are applied, resulting in sub-par timing QoR results.
• Probe nets that are synchronous to get cycle accurate data capture.
You can mark a signal for debug either at the RTL stage or post-synthesis. The presence of the
MARK_DEBUG attribute on the nets ensures that the nets are not replicated, retimed, removed,
or otherwise optimized. You can apply the MARK_DEBUG attribute on top level ports, nets,
hierarchical module ports and nets internal to hierarchical modules. This method is most likely to
preserve HDL signal names post synthesis. Nets marked for debugging are shown in the
Unassigned Debug Nets folder in the Debug window post synthesis.
VHDL:
Verilog:
You can also add nets for debugging in the post-synthesis netlist. These methods do not require
HDL source modification. However, there may be situations where synthesis might not have
preserved the original RTL signals due to netlist optimization involving absorption or merging of
design structures. Post-synthesis, you can add nets for debugging in any of the following ways:
• Select a net in any of the design views (such as the Netlist or Schematic window), then right-
click and select Mark Debug.
• Select a net in any of the design views, then drag and drop the net into the Unassigned Debug
Nets folder.
• Use the net selector in the Set Up Debug wizard.
• Set the MARK_DEBUG property using the Properties window or the Tcl Console.
set_property mark_debug true [get_nets -hier [list {sine[*]}]]
This applies the mark_debug property on the current, open netlist. This method is flexible,
because you can turn MARK_DEBUG on and off through the Tcl command.
The configuration of the ILA core has an impact in meeting the overall design timing goals. Follow
the recommendations below to minimize the impact on timing:
• Choose probe width judiciously. The bigger the probe width the greater the impact on both
resource utilization and timing.
• Choose ILA core data depth judiciously. The bigger the data depth the greater the impact on
both block RAM resource utilization and timing.
• Ensure that the clocks chosen for the ILA cores are free-running clocks. Failure to do so could
result in an inability to communicate with the debug core when the design is loaded onto the
device.
• Ensure that the clock going to the dbg_hub is a free running clock. Failure to do so could
result in an inability to communicate with the debug core when the design is loaded onto the
device. You can use the connect_debug_port Tcl command to connect the clk pin of the
debug hub to a free-running clock.
• Close timing on the design prior to adding the debug cores. Xilinx does not recommend using
the debug cores to debug timing related issues.
• If you still notice that timing has degraded due to adding the ILA debug core and the critical
path is in the dbg_hub, perform the following steps:
1. Open the synthesized design.
2. Find the dbg_hub cell in the netlist.
3. Go to the Properties window of the dbg_hub.
4. Find property C_CLK_INPUT_FREQ_HZ.
5. Set it to frequency (in Hz) of the clock that is connected to the dbg_hub.
6. Find property C_ENABLE_CLK_DIVIDER and enable it.
7. Re-implement design.
• Make sure the clock input to the ILA core is synchronous to the signals being probed. Failure
to do so results in timing issues and communication failures with the debug core when the
design is programmed into the device.
• Make sure that the design meets timing before running it on hardware. Failure to do so results
in unreliable probed waveforms.
The following table shows the impact of using specific ILA features on design timing and
resources.
Note: This table is based on a design with one ILA and does not represent all designs.
Table 19: Impact of ILA Features on Design Timing and Resources (cont'd)
TIP: In the early stages of the design, there are usually many spare resources in the device that can be used
for debugging.
Note: For designs with limited MMCM/BUFG availability, consider clocking the debug hub with the lowest
clock frequency in the design instead of using the clock divider inside the debug hub.
For information on customizing the VIO core, see the Virtual Input/Output LogiCORE IP Product
Guide (PG159). For information on taking measurements with a VIO core, see this link in the
Vivado Design Suite User Guide: Programming and Debugging (UG908).
• Signals connected to VIO input probes must be synchronous to the clock connected to the
VIO clk port on the VIO core. Connecting signals that are not synchronous to the clk port
results in a clock domain crossing at the VIO input probe port.
• Signals driven from VIO output probes are asserted and deasserted synchronous to the clock
connected to the VIO clk port on the VIO core.
• The VIO core has a relatively low refresh rate because it is intended to replace low speed
board I/O, such as push-buttons or light-emitting diodes (LEDs). To capture high-speed signals,
consider using the ILA core.
• Debug interfaces, nets, or both in the block design using the System ILA core
Use this flow to:
○ Perform hardware-software co-verification using the cross-trigger feature of a MicroBlaze™
device, Zynq®-7000 SoC, or Zynq UltraScale+ MPSoC.
○ Verify the interface-level connectivity.
Note: You can also use a combination of both flows to debug your design.
For more information on using System ILA in your IP integrator design, see the Vivado Design
Suite User Guide: Designing IP Subsystems Using IP Integrator (UG994).
If you changed the ILA mode to Interface, you can debug and monitor AXI transactions and read
and write events in the Waveform window shown in the following figure. The Waveform window
displays the interface slots, transactions, events, and signal groups that correspond to the
interfaces probed by the interface slots on the ILA.
For more information on System ILA and debugging AXI interfaces in the Vivado Hardware
Manager, see this link and this link in the Vivado Design Suite User Guide: Programming and
Debugging (UG908).
The Vivado Serial I/O Analyzer in the Hardware Manager communicates with the core through
JTAG when the design is programmed onto the device. There is only one instance of In-System
IBERT required per design. In-System IBERT can work with all GTs used in the design. However,
you must generate separate In-System IBERT cores according to the different GT types (for
example, GTH, GTY).
Creating an In-System IBERT design with an internal system clock can prevent a scan from being
performed. When creating an eye scan, the status changes from In Progress to Incomplete. Eye
scan is incomplete when the internal system clock (MGTREFCLK) is connected to the clk/
drpclk_i input port of In-System IBERT IP.
Note: If needed, consider using an external clock, which does not exhibit this behavior. Alternatively, click
any available link in the Vivado Serial I/O Analyzer. Go to the Properties window, and find the MB_RESET
reg under the LOGIC field. Set it to 1 and then toggle back to 0. Rerun the eye scan or sweep.
For more information on this core, see the In-System IBERT LogiCORE IP Product Guide (PG246).
• Block RAM resources for the device are exceeded because of the current requirements of the
debug core.
• Non-clock net is connected to the clock port on the debug core.
• Port on the debug core is unconnected.
For information on using the Incremental Compile flow to insert, delete, or edit ILA cores, see
this link in the Vivado Design Suite User Guide: Programming and Debugging (UG908).
• Use the Xilinx Hardware Server product to connect to a remote computer in the lab.
Appendix A
Xilinx Resources
For support resources such as Answers, Documentation, Downloads, and Forums, see Xilinx
Support.
Solution Centers
See the Xilinx Solution Centers for support on devices, tools, and intellectual property at all
stages of the design cycle. Topics include design assistance, advisories, and troubleshooting tips.
Xilinx Design Hubs provide links to documentation organized by design tasks and other topics,
which you can use to learn key concepts and address frequently asked questions. To access the
Design Hubs:
Note: For more information on DocNav, see the Documentation Navigator page on the Xilinx website.
References
These documents provide supplemental material useful with this guide.
22. Vivado Design Suite User Guide: Release Notes, Installation, and Licensing (UG973)
23. Vivado Design Suite User Guide: Designing IP Subsystems Using IP Integrator (UG994)
24. UltraFast Embedded Design Methodology Guide (UG1046)
25. Vivado Design Suite User Guide: Creating and Packaging Custom IP (UG1118)
15. Vitis HLS Methodology in the Vitis HLS User Guide (UG1399)
16. Simulating FPGA Power Integrity Using S-Parameter Models (WP411)
17. Extending the Thermal Solution by Utilizing Excursion Temperatures (WP517)
18. Using SPI Flash with 7 Series FPGAs (XAPP586)
19. BPI Fast Configuration and iMPACT Flash Programming with 7 Series FPGAs (XAPP587)
20. Reference System: Kintex-7 MicroBlaze System Simulation Using IP Integrator (XAPP1180)
21. UltraScale FPGA BPI Configuration and Flash Programming (XAPP1220)
22. SPI Configuration and Flash Programming in UltraScale FPGAs (XAPP1233)
23. Using Encryption to Secure a 7 Series FPGA Bitstream (XAPP1239)
24. Mechanical and Thermal Design Guidelines for Lidless Flip-Chip Packages (XAPP1301)
25. Designing Using SelectIO Interface Component Primitives (XAPP1324)
26. a. 7 Series Schematic Review Recommendations (XMP277)
b. Kintex UltraScale and Virtex UltraScale FPGAs Schematic Review Checklist (XTP344)
c. UltraScale+ FPGAs and Zynq Ultrascale+ Devices Schematic Review Checklist (XTP427)
Training Resources
1. UltraFast Design Methodology Training Course
2. Vivado Design Suite QuickTake Video: UltraFast Vivado Design Methodology
3. Vivado Design Suite QuickTake Video: Vivado Design Flows Overview
4. Vivado Design Suite QuickTake Video: Targeting Zynq Using Vivado IP Integrator
5. Vivado Design Suite QuickTake Video: Partial Reconfiguration in Vivado Design Suite
6. Vivado Design Suite QuickTake Video: Creating Different Types of Projects
7. Vivado Design Suite QuickTake Video: Managing Sources With Projects
8. Vivado Design Suite QuickTake Video: Using Vivado Design Suite with Revision Control
9. Vivado Design Suite QuickTake Video: Managing Vivado IP Version Upgrades
10. Vivado Design Suite QuickTake Video: I/O Planning Overview
11. Vivado Design Suite QuickTake Video: Configuring and Managing Reusable IP in Vivado
12. Vivado Design Suite QuickTake Video: How To Use the "write_bitstream" Command in Vivado
13. Vivado Design Suite QuickTake Video: Design Analysis and Floorplanning
14. Vivado Design Suite QuickTake Video: Introducing the UltraFast Design Methodology
Checklist
15. Vivado Design Suite Video Tutorials
Revision History
The following table shows the revision history for this document.
Using the CLOCK_DEDICATED_ROUTE Constraint Added information on vertically adjacent clock regions.
Assessing Post-Synthesis Quality of Results Updated table.
Methodology DRCs with Impact on Timing Closure Added TIMING-56 check description.
Methodology DRCs with Impact on Signoff Quality and Added TIMING-54 to 57 check descriptions.
Hardware Stability
Overconstraining the Design Added overconstrain warning note.
Xilinx's limited warranty, please refer to Xilinx's Terms of Sale which can be viewed at https://
www.xilinx.com/legal.htm#tos; IP cores may be subject to warranty and support terms contained
in a license issued to you by Xilinx. Xilinx products are not designed or intended to be fail-safe or
for use in any application requiring fail-safe performance; you assume sole risk and liability for
use of Xilinx products in such critical applications, please refer to Xilinx's Terms of Sale which can
be viewed at https://www.xilinx.com/legal.htm#tos.
Copyright
© Copyright 2013–2022 Xilinx, Inc. Xilinx, the Xilinx logo, Alveo, Artix, Kintex, Kria, Spartan,
Versal, Vitis, Virtex, Vivado, Zynq, and other designated brands included herein are trademarks of
Xilinx in the United States and other countries. AMBA, AMBA Designer, Arm, ARM1176JZ-S,
CoreSight, Cortex, PrimeCell, Mali, and MPCore are trademarks of Arm Limited in the EU and
other countries. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission
by Khronos. PCI, PCIe, and PCI Express are trademarks of PCI-SIG and used under license.
MATLAB and Simulink are registered trademarks of The MathWorks, Inc. All other trademarks
are the property of their respective owners.