AMD Soc
AMD Soc
Overview
Navigating Content by Design Process
Terminology
Introduction
Designing with the Core
Tandem Configuration
Overview
Supported Devices
Tandem + DFX
Enable the Tandem Configuration Solution
Deliver Programming Images to Silicon
Tandem Configuration Performance
Design Operation
Loading Tandem PCIe for Stage 2
Segmented Configuration
Known Issues and Limitations
QDMA Subsystem for CPM4
Overview
Product Specification
Design Flow Steps
Customizable Example Design (CED)
Debugging
Application Software Development
Upgrading
QDMA Subsystem for CPM5
Overview
Product Specification
Design Flow Steps
Customizable Example Design (CED)
Application Software Development
Debugging
Upgrading
AXI Bridge Subsystem for CPM4
Overview
Product Specification
Design Flow Steps
Debugging
Upgrading
AXI Bridge Subsystem for CPM5
Overview
Product Specification
Design Flow Steps
Debugging
Upgrading
XDMA Subsystem for CPM4
Overview
Overview
Navigating Content by Design Process
AMD Adaptive Computing documentation is organized around a set of standard design processes to
help you find relevant content for your current development task. You can access the AMD Versal™
adaptive SoC design processes on the Design Hubs page. You can also use the Design Flow
Assistant to better understand the design flows and find content that is specific to your intended
design needs. This document covers the following design processes:
CPM4
QDMA Subsystem
Register Space
Application Software Development
AXI Bridge Subsystem
Register Space
XDMA Subsystem
Register Space
Application Software Development
CPM5
QDMA Subsystem
Register Space
Application Software Development
AXI Bridge Subsystem
Register Space
CPM4
QDMA Subsystem: Lab1: QDMA AXI MM Interface to NoC and DDR
QDMA Subsystem: Lab2: QDMA AXI MM Interface to NoC and DDR with Mailbox
XDMA Subsystem: XDMA AXI MM Interface to NoC and DDR Lab
CPM5
QDMA Subsystem: QDMA AXI MM Interface to NoC and DDR Lab
Terminology
Table: Terminology in this Guide
AXI-ST AXI4-Stream
Controller 1 QDMA or QDMA1 CPM PCIE controller 1 with QDMA (Only CPM5
contains hardened QDMA with Controller 1)
Introduction
Introduction to the CPM4
The integrated block for PCIe® Rev. 4.0 with DMA and CCIX Rev. 1.0 (CPM4) is shown in the
following figure.
CPM Components
The CPM includes multiple IP cores:
Use Modes
There are several use modes for DMA functionality in the CPM. You can select one of three options
for data transport from host to programmable logic (PL), or PL to host: QDMA, AXI Bridge, and XDMA.
To enable DMA transfers, customize the Control, Interfaces and Processing System (CIPS) IP core as
follows:
1. In the CPM4 Basic Configuration page, set the PCIE Controller 0 Mode to DMA.
2. Set the lane width value.
3. In the CPM4 PCIE Controller 0 Configuration page, set the PCIe Functional Mode for the desired
DMA transfer mode:
QDMA
AXI Bridge
XDMA
The following sections explain how you can further configure and use these different functional modes
for your application.
QDMA Subsystem
QDMA mode enables the use of PCIE Controller 0 with QDMA enabled. QDMA mode provides two
connectivity variants: AXI4-Stream and AXI4. Both variants can be enabled simultaneously.
AXI Streaming
QDMA Streaming mode can be used in applications where the nature of the data traffic is
streaming with source and destination IDs instead of a specific memory address location, such
as network accelerators, network quality of service managers, or firewalls.
The main difference between XDMA mode and QDMA mode is that while XDMA mode supports up to
four independent data streams, QDMA mode can support up to 2048 independent data streams.
Based on this strength, QDMA mode is typically used for applications that require many queues or
data streams that need to be virtually independent of each other. QDMA mode is the only DMA mode
that can support multiple functions, either physical functions or single root I/O virtualization (SR-IOV)
virtual functions.
QDMA mode can be used with AXI Bridge mode. For more details on AXI Bridge mode, see the AXI
Bridge Subsystem section.
AXI bridge mode enables you to interface the CPM4 PCIE Controller 0 with an AXI4 domain. This use
mode connects directly to the NoC which allows communication with other peripherals within the
Processing System (PS) and in the Programmable Logic (PL).
AXI bridge mode is typically used for light traffic data paths such as write to or read from Control and
Status registers. AXI bridge mode is also the only mode that can be configured for root port
application with AXI4 interface used to interface with a processor, typically the PS.
AXI bridge functionality is available in the following three modes:
1. For CPM4 Basic Configuration options, select DMA for PCIE controller 0 Mode.
2. In controller 0 Basic tab, set PCIE 0 Functional Mode to either XDMA or QDMA.
3. Set one or both of the following options:
In the PCIe BARs tab, select the BAR check-box next to the AXI Bridge Master. This option
enables the master AXI Bridge interface within the IP, which you can use to receive write or
read transactions from a PCIe source device to AXI peripherals.
XDMA Subsystem
XDMA mode enables the use of PCIE Controller 0 with the XMDA enabled. XDMA mode provides two
connectivity variants: AXI Streaming and AXI Memory Mapped. Only one variant can be enabled at a
time.
AXI Streaming
XDMA Streaming mode can be used in applications where the nature of the data traffic is
streaming with source and destination IDs instead of a specific memory address location, such
as network accelerators, network quality of service managers, or firewalls.
XDMA mode can be used in conjunction with AXI Bridge mode. For more details on AXI Bridge mode,
see the AXI Bridge Subsystem.
✎ Note: x16 Gen4 configuration is not available in the data path from CPM directly to PL. This is only
used with the CPM through AXI4 to NoC to PL datapath.
The IP configuration supports and shows the selectable options in the GUI, which are x16, x8, and x4.
The PCIe specification requires devices to negotiate link width with the attached link partner during
link training. The IP is capable of training down to narrower link widths than the IP configuration set at
design time. For designs intending to use narrower x2 or x1 link widths, configure the IP as x4 and
connect only the bottom lane(s).
AXI Bridge functional mode features are supported when the AXI4 slave bridge is enabled in the
XDMA or QDMA or in standalone Bridge use mode.
Supports Multiple Vector Messaged Signaled Interrupts (MSI), MSI-X interrupt, and Legacy
interrupt.
AXI4 Slave access to PCIe address space.
PCIe access to AXI4 Master.
Tracks and manages Transaction Layer Packets (TLPs) completion processing.
Detects and indicates error conditions with interrupts in Root Port mode.
Supports six PCIe 32-bit or three 64-bit PCIe Base Address Registers (BARs) as an Endpoint.
Supports up to two PCIe 32-bit or a single PCIe 64-bit BAR as Root Port.
Standards
The AMD Versal Adaptive SoC CPM DMA and Bridge Mode for PCI Express adheres to the following
standards:
Table: CPM4 Controller with QDMA, Bridge, or XDMA Hard IP Subsystem Maximum
Configurations (Versal Prime, Versal AI Core, Versal AI Edge)
Speed Grade -1 -1 -2 -2 -2 -3
Speed Grade -1 -1 -2 -2 -2 -3
1. Gen4x16 does not support AXI4 interfaces directly between CPM4 and the programmable
logic. This is supported only through NoC.
This AMD LogiCORE™ IP module is provided at no additional cost with the AMD Vivado™ Design
Suite under the terms of the End User License.
Information about other AMD LogiCORE™ IP modules is available at the Intellectual Property page.
For information about pricing and availability of other AMD LogiCORE IP modules and tools, contact
your local sales representative.
The integrated block for PCIe® Rev. 5.0 with DMA and CCIX Rev. 1.0 (CPM5) is shown in the
following figure.
CPM Components
The CPM includes multiple IP cores:
PCIE Controller 0
Data transfer width can be x16, x8, x4, x2 or x1.
AXI4 data can only be transferred through NoC. From NoC, the data can be steered to
DDR or to the programmable logic.
PCIE Controller 1
Data transfer width can be x8, x4, x2 or x1 (Not x16).
AXI4 data can be transferred through NoC or directly to PL logic. This is possible by setting
the host profile programming. See Host Profile.
Use Modes
There are several use modes for QDMA functionality in the CPM. You can select one of the two
options for data transport from host to programmable logic (PL) or PL to host: QDMA or AXI Bridge.
To enable QDMA transfers, customize the Control, Interfaces and Processing System (CIPS) IP core
as follows:
DMA transfers can be initiated in PCIE Controller 0 or in PCIE Controller 1.
The following illustration shows the CPM5 PCIE Controller 0 selection. The same options apply to
Controller 1 QDMA. CPM to PL options is available only for controller 1.
1. In the CPM5 Basic Configuration page, set the PCIE Controller 0 Mode to DMA.
3. In the CPM5 PCIE Controller 0 Configuration page, set the PCIe Functional Mode for the desired
DMA transfer mode:
QDMA
AXI Bridge
The following sections explain how to further configure and use these different functional modes for
your application.
QDMA Subsystem
QDMA mode enables the use of PCIE Controller 0 QDMA or PCIE Controller 1 QDMA. QDMA mode
provides two connectivity variants: AXI Streaming and AXI Memory Mapped. Both variants can be
enabled simultaneously.
AXI Streaming
QDMA Streaming mode can be used in applications where the nature of the data traffic is
streaming with source and destination IDs instead of a specific memory address location, such
as network accelerators, network quality of service managers, or firewalls.
AXI Bridge mode enables you to interface the CPM5 PCIE Controller 0 or CPM5 PCIE Controller 1
with an AXI4 domain. This use mode connects directly to the NoC or to PL logic (based on Controller
0 or Controller 1 selection) that allows communication with other peripherals within the Processing
System (PS) and in the Programmable Logic (PL).
AXI Bridge mode is typically used for light traffic data paths such as write to or read from Control and
Status registers. AXI Bridge mode is also the only mode that can be configured for Root Port
application with AXI4 interface used to interface with a processor, typically the PS.
AXI bridge functionality is available in the following two modes:
In the PCIe BARs tab, select the BAR checkbox next to the AXI Bridge Master. This option
enables the Master AXI interface within the IP that you can use to receive write or read
transaction from a PCIe source device to AXI peripherals.
CPM5 has two PCIE controllers, and each controller has a hardened QDMA/AXI Bridge. Each
QDMA/AXI Bridge supports the following features:
AXI4-Stream Interfaces
AXI4 Memory Mapped Interfaces
64-bit PCIe addresses
48-bit AXI MM addresses
10-bit tag support with maximum outstanding 512 PCIe tags (requester and completer)
MSI-X Interrupt type supported
Descriptor input interface
Descriptor output interface
Root Port support (in AXI Bridge Mode)
Endpoint support
4096 Descriptor rings
4096 CMPT rings
Programmable ring sizes for descriptor and CMPT rings
Per queue PASID
4096 functions (Up to 4 physical functions (PFs) and 240 virtual functions (VFs))*
Function level reset
Interrupt coalescing
A total of 8192 MSI-X vectors
AXI Bridge functional mode features are supported when the AXI4 slave bridge is enabled in QDMA
use mode or standalone AXI Bridge.
Standards
The CPM5 QDMA and Bridge Mode for PCI Express adheres to the following standards:
Table: CPM5 Controller with QDMA or Bridge Hard IP Subsystem Maximum Configurations
(Versal Premium, Versal HBM, Versal Prime, Versal AI Core, Versal AI Edge)
Speed Grade -1 -1 -2 -2 -2 -2 -2 -3
Voltage Grade L (0.70V) M (0.80V) L (0.70V) L (0.70V) M (0.80V) M (0.80V) H (0.88V) H (0.88V)
Gen1 (2.5 x16 x16 x16 x16 x16 x16 x16 x16
GT/s per
lane)
Speed Grade -1 -1 -2 -2 -2 -2 -2 -3
Voltage Grade L (0.70V) M (0.80V) L (0.70V) L (0.70V) M (0.80V) M (0.80V) H (0.88V) H (0.88V)
1. For information on requirements for supplying Overdrive voltages, see Versal Premium
Series Data Sheet: DC and AC Switching Characteristics (DS959) and Power Design
Manager User Guide (UG1556).
2. CPM5 PCIe Gen5 support is available only in Versal Premium, Versal HBM, and Versal AI
Core series.
This AMD LogiCORE™ IP module is provided at no additional cost with the AMD Vivado™ Design
Suite under the terms of the End User License.
Information about other AMD LogiCORE™ IP modules is available at the Intellectual Property page.
For information about pricing and availability of other AMD LogiCORE IP modules and tools, contact
your local sales representative.
Clocking
All user interface signals are timed with respect to the same clock (user_clk), which can have a
frequency of 62.5, 125, or 250 MHz depending on the configured link speed and width. The user_clk
should be used to interface with the CPM. With the user logic, any available clocks can be used.
Each link partner device shares the same reference clock source. The following figures show a
system using a 100 MHz reference clock. Even if the device is part of an embedded system, if the
Figure: Open System Add-In Card Using 100 MHz Reference Clock
Resets
The fundamental resets for the CPM PCIe® controllers and associated GTs are perst0n and
perst1n. These resets should be provided as an input to the FPGA for both endpoint and root port
modes using the pins identified in GT Selection and Pin Planning for CPM4 and GT Selection and Pin
Planning for CPM5.
PERST# input to the IP is routed from one of the allowed PS/PMC MIO pins through a dedicated
logic into the CPM.
Selection of the PS/PMC MIO pins is made in CIPS/PS Wizard, but it is not made in the CPM
sub-core, rather it is made in the PS/PMC sub-core.
Users of the configurable example designs (CEDs) are advised that CEDs by default have
PERST# assignment(s) but it is a best practice to visit the IP configuration GUI for CIPS/PS
Wizard to ensure the pin assignment matches the PERST# connectivity in the board schematic.
Following are the two ways to activate fundamental reset in a Root Port mode:
To set an MIO pin as input, you need to configure that in CIPS GUI as shown below:
Following are the few pins that are allowed as MIO input pins:
Dedicated pins for controller 0: PS MIO 18, PMC MIO 24, or PMC MIO 38.
Dedicated pins for controller 1: PS MIO 19, PMC MIO 25, or PMC MIO 39.
DMA Clock
QDMA and AXI Bridge run on the clock that is provided by the user. This is a change from CPM4
(where the IP provides the clock). You must provide a clock dma<n>_intrfc_clk, that is used by the
IP. All the input and output ports are driven or loaded using this clock. Because this is an independent
clock provided by the user, there are some restrictions on clock frequency based on the IP
configurations that are listed below:
The input clock frequency (dma<n>_interfc_clk and cpm_pl_axi<n>_clk for Gen3x16 and Gen4x8
configurations is 250 MHz. For Gen4x16 and Gen5x8 configurations, the maximum input clock
frequency allowed is 433 MHz for a -3HP device. For other device speed grades, refer to the
corresponding device datasheet to know the maximum frequency applicable to those devices.
For the QDMA1 AXI-MM interface, there are two more clock inputs that you must provide,
cpm_pl_axi0_clk and cpm_pl_axi1_clk.
Figure: Open System Add-In Card Using 100 MHz Reference Clock
Resets
The fundamental resets for the CPM PCIe® controllers and associated GTs are perst0n and
perst1n. The resets are driven by the I/O inside the PS.
PERST# input to the IP is routed from one of the allowed PS/PMC MIO pins through a dedicated
logic into the CPM.
Selection of the PS/PMC MIO pins is made in CIPS/PS Wizard, but it is not made in the CPM
sub-core, rather it is made in the PS/PMC sub-core.
Users of the configurable example designs (CEDs) are advised that CEDs by default have
PERST# assignment(s) but it is a best practice to visit the IP configuration GUI for CIPS/PS
Wizard to ensure the pin assignment matches the PERST# connectivity in the board schematic.
In addition, there is a power-on-reset for CPM driven by the platform management controller (PMC).
When both PS and the power-on reset from PMC are released, CPM PCIE controllers and the
associated GTs come out of reset.
After the reset is released, the core attempts to link train and resumes normal operation.
Signals dma<n>_intrfc_clk and dma<n>_intrfc_resetn are both input to CPM5 IPs. CPM5
interface logic is cleared with dma<n>_intrfc_resetn signals that the user provides. These reset
signals are active-Low and should be held in the reset state (1'b0) until the input clock
dma<n>_intrfc_clk is stable. Once the clock is stable, you can deassert dma<n>_intrfc_resetn
signal.
To activate fundamental reset in a Root Port mode, see Reset in a Root Port Mode.
The CPM offers a few different main data interfaces for you to use depending on the CPM subsystem
functional mode being used. The following table shows the available data interfaces to be used as the
primary data transfer interface for each functional mode.
Table: Available Data Interface for Each CPM Subsystem Functional Mode
Functional Mode
CPM_PCIE_NOC_0/1NOC_CPM_PCIE_0 CPM_PL_AXI_0/1 AXI4 ST C2H/H2C
1. CPM_PCIE_NOC_0/1: These interfaces are for AXI4-MM traffic which is mastered from
within the CPM and exits to the NoC towards DDRMC/PL connections. Examples of such
masters in the CPM include the CPM integrated DMA and the CPM integrated bridge master.
2. NOC_CPM_PCIE_0: This interface is for AXI4-MM traffic which is mastered from an internal
PS or PL connections and exits from the NoC towards CPM. Examples of such slaves in the
CPM include the CPM integrated bridge slave.
3. CPM_PL_AXI_0/1: These interfaces are for AXI4-MM traffic which is mastered from within the
CPM and exits to the PL directly. Examples of such masters in the CPM include the CPM
integrated DMA and the CPM integrated bridge master. These interfaces are only available to
CPM5 controller and DMA/Bridge instance 1 (not available to instance 0).
4. AXI4 ST C2H/H2C: This interface is for inbound and outbound AXI4-ST traffic for the CPM
integrated DMA.
✎ Note: Certain data interfaces are unavailable based on the selected feature set for that particular
functional mode. For more details on these restrictions, refer to the port description in the associated
Load balance data transfer by allocating half of the enabled DMA queues or DMA channels to
interface #0, and the other half to interface #1.
Share the available PCIe link bandwidth among different types of transfers. DMA streaming uses
AXI4 ST C2H/H2C interface while DMA Memory Mapped uses CPM_PCIE_NOC or
CPM_PL_AXI interfaces.
AXI Bridge functional mode alone might not be able to sustain full PCIe link bandwidth in some link
and device configurations due to the availability of only one data interface per bridge instance.
Therefore, the AXI Bridge functional mode is restricted to control and status accesses only, it is not
intended to be used as a primary data mover. However, it can be paired with a DMA functional mode
to make use of the remaining bandwidth. This functional mode has a variety of applications, including
but not limited to root complex (RC) memory bridging and add-in-card peer-to-peer (P2P) operation.
P2P use cases are complex with respect to the achievable bandwidth depending on many factors
including but not limited to CPM DMA/Bridge bandwidth capabilities, whether DMA or Bridge is active
depending on the initiator of the P2P operation, the peer capability, and the capability of any
intervening switch component or root complex integrated switch.
You must also analyze the potential of head of line blocking or the request and response buffer size
for each interface and ensure that data transfer initiated within a system does not cause cyclic
dependencies between interfaces or different transfers. PCIe and AXI specifications have data types,
IDs, and request/response ordering requirements and CPM upholds those requirements. For more
details on CPM_PCIe_NOC and NOC_CPM_PCIe interfaces, refer to Versal Adaptive SoC
Programmable Network on Chip and Integrated Memory Controller LogiCORE IP Product Guide
Tandem Configuration
Overview
PCI Express® is a plug-and-play protocol, meaning that at power-up the PCIe® host enumerates the
system. This process consists of the host enumerating PCIe devices and assigning them base
addresses. As such, PCIe interfaces must be ready when the host queries them or they do not get
assigned a base address. The PCI Express specification states that PERST# can deassert 100 ms
after the power good of the systems has occurred, and a PCI Express port must be ready to link train
20 ms after PERST# has deasserted. This is commonly referred to as the 100 ms boot time
requirement, even though 120 ms is the fundamental goal.
AMD devices can meet this 120 ms link training requirement by using Tandem Configuration, a
solution that splits the programming image into two stages. The first stage quickly configures the PCIe
endpoint(s) so the endpoint is ready for link training within 120 ms. The second stage then configures
the rest of the device. Two variants are supported:
Tandem PROM
Loads both stages from a single programming image from a standard primary boot interface.
Tandem PCIe
Loads the first stage from a primary boot interface, then the second stage is delivered via one of
the CPM PCIE controllers.
Within AMD Versal™ device, the CPM consists of two PCIE controllers, DMA features, CCIX features,
and network-on-chip (NoC) integration. The enables direct access to the two high-performance,
independently customizable PCIE controllers. You can select to have one or both of these controllers
enumerate within 120 ms using Tandem Configuration.
✎ Note: While Tandem Configuration is designed to meet the 120 ms link training goal, additional
considerations are required. Configuration memory device selection is a key factor, as some options
(such as OSPI or QSPI) are much faster than others (such as SD or eMMC). Secure boot features
such as encryption and authentication increase the stage 1 loading time, putting the goal of 120 ms at
risk. Use the Versal Adaptive SoC - Boot Time Estimator to calculate the time expected to complete
the stage 1 load.
While the term Tandem Configuration is carried forward from prior iterations of this technology applied
in 7 series, AMD UltraScale™ and AMD UltraScale+™ silicon, the solution is fundamentally different
in an AMD Versal device given how the PCIE controllers are built and configured. No programmable
logic is needed to activate an endpoint, so only CPM, XPIPE or CPIPE, NoC, and GTY or GTYP
resources are included along with the PMC in the first stage.
‼ Important: Only production AMD Versal devices are supported for Tandem Configuration for certain
devices. Do NOT use this solution on engineering silicon (ES1) for VC1902, VC1802 or VM1802
devices.
Supported Devices
The Integrated Block for PCIe with DMA and CCIX and AMD Vivado™ tool flow support
implementations targeting AMD reference boards and specific part/package combinations.
Tandem Configuration is available as a production solution for monolithic Versal devices, but only
those with CPM resources. Tandem Configuration supports the devices described in the following
table:
Device Description
Device Description
Device image generation is disabled by default for all ES silicon. Engineering Silicon devices might
not be tested through software and/or hardware and Tandem PDI generation is gated by a parameter.
Any device not listed in this table is either currently unsupported, or does not contain a CPM site
necessary to enable support.
Tandem + DFX
The Dynamic Function eXchange (DFX) feature is supported by much of the AMD silicon portfolio and
Vivado Design Suite that allows for the reconfiguration of various modules within an active device. It
gives system architects the flexibility to switch a portion of the design in and out depending on the
system requirements, removing the need to multiplex multiple functions in a larger device, which
saves on part cost, power and improves system up time. Taking advantage of the PCIe link with CPM
for delivery of reconfigurable partition bitstream data to the PMC allows for high throughput and
Tandem PROM
Tandem PCIe
None
Tandem PROM is the simpler mode for Tandem Configuration, where both stages reside in a single
programming image. If 120 ms enumeration is required the selection of this option essentially comes
for free, as there is no overhead in design complexity or requirement for programmable logic to build.
The programming ordering simply starts with the CPM and other necessary elements before moving
on to the rest of the device.
Because Tandem PCIe uses the PCIe link to program the stage 2 portion of the design, the design
must include connectivity from the enable CPM Master(s) to the PMC Slave. This should be
accomplished through the block design connectivity. When PCIE Controller 0 is set to DMA, this CPM
Interface is set automatically and appropriate mapping of the slave into the CPM master address
space(s). This includes enabling the CPM to NoC 0 Interface by checking the appropriate box on the
CPM Basic Configuration customization page. The specific aperture within the PMC slave that must
be accessible from the host is the Slave Boot Interface (SBI) which is available at AMD Versal device
address 0x102100000.
Figure: CPM Master to PMC Slave Connection for Loading Tandem PCIe Stage 2 to SBI
To deliver stage 2 images using MCAP VSEC, see Versal Adaptive SoC CPM Mode for PCI Express
Product Guide (PG346).
✎ Note: If these interfaces are not used, tie the corresponding ready signal to 1.
dma0_mgmt_cpl_rdy, dma0_st_rx_msg_tready and dma_tm_dsc_sts_rdy must be tied to 1 if not
used.
To deliver stage 2 images using PCIe DMA, the DMA BAR must be set to BAR0. The driver probes
BAR0 to find the DMA BAR. If this goes to the PL, the transaction does not complete, because the PL
is not yet configured.
Confirmation that Vivado parameters and Tandem Configuration in general have been applied can be
seen in the log when write_device_image is run. Following is a snippet of the log for a Tandem
PROM run during the write_device_image step:
INFO: [Designutils 12-2358] Enabled Tandem boot bitstream.
Creating bitmap...
INFO: [Bitstream 40-812] Reading NPI Startup sequence definitions
INFO: [Bitstream 40-811] Reading NPI Shutdown sequence definitions
INFO: [Bitstream 40-810] Reading NPI Preconfig sequence definitions
Creating bitstream...
Tandem stage1 bitstream contains 1243712 bits.
Writing NPI partition ./Versal_CPM_Tandem_tandem1.rnpi...
Creating bitstream...
Tandem stage2 bitstream contains 15633728 bits.
Writing CDO partition ./Versal_CPM_Tandem_tandem2.rcdo...
Writing NPI partition ./Versal_CPM_Tandem_tandem2.rnpi...
Generating bif file Versal_CPM_Tandem_tandemPROM.bif for Tandem PROM.
The resulting run creates (in addition to the files mentioned above) a single .pdi image for this design
called Versal_CPM_Tandem.pdi.
When Tandem PCIe is enabled through CIPS IP customization, two .pdi files are generated:
<design>_tandem1.pdi
This file should be added to the device configuration flash.
<design>_tandem2.pdi
This file should be programmed into the device through the PCIe link once it becomes active.
In addition to the files mentioned above, the resulting run creates two .pdi images for this design
called design_1_wrapper_tandem1.pdi and design_1_wrapper_tandem2.pdi. The _tandem1 and
_tandem2 suffixes are automatically added to differentiate the stages.
‼ Important: Stage 1 and stage 2 bitstreams must remained paired. While this is trivial for Tandem
PROM because both stages are stored in a single PDI image, this is a critical consideration for
Tandem PCIe. If any part of the design is modified such that a full recompilation is triggered, both
stage 1 and stage 2 images must be updated. Always update both stages when any change is made.
✎ Note: Tandem PDI generation for new devices is gated until a device reaches production status. A
parameter is available to build images for pre-production or ES silicon; contact support for information
and to confirm that there are no issues with the new device. Without the parameter,
write_device_image is expected to fail with the following error:
ERROR: [Vivado 12-4165] Tandem Configuration bitstream generation is not supported for part
<device>.
In an AMD UltraScale+ device, the Field Updates solution enables you to build Reconfigurable Stage
Twos where one could not only pick a stage 2 image from a list of compatible images, but then also
reconfigure that stage 2 area with another stage 2 image to provide dynamic field updates. In an AMD
Versal device, the solution is similar but not exactly the same. The first part (for the initial boot of a
device) can be supported in the future to allow you to lock a stage 1 image in a small local boot flash;
the second part (dynamic reconfiguration) requires a Tandem + DFX-based approach to allow for
dynamic reconfiguration of a subsection of the PL.
For test and debug purposes the HD.TANDEM_BITSTREAMS property can be set on the implemented
design before .pdi file generation to separate a single Tandem PROM .pdi file into separate
tandem1.pdi and tandem2.pdi files.
Similarly, the behavior of Tandem PROM or Tandem PCIe file generation can be disabled entirely by
using the HD.TANDEM_BITSTREAMS property on the implemented design before .pdi file generation.
The following command can be used to do this.
✎ Note: AMD provides sample drivers and software to enable stage 2 programming. These drivers
can be found at https://github.com/Xilinx/dma_ip_drivers.
For more information regarding Versal Configuration and Boot, please consult Versal Adaptive SoC
System Software Developers Guide (UG1304).
Design Operation
Though the CPM is a hardened integrated block, many features and options that can be selected
during CIPS configuration will require implementation in programmable soft logic (PL). Any part of the
design that has been implemented in the PL will be configured in stage 2. Design configurations that
require PL resources during PCIe enumeration should not be used with Tandem PROM or Tandem
PCIe. Specifically, the PCIe extended capability interface should not be enabled for Tandem modes
because these registers are addressed during enumeration and are not present in the stage 1 portion
of the design. Moreover, any other resource in the Versal device, such as the R5 or A72 processors in
the Scalar Engines, will be programmed after the CPM and its PCI Express endpoint(s). While future
enhancements to the Tandem Configuration solution may open opportunities to quickly booting other
dedicated parts of a target device, the current solution focuses exclusively on PCI Express end points
in the CPM only for the sole purpose of meeting the 100 ms boot time requirement.
The Tandem PCIe and DFX features inherently operate on the same datapaths for this discussion;
they both use the PCIe link to deliver bitstream data to the slave boot interface (SBI) buffer, which is
grabbed by a PMC DMA block and delivered to the platform processing unit (PPU) for processing and
delivery to be programmed to the configurable device resources. The SBI buffer is an 8 KB FIFO and
any write to the aperture will occur in order, regardless of the target address. The mechanism for
delivery through CPM varies depending on the chosen methodology, but all require specific hardware
design requirements and have accompanying software and driver components. CPM4 Controller 1
✎ Note: It is possible to take advantage of CPM in streaming only mode with the RQ/RC/CQ/CC
interfaces to the PL and deliver reconfigurable partitions to SBI, but this would require a user-written
soft IP DMA or bridge and custom software and drivers. It is recommended to use the MCAP VSEC
instead for ease of use unless higher throughput is required. The MCAP VSEC can also be used with
a user DMA or bridge to load a Tandem PCIe Stage 2 bitstream since the path to the SBI does not
require PL logic to be present.
For details on configuring a design in the Vivado Design Suite to support using MCAP VSEC or DMA
transfers for Tandem PCIe, refer to Enable the Tandem Configuration Solution. The requirements to
load a reconfigurable partition are the same as what’s described for Tandem PCIe since the datapaths
are the same. To configure a design for DFX and generate partial bitstreams for reconfigurable
modules, refer to Vivado Design Suite User Guide: Dynamic Function eXchange (UG909).
The open-source, AMD provided drivers and user space applications for the MCAP VSEC, QDMA,
and XDMA IPs can be found at https://github.com/Xilinx/dma_ip_drivers to be cloned or downloaded.
There is also extensive documentation on each of the drivers contained in the repository and linked to
external pages. The following sections are provided to show an example of the required commands;
assuming a compatible bitstream has been loaded to the device already, the PCIe link is up, and the
stage 2 or partial bitstream(s) are ready to be delivered to the device. For the examples in the
following sections, assuming the Bus:Device.Function of the PCIe device is 01:00.0.
✎ Note: Before loading DFX PDI using QDMA/XDMA driver (or) MCAP, you need to enable the SBI
data path by writing 0x29 to the SBI control register (0xF1220004).
QDMA
MCAP VSEC
‼ Important: The MCAP VSEC can only natively address the lower 4 GB of the address map, as it
can only issue 32-bit address transactions. To reach the SBI buffer address, the NOC NMU address
remapping capability must be employed, see the following recommended command as an example to
employ remapping from 32 bit address space to 48 bit address space for the SBI buffer.
A critical detail when using Tandem Configuration, specifically Tandem PCIe® , is to ensure that stage
1 and stage 2 PDI images remain paired. The fundamental solution splits a single implemented
design image into two sections, and there is no mechanism to guarantee either section will pair
successfully with a section from an independent design. It is the designer’s responsibility to keep the
two stages in sync. AMD Versal™ adaptive SoC devices have safeguards in place to help with this
effort.
AMD Versal™ adaptive SoC devices have safeguards in the form of unique identifiers (UID) that are
checked by the PLM when stage 2 programming images are delivered. Four 32-bit fields are
embedded in the PDI as described in the following table:
...
--------------------------------------------------------------------------------
IMAGE HEADER (pl_cfi)
--------------------------------------------------------------------------------
pht_offset (0x00) : 0x00014a24 section_count (0x04) : 0x00000001
mHdr_revoke_id (0x08) : 0x00000000 attributes (0x0c) : 0x00001800
name (0x10) : pl_cfi
id (0x18) : 0x18700000 unique_id (0x24) : 0x738c16a7
parent_unique_id (0x28) : 0x00000000 function_id (0x2c) : 0x00000000
memcpy_address_lo (0x30) : 0x00000000 memcpy_address_hi (0x34) : 0x00000000
checksum (0x3c) : 0x10a2b15d
attribute list -
...
--------------------------------------------------------------------------------
IMAGE HEADER (pl_cfi)
--------------------------------------------------------------------------------
pht_offset (0x00) : 0x00000034 section_count (0x04) : 0x00000002
mHdr_revoke_id (0x08) : 0x00000000 attributes (0x0c) : 0x00001800
name (0x10) : pl_cfi
id (0x18) : 0x18700001 unique_id (0x24) : 0xf278b7b9
parent_unique_id (0x28) : 0x738c16a7 function_id (0x2c) : 0x00000000
memcpy_address_lo (0x30) : 0x00000000 memcpy_address_hi (0x34) : 0x00000000
checksum (0x3c) : 0x1e2b4392
attribute list -
owner [plm] memcpy [no]
load [now] handoff [now]
dependentPowerDomains [spd][pld]
Note how the Unique ID of the stage 1 output matches the Parent ID of the stage 2 output. These
PDIs are from the same design and are therefore compatible.
This data can also be found via cdoutil by grepping for the string "ldr_set_image_info" in a Tandem or
DFX PDI image. The four UID fields always be listed in the same order: Node ID, Unique ID, Parent
ID, and Function ID. The stage 1 PDI has the UID information for the stage 1 domain (clearly
identified by the Parent ID – the third field – set to 0) followed by the stage 2, whereas the stage 2 PDI
only shows the UID information for the PL portion of the design, unless DFX is also enabled. If DFX is
enabled, the stage 2 PDI lists all RMs in the design, each with an incrementing Node ID and a Parent
ID pointing back to stage 2.
Example without DFX:
During the stage 2 image load, the PLM examines these contents and confirms that the identifiers
match. The Parent ID of the stage 2 must match the Unique ID of the stage 1 image already loaded in
the Versal device. If they do not, a failure is reported and the stage 2 PDI is rejected before it is
programmed, allowing the you to take corrective action. Following are example logs for passing and
failing conditions:
Passing example:
Failing example:
A set of example designs are hosted on GitHub in the XilinxCEDStore repository. These repositories
can be accessed through AMD Vivado, which can be refreshed with a valid internet connection
including an AMD Versal device CPM Tandem PCIe and DFX example design. You can also
download or clone the GitHub repository to your local machine and point to that location on your PC.
The example design can be generated for CPM4 (VCK190) and CPM5 (VPK120) targets. The
following diagram intends to demonstrate the tandem PCIe and DFX features:
5. In the Flow Navigator, click Generate Device Image to run synthesis, implementation, and to
generate a programmable device image (.pdi) file that can be loaded to the target Versal device.
Instructions on how to download, install, and use the DMA drivers for this design are here.
Segmented Configuration
Segmented Configuration is a solution that enables designers to boot the processors in a Versal
device and access DDR memory before the programmable logic (PL) is configured. This allows DDR-
based software like Linux to boot first followed by the PL, which can be configured later if needed via
any primary or secondary boot device or through a DDR image store. The Segmented configuration
feature is intended to present the Versal boot sequence with similar flexibility to configure PL as can
be done with AMD Zynq™ UltraScale+™ MPSoCs.
This solution uses a standard Vivado tool flow through implementation, with the only additional
annotation required is the identification of NoC path segments to be included in the initial boot image.
This occurs automatically after the project property enabling the feature is set. Programming image
generation (write_device_image) automatically splits the programming images into two PDI files to be
stored and delivered separately. The entire PL is dynamic and it can be completely reloaded while any
operating system and DDR memory access remain active.
Segmentation of Versal adaptive SoC programming images allow the processing domain, which
includes the CPM to be available much more quickly than with a monolithic programming solution.
The difference between Tandem Configuration and Segmented Configuration is where the split is
done. Tandem includes only the necessary elements for link training in stage 1 – CPM, GTY, and
PMC. Segmented includes everything except the programmable logic (PL) domain along with NoC
resources within that domain.
✎ Note: In the Vivado 2024.2 release, select one feature or the other. Selecting both Tandem and
Segmented options results in an error during write_device_image.
To load PLD (Segmented) or partial (DFX) PDI images over the CPM QDMA interface, PCIe must be
declared as a secondary boot interface. In the boot.bif generated by the Vivado flow (found in the
implementation runs directory), add a single line. Insert boot_device { pcie } after line 5 (id = 0x2).
This is automatically managed when Tandem PCIe is selected, but must be declared when using
Segmented Configuration or DFX.
Figure: boot.bif
For more information on Segmented Configuration, including design requirements and a tutorial walk-
through, see the Segmented Configuration tutorial available on the AMD Vivado GitHub repository.
QDMA Architecture
DMA Engines
The Host to Card (H2C) and Card to Host (C2H) descriptors are fetched by the Descriptor Engine in
one of two modes: Internal mode, and Descriptor bypass mode. The descriptor engine maintains per
queue contexts where it tracks software (SW) producer index pointer (PIDX), consumer index pointer
(CIDX), base address of the queue (BADDR), and queue configurations for each queue. The
descriptor engine uses a round robin algorithm for fetching the descriptors. The descriptor engine has
separate buffers for H2C and C2H queues, and ensures it never fetches more descriptors than
available space. The descriptor engine will have only one DMA read outstanding per queue at a time
and can read as many descriptors as can fit in a MRRS. The descriptor engine is responsible for
reordering the out of order completions and ensures that descriptors for queues are always in order.
The descriptor bypass can be enabled on a per-queue basis and the fetched descriptors, after
buffering, are sent to the respective bypass output interface instead of directly to the H2C or C2H
engine. In internal mode, based on the context settings the descriptors are sent to delete per H2C
memory mapped (MM), C2H MM, H2C Stream, or C2H Stream engines.
The descriptor engine is also responsible for generating the status descriptor for the completion of the
DMA operations. With the exception of C2H Stream mode, all modes use this mechanism to convey
completion of each DMA operation so that software can reclaim descriptors and free up any
associated buffers. This is indicated by the CIDX field of the status descriptor.
✎ Recommended: If a queue is associated with interrupt aggregation, AMD recommends that the
status descriptor be turned off, and instead the DMA status be received from the interrupt aggregation
ring.
To put a limit on the number of fetched descriptors (for example, to limit the amount of buffering
required to store the descriptor), it is possible to turn-on and throttle credit on a per-queue basis. In
this mode, the descriptor engine fetches the descriptors up to available credit, and the total number of
descriptors fetched per queue is limited to the credit provided. The user logic can return the credit
through the dsc_crdt interface. The credit is in the granularity of the size of the descriptor.
To help a user-developed traffic manager prioritize the workload, the available descriptor to be fetched
(incremental PIDX value) of the PIDX update is sent to the user logic on the tm_dsc_sts interface.
Using this interface it is possible to implement a design that can prioritize and optimize the descriptor
storage.
H2C MM Engine
The H2C MM Engine moves data from the host memory to card memory through the H2C AXI-MM
interface. The engine generates reads on PCIe, splitting descriptors into multiple read requests based
on the MRRS and the requirement that PCIe reads do not cross 4 KB boundaries. Once completion
data for a read request is received, an AXI write is generated on the H2C AXI-MM interface. For
source and destination addresses that are not aligned, the hardware will shift the data and split writes
on AXI-MM to prevent 4 KB boundary crossing. Each completed descriptor is checked to determine
whether a writeback and/or interrupt is required.
For Internal mode, the descriptor engine delivers memory mapped descriptors straight to the H2C MM
engine. The user logic can also inject the descriptor into the H2C descriptor bypass interface to move
data from host to card memory. This gives the ability to do interesting things such as mixing control
and DMA commands in the same queue. Control information can be sent to a control processor
indicating the completion of DMA operation.
The C2H MM Engine moves data from card memory to host memory through the C2H AXI-MM
interface. The engine generates AXI reads on the C2H AXI-MM bus, splitting descriptors into multiple
requests based on 4 KB boundaries. Once completion data for the read request is received on the
AXI4 interface, a PCIe write is generated using the data from the AXI read as the contents of the
write. For source and destination addresses that are not aligned, the hardware will shift the data and
split writes on PCIe to obey Maximum Payload Size (MPS) and prevent 4 KB boundary crossings.
Each completed descriptor is checked to determine whether a writeback and/or interrupt is required.
For Internal mode, the descriptor engine delivers memory mapped descriptors straight to the C2H MM
engine. As with H2C MM Engine, the user logic can also inject the descriptor into the C2H descriptor
bypass interface to move data from card to host memory.
For multi-function configuration support, the PCIe function number information will be provided in the
aruser bits of the AXI-MM interface bus to help virtualization of card memory by the user logic. A
parity bus, separate from the data and user bus, is also provided for end-to-end parity support.
The H2C stream engine moves data from the host to the H2C Stream interface. For internal mode,
descriptors are delivered straight to the H2C stream engine; for a queue in bypass mode, the
descriptors can be reformatted and fed to the bypass input interface. The engine is responsible for
breaking up DMA reads to MRRS size, guaranteeing the space for completions, and also makes sure
completions are reordered to ensure H2C stream data is delivered to user logic in-order.
The engine has sufficient buffering for up to 256 descriptor reads and up to 32 KB of data. DMA
fetches the data and aligns to the first byte to transfer on the AXI4 interface side. This allows every
descriptor to have random offset and random length. The total length of all descriptors put together
must be less than 64 KB.
For internal mode queues, each descriptor defines a single AXI4-Stream packet to be transferred to
the H2C AXI-ST interface. A packet with multiple descriptors straddling is not allowed due to the lack
of per queue storage. However, packets with multiple descriptors straddling can be implemented
using the descriptor bypass mode. In this mode, the H2C DMA engine can be initiated when the user
logic has enough descriptors to form a packet. The DMA engine is initiated by delivering the multiple
descriptors straddled packet along with other H2C ST packet descriptors through the bypass
interface, making sure they are not interleaved. Also, through the bypass interface, the user logic can
control the generation of the status descriptor.
The C2H streaming engine is responsible for receiving data from the user logic and writing to the Host
memory address provided by the C2H descriptor for a given Queue.
The C2H engine has two major blocks to accomplish C2H streaming DMA, Descriptor Prefetch Cache
(PFCH), and the C2H-ST DMA Write Engine. The PFCH has per queue context to enhance the
performance of its function and the software that is expected to program it.
PFCH cache has three main modes, on a per queue basis, called Simple Bypass Mode, Internal
Cache Mode, and Cached Bypass Mode.
Completion Engine
The Completion (CMPT) Engine is used to write to the completion queues. Although the Completion
Engine can be used with an AXI-MM interface and Stream DMA engines, the C2H Stream DMA
engine is designed to work closely with the Completion Engine. The Completion Engine can also be
used to pass immediate data to the Completion Ring. The Completion Engine can be used to write
Completions of up to 64B in the Completion ring. When used with a DMA engine, the completion is
used by the driver to determine how many bytes of data were transferred with every packet. This
allows the driver to reclaim the descriptors.
The Completion Engine maintains the Completion Context. This context is programmed by the Driver
and is maintained on a per-queue basis. The Completion Context stores information like the base
address of the Completion Ring, PIDX, CIDX and a number of aspects of the Completion Engine,
which can be controlled by setting the fields of the Completion Context.
The engine also can be configured on a per-queue basis to generate an interrupt or a completion
status update, or both, based on the needs of the software. If the interrupts for multiple queues are
aggregated into the interrupt aggregation ring, the status descriptor information is available in the
interrupt aggregation ring as well.
The CMPT Engine has a cache of up to 64 entries to coalesce the multiple smaller CMPT writes into
64B writes to improve the PCIe efficiency. At any time, completions can be simultaneously coalesced
for up to 64 queues. Beyond this, any additional queue that needs to write a CMPT entry will cause
the eviction of the least recently used queue from the cache. The depth of the cache used for this
purpose is configurable with possible values of 8, 16, 32, and 64.
Bridge Interfaces
The AXI MM Bridge Master interface is used for high bandwidth access to AXI Memory Mapped space
from the host. The interface supports up to 32 outstanding AXI reads and writes. One or more PCIe
BAR of any physical function (PF) or virtual function (VF) can be mapped to the AXI-MM bridge
master interface. This selection must be made prior to design compilation.
Virtual function group (VFG) refers to the VF group number. It is equivalent to the PF number
associated with the corresponding VF. VFG_OFFSET refers to the VF number with respect to a
particular PF. Note that this is not the FIRST_VF_OFFSET of each PF.
For example, if both PF0 and PF1 have 8 VFs, FIRST_VF_OFFSET for PF0 and PF1 is 4 and 11.
Each host initiated access can be uniquely mapped to the 64 bit AXI address space through the PCIe
to AXI BAR translation.
Since all functions share the same AXI Master address space, a mechanism is needed to map
requests from different functions to a distinct address space on the AXI master side. An example
provided below shows how PCIe to AXI translation vector is used. Note that all VFs belonging to the
same PF share the same PCIe to AXI translation vector. Therefore, the AXI address space of each VF
is concatenated together. Use VFG_OFFSET to calculate the actual starting address of AXI for a
particular VF.
To summarize, AXI master write or read address is determined as:
Where pcie2axi_vec is PCIe to AXI BAR translation (that can be set when the IP core is configured
from the Vivado IP Catalog).
And axi_offset is the address offset in the requested target space.
For each physical function, the PCIe configuration space consists of a set of six 32-bit memory BARs
and one 32-bit Expansion ROM BAR. When SR-IOV is enabled, an additional six 32-bit BARs are
enabled for each Virtual Function. These BARs provide address translation to the AXI4 memory
mapped space capability, interface routing, and AXI4 request attribute configuration. Any pairs of
BARs can be configured as a single 64-bit BAR. Each BAR can be configured to route its requests to
the QDMA register space, or the AXI MM bridge master interface.
AxCache[1] is set to 1 for modifiable, and 0 for non-modifiable. Selecting the AxCache box will
set AxCache[1] to 1.
The AXI-MM Bridge Slave interface is used for high bandwidth memory transfers between the user
logic and the Host. AXI to PCIe translation is supported through the AXI to PCIe BARs. The interface
In the Bridge Slave interface, there are six BARs which can be configured as 32 bits or 64 bits. These
BARs provide address translation from an AXI address space to the PCIe address space. The
address translation is configured through GUI configuration, for more information see, BAR and
Address Translation.
Interrupt Module
The IRQ module aggregates interrupts from various sources. The interrupt sources are queue-based
interrupts, user interrupts and error interrupts.
Queue-based interrupts and user interrupts are allowed on PFs and VFs, but error interrupts are
allowed only on PFs. If the SR-IOV is not enabled, each PF has the choice of MSI-X or Legacy
Interrupts. With SR-IOV enabled, only MSI-X interrupts are supported across all functions.
MSI-X interrupt is enabled by default. Host system (Root Complex) will enable one or all of the
interrupt types supported in hardware. If MSI-X is enabled, it takes precedence.
Up to eight interrupts per function are available. To allow many queues on a given function and each
to have interrupts, the QDMA offers a novel way of aggregating interrupts from multiple queues to a
single interrupt vector. In this way, all 2048 queues could in principle be mapped to a single interrupt
vector. QDMA offers 256 interrupt aggregation rings that can be flexibly allocated among the 256
available functions.
PCIe CQ/CC
The PCIe Completer Request (CQ)/Completer Completion (CC) modules receive and process TLP
requests from the remote PCIe agent. This interface to the PCIE Controller operates in address
aligned mode. The module uses the BAR information from the Integrated Block for PCIE Controller to
determine where the request should be forwarded. The possible destinations for these requests are:
Non-posted requests are expected to receive completions from the destination, which are forwarded
to the remote PCIe agent. For further details, see the Versal Adaptive SoC CPM Mode for PCI
Express Product Guide (PG346).
PCIe RQ/RC
The PCIe Requester Request (RQ)/Requester Completion (RC) interface generates PCIeTLPs on the
RQ bus and processes PCIe Completion TLPs from the RC bus. This interface to the PCIE Controller
operates in DWord aligned mode. With a 512-bit interface, straddling will be enabled. While straddling
PCIe Configuration
Several factors can throttle outgoing non-posted transactions. Outgoing non-posted transactions are
throttled based on flow control information from the PCIE Controller to prevent head of line blocking of
posted requests. The DMA will meter non-posted transactions based on the PCIe Receive FIFO
space.
The multi-queue DMA engine of the QDMA uses RDMA model queue pairs to allow RNIC
implementation in the user logic. Each queue set consists of Host to Card (H2C), Card to Host (C2H),
and a C2H Stream Completion (CMPT). The elements of each queue are descriptors.
H2C and C2H are always written by the driver/software; hardware always reads from these queues.
H2C carries the descriptors for the DMA read operations from Host. C2H carries the descriptors for
the DMA write operations to the Host.
In internal mode, H2C descriptors carry address and length information and are called gather
descriptors. They support 32 bits of metadata that can be passed from software to hardware along
with every descriptor. The descriptor can be memory mapped (where it carries host address, card
address, and length of DMA transfer) or streaming (only host address, and length of DMA transfer)
based on context settings. Through descriptor bypass, an arbitrary descriptor format can be defined,
where software can pass immediate data and/or additional metadata along with packet.
C2H queue memory mapped descriptors include the card address, the host address and the length.
In streaming internal cached mode, descriptors carry only the host address. The buffer size of the
descriptor, which is programmed by the driver, is expected to be of fixed size for the whole queue.
Actual data transferred associated with each descriptor does not need to be the full length of the
buffer size.
The software advertises valid descriptors for H2C and C2H queues by writing its producer index
(PIDX) to the hardware. The status descriptor is the last entry of the descriptor ring, except for a C2H
stream ring. The status descriptor carries the consumer index (CIDX) of the hardware so that the
driver knows when to reclaim the descriptor and deallocate the buffers in the host.
For the C2H stream mode, C2H descriptors will be reclaimed based on the CMPT queue entry.
Typically, this carries one entry per C2H packet, indicating one or more C2H descriptors is consumed.
The CMPT queue entry carries enough information for software to claim all the descriptors consumed.
Through external logic, this can be extended to carry other kinds of completions or information to the
host.
CMPT entries written by the hardware to the ring can be detected by the driver using either the color
bit in the descriptor or the status descriptor at the end of the CMPT ring. Each CMPT entry can carry
metadata for a C2H stream packet and can also serve as a custom completion or immediate
notification for the user application.
The base address of all ring buffers (H2C, C2H, and CMPT) should be aligned to a 4 KB address.
The software can program 16 different ring sizes. The ring size for each queue can be selected from
context programing. The last queue entry is the descriptor status, and the number of allowable entries
is (queue size -1).
For example, if queue size is 8, which contains the entry index 0 to 7, the last entry (index 7) is
reserved for status. This index should never be used for PIDX update, and PIDX update should never
be equal to CIDX. For this case, if CIDX is 0, the maximum PIDX update would be 6.
In the example above, if traffic has already started and the CIDX is 4, the maximum PIDX update is 3.
H2C/C2H queues are rings located in host memory. For both type of queues, the producer is software
and consumer is the descriptor engine. The software maintains producer index (PIDX) and a copy of
hardware consumer index (HW CIDX) to avoid overwriting unread descriptors. The descriptor engine
also maintains consumer index (CIDX) and a copy of SW PIDX, which is to make sure the engine
does not read unwritten descriptors. The last entry in the queue is dedicated for the status descriptor
where the engine writes the HW CIDX and other status.
The engine maintains a total of 2048 H2C and 2048 C2H contexts in local memory. The context
stores properties of the queue, such as base address (BADDR), SW PIDX, CIDX, and depth of the
queue.
The figure above shows the H2C and C2H fetch operation.
1. For H2C, the driver writes payload into host buffer, forms the H2C descriptor with the payload
buffer information and puts it into H2C queue at the PIDX location. For C2H, the driver forms the
descriptor with available buffer space reserved to receive the packet write from the DMA.
2. The driver sends the posted write to PIDX register in the descriptor engine for the associated
Queue ID (QID) with its current PIDX value.
3. Upon reception of the PIDX update, the engine calculates the absolute QID of the pointer update
based on address offset and function ID. Using the QID, the engine will fetch the context for the
absolute QID from the memory associated with the QDMA.
4. The engine determines the number of descriptors that are allowed to be fetched based on the
context. The engine calculates the descriptor address using the base address (BADDR), CIDX,
and descriptor size, and the engine issues the DMA read request.
5. After the descriptor engine receives the read completion from the host memory, the descriptor
engine delivers them to the H2C Engine or C2H Engine in internal mode. In case of bypass, the
descriptors are sent out to the associated descriptor bypass output interface.
6. For memory mapped or H2C stream queues programmed as internal mode, after the fetched
descriptor is completely processed, the engine writes the CIDX value to the status descriptor.
For queues programmed as bypass mode, user logic controls the write back through bypass in
interface. The status descriptor could be moderated based on context settings. C2H stream
queues always use the CMPT ring for the completions.
For C2H, the fetch operation is implicit through the CMPT ring.
Completion Queue
The Completion (CMPT) queue is a ring located in host memory. The consumer is software, and the
producer is the CMPT engine. The software maintains the consumer index (CIDX) and a copy of
hardware producer index (HW PIDX) to avoid reading unwritten completions. The CMPT engine also
maintains PIDX and a copy of software consumer index (SW CIDX) to make sure that the engine
C2H stream is expected to use the CMPT queue for completions to host, but it can also be used for
other types of completions or for sending messages to the driver. The message through the CMPT is
guaranteed to not bypass the corresponding C2H stream packet DMA.
The simple flow of DMA CMPT queue operation with respect to the numbering above follows:
1. The CMPT engine receives the completion message through the CMPT interface, but the QID
for the completion message comes from the C2H stream interface. The engine reads the QID
index of CMPT context RAM.
2. The DMA writes the CMPT entry to address BASE+PIDX.
3. If all conditions are met, optionally writes PIDX to the status descriptor of the CMPT queue with
color bit.
4. If interrupt mode is enabled, the CMPT engine generates the interrupt event message to the
interrupt module.
5. The driver can be in polling or interrupt mode. Either way, the driver identifies the new CMPT
entry either by matching the color bit or by comparing the PIDX value in the status descriptor
against its current software CIDX value.
6. The driver updates CIDX for that queue. This allows the hardware to reuse the descriptors again.
After the software finishes processing the CMPT, that is, before it stops polling or leaving the
interrupt handler, the driver issues a write to CIDX update register for the associated queue.
The QDMA provides an optional feature to support Single Root I/O Virtualization (SR-IOV). The PCI-
SIG® Single Root I/O Virtualization and Sharing (SR-IOV) specification (available from PCI-SIG
Specifications (www.pcisig.com/specifications) standardizes the method for bypassing the VMM
involvement in datapath transactions and allows a single endpoint to appear as multiple separate
endpoints. SR-IOV classifies the functions as:
Physical Functions (PF): Full featured PCIe® functions which include SR-IOV capabilities among
others.
Virtual Functions (VF): PCIe functions featuring configuration space with Base Address
Registers (BARs) but lacking the full configuration resources and controlled by the PF
configuration. The main role of the VF is data transfer.
Apart from PCIe defined configuration space, QDMA Subsystem for PCI Express virtualizes data path
operations, such as pointer updates for queues, and interrupts. The rest of the management and
configuration functionality is deferred to the physical function driver. The Drivers that do not have
sufficient privilege must communicate with the privileged Driver through the mailbox interface which is
provided in part of the QDMA Subsystem for PCI Express.
Security is an important aspect of virtualization. The QDMA Subsystem for PCI Express offers the
following security functionality:
QDMA allows only privileged PF to configure the per queue context and registers. VFs inform the
corresponding PFs of any queue context programming.
Drivers are allowed to do pointer updates only for the queue allocated to them.
The system IOMMU can be turned on to check that the DMA access is being requested by PFs
or VFs. The ARID comes from queue context programmed by a privileged function.
Any PF or VF can communicate to a PF (not itself) through mailbox. Each function implements one
128B inbox and 128B outbox. These mailboxes are visible to the driver in the DMA BAR (typically
BAR0) of its own function. At any given time, any function can have one outgoing mailbox and one
incoming mailbox message outstanding per function.
The diagram below shows how a typical system can use QDMA with different functions and operating
systems. Different Queues can be allocated to different functions, and each function can transfer DMA
packets independent of each other.
Limitations
Applications
The QDMA is used in a broad range of networking, computing, and data storage applications. A
common usage example for the QDMA is to implement Data Center and Telco applications, such as
Compute acceleration, Smart NIC, NVMe, RDMA-enabled NIC (RNIC), server virtualization, and NFV
in the user logic. Multiple applications can be implemented to share the QDMA by assigning different
queue sets and PCIe functions to each application. These Queues can then be scaled in the user
logic to implement rate limiting, traffic priority, and custom work queue entry (WQE).
Product Specification
C2H Descriptor Cache Depth 512 Total number of outstanding C2H stream
descriptor fetches for cache bypass and
internal. This cache depth is not relevant in
Simple bypass mode, in simple bypass
mode user can have longer descriptor
cache.
C2H Payload FIFO Depth 512 Units of 64 B. Amount of C2H data that
C2H engine can buffer. This amount of
buffer can sustain host read latency up to 2
us (512 *4 ns). If latency is more then 2 us
there could be performance degradation.
Desc Eng Reorder Buffer Depth 512 Units of 64 B. Amount of Descriptor fetch
data that can be stored to absorb host
read latency.
H2C-ST Reorder Buffer Depth 512 Units of 64 B. Amount of H2C-ST data that
can be stored to absorb host read latency.
QDMA Operations
Descriptor Engine
The descriptor engine is responsible for managing the consumer side of the Host to Card (H2C) and
Card to Host (C2H) descriptor ring buffers for each queue. The context for each queue determines
how the descriptor engine will process each queue individually. When descriptors are available and
other conditions are met, the descriptor engine will issue read requests to PCIe to fetch the
Descriptor Context
The Descriptor Engine stores per queue configuration, status and control information in descriptor
context that can be stored in block RAM or UltraRAM, and the context is indexed by H2C or C2H QID.
Prior to enabling the queue, the hardware and credit context must first be cleared. After this is done,
the software context can be programmed and the qen bit can be set to enable the queue. After the
queue is enabled, the software context should only be updated through the direct mapped address
space to update the Producer Index and Interrupt ARM bit, unless the queue is being disabled. The
hardware context and credit context contain only status. It is only necessary to interact with the
hardware and credit contexts as part of queue initialization in order to clear them to all zeros. Once
the queue is enabled, context is dynamically updated by hardware. Any modification of the context
through the indirect bus when the queue is enabled can result in unexpected behavior. Reading the
context when the queue is enabled is not recommended as it can result in reduced performance.
The descriptor context is used by the descriptor engine. All descriptor rings must be aligned to the 4K
address.
The credit descriptor context is for internal DMA use only and can be read from the indirect bus for
debug. This context stores credits for each queue that have been received through the Descriptor
Credit Interface with the CREDIT_ADD operation. If the credit operation has the fence bit, credits are
added only as the read request for the descriptor is generated.
Descriptor Fetch
✎ Note: Available descriptors are always <ring size> - 2. At any time, the software should not update
the PIDX to more than <ring size> - 2.
For example, if queue size is 8, which contains the entry index 0 to 7, the last entry (index 7) is
reserved for status. This index should never be used for the PIDX update, and the PIDX update
should never be equal to CIDX. For this case, if CIDX is 0, the maximum PIDX update would be 6.
A queue can be configured to operate in Descriptor Bypass mode or Internal mode by setting the
software context bypass field. In internal mode, the queue requires no external user logic to handle
descriptors. Descriptors that are fetched by the descriptor engine are delivered directly to the
appropriate DMA engine and processed. Internal mode allows credit fetching and status updates to
the user logic for run time customization of the descriptor fetch behavior.
Status writebacks and/or interrupts are generated automatically by hardware based on the queue
context. When wbi_intvl_en is set, writebacks/interrupts will be sent based on the interval selected
in the register QDMA_GLBL_DSC_CFG.wb_intvl. Due to the slow nature of interrupts, in interval
mode, interrupts may be late or skip intervals. If the wbi_chk context bit is set, a writeback/interrupt
will be sent when the descriptor engine has detected that the last descriptor at the current PIDX has
completed. It is recommended the wbi_chk bit be set for all internal mode operation, including when
interval mode is enabled. An interrupt will not be generated until the irq_arm bit has been set by
software. Once an interrupt has been sent the irq_arm bit is cleared by hardware. Should an interrupt
be needed when the irq_arm bit is not set, the interrupt will be held in a pending state until the
irq_arm bit is set.
Descriptor completion is defined to be when the descriptor data transfer has completed and its write
data has been acknowledged on AXI (H2C bresp for AXI MM, Valid/Ready of ST), or been accepted
by the PCIE Controller’s transaction layer for transmission (C2H MM).
Descriptor Bypass mode also supports crediting and status updates to user logic. In addition,
Descriptor Bypass mode allows the user logic to customize processing of descriptors and status
updates. Descriptors fetched by the descriptor engine are delivered to user logic through the
descriptor bypass out interface. This allows user logic to pre-process or store the descriptors, if
desired. On the descriptor bypass out interface, the descriptors can be a custom format (adhering to
the descriptor size). To perform DMA operations, the user logic drives descriptors (must be QDMA
format) into the descriptor bypass input interface.
In bypass mode, the user logic has explicit control over status updates to the host, and marker
responses back to user logic. Along with each descriptor submitted to the Descriptor Bypass Input
Port for a Memory Mapped Engine (H2C and C2H) or H2C Stream DMA engine, there is a CIDX, and
sdi field. The CIDX is used to identify which descriptor has completed in any status update (host
writeback, marker response, or coalesced interrupt) generated at the completion of the descriptor. If
the sdi field of the descriptor was input, then a writeback to the host will be generated if the context
wbk_en bit is set. An interrupt can also be sent if the sdi bit is set if the context irq_en and irq_arm
bits are set.
If interrupts are enabled, the user logic must monitor the traffic manager output for the irq_arm. After
the irq_arm bit has been observed for the queue, a descriptor with the sdi bit will be sent to the
Marker Response
Marker responses can be generated for any descriptor by setting the mrkr_req bit. Marker responses
are generated after the descriptor is completed. Similar to host writebacks, excessive marker
response requests can reduce descriptor engine performance. Marker responses to the user logic can
also be sent with the sbi bit if configured in the context. The Marker responses are sent on Queue
Status ports which can be identified by the queue id.
Descriptor completion is defined as when the descriptor data transfer has completed and its write data
is acknowledged on AXI (H2C bresp for AXI MM, Valid/Ready of ST), or is accepted by the PCIE
Controller’s transaction layer for transmission (C2H MM).
The traffic manager interface provides details of a queue’s status to user logic, allowing user logic to
manage descriptor fetching and execution. In normal operation, for an enabled queue, each time the
irq_arm bit is asserted or PIDX of a queue is updated, the descriptor engine asserts
tm_dsc_sts_valid. The tm_dsc_sts_avl signal indicates the number of new descriptors available
since the last update. Through this mechanism, user logic can track the amount of work available for
each queue. This can be used for prioritizing fetches through the descriptor engine’s fetch crediting
mechanism or other user optimizations. On the valid cycle, the tm_dsc_sts_irq_arm indicates that
the irq_arm bit was zero and was set. In bypass mode, this is essentially a credit for an interrupt for
this queue. When a queue is invalidated by software or due to error, the tm_dsc_sts_qinv bit will be
set. If this bit is observed, the descriptor engine will have halted new descriptor fetches for that queue.
In this case, the contents on tm_dsc_sts_avl indicate the number of available fetch credits held by
the descriptor engine. This information can be used to help user logic reconcile the number of credits
given to the descriptor engine, and the number of descriptors it should expect to receive. Even after
tm_dsc_sts_qin is asserted, valid descriptors already in the fetch pipeline will continue to be
delivered to the DMA engine (internal mode) or delivered to the descriptor bypass output port (bypass
mode).
Other fields of the tm_dsc_sts interface identify the queue id, DMA direction (H2C or C2H), internal
or bypass mode, stream or memory mapped mode, queue enable status, queue error status, and port
ID.
While the tm_dsc_sts interface is a valid/ready interface, it should not be back-pressured for optimal
performance. Since multiple events trigger a tm_dsc_sts cycle, if internal buffering is filled, descriptor
fetching will be halted to prevent generation of new events.
The credit interface is relevant when a queue’s fcrd_en context bit is set. It allows the user logic to
prioritize and meter descriptors fetched for each queue. You can specify the DMA direction, qid, and
credit value. For a typical use case, the descriptor engine uses credit inputs to fetch descriptors.
Internally, credits received and consumed are tracked for each queue. If credits are added when the
queue is not enabled, the credits will be returned through the Traffic Manager Output Interface with
tm_dsc_sts_qinv asserted, and the credits in tm_dsc_sts_avl are not valid. Monitor tm_dsc_sts
interface to keep an account for each queue on how many credits are consumed.
Errors
Errors can potentially occur during both descriptor fetch and descriptor execution. In both cases, once
an error is detected for a queue it will invalidate the queue, log an error bit in the context, stop fetching
new descriptors for the queue which encountered the error, and can also log errors in status registers.
If enabled for writeback, interrupts, or marker response, the DMA will generate a status update to
these interfaces. Once this is done, no additional writeback, interrupts, or marker responses (internal
mode) will be sent for the queue until the queue context is cleared. As a result of the queue
invalidation due to an error, a Traffic Manager Output cycle will also be generated to indicate the error
and queue invalidation. After the queue is invalidated, if there is an error you can determine the cause
by reading the error registers and context for that queue. You must clear and remove that queue, and
then add the queue back later when needed.
Although additional descriptor fetches will be halted, fetches already in the pipeline will continue to be
processed and descriptors will be delivered to a DMA engine or Descriptor Bypass Out interface as
usual. If the descriptor fetch itself encounters an error, the descriptor will be marked with an error bit.
If the error bit is set, the contents of the descriptor should be considered invalid. It is possible that
subsequent descriptor fetches for the same queue do not encounter an error and will not have the
error bit set.
In memory mapped DMA operations, both the source and destination of the DMA are memory
mapped space. In an H2C transfer, the source address belongs to PCIe address space while the
destination address belongs to AXI MM address space. In a C2H transfer, the source address belongs
to AXI MM address space while the destination address belongs to PCIe address space. PCIe-to-
PCIe, and AXI MM-to-AXI MM DMAs are not supported. Aside from the direction of the DMA, transfer
H2C and C2H DMA behave similarly and share the same descriptor format.
Operation
The memory mapped DMA engines (H2C and C2H) are enabled by setting the run bit in the Memory
Mapped Engine Control Register. When the run bit is deasserted, descriptors can be dropped. Any
descriptors that have already started the source buffer fetch will continue to be processed.
Reassertion of the run bit will result in resetting internal engine state and should only be done when
the engine is quiesced. Descriptors are received from either the descriptor engine directly or the
Descriptor Bypass Input interface. Any queue that is in internal mode should not be given descriptors
Errors
There are two primary error categories for the DMA Memory Mapped Engine. The first is an error bit
that is set with an incoming descriptor. In this case, the DMA operation of the descriptor is not
processed but the descriptor will proceed through the engine to status update phase with an error
indication. This should result in a writeback, interrupt, and/or marker response depending on context
and configuration. It will also result in the queue being invalidated. The second category of errors for
the DMA Memory Mapped Engine are errors encountered during the execution of the DMA itself. This
can include PCIe read completions errors, and AXI bresp errors (H2C), or AXI bresp errors and PCIe
write errors due to bus master enable or function level reset (FLR), as well as RAM ECC errors. The
first enabled error is logged in the DMA engine. Please refer to the Memory Mapped Engine error
logs. If an error occurs on the read, the DMA write will be aborted if possible. If the error was detected
when pulling write data from RAM, it is not possible to abort the request. Instead invalid data parity
will be generated to ensure the destination is aware of the problem. After the descriptor which
encountered the error has gone through the DMA engine, it will proceed to generate status updates
with an error indication. As with descriptor errors, it will result in the queue being invalidated.
Table: AXI Memory Mapped Descriptor Structure for H2C and C2H
AXI Memory Mapped Writeback Status Structure for H2C and C2H
The MM writeback status register is located after the last entry of the (H2C or C2H) descriptor.
Table: AXI Memory Mapped Writeback Status Structure for H2C and C2H
The H2C Stream Engine is responsible for transferring streaming data from the host and delivering it
to the user logic. The H2C Stream Engine operates on H2C stream descriptors. Each descriptor
specifies the start address and the length of the data to be transferred to the user logic. The H2C
Stream Engine parses the descriptor and issues read requests to the host over PCIe, splitting the
read requests at the MRRS boundary. There can be up to 256 requests outstanding in the H2C
Stream Engine to hide the host read latency. The H2C Stream Engine implements a re-ordering buffer
of 32 KB to re-order the TLPs as they come back. Data is issued to the user logic in order of the
requests sent to PCIe.
If the status descriptor is enabled in the associated H2C context, the engine could additionally send a
status write back to host once it is done issuing data to the user logic.
Each queue in QDMA can be programmed in either of the two H2C Stream modes: internal and
bypass. This is done by specifying the mode in the queue context. The H2C Stream Engine knows
whether the descriptor being processed is for a queue in internal or bypass mode.
The following figures show the internal mode and bypass mode flows.
The user logic can have a custom descriptor format. This is possible because QDMA does not
parse descriptors for queues in bypass mode. The user logic parses these descriptors and
provides the information required by the QDMA on the H2C Stream bypass-in interface.
Immediate data can be passed from the software to the user logic without DMA operation.
The user logic can do traffic management by sending the descriptors to the QDMA when it is
ready to sink all the data. Descriptors can be cached in local RAM.
Perform address translation.
There are some requirements imposed on the user logic when using the bypass mode. Because the
bypass mode allows a packet to span multiple descriptors, the user logic needs to indicate to QDMA
which descriptor marks the Start-Of-Packet (SOP) and which marks the End-Of-Packet (EOP). At the
QDMA H2C Stream bypass-in interface, among other pieces of information, the user logic needs to
provide: Address, Length, SOP, and EOP. It is required that once the user logic feeds SOP descriptor
information into QDMA, it must eventually feed EOP descriptor information also. Descriptors for these
multi-descriptor packets must be fed in sequentially. Other descriptors not belonging to the packet
must not be interleaved within the multi-descriptor packet. The user logic must accumulate the
descriptors up to the EOP descriptor, before feeding them back to QDMA. Not doing so can result in a
hang. The QDMA will generate a TLAST at the QDMA H2C AXI Stream data output once it issues the
last beat for the EOP descriptor. This is guaranteed because the user is required to submit the
descriptors for a given packet sequentially.
The H2C stream interface is shared by all the queues, and has the potential for a head of line
blocking issue if the user logic does not reserve the space to sink the packet. Quality of service can
be severely affected if the packet sizes are large. The Stream engine is designed to saturate PCIe for
packet sizes as low as 128B, so AMD recommends that you restrict the packet size to be host page
size or maximum transfer unit as required by the user application.
A performance control provided in the H2C Stream Engine is the ability to stall requests from being
issued to the PCIe RQ/RC if a certain amount of data is outstanding on the PCIe side as seen by the
H2C Stream Engine. To use this feature, the SW must program a threshold value in the
H2C_REQ_THROT (0xE24) register. After the H2C Stream Engine has more data outstanding to be
This H2C descriptor format is only applicable for internal mode. For bypass mode, the user logic can
define its own format as needed by the user application.
Descriptor Metadata
Similar to bypass mode, the internal mode also provides a mechanism to pass information directly
from the software to the user logic. In addition to address and length, the H2C Stream descriptor also
has a 32b metadata field. This field is not used by the QDMA for the DMA operation. Instead, it is
passed on to the user logic on the H2C AXI4-Stream tuser on every beat of the packet. Passing
The length field in a descriptor can be zero. In this case, the H2C Stream Engine will issue a zero
byte read request on PCIe. After the QDMA receives the completion for the request, the H2C Stream
Engine will send out one beat of data with tlast on the QDMA H2C AXI4-Stream interface. The zero
byte packet will be indicated on the interface by setting the zero_b_dma bit in the tuser. The user
logic must set both the SOP and EOP for a zero byte descriptor. If not done, an error will be flagged
by the H2C Stream Engine.
When feeding the descriptor information on the bypass input interface, the user logic can request the
QDMA to send a status write back to the host when it is done fetching the data from the host. The
user logic can also request that a status be issued to it when the DMA is done. These behaviors can
be controlled using the sdi and mrkr_req inputs in the bypass input interface.
The H2C writeback status register is located after the last entry of the H2C descriptor list.
✎ Note: The format of the H2C-ST status descriptor written to the descriptor ring is different from
that written into the interrupt coalesce entry.
The H2C engine has a data aligner that aligns the data to zero Bytes (0B) boundary before issuing it
to the user logic. This allows the start address of a descriptor to be arbitrarily aligned and still receive
the data on the H2C AXI4-Stream data bus without any holes at the beginning of the data. The user
logic can send a batch of descriptors from SOP to EOP with arbitrary address and length alignments
for each descriptor. The aligner will align and pack the data from the different descriptors and will
If an error is encountered while fetching a descriptor, the QDMA Descriptor Engine flags the descriptor
with error. For a queue in internal mode, the H2C Stream Engine handles the error descriptor by not
performing any PCIe or DMA activity. Instead, it waits for the error descriptor to pass through the
pipeline and forces a writeback after it is done. For a queue in bypass mode, it is the responsibility of
the user logic to not issue a batch of descriptors with an error descriptor. Instead, it must send just
one descriptor with error input asserted on the H2C Stream bypass-in interface and set the SOP,
EOP, no_dma signal, and sdi or mrkr-req signal to make the H2C Stream Engine send a writeback
to Host.
If the H2C Stream Engine encounters an error coming from PCIe on the data, it keeps the error sticky
across the full packet. The error is indicated to the user on the err bit on the H2C Stream Data
Output. Once the H2C Stream sends out the last beat of a packet that saw a PCIe data error, it also
sends a Writeback to the Software to inform it about the error.
The C2H Stream Engine DMA writes the stream packets to the host memory into the descriptor
provided by the host driver through the C2H descriptor queue.
The Prefetch Engine is responsible for calculating the number of descriptors needed for the DMA that
is writing the packet. The buffer size is fixed per queue basis. For internal and cached bypass mode,
the prefetch module can fetch up to 512 descriptors for a maximum of 64 different queues at any
given time.
The Prefetch Engine also offers low latency feature pfch_en = 1, where the engine can prefetch up
to qdma_c2h_pfch_cfg.num_pfch descriptors upon receiving the packet, so that subsequent packets
can avoid the PCIe latency.
The QDMA requires software to post full ring size so the C2H stream engine can fetch the needed
number of descriptors for all received packets. If there are not enough descriptors in the descriptor
ring, the QDMA will stall the packet transfer. For performance reasons, the software is required to post
the PIDX as soon as possible to ensure there are always enough descriptors in the ring.
C2H stream packet data length is limited to 7 * C2H buffer size. C2H buffer size can be programmed
from 0xAB0 to 0xAEC address. For details see cpm4-qdma-v2-1-registers.csv available in the register
map files.
The prefetch engine interacts between the descriptor fetch engine and C2H DMA write engine to pair
up the descriptor and its payload.
The C2H descriptors can be from the descriptor fetch engine or C2H bypass input interfaces. The
descriptors from the descriptor fetch engine are always in cache mode. The prefetch engine keeps
the order of the descriptors to pair with the C2H data packets from the user. The descriptors from the
C2H bypass input interfaces have one interface for both simple mode and cache mode (note that both
simple bypass and cache bypass use the same interface). For simple mode, the user application
keeps the order of the descriptors to pair with the C2H data packets. For cache mode, the prefetch
engine keeps the order of the descriptors to pair with the C2H data packet from the user.
The prefetch context has a bypass bit. When it is 1'b1, the user application sends the credits for the
descriptors. When it is 1'b0, the prefetch engine handles the credits for the descriptors.
If you already have descriptors, there is no need to update the pointers or provide credits. Instead,
send the descriptors in the descriptor bypass interface, and send the data and Completion (CMPT)
packets.
1 beat of data
dma0_s_axis_c2h_ctrl_imm_data = 1’b1
dma0_s_axis_c2h_ctrl_len = datapath width in bytes (for example; 64 if datawidth is 512 bits)
dma0_s_axis_c2h_mty = 0
Marker Packet
The C2H Stream Engine of the QDMA provides a way for the user logic to insert a marker into the
QDMA along with a C2H packet. This marker then propagates through the C2H Engine pipeline and
comes out on the C2H Stream Descriptor Bypass Out interface. The marker is inserted by setting the
marker bit in the C2H Stream Control input. The marker response is indicated by QDMA to the user
logic by setting the mrkr_rsp bit on the C2H Stream Descriptor Bypass Out interface. For a marker,
QDMA does not send out a payload packet but still writes to the Completion Ring. Not all marker
responses are generated because of a corresponding marker request. The QDMA sometimes
generates marker responses when it encounters exceptional events. See the following section for
details about when QDMA internally generates marker responses.
The primary purpose of giving the user logic the ability of sending in a marker into QDMA is to
determine when all the traffic prior to the marker has been flushed. This can be used in the shutdown
sequence in the user logic. Although not a requirement, the marker must be sent by the user logic
with the user_trig bit set when sending in the marker into QDMA. This allows the QDMA to generate
an interrupt and truly ensures that all traffic prior to the marker is flushed out. The QDMA Completion
Engine takes the following actions when it receives a marker from the user logic:
Sends the Completion that came along with the marker to the C2H Stream Completion Ring
Generates Status Descriptor if enabled (if user_trig was set when maker was inserted)
Generates an Interrupt if enabled and not outstanding
Sends the marker response. If an Interrupt was not sent due to it being enabled but outstanding,
the ‘retry_mrkr’ bit in the marker response is set to inform the user that an Interrupt could not be
sent for this marker request. See the C2H Stream Descriptor Bypass Output interface
description for details of these fields.
1 beat of data
dma0_s_axis_c2h_ctrl_marker = 1’b1
dma0_s_axis_c2h_ctrl_len = data width (ex. 64 if data width is 512 bits)
dma0_s_axis_c2h_mty = 0
The immediate data packet and the marker packet don't consume the descriptor, but they write to the
C2H Completion Ring. The software needs to size the C2H Completion Ring large enough to
accommodate the outstanding immediate packets and the marker packets.
1 beat of data
dma0_s_axis_c2h_ctrl_len = 0
dma0_s_axis_c2h_mty = 0
dma0_s_axis_c2h_ctrl_disable_cmp = 1
If an error is encountered while fetching a descriptor (in pre-fetch or regular mode), the QDMA
Descriptor Engine flags the descriptor with error. For a queue in internal mode, the C2H Stream
Engine handles the error descriptor by not performing any PCIe or DMA activity. Instead, it waits for
the error descriptor to pass through the pipeline and forces a writeback after it is done. For a queue in
bypass mode, it is the responsibility of the user logic to not issue a batch of descriptors with an error
descriptor. Instead, it must send just one descriptor with error input asserted on the C2H Stream
bypass-in interface and set the SOP, EOP, no_dma signal, and sdi or mrkr_req signal to make the
C2H Stream Engine send a writeback to Host.
C2H Completion
When the DMA write of the data payload is done, the QDMA writes the CMPT packet into the CMPT
queue. Besides the user defined data, the CMPT packet also includes some other information, such
as error, color, and the length. It also has a desc_used bit to indicate if the packet consumes a
descriptor. A C2H data packet of immediate-data or marker type does not consume any descriptor.
The C2H completion status is located at the last location of completion ring, that is, Completion Ring
Base Address + (Size of the completion length (8,16,32) * (Completion Ring Size – 1)).
When C2H Streaming Completion is enabled, after the packet is transferred, CMPT entry and CMPT
status are written to C2H Completion ring. PIDX in the Completion status can be used to indicate the
currently available completion to be processed.
The following is the C2H Completion ring entry structure for User format when the data format bit is
set to 1’b1.
desc_used 1 [3:3]
err 1 [2:2]
color 1 [1:1]
Len 16 [19:4]
desc_used 1 [3:3]
err 1 [2:2]
color 1 [1:1]
rsvd 8 [19:12]
Qid 11 [11:1]
rsvd 2 [2:1]
The CMPT packet has three types: 8B, 16B, or 32B. When it is 8B or 16B, it only needs one beat of
the data. When it is 32B, it needs two beats of data. Each data beat is 128 bits.
The QDMA provides a means to moderate the C2H completion interrupts and Completion status
writes on a per queue basis. The software can select one out of five modes for each queue. The
selected mode for a queue is stored in the QDMA in the C2H completion ring context for that queue.
After a mode has been selected for a queue, the driver can always select another mode when it
sends the completion ring CIDX update to QDMA.
The C2H completion interrupt moderation is handled by the completion engine inside the C2H engine.
The completion engine stores the C2H completion ring contexts of all the queues. It is possible to
individually enable or disable the sending of interrupts and C2H completion status descriptors for
every queue and this information is present in the completion ring context. It is worth mentioning that
the modes being described here moderate not only interrupts but also completion status writes. Also,
since interrupts and completion status writes can be individually enabled/disabled for each queue,
these modes will work only if the interrupt/completion status is enabled in the Completion context for
that queue.
The QDMA keeps only one interrupt outstanding per queue. This policy is enforced by QDMA even if
all other conditions to send an interrupt have been met for the mode. The way the QDMA considers
an interrupt serviced is by receiving a CIDX update for that queue from the driver.
The basic policy followed in all the interrupt moderation modes is that when there is no interrupt
outstanding for a queue, the QDMA keeps monitoring the trigger conditions to be met for that mode.
Once the conditions are met, an interrupt is sent out. While the QDMA subsystem is waiting for the
interrupt to be served, it remains sensitive to interrupt conditions being met and remembers them.
When the CIDX update is received, the QDMA subsystem evaluates whether the conditions are still
being met. If they are still being met, another interrupt is sent out. If they are not met, no interrupt is
sent out and QDMA resumes monitoring for the conditions to be met again.
Note that the interrupt moderation modes that the QDMA subsystem provides are not necessarily
precise. Thus, if the user application sends two C2H packets with an indication to send an interrupt, it
is not necessary that two interrupts will be generated. The main reason for this behavior is that when
the driver is interrupted to read the completion ring, and it is under no obligation to read exactly up to
the completion for which the interrupt was generated. Thus, the driver may not read up to the
interrupting completion descriptor, or it may even read beyond the interrupting completion descriptor if
TRIGGER_EVERY
This mode is the most aggressive in terms of interruption frequency. The idea behind this mode
is to send an interrupt whenever the completion engine determines that an unread completion
descriptor is present in the completion ring.
TRIGGER_USER
The QDMA provides a way to send a C2H packet to the subsystem with an indication to send out
an interrupt when the subsystem is done sending the packet to the host. This allows the user
application to perform interrupt moderation when the TRIGGER_USER mode is set.
TRIGGER_USER_COUNT
This mode allows the QDMA is sensitive to either of two triggers. One of these triggers is sent by
the user along with the C2H packet. The other trigger is the presence of more than a
programmed threshold of unread Completion entries in the Completion Ring, as seen by the HW.
This threshold is driver programmable on a per-queue basis. The QDMA evaluates whether or
not to send an interrupt when either of these triggers is detected. As explained in the preceding
sections, other conditions must be satisfied in addition to the triggers for an interrupt to be sent.
TRIGGER_USER_TIMER
In this mode, the QDMA is sensitive to either of two triggers. One of these triggers is sent by the
user along with the C2H packet. The other trigger is the expiration of the timer that is associated
with the C2H queue. The period of the timer is driver programmable on a per-queue basis. The
QDMA evaluates whether or not to send an interrupt when either of these triggers is detected. As
explained in the preceding sections, other conditions must be satisfied in addition to the triggers
for an interrupt to be sent. For more information, see C2H Timer.
TRIGGER_USER_TIMER_COUNT
This mode allows the QDMA is sensitive to any of three triggers. One of these triggers is sent by
the user along with the C2H packet. The second trigger is the expiration of the timer that is
associated with the C2H queue. The period of the timer is driver programmable on a per-queue
basis. The third trigger is the presence of more than a programmed threshold of unread
Completion entries in the Completion Ring, as seen by the HW. This threshold is driver
programmable on a per-queue basis. The QDMA evaluates whether or not to send an interrupt
when any of these triggers is detected. As explained in the preceding sections, other conditions
must be satisfied in addition to the triggers for an interrupt to be sent.
TRIGGER_DIS
In this mode, the QDMA does not send C2H completion interrupts in spite of them being enabled
for a given queue. The only way that the driver can read the completion ring in this case is when
it regularly polls the ring. The driver will have to make use of the color bit feature provided in the
completion ring when this mode is set as this mode also disables the sending of any completion
status descriptors to the completion ring.
The following are the flowcharts of different modes. These flowcharts are from the point of view of the
C2H Completion Engine. The Completion packets come in from the user logic and are written to the
C2H Timer
The C2H timer is a trigger mode in the Completion context. It supports 2048 queues, and each queue
has its own timer. When the timer expires, a timer expire signal is sent to the Completion module. If
multiple timers expire at the same time, then they are sent out in a round robin manner.
Reference Timer
The reference timer is based on the timer tick. The register QDMA_C2H_INT (0xB0C) defines the
value for a timer tick. The 16 registers QDMA_C2H_TIMER_CNT (0xA00-0xA3C) has the timer
counts based on the timer tick. The timer_idx in the Completion context is the index to the 16
QDMA_C2H_TIMER_CNT registers. Each queue can choose its own timer_idx.
The Bridge core is an interface between the AXI4 and the PCI Express integrated block. It contains
the memory mapped AXI4 to AXI4-Stream Bridge, and the AXI4-Stream Enhanced Interface Block for
PCIe. The memory mapped AXI4 to AXI4-Stream Bridge contains a register block and two functional
half bridges, referred to as the Slave Bridge and Master Bridge.
The slave bridge connects to the AXI4 Interconnect as a slave device to handle any issued AXI4
master read or write requests.
The master bridge connects to the AXI4 Interconnect as a master to process the PCIe generated
read or write TLPs.
The register block contains registers used in the Bridge core for dynamically mapping the AXI4
memory mapped (MM) address range provided using the AXIBAR parameters to an address for
PCIe range.
The core uses a set of interrupts to detect and flag error conditions.
The following tables are the translation tables for AXI4-Stream and memory-mapped transactions.
For PCIe® requests with lengths greater than 1 Dword, the size of the data burst on the Master AXI
interface will always equal the width of the AXI data bus even when the request received from the
PCIe link is shorter than the AXI bus width.
slave axi wstrb can be used to facilitate data alignment to an address boundary. slave axi
wstrb can equal 0 in the beginning of a valid data cycle and will appropriately calculate an offset to
The functional mode conforms to PCIe® transaction ordering rules. See the PCI-SIG Specifications
for the complete rule set. The following behaviors are implemented in the functional mode to enforce
the PCIe transaction ordering rules on the highly-parallel AXI bus of the bridge.
The bresp to the remote (requesting) AXI4 master device for a write to a remote PCIe device is
not issued until the MemWr TLP transmission is guaranteed to be sent on the PCIe link before any
subsequent TX-transfers.
If Relaxed Ordering bit is not set within the TLP header, then a remote PCIe device read to a
remote AXI slave is not permitted to pass any previous remote PCIe device writes to a remote
AXI slave received by the functional mode. The AXI read address phase is held until the
previous AXI write transactions have completed and bresp has been received for the AXI write
transactions. If the Relaxed Ordering attribute bit is set within the TLP header, then the remote
PCIe device read is permitted to pass.
Read completion data received from a remote PCIe device are not permitted to pass any remote
PCIe device writes to a remote AXI slave received by the functional mode prior to the read
completion data. The bresp for the AXI write(s) must be received before the completion data is
presented on the AXI read data channel.
✎ Note: The transaction ordering rules for PCIe might have an impact on data throughput in heavy
bidirectional traffic.
BAR Addressing
Aperture_Base_Address_n provides the low address where AXI BAR n starts and will be
regarded as address offset 0x0 when the address is translated.
Aperture_High_Address_n is the high address of the last valid byte address of AXI BAR n (for
more details on how the address gets translated, see Address Translation.).
Addressing Translation
The address space for PCIe® is different than the AXI address space. To access one address space
from another address space requires an address translation process. On the AXI side, the bridge
supports mapping to PCIe on up to six 32-bit or 64-bit AXI base address registers (BARs).
Four examples follow:
Example 1 (32-bit PCIe Address Mapping) demonstrates how to set up three AXI BARs and
translate the AXI address to a 32-bit address for PCIe.
Example 2 (64-bit PCIe Address Mapping) demonstrates how to set up three AXI BARs and
translate the AXI address to a 64-bit address for PCIe.
Example 3 demonstrates how to set up two 64-bit PCIe BARs and translate the address for PCIe
to an AXI address.
Example 4 demonstrates how to set up a combination of two 32-bit AXI BARs and two 64 bit AXI
BARs, and translate the AXI address to an address for PCIe.
This example shows the generic settings to set up three independent AXI BARs and address
translation of AXI addresses to a remote 32-bit address space for PCIe. This setting of AXI BARs
does not depend on the BARs for PCIe in the functional mode.
In this example, number of AXI BARs are 3, the following assignments for each range are made.
Aperture_Base_Address_0=0x00000000_12340000
Aperture_High_Address_0 =0x00000000_1234FFFF (64 Kbytes)
AXI_to_PCIe_Translation_0=0x00000000_56710000 (Bits 63-32 are zero in order to
produce a
32-bit PCIe TLP. Bits 15-0 must be zero based on the AXI BAR aperture size. Non-
zero
values in the lower 16 bits are invalid translation values.)
Aperture_Base_Address_1 =0x00000000_ABCDE000
Aperture_High_Address_1 =0x00000000_ABCDFFFF (8 Kbytes)
AXI_to_PCIe_Translation_1=0x00000000_FEDC0000 (Bits 63-32 are zero in order to
produce a
32-bit PCIe TLP. Bits 12-0 must be zero based on the AXI BAR aperture size. Non-
zero
values in the lower 13 bits are invalid translation values.)
Aperture_Base_Address_1 =0x00000000_FE000000
Accessing the Bridge AXI BAR_0 with address 0x0000_12340ABC on the AXI bus yields
0x56710ABC on the bus for PCIe.
This example shows the generic settings to set up to three independent AXI BARs and address
translation of AXI addresses to a remote 64-bit address space for PCIe. This setting of AXI BARs
does not depend on the BARs for PCIe within the Bridge.
In this example, number of AXI BARs are three, the following assignments for each range are made:
Aperture_Base_Address_0 =0x00000000_12340000
Aperture_High_Address_0 =0x00000000_1234FFFF (64 Kbytes)
AXI_to_PCIe_Translation_0=0x5000000056710000 (Bits 63-32 are non-zero in order to
produce a
64-bit PCIe TLP. Bits 15-0 must be zero based on the AXI BAR aperture size. Non-
zero
values in the lower 16 bits are invalid translation values.)
Aperture_Base_Address_1 =0x00000000_ABCDE000
Aperture_High_Address_1 =0x00000000_ABCDFFFF (8 Kbytes)
AXI_to_PCIe_Translation_1=0x60000000_FEDC0000 (Bits 63-32 are non-zero in order to
produce
a 64-bit PCIe TLP. Bits 12-0 must be zero based on the AXI BAR aperture size. Non-
zero
values in the lower 13 bits are invalid translation values.)
Aperture_Base_Address_2 =0x00000000_FE000000
Aperture_High_Address_2 =0x00000000_FFFFFFFF (32 Mbytes)
AXI_to_PCIe_Translation_2=0x7000000040000 (Bits 63-32 are non-zero in order to
produce a
64-bit PCIe TLP. Bits 24-0 must be zero based on the AXI BAR aperture size. Non-
zero
values in the lower 25 bits are invalid translation values.)
Accessing the Bridge AXI BAR_0 with address 0x0000_12340ABC on the bus
yields0x5000000056710ABC on the bus for PCIe.
Accessing the Bridge AXI BAR_1 with address 0x0000_ABCDF123 on the bus yields
0x60000000FEDC1123 on the bus for PCIe.
Accessing the Bridge AXI BAR_2 with address 0x0000_FFFEDCBA on the bus yields
0x7000000041FEDCBA on the bus for PCIe.
Example 3
This example shows the generic settings to set up two independent BARs for PCIe® and address
translation of addresses for PCIe to a remote AXI address space. This setting of BARs for PCIe does
not depend on the AXI BARs within the bridge.
In this example, number of PCIe BAR are two, the following range assignments are made.
Aperture_Base_Address_0 =0x00000000_12340000
Aperture_High_Address_0 =0x00000000_1234FFFF (64 KB)
AXI_to_PCIe_Translation_0=0x00000000_56710000 (Bits 63-32 are zero to
produce a 32-bit PCIe
TLP. Bits 15-0 must be zero based on the AXI BAR aperture size. Non-zero
values in
the lower 16 bits are invalid translation values.)
Aperture_Base_Address_1 =0x00000000_ABCDE000
Aperture_High_Address_1 =0x00000000_ABCDFFFF (8 KB)
AXI_to_PCIe_Translation_1=0x50000000_FEDC0000 (Bits 63-32 are non-zero to
produce a 64-bit
PCIe TLP. Bits 12-0 must be zero based on the AXI BAR aperture size. Nonzero
values
in the lower 13 bits are invalid translation values.)
Accessing the Bridge AXI BAR_0 with address 0x0000_12340ABC on the AXI bus yields
0x56710ABC on the bus for PCIe.
Accessing the Bridge AXI BAR_1 with address 0x0000_ABCDF123 on the AXI bus yields
0x50000000FEDC1123 on the bus for PCIe.
Example 4
This example shows the generic settings of four AXI BARs and address translation of AXI addresses
to a remote 32-bit and 64-bit addresses for PCIe® . This setting of AXI BARs do not depend on the
BARs for PCIe within the Bridge.
In this example, where number AXI BAR's are 4, the following assignments for each range are made:
Aperture_Base_Address_0 =0x00000000_12340000
Aperture_High_Address_0 =0x00000000_1234FFFF (64 KB)
AXI_to_PCIe_Translation_0=0x00000000_56710000 (Bits 63-32 are zero to produce a 32-
bit PCIe
TLP. Bits 15-0 must be zero based on the AXI BAR aperture size. Non-zero values in
the lower 16 bits are invalid translation values.)
Aperture_Base_Address_1 =0x00000000_ABCDE000
Aperture_High_Address_1 =0x00000000_ABCDFFFF (8 KB)
AXI_to_PCIe_Translation_1=0x50000000_FEDC0000 (Bits 63-32 are non-zero to produce a
64-bit
PCIe TLP. Bits 12-0 must be zero based on the AXI BAR aperture size. Non-zero
values
in the lower 13 bits are invalid translation values.)
Aperture_Base_Address_2 =0x00000000_FE000000
Aperture_High_Address_2 =0x00000000_FFFFFFFF (32 MB)
AXI_to_PCIe_Translation_2=0x00000000_40000000 (Bits 63-32 are zero to produce a 32-
bit PCIe
TLP. Bits 24-0 must be zero based on the AXI BAR aperture size. Non-zero values in
the lower 25 bits are invalid translation values.)
Aperture_Base_Address_3 =0x00000000_00000000
Aperture_High_Address_3 =0x00000000_00000FFF (4 KB)
AXI_to_PCIe_Translation_3=0x60000000_87654000 (Bits 63-32 are non-zero to produce a
64-bit
PCIe TLP. Bits 11-0 must be zero based on the AXI BAR aperture size. Non-zero
values
in the lower 12 bits are invalid translation values.)
Accessing the Bridge AXI BAR_0 with address 0x0000_12340ABC on the AXI bus yields
0x56710ABC on the bus for PCIe.
Accessing the Bridge AXI BAR_1 with address 0x0000_ABCDF123 on the AXI bus yields
0x50000000FEDC1123 on the bus for PCIe.
Accessing the Bridge AX IBAR_2 with address 0x0000_FFFEDCBA on the AXI bus yields
0x41FEDCBA on the bus for PCIe.
Accessing the Bridge AXI BAR_3 with address 0x0000_00000071 on the AXI bus yields
0x6000000087654071 on the bus for PCIe.
Addressing Checks
When setting the following parameters for PCIe® address mapping, C_PCIEBAR2AXIBAR_n and
PF0_BARn_APERTURE_SIZE, be sure these are set to allow for the addressing space on the AXI
system. For example, the following setting is illegal and results in an invalid AXI address.
C_PCIEBAR2AXIBAR_n=0x00000000_FFFFF000
PF0_BARn_APERTURE_SIZE=0x06 (8 KB)
For an 8 Kilobyte BAR, the lower 13 bits must be zero. As a result, the C_PCIEBAR2AXIBAR_n value
should be modified to be 0x00000000_FFFFE0000. Also, check for a larger value on
PF0_BARn_APERTURE_SIZE compared to the value assigned to the C_PCIEBAR2AXIBAR_n parameter.
And example parameter setting follows.
C_PCIEBAR2AXIBAR_n=0xFFFF_E000
PF0_BARn_APERTURE_SIZE=0x0D (1 MB)
Malformed TLP
The integrated block for PCI Express® detects a malformed TLP. For the IP configured as an
Endpoint core, a malformed TLP results in a fatal error message being sent upstream if error reporting
is enabled in the Device Control register.
Abnormal Conditions
This section describes how the Slave side and Master side (see the following tables) of the functional
mode handle abnormal conditions.
Slave bridge abnormal conditions are classified as: Illegal Burst Type and Completion TLP Errors. The
following sections describe the manner in which the Bridge handles these errors.
Illegal Burst Type
The slave bridge monitors AXI read and write burst type inputs to ensure that only the INCR
(incrementing burst) type is requested. Any other value on these inputs is treated as an error condition
and the Slave Illegal Burst (SIB) interrupt is asserted. In the case of a read request, the Bridge
asserts SLVERR for all data beats and arbitrary data is placed on the Slave AXI4-MM read data bus.
In the case of a write request, the Bridge asserts SLVERR for the write response and all write data is
discarded.
Completion TLP Errors
Any request to the bus for PCIe (except for a posted Memory write) requires a completion TLP to
complete the associated AXI request. The Slave side of the Bridge checks the received completion
TLPs for errors and checks for completion TLPs that are never returned (Completion Timeout). Each
of the completion TLP error types are discussed in the subsequent sections.
Unexpected Completion
When the slave bridge receives a completion TLP, it matches the header RequesterID and Tag to the
outstanding RequesterID and Tag. A match failure indicates the TLP is an Unexpected Completion
which results in the completion TLP being discarded and a Slave Unexpected Completion (SUC)
interrupt strobe being asserted. Normal operation then continues.
Unsupported Request
A device for PCIe might not be capable of satisfying a specific read request. For example, if the read
request targets an unsupported address for PCIe, the completer returns a completion TLP with a
completion status of 0b001 - Unsupported Request. The completer that returns a completion TLP
with a completion status of Reserved must be treated as an unsupported request status, according to
the PCI Express Base Specification v3.0. When the slave bridge receives an unsupported request
response, the Slave Unsupported Request (SUR) interrupt is asserted and the DECERR response is
asserted with arbitrary data on the AXI4 memory mapped bus.
Completion Timeout
A Completion Timeout occurs when a completion (Cpl) or completion with data (CplD) TLP is not
returned after an AXI to PCIe memory read request, or after a PCIe Configuration Read/Write request.
1. Read register 0xE10 (INT_DEC) and check if bits is set to one of: [9] (correctable), [10]
(non_fatal) or [11] (fatal).
2. Read register 0xE20 (RP_CSR) and check if bit [16] is set to see if efifo_not_empty is set.
3. If FIFO is not empty read FIFO by reading 0xE2C (RP_FIFO_READ)
a. Error message indicates where the error comes from (i.e, requestor ID) and Error type.
4. To clear the error, write to 0xE2C (RP_FIFO_READ). Value does not matter
5. Repeat steps 2 and 3 until 0xE2C (RP_FIFO_READ) bit [18] valid bit is cleared.
6. Write 1 to register 0xE10 (INT_DEC) to clear bits [9] (correctable), [10] (non_fatal) or [11] (fatal).
1. Read register 0xE10 (INT_DEC) and check if bits [17] is set, which indicates PM_PME message
has been received.
2. Read register 0xE20 (RP_CSR) and check if bit [18] is set to see if pfifo_not_empty is set.
3. If FIFO is not empty, read FIFO by reading 0xE30 (RP_PFIFO).
a. Message will indicate where the message comes from (i.e., requestor ID).
4. To clear the error, write to 0xE30 (RP_PFIFO). Value does not matter.
5. Repeat steps 2 and 3 until 0xE30 (RP_PFIFO) bit [31] valid bit is cleared.
6. Write 1 to register 0xE10 (INT_DEC) to clear bits [17].
The following sections describe the manner in which the master bridge handles abnormal conditions.
When the master bridge receives a DECERR response from the AXI bus, the request is discarded
and the Master DECERR (MDE) interrupt is asserted. If the request was non-posted, a completion
packet with the Completion Status = Unsupported Request (UR) is returned on the bus for PCIe.
When the master bridge receives a SLVERR response from the addressed AXI slave, the request is
discarded and the Master SLVERR (MSE) interrupt is asserted. If the request was non-posted, a
completion packet with the Completion Status = Completer Abort (CA) is returned on the bus for PCIe.
When the master bridge receives a SLVERR response from the addressed AXI slave, the request is
discarded and the Master SLVERR (MSE) interrupt is asserted. If the request was non-posted, a
completion packet with the Completion Status = Completer Abort (CA) is returned on the bus for PCIe.
Completion Packets
When the MAX_READ_REQUEST_SIZE is greater than the MAX_PAYLOAD_SIZE, a read request for PCIe
can ask for more data than the master bridge can insert into a single completion packet. When this
situation occurs, multiple completion packets are generated up to MAX_PAYLOAD_SIZE, with the Read
Completion Boundary (RCB) observed.
Poison Bit
When the poison bit is set in a transaction layer packet (TLP) header, the payload following the
header is corrupted. When the master bridge receives a memory request TLP with the poison bit set,
it discards the TLP and asserts the Master Error Poison (MEP) interrupt strobe.
When the master bridge receives a read request with the Length = 0x1, FirstBE = 0x00, and LastBE =
0x00, it responds by sending a completion with Status = Successful Completion.
When the master bridge receives a write request with the Length = 0x1, FirstBE = 0x00, and LastBE
= 0x00 there is no effect.
The normal operation of the functional mode is dependent on the integrated block for PCIe
establishing and maintaining the point-to-point link with an external device for PCIe. If the link has
been lost, it must be re-established to return to normal operation.
When a Hot Reset is received by the functional mode, the link goes down and the PCI Configuration
Space must be reconfigured.
Initiated AXI4 write transactions that have not yet completed on the AXI4 bus when the link goes down
have a SLVERR response given and the write data is discarded. Initiated AXI4 read transactions that
have not yet completed on the AXI4 bus when the link goes down have a SLVERR response given,
with arbitrary read data returned.
Any MemWr TLPs for PCIe that have been received, but the associated AXI4 write transaction has not
started when the link goes down, are discarded.
Interrupts
The QDMA supports up to 2K total MSI-X vectors. A single MSI-X vector can be used to support
multiple queues.
The QDMA supports Interrupt Aggregation. Each vector has an associated Interrupt Aggregation Ring.
The QID and status of queues requiring service are written into the Interrupt Aggregation Ring. When
a PCIe® MSI-X interrupt is received by the Host, the software reads the Interrupt Aggregation Ring to
determine which queue needs service. Mapping of queues to vectors is programmable vector number
provided in the queue context. It supports MSI-X interrupt modes for SR-IOV and non-SR-IOV.
Interrupt Engine
The QDMA Interrupt Engine handles the queue based interrupts and the error interrupt.
This block diagram is of the Interrupt Engine.
When the H2C or C2H interrupt occur, it first reads the QID to vector table. The table has 2K entries to
support up to 2K queues. Each entry of the table includes two portions: one for H2C interrupts, and
one for C2H interrupts. The table maps the QID to the vector, and indicates if the interrupt is direct
interrupt mode or indirect interrupt mode. If it is direct interrupt mode, the vector is used to generate
the PCIe MSI-X message. If it is indirect interrupt mode, the vector is the ring index, which is the index
of the Interrupt Context for the Interrupt Aggregation Ring.
The following is the data in the QID to vector table.
For direct interrupt, the QDMA processes the interrupt with the following steps.
For indirect interrupt, it does interrupt aggregation. The following are some restrictions for the interrupt
aggregation.
Each Interrupt Aggregation Ring can only be associated with one function. But multiple rings can
be associated with the same function.
The interrupt engine supports up to three interrupts from same source, until software services
the interrupts.
In the indirect interrupt, the QDMA processes the interrupt with the following steps.
The Interrupt Context includes the information of the Interrupt Aggregation Ring. It has 256 entries to
support up to 256 Interrupt Aggregation Rings.
The following is the Interrupt Context Structure (0x8).
The software needs to size the Interrupt Aggregation Ring appropriately. Each source can send up to
three messages to the ring. Therefore, the size of the ring needs satisfy the following formula.
Number of entry >= 3 * (number of queues + error interrupts that are mapped to this ring)
The Interrupt Context is programmed by the context access. The QDMA_IND_CTXT_CMD.Qid has
the ring index, which is from the Qid to Vector Table. The operation of MDMA_CTXT_CMD_CLR can
clear all of the bits in the Interrupt Context. The MDMA_CTXT_CMD_INV can clear the valid bit.
After it looks up the Interrupt Context, it then writes to the Interrupt Aggregation Ring. It also updates
the Interrupt Context with the new PIDX, color, and the interrupt state.
This is the Interrupt Aggregation Ring entry structure. It has 8B data.
When the software allocates the memory space for the Interrupt Aggregation Ring, the coal_color
starts with 1’b0. The software needs to initialize the color bit of the Interrupt Context to be 1’b1.
When the hardware writes to the Interrupt Aggregation Ring, it reads color bit from the Interrupt
The interrupt source is C2H stream, then it is the status descriptor of the C2H Completion Ring.
The software can read the pidx of the C2H Completion Ring.
The interrupt source is others (H2C stream, H2C MM, C2H MM), then it is the status descriptor
of that source. The software can read the cidx.
Finally, the QDMA sends out the PCIe MSI-X message using the interrupt vector from the Interrupt
Context.
When the PCIe MSI-X interrupt is received by the Host, the software reads the Interrupt Aggregation
Ring to determine which queue needs service. After the software reads the Interrupt Aggregation
Ring, it will do a dynamic pointer update for the software CIDX to indicate the cumulative pointer that
the software reads to. The software does the dynamic pointer update using the register
QDMA_DMAP_SEL_INT_CIDX[2048] (0x6400). If the software cidx is equal to the pidx, this will
trigger a write to the Interrupt Context on the interrupt state of that queue. This is to indicate the
QDMA that the software already reads all of the entries in the Interrupt Aggregation Ring. If the
software cidx is not equal to the pidx, it will send out another PCIe MSI-X message. Therefore, the
software can read the Interrupt Aggregation Ring again. After that, the software can perform a pointer
update of the interrupt source ring. For example, for a C2H stream interrupt, the software will update
the pointer of the interrupt source ring, which is the C2H Completion Ring.
These are the steps for the software:
1. After the software gets the PCIe MSI-X message, it reads the Interrupt Aggregation Ring entries.
2. The software uses the coal_color bit to identify the written entries. Each entry has Qid and
Int_type (H2C or C2H). From the Qid and Int_type, the software can check if it is stream or
MM. This points to a corresponding source ring. For example, if it is C2H stream, the source ring
is the C2H Completion Ring. The software can then read the source ring to get information, and
do a dynamic pointer update of the source ring after that.
3. After the software finishes reading of all written entries, it does one dynamic point update of the
software cidx using the register QDMA_DMAP_SEL_INT_CIDX[2048] (0x6400). The Qid in the
register is the Qid in the last written entry. This tells hardware the pointer of the Interrupt
Aggregation Ring that the software reads to.
If the software cidx is not equal to the PIDX, the hardware will send out another PCIE MSI-X
message, so that the software can read the Interrupt Aggregation Ring again.
The following diagram shows the indirect interrupt flow. The Interrupt module gets the interrupt
requests. It first writes to the Interrupt Aggregation Ring. Then it waits for the write completions. After
that, it sends out the PCIe MSI-X message. The interrupt requests can keep on coming, and the
Interrupt module keeps on processing them. In the meantime, the software reads the Interrupt
Aggregation Ring and it does the dynamic pointer update. If the software CIDX is not equal to the
PIDX, it will send out another PCIe MSI-X message.
Legacy Interrupt
The QDMA supports the legacy interrupt for physical function, and it is expected that the single queue
will be associated with interrupt.
To enable the legacy interrupt, the software needs to set the en_lgcy_intr bit in the register
QDMA_GLBL_INTERRUPT_CFG (0x288). When en_lgcy_intr is set, the QDMA will not send out
MSI-X interrupt.
When the legacy interrupt wire INTA, INTB, INTC, or INTD is asserted, the QDMA hardware sets the
lgcy_intr_pending bit in the QDMA_GLBL_INTERRUPT_CFG (0x288) register. When the software
receives the legacy interrupt, it needs to clear the lgcy_intr_pending bit. The hardware will keep
the legacy interrupt wire asserted until the software clears the lgcy_intr_pending bit.
User Interrupt
Figure: Interrupt
There are Leaf Error Aggregators in different places. They log the errors and propagate the errors to
the Central Error Aggregator. Each Leaf Error Aggregator has an error status register and an error
mask register. The error mask is enable mask. Irrespective of the enable mask value, the error status
register always logs the errors. Only when the error mask is enabled, the Leaf Error Aggregator will
propagate the error to the Central Error Aggregator.
The Central Error Aggregator aggregates all of the errors together. When any error occurs, it can
generate an Error Interrupt if the err_int_arm bit is set in the error interrupt register
QDMA_GLBL_ERR_INT (0B04). The err_int_arm bit is set by the software and cleared by the
hardware when the Error Interrupt is taken by the Interrupt Engine. The Error Interrupt is for all of the
errors including the H2C errors and C2H errors. The Software must set this err_int_arm bit to
generate interrupt again.
The Error Interrupt supports the direct interrupt only. Register QDMA_GLBL_ERR_INT bit[23],
en_coal must always be programmed to 0 (direct interrupt).
The Error Interrupt gets the vector from the error interrupt register QDMA_GLBL_ERR_INT. For the
direct interrupt, the vector is the interrupt vector index of the MSI-X table.
Here are the processes of the Error Interrupt.
1. Reads the Error Interrupt register QDMA_C2H_GLBL_INT (0B04) to get function and vector
numbers.
2. Sends out the PCIe MSI-X message.
The following figure shows the error interrupt register block diagram.
Queue Management
The Function Map Table is used to allocate queues to each function. The index into the RAM is the
function number. Each entry contains the base number of the physical QID and the number of queues
allocated to the function. It provides a function based, queue access protection mechanism by
translating and checking accesses to logical queues (through QDMA_TRQ_SEL_QUEUE_PF and
QDMA_TRQ_SEL_QUEUE_VF address space) to their physical queues. Direct register accesses to
queue space beyond what is allocated to the function in the table will be canceled and an error will be
logged.
The table can be programmed through the QDMA_TRQ_SEL_FMAP address space. Because this
space only exists in the PF address map, only a physical function can modify this table.
Queue Setup
Virtualization
QDMA implements SR-IOV passthrough virtualization where the adapter exposes a separate virtual
function (VF) for use by a virtual machine (VM). A physical function (PF) can be optionally made
privileged with full access to QDMA registers and resources, but only VFs implement per queue
pointer update registers and interrupts. VF drivers must communicate with the driver attached to the
PF through the mailbox for configuration, resource allocation, and exception handling. The QDMA
implements function level reset (FLR) to enable operating system on VM to reset the device without
interfering with the rest of the platform.
Type Notes
Queue context/other Registers for Context access only controlled by PFs (All 4 PFs).
control registers
Status and statistics Mainly PF only registers. VFs need to coordinate with a PF driver for
registers error handling. VFs need to communicate through the mailbox with driver
attached to PF.
Data path registers Both PFs and VFs must be able to write the registers involved in data
path without needing to go through a hypervisor. Pointer update for
H2C/C2H Descriptor Fetch can be done directly by VF or PF for the
queues associated with the function using its own BAR space. Any
pointer updates to queue that do not belong to the function will be
dropped with error logged.
Other protection Turn on IOMMU to protect bad memory accesses from VMs.
recommendations
PF driver and VF The VF driver needs to communicate with the PF driver to request
driver communication operations that have global effect. This communication channel needs
Type Notes
this ability to pass messages and generate interrupts. This
communication channel utilizes a set of hardware mailboxes for each VF.
Mailbox
In a virtualized environment, the driver attached to a PF has enough privilege to program and access
QDMA registers. For all the lesser privileged functions, certain PFs and all VFs must communicate
with privileged drivers using the mailbox mechanism. The communication API must be defined by the
driver. The QDMA IP does not define it.
Each function (both PF and VF) has an inbox and an outbox that can fit a message size of 128B. A VF
accesses its own mailbox, and a PF accesses its own mailbox and all the functions (PF or VF)
associated with that PF.
✎ Note: Enabling mailbox will increase PL utilization.
The QDMA mailbox allows the following access:
Figure: Mailbox
VF To PF Messaging
A VF is allowed to post one message to a target PF mailbox until the target function (PF) accepts it.
Before posting the message the source function should make sure its o_msg_status is cleared, then
the VF can write the message to its Outgoing Message Registers. After finishing message writing, the
PF To VF Messaging
The messaging flow from a PF to the VFs that belong to its VFG is slightly different than the VF to PF
flow because:
A PF can send messages to multiple destination functions, therefore, it may receives multiple
acknowledgments at the moment when checking the status. As illustrated in the following figure, a PF
driver must set Mailbox Target Function Register to the destination function ID before doing any
The mailbox hardware asserts the ack_status filed in the Status Register (0x22400) when there is
any bit was asserted in the Acknowledge Status Register (ASR). The PF driver can poll the
ack_status before actually reading out the Acknowledge status registers. The PF driver may detect
multiple completions through one register access. After being processed, the PF driver should also
write the value back to the same register address to clear the status.
Mailbox Interrupts
The mailbox module supports interrupt as the alternative event notification mechanism. Each mailbox
has an Interrupt Control Register (at the offset 0x22410 for a PF, or at the offset 0x5010 for a VF). Set
1 to this register to enable the interrupt. Once the interrupt is enabled, the mailbox will send the
interrupt to the QDMA given there is any pending event for the mailbox to process, namely, any
incoming message pending or any acknowledgment for the outgoing messages. Configure the
The function level reset (FLR) mechanism enables the software to quiesce and reset Endpoint
hardware with function-level granularity. When a VF is reset, only the resources associated with this
VF are reset. When a PF is reset, all resources of the PF, including that of its associated VFs, are
reset. Since FLR is a privileged operation, it must be performed by the PF driver running in the
management system.
Use Mode
Hypervisor requests for FLR when a function is attached and detached (i.e., power on and off).
You can request FLR as follows:
where $BDF is the bus device function number of the targeted function.
FLR Process
A complete FLR process involves of three major steps.
1. Pre-FLR: Pre-FLR resets all QDMA context structure, mailbox, and user logic of the target
function.
Each function has a register called MDMA_PRE_FLR_STATUS, which keeps track of the
pre-FLR status of the function. The offset is calculated as
MDMA_PRE_FLR_STATUS_OFFSET = MB_base + 0x100, which is located at offset
0x100 from the mailbox memory space of the function. Note that PF and VF have different
MB_base. The definition of MDMA_PRE_FLR_STATUS is shown in the table below.
The software writes 1 to MDMA_PRE_FLR_STATUS[0] (bit 0) of the target function to
initiate pre-FLR. Hardware will clear MDMA_PRE_FLR_STATUS[0] when pre-FLR
completes. The software keeps polling on MDMA_PRE_FLR_STATUS[0], and only
proceeds to the next step when it returns 0.
OS Support
If the PF driver is loaded and alive (i.e., use mode 1), all three steps aforementioned are performed by
the driver. However, for Versal, if a user wants to perform FLR before loading the PF driver (as
defined in Use Mode above), an OS kernel patch is provided to allow OS to perform the correct FLR
sequence through functions defined in //…/source/drivers/pci/quick.c.
Mailbox IP
You need to add a new IP from the IP catalog to instantiate pcie_qdma_mailox Mailbox IP. This IP is
needed for function virtualization. pcie_qdma_mailbox IP should be connected to the versel_cips
IP as shown in the following diagram:
Follow the above diagram to make all necessary connections. Mailbox IP has two clocks, axi_aclk
and ip_clk and two resets axi_aresetn and ip_resetn. Connect two clocks and two resets
together.
✎ Note: Mailbox access can be steered to NoC0 or NoC1 port based on CIPS the GUI configuration.
You should configure the NoC based on the CIPS GUI selection.
Port ID is the categorization of some queues on the FPGA side. When the DMA is shared by more
than one user application, the port ID provides indirection to QID so that all the interfaces can be
further demuxed with lower cost. However, when used by a single application, the port ID can be
ignored and drive the port id inputs to 0s.
System Management
Resets
The QDMA supports all the PCIe defined resets, such as link down, reset, hot reset, and function level
reset (FLR) (supports only Quiesce mode).
VDM
Vendor Defined Messages (VDMs) are an expansion of the existing messaging capabilities with PCI
Express. PCI Express Specification defines additional requirements for Vendor Defined Messages,
header formats and routing information. For details, see PCI-SIG Specifications
(https://www.pcisig.com/specifications).
QDMA allows the transmission and reception of VDMs. To enable this feature, select Enable Bridge
Slave Mode in the Vivado Customize IP dialog box. This enables the st_rx_msg interface.
RX Vendor Defined Messages are stored in shallow FIFO before they are transmitted to the output
port. When there are many back-to-back VDM messages, FIFO will overflow and these message will
be dropped. So it is better to repeat VDM messages at regular intervals.
Throughput for VDMs depend on several factors: PCIe speed, data width, message length, and the
internal VDM pipeline.
Internal VDM pipelines must be replaced with the Internal RX VDM FIFO interface for network on chip
(NoC) access, which has a shallow buffer of 64B.
✎ Note: New VDM messages will be dropped if more than 64B of VDM are received before the FIFO
is serviced through NoC.
Internal RX VDM FIFO interface cannot handle back-to-back messages. Pipeline throughput can only
handle one in every four accesses, which is about 25% efficiency from the host access.
‼ Important: Do not use back-to-back VDM access.
RX Vendor Defined Messages:
1. When QDMA receives a VDM, the incoming messages will be received on the st_rx_msg port.
2. The incoming data stream will be captured on the st_rx_msg_data port (per-DW).
3. The user application needs to drive the st_rx_msg_rdy to signal if it can accept the incoming
VDMs.
4. Once st_rx_msg_rdy is High, the incoming VDM is forwarded to the user application.
5. The user application needs to store this incoming VDMs and track of how many packets were
received.
Config Extend
PCIe extended interface can be selected for more configuration space. When the Configuration
Extend Interface is selected, you are responsible for adding logic to extend the interface to make it
work properly.
Expansion ROM
If selected, the Expansion ROM is activated and can be a value from 2 KB to 4 GB. According to the
PCI Local Bus Specification ( PCI-SIG Specifications (https://www.pcisig.com/specifications)), the
maximum size for the Expansion ROM BAR should be no larger than 16 MB. Selecting an address
space larger than 16 MB can result in a non-compliant core.
Errors
Bridge Errors
Slave bridge abnormal conditions are classified as: Illegal Burst Type and Completion TLP Errors. The
following sections describe the manner in which the Bridge handles these errors.
Illegal Burst Type
The slave bridge monitors AXI read and write burst type inputs to ensure that only the INCR
(incrementing burst) type is requested. Any other value on these inputs is treated as an error condition
and the Slave Illegal Burst (SIB) interrupt is asserted. In the case of a read request, the Bridge
asserts SLVERR for all data beats and arbitrary data is placed on the Slave AXI4-MM read data bus.
In the case of a write request, the Bridge asserts SLVERR for the write response and all write data is
discarded.
The following sections describe the manner in which the master bridge handles abnormal conditions.
AXI DECERR Response
When the master bridge receives a DECERR response from the AXI bus, the request is discarded
and the Master DECERR (MDE) interrupt is asserted. If the request was non-posted, a completion
packet with the Completion Status = Unsupported Request (UR) is returned on the bus for PCIe.
AXI SLVERR Response
When the master bridge receives a SLVERR response from the addressed AXI slave, the request is
discarded and the Master SLVERR (MSE) interrupt is asserted. If the request was non-posted, a
completion packet with the Completion Status = Completer Abort (CA) is returned on the bus for PCIe.
Max Payload Size for PCIe, Max Read Req
When the master bridge receives a SLVERR response from the addressed AXI slave, the request is
discarded and the Master SLVERR (MSE) interrupt is asserted. If the request was non-posted, a
completion packet with the Completion Status = Completer Abort (CA) is returned on the bus for PCIe.
Completion Packets
When the MAX_READ_REQUEST_SIZE is greater than the MAX_PAYLOAD_SIZE, a read request for PCIe
can ask for more data than the master bridge can insert into a single completion packet. When this
situation occurs, multiple completion packets are generated up to MAX_PAYLOAD_SIZE, with the Read
Completion Boundary (RCB) observed.
Linkdown Errors
If the PCIe link goes down during DMA operations, transactions may be lost and the DMA may not be
able to complete. In such cases, the AXI4 interfaces will continue to operate. Outstanding read
requests on the C2H Bridge AXI4 MM interface receive correct completions or completions with a
slave error response. The DMA will log a link down error in the status register. It is the responsibility of
the driver to have a timeout and handle recovery of a link down situation.
Data protection is supported on the primary data paths. CRC error can occur on C2H streaming, H2C
streaming. Parity error can occur on Memory Mapped, Bridge Master and Bridge Slave interfaces.
Error on Write payload can occur on C2H streaming, Memory Mapped and Bridge Slave. Double bit
error on write payload and read completions for Bridge Slave interface causes parity error. Parity
DMA Errors
All DMA errors are logged in their respective error status register. Each block has error status and
error mask register so error can be passed on to higher level and eventually to
QDMA_GLBL_ERR_STAT register.
Errors can be fatal error based on register settings. If there is an fatal error DMA will stop the transfer
and will send interrupt if enabled. After debug and analysis, you must invalidate and restart the queue
to start the DMA transfer.
Error Aggregator
There are Leaf Error Aggregators in different places. They log the errors and propagate them to the
central place. The Central Error Aggregator aggregates the errors from all of the Leaf Error
Aggregators.
The QDMA_GLBL_ERR_STAT register is the error status register of the Central Error Aggregator. The
bit fields indicate the locations of Leaf Error Aggregators. Then, look for the error status register of the
individual Leaf Error Aggregator to find the exact error.
The register QDMA_GLBL_ERR_MASK is the error mask register of the Central Error Aggregator. It
has the mask bits for the corresponding errors. When the mask bit is set to 1'b1, it will enable the
corresponding error to be propagated to the next level to generate an Interrupt. The detail information
of the error generated interrupt is described in the interrupt section. Error interrupt is controlled by the
register QDMA_GLBL_ERR_INT (0xB04).
Each Leaf Error Aggregator has an error status register and an error mask register. The error status
register logs the error. The hardware sets the bit when the error happens, and the software can write
1'b1 to clear the bit if needed. The error mask register has the mask bits for the corresponding errors.
When the mask bit is set to 1'b1, it will enable the propagation of the corresponding error to the
Central Error Aggregator. The error mask register does not affect the error logging to the error status
register.
The error status registers and the error mask registers of the Leaf Error Aggregators are as follows.
C2H MM Error
QDMA_C2H MM Status (0x1040)
C2H MM Error Code Enable Mask (0x1054)
C2H MM Error Code (0x1058)
C2H MM Error Info (0x105C)
TRQ Error
QDMA_GLBL_TRQ_ERR_STS (0x264): This is the error status register of the Trq errors.
QDMA_GLBL_TRQ_ERR_MSK (0x268): This is the error mask register.
QDMA_GLBL_TRQ_ERR_LOG_A (0x26C): This is the error logging register. It shows the select,
function and the address of the access when the error happens.
Descriptor Error
QDMA_GLBL_DSC_ERR_STS (0x254)
QDMA_GLBL_DSC_ERR_MSK (0x258): This is the error logging register. It has the QID, DMA
direction, and the consumer index of the error.
QDMA_GLBL_DSC_ERR_LOG0 (0x25C)
QDMA_GLBL_TRQ_ERR_STS (0x264): This is the error status register of the TRQ errors.
QDMA_C2H_FATAL_ERR_STAT (0xAF8): The error status register of the C2H streaming fatal
errors.
QDMA_C2H_FATAL_ERR_MASK (0xAFC): The error mask register. The SW can set the bit to
enable the corresponding C2H fatal error to be sent to the C2H fatal error handling logic.
QDMA_C2H_FATAL_ERR_ENABLE (0xB00): This register enables two C2H streaming fatal
error handling processes:
1. Stop the data transfer by disabling the WRQ from the C2H DMA Write Engine.
2. Invert the WPL parity on the data transfer.
Port Descriptions
pcie0_user_clk O User clock out. PCIe derived clock output for all
interface signals output/input to the QDMA. Use this
clock to drive inputs and gate outputs from QDMA.
dma0_axi_aresetn O User reset out. AXI reset signal synchronous with the
clock provided on the pcie0_user_clk output. This
reset should drive all corresponding AXI Interconnect
aresetn signals.
AXI Bridge Slave ports are connected from the AMD Versal device Network on Chip (NoC) to the
CPM DMA internally. For Slave Bridge AXI4 details, see the Versal Adaptive SoC Programmable
Network on Chip and Integrated Memory Controller LogiCORE IP Product Guide (PG313).
To access QDMA registers, you must follow the protocols outlined in the AXI Slave Bridge Register
Limitations section.
Related Information
Slave Bridge Registers Limitations
AXI4 (MM) Master ports are connected from the CPM to the AMD Versal device Network on Chip
(NoC) internally. For details, see the Versal Adaptive SoC Programmable Network on Chip and
Integrated Memory Controller LogiCORE IP Product Guide (PG313). The AXI4 Master interface can
be connected to the DDR memory or to the PL user logic, depending on the NoC configuration.
AMD Versal device Network on Chip (NoC) provides only AXI4 interface. If you need AXI4-Lite
interface, use SmartConnect IP to convert NoC output AXI4 interface to AXI4-Lite interface. For
details, see the SmartConnect LogiCORE IP Product Guide (PG247).
dma0_m_axis_h2c_tuser_qid[10:0] O Queue ID
dma0_m_axis_h2c_tuser_port_id[2:0]
O Port ID
dma0_m_axis_h2c_mdata[31:0] O Metadata
In internal mode, QDMA passes the lower 32 bits of
the H2C AXI4-Stream descriptor on this field.
dma0_m_axis_h2c_mty[5:0] O The number of bytes that are invalid on the last beat
of the transaction. This field is 0 for a 64B transfer.
dma0_m_axis_h2c_tvalid O Valid
dma0_m_axis_h2c_tready I Ready
dma0_s_axis_c2h_ctrl_user_trig I User trigger. This can trigger the interrupt and the
status descriptor write if they are enabled.
dma0_s_axis_c2h_tvalid I Valid.
dma0_s_axis_c2h_tready O Ready.
dma0_s_axis_c2h_cmpt_tvalid I Valid
dma0_s_axis_c2h_cmpt_tready O Ready
VDM Interface
dma0_st_rx_msg_tvalid O Valid
dma0_st_rx_msg_tdata[31:0] O Beat 1:
{REQ_ID[15:0], VDM_MSG_CODE[7:0],
VDM_MSG_ROUTING[2:0],
dma0_st_rx_msg_tready I Ready.
✎ Note: When this interface is not used, Ready must
be tied-off to 1.
✎ Recommended: RX Vendor Defined Messages are stored in shallow FIFO before they are
transmitted to output ports. When there are many back-to-back VDM messages, the FIFO overflows
and these messages are dropped. It is best to repeat VDM messages at regular intervals.
FLR Interface
dma0_usr_flr_set O Set
Asserted for 1 cycle indicating that the FLR status of
the function indicated on dma0_usr_flr_fnc[7:0] is
active.
dma0_usr_flr_clr O Clear
Asserted for 1 cycle indicating that the FLR status of
the function indicated on dma0_usr_flr_fnc[7:0] is
completed.
dma0_h2c_byp_in_st_error I This bit can be set to indicate an error for the queue.
The descriptor will not be processed. Context will be
updated to reflect an error in the queue
dma0_h2c_byp_in_st_cidx[15:0] I The CIDX that will be used for the status descriptor
update and/or interrupt (aggregation mode). Generally
the CIDX should be left unchanged from when it was
received from the descriptor bypass output interface.
dma0_h2c_byp_in_mm_error I This bit can be set to indicate an error for the queue.
The descriptor will not be processed. Context will be
updated to reflect and error in the queue.
dma0_h2c_byp_in_mm_cidx[15:0] I The CIDX that will be used for the status descriptor
update and/or interrupt (aggregation mode). Generally
the CIDX should be left unchanged from when it was
received from the descriptor bypass output interface.
dma0_c2h_byp_in_st_csh_error I This bit can be set to indicate an error for the queue.
The descriptor will not be processed. Context will be
dma0_c2h_byp_in_st_csh_port_id[2:0]
I QDMA port ID
dma0_c2h_byp_in_st_sim_error I This bit can be set to indicate an error for the queue.
The descriptor will not be processed. Context will be
updated to reflect an error in the queue.
dma0_c2h_byp_in_st_sim_port_id[2:0]
I QDMA port ID
dma0_c2h_byp_in_mm_error I This bit can be set to indicate an error for the queue.
The descriptor will not be processed. Context will be
updated to reflect and error in the queue.
dma0_c2h_byp_in_mm_cidx I The User must echo the CIDX from the descriptor that
[15:0] it received on the bypass-out interface.
dma0_dsc_crdt_in_qid [10:0] I The QID associated with the descriptor ring for the
credits are being added.
dma0_tm_dsc_sts_port_id O The port id associated with the queue from the queue
[2:0] context.
User Interrupts
dma0_usr_irq_vld I Valid
An assertion indicates that an interrupt associated
with the vector, function, and pending fields on the
bus should be generated to PCIe. Once asserted,
dma0_usr_irq_in_vld must remain high until
dma0_usr_irq_ack is asserted by the DMA.
NoC Ports
✎ Note: NoC ports are always connected to NoC. You cannot leave them unconnected or connect to
any other blocks. This results in synthesis and implementation error. For connection reference, see
the following figure:
✎ Note: * All NoC related clock frequency can be modified in PS PMC GUI settings. By default all
the clock frequencies are set to max frequencies for an appropriate configurations.
✎ Note: Mailbox ports are always connected to Mailbox IP. If Mailbox IP is not used, leave the port
unconnected (floating). For connection reference look at the picture below.
Register Space
All the physical function (PF) registers are found in cpm4-qdma-v2-1-registers.csv available in the
register map files.
To locate the register space information:
Register Name Base (Hex) Byte Size (Dec) Register List and Details
Register Name Base (Hex) Byte Size (Dec) Register List and Details
Maximum of 32 vectors per
function.
QDMA_CSR (0x0000)
QDMA_TRQ_MSIX (0x2000)
Mailbox Addressing
PF addressing
Addr = PF_Bar_offset + CSR_addr
VF addressing
Addr = VF_Bar_offset + VF_Start_offset + VF_offset + CSR_addr
[1] 0 RO o_msg_status For VF: The status bit will be set when VF
driver write msg_send to its command
register. When The associated PF driver
send acknowledgment to this VF, the
hardware clear this field. The VF driver is
not allow to update any content in its
outgoing mailbox memory (OMM) while
o_msg_status is asserted. Any illegal
write to the OMM will be discarded
(optionally, this can cause an error in the
AXI4-Lite response channel).
For PF: The field indicated the message
status of the target FN which is specified
in the Target FN Register. The status bit
will be set when PF driver sends
msg_send command. When the
corresponding function driver send
acknowledgment by sending msg_rcv, the
hardware clear this field. The PF driver is
not allow to update any content in its
outgoing mailbox memory (OMM) while
o_msg_status(target_fn_id) is asserted.
Any illegal write to the OMM will be
discarded (optionally, this can cause an
error in the AXI4-Lite response channel).
QDMA_TRQ_SEL_QUEUE_PF (0x6400)
QDMA_DMAP_SEL_H2C_DSC_PIDX[2048]
0x6404-0xB3F4 H2C Descriptor Producer index
(0x6404) (PIDX)
QDMA_DMAP_SEL_C2H_DSC_PIDX[2048]
0x6408-0xB3F8 C2H Descriptor Producer Index
(0x6408) (PIDX)
There are 2048 Queues, each Queue will have more than four registers. All these registers can be
dynamically updated at any time. This set of registers can be accessed based on the Queue number.
For Queue 0:
For Queue 1:
For Queue 2:
QDMA_DMAP_SEL_INT_CIDX[2048] (0x6400)
QDMA_DMAP_SEL_H2C_DSC_PIDX[2048] (0x6404)
QDMA_DMAP_SEL_C2H_DSC_PIDX[2048] (0x6408)
QDMA_DMAP_SEL_CMPT_CIDX[2048] (0x640C)
QDMA_TRQ_SEL_QUEUE_VF (0x3000)
VF functions can access direct update registers per queue with offset (0x3000). The description for
this register space is the same as QDMA_TRQ_SEL_QUEUE_PF (0x6400).
This set of registers can be accessed based on Queue number. Queue number is absolute Qnumber,
[0 to 256].
For Queue 0:
For Queue 1:
QDMA_TRQ_MSIX_VF (0x400)
VF functions can access the MSI-X table with offset (0x0000) from that function. The description for
this register space is the same as QDMA_TRQ_MSIX (0x2000).
QDMA_VF_MAILBOX (0x1000)
[1] 0 RO o_msg_status For VF: The status bit will be set when VF
driver write msg_send to its command
register. When the associated PF driver
sends acknowledgement to this VF, the
hardware clears this field. The VF driver is
not allow to update any content in its
DMA register space can be accessed using AXI Slave interface. When AXI Slave Bridge mode is
enabled (based on GUI settings) user can also access Bridge registers and can also access Host
memory space.
Slave Bridge 0xE001_0000 - 0xEFFF_FFFF Address range for Slave bridge access is set
access to 0x6_1101_0000 - during IP customization in the Address Editor
Host memory 0x7_FFFF_FFFF tab of the Vivado IDE.
space 0x80_0000_0000 -
0xBF_FFFF_FFFF
If you want to access QDMA register space from AXI Slave interface the offset is 0x6_1000_0000.
QDMA supports 256 functions and 64 KB space is allocated for each function. To access any CSR or
queue space register for a function, "AXI Slave address offset + function offset+ register offset".
Function offset (64 KB): 0x1_0000
Register offset within each function is listed at QDMA PF Address Register Space.
For example, to access queues space registers for function 0: 0x6_1000_6400.
For example, to access queues space registers for function 1: 0x6_1001_6400.
For example, to access queues space registers for function 2: 0x6_1002_6400.
Bridge register addresses start at 0xE00. Addresses from 0x00 to 0xE00 are directed to the PCIe
configuration register space.
All the bridge registers are listed in the cpm4-bridge-v2-1-registers.csv available in the register map
files.
To locate the register space information:
This lab describes the process of generating an AMD Versal™ device QDMA design with AXI4
interface connected to network on chip (NoC) IP and DDR memory. This design has the following
configurations:
AXI4 memory mapped (AXI MM) connected to DDR through the NoC IP
Gen3 x 16
MSI-X interrupts
This lab provides step by step instructions to configure a Control, Interfaces and Processing System
(CIPS) QDMA design and network on chip (NoC) IP. The following figure shows the AXI4 Memory
Mapped (AXI-MM) interface to DDR using the NoC IP. At the end of this lab, you can synthesize and
implement the design, and generate a Programmable Device Image (PDI) file. The PDI file is used to
program the Versal device and run data traffic on a system. For the AXI-MM interface host to chip
(H2C) transfers, data is read from Host and sent to DDR memory. For chip to host (C2H) transfers,
data is read from DDR memory and written to host.
This lab targets a xcvc1902-vsvd1760-1LP-e-S-es1 part on a VCK5000 board. This lab connects to
DDR memory found outside the Versal device. For more information, see QDMA AXI MM Interface to
NoC and DDR Lab.
This lab describes the process of generating an AMD Versal™ device QDMA design containing 4
PFs, 252 VFs and AXI4 Memory Mapped interface connected to the network on chip (NoC) IP and
DDR memory. This design has the following configurations:
AXI4 memory mapped (AXI MM) connected to DDR through the NoC IP
Gen3 x 16
4 physical functions (PFs) and 252 virtual functions (VFs) with Mailbox connections.
MSI-X interrupts
This lab provides step by step instructions to configure a Control, Interfaces and Processing System
(CIPS) QDMA design and network on chip (NoC) IP. The following figure shows the AXI4 Memory
Mapped (AXI-MM) interface to DDR using the NoC IP. At the end of this lab, you can synthesize and
implement the design, and generate a Programmable Device Image (PDI) file. The PDI file is used to
program the Versal device and run data traffic on a system. For the AXI-MM interface host to chip
(H2C) transfers, data is read from Host and sent to DDR memory. For chip to host (C2H) transfers,
data is read from DDR memory and written to host.
This lab targets a xcvc1902-vsvd1760-1LP-e-S-es1 part on a VCK5000 board. This lab connects to
DDR memory found outside the Versal device. For more information, see QDMA AXI MM Interface to
NoC and DDR with Mailbox.
CPM4_QDMA_Gen4x8_MM_ST_Design
Implementation Functional
example design.
Versal_CPM_QDMA_EP_Design
CPM4_QDMA_Gen4x8_MM_ST_Performance_Design
Implementation Performance
example design.
Versal_CPM_QDMA_EP_Simulation_Design
No preset Simulation Functional
example design.
The associated drivers can be downloaded from GitHub. For more information about CED, see Versal
Adaptive SoC CPM Example Designs.
1. Launch Vivado.
2. Check whether the designs are installed and update if required.
5. Click install for any newly added designs or click Update for any updates to the designs and
close it.
6. Click Open Example Project > Next, select the appropriate design, click Next.
8. Choose the board or part option available. Based on the board selected, appropriate CPM block
is enabled. For example, for VCK190 board, CPM4 block is enabled. Similarly, for VPK120
board, CPM5 block is enabled.
Versal_CPM_QDMA_EP_Design
This design has CPM4 – QDMA0 enabled in Gen4x8 configuration as an End Point
The design targets VCK190 board and it supports Synthesis and Implementation flows
The associated drivers can be downloaded from GitHub
Enables QDMA AXI4-MM and QDMA AXI-ST functionality with 4 PF and 252 VFs
Capable of exercising AXI4-MM, AXI-ST path, and descriptor bypass
Design also includes DDR
Example design registers can only be controlled through the AXI4-Lite master interface. To test the
QDMA's AXI4-Stream interface, ensure that the AXI4-Lite master interface is present. Following are
the example design registers:
C2H_ST_QID (0x000)
[31:11] 0 NA Reserved
C2H_ST_LEN (0x004)
[31:16] 0 NA Reserved
C2H_CONTROL_REG (0x008)
[31:6] 0 NA Reserved
[4] 0 NA reserved
H2C_CONTROL_REG (0x00C)
[31:30] 0 NA Reserved
[31:15] 0 NA Reserved
[3:1] 0 NA Reserved
C2H_STATUS (0x018)
[31:30] 0 NA Reserved
C2H_PACKET_COUNT (0x020)
[31:10] 0 NA Reserved
C2H_PREFETCH_TAG(0x024)
[31:27] 0 NA Reserved
[15:7] 0 NA Reserved
C2H_COMPLETION_DATA_1 (0x034)
C2H_COMPLETION_DATA_2 (0x038)
C2H_COMPLETION_DATA_3 (0x03C)
C2H_COMPLETION_DATA_4 (0x040)
C2H_COMPLETION_DATA_5 (0x044)
C2H_COMPLETION_DATA_7 (0x04C)
C2H_COMPLETION_SIZE (0x050)
[31:13] 0 NA Reserved
SCRATCH_REG1 (0x064)
C2H_PACKETS_DROP (0x088)
Each AXI-ST C2H transfer can contain one or more descriptors depending on transfer size and C2H
buffer size. This register represents how many of the descriptors were dropped in the current transfer.
This register will reset to 0 in the beginning of each transfer.
C2H_PACKETS_ACCEPTED (0x08C)
Each AXI-ST C2H transfer can contain one or more descriptors depending on the transfer size and
C2H buffer size. This register represents how many of the descriptors were accepted in the current
transfer. This register will reset to 0 at the beginning of each transfer.
DESCRIPTOR_BYPASS (0x090)
[31:3] 0 NA Reserved
USER_INTERRUPT (0x094)
[31:20] 0 NA Reserved
[11:9] 0 NA Reserved
[3:1] 0 NA Reserved
1. Write the function number at bits [19:12]. This corresponds to the function that generates the
usr_irq_in_fnc user interrupt.
2. Write MSI-X Vector number at bits [8:4]. This corresponds to the entry in the MSI-X table that is
set up for usr_irq_in_vec user interrupt.
All three above steps can be done at the same time, with a single write.
Following is the user interrupt timing diagram:
Figure: Interrupt
USER_INTERRUPT_MASK (0x098)
USER_INTERRUPT_VECTOR (0x09C)
Write to user_interrupt[0], or
Write to the user_interrupt_vector[31:0] register with mask set.
DMA_CONTROL (0x0A0)
[31:1] NA Reserved
VDM_MESSAGE_READ (0x0A4)
Vendor Defined Message (VDM) messages, st_rx_msg_data, are stored in FIFO in the example
design. A read to this register (0x0A4) will pop out one 32-bit message at a time.
CPM4_QDMA_Gen4x8_MM_ST_Performance_Design
This design has CPM4 – QDMA0 enabled in Gen4x8 configuration as an End Point
The design targets VCK190 board and it supports Synthesis and Implementation flows
The associated drivers can be downloaded from: https://github.com/Xilinx/dma_ip_drivers
Enables QDMA AXI4-MM and QDMA AXI-ST functionality
Capable of demonstrating the QDMA MM and ST performance
Design also includes DDR
Versal_CPM_QDMA_EP_Simulation_Design
CPM4 Controler0 QDMA Gen4x8. Functional example design which can be used for simulation.
This design has CPM4 – QDMA0 AXI Bridge mode enabled in Gen4x8 configuration as Root
Port
The design targets VCK190 board and it supports Synthesis and Implementation flows
The design implements the Root Complex functionality. It includes CIPS IP, which enabled both
CPM and PS
Debugging
This appendix includes details about resources available on the AMD Support website and debugging
tools.
If the IP requires a license key, the key must be verified. The AMD Vivado™ design tools have several
license checkpoints for gating licensed IP through the flow. If the license check succeeds, the IP can
continue generation. Otherwise, generation halts with an error. License checkpoints are enforced by
the following tools:
Vivado Synthesis
Vivado Implementation
write_bitstream (Tcl command)
To help in the design and debug process when using the functional mode, the Support web page
contains key resources such as product documentation, release notes, answer records, information
about known issues, and links for obtaining further product support. The Community Forums are also
available where members can learn, participate, share, and ask questions about AMD Adaptive
Computing solutions.
Documentation
This product guide is the main document associated with the functional mode. This guide, along with
documentation related to all products that aid in the design process, can be found on the AMD
Adaptive Support web page or by using the AMD Adaptive Computing Documentation Navigator.
Download the Documentation Navigator from the Downloads page. For more information about this
tool and the features available, open the online help after installation.
Answer Records
Answer Records include information about commonly encountered problems, helpful information on
how to resolve these problems, and any known issues with an AMD Adaptive Computing product.
Answer Records are created and maintained daily to ensure that users have access to the most
accurate information available.
Answer Records for this functional mode can be located by using the Search Support box on the main
AMD Adaptive Support web page. To maximize your search results, use keywords such as:
Product name
Tool message(s)
Summary of the issue encountered
A filter search is available after results are returned to further target the results.
AR 75396.
Technical Support
AMD Adaptive Computing provides technical support on the Community Forums for this AMD
LogiCORE™ IP product when used as described in the product documentation. AMD Adaptive
Computing cannot guarantee timing, functionality, or support if you do any of the following:
Implement the solution in devices that are not defined in the documentation.
Customize the solution beyond that allowed in the product documentation.
Change any section of the design labeled DO NOT MODIFY.
Hardware Debug
Hardware issues can range from link bring-up to problems seen after hours of testing. This section
provides debug steps for common issues. The AMD Vivado™ debug feature is a valuable resource to
use in hardware debug. The signal names mentioned in the following individual sections can be
probed using the debug feature for debugging the specific problems.
General Checks
Ensure that all the timing constraints for the core were properly incorporated from the example design
and that all constraints were met during implementation.
Soft Reset
Reset the QDMA logic through the dma0_soft_reset_n port. This port needs to be held in reset for a
minimum of 100 clock cycles (pcie0_user_clk cycles).
This signal resets only the DMA portion of logic. It does not reset the PCIe hard block.
DMA does not respond, and the user application is not getting proper values.
DMA transfer has errors, but the PCIe links are good.
DMA records some asynchronous errors.
After dma0_soft_reset, you must reinitialize the queues and program all queue context.
Registers
A complete list of registers and attributes for the QDMA Subsystem is available in the Versal Adaptive
SoC Register Reference (AM012). Reviewing the registers and attributes might be helpful for
advanced debugging.
✎ Note: The attributes are set during IP customization in the Vivado IP catalog. After core
customization, attributes are read-only.
Debug Tools
There are many tools available to address QDMA design issues. It is important to know which tools
are useful for debugging various situations.
The AMD Vivado™ Design Suite debug feature inserts logic analyzer and virtual I/O cores directly into
your design. The debug feature also allows you to set trigger conditions to capture application and
integrated block port signals in hardware. Captured signals can then be analyzed. This feature in the
Vivado IDE is used for logic debugging and validation of a design running in AMD devices.
The Vivado logic analyzer is used to interact with the logic debug LogiCORE IP cores, including:
The above figure shows the usage model of Linux and Windows QDMA software drivers. The QDMA
example design is implemented on an AMD adaptive SoC, which is connected to an X86 host through
PCI Express.
In the first use mode, the QDMA driver in kernel space runs on Linux, whereas the test
application runs in user space.
In the second use mode, the Data Plane Dev Kit (DPDK) is used to develop a QDMA Poll Mode
Driver (PMD) running entirely in the user space, and use the UIO and VFIO kernel framework to
communicate with the adaptive SoC.
In the third usage mode, the QDMA driver runs in kernel space on Windows, whereas the test
application runs in the user space.
Device control tool: Creates a netlink socket for PCIe device query, queue management,
reading the context of a queue, etc.
DMA tool: Is the user space application to initiate a DMA transaction. You can use standard
Linux utility dd or fio, or use the example application in the driver package.
Kernel space driver: Creates the descriptors and translates the user space function into low-
level command to interact with the Versal device.
Linux, DPDK and Windows drivers and the corresponding documentation are available at AMD DMA
IP Drivers.
‼ Important: 8 MSI-X vectors are needed on all functions (PF/VF) for using AMD QDMA IP driver.
Upgrading
This appendix is not applicable for the first release of the functional mode.
QDMA Architecture
DMA Engines
Descriptor Engine
The Host to Card (H2C) and Card to Host (C2H) descriptors are fetched by the Descriptor Engine in
one of two modes: Internal mode, and Descriptor bypass mode. The descriptor engine maintains per
queue contexts where it tracks software (SW) producer index pointer (PIDX), consumer index pointer
(CIDX), base address of the queue (BADDR), and queue configurations for each queue. The
descriptor engine uses a round robin algorithm for fetching the descriptors. The descriptor engine has
separate buffers for H2C and C2H queues, and ensures it never fetches more descriptors than
available space. The descriptor engine will have only one DMA read outstanding per queue at a time
and can read as many descriptors as can fit in a MRRS. The descriptor engine is responsible for
reordering the out of order completions and ensures that descriptors for queues are always in order.
The descriptor bypass can be enabled on a per-queue basis and the fetched descriptors, after
buffering, are sent to the respective bypass output interface instead of directly to the H2C or C2H
engine. In internal mode, based on the context settings the descriptors are sent to delete per H2C
memory mapped (MM), C2H MM, H2C Stream, or C2H Stream engines.
The descriptor engine is also responsible for generating the status descriptor for the completion of the
DMA operations. With the exception of C2H Stream mode, all modes use this mechanism to convey
completion of each DMA operation so that software can reclaim descriptors and free up any
associated buffers. This is indicated by the CIDX field of the status descriptor.
✎ Recommended: If a queue is associated with interrupt aggregation, AMD recommends that the
status descriptor be turned off, and instead the DMA status be received from the interrupt aggregation
ring.
To put a limit on the number of fetched descriptors (for example, to limit the amount of buffering
required to store the descriptor), it is possible to turn-on and throttle credit on a per-queue basis. In
H2C MM Engine
The H2C MM Engine moves data from the host memory to card memory through the H2C AXI-MM
interface. The engine generates reads on PCIe, splitting descriptors into multiple read requests based
on the MRRS and the requirement that PCIe reads do not cross 4 KB boundaries. Once completion
data for a read request is received, an AXI write is generated on the H2C AXI-MM interface. For
source and destination addresses that are not aligned, the hardware will shift the data and split writes
on AXI-MM to prevent 4 KB boundary crossing. Each completed descriptor is checked to determine
whether a writeback and/or interrupt is required.
For Internal mode, the descriptor engine delivers memory mapped descriptors straight to the H2C MM
engine. The user logic can also inject the descriptor into the H2C descriptor bypass interface to move
data from host to card memory. This gives the ability to do interesting things such as mixing control
and DMA commands in the same queue. Control information can be sent to a control processor
indicating the completion of DMA operation.
C2H MM Engine
The C2H MM Engine moves data from card memory to host memory through the C2H AXI-MM
interface. The engine generates AXI reads on the C2H AXI-MM bus, splitting descriptors into multiple
requests based on 4 KB boundaries. Once completion data for the read request is received on the
AXI4 interface, a PCIe write is generated using the data from the AXI read as the contents of the
write. For source and destination addresses that are not aligned, the hardware will shift the data and
split writes on PCIe to obey Maximum Payload Size (MPS) and prevent 4 KB boundary crossings.
Each completed descriptor is checked to determine whether a writeback and/or interrupt is required.
For Internal mode, the descriptor engine delivers memory mapped descriptors straight to the C2H MM
engine. As with H2C MM Engine, the user logic can also inject the descriptor into the C2H descriptor
bypass interface to move data from card to host memory.
For multi-function configuration support, the PCIe function number information will be provided in the
aruser bits of the AXI-MM interface bus to help virtualization of card memory by the user logic. A
parity bus, separate from the data and user bus, is also provided for end-to-end parity support.
The H2C stream engine moves data from the host to the H2C Stream interface. For internal mode,
descriptors are delivered straight to the H2C stream engine; for a queue in bypass mode, the
descriptors can be reformatted and fed to the bypass input interface. The engine is responsible for
breaking up DMA reads to MRRS size, guaranteeing the space for completions, and also makes sure
completions are reordered to ensure H2C stream data is delivered to user logic in-order.
The C2H streaming engine is responsible for receiving data from the user logic and writing to the Host
memory address provided by the C2H descriptor for a given Queue.
The C2H engine has two major blocks to accomplish C2H streaming DMA, Descriptor Prefetch Cache
(PFCH), and the C2H-ST DMA Write Engine. The PFCH has per queue context to enhance the
performance of its function and the software that is expected to program it.
PFCH cache has three main modes, on a per queue basis, called Simple Bypass Mode, Internal
Cache Mode, and Cached Bypass Mode.
In Simple Bypass Mode, the engine does not track anything for the queue, and the user logic
can define its own method to receive descriptors. The user logic is then responsible for
delivering the packet and associated descriptor through the simple bypass interface. The
ordering of the descriptors fetched by a queue in the bypass interface and the C2H stream
interface must be maintained across all queues in bypass mode.
In Internal Cache Mode and Cached Bypass Mode, the PFCH module offers storage for up to
512 descriptors and these descriptors can be used by up to 64 different queues. In this mode,
the engine controls the descriptors to be fetched by managing the C2H descriptor queue credit
on demand based on received packets in the pipeline. Pre-fetch mode can be enabled on a per
queue basis, and when enabled, causes the descriptors to be opportunistically pre-fetched so
that descriptors are available before the packet data is available. The status can be found in
prefetch context. This significantly reduces the latency by allowing packet data to be transferred
to the PCIe integrated block almost immediately, instead of having to wait for the relevant
descriptor to be fetched. The size of the data buffer is fixed for a queue (PFCH context) and the
engine can scatter the packet across as many as seven descriptors. In cached bypass mode
descriptor is bypassed to user logic for further processing, such as address translation, and sent
back on the bypass in interface. This mode does not assume any ordering descriptor and C2H
stream packet interface, and the pre-fetch engine can match the packet and descriptors. When
pre-fetch mode is enabled, do not give credits to IP. The pre-fetch engine takes care of credit
management.
The Completion (CMPT) Engine is used to write to the completion queues. Although the Completion
Engine can be used with an AXI-MM interface and Stream DMA engines, the C2H Stream DMA
engine is designed to work closely with the Completion Engine. The Completion Engine can also be
used to pass immediate data to the Completion Ring. The Completion Engine can be used to write
Completions of up to 64B in the Completion ring. When used with a DMA engine, the completion is
used by the driver to determine how many bytes of data were transferred with every packet. This
allows the driver to reclaim the descriptors.
The Completion Engine maintains the Completion Context. This context is programmed by the Driver
and is maintained on a per-queue basis. The Completion Context stores information like the base
address of the Completion Ring, PIDX, CIDX and a number of aspects of the Completion Engine,
which can be controlled by setting the fields of the Completion Context.
The engine also can be configured on a per-queue basis to generate an interrupt or a completion
status update, or both, based on the needs of the software. If the interrupts for multiple queues are
aggregated into the interrupt aggregation ring, the status descriptor information is available in the
interrupt aggregation ring as well.
The CMPT Engine has a cache of up to 64 entries to coalesce the multiple smaller CMPT writes into
64B writes to improve the PCIe efficiency. At any time, completions can be simultaneously coalesced
for up to 64 queues. Beyond this, any additional queue that needs to write a CMPT entry causes the
eviction of the least recently used queue from the cache. The depth of the cache is set to 64.
Bridge Interfaces
The AXI MM Bridge Master interface is used for high bandwidth access to AXI Memory Mapped space
from the host. The interface supports up to 32 outstanding AXI reads and writes. One or more PCIe
BAR of any physical function (PF) or virtual function (VF) can be mapped to the AXI-MM bridge
master interface. This selection must be made prior to design compilation.
Virtual function group (VFG) refers to the VF group number. It is equivalent to the PF number
associated with the corresponding VF. VFG_OFFSET refers to the VF number with respect to a
particular PF. Note that this is not the FIRST_VF_OFFSET of each PF.
For example, if both PF0 and PF1 have 8 VFs, FIRST_VF_OFFSET for PF0 and PF1 is 16 and 23.
Each host initiated access can be uniquely mapped to the 64-bit AXI address space through the PCIe
to AXI BAR translation.
Since all functions share the same AXI Master address space, a mechanism is needed to map
requests from different functions to a distinct address space on the AXI master side. An example
provided below shows how PCIe to AXI translation vector is used. Note that all VFs belonging to the
same PF share the same PCIe to AXI translation vector. Therefore, the AXI address space of each VF
is concatenated together. Use VFG_OFFSET to calculate the actual starting address of AXI for a
particular VF.
To summarize, AXI master write or read address is determined as:
For each physical function, the PCIe configuration space consists of a set of five 32-bit memory BARs
and one 32-bit Expansion ROM BAR. When SR-IOV is enabled, an additional five 32-bit BARs are
enabled for each Virtual Function. These BARs provide address translation to the AXI4 memory
mapped space capability, interface routing, and AXI4 request attribute configuration. Any pairs of
BARs can be configured as a single 64-bit BAR. Each BAR can be configured to route its requests to
the QDMA register space, or the AXI MM bridge master interface.
AxCache[1] is set to 1 for modifiable, and 0 for non-modifiable. Selecting the AxCache box sets
AxCache[1] to 1.
The AXI-MM Bridge Slave interface is used for high bandwidth memory transfers between the user
logic and the Host. AXI to PCIe translation is supported through the AXI to PCIe BARs. The interface
will split requests as necessary to obey PCIe MPS and 4 KB boundary crossing requirements. Up to
32 outstanding read and write requests are supported.
In the Bridge Slave interface, there is one BARs which can be configured as 32 bits or 64 bits. This
BAR provide address translation from AXI address space to PCIe address space. The address
translation is configured through BDF table programming. Refer to Slave Bridge section for BDF
programming.
Interrupt Module
The IRQ module aggregates interrupts from various sources. The interrupt sources are queue-based
interrupts, user interrupts and error interrupts.
Queue-based interrupts and user interrupts are allowed on PFs and VFs, but error interrupts are
allowed only on PFs. If the SR-IOV is not enabled, each PF has the choice of MSI-X or Legacy
Interrupts. With SR-IOV enabled, only MSI-X interrupts are supported across all functions.
MSI-X interrupt is enabled by default. Host system (Root Complex) will enable one or all of the
interrupt types supported in hardware. If MSI-X is enabled, it takes precedence.
Up to eight interrupts per function are available. To allow many queues on a given function and each
to have interrupts, the QDMA offers a novel way of aggregating interrupts from multiple queues to a
single interrupt vector. In this way, all 2048 queues could in principle be mapped to a single interrupt
PCIe CQ/CC
The PCIe Completer Request (CQ)/Completer Completion (CC) modules receive and process TLP
requests from the remote PCIe agent. This interface to the PCIe Controller operates in an address
aligned mode. The module uses the BAR information from the Integrated Block for IP PCIe Controller
to determine where the request should be forwarded. The possible destinations for these requests
are:
Non-posted requests are expected to receive completions from the destination, which are forwarded
to the remote PCIe agent. For further details, see the Versal Adaptive SoC CPM Mode for PCI
Express Product Guide (PG346).
PCIe RQ/RC
The PCIe Requester Request (RQ)/Requester Completion (RC) interface generates PCIe TLPs on
the RQ bus and processes PCIe Completion TLPs from the RC bus. This interface to the PCIe
Controller operates in DWord aligned mode. With a 512-bit interface, straddling is enabled. While
straddling is supported, all combinations of RQ straddled transactions might not be implemented. For
further details, see the Versal Adaptive SoC CPM Mode for PCI Express Product Guide (PG346).
PCIe Configuration
Several factors can throttle outgoing non-posted transactions. Outgoing non-posted transactions are
throttled based on flow control information from the PCIE Controller to prevent head of line blocking of
posted requests. The DMA will meter non-posted transactions based on the PCIe Receive FIFO
space.
The multi-queue DMA engine of the QDMA uses RDMA model queue pairs to allow RNIC
implementation in the user logic. Each queue set consists of Host to Card (H2C), Card to Host (C2H),
and a C2H Stream Completion (CMPT). The elements of each queue are descriptors.
H2C and C2H are always written by the driver/software; hardware always reads from these queues.
H2C carries the descriptors for the DMA read operations from Host. C2H carries the descriptors for
the DMA write operations to the Host.
In internal mode, H2C descriptors carry address and length information and are called gather
descriptors. They support 32 bits of metadata that can be passed from software to hardware along
with every descriptor. The descriptor can be memory mapped (where it carries host address, card
address, and length of DMA transfer) or streaming (only host address, and length of DMA transfer)
H2C/C2H queues are rings located in host memory. For both type of queues, the producer is software
and consumer is the descriptor engine. The software maintains producer index (PIDX) and a copy of
hardware consumer index (HW CIDX) to avoid overwriting unread descriptors. The descriptor engine
also maintains consumer index (CIDX) and a copy of SW PIDX, which is to make sure the engine
does not read unwritten descriptors. The last entry in the queue is dedicated for the status descriptor
where the engine writes the HW CIDX and other status.
The engine maintains a total of 2048 H2C and 2048 C2H contexts in local memory. The context
stores properties of the queue, such as base address (BADDR), SW PIDX, CIDX, and depth of the
queue.
The figure above shows the H2C and C2H fetch operation.
1. For H2C, the driver writes payload into host buffer, forms the H2C descriptor with the payload
buffer information and puts it into H2C queue at the PIDX location. For C2H, the driver forms the
descriptor with available buffer space reserved to receive the packet write from the DMA.
2. The driver sends the posted write to PIDX register in the descriptor engine for the associated
Queue ID (QID) with its current PIDX value.
For C2H, the fetch operation is implicit through the CMPT ring.
Completion Queue
The Completion (CMPT) queue is a ring located in host memory. The consumer is software, and the
producer is the CMPT engine. The software maintains the consumer index (CIDX) and a copy of
hardware producer index (HW PIDX) to avoid reading unwritten completions. The CMPT engine also
maintains PIDX and a copy of software consumer index (SW CIDX) to make sure that the engine
does not overwrite unread completions. The last entry in the queue is dedicated for the status
descriptor which is where the engine writes the hardware producer index (HW PIDX) and other status.
The engine maintains a total of 2048 CMPT contexts in local memory. The context stores properties of
the queue, such as base address, SW CIDX, PIDX, and depth of the queue.
C2H stream is expected to use the CMPT queue for completions to host, but it can also be used for
other types of completions or for sending messages to the driver. The message through the CMPT is
guaranteed to not bypass the corresponding C2H stream packet DMA.
The simple flow of DMA CMPT queue operation with respect to the numbering above follows:
1. The CMPT engine receives the completion message through the CMPT interface, but the QID
for the completion message comes from the C2H stream interface. The engine reads the QID
index of CMPT context RAM.
2. The DMA writes the CMPT entry to address BASE+PIDX.
3. If all conditions are met, optionally writes PIDX to the status descriptor of the CMPT queue with
color bit.
4. If interrupt mode is enabled, the CMPT engine generates the interrupt event message to the
interrupt module.
5. The driver can be in polling or interrupt mode. Either way, the driver identifies the new CMPT
entry either by matching the color bit or by comparing the PIDX value in the status descriptor
against its current software CIDX value.
6. The driver updates CIDX for that queue. This allows the hardware to reuse the descriptors again.
After the software finishes processing the CMPT, that is, before it stops polling or leaving the
interrupt handler, the driver issues a write to CIDX update register for the associated queue.
SR-IOV Support
The QDMA provides an optional feature to support Single Root I/O Virtualization (SR-IOV). The PCI-
SIG® Single Root I/O Virtualization and Sharing (SR-IOV) specification (available from PCI-SIG
Specifications (www.pcisig.com/specifications) standardizes the method for bypassing the VMM
Physical Functions (PF): Full featured PCIe® functions which include SR-IOV capabilities among
others.
Virtual Functions (VF): PCIe functions featuring configuration space with Base Address
Registers (BARs) but lacking the full configuration resources and controlled by the PF
configuration. The main role of the VF is data transfer.
Apart from PCIe defined configuration space, QDMA Subsystem for PCI Express virtualizes data path
operations, such as pointer updates for queues, and interrupts. The rest of the management and
configuration functionality is deferred to the physical function driver. The Drivers that do not have
sufficient privilege must communicate with the privileged Driver through the mailbox interface which is
provided in part of the QDMA Subsystem for PCI Express.
Security is an important aspect of virtualization. The QDMA Subsystem for PCI Express offers the
following security functionality:
QDMA allows only privileged PF to configure the per queue context and registers. VFs inform the
corresponding PFs of any queue context programming.
Drivers are allowed to do pointer updates only for the queue allocated to them.
The system IOMMU can be turned on to check that the DMA access is being requested by PFs
or VFs. The ARID comes from queue context programmed by a privileged function.
Any PF or VF can communicate to a PF (not itself) through mailbox. Each function implements one
128B inbox and 128B outbox. These mailboxes are visible to the driver in the DMA BAR (typically
BAR0) of its own function. At any given time, any function can have one outgoing mailbox and one
incoming mailbox message outstanding per function.
The diagram below shows how a typical system can use QDMA with different functions and operating
systems. Different Queues can be allocated to different functions, and each function can transfer DMA
packets independent of each other.
Limitations
Applications
The QDMA is used in a broad range of networking, computing, and data storage applications. A
common usage example for the QDMA is to implement Data Center and Telco applications, such as
Compute acceleration, Smart NIC, NVMe, RDMA-enabled NIC (RNIC), server virtualization, and NFV
in the user logic. Multiple applications can be implemented to share the QDMA by assigning different
queue sets and PCIe functions to each application. These Queues can then be scaled in the user
logic to implement rate limiting, traffic priority, and custom work queue entry (WQE).
Product Specification
AMD provides multiple example designs for you to experiment. All example designs can be
downloaded from GitHub. Performance example design can be selected from the CED example
design.
Clock Settings
To achieve maximum performance of 200 Gb/s line rate, PCIe configuration needs to be set at Gen5
speed and x8 width.
NoC needs to run at much higher frequency to reach the maximum data throughput. Set NoC
clock at 1000 MHz as shown below.
PL logic that interacts with QDMA IP needs to run at 420 to 430 MHz. This clock can be
generated from CPM5 IP as shown in the following settings. The PL CLK 0 is set to 430 MHz to
achieve the desired performance.
Not all PL logic need to run at this rate, only the data path and the control path that interact with
the QDMA IP. pcie_qdma_mailbox IP's internal logic can run at lower frequency at 250 MHz.
This clock is generated from the IP PL CLK1.
✎ Note: You can get all clock details from the performance CED.
Following are the QDMA register settings recommended by AMD for better performance. Performance
numbers vary depending on systems and OS used.
evt_pfch_fl_th[15:0] 256
0xB08 PFCH CFG 0x100_0100
pfch_fl_th[15:0] 256
evt_qcnt_th[15:0] 120
0xA80 PFCH_CFG_1 0x78_007C
pfch_qcnt[15:0] 124
fence 1
rsvd[1:0] 0
var_desc_no_drop 0
0xA84 PFCH_CFG_2 0x8040_03C8
pfch_ll_sz_th[15:0] 1024
var_desc_num_pfch[5:0] 15
num_pfch[5:0] 8
rsvd[12:0] 0
dis_fence_fix 0
0x1400 CRDT_COAL_CFG_1 0x4010
pld_fifo_th[7:0] 16
crdt_timer_th[9:0] 16
rsv2[7:0] 0
crdt_fifo_th[7:0] 120
0x1404 CRDT_COAL_CFG_2 0x78_0060
rsv1[4:0] 0
crdt_cnt_th[10:0] 96
req_throt_en_req 1
req_throt 448
0xE24 H2C_REQ_THROT_PCIE 0x8E04_E000
req_throt_en_data 1
data_thresh 57344
req_throt_en_req 0
req_throt 448
0xE2C H2C_REQ_THROT_AXIMM 0x8E05_0000
req_throt_en_data 0
data_thresh 65536
c2h_uodsc_limit 0
h2c_uodsc_limit 0
0x250 QDMA_GLBL_DSC_CFG reserved 0 0x00_0015
Max_dsc_fetch 5
wb_acc_int 1
10b_tag_en 1
reserved 0
axi_wbk 0
0x4C CONFIG_BLOCK_MISC_CONTROL
axi_dsc 0 0x81_001F
num_tags 512
reserved 0
rq_metering_multiplier 31
✎ Recommended: AMD recommends that you limit the total outstanding descriptor fetch to be less
than 8 KB on the PCIe. For example, limit the outstanding credits across all queues to 512 for a 16B
descriptor.
When bypass in dma<0/1>_h2c_byp_in_st_sdi ports is set, the QDMA IP generates the status
write back for every packet. AMD recommends that this port be asserted once in 32 packets or
64 packets. And if there are no more descriptors left then assert dma<0/1>_h2c_byp_in_st_sdi
at the last descriptor. This requirement is per queue basis, and applies to AXI4 (H2C and C2H)
bypass transfers and AXI4-Stream H2C transfers.
For AXI4-Stream C2H Simple bypass mode, the dma<0/1>_dsc_crdt_in_fence port should be
set to 1 for performance reasons. This recommendation assumes the user design already
coalesced credits for each queue and sent them to the IP. In internal mode, set the fence bit in
the QDMA_C2H_PFCH_CFG_2 (0xA84) register.
Prefetch Cache Depth 128 C2H prefetch tags available. If a you have more
then 128 active queues for packets < 512B,
performance may reduce depending on the
data pattern. If user see performance
degradation they can implement simple bypass
mode, where user can maintain all descriptor
flow.
C2H Payload FIFO Depth 1024 Units of 64B. Amount of C2H data that C2H
engine can buffer. This amount of buffer can
sustain host read latency up to 2us (1024
*2ns). If latency is more then 2us there could
be performance degradation.
MM Reorder Buffer Depth 512 Units of 64B. Amount of MM read data that can
be stored to absorb host read latency.
Desc Eng Reorder Buffer 512 Units of 64B. Amount of Descriptor fetch data
Depth that can be stored to absorb host read latency.
H2C-ST Reorder Buffer 1024 Units of 64B. Amount of H2C-ST data that can
Depth be stored to absorb host read latency.
QDMA Operations
Descriptor Engine
The descriptor engine is responsible for managing the consumer side of the Host to Card (H2C) and
Card to Host (C2H) descriptor ring buffers for each queue. The context for each queue determines
how the descriptor engine will process each queue individually. When descriptors are available and
other conditions are met, the descriptor engine will issue read requests to PCIe to fetch the
descriptors. Received descriptors are offloaded to either the descriptor bypass out interface (bypass
mode) or delivered directly to a DMA engine (internal mode). When a H2C Stream or Memory
Mapped DMA engine completes a descriptor, status can be written back to the status descriptor, an
interrupt, and/or a marker response can be generated to inform software and user logic of the current
DMA progress. The descriptor engine also provides a Traffic Manager Interface which notifies user
logic of certain status for each queue. This allows the user logic to make informed decisions if
customization and optimization of DMA behavior is desired.
Descriptor Context
The Descriptor Engine stores per queue configuration, status and control information in descriptor
context that can be stored in block RAM or UltraRAM, and the context is indexed by H2C or C2H QID.
Prior to enabling the queue, the hardware and credit context must first be cleared. After this is done,
the software context can be programmed and the qen bit can be set to enable the queue. After the
queue is enabled, the software context should only be updated through the direct mapped address
space to update the Producer Index and Interrupt Arm® bit, unless the queue is being disabled. The
hardware context and credit context contain only status. It is only necessary to interact with the
hardware and credit contexts as part of queue initialization to clear them to all zeros. Once the queue
is enabled, context is dynamically updated by hardware. Any modification of the context through the
indirect bus when the queue is enabled can result in unexpected behavior. Reading the context when
the queue is enabled is not recommended as it can result in reduced performance.
[138:128] [10:0] vec MSI-X vector used for interrupts for direct
interrupt or interrupt aggregation entry for
aggregated interrupts.
[39:32] 8 Reserved
Descriptor Fetch
✎ Note: Available descriptors are always <ring size> - 2. At any time, the software should not update
the PIDX to more than <ring size> - 2.
For example, if queue size is 8, which contains the entry index 0 to 7, the last entry (index 7) is
reserved for status. This index should never be used for the PIDX update, and the PIDX update
should never be equal to CIDX. For this case, if CIDX is 0, the maximum PIDX update would be 6.
Internal Mode
A queue can be configured to operate in Descriptor Bypass mode or Internal mode by setting the
software context bypass field. In internal mode, the queue requires no external user logic to handle
descriptors. Descriptors that are fetched by the descriptor engine are delivered directly to the
appropriate DMA engine and processed. Internal mode allows credit fetching and status updates to
the user logic for run time customization of the descriptor fetch behavior.
Status writebacks and/or interrupts are generated automatically by hardware based on the queue
context. When wbi_intvl_en is set, writebacks/interrupts are sent based on the interval selected in
the register QDMA_GLBL_DSC_CFG (0x250) bits[2:0]. Due to the slow nature of interrupts, in
interval mode, interrupts might be late or skip intervals. If the wbi_chk context bit is set, a
writeback/interrupt is sent when the descriptor engine has detected that the last descriptor at the
current PIDX has completed. It is recommended the wbi_chk bit be set for all internal mode
operation, including when interval mode is enabled. An interrupt is not generated until the irq_arm bit
is set by the software. After an interrupt is sent, the irq_arm bit is cleared by the hardware. Should an
interrupt be needed when the irq_arm bit is not set, the interrupt is held in a pending state until the
irq_arm bit is set.
Descriptor completion is defined to be when the descriptor data transfer has completed and its write
data is acknowledged on AXI (H2C bresp for AXI MM, Valid/Ready of ST), or it is accepted by the
PCIe Controller’s transaction layer for transmission (C2H MM).
Descriptor Bypass mode also supports crediting and status updates to user logic. In addition,
Descriptor Bypass mode allows the user logic to customize processing of descriptors and status
updates. Descriptors fetched by the descriptor engine are delivered to user logic through the
In bypass mode, the user logic has explicit control over status updates to the host, and marker
responses back to user logic. Along with each descriptor submitted to the Descriptor Bypass Input
Port for a Memory Mapped Engine (H2C and C2H) or H2C Stream DMA engine, there is a CIDX, and
sdi field. The CIDX is used to identify which descriptor has completed in any status update (host
writeback, marker response, or coalesced interrupt) generated at the completion of the descriptor. If
the sdi field of the descriptor is input, then a writeback to the host is generated if the context wbk_en
bit is set. An interrupt can also be sent if the sdi bit is set if the context irq_en and irq_arm bits are
set.
If interrupts are enabled, the user logic must monitor the traffic manager output for the irq_arm. After
the irq_arm bit is observed for the queue, a descriptor with the sdi bit is sent to the DMA. Once a
descriptor with the sdi bit is sent, another irq_arm assertion must be observed before another
descriptor with the sdi bit can be sent. If the you set the sdi bit when the arm bit is not properly
observed, an interrupt might or might not be sent, and software might hang indefinitely waiting for an
interrupt. When interrupts are not enabled, setting the sdi bit has no restriction. However, excessive
writeback events can severely reduce the descriptor engine performance and consume write
bandwidth to the host.
Descriptor completion is defined to be when the descriptor data transfer has completed and its write
data has been acknowledged on AXI4 (H2C bresp for AXI MM, Valid/Ready of ST), or been accepted
by the PCIe Controller’s transaction layer for transmission (C2H MM).
Marker Response
Marker responses can be generated for any descriptor by setting the mrkr_req bit. Marker responses
are generated after the descriptor is completed. Similar to host writebacks, excessive marker
response requests can reduce descriptor engine performance. Along with mrkr_req signals, sdi can
also be set. In this case, the marker response is sent on queue status ports and writeback is sent to
the host. The marker responses are sent on queue status ports that can be identified by the queue id.
Descriptor completion is defined as when the descriptor data transfer has completed and its write data
is acknowledged on AXI (H2C bresp for AXI4, Valid/Ready of ST), or is accepted by the PCIe
Controller’s transaction layer for transmission (C2H MM).
✎ Note: The ports described below have a prefix of dma<n>_, which can be either dma0_ for QDMA
Port 0 or dma1_ for QDMA Port 1.
The traffic manager interface provides details of a queue’s status to user logic, allowing user logic to
manage descriptor fetching and execution. In normal operation, for an enabled queue, each time the
irq_arm bit is asserted or PIDX of a queue is updated, the descriptor engine asserts
dma<n>_tm_dsc_sts_valid. The dma<n>_tm_dsc_sts_avl signal indicates the number of new
The credit interface is relevant when a queue’s fcrd_en context bit is set. It allows the user logic to
prioritize and meter descriptors fetched for each queue. You can specify the DMA direction, qid, and
credit value. For a typical use case, the descriptor engine uses credit inputs to fetch descriptors.
Internally, credits received and consumed are tracked for each queue. If credits are added when the
queue is not enabled, the credits will be returned through the Traffic Manager Output Interface with
tm_dsc_sts_qinv asserted, and the credits in tm_dsc_sts_avl are not valid. Monitor tm_dsc_sts
interface to keep an account for each queue on how many credits are consumed.
Errors
Errors can potentially occur during both descriptor fetch and descriptor execution. In both cases, once
an error is detected for a queue it will invalidate the queue, log an error bit in the context, stop fetching
new descriptors for the queue which encountered the error, and can also log errors in status registers.
If enabled for writeback, interrupts, or marker response, the DMA will generate a status update to
these interfaces. Once this is done, no additional writeback, interrupts, or marker responses (internal
mode) will be sent for the queue until the queue context is cleared. As a result of the queue
invalidation due to an error, a Traffic Manager Output cycle will also be generated to indicate the error
and queue invalidation. After the queue is invalidated, if there is an error you can determine the cause
by reading the error registers and context for that queue. You must clear and remove that queue, and
then add the queue back later when needed.
Although additional descriptor fetches will be halted, fetches already in the pipeline will continue to be
processed and descriptors will be delivered to a DMA engine or Descriptor Bypass Out interface as
usual. If the descriptor fetch itself encounters an error, the descriptor will be marked with an error bit.
If the error bit is set, the contents of the descriptor should be considered invalid. It is possible that
In memory mapped DMA operations, both the source and destination of the DMA are memory
mapped space. In an H2C transfer, the source address belongs to PCIe address space while the
destination address belongs to AXI MM address space. In a C2H transfer, the source address belongs
to AXI MM address space while the destination address belongs to PCIe address space. PCIe-to-
PCIe, and AXI MM-to-AXI MM DMAs are not supported. Aside from the direction of the DMA, transfer
H2C and C2H DMA behave similarly and share the same descriptor format.
Operation
The memory mapped DMA engines (H2C and C2H) are enabled by setting the run bit in the Memory
Mapped Engine Control Register. When the run bit is deasserted, descriptors can be dropped. Any
descriptors that have already started the source buffer fetch will continue to be processed.
Reassertion of the run bit will result in resetting internal engine state and should only be done when
the engine is quiesced. Descriptors are received from either the descriptor engine directly or the
Descriptor Bypass Input interface. Any queue that is in internal mode should not be given descriptors
through the Descriptor Bypass Input interface. Any descriptor sent to an MM engine that is not running
will be dropped. For configurations where a mix of Internal Mode queues and Bypass Mode queues
are enabled, round robin arbitration is performed to establish order.
The DMA Memory Mapped engine first generates the read request to the source interface, splitting
the descriptor at alignment boundaries specific to the interface. Both PCIe and AXI read interfaces
can be configured to split at different alignments. Completion space for read data is preallocated
when the read is issued. Likewise for the write requests, the DMA engine will split at appropriate
alignments. On the AXI interface each engine will use a single AXI ID. The DMA engine will reorder
the read completion/write data to the order in which the reads were issued. Once sufficient read
completion data is received the write request will be issued to the destination interface in the same
order that the read data was requested. Before the request is retired, the destination interfaces must
accept all the write data and provide a completion response. For PCIe the write completion is issued
when the write request has been accepted by the transaction layer and will be sent on the link next.
For the AXI Memory Mapped interface, the bresp is the completion criteria. Once the completion
criteria has been met, the host writeback, interrupt and/or marker response is generated for the
descriptor as appropriate.
The DMA Memory Mapped engines also support the no_dma field of the Descriptor Bypass Input, and
zero-length DMA. Both cases are treated identically in the engine. The descriptors propagate through
the DMA engine as all other descriptors, so descriptor ordering within a queue is still observed.
However no DMA read or write requests are generated. The status update (writeback, interrupt,
and/or marker response) for zero-length/no_dma descriptors is processed when all previous
descriptors have completed their status update checks.
There are two primary error categories for the DMA Memory Mapped Engine. The first is an error bit
that is set with an incoming descriptor. In this case, the DMA operation of the descriptor is not
processed but the descriptor proceeds through the engine to status update phase with an error
indication. This should result in a writeback, interrupt, and/or marker response depending on context
and configuration. It also results in the queue being invalidated. The second category of errors for the
DMA Memory Mapped Engine are errors encountered during the execution of the DMA itself. This can
include PCIe read completions errors, and AXI bresp errors (H2C), or AXI bresp errors and PCIe
write errors due to bus master enable or function level reset (FLR), as well as RAM ECC errors. The
first enabled error is logged in the DMA engine. Refer to the Memory Mapped Engine error logs. If an
error occurs on the read, the DMA write is aborted if possible. If the error was detected when pulling
write data from RAM, it is not possible to abort the request. Instead invalid data parity is generated to
ensure the destination is aware of the problem. After the descriptor which encountered the error has
gone through the DMA engine, it proceeds to generate status updates with an error indication. As with
descriptor errors, it results in the queue being invalidated.
Table: AXI Memory Mapped Descriptor Structure for H2C and C2H
Internal mode memory mapped DMA must configure the descriptor queue to be 32B and follow the
above descritor format. In bypass mode, the descriptor format is defined by the user logic, which must
drive the H2C or C2H MM bypass input port.
AXI Memory Mapped Writeback Status Structure for H2C and C2H
The MM writeback status register is located after the last entry of the (H2C or C2H) descriptor.
Table: AXI Memory Mapped Writeback Status Structure for H2C and C2H
The H2C Stream Engine is responsible for transferring streaming data from the host and delivering it
to the user logic. The H2C Stream Engine operates on H2C stream descriptors. Each descriptor
specifies the start address and the length of the data to be transferred to the user logic. The H2C
Stream Engine parses the descriptor and issues read requests to the host over PCIe, splitting the
read requests at the MRRS boundary. There can be up to 256 requests outstanding in the H2C
Stream Engine to hide the host read latency. The H2C Stream Engine implements a re-ordering buffer
of 32 KB to re-order the TLPs as they come back. Data is issued to the user logic in order of the
requests sent to PCIe.
If the status descriptor is enabled in the associated H2C context, the engine could additionally send a
status write back to host once it is done issuing data to the user logic.
Each queue in QDMA can be programmed in either of the two H2C Stream modes: internal and
bypass. This is done by specifying the mode in the queue context. The H2C Stream Engine knows
whether the descriptor being processed is for a queue in internal or bypass mode.
The following figures show the internal mode and bypass mode flows.
For a queue in the Internal mode, after the descriptor is fetched from the host it is fed straight to the
H2C Stream Engine for processing. In this case, a packet of data cannot span over multiple
descriptors. Thus for a queue in internal mode, each descriptor generates exactly one AXI4-Stream
packet on the QDMA H2C AXI4-Stream output. If the packet is present in host memory in non-
contiguous space, then it has to be defined by more than one descriptor and this requires that the
queue be programmed in bypass mode.
In the Bypass mode, after the descriptors are fetched from the host they are sent straight to the user
logic using the QDMA bypass output port. The QDMA does not parse these descriptors at all. The
user logic can store these descriptors and then send the required information from these descriptors
back to QDMA using the QDMA H2C Stream descriptor bypass-in interface. Using this information,
the QDMA constructs descriptors which are then fed to the H2C Stream Engine for processing.
When fcrd_en is enabled in the software context, DMA will wait for the user application to provide
credits, Credit return in the figure above. When fcrd_en is not set, the DMA uses a pointer update,
fetches descriptors and sends the descriptor out. The user application should not send in credits.
Credit return in the above figure does not apply in this case.
The following are the advantages of using the bypass mode:
The user logic can have a custom descriptor format. This is possible because QDMA does not
parse descriptors for queues in bypass mode. The user logic parses these descriptors and
provides the information required by the QDMA on the H2C Stream bypass-in interface.
Immediate data can be passed from the software to the user logic without DMA operation.
The user logic can do traffic management by sending the descriptors to the QDMA when it is
ready to sink all the data. Descriptors can be cached in local RAM.
Perform address translation.
Descriptor Metadata
Similar to bypass mode, the internal mode also provides a mechanism to pass information directly
from the software to the user logic. In addition to address and length, the H2C Stream descriptor also
has a 32b metadata field. This field is not used by the QDMA for the DMA operation. Instead, it is
passed on to the user logic on the H2C AXI4-Stream tuser on every beat of the packet. Passing
metadata on the tuser is not supported for a queue in bypass mode and consequently there is no
input to provide the metadata on the QDMA H2C Stream bypass-in interface.
The length field in a descriptor can be zero. In this case, the H2C Stream Engine will issue a zero
byte read request on PCIe. After the QDMA receives the completion for the request, the H2C Stream
Engine will send out one beat of data with tlast on the QDMA H2C AXI4-Stream interface. The zero
byte packet will be indicated on the interface by setting the zero_b_dma bit in the tuser. The user
logic must set both the SOP and EOP for a zero byte descriptor. If not done, an error will be flagged
by the H2C Stream Engine.
When feeding the descriptor information on the bypass input interface, the user logic can request the
QDMA to send a status write back to the host when it is done fetching the data from the host. The
user logic can also request that a status be issued to it when the DMA is done. These behaviors can
be controlled using the sdi and mrkr_req inputs in the bypass input interface.
The H2C writeback status register is located after the last entry of the H2C descriptor list.
✎ Note: The format of the H2C-ST status descriptor written to the descriptor ring is different from
that written into the interrupt coalesce entry.
The H2C engine has a data aligner that aligns the data to zero Bytes (0B) boundary before issuing it
to the user logic. This allows the start address of a descriptor to be arbitrarily aligned and still receive
the data on the H2C AXI4-Stream data bus without any holes at the beginning of the data. The user
logic can send a batch of descriptors from SOP to EOP with arbitrary address and length alignments
for each descriptor. The aligner will align and pack the data from the different descriptors and will
issue a continuous stream of data on the H2C AXI4-Stream data bus. The tlast on that interface will
be asserted when the last beat for the EOP descriptor is being issued.
If an error is encountered while fetching a descriptor, the QDMA Descriptor Engine flags the descriptor
with error. For a queue in internal mode, the H2C Stream Engine handles the error descriptor by not
performing any PCIe or DMA activity. Instead, it waits for the error descriptor to pass through the
pipeline and forces a writeback after it is done. For a queue in bypass mode, it is the responsibility of
the user logic to not issue a batch of descriptors with an error descriptor. Instead, it must send just
one descriptor with error input asserted on the H2C Stream bypass-in interface and set the SOP,
EOP, no_dma signal, and sdi or mrkr-req signal to make the H2C Stream Engine send a writeback
to Host.
If the H2C Stream Engine encounters an error coming from PCIe on the data, it keeps the error sticky
across the full packet. The error is indicated to the user on the err bit on the H2C Stream Data
Output. Once the H2C Stream sends out the last beat of a packet that saw a PCIe data error, it also
sends a Writeback to the Software to inform it about the error.
The C2H Stream Engine DMA writes the stream packets to the host memory into the descriptor
provided by the host driver through the C2H descriptor queue.
The Prefetch Engine is responsible for calculating the number of descriptors needed for the DMA that
is writing the packet. The buffer size is fixed per queue basis. For internal and cached bypass mode,
the prefetch module can fetch up to 512 descriptors for a maximum of 64 different queues at any
given time.
The Prefetch Engine also offers low latency feature pfch_en = 1, where the engine can prefetch up
to qdma_c2h_pfch_cfg.num_pfch descriptors upon receiving the packet, so that subsequent packets
can avoid the PCIe latency.
The QDMA requires software to post full ring size so the C2H stream engine can fetch the needed
number of descriptors for all received packets. If there are not enough descriptors in the descriptor
ring, the QDMA will stall the packet transfer. For performance reasons, the software is required to post
the PIDX as soon as possible to ensure there are always enough descriptors in the ring.
C2H stream packet data length is limited to 31 * C2H buffer size. C2H buffer size can be programed
from 0xAB0 to 0xAEC address, for details refer to cpm5-qdma-v4-0-pf-registers.csv file.
The prefetch engine interacts between the descriptor fetch engine and C2H DMA write engine to pair
up the descriptor and its payload.
[0] 1 bypass C2H bypass mode, set this bit for simple
bypass mode.
The C2H descriptors can be from the descriptor fetch engine or C2H bypass input interfaces. The
descriptors from the descriptor fetch engine are always in cache mode. The prefetch engine keeps
the order of the descriptors to pair with the C2H data packets from the user. The descriptors from the
C2H bypass input interfaces have one interface for both simple mode and cache mode (note that both
simple bypass and cache bypass use the same interface). For simple mode, the user application
keeps the order of the descriptors to pair with the C2H data packets. For cache mode, the prefetch
engine keeps the order of the descriptors to pair with the C2H data packet from the user.
The prefetch context has a bypass bit. When it is 1'b1, the user application sends the credits for the
descriptors. When it is 1'b0, the prefetch engine handles the credits for the descriptors.
The descriptor context has a bypass bit. When it is 1'b1, the descriptor fetch engine sends out the
descriptors on the C2H bypass output interface. The user application can convert it and later loop it
back to the QDMA on the C2H bypass input interface. When the bypass context bit is 1'b0, the
descriptor fetch engine sends the descriptors to the prefetch engine directly.
There is a 2K descriptor entry buffer to take in descriptors from bypass input ports. 2K deep buffer is
shared with all the Qs.
On a per queue basis, three modes are supported.
The selection between Simple Bypass Mode and Cache Bypass Mode is done by setting the bypass
bits in Software Descriptor Context and C2H Prefetch Context as shown in the following table.
✎ Note: If you already have the descriptor cached on the device, there is no need to fetch one from
the host and you should follow the simple bypass mode for the C2H Stream application. In simple
bypass mode, do not provide credits to fetch the descriptor, and instead, you need to send in the
descriptor on the descriptor bypass interface.
✎ Note: AXI4-Stream C2H Simple Bypass mode and Cache Bypass mode both use same bypass in
ports (c2h_byp_in_st_csh_* ports).
1. Software instruction:
a. Initialize a queue (qid).
b. Write to MDMA_C2H_PFCH_BYP_QID 0x1408 with valid qid.
c. Read MDMA_C2H_PFCH_BYP_TAG 0x140C to obtain the prefetch tag.
Simple bypass flow shown below does not include fetch of the "prefetch_tag".
✎ Note: No sequence is required between descriptor bypass in, data payload and completion
packets.
If you already have descriptors, there is no need to update the pointers or provide credits. Instead,
send the descriptors in the descriptor bypass interface, and send the data and Completion (CMPT)
packets.
When simple bypass mode is selected, the queue that is used to fetch the prefetch ID, acts like
management Q and it controls the buffer sizes.
The buffer size that is set for this management Q is used for all the queues, irrespective of the buffer
size set for other queues. In the simple bypass mode, you provide descriptors and data packets, so
the buffer size should be set properly to accommodate all packet sizes in all queues. Users need to
Regular Packet
The regular C2H packet has both the data packet and Completion (CMPT) packet. They are a one-to-
one match.
The regular C2H data packet can be multiple beats.
When the user application sends the data packet, it must count the packet ID for each packet. The
first data packet has a packet ID of 1, and it increments for each data packets.
For the regular C2H packet, the data packet and the completion packet is a one-to-one match.
Therefore, the number of data packets with dma<n>_s_axis_c2h_ctrl_has_cmpt as 1'b1 should be
equal to the number of CMPT packet with dma<n>_s_axis_c2h_cmpt_ctrl_cmpt_type as HAS_PLD.
The QDMA has a shallow completion input FIFO of depth 2. For better performance, add FIFO for
completion input as shown in the diagram below. Depth and width of the FIFO depends on the use
case. Width is dependent on the largest CMPT size for the application, and depth is dependent on
performance needs. For best performance for 64 Byte CMPT, a depth of 512 is recommended.
When the user application sends the data payload, it counts every packet. The first packet starts with
a pkt_pld_id of 1. The second packet has a pkt_pld_id of 2, and so on. It is a 16-bits counter once
the count reaches 16'hffff it wraps around to 0 and count forward.
The user application defines the CMPT type.
1 beat of data
dma<n>_s_axis_c2h_ctrl_len = 0
dma<n>_s_axis_c2h_mty = 0
✎ Note: Zero Byte packets are not supported in Internal mode and Cache bypass mode. The QDMA
might hang if zero byte packets are dropped due to not available descriptor. Zero Byte Packets are
supported in Simple bypass mode.
dma<n>_s_axis_c2h_ctrl_has_cmpt = 1'b0
If an error is encountered while fetching a descriptor (in pre-fetch or regular mode), the QDMA
Descriptor Engine flags the descriptor with error. For a queue in internal mode, the C2H Stream
Engine handles the error descriptor by not performing any PCIe or DMA activity. Instead, it waits for
the error descriptor to pass through the pipeline and forces a writeback after it is done. For a queue in
bypass mode, it is the responsibility of the user logic to not issue a batch of descriptors with an error
descriptor. Instead, it must send just one descriptor with error input asserted on the C2H Stream
bypass-in interface and set the SOP, EOP, no_dma signal, and sdi or mrkr_req signal to make the
C2H Stream Engine send a writeback to Host.
Completion Engine
The Completion Engine writes the C2H AXI4-Stream Completion (CMPT) in the CMPT queue. The
user application sends a CMPT packet and other information, such as, but not limited to, CMPT QID,
and CMPT_TYPE to the QDMA. The QDMA uses this information to process the CMPT packet. The
QDMA can be instructed to write the CMPT packet unchanged in the CMPT queue. Alternatively, the
user application can instruct the QDMA to insert certain fields, like error and color, in the CMPT packet
before writing it into the CMPT queue. Additionally, using the CMPT interface signals, the user
application instructs the QDMA to order the writing of the CMPT packet in a specific way, relative to
traffic on the C2H data input. Although not a requirement, a CMPT is typically used with a C2H queue.
In such a case, the CMPT is used to inform the SW that a certain number of C2H descriptors are used
The Completion Status is located at the last location of Completion ring, that is, Completion Ring
Base Address + (Size of the completion length (8,16,32) * (Completion Ring Size – 1)).
In order to make the QDMA write Completion Status to the Completion ring, Completion Status must
be enabled in the Completion context. In addition to affecting Interrupts, the trigger mode defined in
the Completion context also moderates the writing of Completion Statuses. Subject to Interrupt/Status
moderation, a Completion Status can be written when either of the following happens:
[63:37] 27 Reserved
The size of a Completion (CMPT) Ring entry is 512-bits. This includes user defined data, an optional
error bit, and an optional color bit. The user defined data has four size options: 8B, 16B, 32B and 64B.
The bit locations of the optional error and color bits in the CMPT entry are configurable individually.
This is done by specifying the locations of these fields using the AMD Vivado™ IDE IP customization
options while compiling the QDMA. There are seven color bit location options and eight error bit
location options. The location is specified as an offset from the LSB bit of the Completion entry.
When the user application drives a Completion packet into the QDMA, it provides a
dma<n>_s_axis_cmpt_ctrl_col_idx[2:0] value and a
dma<n>_s_axis_cmpt_ctrl_err_idx[2:0] value at the interface. These indices are used by the
QDMA to use the correct locations of the color and error bits. For example, if
dma<n>_s_axis_cmpt_ctrl_col_idx[2:0] = 0 and dma<n>_s_axis_cmpt_ctrl_err_idx[2:0] =
1, then the QDMA uses the C2H Stream Completion Color bits position option 0 for color location, and
C2H Stream Completion Error bits position option 1 for error location. An index of seven for color or
error signals implies that the DMA will not update the corresponding color or error bits when
Completion entry is updated (those fields are ignored). The C2H Stream Completions bits options are
set in the PCIe DMA Tab in the AMD Vivado™ IDE.
The error and color bit location values that are used at compile time are available for the software to
read from the MMIO registers. There are seven registers for this purpose,
QDMA_C2H_CMPT_FORMAT (0xBC4) to QDMA_GLBL_ERR_MASK (0x24C). Each of these
registers holds one color and one error bit location.
C2H Stream Completions bits option 0 for color bit location and option 0 for error bit location are
available through the QDMA_C2H_CMPT_FORMAT_0 register.
C2H Stream Completions bits option 1 for color bit location and option 1 for error bit location are
available through the QDMA_C2H_CMPT_FORMAT_1 register.
And so on.
Based on the CMPT data size selection (8, 16, 32 or 64 Bytes), the data in
s_axis_c2h_cmpt_tdata[511:0] signal is registered in the completion entry as shown in the
following table.
User-defined bits for 8 Bytes settings 62-64 Depending on whether there are
color and error bits present.
The CMPT packet has four options (8, 16, 32 or 64 Bytes). It has one pump of data with 512 bits.
The QDMA provides a means to moderate the Completion interrupts and Completion Status writes on
a per queue basis. The software can select one out of five modes for each queue. The selected mode
for a queue is stored in the QDMA in the Completion ring context for that queue. After a mode has
been selected for a queue, the driver can always select another mode when it sends the completion
ring CIDX update to the QDMA.
The Completion interrupt moderation is handled by the Completion engine. The Completion engine
stores the Completion ring contexts of all the queues. It is possible to individually enable or disable
the sending of interrupts and Completion Statuses for every queue and this information is present in
the Completion ring context. It is worth mentioning that the modes being described here moderate not
only interrupts but also Completion Status writes. Also, since interrupts and Completion Status writes
can be individually enabled/disabled for each queue, these modes work only if the
interrupt/Completion Status is enabled in the Completion context for that queue.
The QDMA keeps only one interrupt outstanding per queue. This policy is enforced by QDMA even if
all other conditions to send an interrupt are met for the mode. The way the QDMA considers an
interrupt serviced is by receiving a CIDX update for that queue from the driver.
The basic policy followed in all the interrupt moderation modes is that when there is no interrupt
outstanding for a queue, the QDMA keeps monitoring the trigger conditions to be met for that mode.
Once the conditions are met, an interrupt is sent out. While the QDMA subsystem is waiting for the
interrupt to be served, it remains sensitive to interrupt conditions being met and remembers them.
When the CIDX update is received, the QDMA subsystem evaluates whether the conditions are still
being met. If they are still being met, another interrupt is sent out. If they are not met, no interrupt is
sent out and the QDMA resumes monitoring for the conditions to be met again.
The interrupt moderation modes that the QDMA subsystem provides are not necessarily precise.
Thus, if the user application sends two CMPT packets with an indication to send an interrupt, it is not
necessary that two interrupts are generated. The main reason for this behavior is that when the driver
is interrupted to read the Completion ring, and it is under no obligation to read exactly up to the
Completion for which the interrupt was generated. Thus, the driver might not read up to the
interrupting Completion, or it might even read beyond the interrupting Completion descriptor if there
are valid descriptors to be read there. This behavior requires the QDMA to re-evaluate the trigger
conditions every time it receives the CIDX update from the driver.
The detailed description of each mode is given below:
TRIGGER_USER
The QDMA provides a way to send a CMPT packet to the subsystem with an indication to send
out an interrupt when the subsystem is done sending the packet to the host. This allows the user
application to perform interrupt moderation when the TRIGGER_USER mode is set.
TRIGGER_USER_TIMER
In this mode, the QDMA is sensitive to either of two triggers. One of these triggers is sent by the
user along with the CMPT packet. The other trigger is the expiration of the timer that is
associated with the CMPT queue. The period of the timer is driver programmable on a per-
queue basis. The QDMA evaluates whether or not to send an interrupt when either of these
triggers is detected. As explained in the preceding sections, other conditions must be satisfied in
addition to the triggers for an interrupt to be sent. For more information, see Completion Timer.
TRIGGER_DIS
In this mode, the QDMA does not send Completion interrupts in spite of them being enabled for
a given queue. The only way that the driver can read the completion ring in this case is when it
regularly polls the ring. The driver requires to make use of the color bit feature provided in the
Completion ring when this mode is set as this mode also disables the sending of any Completion
Status descriptors to the Completion ring.
Completion Timer
The Completion Timer engine supports the timer trigger mode in the Completion context. It supports
2048 queues, and each queue has its own timer. When the timer expires, a timer expire signal is sent
to the Completion module. If multiple timers expire at the same time, they are sent out in a round
robin manner.
Reference Timer
The reference timer is based on the timer tick. The register QDMA_C2H_INT (0xB0C) defines the
value of a timer tick. The 16 registers QDMA_C2H_TIMER_CNT (0xA00-0xA3c) has the timer counts
based on the timer tick. The timer_idx in the Completion context is the index to the 16
QDMA_C2H_TIMER_CNT registers. Each queue can choose its own timer_idx.
Port ID Mismatch
The CMPT context specifies the port over which CMPTs are expected for that CMPT queue. If the
port_id in the incoming CMPT is not the same as the port_id in the CMPT context, the CMPT
Bridge
The Bridge core is an interface between the AXI4 and the PCI Express integrated block. It contains
the memory mapped AXI4 to AXI4-Stream Bridge, and the AXI4-Stream Enhanced Interface Block for
PCIe. The memory mapped AXI4 to AXI4-Stream Bridge contains a register block and two functional
half bridges, referred to as the Slave Bridge and Master Bridge.
The slave bridge connects to the AXI4 Interconnect as a slave device to handle any issued AXI4
master read or write requests.
The master bridge connects to the AXI4 Interconnect as a master to process the PCIe generated
read or write TLPs.
The register block contains registers used in the Bridge core for dynamically mapping the AXI4
memory mapped (MM) address range provided using the AXIBAR parameters to an address for
PCIe range.
The core uses a set of interrupts to detect and flag error conditions.
Related Information
Bridge Register Space
Slave Bridge
The slave bridge provides termination of memory-mapped AXI4 transactions from an AXI4 master
device (such as a processor). The slave bridge provides a way to translate addresses that are
mapped within the AXI4 memory mapped address domain to the domain addresses for PCIe. Write
transactions to the Slave Bridge are converted into one or more MemWr TLPs, depending on the
configured Max Payload Size setting, which are passed to the integrated block for PCI Express. When
a remote AXI master initiates a read transaction to the slave bridge, the read address and qualifiers
are captured and a MemRd request TLP is passed to the core and a completion timeout timer is
started. Completions received through the core are correlated with pending read requests and read
data is returned to the AXI4 master. The slave bridge can support up to 32 AXI4 write requests, and
32 AXI4 read requests.
CPM does not do any SMID checks for slave AXI4 transfers. Any value is accepted.
✎ Note: If slave reads and writes are valid, IP prioritizes reads over writes. You are recommended to
have proper arbitration (leave some gaps between reads so writes can pass through).
BDF Table
Address translations for AXI address is done based on BDF table programming (0x2420 to 0x2434).
These BDF table entries can be programmed through the NoC AXI Slave interface. There are three
regions that you can use for slave data transfers. Each region can be further divided into many
windows for a different address translation. These regions and number of windows should be
configured in the IP wizard configuration. Each entry in the BDF table programming represents one
window. If a you need 2 windows then 2 entries need to be programmed and so on.
1. All PCIe slave bridge data transfers must be quiesced before programming the BDF table.
2. There are six registers for each BDF table entry. All six registers must be programmed to make a
valid entry. Even if some registers have 0s, you need to program 0s in those registers.
3. All the six registers need to be programmed in an order for an entry to be valid. Order is listed
below.
a. 0x2420
b. 0x2424
c. 0x2428
d. 0x242C
e. 0x2430
f. 0x2434
BDF table entry start address = 0x2420 + (0x20 * i), where i = table entry number.
Protection
Specifying protection levels for different windows within a BAR is facilitated by AXI4 prot field via
Trustzones. Any access from PMC will have a*prot[1]=0 and hence will get full access.
For the BDF space the protection domain ID itself is stored in the BDF table. When a request comes
in with a*rpot[1]=0, it will be allowed full access. Requests with a*prot[1]=1 will only be allowed to
access BDF entries that have lower protection level.
The following table describes this behavior:
Non-secure access to 3'hX1X 3'hX1X (bit 1=1) Allow if bits [2] and [0]
less secure entry match between a*prot
and BDF entry
Address Translation
Slave bridge data transfer can be performed over three regions. You have options to set the size of
the region and also how many windows are needed for different address translation per region. If
address translation is not needed for a window, you still need to program the BDF table with address
translation value as 0x0.
Address translation for Slave Bridge transfer are described in the following examples:
0x0 reserved
For this example, the Slave address 0x0000_0000_0000_0100 will be address translated to
0x0000_0000_0000_C100.
1. Selections in AMD Vivado™ IP configuration in the AXI BARs tab are as follows:
AXI BAR size 32G: 0x7_FFFF_FFFF bits [34:0].
Set Aperture Base Address: 0x0000_0000_0000_0000.
Set Aperture High Address: 0x0000_0007_FFFF_FFFF.
2. The BDF table programming:
The slave bridge does not support narrow burst AXI transfers. To avoid narrow burst transfers,
connect the AXI smart-connect module which will convert narrow burst to full burst AXI transfers.
Master Bridge
The master bridge processes both PCIe MemWr and MemRd request TLPs received from the integrated
block for PCI Express and provides a means to translate addresses that are mapped within the
address for PCIe domain to the memory mapped AXI4 address domain. Each PCIe MemWr request
TLP header is used to create an address and qualifiers for the memory mapped AXI4 bus and the
associated write data is passed to the addressed memory mapped AXI4 Slave. The Master Bridge
can support up to 32 active PCIe MemWr request TLPs. PCIe MemWr request TLPs support is as
follows:
Each PCIe MemRd request TLP header is used to create an address and qualifiers for the memory
mapped AXI4 bus. Read data is collected from the addressed memory mapped AXI4 bridge slave and
used to generate completion TLPs which are then passed to the integrated block for PCI Express.
The Master Bridge in AXI Bridge mode can support up to 32 active PCIe MemRd request TLPs with
pending completions for improved AXI4 pipe-lining performance.
All AXI4_MM master transfer can be directed to modules based on the QDMA controller selection and
the steering selection in the GUI as shown in the following table:
CTRL 0
CPM PCIE NoC 0
CPM PCIE NoC 1
CCI PS AXI 0
CTRL 1
CPM PCIE NoC 0
CPM PCIE NoC 1
CCI PS AXI 0
PL AXI0
PL AXI1
Interrupts
The QDMA supports up to 2K total MSI-X vectors. A single MSI-X vector can be used to support
multiple queues. Each function can support up to 8 vectors (8 *256 function = 2K vectors).
The QDMA supports Interrupt Aggregation. Each vector has an associated Interrupt Aggregation Ring.
The QID and status of queues requiring service are written into the Interrupt Aggregation Ring. When
a PCIe® MSI-X interrupt is received by the Host, the software reads the Interrupt Aggregation Ring to
determine which queue needs service. Mapping of queues to vectors is programmable vector number
provided in the queue context. It supports MSI-X interrupt modes for SR-IOV and non-SR-IOV.
Interrupt Engine
The Interrupt Engine handles the queue based interrupts and the error interrupt.
The following figure shows the Interrupt Engine block diagram.
The Interrupt Engine gets the interrupts from H2C MM, H2C stream, C2H MM, C2H stream, or error
interrupt.
It handles the interrupts in two ways: direct interrupt or indirect interrupt. The interrupt sources has the
information that shows if it is direct interrupt or indirect interrupt. It also has the information of the
vector. If it is direct interrupt, the vector is the interrupt vector that is used to generate the PCIe MSI-X
message (the interrupt vector indix of the MSIX table). If it is indirect interrupt, the vector is the ring
index of the Interrupt Aggregation Ring. The interrupt source gets the information of interrupt type and
vector from the Descriptor Software Context, the Completion Context, or the error interrupt register.
Direct Interrupt
For direct interrupt, the Interrupt Engine gets the interrupt vector from the source, and it then sends
out the PCIe MSI-X message directly.
For the indirect interrupt, it does interrupt aggregation. The following are some restrictions for the
interrupt aggregation.
Each Interrupt Aggregation Ring can only be associated with one function. But multiple rings can
be associated with the same function.
The interrupt engine supports up to three interrupts from the same source, until the software
services the interrupts.
Interrupt aggregation ring size needs to be > 3 * number of Qs.
The Interrupt Engine processes the indirect interrupt with the following steps.
Interrupt source provides the index to which interrupt ring it belongs too.
Reads interrupt context for that queue.
Writes to the Interrupt Aggregation Ring.
Sends out the PCIe MSI-X message.
The Interrupt Context includes the information of the Interrupt Aggregation Ring. It has 256 entries to
support up to 256 Interrupt Aggregation Rings.
Color bit is added so software does not read more entries that what it should read. When the software
allocates the memory space for the Interrupt Aggregation Ring, the coal_color starts with 1'b0. The
software needs to initialize the color bit of the Interrupt Context to be 1'b1. When the hardware
completes the entire ring and flips to first entry in the next pass, it also flips the color value to 0 and
starts writing 0 in color bit space. The software does the same after it completes the last entry with a
color value 1, and goes to the first entry in the second pass and expects a color value 0. If the
software does not see a color value 0,which indicates an old entry, it waits for new entry with a color
value 0.
The software reads the Interrupt Aggregation Ring to get the Qid, and the int_type (H2C or C2H).
From the Qid, the software can identify whether the queue is stream or MM.
The stat_desc in the Interrupt Aggregation Ring is the status descriptor from the Interrupt source.
When the status descriptor is disabled, the software can get the status descriptor information from the
Interrupt Aggregation Ring.
There can be two cases:
The interrupt source is C2H stream. Then it is the status descriptor of the C2H Completion Ring.
The software can read the pidx of the C2H Completion Ring.
The interrupt source is others (H2C stream, H2C MM, C2H MM). Then it is the status descriptor
of that source. The software can read the cidx.
Finally, the Interrupt Engine sends out the PCIe MSI-X message using the interrupt vector from the
Interrupt Context. When there is an interrupt from any source, the interrupt engine updates PIDX and
check for int_st of that interrupt context. If int_st is 0 (WAITING_TRIGGER) then the interrupt
engine will send a interrupt. If int_st is 1 (ISR_RUNNING), the interrupt engine will not send
interrupt. If the interrupt engine sends interrupt it will update int_sts to 1 and once software updated
CIDX and CIDX matches PIDX int_sts will be cleared. The process is explained below.
When the PCIe MSI-X interrupt is received by the Host, the software reads the Interrupt Aggregation
Ring to determine which queue needs service. After the software reads the ring, it will do a dynamic
pointer update for the software CIDX to indicate the cumulative pointer that the software reads to. The
software does the dynamic pointer update using the register QDMA_DMAP_SEL_INT_CIDX[2048]
(0x18000). If the software CIDX is equal to the PIDX, this will trigger a write to the Interrupt Context to
clear int_ston the interrupt state of that queue. This is to indicate the QDMA that the software
already reads all of the entries in the Interrupt Aggregation Ring. If the software CIDX is not equal to
the PIDX, the interrupt engine will send out another PCIe MSI-X message. Therefore, the software
can read the Interrupt Aggregation Ring again. After that, the software can do a pointer update of the
1. After the software gets the PCIe MSI-X message, it reads the Interrupt Aggregation Ring entries.
2. The software uses the coal_color bit to identify the written entries. Each entry has Qid and
Int_type (H2C or C2H). From the Qid and Int_type, the software can check if it is stream or
MM. This points to a corresponding source ring. For example, if it is C2H stream, the source ring
is the C2H Completion Ring. The software can then read the source ring to get information, and
do a dynamic pointer update of the source ring after that.
3. After the software finishes reading of all written entries in the Interrupt Aggregation Ring, it does
one dynamic pointer update of the software cidx using the register
QDMA_DMAP_SEL_INT_CIDX[2048] (0x18000). This communicates to the hardware of the
Interrupt Aggregation Ring pointer used by the software.
If the software cidx is not equal to the pidx, the hardware will send out another PCIe MSI-X
message, so that the software can read the Interrupt Aggregation Ring again.
When the software does the dynamic pointer update for the Interrupt Aggregation Ring using the
register QDMA_DMAP_SEL_INT_CIDX[2048] (0x18000), it sends the ring index of the Interrupt
Aggregation Ring.
The following diagram shows the indirect interrupt flow. The Interrupt module gets the interrupt
requests. It first writes to the Interrupt Aggregation Ring. Then it waits for the write completions. After
that, it sends out the PCIe MSI-X message. The interrupt requests can keep on coming, and the
Interrupt module keeps on processing them. In the meantime, the software reads the Interrupt
Aggregation Ring, and it does the dynamic pointer update. If the software CIDX is not equal to the
PIDX, it will send out another PCIe MSI-X message.
Interrupt Context Structure
The following is the Interrupt Context Structure (0x8).
After the interrupt engine looks up the Interrupt Context, the interrupt engine writes to the Interrupt
Aggregation Ring. The interrupt engine also updates the Interrupt Context with the new PIDX, color,
and the interrupt state.
Interrupt Aggregation Entry
This is the Interrupt Aggregation Ring entry structure. It has 8B data.
Interrupt Flow
Error Interrupt
There are Leaf Error Aggregators in different places. They log the errors and propagate the errors to
the Central Error Aggregator. Each Leaf Error Aggregator has an error status register and an error
mask register. The error mask is enable mask. Irrespective of the enable mask value, the error status
register always logs the errors. Only when the error mask is enabled, the Leaf Error Aggregator will
propagate the error to the Central Error Aggregator.
The Central Error Aggregator aggregates all of the errors together. When any error occurs, it can
generate an Error Interrupt if the err_int_arm bit is set in the error interrupt register
QDMA_GLBL_ERR_INT (0B04). The err_int_arm bit is set by the software and cleared by the
hardware when the Error Interrupt is taken by the Interrupt Engine. The Error Interrupt is for all of the
errors including the H2C errors and C2H errors. The Software must set this err_int_arm bit to
generate interrupt again.
The Error Interrupt supports the direct interrupt only. Register QDMA_GLBL_ERR_INT bit[23],
en_coal must always be programmed to 0 (direct interrupt).
The Error Interrupt gets the vector from the error interrupt register QDMA_GLBL_ERR_INT. For the
direct interrupt, the vector is the interrupt vector index of the MSI-X table.
Here are the processes of the Error Interrupt.
1. Reads the Error Interrupt register QDMA_C2H_GLBL_INT (0B04) to get function and vector
numbers.
2. Sends out the PCIe MSI-X message.
The following figure shows the error interrupt register block diagram.
User Interrupt
You can generate interrupt to host system using the user interrupt interface. You need to provide
usr_irq_in_fnc, usr_irq_in_vec, and usr_irq_in_vld interrupts and they should be held active
until usr_irq_out_ack is returned. usr_irq_in_fnc is a function number associated with an
interrupt. If it is for MSI-X interrupt usr_irq_in_vec should be provided. If the interrupt is for legacy
interrupt, vector is not needed.
Figure: Interrupt
Queue Management
The Function Map Table is used to allocate queues to each function. The index into the RAM is the
function number. Each entry contains the base number of the physical QID and the number of queues
allocated to the function. It provides a function based, queue access protection mechanism by
translating and checking accesses to logical queues (through QDMA_TRQ_SEL_QUEUE_PF and
QDMA_TRQ_SEL_QUEUE_VF address space) to their physical queues. Direct register accesses to
queue space beyond what is allocated to the function in the table will be canceled and an error will be
logged.
Function map can be accessed through the indirect context register space QMDA_IND_CTXT_CMD
registers, with QDMA_IND_CTXT_CMD.sel = 0xC. When accessed through indirect context register
space, the context structure is defined by the Function Map Context Structure table. Along with FMAP
table programming in the IP, you must program the FMAP table in the Mailbox IP. This is needed for
function level reset (FLR) procedure.
Address
Data
Look at the following table for the data description.
[31:22] Reserved.
Queue Teardown
Virtualization
QDMA implements SR-IOV passthrough virtualization where the adapter exposes a separate virtual
function (VF) for use by a virtual machine (VM). A physical function (PF) can be optionally made
privileged with full access to QDMA registers and resources, but only VFs implement per queue
pointer update registers and interrupts. VF drivers must communicate with the driver attached to the
PF through the mailbox for configuration, resource allocation, and exception handling. The QDMA
implements function level reset (FLR) to enable operating system on VM to reset the device without
interfering with the rest of the platform.
Type Notes
Queue context/other Registers for Context access only controlled by PFs (All 4 PFs).
control registers
Type Notes
Status and statistics Mainly PF only registers. VFs need to coordinate with a PF driver for
registers error handling. VFs need to communicate through the mailbox with driver
attached to PF.
Data path registers Both PFs and VFs must be able to write the registers involved in data
path without needing to go through a hypervisor. Pointer update for
H2C/C2H Descriptor Fetch can be done directly by VF or PF for the
queues associated with the function using its own BAR space. Any
pointer updates to queue that do not belong to the function will be
dropped with error logged.
Other protection Turn on IOMMU to protect bad memory accesses from VMs.
recommendations
PF driver and VF The VF driver needs to communicate with the PF driver to request
driver communication operations that have global effect. This communication channel needs
this ability to pass messages and generate interrupts. This
communication channel utilizes a set of hardware mailboxes for each VF.
Mailbox
In a virtualized environment, the driver attached to a PF has enough privilege to program and access
QDMA registers. For all the lesser privileged functions, certain PFs and all VFs must communicate
with privileged drivers using the mailbox mechanism. The communication API must be defined by the
driver. The QDMA IP does not define it.
Each function (both PF and VF) has an inbox and an outbox that can fit a message size of 128B. A VF
accesses its own mailbox, and a PF accesses its own mailbox and all the functions (PF or VF)
associated with that PF.
✎ Note: pcie_qdma_mailbox IP supports up to 4PFs and 240VFs. You can build a mailbox system in
the PL to support more number of PFs and VFs (CPM5 limit is 16PFs and 2KVFs). Adding
pcie_qdma_mailbox IP increases PL utilization.
The QDMA mailbox allows the following access:
Figure: Mailbox
VF To PF Messaging
A VF is allowed to post one message to a target PF mailbox until the target function (PF) accepts it.
Before posting the message the source function should make sure its o_msg_status is cleared, then
the VF can write the message to its Outgoing Message Registers. After finishing message writing, the
VF driver sends msg_send command through write 0x1 at the control/status register (CSR) address
0x5004. The mailbox hardware then informs the PF driver by asserting i_msg_status field.
The function driver should enable the periodic polling of the i_msg_status to check the availability of
incoming messages. At a PF side, i_msg_status = 0x1 indicates one or more message is pending
for the PF driver to pick up. The cur_src_fn in the Mailbox Status Register gives the function ID of
the first pending message. The PF driver should then set the Mailbox Target Function Register to the
source function ID of the first pending message. Then access to a PF’s Incoming Message Registers
is indirectly, which means the mailbox hardware will always return the corresponding message bytes
sent by the Target function. Upon finishing the message reading, the PF driver should also send
msg_rcv command through write 0x2 at the CSR address. The hardware will deassert the
o_msg_status at the source function side. The following figure illustrates the messaging flow from a
VF to a PF at both the source and destination sides.
PF To VF Messaging
The messaging flow from a PF to the VFs that belong to its VFG is slightly different than the VF to PF
flow because:
A PF can send messages to multiple destination functions, therefore, it might receive multiple
acknowledgments at the moment when checking the status. As illustrated in the following figure, a PF
driver must set Mailbox Target Function Register to the destination function ID before doing any
message operation; for example, checking the incoming message status, write message, or send the
command. At the VF side (receiving side), whenever a VF driver get the i_msg_status = 0x1, the
VF driver should read its Incoming Message Registers to pick up the message. Depending on the
application, the VF driver can send the msg_rcv immediately after reading the message or after the
corresponding message being processed.
To avoid one-by-one polling of the status of outgoing messages, the mailbox hardware provides a set
of Acknowledge Status Registers (ASR) for each PF. Upon the mailbox receiving the msg_rcv
command from a VF, it deasserts the o_msg_status field of the source PF and it also sets the
corresponding bit in the Acknowledge Status Registers. For a given VF with function ID <N>,
acknowledge status is at:
The mailbox hardware asserts the ack_status filed in the Status Register (0x22400) when there is
any bit was asserted in the Acknowledge Status Register (ASR). The PF driver can poll the
Mailbox Interrupts
The mailbox module supports interrupt as the alternative event notification mechanism. Each mailbox
has an Interrupt Control Register (at the offset 0x22410 for a PF, or at the offset 0x5010 for a VF). Set
1 to this register to enable the interrupt. Once the interrupt is enabled, the mailbox will send the
interrupt to the QDMA given there is any pending event for the mailbox to process, namely, any
incoming message pending or any acknowledgment for the outgoing messages. Configure the
interrupt vector through the Function Interrupt Vector Register (0x22408 for a PF, or 0x5008 for a VF)
according to the driver configuration.
Enabling the interrupt does not change the event logging mechanism, which means the user must
check the pending events through reading the Function Status Registers. The first step to respond to
an interrupt request is disabling the interrupt. It is possible that the actual number of the pending
events is more than the number of the events at the moment when the mailbox is sent the interrupt.
✎ Recommended: AMD recommends that the user application interrupt handler process all the
pending events that present in the status register. Upon finishing the interrupt response, the user
application re-enables the interrupt.
The mailbox will check its event status at the time the interrupt control change from disabled to
enabled. If there is any new events that arrived the mailbox between reading the interrupt status and
the re-enabling the interrupt, the mailbox will generate a new interrupt request immediately.
The function level reset (FLR) mechanism enables the software to quiesce and reset Endpoint
hardware with function-level granularity. When a VF is reset, only the resources associated with this
VF are reset. When a PF is reset, all resources of the PF, including that of its associated VFs, are
reset. Because FLR is a privileged operation, it must be performed by the PF driver running in the
management system.
Use Mode
Hypervisor requests for FLR when a function is attached and detached (that is; power on and
off).
You can request FLR as follows:
where $BDF is the bus device function number of the targeted function.
FLR Process
A complete FLR process involves of three major steps.
1. Pre-FLR: Pre-FLR resets all QDMA context structure, mailbox, and user logic of the target
function.
Each function has a register called FLR Control Status register, which keeps track of the
pre-FLR status of the function. The offset is calculated as FLR Control Status register offset
= MB_base + 0x100, which is located at offset 0x100 from the mailbox memory space of
the function. Note that PF and VF have different MB_base. The definition of FLR Control
Status register is shown in the following table.
The software writes 1 to Bit[0] flr_status of the target function to initiate pre-FLR. The
hardware clears Bit[0] flr_status when pre-FLR completes. The software keeps polling
on Bit[0] flr_status, and only proceeds to the next step when it returns 0.
2. Quiesce: The software must ensure all pending transaction is completed. This can be done by
polling the Transaction Pending bit in the Device Status register (in PCIe Configuration Space),
OS Support
If the PF driver is loaded and alive (i.e., use mode 1), all three steps aforementioned are performed by
the driver. However, for an AMD Versal device, if you want to perform FLR before loading the PF
driver (as defined in Use Mode above), an OS kernel patch is provided to allow OS to perform the
correct FLR sequence through functions defined in //…/source/drivers/pci/quick.c.
CPM5 Mailbox IP
You need to add a new IP from the IP catalog to instantiate pcie_qdma_mailox Mailbox IP. This IP is
needed for function virtualization. pcie_qdma_mailbox IP should be connected to the versel_cips
IP as shown in the following diagram:
✎ Note: Mailbox ports are always connected to Mailbox IP. If Mailbox IP is not used, leave the port
unconnected (floating). See the preceding figure for connection reference.
The following connections are related to the above example design. To connect the Mailbox IP, follow
these steps:
Mailbox IP has two clocks and two resets as shown in the preceding figure. In this example, both the
clocks are generated from the PMC block.
axi_aclk
Mailbox IP runs at 250 MHz, this clock is used internaly in the Mailbox IP.
ip_clk
Depending on the configuration, the PL might need to run at a higher frequency to satisfy the
data throughput. For example, Gen5x8 PL need to run at 433 MHz to satisfy the data
throughput. ip_clk should be connected to 433 MHz clock in this case.
ip_resetn
It is synchronous with ip_clk and it is derived from the CIPS IP.
axi_aresetn
It is synchronous with axi_aclk. Use pro_sys_reset IP to generate a reset synchronous to
axi_aclk.
For some configurations, the PL clock's maximum speed is 250 MHz. For example, Gen3x16 and
Gen4x8 configurations. The PL clock runs at 250 MHz and pcie_qdma_mailbox IP also runs at 250
MHz. In this case, connect ip_clk and axi_aclk clock and ip_resetn and axi_aresetn reset.
Follow the CPM5 Mailbox Connection figure to make the following connections:
✎ Note: Mailbox access can be steered to NoC0 or NoC1 port based on the CIPS GUI configuration.
You should configure the NoC based on the CIPS GUI selection.
Port ID
Port ID is the categorization of some queues on the FPGA side. When the DMA is shared by more
than one user application, the port ID provides indirection to QID so that all the interfaces can be
Host Profile
Host profile must be programmed to represent Root Port host. Host profile can be programmed
through indirect Context programming. Select QDMA_CTXT_SELC_HOST_PROFILE (4'hA) in
QDMA_IND_CTXT_CMD. Host profile context structure is given in the following table.
Host Profile Context needs to be programed for any QDMA AXI4-MM transfers. And it needs to be
programmed before any Queue context programming. This effects only AX4-MM DMA transfers not
Streaming transfers.
There are some restrictions based on the QDMA selection. For more information, see Controller
Steering Options table in Master Bridge section.
The following example illustrates how a host profile can be programmed to direct some queues to a
specific location.
The example uses QDMA0 and two Host IDs. Host ID 0 targets Queues from 0 to 9 to NoC Channel 0
and Host ID 1 targets Queues from 10 to 19 to NoC Channel 1.
System Management
Resets
The QDMA supports all the PCIe defined resets, such as link down, reset, hot reset, and function level
reset (FLR) (supports only Quiesce mode).
VDM
Vendor Defined Messages (VDMs) are an expansion of the existing messaging capabilities with PCI
Express. PCI Express Specification defines additional requirements for Vendor Defined Messages,
header formats and routing information. For details, see PCI-SIG Specifications
(https://www.pcisig.com/specifications).
QDMA allows the transmission and reception of VDMs. To enable this feature, select Enable Bridge
Slave Mode in the Vivado Customize IP dialog box. This enables the st_rx_msg interface.
RX Vendor Defined Messages are stored in shallow FIFO before they are transmitted to the output
port. When there are many back-to-back VDM messages, FIFO will overflow and these message will
be dropped. So it is better to repeat VDM messages at regular intervals.
Throughput for VDMs depend on several factors: PCIe speed, data width, message length, and the
internal VDM pipeline.
Internal VDM pipelines must be replaced with the Internal RX VDM FIFO interface for network on chip
(NoC) access, which has a shallow buffer of 64B.
✎ Note: New VDM messages will be dropped if more than 64B of VDM are received before the FIFO
is serviced through NoC.
Internal RX VDM FIFO interface cannot handle back-to-back messages. Pipeline throughput can only
handle one in every four accesses, which is about 25% efficiency from the host access.
‼ Important: Do not use back-to-back VDM access.
RX Vendor Defined Messages:
1. When QDMA receives a VDM, the incoming messages will be received on the st_rx_msg port.
2. The incoming data stream will be captured on the st_rx_msg_data port (per-DW).
1. To enable transmission of VDM from QDMA, program the TX Message registers in the Bridge
through the AXI4 Slave interface.
2. Bridge has TX Message Control, Header L (bytes 8-11), Header H (bytes 12-15) and TX
Message Data registers as shown in the PCIe TX Message Data FIFO Register
(TX_MSG_DFIFO).
3. Issue a Write to offset 0xE64 through AXI4 Slave interface for the TX Message Header L
register.
4. Program offset 0xE68 for the required VDM TX Header H register.
5. Program up to 16DW of Payload for the VDM message starting from DW0 – DW15 by sending
Writes to offset 0xE6C one by one.
6. Program the msg_routing, msg_code, data length, requester function field and msg_execute
field in the TX_MSG_CTRL register in offset 0xE60 to send the VDM TX packet.
7. The TX Message Control register also indicates the completion status of the message in bit 23.
User needs to read this bit to confirm the successful transmission of the VDM packet.
8. All the fields in the registers are RW except bit 23 (msg_fail) in TX Control register which is
cleared by writing a 1.
9. VDM TX packet will be sent on the AXI-ST RQ transmit interface.
Expansion ROM
If selected, the Expansion ROM is activated and can be a value from 2 KB to 4 GB. According to the
PCI Local Bus Specification ( PCI-SIG Specifications (https://www.pcisig.com/specifications)), the
maximum size for the Expansion ROM BAR should be no larger than 16 MB. Selecting an address
space larger than 16 MB can result in a non-compliant core.
Errors
Bridge Errors
Slave bridge abnormal conditions are classified as: Illegal Burst Type and Completion TLP Errors. The
following sections describe the manner in which the Bridge handles these errors.
Illegal Burst Type
The slave bridge monitors AXI read and write burst type inputs to ensure that only the INCR
(incrementing burst) type is requested. Any other value on these inputs is treated as an error condition
and the Slave Illegal Burst (SIB) interrupt is asserted. In the case of a read request, the Bridge
asserts SLVERR for all data beats and arbitrary data is placed on the Slave AXI4-MM read data bus.
The following sections describe the manner in which the master bridge handles abnormal conditions.
AXI DECERR Response
When the master bridge receives a DECERR response from the AXI bus, the request is discarded
and the Master DECERR (MDE) interrupt is asserted. If the request was non-posted, a completion
packet with the Completion Status = Unsupported Request (UR) is returned on the bus for PCIe.
AXI SLVERR Response
When the master bridge receives a SLVERR response from the addressed AXI slave, the request is
discarded and the Master SLVERR (MSE) interrupt is asserted. If the request was non-posted, a
completion packet with the Completion Status = Completer Abort (CA) is returned on the bus for PCIe.
Max Payload Size for PCIe, Max Read Request Size
When the master bridge receives a SLVERR response from the addressed AXI slave, the request is
discarded and the Master SLVERR (MSE) interrupt is asserted. If the request was non-posted, a
completion packet with the Completion Status = Completer Abort (CA) is returned on the bus for PCIe.
Completion Packets
When the MAX_READ_REQUEST_SIZE is greater than the MAX_PAYLOAD_SIZE, a read request for PCIe
can ask for more data than the master bridge can insert into a single completion packet. When this
Linkdown Errors
If the PCIe link goes down during DMA operations, transactions might be lost and the DMA might not
be able to complete. In such cases, the AXI4 interfaces continue to operate. Outstanding read
requests on the C2H Bridge AXI4 MM interface receive correct completions or completions with a
slave error response. The DMA logs a link down error in the status register. It is the responsibility of
the driver to have a timeout and handle recovery of a link down situation.
Data protection is supported on the primary data paths. CRC error can occur on C2H streaming, H2C
streaming. Parity error can occur on Memory Mapped, Bridge Master and Bridge Slave interfaces.
DMA Errors
All DMA errors are logged in their respective error status register. Each block has error status and
error mask register so error can be passed on to higher level and eventually to
QDMA_GLBL_ERR_STAT register.
Errors can be fatal error based on register settings. If there is an fatal error DMA will stop the transfer
and will send interrupt if enabled. After debug and analysis, you must invalidate and restart the queue
to start the DMA transfer.
Error Aggregator
There are Leaf Error Aggregators in different places. They log the errors and propagate them to the
central place. The Central Error Aggregator aggregates the errors from all of the Leaf Error
Aggregators.
The QDMA_GLBL_ERR_STAT register is the error status register of the Central Error Aggregator. The
bit fields indicate the locations of Leaf Error Aggregators. Then, look for the error status register of the
individual Leaf Error Aggregator to find the exact error.
The register QDMA_GLBL_ERR_MASK is the error mask register of the Central Error Aggregator. It
has the mask bits for the corresponding errors. When the mask bit is set to 1'b1, it will enable the
corresponding error to be propagated to the next level to generate an Interrupt. The detail information
of the error generated interrupt is described in the interrupt section. Error interrupt is controlled by the
register QDMA_GLBL_ERR_INT (0xB04).
Each Leaf Error Aggregator has an error status register and an error mask register. The error status
register logs the error. The hardware sets the bit when the error happens, and the software can write
1'b1 to clear the bit if needed. The error mask register has the mask bits for the corresponding errors.
When the mask bit is set to 1'b1, it will enable the propagation of the corresponding error to the
Central Error Aggregator. The error mask register does not affect the error logging to the error status
register.
The error status registers and the error mask registers of the Leaf Error Aggregators are as follows.
C2H MM Error
QDMA_C2H MM Status (0x1040)
C2H MM Error Code Enable Mask (0x1054)
C2H MM Error Code (0x1058)
C2H MM Error Info (0x105C)
TRQ Error
QDMA_GLBL_TRQ_ERR_STS (0x264): This is the error status register of the Trq errors.
QDMA_GLBL_TRQ_ERR_MSK (0x268): This is the error mask register.
QDMA_GLBL_TRQ_ERR_LOG_A (0x26C): This is the error logging register. It shows the select,
function and the address of the access when the error happens.
Descriptor Error
QDMA_GLBL_DSC_ERR_STS (0x254)
QDMA_GLBL_DSC_ERR_MSK (0x258): This is the error logging register. It has the QID, DMA
direction, and the consumer index of the error.
QDMA_GLBL_DSC_ERR_LOG0 (0x25C)
QDMA_GLBL_TRQ_ERR_STS (0x264): This is the error status register of the TRQ errors.
QDMA_C2H_FATAL_ERR_STAT (0xAF8): The error status register of the C2H streaming fatal
errors.
QDMA_C2H_FATAL_ERR_MASK (0xAFC): The error mask register. The SW can set the bit to
enable the corresponding C2H fatal error to be sent to the C2H fatal error handling logic.
QDMA_C2H_FATAL_ERR_ENABLE (0xB00): This register enables two C2H streaming fatal
error handling processes:
bit[0]
Stop the data transfer by disabling the write request from the C2H DMA write engine by
setting enable_wrq_dis bit [0] to 1.
bit[1]
Invert the write payload parity on the data transfer by setting enable_wpl_par_inv bit [1]
to 1.
Port Descriptions
AMD Versal™ device CPM5 has two QDMA IPs, which can be selected in the AMD Vivado™ IP
customization GUI.
Based on the GUI selection, QDMA0 or QDMA1 ports will be enabled. Ports described below have a
prefix of dma<n>_, which can be dma0_ for QDMA Port 0 or dma1_ for QDMA Port 1.
✎ Note: Ports without prefix dma0_ or dma1_ are global ports.
Table: Parameters
AXI Bridge Slave ports are connected from the AMD Versal device Network on Chip (NoC) to the
CPM DMA internally. For Slave Bridge AXI4 details, see the Versal Adaptive SoC Programmable
Network on Chip and Integrated Memory Controller LogiCORE IP Product Guide (PG313).
To access QDMA registers, you must follow the protocols outlined in the AXI Slave Bridge Register
Limitations section.
CPM_PL_AXI0_ruser I Master read odd data parity, per byte. This port is
[C_M_AXI_DATA_WIDTH/8- enabled only in Data Protection mode.
1:0]
CPM_PL_AXI1_ruser
[C_M_AXI_DATA_WIDTH/8-
1:0]
AMD Versal device Network on Chip (NoC) provides only AXI4 interface. If you need AXI4-Lite
interface use SmartConnect IP to convert NoC output AXI4 interface to AXI4-Lite interface.
For details, see SmartConnect LogiCORE IP Product Guide (PG247).
dma<n>_m_axis_h2c_qid[1:0] O Queue ID
dma<n>_m_axis_h2c_port_id[2:0] O Port ID
dma<n>_m_axis_h2c_mdata[31:0] O Metadata
In internal mode, QDMA passes the lower 32 bits of
the H2C AXI4-Stream descriptor on this field.
dma<n>_m_axis_h2c_mty[5:0] O The number of bytes that are invalid on the last beat
of the transaction. This field is 0 for a 64B transfer.
dma<n>_m_axis_h2c_tvalid O Valid
dma<n>_m_axis_h2c_tready I Ready
dma<n>_s_axis_c2h_ctrl_len I Length of the packet. For ZERO byte write, the length
[15:0] is 0.
C2H stream packet data length is limited to 31 * c2h
buffer size
ctrl_len is in bytes and should be valid in first beat of
the packet
dma<n>_s_axis_c2h_ctrl_qid I Queue ID
[11:0]
dma<n>_s_axis_c2h_ctrl_port_id I Port ID
[2:0]
dma<n>_s_axis_c2h_tvalid I Valid
dma<n>_s_axis_c2h_tready O Ready
To generate ECC signals for C2H control bus dma<n>_s_axis_c2h_ctrl_ecc[6:0], use AMD Error
Correction Code (ECC) IP. Input signals to ECC IP are listed below and you have to maintain the
order as per the list.
dma<n>_s_axis_c2h_cmpt_data[511:0]
I Completion data from the user application. This
contains information that is written to the completion
ring in the host.
dma<n>_s_axis_c2h_cmpt_ctrl_qid[1:0]
I Completion queue ID.
dma<n>_s_axis_c2h_cmpt_ctrl_marker
I Marker message used for making sure pipeline is
completely flushed. After that, you can safely do
queue invalidation.
dma<n>_s_axis_c2h_cmpt_ctrl_user_trig
I User can trigger the interrupt and the status descriptor
write if they are enabled.
dma<n>_s_axis_c2h_cmpt_ctrl_cmpt_type[1:0]
I 2’b00: NO_PLD_NO_WAIT. The CMPT packet does
not have a corresponding payload packet, and it does
not need to wait.
2’b01: NO_PLD_BUT_WAIT. The CMPT packet does
not have a corresponding payload packet; however, it
still needs to wait for the payload packet to be sent
before sending the CMPT packet.
2’b10: RSVD.
2’b11: HAS_PLD. The CMPT packet has a
corresponding payload packet, and it needs to wait for
the payload packet to be sent before sending the
CMPT packet.
dma<n>_s_axis_c2h_cmpt_ctrl_wait_pld_pkt_id[15:0]
I The data payload packet ID that the CMPT packet
needs to wait for before it can be sent.
dma<n>_s_axis_c2h_cmpt_ctrl_port_id[2:0]
I Port ID.
dma<n>_s_axis_c2h_cmpt_ctrl_col_idx[2:0]
I Color index that defines if the user wants to have the
color bit in the CMPT packet and the bit location of the
color bit if present.
dma<n>_s_axis_c2h_cmpt_ctrl_err_idx[2:0]
I Error index that defines if the user wants to have the
error bit in the CMPT packet and the bit location of the
error bit if present.
dma<n>_s_axis_c2h_cmpt_ctrl_no_wrb_marker
I Disables CMPT packet during Marker transfer.
1'b0 : CMPT packets are sent to CMPT ring
1'b1 : CMPT packets are not sent to CMPT ring.
dma<n>_s_axis_c2h_cmpt_tready O Ready.
dma<n>_axis_c2h_dmawr_target_vch
O Target virtual channel
dma<n>_c2h_dmawr_port_id O Port ID
VDM Ports
dma<n>_st_rx_msg_tvalid O Valid
dma<n>_st_rx_msg_tdata[31:0] O Beat 1:
{REQ_ID[15:0], VDM_MSG_CODE[7:0],
VDM_MSG_ROUTING[2:0],
VDM_DW_LENGTH[4:0]}
Beat 2:
VDM Lower Header [31:0]
or
{(Payload_length=0), VDM Higher Header [31:0]}
Beat 3 to Beat <n>:
VDM Payload
dma<n>_st_rx_msg_tready I Ready.
✎ Note: When this interface is not used, Ready must
be tied-off to 1.
RX Vendor Defined Messages are stored in shallow FIFO before they are transmitted to output ports.
When there are many back to back VDM messages, FIFO overflows and these messages are
dropped. It is best to repeat VDM messages at regular intervals.
FLR Ports
dma<n>_usr_flr_set O Set
Asserted for 1 cycle indicating that the FLR status of
the function indicated on
dma<n>_usr_flr_fnc[12:0] is active.
dma<n>_usr_flr_clear O Reserved
dma<n>_h2c_byp_in_st_error I This bit can be set to indicate an error for the queue.
The descriptor is not be processed. Context is be
updated to reflect an error in the queue
dma<n>_h2c_byp_in_st_cidx I The CIDX that is used for the status descriptor update
[15:0] and/or interrupt (aggregation mode). Generally the
CIDX should be left unchanged from when it is
received from the descriptor bypass output interface.
dma<n>_h2c_byp_in_mm_<y>_radr[63:0]
I The read address for the DMA data.
dma<n>_h2c_byp_in_mm_<y>_wadr[63:0]
I The write address for the dma data.
dma<n>_h2c_byp_in_mm_<y>_no_dma
I H2C Bypass In No DMA
When sending in a descriptor through this interface
with this signal asserted, this signal informs the
QDMA to not send any PCIe requests for this
descriptor. Because no PCIe request is sent out, no
corresponding DMA data is issued on the H2C MM
output interface.
This is typically used in conjunction with
h2c_byp_in_mm_sdi to cause Status
Descriptor/Interrupt when the user logic is out of the
actual descriptors and still wants to drive the
h2c_byp_in_mm_sdi signal.
If h2c_byp_in_mm_mrkr_req and
h2c_byp_in_mm_sdi are reset when sending in a no-
DMA descriptor, the descriptor is treated as a No
Operation (NOP) and is completely consumed inside
the QDMA without any interface activity.
If h2c_byp_in_mm_no_dma is set, the QDMA ignores
the address. The length field should be set to 0.
dma<n>_h2c_byp_in_mm_<y>_len[27:0]
I The DMA data length.
The upper 12 bits must be tied to 0. Thus only the
lower 16 bits of this field can be used for specifying
the length.
dma<n>_h2c_byp_in_mm_<y>_mrkr_req
I H2C-MM Bypass In Marker Request
dma<n>_h2c_byp_in_mm_<y>_errorI This bit can be set to indicate an error for the queue.
The descriptor is not processed. Context is updated to
reflect and error in the queue.
dma<n>_h2c_byp_in_mm_<y>_cidx I The CIDX that is used for the status descriptor update
[15:0] and/or interrupt (aggregation mode). Generally the
CIDX should be left unchanged from when it was
received from the descriptor bypass output interface.
dma<n>_h2c_byp_in_mm_<y>_port_id
I QDMA port ID
[2:0]
dma<n>_h2c_byp_in_mm_<y>_ready
O Ready to take in descriptor
The following is an example timing diagram for H2C AXI-MM Bypass input:
dma<n>_c2h_byp_in_st_csh_error I This bit can be set to indicate an error for the queue.
The descriptor is not processed. Context is updated to
reflect and error in the queue. Error port is not valid in
Simple bypass mode. You are responsible to feed in
good descriptors. If there is a descriptor with error on
Bypass out, you need to fix the error first.
dma<n>_c2h_byp_in_st_csh_port_id[2:0]
I QDMA port ID
dma<n>_c2h_byp_in_st_csh_pfch_tag[6:0]
I Prefetch tag. The prefetch tag points to the cam that
stores the active queues in the prefetch engine. In
Cache Bypass mode, you must loop back
dma<n>_c2h_byp_out_pfch_tag[6:0]to
dma<n>_c2h_byp_in_st_csh_pfch_tag[6:0].In
1. AXI-Stream C2H Simple Bypass mode and Cache Bypass mode both use the same bypass
ports, dma<n>_c2h_byp_in_st_csh_*.
The following is an example timing diagram for C2H AXI-Stream Bypass Input:
dma<n>_c2h_byp_in_mm_<y>_wadr[63:0]
I The write address for the DMA data.
dma<n>_c2h_byp_in_mm_<y>_no_dma
I C2H Bypass In No DMA
When sending in a descriptor through this interface
with this signal asserted, this signal informs the
QDMA to not send any PCIe requests for this
descriptor. Because no PCIe request is sent out, no
corresponding DMA data is read from C2H MM
interface.
dma<n>_c2h_byp_in_mm_<y>_len[27:0]
I The DMA data length. The upper 12 bits must be tied
to 0. Thus, only the lower 16 bits of this field can be
used for specifying the length.
dma<n>_c2h_byp_in_mm_<y>_mrkr_req
I C2H Bypass In Marker Request
You must send an indication that the QDMA must
send a completion status after the QDMA completes
the data transfer of this descriptor.
dma<n>_c2h_byp_in_mm_<y>_errorI This bit can be set to indicate an error for the queue.
The descriptor is not processed. Context is updated to
reflect and error in the queue.
dma<n>_c2h_byp_in_mm_<y>_cidx I You must echo the CIDX from the descriptor that it
[15:0] received on the bypass-out interface.
dma<n>_c2h_byp_in_mm_<y>_port_id[2:0]
I QDMA port ID
dma<n>_c2h_byp_in_mm_<y>_ready
O Ready to take in descriptor.
The following is an example timing diagram for C2H AXI-MM Bypass input:
dma<n>_h2c_byp_out_fmt[2:0] O Format
Tthe encoding for this field is as follows.
0x0: Standard descriptor
0x1 - 0x7: Reserved
dma<n>_c2h_byp_out_pfch_tag[6:0]O Prefetch tag. The prefetch tag points to the cam that
stores the active queues in prefetch engine
dma<n>_c2h_byp_out_fmt[2:0] O Format
The encoding for this field is as follows.
0x0 : Standard descriptor
0x1 - 0x7 : Reserved
dma<n>_dsc_crdt_in_fence I If the fence bit is set, the credits are not coalesced,
and the queue is guaranteed to generate a descriptor
fetch before subsequent credit updates are
processed. The fence bit should only be set for a
queue that is enabled, and has both descriptors and
credits available, otherwise a hang condition might
occur.
dma<n>_dsc_crdt_in_qid I The QID associated with the descriptor ring for the
[1:0] credits are being added.
dma<n>_tm_dsc_sts_port_id O The port id associated with the queue from the queue
[2:0] context.
User Interrupts
dma<n>_usr_irq_valid I Valid
An assertion indicates that an interrupt associated
with the vector, function, and pending fields on the
bus should be generated to PCIe. Once asserted,
dma<n>_usr_irq__vld must remain high until
dma<n>_usr_irq_ack is asserted by the DMA.
1 0 0 usr_irq_ack Received
dma<n>_qsts_out_data[63:0] O The data field for the individual opcodes are defined
in the tables below.
dma<n>_qsts_out_port_id[2:0] O Port ID
dma<n>_qsts_out_qid[12:0] O Queue ID
NoC Ports
✎ Note: NoC ports are always connected to the NoC IP block, you cannot leave them unconnected
nor connect to any other block. This results in synthesis and implementation error. For connection
reference, see the following figure:
noc_cpm_pcie_axi0_clk O Clock for AXI4 MM port from NoC. This port will
enabled when AXI Slave Bridge is enabled.
dma0_mgmt_req_rdy O Ready
dma0_mgmt_cpl_rdy I Ready
dma0_mgmt_cpl_sts[1:0] O
bit[0] :
1 Error
0 good
bit[1] :
1 write response
0 read response
QDMA Management port should be connected to mailbox ports as described in CPM5 Mailbox IP.
Register Space
NA
Reserved
RO
Read-Only - Register bits are read-only and cannot be altered by the software.
RW
Read-Write - Register bits are read-write and are permitted to be either Set or Cleared by the
software to the desired state.
RW1C
Write-1-to-clear-status - Register bits indicate status when read. A Set bit indicates a status
event which is Cleared by writing a 1b. Writing a 0b to RW1C bits has no effect.
W1C
Non-readable-write-1-to-clear-status - Register will return 0 when read. Writing 1b Clears the
status for that bit index. Writing a 0b to W1C bits has no effect.
W1S
Non-readable-write-1-to-set - Register will return 0 when read. Writing 1b Sets the control set for
that bit index. Writing a 0b to W1S bits has no effect.
All the physical function (PF) registers are listed in the cpm5-qdma-v4-0-pf-registers.csv available in
the register map files.
Register Name Base (Hex) Byte Size (Dec) Register List and Details
Register Name Base (Hex) Byte Size (Dec) Register List and Details
QDMA_CSR (0x0000)
QDMA_TRQ_SEL_QUEUE_PF (0x18000)
QDMA_DMAP_SEL_H2C_DSC_PIDX[2048]
0x18004- H2C Descriptor Producer index
(0x18004) 0x1CFF4 (PIDX)
QDMA_DMAP_SEL_C2H_DSC_PIDX[2048]
0x18008- C2H Descriptor Producer Index
(0x18008) 0x1CFF8 (PIDX)
There are 2048 Queues, each Queue will have more than four registers. All these registers can be
dynamically updated at any time. This set of registers can be accessed based on the Queue number.
For Queue 0:
For Queue 1:
For Queue 2:
QDMA_DMAP_SEL_INT_CIDX[2048] (0x18000)
QDMA_DMAP_SEL_H2C_DSC_PIDX[2048] (0x18004)
QDMA_DMAP_SEL_C2H_DSC_PIDX[2048] (0x18008)
QDMA_DMAP_SEL_CMPT_CIDX[2048] (0x1800C)
QDMA_PF_MAILBOX (0x42400)
Mailbox Addressing
PF addressing
Addr = PF_Bar_offset + CSR_addr
VF addressing
Addr = VF_Bar_offset + VF_Start_offset + VF_offset + CSR_addr
[1] 0 RO o_msg_status For VF: The status bit will be set when VF
driver write msg_send to its command
register. When The associated PF driver
send acknowledgment to this VF, the
hardware clear this field. The VF driver is
not allow to update any content in its
outgoing mailbox memory (OMM) while
o_msg_status is asserted. Any illegal
write to the OMM will be discarded
(optionally, this can cause an error in the
AXI Lite response channel).
For PF: The field indicated the message
status of the target FN which is specified
in the Target FN Register. The status bit
will be set when PF driver sends
msg_send command. When the
corresponding function driver send
acknowledgment by sending msg_rcv, the
hardware clear this field. The PF driver is
not allow to update any content in its
outgoing mailbox memory (OMM) while
o_msg_status(target_fn_id) is asserted.
Any illegal write to the OMM will be
discarded (optionally, case an error in the
AXI4L response channel).
QDMA_TRQ_MSIX (0x50000)
✎ Note: The table above represents one MSI-X table entry 0. There are 2K MSI-X table entries for
the QDMA.
All the virtual function (VF) registers are listed in the cpm5-qdma-v4-0-vf-registers.csv available in the
register map files.
QDMA_TRQ_SEL_QUEUE_VF (0x3000)
VF functions can access direct update registers per queue with offset (0x3000). The description for
this register space is the same as QDMA_TRQ_SEL_QUEUE_PF (0x18000).
For Queue 0:
For Queue 1:
QDMA_TRQ_MSIX_VF (0x4000)
VF functions can access the MSIX table with offset (0x0000) from that function. The description for
this register space is the same as QDMA_TRQ_MSIX (0x50000).
QDMA_VF_MAILBOX (0x5000)
[1] 0 RO o_msg_status For VF: The status bit will be set when VF
driver write msg_send to its command
register. When the associated PF driver
sends acknowledgement to this VF, the
hardware clears this field. The VF driver is
not allow to update any content in its
outgoing mailbox memory (OMM) while
o_msg_status is asserted. Any illegal
writes to the OMM are discarded
(optionally, case an error in the AXI4-Lite
response channel).
For PF: The field indicated the message
status of the target FN which is specified
in the Target FN Register. The status bit is
set when PF driver sends the msg_send
command. When the corresponding
function driver sends acknowledgement
through msg_rcv, the hardware clears this
field. The PF driver is not allow to update
any content in its outgoing mailbox
memory (OMM) while
You can access QDMA0 or QDMA1 register space using AXI Slave interface. When AXI Slave Bridge
mode is enabled (based on GUI settings) user can also access Bridge registers in QDMA0 or in
QDMA1 and can also access Host memory space. Host memory address offset varies based on
QDMA0 and/or QDMA 1 selection.
If only QDMA0 is enabled, the table below shows address ranges and limitations.
✎ Note: You cannot access QDMA CSR register space through AXI Slave Bridge interface. You can
only access QDMA Queue space register.
Slave Bridge access to Host 0xE000_0000 - 0xEFFF_FFFF Address range for Slave
memory space 0x6_1101_0000 - bridge access is set during IP
0x7_FFFF_FFFF
Slave Bridge access to Host 0xE800_0000 - 0xEFFF_FFFF Address range for Slave
memory space 0x7_1101_0000 - bridge access is set during IP
0x7_FFFF_FFFF customization in the Address
0xA0_0000_0000 - Editor tab of the Vivado IDE.
0xBF_FFFF_FFFF
When both QDMA0 and QDMA1 controllers are enabled, the table above remains the same for
QDMA1 controller. The table shown below represents the QDMA0 controller.
Slave Bridge access to Host 0xE000_0000 - 0xE7FF_FFFF Address range for Slave
memory space 0x6_1101_0000 - bridge access is set during IP
0x6_FFFF_FFFF customization in the Address
0x80_0000_0000 - Editor tab of the Vivado IDE.
0x9F_FFFF_FFFF
Bridge register addresses start at 0xE00. Addresses from 0x00 to 0xE00 are directed to the PCIe
configuration register space.
All the bridge registers are listed in the cpm5-qdma-v4-0-bridge-registers.csv available in the register
map files.
To locate the register space information:
QDMA_TRQ_SEL_QUEUE_PF (0x18000)
QDMA_TRQ_SEL_QUEUE_VF (0x3000)
Queue space register access from AXI Slave interface should be Salve Address range low (see the
table above) + 0x18000.
Vivado Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994)
Vivado Design Suite User Guide: Designing with IP (UG896)
Vivado Design Suite User Guide: Getting Started (UG910)
Vivado Design Suite User Guide: Logic Simulation (UG900)
This lab describes the process of generating an AMD Versal™ device QDMA design with AXI4
interface connected to network on chip (NoC) IP and DDR memory. This design has the following
configurations:
AXI4 memory mapped (AXI MM) connected to DDR through the NoC IP
Gen3 x 16
4 physical functions (PFs) and 252 virtual functions (VFs)
MSI-X interrupts
This lab provides step by step instructions to configure a Control, Interfaces and Processing System
(CIPS) QDMA design and network on chip (NoC) IP. The following figure shows the AXI4 Memory
Mapped (AXI-MM) interface to DDR using the NoC IP. At the end of this lab, you can synthesize and
implement the design, and generate a Programmable Device Image (PDI) file. The PDI file is used to
program the Versal device and run data traffic on a system. For the AXI-MM interface host to chip
(H2C) transfers, data is read from Host and sent to DDR memory. For chip to host (C2H) transfers,
data is read from DDR memory and written to host.
This lab targets xcvp1202-vsva2785-2MP-e-S part. This lab connects to DDR memory found outside
the Versal device. A constraints file is provided and added to the design during the lab. The
constraints file lists all DDR pins and their placement. You can modify the constraint file based on your
requirements and DDR part selection. For more information, see QDMA AXI MM Interface to NoC and
DDR.
Simulation
Simulation example designs are listed in the configurable example design (CED). You can download
the simulation example design from the Vivado store Versal_CPM5_QDMA_Simultion_Design. The
list of Versal PCIe example designs are available here. Simulation design has a fixed configuration as
follows:
Gen4x8
AXI4 and AXI-ST
4 PF and 250 VF's
Each Function with two BAR's. One for QDMA configuration space and one for Bypass access to
PL.
Descriptor bypass and Internal Mode
1. Open Vivado and select Open Example Project option under Quick Start.
✎ Note: In simulation, you might receive warnings from internal RAMs within the CPM5 block
indicating that a write/read contention has occurred on a multi-port RAM. This is normal as the
contention is resolved separately outside of the RAM blocks. These warnings can be safely
ignored.
2. From the Template options, select Versal CPM5 QDMA Simulation Design under PCIe section.
You can see the corresponding diagram on the right-hand side description section. click Next.
The Vivado wizard guides you through the board/part section. This example design is fixed to
VPK120 board.
3. Select the project name and directory and click Next to generate project.
4. A new simulation project is displayed as shown below.
Basic Simulation
Simulation models for the AXI-MM and AXI-ST options can be generated and simulated. The simple
simulation model options enable you to develop complex designs.
AXI-MM Mode
The example design for the AXI4 Memory Mapped (AXI-MM) mode has 512 KB block RAM on the
user side, where the data can be written to the block RAM, and read from block RAM to the host.
After the host to Card (H2C) transfer is started, the DMA reads data from the host memory, and writes
to the block RAM. After the transfer is completed, the DMA updates the write back status and
generates an interrupt (if enabled). Then, the card to host (C2H) transfer is started, and the DMA
reads data from the block RAM and writes to the host memory. The original data is compared with the
C2H write data. H2C and C2H are set up with one descriptor each, and the total transfer size is 128
bytes.
AXI-ST Mode
The example design for the AXI4-Stream (AXI-ST) mode has a data check that checks the data from
the H2C transfer, and has a data generator that generates the data for C2H transfer.
After the H2C transfer is started, the DMA engine reads data from the host memory, and writes to the
user side. After the transfer is completed, the DMA updates write back status and generates an
interrupt (if enabled). The data checker on the user side checks for a predefined data to be present,
and the result is posted in a predefined address for the user application to read.
After the C2H transfer is started, the data generator generates predefined data and associated control
signals, and sends them to the DMA. The DMA transfers data to the Host, updates the completion
(CMPT) ring entry/status, and generates an interrupt (if enabled).
H2C and C2H are set up with 16 descriptor each, and the total transfer size is 128 bytes.
The QDMA supports the PIPE mode simulation where the PIPE interface of the core is connected to
the PIPE interface of the link partner. This mode increases the simulation speed.
Use the Enable PIPE Simulation option on the Basic tab of the Customize IP dialog box to enable
PIPE mode simulation in the current AMD Vivado™ Design Suite solution example design, in either
Endpoint or Root Port mode. The External PIPE interface signals are generated at the core boundary
for access to the external device. Enabling this feature also provides the necessary hooks to use
third-party PCI Express® VIPs/BFMs instead of the Root Port model provided with the example
design.
CPM5 QDMA
The following table describes the available CPM5 example design. All the listed example designs are
based on VPK120 evaluation board or equivalent part.
CPM5_QDMA_Gen4x8_MM_ST_Design
Implementation Functional
example design.
CPM5_QDMA_Gen5x8_MM_Performance_Design
Implementation AXI4
performance
design.
Vsersal_CPM_QDMA_EP_Design
CPM5_QDMA_Gen4x8_ST_Performance_Design
Implementation AXI-ST
performance
design.
CPM5_QDMA_Dual_Gen4x8_MM_ST_Design
Implementation Functional
example design.
Versal_CPM_QDMA_EP_Simulation_Design
No preset Simulation QDMA full
functional
simulation
design.
Versal_CPM_Bridge CPM5_PCIe_Controller0_Gen4x8_RootPort_Design
Implementation RP design
RP_Design CPM5_PCIe_Controller1_Gen4x8_RootPort_Design
Implementation RP design
CPM5_QDMA_Gen5x8_ST_Performance_Design
Implementation AXI-ST
performance
Versal_CPM_QDMA_EP_Design design.
(Part Based) CPM5_QDMA_Dual_Gen5x8_ST_Performance_Design
Implementation AXI-ST
performance
design.
1. Launch Vivado.
5. Click install for any newly added designs or click Update for any updates to the designs and
close it.
6. Click Open Example Project > Next, select the appropriate design, click Next.
8. Choose the board or part option available. Based on the board selected, appropriate CPM block
is enabled. For example, for VCK190 board, CPM4 block is enabled. Similarly, for VPK120
board, CPM5 block is enabled.
a. In CPM5, the Gen 5 link speed is available for -2 MHP speed grade variant of VPK120
board. Ensure to choose -2 MHP variant in the Switch Part option while selecting the board.
Versal_CPM_QDMA_EP_Design
CPM5_QDMA_Gen4x8_MM_ST_Design
This design has CPM5 – QDMA1 enabled in Gen4x8 configuration as an End Point
The design targets VPK120 board and it supports synthesis and implementation flows
Enables QDMA AXI4 and QDMA AXI-ST functionality with 4 PF and 252 VFs
Capable of exercising AXI4, AXI-ST path, and descriptor bypass
C2H_ST_QID (0x000)
[31:11] 0 NA Reserved
C2H_ST_LEN (0x004)
[31:16] 0 NA Reserved
C2H_CONTROL_REG (0x008)
[31:6] 0 NA Reserved
[4] 0 NA reserved
H2C_CONTROL_REG (0x00C)
[31:30] 0 NA Reserved
H2C_STATUS (0x010)
[31:15] 0 NA Reserved
[3:1] 0 NA Reserved
C2H_STATUS (0x018)
[31:30] 0 NA Reserved
C2H_PACKET_COUNT (0x020)
[31:10] 0 NA Reserved
C2H_PREFETCH_TAG (0x024)
[31:27] 0 NA Reserved
[15:7] 0 NA Reserved
C2H_COMPLETION_DATA_0 (0x030)
C2H_COMPLETION_DATA_1 (0x034)
C2H_COMPLETION_DATA_2 (0x038)
C2H_COMPLETION_DATA_3 (0x03C)
C2H_COMPLETION_DATA_4 (0x040)
C2H_COMPLETION_DATA_5 (0x044)
C2H_COMPLETION_DATA_6 (0x048)
C2H_COMPLETION_DATA_7 (0x04C)
C2H_COMPLETION_SIZE (0x050)
[31:13] 0 NA Reserved
SCRATCH_REG0 (0x060)
SCRATCH_REG1 (0x064)
C2H_PACKETS_DROP (0x088)
C2H_PACKETS_ACCEPTED (0x08C)
Each AXI-ST C2H transfer can contain one or more descriptors depending on the transfer size and
C2H buffer size. This register represents how many of the descriptors were accepted in the current
transfer. This register will reset to 0 at the beginning of each transfer.
DESCRIPTOR_BYPASS (0x090)
[31:3] 0 NA Reserved
USER_INTERRUPT (0x094)
[31:20] 0 NA Reserved
[11:9] 0 NA Reserved
[3:1] 0 NA Reserved
1. Write the function number at bits [19:12]. This corresponds to the function that generates the
usr_irq_in_fnc user interrupt.
2. Write MSI-X Vector number at bits [8:4]. This corresponds to the entry in the MSI-X table that is
set up for usr_irq_in_vec user interrupt.
3. Write 1 to bit [0] to generate user interrupt. This bit clears itself after usr_irq_out_ack from the
DMA is generated.
All three above steps can be done at the same time, with a single write.
Following is the user interrupt timing diagram:
Figure: Interrupt
USER_INTERRUPT_VECTOR (0x09C)
Write to user_interrupt[0], or
Write to the user_interrupt_vector[31:0] register with mask set.
DMA_CONTROL (0x0A0)
[31:1] NA Reserved
VDM_MESSAGE_READ (0x0A4)
Vendor Defined Message (VDM) messages, st_rx_msg_data, are stored in FIFO in the example
design. A read to this register (0x0A4) will pop out one 32-bit message at a time.
CPM5_QDMA_Gen5x8_MM_Performance_Design
This design has CPM5 – QDMA1 enabled in Gen5x8 configuration as an End Point
The design targets VPK120 board and it supports synthesis and implementation flows
The design has AXI4 datapath accessing DDR over NoC
Capable of demonstrating the QDMA MM performance
To achieve maximum performance for AXI4 transfers, you need to use both the NoC channels 0 and
1. Both NoC channels can be used by programming the Host Profile settings. For more information,
see Host Profile.
For example all even queues can be assigned to Host ID 0 and odd queues can be assigned to Host
ID 1. During Q context programming, all even queues can be assigned with host ID 0 and all odd
queues can be assigned to host ID 1. In this manner there is equal traffic on NoC 0 and NoC1
channels. This maximizes the MM throughput from CPM to NoC to DDR memory.
CPM5_QDMA_Gen4x8_ST_Performance_Design
This design has CPM5 – QDMA1 enabled in Gen4x8 configuration as an End Point
The design targets VPK120 board and it supports Synthesis and Implementation flows
Capable of demonstrating the QDMA AXI4-Stream performance
CPM5 Dual Controller QDMA0 and QDMA1 with Gen4x8 AXI4 and AXI4-Stream functional example
design:
This design has CPM5–QDMA0 and CPM5-QDMA1 enabled in Gen4x8 configuration as an End
Point
The design targets VPK120 board and it supports Synthesis and Implementation flows
Enables each QDMA, AXI4 and QDMA AXI-ST functionality with 4 PF and 252 VFs
Capable of exercising AXI4, AXI-ST path, and descriptor bypass
Versal_CPM_QDMA_EP_Simulation_Design
This design has CPM5 – QDMA1 enabled in Gen4x8 configuration as an End Point
The design support simulation
Enables QDMA AXI4 and QDMA AXI-ST functionality with 4 PF and 252 VFs
The design includes Root Port testbench, which simulates QDMA AXI4 and AXI4-Stream
datapath.
This design has CPM5 – QDMA1 AXI Bridge mode enabled in Gen4x8 configuration as Root
Port
The design targets VPK120 board and It supports Synthesis and Implementation flows
The design implements the Root Complex functionality. It includes CIPS IP, which enabled both
CPM and PS
CPM5_QDMA_Gen5x8_ST_Performance_Design
1. Put example design in pause mode. Set offset 0x8 bit [30] to 1, all other bit values should not be
changed
2. Set the example design in simple bypass mode. Set offset 0x98 to 0x4
3. Set the desired packet size. Set offset 0x90 to the desired value
4. Enable and start the data transfer from the host application/driver
At this time no data transfer happens as the example design is paused.
5. Now you need to fetch the prefetch tag from QDMA IP (configuration bar)
Write 0 to QDMA configuration bar (BAR0) offset 0x1048. Write offset 0x1408 with value
0x0
Read prefetch tag from QDMA configuration bar (BAR0). Read offset 0x140C
You need to write that tag value to the example design (BAR2). Write tag value to offset
0x24
6. After the prefetch tag exchange, you can release the example design. Set offset 0x8 bit [30] to 0.
All other bits should not be changed
7. You can now see the data transfer from example design to host
CPM5_QDMA_Dual_Gen5x8_ST_Performance_Design
CPM5 Dual Controller QDMA0 and QDMA1 with Gen5x8 AXI4 and AXI4-Stream performance
example design
The design targets "vsva2785-3HP-e-S" part and it supports synthesis and implementation flows
This design has both CPM5–QDMA0 and CPM5-QDMA1 enabled in Gen5x8, AXI4-Stream
configuration as an End Point
Capable of demonstrating AXI-ST performance
Capable of performing in Internal modes or cache bypass mode or in Simple bypass mode
To change the data packet size on C2H direction you need to set example design register offset
0x90 to desired value
The above figure shows the usage model of Linux QDMA software drivers. The QDMA example
design is implemented on an AMD adaptive SoC, which is connected to an X86 host through PCI
Express® .
In the first use mode, the QDMA driver in kernel space runs on Linux, whereas the test
application runs in user space.
In the second use mode, the Data Plane Dev Kit (DPDK) is used to develop a QDMA Poll Mode
Driver (PMD) running entirely in the user space, and use the UIO and VFIO kernel framework to
communicate with the adaptive SoC.
DMA tool
Is the user space application to initiate a DMA transaction. You can use standard Linux utility dd
or fio, or use the example application in the driver package.
Debugging
This appendix includes details about resources available on the AMD Support website and debugging
tools.
To help in the design and debug process when using the functional mode, the Support web page
contains key resources such as product documentation, release notes, answer records, information
about known issues, and links for obtaining further product support. The Community Forums are also
available where members can learn, participate, share, and ask questions about AMD Adaptive
Computing solutions.
This product guide is the main document associated with the functional mode. This guide, along with
documentation related to all products that aid in the design process, can be found on the AMD
Adaptive Support web page or by using the AMD Adaptive Computing Documentation Navigator.
Download the Documentation Navigator from the Downloads page. For more information about this
tool and the features available, open the online help after installation.
Debug Guide
Answer Records
Answer Records include information about commonly encountered problems, helpful information on
how to resolve these problems, and any known issues with an AMD Adaptive Computing product.
Answer Records are created and maintained daily to ensure that users have access to the most
accurate information available.
Answer Records for this functional mode can be located by using the Search Support box on the main
AMD Adaptive Support web page. To maximize your search results, use keywords such as:
Product name
Tool message(s)
Summary of the issue encountered
A filter search is available after results are returned to further target the results.
AR 75396.
Technical Support
AMD Adaptive Computing provides technical support on the Community Forums for this AMD
LogiCORE™ IP product when used as described in the product documentation. AMD Adaptive
Computing cannot guarantee timing, functionality, or support if you do any of the following:
Implement the solution in devices that are not defined in the documentation.
Customize the solution beyond that allowed in the product documentation.
Change any section of the design labeled DO NOT MODIFY.
Hardware Debug
Hardware issues can range from link bring-up to problems seen after hours of testing. This section
provides debug steps for common issues. The AMD Vivado™ debug feature is a valuable resource to
General Checks
Ensure that all the timing constraints for the core were properly incorporated from the example design
and that all constraints were met during implementation.
Does it work in post-place and route timing simulation? If problems are seen in hardware but not
in timing simulation, this could indicate a PCB issue. Ensure that all clock sources are active and
clean.
If using MMCMs in the design, ensure that all MMCMs have obtained lock by monitoring the
locked port.
If your outputs go to 0, check your licensing.
Registers
A complete list of registers and attributes for the QDMA Subsystem is available in the Versal Adaptive
SoC Register Reference (AM012). Reviewing the registers and attributes might be helpful for
advanced debugging.
✎ Note: The attributes are set during IP customization in the Vivado IP catalog. After core
customization, attributes are read-only.
Upgrading
This appendix is not applicable for the first release of the functional mode.
Limitations
1. The achievable bandwidth for this subsystem depends on multiple factors, including but not
limited to the IP configuration, the data path options used with the IP, the host system
performance, and the methods by which data movements are programmed. For more
information on CPM4 AXI Bridge, see Data Bandwidth and Performance Tuning section. The
bandwidth ceiling is limited by the lower of the raw capacity of the designed PCIe link
configuration and the internal data interface used. The Data Bandwidth and Performance Tuning
section provides guidance on the related clock frequency settings and high-level guidance on
performance expectations. Achievable bandwidth might vary.
2. Bridge is compliant with all MPS and MRRS settings; however, all the traffic initiated from the
Bridge is limited to 256 bytes (max).
3. AXI address width is limited to 48 bits.
Product Specification
The Register block contains registers used in the AXI Bridge functional mode for dynamically mapping
the AXI4 memory mapped (MM) address range provided using the AXIBAR parameters to an address
for PCIe® range.
The slave bridge provides termination of memory-mapped AXI4 transactions from an AXI master
device (such as a processor). The slave bridge provides a way to translate addresses that are
mapped within the AXI4 memory mapped address domain to the domain addresses for PCIe. Write
transactions to the Slave Bridge are converted into one or more MemWr TLPs, depending on the
configured Max Payload Size setting, which are passed to the integrated block for PCI Express. The
slave bridge can support up to 32 active AXI4 Write requests. When a remote AXI master initiates a
read transaction to the slave bridge, the read address and qualifiers are captured and a MemRd
request TLP is passed to the core and a completion timeout timer is started. Completions received
Each PCIe MemRd request TLP header is used to create an address and qualifiers for the memory-
mapped AXI4 bus. Read data is collected from the addressed memory mapped AXI4 slave and used
to generate completion TLPs which are then passed to the integrated block for PCI Express. The
Master Bridge in can support up to 32 active PCIe MemRd request TLPs with pending completions for
improved AXI4 pipelining performance.
The instantiated AXI4-Stream Enhanced PCIe block contains submodules including the
Requester/Completer interfaces to the AXI bridge and the Register block. The Register block contains
the status, control, and interrupt registers.
The following tables are the translation tables for AXI4-Stream and memory-mapped transactions.
The AXI Bridge functional mode conforms to PCIe® transaction ordering rules. See the PCI-SIG
Specifications for the complete rule set. The following behaviors are implemented in the AXI Bridge
functional mode to enforce the PCIe transaction ordering rules on the highly-parallel AXI bus of the
bridge.
✎ Note: The transaction ordering rules for PCIe might have an impact on data throughput in heavy
bidirectional traffic.
BAR Addressing
Aperture_Base_Address_n provides the low address where AXI BAR n starts and will be
regarded as address offset 0x0 when the address is translated.
Aperture_High_Address_n is the high address of the last valid byte address of AXI BAR n. (For
more details on how the address gets translated, see Address Translation.)
Example 1 (32-bit PCIe Address Mapping) demonstrates how to set up three AXI BARs and
translate the AXI address to a 32-bit address for PCIe.
Example 2 (64-bit PCIe Address Mapping) demonstrates how to set up three AXI BARs and
translate the AXI address to a 64-bit address for PCIe.
Example 3 demonstrates how to set up two 64-bit PCIe BARs and translate the address for PCIe
to an AXI address.
Example 4 demonstrates how to set up a combination of two 32-bit AXI BARs and two 64 bit AXI
BARs, and translate the AXI address to an address for PCIe.
This example shows the generic settings to set up three independent AXI BARs and address
translation of AXI addresses to a remote 32-bit address space for PCIe. This setting of AXI BARs
does not depend on the BARs for PCIe in the functional mode.
In this example, number of AXI BARs are 3, the following assignments for each range are made.
Aperture_Base_Address_0=0x00000000_12340000
Aperture_High_Address_0 =0x00000000_1234FFFF (64 Kbytes)
AXI_to_PCIe_Translation_0=0x00000000_56710000 (Bits 63-32 are zero in order
to produce a
32-bit PCIe TLP. Bits 15-0 must be zero based on the AXI BAR aperture size.
Non-zero
values in the lower 16 bits are invalid translation values.)
Aperture_Base_Address_1 =0x00000000_ABCDE000
Aperture_High_Address_1 =0x00000000_ABCDFFFF (8 Kbytes)
AXI_to_PCIe_Translation_1=0x00000000_FEDC0000 (Bits 63-32 are zero in order
Accessing the Bridge AXI BAR_0 with address 0x0000_12340ABC on the AXI bus yields
0x56710ABC on the bus for PCIe.
Accessing the Bridge AXI BAR_1 with address 0x0000_ABCDF123 on the AXI bus yields
0xFEDC1123 on the bus for PCIe.
Accessing the Bridge AXI BAR_2 with address 0x0000_FFEDCBA on the AXI bus yields
0x41FEDCBA on the bus for PCIe.
This example shows the generic settings to set up to three independent AXI BARs and address
translation of AXI addresses to a remote 64-bit address space for PCIe. This setting of AXI BARs
does not depend on the BARs for PCIe within the Bridge.
In this example, number of AXI BARs are three, the following assignments for each range are made:
Aperture_Base_Address_0 =0x00000000_12340000
Aperture_High_Address_0 =0x00000000_1234FFFF (64 Kbytes)
AXI_to_PCIe_Translation_0=0x5000000056710000 (Bits 63-32 are non-zero in order to
produce a
64-bit PCIe TLP. Bits 15-0 must be zero based on the AXI BAR aperture size. Non-
zero
values in the lower 16 bits are invalid translation values.)
Aperture_Base_Address_1 =0x00000000_ABCDE000
Aperture_High_Address_1 =0x00000000_ABCDFFFF (8 Kbytes)
AXI_to_PCIe_Translation_1=0x60000000_FEDC0000 (Bits 63-32 are non-zero in order to
produce
a 64-bit PCIe TLP. Bits 12-0 must be zero based on the AXI BAR aperture size. Non-
zero
values in the lower 13 bits are invalid translation values.)
Aperture_Base_Address_2 =0x00000000_FE000000
Aperture_High_Address_2 =0x00000000_FFFFFFFF (32 Mbytes)
AXI_to_PCIe_Translation_2=0x70000000_40000000 (Bits 63-32 are non-zero in order to
produce a
64-bit PCIe TLP. Bits 24-0 must be zero based on the AXI BAR aperture size. Non-
Accessing the Bridge AXI BAR_0 with address 0x0000_12340ABC on the bus yields
0x5000000056710ABC on the bus for PCIe.
Accessing the Bridge AXI BAR_1 with address 0x0000_ABCDF123 on the bus yields
0x60000000FEDC1123 on the bus for PCIe.
Accessing the Bridge AXI BAR_2 with address 0x0000_FFFEDCBA on the bus yields
0x7000000041FEDCBA on the bus for PCIe.
Example 3
This example shows the generic settings to set up two independent BARs for PCIe® and address
translation of addresses for PCIe to a remote AXI address space. This setting of BARs for PCIe does
not depend on the AXI BARs within the bridge.
In this example, number of PCIe BAR are two, the following range assignments are made.
Aperture_Base_Address_0 =0x00000000_12340000
Aperture_High_Address_0 =0x00000000_1234FFFF (64 KB)
AXI_to_PCIe_Translation_0=0x00000000_56710000 (Bits 63-32 are zero to
produce a 32-bit PCIe
TLP. Bits 15-0 must be zero based on the AXI BAR aperture size. Non-zero
values in
the lower 16 bits are invalid translation values.)
Aperture_Base_Address_1 =0x00000000_ABCDE000
Aperture_High_Address_1 =0x00000000_ABCDFFFF (8 KB)
AXI_to_PCIe_Translation_1=0x50000000_FEDC0000 (Bits 63-32 are non-zero to
produce a 64-bit
PCIe TLP. Bits 12-0 must be zero based on the AXI BAR aperture size. Nonzero
Accessing the Bridge AXI BAR_0 with address 0x0000_12340ABC on the AXI bus yields
0x56710ABC on the bus for PCIe.
Accessing the Bridge AXI BAR_1 with address 0x0000_ABCDF123 on the AXI bus yields
0x50000000FEDC1123 on the bus for PCIe.
Example 4
This example shows the generic settings of four AXI BARs and address translation of AXI addresses
to a remote 32-bit and 64-bit addresses for PCIe® . This setting of AXI BARs do not depend on the
BARs for PCIe within the Bridge.
In this example, where number AXI BARs are 4, the following assignments for each range are made:
Aperture_Base_Address_0 =0x00000000_12340000
Aperture_Base_Address_1 =0x00000000_ABCDE000
Aperture_High_Address_1 =0x00000000_ABCDFFFF (8 KB)
AXI_to_PCIe_Translation_1=0x50000000_FEDC0000 (Bits 63-32 are non-zero to produce a
64-bit
PCIe TLP. Bits 12-0 must be zero based on the AXI BAR aperture size. Non-zero
values
in the lower 13 bits are invalid translation values.)
Aperture_Base_Address_2 =0x00000000_FE000000
Aperture_High_Address_2 =0x00000000_FFFFFFFF (32 MB)
AXI_to_PCIe_Translation_2=0x00000000_40000000 (Bits 63-32 are zero to produce a 32-
bit PCIe
TLP. Bits 24-0 must be zero based on the AXI BAR aperture size. Non-zero values in
the lower 25 bits are invalid translation values.)
Aperture_Base_Address_3 =0x00000000_00000000
Aperture_High_Address_3 =0x00000000_00000FFF (4 KB)
AXI_to_PCIe_Translation_3=0x60000000_87654000 (Bits 63-32 are non-zero to produce a
64-bit
PCIe TLP. Bits 11-0 must be zero based on the AXI BAR aperture size. Non-zero
values
in the lower 12 bits are invalid translation values.)
Accessing the Bridge AXI BAR_0 with address 0x0000_12340ABC on the AXI bus yields
0x56710ABC on the bus for PCIe.
Accessing the Bridge AXI BAR_1 with address 0x0000_ABCDF123 on the AXI bus yields
0x50000000FEDC1123 on the bus for PCIe.
Accessing the Bridge AX IBAR_2 with address 0x0000_FFFEDCBA on the AXI bus yields
0x41FEDCBA on the bus for PCIe.
Accessing the Bridge AXI BAR_3 with address 0x0000_00000071 on the AXI bus yields
0x6000000087654071 on the bus for PCIe.
Addressing Checks
When setting the following parameters for PCIe® address mapping, C_PCIEBAR2AXIBAR_n and
PF0_BARn_APERTURE_SIZE, be sure these are set to allow for the addressing space on the AXI
system. For example, the following setting is illegal and results in an invalid AXI address.
C_PCIEBAR2AXIBAR_n=0x00000000_FFFFF000
PF0_BARn_APERTURE_SIZE=0x06 (8 KB)
For an 8 Kilobyte BAR, the lower 13 bits must be zero. As a result, the C_PCIEBAR2AXIBAR_n value
should be modified to be 0x00000000_FFFFE0000. Also, check for a larger value on
PF0_BARn_APERTURE_SIZE compared to the value assigned to the C_PCIEBAR2AXIBAR_n parameter.
And example parameter setting follows.
C_PCIEBAR2AXIBAR_n=0xFFFF_E000
PF0_BARn_APERTURE_SIZE=0x0D (1 MB)
Malformed TLP
The integrated block for PCI Express® detects a malformed TLP. For the IP configured as an
Endpoint core, a malformed TLP results in a fatal error message being sent upstream if error reporting
is enabled in the Device Control register.
Abnormal Conditions
This section describes how the Slave side and Master side (see the following tables) of the AXI Bridge
functional mode handle abnormal conditions.
The slave bridge monitors AXI read and write burst type inputs to ensure that only the INCR
(incrementing burst) type is requested. Any other value on these inputs is treated as an error condition
and the Slave Illegal Burst (SIB) interrupt is asserted. In the case of a read request, the Bridge
asserts SLVERR for all data beats and arbitrary data is placed on the Slave AXI4-MM read data bus.
In the case of a write request, the Bridge asserts SLVERR for the write response and all write data is
discarded.
Any request to the bus for PCIe (except for a posted Memory write) requires a completion TLP to
complete the associated AXI request. The Slave side of the Bridge checks the received completion
TLPs for errors and checks for completion TLPs that are never returned (Completion Timeout). Each
of the completion TLP error types are discussed in the subsequent sections.
Unexpected Completion
When the slave bridge receives a completion TLP, it matches the header RequesterID and Tag to the
outstanding RequesterID and Tag. A match failure indicates the TLP is an Unexpected Completion
which results in the completion TLP being discarded and a Slave Unexpected Completion (SUC)
interrupt strobe being asserted. Normal operation then continues.
Unsupported Request
A device for PCIe might not be capable of satisfying a specific read request. For example, if the read
request targets an unsupported address for PCIe, the completer returns a completion TLP with a
completion status of 0b001 - Unsupported Request. The completer that returns a completion TLP
with a completion status of Reserved must be treated as an unsupported request status, according to
the PCI Express Base Specification v3.0. When the slave bridge receives an unsupported request
response, the Slave Unsupported Request (SUR) interrupt is asserted and the DECERR response is
asserted with arbitrary data on the AXI4 memory mapped bus.
Completion Timeout
1. Read register 0xE10 (INT_DEC) and check if bits is set to one of: [9] (correctable), [10]
(non_fatal) or [11] (fatal).
2. Read register 0xE20 (RP_CSR) and check if bit [16] is set to see if efifo_not_empty is set.
3. If FIFO is not empty read FIFO by reading 0xE2C (RP_FIFO_READ)
a. Error message indicates where the error comes from (i.e, requestor ID) and Error type.
4. To clear the error, write to 0xE2C (RP_FIFO_READ). Value does not matter
5. Repeat steps 2 and 3 until 0xE2C (RP_FIFO_READ) bit [18] valid bit is cleared.
6. Write 1 to register 0xE10 (INT_DEC) to clear bits [9] (correctable), [10] (non_fatal) or [11] (fatal).
1. Read register 0xE10 (INT_DEC) and check if bits [17] is set, which indicates PM_PME message
has been received.
2. Read register 0xE20 (RP_CSR) and check if bit [18] is set to see if pfifo_not_empty is set.
3. If FIFO is not empty, read FIFO by reading 0xE30 (RP_PFIFO).
a. Message will indicate where the message comes from (i.e., requestor ID).
4. To clear the error, write to 0xE30 (RP_PFIFO). Value does not matter.
5. Repeat steps 2 and 3 until 0xE30 (RP_PFIFO) bit [31] valid bit is cleared.
6. Write 1 to register 0xE10 (INT_DEC) to clear bits [17].
The following sections describe the manner in which the master bridge handles abnormal conditions.
When the master bridge receives a DECERR response from the AXI bus, the request is discarded
and the Master DECERR (MDE) interrupt is asserted. If the request was non-posted, a completion
packet with the Completion Status = Unsupported Request (UR) is returned on the bus for PCIe.
When the master bridge receives a SLVERR response from the addressed AXI slave, the request is
discarded and the Master SLVERR (MSE) interrupt is asserted. If the request was non-posted, a
completion packet with the Completion Status = Completer Abort (CA) is returned on the bus for PCIe.
Max Payload Size for PCIe, Max Read Request Size or 4K Page Violated
When the master bridge receives a SLVERR response from the addressed AXI slave, the request is
discarded and the Master SLVERR (MSE) interrupt is asserted. If the request was non-posted, a
completion packet with the Completion Status = Completer Abort (CA) is returned on the bus for PCIe.
Completion Packets
When the MAX_READ_REQUEST_SIZE is greater than the MAX_PAYLOAD_SIZE, a read request for PCIe
can ask for more data than the master bridge can insert into a single completion packet. When this
situation occurs, multiple completion packets are generated up to MAX_PAYLOAD_SIZE, with the Read
Completion Boundary (RCB) observed.
Poison Bit
When the poison bit is set in a transaction layer packet (TLP) header, the payload following the
header is corrupted. When the master bridge receives a memory request TLP with the poison bit set,
it discards the TLP and asserts the Master Error Poison (MEP) interrupt strobe.
When the master bridge receives a read request with the Length = 0x1, FirstBE = 0x00, and LastBE =
0x00, it responds by sending a completion with Status = Successful Completion.
When the master bridge receives a write request with the Length = 0x1, FirstBE = 0x00, and LastBE
= 0x00 there is no effect.
The normal operation of the functional mode is dependent on the integrated block for PCIe
establishing and maintaining the point-to-point link with an external device for PCIe. If the link has
been lost, it must be re-established to return to normal operation.
When a Hot Reset is received by the functional mode, the link goes down and the PCI Configuration
Space must be reconfigured.
Initiated AXI4 write transactions that have not yet completed on the AXI4 bus when the link goes down
have a SLVERR response given and the write data is discarded. Initiated AXI4 read transactions that
have not yet completed on the AXI4 bus when the link goes down have a SLVERR response given,
with arbitrary read data returned.
Any MemWr TLPs for PCIe that have been received, but the associated AXI4 write transaction has not
started when the link goes down, are discarded.
Endpoint
When configured to support Endpoint functionality, the AXI Bridge functional mode fully supports
Endpoint operation as supported by the underlying block. There are a few details that need special
consideration. The following subsections contain information and design considerations about
Endpoint support.
Interrupts
The Interrupt modes in the following section applies to AXI Bridge mode only.
Multiple interrupt modes can be configured during IP configuration, however only one interrupt mode
is used at runtime. If multiple interrupt modes are enabled by the host after PCI bus enumeration at
runtime, MSI-X interrupt takes precedence over MSI interrupt, and MSI interrupt takes precedence
over Legacy interrupt. All of these interrupt modes are sent using the same xdma0_usr_irq_*
interface and the core automatically picks the best available interrupt mode at runtime.
Legacy Interrupts
Asserting one or more bits of xdma0_usr_irq_req when legacy interrupts are enabled causes the IP
to issue a legacy interrupt over PCIe. Multiple bits may be asserted simultaneously but each bit must
remain asserted until the corresponding xdma0_usr_irq_ack bit has been asserted. After a
xdma0_usr_irq_req bit is asserted, it must remain asserted until the corresponding
xdma0_usr_irq_ack bit is asserted and the interrupt has been serviced and cleared by the Host. The
Asserting one or more bits of xdma0_usr_irq_req causes the generation of an MSI or MSI-X
interrupt if MSI or MSI-X is enabled. If both MSI and MSI-X capabilities are enabled, an MSI-X
interrupt is generated. The Internal MSI-X interrupts mode is enabled when you set the MSI-X
Implementation Location option to Internal in the PCIe Misc Tab.
After a xdma0_usr_irq_req bit is asserted, it must remain asserted until the corresponding
xdma0_usr_irq_ack bit is asserted and the interrupt has been serviced and cleared by the Host. The
xdma0_usr_irq_ack assertion indicates the requested interrupt has been sent on the PCIe block.
This will ensure the interrupt pending register within the IP remains asserted when queried by the
Host's Interrupt Service Routine (ISR) to determine the source of interrupts. You must implement a
mechanism in the user application to know when the interrupt routine has been serviced. This
detection can be done in many different ways depending on your application and your use of this
interrupt pin. This typically involves a register (or array of registers) implemented in the user
application that is cleared, read, or modified by the Host software when an Interrupt is serviced.
Configuration registers are available to map xdma0_usr_irq_req to MSI or MSI-X vectors. For MSI-X
support, there is also a vector table and PBA table. The following figure shows the MSI interrupt.
This figure shows only the handshake between xdma0_usr_irq_req and xdma0_usr_irq_ack. Your
application might not clear or service the interrupt immediately, in which case, you must keep
xdma0_usr_irq_req asserted past xdma0_usr_irq_ack.
Root Port
When configured to support Root Port functionality, the AXI Bridge functional mode fully supports
Root Port operation as supported by the underlying block. There are a few details that need special
consideration. The following subsections contain information and design considerations about Root
Port support.
When the functional mode is configured as a Root Port, configuration traffic is generated by using the
PCI Express enhanced configuration access mechanism (ECAM). ECAM functionality is available
only when the core is configured as a Root Port. Reads and writes to a certain memory aperture are
translated to configuration reads and writes, as specified in the PCI Express Base Specification
(v3.0), §7.2.2.
The address breakdown is defined in the following table. ECAM is used in conjunction with the Bridge
Register Memory Map only when used in both AXI Bridge for PCIe Gen3 core as well as DMA/Bridge
Subsystem for PCIe in AXI Bridge mode core. The DMA/Bridge Subsystem for PCIe Register Memory
Map does not have ECAM functionality.
When an ECAM access is attempted to the primary bus number, which defaults as bus 0 from reset,
then access to the type 1 PCI™ Configuration Header of the integrated block in the Enhanced
Interface for PCIe is performed. When an ECAM access is attempted to the secondary bus number,
then type 0 configuration transactions are generated. When an ECAM access is attempted to a bus
number that is in the range defined by the secondary bus number and subordinate bus number (not
including the secondary bus number), then type 1 configuration transactions are generated. The
primary, secondary, and subordinate bus numbers are written by Root Port software to the type 1 PCI
Configuration Header of the Enhanced Interface for PCIe in the beginning of the enumeration
procedure.
When an ECAM access is attempted to a bus number that is out of the bus_number and subordinate
bus number range, the bridge does not generate a configuration request and signal SLVERR response
on the AXI4-Lite bus. When the Bridge is configured for EP (PL_UPSTREAM_FACING = TRUE), the
underlying Integrated Block configuration space and the core memory map are available at the
beginning of the memory space. The memory space looks like a simple PCI Express® configuration
11:8 Extended Register Number Along with Register Number, allows access to
PCI Express Extended Configuration Space.
Any of the 6 BAR can be programmed within 3 regions that are listed below. Some regions full space
is not accessible because DMA register reside there.
✎ Note: For Root Port, most of the OS does not support address translation; therefore, AMD
recommends Region 0 for 32-bit address space or non-prefetchable memory allocation. If RP design
needs to support 64-bit BAR, it is recommended EP select region 1 or 2. Refer to Root Port BAR for
more information.
In Versal architecture, address maps are fixed. If AXI Bridge Master BAR is not selected, all
transaction will be as is, that is, with no address translation. For information about the Versal adaptive
SoC global address map, see Versal Adaptive SoC Technical Reference Manual (AM011).
Root port configuration address offsets are not listed correctly. Next pointer address is not pointing to
proper address below AER, this can result in wrong configuration values. Up to AER all listed values
are correct. User can read Config extended space below AER with fixed targeted address. Following
are the target address values:
AER 0x100
VC 0x1F0
16 GT Cap 0x3B0
ACS 0x450
PASID 0x5F0
Extend-Large 0x600
Extend-Small 0xE00
Root Port Bridge IP has two choices for AXI4 data path. Coherent data path is routed to the NoC by
selecting CPM to NoC port 0 in the CIPS CPM IP customization GUI. Alternatively, non-coherent data
The AXI Bridge functional mode automatically sends a Power Limit Message TLP when the Master
Enable bit of the Command Register is set. The software must set the Requester ID register before
setting the Master Enable bit to ensure that the desired Requester ID is used in the Message TLP.
When an ECAM access is performed to the primary bus number, self-configuration of the integrated
block for PCIe is performed. A PCIe configuration transaction is not performed and is not presented
on the link. When an ECAM access is performed to the bus number that is equal to the secondary bus
value in the Enhanced PCIe Type 1 configuration header, then Type 0 configuration transactions are
generated.
When an ECAM access is attempted to a bus number that is in the range defined by the secondary
bus number and subordinate bus number range (not including secondary bus number), then Type 1
configuration transactions are generated. The primary, secondary and subordinate bus numbers are
written and updated by Root Port software to the Type 1 PCI™ Configuration Header of the AXI
Bridge functional mode in the enumeration procedure.
When an ECAM access is attempted to a bus number that is out of the range defined by the
secondary bus_number and subordinate bus number, the bridge does not generate a configuration
request and signal a SLVERR response on the AXI4 bus.
When an Unsupported Request (UR) response is received for a configuration read request, all ones
are returned on the AXI4 bus to signify that a device does not exist at the requested device address. It
is the responsibility of the software to ensure configuration write requests are not performed to device
addresses that do not exist. However, the AXI Bridge functional mode asserts SLVERR response on
the AXI4 bus when a configuration write request is performed on device addresses that do not exist or
a UR response is received.
Root Port BAR does not support packet filtering (all TLPs received from PCIe link are forwarded to the
user logic), however Address Translation can be configured to enable or disable, depending on the IP
configuration.
During core customization in the AMD Vivado™ Design Suite, when there is no BAR enabled, RP
passes all received packets to the user application without address translation or address filtering.
When BAR is enabled, by default the BAR address starts at 0x0000_0000 unless programmed
separately. Any packet received from the PCIe® link that hits a BAR is translated according to the
PCIE-to-AXI Address Translation rules.
✎ Note: The IP must not receive any TLPs outside of the PCIe BAR range from the PCIe link when
RP BAR is enabled. If this rule cannot be enforced, it's recommended that the PCIe BAR is disabled
and do address filtering and/or translation outside of the IP.
The Root Port BAR customization options in the Vivado Design Suite are found in the PCIe BARs
Tab.
Configuration transactions are non-posted transactions. The AXI Bridge functional mode has a timer
for timeout termination of configuration transactions that have not completed on the PCIe link. An
OKAY response and 0s data are given on the AXI4 memory mapped bus.
✎ Note: Multiple Configuration read (PCIe CFG Read) can block Configuration Writes (PCIe CFG
write). User need to have some kind of throttling mechanism for CFG reads so CFG Write can pass
by.
Responses to abnormal terminations to configuration transactions are shown in the following table.
Port Description
Global Signals
The interface signals for the Bridge are described in the following table.
cpm_cor_irq O Reserved
cpm_misc_irq O Reserved
cpm_uncor_irq O Reserved
cpm_irq0 I Reserved
cpm_irq1 I Reserved
AXI Bridge Slave ports are connected from the AMD Versal™ device programmable Network on Chip
(NoC) to the CPM DMA internally. For slave bridge AXI4 details and configuration, see Versal
Adaptive SoC Programmable Network on Chip and Integrated Memory Controller LogiCORE IP
Product Guide (PG313).
AXI4 (MM) Master ports are connected from the AMD Versal device Network on Chip (NoC) to the
CPM DMA internally. For details, see Versal Adaptive SoC Programmable Network on Chip and
Integrated Memory Controller LogiCORE IP Product Guide (PG313). The AXI4 Master interface can
be connected to the DDR or the PL, depending on the NoC configuration.
NUM_USR_IRQ is selectable and it ranges from 0 to 15. Each bits in bridge0_usr_irq_req bus
corresponds to the same bits in bridge0_usr_irq_ack. For example, bridge0_usr_irq_ack[0]
represents an ack for bridge0_usr_irq_req[0].
Register Space
Bridge register space can be accessed using AXI Slave interface and user can also access Host
memory space.
Bridge register descriptions are found in cpm4-bridge-v2-1-registers.csv available in the register map
files.
To locate the register space information:
The Register Space mentioned in this document can also be accessible through the AXI4 Memory
Mapped Slave interface. All accesses to these registers will be based on the following AXI Base
Addresses:
The offsets within each register space are the same as listed for the PCIe BAR accesses.
Please make sure that all transactions targeting these register spaces have AWCACHE[1] and
ARCACHE[1] set to 1’b0 (Non-Modifiable) and only access it in 4 Bytes transactions.
All transactions originating from Programmable Logic (PL) region, must have an AXI Master that
sets AxCACHE[1] = 1’b0 before it enters the AXI NOC.
All transactions originating from the APU or RPU must be defined by a Memory Attribute nGnRnE
or nGnRE to ensure AxCACHE[1] = 1’b0.
All transactions originating from PPU has no additional requirement necessary.
Vivado Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994)
Vivado Design Suite User Guide: Designing with IP (UG896)
Vivado Design Suite User Guide: Getting Started (UG910)
Vivado Design Suite User Guide: Logic Simulation (UG900)
Debugging
This appendix includes details about resources available on the AMD Support website and debugging
tools.
To help in the design and debug process when using the functional mode, the Support web page
contains key resources such as product documentation, release notes, answer records, information
about known issues, and links for obtaining further product support. The Community Forums are also
available where members can learn, participate, share, and ask questions about AMD Adaptive
Computing solutions.
This product guide is the main document associated with the functional mode. This guide, along with
documentation related to all products that aid in the design process, can be found on the AMD
Adaptive Support web page or by using the AMD Adaptive Computing Documentation Navigator.
Download the Documentation Navigator from the Downloads page. For more information about this
tool and the features available, open the online help after installation.
Debug Guide
Answer Records
Answer Records include information about commonly encountered problems, helpful information on
how to resolve these problems, and any known issues with an AMD Adaptive Computing product.
Answer Records are created and maintained daily to ensure that users have access to the most
accurate information available.
Answer Records for this functional mode can be located by using the Search Support box on the main
AMD Adaptive Support web page. To maximize your search results, use keywords such as:
Product name
Tool message(s)
Summary of the issue encountered
A filter search is available after results are returned to further target the results.
AR 75396.
Technical Support
AMD Adaptive Computing provides technical support on the Community Forums for this AMD
LogiCORE™ IP product when used as described in the product documentation. AMD Adaptive
Computing cannot guarantee timing, functionality, or support if you do any of the following:
Implement the solution in devices that are not defined in the documentation.
Customize the solution beyond that allowed in the product documentation.
Change any section of the design labeled DO NOT MODIFY.
Hardware Debug
Hardware issues can range from link bring-up to problems seen after hours of testing. This section
provides debug steps for common issues. The AMD Vivado™ debug feature is a valuable resource to
General Checks
Ensure that all the timing constraints for the core were properly incorporated from the example design
and that all constraints were met during implementation.
Does it work in post-place and route timing simulation? If problems are seen in hardware but not
in timing simulation, this could indicate a PCB issue. Ensure that all clock sources are active and
clean.
If using MMCMs in the design, ensure that all MMCMs have obtained lock by monitoring the
locked port.
If your outputs go to 0, check your licensing.
Registers
A complete list of registers and attributes for the AXI Bridge Subsystem is available in the Versal
Adaptive SoC Register Reference (AM012). Reviewing the registers and attributes might be helpful
for advanced debugging.
✎ Note: The attributes are set during IP customization in the Vivado IP catalog. After core
customization, attributes are read-only.
Upgrading
This appendix is not applicable for the first release.
Limitations
1. The achievable bandwidth for this subsystem depends on multiple factors, including but not
limited to the IP configuration, the data path options used with the IP, the host system
performance, and the methods by which data movements are programmed. See Data Bandwidth
and Performance Tuning for more information on CPM4 AXI Bridge. The bandwidth ceiling is
limited by the lower of the raw capacity of the designed PCIe link configuration and the internal
data interface used. The Data Bandwidth and Performance Tuning section provides guidance on
the related clock frequency settings and high-level guidance on performance expectations.
Achievable bandwidth might vary.
2. Bridge is compliant with all MPS and MRRS settings, however all traffic initiated from the bridge
is limited to 256 Bytes (max).
3. AXI Address width is limited to 48 bits.
4. Writes to the Slot Capability register in ECAM space, do not retain values but the functionality is
execute as expected. Reads of Slot Capability register does not retail the values.
5. AXI Bridge in Root Port mode does not support ASPM L1 or L0.
Product Specification
The Register block contains registers used in the AXI Bridge functional mode for dynamically mapping
the AXI4 memory mapped (MM) address range provided using the AXIBAR parameters to an address
for PCIe® range.
The slave bridge provides termination of memory-mapped AXI4 transactions from an AXI master
device (such as a processor). The slave bridge provides a way to translate addresses that are
mapped within the AXI4 memory mapped address domain to the domain addresses for PCIe. Write
transactions to the Slave Bridge are converted into one or more MemWr TLPs, depending on the
configured Max Payload Size setting, which are passed to the integrated block for PCI Express. The
slave bridge can support up to 32 active AXI4 Write requests. When a remote AXI master initiates a
Each PCIe MemRd request TLP header is used to create an address and qualifiers for the memory-
mapped AXI4 bus. Read data is collected from the addressed memory mapped AXI4 slave and used
to generate completion TLPs which are then passed to the integrated block for PCI Express. The
Master Bridge in can support up to 32 active PCIe MemRd request TLPs with pending completions for
improved AXI4 pipelining performance.
The instantiated AXI4-Stream Enhanced PCIe block contains submodules including the
Requester/Completer interfaces to the AXI bridge and the Register block. The Register block contains
the status, control, and interrupt registers.
The following tables are the translation tables for AXI4-Stream and memory-mapped transactions.
The AXI Bridge functional mode conforms to PCIe® transaction ordering rules. See the PCI-SIG
Specifications for the complete rule set. The following behaviors are implemented in the AXI Bridge
functional mode to enforce the PCIe transaction ordering rules on the highly-parallel AXI bus of the
bridge.
The bresp to the remote (requesting) AXI4 master device for a write to a remote PCIe device is
not issued until the MemWr TLP transmission is guaranteed to be sent on the PCIe link before any
subsequent TX-transfers.
If Relaxed Ordering bit is not set within the TLP header, then a remote PCIe device read to a
remote AXI slave is not permitted to pass any previous remote PCIe device writes to a remote
AXI slave received by the AXI Bridge functional mode. The AXI read address phase is held until
the previous AXI write transactions have completed and bresp has been received for the AXI
write transactions. If the Relaxed Ordering attribute bit is set within the TLP header, then the
remote PCIe device read is permitted to pass.
Read completion data received from a remote PCIe device are not permitted to pass any remote
PCIe device writes to a remote AXI slave received by the AXI Bridge functional mode prior to the
read completion data. The bresp for the AXI write(s) must be received before the completion
data is presented on the AXI read data channel.
✎ Note: The transaction ordering rules for PCIe might have an impact on data throughput in heavy
bidirectional traffic.
Bridge
The Bridge core is an interface between the AXI4 and the PCI Express integrated block. It contains
the memory mapped AXI4 to AXI4-Stream Bridge, and the AXI4-Stream Enhanced Interface Block for
PCIe. The memory mapped AXI4 to AXI4-Stream Bridge contains a register block and two functional
half bridges, referred to as the Slave Bridge and Master Bridge.
The core uses a set of interrupts to detect and flag error conditions.
Slave Bridge
The slave bridge provides termination of memory-mapped AXI4 transactions from an AXI4 master
device (such as a processor). The slave bridge provides a way to translate addresses that are
mapped within the AXI4 memory mapped address domain to the domain addresses for PCIe. Write
transactions to the Slave Bridge are converted into one or more MemWr TLPs, depending on the
configured Max Payload Size setting, which are passed to the integrated block for PCI Express. When
a remote AXI master initiates a read transaction to the slave bridge, the read address and qualifiers
are captured and a MemRd request TLP is passed to the core and a completion timeout timer is
started. Completions received through the core are correlated with pending read requests and read
data is returned to the AXI4 master. The slave bridge can support up to 32 AXI4 write requests, and
32 AXI4 read requests.
CPM does not do any SMID checks for slave AXI4 transfers. Any value is accepted.
✎ Note: If slave reads and writes are valid, IP prioritizes reads over writes. You are recommended to
have proper arbitration (leave some gaps between reads so writes can pass through).
BDF Table
Address translations for AXI address is done based on BDF table programming (0x2420 to 0x2434).
These BDF table entries can be programmed through the NoC AXI Slave interface. There are three
regions that you can use for slave data transfers. Each region can be further divided into many
windows for a different address translation. These regions and number of windows should be
configured in the IP wizard configuration. Each entry in the BDF table programming represents one
window. If a you need 2 windows then 2 entries need to be programmed and so on.
There are some restrictions in programming BDF table.
1. All PCIe slave bridge data transfers must be quiesced before programming the BDF table.
2. There are six registers for each BDF table entry. All six registers must be programmed to make a
valid entry. Even if some registers have 0s, you need to program 0s in those registers.
3. All the six registers need to be programmed in an order for an entry to be valid. Order is listed
below.
a. 0x2420
b. 0x2424
c. 0x2428
d. 0x242C
e. 0x2430
BDF table entry start address = 0x2420 + (0x20 * i), where i = table entry number.
Address Translation
Slave bridge data transfer can be performed over three regions. You have options to set the size of
the region and also how many windows are needed for different address translation per region. If
address translation is not needed for a window, you still need to program the BDF table with address
translation value as 0x0.
Address translation for Slave Bridge transfer are described in the following examples:
0x0 reserved
For this example, the Slave address 0x0000_0000_0000_0100 will be address translated to
0x0000_0000_0000_C100.
The slave bridge does not support narrow burst AXI transfers. To avoid narrow burst transfers,
connect the AXI smart-connect module which will convert narrow burst to full burst AXI transfers.
Master Bridge
The master bridge processes both PCIe MemWr and MemRd request TLPs received from the integrated
block for PCI Express and provides a means to translate addresses that are mapped within the
address for PCIe domain to the memory mapped AXI4 address domain. Each PCIe MemWr request
TLP header is used to create an address and qualifiers for the memory mapped AXI4 bus and the
associated write data is passed to the addressed memory mapped AXI4 Slave. The Master Bridge
can support up to 32 active PCIe MemWr request TLPs. PCIe MemWr request TLPs support is as
follows:
Each PCIe MemRd request TLP header is used to create an address and qualifiers for the memory
mapped AXI4 bus. Read data is collected from the addressed memory mapped AXI4 bridge slave and
used to generate completion TLPs which are then passed to the integrated block for PCI Express.
The Master Bridge in AXI Bridge mode can support up to 32 active PCIe MemRd request TLPs with
pending completions for improved AXI4 pipe-lining performance.
All AXI4_MM master transfer can be directed to modules based on the QDMA controller selection and
the steering selection in the GUI as shown in the following table:
CTRL 0
CPM PCIE NoC 0
CPM PCIE NoC 1
CCI PS AXI 0
CTRL 1
CPM PCIE NoC 0
CPM PCIE NoC 1
CCI PS AXI 0
PL AXI0
PL AXI1
Root Port
When the AXI bridge is configured as a root port, the transfers are directed based on the GUI
selections as shown in the following table:
✎ Note: Root port mode is not supported in QDMA0 and QDMA1 controllers at the same time. You
can enable root port mode only in one controller at a time.
Malformed TLP
The integrated block for PCI Express® detects a malformed TLP. For the IP configured as an
Endpoint core, a malformed TLP results in a fatal error message being sent upstream if error reporting
is enabled in the Device Control register.
Abnormal Conditions
This section describes how the Slave side and Master side (see the following tables) of the AXI Bridge
functional mode handle abnormal conditions.
Slave bridge abnormal conditions are classified as: Illegal Burst Type and Completion TLP Errors. The
following sections describe the manner in which the Bridge handles these errors.
The slave bridge monitors AXI read and write burst type inputs to ensure that only the INCR
(incrementing burst) type is requested. Any other value on these inputs is treated as an error condition
and the Slave Illegal Burst (SIB) interrupt is asserted. In the case of a read request, the Bridge
asserts SLVERR for all data beats and arbitrary data is placed on the Slave AXI4-MM read data bus.
In the case of a write request, the Bridge asserts SLVERR for the write response and all write data is
discarded.
Any request to the bus for PCIe (except for a posted Memory write) requires a completion TLP to
complete the associated AXI request. The Slave side of the Bridge checks the received completion
TLPs for errors and checks for completion TLPs that are never returned (Completion Timeout). Each
of the completion TLP error types are discussed in the subsequent sections.
Unexpected Completion
When the slave bridge receives a completion TLP, it matches the header RequesterID and Tag to the
outstanding RequesterID and Tag. A match failure indicates the TLP is an Unexpected Completion
which results in the completion TLP being discarded and a Slave Unexpected Completion (SUC)
interrupt strobe being asserted. Normal operation then continues.
Unsupported Request
A device for PCIe might not be capable of satisfying a specific read request. For example, if the read
request targets an unsupported address for PCIe, the completer returns a completion TLP with a
completion status of 0b001 - Unsupported Request. The completer that returns a completion TLP
with a completion status of Reserved must be treated as an unsupported request status, according to
the PCI Express Base Specification v3.0. When the slave bridge receives an unsupported request
response, the Slave Unsupported Request (SUR) interrupt is asserted and the DECERR response is
asserted with arbitrary data on the AXI4 memory mapped bus.
Completion Timeout
A Completion Timeout occurs when a completion (Cpl) or completion with data (CplD) TLP is not
returned after an AXI to PCIe memory read request, or after a PCIe Configuration Read/Write request.
For PCIe Configuration Read/Write request, completions must complete within the C_COMP_TIMEOUT
parameter selected value from the time the request is issued. For PCIe Memory Read request,
completions must complete within the value set in the Device Control 2 register in the PCIe
Configuration Space register. When a completion timeout occurs, an OKAY response is asserted with
all 1s data on the memory mapped AXI4 bus.
Poison Bit Received on Completion Packet
An Error Poison occurs when the completion TLP EP bit is set, indicating that there is poisoned data
in the payload. When the slave bridge detects the poisoned packet, the Slave Error Poison (SEP)
interrupt is asserted and the SLVERR response is asserted with arbitrary data on the memory
mapped AXI4 bus.
Completer Abort
A Completer Abort occurs when the completion TLP completion status is 0b100 - Completer Abort.
This indicates that the completer has encountered a state in which it was unable to complete the
transaction. When the slave bridge receives a completer abort response, the Slave Completer Abort
1. Read register 0xE10 (INT_DEC) and check if bits is set to one of: [9] (correctable), [10]
(non_fatal) or [11] (fatal).
2. Read register 0xE20 (RP_CSR) and check if bit [16] is set to see if efifo_not_empty is set.
3. If FIFO is not empty read FIFO by reading 0xE2C (RP_FIFO_READ)
a. Error message indicates where the error comes from (i.e, requestor ID) and Error type.
4. To clear the error, write to 0xE2C (RP_FIFO_READ). Value does not matter
5. Repeat steps 2 and 3 until 0xE2C (RP_FIFO_READ) bit [18] valid bit is cleared.
1. Read register 0xE10 (INT_DEC) and check if bits [17] is set, which indicates PM_PME message
has been received.
2. Read register 0xE20 (RP_CSR) and check if bit [18] is set to see if pfifo_not_empty is set.
3. If FIFO is not empty, read FIFO by reading 0xE30 (RP_PFIFO).
a. Message will indicate where the message comes from (i.e., requestor ID).
4. To clear the error, write to 0xE30 (RP_PFIFO). Value does not matter.
5. Repeat steps 2 and 3 until 0xE30 (RP_PFIFO) bit [31] valid bit is cleared.
6. Write 1 to register 0xE10 (INT_DEC) to clear bits [17].
The following sections describe the manner in which the master bridge handles abnormal conditions.
When the master bridge receives a DECERR response from the AXI bus, the request is discarded
and the Master DECERR (MDE) interrupt is asserted. If the request was non-posted, a completion
packet with the Completion Status = Unsupported Request (UR) is returned on the bus for PCIe.
When the master bridge receives a SLVERR response from the addressed AXI slave, the request is
discarded and the Master SLVERR (MSE) interrupt is asserted. If the request was non-posted, a
completion packet with the Completion Status = Completer Abort (CA) is returned on the bus for PCIe.
Max Payload Size for PCIe, Max Read Request Size or 4K Page Violated
When the master bridge receives a SLVERR response from the addressed AXI slave, the request is
discarded and the Master SLVERR (MSE) interrupt is asserted. If the request was non-posted, a
completion packet with the Completion Status = Completer Abort (CA) is returned on the bus for PCIe.
Completion Packets
When the MAX_READ_REQUEST_SIZE is greater than the MAX_PAYLOAD_SIZE, a read request for PCIe
can ask for more data than the master bridge can insert into a single completion packet. When this
situation occurs, multiple completion packets are generated up to MAX_PAYLOAD_SIZE, with the Read
Completion Boundary (RCB) observed.
When the poison bit is set in a transaction layer packet (TLP) header, the payload following the
header is corrupted. When the master bridge receives a memory request TLP with the poison bit set,
it discards the TLP and asserts the Master Error Poison (MEP) interrupt strobe.
When the master bridge receives a read request with the Length = 0x1, FirstBE = 0x00, and LastBE =
0x00, it responds by sending a completion with Status = Successful Completion.
When the master bridge receives a write request with the Length = 0x1, FirstBE = 0x00, and LastBE
= 0x00 there is no effect.
The normal operation of the functional mode is dependent on the integrated block for PCIe
establishing and maintaining the point-to-point link with an external device for PCIe. If the link has
been lost, it must be re-established to return to normal operation.
When a Hot Reset is received by the functional mode, the link goes down and the PCI Configuration
Space must be reconfigured.
Initiated AXI4 write transactions that have not yet completed on the AXI4 bus when the link goes down
have a SLVERR response given and the write data is discarded. Initiated AXI4 read transactions that
have not yet completed on the AXI4 bus when the link goes down have a SLVERR response given,
with arbitrary read data returned.
Endpoint
When configured to support Endpoint functionality, the AXI Bridge functional mode fully supports
Endpoint operation as supported by the underlying block. There are a few details that need special
consideration. The following subsections contain information and design considerations about
Endpoint support.
Interrupts
The Interrupt modes in the following section applies to AXI Bridge mode only.
Multiple interrupt modes can be configured during IP configuration, however only one interrupt mode
is used at runtime. If multiple interrupt modes are enabled by the host after PCI bus enumeration at
runtime, MSI-X interrupt takes precedence over MSI interrupt, and MSI interrupt takes precedence
over Legacy interrupt. All of these interrupt modes are sent using the same xdma0_usr_irq_*
interface and the core automatically picks the best available interrupt mode at runtime.
Legacy Interrupts
Asserting one or more bits of xdma0_usr_irq_req when legacy interrupts are enabled causes the IP
to issue a legacy interrupt over PCIe. Multiple bits may be asserted simultaneously but each bit must
remain asserted until the corresponding xdma0_usr_irq_ack bit has been asserted. After a
xdma0_usr_irq_req bit is asserted, it must remain asserted until the corresponding
xdma0_usr_irq_ack bit is asserted and the interrupt has been serviced and cleared by the Host. The
xdma0_usr_irq_ack assertion indicates the requested interrupt has been sent on the PCIe block.
This will ensure interrupt pending register within the IP remains asserted when queried by the Host's
Interrupt Service Routine (ISR) to determine the source of interrupts. You must implement a
mechanism in the user application to know when the interrupt routine has been serviced. This
detection can be done in many different ways depending on your application and your use of this
interrupt pin. This typically involves a register (or array of registers) implemented in the user
application that is cleared, read, or modified by the Host software when an interrupt is serviced.
After the xdma0_usr_irq_req bit is deasserted, it cannot be reasserted until the corresponding
xdma0_usr_irq_ack bit has been asserted for a second time. This indicates the deassertion
message for the legacy interrupt has been sent over PCIe. After a second xdma0_usr_irq_ack
occurred, the xdma0_usr_irq_req wire can be reasserted to generate another legacy interrupt.
The xdma0_usr_irq_req bit can be mapped to legacy interrupt INTA, INTB, INTC, INTD through the
configuration registers. The following figure shows the legacy interrupts.
This figure shows only the handshake between xdma0_usr_irq_req and xdma0_usr_irq_ack. The
user application might not clear or service the interrupt immediately, in which case, you must keep
xdma0_usr_irq_req asserted past xdma0_usr_irq_ack.
Asserting one or more bits of xdma0_usr_irq_req causes the generation of an MSI or MSI-X
interrupt if MSI or MSI-X is enabled. If both MSI and MSI-X capabilities are enabled, an MSI-X
interrupt is generated. The Internal MSI-X interrupts mode is enabled when you set the MSI-X
Implementation Location option to Internal in the PCIe Misc Tab.
After a xdma0_usr_irq_req bit is asserted, it must remain asserted until the corresponding
xdma0_usr_irq_ack bit is asserted and the interrupt has been serviced and cleared by the Host. The
xdma0_usr_irq_ack assertion indicates the requested interrupt has been sent on the PCIe block.
This will ensure the interrupt pending register within the IP remains asserted when queried by the
Host's Interrupt Service Routine (ISR) to determine the source of interrupts. You must implement a
mechanism in the user application to know when the interrupt routine has been serviced. This
detection can be done in many different ways depending on your application and your use of this
interrupt pin. This typically involves a register (or array of registers) implemented in the user
application that is cleared, read, or modified by the Host software when an Interrupt is serviced.
Configuration registers are available to map xdma0_usr_irq_req and DMA interrupts to MSI or MSI-
X vectors. For MSI-X support, there is also a vector table and PBA table. The following figure shows
the MSI interrupt.
This figure shows only the handshake between xdma0_usr_irq_req and xdma0_usr_irq_ack. Your
application might not clear or service the interrupt immediately, in which case, you must keep
xdma0_usr_irq_req asserted past xdma0_usr_irq_ack.
Root Port
When configured to support Root Port functionality, the AXI Bridge functional mode fully supports
Root Port operation as supported by the underlying block. There are a few details that need special
When the functional mode is configured as a Root Port, configuration traffic is generated by using the
PCI Express enhanced configuration access mechanism (ECAM). ECAM functionality is available
only when the core is configured as a Root Port. Reads and writes to a certain memory aperture are
translated to configuration reads and writes, as specified in the PCI Express Base Specification
(v3.0), §7.2.2.
The address breakdown is defined in the following table. ECAM is used in conjunction with the Bridge
Register Memory Map only when used in both AXI Bridge for PCIe Gen3 core as well as DMA/Bridge
Subsystem for PCIe in AXI Bridge mode core. The DMA/Bridge Subsystem for PCIe Register Memory
Map does not have ECAM functionality.
When an ECAM access is attempted to the primary bus number, which defaults as bus 0 from reset,
then access to the type 1 PCI™ Configuration Header of the integrated block in the Enhanced
Interface for PCIe is performed. When an ECAM access is attempted to the secondary bus number,
then type 0 configuration transactions are generated. When an ECAM access is attempted to a bus
number that is in the range defined by the secondary bus number and subordinate bus number (not
including the secondary bus number), then type 1 configuration transactions are generated. The
primary, secondary, and subordinate bus numbers are written by Root Port software to the type 1 PCI
Configuration Header of the Enhanced Interface for PCIe in the beginning of the enumeration
procedure.
When an ECAM access is attempted to a bus number that is out of the bus_number and subordinate
bus number range, the bridge does not generate a configuration request and signal SLVERR response
on the AXI4-Lite bus. When the Bridge is configured for EP (PL_UPSTREAM_FACING = TRUE), the
underlying Integrated Block configuration space and the core memory map are available at the
beginning of the memory space. The memory space looks like a simple PCI Express® configuration
space. When the Bridge is configured for RC (PL_UPSTREAM_FACING = FALSE), the same is true, but
it also looks like an ECAM access to primary bus, Device 0, Function 0.
When the functional mode is configured as a Root Port, the reads and writes of the local ECAM are
Bus 0. Because the adaptive SoC only has a single Integrated Block for PCIe core, all local ECAM
operations to Bus 0 return the ECAM data for Device 0, Function 0.
Configuration write accesses across the PCI Express bus are non-posted writes and block the AXI4-
Lite interface while they are in progress. Because of this, system software is not able to service an
interrupt if one were to occur. However, interrupts due to abnormal terminations of configuration
transactions can generate interrupts. ECAM read transactions block subsequent Requester read
TLPs until the configuration read completions packet is returned to allow unique identification of the
completion packet.
11:8 Extended Register Number Along with Register Number, allows access to
PCI Express Extended Configuration Space.
Root Port PCIe enumeration is done through ECAM register space. Each PCIe device or function is
allocated 4 KB address space which holds their PCIe Configuration Space register. The upper
address field of the ECAM register space consists of the PCIe Bus Device Function number to select
the target device. ECAM register space automatically routes and generates the appropriate PCIe
Configuration Request TLP Type based on the target PCIe Bus Device Function number as well as
the programmed Primary Bus Number, Secondary Bus Number, and Subordinate Bus Number field.
Enumeration process through the Root Port PCIe Bridge IP follows the standard PCIe Bus discovery
as well as PCIe Configuration Space programming sequence, as defined by the PCIe Base
Specification. Root Port PCIe Bridge lists all the PCIe capabilities enabled in the Root Port up to AER
Capabilities register. The remaining PCIe capabilities registers in the Root Port Configuration Space
are not visible in the standard PCIe Configuration Space link list, however they follow the standard
PCIe Configuration Space layout. Bridge register in the Root Port use one of the PCIe User Extended
Configuration Space region and is accessible when targeting the Root Port Bus Device Function
number. All downstream devices (PCIe switches, Endpoints) attached to the Root Port show all PCIe
capabilities registers without any limitation.
Root port configuration address offsets are not listed correctly. Next pointer address is not pointing to
proper address below AER, this may result in wrong configuration values. Up to AER all listed values
are correct. User can read Config extended space below AER with fixed targeted address, Target
address values are listed below.
AER 0x100
VC 0x1F0
16 GT Cap 0x3B0
The AXI Bridge functional mode automatically sends a Power Limit Message TLP when the Master
Enable bit of the Command Register is set. The software must set the Requester ID register before
setting the Master Enable bit to ensure that the desired Requester ID is used in the Message TLP.
When an ECAM access is performed to the primary bus number, self-configuration of the integrated
block for PCIe is performed. A PCIe configuration transaction is not performed and is not presented
on the link. When an ECAM access is performed to the bus number that is equal to the secondary bus
value in the Enhanced PCIe Type 1 configuration header, then Type 0 configuration transactions are
generated.
When an ECAM access is attempted to a bus number that is in the range defined by the secondary
bus number and subordinate bus number range (not including secondary bus number), then Type 1
configuration transactions are generated. The primary, secondary and subordinate bus numbers are
written and updated by Root Port software to the Type 1 PCI™ Configuration Header of the AXI
Bridge functional mode in the enumeration procedure.
When an ECAM access is attempted to a bus number that is out of the range defined by the
secondary bus_number and subordinate bus number, the bridge does not generate a configuration
request and signal a SLVERR response on the AXI4 MM bus.
When a Unsupported Request (UR) response is received for a configuration read request, all ones
are returned on the AXI bus to signify that a device does not exist at the requested device address. It
is the responsibility of the software to ensure configuration write requests are not performed to device
addresses that do not exist. However, the AXI Bridge functional mode asserts SLVERR response on
the AXI bus when a configuration write request is performed on device addresses that do not exist or
a UR response is received.
Root Port BAR does not support packet filtering (all TLPs received from PCIe link are forwarded to the
user logic), however Address Translation can be configured to enable or disable, depending on the IP
configuration.
During core customization in the AMD Vivado™ Design Suite, when there is no BAR enabled, RP
passes all received packets to the user application without address translation or address filtering.
When BAR is enabled, by default the BAR address starts at 0x0000_0000 unless programmed
separately. Any packet received from the PCIe® link that hits a BAR is translated according to the
PCIE-to-AXI Address Translation rules.
✎ Note: The IP must not receive any TLPs outside of the PCIe BAR range from the PCIe link when
RP BAR is enabled. If this rule cannot be enforced, it's recommended that the PCIe BAR is disabled
and do address filtering and/or translation outside of the IP.
The Root Port BAR customization options in the Vivado Design Suite are found in the PCIe BARs
Tab.
Configuration transactions are non-posted transactions. The AXI Bridge functional mode has a timer
for timeout termination of configuration transactions that have not completed on the PCIe link. SLVERR
is returned when a configuration timeout occurs. Timeouts of configuration transactions are flagged by
an interrupt as well.
✎ Note: Multiple Configuration read (PCIe CFG read) can block Configuration writes (PCIe CFG
write). You must have some kind of throttling mechanism for CFG reads so CFG write can pass by.
Responses on AXI to abnormal terminations to configuration transactions are shown in the following
table.
SLVERR response is
Config Read or Write Completion timeout.
asserted.
Receiving Interrupts
In Root Port mode, you can choose one of the two ways to handle incoming interrupts;
Legacy Interrupt FIFO mode: Legacy Interrupt FIFO mode is the default. It is available in earlier
Bridge IP variants and versions, and will continue to be available. Legacy Interrupt FIFO mode is
geared towards compatibility for legacy designs.
Interrupt Decode mode: Interrupt Decode mode is available in the CPM AXI Bridge. Interrupt
Decode mode can be used to mitigate Interrupt FIFO overflow condition which can occur in a
design that receives interrupts at a high rate and avoids the performance penalty incurred when
such condition occurs.
Port Description
Global Signals
The interface signals for the Bridge are described in the following table.
cpm_cor_irq O Reserved
cpm_misc_irq O Reserved
cpm_uncor_irq O Reserved
cpm_irq0 I Reserved
cpm_irq1 I Reserved
AXI Bridge Slave ports are connected from the AMD Versal™ device programmable Network on Chip
(NoC) to the CPM DMA internally. For slave bridge AXI-MM details and configuration, see Versal
Adaptive SoC Programmable Network on Chip and Integrated Memory Controller LogiCORE IP
Product Guide (PG313).
AXI4 (MM) Master ports are connected from the AMD Versal device Network on Chip (NoC) to the
CPM DMA internally. For details, see Versal Adaptive SoC Programmable Network on Chip and
Integrated Memory Controller LogiCORE IP Product Guide (PG313). The AXI4 Master interface can
be connected to the DDR or the PL, depending on the NoC configuration.
The CIPS IP does not support the AXI4-Lite Master interface. Use the SmartConnect IP to connect
the NoC to the AXI4-Lite Master interface. For details, see SmartConnect LogiCORE IP Product
Guide (PG247).
xdma0_usr_irq_fnc[7:0] I Function
The function of the vector to be sent.
✎ Note: The xdma0_ prefix in the above signal names will be changed to dma0_* in a future release.
NUM_USR_IRQ is selectable and it ranges from 0 to 15. Each bits in xdma0_usr_irq_req bus
corresponds to the same bits in xdma0_usr_irq_ack. For example, xdma0_usr_irq_ack[0]
represents an ack for xdma0_usr_irq_req[0].
Register Space
You can access the register space when AXI Slave Bridge mode is enabled (based on GUI settings).
You can also access Bridge registers in Controller0 or in Controller1 and can also access Host
memory space. Host memory address offset varies based on Contoller0 and/or Controller1 selection.
If only Contoller0 is enabled, the table below shows address ranges and limitations.
The table below shows address ranges and limitations for when Contoller1 is enabled.
Slave Bridge access to Host 0xE800_0000 - 0xEFFF_FFFF Address range for Slave
memory space 0x7_1101_0000 - bridge access is set during IP
0x7_FFFF_FFFF customization in the Address
0xA0_0000_0000 - Editor tab of the Vivado IDE.
0xBF_FFFF_FFFF
When Both controller Contoller0 and Controller1 are enabled, the table above remains same for
Controller1. The table shown below represents Controller0.
Slave Bridge access to Host 0xE000_0000 - 0xE7FF_FFFF Address range for Slave
memory space 0x6_1101_0000 - bridge access is set during IP
0x6_FFFF_FFFF customization in the Address
0x80_0000_0000 - Editor tab of the Vivado IDE.
0x9F_FFFF_FFFF
The Register Space mentioned in this document can also be accessible through the AXI4 Memory
Mapped Slave interface. All accesses to these registers will be based on the following AXI Base
Addresses:
The offsets within each register space are the same as listed for the PCIe BAR accesses.
Please make sure that all transactions targeting these register spaces have AWCACHE[1] and
ARCACHE[1] set to 1’b0 (Non-Modifiable) and only access it in 4 Bytes transactions.
All transactions originating from Programmable Logic (PL) region, must have an AXI Master that
sets AxCACHE[1] = 1’b0 before it enters the AXI NOC.
All transactions originating from the APU or RPU must be defined by a Memory Attribute nGnRnE
or nGnRE to ensure AxCACHE[1] = 1’b0.
All transactions originating from PPU has no additional requirement necessary.
Vivado Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994)
Vivado Design Suite User Guide: Designing with IP (UG896)
Vivado Design Suite User Guide: Getting Started (UG910)
Vivado Design Suite User Guide: Logic Simulation (UG900)
Debugging
This appendix includes details about resources available on the AMD Support website and debugging
tools.
To help in the design and debug process when using the functional mode, the Support web page
contains key resources such as product documentation, release notes, answer records, information
about known issues, and links for obtaining further product support. The Community Forums are also
available where members can learn, participate, share, and ask questions about AMD Adaptive
Computing solutions.
This product guide is the main document associated with the functional mode. This guide, along with
documentation related to all products that aid in the design process, can be found on the Support web
page or by using the AMD Adaptive Computing Documentation Navigator. Download the
Documentation Navigator from the Downloads page. For more information about this tool and the
features available, open the online help after installation.
Debug Guide
Answer Records
Answer Records include information about commonly encountered problems, helpful information on
how to resolve these problems, and any known issues with an AMD Adaptive Computing product.
Answer Records are created and maintained daily to ensure that users have access to the most
accurate information available.
Answer Records for this functional mode can be located by using the Search Support box on the main
Support web page. To maximize your search results, use keywords such as:
Product name
Tool message(s)
Summary of the issue encountered
A filter search is available after results are returned to further target the results.
AR 75396.
Technical Support
AMD Adaptive Computing provides technical support on the Community Forums for this AMD
LogiCORE™ IP product when used as described in the product documentation. AMD Adaptive
Computing cannot guarantee timing, functionality, or support if you do any of the following:
Implement the solution in devices that are not defined in the documentation.
Customize the solution beyond that allowed in the product documentation.
Change any section of the design labeled DO NOT MODIFY.
Hardware Debug
Hardware issues can range from link bring-up to problems seen after hours of testing. This section
provides debug steps for common issues. The AMD Vivado™ debug feature is a valuable resource to
General Checks
Ensure that all the timing constraints for the core were properly incorporated from the example design
and that all constraints were met during implementation.
If using MMCMs in the design, ensure that all MMCMs have obtained lock by monitoring the
locked port.
Registers
A complete list of registers and attributes for the AXI Bridge Subsystem is available in the Versal
Adaptive SoC Register Reference (AM012). Reviewing the registers and attributes might be helpful
for advanced debugging.
✎ Note: The attributes are set during IP customization in the Vivado IP catalog. After core
customization, attributes are read-only.
Upgrading
This appendix is not applicable for the first release.
This diagram refers to the Requester Request (RQ)/Requester Completion (RC) interfaces, and the
Completer Request (CQ)/Completer Completion (CC) interfaces.
Limitations
SR-IOV
Example design not supported for all configurations
Narrow burst (not supported on the master interface)
Invalid descriptors can cause system crash, user/driver is responsible to generate valid
descriptors.
Architecture
Internally, the subsystem can be configured to implement up to eight independent physical DMA
engines (up to four H2C and four C2H). These DMA engines can be mapped to individual AXI4-
Stream interfaces or a shared AXI4 memory mapped (MM) interface to the user application. On the
AXI4 MM interface, the XDMA Subsystem generates requests and expected completions. The AXI4-
Stream interface is data-only.
The type of channel configured determines the transactions on which bus:
A Host-to-Card (H2C) channel generates read requests to PCIe and provides the data or
generates a write request to the user application.
A Card-to-Host (C2H) channel either waits for data on the user side or generates a read request
on the user side and then generates a write request containing the data received to PCIe.
Target Bridge
The target bridge receives requests from the host. Based on BARs, the requests are directed to the
internal registers, or the CQ bypass port. After the downstream user logic has returned data for a non-
posted request, the target bridge generates a read completion TLP and sends it to the PCIe IP over
the CC bus.
In the following tables, the PCIe BARs selection corresponds to the options set in the PCIe BARs tab
in the IP Configuration GUI.
H2C Channel
The number of H2C channels is configured in the AMD Vivado™ Integrated Design Environment
(IDE). The H2C channel handles DMA transfers from the host to the card. It is responsible for splitting
read requests based on maximum read request size, and available internal resources. The DMA
channel maintains a maximum number of outstanding requests based on the RNUM_RIDS, which is the
number of outstanding H2C channel request ID parameter. Each split, if any, of a read request
consumes an additional read request entry. A request is outstanding after the DMA channel has
issued the read to the PCIe RQ block to when it receives confirmation that the write has completed on
the user interface in-order. After a transfer is complete, the DMA channel issues a writeback or
interrupt to inform the host.
The H2C channel also splits transaction on both its read and write interfaces. On the read interface to
the host, transactions are split to meet the maximum read request size configured, and based on
available Data FIFO space. Data FIFO space is allocated at the time of the read request to ensure
space for the read completion. The PCIe RC block returns completion data to the allocated Data
Buffer locations. To minimize latency, upon receipt of any completion data, the H2C channel begins
issuing write requests to the user interface. It also breaks the write requests into maximum payload
size. On an AXI4-Stream user interface, this splitting is transparent.
When multiple channels are enabled, transactions on the AXI4 Master interface are interleaved
between all selected channels. Simple round robin protocol is used to service all channels.
Transactions granularity depends on host Max Payload Size (MPS), page size, and other host
settings.
The C2H channel handles DMA transfers from the card to the host. The instantiated number of C2H
channels is controlled in the AMD Vivado™ IDE. Similarly the number of outstanding transfers is
configured through the WNUM_RIDS, which is the number of C2H channel request IDs. In an AXI4-
Stream configuration, the details of the DMA transfer are set up in advance of receiving data on the
AXI4-Stream interface. This is normally accomplished through receiving a DMA descriptor. After the
request ID has been prepared and the channel is enabled, the AXI4-Stream interface of the channel
can receive data and perform the DMA to the host. In an AXI4 MM interface configuration, the request
IDs are allocated as the read requests to the AXI4 MM interface are issued. Similar to the H2C
channel, a given request ID is outstanding until the write request has been completed. In the case of
the C2H channel, write request completion is when the write request has been issued as indicated by
the PCIe IP.
When multiple channels are enabled, transactions on the AXI4 Master interface are interleaved
between all selected channels. Simple round robin protocol is used to service all channels.
Transactions granularity depends on host MaxPayload Size (MPS), page size, and other host
settings.
Host requests that reach the PCIe to DMA bypass BAR are sent to this module. The bypass master
port is an AXI4 MM interface and supports read and write accesses.
IRQ Module
The IRQ module receives a configurable number of interrupt wires from the user logic and one
interrupt wire from each DMA channel. This module is responsible for generating an interrupt over
PCIe. Support for MSI-X, MSI, and legacy interrupts can be specified during IP configuration.
✎ Note: The Host can enable one or more interrupt types from the specified list of supported
interrupts during IP configuration. The IP only generates one interrupt type at a given time even when
there are more than one enabled. MSI-X interrupt takes precedence over MSI interrupt, and MSI
interrupt take precedence over Legacy interrupt. The Host software must not switch (either enable or
disable) an interrupt type while there is an interrupt asserted or pending.
Legacy Interrupts
Asserting one or more bits of xdma0_usr_irq_req when legacy interrupts are enabled causes the
DMA to issue a legacy interrupt over PCIe. Multiple bits may be asserted simultaneously but each bit
must remain asserted until the corresponding xdma0_usr_irq_ack bit has been asserted. After a
xdma0_usr_irq_req bit is asserted, it must remain asserted until both corresponding
xdma0_usr_irq_ack bit is asserted and the interrupt is serviced and cleared by the Host. This
ensures interrupt pending register within the IP remains asserted when queried by the Host's Interrupt
Service Routine (ISR) to determine the source of interrupts. The xdma0_usr_irq_ack assertion
indicates the requested interrupt has been sent to the PCIe block. You must implement a mechanism
in the user application to know when the interrupt routine has been serviced. This detection can be
done in many different ways depending on your application and your use of this interrupt pin. This
Asserting one or more bits of xdma0_usr_irq_req causes the generation of an MSI or MSI-X
interrupt if MSI or MSI-X is enabled. If both MSI and MSI-X capabilities are enabled, an MSI-X
interrupt is generated.
After a xdma0_usr_irq_req bit is asserted, it must remain asserted until the corresponding
xdma0_usr_irq_ack bit is asserted and the interrupt has been serviced and cleared by the Host. The
xdma0_usr_irq_ack assertion indicates the requested interrupt has been sent to the PCIe block.
This will ensure the interrupt pending register within the IP remains asserted when queried by the
Host's Interrupt Service Routine (ISR) to determine the source of interrupts. You must implement a
mechanism in the user application to know when the interrupt routine has been serviced. This
detection can be done in many different ways depending on your application and your use of this
interrupt pin. This typically involves a register (or array of registers) implemented in the user
application that is cleared, read, or modified by the Host software when an Interrupt is serviced.
Configuration registers are available to map xdma0_usr_irq_req and DMA interrupts to MSI or MSI-
X vectors. For MSI-X support, there is also a vector table and PBA table. The following figure shows
the MSI interrupt.
✎ Note: This figure shows only the handshake between xdma0_usr_irq_req and
xdma0_usr_irq_ack. Your application might not clear or service the interrupt immediately, in which
case, you must keep xdma0_usr_irq_req asserted past xdma0_usr_irq_ack.
Config Block
The config module, the DMA register space which contains PCIe® solution IP configuration
information and DMA control registers, stores PCIe IP configuration information that is relevant to the
XDMA. This configuration information can be read through register reads to the appropriate register
offset within the config module.
Product Specification
XDMA Operations
Descriptors
The XDMA Subsystem uses a linked list of descriptors that specify the source, destination, and length
of the DMA transfers. Descriptor lists are created by the driver and stored in host memory. The DMA
channel is initialized by the driver with a few control registers to begin fetching the descriptor lists and
executing the DMA operations.
Descriptors describe the memory transfers that the XDMA should perform. Each channel has its own
descriptor list. The start address of each channel's descriptor list is initialized in hardware registers by
the driver. After the channel is enabled, the descriptor channel begins to fetch descriptors from the
initial address. Thereafter, it fetches from the Nxt_adr[63:0] field of the last descriptor that was
fetched. Descriptors must be aligned to a 32 byte boundary.
The size of the initial block of adjacent descriptors are specified with the Dsc_Adj register. After the
initial fetch, the descriptor channel uses the Nxt_adj field of the last fetched descriptor to determine
the number of descriptors at the next descriptor address. A block of adjacent descriptors must not
cross a 4K address boundary. The descriptor channel fetches as many descriptors in a single request
as it can, limited by MRRS, the number the adjacent descriptors, and the available space in the
channel's descriptor buffer.
✎ Note: Because MRRS in most host systems is 512 bytes or 1024 bytes, having more than 32
adjacent descriptors is not allowed on a single request. However, the design will allow a maximum 64
descriptors in a single block of adjacent descriptors if needed.
Offset Fields
0x08 Src_adr[31:0]
0x0C Src_adr[63:32]
0x10 Dst_adr[31:0]
0x14 Dst_adr[63:32]
0x18 Nxt_adr[31:0]
0x1C Nxt_adr[63:32]
0x0 5, 6, 7 Reserved
0x0 2, 3 Reserved
0x0 Set to 1 to
interrupt after the
engine has
completed this
descriptor. This
1 Completed requires global
IE_DESCRIPTOR_COMPLE
control flag set in
Control
the H2C/C2H
Channel control
register.
Descriptor Bypass
The descriptor fetch engine can be bypassed on a per channel basis through AMD Vivado™ IDE
parameters. A channel with descriptor bypass enabled accepts descriptor from its respective
c2h_dsc_byp or h2c_dsc_byp bus. Before the channel accepts descriptors, the Control register Run
bit must be set. The NextDescriptorAddress and NextAdjacentCount, and Magic descriptor fields are
not used when descriptors are bypassed. The ie_descriptor_stopped bit in Control register bit
does not prevent the user logic from writing additional descriptors. All descriptors written to the
channel are processed, barring writing of new descriptors when the channel buffer is full.
When XDMA is configured in descriptor bypass mode, there is an 8 deep descriptor FIFO which is
common for all descriptor channels from user.
✎ Note: To enable descriptor bypass for any channel, you should write to register 0x3060. Refer to
cpm4-xdma-v2-1-registers.csv available in the register map files.
Poll Mode
Each engine is capable of writing back completed descriptor counts to host memory. This allows the
driver to poll host memory to determine when the DMA is complete instead of waiting for an interrupt.
For a given DMA engine, the completed descriptor count writeback occurs when the DMA completes
a transfer for a descriptor, and ie_descriptor_completed and Pollmode_wb_enable are set. The
completed descriptor count reported is the total number of completed descriptors since the DMA was
Offset Fields
Field Description
Sts_err The bitwise OR of any error status bits in the channel Status register.
For host-to-card transfers, data is read from the host at the source address, but the destination
address in the descriptor is unused. Packets can span multiple descriptors. The termination of a
packet is indicated by the EOP control bit. A descriptor with an EOP bit asserts tlast on the AXI4-
Stream user interface on the last beat of data.
Data delivered to the AXI4-Stream interface will be packed for each descriptor. tkeep is all 1s except
for the last cycle of a data transfer of the descriptor if it is not a multiple of the datapath width. The
DMA does not pack data across multiple descriptors.
For card-to-host transfers, the data is received from the AXI4-Stream interface and written to the
destination address. Packets can span multiple descriptors. The C2H channel accepts data when it is
enabled, and has valid descriptors. As data is received, it fills descriptors in order. When a descriptor
is filled completely or closed due to an end of packet on the interface, the C2H channel writes back
information to the writeback address on the host with pre-defined WB Magic value 16'h52b4 (Table
2), and updated EOP and Length as appropriate. For valid data cycles on the C2H AXI4-Stream
interface, all data associated with a given packet must be contiguous.
✎ Note: C2H Channel Writeback information is different then Poll mode updates. C2H Channel
Writeback information provides the driver current length status of a particular descriptor. This is
different from Pollmode_*, as is described in Poll Mode.
The tkeep bits must be all 1s except for the last data transfer of a packet. On the last transfer of a
packet, when tlast is asserted, you can specify a tkeep that is not all 1s to specify a data cycle that
is not the full datapath width. The asserted tkeep bits need to be packed to the lsb, indicating
contiguous data. If tlast is asserted and tkeep has all zero's, this is not a valid combination for DMA
to function properly.
The length of a C2H Stream descriptor (the size of the destination buffer) must always be a multiple of
64 bytes.
Offset Fields
0x04 Length[31:0]
Length Granularity
1. Each C2H descriptor must be sized as a multiple of 64 Bytes. However, there are no
restrictions to the total number of Bytes in the actual C2H transfer.
Parity
Set the Propagate Parity option in the PCIe DMA Tab in the AMD Vivado™ IDE to check for parity.
Otherwise, no parity checking occurs.
When Propagate Parity is enabled, the XDMA propagates parity to the user AXI interface. You are
responsible for checking and generating parity in the AXI Interface. Parity is valid every clock cycle
when a data valid signal is asserted, and parity bits are valid only for valid data bytes. Parity is
calculated for every byte; total parity bits are DATA_WIDTH/8.
Parity information is sent and received on *_tuser ports in AXI4-Stream (AXI_ST) mode.
Parity information is sent and received on *_ruser and *_wuser ports in AXI4 Memory Mapped
(AXI-MM) mode.
Odd parity is used for parity checking. By default, parity checking is not enabled.
Port Description
Global Signals
pcie0_user_clk O User clock out. PCIe derived clock output for all
interface signals output/input to QDMA. Use this clock
to drive inputs and gate outputs from QDMA.
dma0_axi_aresetn O User reset out. AXI reset signal synchronous with the
clock provided on the pcie0_user_clk output. This
reset should drive all corresponding AXI Interconnect
aresetn signals.
AXI Bridge Slave ports are connected from the AMD Versal device Network on Chip (NoC) to the
CPM DMA internally. For Slave Bridge AXI4 details, see Versal Adaptive SoC Programmable Network
on Chip and Integrated Memory Controller LogiCORE IP Product Guide (PG313).
To access XDMA registers, you must follow the protocols outlined in the AXI Slave Bridge Register
Limitations section.
Related Information
Slave Bridge Registers Limitations
AXI4 (MM) Master ports are connected from the CPM to the AMD Versal device Network on Chip
(NoC) internally. For details, see Versal Adaptive SoC Programmable Network on Chip and Integrated
Memory Controller LogiCORE IP Product Guide (PG313). The AXI4 Master interface can be
connected to DDR or to the PL user logic, depending on the NoC configuration.
The CIPS IP does not support the AXI4-Lite Master interface. Use the SmartConnect IP to connect
the NoC to the AXI4-Lite Master interface.
For details, see SmartConnect LogiCORE IP Product Guide (PG247).
dma0_m_axis_h2c_x_tdata
O Transmit data from the DMA to the user logic.
[DATA_WIDTH-1:0]
1. _x in the signal name changes based on the channel number 0, 1, 2, and 3. For example, for
channel 0 use the dma0_ m_axis_h2c_tready_0 port, and for channel 1 use the
dma0_m_axis_h2c_tready_1 port.
dma0_s_axis_c2h_x_tdata
I Transmits data from the user logic to the DMA.
[DATA_WIDTH-1:0]
1. _x in the signal name changes based on the channel number 0, 1, 2, and 3. For example, for
channel 0 use the dma0_m_axis_c2h_tready_0 port, and for channel 1 use the
dma0_m_axis_c2h_tready_1 port.
Interrupt Interface
1. _x in the signal name changes based on the channel number 0, 1, 2, and 3. For example, for
channel 0 use the dma0_c2h_sts_0 port, and for channel 1 use the dma0_c2h_sts_1 port.
These ports are present if either Descriptor Bypass for Read (H2C) or Descriptor Bypass for Write
(C2H) are selected in the PCIe DMA Tab in the Vivado IDE. Each binary bit corresponds to a channel,
and LSB corresponds to Channel 0. Value 1 in bit positions means the corresponding channel
descriptor bypass is enabled.
1. _x in the signal name changes based on the channel number 0, 1, 2, and 3. For example, for
channel 0 use the dma0_h2c_dsc_byp_0_ctl[15:0] port, and for channel 1 use the
dma0_h2c_dsc_byp_1_ctl[15:0] port.
1. _x in the signal name changes based on the channel number 0, 1, 2, and 3. For example, for
channel 0 use the dma0_c2h_dsc_byp_0_ctl[15:0] port, and for channel 1 use the
dma0_c2h_dsc_byp_1_ctl[15:0] port.
The following timing diagram shows how to input the descriptor in descriptor bypass mode. When
dma0_<h2c|c2h>_dsc_byp_ready is asserted, a new descriptor can be pushed in with the
dma0_<h2c|c2h>_dsc_byp_load signal.
NoC Ports
✎ Note: NoC ports are always connected to the NoC IP. You cannot leave the ports unconnected
and they connect to any other blocks. If so, this will result in synthesis and implementation error. For
connection reference look the picture below.
Register Space
Configuration and status registers internal to the XDMA Subsystem and those in the user logic can be
accessed from the host through mapping the read or write request to a Base Address Register (BAR).
Based on the BAR hit, the request is routed to the appropriate location. For PCIe BAR assignments,
see Target Bridge.
All the registers are found in cpm4-xdma-v2-1-registers.csv available in the register map files.
To locate the register space information:
1. Download the register map files.
2. Extract the ZIP file contents into any write-accessible location.
3. Refer to the cpm4-xdma-v2-1-registers.csv file.
Transactions that hit the PCIe to AXI Bridge Master are routed to the AXI4 Memory Mapped user
interface.
Transactions that hit the PCIe to DMA space are routed to the DMA Subsystem for the PCIeXDMA
Subsystem internal configuration register bus. This bus supports 32 bits of address space and 32-bit
read and write requests.
XDMA registers can be accessed from the host or from the AXI Slave interface. These registers
should be used for programming the DMA and checking status.
DMA register space can be accessed using AXI Slave interface. When AXI Slave Bridge mode is
enabled (based on GUI settings) user can also access Bridge registers and can also access Host
memory space.
Slave Bridge access to Host 0xE001_0000 - 0xEFFF_FFFF Address range for Slave
memory space 0x6_1100_0000 - bridge access is set during IP
0x7_FFFF_FFFF customization in the Address
0x80_0000_0000 - Editor tab of the Vivado IDE.
0xBF_FFFF_FFFF
Bridge register addresses start at 0xE00. Addresses from 0x00 to 0xE00 are directed to the PCIe
configuration register space.
All the bridge registers are listed in the cpm4-bridge-v2-1-registers.csv available in the register map
files.
To locate the register space information:
Vivado Design Suite User Guide: Designing IP Subsystems using IP Integrator (UG994)
Vivado Design Suite User Guide: Designing with IP (UG896)
Vivado Design Suite User Guide: Getting Started (UG910)
Vivado Design Suite User Guide: Logic Simulation (UG900)
This lab describes the process of generating an AMD Versal™ device XDMA design with AXI4
interface connecting to DDR memory. This lab explains a step by step procedure to configure a
Device Drivers
The above figure shows the usage model of Linux XDMA software drivers. The XDMA example
design is implemented on an AMD adaptive SoC, which is connected to an X86 host through PCI
Express. In this mode, the XDMA driver in kernel space runs on Linux, whereas the test application
runs in user space.
The Linux device driver has the following character device interfaces:
XDMA0_control
Used to access XDMA Subsystem registers.
XDMA0_user
Used to access AXI-Lite master interface.
XDMA0_bypass
Used to access DMA Bypass interface.
XDMA0_events_*
Used to recognize user interrupts.
The XDMA drivers can be downloaded from the DMA IP Drivers page.
Interrupt Processing
Legacy Interrupts
There are four types of legacy interrupts: A, B, C and D. You can select any interrupts in the PCIe
Misc tab under Legacy Interrupt Settings. You must program the corresponding values for both the
IRQ Block Channel Vector and the IRQ Block User Vector. Values for each legacy interrupts are A = 0,
B = 1, C = 2 and D = 3. The host recognizes interrupts only based on these values.
MSI Interrupts
For MSI interrupts, you can select from 1 to 32 vectors in the PCIe Misc tab under MSI Capabilities,
which consists of a maximum of 16 usable DMA interrupt vectors and a maximum of 16 usable user
interrupt vectors. The Linux operating system (OS) supports only 1 vector. Other operating systems
might support more vectors and you can program different vectors values in the IRQ Block Channel
Vector and in the IRQ Block User Vector to represent different interrupt sources. The AMD Linux
driver supports only 1 MSI vector.
MSI-X Interrupts
The DMA supports up to 32 different interrupt source for MSI-X, which consists of a maximum of 16
usable DMA interrupt vectors and a maximum of 16 usable user interrupt vectors. The DMA has 32
MSI-X tables, one for each source. For MSI-X channel interrupt processing the driver should use the
Engine’s Interrupt Enable Mask for H2C and C2H to disable and enable interrupts.
The user logic must hold usr_irq_req active-High even after receiving usr_irq_ack (acks) to keep
the interrupt pending register asserted. This enables the Interrupt Service Routine (ISR) within the
driver to determine the source of the interrupt. Once the driver receives user interrupts, the driver or
software can reset the user interrupts source to which hardware should respond by deasserting
usr_irq_req.
In the example H2C flow, loaddriver.sh loads devices for all available channels. The dma_to_device
user program transfers data from host to Card.
The example H2C flow sequence is as follows:
In the example C2H flow, loaddriver.sh loads the devices for all available channels. The
dma_from_device user program transfers data from Card to host.
The example C2H flow sequence is as follows:
Debugging
This appendix includes details about resources available on the AMD Support website and debugging
tools.
To help in the design and debug process when using the functional mode, the Support web page
contains key resources such as product documentation, release notes, answer records, information
about known issues, and links for obtaining further product support. The Community Forums are also
available where members can learn, participate, share, and ask questions about AMD Adaptive
Computing solutions.
Documentation
This product guide is the main document associated with the functional mode. This guide, along with
documentation related to all products that aid in the design process, can be found on the AMD
Adaptive Support web page or by using the AMD Adaptive Computing Documentation Navigator.
Download the Documentation Navigator from the Downloads page. For more information about this
tool and the features available, open the online help after installation.
Debug Guide
Answer Records
Answer Records include information about commonly encountered problems, helpful information on
how to resolve these problems, and any known issues with an AMD Adaptive Computing product.
Answer Records are created and maintained daily to ensure that users have access to the most
accurate information available.
Answer Records for this functional mode can be located by using the Search Support box on the main
AMD Adaptive Support web page. To maximize your search results, use keywords such as:
Product name
Tool message(s)
Summary of the issue encountered
A filter search is available after results are returned to further target the results.
AR 75396.
AMD Adaptive Computing provides technical support on the Community Forums for this AMD
LogiCORE™ IP product when used as described in the product documentation. AMD Adaptive
Computing cannot guarantee timing, functionality, or support if you do any of the following:
Implement the solution in devices that are not defined in the documentation.
Customize the solution beyond that allowed in the product documentation.
Change any section of the design labeled DO NOT MODIFY.
Hardware Debug
Hardware issues can range from link bring-up to problems seen after hours of testing. This section
provides debug steps for common issues. The AMD Vivado™ debug feature is a valuable resource to
use in hardware debug. The signal names mentioned in the following individual sections can be
probed using the debug feature for debugging the specific problems.
General Checks
Ensure that all the timing constraints for the core were properly incorporated from the example design
and that all constraints were met during implementation.
Does it work in post-place and route timing simulation? If problems are seen in hardware but not
in timing simulation, this could indicate a PCB issue. Ensure that all clock sources are active and
clean.
If using MMCMs in the design, ensure that all MMCMs have obtained lock by monitoring the
locked port.
If your outputs go to 0, check your licensing.
Registers
A complete list of registers and attributes for the XDMA Subsystem is available in the Versal Adaptive
SoC Register Reference (AM012). Reviewing the registers and attributes might be helpful for
advanced debugging.
✎ Note: The attributes are set during IP customization in the Vivado IP catalog. After core
customization, attributes are read-only.
Upgrading
This appendix is not applicable for the first release of the functional mode.
Link widths of x1, x2, and x4 require one bonded GT Quad and should not split lanes between
two GT Quads.
A link width of x8 requires two adjacent GT Quads that are bonded and are in the same SLR.
A link width of x16 requires four adjacent GT Quads that are bonded and are in the same SLR.
PL PCIE blocks should use GTs adjacent to the PCIe block where possible.
CPM has a fixed connectivity to GTs based on the CPM configuration.
For GTs on the left side of the device, it is suggested that PCIe lane 0 is placed in the bottom-most
GT of the bottom-most GT Quad. Subsequent lanes use the next available GTs moving vertically up
the device as the lane number increments. This means that the highest PCIe lane number uses the
top-most GT in the top-most GT Quad that is used for PCIe.
For GTs on the right side of the device, it is suggested that PCIe lane 0 is placed in the top-most GT
of the top-most GT Quad. Subsequent lanes use the next available GTs moving vertically down the
device as the lane number increments. This means that the highest PCIe lane number uses the
bottom-most GT in the bottom-most GT Quad that is used for PCIe.
✎ Note: For more information on GT Quad location, refer to the Device Diagram Overview section in
Versal Adaptive SoC Packaging and Pinouts Architecture Manual (AM013). Understand that the
device diagram view might not be the same as IO device view in Vivado because of the device
packaging.
✎ Note: The implemented device view in Vivado shows lane 0 on the bottom-most GT of the bottom-
most Quad on the right side of the device, but lane re-ordering is handled in logic to place lane 0 on
the top-most GT of the top-most GT Quad. The GT Quad IP does not allow channel level control to
remap the GT pins.
The PCIe reference clock uses GTREFCLK0 in the PCIe lane 0 GT Quad for x1, x2, x4, and x8
configurations. For x16 configurations the PCIe reference clock should use GTREFCLK0 on a GT
Quad associated with lanes 8-11. This allows the clock to be forwarded to all 16 PCIe lanes.
✎ Note: The reference clock cannot be forwarded between the CPM4 GTs and GTs used by the PL.
CPM4 and PL IPs must have separate reference clocks.
The PCIe reset pins for CPM designs must connect to one of specified pins for each of the two PCIe
controllers. The PCIe reset pin for PL PCIE and PHY IP designs can be connected to any compatible
PL pin location, or the CPM PCIE reset pins when the corresponding CPM PCIE controller is not in
use. This is summarized in the following table:
PMC MIO 24
PMC MIO 38
PMC MIO 25
PMC MIO 39
For example in a x2 design, by default PIPE signals of the PCIe MAC[1:0] connects to PIPE signals of
the GT QUAD[1:0]. When you apply lane_reversal {true}, PIPE signals of the PCIe MAC[1:0] connects
to PIPE signals of the GT QUAD[0:1]. When you apply lane_order {Top}, PIPE signals of the PCIe
MAC[1:0] connects to PIPE signals of the GT QUAD[3:2].
Following are the commands for using lane_reversal and lane_order:
lane_reversal
set_property -dict [list CONFIG.lane_reversal {true}] [get_bd_cells
<ip_name>]
lane_order
set_property -dict [list CONFIG.lane_order {Top}] [get_bd_cells <ip_name>]
CPM4 GT Selection
The CPM block within Versal devices has a fixed set of GTs that can be used for each of the two PCIE
controllers. These GTs are shared between the two PCIE controllers and High Speed Debug Port
(HSDP) as such x16 link widths are only supported when a single PCIE controller is in use and HSDP
is disabled. When two CPM PCIE controllers or one PCIE controller and HSDP are enabled each link
will be limited to a x8 link width. GT Quad allocation for CPM happens at GT Quad granularity and
must include all GT Quads from the most adjacent to the CPM to the topmost GT Quad that is in use
by the CPM. GT Quads that are used or between GT Quads that are used by the CPM (for either
PCIe or HSDP) cannot be shared with PL resources even if GTs within the quad are not in use.
When this option is enabled the PCIe reset for each disabled CPM4 PCIE controller is routed to the
PL. The same CPM4 pin selection limitations apply and the additional PCIe reset output pins are
The following table identifies the GT Quad(s) that can be used for each PCIE controller location. The
Quad shown in bold is the most adjacent GT Quad for each PCIe block location.
CPM GTY_QUAD_X0Y4
GTY_QUAD_X0Y2
GTY_QUAD_X0Y1
Controller GTY_QUAD_X0Y3
GTY_QUAD_X0Y1
0 GTY_QUAD_X0Y2
GTY_QUAD_X0Y1
CPM5 has dedicated connectivity to a specific set of four GTYP quads which are adjacent to each
other, and adjacent to CPM5. If unused by CPM5, certain quads might be available for use with the
high-speed debug port (HSDP), but no quad can be bypassed to the programmable logic. The
remaining GT in the device are available for other use cases, if the GT of interest provide the
necessary protocol support as required for the desired use cases.
Through the GTYP quads with dedicated connectivity to the CPM5, specific REFCLK inputs must be
used to provide a reference clock to GTYP quads, which internally provide derived clocks to the
CPM5. In the common case of add-in-card designs, the reference clock is sourced from the edge
connector. In other cases, such as system-board designs, embedded designs, and cabled
interconnect, a local oscillator is typically required.
✎ Recommended: Although the CPM5 can support a variety of reference clock frequencies, AMD
recommends that designers selecting local oscillators use a 100 MHz reference clock as described in
the PCI Express Card Electromechanical Specification unless there is a compelling reason to use a
different supported frequency.
As part of the AMD Versal™ architecture integrated shell, specific reset inputs must be used to
provide a reset to the GTYP and the CPM5. In the common case of add-in-card designs, the reset is
sourced from the edge connector. In the case of a system-board or embedded design, the system is
responsible for generating reset signals and sourcing them to devices as required for the desired use
case. Where cabled interconnect is used, consult the cable specification for information about if and
how it accommodates sideband signaling for reset.
The remainder of this appendix is divided into these sections:
The allowable GTYP quad placements are shown in the following table. Placements are determined
by CIPS IP configuration GUI as part of CPM configuration selections.
Controller
Controller
1 Quad0 3 Quad 2 Quad 1 Quad 0
(Bank 105)(Bank 104)(Bank 103)(Bank 102)
3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 Controller C
1 ontroller 0
x4 x4 – – Controller – Quad – – On
1 [3:0] 104, PCB
1 REFCLK
0
x4 – x4 – – Controller – Quad – On
0 [3:0] 102, PCB
1 REFCLK
0
Controller
Controller
1 Quad0 3 Quad 2 Quad 1 Quad 0
(Bank 105)(Bank 104)(Bank 103)(Bank 102)
3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 Controller C
1 ontroller 0
1. A x1 link width uses lane 0 of the x4 configuration. A x2 link width uses lanes 1:0 of the x4
configuration. This should be reversed on the PCB.
Board designs for x2 and x1 must use x4 guidance. For x2 board designs, based on the controller to
be used, connect to controller lane numbers Controller 0 [1:0] or Controller 1 [1:0]. Similarly, for x1
board designs, connect to controller lane numbers Controller 0 [0] or Controller 1 [0]. Note that
controller lane numbers might not be the same as physical GTYP channel numbers in a quad.
Consult the provided placement table.
RESET Placements
Allowed placements are shown in the table below. Placements are selected in CIPS IP configuration
GUI as part of PS PMC peripheral and I/O configuration selections.
CPM5 PCIE Controller and Port Type RESET Pin Location Options
PS MIO 18 (Default)
0: Endpoint, Switch Ports
PMC MIO 24
(Up/Down)
PMC MIO 38
PS MIO 19 (Default)
1: Endpoint, Switch Ports
PS MIO 25
(Up/Down)
PS MIO 39
PS MIO 0 (Default)
PMC MIO 0 – 51
CPM5 PCIE Controller and Port Type RESET Pin Location Options
PS MIO 0 – 25
PMC MIO 0 – 51
1. Both CPM5 PCIE Controller0 and Controller1 cannot be enabled in the AXI Bridge functional
mode and port type as root port, rest other combinations are possible.
The GTYP lane and quad ordering above typically results in lanes crossing for x4 and x2 endpoint
configurations. In this scenario AMD recommends physically reversing the lanes in the PCB design
board traces. This typically results in a bow-tie in the board PCB traces between the device and the
PCIe edge connector.
Allowed placements are shown in the table below. Placements are determined by CIPS IP
configuration GUI as part of CPM configuration selections.
1 0 Quad 3 (Bank
Quad105)
2 (Bank
Quad104)
1 (Bank
Quad103)
0 (Bank 102)
3 2 1 0 3 2 1 0 3 2 1 0 3 2 1 0 1 0
Quad
x16 -- x16 Controller 0 [15:0] -- 104, -- By IP
Refclk 0
Quad Quad
x8,
x8 x8 Controller 1 [7:0] Controller 0 [7:0] 104, 102, -- By IP
x8
Refclk 0 Refclk 0
Quad
x8 x8 -- Controller 1 [7:0] -- 104, -- -- By IP
Refclk 0
Quad
x8 -- x8 -- Controller 0 [7:0] -- 102, -- By IP
Refclk 0
Quad Quad
x4, Controller Controller On
x4 x4 -- -- 104, 102, --
x4 1 [3:0] 0 [3:0] PCB
Refclk 0 Refclk 0
Quad
Controller On
x4 x4 -- -- -- 104, -- --
1 [3:0] PCB
Refclk 0
Quad
Controller On
x4 -- x4 -- -- -- 102, --
0 [3:0] PCB
Refclk 0
x4 on
Quad Quad
x4, Controller PCB
x4 x8 -- Controller 0 [7:0] 104, 102, --
x8 1 [3:0] x8 by
Refclk 0 Refclk 0
IP
x8 by
Quad Quad
x8, Controller IP
x8 x4 Controller 1 [7:0] -- 104, 102, --
x4 0 [3:0] x4 on
Refclk 0 Refclk 0
PCB
Board designs for x2 and x1 must use x4 guidance. For x2 board designs, based on the controller to
be used, connect to controller lane numbers Controller 0 [1:0] or Controller 1 [1:0]. Similarly, for x1
board designs, connect to controller lane numbers Controller 0 [0] or Controller 1 [0].
✎ Note: Controller lane numbers might not be the same as physical GTYP channel numbers in a
quad. Refer the provided placement table.
RESET Placements
Allowed placements are shown in the following table. Placements are selected in CIPS IP
configuration GUI as part of PS PMC peripheral and I/O configuration selections.
CPM5 PCIE Controller and Port Type RESET Pin Location Options
PS MIO 18 (Default)
0: Endpoint, Switch Ports
PMC MIO 24
(Up/Down)
PMC MIO 38
PS MIO 19 (Default)
1: Endpoint, Switch Ports
PS MIO 25
(Up/Down)
PS MIO 39
PS MIO 0 (Default)
PMC MIO 0 – 51
PS MIO 1 (Default)
PMC MIO 0 – 51
1. Root port mode can be enabled simultaneously on both controller 0 and controller 1 as long
as one of them is in non-DMA mode.
In many cases which might naturally arise from allowed GTYP quad placements and lane ordering,
the PCB designer might conclude it is not feasible to meet length, loss, or other signaling
requirements while physically implementing lane reversal on the PCB.
This is likely with x16 and x8 link widths, therefore use lane reversal by the IP rather than physically
implementing lane reversal on the PCB. With lane reversal by the IP, CPM5 link width selection in
CIPS IP configuration GUI must match the PCB designed link width to ensure lane reversal by the IP
will function.
For x4 or narrower link widths, the feasibility of physically implementing lane reversal on the PCB is
greater, therefore use this approach instead.
In migration, the lane ordering for each controller configured for x16 or x8 link widths reverse within
the GTYP quads accessible to each controller. For these designs, the lane reversal is transparent
under the assumption that the lane reversal by the IP is used. REFCLK placements for x16 or x8 link
widths do not change.
For designs using x4 or narrower link widths, the lane ordering is unchanged during migration.
REFCLK placements also do not change.
Refer the provided placement tables. For additional migration support, contact your AMD
representative.
RESET Considerations
Design migration requires IP update of the CPM5 as well as a re-implementation of the design to
generate a new programmable design image (PDI).
Overview
The AMD Versal™ device has an integrated debug that resides in the PMC. The integrated debug
subsystem includes the test access port (TAP) controller, the Arm® debug access port (DAP)
controller, and the debug packet controller (DPC). The DPC receives command packets, referred to
as debug and trace packets (DTP), from one or more debug host interfaces, then generates reply
packets and transmits them back to the debug host. The Versal device has four debug host interfaces
that are connected to the DPC for interaction with external debug hosts.
The focus of this appendix is employing the PCIe link as the communication channel with the DPC,
referred to as management (mgmt) mode for HSDP-over-PCIe, and user mode for HSDP-over-PCIe,
which is a slower and more restrictive mode to enable debug over a PCIe link. For more information
about Versal device integrated debug and other communication channels, refer Versal Adaptive SoC
Technical Reference Manual (AM011).
There are three primary components that enable HSDP-over-PCIe debug:
The following figure shows the role of each component when performing debug over PCIe.
The HSDP-PCIe driver provides connectivity to the debug over PCIe enabled FPGA hardware
resource that is connected to the Host PC via PCIe link. It acts as a bridge between the user-space
hw_server application and the FPGA hardware. The driver, depending on the FPGA design and the
arguments specified to hw_server, can function in mgmt mode or user mode. The hardware design
requirements for HSDP-over-PCIe for both mgmt and user mode and their differences are specified in
a later section. You must supply the required parameters associated with the hardware design to the
driver via a configuration header file before compilation and module installation. The HSDP-PCIe
driver for Linux can be downloaded from GitHub.
The Vivado IDE is used to connect to the hw_server application to use the debug feature for remote
or local FPGA targets, including when using HSDP-over-PCIe. The Vivado IDE application can be
running on a remote or local PC and connects to the host PC using a TCP/IP socket, which is running
hw_server and connected to the hardware target via PCIe link. The HSDP-over-PCIe driver acts as a
conduit for hw_server to serve, receive debug, and trace data to the target device and display it to
the Vivado IDE.
Traditionally, hardware debug through AMD Vivado is performed over a JTAG interface. For Versal
device, the JTAG datapath to the DPC is hardened and abstracted away using Vivado IDE. Making
debug seamless requires that the connections are established among the debug target(s), DPC, and
debug target clock(s) are active and reset(s) are deasserted. To enable the HSDP-over-PCIe feature
on a Versal device, there are several design requirements that must be met. As mentioned previously,
there are two distinct methods to exercise the HSDP-over-PCIe feature such as mgmt mode and user
mode. Each of these methods has its own design requirements and supporting driver code.
User Mode
The user mode method for HSDP-over-PCIe imposes fewer requirements on the hardware design,
but it is also slower than management mode and does not allow for debug access to hardened blocks
like SYSMON, DDRMC, and IBERT. User mode must employ a PCIe BAR to access a fabric debug
hub from the host PC, which bypasses the DPC entirely and does not operate using DTPs. Instead,
the host PC uses memory mapped reads and writes to directly access the debug hub to issue and
collect debug data. User mode is identical on CPM4 and CPM5 capable devices and it is
recommended to use CPM in DMA mode to easily make use of the AXI Bridge Master type for the
PCIe BAR required to reach the debug hub. Debug transactions are then routed through the NoC to
the fabric, which opens the possibility to use NoC NMU remapping to ensure the PCIe BAR size
remains small and the device address map does not become fragmented.
Figure: Example Block Diagram for HSDP-over-PCIe User Mode Debug for CPM4/CPM5
Figure: Example Minimum Address Map for HSDP-over-PCIe User Mode Debug for
CPM4/CPM5
The management mode method for HSDP-over-PCIe imposes more design requirements on the
target device and requires more setup, but its throughput is faster and allows for debug of hardened
debug cores. Management mode for HSDP-over-PCIe uses the HSDP DMA block to transfer DTPs to
DPC and coordinates responses to the Host PC. The HSDP DMA block must be accessible for setup
from a PCIe BAR through CPM’s AXI Master Bridge at base address 0xFE5F0000. The physical
address is fixed and cannot be remapped in the FPGA address space, as the HSDP DMA is only
accessible from an interconnect switch between the CPM interconnect and the NoC. This means that
NoC NMU address remapping cannot be employed, and the PCIe BAR must be large enough to
reach the HSDP DMA or the master bridge itself must perform address translation.
Figure: CIPS IP PS PMC Configuration for Management Mode Debug Using DPC
Figure: Block Diagram for HSDP-over-PCIe Management Mode Debug for CPM4
Figure: Address Map for HSDP-over-PCIe Management Mode Debug for CPM4
Figure: AXI BARs for HSDP-over-PCIe Management Mode Debug for CPM4
To enable mgmt mode debug for a Versal device that has CPM5 within it, CPM must have at least one
AXI BAR enabled and the slave bridge must also be enabled for the DMA transfers to target. The
PMC master must have the debug slave mapped into its address space and the CPM master must
have the CPM registers mapped into its address space. The address translation between CPM4 and
CPM5 differ significantly. For CPM5, the concept of the BDF table was introduced, which allows for
significantly more granularity for address translation within each AXI BAR, even with fewer AXI BARs.
The BDF table registers are located in the CPM register space, which are only accessible through the
PMC interface by the Host PC from the master bridge.
Figure: Block Diagram for HSDP-over-PCIe Management Mode Debug for CPM5
Figure: Address Map for HSDP-over-PCIe Management Mode Debug for CPM5
Figure: AXI BARs for HSDP-over-PCIe Management Mode Debug for CPM5
A set of example designs are hosted on GitHub in the AMD CED Store repository and displayed
through Vivado, which can be refreshed with a valid internet connection, including the HSDP-over-
PCIe example design. You can also download or clone the GitHub repository to your local machine
and point to the that location on your PC. To open the example design, perform the following options:
1. Launch Vivado.
2. Navigate to the set of example designs for selection
From the Quick Start menu, select Open Example Project, or
Select File > Project > Open Example.
3. From the Select Project Template window, select Versal CPM PCIe Debug and navigate through
the menus to select a project location and board part.
4. In the Flow Navigator, click Generate Device Image to run synthesis, implementation, and
generate a programmable device image (.pdi) file that can be loaded to the target Versal device
and a probes (.ltx) file used to specify debug information.
✎ Note: You can download or clone the GitHub repository to a local machine from
https://github.com/Xilinx/XilinxCEDStore and set the following parameter so that local example
designs are displayed in the Select Project template window.
System Bring-Up
The first step is to program the FPGA and power on the system such that the PCIe link is detected by
the host system. This can be accomplished by either:
If the card is powered by the Host PC, it will need to be powered on to perform this programming
using JTAG and then restarted to allow the PCIe link to enumerate. After the system is up and
running, you can use the Linux lspci utility to list the details for the FPGA-based PCIe device.
The provided HSDP-PCIe driver can be downloaded from GitHub. The driver should be compiled and
installed on the Host PC that is connected to the target FPGA via PCIe link. Before compiling the
driver, you must specify the relevant parameters of the hardware design to the driver through a
configuration header file for management and/or user mode. Refer to the comments within the header
file for more information on cross referencing the variable values to the hardware design parameters.
The values provided within the driver should already match the example design’s hardware
configuration, but may require selectively commenting or defining a specific section of the
configuration file in or out. A Makefile is provided with the driver to simply compilation, installation, and
removal.
✎ Note: You can download the driver from GitHub.
1. Navigate to the driver directory
$> cd <parent-path>/hsdp-pcie-driver
After installing the HSDP-PCIe driver in the previous step, character device file(s) for user mode
and/or management mode will be located at /dev/ on the system with the BDF (Bus:Domain.Function)
of the PCIe device appended to the name if module compilation and installation was successful.
To launch hw_server and specify mgmt mode connection to the target FPGA, issue the following
command on the remote Host PC and replace <BB:DD.F> with the BDF of the PCIe device.
To launch hw_server <format> and specify user mode connection to the target FPGA, issue the
following command on the remote Host PC and replace <BB:DD.F> with the BDF of the PCIe device
and <name> if that was specified in the configuration header file.
Connecting the Vivado IDE to the hw_server Application for Debug Over PCIe
At this point, the FPGA design has been loaded, the PCIe link has been established, the HSDP-PCIe
driver has been compiled and installed with the correct configuration values, and the hw_server
application has been started on the debug Host PC. The remaining step is to connect to hw_server
and begin connecting to the debug cores to exchange and display debug data.
1. Launch Vivado.
2. Select Open Hardware Manager from the Flow Navigator.
3. In the Hardware Manager, select Open target > Open New Target.
4. Connect to the hw_server application from the Vivado IDE.
If the debug host is remote, in the Hardware Server Settings window, modify the host name
field to the remote server that is running hw_server and the port number field, if using the
non-default port.
If the debug host is local, in the Hardware Server Settings window, select the Local Server
option for the Connect to: field.
5. If successful, a hardware target should be populated for selection, then click through to Finish.
6. The target device should be in the Hardware window and a probes file can now be specified in
the Hardware Device Properties window after opening the hardware target and the debug core
data is displayed
✎ Note: If using mgmt mode for debug, the hard block debug cores are accessible for debug,
while only user debug cores are present when using user mode for debug.
7. If using mgmt mode for debug, a user can connect to the debug host PC through the XSDB
application and issue direct AXI reads and writes through the PMC
Migrating
For information about migrating QDMA/AXI Bridge 4.0/5.0 Soft IP to Versal CPM4 QDMA/AXI Bridge
Hard IP, see AR 33054.
For information about migrating QDMA/AXI Bridge 4.0/5.0 Soft IP to Versal CPM5 QDMA/AXI Bridge
Hard IP, see AR 33056 .
Limitations
Speed Change Related Issue #1
Description
Repeated speed changes can result in the link not coming up to the intended targeted speed.
Workaround
A follow-on attempt should bring the link back. In extremely rare scenarios, a full reboot might be
required.
Workaround
In the case of PM D3, AMD recommends that any valid EP address be used except ECAM
space in the pre-read before initiating PM D3 sequence.
In all other cases, waiting approximately 20 msec after the link rate and before attempting any
PCIe access can help.
However, in scenarios where the transaction still does not complete, a full reboot (power cycle
and re-programming image) would be required.
Workaround
An additional write with value 1 to the Perform Equalization bit in Link Control 3 register on the
Root complex PCIe configuration space is required when the rate change is performed to Gen3,
Gen4, or Gen5 speeds from Gen1/Gen2.
1. You should assign separate BAR to access QDMA queues space registers and set the steering
to route it to NOC. You can then loop back AXI Master transfers on to AXI slave interface. Set the
BAR size at 256K.
2. You should not make any BAR as DMA BAR rather make it as a separate AXI BAR to map
QDMA base registers and set the steering to route it to NOC, making DMA BAR internally
terminates this and ordering is not maintained. To workaround this issue, it needs to go to AXI
Master which should then be loop back to AXI slave interface.
3. Address offset of queues space registers are listed under AXI Slave register space section. For
controller 0, DMA registers are at 0x6_1000_0000. For controller 1 its at 0x7_1000_0000.
Bridge mode
SRI-OV will not be supported in Bridge mode.
MPS Limitation
Description
Only an MPS size of up to 512 is supported in DMA and Bridge modes.
Workaround
If strict ordering is required, users should wait for the appropriate AXI response before issuing
the dependent transaction.
1. Enabling ASPM L0s/ASPM L1 could show correctable errors being reported on the link by
both link partners such as, replay timer timeout, replay timer rollover, and receiver error.
2. A PCIe Endpoint device might also log errors when Configuration PM D3 transition request
comes in during non-quiesced traffic mode.
3. A PCIe Root Port device does not support ASPM L1 or L0s.
Workaround
Workaround
1. For XDMA and AXI4 bridge modes, MSI-X internal capability is used, therefore no
workaround is available. The choice to enable either MSI-X or MSI capability must be made
when configuring CPM5 IP.
2. This limitation does not apply to QDMA mode as MSI interrupt is not supported.
Workaround
For the dual controller use cases, where two PCIe Controllers of CPM5 are needed as root port,
at least one of them must be used in non DMA mode. For example, the CPM5 PCIe Controller0
can be enabled in AXI Bridge functional mode and Root port type; the Processor Subsystem(PS)
via NoC can enable the datapath for the same. The CPM5 PCIe Controller1 cannot be enabled
in the same configuration as CPM PCIe Controller0. However, the AXI Bridge Functional mode
and Root port type with CPM PCIe Controller1 can be realized by configuring it in PCIE mode as
Root port Type and implementing the AXI Bridge functionality in the PL logic. For the latter, the
existing soft QDMA IP in AXI Bridge mode can be used.
Documentation Navigator
Documentation Navigator (DocNav) is an installed tool that provides access to AMD Adaptive
Computing documents, videos, and support resources, which you can filter and search to find
information. To open DocNav:
From the AMD Vivado™ IDE, select Help > Documentation and Tutorials.
On Windows, click the Start button and select Xilinx Design Tools > DocNav.
At the Linux command prompt, enter docnav.
✎ Note: For more information on DocNav, refer to the Documentation Navigator User Guide
(UG968).
Design Hubs
AMD Design Hubs provide links to documentation organized by design tasks and other topics, which
you can use to learn key concepts and address frequently asked questions. To access the Design
Hubs:
Support Resources
For support resources such as Answers, Documentation, Downloads, and Forums, see Support.
References
These documents provide supplemental material useful with this guide:
Revision History
The following table shows the revision history for this document.
Limitations Updated.
CPM4_QDMA_Gen4x8_MM_ST_Performance_Design
Added new section.
CPM5_QDMA_Dual_Gen5x8_ST_Performance_Design
Added new section.
QDMA Features and XDMA Features Added clarifying details regarding AXI4-Stream
interface data rate support.
Copyright
© Copyright 2020-2024 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, UltraScale,
UltraScale+, Versal, Vivado, Zynq, and combinations thereof are trademarks of Advanced Micro
Devices, Inc. PCI, PCIe, and PCI Express are trademarks of PCI-SIG and used under license. AMBA,
AMBA Designer, Arm, ARM1176JZ-S, CoreSight, Cortex, PrimeCell, Mali, and MPCore are
trademarks of Arm Limited in the US and/or elsewhere. Other product names used in this publication
are for identification purposes only and may be trademarks of their respective companies.