Multi-Root IO Virtualization and Sharing Specification Revision 0.7
Multi-Root IO Virtualization and Sharing Specification Revision 0.7
7
mr-iov-07-2007-06-08
June 8, 2007
PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
PCI-SIG disclaims all warranties and liability for the use of this document and the information
contained herein and assumes no responsibility for any errors that may appear in this document, nor
does PCI-SIG make a commitment to update the information contained herein.
Contact the PCI-SIG office to obtain the latest revision of the specification.
Questions regarding this document or membership in PCI-SIG may be forwarded to:
Membership Services
http://www.pcisig.com
E-mail: [email protected]
Phone: 503-291-2569
Fax: 503-297-1090
Technical Support
Technical support for this specification is available to members. For information, please
visit: http://www.pcisig.com/developers/technical_support.
DISCLAIMER
This document is provided “as is” with no warranties whatsoever, including any warranty of
merchantability, non-infringement, fitness for any particular purpose, or any warranty otherwise
arising out of any proposal, specification, or sample. PCI-SIG disclaims all liability for infringement
of proprietary rights, relating to use of information in this specification. No license, express or
implied, by estoppel or otherwise, to any intellectual property rights is granted herein.
2 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Contents
1. ARCHITECTURAL OVERVIEW ................................................................................... 17
1.1. HOW DOES MR-IOV WORK?......................................................................................... 19
1.1.1. MRA Components ................................................................................................. 24
1.1.1.1. Multi-Root Aware Root Port (MRA RP) .......................................... 24
1.1.1.2. Multi-Root Aware PCIe Device (MRA PCIe Device)...................... 25
1.1.1.3. Multi-Root PCI Manager (MR-PCIM).............................................. 26
1.1.1.4. Multi-Root Aware PCIe Switch (MRA PCIe Switch) ...................... 27
1.1.2. MR Initialization Overview................................................................................... 29
1.1.3. MR Transaction Encapsulation Overview ............................................................ 29
1.1.4. MR Congestion Management Overview ............................................................... 29
1.1.5. MR Error and Event Handling Overview ............................................................. 29
1.1.6. MR-IOV and ARI (Alternative Routing Identifier)................................................ 29
1.1.7. MR-IOV Relationship to SR-IOV and ATS ........................................................... 29
1.2. OVERVIEW OF MR TRANSACTION LAYER ...................................................................... 30
2. MR PROTOCOL CHANGES ........................................................................................... 34
2.1. MR LINK PROTOCOL NEGOTIATION .............................................................................. 34
2.1.1. MR Link Protocol Negotiation.............................................................................. 38
2.1.2. MR Flow Control Initialization Protocol ............................................................. 39
2.1.2.1. MR Flow Control DLLP Encoding ................................................... 40
2.1.2.2. MR Flow Control Initialization State Machine Rules....................... 45
2.2. TLP PREFIX TAGGING.................................................................................................... 49
2.2.1. MR Switch Transaction Layer Processing............................................................ 50
2.2.2. MR Device Transaction Layer Processing ........................................................... 52
2.2.2.1. Receiving TLPs ................................................................................. 52
2.2.2.2. Transmitting TLPs............................................................................. 53
2.2.3. Global Key Processing ......................................................................................... 54
2.2.4. MR TLP Dataflow Examples ................................................................................ 55
2.3. PER-VH RESET ............................................................................................................ 56
2.3.1. Per-VH Reset Example ......................................................................................... 57
2.3.2. RESET DLLP Format ........................................................................................... 60
2.3.3. RESET DLLP Processing ..................................................................................... 60
2.3.3.1. Upstream State Machine ................................................................... 61
2.3.3.2. Downstream State Machine............................................................... 62
2.3.3.3. Reset DLLP Reliability ..................................................................... 64
2.3.3.4. Flow Control and Reset / DL_DOWN .............................................. 65
2.4. MR FLOW CONTROL ...................................................................................................... 65
2.4.1. FC Information Tracked by Transmitter .............................................................. 65
2.4.2. Information Tracked by Receiver.......................................................................... 67
2.5. MR MESSAGE PROCESSING ........................................................................................... 67
2.5.1. Interrupts............................................................................................................... 67
2.5.1.1. INTx Device Processing.................................................................... 67
2.5.1.2. INTx Switch Processing.................................................................... 68
PCISIG Confidential 3
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
4 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
PCISIG Confidential 5
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
6 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Figures
Figure 1-1: Generic Server Blade Configuration..................................................................................17
Figure 1-2: Example Multi-Root Topology ..........................................................................................30
Figure 1-3: Example Multi-Root Topology as viewed from Host A ................................................32
Figure 1-4: Example Multi-Root Topology as viewed from Host C ................................................33
Figure 2-1: MRInit DLLP Format .........................................................................................................34
Figure 2-2: MR to MR Initialization Sequence.....................................................................................36
PCISIG Confidential 7
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
8 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Tables
Table 2-1: MRInit DLLP Fields.................................................................................................................35
Table 2-2: MR Flow Control DLLP Fields ..............................................................................................45
Table 2-3: Reset DLLP Example: Events and Actions..........................................................................59
Table 3-1: Port Types – Example MR Topology ....................................................................................74
Table 3-2: Example MR Topology VH and VF Mapping Policy .........................................................80
Table 3-3: Example Topology: Switch A VS Bridge Table Contents ..................................................83
Table 3-4: Example Topology: Switch B VS Bridge Table Contents...................................................84
Table 3-5: Valid MR State Transitions for VF Migration ......................................................................94
Table 3-6: Example MR Root Topology: RP Associations ...................................................................98
Table 4-1: MR-IOV Fields....................................................................................................................... 100
Table 4-2: Device MR-IOV Extended Capability Header.................................................................. 110
Table 4-3: MR-IOV Capabilities............................................................................................................. 111
Table 4-4: Device MR-IOV Control...................................................................................................... 112
Table 4-5: Device MR-IOV Status ......................................................................................................... 114
Table 4-6: Device MR-IOV VH Counts ............................................................................................... 115
Table 4-7: Device Function Table Offset ............................................................................................. 115
Table 4-8: VF MVF Region..................................................................................................................... 116
Table 4-9: LVF Table Offset................................................................................................................... 117
Table 4-10: Device VL Arbitration Capability and Status ................................................................ 118
Table 4-11: Device VL Arbitration Control ....................................................................................... 120
Table 4-12: Device VL Arbitration Table Offset .............................................................................. 121
Table 4-13: Device MR Error Status ................................................................................................... 122
Table 4-14: Device MR Error Control ................................................................................................ 122
Table 4-15: LVF Table Entry................................................................................................................ 123
Table 4-16: Function Capability 1 (00h).............................................................................................. 126
Table 4-17: Function Capability 2 (04h).............................................................................................. 126
Table 4-18: Function Control 1 (08h) ................................................................................................. 127
Table 4-19: Function Control 2 (0Ch)................................................................................................. 129
Table 4-20: Function Control 3 (10h) ................................................................................................. 130
Table 4-21: Function Status .................................................................................................................. 131
Table 4-22: Function Table VC to VL Map 1 (VC Capability) ....................................................... 132
Table 4-23: Function Table VC to VL Map 2 (VC Capability) ....................................................... 133
Table 4-24: Function Table VC Resource State................................................................................. 134
Table 4-25: VH Table MFVC Resource State.................................................................................... 135
PCISIG Confidential 9
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
10 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
PCISIG Confidential 11
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
12 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Document Organization
Chapter 1 specifies.
Chapter 2 specifies.
Documentation Conventions
Capitalization
Some terms are capitalized to distinguish their definition in the context of this document from their
common English meaning. Words not capitalized have their common English meaning. When terms
such as “memory write” or “memory read” appear completely in lower case, they include all
transactions of that type.
Register names and the names of fields and bits in registers and headers are presented with the first
letter capitalized and the remainder in lower case.
Reference Information
Reference information is provided in various places to assist the reader and does not represent a
requirement of this document. Such references are indicated by the abbreviation “(ref).” For
example, in some places, a clock that is specified to have a minimum period of 400 ps also includes
the reference information maximum clock frequency of “2.5 GHz (ref).” Requirements of other
specifications also appear in various places throughout this document and are marked as reference
information. Every effort has been made to guarantee that this information accurately reflects the
referenced document; however, in case of a discrepancy, the original document takes precedence.
PCISIG Confidential 13
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Implementation Notes
Implementation Notes should not be considered to be part of this specification. They are included
for clarification and illustration only. Implementation Notes within this document are enclosed in a
box and set apart from other text.
14 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
PCISIG Confidential 15
1
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
1. Architectural Overview
Within the industry, significant effort has been expended to increase the effective hardware resource
utilization through the use of virtualization technology. The Multi-Root I/O Virtualization (MR-
IOV) specification defines the extensions to the PCI Express (PCIe) specification suite to enable
multiple non-coherent Root Complexes (RCs) to share PCI hardware resources.
To illustrate how this technology can be used to increase effective resource utilization, consider the
following generic server blade configuration as illustrated in the Figure 1-1 below.
The server blade configuration contains four server blades and two external fabric switches.
In a high availability configuration, nominally there would be two external fabric switches of
each type to avoid any single point of failure (SPOF) for a total of four switches.
In this example, each switch provides two external connectivity ports though more can be
configured to deliver high availability solutions as well as increased aggregate performance.
Each server blade is provisioned with two PCIe Endpoint Devices – an Ethernet and a storage
area network (SAN) device. This translates to a total of eight PCIe Endpoint Devices. These
PCIe Devices are point-to-point connected to a Root Port (RP) [not shown] – either emitted
by a chipset or a processor.
Each I/O device and switch port is typically provisioned to enable any I/O device to operate
at full bandwidth.
Depending upon workload, the example configuration’s I/O resource capacity may be excessive
resulting in under-utilized hardware.
Blade Enclosure
PCIe Ethernet PCIe Storage PCIe Ethernet PCIe Storage PCIe Ethernet PCIe Storage PCIe Ethernet PCIe Storage
Device Device Device Device Device Device Device Device
Storage Area
Ethernet
Network
Switch
Switch
External Connectivty
PCISIG Confidential 17
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Through the application of MR-IOV technology, the prior example server blade configuration can
be transformed as illustrated in Figure 1-2.
• Unlike a PCIe Switch which contains a single upstream Port and can only be claimed by
a single RP, a MRA PCIe Switch contains multiple upstream Ports to enable it to
connect to multiple RPs. This enables the MRA PCIe Switch to be a shared component
within the configuration.
• Multiple MRA PCIe Switches can be interconnected in a variety of topologies to create
high availability solutions as well as provide increased I/O fan-out capacity.
In place of eight PCIe Endpoint Devices – four of each type – the example MR-IOV
configuration contains four MRA PCIe Endpoint Devices – two of each type. Each MRA
PCIe Endpoint Device is attached to a MRA PCIe Switch downstream Port enabling each to
be accessed, and thus shared, by any of the server blades.
Unlike the prior example configuration where I/O is dedicated to each server blade, a MR-
IOV based configuration enables the I/O to be dynamically assigned. A fraction or an entire
I/O Device can be assigned to each server blade based on its workload requirements.
18 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
As noted above, a MR-IOV configuration reduces the component count and change the component
composition. This specification will cover the elements involved in delivering a MR-IOV
configuration.
SI SI SI SI SI SI
Virtualization Intermediary
Processor
Memory
ATC
PCIe Device Switch
ATC ATC
PCIe Device PCIe Device
A Translation Agent (TA) and Address Translation and Protection Table (ATPT).
PCISIG Confidential 19
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
• A TA parses the contents of a PCIe DMA request transaction (TLP) to index an ATPT
to derive the physical address translation and access rights. The purposes for having
DMA address translation vary and include:
♦ Limiting the destructiveness of a ‘broken’ or miss-programmed DMA I/O Function
♦ Providing for scatter/gather
♦ Ability to redirect message-signaled interrupts (e.g., MSI or MSI-X) to different
address ranges without requiring coordination with the underlying I/O Function
♦ Address space conversion (32-bit I/O Function to larger system address space)
♦ Virtualization support
• A PCIe Endpoint may contain an Address Translation Cache (ATC) in support of the
PCI-SIG Address Translation Services Specification.
A PCIe Root Complex (RC) containing one or more Root Ports (RP) with direct-attached or
PCIe Switch-attached PCIe Devices or PCI / PCI-X Bridges. Each RP defines a unique
hierarchy domain (see PCI Express Base Specification).
Now examine a platform that supports SR-IOV technology as illustrated in Figure 1-4 below.
20 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
SI SI SI SI SI SI
Processor
Memory
ATC
PCIe Device
ATC ATC
PCISIG Confidential 21
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
In order to support either example platform while preserving these semantics, the PCI components
underneath each RP must be virtualized and logically overlaid on the MRA PCIe Switches and
Devices as illustrated in Figure 1-5. The virtualized PCI components are referred to as a Virtual
Hierarchy (VH). A VH has the following attributes:
Each VH must contain at least one PCIe Switch.
• The PCIe Switch will be a virtualized component implemented over of a MRA Switch.
• The PCIe Switch functionality and semantics are per the PCI Express Base Specification.
Each VH may contain any mix of PCIe Devices, MRA PCIe Devices, or PCIe to PCI / PCI-X
Bridges as illustrated in Figure 1-6.
• A PCIe Device is a device that does not support the MR-IOV Capability. Such a device
must only be visible in a single VH at a time. A PCIe Device can be serially shared
among a set of accessible VH within the MR-IOV topology. The PCIe Device is serially
deleted from the current source VH and added to destination VH.
• A PCIe Device may support the SR-IOV Capability which enables it to be shared by
multiple SI executing above a single RP.
• A PCIe to PCI / PCI-X Bridge can only be visible in a single VH at a time. As with a
PCIe Device, a PCIe to PCI / PCI-X Bridge can be serially shared among a set of
accessible VH within the MR-IOV topology using a conceptually similar deletion /
addition process as a PCIe Device.
♦ The SR-IOV Capability does not apply to the PCIe to PCI / PCI-X Bridge. The
bridge and all associated PCI / PCI-X devices can only be configured in a single OS,
VI, or SI at a time.
• A MRA PCIe Device is a device which supports the MR-IOV Capability. Such a device
can be visible in multiple VH at a time depending upon the MR-IOV resources
provisioned. A MRA PCIe Device can be added or deleted from any accessible VH
within a MR-IOV topology.
• A MRA PCIe Device must support the SR-IOV Capability. This enables it to be shared
by multiple SI executing above each RP.
The MR-IOV topology must contain at least one MRA PCIe Switch.
22 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Hierarchy A Hierarchy B
Switch Switch
Physical Components
MRA
Switch
Figure 1-5: Two Virtual Hierarchies (VH) Implemented over Shared Physical
Components
PCISIG Confidential 23
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Root Complex (RC) Root Complex (RC) Root Complex (RC) Root Complex (RC)
MRA MRA
Switch Switch
PCI /PCI-X
PCIe Device Device
As illustrated in Figure 1-6, a PCIe RP supporting either a single OS or a VI with multiple SI can be
connected to a MRA Switch and access multiple downstream devices and bridges. The PCIe RP
though, is restricted to a single VH. In order to enable multiple VH to be accessed, a MRA PCIe
RP is required. A MRA PCIe RP differs from a PCIe RP in the following ways:
A MRA PCIe RP maintains state to delineate each VH. At a high level, this amounts to a set
of resource mapping tables to translate the I/O function associated with each SI into a VH
and MR I/O function identifier.
A MRA PCIe RP participates in the MR transaction encapsulation protocol (see subsequent
section for details) to enable a MRA PCIe Switch to derive the VH and associated routing
information.
24 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
A MRA PCIe RP emits a MRA Link. A MRA Link is identical the physical layer to a PCIe
Link as defined in the PCI Express Base Specification. A MRA Link differs at the data link layer
where a new set of DLLP are defined to support the MR-IOV protocol.
A MRA PCIe RP may implement MRA congestion management (see subsequent chapter for
details).
A MR-IOV platform may contain any mix of PCIe Devices, SR-IOV PCIe Devices, or MR-IOV
PCIe Devices. Figure 1-8 illustrates a functional block comparison between these three types of
devices.
Figure 1-8 PCIe Device, SR-IOV, and MRA PCIe Device Functional Block Comparison
PCISIG Confidential 25
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
A MRA IOV PCIe Device differs from a PCIe Device and a SR-IOV PCIe Device in the following
ways:
The MRA IOV PCIe Device must support the new MR DLLP protocol.
• A PCIe Device or a SR-IOV PCIe Device do not support the MR-IOV capability and
therefore are unable to participate in this protocol. A MRA PCIe Switch must subsume
all responsibility for forwarding transactions and event handling on behalf of these
devices through the MR-IOV topology. The MRA PCIe Switch will perform all
encapsulation or de-encapsulation as appropriate.
The MRA IOV PCIe Device must support the MR-IOV transaction encapsulation protocol.
• Each PF supports a full PCI Configuration and PCIe Extended Configuration Space.
• Each PF supports a full BAR.
• Each PF must only be assigned to a single accessible VH at a given time.
• One or more PF may be assigned to any accessible VH.
• A PF and its associated resources may be migrated from one VH to another.
• Each PF may support zero or more Virtual Functions (VF).
♦ VF share resources including some portions of the configuration space with the
associated PF.
♦ A VF exists only within the VH associated with the PF.
♦ A MRA PCIe Device with multiple PF and zero VF per PF is conceptually
equivalent to a single function PCIe Device configured per VH.
• The number of VF provisioned per PF may vary on a per PF basis.
• Each PF represents a single device-specific functionality, e.g. an Ethernet controller, a
SATA controller, etc. Subsequently, each VF must represent the same device-specific
functionality. This enables the existing device driver models to be supported.
Each MRA component must support a corresponding MR-IOV capability. This capability is
accessed and configured by the Multi Root PCI Manager (MR-PCIM). MR-PCIM can be
implemented anywhere within the MR-IOV topology – for example, above a RP as illustrated in
Figure 1-9, or, for example, through a private interface provided by a MRA PCIe Switch.
MR-PCIM is responsibilities include:
26 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Enumeration of the physical components within the MR-IOV topology. MR-PCIM must
determine what components are or are not MR-IOV capable, how components are
interconnected, and what the PCIe and MR-IOV resources they provide.
MR-PCIM configures the components and resources that comprise each VH. The policies to
determine this are outside the scope of this specification.
Given the physical hardware is shared among a set of VH, MR-PCIM configures all PCIe and
MR-IOV attributes including: link signaling rate, VC arbitration, Port arbitration, Access
Control Services (ACS), etc.
Given the physical hardware is shared among a set of VH, MR-PCIM processes or controls
various events, e.g. RESET, physical hardware failure, surprise add / remove, error handling,
etc.
<continue to build up list of responsibilities>
SI SI SI SI SI
MR-
Virtual Intermediary SI Virtual Intermediary PCIM
Root Complex (RC) Root Complex (RC) Root Complex (RC) Root Complex (RC)
MRA MRA
Switch Switch
In a prior section it was noted that a MRA Switch is conceptually the overlay of multiple PCIe
Switches onto a single physical package. This is illustrated in more detail in Figure 1-9.
PCISIG Confidential 27
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
RP RP RP RP
Figure 1-9 PCIe Switch and MRA PCIe Switch Functional Block Comparison
The PCIe Switch is composed of a set of logical P2P bridges with a single upstream Port attached to
a PCIe RP and one or more downstream Ports attached to either a PCIe Device or a PCIe to PCI /
PCI-X Bridge. A PCIe Switch also operates using a single address space.
In contrast a PCIe Switch, a MRA Switch is as follows:
A MRA Switch is composed of one or more upstream Ports attached to either PCIe RP or
MRA PCIe RP or the downstream Port of a MRA Switch.
• If the upstream Port is attached to a PCIe RP, the MRA Switch must transparently
provide all MRA related services on behalf of the PCIe RP.
A MRA Switch is composed of one or more downstream Ports attached to PCIe Devices,
MRA PCIe Devices, PCIe Switch upstream Ports, MRA Switch upstream Ports, or PCIe to
PCI / PCI-X Bridges.
A set of logical P2P bridges that constitute a VH. A MRA Switch must support two or more
VH.
Each VH represent a separate address space. The combination of a VH identifier and the
address contained within the PCIe TLP enable the MRA Switch to forward the a TLP to
appropriate egress Port as well as a MRA RP or MRA PCIe Device to delineate which PF or
VF is the source or sink of the PCIe TLP.
<fill in additional attributes / operational semantics to flesh out this section>
28 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
• A SR-IOV PCIe Device must support the PCI Express Base Specification.
• By requiring these specifications to be supported, the number of permutations is reduced
further enhancing the ability deploy and interoperate across a wide range of solution
options.
A MRA PCIe Device may support the Address Translation Services Specification.
PCISIG Confidential 29
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
SI 1 SI 2 SI 3 SI 4 SI 5
VI VI
PCIe Link
MR Enabled Link
MR Fabric
30 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
In this example, all Root Ports use PCIe protocol. Each Root Port is the root of a PCIe Hierarchy.
There are three VHs (A, B, C) each associated with one of the root Ports. Hosts A and C are
running a Virtual Intermediary that supports SR sharing. B is running an Operating System directly
(no Virtual Intermediary is involved).
In this example, there are four Devices, one of each variety.
Device W is a Single-Root IOV Device. It is assigned to VH A. Two System Images (SIs) are
in use on Host A. The Virtual Intermediary on Host A has further assigned VF 0,1 to SI 1 and
VF 0,2 to SI 2.
Device X is using both Single Root and Multi-Root sharing. PF 1:0 is assigned to VH B.
PF 0:1 is assigned to VH C. The Virtual Intermediary running on Host C has further assigned
VF 0:1,1 to SI 4 and VF 0:1,2 to SI 5. The MR features of the device are managed through the
Base Function which is assigned to VH C.
Device Y is using only Multi-Root sharing. F 1:0 is assigned to VH A and F 0:0 is assigned to
VH C. In VH A, the Virtual Intermediary has further assigned F 1:0 to SI 2. In VH C, the
Virtual Intermediary has further assigned F 0:0 to SI 4. The MR features of the device are
managed through the Base Function which is assigned to VH C.
Device Z is a 3 Function PCIe Device. It is assigned to VH C. Virtual Intermediary software
on VH C has further assigned F 0 and F 1 to SI 4 and F 2 to SI 5.
All Switches shown in this example are Multi-Root Aware (MRA). Non-MRA Switches are also
possible; however, such Switches and all components below them will be associated with a single
Root Port in a single VH. Note that this non-MR sub-tree can be a mixture of SR aware and non-SR
aware PCIe components.
Multi-Root Aware Components enforce separation between VHs. Software running in one VH is
not allowed to affect other VHs. For example, every VH has a complete and independent address
space.
Figure 1-3 shows the same example as Figure 1-2 but only shows the components visible to Host A. Deleted: Figure 1-3
The MRA Switches and Devices appear to software as Single Root equivalents. Deleted: Figure 1-2
Similarly, Figure 1-4 also shows the example from Figure 1-2 but shows only components visible to Deleted: Figure 1-4
Host C. Note that the link between Switch 0 and Switch 1 changes direction between these views of Deleted: Figure 1-2
the topology. In Multi-Root systems, the logical upstream / downstream direction of a link is a per-
VH concept and is distinct from physical link direction that was established during link bring up.
PCISIG Confidential 31
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Host A
SI 1 SI 2
VI
PCIe Link
MR Enabled Link
MR Fabric
SR Dev W MR Dev Y
32 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Host C
SI 4 SI 5
VI
PCIe Link
MR Enabled Link
MR Fabric
PCISIG Confidential 33
2
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
2. MR Protocol Changes
There are five parts of the PCIe Protocol that are changed to support Multi-Root operation.
Negotiating use of the MR link protocol
Auth
34 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Location Description
Byte 0 Type – The value 0000 0001b indicates an MRInit DLLP.
Bits 7:0
Byte 1 Phase – Indicates the phase of the MR Negotiation protocol.
Bit 7
Byte 1 VH FC – If Set, indicates that the sender supports per-VH, VL flow
Bit 6 control. If Clear, indicates that the sender only supports per-VL flow
control. Must be set for Switches.
Byte 1 Reserved
Bits 5:4
Byte 1 Device / Port Type – Device / Port Type of the sender. Encoding is
Bits 3:0 identical to the Device / Port Type field in the PCI Express Capability
(Offset 02, Bits 7:4).
Byte 2 Authorized – If Device / Port Type indicates a Switch, indicates that
Bit 7 the sender is an Authorized port on an MR Capable Switch. Must be
0b if Device / Port Type does not indicate a Switch.
Byte 2 Protocol Version – must be 001b for this version of the specification.
Bits 6:4
Byte 2 Reserved – Transmit 0b.
Bit 3
Byte 2 MaxVL – Maximum number of Virtual Links supported by the sender.
Bits 2:0
Byte 3 MaxVH – Maximum number of Virtual Hierarchies supported by the
sender.
The MRInit DLLP is a new encoding, not defined in PCI Express. PCI Express components are
required to ignore DLLPs not defined in the Base PCI Express Specification (see section 3.5.2.2 of
the PCI Express 2.0 Specification, or section 3.5.2.1 of the PCI Express 1.1 Specification).
MR Devices will always negotiate to use the MR link protocol. MRA Switches and Root Ports will
only negotiate if enabled to do so.
The negotiation sequence between two MR components that are enabled to use the link in MR
Deleted: Figure 2-2
mode is shown in Figure 2-2.
PCISIG Confidential 35
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
MR Component 1 MR Component 2
MRInit Phase 0 MRInit Phase 0
MRInit Phase 1
MR Negotiated
MR InitFC1 VL0
MR InitFC1 VL0
MR InitFC2 VL0
MR VL0 Negotiated
MR InitFC1 VH0 / VL0
MR InitFC1 VH0 / VL0
DL_Init – Physical Layer reporting Link is operational, initialize Base Flow Control for the default Virtual
Channel
36 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
DL_NegotiateMR – Physical Layer reporting Link is operational, negotiate the use of the MR
Link Protocol
DL_InitMR – Physical Link reporting Link is operational, initialize MR Flow Control for VL0
and for VH0 / VL0
DL_Active – Normal operation mode
DL_NegotiateMR
DL_InitMR
Figure 2-4: MR Data Link Control and Management State Machine (MR-DLCMSM)
The DL_Inactive state rules are modified as follows:
DL_Inactive
…
• Exit to DL_Init if:
♦ Indication from the Transaction Layer that the Link is not disabled by software, the link is not
enabled for MR operation and the Physical Layer reports Physical LinkUp = 1b
• Exit to DL_NegotiateMR if:
♦ Indication from the Transaction Layer that the Link is not disabled by software, the
link is enabled for MR operation and the Physical Layer reports Physical LinkUp =
1b
PCISIG Confidential 37
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Rules are added for the new DL_NegotiateMR and DL_InitMR states:
DL_NegotiateMR
• While in DL_NegotiateMR:
♦ Negotiate MR Link Protocol usage following the MR Link Protocol Negotiation
described in Section 2.1.1
♦ Report DL_Down status
♦ The Data Link Layer of a Port with DL_Down status is permitted to discard any
received TLPs provided that it does not acknowledge those TLPs by sending one or
more Ack DLLPs
• Exit to DL_Init if:
♦ MR Link Protocol negotiation completes indicating PCIe Link Mode and the
Physical Layer continues to report Physical LinkUp = 1b
• Exit to DL_InitMR if:
♦ MR Link Protocol negotiation completes indicating MR Link Mode and the Physical
Layer continues to report Physical LinkUp = 1b
• Terminate attempt to negotiate MR Link Protocol and Exit to DL_Inactive if:
♦ Physical Layer reports Physical LinkUp = 0b
DL_InitMR
• While in DL_InitMR:
♦ Initialize Flow Control for the default Virtual Link, VL0, and default Virtual
Hierarchy on the default Virtual Link, VH0/VL0, following the Flow Control
initialization protocol described in Section 2.1.2
♦ Report DL_Down status while in state MRFC_INIT1_VL, MRFC_INIT2_VL or
MRFC_INIT1_VH; DL_Up status in state MRFC_INIT2_VH
♦ The Data Link Layer of a Port with DL_Down status is permitted to discard any
received TLPs provided that it does not acknowledge those TLPs by sending one or
more Ack DLLPs
• Exit to DL_Active if:
♦ Flow Control initialization completes successfully, and the Physical Layer continues
to report Physical LinkUp = 1b
• Terminate attempt to initialize Flow Control and Exit to DL_Inactive if:
♦ Physical Layer reports Physical LinkUp = 0b
38 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
• Transaction Layer must block transmission of TLPs and DLLPs other than the MRInit
DLLP
• Continuously transmit an MRInit DLLP as shown in Figure 2-1. MaxVL, MaxVH, Auth Deleted: Figure 2-1
and Device / Port Type reflect the sender’s values. Protocol Version is 1h. Phase is 0b.
♦ This does not block Physical Layer initiated transmissions (for example, Ordered
Sets)
• Process received MRInit DLLPs:
♦ Record the MaxVL, MaxVH, VH FC, Device / Port Type and Authorized values
• Exit to Phase 1 of the MR Link Protocol Negotiation if:
♦ MRInit DLLP was received with Protocol Version 1h (with either Phase).
• Exit indicating PCIe Link Mode if:
♦ InitFC1 DLLP was received.
While in Phase 1 of the MR Link Protocol:
• Transaction Layer must block transmission of TLPs and DLLPs other than the MRInit
DLLP
• Continuously transmit an MRInit DLLP as shown in Figure 2-1. MaxVL, MaxVH, Auth Deleted: Figure 2-1
and Device / Port Type reflect the sender’s values. Protocol Version is 1h. Phase is 1b.
♦ This does not block Physical Layer initiated transmissions (for example, Ordered
Sets)
• Process received MRInit DLLPs:
♦ Ignore the MaxVL, MaxVH, VF FC, Device / Port Type and Authorized values
• Exit indicating MR Link Mode if either:
♦ MRInit DLLP was received with Protocol Version 1h and Phase 1b
♦ Any MRInitFC1_VL DLLP was received
PCISIG Confidential 39
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
active prior to initialization of VL0 and VH0. However, when additional VLs or VHs are being
initialized there will typically be TLP traffic flowing on other, already enabled, VLs and VHs. Such
traffic has no direct effect on the initialization process for the additional VL(s) and VH(s).
Deleted: Figure 2-5
There are four states in the MR VL/VH initialization process. These states are shown in Figure 2-5.
VL Enabled
Non-Zero VH Mapped to a VC
MRFC_INIT1_VL and VC Enabled
MRFC_INIT2_VL MR_INITFC1_VH
MR_INITFC2_VH
Finshed
MR Flow Control DLLPs need to communicate the VH number in addition to the Base PCIe
information. It is no longer possible to fit this information in a single DLLP. Consequently, for MR-
IOV, Header and Data credits are communicated using different DLLPs. The formats for the
various MF Flow Control DLLPs are shown in Figure 2-6 through Figure 2-15. The DLLP fields are Deleted: Figure 2-6
Deleted: Figure 2-15
40 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
described in Table 2-2. If a Receiver advertises infinite VH credits, then the Receiver must transmit Deleted: Table 2-2
PCIe Base UpdateFC DLLPs for that VL instead of the MRUpdateFC DLLPs described below.
The VL Credit Type is implicitly determined from the DLLP Type Encoding used by the
UpdateFC DLLP
The VC ID field in the UpdateFC DLLP contains the VL number.
The HdrFC field in the UpdateFC DLLP contains VL header credit value for the indicated
type (P, NP, or Cpl)
The DataFC field in the UpdateFC DLLP contains the VL data credit value for the indicated
type (P, NP, or Cpl)
PCISIG Confidential 41
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
42 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
PCISIG Confidential 43
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
44 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Location Description
Byte 0 DLLP Type – 1011b indicates an MRUpdateFC DLLP
Bits 7:4 0111b indicates an MRInitFC1_VL or MRInitFC1_VH DLLP
1111b indicates an MRInitFC2_VL or MRInitFC2_VH DLLP
Byte 0 VL Number – Indicates the Virtual Link
Bits 2:0
Byte 1 VH Number – Indicates the Virtual Hierarchy. This field is reserved if VHO is Set.
Byte 2 VH Omitted – Indicates whether the VH Number field is present in the DLLP. If Set, this
Bit 7 indicates the DLLP is MRInitFC1_VL or MRInitFC2_VL and the VH Number is omitted.
Byte 2 TT – TLP Type 00b indicates Posted credit
Bits 6:5 01b indicates Non-Posted credit
10b indicates Completion credit
11b is Reserved
Byte 3 Credit Type – 0 indicates Header Credit
Bit 4 1 indicates Data Credit
Byte 3 Bits 3:0 Data Credit Value – If Credit Type is Set. PCIe encoding applies (i.e. during
& Byte 4 initialization, zero means infinite)
Byte 4 Header Credit Value – If Credit Type is Clear. PCIe encoding applies (i.e. during
initialization, zero means infinite)
If at any time during initialization for VLs 1-7 a VL is disabled, any flow control initialization
process involving that VL is terminated
Rules for state MRFC_INIT1_VL:
PCISIG Confidential 45
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
• While in MRFC_INIT2_VL:
♦ Transaction Layer must block transmission of TLPs using VLx
♦ Transmit the following six MRInitFC2 DLLPs for VLx in the following relative
order:
■ MRInitFC2_VL – P – Header (first)
■ MRInitFC2_VL – P – Data (second)
■ MRInitFC2_VL – NP – Header(third)
■ MRInitFC2_VL – NP – Data (fourth)
■ MRInitFC2_VL – Cpl – Header (fifth)
■ MRInitFC2_VL – Cpl – Data (sixth)
♦ The six MRInitFC2_VL DLLPs must be transmitted at least once every 34 μs.
■ Time spent in the Recovery LTSSM state does not contribute to this limit.
■ It is strongly encouraged that the MRInitFC2_VL DLLP transmissions are
repeated frequently, particularly when there are no other TLPs or DLLPs
available for transmission.
46 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
PCISIG Confidential 47
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
• While in MRFC_INIT2_VH:
♦ Transaction Layer must block transmission of TLPs using VHx VLy
♦ If the link partner indicated no support for per-VH, VL flow control during
DL_NegotiateMR (i.e., the VH VL bit was cleared in sent MRInit DLLPs), then
infinite VH credits must be advertised for all VH Credit Types.
♦ Transmit the following six MRInitFC2 DLLPs for VHx VLy in the following relative
order:
■ MRInitFC2_VH – P – Header (first)
■ MRInitFC2_VH – P – Data (second)
■ MRInitFC2_VH – NP – Header (third)
■ MRInitFC2_VH – NP – Data (fourth)
■ MRInitFC2_VH – Cpl – Header (fifth)
■ MRInitFC2_VH – Cpl – Data (sixth)
♦ The six MRInitFC2_VH DLLPs must be transmitted at least once every 34 μs.
■ Time spent in the Recovery LTSSM state does not contribute to this limit.
■ It is strongly encouraged that the MRInitFC2_VH DLLP transmissions are
repeated frequently, particularly when there are no other TLPs or DLLPs
available for transmission.
♦ Except as needed to ensure at least the required frequency of MRInitFC2 DLLP
transmission, the Data Link Layer must not block other transmissions
■ Note that this includes all Physical Layer initiated transmissions (for example,
Ordered Sets), Ack and Nak DLLPs (when applicable), and TLPs using VLs and
VHs that have previously completed initialization (when applicable)
♦ Process received MRInitFC1_VH and MRInitFC2_VH DLLPs for VHx VLy:
48 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Covered by LCRC
TLP Tag
Covered by ECRC
TLP Header (unchanged from PCIe)
ECRC (optional)
LCRC
END
Base PCIe TLP
PCISIG Confidential 49
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Virtual Hierarchy Numbers (VHN) are link-local. On a link that supports n VHs, valid VHNs are
[0 .. n-1]. A given TLP may take on different VHN values on each Multi-Root link that it traverses.
The VL # field contains the Virtual Link Number (VL #). It contains information used to support
Congestion Management and Isolation. TLPs are assigned to VLs using a combination of PCIe TC
to VC mapping rules and new MR VC to VL mapping rules. See Chapter 8 Congestion Management
for details.
The Global Key field is used to guard against “VH Hopping”. VH Hopping is when a TLP in one
VH inadvertently ends up on a different VH (either due to MR-PCIM table configuration errors or
due to a hardware error inside a MR Switch). The Global Key is added to the TLP when the TLP
Prefix is attached at the MR Ingress point. The Global Key value selected is based on which VH is
being used. This Global Key value is preserved, unchanged, through subsequent MR Switches. The
Global Key value is validated against the expected value at various points in the MR topology. The
Global key is always validated at MR Egress. The Global Key value is optionally validated at either
(or both) MR Switch Input or MR Switch Output. Global Key checking is similar to PCIe ECRC
checking. Failure of any Global Key validation is an unrecoverable error.
MR-PCIM software configures the tables used to generate and check the Global Key values. To
avoid Global Key mismatch errors, MR-PCIM must configure tables such that all TLPs in a given
VH have the same Global Key value. To provide maximum protection, MR-PCIM should configure
tables so that TLPs in different VHs have different Global Keys (this protection is not provided
between VHs with duplicate Global Key values).
1. TLPs arrive on an Input Port of the switch with a link-local Input VH Number.
a. For a link operating in MR mode, the Input VHN is contained in the TLP Prefix
Header.
50 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
PCISIG Confidential 51
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
a. For an output link operating in Base PCIe mode, the Output VHN is zero. The
Virtual Circuit (VC) used to transmit the TLP was determined in step 4 above. The
Output VL value is not used.
b. For an output link operating in MR mode new Output VHN and Output VL values
are placed in the TLP Prefix Header overwriting the input values (if any). The
Output VL is used to transmit the TLP.
13. Global Key output processing occurs. If the output link is operating in MR mode, the
Global Key from the TLP is optionally validated against the global key associated with the
VF. This is the “exiting check” as described in Section 4.3.5.2. If the output link is operating
in Base PCIe mode, the “terminating check” in step 5 is used instead.
14. When receive buffer space is made available, Flow Control is returned to the VL and
(VH, VL) contained in the TLP Prefix. The TC to VC maps and VC to VL maps on the
receiver are not used (PCIe does not transmit the VC so the TC to VC map is used for this
purpose).
52 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
PCISIG Confidential 53
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
MR Switches can optionally check Global Keys as TLPs exit the switch on an MR link.
Hardware support for the check as TLPs enter and exit MR Switches is optional. Hardware support
for the MR Egress check is mandatory.
The Global key value is a 12 bit value, assigned by software to each VH. To achieve maximum
protection, software should assign each VH a distinct Global Key value.
A Global Key check passes if the value in the TLP and the expected value match. A Global Key
check also passes if either the expected value is 000h or the TLP value is 000h (the wild card value).
In other words, when MR Ingress points assign a TLP the Global Key value of 000h, Global Key
checks will always pass for that TLP. When a Global Key is programmed with an expected value of
000h, any checks that use that value will always pass.
Global Key checking is disabled by default. This allows software time to program the Global Key
registers before enabling checking.
54 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
SI 1 SI 2 SI 3 SI 4 SI 5
VI
PCIe Link
MR Enabled Link
MR Fabric
PCISIG Confidential 55
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
4. The TLP arrives at MRA Switch 2 Port p labeled as belonging to VH 0. MRA Switch 2 has
been programmed so that all TLPs arriving at Port p labeled VH 0 are associated with the VS
for VH C.
5. The VS inside MRA Switch 2 address routes the TLP to a Virtual Downstream Port associated
with physical Port q (the link headed towards Device Y). MR-PCIM has assigned VH 1 to this
Virtual Downstream Port. The TLP exits MRA Switch 2 with a TLP Prefix having VH
Number 1.
6. The TLP arrives at Device Y labeled as belonging to VH 1. The Device hands the transaction to
PF 1:0 s for execution.
7. PF 1:0 completes the transaction and emits a completion TLP. Device Y sends this TLP out
Port r labeled with VH 1.
8. The TLP arrives at MRA Switch 2 Port q labeled as belonging to VH 1. MRA Switch 2 has
been programmed so that all TLPs arriving at Port q labeled VH 1 are associated with the VS
for VH C.
9. The VS inside MRA Switch 2 ID routes the TLP to a Virtual Upstream Port associated with
physical Port p (the link headed towards MRA Switch 0). MR-PCIM has assigned VH 0 to this
Virtual Upstream Port. The TLP exits MRA Switch 2 with a TLP Prefix having VH Number 0.
10. The TLP arrives at MRA Switch 0 Port o labeled as belonging to VH 0. MRA Switch 0 has
been programmed so that all TLPs arriving at Port o labeled VH 0 are associated with the VS
for VH C.
11. The VS inside MRA Switch 0 ID routes the TLP to a Virtual Upstream Port associated with
physical Port n (the link headed towards Host C). MR-PCIM has designated this link a PCIe
link and has assigned this Virtual Upstream Port to it. The TLP exits Switch 0 with a no TLP
Prefix.
12. SI 4 in Host C sees a completion for the Memory Read Transaction.
56 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
SI 1 SI 2 SI 3 SI 4 SI 5
VI VI
PCIe Link
MR Enabled Link
X0 Y0
X1 Y1 X2
Z1 Y2
PCISIG Confidential 57
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Host A
SI 1 SI 2
VI
PCIe Link
MR Enabled Link
X0 Y0
X1 Y1 X2
Z1 Y2
SR Dev W MR Dev Y
58 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
PCISIG Confidential 59
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
A out Port Y2
Device W Sees PCIe Hot Reset W.1 Discards all TLPs, enters Reset
W.2 Acks Hot Reset (PCIe TS1/2)
Device Y Sees Reset DLLP Request on VH A Y.1 Starts discarding TLPs for VH A
All TLPs for VH A are discarded or Y.2 Send Reset Ack DLLP
marked for discard and no TLPs for
VH A are in the Retry Buffer
Device X and Nothing, not part of VH A
Device Z
60 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
may be using the upstream state machine for some VHs and the downstream state machine for
other VHs.
The upstream and downstream state machines run in parallel on every VH.
The following steps are used to enter Reset. This is triggered by a request to send a Reset DLLP on
a link. This request can occur for a variety of reasons (e.g. DL_DOWN, Reset DLLP Request, or
Hot Reset on an upstream Switch link, setting Secondary Bus Reset bit in some Type 1
configuration header, etc.)
1. Reset Requested.
2. Start discarding new TLPs received for this VH from this link.
3. Start discarding new TLPs to be transmitted for this VH on this link.
4. Discard or mark for discard any TLPs waiting to be sent for this VH on this link.
5. Wait until the Retry Buffer contains no TLPs for this VH (i.e. they have been acknowledged
and thus removed from the Retry Buffer).1
6. Schedule a Reset Request DLLP to be sent with the Assert bit = 1
7. Schedule resending the Reset Request DLLP approximately every 30 μsec until a Reset Ack
DLLP with the Assert bit = 1 is received (see Section 2.3.3.3 for details).
8. If a Reset Ack DLLP is received with Assert bit = 1, the remote end has entered Reset, stop
scheduling Reset Request DLLPs for this VH (Reset Request DLLPs may continue to be
sent if needed by other VHs).
The following steps are used to exit Reset. This is triggered by the condition causing the entry into
Reset going away (e.g. clearing Secondary Bus Reset, Physical LinkUp transitions from 0 to 1 etc.).
1. Schedule a Reset Request DLLP to be sent with the Assert bit = 0
2. Schedule resending the Reset Request DLLP approximately every 30 μsec until Reset Ack
DLLP with the Assert bit = 0 is received (see Section 2.3.3.3 for details).
3. If a Reset Ack DLLP has been received with Assert bit = 0, the remote end has exited Reset,
stop scheduling Reset Request DLLPs for this VH (Reset Request DLLPs may continue to
be sent if needed by other VHs).
Reset Request DLLPs may be coalesced so that multiple schedule events result in a single DLLP
being transmitted.
Timely propagation of Reset is important. Components should send Reset Request DLLPs as soon
as possible after detecting after starting to enter or exit Reset.
1This could be implemented by waiting until all TLPs, for any VH, have been flushed from the Retry Buffer that
were in the Retry Buffer prior to Step 3 (Step 3 ensures that no new TLPs for this VH will enter the Retry Buffer).
PCISIG Confidential 61
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Init
US US
Deassert Assert
Wait 1 Wait 1
US
VH Sent Reset Req DLLP US
Down (with bit == 1) Assert
Wait 2
Received US
Reset Ack DLLP Sent Reset Req DLLP
Assert
(with bit == 1) (with bit == 1)
Wait 3
The following steps are used to enter Reset. This is triggered receiving a Reset DLLP on a link with
the Assert bit = 1.
1. Reset Request DLLP is received with the Assert bit = 1
2. Start discarding new TLPs received for this VH from this link.
3. Start discarding new TLPs to be transmitted for this VH on this link.
62 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
4. Discard or mark for discard any TLPs waiting to be sent for this VH on this link.
5. All Functions in the VH enter the Reset
6. Wait until the Retry Buffer contains no TLPs for this VH (i.e. they have been acknowledged
and thus removed from the Retry Buffer).
7. Schedule a Reset Ack DLLP to be sent with the Assert bit = 1
8. Schedule another Reset Ack DLLP to be sent whenever a Reset Request DLLP is received
and the Assert bits in the Reset Request match what would be transmitted in the Reset Ack
(i.e. retransmit the Ack in case it got lost).
The following steps are used to exit Reset. This is triggered by receiving a Reset DLLP on a link
with the Assert bit = 0.
1. Reset Request DLLP is received with the Assert bit = 0
2. If all Functions are ready to exit Reset, Schedule a Reset Ack DLLP to be sent with the
Assert bit = 0
3. Schedule another Reset Ack DLLP to be sent whenever a Reset Request DLLP is received.
Reset Ack DLLPs may be coalesced so that multiple schedule events result in one DLLP being
transmitted.
Timely acknowledgement of Reset is important. Components shall respond with Reset Ack within
1.9 ms (+0%, -100%) and are strongly encouraged to respond much quicker. The 1.9 ms value is
chosen to avoid inadvertent link retraining caused by the Reset DLLP Forward Progress Timer (see
section 2.3.3.3).
PCISIG Confidential 63
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
A VH is considered to be waiting for Reset Ack when it is states US Assert Wait 3 or US Deassert
Wait 2.
The Upstream component retransmits Reset Requests approximately every 30 µsec as controlled by
the Reset Request Retransmit timer. This timer is enabled whenever any VH is waiting for Reset
Ack. It restarts when a Reset Request DLLP is transmitted for any VH group (either initial DLLP or
a resend). It expires 30 µsec after being started (+50%, -0%). When the timer expires, Reset Request
DLLPs are scheduled to be sent for all VH groups that have some VH waiting for a Reset Ack.
The Upstream component also includes a Reset DLLP Forward Progress timer. This timer is
enabled when any VH is waiting for Reset Ack. It restarts when some VH enters either waiting for
Reset Ack state or a Reset Ack is received for any VH. It expires 2 ms after being started (+50%, -
0%). When the timer expires, a link retrain is requested.
The Upstream component also includes a 2 bit Reset Retrain counter. This counter is incremented
when a link retrain is requested due to the expiration of the Reset DLLP Forward Progress timer.
This counter resets to 00b whenever the Reset DLLP Forward Progress timer is restarted. If this
counter rolls over from 11b to 00b, the link shall enter enters Detect.
64 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Reset state machines are not affected by link retraining (either initiated by the Reset DLLP Forward
Progress Timer or through other means). DLLPs that were scheduled during the link retrain shall be
sent when the link retrain completes.
Requests to retrain the link initiated by this mechanism may be coalesced with requests to retrain the
link initiated by other mechanisms. For example, two “simultaneous” requests for a link retrain due
to (1) a REPLAY_NUM rollover and to (2) the Reset Forward Progress timer may result in either
one or two retrain sequences.
In Multi-Root systems, Flow Control DLLPs can affect more than one VH. It is critical that a VH
entering or exiting the Reset state does not disrupt other VHs. Flow Control credits on MR Enabled
Links must be returned to the originator even when the VH that originated the discarded TLPs is in
or is entering Reset.
For example, suppose TLPs enter a Switch on the MR Link associated with Port 1 destined for
Port 2. If Port 2 is enters Reset or DL_DOWN and thus discards these TLPs, credits must be
returned in the appropriate VH of Port 1. This must occur even if the associated VH at Port 1 is
also in or is entering reset (there may be other VHs not in Reset sharing Port 1). This must occur
whether or not Port 2 is an MR Link.
Reset propagation may not be affected by Flow Control. Reset DLLPs must be sent independent of
any Flow Control state.
Reset propagation is affected by the TLP Ack / Nak protocol. Allowing this avoids the complexity
of editing the Retry Buffer to remove TLPs for VHs that are now in Reset.
PCISIG Confidential 65
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
PD, NPD or CPLD data VL credits may be required. The number of credits needed depends
on the TLP size and uses the same rules as PCIe.
PD, NPD or CPLD data (VH, VL) credits may be required. The number of credits needed
depends on the TLP size and uses the same rules as PCIe.
The transmitter gating function for a TLP is said to pass if the transmitter gating function passes for
all four gates. If any gate fails, the transmitter must block transmission of the TLP. If
CREDIT_LIMIT was specified as “infinite” during Flow Control initialization, then the
corresponding gating function is unconditionally satisfied for that type of credit.
The Transmitter must following the same ordering and deadlock avoidance rules as specified in the
PCIe protocol. TLPs mapped to different VLs have no ordering relationship, and must not block
each other.
The transmitter gating function rules for (VH, VL)s and VLs are the same as that specified in the
PCIe protocol.
TLPs associated with different VHs represent different flows and have no ordering relationship.
TLPs blocked due to failure of the (VH, VL) gating function should not block TLPs associated with
other VHs mapped to the same VL.
If any CREDIT_LIMIT_VHVL for a Credit Type is infinite, then all CREDIT_LIMIT_VHVL
Credit Type values for that VL must also be infinite.
If the CREDIT_LIMIT_VL and CREDIT_LIMIT_VHVL are both infinite, no UpdateFC or
MRUpdateFC DLLPs are sent. A transmitter may optionally raise a Receiver Error if one is received
Comment [po4]: PCIe still allows
(as in PCIe). UpdateFCs to be sent. Need to resolve
how to handle this.
Otherwise, if CREDIT_LIMIT_VHVL is non-infinite, then MRUpdateFC DLLPs are sent.
2.6.1. Flow Control Rules
MRUpdateFC DLLPs update the values of VL and (VH, VL) CREDIT_LIMIT as follows:
If an Infinite Credit advertisement (value
♦ Credits_Received = (Update Value – CREDIT_LIMIT_VHVL) mod 2Field Size of 00h or 000h) has been made during
initialization, no
Flow Control updates are required
♦ CREDIT_LIMIT_VL = (CREDIT_LIMIT_VL + Credits_Received) mod 2Field_Size following initialization.
♦ CREDIT_LIMIT_VHVL = Update Value • If UpdateFC DLLPs are sent, the credit
value fields must be set to zero and must
♦ The CREDIT_LIMIT_VL value is updated using the CREDIT_LIMIT_VHVL be ignored
by the Receiver. The Receiver may
value before it is updated by this DLLP. optionally check for non-zero update
values (in violation
The transmitter computes but otherwise ignores the CREDIT_LIMIT values associated with of this rule). If a component implementing
this check determines a violation of this
infinite CREDIT_LIMIT_VL credits. rule, the
violation is a Flow Control Protocol Error
Otherwise, if CREDIT_LIMIT_VHVL is infinite, then PCIe Base UpdateFC DLLPs are sent. A (FCPE)
transmitter may optionally raise a Receive Error if an MRUpdateFC DLLP is received.
UpdateFC DLLPs update the value of VL CREDIT_LIMIT as follows:
66 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
• The initial value is Vendor Specific. This value was communicated to the transmitter
using the MR Flow Control Initialization Protocol
• This value is incremented as processing (or discarding) of received TLPs makes
additional receiver buffer space available.
• Changes to this value are communicated to the transmitter using flow control update
DLLPs.
CREDITS_RECEIVED
2.5.1. Interrupts
Interrupt processing occurs within a VH. The associated TLPs contain a TLP Prefix allowing all
components to route them appropriately.
MSI and MSI-X Interrupts are indistinguishable from other Memory Write TLPs.
INTx Interrupts are represented using ASSERT_INTx / DEASSERT_INTx messages per the PCI
Express Base Specification. In MR, these messages are queued and ordered within a VH.
In PCIe, INTx messages are emitted by Devices. An ASSERT_INTx message is emitted when an
interrupt condition is signaled. A DEASSERT_INTx message is emitted when the interrupt
condition has been satisfied.
PCISIG Confidential 67
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
In MR, INTx messages are emitted within a VH. If a function within a VH signals or satisfies an
interrupt, ASSERT_INTx / DEASSERT_INTx messages are emitted within that VH. These
messages are unrelated to INTx messages issued in any other VH.
In PCIe, INTx messages are processed by Switches. Each downstream Switch Port tracks an internal
INTx wire for each of INTA/B/C/D. These INTx wires are combined into four INTx wires at the
upstream Switch Port. Transitions of the combined INTx wires trigger sending of INTx messages
out the upstream Switch Port.
Similarly, in MR, INTx messages are processed by Virtual Switches. Each virtual downstream Switch
Port tracks an internal INTx wire for each of INTA/B/C/D. The INTx wires are combined into
four INTx wires at the virtual upstream Switch Port. Transitions of the combined INTx wires
trigger sending of INTx messages out the virtual upstream Switch Port.
Switch reconfiguration will also affect INTx. For example, when a Virtual Device is unmapped from
a Virtual Switch, the virtual downstream Switch Port sees DL_Down. This virtual downstream
Switch Port follows PCIe rules by deasserting the internal INTx wires. If this results in a transition
of the combined INTx wires a DEASSERT_INTx message is sent out the virtual upstream Switch
Port.
In PCIe, the Host Processor tracks the INTx assert/deassert state. If INTx is asserted, and software
has not masked processing, an interrupt is sent to host processor.
In MR with a PCIe Root Port, this is unchanged. By the time the INTx message is seen by the Root
Port, the TLP Prefix has been dropped and the message is indistinguishable from non-MR usage.
In MR within an MR Root Port, the INTx messages are tagged with a VH and the RP must track
independent INTx assert/deassert state for each VH.
68 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
TLPs in a retry buffer are slightly bigger since it includes a TLP Prefix. This has a minor ripple
effect (e.g. Ack/Nak timer values may need minor adjustments)
The TLP Prefix is queued with the TLP. This increases the amount of buffering required for
the TLP Header.
PCISIG Confidential 69
3
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Unlike conventional PCI Express, MR topologies need not be a tree. MR Topologies can be an
arbitrary mesh that may contain loops. Each VH sees a tree structure consisting of a subset of the
overall MR Topology.
Because MR Topologies are no a tree, they have no single notion of link upstream and downstream
directions. Links have a Physical direction (used in the Physical layer Link training process). In
addition, each VH using a Link has a Logical direction which may differ from the Link’s Physical
direction.
There are a number of steps or phases used to initialize and configure the above components into a
MR Topology. These steps are:
1. Initial State after Reset
2. PCIM Location Ploicy Decision
3. Topology Discovery
4. Component Discovery
5. Mapping Policy Decision
6. Mapping Implementation
7. Virtual Hierarchy Enumeration
Each step is described in more detail below.
2Useful MR Topologies will eventually have at least one device, but this is not strictly required. In particular, no
devices might initially be present with devices being added later using Hot-Add operations.
70 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
A PCIM Capable Switch Port i is defined as a Switch Port with the following characteristics:
Port[i].Link_Direction indicates an upstream Port.
The upstream P2P Bridge Configuration Header for VH 0, VS[j] has a full MR-IOV
Capability.
The bit j of the VS Authorization Bitmap is Set.
The Management VS value is Vendor Specific. It could be j or it could be some other VS.
VS[j] contains sufficient Enabled Downstream P2P Bridges to access all Ports of the MR
Switch.
These settings ensure that MR-PCIM could manage the Switch using this Port. MR-PCIM is not
required to be present on this Port. The RP running MR-PCIM could be directly attached to this Port
or could be connected indirectly using one or more MR or SR Switches.
This Port is authorized. In particular, any software using it will be allows to manage the Switch.
Software can later de-authorize the Port if desired.
Note: Since a VS has a single upstream bridge, these rules imply that every Potential PCIM port will
be associated with a distinct VS.
Implementation Note:
PCISIG Confidential 71
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
PCIM Capable Switch Ports need not be full width “expensive” Ports. They could be x1 ports
intended just to support management.
A Non-PCIM Capable Switch Port j is defined as a Switch Port with the following characteristics:
Port[j].Link_Direction is Vendor Specific.
No Authorized VS has mapped {Port[j], any Port VH} to its upstream bridge
• Any Port VH of Port[j] may be mapped to any downstream bridge of any VS.
• Any Port VH of Port[j] may be mapped to the upstream bridge of any non-authorized
VS. This upstream bridge may or may not have an MR-IOV capability in its Type 1
Configuration header. If it has one, only selected fields in the Type 1 header are
operational and none of the MR-IOV tables located in memory are visible. See <REF>
for details.
• Port VHs of Port[j] need not be mapped into any VS.
These settings ensure that Initial MR-PCIM will never be present on this Port. This Port can be
directly or indirectly via Switches connected to Devices, Root Ports or Bridges.
This Port is not authorized. Attempts by software attached to this Port to configure the MR Switch
will fail unless the port is later Authorized
A non-PCIe port may also be used to manage MR Topologies. These ports consist of Vendor
Specific hardware that appears to a MR Switch as an upstream Port. Such Ports have a Port Table
entry with the Non-PCIe bit set.
This Vendor Specific hardware allows MR-PCIM to issue and respond to the subset of PCIe
transactions needed to manage the MR Topology.
Such hardware must allow MR-PCIM to issue:
• Configuration Read and Write Requests (Type 0 and Type 1) of sizes 1, 2 or 4 bytes,
naturally aligned.
• Memory Space Read and Write Requests (32 bit and 64 bit addressing) of sizes 1, 2, 4
and 8 bytes, naturally aligned.
• Message Transactions normally issued by Root Ports (e.g. PME Turn Off messages).
Such hardware must allow MR-PCIM to respond to:
72 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
• Completions related to the above (including completion status and support for
Completion Timeout).
• Posted Memory Write transactions for MSI Interrupts
• Message Transactions (e.g. INTx, PME_TO_Ack, PM_PME, Errors, …)
This is the minimum support needed to configure the MR Topology. Additional support may be
required if MR-PCIM uses the MR Topology for general I/O (e.g. Logging). Additional support may
also be required if Vendor Specific device management software also needs to use this port.
A Non-PCIe Switch Management Port i is defined as a Port with the following characteristics:
Port[i].Link_Direction indicates an upstream Port.
The Port operates in Base PCIe mode (the link can not be MR Enabled).
Some VS[j] has mapped {Port[i], Port VH 0} to its upstream bridge. This mapping may be
fixed.
The upstream P2P Bridge Configuration Header for VH 0, VS[j] has a full MR-IOV
Capability.
The bit j of the VS Authorization Bitmap is Set. This bit may be read only.
The Management VS value is Vendor Specific. It could be j or it could be some other VS.
VS[j] contains enough Enabled Downstream P2P Bridges to access all Ports of the MR
Switch.
These settings ensure that MR-PCIM could manage the Switch using this Port. MR-PCIM is not
required to be present on this Port.
This Port is authorized. In particular, any software using it will be allowed to manage the Switch.
Note: Since a VS has a single upstream bridge, these rules imply that every Non-PCIe Switch
Management Port will be associated with a distinct VS.
PCISIG Confidential 73
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Figure 3-1 shows an example MR Topology. Both Switches are MRA Switches. The three green Deleted: Figure 3-1
Root Ports labeled RP 0, RP 1 and RP 2 are capable of running MR-PCIM software and configuring
the topology. The blue RP 3 is not permitted to run MR-PCIM. Devices are a mixture of Base PCIe,
SR Aware and MR Aware Devices.
74 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Switch Port
K, L, M, N Non-PCIM Capable Devices never run MR-PCIM
Switch Port
Continuing with the example from Section 3.1.1.4, assume that RP 0 is chosen for the Initial MR-
PCIM. Vendor Specific mechanisms are used to prevent RP 1 and RP 2 from accessing the MR
Topology. For example, the affected processors might be powered off or held in reset.
PCISIG Confidential 75
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
For each Port that trains as a downstream Port, MR-PCIM will examine the Link Partner Training
Status fields in the Port Table. For each Link Partner that is an Authorized Port on an MR Switch,
MR-PCIM will determine whether the Port is mapped into the MR-PCIM VS and if necessary map
the Port into an unused downstream bridge. It will then establish a Bus Number range for the
downstream bridge so that allow PCIe Configuration transactions can be issued to the Link Partner.
MR-PCIM then probes the configuration header of the MR Switch attached to the Port.
If the component is a “new” MR Switch because the Switch’s MR Switch Number field has a
number not assigned by MR-PCIM in this enumeration cycle, the enumeration process repeats
to configure the new Switch.
If the component is an “old” MR Switch that MR-PCIM has seen before, the MR Switch
Number and connection information is noted but further enumeration is not needed via this
connection. Note that enumeration of the “old” Switch may not be complete, but it will be
completed using other links into the Switch.
76 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
between RP 0 and Switch A trains. Vendor Specific mechanisms are used to prevent RP 1 and RP 2
from starting (links may train but no transactions will be generated from these RPs). The link
between RP 3 and Port D can’t train since it consists of two downstream Ports.
MR-PCIM software operating in RP 0 could proceed as follows:
1. Reads Type 1 Configuration Header at Port A, detects MR-IOV capability indicating an MR
Switch.
2. Configures BAR registers in Switch A so that the Port Table can be examined.
3. Assigns MR Switch Number 42 to Switch A.
4. Detects that Port A, VH 0 was mapped to VS n.
5. Notices link attached to Port E detected something present but did not train. Switches
Link_Direction for link E to downstream thus allowing the E to F link to train.
6. Notices link attached to Port X detected something present but did not train. Switches
Link_Direction for link X to downstream thus allowing the X to Y link to train.
7. Notices that Links G, H, I and J trained as downstream. Each Port’s Link Partner Training
Status indicates no MR Switch is connected so no additional enumeration is needed at this
time.
8. Notices that Port E is connected to an Authorized MR Switch. If needed, maps Port E to
some downstream bridge of VS n of Switch A (it might already be mapped). Using this
downstream bridge Switch B is enumerated.
9. Reads the Type 1 Configuration Header of Port F, detects MR-IOV capability indicating an
MR Switch.
10. Configures BAR registers in Switch B so that the Port Table can be examined.
11. Assigns MR Switch Number 86 to Switch B.
12. Detects that Port F, VH 0 was mapped to VS m.
13. Notices that Links K, L, M and N trained as downstream. Each Port’s Link Partner Training
Status indicates no MR Switch is connected so no additional enumeration is needed at this
time.
14. Notices link attached to Port D detected something present but did not train. Switches
Link_Direction for link D to upstream thus allowing the D to RP 3 link to train. This step
might be delayed until later if preventing link training was the Vendor Specific mechanism
used hold off RP 3 from enumerating the topology.
15. Notices that Port X is connected to an Authorized MR Switch. If needed, maps Port X to
some downstream bridge of VS n of Switch A (it might already be mapped). Using this
downstream bridge Switch B is enumerated. Reads the Type 1 Configuration Header of
Port Y, detects MR-IOV capability indicating an MR Switch. The previously assigned MR
Switch Number of 86 indicating that Port X is a second path to Switch B. Since Switch B
has been (or is still being) enumerated using Port E, no further enumeration is needed using
Port X.
This is an example; other enumeration orderings are also valid. Note that the link directions of the
E to F and X to Y links depend on the chosen ordering. If, for example, instead of changing the
PCISIG Confidential 77
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Link_Direction of Port X in Step 6 the Link_Direction of Port Y were changed later during the
enumeration of Switch B, the X to Y link would train in the opposite direction.
Note: MR-PCIM assigns MR Switch registers locations in PCIe Memory Space during this discovery
process. These assignments may change in subsequent software.
Information may not be available for Root Ports (MR or PCIe). The Vendor Specific mechanisms
used to hold off transactions might also prevent the link from training so the Link Partner Training
Status may not yet be meaningful.
78 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
MR MR MR PCIe PCIe MR MR SR
8 VH 8 VH 8 VH 8 VH 8 VH
no no no VF Map VF Map
VF Map VF Map VF Map +
Migrate
PCISIG Confidential 79
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Table 3-2 expands on the example shown in Figure 3-3 adding VF and VF Mapping Policy Decision Deleted: Table 3-2
information. Deleted: Figure 3-3
80 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
The view from each Root Port is shown in Figure 3-4 through Figure 3-7 below. Switch VS Table Deleted: Figure 3-4
Programming to implement this is shown in Table 3-3 and Table 3-4 Deleted: Figure 3-7
Deleted: Table 3-3
RP 0
A E F
A B
G
H I L M
RP 1
B J
A
H
G
PCISIG Confidential 81
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
PCIM
RP 2
E F C
A B
I M
G H L K
RP 3
X Y D
A B
G I2 N
H I1 L M Dev
N
82 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
5 VS 0 Downstream 4 No
6 VS 0 Downstream 5 No
7 VS 0 Downstream 6 No
8 VS 0 Downstream 7 No
14 RP 1 VS 1 Downstream 4 Yes No
15 RP 1 VS 1 Downstream 5 Yes No
16 RP 1 VS 1 Downstream 6 Yes No
17 RP 1 VS 1 Downstream 7 Yes No
23 RP 2 VS 2 Downstream 4 Yes No
24 RP 2 VS 2 Downstream 5 Yes
25 VS 2 Downstream 6 No
26 VS 2 Downstream 7 No
27 VS 3 Upstream No
28 VS 3 Downstream 0 No
29 VS 3 Downstream 1 No
30 VS 3 Downstream 2 No
31 VS 3 Downstream 3 No
32 VS 3 Downstream 4 No
33 VS 3 Downstream 5 No
34 VS 3 Downstream 6 No
35 VS 3 Downstream 7 No
PCISIG Confidential 83
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
5 VS 0 Downstream 4 No
6 VS 0 Downstream 5 No
7 VS 0 Downstream 6 No
8 VS 0 Downstream 7 No
14 RP 0 VS 1 Downstream 4 Yes No
15 RP 0 VS 1 Downstream 5 Yes No
16 RP 0 VS 1 Downstream 6 Yes No
17 RP 0 VS 1 Downstream 7 Yes No
23 VS 2 Downstream 4 No
24 VS 2 Downstream 5 No
25 VS 2 Downstream 6 No
26 VS 2 Downstream 7 No
32 VS 3 Downstream 4 No
3 Port I1 in Figure 3-7. Two PFs of Device I are assigned to RP 3. As far as RP 3 is concerned, these are Deleted: Figure 3-7
independent Devices.
Deleted: Figure 3-7
4 Port I2 in Figure 3-7.
84 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
33 VS 3 Downstream 5 No
34 VS 3 Downstream 6 No
35 VS 3 Downstream 7 No
PCISIG Confidential 85
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
tate
#
LVF
VF S
PF 0:0
BaseLVF = 0 VF Map MVF Context
InitialVFs = 8 LVF 0,0 A.A VF 0:0,1 MVF 0,0 MVF 0,0
TotalVFs = 8 LVF 0,1 A.A VF 0:0,2 MVF 0,1 MVF 0,1
LVF 0,2 A.A VF 0:0,3 MVF 0,2 MVF 0,2
LVF 0,3 A.A VF 0:0,4 MVF 0,3 MVF 0,3
LVF 0,4 A.A VF 0:0,5 MVF 0,4 MVF 0,4
PF 1:0
LVF 0,5 A.A VF 0:0,6 MVF 0,5 MVF 0,5
BaseLVF = 8
LVF 0,6 A.A VF 0:0,7 MVF 0,6 MVF 0,6
InitialVFs = 16
LVF 0,7 A.A VF 0:0,8 MVF 0,7 MVF 0,7
TotalVFs = 16
LVF 0,8 A.A VF 1:0,1 MVF 0,8 MVF 0,8
LVF 0,9 A.A VF 1:0,2 MVF 0,9 MVF 0,9
LVF 0,10 A.A VF 1:0,3 MVF 0,10 MVF 0,10
PF 2:0 LVF 0,11 A.A VF 1:0,4 MVF 0,11 MVF 0,11
BaseLVF = 24 LVF 0,12 A.A VF 1:0,5 MVF 0,12 MVF 0,12
InitialVFs = 8 LVF 0,13 A.A VF 1:0,6 MVF 0,13 MVF 0,13
TotalVFs = 8 LVF 0,14 A.A VF 1:0,7 MVF 0,14 MVF 0,14
LVF 0,15 A.A VF 1:0,8 MVF 0,15 MVF 0,15
LVF 0,16 A.A VF 1:0,9 MVF 0,16 MVF 0,16
LVF 0,17 A.A VF 1:0,10 MVF 0,17 MVF 0,17
LVF 0,18 A.A VF 1:0,11 MVF 0,18 MVF 0,18
LVF 0,19 A.A VF 1:0,12 MVF 0,19 MVF 0,19
LVF 0,20 A.A VF 1:0,13 MVF 0,20 MVF 0,20
LVF 0,21 A.A VF 1:0,14 MVF 0,21 MVF 0,21
LVF 0,22 A.A VF 1:0,15 MVF 0,22 MVF 0,22
LVF 0,23 A.A VF 1:0,16 MVF 0,24 MVF 0,23
LVF 0,24 A.A VF 2:0,1 MVF 0,23 MVF 0,24
LVF 0,25 A.A VF 2:0,2 MVF 0,26 MVF 0,25
LVF 0,26 A.A VF 2:0,3 MVF 0,25 MVF 0,26
LVF 0,27 A.A VF 2:0,4 MVF 0,27 MVF 0,27
LVF 0,28 A.A VF 2:0,5 MVF 0,28 MVF 0,28
LVF 0,29 A.A VF 2:0,6 MVF 0,29 MVF 0,29
LVF 0,30 A.A VF 2:0,7 MVF 0,30 MVF 0,30
LVF 0,31 A.A VF 2:0,8 MVF 0,31 MVF 0,31
86 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Device M is assigned NumVHs of 3. VF Mapping is shown in Figure 3-9. VF Migration Capable is Deleted: Figure 3-9
Set.
tate
#
LVF
VF S
PF 0:0
BaseLVF = 0 VF Map MVF Context
InitialVFs = 6 LVF 0,0 A.A VF 0:0,1 MVF 0,0 MVF 0,0
TotalVFs = 8 LVF 0,1 A.A VF 0:0,2 MVF 0,1 MVF 0,1
LVF 0,2 A.A VF 0:0,3 MVF 0,2 MVF 0,2
LVF 0,3 A.A VF 0:0,4 MVF 0,3 MVF 0,3
LVF 0,4 A.A VF 0:0,5 MVF 0,4 MVF 0,4
PF 1:0
LVF 0,5 A.A VF 0:0,6 MVF 0,5 MVF 0,5
BaseLVF = 8
LVF 0,6 I.U VF 0:0,7 none MVF 0,6
InitialVFs = 14
LVF 0,7 I.U VF 0:0,8 none MVF 0,7
TotalVFs = 16
LVF 0,8 A.A VF 1:0,1 MVF 0,6 MVF 0,8
LVF 0,9 A.A VF 1:0,2 MVF 0,7 MVF 0,9
LVF 0,10 A.A VF 1:0,3 MVF 0,8 MVF 0,10
PF 2:0 LVF 0,11 A.A VF 1:0,4 MVF 0,9 MVF 0,11
BaseLVF = 24 LVF 0,12 A.A VF 1:0,5 MVF 0,10 MVF 0,12
InitialVFs = 6 LVF 0,13 A.A VF 1:0,6 MVF 0,11 MVF 0,13
TotalVFs = 8 LVF 0,14 A.A VF 1:0,7 MVF 0,12 MVF 0,14
LVF 0,15 A.A VF 1:0,8 MVF 0,13 MVF 0,15
LVF 0,16 A.A VF 1:0,9 MVF 0,14 MVF 0,16
LVF 0,17 A.A VF 1:0,10 MVF 0,15 MVF 0,17
LVF 0,18 A.A VF 1:0,11 MVF 0,16 MVF 0,18
LVF 0,19 A.A VF 1:0,12 MVF 0,17 MVF 0,19
LVF 0,20 A.A VF 1:0,13 MVF 0,18 MVF 0,20
LVF 0,21 A.A VF 1:0,14 MVF 0,19 MVF 0,21
LVF 0,22 I.U VF 1:0,15 none MVF 0,22
LVF 0,23 I.U VF 1:0,16 none MVF 0,23
LVF 0,24 A.A VF 2:0,1 MVF 0,20 MVF 0,24
LVF 0,25 A.A VF 2:0,2 MVF 0,21 MVF 0,25
LVF 0,26 A.A VF 2:0,3 MVF 0,22 MVF 0,26
LVF 0,27 A.A VF 2:0,4 MVF 0,25 MVF 0,27
LVF 0,28 A.A VF 2:0,5 MVF 0,26 MVF 0,28
LVF 0,29 A.A VF 2:0,6 MVF 0,27 MVF 0,29
LVF 0,30 I.U VF 2:0,7 none
LVF 0,31 I.U VF 2:0,8 none
PCISIG Confidential 87
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
There may be multiple Authorized VSs. Software running in these VSs is allowed to view and
change the Switch MR-IOV configuration. To avoid confusion, coordination is needed between
such software but is not specified by this specification.
A “backup” MR-PCIM operating in any Authorized VS may become the active MR-PCIM simply by
changing the Management VS register in the affected Switch(s). This feature can support failover
from one MR-PCIM to a backup MR-PCIM. The Suppress Reset Propagation feature of a VS can
be used to prevent a reset due to the failure of one MR-PCIM resetting state needed to continue
operation using a backup MR-PCIM.
Mechanisms used to detect MR-PCIM failure, to select the new MR-PCIM location, to initiate the
failover, etc. are undefined by this specification.
TLPs targeting MR-PCIM are regular PCIe TLPs within the appropriate VH. Such TLPs are
forwarded based on the switch configuration when they are received. Old TLPs are not re-routed
due to switch reconfiguration or change in VS Authorization.
88 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
In addition, the optional VF Mapping and VF Migration features use the terms Logical Virtual
Function (LVF) and Mission Virtual Function (MVF). These are similarly designated as follows:
LVF f,s indicates LVF table slot number s attached to PF number f.
MVF f,s indicates MVF number s attached to PF number f. MVFs do not have a VH (a VH
is associated the LVF that an MVF is mapped to, see below).
During topology enumeration, MR-PCIM detects MR Devices by noticing the presence of an MR-
IOV Capability in PF 0’s Configuration Header.
Initializing and managing a Device in MR mode involves managing four aspects of the Device.
Configuring and enabling the VHs.
Enabling and managing the optional MR flow control features: This involves configuring the
number of Virtual Links used, configuring the number of VCs offered in each VH,
configuring the {VH, VC} to VL mapping hardware, configuring any VH to VL arbitration
hardware and configuring any VL to Link arbitration hardware.
Enabling and managing the optional VF Mapping features: This involves configuring the
number of LVFs offered by each PF in each VH and configuring each PF’s LVF to MVF
map.
Enabling and managing the optional VF Migration features: This involves leaving “holes” in
the LVF to MVH map to support migration, enabling VF Migration, responding to requests
for initiate VF migration and interacting with SR-PCIM software to accomplish VF migration.
These aspects will be described separately in the following sections.
PCISIG Confidential 89
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
The NumVH register in the Main BF indicates to the Device how many VHs are going to be
used by MR-PCIM. Software should set this value based on the value of MaxVH, on the
number of VHs implemented at the upstream end of the link and on the number of VHs
needed by the system. Once software has enabled additional VHs, the NumVH value may not
be changed.
The MR-IOV Capability of every BF contains a pointer to the Function Table. This table
contains one entry for every Function associated with the BF. This table is indexed by VH
number since every BF contains a single function in each VH (exception: VH 0 need not have
a function so the first Function Table entry might not be used see the Function Offset field in
Section 4.2.4.1)
The Function Table entry for the Main BF contains a VC ID to VL Map. This map includes a
Map Enable bit. A VH is considered Enabled when, for some VC, the VC ID to VL Map
entry is Enabled, points to a VL that is Enabled, and software operating in the VH enables
that VC. See section 4.2.4.4 for additional details.
90 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
The PF h:f VC Enabled fields allow MR-PCIM to determine which VC Resources are
enabled by software operating in the VH. When VC Enabled changes value, the VC Config
Changed interrupt is signaled to MR-PCIM.
The PF h:f VC ID fields allow MR-PCIM to determine the VC ID assignments made by
software operating in the VH.
The PF h:f TC to VC map field allows MR-PCIM to determine the mapping between TCs
and VCs made by software operating in the VH.
PCISIG Confidential 91
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
92 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
The number of populated LVFs offered to SR-PCIM is contained in InitialVFs. If SR-PCIM does
not enable VF Migration, only these slots are used and any additional unpopulated slots remain
unused.
Initially, MR-PCIM must populate LVFs for a given PF using the lower numbered VFs first. Holes
left for migration (if any) follow the last populated LVF.
Before it enables SR-IOV operation, SR-PCIM in each VH must set the NumVFs value to indicate
the number of VFs it wishes to use. The value set must be less than or equal to TotalVFs. If VF
Migration not enabled by SR-PCIM, the value set must also be less than or equal to InitialVFs.
In some Devices, the Device Programming Model assumes that operations on one Function can
affect another Function of the Device. In MR, this dependency manifests itself as a relationship
between MVFs of some of the Device’s BFs. The Function Dependency Link field in a BF indicates
the presence a dependency. See section 4.2.1.2 for details. A similar dependency exists in the SR-
IOV specification but there it deals with dependencies associated with VF assignment to SIs.
In at least one VH (h), MR-PCIM software has set the VF Migration Capable bit in the
Function Table entry controlling PF h:f.
In that VH (h), SR-PCIM software controlling PF h:f has also Set the VF Migration Enabled
bit.
VF Migration centers around the LVF Table:
The VF State field manages the SR-IOV and MR-IOV combined view of the migration state
of a VF. This table is used to gracefully add and remove VFs to or from VHs.
The VF Map field maps LVFs onto MVFs. When the VF State is Inactive.Unavailable,
software can write this field to implement a change.
VF migration follows the state diagram shown in Figure 3-11. The state values shown are contained Deleted: Figure 3-11
in the VF State field associated with the VF. State transitions indicated by solid lines are initiated by
MR software by writing the new state value to the VF State field. State transitions indicated by
dashed lines are typically initiated by SR-PCIM and are visible to MR-PCIM via the VF State field.
The mechanisms used for this is described in the SR-IOV Specification.
PCISIG Confidential 93
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
SR: Activate
SR: Deactivate
VFs that are in the Inactive.Unavailable state are not usable by software in the VH. Configuration,
IO and Memory Requests within the VH targeting the associated VF return UR. Within 100 ms of
transitioning to this state, a VF must stop issuing Requests.
94 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
VFs that are in the Inactive.MigrateIn state (1) will respond to Configuration Requests issued by
software running in the VH, (2) if MSE is Set, will respond to Memory requests issued by software
running in the VH and (3) will not issue Requests.
The state transition from Active.MigrateOut to Inactive.Unavailable Sets the MR VF Migration
Status bit. If the MR VF Migration Interrupt Enable is Set, this in turn causes an MSI to be queued
to MR-PCIM. MR-PCIM software can then scan the LVF Table to determine the cause of the
interrupt. Specifically, MR-PCIM is looking for the VFs that it previously placed in the
Active.MigrateOut state and are now in the Inactive.Unavailable state.
State transitions with the notation “SR: Set VF Migration Status” cause similar behavior within the
VH. See the SR-IOV Specification for details.
The following steps are use by MR-PCIM to migrate a VF from one VH to another.
1. Request arrives from higher level software requesting that VF h:f,s be migrated to VF a:f,c.
For the request to be valid, the VF Table Entry associated VF h:f,s must be in the
Active.Available state and the VF Table Entry associated with VF a:f,c must be in the
Inactive.Unavailable state.
2. Initiate a Migrate Out operation in VH h by writing the VF State entry associated with
VF h:f,s to the Active.MigrateOut.
3. Wait for SR-PCIM to stop using the VF and to indicate so by transitioning the VF State
to Inactive.Unavailable. This transition sets the MR VF Migration Status bit and can raise an
interrupt to MR-PCIM.
4. Save the value of the VF Map entry associated with VF h:f,s.
5. Set the VF Map entry associated with VF h:f,s to zero to indicate an empty slot
6. Set the VF Map entry associated with VF a:f,c the value saved in step 4.
7. Initiate a Migrate In operation in VH a by writing the VF Map entry associated with VF a:f,c
to the Inactive.MigrateIn state.
8. At some point, SR-PCIM will transition the VF the Active.Available state and start using it.
In addition graceful migration described above, MR-PCIM can retract a Migrate In or Migrate Out
request that it previously requested.
Software in a VH is expecting to see the initial VF configuration shown in Figure 3-12. MR-PCIM Deleted: Figure 3-12
must ensure this condition by appropriate programming of the VF Migration tables.
PCISIG Confidential 95
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
VF 1
VF 2
Active.Available VFs
...
InitialVFs VF n
VF n+1
... Inactive.Unavailable VFs
TotalVFs VF m
TotalVFs ≥ InitialVFs
The LVF table region assigned to the PF, [BaseLVF .. BaseLVF + TotalVFs – 1], does not
overlap with the region assigned to any other PF of the BF.
If InitialVFs > 0, all LVF entries in the range [BaseLVF .. BaseLVF + InitialVFs – 1] are in
state Active.Available and are mapped to valid MVFs.
If InitialVFs != TotalVFs, all LVF entries in the range [BaseLVF + InitialVFs ..
BaseLVF + TotalVFs – 1] are in state Inactive.Unavailable.
After a PF is Reset or when VF Enable is Cleared and then Set, a valid initial VF configuration must
be re-established. The InitialVFs value may be different from an earlier initial configuration so long
as the configuration meets the rules described in Section 3.2.4.1.
This process can be accomplished by hardware or software and must be completed within 1 sec (to
avoid an SR software timeout resulting in the hardware being declared broken).
This process starts by adjusting InitialVFs to reflect the number of active VFs associated with the
PF and then rearranging those active VFs into lower numbered VFs, keeping the same relative VF
ordering.
Implementation Note: As described in the SR-IOV Specification, if VF Migration Enable was Set,
SR software must wait 1 second after clearing VF Enable for InitialVFs and TotalVFs to become
Comment [sdg5]: This constraint
valid. does not currently exist but will be added
in the 1.0 version of SR-IOV.
96 PCISIG Confidential
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Any MR Root Complex must address the following areas. The mechanisms used are vendor specific Formatted: Do not check spelling or
grammar, Highlight
and outside the scope of this specification.
Formatted: Highlight
Method for deciding whether to enable MR operation on a given Link
Method for enabling and disabling VLs (VL0 is always enabled, others are controlled by
software)
Method for enabling and disabling VHs (VH0 is always enabled, others are controlled by
software)
Method for controlling the number of VCs offered to each RP (could be fixed)
In addition, software running above the RC must determine how to configure and use the MR
topology. This involves determining, for example, how the various VHs on each MR Link should be
used, how VCs and VLs should be managed, etc.
The mechanism used to determine this is outside the scope of this specification. A variety of
mechanisms can be used (in any combination) including:
Using arbitrary out-of-band communication paths
Using the MR Switch “Link Partner” registers in MR Switches to see the values sent in MRInit
DLLPs.
PCISIG Confidential 97
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
Note that the MR RP is always upstream on all VHs. This means that MR-PCIM can not, in general,
access MR RP Configuration Space (Exception: MR-PCIM can access the RP it’s running above and
may be able to access other RPs in the same RC).
If Switch X authorizes the port connected to Link X, MR-PCIM could manage Switch X via RP 0.
Similarly, if Switch Y authorizes the port connected to Link Y, MR-PCIM could manage Switch Y
via RP 3.
If Switches X and Y managed from the MR Root Complex and are distinct MR topologies with no
connection between them, the two MR-PCIMs above RP 0 and RP 3 are independent as well. If
there is a single interconnected MR topology, there must be a single MR-PCIM and it can use either
RP 0 or RP 3 but not both (i.e. the Management VH must always be a tree).
98 PCISIG Confidential
4
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
4. Configuration
The following sections list the configuration requirements for Base Function (BF), Physical
Functions, MRA Switches and MR-PCIMs.
PCISIG Confidential 99
Multi-Root I/O Virtualization and Sharing, Rev. 0.7
mr-iov-07-2007-06-08
There are up to five tables provided by each Base Function. These tables are located using the MR-
IOV Capability block located in the Type 0 Configuration Space of the associated BF. An overview
Deleted: Figure 4-1
of the tables is show in Figure 4-1.
The MR-IOV Capability contains information concerning the BF.
The VH Table contains information concerning each VH supported by the Device. There is
exactly one VH Table per Device and it is associated with an arbitrary BF.
The optional VL Arbitration Table contains information describing how VL Arbitration is
performed by the Device. If present, this table is associated with the BF that contains the VH
Table.
The optional LVF Table is used to control VF Mapping and VF Migration. This table is absent
if neither VF Mapping nor VF Migration are supported by the BF.
The Function Table contains one entry for each Function or PF in each non-management VH.
NumVH MaxVH
Function Table Offset Function Table
Entry MaxVH-1
VL and Function Mapping
Control / Status
LVF Table
LVF Table Offset
# Stats Desc
Descriptor 0
VL Arb Table Offset
# Stats Blks # Stats Desc
Statistics
Statsistics Start / Busy
Descriptor Max
Stats Descriptor Offset Statistics
Stats Block Offset Block 0
Statistics 0
# Stats in Block N
Block N
Statistics
# Stats
Blocks
Block N
Statistics Max
Block N
Statistics
Block Max
48h
Statistics Block 31 Statistics Block 1 Start / Busy
Statistics Start / Busy
Statistics Block 0 Start / Busy
4Ch Statistics Descriptor Table Offset BIR
Table 4-2 defines the Switch MR-IOV Extended Capability header. The Capability ID for the Deleted: Table 4-2
Switch MR-IOV Extended Capability is 0011h.
Table 4-2: Device MR-IOV Extended Capability Header
This includes both the Function Table and the Function Interrupt Bitmap.
The Function Interrupt Bitmap immediately precedes the Function Table. The size of the Function
Table Interrupt Bitmap is 32 bytes (supporting a maximum of 256 VHs). Bit 0 of the first DWORD
corresponds to VH 0; bit 1 corresponds to VH 1, … Any unused bits are Read Only Zero.
If VF Mapping is supported, this register indicates the MVF Index values that may be assigned to a
VF. Values in the range [1..Max MVF] may be written to the LVF Table mapping field.
Table 4-8: VF MVF Region
The total size of the LVF Table (in bytes) is: Max LVF * 4
These fields contain the TLP Prefix and TLP Header corresponding to the error described by the
First Error Pointer in the MR Error Status register.
The value of these fields is undefined if the First Error Pointer is zero or points to a bit number that
is not Set.
Headers are not logged and the First Error Pointer is not updated for DLLP Errors.
Device and Switch Statistics related fields are described in Section 4.3.7.
Config Space
BAR Memory Space
LVF0 MVF #
LVF0 VF State
LVF Table
LVFMax VF State
LVFMax MVF #
NumVFs
VF Migration Enabled
VF Enabled
VF Initialization Pending Status
VF Enable Changed
VF Migration Status
PF Reset Initiated
MFVC Config Changed
VC Config Changed
VC3 VC2 VC1 1 VC0 VC ID to
VC7 VC6 VC5 VC4 VL Map
VC Resource 1 State VC Resource 0 State
VC Resource 3 State VC Resource 2 State
VC Resource 5 State VC Resource 4 State
Read Only
VC Resource 7 State TCàVC Map6
VC State
VC Resource 6 VC ID
VC Resource 6 VC Negotiation Pending
VC Resource 6 Enabled
MFVC Resource 1 State MFVC Resource 0 State
MFVC Resource 3 State MFVC Resource 2 State
MFVC Resource 5 State MFVC Resource 4 State
Read Only
MFVC Resource 7 State TCàVC Map6
MFVC State
MFVC Resource 6 VC ID
MFVC Resource 6 VC Negotiation Pending
MFVC Resource 6 Enabled
Setting of any of bits 7:0 of this register Sets the corresponding Function Interrupt Bitmap entry if
the Function Interrupt Enable bit is Set. When software clears all of these bits, or clears the
Function interrupt Enable bit, the corresponding Function Interrupt Bitmap entry is also cleared.
Table 4-21: Function Status
This table maps VCs in a VH to VLs. To preserve PCI Express ordering and flow control
independence assumptions, each VC must be assigned to a distinct VL.
This map is only meaningful on the Main BF of the Device. In all other BFs, all fields of this map
are Read Only Zero.
These fields return data from the VC Capability of the associated Type 1 Configuration Header.
They allow MR-PCIM software to track the enabling and mapping of VCs with each VH.
VC Resource fields for resource numbers above Num VC Resource Hardware Present are not
implemented and return 0 when read. VC Resource fields for resource numbers above Extended VC
Count are Undefined.
VC Resource State 0 is located at offset 1Ch; VC Resource 1 is located at offset 1Eh; etc. Fields
Deleted: Table 4-25
within VC Resource State are described in Table 4-25.
Table 4-24: Function Table VC Resource State
These fields return data from the MFVC Capability of the associated Type 1 Configuration Header.
They allow MR-PCIM software to track the enabling and mapping of VCs with each VH.
VC Resource fields for resource numbers above Num MFVC Resource Hardware Present are not
implemented and return 0 when read. VC Resource fields for resource numbers above Extended
MFVC Count are Undefined.
VC Resource State 0 is located at offset 30h; VC Resource 1 is located at offset 32h; etc. Fields
Deleted: Table 4-25
within VC Resource State are described in Table 4-25.
The Function Interrupt Status bitmap precedes the Function Table. It is always 32 bytes (supporting
switches with a maximum of 256 ports).
Bits in this table are Read Only. A bit is Set to indicate the Function has an interrupt pending and
Clear otherwise. These Interrupt Status bits are cleared either by clearing the appropriate Function
Interrupt Pending bit or by masking the interrupt using the Function Interrupt Enable.
An MSI Interrupt is requested on any zero to one transition of any of these bits.
BIST remains optional in MR-IOV. The results of invoking BIST in any non-BF Function must not
affect any other VH.
The results are undefined if software invokes BIST in a BF when any VC to VL Map Enable bit in
any Function Table entry of any BF of the Device is Set..
There are nine tables provided by the Switch. These tables are located using the MR-IOV Capability
block located in the Type 1 Configuration Space of upstream P2P Bridge(s). An overview of the
Deleted: Figure 4-5
tables is show in Figure 4-5.
Port Interrupt
Bitmap
MR-IOV Capability Hdr Port0 Port0 VL Arb Table
MR-IOV Capability Bits Table Entry
# Ports
MR-IOV Control Bits
MR-IOV Status Bits PortN
VS# VS Bridge# Table Entry PortN VL Arb Table
# Bridges
Table Entry
VS0
Entry Size # VS
# VS
VS Table Offset VS0 BridgeM
VSN Table Entry
Entry Size # Bridges
Table Entry
VS Bridge Table Offset
# Stats Blks # Stats Desc VSN Bridge0
Statsistics Start / Busy Table Entry
# Bridges
Statistics
VSN
Stats Descriptor Offset
# Stats Desc
Descriptor 0
Stats Block Offset VSN BridgeM
Table Entry
Statistics
Descriptor Max
VS Authorization Bitmap
Backup VS Authorization Statistics
Block 0
Statistics 0
# Stats in Block N
Block N
# Stats
Blocks
Statistics
Block N
Statistics Max
Block N
Statistics
Block Max
The Port Table contains an entry for every Port on the Switch. The table may be sparse (i.e.
contain unused table entries) for hardware implementation flexibility. The Port Table contains
the fields that control the physical link. The Port table also points to the optional VL
Arbitration Table. Preceding the Port Table is a 256 bit Port Interrupt Summary.
The optional VL Arbitration Table supports controlling the arbitration between Virtual Links
for access to a Port. This table is modeled after the VC Arbitration Table in PCI Express.
The VS Table contains an entry for every Virtual Switch in the MR Switch. The table may be
sparse for hardware implementation flexibility. Preceding the VS Table is a 256 bit VS
Interrupt Summary.
The VS Bridge Table contains an entry for every PCI-to-PCI bridge in every Virtual Switch.
This table is a two dimensional array indexed by VS number and by P2P Bridge number.
Within a VS, the first entry corresponds to the upstream P2P Bridge and the remaining entries
are the downstream bridge(s). This table may also be sparse but the upstream entry of each VS
must be present if the associated VS is present. No entries are present in this table unless the
associated VS table is also present.
The optional Statistics Descriptor Table contains descriptions of the varieties of performance
counters and statistics information supported. This table is read only and contains one entry
for each counter style supported by the component.
The optional Statistics Block Table contains an entry for every block of related statistics
counters. Each entry contains controls for the block and an offset to the array of counters.
The optional Statistics Counter Table contains the actual counters and sampled values. There
is one table for each Statistics Block. The offset and size of this table is contained in the
associated Statistics Block Table.
These tables must be visible in the Upstream P2P Bridge of all authorized VHs as. A subset of the
MR-IOV Capability (but none of the other tables) may optionally be present in the Upstream P2P
Bridges of non-authorized VHs. This subset MR-IOV Capability must be present in VH 0
Upstream P2P Bridges that are attached to MR aware Root Ports.
00h Next Cap Offset Vers Capability ID (0011h) MR-IOV Capability Header
04h MSI Vector # MR-IOV Capability Bits
08h MR Switch Number
0Ch
24h
2Ch
38h
Statistics Block 31 Statistics Block 1 Start / Busy
Start / Busy Statistics
Statistics Block 0 Start / Busy
3Ch Statistics Descriptor Table Offset BIR
Table 4-26 defines the Switch MR-IOV Extended Capability Header. The Capability ID for the Deleted: Table 4-26
Switch MR-IOV Extended Capability is 0011h.
Table 4-26: Switch MR-IOV Extended Capability Header
This Read Only register returns the VS number and P2P Bridge number within the VS of this type 1
config header. In the upstream P2P Bridge of a VS, it also indicates the Authorization status of the
VS.
Table 4-30: Switch MR-IOV This Bridge Map
The Watchdog Timer is used ensure that a backup MR-PCIM can recover and take over from the
primary MR-PCIM. One key area is the situation where discovery has failed and the Initial MR-
PCIM has altered the switch configuration(s) such that the backup MR-PCIM can’t manage a switch
(e.g. the Backup MR-PCIM VS could have been de-authorized, some inter-switch link directions
could be configured inappropriately, etc.).
Two Watchdog Timers are provided:
Timer 1 supports sending an interrupt, reauthorizing selected Virtual Switches and resetting
the Link Direction on selected Ports.
Timer 2 supports a complete reset of the switch. This can be used as a “fall back” mechanism
in the case that Timer 1 was not able to reconfigure.
This register determines which Virtual Switches are authorized to manage the MR Switch. It also
indicates which VS receives “route to MR-PCIM” messages initiated by this switch.
Table 4-32: Switch MR-IOV Authorization Control
The Port Table starts at the Port Table Offset. The Port Interrupt Bitmap immediately precedes the
Port Table (i.e. it starts 32 bytes before the Port_Table_Offset).
The VS Table starts at the VS Table Offset. The VS Interrupt Bitmap immediately precedes the VS
Table (i.e. it starts 32 bytes before the VS_Table_Offset).
The VS Bridge Table Entry associated with Bridge 0 of VS N immediately follows the last Bridge
Table Entry associated with VS N-1.
Device and Switch Statistics related fields are described in section TBD.
Entry
MR-IOV
Size Capability
# Ports
00h PCIe offset MaxVH
Port
Non-PCIe Port (Management Port)
Port Table Offset Capability
Port Present
Config Space
04h NumVH Port Interrupt Enable
MMIO Space
08h VL Enable
Send PME_Enter_L23 DLLP Link Port
PM_PME Triggers Beacon / Wake#/ MR-IOV Control
Enable
Backup Link Direction Control
Port Enable
Link Direction Control
Port Interrupt Bitmap
0Ch Port Interrupt Pending
Port Status
Port0 VL Negotiation Pending
Table Entry
10h
Link Partner Link Link Trained in
PortN VH FC Partner MR Mode Link
Table Entry MaxVH Partner
Link Partner Type Link Direction
Status Training
Link Partner Authorized
Link Partner Status
Link Partner Protocol Version Detected
Link Partner MaxVL
Table 4-44: Switch Link Partner Training Status – MRInit DLLP Bits
18:16 Link Partner MaxVL – Maximum number of VLs that the Link Partner RO
can support.
19 Reserved RO
22:20 Link Partner Protocol Version – MR-IOV Protocol version RO
supported by the Link Partner. For this version of the specification,
this is the value 1h.
23 Link Partner was Authorized – If the Link Partner is an MR Switch, RO
this bit indicates that at the time of Link Training the VS associated
with VH0 of this link was allowed to manage the switch (authorization
can be revoked so this may no longer be accurate). This corresponds
to the VS is Authorized bit in the Link Partner’s MR-IOV Capability.
27:24 Link Partner Type – Indicates what kind of PCI Express Device is RO
present as Function 0 of the Link Partner. Encoding is identical to the
Device/Port Type field in the PCI Express Capabilities (Offset 02h,
Bits 7:4).
29:28 Reserved RO
30 Link Partner VH FC – If Set, indicates the Link Partner supports per- RO
VH and Per-VL Flow Control. If Clear, indicates that the Link Partner
supports only per-VL Flow Control.
31 Reserved RO
These fields contain the TLP Prefix and TLP Header corresponding to the error described by the
First Error Pointer in the MR Error Status register.
The value of these fields is undefined if the First Error Pointer is zero or points to a bit number that
is not Set.
Headers are not logged and the First Error Pointer is not updated for DLLP Errors.
These fields are the PCI Bridge controls that affect the Physical Port.
Table 4-50: PCI Bridge Control
These fields are the PCI Express controls that affect the Physical Port. Values in various
Configuration Spaces are either Virtual values or map to these values (see the Bridge Controls
Physical bit described in Section 4.3.6.2)
The layout of this structure is similar to the PCI Express Capability dropping the Root Capability,
Control and Status words. All fields are implemented as defined in the PCI Express Specification except
Deleted: Table 4-51
as indicated in Table 4-51.
Entry
MR-IOV
Size Capability
# Ports
Port Table Offset
Config Space
MMIO Space Port 31 Interrupt Status Ports
Port 1 Interrupt Status 0..31
Port Interrupt Bitmap Port 0 Interrupt Status
Port0
Table Entry Port 63 Interrupt Status
Port 33 Interrupt Status Ports
Port 32 Interrupt Status 63..32
PortN
Table Entry
The Port Interrupt Status bitmap precedes the Port Table. It is always 32 bytes (supporting switches
with a maximum of 256 ports).
Bits in this table are Read Only. A bit is Set to indicate the Port has an interrupt pending and Clear
otherwise. These Interrupt Status bits are cleared either by clearing the appropriate Port Interrupt
Pending bit or by masking the interrupt using the Port Interrupt Enable.
An MSI Interrupt is requested on any zero to one transition of any of these bits.
Bits corresponding to Ports that are not Present are Read Only Zero.
Entry Size # VS
VS Table Offset
Config Space
BAR Memory Space
00h
VS Capability
VS Present
VS Interrupt
Bitmap 04h VS Global Key Value
VS0 VS Global Key Check Enable Bits
Table Entry
VS Suppress Reset Propagation VS Control
VS Interrupt Enable
VSN VS Enable
Table Entry
08h-??h
VS Bridge
BridgeN Interrupt Status
Interrupt
Bridge1 Interrupt Status Status
Bridge0 Interrupt Status
This field contains one bit per P2P Bridge in the VS. Bit 0 of the first DWORD corresponds to VS
Bridge Table Entry 0 (i.e. the Upstream Bridge). Bit 1 corresponds to the VS Bridge Table Entry 0.
This field contains INT((Num_Bridge_Table_Entries + 31) / 32) DWORDs.
Bits in this field are Set only when the any of the following bits are Set in the VS Bridge Table.
VS Status Changed
Bits in this field are Cleared by clearing the one of these “Changed” bits.
If VS Interrupt Enable is Set and any of the bits in this field are Set, the corresponding VS Interrupt
Summary bitmap bit is Set.
Entry
MR-IOV
Size Capability
# VS
VS Table Offset
Config Space
MMIO Space VS 31 Interrupt Status VS
VS 1 Interrupt Status 0..31
VS Interrupt Bitmap VS 0 Interrupt Status
VS0
Table Entry VS 63 Interrupt Status
VS 33 Interrupt Status VS
VS 32 Interrupt Status 63..32
VSN
Table Entry
The VS Interrupt Status bitmap precedes the VS Table. It is always 32 bytes (supporting switches
with a maximum of 256 Virtual Switches).
Bits in this table are Read Only. A bit is Set to indicate the VS has an interrupt pending and Clear
otherwise. These Interrupt Status bits are cleared by clearing the VS BridgeN Interrupt bit Pending
bit using the VS Bridge Table or by masking the interrupt using the VS Interrupt Enable.
An MSI Interrupt is requested on any zero to one transition of any of these bits.
Bits corresponding to Virtual Switches that are not Present are Read Only Zero.
00h
Entry Size Num Bridges
Power Controller Changed
VS Bridge Table Offset
Power Indicator Changed
Config Space Attention Indicator Changed Bridge
BAR Memory Space VC Config Changed Capability
PME Turn Off State & Status
Num VC Resources Present
Max Payload Size Supported
Hot-Plug Hardware Present
Bridge Hardware Present
VS0 Bridge0
Table Entry 04h Bridge Port
Bridge Port VHN
Port Mapped to Bridge
VS0 BridgeM VC Config Interrupt Enable
Table Entry
Bridge Controls Physical Link Bridge
Bridge Enable Control
08h
VSN Bridge0 Low Priority Extended VC Count
Table Entry Extended VC Count
Max Payload Size Offered
map map
map map
map map
map map
0Ch VC3 VC2 VC1 VC0 VC ID to
VSN BridgeM
Table Entry 10h VC7 VC6 VC5 VC4 VL Map
14h VC Resource 1 State VC Resource 0 State
18h VC Resource 4 State VC Resource 2 State
1Ch VC Resource 5 State VC Resource 4 State
Read Only
20h VC Resource 7 State TC VC Map VC State
VC Resource 6 VC ID
VC Resource 6 VC Negotiation Pending
VC Resource 6 VC Enabled
24h Physical Slot Number
Slot Power Limit Scale
Slot Power Limit Value
Hot Plug Capable
Hot Plug Suprise
Power Controller Present
Hot Plug
28h “Virtual
Signal Force Signals
Power Reset Virtual
Fault Hot-Plug Interface”
Presence Interrupt
Push Detect Enable
Attention State
Button
Slot Implemented
Power Controller State
Attention Indicator State
Power Indicator State
VS Bridge Table Entry 1 corresponds to the downstream P2P Bridge located at Device 0, Function
0 on the VS internal bus. VS Bridge Table Entry N (1≤N≤32) corresponds to the downstream P2P
Bridge located at Device N-1, Function 0 on the VS Internal bus. VS Bridge Table Entries above 32
are used for P2P Bridges located at non-zero Function numbers as shown in the following table.
VS Bridge Table Entry N Device Function
1..32 N-1 0
33..64 N-33 1
65..96 N-65 2
97..128 N-97 3
129..160 N-129 4
161..192 N-161 5
193..224 N-193 6
225..256 N-225 7
Behavior is undefined if a Bridge Port / Bridge VHN combination is simultaneously mapped into
more than one VS Bridge Table entry.
These fields contain the Virtual Link to be used for traffic out this P2P Bridge for the indicated VC.
VC to VL mapping is not needed and these fields are read only zero if the MaxVL value is zero for
all Ports of the switch.
Software may not map multiple VCs to the same VL. Specifically, within a single VS Bridge Table
entry, behavior is undefined if multiple enabled VL Map entries contain the same map value.
These fields return data from the VC Capability of the associated Type 1 Configuration Header.
They allow MR-PCIM software to track the enabling and mapping of VCs with each VH.
VC Resource fields for resource numbers above Num VC Resource Hardware Present are not
implemented and return 0 when read. VC Resource fields for resource numbers above Extended VC
Count are Undefined.
VC Resource State 0 is located at offset 0Ch; VC Resource 1 is located at offset 0Eh; etc. Fields
Deleted: Table 4-59
within VC Resource State are described in Table 4-59.
These bits form the “Virtual Signals Interface” of the Virtual Hot-Plug controller. These registers
allow MR-PCIM to indicate to software what Hot-Plug features are supported and to control those
features.
See Chapter 6 Hot Plug for additional details.
Virtual Hot-Plug controller hardware is optional. Presence of hardware is indicated by the Hot Plug
Hardware Present bit. If not hardware is present, the Slot Implemented bit is Read Only zero and
some of the bits in this section are Undefined. Virtual Hot-Plug hardware is only present in
downstream Ports.
If the Bridge Controls Physical Link bit is Set, the Virtual Hot-Plug Signals Interface Registers are
not used and their content is Undefined.
Alternate RID Interpretation (ARI) support must be provided in all downstream PCI-to-PCI bridges
of MRA Switches. Specifically, the ARI Forwarding Supported bit located in the Device Capabilities
2 register must be set and the ARI Forwarding Enable bit located in the Device Control 2 register
must be implemented.
Configuration Space
Memory Space
In addition to the fields described below, each Capability contains Statistics Interrupt Status and
Statistics Interrupt Enable bits.
These fields define the sizes of the Statistics Tables pointed to by the MR-IOV Capability.
If the Performance Monitoring and Statistics Collection Capability is not implemented, these fields
are Read Only Zero.
Table 4-64: Statistics Table Sizes
Vendor Specific statistics are designated CSEL[Vendor ID, Collection ID, n]. Vendor ID is
assigned by the PCI SIG. Collection ID a Vendor defined value used to select a set of S bit
definitions. The value n is in the range [0..39] (inclusive) and the corresponding S bit number
is n + 168 (if mapped using Group 1) or n + 192 (if mapped using Group 2).The meaning of a
Vendor Specific statistic is not affected by whether it is mapped using Group 1 or Group 2.
This mechanism allows a single counter to support any mixture of standard events and vendor
defined events from up to two sets of S bit definitions (from either the same or different Vendors).
Table 4-69Error! Reference source not found. contains Standard Statistics defined by this Deleted: Table 4-69
standard. Items marked Sample correspond to sampled values while items marked Count
correspond to counted values. Standard filters are defined in Section 4.5.2.2.
Table 4-69: Standard Statistics
33 Available {VH, VL} Transmit Credits – Number of Sample Required Credit Filters:
available transmit credits associated with a {VH, VL} VL, VH, Credit Type
computed as:
[Field Size]
(CREDIT_LIMIT – CREDITS_CONSUMED) mod 2
97 Available {VH, VL} Receive Credits – Number of Sample Required Credit Filters:
available receive credits associated with a {VH, VL} VL, VH, Credit Type
computed as:
[Field Size]
(CREDITS_ALLOCATED – CREDITS_RECEIVED) mod 2
Credit Filtering consists of three filters: Credit Type, VL and VH. Unsupported credit filters are
hardwired to zero.
Table 4-71: Credit Filters
DLLP Filtering is optional and consists of a number of filters. Unsupported DLLP filters are
hardwired to zero.
Table 4-72: DLLP Filters
5. Error Handling
Error Name Error Type Detecting Agent Action Detecting Agent Action
(PCIe) (MR IOV)
Receiver Error Correctable Receiver (if checking): Receiver:
Send ERR_COR to Root Same as PCIe but send on
Complex. all enabled VHs not in Reset.
Error Name Error Type Detecting Agent Action Detecting Agent Action
(PCIe) (MR IOV)
Bad TLP Correctable Receiver: Receiver:
Send ERR_COR to Root Same as PCIe but send to all
Complex. enabled VHs not in Reset.
Bad DLLP Correctable Receiver: Receiver:
Send ERR_COR to Root Same as PCIe but send on
Complex. all enabled VHs not in Reset.
Replay Timeout Correctable Transmitter: Transmitter:
Send ERR_COR to Root Same as PCIe but send on
Complex. all enabled VHs not in Reset
REPLAY NUM Correctable Transmitter: Transmitter:
Rollover
Send ERR_COR to Root Same as PCIe but send on
Complex. all enabled VHs not in Reset
Data Link Layer Uncorrectable If checking, send Same as PCIe but send to all
Protocol Error (Fatal) ERR_FATAL to Root enabled VHs not in Reset
Complex.
Error Name Error Type Detecting Agent Action Detecting Agent Action (MV
(PCIe) IOV)
the error.
Receiver Uncorrectable Receiver (if checking): Receiver (if checking):
Overflow (Fatal)
Send ERR_FATAL to Root Same as PCIe.
Complex.
Send ERR_FATAL to all Root
Complex that have a VC on
the affected link mapped to
the affected VL.
Flow Control Uncorrectable Receiver (if checking): Receiver (if checking):
Protocol Error (Fatal)
Send ERR_FATAL to Root Same as PCIe.
Complex.
Send ERR_FATAL to all Root
Complex that have a VC on
the affected link mapped to
the affected VL.
Malformed TLP Uncorrectable Receiver: Receiver:
(Fatal)
Send ERR_FATAL to Root Same as PCIe.
Complex.
Error message sent only for
Log the header of the TLP affected VH.
that encountered the error.
Log the header for the
affected VH only.
5.2. MR Errors
Table 5-4: MR Error List
Error Name Error Type Detecting Agent Action Detecting Agent Action
(PCIe) (MR IOV)
Invalid TLP Uncorrectable N/A Receiver:
Prefix
Signal MR Uncorrectable
TLP Error, discard TLP, do
not update flow control (VL /
VH can’t be trusted).
TLP received Uncorrectable N/A Receiver:
on VH in reset
Signal MR Uncorrectable
(sender of TLP
TLP Error, discard TLP,
has Acked
update flow control normally.
entering Reset)
TLP Prefix with Uncorrectable N/A Receiver:
Global Key
Signal MR Global Key Error,
Mismatch: TLP
discard TLP, update flow
at destination or
control normally
forwarded on
PCIe Link
TLP Prefix with Correctable N/A Receiver:
Global Key
Signal MR Global Key Error,
Mismatch: TLP
forward TLP normally.
being forwarded
on MR Link
TLP Prefix Uncorrectable N/A Receiver:
{VH, VL} that is
Signal MR Uncorrectable
invalid, not
TLP Error, discard TLP, do
enabled or has
not update flow control (VL /
not finished
VH can’t be trusted).
Flow Control
Initialization
MRUpdateFC Correctable N/A Receiver:
for {VH, VL}
Signal MR DLLP Error,
that is invalid,
discard DLLP.
not enabled or
has not finished
Flow Control
Initialization
Invalid VH Correctable N/A Receiver:
Group in Reset
Signal MR DLLP Error,
DLLP
discard DLLP.
Out of range Correctable N/A Receiver:
Assert bit set in
Signal MR DLLP Error,
Reset DLLP
ignore the offending Assert
bit(s) and process remainder
Error Name Error Type Detecting Agent Action Detecting Agent Action
(PCIe) (MR IOV)
of Reset DLLP normally.
6. Hot Plug
2. Push virtual buttons of the Virtual Hot-Plug controller (i.e. change bits in the Virtual Slot
Status register)
3. Detect Indicator and Control changes made to the Virtual Hot-Plug controller (i.e. detect
certain changes to the Virtual Slot Control and Virtual Slot Capabilities registers)
The following register fields are defined for this function:
Bit Field MR-PCIM View Purpose
Virtual Slot Number Read / Write 1
Virtual Slot Power Limit Scale / Value Read Only 3
Virtual Slot Implemented Read / Write 1
Virtual Hot-Plug Capable Read / Write 1
Virtual Hot-Plug Surprise Read / Write 1
Virtual Data Link State Read / Write 2
Virtual Power Controller State Read Only 3
Virtual Power Controller State Read / Write One to Clear 3
Changed
Virtual Power Controller Present Read / Write 1
Virtual Power Indicator State Read Only 3
Virtual Power Indicator State Changed Read / Write One to Clear 3
Virtual Attention Indicator State Read Only 3
Virtual Attention Indicator State Read / Write One to Clear 3
Changed
Virtual Presence Detect State Read / Write 2
Press Virtual Attention Button Read Zero, Write One to Set 2
Signal Virtual Power Fault Read Zero, Write One to Set 2
when an MRA Switch is being used as a Base PCIe Switch. It is also useful if MR-PCIM chooses to
delegate authority for managing the Physical Link to software running in a specific VH (typically
because the associated Port is attached to a Base PCIe Device).
7. Power Management
MR systems continue to need power management capabilities. Two varieties of power management
are involved:
Virtual power management allows software running in a VH to believe that it has turned off
power to one or more virtual functions.
Actual power management allows MR-PCIM software to control device power.
7.1. Overview
ASPM and PCI-PM are expanded for MR-IOV. MR Components have both virtual and physical D-
states. Slots have virtual and physical power states. Virtual and physical ASPM controls also exist.
X X X X X X
X X X X X X
X
1 2 1 2 1 2 1 2
X
X
0 0 0 0
3 3 3 3
X X X
X X X
Figure 7-1: Multi-Root Wake-Up Scenarios
To address this, MR Switches implement a number of power management features:
The ability to send Beacon / WAKE# on receipt of certain PM_PME messages.
The ability to detect beacon / Wake# and convert it into a PM_PME message.
notices the Beacon / WAKE# event and powers up the Device using the Physical Power Controller
associated with Port 3. Once the link comes up, Scenario A applies.
Scenario C: This mechanism is also used in to power up the Device in Figure 7-1 Scenario C. The Deleted: Figure 7-1
Device is powered up as in Scenario B. When the PM_PME Message is sent, it is addressed to either
Root 1 or Root 2 and Scenario A applies to the addressed Root.
and the Bridge Controls Physical bit is Clear, when the VH turns off power using the PCIe Hot-Plug
controller, the VH is sent a reset.
The virtual power controller has no effect on physical power. Turning off virtual power causes the
MR Switch to send Reset DLLPs to cleanse state from the affected Components.
A form factor can allow MR-PCIM to control the physical power to a slot. Like in PCIe doing so is
optional. This control occurs through the Port Table (see Section 4.3.3.12 for details).
8. Congestion Management
Exceeding the bandwidth of a link or capacity of a buffer can lead to congestion in a topology. In a
Multi-Root Topology, congestion may affect the performance of unrelated VHs and lead to
Completion Timeout errors. This chapter defines mechanisms for detecting and controlling
congestion in a Multi-Root Topology.
8.1. Overview
There are three possible causes of congestion in a PCIe topology.
A fault in hardware or software configuration of a device in the topology.
A static rate mismatch in the capacity of the path from a component injecting traffic into the
topology (e.g., a Device) and the ultimate destination (e.g., a Root Port). Congestion due to a
static rate mismatch would occur even if the topology were otherwise idle.
Traffic merging of multiple flows, none of which individually suffer from a static rate
mismatch, causing the capacity of an element in the topology to be exceeded
While the causes of congestion are the same in both single and multi-root PCIe topologies, it is
desirable to provide the ability to manage, limit and contain the congestion caused by one VH on
other VHs in the system. The congestion management mechanisms outlined in this section allow
management of congestion that is unique to MR topologies (i.e., due to traffic merging from
different VHs). These mechanisms do not address congestion within a VH since this congestion
would have been present in an equivalent PCIe Base topology.
MR congestion management mechanisms provide the following benefits.
Preserve the behavior of Virtual Channels (VCs) defined by the PCIe Base specification within
a VH.
Allow systems to be constructed where a fault in one VH does not result in errors (e.g.,
Completion Timeouts) in another VH.
Allow systems to be constructed that support forward progress guarantees on a VH or groups
of VHs when congestion exists on an unrelated VH.
Support a wide range of implementation options. At one extreme, they allow creation of MR
Devices through incremental changes to SR Devices. At the other extreme, they allow
implementations that support complete isolation between virtual hierarchies.
A port of an MRA component may support from one to eight VLs. Virtual Links are uniquely
identified using a Virtual Link Identification (VL ID). There is a fixed one-to-one mapping between
VL IDs and VL resources (e.g., VC resource 0 always has a fixed ID of zero (i.e., VL0)).
A port of an MRA component may support from one to 256 Virtual Hierarchies. Virtual Hierarchies
associated with a port are uniquely identified using a Virtual Hierarchy Number (VHN). VHNs are
Link specific and do not represent a global VH identifier.
Each port is independently configured and managed allowing implementations to vary the number
of VLs and VHs supported per Port based usage model-specific requirements.
MR DLLPs used for flow control accounting contain VHN and VL ID information. Unlike PCIe
Base TLPs that contain only TC and no VC information in the header, the MR TLP prefix tag
contains both VHN and VL ID information simplifying the Flow Control accounting done at each
Port of a Link.
Rules for allocating VL IDs to VL hardware resources associate with a port are as follows:
VL ID assignment must be one-to-one
The same VL ID cannot be assigned to different VL hardware resources within the same Port.
Rules for assigning VH VCs to VL hardware resources associate with a port are as follows:
(VH, VC) assignment must be the same (matching) for the two Ports on both sides of a Link.
(VH0, VC0) is assigned at initialization, but not fixed, to the default VL.
MR-PCIM is responsible for configuring ports on both sides of an MR link in a consistent manner.
Support for VLs beyond the default VL0 is optional. VL0 is always enabled and while not fixed or
“hardwired,” by default there is a one-to-one mapping between VC and VLs. Therefore, MR
topology initialization may proceed using (VH 0, VC 0) mapped to VL 0 and does not require any
specific hardware or software configuration.
MR-PCIM is responsible for enabling VLs and configuring the mapping of VC associated with VHs
to VLs.
VL0 is always enabled
For VLs 1-7, a VL is considered enabled when the corresponding VL Enable bit in the MR-
IOV Control register has been set to 1b in the BF, and once FC negotiation for that VL has
exited the MRFC_INIT2_VL state.
For VLs 1-7, MR PCIM must use the VL Negotiation Pending bit in the MR-IOV Status register to
determine when a VL is enabled.
Every VC resource of a VH associated with a Port visible to software operating in the VH must be
mapped to an enabled VL. Since the number of VLs supported by components on a Link is
implementation specific, and only one VC of any VH may be mapped to given a VL, the number of
advertised VC resources to software operating in the VH must not exceed the number of enabled
VLs associated with the port.
If a function only implements the default VC0 resource, then no configuration is necessary.
• For Virtual Hierarchies that do not have a MFVC Capability structure associated with
the Port, then the VC Extended VC Count field in the Function Control 2 register must
be initialized to a value such that the number of VC resources advertised to software
operating in the VH is less than or equal to the number of enabled VLs. As a result of
this configuration, the VC Low Priority Extended VC Count field in the Function
Control 2 register may need to be initialized to a value consistent with the VC Extended
VC Count field.
• For virtual hierarchies that have a MFVC Capability structure associated with a Port, the
MFVC Extended VC Count field in the Function Control 1 register must be initialized
to a value such that the number of VC resources advertised to software operating in the
VH is less than or equal to the number of enabled VLs. As a result of this initialization,
the MFVC Low Priority Extended VC Count field in the Function Control 1 register
may need to be initialized to a value consistent with the MFVC Extended VC Count
field.
For Switches this is managed through the corresponding Virtual Switch (VS) Bridge Table
Entry as follows:
• The VC Extended VC Count field in the Switch VS Bridge Control 2 register must be
initialized to a value such that the number of VC resources advertised to software
operating in the VH is less than or equal to the number of enabled VLs. . As a result of
this configuration, the VC Low Priority Extended VC Count in the Function Control 2
8.2.1.3. VC to VL Mapping
A Virtual Link is established when one or more VC IDs from different VHs are associated with a
physical resources designated by a VL ID.
Components with Ports that implement VLs beyond the default VL must also implement an
associated VC to VL mapping capability. The VC to VL mapping capability is optional for Ports that
implement only the default VL0. Ports that do not have an associated VC to VL mapping capability
must map VC0 from all supported VHs to VL0.
In order to preserve the semantics of a VC defined in the PCIe Base specification, only one VC of a
given VH may be mapped to a VL. The behavior when two or more VCs from the same VH are
mapped to a single VL is undefined.
Given the above requirement, knowledge that a VH is mapped to a VL together with the VC to VL
mapping function associated with that VH is sufficient to determine the (VH,VC). Thus, indicating a
that a VL has a mapped VH, or (VL,VH) is synonymous with specifying the (VH, VC).
VC to VL mapping is from virtual VC IDs to VLs and is controlled as follows:
For Devices, the VC to VL mapping is controlled by fields in the Function VC to VL Map
register associated with the BF of each VH. This map is from Virtual Hierarchy VC resources
as they would have appeared on the Link in an equivalent PCIe Base component. Thus, if
function 0 in the VH contains a MFVC Capability structure then this mapping is from VC IDs
managed by the MFVC Capability structure. Otherwise, this mapping is from VC IDs
managed by the BF VC capability structure.
For Switches the VC to VL mapping is controlled by fields in the VC to VL Map register in a
VS Bridge Table entry.
A VC ID x is mapped to a VL y when the VCx VL Map field has been initialized with a value
of y and the corresponding VCx VL Map Enable bit has been set.
VC to VL mapping may be performed during MR-PCIM initialization or on an as-needed basis as
dictated by software operating in the VH. The PCIe Base specification supports the arbitrary
mapping of VC IDs to VC resources. Thus, in general, MR PCIM has no a priori knowledge of
which VC IDs will be used or how they will be allocated by software operating in the VH. In
systems where MR PCIM possess this knowledge or in which MR PCIM can communicate desired
allocation to software operating in the VH, VC IDs may be mapped to VLs during MR-PCIM
initialization (i.e., prior to the instantiation of software operating in the VH). Otherwise, this
allocation must be performed by MR-PCIM on an as-needed basis as VC IDs are allocated by
software operating in the VH to VC resources.
8.2.1.4. Arbitration
Figure tbdError! Reference source not found. illustrates both VH to VL and VL to Link
arbitration associated with an egress Port of an MRA Enabled PCIe Component such as an MR
Root Port, Switch, Bridge, or Device. The component in this example implements two VHs, with
both VHs implementing two VCs. Flows (A, VC 0) and (B, VC 0) are mapped to VL 0 and require
VH to VL arbitration to control the multiplexing of TLPs onto this VL. VLs one and two only have
a single mapped flow and therefore require trivial arbitration. The physical link associated with the
egress Port in this example implements three VLs. VL to Link arbitration is required to control the
multiplexing of VLs onto the physical link.
8.2.1.4.1. VH to VL Arbitration
MRA components with ports that support multiple Virtual Links must implement a VL to Link
arbitration mechanism.
A component may support a hardwired-fixed arbitration algorithm or optional software configurable
algorithms selection. Support for the optional software configurable arbitration algorithm selection
is indicated by the state of the VL Arbitration Table Present bit in the MR-IOV Capabilities register.
If an implementation does not support software configurable algorithm selection, then it must
implement a hardwired-fixed arbitration scheme (e.g., Round Robin) that guarantees forward
progress on all enabled VLs.
The remainder of this section describes the behavior and requirements of software configurable VL
arbitration algorithm selection.
VL arbitration algorithm selection is controlled as follows:
For Devices, registers associated with VL arbitration are located in the MR-IOV Capability.
For Switches, registers associated with VL arbitration are located in the Port Table Entry of
the corresponding port.
The VL arbitration table in all components, if present, is located in BAR memory space.
VLs may be partitioned into two priority groups – a lower and an upper group. VLs in the upper
group are arbitrated using strict priority based on VL number while VLs in the lower group are
arbitrated only when there are no packets to process in the upper group. Arbitration within the
lower group may be configured to one of the supported arbitration algorithms described below.
Membership of a VL in the low or high priority group is determined by the state of the
corresponding bit in the VL Strict Priority Arbitration field in the VL Arbitration Control register.
Since the VL Strict Priority Arbitration field represents a bit vector, VLs to group assignment is
flexible and need not be allocated sequentially based on VL ID.
Among VLs configured for strict priority, priority is based on increasing VL number. VL0 has the
lowest priority while VL 7 has the highest.
The arbitration algorithm for VLs in the low priority group is selected by the VL Arbitration Select
field in the VL Arbitration Control register. Arbitration algorithms supported by an implementation
are advertised in the VL Arbitration Capability field in the VL Arbitration Capability and Status
register and may include the following architected schemes.
Hardware-fixed arbitration, e.g. Round-Robin
Weighted Round Robin (WRR) arbitration scheme with 32, 64, 128 or 256 phases
Time-Based Weighted Round Robin (time-based WRR) arbitration scheme with 128 phases
This specification establishes a standard framework within which vendors may specify their own
vendor specific arbitration scheme. The definition of vendor-defined arbitration is outside the
scope of this document.
VL arbitration algorithms, e.g., WRR and time-based WRR, operate in a manner analogous to the
schemes defined for VC arbitration in the PCI Express Base specification.
As in the PCI Express Base specification, flow control is distinguishes between TLP type
(Posted, Non-Posted, and Completion) and Header/Data. Thus, there are six types of tracked
flow control information.
The unit of Flow Control credit is 4 DW for data
The unit of Flow Control credit for headers is one maximum-size header plus TLP prefix and
TLP digest
Flow Control is initialized autonomously by hardware only for the default virtual link (VL0)
and virtual hierarchy (VH0).
When other Virtual Links are enabled by MR-PCIM, each newly created VL will follow the
flow control initialization protocol.
(VL, VH) credits are negotiated using the flow control initialization protocol outlined in
Section 2.1.2 whenever MR-PCIM increases NumVH.
A Receiver must never cumulatively issue more than 2047 outstanding unused credits to the
Transmitter for data and 127 for header.
If an Infinite VL and (VL, VH) credit advertisement has been made during initialization, no
Flow Control updates are required for that VL following initialization.
A Receiver that advertises non-infinite VH credits must utilize MRUpdateFC DLLPs for that
VL.
• Independent MRUpdateFCs DLLPs are used to track header and data credits associated
with VLs and (VH,VC)s.
• As described in Section tbd, Receivers and Transmitters track independent flow control
information for each VL and for each supported VH. For each VL and (VL, VH), the
six types of flow control information outlined above are tracked.
• A TLP in a Receiver’s VLQ or BQ consumes both VL and corresponding VH credits.
• Both VL and corresponding VH credits are released when a TLP is processed and
removed from the logical queuing structure associated with the Receiver for a VL.
• MRUpdateFC DLLPs are only associated with VH credits related with a VL. VL credits
are implicitly computed using state information and MRUpdateFC DLLPs as outlined in
Section 2.4.1.
A Receiver that advertises non-infinite VL credits and infinite VH credits must utilize PCIe
Base UpdateFC DLLPs for that VL.
• UpdateFC DLLPs are used to explicitly track VL header and data credits.
• The VH gating function is unconditionally satisfied for all Credit Types associated with
that VL.
• Receivers and Transmitters track independent flow control information for each VL. For
each VL, the six types of flow control information are tracked.
• A Receiver that advertises infinite VH credits may only implement a VLQ for that VL.
• VL credits are released when a TLP is processed and removed from the logical queuing
structure associated with the Receiver for a VL
If a receiver advertises infinite VH credits on one VL, then it must advertise infinite VH
credits on all VLs.
Completion of the statistics collection process (i.e., the end of the counting period) may be signaled
via an interrupt.
A component that implements the Performance Monitoring and Statistics Collection Capability
must implement a Statistics Block Table and a Statistics Descriptor Table. These tables are located
in memory space and their location is specified by the Statistics Descriptor Table and Statistics
Block Table registers in the MR-IOV Extended Capability structure.
A set of Statistics Counters that share a common initiation mechanism and statistics collection
process periods is referred to as a Statistics Block. A component that implements the Performance
Monitoring and Statistics Collection Capability may implement one to 32 Statistics Blocks. Each
Statistics Block has an associated Statistics Block Table entry that contains a pointer to a Statistics
Counter Table that holds the Statistics Counters associated with the Statistics Block. The Statistics
Block Table entry also specifies the statistics collection process state (i.e., Idle, Waiting, Counting),
number of entries in the Statistics Counter Table, waiting period, and counting period.
Statistics Counters associated with a Statistics Block may have different characteristics and be
associated with different Ports. Associated with each Statistics Counter is a Statistics Descriptor
Index that points to the Statistics Descriptor Table entry that describes statistics that may be
recorded by the Statistics Counter. The actual statistic recorded by a Statistics Counter is selected by
the Statistics Select field.
Associated with a Statistics Counters is a 64-bit counter that is used to report the captured statistic .
The counter is formed by the Statistics Counter Low and Statistics Counter High registers. While the
field is specified as 64-bits, an implementation is free to implement fewer bits. The number of
implemented counter bits is specified by the Statistics Width field and, for standard counters, must
be 32-bits or greater.
Associated with each Statistics Counter is an optional filter specified by the Statistics Filter Enable
and Control register. Filters allow refinement in a recorded statistic. For example, rather than count
all transmitted TLPs on a Port, a filter may be used to only count transmitted TLPs from a particular
VH on a particular VL. Each entry in a Statistics Descriptor (i.e., an S-bit) defines required filters
that must be implemented and optional filters that may be implemented.
The format of Statistics Descriptors, standard statistics, filters and requirements are specified in
Section 4.5.2.
A component that implements the Performance Monitoring and Statistics Collection Capability
must implement at least one Statistics Block and at least two Statistics Counters per Port. At least
two Statistics Counters per port must implement Statistics Descriptor standard statistics specified as
Deleted: Table 4-69
required in Table 4-69.