Introduction To PCI Express
Introduction To PCI Express
PCI Express
†
Adam H. Wilen
Justin P. Schade
Ron Thornburg
iii
iv Introduction to PCI Express
PCI Challenges 19
Bandwidth Limitations 19
Host Pin Limitations 21
Inability to Support Real-Time (Isochronous) Data Transfers 23
Inability to Address Future I/O Requirements 24
PCI Moving Forward 25
Glossary 313
Index 319
Chapter 1
Introduction
1
2 Introduction to PCI Express: A Hardware and Software Developers Guide
between PCI Express and PCI/PCI-X is that many software and configura-
tion space models are preserved among the three technologies.
Aside from the opportunity of introducing a brand new general in-
put/output (I/O) architecture, there are several motivations for writing
this book. One of the primary motivations is to give the reader an easy-to-
follow, introductory technological overview of PCI Express. This book is
not a replacement for reading the PCI Express Specification. The opinion
of the authors is that this book makes the PCI Express Specification eas-
ier to comprehend by giving it a context with extra background and in-
sights into many areas of the technology. The second motivation is to
prepare the industry for a transition to PCI Express architecture by dis-
cussing system-level impact, application-specific transitions, and the gen-
eral timeline for consumer market introduction.
A Quick Overview
PCI Express is a high-performance interconnect that gives more for less,
meaning more bandwidth with fewer pins. PCI Express is designed to
leverage the strengths of yesterday’s general I/O architectures while ad-
dressing immediate and future I/O architectural and mechanical issues
with current technologies such as bandwidth constraints, protocol limita-
tions and pin count. More technically speaking, PCI Express is a high
speed, low voltage, differential serial pathway for two devices to com-
municate with each other. PCI Express uses a protocol that allows de-
vices to communicate simultaneously by implementing dual
unidirectional paths between two devices, as shown in Figure 1.1.
Device A Device B
Device A Device B
x2 Scaled
Device A Device B
x1 Scaled
bus frequencies of 250 megahertz and higher are plagued with electrical
challenges and a limited set of solutions. Advancing the bus frequency
beyond 500 megahertz will require massive efforts and yield less than
friendly results, if those results are useable at all. There is no question
that something beyond PCI is required going forward. This is the oppor-
tunity to go beyond the stopgap approach of trying to squeeze more life
out of PCI by simply bumping up the frequency. This is the chance to
make changes that will carry general I/O architecture comfortably into
the next decade.
PCI is based on a protocol that is nearly a decade old. As usage mod-
els change, the protocol must adapt to deal with new models. The ag-
gressive multimedia nature of today’s applications such as streaming
audio and video require the ability to guarantee certain amounts of
bandwidth. The PCI protocol does not have the ability to deal appropri-
ately with these types of deterministic transactions. There is a need to de-
fine an architecture that is equipped to deal with these multimedia usage
models.
System Integrators
System integrators will benefit from the system level information that is
discussed in many sections of this book. Of particular interest is the im-
pact to the current infrastructure, cost structure, flexibility, applications,
and technology timelines. System integrators can take advantage of this
information for developing strategic short-range and long-range goals for
incorporating PCI Express into current and future designs.
6 Introduction to PCI Express: A Hardware and Software Developers Guide
Silicon Designers
Silicon designers can use the information in this book to assist in inter-
preting the PCI Express Specification. The technology sections of this
book are written to address and answer many of the “why” questions that
the specification does not clearly articulate. This book can be used to
bring silicon designers up to speed quickly on key PCI Express concepts
before and while reading through the PCI Express Specification.
Software Engineers
Software engineers can use the information in this book to understand
what must be done by BIOS code and drivers to take advantage of PCI
Express features. Software engineers should focus on which features can
take advantage of the existing PCI configuration model and which fea-
tures cannot. The information in the technology section of the book is
helpful in outlining the general flow of software routines in setting up
the advanced features.
Application Engineers
Applications engineers will find the entire book useful to support and
drive their customer base through the transition to this new technology.
As with silicon designers, application engineers can use this book to pro-
vide insight and additional understanding to many key areas of the PCI
Express Specification.
Beyond PCI
The first section of this book sets the stage for understanding the motiva-
tions, goals and applications of PCI Express. As a baseline, a brief history
of PCI is explored in Chapter 2, as are the successes and challenges PCI
has encountered during its lifetime. The successes of PCI are discussed as
a foundation for PCI Express, while the challenges PCI faces are disclosed
to reveal areas that need to be addressed by a next generation technol-
ogy.
Chapter 3 includes an investigation of the goals and requirements of
PCI Express. This section explores the metrics and criteria for PCI Ex-
press adoption, with a focus on preserving key commonalities such as in-
frastructure, manufacturing, multi-segment support and cost. In addition
to this, many new capabilities are discussed.
The first section ends with a discussion of next generation applica-
tions, looking closely at the applications for which PCI Express offers
significant benefits beyond existing I/O architectures. This discussion
takes into account the various segments such as desktop, mobile, server,
and communications. New and revolutionary usage models are also dis-
cussed as a natural solution to evolving system and I/O requirements
The Technology
The second section of this book is the general hardware and software ar-
chitecture of PCI Express. This section examines what it means to have a
layered architecture, how those layers interact with each other, with
software, and with the outside world. This section also introduces and
explains the advantages of the PCI Express transaction flow control
mechanisms, closing with a look at PCI Express power management.
Chapter 5 introduces PCI Express as an advanced layered architec-
ture. It includes an introduction to the three key PCI Express architec-
tural layers and their interaction with each other, with software, and with
the outside world. A top-down description follows where the uppermost
layer (the Transaction Layer), which interacts directly with software, is
discussed first, followed by the intermediate (Data Link Layer) and final
layer (Physical Layer). Chapters 6, 7, and 8 examine each of the three ar-
chitectural layers in detail. Following the discussion of the individual PCI
Express layers is a discussion in Chapter 9 on the various transaction flow
8 Introduction to PCI Express: A Hardware and Software Developers Guide
control mechanisms within PCI Express. This section describes the or-
dering requirements for the various PCI Express transaction types. The
bulk of this section, however, focuses on the newer flow control policies
that PCI Express utilizes such as virtual channels, traffic classes and flow
control credits.
Chapter 10 presents insights into PCI Express software architecture.
This section focuses on identifying the PCI Express features available in a
legacy software environment. It also includes a discussion of software
configuration stacks and also broader (device driver model, auxiliary)
software stacks for control of advanced features.
Chapter 11 concludes this section with a discussion on PCI Express
power management. This chapter discusses the existing PCI power man-
agement model as a base for PCI Express power management. This base
is used to discuss PCI Express system-level, device-level and bus/link
power management states.
early and late adopter will be examined. This allows companies to assess
what type of adopter they are and what opportunities, challenges, and
tools are available to them.
Chapter 14 closes the book with a case study of several different PCI
Express-based products. The development phase of each product is ex-
amined along with a discussion of some of the challenges of implement-
ing PCI Express technology.
10 Introduction to PCI Express: A Hardware and Software Developers Guide
Chapter 2
The PCI Legacy,
Successes, and
Challenges
11
12 Introduction to PCI Express: A Hardware and Software Developers Guide
was preparing to market. PCI was viewed as the vehicle that would fully
exploit the processing capabilities of the new Pentium line of processors.
The PCI-SIG was given the charter to manage, develop, and promote
PCI. Over the last ten years the PCI-SIG has done exactly that. The PCI
specification is currently in its third revision and many other PCI-related
technologies and concepts have evolved at the hands of this group. The
founding of the PCI-SIG has done more to unify and shape general I/O in
the computing industry than any other forum. PCI Express will use the
influence of this group to assist in its adoption as the next I/O standard.
PCI Successes
As fast as technology has evolved and advanced over the last ten years, it
is amazing how long PCI has remained a viable piece of the computing
platform. The original architects of PCI had no idea that this architecture
would still be integral to the computing platform ten years later. PCI has
survived and thrived as long as it has because of the successes it has en-
joyed. The most noted success of PCI is the wide industry and segment
acceptance achieved through the promotion and evolution of the tech-
nology. This is followed by general compatibility as defined by the PCI
specification. Combine the above with processor architecture independ-
ence, full-bus mastering, Plug and Play operation, and high performance
low cost implementation, and you have a recipe for success.
Industry Acceptance
Few technologies have influenced general PC architecture as has PCI.
The way in which this influence can be gauged is by analyzing segment
acceptance and technology lifespan. PCI has forged its way into the three
computing segments (desktop, server, and mobile), as well as communi-
14 Introduction to PCI Express: A Hardware and Software Developers Guide
cations, and has become the I/O standard for the last ten years. The pri-
mary force that has made this possible is the PCI-SIG. The PCI-SIG placed
ownership of PCI in the hands of its member companies. These member
companies banded together to drive standardization of I/O into the mar-
ket through the promotion of PCI. A list of current PCI-SIG members can
be found on the PCI-SIG web site at http://www.pcisig.com.
There are two key ways in which PCI is sufficiently flexible that
member companies banded together under PCI-SIG. The first is that PCI
is processor-agnostic (both its frequency and its voltage). This allows PCI
to function in the server market, mobile market, and desktop market
with little to no change. Each of these markets supports multiple proces-
sors that operate at different voltages and frequencies. This allows mem-
bers to standardize their I/O across multiple product groups and
generations. The net effect to the vendor is lower system cost through
the use of common elements that can be secured at lower pricing
through higher volume contracts. For example, a system integrator can
use the same PCI based networking card in all of their product lines for
three to four generations. Along the same line of thought, multiple seg-
ments can use the same I/O product that invokes the economic concept
of reduced pricing through economy of scale.
The second way that PCI is flexible is in its ability to support multiple
form factors. The PCI-SIG members defined connectors, add-in cards, and
I/O brackets to standardize the I/O back panel and form factors for the
server and desktop market. The standardization of add-in cards, I/O
brackets and form factors in particular has had a massive impact to cost
structure of PCI from not only a system integrator’s perspective, but from
a consumer perspective as well. This standardization made the distribu-
tion of PCI-based add-in cards and form-factor-based computer chassis
possible through the consumer channel. For a product to be successful in
the computer consumer market it must be standardized in order to sus-
tain sufficient volumes to meet general consumer price targets.
Defined Specifications
PCI add-in cards and discrete silicon are available from hundreds of dif-
ferent vendors fulfilling just about every conceivable I/O application.
Consumers can choose from over thirty brands of PCI add-in modem
cards alone ranging from several dollars to several hundred dollars in
cost. These PCI add-in solutions can function in systems that feature host
silicon from multiple vendors like Intel and others.
Chapter 2: The PCI Legacy, Successes, and Challenges 15
CPU
Expansion
Memory Bus Graphics
Controller Controller Controller
ISA, EISA
CPU
Bridge/
Memory
Controller
PCI Bus
Expansion Graphics
Bus Controller IDE
Controller PCI Cards
ISA, EISA
of waiting for the host bridge to service the device. The net effect to the
system is a reduction of overall latency in servicing I/O transactions.
tecture, system vendors could get away with building fewer motherboard
variations. A single variation could support different features depending
on the add-in device socketed in the PCI slot. Many of the details around
high performance and low cost are addressed in much more detail in
Chapter 3.
PCI Challenges
Equal to the successes that PCI has enjoyed are the challenges that PCI
now faces. These challenges pave the way for defining a new architec-
ture. The challenges that PCI faces are in essence areas where PCI has
become inadequate. The key inadequacies are bandwidth limitations,
host pin limitations, the inability to support real time (isochronous) data
transfers, and the inability to address future I/O requirements.
Bandwidth Limitations
PCI transfers data at a frequency of 33 megahertz across either a 32-bit or
64-bit bus. This results in a theoretical bandwidth of 132 megabytes per
second (MB/s) for a 32-bit bus and a theoretical bandwidth of 264 mega-
bytes per second for a 64-bit bus. In 1995 the PCI Specification added
support for 66 megahertz PCI, which is backward-compatible with 33
megahertz PCI in a thirty-two or 64-bit bus configuration. Consequently,
the server market has been the only market to make use of 66 megahertz
PCI and 64-bit PCI, as shown in Table 2.2. This is probably because the
64-bit PCI requires so much space on the platform due to the connector
size and signal routing space. The server market is much less sensitive to
physical space constraints than the desktop and mobile market.
momentum in the server market for PCI. The desktop and mobile mar-
kets continue to use only 33 megahertz PCI in the 32-bit bus flavor. In
light of this, when PCI is mentioned in this book it is in reference to 33
megahertz PCI, which is used exclusively in the mobile and desktop sys-
tems that account for over 85 percent of the total computer market.
The actual bandwidth of the PCI bus is much less than the theoretical
bandwidth (approximately 90 megabytes per second) due to protocol
overhead and general bus topology issues, such as shared bandwidth,
that are discussed in more detail in Chapter 3. Since PCI is a shared bus,
the available bandwidth decreases as number of users increase. When
PCI was introduced, 90 megabytes per second was more than adequate
for the I/O usage models and applications that had been defined. Today’s
I/O usage models and applications have grown to require far more
bandwidth than can be supplied by PCI (take Gigabit Ethernet for exam-
ple that requires 125 megabytes per second). While PCI has been im-
proved over the years (the current PCI Specification is version 3.0), the
bandwidth of PCI has only been increased once. Comparatively, proces-
sor frequencies have increased dramatically. Ten years ago 66 megahertz
was a pretty fast processor speed, but today’s processor speeds are two
orders of magnitude larger, already passing the 3000 megahertz (or 3 gi-
gahertz) mark, as shown in Figure 2.3. PCI bandwidth hardly dents the I/O
processing capability of today’s processors.
Processor Frequencies
3500
3000
2500
2000
MHz
1500
1000
500
0
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002
Processor Frequencies
can neither bear the cost nor the system space to add additional host de-
vices to support multiple PCI-X segments. This is one of the primary rea-
sons that PCI-X is not used in desktop or mobile systems. Desktop and
mobile systems are constrained to use PCI to maintain upgradeability
through available connectors.
Other
PCI-X
Topside View
of a
LAN I/O Host
Silicon Device
USB IDE
Mini-PCI and PCI-X. Both of these technologies are based on PCI and use
a subset of the same signal protocol, electrical definitions, and configura-
tion definitions as PCI. Mini-PCI defines an alternate implementation for
small form factor PCI cards. PCI-X was designed with the goal of increas-
ing the overall clock speed of the bus and improving bus efficiencies
while maintaining backward compatibility with conventional PCI de-
vices. PCI-X is used exclusively in server-based systems that require extra
bandwidth and can tolerate PCI-X bus width, connector lengths, and card
lengths.
The PCI SIG will continue to work on enhancements to the existing
base of standards (like the 533 megahertz PCI-X). However, the future of
PCI is PCI Express. PCI Express is not the next stretch of PCI architec-
ture, but rather an architectural leap that keeps the core of PCI’s soft-
ware infrastructure to minimize the delays to adoption that were
experienced with PCI. PCI Express completely replaces the hardware in-
frastructure with a radically new forward-looking architecture. The goal
of this leap is to hit the technology sweet-spot like PCI did nearly a dec-
ade ago.
Chapter 3
Goals and
Requirements
his chapter explores key goals and requirements for smooth migra-
T tion of systems and designs to PCI Express architecture. A basic as-
sumption is that PCI Express must be stable and scalable for the next ten
years. This chapter discusses the need for PCI Express scalability and pin
link usage efficiency. Another primary requirement is multiple segment
support, focusing on the three key computing segments, desktop PCs,
servers, and mobile PCs. Following the discussion on multiple segment
support, the chapter explores system level cost parity with PCI as a key
requirement for technology migration. Afterwards, I/O simplification
goals are explored with an emphasis on consolidation of general I/O. Fi-
nally, the chapter investigates backward compatibility as it relates to cur-
rent software environments and form factors.
25
26 Introduction to PCI Express: A Hardware and Software Developers Guide
the center of the digital world. Over a period of a few years the concept
of I/O has changed dramatically. The usage models for general I/O have
grown to include not only streaming audio and video, but the entire array
of digital devices such as PDAs, MP3 players, cameras and more. At first
introduction PCI Express will have sufficient bandwidth to support the
available applications. However, future applications and evolving current
applications will require PCI Express to be scalable, as seen in Figure 3.1.
Possible Generation 2
PCI Express (5GHz)
Possible Generation 3
PCI Express (10GHz)
The next generation frequencies shown in the figure above are based upon
current speculation.
Pin Efficiency
First consider conventional PCI, which uses a 32-bit wide (4 byte) bi-
directional bus. In addition to the 32 data pins, there are 52 side band
and power and ground pins.
33 megahertz * 4 bytes ≅ 132 megabytes per second TX and RX
Device A Device B
Device A Device B
1byte
2500 megahertz * 1bit * ≅ 250 megabytes per sec ond TX
10 bits
1byte
2500 megahertz * 1bit * ≅ 250 megabytes per second RX
10 bits
Note—In PCI Express 10 bits are equated to a single byte due to 8-bit/10-
bit encoding, which reduces the efficiency of the bus. In PCI-Express (x1)
, 4 pins are used for signaling. The remaining 2 pins are used for power
and ground. A link that has multiple transmit and receive pairs is actually
more pin-efficient with regards to power and ground balls. As link width
increases the bandwidth per pin increases to 100 megabytes per second.
Link Efficiency
To foster high link efficiencies, PCI Express uses a split transaction pro-
tocol that benefits from the adoption of advanced flow control mecha-
nisms. This allows PCI Express to maximize the bandwidth capability of
the architecture by minimizing the possibility of bottleneck contentions
and link inactivity. This type of efficiency is extremely critical when deal-
ing with a serial architecture such as PCI Express.
Within any architecture there are devices that exhibit latencies that
do not allow a transaction to complete immediately (considered a latent
transaction). This requires mechanisms to be defined to handle these
types of transactions. For example, two devices, A and B, which are in-
terconnected, are defined to have multiple functions. Devices A and B
could represent a real world chip-to-chip connection where Device A is a
Host Bridge and Device B is a PCI-to-PCI bridge to other bus segments
supporting multiple devices. Similarly Devices A and B could represent a
Host/PCI Bridge communicating to a slot-based device that has multiple
functions, as shown in Figure 3.4.
If Device A desires to receive some data from Function 1 on Device
B, it is likely that Device B will require time to obtain the requested in-
formation since Function 1 exists outside of Device B. The delay from the
time the request is comprehended in Device B until it can be serviced to
Device A is considered the latency period. If no further transactions are
allowed until the outstanding transaction is finished, the system effi-
ciency is reduced severely. Take the case where Device A also requires
information from Function 2 of Device B. Without an intelligent mecha-
nism for maximizing the efficiency of the system, Device A would have
32 Introduction to PCI Express: A Hardware and Software Developers Guide
to wait for Device B to complete the first request before beginning the
next transaction, which will most likely have a latency associated with it,
as shown in Figure 3.4.
Function 1
Device B Device A
Function 2
lane highway in both directions. This four lane highway has a carpool
lane that allows carpoolers an easier path to travel during rush hour traf-
fic congestion. There are also fast lanes for swifter moving traffic and
slow lanes for big trucks and other slow traffic. Drivers can use different
lanes in either direction to get to a particular destination. Each driver oc-
cupies a lane based upon the type of driver he or she is. Carpoolers take
the carpool lane while fast drivers and slow drivers occupy the fast and
slow lanes respectively. The four lane highway example represents what
is referred to in PCI Express as a virtual channel The link, or connec-
tion, formed between two PCI Express devices can support multiple vir-
tual channels regardless of the actual link width.
Virtual channels are exactly what one might suspect, they are virtual
wires between two devices. The finite physical link bandwidth is divided
up amongst the supported virtual channels as appropriate. Each virtual
channel has its own set of queues and buffers, control logic, and a credit-
based mechanism to track how full or empty those buffers are on each
side of the link. Thinking back to the four lane highway example, in the
real world those four lanes can become congested and blocked as well.
The advancement of cars on the four lane highway is in direct proportion
to the amount of road available in front of each vehicle. Some lanes may
have more space for traffic to move then others. Likewise, if the receive
queues and buffers for a virtual channel on one side of the link or the
other are full, then no further transactions can be sent until they are
freed up by completing outstanding transactions. Additionally, on the
transmit side, if the transmit queues and buffers become full, no further
transactions are accepted until they are freed up by completing out-
standing transactions. Bottlenecked transactions on one virtual channel
do not cause bottlenecks on another virtual channel since each virtual
channel has its own set of queues and buffers.
System traffic is broken down into classes that are based on device
class and negotiation with the operating system. In the traffic example
above, the traffic classes would consist of carpoolers, fast drivers, and
slow drivers. PCI Express supports up to eight different traffic classes and
hence, eight different virtual channels. Each traffic class may be mapped
to a unique virtual channel; however, this is not a requirement. Unlike
drivers on a four lane highway who may continually change lanes, once a
device is assigned a traffic class it cannot change to another traffic class.
Figure 3.5 illustrates how PCI Express links can support multiple vir-
tual channels. Each virtual channel can support one or multiple traffic
classes; however, a single traffic class may not be mapped to multiple vir-
tual channels. Again recall that virtual channels are in fact virtual. You
Chapter 3: Goals and Requirements 35
cannot infer simply because a PCI Express link is defined as a x2 link that
there are two virtual channels. A x1 PCI Express link can have as many as
eight virtual channels and a x32 link can have as few as one virtual chan-
nel. Additional details of PCI Express flow control are examined in Chap-
ter 9.
Figure 3.5 PCI Express Flow Control through Virtual Channels and Traffic
Classes
multiplexed and flow across the link. The combination of these tech-
niques gives PCI Express a technical advantage over both conventional
PCI and PCI-X. More importantly this type of link efficiency will be nec-
essary to support the device usage models of the immediate future.
Multi-Segment Support
PCI Express must support multiple market segments. An architectural
change this large requires integration into multiple segments to gain the
appropriate momentum for mass acceptance. In addition to this, it has
become apparent that the various market segments are becoming more
unified as time goes on. Mobile and desktop segments have been merging
for years. Many corporate stable systems have now shifted to become
desktop/mobile hybrids that also require manageability features found
primarily in the server market. Leveraging the mass adoption and econ-
omy of scale in the computing sectors, PCI has been adopted as a control
mechanism in the communications sector. To this end PCI Express has
been defined to support primary feature requirements of all four seg-
ments.
fined as being known to the system before they happen. PCI Express
transmit and receive buffers have been designed to withstand sustained
shorts to ground of the actual data lines. Additionally receive buffers re-
main in high impedance whenever power is not present to protect the
device from circuit damage.
Disc (DVD) technology took many years to become widely adopted. One
of the primary reasons that it took so long to adopt was the fact that DVD
players cost far more to manufacture than video cassette players. As a re-
sult they were sold for nearly four and five times the cost of a good qual-
ity video cassette player. DVDs have much higher quality than video
tapes. However, if the cost is inhibitive for the market in question, adop-
tion will be incredibly slow or will fail altogether. As a consequence, the
PCI Express architects formed a set of requirements that use current fab-
rication technologies in four key areas: printed circuit board fabrication
technology, connector manufacturing technology, four-layer routability,
and silicon design process technology.
ever, given the fact that PCI Express connectors are much smaller than
conventional PCI connectors, there are some material savings realized
from a manufacturing standpoint that may balance out the cost.
Signal
Power
Glass
Laminate
“Dielectric”
Ground
Signal
Four-layer stackups are used primarily in the desktop computer market,
which makes up approximately 70 percent of overall computer sales.
process or smaller for their devices. Voltage constraints also exist. PCI
Express was designed to operate at I/O voltage levels compatible with
0.25 micron and future processes. In short PCI Express has some design
flexibility in that it can be designed on multiple silicon processes. PCI
Express devices manufactured on different processes can still be con-
nected to one another since PCI Express is voltage independent through
AC coupling on the signal lines. For additional consideration and insights
into manufacturing choices see Chapter 14.
I/O Simplification
PCI Express seeks to simplify general I/O by consolidating the I/O strat-
egy for the desktop, mobile, and server segments. I/O consolidation gives
a sense of validity to the architecture by removing technological con-
straints and architectural constraints that have generally separated the
segments. PCI Express is defined as an open specification and will allow
all hardware vendors alike to adopt the technology without the burden of
paying royalties. PCI Express is not currently defined to replace all I/O
that currently exists across the multiple segments. However, it is ex-
pected that as time passes, the I/O that was not originally consolidated
will soon become so.
Bridge/
Memory
Controller
Graphics
Audio
LAN
ATA
PCI
PCI-to-PCI
Bridge PCI Expansion Slots
Within a few years, however, the architectural picture was quite dif-
ferent. In a relatively short period of time the demand for bandwidth in-
creased beyond what conventional PCI could deliver. Since conventional
PCI was not designed to be scalable, chipset manufacturers had to ex-
plore other options, resulting in the integration of high bandwidth I/O
elements such as graphics and ATA into the memory and I/O controller
respectively. While the integration of high bandwidth I/O gave some
bandwidth relief to the expansion slots portion of the platform, it made
the PCI-based chip-to-chip interconnect bandwidth problems even
worse. The natural result of feature integration into the I/O controller
was the simultaneous development of proprietary high-bandwidth chip-
to-chip solutions. The end result was segmentation of a once consoli-
dated I/O, as illustrated in Figure 3.9.
44 Introduction to PCI Express: A Hardware and Software Developers Guide
Memory
Controller
Memory
Controller
Graphics AGP Core
Chip
to
Chip
Audio
MISC
MISC
I/O
LAN
ATA IDE Controller
Core
SATA SATA
Audio AC'97
I/O
Controller
In many systems audio and LAN still exist as expansion cards on the PCI
Bus
Audio
MISC
Root
Graphics PCI Express
Complex
PCI
Express
Audio AC'97
PCI Express Mobile Docking
Conventional PCI may coexist with PCI Express as an I/O controller feature
during the transition phase. The system level architecture will be PCI
Express-based.
Backward Compatibility
Over the last ten years PCI has developed an extensive infrastructure that
ranges from operating system support to chassis form-factor solutions.
This infrastructure is the result of many years of coordinated efforts be-
tween hardware vendors and software vendors. The infrastructure estab-
lished through the adoption of PCI has been the springboard of success
for the personal computing platform. The most significant requirement
for smooth migration of PCI Express architecture is the level of backward
compatibility it has with the existing infrastructure.
46 Introduction to PCI Express: A Hardware and Software Developers Guide
the architecture quickly, whereas devices that do not require the benefits
of PCI Express (56K PCI modem cards for example) can make the change
slowly.
The coexistence of PCI and PCI Express needs to be clarified. The
core architecture of systems should change to PCI Express as rapidly as
possible. The coexistence model is defined to be a PCI Express core ar-
chitecture with supporting PCI Express-to-PCI bridges.
48 Introduction to PCI Express: A Hardware and Software Developers Guide
Chapter 4
PCI Express
Applications
T his chapter looks more closely at the applications where PCI Express
offers significant benefits beyond existing interconnect technologies.
PCI Express is a unique technology in that it provides immediate benefits
across multiple market segments from desktop PCs, mobile PCs, enter-
prise servers, to communications switches and routers. This chapter
starts with a brief overview of the key benefits of PCI Express and then
covers the applications where PCI Express is a natural solution due to
evolving requirements. Finally this chapter reviews some of the applica-
tions where PCI Express provides a new and revolutionary usage model.
49
50 Introduction to PCI Express: A Hardware and Software Developers Guide
High Performance
A key metric of performance is bandwidth, or the amount of data that
can be transferred in a given time. True usable bandwidth, typically
measured in millions of bytes or megabytes per second (MB/s) is a factor
of total theoretical peak bandwidth multiplied by efficiency. For exam-
ple, recall PCI is a 32-bit bus running at 33 megahertz, which is 132
megabytes per second Although the final PCI specification evolved to a
64-bit bus running at 66 megahertz for a total of 533 megabytes per sec-
ond, approximately 85 percent of the computing industry continues to
use the 33-megahertz version. However, the PCI bus cannot actually
transfer data at these rates due to overhead required for commands as
well as the inability to perform reads and writes at the same time. To de-
termine the actual data transfer capability requires an understanding of
the bus efficiency. The bus efficiency is determined by several factors
such as protocol and design limitations and is beyond the scope of this
Chapter 4: PCI Express Applications 51
I/O Simplification
A look inside several computing platforms today illustrates that there is
an overabundance of I/O technologies. Today’s platforms have PCI-X for
servers, Cardbus (PCMCIA slot for expansion) on mobile PCs, and PCI for
desktop PCs. In addition, several I/O technologies have evolved to appli-
cation-specific usage models such as IDE and SCSI for disk drives, USB or
IEEE 1394 for PC peripherals, AGP for graphics cards, and proprietary
chip-to-chip interconnects such as Intel’s Hub Link. Although many of
these technologies will continue to coexist moving forward, PCI Express
provides a unique interface technology serving multiple market seg-
ments. For example, a PC chipset designer may implement an x16 PCI
Express configuration for graphics, an x1 configuration for general pur-
pose I/O, and an x4 configuration as a high-speed chip-to-chip intercon-
nect. Notice the platform of the future consolidates the design and
development effort to a single PCI Express core away from three separate
and distinct I/O technologies (AGP, PCI, and Hub Link respectively). Re-
fer to Figure 12.2 and Figure 4.13 for examples of what a future desktop
PC and server platform could look like.
Layered Architecture
PCI Express establishes a unique divergence from historical PCI evolu-
tions through a layered architecture improving serviceability and scalabil-
52 Introduction to PCI Express: A Hardware and Software Developers Guide
Data Integrity
Data Link
Electrical Signaling
Physical
from 2.5 gigabits per second to probably more than 5.0 gigabits per sec-
ond, only the Physical Layer needs to evolve. The remaining layers can
continue to operate flawlessly, reducing development cost and time for
each incremental evolution of PCI Express.
PCI Express
VCs
Ease of Use
PCI Express will revolutionize the way users install upgrades and repair
failures. PCI Express natively supports hot plug and hot swap. Hot swap
is the ability to swap I/O cards without software interaction where as hot
plug may require operating system interaction. PCI Express as a hard-
ware specification defines the capability to support both hot swap and
hot plug, but hot plug support will depend on the operating system. In
the future, systems will not need to be powered down to replace faulty
equipment or install upgrades. In conjunction with the PCMCIA and PCI
SIG industry groups defining standard plug-in modules for mobile, desk-
top PCs, and servers, PCI Express enables systems to be easier to config-
ure and use.
For example, compare the events following the failure of a PCI
Ethernet controller in the office today with what the future could look
like. If a PCI card fails today, a technician is dispatched to the location of
the PC. The PC must be powered down, opened up, and the card must
be physically removed. Opening the PC chassis can be cumbersome, as
screws need to be removed, cables disconnected and pushed out of the
way, and the card unseated from the motherboard. Once the faulty card
is replaced with an identical unit, the system is then reassembled, recon-
nected, and powered back on. Hopefully all goes well and the PC is up
and running after a short two-hour delay. In the future, PCI Express
modules could be plugged into the external slot on the PC without pow-
ering down, disassembling, and disconnecting the PC. Refer to Figure 4.4
for a picture of the modules. In the same scenario, the technician arrives
Chapter 4: PCI Express Applications 55
with a new module, swaps the good with the bad, and the user is off and
running in less than ten minutes. In addition to serviceability, PCI Ex-
press provides the interconnect capability to perform upgrades without
powering down the system.
Although the modules are still under definition and expected to be
finalized in 2003, the proposals currently being discussed highlight the
benefits of easy to use modules. See Figure 14.4. The PC on the left is
based on today’s capabilities. In order to install an upgrade, the user must
open the box and navigate through the cables and connectors. The PC
on the right has the ability to install either in the front or back of the sys-
tem. PC add-in cards are expected to continue to be supported within
the box for standard OEM configurations, but PCI Express provides a
revolutionary and easier method to install upgrades versus internal slots.
The small module also enables OEMs to provide expansion capability on
extremely small form-factor PCs where the PCI Express connectors con-
sume valuable space.
Evolutionary Applications
The following sections within this chapter review both evolutionary and
revolutionary applications for PCI Express. To some extent or another, all
the following applications are expected to leverage one or more of the
56 Introduction to PCI Express: A Hardware and Software Developers Guide
CPU
CPU
Bus
Memory Main
Graphics AGP Memory
Controller Memory
Chip
to
Chip
Graphics Evolution
Initial graphics devices were based on the ISA (Industry Standard Archi-
tecture) system bus in the early 1980s with a text-only display. The ISA
bus provided a 16-bit bus operating at 8.33 megahertz for a total theo-
retical bandwidth of approximately 16 megabytes per second. As the
CPU and main memory continued to improve in performance, the graph-
ics interconnect also scaled to match system performance and improve
the overall end user experience. The early 1990s saw the introduction of
the PCI architecture providing a 32-bit bus operating at 33 megahertz for
a total bandwidth of approximately 132 megabytes per second as well as
the evolution to two-dimensional rendering of objects improving the
user’s visual experience. Although the PCI interface added support for
faster 64-bit interface on 66 megahertz clock, the graphics’ interface
evolved to AGP implementations. In the mid-1990s to the early years of
the following decade, the AGP interface evolved from the 1x mode even-
tually to the 8x mode. Today the AGP 8x mode operates on a 32-bit bus
with a 66 megahertz clock that is sampled eight times in a given clock
period for a total bandwidth of approximately 2,100 megabytes per sec-
ond. Additional enhancements such as three-dimensional rendering also
evolved to improve the overall experience as well as drive the demand
for continued bandwidth improvements. See Figure 4.6 for the band-
width evolution of the graphics interconnect. The graphics interconnect
has continued to double in bandwidth to take full advantage of increased
computing capability between main memory, the CPU, and the graphics
58 Introduction to PCI Express: A Hardware and Software Developers Guide
PC Graphics Evolution
PCI Express
4500 x16 4000MBs
4000
AGP8x
3500 2133MBs
3000
2500
MB/s
AGP4x
2000
AGP2x 1066MBs
1500 AGP
533MBs
266MBs
1000 PCI
133MBs
ISA
500 16MBs
0
1985 1993 1997 1998 1999 2002 2004
Ethernet Evolution
Ethernet has continually demonstrated resilience and flexibility in evolv-
ing to meet increasing networking demands. Ethernet first hit the market
in the early 1980s. As the PC market grew, so did the requirement for
computers and users to share data. By 1990, a 10-megabit per second
networking technology across standard UTP (unshielded twisted pair)
wiring was approved as the IEEE 10BASE-T standard and the next year
Ethernet sales nearly doubled (Riley, Switched Fast Gigabit Ethernet, pp
15). Networking requirements quickly demanded the evolution to Fast
Ethernet and the IEEE 100Base-T standard capable of 100 megabits per
second, published in 1994. Fast Ethernet enjoyed a rapid adoption as
network interface card suppliers offered both Fast Ethernet (100 megabit
per second standard) and Ethernet (10 megabit per second standard) ca-
pability on the same card providing backward compatibility and com-
monly referred to as 10/100Mbps capable. In fact, almost all Fast Ethernet
network interface cards (NICs) are 10/100Mbps-capable and represent a
commanding 70 percent of the market after only 4 years after introduc-
tion as shown in Figure 4.7.
Immediate Benefits
Gigabit Ethernet requires a high performance interface and is well suited
for PCI Express. PCI Express provides an immediate boost in perform-
ance due to the dedicated bandwidth per link, a direct increase in the us-
able bandwidth, and the ability to perform concurrent cycles. PCI
Express provides a dedicated link with 100 percent of the bandwidth on
each port independent of the system configuration, unlike today’s PCI
shared bus. An immediate performance gain is realized with PCI Express
due to the increase in available bandwidth. PCI provides a total band-
width of 132 megabytes per second whereas PCI Express operates on a
2.5 gigabits-per-second encoded link providing 250 megabytes per sec-
ond. PCI Express additionally supports concurrent data transmissions for
a maximum concurrent data transfer of 500 megabytes per second. De-
vices are able to transmit 250 megabytes per second of data during a
write operation while simultaneously receiving 250 megabytes per sec-
ond of read data due to separate differential pairs. PCI on the other hand
is only capable of performing one read or write operation at any given
time. PCI Express is the next obvious connection to deliver Gigabit
Ethernet speeds.
End Users
Inter-campus
Link Building
Switch Network
Switch
Firewall
Server
Internet
Gigabit Ethernet
Interconnect History
In 1998, Intel announced the 400 megahertz Pentium® II processor to
operate along with the Intel® 440BX chipset. The 440BX used the stan-
dard PCI bus to interconnect the memory and processor controller to the
PIIX4E I/O controller, as shown in Figure 4.10. Prior to the evolution of
bandwidth-intensive I/O devices, the PCI bus provided sufficient band-
width and performance at 132 megabytes per second. Adding up the ma-
64 Introduction to PCI Express: A Hardware and Software Developers Guide
CPU
400MHz
AGP2x CPU PC100 SDRAM
(533MB/s) Bus (800MB/s)
PCI
(133MB/s) PCI Expansion
Slots
DMA 66
(66MB/s)
DMA 66
(4 Drives) IDE
PIIX4E
USB1.1 Controller ISA Expansion
(2 Ports) USB Slots
CPU
1,200MHz
AGP4x CPU PC133 SDRAM
(1,066MB/s) Bus (1,00MB/s)
ATA 100
(4 Drives) IDE
PCI Expansion
ICH2 Slots
USB1.1
(4 Ports) USB PCI
(133MB/s)
CPU
3,000MHz
AGP4x CPU PC266 DDR
(1,066MB/s) Bus (2,100MB/s)
ATA 100
(4 Drives) IDE
PCI Expansion
ICH4 Slots
USB2.0
(6 Ports) USB PCI
(133MB/s)
Immediate Benefits
The immediate benefits of PCI Express as a high-speed chip interconnect
are the bandwidth improvements, scalability, and isochrony. I/O evolu-
tion will likely continue and several technological changes are underway.
The Serial ATA specification was published in 2001 paving the road for
new disk transfer rates. The first generation of Serial ATA interconnects
will be able to support up to 150 megabytes per second, a 50 percent in-
crease over the existing ATA100 connections. Serial ATA has a plan for
reaching 600 megabytes per second by the third generation. In addition,
Gigabit Ethernet adoption will eventually drive the requirement for addi-
tional bandwidth between the host controller and the I/O controller.
History has demonstrated that bandwidth and performance continue
to rise as systems increase the computing power. The inherent scalability
in both the numbers of PCI Express lanes as well as the ability to scale
the individual frequency of the PCI Express link provides a robust plan
for the next decade. For example, if a high-speed interconnect within a
system requires 1 gigabyte per second of data initially, the system manu-
facturer could implement an x4 PCI Express link. With changes to the
Physical Layer only, the system would be able to scale up to 4 gigabytes
per second with a modification to the signaling rate, but leaving the
software and existing connectors intact. Due to the benefits, high-speed
chip-to-chip interconnects will likely implement PCI Express or proprie-
tary solutions leveraging the PCI Express technology by 2004.
The other immediate benefit evolves from new capabilities within PCI
Express. Specifically, Isochrony provides revolutionary capabilities for mul-
timedia. Isochrony is discussed later in this chapter and in Chapter 9.
OM15509
PRELIMINARY
Can't see detail to make drawing
Figure 4.13 Server Architecture Evolution with PCI Express
Revolutionary Applications
Where the previous section covered the applications that will take ad-
vantage of the benefits within PCI Express enabling a natural evolution,
this section reviews revolutionary applications that PCI Express enables.
Specifically, Isochrony, a unique implementation to guarantee glitchless
media, future modules that improve ease of use, and communications
applications with advanced switching.
check their e-mail and their favorite stock price. When they come back
to review the MPEG, they noticed several dropped frames and an exces-
sive number of glitches in the movie clip. What happened?
This scenario is actually more common than most people think. Es-
sentially, some data types require dedicated bandwidth and a mechanism
to guarantee delivery of time-critical data in a deterministic manner. The
video stream data needs to be updated on a regular basis between the
camera and the application to ensure the time dependencies are met to
prevent the loss of video frames. PCI Express-based isochrony solves this
problem by providing an interconnect that can deliver time-sensitive data
in a predetermined and deterministic method. In addition to providing
the interconnect solution, PCI Express provides a standardized software
register set and programming interface easing the software developer.
Full Basel
Module Module
~40W ~20W
Single Wide
33.7m wide
60mm ong
5mm thick
Double Wide
68m wide
60mm ong
5mm thick
Extended
Double Wide
Single Wide
Figure 4.15 Client PCI Express Modules for Desktops and Laptops
Some of the initial applications to embrace the module are flash me-
dia, microdrives, wireless LAN, and broadband modems (Cable and DSL
routers) because of ease of use. For example, consider the user who or-
ders DSL from the local service provider. Currently, a technician is dis-
patched to install the unit and must disassemble the PC to install a
network connection to an external router via a PCI slot add-in card. The
module creates a compelling business scenario for service providers. The
service provider could simply ship the module to the user with instruc-
tions to connect the phone line to the module and then insert the mod-
ule into the slot. The user does not need to disassemble or reboot the
system. When the card is installed, the operating system detects the
presence of a new device and loads the necessary drivers. Because of the
native hot-plug and hot-swap capability within PCI Express, desktop and
notebook systems will no longer be burdened with the card bus control-
ler and the additional system cost issues with the current PC Card.
his chapter introduces the PCI Express architecture, starting off with
T a system level view. This addresses the basics of a point-to-point ar-
chitecture, the various types of devices and the methods for information
flow through those devices. Next, the chapter drops down one level to
further investigate the transaction types, mainly the types of information
that can be exchanged and the methods for doing so. Lastly, the chapter
drops down one level further to see how a PCI Express device actually
goes about building those transactions. PCI Express uses three transac-
tion build layers, the Transaction Layer, the Data Link Layer and the
Physical Layer. These architectural build layers are touched upon in this
chapter with more details in Chapters 6 through 8.
75
76 Introduction to PCI Express: A Hardware and Software Developers Guide
PCI are no longer directly applicable to PCI Express. For example, de-
vices no longer need to arbitrate for the right to be the bus driver prior to
sending out a transaction. A PCI Express device is always the driver for its
transmission pair(s) and is always the target for its receiver pair(s). Since
only one device ever resides at the other end of a PCI Express link, only
one device can drive each signal and only one device receives that signal.
In Figure 5.1, Device B always drives data out its differential transmis-
sion pair (traces 1 and 2) and always receives data on its differential re-
ceiver pair (traces 3 and 4). Device A follows the same rules, but its
transmitter and receive pairs are mirrored to Device B. Traces 3 and 4
connect to Device A’s transmitter pair (TX), while traces 1 and 2 connect
to its receiver pair (RX). This is a very important difference from parallel
busses such as PCI; the transmit pair of one device NVTU
be the receiver
pair for the other device. They must be point-to-point, one device to a
second device. TX of one is RX of the other and vice versa.
Trace #1
R+ T+
RX Trace #2 TX
R- T-
Device A Device B
Trace #3
T+ R+
TX T-
Trace #4
R- RX
Device A Device B
Lane
P P
O O
R R
T T
Link
Please note that the signaling scheme for PCI Express is tremendously
simple. Each lane is just a unidirectional transmit pair and receive pair.
There are no separate address and data signals, no control signals like the
FRAME#, IRDY# or PME# signals used in PCI, not even a sideband clock
sent along with the data. Because of this modularity, the architecture can
more easily scale into the future, provide additional bandwidth and sim-
plify the adoption of new usage models. However, it also requires the
adoption of techniques vastly different from traditional PCI.
Embedded Clocking
PCI Express utilizes 8-bit/10-bit encoding to embed the clock within the
data stream being transmitted. At initialization, the two devices deter-
mine the fastest signaling rate supported by both devices. The current
78 Introduction to PCI Express: A Hardware and Software Developers Guide
specification only identifies a single signaling rate, 2.5 gigabits per sec-
ond (per lane per direction), so that negotiation is pretty simple. Since
the transfer rate is determined ahead of time, the only other function of
the clock would be for sampling purposes at the receiver. That is where
8-bit/10-bit encoding with an embedded clock comes into play. By
transmitting each byte of data as 10 encoded bits, you can increase the
number of transitions associated with each transmission character —
simplifying the sampling procedures on the receiver side. Chapter 8,
“Physical Layer Architecture” contains more information on this topic.
Multiple Lanes
You might now be asking yourself, “If the transfer rate is fixed ahead of
time at 2.5 gigabits per second per lane per direction, how can this inter-
face scale to meet the needs of high-bandwidth interfaces?” After all, 2.5
gigabits per second per direction is only 250 megabytes per second of ac-
tual data that can flow each way (recall that with 8-bit/10-bit encoding,
each byte of data is transferred as 10 bits, so you need to divide 2.5 giga-
bits per second by 10 to get theoretical data transfer rates). A data trans-
fer rate of 250 megabytes per second per direction might be better than
traditional PCI, but it certainly is not in the same league as higher band-
width interfaces, like AGP (AGP4x runs at 1 gigabyte per second and
AGP8x runs at 2 gigabytes per second total bandwidth). When you add to
this the fact that a parallel bus, like PCI or AGP, is substantially more effi-
cient than a serial interface like PCI Express, the bandwidth of this new
interface seems to be at a disadvantage to some existing platform tech-
nologies. Well, that is where PCI Express’ scalability comes into play.
Much like lanes can be added to a highway to increase the total traffic
throughput, multiple lanes can be used within a PCI Express link to in-
crease the available bandwidth. In order to make its capabilities clear, a
link is named for the number of lanes it has. For example, the link shown
in Figure 5.2 is called a x4 (read as: “by four”) link since it consists of a
four lanes. A link with only a single lane, as in Figure 5.1, is called a x1
link. As previously noted, the maximum bandwidth of a x1 is 250 mega-
bytes per second in each direction. Because PCI Express is dual unidirec-
tional, this offers a maximum theoretical bandwidth of 500 megabytes
per second between the two devices (250 megabytes per second in both
directions). The x4 link shown in Figure 5.2 has a maximum bandwidth
of 4 × 250 megabytes per second = 1 gigabyte per second in each direc-
tion. Going up to a x16 link provides 16 × 250 megabytes per second = 4
gigabytes per second in each direction. This means that PCI Express can
Chapter 5: PCI Express Architecture Overview 79
Device Types
The PCI Express specification identifies several types of PCI Express ele-
ments: a root complex, a PCI Express-PCI bridge, an endpoint and a
switch. These device elements emulate the PCI configuration model, but
apply it more closely to the variety of potential point-to-point PCI Express
topologies. Figure 5.3 demonstrates how these elements play together
within a PCI Express world.
CPU
PCI Express
PCI/PCI_X
■ The root complex is the head or root of the connection of the I/O
system to the CPU and memory. For example, in today’s PC chip-
set system architecture, the (G)MCH (Graphics & Memory Con-
troller Hub) or a combination of the (G)MCH and ICH (I/O
Controller Hub) could be considered the root complex. Each in-
terface off of the root complex defines a separate hierarchy do-
main. Supporting transactions across hierarchies is not a required
80 Introduction to PCI Express: A Hardware and Software Developers Guide
Even though PCI Express links are point-to-point, this does not always
mean that one of the devices on the link is the requester and the other
the completer. For example, say that the root complex in Figure 5.3
wants to communicate with a PCI Express endpoint that is downstream
of the switch. The root complex is the requester and the endpoint is the
completer. Even though the switch receives the transaction from the root
complex, it is not considered a completer of that transaction. Even
though the endpoint receives the transaction from the switch, it does not
consider the switch to be the requester of that transaction. The requester
identifies itself within the request packet it sends out, and this informs
the completer (and/or switch) where it should return the completion
packets (if needed).
Transaction Types
The PCI Express architecture defines four transaction types: memory,
I/O, configuration and message. This is similar to the traditional PCI
transactions, with the notable difference being the addition of a message
transaction type.
Memory Transactions
Transactions targeting the memory space transfer data to or from a mem-
ory-mapped location. There are several types of memory transactions:
Memory Read Request, Memory Read Completion, and Memory Write
Request. Memory transactions use one of two different address formats,
either 32-bit addressing (short address) or 64-bit addressing (long ad-
dress).
I/O Transactions
Transactions targeting the I/O space transfer data to or from an I/O-
mapped location. PCI Express supports this address space for compatibil-
ity with existing devices that utilize this space. There are several types of
I/O transactions: I/O Read Request, I/O Read Completion, I/O Write Re-
quest, and I/O Write Completion. I/O transactions use only 32-bit ad-
dressing (short address format).
Configuration Transactions
Transactions targeting the configuration space are used for device con-
figuration and setup. These transactions access the configuration regis-
ters of PCI Express devices. Compared to traditional PCI, PCI Express
82 Introduction to PCI Express: A Hardware and Software Developers Guide
allows for many more configuration registers. For each function of each
device, PCI Express defines a configuration register block four times the
size of PCI. There are several types of configuration transactions: Con-
figuration Read Request, Configuration Read Completion, Configuration
Write Request, and Configuration Write Completion.
Message Transactions
PCI Express adds a new transaction type to communicate a variety of
miscellaneous messages between PCI Express devices. Referred to simply
as messages, these transactions are used for things like interrupt signal-
ing, error signaling or power management. This address space is a new
addition for PCI Express and is necessary since these functions are no
longer available via sideband signals such as PME#, IERR#, and so on.
Build Layers
The specification defines three abstract layers that “build” a PCI Express
transaction, as shown in Figure 5.4. The first layer, logically enough, is re-
ferred to as the Transaction Layer. The main responsibility of this layer
is to begin the process of turning requests or completion data from the
device core into a PCI Express transaction. The Data Link Layer is the
second architectural build layer. The main responsibility of this layer is to
ensure that the transactions going back and forth across the link are re-
ceived properly. The third architectural build layer is called the Physical
Layer. This layer is responsible for the actual transmitting and receiving
of the transaction across the PCI Express link.
Chapter 5: PCI Express Architecture Overview 83
Device
Core
Transaction Layer
Tx Rx
Physical Layer
Tx Rx
PCI Express
Link
Since each PCI Express link is dual unidirectional, each of these ar-
chitectural layers has transmit as well as receive functions associated
with it. Outgoing PCI Express transactions may proceed from the trans-
mit side of the Transaction Layer to the transmit side of the Data Link
Layer to the transmit side of the Physical Layer. Incoming transactions
may proceed from the receive side of the Physical Layer to the receive
side of the Data Link Layer and then on to the receive side of the Transac-
tion Layer.
Packet Formation
In a traditional parallel interface like AGP, sideband signals (such as
C/BE[3:0]#, SBA[7:0] and so on) transmit the information for command
type, address location, length, and so on. As discussed previously, no
such sideband signals exist in PCI Express. Therefore, the packets that
are being sent back and forth must incorporate this sort of information.
The three architectural build layers accomplish this by “building up”
the packets into a full scale PCI Express transaction. This buildup is
shown in Figure 5.5.
84 Introduction to PCI Express: A Hardware and Software Developers Guide
Optional
@ Transaction Layer
Sequence
Number Header Data ECRC LCRC
Sequence
Frame Number Header Data ECRC LCRC Frame
@ Physical Layer
is unreliable.
The differences between the three CRC types deal with the sizes (32 bits long
versus 16 bits long), and the PCI Express layer that is responsible for generat-
ing and checking the values. Additional details on ECRCs are contained in
Chapter 6, “Transaction Layer Architecture” and additional details on LCRCs
and CRCs are contained in Chapter 7, “Data Link Layer Architecture.”
Memory LPC
Figure 5.6 PCI Express in the Pathway between the CPU and System
Firmware
The following example details how PCI Express may be used to help
boot up a standard computer. Once the system has powered up, the CPU
sends out a memory read request for the first BIOS instruction. This re-
quest comes to Device A across the processor’s system bus. Device A’s
core decodes this transaction and realizes that the requested address is
not its responsibility and this transaction needs to be forwarded out to
Device B. This is where PCI Express comes into play.
86 Introduction to PCI Express: A Hardware and Software Developers Guide
Device A’s core passes this memory read request to its PCI Express
block. This block is then responsible for turning the request into a le-
gitimate PCI Express request transaction and sending it out across the
PCI Express link. On the other side of the link, Device B’s PCI Express
block is responsible for receiving and decoding the request transaction,
verifying its integrity, and passing it along to the Device B core.
Now the Device B core has just received a memory read request, so it
sends that request out its LPC (low pin count) bus to read that address
location from the system’s flash/BIOS device. Once the Device B core re-
ceives the requested data back, it passes the data along to its PCI Express
block.
Device B’s PCI Express block is then responsible for turning this data
into a legitimate PCI Express completion transaction and sending it back
up the PCI Express link. On the other side of the link, Device A’s PCI Ex-
press block is responsible for receiving and decoding the transaction,
verifying its integrity, and passing it along to the Device A core. The De-
vice A core now has the appropriate information and forwards it along to
the CPU. The computer is now ready to start executing instructions.
With this big picture in mind, the following sections start to examine
how each of the PCI Express architectural layers contributes to accom-
plishing this task.
Transaction Layer
As mentioned earlier, the Transaction Layer is the uppermost PCI Express
architectural layer and starts the process of turning request or data pack-
ets from the device core into PCI Express transactions. This layer re-
ceives request (such as “read from BIOS location FFF0h”) or completion
packet (“here is the result of that read”) from the device core. It is then
responsible for turning that request/data into a Transaction Layer Packet
(TLP). A TLP is simply a packet that is sent from the Transaction Layer of
one device to the Transaction Layer of the other device. The TLP uses a
header to identify the type of transaction that it is (for example, I/O ver-
sus memory, read versus write, request versus completion, and so on).
Please note that the Transaction Layer has direct interaction only
with its device core and its Data Link Layer, as shown in Figure 5.7. It re-
lies on its device core to provide valid requests and completion data, and
on its Data Link to get that information to and from the Transaction Layer
on the other side of the link.
Chapter 5: PCI Express Architecture Overview 87
.
.
.
Device
Core
Tx Rx Transaction Layer
How might this layer behave in the previous “Big Picture” startup ex-
ample? The Device A core issues a memory read request with associated
length and address to its PCI Express block. The Transaction Layer’s
transmit functions turn that information into a TLP by building a memory
read request header. Once the TLP is created, it is passed along to the
transmit side of the Data Link layer. Some time later, Device A’s Transac-
tion Layer receives the completion packet for that request from the re-
ceive side of its Data Link Layer. The Transaction Layer’s receive side
then decodes the header associated with that packet and passes the data
along to its device core.
The Transaction Layer also has several other functions, such as flow
control and power management. Chapter 6, “Transaction Layer Architec-
ture” contains additional details on the Transaction Layer and TLPs.
Chapter 9, “Flow Control” contains additional details on the flow control
mechanisms for those TLPs, and Chapter 11, “Power Management” con-
tains additional details on the various power management functions.
across the link is wholesome. It is responsible for making sure that each
packet makes it across the link, and makes it across intact.
Tx Rx Transaction Layer
Tx Rx
Data Link Layer
Physical Layer
Tx Rx
This layer receives TLPs from the transmit side of the Transaction
Layer and continues the process of building that into a PCI Express trans-
action. It does this by adding a sequence number to the front of the
packet and an LCRC error checker to the end. The sequence number
serves the purpose of making sure that each packet makes it across the
link. For example, if the last sequence number that Device A successfully
received was #6, it expects the next packet to have a sequence number
of 7. If it instead sees #8, it knows that packet #7 got lost somewhere and
notifies Device B of the error. The LCRC serves to make sure that each
packet makes it across intact. As mentioned previously, if the LCRC does
not check out at the receiver side, the device knows that there was a bit
error sometime during the transmission of this packet. This scenario also
generates an error condition. Once the transmit side of the Data Link
Layer applies the sequence number and LCRC to the TLP, it submits them
to the Physical Layer.
The receiver side of the Data Link Layer accepts incoming packets
from the Physical Layer and checks the sequence number and LCRC to
make sure the packet is correct. If it is correct, it then passes it up to the
receiver side of the Transaction Layer. If an error occurs (either wrong
sequence number or bad data), it does not pass the packet on to the
Transaction Layer until the issue has been resolved. In this way, the Data
Link Layer acts a lot like the security guard of the link. It makes sure that
only the packets that are “supposed to be there” are allowed through.
The Data Link Layer is also responsible for several link management
functions. To do this, it generates and consumes Data Link Layer Packets
(DLLPs). Unlike TLPs, these packets are created at the Data Link Layer.
Chapter 5: PCI Express Architecture Overview 89
These packets are used for link management functions such as error noti-
fication, power management, and so on.
How might this layer behave in the previous “Big Picture” startup ex-
ample? The Transaction Layer in Device A creates a memory read request
TLP and passes it along to the Data Link Layer. This layer adds the appro-
priate sequence number and generates an LCRC to append to the end of
the packet. Once these two functions are performed, the Data Link Layer
passes this new, larger packet along to the Physical Layer. Some time
later, Device A’s Data Link Layer receives the completion packet for that
request from the receive side of its Physical Layer. The Data Link Layer
then checks the sequence number and LCRC to make sure the received
read completion packet is correct.
What happens if the received packet at Device A was incorrect (as-
sume the LCRC did not check out)? The Data Link Layer in Device A then
creates a DLLP that states that there was an error and that Device B
should resend the packet. Device A’s Data Link Layer passes that DLLP on
to its Physical Layer, which sends it over to Device B. The Data Link
Layer in Device B receives that DLLP from its Physical Layer and decodes
the packet. It sees that there was an error on the read completion packet
and resubmits that packet to its Physical Layer. Please note that the Data
Link Layer of Device B does this on its own; it does not send it on to its
Transaction Layer. The Transaction Layer of Device B is not responsible
for the retry attempt.
Eventually, Device A receives that resent packet and it proceeds from
the receive side of the Physical Layer to the receive side of the Data Link
Layer. If the sequence number and LCRC check out this time around, it
then passes that packet along to the Transaction Layer. The Transaction
Layer in Device A has no idea that a retry was needed for this packet; it is
totally dependent on its Data Link Layer to make sure the packet is cor-
rect.
Additional details on this layer’s functions, sequence numbers, LCRCs
and DLLPs are explained in Chapter 7, “Data Link Layer Architecture.”
Physical Layer
Finally, the lowest PCI Express architectural layer is the Physical Layer.
This layer is responsible for actually sending and receiving all the data to
be sent across the PCI Express link. The Physical Layer interacts with its
Data Link Layer and the physical PCI Express link (wires, cables, optical
fiber, and so on), as shown in Figure 5.9. This layer contains all the cir-
cuitry for the interface operation: input and output buffers, parallel-to-
90 Introduction to PCI Express: A Hardware and Software Developers Guide
Tx Rx
Physical Layer
PCI Express
Link
How might this layer behave in the previous “Big Picture” startup ex-
ample? Once power up occurs, the Physical Layers on both Device A and
Device B are responsible for initializing the link to get it up and running
and ready for transactions. This initialization process includes determin-
ing how many lanes should be used for the link. To make this example
simple, both devices support a x1 link. Sometime after the link is prop-
erly initialized, that memory read request starts to work its way through
Device A. Eventually it makes its way down to Device A’s Physical Layer,
complete with a sequence number, memory read request header, and
LCRC. The Physical Layer takes that packet of data and transforms it into
Chapter 5: PCI Express Architecture Overview 91
a serial data stream after it applies 8-bit/10-bit encoding and data scram-
bling to each character. The Physical Layer knows the link consists of a
single lane running at 2.5 gigahertz, so it sends that data stream out its
four transmit pairs at that speed. In doing this, it needs to meet certain
electrical and timing rules that are discussed in Chapter 8, “Physical Layer
Architecture.” The Physical Layer on Device B sees this data stream ap-
pear on its differential receiver input buffers and samples it accordingly.
It then decodes the stream, builds it back into a data packet and passes it
along to its Data Link Layer.
Please note that the Physical Layers of both devices completely insu-
late the rest of the layers and devices from the physical details for the
transmission of the data. How that data is transmitted across the link is
completely a function of the Physical Layer. In a traditional computer sys-
tem, the two devices would be located on the same FR4 motherboard
planar and connected via copper traces. There is nothing in the PCI Ex-
press specification, however, that would require this sort of implementa-
tion. If designed properly, the two devices could implement their PCI
Express buffers as optical circuits that are connected via a 6-foot-long op-
tical fiber cable. The rest of the layers would not know the difference.
This provides PCI Express an enormous amount of flexibility in the ways
it can be implemented. As speed or transmission media changes from sys-
tem to system, those modifications can be localized to one architectural
layer.
Additional details on this layer’s functions, 8-bit/10-bit encoding, elec-
trical requirements and timing requirements are explained in Chapter 8,
“Physical Layer Architecture.”
92 Introduction to PCI Express: A Hardware and Software Developers Guide
Chapter 6
Transaction
Layer
Architecture
his chapter goes into the details of the uppermost architectural layer:
T the Transaction Layer. This layer creates and consumes the request
and completion packets that are the backbone of data transfer across PCI
Express. The chapter discusses the specifics for Transaction Layer Packet
(TLP) generation, how the header is used to identify the transaction, and
how the Transaction Layer handles incoming TLPs. Though TLP flow
control is a function of the Transaction Layer, that topic is discussed in
Chapter 9, “Flow Control” and is not discussed in this chapter.
93
94 Introduction to PCI Express: A Hardware and Software Developers Guide
transmit side, the Transaction Layer receives request data (such as “read
from BIOS location FFF0h”) or completion data (“here is the result of that
read”) from the device core, and then turns that information into an out-
going PCI Express transaction. On the receive side, the Transaction Layer
also accepts incoming PCI Express transactions from its Data Link Layer
(refer to Figure 6.1). This layer assumes all incoming information is cor-
rect, because it relies on its Data Link Layer to ensure that all incoming
information is error-free and properly ordered.
.
.
.
Device
Core
Tx Rx Transaction Layer
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Header
Data Byte 0
Data Payload
(Optional)
Data Byte N-1
31 TLP Digest (Optional) n
The TLP always begins with a header. The header is DWord aligned (al-
ways a multiple of four bytes) but varies in length based on the type of
transaction. Depending on the type of packet, TLPs may contain a data
payload. If present, the data payload is also DWord-aligned for both the
first and last DWord of data. DWord Byte Enable fields within the header
indicate whether “garbage” bytes are appended to either the beginning
or ending of the payload to achieve this DWord alignment. Finally, the
TLP may consist of a digest at the end of the packet.
Like the data payload, the digest is optional and is not always used. If
used, the digest field contains an ECRC (end-to-end CRC) that ensures the
contents of the TLP are properly conveyed from the source of the trans-
action to its ultimate destination. The Data Link Layer ensures that the
TLP makes it across a given link properly, but does not necessarily guar-
antee that the TLP makes it to its destination intact. For example, if the
TLP is routed through an intermediate device (such as a switch), it is pos-
sible that during the handling of the TLP, the switch introduces an error
96 Introduction to PCI Express: A Hardware and Software Developers Guide
within the TLP. An ECRC may be appended to the TLP to ensure that this
sort of error does not go undetected.
TLP Headers
All TLPs consist of a header that contains the basic identifying informa-
tion for the transaction. The TLP header may be either 3 or 4 DWords in
length, depending on the type of transaction. This section covers the de-
tails of the TLP header fields, beginning with the first DWord (bytes 0
through 4) for all TLP headers. The format for this DWord is shown in
Figure 6.3.
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
T E Length
R Fmt Type R TC R Attr R
D P
TLP fields marked with an R indicate a reserved bit or field. Reserved bits
are filled with 0’s during TLP formation, and are ignored by receivers.
The format (Fmt) field indicates the format of the TLP itself. Table 6.1
shows the associated values for that field.
As can be seen in Table 6.1, the format field indicates the length of the
TLP header, but does not directly identify the type of transaction. This is
determined by the combination of the Format and Type fields, as shown
in Table 6.2.
Chapter 6: Transaction Layer Architecture 97
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 T E
R Fmt Type R TC R D P Attr R Length
Byte 4 Last DW 1st DW
Requester ID Tag
BE BE
Byte 8
Address [64:32]
Byte 12
Address [31:2]
R
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 T E
R Fmt Type R TC R D P Attr R Length
Byte 4 Last DW 1st DW
Requester ID Tag
BE BE
Byte 8
Address [31:2] R
The address mapping for TLP headers is outlined in Table 6.3. All TLP
headers, not just memory requests, use this address scheme. Please note
that address bits [31:2] are not in the same location for 64-bit address
formats as they are for 32-bit addressing formats. If addressing a location
below 4 gigabytes, requesters must use the 32-bit address format.
The Requester ID field (bytes 4 and 5 in Figure 6.5) contains the bus, de-
vice and function number of the requester. This is a 16-bit value that is
unique for every PCI Express function within a hierarchy. Bus and device
numbers within a root complex may be assigned in an implementation
specific manner, but all other PCI Express devices (or functions within a
multi-function device) must comprehend the bus and device number
they are assigned during configuration. PCI Express devices (other than
the root complex) cannot make assumptions about their bus or device
number. Each device receives a configuration write that identifies its as-
signed bus and device number. Since this information is necessary to
generate any request TLP, a device cannot initiate a request until it re-
ceives that initial configuration write containing its assigned bus and de-
vice numbers. This model is consistent with the existing PCI model for
system initialization and configuration. Figure 6.6 shows the requester ID
format.
Requester ID
The Tag field (byte 6 in Figure 6.5) is an 8-bit field that helps to uniquely
identify outstanding requests. The requester generates a unique tag value
for each of its outstanding requests that requires a completion. Requests
that do not require a completion do not have a tag assigned to them (the
tag field is undefined and may have any value). If a completion is re-
quired, the requester ID and tag value are copied into the completion
header. This allows the system to route that completion packet back up
to the original requester. The returned tag value identifies which request
the completion packet is responding to. These two values form a global
identification (referred to as a transaction ID) that uniquely identifies
102 Introduction to PCI Express: A Hardware and Software Developers Guide
If the request indicates a length greater than a single DWord, neither the
First DW BE field nor the Last DW BE field can be 0000b. Both must spec-
ify at least a single valid byte within their respective DWord. For exam-
ple, if a device wanted to write six bytes to memory, it needs to send a
data payload of two DWords, but only six of the accompanying eight
bytes of data would be legitimately intended for that write. In order to
make sure the completer knows which bytes are to be written, the re-
quester could indicate a First DW BE field of 1111b and a Last DW BE
field of 1100b. This indicates that the four bytes of the first DWord and
the first two bytes of the second (and last) DWord are the six bytes in-
Chapter 6: Transaction Layer Architecture 103
tended to be written. The completer knows that the final two bytes of
the accompanying data payload are not to be written.
If the request indicates a data length of a single DWord, the Last DW
BE field must equal 0000b. If the request is for a single DWord, the First
DW BE field can also be 0000b. If a write request of a single DWord is
accompanied by a First DW BE field of 0000b, that request should have
no effect at the completer and is not considered a malformed (improp-
erly built) packet. If a read request of a single DWord is accompanied by
a First DW BE field of 0000b, the corresponding completion for that re-
quest should have and indicate a data payload length of one DWord. The
contents of that data payload are unspecified, however, and may be any
value. A memory read request of one DWord with no bytes enabled is re-
ferred to as a “zero length read”. These reads may be used by devices as a
type of flush request, allowing a device to ensure that previously issued
posted writes have been completed.
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 TC T E Attr Length
R Fmt Type R 000 R D P 00 R 0 0 0 0 0 0 0 0 0 1
Byte 4 Last DW BE 1st DW
Requester ID Tag 0 0 0 0 BE
Byte 8
Address [31:2] R
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 TC T E Attr Length
R Fmt Type R 000 R D P 00 R 0 0 0 0 0 0 0 0 0 1
Byte 4 Last DW BE 1st DW
Requester ID Tag 0 0 0 0 BE
Byte 8 Device Function Reserved Ext. Reg. Register
Bus Number R
Number Number Number Number
Message Headers
Recall that since PCI Express has no sideband signals (such as INTA#,
PME#, and so on), all special events must be transmitted as packets
(called messages) across the PCI Express link. There are two different
types of messages, those classified as baseline messages, and those
needed for advanced switching. Baseline messages are used for INTx in-
terrupt signaling, power management, error signaling, locked transaction
support, slot power limit support, hot plug signaling, and for other ven-
dor defined messaging. Advanced switching messages are used for data
packet messages or signal packet messages.
Baseline Messages
All baseline messages have the common DWord shown in Figure 6.3 as
the first DWord of the header. The second DWord for all baseline mes-
sages uses the transaction ID (requester ID + tag) in the same location as
memory, I/O and configuration requests. It then adds a Message Code
106 Introduction to PCI Express: A Hardware and Software Developers Guide
field to specify the type of message. Figure 6.9 shows the format for a
baseline message header.
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 T E Attr
R Fmt Type R TC R D P 00 R Length
Byte 4
Requester ID Tag Message Code
Byte 8
Byte 12 Fields in bytes 8 through 15 depend on the type of message
Most messages use the Msg encoding for the Type field. Exceptions
to this include the Slot Power Limit message, which uses the MsgD for-
mat, and vendor defined messages, which may use either the Msg or
MsgD encoding. Recall from Table 6.2 that the Msg encoding is 01b for
Format and 1 0r2 r1 r0 for Type, where r[2:0] indicates message routing.
The MsgD encoding is similar, but with 11b in the Fmt field indicating
that a data payload is attached. In addition to the address-based routing
used by memory and I/O requests, and the ID-based routing employed by
configuration requests, messages may use several other routing schemes.
The r[2:0] sub-field indicates the type of routing scheme that a particular
message employs. Table 6.5 outlines the various routing options.
use the default traffic class, TC0 (this is different than MSI interrupts,
which are not restricted to the default traffic class).
1PXFS .BOBHFNFOU .FTTBHFT These messages are used to support
Correctable errors are error conditions where the PCI Express proto-
col (and specifically hardware) can recover without any loss of informa-
tion. An example of this type of error is an LCRC error that is detected by
the Data Link Layer and corrected through normal retry means. An un-
correctable error is one that impacts the functionality of the interface and
may be classified as either fatal or nonfatal. A fatal error is uncorrectable
and renders that particular link unreliable. A reset of the link may be re-
quired to return to normal, reliable operation. Platform handling of fatal
Chapter 6: Transaction Layer Architecture 109
The Unlock message does not include a data payload and treats the
Length field as reserved. Unlock messages must use the default traffic
class, TC0. As evidenced by the r[2:0] value, the root complex initiates
and broadcasts this message.
4MPU 1PXFS -JNJU .FTTBHFT. PCI Express provides a mechanism for a
The Set Slot Power Limit message contains a one DWord data payload
with the relevant power information. This data payload is a copy of the
slot capabilities register of the upstream device and is written into the
device capabilities register of the downstream device. Slot Power mes-
sages must use the default traffic class, TC0. As evidenced by the r[2:0]
value, this message is only intended to be sent from an upstream device
(root complex or switch) to its link mate.
Hot Plug Messages. The PCI Express architecture is defined to natively
support both hot plug and hot removal of devices. There are seven dis-
tinct Hot Plug messages. As shown in Table 6.11, these messages simu-
late the various states of the power indicator, attention button, and
attention indicator.
Hot plug messages do not contain a data payload and treat the Length
field as reserved. Hot plug messages must use the default traffic class,
TC0. Additional details on PCI Express hot plug support are found in
Chapter 12.
Completion Packet/Header
Some, but not all, of the requests outlined so far in this chapter may re-
quire a completion packet. Completion packets always contain a comple-
tion header and, depending on the type of completion, may contain a
number of DWords of data as well. Since completion packets are really
only differentiated based on the completion header, this section focuses
on that header format.
Completion headers are three DWords in length and have the com-
mon DWord shown in Figure 6.3 as the first DWord of the header. The
second DWord for completion headers make use of some unique fields: a
Completer ID, Completion Status, Byte Count Modified (BCM) and Byte
Count. The third and final DWord contains the requester ID and tag val-
ues, along with a Lower Address field. Figure 6.10 shows the format for a
completion header.
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 T E
R Fmt Type R TC R D P Attr R Length
Byte 4 Current B
Completer ID Status M
C Byte Count
Byte 8
Requester ID Tag R Lower Address
Completion packets are routed by ID, and more specifically, the re-
quester ID that was supplied with the original request. The Completer ID
field (bytes 4 and 5) is a 16-bit value that is unique for every PCI Express
function within the hierarchy. It essentially follows the exact same for-
mat as the requester ID, except that it contains the component informa-
tion for the completer instead of the requester. This format is shown in
Figure 6.11.
112 Introduction to PCI Express: A Hardware and Software Developers Guide
Completer ID
The Completion Status field (bits [7:5] of byte 6) indicates if the request
has been completed successfully. There are four defined completion
status responses, as shown in Table 6.12. The TLP Handling section later
in this chapter contains the details for when each of these completion
options is used.
field (though they are in other header fields). A value of 00 0000 0001b
in this location indicates a data payload that is one DWord long. A value
of 00 0000 0010b indicates a two DWord value, and so on up to a maxi-
mum of 1024 DWords. The data payload for a TLP must not exceed the
maximum allowable payload size, as defined in the device’s control regis-
ter (and more specifically, the Max_Payload_Size field of that register).
TLPs that use a data payload must have the value in the Length field
match the actual amount of data contained in the payload. Receivers
must check to verify this rule and, if violated, consider that TLP to be
malformed and report the appropriate error. Additionally, requests must
not specify an address and length combination that crosses a 4 kilobyte
boundary.
When a data payload is included in a TLP, the first byte of data corre-
sponds to the lowest byte address (that is to say, closest to zero) and sub-
sequent bytes of data are in increasing byte address sequence. For
example, a 16 byte write to location 100h would place the data in the
payload as shown in Figure 6.12.
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Header
TLP Digest
The Data Link Layer provides the basic data reliability mechanism within
PCI Express via the use of a 32-bit LCRC. This LCRC code can detect er-
rors in TLPs on a link-by-link basis and allows for a retransmit mechanism
Chapter 6: Transaction Layer Architecture 115
for error recovery. This LCRC, however, is based upon the TLP the Data
Link is provided by its Transaction Layer. If an error is induced within the
TLP prior to being provided to the Data Link Layer (for example, by a
switch processing the TLP), the resultant LCRC has no ability to detect
that the TLP itself was in error.
To ensure end-to-end data integrity, the TLP may contain a digest that
has an end-to-end CRC. This optional field protects the contents of the
TLP through the entire system, and can be used in systems that require
high data reliability. The Transaction Layer of the source component
generates the 32-bit ECRC. The ECRC calculation begins with bit 0 of
byte 0 and proceeds from bit 0 to bit 7 of each subsequent byte in the
TLP. It incorporates the entire TLP header and, if present, the data pay-
load. The exact details for the ECRC algorithm are contained in the PCI
Express Base Specification, Rev 1.0. Once calculated, that ECRC value is
placed in the digest field at the end of the TLP (refer to Figure 6.2). If the
ECRC is present and support is enabled, the destination device applies
the same ECRC calculation and compares the value to what is received in
the TLP digest.
The TD bit (bit 7 of byte 2 in the header) indicates whether a TLP di-
gest is provided at the end of the TLP. A value of 1b in this location indi-
cates that a TLP digest is attached, while a value of 0b indicates that no
TLP digest is present. Now what happens if, during the handling of the
TLP, a switch induces an error on the TD bit? It could accidentally switch
it from a 1 to a 0, which would negate the use of the ECRC and could
lead to other undetected errors. The PCI Express specification does not
really have a way to avoid this potential issue, other than to highlight that
it is the utmost of importance that switches maintain the integrity of the
TD bit.
The capability to generate and check ECRCs is reported to software
(by an Advanced Error Capabilities and Control register), which also con-
trols whether the capability is enabled. If a device is enabled to generate
and/or check ECRCs, it must do so for all TLPs.
TLP Handling
This section details how the Transaction Layer handles incoming
TLPs, once they have been verified by the Data Link Layer. A TLP that
makes it through the Data Link Layer has been verified to have traversed
the link properly, but that does not necessarily mean that the TLP is cor-
rect. A TLP may make it across the link intact, but may have been im-
116 Introduction to PCI Express: A Hardware and Software Developers Guide
properly formed by its originator. As such, the receiver side of the Trans-
action Layer performs some checks on the TLP to make sure it has fol-
lowed the rules described in this chapter. If the incoming TLP does not
check out properly, it is considered a malformed packet, is discarded
(without updating receiver flow control information) and generates an
error condition. If the TLP is legitimate, the Transaction Layer updates its
flow control tracking and continues to process the packet. This is seen in
the flowchart in Figure 6.13.
Start
Yes
No
Update Flow
Is TLP a request? Completion TLP
Control tracking
Request TLP
Request Handling
If the TLP is a request packet, the Transaction Layer first checks to make
sure that the request type is supported. If it is not supported, it generates
a non-fatal error and notifies the root complex. If that unsupported re-
quest requires a completion, the Transaction Layer generates a comple-
Chapter 6: Transaction Layer Architecture 117
Start
No
Is request type
supported?
Unsupported Yes
Request
Yes No
Request Type = Is message code
Message? value defined?
No
Request Type =
No Yes Unsupported
Message? Request
Process Request Handle
End
Yes as a message
End
Send Completion:
- Completion Does
Status = UR request require a No End
completion?
End End
Yes
Send Completion
End
The shaded Process Request box indicates that there are optional im-
plementation methods that may be employed by a PCI Express compo-
nent. For example, if a component wanted to restrict the supported
characteristics of requests (for performance optimizations), it is permit-
ted to issue a Completer Abort if it receives a request that violates its re-
stricted model.
Another implementation-specific option may arise with configuration
requests. Some devices may require a lengthy self-initialization sequence
before they are able to properly handle configuration requests. Rather
than force all configuration requests to wait for the maximum allowable
118 Introduction to PCI Express: A Hardware and Software Developers Guide
Completion Handling
If a device receives a completion that does not correspond to any out-
standing request, that completion is referred to as an unexpected com-
pletion. Receipt of an unexpected completion causes the completion to
be discarded and results in an error-condition (nonfatal). The receipt of
unsuccessful completion packets generates an error condition that is de-
pendent on the completion status. The details for how successful com-
pletions are handled and impact flow control logic are contained in
Chapter 9, “Flow Control.”
Chapter 7
Data Link Layer
Architecture
his chapter describes the details of the middle architectural layer, the
T Data Link Layer. The Data Link Layer’s main responsibility is error
detection and correction. The chapter discusses the sequence number
and LCRC (Link CRC), and how they are added to the Transaction Layer
Packet (TLP) to ensure data integrity. It then describes the functions spe-
cific to the Data Link Layer, particularly the creation and consumption of
Data Link Layer Packets (DLLPs).
119
120 Introduction to PCI Express: A Hardware and Software Developers Guide
tion. The Data Link Layer adds a sequence number to the front of the
packet and an LCRC error checker to the tail. Once the transmit side of
the Data Link Layer has applied these to the TLP, the Data Link Layer
forwards it on to the Physical Layer. Like the Transaction Layer, the Data
Link Layer has unique duties for both outgoing packets and incoming
packets. For incoming TLPs, the Data Link Layer accepts the packets
from the Physical Layer and checks the sequence number and LCRC to
make sure the packet is correct. If it is correct, the Data Link Layer re-
moves the sequence number and LCRC, then passes the packet up to the
receiver side of the Transaction Layer. If an error is detected (either
wrong sequence number or LCRC does not match), the Data Link Layer
does not pass the “bad” packet on to the Transaction Layer. Instead, the
Data Link Layer communicates with its link mate to try and resolve the is-
sue through a retry attempt. The Data Link Layer only passes a TLP
through to the Transaction Layer if the packet’s sequence number and
LCRC values check out. It is important to note this because this “gate-
keeping” allows the Transaction Layer to assume that everything it re-
ceives from the link is correct. As seen in Figure 7.1, the Data Link Layer
forwards outgoing transactions from the Transaction Layer to the Physi-
cal Layer, and incoming transactions from the Physical Layer to the
Transaction Layer.
Tx Rx Transaction Layer
Tx Rx
Data Link Layer
Physical Layer
Tx Rx
To Physical Layer
Sequence Number
The Data Link Layer assigns a 12-bit sequence number to each TLP as it is
passed from the transmit side of its Transaction Layer. The Data Link
Layer applies the sequence number, along with a 4-bit reserved field to
the front of the TLP. Refer to Figure 7.3 for the sequence number format.
To accomplish this, the transmit side of this layer needs to implement
two simple counters, one indicating what the next transmit sequence
number should be, and one indicating the most recently acknowledged
sequence number. When a sequence number is applied to an outgoing
TLP, the Data Link Layer refers to its next sequence counter for the ap-
propriate value. Once that sequence number is applied, the Data Link
Layer increments its next sequence counter by one.
122 Introduction to PCI Express: A Hardware and Software Developers Guide
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Reserved TLP Sequence Number TLP Header
On the receiver side, the Data Link Layer checks the sequence num-
ber (and LCRC). If they check out properly, the TLP is passed on to the
Transaction Layer. If the sequence number (or LCRC) is incorrect, the
Data Link Layer requests a retry. To accomplish this, the receive side of
this layer needs to implement a counter for the next receiver sequence
number, which indicates the next expected sequence number. If the re-
ceived sequence number matches that counter (and the LCRC checks),
the Data Link Layer then removes the sequence number, associated re-
served bits, and the LCRC. Once the layer removes that data, it forwards
the incoming TLP on to the receive side of the Transaction Layer. When
this occurs, the Data Link Layer increments its next receiver sequence
counter.
If the sequence number does not match the value stored in the re-
ceiver’s next sequence counter, that Data Link Layer discards that TLP.
The Data Link Layer checks to see if the TLP is a duplicate. If it is, it
schedules an acknowledgement (Ack) DLLP to be sent out for that
packet. If the TLP is not a duplicate, it schedules a negative acknowl-
edgement (Nak) DLLP to report a missing TLP. The “Retries” section of
this chapter explains this procedure in more detail.
The Data Link Layer does not differentiate among types of TLP when
assigning the sequence number. Transactions destined to I/O space do
not have a different set of sequence numbers than memory transactions.
Sequence numbers are not dependent on the completer of the transac-
tion. The Data Link Layer of the transmitting device is the sole determi-
nant of the sequence number assigned to a TLP.
The sequence number is completely link-dependent. If a TLP passes
through a PCI Express device (such as a switch), it has different sequence
numbers associated with it on a link-to-link basis. The TLP header con-
tains all the global identifying information. The sequence numbers only
have meaning for a single transmitter and receiver. For example, if a PCI
Express switch receives a request TLP from its upstream link, it processes
that packet through its upstream receiver logic. That packet has a se-
Chapter 7: Data Link Layer Architecture 123
quence number associated with it that the upstream Data Link Layer veri-
fies. Once verified and acknowledged on the upstream side, that se-
quence number no longer means anything. After the request TLP is
passed through the Transaction Layer of the upstream port, it is sent
along to the appropriate downstream port. There, the TX side of the
downstream Data Link Layer appends its own sequence number as the
request TLP is sent out the downstream port. The downstream receiver
verifies and acknowledges this sequence number. If the TLP requires a
completion packet, the sequence numbers for the completion TLP is also
completely independent. The sequence number for the completion TLP
on the downstream link has no relationship to the request TLP’s se-
quence number or the upstream link’s sequence number (once it is for-
warded). Refer to Figure 7.4 for additional clarification: sequence
numbers A, B, C and D are completely independent of one another.
Root
Complex
Downstream Request
Sequence Number B
Switch
Downstream Comp
Sequence Number C
PCI Express
Sequence Numbers A, B, C, & D Endpoint
Have No Relationship
In this example, a request is made from the root complex to the PCI Express
endpoint though a switch.
LCRC
The Data Link Layer protects the contents of the TLP by using a 32-bit
LCRC value. The Data Link Layer calculates the LCRC value based on the
TLP received from the Transaction Layer and the sequence number it has
124 Introduction to PCI Express: A Hardware and Software Developers Guide
just applied. The LCRC calculation utilizes each bit in the packet, including
the reserved bits (such as bits 7:4 of byte 0). The exact details for the LCRC
algorithm are contained in the PCI Express Base Specification, Rev 1.0.
On the receiver side, the first step that the Data Link Layer takes is to
check the LCRC value. It does this by applying the same LCRC algorithm
to the received TLP (not including the attached 32-bit LCRC). If a single
or multiple-bit error occurs during transmission, the calculated LCRC
value should not match the received LCRC value. If the calculated value
equals the received value, the Data Link Layer then proceeds to check
the sequence number. If the calculated LCRC value does not equal the
received value, the TLP is discarded and a Nak DLLP is scheduled for
transmission.
Like sequence numbers, the LCRC protects the contents of a TLP on
a link-by-link basis. If a TLP travels across several links (for example,
passes through a switch on its way to the root complex), an LCRC value
is generated and checked for each link. In this way, it is different than the
ECRC value that may be generated for a TLP. The ECRC serves to protect
the TLP contents from one end of the PCI Express topology to the other
end (refer to Chapter 6), while the LCRC only ensures TLP reliability for a
give link. The 32-bit LCRC value for TLPs is also differentiated from the
16-bit CRC value that is used for DLLP packets.
Retries
The transmitter cannot assume that a transaction has been properly re-
ceived until it gets a proper acknowledgement back from the receiver. If
the receiver sends back a Nak (for something like a bad sequence num-
ber or LCRC), or fails to send back an Ack in an appropriate amount of
time, the transmitter needs to retry all unacknowledged TLPs. To accom-
plish this, the transmitter implements a Data Link Layer retry buffer.
All copies of transmitted TLPs must be stored in the Data Link Layer
retry buffer. Once the transmitter receives an appropriate acknowledge-
ment back, it purges the appropriate TLPs from its retry buffer. It also
updates its acknowledged sequence number counter.
Note A quick note on retry terminology: the PCI Express specification often flips back
and forth between the terms retry and replay. For example, the buffer that is
used during retry attempts is called a retry buffer, but the timeout counter asso-
ciated with that buffer is called a replay timer. To avoid as much confusion as
possible, this chapter sticks to the term retry as much as possible and only uses
Chapter 7: Data Link Layer Architecture 125
replay when referring to a function that uses that term expressly within the
specification.
TLPs may be retried for two reasons. First, it is retried if the receiver
sends back a Nak DLLP indicating some sort of transmission error. The
second reason for a retry deals with a replay timer, which helps ensure
that forward progress is being made. The transmitter side of the Data
Link Layer needs to implement a replay timer that counts the time since
the last Ack or Nak DLLP was received. This timer runs anytime there is
an outstanding TLP and is reset every time an Ack or Nak DLLP is re-
ceived. When no TLPs are outstanding, the timer should reset and hold
so that it does not unnecessarily cause a time-out. The replay timer limit
depends upon the link width and maximum payload size. The larger the
maximum payload size and the narrower the link width, the longer the
replay timer can run before timing out (since each packet requires more
time to transmit). If the replay timer times out, the Data Link Layer re-
ports an error condition.
If either of these events occurs—either a Nak reception or a replay
timer expiration—the transmitter’s Data Link Layer begins a retry. The
Data Link Layer increments a replay number counter. This is a 2-bit
counter that keeps track of the number of times the retry buffer has been
retransmitted. If the replay counter rolls over from 11b to 00b (that is,
this is the fourth retry attempt) the Data Link Layer indicates an error
condition that requires the Physical Layer to retrain the link (refer to
Chapter 8, “Physical Layer Architecture” for details on retraining). The
Data Link Layer resets its replay counter every time it successfully re-
ceives an acknowledgement, so the retrain procedure only occurs if a re-
try attempt continuously fails. In other words, four unsuccessful attempts
at a single retry create this error. Four unsuccessful retry attempts across
numerous packets with numerous intermediate acknowledgements do
not.
If the replay counter does not roll over, then the Data Link Layer pro-
ceeds with a normal retry attempt. It blocks acceptance of any new out-
going TLPs from its Transaction Layer and completes the transmission of
any TLPs currently in transmission. The Data Link Layer then retransmits
all unacknowledged TLPs. It begins with the oldest unacknowledged TLP
and retransmits in the same order as the original transmission. Once all
unacknowledged TLPs have been retransmitted, the Data Link Layer re-
sumes normal operation and once again accepts outgoing TLPs from its
Transaction Layer.
126 Introduction to PCI Express: A Hardware and Software Developers Guide
During the retry attempt, the Data Link Layer still needs to accept in-
coming TLPs and DLLPs. If the layer receives an Ack or Nak DLLP during
the retry attempt it must be properly processed. If this occurs, the
transmitter may fully complete the retry attempt or may skip the re-
transmission of any newly acknowledged TLPs. However, once the Data
Link Layer starts to retransmit a TLP it must complete the transmission of
that TLP. For example, imagine the transmitter has sequence numbers #5-
8 sitting unacknowledged in its retry buffer and initiates a retry attempt.
The transmitter starts to retransmit all four TLPs, beginning with se-
quence number #5. If, during the retransmission of TLP #5, the transmit-
ter receives an Ack associated with sequence number #7, it must
complete the retransmission of TLP #5. Depending on the implementa-
tion, the transmitter either continues with the retransmission of TLPs #6,
#7, and #8, or skips the newly acknowledged TLPs (that is, up through
#7) and continues retransmitting the remaining unacknowledged TLPs—
in this example, #8.
If the transmitter receives multiple Acks during a retry, it can “col-
lapse” them into only the most recent. If in the previous example the
transmitter had seen separate individual Acks for #5, #6, and then #7, it
could discard the individual Acks for #5 and #6 and only process the Ack
for #7. Acknowledging #7 implies that all previous outstanding sequence
numbers (#5 and #6) are also acknowledged. Likewise, if, during retry,
the transmitter receives a Nak followed by an Ack with a later sequence
number, the Ack supercedes the Nak and that Nak is ignored.
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 Ack/nak Reserved Ack/Nak_Seq_Num
Byte 4
16bit CRC
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 VC ID Hdr FC
P/N/P/Cpl 0 R R Data FC
[2:0]
Byte 4
16bit CRC
+0 +1 +2 +3
7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
Byte 0 00100 xxx Reserved
Byte 4
16bit CRC
Processing a DLLP
The Physical Layer passes the received DLLP up to its Data Link Layer. If
the Physical Layer indicates a receiver error, it (and not the Data Link
Layer) reports the error condition. In this situation, the Data Link Layer
discards that DLLP. If the Physical Layer does not indicate a receiver er-
ror, the Data Link Layer calculates the CRC for the incoming DLLP. The
Data Link Layer then checks to see if the calculated value matches the
CRC attached to that DLLP. If the CRCs check out, the DLLP is processed.
In the event that the CRCs do not match, the DLLP is discarded and an
error is reported. This flow can be seen in Figure 7.8. Please note that
neither device expects to retry a DLLP. As such, DLLPs are not placed
into the retry buffer.
Start
No
Calculate CRC of
DLLP, not including Discard DLLP
attached CRC
End
Calculated CRC No
equal to received
value?
Yes
End
Reset DL_Inactive
DL_Init
DL_Active
The DL_Inactive state is the initial state following a reset event. Upon
entry into this state, all Data Link Layer state information resets to default
values. Additionally, the Data Link Layer purges any entries in the retry
buffer. While in this state, the Data Link Layer reports DL_Down to the
Transaction Layer. This causes the Transaction Layer to discard any out-
standing transactions and cease any attempts to transmit TLPs. Just as
well, because while in this state, the Data Link Layer does not accept any
TLPs from either the Transaction or the Physical Layer. The Data Link
Layer also does not generate or accept any DLLPs while in the Inactive
state. The state machine proceeds to the Init state if two conditions are
met: the Transaction Layer indicates the link is not disabled by software,
and the Physical Layer reports that the link is up (Physical LinkUp = 1).
Chapter 7: Data Link Layer Architecture 135
The DL_Init state takes care of flow control initialization for the de-
fault virtual channel. While in this state, the Data Link Layer initializes the
default virtual channel according to the methods outlined in Chapter 9.
The DL status output changes during this state. It reports out DL_Down
while in FC_Init1 and switches over to DL_Up when in gets to FC_Init2.
The state machine proceeds to the Active state if FC initialization com-
pletes successfully and the Physical Layer continues to report that the
Physical Link is up. If the Physical Layer does not continue to indicate the
link is up (Physical LinkUp = 0), the state machine will return to the
DL_Inactive state.
The DL_Active state is the normal operating state. The Data Link
Layer accepts and processes incoming and outgoing TLPs, and generates
and accepts DLLPs as described in this chapter. While in this state, the
Data Link Layer reports DL_Up. If the Physical Layer does not continue to
indicate the link is up (Physical LinkUp = 0), the state machine returns to
the DL_Inactive state.
136 Introduction to PCI Express: A Hardware and Software Developers Guide
Chapter 8
Physical Layer
Architecture
137
138 Introduction to PCI Express: A Hardware and Software Developers Guide
have been reached; however, to satisfy curiosity it can be noted that op-
tical wires are a likely solution.
The Physical Layer contains all the necessary digital and analog cir-
cuits required to configure and maintain the link. Additionally, the Physi-
cal Layer could contain a phase locked loop (PLL) to provide the
necessary clocking for the internal state machines. Given the understand-
ing that PCI Express supports data rates greater than 2.5 gigabits per sec-
ond, the data rate detect mechanisms have been predefined to minimize
the changes to support future generations of PCI Express. Additionally,
the Physical Layer of PCI Express is organized to provide isolation of the
circuits and logic that need to be modified and/or tuned in order to sup-
port next generation speeds. As illustrated in Figure 8.1, the architectural
forethought of layering and isolating the Physical Layer eases the transi-
tion for upgrading the technology by allowing maximum reuse of the
upper layers.
Software
Transaction Layer
Physical Layer
Mechanical
(Connectors, Wire)
There are two key sub-blocks that make up the Physical Layer archi-
tecture: a logical sub-block and an electrical sub-block. Both sub-blocks
have dedicated transmit and receive paths that allow dual unidirectional
communication (also referred to as dual simplex) between two PCI Ex-
press devices. These sub-blocks ensure that data gets to and from its des-
tination quickly and in good order, as shown in Figure 8.2.
Chapter 8: Physical Layer Architecture 139
Physical
Physical
Logical Logical
Logical Sub-Block
The logical sub-block is the key decision maker for the Physical Layer. As
mentioned above, the logical sub-block has separate transmit and receive
paths, referred to hereafter as the transmit unit and receive unit. Both
units are capable of operating independently of one another.
The primary function of the transmit unit is to prepare data link
packets received from the Data Link Layer for transmission. This process
involves three primary stages: data scrambling, 8-bit/10-bit encoding, and
packet framing. The receive unit functions similarly to the transmit unit,
but in reverse. The receive unit takes the deserialized physical packet
taken off the wire by the electrical sub-block, removes the framing, de-
codes it, and finally descrambles it. Figure 8.3 gives a description of each
of these stages along with a description of the benefits received by each
stage.
140 Introduction to PCI Express: A Hardware and Software Developers Guide
Logical Sub-Block
Transmit Unit Receive Unit
Packet Packet
Scrambled De-scrambled
8-Bit / 10-Bit 8-Bit / 10-Bit
Encoded Decoded
Packet Framing
Framed Removed
Electrical Sub-Block
Data Scrambling
PCI Express employs a technique called data scrambling to reduce the
possibility of electrical resonances on the link. Electrical resonances can
cause unwanted effects such as data corruption and in some cases circuit
damage, due to electrical overstresses caused by large concentrations of
voltage. Since electrical resonances are somewhat difficult to predict, the
simplest solution is usually to prevent conditions that can cause electrical
resonances. Most electrical resonance conditions are caused by repeated
data patterns at the system’s preferred frequency. The preferred fre-
quency of a system depends on many factors, which are beyond the
scope of this book. However, it would be well to note that very few sys-
tems have the same preferred frequency. To avoid repeated data patterns
the PCI Express specification defines a scrambling/descrambling algo-
rithm that is implemented using a linear feedback shift register. PCI Ex-
press accomplishes scrambling or descrambling by performing a serial XOR
operation to the data with the seed output of a Linear Feedback Shift Register
(LFSR) that is synchronized between PCI Express devices. Scrambling is en-
abled by default; however, it can be disabled for diagnostic purposes.
Chapter 8: Physical Layer Architecture 141
8-Bit/10-Bit Encoding
The primary purpose of 8-bit/10-bit encoding is to embed a clock signal
into the data stream. By embedding a clock into the data, this encoding
scheme renders external clock signals unnecessary. An investigation of
parallel multi-drop bus technologies, like conventional PCI, has shown
that as clock frequencies increase, the length matching requirements be-
come increasingly more stringent. The dependency of a group of signals
to a single clock source acts to severely reduce setup and hold margins in
a particular data transaction. Take for example two data lines named Data
Line 1 and Data Line 2 that are both referenced to a high-speed clock sig-
nal called Data Clock. At a transmitting source both Data Line 1 and Data
Line 2 have signals placed on the bus at the same instance in reference to
Data Clock. However, due to a slight mismatch in interconnect length
between Data Line 1, Data Line 2, and Data Clock, they all reach the re-
ceiving device at slightly different times. Since the receiving device sam-
ples the data based on the reception of the Data Clock, the overall
margin may be reduced significantly or the wrong data may be clocked
into the receiving device if the mismatch is bad enough. As bus frequen-
cies increase, the amount of allowable mismatch decreases or essentially
becomes zero. For an illustration of this concept see Figure 8.4.
Because PCI Express embeds a clock into the data, setup and hold
times are not compromised due to length mismatch between individual
PCI Express lanes within a link.
The concept of 8-bit/10-bit encoding is not something new that is
unique to PCI Express. This data encoding concept was actually patented
by IBM and used in Fibre Channel to increase data transfer lengths and
rates. Since then it has also been adopted and used in Serial ATA and Gi-
gabit Ethernet because of the benefits that can be had through its adop-
142 Introduction to PCI Express: A Hardware and Software Developers Guide
Layer Packet
K29.7 End FD 111 11101 Marks the end of a Transaction Layer
Packet or a Data Link Layer Packet
K30.7 End Bad FE 111 11110 Marks the end of a nullified TLP.
Note: Reserved characters have not been given a name
Byte Value
00H 0
8-Bit/10-Bit
Encoded 1 1 0 1 0 0 0 1 1 0
Value
Left side shows a cut-out of a routing example in which the traces are
“snaked” to length-match them to the clock in order to guarantee data is
sampled with the clock.
Right side shows a PCI Express routing solution, which does not require
length matching to a clock signal thereby freeing up board space and
simplifying the routing.
possible. This allows the receiving device to determine the health of the
transmitted character by registering the effect of the received character
had on disparity.
Benefit 3: DC Balance. DC balancing is accomplished through running
disparity. It is called out separately here to discuss the benefits received
from maintaining the balance of 1s and 0s from an electrical perspective
instead of an error-checking mechanism. Maintaining a proportionate
number of 1s and 0s allows an individual data line to have an average DC
voltage of approximately half of the logical threshold. This reduces the
possibility of having inter-symbol interference, which is the inability to
switch from one logic level to the next because of system capacitive
charging. Inter-symbol interference is discussed in more detail during the
electrical sub-block section.
Packet Framing
In order to let the receiving device know where one packet starts and
ends, there are identifying 10-bit special symbols that are added and ap-
pended to a previously 8-bit/10-bit encoded data packet. The particular
special symbols that are added to the data packet are dependent upon
where the packet originated. In the case where the packet originated
from the Transaction Layer the special symbol Start TLP (encoding
K27.7) would be added to the front of the data packet. In the case the
packet originated from the Data Link layer the special symbol Start DLLP
(encoding K28.2) would be added to the beginning of the data packet.
To end either a TLP or DLLP the special symbol END (encoding K29.7) is
appended, as shown in Figure 8.6.
Electrical Sub-Block
As the logical sub-block of the Physical Layer fulfils the role as the key
decision maker, the electrical sub-block functions as the delivery mecha-
nism for the physical architecture. The electrical sub-block contains
transmit and receive buffers that transform the data into electrical signals
that can be transmitted across the link. The electrical sub-block may also
contain the PLL circuitry, which provides internal clocks for the device.
The following paragraphs describe exactly how the signaling of PCI Ex-
press works and why, and what a PLL (Phase Locked Loop) actually does.
The concepts of AC coupling and de-emphasis are also discussed briefly.
Serial/Parallel Conversion
The transmit buffer in the electrical sub-block takes the en-
coded/packetized data from the logical sub-block and converts it into se-
rial format. Once the data has been serialized it is then routed to an
associated lane for transmission across the link. On the receive side the
receivers deserialize the data and feed it back to the logical sub-block for
further processing.
Clock Extraction
In addition to the parallel-to-serial conversion described above, the re-
ceive buffer in the electrical sub-block is responsible for recovering the
link clock that has been embedded in the data. With every incoming bit
transition, the receive side PLL circuits are resynchronized to maintain bit
and symbol (10 bits) lock.
Lane-to-Lane De-Skew
The receive buffer in the electrical sub-block de-skews data from the
various lanes of the link prior to assembling the serial data into a parallel
data packet. This is necessary to compensate for the allowable 20 nano-
seconds of lane-to-lane skew. Depending on the flight time characteristics
of a given transmission medium this could correlate to nearly 7 inches of
variance from lane to lane. The actual amount of skew the receive buffer
must compensate for is discovered during the training process for the
link.
Differential Signaling
PCI Express signaling differs considerably from the signaling technology
used in conventional PCI. Conventional PCI uses a parallel multi-drop
Chapter 8: Physical Layer Architecture 147
bus, which sends a signal across the wire at given amplitude referenced
to the system ground. In order for that signal to be received properly, it
must reach its destination at a given time in reference to some external
clock line. In addition to this the signal must arrive at the destination
with a given amplitude in order to register at the receiver. For relatively
slow signals this type of signaling has worked quite well. However, as
signals are transmitted at very high frequencies over distances of 12
inches or more, the low pass filter effects of the common four-layer FR4
PC platform cause the electrical signals to become highly attenuated. In
many cases the attenuation is so great that a parallel multi-bus receiver
cannot detect the signal as valid. Electrically there are two options to
overcome this signal attenuation. One option is to shorten the length of
the transmission path in order to reduce signal attenuation. In some cases
this is possible. However, in most cases it makes design extremely diffi-
cult, if not impossible. The other option is to use a different type of sig-
naling technique that can help overcome the effects of attenuation.
PCI Express transmit and receive buffers are designed to convert the
logical data symbols into a differential signal. Differential signaling, as its
name might give away, is based on a relative difference between two dif-
ferent signals referred to as a differential pair. A differential pair is usually
signified by a positively notated signal and a negatively notated signal.
Logical bits are represented by the relative swing of the differential pair.
To illustrate how logical bits are represented electrically on a differential
pair, take the following example, as illustrated in Figure 8.7. A differential
pair has a given voltage swing around 1 volt, which means the positively
notated signal swings to +1 volt when representing a logical 1 and to a –1
volt when representing a logical 0. The negatively notated signal likewise
swings to –1 volt when representing a logical 1 and to a +1 volt when
representing a logical 0. The peak-to-peak difference between the differ-
ential pair is 2 volts in the case either logical bit is represented. The logi-
cal bit is determined by the direction in which the signals swing.
Parallel Multi-
Drop Signaling 0 1 0
OV Reference
D-
Differential OV Reference or
Signaling
0 1 0 some common
D+ mode DC voltage
AC Coupling
PCI Express uses AC coupling on the transmit side of the differential pair
to eliminate the DC Common Mode element. By removing the DC Com-
mon Mode element, the buffer design process for PCI Express becomes
much simpler. Each PCI Express device can have a unique DC Common
Mode voltage element, which is used during the detection process. The
link AC coupling removes the common mode element from view of the
receiving device. The range of AC capacitance that is permissible by the
PCI Express specification is 75 to 200 nanofarads.
De-Emphasis
PCI Express utilizes a concept referred to as de-emphasis to reduce the
effects of inter-symbol interference. In order to best explain how de-
emphasis works it is important to understand what inter-symbol interfer-
ence is. As frequencies increase, bit times decrease. As bit times decrease
the capacitive effects of the platform become much more apparent. Inter-
symbol interference comes into play when bits change rapidly on a bus
after being held constant for some time prior. Take under consideration a
differential bus that transmits five logical 1s in a row. This is the maxi-
mum number of same-bit transmissions allowable under 8-bit/10-bit en-
coding. Suppose that following the five logical 1s was a logical 0
followed by another logical 1. The transmission of the first five logical 1s
charges the system capacitance formed by the layering process of the
PCB stackup (Plate Capacitor). When the system follows the five logical
1s with a logical 0 and then another logical 1 the system cannot dis-
charge quick enough to register the logical 0 before the next logical 1.
The effect is inter-symbol interference, as shown in Figure 8.8.
0 1 0 0 1 ?
0 1 1 0 0
Polling
Detect
(Initial State)
Configuration
Electrical Idle
Before describing the link configuration states it seems appropriate to de-
fine electrical idle since it will be referred to throughout the remainder of
this chapter. Upon initial power-up the device enters the electrical idle
state, which is a steady state condition where the Transmit and Receive volt-
ages are held constant. The PCI Express specification defines constant as
meaning that the differential pair lines have no more than 20 millivolts of dif-
ference between the pair after factoring out any DC common element. The
minimum time that a transmitter must remain in electrical idle is 20 nanosec-
onds, however, the transmitter must attempt to detect a receiving device
within 100 milliseconds. Electrical idle is primarily used in power saving
mode and common mode voltage initialization.
152 Introduction to PCI Express: A Hardware and Software Developers Guide
Detect State
The first Physical Layer state that the PCI Express link enters into is the
detect state upon power-up. The detect state is also entered into upon a
link reset condition, a surprise removal of a device, or an exit from the
link disabled state. The detect state determines whether or not there is a
device connected on the other side of the link. The detection process
takes place in the progression through three sub-states called quiet, ac-
tive, and charge.
Quiet Sub-State. During the quiet sub-state, four primary tasks are com-
pleted. The first task, completed by the electrical sub-block, is that the
transmitter in the downstream port (upstream device) begins driving its DC
Common Mode voltage while remaining in high impedance. The relation-
ship between an upstream and downstream port are shown in Figure 8.12.
Downstream Port
(Upstream Component)
Upstream Port
(Downstream Component)
The PCI Express specification defines the upstream and downstream port
relationship as following: All ports on a root complex are downstream ports.
The downstream device on a link is the device farther from the root complex.
The port on a switch that is closest topologically to the root complex is the
upstream port. The port on an endpoint device or bridge component is an
upstream port. The upstream component on a link is the component closer to
the root complex.
The downstream device next selects the data rate, which is always 2.5
gigahertz during link training even when PCI Express speeds go beyond
the immediate generation. Finally, the downstream device clears the
status of the linkup indicator to inform the system that a link connection
is not currently established. A register in the data link layer monitors the
linkup status. The system only remains in the quiet sub-state for 12 milli-
seconds before attempting to proceed to the next sub-state.
Active Sub-State. Primary detection is completed during the active sub-
state. Detection is done by analyzing the effect that the upstream port
(downstream device) receiver loading has on the operating DC Common
Mode voltage output from the transmitter. If there is no upstream port
connected, the rate of change of the applied DC Common Mode voltage
is much faster than if a terminated upstream port receiver were setting
out on the link. The detection process is done on a per-lane basis. The
downstream device holds the transmitter in high impedance to disable
any lanes on the downstream port, which are not connected. During the
detection process the downstream port transmitter is always in high im-
pedance, even when sending the operating DC common mode voltage
for detection. If an upstream device is detected the next state is the poll-
ing state. If no upstream device can be detected the sub-state machine re-
turns to the quiet state and wait for 12 milliseconds before checking
again for an upstream device.
Charge Sub-State. The final sub-state of the detect state is the charge
state. During this state the electrical sub-block of the downstream port
continues to drive the DC Common Mode voltage while remaining in a
high impedance electrical idle state. A timer is also set to count off 12
milliseconds. As soon as the DC Common Mode voltage is stable and
within specification or the 12-millisecond timer has timed out the state
machine transitions to the polling state. Figure 8.13 illustrates the detect
sub-state machine.
154 Introduction to PCI Express: A Hardware and Software Developers Guide
Entry
12ms timeout Receiver Detected
No Detect
12ms charge
Exit to Polling
Polling State
The polling state is the first state where training instructions called train-
ing ordered sets are sent out on all the individual PCI Express lanes. PCI
Express currently defines two training ordered sets called TS1 and TS2.
There are not many differences between the two sets except for the in-
dicator used to distinguish which training ordered set it actually is. TS1
ordered sets are used during the configuration process. Once all of the
lanes of the link are trained, TS2 ordered sets are used to mark a success-
ful training. During the polling state TS1 ordered sets are used to estab-
lish bit and symbol lock, to determine whether a single or multiple links
should be formed, and to select the data rate for the link.
Training ordered sets are nothing more than a group of 16 8-bit/10-bit
encoded special symbols and data. Training ordered sets are never scram-
bled. These training instructions are used to establish the link data rate,
establish clock synchronization down to the bit level, and check lane po-
larity. Table 8.3 shows the training ordered set that is sent out during the
polling state.
Chapter 8: Physical Layer Architecture 155
Table 8.3 TS1 Ordered Sets Used during the Polling State
Symbol Allowed Encoded Values Description
Number Values
0 K28.5 COMMA code group for symbol alignment
1 0 - 255 D0.0 – D31.7, K23.7 Link Number within device
2 0 - 31 D0.0 – D31.0, K23.7 Lane Number within Port
3 0 - 255 D0.0 – D31.7 N_FTS. This is the number of fast training
ordered sets required by the receiver to
obtain reliable bit and symbol lock.
4 2 D2.0 Data Rate Identifier
Bit 0 – Reserved, set to 0
Bit 1 = 1, generation 1 (2.5 Gb/s) data rate
supported
Bit 2:7 – Reserved, set to 0
5 Bit 0 = 0, 1 D0.0, D1.0, D2.0, D4.0, Bit 0 = 0, De-assert Reset
Bit 1 = 0, 1 D8.0 Bit 0 = 1, Assert Reset
Bit 2 = 0, 1 Bit 1 = 0, Enable Link
Bit 3 = 0, 1 Bit 1 = 1, Disable Link
Bit 4:7 = 0, Bit 2 = 0, No Loop back
Bit 2 = 1, Enable Loop back
Bit 3 = 0, Enable Scrambling
Bit 3 = 1, Disable Scrambling
Bit 4:7, Reserved
6-15 D10.2 TS1 Identifier
Note: The TS2 Training Ordered Set is exactly the same as the TS1 Training Ordered Set with one
exception. In the place of symbols 6-15 is the TS2 encoded value D5.2.
Similar in concept to the detect state, the polling state has five defined
sub-states that are used in the link training process. The polling sub-states
are referred to as quiet, active configuration, speed, and compliance. A
short description of the transition in and out of these sub-states follows.
Quiet Sub-State. The first polling sub-state that is entered into upon the
completion of the detect state is the quiet sub-state. Upon entry into this
sub-state the 12-millisecond countdown timer is set. During the quiet
sub-state each downstream port receiver is looking for a training ordered
set or its complement. The polling state responds to either a TS1 or TS2
ordered training set by progressing to the next sub-state. As mentioned
above, the receiver also responds to the complement of either training
156 Introduction to PCI Express: A Hardware and Software Developers Guide
Logical Inversion
PCI PCI
Express 1010.... 0101.... D- 1010.... D-
Express
D+ D+
Device Device
The ability to perform a logical inversion on incoming signals due to polarity inversion of the
differential pair gives the designer extra freedom in design in cases where it would otherwise be
necessary to bow-tie the signals.
Entry
12ms timeout or
TSx received
Polling.Quiet Polling.Active
Polling.Compliance
No No TSx set
received
Polling.Configuration Polling.Speed
TSx received
1024 TS1 Sets
Only initial data rate
of 2.5Gbps supported Exit to
by both devices Configuration
Configuration State
The configuration state establishes link width and lane ordering. Prior to
this state, bit and symbol lock should have been established, link data
rate determined, and polarity corrections made on incoming data if nec-
essary. Within the configuration state there are two sub-states, rcvrcfg
and idle.
Rcvrcfg Sub-State. During the rcvrcfg sub-state, link width and lane or-
dering are established. For links greater than a x1 configuration the re-
ceiver must compensate for the allowable 20 nanoseconds of lane-to-lane
skew that the PCI Specification Revision 1.0 allows between the lanes
that form the link. Transmitted TS1 training ordered sets to the upstream
port contain link numbers assigned by the downstream port, as shown in
Figure 8.16. If the downstream port is capable of forming two individual
links it sends out two separate link numbers, N and N+1. The upstream
port responds to the downstream port by sending the desired link num-
ber in symbol 2 of the training ordered set, as shown in Table 8.3. The
upstream port establishes a link number by sending a TS1 training or-
dered set to the downstream port with the preferred link number instead
of the special character K23.7 (PAD). If the link number is not estab-
lished within 2 milliseconds, the state machine changes again to the poll-
ing state. Until the link number is established or the 2-millisecond timer
times out, the downstream port continues to broadcast its link number-
ing preference.
Chapter 8: Physical Layer Architecture 159
Sym Link Number Lane Number Sym Data Rate Sym Sym 6-15
Lane 0
0 =N =K23.7 3 = D2.0 TS identifier
Upstream PCI Express Port
Sym 6-15 Sym Data Rate Sym Lane Number Link Number Sym
TS identifier 5 = D2.0 3 =K23.7 =N 0
Sym Link Number Lane Number Sym Data Rate Sym Sym 6-15
0 =N = 1 =K23.7 3 = D2.0 5 TS identifier
Lane 1
Lane 0
OM15568
Sym 6-15 Sym Data Rate Sym Lane Number Link Number Sym
TS identifier 5 = D2.0 3 =K23.7 =N 0
In the picture above, PCI Express Device B transmits the link preference of N and N+1 since this
x2 device can bifurcate into two independent x1 links. Since PCI Express Device A is a single
device with both lanes connected to PCI Express Device B, it responds back to PCI Express
Device B with the preferred link number N indicating that a single link should be formed.
ured. At this point bit and symbol lock are established, the link data rate
is selected, and link and lane numbering are fixed. In this sub-state the
downstream port sends the special symbol K28.3 (Idle) to the upstream
port. As soon as a port receives the idle symbol it transmits at least 16
consecutive idle symbols in return. As soon as a port receives eight con-
secutive idle symbols the port transitions to the L0 state, which is the
operating state. If the ports receive no idle symbols within 2 millisec-
onds, a timeout condition occurs and the state returns to the polling
state, as shown in Figure 8.17.
Link Configured
Config.RcvCfg Config.Idle
ut
eo
Link tim 8 idle
2ms
Error symbols
Exit to Exit to
Configuration Configuration
Surprise Insertion/Removal
PCI Express physical architecture is designed with ease of use in mind.
To support this concept PCI Express has built in the ability to handle
surprise insertion and removal of PCI Express devices. All transmitters
and receivers must support surprise hot insertion/removal without dam-
age to the device. The transmitter and receiver must also be capable of
withstanding sustained short circuit to ground of the differential in-
puts/outputs D+ and D-.
A PCI Express device can assume the form of an add-in card, module,
or a soldered-down device on a PC platform. In the case of an add-in card
or cartridge, PCI Express allows a user to insert or remove a device (an
Chapter 8: Physical Layer Architecture 161
upstream port) while the system is powered. This does not mean that
there is nothing more required to support surprise insertion. The key ob-
jective is to identify that a mechanism exists to check for the presence of
a device on a link.
Surprise Insertion
A broken link that is missing an upstream port causes the downstream
device to remain in the detect state. Every 12 milliseconds the down-
stream port checks the link to see whether or not any upstream ports
have been connected. As soon as a user inserts a device into the system it
is detected and the link training process as previously described begins.
Surprise Removal
If a PCI Express device is removed from the system during normal opera-
tion, the downstream port receivers detect an electrical idle condition (a
loss of activity). Because the electrical idle condition was not preceded
by the electrical idle ordered set, the link changes to the detect state.
while the receive path to the downstream port could remain in the fully
functional L0 state. Because the link will likely transition into and out of
this state often, the latencies associated with coming in and out of this
state must be relatively small (a maximum of several microseconds). Dur-
ing this state the transmitter continues to drive the DC common mode
voltage and all devices on chip clocks continue to run (PLL clocks and so
on).
Recovery
L0 Normal
Operation
L1 L0’s
1
= 0.4 ps ⇒ 0.4 ps ∗ 20 = 8 ps
2.5 exp 9
To exit the L0s state the transmitter must begin sending out Fast
Training Sequences to the receiver. A Fast Training Sequence is an or-
dered set composed of one K28.5 (COM) special character and three
K28.1 special characters. The fast training sequences are used to resyn-
chronize the bit and symbol times of the link in question. The exit la-
tency from this state depends upon the amount of time it takes the
receiving device to acquire bit and symbol synchronization. If the re-
ceiver is unable to obtain bit and symbol lock from the Fast Training Se-
quence the link must enter a recovery state where the link can be
reconfigured if necessary.
his chapter goes into the details of the various flow control mecha-
T nisms within PCI Express. It begins with a description of the ordering
requirements for the various transaction types. The rest of the chapter
then deals with some of the newer flow control policies that PCI Express
uses: virtual channels, traffic classes, as well as flow control credits. Fol-
lowing that, the chapter briefly describes how these flow control
mechanisms are used to support isochronous data streams.
Transaction Ordering
The PCI Express specification defines several ordering rules to govern
which types of transactions are allowed to pass or be passed. Passing oc-
curs when a newer transaction bypasses a previously issued transaction
and the device executes the newer transaction first. The ordering rules
apply uniformly to all transaction types—memory, I/O, configuration,
and messages—but only within a given traffic class. There are no order-
ing rules between transactions with different traffic classes. It follows
that there are no ordering rules between different virtual channels, since
165
166 Introduction to PCI Express: A Hardware and Software Developers Guide
(Row letter or column Memory Write Read I/O or Config Read I/O or Config
number in or Message Request Write Request Completion Write
parentheses) Request (1) (2) (3) (4) Completion
(5)
Memory Write a) No Yes Yes a) Y/N a) Y/N
or Message
Request
Request (A)
(B)
Request
b) Y/N b) No
I/O or Config Y/N Yes Yes Y/N Y/N
Write
Completion (E)
the relaxed ordering bit (bit 5 of byte 2 in the TLP header) contains a
value of 0, then the second transaction is not permitted to bypass the
previously submitted request (A1a). If that bit is set to a 1, then the sub-
sequent transaction is permitted to bypass the previous transaction
(A1b). A memory write or message request must be allowed to pass read
requests (A2) as well as I/O or configuration write requests (A3) in order
to avoid deadlock. The ordering rules between memory write or message
requests and completion packets depend on the type of PCI Express de-
vice. Endpoints, switches, and root complexes may allow memory write
or message requests to pass or be blocked by completions (A4a and A5a).
PCI Express to PCI or PCI-X bridges, on the other hand, must allow
memory write or message requests to pass completions in order to avoid
deadlock (A4b and A5b). This scenario only occurs for traffic flowing
from the upstream (PCI Express) side of the bridge to the downstream
(PCI or PCI-X) side of the bridge.
A subsequent non-posted request (any read request or an I/O or con-
figuration write request) interacts with previous transactions in the fol-
lowing way. As seen in cells B1 and C1, these requests are not allowed to
pass previously issued memory write or message requests. Non-posted
requests may pass or be blocked by all other transaction types (B2, B3,
B4, B5, C2, C3, C4, C5).
A subsequent read completion interacts with previous transactions as
follows. As seen in cell D1, there are two potential ordering rules when
determining if a read completion can pass a previously issued memory
write or message request. If the relaxed ordering bit (bit 5 of byte 2 in
the TLP header) contains a value of 0, then the read completion is not
permitted to bypass the previously submitted request (D1a). If that bit is
set to a 1, then the read completion may bypass the previously enqueued
transaction (D1b). A read completion must be allowed to pass read re-
quests (D2) as well as I/O or configuration write requests (D3) in order
to avoid deadlock. Read completions from different read requests are
treated in a similar fashion to I/O or configuration write completions. In
either case (D4a or D5), the subsequent read completion may pass or be
blocked by the previous completion transaction. Recall however, that a
single completion may be split up amongst several completion packets.
In this scenario, a subsequent read completion packet is not allowed to
pass a previously enqueued read completion packet for that same re-
quest/completion (D4b). This is done in order to ensure that read com-
pletions return in the proper order.
A subsequent I/O or configuration write completion interacts with
previous transactions as follows. As seen in cell E1, these completions
168 Introduction to PCI Express: A Hardware and Software Developers Guide
PCI-X flow control works. Additionally, once a car gains access to the
road, it needs to determine how fast it can go. If there is a lot of traffic al-
ready on the road, the driver may need to throttle his or her advance-
ment to keep from colliding with other cars on the road. PCI
accomplishes this through signals such as IRDY# and TRDY#.
Now consider that the road is changed into a highway with four lanes
in both directions. This highway has a carpool lane that allows carpoolers
an easier path to travel during rush hour traffic congestion. There are also
fast lanes for swifter moving traffic and slow lanes for big trucks and
other slow moving traffic. Drivers can use different lanes in either direc-
tion to get to a particular destination. Each driver occupies a lane based
upon the type of driver he or she is. Carpoolers take the carpool lane
while fast drivers and slow drivers occupy the fast and slow lanes respec-
tively. This highway example represents the PCI Express flow control
model. Providing additional lanes of traffic increases the total number of
cars or bandwidth that can be supported. Additionally, dividing up that
bandwidth based on traffic class (carpoolers versus slow trucks) allows
certain packets to be prioritized over others during high traffic times.
PCI Express does not have the same sideband signals (IRDY#,
TRDY#, RBF#, WBF#, and so on) that PCI or AGP have in order to im-
plement this sort of flow control model. Instead, PCI Express uses a flow
control credit model. Data Link Layer Packets (DLLPs) are exchanged be-
tween link mates indicating how much free space is available for various
types of traffic. This information is exchanged at initialization, and then
updated throughout the active time of the link. The exchange of this in-
formation allows the transmitter to know how much traffic it can allow
on to the link, and when the transmitter needs to throttle that traffic to
avoid an overflow condition at the receiver.
System traffic is broken down into a variety of traffic classes (TCs). In
the traffic example above, the traffic classes would consist of carpoolers,
fast drivers, and slow drivers. PCI Express supports up to eight different
traffic classes. Each traffic class can be assigned to a separate virtual chan-
nel (VC), which means that there can be at most eight virtual channels.
Support for traffic classes and virtual channels beyond the defaults (TC0
and VC0) is optional. Each supported traffic class is assigned to a sup-
ported virtual channel for flow control purposes. TC0 is always associ-
ated with VC0, but beyond that, traffic class to virtual channel mapping is
flexible and device-dependent. Although each traffic class may be
mapped to a unique virtual channel, this is not a requirement. Multiple
traffic classes can share a single virtual channel, but multiple virtual
channels cannot share a single traffic class. A traffic class may only be as-
170 Introduction to PCI Express: A Hardware and Software Developers Guide
Figure 9.1 Flow Control through Virtual Channels and Traffic Classes
VC = 0 T0+ R0+ VC = 0
VC = ? T0- R0- VC = ?
VC = ?
TC VC TC VC
TC0 VC0 R0+ T0+ TC0 VC0
R0- T0-
Figure 9.2 Virtual Channel and Traffic Class Setup Prior to Configuration
VC = 0 T0+ R0+ VC = 0
VC = 1 T0- R0- VC = 1
VC = x
TC VC TC VC
Figure 9.3 Virtual Channel and Traffic Class Setup after Configuration
Again, these are example associations and not the only possible traffic
class to virtual channel associations in these configurations.
There are several additional traffic class/virtual channel configuration
details to make note of. As seen in Figure 9.2, all ports support VC0 and
map TC0 to that virtual channel by default. This allows traffic to flow
across the link without (or prior to) any VC-specific hardware or software
configuration. Secondly, implementations may adjust their buffering per
virtual channel based on implementation-specific policies. For example,
in Figure 9.3, the queues or buffers in Device A that are identified with a
VC ID of x may be reassigned to provide additional buffering for VC0 or
VC1, or they may be left unassigned and unused.
Flow Control
PCI Express enacts flow control (FC) mechanisms to prevent receiver
buffer overflow and to enable compliance with the ordering rules out-
lined previously. Flow control is done on a per-link basis, managing the
traffic between a device and its link mate. Flow control mechanisms do
not manage traffic on an end-to-end basis, as shown in Figure 9.4.
Chapter 9: Flow Control 175
Root
Complex
FC
FC
FC
Switch
FC
PCI Express
Endpoint
If the root complex issues a request packet destined for the PCI Ex-
press endpoint, it transmits that packet across the outgoing portion of its
link to the switch. The switch then sends that packet across its down-
stream port to the endpoint. The flow control mechanisms that PCI Ex-
press implements, however, are link-specific. The flow control block in
the root complex only deals with managing the traffic between the root
complex and the switch. The downstream portion of the switch and the
endpoint then manage the flow control for that packet between the
switch and the endpoint. There are no flow control mechanisms in the
root complex that track that packet all the way down to the endpoint.
Link mates share flow control details to ensure that no device trans-
mits a packet that its link mate is unable to accept. Each device indicates
how many flow control credits it has available for use. If the next packet
allocated for transmission exceeds the available credits at the receiver,
that packet cannot be transmitted. Within a given link, each virtual chan-
nel maintains its own flow control credit pool.
As mentioned in Chapter 7, DLLPs carry flow control details between
link mates. These DLLPs may initialize or update the various flow control
credit pools used by a link. Though the flow control packets are DLLPs
and not TLPs, the actual flow control procedures are a function of the
Transaction Layer in cooperation with the Data Link Layer. The Transac-
tion Layer performs flow control accounting for received TLPs and gates
outgoing TLPs if, as mentioned previously, they exceed the credits avail-
176 Introduction to PCI Express: A Hardware and Software Developers Guide
able. The flow control mechanisms are independent of the data integrity
mechanisms of the Data Link Layer (that is to say that the flow control
logic does not know if the Data Link Layer was forced to retry a given
TLP).
The n units used for the data credits come from rounding up the data
length by 16 bytes. For example, a memory read completion with a data
length of 10 DWords (40 bytes) uses 1 CplH unit and 3 (40/16 = 2.5,
which rounds up to 3) CplD units. Please note that there are no credits
Chapter 9: Flow Control 177
and hence no flow control processes for DLLPs. The receiver must there-
fore process these packets at the rate that they arrive.
Each virtual channel has independent flow control, and thus main-
tains independent flow control pools (buffers) for PH, PD, NPH, NPD,
CplH, and CplD credits. Each device autonomously initializes the flow
control for its default virtual channel (VC0). As discussed in Chapter 7,
this is done during the DL_Init portion of the Data Link Layer state ma-
chine. The initialization procedures for other virtual channels flow con-
trol are quite similar to that of VC0, except that VC0 undergoes
initialization by default (and before the link is considered active) while
other virtual channels undergo initialization after the link is active. Once
enabled by software, multiple virtual channels may progress through the
various stages of initialization simultaneously. They need not initialize in
numeric VC ID order (that is to say, VC1 initializes before VC2 initializes
before VC3, and so on) nor does one channel’s initialization need to
complete before another can begin (aside from VC0, which must be ini-
tialized before the link is considered active). Additionally, since VC0 is
active prior to the initialization of any other virtual channels, there may
already be TLP traffic flowing across that virtual channel. Such traffic has
no direct impact on the initialization procedures for other virtual chan-
nels.
Tying this all together, this section has shown that arbitration for
bandwidth of a given link is dependent on several factors. Since multiple
virtual channels may be implemented across a single link, those virtual
channels must arbitrate for the right to transmit. VC-to-VC arbitration can
take several forms, but it is important that regardless of the policy, no vir-
tual channel is locked out or starved for bandwidth. Once a virtual chan-
nel is set to transmit, it must arbitrate amongst its supported traffic
classes. If a virtual channel has only one traffic class assigned to it, that
arbitration is quite simple. If a virtual channel has numerous traffic
classes, there needs to be some arbitration policy to determine which
traffic class has priority. Like VC-to-VC arbitration, TC-to-TC arbitration
(within a virtual channel) can take several forms, with a similar priority of
ensuring that no traffic class is locked out or starved for bandwidth. Fi-
nally, once a specified traffic class is ready to transmit, the transaction
ordering rules from the beginning of the chapter are used to determine
which transaction should be transmitted. Transaction ordering only de-
tails the rules for traffic within a set traffic class, which is why this is the
last step in the arbitration process.
178 Introduction to PCI Express: A Hardware and Software Developers Guide
Start
Record indicated FC
unit value for VCx and
set appointed flag
No
Yes
Proceed to FC_Init2
rupted. While in this state for VC0 no other traffic is possible. As such,
this pattern should be retransmitted in this order continuously until exit
into FC_Init2. For other virtual channels, this pattern is not repeated con-
tinuously. Since other traffic may wish to use the link during nonzero vir-
tual channel initialization, this pattern does not need to be repeated
continuously, but it does need to be repeated (uninterrupted) at least
every 17 microseconds.
While in this state, the FC logic also needs to process incoming
InitFC1 (and InitFC2) DLLPs. Upon receipt of an InitFC DLLP, the device
records the appropriate flow control unit value. Each InitFC packet con-
tains a value for both the header units and data payload units. Once the
device has recorded values for all types (P, NP and Cpl, both header and
data) of credits for a given virtual channel, it sets a flag (FI1) to indicate
that the virtual channel has successfully completed FC_Init1. At this
point, InitFC1 packets are no longer transmitted, and the device pro-
ceeds to the FC_Init2 stage. Figure 9.6 shows the flowchart for flow con-
trol initialization state FC_Init2.
180 Introduction to PCI Express: A Hardware and Software Developers Guide
Set Flag
No
Yes No
Timer roll-over? Flag set?
Yes
End
For all virtual channels, the entrance to FC_Init2 occurs after success-
ful completion of the FC_Init1 stage. While in FC_Init2, the Transaction
Layer no longer needs to block transmission of TLPs that use that virtual
channel. While in this state, the device must first transmit InitFC2-P, then
InitFC2-NP, and then InitFC2-Cpl. This sequence must progress in this
order and must not be interrupted. While in this state for VC0, this pat-
tern should be retransmitted in this order continuously until successful
completion of FC_Init2. For other virtual channels, this pattern is not re-
peated continuously, but is repeated uninterrupted at least every 17 mi-
croseconds until FC_Init2 is completed.
While in this state, it is also necessary to process incoming InitFC2
DLLPs. The values contained in the DLLP can be ignored, and InitFC1
packets are ignored entirely. Receiving any InitFC2 DLLP for a given vir-
tual channel should set a flag (FI2) that terminates FC_Init2 and the flow
Chapter 9: Flow Control 181
control initialization process. Please note that the FI2 flag is dependent
on receipt of a single Init_FC2 DLLP, and not all three(P, NP and Cpl).
Additionally, the FC_Init2 flag may also be set upon receipt of any TLP or
UpdateFC DLLP that uses that virtual channel.
What exactly is the purpose of this FC_Init2 state? It seems as if it is
just retransmitting the same flow control DLLPs, but with a single bit
flipped to indicate that it is in a new state. The purpose for this state is to
ensure that both devices on a link can successfully complete the flow
control initialization process. Without it, it could be possible for one de-
vice to make it through flow control while its link mate had not. For ex-
ample, say that there is no FC_Init2 state and a device can proceed
directly from FC_Init1 to normal operation mode. While in FC_Init1, De-
vice A transmits its three InitFC1 DLLPs to Device B and vice versa. De-
vice B successfully receives all three DLLPs and proceeds on to normal
operation. Unfortunately, one of Device B’s flow control DLLPs gets lost
on its way to Device A. Since Device A has not received all three types of
flow control DLLPs, it stays in FC_Init1 and continues to transmit flow
control DLLPs. Device B is no longer transmitting flow control packets,
so Device A never gets out of FC_Init1 and all traffic from Device A to B
is blocked.
By having the FC_Init2 state in there, it ensures that both devices can
successfully complete the flow control initialization process. In the above
example, Device B would transfer into the FC_Init2 state and could begin
to transmit TLPs and other DLLPs. However, Device B still needs to peri-
odically transmit FC2 DLLPs for all three flow control types (P, NP, Cpl).
If Device A does not see the three original FC1 DLLPs, it can still eventu-
ally complete FC_Init1 since it periodically receives FC2 packets that
contain the needed flow control configuration information.
Tying this all together, what would a real flow control initialization
look like? Figure 9.7 illustrates the first step in an example flow control
initialization.
182 Introduction to PCI Express: A Hardware and Software Developers Guide
Device A Device B
P NP Cpl
SDP 0 01 040 CRC END SDP 0 01 040 CRC END SDP 0 01
1 1 1
Init_FCI - P Init_FCI - NP
NP P
END CRC 040 01 0 SDP END CRC 040 01 0 SDP
1 1
Devices A and B exit out of reset and begin the default initialization
of VC0. In this example, Device B happens to begin the initialization first,
so it begins to transmit Init_FC1 packets before Device A does. It starts
with an SDP symbol (Start DLLP Packet—refer to Chapter 8 for additional
details on framing) and then begins to transmit the DLLP itself. The first
packet that Device B must transmit is Init_FC1 for type P. It does so, dif-
ferentiating it as an FC1 initialization packet for credit type P in the first
four bits of the DLLP (refer to Chapter 7 for more details on the format of
these DLLPs). Device B then indicates that this packet pertains to VC0 by
placing a 0 in the DLLP’s VC ID field. The next portions of the packet
identify that Device B can support 01h (one decimal) posted header re-
quest units and 040h (64 decimal) posted request data units. At 16 bytes
per unit, this equates to a maximum of 1024 bytes of data payload. Fol-
lowing this information is the CRC that is associated with this DLLP. The
Init_FC1-P DLLP then completes with the END framing symbol. Device B
continues on with the transmission of the Init_FC1-NP and Init_FC1-Cpl
DLLPs.
Device A also begins transmitting Init_FC1 DLLPs, but does so just a
little later than Device B. At the point in time of this example, Device A
has just completed the transmission of the second initialization packet,
Init_FC1-NP DLLP, whereas Device B is already well into the transmission
Chapter 9: Flow Control 183
Device A Device B
NP Cpl P
END SDP 0 01 040 CRC END SDP 0 01 040 CRC END SDP 0
1 1 2
Device B must
complete the 2nd set of
Init_FC1s before it can
progress to FC_Init2
Cpl NP P
01 0 SDP END CRC 040 01 0 SDP END CRC 040 01 0 SDP
2 2 2
Device B has sent out all three Init_FC1 packets (P, NP, and Cpl), but
has not yet received all three Init_FC1 packets from Device A. This
means that Device B cannot yet exit from the FC_Init1 state and must
therefore retransmit all three Init_FC1 packets. Device A, on the other
hand, has already received all three Init_FC1 packets by the time it com-
pletes transmitting its own Init_FC1 packets. This means that Device A
can exit from the FC_Init1 state after only one pass and proceed on to
FC_Init2.
In Figure 9.8, Device A has begun to send out Init_FC2 packets. It
begins, as required, with the P type. Only this time, it is identified as an
Init_FC2 packet and not an Init_FC1. It proceeds to send out the Init_FC2
184 Introduction to PCI Express: A Hardware and Software Developers Guide
packet for NP and has started to send out the Init_FC2 packet for Cpl at
the time of this example.
Device B, on the other hand, has had to continue through the second
transmission of Init_FC1 packets. Once it completes the set of three, it
can transition to FC_Init2 and begin to transmit Init_FC2 packets. In this
example, Device B has just started to send out an Init_FC2 packet for
type P.
Now that each device has entered into the FC_Init2 stage, what do
things look like as they exit the flow control initialization and enter into
normal link operation? Figure 9.9 illustrates the completion of the flow
control initialization process.
Device A Device B
Cpl
01 040 CRC END SDP 0 01 040 CRC END STP TLP on VC0
1
Cpl
TLP on VC0 STP END CRC 040 01 0 SDP
2
Device A has sent out all three Init_FC2 packets (P, NP, and Cpl), but
has not yet received all three Init_FC2 packets from Device B. As dis-
cussed previously, however, exit from FC_Init2 is not dependent on re-
ceiving all three Init_FC2 packets. Receipt of any Init_FC2 DLLP allows
that device to exit from the FC_Init2 state (as long as it does not interrupt
the transmission of a trio of its Init_FC2 DLLPs). As such, Device A does
not need to retransmit its Init_FC2 packets and has completed the flow
control initialization. If it had not successfully received and processed the
first Init_FC2 DLLP from Device B by the time the transmission of
Chapter 9: Flow Control 185
At initialization, Device B indicates that it has 04h NPH credits and 040h NPD
credits. Device A logs those in its NPH and NPD Credit_Limit count-
ers/registers. After initialization, Device A has set its NPH and NPD Cred-
its_Consumed counters/registers to zero.
Device A then sends out two NP requests (sequence numbers #1 and #2) that
each utilizes a single NPH unit and 10h NPD units. It therefore updates its NPH
and NPD Credits_Consumed counters/registers to 02h and 20h, respectively.
Device A now wants to send out another NP request (sequence #3) that uses a
single NPH unit and 30h NPD units. In this example, however, the Transaction
Layer of Device A must gate that TLP and not transmit it just yet. For while it has
the necessary NPH credits for this transaction, it does not have the proper num-
ber of NPD credits left. Device B originally advertised support for 040h NPD
units and Device A has already sent out packets that consumed 020h of that.
That leaves only 020h credits available, not enough to cover the 030h that TLP
#3 requires.
Device A must gate this TLP and, based on the transaction ordering rules dis-
cussed previously, potentially stall other TLPs on the same virtual channel. Once
Device B issues an UpdateFC-NP packet that indicates that one or both of the
outstanding TLPs (sequence #’s 1 and/or 2) has been cleared from its queues,
Device A can release TLP #3 and transmit it as appropriate.
It should be noted that return of flow control credits does not necessarily
mean that the TLP has reached its destination or has been completed. It
simply means that the buffer or queue space allocated to that TLP at the
receiver has been cleared. In Figure 9.4, the upstream port of the switch
may send an UpdateFC that indicates it has freed up the buffer space
from a given TLP that is destined for the endpoint. The root complex
should not imply that this has any meaning other than that the TLP has
been cleared from the upstream receive buffers of the switch. That TLP
may be progressing through the core logic of the switch, may be in the
outgoing queue on the downstream port, or may be already received
down at the endpoint.
188 Introduction to PCI Express: A Hardware and Software Developers Guide
Isochronous Support
Servicing isochronous traffic requires a system to not only provide guar-
anteed data bandwidth, but also specified service latency. PCI Express is
designed to meet the needs of isochronous traffic while assuring that
other traffic is not starved for support. Isochronous support may be real-
ized through the use of the standard flow control mechanisms described
above: traffic class traffic labeling, virtual channel data transfer protocol
Chapter 9: Flow Control 189
Root Complex
Endpoint to Root
Complex Isochronous Traffic
Endpoint to Endpoint
Isochronous Traffic
Switch
Those parts of the system that you can hit with a hammer
are called hardware; those program instructions that you
can only curse at are called software.
—Anonymous
191
192 Introduction to PCI Express: A Hardware and Software Developers Guide
this model should ease the adoption of PCI Express, because it removes
the dependency on operating system support for PCI Express in order to
have baseline functionality. The second PCI Express configuration model
is referred to as the enhanced mechanism. The enhanced mechanism in-
creases the size of available configuration space and provides some op-
timizations for access to that space.
the PCI-PCI bridges on that bus need to pay attention to the configura-
tion transaction. If the target PCI bus for the configuration is a bridge’s
subordinate but not secondary bus, the bridge claims the transaction
from its primary bus and forwards it along to its secondary bus (still as a
Type 1). If the target PCI bus for the configuration is a bridge’s secondary
bus, the bridge claims the transaction from its primary bus and forwards
it along to its secondary bus, but only after modifying it to a Type 0 con-
figuration transaction. This indicates that the devices on that bus need to
determine whether they should claim that transaction. Please refer to the
PCI Local Bus Specification Revision 2.3 for additional details on PCI con-
figuration.
For PCI Express, each link within the system originates from a PCI-
PCI bridge and is mapped as the secondary side of that bridge. Figure
10.1 shows an example of how this configuration mechanism applies to a
PCI Express switch. In this example, the upstream PCI Express link that
feeds the primary side of the upstream bridge originates from the secon-
dary side of a PCI bridge (either from the root complex or another
switch). A PCI Express endpoint is represented as a single logical device
with one or more functions.
Bridge Key
Primary Side of Bridge Upstream
PCI Express
Secondary Side of Bridge
Switch
PCI-PCI
Bridge
PCI Express PCI Express
PCI-PCI PCI-PCI
Bridge Bridge
PCI-PCI
Bridge
PCI Express PCI Express
Endpoint Endpoint
PCI Express
Endpoint
Configuration Mechanisms
PCI 2.3 allowed for 256 bytes of configuration space for each device
function within the system. PCI Express extends the allowable configura-
tion space to 4096 bytes per device function, but does so in a way that
maintains compatibility with existing PCI enumeration and configuration
software. This is accomplished by dividing the PCI Express configuration
space into two regions, the PCI 2.3-compatible region and the extended
region. The PCI 2.3-compatible region is made up of the first 256 bytes of
a device’s configuration space. This area can be accessed via the tradi-
tional configuration mechanism (as defined in the PCI 2.3 specification)
or the new PCI Express enhanced mechanism. The extended region of
configuration space consists of the configuration space between 256 and
4096 bytes. This area can be accessed only through the enhanced PCI
Express mechanism, and not via the traditional PCI 2.3 access mecha-
nism. This is shown in Figure 10.2. The extension of the configuration
space is useful for complex devices that require large amounts of regis-
ters to control and monitor the device (for example, a Memory Controller
Hub). With only 256 bytes of configuration space offered by PCI, these
devices may need to be implemented as multiple device or multi-
functional devices, just to have enough configuration space.
Chapter 10: PCI Express Software Overview 195
register that is being addressed. The memory data contains the contents for
the configuration register being accessed. The mapping from memory ad-
dress A[27:0] to PCI Express configuration space is shown in Table 10.1.
Again, both the enhanced PCI Express and the PCI 2.3-compatible access
mechanisms use this request format. PCI 2.3-compatible configuration
requests must fill the Extended Register Address field with all 0s.
The PCI Express host bridge is required to translate the memory-
mapped PCI Express configuration accesses from the host processor to
legitimate PCI Express configuration transactions. Refer to Chapter 6 for
additional details on how configuration transactions are communicated
through PCI Express.
Error Reporting
This section explains the error signaling and logging requirements for
PCI Express. PCI Express defines two error reporting mechanisms. The
first is referred to as baseline and defines the minimum error reporting
capabilities required by all PCI Express devices. The second is referred to
as advanced error reporting and allows for more robust error reporting.
Advanced error reporting requires specific capability structures within
the configuration space. This is touched upon briefly in this section, but
not to the same level of detail as in the PCI Express specification.
In order to maintain compatibility with existing software that is not
aware of PCI Express, PCI Express errors are mapped to existing PCI re-
porting mechanisms. Naturally, this legacy software would not have ac-
cess to the advanced error reporting capabilities offered by PCI Express.
Chapter 10: PCI Express Software Overview 197
Error Classification
There are two types of PCI Express errors: uncorrectable errors and cor-
rectable errors. Uncorrectable errors are further classified as either fatal
or nonfatal. Specifying these error types provides the platform with a
method for dealing with the error in a suitable fashion. For instance, if a
correctable error such as a bad TLP (due to an LCRC error) is reported,
the platform may want to respond with some monitoring software to de-
termine the frequency of the TLP errors. If the errors become frequent
enough, the software may initiate a link specific reset (such as retraining
the link). Conversely, if a fatal error is detected, the platform may want to
initiate a system-wide reset. These responses are merely shown as exam-
ples. It is up to platform designers to map appropriate platform re-
sponses to error conditions.
Correctable errors are identified as errors where the PCI Express pro-
tocol can recover without any loss of information. Hardware corrects
these errors (for example, through a Data Link Layer initiated retry at-
tempt for a bad LCRC on a TLP). As mentioned previously, logging the
frequency of these types of errors may be useful for understanding the
overall health of a link.
Uncorrectable errors are identified as errors that impact the function-
ality of the interface. Fatal errors are uncorrectable errors that render a
given link unreliable. Handling of fatal errors is platform-specific, and
may require a link reset to return to a reliable condition. Nonfatal errors
are uncorrectable errors that render a given transaction unreliable, but do
not otherwise impact the reliability of the link. Differentiating between
fatal and nonfatal errors allows system software greater flexibility when
dealing with uncorrectable errors. For example, if an error is deemed to
be nonfatal, system software can react in a manner that does not upset
(or reset) the link and other transactions already in progress. Table 10.2
shows the various PCI Express errors.
Unexpected Receiver:
Completion Send ERR_NONFATAL to root
complex
Log the header of the completion
that encountered the error
This error is a result of misrouting
Receiver Uncorrectable Receiver (if checking):
Overflow (Fatal) Send ERR_FATAL to root
complex
Flow Control Receiver (if checking):
Protocol Error Send ERR_FATAL to root
complex
Malformed TLP Receiver:
Send ERR_FATAL to root
complex
Log the header of the TLP that
encountered the error
Error Signaling
The PCI Express device that detects an error is responsible for the ap-
propriate signaling of that error. PCI Express provides two mechanisms
for devices to alert the system or the initiating device that an error has
occurred. The first mechanism is through the Completion Status field in
the completion header. As discussed in Chapter 6, the completion packet
indicates if the request has been completed successfully. Signaling an er-
ror in this manner allows the requester to associate that error with a spe-
cific request.
The second method for error signaling is through in-band error mes-
sages. These messages are sent to the root complex in order to “adver-
tise” that an error of a particular severity has occurred. These messages
are routed up to the root complex, and indicate the severity of the error
(correctable versus fatal versus nonfatal) as well as the ID of the initiator
of the error message. If multiple error messages of the same type are de-
tected, the corresponding error messages may be merged into a single er-
ror message. Error messages of differing severity (or from differing
initiators) may not be merged together. Refer to Chapter 6 for additional
details on the format and details of error messages. Once the root com-
plex receives the error message, it is responsible for translating the error
into the appropriate system event.
200 Introduction to PCI Express: A Hardware and Software Developers Guide
Baseline error handling does not allow for severity programming, but
advanced error reporting allows a device to identify each uncorrectable
error as either fatal or nonfatal. This is accomplished via the Uncorrect-
able Errors Severity register that is implemented if a device supports ad-
vanced error reporting.
Error messages may be blocked through the use of error masking.
When an error is masked, the status bit for that type of error is still af-
fected by an error detection, but no message is sent out to the root com-
plex. Devices with advanced error reporting capabilities can
independently mask or transmit different error conditions.
Byte
Offset
31 0
header for the TLP that encounters an error. Table 10.2 identifies the er-
rors that make use of this register.
Root complexes that support advanced error reporting must imple-
ment several additional registers. Among them is the Root Error Com-
mand and Root Error Status registers that allow the root complex to
differentiate the system response to a given error severity. Finally, if sup-
porting advanced error reporting, the root complex also implements reg-
isters to log the Requester ID if either a correctable or uncorrectable
error is received.
Error Logging
Figure 10.4 shows the sequence for signaling and logging a PCI Express
error. The boxes shaded in gray are only for advanced error handling and
not used in baseline error handling.
Chapter 10: PCI Express Software Overview 203
Error Detected
No Yes
Correctable?
End End
No
Update First Error
Pointer & Header
Log Registers
Uncorrectable No
Error Reporting
Uncorrectable No Enabled?
Error Reporting End
Enabled? Yes
End Send ERR_COR
Message
No Is Severity Yes
End
Fatal?
End End
Devices that do not support the advanced error handling ignore the
boxes shaded in gray and only log the Device Status register bits as
shown in the white boxes. Some errors are also reported using the PCI-
compatible configuration registers, using the parity error and system er-
204 Introduction to PCI Express: A Hardware and Software Developers Guide
ror status bits (refer to the PCI Express specification for full details on
this topic).
PCI Express do not have mechanisms to configure and enable this fea-
ture. Active state power management can be implemented on legacy op-
erating systems through updated BIOS or drivers.
The specifics of PCI Express power management are covered in de-
tail in Chapter 11 and therefore are not included here.
Table 10.3 Primary Elements of the Standard Hot Plug Usage Model
Primary Element Objective
Indicators Reveals the power and attention state of the slot
Manually Operated Holds add-in cards in place
Retention Latches (MRLs)
MRL Sensor A mechanism that allows the port and system
software to detect a change in the MRL state
Electromechanical Prevents add-in cards from being removed while
Interlock the slot is powered
Attention Button Hardware mechanism to notify the system that a
Hot Plug event is desired
Software User Interface A GUI mechanism to notify the system that a Hot
Plug event is desired
Slot Numbering Provides visual identification of slots
PCI Express adopts the standard usage model for several reasons. One of
the primary reasons, unrelated to software, is the ability to preserve the ex-
isting dominant Hot Plug usage model that many customers have become
used to. Another reason is the ability to reuse code bits and flow processes
already defined for legacy Hot Plug implementations with PCI-X.
206 Introduction to PCI Express: A Hardware and Software Developers Guide
Once software has enabled and configured the PCI Express device for
Hot Plug functionality, if supported, system interrupts and power man-
208 Introduction to PCI Express: A Hardware and Software Developers Guide
agement events are generated based upon Hot Plug activity (attention
buttons being pressed, power faults, manual retention latches open-
ing/closing). When PCI Express Hot Plug events generate interrupts, the
system Hot Plug mechanism services those interrupts. The Hot Plug
mechanism is dependent upon the operating system. Legacy operating
systems will likely use an ACPI implementation with vendor specific filter
drivers. A contrast between a PCI Express aware and a legacy ACPI-
capable operating system Hot Plug service model is provided in Figure
10.5. For additional information on ACPI refer to the Advanced Configu-
ration and Power Interface, Specification Revision 2.0b.
No
PCI Express-Aware
Operating Firmware Control of
System Hot Plug Registers
is Enabled
OM15584
Figure 10.5 Generic Hot Plug Service Model for PCI Express Hot Plug
byte limit for PCI 2.3-compatible configuration space). The PCI Express
enhanced configuration mechanism is required to access this feature
space. This means that isochronous services such as traffic class/virtual
channel mapping for priority servicing cannot be supported by legacy
operating systems. The following discussion assumes that software sup-
ports the PCI Express enhanced configuration mechanism.
31 16 15 0
Offset (31:24)
would get the highest priority. This arbitration mechanism is the default
arbitration mechanism for PCI Express virtual channel arbitration. The
use of this arbitration scheme does require some amount of software
regulation in order to prevent possible starvation of low priority devices.
Round Robin arbitration is a common technique that allows equal access
opportunities to all virtual channel traffic. This method does not guaran-
tee that all virtual channels are given equal bandwidth usage, only the
opportunity to use some of the available bandwidth. Weighted Round
Robin arbitration is a cross between Strict Priority and Round Robin. This
mechanism provides fairness during times of traffic contention by allow-
ing lower priority devices at least one arbitration win per arbitration
loop. The latency of a particular virtual channel is bounded by a mini-
mum and maximum amount. This is where the term “weighted” comes
in. Weights can be fixed through hardware, or preferably, programmable
by software. If configurable by software, the ability is reported through
the PCI Express Virtual Channel Extended Capability Structure outlined
in Figure 10.6.
The Virtual Channel Arbitration Table and the Port Arbitration Table
It can be a bit confusing, when mentioning the different arbitration ta-
bles, to understand what each table is used for. It seems useful at this
point to make some sort of distinction between the two. The Virtual
Channel Arbitration Table contains the arbitration mechanism for priori-
tizing virtual channels competing for a single port. The Port Arbitration
Table contains the arbitration mechanism for prioritizing traffic that is
mapped onto the same virtual channel, but originates from another re-
ceiving (also called ingress) port. Port Arbitration Tables are only found
in switches and root complexes. Figure 10.7 illustrates the arbitration
structure that is configured through the PCI Express virtual channel
structure.
VC0
VC0 ARB
Port VC0 Virtual
VC1 0 Channel
Arbitration
VC1
ARB
VC1
VC0
VC0 VC0 ARB Port
VC1 Port VC0
3
1
VC2
Port
Arbitration
Receiving Transmitting
(Ingress) (Egress)
Ports Port
set 08h), bits [7:0], as shown in Figure 10.6. Table 10.5 is an excerpt
from this register as documented in the PCI Express Specification, Revi-
sion 1.0. Software can select an available arbitration option by setting bits
in the Port VC Control register, as shown in Figure 10.6.
31 28 27 24 23 20 19 16 15 12 11 8 7 4 3 0
217
218 PCI Express: A Hardware And Software Developers Guide
Device X Device X
Function 1 Function 2
Supports D0, D3 Supports D0, D1, D3
Device X Device X
Function 3 Function 4
Supports D0, D1, D2, D3 Supports D0, D3
Each PCI and PCI Express device function maintains 256 bytes of con-
figuration space. PCI Express extends the configuration space to 4096
bytes per device function. However, only the first 256 bytes of the con-
figuration space is PCI 2.3-compatible. Additionally the first 256 bytes of
the extended PCI Express configuration space is all that is visible to cur-
rent (Microsoft† Windows† XP) and legacy operating systems. Refer to
Chapter 10 for additional information.
220 PCI Express: A Hardware And Software Developers Guide
Cap_Ptr 34h
38h
Secondary Bus
Bridge A
of Bridge A
Bus 0
Device 4
As mentioned before, the PCI Express Link states replace the Bus
states defined by the PCI-PM Specification. The newly defined Link states
are not a radical departure from the Bus states of PCI-PM. These Link
states have been defined to support the new advanced power manage-
ment concepts and clock architecture of PCI Express. For the most part,
the general functionality of the PCI Express Link states parallels the Bus
states of PCI with the exception of a couple new states added by PCI Ex-
press. PCI Express defines the following Link states: L0, L0s, L1, L2/L3
Ready, L2, and L3. As with the PCI-PM Bus states, In the L0 state, the link
is fully on, in the link state L3 the link is fully off; and L0s, L1, L2/L3
Ready, and L2 are in-between sleep states, as shown in Figure 11.5 and
Table 11.3. Of the above PCI Express Link states, the only state that is not
PCI-PM-compatible is the L0s state, which is part of the advanced power
management features of PCI Express.
224 PCI Express: A Hardware And Software Developers Guide
Configuration
Polling
L0
L0s
(Full on)
Recovery
Detect L1
L2/L3
Ready
L2
L3
Differential Clock
Input Buffers
PCI Express Link States L1, L2, and PCI-PM Bus State B2
PCI Express defines two Link states that are similar to the optional PCI-
PM Bus state B2. At a high level the similarity between these PCI Express
Link states and the B2 PCI-PM Bus state is an idle bus and the absence of
a clock. In terms of PCI Express, the Link states L1 and L2 correspond to
a high latency low power state and a high latency deep sleep state.
The link does not enter the Recovery state if main power has been re-
moved from the system. In the case that main power has been removed
the link must be entirely retrained.
Table 10.5 PCI Bus State and PCI Express Link State Comparison
PCI Bus Characterization Device PCI Express Characterization Device
State Power Link State Power
B0 Bus is fully on Vcc L0 Link is fully on Vcc
Free Running Bus Component Reference
Clock Clock Running
Component PLL’s
Running
ment does require new software to enable the capability. In this case the
definition of new software does not mean a new operating system, but
rather new BIOS code that works in conjunction with a legacy operating
system. Active State Power Management does not coincide with basic
PCI-PM software-compatible features.
You access PCI Express Active Power Management capabilities
through the PCI Express Capability Structure that exists within PCI 2.3-
compatible configuration space, as shown in Figure 11.7. You manage
Active State Power Management at each PCI Express port through regis-
ters in the PCI Express Capability Structure, as shown in Figure 11.8.
FFFh
FFh
PCI Express Capabilities Register Next Item Pointer Capability ID Offset = 00h
ware must query the Link Status register (offset 0x12h, bit 12) of the PCI
Express Capability Structure to determine whether the device utilizes the
reference clock provided to the slot by the system or a clock provided on
the add-in card itself. The results of the previous query are used to update
the Link Control register located at offset 0x10h, bit 6 of the PCI Express
Capability Structure. Bit 6 corresponds to the common clock configura-
tion and when set causes the appropriate L0s and L1 exit latencies to be
reported in the Link Capabilities register (offset 0x0Ch, bits 14:12). It is
no surprise that exit latencies will vary dependent upon whether two PCI
Express devices on a link utilize the same reference clock or different
reference clocks.
A Quick Example
Consider the following example. On a certain link there is a root com-
plex and two endpoint devices. Software determines that L0s Active State
Power Management is supported on the link. After polling the individual
devices for L0s exit latency information it finds that the root complex has
an exit latency of 256 nanoseconds and both endpoint devices have exit
latencies of 64 nanoseconds. Further investigation reveals that the end-
points can tolerate up to 512 nanoseconds of exit latency before risking,
for example, the possibility of internal FIFO overruns. Based upon this in-
formation software enables Active State Power Management on the link.
234 PCI Express: A Hardware And Software Developers Guide
his chapter touches on some of the basics for PCI Express implemen-
T tation. It begins with some examples of chipset partitioning, explain-
ing how PCI Express could be used in desktop, mobile, or server envi-
ronments. The rest of the chapter identifies some of the ways that PCI
Express lives within, or can expand, today’s computer systems. This fo-
cuses on example connectors and add-in cards, revolutionary form fac-
tors, and system level implementation details such as routing constraints.
Chipset Partitioning
PCI Express provides a great amount of flexibility in the ways that it can
be used within a system. Rather than try to explain all the various ways
that this architecture could be used, this section focuses on how the
chipset may implement a PCI Express topology. Generically speaking, the
chipset is the way that the CPU talks to the rest of the components
within a system. It connects the CPU with memory, graphics, I/O com-
ponents, and storage. As discussed in Chapter 5, a common chipset divi-
239
240 PCI Express: A Hardware & Software Developers Guide
sion is to have a (G)MCH and an ICH. The GMCH (Graphics & Memory
Controller Hub) connects the CPU to system memory, graphics (option-
ally), and to the ICH. The ICH (I/O Controller Hub), then branches out to
communicate with generic I/O devices, storage, and so on.
How exactly could a chipset like this make use of PCI Express? First,
recall the generic PCI Express topology discussed in Chapter 5 and
shown in Figure 12.1.
CPU
PCI Express
PCI/PCI_X
examples are just that, examples, and actual PCI Express designs may or
may not be implemented as such.
Desktop Partitioning
Desktop chipsets generally follow the (G)MCH and ICH divisions dis-
cussed above. An example PCI Express topology in the desktop space is
shown in Figure 12.2.
CPU
PCI Express
Graphics Memory
(G)MCH PCI Express
GbE
PCI Express
USB2.0
PCI
SATA
HDD ICH
PCI Express MB Down
Add-ins
Device
Add-ins PCI Express
PCI Express
SIO
SIO
In the above hypothetical example, the GMCH acts as the root com-
plex—interacting with the CPU and system memory, and fanning out to
three separate hierarchy domains. One goes to the graphics device or
connector, the second domain goes to the GbE (Gigabit Ethernet) LAN
device or connector and the third domain goes to the ICH (domain iden-
tifying numbers are arbitrary). The connection to the ICH may occur via a
direct connection on a motherboard or through several connectors or
cables if the GMCH and ICH reside on separate boards or modules (more
on this later in the chapter). Recall that this is a theoretical example only,
242 PCI Express: A Hardware & Software Developers Guide
and actual PCI Express products may or may not follow the topology
breakdowns described here.
In this example, the chipset designers may have identified graphics
and Gigabit Ethernet as high priority devices. By providing them with
separate PCI Express domains off of the root complex, it may facilitate
flow control load balancing throughout the system. Thanks to the traffic
classes and virtual channels defined by the specification, it would be pos-
sible to place all these devices on a single domain and prioritize traffic via
those specified means. However, if both graphics and Gigabit Ethernet
require large amounts of bandwidth, they may compete with each other
and other applications for the available flow control credits and physical
link transmission time. Separating these devices onto separate domains
may facilitate bandwidth tuning on all domains.
Naturally, the downside to this possibility is that the GMCH/root
complex is required to be slightly larger and more complex. Supporting
multiple domains requires the GMCH to implement some arbitration
mechanisms to efficiently handle traffic flow between all three PCI Ex-
press domains, the CPU and main memory interfaces. Additionally, the
GMCH needs to physically support PCI Express logic, queues, TX and RX
buffers, and package pins for all three domains. For these reasons, it may
be just as likely that the Gigabit Ethernet connection is located off of the
ICH instead of the GMCH.
Since graphics tends to be a bandwidth-intensive application, the
GMCH may implement a x16 port for this connection. This allows for a
maximum of 16 × 250 megabytes per second = 4 gigabytes per second in
each direction. The graphics device may make use of this port via a direct
connection down on the motherboard or, more likely, through the use of
a x16 PCI Express connector (more on PCI Express connectors later in
this chapter). Through this connector, a graphics path is provided that is
very similar to today’s AGP (Accelerated Graphics Port) environment, but
provides additional bandwidth and architectural capabilities.
Gigabit Ethernet bandwidth requirements are much less than those
for graphics, so the GMCH may only implement a x1 port for this con-
nection. This allows for a maximum of 1 × 250 megabytes per second =
250 megabytes per second in each direction. The Gigabit Ethernet device
may make use of this port via a x1 connector or may be placed down on
the motherboard and tied to the GMCH directly.
The bandwidth requirements for the PCI Express connection be-
tween the GMCH and ICH depend mostly on the bandwidth require-
ments for the devices attached to the ICH. For this example, assume that
the bandwidth needs of the ICH can be met via a x4 PCI Express connec-
Chapter 12: PCI Express Implementation 243
Mobile Partitioning
Mobile chipsets also tend to follow the (G)MCH and ICH divisions dis-
cussed above. An example PCI Express topology in the mobile space is
shown in Figure 12.3. Again, this is a hypothetical example only, and ac-
tual PCI Express products may or may not follow the topology break-
downs described here.
CPU
PCI Express
Graphics Memory
(G)MCH PCI Express
Docking
PCI Express
USB2.0
PCI
PCMCIA
SATA
HDD ICH
PCI Express MB Down
Add-ins
Device
Add-ins PCI Express
PCI Express
SIO
SIO
This example looks remarkably similar to the desktop model just dis-
cussed. The GMCH still acts as the root complex, interacting with the
CPU and system memory, and fanning out to three separate hierarchy
domains. One still goes to the graphics device or connector and another
still goes to the ICH. The only noticeable difference between Figure 12.2
and Figure 12.3 is that the mobile platform has identified one of the
GMCH/root complex’s domains for docking, whereas the desktop model
had identified it for Gigabit Ethernet. If the GMCH does not supply a x1
for Gigabit Ethernet (desktop) or docking (mobile), that functionality
would likely be located on the ICH.
Just like on the desktop model, mobile graphics tends to be a band-
width intensive application so the GMCH may implement a x16 port for
this connection (with a maximum of 4 gigabytes per second in each di-
rection). The graphics device may make use of this port via a mobile-
specific x16 connector, or more likely, through a direct connection if it is
placed on the motherboard. Docking bandwidth requirements are much
less than those for graphics, so the GMCH may only implement a x1 port
for this connection (with a maximum of 250 megabytes per second in
each direction). PCI Express allows for a variety of docking options due
to its hot-plug and low-power capabilities.
As on the desktop model, the bandwidth requirements for the PCI
Express connection between the GMCH and ICH depend mostly on the
bandwidth requirements for the devices attached to the ICH. For this ex-
ample, assume that the bandwidth needs of the ICH can still be met via a
x4 PCI Express connection (with a maximum of 1 gigabyte per second in
each direction). In order to prioritize and differentiate between the vari-
ous types of traffic flowing between the GMCH and ICH, this interface
likely includes support for multiple traffic classes and virtual channels.
The ICH in this example is also almost identical to that in the desktop
model. It continues to act as a switch that fans out the third PCI Express
domain. The three (downstream) PCI Express ports shown on the ICH
are likely x1 ports. These provide high speed (250 megabytes per second
maximum each way) connections to generic I/O functions. In the exam-
ple shown in Figure 12.3, one of those generic I/O functions is located
on the motherboard, while the other two ports are accessed via x1 con-
nectors. The x1 connectors used for a mobile system are obviously not
going to be the same as those used in a desktop system. There will likely
be specifications that define mobile specific add-in cards, similar to mini-
PCI (from PCI SIG) or the PC Card (from PCMCIA) in existing systems.
The PCI Express specification provides a great amount of flexibility in the
types of supported connectors and daughter cards that it can support.
246 PCI Express: A Hardware & Software Developers Guide
For example, the power and voltage requirements in a mobile system and
for a mobile x1 PCI Express connector likely need to meet much differ-
ent standards than those used in a desktop environment. Since PCI Ex-
press is AC coupled, this allows a wide range of options for the common
mode voltages required by a PCI Express device.
This example demonstrates another of the benefits of PCI Express—
functionality across multiple segment types. The GMCH and ICH used in
the desktop model could, in fact, be directly reused for this mobile
model. Even though the x1 port off the GMCH is intended as a Gigabit
Ethernet port for desktops, it could just as easily be a x1 docking port for
mobile systems. Since PCI Express accounts for cross-segment features
such as hot-plugging and reduced power capabilities, it can span a wide
variety of platforms.
Server Partitioning
Server chipsets generally follow the MCH and ICH divisions discussed
above, with the difference being that the MCH generally has more I/O
functionality than a desktop or mobile MCH. An example PCI Express to-
pology in the server space is shown in Figure 12.4.
Chapter 12: PCI Express Implementation 247
CPU CPU
GbE
PCI Express
GbE
Chipset Memory
PCI Express
Add-ins
RAID
InfiniBand
Switched
Fabric
In the above example, the MCH acts as the root complex, interacting
with the CPU and system memory, and fanning out to multiple hierarchy
domains. In this example, the MCH has implemented three x8 interfaces,
but supports each as two separate x4 ports. In this example, the MCH is
running one of interfaces as a x8 port and is running the other two as x4
ports, providing a total of five PCI Express ports (four x4 ports and a x8
port). The full x8 port is connected to an Infiniband device. The second
x8 port splits into two x4 ports, with one x4 port connected to a PCI-X
bridge and the other x4 port connected to an I/O Processor (RAID: Re-
dundant Array of Independent Disks controller). The third x8 port is also
split into two x4 ports, with one x4 port going to a dual Gigabit Ethernet
part and the other x4 port going to a connector.
In this example, the dual Gigabit Ethernet, RAID controller, PCI-X
bridge, and generic add-in connector are each provided a x4 port (with a
maximum of 1 gigabyte per second in each direction). If a function, such
as the PCI-X bridge, requires more bandwidth, this platform is flexible
enough to accommodate that need. The system designer could provide
that function with a full x8 connection (with a maximum of 2 gigabytes
248 PCI Express: A Hardware & Software Developers Guide
per second in each direction) if they were willing to sacrifice one of the
other x4 ports (that is, a generic add-in x4 port). The example shown
here has prioritized the Infiniband device by providing it with a full x8
port, rather than providing an additional x4 port.
This example further demonstrates the great flexibility that PCI Ex-
press offers. The chipset designers have simply provided three x8 PCI
Express interfaces, but have allowed a wide variety of implementation
options. Depending on the platform’s needs, those x8 interfaces could be
configured as identified here or in a much different manner. If a system
does not need to provide PCI Express or PCI-X connectors, this same
chip could be used to provide three full x8 interfaces to RAID, Gigabit
Ethernet, and Infiniband. Nor do the chip designers need to identify
ahead of time if the port is used on the motherboard, through a single
connector on the main board, or through a riser connector in addition to
the card connector. PCI Express inherently allows for all of those op-
tions. In the above example, any one of the identified functions could be
located directly down on the main board, through a connector on the
main board, up on a riser, or through a connector located on a riser.
One important item to note at this point is that PCI Express does not
require larger interfaces to be able to be divided and run as multiple
smaller ports. The chipset designers in this example could have simply
implemented three x8 ports and not supported the bifurcation into mul-
tiple x4 ports. Each PCI Express port must be able to “downshift” and
run as a x1 port, but that does not mean that a x8 port needs to run as 8
separate x1 ports. Implementing multiple port options as discussed here
is an option left to the chip designers.
Form Factors
PCI Express can be used in a variety of form factors and can leverage ex-
isting infrastructure. Motherboards, connectors, and cards can be de-
signed to incorporate existing form factors such as ATX/µATX in the
desktop space, or rack mount chassis in the server space. This is shown
in Figure 12.5.
CNR Connector
Shared Connector
PCI Connector x1 PCI Express Connector
x16 PCI Express Connector
In the example shown in Figure 12.5, the µATX motherboard has in-
corporated five connectors using a total of four expansion slots in the
chassis. This design incorporates two PCI slots, one of which shares an
expansion slot with the CNR (Communication and Networking Riser)
connector. In addition to these three connectors, there is also a x1 PCI
Express connector along with a x16 PCI Express connector. The PCI Ex-
press connectors are offset (from the back edge of the chassis) by a dif-
ferent amount than CNR, PCI or AGP connectors. Additionally, PCI
Express connectors and cards are keyed differently than other standards.
Neither of these modifications inhibits PCI Express from properly meet-
ing ATX/ µATX expansion slot specifications. Rather, these modifications
are needed to prevent improper insertion of non-PCI Express cards into
PCI Express connectors, and vice versa.
Similarly, PCI Express can meet existing form factor requirements in
both the mobile and server space. The electrical specifications for the in-
250 PCI Express: A Hardware & Software Developers Guide
Modular Designs
Because of PCI Express’ flexibility, it is not necessarily contained to exist-
ing form factors. It can be used to help expand new concepts in form
factors, and help in evolutionary and revolutionary system designs. For
example, PCI Express can facilitate the use of modular or split-system de-
signs. The system core can be separated from peripherals and add-in
cards, and be connected through a PCI Express link. For the desktop
chipset shown in Figure 12.2, there is no reason that the GMCH and ICH
need to be located on the same motherboard. A system designer could
decide to separate the ICH into a separate module, then connect that
module back to the GMCH’s module via a PCI Express connection. Natu-
rally, PCI Express electrical and timing requirements would still need to
be met, and the connectors and/or cables needed for such a design
would need extensive simulation and validation. Example modular de-
signs are shown in Figure 12.6.
Chapter 12: PCI Express Implementation 251
Connectors
In order to fit into existing form factors and chassis infrastructure, PCI
Express connectors need to be designed to meet the needs of today’s sys-
tem environment. Since PCI Express is highly scalable, however, it also
needs to have connectors flexible enough to meet the variety of func-
tions that PCI Express can be used for. As such, PCI Express does not in-
herently require a single connector. Rather, connector standards are
likely to emerge that define connectors and cards for a variety of different
needs. There is already work being done on generic add-in cards for
desktop, mini-PCI and PC Card replacements for communications and
mobile, and modules for server systems. Generic add-in cards are likely to
use the connector family shown in Figure 12.7.
252 PCI Express: A Hardware & Software Developers Guide
x1
x4
x8
x16
These connectors are simple through-hole designs that fit within the
existing ATX/µATX form factor. The scalable design allows for connec-
tors from x1 up to x16. The cards associated with these connectors use
the existing PCI I/O bracket and follow PCI card form factor require-
ments for height (standard versus low profile) and length (half versus
full). The connectors are designed in a modular manner such that each
successively larger connector acts like the superset connector for its
smaller brethren. For example, the x8 connector has all the same con-
nections (in the same places) as the x4 connector, but then adds the four
additional lanes to the ”end” of the connector. This unique design allows
PCI Express connectors to support multiple card sizes. For example, a
x8 connector can support x1, x4 as well as x8 cards. This flexibility is
shown in Table 12.1.
Card
x1 Yes Yes Yes Yes
x4 No Yes Yes Yes
Chapter 12: PCI Express Implementation 253
x8 No No Yes Yes
x16 No No No Yes
A system that implements a x16 connector can support all four card
sizes, but this does not necessarily mean that the interface will run at all
four port widths. If the motherboard uses a x16 connector, the chip at-
tached to that connector is likely to support a port width of x16 (since it
does not make much sense to use a connector larger than the port at-
tached to it). Also following the specification, that port needs to be able
to downshift and run as a x1 port. Whether that port can also run as a x4
port and/or a x8 port is dependent on the implementation details of that
chip.
The ability to support multiple connector sizes, as well as card sizes
within each connector, poses some interesting problems for shock and
vibration. In addition to the connector and card/module standards that
are to emerge for PCI Express, there is a need for new retention mecha-
nisms for those cards and connectors. The retention mechanisms cur-
rently in use (for example, with AGP) are not necessarily well suited for
the shock and vibration issues that face PCI Express.
As mentioned in previous chapters, these connectors are very similar,
in terms of materials and manufacturing methods, to those used for con-
ventional PCI. By using the same contact style and through-hole design,
the manufacturing costs are less than they would be for a completely
new connector design. Additionally, the same processes for securing
connectors to the printed circuit board can be reused.
Since PCI Express connectors vary in length (in relationship to the
maximum supported link width), connector and card costs are also likely
to vary. For generic add-in support, akin to the multiple PCI connectors
found in existing desktop systems, system designers are likely to use a x1
connector (providing a maximum of 250 megabytes per second in each
direction, or 500 megabytes per second of total bandwidth). Not only
does this provide increased bandwidth capabilities (PCI provides a theo-
retical maximum of 132 megabytes per second in total bandwidth), but it
uses a smaller connector as well. The smaller x1 PCI Express connector
should help motherboard designs by freeing up additional real estate for
component placement and routing. Since PCI Express requires a smaller
connector than PCI, there are also some potential material savings from a
manufacturing standpoint. Figure 12.8 shows the comparative size of a
x8 PCI Express connector.
254 PCI Express: A Hardware & Software Developers Guide
Presence Detection
The PCI Express connectors shown here provide support for presence
detection. Specific presence detection pins, located throughout the con-
nector, allow the motherboard to determine if and when a card is in-
serted or removed. This allows the motherboard to react properly to
these types of events. For example, a motherboard may gate power de-
livery to the connector until it is sure that the card is fully plugged in. Al-
ternatively, the presence detect functionality may be used to log an error
event if a card is unexpectedly removed.
Routing Implications
Stackup
Advances in computer-based electronics, especially cutting edge ad-
vances, often require the advancement of printed circuit board (PCB)
manufacturing capabilities. This is usually needed to accommodate new
requirements for electrical characteristics and tolerances.
The printed circuit board industry uses a variety of glass laminates to
manufacture PCBs for various industries. Each laminate exhibits different
electrical characteristics and properties. The most common glass lami-
nate used in the computer industry is FR4. This glass laminate is pre-
ferred because it has good electrical characteristics and can be used in a
wide variety of manufacturing processes. Processes that use FR4 have
relatively uniform control of trace impedance, which allows the material
Chapter 12: PCI Express Implementation 255
to be used in systems that support high speed signaling. PCI Express does
not require system designers to use specialized glass laminates for
printed circuit boards. PCI Express can be implemented on FR4-based
PCBs.
The majority of desktop motherboard and add-in card designs are
based on a four-layer stackup to save money on system fabrication costs.
A traditional four-layer stackup consists of a signal layer, a power layer, a
ground layer, and another signal layer (see Figure 12.9). There is signifi-
cant cost associated with adding additional signal layers (in multiples of
two to maintain symmetry), which is usually highly undesirable from a
desktop standpoint. Due to increased routing and component density,
mobile and server systems typically require stackups with additional lay-
ers. In these designs, there are often signal layers in the interior portion
of the board to alleviate much of the congestion on the outer signal lay-
ers. Signal routing on internal layers (referred to as stripline) have differ-
ent electrical characteristics than those routed on external layers
(referred to as micro-strip). PCI Express electrical requirements are speci-
fied in order to accommodate either type of routing.
Signal
Power
Glass
Laminate
“Dielectric”
Ground
Signal
Four-layer stackups are used primarily in the desktop computer market,
which equates to approximately 70 percent of overall computer sales.
Routing Requirements
As discussed in Chapter 8, PCI Express uses differential signaling. This
requires that motherboards and cards use differential routing techniques.
Routing should target 100 ohms differential impedance. The PCB stackup
(micro-strip versus stripline, dielectric thickness, and so on) impacts
what trace thickness and spacing meets that target. For micro-strip rout-
256 PCI Express: A Hardware & Software Developers Guide
ing on a typical desktop stackup, 5-mil wide traces with 7-mil spacing to
a differential partner and 20-mil spacing to other signals (5-7-20), meets
the 100 ohm differential target.
From a length-matching perspective, PCI Express offers some nice
advances over parallel busses such as conventional PCI. In many in-
stances designers have to weave or “snake” traces across the platform in
order to meet the length-matching requirement between the clock and
data signals for a parallel bus. This is needed to ensure that all the data
and clocks arrive at the receiver at the same time. The length-matching
requirements of parallel busses, especially as bus speeds increase, come
at a high cost to system designers. The snaking required to meet those
requirements leads to extra design time as well as platform real estate, as
shown on the left side of Figure 12.10. Since each PCI Express lane uses
8-bit/10-bit encoding with an embedded clock (refer to Chapter 8), the
lanes’ length-matching requirements are greatly relaxed. A PCI Express
link can be routed without much consideration for length matching the
individual lanes within the link. This is shown on the right side of Figure
12.10.
Left side shows a parallel bus routing example where the traces are “snaked” to length-match
them to the clock in order to guarantee data and clock arrive simultaneously.
Right side shows a PCI Express routing solution. Note that the freedom from
length matching frees up board space and simplifies the routing.
Polarity Inversion
PCI Express offers several other interesting items to facilitate the routing.
One example of this is the support for polarity inversion. PCI Express
devices can invert a signal after it has been received if its polarity has
been reversed. This occurs if the TX+ pin of one device is connected to
the RX- pin of its link-mate. As discussed in Chapter 8, polarity inversion
is determined during link initialization.
Polarity inversion may occur due to a routing error, or it may be de-
liberate to facilitate routing. For example, as shown in Figure 12.11, the
natural alignment between these two devices has the D+ of one device
aligned with the D- of its link-mate (naturally one D+ would be a TX
while the other would be an RX). In this scenario, the system designer
may want to purposely use polarity inversion to simplify the routing. Try-
ing to force the D+ of one device to connect to the D+ of the other
would force a crisscross of the signals. That crisscross would require an
extra layer change and would force the routing to be non-differential for
a time. Polarity inversion helps to simplify the routing.
Logical Inversion
PCI PCI
Express 1010.... 0101.... D- 1010.... D-
Express
D+ D+
Device Device
Lane Reversal
Lane reversal is another technique that PCI Express offers to facilitate the
routing. Lane reversal allows a port to essentially reverse the ordering of
its lanes. For instance, if a port is a x2 port, lane 0 may be at the top of
the device with lane 1 at the bottom, as shown in Figure 12.12. If a de-
vice supports lane reversal, it can reverse its lane ordering and have lane
1 act like lane 0 and lane 0 act like lane 1.
258 PCI Express: A Hardware & Software Developers Guide
Device A Device B
R0+ T1+
RX0 TX1
R0- T1-
Lane 0 Lane 1
T0+ R1+
TX0 RX1
T0- R1-
R1+
RX1 T0+
R1-
T0- TX0
Lane 1 Lane 0
T1+
TX1 R0+
T1-
R0- RX0
Why would this be useful? As with polarity inversion, the natural align-
ment between devices may line up such that lane 0 of Device A does not
always line up with lane 0 of Device B. Rather than force the connection
of lane 0 to lane 0, forcing a complete crisscross of the interface (referred
to as a “bowtie”), lane reversal allows for an easier and more natural rout-
ing. This is shown in Figure 12.13.
Chapter 12: PCI Express Implementation 259
Device A Device B
R0+ T1+
RX0 TX1
R0- T1-
Lane 1 Lane 1
T0+ R1+
TX0 RX1
T0- R1-
R1+
RX1 T0+
R1-
T0- TX0
Lane 0 Lane 0
T1+
TX1 R0+
T1-
R0- RX0
AC Coupling
PCI Express signals are AC coupled to eliminate the DC Common Mode
element. By removing the DC Common Mode element, the buffer design
process for PCI Express becomes much simpler. Each PCI Express device
can also have a unique DC Common Mode voltage element, eliminating
the need to have all PCI Express devices and buffers share a common
voltage.
This impacts the system in several ways. First, it requires AC coupling
capacitors on all PCI Express traces to remove the common mode voltage
element. As can be seen near the connecter in Figure 12.10, each PCI
Express signal has a discrete series AC capacitor on it (capacitor packs
260 PCI Express: A Hardware & Software Developers Guide
T his chapter looks more closely at the timeline for products based on
PCI Express to enter the market. Several factors play a role in the in-
troduction of applications. This chapter looks at the factors that affect
adoption and discusses the profiles, benefits, challenges, and tools for
early adopters as well as late adopters of the technology.
Anticipated Schedule
The applications that can take advantage of the benefits of PCI Express
will be the first to enter the market. Graphics, Gigabit Ethernet, IEEE
1394, and high-speed chip interconnects are a few examples of the types
of applications to adopt PCI Express.
261
262 Introduction to PCI Express: A Hardware and Software Developers Guide
Sufficient Resources
A few companies have sufficient resources to develop the necessary in-
tellectual property and building blocks for a PCI Express interface inter-
nally. Larger companies can afford to absorb the development costs and
time. For example, Intel is a large company that can afford to develop
PCI Express building blocks to be used across multiple divisions and
markets. Intel plans to offer a wide range of products and support across
multiple market segments that use the PCI Express architecture. Through
Chapter 13: PCI Express Timetable 263
Market Dynamics
Market dynamics also play a major role in the adoption of PCI Express
applications. Compare the differences between the graphics suppliers
and analog modem suppliers. In the case of the graphics market, the end
user and customers continually drive for greater performance. Due to the
market demand, graphics suppliers such as ATI and nVidia capture addi-
tional value through higher average selling prices and providing the latest
technology and performance over the existing technology. If a graphics
supplier can show demonstrable performance gains with PCI Express
over AGP8x, that supplier will likely capture more value and more reve-
nue for their latest product as the older technology continues to experi-
ence price erosion. The immediate realization in revenue plays a major
role in recouping the development costs. The analog modem market
264 Introduction to PCI Express: A Hardware and Software Developers Guide
Sales
Profit
$$
Time
When new products enter the market, it takes some time before sales
and profits ramp. As products near the growth stage, sales result in prof-
its as the volumes start to become the largest contributing factor. After
achieving a peak volume, products eventually enter the decline stage
where the profits and sales decline. The challenge a business faces is de-
termining where the industry is on the ideal curve.
Industry Enabling
Industry enabling and collaboration between companies and within SIGs
(Special Interest Groups) such as the PCI-SIG will have a significant im-
pact on the adoption of new technologies. As with many new technology
introductions, the term “bleeding edge” is more commonly used than
Chapter 13: PCI Express Timetable 265
testing workshops and plugfests. Figure 13.3 indicates the types of pro-
grams the PCI-SIG provides.
OM15606
The intellectual property provider market also provides key tools for
developing products for emerging technologies. Companies exploring
new product plans perform cost/benefit analysis of making versus buying
for the PCI Express block. The benefit of purchasing the PCI Express
Physical Layer transceiver (PCI Express PHY) for example, is that pur-
chasing enables the component manufacturer to expedite product devel-
opment on the core of the device and not spend time and resources on
the PCI Express interface. Intellectual property providers typically pro-
vide tools for testing, debug, and designing with the core block. In this
example, the intellectual property providers are a tool early adopters can
use.
providers, tools, and industry deploy PCI Express in volume, there will
be an initial cost impact relative to PCI. The later adopter must overcome
the advances the early adopters have made on being further down the
learning cure and usually these companies have a fundamental difference
in their operating model to support a lower cost structure.
Applications that show little difference in migrating from PCI to PCI
Express will transition later, or potentially never. For example, 10/100
Ethernet Network Interface Cards (NICs) are in abundance on PCI today.
Migrating from PCI (133 megabytes per second) to PCI Express (250
megabytes per second per direction) for a maximum line connection of
100 megabits per second (12.5 megabytes per second) would give few
performance gains because the performance today on PCI is sufficient.
The LAN market is quickly adopting Gigabit Ethernet as the desired net-
working standard, and it is unlikely that 10/100 NIC cards will migrate to
PCI Express rapidly, if ever.
Sales - Scenario 1
Profit - Scenario 1
Sales - Scenario 2
Profit - Scenario 2
$$
Late adopters must decide when to enter the market. As most new
architectures take 12 to 18 months to develop, this means companies
must predict accurately or possibly miss the opportunity. For example in
scenario 1 of Figure 13.4, products should enter the market prior to 12
months after introduction prior to the peak volume opportunity. This al-
lows the intellectual property providers and the industry to work
through some initial iterations while not burdening the late adopter with
additional costs. The challenge comes in that, if the company believes
the ramp will follow scenario 2, they will develop product plans and re-
sources plans later than required. The end result is that the company en-
ters the market later than anticipated and misses the peak profit
opportunity. As mentioned previously, late adopters also need to time
the market correctly to overcome the benefits of the learning curve and
brand awareness the early adopter has established.
T his chapter looks more closely at the aspects behind planning and
defining PCI Express-based products from two different perspectives.
The first example represents the challenges and decisions a component
manufacturer must make in developing a PCI Express based device.
There are some unique aspects PCI Express presents to silicon device
manufacturers. The second example takes a look at the challenges and
decisions a motherboard manufacturer must make in using PCI Express
devices.
275
276 PCI Express: A Hardware & Software Developers Guide
Market Assessment
As covered in Chapter 4, graphics is a unique application that has con-
tinuously evolved with faster and faster interfaces. Figure 14.1 shows the
bandwidth evolution as discussed in Chapter 4 (Note the bandwidth is
shown for one direction in Figure 14.1). PC graphics is an application
where suppliers can achieve higher prices on initial implementations to
recoup the development costs associated with new development. Refer
to Chapter 4 for more information, but rather than revisit the content,
the underlying assumption is that there is a compelling reason to migrate
to a higher bandwidth interface since it has proven to be the natural evo-
lution since the early 1990s.
PCI Express
5000
x16
4000
AGP8x
MB/s
3000
2000 AGP4x
AGP
1000 PCI
0
1992 1994 1996 1998 2000 2002 2004
pothetical scenarios and discusses the relevant topics in buy versus make
for:
■ Using an ASIC Manufacturing and Design Flow
■ Using a Foundry Manufacturing and Design Flow
■ Using Internal Manufacturing and Design Flow
The three scenarios are listed in Figure 14.2.
PROs CONs
Option
ASIC Flow Frees up design resources Design limitations of ASIC
for other activities process flow
Partner for manufacturing Reliance on 3rd party for
“Vendor A”
core competency manufacturing
Increasing Reliance on 3rd Party
still critical pieces of the design. Key factors to consider in this scenario
are costs, which are typically on a per-unit basis and include and up front
payment, and the long term strategic impact. Typically, ASIC flows do
not offer multiple sources of intellectual property from other suppliers.
ASIC flows become difficult for multiple product SKUs and multiple
product generations due to the generally limited flexibility. ASIC flows
usually encompass wafer and final component testing. The graphics ven-
dor receives the finished product that is ready to ship.
The other category of silicon graphics suppliers who do not operate
fabrication facilities use a fabless semiconductor business model or foun-
dry flow. Here the tradeoffs are still to “buy” versus “make” the necessary
building blocks, but the differences vary significantly. Vendor B for ex-
ample operates a fabless business model and partners with foundries
such as TSMC and UMC. In this business model, the graphics vendor pays
for the mask set for the specific foundry’s fab. The mask is used in the
production facility to create wafers that contain multiple chips. At the
end of the process flow, the vendor receives untested raw wafers. The
foundry typically does not provide intellectual property, but provides the
core libraries necessary to design the end product. Unlike the ASIC flow,
there is a wide availability of intellectual property building blocks from
multiple intellectual property suppliers.
However, the decision to make in this scenario is whether or not to
buy intellectual property from the intellectual property suppliers targeted
at a specific foundry or to develop the necessary building blocks inter-
nally. This decision boils down to time to market, cost, and strategic
relevance. For example, if the company can expedite product develop-
ment by several months at the expense of a million dollars in fees to the
intellectual property provider, this may be a worthwhile tradeoff to gain
market segment share and several months of profits. Alternatively, if the
vendor determines the PCI Express core is critical to their future and
want to own the rights to the intellectual property outright, they may opt
to develop the cores internally. The graphics vendors typically partner
with one fabless semiconductor supplier to attempt to gain the largest
bargaining position on wafer costs. Unlike the ASIC flow, the foundry
model still requires the vendor to determine packaging and testing op-
tions for the end device.
Finally, Vendor C is a vertically integrated company and has an estab-
lished manufacturing process capability (such as Intel and SiS). The deci-
sion of “buy” versus “make” is significantly altered if the vendor has
internal fabrication facilities. In this scenario, the process manufacturing
capability (ability to build chips) is likely to be protected. The vendor
Chapter 14: PCI Express Product Definition and Planning 279
AGPCLK
(66 MHz)
AGP 4X
AD_STB
AD_STB#
AGP 8X
AD_STBF
AD_STBS
Da13
Da11
Da17
Da11
Da11
Da16
Da11
Da15
Da19
Da10
Da12
Da11
Da18
Da11
Da14
Da11
AD[31:0]
AGP8x Introduced
an increase in strobe Parallel Data Lines, 66MHz Clock
sampling rate
from the previous parallel data transfer to the serial PCI Express technol-
ogy.
AGPCLK
(66 MHz)
AGP 8X
PCI Express ..............................
(2.5 GHz)
AGPCLK
(66 MHz)
PCI Express
PCI Express (2.5 MHz)
x1 Lane
Embedded Clock and Data Differential Pair
Departure from parallel data lines
Clock running 38 times faster (66MHz vz 2.5GHz)
Along with the new high-speed challenges are device testing trade-
offs. Graphics suppliers will want to ensure the product that leaves their
factory is of a high standard of quality such that OEMs and end-users will
not experience field failures. The objective of product test programs is to
ensure that the products leaving the factory are of a high enough stan-
dard of quality to reduce return and failure costs. Quality is typically
measured in the Defects per Million or DPM. To achieve these standards,
the industry has embraced two methodologies, at-speed testing and
structural testing. The objective of structural testing is to catch manufac-
turing defects. If the device has a damaged transistor, the functional test
should identify the failure by detecting that the transistor did not turn on.
Structural tests typically implement a scan chain, where a scan of 0s and
1s are inserted serially into the device and the tester captures a chain of
output values that gets compared with the expected outcome. In the
282 PCI Express: A Hardware & Software Developers Guide
damaged transistor example where the transistor failed to turn on, the
tester detects an error in the output.
The objective of at-speed testing is to ensure that the device operates
as expected under “real world conditions.” In this scenario, the transistor
may have turned on as expected in the functional test, but the at-speed
testing ensures that the transistor turned on in the right amount of time.
The departure from the 66 megahertz clocking domain to the 2.5 giga-
hertz clocking capability will provide substantial initial challenges for at-
speed testing. Although these speeds are not beyond the capabilities of
some of the testers currently on the market, the cost of developing an en-
tire test floor with at-speed testers would be prohibitive. Vendors must
balance cost and risk in developing sufficient test coverage plans.
Optional
@ Transaction Layer
Sequence
Number Header Data ECRC LCRC
Sequence
Frame Number Header Data ECRC LCRC Frame
@ Physical Layer
Efficiency Comparison
100.00%
95.00%
90.00%
85.00%
80.00%
CPU
Instruction Cache
Core
CPU
Bus
DDR Chip
Graphics to
Memory Chip
Price $
Volume
the connectors for both PCI and PCI Express clearly indicates that in the
long run, PCI Express will succeed in being the lower cost solution. The
PCI connector consumes 120 pins and is roughly 84 millimeters long.
The PCI Express x1 connector is much smaller at 25 millimeters long
with 36 pins. Refer to Figure 14.9 for comparison of the various connec-
tors. Although the cost parity will likely be achieved over time, the initial
higher cost for the necessary components (silicon, connectors, and so
on) may delay adoption in the most price-sensitive markets.
x1 x4 x8 x16
mm
0 20 40 60 80 100 120
Performance Connector Length:
Point mm (inches)
1X 25.00 (0.984) 1X
4X 39.00 (1.535) 4X
8X 56.00 (2.205) 8X PCI-X
16X 89.00 (3.504) 16X
PCI 84.84 (3.400) PCI
PCI X 128.02 (5.040) PCI-X
AGP 73.87 (2.908)
AGP
signal deformation that are trivial for most PCI implementations. Unlike
previous technologies, vendors can no longer ignore the affects of Vias,
capacitive parasitics, and connectors.
Device Package
PCB Traces
Series Capacitor Connector
VIA
Signal Layer
Prepreg
Power Layer
Core
Ground Layer
Prepreg
Signal Layer
Signal Route
Conclusion
Hopefully, this book has helped show that PCI Express is an exciting
new architecture that will help move both the computing and communi-
cations indus-tries forward through the next ten years. The technology is
flexible enough to span computing platforms from servers to laptops to
desktop PCs, and serve as the interconnect for Gigabit Ethernet, graph-
ics, as well as numerous other generic I/O devices.
This flexibility is afforded by PCI Express’s layered architecture. The
three architectural layers offer increased error detection and handling
capabili-ties, flexible traffic prioritization and flow control policies, and
the modularity to scale into the future. Additionally, PCI Express provides
revolutionary capabilities for streaming media, hot plugging, and ad-
vanced power manage-ment.
PCI Express does this while maintaining compatibility with much of
the existing hardware and software infrastructure to enable a smooth
Chapter 14: PCI Express Product Definition and Planning 291
PCI-Express Specifications.
*OUFSPQFSBCJMJUZ – Ability to coexist, function and interact as needed
PCI-Express
Architecture
Specs
Spec
Defines
Design
Criteria
PASS
Test
Checklists
Specs FAIL
Compliance Testing
It is expected that all major test equipment vendors will provide
compliance test tools applicable at various layers. In addition the PCI-SIG
is likely to identify a set of equipment for the compliance testing process.
A PCI-Express product vendor should check the websites listed below to
see what test equipment is relevant to their product and obtain them to
test on their premises before going to a plugfest. The following figures
show example topologies which may be used for testing at platform
(BIOS, RC) and add-in device (endpoints, switches, bridges) levels. The
actual test topologies employed in compliance testing may be different
from what is shown here but are expected to be functionally equivalent.
The Functional Compliance Test (FCT) card or an entity providing a
similar function is intended to test at the link and above layers (including
BIOS). A separate electrical tester card is intended to verify electrical
compliance at multiple link widths. In all cases it is expected that the
tests executed employing such cards will clearly map the test results to
one or a set of assertions. In addition test equipment like protocol
analyzers or oscilloscopes are expected to provide support for automatic
checking of the results as much as possible thus reducing human
intervention and any associated errors. It is worth noting that
Chapter 3: Applications 5
Platform Components
This example topology is suited to test BIOS’s ability to configure
PCI-Express and PCI devices properly, program resources for supporting
power management and hot-plug, a Root Complex’s ability to handle
messages, legacy interrupts, error conditions on the root ports’ links etc.
For electrical testing the electrical tester card is inserted into an
appropriate slot (as the width the slot supports) and root ports’
transmitter and receiver capabilities (like jitter, voltage levels) will be
measured via oscilloscopes and analysis software.
Compliance
Tests
FCT Protocol
Tester Card Analyzer
Compliance
Tests Add-in
Card (DUT)
FCT Protocol
Tester Card Analyzer
Intel® Architecture
Platform
.
Chapter 3: Applications 7
Interoperability Testing
Compliance is a pre-requisite to interoperability testing. Once
compliance is established, it is necessary to verify that the device or
function in question works with other devices and functions (not just
PCI-Express based but others as well) in a system. A typical way to test
for this is to introduce the device into a well known operating
environment and run applications that will measure the electrical
characteristics of the link to which it is connected, power consumptions
in the various power management states it supports and functional
characteristics like interrupts etc. A xN device will be tested in its
natural xN mode and thus expose any issues with multi-link functioning
like skews, link reversals etc. The test is repeated by using the devices
and functions in as many platforms as applicable.
Plugfests
The PCI-SIG periodically arranges plugfests (multi-day events) where
multiple PCI-Express product vendors bring their products and
participate in a structured testing environment. While it is expected that
vendors would have tested their products to a good extent on their
premises, early plugfest events provide a great venue to test against other
implementations which otherwise may not be accessible. If necessary,
there are usually opportunities to test and debug informally outside the
established process. For these reasons it is recommended that vendors
should plan on sending a mix of developers and test engineers to these
events. The bottom line is vendors should take advantage of these events
to refine their products and gain time to market advantage.
Useful References
These links provide a starting point for finding the latest information on
the C&I Test Specifications and events, architecture specifications and
tools etc.
1. www.agilent.com
2. www.catc.com
3. http://developer.intel.com/technology/pciexpress/devnet/
4. http://www.pcisig.com/home
5. http://www.tektronix.com/
Chapter 3: Applications 8