Design and Simulation of A PCI Express Based Embed
Design and Simulation of A PCI Express Based Embed
net/publication/228909244
CITATION READS
1 4,727
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Christoph Grimm on 03 June 2014.
point-to-point, dual simplex, low-pin-count and completion TLP back to the requester, unlike the
differential signalling link for interconnecting devices. memory read TLP, where the completer is supposed to
The PCIe link, shown in Figure 1, implements the return a completion TLP back to the requester. The
physical connection between two devices in the PCIe completer returns either a CPLD, if it is able to provide
topology. A PCIe interconnect is constructed of either a the requested data, or a Completion without data (CPL),
x1, x2, x4, x8, x12, x16, or x32 point-to-point link. A x1 if it fails to obtain the requested data.
link has 1 lane or 1 differential signal pairs in each
direction, transmitter and receiver, with a total of 4
signals. Correspondingly, x32 link has 32 lanes or 32 CPU
Memory
signal pairs for each direction, with a total of 128 signals
[2]. FSB Memory bus
PCIe employs a packet-based communication protocol
Graphics
PCIe
with a split transaction. Communication in this bus Root Complex
system includes the transmission and reception of
packets called Transaction Layer packets (TLPs). The
x1 PCIe link
transactions supported by the PCIe protocol can be EP: Endpoint
grouped into four categories: Memory, IO, FSB: Front Side Bus
Configuration, and Message transactions [2]. Switch
PCIe
2.2 PCI Express Topology PCIe
Device A Device
Device BB
HDR Data HDR Data
Device Core Device Core
1DW
Transaction Transaction
HDR Data ECRC HDR Data ECRC
Layer Layer
12-bit 1DW
SEQ. # HDR Data ECRCLCRC Data Link Layer Data Link Layer SEQ. # HDR Data ECRCLCRC
1B
Start SEQ. # HDR Data ECRCLCRC END Physical Layer Physical Layer Start SEQ. # HDR Data ECRCLCRC END
TX RX TX RX
Figure 2. PCI Express Architecture and Transaction Layer Packet (TLP) Assembly/Disassembly
The TLPs are transferred to this layer for the purpose of A termination of the transaction takes place when the
transmission across the link. This layer also receives the Endpoint receives the TLP and writes the data to the
incoming TLPs from the link and sends them to the Data targeted local register.
Link Layer. This layer appends 8-bit Start and End To read this data back, the CPU issues a load register
framing characters to the packet before being command from the same memory mapped location in the
transmitted. Endpoint. This is done by having the Root Complex
The physical layer of the receiving device in-turn strips generate a Memory Read TLP with the same memory
out these characters after recognizing the starting and mapped address and other Header contents. This TLP
ending of the received packet, and then forwards it to the moves downstream through the PCIe fabric to the
Data Link Layer. In addition to that, the physical layer of Endpoint. Again, routing here is based on the same
the transmitter issues Physical Layer Packets (PLPs) address within the Header. Once the Endpoint receives
which are terminated at the physical layer of the this Memory Read TLP, it generates a Completion with
receiver, such PLPs are used during the Link Training Data TLP (CPLD). The Header of this CPLD TLP
and Initialization process. In this process the link is includes the ID number of the Root Complex, which is
automatically configured and initialized for normal used to route this TLP upstream through the fabric to the
operation; no software is involved. During this process Root Complex, which in-turn updates the targeted CPU
the following features are defined: link width, data rate register and terminates the transaction. The other way
of the link, polarity inversion, lane reversal, bit/symbol around, is to have the Endpoint act as a bus master and
lock per lane, and lane-to-lane deskew (in case of multi- initiate a Memory Write TLP to write 1 DW to a location
lane link) [2]. within the system memory. This TLP is routed upstream
toward the Root Complex which in turn writes the data
to the targeted location in the system memory. If the
3 PCI Express Endpoint Design Endpoint wants to read the data it has written, it
generates a Memory Read TLP with the same address.
This is steered to the Root Complex, which in-turn
3.1 Design Overview accesses the system memory, gets the required data and
In this paper, the x1 PCIe Endpoint is considered. In generates a Completion with this data TLP. This CPLD
Figure 1, the Endpoint is an intelligent device which acts TLP is routed downstream to the Endpoint through the
as a target for downstream TLPs from the CPU through PCIe fabric. The Endpoint receives this TLP, updates its
the Root Complex and as an initiator of upstream TLPs local register and terminates the transaction. Figure 3
to the CPU. This Endpoint generates or responds to shows the layered structure of the PCIe Endpoint device.
Memory Write/Read transactions. There are two different solutions for the physical layer
When the Endpoint acts as a receiver, the CPU issues a (PHY). In the first solution, this layer can be integrated
store register command to a memory mapped location in with the other layers in the same chip. Doing so
the Endpoint. This is done by having the Root Complex increases the complexity of this chip and provides a
generate a Memory Write TLP with the required higher integration level. This integrated solution has one
memory mapped address in the Endpoint, the payload key advantage when designing using an FPGA. It uses a
size (a DW in this design), byte enables and other smaller number of IO pins, which enables easier timing
Header contents. This TLP moves downstream through closure. An example of this integrated solution is offered
the PCIe fabric to the Endpoint. Routing of the TLP in by Xilinx in their newly introduced Xilinx Virtex-5 PCIe
this case is based on the address within its Header. Endpoint block [5].
Physical Layer
Physical Layer
PHILIPS PHY 50 MHz
USER LOGIC
OPB IPIF
(OPB)
DLMB
ILMB
PCI Express Fabric
ILMB DLMB
Controller Controller
62.5 MHz
Figure 3. Endpoint Design 32 Bits
exists in one chip, and the other layers are designed in Xilinx® Spartan-3 FPGA PXPIPE
250 MHz
another chip. In this two-chip solution, a smaller FPGA 8 Bits
with external PHY can be used. This PHY supports x1 Spartan-3 PCI Express Starter Kit
PHILIPS
PCIe designs. Having the practical bandwidth provided PHY
by x1 PCIe is 2.0 Gbps requires an internal interface of 8
bits running at 250 MHz or an interface of 16 bits PCI Express Link 2.5 GHz
running at 125 MHz. This solution has the disadvantage Serial
3.2 Protocol and Application Layers The Microblaze has different bus interfaces, connecting
The protocol layers containing the logical sub-layer of it with different peripherals. For example, the Local
the physical layer, the data link layer and the transaction Memory Bus (LMB) allows the communication between
layer are implemented using the Xilinx PCI Express the processor and the Block Random Access Memory
Physical Interface for PCI Express (PIPE) Endpoint 1- (BRAM), which is loaded with the application program
Lane IP core [7]. to be executed by the Microblaze. This program is
written in C, using special library provided by Xilinx. It
A Microblaze based system was built up to implement is compiled into an executable link format (ELF). The
the Application layer of the designed PCIe Endpoint. In Microblaze has a Harvard structure, in which the BRAM
this system, the PCIe core is attached as a slave to the consists of two sections, data and instructions. These
processor, which in-turn tries to access the configuration sections are accessed by the processor through memory
space of this core, reading from and writing to this space. controllers over the local memory bus.
In the application layer, the Microblaze is responsible for
sending the required Header and data payload to the The Xilinx On-Chip Peripheral Bus (OPB), which
transaction layer. When a TLP is received by the PCIe implements the IBM CoreConnect On-Chip Peripheral
Endpoint, the Header and the payload, if exists, will be Bus, has two 32-bit separate paths for data and address
forwarded to the Microblaze for further processing. The [8]. This bus is used to connect peripherals to the
Microblaze, which masters the bus.
The PCIe core can not be directly connected to the x1 PCIe design), the PCIe downstream port model, the
OPB as a slave, because of the incompatibility of its Philips PHY and the Design Under Test (DUT) are
interfaces with the OPB protocol. To solve this instantiated.
compatibility issue, a bridge was developed to bridge the
OPB and the PCIe core. This bridge interfaces the OPB Application Output logs
with its standard protocol through the OPB Intellectual Program .txt
.elf
Property Interface (OPB IPIF) from one side, and the
PCIe core through the USER LOGIC model from the
other side. The internal structure of this USER LOGIC
model is shown in Figure 5. This model implements the PXPIPE
Philips
PCIe PCIe
Interface Downstream
logic needed to transmit/receive TLPs across the PCIe Design Under PHY
Test (DUT) Port Model
link and to access the configuration space of the PCIe PX1011A Link
core. The PCIe core transaction interfaces are
synchronized with a clock of 62.5 MHz generated from
the core as indicated in Figure 5.
Test Program
.v
USER LOGIC
Transmit Transaction Boardx01
PCIe Transmission
Interface
State Machine
Register
Read The PX1011A behavioural model is a packaged model,
which can be simulated in ModelSim or other standard
Hardware Description Language (HDL) simulators. The
IP Model Packager from Cadence was used to generate
this model. This model can be integrated in any
Receive Transaction
the PCIe Transmit Transaction interfaces as shown in In addition to that, the designed Endpoint was system
Figure 7: First, after receiving an active high control was prepared to be implemented in the Xilinx Spartan-3
signal (compl_gen) from the processor requesting the FPGA, located on the Xilinx PCIe Spartan-3 Starter Kit.
generation of a CPLD, the machine activates, at the next It can also be concluded that working with PCIe
positive edge of the transaction clock (trn_clk), the requires the knowledge of the PCIe protocol, because
active low start of frame signal (trn_tsof_n) and the most of the available PCIe IP cores don’t provide
active low transmit source ready signal compatible interfaces, which would allow them to be
(trn_tsrc_rdy_n) to indicate the availability of directly connected to the regarded processor. Therefore,
valid data from the user logic application, and then in most cases, an effort must be made to develop a
presents the first TLP’s DW on the 32-bit transaction bridge that allows an easy connection of the PCIe
data signal (trn_td). Note that, in case of transmitting, peripheral to the processor. Furthermore, the
the Endpoint is enabled as a master through the signal functionality of this designed Endpoint can be more
master_enable. Second, at the next clock cycle, the complicated than this simple data transfer task. One can
state machine deactivates trn_tsof_n and presents further extend the capabilities of this Endpoint by
the rest of the TLP’s DWs on trn_td. The PCIe core reconfiguring the PCIe core to include IO mapped space.
keeps the activation of trn_tdst_rdy_n. Third, this
state machine activates trn_tsrc_rdy_n and the Acknowledgment
active low end of frame signal (trn_teof_n) together
The results in this paper are from the work which was
with the last DW of data. It also activates the signal conducted as a Master thesis in the field of
cpld_transmitted to indicate that a CPLD TLP is Microelectronics, in cooperation with the Institute of
transmitted. Finally, at the next clock cycle, the state Computer Technology at the Technical University of
machine deactivates trn_tsrc_rdy_n to indicate the Vienna and the CES Design Services business unit of
end of valid transfer of data on trn_td. Siemens IT Solutions and Services PSE, Austria. I’m
very grateful for those who supported and helped me
during conducting this work.
5 Conclusion
Within this paper, the various capabilities of the PCIe References
bus protocol were demonstrated. In a platform based on [1] Don Anderson and Tom Shanley, “PCI System
PCIe topology, an Endpoint device was designed. This Architecture”, MINDSHARE INC., 1999.
Endpoint embeds the Microblaze soft core of Xilinx, [2] Don Anderson, Ravi Budruk, and Tom Shanley, “PCI
which is bridged to the PCIe protocol layers Express System Architecture, MINDSHARE INC., 2004.
implemented by the PCIe core, to serve the data [3] Ajay V. Bhatt “Creating a PCI Express™ Interconnect”,
communication between this intelligent Endpoint and the Technology and Research Labs, Intel Corporation, 2002.
CPU/system memory through the Root Complex.
[4] “PCI Express™ Base Specification”, Revision 1.1, March
A basic and simplified OPB to PCIe Bridge was 28, 2005
developed to bridge the Microblaze and the PCIe [5] “Virtex-5 Integrated Endpoint Block for PCI Express
protocol layers. The PCIe core was generated, Designs”, User Guide, UG197 (v1.1), March 20, 2007.
configured and customized using the Xilinx CORE
[6] Koninklijke Philips Electronics N.V, “NXP x1 PHY
generator. A packaged simulation model, provided by single-lane transceiver PX1011A (I)”, September 2006.
NXP Semiconductors, was used to simulate the
[7] “LogiCORE™ PCI Express PIPE Endpoint 1-Lane v1.5”,
functionality of the PCIe physical layer. This model
User Guide, UG167, September 21, 2006.
interfaces the simulation tool using the Verilog HDL
Programming Language Interface (PLI). [8] “MicroBlaze Processor Reference Guide”, UG081 (v6.3),
User Guide, August 29, 2006.
In a modified version of a PCIe Testbench (provided by
[9] Xilinx User Guide “LogiCORE™ PCI Express® Endpoint
Xilinx) and with the help of the simulation tool Block Plus v1.2”, UG341 February15, 2007.
ModelSim, the functionality of the designed Endpoint
was simulated and verified.
Figure 7. PCIe Endpoint’s Completion with Data (CPLD) Transaction Layer Packet (TLP)
ISBN 978-3-200-01330-8 - 78 - Institut für Integrierte Schaltungen