HTAX Specification v014
HTAX Specification v014
University of Heidelberg
ver. 0.14
Heiner Litz
ver 0.13:
ver 0.14:
HT3.0-Core: The HyperTransport 3.0 core developed by the Computer Architecture Group
(CAG).
HT fabric: This term defines the HyperTransport device chain as described in the HyperTrans-
port 3.0 specification. The HT fabric ends at the HT3.0-Core.
HTOC fabric: This term defines the fabric behind the HT3.0-Core. This is the area where the
user application connected to the core is located. The HTOC fabric defines a protocol which is
very similar to the original HyperTransport protocol.
HTOC protocol: This protocol is an on-chip protocol derived from the HyperTransport proto-
col. It is very similar to allow for simple protocol conversion. It is enhanced to comply with the
needs of an on-chip protocol.
HTOC transaction: Derived from the HyperTransport specification this term describes a com-
plete HyperTransport message in the HTOC fabric which consists of a command and a data
part which are made up of packets.
HTOC packet: Derived from the HyperTransport specification this term describes the physical
units which are combined to form a HTOC transaction. Two different types of packets exist
namingly command and data. HTOC packets are defined for two different sizes, 64 and 128 bit
depending on the HTOC fabric implementation.
HTAX: Abbreviation for HyperTransport Advanced X-Bar. The HTAX defines a switching
structure which resides in the HTOC fabric and allows bidirectional communication between
different communication endpoints. The HTAX is a protocol agnostic switch and by definition
able to switch arbitrary protocols. It is however optimized to support the HTOC protocol
Functional Unit (FU): A communication endpoint connected to the HTAX. This term
includes the HT-Core.
The HTAX consists of bidirectional ports that are interconnected by a switch matrix, as shown
in figure 1. The ports provide a transmit (TX) and receive (RX) interface and can be input buff-
ered or bufferless. The HTAX supports an arbitrary number of virtual channels for Quality of
Service (QoS) and deadlock avoidance. To handle congestion both virtual output queueing
(VOQ) and RECN mechanism is supported for high radix switches.
HTAX
clk
System
reset Interface
TX Interface TX Interface
TX Interface TX Interface
• PORTS:
The number of ports connected to the HTAX. Each port is bidirectional and is subdivided
into the TX interface and the RX interface. If a unit connected to the HTAX requires only
unidirectional communication the signals have to be tied to zero. They will be removed by
synthesis.
• VC:
The amount of virtual channels.
• WIDTH
The width of the data bus in bits.
clk
res_n
HTAX
System
Interface
4.2 TX Interface
Data packets are injected into the HTAX through the TX interface. Figure 2 shows the TX
interface signaling.
tx_vc_req[VC-1:0]
tx_vc_gnt[VC-1:0]
HTAX
FU-TX tx_data [W-1:0]
TX Interface
tx_sot[VC-1:0]
tx_eot
tx_release_gnt
tx_vc_req[VC-1:0]
This output signal is used by the transmitter to specify the virtual channel used for the cur-
rent outport request. It is allowed to request multiple virtual channels simultaneously. The
signal width depends on the number of supported virtual channels. tx_vc_req has to be
asserted and deasserted simultaneously with the tx_outport_req signal. As long as an out-
port is requested it is mandatory to request at least a single virtual channel. It is not allowed
to retrieve asserted tx_vc_req. If multiple tx_vc_req are requested and multiple tx_vc_gnt
are granted the transmitter may decide which vc to use for the next transaction. The next
clock cycle after vc_grant is asserted the tx_vc_req signal has to be deasserted. If multiple
tx_vc_req are granted a transaction has to be started at the next clock cycle after tx_vc_req
&& tx_vc_gnt is active and only the tx_vc_req for the virtual channel that is started,
defined by the tx_sot signal, must be deasserted. The FU-TX may assert tx_vc_req for the
next packet concurrently to a data transfer to hide arbitration latency. The tx_vc_req for the
next packet have to be asserted simultaneously with release_gnt of the previous packet.
tx_data[WIDTH-1:0]
This signal bus transports the payload data. The width of the data bus is represented by the
WIDTH parameter.
tx_sot[VC-1:0]
The tx_sot signal indicates the start of a transaction for a specific virtual channel. It is a
one-hot encoded signal. A specific request has to be asserted at the same time or before
tx_sot may be asserted. As soon as a grant for the requested outport has been received the
tx_sot signals has to be deasserted. It has to be asserted simultaneously with the first data
packet. It is recommended to assert tx_sot as soon as possible after tx_vc_gnt has been
received to avoid blocking of the requested outport. It is allowed to assert tx_sot and data
for a single requested virtual channel in prior of a received tx_vc_gnt. In this case the tx_sot
and tx_data signals are registered by the HTAX during assertion of tx_vc_gnt. This mini-
mum latenvy feature (MLF) reduces the startup latency by a single clock cycle. If the MLF
is used, only a single virtual channel may be requested, respectively the concurrent
tx_vc_req signal must be one hot encoded.
tx_eot
The end of transaction signal is asserted simultaneously with the last data packet of a trans-
action and determines the end of the transaction. It is asserted for a single clock cycle.
tx_release_gnt
This signal triggers the HTAX outport to perform a new arbitration cycle. Asserting
release_grant during an active transfer allows to overlap and hide arbitration latency
needed for streaming data back-to-back. It is asserted a single clock cycle in prior of tx_eot,
tx_release_grant may only be asserted if a grant has been received previously.
tx_sot[2:0] 001
tx_eot
tx_release_gnt
tx_eot
tx_release_gnt
tx_outport_req[3:0] 0001
tx_vc_req[2:0] 001
tx_vc_gnt[2:0] 001
tx_sot[2:0] 001
tx_eot
tx_release_gnt
4.4 RX Interface
Data packets are retrieved from the switch through the RX interface. The outport forwards
packets to the receiving functional unit and contains an arbiter that hands out grants to the
inports. It is a virtual channel aware multi request arbiter which allows to signal multiple pack-
ets of different virtual channels waiting for transaction to the receiver. The receiving functional
unit needs to implement arbitration logic to determine which virtual channel to grant. The out-
port’s RX interface is shown in figure 6. The receiver unit is allowed to grant several virtual
channels concurrently.
rx_vc_req[VC-1:0]
rx_vc_gnt[VC-1:0]
rx_eot
rx_vc_gnt[VC-1:0]
This output signal acknowledges possible reception of transactions over specific virtual
channels. Its width is equivalent to the number of virtual channels. Once asserted the unit is
required to accommodate the complete transaction. It is allowed to assert rx_vc_gnt with-
out having rx_vc_req of that specific virtual channel enabled. Once rx_vc_gnt is asserted
the unit is not allowed to deassert it until a transaction starts, that is rx_vc_req and
rx_vc_gnt are active for a single clock cycle. The functional unit is responsible for deter-
mining which virtual channel a current transaction is being processed on. To grant a trans-
action for a specific virtual channel rx_vc_req and rx_vc_gnt for the same virtual channel
has to be asserted for a single clock cycle. If rx_vc_req and rx_vc_gnt for the same virtual
channel are asserted for multiple cycles this may be regarded as single or multiple grants
for transactions depending on whether a transaction is currently running or not.
rx_data[WIDTH-1:0]
This signal bus carries the data of the transaction. Its width is set by the parameter WIDTH.
rx_sot[VC-1:0]
This one-hot encoded signal is asserted simultaneously with the first packet of a transaction
and determines that the rx_data bus contains the first data packet. It identifies the virtual
channel which is used to transmit the transaction. It is asserted for a single clock cycle.
rx_eot
This signal is asserted simultaneously with the last packet of a transaction and determines
that the transaction will be finished in the next clock cycle. It is asserted for a single clock
cylce. If rx_vc_gnt is asserted while rx_eot is high and rx_sot is low the rx_vc_gnt is
regarded as grant for another packet. If rx_vc_gnt is asserted while rx_eot is high and
rx_sot is high it is not regarded as the grant for another packet.
rx_vc_req[2:0] 001
rx_vc_gnt[2:0] 001
rx_sot[2:0] 001
rx_eot
rx_vc_req[2:0] 001
rx_eot
rx_vc_gnt[2:0] 111
rx_eot
• Address intervall routing. Non coherent HT transactions like posted writes are routed to
their endpoint by interpreting the address embedded in the command.
• SrcTag lookup routing. Non coherent response transactions are routed regarding their
embedded srcTag respectively ext_srcTag.
• NodeID-unitID based routing. Coherent HT transactions are routed regarding their embed-
ded nodeID and unitID.
Mode of Operation
This chapter provides an insight into the inport and outport internals. The switch itself is sim-
ply a bunch of wires without any essential functionality connecting the various inports and out-
ports together. The only impact of the switch on the design is the number of wires which have
to be routed which hence has an impact on operation frequency. However additional pipeline
stages will have to be implemented in the inports or outports and not in the switch.
5.2 Inport
The inport has the purpose of requesting the correct outport for a specific packet. Therefore it
needs to interpret the command of a HyperTransport packet. HyperTransport implements two
possible ways to route packets. Posted and non-posted request commands include an address
field which specifies its target. These types of packets are handled by the address interpreter
logic. Response packets carry no address but only a srcTag field which contains a unique num-
ber that matches the response to a certain non-posted request. Routing of responses are handled
by the tag management unit.
Address Interpreter
The address interpreter unit routes posted and non-posted packets. Therefore it interprets the
part of the address which is used to determine the functional unit and the base address register
(BAR) field of the command. This information is used to index the Address Lookup Table
(ALT) which in return provides the target functional unit. The inport can then request the cor-
responding outport for a specific virtual channel (posted or non-posted in this case). In the case
of a non-posted request has to retrieve a srcTag before requesting an outport. In this case it
requests a tag from the SrcTag Management Unit before requesting the outport.
The TAQ keeps track of available and handed out tags. It is implemented with 5 bit wide FIFO
that holds all available srcTags. Every new non-posted request is assigned the first item from
the queue reducing the available tags by one. If the TAQ is empty all tags have been handed
out and no more non-posted requests can be processed. Each time a response returns, the freed
tag is inserted in the FIFO, incrementing the available tags by one.
5.3 Outport
The outport’s purpose is to transfer data from the inports to the functional units. Each outport is
connected to several inports however only one inport can be allowed to send data to a certain
outport at a time. This necessity is enforced by the arbiter. An arbitration cycle is performed by
issuing a request at the inport which then has to wait for getting assigned a grant by the arbiter
residing in the outport. In detail the HTAX has to actually support two request-grant mecha-
nisms. Besides multiple inports also multiple virtual channels from a single or several inports
can request an outport. The grant of a specific inport is carried out by the outport totally trans-
parent to the functional unit. The grant policy is first come first serve, if two inports request the
same outport at a time the inport with the higher priority is favored.
Granting a specific virtual channel however is carried out by the functional unit residing
behind the outport. This allows the functional unit to be in full control of the virtual channel
arbitration.
functional unit
The cyclic dependency between posted transactions and responses discussed above can be
broken off by implementing a virtual channel aware functional unit interface. Arbitration
of a R(i, v) is carried out by the functional unit rather the outport. In this case the FU can
select whether it is able to process new jobs by granting the posted VC or to complete a
previous job by granting the response VC. The disadvantage of this approach is the
required number of i + v control signals at the completer interface.
• Bandwidth
Back-2-back streaming of data packets is mandatory. Hiding the arbitration latency by
requesting outports during in-flight messages is needed.
• Low Latency
Switch latency should be kept to a minimum. The more complex the design is regarding
buffering, flow control, virtual channels, etc. the higher the switch latency will be.
• Efficiency
Switch efficiency should be maximized by avoiding congestion. Key goal has to be to
avoid head of line (HOL) blocking.
Note: The request of an outport performed by the routing interpreter is determined differently
depending on the type of transaction:
• Static: The source has static information to determine the target for the transaction. The
lookahead is therefore easy to implement. The content of the command (address, SrcTag) is
ignored.
• Address Interval based: Regular posted and non-posted transactions are routed based on
the address in their command packet. Each outport is assigned a specific address interval.
The routing interpreter has to lookup the address in a table to determine the outport.
• SrcTag based: Response transactions are routed based on their SrcTag. In a HT fabric every
non-posted request is assigned a SrcTag. Corresponding responses carry the same SrcTag
and have to be forwarded to the initiator of the non-posted request. EXTENDED SRCTAG
Low Latency
To achieve better latency performance store and forward of packets should be avoided at any
time. Virtual cut through allows to inject the transaction header into the crossbar before the
complete transaction has arrived. As we cannot guarantee a continuos data flow at any time
(HyperTransport may interleave transactions with other transactions) there may be bubbles in
the data stream. An interrupted transaction may therefore block other transactions.
Issue: Maybe we can (have to) define that transactions may not be interrupted. As transactions
always have to be CRC checked (on HT and IB side) we may be able to guarantee an uninter-
rupted data flow. Should we allow speculative forwarding?
Efficiency
Switch efficiency is limited by HOL blocking. A common way to alleviate this problem is to
implement a large buffer in every inport of the crossbar. If one outport is blocked other transac-
tions can bypass the congested transactions if they request another outport. A common imple-
mentation is virtual output queuing (VOQ). This buffering technique can also be extended to
provide hiding of arbitration latency. Disadvantage of a buffered crossbar is the increased com-
plexity which increases latency and the amount of additional resources needed.