0% found this document useful (0 votes)
10 views

0015 Switching Notes

This document discusses circuit switching, a technique for message switching in multiprocessor networks. In circuit switching, a physical path is reserved from source to destination before data transmission by injecting a routing header flit. This header propagates through routers, reserving links along the way until it reaches the destination and an acknowledgment is sent back. The message contents can then be transmitted at full bandwidth along the reserved circuit. The base latency of a circuit-switched message includes the time to set up the path by propagating the header and acknowledgment, plus the time required to transmit the data along the reserved circuit.

Uploaded by

Asif Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

0015 Switching Notes

This document discusses circuit switching, a technique for message switching in multiprocessor networks. In circuit switching, a physical path is reserved from source to destination before data transmission by injecting a routing header flit. This header propagates through routers, reserving links along the way until it reaches the destination and an acknowledgment is sent back. The message contents can then be transmitted at full bandwidth along the reserved circuit. The base latency of a circuit-switched message includes the time to set up the path by propagating the header and acknowledgment, plus the time required to transmit the data along the reserved circuit.

Uploaded by

Asif Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

48 CHAPTER 2 Message Switching Layer

Clock

Data

Figure 2.4 An example of synchronous physical channel flow control.

each data item, or the channel may utilize block acknowledgments, that is, each
acknowledgment signal indicates the availability of buffer space for some fixed
number of data items. Such an approach reduces both the acknowledgment traffic
and the signaling rate of acknowledgments. It also enables other optimizations
for high-speed channel operation that are discussed in Chapter 7.

While interrouter transfers are necessarily constructed in terms of phits, the switching
technique deals with flits (which could be defined to be the complete message packet!).
The switching techniques set the internal switch to connect input buffers to output buffers
and forward flits along this path. These techniques are distinguished by the time at which
they occur relative to the message flow control operation and the routing operation. For
example, switching may take place after a flit has been received in its entirety.Alternatively,
the transfer of a flit through the switch may begin as soon as the routing operation has
been completed, but before the remainder of the flit has been received from the preceding
router. In this case switching is overlapped with message-level flow control. In at least
one proposed switching technique, switching begins after the first phit is received and
even before the routing operation is complete! In general, high-performance switching
techniques seek to overlap switching and message flow control as much as possible.
While such an approach provides low-latency communication, it does complicate link-
level diagnosis and error recovery.
This chapter describes the prevalent switching techniques that have been developed
to date for use in current-generation multiprocessors. Switching layers can share the same
physical channel flow control mechanism, but differ in the choice of message flow control.
Unless otherwise stated, flow control will refer to message flow control.

2.3 Basic Switching Techniques


For the purposes of comparison, for each switching technique we will consider the
computation of the base latency of an L-bit message in the absence of any traffic. The
phit size and flit size are assumed to be equivalent and equal to the physical data channel
width of W bits. The routing header is assumed to be 1 flit; thus the message size is
L + W bits. A router can make a routing decision in tr seconds. The physical channel
between two routers operates at B Hz; that is, the physical channel bandwidth is BW bits
per second. In this chapter, we assume that channel wires are short enough to complete
2.3 Basic Switching Techniques 49

Source Destination
Processor Processor
tw tr

R R R R

Link 1 ts Link D
.

Figure 2.5 View of the network path for computing the no-load latency. (R = router.)

a transmission in one clock cycle. Therefore, the propagation delay across this channel
is denoted by tw = B1 . This assumption will be relaxed in Section 7.1.5. Once a path has
been set up through the router, the intrarouter delay or switching delay is denoted by ts .
The router internal data paths are assumed to be matched to the channel width of W bits.
Thus, in ts seconds a W -bit flit can be transferred from the input of the router to the output.
The source and destination processors are assumed to be D links apart. The relationship
between these components as they are used to compute the no-load message latency is
shown in Figure 2.5.

2.3.1 Circuit Switching

In circuit switching, a physical path from the source to the destination is reserved prior to
the transmission of the data. This is realized by injecting the routing header flit into the
network. This routing probe contains the destination address and some additional control
information. The routing probe progresses toward the destination reserving physical links
as it is transmitted through intermediate routers. When the probe reaches the destination,
a complete path has been set up and an acknowledgment is transmitted back to the source.
The message contents may now be transmitted at the full bandwidth of the hardware path.
The circuit may be released by the destination or by the last few bits of the message. In the
Intel iPSC/2 routers [258], the acknowledgments are multiplexed in the reverse direction
on the same physical line as the message. Alternatively, implementations may provide
separate signal lines to transmit acknowledgment signals. A time-space diagram of the
transmission of a message over three links is shown in Figure 2.6. The header probe is
forwarded across three links, followed by the return of the acknowledgment. The shaded
boxes represent the times during which a link is busy. The space between these boxes
represents the time to process the routing header plus the intrarouter propagation delays.
The clear box represents the duration that the links are busy transmitting data through the
50 CHAPTER 2 Message Switching Layer

Header
Probe Acknowledgment Data

ts
Link tr + ts

t setup t data

Time Busy

Figure 2.6 Time-space diagram of a circuit-switched message.

31 28 23 20 16 15 12 11 1 0

CHN ... CHN XXX 1 0000 DEST 0

31 28 3 0

CHN . . . CHN

Figure 2.7 An example of the format of a circuit probe. (CHN = channel number; DEST =
destination address; XXX = not defined.)

circuit. Note that the routing and intrarouter delays at the source router are not included
and would precede the box corresponding to the first busy link.
An example of a routing probe used in the JPL Mark III binary hypercube is shown
in Figure 2.7. The network of the Mark III was quite flexible, supporting several distinct
switching mechanisms in configurations up to 2,048 nodes. Bits 0 and 16 of the header
define the switching technique being employed. The values shown in Figure 2.7 are for
circuit switching. Bits 17–19 are unused, while the destination address is provided in bits
1–11. The remaining 4-bit fields are used to address 1 of 11 output links at each individual
router. There are 11 such fields supporting an 11-dimensional hypercube and requiring a
two-word, 64-bit header. The path is computed at the source node. An alternative could
have been to compute the value of the output port at each node rather than storing the
addresses of all intermediate ports in the header. This would significantly reduce the
size of the routing header probe. However, this scheme would require routing time and
2.3 Basic Switching Techniques 51

buffering logic within the router. In contrast, the format shown in Figure 2.7 enables a fast
lookup using the header and simple processing within the router.
Circuit switching is generally advantageous when messages are infrequent and long;
that is, the message transmission time is long compared to the path setup time. The
disadvantage is that the physical path is reserved for the duration of the message and
may block other messages. For example, consider the case where the probe is blocked
waiting for a physical link to become free. All of the links reserved by the probe up to
that point remain reserved, cannot be used by other circuits, and may be blocking other
circuits, preventing them from being set up. Thus, if the size of the message is not that
much greater than the size of the probe, it would be advantageous to transmit the message
along with the header and buffer the message within the routers while waiting for a free
link. This alternative technique is referred to as packet switching, and will be studied in
Section 2.3.2.
The base latency of a circuit-switched message is determined by the time to set up a
path and the subsequent time the path is busy transmitting data. The router operation differs
a bit from that shown in Figure 2.1. While the routing probe is buffered at each router,
data bits are not. There are no intervening data buffers in the circuit, which operates
effectively as a single wire from source to destination. This physical circuit may use
asynchronous or synchronous flow control, as shown in Figures 2.3 or 2.4. In this case
the time for the transfer of each flit from source to destination is determined by the
clock speed of the synchronous circuit or signaling speed of the asynchronous handshake
lines. The signaling period or clock period must be greater than the propagation delay
through this circuit. This places a practical limit on the speed of circuit switching as a
function of system size. More recent techniques have begun to investigate the use of this
delay as a form of storage. At very high signal speeds, multiple bits may be present on
a wire concurrently, proceeding as waves of data. Such techniques have been referred to
as wave pipelining [111]. Using such techniques, the technological limits of router and
network designs have been reexamined [101, 311], and it has been found that substantial
improvements in wire bandwidth are possible. The challenges to widespread use remain
the design of circuits that can employ wave pipelining with stable and predictable delays,
while in large designs the signal skew remains particularly challenging.
Without wave pipelining, from Figure 2.6 we can write an expression for the base
latency of a message as follows:

tcircuit = tsetup + tdata


tsetup = D [tr + 2(ts + tw )] (2.1)
 
1 L
tdata =
B W
Actual latencies clearly depend on a myriad of implementation details. Figure 2.6
represents some simplifying assumptions about the time necessary for various events,
such as processing an acknowledgment or initiating the transmission of the first data
flit. In particular, it is assumed that, once the circuit has been established, propagation
52 CHAPTER 2 Message Switching Layer

Message Header

Message Data
tr

Link

t packet

Time Busy

Figure 2.8 Time-space diagram of a packet-switched message.

delay through the entire circuit is negligible compared to clock cycle. Hence, tdata does
not depend on that delay. The factor of 2 in the setup cost represents the time for the
forward progress of the header and the return of the acknowledgment. The use of B Hz
as the channel speed represents the transmission across a hardwired path from source to
destination.

2.3.2 Packet Switching

In circuit switching, the complete message is transmitted after the circuit has been set up.
Alternatively, the message can be partitioned and transmitted as fixed-length packets, for
example, 128 bytes. The first few bytes of a packet contain routing and control information
and are referred to as the packet header. Each packet is individually routed from source
to destination. This technique is referred to as packet switching. A packet is completely
buffered at each intermediate node before it is forwarded to the next node. This is the reason
why this switching technique is also referred to as store-and-forward (SAF) switching.
The header information is extracted by the intermediate router and used to determine the
output link over which the packet is to be forwarded. A time-space diagram of the progress
of a packet across three links is shown in Figure 2.8. From the figure we can see that the
latency experienced by a packet is proportional to the distance between the source and
destination nodes. Note that the figure has omitted the packet latency, ts , through the router.
Packet switching is advantageous when messages are short and frequent. Unlike
circuit switching, where a segment of a reserved path may be idle for a significant period
of time, a communication link is fully utilized when there are data to be transmitted.
Many packets belonging to a message can be in the network simultaneously even if the
first packet has not yet arrived at the destination. However, splitting a message into packets
produces some overhead. In addition to the time required at source and destination nodes,
every packet must be routed at each intermediate node. An example of the format of a
data packet header is shown in Figure 2.9. This is the header format used in the JPL
2.3 Basic Switching Techniques 53

31 16 15 12 11 1 0

LEN XXX 0 0001 DEST 0

Figure 2.9 An example packet header format. (DEST = destination address; LEN = packet
length in units of 192 bytes; XXX = not defined.)

Hyperswitch. Since the Hyperswitch can operate in one of many modes, bit field 12–15
and bit 0 collectively identify the switching technique being used: in this case it is packet
switching using a fixed-path routing algorithm. Bits 1–11 identify the destination address,
limiting the format to systems of 2,048 processors or less. The LEN field identifies the
packet size in units of 192 bytes. For the current implementation, packet size is limited to
384 bytes. If packets are routed adaptively through the network, packets from the same
message may arrive at the destination out of order. In this case the packet headers must
also contain sequencing information so that the messages can be reconstructed at the
destination.
In multidimensional, point-to-point networks it is evident that the storage require-
ments at the individual router nodes can become extensive if packets can become large
and multiple packets must be buffered at a node. In the JPL implementation, packets
are not stored in the router, but rather are stored in the memory of the local node, and
a special-purpose message coprocessor is used to process the message, that is, compute
the address of an output channel and forward the message. Other multicomputers using
packet switching also buffer packets in the memory of the local node (Cosmic Cube [314],
Intel iPSC/1 [163]). This implementation is no doubt a carryover from implementations in
local and wide area networks where packets are buffered in memory and special-purpose
coprocessors and network interfaces have been dedicated to processing messages. In mod-
ern multiprocessors, the overhead and impact on message latency render such message
processing impractical. To be viable, messages must be buffered and processed within the
routers. Storage requirements can be reduced by using central queues in the router that are
shared by all input channels rather than providing buffering at each input channel, output
channel, or both. In this case, internal and external flow control delays will typically take
many cycles.
The base latency of a packet-switched message can be computed as follows:

  
L+W
tpacket = D tr + (ts + tw ) (2.2)
W

This expression follows the router model in Figure 2.1 and, as a result, includes factors to
represent the time for the transfer of a packet of length L + W bits across the channel (tw )
as well as from the input buffer of the router to the output buffer (ts ). However, in practice,
the router could be only input buffered, output buffered, or use central queues. The above
54 CHAPTER 2 Message Switching Layer

expression would be modified accordingly. The important point to note is that the latency
is directly proportional to the distance between the source and destination nodes.

2.3.3 Virtual Cut-Through (VCT) Switching

Packet switching is based on the assumption that a packet must be received in its entirety
before any routing decision can be made and the packet forwarded to the destination.
This is not generally true. Consider a 128-byte packet and the router model shown in
Figure 2.1. In the absence of 128-byte-wide physical channels, the transfer of the packet
across the physical channel will take multiple cycles. However, the first few bytes will
contain routing information that is typically available after the first few cycles. Rather
than waiting for the entire packet to be received, the packet header can be examined as
soon as it is received. The router can start forwarding the header and following data bytes
as soon as routing decisions have been made and the output buffer is free. In fact, the
message does not even have to be buffered at the output and can cut through to the input
of the next router before the complete packet has been received at the current router. This
switching technique is referred to as virtual cut-through switching (VCT). In the absence
of blocking, the latency experienced by the header at each node is the routing latency
and propagation delay through the router and along the physical channels. The message
is effectively pipelined through successive switches. If the header is blocked on a busy
output channel, the complete message is buffered at the node. Thus, at high network loads,
VCT switching behaves like packet switching.
Figure 2.10 illustrates a time-space diagram of a message transferred using VCT
switching where the message is blocked after the first link waiting for an output channel
to become free. In this case we see that the complete packet has to be transferred to the
first router where it remains blocked waiting for a free output port. However, from the
figure we can see that the message is successful in cutting through the second router and
across the third link.
The base latency of a message that successfully cuts through each intermediate router
can be computed as follows:

 
L
tvct = D(tr + ts + tw ) + max(ts , tw ) (2.3)
W
Cut-through routing is assumed to occur at the flit level with the routing information
contained in 1 flit. This model assumes that there is no time penalty for cutting through
a router if the output buffer and output channel are free. Depending on the speed of
operation of the routers, this may not be realistic. Note that only the header experiences
routing delay, as well as the switching delay and wire delay at each router. This is because
the transmission is pipelined and the switch is buffered at the input and output. Once the
header flit reaches the destination, the cycle time of this message pipeline is determined
by the maximum of the switch delay and wire delay between routers. If the switch had
been buffered only at the input, then in one cycle of operation, a flit traverses the switch
2.3 Basic Switching Techniques 55

Packet Header

tw Message Packet Cuts


Through the Router
Link

t blocking tr + ts

Time Busy

Figure 2.10 Time-space diagram of a virtual cut-through switched message. (tblocking =


Waiting time for a free output link.)

and channel between the routers. In this case the coefficient of the second term and the
pipeline cycle time would be (ts + tw ). Note that the unit of message flow control is a
packet. Therefore, even though the message may cut through the router, sufficient buffer
space must be allocated for a complete packet in case the header is blocked.

2.3.4 Wormhole Switching

The need to buffer complete packets within a router can make it difficult to construct small,
compact, and fast routers. In wormhole switching, message packets are also pipelined
through the network. However, the buffer requirements within the routers are substantially
reduced over the requirements for VCT switching. A message packet is broken up into
flits. The flit is the unit of message flow control, and input and output buffers at a router
are typically large enough to store a few flits. For example, the message buffers in the
Cray T3D are 1 flit deep, and each flit is comprised of eight 16-bit phits. The message is
pipelined through the network at the flit level and is typically too large to be completely
buffered within a router. Thus, at any instant in time a blocked message occupies buffers
in several routers. The time-space diagram of a wormhole-switched message is shown
in Figure 2.11. The clear rectangles illustrate the propagation of single flits across the
physical channel. The shaded rectangles illustrate the propagation of header flits across
the physical channel. Routing delays and intrarouter propagation of the header flits are
also captured in this figure. The primary difference between wormhole switching and VCT
switching is that, in the former, the unit of message flow control is a single flit and, as a
consequence, small buffers can be used. Just a few flits need to be buffered at a router.
In the absence of blocking, the message packet is pipelined through the network.
However, the blocking characteristics are very different from that of VCT. If the required
56 CHAPTER 2 Message Switching Layer

Header Flit

...
Link ... Single Flit
tr + ts
...
t wormhole

Time Busy

Y
Figure 2.11 Time-space diagram of a wormhole-switched message.

FL
Message B
AM
Message A
TE

R1 R2 R3

Header Flit

Data Flits

Figure 2.12 An example of a blocked wormhole-switched message.

output channel is busy, the message is blocked “in place.” For example, Figure 2.12
illustrates a snapshot of a message being transmitted through routers R1 , R2 , and R3 . Input
and output buffers are 2 flits deep, and the routing header is 2 flits. At router R3 , message A
requires an output channel that is being used by message B. Therefore, message A blocks in
place. The small buffer sizes at each node (< message size) cause the message to occupy
buffers in multiple routers, similarly blocking other messages. In effect dependencies
between buffers span multiple routers. This property complicates the issue of deadlock
freedom. However, it is no longer necessary to use the local processor memory to buffer
messages, significantly reducing average message latency. The small buffer requirements
and message pipelining enable the construction of routers that are small, compact, and
fast.
Examples of the format of wormhole-switched packets in the Cray T3D are shown
in Figure 2.13. In this machine, a phit is 16 bits wide—the width of a T3D physical

Team-Fly®
2.3 Basic Switching Techniques 57

Read Request Packet Read Response Packet


15 0 15 0
Phit 0 Header Phit Phit 0 Header Phit
Phit 1 Header Phit Phit 1 Header Phit
Phit 2 Header Phit Phit 2 Header Phit
Phit 3 Header Phit
Word 0 Phits 3−7
Phit 4 Header Phit
Phit 5 Header Phit
Word 1 Phits 8−12

Word 2 Phits 13−17

Word 3 Phits 18−22.

Figure 2.13 Format of wormhole-switched packets in the Cray T3D.

channel—and a flit is comprised of 8 phits. A word is 64 bits and thus 4 phits. A message
is comprised of header phits and possibly data phits. The header phits contain the routing
tag, destination node address, and control information. The routing tag identifies a fixed
path through the network. The control information is interpreted by the receiving node to
determine any local operations that may have to be performed (e.g., read and return a local
datum). Depending on the type of packet, additional header information may include the
source node address and memory address at the receiving node. For example, in the figure,
a read request packet is comprised of only header phits, while the read response packet
contains four 64-bit words. Each word has an additional phit that contains 14 check bits
for error correction and detection.
From the example in Figure 2.13 we note that routing information is associated only
with the header phits (flits) and not with the data flits. As a result, each incoming data flit of
a message packet is simply forwarded along the same output channel as the preceding data
flit. As a result, the transmission of distinct messages cannot be interleaved or multiplexed
over a physical channel. The message must cross the channel in its entirety before the
channel can be used by another message. This is why messages A and B in Figure 2.12
cannot be multiplexed over the physical channel without some additional architectural
support.
The base latency of a wormhole-switched message can be computed as follows:

 
L
twormhole = D(tr + ts + tw ) + max(ts , tw ) (2.4)
W

This expression assumes flit buffers at the router inputs and outputs. Note that in the
absence of contention, VCT and wormhole switching have the same latency. Once the
header flit arrives at the destination, the message pipeline cycle time is determined by the
maximum of the switch delay and wire delay. For an input-only or output-only buffered
switch, this cycle time would be given by the sum of the switch and wire delays.

You might also like