A Generic Architecture For On-Chip Packet-Switched Interconnections
A Generic Architecture For On-Chip Packet-Switched Interconnections
Network Wrappers
buffer space. The receiver notifies the sender of every IP Blocks
datum consumed, with a dedicated feedback wire. This Stream Service:
Dataflow Cache Slave/
inexpensive mechanism provides latency-independence: the
Protocol Memory Memory
links can be pipelined transparently to achieve the desired
clock speed in a submicron process.
Traffic Regulation Address-Space Service
4 equivalent tree roots
of Routers
0 0 0 0
2 Stages
Packet-Switched Communication Medium:
Lossless, Out-of-Order Datagram Protocol
Child Paths
Output Paths Connectors
.18 µ < 20 mm2
Routing (Crossbar Control Logic)
64 568 Gbit/s
4-word Input Buffers
Parent Paths
of 36-bit busses testability issues, as no off-the-shelf test method can han-
dle a system comprising dozens of crossbars and hundreds
of FIFOs. We use a graph property of the fat-tree shown
(2304 TriStates) on figure 8: the graph is eulerian, thus a common prede-
fined connection scheme can be applied to all routers to
create paths covering all links and buffers in the network.
For the test mode, these paths are daisy-chained and fed
with a pattern generator. All the global interconnect is
tested by this method. Stalls and bubbles are included in
Shared Output Buffer
the stream to stimulate the FIFOs and test them without
Shared Output Buffer any scan path. In addition, an input-output loop-back test
is applied to every switching matrix individually.
Figure 6: Router datapaths synopsis
We used the symbolic layout standard cell library of the
ALLIANCE system to synthesize the router. The regular
parts (buffers and crossbar) were placed manually using
metal layers 4, 5 and 6 to interconnect the switch matrix.
Figure 7 shows this placing. The area is 1×0.8 mm2 in a
.25 µ process. The control logic is synthesized, placed and
routed by automatic tools in the empty space, using metal
layers 2 and 3. From this geometry and from table 1, we Test chaining taps
can forecast the network costs for future systems, which Figure 8: Global test paths
we summarized in table 2.
5. Performance evaluation
A fast cycle-true, bit-true simulation model of the
SPIN router was written in C language for the CASS
system-level simulation tool [10]. This model allowed us
to test large networks (up to 256 terminals) under a range
of benchmark loads. The most common benchmark in use
in the networking literature is the uniform random
distribution of packet destinations. The results of this test
are a pessimistic estimate of the practical performance of a
network, because the load does not exhibit any locality
that the tree clustering could exploit. Since all paths of
the network are equally loaded, it also reduces the
advantages of router adaptivity. Despite these artifacts, we
Figure 7: Router floorplan tested the network under a random load to compare it
against past realizations, while still getting a worst-case
The control logic can be pipelined to match virtually performance prediction.
any speed the switching matrix may reach. Therefore the
timing is constrained by the delay of the wires in the Figure 9 summarizes the key data for a 32-terminal
matrix. Electrical simulation taking into account crosstalk network (16 routers) with 20-word packets spread
capacitances suggests a cycle time below 5 ns for a .25 µ uniformly. The simulations were run for one million
CMOS process. Table 2 assumes this delay will scale in clock cycles. This is enough to provide ±1% accurate
population counts. The offered load is the average 6. The communication protocols
proportion of cycles at which data is injected into
unbounded buffers connected to each input terminal. These This section presents results on the hardware
buffers become saturated when the network cannot absorb implementation (i.e. wrappers) of the protocols enabling
the offered load. Figure 9 shows how the packet latency dataflow or address communications through the packet-
grows as the network is loaded, until it reaches saturation. switched network. The goal is to emulate these models
Note that this latency includes the time spent in the using the message passing model. In addition, the two
injection buffers before entering the network. The time bandwidth-effective features of SPIN, adaptivity and
actually spent inside the network is always lower. output buffering, have the unpleasant property of
swapping packets. Protocols must also enforce strict in-
Average Latency (cycles)
One 48% order delivery for dataflow and for some address-space
52% transactions.
100 Buffer
Two Regarding stream communications, we imagined a
80
Buffers protocol based on our experience with sender-based
60 protocols for adaptive networks [11]. The key is an
No
40 unambiguous chronological tagging of every packet
Buffer
transmitted in a stream. Upon reception, packets are stored
20 in the wrapper, which may deliver them in order to the
41%
dataflow processor. Buffer storage is reserved for the
10% 20% 30% 40% 50% 60%
missing packets until they arrive. If a packet is very late,
Offered Load (fraction of maximum)
the processor stalls, the entire buffer becomes busy, and
Figure 9: Load benchmarks other incoming packets have to be rejected in the network,
which is used as a delay line. An end-to-end credit-based
The curves show the performance impact of the two
shared output buffers. These simulations permitted to traffic regulation bounds the amount of outstanding stream
optimize the cost/performance of the realization presented data and thus prevents rejection from causing a
above. The results are satisfactory by (discrete) networking catastrophic network congestion. Both statistical modeling
standards, although somewhat lower than state-of-the-art. and cycle-true simulations show that this spurious traffic
This is remarkable given the very small depth of the input is negligible given the latency distribution of SPIN. The
buffers, and it demonstrates the relevance of our topology- total bandwidth overhead introduced by the protocol is
specific buffering policy. Simulations with larger 31% of the stream throughput.
networks have shown little performance degradation. We We synthesized a VHDL description of this hardware
conclude that a network clocked at 200 Mhz would easily protocol stack, supporting four concurrent streams per
deliver 2 Gbit/s per terminal, and still scale nicely to terminal. The basic network access layer, necessary to
aggregated throughputs up to 100 Gbit/s for 32 terminals. plug in any service wrapper, test the network and check
data integrity, represents about 0.15 mm2 in a .25 µ
Probability of
1%
mm 2. The stream protocol itself is only 0.1 mm2, plus a
49% Load
dual-port SRAM of 1 to 4 kbits (0.1 to 0.3 mm2 with our
.1% symbolic SRAM generator). In any configuration, the full
30% Load network interface is smaller than 0.6 mm2 in this process,
51% Load
.01% making it affordable for systems comprising up to a dozen
stream processors, each with gigabit/s throughput.
25 50 75 100 125 150 175 200 Regarding address-space communications, we are
Latency (cycles) defining a protocol for wrappers matching the "Virtual
Figure 10: Latency distributions Component Interface" address-oriented standard [2]. The
basic principle is to translate the VCI request/response
Nevertheless, figure 10 shows an important drawback packets into SPIN packets. However for optimal
of the switched network: all latencies tend to occur, with performance, the initiator must take advantage of the VCI
an exponentially decreasing probability. This means that split transactions: because the delay of return-trips through
nearly all packets will be delivered in a small time. the switched network is large, several transactions must be
However, there will be some random hick-ups, all the overlapped, as shown on figure 11, to use the full link
more frequent as the network is heavily loaded. These bandwidth, e.g. for cache to memory traffic. Memory
simulations are for a 64-terminal network (48 routers), throughput can then be scaled by distributing it over
with a uniform load, where most packets have the several network terminals standing concurrent accesses,
maximum number of routing hops (i.e. 5 hops). like in Distributed Shared Memory multi-computers.
Master Slave with new caveats and new refinements. Although all
Side Communication Side silicon engineers have seamlessly used commodity
Medium
Read networks (through telephones, NFS or Internet...), they
Time Axis are unfamiliar and ill-at-ease with key network aspects like
Write Data true concurrence or statistical behavior prediction. This
problem cannot be solved by improvements to the
proposed architecture, although protocols and CAD tools
Acknowledge can help hiding its internal intricacies. The true solution
is the education of the designer community to the broader
Figure 11: The split transaction model perspectives opened by deep submicron systems [13].
Bibliography
[1] K. Keutzer, "Chip Level Assembly (and not [2] Virtual Socket Interface Alliance, "On-Chip Bus
Integration of Synthesis and Physical) is the Key to Attributes" and "Virtual Component Interface - Draft
DSM Design", Proceedings of the ACM/IEEE Specification, v. 2.0.4", http://www.vsia.com,
International Workshop on Timing Issues in the September 1999 (document access may be limited to
Specification and Synthesis of Digital Systems members only).
(Tau'99), Monterey, CA, USA, March 1999.
[3] C. Clos, "A Study of Nonblocking Switching [9] B. Zerrouk et al., "RCube : A Gigabit Serial Link Low
Networks", Bell System Technical Journal, vol. 32, Latency Adaptive Router", Records of the IEEE Hot
no. 2, pp. 406-424, 1953. Interconnects IVth Symposium, Palo Alto, CA, USA,
[4] J. Leijten et al., "Stream Communication between August 1996.
Real-Time Tasks in a High-Performance [10] F. Pétrot et al., "Cycle-Precise Core Based
Multiprocessor", Proceedings of the 1998 DATE Hardware/Software System Simulation with
Conference, Paris, France, March 1998. Predictable Event Propagation", IEEE Computer
[5] D. C. Chen and J. M. Rabaey, "A Reconfigurable Society Press, Proceedings of the 23rd Euromicro
Multiprocessor IC for Rapid Prototyping of Conference, Budapest, Hungary, pp. 182-187,
Algorithmic-Specific High-Speed DSP Data Paths", September 1997.
IEEE Journal of Solid-State Circuits, 27, no. 12, pp. [11] F. Wajsbürt et al., "An Integrated PCI Component for
1895-1904, December 1992. IEEE 1355", Proceedings of the 1997 EMMSEC
[6] W. Dally, C. Seitz, "Deadlock-free Message Routing Conference and Exhibition, Florence, Italy,
in Multiprocessor Interconnection Networks", IEEE November 1997.
Transactions on Computers, vol. C-36, no. 5, pp. [12] J. Hennessy, D. Patterson, "Computer Architecture, A
547-553, May 1987. Quantitative Approach - 2 n d Edition", Morgan
[7] C. Leiserson, "Fat-Trees: Universal Networks for Kaufmann Publishers, San Francisco, CA, USA, 1996.
Hardware-Efficient Supercomputing", IEEE [13] H. de Man, "Education for the Deep Submicron Age:
Transactions on Computers, vol. C-34, no. 10, pp. Business As Usual ?", Proceedings of the 34th Design
892-901, October 1985 Automation Conference, Anaheim, CA, USA, March
[8] M. Karol et al., "Input versus Output Queueing on a 1997.
Space-Division Packet Switch", IEEE Transactions on
Communications, pp. 1347-1356, December 1987.