0% found this document useful (0 votes)
21 views7 pages

A Generic Architecture For On-Chip Packet-Switched Interconnections

Uploaded by

SRIRAAM VS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views7 pages

A Generic Architecture For On-Chip Packet-Switched Interconnections

Uploaded by

SRIRAAM VS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

A Generic Architecture for On-Chip Packet-Switched Interconnections

Pierre Guerrier Alain Greiner


Université Pierre et Marie Curie
4, place Jussieu, F-75252 PARIS CEDEX 05
[email protected]

Abstract cations, then aggregated interconnection throughputs like


50 Gbit/s are needed. This requirement will continue to
This paper presents an architectural study of a scalable grow as the number of IPs incorporated on chip will
system-level interconnection template. We explain why increase, as well as the individual performance of each IP
the shared bus, which is today's dominant template, will (with faster processors, better video quality...).
not meet the performance requirements of tomorrow's
systems. We present an alternative interconnection in the Throughput (bit/s)
1G Uncompressed
form of switching networks. This technology originates Video
in parallel computing, but is also well suited for hetero- CPU Cache
geneous communication between embedded processors and to
addresses many of the deep submicron integration issues. 10M Main Memory Compressed
We discuss the necessity and the ways to provide high- Video
level services on top of the bare network packet protocol,
such as dataflow and address-space communication 100k
services. Eventually we present our first results on the
cost/performance assessment of an integrated switching Interrupt Handling
network. Latency (µ s)
.1 10 100
1. Introduction Figure 1: Requirements for some traffics
Bus-based architectures will not meet this requirement
In the year 1999, the first .18 µ fabs have moved into
because a bus is inherently non-scalable. The bandwidth of
volume production. With such process technology, chip a bus is shared by all attached devices, and it is simply not
designers can create systems-on-a-chip (SoC) by incorpo- sufficient, firstly because the bus width cannot reasonably
rating several dozens of IP blocks, each in the 50-100 exceed a hundred bits, and secondly because the clocking
kGate range, together with large amounts of embedded frequency of global wiring becomes tightly constrained by
DRAM. Those IPs can be CPU or DSP cores, video the electrical properties of deep submicron processes [1].
stream processors, and high-bandwidth I/O (like IEEE New bus proposals are still being made, e.g. in the
1394, Gigabit Ethernet or DACs). framework of [2], but in our opinion their only clear
SoC designs pose a number of daunting methodology advantage is to standardize the IP interfaces. These
problems: how to specify the system ? how to map the architectures advocate multiple on-chip busses, requiring
specification onto a collection of available IPs ? how to case-specific grouping of IPs and the design of transversal
evaluate design options ? These are mostly CAD issues, bridges, which does not make for a truly scalable and
and some commercial solutions are appearing. However, reusable interconnection.
they build on the implicit assumption that SoC will stick This paper proposes another generic interconnection
to the "Central Bus" architecture template, or slightly template that addresses the performance and scalability
enhanced multi-bus variants. requirements of systems-on-chip, using integrated
Yet virtually every IP in the portfolio evoked above, switching networks. We aim to prove that present
has I/O requirements in the Gbit/s range: fast CPUs, fast manufacturing technology already enables the realization
network controllers, and multimedia processors. Figure 1 of a switching network with the wrappers to carry on both
is a synopsis of the latency and throughput required by a address-space and dataflow communications.
few kinds of traffic. They may all be featured in a single The paper is organized as follows: Section 2 provides
consumer equipment, for instance an advanced multimedia the reader with a practical understanding of switching
PDA. If we want to exploit task-level parallelism between concepts. Then section 3 explains how they should be
processing IPs, with concurrent reconfigurable communi- tuned for on-chip environment. Section 4 is a cost
analysis of a tentative realization of such a network. However, the drawback of circuit-switching is the lack
Section 5 analyzes its performance through simulation of reactivity against rapidly changing communications.
benchmarks. Section 6 shows how hardwired protocols For instance, the PROPHID interconnection cannot
can make the network compatible with the synchronous dynamically increase a bandwidth allocation to match
dataflow I/O semantics as it is used in high performance bursts in an MPEG bitstream. It is even less suited to the
video processors, or the address-space semantics of the random traffic between a CPU master and several slaves
VCI standard. Eventually section 7 will discuss the on a bus. For this reason, networks in parallel computers
present shortcomings of the proposed architecture and or LANs have used a different technique known as packet-
possible solutions. switching.
A packet-switched network moves data from one of its
2. Switching network basics terminals to another, in small formatted chunks called
We call switching network, a set of switching elements packets. These consist in a header identifying the
connected together by full duplex point-to-point links. destination of data, followed by a payload (the data itself),
The reader is referred to [3] for a milestone presentation. unambiguously terminated by a trailer. The switching
The switching elements can operate in space (S), by elements of the network are called routers and operate in
establishing physical connections between their terminals. space. When it receives a packet, a router forwards it to
The typical S-switch is the crossbar. Other switches one of his neighbors (chosen according to the header
operate in time (T), using buffers to swap the order of information). Packets repeatedly undergo this process until
timeslices on time-division multiplexed links. they reach their final destination. Since routing decisions
Historically, combinations of S and T switches have been are distributed over the routers, the network can remain
used to build telephony networks. very reactive even for very large sizes.
Such networks implement a circuit-switching Despite the many routing hops, low latency can be
technique, where connections are established between two maintained if routers forward the header of packets ASAP,
terminals by assigning them a set of time-slices on the without waiting for the trailer (figure 3). This technique is
network links. This set has to be determined by clever called wormhole routing and has been used extensively in
computations when the connection is requested, and high-performance parallel computer networks [6]. In a
subsequently remains constant during the entire low-cost wormhole switch, buffers are small and packets
connection. The main advantage is the formal guarantee of typically span a handful of routers. If a packet requests a
bandwidth resulting from the static establishment of the link which is already busy, and this packet cannot be
circuit. The PROPHID architecture template [4] has entirely buffered in the router, the macropipeline formed
demonstrated a remarkable application of circuit-switching by the other links and routers spanned by the packet will
to on-chip communication. PROPHID uses a T-S-T be stalled. This may result in many links being
switch (figure 2) to cascade dataflow video processings. It inefficiently kept busy. Such cascaded contentions
is also the only example to our knowledge of a non-bus practically prevents any packet-switched network from
SoC interconnection. (We do not consider prototyping delivering its full theoretical bandwidth.
platforms such as [5], which featured a single S-switch Packet
connecting dataflow computing resources, because those Tail
systems were not intended for production and could only Router 3
be reconfigured at start-up time.) Router 1 Router 2

Full Switch Matrix Packet Header


Stream Input Terminals

(Crossbar or Space-switch) Figure 3: A wormhole packet switch

3. An on-chip switched network


This section is an overview of the global design
options for a Scalable, Programmable, Integrated Network
(SPIN) for packet-switched system-on-chip interconnec-
tions. The design decisions concern the nature of elemen-
Two-port Buffers
tary links, the topology of the network, the packet
(or Time-division
structure, and the network access protocols.
switches) • The point-to-point link should be able to stand the
throughput of a bus devoted to a single master. Therefore
Stream Output Terminals
we decided on a parallel link built of two one-way 32-bit
Figure 2: The PROPHID interconnection data paths. In contrast to a bus, there are no bi-directional
wires. We use a credit-based flow control on the paths:
buffer overflows at the target end of a path are checked at Master
Dataflow Processor Legacy
the source, using a counter to track the amount of free e.g. CPU

Network Wrappers
buffer space. The receiver notifies the sender of every IP Blocks
datum consumed, with a dedicated feedback wire. This Stream Service:
Dataflow Cache Slave/
inexpensive mechanism provides latency-independence: the
Protocol Memory Memory
links can be pipelined transparently to achieve the desired
clock speed in a submicron process.
Traffic Regulation Address-Space Service
4 equivalent tree roots

of Routers
0 0 0 0

2 Stages
Packet-Switched Communication Medium:
Lossless, Out-of-Order Datagram Protocol

0 1 2 3 Figure 5: Hardwired protocol stack


• Packets natively implement the message passing
communication model. Messages can be used to build
16 terminals at the leaves of the tree protocols emulating other models like dataflow streams
Figure 4: A fat-tree network and address-space. We believe that the future system-on-
chip will be heterogeneous, featuring at least these two
• The network topology impacts the complexity of the
communication mechanisms. They could be used together
distributed routing decisions. Regular topologies like
on some units, since they serve different purposes: for
meshes or hypercubes make the decision functions simple
instance, application software configures and monitors
and identical for every router, so the same macrocell can
dataflow processors through addressable registers. Hence,
be reinstanciated to build the network. Among those, we
we specified a modular network access terminal, that can
preferred the Fat-Tree (figure 4) because Leiserson has
be shared by several wrappers, each providing a different
proved formally that it is the most cost-efficient for VLSI
service to the attached unit. The packets are tagged to
realizations [7].
associate them with services. Each service uses a private
A fat-tree is a tree structure with routers on the nodes payload format for its packets. Figure 5 illustrates the
and terminals on the leaves, except that every node has parallel between this modular, fully hardwired approach,
replicated fathers. If there are as many fathers as children and a layered stack of APIs.
on all nodes, then there are as many roots as leaves, and
the bisection throughput is preserved: the network is non- 4. Router design and cost
blocking. We chose a four-child tree because the 8x8
router seemed most convenient for VLSI implementation. The critical component of the network described in
The size of this network grows like (n.logn)/8 with the section 3 is the router macrocell. On one hand, it must be
number of terminals. Table 1 lists the costs of some carefully optimized for area, because it is reinstanciated
network instances. many times in a single system. On the other hand, the
packet buffering strategy of individual routers strongly
Attached Resources impacts the global performance of the network: the most
Units Routers Links conservative model where packets are queued in FIFOs at
8 2 12 the router inputs is known to generate the highest
16 8 32 contention [8]. But the smarter buffering schemes used in
32 16 96 general purpose discrete routers require expensive hardware
64 48 192 to clear the inputs by queuing packets on additional
128 96 448
crossbar channels called "output buffers".
Table 1: Scaling the SPIN fat-tree
However, careful analysis of our topology shows that
• Packets consist of sequences of words of 32 bits. This contention will mostly result from packets flowing down
width allows the header to fit in a single word. A byte in the tree towards child links, because alternate paths are
this word identifies the destination (this allows 256 only available for the father links, and also because those
terminals) and other bits are used for packet tagging and packets span more routers on average. We based our
routing options. The routers are free to use any of the design (figure 6) on this property and the experience of the
redundant paths in the fat-tree to route a packet. This successful realization of the discrete router RCube™ [9]:
feature is called adaptivity and reduces contention hot- Small (4-word) input buffers are necessary for hiding the
spots. The packet payload may be of any size, and delay of the control logic and the link latency. In case of
possibly infinite. Finally the trailer is a special word output contention, child-bound packets can use two output
marked by a dedicated control line. It contains a checksum buffers of 18 words each. They reduce cascaded contention
of the payload data. by providing a longer side track for halted packets. It is
more hardware-efficient than simply making all input
buffers longer, because in most practical conditions only a accordance with the SIA roadmap projections. A test
couple of inputs will be subject to contention. The chosen silicon of the router macrocell is scheduled for the end of
size handles most efficiently packets shorter than 18 this year, which will enable accurate performance and
words, with payloads as large as typical bus bursts. The power measurements.
crossbar grows to 10×10 because of the output buffers,
but it is not full because all routes are not possible. Process Attached Peak Network
Units Bandwidth Area
.25 µ, 6ML 32 205 Gbit/s < 13 mm2

Child Paths
Output Paths Connectors
.18 µ < 20 mm2
Routing (Crossbar Control Logic)

64 568 Gbit/s
4-word Input Buffers

.13 µ 128 1.82 Tbit/s < 20 mm2


Table 2: Projected network costs
10x10 Partial
Crossbar Eventually, we paid careful attention to the network

Parent Paths
of 36-bit busses testability issues, as no off-the-shelf test method can han-
dle a system comprising dozens of crossbars and hundreds
of FIFOs. We use a graph property of the fat-tree shown
(2304 TriStates) on figure 8: the graph is eulerian, thus a common prede-
fined connection scheme can be applied to all routers to
create paths covering all links and buffers in the network.
For the test mode, these paths are daisy-chained and fed
with a pattern generator. All the global interconnect is
tested by this method. Stalls and bubbles are included in
Shared Output Buffer
the stream to stimulate the FIFOs and test them without
Shared Output Buffer any scan path. In addition, an input-output loop-back test
is applied to every switching matrix individually.
Figure 6: Router datapaths synopsis
We used the symbolic layout standard cell library of the
ALLIANCE system to synthesize the router. The regular
parts (buffers and crossbar) were placed manually using
metal layers 4, 5 and 6 to interconnect the switch matrix.
Figure 7 shows this placing. The area is 1×0.8 mm2 in a
.25 µ process. The control logic is synthesized, placed and
routed by automatic tools in the empty space, using metal
layers 2 and 3. From this geometry and from table 1, we Test chaining taps
can forecast the network costs for future systems, which Figure 8: Global test paths
we summarized in table 2.
5. Performance evaluation
A fast cycle-true, bit-true simulation model of the
SPIN router was written in C language for the CASS
system-level simulation tool [10]. This model allowed us
to test large networks (up to 256 terminals) under a range
of benchmark loads. The most common benchmark in use
in the networking literature is the uniform random
distribution of packet destinations. The results of this test
are a pessimistic estimate of the practical performance of a
network, because the load does not exhibit any locality
that the tree clustering could exploit. Since all paths of
the network are equally loaded, it also reduces the
advantages of router adaptivity. Despite these artifacts, we
Figure 7: Router floorplan tested the network under a random load to compare it
against past realizations, while still getting a worst-case
The control logic can be pipelined to match virtually performance prediction.
any speed the switching matrix may reach. Therefore the
timing is constrained by the delay of the wires in the Figure 9 summarizes the key data for a 32-terminal
matrix. Electrical simulation taking into account crosstalk network (16 routers) with 20-word packets spread
capacitances suggests a cycle time below 5 ns for a .25 µ uniformly. The simulations were run for one million
CMOS process. Table 2 assumes this delay will scale in clock cycles. This is enough to provide ±1% accurate
population counts. The offered load is the average 6. The communication protocols
proportion of cycles at which data is injected into
unbounded buffers connected to each input terminal. These This section presents results on the hardware
buffers become saturated when the network cannot absorb implementation (i.e. wrappers) of the protocols enabling
the offered load. Figure 9 shows how the packet latency dataflow or address communications through the packet-
grows as the network is loaded, until it reaches saturation. switched network. The goal is to emulate these models
Note that this latency includes the time spent in the using the message passing model. In addition, the two
injection buffers before entering the network. The time bandwidth-effective features of SPIN, adaptivity and
actually spent inside the network is always lower. output buffering, have the unpleasant property of
swapping packets. Protocols must also enforce strict in-
Average Latency (cycles)

One 48% order delivery for dataflow and for some address-space
52% transactions.
100 Buffer
Two Regarding stream communications, we imagined a
80
Buffers protocol based on our experience with sender-based
60 protocols for adaptive networks [11]. The key is an
No
40 unambiguous chronological tagging of every packet
Buffer
transmitted in a stream. Upon reception, packets are stored
20 in the wrapper, which may deliver them in order to the
41%
dataflow processor. Buffer storage is reserved for the
10% 20% 30% 40% 50% 60%
missing packets until they arrive. If a packet is very late,
Offered Load (fraction of maximum)
the processor stalls, the entire buffer becomes busy, and
Figure 9: Load benchmarks other incoming packets have to be rejected in the network,
which is used as a delay line. An end-to-end credit-based
The curves show the performance impact of the two
shared output buffers. These simulations permitted to traffic regulation bounds the amount of outstanding stream
optimize the cost/performance of the realization presented data and thus prevents rejection from causing a
above. The results are satisfactory by (discrete) networking catastrophic network congestion. Both statistical modeling
standards, although somewhat lower than state-of-the-art. and cycle-true simulations show that this spurious traffic
This is remarkable given the very small depth of the input is negligible given the latency distribution of SPIN. The
buffers, and it demonstrates the relevance of our topology- total bandwidth overhead introduced by the protocol is
specific buffering policy. Simulations with larger 31% of the stream throughput.
networks have shown little performance degradation. We We synthesized a VHDL description of this hardware
conclude that a network clocked at 200 Mhz would easily protocol stack, supporting four concurrent streams per
deliver 2 Gbit/s per terminal, and still scale nicely to terminal. The basic network access layer, necessary to
aggregated throughputs up to 100 Gbit/s for 32 terminals. plug in any service wrapper, test the network and check
data integrity, represents about 0.15 mm2 in a .25 µ
Probability of

process. The stream traffic regulation layer is less than 0.1


Occurrence

1%
mm 2. The stream protocol itself is only 0.1 mm2, plus a
49% Load
dual-port SRAM of 1 to 4 kbits (0.1 to 0.3 mm2 with our
.1% symbolic SRAM generator). In any configuration, the full
30% Load network interface is smaller than 0.6 mm2 in this process,
51% Load
.01% making it affordable for systems comprising up to a dozen
stream processors, each with gigabit/s throughput.
25 50 75 100 125 150 175 200 Regarding address-space communications, we are
Latency (cycles) defining a protocol for wrappers matching the "Virtual
Figure 10: Latency distributions Component Interface" address-oriented standard [2]. The
basic principle is to translate the VCI request/response
Nevertheless, figure 10 shows an important drawback packets into SPIN packets. However for optimal
of the switched network: all latencies tend to occur, with performance, the initiator must take advantage of the VCI
an exponentially decreasing probability. This means that split transactions: because the delay of return-trips through
nearly all packets will be delivered in a small time. the switched network is large, several transactions must be
However, there will be some random hick-ups, all the overlapped, as shown on figure 11, to use the full link
more frequent as the network is heavily loaded. These bandwidth, e.g. for cache to memory traffic. Memory
simulations are for a 64-terminal network (48 routers), throughput can then be scaled by distributing it over
with a uniform load, where most packets have the several network terminals standing concurrent accesses,
maximum number of routing hops (i.e. 5 hops). like in Distributed Shared Memory multi-computers.
Master Slave with new caveats and new refinements. Although all
Side Communication Side silicon engineers have seamlessly used commodity
Medium
Read networks (through telephones, NFS or Internet...), they
Time Axis are unfamiliar and ill-at-ease with key network aspects like
Write Data true concurrence or statistical behavior prediction. This
problem cannot be solved by improvements to the
proposed architecture, although protocols and CAD tools
Acknowledge can help hiding its internal intricacies. The true solution
is the education of the designer community to the broader
Figure 11: The split transaction model perspectives opened by deep submicron systems [13].

Addressing the memory over a switching network raises 8. Conclusions


a software compatibility concern, for those features that
relied upon the snooping of a central bus, like semaphore We have shown a new architecture template for system-
synchronization and cache invalidation. DSM multi- level interconnection, based on switching networks. We
computers have shown that a comprehensive support for assessed the cost and the performance of this template
cache consistency is possible but too complex for an through joint functional modeling and physical implemen-
embedded system [12]. Fortunately it is not needed tation of key parts of the architecture. Our results demon-
because these systems are rather heterogeneous, asymmet- strate that it matches the throughput, latency and area
ric multiprocessors, with explicit invalidations and simple requirements of future systems-on-chip.
synchronization primitives. These can be supported more We acknowledged the need for legacy communication
easily by simple protocols. We do not presently have a protocols to enable straightforward reuse of existing IPs in
cost measurement of the VCI wrappers for SPIN. industry designs. Therefore we have devised a layered
However, we believe they will be smaller than the protocol architecture, to provide a set of industry-standard
dataflow wrappers, because they do not require any data communication mechanisms on top of a switching
buffering. network. Results of such a mechanism for dataflow
communication have been presented, including detailed
7. Present limitations silicon area evaluation. VCI-compliant wrappers for
A well-known criticism against our approach, is the address-space communications are being actively
complexity of switching network concepts. The design investigated.
space for networks is different and larger than for busses,
Bus Pros & Cons Network Pros & Cons
Every unit attached adds parasitic capacitance, - + Only point-to-point one-way wires are used, for all
therefore electrical performance degrades with growth. network sizes.
Bus timing is difficult in a deep submicron process. - + Network wires can be pipelined because the network
protocol is globally asynchronous.
Bus testability is problematic and slow. - + Dedicated BIST is fast and complete.
Bus arbiter delay grows with the number of masters. The - + Routing decisions are distributed and the same router is
arbiter is also instance-specific. reinstanciated, for all network sizes.
Bandwidth is limited and shared by all units attached. - + Aggregated bandwidth scales with the network size.
Bus latency is zero once arbiter has granted control. + - Internal network contention causes a small latency.
The silicon cost of a bus is near zero. + - The network has a significant silicon area.
Any bus is almost directly compatible with most + - Bus-oriented IPs need smart wrappers. Software needs
available IPs, including software running on CPUs. clean synchronization in multiprocessor systems.
The concepts are simple and well understood. + - System designers need reeducation for new concepts.
Table 3: The bus-versus-network arguments

Bibliography
[1] K. Keutzer, "Chip Level Assembly (and not [2] Virtual Socket Interface Alliance, "On-Chip Bus
Integration of Synthesis and Physical) is the Key to Attributes" and "Virtual Component Interface - Draft
DSM Design", Proceedings of the ACM/IEEE Specification, v. 2.0.4", http://www.vsia.com,
International Workshop on Timing Issues in the September 1999 (document access may be limited to
Specification and Synthesis of Digital Systems members only).
(Tau'99), Monterey, CA, USA, March 1999.
[3] C. Clos, "A Study of Nonblocking Switching [9] B. Zerrouk et al., "RCube : A Gigabit Serial Link Low
Networks", Bell System Technical Journal, vol. 32, Latency Adaptive Router", Records of the IEEE Hot
no. 2, pp. 406-424, 1953. Interconnects IVth Symposium, Palo Alto, CA, USA,
[4] J. Leijten et al., "Stream Communication between August 1996.
Real-Time Tasks in a High-Performance [10] F. Pétrot et al., "Cycle-Precise Core Based
Multiprocessor", Proceedings of the 1998 DATE Hardware/Software System Simulation with
Conference, Paris, France, March 1998. Predictable Event Propagation", IEEE Computer
[5] D. C. Chen and J. M. Rabaey, "A Reconfigurable Society Press, Proceedings of the 23rd Euromicro
Multiprocessor IC for Rapid Prototyping of Conference, Budapest, Hungary, pp. 182-187,
Algorithmic-Specific High-Speed DSP Data Paths", September 1997.
IEEE Journal of Solid-State Circuits, 27, no. 12, pp. [11] F. Wajsbürt et al., "An Integrated PCI Component for
1895-1904, December 1992. IEEE 1355", Proceedings of the 1997 EMMSEC
[6] W. Dally, C. Seitz, "Deadlock-free Message Routing Conference and Exhibition, Florence, Italy,
in Multiprocessor Interconnection Networks", IEEE November 1997.
Transactions on Computers, vol. C-36, no. 5, pp. [12] J. Hennessy, D. Patterson, "Computer Architecture, A
547-553, May 1987. Quantitative Approach - 2 n d Edition", Morgan
[7] C. Leiserson, "Fat-Trees: Universal Networks for Kaufmann Publishers, San Francisco, CA, USA, 1996.
Hardware-Efficient Supercomputing", IEEE [13] H. de Man, "Education for the Deep Submicron Age:
Transactions on Computers, vol. C-34, no. 10, pp. Business As Usual ?", Proceedings of the 34th Design
892-901, October 1985 Automation Conference, Anaheim, CA, USA, March
[8] M. Karol et al., "Input versus Output Queueing on a 1997.
Space-Division Packet Switch", IEEE Transactions on
Communications, pp. 1347-1356, December 1987.

You might also like