0% found this document useful (0 votes)
15 views12 pages

Isca 10

The document discusses the challenges and potential of intra-chip free-space optical interconnects as a solution to the limitations of conventional copper interconnects in microprocessors and systems-on-chip (SoCs). It proposes a fully distributed interconnect architecture leveraging emerging optical technologies to achieve ultra-low latency and scalable bandwidth, while addressing issues related to packet switching and optical signal integrity. The paper emphasizes the need for innovative designs and integration techniques to enhance performance and efficiency in future high-performance computing devices.

Uploaded by

wedoso6101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views12 pages

Isca 10

The document discusses the challenges and potential of intra-chip free-space optical interconnects as a solution to the limitations of conventional copper interconnects in microprocessors and systems-on-chip (SoCs). It proposes a fully distributed interconnect architecture leveraging emerging optical technologies to achieve ultra-low latency and scalable bandwidth, while addressing issues related to packet switching and optical signal integrity. The paper emphasizes the need for innovative designs and integration techniques to enhance performance and efficiency in future high-performance computing devices.

Uploaded by

wedoso6101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

An Intra-Chip Free-Space Optical Interconnect ∗

Jing Xue, Alok Garg, Berkehan Ciftcioglu, Jianyun Hu, Shang Wang
Ioannis Savidis, Manish Jain† , Rebecca Berman† , Peng Liu
Michael Huang, Hui Wu, Eby Friedman, Gary Wicks† , Duncan Moore†
Dept. Electrical & Computer Engineering and † Institute of Optics
University of Rochester, Rochester, NY 14627, USA

ABSTRACT other systems-on-chip (SoC) to increase their performance,


Continued device scaling enables microprocessors and other functionality, and complexity, which is evident in the recent
systems-on-chip (SoCs) to increase their performance, func- technology trend toward multi-core systems [1]. Simulta-
tionality, and hence, complexity. Simultaneously, relentless neously, uncompensated scaling degrades wire performance
scaling, if uncompensated, degrades the performance and and signal integrity. Conventional copper interconnects are
signal integrity of on-chip metal interconnects. These sys- facing significant challenges to meet the increasingly strin-
tems have therefore become increasingly communications- gent design requirements on bandwidth, delay, power, and
limited. The communications-centric nature of future high noise, especially for on-chip global interconnects.
performance computing devices demands a fundamental Optical interconnects have fundamental advantages com-
change in intra- and inter-chip interconnect technologies. pared to metal interconnects, particularly in delay and po-
Optical interconnect is a promising long term solution. tential bandwidth [2,3], and significant progress in the tech-
However, while significant progress in optical signaling has nology has been made in recent years [4]. However, while
been made in recent years, networking issues for on-chip signaling issues have received a lot of attention [5], net-
optical interconnect still require much investigation. Taking working issues in the general-purpose domain remain under-
the underlying optical signaling systems as a drop-in replace- explored. The latter cannot be neglected as conventional
ment for conventional electrical signaling while maintaining packet-switched interconnects are ill-suited for optics: With-
conventional packet-switching architectures is unlikely to re- out major breakthroughs, storing packets optically remains
alize the full potential of optical interconnects. In this pa- impractical. Hence packet switching would require repeated
per, we propose and study the design of a fully distributed optoelectronic (O/E) and electro-optic (E/O) conversions
interconnect architecture based on free-space optics. The that significantly diminish the advantages of optical signal-
architecture leverages a suite of newly-developed or emerg- ing. The alternative topologies such as buses or rings [6, 7]
ing devices, circuits, and optics technologies. The intercon- avoid packet switching by sharing the transmission media
nect avoids packet relay altogether, offers an ultra-low trans- (optical waveguides), and rely on wavelength division multi-
mission latency and scalable bandwidth, and provides fresh plexing (WDM) to achieve large bandwidth. Purely relying
opportunities for coherency substrate designs and optimiza- on WDM, however, poses rather stringent challenges to the
tions. design and implementation of on-chip E/O modulators, e.g.,
requiring precise wavelength alignment and extremely low
Categories and Subject Descriptors: C.1.4 [Proces- insertion loss. Furthermore, on-chip interconnect poses dif-
sor Architecture]: Parallel Architecture C.2.1 [Computer- ferent constraints and challenges from off-chip interconnect,
communication Networks]: Network Architecture and De- and offers a new set of opportunities. Hence architecting on-
sign chip interconnect’s for future microprocessors requires novel
General Terms: Design, Performance holistic solutions and deserves more attention.
Keywords: 3D, intra-chip, free-space optical interconnect In this paper, we propose to leverage a suite of newly-
developed or emerging device, circuits, and optics technolo-
gies to build a relay-free interconnect architecture:
1. INTRODUCTION
Continued device scaling enables microprocessors and • Signaling: VCSELs (vertical cavity surface emitting
lasers) provide light emission without the need of exter-

This work is supported by NSF under grant 0829915 and nal laser sources and routing the “optical power supply”
also in part by grants 0747324 and 0901701. all over the chip. VCSELs, photodetectors (PDs) and
supporting micro-optic components can be implemented
in GaAs technologies and 3-D integrated with the silicon
chip – the latter includes CMOS digital electronics as well
Permission to make digital or hard copies of all or part of this work for as the transmitters and receivers.
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies • Propagation medium: Free-space optics using integrated
bear this notice and the full citation on the first page. To copy otherwise, to micro-optic components provides an economic medium al-
republish, to post on servers or to redistribute to lists, requires prior specific lowing speed-of-light signal propagation with low loss and
permission and/or a fee. low dispersion.
ISCA’10, June 19–23, 2010, Saint-Malo, France.
Copyright 2010 ACM 978-1-4503-0053-7/10/06 ...$10.00. • Networking: Direct communications through dedicated
VCSELs, PDs, and micro-mirrors (in small-scale sys- Further, there is a fundamental bandwidth density chal-
tems) or via phase array beam-steering (in large-scale sys- lenge for the in-plane waveguided approach: the mode di-
tems) allows a quasi-crossbar structure that avoids packet ameter of optical waveguides, which determines the mini-
switching, offers ultra-low communication latency in the mum distance required between optical waveguides to avoid
common case, and provides scalable bandwidth thanks to crosstalk, is significantly larger than metal wire pitch in elec-
the fully distributed nature of the interconnect. trical interconnect in nanoscale CMOS technologies, and will
deteriorate with scaling [8]. Wavelength division multiplex-
The rest of the paper is organized as follows: Section 2 ing (WDM), proven in long distance fiber-optic communica-
discusses the background of on-chip optical interconnect; tions, has been proposed to solve the problem and achieve
Section 3 introduces our free-space optical interconnect and the bandwidth-density goal. WDM, however, is much more
the array of enabling technologies; Section 4 and 5 discuss challenging for an intra-chip environment due to a whole ar-
the architectural design issues and optimizations; Section 6 ray of issues. First, wavelength multiplexing devices such
presents the quantitative analysis; Section 7 discusses re- as micro-ring based wavelength add-drop filters [11] require
lated work; and Section 8 concludes. fine wavelength resolution and superior wavelength stability,
which exacerbates the device fabrication and thermal tuning
2. CHALLENGES FOR ON-CHIP OPTI- challenges. Second, these multiplexers introduce insertion
CAL INTERCONNECT loss (on the orders of 0.01-0.1 dB per device) to the opti-
cal signals on the shared optical waveguide. Using multiple
First, it is worth noting that on-chip electrical intercon-
wavelengths exponentially amplifies the losses, and signifi-
nects have made tremendous progress in recent years, driven
cantly degrades the link performance. This problem would
by continuous device scaling, reverse scaling of top metal
be almost prohibitive in a bus or ring topology with a large
layers, and the adoption of low-k inter-layer dielectric. The
number of nodes. Lastly, a multi-wavelength light source
bandwidth density is projected to reach 100 Gbps/µm with
(laser array, supercontinuum generation, or spectrum slic-
20-ps/mm delay at the 22-nm technology node by 2016 [8].
ing) is needed, which is more complex and expensive than a
Assisted by advanced signal processing techniques such as
single-wavelength laser.
equalization, echo/crosstalk cancellation, and error correc-
Another challenge facing the in-plane waveguide approach
tion coding, the performance of electrical interconnects is ex-
is the optical loss and crosstalk from the large number of
pected to continue advancing at a steady pace. Therefore,
waveguide crossings [14], which severely limit the topology
on-chip optical interconnects can only justify the replace-
of the interconnect system [13] and hence the total aggre-
ment of its electrical counterpart by offering significantly
gated system bandwidth. Placing waveguides onto a dedi-
higher aggregated bandwidth with lower power dissipation
cated optics plane with multiple levels would require multi-
and without significant complexity overhead.
ple silicon-on-insulator (SOI) layers, increasing the process
Current optical interconnect research efforts focus on us-
complexity, and the performance gain is not scalable.
ing planar optical waveguides, which will be integrated onto
In summary, we believe that (a) it is critical to achieve the
the same chip as CMOS electronics. This in-plane waveg-
highest possible data rate in each optic channel at a fixed
uide approach, however, presents some significant chal-
wavelength in an on-chip optical interconnect system in or-
lenges. First, all-optical switching and storage devices in sil-
der to replace the electrical interconnects; (b) using WDM
icon technologies remain far from practical. Without these
and in-plane optical waveguides may not be the best solu-
capabilities, routing and flow control in a packet-switched
tion to achieve the bandwidth goal and certainly should not
network, as typically envisioned for an on-chip optical in-
be the sole focus of our effort; and (c) electronics and pho-
terconnect system, require repeated O/E and E/O conver-
tonics have different physics, follow different scaling rules,
sions, which can significantly increase signal delay, circuit
and probably should be fabricated separately.
complexity, and energy consumption. Simultaneously, effi-
cient silicon E/O modulators remain challenging due to the
inherently poor nonlinear optical properties of silicon [9]. 3. OVERVIEW
Hence the modulator design requires a long optical length, To address the challenges of building high-performance
which results in large device size, e.g., typically in centime- on-chip optical interconnects, we seek to use free-space op-
ters for a Mach-Zehnder interferometer (MZI) device [10]. tics and supporting device, circuit, and architecture tech-
Resonant devices such as micro-ring resonators (e.g., [11]) niques to create a high performance, complexity-effective
can effectively slow the light and hence reduce the required interconnect system. We envision a system where a free-
device size. These high-Q resonant devices, however, have space optical communication layer, consisting of arrays of
relatively small bandwidth and need to achieve very strin- lasers, photodetectors, and micro-optics devices such as
gent spectral and loss requirements, which translates into micro-mirrors and micro-lenses, is superimposed on top of
extremely fine device geometries and little tolerance for fab- the CMOS electronics layer via 3-D chip integration. This
rication variability. Fine-resolution processing technologies free-space optical interconnect (FSOI) system provides all-
such as electron beam lithography are needed for device fab- to-all direct communication links between processor cores,
rication, which poses cost and yield challenges that are even regardless of their topological distance. As shown in Fig-
greater than integrating non-silicon components at present. ure 1, in a particular link, digital data streams modulate an
Further, accurate wavelength tuning is required at runtime, array of lasers; each modulated light beam emitted by a laser
especially when facing the large process and temperature is collimated by a micro-lens, guided by a series of micro-
variations and hostile thermal environment on-chip. Typi- mirrors, focused by another micro-lens, and then detected
cal wavelength tuning using resistive thermal bias [12] sub- by a photodetector (PD); the received electrical signals are
stantially increases the system complexity and static energy finally converted to digital data. Note that the optical links
consumption [13]. are running at multiples of the core clock speed.
Package
Package

tic ace

Op ace -
Sp ree
Micro-mirrors

Op -Sp

s
s

F
Micro-Mirrors

t ic
ee
Fr
Micro-lenses
Micro-lenses

ot s
ics
Ph GaA
ot s
ics

on
Ph GaA
on
GaAs substrate VCSEL OPA GaAs substrate
VCSEL
PD Flip-chip PD
Flip-chip
bonding S
bonding O nics
S
Si substrate O nics Si substrate CM tro
CM tro ec
Through- ec
Through- El
silicon via El silicon via

Package Package

(a) Side view (mirror-guided only) (b) Side view (with phase array beamform- (c) Top view
ing)
Figure 1: Illustration of the overall interconnect structure and 3-D integrated chip stack. (a) and (b) also show two different optics
configuration. In the top view (c), the VCSEL arrays are in the center and the photodetectors are on the periphery within each core.

Without packet switching, this design eliminates the in- and focusing allow smaller size VCSELs and PDs to be used,
termediate routing and buffering delays and makes the sig- which reduces their parasitic capacitance and improve their
nal propagation delay approach the ultimate lower bound, bandwidth. Micro-lenses can be fabricated either on top
i.e., the speed of light. These links can operate at a much of VCSELs when the latter are top emitting [21, 22], or on
higher speed than core logic, making it easy to provide high the backside of the GaAs substrate for substrate-emitting
throughput. On the energy efficiency front, bypassing packet VCSELs [23, 24].
relaying clearly keeps energy cost low. As compared to Micro-mirrors will be fabricated on silicon or polymer by
waveguided optical interconnect, FSOI also avoids the loss micro-molding techniques [25, 26]. Note that commercial
and cross-talk associated with modulators and waveguide micro-mirror arrays (e.g., Digital Micromirror Device chips
crossings. In the future, by utilizing the beamsteering capa- from Texas Instrument) have mirrors that can turn on and
bility of an optical phase array (OPA) of lasers, the number off thousands of times per second and are in full HD den-
of lasers and photodetectors in each node can be constant, sity (millions of pixels). Our application requires only fixed
providing crucial scalability. mirrors at the scale of at most n2 (n is the number of nodes).

3.1 Lasers and Photodetectors 3.3 3-D Integration and Thermal Issues
The lasers used in this FSOI system are vertical-cavity In this FSOI system, 3-D integration technologies are ap-
surface-emitting lasers (VCSELs) [15]. A VCSEL is a plied to electrically connect the free space and photonics
nanoscale heterostructure, consisting of an InGaAs quan- layers with the electronics layer, forming an electro-optical
tum well active region, a resonant cavity constructed with system-in-package (SiP). For example, the GaAs chip is flip-
top and bottom dielectric mirrors (distributed Bragg reflec- chip bonded to the back side of the silicon chip, and con-
tors), and a pn junction structure for carrier injection. They nected to the transceiver circuits there using through-silicon-
are fabricated on a GaAs substrate using molecular beam vias (TSVs). Note that the silicon chip is flip-chip bonded
epitaxy (MBE) or metal-organic chemical vapor deposition to the package in a normal fashion. In general, such electro-
(MOCVD). A VCSEL is typically a mesa structure with optical SiP reduces the latency and power consumption of
several microns in diameter and height. A large 2-D ar- the global signaling through optical interconnect, while per-
ray with millions of VCSELs can be fabricated on the same mitting the microprocessors to be implemented using stan-
GaAs chip. The light can be emitted from the top of the dard CMOS technologies. Significant work has explored
VCSEL mesa. Alternatively, at the optical wavelength of merging various analog, digital, and memory technologies
980-nm and shorter when the GaAs substrate is transpar- in a 3-D stack. Adding an optical layer to the 3-D stack is
ent, the VCSELs can also be made to emit from the back the next logical step to improve overall system performance.
side and then through the GaAs substrate. A VCSEL’s op- Thermal problems have long been a major issue in 2-D in-
tical output can be directly modulated by its current, and tegrated circuits degrading both maximum achievable speed
the modulation speed can reach tens of Gbps [16, 17]. and reliability [27]. By introducing a layer of free space, our
The photodetectors can be either integrated on the CMOS proposed design further adds to the challenge of air cool-
chip as silicon p-i-n photodiodes [18], or fabricated on the ing. However, even without this free space layer, continued
same GaAs chip using the VCSELs as resonant cavity pho- scaling and the trend towards 3-D integration are already
todiodes [19,20]. In the latter case, an InGaAs active region making air cooling increasingly insufficient as demonstrated
is enhanced by the resonant cavity similar to a VCSEL, and by researchers that explored alternative heat removal tech-
the devices offer a larger bandwidth and are well suited for niques for stacked 3-D systems [28–30].
this FSOI system. One such technique delivers liquid coolants to microchan-
nel heat sinks on the back side of each chip in the 3-D stack
3.2 Micro-lenses and Micro-mirrors using fluidic TSVs [28]. Fluidic pipes [29] are used to propa-
In the free-space optical interconnect, passive micro-optics gate heat produced by the devices to the microchannel heat
devices such as micro-lenses and micro-mirrors collimate, sinks. The heat is further dissipated through external fluidic
guide, and focus the light beams in free space. Collimating tubes that can be located on either side of the 3-D stack.
A second technique exploits the advanced thermal con- chosen as 980 nm, which is a good compromise between VC-
ductive properties of emerging materials. Materials such as SEL and PD performance. The serialized transmitted data
diamond, carbon nanotubes, and graphene have been pro- is fed to the laser driver driving a VCSEL. The light from
posed for heat removal. The thermal conductivity of dia- the back-emitting VCSEL is collimated through a microlens
mond ranges from 1000 to 2200 W per meter per kelvin. on the backside of the 430-µm thick GaAs substrate. Using
Carbon nanotubes have an even higher thermal conductiv- a device simulator, DAVINCI, and 2007 ITRS device pa-
ity of 3000 to 3500 W/m·K, and graphene is better [30]. rameters for the 45-nm CMOS technology, the performance
These materials can be used to produce high heat conduc- and energy parameters of the optical link are calculated and
tive paths from the heat sources to the periphery of the 3-D detailed in Table 1.
stack through both thermal vias (vertical direction) and in
plane heat spreaders (lateral direction).
In both alternatives, thermal pipes are guided to the side
of the 3-D stack, allowing placement of the free space optical
system. Finally, we note that replacing air cooling in high-
end chips is perhaps not only inevitable but also desirable.
For instance, researchers from IBM showed that liquid cool- Figure 2: Intra-chip FSOI link calculation.
ing allows the reuse of the heat, reducing the overall carbon
footprint of a facility [31, 32]. Our transmitter is much less power hungry than a com-
mercial one because (a) more advanced technology (45-nm
4. ARCHITECTURAL DESIGN CMOS) is used; (b) the load is smaller (the integrated VC-
SEL exhibits a resistance of over 200 Ω, as compared to typi-
4.1 Overall Interconnect Structure cally 25 Ω when driving an external laser or modulator); and
(c) signal swing is much smaller (the VCSEL voltage swing
As illustrated in Figure 1, in an FSOI link, a single light
is about 100 mV instead of several hundred mV). Further,
beam is analogous to a single wire and similarly, an array
the transmitter goes into standby when not transmitting to
of VCSELs can form essentially a multi-bit bus which we
save power: the VCSEL is biased below threshold, and the
call a lane. An interesting feature of using free-space op-
laser driver is turned off. The receiver is kept on all the
tics is that signaling is not confined to fixed, prearranged
time. Note that the power dissipation of the serializer in the
waveguides and the optical path can change relatively eas-
transmitter and deserializer in the receiver is much smaller
ily. For instance, we can use a group of VCSELs to form a
compared to that of the laser driver and TIA, and hence is
phase-array [33] – essentially a single tunable-direction laser
not included in the estimate.
as shown in Figure 1(b). This feature makes an all-to-all
network topology much easier to implement. Free-Space Optics
Trans. distance 2 cm
For small- and medium-scaled chip-multiprocessors, fixed- Optical wavelength 980 nm
direction lasers should be used for simplicity: each outgoing Optical path loss
Microlens aperture
2.6 dB
90 µm @ transmitter
lane can be implemented by a dedicated array of VCSELs. 190 µm @ receiver
Transmitter & Receiver
In a system with N processors, each having a total of k bits Laser driver bandwidth=43 GHz
in all lanes, N ∗ (N − 1) ∗ k VCSELs are needed for trans- VCSEL aperture=5 µm
parasitic=235 Ω, 90 f F
mission. Note that even though the number scales with N 2 , threshold=0.14 mA
extinction ratio=11:1
the actual hardware requirement is far from overwhelming. PD responsivity=0.5 A/W
capacitance=100 f F
For a rough sense of scale, for N = 16, k = 9 (our default TIA & Limiting amp bandwidth=36 GHz, gain=15000 V/A
configuration for evaluation), we need approximately 2000 Link
Data rate 40 Gbps
VCSELs. Existing VCSELs are about 20µmx20µm in di- Signal-to-noise ratio 7.5 dB
Bit-error-rate (BER) 10−10
mension [16, 17]. Assuming, conservatively, 30µm spacing, Cycle-to-cycle jitter 1.7 ps
2000 VCSELs occupy a total area of about 5mm2 . Note that Power Consumption
Laser driver 6.3 mW
on the receiving side, we do not use dedicated receivers. In- VCSEL 0.96 mW (0.48 mA@2V)
Transmitter (standby) 0.43 mW
stead, multiple light beams from different nodes share the Receiver 4.2 mW
same receiver. We do not try to arbitrate the shared re-
ceivers but simply allow packet collisions to happen. As will Table 1: Optical link parameters.
be discussed in more detail later, at the expense of having
packet collisions, this strategy simplifies a number of other
design issues. 4.3 Network Design
4.2 Optical Links 4.3.1 Tradeoff to Allow Collision
To facilitate the architectural evaluation, a single-bit In our system, optical communication channels are built
FSOI link is constructed (Figure 2) and the link performance directly between communicating nodes within the network
is estimated for the most challenging scenario: communica- in a totally distributed fashion, without arbitration. An
tion across the chip diagonally. Note that the transceiver important consequence is that packets destined for the same
here is based on a conventional architecture, and can be receiver at the same time will collide. Such collisions require
simplified for lower power dissipation. Since the whole chip paths, which can be up to tens of picoseconds, or equivalent
is synchronous (e.g., using optical clock distribution), no to about 3 communication cycles. To maintain chip-wide
clock recovery circuit is needed.1 The optical wavelength is synchronous operation, we delay the faster paths by padding
extra bits in the serializer, and fine tuning the delay using
1
There will be delay differences between different optical digital delay lines in the transmitter.
detection, retransmission, and extra bandwidth margin to a simplified transmission model assuming equal probabil-
prevent them from becoming a significant issue. However, ity of transmission and random destination, the probability
for this one disadvantage, our design allows a number of of a collision in a `cycle
´ pin any node can be described as
other significant advantages (and later we will show that no 1 − [(1 − N−1p
)n + n1 N−1 p
(1 − N−1 )n−1 ]R ,where N is the
significant over-provisioning is necessary): number of nodes; p is the transmission probability of a node;
R is the number of receivers (evenly divided among the N −1
• Compared to a conventional crossbar design, we do not potential transmitters); and n = N−1 is the number of nodes
R
need a centralized arbitration system. This makes the de- sharing the same receiver.
sign scalable and reduces unnecessary arbitration latency Numerical results are shown visually in Figure 3. It is
for the common cases. worth noting that the simplifying assumptions do not dis-
• Compared to a packet-switched interconnect, this design tort the reality significantly. As can be seen from the plot,
1. Avoids relaying and thus repeated O/E and E/O con- experimental results agree well with the trend of theoretical
versions in an optical network; calculations.
2. Guarantees the absence of network deadlocks;2 30%

Collision probability
3. Provides point-to-point message ordering in a straight- R=1 R=2 R=3 R=4 R=2(meta) R=2(data)

forward fashion and thus allows simplification in coher- 20%

ence protocol designs;


10%
4. Reduces the circuit needs for each node to just drivers,
receivers, and their control circuit. Significant amount 0
33% 25% 20% 15% 10% 7% 5% 4% 3% 2% 1%
of logic specific to packet relaying and switching is Transmission probability (p)
avoided (e.g., virtual channel allocation, switch allo-
cators, and credit management for flow control). Figure 3: Collision probability (normalized to packet transmis-
sion probability) as a function of transmission probability p and
• The design allows errors and collisions to be handled by the number of receivers per node (R). The result has an extremely
the same mechanism essentially requiring no extra sup- weak dependency on the number of nodes in a system (N ) as long
as it is not too small. The plot shown is drawn with N = 16. To
port than needed to handle errors, which is necessary in
see that this simplified theoretical analysis is meaningful, we show
any system. Furthermore, once we accept collisions (with experimental data points using two receivers (R=2). We separate
a probability on the orders of about 10−2 ), the bit error the channels (“meta” and “data” channels as explained later).
rates of the signaling chain can be relaxed significantly
(from 10−10 to, say, 10−5 ) without any tangible impact on
To a first-order approximation, collision frequency is in-
performance. This provides important engineering mar-
versely proportional to the number of receivers. Therefore,
gins for practical implementations and further opportuni-
having a few (e.g., 2-3) receivers per node is a good option.
ties for energy optimization on the entire signaling chain.
Further increasing the number will lead to diminishing re-
turns.
4.3.2 Collision Handling
2. Slotting and lane separation: In a non-arbitrated
shared medium, when a packet takes multiple cycles to
Collision detection:. transmit, it is well known that “slotting” reduces collision
Since we use the simple on-off keying (OOK), when mul- probability [35]. For instance, suppose data packets take 5
tiple light beams from different source nodes collide at the processor cycles to transmit, then they can only start at the
same receiver node, the received light pulse becomes the log- beginning of a 5-cycle slot. In our system, we define two
ical “OR” of the multiple underlying pulses. The detection packet lengths, one for meta packets (e.g., requests and ac-
of the collision is simple, thanks to the synchrony of the en- knowledgments) and one for data packets (which is about 5
tire interconnect. In the packet header, we encode both the times the former). Each type will thus have a different slot
sender node ID (P ID) and its complement (P ID). When length. In that case, slotting only reduces the chance of col-
more than one packet arrives at the same receiver array, lision between two packets of the same length (and thus the
then at least one bit of the IDs (say P IDi ) would differ. same slot length). Furthermore, the different packet lengths
Because of the effective “OR” operation, the received P IDi (especially because one is much longer than the other) also
and P IDi would both be 1, indicating a collision. make the retransmission difficult to manage. One option to
deal with both problems is to separate the packets into their
Structuring:. own lanes and manage each lane differently.
We take a few straightforward structuring steps to reduce 3. Bandwidth allocation: Given a fixed bandwidth, we
the probability of collision. need to determine how to allocate the bandwidth between
1. Multiple receivers: It is beneficial to have a few the two lanes for optimal performance. Even though a pre-
receivers at each node so that different transmitter nodes cise analytical expression between bandwidth allocation and
target different receivers within the same node and reduce performance is difficult to obtain, some approximate analy-
the probability of a collision. The effect can be better sis can still be derived: each packet has an expected total
understood with some simple theoretical analysis. Using latency of L+Pc ∗Lr , where L, Pc , and Lr are basic transmis-
2 sion latency, probability of collision, and collision resolution
Note that fetch deadlock is an independent issue that is
not caused by the interconnect design itself. It has to be latency, respectively. L, Pc , and Lr are inversely propor-
either prevented with multiple virtual networks, which is tional to the bandwidth allocated to a lane.3 The overall
very resource intensive, or probabilistically avoided using latency can be expressed as BCM 1
+ BC22 + 1−B
C3
M
+ (1−BC4
)2
,
M M
NACKs [34]. We use the latter approach in all configura-
3
tions. Pc is not exactly inversely proportional to bandwidth [9].
where BM is the portion of total bandwidth allocated to the Specifically, the window size for the r th retry Wr is set to
meta packets, the constants (C1 ..C4 ) are a function of statis- W × B r−1 , where B is the base of the exponential function.
tics related to application behavior and parameters that can While doubling the window size is a classic approach [36],
be calculated analytically.4 In our setup, the optimal la- we believe setting B to 2 is an over-correction, since the
tency value occurs at BM = 0.285: about 30% of the band- pathological case is a very remote possibility. Note that B
width should be allocated to transmit meta packets. In our need not be an integer. To estimate the optimal values of W
system, we use 3 VCSELs for the meta lane and 6 for the and B without blindly relying on expensive simulations, we
data lane, with a serialization latency of 2 (processor) cycles use a simplified analytical model of the network to derive the
for a (72-bit) meta packet and 5 cycles for a (360-bit) data expression of the average collision resolution delay given W
packet. Because we are using 2 separate receivers to reduce and B, taking into account the confirmation laser delay (2
collisions, the receiving bandwidth is twice the transmitting cycles). Although the calculation does not lead to a simple
bandwidth. For comparison, we use a baseline mesh net- closed-form expression, numerical computation using packet
work where the meta and data packets have a serialization transmission probability measured in our system leads to the
latency of 1 and 5 cycles, respectively. results shown in Figure 4.

Confirmation:. G=1%

Collision resolution delay


Because a packet can get corrupted due to collision, some 12
G=10%
mechanism is needed to infer or to explicitly communicate
11
the transmission status. For instance, a requester can time
out and retry. However, solely relying on timeouts is not 10
enough as certain packets (e.g., acknowledgments) generate 9
no response and the transmitter thus has no basis to infer 8
whether the transmission was successful. 7 2
A simple hardware mechanism can be devised to confirm 2
3 1.5
4
uncorrupted transmissions. We dedicate a single-VCSEL W 5 1 B
lane per node just to transmit a beam for confirmation:
Upon receiving an uncorrupted packet, the receiver node Figure 4: Average collision resolution delay for meta packets
as a function of starting window size and back-off speed. While
activates the confirmation VCSEL and sends the confirma- retransmission is attempted, other nodes continue regular trans-
tion to the sender. Note that by design, the confirmation mission. This “background” transmission rate (G=1% and 10%
beam will never collide with one another: when a packet shown) has a negligible impact on the optimal values of W and
is received in cycle n, the confirmation is sent after a fixed B.
delay (in our case, in cycle n + 2, after a cycle for any delay
in decoding and error-checking). Since at any cycle n, only The minimum collision resolution delay occurs at W =
one packet (per lane) will be transmitted by any node, only 2.7, B = 1.1. We selected a few data points on the curve
one confirmation (per lane) will be received by that node in and verified that the theoretical computation agrees with
cycle n + 2. Other than confirming successful packet receipt, execution-driven simulation rather well. For instance, for
the confirmation can also piggy-back limited information as W = 2.7, B = 1.1, the computed delay is 7.26 cycles and
we show later. the simulated result is between 6.8 and 9.6 with an aver-
age of 7.4 cycles. The graph clearly shows that B = 1.1
Retransmission:. produces a decidedly lower resolution delay in the common
Once packets are involved in a collision, the senders retry. case than when B = 2. This does not come at the expense
In a straightforward way, the packet is retransmitted in a of unacceptable delay in the pathological case. For example,
random slot within a window of W slots after the detection in a 64-node system, when all other nodes send one packet
of the collision. The chance of further collision depends on to a particular node at about the same time, it takes an av-
W . A large W results in a smaller probability of secondary erage of about 26 retries (for a total of 416 cycles) to get one
collisions, but a longer average delay in retransmission. Fur- packet to come through. In contrast, with a fixed window
thermore, as the retry continues, other packets may arrive size of 3, it would take 8.2 × 1010 number of retries. Setting
and make collisions even more likely, greatly increasing the B to 2, shortens this to about 5 retries (199 cycles).
delay and energy waste. If we simply retry using the same
window size, in the pathological case when too many pack-
4.4 Protocol Considerations
ets arrive in a concentrated period, they can reach a critical The delivery-order property of the interconnect can im-
mass such that it is more likely to have a new packet from pact the complexity of the coherence protocol [34]. Our sys-
a different node join the existing set of competing senders tem does not rely on relaying and thus it is easy to enforce
than to have one successfully transmitted and leave the com- point-to-point message ordering. We delay the transmission
petition. This leads to a virtual live lock that we have to of another message about a cache line until a previous mes-
guard against. sage about that line has been confirmed. This serialization
Thus, we adopt an exponential back-off heuristic and set reduces the number of transient states the coherence proto-
the window size to grow as the number of retries increases. col has to handle. We summarize the stable and transient
states transitions in [9].
4
For example, the composition of packets (requests, data
replies, forwarded requests, memory fetches, etc), the per- 5. OPTIMIZATIONS
centage of meta and data packets that are on the critical
path, the average number of expected retries in a back-off While a basic design described above can already support
algorithm. the coherency substrate and provide low-latency communi-
cation, a perhaps more interesting aspect of using optical scribers. Note that our design does not assume any specific
interconnect is to explore new communication or protocol implementation of lock or barrier. It merely implements
opportunities. Below, we describe a few optimizations that the semantics of ll and sc differently when feasible, which
we have explored in the proposed interconnect architecture. expedites the dissemination of single-bit values. Also, this
change has little impact on regular coherence handling. A
5.1 Leveraging Confirmation Signals normal store request to the line containing subscribed words
In a cache coherence system, we often send a message simply invalidates all subscribers.
where the whole point is to convey timing, such as the re-
lease of a barrier or lock. In these cases, the information 5.2 Ameliorating data packet collisions
content of the payload is extremely low and yet carrying out Since data packets are longer than meta packets, their
synchronization accounts for about a quarter of total traffic collisions cause more damage and take longer to resolve.
in our simulated 64-node mesh-based chip-multiprocessor. Fortunately, data packets have unique properties that can be
Since usually the receiver is anticipating such a message, leveraged in managing collisions: they are often the result
and it is often latency-sensitive, we can quickly convey such of earlier requests. This has two implications. First, the
timing information using the confirmation laser. Compared receiver has some control over the timing of their arrival and
to sending a full-blown packet, we can achieve even lower can use that control to reduce the probability of a collision
latency and higher energy efficiency, while reducing traffic to begin with. Second, the receiver also has a general idea
and thus collisions on the regular channels. which nodes may be involved in the collision and can play a
Take invalidation acknowledgments for example. They role coordinating retransmissions.
are needed to determine write completion, so as to help en-
sure write atomicity and determine when memory barriers Request spacing:.
can finish in a relaxed consistency model [34]. In our sys- When a request results in a data packet reply, the most
tem, we can eliminate the need for acknowledgment alto- likely slot into which the reply falls can be calculated. The
gether by using the confirmation (of receiving the request) overall latency includes queuing delays for both the request
as a commitment of carrying out the invalidation [34]. This and the reply, the collision resolution time for the request,
commitment logically serializes the invalidation before any and the memory access latency. All these components can be
subsequent externally visible transaction.5 analyzed as independent discreet random variables. Figure 5
Now let us consider typical implementation of locks us- shows an example of the distribution of the overall latency of
ing load-linked (ll) and store-conditional (sc) instructions a read-miss request averaged over all application runs in our
and barriers. Both can involve spinning on boolean values, environment for illustration. As we can see, the probability
which incurs a number of invalidations, confirmations, and is heavily concentrated in a few choices. Accordingly, we can
reloading requests when the value changes. We choose to reserve slots on the receiver. If a slot is already reserved, a
(a) transmit certain boolean values over the confirmation request gets delayed to minimize the chance of collision.
channel and (b) use an update protocol for boolean syn-
10
chronization variables when feasible. 41%
8
Requests(%)

When a ll or sc misses in the L1 cache, we send a spe-


6
cial request to the directory indicating reserved timing slots
4
on the confirmation channel. Recall that each CPU cycle
2
contains multiple communication cycles, or mini-cycles. If, 1
0
for example, mini-cycle i is reserved, the directory can use 0 20 40 60 80 100 150 >200
Reply latency (cycles)
that mini-cycle in any cycle to respond the value or state of
store-conditional directly. In other words, the information Figure 5: Probability distribution of the overall latency of a
is encoded in the relative position of the mini-cycle. request resulting in a data reply.
Using such a mechanism over the confirmation channel, a
requester can receive single-bit replies for ll requests. The
value received is then recorded in the link register, essentially Hints in collision resolution:.
forming a special cache line with just one single-bit word. When packets collide, each sender retries with the expo-
Such a “line” lends itself to an update protocol. Nodes hold- nential back-off algorithm that tries to balance the wait time
ing these single bits can be thought of as having subscribed and the probability of secondary collisions (Section 4.3.2).
to the word location and will continue to receive updates However, the design of the algorithm assumes no coordina-
via the same mini-cycle reserved on the confirmation lane tion among the senders. Indeed, the senders do not even
earlier. The directory, on the other hand, uses one or more know the packet is involved in a collision until cycles after
registers to track the subscriptions. When a node issues a the fact nor do they know the identities of the other parties
sc with a boolean value, it sends the value directly through involved.
the request (rather than just seeking write permission of the In the case of the data packet lane, the receiver knows
entire line). The directory can thus perform updates to sub- of the collision early, immediately after receiving the header
5
For instance, in a sequentially consistent system, any load that encodes P ID and P ID. It can thus send a no-collision
(to the invalidated cache line) following that externally vis- notification to the sender before the slot is over. The ab-
ible transaction need to reflect the effect of the invalidation sence of this notification is an indication that a collision has
and replay if it is speculatively executed out of order. For occurred. Moreover, even though in a collision the P ID
practical implementation, we freeze the retirement of any and P ID are corrupted due to the collision and only indi-
memory instructions until we have applied all pending inval- cate a super-set of potential transmitters,6 the receiver has
idations in the input packet queue and performed necessary
6
replays [37]. Clearly, for small-scale networks, one could use a bit vector
the benefit of additional knowledge of the potential candi- (em3d ), a parallel genetic linkage analysis program (ilink ), a
dates – those nodes that are expected to send a data packet program to iteratively solve partial differential equations (ja-
reply. Based on this knowledge, the receiver can select one cobi), a 3-dimensional particle simulator (mp3d ), a shallow
transmitting node as the winner for the right to re-transmit water benchmark from the National Center for Atmospheric
immediately in the next slot. This selection is beamed back Research to solve differential equations on a two-dimensional
through a notification signal (via the confirmation laser) to grid for weather prediction (shallow ), and a branch-and-
the winner only. All other nodes that have not received bound based implementation of the non-polynomial (NP)
this notification will avoid the next slot and start the re- traveling salesman problem (tsp). We follow the observa-
transmission with back-off process from the slot after the tions in [38] to scale down the L1 cache to mimic realistic
next. This way, the winning node suffers a minimal extra cache miss rates.7
delay and the remaining nodes will have less retransmission
contention. Note that, this whole process is probabilistic 6.2 Performance Analysis
and the notification is only used as a hint. We start our evaluation with the performance analysis of
Finally, we note that packet collisions are ultimately infre- the proposed interconnect. We model a number of conven-
quent. So a scheduling-based approach that avoid all possi- tional interconnect configurations for comparison. To nor-
ble collisions does not seem beneficial, unless the scheduling malize performance, we use a baseline system with canonical
overhead is extremely low. 4-cycle routers. Note that while the principles of conven-
tional routers and even newer designs with shorter pipelines
6. EXPERIMENTAL ANALYSIS are well understood, practical designs require careful consid-
eration of flow control, deadlock avoidance, QoS, and load-
The proposed intra-chip free-space optical interconnect
balancing and are by no means simple and easy to imple-
has many different design tradeoffs compared with a con-
ment. For instance, the router in Alpha 21364 has hundreds
ventional wire-based interconnect or newer proposals of op-
of packet buffers and occupies a chip area equal to 20% of
tical versions. Some of these tradeoffs can not be easily ex-
the combined area of the core and 128KB of L1 caches. The
pressed in quantitative terms, and are discussed in the archi-
processing by the router itself adds 7 cycles of latency [39].
tectural design and later in the related work section. Here,
Nevertheless, we provide comparison with conventional in-
we attempt to demonstrate that the proposed design offers
terconnects with aggressive latency assumptions.
ultra-low latency, excellent scalability, and superior energy
In Figure 6-(a), we show the average latency of trans-
efficiency. We also show that accepting collisions does not
ferring a packet in our free-space optical interconnect and
necessitate drastic bandwidth over-provisioning.
in the baseline mesh interconnect. Latency in the optical
6.1 Experimental Setup interconnect is further broken down into queuing delay, in-
tentionally scheduled delay to minimize collision, the actual
We use an execution-driven simulator to model in great
network delay, and collision resolution delay. Clearly, even
detail the coherence substrate, the processor microarchitec-
with the overhead of collision and its prevention, the overall
ture, the communication substrate, and the power consump-
delay of 7.5 cycles is very low.
tion of both a 16-way and a 64-way chip-multiprocessor
(CMP). We leave the details of the simulator in [9] and only
Queuing Scheduling Network Collision resolution Mesh
show the system configuration in Table 2.
30
Delay (cycles)

Processor core
Fetch/Decode/Commit 4 / 4 / 4
20
ROB 64
Functional units INT 1+1 mul/div, FP 2+1 mul/div
Issue Q/Reg. (int,fp) (16, 16) / (64, 64) 10
LSQ(LQ,SQ) 32 (16,16) 2 search ports
Branch predictor Bimodal + Gshare
- Gshare 8K entries, 13 bit history 0
- Bimodal/Meta/BTB 4K/8K/4K (4-way) entries ba ch fmm fft lu oc ro rx ray ws em ilink ja mp sh tsp avg
Br. mispred. penalty at least 7 cycles
Process specifications Feature size: 45nm, Freq: 3.3 GHz, Vd : 1 V (a) Latency
Memory hierarchy
L1 D cache (private) 8KB [38], 2-way, 32B line, 2 cycles, 2 ports, dual tags
L1 I cache (private) 32KB, 2-way, 64B line, 2 cycle FSOI L0 Lr1 Lr2
L2 cache (shared) 64KB slice/node, 64B line, 15 cycles, 2 ports
2
Dir. request queue 64 entries
Speedup

Memory channel 52.8GB/s bandwidth, memory latency 200 cycles


Number of channels 4 in 16-node system, 8 in 64-node system
Prefetch logic stream prefetcher 1.5
Network packets Flit size: 72-bit, data packet: 5 flits, meta packet: 1 flit
Wired interconnect 4 VCs, latency: router 4 cycles, link 1 cycle, buffer:
5x12 flits
Optical interconnect (each node)
VCSEL 40 GHz, 12 bits per CPU cycle 1
ba ch fmm fft lu oc ro rx ray ws em ilink ja mp sh tsp gmean
Array Dedicated (16-node), phase-array w/ 1 cycle setup delay
(64-node).
Lane widths 6/3/1 bit(s) for data/meta/confirmation lane (b) Speedup
Receivers 2 data (6b), 2 meta (3b), 1 for confirmation (1b)
Outgoing queue 8 packets each for data and meta lanes. Figure 6: Performance of 16-node systems. (a) Total packet
latency in the free-space optical interconnect (left) broken down
Table 2: System configuration. into 4 components (queuing delay, scheduling delay, network la-
tency, and collision resolution delay) and the conventional mesh
Evaluation is performed using a suite of parallel appli- (right). (b) Speedups of free-space optical interconnect (FSOI)
cations [9] including SPLASH2 benchmark suite [38], a and various configurations of conventional mesh relative to the
program to solve electromagnetic problem in 3 dimensions baseline.
7
encoding of P ID and thus allow the receiver to definitively Our studies also show that not scaling down the cache size
identify the colliding parties all the time. only affects the quantitative results marginally [9].
The application speedups are shown in Figure 6-(b). We cause large temporary queuing delays. Indeed, the queuing
use the ultimate execution time8 of the applications to com- delay of 4.1 cycles in our system is only marginally higher
pute speedups against the baseline using a conventional than the 3.1 cycles in the ideal L0 configuration.
mesh interconnect. For relative comparison, we model a Understandably, the better scalability led to wider perfor-
number of conventional configurations: L0 , Lr1 , and Lr2 . mance gaps between our optical interconnect and the non-
In L0 , the transmission latency is idealized to 0 and only ideal mesh configurations. The speedup of our FSOI contin-
the throughput is modeled. In other words, the only delay ues to track that of the ideal L0 configuration (with a geo-
a packet experiences is the serialization delay (1 cycle for metric mean of 1.75 vs 1.91), and pulls further ahead of those
meta packets and 5 cycles for data packets) and any queuing of Lr1 (1.55) and Lr2 (1.29). Not surprisingly, interconnect-
delay at the source node. L0 is essentially an idealized inter- bound applications show more significant benefits. If we
connect. Lr1 and Lr2 represent the cases where the overall take the eight applications that experience above average
latency accounts for the number of hops traveled: each hop performance gain from the ideal L0 , the geometric mean of
consumes 1 cycle for link traversal and 1 or 2 cycles respec- their speedup in FSOI is 2.30, compared to L0 ’s 2.59 and
tively for router processing. Like in L0 , we do not model any Lr1 ’s 1.92.
contentions or delays inside the network. Thus, they only To summarize, the proposed interconnect offers an ultra-
serve to illustrate (loose) performance upper-bounds when low communication latency and maintains a low latency
aggressively designed routers are used. as the system scales up. The system outperforms aggres-
While the performance gain varies from application to sively configured packet-switched interconnect and the per-
application, our design tracks the ideal L0 configuration formance gap is wider for larger-scale systems and for appli-
well, achieving a geometric mean of 1.36 speedup versus the cations whose performance has a higher dependence on the
ideal’s 1.43. It also outperforms the aggressive Lr1 (1.32) interconnect.
and Lr2 (1.22) configurations.
Although a mesh interconnect is scalable in terms of ag- 6.3 Energy Consumption Analysis
gregate bandwidth provided, latency worsens as the net- We have also performed a preliminary analysis of the en-
work scales up. In comparison, our design offers a direct- ergy characteristics of the proposed interconnect. Figure 8
communication system that is scalable while maintaining shows the total energy consumption of the 16-node system
low latency. The simulation results of 64-node CMP are normalized to the baseline configuration using mesh. Our
shown in Figure 7. direct communication substrate avoids the inherent ineffi-
ciency in repeated buffering and processing in a packet-
Queuing Scheduling Network Collision resolution Mesh switched network. Thanks to the integrated VCSELs, we
60 can keep them powered off when not in use. This leads
to an insignificant 1.8W of average power consumption in
Delay (cycles)

40 the optical interconnect subsystem. The overall energy con-


sumption in the interconnect is 20X smaller than that in
20 a mesh-based system. The faster execution also saves en-
ergy overhead elsewhere. On average, our system achieves a
0
ba ch fmm fft lu oc ro rx ray ws em ilink ja mp sh tsp avg 40.6% energy savings. The reduction in energy savings out-
(a) Latency strips the reduction in execution time, resulting in a 22%
reduction in average power: 156W for conventional system
FSOI
3
L0
and 121W for our design. The energy-delay product of FSOI
2.5
Lr1 is 2.7X (geometric mean) better than baseline in the 16-node
Speedup

Lr2 system and 4.4X better in the 64-node system.


2
Network Processor core+Cache Leakage
1.5 100%

1 80%
ba ch fmm fft lu oc ro rx ray ws em ilink ja mp sh tuo gmean
60%
(b) Speedup
40%
Figure 7: Performance of 64-node systems.
20%

0
As expected, latency in mesh interconnect increases sig- ba ch fmm fft lu oc ro rx ray ws em ilink ja mp sh tsp avg

nificantly. The latency does increase in our network too,


Figure 8: Energy relative to baseline mesh interconnect.
from 7.5 cycles (16-node) system to 12.6 cycles. However, in
addition to the 1 cycle phase array setup delay, much of this
increase is due to an increase of 2.7 cycles (from 1.4 to 4.1 6.4 Analysis of Optimization Effectiveness
cycles) in queuing delays on average. In certain applications
(e.g., raytrace), the increase is significant. This increase in
queuing delays is not a result of interconnect scalability bot- Meta packet collision reduction:.
tleneck, but rather a result of how the interconnect is used Our design does not rely on any arbiter to coordinate the
distributed communication, making the system truly scal-
in applications with a larger number of threads. For ex-
able. The tradeoff is the presence of occasional packet col-
ample, having more sharers means more invalidations that
lisions. Several mechanisms are used to reduce the collision
8 probability. The most straightforward of these mechanisms
For applications too long to finish, we measure the same
workload, e.g., between a fixed number of barrier instances. is using more receivers. We use 2 receivers per lane. Our
detailed simulations show that this indeed roughly reduces Data packet collision reduction:.
collisions by half in both cases as predicted by the simplified We also looked at a few ways to reduce collisions in
theoretical calculation and Monte Carlo simulations. This the data lane. These techniques include probabilistically
partly validates the use of simpler analytical means to make scheduling the receiver for the incoming replies, applying
early design decisions. split transactions for writebacks to minimize unexpected
data packets, and using hints to coordinate retransmissions
Leveraging confirmation signals:. (Section 5.2). Figure 10 shows the breakdown of the type
Using the confirmation of successful invalidation delivery of collisions in the data packet lane with and without these
as a substitute for an explicit acknowledgment packet is a optimizations. The result shows the general effectiveness of
particularly effective approach to further reduce unnecessary the techniques: about 38% of all collisions are avoided.
traffic and collisions. Figure 9 shows the impact of this op-
timization. The figure represents each application by a pair Memory packets Reply WriteBack Retransmission
100%
of points. The coordinates show the packet transmission
80%
probability and the collision rate of the meta packet lane.
60%
0 40%
15% Optimized for meta
10%
Theoretical calculation 20%
Collision (%)

10%
5% Baseline for meta
5% 0
0 ba ch fmm fft lu oc ro rx ray ws em ilink ja mp sh tsp avg
15% 10%
10%
Figure 10: Breakdown of data packet collisions by type: involv-
5%
ing memory packets (Memory packets), between replies (Reply),
5% 0
15 10 7 6 5 4 3 2 1 involving writebacks (Writeback), and involving re-transmitted
Transmission probability (p) packets (Retransmission). The left and the right bars show the
result without and with the optimizations, respectively. The col-
Figure 9: Change in packet transmission probability and colli- lision rate for data packets ranges from 3.0% to 21.2%, with an
sion rate with and without the optimization of using confirmation average of 9.4%. After optimization, the collision rate is between
signal to substitute acknowledgment. For clarity, the applications 1.2% and 12.2% with an average of 5.8%.
are separated into two distinctive regions.

In general, as we reduce the number of packets (acknowl- Data packet collision resolution hint:.
edgments), we reduce the transmission probability and nat- As discussed in Section 5.2, when a data lane collision
urally the collision rate. However, if reduction of the trans- happens we can guess the identities of the senders involved.
mission probability is the only factor in reducing collisions, From the simulations, we can see that based on the informa-
the movement of the points would follow the slope of the tion of potential senders and the corrupted pattern of P ID
curve which shows the theoretical collision rate given a trans- and P ID, we can correctly identify a colliding sender 94% of
mission probability. Clearly, the reduction in collision is the time. Even for the rest of the time when we mis-identify
much sharper than simply due to the reduction of packets. the sender, it is usually harmless: If the mis-identified node
This is because the burst of the invalidation messages sent is not sending any data packet at the time, it simply ignores
leads to acknowledgments coming back at approximately the the hint. Overall, the hints are quite accurate and on aver-
same time and much more likely to collide than predicted by age, only 2.3% of the hints cause a node to wrongly believe
theory assuming independent messages. Indeed, after elim- it is selected as a winner to re-transmit. As a result, the hint
inating these “quasi-synchronized” packets, the points move improves the collision resolution latency from an average of
much closer to the theoretical predictions. Clearly, avoid- 41 cycles to about 29 cycles.
ing these acknowledgments is particularly helpful. Note Finally, note that all these measures that reduce colli-
that, because of this optimization, some applications speed sions may not lead to significant performance gain when the
up and the per-cycle transmission probability actually in- collision probability is low. Nevertheless, these measures
creases. Overall, this optimization reduces traffic by only lower the probability of collisions when traffic is high and
5.1% but eliminates about 31.5% of meta packet collisions. thus improve the resource utilization and the performance
Confirmation can also be used to speed up the dissem- robustness of the system.
ination of boolean variables used in load-linked and store-
conditional. Other than latency reduction, we also cut down 6.5 Sensitivity Analysis
the packets transmitted over regular channels. Clearly, the As discussed before, we need to over-provision the network
impact of this optimization depends on synchronization in- capacity to avoid excessive collisions in our design. However,
tensity of the application. Some of our codes have virtually such over-provisioning is not unique to our design. Packet-
no locks or barriers in the simulated window. Seven ap- switched interconnects also need capacity margins to avoid
plications have non-trivial synchronization activities in the excessive queuing delays, increased chance of network dead-
64-way system. For these applications, the optimization re- locks, etc. In our comparison so far, the aggregate band-
duces data and meta packets sent by an average of 8% and width of the conventional network and of our design are
11%, respectively, and achieves a speedup of 1.07 (geometric comparable: the configuration in the optical network design
mean). Note that the benefit comes from the combination of has about half the transmitting bandwidth and roughly the
fast optical signaling and leveraging the confirmation mech- same receiving bandwidth as the baseline conventional mesh.
anism that is already in place. A similar optimization in a To understand the sensitivity of the system performance to
conventional network still requires sending full-blown pack- the communication bandwidth provided, we progressively
ets, resulting in negligible impacts. reduce the bandwidth until it is halved. For our design, this
involves reducing the number of VCSELs, rearranging them fort. In contrast to both designs, our solution does not rely
between the two lanes, and adjusting the cycle-slotting as on any optical switch component.
the serialization latency for packets increases.9 Figure 11 Among the enabling technologies of our proposed de-
shows the overall performance impact. Each network’s re- sign, free-space optics have been discussed in general terms
sult is normalized to that of its full-bandwidth configuration. in [3, 41]. There are also discussions of how free-space op-
For brevity, only the average slowdown of all applications is tics can serve as a part of the global backbone of a packet-
shown. switched interconnect [42] or as an inter-chip communication
mechanism (e.g., [43]). On the integration side, leveraging
Relative performance

1
FSOI Mesh
3D integration to build on-chip optoelectronic circuit has
0.95 also been mentioned as an elegant solution to address vari-
0.9 ous integration issues [6].
0.85
Many proposals exist that use a globally shared medium
for the optical network and use multiple wavelengths avail-
0.8
able in an optical medium to compensate for the network
100% 90% 80% 70% 60% 50%
Relative bandwidth topology’s non-scalable nature. [44] discussed dividing the
channels and using some for coherence broadcasts. [7] also
Figure 11: Performance impact due to reduction in bandwidth. uses broadcasts on the shared bus for coherence. A recent
design from HP [13, 45] uses a microring-based EO modula-
tor to allow fast token-ring arbitration to arbitrate the access
We see that both interconnects demonstrate noticeable to the shared medium. A separate channel broadcast is also
performance sensitivity to the communication bandwidth reserved for broadcast. Such wavelength division multiplex-
provided. In fact, our system shows less sensitivity. In other ing (WDM) schemes have been proven highly effective in
words, both interconnects need to over-provision bandwidth long-haul fiber-optic communications and inter-chip inter-
to achieve low latency and high execution speed. The is- connects [46, 47]. However, as discussed in Section 2 there
sue that higher traffic leads to higher collision rate in our are several critical challenges to adopt these WDM systems
proposed system is no more significant than factors such for intra-chip interconnects: the need for stringent device
as queuing delays in a packet-relaying interconnect; it does geometry and runtime condition control; practical limits on
not demand drastic over-provisioning. In the configuration the number of devices that can be allowed on a single waveg-
space that we are likely to operate in, collisions are reason- uide before the insertion loss becomes prohibitive; and the
ably infrequent and accepting them is a worthwhile tradeoff. large hidden cost of external multi-wavelength laser.
Finally, thanks to the superior energy efficiency for the in- In summary, while nano-photonic devices provide tremen-
tegrated optical signaling chain, bandwidth provisioning is dous possibilities, integrating them into microprocessors at
rather affordable energy-wise. scale is not straightforward. Network and system level solu-
tions and optimizations are a necessary venue to relax the
7. RELATED WORK demands on devices.
The effort to leverage optics for on-chip communication
spans multiple disciplines and there is a vast body of related 8. CONCLUSION
work, especially on the physics side. Our main focus in this While optics are believed to be a promising long-term so-
paper is to address the challenge in building a scalable inter- lution to address the worsening processor interconnect prob-
connect for general-purpose chip-multiprocessors, and doing lem as technology scales, significant technical challenges re-
so without relying on repeated O/E and E/O conversions main to allow scalable optical interconnect using conven-
or future breakthroughs that enable efficient pure-optical tional packet switching technology. In this paper, we have
packet switching. In this regards, the most closely related proposed a scalable, fully-distributed interconnect based on
design that we are aware of is [4]. free-space optics. The design leverages a suite of matur-
In [4], packets do not need any buffering (and thus conver- ing technologies to build an architecture that supports a
sions) at switches within the Omega network because when direct communication mechanism between nodes and does
a conflict occurs at any switch, one of the contenders is not rely on any packet switching functionality and thus side-
dropped. Even though this design addresses part of the steps the challenges involved in implementing efficient opti-
challenge of optical packet switching by removing the need cal switches. The tradeoff is the occasional packet collisions
to buffer a packet, it still needs high-speed optical switches from uncoordinated packet transmissions. The negative im-
to decode the header of the packet in a just-in-time fashion in pact of collisions is minimized by careful architecting of the
order to allow the rest of the packet to be switched correctly interconnect and novel optimizations in the communication
to the next stage. In a related design [40], a circuit-switched and coherence substrates of the multiprocessor.
photonic network relies on an electrical interconnect to route Based on parameters extracted from device and circuit
special circuit setup requests. Only when an optical route is simulations, we have performed faithful architectural sim-
completely set up can the actual transfer take place. Clearly, ulations with detailed modeling of the microarchitecture,
only bulk transfers can amortize the delay of the setup ef- the memory subsystems, the communication substrate, and
9
the coherence substrates to study the performance and en-
For easier configuration of the optical network, we use a ergy metrics of the design. The study shows that compared
slightly different base configuration for normalization. In to conventional electrical interconnect, our design provides
this configuration, both data and meta lanes have 6 VCSELs
and as a result, the serialization latency for a meta packet good performance (superior than even the most aggressively
and a data packet is 1 and 5 cycles respectively – the same configured mesh interconnect), better scalability, and a far
as in the mesh networks. better energy efficiency. With the proposed architectural
optimizations to minimize the negative consequences of col- Laser Diodes with Refractive GaAs Microlenses. Electronics
lisions, the design is also shown to be rather insensitive to Lett., 31(9):724–725, Apr. 1995.
bandwidth capacity. Overall, we believe the proposed ideas [24] D. Louderback et al. Modulation and Free-space Link
Characteristics of Monolithically Integrated Vertical-cavity
point to promising design spaces for further exploration. Lasers and Photodetectors with Microlenses . IEEE J.
of Selected Topics in Quantum Electronics, 5(2):157–165,
Mar/Apr 1999.
[25] S. Chou et al. Sub-10 nm Imprint Lithography and Appli-
REFERENCES cations. J. Vac. Sci. Tech. B., 15:2897–2904, 1997.
[1] SIA. International Technology Roadmap for Semiconduc- [26] M. Austin et al. Fabrication of 5 nm Linewidth and 14 nm
tors. Technical report, 2008. Pitch Features by Nanoimprint Lithography. Appl. Phys.
[2] J. Goodman et al. Optical Interconnections for VLSI Sys- Lett., 84:5299–5301, 2004.
tems. Proc. IEEE, 72:850–866, Jul. 1984. [27] K. Banerjee et al. On Thermal Effects in Deep Sub-Micron
[3] D. Miller. Optical Interconnects to Silicon. IEEE J. of VLSI Interconnects. Proc. of the IEEE/ACM Design Au-
Selected Topics in Quantum Electronics, 6(6):1312–1317, tomation Conf., pages 885–890, Jun. 1999.
Nov/Dec 2000. [28] D. Tuckerman and R. Pease. High Performance Heat Sinking
[4] A. Shacham and K. Bergman. Building Ultralow-Latency for VLSI. IEEE Electron Device Lett., 2(5):126–129, May
Interconnection Networks Using Photonic Integration. IEEE 1981.
Micro, 27(4):6–20, July/August 2007. [29] B. Dang. Integrated Input/Output Interconnection and
[5] Y. Vlasov, W. Green, and F. Xia. High-Throughput Silicon Packaging for GSI. PhD thesis, Georgia Inst. of Tech., 2006.
Nanophotonic Wavelength-Insensitive Switch for On-Chip [30] A. Balandin. Chill out. IEEE Spec., 46(10):34–39, Oct. 2009.
Optical Networks. Nature Photonics, (2):242–246, March [31] C. Hammerschmidt. IBM Brings Back Water Cooling Con-
2008. cepts. EE Times, June 2009. http://www.eetimes.com/
[6] R. Beausoleil et al. Nanoelectronic and Nanophotonic Inter- showArticle.jhtml?articleID=218000152.
connect. Proceedings of the IEEE, February 2008. [32] C. Hammerschmidt. IBM, ETH Zurich Save Energy with
[7] N. Kirman et al. Leveraging Optical Technology in Future Water-Cooled Supercomputer. EE Times, June 2009. http:
Bus-based Chip Multiprocessors. In Proc. Int’l Symp. on //eetimes.eu/showArticle.jhtml?articleID=218100798.
Microarch., pages 492–503, December 2006. [33] P. McManamon et al. Optical Phased Array Technology.
[8] M. Haurylau et al. On-Chip Optical Interconnect Roadmap: Proc. of the IEEE, 84(2):268–298, Feb. 1996.
Challenges and Critical Directions. IEEE J. Sel. Quantum [34] D. Culler and J. Singh. Parallel Computer Architecture: a
Electronics, (6):1699–1705, 2006. Hardware/Software Approach. Morgan Kaufmann, 1999.
[9] J. Xue et al. An Intra-Chip Free-Space Optical Interconnect: [35] L. Roberts. ALOHA Packet System With and Without Slots
Extended Technical Report. Technical report, Dept. Electri- and Capture. ACM SIGCOMM Computer Communication
cal & Computer Engineering, Univ. of Rochester, April 2010. Review, 5(2):28–42, April 1975.
http://www.ece.rocehster.edu/~mihuang/. [36] R. Metcalfe and D. Boggs. Ethernet: Distributed Packet
[10] L. Liao et al. High Speed Silicon Mach-Zehnder Modulator. Switching for Local Computer Networks. Communications
Opt. Express, 13(8):3129–3135, 2005. of the ACM, 26(1):90–95, January 1983.
[11] Q. Xu et al. Micrometre-scale Silicon Electro-optic Modula- [37] K. Yeager. The MIPS R10000 Superscalar Microprocessor.
tor. Nature, 435(7040):325–327, May. 2005. IEEE Micro, 16(2):28–40, April 1996.
[12] S. Manipatruni et al. Wide Temperature Range Operation [38] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The
of Micrometer-scale Silicon Electro-Optic Modulators. Opt. SPLASH-2 Programs: Characterization and Methodological
Lett., 33(19):2185–2187, 2008. Considerations. In Proc. Int’l Symp. on Comp. Arch., pages
[13] R. Beausoleil et al. A Nanophotonic Interconnect for 24–36, June 1995.
High-Performance Many-Core Computation. IEEE LEOS [39] S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D Webb.
Newsletter, June 2008. The ALpha 21364 Network Architecture. IEEE Micro,
[14] W. Bogaerts et al. Low-loss, Low-cross-talk Crossings for 22(1):26–35, January 2002.
Silicon-on-insulator Nanophotonic Waveguides. Opt. Lett., [40] A. Shacham, K. Bergman, and L. Carloni. On the Design of
32(19):2801–2803, 2007. a Photonic Network-on-Chip. In First Proc. Int’l Symp. on
[15] R. Michalzik and K. Ebeling. Vertical-Cavity Surface- Networks-on-Chip, pages 53–64, May 2007.
Emitting Laser Devices, chapter 3, pages 53–98. Springer, [41] A. Krishnamoorthy and D. Miller. Firehose Architectures for
2003. Free-Space Optically Interconnected VLSI Circuits. Journal
[16] K. Yashiki et al. 1.1-um-Range Tunnel Junction VCSELs of Parallel and Distributed Computing, 41:109–114, 1997.
with 27-GHz Relaxation Oscillation Frequency. In Proc. Op- [42] P. Marchand et al. Optically Augmented 3-D Computer:
tical Fiber Communications Conf., pages 1–3, 2007. System Technology and Architecture. Journal of Parallel
[17] Y. Chang, C. Wang, and L. Coldren. High-efficiency, High- and Distributed Computing, 41:20–35, 1997.
speed VCSELs with 35 Gbit/s Error-free Operation. Elec. [43] A. Walker et al. Optoelectronic Systems Based on In-
Lett., 43(19):1022–1023, 2007. GaAs Complementary-Metal-Oxide-Semiconductor Smart-
[18] B. Ciftcioglu et al. 3-GHz Silicon Photodiodes Integrated Pixel Arrays and Free-Space Optical Interconnects. Applied
in a 0.18-mum CMOS Technology. IEEE Photonics Tech. Optics, 37(14):2822–2830, May 1998.
Lett., 20(24):2069–2071, Dec.15 2008. [44] J. Ha and T. Pinkston. SPEED DMON: Cache Coherence on
[19] A. Chin and T. Chang. Enhancement of Quantum Efficiency an Optical Multichannel Interconnect Architecture. Journal
in Thin Photodiodes through Absorptive Resonance. J. Vac. of Parallel and Distributed Computing, 41:78–91, 1997.
Sci. and Tech., (339), 1991. [45] D. Vantrease et al. Corona: System Implications of Emerg-
[20] G. Ortiz et al. Monolithic Integration of In0.2 Ga0.8 As ing Nanophotonic Technology. In Proc. Int’l Symp. on
Vertical-cavity Surface-emitting Lasers with Resonance- Comp. Arch., June 2008.
enhanced Quantumwell Photodetectors. Elec. Lett., (1205), [46] E. de Souza et al. Wavelength-division Multiplexing with
1996. Femtosecond Pulses. Opt. Lett., 20(10):1166, 1995.
[21] S. Park et al. Microlensed Vertical-cavity Surface-emitting [47] B. Nelson et al. Wavelength Division Multiplexed Optical
Laser for Stable Single Fundamental Mode Operation. Ap- Interconnect Using Short Pulses. IEEE J. of Selected Topics
plied Physics Lett., 80(2):183–185, 2002. in Quantum Electronics, 9(2):486–491, Mar/Apr 2003.
[22] K. Chang, Y. Song, and Y. Lee. Self-Aligned Microlens-
Integrated Vertical-Cavity Surface-Emitting Lasers. IEEE
Photonics Tech. Lett., 18(21):2203–2205, Nov.1 2006.
[23] E. Strzelecka et al. Monolithic Integration of Vertical-cavity

You might also like