Isca 10
Isca 10
Jing Xue, Alok Garg, Berkehan Ciftcioglu, Jianyun Hu, Shang Wang
Ioannis Savidis, Manish Jain† , Rebecca Berman† , Peng Liu
Michael Huang, Hui Wu, Eby Friedman, Gary Wicks† , Duncan Moore†
Dept. Electrical & Computer Engineering and † Institute of Optics
University of Rochester, Rochester, NY 14627, USA
tic ace
Op ace -
Sp ree
Micro-mirrors
Op -Sp
s
s
F
Micro-Mirrors
t ic
ee
Fr
Micro-lenses
Micro-lenses
ot s
ics
Ph GaA
ot s
ics
on
Ph GaA
on
GaAs substrate VCSEL OPA GaAs substrate
VCSEL
PD Flip-chip PD
Flip-chip
bonding S
bonding O nics
S
Si substrate O nics Si substrate CM tro
CM tro ec
Through- ec
Through- El
silicon via El silicon via
Package Package
(a) Side view (mirror-guided only) (b) Side view (with phase array beamform- (c) Top view
ing)
Figure 1: Illustration of the overall interconnect structure and 3-D integrated chip stack. (a) and (b) also show two different optics
configuration. In the top view (c), the VCSEL arrays are in the center and the photodetectors are on the periphery within each core.
Without packet switching, this design eliminates the in- and focusing allow smaller size VCSELs and PDs to be used,
termediate routing and buffering delays and makes the sig- which reduces their parasitic capacitance and improve their
nal propagation delay approach the ultimate lower bound, bandwidth. Micro-lenses can be fabricated either on top
i.e., the speed of light. These links can operate at a much of VCSELs when the latter are top emitting [21, 22], or on
higher speed than core logic, making it easy to provide high the backside of the GaAs substrate for substrate-emitting
throughput. On the energy efficiency front, bypassing packet VCSELs [23, 24].
relaying clearly keeps energy cost low. As compared to Micro-mirrors will be fabricated on silicon or polymer by
waveguided optical interconnect, FSOI also avoids the loss micro-molding techniques [25, 26]. Note that commercial
and cross-talk associated with modulators and waveguide micro-mirror arrays (e.g., Digital Micromirror Device chips
crossings. In the future, by utilizing the beamsteering capa- from Texas Instrument) have mirrors that can turn on and
bility of an optical phase array (OPA) of lasers, the number off thousands of times per second and are in full HD den-
of lasers and photodetectors in each node can be constant, sity (millions of pixels). Our application requires only fixed
providing crucial scalability. mirrors at the scale of at most n2 (n is the number of nodes).
3.1 Lasers and Photodetectors 3.3 3-D Integration and Thermal Issues
The lasers used in this FSOI system are vertical-cavity In this FSOI system, 3-D integration technologies are ap-
surface-emitting lasers (VCSELs) [15]. A VCSEL is a plied to electrically connect the free space and photonics
nanoscale heterostructure, consisting of an InGaAs quan- layers with the electronics layer, forming an electro-optical
tum well active region, a resonant cavity constructed with system-in-package (SiP). For example, the GaAs chip is flip-
top and bottom dielectric mirrors (distributed Bragg reflec- chip bonded to the back side of the silicon chip, and con-
tors), and a pn junction structure for carrier injection. They nected to the transceiver circuits there using through-silicon-
are fabricated on a GaAs substrate using molecular beam vias (TSVs). Note that the silicon chip is flip-chip bonded
epitaxy (MBE) or metal-organic chemical vapor deposition to the package in a normal fashion. In general, such electro-
(MOCVD). A VCSEL is typically a mesa structure with optical SiP reduces the latency and power consumption of
several microns in diameter and height. A large 2-D ar- the global signaling through optical interconnect, while per-
ray with millions of VCSELs can be fabricated on the same mitting the microprocessors to be implemented using stan-
GaAs chip. The light can be emitted from the top of the dard CMOS technologies. Significant work has explored
VCSEL mesa. Alternatively, at the optical wavelength of merging various analog, digital, and memory technologies
980-nm and shorter when the GaAs substrate is transpar- in a 3-D stack. Adding an optical layer to the 3-D stack is
ent, the VCSELs can also be made to emit from the back the next logical step to improve overall system performance.
side and then through the GaAs substrate. A VCSEL’s op- Thermal problems have long been a major issue in 2-D in-
tical output can be directly modulated by its current, and tegrated circuits degrading both maximum achievable speed
the modulation speed can reach tens of Gbps [16, 17]. and reliability [27]. By introducing a layer of free space, our
The photodetectors can be either integrated on the CMOS proposed design further adds to the challenge of air cool-
chip as silicon p-i-n photodiodes [18], or fabricated on the ing. However, even without this free space layer, continued
same GaAs chip using the VCSELs as resonant cavity pho- scaling and the trend towards 3-D integration are already
todiodes [19,20]. In the latter case, an InGaAs active region making air cooling increasingly insufficient as demonstrated
is enhanced by the resonant cavity similar to a VCSEL, and by researchers that explored alternative heat removal tech-
the devices offer a larger bandwidth and are well suited for niques for stacked 3-D systems [28–30].
this FSOI system. One such technique delivers liquid coolants to microchan-
nel heat sinks on the back side of each chip in the 3-D stack
3.2 Micro-lenses and Micro-mirrors using fluidic TSVs [28]. Fluidic pipes [29] are used to propa-
In the free-space optical interconnect, passive micro-optics gate heat produced by the devices to the microchannel heat
devices such as micro-lenses and micro-mirrors collimate, sinks. The heat is further dissipated through external fluidic
guide, and focus the light beams in free space. Collimating tubes that can be located on either side of the 3-D stack.
A second technique exploits the advanced thermal con- chosen as 980 nm, which is a good compromise between VC-
ductive properties of emerging materials. Materials such as SEL and PD performance. The serialized transmitted data
diamond, carbon nanotubes, and graphene have been pro- is fed to the laser driver driving a VCSEL. The light from
posed for heat removal. The thermal conductivity of dia- the back-emitting VCSEL is collimated through a microlens
mond ranges from 1000 to 2200 W per meter per kelvin. on the backside of the 430-µm thick GaAs substrate. Using
Carbon nanotubes have an even higher thermal conductiv- a device simulator, DAVINCI, and 2007 ITRS device pa-
ity of 3000 to 3500 W/m·K, and graphene is better [30]. rameters for the 45-nm CMOS technology, the performance
These materials can be used to produce high heat conduc- and energy parameters of the optical link are calculated and
tive paths from the heat sources to the periphery of the 3-D detailed in Table 1.
stack through both thermal vias (vertical direction) and in
plane heat spreaders (lateral direction).
In both alternatives, thermal pipes are guided to the side
of the 3-D stack, allowing placement of the free space optical
system. Finally, we note that replacing air cooling in high-
end chips is perhaps not only inevitable but also desirable.
For instance, researchers from IBM showed that liquid cool- Figure 2: Intra-chip FSOI link calculation.
ing allows the reuse of the heat, reducing the overall carbon
footprint of a facility [31, 32]. Our transmitter is much less power hungry than a com-
mercial one because (a) more advanced technology (45-nm
4. ARCHITECTURAL DESIGN CMOS) is used; (b) the load is smaller (the integrated VC-
SEL exhibits a resistance of over 200 Ω, as compared to typi-
4.1 Overall Interconnect Structure cally 25 Ω when driving an external laser or modulator); and
(c) signal swing is much smaller (the VCSEL voltage swing
As illustrated in Figure 1, in an FSOI link, a single light
is about 100 mV instead of several hundred mV). Further,
beam is analogous to a single wire and similarly, an array
the transmitter goes into standby when not transmitting to
of VCSELs can form essentially a multi-bit bus which we
save power: the VCSEL is biased below threshold, and the
call a lane. An interesting feature of using free-space op-
laser driver is turned off. The receiver is kept on all the
tics is that signaling is not confined to fixed, prearranged
time. Note that the power dissipation of the serializer in the
waveguides and the optical path can change relatively eas-
transmitter and deserializer in the receiver is much smaller
ily. For instance, we can use a group of VCSELs to form a
compared to that of the laser driver and TIA, and hence is
phase-array [33] – essentially a single tunable-direction laser
not included in the estimate.
as shown in Figure 1(b). This feature makes an all-to-all
network topology much easier to implement. Free-Space Optics
Trans. distance 2 cm
For small- and medium-scaled chip-multiprocessors, fixed- Optical wavelength 980 nm
direction lasers should be used for simplicity: each outgoing Optical path loss
Microlens aperture
2.6 dB
90 µm @ transmitter
lane can be implemented by a dedicated array of VCSELs. 190 µm @ receiver
Transmitter & Receiver
In a system with N processors, each having a total of k bits Laser driver bandwidth=43 GHz
in all lanes, N ∗ (N − 1) ∗ k VCSELs are needed for trans- VCSEL aperture=5 µm
parasitic=235 Ω, 90 f F
mission. Note that even though the number scales with N 2 , threshold=0.14 mA
extinction ratio=11:1
the actual hardware requirement is far from overwhelming. PD responsivity=0.5 A/W
capacitance=100 f F
For a rough sense of scale, for N = 16, k = 9 (our default TIA & Limiting amp bandwidth=36 GHz, gain=15000 V/A
configuration for evaluation), we need approximately 2000 Link
Data rate 40 Gbps
VCSELs. Existing VCSELs are about 20µmx20µm in di- Signal-to-noise ratio 7.5 dB
Bit-error-rate (BER) 10−10
mension [16, 17]. Assuming, conservatively, 30µm spacing, Cycle-to-cycle jitter 1.7 ps
2000 VCSELs occupy a total area of about 5mm2 . Note that Power Consumption
Laser driver 6.3 mW
on the receiving side, we do not use dedicated receivers. In- VCSEL 0.96 mW (0.48 mA@2V)
Transmitter (standby) 0.43 mW
stead, multiple light beams from different nodes share the Receiver 4.2 mW
same receiver. We do not try to arbitrate the shared re-
ceivers but simply allow packet collisions to happen. As will Table 1: Optical link parameters.
be discussed in more detail later, at the expense of having
packet collisions, this strategy simplifies a number of other
design issues. 4.3 Network Design
4.2 Optical Links 4.3.1 Tradeoff to Allow Collision
To facilitate the architectural evaluation, a single-bit In our system, optical communication channels are built
FSOI link is constructed (Figure 2) and the link performance directly between communicating nodes within the network
is estimated for the most challenging scenario: communica- in a totally distributed fashion, without arbitration. An
tion across the chip diagonally. Note that the transceiver important consequence is that packets destined for the same
here is based on a conventional architecture, and can be receiver at the same time will collide. Such collisions require
simplified for lower power dissipation. Since the whole chip paths, which can be up to tens of picoseconds, or equivalent
is synchronous (e.g., using optical clock distribution), no to about 3 communication cycles. To maintain chip-wide
clock recovery circuit is needed.1 The optical wavelength is synchronous operation, we delay the faster paths by padding
extra bits in the serializer, and fine tuning the delay using
1
There will be delay differences between different optical digital delay lines in the transmitter.
detection, retransmission, and extra bandwidth margin to a simplified transmission model assuming equal probabil-
prevent them from becoming a significant issue. However, ity of transmission and random destination, the probability
for this one disadvantage, our design allows a number of of a collision in a `cycle
´ pin any node can be described as
other significant advantages (and later we will show that no 1 − [(1 − N−1p
)n + n1 N−1 p
(1 − N−1 )n−1 ]R ,where N is the
significant over-provisioning is necessary): number of nodes; p is the transmission probability of a node;
R is the number of receivers (evenly divided among the N −1
• Compared to a conventional crossbar design, we do not potential transmitters); and n = N−1 is the number of nodes
R
need a centralized arbitration system. This makes the de- sharing the same receiver.
sign scalable and reduces unnecessary arbitration latency Numerical results are shown visually in Figure 3. It is
for the common cases. worth noting that the simplifying assumptions do not dis-
• Compared to a packet-switched interconnect, this design tort the reality significantly. As can be seen from the plot,
1. Avoids relaying and thus repeated O/E and E/O con- experimental results agree well with the trend of theoretical
versions in an optical network; calculations.
2. Guarantees the absence of network deadlocks;2 30%
Collision probability
3. Provides point-to-point message ordering in a straight- R=1 R=2 R=3 R=4 R=2(meta) R=2(data)
Confirmation:. G=1%
Processor core
Fetch/Decode/Commit 4 / 4 / 4
20
ROB 64
Functional units INT 1+1 mul/div, FP 2+1 mul/div
Issue Q/Reg. (int,fp) (16, 16) / (64, 64) 10
LSQ(LQ,SQ) 32 (16,16) 2 search ports
Branch predictor Bimodal + Gshare
- Gshare 8K entries, 13 bit history 0
- Bimodal/Meta/BTB 4K/8K/4K (4-way) entries ba ch fmm fft lu oc ro rx ray ws em ilink ja mp sh tsp avg
Br. mispred. penalty at least 7 cycles
Process specifications Feature size: 45nm, Freq: 3.3 GHz, Vd : 1 V (a) Latency
Memory hierarchy
L1 D cache (private) 8KB [38], 2-way, 32B line, 2 cycles, 2 ports, dual tags
L1 I cache (private) 32KB, 2-way, 64B line, 2 cycle FSOI L0 Lr1 Lr2
L2 cache (shared) 64KB slice/node, 64B line, 15 cycles, 2 ports
2
Dir. request queue 64 entries
Speedup
1 80%
ba ch fmm fft lu oc ro rx ray ws em ilink ja mp sh tuo gmean
60%
(b) Speedup
40%
Figure 7: Performance of 64-node systems.
20%
0
As expected, latency in mesh interconnect increases sig- ba ch fmm fft lu oc ro rx ray ws em ilink ja mp sh tsp avg
10%
5% Baseline for meta
5% 0
0 ba ch fmm fft lu oc ro rx ray ws em ilink ja mp sh tsp avg
15% 10%
10%
Figure 10: Breakdown of data packet collisions by type: involv-
5%
ing memory packets (Memory packets), between replies (Reply),
5% 0
15 10 7 6 5 4 3 2 1 involving writebacks (Writeback), and involving re-transmitted
Transmission probability (p) packets (Retransmission). The left and the right bars show the
result without and with the optimizations, respectively. The col-
Figure 9: Change in packet transmission probability and colli- lision rate for data packets ranges from 3.0% to 21.2%, with an
sion rate with and without the optimization of using confirmation average of 9.4%. After optimization, the collision rate is between
signal to substitute acknowledgment. For clarity, the applications 1.2% and 12.2% with an average of 5.8%.
are separated into two distinctive regions.
In general, as we reduce the number of packets (acknowl- Data packet collision resolution hint:.
edgments), we reduce the transmission probability and nat- As discussed in Section 5.2, when a data lane collision
urally the collision rate. However, if reduction of the trans- happens we can guess the identities of the senders involved.
mission probability is the only factor in reducing collisions, From the simulations, we can see that based on the informa-
the movement of the points would follow the slope of the tion of potential senders and the corrupted pattern of P ID
curve which shows the theoretical collision rate given a trans- and P ID, we can correctly identify a colliding sender 94% of
mission probability. Clearly, the reduction in collision is the time. Even for the rest of the time when we mis-identify
much sharper than simply due to the reduction of packets. the sender, it is usually harmless: If the mis-identified node
This is because the burst of the invalidation messages sent is not sending any data packet at the time, it simply ignores
leads to acknowledgments coming back at approximately the the hint. Overall, the hints are quite accurate and on aver-
same time and much more likely to collide than predicted by age, only 2.3% of the hints cause a node to wrongly believe
theory assuming independent messages. Indeed, after elim- it is selected as a winner to re-transmit. As a result, the hint
inating these “quasi-synchronized” packets, the points move improves the collision resolution latency from an average of
much closer to the theoretical predictions. Clearly, avoid- 41 cycles to about 29 cycles.
ing these acknowledgments is particularly helpful. Note Finally, note that all these measures that reduce colli-
that, because of this optimization, some applications speed sions may not lead to significant performance gain when the
up and the per-cycle transmission probability actually in- collision probability is low. Nevertheless, these measures
creases. Overall, this optimization reduces traffic by only lower the probability of collisions when traffic is high and
5.1% but eliminates about 31.5% of meta packet collisions. thus improve the resource utilization and the performance
Confirmation can also be used to speed up the dissem- robustness of the system.
ination of boolean variables used in load-linked and store-
conditional. Other than latency reduction, we also cut down 6.5 Sensitivity Analysis
the packets transmitted over regular channels. Clearly, the As discussed before, we need to over-provision the network
impact of this optimization depends on synchronization in- capacity to avoid excessive collisions in our design. However,
tensity of the application. Some of our codes have virtually such over-provisioning is not unique to our design. Packet-
no locks or barriers in the simulated window. Seven ap- switched interconnects also need capacity margins to avoid
plications have non-trivial synchronization activities in the excessive queuing delays, increased chance of network dead-
64-way system. For these applications, the optimization re- locks, etc. In our comparison so far, the aggregate band-
duces data and meta packets sent by an average of 8% and width of the conventional network and of our design are
11%, respectively, and achieves a speedup of 1.07 (geometric comparable: the configuration in the optical network design
mean). Note that the benefit comes from the combination of has about half the transmitting bandwidth and roughly the
fast optical signaling and leveraging the confirmation mech- same receiving bandwidth as the baseline conventional mesh.
anism that is already in place. A similar optimization in a To understand the sensitivity of the system performance to
conventional network still requires sending full-blown pack- the communication bandwidth provided, we progressively
ets, resulting in negligible impacts. reduce the bandwidth until it is halved. For our design, this
involves reducing the number of VCSELs, rearranging them fort. In contrast to both designs, our solution does not rely
between the two lanes, and adjusting the cycle-slotting as on any optical switch component.
the serialization latency for packets increases.9 Figure 11 Among the enabling technologies of our proposed de-
shows the overall performance impact. Each network’s re- sign, free-space optics have been discussed in general terms
sult is normalized to that of its full-bandwidth configuration. in [3, 41]. There are also discussions of how free-space op-
For brevity, only the average slowdown of all applications is tics can serve as a part of the global backbone of a packet-
shown. switched interconnect [42] or as an inter-chip communication
mechanism (e.g., [43]). On the integration side, leveraging
Relative performance
1
FSOI Mesh
3D integration to build on-chip optoelectronic circuit has
0.95 also been mentioned as an elegant solution to address vari-
0.9 ous integration issues [6].
0.85
Many proposals exist that use a globally shared medium
for the optical network and use multiple wavelengths avail-
0.8
able in an optical medium to compensate for the network
100% 90% 80% 70% 60% 50%
Relative bandwidth topology’s non-scalable nature. [44] discussed dividing the
channels and using some for coherence broadcasts. [7] also
Figure 11: Performance impact due to reduction in bandwidth. uses broadcasts on the shared bus for coherence. A recent
design from HP [13, 45] uses a microring-based EO modula-
tor to allow fast token-ring arbitration to arbitrate the access
We see that both interconnects demonstrate noticeable to the shared medium. A separate channel broadcast is also
performance sensitivity to the communication bandwidth reserved for broadcast. Such wavelength division multiplex-
provided. In fact, our system shows less sensitivity. In other ing (WDM) schemes have been proven highly effective in
words, both interconnects need to over-provision bandwidth long-haul fiber-optic communications and inter-chip inter-
to achieve low latency and high execution speed. The is- connects [46, 47]. However, as discussed in Section 2 there
sue that higher traffic leads to higher collision rate in our are several critical challenges to adopt these WDM systems
proposed system is no more significant than factors such for intra-chip interconnects: the need for stringent device
as queuing delays in a packet-relaying interconnect; it does geometry and runtime condition control; practical limits on
not demand drastic over-provisioning. In the configuration the number of devices that can be allowed on a single waveg-
space that we are likely to operate in, collisions are reason- uide before the insertion loss becomes prohibitive; and the
ably infrequent and accepting them is a worthwhile tradeoff. large hidden cost of external multi-wavelength laser.
Finally, thanks to the superior energy efficiency for the in- In summary, while nano-photonic devices provide tremen-
tegrated optical signaling chain, bandwidth provisioning is dous possibilities, integrating them into microprocessors at
rather affordable energy-wise. scale is not straightforward. Network and system level solu-
tions and optimizations are a necessary venue to relax the
7. RELATED WORK demands on devices.
The effort to leverage optics for on-chip communication
spans multiple disciplines and there is a vast body of related 8. CONCLUSION
work, especially on the physics side. Our main focus in this While optics are believed to be a promising long-term so-
paper is to address the challenge in building a scalable inter- lution to address the worsening processor interconnect prob-
connect for general-purpose chip-multiprocessors, and doing lem as technology scales, significant technical challenges re-
so without relying on repeated O/E and E/O conversions main to allow scalable optical interconnect using conven-
or future breakthroughs that enable efficient pure-optical tional packet switching technology. In this paper, we have
packet switching. In this regards, the most closely related proposed a scalable, fully-distributed interconnect based on
design that we are aware of is [4]. free-space optics. The design leverages a suite of matur-
In [4], packets do not need any buffering (and thus conver- ing technologies to build an architecture that supports a
sions) at switches within the Omega network because when direct communication mechanism between nodes and does
a conflict occurs at any switch, one of the contenders is not rely on any packet switching functionality and thus side-
dropped. Even though this design addresses part of the steps the challenges involved in implementing efficient opti-
challenge of optical packet switching by removing the need cal switches. The tradeoff is the occasional packet collisions
to buffer a packet, it still needs high-speed optical switches from uncoordinated packet transmissions. The negative im-
to decode the header of the packet in a just-in-time fashion in pact of collisions is minimized by careful architecting of the
order to allow the rest of the packet to be switched correctly interconnect and novel optimizations in the communication
to the next stage. In a related design [40], a circuit-switched and coherence substrates of the multiprocessor.
photonic network relies on an electrical interconnect to route Based on parameters extracted from device and circuit
special circuit setup requests. Only when an optical route is simulations, we have performed faithful architectural sim-
completely set up can the actual transfer take place. Clearly, ulations with detailed modeling of the microarchitecture,
only bulk transfers can amortize the delay of the setup ef- the memory subsystems, the communication substrate, and
9
the coherence substrates to study the performance and en-
For easier configuration of the optical network, we use a ergy metrics of the design. The study shows that compared
slightly different base configuration for normalization. In to conventional electrical interconnect, our design provides
this configuration, both data and meta lanes have 6 VCSELs
and as a result, the serialization latency for a meta packet good performance (superior than even the most aggressively
and a data packet is 1 and 5 cycles respectively – the same configured mesh interconnect), better scalability, and a far
as in the mesh networks. better energy efficiency. With the proposed architectural
optimizations to minimize the negative consequences of col- Laser Diodes with Refractive GaAs Microlenses. Electronics
lisions, the design is also shown to be rather insensitive to Lett., 31(9):724–725, Apr. 1995.
bandwidth capacity. Overall, we believe the proposed ideas [24] D. Louderback et al. Modulation and Free-space Link
Characteristics of Monolithically Integrated Vertical-cavity
point to promising design spaces for further exploration. Lasers and Photodetectors with Microlenses . IEEE J.
of Selected Topics in Quantum Electronics, 5(2):157–165,
Mar/Apr 1999.
[25] S. Chou et al. Sub-10 nm Imprint Lithography and Appli-
REFERENCES cations. J. Vac. Sci. Tech. B., 15:2897–2904, 1997.
[1] SIA. International Technology Roadmap for Semiconduc- [26] M. Austin et al. Fabrication of 5 nm Linewidth and 14 nm
tors. Technical report, 2008. Pitch Features by Nanoimprint Lithography. Appl. Phys.
[2] J. Goodman et al. Optical Interconnections for VLSI Sys- Lett., 84:5299–5301, 2004.
tems. Proc. IEEE, 72:850–866, Jul. 1984. [27] K. Banerjee et al. On Thermal Effects in Deep Sub-Micron
[3] D. Miller. Optical Interconnects to Silicon. IEEE J. of VLSI Interconnects. Proc. of the IEEE/ACM Design Au-
Selected Topics in Quantum Electronics, 6(6):1312–1317, tomation Conf., pages 885–890, Jun. 1999.
Nov/Dec 2000. [28] D. Tuckerman and R. Pease. High Performance Heat Sinking
[4] A. Shacham and K. Bergman. Building Ultralow-Latency for VLSI. IEEE Electron Device Lett., 2(5):126–129, May
Interconnection Networks Using Photonic Integration. IEEE 1981.
Micro, 27(4):6–20, July/August 2007. [29] B. Dang. Integrated Input/Output Interconnection and
[5] Y. Vlasov, W. Green, and F. Xia. High-Throughput Silicon Packaging for GSI. PhD thesis, Georgia Inst. of Tech., 2006.
Nanophotonic Wavelength-Insensitive Switch for On-Chip [30] A. Balandin. Chill out. IEEE Spec., 46(10):34–39, Oct. 2009.
Optical Networks. Nature Photonics, (2):242–246, March [31] C. Hammerschmidt. IBM Brings Back Water Cooling Con-
2008. cepts. EE Times, June 2009. http://www.eetimes.com/
[6] R. Beausoleil et al. Nanoelectronic and Nanophotonic Inter- showArticle.jhtml?articleID=218000152.
connect. Proceedings of the IEEE, February 2008. [32] C. Hammerschmidt. IBM, ETH Zurich Save Energy with
[7] N. Kirman et al. Leveraging Optical Technology in Future Water-Cooled Supercomputer. EE Times, June 2009. http:
Bus-based Chip Multiprocessors. In Proc. Int’l Symp. on //eetimes.eu/showArticle.jhtml?articleID=218100798.
Microarch., pages 492–503, December 2006. [33] P. McManamon et al. Optical Phased Array Technology.
[8] M. Haurylau et al. On-Chip Optical Interconnect Roadmap: Proc. of the IEEE, 84(2):268–298, Feb. 1996.
Challenges and Critical Directions. IEEE J. Sel. Quantum [34] D. Culler and J. Singh. Parallel Computer Architecture: a
Electronics, (6):1699–1705, 2006. Hardware/Software Approach. Morgan Kaufmann, 1999.
[9] J. Xue et al. An Intra-Chip Free-Space Optical Interconnect: [35] L. Roberts. ALOHA Packet System With and Without Slots
Extended Technical Report. Technical report, Dept. Electri- and Capture. ACM SIGCOMM Computer Communication
cal & Computer Engineering, Univ. of Rochester, April 2010. Review, 5(2):28–42, April 1975.
http://www.ece.rocehster.edu/~mihuang/. [36] R. Metcalfe and D. Boggs. Ethernet: Distributed Packet
[10] L. Liao et al. High Speed Silicon Mach-Zehnder Modulator. Switching for Local Computer Networks. Communications
Opt. Express, 13(8):3129–3135, 2005. of the ACM, 26(1):90–95, January 1983.
[11] Q. Xu et al. Micrometre-scale Silicon Electro-optic Modula- [37] K. Yeager. The MIPS R10000 Superscalar Microprocessor.
tor. Nature, 435(7040):325–327, May. 2005. IEEE Micro, 16(2):28–40, April 1996.
[12] S. Manipatruni et al. Wide Temperature Range Operation [38] S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The
of Micrometer-scale Silicon Electro-Optic Modulators. Opt. SPLASH-2 Programs: Characterization and Methodological
Lett., 33(19):2185–2187, 2008. Considerations. In Proc. Int’l Symp. on Comp. Arch., pages
[13] R. Beausoleil et al. A Nanophotonic Interconnect for 24–36, June 1995.
High-Performance Many-Core Computation. IEEE LEOS [39] S. Mukherjee, P. Bannon, S. Lang, A. Spink, and D Webb.
Newsletter, June 2008. The ALpha 21364 Network Architecture. IEEE Micro,
[14] W. Bogaerts et al. Low-loss, Low-cross-talk Crossings for 22(1):26–35, January 2002.
Silicon-on-insulator Nanophotonic Waveguides. Opt. Lett., [40] A. Shacham, K. Bergman, and L. Carloni. On the Design of
32(19):2801–2803, 2007. a Photonic Network-on-Chip. In First Proc. Int’l Symp. on
[15] R. Michalzik and K. Ebeling. Vertical-Cavity Surface- Networks-on-Chip, pages 53–64, May 2007.
Emitting Laser Devices, chapter 3, pages 53–98. Springer, [41] A. Krishnamoorthy and D. Miller. Firehose Architectures for
2003. Free-Space Optically Interconnected VLSI Circuits. Journal
[16] K. Yashiki et al. 1.1-um-Range Tunnel Junction VCSELs of Parallel and Distributed Computing, 41:109–114, 1997.
with 27-GHz Relaxation Oscillation Frequency. In Proc. Op- [42] P. Marchand et al. Optically Augmented 3-D Computer:
tical Fiber Communications Conf., pages 1–3, 2007. System Technology and Architecture. Journal of Parallel
[17] Y. Chang, C. Wang, and L. Coldren. High-efficiency, High- and Distributed Computing, 41:20–35, 1997.
speed VCSELs with 35 Gbit/s Error-free Operation. Elec. [43] A. Walker et al. Optoelectronic Systems Based on In-
Lett., 43(19):1022–1023, 2007. GaAs Complementary-Metal-Oxide-Semiconductor Smart-
[18] B. Ciftcioglu et al. 3-GHz Silicon Photodiodes Integrated Pixel Arrays and Free-Space Optical Interconnects. Applied
in a 0.18-mum CMOS Technology. IEEE Photonics Tech. Optics, 37(14):2822–2830, May 1998.
Lett., 20(24):2069–2071, Dec.15 2008. [44] J. Ha and T. Pinkston. SPEED DMON: Cache Coherence on
[19] A. Chin and T. Chang. Enhancement of Quantum Efficiency an Optical Multichannel Interconnect Architecture. Journal
in Thin Photodiodes through Absorptive Resonance. J. Vac. of Parallel and Distributed Computing, 41:78–91, 1997.
Sci. and Tech., (339), 1991. [45] D. Vantrease et al. Corona: System Implications of Emerg-
[20] G. Ortiz et al. Monolithic Integration of In0.2 Ga0.8 As ing Nanophotonic Technology. In Proc. Int’l Symp. on
Vertical-cavity Surface-emitting Lasers with Resonance- Comp. Arch., June 2008.
enhanced Quantumwell Photodetectors. Elec. Lett., (1205), [46] E. de Souza et al. Wavelength-division Multiplexing with
1996. Femtosecond Pulses. Opt. Lett., 20(10):1166, 1995.
[21] S. Park et al. Microlensed Vertical-cavity Surface-emitting [47] B. Nelson et al. Wavelength Division Multiplexed Optical
Laser for Stable Single Fundamental Mode Operation. Ap- Interconnect Using Short Pulses. IEEE J. of Selected Topics
plied Physics Lett., 80(2):183–185, 2002. in Quantum Electronics, 9(2):486–491, Mar/Apr 2003.
[22] K. Chang, Y. Song, and Y. Lee. Self-Aligned Microlens-
Integrated Vertical-Cavity Surface-Emitting Lasers. IEEE
Photonics Tech. Lett., 18(21):2203–2205, Nov.1 2006.
[23] E. Strzelecka et al. Monolithic Integration of Vertical-cavity