Datacenter Photonics
Datacenter Photonics
DATACENTER NETWORKS
Al Davis
Hewlett Packard Laboratories & University of Utah
3 July, 2012
1
TODAY’S DOMINANT INFORMATION LANDSCAPE
2
MY OTHER COMPUTER IS
3
WHAT’S THE POINT?
– End point is increasingly mobile
• battery longevity both limited processing & limited memory/storage
− memory != storage
• typical driving applications access non-local information
– Key observation
• “the network is the computer” – John Gage, Sun Microsystems employee #2
4
HENCE: FOCUS ON THE INTERCONNECT
– Bill Clinton and Al Gore had the same focus
• March 9, 1996 at Ygnacio Valley High School
− bizarre: I was there as a sophomore in 33 years earlier when the school opened
5
THE FIRST STEP
– Information endpoint to the datacenter
• 1st hop: wireless (802.11x, 3G) or wired to the “edge”
• 2nd hop: telecom mostly fiber to the backbone fiber
• 3rd hop: to the datacenter internet routers
6
TODAY’S DATA CENTERS
– Mostly or all electrical
• 50K+ cores already in play
− larger configurations in the HPC realm
– Configuration [3]
• rows of racks
− rack: .6 m wide, 1 m deep, 2 m high
− each rack has 42 vertical 44.45 mm U slots, 175 kg rack, max loaded weight 900 kg
− each RU holds 2 – 4 socket (multi-core) processors motherboards
• # of cores growing – maybe even at Moore’s rate if you believe the pundits
• cold and hot aisles (heat is a huge issue) – front side cold, back side hot
− front to front and back to back row placement
− >= 1.22 m cold row allows human access to blades but not the cables
− >= .9 m hot row holds cables and is the key to CRAC heat extraction strategy
7
THE CABLE NIGHTMARE
Source:
random web
photo’s
Consider
The Ugly Hot Aisle The Bad
Airflow
– For HPC
• prices are much higher due to router ASICS & better bisection topologies
• bisection bandwidth improves significantly
− important in the datacenter where high locality is not the predominant workload
10
EXAMPLE DATA CENTRIC WORKLOADS
– Financial trading
• 350 billion transactions and
updates per year
– Sensor networks
increased data glut
• CENSE project
11
MAPREDUCE/HADOOP
Another example of non-local communication patterns
– “Customers Who Bought This Item Also Bought……”
Sorting 1PB with
MapReduce*
• 4000 node cluster
• 48000 disks
• 1Petabyte of 100
byte records
• Sort time 6 hours &
2 minutes.
*Google blog, November 2008
Computation
Data
MAP REDUCE
Currently storage
bandwidth limited –
Storage Network moving towards
intensive intensive network bandwidth
12 limited w/ increased SSD
DATACENTER TRENDS [1]
– Server count ~30M in 2007
• 5-year forward CAGR = 7%
− EPA CAGR estimate is 17%
• doesn’t account for server consolidation trend
• “whacked on the Cloud” is a likely accelerant
– Storage growth
• 5-year forward CAGR = 52%
• added 5 exabytes in 2007 - 105xLoC (the printed Library of Congress)
– Internet traffic
• 5-year forward CAGR = 46% (6.5 exabytes per month in 2007)
• 650K LoC equivalents sent every month in 2007
– Internet nodes
• 5-year backward CAGR = 27%
• public fascination with mobile information appliances has accelerated this rate
13
COMMUNICATION ESTIMATES [1]
– Server count growing slower than anything else
– exponential communication growth per server in the data center
– Estimate [1] (+/- 10x)
• for every byte written or read to/from a disk
−10KB are transmitted over some network in the data center
• for every byte transmitted over the internet
−1GB are transmitted within or between data centers
– Clear conclusion
• improving data center communication efficiency is likely more important than
improving individual socket performance (which will happen anyway)
−includes socket to socket & socket to main memory and storage
14
OTHER DATA CENTER CHALLENGES
– Consume too much power, generate too much heat & C02
• 2007 EPA report to Congress – 2 socket server (2 cores/socket)
Component Peak Power(W)
CPU 80
Memory 36
2006: 61 Pwh (doubled since 2000)
Disks 12 doesn’t include telecom component
Communication 50 $4.5B in electrical costs
Motherboard 25 Total pwr/IT equip. pwr:
2 common, 1.7 good
Fan 10 1.2 claimed but hard to validate
PSU losses 38
TOTAL 251
– Grand challenge
• how do we achieve these goals?
• future datacenters with 100K nodes (each with 10’s to 100’s of cores)
• O(103) increase in communication & memory pressure expected
• without commensurate increase in communication latency & power consumption
− shrinking transistors will help but not enough, the cm to 100m scale problem remains
17
DATA CENTER NETWORK REQ’S
– High dimension networks
• to reduce hop count
• scalable without significant re-cabling
− scale-out to accommodate more racks and rows
− scale-up to higher performance blades
• regularity will be important
− minimize cable complexity
− minimize number of cable SKU’s for cost purposes
− enable adaptive routing to meet load balance demands
• path diversity
− increased availability and fault tolerance
source: Luxtera
18
ITRS EYE CHART FOR INTERCONNECT
– Advantages
• mature technology and volume production reduces cost
• manufacturing and packaging have been optimized for electrical technology
• “Always ride your horse in the direction it’s going”
− Texas proverb
− good questions: better horse? time to change direction??
– Conclusion
• computation gets better with technology shrink but communication improves slowly or not at
all in terms of BTE & delay.
20
RECENT SERDES PUBLICATIONS
– Two classes of SerDes, short reach and long reach (memory & backplane)
– Still seeing improvement in SerDes power (20% per year historically)
– Numbers in system publications tend to be higher
21
LOW POWER SERDES COMPARISON
22
PHOTONIC SIGNALING
– Problems
• immature technology
− waveguides, modulators, detectors all exist in various forms in lab scale demonstrations
− improvements likely but technology is here now – risky path: the lab to volume production &
low cost
• photonic elements don’t shrink with feature size
− resonance properties a l a size
• maintaining proper resonance requires thermal tuning
• currently: cables, connectors, etc. all cost more than their electrical counterparts
– Advantages
• power consumption is independent of length for lengths of interest in the datacenter
− due to the very low loss nature of the waveguides
− energy consumption is at the EO or OE endpoints
• relatively immune to signal integrity & stub electronic problems
− buses are not a problem
• built in bandwidth multiplier per waveguide: CWDM & DWDM
− 10 Gbs/l demonstrated - 4l now (MZ), doubling every 3 years likely, ~67l limit?
23
DWDM POINT TO POINT PHOTONIC
LINK
24
OPTICAL LOSSES
2cm of waveguide and 10m of fiber
25
INTEGRATED CMOS PHOTONCS POINT-
TO-POINT POWER BUDGET
23fJ
44fJ
Receiver
Modulator
50fJ Tuning
Laser
60fJ
– 10Gbit/s per wavelength
– 177fJ/bit assuming 32nm process
– No clock recovery and latching - not directly comparable to
electronic numbers
26
– Tuning and laser power required when idle
HIGH PERFORMANCE SWITCH - STATE
OF THE ART ELECTRONIC
MELLANOX INFINISWITCH IV ISSUES
27
IMPROVING DATA CENTER NETWORKS
– Step 1: Use optical cables
• already in limited use
– Step 2: Move optics into the core switch backplane (Interop 2011)
• current core switch backplane limitations are hitting a rather hard wall
−more power and higher cost are not feasible as bisection bandwidth demands
advance
−CWDM bandwidth scaling is an attractive proposition
28
TACKLING THE BANDWIDTH
BOTTLENECK WITH PHOTONICS
On-chip
Hybrid laser interconnect
cable Silicon PIC
Active cable Optical Bus
Rx R
xR
x Rx
Rx
RxR
xR
xT
x
29
ALL OPTICALLY CONNECTED DATA
CENTER CORE SWITCH
10x bandwidth scaling
• core switch requirement doubling every 18
months
• electronic technologies can no longer keep
up
Equivalent cost
• historically the main obstacle to adoption
of optics
Future Scaling
• VCSEL BW scaling 10G 25G NODE 0 NODE 1 NODE 2 NODE 3
• single l CWDM 2 l 4 l
• optical backplane remains unchanged
30
INTEGRATED CMOS PHOTONIC
SWITCH
CHARACTERISTICS
• 64-128 DWDM ports
• <400fJ/bit IO power
• 160 - 640 Gbps per port
ADVANTAGES
• switch size unconstrained by
device IO limits
• port bandwidth scalable by
increasing number of
wavelengths
• optical link ports can directly
connect to anywhere within
the data centre
• greatly increased connector
31
density, reduced cable bulk
MINIMIZE ELECTRONICS
Buffering & Routing
Optical Cross Bar on Switch Die
– Basic idea
• fully connected in each
dimension
• one link to each mirror in all
other dimensions
– Regularity benefits
• simple adaptive routing (DAL)
• set L,S,K,T values to match
needs
− packaging & configuration
34
NEW NETWORK TOPOLOGIES –
HYPERX [5]
– Direct network – switch is
embedded with processors
• avoids wiring complexity of central/core
switches (e.g. fat trees)
• much lower hop count than grids and
torus
• but many different interconnect lengths
37
GENERAL CONCLUSIONS
– Advances in electronics will continue BUT
• processing benefits from these advances
• data center communications will benefit but not as much
• optics is the transport choice, electronics is the processor choice in an ideal world
− NOTE: we don’t live in an ideal world
– Power wall is here to stay (I don’t see the magic technology which moves the
wall)
• going green is not going to be easy if consumption is based on MORE
• getting more performance for less power is problematic
• replacing long wires with optical paths is a good idea
− telecomm did this in the 80’s
− definition of long for computing is changing however
• maybe it should be relative to transistor speed
38
PHOTONICS CONCLUSIONS
a somewhat personal view
– The switch to photonics is inevitable
• the technology is already demonstrated in multiple labs around the world
• however it’s not mature
−costs need to come down
−improvements will be made & a lot of smart people are making this
happen
40
ACKNOWLEDGMENTS
– HPL/ECL
• Moray McLaren (who provided some of these slides) – the rest is my fault
• Jung-Ho Ahn, Nate Binkert, Naveen Muralimanohar, Norm Jouppi, Rob
Schreiber, Partha Ranganathan, Dana Vantrease …
– HPL/IQSL
• Ray Beausoleil, Marco Fiorentino, Zhen Peng, David Fattal, Charlie Santori, Di
Liang (UCSB), Mike Tan, Paul Rosenberg, Sagi Mathai …
41
FOR FURTHER STUDY
Some referenced in this presentation
43 ©2009