0% found this document useful (0 votes)
36 views26 pages

14-740: Networks: Lecture 24 Spring 2018 Kesden

The document describes the venerable 3-tier data center topology and how it scales up. It then introduces Clos networks, leaf-spine networks, and fat-tree networks as improved topologies that provide full bisection bandwidth. The Portland solution is presented as using a fat-tree network with commodity switches and offloading services to software on servers. It assigns hierarchical MAC addresses to enable location awareness and uses a fabric manager for coordination.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views26 pages

14-740: Networks: Lecture 24 Spring 2018 Kesden

The document describes the venerable 3-tier data center topology and how it scales up. It then introduces Clos networks, leaf-spine networks, and fat-tree networks as improved topologies that provide full bisection bandwidth. The Portland solution is presented as using a fat-tree network with commodity switches and offloading services to software on servers. It assigns hierarchical MAC addresses to enable location awareness and uses a fabric manager for coordination.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

14-740: Networks

Lecture 24 * Spring 2018 * Kesden


DC Topology: Venerable 3-Tier
• Since, beyond a certain point, we can’t make switches wider and/or faster, we
need to “fan out”, most commonly with a tree topology
• Venerable 3-tier network is a straight-forward example:
Core

Aggregation

Leaf
DC Topology: Venerable 3-Tier
• Can add a redundant core for increased throughput and resilience

Core

Aggregation

Leaf
DC Topology: Venerable 3-Tier
• Scales nicely, but …
• Higher up gets over-subscribed since everything passes through
• Over-subscription increases with scale
• Request-to-stream and host-to-host cases generate bottlenecks
1x Switch Throughput

Wx Switch Throughput

W 2x Switch Throughput
Clos Networks
• Allocating an input port, and associated
throughput,
• Allocates path whole way through.
• NxN connectivity with switches with less
than NxN connectivity
• Basically a way to make a large NxN
switch
• Still an expensive expansion and not likely
to need all throughput capacity
By Piggly (talk) (Uploads) - Transferred from en.wikipedia to Commons.,
Public Domain, https://commons.wikimedia.org/w/index.php?curid=61536102
simultaneously
Leaf and Spine
Leaf and Spine
• Type of Clos network
• Essentially folded, but still N-to-N connections
• Derived from old phone company architecture, invented in 1950s.
• All paths are same length from edge to edge
• Great for switch vendors
• Need to pick path, as can choose any middle router
• Very redundant
• Can implement at layer-2 or layer-3
• More soon
Fat-Tree Networks
• More throughput at higher levels, more even across levels
• Not easy to do since buying more powerful switches is harder
• To extent possible, more cost per unit capacity
• Not possible beyond a modest point
• This is somewhat necessarily the case as, if bigger switches were more readily available
and economical, they’d be used at the bottom, and we’d be back where we started.
Fat-Trees With Skinny Switches: Goals
• Use all commodity switches
• Full throughput from host-to-host
• Compatible with usual TCP/IP stack
• Better energy efficiency per unit throughput from more smaller switches
than fewer bigger switches
Note the replacement of
aggregation layer
(K/2)2 core routers Fat Tree (K=4) switches with 2 layers of
K/2 K-port switches

(K/2)2 servers
per pod

K-port switches support K3/4 servers


Fat Tree Details
• K-ary fat free: three layers (core, aggregation, edge)
• Each pod consists of (K/2)2 servers and 2 layers of K/2 K-port switches.
• Each edge switch connects (K/2) servers to (K/2) aggregator switches
• Each aggregator switch connects (K/2) edge and (K/2) core switches
• (K/2)2 core switches, each ultimately connecting to K pods
• Providing K different roots, not 1. Trick is to pick different ones
• K-port switches support K3/4 servers/host:
• (K/2 hosts/switch * K/2 switches per pod * K pods)
Using Multiple Paths
• Must pick different paths (“path diversity”) or will have a hotspot
• Unless sessions use the same path, reordering will be a problem and need to be
resolved with buffering higher up
• Static paths may not respond to actual, dynamic workloads
• Can be done at different levels.
• Higher levels, e.g. transport, are more flexible, but likely more effort and slower
• Lower levels are likely less adaptive, but simpler and faster.
• Ability to weight or remove paths can aid fault tolerance
Portland Solution
• Use commodity switches and off-load services into software on commodity
server
• Start With Fat Tree for a topology without hot spots
• Use layer-2 to avoid routing, forwarding, and related complexity
• Separate host identifier from host location
• IP addresses identify host, but not location, just and ID
• Use “Pseudo MAC Address” to identify location at Level-2
PortLand Addresses
• Normally MAC addresses are arbitrary – no clue about location
• IP normally is hierarchical, but here we are using it only as a host identifier
• If MAC addresses are not tied to location, switch tables grow linearly with growth of
network, i.e. O(n)
• PortLand uses hierarchical MAC addresses, called “Pseudo MAC” or PMAC
addresses to provide for switch location
• <pod:port:position:vmid>
• <16,8,8,16> bits
0 1
PortLand PMAC Addresses
2 3

Position 0 1 0 1 0 1

PMAC: <pod.position.port.vmid> 48 bits: <16-bits.8-bits.8-bits.16-bits>


Portland PMAC Addresses

PMAC: <pod.position.port.vmid> 48 bits: <16-bits.8-bits.8-bits.16-bits>


VM Migration
• Flat address space.
• IP address unchanged after migration, higher level doesn’t see state change
• After migration IP<->PMAC changes, as PMAC is location dependent
• VM sends gratuitous ARP with new mapping.
• Fabric Manager receives ARP and sends invalidation to old switch
• Old switch sets flow table to software, causing ARP to be sent to any stray packets
• Forwarding the packet is optional, as retransmit (if reliable) will fix delivery
Location Discovery: Configuring Switch IDs
• Humans = Not right Answer
• Discovery = Right Answer
• Send messages to neighbors – Get Tree Level
• Hosts don’t reply, so edge only hears back from above
• Aggregate hears back from both levels
• Core hears back only from aggregate
• Contact Fabric Manager with tree level to get ID
• Fabric Manager is service running on commodity host
• Assigns ID
Name Resolution: MACPMACIP
• End hosts continue to use Actual MAC (AMAC) addresses
• Switches convert PMAC<->AMAC for the host
• Edge switch responsible for creating PMAC:AMAC mapping and telling Fabric
Manager
• Software on commodity server, can be replicated, etc. Simplicity is a virtue.
• Mappings timed out of Fabric Manager’s cache, if not used.
• ARPs are for PMACs
• First ask fabric manager which keeps cache. Then, if needed, broadcast.
No loops, No Spanning Trees
• Forwarding can only go up the tree.
• Cycles not possible.
Failure
• Keep-alives like the link discovery messages
• Miss a keep alive? Tattle to the Fabric Manager
• Fabric manager tells effected switches, which adjust own tables.
• O(N) vs O(N2) for traditional routing algorithms (Fabric Manager tells every
switch vs every switch tells every switch)
Looking Back
• Connectivity – Hosts can talk! No possibility of loops
• Efficiency – Much less memory needed in switches, O(N) fault handlingh
• Self configuring – Discovery protocol + ARP
• Robust – Failure handling coordinated by FM
• VMs and Migration – Each has own IP address, each has own MAC address
• Commodity hardware – Nothing magic.
Flow Classification
• Type of “diffusion optimization”
• Mitigate local congestion
• Assign traffic to ports based upon flow, not host.
• One host can have many flows, thus many assigned routings
• Fairly distribute flows
• (K/2)2 shortest paths available – but doesn’t help if all pick same one, e.g. same root
of multi-rooted tree
• Periodically reassign output port to free up corresponding input port
Flow Scheduling
• Also a “Diffusion Optimization”
• Detect and deconflict large, long-lived flows
• Threshholds for throughput and longevity
Fat-Tree Solution: “Special” IP Addressing
• “10.0.0.0/8” private addresses
• Pod-level uses “10.pod.switch.1“
• pod,switch < K
• Core-level uses "10.K.j.i“
• K is the same K as elsewhere, the number of ports/switch
• View cores as logical square. i, j denote position in square.
• Hosts use “10.pod.switch.ID" addresses
• 2 <= ID <= (K/2)
• K=1 is pod-level switch; ID > 2 is too many hosts
• 8-bits implies K < 256
• Will pre-bake the paths to ensure diversity, while maintaining
ordering

You might also like