Modernizing The BSD Networking Stack: Dennis Ferguson
Modernizing The BSD Networking Stack: Dennis Ferguson
Networking Stack
Dennis Ferguson
About Me
Ive been a Unix (v7) user since about 1980
I learned to love networking by reading, using and
modifying the BSD network stack, starting with 4.2BSD (I
think a beta version) in 1983
I have developed software for four BSD-based host-in-
router projects
the original CA*net routers, ca. 1990
the T3 NSFnet routers (AIX host), ca. 1992
Ipsilon Networks routers for a while, ca. 1995
Juniper Networks routers, since 1996
Ive acquired some opinions in the process
About This Talk
I started off thinking I could may write a polemic about the BSD network stack and
how many things I had to do to make it useful as a host-in-a-router, but I found I
couldnt. The BSD network stack is good, it is just a bit old and its roots in a simpler
time are showing
Never-the-less I did have to do a fair amount of work to turn this into something that
could be used by a good (for its time) router. While the implementation was a quick
hack the basic organization turned out to be good and has lasted far longer than Id
imagined it might
What I want to talk about are some basic organizational bits which were changed in
the stack which turned out to have significant utility. This mostly deals with interface
representation and route maintenance; changing these two bits made all other
issues easier
I also believe the best result for multicore scaling is obtained, in both a router and a
host, if the basic stateless packet forwarding can be done without locks at all. This
requires that packet processing access global data through a limited number of
structures which can be modified while being read. We didnt start off with this in
mind but it ended up what we did do was the right shape. I wont talk much about
this, but you might keep it in mind that this is one of the places this is going.
Hosts versus Routers
Hosts deal in packets which are somehow addressed to
or originated by, or for, applications running locally
Routers additionally deal with packets which are not of
interest to local applications, but are instead forwarded
directly from one interface to another.
You can route packets through the same kernel that
supports local applications (BSD has always supported
this) but at some scale it makes sense to move ifif
forwarding to dedicated hardware. All but the first router I
worked on did the latter.
Even when forwarding is done elsewhere the host-in-the-
router remains to support local applications.
Hosts versus Hosts-in-
Routers
Apart from packet forwarding, networking loads for
hosts-in-routers may differ from that for hosts
Routers (and hence hosts-in-routers) typically
have more interfaces, more direct neighbours and
more local addresses.
Routers typically need to maintain a larger number
of routes to make correct routing choices for
locally-originated packets.
These are solely differences of scale. A good host-
in-a-router will also be a good, scalable host.
BSD as a Host-in-a-Router
It is difficult to use an unmodified BSD network stack for a modern
host-in-a-router
The fundamental difficulties with this seem to have three root
causes:
The representation of interfaces in a BSD kernel
The representation of routes and next hops, and the procedures
for determining routing
The embedded assumptions of limited scale
I would like to explore some aspects of these, describing what the
issues are and what we ended up doing to improve the situation
while reorganizing for better MP scaling
As noted previously, I believe work to provide better networking
support for a host-in-a-router also provides better host networking
Network Interfaces: The
Hardware View
Most people are pretty clear about the hardware view of
network interface hardware:
It has a connector (or antenna) to attach it to the network
physically
It has a physical/logical layer state (up/down/)
It may require configuration of physical layer parameters,
like media selection or clocking
It has an output queue to buffer mismatches in capacity
and demand
It has an L2 encapsulation (or several alternatives) which
typically may include L2 station addresses and packet
protocol identifiers
Network Interfaces: The
Protocol View
IP (including IPv6) sends packets to and receives packets from interfaces, but
what it defines as an interface is not the same as the hardware view
For IP an interface is now almost always defined by multicast scope. The
neighbours on a single interface are those that can receive a single multicast/
broadcast addressed packet sent out the interface. Neighbours that cant receive
that packet are on another interface.
This view of interface is imposed or encouraged by:
The IP multicast service
IP routing protocols, generally
ARP and ND
This is a fairly new problem. When the BSD network stack was originally
implemented none of those protocols were used (not even ARP). See RFC796
and RFC895 for the types of networks for which the original stack was designed.
Note that while NBMA networks still find support in IP routing protocols, in practice
more recent NBMA networks (ATM, Frame Relay) are (or, were) generally treated
as multiple point-to-point interfaces to maintain the scope definition
Where Hardware and Protocol
Interfaces are Mismatched
For some types of configurations one hardware interface may need to be
represented by multiple protocol interfaces
Ethernet with VLANs
Frame Relay and ATM (i.e. relatively modern NBMA)
encapsulation-defined interfaces like PPPoE or tunnelling protocols
For some interface arrangements several hardware interfaces may need to be
treated as one protocol interface
Aggregated Ethernet
MLPPP
Bridging
For some arrangements the relationship can be several to several
VLANs and PPPoE on aggregated Ethernet
Making the relationship between hardware and protocol interfaces transparent
is useful. While many VLAN protocol interfaces can share a single hardware
ethernet there is still only one hardware down flag
(De)Multiplexing Protocol
Interfaces
The protocol interface of an incoming packet is recognized by a combination of
the hardware interface it arrived on and some value in the packets L2 header
For ethernet VLANs, the VLAN tag
For frame relay, the DLCI
For ATM, the VPI/VCI
For PPPoE, the ethertype, session ID and source MAC address
To send a packet out a protocol interface requires encapsulating the packet
with the appropriate L2 header and queuing it to the hardware interface output
queue
Each protocol interface sharing a single hardware interface is distinguished
solely by an L2 header construction; there is still only a single output queue and
input stream, that of the hardware interface
While hardware interface configuration may be heavy weight, each protocol
interface added to it adds only an additional L2 header variant. The marginal
cost of adding a new protocol interface to existing hardware should be cheap
What BSD Provides to Deal
With Interfaces
An interface is represented by a struct ifnet. Hardware,
protocol and pseudo interfaces all need to be represented
by a struct ifnet since that is all there is.
A struct ifnet is quite large since it always includes the
locking and queue structures, and procedure handles,
required for a hardware interface
The fact that while there may be many protocol-level
interfaces there is only one piece of hardware is opaque.
There are many hardware down flags, many physical
layer configuration variables and many output queues.
For anything beyond the most basic configuration this is
exceedingly messy.
What BSD Provides to Configure
Protocols on Interfaces
To enable a protocol on a struct ifnet you add an address for that protocol to its list of
addresses. This has problems:
Some protocols an interface might be configured for, like IEEE 802 bridging, have
no address to configure and need special casing.
IPv4 doesnt need interface addresses in some cases, like point-to-point links
between routers, but there is no way to enable the protocol without them. It also
wants an interface enabled for IPv4 without addresses to run a DHCP client to learn
the addresses to configure (dhcpcd works around this by sending and receiving
raw packets with bpf instead, a sure sign the protocol stack is missing something).
IPv6 has the opposite problem. It never needs to run without an interface address
since it can (must?) make one up and add it to the interface automatically. Since
adding an address is how you tell it you want a protocol enabled, adding one
automatically leaves you no way to tell it when you dont want IPv6 enabled, adding
another special case.
This also ignores the fact that there may be other protocol-specific things (in and out
firewall filters, a routing instance, a protocol MTU, procedure handles) to be configured
when a protocol is enabled on an interface even if you dont have an address.
Hierarchical Interface
Configuration
A happier result can be obtained by splitting the struct ifnet into
components that can be assembled into an interface configuration
in a bigger variety of ways. The interface configuration
components are:
The ifdevice_t, or ifd. Represents the configuration of a thing
that has an output queue, i.e. the hardware view of an
interface.
The iflogical_t, or ifl. Represents the protocol view of an
interface, and the L2 values associated with that view.
The iffamily_t, or iff. Configures a particular protocol (IPv4,
IPv6, bridging, ) onto an interface.
The ifaddr_t, or ifa. Configures addresses onto a protocol family.
The ifdevice_t
The ifd is the structure autoconf will add when it finds appropriate
hardware configured in a system.
Holds configuration related to media selection, framing and clocking, i.e.
the things needed by anything with a connector.
Has a down flag representing the state of the hardware.
Has an MRU, the maximum frame size that can be received and sent
Has hardware-related procedure handles and an output queue
Has an encapsulation selector (needed for HDLC interfaces) which
defines what is acceptable to configure on top of this.
Two types of structures may be configured as children of an ifd:
Zero or more ifds (this is probably no longer needed)
Zero or more ifls
Naming is, e.g., wm%u, the same as a struct ifnet that might sit in the
same place
The iflogical_t
An ifl can be added as the child of an ifd or another ifl.
Has an L2 identifier defining how packets for this ifl are distinguished, e.g. a VLAN
tag for the child of an ethernet ifd (NULL is acceptable) or a VPI/VCI for the child of
an ATM ifd
Has a type. A PPP ifl might be the child of an HDLC interface or an ethernet ifl
(for PPPoE).
The parent structure is the arbiter of good taste for the types and L2 identifiers a
child may use.
Has a down flag for use, e.g., to reflect the state of PPP LCP. Also collects the
state of down flags in parent structures.
Two types of structures may be added as children:
Zero or more ifls
Zero or more iffs
Naming is the name of its parent structure with a .%u appended, e.g. wm0.2. If
an ifl name omits the .%u suffix, .0 is assumed.
Also has the if_index as an alternative name
The iffamily_t
An iff can be added as the child of an ifl.
A protocol family is configured up on an interface (to the extent possible without
address configuration) when an iff for the family is added to the interface (but not
before that).
Has a type indicating the protocol family, e.g. IPv4, IPv6, bridging. Only one iff of
a type can be configured on the same ifl, and not all iff types are compatible
Has a routing instance identifier, i.e. selects a table to do route lookups for arriving
packets (more later).
Has input and output firewall filter specifications for forwarded and locally received
packets
Has the collected down flags of its parents
Has an MTU for the protocol. This is configured (with a standard default) but needs
to fit in the MRU of the parent ifd when the L2 header overheads are added in.
Has procedure handles for, e.g. multicast operations
Zero or more ifas can be added as children
Is named by the name or if_index of its parent ifl plus the protocol type
The ifaddr_t
An ifa is added to an iff to configure a protocol
address for that family on the interface
ifa configuration adds routes to the routing table as