0% found this document useful (0 votes)
347 views25 pages

Journey To The Center of The Linux Kernel

This document provides an in-depth overview of the Traffic Control subsystem in the Linux kernel. It describes how Traffic Control works at the packet level to manage outgoing network traffic. It demonstrates how to create a basic traffic shaping policy using two classes in a HTB tree to limit the bandwidth of a web server to 200kbps while allowing other traffic up to 1Mbps. The document then discusses diving deeper into the algorithms and code that make up the Traffic Control subsystem to build more advanced policies.

Uploaded by

agusalsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
347 views25 pages

Journey To The Center of The Linux Kernel

This document provides an in-depth overview of the Traffic Control subsystem in the Linux kernel. It describes how Traffic Control works at the packet level to manage outgoing network traffic. It demonstrates how to create a basic traffic shaping policy using two classes in a HTB tree to limit the bandwidth of a web server to 200kbps while allowing other traffic up to 1Mbps. The document then discusses diving deeper into the algorithms and code that make up the Traffic Control subsystem to build more advanced policies.

Uploaded by

agusalsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Journey to the Center of the Linux Kernel:

Traffic Control, Shaping and QoS


Julien Vehent - see revisions

1 Introduction
This document describes the Traffic Control subsystem of the Linux Kernel in depth, algorithm
by algorithm, and shows how it can be used to manage the outgoing traffic of a Linux system.
Throughout the chapters, we will discuss both the theory behind and the usage of Traffic Control,
and demonstrate how one can gain a complete control over the packets passing through his
system.

a QoS graph
The initial target of this paper was to gain a better control over a small DSL uplink. And it grew
over time to cover a lot more than that. 99% of the information provided here can be applied to
any type of server, as well as routers, firewalls, etc

The Traffic Control topic is large and in constant evolution, as is the Linux Kernel. The real
credit goes to the developers behind the /net directory of the kernel, and all of the researchers
who created and improved all of these algorithms. This is merely an attempt to document some
of this work for the masses. Any participation and comments are welcome, in particular if you
spotted an inconsistency somewhere. Please email julien[at]linuxwall.info, your messages are
always most appreciated.
For the technical discussion, since the LARTC mailing list doesn't exists anymore, try those two:
* Netfilter users mailing list for general discussions * NetDev mailing list] is where magic
happens (developers ML)

2 Motivation

This article was initially published in the french issue of Gnu/Linux Magazine France #127, in
May 2010. GLMF is kind enough to provide a contract that release the content of the article
under Creative Common after some time. I extended the initial article quite a bit since, but you
can still find the original french version here
My interest for the traffic shaping subsystem of Linux started around 2005, when I decided to
host most of the services I use myself. I read the documentation available on the subject
(essentially LARTC) but found it incomplete and ended up reading the original research
publications and the source code of Linux.
I host most of my Internet services myself, at home or on small end servers (dedibox and so on).
This includes web hosting (this wiki, some blogs and a few websites), SMTP and IMAP servers,
some FTP servers, XMPP, DNS and so on. French ISPs allow this, but only give you
1Mbps/128KBps of uplink, which corresponds to the TCP Acks rate necessary for a 20Mbps
downlink.
1Mbps is enough for most usage, but without traffic control, any weekly backup to an external
location fills up the DSL link and slows down everyone on the network. During that time, both
the visitors of this wiki and my wife chatting on skype will experience high latency. This is not
acceptable, because the priority of the weekly backup is a lot lower than the two others. Linux
give you the flexibility to shape the network traffic and use all of your bandwidth efficiently,
without penalizing real-time applications. But this comes with a price, and the learning curve to
implement an efficient traffic control policy is quite steep. This document provides an accurate
and comprehensive introduction to the most used QoS algorithms, and gives you the tools to
implement and understand your own policy.

3 The basics of Traffic Control

In the Internet world, everything is packets. Managing an network means managing packets: how
they are generated, router, transmitted, reorder, fragmented, etc Traffic Control works on
packets leaving the system. It doesn't, initially, have as an objective to manipulate packets
entering the system (although you could do that, if you really want to slow down the rate at
which you receive packets). The Traffic Control code operates between the IP layer and the
hardware driver that transmits data on the network. We are discussing a portion of code that
works on the lower layers of the network stack of the kernel. In fact, the Traffic Control code is
the very one in charge of constantly furnishing packets to send to the device driver.
It means that the TC module, the packet scheduler, is permanently activate in the kernel. Even
when you do not explicitly want to use it, it's there scheduling packets for transmission. By
default, this scheduler maintains a basic queue (similar to a FIFO type queue) in which the first
packet arrived is the first to be transmitted.
At the core, TC is composed of queuing disciplines, or qdisc, that represent the scheduling
policies applied to a queue. Several types of qdisc exist. If you are familiar with the way CPU
schedulers work, you will find that some of the TC qdisc are similar. We have FIFO, FIFO with
multiple queues, FIFO with hash and round robin (SFQ). We also have a Token Bucket Filter
(TBF) that assigns tokens to a qdisc to limit it flow rate (no token = no transmission = wait for a
token). This last algorithm was then extended to a hierarchical mode called HTB (Hierarchical
Token Buket). And also Quick Fair Queuing (QFQ), Hierarchical Fair Service Curve (HFSC),
Random Early Detection (RED), etc.
For a complete list of algorithm, check out the source code at kernel.org.

3.1 First contact


Let's skip the theory for now and start with an easy example. We have a web server on which we
would like to limit the flow rate of packets leaving the server. We want to fix that limit at 200
kilo-bits per seconds (25 KB/s).
This set up is fairly simple, and we need three things:
1. a Netfilter rule to mark the packets that we want to limit
2. a Traffic Control policy
3. a Filter to bind the packets to the policy

3.2 Netfilter MARK

Netfilter can be used to interact directly with the structure representing a packet in the kernel.
This structure, the sk_buff, contains a field called __u32 nfmark that we are going to modify.
TC will then read that value to select the destination class of a packet.
The following iptables rule will apply the mark '80' to outgoing packets (OUTPUT chain) sent by
the web server (TCP source port is 80).
# iptables -t mangle -A OUTPUT -o eth0 -p tcp --sport 80 -j MARK --set-mark 80

We can control the application of this rule via the netfilter statistics:
# iptables -L OUTPUT -t mangle -v
Chain OUTPUT (policy ACCEPT 74107 packets, 109M bytes)
pkts bytes target prot opt in out source
destination
73896 109M MARK
tcp -- any eth0 anywhere anywhere
tcp spt:www MARK
xset 0x50/0xffffffff

You probably noticed that the rule is located in the mangle table. We will go back to that a little
bit later.

3.3 Two classes in a tree


To manipulate TC policies, we need the /sbin/tc binary from the **iproute** package (aptitude
install iproute).
The iproute package must match your kernel version. Your distribution's package manager will
normally take care of that.
We are going to create a tree that represents our scheduling policy, and that uses the HTB
scheduler. This tree will contain two classes: one for the marked traffic (TCP sport 80), and one
for everything else.
# tc
# tc
prio
# tc
prio

qdisc
class
1 mtu
class
2 mtu

add dev eth0 root handle 1: htb default 20


add dev eth0 parent 1:0 classid 1:10 htb rate 200kbit ceil 200kbit
1500
add dev eth0 parent 1:0 classid 1:20 htb rate 824kbit ceil 1024kbit
1500

The two classes are attached to the root. Each class has a guaranteed bandwidth (rate value) and
an opportunistic bandwidth (ceil value). If the totality of the bandwidth is not used, a class will
be allowed to increased its flow rate up to the ceil value. Otherwise, the rate value is applied. It
means that the sum of the rate values must correspond to the total bandwidth available.
In the previous example, we consider the total upload bandwidth to be 1024kbits/s, so class 10
(web server) gets 200kbits/s and class 20 (everything else) gets 824 kbits/s.
TC can use both kbit and kbps notations, but they don't have the same meaning. kbit is the rate
in kilo-bits per seconds, and kbps is in kilo-bytes per seconds. In this article, I will the kbit
notation only.

3.4 Connecting the marks to the tree


We now have on one side a traffic shaping policy, and on the other side packets marking. To
connect the two, we need a filter.
A filter is a rule that identify packets (handle parameter) and direct them to a class (fw flowid
parameter). Since several filters can work in parallel, they can also have a priority. A filter must
be attached to the root of the QoS policy, otherwise, it won't be applied.
# tc filter add dev eth0 parent 1:0 protocol ip prio 1 handle 80 fw flowid
1:10

We can test the policy using a simply client/server setup. Netcat is very useful for such testing.
Start a listening process on the server that applies the policy using:
# nc -l -p 80 < /dev/zero

And connect to it from another machine using :


# nc 192.168.1.1 80 > /dev/null

The server process will send zeros (taken from /dev/zero) as fast as it can, and the client will
receive them and throw them away, as fast as it can.
Using iptraf to monitor the connection, we can supervise the bandwidth usage (bottom right
corner).

The value is 199.20kbits/s, which is close enough to the 200kbits/s target. The precision of the
scheduler depends on a few parameters that we will discuss later on.
Any other connection from the server that uses a source port different from TCP/80 will have a
flow rate between 824kbits/s and 1024kbits/s (depending on the presence of other connections in
parallel).

4 Twenty Thousand Leagues Under the Code


Now that we enjoyed this first contact, it is time to go back to the fundamentals of the Quality of
Service of Linux. The goal of this chapter is to dive into the algorithms that compose the traffic
control subsystem. Later on, we will use that knowledge to build our own policy.
The code of TC is located in the net/sched directory of the sources of the kernel. The kernel
separates the flows entering the system (ingress) from the flows leaving it (egress). And, as we
said earlier, it is the responsibility of the TC module to manage the egress path.

The illustration below show the path of a packet inside the kernel, where it enters (ingress) and
where it leaves (egress). If we focus on the egress path, a packet arrives from the layer 4 (TCP,
UDP, ) and then enter the IP layer (not represented here). The Netfilter chains OUTPUT and
POSTROUTING are integrated in the IP layer and are located between the IP manipulation
functions (header creation, fragmentation, ). At the exit of the NAT table of the
POSTROUTING chain, the packet is transmitted to the egress queue, and this is where TC starts
its work.
Almost all the devises use a queue to schedule the egress traffic. The kernel possesses algorithms
that can manipulate these queues, they are the queuing disciplines (FIFO, SFQ, HTB, ). The
job of TC is to apply these queuing disciplines to the egress queue in order to select a packet for
transmission.
TC works with the sk_buff] structure that represents a packet in the kernel. It doesn't manipulate
the packet data directly. sk_buff is a shared structure accessible anywhere in the kernel, thus
avoiding unnecessary duplication of data. This method is a lot more flexible and a lot faster
because sk_buff contains all of the necessary information to manipulate the packet, the kernel
therefore avoids copies of headers and payloads from different memory areas that would ruin the
performances.

On a regular basis, the packet scheduler will wake up and launch any pre-configured scheduling
algorithms to select a packet to transmit.
Most of the work in launch by the function dev_queue_xmit from net/core/dev.c, that only take a
sk_buff structure as input (this is enough, sk_buff contains everything needed, such as skbdev,
a pointer to the output NIC).
dev_queue_xmit makes sure the packet is ready to be transmitted on the network, that the
fragmentation is compatible with the capacity of the NIX, that the checksums are calculated (if
this is handled by the NIC, this step is skipped). Once those controls done, and if the equipment
has a queue in skbdevqdisc, then the sk_buff structure of the packet is added to this queue
(via the enqueue function) and the qdisc_run function is called to select a packet to send.
This means that the packet that has just been added to the NIC's queue might not be the one
immediately transmitted, but we know that it is ready for subsequent transmission the moment it
is added to the queue.
To each device is attached a root queuing discipline. This is what we defined earlier when
creating the root qdisc to limit the flow rate of the web server:
# tc qdisc add dev eth0 root handle 1: htb default 20

This command means attach a root qdisc identified by id 1 to the device eth0, use htb as a
scheduler and send everything to class 20 by default.
We will find this qdisc at the pointer skbdevqdisc. The enqueue and dequeue functions are
also located there, respectively in skbdevqdiscenqueue() and
skbdevqdiscdequeue(). This last dequeue function will be in charge of forwarding the
sk_buff structure of the packet selected for transmission to the NIC's driver.
The root qdisc can have leaves, known as classes, that can also possess their own qdisc, thus
constructing a tree.

4.1 Classless Disciplines


This is a pretty long way down. For those who wishes to deepen the subject, I recommend
reading Understanding Linux Network Internals , from Christian Benvenuti at O'reilly.
We now have a dequeue function which role is to select the packet to send to the network
interface. To do so, this function calls scheduling algorithms that we are now going to study.
There are two types of algorithms: Classless and Classful. The classful algorithms are composed
of qdisc that can contain classes, like we did in the previous example with HTB. In opposition,
classless algorithms cannot contain classes, and are (supposedly) more simple.
4.1.1 PFIFO_FAST
Let's start with a small one. pfifo_fast is the default scheduling algorithm used when no other is
explicitly defined. In other words, it's the one used on 99.9% of the Linux systems. It is classless
and a tiny bit more evolved than a basic FIFO queue, since it implements 3 bands working in
parallel. These bands are numbered 0, 1 and 2 and emptied sequentially: while 0 is not empty, 1
will not be processed, and 2 will be the last. So we have 3 priorities: the packets in queue 0 will
be sent before the packets in queue 2.
The kernel uses the Type of Service field (the 8 bits fields from bit 8 to bit 15 of the IP header,
see below) to select the destination band of a packet.
0
1
2
3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|Version| IHL |Type of Service|
Total Length
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Identification
|Flags|
Fragment Offset
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Time to Live |
Protocol
|
Header Checksum
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Source Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Destination Address
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
Options
|
Padding
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Example Internet Datagram Header from RFC 791

This algorithm is defined in **net/sched/sch_generic.c** and represented in the diagram below.

dia source
The length of a band, representing the number of packet it can contain, is set to 100 by default
and defined outside of TC. It's a parameter that can be set using ifconfig, and visualized in /sys:
# cat /sys/class/net/eth0/tx_queue_len
1000

Once the default value of 1000 is passed, TC will start dropping packets. This should very rarely
happen because TCP makes sure to adapt its sending speed to the capacity of both systems
participating in the communication (that's the role of the TCP slow start). But experiments
showed that increased that limit to 10,000, or even 100,000, in some very specific cases of
gigabits networks can improve the performances. I wouldn't recommend touching this value
unless you really now what you are doing. Increasing a buffer size to a too large value can have
very negative side effect on the quality of a connection. Jim Gettys called that bufferbloat, and
we will talk about it in the last chapter.
Because this algorithm is classless, it is not possible to plug another scheduler after it.
4.1.2 SFQ - Stochastic Fairness Queuing
Stochastic Fairness Queuing is an algorithm that shares the bandwidth without giving any
privilege of any sort. The sharing is fair by being stochastic, or random if you prefer. The idea is
to take a fingerprint, or hash of the packet based on its header, and to use this hash to send the
packet to one of the 1024 buckets available. The buckets are send emptied in a round-robin
fashion.
The main advantage of this method is that no packet will have the priority over another one. No
connexion can take over the other, and everybody has to share. The repartition of the bandwidth
across the packets will almost always be fair, but there are some minor limitation. The main
limitation is that the hashing algorithm might produce the same result for several packets, and
thus send them to the same bucket. One bucket will then fill up faster than the other, breaking the
fairness of the algorithm. To mitigate this, SFQ will modify the parameters of its hashing
algorithm on a regular basis, by default every 10 seconds.
The diagram below shows how the packets are processed by SFQ, from entering the scheduler at
the top, to being dequeued and transmitted at the bottom. The source code is available in
net/sched/sch_sfq.c. Some variables are hard coded in the source code: *

SFQ_DEFAULT_HASH_DIVISOR gives the number of buckets and default to 1024 *


SFQ_DEPTH defines the depth of each bucket, and defaults to 128 packets
#define SFQ_DEPTH
128 /* max number of packets per flow */
#define SFQ_DEFAULT_HASH_DIVISOR 1024

These two value determine the maximum number of packets that can be queued at the same time.
1024 * 128 = 131072 packets can be waiting in the SFQ scheduler before it starts dropping
packets. But this theoritical number is very unlikely to ever happen, because it would require of
the algorithm to fill up every single bucket evenly before starting to dequeue or drop packets.
The tc command line lists the options that can be fed to SFQ:
# tc qdisc add sfq help
Usage: ... sfq [ limit NUMBER ] [ perturb SECS ] [ quantum BYTES ]

limit: is the size of the buckets. It can be reduced below SFQ_DEPTH but not increased
past it. If you try to put a value above 128, TC will simply ignore it.

perturb: is the frequency, in seconds, at which the hashing algorithm is updated.

quantum: represents the maximum amount of bytes that the round-robin dequeue
process will be allow to dequeue from a bucket. At a bare minimum, this should be equal
to the MTU of the connexion (1500 bytes on ethernet). Imagine that we set this value to
500 bytes, all packets bigger than 500 bytes would not be dequeued by SFQ, and would
stay in their bucket. We would, very quickly, arrive at a point where all buckets are
blocked and no packets are transmitted anymore.
DIA source

4.1.2.1 SFQ Hashing algorithm

The hash used by SFQ is computed by the sfq_hash function (see the source code), and take into
account the source and destination IP addresses, the layer 4 protocol number (still an IP header),
and the TCP ports (if the IP packet is not fragmented). Those parameters are mixed with a
random number regenerated every 10 seconds (the perturb value defines that).
Let's walk through a simplified version of the algorithm steps. The connexion has the following
parameters:

source IP: 126.255.154.140

source port: 8080

destination IP: 175.112.129.215

destination port: 2146

/*IP source address in hexadecimal*/


h1 = 7efffef0
/*IP Destination address in hexadecimal*/
h2 = af7081d7
/* 06 is the protocol number for TCP (bits 72 to 80 of the IP header)
We perform a XOR between the variable h2 obtained in the previous step
and the TCP protocol number
*/
h2 = h2 XOR 06
/*
/*
We
*/
h2

if the IP packet is not fragmented, we include the TCP ports in the hash*/
1f900862 is the hexadecimal representation of the source destination ports
perform another XOR with this value and the h2 variable
= h2 XOR 1f900862

/* And finally, we use the Jenkins algorithm with some additional "golden
numbers"
This jhash function is defined somewhere else in the kernel source code*/
h = jhash(h1, h2, perturbation)

The result obtained is a hash value of 32 bits that will be used by SFQ to select the destination
bucket of the packet. Because the perturb value is regenerated every 10 seconds, the packets
from a reasonably long connexion will be directed to different buckets over time.
But this also means that SFQ might break the sequencing of the sending process. Because if two
packets from the same connexion are placed in two differents buckets, it is possible that the
second bucket will be processed before the first one, and therefore sending the second packet
before the first one. This is not a problem for TCP, which uses sequence number to reorder the
packets when they reach their destination, but for UDP it might be.
For example, imagine that you have a syslog daemon sending logs to a central syslog server.
With SFQ, it might happen that a log process arrives before the log that precedes it. If you don't
like it, use TCP.
TC gives us some more flexibility on the hashing algorithm. We can, for example, modify the
fields considered by the hashing process. This can be used using TC filters, as follow:
# tc qdisc add dev eth0 root handle 1: sfq perturb 10 quantum 3000 limit 64
# tc filter add dev eth0 parent 1:0 protocol ip handle 1 flow hash keys
src,dst divisor 1024

The filter above simply the hash to keep only the source and destination IP addresses as input
parameters. The divisor value is the number of buckets, as seen before. We could, then, create a

SFQ scheduler that works with 10 buckets only and considers the IP addresses of the packets in
the hash.
This discipline is classless as well, which means we cannot direct packet to another scheduler
when they leave SFQ. Packets are transmitted to the network interface only.

4.2 Classful Disciplines


4.2.1 TBF - Token Bucket Filter
Until now, we looked at algorithm that do not allow to control the amount of bandwidth. SFQ
and PFIFO_FAST give the ability to smoothen the traffic, and even to prioritize it a bit, but not to
control its throughput.
In fact, the main problem when controlling the bandwidth is to find an efficient accounting
method. Because counting in memory is extremely difficulty and costly to do in real-time,
computer scientists took a different approach here.
Instead of counting the packets (or the bits transmitted by the packets, it's the same thing), the
Token Bucket Filter algorithm sends, at a regular interval, a token into a bucket. Now this is
disconnected from the actual packet transmission, but when a packet enters the scheduler, it will
consume a certain number of tokens. If there is not enough tokens for it to be transmitted, the
packet waits.
Until now, with SFQ and PFIFO_FAST, we were talking about packets, but with TBF we now
have to look into the bits contained in the packets. Let's take an example: a packet carrying 8000
bits (1KB) wishes to be transmitted. It enters the TBF scheduler and TBF control the content of
its bucket: if there are 8000 tokens in the bucket, TBF destroys them and the packet can pass.
Otherwise, the packet waits until the bucket has enough tokens.
The frequency at which tokens are added into the bucket determine the transmission speed, or
rate. This is the parameter at the core of the TBF algorithm, shown in the diagram below.

DIA source
Another particularity of TBF is to allow bursts. This is a natural side effect of the algorithm: the
bucket fills up at a continuous rate, but if no packets are being transmitted for some time, the
bucket will get completely full. Then, the next packets to enter TBF will be transmitted right
away, without having to wait and without having any limit applied to them, until the bucket is
empty. This is called a burst, and in TBF the burst parameter defines the size of the bucket.

So with a very large burst value, say 1,000,000 tokens, we would let a maximum of 83 fully
loaded packets (roughly 124KBytes if they all carry their maximum MTU) traverse the scheduler
without applying any sort of limit to them.
To overcome this problem, and provides better control over the bursts, TBF implements a second
bucket, smaller and generally the same size as the MTU. This second bucket cannot store large
amount of tokens, but its replenishing rate will be a lot faster that the one of the big bucket. This
second rate is called peakrate and it will determine the maximum speed of a burst.
Let's take a step back and look at those parameters again. We have:

peakrate > rate : the second bucket fills up faster than the main one, to allow and control
bursts. If the peakrate value is infinite, then TBF behaves as if the second bucket didn't
exist. Packets would be dequeued according to the main bucket, at the speed of rate.

burst > MTU : the size of the first bucket is a lot larger than the size of the second bucket.
If the burst is equal to MTU, then peakrate is equal to rate and there is no burst.

So, to summarize, when everything works smoothly packets are enqueued and dequeued at the
speed of rate. If tokens are available when packets enter TBF, those packets are transmitted at the
speed of peakrate until the first bucket is empty. This is represented in the diagram above, and in
the source code at net/sched/sch_tbf.c, the interesting function being tbf_dequeue.c.
The configurable options of the TBF scheduler are listed in TC:
# tc qdisc add tbf help
Usage: ... tbf limit BYTES burst BYTES[/BYTES] rate KBPS [ mtu BYTES[/BYTES] ]
[ peakrate KBPS ] [ latency TIME ] [ overhead BYTES ] [ linklayer TYPE
]

We recognize burst, rate, mtu and peakrate that we discussed above. The limit parameter is the
size of the packet queue (see diagram above). latency represents another way of calculating the
limit parameter, by setting the maximum time a packet can wait in the queue (the size of the
queue is then derivated from it, the calculation includes all of the values of burst, rate, peakrate
and mtu). overhead and linklayer are two other parameters whose story is quite interesting.
Let's take a look at those now.
4.2.1.1 DSL and ATM, the Ox that believed to be Frog

If you have ever read Jean De La Fontaine, you probably know the story of the The frog that
wanted to be as big as an ox. Well, in our case, it's the opposite, and your packets might not be
as small as they think they are.
If most local networks use Ethernet, up until recently a lot of communications inside (at least in
Europe) were done over the ATM protocol. Nowadays, ISP are moving towarding all-over-ip, but
ATM is still around. The particularity of ATM is to split large ethernet packets into much smaller
ones, called cells. A 1500 bytes ethernet packet would be split into ~30 smaller ATM cells of

just 53 bytes each. And from those 53 bytes, only 48 are from the original packet, the rest is
occupied by the ATM headers.
So where is the problem ? Considering the following network topology.

The QoS box is in charge of performing the packet scheduling before transmitting it to the
modem. The packets are then split by the modem into ATM cells. So our initial 1.5KB ethernet
packets is split into 32 ATM cells, for a total size of 32 * 5 bytes of headers per cell + 1500 bytes
of data = (32*5)+1500 = 1660 bytes. 1660 bytes is 10.6% bigger than 1500. When ATM is used,
we lose 10% of bandwidth compared to an ethernet network (this is an estimate that depend on
the average packet size, etc).
If TBF doesn't know about that, and calculates its rate based on the sole knowledge of the
ethernet MTU, then it will transmit 10% more packets than the modem can transmit. The modem
will start queuing, and eventually dropping, packets. The TCP stacks will have to adjust their
speed, traffic gets erratic and we lose the benefit of TC as a traffic shaper.
Jesper Dangaard Brouer did his Master Thesis on this topic, and wrote a few patchs for the
kernel and TC. These patchs implement the overhead and linklayer parameter, and can be used
to inform the TBF scheduler of the type of link to account for.
* overhead represents represent the ADSL encapsulation overhead. Figuring out what the ADSL
encapsulation overhead for the specific line is difficult. One should assume a minimum of 10
bytes overhead. adsl encapsulation
* linklayer defines the type of link, either ethernet or {atm,adsl}. atm and adsl are the same
thing and represent a 5 bytes header overhead
We can use these parameter to fine tune the creation of a TBF scheduling discipline:
# tc qdisc add dev eth0 root tbf rate 1mbit burst 10k latency 30ms linklayer
atm
# tc -s qdisc show dev eth0
qdisc tbf 8005: root rate 1000Kbit burst 10Kb lat 30.0ms
Sent 738 bytes 5 pkt (dropped 0, overlimits 0 requeues 0)
rate 0bit 0pps backlog 0b 0p requeues 0

And by setting the linklayer to atm, we take into account an overhead of 5 bytes per cell later
in the transmission, therefore preventing the modem from buffering the packets.
4.2.1.2 The limits of TBF

TBF gives a pretty accurate control over the bandwidth assigned to a qdisc. But it also imposes
that all packets pass through a single queue. If a big packet is blocked because there is not
enough tokens to send it, smaller packets - that could potentially be sent instead - are blocked
behind it as well. This is the case represented in the diagram above, where packet #2 is stuck
behind packet #1. We could optimize the bandwidth usage by allowing the smaller packet to be
sent instead of the bigger one. We would, however, fall into the same problem of reordering
packets that we discussed with the SFQ algorithm.
The other solution would be to give more flexibility to the traffic shaper, declare several TBF
queues in parallel, and route the packets to one or the other using filters. We could also allow
those parallel queues to borrow tokens from each other, in case one is idle and the other one is
not.
We just prepared the ground for classful qdisc, and the Hierarchical Token Bucket.
4.2.2 HTB - Hierarchical Token Bucket
The Hierarchical Token Bucket (HTB) is an improved version of TBF that introduces the notion
of classes. Each class is, in fact, a TBF-like qdisc, and classes are linked together as a tree, with a
root and leaves. HTB introduces a number of features to improved the management of
bandwidth, such as a the priority between classes, a way to borrow bandwidth from another
class, or the possibility to plug another qdisc as an exit point (a SFQ for example).
Let's take a simple example, represented in the diagram below.

htb_en.dia.zip
The tree is created with the commands below:
# tc
# tc
prio
# tc
prio

qdisc
class
1 mtu
class
2 mtu

add dev eth0 root handle 1: htb default 20


add dev eth0 parent 1:0 classid 1:10 htb rate 200kbit ceil 400kbit
1500
add dev eth0 parent 1:0 classid 1:20 htb rate 200kbit ceil 200kbit
1500

HTB uses a similar terminology than TBF and SFQ:

burst is identical to the burst of TBF: it's the size of the token bucket

rate is the speed at which token at generated and put in the bucket, the speed of the leaf,
like in TBF

quantum is similar to the quantum discussed in SFQ, it's the amount of bytes to serve
from the leaf at once.

The new parameters are ceil and cburst. Let us walk through the tree to see how they work.
In the example above, we have a root qdisc handle 1 and two leaves qdisc handle 10 and
qdisc handle 20. The root will apply filters to decide where a packet should be directed, we
will discuss those later, and by default packets are sent to leaf #20 (default to 20).
The leaf #10 has a rate value of 200kbits/s, a ceil value of of 400kbibts/s (which means it can
borrow 200kbits/s more that its rate) and a priority (prio) of 1.
The leaf #20 has a rate value of 200kbits/s, a ceil value of 200kbbits/s (which means it cannot
borrow anything, rate == ceil) and a priority of 2.
Each HTB leaf will, at any point, have one of the three following status:

HTB_CANT_SEND : the class can neither send nor borrow, no packets are allowed to
leave the class

HTB_MAY_BORROW : the class cannot send using its own tokens, but can try to
borrow from another class

HTB_CAN_SEND : the class can send using its own tokens

Imagine a group of packets that enter TC and are marked with the flag #10, and therefore are
directed to leaf #10. The bucket for leaf #10 does not contain enough tokens to let the first
packets pass, so it will try to borrow some from its neighbor leaf #20. The quantum value of leaf
#10 is set to the MTU (1500 bytes), which means the maximal amount of data that leaf #10 will
try to send is 1500 bytes. If packet#1 is 1400 bytes large, and the bucket in leaf #10 has enough
tokens for 1000 bytes, then the leaf will try to borrow the remaining 400 bytes from its neightbor
leaf #20.
The quantum is the maximal amount of bytes that a leaf will try to send at once. The closer the
value is from the MTU, the more accurate the scheduling will be, because we reschedule after
every 1500 bytes. And the larger the value of quantum will be, the more a leaf will be privileged:
it will be allowed to borrow more tokens from its neighbor. But of course, since the total amount
of tokens in the tree is not unlimited, if a token is borrowed from a leaf, another leaf cannot use it
anymore. Therefore, the bigger the value of quantum is, the more a leaf is able to steal from its
neighbor. This is tricky because those neighbors might very well have packets to send as well.
When configuring TC, we do not manipulate the value of quantum directly. There is an
intermediary parameter called r2q that calculates the quantum automatically based on the rate.
quantum = rate / r2q. By default, r2q is set to 10, so for a rate of 200kbits, quantum will have a
value of 20kbits.

For very small or very large bandwidth, it is important to tune r2q properly. If r2q is too large,
too many packets will leave a queue at once. If r2q is too small, no enough packets are sent.
One important detail is that r2q is set on the root qdisc once and for all. It cannot be configured
for each leaf separately.
TC offer the following configuration options for HTB:
Usage: ... qdisc add ... htb [default N] [r2q N]
default minor id of class to which unclassified packets are sent {0}
r2q
DRR quantums are computed as rate in Bps/r2q {10}
debug
string of 16 numbers each 0-3 {0}
... class add ... htb rate R1 [burst B1] [mpu B] [overhead O]
[prio P] [slot S] [pslot PS]
[ceil R2] [cburst B2] [mtu MTU] [quantum Q]
rate
rate allocated to this class (class can still borrow)
burst
max bytes burst which can be accumulated during idle period
{computed}
mpu
minimum packet size used in rate computations
overhead per-packet size overhead used in rate computations
linklay adapting to a linklayer e.g. atm
ceil
definite upper class rate (no borrows) {rate}
cburst
burst but for ceil {computed}
mtu
max packet size we create rate map for {1600}
prio
priority of leaf; lower are served first {0}
quantum how much bytes to serve from leaf at once {use r2q}

As you can see, we are now familiar to all of those parameters. If you just jumped to this section
without reading about SFQ and TBF, please read those chapters for a detailed explanation of
what those parameters do.
Remember that, when configuring leaves in HTB, the sum of the rate of the leaves cannot be
higher than the rate of the root. It makes sense, right ?
4.2.2.1 Hysteresis and HTB

Hysteresis. If this barbarian word is not familiar to you, as it wasn't to me, here is how wikipedia
defines it: Hysteresis is the dependence of a system not just on its current environment but also
on its past.
Hysteresis is a side effect introduced by an optimization of HTB. In order to reduce the load on
the CPU, HTB initially did not recalculate the content of the bucket often enough, therefore
allowing some classes to consume more tokens that they actually held, without borrowing.
The problem was corrected and a parameter introduced to allow or block the usage of estimate in
HTB calculation. The kernel developers kept the optimization feature simply because it can
prove useful in high traffic networks, where recalculating the content of the bucket each time is
simply not doable.

But in most cases, this optimization is simply deactivated, as shown below:


# cat /sys/module/sch_htb/parameters/htb_hysteresis
0

4.2.3 CoDel
http://queue.acm.org/detail.cfm?id=2209336

4.2.4 FQ_CoDel

4.2.5 HFSC - Hierarchical Fair Service Curve

http://nbd.name/gitweb.cgi?p=openwrt.git;a=tree;f=package/qosscripts/files;h=71d89f8ad63b0dda0585172ef01f77c81970c8cc;hb=HEAD
4.2.6 QFQ - Quick Fair Queueing

http://www.youtube.com/watch?v=r8vBmybeKlE
4.2.7 RED - Random Early Detection

4.2.8 CHOKe - CHOose and {Keep,Kill}

5 Shaping the traffic on the Home Network

Home networks are tricky to shape, because everybody wants the priority and it's difficult to
predetermine a usage pattern. In this chapter, we will build a TC policy that answer general
needs. Those are:

Low latency. The uplink is only 1.5Mbps and the latency shouldn't be more than 30ms
under high load. We can tune the buffers in the qdisc to ensure that our packets will not
stay in a large queue for 500ms waiting to be processed

High UDP responsiveness, for applications like Skype and DNS queries

Guarantied HTTP/s bandwidth, half of the uplink is dedicated to the HTTP traffic
(although, other classes can borrow from it) to ensure that web browsing, probably 80%
of a home network usage, is smooth and responsive

TCP ACKs and SSH traffic get higher priority. In the age of Netflix and HD VoD, it's
necessary to ensure fast download speed. And for that, you need to be able to send TCP
ACKs as fast as possible. This is why those packets get a higher priority than the HTTP
traffic.

A general class for everything else.

This policy is represented in the diagram below. We will use PFIFO_FAST and SFQ termination
qdisc once we exit HTB to perform some additional scheduling (and prevent a single HTTP
connection from eating all of the bandwidth, for example).

DIA source
The script that generates this policy is available on github via the icon below, with comments to
help you follow through.

get the bash script from github


Below is one of the section, in charge of the creation of the class for SSH. I have replaced the
variables with their value for readability.
#
#
#
#

SSH class: for outgoing connections to


avoid lag when somebody else is downloading
however, an SSH connection cannot fill up
the connection to more than 70%

echo "#---ssh - id 300 - rate 160 kbit ceil 1120 kbit"


/sbin/tc class add dev eth0 parent 1:1 classid 1:300 htb \
rate 160kbit ceil 1120kbit burst 15k prio 3
# SFQ will mix the packets if there are several
# SSH connections in parallel
# and ensure that none has the priority
echo "#--- ~ sub ssh: sfq"
/sbin/tc qdisc add dev eth0 parent 1:300 handle 1300: \
sfq perturb 10 limit 32
echo "#--- ~ ssh filter"
/sbin/tc filter add dev eth0 parent 1:0 protocol ip \
prio 3 handle 300 fw flowid 1:300
echo "#--- ~ netfilter rule - SSH at 300"
/sbin/iptables -t mangle -A POSTROUTING -o eth0 -p tcp
--tcp-flags SYN SYN -dport 22 -j CONNMARK \
--set-mark 300

The first rule is the definition of the HTB class, the leaf. I connects back to its parent 1:1, defines
a rate of 160kbit/s and can use up to 1120kbit/s by borrowing the difference from other leaves.
The burst value is set to 15k, with is 10 full packets with a MTU of 1500 bytes.
The second rule defines a SFQ qdisc connected to the HTB one above. That means that once
packets have passed the HTB leaf, they will pass through a SFQ leaf before being transmitted.
The SFQ will ensure that multiple parallel connections are mixed before being transmitted, and
that one connection cannot eat the whole bandwidth. We limit the size of the SFQ queue to 32
packets, instead of the default of 128.
The come the TC filter in the third rule. This filter will check the handle of each packet, or, to be
more accurate, the value of nf_mark in the sk_buff representation of the packet in the kernel.
Using this mark, the filter will direct SSH packet to the HTB leaf above.
Even though this rule is located in the SSH class block for clarity, you might have noticed that
the filter has the root qdisc for parent (parent 1:0). Filters are always attached to the root qdisc,
and not to the leaves. That makes sense, because the filtering must be done at the entrance of the
traffic control layer.
And finally, the fourth rule is the iptables rule that applies a mark to SYN packets leaving the
gateway (connection establishments). Why SYN packets only ? To avoid performing complex
matching on all the packets of all the connection. We will rely on netfilter's capability to maintain
stateful information to propagate a mark placed on the first packet of the connection to all of the
other packets. This is done by the following rule at the end of the script:
echo "#--- ~ propagating marks on connections"
iptables -t mangle -A POSTROUTING -j CONNMARK --restore-mark

Let us now load the script on our gateway, and visualise the qdisc created.
# /etc/network/if-up.d/lnw_gateway_tc.sh start
~~~~ LOADING eth0 TRAFFIC CONTROL RULES FOR ramiel ~~~~
#-cleanup
RTNETLINK answers: No such file or directory
#-define a HTB root qdisc
#--uplink - rate 1600 kbit ceil 1600 kbit
#---interactive - id 100 - rate 160 kbit ceil 1600 kbit
#--- ~ sub interactive: pfifo
#--- ~ interactive filter
#--- ~ netfilter rule - all UDP traffic at 100
#---tcp acks - id 200 - rate 320 kbit ceil 1600 kbit
#--- ~ sub tcp acks: pfifo
#--- ~ filtre tcp acks
#--- ~ netfilter rule for TCP ACKs will be loaded at the end
#---ssh - id 300 - rate 160 kbit ceil 1120 kbit
#--- ~ sub ssh: sfq
#--- ~ ssh filter
#--- ~ netfilter rule - SSH at 300
#---http branch - id 400 - rate 800 kbit ceil 1600 kbit
#--- ~ sub http branch: sfq
#--- ~ http branch filter
#--- ~ netfilter rule - http/s
#---default - id 999 - rate 160kbit ceil 1600kbit
#--- ~ sub default: sfq
#--- ~ filtre default
#--- ~ propagating marks on connections
#--- ~ Mark TCP ACKs flags at 200
Traffic Control is up and running
# /etc/network/if-up.d/lnw_gateway_tc.sh show
---qdisc
qdisc
qdisc
qdisc
10sec
qdisc
10sec
qdisc
10sec

qdiscs details ----htb 1: root refcnt 2 r2q 40 default 999 direct_packets_stat 0 ver 3.17
pfifo 1100: parent 1:100 limit 10p
pfifo 1200: parent 1:200 limit 10p
sfq 1300: parent 1:300 limit 32p quantum 1514b flows 32/1024 perturb

---qdisc
Sent
rate
qdisc
Sent
rate
qdisc
Sent
rate
qdisc

qdiscs statistics -htb 1: root refcnt 2 r2q 40 default 999 direct_packets_stat 0


16776950 bytes 125321 pkt (dropped 4813, overlimits 28190 requeues 0)
0bit 0pps backlog 0b 0p requeues 0
pfifo 1100: parent 1:100 limit 10p
180664 bytes 1985 pkt (dropped 0, overlimits 0 requeues 0)
0bit 0pps backlog 0b 0p requeues 0
pfifo 1200: parent 1:200 limit 10p
5607402 bytes 100899 pkt (dropped 4813, overlimits 0 requeues 0)
0bit 0pps backlog 0b 0p requeues 0
sfq 1300: parent 1:300 limit 32p quantum 1514b perturb 10sec

sfq 1400: parent 1:400 limit 32p quantum 1514b flows 32/1024 perturb
sfq 1999: parent 1:999 limit 32p quantum 1514b flows 32/1024 perturb

Sent
rate
qdisc
Sent
rate
qdisc
Sent
rate

0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)


0bit 0pps backlog 0b 0p requeues 0
sfq 1400: parent 1:400 limit 32p quantum 1514b perturb 10sec
9790497 bytes 15682 pkt (dropped 0, overlimits 0 requeues 0)
0bit 0pps backlog 0b 0p requeues 0
sfq 1999: parent 1:999 limit 32p quantum 1514b perturb 10sec
1198387 bytes 6755 pkt (dropped 0, overlimits 0 requeues 0)
0bit 0pps backlog 0b 0p requeues 0

The output below is just two types of output tc can generate. You might find the class statistics to
be helpful to diagnose leaves consumption:
# tc -s class show dev eth0
[...truncated...]
class htb 1:400 parent 1:1 leaf 1400: prio 4 rate 800000bit ceil 1600Kbit
burst 30Kb cburst 1600b
Sent 10290035 bytes 16426 pkt (dropped 0, overlimits 0 requeues 0)
rate 23624bit 5pps backlog 0b 0p requeues 0
lended: 16424 borrowed: 2 giants: 0
tokens: 4791250 ctokens: 120625

Above is shown the detailled statistics for the HTTP leaf, and you can see the accumulated rate,
statistics of packets per seconds, but also the tokens accumulated, lended, borrowed, etc this is
the most helpful output to diagnose your policy in depth.

6 A word about "BufferBloat"


We mentionned that too large buffers can have a negative impact on the performances of a
connection. But how bad is it exactly ?
The answer to that question was investigated by Jim Gettys when he found his home network to
be inexplicably slow.
He found that, while we were increasing the bandwidth of network connections, we didn't worry
about the latency at all. Those two factors are quite different and both critical to the good quality
of a network. Allow me to quote Gettys's FAQ here:
A 100 Gigabit network is always faster than a
1 megabit network, isnt it?
More bandwidth is always better! I want a faster network!
No, such a network can easily be much slower.
Bandwidth is a measure of capacity, not a measure of how
fast the network can respond. You pick up the phone to
send a message to Shanghai immediately, but dispatching a
cargo ship full of blu-ray disks will be amazingly slower
than the telephone call, even though the bandwidth of the
ship is billions and billions of times larger than the
telephone line.
So more bandwidth is better only if its

latency (speed) meets your needs. More of what you dont


need is useless.
Bufferbloat destroys the speed we really need.

More information on Gettys's page, and in this paper from 1996: It's the Latency, Stupid.
Long story short: if you have bad latency, but large bandwidth, you will be able to transfer very
large files efficiently, but a simple DNS query will take a lot longer than it should. And since
those DNS queries, and other small message such as VoIP, are very often time sensitive, bad
latency impacts them a lot (a single packet takes several hundreds of milliseconds to traverse the
network).
So how does that relate to buffers ? Gettys proposes a simple experiment to illustrate the
problem.
We said earlier that Linux ships with a default txqueuelen of 1000. This value was increased in
the kernel when gigabits ethernet cards became the standard. But not all links are gigabits, far
from that. Consider the following two computers:

We will call them desktop and laptop. They are both gigabit, and the switch is gigabit.
If we verify the configuration of their network interfaces, we will confirm that:

the interfaces are configured in gigabits via ethtool

the txqueuelen is set to 1000

# ifconfig eth0|grep txqueuelen


collisions:0 txqueuelen:1000
# ethtool eth0|grep Speed
Speed: 1000Mb/s

On one machine, launch nttcp with the '-i' switch to make it wait for connections:
# nttcp -i

On the laptop, launch nttcp -t -D -n2048000 <serverIP>, where

-t means this machine is transmitting, or sending data

-D disables the TCP_NODELAY see redhat doc

-n is the number of buffers of 4096 bytes given to the socket.

# nttcp -t -D -n2048000 192.168.1.220

And at the same time, on the laptop, launch a ping of the desktop.
64
64
64
64
64
64
64
64
64
64

bytes
bytes
bytes
bytes
bytes
bytes
bytes
bytes
bytes
bytes

from
from
from
from
from
from
from
from
from
from

192.168.1.220:
192.168.1.220:
192.168.1.220:
192.168.1.220:
192.168.1.220:
192.168.1.220:
192.168.1.220:
192.168.1.220:
192.168.1.220:
192.168.1.220:

icmp_req=1 ttl=64 time=0.300 ms


icmp_req=2 ttl=64 time=0.386 ms
icmp_req=3 ttl=64 time=19.2 ms
icmp_req=4 ttl=64 time=19.2 ms
icmp_req=5 ttl=64 time=19.2 ms
icmp_req=6 ttl=64 time=19.2 ms
icmp_req=7 ttl=64 time=19.3 ms
icmp_req=8 ttl=64 time=19.0 ms
icmp_req=9 ttl=64 time=0.281 ms
icmp_req=10 ttl=64 time=0.362 ms

The first two pings are launch before nttcp is launched. When nttcp starts, the latency augments
but this is still acceptable.
Now, reduce the speed of each network card on the desktop and the laptop to 100Mbips. The
command is:
#ethtool -s eth0 speed 100 duplex full
# ethtool eth0|grep Speed
Speed: 100Mb/s

And run the same test again. After 60 seconds, here are the latency I get :
64 bytes from 192.168.1.220: icmp_req=75 ttl=64 time=183 ms
64 bytes from 192.168.1.220: icmp_req=76 ttl=64 time=179 ms
64 bytes from 192.168.1.220: icmp_req=77 ttl=64 time=181 ms

And one last time, with an Ethernet speed of 10Mbps:


64 bytes from 192.168.1.220: icmp_req=187 ttl=64 time=940 ms
64 bytes from 192.168.1.220: icmp_req=188 ttl=64 time=937 ms
64 bytes from 192.168.1.220: icmp_req=189 ttl=64 time=934 ms

Almost one second of latency between two machines next to each other. Every time we divide
the speed of the interface by an order of magnitude, we augment the latency by an order of
magnitude as well. Under that load, opening an SSH connection from the laptop to the desktop is
taking more than 10 seconds, because we have a latency of almost 1 second per packet.
Now, while this last test is running, and while you are enjoying the ridiculous latency of your
SSH session to the desktop, we will get rid of the transmit and ethernet buffers.
# ifconfig eth0 |grep txqueuelen

collisions:0 txqueuelen:1000
# ethtool -g eth0
Ring parameters for eth0:
[...]
Current hardware settings:
[...]
TX:
511

We start by changing the txqueuelen value on the laptop machine from 1000 to zero. The latency
will not change.
# ifconfig eth0 txqueuelen 0
64 bytes from 192.168.1.220: icmp_req=1460 ttl=64 time=970 ms
64 bytes from 192.168.1.220: icmp_req=1461 ttl=64 time=967 ms

Then we reduce the size of the tx ring of the ethernet card. Now that we don't have any buffer
anymore, let's see what happens:
# ethtool -G eth0 tx 32
64 bytes from 192.168.1.220:
64 bytes from 192.168.1.220:
64 bytes from 192.168.1.220:
64 bytes from 192.168.1.220:
64 bytes from 192.168.1.220:
64 bytes from 192.168.1.220:

icmp_req=1495
icmp_req=1499
icmp_req=1500
icmp_req=1501
icmp_req=1502
icmp_req=1503

ttl=64
ttl=64
ttl=64
ttl=64
ttl=64
ttl=64

time=937 ms
time=0.865 ms
time=60.3 ms
time=53.1 ms
time=49.2 ms
time=45.7 ms

The latency just got divided by 20 ! We dropped from almost one second to barely 50ms. This is
the effect of excessive buffering in a network, and this is what happens, today, in most Internet
routers.

6.1 What happens in the buffer ?


If we take a look at the Linux networking stack, we see that the TCP stack is a lot above the
transmit queue and ethernet buffer. During a normal TCP connection, the TCP stack starts
sending and receiving packets at a normal rate, and accelerates its sending speed at an
exponential rate: send 2 packets, receive ACKs, send 4 packets, receive ACKs, send 8 packets,
receives ACKs, send 16 packets, receives ACKS, etc.
This is known as the TCP Slow Start. This mechanisms works fine in practice, but the presence
of large buffers will break it.
A buffer of 1MB on a 1Gbits/s link will empty in ~8 milli-seconds. But the same buffer on a
1MBits/s link will take 8 seconds to empty. During those 8 seconds, the TCP stack thinks that all
of the packets it sent have been transmitted, and will probably continue to increase its sending
speed. The subsequent packets will get dropped, the TCP stack will panick, drop its sending rate,
and restart the slow start procedure from 0: 2 packets, get ack, 4 packets, get ack, etc

But while the TCP stack was filling up the TX buffers, all the other packets that our system
wanted to send got either stuck somewhere in the queue, with several hundreds of milliseconds
of delay before being transmitted, or purely dropped.
The problem happens on the TX queue of the sending machine, but also on all the the buffers of
the intermediary network devices. And this is why Gettys went to war against the home router
vendors.

You might also like