0% found this document useful (0 votes)
91 views35 pages

QoS - Linux - NSM - HTB

The document provides an overview of the Linux HTB Qdisc (Hierarchical Token Bucket). HTB improves upon the Token Bucket Filter (TBF) by introducing the concept of classes arranged in a tree structure. Each class acts as its own TBF queue. HTB allows for prioritization between classes, bandwidth borrowing between classes, and the ability to attach other queueing disciplines like SFQ as exit points. The document describes how packets are classified and queued in HTB based on filters, priorities, and available bandwidth at each node in the class tree.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views35 pages

QoS - Linux - NSM - HTB

The document provides an overview of the Linux HTB Qdisc (Hierarchical Token Bucket). HTB improves upon the Token Bucket Filter (TBF) by introducing the concept of classes arranged in a tree structure. Each class acts as its own TBF queue. HTB allows for prioritization between classes, bandwidth borrowing between classes, and the ability to attach other queueing disciplines like SFQ as exit points. The document describes how packets are classified and queued in HTB based on filters, priorities, and available bandwidth at each node in the class tree.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Overview of

LINUX HTB Qdisc

NOMUS COMM SYSTEMS PVT LTD


HYDERABAD
Presented by NSS MURTHY on 27-12-2019
Hierarchical Token Bucket (HTB) - Classful Qdisc

HTB Home:
http://luxik.cdi.cz/~devik/qos/htb/

net/pkt_sched.h
net/sched/sch_htb.c
 The limits of TBF:

 TBF gives a pretty accurate control over the bandwidth assigned to a qdisc.
o But it also imposes that all packets pass through a single queue.

o If a big packet is blocked because there is not enough tokens to send it, smaller packets - that could potentially be sent
instead - are blocked behind it as well.
o This is the case represented in the diagram above, where packet #2 is stuck behind packet #1.
o We could optimize the bandwidth usage by allowing the smaller packet to be sent instead of the bigger one.
 We would, however, fall into the same problem of reordering packets that we discussed with the SFQ algorithm.

 The other solution would be to give more flexibility to the traffic shaper,
o Declare several TBF queues in parallel, and route the packets to one or the other using filters.
o We could allow those parallel queues to borrow tokens from each other, in case one is idle and the other one is not.

 We just prepared the ground for classful qdisc, and the Hierarchical Token Bucket (HTB) .
 The HTB is an improved version of TBF that introduces the notion of classes.

 Each class is, in fact, a TBF-like qdisc, and classes are linked together as a tree, with a root and leaves.

 HTB introduces a number of features to improve the management of bandwidth, such as a


• The priority between classes,
• A way to borrow bandwidth from another class, or
• The possibility to plug another qdisc as an exit point (a SFQ for example).

 HTB is meant as a more understandable and intuitive replacement for the CBQ qdisc in Linux.

 Both CBQ and HTB help you to control the use of the outbound bandwidth on a given link.
 Both allow you to use one physical link to simulate several slower links and
 To send different kinds of traffic on different simulated links.

• In both cases, you have to specify


o How to divide the physical link into simulated links and
o How to decide which simulated link to use for a given packet to be sent.

• Unlike CBQ,
o HTB shapes traffic based on the Token Bucket Filter algorithm which does not depend on interface characteristics and so
 Does not need to know the underlying bandwidth of the outgoing interface.

 Shaping Algorithm

 Shaping works as documented in tc-tbf (8).


Classification:

 Within the one HTB instance many classes may exist.

 Each of these classes contains another qdisc, by default tc-pfifo(8).


 When enqueueing a packet,
• HTB starts at the root and uses various methods to determine which class should receive the data.

 In the absence of uncommon configuration options, the process is rather easy.


 At each node we look for an instruction, and then go to the class the instruction refers us to.
 If the class found is a barren leaf-node (without children), we enqueue the packet there.
 If it is not yet a leaf node, we do the whole thing over again starting from that node.

 The following actions are performed, in order at each node we visit, until one sends us to another node, or terminates the
process.

 1. Consult filters attached to the class. If sent to a leaf node, we are done. Otherwise, restart.

 2. If none of the above returned with an instruction, enqueue at this node (default class).

 This algorithm makes sure that a packet always ends up somewhere, even while you are busy building your
configuration.
Qdisc:

 The root of a HTB qdisc class tree has the following parameters:

 parent major:minor | root


• This mandatory parameter determines the place of the HTB instance,
either at the root of an interface or within an existing class.

• handle major:
o Like all other qdiscs, the HTB can be assigned a handle.
o Should consist only of a major number, followed by a colon.
o Optional, but very useful if classes will be generated within this qdisc.

 default minor-id:
• Unclassified traffic gets sent to the class with this minor-id.

 tc qdisc ... dev dev ( parent classid | root) [ handle major: ] htb [ default minor-id ]

 tc qdisc add dev eth0 root handle 1: htb default 20 (parent is used for intermediate Qdiscs.)
tc class ... dev dev parent major:[minor] [ classid major:minor ] htb
rate rate [ ceil rate ] burst bytes [ cburst bytes ] [ prio priority ] Display:
tc -s -d qdisc
tc class add dev eth0 parent 1:0 classid 1:10 htb rate 200kbit ceil 400kbit prio 1 mtu 1500
tc class add dev eth0 parent 1:0 classid 1:20 htb rate 200kbit ceil 200kbit prio 2 mtu 1500 tc -s -d class show dev your_htb_device1_here
tc -s -d class show dev your_htb_device2_here
 Classes: Classes have a host of parameters to configure their operation.

 parent major:minor Mandatory.


• Place of this class within the hierarchy.
• If attached directly to a qdisc and not to another class, minor can be omitted.
 classid major:minor
• Like qdiscs, classes can be named.
• The major number must be equal to the major number of the qdisc to which it belongs.
• The minor number is Optional, but needed if this class is going to have children.

 prio priority: In the round-robin process, classes with the lowest priority field are tried for packets first. Mandatory.

 rate rate: Maximum rate this class and all its children (put together) are guaranteed. Mandatory.
 ceil rate: Maximum rate at which a class can send, if its parent has bandwidth to spare.
• Defaults to the configured rate, which implies no borrowing

 burst bytes: Amount of bytes that can be burst at ceil speed, in excess of the configured rate. TBD
• Should be at least as high as the highest burst of all children.

 cburst bytes: Amount of bytes that can be burst at 'infinite' speed, in other words, as fast as the interface can transmit them. TBD
• For perfect evening out, should be equal to at most one average packet. Should be at least as high as the highest cburst of all
children.
Notes
 Due to Unix timing constraints, the maximum ceil rate is not infinite and may in fact be quite low.
 On Intel, there are 100 timer events per second, the maximum rate is that rate at which 'burst' bytes are sent each
timer tick.
 From this, the minimum burst size for a specified rate can be calculated.
• For i386, a 10mbit rate requires a 12 kilobyte burst as 100*12kb*8 equals 10mbit.
Problem Statement:
 We have two customers, A and B, both connected to the internet via eth0. root
A
• We want to allocate 60 kbps to B and 40 kbps to A.
• Next we want to subdivide A's bandwidth B
o 30kbps for WWW and 10kbps for everything else. A
• Any unused bandwidth can be used by any class which needs it (in proportion of its allocated share).
A1
A A2 B
A
A
30 10 60
A
 HTB ensures that the amount of service provided to each class is at least the minimum of amount requested and amount assigned to the class.
• When a class requests less than the amount assigned, the remaining (excess) bandwidth is distributed to other classes which request
service.
• This is called borrowing with no concept of repayment.

• This command attaches queue discipline HTB to eth0 with a handle 1:.
The default 12 means un-classified traffic will be assigned to class 1:12.

 tc qdisc add dev eth0 root handle 1: htb default 12 // handles are local to an interface. Remember that a Qdisc as attached to an interface.

Notes:
• Creates a root class 1:1 under Qdisc 1: Qdisc is its parent. o A root class cannot borrow from another root class.
A root class, like other classes under a htb qdisc allows its children to borrow from each other,
o If we create the other three classes directly under
 tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps (rate 30+10+60)
the htb qdisc, then the excess bandwidth from one
would not be available to the others.
• Three subclasses/children of root class are created so that they can borrow from each other.

 tc class add dev eth0 parent 1:1 classid 1:10 htb rate 30kbps ceil 100kbps // A1
 tc class add dev eth0 parent 1:1 classid 1:11 htb rate 10kbps ceil 100kbps // A2

 tc class add dev eth0 parent 1:1 classid 1:12 htb rate 60kbps ceil 100kbps // B
Filter usage:
 Let us define filters to send traffic to a specific class.
• Filters are always added to the root Qdisc.

tc filter add dev eth0 protocol ip parent 1:0 prio 1 \


u32 match ip src 1.2.3.4 match ip dport 80 0xffff flowid 1:10 // class A1
tc filter add dev eth0 protocol ip parent 1:0 prio 1 \ // class A2
u32 match ip src 1.2.3.4 flowid 1:11

• We identify class A packets by its IP address 1.2.3.4.


• Notice that we didn't create a filter for the 1:12 class (class B).
o It might be more clear to do so, but this illustrates the use of the default.
o Any packet not classified by the two rules (any packet not from source address 1.2.3.4) will be put in class 1:12.

 Now we can optionally attach queuing disciplines to the leaf classes. If none is specified the default is pfifo.
tc qdisc add dev eth0 parent 1:10 handle 20: pfifo limit 5 TBD What is 5 here ???.
tc qdisc add dev eth0 parent 1:11 handle 30: pfifo limit 5
tc qdisc add dev eth0 parent 1:12 handle 40: sfq perturb 10

Quantum: TBD
 When more classes want to borrow bandwidth they are each given some number of bytes before serving other competing class. This number is called
quantum. If several classes are competing for parent's bandwidth then they get it in proportion of their quantums. It is important to know that for precise
operation quantums need to be as small as possible and larger than MTU.

 Quantum need not be manually configured as HTB calculates on its own. Quantum for a class = rate/r2q. Default value for r2q is 10.
Its default value is 10. Because typical MTU is 1500 the default is good for rates from 15 kBps (120 kbit). For smaller minimal rates specify r2q 1
when creating qdisc - it is good from 12 kbit which should be enough. If you will need you can specify quantum manually when adding or changing the class.
You can avoid warnings in log if precomputed value would be bad.

 When you specify quantum on command line the r2q is ignored for that class.
Sharing hierarchy:
root
A
 Earlier solution seems good if A and B were not different customers.
• However, if A is paying for 40kbps then he would probably prefer his unused WWW bandwidth A B
A
to go to his own other service rather than to B.
• This requirement is represented in HTB by the class hierarchy.
A1
A A2
A
 Customer A is now explicitly represented by its own class.

• Leaf Classes:
o Recall from above that the amount of service provided to each class is at least the minimum of the amount it requests and amount assigned to it.
o This applies to htb classes that are not parents of other htb classes.

• Interior Classes:
o Htb classes that are parents of other htb classes are called interior classes.
o The rule is that the amount of service is at least the minimum of the amount assigned to it and the sum of the amount requested by its children.
o In this case we assign 40kbps to customer A. That means that if A1 requests less than the allocated rate for WWW, the excess will be used for A's
other traffic i.e. A2 (if there is demand for it), at least until the sum is 40kbps.

tc class add dev eth0 parent 1: classid 1:1 htb rate 100kbps ceil 100kbps // root class • When A's WWW traffic(A1) stops, its assigned bandwidth
is reallocated to A's other traffic(A2) so that A's total
tc class add dev eth0 parent 1:1 classid 1:2 htb rate 40kbps ceil 100kbps // inner class A bandwidth is still the assigned 40kbps.
tc class add dev eth0 parent 1:2 classid 1:10 htb rate 30kbps ceil 100kbps // leaf class A1
• If A were to request less than 40kbs in total then the
tc class add dev eth0 parent 1:2 classid 1:11 htb rate 10kbps ceil 100kbps // leaf class A2 excess would be given to B.

tc class add dev eth0 parent 1:1 classid 1:12 htb rate 60kbps ceil 100kbps // leaf class B

• Notes: Packet classification rules can assign to inner nodes too. Then you have to attach other filter list to inner node. Finally you should reach leaf or
special 1:0 class. The rate supplied for a parent should be the sum of the rates of its children.
Rate ceiling:

 The ceil argument specifies the maximum bandwidth that a class can use.
• This limits how much bandwidth that class can borrow.
• The default ceil is the same as the rate.
• In the above examples, borrowing was shown.

 We now change the ceil 100kbps for classes 1:2 (A) and 1:11 (A's other A2) from the previous example to ceil 60kbps and ceil 20kbps.

tc class add dev eth0 parent 1:1 classid 1:2 htb rate 40kbps ceil 60kbps (earlier 100kbps) // inner class A
tc class add dev eth0 parent 1:2 classid 1:10 htb rate 30kbps ceil 100kbps // leaf class A1
tc class add dev eth0 parent 1:2 classid 1:11 htb rate 10kbps ceil 20kbps (earlier 100kbps) // leaf class A2
tc class add dev eth0 parent 1:1 classid 1:12 htb rate 60kbps ceil 100kbps // independent leaf class B

• Now when A1 traffic stops and A2 ceil is limited to 20kbps, customer A gets only 20kbps in total and the unused 20kbps is allocated to B.
• When B stops, without the ceil, all of its bandwidth was given to A, but now A is only allowed to use 60kbps, so the remaining 40kbps goes unused.

• Note that root classes are not allowed to borrow, so there's really no point in specifying a ceil for them.

• This feature should be useful for ISPs because they probably want to limit the amount of service a given customer gets even when other customers are
not requesting service. (ISPs probably want customers to pay more money for better service.)

 Notes:
• The ceil for a class should always be at least as high as the rate.
• Also, the ceil for a class should always be at least as high as the ceil of any of its children.
Burst: TBD

• Networking HW can only send one packet at a time and only at a hardware dependent rate.

• Link sharing software can only use this ability to approximate the effects of multiple links running at different (lower) speeds.

• Therefore the rate and ceil are not really instantaneous measures but averages over the time that it takes to send many packets.

• What really happens is that the traffic from one class is sent a few packets at a time at the maximum speed and then other classes are served for a while.
o The burst and cburst parameters control the amount of data that can be sent at the maximum (hardware) speed without trying to serve another class.

• If cburst is smaller (ideally one packet size) it shapes bursts to not exceed ceil rate in the same way as TBF's peak rate does.

• When you set burst for parent class smaller than for some child then you should expect the parent class to get stuck sometimes (because child will drain more
than parent can handle).
o HTB will remember these negative bursts up to 1 minute.

• Why we need bursts?


o Well it is cheap and simple way how to improve response times on congested link.
o For example www traffic is bursty. You ask for page, get it in burst and then read it. During that idle period burst will "charge" again.

• Note: The burst and cburst of a class should always be at least as high as that of any of it children.

Limitation:
• When you operate with high rates on computer with low resolution timer, you need some minimal burst and cburst to be set for all classes.

• Timer resolution on i386 systems is 10ms and 1ms on Alphas. The minimal burst can be computed as max_rate*timer_resolution. So that for 10Mbit on plain
i386 you need burst 12kb.
• If you set too small burst you will encounter smaller rate than you set. Latest tc tool will compute and set the smallest possible burst when it is not specified.
Priorizing bandwidth share: use prio

 Priorizing traffic affects how the excess bandwidth is distributed among siblings.
• Up to now we have seen that excess bandwidth was distributed according to rate ratios (Quantum).
• Set priority of all classes to 1 except SMTP (green) which is set to 0 (higher).
• The rule is that classes with higher priority are offered excess bandwidth first. But rules about guaranteed rate and ceil are still met

 What class should we priorize ?


• Generally those classes where you really need low delays.
• The example could be video or audio traffic (and you will really need to use correct rate here to prevent traffic to kill other ones) or interactive (telnet,
SSH) traffic which is bursty in nature and will not negatively affect other flows.
• Common trick is to priorize ICMP to get nice ping delays even on fully utilized links (but from technical point of view it is not what you want when
measuring connectivity).
Understanding statistics: The tc tool allows us to gather statistics of queuing disciplines in Linux.

1. tc -s -d qdisc show dev eth0 // Qdisc Stats

qdisc pfifo 22: limit 5p Sent 0 bytes 0 pkts (dropped 0, overlimits 0)


qdisc pfifo 21: limit 5p Sent 2891500 bytes 5783 pkts (dropped 820, overlimits 0)
qdisc pfifo 20: limit 5p Sent 1760000 bytes 3520 pkts (dropped 3320, overlimits 0) // First three disciplines are HTB's children.
qdisc htb 1: r2q 10 default 1 direct_packets_stat 0 Sent 4651500 bytes 9303 pkts (dropped 4140, overlimits 34251)
// overlimits tells you how many times the discipline delayed a packet.
// direct_packets_stat tells you how many packets was sent thru direct queue. Other stats are sefl explanatory.

2. tc -s -d class show dev eth0 // Class Stats

class htb 1:1 root prio 0 rate 800Kbit ceil 800Kbit burst 2Kb/8 mpu 0b cburst 2Kb/8 mpu 0b quantum 10240 level 3 // See slide 11 for diagram. Level is actually 2
Sent 5914000 bytes 11828 pkts (dropped 0, overlimits 0) rate 70196bps 141pps lended: 6872 borrowed: 0 giants: 0
class htb 1:2 parent 1:1 prio 0 rate 320Kbit ceil 4000Kbit burst 2Kb/8 mpu 0b cburst 2Kb/8 mpu 0b quantum 4096 level 2 // Level is 1 not 2
Sent 5914000 bytes 11828 pkts (dropped 0, overlimits 0) rate 70196bps 141pps lended: 1017 borrowed: 6872 giants: 0
class htb 1:10 parent 1:2 leaf 20: prio 1 rate 224Kbit ceil 800Kbit burst 2Kb/8 mpu 0b cburst 2Kb/8 mpu 0b quantum 2867 level 0
Sent 2269000 bytes 4538 pkts (dropped 4400, overlimits 36358) rate 14635bps 29pps lended: 2939 borrowed: 1599 giants: 0

Classes 1:11 and 1:12 are deleted to make output shorter.

The configured parameters are shown.


level and DRR quantum informations are shown..
overlimits shows how many times class was asked to send packet but he can't due to rate/ceil constraints.
rate, pps tells you actual (10 sec averaged) rate going thru class. It is the same rate as used by gating.
lended is # of packets donated by this class (from its rate) and borrowed are packets for whose we borrowed from parent.
Lends are always computed class-local while borrows are transitive (when 1:10 borrows from 1:2 which in turn borrows from 1:1 both 1:10 and 1:2 borrow
counters are incremented).

giants is number of packets larger than mtu set in tc command.


Add mtu to your tc (defaults to 1600 bytes).
tc qdisc add dev eth0 root handle 1: htb default 20

tc class add dev eth0 parent 1:0 classid 1:10 htb rate 200kbit ceil 400kbit prio 1 mtu 1500
tc class add dev eth0 parent 1:0 classid 1:20 htb rate 200kbit ceil 200kbit prio 2 mtu 1500
HTB theory:

• Each Class is associated with


o AR (Assured rate), CR (Ceil rate), Priority P, Level, Quantum Q, A Parent and
o R (Value of Actual Rate).

• Actual Rate (R): It is the rate of packet flow leaving the class and is measured over small period.
o For inner classes it is R sum of all descendant leaves.

• Leaf class is a class with no children.


o Only leaf class can hold a packet queue.
o Classes A1, A2 and B shown in slide 11 are leaf classes

• Level of class: determines its position in hierarchy.


o Leaves has level 0,
o Root classes LEVEL_COUNT-1 and
o Each inner class has level one less than its parent. // See pictures below (LEVEL_COUNT=3 there).

• Mode of Class: It is computed from R,AR,CR. Possible modes are


Red: R > CR Yellow: R <= CR and R > AR Green: Otherwise

• D(c): It is the list of all backlogged leaves which are descendants of class c and
o Where all classes between such leaf (and including it) and c are yellow.
o In other words D(c) is list of all leaves which would want to borrow from c.

Auxiliary delay goal:


• Ensuring class isolation, so that changes in rate of one class should not affect delay in other class unless
both are actualy borrowing from the same ancestor.
• Also hi-prio class should has lower delays than lo-prio assuming they are both borrowing from the same
level.
Link sharing goal:

Now we can define link sharing goal as definition for R. For each class c it should hold

Rc = min(CRc, ARc + Bc) [eq1]

where Bc is rate borrowed from ancestors and is defined as

Qc Rp
Bc = ----------------------------------------- if min[Pi over D(p)] >= Pc [eq2]
sum[Qi over D(p) where Pi=Pc]

Bc = 0 otherwise [eq3]

• Where p is c's parent class. If there is no p then Bc = 0.

• Two definitions for Bc above reflects priority queuing –


• When there are backlogged descendants with numerically lower prio in D(p) then these should be served, not us.
• Fraction above shows us that excess rate (Rp) is divided between all leaves on the same priority according to Q values.
Because Rp in [2] is defined again by [1] the formulas are recursive.

• We can say that AR's of all classes are maintained if there is demand and no CR's are exceeded.
o Excess BW is subdivided between backlogged descendant leaves with highest priority according to Q (quantum) ratios.
HTB Scheduler: Linux implementation is considered.

 There is a tree of classes (struct htb_class) in HTB scheduler (struct htb_sched).

 There is also a global structure self feed list (htb_sched::row).


o It is the rightmost column on pictures.
o The self feed is comprised of self slots.
o There is one slot per priority per level so that there are six self slots on the example (ignore while slot now).
o Each self slot holds list of classes - the list is depicted by coloured line going from the slot to class(es).
o All classes in slot has the same level and priority as slot has.
o The self slot contains list of all green classes whose have demand (are in D(c) set).

 Each inner (non-leaf) class has inner feed slots (htb_class::inner.feed).


o Again there is one inner feed slot per priority (red-high, blue-low) and per inner class.
o As with self slots there is list of classes with the same priority as slot has and the classes must be slot owner's children.
o Again the inner slot holds list of yellow children which are in D(c).

http://luxik.cdi.cz/~devik/qos/htb/manual/theory.htm
/* interior & leaf nodes; props specific to leaves are marked L: */ /* node for self or feed tree */
struct htb_class struct Qdisc_class_common { struct rb_node node[TC_HTB_NUMPRIO];
{ u32 classid; struct rb_node pq_node; /* node for event queue */
struct Qdisc_class_common common; struct hlist_node hnode;
};
psched_time_t pq_key;

int level; /* our level (see above) */ int prio_activity; /* for which prios are we active
unsigned int children; */
struct htb_class *parent; /* parent class */ enum htb_cmode cmode; /* current mode of the class */

int prio; /* these two are used only by leaves... */ /* class attached filters */
int quantum; /* but stored for parent-to-leaf return */ struct tcf_proto *filter_list;
int filter_cnt;
union {
struct htb_class_leaf /* token bucket parameters */
{
struct Qdisc *q; struct qdisc_rate_table *rate;
int deficit[TC_HTB_MAXDEPTH]; /* rate table of the class itself */
struct list_head drop_list; struct qdisc_rate_table *ceil;
} leaf; /* ceiling rate (limits borrows too) */

struct htb_class_inner long buffer, cbuffer; /* token bucket depth/rate */


{ psched_tdiff_t mbuffer; /* max wait time */
struct rb_root feed[TC_HTB_NUMPRIO]; /* feed trees */
struct rb_node *ptr[TC_HTB_NUMPRIO]; /* current class ptr */ long tokens, ctokens; /* current number of tokens */
/* When class changes from state 1->2 and disconnects from parent's feed then we lost
ptr value and start from the first child again. Here we store classid of the last valid psched_time_t t_c; /* checkpoint time */
ptr (used when ptr is NULL). */ };

u32 last_ptr_id[TC_HTB_NUMPRIO];
} inner;
} un;
struct htb_sched struct rb_root
{ {
struct Qdisc_class_hash clhash; struct rb_node *rb_node;
struct list_head drops[TC_HTB_NUMPRIO]; /* active leaves (for drops) */ };

/* self list - roots of self generating tree; self feed list */


struct rb_root row[TC_HTB_MAXDEPTH][TC_HTB_NUMPRIO]; struct rb_node // One node for a level and priority.
int row_mask[TC_HTB_MAXDEPTH]; {
unsigned long rb_parent_color;
struct rb_node *ptr[TC_HTB_MAXDEPTH][TC_HTB_NUMPRIO];
u32 last_ptr_id[TC_HTB_MAXDEPTH][TC_HTB_NUMPRIO]; #define RB_RED 0
#define RB_BLACK 1
/* self wait list - roots of wait PQs per row */
struct rb_root wait_pq[TC_HTB_MAXDEPTH]; struct rb_node *rb_right;
struct rb_node *rb_left;
/* time of nearest event per level (row) */ }
psched_time_t near_ev_cache[TC_HTB_MAXDEPTH];

int defcls; /* class where unclassified flows go to */


/* filters for qdisc itself */ struct tcf_proto *filter_list;

int rate2quantum; /* quant = rate / rate2quantum */


psched_time_t now; /* cached dequeue time */
struct qdisc_watchdog watchdog;

/* non shaped skbs; let them go directly thru */


struct sk_buff_head direct_queue;
int direct_qlen; /* max qlen of above */

long direct_pkts;
};
HTB (Hierarchical Token Bucket ) Algorithm: static int __init htb_module_init(void)
{
 HTB is like TBF with multiple classes. return register_qdisc(&htb_qdisc_ops);
}
 It allows to assign priority to each class in hierarchy.
static struct Qdisc_ops htb_qdisc_ops =
 Levels: Each class is assigned a level. {
• Leaf has ALWAYS level 0 and .next = NULL,
• Root classes have level TC_HTB_MAXDEPTH-1. .cl_ops = &htb_class_ops,
• Interior nodes has level one less than their parent. .id = "htb",
.priv_size = sizeof(struct htb_sched),
static const struct Qdisc_class_ops htb_class_ops = .enqueue = htb_enqueue,
{ .dequeue = htb_dequeue,
.graft = htb_graft, .peek = qdisc_peek_dequeued,
.leaf = htb_leaf, .drop = htb_drop,
.qlen_notify = htb_qlen_notify, .init = htb_init,
.reset = htb_reset,
.get = htb_get,
.destroy = htb_destroy,
.put = htb_put,
.change = NULL, // htb_change
.change = htb_change_class, .dump = htb_dump,
.delete = htb_delete, .owner = THIS_MODULE,
.walk = htb_walk, };
.tcf_chain = htb_find_tcf,
.bind_tcf = htb_bind_filter,
.unbind_tcf = htb_unbind_filter,
.dump = htb_dump_class,
.dump_stats = htb_dump_class_stats,
};
struct htb_sched struct list_head
{ {
struct Qdisc_class_hash clhash;
struct list_head *next, *prev;
struct list_head drops[TC_HTB_NUMPRIO]; // active leaves (for drops)
};
// self list - roots of self generating tree
struct rb_root row[TC_HTB_MAXDEPTH][TC_HTB_NUMPRIO]; struct rb_root
int row_mask[TC_HTB_MAXDEPTH]; //Level. {
struct rb_node *ptr[TC_HTB_MAXDEPTH][TC_HTB_NUMPRIO]; struct rb_node *rb_node;
u32 last_ptr_id[TC_HTB_MAXDEPTH][TC_HTB_NUMPRIO]; };

// self wait list - roots of wait PQs per row struct rb_node
struct rb_root wait_pq[TC_HTB_MAXDEPTH]; {
unsigned long rb_parent_color;
// time of nearest event per level (row)
psched_time_t near_ev_cache[TC_HTB_MAXDEPTH]; #define RB_RED 0
#define RB_BLACK 1
int defcls; // class where unclassified flows go to
struct rb_node *rb_right;
// filters for qdisc itself struct rb_node *rb_left;
struct tcf_proto *filter_list; } __attribute__((aligned(sizeof(long))));

int rate2quantum; // quant = rate / rate2quantum


psched_time_t now; // cached dequeue time
struct qdisc_watchdog watchdog;

// non shaped skbs; let them go directly thru


struct sk_buff_head direct_queue;
int direct_qlen; // max qlen of above
long direct_pkts;

#define HTB_WARN_TOOMANYEVENTS 0x1


unsigned int warned; // only one warning
struct work_struct work;
};
struct htb_class // interior & leaf nodes; props specific to leaves are marked L: struct rb_node node[TC_HTB_NUMPRIO]; // node for self or feed tree
{ struct Qdisc_class_common
struct Qdisc_class_common common; struct rb_node pq_node; // node for event queue
{
// general class parameters u32 classid;
psched_time_t pq_key;
struct hlist_node hnode;
};
bstats, qstats, rate_est, xstats; // our special stats int prio_activity; // for which prios are we active
// topology enum htb_cmode cmode; // current mode of the class
int level; // our level (see above)
unsigned int children; // class attached filters
struct htb_class *parent; // parent class struct tcf_proto *filter_list;
int filter_cnt;
int prio; // these two are used only by leaves...
int quantum; // but stored for parent-to-leaf return // token bucket parameters
struct qdisc_rate_table *rate; // rate table of the class itself
struct qdisc_rate_table *ceil; // ceiling rate (limits borrows too)
union
long buffer, cbuffer; // token bucket depth/rate
{
psched_tdiff_t mbuffer; // max wait time
struct htb_class_leaf
long tokens, ctokens; // current number of tokens
{
psched_time_t t_c; // checkpoint time
struct Qdisc *q;
};
int deficit[TC_HTB_MAXDEPTH];
struct list_head drop_list;
}leaf; // used internally to keep Status of Single Class
struct htb_class_inner enum htb_cmode
{ {
struct rb_root feed[TC_HTB_NUMPRIO]; // feed trees
struct rb_node *ptr[TC_HTB_NUMPRIO]; // current class ptr
HTB_CANT_SEND, // class can't send and can't borrow. RED
HTB_MAY_BORROW, // class can't send but may borrow. YELLOW
// When class changes from state 1->2 and disconnects from parent's feed then we HTB_CAN_SEND // class can send. GREEN
lost ptr value and start from the first child again. };

// Here we store classid of the last valid ptr (used when ptr is NULL).
u32 last_ptr_id[TC_HTB_NUMPRIO];
} inner;
} un;
static void htb_reset(struct Qdisc *sch) // reset all classes
{
struct htb_sched *q = qdisc_priv(sch);
struct htb_class *cl;
struct hlist_node *n;
unsigned int i; // always called under BH & queue lock

for (i = 0; i < q->clhash.hashsize; i++)


{
hlist_for_each_entry(cl, n, &q->clhash.hash[i], common.hnode)
{
if(cl->level)
{
memset(&cl->un.inner, 0, sizeof(cl->un.inner));
}
else
{
if(cl->un.leaf.q)
qdisc_reset(cl->un.leaf.q);
INIT_LIST_HEAD(&cl->un.leaf.drop_list);
}

cl->prio_activity = 0;
cl->cmode = HTB_CAN_SEND;
}
}
qdisc_watchdog_cancel(&q->watchdog);
__skb_queue_purge(&q->direct_queue);
sch->q.qlen = 0;
memset(q->row, 0, sizeof(q->row));
memset(q->row_mask, 0, sizeof(q->row_mask));
memset(q->wait_pq, 0, sizeof(q->wait_pq));
memset(q->ptr, 0, sizeof(q->ptr));
for(i = 0; i < TC_HTB_NUMPRIO; i++)
INIT_LIST_HEAD(q->drops + i);
}
static int htb_init(struct Qdisc *sch, struct nlattr *opt) for(i = 0; i < TC_HTB_NUMPRIO; i++)
{ INIT_LIST_HEAD(q->drops + i);
struct htb_sched *q = qdisc_priv(sch);
struct nlattr *tb[TCA_HTB_INIT + 1]; qdisc_watchdog_init(&q->watchdog, sch);
struct tc_htb_glob *gopt; INIT_WORK(&q->work, htb_work_func);
int err, i; skb_queue_head_init(&q->direct_queue);

if(!opt) q->direct_qlen = qdisc_dev(sch)->tx_queue_len;


return -EINVAL; if(q->direct_qlen < 2) // some devices have zero tx_queue_len
q->direct_qlen = 2;
err = nla_parse_nested(tb, TCA_HTB_INIT, opt, htb_policy);
if (err < 0) return err; if( (q->rate2quantum = gopt->rate2quantum) < 1)
q->rate2quantum = 1;
if (tb[TCA_HTB_INIT] == NULL)
{ q->defcls = gopt->defcls;
printk(KERN_ERR "HTB: hey probably you have bad tc tool ?\n");
return -EINVAL; return 0;
} }
gopt = nla_data(tb[TCA_HTB_INIT]);

if (gopt->version != HTB_VER >> 16)


{
printk(KERN_ERR
"HTB: need tc/htb version %d (minor is %d), you have %d\n",
HTB_VER >> 16, HTB_VER & 0xffff, gopt->version);
return -EINVAL;
}

err = qdisc_class_hash_init(&q->clhash);

if(err < 0) return err; // Contd ….


htb_classify - classify a packet into class // Find class in global hash table using given handle.
1. Returns
• It returns NULL, if the packet should be dropped struct htb_class *htb_find(u32 handle, struct Qdisc *sch)
{
• It returns -1 , if the packet should be passed directly thru. struct htb_sched *q = qdisc_priv(sch);
• In all other cases leaf class is returned. struct Qdisc_class_common *clc;

2. We allow direct class selection by classid in priority. clc = qdisc_class_find(&q->clhash, handle);


Then we examine filters in qdisc and in inner nodes (if higher filter points if(clc == NULL)
return NULL;
to the inner node).
If we end up with classid MAJOR:0 we enqueue the skb into special return container_of(clc, struct htb_class, common);
internal fifo (direct). These packets then go directly thru. }

If we still have no valid leaf, we try to use MAJOR:default leaf. //


If still unsuccessful then finish and return direct queue. container_of - cast a member of a structure out to the containing structure
ptr: the pointer to the member.
#define HTB_DIRECT (struct htb_class*)-1 type: the type of the container struct this is embedded in.
member: the name of the member within the struct.
struct htb_class* htb_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) #define container_of(ptr, type, member) ({ \
{ const typeof( ((type *)0)->member ) *__mptr = (ptr); \
struct htb_sched *q = qdisc_priv(sch); (type *)( (char *)__mptr - offsetof(type,member) );})
struct htb_class *cl;
struct tcf_result res; struct Qdisc_class_common * qdisc_class_find(struct Qdisc_class_hash *hash, u32 id)
struct tcf_proto *tcf; {
int result; struct Qdisc_class_common *cl; unsigned int qdisc_class_hash(
struct hlist_node *n; u32 id,
// Allow to select class by setting skb->priority to valid classid; unsigned int h; u32 mask)
note that nfmark can be used too by attaching filter fw with no rules in it. {
if(skb->priority == sch->handle) h = qdisc_class_hash(id, hash->hashmask); id ^= id >> 8;
return HTB_DIRECT; // X:0 (direct flow) selected id ^= id >> 4;
hlist_for_each_entry(cl, n, &hash->hash[h], hnode) return id & mask;
if((cl = htb_find(skb->priority, sch)) != NULL && cl->level == 0) if(cl->classid == id) return cl; }
return cl; return NULL;
}
1. Struct htb_sched contains a hash list of all classes. 4. htb_sched::wait_pq
struct Qdisc_class_hash clhash
struct rb_root wait_pq[TC_HTB_MAXDEPTH]; // level
2. htb_sched::row[][] - It is a global self feed list.
• self wait list - roots of wait PQs per row or level.
// level // priority • There is one wait queue per level.
struct rb_root row[TC_HTB_MAXDEPTH][TC_HTB_NUMPRIO]; • It holds list of all classes on that level which are either red or yellow.

• The self feed is comprised of self slots. 5. htb_class::pq_node


• self list - roots of self generating tree .
• There is one self slot for the combination of each level and priority. struct rb_node pq_node; // node for event queue
• All classes in slot has the same level and priority as slot has.
• Each self slot holds a list of green classes only. 6. htb_class::pq_key

3. htb_class: :un. struct htb_class_inner::feed[] psched_time_t pq_key;

struct rb_root feed[TC_HTB_NUMPRIO]; // feed trees • Classes are sorted by this wall time at which they will change colour.
It is because the colour change is asynchronous.
• Each inner (non-leaf) class has inner feed slots as shown above. This is not global.
• There is one inner feed slot per priority (red-high, blue-low). 7. htb_class::mode
• Each feed slot contains a list of classes having the same priority as that of the slot.
• The classes must be slot owner’s(inner class) children. enum htb_cmode cmode; // current mode of the class

• A child class is moved from its self feed list to the feed list of its parent class if // used internally to keep status of single class
child class turns yellow and now can borrow from its parent. It is assumed here enum htb_cmode
that child class moved from green to yellow and parent class is in green state. {
The parent class is also connected to self feed list (identified by the level of parent HTB_CANT_SEND, // class can't send and can't borrow. Red
and priority of the child class) as it is green. HTB_MAY_BORROW, // class can't send but may borrow. Yellow
HTB_CAN_SEND // class can send. Green
• The inner class moves its child class from its feed slots to wait_pq or D(c) list when };
the inner class (parent) itself moves to yellow.
8. htb_class::prio_active
4. htb_sched::ptr
int prio_activity; // for which priorities are we active. It is represented by a bit map.
static int htb_enqueue(struct sk_buff *skb, struct Qdisc *sch)
else if ((ret = qdisc_enqueue(skb, cl->un.leaf.q)) != NET_XMIT_SUCCESS)
{
int uninitialized_var(ret); {
if(net_xmit_drop_count(ret)) int qdisc_enqueue(struct sk_buff *skb,
struct htb_sched *q = qdisc_priv(sch); sch->qstats.drops++; cl->qstats.drops++; struct Qdisc *sch)
return ret; {
struct htb_class *cl = htb_classify(skb, sch, &ret); } #ifdef CONFIG_NET_SCHED
if(sch->stab)
if (cl == HTB_DIRECT)
else
qdisc_calculate_pkt_len(skb, sch->stab);
{ { #endif
// enqueue to helper queue cl->bstats.packets +=
if(q->direct_queue.qlen < q->direct_qlen) skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1; return sch->enqueue(skb, sch);
{ // Enqueues skb to the pfifo (default).
__skb_queue_tail(&q->direct_queue, skb); cl->bstats.bytes += qdisc_pkt_len(skb); }
q->direct_pkts++;
}
htb_activate(q, cl);
else }
{ sch->q.qlen++;
kfree_skb(skb); sch->bstats.packets += skb_is_gso(skb)?skb_shinfo(skb)->gso_segs:1;
sch->qstats.drops++; sch->bstats.bytes += qdisc_pkt_len(skb);
return NET_XMIT_DROP; return NET_XMIT_SUCCESS;
}
#ifdef CONFIG_NET_CLS_ACT
}
}
// htb_activate - inserts leaf class cl into appropriate active feeds. Routine learns (new) priority of leaf and
else if (!cl)
activates feed chain for the prio. It can be called on already active leaf safely. It also adds leaf into drop list.
{
if(ret & __NET_XMIT_BYPASS)
void htb_activate(struct htb_sched *q, struct htb_class *cl) // Active the class.
sch->qstats.drops++;
{
kfree_skb(skb);
WARN_ON(cl->level || !cl->un.leaf.q || !cl->un.leaf.q->q.qlen);
return ret;
if(!cl->prio_activity) // If this leaf class is not already active
#endif
{
}
cl->prio_activity = 1 << cl->prio; priority values start from 0
htb_activate_prios(q, cl);
list_add_tail(&cl->un.leaf.drop_list, q->drops + cl->prio);
}
}
// // htb_add_class_to_row - add class to its row
* htb_activate_prios - creates active classes' feed chain The class is added to row at priorities marked in mask.
* The class is connected to ancestors and/or appropriate rows It does nothing if mask == 0.
* for priorities it is participating on. cl->cmode must be new (activated) mode.
* It does nothing if cl->prio_activity == 0. void htb_add_class_to_row(struct htb_sched *q,
// struct htb_class *cl, int mask)
{
static void htb_activate_prios(struct htb_sched *q, struct htb_class *cl) q->row_mask[cl->level] |= mask;
{
struct htb_class *p = cl->parent; while(mask)
long m, mask = cl->prio_activity; {
int prio = ffz(~mask);
while (cl->cmode == HTB_MAY_BORROW && p && mask) // Find the index of first zero bit
{ // if mask is 1 , when inverted becomes 0xfe, index of
m = mask; // first zero bit in 0xfe is 0.
while (m) // so prio = 0;
{
int prio = ffz(~m); mask &= ~(1 << prio);
m &= ~(1 << prio); //

if(p->un.inner.feed[prio].rb_node) htb_add_to_id_tree(q->row[cl->level] + prio, cl, prio);


// parent already has its feed in use so that // row structure is global present htb_sched.
// reset bit in mask as parent is already ok }
mask &= ~(1 << prio); }

htb_add_to_id_tree(p->un.inner.feed + prio, cl, prio); // hlist_for_each_entry - iterate over list of given type
} tpos: The type * to use as a loop cursor. pos: The &struct hlist_node to use as a loop cursor
p->prio_activity |= mask; head: The head for your list. member: The name of the hlist_node within the struct.
cl = p;
p = cl->parent; #define hlist_for_each_entry(tpos, pos, head, member) \
for(pos = (head)->first; \
} pos && ({ prefetch(pos->next); 1;}) && \
if(cl->cmode == HTB_CAN_SEND && mask) ({ tpos = hlist_entry(pos, typeof(*tpos), member); 1;}); \
htb_add_class_to_row(q, cl, mask); pos = pos->next)
}
// htb_add_to_id_tree - Adds class to the round robin list. void rb_link_node(struct rb_node* node,
Routine adds class to the list (actually tree) sorted by classid. struct rb_node* parent,
Make sure that class is not already on such list for given prio. struct rb_node** rb_link)
{
void htb_add_to_id_tree(struct rb_root *root, node->rb_parent_color = (unsigned long )parent;
struct htb_class *cl, int prio) node->rb_left = node->rb_right = NULL;
{ *rb_link = node;
struct rb_node **p = &root->rb_node, *parent = NULL; }

while (*p)
{
struct htb_class *c; parent = *p;
c = rb_entry(parent, struct htb_class, node[prio]);
// see next sheet
if(cl->common.classid > c->common.classid)
p = &parent->rb_right;
else
p = &parent->rb_left;

// This while loop is traversing through row[level][prio] global


structure (in htb_sched) and trying to add the class cl to
this list based on class id.
}

rb_link_node(&cl->node[prio], parent, p);

rb_insert_color(&cl->node[prio], root);
}
struct htb_class
{
struct Qdisc_class_common common;
// general class parameters

------------------------

struct rb_node node[TC_HTB_NUMPRIO];


// node for self or feed tree

-------------------------
}

struct rb_node *parent = NULL;


c = rb_entry(parent, struct htb_class, node[prio]);
// Get the node[with a given priority] from a given class (htb_class)

#define rb_entry(ptr, type, member)


container_of(ptr, type, member)

// container_of - cast a member of a structure out to the containing structure


ptr: The pointer to the member.
type: The type of the container struct this is embedded in.
member: The name of the member within the struct.

#define container_of(ptr, type, member) ({ \


const typeof( ((type *)0)->member ) *__mptr = (ptr); \
(type *)( (char *)__mptr - offsetof(type,member) );})
void rb_insert_color(struct rb_node *node, struct rb_root *root) else
{ {
struct rb_node *parent, *gparent;
{
register struct rb_node *uncle = gparent->rb_left;
while ((parent = rb_parent(node)) && rb_is_red(parent))
{
if(uncle && rb_is_red(uncle))
gparent = rb_parent(parent);
{
rb_set_black(uncle);
if (parent == gparent->rb_left)
rb_set_black(parent);
{
rb_set_red(gparent);
{
register struct rb_node *uncle = gparent->rb_right;
node = gparent;
continue;
if(uncle && rb_is_red(uncle))
}
{
}
rb_set_black(uncle);
rb_set_black(parent);
if(parent->rb_left == node)
rb_set_red(gparent);
{
register struct rb_node *tmp;
node = gparent;
__rb_rotate_right(parent, root);
continue;
tmp = parent;
}
parent = node;
}
node = tmp;
}
if(parent->rb_right == node)
{
rb_set_black(parent);
register struct rb_node *tmp;
rb_set_red(gparent);
__rb_rotate_left(parent, root);
__rb_rotate_left(gparent, root);
tmp = parent; parent = node; node = tmp;
}
}
} // while
rb_set_black(parent); rb_set_red(gparent);
rb_set_black(root->rb_node);
__rb_rotate_right(gparent, root);
}
}
static struct sk_buff *htb_dequeue(struct Qdisc *sch) for(level = 0; level < TC_HTB_MAXDEPTH; level++)
{ {
struct sk_buff *skb = NULL; // common case optimization - skip event handler quickly
struct htb_sched *q = qdisc_priv(sch); int m; psched_time_t event;
int level;
psched_time_t next_event; if(q->now >= q->near_ev_cache[level])
unsigned long start_at; {
event = htb_do_events(q, level, start_at);
// try to dequeue direct packets as high prio (!) to minimize CPU work if(!event) event = q->now + PSCHED_TICKS_PER_SEC;
skb = __skb_dequeue(&q->direct_queue); q->near_ev_cache[level] = event;
}
if (skb != NULL) else event = q->near_ev_cache[level];
{
sch->flags & = ~TCQ_F_THROTTLED; if(next_event > event) next_event = event;
sch->q.qlen--;
return skb; m = ~q->row_mask[level];
} while (m != (int)(-1))
{
if(!sch->q.qlen) int prio = ffz(m); m |= 1 << prio;
goto fin; skb = htb_dequeue_tree(q, prio, level);
if(likely(skb != NULL))
q->now = psched_get_time(); sch->q.qlen--; sch->flags &= ~TCQ_F_THROTTLED; goto fin;
start_at = jiffies; }
}
next_event = q->now + 5 * PSCHED_TICKS_PER_SEC; Contd…. sch->qstats.overlimits++;
if(likely(next_event > q->now))
qdisc_watchdog_schedule(&q->watchdog, next_event);
else
schedule_work(&q->work);
fin:
return skb;
}
enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer) // htb
{
struct qdisc_watchdog *wd = container_of(timer, struct qdisc_watchdog, timer); htb_work_func(struct work_struct *work)
{
wd->qdisc->flags &= ~TCQ_F_THROTTLED; struct htb_sched *q = container_of(work, struct htb_sched, work);

__netif_schedule(qdisc_root(wd->qdisc)); struct Qdisc *sch = q->watchdog.qdisc;

return HRTIMER_NORESTART; __netif_schedule(qdisc_root(sch));


} }

// cbq

enum hrtimer_restart cbq_undelay(struct hrtimer *timer)


{
-------------------------------------------------------------------------
sch->flags &= ~TCQ_F_THROTTLED;
__netif_schedule(qdisc_root(sch));
return HRTIMER_NORESTART;
}

You might also like