Monitoring and Tuning The Linux Networking Stack - Sending Data PDF
Monitoring and Tuning The Linux Networking Stack - Sending Data PDF
packagecloud:blog
Subscribe to our blog via email
Sign up!
back to posts
linux
TL;DR
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 1/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
This blog post explains how computers running the Linux kernel send
packets, as well as how to monitor and tune each component of the
networking stack as packets ow from user programs to network hardware.
This post forms a pair with our previous post Monitoring and Tuning the
Linux Networking Stack: Receiving Data.
TL;DR
General advice on monitoring and tuning the Linux networking stack
Overview
Detailed Look
Protocol family registration
Sending network data via a socket
sock_sendmsg , __sock_sendmsg , and __sock_sendmsg_nosec
inet_sendmsg
ip_finish_output
neigh_resolve_output
netdev_pick_tx
__netdev_pick_tx
Resuming __dev_queue_xmit
__dev_xmit_skb
Tuning: Transmit Packet Steering (XPS)
Queuing disciplines!
qdisc_run_begin and qdisc_run_end
__qdisc_run
qdisc_restart
We use cookies to enhance the user experience on packagecloud.
dequeue_skb
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 4/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
sch_direct_xmit
handle_dev_cpu_collision
dev_requeue_skb
Using ethtool -S
Using sysfs
Using /proc/net/dev
Monitoring dynamic queue limits
Tuning network devices
Check the number of TX queues being used
Adjust the number of TX queues used
Adjust the size of the TX queues
The End
Extras
Reducing ARP traf c ( MSG_CONFIRM )
UDP Corking
Timestamping
Conclusion
Help with Linux networking or other systems
Related posts
considerable amount of time, effort, and money into understanding how the
various parts of networking system interact.
Many of the example settings provided in this blog post are used solely for
illustrative purposes and are not a recommendation for or against a certain
con guration or default setting. Before adjusting any setting, you should
develop a frame of reference around what you need to be monitoring to
notice a meaningful change.
Overview
For reference, you may want to have a copy of the device data sheet handy.
This post will examine the Intel I350 Ethernet controller, controlled by the
igb device driver. You can nd that data sheet (warning: LARGE PDF) here
for your reference.
The high-level path network data takes from a user program to a network
device is as follows:
1. Data is written using a system call (like sendto , sendmsg , et. al.).
2. Data passes through the socket subsystem on to the socket’s
We use cookies to enhance the user experience on packagecloud.
protocol
By using family’s
our site, you system
acknowledge (inhave
that you ourread
case, AF_INETour
and understand ).
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 7/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
3. The protocol family passes data through the protocol layers which
(in many cases) arrange the data into packets.
4. The data passes through the routing layer, populating the
destination and neighbour caches along the way (if they are cold).
This can generate ARP traf c if an ethernet address needs to be
looked up.
5. After passing through the protocol layers, packets reach the device
agnostic layer.
6. The output queue is chosen using XPS (if enabled) or a hash
function.
7. The device driver’s transmit function is called.
8. The data is then passed on to the queue discipline (qdisc) attached
to the output device.
9. The qdisc will either transmit the data directly if it can, or queue it
up to be sent during the NET_TX softirq.
10. Eventually the data is handed down to the driver from the qdisc.
11. The driver creates the needed DMA mappings so the device can
read the data from RAM.
12. The driver signals the device that the data is ready to be transmit.
13. The device fetches the data from RAM and transmits it.
14. Once transmission is complete, the device raises an interrupt to
signal transmit completion.
15. The driver’s registered IRQ handler for transmit completion runs.
For many devices, this handler simply triggers the NAPI poll loop to
start running via the NET_RX softirq.
16. The poll function runs via a softIRQ and calls down into the driver
to unmap DMA regions and free packet data.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
This entire
Cookie ow will
Policy, Privacy beandexamined
Policy, our Terms of in detail in the following sections.
Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 8/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
The protocol layers examined below are the IP and UDP protocol layers.
Much of the information presented will serve as a reference for other
protocol layers, as well.
Detailed Look
This blog post will be examining the Linux kernel version 3.13.0 with links
to code on GitHub and code snippets throughout this post, much like the
companion post.
Let’s begin by examining how protocol families are registered in the kernel
and used by the socket subsystem, then we can proceed to receiving data.
What happens when you run a piece of code like this in a user program to
create a UDP socket?
In short, the Linux kernel looks up a set of functions exported by the UDP
protocol stack that deal with many things including sending and receiving
network data. To understand exactly how this work, we have to look into the
AF_INET address family code.
The Linux kernel executes the inet_init function early during kernel
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
initialization. ThisPolicy,
Cookie Policy, Privacy function
and our registers the AF_INET protocol family, the
Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 9/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
individual protocol stacks within that family (TCP, UDP, ICMP, and RAW), and
calls initialization routines to get protocol stacks ready to process network
data. You can nd the code for inet_init in ./net/ipv4/af_inet.c.
The AF_INET protocol family exports a structure that has a create function.
This function is called by the kernel when a socket is created from a user
program:
The inet_create function takes the arguments passed to the socket system
call and searches the registered protocols to nd a set of operations to link
to the socket. Take a look:
err = 0;
/* Check the non-wild match. */
if (protocol == answer->protocol) {
if (protocol != IPPROTO_IP)
break;
} else {
/* Check for the two wild cases. */
if (IPPROTO_IP == protocol) {
protocol
We use cookies to enhance the user experience = answer->protocol;
on packagecloud.
By using our site, you acknowledge that youbreak;
have read and understand our
Cookie Policy, Privacy Policy, and
} our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 10/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
if (IPPROTO_IP == answer->protocol)
break;
}
err = -EPROTONOSUPPORT;
}
Later, answer which holds a reference to a particular protocol stack has its
ops elds copied into the socket structure:
sock->ops = answer->ops;
You can nd the structure de nitions for all of the protocol stacks in
af_inet.c . Let’s take a look at the TCP and UDP protocol structures:
{
.type = SOCK_DGRAM,
.protocol = IPPROTO_UDP,
.prot = &udp_prot,
We use cookies to enhance
.ops =the user experience on packagecloud.
&inet_dgram_ops,
By using our site, you acknowledge that you have read and understand our
.no_check = UDP_CSUM_DEFAULT,
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 11/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
.flags = INET_PROTOSW_PERMANENT,
},
/* ... */
.sendmsg = inet_sendmsg,
.recvmsg = inet_recvmsg,
/* ... */
};
EXPORT_SYMBOL(inet_dgram_ops);
/* ... */
.sendmsg = udp_sendmsg,
.recvmsg = udp_recvmsg,
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
/* ...
Cookie */ Privacy Policy, and our Terms of Service.
Policy, back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 12/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
};
EXPORT_SYMBOL(udp_prot);
Now, let’s turn to a user program that sends UDP data to see how
udp_sendmsg is called in the kernel!
A user program wants to send UDP network data and so it uses the sendto
system call, maybe like this:
This system call passes through the Linux system call layer and lands in this
function in ./net/socket.c :
/*
* Send a datagram to a given address. We move the address into kernel
* space and check the user space data area is readable before invoking
* the protocol.
*/
The SYSCALL_DEFINE6 macro unfolds into a pile of macros, which in turn, set
up the infrastructure needed to create a system call with 6 arguments
(hence DEFINE6 ). One of the results of this is that inside the kernel, system
call function names have sys_ prepended to them.
The system call code for sendto calls sock_sendmsg after arranging the
data in a way that the lower layers will be able to handle. In particular, it
takes the destination address passed into sendto and arranges it into a
structure, let’s take a look:
iov.iov_base = buff;
iov.iov_len = len;
msg.msg_name = NULL;
msg.msg_iov = &iov;
msg.msg_iovlen = 1;
msg.msg_control = NULL;
msg.msg_controllen = 0;
msg.msg_namelen = 0;
if (addr) {
err = move_addr_to_kernel(addr, addr_len, &address);
if (err < 0)
goto out_put;
msg.msg_name = (struct sockaddr *)&address;
msg.msg_namelen = addr_len;
}
This code
We use is copying
cookies addr
to enhance the user, experience
passed in via the user program into the kernel
on packagecloud.
inet_sendmsg
As you may have guessed from the name, this is a generic function provided
by the AF_INET protocol family. This function starts by calling
We use cookies to enhance the to
sock_rps_record_flow userrecord the
experience on last CPU that the ow was processed
packagecloud.
on;
Bythis
using is
ourused byacknowledge
site, you Receivethat Packet Steering.
you have Next, this
read and understand our function looks up the
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 15/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
int inet_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
size_t size)
{
struct sock *sk = sock->sk;
sock_rps_record_flow(sk);
udp_sendmsg
UDP corking
After variable declarations and some basic error checking, one of the rst
things udp_sendmsg does is check if the socket is “corked”. UDP corking is a
feature that allows a user program request that the kernel accumulate data
from multiple calls to send into a single datagram before sending. There
are two ways to enable this option in your user program:
1. Use the setsockopt system call and pass UDP_CORK as the socket
option.
2. Pass MSG_MORE as one of the flags when calling send , sendto , or
sendmsg from your program.
These options are documented in the UDP man page and the send / sendto
/ sendmsg man page, respectively.
int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
size_t len)
{
fl4 = &inet->cork.fl.u.ip4;
if (up->pending) {
/*
* There are pending frames.
* The socket lock must be held while it's corked.
*/
We use cookies to enhance the user experience on packagecloud.
lock_sock(sk);
By using our site, you acknowledge that you have read and understand our
ifPrivacy
Cookie Policy, (likely(up->pending))
Policy, and our Terms of{Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 17/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
if (unlikely(up->pending != AF_INET)) {
release_sock(sk);
return -EINVAL;
}
goto do_append_data;
}
release_sock(sk);
}
Next, the destination address and port are determined from one of two
possible sources:
1. The socket itself has the destination address stored because the
socket was connected at some point.
2. The address is passed in via an auxiliary structure, as we saw in the
kernel code for sendto .
/*
* Get and verify the address.
*/
if (msg->msg_name) {
struct sockaddr_in *usin = (struct sockaddr_in *)msg->msg_name;
if (msg->msg_namelen < sizeof(*usin))
return -EINVAL;
if (usin->sin_family != AF_INET) {
if (usin->sin_family != AF_UNSPEC)
return -EAFNOSUPPORT;
}
We use cookies to enhance the user experience on packagecloud.
By using our daddr
site, you=acknowledge
usin->sin_addr.s_addr;
that you have read and understand our
Cookie Policy, Privacy
dport = Policy, and our Terms of Service.
usin->sin_port; back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 18/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
if (dport == 0)
return -EINVAL;
} else {
if (sk->sk_state != TCP_ESTABLISHED)
return -EDESTADDRREQ;
daddr = inet->inet_daddr;
dport = inet->inet_dport;
/* Open fast path for connected socket.
Route will not be used, if at least one option is set.
*/
connected = 1;
}
Yes, that is a TCP_ESTABLISHED in the UDP protocol layer! The socket states
for better or worse use TCP state descriptions.
Recall earlier that we saw how the kernel arranges a struct msghdr
structure on behalf of the user when the user program calls sendto . The
code above shows how the kernel parses that data back out in order to set
daddr and dport .
If the udp_sendmsg function was reached by kernel function which did not
arrange a struct msghdr structure, the destination address and port are
retrieved from the socket itself and the socket is marked as “connected.”
In either case daddr and dport will be set to the destination address and
port.
Next, the source address, device index, and any timestamping options which
were set on the socket (like SOCK_TIMESTAMPING_TX_HARDWARE ,
SOCK_TIMESTAMPING_TX_SOFTWARE , SOCK_WIFI_STATUS ) are retrieved and
stored:
ipc.addr = inet->inet_saddr;
ipc.oif = sk->sk_bound_dev_if;
sock_tx_timestamp(sk, &ipc.tx_flags);
The sendmsg and recvmsg system calls allow the user to set or request
ancillary data in addition to sending or receiving packets. User programs
can make use of this ancillary data by crafting a struct msghdr with the
request embedded in it. Many of the ancillary data types are documented in
the man page for IP.
Similarly, the IP_TTL and IP_TOS ancillary messages allow the user to set
the IP packet TTL and TOS values on a per-packet basis, when passed with
We use cookies to enhance the user experience on packagecloud.
data to sendmsg
By using our site, you from the user
acknowledge that youprogram. Note
have read and that our
understand both IP_TTL and IP_TOS
back to top
may be set at the socket level for all outgoing packets by using setsockopt ,
Cookie Policy, Privacy Policy, and our Terms of Service.
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 20/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
We can see how the kernel handles ancillary messages for sendmsg on UDP
sockets:
if (msg->msg_controllen) {
err = ip_cmsg_send(sock_net(sk), msg, &ipc,
sk->sk_family == AF_INET6);
if (err)
return err;
if (ipc.opt)
free = 1;
connected = 0;
}
Next, sendmsg will check to see if the user speci ed any custom IP options
with ancillary messages. If options were set, they will be used. If not, the
options already in use by this socket will be used:
if (!ipc.opt) {
struct ip_options_rcu *inet_opt;
if (inet_opt) {
memcpy(&opt_copy, inet_opt,
sizeof(*inet_opt) + inet_opt->opt.optlen);
ipc.opt = &opt_copy.opt;
}
rcu_read_unlock();
}
Next up, the function checks to see if the source record route (SRR) IP
option is set. There are two types of source record routing: loose and strict
source record routing. If this option was set, the rst hop address is
recorded and stored as faddr and the socket is marked as “not connected”.
This will be used later:
After the SRR option is handled, the TOS IP ag is retrieved either from the
value the user set via an ancillary message or the value currently in use by
the socket. Followed by a check to determine if:
Then, the tos has 0x1 ( RTO_ONLINK ) added to its bit set and the socket is
considered not “connected”:
Multicast or unicast?
Next, the code attempts to deal with multicast. This is a bit tricky, as the
user could specify an alternate source address or device index of where to
send the packet from by sending an ancillary IP_PKTINFO message, as
explained earlier.
1. The device index of where to write the packet will be set to the
multicast device index, and
2. The source address on the packet will be set to the multicast
source address.
Unless, the user has not overridden the device index by sending the
IP_PKTINFO ancillary message. Let’s take a look:
if (ipv4_is_multicast(daddr)) {
if (!ipc.oif)
ipc.oif = inet->mc_index;
We use cookies to enhance the user experience on packagecloud.
if (!saddr)
By using our site, you acknowledge that you have read and understand our
saddr = inet->mc_addr;
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
connected = 0;
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 23/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
} else if (!ipc.oif)
ipc.oif = inet->uc_index;
If the destination address is not a multicast address, the device index is set
unless it was overridden by the user with IP_PKTINFO .
Routing
The code in the UDP layer that deals with routing begins with a fast path. If
the socket is connected try to get the routing structure:
if (connected)
rt = (struct rtable *)sk_dst_check(sk, 0);
If the socket was not connected, or if it was but the routing helper
sk_dst_check decided the route was obsolete the code moves into the slow
path to generate a routing structure. This begins by calling
flowi4_init_output to construct a structure describing this UDP ow:
if (rt == NULL) {
struct net *net = sock_net(sk);
fl4 = &fl4_stack;
flowi4_init_output(fl4, ipc.oif, sk->sk_mark, tos,
RT_SCOPE_UNIVERSE, sk->sk_protocol,
inet_sk_flowi_flags(sk)|FLOWI_FLAG_CAN_SLEEP,
faddr, saddr, dport, inet->inet_sport);
Once this ow structure has been constructed, the socket and its ow
We use cookies to enhance the user experience on packagecloud.
structure are passed along to the security subsystem so that systems like
By using our site, you acknowledge that you have read and understand our
SELinux or SMACK
Cookie Policy, canand
Privacy Policy, setouraTerms
security id value on the ow structure.back
of Service. Next,
to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 24/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
security_sk_classify_flow(sk, flowi4_to_flowi(fl4));
rt = ip_route_output_flow(net, fl4, sk);
if (IS_ERR(rt)) {
err = PTR_ERR(rt);
rt = NULL;
if (err == -ENETUNREACH)
IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
goto out;
}
The location of the le holding these statistics counter and the other
available counters and their meanings will be discussed below in the UDP
monitoring section.
Next, if the route is for broadcast, but the socket option SOCK_BROADCAST
was not set on the socket the code terminates. If the socket is considered
“connected” (as described throughout this function), the routing structure is
cached on the socket:
err = -EACCES;
if ((rt->rt_flags & RTCF_BROADCAST) &&
!sock_flag(sk, SOCK_BROADCAST))
goto out;
if (connected)
sk_dst_set(sk,
We use cookies to enhance thedst_clone(&rt->dst));
user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 25/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
if (msg->msg_flags&MSG_CONFIRM)
goto do_confirm;
back_from_confirm:
This ag indicates to the system to con rm that the ARP cache entry is still
valid and prevents it from being garbage collected. The dst_confirm
function simply sets a ag on destination cache entry which will be checked
much later when the neighbour cache has been queried and an entry has
been found. We’ll see this again later. This feature is commonly used in UDP
networking applications to reduce unnecessary ARP traf c. The do_confirm
label is found near the end of this function, but it is straightforward:
do_confirm:
dst_confirm(&rt->dst);
if (!(msg->msg_flags&MSG_PROBE) || len)
goto back_from_confirm;
err = 0;
goto out;
This code con rms the cache entry and jumps back to back_from_confirm , if
this was not a probe.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 26/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Fast path for uncorked UDP sockets: Prepare data for transmit
If UDP corking is not requested, the data can be packed into a struct
sk_buff and passed on to udp_send_skb to move down the stack and closer
to the IP protocol layer. This is done by calling ip_make_skb . Note that the
routing structure generated earlier by calling ip_route_output_flow is
passed in as well. It will be af xed to the skb and used later in the IP
protocol layer.
The MTU.
UDP corking (if enabled).
UDP Fragmentation Of oading (UFO).
Fragmentation,
We use cookies to enhance theif UFO is unsupported
user experience and the size
on packagecloud. of the data to
By using our site, you acknowledge that you have read and understand our
transmit
Cookie is larger
Policy, Privacy than
Policy, and the MTU.
our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 27/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Most network device drivers do not support UFO because the network
hardware itself does not support this feature. Let’s take a look through this
code, keeping in mind that corking is disabled. We’ll look at the corking
enabled path next.
ip_make_skb
Let’s take a look at how the faux corking structure and queue are setup:
__skb_queue_head_init(&queue);
cork.flags = 0;
cork.addr = 0;
cork.opt = NULL;
err = ip_setup_cork(sk, &cork, /* more args */);
if (err)
return ERR_PTR(err);
As seen above, both the corking structure ( cork ) and the queue ( queue ) are
stack-allocated; neither are needed by the time ip_make_skb has
completed. The faux corking structure is setup with a call to ip_setup_cork
which allocates memory and initializes the structure. Next,
__ip_append_data is called and the queue and corking structure are passed
in:
We’ll see how this function works later, as it is used in both cases whether
the socket is corked or not. For now, all we need to know is that
__ip_append_data will create an skb, append data to it, and add that skb to
the queue passed in. If appending the data failed,
__ip_flush_pending_frame is called to drop the data on the oor and the
error code is passed back upward:
if (err) {
__ip_flush_pending_frames(sk, &queue, &cork);
return ERR_PTR(err);
}
If no errors occurred, the skb is handed to udp_send_skb which will pass the
skb to the next layer of the networking stack, the IP protocol stack:
err = PTR_ERR(skb);
if (!IS_ERR_OR_NULL(skb))
err = udp_send_skb(skb, fl4);
goto out;
If there was an error, it will be accounted later. See the “Error Accounting”
section below the UDP corking case for more information.
If UDP corking is being used, but no preexisting data is corked, the slow
path commences:
You can see this in the next piece of code, continuing down udp_sendmsg :
lock_sock(sk);
if (unlikely(up->pending)) {
/* The socket is already corked while preparing it. */
/* ... which is an evident application bug. --ANK */
release_sock(sk);
goto out;
}
/*
* Now cork the socket to pend data.
*/
fl4 = &inet->cork.fl.u.ip4;
fl4->daddr = daddr;
fl4->saddr = saddr;
fl4->fl4_dport = dport;
fl4->fl4_sport = inet->inet_sport;
up->pending = AF_INET;
do_append_data:
up->len += ulen;
err = ip_append_data(sk, fl4, getfrag, msg->msg_iov, ulen,
sizeof(struct udphdr), &ipc, &rt,
corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
ip_append_data
Once the above conditions are dealt with the __ip_append_data function is
called which contains the bulk of the logic for processing data into packets.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
C t R b G it i l th 10
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 31/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
__ip_append_data
The way this work centers around the socket’s send queue. Existing data
waiting to be sent (for example, if the socket is corked) will have an entry in
the queue where additional data can be appended.
In the uncorked case, the queue holding the skb is passed to __ip_make_skb
described above where it is dequeued and prepared to be sent to the lower
layers via udp_send_skb .
Now, udp_sendmsg will move on to check the return value ( err below) from
__ip_append_skb :
if (err)
udp_flush_pending_frames(sk);
We use cookies to enhance the user experience on packagecloud.
else if (!corkreq)
By using our site, you acknowledge that you have read and understand our
Cookie Policy,
err Privacy Policy, and our Terms of Service.
= udp_push_pending_frames(sk); back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 33/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
else if (unlikely(skb_queue_empty(&sk->sk_write_queue)))
up->pending = 0;
release_sock(sk);
ip_rt_put(rt);
if (free)
kfree(ipc.opt);
if (!err)
return len;
Error accounting
If:
/*
* ENOBUFS = no kernel mem, SOCK_NOSPACE = no sndbuf space. Reporting
* ENOBUFS might not be good (it's not tunable per se), but otherwise
* we don't have a good statistic (IpOutDiscards but it can be too many
* things). We could add another new stat but at least for now that
* seems like overkill.
*/
if (err == -ENOBUFS || test_bit(SOCK_NOSPACE, &sk->sk_socket->flags)) {
UDP_INC_STATS_USER(sock_net(sk),
UDP_MIB_SNDBUFERRORS, is_udplite);
}
return err;
We’ll see how to read these counters in the monitoring section below.
udp_send_skb
/*
* Create a UDP header
*/
uh = udp_hdr(skb);
uh->source = inet->inet_sport;
uh->dest = fl4->fl4_dport;
uh->len = htons(len);
uh->check = 0;
if (is_udplite) /* UDP-Lite */
csum = udplite_csum(skb);
goto send;
} else
csum = udp_csum(skb);
send:
err = ip_send_skb(sock_net(sk), skb);
if (err) {
if (err == -ENOBUFS && !inet->recverr) {
UDP_INC_STATS_USER(sock_net(sk),
UDP_MIB_SNDBUFERRORS, is_udplite);
err = 0;
}
} else
UDP_INC_STATS_USER(sock_net(sk),
UDP_MIB_OUTDATAGRAMS, is_udplite);
return err;
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 37/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Two very useful les for getting UDP protocol statistics are:
/proc/net/snmp
/proc/net/udp
/proc/net/snmp
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 38/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Note that some errors discovered by the UDP protocol layer are reported in
the statistics les for other protocol layers. One example of this: routing
errors. A routing error discovered by udp_sendmsg will cause an increment
to the IP protocol layer’s OutNoRoutes statistic.
/proc/net/udp
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 39/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
$ cat /proc/net/udp
sl local_address rem_address st tx_queue rx_queue tr tm->when r
515: 00000000:B346 00000000:0000 07 00000000:00000000 00:00000000
558: 00000000:0371 00000000:0000 07 00000000:00000000 00:00000000
588: 0100007F:038F 00000000:0000 07 00000000:00000000 00:00000000
769: 00000000:0044 00000000:0000 07 00000000:00000000 00:00000000
812: 00000000:006F 00000000:0000 07 00000000:00000000 00:00000000
The rst line describes each of the elds in the lines following:
this to help you determine which user process has this socket open.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 40/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
The maximum size of the send queue (also called the write queue) can be
adjusted by setting the net.core.wmem_max sysctl.
You can also set the sk->sk_write_queue size by calling setsockopt from
We use cookies to enhance the user experience on packagecloud.
your application and passed SO_SNDBUF . The maximum you can set with
By using our site, you acknowledge that you have read and understand our
setsockopt is net.core.wmem_max
Cookie Policy, Privacy .
Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 41/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
IP protocol layer
The UDP protocol layer hands skbs down to the IP protocol by simply
calling ip_send_skb , so let’s start there and map out the IP protocol layer!
ip_send_skb
err = ip_local_out(skb);
if (err) {
if (err > 0)
err = net_xmit_errno(err);
if (err)
IP_INC_STATS(net, IPSTATS_MIB_OUTDISCARDS);
We use cookies to enhance the user experience on packagecloud.
}
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 42/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
return err;
}
As seen above, ip_local_out is called and the return value is dealt with
after that. The call to net_xmit_errno helps to “translate” any errors from
lower levels into an error that is understood by the IP and UDP protocol
layers. If any error happens, the IP protocol statistic “OutDiscards” is
incremented. We’ll see later which les to read to obtain this statistic. For
now, let’s continue down the rabbit hole and see where ip_local_out takes
us.
err = __ip_local_out(skb);
if (likely(err == 1))
err = dst_output(skb);
return err;
}
We can see from the source to __ip_local_out that the function does two
important things rst:
We use cookies to enhance the user experience on packagecloud.
Next, the IP protocol layer will call down into net lter by calling nf_hook .
The return value of the nf_hook function will be passed back up to
ip_local_out . If nf_hook returns 1 , this indicates that the packet was
allowed to pass and that the caller should pass it along itself. As we saw
above, this is precisely what happens: ip_local_out checks for the return
value of 1 and passes the packet on by calling dst_output itself. Let’s take
a look at the code for __ip_local_out :
iph->tot_len = htons(skb->len);
ip_send_check(iph);
return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT, skb, NULL,
skb_dst(skb)->dev, dst_output);
}
In the interest of brevity (and my RSI), I’ve decided to skip my deep dive into
net lter, iptables, and conntrack. You can dive into the source for net lter
byWe
starting here and here.
use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 44/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Keep in mind: if you have numerous or very complex net lter or iptables
rules, those rules will be executed in the CPU context of the user process
which initiated the original sendmsg call. If you have CPU pinning set up to
restrict execution of this process to a particular CPU (or set of CPUs), be
aware that the CPU will spend system time processing outbound iptables
rules. Depending on your system’s workload, you may want to carefully pin
processes to CPUs or reduce the complexity of your ruleset if you measure a
performance regression here.
Destination cache
subsystems can all be examined in extreme detail on their own. For our
purposes, we can take a quick look to see how this all ts together.
The code we’ve seen above calls dst_output(skb) . This function simply
looks up the dst entry attached to the skb and calls the output function.
Let’s take a look:
Seems simple enough, but how does that output function get attached to
the dst entry in the rst place?
ip_output
So, dst_output executes the output function, which in the UDP IPv4 case
is We use cookies .toThe
ip_output ip_output
enhance function
the user experience is straightforward:
on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 46/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
allows the packet to pass, the okfn is called. In this case, the okfn is
ip_finish_output .
ip_finish_output
The ip_finish_output function is also short and clear. Let’s take a look:
If net lter and packet transformation are enabled in this kernel, the skb ’s
ags are updated and it is sent back through dst_output . The two more
common cases are:
Let’s take a short detour to talk about Path MTU Discovery before continuing
our way through the kernel.
Linux provides a feature I’ve avoided mentioning until now: Path MTU
Discovery. This feature allows the kernel to automatically determine the
largest MTU for a particular route. Determining this value and sending
packets that are less than or equal to the MTU for the route means that IP
fragmentation can be avoided. This is the preferred setting because
fragmenting packets consumes system resources and is seemingly easy to
avoid: simply send small enough packets and fragmentation is unnecessary.
You can adjust the Path MTU Discovery settings on a per-socket basis by
calling setsockopt in your application with the SOL_IP level and
IP_MTU_DISCOVER optname. The optval can be one of the several values
described in the IP protocol man page. The value you’ll likely want to set is:
IP_PMTUDISC_DO which means “Always do Path MTU Discovery.” More
advanced network applications or diagnostic tools may choose to
implement RFC 4821 themselves to determine the PMTU at application
start for a particular route or routes. In this case, you can use the
IP_PMTUDISC_PROBE option which tells the kernel to set the “Don’t
Fragment” bit, but allows you to send data larger than the PMTU.
Your application can retrieve the PMTU by calling getsockopt , with the
SOL_IP and IP_MTU optname. You can use this to help guide the size of the
UDP datagrams your application will construct prior to attempting
transmissions.
If We
youusehave enabled PTMU discovery, any attempt to send UDP data larger
cookies to enhance the user experience on packagecloud.
than theour
By using PMTU will
site, you result that
acknowledge in the application
you have receiving
read and understand our the error code
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 49/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
EMSGSIZE . The application can then retry, but with less data.
ip_finish_output2
/* variable declarations */
if (rt->rt_type == RTN_MULTICAST) {
IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUTMCAST, skb->len);
} else if (rt->rt_type == RTN_BROADCAST)
IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUTBCAST, skb->len);
return -ENOMEM;
}
if (skb->sk)
skb_set_owner_w(skb2, skb->sk);
consume_skb(skb);
skb = skb2;
}
If the routing structure associated with this packet is of type multicast, both
the OutMcastPkts and OutMcastOctets counters are bumped by using the
IP_UPD_PO_STATS macro. Otherwise, if the route type is broadcast the
OutBcastPkts and OutBcastOctets counters are bumped.
Next, a check is performed to ensure that the skb structure has enough
room for any link layer headers that need to be added. If not, additional
room is allocated with a call to skb_realloc_headroom and the cost of the
new skb is charged to the associated socket.
rcu_read_lock_bh();
nexthop = (__force u32) rt_nexthop(rt, ip_hdr(skb)->daddr);
neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
if (unlikely(!neigh))
neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
Continuing on, we can see that the next hop is computed by querying the
routing layer followed by a lookup against the neighbour cache. If the
neighbour is not found, one is created by calling __neigh_create . This
could be the case, for example, the rst time data is sent to another host.
Note that this function is called with arp_tbl (de ned in ./net/ipv4/arp.c) to
create the neighbour entry in the ARP table. Other systems (like IPv6 or
DECnet) maintain their own ARP tables and would pass a different structure
We use cookies to enhance the user experience on packagecloud.
into __neigh_create . This post does not aim to cover the neighbour cache
By using our site, you acknowledge that you have read and understand our
inCookie
full detail, but it
Policy, Privacy is worth
Policy, nothing
and our Terms that if the neighbour has to beback
of Service. created
to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 51/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
it is possible that this creation can cause the cache to grow. This post will
cover some more details about the neighbour cache in the sections below.
At any rate, the neighbour cache exports its own set of statistics so that this
growth can be measured. See the monitoring sections below for more
information.
if (!IS_ERR(neigh)) {
int res = dst_neigh_output(dst, neigh, skb);
rcu_read_unlock_bh();
return res;
}
rcu_read_unlock_bh();
dst_neigh_output
The dst_neigh_output function does two important things for us. First,
recall from earlier in this blog post we saw that if a user speci ed
MSG_CONFIRM via an ancillary message to sendmsg the function, a ag is
ipped to indicate that the destination cache entry for the remote host is
still valid and should not be garbage collected. That check happens here
and the confirmed eld on the neighbour is set to the current jif es count.
if (dst->pending_confirm) {
unsigned long now = jiffies;
dst->pending_confirm = 0;
/* avoid dirtying neighbour */
if (n->confirmed != now)
n->confirmed = now;
}
hh = &n->hh;
if ((n->nud_state & NUD_CONNECTED) && hh->hh_len)
return neigh_hh_output(hh, skb);
else
return n->output(n, skb);
}
and the “hardware header” ( hh ) is cached (because we’ve sent data before
and have previously generated it), call neigh_hh_output . Otherwise, call the
output function. Both code paths end with dev_queue_xmit which pass the
skb down to the Linux net device subsystem where it will be processed a bit
more before hitting the device driver layer. Let’s follow both the
neigh_hh_output and n->output code paths until we reach
dev_queue_xmit .
neigh_hh_output
static inline int neigh_hh_output(const struct hh_cache *hh, struct sk_buff *sk
b)
{
unsigned int seq;
int hh_len;
We use cookies
do { to enhance the user experience on packagecloud.
By using our site, youseq
acknowledge that you have read and understand our
= read_seqbegin(&hh->hh_lock);
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
hh_len = hh->hh_len;
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 54/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
skb_push(skb, hh_len);
return dev_queue_xmit(skb);
}
Once the data is copied to the skb and the skb’s internal pointers tracking
the data are updated with skb_push , the skb is passed to dev_queue_xmit
to enter the Linux net device subsystem.
n->output
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 55/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
neigh_resolve_output
if (!dst)
goto discard;
We use cookies to enhance the user experience on packagecloud.
By using our
if site, you acknowledge that you have skb))
(!neigh_event_send(neigh, read and{ understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
int err;
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 57/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
The code starts by doing some basic checks and proceeds to calling
neigh_event_send . The neigh_event_send function is short wrapper around
__neigh_event_send which will do the heavy lifting to resolve the
neighbour. You can read the source for __neigh_event_send in
./net/core/neighbour.c, but the high-level takeaway from the code is that
there are three cases users will most interested in:
do {
__skb_pull(skb, skb_network_offset(skb));
seq = read_seqbegin(&neigh->ha_lock);
err = dev_hard_header(skb, dev, ntohs(skb->protocol),
neigh->ha, NULL, skb->len);
} while (read_seqretry(&neigh->ha_lock, seq));
if (err >= 0)
rc = dev_queue_xmit(skb);
else
goto out_kfree_skb;
}
If the ethernet header was written without returning an error, the skb is
handed down to dev_queue_xmit to pass through the Linux network device
subsystem for transmit. If there was an error, a goto will drop the skb, set
We use cookies to enhance the user experience on packagecloud.
theBy return
using our code
site, youand returnthat
acknowledge theyouerror:
have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 59/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
out:
return rc;
discard:
neigh_dbg(1, "%s: dst=%p neigh=%p\n", __func__, dst, neigh);
out_kfree_skb:
rc = -EINVAL;
kfree_skb(skb);
goto out;
}
EXPORT_SYMBOL(neigh_resolve_output);
Before we proceed into the Linux network device subsystem, let’s take a
look at some les for monitoring and turning the IP protocol layer.
/proc/net/snmp
$ cat /proc/net/snmp
Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDa
Ip: 1 64 25922988125 0 0 15771700 0 0 25898327616 22789396404 129878
...
enum
{
IPSTATS_MIB_NUM = 0,
/* frequently written fields in fast path, kept in same cache line */
IPSTATS_MIB_INPKTS, /* InReceives */
IPSTATS_MIB_INOCTETS, /* InOctets */
IPSTATS_MIB_INDELIVERS, /* InDelivers */
IPSTATS_MIB_OUTFORWDATAGRAMS, /* OutForwDatagrams */
IPSTATS_MIB_OUTPKTS, /* OutRequests */
IPSTATS_MIB_OUTOCTETS, /* OutOctets */
/* ... */
/proc/net/netstat
The format is similar to /proc/net/snmp , except the lines are pre xed with
IpExt .
Linux supports a feature called traf c control. This feature allows system
administrators to control how packets are transmit from a machine. This
blog post will not dive into the details of every aspect of Linux traf c
control. This document provides a great in-depth examination of the
system, its control, and its features. There a few concepts that are worth
mentioning to make the code seen next easier to understand.
The traf c control system contains several different sets of queuing systems
that provide different features for controlling traf c ow. Individual queuing
systems are commonly called qdisc and also known as queuing disciplines.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 63/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
You can think of qdiscs as schedulers; qdiscs decide when and how packets
are transmit.
On Linux every interface has a default qdisc associated with it. For network
hardware that supports only a single transmit queue, the default qdisc
pfifo_fast is used. Network hardware that supports multiple transmit
queues uses the default qdisc of mq . You can check your system by running
tc qdisc .
Now that those ideas have been introduced, let’s proceed down
dev_queue_xmit from ./net/core/dev.c.
Following that, __dev_queue_xmit is where the heavy lifting gets done. Let’s
take a look and step through this code piece by piece. Follow along:
skb_reset_mac_header(skb);
skb_update_prio(skb);
1. Declaring variables.
2. Preparing the skb to be processed by calling
skb_reset_mac_header . This resets the skb’s internal pointers so
that the ethernet header can be accessed.
3. rcu_read_lock_bh is called to prepare for reading RCU protected
data structures in the code below. Read more about safely using
RCU.
4. skb_update_prio is called to set the skb’s priority, if the network
priority cgroup is being used.
Here the code attempts to determine which transmit queue to use. As you’ll
see later in this post, some network devices expose multiple transmit
queues for transmitting data. Let’s see how this works in detail.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
C t R b G it i l th 10
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 65/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
netdev_pick_tx
if (dev->real_num_tx_queues != 1) {
const struct net_device_ops *ops = dev->netdev_ops;
if (ops->ndo_select_queue)
queue_index = ops->ndo_select_queue(dev, skb,
accel_priv);
else
queue_index = __netdev_pick_tx(dev, skb);
if (!accel_priv)
queue_index = dev_cap_txqueue(dev, queue_index);
}
skb_set_queue_mapping(skb, queue_index);
return netdev_get_tx_queue(dev, queue_index);
}
As you can see above, if the network device supports only a single TX
queue, the more complex code is skipped and that single TX queue is
We use cookies to enhance the user experience on packagecloud.
returned.
By using ourMost devices
site, you used
acknowledge thaton
you higher
have readend serversour
and understand will have multiple TX
queues. There are two cases for devices with multiple TX queues:
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 66/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
__netdev_pick_tx
Let’s take a look at how the kernel chooses the TX queue to use for
transmitting data. From ./net/core/ ow_dissector.c:
queue_index = new_index;
}
return queue_index;
}
The code begins rst by checking if the transmit queue has already been
cached on the socket by calling sk_tx_queue_get , If it hasn’t been cached,
-1 is returned.
The queue_index is < 0. This will happen if the queue hasn’t been
set yet.
If the ooo_okay ag is set. If this ag is set, this means that out of
order packets are allowed now. The protocol layers must set this
ag appropriately. The TCP protocol layer sets this ag when all
outstanding packets for a ow have been acknowledged. When this
happens, the kernel can choose a different TX queue for this
packet. The UDP protocol layer does not set this ag – so UDP
packets will never have ooo_okay set to a non-zero value.
If the queue index is larger than the number of queues. This can
happen if the user has recently changed the queue count on the
device via ethtool . More on this later.
InWe
anyuseof those
cookies cases,thethe
to enhance usercode descends
experience into the slow path to get the
on packagecloud.
transmit queue.
By using our site, youThis begins
acknowledge thatwith read and understandwhich
get_xps_queue
you have our attempts to use a
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 68/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
You can read more about how XPS works by checking the kernel
documentation for XPS. We’ll examine how to tune XPS for your system
below, but for now, all you need to know is that to con gure XPS the system
administrator can de ne a bitmap mapping transmit queues to CPUs.
The function call in the code above to get_xps_queue will consult this user-
speci ed map in order to determine which transmit queue should be used.
If get_xps_queue returns -1 , skb_tx_hash will be used instead.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
skb_tx_hash
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 69/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
/*
* Returns a Tx hash for the given packet when dev->real_num_tx_queues is used
* as a distribution range limit for the returned value.
*/
static inline u16 skb_tx_hash(const struct net_device *dev,
const struct sk_buff *skb)
{
return __skb_tx_hash(dev, skb, dev->real_num_tx_queues);
}
/*
* Returns a Tx hash based on the given packet descriptor a Tx queues' number
* to be used as a distribution range.
*/
u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
unsigned int num_tx_queues)
{
u32 hash;
We use cookies to enhance the user experience on packagecloud.
u16 qoffset = 0;
By using our site, you acknowledge that you have read and understand our
u16 Privacy
Cookie Policy, qcountPolicy,
= num_tx_queues;
and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 70/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
if (skb_rx_queue_recorded(skb)) {
hash = skb_get_rx_queue(skb);
while (unlikely(hash >= num_tx_queues))
hash -= num_tx_queues;
return hash;
}
The rst if stanza in this function is an interesting short circuit. The function
name skb_rx_queue_recorded is a bit misleading. An skb has a
queue_mapping eld that is used both for rx and tx. At any rate, this if
statement can be true if your system is receiving packets and forwarding
them elsewhere. If that isn’t the case, the code continues.
if (dev->num_tc) {
u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
qoffset = dev->tc_to_txq[tc].offset;
qcount = dev->tc_to_txq[tc].count;
}
Note that if you have used the setsockopt option IP_TOS to set the TOS
ags on the IP packets sent on a particular socket (or on a per-packet basis
if passed as an ancillary message to sendmsg ) in your application, the kernel
will translate the TOS options set by you to a priority which end up in skb-
>priority .
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 71/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Next, the range of appropriate transmit queues for the traf c class will be
generated. They will be used to determine the transmit queue.
If num_tc was zero (because the network device does not support hardware
based traf c control), the qcount and qoffset variables are set to the
number of transmit queues and 0 , respectively.
Using qcount and qoffset , the index of the transmit queue will be
calculated:
Resuming __dev_queue_xmit
q = rcu_dereference_bh(txq->qdisc);
#ifdef CONFIG_NET_CLS_ACT
skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS);
#endif
trace_net_dev_queue(skb);
if (q->enqueue) {
rc = __dev_xmit_skb(skb, q, dev, txq);
goto out;
}
Next, the code assigns a traf c classi cation “verdict” to the outgoing data,
if the packet classi cation API has been enabled in your kernel. Next, the
queue discipline is checked to see if there is a way to queue data. Some
queuing disciplines like the noqueue qdisc do not have a queue. If there is a
queue, the code calls down to __dev_xmit_skb to continue processing the
data for transmit. Afterward, execution jumps to the end of this function.
We use cookies to enhance the user experience on packagecloud.
We’ll take
By using ourasite,
look at __dev_xmit_skb
you acknowledge shortly.
that you have read For now,
and understand our let’s see what happens
if Cookie
therePolicy,
is noPrivacy
queue, starting
Policy, withof aService.
and our Terms very helpful comment: back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 73/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Check this and shot the lock. It is not prone from deadlocks.
Either shot noqueue qdisc, it is even simpler 8)
*/
if (dev->flags & IFF_UP) {
int cpu = smp_processor_id(); /* ok because BHs are off */
As the comment illustrates, the only devices that could have a qdisc with no
queues are the loopback device and tunnel devices. If the device is currently
up, then the current CPU is saved. It used for the next check which is a bit
tricky, let’s take a look:
if (txq->xmit_lock_owner != cpu) {
There’s two cases: the transmit lock on this device queue is owned by this
CPU or not. If so, a counter variable xmit_recursion , which is allocated per-
CPU, is checked here to determine if the count is over the RECURSION_LIMIT .
It is possible that one program could attempt to send data and get
preempted right around this place in the code. Another program could be
selected by the scheduler to run. If that second program attempts to send
data as cookies
We use well and lands
to enhance thehere. So, theon xmit_recursion
user experience packagecloud. counter is used to
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 74/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
if (!netif_xmit_stopped(txq)) {
__this_cpu_inc(xmit_recursion);
rc = dev_hard_start_xmit(skb, dev, txq);
__this_cpu_dec(xmit_recursion);
if (dev_xmit_complete(rc)) {
HARD_TX_UNLOCK(dev, txq);
goto out;
}
}
HARD_TX_UNLOCK(dev, txq);
net_crit_ratelimited("Virtual device %s asks to queue p
acket!\n",
dev->name);
} else {
/* Recursion is detected! It is possible,
* unfortunately
*/
recursion_alert:
net_crit_ratelimited("Dead loop on virtual device %s, f
ix it urgently!\n",
dev->name);
}
}
The remainder of the code starts by trying to take the transmit lock. The
device’s transmit queue to be used is checked to see if transmit is stopped.
If not, the xmit_recursion variable is incremented and the data is passed
down closer to the device to be transmit. We’ll see dev_hard_start_xmit in
We use cookies to enhance the user experience on packagecloud.
more detail later. Once this completes, the locks are released and a warning
By using our site, you acknowledge that you have read and understand our
is Cookie
printed.
Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 75/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Since we are interested in real ethernet devices, let’s continue down the
code path that would have been taken for those earlier via __dev_xmit_skb .
__dev_xmit_skb
qdisc_pkt_len_init(skb);
qdisc_calculate_pkt_len(skb, q);
/*
* Heuristic to force contended enqueues to serialize on a
* separate lock before trying to get qdisc main lock.
* This permits __QDISC_STATE_RUNNING owner to get the lock more often
* and dequeue packets faster.
*/
contended = qdisc_is_running(q);
if (unlikely(contended))
spin_lock(&q->busylock);
will be used by the qdisc later. This is necessary for skbs that will pass
through hardware based send of oading (such as UDP Fragmentation
Of oading, as we saw earlier) as the additional headers that will be added
when fragmentation occurs need to be taken into account.
Next, a lock is used to help reduce contention on the qdisc’s main lock (a
second lock we’ll see later). If qdisc is currently running, then other
programs attempting to transmit will contend on the qdisc’s busylock . This
allows the running qdisc to process packets and contend with a smaller
number of programs for the second, main lock. This trick increases
throughput as the number of contenders is reduced. You can read the
original commit message describing this here. Next the main lock is taken:
spin_lock(root_lock);
Let’s take a look at what happens in each of these cases, in order starting
with a deactivated qdisc:
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
kfree_skb(skb);
We use cookies to enhance the user experience on packagecloud.
rc = NET_XMIT_DROP;
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 77/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
This is straightforward. If the qdisc is deactivated, free the data and set the
return code to NET_XMIT_DROP . Next, a qdisc allowing packet bypass, with
no other outstanding packets, that is not currently running:
qdisc_bstats_update(q, skb);
rc = NET_XMIT_SUCCESS;
This if statement is a bit tricky. The entire statement evaluates as true if all
of the following are true:
1. The queue is not empty ( >0 returned). In this case, lock preventing
contention from other programs is released and __qdisc_run is
called to restart the qdisc processing.
2. The queue was empty ( 0 is returned). In this case qdisc_run_end is
used to turn off qdisc processing.
In either case, the return value NET_XMIT_SUCCESS is set as the return code.
That wasn’t too bad. Let’s check the last case, which is catch all:
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 79/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
} else {
skb_dst_force(skb);
rc = q->enqueue(skb, q) & NET_XMIT_MASK;
if (qdisc_run_begin(q)) {
if (unlikely(contended)) {
spin_unlock(&q->busylock);
contended = false;
}
__qdisc_run(q);
}
}
The function then nishes up by releasing some locks and returning the
return code:
spin_unlock(root_lock);
if (unlikely(contended))
spin_unlock(&q->busylock);
return rc;
For XPS to work, it must be enabled in the kernel con guration (it is on
Ubuntu for kernel 3.13.0), and a bitmask describing which CPUs should
process packets for a given interface and TX queue.
These bitmasks are similar to the RPS bitmasks and you can nd some
documentation about these bitmasks in the kernel documentation.
/sys/class/net/DEVICE_NAME/queues/QUEUE/xps_cpus
So, for eth0 and transmit queue 0, you would modify the le:
/sys/class/net/eth0/queues/tx-0/xps_cpus with a hexadecimal number
indicating which CPUs should process transmit completions from eth0 ’s
transmit queue 0. As the documentation points out, XPS may be
unnecessary in certain con gurations.
Queuing disciplines!
To follow the path of network data, we’ll need to move into the qdisc code a
bit. This post does not intend to cover the speci c details of each of the
different transmit queue options. If you are interested in that, check this
excellent guide.
For the purpose of this blog post, we’ll continue the code path by examining
how the generic packet scheduler code works. In particular, we’ll explore
how qdisc_run_begin , qdisc_run_end , __qdisc_run , and sch_direct_xmit
We use cookies to enhance the user experience on packagecloud.
work to move network data closer to the driver for transmit.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 81/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
while (qdisc_restart(q)) {
/*
* Ordered by possible occurrence: Postpone processing if
* 1. we've exceeded packet quota
* 2. another process needs the CPU;
*/
if (--quota <= 0 || need_resched()) {
__netif_schedule(q);
break;
}
}
qdisc_run_end(q);
}
This function begins by obtaining the weight_p value. This is set typically
via a sysctl and is also used in the receive path. We’ll see later how to adjust
this value. This loop does two things:
in the kernel, need_resched will return true. If there’s still available quota
and the user program hasn’t used is time slice up yet, qdisc_restart will
be called over again.
Let’s see how qdisc_restart(q) works and then we’ll dive into
__netif_schedule(q) .
qdisc_restart
/*
* NOTE: Called under qdisc_lock(q) with locally disabled BH.
*
* __QDISC_STATE_RUNNING guarantees only one CPU can process
* this qdisc at a time. qdisc_lock(q) serializes queue accesses for
* this queue.
*
* netif_tx_lock serializes accesses to device driver.
*
* qdisc_lock(q) and netif_tx_lock are mutually exclusive,
* if one is grabbed, another must be free.
*
* Note, that this procedure can be called by a watchdog timer
*
* Returns to the caller:
* 0 - queue is empty or throttled.
* >0 - queue is not empty.
*
*/
static inline int qdisc_restart(struct Qdisc *q)
{
We use cookies
struct to enhance the user experience
netdev_queue *txq; on packagecloud.
By using our site, you acknowledge that you have read and understand our
struct net_device *dev;
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
spinlock_t *root_lock;
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 84/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
/* Dequeue packet */
skb = dequeue_skb(q);
if (unlikely(!skb))
return 0;
WARN_ON_ONCE(skb_dst_is_noref(skb));
root_lock = qdisc_lock(q);
dev = qdisc_dev(q);
txq = netdev_get_tx_queue(dev, skb_get_queue_mapping(skb));
dequeue_skb
if (unlikely(skb)) {
/* check the reason of requeuing without tx lock first */
txq = netdev_get_tx_queue(txq->dev, skb_get_queue_mapping(skb
));
if (!netif_xmit_frozen_or_stopped(txq)) {
q->gso_skb = NULL;
q->q.qlen--;
} else
skb = NULL;
Note that the code begins by taking a reference to gso_skb eld of the
qdisc. This eld holds a reference to data that was requeued. If no data was
requeued, this eld will be NULL . If that eld is not NULL , the code
continues by getting the transmit queue for the data and checking if the
queue is stopped. If the queue is not stopped, the gso_skb eld is cleared
and the queue length counter is decreased. If the queue is stopped, the data
We use cookies to enhance the user experience on packagecloud.
remains attached
By using our to gso_skb
site, you acknowledge that ,you
buthaveNULL will
read and be returned
understand our from this function.
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 86/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Let’s check the next case, where there is no data that was requeued:
} else {
if (!(q->flags & TCQ_F_ONETXQUEUE) || !netif_xmit_frozen_or_sto
pped(txq))
skb = q->dequeue(q);
}
return skb;
}
Then, the qdisc’s dequeue function will be called to obtain new data. The
internal implementation of dequeue will vary depending on the qdisc’s
implementation and features.
sch_direct_xmit
/*
* Transmit one skb, and handle the return status as required. Holding the
* __QDISC_STATE_RUNNING bit guarantees that only one CPU can execute this
We use cookies to enhance the user experience on packagecloud.
* function.
By using our site, you acknowledge that you have read and understand our
Cookie
* Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 87/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
HARD_TX_UNLOCK(dev, txq);
The code begins by unlocking the qdisc lock and then locking the transmit
lock. Note that HARD_TX_LOCK is a macro:
This macro is checking if the device has the NETIF_F_LLTX ag set in its
feature ags. This ag is deprecated and should not be used by new device
drivers. Most drivers in this kernel version do not use this ag, so this check
will evaluate to to true and the lock for the transmit queue for this data will
beWeobtained.
use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 88/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Next, the transmit queue is checked to ensure that it is not stopped and
then dev_hard_start_xmit is called. As we’ll see later,
dev_hard_start_xmit handles transitioning the network data from the
Linux kernel’s network device subsystem into the device driver itself for
transmission. The return code from this function is stored and will be
checked next to determine if the transmit succeeded.
Once this has run (or been skipped because the queue is stopped), the
queue’s transmit lock is released. Let’s continue:
spin_lock(root_lock);
if (dev_xmit_complete(ret)) {
/* Driver sent out skb successfully or skb was consumed */
ret = qdisc_qlen(q);
} else if (ret == NETDEV_TX_LOCKED) {
/* Driver try lock failed */
ret = handle_dev_cpu_collision(skb, txq, q);
Next, the lock for this qdisc is taken again and then the return value of
dev_hard_start_xmit is examined. The rst case is checked by calling
dev_xmit_complete which simply checks the return value to determine if
the data was sent successfully. If so the qdisc queue length is set as the
return value.
} else {
/* Driver returned NETDEV_TX_BUSY - requeue skb */
if (unlikely(ret != NETDEV_TX_BUSY))
net_warn_ratelimited("BUG %s code %d qlen %d\n",
dev->name, ret, q->q.qlen);
So if the driver did not transmit the data and it was not due to the transmit
lock being held, it is probably due to NETDEV_TX_BUSY (if not a warning is
printed). NETDEV_TX_BUSY can be returned by a driver to indicate that either
the device or the driver were “busy” and the data can not be transmit right
now. In this case, dev_requeue_skb is used to queue the data to be retried.
return ret;
handle_dev_cpu_collision
In the rst case, this is handled as a con guration problem and thus a
warning is printed. In the second case a statistic counter cpu_collision is
incremented and the data is sent through dev_requeue_skb to be requeued
for transmission later. Recall earlier we saw code in dequeue_skb that dealt
speci cally with requeued skbs.
if (unlikely(dev_queue->xmit_lock_owner == smp_processor_id())) {
/*
* Same CPU holding the lock. It may be a transient
* configuration error, when hard_start_xmit() recurses. We
* detect it by checking xmit owner and drop the packet when
* deadloop is detected. Return OK to try the next skb.
*/
kfree_skb(skb);
net_warn_ratelimited("Dead loop on netdevice %s, fix it urgentl
y!\n",
dev_queue->dev->name);
ret = qdisc_qlen(q);
} else {
/*
* Another cpu is holding lock, requeue & delay xmits for
* some time.
*/
__this_cpu_inc(softnet_data.cpu_collision);
ret = dev_requeue_skb(skb, q);
}
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 91/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
return ret;
}
Let’s take a look at what dev_requeue_skb does, as we’ll see this function
called from sch_direct_xmit .
dev_requeue_skb
return 0;
}
Simple and straightforward. Let’s refresh how we got here and then
examine __netif_schedule .
while (qdisc_restart(q)) {
/*
* Ordered by possible occurrence: Postpone processing if
* 1. we've exceeded packet quota
* 2. another process needs the CPU;
*/
if (--quota
We use cookies to enhance <= 0 || need_resched())
the user experience on packagecloud. {
By using our site, you acknowledge that you have read and understand our
__netif_schedule(q);
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
break;
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 93/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
}
}
qdisc_run_end(q);
}
__netif_schedule
This code checks and sets the __QDISC_STATE_SCHED bit in the qdisc’s state.
If the bit was ipped (meaning that it was not previously in the
__QDISC_STATE_SCHED state), the code will call __netif_reschedule , which
We use cookies to enhance the user experience on packagecloud.
is By
not much
using longer
our site, but hasthat
you acknowledge very interesting
you have side effects.
read and understand our Let’s take a look:
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 94/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
local_irq_save(flags);
sd = &__get_cpu_var(softnet_data);
q->next_sched = NULL;
*sd->output_queue_tailp = q;
sd->output_queue_tailp = &q->next_sched;
raise_softirq_irqoff(NET_TX_SOFTIRQ);
local_irq_restore(flags);
}
1. Save the current local IRQ state and disable IRQs with a call to
local_irq_save .
2. Get the current CPUs softnet_data structure.
3. Add the qdisc to the softnet_data ’s output queue.
4. Raise the NET_TX_SOFTIRQ softirq.
5. Restore the IRQ state and re-enable interrupts.
You can read more about the initialization of the softnet_data data
structures by reading our previous post about the receive side of the
networking stack.
As you’ll see from the previous post, the NET_TX_SOFTIRQ softirq has the
function net_tx_action registered to it. This means that there is a kernel
thread executing net_tx_action . That thread is occasionally paused and
raise_softirq_irqoff resumes it. Let’s take a look at what net_tx_action
does so we can understand how the kernel processes transmit requests.
net_tx_action
In fact, the code for the function is two large if blocks. Let’s take them one
at a time, remembering all the while that this code is executing in the
softirq context as an independent kernel thread. The purpose of
net_tx_action is to execute code that cannot be executed in hot paths
throughout the transmit side of the network stack; work is deferred and
later processed by the thread executing net_tx_action .
Take a look at the net_tx_action code which deals with freeing skbs on the
completion queue:
if (sd->completion_queue) {
struct sk_buff *clist;
local_irq_disable();
clist = sd->completion_queue;
sd->completion_queue = NULL;
local_irq_enable();
while (clist) {
struct sk_buff *skb = clist;
clist = clist->next;
WARN_ON(atomic_read(&skb->users));
trace_kfree_skb(skb, net_tx_action);
__kfree_skb(skb);
}
}
If the completion queue has entries, the while loop will walk through the
linked list of skbs and call __kfree_skb on each of them to free their
memory. Remember, this code is running in a separate “thread” called a
softirq – it is not running on behalf of any user program in particular.
if (sd->output_queue) {
struct Qdisc *head;
local_irq_disable();
head = sd->output_queue;
sd->output_queue = NULL;
sd->output_queue_tailp = &sd->output_queue;
local_irq_enable();
This block simply ensures that there are qdiscs on the output queue, and if
so, it sets head to the rst entry and moves the tail pointer of the queue.
Next, the while loop for traversing the list of qdsics starts:
while (head) {
struct Qdisc *q = head;
spinlock_t *root_lock;
head = head->next_sched;
smp_mb__before_clear_bit();
clear_bit(__QDISC_STATE_SCHED,
&q->state);
qdisc_run(q);
spin_unlock(root_lock);
The above section of code moves the head pointer forward and obtains a
reference to the qdisc lock. spin_trylock is used to check if the lock can be
obtained; note that this call is used speci cally because it does not block. If
the lock is already held, spin_trylock will return immediately instead of
waiting to obtain the lock.
Your program’s system time will include time spent calling down to
the driver to try to send data, regardless of whether the send
completes or the driver returns an error.
If that send is unsuccessful at the driver layer (e.g. because the
device was busy sending something else), the qdisc will be added
to the output queue and processed later by a softirq thread. In this
case,
We use softirq
cookies (si) the
to enhance time
user will be spent
experience attempting to transmit your
on packagecloud.
data.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 99/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
So, the total time spent sending data is a combination of both the system
time of send-related system calls and the softirq time for the NET_TX
softirq.
At any rate, the code above completes by releasing the qdisc lock. If the
spin_trylock call above falls to obtain the lock, the following code is
executed:
} else {
if (!test_bit(__QDISC_STATE_DEACTIVATED,
&q->state)) {
__netif_reschedule(q);
} else {
smp_mb__before_clear_bit();
clear_bit(__QDISC_STATE_SCHED,
&q->state);
}
}
}
}
This code, which only executes if the qdisc lock couldn’t be obtained,
handles two cases. Either:
1. The qdisc is not deactivated, but the lock couldn’t be obtained for
executing qdisc_run . So, call __netif_reschedule . Calling
__netif_reschedule here puts the qdisc back on the queue that
this function is currently dequeuing from. This allows the qdisc to
be checked again later when perhaps the lock has been given up.
2. The qdisc is marked as deactivated, ensure that the
__QDISC_STATE_SCHED state ag is cleared as well.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Finally time to meet our friend dev_hard_start_xmit
back to top
Cookie Policy, Privacy Policy, and our Terms of Service.
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 100/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
We’ll see how both cases are handled, starting with the case of network
data that is ready to send. Let’s take a look (follow along here:
./net/code/dev.c:
if (likely(!skb->next)) {
netdev_features_t features;
/*
We use cookies to enhance
* If the user experience
device doesn't on packagecloud.
need skb->dst, release it right now while
By using our site, you acknowledge
* its hot that you have
in this cpuread and understand our
cache
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
*/
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 101/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
features = netif_skb_features(skb);
Next, the vlan tag will be checked and if the device can’t of oad VLAN
tagging, __vlan_put_tag will be used to do this in software:
if (vlan_tx_tag_present(skb) &&
!vlan_hw_offload_capable(features, skb->vlan_proto)) {
skb = __vlan_put_tag(skb, skb->vlan_proto,
vlan_tx_tag_get(skb));
We use cookies to enhance the user experience on packagecloud.
if that
By using our site, you acknowledge (unlikely(!skb))
you have read and understand our
goto
Cookie Policy, Privacy Policy, and our Terms out;
of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 102/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
skb->vlan_tci = 0;
}
if (netif_needs_gso(skb, features)) {
if (unlikely(dev_gso_segment(skb, features)))
goto out_kfree_skb;
if (skb->next)
goto gso;
}
If the data does not need segmentation, a few other cases are handled.
First: does the data need to be linearized? That is, can the device support
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
sending network
Cookie Policy, Privacy data
Policy, if
andthe dataofisService.
our Terms spread out across multiple buffers,
back or
to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 103/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
does it all need to be combined into a single linear buffer rst? The vast
majority of network cards do not required the data to be linearized before
transmit, so in almost all cases this will evaluated to false and will be
skipped.
else {
if (skb_needs_linearize(skb, features) &&
__skb_linearize(skb))
goto out_kfree_skb;
A helpful comment is provided next, explaining the next case. The packet
will be checked to determine if it still needs a checksum. If the device does
not support checksumming, a checksum will be generated in software now:
Now we move on to packet taps! Recall in the receive side blog post, we
saw how packets were passed off to packet taps (like PCAP). The next chunk
of code in this function hands packets which are about to be transmit over
to the packet taps (if there are any).
if (!list_empty(&ptype_all))
dev_queue_xmit_nit(skb, dev);
Finally, the driver’s ops are used to pass the data down to the device by
calling ndo_start_xmit :
skb_len = skb->len;
rc = ops->ndo_start_xmit(skb, dev);
Let’s take a look at the GSO case. This code will run if the skb was already
separated into a chain of packets due to segmentation which happened in
this function or a packet that was previously segmented, but failed to send
and was queued to be sent again.
gso:
We use cookies to enhance the user experience on packagecloud.
do {
By using our site, you acknowledge that you have read and understand our
struct
Cookie Policy, Privacy sk_buff
Policy, and *nskb
our Terms = skb->next;
of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 105/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
skb->next = nskb->next;
nskb->next = NULL;
if (!list_empty(&ptype_all))
dev_queue_xmit_nit(nskb, dev);
skb_len = nskb->len;
rc = ops->ndo_start_xmit(nskb, dev);
trace_net_dev_xmit(nskb, rc, dev, skb_len);
if (unlikely(rc != NETDEV_TX_OK)) {
if (rc & ~NETDEV_TX_MASK)
goto out_kfree_gso_skb;
nskb->next = skb->next;
skb->next = nskb;
return rc;
}
txq_trans_update(txq);
if (unlikely(netif_xmit_stopped(txq) && skb->next))
return NETDEV_TX_BUSY;
} while (skb->next);
As you may have guessed, this code is a while loop that iterates over the list
of skbs that were generated when the data was segmented.
Any error in transmitting a packet is dealt with by adjusting the list of skbs
that need to be sent. The error will be returned up the stack and the unsent
skbs may be requeued to be sent again later.
We use cookies to enhance the user experience on packagecloud.
The last piece of this function handles cleaning up and potentially freeing
By using our site, you acknowledge that you have read and understand our
data in Policy,
Cookie the event of any
Privacy Policy, anderrors hitof Service.
our Terms above: back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 106/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
out_kfree_gso_skb:
if (likely(skb->next == NULL)) {
skb->destructor = DEV_GSO_CB(skb)->destructor;
consume_skb(skb);
return rc;
}
out_kfree_skb:
kfree_skb(skb);
out:
return rc;
}
EXPORT_SYMBOL_GPL(dev_hard_start_xmit);
Before continuing into the device driver, let’s take a look at some
monitoring and tuning that can be done for the code that we just walked
through.
Monitoring qdiscs
bytes : The number of bytes that were pushed down to the driver
for transmit.
pkt : The number of packets that were pushed down to the driver
for transmit.
dropped : The number of packets that were dropped by the qdisc.
This can happen if transmit queue length is not large enough to t
the data being queued to it.
overlimits : Depends on the queuing discipline, but can be either
the number of packets that could not be enqueued due to a limit
being hit, and/or the number of packets which triggered a
throttling event when dequeued.
requeues : Number of times dev_requeue_skb has been called to
requeue an skb. Note that an skb which is requeued multiple times
will bump this counter each time it is requeued.
backlog : Number of bytes currently on the qdisc’s queue. This
number is usually bumped each time a packet is enqueued.
Some qdsics may export additional statistics. Each qdisc is different and
may bump these counters at different times. You may want to study the
source for the qdisc you are using to understand precisely when these
values can be incremented on your system to help understand what the
consequences are for you.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 108/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Tuning qdiscs
You can adjust the weight of __qdisc_run loop seen earlier (the quota
variable seen above) which will cause more calls to __netif_schedule to be
executed. The result will be the current qdisc added to the output_queue
list for the current CPU more times, which should result in additional
processing of transmit packets.
Example: increase the `__qdisc_run` quota for all qdiscs with `sysctl`.
Each network device has a txqueuelen tuning knob that can be modi ed.
Most qdisc’s will check if the device has suf cient txqueuelen bytes when
enqueuing data that should eventually be transmit by the qdisc. You can
adjust his parameter to increase the number of bytes that may be queued by
a qdisc.
The default value for ethernet devices is 1000 . You can check the
txqueuelen for network devices by reading the output of ifconfig .
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 109/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
netdev->netdev_ops = &igb_netdev_ops;
The higher layers of the networking stack use the net_device_ops structure
to call into a driver to perform various operations. As we saw earlier, the
qdisc code calls ndo_start_xmit to pass data down to the driver for
transmit. The ndo_start_xmit function is called while a lock is held, for
most hardware devices, as we saw above.
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 112/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
} else {
count += skb_shinfo(skb)->nr_frags;
}
The code then obtains a regerence to the next available buffer info in the
transmit queue. This structure will track the information needed for setting
up a buffer descriptor later. A reference to the packet and its size are copied
into
We the buffer
use cookies info structure.
to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
skb_tx_timestamp(skb);
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 113/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
if (!(adapter->ptp_tx_skb)) {
skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
tx_flags |= IGB_TX_FLAGS_TSTAMP;
adapter->ptp_tx_skb = skb_get(skb);
adapter->ptp_tx_start = jiffies;
if (adapter->hw.mac.type == e1000_82576)
schedule_work(&adapter->ptp_tx_work);
}
}
if (vlan_tx_tag_present(skb)) {
tx_flags |= IGB_TX_FLAGS_VLAN;
tx_flags |= (vlan_tx_tag_get(skb) << IGB_TX_FLAGS_VLAN_SHIFT);
}
The code above will check if the vlan_tci eld of the skb was set. If it is
set, then the IGB_TX_FLAGS_VLAN ag is enabled and the vlan ID is stored.
The ags and protocol are recorded to the buffer info structure.
( first ) will have its ags updated to indicate to the hardware that TSO is
required.
/* Make sure there is space in the ring for the next send. */
igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
return NETDEV_TX_OK;
Once the the transmit is complete, the driver checks to ensure that there is
suf cient space available for another transmit. If not, the queue is
shutdown. In either case NETDEV_TX_OK is returned to the higher layer (the
qdisc code).
out_drop:
igb_unmap_and_free_tx_resource(tx_ring, first);
return NETDEV_TX_OK;
}
Finally, some error handling code. This code is only hit if igb_tso hits an
We use cookies to enhance the user experience on packagecloud.
error of some
By using kind.
our site, you The igb_unmap_and_free_tx_resource
acknowledge that you have read and understand our is used to clean
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 116/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
up data. NETDEV_TX_OK is returned in this case as well. The transmit was not
successful, but the driver freed the resources associated and there is
nothing left to do. Note that this driver does not increment packet drops in
this case, but it probably should.
igb_tx_map
The igb_tx_map function handles the details of mapping skb data to DMA-
able regions of RAM. It also updates the transmit queue’s tail pointer on the
device, which is what triggers the device to “wake up”, fetch the data from
RAM, and begin transmitting the data.
size = skb_headlen(skb);
data_len = skb->data_len;
What follows next is a very dense loop in the driver to deal with generating
a valid mapping for each fragment of a skb. The details of how exactly this
happens are not particularly important, but are worth mentioning:
The code for the loop is provided below for reference with the above
description. This should illustrate further to readers that avoiding
fragmentation, if at all possible, is a good idea. Lots of additional code
needs to run to deal with it at every layer of the stack, including the driver.
tx_buffer = first;
tx_desc->read.buffer_addr = cpu_to_le64(dma);
i++;
tx_desc++;
if (i == tx_ring->count) {
tx_desc = IGB_TX_DESC(tx_ring, 0);
i = 0;
}
tx_desc->read.olinfo_status = 0;
dma += IGB_MAX_DATA_PER_TXD;
size -= IGB_MAX_DATA_PER_TXD;
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
tx_desc->read.buffer_addr
Cookie Policy, Privacy Policy, and our Terms of Service. = cpu_to_le64(dma); back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 119/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
if (likely(!data_len))
break;
i++;
tx_desc++;
if (i == tx_ring->count) {
tx_desc = IGB_TX_DESC(tx_ring, 0);
i = 0;
}
tx_desc->read.olinfo_status = 0;
size = skb_frag_size(frag);
data_len -= size;
tx_buffer = &tx_ring->tx_buffer_info[i];
}
Once all the necessary descriptors have been constructed and all of the skb’s
data has been mapped to DMA-able addresses, the driver proceeds to its
nal steps to trigger a transmit:
netdev_tx_sent_queue(txring_txq(tx_ring), first->bytecount);
i++;
if (i == tx_ring->count)
i = 0;
tx_ring->next_to_use = i;
writel(i, tx_ring->tail);
/* we need this if more than one processor can write to our tail
We use cookies
* attoaenhance
time, the
it user experience onIO
synchronizes packagecloud.
on IA64/Altix systems
By using our site, you acknowledge that you have read and understand our
*/
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
mmiowb();
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 121/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
return;
Finally, the code wraps up with some error handling. This code only
executes if an error was returned from the DMA API when attemtping to
map skb data addresses to DMA-able addresses.
dma_error:
dev_err(tx_ring->dev, "TX DMA map failed\n");
tx_ring->next_to_use = i;
As you’ve seen throughout this post, network data spends a lot of time
sitting
We usequeues at various
cookies to enhance stages
the user as on
experience it packagecloud.
moves closer and closer to the device
By using our site, you acknowledge that you have read and understand our
forCookie
transmission. As queue sizes increase, packets spend longer sitting
Policy, Privacy Policy, and our Terms of Service. back in
to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 123/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
queues not being transmit i.e. packet transmit latency increases as queue
size increases.
One way to ght this is with back pressure. The dynamic queue limit (DQL)
system is a mechanism that device drivers can use to apply back pressure to
the networking system to prevent too much data from being queued for
transmit when the device is unable to transmit,
To use this system, network device drivers need to make a few simple API
calls during their transmit and completion routines. The DQL system
internally will use an algorithm to determine when suf cient data is in
transmit. Once this limit is reached, the transmit queue will be temporarily
disabled. This queue disabling is what produces the back pressure against
the networking system. The queue will be automatically re-enabled when
the DQL system determines enough data has nished transmission.
Check out this excellent set of slides about the DQL system for some
performance data and an explanation of the internal algorithm in DQL.
DQL exports statistics and tuning knobs in sysfs. Tuning DQL should not be
necessary; the algorithm will adjust its parameters over time. Nevertheless,
in the interest of completeness we’ll see later how to monitor and tune
DQL.
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 124/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Transmit completions
Once the device has transmit the data, it will generate an interrupt to signal
that transmission is complete. The device driver can then schedule some
long running work to be completed, like unmapping memory regions and
freeing data. How exactly this works is device speci c. In the case of the
igb driver (and its associated devices), the same IRQ is red for transmit
completion and packet receive. This means that for the igb driver the
NET_RX is used to process both transmit completions and incoming packet
receives.
Let me re-state that to emphasize the importance of this: your device may
raise the same interrupt for receiving packets that it raises to signal that a
packet transmit has completed. If it does, the NET_RX softirq runs to process
both incoming packets and transmit completions.
Since both operations share the same IRQ, only a single IRQ handler
function can be registered and it must deal with both possible cases. Recall
the following ow when network data is received:
Step 5 above in the igb driver (and the ixgbe driver [greetings, tyler])
processes TX completions before processing incoming data. Keep in mind
that depending on the implementation of the driver, both processing
functions for TX completions and incoming data may share the same
processing budget. The igb and ixgbe drivers track the TX completion and
incoming packet budgets separately, so processing TX completions will not
necessarily exhaust the RX budget.
That said, the entire NAPI poll loop runs within a hard coded time slice. This
means that if you have a lot of TX completion processing to handle, TX
completions may eat more of the time slice than processing incoming data
does. This may be an important consideration for those running network
hardware in very high load environments.
Let’s see how this happens in practice for the igb driver.
3. The user program sends data to a network socket. The data travels
the network stack until the device fetches it from memory and
transmits it.
4. The device nishes transmitting the data and raises an IRQ to
signal transmit completion.
5. The driver’s IRQ handler executes to handle the interrupt.
6. The IRQ handler calls napi_schedule in response to the IRQ.
7. The NAPI code triggers the NET_RX softirq to execute.
8. The NET_RX so trq function, net_rx_action begins to execute.
9. The net_rx_action function calls the driver’s registered NAPI poll
function.
10. The NAPI poll function, igb_poll , is executed.
The poll function igb_poll is where the code splits off and processes both
incoming packets and transmit completions. Let’s dive into the code for this
function and see where that happens.
igb_poll
/**
* igb_poll - NAPI Rx polling callback
* @napi: napi polling structure
* @budget: count of how many packets we should handle
**/
static int igb_poll(struct napi_struct *napi, int budget)
{
struct
We use cookies igb_q_vector
to enhance *q_vectoron=packagecloud.
the user experience container_of(napi,
By using our site, you acknowledge that you have read and understand our
struct igb_q_vector,
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
napi);
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 127/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
#ifdef CONFIG_IGB_DCA
if (q_vector->adapter->flags & IGB_FLAG_DCA_ENABLED)
igb_update_dca(q_vector);
#endif
if (q_vector->tx.ring)
clean_complete = igb_clean_tx_irq(q_vector);
if (q_vector->rx.ring)
clean_complete &= igb_clean_rx_irq(q_vector, budget);
return 0;
}
To learn more about how igb_clean_rx_irq works, read this section of the
previous blog post.
This blog post is concerned primarily with the transmit side, so we’ll
continue by examining how igb_clean_tx_irq above works.
igb_clean_tx_irq
It’s a bit long, so we’ll break it into chunks and go through it:
if (test_bit(__IGB_DOWN, &adapter->state))
return true;
Moving on, the snippet of code above nishes by checking if the network
device is down. If so, it returns true and exits igb_clean_tx_irq .
tx_buffer = &tx_ring->tx_buffer_info[i];
tx_desc = IGB_TX_DESC(tx_ring, i);
i -= tx_ring->count;
do {
union e1000_adv_tx_desc *eop_desc = tx_buffer->next_to_watch;
This inner loop will loop over each transmit descriptor until tx_desc arrives
at the eop_desc . This code unmaps data referenced by any of the additional
descriptors.
/* move us one more past the eop_desc for start of next pkt */
tx_buffer++;
tx_desc++;
i++;
if (unlikely(!i)) {
i -= tx_ring->count;
tx_buffer = tx_ring->tx_buffer_info;
tx_desc = IGB_TX_DESC(tx_ring, 0);
}
The outer loop increments iterators and reduces the budget value. The loop
invariant is checked to determine if the loop should continue.
netdev_tx_completed_queue(txring_txq(tx_ring),
total_packets, total_bytes);
i += tx_ring->count;
tx_ring->next_to_clean = i;
u64_stats_update_begin(&tx_ring->tx_syncp);
tx_ring->tx_stats.bytes += total_bytes;
tx_ring->tx_stats.packets += total_packets;
u64_stats_update_end(&tx_ring->tx_syncp);
q_vector->tx.total_bytes += total_bytes;
q_vector->tx.total_packets += total_packets;
This code:
if (test_bit(IGB_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags)) {
struct e1000_hw *hw = &adapter->hw;
clear_bit(IGB_RING_FLAG_TX_DETECT_HANG, &tx_ring->flags);
if (tx_buffer->next_to_watch &&
time_after(jiffies, tx_buffer->time_stamp +
(adapter->tx_timeout_factor * HZ)) &&
!(rd32(E1000_STATUS) & E1000_STATUS_TXOFF)) {
If those three tests are all true, then an error is printed that a hang has been
detected. netif_stop_subqueue is used to turn off the queue and true is
returned.
Let’s continue reading the code to see what happens if there was no
transmit hang check, or if there was, but no hang was detected:
u64_stats_update_begin(&tx_ring->tx_syncp);
tx_ring->tx_stats.restart_queue++;
u64_stats_update_end(&tx_ring->tx_syncp);
}
}
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy,
returnPrivacy Policy, and our Terms of Service.
!!budget; back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 136/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
In the above code the driver will restart the transmit queue if it was
previously disabled. It rst checks if:
If all conditions are satis ed, a write barrier is used ( smp_mb ). Next another
set of conditions are checked:
if (q_vector->rx.ring)
clean_complete &= igb_clean_rx_irq(q_vector, budget);
Then, the entire budget amount (which is hardcoded to 64 for most drivers
including igb ) will be returned. If either of RX or TX processing could not
complete (because there was more work to do), then NAPI is disabled with a
call to napi_complete and 0 is returned:
return 0;
}
There are several different ways to monitor your network devices offering
different levels of granularity and complexity. Let’s start with most granular
and move to least granular.
Using ethtool -S
Monitor detailed NIC device statistics (e.g., transmit errors) with `ethtool -
S`.
Monitoring this data can be dif cult. It is easy to obtain, but there is no
standardization of the eld values. Different drivers, or even different
versions of the same driver might produce different eld names that have
the same meaning.
You should look for values with “drop”, “buffer”, “miss”, “errors” etc in the
We use cookies to enhance the user experience on packagecloud.
label. Next,
By using you
our site, youwill have to
acknowledge thatread your
you have readdriver source.
and understand our You’ll be able to
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
determine which values are accounted for totally in software (e.g.,
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 139/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Using sysfs
sysfs also provides a lot of statistics values, but they are slightly higher
level than the direct NIC level stats provided.
You can nd the number of dropped incoming network data frames for, e.g.
eth0 by using cat on a le.
$ cat /sys/class/net/eth0/statistics/tx_aborted_errors
2
If these values are critical to you, you will need to read your driver source
and device data sheet to understand exactly what your driver thinks each of
these values means.
We use cookies to enhance the user experience on packagecloud.
Using /proc/net/dev
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service.
back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 140/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
$ cat /proc/net/dev
Inter-| Receive |
face |bytes packets errs drop fifo frame compressed multicast|by
eth0: 110346752214 597737500 0 2 0 0 0 2096
lo: 428349463836 1579868535 0 0 0 0 0
This le shows a subset of the values you’ll nd in the sysfs les mentioned
above, but it may serve as a useful general reference.
The caveat mentioned above applies here, as well: if these values are
important to you, you will still need to read your driver source to
understand exactly when, where, and why they are incremented to ensure
your understanding of an error, drop, or fo are the same as your driver.
You can monitor dynamic queue limits for a network device by reading the
les located under: /sys/class/net/NIC/queues/tx-
QUEUE_NUMBER/byte_queue_limits/ .
Replacing NIC with your device name ( eth0 , eth1 , etc) and tx-
QUEUE_NUMBER with the transmit queue number ( tx-0 , tx-1 , tx-2 , etc).
hold_time
We use : Initialized
cookies to enhance to HZ (aon single
the user experience hertz). If the queue
packagecloud. has been
By using our site, you acknowledge that you have read and understand our
full for hold_time , then the maximum size is decreased.
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 141/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
$ cat /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight
350
If your NIC and the device driver loaded on your system support multiple
transmit queues, you can usually adjust the number of TX queues (also
called TX channels), by using ethtool .
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 8
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 4
This output is displaying the pre-set maximums (enforced by the driver and
the hardware) and the current settings.
Note: not all device drivers will have support for this operation.
This means that your driver has not implemented the ethtool get_channels
operation. This could be because the NIC doesn’t support adjusting the
number of queues, doesn’t support multiple transmit queues, or your driver
has not been updated to handle this feature.
Once you’ve found the current and maximum queue count, you can adjust
the
Wevalues bytousing
use cookies enhancesudo
the userethtool
experience -L .
on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 143/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Note: some devices and their drivers only support combined queues that
are paired for transmit and receive, as in the example in the above
section.
If your device and driver support individual settings for RX and TX and you’d
like to change only the TX queue count to 8, you would run:
Note: making these changes will, for most drivers, take the interface
down and then bring it back up; connections to this interface will be
interrupted. This may not matter much for a one-time change, though.
Some NICs and their drivers also support adjusting the size of the TX queue.
Exactly how this works is hardware speci c, but luckily ethtool provides a
generic way for
We use cookies usersthe
to enhance touser
adjust the on
experience size. Increasing the size of the TX may
packagecloud.
By using our site, you acknowledge that you have read and understand our
not make a drastic difference because DQL is used to prevent higher
Cookie Policy, Privacy Policy, and our Terms of Service.
layer
back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 144/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
networking code from queueing more data at times. Nevertheless, you may
want to increase the TX queues to the maximum size and let DQL sort
everything else out for you:
the above output indicates that the hardware supports up to 4096 receive
and transmit descriptors, but it is currently only using 512.
Note: making these changes will, for most drivers, take the interface
down and then bring it back up; connections to this interface will be
interrupted. This may not matter much for a one-time change, though.
The
We useEnd
cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 145/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
The end! Now you know everything about how packet transmit works on
Linux: from the user program to the device driver and back.
Extras
There are a few extra things worth mentioning that are worth mentioning
which didn’t seem quite right anywhere else.
The send , sendto , and and sendmsg system calls all take a flags
parameter. If you pass the MSG_CONFIRM ag to these system calls from your
application, it will cause the dst_neigh_output function in the kernel on
the send path to update the timestamp of the neighbour structure. The
consequence of this is that the neighbour structure will not be garbage
collected. This prevents additional ARP traf c from being generated as the
neighbour cache entry will stay warmer, longer.
UDP Corking
Timestamping
As mentioned in the above blog post, the networking stack can collect
timestamps of outgoing data. See the above network stack walkthrough to
see where transmit timestamping happens in software. Some NICs even
support timestamping in hardware, too.
This is a useful feature if you’d like to try to determine how much latency
the kernel network stack is adding to sending packets.
Determine which timestamp modes your driver and device support with
ethtool -T .
Conclusion
This highlights what I believe to be the core of the issue: optimizing and
monitoring the network stack is impossible unless you carefully read and
understand how it works. You cannot monitor code you don’t understand at
a deep level.
Need some extra help navigating the network stack? Have questions about
anything in this post or related things not covered? Send us an email and
let us know how we can help.
Related posts
We use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 148/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
If you enjoyed this post, you may enjoy some of our other low-level
technical posts:
Sign up!
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 149/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Features
Travis CI
Jenkins
Buildkite
GPG Signatures
Info
Pricing
HOWTOs
NPM/NodeJS HOWTO
Maven HOWTO
Java
HOWTO
Debian HOWTO
RPM HOWTO
RubyGem
We HOWTO
use cookies to enhance the user experience on packagecloud.
By using our site, you acknowledge that you have read and understand our
Python HOWTO
Cookie Policy, Privacy Policy, and our Terms of Service. back to top
https://blog.packagecloud.io/eng/2017/02/06/monitoring-tuning-linux-networking-stack-sending-data/ 150/151
3/26/2019 Monitoring and Tuning the Linux Networking Stack: Sending Data - Packagecloud Blog
Linux HOWTO
Guides
Maven Guide
Debian Guide
RPM Guide
RubyGem Guide
Python Guide
Linux Guide
Docs
General Docs
API Docs
Community
Blog
Slack
Status
Contact
Legal
Terms of Service
Privacy Policy