MHVLUG 2017-04 Network Receive Stack
MHVLUG 2017-04 Network Receive Stack
Patrick Ladd
Technical Account Manager
Red Hat
[email protected]
Hardware Interrupt
softirq_pending bits
net_dev_init
CPU 0 CPU 1
softirq_vec handlers
ksoftirqd/0 ksoftirqd/1
net/core/dev.c
4. Softirq handler
(net_rx_action) for
2. ksoftirqd processing NET_RX_SOFTIRQ
loops started registered
Network Device Initialization
● net_device_ops Data Structure
– Function pointers to driver implementation of function
static const struct net_device_ops igb_netdev_ops = {
.ndo_open = igb_open,
.ndo_stop = igb_close,
.ndo_start_xmit = igb_xmit_frame,
.ndo_get_stats64 = igb_get_stats64,
.ndo_set_rx_mode = igb_set_rx_mode,
.ndo_set_mac_address = igb_set_mac,
.ndo_change_mtu = igb_change_mtu,
.ndo_do_ioctl = igb_ioctl,
3. IRQ Raised
5. IRQ Cleared
Driver
4. Runs
IRQ Handler
CPU /
Chipset 6. NAPI started
NAPI (New API) Processing
softnet_data Poll list 1. NAPI poller is
added to poll_list
softirq_pending bits
Driver
CPU 0 2. softirq_pending
softirq_vec handlers bit set
ksoftirqd/0
run_ksoftirqd()
Packets are processed through
3. run_ksoftirqd checks polling from the poll list until all
if pending packets are processed or
softirq_pending bit
__do_softirq() specified limits are reached
net_rx_action()
NAPI Advantages
● Reduced interrupt load
– Without NAPI: 1 interrupt per packet → high CPU load
– With NAPI: polling during high packet arrival times
● No work to drop packets if kernel is too busy
– Ring buffer overwrite by NIC
# ethtool -l eth0
Channel parameters for eth0:
Cannot get device channel parameters
: Operation not supported
Multiqueue /
RSS (Receive Side Scaling)
● Recommendations:
– Enable for latency concerns or when interrupt bottlenecks form
– Lowest latency:
● 1 queue per CPU or max supported by NIC
– Best efficiency:
● Smallest number with no overflows due to CPU saturation
● Aggressive techniques:
– Lock IRQ & userspace process to CPU
– Custom n-tuple setups (i.e. “all TCP/80 to CPU1)
Network Data Processing “Bottom Half”
1. poll_list entry softnet_data Poll list RAM (ring buffer)
received
net_rx_action
2. Budget and
Elapsed Time
Checked*
4. Packet harvested
from ring buffer
3. Driver poll
function called
napi_gro_receive
5. Packets passed
for possible GRO mydrv_poll()
GRO List
__netif_receive_skb_core
process_backlog
5. Packets returned to
main flow
6. Packets 4. Packets harvested
copied to from input queue
any taps Packet Taps (PCAP)
To Protocol Layers
Monitoring
Monitoring
● ethtool -S {ifname} – Direct NIC level Statistics
– Hard to use – no standards, variation between drivers or even different
releases of same driver
– May have to resort to reading the driver source or NIC datasheet to
determine true meaning
● /sys/class/net/{ifname}/statistics/ – Kernel Statistics
– Slightly higher level
– Still some ambiguity in what values are incremented when
– May need to read source to get exact meanings
● /proc/net/dev – Kernel Device Statistics
– Subset of statistics from above for all interfaces
– Same caveats as above
Monitoring
● Monitoring SoftIRQs
– watch -n1 grep RX /proc/softirqs
● Packets dropped by the kernel: dropwatch
# dropwatch -l kas
Initalizing kallsyms db
dropwatch> start
Enabling monitoring…
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
1 drops at skb_queue_purge+18 (0xffffffff8151a968)
41 drops at __brk_limit+1e6c5938 (0xffffffffa0a1d938)
1 drops at skb_release_data+eb (0xffffffff8151a80b)
2 drops at nf_hook_slow+f3 (0xffffffff8155d083)
Finding the Bottleneck
● Drops at NIC level:
– ethtool -S {ifname}
rx_errors: 0
tx_errors: 0
rx_dropped: 0
tx_dropped: 0
rx_length_errors: 0
rx_over_errors: 3295
rx_crc_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 3295
rx_missed_errors: 3295
Finding the Bottleneck
● IRQs out of balance
– egrep “CPU0|{ifname}” /proc/interrupts
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
105: 1430000 0 0 0 0 0 IR-PCI-MSI-edge eth2-rx-0
106: 1200000 0 0 0 0 0 IR-PCI-MSI-edge eth2-rx-1
107: 1399999 0 0 0 0 0 IR-PCI-MSI-edge eth2-rx-2
108: 1350000 0 0 0 0 0 IR-PCI-MSI-edge eth2-rx-3
109: 80000 0 0 0 0 0 IR-PCI-MSI-edge eth2-tx
– Activate a profile
# tuned-adm profile throughput-performance
Switching to profile 'throughput-performance'
...
Numad
● Intelligently move processes and memory among NUMA domains
– Activate
# systemctl enable numad
# systemctl start numad
– RHEL7:
● Add {myname}.conf file in /etc/sysctl.d/
– Prior to RHEL7:
● Insert or update parameter in /etc/sysctl.conf
Adapter Buffer Sizes
● Customize the size of RX ring buffer(s)
– “ethtool –g {ifname}” to View
● # ethtool -g eth3
Ring parameters for eth3:
Pre-set maximums:
RX: 8192
RX Mini: 0
RX Jumbo: 0
TX: 8192
Current hardware settings:
RX: 1024
RX Mini: 0
RX Jumbo: 0
TX: 512
rx-usecs: 16
rx-frames: 44
rx-usecs-irq: 0
rx-frames-irq: 0
THANK YOU
Patrick Ladd
Technical Account Manager
Red Hat
[email protected]