Performance Benchmark Analysis of Egress Filtering On Linux - Kinvolk
Performance Benchmark Analysis of Egress Filtering On Linux - Kinvolk
Traffic filtering is a common technique used to protect networks and hosts from malicious activity. The filtering
can be done in the incoming and outgoing flows, better known as ingress and egress filtering. There are many
examples of use cases for ingress filtering: mitigating DDoS attacks, avoiding SPAM, blocking access to services
for specific geographic regions and so on.
One common application of egress filtering is to prevent applications and users from reaching remote hosts.
These kinds of filters are usually based on a large deny-list of IP address ranges, like Reputation Block Lists.
The technology used to perform this task has to be able to handle a high number of restricted hosts, up to
millions in some cases.
Our friends at SAP asked us to perform a benchmark of the common Linux technologies available for this task.
This blog post presents the methodology and results from benchmarking some of the Linux filtering
technologies: eBPF, IP sets and iptables.
Update 2020-09-14: we have received some great feedback and suggestions about other options to try, including
BPF JIT optimization (that we left inadvertently disabled) and inline BPF map lookup. We will take a look at this
and update this post as soon as possible - expected at the end of the month.
Goals
We had the following goals going into this study:
Provide a reproducible benchmark framework that anyone else can download and use.
Evaluate the different mechanisms in the Linux networking stack to filter out a large amount of IP addresses
and assess their scalability when the amount of IP ranges to block reaches 1 million.
Scenario
We aim to understand the cost of performing egress filtering based on the destination IP in a large set of IP
ranges.
The scenario we consider is composed of a client and a server computer that communicate through an IP
network. The egress filtering is performed on the client machine and there is no filtering performed on the server
side.
Metrics
An egress filter has to perform a matching operation to understand if a packet can continue its way or has to be
dropped. This operation could be costly if the number of blocked IPs is high. This kind of filter impacts the
throughput of IP connections, it consumes extra CPU and increases the latency. We decided to take into
consideration these metrics under the following conditions.
Throughput
CPU usage
Latency
eBPF
eBPF is a virtual machine built in the Linux kernel. In the networking scope, it allows the user to load programs
that are executed each time a packet arrives or is sent out of a networking interface. Those eBPF programs can
modify, redirect or drop the packets.
There are different types of networking eBPF programs, CGROUP_SKB for per-cgroup filtering, SCHED_CLS and
SCHED_ACT for filtering at the traffic control level, and XDP for filtering at the network interface level.
In our scenario, we want to apply the same filtering regardless of the cgroup of the application generating the
traffic. The traffic coming from non-local applications or forwarded traffic should be filtered as well. This makes
filtering at the socket level unfit for our scenario. XDP programs can only be attached to the ingress1 path,
removing it from consideration in our tests also. By process of elimination, we only consider eBPF programs
attached to the traffic control layer.
Given that we want to keep the tests as simple as possible we decided to implement our ownfiltering program in
eBPF that will be attached to the traffic control layer with a clsact qdisc. The clsact qdisc was introduced in Linux
4.5, it’s a pseudo qdisc that allows to attach eBPF programs in ingress and egress using the direct-action mode.
Our filter implementation uses an LPM map to save the list of IP addresses to block. eBPF implements the LPM
algorithm with a digital tree.
iptables
We tested the standard iptables implementation by adding DROP rules to the OUTPUT chain in the filter table.
The rules we used have the following format
While implementing this filter we found that appending each rule independently was too slow and we had to use
the iptables-restore utility to do it.
iptables uses a linear search algorithm for rule matching.
IP Sets
We performed the testing also with the IP sets framework. We used an IP set of type hash:net and linked it to
iptables by using the following rule:
-A OUTPUT -o eth0 -m set --match-set myset dst -m comment --comment benchmark -j DROP.
The IP address to check is hashed and the result of the hash is used as the index in a hash table. Each bucket of
the hash table contains an array of networks for that hash. When an IP address is checked against an IP set of
type “hash:net”, it is not known what network size will match. The IP set implementation hashes the IP address
for all possible network sizes. For instance, if the IP to match is 1.2.3.4, it looks for 1.2.3.4/32, for 1.2.3.4/31 and so
on until all the 32 possible network sizes have been tested.
Benchmark Set-up
We used iperf3 to measure the performance of TCP, UDP and the CPU usage. The standard ping utility was
used to measure the latency. We tested with 10, 100, 1k, 10k, 100k and 1M rules and each test was run 5 times to
get statistical robustness.
We used two bare metal servers - running on Packet - as client and server machines. The specifications of such
servers are:
c2.medium.x86
1x AMD EPYC 7401P 24-Core Processor @ 2.0GHz
2x 120GB SSD
2x 480GB SSD
64GB RAM
2x 10Gbps
Reproducibility
https://github.com/kinvolk/egress-filtering-benchmark has all the tools and instructions to reproduce these
tests.
Results
The throughput graph shows that the test is reaching line rate, 10Gbps because we are using a single connection
hence only one of the interfaces on the bonding is used in most of the cases. Only iptables with more than 10k
rules is not able to handle that performance. From this test we can conclude that iptables doesn’t scale well after
100k, but unfortunately we cannot conclude anything from other tests as network performance is the bottleneck.
In order to stress the CPU more, we performed another test using UDP.
We can see that iptables does not scale well beyond 1k rules and that IP sets are a bit faster than eBPF on the
clsact qdisc with a high number of rules.
We used iperf with a target bandwidth of 1Gbps (-b 1G) to avoid saturating the CPU and be able to compare the
CPU usage with the different filters. We take the CPU usage reported by iperf that includes all the applications
running on the host. From the above graph we can see that the CPU usage with iptables increases when using
more than 1k rules. The CPU usage with eBPF and IP sets filters stays almost constant showing a bit higher usage
in the eBPF case.
Test #4 - Latency
We used the standard Linux ping utility to measure the latency with the -i 0.001 and -c 1000 parameters,
sending 1000 pings at a 1 millisecond interval. The above graph shows that the latency is not affected by the
number of rules in the eBPF and IP sets filters, it’s ~27µs in those cases. On the other hand, it increases linearly
with the number of rules for the iptables case.
The above graph shows that the time required to set up the rules in all the filters increases linearly with the
number of rules.
With 100.000 IPs, the setup time is still less than one second. For 1 million IPs, the setup time is less than 10
seconds. This shows that in practice, all 3 methods have an acceptable setup time.
The setup time is very dependent on the benchmark implementation rather than being inherent to the filtering
method. Each filtering method could be optimised in the benchmark code if it were deemed necessary.
For the bpf filter: the benchmark performs one bpf() system call for every IP. From Linux 5.6, it is possible to
perform several updates in a eBPF map with a single bpf() call.
For the iptables and IP sets filters, the benchmark executes external commands and feeds each IP range
individually over a pipe. We could avoid that by communicating the list to the kernel using the Netlink
protocol directly. There are some attempts to implement IP Sets support in Golang with the
vishvananda/netlink library.
For this reason, the reader should be cautious before comparing the setup time of one filter to another. This
should not be interpreted as one method being better than another but rather to show expected setup
performance and that setup time is unlikely to be a problem for any filtering method.
We found that it’s not possible to run a test with 100M entries in the eBPF case as this would require more than
4GB of memory for the LPM map and it’s not allowed by the Linux kernel2. We decided to test with 10M and 50M
rules.
From the above chart we can see that there is not a big difference in the throughput and the same tendency
keeps even with 10M and 50M rules.
In the UDP tests, the CPU is saturated and is not able to reach the 10Gbps line rate. We can use the BCC profile
tool to get the list of functions that are executed in the tests for each filtering method.
We have to run the tool with the -K parameter to get the traces from the kernel.
sudo /usr/share/bcc/tools/profile -K
Iptables
For the iptables case, it prints the following trace:
ipt_do_table
ipt_do_table
nf_hook_slow
__ip_local_out
ip_local_out
__ip_queue_xmit
__tcp_transmit_skb
The order of calls is from the bottom to the top, we can see how the nf_hook_slow function calls ipt_do_table in
an iterative manner. It is a simple “for” loop that iterates over each rule, explaining the performance decreasing
linearly with the amount of rules.
eBPF
We got the following trace for the eBPF case:
longest_prefix_match.isra.0
longest_prefix_match.isra.0
trie_lookup_elem
___bpf_prog_run
__bpf_prog_run32
cleanup_module
tcf_classify
__dev_queue_xmit
ip_finish_output2
ip_do_fragment
ip_output
In this case the first important call is to tcf_classify, then the bpf program is run in ___bpf_prog_run and the
lookup in the LPM is performed in trie_lookup_elem and longest_prefix_match. It is walking the trie. This
explains the better performance, since walking over the trie takes maximum 32 steps with IPv4 addresses.
IP Sets
The traces for IP sets are the following:
hash_net4_test
hash_net4_test
hash_net4_kadt
ip_set_test
set_match_v4
ipt_do_table
iptable_filter_hook
nf_hook_slow
__ip_local_out
ip_local_out
ip_send_sk
udp_send_skb.isra.0
udp_sendmsg
For the IP sets case we can see how the first part of the stack is the same as iptables, nf_hook_slow ->
iptable_filter_hook-> ipt_do_table. Then set_match_v4 and after it specific IP set logic is executed: ip_set_test,
hash_net4_kadt and hash_net4_test, that lookups all possible network sizes for a given IP.
Conclusion
The throughput, CPU usage and latency tests show that IP sets and eBPF filters scale well even with 1 million IPs.
On the other hand, iptables doesn’t scale that well and even with a low number of rules like 1k or 10k shows a
considerable difference to the other filters. This is expected as iptables performs a linear search on the rule list.
IP sets outperform eBPF with an LPM map on a clsact qdisc a bit, it has a slightly higher throughput in the UDP
test and uses less CPU.
Update 2020-09-14: we only ran the benchmark with the eBPF interpreter instead of generating efficient JIT
code, so we cannot compare the performance between IP sets and eBPF in the general case at this point.
eBPF cannot handle more than ~89M entries due to restrictions in the Linux kernel.
We suggest the decision of what tool to use shouldn’t be strictly based on the results of this benchmark but
should take into consideration other aspects like whether the technology is already used on the hosts, whether
there are other needed features (now or in the future) that could be provided by some of the technologies, the
way traffic statistics can be exposed to a monitoring tool, etc.
Keep in mind that this benchmark only tests a specific use case: filtering a large deny-list of IP address ranges on
egress. A more realistic cluster would have other filters, including on ingress.
Notes
1. There are some proposals to add support for the egress path but they aren’t merged at the time of writing.
↩
2. Each element on the LPM map requires 48 bytes in our case (40 from sizeof(struct lpm_trie_node) +
4 from attr->value_size and 4 from trie->data_size). It means the maximum number of possible
elements in the map is (2^32/48) ~89M. ↩
WRITTEN BY Mauricio Vásquez / Alban Crequy · September 10, 2020
Related Articles