Updated: Nov 17
eBPF has been making quite some news lately. An elegant way to extend the linux kernel (or windows) has far reaching implications. Although initially, eBPF was used to enhance system observability beyond existing tools, we will explore in this post how eBPF can be used for enhancing Linux networking performance.
A quick recap:
As far as networking is concerned, let’s take a quick look at the hook points for eBPF program inside Linux kernel:
The hooks that are of particular interest for this discussion are NIC hook (invoked just after packet is received at NIC) and TC hook (invoked just before Linux starts processing packet with its TCP/IP stack). Programs loaded to the former hook are also known as XDP programs and to the latter are called eBPF TC. Although both use eBPF restricted C syntax, there are significant differences between these types. (We will cover it in a separate blog later). For now, we just need to remember that when dealing with container-to-container or container-to-outside communication eBPF-TC makes much more sense since memory allocation (for skb) will happen either way in such scenarios.
The performance bottlenecks:
Coming back to the focus of our discussion which is of course performance, let us step back and take a look at why Linux sucks at networking performance (or rather why it could perform much faster). Linux networking evolved from the days of dial up modem networking when speed was not of utmost importance. Down the lane, code kept accumulating. Although it is extremely feature rich and RFC compliant, it hardly resembles a powerful data-path networking engine. The following figure shows a call-trace of Linux kernel networking stack:
The point is it has become incredibly complex over the years. Once features like NAT, VXLAN, conntrack etc come into play, Linux networking stops scaling due to cache degradation, lock contention etc.
One problem leads to the other:
To avoid performance penalties, many user-space frameworks like DPDK have been widely used, which completely skip the linux kernel networking and directly process packets in the user-space. As simple as that may sound, there are some serious drawbacks in using such frameworks e.g need to dedicate cores (can’t multitask), applications written on a specific user-space driver (PMD) might not run on another as it is, apps are also rendered incompatible across different DPDK releases frequently. Finally, there is a need to redo various parts of the TCP/IP stack and the provisioning involved. In short, it leads to a massive and completely unnecessary need of reinventing the wheel. We will have a detailed post later to discuss these factors. But for now, in short, if we are looking to get more out of a box than doing only networking, DPDK is not the right choice. In the age of distributed edge computing and immersive metaverse, the need to do more out of less is of utmost importance.
eBPF comes to the rescue:
Now, eBPF changes all of this. eBPF is hosted inside the kernel so the biggest advantage of eBPF is it can co-exist with Linux/OS without the need of using dedicated cores, skipping the Kernel stack or breaking tools used for ages by the community.
At NetLOX, we developed an extremely scalable eBPF stack “Loxilight” from scratch which can hook up either as XDP or TC-eBPF. Loxilight acts as a fast-path on top of Linux networking stack and closely mimics a hardware-like data-path pipeline. In other words, it simply accelerates Linux networking stack without ripping apart the existing software landscape. Linux kernel continues to act as “slow-path” whenever Loxilight encounters a packet out of its scope of operation. Loxilight also implements its own conntrack, stateful firewall/NAT, DDOS handling, Load-balancer on top of its eBPF stack to scale connections upto a million entries. We further made sure that tools like iptables work transparently with loxilight. There are tons of other exciting features already available or planned (more info on our website).
What matters most are the performance numbers :
As a final part of this blog (and the most interesting part), let's take a quick look at the kind of performance boost Loxilight provides compared to default Linux networking.
The server spec that we used for our tests :
Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz (40-core)
Ubuntu 20.04.2 LTS RAM 125GB
Node 2. Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz (40-core)
Ubuntu 20.04.2 LTS RAM 125GB
For the first scenario, we create a single node(server) networking topology consisting of various types of networking – plain L2, L3, L3 with NAT, vlan to Vxlan, Routing with NAT over VxLAN.
There are total of 8 pods/containers/hosts in a single node as seen above
Loxilight-eBPF is attached at ingress TC for all veth-pair end-points which are named hs1, hs2 etc.
Connection tracking is enabled by default for all communication
All measurements are between two pods/containers in the same node/server (with or without loxilight)
The scripts to create this network can be found here. (If you are brave enough to experiment)
1. Single stream - Traffic bandwidth (Tool - iperf/iperf3)
2. 50 streams - Traffic bandwidth (Tool - iperf/iperf3 -P 50)
3. Traffic latency (Tool – ping/qperf)
1. Loxilight (eBPF) shows roughly 5x boost compared to Linux for advanced and complicated topologies
2. It is also able to cut down latency by quite a bit
For the second scenario, we create a two node(server) networking topology which mimics the popular Kubernetes node-to-node communication scheme. All hosts/pods in a single node are in the same subnet but the subnets differ from hosts/pods in the other node(s). Vxlan tunnel is used as a node-to-node overlay. So, we need to do routing on top of vxlan for inter-node communication over vxlan. This is quite a standard affair in the life of a packet inside Kubernetes. We also configure (D)NAT to emulate Kubernetes service networking for performance measurements.
1. Pod-to-Pod across nodes - Traffic Bandwidth (Tool - iperf/iperf3)
Node1 <-> Node2 Interconnect - Port speed : 40G
2.Pod-to-Pod across nodes - Traffic Latency (Tool - ping/qperf)
1. Loxilight (eBPF) shows roughly 2x boost compared to Linux
2. 40G line rate (with iperf3) is reached with around 4 cores while Linux kernel takes ~9-core for line rate.
3. Iperf3, when run as multiple processes, runs normally in separate cores and hence is able to pump more data and we use iperf3 in multi-threaded and multi-process mode to check performance.
Loxilight is geared to enhance performance in cloud-native environments but as evident from this post, it can be used in potentially any server(or user) networking use-case. Apart from top-notch performance, it has great programmability and visibility since it inherits these traits from underlying eBPF technology and enhances it with its incredible architecture. For a demo or a free 3 month-trial, kindly contact us. We are also working towards an open-source fork for the greater good of the community.
In future posts of this series, we will explore the following :
How Loxilight/eBPF tackles security and integrates ipsec and wire-guard.
The role of eBPF and DPUs in edge networking, security and rich analytics.
Loxilight's use-case for DDOS protection in Android/Windows.