BPF has some wow-presentations, showing how it enables new performance measuring and tracing. Brendan Gregg has a whole bunch, for example. But I don’t think’s it’s very well explained just why BPF is such a big deal.

Most of the demos are essentially cool and useful looking tools, with an “oh by the way BPF made this happen”. Similar to how it’s common to see announcements about some software, where the very title of the announcement ends with “written in Go”. It gives a vibe of “so what?”.

If you’re interested in system tooling and configuration, and aren’t already aware of BPF, then this is for you.

I’m not an expert on BPF, but this will hopefully help someone else bootstrap faster.


bpftrace is really cool. Clearly it’s inspired by dtrace. But one should not mistake bpftrace for BPF. bpftrace is only yet another tool that uses BPF, albeit one that allows you to create trace points in a domain specific language.

This is not the full power of BPF. It’s not at all the big picture.

BPF and configs

Let’s take packet filtering as an example. Once upon a time in Linux there was ipfwadm. I bet there are people reading this who were not born when we moved off of that. Hell, its sequel ipchains is not exactly new either. After that came iptables. iptables still works, and is probably the most popular still, but technically it’s replaced by nftables.

What all these have in common is that they are configs. They’re data. They provide a list of rules, and a rule engine goes through the rules, one by one, and finds a “matching rule” and performs an “action”.

In other words the user starts with an intent, and encodes that in a configuration, which in turn is checked for every packet.

The problem with that is if something cannot be encoded in the config, then it’s not possible to make the packet filter do it. Sure, every generation of tooling made more things possible to configure, but it’ll never be complete.

OpenBSD has instead improved and expanded what pf can do, but there too you are at the mercy of what’s possible to encode in the configuration.

E.g. what you you want to filter all packets whose source and destination port are the same? The only option I can think of is to create 65536 separate rules (times two for TCP/UDP), which is not only messy, but also affects performance, since they’ll be evaluated for every packet sequentially.

Classic solution

The classic solution to these special configurations is to add an interface between userspace and the kernel, and have the kernel ask a daemon what to do.

This has many drawbacks:

  • what should happen if the userspace daemon crashes?
  • userspace/kernel context switches are expensive, and here we need two of them per packet
  • Both sides have to be configured. The kernel needs a configuration to know it’s supposed to ask userspace, and userspace needs to “connect” to that.
  • Complex kernel/userspace interfaces is where kernel bugs happen
  • Ok, that’s great for iptables. But what about all the other configs? Should they all create this type of link?

BPF solution

BPF does not deal in configuration, but in code. BPF allows you to load code into the kernel (in a safe way), and you’re not constrained by what can be encoded in rules.

For our example you don’t tell the kernel to drop packets with source port equal to destination port, you upload a program that hooks in to “packet received”. You tell the kernel run this code to know what to do with that packet.

There’s even bpftables now. I wouldn’t say that it completely removes the need for iptables & nftables, but it removes the need for the kernel portion of them. iptables/ntfables userspace could compile to BPF, and all rule processing code could be removed from the kernel.

And so it goes for all configurations and settings.

This is a big deal.

Taking this to the extreme

You could remove the routing table code from the kernel (a routing table is a configuration), and have a BPF program decide the next hop. This may sound strange to you, but if you’ve ever done policy routing under Linux, with ip rule and numbered routing tables, then this may make more sense.

Maybe more realistic is to rip out the support for more than the default routing table, and allow creation of userspace domain specific languages that just upload the code to the kernel.

You could implement filesystem access controls in a BPF program, and rip out unix filesystem permissions code. Maybe not a good idea, but you could.

Did you know that in Linux you can customize the TCP initial congestion window and receive window by applying attributes to the routing table entry? That’s clever. But if you want to set the cwnd based on port, then the “cleanest” way is probably to use iptables to -j MARK packets on those ports, then ip rule to change routing table, and then create a new routing table as a copy of the default one but with these settings changed. And remember to update both tables when you need to change one. I said “cleanest”, not clean.

With BPF you can just hook into sockops on the BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB operation, and change it based on any criteria you want.

What I have actually done with BPF

Measure USB traffic, but the idea has more potential

This uses the bpftrace tool, and thus doesn’t use the full power of BPF. My code in the domain specific language compiles into BPF that hooks into specific function calls in the USB stack.

As I understand it BPF allows me to hook into those same places and override the function return value. That way I can create a firewall for USB.

Imagine how much work it would be to create a USB firewall without BPF. You would have to:

  • specify a configuration language
  • add a way to upload a config to the kernel
  • create the user language
  • create command line tooling for it
  • convince kernel developers to accept your patches
  • convince distribution vendors to add your tools to the OS

And then it wouldn’t be available on all Linux systems until a few years later when kernels and distributions pick it up.

It would literally take years to make this available to people. With BPF you could do it in a weekend, and other people could start using it without them even needing to reboot.

Set default retransmit time

By default Linux retransmits a SYN packet after 1 second, if there is no response. Then exponentially after 2 more seconds, then 4 more, etc…

While playing with LoRa and AX.25 this timer has caused problems. The problem is that under some settings it can take more than a second (actually in the slowest LoRa mode over 5 seconds) to send the SYN packet. This means that as soon as the SYN is done sending, the retransmit gets sent too. This ties up the radio for another 5 seconds, and likely prevents the reply from being received (my LoRa hardware doesn’t seem to be able to listen before transmit (LBT), and therefore doesn’t do CSMA).

I wouldn’t say that the interface is intuitive (it took me a few hours to find a way to compile and load the code), but once you know how, it’s simple.

#define SEC(NAME) __attribute__((section(NAME), used))

// TODO: assumes little-endian (x86, amd64)
#define bpf_ntohl(x)  __builtin_bswap32(x)

int bpf_sockmap(struct bpf_sock_ops *skops)
  skops->reply = -1;
  // TODO: filter on outgoing interface
  if (bpf_ntohl(skops->remote_port) != 12345 && skops->local_port != 12345) {
    return 0;

  const int op = (int) skops->op;
     // TODO: this is in jiffies, and despite `getconf CLK_TCK` return 100,
     // HZ is clearly 250 on my kernel.
     // 5000 / 250 = 20 seconds
     skops->reply = 5000;
     return 1;
  return 0;
char _license[] __attribute((section("license"),used)) = "GPL";
int _version SEC("version") = 1;

Compile & load using clang (apt install clang llvm) and bpftool (kernel sources in tools/bpf/bpftool):

# I needed kernel headers. Preferably for the kernel you're actually running.
# Compile
clang $CFLAGS -target bpf  -Wall -g -O2 -c set_rto.c -o set_rto.o
# Remove any previous version from the kernel
sudo bpftool cgroup detach "/sys/fs/cgroup/unified/" sock_ops \
    pinned "/sys/fs/bpf/set_rto"
sudo rm -f /sys/fs/bpf/bpf_sockop
# Load the new version.
sudo bpftool prog load set_rto.o  /sys/fs/bpf/bpf_sockop
sudo bpftool cgroup attach /sys/fs/cgroup/unified/ sock_ops \
    pinned /sys/fs/bpf/set_rto

And now port 12345 has a 20s retransmit timer. I’ll update this post with the improved code (see TODOs) once I code that up.

In theory I could create a domain specific language on top of this (all in userspace) for creating rules about when the timer is set to what, and have that compile to BPF.