The strange webserver hot potato — sending file descriptors

I’ve previously mentioned my io-uring webserver tarweb. I’ve now added another interesting aspect to it.

As you may or may not be aware, on Linux it’s possible to send a file descriptor from one process to another over a unix domain socket. That’s actually pretty magic if you think about it.

You can also send unix credentials and SELinux security contexts, but that’s a story for another day.

My goal

I want to run some domains using my webserver “tarweb”. But not all. And I want to host them on a single IP address, on the normal HTTPS port 443.

Simple, right? Just use nginx’s proxy_pass?

Ah, but I don’t want nginx to stay in the path. After SNI (read: “browser saying which domain it wants”) has been identified I want the TCP connection to go directly from the browser to the correct backend.

I’m sure somewhere on the internet there’s already an SNI router that does this, but all the ones I found stay in line with the request path, adding a hop.

Why?

A few reasons:

Having all bytes bounce on the SNI router triples the number of total file descriptors for the connection. (one on the backend, then one each on the router for upstream and downstream). There are limits per process and system wide, and the more you have the more you need to juggle them in code.
It also wastes CPU and RAM.
I want the backend to know the real client IP address, via getpeername() or similar, on the socket itself.
I don’t want restarting nginx to cut existing connections to backends.
I’d like to use TLS keys that the nginx user doesn’t have access to.
I used proxy_pass for livecount, and last time I got blog posts on HackerNews nginx ran out of file descriptors, and started serving 500 for it serving just plain old static files on disk. For now I’ve moved livecount to a different port, but in the long run I want it back on port 443, and yet isolated from nginx so that the latter keeps working even if livecount is overloaded.

Livecount has an open websocket to every open browser tab in the world reading a given page, so they add up. (no, it doesn’t log. It just keeps count)

What I built

I built a proof of concept SNI router. It is a frontline server receiving TCP connections, on which it then snoops the SNI from the TLS ClientHello, and routes the connection according to its given rules.

Anything it reads from the socket is sent along to the real backend along with the file descriptor. So the backend (in my use that’s tarweb) needs to have code cooperating to receive the new connection.

It’s not the cleanest code, but it works. I got ChatGPT to write the boring “parse the TLS record / ClientHello” parts. Rust is a memory safe language, so “how bad could it be?”. :-)

It seems to work for all the currently used TLS versions.

It’s not plug and play

As I said, it requires the backend to be ready to receive “hey, here’s a file descriptor, and here’s the first few hundred bytes you should treat as if you’ve read them from the client”.

File descriptors don’t have an operation to “unread”. If they did then this would be easier. Then it would “just” be a matter of giving a backend webserver a file descriptor. For some use cases that could mean starting a new webserver process that reads and writes from stdin/stdout.

Not super efficient to go back to the fork-exec-per-connection model from the previous century, but it would achieve the direct connection.

But the details are academic. We do need to pass along the snooped bytes somehow, or the TLS handshake won’t succeed. Which means it does need cooperation from the backend.

But it is privacy preserving

Because the SNI router never writes to the client, and therefore doesn’t perform a TLS handshake, it doesn’t need any private keys or certificates.

The SNI router has no secrets, and sees no secrets.

I also added a mode that proxies the TCP connection, if some SNI should be routed to a different server. But of course then it’s not possible to pass the file descriptor. So encrypted bytes will bounce on the SNI router for that kind of flow. But still the SNI router is not able to decrypt anything.

A downside is of course that bouncing the connection around the world will slow it down, add latency, and waste resources. So pass the file descriptor where possible.

The hot potato

So now my setup has the SNI router accept the connection, and then throw the very file descriptor over to tarweb, saying “you deal with this TCP connection”. Tarweb does the TLS handshake, and then throws the TLS session keys over to the kernel, saying “I can’t be bothered doing encryption, you do it”, and then actually handles the HTTP requests.

Well actually, there’s another strange indirection. When tarweb receives a file descriptor, it uses io-uring “registered files” to turn it into a “fixed file handle”, and closes the original file descriptor. On the kernel side there’s still a file descriptor of course, but there’s nothing in /proc/<pid>/fd/:

$ ls /proc/699874/fd -l
total 0
lrwx------ 1 thomas thomas 64 Oct 26 21:47 0 -> /dev/pts/5
lrwx------ 1 thomas thomas 64 Oct 26 21:47 1 -> /dev/pts/5
lrwx------ 1 thomas thomas 64 Oct 26 21:47 2 -> /dev/pts/5
lrwx------ 1 thomas thomas 64 Oct 26 21:47 3 -> anon_inode:[io_uring]

This improves performance a bit on the linux kernel side.

The SNI router does not use io-uring. At least not yet. The SNI router’s job is much smaller (doesn’t even do a TLS handshake), much more brief (it almost immediately passes the file descriptor to tarweb), and much less concurrency (because of the connections being so short lived as far as it’s concerned), that it may not be worth it.

In normal use the SNI router only needs these syscalls per connection:

accept() for the new connection,
read() a few hundred bytes of ClientHello,
sendmsg() of same size to pass it on,
close() to forget the file descriptor.

HTTP/3 can redirect connections

At the risk of going off on an unrelated tangent, HTTP/3 (QUIC-based) has an interesting way of telling a client to “go over there”. A built in load balancer inside the protocol, you could say, sparing the load balancer needing to proxy everything.

This opens up opportunities to steer not just on SNI, and is much more flexible than DNS, all without needing the “proxy” to be inline.

E.g. say a browser is in Sweden, and you have servers in Norway and Italy. And say you have measured, and find that it would be best if the browser connected to your Norway server. But due to peering agreements and other fun stuff, Italy will be preferred on any BGP anycasted address.

You then have a few possible options, and I do mean they’re all possible:

Have the browser connect to norway.example.com, with Norway-specific IP addresses. Not great. People will start bookmarking these URLs, and what happens when you move your Norway servers to Denmark? norway.example.com now goes to servers in Denmark?
Use DNS based load balancing, giving Swedish browsers the Norway unicast IPs. Yes… but this is WAY more work than you probably think. And WAY less reliable at giving the best experience for the long tail. And sometimes your most important customer is in that long tail.
Try to traffic engineer the whole Internet with BGP announcement tweaks. Good luck with that, for the medium to long tail.
Install servers in Sweden, and any other place you may have users. Then you can anycast your addresses from there, and have full control of how you proxy (or packet by packet traffic engineer over tunnels) them. Expensive if you have many locations you need to do this in. Some traffic will still go to the wrong anycast entry point, but pretty feasible though expensive.

The two DNS-based ones also have the valid concern that screwing up DNS can have bad consequences. If you can leave DNS alone that’s better.

Back to HTTP/3. If you’ve set up HTTP/3 it may be because you care about latency. It’s then easier to act on information you have about every single connection. On an individual connection basis you can tell the browser in Sweden that it should now talk to the servers in Norway. All without DNS or anycast.

Which is nice, because running a webserver is hard enough. Also running a dynamic DNS service or anycast has even more opportunities to blow up fantastically.

Where was I? Oh yeah, file descriptors

I should add that HTTP/3 doesn’t have the “running out of file descriptors” problem. Being based on UDP you can run your entire service with just a single file descriptor. Connections are identified by IDs, not 5-tuples.

So why didn’t I just use HTTP/3?

HTTP/3 is complex. You can build a weird io-uring kTLS based webserver on a weekend, and control everything (except TLS handshakes). Implementing HTTP/3 from scratch, and controlling everything, is a different beast.
HTTP/1 needs to still work. Not all clients support HTTP/3, and HTTP/1 or 2 is even used to bootstrap HTTP/3 via its Alt-Svc header.
Preferred address in HTTP/3 is just a suggestion. Browsers don’t have to actually move.

What about encrypted SNI (ESNI), or encrypted ClientHello (ECH)

No support for that (yet). From some skimming ESNI should “just work”, with just a minor decryption operation in the SNI router.

ECH seems harder. It should still be doable, but the SNI router will need to do the full handshake, or close to it. And after taking its routing decision it needs to transfer the encryption state to the backend, along with the file descriptor.

This is not impossible, of course. It’s similar to how tarweb passes the TLS session keys to the kernel. But it likely does mean that the SNI router needs to have access to both the TLS session keys and maybe even the domain TLS private keys.

But that’s a problem for another day.

tarweb was first written in C++.
livecount keeps long lived connection.
tarweb rewritten in Rust, and using io-uring.
You can redirect with Javascript, but this still has the norway.example.com problem.
I passed file descriptors between processes in injcode, but it was only ever a proof of concept that only worked on 32bit x86, and the code doesn’t look like it actually does it? Anyway I can’t expect to remember code from 17 years ago.