I’ll just assert that there’s no way to use
correctly. Just like how there’s no way to use
correctly, causing it to eventually be removed from the C and C++
seccomp allows you to filter syscalls with a ruleset.
The obvious thing is to filter anything your program isn’t supposed to be doing. If it doesn’t do file IO, don’t let it open files. If it’s not supposed to execute anything, don’t let it do that.
But whether you use a whitelist (e.g. only allow working with already open file descriptors), or a blacklist (e.g. don’t allow it to open these files), it’s fundamentally flawed.
1. Syscalls change. Sometimes without even recompiling
open() in your code actually becomes the
openat syscall. Maybe. At
least today. At least on my machine, today.
select() actually becomes
pselect6. At least on Fridays.
If you upgrade libc or distribute a binary to other systems, this may start to fail.
2. Surprising syscalls
printf() will call the syscall
newfstatat, a syscall hard
to even parse into words. But only the first time you call it! So
after your first
printf() you can block
Maybe this will all work just fine, normally. But then an unrelated
bug happens, and your tool tries to log it, but can’t because
newfstatat is blocked. So you get no logs.
So it’s not just what you call, but highly dependent on what order you call things when dropping privileges.
In my example it worked fine when I ran with verbose mode turned on,
but not with it off. That’s because in verbose mode I called
printf() before dropping privs.
3. (hinting at the solution): There’s no grouping
I would say that the most common thing everyone wants to do is this: After everything’s set up, don’t allow anything done by the process to interact with anything else, except via already open file descriptors.
That’s almost true. Getting the current time, and memory allocation, is probably also safe.
(But the original binary on/off
seccomp() blocked even those)
But there’s no way to express this. In order to actually interact with open network sockets in the most minimal of ways I’d need at least:
And that’s just for the most trivial of examples where you have some unsafe code (e.g. a parser) that takes input on one fd and gives output on another. For example if you implement an oracle that takes an X.509 certificate (famously tricky to parse) and a hostname, and returns if it’s valid or not.
And what’s worse: This is completely dynamic and depends on the architecture. It can change from execution to execution, or millisecond to millisecond. This is just not part of the ABI.
There’s nothing stopping libc from changing to implementing
as a special case of
select() could be implemented in
There are 300+ syscalls, and will likely grow. Do you know which ones are “just read or write from the sockets”?
So I don’t think the
seccomp(2) manpage is realistic when it says:
It is strongly recommended to use an allow-list approach whenever possible because such an approach is more robust and simple. A deny-list will have to be updated whenever a potentially dangerous system call is added
Good luck with that.
OpenBSD clearly got this right. Don’t list syscalls. Who cares if it’s
Go on, think about it. Even with full control of the process, what
could you possibly do after it runs
pledge("stdio", "")? Print
profanities to the user? Exit with the wrong exit code? Yeah, but
that’s about it.
seccomp() allows more restrictions. In arping I blocked so that
it can only write to
stderr, not read. But so what? I
may have to eat my hat on this, but being able to read from stdout
doesn’t sound like it’ll cause a security problem.
unveil(), are clearly the right solution here.
But what about Linux?
Maybe one day Landlock will be the thing. But considering the previous nightmare with many generations of Linux solutions getting it wrong I’m not holding my breath.
For now I guess
unshare() is the way to go. But even that’s tricky
(and doesn’t block as much). I’m planning a follow-up post about how
to drop access to the outside world using available tools.