If you’re writing a tool that takes untrusted input, and you should treat almost all input as untrusted, then it’s a good idea to add a layer of defense against bugs in your code.

What good is a buffer overflow, if the process is fully sandboxed?

This applies to both processes running as root, and as normal users. Though there are some differences.

Standard POSIX

In POSIX you can only sandbox if you are root. The filesystem can be hidden with chroot(), and you can then change user to be non-root using setuid() and setgid().

There have been ways to break out of a chroot() jail, but if you make sure to drop root privileges then chroot() is pretty effective at preventing opening new files and running any new programs.

But which directory? Ideally you want it to be:

  • read-only by the process (after dropping root)
  • empty
  • not shared by any other process that might write to it

The best way no ensure this is probably to create a temporary directory yourself, owned by root.

This is pretty tricky to do, though:

// Return 0 on success.
int do_chroot()
  const char* tmpdir = getenv("TMPDIR");
  if (tmpdir == NULL) {
    tmpdir = "/tmp";
  char jail[PATH_MAX];
  if (0 > snprintf(jail, PATH_MAX, "%s/jail-XXXXXX", tmpdir)) {
    // If truncated then mkdtemp() will complain.
    return 1;
  if (mkdtemp(jail) == NULL) {
    return 1;
  if (chdir(jail)) {
    return 1;
  // Caveat: Deleting the current working directory and then chrooting into it
  // may not be portable. If it's not, then skip the rmdir step and just leak
  // the directory.
  if (rmdir(jail)) {
  if (chroot(".")) {
    return 1;
  return 0;

Ok, now the filesystem is gone. Well, do make sure that you don’t have any file descriptors open to any directories, or openat() could be used to open a file outside the chroot.

The second POSIX step is to drop root privileges.

First we need to drop any group IDs (GID), since we can only do that while running as user ID (UID) root.

Then we drop UID from root.

// Return 0 on success.
int drop_uid(uid_t uid, gid_t gid)
  if (setgroups(0, NULL)) {
    perror("setgroups(0, NULL)");
    return 1;
  if (setgid(gid)) {
    return 1;
  if (setuid(uid)) {
    return 1;
  return 0;

What UID to use? Well, first of all you need to resolve the user before you do your chroot, since the mapping exists in /etc/passwd and /etc/group. So that’s some boilerplate getpwnam().

Ideally you should run every binary as a separate user, since otherwise they can send signals (kill()) each other. That’s not always feasible though, so maybe the least bad option is nobody/nogroup.

Except if you need to access the network, then on Android you need to use the group inet.

Well, that was easy, right?


OpenBSD has pledge() and unveil(), which work even on unprivileged users. So if running as root you should first drop your root privileges, and then call ‘pledge()’ and ‘unveil()’ to only list what’s needed from here.

// Return 0 on success
int openbsd_drop()
  if (unveil("/", "")) {
    return 1;
  if (pledge("stdio", "")) {
    return 1;
  return 0;

For belts and suspenders you can chroot, drop root privs, unveil, and then pledge. In that order. For non-root processes just unveil and pledge.

OpenBSD clearly has the best solution, here. You can even skip unveil() in this case, since pledge("stdio", "") doesn’t allow opening any new files.

One single call to sandbox to a very common post-init setup. Nice.


As I said in a previous post, seccomp is basically unusable.

But there’s two other ways to restrict a process’s ability to affect the outside world:

  1. Capabilities
  2. unshare()

(and a minor one: prctl(PR_SET_NO_NEW_PRIVS), that effectively turns off suid and guid bits on binaries)

Linux — Capabilities

Capabilities are a fine grained way to give “root powers” to processes.

A process running as user root can have all its special powers taken away, by stripping its capabilities. But that still leaves it as a normal user, and a normal user that owns some key files, like /etc/shadow and… uh… the root directory. So basically the whole filesystem.

So the root user is powerful even without any capabilities.

This is why it’s not enough to merely drop all capabilities.

Capabilities can also be granted to non-root processes. E.g. you can remove the suid bit and grant /bin/ping just the CAP_NET_RAW capability instead. In fact that`s the case on my system:

$ getcap /bin/ping
/bin/ping cap_net_raw=ep

A full compromise of ping can only lead to sniffing my traffic. Bad, but full root access is worse.

If you can avoid having your tool run as root in the first place, that’s strictly better. But still don’t forget to drop that capability as soon as you no longer need it. ping does this:

$ sudo strace -esocket,capset ping
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_NET_ADMIN|1<<CAP_NET_RAW, inheritable=0}) = 0
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=1<<CAP_NET_RAW, permitted=1<<CAP_NET_ADMIN|1<<CAP_NET_RAW, inheritable=0}) = 0
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_NET_ADMIN|1<<CAP_NET_RAW, inheritable=0}) = 0
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=120 time=9.48 ms

Uhm… hmm… actually no that’s not right. It drops the capabilities from the effective set, but surely it should drop it from the permitted set too?

Gah, it does for IPv6, but not IPv4:

$ sudo strace -esocket,capset ping -6 ns1.google.com
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_NET_ADMIN|1<<CAP_NET_RAW, inheritable=0}) = 0
PING ns1.google.com(ns1.google.com (2001:4860:4802:32::a)) 56 data bytes
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
64 bytes from ns1.google.com (2001:4860:4802:32::a): icmp_seq=1 ttl=110 time=14.2 ms

Looks like it’s been broken since 2012. I’ve sent a pull request.

I’m actually not convinced it’s a good idea to replace root with just CAP_NET_RAW in this case. If there’s a security hole in ping then my normal user gets compromised. If ping also had the CAP_SETUID capability then it could limit its blast radius to the nobody user instead.

Compromising my user account is worse than sniffing my traffic, since basically all traffic is encrypted nowadays.

As it is now a bug in ping can lead to a complete account takeover, and system ownage.

Same thing with AF_ICMP.

I’ve filed a bug requesting some thoughts on this.

Linux — unshare()

unshare() creates a new universe that can never be joined back to the old one. Instead of dropping root privileges, you can create a new namespace where even root can’t affect anything important. And then you can drop privileges inside even that universe.

It’s a bit tricky to use, though. And there are some gotchas. Yes, trickier than chroot() + setuid().

E.g. if you create a new namespace as root then you may be under the misapprehension that you have no way to touch the outside world, but that’s not the case.

real$ sudo unshare --user
new$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
new$ touch /only-root-can-break-your-heart && echo success
new$ ls -l /only-root-can-break-your-heart
-rw-r----- 1 nobody nogroup 0 Mar 11 15:49 /only-root-can-break-your-heart
new$ exit
real$ ls -l /only-root-can-break-your-heart
-rw-r----- 1 root root 0 Mar 11 15:49 /only-root-can-break-your-heart

So you need to drop privileges before creating the new namespace, too.

Linux — Combining them all

Let’s say you start off with root, and you want to:

  • chroot
  • drop capabilities
  • unshare
  • change uid

The restrictions are:

  • unshare(CLONE_NEWUSER) before chroot, because CLONE_NEWUSER is not allowed in chrooted environment.
  • chroot() before drop_uid(), because you can’t chroot() as non-root
  • drop_uid() before unshare(CLONE_NEWUSER), because the new user namespace still maps back to the real root user.

Oh… and now it looks like we have a circular dependency.

But not really. What you can do is run chroot() after unshare(CLONE_NEWUSER), because while you aren’t real root, you have all the capabilities inside your new domain:

$ capsh --decode=$(unshare --user --map-root-user awk '/^CapEff/ {print $2}' /proc/self/status)

--map-root-user here can be a bit confusing here. If your program doesn’t do the extra work this option does then you may be fooled into thinking all the capabilities are lost by default:

$ capsh --decode=$(unshare --user awk '/^CapEff/ {print $2}' /proc/self/status)

but unfortunately that’s not the case. The capabilities are there until you exec() something (in this case awk), because capabilities are not by default part of the inherited set here.

To illustrate:

$ cat not_gone.c
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int main()
  if (unshare(CLONE_NEWUSER)) {
      err(EXIT_FAILURE, "unshare(CLONE_NEWUSER): %s", strerror(errno));
  printf("--- Actual permissions ---\n");
  FILE* f = fopen("/proc/self/status", "r");
  if (!f) {
    err(EXIT_FAILURE, "fopen(/proc/self/status): %s", strerror(errno));
  char* line = NULL;
  size_t len = 0;
  while (getline(&line, &len, f) != -1) {
    if (!strncmp(line, "Cap", 3)) {
      printf("%s", line);
  printf("--- Post-exec permissions ---\n");
  execlp("grep", "grep", "^Cap", "/proc/self/status", NULL);
  err(EXIT_FAILURE, "execlp: %s", strerror(errno));
$ ./not_gone
--- Actual permissions ---
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
--- Post-exec permissions ---
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000

This is usually what a user wants, so it makes sense that unshare() would work this way. This allows some setup before executing the ultimate command, and optionally adding to the inherited set of permissions.

In our case though we’re just dropping privileges, and we are the ultimate command (there will be no exec()) so we need them gone now, not merely prevent inheritance.

This is a second gotcha, because it’s easy to be fooled into thinking all the capabilities are gone already, just because getuid() returns nonzero.

Other namespaces

Once all needed network sockets have been opened we can drop support for creating new ones. And we can detach from other namespaces too.

Making the chroot dir read only

There appears to be two ways to create a “safe” working directory.

As far as I can tell, after deleting the current directory we’re chrooted into it’s impossible to create any new files or directories in it. It’ll fail with ENOENT.

High level

dir = mkdtemp()
setuid_setgid(nobody, nogroup);  // This will fail if not root / suid, but that's fine.
drop_capabilities();      // Capabilities in parent namespace.
drop_capabilities();      // Capabilities in new namespace.

The downside here is that if the chroot() fails inside the new namespace, but would have succeeded if we’d just gone the POSIX way, then it’s too late to go back and try again.

An alternative way

For the superuser case (where setuid() succeeds) the new root file system is empty, deleted, and owned by a user other than the currently running one.

But in the non-superuser case the new file system is all inside the normal user’s UID. User namespaces merely map their UIDs to real UIDs, they don’t create new ones. Maybe there’s then something they can do to create files in there, and possibly filling up the disk.

I don’t think so, but let’s explore another trick: A cloned mount namespace, with a read-only filesystem. Good luck creating files in a read-only file system.

Unfortunately while it’s possible to rmdir the current working directory, it’s not possible to rmdir a directory that’s a mount point. So here we’d leak the temporary directory.

Unless we don’t create one.

We can mount “over” an existing directory, and use that. Then we won’t leak anything. It can probably be any directory, except the root.

setuid_setgid(nobody, nogroup);  // This will fail if not root / suid, but that's fine.
drop_capabilities();      // Capabilities in parent namespace.
mount(tmpfs read only on /tmp);
drop_capabilities();      // Capabilities in new namespace.

I’ve implemented both, and they work. I’m undecided on which is best.

Actual code

It’s over 100 lines lines, so I’m stuffing it into a project on github, but here’s how it looks when it’s running:

Inspecting it when running as root / suid:

$ ps auxww | grep drop
root      115398  0.6  0.0  17308  6748 pts/13   S+   16:13   0:00 sudo ./drop
nobody    115399  0.0  0.0   2480  1632 pts/13   S+   16:13   0:00 ./drop
$ grep Cap /proc/115399/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
$ ls -l /proc/115399/{cwd,root}
lrwxrwxrwx 1 root root 0 Mar 11 16:15 /proc/115399/cwd -> '/tmp/jail-W54mvW (deleted)'
lrwxrwxrwx 1 root root 0 Mar 11 16:15 /proc/115399/root -> '/tmp/jail-W54mvW (deleted)'

When running as normal user:

$ ps auxww | grep drop
thomas    115615  0.0  0.0   2480  1724 pts/13   S+   16:16   0:00 ./drop
$ grep Cap /proc/115615/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
$ ls -l /proc/115615/{cwd,root}
lrwxrwxrwx 1 thomas thomas 0 Mar 11 16:17 /proc/115615/cwd -> '/tmp/jail-G3RXdm (deleted)'
lrwxrwxrwx 1 thomas thomas 0 Mar 11 16:17 /proc/115615/root -> '/tmp/jail-G3RXdm (deleted)'

You can also verify that the network namespaces are different:

$ ls -l /proc/{self,115615}/ns/net
lrwxrwxrwx 1 thomas thomas       0 Mar 11 16:17 /proc/115615/ns/net -> 'net:[4026532208]'
lrwxrwxrwx 1 root   root         0 Mar 11 16:17 /proc/self/ns/net -> 'net:[4026532008]'

This means that (aside from any already open sockets) this process cannot use the network. It doesn’t have any network interfaces. Not even loopback.

To be root or not to be root

With user and mount namespaces you’d think that it doesn’t matter if you’re root or not. You can drop privs equally well anyway.

But really what Linux needs is a setuid_ephemeral() callable by nonprivileged user that sets UID and GID to a one-time ephemeral value. That way normal file system, semaphore, signal management takes care of ACLs. And all tooling can be isolated from each other.

setuid() to nobody/nogroup is better than nothing, but would be better if they could all be unique.

What attack surfaces are still exposed?

Lots still, probably. The process can still kill other processes running as the same user.

pledge() is just so much better than this.