Dropping privileges
If you’re writing a tool that takes untrusted input, and you should treat almost all input as untrusted, then it’s a good idea to add a layer of defense against bugs in your code.
What good is a buffer overflow, if the process is fully sandboxed?
This applies to both processes running as root, and as normal users. Though there are some differences.
Standard POSIX
In POSIX you can only sandbox if you are root. The filesystem can be
hidden with chroot(), and you can then change user to be non-root
using setuid() and setgid().
There have been ways to break out of a chroot() jail, but if you
make sure to drop root privileges then chroot() is pretty effective
at preventing opening new files and running any new programs.
But which directory? Ideally you want it to be:
- read-only by the process (after dropping root)
- empty
- not shared by any other process that might write to it
The best way no ensure this is probably to create a temporary directory yourself, owned by root.
This is pretty tricky to do, though:
// Return 0 on success.
int do_chroot()
{
  const char* tmpdir = getenv("TMPDIR");
  if (tmpdir == NULL) {
    tmpdir = "/tmp";
  }
  char jail[PATH_MAX];
  if (0 > snprintf(jail, PATH_MAX, "%s/jail-XXXXXX", tmpdir)) {
    // If truncated then mkdtemp() will complain.
    perror("snprintf()");
    return 1;
  }
  if (mkdtemp(jail) == NULL) {
    perror("mkdtemp()");
    return 1;
  }
  if (chdir(jail)) {
    perror("chdir()");
    return 1;
  }
  // Caveat: Deleting the current working directory and then chrooting into it
  // may not be portable. If it's not, then skip the rmdir step and just leak
  // the directory.
  if (rmdir(jail)) {
    perror("rmdir()");
  }
  if (chroot(".")) {
    perror("chroot()");
    return 1;
  }
  return 0;
}
Ok, now the filesystem is gone. Well, do make sure that you don’t have
any file descriptors open to any directories, or openat() could be
used to open a file outside the chroot.
The second POSIX step is to drop root privileges.
First we need to drop any group IDs (GID), since we can only do that while running as user ID (UID) root.
Then we drop UID from root.
// Return 0 on success.
int drop_uid(uid_t uid, gid_t gid)
{
  if (setgroups(0, NULL)) {
    perror("setgroups(0, NULL)");
    return 1;
  }
  if (setgid(gid)) {
    perror("setgid()");
    return 1;
  }
  if (setuid(uid)) {
    perror("setuid()");
    return 1;
  }
  return 0;
}
What UID to use? Well, first of all you need to resolve the user
before you do your chroot, since the mapping exists in /etc/passwd
and /etc/group. So that’s some boilerplate getpwnam().
Ideally you should run every binary as a separate user, since
otherwise they can send signals (kill()) each other. That’s not
always feasible though, so maybe the least bad option is
nobody/nogroup.
Except if you need to access the network, then on Android you need to
use the group inet.
Well, that was easy, right?
OpenBSD
OpenBSD has pledge() and unveil(), which work even on unprivileged
users. So if running as root you should first drop your root
privileges, and then call ‘pledge()’ and ‘unveil()’ to only list
what’s needed from here.
// Return 0 on success
int openbsd_drop()
{
  if (unveil("/", "")) {
    perror("unveil()");
    return 1;
  }
  if (pledge("stdio", "")) {
    perror("pledge()");
    return 1;
  }
  return 0;
}
For belts and suspenders you can chroot, drop root privs, unveil, and then pledge. In that order. For non-root processes just unveil and pledge.
OpenBSD clearly has the best solution, here. You can even skip
unveil() in this case, since pledge("stdio", "") doesn’t allow
opening any new files.
One single call to sandbox to a very common post-init setup. Nice.
Linux
As I said in a previous post, seccomp is basically
unusable.
But there’s two other ways to restrict a process’s ability to affect the outside world:
- Capabilities
- unshare()
(and a minor one: prctl(PR_SET_NO_NEW_PRIVS), that effectively turns
off suid and guid bits on binaries)
Linux — Capabilities
Capabilities are a fine grained way to give “root powers” to processes.
A process running as user root can have all its special powers taken
away, by stripping its capabilities. But that still leaves it as a
normal user, and a normal user that owns some key files, like
/etc/shadow and… uh… the root directory. So basically the whole
filesystem.
So the root user is powerful even without any capabilities.
This is why it’s not enough to merely drop all capabilities.
Capabilities can also be granted to non-root processes. E.g. you can
remove the suid bit and grant /bin/ping just the CAP_NET_RAW
capability instead. In fact that`s the case on my system:
$ getcap /bin/ping
/bin/ping cap_net_raw=ep
A full compromise of ping can only lead to sniffing my traffic. Bad,
but full root access is worse.
If you can avoid having your tool run as root in the first place,
that’s strictly better. But still don’t forget to drop that capability
as soon as you no longer need it. ping does this:
$ sudo strace -esocket,capset ping 8.8.8.8
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_NET_ADMIN|1<<CAP_NET_RAW, inheritable=0}) = 0
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=1<<CAP_NET_RAW, permitted=1<<CAP_NET_ADMIN|1<<CAP_NET_RAW, inheritable=0}) = 0
socket(AF_INET, SOCK_RAW, IPPROTO_ICMP) = 3
socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6) = 4
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_NET_ADMIN|1<<CAP_NET_RAW, inheritable=0}) = 0
socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=120 time=9.48 ms
Uhm… hmm… actually no that’s not right. It drops the capabilities from the effective set, but surely it should drop it from the permitted set too?
Gah, it does for IPv6, but not IPv4:
$ sudo strace -esocket,capset ping -6 ns1.google.com
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=1<<CAP_NET_ADMIN|1<<CAP_NET_RAW, inheritable=0}) = 0
socket(AF_INET6, SOCK_RAW, IPPROTO_ICMPV6) = 3
PING ns1.google.com(ns1.google.com (2001:4860:4802:32::a)) 56 data bytes
capset({version=_LINUX_CAPABILITY_VERSION_3, pid=0}, {effective=0, permitted=0, inheritable=0}) = 0
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 4
64 bytes from ns1.google.com (2001:4860:4802:32::a): icmp_seq=1 ttl=110 time=14.2 ms
Looks like it’s been broken since 2012. I’ve sent a pull request.
I’m actually not convinced it’s a good idea to replace root with just
CAP_NET_RAW in this case. If there’s a security hole in ping then
my normal user gets compromised. If ping also had the CAP_SETUID
capability then it could limit its blast radius to the nobody user
instead.
Compromising my user account is worse than sniffing my traffic, since basically all traffic is encrypted nowadays.
As it is now a bug in ping can lead to a complete account takeover,
and system ownage.
Same thing with AF_ICMP.
I’ve filed a bug requesting some thoughts on this.
Linux — unshare()
unshare() creates a new universe that can never be joined back to
the old one. Instead of dropping root privileges, you can create a new
namespace where even root can’t affect anything important. And then
you can drop privileges inside even that universe.
It’s a bit tricky to use, though. And there are some gotchas. Yes,
trickier than chroot() + setuid().
E.g. if you create a new namespace as root then you may be under the misapprehension that you have no way to touch the outside world, but that’s not the case.
real$ sudo unshare --user
new$ id
uid=65534(nobody) gid=65534(nogroup) groups=65534(nogroup)
new$ touch /only-root-can-break-your-heart && echo success
success
new$ ls -l /only-root-can-break-your-heart
-rw-r----- 1 nobody nogroup 0 Mar 11 15:49 /only-root-can-break-your-heart
new$ exit
real$ ls -l /only-root-can-break-your-heart
-rw-r----- 1 root root 0 Mar 11 15:49 /only-root-can-break-your-heart
So you need to drop privileges before creating the new namespace, too.
Linux — Combining them all
Let’s say you start off with root, and you want to:
- chroot
- drop capabilities
- unshare
- change uid
The restrictions are:
- unshare(CLONE_NEWUSER)before- chroot, because- CLONE_NEWUSERis not allowed in chrooted environment.
- chroot()before- drop_uid(), because you can’t- chroot()as non-root
- drop_uid()before- unshare(CLONE_NEWUSER), because the new user namespace still maps back to the real root user.
Oh… and now it looks like we have a circular dependency.
But not really. What you can do is run chroot() after
unshare(CLONE_NEWUSER), because while you aren’t real root, you have
all the capabilities inside your new domain:
$ capsh --decode=$(unshare --user --map-root-user awk '/^CapEff/ {print $2}' /proc/self/status)
0x000001ffffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read,cap_perfmon,cap_bpf,cap_checkpoint_restore
--map-root-user here can be a bit confusing here. If your program
doesn’t do the extra work this option does then you may be fooled into
thinking all the capabilities are lost by default:
$ capsh --decode=$(unshare --user awk '/^CapEff/ {print $2}' /proc/self/status)
0x0000000000000000=
but unfortunately that’s not the case. The capabilities are there
until you exec() something (in this case awk), because
capabilities are not by default part of the inherited set here.
To illustrate:
$ cat not_gone.c
#define _GNU_SOURCE
#include <err.h>
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main()
{
  if (unshare(CLONE_NEWUSER)) {
      err(EXIT_FAILURE, "unshare(CLONE_NEWUSER): %s", strerror(errno));
  }
  printf("--- Actual permissions ---\n");
  FILE* f = fopen("/proc/self/status", "r");
  if (!f) {
    err(EXIT_FAILURE, "fopen(/proc/self/status): %s", strerror(errno));
  }
  char* line = NULL;
  size_t len = 0;
  while (getline(&line, &len, f) != -1) {
    if (!strncmp(line, "Cap", 3)) {
      printf("%s", line);
    }
  }
  fclose(f);
  printf("--- Post-exec permissions ---\n");
  execlp("grep", "grep", "^Cap", "/proc/self/status", NULL);
  err(EXIT_FAILURE, "execlp: %s", strerror(errno));
}
$ ./not_gone
--- Actual permissions ---
CapInh: 0000000000000000
CapPrm: 000001ffffffffff
CapEff: 000001ffffffffff
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
--- Post-exec permissions ---
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
This is usually what a user wants, so it makes sense that unshare()
would work this way. This allows some setup before executing the
ultimate command, and optionally adding to the inherited set of
permissions.
In our case though we’re just dropping privileges, and we are the
ultimate command (there will be no exec()) so we need them gone
now, not merely prevent inheritance.
This is a second gotcha, because it’s easy to be fooled into thinking
all the capabilities are gone already, just because getuid() returns
nonzero.
Other namespaces
Once all needed network sockets have been opened we can drop support for creating new ones. And we can detach from other namespaces too.
Making the chroot dir read only
There appears to be two ways to create a “safe” working directory.
As far as I can tell, after deleting the current directory we’re
chrooted into it’s impossible to create any new files or directories
in it. It’ll fail with ENOENT.
High level
dir = mkdtemp()
chdir(dir)
rmdir(dir)
setuid_setgid(nobody, nogroup);  // This will fail if not root / suid, but that's fine.
drop_capabilities();      // Capabilities in parent namespace.
unshare(CLONE_NEWUSER|CLONE_NEWNS);
chroot(".");
drop_capabilities();      // Capabilities in new namespace.
The downside here is that if the chroot() fails inside the new
namespace, but would have succeeded if we’d just gone the POSIX way,
then it’s too late to go back and try again.
An alternative way
For the superuser case (where setuid() succeeds) the new root file
system is empty, deleted, and owned by a user other than the currently
running one.
But in the non-superuser case the new file system is all inside the normal user’s UID. User namespaces merely map their UIDs to real UIDs, they don’t create new ones. Maybe there’s then something they can do to create files in there, and possibly filling up the disk.
I don’t think so, but let’s explore another trick: A cloned mount namespace, with a read-only filesystem. Good luck creating files in a read-only file system.
Unfortunately while it’s possible to rmdir the current working
directory, it’s not possible to rmdir a directory that’s a mount
point. So here we’d leak the temporary directory.
Unless we don’t create one.
We can mount “over” an existing directory, and use that. Then we won’t leak anything. It can probably be any directory, except the root.
setuid_setgid(nobody, nogroup);  // This will fail if not root / suid, but that's fine.
drop_capabilities();      // Capabilities in parent namespace.
unshare(CLONE_NEWUSER|CLONE_NEWNS);
mount(tmpfs read only on /tmp);
chdir("/tmp");
chroot("/tmp");
drop_capabilities();      // Capabilities in new namespace.
I’ve implemented both, and they work. I’m undecided on which is best.
Actual code
It’s over 100 lines lines, so I’m stuffing it into a project on github, but here’s how it looks when it’s running:
Inspecting it when running as root / suid:
$ ps auxww | grep drop
root      115398  0.6  0.0  17308  6748 pts/13   S+   16:13   0:00 sudo ./drop
nobody    115399  0.0  0.0   2480  1632 pts/13   S+   16:13   0:00 ./drop
$ grep Cap /proc/115399/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
$ ls -l /proc/115399/{cwd,root}
lrwxrwxrwx 1 root root 0 Mar 11 16:15 /proc/115399/cwd -> '/tmp/jail-W54mvW (deleted)'
lrwxrwxrwx 1 root root 0 Mar 11 16:15 /proc/115399/root -> '/tmp/jail-W54mvW (deleted)'
When running as normal user:
$ ps auxww | grep drop
thomas    115615  0.0  0.0   2480  1724 pts/13   S+   16:16   0:00 ./drop
$ grep Cap /proc/115615/status
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 000001ffffffffff
CapAmb: 0000000000000000
$ ls -l /proc/115615/{cwd,root}
lrwxrwxrwx 1 thomas thomas 0 Mar 11 16:17 /proc/115615/cwd -> '/tmp/jail-G3RXdm (deleted)'
lrwxrwxrwx 1 thomas thomas 0 Mar 11 16:17 /proc/115615/root -> '/tmp/jail-G3RXdm (deleted)'
You can also verify that the network namespaces are different:
$ ls -l /proc/{self,115615}/ns/net
lrwxrwxrwx 1 thomas thomas       0 Mar 11 16:17 /proc/115615/ns/net -> 'net:[4026532208]'
lrwxrwxrwx 1 root   root         0 Mar 11 16:17 /proc/self/ns/net -> 'net:[4026532008]'
This means that (aside from any already open sockets) this process cannot use the network. It doesn’t have any network interfaces. Not even loopback.
To be root or not to be root
With user and mount namespaces you’d think that it doesn’t matter if you’re root or not. You can drop privs equally well anyway.
But really what Linux needs is a setuid_ephemeral() callable by
nonprivileged user that sets UID and GID to a one-time ephemeral
value. That way normal file system, semaphore, signal management takes
care of ACLs. And all tooling can be isolated from each other.
setuid() to nobody/nogroup is better than nothing, but would be
better if they could all be unique.
What attack surfaces are still exposed?
Lots still, probably. The process can still kill other processes running as the same user.
pledge() is just so much better than this.