pam-unshare: a PAM module that switches into a PID namespace

Posted on 15 April 2016 in Linux, Programming, PythonAnywhere

Today in my 10% time at PythonAnywhere (we're a bit less lax than Google) I wrote a PAM module that lets you configure a Linux system so that when someone sus, sudos, or sshes in, they are put into a private PID namespace. This means that they can't see anyone else's processes, either via ps or via /proc. It's definitely not production-ready, but any feedback on it would be very welcome.

In this blog post I explain why I wrote it, and how it all works, including some of the pitfalls of using PID namespaces like this and how I worked around them.

Why write it?

At PythonAnywhere we use a variety of tools to sandbox our users. To a certain extent, we've hand-rolled our own containerisation system using the amazing primitives provided by the Linux kernel.

One of the problems with our sandboxes right now is that they don't allow listing of processes using normal tools like ps. This is because, for security, we don't mount a /proc inside the filesystem visible from our users' code. The reason for that is that we don't want people to see each other's processes, because -- if you're careless -- there can be secret information on the command lines, and command lines are visible from /proc and thus from ps. Our one and only security incident so far came from an error in the system that handles this.

The right way to solve this kind of problem in Linux is to use a combination of PID namespaces and mount namespaces.

Namespaces

There are two kinds of namespaces we're interested in for this module:

PID namespaces

As the docs say, "PID namespaces isolate the process ID number space, meaning that processes in different PID namespaces can have the same PID." Allowing different processes to have the same PID isn't important to us for this -- but the isolation is what we want. We want the processes that a user uses when they log in to the system to be in a separate namespace to every other user's.

Mount namespaces

These were the first kind of namespaces to be introduced into Linux, so they're sometimes confusingly referred to simply as "namespaces". Again, going to the docs: "Mount namespaces isolate the set of filesystem mount points, meaning that processes in different mount namespaces can have different views of the filesystem hierarchy." This is useful because we want each of our process namespaces to have access to its own /proc. When you go into a process namespace, you may have a set of process IDs that are different to the external system. But if you have access to the external filesystem, then you can still see the /proc on the external filesystem -- so, ps ax will show you processes outside.

What we need is to get our processes into both a PID namespace and a mount namespace, then umount /proc so that we don't see the external filesystem's one, then mount it again so that we see the one appropriate to our PID namespace.

This is actually pretty simple to do from the command line, if you have a recent version of Linux with linux-utils 2.23 or higher (for Ubuntu, that's Vivid or later -- or you can upgrade Trusty using this PPA from Ivan Larionov). If you're on a Linux command line (as root) and you have the right version, you can try it out:

# unshare --pid -- /bin/sh -c /bin/bash
# echo $$
1

The first command is a slightly complicated way of getting into a PID namespace -- unshare --pid on its own doesn't work, for reasons that are still hazy in my mind... Anyway, once that's done, we echo the PID of the current bash process, and we get 1 -- so we're definitely in our own process namespace. However, if you run ps ax you'll see all of the processes in the parent PID namespace, because (as I said before) the /proc that we see in our filesystem is the one associated with the parent. Naturally, we can't umount /proc because we'd be trying to umount the directory everyone else in the system is using -- the system would complain that it's busy. So the next thing is to switch into our own mount namespace, then umount our own private /proc, then mount a fresh one:

# unshare --mount
# umount /proc
# mount -t proc proc /proc
# ps ax
  PID TTY      STAT   TIME COMMAND
    1 pts/0    S      0:00 /bin/bash
   42 pts/0    S      0:00 -bash
   57 pts/0    R+     0:00 ps ax
# ls /proc
1          consoles   execdomains  ipmi       kpagecount     misc          schedstat  sys            version
42         cpuinfo    fb           irq        kpageflags     modules       scsi       sysrq-trigger  version_signature
58         crypto     filesystems  kallsyms   latency_stats  mounts        self       sysvipc        vmallocinfo
buddyinfo  devices    fs           kcore      loadavg        net           slabinfo   timer_list     vmstat
bus        diskstats  interrupts   keys       locks          pagetypeinfo  softirqs   timer_stats    xen
cgroups    dma        iomem        key-users  mdstat         partitions    stat       tty            zoneinfo
cmdline    driver     ioports      kmsg       meminfo        sched_debug   swaps      uptime

Awesome! We're in our own namespace.

PAM

Now, if when we wanted to go into namespaces we had complete control over the code, the above would be entirely sufficient. For example, on PythonAnywhere we have web-based consoles. When someone connects to one of those, we have complete control over the code that is executed before they can start typing in. We could do the two unshare commands, then the /proc remount, then su to the appropriate user account, and then we'd be done.

But we don't always have control over this code path. For example, people can log in using ssh. And controlling what's done when someone does that is the domain of PAM.

PAM is Pluggable Authentication Modules. A program can link with PAM and hand over all of its authentication to it. For example, when you ssh in, the ssh daemon asks PAM to authenticate your credentials.

PAM itself delegates the authentication process to a set of modules that are implemented as shared libraries. For example, there's one to do normal Unix authentication using /etc/passwd or nsswitch -- but you could also have ones to do biometric authentication or whatever.

The directory /etc/pam.d contains configuration files saying which auth modules should be used for each PAM client app -- what to use to auth ssh, what to use to auth sudo, and so on, along with some common stuff for everything. The syntax is, frankly, vile, but it's just about understandable if you put your mind to it.

Anyway, that's all this got to do with our problem? Well, PAM has four kinds of plugins:

Authentication management modules, which handle checking people's credentials.
Account management modules, which can allow/disallow access even for people who'd be otherwise authorised, based on other factors (eg. time of day).
Authentication token management modules which do things like allowing people to change their passwords.
Session management modules, which do session setup and teardown stuff. A standard module of this type is pam_env, which sets up environment variables.

The last one kind of modules is the place where we can hook in our code. There's already a pam-chroot, which is a session management module that puts the user into a chroot jail. So my goal with this module was essentially to write something like that which did the same kind of thing, but for process namespaces.

Implementation

Here's a minimal PAM session module that just prints stuff when people enter and leave a session (for example, when their su session starts, and when it ends):

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define  PAM_SM_SESSION
#include <security/pam_modules.h>

PAM_EXTERN int pam_sm_open_session(pam_handle_t *pamh, int flags, int argc, const char **argv) {
    printf("pam_basic pam_sm_open_session\n");
    return PAM_SUCCESS;
}


PAM_EXTERN int pam_sm_close_session(pam_handle_t *pamh, int flags, int argc, const char **argv) {
    printf("pam_basic pam_sm_close_session\n");
    return PAM_SUCCESS;
}

Save it as pam_basic.c and you can compile it with this:

gcc -c -fPIC -fno-stack-protector -Wall pam_basic.c
ld --shared -o pam_basic.so pam_basic.o -lpam

...then install it like this:

sudo cp pam_basic.so /lib/security/

...and enable it by adding this line towards the end of /etc/pam.d/su (before the @includes):

session required      pam_basic.so

Then try suing to another user. You'll see the open_session and close_session messages as you enter and exit the sued environment.

Enter the namespaces

So, you'd think that getting this to work with PID namespaces would be really simple; just make the appropriate system calls in the pam_sm_open_session function to switch to a new PID namespace, then to a new mount namespace, then umount and then mount /proc, and you're all set. The system function to switch into a new namespace is even called unshare, just like the command-line tool.

But, of course, it's a little bit more complicated than that. It comes down to processes.

When you make the unshare system call to enter a PID namespace, your current process's PID namespace is unaffected. Instead, the new namespace is used for any child processes you create using (eg.) fork. When you spin off your first child process after calling unshare, then that process is the "owner" of the PID namespace -- kind of like init is for the machine as a whole.

By contrast, the unshare for mount namespaces switches you into a new namespace right away.

Now, when you're doing an su, your PAM module is executed in-process by su, before it spins off the child process that will handle the user-switched session. So you can do the two unshares in there, and you'll wind up with a child process that has its own mount and PID namespaces. But that will still have the external system's /proc mounted, so ps ax will still show all processes. No problem -- you can also umount /proc inside the PAM code. Now the user can't do ps at all.

But the re-mounting of /proc can't happen in the PAM process, because it's not in the new PID namespace. Remember, only its children will be. If we were to do the re-mount in the PAM process, we'd still get the /proc for the parent PID namespace.

So the trick is to do the re-mount in a child process. But the child process that's spun off by su is out of our control; it's a shell or whatever the user specified. Even worse, the child process will be run as the user we're suing to, and only root can mount /proc.

OK, you might think -- perhaps, after setting things up so that the su process, thanks to the PAM module, is in the right mount namespace, and its children will be in the right PID namespace, we could umount /proc, then spin off a short-lived child process to do the re-mount of /proc, then when it's exited, continue?

What happens when you do that is that the PID namespace dies when your short-lived child process exits. Remember, the first child process you create after doing the unshare to enter the PID namespace is the "init" equivalent. When it dies, the PID namespace dies with it (and the kernel kills all of its child processes). (BTW I think this is why, when you kill the process you've specified in a docker run command, all of its child processes die -- even if you've detached them.)

My solution to this is a bit of a hack. I spin off a child process, which, being in a fresh PID namespace, will have PID 1. This is our parent process, our "init", and when it exits, the PID namespace will be shut down. But it's running as root, so it can mount /proc We know that the next process to be started in the namespace will have the PID 2. So, the child process mounts /proc, then waits until it sees a process with PID 2 -- then it waits for that process to die:

        while (kill(2, 0) == -1 && errno == ESRCH) {
            // short-lived busy wait
        }
        while (kill(2, 0) != -1 && errno != ESRCH) {
            // long-lived, poll twice a second
            usleep(500000);
        }

(If you're wondering why I'm using kill(pid, 0) and polling, rather than waitpid for the process to die, it's because process 2 isn't a child of process 1, and you can only use waitpid with your own child processes.).

This seems to work fine! Here's the complete source code of the current version, annotated. GitHub repo here.

#define _GNU_SOURCE

#include <syslog.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdarg.h>
#include <errno.h>
#include <unistd.h>
#include <signal.h>

#include <sched.h>

#include <sys/mount.h>

#define  PAM_SM_SESSION
#include <security/pam_modules.h>

The standard import-y stuff. The only points of note are the #define _GNU_SOURCE, which is needed to use the unshare function, and the #define PAM_SM_SESSION, which sets things up so that PAM knows we're writing a session management module.

static void _pam_log(int err, const char *format, ...) {
  va_list args;

  va_start(args, format);
  openlog("pam_unshare", LOG_PID, LOG_AUTHPRIV);
  vsyslog(err, format, args);
  va_end(args);
  closelog();
}

A nice wrapper around syslog, shamelessly stolen from pam-chroot.

PAM_EXTERN int pam_sm_open_session(pam_handle_t *pamh, int flags, int argc, const char **argv) {

So this is our entry point when a PAM session is started:

    const char *username;
    if (pam_get_user(pamh, &username, NULL) != PAM_SUCCESS) {
        _pam_log(LOG_ERR, "pam_unshare pam_sm_open_session: could not get username");
        return PAM_SESSION_ERR;
    }
    _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: start", username);

Get the username of the person we're suing to, or who we're sshing in as, or whatever. Useful for logging.

    _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: about to unshare", username);
    int unshare_err = unshare(CLONE_NEWPID | CLONE_NEWNS);
    if (unshare_err) {
        _pam_log(LOG_ERR, "pam_unshare pam_sm_open_session: %s: error unsharing: %s", username, strerror(errno));
        return PAM_SESSION_ERR;
    }
    _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: successfully unshared", username);

This does both of the unshares; the CLONE_NEWPID means that our child processes will be in their own PID namespace, and the CLONE_NEWNS put the current process, and all of its future children, into a new mount namespace.

    if (access("/proc/cpuinfo", R_OK)) {
        _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: no need to umount /proc", username);
    } else {

If we're already in a situation where we don't have /proc then we don't want to blow up when we try to umount it, so this is a simple guard against that...

        _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: about to umount /proc", username);
        int umount_err = umount("/proc");
        if (umount_err) {
            _pam_log(LOG_ERR, "pam_unshare pam_sm_open_session: %s: error umounting /proc: %s", username, strerror(errno));
            return PAM_SESSION_ERR;
        }
        _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: successfully umounted /proc", username);
    }

And here we do the umount if we need to.

    _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: about to kick off a subprocess", username);
    int pid = fork();

We've kicked off our subprocess:

    if (pid == 0) {

If we're in the new child process...

        _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: in subprocess, about to mount /proc", username);
        if (mount("proc", "/proc", "proc", MS_NOSUID|MS_NOEXEC|MS_NODEV, NULL)) {
            _pam_log(LOG_ERR, "pam_unshare pam_sm_open_session: %s: subprocess: error mounting /proc: %s", username, strerror(errno));
            exit(1);
        }
        _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: in subprocess, successfully mounted /proc", username);

Do the umount.

        _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: in subprocess, about to busy-wait for second child", username);
        while (kill(2, 0) == -1 && errno == ESRCH) {
        }
        _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: in subprocess, second child has appeared, switching to slow-poll", username);

        while (kill(2, 0) != -1 && errno != ESRCH) {
            usleep(500000);
        }
        _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: in subprocess, done waiting, exiting", username);

        exit(0);
    }

The do the wait for PID 2.

    _pam_log(LOG_DEBUG, "pam_unshare pam_sm_open_session: %s: done", username);
    return PAM_SUCCESS;
}

This is run if we're not in the child process -- just continue as normal.

PAM_EXTERN int pam_sm_close_session(pam_handle_t *pamh, int flags, int argc, const char **argv) {
    const char *username;
    if (pam_get_user(pamh, &username, NULL) != PAM_SUCCESS) {
        _pam_log(LOG_ERR, "pam_unshare pam_sm_close_session: could not get username");
        return PAM_SESSION_ERR;
    }
    _pam_log(LOG_DEBUG, "pam_unshare pam_sm_close_session: %s: start", username);
    _pam_log(LOG_DEBUG, "pam_unshare pam_sm_close_session: %s: done", username);
    return PAM_SUCCESS;
}

And that, of course, is just a dummy pam_sm_close_session, which needs to be there for completeness.

That's basically it.

What's next?

I'm pretty pleased with how this worked out (especially given that I didn't really understand PAM or namespaces when I started working on this stuff this morning). But it's not quite what we need. We already have some pretty powerful code that sets up sandboxed filesystems, and this wouldn't be compatible with the module as I've written it. Possibly we'll simply use the unsharing portion of this, and then use another mechanism to handle the remounting of /proc.

But I figured it might be worth putting this code out there, just in case anyone else is interested in how PAM and namespaces interact, and what some of the pitfalls -- and their workarounds -- are.

Comments welcome!

Acknowledgements

Many thanks to Ed Schmollinger for pam-chroot, which was the inspiration for all this, and to Jameson Little for simple-pam, which was simple enough that I had the confidence to start off coding a PAM module.