An Overview of ptrace

Table of Contents

Interactive Security Sandbox - This article is part of a series.

Part 2: This Article

Part 4: Arbitrary Userspace Blocking eBPF

Part 5: Creating an Interactive Security Sandbox

Introduction
#

Regular binary executables do not have have extra I/O layer like interpreters. Instead, we typically rely on the ptrace syscall in Linux to introspect into another arbitrary process. ptrace, or Process Trace, is how GDB is implemented, for example. As a process (the tracer), we can ask the kernel to interrupt another process (the tracee) under certain circumstances. For example, when the tracee reaches a particular instruction (a breakpoint), or if it executes a syscall. The tracer is also able to read and write arbitrarily into the tracee (you can see how a debugger would find this useful). Syscalls are the I/O layer of a Linux system, so this is what we’ll want to implement our access control on. Basically, if a process we are interested in calls the open syscall to read a file, we need to see if we have authorized this path before, and otherwise we can show a dialog to the user asking if they’re okay with the tracee performing that action. Repeat for a few other important syscalls (like connect and listen) and call it day.

If you have heard of ptrace before you might have some issues with this approach.

Performance
#

The typical fear with ptrace is bad performance. Normally a syscall incurs at least one context switch into kernel space, which is expensive. However with ptrace this becomes much worse because now another userspace process has to get involved with every syscall. For I/O bound workloads this results in an extremely bad performance loss.

A feature called seccomp-bpf allows us to filter which syscalls we want to trap. So we can ignore high volume syscalls like write and instead focus on syscalls we care about like open which is only run once per file, instead of every time we want to write to the file. This allows us to alleviate the performance deficit somewhat.

To illustrate, here is a non-scientific, ad hoc “benchmark” of compiling the Linux kernel on my laptop. This represents a real world workload for developers, which is also somewhat I/O bound. I will use strace which is a well known program that simply prints syscalls that the tracee makes. I ignore the output to reduce the performance impact of printing so much to my terminal emulator. I first run a control where no ptracer is used. Then I run strace like normal where it looks at all syscalls. Finally I run strace only looking at syscalls I know a security sandbox should have a look at.

$ cd linux
$ make defconfig
# control
$ time make -j$(nproc) > /dev/null
real    5m54.015s
user    75m7.188s
sys     5m55.926s

$ make clean
# strace all syscalls
$ time strace -f -o /dev/null make -j$(nproc) > /dev/null
real    8m34.852s
user    74m16.341s
sys     14m39.447s

$ make clean
# optimize to filter for some syscalls
$ time strace -f -eopen,openat,openat2,stat,mmap,mprotect,ioctl,mremap,connect,bind,execve,rename,mkdir,rmdir,creat,link,unlink,symlink,readlink,chmod,chown,mknod,statfs,prctl,mount,umount2,reboot,sethostname,setdomainname,setxattr,lsetxattr,getxattr,lgetxattr,listxattr,llistxattr,removexattr,lremovexattr --seccomp-bpf -o /dev/null make -j$(nproc) > /dev/null
real    6m37.486s
user    70m52.145s
sys     10m37.741s

Instructing strace to use seccomp-bpf and choosing only the syscalls we want to look at yields an improved performance over strace-ing all syscalls, but it is still noticeably slower than the control.

Security
#

Making a secure sandbox is actually not an issue with ptrace, but you do have to be careful. Fortunately there is some prior art here in mbox, a project and research paper by Taesoo Kim and Nickolai Zeldovich from MIT CSAIL in 2013. In it, they describe a technique to safely apply policy on syscalls:

Using ptrace to intercept system call entry allows us to examine, sanitize, and rewrite the system call’s arguments. If an argument points to process memory, we can read remote memory and interpret it as the system call handler does. However, the read value can be different from what the system call handler will see in the kernel. For example, an adversary’s thread can overwrite the memory that the current argument points to, right after the tracer checks the argument.
[…]
MBOX avoids TOCTTOU problems by mapping a page of read-only memory in the tracee process. When MBOX needs to examine, sanitize, or rewrite an in-memory data structure, such as a path name, used as a system call argument, MBOX copies the data structure to the read-only memory (using PTRACE_POKEDATA or the more efficient process_vm_writev()), and changes the system call argument pointer to point to this copy. For example, at the entry of an open(path, O_WRONLY) system call, the tracer first gets the system call’s arguments, rewrites the path argument to point to the read-only memory, and updates the read-only memory with a new path pointing to the sandbox filesystem. Since no other threads can overwrite the read-only memory without invoking a system call (e.g., mprotect()), MBOX avoids TOCTTOU problem when rewriting path arguments. To ensure that the sandboxed process cannot change this read-only virtual memory mapping (e.g., using mprotect(), mmap(), or mremap()), MBOX intercepts these system call and kills the process if it detects an attempt to modify MBOX’s special read-only page.

This is also why I included mprotect and friends in my seccomp-bpf benchmark above.

So why not just use mbox? Well, it hasn’t been updated since 2014 as of writing this post. Also, it doesn’t have an interactive mode for network, and allows all file reads (among other protections I would want to add). It was written as a fork of strace 4.7 and designed for Linux 3.8. It fulfilled its purpose as a POC for a research paper, but in my opinion a rewrite using similar techniques would be the most sane approach.

Complexity
#

In my opinion the biggest reason not to work with ptrace is the sheer complexity of it. There are so many edge cases to worry about (as seen by the hoops MBOX had to jump through). Also, portability between architectures is poor because the set of syscalls varies. Deep knowledge of all syscalls (to determine their safety characteristics) on every supported architecture is required. New syscalls can be added which may need to be restricted too so the tool needs to disable itself on kernels that are too new.

There is an interesting project called Reverie which provides a high level syscall interception framework using ptrace for x86_64 and aarch64. However, it is very experimental and is designed for observability tools, not for security tools. Also, Reverie disables ASLR in the tracee which I do not want. I have hope for the future of this project, but currently it would not be good for this use case.

Another wrench to throw into ptrace is that a tracee can only have one tracer. Since I am building a developer tool, it is not inconceivable that someone may want to securely run their program and attach GDB (or another debugger) to it. Reverie actually runs a GDB server which you can attach to so you can step through the tracee while it’s running under a Reverie based tool. My solution would have to do that as well. Honestly it might be best to contribute to upstream Reverie to get it where I would want it.

Ultimately, I don’t think ptrace is suitable to design this interactive sandbox with. I’ll need to look for another approach.

Interactive Security Sandbox - This article is part of a series.

Part 1: Supply Chain Security

Part 2: This Article

Part 3: Linux Security Modules (LSMs)

Part 4: Arbitrary Userspace Blocking eBPF

Part 5: Creating an Interactive Security Sandbox