Introduction#
eBPF is a special instruction set built for the Linux kernel. The kernel includes a VM to run the code and a JIT engine to turn it into native code for maximum performance. There are hook points all throughout the kernel to inject eBPF code into. One of the supported use cases is to write an LSM with eBPF.
However unlike kernel modules, eBPF is sandboxed and has a verifier that ensures it never accesses memory it is not allowed to, that all pointers are checked, and that all loops are bounded so that the program will eventually return. Instead of solving the halting problem to do this, the verifier just statically evaluates all code paths.
The sandboxed environment has all sorts of helper functions to interact with the outside world.
For example there is bpf_redirect available to networking eBPF programs, which can redirect a packet to an interface.
There are also Maps which are data structures that can be shared between different eBPF programs or with a userspace program. Some map types are hash maps, arrays, and ring buffers.
Sleepable eBPF Programs#
Most eBPF programs cannot sleep. Specifically, they are guaranteed to not switch between CPUs, and they will never be interrupted by the scheduler. This makes sense because traditionally eBPF programs are in the networking code paths in the kernel, which are designed for very high performance. The verifier enforces not sleeping by simply not giving helper functions which would require sleeping.
However, some eBPF programs can sleep if they are explicitly marked as such.
That means they have access to more helper functions, which could sleep.
Luckily many of the hooks for eBPF LSMs are marked with BPF_F_SLEEPABLE, meaning it is a sleepable context.
As far as I can tell this is entirely undocumented in the kernel.
There is the commit messsage which introduced it and an LWN article about it…and that’s all.
Only by reading the source code could I find a list of which LSM hooks are sleepable, for example.
But allowing sleeping in a kernel context doesn’t mean we can simply sleep(2000) to sleep for 2000 milliseconds.
It all depends on the helper functions which are allowed.
The first sleepable function added was bpf_copy_from_user, which allows the eBPF program to copy memory from userspace into the kernel where the eBPF program is running.
This might not sound useful, given that I literally do need to sleep and wait for a userspace daemon to respond. And there are many stackoverflow answers saying that this isn’t possible. But like, I just don’t believe them.
Sleepable Sleepable eBPF Programs#
After scouring the eBPF docs for helper functions and kfuncs, I came across an interesting one. (By the way a kfunc is basically just a helper function).
/**
* bpf_get_file_xattr - get xattr of a file
* @file: file to get xattr from
* @name__str: name of the xattr
* @value_p: output buffer of the xattr value
*
* Get xattr *name__str* of *file* and store the output in *value_ptr*.
*
* For security reasons, only *name__str* with prefixes "user." or
* "security.bpf." are allowed.
*
* Return: length of the xattr value on success, a negative value on error.
*/
__bpf_kfunc int bpf_get_file_xattr(struct file *file, const char *name__str,
struct bpf_dynptr *value_p)
By spelunking through the git blame we can find the commit which added this for some more context. In the commit message they mention this:
It is common practice for security solutions to store tags/labels in xattrs. To implement similar functionalities in BPF LSM, add new kfunc
bpf_get_file_xattr().
Indeed it makes sense. Imagine trying to implement an LSM like SELinux with eBPF LSM. Clearly, eBPF programs need to be able to read xattrs off the file. So why do I find this interesting?
Reading off a file, even if an xattr, clearly must sleep. I mean it could hit the disk if it is not in the cache. And then what about NFS or some other network filesystem? Now it has to wait for a network roundtrip before the helper can continue.
So it should be possible to wait for a pretty long time if we can try to read an xattr off of an NFS server on the other side of the world. Then we can just loop 100 times (loops must be bounded), to have a maximum timeout of like a minute.
Then I can poll userspace for a response through a regular eBPF map, sleeping for a little bit in between each poll.
So how do we get a struct file * in eBPF?
The verifier is very strict.
We can’t just pass NULL or try to construct one ourself.
One way is by the file_open hook which we want to attach to anyway:
SEC("lsm.s/file_open")
int BPF_PROG(file_open, struct file *file, int ret) { ... }
So we could install this hook, open a file on the NFS, and take that struct file *file and stick it in a global variable.
That way we can re use that for other hooks.
Unfortunately, the verifier ensures that it is a valid pointer to a file that currently exists.
We’ll need to find some other way.
Let’s take a look at bpf_get_task_exe_file:
/**
* bpf_get_task_exe_file - get a reference on the exe_file struct file member of
* the mm_struct that is nested within the supplied
* task_struct
* @task: task_struct of which the nested mm_struct exe_file member to get a
* reference on
*
* Get a reference on the exe_file struct file member field of the mm_struct
* nested within the supplied *task*. The referenced file pointer acquired by
* this BPF kfunc must be released using bpf_put_file(). Failing to call
* bpf_put_file() on the returned referenced struct file pointer that has been
* acquired by this BPF kfunc will result in the BPF program being rejected by
* the BPF verifier.
*
* This BPF kfunc may only be called from BPF LSM programs.
*
* Internally, this BPF kfunc leans on get_task_exe_file(), such that calling
* bpf_get_task_exe_file() would be analogous to calling get_task_exe_file()
* directly in kernel context.
*
* Return: A referenced struct file pointer to the exe_file member of the
* mm_struct that is nested within the supplied *task*. On error, NULL is
* returned.
*/
__bpf_kfunc struct file *bpf_get_task_exe_file(struct task_struct *task)
A task_struct is a user thread/process.
An exe_file is the ELF file of a task, exposed to userspace as ls -l /proc/<pid>/exe.
So this kfunc gets the ELF file which spawned a particular process.
But then this kfunc needs a struct task_struct *. To solve this let’s have a look at bpf_task_from_pid:
/**
* bpf_task_from_pid - Find a struct task_struct from its pid by looking it up
* in the root pid namespace idr. If a task is returned, it must either be
* stored in a map, or released with bpf_task_release().
* @pid: The pid of the task being looked up.
*/
__bpf_kfunc struct task_struct *bpf_task_from_pid(s32 pid)
This one takes in a PID, and gives a valid struct task_struct * to it in eBPF.
So here’s the plan:
- Write a program which just sleeps forever, our
sleeperprocess - Copy the
sleeperto our NFS server/some/nfs/directory/sleeper - Write some xattr to it starting with
user.:setfattr -n user.sleep -v "foo" - Start the program and get its
pid - Pass this
pidto our eBPF program - Get a valid
struct task_struct *taskwithbpf_task_from_pid(pid); - Get a valid
struct file *filewithbpf_get_task_exe_file(task). This file is on NFS! - Call
bpf_get_file_xattr(file, "user.sleep", ...)and hope it takes a while
Let’s write a more complete eBPF program to do this (ignore all the dynptr plumbing):
s32 pid;
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 4096);
} ringbuf SEC(".maps");
SEC("lsm.s/socket_connect")
int BPF_PROG(restrict_connect, struct socket *sock, struct sockaddr *address, int addrlen, int ret) {
if (ret != 0)
return ret;
// Only IPv4 in this example
if (address->sa_family != AF_INET)
return 0;
bpf_printk("lsm.s/socket_connect time=%llu", bpf_ktime_get_boot_ns());
struct task_struct *task = bpf_task_from_pid(pid);
if (!task)
return 0;
struct file *file = bpf_get_task_exe_file(task);
if (!file) {
bpf_task_release(task);
return 0;
}
struct bpf_dynptr dynp;
bpf_ringbuf_reserve_dynptr(&ringbuf, 64, 0, &dynp);
for (__u32 i = 0; i < 100; i++) {
bpf_get_file_xattr(file, "user.sleep", &dynp);
}
bpf_ringbuf_discard_dynptr(&dynp, 0);
bpf_put_file(file);
bpf_task_release(task);
bpf_printk("lsm.s/socket_connect time=%llu", bpf_ktime_get_boot_ns());
return 0;
}
Then I run an NFS server, make a sleeper program, etc. and finally:
$ curl 1.1.1.1
[...]
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
curl-1029331 [006] ...11 399963.525705: bpf_trace_printk: lsm.s/socket_connect time=399962426773045
curl-1029331 [006] ...11 399965.875346: bpf_trace_printk: lsm.s/socket_connect time=399964776414568
That took 2.349642 seconds between traces! We slept in a sleepable eBPF program!
Arbitrary Userspace Helpers#
This might seem a little bit hacky. And indeed we can make it much more robust, with FUSE (Filesystem in Userspace).
Normally FUSE is used for things like mirroring Google Drive into a directory on your system. Or similarly, having an S3 backed directory. Clearly these filesystems should not be baked into the kernel directly, and it is cool that these custom filesystems are possible.
So why don’t I just write a custom FUSE filesystem, where all it does is respond to xattrs? This will replace my use of NFS, and will allow me to control how much I sleep for directly.
There is a good library for Go to help implement FUSE filesystems, so doing this is relatively easy:
func (f *ExecFile) Getxattr(ctx context.Context, attr string, dest []byte) (uint32, syscall.Errno) {
slog.Info("Getxattr", "attr", attr)
time.Sleep(5 * time.Second)
return 0, syscall.ENODATA
}
Then after copying the sleeper binary over to somewhere in the running FUSE filesystem, I can run through the rest of the steps. Finally,
$ curl 1.1.1.1
[...]
$ sudo cat /sys/kernel/debug/tracing/trace_pipe
curl-1029206 [008] ...11 433208.365331: bpf_trace_printk: lsm.s/socket_connect time=433207215274204
curl-1029206 [008] ...11 433213.366560: bpf_trace_printk: lsm.s/socket_connect time=433212216491263
There were 5.001217 seconds between those traces!
Let’s level up some more.
I want to send data from eBPF land to userspace, wait for the userspace to do something with it, then send data back.
Sending data back is actually pretty trivial. We are reading xattrs, so we can just have the xattr value be the result. I discarded the value of the xattr before, but it is already in a dynptr. It gets a little more tricky to send data up.
bpf_get_file_xattr’s argument const char *name__str has to be known at compile time, and has to start with user. or security.bpf..
So we can’t just do user.sleep.<number of seconds> for example.
Instead, we can keep a ring of user.helper.0, user.helper.1, …, user.helper.31 to allow up to 32 concurrent executions.
In eBPF we keep an index which we atomically increment, and a giant switch statement to select the right user.helper.X string.
Then we can keep a BPF_MAP_TYPE_ARRAY of size 32 with an arbitrary struct.
Each user.helper.X has X be an index into this array.
We can read this map from userspace, to get our input data for our helper.
This FUSE xattr RPC technique allows us to write arbitrary userspace helpers for our sleepable eBPF programs. You can send any data to a userspace daemon, and get any value back after waiting for any amount of time. You could sleep, read from a database, make a network request, or even ask an LLM if it is secure to perform the action (please don’t do this one).
The helpers which were used apply to BPF_PROG_TYPE_LSM, BPF_PROG_TYPE_PERF_EVENT, BPF_PROG_TYPE_TRACEPOINT, BPF_PROG_TYPE_TRACING, as long as they are sleepable.
Also, it only works on kernel version 6.12 (released in 2024-11-17) or greater.
Honestly this should probably be considered a kernel bug. It’s pretty easy to write an eBPF program with this technique which sleeps forever. You can imagine what permanently sleeping all processes which try to open a file will do to a system. Also, it’s really easy to recursively trigger a security hook while executing a userspace helper. Normally the verifier prevents this, but we are doing crazy stuff, so you have to be careful to exclude the daemon doing the work from the LSM.
Also, keep in mind that the performance is not great. It is best to keep as much as possible inside eBPF, and only reach out to a userspace helper for I/O tasks. For my use case, already approved paths can be determined in pure eBPF, so it should be as fast as any other LSM like AppArmor (which is to say, very little overhead). If I reach out to userspace, it is to block intentionally for a long time, so I don’t even really care about the performance of it.
Stacking LSMs#
Calling bpf_get_file_xattr actually triggers standard UNIX file permission (DAC) checks, and the security_file_permission MAC hook to make sure that the process can read from that file.
It’s trivial to allow all users to read from /path/to/fuse/sleeper, but it gets tricky when another LSM gets involved.
bpf_get_file_xattr runs within the context of the currently running process which triggered the security hook.
In other words, the eBPF program makes the original program look a bit like malware, trying to read a path it wouldn’t normally.
If SELinux or AppArmor is also installed on the system, its policies may disallow a binary from reading /path/to/fuse/sleeper.
For example, sudo is usually very strictly confined by default policy.
These LSMs don’t have a way to allow every program to read /path/to/fuse/sleeper.
For SELinux I can add context=system_u:object_r:tmp_t:s0 to the fusermount options to set the type of the FUSE filesystem to tmp_t, which is usually what temporary files are for.
That makes it work on Fedora anyways.
Any sane policy allows this by default.
I haven’t tried an AppArmor system for this, so it is possible I will have to mount the FUSE filesystem at /lib/cordon/fuse or similar.
Sleepable Hooks#
There is a list of eBPF program hooks, and whether they are sleepable or not in the documentation. LSM hooks are useful, because we can easily analyze and deny, but not all LSM hooks are sleepable. The list of sleepable LSM hooks is here. While most are sleepable, there are some notable exceptions.
For example you can’t check when a process calls setuid, which you might reasonably want to know about.
For this one, there is a workaround though.
We can place a regular unsleepable “regular” hook on fentry/__x64_sys_setuid which is the function entry of the setuid syscall.
Internally, the setuid syscall triggers security_cred_prepare which is sleepable, and hookable in eBPF with lsm.s/cred_prepare.
Then we can correlate the process ID and thread ID between these calls to associate a particular invocation of the setuid syscall with somewhere where we can sleep.
Another example is that security_task_kill is a non sleepable LSM hook.
This means that we can’t wait for user approval if one process tries to send a signal to another process (potentially killing it).
I have not found a similar workaround, but admittedly I have not looked much into it.
I do not particularly like hooking syscalls directly, because then it makes it more architecture dependent. But, if it is for only a few edge cases like this then the maintenance burden shouldn’t be too high.
Also, the astute reader may have noticed that fentry has a sleepable variant too, fentry.s.
In fact, there are even kselftests hooking syscalls with fentry.s eBPF.
So why don’t we just use that?
Well, it turns out to require a kernel debug feature, set by CONFIG_FUNCTION_ERROR_INJECTION=y.
I’ll let the commit disabling it by default do the talking:
error-injection: Add prompt for function error injection
The config to be able to inject error codes into any function annotated with
ALLOW_ERROR_INJECTION()is enabled whenFUNCTION_ERROR_INJECTIONis enabled. But unfortunately, this is always enabled on x86 whenKPROBESis enabled, and there’s no way to turn it off.As kprobes is useful for observability of the kernel, it is useful to have it enabled in production environments. But error injection should be avoided. Add a prompt to the config to allow it to be disabled even when kprobes is enabled, and get rid of the “def_bool y”.
This is a kernel debug feature (it’s in Kconfig.debug), and should have never been something enabled by default.
The kselftests enable this kconfig option to pass those tests.
There are fentry.s fexit.s and fmot_ret.s, but I personally have never gotten any of them to work on my system.
Though maybe this is a skill issue.
If you know of any functions which can be hooked by any of these, please let me know.
I may develop a tool in the future which hooks all syscalls and sleepable LSM hooks, to make finding sleepable insertion points easier.
Use Cases#
Besides waiting for user input in an eBPF program which is my goal, there are a couple other use cases for this technique.
For example more sophisticated EDRs for Linux could be developed where the eBPF LSM can run a database or network lookup or an ML trigger.
