Introduction#
By default a program has the same permissions as the user who ran it, and permissions are coarse grained with the traditional Unix style of user and groups (this system is called DAC, or Discretionary Access Control). What if there was a way to customize or fine tune access?
The Linux kernel defines a set of hooks which allow the implementation of Mandatory Access Control (MAC) systems.
For example, instead of hooking on every syscall like open, openat, and openat2, all of these syscalls (and any other open-like syscalls on all architectures in the future) will reach the security_file_open(struct file *file) hook.
Then an LSM implementer can determine whether the process wanting to open that file is actually allowed to open the file.
This is a much less fragile approach compared to syscall interposition when implementing security software.
Note that this check happens after the standard user/group permissions checks, and that an LSM can even deny root the ability to perform actions.
The primary goal of an LSM is to reduce the attack surface of a compromised program. They used to be more useful when production servers ran binaries directly, to make sure that it can’t read files it isn’t supposed to for example. Nowadays we have Docker, so the process is more isolated by default. Nevertheless, it is still productive to have a look at the LSM ecosystem.
And by the way, the nomenclature surrounding this can be a bit confusing. LSM is a subsystem of the linux kernel, which defines LSM hooks. LSMs are also individual security systems implemented using these LSM hooks. For example, SELinux and AppArmor are LSMs. Speaking of which…
SELinux#
Originally developed by the NSA and now also maintained by Red Hat, SELinux is the one of the oldest LSMs and also still the most popular today.
It is classified as an attribute based LSM, meaning it stores its policy identifiers in the xattrs of a file.
The security label can be seen by ls -Z.
For example, let’s take a look at httpd in Fedora.
$ sudo dnf install httpd
$ cd /var/www/html
$ touch index.html
$ ls -lZ
-rw-r--r--. 1 apache apache unconfined_u:object_r:httpd_sys_content_t:s0 0 Aug 28 14:16 index.html
$ ls -lZ $(which httpd)
-rwxr-xr-x. 1 root root system_u:object_r:httpd_exec_t:s0 573120 Dec 31 1969 /usr/sbin/httpd*
The important parts are the httpd_exec_t and httpd_sys_content_t.
That part of the context string is known as a domain if it is on an executable, and a type otherwise.
Some other examples of types are bin_t which are all files under /bin, and postgresql_port_t which allows TCP port 5432.
There is a policy which allows domain httpd_exec_t to read files with the type httpd_sys_content_t, but not others.
A default set of recommended policies are available at SELinuxProject/refpolicy, though many distros customize these quite a bit.
I’ve glossed over a lot here, and honestly I am not very good at using SELinux myself.
In general though, you’ll notice that SELinux is a very complicated and powerful system.
Another fun fact about SELinux is that as one of the first LSMs, it existed before LSM hooks were added to Linux. To quote from the Linux documentation:
In March 2001, the National Security Agency (NSA) gave a presentation about Security-Enhanced Linux (SELinux) at the 2.5 Linux Kernel Summit.
[…]
In response to the NSA presentation, Linus Torvalds made a set of remarks that described a security framework he would be willing to consider for inclusion in the mainstream Linux kernel. He described a general framework that would provide a set of security hooks to control operations on kernel objects and a set of opaque security fields in kernel data structures for maintaining security attributes.
AppArmor#
AppArmor is a path based LSM.
Policies are written per-executable, defining what it can and cannot touch.
For example, here is a shortened version of Void Linux’s nginx AppArmor policy:
include <tunables/global>
profile nginx /usr/bin/nginx {
include <abstractions/base>
include <abstractions/nameservice>
include <abstractions/nis>
include <abstractions/openssl>
capability setgid,
capability setuid,
/etc/nginx/** r,
/run/nginx.pid rw,
/usr/bin/nginx mr,
/usr/share/nginx/html/* r,
/var/log/nginx/* w,
}
The include syntax works like C.
If you are wondering how network access is allowed, it is defined in abstractions/nameservice as:
# TCP/UDP network access
network inet stream,
network inet6 stream,
network inet dgram,
network inet6 dgram,
This is a much simpler way of defining policy in my opinion. There are just far fewer moving parts, making it easier to grok and just about as powerful as SELinux.
Landlock#
Most LSMs are installed globally, and enforce policy across the entire system.
Landlock takes a different approach.
A program can restrict itself without acquiring additional permissions.
Once you landlock yourself, there is no way to allow your process access to what you restricted again (hence the name). You may be familiar with OpenBSD’s syscalls pledge and unveil, which behave similarly.
For example, nginx could read all its configs, and then Landlock itself to only read and write to its content directories, and no others, after it starts serving requests.
That way, if there is a critical bug in nginx, the attack surface is reduced.
Of course, this requires buy-in from developers.
nginx does not implement what I described, and as far as I can tell Landlock is very unpopular.
However, there are some other use cases for Landlock.
There is a tool called landrun which can parse command line arguments to setup a sandbox you want.
Then it execs a subprogram, so that the new program also is restricted when it is started.
This makes for a pretty slick CLI to “add” Landlock to any program.
It looks like this:
$ landrun \
--rox /usr/bin \
--ro /lib,/lib64,/var/www \
--rwx /var/log \
--bind-tcp 80,443 \
/usr/bin/nginx
This is really close to what I want.
However, there are some issues with Landlock (and by extension, landrun).
For one, you cannot restrict UDP, or raw sockets in general.
This is important for things like DNS, QUIC (for HTTP/3), and ping.
Also, it is not interactive.
Once you’ve landlocked yourself, there is absolutely nothing (not even root) that could grant you those permissions back.
This is a core requirement for me, so unfortunately I cannot use landrun or Landlock.
Drawbacks#
These policies are highly specific to the distro.
For example, some put their programs in /usr/sbin, while others go in /usr/bin/, and good distros put them in /nix/store.
Also, support must be compiled into the kernel with a kconfig flag, and distros typically enable one or the other (or potentially neither).
I want my tool to be easy to use and distro agnostic.
And how do administrators create policies for SELinux and AppArmor? Both include a “permissive” mode of operation, where running a program will log what all it accesses. Then you can take that log and turn it into some policy.
The problem is that these require running the program once, untrusted, to create a safety profile for it.
SELinux uses audit2allow and AppArmor uses aa-logprof.
Then when enforced, if a code path which requires reading another file is hit (that was not hit when auto generating the profile from the initial log), a -EPERM will be returned and likely crash the program.
This is not a fun development cycle.
I need cordon to halt the program execution, and wait for user input to continue, like Deno.
Implementing an LSM#
To implement my interactive security sandbox idea, I’ll likely need to implement my own LSM, and a corresponding userspace agent.
A kernel module might seem like a fairly natural way to create an LSM out-of-tree.
However, it is not possible to implement an LSM as a kernel module.
security_add_hooks (which is what is called to register LSMs) is marked with __init in the kernel, which means that it is only available during kernel startup.
So by the time it comes to insmod a kernel module it is too late to register an LSM.
This means all LSMs would need to be built into the kernel directly, minimally requiring a system restart.
It is possible to implement a “pseudo-LSM” in a kernel module by using kretprobes on the security_* hooks as described in this post, but this technique seems really fragile to me.
Also, it is not portable because it requires overwriting the return value by directly modifying the struct pt_regs * to set -EPERM.
In general, kernel modules have version compatibility issues, and would need to be compiled per architecture. Additionally, any error in the implementation of a kernel module can easily crash the system, or worse yet allow an attacker to compromise the system at the kernel level. This is a little annoying, but fortunately there is a better way.
