Securing Containers With Seccomp Filters
Many businesses are adopting containers as a foundational technology used to manage and run their applications. If you’ve worked much with containers, it’s easy to see why: they enable entirely new levels of portability and scalability. But the adoption of containers, like any other new technology, also means new ways to exploit applications.
Depending on the container’s configuration, an exploited application can eventually lead to the compromise of the host that the container is running on. There are also other implications to consider, such as potential secrets stored as environment variables in the container and what they have access to. If you want to know more about Docker containers security best practices specifically, GitGuardian proposes a useful cheat sheet.
A mature software development lifecycle already includes security processes such as vulnerability scanning and software composition analysis, but there is a need for more. Most available application security technology exists to prevent an application from being vulnerable, but not many will contain the damage that can be done when an application is successfully exploited. To help with that, I’ve been researching a novel way to protect your container applications post-exploitation. In this post, I’ll be sharing what it is and how it can be seamlessly integrated into your software development processes that are already established. The additional protection I’m referring to is called Seccomp-BPF, and I need to explain a little about what it is before diving into how to use it.
Background
The programs that we run on computers rely heavily on the underlying operating system to do anything. Tasks like opening files and spawning new processes are abstracted in modern programming languages, but under the hood, the code is making kernel requests called system calls (or syscalls). How important are syscalls for a program to function? Well, there are around 400 syscalls available in the Linux kernel, and even a basic “Hello, World!” program written in C makes 2 of them: write and exit.
Code running in so-called “user space” can’t do anything without going through the kernel to do it. Eventually, some smart Linux kernel developers decided to use that fact to create a powerful security feature. In July 2012, Linux version 3.5 was released which added support for something called Seccomp-BPF.
Seccomp-BPF is a Linux kernel feature that allows you to restrict the syscalls that a process can make by creating a special filter.
In theory, you can create a Seccomp-BPF filter that only allows a process to make the exact syscalls that it needs to function and nothing more. This would be useful in cases where an app is accidentally exploitable in a way that allows an adversary to spawn additional processes. If Seccomp isn’t allowing the process to make new syscalls, there’s a good chance it could thwart the attacker.
Seccomp is super cool, and it’s even integrated into container runtime and orchestration tools like Docker and Kubernetes. It begs the question: “Why isn’t Seccomp widely used?” I think the answer is that there aren’t enough resources out there that bridge the gap between a low-level kernel feature like Seccomp and modern software development processes. Not every organization has a low-level code developer who knows a ton about syscalls. There’s also the overhead of figuring out which syscalls your program needs and updating that with every new feature you implement in your code.
I was thinking about how to solve that problem, and I thought of an idea: “What if we record the syscalls that a program makes while it’s running?” I was telling one of my co-workers about my idea, and the next day he sent me a link to a tool he found on GitHub. It turned out that some folks at Red Hat had already made a tool called oci-seccomp-bpf-hook
that does exactly what I wanted!
Creating a Seccomp-BPF Filter
The tool oci-seccomp-bpf-hook
was made to work with Linux containers. OCI stands for “Open Container Initiative,” and it’s a set of standards for container runtimes that defines what kinds of interfaces they should be able to provide. OCI-compliant container runtimes (like Docker) provide a mechanism called “hooks” that allows you to run code before a container is spun up and after a container is torn down. Rather than explain how Red Hat’s tool uses these hooks, I think a demonstration will be clearer.
Red Hat developed oci-seccomp-bpf-hook
for use with their container runtime, podman. Podman is backward-compatible with Docker, for the most part, so the syntax in my examples will look mostly familiar if you’ve used Docker. Additionally, the OCI hook is currently only available in Red-Hat-related DNF repositories unless you install it from the source. To make things less complicated for this demo, I’m just using a Fedora server (if you don’t have a Fedora environment, I recommend running a Fedora virtual machine on something like Virtualbox or VMware to follow).
The first thing you’ll need to do to start using oci-seccomp-bpf-hook
is to make sure you have it installed along with podman. To do that, we can run the following command:
sudo dnf install podman oci-seccomp-bpf-hook
Now that we have podman and the OCI hook, we can finally dive into how to generate a Seccomp-BPF filter. From the readme, the syntax is:
sudo podman run --annotation io.containers.trace-syscall="if:[absolute path to the input file];of:[absolute path to the output file]" IMAGE COMMAND
Let’s run the ls
command in a basic container and pipe the output into /dev/null
. While we’re doing that, we’re going to be recording the syscalls that the ls
command makes and saving them to a file at /tmp/ls.json
.
sudo podman run --annotation io.containers.trace-syscall=of:/tmp/ls.json fedora:35 ls / > /dev/null
Since we are piping the output of the ls
command to /dev/null
, there should be no output in the terminal. However, after the command is done, we can look at the file that we saved the syscalls to. There we see that the command did work, and the syscalls were captured:
cat /tmp/ls.json
{"defaultAction":"SCMP_ACT_ERRNO","architectures":["SCMP_ARCH_X86_64"],"syscalls":[{"names":["access","arch_prctl","brk","capset","chdir","close","close_range","dup2","execve","exit_group","fchdir","fchown","fstatfs","getdents64","getegid","geteuid","getgid","getrandom","getuid","ioctl","lseek","mmap","mount","mprotect","munmap","newfstatat","openat","openat2","pivot_root","prctl","pread64","prlimit64","pselect6","read","rt_sigaction","rt_sigprocmask","seccomp","set_robust_list","set_tid_address","sethostname","setresgid","setresuid","setsid","statfs","statx","umask","umount2","write"],"action":"SCMP_ACT_ALLOW","args":[],"comment":"","includes":{},"excludes":{}}]}
This file is our Seccomp filter, and we can now use it with any container runtime that supports it. Let’s try using the filter with the same containerized ls
command that we just ran:
sudo podman run --security-opt seccomp=/tmp/ls.json fedora ls / > /dev/null
There is no output nor any errors, indicating that the command was able to successfully run with the Seccomp filter applied. Now comes the fun part. We will add some capability to the container that wasn’t present when we recorded the syscalls to make our Seccomp filter. All we’re going to do is add the -l
flag to our ls
command.
sudo podman run --security-opt seccomp=/tmp/ls.json fedora ls -l / > /dev/null
ls: /: Operation not permitted
ls: /proc: Operation not permitted
ls: /root: Operation not permitted
…
As you can see, we now get a bunch of errors telling us that we can’t perform some operation that our command was trying to do. The addition of the -l
flag to our ls
command added a few new syscalls to the process that weren’t in our Seccomp filter’s allow list. If we generate a new Seccomp filter with the ls -l
command, we can see that the new filter works because it now has all the required syscalls.
sudo podman run --annotation io.containers.trace-syscall=of:/tmp/lsl.json fedora ls -l / > /dev/null
sudo podman run --security-opt seccomp=/tmp/lsl.json fedora ls -l / > /dev/null
As you can see, applying Seccomp filters to your containers greatly restricts its capabilities. In a scenario where an attacker can exploit your application, it may stop them from doing damage or even prevent exploitation altogether.
By using Red Hat’s OCI hook, you no longer need to have a deep knowledge of the Linux kernel’s syscalls to create a Seccomp filter. You can easily create an application-specific filter that doesn’t allow your container to do anything more than what it needs to be able to do. This is a huge step in bridging the gap between the kernel feature and high-level software development.
Conclusion
As great as oci-seccomp-bpf-hook
is, the tool alone doesn’t fully live up to my expectations for integrating Seccomp into a mature software engineering workflow. There is still overhead involved in running the tool, and as a software developer, you don’t want to spend time manually updating your Seccomp filter for every update of your application. To bridge that final gap and make it as easy as possible to use Seccomp in enterprise applications, we need to find a way to automate the generation of Seccomp-BPF filters. Fortunately, when we look at how modern software development happens, there is already a perfect place for this automation to happen: during Continuous Integration (CI).
CI workflows are already a well-established part of a mature software development lifecycle. For those that aren’t familiar with CI, it enables you to do things like automated unit testing and code security scanning every time you commit code to your git repository. There are lots of tools for CI out there, so it’s the perfect place to automate the generation of a Seccomp filter for your containerized application.
We are running out of time for this post, so I’ll be back in another post with a demonstration of how to create a CI workflow that generates a Seccomp filter every time you update your code. Then you will finally be equipped to take advantage of Seccomp’s syscall restriction and secure your applications!