The Rabbit Hole of Building a Filesystem Watcher

Some of the systems I work with are highly customized environments, and often need support engineers to maintain them. A lot of automation exists, but sometimes they need to manually go into a VM and change things. This is normal , but with these manual tasks, mistakes are inevitable. One such case is a service that would only work if all the files and directories it manages are owned by a special user. But sometimes people run commands in the service directories as root. This doesn’t impact the service as it’s running, but it won’t restart. While the fix is simple, just chown -R the service directory. There are many easy ways to prevent this, e.g. setting file permissions, File ACLs. These are less strict as root user can override these. Setting SELinux policies would be a much stricter solution. These are very sensible solutions. But what is the fun it that? How about we build an entire filesystem event watcher ourselves?

Attempt 1 - `fanotify`#

fanotify is a set of APIs in the Linux kernel by which we could get filesystem events sent to userspace. Let’s dive in, according to man page, we first need to call fanotify_init with proper flags; This sets up a kernel-space notification group. We can set up the directories we need to watch via fanotify_mark. fanotify_init sets up a file descriptor for the event queue, which can be accessed by reading the file descriptor. This is a great built-in API, but we have a few issues.

We cannot monitor a directory recursively. This feature is only available for whole filesystem mounts.
Another limitation is that fanotify only gives us the PID of the process that triggered the event (metadata->pid), not the full credentials. If we want to know who (which UID/GID) actually performed the operation, we must do an extra lookup in /proc/<pid> (for example, reading /proc/<pid>/status) to fetch the task’s credentials. That means for every single event, we would need to open and parse a /proc file, and then apply our filtering logic.

Attempt 2 - eBPF#

This was when I put this idea on the back burner, but then I stumbled on eBPF while working on another project with Falco.

eBPF enables running programs in kernel space. Programs are first compiled into bytecode, then verified by an in-kernel static verifier, then run using JIT for native execution performance. To communicate with the user-space, we can instantiate various forms of data structures too. The official intro docs do a great job of explaining this, see What is eBPF?.

I have been fortunate enough to be writing this at a time when tooling around eBPF has evolved a lot. Earlier tools had to include kernel headers by either a) compiling the program with the exact kernel source present locally or b) compiling the program on the server where it will run. Thanks to improvements around adding lightweight type info BTF(BPF Type Format) , CO-RE (Compile Once - Run Everywhere) and libbpf loader. The user interface for writing eBPF programs is a bit easier.

Now the question comes, what do you hook into? We can directly hook into kernel VFS layer functions such as vfs_mkdir and vfs_create, which abstract out various filesystem implementations and expose a single filesystem interface to user-space. We could read the arguments and filter out the events shipped to userspace, saving on a lot of context switches.

This method again has its own slew of annoyances.

Using kprobes on functions like vfs_* does not guarantee a stable ABI, i.e the arguments can change anytime, or functions themselves can disappear across kernel releases. In my case, this is not a big deal since I would be running this in a standardized environment with consistent kernel versions. But this is a solvable problem, though requiring more engineering effort. See this section about handling kernel change in the BPF-CORE reference
We will have to write the path filtering logic in kernelspace using eBPF, since vfs_* probes will trigger for all events. We will have to walk the filesystem tree up and see if some dir matches our monitored dir. Aside from the complexity of writing this, each eBPF program is statically verified. It must not contain unbounded loops, and we have a limited stack size (typically 512 bytes).

Walking up the file tree in eBPF#

With the generous help of Andrii Nakryiko’s excellent BPF CO-RE reference guide, I was able to come up with a good enough solution. We can use the dentry struct to walk up the tree. But since we can’t have unbounded loops in BPF, I had to truncate the walk at MAX_DEPTH, which is acceptable for my problem statement since the expected depth of the directory I want to monitor is known.

static bool is_monitored_dir(struct dentry *dentry, __u64 target_ino) {
  bpf_rcu_read_lock();
  struct dentry *curr_dentry = BPF_CORE_READ(dentry, d_parent);
  struct inode *curr_inode;
  __u64 curr_ino;
  bool result = false;

  #pragma unroll
  for(int i=0; i < MAX_DEPTH; i++) {
    if (!curr_dentry) {
      break;
    }

    curr_inode = BPF_CORE_READ(curr_dentry, d_inode);
    curr_ino = BPF_CORE_READ(curr_inode, i_ino);
    if (curr_ino == target_ino) {
      result = true;
      break;
    }

    struct dentry *parent_dentry = BPF_CORE_READ(curr_dentry, d_parent);
    if (curr_dentry == parent_dentry) {
      break; // curr_dentry is its own root, we have reached the top of
                    // the tree.
    }

    curr_dentry = parent_dentry;
  }

  bpf_rcu_read_unlock();
  return result;
}

Note the kernel RCU (Read, Copy, Update) locks are needed since the dentry tree can change while we are traversing it. The RCU mechanism lets the readers safely traverse without blocking the writers.

For a complete, working example of this approach, please refer to the fs-watcher GitHub repository . This repository contains the full source code.

Better Probes#

LSM hooks provide a more stable and semantically meaningful API for monitoring filesystem events, since they are part of the kernel’s Linux Security Module framework. They can reduce the number of events you need to filter and eliminate some of the brittleness associated with probing low-level VFS functions. However, these hooks were not available in the kernel I was working with. With LSM hooks, we have access to the path struct with which we can resolve the name into a buffer using bpf_path_d_path. Then we can do a substring search to see if the path is monitored or not. I will be sure to try this out after our next infra update.

Wrapping Up#

This little experiment turned out to be a great deep dive into Linux kernel internals, eBPF and various trade-offs of running kernel-space programs. eBPF is a very powerful tool, but also has very sharp edges if you are not careful. This has also been my most rigorous exercise in RTFM’ing. A lot of information about these tools exists, but it’s scattered across kernel docs, blog posts, and reference guides. Piecing it all together was a journey in itself.

The Rabbit Hole of Building a Filesystem Watcher

Attempt 1 - `fanotify`#

Attempt 2 - eBPF#

Walking up the file tree in eBPF#

Better Probes#

Wrapping Up#

Read Next

Intuition behind Power of 2 Choices Load balancing

Attempt 1 - fanotify#

Attempt 2 - eBPF#

Walking up the file tree in eBPF#

Better Probes#

Wrapping Up#

Read Next

Intuition behind Power of 2 Choices Load balancing

Attempt 1 - `fanotify`#