I'm back, after a somewhat long hiatus of multiple conferences, lots of travel, and getting engaged. I am in the process of writing another article as part of the "What is memory?" series, but in the meantime, I wanted to share another article that I wrote which was published in LWN back in July. If you're a fan of BPF, or think it's cool that it's possible to write code that runs in kernel space which verifiably cannot crash the system, then I recommend reading this article!
The BPF subsystem allows programmers to write programs that can run safely in kernel space. All memory accesses and function calls in BPF programs are statically checked for safety using the in-kernel verifier, which analyzes programs in their entirety before allowing them to be loaded. While this allows the kernel to safely run BPF programs, it heavily restricts what those programs are able to do. Among these constraints is a rule that programs cannot store pointers into BPF maps for use (such as dereferencing them or passing them to the kernel in kfunc and BPF helper invocations) at a later time. A patch set by Kumar Kartikeya Dwivedi adds this capability to BPF.
Interacting with kernel pointers in BPF programs
Some operations are always safe to execute from within BPF. For example, the
bpf_probe_read_kernel() helpers allow BPF programs to safely read user and kernel memory respectively by registering a page-fault handler to catch any faulting accesses. When BPF programs receive pointers from event handlers, for example, they can safely dereference them without having to worry about whether they will cause a crash, though they do have to ensure that the read succeeded by checking the returned error code.
BPF programs can also receive pointers from BPF helpers and kfuncs. BPF helpers are functions defined in the kernel that provide the core APIs that can be invoked by any BPF program. Kfuncs are also functions in the kernel that can be invoked by BPF programs but, unlike BPF helpers, their APIs do not need to be applicable to all types of BPF programs. To make a function available to BPF programs as a kfunc, it must be aggregated into one or more BPF Type Format (BTF) kfunc-sets, which are then registered with the BPF subsystem.
Kfunc-sets can also specify properties about their kfuncs that inform the verifier about how they need to be invoked in order to ensure safe execution in BPF programs. One such property specifies that the kfuncs in a kfunc-set will return an "acquired" pointer that must be passed to another kfunc that is part of a kfunc-set that can release it. Pointers that are subject to this constraint are called "referenced pointers" within the BPF community.
When loading BPF programs, the verifier will enforce this contract, and reject any program that fails to release a referenced pointer before returning, or which passes a pointer that was not previously returned as a referenced pointer to a BPF helper or kfunc. Note that the implementation of the "acquire" and "release" semantics of a kfunc is completely opaque to BPF, and is entirely up to the developer implementing the kfunc. The only thing that BPF requires is a guarantee that an acquired pointer will remain valid until it is released.
Extending BPF usability with referenced pointers
The ability to ensure that a kernel pointer is valid affords several advantages to BPF programs. The first and perhaps most straightforward is that BPF programs no longer need to use probed reads to dereference the pointers. Probed reads use the exception table mechanism used to safely read user memory from kernel space, and while they have nearly the same performance as a normal load instruction for successful reads, they impose a tax on the programmer by requiring them to always check if a read was successful. Avoiding probed reads allows a simpler programming model which can significantly cut down on the raw amount of code needed in BPF programs to satisfy the verifier.
In addition to providing simplification, referenced pointers also improve the extensibility of BPF by allowing BPF programs to safely pass those pointers back to the kernel in subsequent kfunc and BPF helper function invocations. While the kernel could use a mechanism such as
copy_from_user() to read pointers received from BPF programs, it is less complex and less error prone to, instead, provide a guarantee to the kernel that pointers received from BPF programs are safe to read. This guarantee also makes it possible to export many internal kernel functions to BPF programs without modifying them.
While referenced pointers are a powerful tool for extending the kernel using BPF, a significant limitation of the feature is the requirement that all of the interactions between BPF programs and kfuncs take place in a synchronous context. Every time it needs to get a referenced pointer from the kernel, a BPF program must invoke a kfunc, and then release the pointer in another kfunc invocation before returning. This may have performance implications, as having to call two kfuncs is quite a lot of overhead relative to performing a single memory read. This workflow is also somewhat orthogonal to the traditional mechanics of reference counting, wherein pointers are stored in a data structure with the intention of being safely accessed at a later time. What would instead be useful is to store kernel pointers in a map, allowing them to be accessed whenever the program requires, possibly over multiple separate calls.
kptrs – storing kernel pointers in BPF maps
Dwivedi's patch set adds this capability via a new feature called "kptrs". A kptr is a strongly typed pointer that is received from a kfunc or BPF helper function and which may be stored into and retrieved from BPF maps throughout the run time of a program. Kptrs may be either ordinary ("unreferenced") or referenced pointers. Unreferenced kptrs have no guarantee of validity and are highly restricted in how they can be used; like normal pointers in BPF programs, they can only be accessed using a probed read. They also cannot be passed to the kernel via a kfunc or BPF helper function, as the pointers may reference invalid memory. Referenced kptrs, on the other hand, may be safely dereferenced by BPF programs, and passed to the kernel via kfunc or BPF helper function invocations.
From the BPF subsystem's point of view, a referenced kptr always has exactly one reference associated with it. In order to transfer a referenced kptr between different contexts, a new
bpf_kptr_xchg() helper function was added that atomically swaps ownership of the reference between a map value and a local pointer. If the reference is transferred from a map value to a local pointer, the semantics enforced by the verifier are the same as for references returned by kfuncs in a synchronous context: the verifier will ensure that the reference is either transferred back to a map value or released via a call to a kfunc before the current execution context returns. On the other hand, if the kptr reference is stored in a map, the current execution context can safely return without releasing it.
If the kptr is never transferred back out of the map with
bpf_kptr_xchg() and manually released, it will be automatically released when the program is unloaded and the map is destroyed. In order to enable this automatic releasing mechanism, Dwivedi extended the kfunc subsystem to allow developers to specify a kfunc destructor function that should be used for a given type.
Some of the use cases for kptrs were discussed in an earlier revision of the patch set, with Dwivedi describing the most common one as being for performance:
The common use case is caching references to objects inside BPF maps, to avoid costly lookups, and being able to raise it once for the duration of program invocation when passing it to multiple helpers (to avoid further re-lookups).
Storing a referenced kptr in a map obviates the need to invoke a kfunc every time the pointer is required, which provides performance benefits and reduces complexity. BPF programs need only make an initial kfunc invocation to first get the pointer and, after storing it in a map, can simply load it from that map and dereference it directly when it's needed.
Referenced kptrs also provide strong safety and correctness guarantees to developers. It is a ubiquitous paradigm in managed-object frameworks that a reference should only be owned by a single context, and the semantics of kptr reference handling bear a striking resemblance to std::boxed::Box in Rust, and std::unique_ptr in C++ in that regard. In Rust, a Box is a pointer to a heap allocation that can only have a single reference at any given time, and which is automatically freed when that reference goes out of scope. When the Rust program is compiled, the Rust compiler will verify that only a single reference can ever exist to the Box by applying Rust's ownership model. If the Rust program compiles, you have a guarantee that the memory pointed to by a Box is always valid and only has a single reference. In C++, some verification takes place at compile time by, for example, prohibiting an std::unique_ptr from being copied, but problems can still arise at run time. For example, a user could invoke std::unique_ptr::release() twice, and would receive a nullptr on the second invocation.
Kptrs seem to draw inspiration from both languages. On the one hand, the BPF verifier provides the compile-time guarantees that are afforded by the Rust compiler by analyzing BPF programs to ensure that there is only ever a single owner of a reference, and that the reference can never be leaked. On the other hand,
bpf_kptr_xchg() closely matches the semantics of std::unique_ptr::swap(), so the mechanics of using the feature will feel more like C++. Managed-object leaks and use-after-free bugs are a common and pervasive source of pain when correct accounting is the responsibility of individual developers. Providing the guarantees of Rust's ownership model, and the semantics of C++'s std::unique_ptr APIs, to C and kernel development using BPF and the verifier therefore seems powerful indeed.
It will be interesting to see if the advantages afforded by these features will motivate more development to take place in BPF as opposed to the kernel.
Future kptr types
While we have been referring to both unreferenced kptrs and referenced kptrs as just "kptrs", they are actually represented as two different types in BTF. If BPF programs wish to use a kernel pointer as an unreferenced or a referenced pointer, they must annotate it with the BTF type tag "kptr" or "kptr_ref" respectively. It makes sense to enforce separate types for each kptr variant, as the verifier needs to use BTF to know a kptr's type, and then enforce its safe use accordingly.
While the current implementation of kptrs only enables the unreferenced and referenced variants, a natural question is whether the implementation could be expanded to include other types of pointers as well. In the first version of the patch set, Dwivedi proposed adding variants for per-CPU and user-space kptrs. Alexei Starovoitov responded by asking what the use case was for storing per-CPU pointers in maps, but Dwivedi did not have a concrete use case in mind. It was decided to drop the feature until a more concrete use case appears, so for now we will have to wait and see if that happens
This patch set is currently in linux-next, and so will presumably be merged during the next development cycle. In the meantime, it will be interesting to see how kfuncs and kptrs will be used to extend the kernel in ways that currently are not possible with BPF.